Posts Tagged ‘dedupe’

Sticky Policies & Data Classifications

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Thu, Jan 28, 2010 @ 12:02 AM

If there is an ominously absent capability missing from the legacy backup products of today (and yesterday), it would be sticky policies that follow data wherever it goes, and auto-data classification that occurs when a file is created.

The capabilities to serve these needs are so obviously missing from the old backup paradigms (still with us today, mind you), that 2 things are starting happen:

1) People know this stuff is missing. They feel it in their bones. They also feel it when they need to look at piles and piles of data, and want to somehow make sense of it. But they can’t. They also want protection to happen the moment it is needed, with a simple policy.  Not when a backup product tells it to.

2) Vendors know this stuff is missing as well. Many of them operate in the world of block and volume data, and simply have no chance to manage information while they are backing up or replicating blocks of bits, instead of information.  Others have no way to manage metadata intelligently or actively. So they try to “market” their way around it.

The problem is impossible to fix within today’s legacy backup infrastructures. And its not going away. It will simply grow, exponentially, unless you start getting after it.

AIMstor from Cofio was created with policy based data management, and classification of data in mind.

Giving users control of what they want to do with data (Live Backup, CDP, Real-Time Replication, Tracking, Archive, etc.) is one thing. Giving users the ability to do it intelligently, such as withworkflow and data flow is something else altogether.

CDP is a Dog -> Unless it’s UNIFIED with Backup

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Thu, Jan 14, 2010 @ 03:32 PM

It’s true.  CDP is tremendous technology, offering granular point-in-time restore that backups simply cannot do. But CDP (Continuous Data Protection) has severe retention and data management limitations, so backup is absolutely necessary.

But – why do CDP if you cannot get it FULLY UNIFIED with your backup solution??  I don’t mean “integrated”.  Any moron can “integrate” a CDP product with their legacy backup product (and, many have, mind you). You just tell the people in marketing to make the box look the same, and update the user manual.

THE TRUTH: CDP is ONLY worth doing, if it comes FULLY UNIFIED with a next-generation backup solution (optimally, with inherent deduplication). That way, they share the same data mover, the same repository, the same metadata, the same underlying data structure and supporting infrastructure.

I dont know any other product besides Cofio’s AIMstor that does this. You get granularity of CDP, with smart retention flexibility of AIMstor’s next-gen Backup, and all the great policy driven capabilities that come together with it. You can also empower Bare Metal Restore from your backup and CDP sets, which are fully single-instanced for huge capacity savings.

More importantly, because AIMstor auto-classifies data, you can SELECT what you want to CDP, and what you want to Backup, and what retention you want for very specific types of data, or whole categories of data. Standalone CDP  products are kinda, well, dumb. They like to move . . . everything. Optimal? Uhm, not.

So what happens if you buy CDP that is NOT unified with your backup solution? Triple the data movers, double the repository setup and capacity usage, double the overhead to servers and clients, double the admin time, double the infrastructure. Plus, you probably can’t select what you really want, so you will just end up wasting even more resources.  Why do it?

The Legacy Backup Bubble (Part II)

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Tue, Jan 12, 2010 @ 05:14 PM

Legacy Backup is a major market in the data protection space, and is still going strong. Regardless of its inefficiencies, people still buy it, and add onto their existing Legacy Backup environment. However, users are starting to take notice.

Every user backup forum will often point to lack of Legacy Backup products to deliver any upstream value, and their typical failure rates as a result of server-dependent architectures, and their terrible storage inefficiency.

In addition, many environmental factors have crept into the woodwork at user sites (business intelligence, eDiscovery needs, compliance requirements, etc.), and now that the paint is off, people are finally getting a look at what’s underneath the hood of Legacy Backup products. It won’t be long.

Deduplication was a key first mover that really made people question the insanity of Legacy Backup. Why create something so inherently inefficient that it required such a huge level of clean-up? (remember, 20X or greater is the typical deduplication cleanup rate).

Cloud architectures will soon expose even more inadequacies in the Legacy Backup camp. Forcing many vendors to accomodate Cloud storage in strange, non-optimal ways.

Virtual machine sprawl has added more headaches to the Legacy Backup camp because of I/O and overhead issues created by Legacy Backup, and multiplied by VM’s.

Additionally, users are becoming more reliant on other tools within the market to make up for the lack of flexible recovery capability of Legacy Backup. CDP, Replication, Bare Metal Restore, and others, are coming into play in the mid-market.  As are technologies that help manage information; index/search tools, data classificationpolicy management, and tools that control data for added layers of security or monitoring.

There are many others, but these ones stick out. When things be

The Legacy Backup Bubble (Part I)

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Tue, Jan 05, 2010 @ 07:10 PM

The terrible inefficiency of Legacy Backup has created new markets and new companies over the past decade in the storage backup space.  Many are fixes applied to Legacy Backup itself, many others are another form of Legacy Backup, that solve some issues for a key market or vertical. Many have been proven to solve real world problems, caused, of course, by Legacy Backup.

So, what is Legacy Backup?  You are probably using it right now in your data center, your remote office, or your SMB, and most certainly, in your enterprise.  It’s a product that protects your data by doing several things based on a schedule, then sends a copy of some processed data to disk or tape. Unfortunately, it batch copies data, creates massive and unnecessary duplication of data, and has no ability to share its repository, its processes, policies, metadata, data movement, or any of its significant infrastructure with other data protection products (like CDPReplicationArchive, etc.).

The great thing about inefficiency is that it creates need.  And where there is need, there is opportunity. But the reason for the need, it is now being learned, is that Legacy Backup is the problem.  Like any boom or bubble, Legacy Backup will . . . utlimately . . . pop.

Don’t Shop with Barbarians: More on Volume CDP/Replication

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Fri, Nov 27, 2009 @ 05:21 PM

In response to our CTO’s last blog (Why Volume CDP and Replication products are so Wasteful), and because it is Black Friday, a day with a serious shopping theme, we had a few comments out there from a few Volume Replication Vendors, so I thought I would answer them here and keep things better organized:

Comment 1:  I guess it depends on what the volume contains. And the purpose of doing it in the first place. Certainly replicating or CDPing a system volume doesn’t seem to make much sense unless the reason for it is Disaster Recovery at a remote site. But replicating or CDPing a volume that only contains business critical data could be meaningful particularly in compliant heavy environments. Were the donkeys nodding?

Answer 1: There are some noisy applications.  In this particular example it was Sophos anti-virus.   But the OS can do very well all on its own too. OS vendors even call out noisy directories that should be avoided during backups, because there is no value in a restore, and there is an obvious cost to replicate it.  It is also not untypical for an application to want to create temporary files that have no business value on the same volume as the database. You want to replicate all that too?

With Volume CDP grabbing it all, that extra 40GB-50GB per machine gets expensive.  Multiply that by a large number of machines, and the overhead is very large.  Plus, the extra of CPU, energy, and bandwidth sending wasteful and unneeded data is another big cost that adds up quickly, and then goes exponential once you consider the enterprise.  That is the essence of the problem with Volume CDP and Replication. It is indiscriminate by nature and grabs everything.

Kind of like a starving barbarian with a big shopping cart at the grocers on double-coupon day, she can’t even resist taking the trash with her.

The Donkey’s weren’t nodding, but they were chuckling.

Comment 2: Couple of things that are puzzling to me in your blog are the fact that there were 40GB of wasted capacity in a single server during 1 week? That would certainly not be the norm and if it was there would be other useful conversations to have with a client.  As for CDP being intelligent enough to distinguish useful data. Great idea and most enterprise CDP solutions will have this ability now or in the near future. Even more important when considering replication is evaluating solutions that will compare data on the local and remote site and deduplicate before replicating the changes across the wire. We have customer examples that were able to shave 70%+ off replicated data!

Answer 2: We are simply saying that it’s a good idea to avoid sending all that unneeded data, in the name of simple logic, speed and efficiency.  The only effective way of combating this is by understanding the data (which is what AIMstor has solved).

I’d be interested in seeing how the Volume replication vendors address this.  I suggest that they can’t.

Volume replication argument have generally been that the “customer” ought to reconfigure their system to suite the replication technologies inability to address data types or data classifications.  Have a volume for one thing, another volume for another , etc.   While it certainly may make sense to partition your system, the point is,  customer shouldn’t be forced to because of the failings of the CDP product. Let the customer partition storage based on what makes sense to his application, not because of the inability of the volume CDP product.

The fact is also, CDP shouldnt just be for the application.  Why shouldn’t it be used for the system volume as it provides a good DR image as well?  Or something even more radical, why not provide a hybrid, period transfers of parts of the system but CDP granularity of other part of the system.  Imagine you have a volume that is both the OS and the application (OK example normally for smaller setups), you could take periodic images of the OS, but then CDP the application data.  This will minimize data transmitted and provide very nice and granular application restore, with safe set of periodic images of OS. You also get big overhead reduction, plus, savings on CPU, energy, bandwidth, etc.

Bringing up the de-duplication topic is interesting too.  Understanding the data you are de-duplicating substantially increases the de-duplication rates, like we do.  That is also why Data Domain excels, it distinguishes the data boundaries and doesn’t treat everything as a dumb block.   Would be good to know how much of that 70% replicated data savings you mention was just white space elimination? – which should have never been transferred in the first place.  If so,  am puzzled because that approach, which is typical among all Volume-approach vendors, seems to be making a mistake, and then the vendor congratulates himself for later correcting his mistakes.

And that’s supposed to be a “solution”?

Why Volume CDP and Replication products are so Wasteful

Friday, March 19th, 2010
  • Originally Posted by Fabrice Helliker on Tue, Nov 03, 2009 @ 11:24 AM

I’m often bewildered by the prevalence of volume CDP  or volume replication products.  This is the type of replication that works at either the whole disk or the partition level.    At this level, everything that is replicated is a dumb block.  There is no context as to “what” the blocks are . . . so, everything is replicated.

So let’s talk about something fundamental – wasted data transfers, wasted storage, and unnecessary system loads.

First let me describe a real world problem.  We had AIMstor setup to Backup, Version and CDP an assortment of machines.  We’d select the whole machine so that we could perform point in time bare metal restores in conjunction with file versioning of user documents.  Many of the machines were office systems, although what we’ve observed would have been exactly the same for a file server.

We decided to analyze the weekend traffic.  Note: because it was the weekend, we really didn’t expect an awful lot of traffic as the systems weren’t in use.  What surprised us however, is the amount of useless data that collected over this period.  We know operating systems can generate noise in the way of unwanted, temporary files, but for this test, we turned “off” all of the filtering within AIMstor.  What shocked us though was the incredible amount of useless data that was generated that has absolutely zero value.

One system alone, generated a staggering 40GB of temporary files.  A large amount of this was created by a virus checker.  Fortunately, because AIMstor works at a very granular level, this type of waste and noise can be easily filtered out.

Take your average Windows OS and you will find a lot a data written to disk that has no value to the business.  The system’s pagefile and prefetch files are constantly being written to.  This is before you apply virus checkers or user applications like Skype (yes it writes a lot to disk), Temporary Internet Files, etc.

And this is where volume level replication is so wasteful.   With Volume Replication everything is transferred and stored.  Factor a CDP system and then you are looking at capturing, transferring and storing a lot of unnecessary data.

Consider also that every block transferred is a load on the source system, the network and storage subsystem. There is a awful lot of energy and resources that goes into supporting Volume Replication and Volume CDP products . . . for no good reason.

The 4-Step Dedupe for Backup

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Sat, Aug 15, 2009 @ 08:36 PM

So, we all know now that the old legacy backup solutions create HUGE waste and require deduplication appliances. But now, for users considering upgrading their environments for “intelligent backup”, DR, replication, or archive, or simply to herd the cats of unstructured data, there seems to be confusion among users about the issues of:

-Source Deduplication (doing the dedupe at primary data), versus

-Target Deduplication (doing the dedupe at the repository).

A number of articles and postings have been put out there, but it typically comes down to the same question: What is best, doing it at Source or Target?

We asked the same question. Then we said, heck, why not BOTH?

This is why AIMstor was originally architected to provide 4-step Dedupe.  Source level dedupe via 2 methods, and Target level dedupe via 2 other separate methods:

Source Side Dedupe:
Step 1-Duplicate Transfer Avoidance: At the initial sync between node and repository, if the repository has the data already (from a previous node), it tells the node to only transfer only “new” data. Saves a lot of time, network bandwidth, and initial repository capacity is minimized.

Step 2-Real-Time Changed Byte Transfer: At the same time, with subsequent BackupsCDP,Replication and Versions, AIMstor will only transfer changed bytes from the node. That reduces network traffic and load on the node. Because AIMstor is real-time, there is no scan or trawl.  So data constantly trickles from the node when it changes, and hits the repository as the RPO settings for the backup.

Target Dedupe:
Step 3-Multi-Level Single Instance Storage: Because the AIMstor repository is unified across Backups, Replication CDP and Versions, it allows only a single instance of a file, no matter where it came from.

Step 4-Global Object Deduplication: Also in the repository, AIMstor runs a final step post processing deduplication algorithm across all data sets from all machines. Thereby finalizing the deduplication with four complete steps to best reduce the total amount of data capacity used in repositories.

The AIMstor repository automates all this as part of any policy.  The good news, is that is downloadable now, and available for Windows environments.

Agent vs Agent-Free in Data Protection, Archive, other stuff . . .

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Wed, Jul 29, 2009 @ 04:09 PM

Lot’s of talk out there these days about the value of agent versus agent-less approaches to deploying data protection and data management tools.

After all is said and done, I think it boils down to a simple question of value versus convenience.

Agent-less designs are convenient for deployment, say, for very basic backup capabilities. That is about where the advantage ends.  They cannot deliver much value other than what the OS and network tools allow. This becomes confining and inflexible to users with greater data management needs at the SME and enterprise levels. Agent-less design also come with some level of painful “cost” to users (probes, scans, queries, etc, to get a look at data, and then extract it).

This is why all major vendors have deployed their solution in agent format.  They can deliver more value, in a less painful way, with greater control on added value. We have not seen any proof to a next-generation agent-less architecture out there. OS’s and network protocols are basically the same, but there have been a few work arounds to offer more value, which is great.  Still, compared to relative basic value you get with even the most legacy agent based products, it’s not much.

However, there is a very BIG problems with agents: There are simply TOO MANY of them.

Vendors issue agents to each machine for backup, for replication, for file archive, for VMware protection, for changed data tracking, for forensic monitoring, for data leak blocking, and other things. All these agents are redundant to some degree, but vendors cannot figure out how to get them to work together. Yet, while there are too many agents, the abilities of agent-less solutions (yes, including the mighty cloud) to provide equitable levels of value down to physical or virtual machines that are on-premises, is simply not there.  The obvious approach is to limit agents, and unify the value of multiple agents, into fewer agents (a single agent?)

Basically, create a platform that allows more and more tools to be added. If the user wants to activate them, great. If not, then just use the one or two tools you do want.  Use the others next year if you need them. The bottom line, is that you have limited the number of agents proliferating throughout your network, and increased value on a per agent basis several-fold.  And, you don’t have to deal with any of the agent-free negatives we put together down below.

Agent-less approach, the bad side:

  • File transfers via the network share APIs are much much slower than through an agent designed for that purpose.
  • An agent-less solution consumes more CPU because the methods of accessing the data have many layers of protocols designed for general purpose.
  • The systems needs to be polled constantly to find out what is new.  The polling method won’t be nice, it will require their server to basically log on, scan the logs etc and then move on.
  • Because the network interface lets you know what file has changed only, you can only get the whole file.  This makes it useless for any database application or email server but frankly isn’t very good for such things as PST files which are regularly 1-2GB in size.
  • This only works when you are connected.  Data isn’t journalled so there is no tracking of and versioning when you are disconnected.
  • You wont be able to compress the network traffic.
  • Sometimes vendors mask the fact that “transient agents” or “security code” must be deployed to clients, which can also be a similar hassle to real agents to deploy / manage, but far less value.
  • Someone changes a password, services can be interrupted.
  • For Virtual Machine protection, the agent-less polls and scans will multiply the pain and workload on the physical systems, multiplied by the number of VMs, where I/O issues already exist.
  • Some agent-less vendors market CDP. Truthfully, it is a poll on the file journal logs, its not real time.  Its not really CDP.
  • For data leakage, detailed monitoring, and content security, the problems multiply by a huge amount. How do you deliver any level of compliance without real-time understanding of granular data changes?
  • Too many issues to name without writing a book.