Posts Tagged ‘ILM’

Sticky Policies & Data Classifications

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Thu, Jan 28, 2010 @ 12:02 AM

If there is an ominously absent capability missing from the legacy backup products of today (and yesterday), it would be sticky policies that follow data wherever it goes, and auto-data classification that occurs when a file is created.

The capabilities to serve these needs are so obviously missing from the old backup paradigms (still with us today, mind you), that 2 things are starting happen:

1) People know this stuff is missing. They feel it in their bones. They also feel it when they need to look at piles and piles of data, and want to somehow make sense of it. But they can’t. They also want protection to happen the moment it is needed, with a simple policy.  Not when a backup product tells it to.

2) Vendors know this stuff is missing as well. Many of them operate in the world of block and volume data, and simply have no chance to manage information while they are backing up or replicating blocks of bits, instead of information.  Others have no way to manage metadata intelligently or actively. So they try to “market” their way around it.

The problem is impossible to fix within today’s legacy backup infrastructures. And its not going away. It will simply grow, exponentially, unless you start getting after it.

AIMstor from Cofio was created with policy based data management, and classification of data in mind.

Giving users control of what they want to do with data (Live Backup, CDP, Real-Time Replication, Tracking, Archive, etc.) is one thing. Giving users the ability to do it intelligently, such as withworkflow and data flow is something else altogether.

Don’t Shop with Barbarians: More on Volume CDP/Replication

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Fri, Nov 27, 2009 @ 05:21 PM

In response to our CTO’s last blog (Why Volume CDP and Replication products are so Wasteful), and because it is Black Friday, a day with a serious shopping theme, we had a few comments out there from a few Volume Replication Vendors, so I thought I would answer them here and keep things better organized:

Comment 1:  I guess it depends on what the volume contains. And the purpose of doing it in the first place. Certainly replicating or CDPing a system volume doesn’t seem to make much sense unless the reason for it is Disaster Recovery at a remote site. But replicating or CDPing a volume that only contains business critical data could be meaningful particularly in compliant heavy environments. Were the donkeys nodding?

Answer 1: There are some noisy applications.  In this particular example it was Sophos anti-virus.   But the OS can do very well all on its own too. OS vendors even call out noisy directories that should be avoided during backups, because there is no value in a restore, and there is an obvious cost to replicate it.  It is also not untypical for an application to want to create temporary files that have no business value on the same volume as the database. You want to replicate all that too?

With Volume CDP grabbing it all, that extra 40GB-50GB per machine gets expensive.  Multiply that by a large number of machines, and the overhead is very large.  Plus, the extra of CPU, energy, and bandwidth sending wasteful and unneeded data is another big cost that adds up quickly, and then goes exponential once you consider the enterprise.  That is the essence of the problem with Volume CDP and Replication. It is indiscriminate by nature and grabs everything.

Kind of like a starving barbarian with a big shopping cart at the grocers on double-coupon day, she can’t even resist taking the trash with her.

The Donkey’s weren’t nodding, but they were chuckling.

Comment 2: Couple of things that are puzzling to me in your blog are the fact that there were 40GB of wasted capacity in a single server during 1 week? That would certainly not be the norm and if it was there would be other useful conversations to have with a client.  As for CDP being intelligent enough to distinguish useful data. Great idea and most enterprise CDP solutions will have this ability now or in the near future. Even more important when considering replication is evaluating solutions that will compare data on the local and remote site and deduplicate before replicating the changes across the wire. We have customer examples that were able to shave 70%+ off replicated data!

Answer 2: We are simply saying that it’s a good idea to avoid sending all that unneeded data, in the name of simple logic, speed and efficiency.  The only effective way of combating this is by understanding the data (which is what AIMstor has solved).

I’d be interested in seeing how the Volume replication vendors address this.  I suggest that they can’t.

Volume replication argument have generally been that the “customer” ought to reconfigure their system to suite the replication technologies inability to address data types or data classifications.  Have a volume for one thing, another volume for another , etc.   While it certainly may make sense to partition your system, the point is,  customer shouldn’t be forced to because of the failings of the CDP product. Let the customer partition storage based on what makes sense to his application, not because of the inability of the volume CDP product.

The fact is also, CDP shouldnt just be for the application.  Why shouldn’t it be used for the system volume as it provides a good DR image as well?  Or something even more radical, why not provide a hybrid, period transfers of parts of the system but CDP granularity of other part of the system.  Imagine you have a volume that is both the OS and the application (OK example normally for smaller setups), you could take periodic images of the OS, but then CDP the application data.  This will minimize data transmitted and provide very nice and granular application restore, with safe set of periodic images of OS. You also get big overhead reduction, plus, savings on CPU, energy, bandwidth, etc.

Bringing up the de-duplication topic is interesting too.  Understanding the data you are de-duplicating substantially increases the de-duplication rates, like we do.  That is also why Data Domain excels, it distinguishes the data boundaries and doesn’t treat everything as a dumb block.   Would be good to know how much of that 70% replicated data savings you mention was just white space elimination? – which should have never been transferred in the first place.  If so,  am puzzled because that approach, which is typical among all Volume-approach vendors, seems to be making a mistake, and then the vendor congratulates himself for later correcting his mistakes.

And that’s supposed to be a “solution”?

The 4-Step Dedupe for Backup

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Sat, Aug 15, 2009 @ 08:36 PM

So, we all know now that the old legacy backup solutions create HUGE waste and require deduplication appliances. But now, for users considering upgrading their environments for “intelligent backup”, DR, replication, or archive, or simply to herd the cats of unstructured data, there seems to be confusion among users about the issues of:

-Source Deduplication (doing the dedupe at primary data), versus

-Target Deduplication (doing the dedupe at the repository).

A number of articles and postings have been put out there, but it typically comes down to the same question: What is best, doing it at Source or Target?

We asked the same question. Then we said, heck, why not BOTH?

This is why AIMstor was originally architected to provide 4-step Dedupe.  Source level dedupe via 2 methods, and Target level dedupe via 2 other separate methods:

Source Side Dedupe:
Step 1-Duplicate Transfer Avoidance: At the initial sync between node and repository, if the repository has the data already (from a previous node), it tells the node to only transfer only “new” data. Saves a lot of time, network bandwidth, and initial repository capacity is minimized.

Step 2-Real-Time Changed Byte Transfer: At the same time, with subsequent BackupsCDP,Replication and Versions, AIMstor will only transfer changed bytes from the node. That reduces network traffic and load on the node. Because AIMstor is real-time, there is no scan or trawl.  So data constantly trickles from the node when it changes, and hits the repository as the RPO settings for the backup.

Target Dedupe:
Step 3-Multi-Level Single Instance Storage: Because the AIMstor repository is unified across Backups, Replication CDP and Versions, it allows only a single instance of a file, no matter where it came from.

Step 4-Global Object Deduplication: Also in the repository, AIMstor runs a final step post processing deduplication algorithm across all data sets from all machines. Thereby finalizing the deduplication with four complete steps to best reduce the total amount of data capacity used in repositories.

The AIMstor repository automates all this as part of any policy.  The good news, is that is downloadable now, and available for Windows environments.

Agent vs Agent-Free in Data Protection, Archive, other stuff . . .

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Wed, Jul 29, 2009 @ 04:09 PM

Lot’s of talk out there these days about the value of agent versus agent-less approaches to deploying data protection and data management tools.

After all is said and done, I think it boils down to a simple question of value versus convenience.

Agent-less designs are convenient for deployment, say, for very basic backup capabilities. That is about where the advantage ends.  They cannot deliver much value other than what the OS and network tools allow. This becomes confining and inflexible to users with greater data management needs at the SME and enterprise levels. Agent-less design also come with some level of painful “cost” to users (probes, scans, queries, etc, to get a look at data, and then extract it).

This is why all major vendors have deployed their solution in agent format.  They can deliver more value, in a less painful way, with greater control on added value. We have not seen any proof to a next-generation agent-less architecture out there. OS’s and network protocols are basically the same, but there have been a few work arounds to offer more value, which is great.  Still, compared to relative basic value you get with even the most legacy agent based products, it’s not much.

However, there is a very BIG problems with agents: There are simply TOO MANY of them.

Vendors issue agents to each machine for backup, for replication, for file archive, for VMware protection, for changed data tracking, for forensic monitoring, for data leak blocking, and other things. All these agents are redundant to some degree, but vendors cannot figure out how to get them to work together. Yet, while there are too many agents, the abilities of agent-less solutions (yes, including the mighty cloud) to provide equitable levels of value down to physical or virtual machines that are on-premises, is simply not there.  The obvious approach is to limit agents, and unify the value of multiple agents, into fewer agents (a single agent?)

Basically, create a platform that allows more and more tools to be added. If the user wants to activate them, great. If not, then just use the one or two tools you do want.  Use the others next year if you need them. The bottom line, is that you have limited the number of agents proliferating throughout your network, and increased value on a per agent basis several-fold.  And, you don’t have to deal with any of the agent-free negatives we put together down below.

Agent-less approach, the bad side:

  • File transfers via the network share APIs are much much slower than through an agent designed for that purpose.
  • An agent-less solution consumes more CPU because the methods of accessing the data have many layers of protocols designed for general purpose.
  • The systems needs to be polled constantly to find out what is new.  The polling method won’t be nice, it will require their server to basically log on, scan the logs etc and then move on.
  • Because the network interface lets you know what file has changed only, you can only get the whole file.  This makes it useless for any database application or email server but frankly isn’t very good for such things as PST files which are regularly 1-2GB in size.
  • This only works when you are connected.  Data isn’t journalled so there is no tracking of and versioning when you are disconnected.
  • You wont be able to compress the network traffic.
  • Sometimes vendors mask the fact that “transient agents” or “security code” must be deployed to clients, which can also be a similar hassle to real agents to deploy / manage, but far less value.
  • Someone changes a password, services can be interrupted.
  • For Virtual Machine protection, the agent-less polls and scans will multiply the pain and workload on the physical systems, multiplied by the number of VMs, where I/O issues already exist.
  • Some agent-less vendors market CDP. Truthfully, it is a poll on the file journal logs, its not real time.  Its not really CDP.
  • For data leakage, detailed monitoring, and content security, the problems multiply by a huge amount. How do you deliver any level of compliance without real-time understanding of granular data changes?
  • Too many issues to name without writing a book.