Posts Tagged ‘disaster recovery’

Sticky Policies & Data Classifications

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Thu, Jan 28, 2010 @ 12:02 AM

If there is an ominously absent capability missing from the legacy backup products of today (and yesterday), it would be sticky policies that follow data wherever it goes, and auto-data classification that occurs when a file is created.

The capabilities to serve these needs are so obviously missing from the old backup paradigms (still with us today, mind you), that 2 things are starting happen:

1) People know this stuff is missing. They feel it in their bones. They also feel it when they need to look at piles and piles of data, and want to somehow make sense of it. But they can’t. They also want protection to happen the moment it is needed, with a simple policy.  Not when a backup product tells it to.

2) Vendors know this stuff is missing as well. Many of them operate in the world of block and volume data, and simply have no chance to manage information while they are backing up or replicating blocks of bits, instead of information.  Others have no way to manage metadata intelligently or actively. So they try to “market” their way around it.

The problem is impossible to fix within today’s legacy backup infrastructures. And its not going away. It will simply grow, exponentially, unless you start getting after it.

AIMstor from Cofio was created with policy based data management, and classification of data in mind.

Giving users control of what they want to do with data (Live Backup, CDP, Real-Time Replication, Tracking, Archive, etc.) is one thing. Giving users the ability to do it intelligently, such as withworkflow and data flow is something else altogether.

CDP is a Dog -> Unless it’s UNIFIED with Backup

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Thu, Jan 14, 2010 @ 03:32 PM

It’s true.  CDP is tremendous technology, offering granular point-in-time restore that backups simply cannot do. But CDP (Continuous Data Protection) has severe retention and data management limitations, so backup is absolutely necessary.

But – why do CDP if you cannot get it FULLY UNIFIED with your backup solution??  I don’t mean “integrated”.  Any moron can “integrate” a CDP product with their legacy backup product (and, many have, mind you). You just tell the people in marketing to make the box look the same, and update the user manual.

THE TRUTH: CDP is ONLY worth doing, if it comes FULLY UNIFIED with a next-generation backup solution (optimally, with inherent deduplication). That way, they share the same data mover, the same repository, the same metadata, the same underlying data structure and supporting infrastructure.

I dont know any other product besides Cofio’s AIMstor that does this. You get granularity of CDP, with smart retention flexibility of AIMstor’s next-gen Backup, and all the great policy driven capabilities that come together with it. You can also empower Bare Metal Restore from your backup and CDP sets, which are fully single-instanced for huge capacity savings.

More importantly, because AIMstor auto-classifies data, you can SELECT what you want to CDP, and what you want to Backup, and what retention you want for very specific types of data, or whole categories of data. Standalone CDP  products are kinda, well, dumb. They like to move . . . everything. Optimal? Uhm, not.

So what happens if you buy CDP that is NOT unified with your backup solution? Triple the data movers, double the repository setup and capacity usage, double the overhead to servers and clients, double the admin time, double the infrastructure. Plus, you probably can’t select what you really want, so you will just end up wasting even more resources.  Why do it?

The Legacy Backup Bubble (Part II)

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Tue, Jan 12, 2010 @ 05:14 PM

Legacy Backup is a major market in the data protection space, and is still going strong. Regardless of its inefficiencies, people still buy it, and add onto their existing Legacy Backup environment. However, users are starting to take notice.

Every user backup forum will often point to lack of Legacy Backup products to deliver any upstream value, and their typical failure rates as a result of server-dependent architectures, and their terrible storage inefficiency.

In addition, many environmental factors have crept into the woodwork at user sites (business intelligence, eDiscovery needs, compliance requirements, etc.), and now that the paint is off, people are finally getting a look at what’s underneath the hood of Legacy Backup products. It won’t be long.

Deduplication was a key first mover that really made people question the insanity of Legacy Backup. Why create something so inherently inefficient that it required such a huge level of clean-up? (remember, 20X or greater is the typical deduplication cleanup rate).

Cloud architectures will soon expose even more inadequacies in the Legacy Backup camp. Forcing many vendors to accomodate Cloud storage in strange, non-optimal ways.

Virtual machine sprawl has added more headaches to the Legacy Backup camp because of I/O and overhead issues created by Legacy Backup, and multiplied by VM’s.

Additionally, users are becoming more reliant on other tools within the market to make up for the lack of flexible recovery capability of Legacy Backup. CDP, Replication, Bare Metal Restore, and others, are coming into play in the mid-market.  As are technologies that help manage information; index/search tools, data classificationpolicy management, and tools that control data for added layers of security or monitoring.

There are many others, but these ones stick out. When things be

The Legacy Backup Bubble (Part I)

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Tue, Jan 05, 2010 @ 07:10 PM

The terrible inefficiency of Legacy Backup has created new markets and new companies over the past decade in the storage backup space.  Many are fixes applied to Legacy Backup itself, many others are another form of Legacy Backup, that solve some issues for a key market or vertical. Many have been proven to solve real world problems, caused, of course, by Legacy Backup.

So, what is Legacy Backup?  You are probably using it right now in your data center, your remote office, or your SMB, and most certainly, in your enterprise.  It’s a product that protects your data by doing several things based on a schedule, then sends a copy of some processed data to disk or tape. Unfortunately, it batch copies data, creates massive and unnecessary duplication of data, and has no ability to share its repository, its processes, policies, metadata, data movement, or any of its significant infrastructure with other data protection products (like CDPReplicationArchive, etc.).

The great thing about inefficiency is that it creates need.  And where there is need, there is opportunity. But the reason for the need, it is now being learned, is that Legacy Backup is the problem.  Like any boom or bubble, Legacy Backup will . . . utlimately . . . pop.

How Underdogs Win: Real-Time versus Batch Data Protection

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Wed, Dec 02, 2009 @ 06:02 PM

The New Yorker Magazine has is a great read for anyone considering the strategic aspects of real-time versus batch processes, (from databases, to running a girl’s basketball team) in this article titled “How David Beats Goliath: When underdogs break the rules“.

In the world of storage and information management and protection, the parallels to current legacy point products are impressive. Today’s leading backup products reside completely upon legacy architectures.  They are still, by and large, run as batch processes, are not searchable, do not provide real-time differencing, and have no real-time capability to tie into other data movement or data management capabiliites. You could say many of the same things about many other tools used within the IT dept.

It would be nice to turn a key, and make it all real-time, but that won’t happen. Fundamentally, it requires changes in the way systems, physical or Virtual Machines, are managed, and how responsibilities are distributed (if they are).  The legacy Client/Servers approaches completely rely on outdated policy distribution communications (batch), where connectivity must remain intact to execute their “server -to- slave server -to- media server -to- client” laundry list of batch “stuff to do”.  They need a lot of hand holding in order for things to happen, and for policies to be executed. A short list of issues with legacy products:

o- Batch methods require scans, trawls, polls, etc., all of which drag down resources

o- Batch I/O stacks up fast on VMs, and goes medieval on their host systems

o- Data changes can be discerned, but data touches cannot be tracked

o- Data classification, if any, is after the fact, instead of at “point of creation”

o- Compliance is via “batch” time slices, not real world “second-by-second” views

o- Metadata consistency is always a day late and a dollar short

o- Repository data always has a “window” of difference with primary data

o- Deduplication remains after the fact and separate

If users want to explore the road of real-time, they will need to seek new solutions that are outside of the realm of their current vendor portfolio, because vendor leaders  just have too much invested in existing legacy code bases. New architectures, which provide self-managing nodes, together with scalable and distributed storage, are the key to deploying more value across the enterprise, on a granular, simple and cost effective basis.  . . . Did I mention . . uhm . . . AIMstor?

Don’t Shop with Barbarians: More on Volume CDP/Replication

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Fri, Nov 27, 2009 @ 05:21 PM

In response to our CTO’s last blog (Why Volume CDP and Replication products are so Wasteful), and because it is Black Friday, a day with a serious shopping theme, we had a few comments out there from a few Volume Replication Vendors, so I thought I would answer them here and keep things better organized:

Comment 1:  I guess it depends on what the volume contains. And the purpose of doing it in the first place. Certainly replicating or CDPing a system volume doesn’t seem to make much sense unless the reason for it is Disaster Recovery at a remote site. But replicating or CDPing a volume that only contains business critical data could be meaningful particularly in compliant heavy environments. Were the donkeys nodding?

Answer 1: There are some noisy applications.  In this particular example it was Sophos anti-virus.   But the OS can do very well all on its own too. OS vendors even call out noisy directories that should be avoided during backups, because there is no value in a restore, and there is an obvious cost to replicate it.  It is also not untypical for an application to want to create temporary files that have no business value on the same volume as the database. You want to replicate all that too?

With Volume CDP grabbing it all, that extra 40GB-50GB per machine gets expensive.  Multiply that by a large number of machines, and the overhead is very large.  Plus, the extra of CPU, energy, and bandwidth sending wasteful and unneeded data is another big cost that adds up quickly, and then goes exponential once you consider the enterprise.  That is the essence of the problem with Volume CDP and Replication. It is indiscriminate by nature and grabs everything.

Kind of like a starving barbarian with a big shopping cart at the grocers on double-coupon day, she can’t even resist taking the trash with her.

The Donkey’s weren’t nodding, but they were chuckling.

Comment 2: Couple of things that are puzzling to me in your blog are the fact that there were 40GB of wasted capacity in a single server during 1 week? That would certainly not be the norm and if it was there would be other useful conversations to have with a client.  As for CDP being intelligent enough to distinguish useful data. Great idea and most enterprise CDP solutions will have this ability now or in the near future. Even more important when considering replication is evaluating solutions that will compare data on the local and remote site and deduplicate before replicating the changes across the wire. We have customer examples that were able to shave 70%+ off replicated data!

Answer 2: We are simply saying that it’s a good idea to avoid sending all that unneeded data, in the name of simple logic, speed and efficiency.  The only effective way of combating this is by understanding the data (which is what AIMstor has solved).

I’d be interested in seeing how the Volume replication vendors address this.  I suggest that they can’t.

Volume replication argument have generally been that the “customer” ought to reconfigure their system to suite the replication technologies inability to address data types or data classifications.  Have a volume for one thing, another volume for another , etc.   While it certainly may make sense to partition your system, the point is,  customer shouldn’t be forced to because of the failings of the CDP product. Let the customer partition storage based on what makes sense to his application, not because of the inability of the volume CDP product.

The fact is also, CDP shouldnt just be for the application.  Why shouldn’t it be used for the system volume as it provides a good DR image as well?  Or something even more radical, why not provide a hybrid, period transfers of parts of the system but CDP granularity of other part of the system.  Imagine you have a volume that is both the OS and the application (OK example normally for smaller setups), you could take periodic images of the OS, but then CDP the application data.  This will minimize data transmitted and provide very nice and granular application restore, with safe set of periodic images of OS. You also get big overhead reduction, plus, savings on CPU, energy, bandwidth, etc.

Bringing up the de-duplication topic is interesting too.  Understanding the data you are de-duplicating substantially increases the de-duplication rates, like we do.  That is also why Data Domain excels, it distinguishes the data boundaries and doesn’t treat everything as a dumb block.   Would be good to know how much of that 70% replicated data savings you mention was just white space elimination? – which should have never been transferred in the first place.  If so,  am puzzled because that approach, which is typical among all Volume-approach vendors, seems to be making a mistake, and then the vendor congratulates himself for later correcting his mistakes.

And that’s supposed to be a “solution”?

The 4-Step Dedupe for Backup

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Sat, Aug 15, 2009 @ 08:36 PM

So, we all know now that the old legacy backup solutions create HUGE waste and require deduplication appliances. But now, for users considering upgrading their environments for “intelligent backup”, DR, replication, or archive, or simply to herd the cats of unstructured data, there seems to be confusion among users about the issues of:

-Source Deduplication (doing the dedupe at primary data), versus

-Target Deduplication (doing the dedupe at the repository).

A number of articles and postings have been put out there, but it typically comes down to the same question: What is best, doing it at Source or Target?

We asked the same question. Then we said, heck, why not BOTH?

This is why AIMstor was originally architected to provide 4-step Dedupe.  Source level dedupe via 2 methods, and Target level dedupe via 2 other separate methods:

Source Side Dedupe:
Step 1-Duplicate Transfer Avoidance: At the initial sync between node and repository, if the repository has the data already (from a previous node), it tells the node to only transfer only “new” data. Saves a lot of time, network bandwidth, and initial repository capacity is minimized.

Step 2-Real-Time Changed Byte Transfer: At the same time, with subsequent BackupsCDP,Replication and Versions, AIMstor will only transfer changed bytes from the node. That reduces network traffic and load on the node. Because AIMstor is real-time, there is no scan or trawl.  So data constantly trickles from the node when it changes, and hits the repository as the RPO settings for the backup.

Target Dedupe:
Step 3-Multi-Level Single Instance Storage: Because the AIMstor repository is unified across Backups, Replication CDP and Versions, it allows only a single instance of a file, no matter where it came from.

Step 4-Global Object Deduplication: Also in the repository, AIMstor runs a final step post processing deduplication algorithm across all data sets from all machines. Thereby finalizing the deduplication with four complete steps to best reduce the total amount of data capacity used in repositories.

The AIMstor repository automates all this as part of any policy.  The good news, is that is downloadable now, and available for Windows environments.

Agent vs Agent-Free in Data Protection, Archive, other stuff . . .

Friday, March 19th, 2010
  • Originally Posted by Tony Cerqueira on Wed, Jul 29, 2009 @ 04:09 PM

Lot’s of talk out there these days about the value of agent versus agent-less approaches to deploying data protection and data management tools.

After all is said and done, I think it boils down to a simple question of value versus convenience.

Agent-less designs are convenient for deployment, say, for very basic backup capabilities. That is about where the advantage ends.  They cannot deliver much value other than what the OS and network tools allow. This becomes confining and inflexible to users with greater data management needs at the SME and enterprise levels. Agent-less design also come with some level of painful “cost” to users (probes, scans, queries, etc, to get a look at data, and then extract it).

This is why all major vendors have deployed their solution in agent format.  They can deliver more value, in a less painful way, with greater control on added value. We have not seen any proof to a next-generation agent-less architecture out there. OS’s and network protocols are basically the same, but there have been a few work arounds to offer more value, which is great.  Still, compared to relative basic value you get with even the most legacy agent based products, it’s not much.

However, there is a very BIG problems with agents: There are simply TOO MANY of them.

Vendors issue agents to each machine for backup, for replication, for file archive, for VMware protection, for changed data tracking, for forensic monitoring, for data leak blocking, and other things. All these agents are redundant to some degree, but vendors cannot figure out how to get them to work together. Yet, while there are too many agents, the abilities of agent-less solutions (yes, including the mighty cloud) to provide equitable levels of value down to physical or virtual machines that are on-premises, is simply not there.  The obvious approach is to limit agents, and unify the value of multiple agents, into fewer agents (a single agent?)

Basically, create a platform that allows more and more tools to be added. If the user wants to activate them, great. If not, then just use the one or two tools you do want.  Use the others next year if you need them. The bottom line, is that you have limited the number of agents proliferating throughout your network, and increased value on a per agent basis several-fold.  And, you don’t have to deal with any of the agent-free negatives we put together down below.

Agent-less approach, the bad side:

  • File transfers via the network share APIs are much much slower than through an agent designed for that purpose.
  • An agent-less solution consumes more CPU because the methods of accessing the data have many layers of protocols designed for general purpose.
  • The systems needs to be polled constantly to find out what is new.  The polling method won’t be nice, it will require their server to basically log on, scan the logs etc and then move on.
  • Because the network interface lets you know what file has changed only, you can only get the whole file.  This makes it useless for any database application or email server but frankly isn’t very good for such things as PST files which are regularly 1-2GB in size.
  • This only works when you are connected.  Data isn’t journalled so there is no tracking of and versioning when you are disconnected.
  • You wont be able to compress the network traffic.
  • Sometimes vendors mask the fact that “transient agents” or “security code” must be deployed to clients, which can also be a similar hassle to real agents to deploy / manage, but far less value.
  • Someone changes a password, services can be interrupted.
  • For Virtual Machine protection, the agent-less polls and scans will multiply the pain and workload on the physical systems, multiplied by the number of VMs, where I/O issues already exist.
  • Some agent-less vendors market CDP. Truthfully, it is a poll on the file journal logs, its not real time.  Its not really CDP.
  • For data leakage, detailed monitoring, and content security, the problems multiply by a huge amount. How do you deliver any level of compliance without real-time understanding of granular data changes?
  • Too many issues to name without writing a book.