Recovery Lessons Learned from Storage Failure

Recently, we experienced a fairly catastrophic SAN failure: we lost two drives of a RAID-5 array. Needless to say, recovery was time-consuming, but it also pointed out some general issues with many disaster recovery, business continuity, and general architectures involved with virtual environments. Luckily, we were able to start one of the drives, let the hot-spare take over for the second failure, and recover the vast majority of our data. Yes, there was corruption, so that is where our backups came in and the ultimate dependencies for restoration. How do you recover from a catastrophic failure? Do you fail over automatically to a hot-site or cloud environment? Even if you fail over, how do you recover from a catastrophic failure?
Here are some of the problems we faced on recovery:

Backups were incomplete; we backed up our critical data, or more to the point, what we thought was our critical data
Our Virtual Storage Appliance was dependent on our hardware storage device
The system dependencies for restore were greater than expected.

Recovery Dependencies

This was an eye-opener. We thought we had our recovery pretty solid, but when we went to use our recovery tool, we found that in order to access the database and filesystem used for recover, we had a dependency on another subsystem, Active Directory (AD). Without Active Directory, none of our backups were easily accessible from our recovery system. While we had a backup of Active Directory as a critical system, it was the first thing we needed to restore. But to restore the system, we needed the recovery console. This was a catch-22 circumstance. The solution was to rebuild our recovery engine without joining a domain and importing our backups into the recovery engine. This was not the only dependency we found, but it raises the question, “Do we know all our dependencies for tens of thousands of systems backed up?” And this is where modern backup tools have issues.
Dependencies in today’s backup tools require you to enter all of the systems to back up, regardless of dependencies. In other words, you need to know your dependencies when you create your backups. For small environments, this may be possible; however, for large environments, there are different dependencies for different people, and the backup administrator really does not know the dependencies. Thus, if one is missed, it is generally due to lack of knowledge, not lack of capabilities. As an example, we have the following people involved in setting backup policy for an application:

The Application Owner (usually the business owner) desires a backup of an application. This person does not truly know what is involved in backing up or restoring the application; he just needs a recovery scenario based on some corporate policy.
The Application Developer lets the backup administrator know that a specific application needs to be backed up. The developer knows his or her code, not the system, and as such has a Java implementation that needs to be backed up. The developer may not even know how many systems this code spans.
The Virtualization Administrator lets the backup administrator know that a set of VMs with either a specific naming convention, resource pool, subnet, or the like needs to be backed up. The virtualization administrator does not really know what applications are within the virtual machines.

Yet, everyone listed is doing what is required to meet some corporate policy. Perhaps they mention a few dependencies but miss some of the rather large ones, such as Active Directory, recovery tools, and changes made since the backup administrator first created the backup job.
Given all this, it is possible for an application to be insufficiently backed up, leading to poor recovery times. There are just too many humans involved, and the lag from one to the other could be a huge issue within a complex system. This is a perfect opportunity for automation, but where is the automation performed? In development, but the virtualization layer, or as part of backup and recovery testing?

Finding the Dependencies

What we need is a set of tools that will automatically map out the backup dependencies, tell us what is missing from our backups, and suggest how to proceed with recovery with automatically generated recovery plans that can be saved somewhere safe (perhaps in a safe, off site, etc.), so that we can perform a catastrophic disaster recovery. Yet, this type of analytics seems to be missing from backup, recovery, and disaster recovery today. Actually, other than pass/fail and speeds of backups, there is a dearth of analytics applied to virtualization, cloud, and all forms of backup today. We began this discussion within our Utopian Disaster Recovery article, but that needs some expansion.
Specifically, we need not only to be application aware, but also to track every dependency so that we know in which order to properly restore our systems, in which order to boot our systems, and how each system is dependent on others. Many application performance management tools, such as VMware Virtual Infrastructure Navigator, can get us some of this data. However, we need this data not just for the virtual environments, but for our cloud environments as well. A complete picture is required.
Furthermore, we need a method, preferably automated, for starting our recovery during catastrophic failure. Our backup tools need to create those scripts for the most critical systems in order to perform the necessary further backups. In our case, we needed AD and the recovery system restored first; then we could use our recovery system to restore all other components as necessary. You could say that it is the backup administrators’ job to know all this and make recovery simpler, but they have a tough enough job just ensuring that there are proper backups and testing backups.
All these tasks should be automated:

Application and dependency mapping, so that the backup or replication tool knows what to back up or replicate instead of relying on by-hand mechanisms
Creation of scripts and repositories to start a recovery in a catastrophic failure, beginning with the very first dependency and recovery tools
Creation of documents to tell us what should be recovered and in what order
Reports on how well our backup and replication is doing, not as a pass/fail, but as a percentage of the entire application and set of dependencies.

For now, these are entirely performed by overworked humans, which implies they may or may not be complete or even possible.

Closing Thoughts

It is time to bring backup and replication tools into the modern age. We should apply some level of automation and analytics to backup and replication to minimally determine each application’s or system’s dependencies. This will speed up recovery, and it will allow our backups and replication to be application-centric as well as to handle continually changing situations.

5 replies on “Recovery Lessons Learned from Storage Failure”

Karen Bannan says:

June 3, 2014 at 11:04 pm

This is an example of everything this infographic says:
http://bit.ly/1pkao0O
It looks at the cloud skills gap, but it is completely applicable to storage issues, too. In-house IT simply doesn’t always have the time or expertise to handle issues like a failure. I think it’s a good argument for bringing in at least some level of IT-as-a-Service to augment what you’re doing in house.
–KB
1. admin says:
  
  June 4, 2014 at 8:52 am
  
  Hello Karen,
  I agree, there is a lack of knowledge about many physical aspects of a virtual/cloud environment. There are many companies large and small who no longer have the intimate knowledge of their hardware and as such could not actually fix this problem without a call to support or a consultant. Moving to ITaaS may not solve that problem either, granted ITaaS would make results reproducible. However, moving to a cloud based service could assist if you can afford the cost. Even with this type of failure how would you recover your systems into a cloud? That requires a bit more planning and forethought. Even with replication to a cloud, there is a lack of getting all the necessary dependencies to even run within a cloud.
  Best regards,
  Edward Haletky
Andy Hooper says:

June 4, 2014 at 10:53 am

One can’t count on a recovery process until it has been tested with an exercise. Doing that will bring out the dependencies.
1. admin says:
  
  June 4, 2014 at 10:59 am
  
  So very true. This also counts on people restoring to a working application and not just a bootable machine. If they do application testing they will need to back fill all the appropriate controls into any backup tool in use. I am not sure a) that backup testing of this nature is done by enough companies b) even if they find a dependency it works it way back into the configuration for the next backup. All of these items, backup, testing, dependency finding, backup update from results can and should be automated. Humans should just be the approvers of any such changes.
  Best regards,
  Edward L. Haletky
Karen Bannan says:

June 16, 2014 at 9:43 pm

“…there is a lack of getting all the necessary dependencies to even run within a cloud.”
Good point, especially if you’re using a variety of cloud-based services that utilize the same on-premises data.
–KB

Comments are closed.