Data Protection: All Starts with an Architecture

At The Virtualization Practice, we have systems running in the cloud as well as on-premises. We run a 100% virtualized environment, with plenty of data protection, backup, and recovery options. These are all stitched together using one architecture: an architecture developed through painful personal experiences. We just had an interesting failure—nothing catastrophic, but it could have been, without the proper mindset and architecture around data protection. Data protection these days does not just mean backup and recovery, but also prevention and redundancy.
To come up with a proper architecture that affords you a good level of redundancy requires you to think outside the box—to really look at what could go wrong. Some of the things we have seen go wrong have included:

bad computing hardware, such as arrays, drives, computers, or blades
bad networking hardware, such as switches or cables
bad infrastructure, such as electricity or cooling

I have seen disasters happen in each of these areas due to circumstances including:

lightning strikes
array failures
compute system failures
cooling issues

For any architecture, it is important to look at the entire system to ensure that there is enough redundancy and basic capability to run the most important applications in the most adverse conditions. This requires first that you realize what those most important applications are, how they integrate, and what their ultimate dependencies are. Once you know this, you can work out a reliable architecture for data protection—a plan to handle all cases.
So, if we normally run two hundred virtual machines, our reduced set could be twenty or more virtual machines. These twenty or more would need to run somewhere if there is a failure. Let us look at a few cases:

Case 1: Local Systems (perhaps just a single system)

In this case, we have to run our reduced set of virtual machines somewhere, and we have chosen a local system or set of systems for this purpose. Since we may have a reduced power, networking, or cooling situation, using a large disk array may be an issue. Thus, locally attached storage may be best. We need enough storage that is readily available or easily made so to store the VMs and run them. Yet, that storage also needs protection—perhaps the ability to mirror data between a set of arrays during normal operation, such as with HP StoreVirtual virtual storage appliances or other VSAs. These work quite well, presenting as iSCSI or NFS to your virtualization hosts using local networking, and they have built-in replication. If you go the VSA route, then you can use Storage vMotion and similar functionality to move VMs from one storage unit to another. Such storage could also be used for other purposes.
Alternatively, you could set up replication so that data is always ready on a secondary storage device attached to the machines. Tools like Veeam Backup & Replication, Zerto, VMware SRA, and others provide this functionality. However, you must have enough storage to hold everything. And that storage could not really be used for other purposes: it would need to be reserved for use in case of failure.
In either case, you should think through your storage requirements for your must-have VMs and determine how to migrate them or start them up on that storage during planned or unavoidable outages. The major costs and planning should be geared toward ensuring that you have enough local storage so that you can start those most-important VMs. Since the redundant storage resides within your own data center, your existing policies and controls should apply.

Case 2: Remote Systems (perhaps into a cloud)

The second case is to do the same thing as we did with the local systems, but to have those systems available to run within the cloud with the proper levels of security and compliance. Additional security measures, such as encryption, may be necessary within the cloud. However, the key is to get the data to the cloud in some fashion. The cloud is not for immediate use, as it takes time to transmit the bits. I cannot just provision vMotion to any cloud; I need to transfer the bits beforehand and have them waiting to start up my most important applications within the cloud. This could be accomplished using any of a number of forms of replication, such as those from HotLink, Veeam, Zerto, Unitrends, Quantum, or Symantec.
However, before the data is replicated into a cloud, it may need to be encrypted, unless the final location is encrypted. Most of the tools mentioned here have encrypted repositories into which data can go. Even so, once the data is in the cloud, the virtual machines would then need to be deployed and made available. This is a process that is not a normal part of your workflows, unless you are using tools like HotLink. This is a process that may be initiated once a year or less often for production reasons. But it should be implemented once a month for testing purposes.
When working with the cloud, there are additional concerns for data protection: costs and, as previously discussed, security. Costs will go up the more systems you place into the cloud. Storage costs will also go up the more recovery points and systems you have in the cloud. What those costs are depends entirely on the cloud in question.

Which Case Is for You

I am a big fan of There and Back Again data protection. In other words, if you move your data to some place other than its original location, you should be able to get it back with minimum fuss, and that should be part of your data protection architecture. Minimally, I suggest the following:

Think There and Back Again for all data protection needs.
Consider local storage that can be used at any time for recovery needs (and perhaps other needs).
Consider the cloud, but first consider the costs of the cloud vs. the costs of getting new hardware.
If you are already using a cloud, consider how to get data to another cloud or locally (There and Back Again).
Consider all failure modes, even those that do not remove all capacity, but reduce capacity to a single machine or set of machines.
Determine what you consider to be the most important applications to keep the business running.
Do you have a feedback loop (data protection analytics) to ensure you have included all dependencies and business requirements?

When you sit down and ask all these questions, you will have the start of an architecture. Start first with your planning and architecture. Then fit the product to the architecture. If no product that does what you currently desire exists, then I suggest reworking the architecture.
Recently, we had a cooling outage that forced us into a reduced-capacity mode. Our data-protection architecture allowed us to move VMs to local storage on a set of hosts and turn everything else off, thereby reducing our cooling needs. Could your data-protection architecture survive such a failure? Can you work under reduced capacity? Have you worked out what are the most important VMs with the business?