When IT Comes Apart: A Case of Failure

What do you do when everything fails—not just typical application failure, but failure of more than one subsystem all at once? Recently, I had such a failure. The air-conditioning plant failed, and the new 10G switch failed, causing a loop that brought down the entire storage network. Then, the storage redundancy failed due to a then-unknown upgrade error, and nodes refused to boot without being reinstalled. Any one of these issues leads to a bad day, but when they all happen at once, or within a short period of time, everything comes apart. What do you do? Do you have contingency plans?

As you hear more and more about the amount of patching required for Meltdown and Spectre, what are your contingency plans? With so many failures, it became a bad day. Too many subsystems failed at once. We’d all like to think this is highly unlikely, but that is not always the case, as my recent experience proves. So, what do you do? What can you do to keep the business running? Some questions that come to mind:

Do you move more to the cloud?
Do you get just enough resources running to keep the business afloat?
Do you move things to different data centers?
Do you emergency order new equipment?
Are you satisfied with your backups?
Are you satisfied with your redundancy?
Are you satisfied with your testing of the patch, fix, upgrade, etc.?

What are the questions you would ask?
In a small enterprise, these questions are fairly important to answer. In a larger enterprise, with multiple data centers and massive redundancy across data centers, many of these questions may be less critical. The last three questions in the list are the most important ones to answer. In my case, I was satisfied with my backup, I thought I was satisfied with my redundancy, and I thought I had tested everything.
The problem is that there are many hidden pieces of our enterprises today, as we move things up the stack to automation. Many of the nuts and bolts we used to play around with are no longer things we can directly control—or, in some cases, even see. The visibility into internal issues has gone away. We are looking for failures at higher levels. In my case, the storage in use was supposed to have redundancy. However, due to an upgrade issue, the redundancy was no longer there, and no errors were reported anywhere. Things would work acceptably until the wrong node was brought down, and then everything went crazy.
Unfortunately, timing is also important. If you have to wait to fix the physical plant, you may have to run the data center–critical applications on a shoestring: a shoestring that allows adequate cooling for all components while running the most critical, business-necessity applications.
Moving to the cloud is a solution, but not one done in the heat of battle, unless you already have plans drawn up. Your cloud costs could skyrocket and leave your business open to a lack of redundancy. All clouds have rebooted their systems in response to Meltdown and Spectre, which caused availability issues with applications that didn’t use multiple zones within the cloud. This raises a question: is moving to the cloud the only good solution to catastrophic business-critical failures?
I do not think so. Yes, business continuity is a must. Some use of clouds is appropriate, but it is not for everything. It depends mostly on your business and regulatory compliance requirements. While transforming to the cloud is on everyone’s mind, and there are many companies that can help you, I must repeat that it’s not something to rush into, at least not for IaaS. You may be able to migrate to a SaaS service for email, websites, etc. However, if you need infrastructure, that tends to be another matter entirely.
So, back to the disaster: redundancy is key, and keeping your business running is even better. The key to successful business continuity is knowing the minimum requirements to run the business, where those assets will spin up as necessary. What levels of failure or risk are you willing to absorb, and ultimately, what’s your plan to recover? If your business continuity is to move data and workloads to an existing cloud environment already preseeded with data and workloads, then that is great. If it is just to spin up more existing cloud workloads, then you are golden! However, if you have not already migrated some things to the cloud, consider carefully before you think of it as the easy answer.
My solution was to get the A/C fixed and then fix the storage. Those both happened within hours of each other. A good plan and knowledge of your business is crucial to successful business continuity.
What do you consider during a business continuity effort? What solutions do you use?