The recent Amazon Web Services Simple Storage Service (S3) outage has taught us quite a bit about fragile cloud architectures. While many cloud providers will make hay during the next few weeks, current cloud architectures are fragile. Modern hybrid cloud architectures are fragile. We need to learn from this outage to design better systems: ones that are not fragile, ones that can recover from an outage. Fragile cloud is not a naysayer: it is a chance to do better! What can we do better?

The obvious answer is to use multiple clouds: to build agile, outage-sensitive applications, applications that can reroute data paths on the fly. Unfortunately, we are just not there yet for the vast majority of systems in use. New systems are being developed that meet those goals. However, at the moment, we are not quite there. So, what have we learned?

  • Clouds go down; read the SLA!
  • Free does not imply always available.
  • Data protection is crucial.
  • You are in charge of your own fate.

The last item on the list is crucial, as is the first one. Each cloud has an SLA. Amazon’s is 99.5% uptime per month. That means it can go down. The cloud will go down. Expect the cloud to go down. You are in charge of your own fate. If you need to design a hybrid cloud, it must account for outages. The cloud will go down or become inaccessible or otherwise unavailable. Count on it.
Our hybrid cloud has so many parts outside our control; even our data centers have many parts outside our control. This is why we have redundancy plans, multiple sites, etc. This is why disaster recovery and business continuity plans exist. We all know that. Yet, these plans often fail to account for a cloud outage.
How do we design or architect around such a problem? We need to consider the business as well as the technology. There are some systems that seem to be always on, such as DNS. While DNS is ultimately outside our control, we do control some aspect of it: the little bit that is local to our applications. Can we use that in some way? We could use this to quickly repoint a host from Amazon S3 to an S3 sitting inside Azure, or even an S3 in a different availability zone. There are a number of things we could do using automation within our hybrid cloud to quickly fix detectable problems. This ability to detect and fix things quickly, perhaps even before a problem is noticeable, is the true strength of automation.
Scale is also an issue. On high-scale sites, we often offload images and other parts of our data using cloud services such as S3. Having multiple copies of that data could be difficult to maintain. This is where copy data solutions, which maintain and manage all those data copies, can come into play. These solutions ensure that there are enough copies in well-known locations to keep the service running.
We are currently within the “fragile cloud” aspect of our hybrid cloud. Still, many companies are moving more and more critical components to the cloud—to many clouds. This is not a bad thing; it is a good thing. Business will ultimately decide which way to go. However, we need to move far past fragile cloud and into a well-formed, highly adaptive hybrid cloud, one that contains more than one cloud, with applications that can repoint from one cloud to another with the flip of a switch or perhaps live in both clouds.

Closing Thoughts

The more I think about scale, the more I think about data protection and scaling out to multiple clouds. In the future, depending on one service, even with multiple zones, will not be the answer. We need to consider more than one service. We need copy data solutions that understand data in cloud repositories. We need security and data protection that works well within a hybrid cloud of multiple clouds. We need… We need…
The list is endless, but now we can learn from this outage and start asking those tough questions. Where will we be if our cloud of choice goes down? Do we have the automation in place to quickly react? Is data in multiple locations we need to stay running? Is it even critical data?