As the dust settles on the Amazon Cloud Outage (or the mist lifts, or whatever cloud-related metaphorical cliché you prefer) I’d like to make a number of conclusions related to scalability performance, reliability and openness.
For those of you who haven’t followed the minutiae of the story, it appears that Amazon failed because a network event caused Elastic Block Storage (EBS) to start re-mirroring itself, which in turn saturated the network and caused more mirroring events in a cascade that made EBS unavailable.
How did Amazon not fail?
In fact, in terms of its Service Level Agreement (SLA), Amazon didn’t come anywhere near failing. In the SLA, Amazon is very specific about what it means for its cloud to be Unavailable:
“Unavailable” means that all of your running instances have no external connectivity during a five minute period and you are unable to launch replacement instances.
In fact all instances would have had external connectivity. Some of them may not have been “running”, as a result of their dependency on EBS. Amazon can easily argue that’s your choice to architect in that dependency. “Unavailable” does not seem to cover failure of EBS, which (given the unsavoury nature of the underlying architecture discussed below) seems like a sensible business decision on Amazon’s part. If you had architected without a dependency on EBS you would have been fine. EBS is internal connectivity, not external connectivity. All Amazon says about EBS is as follows
“volumes that operate with 20 GB or less of modified data since their most recent Amazon EBS snapshot can expect an annual failure rate (AFR) of between 0.1% – 0.5%, where failure refers to a complete loss of the volume. This compares with commodity hard disks that will typically fail with an AFR of around 4%, making EBS volumes 10 times more reliable than typical commodity disk drives.”
Whilst we note that they are comparing with unRAIDed commodity disks (failure rate for RAID arrays is much lower, and any commodity server supports RAID these days) the most important issue here is that there has been no “complete loss of the volume”, all that happened was that the instances became unable to access the Volume temorarily, so even if there had been an EBS SLA couched in the same terms as these guidelines, you wouldn’t have got your payout for the outage that just occurred.
Elastic Block Storage as Elastoplast
EBS is widely-used in EC2 because it provides a filesystem which persists when an instance is shut down. In that way it behaves more-or-less like the way you expect a filesystem to behave (although its performance is a little different). The peculiarity of EBS is that although it is remote from the instance (like for example a NFS or CIFS-mounted share) it isn’t actually shared. You can have multiple EBS volumes per instance, but not multiple instances per EBS Volume.
However, a remote storage model, even within these sharing constraints, does not fit with the physically-shared-nothing architecture that is required in order to ensure elastic scalability of public clouds. They really should call it “Elastoplast” Block storage. It’s covering up something fairly unsavory. This is a point we touched on in other posts. If you follow the VMware or Red Hat blueprints for how to do cloud, you put shared storage in the bottom of the stack. However, the shared storage then defines the limits of the scalability within which you have elasticity, so this approach can’t be applied in public clouds and you are forced to do your shared storage across the network layer. This has two implications
- Your instance performance is dependent on a fairly uncontrolled shared resource (i.e. the network) than can be saturated.
- Elasticity (i.e. spinup of new instances) can also be critically dependent on this layer that can be saturated.
The use of local instance storage backed by an S3-based image store would seem to be a better approach (at least S3 has an SLA), subject to an appropriate application level mechanism for handling long-term persistence, because local instance storage disappears when the Instance is terminated.
You have to assume everything will fail
One key cnclusion here is you need to architect for failure at all layers in the stack. In other words, assume that failure is possible in AWS terms at not only at the Instance level, but also in S3, at EBS, at the Availability Zone level, the Region level, and that AWS as a whole will go down.
In fact AWS as a whole did not go down, and it seems the problem was constrained to a number of Availability Zones within one Region, but the point is AWS has lost its aura of invincibility. From a governance perspective it is no longer possible to believe AWS can’t go down.
It is at this point, we introduce our discussion of Cloud API. Approaches like Deltacloud (which offers a common API across many clouds) or OpenStack, which offers an open standard approach to a common API across multiple providers, can allow “second-sourcing” of Cloud from multiple providers, although significant thought would be required when building such a solution, particularly when architecting for performance.
DNS is a key asset
As we saw in the case of WikiLeaks, if anyone is to find you, your DNS entry (particularly the one that Google points to) has to work. The standard AWS solution is to provide elastic IP Addresses that can be mapped and remapped fairly dynamically to instances within the AWS cloud, and to which DNS is bound fairly statically. If AWS is down that DNS entry doesn’t work. The alternative approach is to use a dynamic DNS Service to do the mapping to a dynamically-assigned IP address in the instance. This approach can suffer from DNS propagation delays, but something along these lines will be required to allow second-sourcing of cloud, so it is worth designing it into the architecture from the outset.
It’s quite good news for Eucalyptus
One of the open source companies we have been tracking, Eucalyptus, offers an API-compatible AWS software platform for internal use. Clearly if you were running Eucalyptus in your own data center, you wouldn’t be impacted by a failure of Amazon’s EBS. Similarly, if you are cloud-bursting to AWS from Eucalyptus in a datacenter, you would most-likely choose not to back images with EBS but rather use snapshots via S3, so we are in an interesting position where in principle a company who built a solution entirely in AWS using EBS would suffer an outage, which then cause a competitor to gain additional traffic which they could satisfy by cloud-bursting their Eucalyptus system into AWS (because they didn’t depend on EBS).
All of this is really very complicated
What is emerging in the cloud discussion is a hype-meets-reality moment, where the technical issues that we all really knew were there underneath the Elasticity (or Elastoplast) are emerging and causing real dollar impact. Cloud isn’t a magic bullet. You do need to understand it, and you need to architect on the basis of that understanding.
Mike,
Regardless of what is in the SLA, in this case, Amazon is providing a 100% credit for 10 days of service for “…customers with an attached EBS volume or a running RDS database instance in the affected Availability Zone in the US East Region at the time of the disruption, regardless of whether their resources and application were impacted or not, we are going to provide a 10 day credit equal to 100% of their usage of EBS Volumes, EC2 Instances and RDS database instances that were running in the affected Availability Zone. These customers will not have to do anything in order to receive this credit, as it will be automatically applied to their next AWS bill. Customers can see whether they qualify for the service credit by logging into their AWS Account Activity page.”
So, basically, if you had an EBS vol in the affected zone, you are getting compensated.
Hi Greg,
Yes thanks for pointing this out. I think we were always expecting Amazon to make some form of compensation, although it is obviously a good thing that they have done so. I drafted the post at the point when Amazon was silent on the matter (even though it may actually have been posted after the Amazon announcement happened).
Amazon also posted a lot of good technical detail about how this happened. Additional clarifications are helpful but I don’t think the later information from Amazon actually contradicts my post, so I decided to leave the post in its current state as a “point-in-time” in the discussion, rather than updating it.
There’s a lot of commentary from the other analysts on the site following on from Amazon outage, and I suspect this reflects the significance of both the event and of Amazon itself.