Whenever AWS has an outage, it makes the news. In fact, AWS said the recent issue wasn’t even an outage, and it still made the news. Issues with S3 returning a lot of errors in the US-East-1 region caused application problems for a few hours. Personally, it affected my morning routine. I start the day reading blog posts using NewsBlur. NewsBlur wouldn’t show me any blogs. Instead, it reported server errors caused by this S3 issue, so my usual source of news couldn’t tell me that there was news about an AWS S3 issue. Before we start talking about how unreliable the cloud is, let us ask who among us has private infrastructure that is without fault? While cloud service outages make the headlines, on-premises outages happen all the time, too. Also, who cares if your application isn’t available for a few hours every couple of years? Not every application needs 100% uptime. It may be the right business decision to accept an application outage when there is an infrastructure outage.
Failures of IT happen everywhere, especially when people are involved. This S3 failure was caused by a human error that was not anticipated by the AWS systems. Notice that it was the AWS systems that were blamed, not the individual who typed the wrong command. This is another of the differences between cloud IT and enterprise IT. Cloud IT expects people to make mistakes. The prevention that AWS has built to avoid a repeat is a better-automated process. I think most IT professionals have had those moments when they realize they made a mistake. They know that icy finger that runs down your spine as you evaluate just how visible a mess you just made. In most private IT environments, these mistakes don’t make the newspaper or sites like Hacker News. But make the same kind of mistake in a public cloud platform, and the world will hear about it. The surprising thing is just how few outages AWS has, at least outages that are visible to customers. AWS has over thirty Availability Zones (AZs), each with at least one data center. They have dozens of different services and (we think) well over a million physical servers. It is amazing to me that we only see one or two outages per year from AWS. I also find it interesting that around half the outages are related to extreme weather. Snowmageddon in the Virginia region and a storm in Sydney caused two of the AWS issues. I would be stunned if no corporate data centers were affected by the same weather.
One of the discussion points, when AWS has these outages, is that application design can mitigate this risk. Applications can be designed and deployed to be highly available in either enterprise or cloud infrastructure. But is this even necessary? High availability is essential for some applications, but for others, the value does not justify the effort and cost. You can design a cloud-native application on AWS to tolerate both AZ and region outages. Netflix does this, and when AWS has an infrastructure outage, Netflix just keeps streaming. But building applications to cope with AZ and region outages requires a lot of specialist skill and is expensive to operate. Sometimes, not coping with failures is the right business decision. I doubt that NewsBlur lost many customers, even though it was out of operation for hours. I imagine that most users treated it the way I did. It was a minor inconvenience not to be able to read my news feeds, but it didn’t make me look for another provider. Basically, there isn’t enough business value to NewsBlur for it to spend the money on developing region resilience.
Outages are a fact of life in IT, whether they are caused by people, process, technology, or Mother Nature. The ability of an application or service to continue to operate when there is an outage is a fundamental part of good IT design. Another part is deciding how much availability makes sense for each application, and then designing for the required level of availability. Deciding to accept the possibility (probability) of an application outage in exchange for a lower capital or operational cost is a valid choice. Not every application or service needs to be available 100% of the time.