We are still coming to grips with the impact of the Xen and Bash shell issues that have sprung up lately. The issues are enough to make us realize that there are some serious pitfalls to cloud computing—or more to the point, pitfalls to using only one cloud service provider. We talk about using live migration and other tools to alleviate downtime, but have we really thought through the use of these tools at cloud scale? What was the impact on your environment, and how have you decided to alleviate that impact? Those are the questions that come out of the latest set of issues with cloud computing.
Virtualization administrators take for granted that we can live migrate/vMotion workloads from host to host. We therefore expect that cloud providers will do the same to save us from downtime. However, this is not always the case. Take our cloud provider: during its recent Xen upgrade, we lost access to our systems. The systems, however, were not down, and the portal said they had been live migrated to somewhere else. Yet, we did lose access to our network, which makes me wonder whether the systems had been actually moved to another host or just put to sleep until the Xen upgrade could complete.
Either way, we were offline for nearly two hours with no warning. The messages we had been sent implied that there would be no outage and that the upgrade was minor. I beg to differ. It really makes you think about the advisability of investment in just one cloud for your applications. Clouds have scale issues that most virtualization administrators today just cannot fathom. They are dealing with thousands of virtualization hosts and millions of virtual machines. Outside of Amazon and the like, not many enterprises even approach that scale. So, can we expect them to follow the same practices we do without our data centers?
I think we should hold them to a certain standard with regard to notification and availability. Notifications should not state that there may be outage, but that there will be outage and what type it will be. We should not find out about an outage related to a planned upgrade when it happens, but rather beforehand. These were not sudden changes. There was plenty of time to test all the changes, find all the pitfalls, and warn customers. Instead, it appears that many cloud service providers ran a test and just made sure it fixed the problem, instead of planning a proper upgrade.
Even with a plan, when humans are involved, mistakes are made. In our case, we lost connectivity to our VMs—thankfully not during our peak time, but still quite annoying.
We talk about automation—the need for automation so that human mistakes are not made. But is there enough time to fully vet and test the automation before it is put into play? Are the automation techniques incomplete? Are there too many security issues that break automation?
We begin to see issues now: things we thought were automated have ended up not being so, and things we thought were not, are. Automation takes time to perfect, and it takes time to cover all contingencies. Further, emergency security updates are hard to automate for all contingencies. While there is basic automation, when problems happen, people become involved.
So really, can you trust just one cloud? If you are working on a next-generation application, that application should be aware and as such be able to self-heal in some fashion. If you are using a more traditional application, it may be worth using multiple clouds or multiple zones within a cloud, with automatic failover between the services. This is where some very interesting and new tools come into play that form mesh networks between clouds and cloud-to-cloud data protection mechanisms.

Mesh Networks

One solution to our problem is to use a mesh network–type product to join multiple cloud instances into one mesh. This mesh may use geographic load balancing, latency detection, and other ways to ensure that if one part of the network is unavailable, its data and queries are redirected to the active part of the network. While great for a medium to large business, this option may be costly for small businesses. Geographic load balancing itself would solve most of the problems but would not redirect traffic from one geographic location to another based on failure or latency. For that, you need to either employ a by-hand action or automate the failover.
Several systems that can solve this problem come to mind:

  • Kemp Technologies Load Balancer has a mesh capability but also does latency-based load balancing and other standard type load balancing. This way, if a site is latent or down, traffic could be automatically redirected to another site. A layered approach is needed to make this work, however, with one layer outside the sites and the others inside. Via this approach, you have Kemp load balancers at all layers, and the latency is intelligently handled.
  • Silverpeak has a mesh network that could also be used to move traffic between load balancers. While it is not a load balancer per say, the mesh network could join together multiple load balancers and send traffic where it needs to go based on availability or other criteria.
  • Barracuda also has a mesh network that can work with all its products. While traditionally a security play, Barracuda products can all talk to each other, and if one part of the system is unavailable, they can redirect further up the stack.
  • Finally, your own automation could be based on a hierarchical load balancing network that would require a software-defined network control to determine where traffic goes.

Cloud-to-Cloud Data Protection

As the Code Spaces attack showed, there is a need to get data outside of any one cloud, either to a data center or to another cloud, one with a separate management interface using different authentication. However, getting data out of one cloud and into another via some form of replication currently is not as easy as it sounds. There are plenty of ways to get data to the cloud using virtual and other data-protection tools such as Veeam, HotLink, Datto, Zerto, VMware, and others. But going between clouds is a bit behind the times. Cloud also needs better data protection, not just cloud-to-cloud, but within one cloud, between availability zones, and perhaps using different users in different tenancies. To remove all data from Amazon, you would then have to break into multiple tenancies. This would have made what happened to Code Spaces harder to do, but unfortunately it is not impossible if people have bad password practices and lack two-factor authentication.

  • HotLink can move data from one cloud to another, as well as from one cloud to a data center and vice versa. The key is to move the data while translating it to make it ready to use if necessary. Just putting data in a cloud is no longer enough. That data needs to be ready to run at a push of a button, including networking. The DR product is limited to just Amazon, but the management product has no such limitation. Yet, to make this work, failover using the Express product would require a bit of scripting.
  • VMware Connector with vCloud Air has the capability to copy data from a data center to a cloud, but not to another cloud yet. However, given the number of vCloud Air–like cloud service providers, this would be a very nice addition. This solution would be limited to just those running VMware vSphere and could not be used for cross-cloud where the target cloud hypervisor is not vSphere.
  • Veeam has the ability to copy data to any share it can see, regardless of where it is. Tie this to Veeam SureBackup, and you have images ready to run, as long as those images are on the same hypervisor from which they were taken. The lack of translation from one cloud to another hampers cross-cloud utilization.
  • Datto translates the images stored within the cloud into the form required to run the image, but it is currently limited to just its cloud and not another cloud. However, since it can translate, restore could be to another cloud, which would require just a bit of scripting.
  • CloudEndure is a relatively young player but can replicate data from Amazon to other clouds or within Amazon. Movement of data is a big part of this puzzle. CloudEndure has solved that between clouds.

The rest of the cloud-to-cloud tools fall into the above categories, but there is a decided lack of automation around data protection. This lack is enough that if I went from Cloud A to Cloud B, regardless of provider, the tool would detect the provider, translate the VMs accordingly, and set up the networks and security as required. Currently, networking and security (outside of Amazon) require by-hand implementations, which could slow down recovery and failover when a situation arises.

When in Doubt

When in doubt about a cloud service provider, which seems to happen more and more, set up a second instance of your application within another cloud in a hot-ready state. Somehow, replicate application data between the clouds and set up ahead of time any networking that is required. In addition, ensure that the two clouds are in different zones of the country, on different networks, and different power grids. Once you have done this, ensure that your recovery or failover works by testing it often. Take a page out of the Netflix world, and employ the simian army to take out major parts to ensure the other parts failover in an automatic way.
How do you failover if your cloud service has issues? Have we learned enough from the current batch of surprises to ensure we have an anti-fragile cloud–based application and experience?