Learning from What Went Wrong: The Affordable Care Act Web Portal

Those of us who work on complex computer systems know that it can be a daunting task to get all the different systems to communicate and work properly. The bigger the infrastructure gets, the more complex it becomes. Now, take the most complex system that you have designed or worked with and increase the complexity a hundredfold, and that might give you an idea of the complexity involved with the design and deployment of the Affordable Care Act web portal.

Troubleshooting from the Trenches

This day seem to start like any other but it seems like as soon as I was logged in to start my day issues arose. It seems like I lost one of my VMware 3.5 ESX servers and all the virtual machines on the host were knocked offline. This should not have been a big deal since HA was enabled but, Murphy has a way of making life really interesting. So as I logged into the vCenter client I noticed that the host in question was in a disconnected state and all the virtual machines showed up as disconnect. In past experiences I have seen HA, during a host failure, recover the virtual machines in under five minutes. So I waited and waited thinking HA should have kicked in by now. Time for a little further investigation!!