VMware HA: What’s New in vSphere 6?

There are a few vSphere features that I really found myself taking for granted until they had enhancements added to their base technology. How about you? Are there any features that you simply don’t think about anymore? You know, ones that just work and have been around and used in best practices for a good while now? Well, for me personally, those features are vMotion and High Availability (HA). Both of these features have been enhanced in vSphere 6.0.

Let me digress for a second and ask another question. Who remembers when vMotion was initially released, and its demo? VMware would take a virtual machine and stream a movie while the virtual machine migrated from host to host. My last article covered VMware’s enhancements to vMotion, and this article will focus on its enhancements to VMware HA.

In my opinion, VM Component Protection (VMCP) is one of the best enhancements added to vSphere 6.0. I am willing to bet that most of you who are reading this post have had to deal with the situations commonly known as APD and PDL, or “all paths down” and “permanent device loss.” For those who have never experienced this rite of passage, let me offer an explanation:

Permanent device loss occurs when connectivity to the storage system is lost and the storage system does not expect that connection to return right away. This capability was already present within HA, but it had to be configured from the command line. The expected response was that the moment the storage systems issued a PDL signal, the virtual machine would be restarted instantly.

“All paths down” is a bit of a different scenario in that the storage system has no idea what happened to the storage and believes that the storage could return at any time. That is an unknown that the service has to deal with. The workflow, if you will, is a series of timer countdowns before the command to restart the virtual machine is given. Once the storage system issues the APL signal, the first timer starts a countdown of 140 seconds. Once that timer has been exhausted, the HA timer starts, and once that timer has reached 180 seconds (configurable), the virtual machine is cleared for restart and recovery.

Five settings are presented to be configured for VMCP:

Response from host isolation
Response for datastore with PDL
Response for datastore with APD
Delay for the VM failover for APD
Response for APD recovery after APD timeout

If you are in the process of looking at or setting this up, be aware of the different scenarios VMCP addresses. In an aggressive fashion, HA tries to restart virtual machines even if it does not know the current state of the other hosts; it could try to restart virtual machines when there is no path to the storage on any of the hosts. On the other hand, a more conservative approach would be to only restart the virtual machines that are directly affected by the APD.

All in all, I believe this is a great and often-overlooked feature of vSphere 6.0. I for one am extremely excited to see VMware present a solution to the APD situation. Hopefully, it will be something the next generation of engineers won’t ever have to deal with.