Does your hardware keep up with technology? Technology advancements have been and are moving at an incredible pace with new and exciting features getting added with each new release or update of a product. Unfortunately, technology can outpace are physical hardware which can leave us in a true troubleshooting nightmare. There is one specific example that I have seen a few times and is worth sharing.
Physical servers tend to get refreshed around every three to five years or so but the backend storage hardware tends to have a longer life cycle and this longer life cycle can leave are storage infrastructure in place for much longer than the physical servers themselves. Clients I have been working with have been busy upgrading and or deploying the latest and greatest version of VMware vSphere to older storage hardware and this is where the problems began.
VMware vSphere Storage API’s for Array Integration (VAAI) will allow certain I/O operations to be offloaded from the ESXi servers to the physical array itself. VMware first started working with VAAI back in 2008 and VAAI was initially implemented as vendor specific command but VMware has been working with its partners and venders to establish standards for the technology moving forward so the API’s would be available to all. VAAI is coming into its fifth year but that is still way less than most storage array life cycles.
I have clients that have recently upgraded or deployed the latest and greatest version of vSphere on brand new physical hardware but connected to an “older” storage array. During the vSphere install VAAI is enabled by default but unfortunately the older storage array does not support VAAI and this is where the fun begins.
At first, I uncovered SCSI reservation conflicts as well as long VMFS3 rsv time errors in the vmkernel log. What follow is latency errors which also means performance has deteriorated. Even though VAAI is enabled by default, ESXi is able to detect if VAAI is unsupported on a datastore and it is supposed to fall back to non-VAAI commands, but in fact it will actually still try to us the VAAI commands. This is what can really create a big problem.
I have seen in specific cases where the “older” storage array, that did not support VAAI, actually used some of the same command to do different tasks. That is where the “fun” can really begin. In one specific case the VAAI command that was issued to the storage array caused the storage array to try to resolve what is should do with the command and from that backpressure on the storage area network started. This backpressure slowly keep building causing latency and connectivity issue to the luns themselves on multiple hosts until finally the pressure build enough that most all devices on the storage area network crash with pretty blue of purple screens of death. Let me tell you as you’re watching the infrastructure crash and burn in front of your eyes, you develop a sick feeling in the pit of your stomach and you hope and pray the root cause does not come pointing back to you.
That was the worst case I have seen to date and recently the issue was discovered more quickly from the latency, reservations and rsv times showing up in the logs. The main point to take from this is that as technology becomes more converged and the life cycles do not match for infrastructure, this can and will be a problem now and tomorrow as one technology blazes ahead of the others. Something to be aware of when considering new technologies and or features for your environment.