My infrastructure recently underwent a catastrophic failure from which recovery was more tedious than difficult. An iSCSI server running from a KVM server using Open vSwitch decided to go south. Why? I am still trying to figure that out. The long and short of it is that Open vSwitch running within a CentOS 7 KVM has a pretty major performance glitch. I hope to fix that soon. So, what happened?
Simply put, I went to scan the iSCSI server, and my vSphere host started to seriously lag. None of the management tools were accessible, including the management console on the box itself. Once you logged in, you could not do anything else. It looked as if hostd had died (it had not). The solution was to shut down the iSCSI server and wait until everything cleared up. But once more, impatience got in the way. I shut down the node instead. However, since the host was lagged, the normal HA activities just didn’t happen. Even HA communication was broken.
The node rebooted, and the services were there once more. However, thinking it was the node, I tried again with a different node. Same result. This one would not reboot at all: it was stuck. Now, I thought it was VSAN at first; but alas, it was not. A shutdown of the iSCSI server alleviated the serious lag within the host.
The tedious part of this was the need to go onto each storage device attached to the remaining hosts, manually add VMs to hosts, and get them booted. Once more, HA did not do its stuff. Tedious, but possible, the steps were as follows:
- Connect the vSphere Client Direct to vCenter, a host, or even hosts using multiple clients if vCenter is not running
- Go to Storage Configuration and view the datastores
- Find the directory for each VM not running anymore
- Double-click on the .vmx file to add to the host
- Boot the VM added to the host, and tell vSphere you “Moved” the VM
Do this first for vCenter if necessary, and then switch directly to using vCenter. I did vCenter second, as I had to bring my primary firewall up first so I still had network connectivity. Even though we have all the tools in the world and all the automation we can use, sometimes we just have to go back to doing this by hand.
Once I killed the iSCSI server causing the problem, the non-booting node came up, which clearly pointed to the iSCSI server.
The fix was to bypass the Open vSwitch. I used SR-IOV within KVM to directly map the 10 GB Intel network card to the VM, and voilà, iSCSI worked once more. This implies that this is a KVM networking issue, which in my case is 100% Open vSwitch. This bothers me, as Open vSwitch is the basis for multi-hypervisor VMware NSX, OpenStack Neutron, and even Arista switches. Was this a configuration issue with my specific Open vSwitch configuration, such as bonding?
My first fix was to remove bonding; the problem persisted. My second attempt was to ensure LACP was set up properly with the bond, and that also did not fix the problem. The only solution that seemed to work was to bypass the Open vSwitch.
After a bit of research, I discovered that if I waited thirty minutes or so, the node and indeed the initial cause of the issue would sort themselves out, as vSphere would eventually time out and give up trying to attach to the iSCSI node. Thirty minutes is too long to wait; it should be a faster time-out.
Disturbingly, a standard configuration for Open vSwitch had issues. I noticed performance problems between VMs on the KVM host, but not between hosts. The node is definitely not overloaded, so now I need to consider whether the default configuration needs to be tweaked.