I run multiple iSCSI servers, ranging from HPE StoreVirtual (must trusted) and Synology Server (tertiary server) to my own CentOS 7 base iSCSI server (least trusted). All run over 10GB links. In general, iSCSI usually works quite well. But for some reason, my CentOS 7 iSCSI server would cause the management agents to fail and vSphere to disconnect from vCenter. This would go on until the iSCSI server was shut down. I use those 10TBs of storage for testing data protection tools and for emergencies. This is a bad thing for the continued support of generic iSCSI. This is also a vSphere Upgrade Saga entry.
Now a little more background:
- vSphere supposedly has a 30m or so timeout for iSCSI failures. I found this not to be the case. It never times out. That could be, because the iSCSI server is responding, but not like it should be.
- The iSCSI server actually runs on a KVM node connected via a 10G link to the 10G iSCSI network. Yes, it runs on KVM.
- The HPE StoreVirtual runs on multiple vSphere nodes.
- It did work for years with no issues.
So, with that background, I could count out KVM itself. It was working for years. I could rule out the Open vSwitch running the network stack within KVM as well. It has been that way since day one. But other VMs had network problems, so I pinned the 10G network ports directly to the VM and used SR-IOV. For a little time that solved the problem, but did not eliminate it completely.
In fact, the problem got so bad that the 10TB iSCSI server has been offline until recently (at least one month). This implies that all my data protection test nodes have also been offline. However, now it is online.
I believe the culprit was actually oVirt and, more importantly, the VDSM daemon. I found that its fingers were in everything, but most notably the start up for libvirtd. I wanted to use oVirt in self-hosted mode. That started a chain reaction, apparently.
oVirt made my Open vSwitch implementation unstable. As of the last update, libvirtd would not even start. Removal of all oVirt and VDSM pieces has fixed my generic iSCSI implementation running on KVM.
This is related to vSphere only in that if access to your iSCSI server is iffy, then the software implementation vSphere 6.0 uses goes catatonic, hanging all management agents.
To attempt a fix, I tried to use HPE StoreVirtual running on KVM backed by a 1.2TB Fusion IO card. That unfortunately caused serious performance issues as the fioa_scan kernel process kicked in, taking almost all of the CPU available to the KVM host. I still do not know why the HPE StoreVirtual KVM VM has its CPU pegged, but as long as the fioa_scan kernel process does not take up all the CPU, then all other VMs on the node behave properly. However, I also have a 1TB HPE StoreVirtual available as well now on very fast storage. Just need to watch that kernel process a bit.
Altogether, this is not a good state of affairs, so movement to oVirt will have to wait. Perhaps it is time to give HotLink another try instead.