The latest vSphere Upgrade Saga happened a week or so after my upgrade to vSphere 6.5. VMware vCenter just up and died on me. I looked at everything and eventually had to call in VMware Support. That is a rare action for me these days, but it is nearly impossible to debug vCenter without their help.
I had entered vNetwork Distributed Switch h-e-double hockey sticks! A broken vNetwork Distributed Switch (vDS) makes for a bad day for anyone.
The situation was exacerbated by the fact that I use every cluster service VMware provides, primarily VSAN, DRS, HA, vMotion, etc. When vCenter crashes, VSAN and HA still work, but the rest do not.
Before calling VMware, I looked at hosts and reset each management appliance. I looked at log files out my ears, and still nothing showed up as the culprit. Quite a few false positives, but no culprit. The vCenter logs in /var/log/vmware/vpxd/vxpd.log had a large number of errors related to:
duplicate name com.vmware.vsan.clusterstate
This led me down the path of it being a clustering problem. Well, of course. So what do you do? No cluster services can be managed without vCenter. Alas, a call to VMware and several hours and sanitized logs later, we came to the conclusion that it was clustering, so all the cluster services had to be disabled.
We disabled HA, Predictive HA, VSAN, and every vDS. Yep, we disabled all the vDS in the cluster by mapping them and their VMs to a vSS. I used the following PowerCLI scripts to aid in that action after I created the virtual switches and appropriate portgroups within vCenter:
PowerCLI C:\> $vmnic="vmnicX" PowerCLI C:\> Get-VMHost |Get-VMHostNetworkAdapter -Physical -Name $vmnic | Remove-vDSwitchPhysicalNetworkAdapter
Then, to move VMs between old and new portgroups:
PowerCLI C:\> $NewNetwork="OldPortGroup" PowerCLI C:\> $OldNetwork="NewPortGroup" PowerCLI C:\> Get-VM |Get-NetworkAdapter |Where {$_.NetworkName -eq $OldNetwork} |Set-NetworkAdapter -NetworkName $NewNetwork -Confirm:$false
Then, to move a specific resource pool of VMs to the new portgroups:
PowerCLI C:\> $OldNetwork="OldNetwork" PowerCLI C:\> $NewNetwork="NewNetwork" PowerCLI C:\> $ResourcePool="ResourcePool" PowerCLI C:\> Get-ResourcePool -name $ResourcePool |Get-VM |Get-NetworkAdapter |Where {$_.NetworkName -eq $OldNetwork}|Set-NetworkAdapter -NetworkName $NewNetwork -Confirm:$false
When all VMs were on a vSS and no vDS had a vmnic attached, vCenter stayed up. Granted, DRS was still configured at this time.
2nd Try!
I really wanted a vDS, so once more I tried to enable it. After many tests, I came to the conclusion that I could have VSAN, DRS, and HA or I could have VSAN and vDS, but not HA or DRS with a vDS. This had me puzzled.
In between the first try and the second try, I had migrated my vCenter with embedded platform services control to one with an external platform services controller. More on that in a different post. I also recreated vCenter from scratch. I redeployed, readded each host, and recreated my resource pools and cluster. This still did not help, which implied it was something external to vCenter, so perhaps each node. Since no vDS was in use, I did the following:
- Remove .dvsData folders per http://www.davidhill.co/2010/07/what-is-the-purpose-of-dvsdata/
- Re-init the DVC database on each host per http://buildvirtual.net/utilize-net-dvs-to-troubleshoot-vnetwork-distributed-switch-configurations using the following code on each host:
[root@hostname:~] net-dvs [root@hostname:~] net-dvs -l # list [root@hostname:~] net-dvs -i # reinit [root@hostname:~] net-dvs -T # test
Sidebar Issue
Then I ran into another issue, hopefully unrelated: the VSA root password needed to be reset. Thankfully, the steps were easy and can be found at https://www.vladan.fr/how-to-reset-root-password-in-vcenter-server-appliance-6-5/. Now, for my environment, I replaced Step 3 and Step 4 with the following, as I also wanted to clear the root failed login attempts:
root [ / ]# pam_tally2 --user root -r root [ / ]# pam_tally2 --user root root [ / ]# umount / root [ / ]# reboot -f
Back to vDS
Now I was ready to recreate a vDS. The first thing I did was create it in vCenter without attached hardware. That worked. It actually ran for three days with no problems. However, on the third day, I attached a vmnic to the vDS as an uplink and BOOM! Within minutes, vCenter once more crashed.
Remove the vDS in the thirty-second window allowed, and everything comes streaming back just fine. Disable HA/DRS and voilà, it also works. Disable just HA and BOOM! Disable just DRS and BOOM!
Now I have a new Platform Services Controller, a new vCenter Server Appliance, and a newly minted cluster. I even recreated the cluster. Questions arise: Is it the hardware? But then how would that affect vCenter as vCenter sends data to vSphere for DRS and vDS?
Solution
A new crash, with more things new and error free. This was sent to VMware under my existing support case. Within twenty-four hours of the new information (and a call to the VMware Support Management), I have an answer. It is not a very good answer, but it works.
A VM that I recently upgraded from Windows 7 to Windows 10 apparently had a bad network descriptor. Actually, when looking at the VMX file itself, there was no attached portgroup. I imagine the portgroup definition was corrupted within vCenter, and since the two did not match, the VM caused DRS calculations and assessments to fail.
A few things to note: I did see that my Veeam backup tool could no longer communicate to the VM. I also saw communication issues within Horizon View, but I attributed those to the upgrade, not the network mismatch failure.
I did ask how to scan for such issues, a VM health check so to speak, and there was nothing. The best I could think was to create a plugin for Alan Renouf’s and others’ vCheck script. If anyone has the time to write the plugin, let me know!