Distributed Virtual Switch Failures: Failing-Safe

In my virtual environment recently, I experienced two major failures. The first was with VMware vNetwork Distributed Switch and the second was related to the use of a VMware vShield. Both led to catastrophic failures, that could have easily been avoided if these two subsystems failed-safe instead of failing-closed. VMware vSphere is all about availability, but when critical systems fail like these, not even VMware HA can assist in recovery. You have to fix the problems yourself and usually by hand. Now after, the problem has been solved, and should not recur again, I began to wonder how I missed this and this led me to the total lack of information on how these subsystems actually work. So without further todo, here is how they work and what I consider to be the definition for fail-safe.There are three failure modes available for security tools:

Fail-Closed – This is generally the state you want for firewalls, and many security devices when they fail. This mode implies that no traffic will route through the device until the failure is corrected.
Fail-Open – This is generally the state you want for network switches. This mode implies that all traffic will be routed through the device. How it routes depends on what actually failed.
Fail-Safe – This is a growing concept that implies that should a device fail, another device or mechanism within the device picks up the traffic sends it onto its destination with the proper checking and routing.

Fail-Safe is the holy grail of security tools. They all want to fail-safe. This is achieved in hardware by having redundant systems that can take over the workload when the primary device fails. Any good security and networking plan builds for such failures. This is why we often see network switches in pairs, why load balancers are in use, why clustering technology is in use, etc.
Networking

2011 02 10 09 51 48 — Figure 1: Network Control Planes

However, when we enter a virtual environment our network is flattened, and no matter how we try to change this, the network remains flat. I have described in detail how the network stack works in several other articles (Blade Physical-Virtual Networking and Virtualization Security,Rethinking vNetwork Security) but now it is time to consider VMware vNetwork Distributed Switches in more detail. In all of my other diagrams they are nothing more than a layer within the stack, but in reality they are much more than that.
In reality, the vNetwork Distributed Switch (vDS) Control Plane extends from the hypervisor into VMware vCenter as shown in Figure 1. While the traditional VMware vSwitch’s control plane lives wholly within the hypervisor, the vDS does not. This implies that for vDS to function properly that VMware vCenter must ALWAYS be running. This is a case of a Failed-Close system. The following catch-22 problem can occur.
Problem
If the Host running vCenter dies and you are using vDS and VMware vCenter is running on a vDS, HA will fail as the VM cannot reboot on a new host. Why? Because VMware vCenter is not running, and VMware vCenter does the port assignment of VMs to the vDS in use as only vCenter knows whether or not a port on the vDS is available.
Solution
The solution is a Fail-Safe style of solution. The vDS code within vCenter should download to each host, the current port assignments so that HA can work as well as assigning to each host participating in the vDS a port-range that can be used for new VMs until vCenter comes up including a certain amount of In Case of Emergency (ICE) ports over the normal allotment for a given host. So if you set up vDS to only allow 10 ports and you have a 2 node cluster, the split may be 5 per host, but you may need 10 for a single host if the other node does not boot. So in this case the ICE ports would be 2x the normal allotment or 10 ports per node. Yes, this could exceed the normal 10 ports allowed in this vDS, but would maintain a running virtual environment. How many ICE ports should be a configurable option.
If HA does not work, then we run into a serious catch-22 issue. Currently the only way to fix this problem is to ensure VMware vCenter is running on something besides a vDS, and that in an HA scenario, it boots first, then all other VMs so that the vDS control plane is available to perform port assignments. In my particular case, I had to go into the host and at the management console (Service Console in this case), reconfigure a VMware vSwitch to connect to the uplink used by the dVS and physically move the VMware vCenter VM to this vSwitch. Thankfully, management appliance access was still available so I was using both the vSphere Client and console access to fix this issue. The reason, is that when you play around with the ports for the management console you absolutely need to do this from the console as you temporarily disrupt communication outside the host.
Security

VMware vShield Drivers — Figure 2: VMware vShield in the Stack

The other issue I had was another availability problem, but related to vMotion slowing to a crawl, it took 15-20 minutes to perform a vMotion. The reason for this was interesting, but once more produced a Fail-Closed system. The security device I was using, VMware vShield Manager and the VMware vShield Endpoint Security module were both on networks not seen or available to the service console. Because of this, the vShield Endpoint vSCSI filter (VFILE) was not available. As you can see in Figure 2, VMware vShield Endpoint sends its data through the SCSI driver using the VFILE filter to the EPSec Transport layer and the EPSec Driver. The transport layer then routs the traffic to the EPSec virtual appliance for consideration (marked as DSVA). The path of which I speak is the dashed red-line.
If for any reason the DSVA (EPSec virtual appliance) is unreachable (or VMware vShield Manager for that matter) the VFILE filter timeouts and retries several times before giving up. This timeout is set fairly high, or the number of retries is also set very high by default.
Problem
The problem occurs if virtual networking has an issue (such as the dVS Failed-Closed discussed above) or the appliance dies for some reason, or even VMware vShield Manager becomes unreachable as it checks licensing. The culprit during a vMotion is the fact that during a vMotion, the virtual machine needs to be quiesced, and that requires data to be written from memory to disk. Any file that is opened or closed within the virtual disk gets sent over the EPSec Transport by the VFILE filter. Which causes all the timeouts I had seen.
Solution
The solution to this problem is to use a different networking structure to do the transport so that the EPSec virtual appliance does not need to be reachable over the network. VMsafe-Net does this for firewalls, EPSec should do this as well. It also should have a administrative settable timeout and if the EPSec virtual appliance is not reachable do not retry in the midst of a vMotion, just sync the memory to disk, mark the files to be checked once the EPSec virtual appliance is reachable, and proceed. EPSec should NOT get in the middle of critical actions, or impact critical actions. This solution would make EPSec and vShield Manager failures Fail-Safe instead of Fail-Closed. Furthermore, licensing should be stored within the hypervisor for these and other functions for a short period of time. Once more make use of ICE style licensing as you never know if the EPSec Virtual Appliance will be running on the host to which VMs have been moved.
Actually, when you use EPSec, VMware will insist the best practice is to have EPSec installed on every host in a cluster, but this does not account for disaster recovery issues where you JUST need to get things running. When I finally was able to remove VMware vShield from my hosts for further testing, I found that the VMs would not boot because the VFILE filter was no longer available. This is a Failed-Closed system, once more we need it to Fail-Safe and allow the boot, but start collecting files to check within the EPSec Driver installed within each VM.
Trend Micro has a Fail-Safe model within their Deep Security 7.5 product such that if EPSec is not available, it switches off that aspect and uses the Deep Security Agent within the VM for all other aspects of security eventually Trend Micro will have a Fail-Safe mode for Anti-Malware as well. However, Deep Security makes use of vShield Endpoint and has all the limitations that we have discussed above. This is not a Trend Micro issue, but a VMware vShield Endpoint issue. VMware vShield Endpoint must Fail-Safe.
More Networking
Since Cisco Pushing More vNetwork into Hardware Fail-Safe within the virtual network becomes all important. No one has said much on how this technology will work, other than it is a paravirtualized driver based on VMXNET3. But the questions remain:

Does this technology make use of the VMware Introspection APIs (VMsafe-Net)
Will it be secured much the same way as virtualized components are today, will this truly be a physical only security model, or some combination?

I do not see this technology moving to a physical-only security model considering VMware’s investment in VMware vShield Edge for use with vCloud Director but more a hybrid model as the vNIC still connects to the Portgroup and then from there may go direct to hardware passing through the Cisco Nexus 1000V. As you can see in Figure 2, the purple dashed line shows this path. The VMsafe-Net introspection APIs would still come into play. What happens if the VMsafe-Net virtual appliance also dies? The Third Party vendors solved this issue with a Fail-Safe model of downloading the majority of the firewall rules into the VMsafe-Net driver installed on every host in a cluster and not by keeping them in the firewall virtual appliance. vShield Edge, App, and Zones does not do this. So when the vShield Edge or App virtual appliance dies, we are now once more Failed-Closed locking you out of the system. Perhaps with this change coming from Cisco, Cisco can work with VMware to create a hybrid hardware/software device that Fails-Safe as needed.
Putting it Together
Critical systems such as networking and security within the virtual and cloud environments need to Fail-Safe! Any product that does not have a fail-safe mode should be considered as incomplete, which unfortunately includes many virtualized components that make up modern hypervisors such as VMware vNetwork Distributed Switch. I call on each vendor to improve their products to account for all failure modes. In a disaster, Fail-Safe could save you quite a bit of time and energy.

12 replies on “Distributed Virtual Switch Failures: Failing-Safe”

Pingback: Tweets that mention Distributed Virtual Switch Failures: Failing-Safe | The Virtualization Practice -- Topsy.com
David Convery says:

February 23, 2011 at 5:06 pm

Ed –
Nice catch. I guess the only thing to address this catch-22 is to use vCenter Heartbeat. But that comes at a fairly high cost for some smaller shops.
Dave
1. admin says:
  
  February 24, 2011 at 11:36 am
  
  Thanks David. I fixed this in my environment by using a standard VMware vSwitch for an administrative network, then setting the DRS/HA reboot to High for vCenter and everything required by vCenter. Seems to fix the problem so far.
  Best regards,
  Edward
David Convery says:

February 24, 2011 at 5:15 pm

AHHH! Ok. That makes sense. I usually already use standard vSwitches for Management and vmkernel ports. Adding a port group (like vsmgmt for vShield) sounds like it will help. I am just wondering how this situation will fare in a larger environment with a dedicated management virtual data center. I guess the management vDC would be better served with standard vSwitches.
Dave
1. admin says:
  
  February 25, 2011 at 8:57 am
  
  Hello David,
  Yes, in an organization with a separate management cluster, that cluster would also have to use something besides a dvSwitch for vCenter given this issue.
  — Edward
Nico says:

March 15, 2011 at 6:12 pm

Hello David,
Sorry to say that, but even with vCenter down, HA will occurs because even if the vDS control backplane is not present, HA doesn’t care
It will start the vm because its network doesn’t change, It’s just impossible to make modification, nor choosing dv-pg. I’just tested it to be sure of what i’m saying.
1. admin says:
  
  March 15, 2011 at 8:25 pm
  
  Hello,
  Absolutely HA will occur as long as the service console or Posix environment for ESXi can reach the outside, that depends on whether or not you have security measures within the environment or not and how your switching is configured, etc. The issues is not that HA does not work, it did in my environment, it is that no VM can boot unless the dvSwitch control plane is also available. Once vCenter is available all is once more well with the world. HA, yes, it always works as expected, granted that other nodes can still access each other and the storage device, but the boot of the VMs is the concern.
  Best regards,
  Edward L. Haletky
Nico says:

March 15, 2011 at 6:54 pm

Moreover, there is .dvsdata in the VM’s datastore so that dvs metadata are kept when HA occures and the vCenter is down.You have to double check before saying “HA will fail as the VM cannot reboot on a new host” 🙂
http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1018394
For sure it’s a design decision, but no nothing to do with HA.
1. admin says:
  
  March 15, 2011 at 8:30 pm
  
  Hello,
  If the VM does not reboot on a new Host, for all intents and purposes HA has failed. Whether that is the root cause of the issue or not. Granted in this case it is really an issue with vCenter not being available. The VMs just did not reboot regardless of what was in the .dvsData hierarchy. In essence, it looked like HA had failed. Which I agree it was not an issue with HA just looked like it.
  The key thing to remember is that all aspects of the environment MUST fail-safe, availability is crucial to a virtual environment. Tracking back where the actual problem lays is sometimes a bit tricky. However, when this failure hit me 2 times in a row, it was time to change the design and go with something quite a bit different than originally desired. One that alleviates this problem all together.
  Best regards,
  Edward L. Haletky
Nico says:

March 21, 2011 at 6:27 pm

Hi, did you play with the last version of vShield App? I’m trying to grab information as the concept as changed. I don’t find how to “not” protect port-groups and if it’s possible for these unprotected VMs to keep their network connections when the local vShield agent on the esx failed. From a design perspective, it can not be acceptable that the unprotected VMs could be impacted by a failed closed scenario !
1. admin says:
  
  March 22, 2011 at 7:52 am
  
  Hello,
  The VMsafe-Net driver for vShield App would need to know what to do if the vShield App appliance is not available. Whether to fail-open or fail-closed. Since it is a firewall device, most devices fail-closed. I do know there are a few vendors of VMsafe-Net tools that allow you the choice, I think Altor is one, Reflex may be another (not positive here). I am unsure of the vShield App appliance however, as I did not test that option. I would post this question into the VMware Communities vShield forum for more details.
  Best regards,
  Edward L. Haletky
Steven_h_d says:

May 13, 2013 at 11:42 am

Just ran into the problem, with the DVswitch. this system had been built by a tech no longer here. the Virtual center and the DHCP server were both vertual and on the DVswitch. Building was powered down so, powering back up when we “found” the problem, finally got the port moved back to standard and VC back up and running. this artical helped confirm the problem. Thanks

Comments are closed.