I recently got called to examine some performance issues that were happening to a VMware VDI Cluster. I was told all the hosts in the cluster would run at 100% CPU utilization for an extended period of time and the client would like an explanation and recommendation. I pretty much had a good idea what the problem was before I ever started looking at hosts. I know this topic has been covered many times before but it does not seem like it has been covered enough.
There are two sure fire ways to bring a virtual host to its knees as far as the CPU utilization goes. Backups and Virus Scanning are the two biggest performance killers that are used in just about every infrastructure out there and when you have both backups and virus scans kick off on all or a good number of your virtual machines at the same time the host will slow to a crawl, to say the least. I have seen hosts in a cluster report using 101% and 102% CPU utilization when the backup cycle kicked off on a heavily loaded cluster.
In this specific case this was a four node VDI cluster with around 300 virtual machines running in the cluster. Every Tuesday night the virus scan would kick off and spiked the CPU utilization to a point that the cluster was basically unresponsive for an extended period of time of over twelve hours. Using the performance graphs from vCenter I was able to take a look at the CPU performance for the past two week and see what is going on.
From the graph I can see this issue happens on Tuesdays and also knew that the client had recently upgraded there virus software. After some research noticed that the spikes were the highest once they completed the upgrade. Below is the CPU performance from the last two months and you can tell when the upgrade happened.
So the moral of this story is to stagger your backup and virus scanning tasks so that you will not overwhelm your hosts. Set up different start times in the case of backups and different days throughout the week for your virus scans. By doing this you will help to keep your host and your environment performing optimally.