The Upgrade Cycle - TVP Strategy

As we move through the year, there are often monthly and quarterly upgrade cycles to our virtual and cloud environments. These are caused by security issues, natural upgrades to hardware, software, or even application updates. Application updates are now continuous, using continuous integration and deployment strategies, while hardware and other upgrades come more slowly. Cloud upgrades can be incredibly impactful, as all subsystems need to be restarted. Yet, there is a cycle to this. There is need to control what is happening, and a need to not break compliance, security, data protection, or other policies.

The upgrade cycle for any part of our hybrid cloud is not one that is easy, or even automated, today. There are so many necessary permutations that human oversight is often required. This needs to change. Even upgrade cycles for our products need to be automated, especially for the cloud. When systems reboot, there needs to be a check to ensure nothing is broken and that everything is working properly before moving on to the next system, etc. However, when I look at the upgrades I have done in the past, many are not automated fully.
The full upgrade cycle of a major release of a new hypervisor often involves the following:

Upgrade some part of the management subsystem
Then, for each hypervisor:
- If possible, migrate virtual machines off the hypervisor
- Upgrade the hypervisor
- Reboot the hypervisor
- Upgrade the firmware of the physical box to required levels
- Reboot once more
Upgrade the rest of the management subsystem

This process is the same for cloud service providers as it is for organizational data centers. However, cloud service providers often don’t migrate the virtual machines off the hypervisor; instead, they shoot the host in the head, taking down each VM with the host.
In-cloud vs. on-premises approaches differ here by quite a bit, which requires you to have more resources in use within certain clouds. Amazon and other cloud service providers solve this by using multiple zones within a region. Rolling upgrades of hardware and hypervisor, within the cloud, force organizations to rethink availability and resilience.
One of the more interesting steps of any hypervisor upgrade, the upgrade of the system firmware, is also often ignored. Firmware upgrades are commonly seen as too difficult and hard to automate. For small environments, this is pretty straightforward to do, but for large environments, unless there is a problem, firmware upgrades are routinely skipped. This can lead to issues down the line.
Recent reboots within Amazon and SoftlLayer were due to issues within the Xen hypervisor. Every system was rebooted running a Xen hypervisor. This is not something we can control within the cloud, so we need to control availability by using multiple clouds, multiple zones, or hybrid clouds.
Does this change with containers? Not really: containers still run on an OS, no matter how small, and that OS or the underlying container need patches. However, with containers, it may be far easier to deploy a new instance and start new containers, then shoot the older ones in the head and kill them. This is often referred to as the “pets vs. cattle option.” However, this really only works when the containers run within virtual machines, not on physical hardware. There are limitations when you come to physical hardware with the number of chassis currently available. It is far easier to deploy virtual machines than new hardware just for upgrades.
The upgrade cycle continues with containers as it does for virtual machines. The underlying hardware (or hardware abstraction) still needs to be upgraded to meet policy. How that happens should be automated as much as possible.
Upgrade automation unrelated to the applications using CI/CD is still in its infancy. We can deploy an application hundreds of times a day, but when it comes to upgrading the underlying layers of hypervisors and hardware, we do not do that very often, even now. We should, as year-long uptimes are NOT a badge of honor. Year-long uptimes are a sign that things need to change and poor security practices.
How do you handle your firmware and hardware upgrades? Do you use the same process as you do for applications, or are these more manual?