The old way of delivering software was to bundle up the software and ship it, sell the software off the shelf, or allow customers to download and install it. In the “shipping model”, it was the buyer’s responsibility to install the software, manage the uptime, patch, monitor, and manage capacity. Sometimes the buyer would perform all of those tasks themselves, or sometimes they would hire a third party to handle it for them. In either case, the buyer of the software had total control over if and when the software was updated and at what time a planned outage would occur in order to perform the patches or upgrades.
The Dilemma
Fast-forward to today where software is delivered in an “elastic cloud” model where the services are always on. Now cloud service consumers (buyers) live in a multi-tenant world where they no longer have control over when updates occur. The challenge this presents to the cloud service providers (sellers) is that there is never a good time to have a planned outage anymore, because customers are scattered across many time zones and each customer has their own maintenance windows that never line up with a time that is good for the CSP. So how do we deal with this dilemma?
Continuous Operations
Gartner defines continuous operations as “those characteristics of a data-processing system that reduce or eliminate the need for planned downtime, such as scheduled maintenance. One element of 24-hour-a-day, seven-day-a-week operation”. Continuous operations manages software and hardware changes in a way that is non-disruptive to the end users. Even though software and servers may be taken offline during planned maintenance, it is performed in a way where customers continue to be serviced by the previous versions of the application and are eventually switched over to the newer version once it has been deployed and successfully smoke tested.
To accomplish non-disruptive upgrades like this, a high level of automation is required. Organizations should already have mastered continuous integration and continuous delivery and may also have already implemented continuous deployments (see definitions in my previous post called How DevOps can turn Fragile to Agile). This is what DevOps is all about. As I have mentioned many times, DevOps is not a role , it is…
a culture shift or a movement that encourages great communication and collaboration (aka teamwork) to foster building better quality software more quickly with more reliability.
So how can we deliver code with zero downtime and no user interruptions?
Canary Releases
In the book “Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation” by Jez Humble and David Farley, the authors highlight that Canary releases offer the following advantages for zero-downtime deployments:
- Easy to rollback
- Useful for A/B testing
- Low-risk approach to testing capacity and performance
Here is how it works. A new production environment is provisioned along side the existing production system. The production system is also known as the baseline system. A small percentage of traffic gets routed to this new environment to validate the new software release.
I have seen the following strategies used to partition traffic to the new production environments:
- Internal user acceptance testers
- Beta test groups
- A small percentage of real production users selected randomly
- A specific demographic of users (location, customer segments, etc.).
Regardless of how the group of users or user sessions are chosen, they run independently on the new infrastructure running the newest release. Metrics are gathered and compared against baseline metrics to monitor the behavior of the new release. Once the new release is deemed certified and stable, the remaining sessions are routed to the new environment and the baseline environment is taken off line but not deleted, just in case a rollback is needed later on. If there are issues with the release, the traffic is no longer routed to the new environment and the release is postponed pending further investigation. This method is often called “Fail Fast and Fail Forward”. By failing forward I mean that no code is rolled out of the production environment. Backing code out is risky, especially in highly distributed real time systems. Instead, entire environments are removed and baseline environments continue on as normal.
The trick to pulling this off is to separate database changes from software changes. For example, if there is a database change required, deploy only the new database structural changes without the corresponding code changes that leverage it. This requires that database changes are backwards compatible. The next deployment then includes the application code that takes advantage of the new database changes. Without this approach, rolling back to the baseline environment can be a complex and sometimes impossible task of manipulating data to accommodate different data models.
Another tip to making this work is to strive for small releases that are deployed frequently. By keeping releases small and simple, changes can be added to the environment incrementally, thus lowering the risks and increasing the success rate of deployments. The customers get the benefit of frequent feature enhancements and no service interruptions. To see a real life example of a canary release, read the latest Netflix post on continuous delivery that came out while I was writing this article.
I’ll be at Puppet2013 this week in SF if anyone would like to meetup and talk tech. I’ll be live tweeting the conference on my twitter handle @madgreek65.
Just curious to know how the data integrity and data consistency aspect is taken care of when you have a minor amount of production data is routed to the the latest environment. How the data migration is handled?
Both environments point to the same DB, so there is no data migration.