I spent two days at PuppetConf 2013 in San Francisco this week, and the common themes were automate everything, monitor everything, provide feedback early in the process, and focus on culture. All four of those topics aligned with the DevOps movement, with the goal of faster and more reliable deliveries. Companies that can deliver software more frequently with fewer issues have a competitive advantage over those who can’t.
Why is automation so important?
There are many reasons why automation makes sense, but the top two are quality and agility. Traditionally, IT has taken a waterfall approach to development where developers build software, perform some basic unit tests, and then throw it over the wall to QA for testing. QA finds bugs and sends them back to development for fixing. This process repeats, usually many times, until the software is deemed good enough to ship. The problem with this model is we waste too much time going back and forth between development and QA, recoding, rebuilding, retesting. This often leads to carrying forward many open bugs because there was not enough time to fix them all. Then the next set of work starts, and these open bugs get prioritized with the new features; some of them never get a high enough priority to get resolved. In essence, with each release we create more defects that do not get fixed, leading to a degradation of quality over time.
Continuous integration attempts to fix this. With CI, developers are required to write unit tests and an automated build process is created. Anytime a developer checks in code the unit tests are automatically executed, and if any of them fail the build fails. The motto of CI is “carry no bugs forward”. This improves on two issues from the old model. First, fewer bugs get introduced to QA, thus decreasing the back and forth time wasted from constantly fixing bugs. Second, the build only contains working code, thus greatly reducing the backlog of defects that carry over and never get fixed. By automating unit tests and the build process, we eliminate human error, thus improving the time to market by eliminating waste (bottlenecks).
Another traditional bottleneck in the software development life cycle (SDLC) is dealing with environmental issues. How many times have we seen the scenario where a developer releases into a QA, staging, or production environment a working piece of code that immediately fails even though it worked perfectly in development? This occurs when the setup and creation of the environments are not totally automated and repeatable. Environment hell is one of the biggest bottlenecks in many shops. Countless hours are wasted troubleshooting things within the environment that broke working code.
Continuous delivery attempts to fix this. With CD, environments are built by automated processes. Code goes through the CI process (automated unit testing and build process) and then goes through automated acceptance testing, possibly even automated user acceptance testing, and then is deployed directly to an environment (dev, test, QA, stage, or prod) that is an exact match of all of the other environments in the SDLC. This greatly reduces the time for testing because it eliminates all of the pesky environment issues and removes the traditional wait time required for someone to manually create environments.
So now we are building cleaner code and cleaner environments. But production deployments are still risky and error prone. Automating the deployment to production process is crucial. In my previous post I discussed canary releases, one of many methods of deploying with zero downtime. In this model, the entire deployment process is hands-off. A new environment is automatically provisioned, and the code is automatically deployed to it. A small subset of traffic is diverted to this new environment to perform what I call a pilot test. Monitoring systems begin to collect information and compare the data against baseline metrics. If anomalies are encountered, an automatic process runs which stops routing traffic to this canary cluster. The team then works to resolve whatever the issues were and then deploy a new build to try again. This method is the “fail fast and fail forward method”. With FFFF, we release small chunks of code and pilot test them. If they work we move all traffic to the new cluster and retire the old cluster. If they fail we shut off the new cluster, fix the issues in development, and then run our CI, CD, and continuous deployment processes again.
All of this is only possible through automation. Manual tasks in a process like these creates tremendous risks. Companies that have put these processes in place are able to deliver software faster and more reliably and waste little time dealing with huge backlogs of defects and troubleshooting environmental issues.
DevOps: How do we get there?
This all sounds great, but how can we change our organization to embrace this lean approach to software development? The one thing we should not do is create another silo called DevOps. What we need to do is break down silos and get development and operations working together more closely. Operations should be involved early in the sprint process alongside of developers, and they should collaborate on building and automating build and provisioning scripts. Both teams should be accountable for delivery from end to end. A great quote from Jez Humble at PuppetConf was “you can’t hire culture”. In other words, hiring “DevOps” people does not get you CI or CD. Instead you must create a culture of collaboration and ownership so that everyone is bought in to making better software by eliminating bottlenecks through automation. Another great quote from the conference was “if you can’t automate it, don’t do it”. Puppet Labs CTO Nigel Kersten’s motto was “Automate the pain away”.
Summary
In closing, time to market is critical in this day and age. We live in a world of reusable cloud services where companies can get products to the market faster than ever with less effort than ever before. To stay competitive we must improve our SDLC and eliminate the bottlenecks within our processes. Identifying these bottlenecks and removing them is a key to improving throughput. Automating all things that are repeatable is a must. Our goal should be to make everything we can repeatable so that it can be automated. Anything that can’t be repeated should be analyzed to see if it is waste, Speed to market and reliability depend heavily on automation. Let’s look into our systems, identify the waste, and then automate the pain away.