Many companies use some flavor of an agile methodology today with mixed results. I have written about agile fail patterns in the past, but some companies do an excellent job of applying agile best practices yet still suffer from missed dates, botched deployments, and low quality. Why is that, you may ask? Because most agile methodologies only address the development side of the house and clearly ignore the operations side of the house. The two need to work in tandem to produce the desired results, which is the goal of DevOps.
When operations is not incorporated into the software development lifecycle (SDLC) the following patterns emerge:
Environment Hell
Systems administrators are often very talented, but reading developers’ minds is usually not a core competency. I have too often seen where the development team does a great job of iterating through a sprint only to be blocked by environment issues. Either the environments are not provisioned in time, or the environments are not configured properly, or both. This is due to the fact that there has been limited communication upfront with the system administrators. The result is that as code moves from the developers’ laptops to the QA environment, things that worked in development stop working in QA. Then time is spent debugging the environment, which is pure waste. Once those issues are resolved, the code is deployed to the staging environment, and new issues emerge. You can only guess what happens when the code goes to production. Environment Hell is pure waste. All of the time spent configuring, deploying, debugging, reconfiguring, and redeploying can be eliminated by changing the SDLC to include the systems administrators as part of the team instead of as a silo that receives a handoff at the end of development.
Doomsday Deployments
Inconsistent environments often lead to discovering issues in production that were not uncovered in testing. This can create a downward spiral of doom. Critical bugs appear that need to be fixed immediately. The team must either roll back the deployment or create some emergency fixes that must be deployed in a rush. Both of those scenarios may lead to more issues that create more urgency to take more risks immediately. It doesn’t need to end this way.
First, implement a process for creating consistent builds and environments. Continuous integration enforces that developers check code into the trunk daily and that unit tests are executed and passed before the code makes it into the build. CI ensures that the build always has working code. Continuous delivery enforces consistency across the environments. With continuous delivery, the build is deployed onto a working automated environment, eliminating all of the configuration nightmares described above.
Second, the deployment itself must be totally automated. A famous quote I heard at PuppetConf this year was “If you can’t automate it, don’t do it.” Not only should the deployment be automated, but a rollback plan should be automated and tested as well. All of these things that I mentioned must be included as user stories within the sprints. The problem I often see with agile is that the product managers only think about the business and development users’ stories, but ignore the operations stories. Don’t make that mistake.
Running Blind
So now we’ve got the environments straightened out and the deployments automated. We are done, right? Wrong. Now we are simply running blind. We successfully wrote and deployed code and met our dates, but we have no clue if what we deployed today will still be running tomorrow. What we need is proactive monitoring. Sure, we probably have monitoring tools like Nagios running that warn us when we are running out of disk, memory, or CPU cycles. But none of those monitors tell us when our services are starting to degrade in performance or if our queries are starting to take too long, or if we are sending too much data over the wire. What we need now is application monitoring.
The first step is to establish baseline metrics. For example, let’s say the expected performance of an API is 10ms. We may want to set a monitor that alerts us when the performance of that API is consistently off by 10%. We can set up similar monitors to watch the performance of database writes and reads so that we can proactively tune the database before the problem is big enough to start to be noticed by users. Application metrics are very important as well. Let’s say that our system is a consumer-facing website, and history has shown that we anticipate 10,000 concurrent users throughout the day. After our last deployment, our monitoring alerts us to the fact that we are only seeing 500 concurrent users. Something is seriously wrong, and we should investigate immediately. Maybe some new feature is not working properly. Maybe page load times have decreased dramatically. Maybe the new workflow we just implemented is deterring users from the system. Whatever the problem is, if we had not alerted ourselves to the delta from our baseline metrics, we might go days without knowing that something is drastically wrong. Maybe we should immediately roll back the deployment with our automated processes, restore the system back to its last good state, and stand up a testing environment with the data we captured during the time the application was deployed so that we can troubleshoot the issues. We can make this call because we invested the time in automating and testing our rollback strategy. This allows us to take our time to identify the problem without disrupting the business.
There are many great monitoring tools that inspect the entire application stack and help pinpoint bottlenecks. It is almost impossible to design the perfect system. Application monitoring will help the team identify application bottlenecks as traffic flows in and out of the system. Tuning the system is a practice that never ends and should be accounted for when product owners start committing to the next sprint. This is another reason why operations must be included as a part of the overall team, so that product and development can be educated about the underlying performance of the overall system. This proactive approach allows the team to continuously improve the system and reduces the odds of the team being totally blindsided by an unexpected system bottleneck.
Agile and Ops
Many implementations of agile methodologies focus solely on software development and exclude operations. Delivering software quickly does not have great value if the systems is not reliable. Too often, deploying software is like rolling the dice. It should not be that way. By embracing a DevOps mindset where operations is baked into the product throughout the SDLC, deployments become more predictable while quality and reliability become greatly improved. Being agile is only half of the equation. Excelling in operations is the other half.