We live in interesting times. If I were to chart the increase in the number of customers asking for help with DevOps, that chart would look like a hockey stick, that same kind of hockey stick our CFOs are always dreaming of. If I added another line on the chart for the percentage of those companies that actually knew what DevOps was, it would be a flat line at the lower coordinates of the chart. What we are seeing is that everyone wants DevOps, but not everyone knows why or exactly what DevOps means.
Many people, including myself, have written endlessly on this topic with a unified message that DevOps is not a role or a person. Instead, it is an approach geared towards building better software, faster, and more reliably. But to many higher ups, the phrase “building better software, faster, and more reliably” sounds like they are buying yet another methodology, which they usually have little patience for. “I already have an agile shop” is a common response when I describe what DevOps is really all about. In order to explain the value of DevOps, you need to shift the conversation away from the “what” and drive home the “why.” In order to understand the “why,” it is critical to also explain “why now?” What changed where all of the sudden this DevOps thing is so important? Here is my attempt at that task.
Why DevOps?
Many development shops have adopted agile methods or frameworks such as Scrum, Kanban, XP (extreme programming), and others with the goal of delivering software in smaller and incremental intervals. Years of long death march waterfall projects that went on for months and months with a history of missed dates, overrun budgets, and unsatisfied requirements led many practitioners to search for answers. Agile methodologies, when managed properly, improve the software delivery process by delivering smaller change sets in shorter time frames. However, the shorter intervals often led to less architecture and more frequent changes to the operating environment (servers, networks, deployments, disk, help desk, etc.)
At the same time, architectures were becoming increasingly complex and more distributed, creating a lot of pressure for administrators and operations to “keep the lights on” in the organization. As agile teams started aiming for more frequent deployments, the operations side of the house was being blindsided with numerous last-minute requests for environmental changes. This lead to inconsistent dev, test, and prod environments, kindly referred to as “Environment Hell,” which often led to developers and testers chasing bugs that were the result of environment issues and not the code. Environment Hell created tons of waste in the development life cycle and often led to poorer-quality software than was being delivered in the pre-agile days.
Multiply these issues by the number of agile development teams, and now you have a high frequency of low-quality deployments happening on a frequent basis all across the enterprise. The end result looks like a day in the life of Bill Palmer of Parts Unlimited, the fictional character in The Phoenix Project, before they put an end to their daily chaos.
DevOps is all about taking all of the waste (bottlenecks) out of the entire system (development and operations), so that frequent changes become a good thing, not a disruptive thing. To accomplish this admirable goal, a shift in the way a culture approaches delivering software and services is required. In the old model, developers owned writing code by a certain date, testers owned completing testing by a certain date, and operations owned the mess that was usually delivered to them at the end of that process. In the new model, there is a shared responsibility for the product (notice I did not say for the code). The silo method of dumping stuff over the wall from one group to the next gives way to a collaborative environment where the different domain experts work together throughout the software lifecycle to ensure that quality and reliability gets delivered along with the agility.
What has changed?
Many people may say that the problems I mentioned above have been around for years and why all of a sudden does this DevOps thing magically make things better? Well, here is what has changed now that cloud computing has entered the mainstream:
In the cloud, infrastructure has been abstracted and offered up to us as a collection of APIs: “Infrastructure as Code,” as some people like to call it. In the pre-cloud days, a significant amount of planning was required in order to procure and install physical hardware. In the cloud, virtual hardware can be provisioned in minutes.
Big deal, you might say. Well, it is a big deal. Now that infrastructure can be provisioned quickly and be automated with code, architects can and will solve technical problems much differently than in the past. Many cloud architectures are much more distributed in nature, taking advantage of many smaller nodes. In addition, it is now much more feasible to separate different layers of the architecture on dedicated servers. For example, in the old days a typical infrastructure layout was geared toward huge, massively parallel boxes such as mainframes, minis, or large MPP boxes. Today’s cloud architectures look more like the diagram below, with dedicated web, app, cache, worker, and database farms, each scaling horizontally and independent of the others.
You can see from the diagram that today’s modern cloud architectures are made up of many independent parts. Even the classic three-tier architectures are much simpler than what we are deploying in the cloud.
These distributed architectures have some distinct pros and cons.
Pros:
- More scalable, including auto-scaling capabilities
- Loosely coupled
- Provides component isolation resulting in higher reliability
Cons
- More complex
- Harder to manage
To address the complexity and manageability concerns, cloud architectures require a high level of automation at all layers of the architecture. In order to automate everything in a complex environment, the people doing the automation must have a deep understanding of the overall system. In order to have a deep understanding of a complex system, all participants in the system must be really good at collaborating and sharing information. Sharing this information late in the development lifecycle does not work with complex and highly distributed systems. The automation must be designed in from the beginning and requires that both dev and ops folks to architect systems together. What needs to be automated, you might ask? Almost everything. Builds, tests, deployments, alerts, KPIs, SLAs, cloud cost accounting, server patching, you name it.
Shifting from shipped software to SaaS model
Another drastic change is the new delivery model that many of us are now operating in. I discussed this topic in detail in my new book. Before SaaS, we would package up our software product and ship it to the customer (or allow them to download it themselves). In this model, major releases typically occurred a few times a year (annually, biannually, or quarterly), with a few patches along the way. The customer was responsible for capacity planning, patching, installing, monitoring, and all of the day-to-day operations. In the new SaaS model, software is deployed frequently and often in flight with no down time. We are now responsible for 24x7x365 operations and support and are often held to high SLAs, various regulatory controls (HIPAA, SOC2, PCI, FERPA, etc.), monitoring, reporting, and much more. The net result is we must build “better software, faster, and more reliably,” because it is expected that SaaS software just works, all the time.
The days of duct tape architectures are gone. The man behind the curtain has been sent away, and our software must work for our customers the way eBay, PayPal, and Amazon.com work. Heroic firefighting efforts must give way to highly automated, monitored systems. Operations must be a proactive job instead of a reactive one. We must analyze metrics and patterns and detect things before our customers do. In some cases, we must automatically scale up when increased traffic comes at us unexpectedly. It takes a total team effort consisting of many different domain experts in development, security, infrastructure, and operations to build this type of architecture that runs itself 24×7.
What you don’t know can kill you
The cloud can be very deceiving. A small team can stand up an environment and deliver a robust product in a short amount of time, thanks to the high levels of abstraction in the cloud. With a handfull of virtual machines, the team can easily manage and operate the system without the depth of automation I discussed above. Over time, more and more products and features are added to the cloud, and managing these systems starts to become more challenging as the virtual machine count approaches 100. Going from 100 servers to 1000 is a dramatic operational challenge. Attempting to do so without a DevOps mindset is career suicide. DevOps is paramount for achieving scale in the cloud. That, my friends, is my selling point to C-level people inquiring about DevOps. It is that simple. Getting a few apps to run in the cloud is possible with whatever processes a company uses today, but to achieve scale in the cloud, it’s DevOps or die.
Summary
DevOps is a hot topic these days despite the fact that the term is broadly misunderstood. DevOps is not a person, it is not a bunch of systems administrators writing Chef scripts, and is not just another software methodology. DevOps is an approach to delivering better, high-quality software more reliably that relies on a culture of teamwork and collaboration across all technical domains. But most importantly, a DevOps mindset is a key enabler for achieving scale in the cloud.
A very succinct and insightful overview Mike – the freedom and agility of building new functionality in cloud architectures gets bogged down over time if you don’t have strong automation and a DevOps mindset in place to handle all the contingencies of other components and services. Expand the environment enough and I find myself wondering “is this even an application anymore?”