Antifragile Systems: Designing for Agility vs. Stability

For many years, the focus in IT has been on building robust systems that invested heavily in avoiding failures. To accomplish this goal, methodical processes were implemented to guide IT through a list of known use cases so that systems could try to avoid failing and have a plan for recovery if a failure did occur. This approach often led to long delivery cycles and a high degree of complexity which in reality, increased the risk of failure and created fragile systems. Fragile systems are those systems that cannot adapt to unpredictable, volatile, and random events often referred to as “shocks to the system”.

In recent years, IT has been focusing more on agility to keep up with the speed of business. Agile methodologies have been embraced in an effort to release software more frequently. Often, the shift to agile methodologies came at the expense of stabile systems. Many steps that ensured stability were skipped or not fully vetted causing various “cracks in the armor” of systems. The emergence of the DevOps movement is the intersection between agility and stability, where we aim to deliver software faster but with a high degree of automation, measurement, and quality.

What is an Antifragile System?

In the book “Antifragile: Things That Gain From Disorder” by Nassim Nicholas Taleb, Taleb points out that stabile (fragile) systems resist shocks and stay the same, while antifragile systems get better. This occurs because fragile systems try to predict outcomes and can only deal with the knowns while antifragile systems deal with randomness and try to accommodate the unknowns. There is a fundamental difference in designing systems that do not fail versus designing systems that expect all parts of the system to fail.

Taleb coined the term the “Black Swan” problem:

The impossibility of calculating the risks of consequential rare events and predicting their occurrence.

He goes on to state that depriving systems of volatility, randomness, and adoption to shocks in the system actually harm the system and ultimately cause them to weaken, die, or blowup. Taleb proposes that systems should be built to handle Black Swan events – unpredictable and irregular events of massive consequence – instead of building systems based on known, predictable patterns.

Our bodies are a great example of an antifragile system. When a shock to our system occurs, be it an infection, a cut to a finger, or getting winded from a 100 yard dash, the body adapts, heals, and recovers. The body can accomplish this because it has not been constrained by a set of rules based on predictable use cases. Systems on the other hand are purposely designed to deal with known risks so when a shock to a system occurs that was not predicted, all hell breaks loose.

Shifting from Fragile to Antifragile

It is time to rethink how we design systems. With advancements in cloud computing and by embracing the DevOps movement, we can now solve problems in different ways then we could before. Take the classic disaster recovery methodology for example. In the past, we would build out our system in a primary datacenter and go through great pains to design for failover in a secondary datacenter in another location. In this model we would try to predict how to recover from catastrophic failures and practice failing over to the secondary datacenter once or twice a year (maybe). Of course in doing this, we were only testing for what we believed a failure would look like. When real catastrophic failures occur like the attack on the World Trade Center or the recent massive flooding cause by the storm Sandy, the results are often much more unpredictable and recovery is much more challenging than a well rehearsed practice run. Instead of trying to predict failure, our systems should embrace failure just like our bodies do. Instead of designing for an entire system to fail, we should design for each component to fail, like a cell within an organism. Instead of designing for active-passive failover, we should design to heal and never fail in the first place. (Note: Not all systems require this level of reliability).

Instead of implementing rigorous methodologies like waterfall, ITIL, etc. that seek to reduce the frequency of changes to a system, we should embrace things like agile and DevOps that encourage more frequent changes. By moving to smaller changesets implemented in shorter timeframes we reduce the risks and complexities to the system. Of course to accomplish this goal, we need to change our approach to software development.

DevOps is a Key to Antifragile Systems

If our bodies had to go through a rigorous time consuming process to deal with shocks to our system, we would not survive. Our system would not respond fast enough to heal and improve. Unfortunately, we don’t treat our businesses the way our bodies treat us. Instead, the business is at the mercy of the processes and controls put in place by IT to build stabile, predictable systems in an unstable, unpredictable business environment. This must change.

DevOps emphasizes speed, efficiency and quality through automation and lean thinking. But there are no silver bullets. A culture change is required to embrace the DevOps mentality. Silos between the business, development, and operations must be broken down. Identifying repeatable processes and completely automating them are required. The more agile a system is the more it must rely on automation. I have written about how DevOps can Turn Fragile to Agile in the past. In a future post, I will talk about strategies for fostering a DevOps culture within an organization.

We also must design systems differently. To handle shocks in the system, software must not be tightly coupled to hardware. Software must support elasticity. If a large surge in traffic occurs, additional infrastructure must be automatically spun up and the software must be able to adapt to the new infrastructure. In the same breath, if infrastructure fails, software should be able to execute on the next available piece of infrastructure.

Buyers Beware

Building antifragile systems requires focusing on architecture. Drag and drop “programmers” need not apply. If an IT organization does not have the engineering chops to build loosely coupled systems, than “don’t try this at home”. Building antifragile systems are not for every organization and every business case. But, for those business cases that support it and those organizations that have the capacity to build it, antifragile systems can deliver the adaptability and the speed to market that our business partners crave.

Sources: http://www.pwc.com/techforecast and [amazon asin=1400067820&text=Antifragile: Things-That-Gain-Disorder]

2 replies on “Antifragile Systems: Designing for Agility vs. Stability”

Simon says:

August 7, 2013 at 3:40 am

Would you equate building anti-fragile systems/code to how we build houses to withstand earthquakes? Build them stable and unmoving and they usually collapse when an earthquake hits, build them so they can move/flex and they can escape undamaged (or at least without collapsing altogether)?
Mike Kavis says:

August 7, 2013 at 9:38 am

Not really. Building anti-fragile systems is made possible because we don’t need to focus on physical infrastructure anymore when we build systems. The infrastructure is being delivered as a service giving us the ability to treat infrastructure as code. Building a house is not virtualized so engineering a house that is anti-fragile if possible, would likely cost enormous amounts.

Comments are closed.