Root Cause Analysis Is Not Dead

Eric Wright of VMTurbo wrote about the death of root cause analysis (RCA) with the rise of microservices. I take exception to this, as microservices aren’t really all that new. Even what’s being called “serverless computing” isn’t particularly new. However, that’s a discussion for another time. The point of RCA is to find the real reasons for failures. I don’t see how using microservices changes this. All we’ve done is add more layers to delve through to find the true root cause of a problem.

Granted, Eric’s point is that our new systems are so resilient, distributed, and not prone to collapse that root cause analysis is no longer as important as it once was. I can buy that, but let me present some cases to the contrary. I’ll start with an example drawn from a midcap firm doing four billion queries a day. To say the least, its environment is both highly resilient and highly distributed. Each node handles roughly 9,000 threads in processes at any given time of the day. It is continually hammered.
Due to the share volume, resiliency is key, as is using key performance metrics (KPI) to determine if that volume is holding steady. If the KPI shows a dip in volume, the company investigates. Now, this firm has the capability to handle double its normal volume, so these investigations are there to ensure it is not going off the deep end somewhere.
These investigations have uncovered some very interesting root causes, all from the same KPI and other logs.

Use of low performance coding practices (Application)
Resources not configured correctly (Operating System)
Networking not configured correctly (Operating System + Hardware)
Memory with incorrect latency values (Hardware)
Overheating processors (Hardware)
Too many DNS queries (Application + OS); this required building tiered DNS caching to handle the load
Thread errors (Application)

The list goes on. Many of these root causes are due to misconfigurations and bad coding, but an equal number apply to hardware.
The application in question uses some pretty modern concepts, such as microservices, message queues, external services, and internal services. While the system scales out quite nicely, its additional complexity encourages different forms of root cause analysis that are heavily customized, rather than stock solutions. It takes extensive domain knowledge to determine the real cause of a problem, such as:

Ensuring proper debugging to determine why services cannot be created (ongoing RCA)
Ensuring message queues are operating as expected; this is currently ongoing, as they see something that points to message queues that they have not figured out yet (ongoing RCA)

This is not to mention the constant look at all services involved to ensure they are running, including autoscaling to more services, and autorecovery by creating new services on the fly.

Root Cause Analysis Is Not Dead

Root cause analysis is not dead, but it has changed significantly. We now have tools that will point out issues with upper levels of our microservices code, yet will not delve deep enough to equate this to operating system, underlying code, assumptions, and hardware.
In some cases, we cannot even see the hardware except remotely through the lens provided by the virtual machine, container, or some other facility we do not know about today.
Three real-world cases in point:

Could a tool tell me why our laptop encryption overhead was 2% in the lab, yet when placed in a container within the cloud, it was 20% or more? Most tools will state that CPU utilization is up and that we need to either move the workload or add more CPU capability. The actual answer: AESNI chipset features on the laptop were not available to the cloud-based containers. The performance problem would never have gone away in the container, but could have cost the company 20% more in CPU utilization for each deployment. That can mean serious money.
Could the tool tell me why I lost access to all my underlying disk mechanisms at the same time across an entire cluster of nodes? Most will tell us that there is a storage problem. We know that: we lost access to all storage across multiple nodes all at once. The answer: A Fibre Channel HBA device was failing. Since the hardware only alerts on true failure, the systems all locked their storage. The solution was not that costly; it was to use more than one physical HBA, instead of relying on a single HBA to handle the workload.
Could the tool tell me why CPU utilization suddenly went up by 110% of normal behavior we have seen for months? Most will claim there is a CPU problem and not go much further. If we move the workload to another node, the same usage spike happens, but when we went to a third node, everything changed back to normal. Answer: Memory in each physical box was not the same, and lower-quality memory was used in two of the three systems.

Some of these items we have 100% control of within our data centers. That control is lost when we move to the cloud, as is the ability to run any form of tests or to look deeper into the infrastructure.

Resilience Requires New Knowledge

That is why resiliency requires workarounds to gain the knowledge we need to make decisions. Some companies run basic tests in the cloud before launching an application. They need to ensure that it will perform well enough. If it doesn’t, they destroy that image and start over, hoping the new image is placed somewhere else.
This type of pre-testing before launch allows you to address many issues. At the same time, it really doesn’t solve the problems: it just moves them around. The cloud provider doesn’t learn anything new, and we’re just working around the cloud provider’s inability to let us to do true root cause analysis.
RCA requires knowledge. How do we gain that knowledge?