Root Cause Analysis as We Move Up the Stack

Containers and other technologies are moving administrators, developers, and even operational folks up the stack. In other words, we have abstracted out the hardware and abstracted out the operating system; next, we will abstract out middleware and eventually everything but the code to run. However, when we do that, we no longer train people to be systems engineers; we no longer have the ability to do root cause analysis. We have seen this many times in recent years, and it may just get worse. Root cause analysis is part knowledge and part tools, but most of all an understanding of the system underneath the code. We are fast approaching a time when this skill may become a lost art.

Root cause analysis is mostly a skill, plus the proper use of tools to determine how to refine our knowledge of the system to pinpoint the cause of any failure. However, the tools themselves often cannot tell us the real root cause; they can only direct. To determine the real root cause, you need knowledge of the system. Take the following examples:

Example 1

The cluster of virtualization hosts lose access to all storage within microseconds of each other. In effect, the entire storage subsystem is hung up, and so are all the virtual machines.
Is this an issue with the array, the virtualization host, the switch, or a virtual machine? We can surmise it is within the storage subsystem, but the real issues are where is it, what caused the failure, and how can we correct it?
Answer: It was actually a failing HBA.

Example 2

The workload is performing poorly yet does not in development, specifically when deployed into a cloud.
Is this an issue with the application? The cloud? How do we find the clues to tell us where to start our investigation?
Answer: The problem was encryption where the AESNI chipset features were not being passed to the virtual machine.

Both of these problems demonstrate that there are hardware issues that show up as odd issues, but no tool today tells us exactly the root cause. We can only get close to the root cause. Even then, we know that the VM had a performance issue in Example 2, but how would we know that it was the encryption algorithm? For that, our root cause analysis needs to include application knowledge. At the same time, we need hardware and hypervisor knowledge.
Today, the most we can hope for are the tools to direct us in the proper direction. For example, New Relic can indicate the code that is having issues by telling us exactly how much time is spent in a code segment. We can then look at the segment and determine that it was an encryption issue. But how do we jump from code to hardware? That takes system-level knowledge: systems engineering.
Any tool that can become the systems engineer is the tool to use. Any tool that can drive us down the proper path and not a false path is a good tool. For example, would ExtraHop realize that it was an encryption code issue? Probably not, as it looks at the network. But it could show a delay in sending network packets, which could also be used to guide us to our answer.
However, the first answer to our question requires us to get a topological view of our system. If we do not have this, we may not know all the pieces involved. VMware vCenter has some of this, VMware Infrastructure Navigator has some more, and SIOS has even more for VMware vSphere environments. Cirba, ScienceLogic, and vCenter plus HotLink provide more for clouds.
Once we have a topology, we know what our system comprises all the way down to the hardware. Then we can start drilling down further and further until we get our root cause of the problem. For many issues, we need to correlate alternative events and log entries to performance data to gain an understanding of the real cause. We need to alleviate false positives, as going down a false path is just as bad as not finding the path through the system for seeking the root cause.
Furthermore, the tools need to correlate and gather data for us humans, to present to us the clues from which we can eventually leap to the right conclusion. In rare cases, it is possible today for tools to do that leap for us, with the proper training. While there is no tool today that can solve either of the examples mentioned, the tools do exist to send us seeking the answer down the correct path.
To solve either, not only do we need to look at log files, but the tools must look at them for us, to understand them and correlate them to the performance and event data. This is a big data solution to the problem; it has been solved for siloed systems and very specific questions, but not for the odd ones that find their way into our environments daily. For that, we need anomaly detection.
SIOS and other tools are starting us down this path, a path where the tools can take over for some of our systems engineering needs, but not for all. We need our tools to know and learn more about our systems, to pull data from multiple places, and to inform us of the broken pieces so we can get them fixed!
As we move up the stack, we forget about what is below, and for that, we need tools to remind us—to do the work we used to do. The trouble with moving up the stack is that we can never forget what is below—not if we want to find the root causes of our problems.