IT operations analytics (ITOA) is the new language that incorporates analytics as a part of IT operations. This is a requirement for today’s environments, as even small labs generate terabytes of data a day: terabytes of logs from applications, network sensors, security devices and products, automation tools, and more. The list of possible streams of data is endless. It is up to the IT operations folks to make sense of this never-ending stream of data. Into this steps analytics. Analytics without knowledge often leads to chasing rabbits down holes, as there can be a large number of false positives.
Analytics also comes to the rescue to alleviate false positives. Tools claim to be able to get to the root cause of a problem, but the real question is not whether they can discover a root cause, it is whether they are looking at, correlating, and analyzing the data streams that allow them to see the true root cause. The problem is not whether tools can help, but whether they are looking at valuable data.
A few years ago, I was asked to investigate a troubling issue. The cause ended up being hardware. This got me thinking: why didn’t the tools pick up on the issue? The reason the tools failed to pick up on the issue was not the algorithms in use: they would have picked up on it if they had had the proper input. The failure ended up being the source of the data they ingested. The original issue just did not show up in the streams they were investigating.
So, how do we decide what streams of data are useful? In the past, we used tools that liked one type of data more than any other. Some still do, such as ExtraHop and other network analytics tools. These are very valuable, but we need tools that can look at multiple streams of data, make sense of them, and spot the anomalies or the bad things that are happening. Now Extrahop has grown from just a networking tool to be much more.
We may not need purely behavioral analysis for some items, but that is the best place to start. We want to map the behavior of users, system identities, devices, and even patterns of usage for all items within our IT stack. Operations is focused not just on the user but on the entire stack. Only when we see the breadth of data can we spot true patterns. Having only half the pieces to the puzzle means we can never complete the puzzle, and we become frustrated as we try.
Tools that claim root cause analysis (RCA) can only give you RCA to the bottom of the stack of data they see and collect; they may not be able to get to the real root cause. To get there consistently requires more data, not less. This invokes the following questions about all the tools:

  • What data do they ingest?
  • Where does that data come from?
  • How do we get the data to them?

Let us consider New Relic, which looks primarily at code-level data for performance throughout the code stack. It has tools to look at systems as well, but not necessarily at hardware directly. New Relic can also look at specific bits of the application through its plugin mechanism. All told, New Relic reveals a large part of the puzzle, including real user monitoring to attempt to determine wait times and thus to determine user experience.
Zenoss, on the other hand, looks at all aspects of the system, all the way down to the individual bits of hardware. Zenoss can collect SNMP and log data, analyze it, and point out issues occurring all the way down to the bottom of the stack. But it can only do this if the data is output by the tools queried. For example, a hardware issue related to a chipset feature may not be available to Zenoss unless some enterprising user exports that data into Zenoss. It is not there by default with the current set of sensors. Nor would it see errors that do not appear within hardware inquiries but rather appear in kernel logs that are, once more, not entirely within Zenoss’s purview. However, Zenoss, like New Relic, is highly extensible with new sensor data.
VMware vRealize Operations, on the other hand, generally only sees within the virtual environment, unless you pull in a series of sensors and configure them to see data that is external to the environment. The out-of-the-box version, however, is very good at seeing all layers of the virtual infrastructure. It is also customizable through a set of connectors. The connectors need to be intelligent about hardware issues to work with vRealize Operations.
Cirba looks at the entirety of the system, hybrid or otherwise, to do better capacity and workload management. Cirba is also extensible, but its main focus seems to be on capacity and on capacity utilization and prediction within the hybrid cloud.
Xangati is another networking based tool that has grown to include not just the network but the entire virtual environment. Their goal is to find storms of anomalous traffic, report, and tell you the reasons as well as how to fix. Their algorithms continue to be refined, and their existing data can be used for many reasons. While they do not allow plugins per-say, the system is becoming much more application aware.
Lastly, VMTurbo looks through a different lens. While not as extensible, it does do a very good job of examining virtual environments and finding causes within the same realm as vRealize Operations, within the virtual environment. To go past that would require a bit more data.

Closing Thoughts

The data you see within an ITOA tool may not be complete. We need to know the what, where, how, when, and why of the data to get true RCA. RCA without all the components is like building a puzzle with only half the pieces. We are fighting this battle today. We need tools and data streams to alleviate all false positives.
Which tool do you use and why? What data is it missing?