Where Is My Operations Swamp-Drain-O-Matic?

As SF author William Gibson said, “The future is already here—it’s just not very evenly distributed.” Some IT infrastructure teams live in a future where they are resolving every issue before there are problems for end users. These teams live in a nirvana where help-desk tickets are all requesting new accounts to be created for staff who start work next week. Phone calls bring praise from line-of-business managers. Personally, I have never seen these IT teams. Maybe they exist; maybe they are just a dream. Many IT infrastructure teams work in a very different world: a world of hurt and pain, where application performance is unpredictable. The help-desk call queue sometimes spirals out of control. When the team is this deep in alligators, it can be hard to see how to drain the swamp. A crucial first step is getting the lay of the land and some idea of where the problems are coming from. The next step is to start dealing with the root causes of issues before they cause problems.

The value of a good monitoring system should be self-evident. Issues are identified fast. The root cause is immediately visible. The path to resolution is clear. As issues are resolved rapidly, it becomes easier to move to a proactive model. Then, the monitoring tool should lead us to latent issues that may impact us later, allowing us to resolve issues before a crisis occurs. The IT nirvana comes into sight, and the whole team can go for Friday afternoon drinks.

Why is it, then, that I have seen IT organizations with multiple monitoring tools, none of which are used in normal operations? I have seen an IT team with huge screens mounted on walls, each with its own massive dashboard, yet the team never uses the monitoring tools. Most often, each monitoring tool has a champion, a person who thinks that this is the best monitoring tool in the world. The champion has usually invested a lot of time in the tool. They have learned the meaning of its counters and metrics. They have spent hours customizing the dashboard to answer their exact needs. Unfortunately, the rest of the team’s needs aren’t being met by this tool. Worse yet, sometimes the tool is championed by a manager who never spends time hands-on with the tool. There is no sense in having a different monitoring tool for each team member. There should be a single source of truth that is used by everyone on the team.

I think that the root cause is that most monitoring tools are designed around the data they collect. This is quite natural, since collecting and managing the data is a large part of the function of the software. This focus on data can result in losing focus on the purpose of the software. Monitoring tools exist to help me understand my environment, resolve or prevent issues, and plan for changes. Showing me that fifty VMs all have the same issue isn’t enough. I need a tool that identifies why they have the same issue and advises me on how to resolve it. If it is a security issue caused by an out-of-date software version, then the monitoring tool should show me how to upgrade the software or reconfigure the old version to resolve the issue. My job is not done when I am told there is an issue; it has only just begun. An operations management tool that provides me with possible actions to take to address a problem is far more helpful than one that just shows me the problem.

Good information about the state of infrastructure is an important tool. Waiting for help-desk calls is not the hallmark of a high-performing operations team. Thinking ahead of the issues is the only way to be high performing. Operations tools need to not just provide information, but also recommend possible actions to resolve issues. Ideally, we would like the operations software to take some corrective actions on its own. This is where “desired state” tools come into play. The operations team defines the desired state and allows the operations management software to make the environment compliant.