A few weeks ago, Tom Howarth asked me one of those simple questions that leads to a very long answer. Like all good questions, it made me think. He asked, “In an ideal world, what would you monitor?” Of course, the analyst in me immediately turned that around. Context is key, and such a broad question needs a much narrower definition before it can be any use. Monitoring a VDI environment will mean measuring different indexes than monitoring a web farm, or a database cluster, or a WAN. The things that concern the end user are different than those that concern a lowly sysadmin like myself, and very, very much different than those that matter to a CxO. There is one thing, though, that is true across all of these metrics: there’s no point measuring something you can do nothing about.  

Of course, ultimately, there is nothing we can’t do something about. Throw enough money at a problem and we can replace or add to hardware, have code rewritten, or move wholesale to new infrastructure. But the amount of money we have to through at any given problem is always limited. The obvious response to Tom’s question is “monitor everything,” but that way lies madness. A given server is performing billions of CPU operations per second and hundreds or thousands of disk reads and writes, and consuming GBs or tens of GBs of bandwidth. Add to this hundreds of switches. SAN devices. Firewalls, intrusion detection. We can monitor physical operations such as temperature and humidity, as well as virtual functions such as file access and permissions changes. Something as trivial as a DHCP server can produce GBs of logs per year. Trying to monitor even a small fraction of these things leads to servers and storage dedicated to monitoring systems, and these in themselves need to be monitored. By far the biggest cost of this monitoring, though, isn’t the use of CPU cycles that could be better put to use solving our business problems; it is the human cost. Monitoring these functions, even with deep learning and alerts and alarms to take most of the strain, takes human effort to separate the wheat from the chaff. Paying your staff to monitor your systems, and pretty much nothing else, isn’t a very good strategy.
 
So, if we can’t monitor everything, how do we decide what we can monitor? Again, the analyst in me speaks up and says “What do you want to achieve from the monitoring?” However, another small voice speaks up and says “No, first, stop monitoring the things you can’t, or rather, don’t need to, do anything about.” So where do we start in deciding which problems we can do something about? The first class of things to remove from consideration are those we simply don’t care about. In the campus, we wouldn’t monitor the link state of every user’s desktop port. Laptops spend the vast majority of their time away from the desk (or the user wouldn’t need a laptop!), so the sheer false positive rate on that metric would be huge. We wouldn’t monitor the uptime of a laptop for a similar reason: we simply don’t care how long the device has been turned on. The second class of things to not monitor could be said to be “those that lie.” There isn’t much point measuring the CPU time of a virtual CPU that spends a good chunk of its time suspended by the hypervisor. There are better tools, further down the stack, that give more reliable results. The third thing to disregard is the set of metrics that are artificially limited. We don’t monitor the bandwidth of connections between virtual machines for a few reasons: firstly, if they are on the same host, then the host will switch at far greater than the 10 Gb/s the VM thinks it has available. Secondly, if they are on different hosts, then the 10 Gb/s they think they have almost certainly is contended. Thirdly, on public cloud, we quite simply can’t do anything about that metric. It is much better to monitor the host bandwidth, or the perimeter bandwidth of public cloud. These are the pressure points where we can fix the problem, by introducing more hosts or purchasing more bandwidth.
 
Monitoring is an expensive pastime, either in software costs, storage costs, or staffing costs. It’s almost certain that your network is diligently capturing information that you can not use. Now is the season to cull that. Open the windows for a good spring clean.