Will 2017 Be the Year of the Self-Healing Data Center?

Will 2017 be the year of the self-healing data center? One might consider this to be the holy grail of IT operations: having an infrastructure that can, to a certain level, maintain and resolve issues as they arrive.

So let’s imagine what it would take to reach this holy grail of a self-healing data center. First, let me establish some parameters for this conversation: we’re referring to the day-to-day operations of maintaining a modern-day cloud computing platform that could encompass both public and private space. The building and decommissioning of hosts is out of scope for this discussion.

To accomplish our goals, we need to use at least three distinctly different methods of monitoring the infrastructure for triggers:

Alerting: Event that is triggered when values differ from expected values.
Logging: Monitoring of logs in real time, looking for predefined alert messages.
Trending: Monitoring of values over an extended period of time.

Let me give an example to show the differences between these types of monitoring. We could call this “fishing for triggers,” and from there, we can start putting everything into perspective.

An alert is an event that is triggered when values differ from specified values. The most common alert, seen repeatedly in countless data centers around the world, is “Disk is running out of space.” An alert like that could be set to trigger when the amount of free space becomes, say, less than twenty percent. An alert is triggered so that action can be taken to address the problem—preferably some kind of automated action.

Monitoring that addresses trending over time gives us a much better understanding of why the disk is getting full. Trending gives us insight into how quickly the drive is filling up and how much time remains before that happens.

The type of alert and where it originated plays an integral part in determining what type of action will be carried out. Trending over time offers awareness of impending issues. It allows for the making of necessary calls to provide additional storage as part of what could be considered normal operational rate of growth.

The last part is the live monitoring of logs. In this scenario, specific log files are monitored for specific events and patterns. The logging tools trigger a set of actions when a particular pattern is found. Now, if an alert comes from live log monitoring, then chances are that a completely different kind of issue is occurring, at a rate of change that puts it outside the bounds of normal operations. The alert instigates a more aggressive automated approach to finding the source of the data and stopping the troubled service or operation, as well as clearing up the space to get back to normal operations. In my opinion, the monitoring of logs is truly the secret sauce in the self-healing data center. It allows you to discover and take action against a root event before the event has the opportunity to escalate into something that will be picked up by the other monitoring methods.

The monitoring technology needed has been around for a while; what’s been missing is the integration of different solutions and the orchestration of different products. You may have several different applications capable of firing off alerts, monitoring logs, or trending the environment, but you still need centralization, orchestration, or both to fully resolve several types of issues in a modern-day data center. A lot of these products will come with some kind of out-of-the-box automation, but automation by itself is never the answer, because in most environments some kind of incident or change authorization is required in order to perform any kind of work. It is this orchestration that is the missing link, and it requires a lot of custom coding to make it happen.

As automation continues to become increasingly relevant in the data center, let’s see if 2017 turns out to be the year that shows us what a self-healing data center is all about, and what the logic and automation can really do.