In my opinion, three main areas, or segments, are established for automation in the modern-day data center. The first segment is provisioning, the next is second-day operations, and the last, to complete lifecycle management, is the decommissioning process. Every data center is similar to others, but what makes each different is the choice of technologies used in its environment. In this article, I focus on philosophies of automation used in data centers.
When it comes to automation, the philosophies used and applied correspond directly to the orchestration engines used. Let’s consider the idea that each technology silo has its own methods and tools for automation. As such, each of the different silos is responsible for automation in its specific technologies. The storage team is responsible for the automation of anything storage related, the network team for network configuration, the Windows administrator for the configuration of Windows operating systems, and the Linux team for the configuration of Linux-based servers. Each team manages its part of the automation inside the data center. When it comes to provisioning and decommissioning, almost all of the silos’ automation is assembled to create the complete process. When one team’s automation part is finished, the automation makes an API call to kick off the next process, and so on until the provisioning or decommissioning has completed. If the automation process for one of the technology silos fails, it stops overall automation until the appropriate team can address the issue, restart its automation, verify that it has completed successfully, and started the automation for the next silo.
This approach has a number of positive aspects to it. Each silo’s technology team has the ability to create and control its specific piece of the puzzle, and automation will not continue unless a step completes successfully. There is a need for a technical person who establishes and defines the overall workflow and leaves it to each team to figure out the best way to complete its part.
A slightly different philosophy applied to this theme is to have a centralized automation engine that makes the calls and fields the output returns to determine if there is a failure and if the workflow should continue. In my opinion, this is a better approach. A centralized model is in a better position to handle any exceptions or failures. The centralized system can connect to any of the previous systems to gather any information needed to recover from the failure and cause the workflow to continue to move forward. I am always looking to create and add exception automation for any failures discovered. The workflow should not have to stop and wait for an administrator to manually resolve an issue to continue; rather, automation should have the ability to resolve things on its own in an agile manner. To make this work, exceptions must be coded in a timely matter, as they appear, instead of being added to a list of tasks to be worked on for the next release. Agility is truly the key for success.
Now, when it comes to second-day operations, each of the technology teams evaluates the most repeated tasks and issues it must address day after day. Each team’s support metrics are usually generated from the number of incidents and changes that are tallied on a monthly basis. Each of the teams has multiple and different priorities to contend with. As such, if the skill set is available, each works to resolve and automate the main metrics in its specific technology.
When I ask my peers which automation philosophy they subscribe to and how they support it, I have found that usually there is a person or small group of people that make up the automation team for the company. It is this group that writes the code for each of the different steps, mainly with the provisioning and decommissioning processes. When it comes to second-day operations, though, I have found more often than not that the automation team uses a script or something that it and the technology team are already using. They encapsulate the script into a format that can be used within the automation engine as they work to expand the operations automation.
When it comes to day-to-day operations, I believe there are three main areas that make up automation. Creating automation to handle the majority of requested changes and incidents is part of what I call “reactionary automation.” Changes, incidents, and alerts are all issues that are reactionary in nature. The flip side of that coin is to be proactive by utilizing automation that scans the environment, finding and resolving any issues before they become non-automated reported incidents or changes. There are a number of monitoring solutions that can trend the amount of growth in the environment. With that information, the monitoring systems can further enhance proactive automation by taking action on the trended information.
Whichever automation philosophy you subscribe to will still point to the same destination. For all practical purposes, this will lead us to a world of hands-off IT, where most configurations and changes are not handled by clicking boxes in the user interface, but rather by the automation itself. The first to arrive at this automated destination will be the companies with the largest data center infrastructures, for the simple reason that you cannot maintain an environment with hundreds or thousands of servers to support with any kind of consistency without automation’s being the primary tool. In the meanwhile, during your journey to that destination, what is your automation philosophy?