There has definitely been a trend for simplification in IT infrastructure startups for a while. We have seen a whole lot of simplification in how IT infrastructure is managed. We have also seen a trend for infrastructure products that are easy to deploy. Is easy to deploy a good thing? Does easy deployment lead to easy operations? What does easy to operate mean for a large enterprise?
Simplification is something that the deep geeks tend to scoff at. It’s easy to do, so anyone can do it. If anyone can do it, then it cannot be sufficiently powerful for our complex use case. However, there is what I call advanced simplification. The product itself does something very complex, but the user interface hides the complexity and allows the operator to focus on what the infrastructure delivers. This is the kind of simplification we need in order to manage complex systems.
At one end of the complexity scale is building your own OpenStack deployment from the source code. There are plenty of stories of deploying OpenStack that end in failure because the process is just too hard for enterprise customers. That’s not to say that OpenStack cannot be made simple to deploy. ZeroStack will deliver a pre-built OpenStack deployment as a physical appliance to your data center. Similarly, Platform9 will run the OpenStack infrastructure in the cloud for you and let you use it to manage your on-premises infrastructure. All of the complexity of OpenStack is still there. These vendors simply handle the complexity and hide it from their customers.
Easy to deploy for a test or PoC is one thing. Easy to deploy and manage at scale in a production-ready configuration is quite another. There are a few products that are very simple to get deployed and set up, but hard to get full value out of when they are deployed. One example is VMware vRealize Operations (vROps), which is sold for monitoring and troubleshooting a vSphere environment. The basic deployment and configuration are pretty simple. You do need to be careful that you deploy enough resources for vROps to be able to operate properly with the number of VMs you have. But once deployed, it can be left alone for a couple of weeks; then, some pretty dashboards light up with symbols and colors. This pretty front page looks great in demos and for the sales process. It is even good to put in front of management. However, the pretty front doesn’t get you to the root cause of a performance problem with a database server that only occurs at 3 pm every second Tuesday. Nor does it let you isolate out the VMs and resources that are related to your SAP environment in order to do some capacity planning with that project team. These things are doable with vROps, but they require quite an investment in expertise and configuration of custom dashboards. Simple to deploy doesn’t mean simple to get value.
To achieve simplified management, we need to move away from handcrafted perfection and toward policy. We need to define outcomes, then allow the underlying systems to manage all of the complexity required to deliver those outcomes. Take security compliance, for example. There is a series of well-defined standards: DSIA STIG, PCI DSS, and the like. Most of the time, we define a collection of configuration settings that match the requirement, and we apply the settings at build time. Usually, we also have an audit process in which we periodically check that the settings in use match the requirement. If the settings don’t match, then we correct the individual settings. In some places, there are configuration management tools like Chef, Puppet, or Ansible, that are used to automate these processes. A better solution would be for the platforms themselves to have policy engines and template policies for these compliance regimes. The platform should then check compliance and remediate any noncompliance, automatically and without human intervention. People should only be involved when the platform determines that compliance is impossible.
Simplification of operational tasks is crucial to managing complex and large-scale systems. To operate at scale, it is crucial that the humans are defining policy and that the machines are implementing those policies. We still have quite a way to go before policy-based management is the standard in IT operations.