I am intrigued by the design decisions that are made as products are developed. I find it amazing how often problems are solved in completely different ways in different products. Sometimes these decisions show up when you are not expecting them. I encountered one such example at a vBrownBag TechTalk presentation at the OpenStack Summit in Barcelona last month. The presentation was about deploying OpenStack in multiple telco point of presence (PoP) data centers to deliver NFV. It was a joint presentation between AT&T and Mirantis; you can find the complete video here on YouTube.
The basic point of the presentation was that OpenStack was deployed in a space-constrained environment for a moderate workload. To deliver a full OpenStack deployment, the architecture used Ceph running inside VMs for shared storage. The Ceph VMs consumed local storage inside the compute nodes and turned it into redundant shared storage. This is a basic hyperconverged architecture that is very relatable in the enterprise IT environment. Just to be clear, this architecture did not deliver the simplification of management that is usually a central part of enterprise hyperconverged. A lot of automation was built by the support teams. This automation was viable, as there were hundreds of PoPs that all needed the same infrastructure deployed. In this use case, the hyperconverged architecture was about minimizing the data center footprint in the PoP. Since the platform was deployed into hundreds of PoPs, any cost saving was multiplied.
The interesting design decision I picked up on was that the Ceph VMs were built with a limited number of disk spindles. We usually hear about Ceph nodes being built with a lot of SSD to accelerate storage performance. Having a limited number of spinning hard disks was an interesting choice. The rationale was that limiting storage performance also limited the Ceph VMs’ demand for CPU. This left more CPU for running workload VMs. Wow—the customer chose to limit storage performance in order to reduce the “hyperconverged” overhead on the nodes. Personally, I have not seen that decision made in any hyperconverged system.
So, just how much CPU time does it take to handle high-performance storage, whether Ceph or anything else? The first thing to realize is that the standard Linux storage stack was designed to work with hard disks and moderate storage performance. It isn’t designed to extract every last IOP per GHz of CPU time. A result is that the standard Linux IO stack adds latency but doesn’t tax the CPU too much.
How do other hyperconverged solutions handle the CPU load of storage and prevent it from interfering with workload VMs? The first thing is that they usually don’t use the standard Linux storage stack. Each hyperconverged vendor spends a lot of time replacing and optimizing the Linux storage software. Alternatively, they place the storage software in the hypervisor kernel rather than in a Linux VM guest. This makes the CPU utilization lower for a given workload or delivers more performance for a given CPU load. Next, most hyperconverged vendors use expensive CPUs with many, many cores. If the physical server has thirty-six cores and I let the storage stack have all of four cores, then there is still 89% of the CPU time available for workload VMs. A single fast SSD at full performance takes roughly an entire Intel CPU to drive it. This balance provides the greatest storage throughput with high CPU utilization.
I suspect that the servers deployed into these PoPs were using Intel’s “value” CPUs. Using lower-cost CPUs requires that the rest of the design be careful with CPU consumption. Two sockets of six cores are the sweet spot for price and moderate performance. With this lower core count, and probably lower clock speed, we do not want to give away whole cores to the hyperconverged storage VMs. Allocating four or six hard disks to the Ceph VMs would mean that each VM would only demand a few hundred GHz of CPU time, leaving plenty of CPU time for the workload VMs. Just don’t expect great storage performance. Six spindles of hard disk aren’t going to compete with a single fast SSD. Luckily, that workload in these PoPs is largely network functions virtualization (NFV). NFV is likely to be light on storage and heavier on loading the network.
System design is always a complex problem with a lot of variables. Design decisions are often a tradeoff of one resource against another. I love to see unusual design decisions, as they always help me to think differently about problems I face. In this case, an unusual decision fits well with the requirements of the workload and deployment.