In a previous article, I wrote that customers don’t care whether a hyperconverged solution uses a VSA or runs the storage cluster in-kernel. I stand by that assertion. One of the comments pointed out that I had missed an area of discussion: that of the resource requirements of the VSA itself. I still don’t think that customers care, but for completeness, I’ll examine them. The point here is that the VSA that most HCI vendors use to provide shared storage is usually a fairly beefy VM. The resources allocated to the VSA are not available to run workload VMs. This logic says that the VSA-based HCI can run fewer VMs than an in-kernel-based HCI. The problem with this argument is that most of the VSA resources are doing storage cluster work. Moving the same storage cluster into the kernel requires almost the same resources. The big difference with in-kernel resource usage is that there isn’t something you can easily point to as taking up these resources. VSA resource usage is all assigned to the VSA; in-kernel resource usage can’t be accounted to a single object. There is no smoking gun of resource usage.
The first thing to do is to quantify the size of these VSA VMs. In my experience, the VSA that is on each HCI node uses between 24 GB and 96 GB of RAM. The exact number depends on the amount of storage to be managed and the data-efficiency features in use. More persistent storage capacity means more RAM required for simply keeping track of the number of objects being stored. Usually, it is the amount of storage capacity in each node that drives the RAM requirement, rather than the total size of the cluster. Adding deduplication increases the amount of RAM used. There is additional metadata that must be in the fastest storage media available. Having tiered storage (SSD plus HDD) also means more RAM used for metadata. The other great use of RAM in the VSA is for optimizing performance. Spare RAM in the VSA is usually used as a read cache to give outstanding performance for reading the most frequently accessed data blocks. These RAM requirements aren’t specific to an HCI platform. Pretty much every modern storage array uses a bunch of RAM for exactly the same purposes. It’s not unusual for the storage controllers in a mid-range array to have 256 GB of RAM each. And there is my key point: to do the work that the VSA is doing, you need a certain amount of resources. If you do the same work in a storage array, you need similar resources, and if you do the same work in-kernel, then you need the same resources.
For the vast majority of HCI customers, the VSA will use around 10% to 20% of the resources of each HCI node. This is a resource demand that needs to be accounted for in capacity planning. Your HCI nodes will deliver fewer compute resources to workload VMs than if the same physical servers accessed a Fibre Channel SAN. Whether the work is done in-kernel or by a VSA may make a difference of a couple of gigabytes of RAM and a little CPU time on each node. This isn’t a significant amount of resources to most customers.
Circling back to the original point, customers don’t care whether your storage cluster is VSA or in-kernel. They do care how much of each node’s physical resources are available to run VMs. This is the RAM and CPU capacity that they want on your HCI spec sheet. Installed RAM capacity that they cannot use for workload VMs is not interesting. The whole point of HCI is to focus on the workload VMs.
What about the small number of customers for whom the few GB of resources and few GHz of CPU time is important? A customer whose entire VM estate uses 64 GB of RAM (e.g., eight VMs with 8 GB of RAM each) is likely to be concerned with every last physical GB. If they need a three-node cluster and each node loses 8 GB of RAM to the VSA operating system, then the overhead is significant. These customers still don’t care if it’s a VSA or in-kernel. What they care about is that your storage cluster scales down its resource usage: scales it way down. This scaling is hard; making the same solution work for both ten VMs and ten thousand VMs is difficult. I would be surprised if the hyperconverged products that support thousands of VMs scaled down to do a great job of running ten VMs. Similarly, I doubt that a product designed to support a dozen VMs will effectively solve the problems of a customer with fifty dozen VMs.
Customers do not care if your HCI uses an in-kernel storage cluster or a VSA. They care whether your HCI is a good solution to their problems. Can your platform run their VMs in a cost-effective way? How easy is your HCI to operate, upgrade, and maintain? Is there more value in replication, backups, and DR that will solve their problems? Customers should not have to care how your HCI works; it’s a red herring. Sell your customers on what your HCI does for them.
Alastair, I have enjoyed reading your articles. I think it very important that vendors try to win over customers on the merits of their HCI offering and not focus on semantics like how those merits are working in the background.
However, I am not sure I agree with one of your core pieces of rationale for why VSA=Kernel: “Moving the same storage cluster into the kernel requires almost the same resources.” In my experience, this has not shown to be the case, but personal experiences aside, I think the time is right to perform an empirical study comparing the resource consumption on select VSA and in-kernel HCI offerings to objectively view those differences. This is not to suggest that the outcome should or will provide any sort of “smoking gun”, simply that it will back up these suppositions with real-world figures.
Also, the other tenet you put forward is: “Whether the work is done in-kernel or by a VSA may make a difference of a couple of gigabytes of RAM and a little CPU time on each node. This isn’t a significant amount of resources to most customers.” The question of resource consumption amplification is important not just to SMB customers or even a small amount of Enterprise customers, but more to applicability of the HCI solution. Customers that are purchasing an HCI solution for branch offices, or UAT, or test/dev, etc. may have smaller environments within a much larger scope. This resource overhead is not only important, therefore, to capacity planning, but has a large impact on procurement as they may be forced to purchase a larger, more expensive set of hardware (a collection of nodes) based on this overhead. The delta could end up being tens of thousands of dollars depending on what model that requires.
Hello Chip,
While we would very much like to do such research, it is almost impossible to do without picking an In-Kernel or VSA that an HCI vendor is NOT using and therefore open such research to the same type of comments. Could we use VSAN and StoreVirtual as an example? Yes, we could but that is not the same as picking the internals of Nutanix, Scale, Simplivity, etc. There are not many In-Kernel that we can just install and use. We would need a stack that is A) 100% software, B) un-entangled by hardware restrictions/requirements. Now that being said, if we use Hyper-V w/DataCore we are in-kernel, if we use vSphere with VSAN we are also in-kernel. Those options are possible, but do they match the in-kernel used by various HCI vendors significantly to make a difference in the discussion. My feeling is that if you have the proper underlying hardware, use cache memory and SSD appropriately, that both will fair just as well. Actually, that is what happens in our own environment where we run VSAN + StoreVirtual. Both perform adequately and both require certain sets of resources to work effectively as primary storage, namely redundancy. But would such a test be acceptable to HCI vendors who do not use VSAN? or even Hyper-V?
Speeds and feeds only go so far, they give the ‘potential’ perhaps not the reality of usage. So what workload would be ‘real’ enough?
Best regards,
Edward Haletky
Perhaps it would be a good start to gather figures on most-common configurations often seen and deployed. I agree it is somewhat problematic either way, but figures don’t necessarily have to be framed in a competitive manner, merely test results from which others are free to draw their own conclusions.
The key point remains that customers want to know what they can do with your HCI product rather than the decisions you made to create the product.
Lab testing anything gets to be problematic in a couple of ways. First, how do you replicate a customer workload? After all, we have seen plenty of useless benchmarks that don’t reflect any real world workload. Next, consider that HCI products do not have a uniform feature set, so what features should we include in testing? What about real street prices for comparison? Nobody pays list but different vendors use different discount schemes. Then when customers start to trust a specific test the vendors will start to game that test.
These aren’t reasons not to create a good HCI comparison benchmark, just reasons it would be hard. I would love to see a way of comparing CI and HCI products using real workloads.