This is the third article in a series about data locality in hyperconverged infrastructure (HCI). In the first article, I discussed basic math around storage network IO and the effect of data locality on the storage network load. In the second, I examined the impact of HCI configurations with more than two copies of data. I’ll wrap up with a look at how data locality is usually incomplete and at a special case of data locality.
Loss of Data Locality
vMotion is a central part of a vSphere cluster that allows running VMs to move from one host to another. Naturally, this can get in the way of data locality. At least, for an HCI that has locality, there can be an effect. If the HCI does not try to achieve data locality, then there is no change after a vMotion. On an HCI with locality, after a vMotion, the host will only have coincidental locality for a VM. Making the whole VM data set local is a costly (lots of network load) and slow activity. Generally, locality is not reestablished proactively. Instead, most HCI will achieve data locality only for blocks that the VM accesses. On the first read IO to a block that is remote, the host will keep a local copy so future IO is local. Naturally, any write IO will happen both locally and remotely. The local write provides data locality for that block. Over time, the data locality will be reestablished, at least for the data blocks that the VM is accessing. Remember that data locality is only important for data that is accessed. Unaccessed, remote data doesn’t cause any issue. This is why I have focused on VM IOs rather than the size of the VM’s disk. The blocks actively accessed need to be local. In an environment with DRS in use, the VMs routinely move around the cluster. Most VMs never achieve 100% data locality; in fact, around 80% is common. Just how much data locality is lost depends on the environment. The rate of vMotions and the rate of change of the VM’s active disk blocks drives the loss of locality.
I would like to be able to do the comparison math again, but it simply isn’t possible. The degree to which data locality is compromised is related to multiple factors. My expectation is that data locality will reduce as the cluster size increases. However, it will also increase in a cluster with a lot of vMotion activity. Then, the application type will also drive the amount of IO that ends up being sent to nonlocal blocks. I would be stunned if there were an environment where an HCI without locality didn’t produce more storage network IO than an HCI that did have data locality.
Special thanks to Josh Odgers for checking my IO math and raising the issue of incomplete data locality.
Double Locality
There is a special case of data locality in which the complete primary and secondary copies of a VM’s data are held on two hosts. All two-node HCI clusters work this way, as there are only two nodes to hold the two copies. Some HCI products, specifically SimpliVity, work this way with larger clusters. Each VM has complete data locality on two of the nodes. You end up with a set of host pairings, and some VMs are stored across each pairing. In a three-node cluster, there are three pairings. In a six-node cluster, there are fifteen pairings. Each pairing holds a fraction of the cluster’s VMs. The VMs on a pairing can vMotion within the pairing and retain complete data locality. However, a VM that is vMotioned outside the pairing loses all data locality. Clearly, in that architecture, it is important to constrain VMware DRS to keep the VMs where their data is local. There are interesting availability effects from this double locality, but this isn’t the article for that discussion.
Bringing It Home
In a small environment, data locality is not going to be a big issue. However, with a large number of nodes in a single cluster, it can be critical. The effect of having data locality is that the total cluster performance can scale linearly as the node count increases. Without data locality, the storage network performance can easily bottleneck a large cluster. Consistent performance as you scale out is an important characteristic of an HCI.
The big question here is if Vmware vMotion and VSAN developers worked together. It not, then they could end up with “optimal” for each technology but sub-optimal when putt together.
Oversimplified, if you add some parameters to vMotion DRS to prefer moving VMs to hosts with the most replica data, you should get statisticaly get lowest locality missrate.
And if VSAN would try to keep the VM data at as few hosts as possible then “smart” vMotion shouldn’t hurt that much.
Of course there are other things you might miss out given fewer replica hosts, like number of disk spindles, number of network connections, number of pci-busses, VM IOPS pressure over time etc.. that might affect the performance, but that might just be another parameter for vMotion DRS.
My point is that data locality and cluster size is something that we as “users” of the system should not have to worry about (more that being aware that sometimes it might affect the performance) but something that the developers should put at least some effort to automate for optimal placement, not just use the standard CPU/Memory parameters.
Unfortunately, a few years back I talked to a Vmware developer regarding optimal placement of Vms with similar/same operating system for more efficient TPS (transparent page sharing) they told me that they didn’t do that and that resulted in less than optimal TPS.
I really hope Vmware put more effort in DRS heuristics with vSAN.
As I understand it VSAN does not have data locality as a design principle. As a result, there is no way to have DRS take data locality into account.
Bear in mind that DRS only accounts for CPU and memory, IO is not a factor in DRS recommendations at all. If you want this sort of intelligence then you have to look at things like VMturbo, sorry I mean Turbonomic, for a more holistic load accounting.
On the point of should users care, I would say that some users care and some do not. At some point, the users who do not care are relying on some experts who care very much. AWS is a classic example. I don’t need to know or care how they deliver their services. But there are a lot of experts at AWS who know in excruciating detail exactly how they deliver those services. If it weren’t for the experts at AWS, I might need to build my own services, then I would have to care.