In a recent Twitter discussion, we talked about data locality. What started the conversation was a comment that if you think about data locality, you think about a specific vendor (paraphrased). My response was that when I think about data locality, I really do not think about that vendor. This led to some other comments and discussion. All in all, it was interesting. The result is this look at data locality. My premise is that in order to have a true definition of data locality, one most consider scale, scope, and data governance (security, control, transformation, and protection). The simple definition no longer applies in the world of the secure hybrid cloud.
That simple definition is:
Data locality is the application and data being as close together as possible.
In general, the above statement implies moving data to the application using various methods. Those methods are often described as being within a cluster of nodes, or intracluster. Intracluster techniques are fairly well defined. Below are a few approaches many companies take. However, in each of these cases, the companies are really depending on the reads of new data to be few in number, while there are many reads of old data and few writes. This approach and limitation implies a small-scale amount of data to move around to be closer to an application. However, what if there is a large amount of data to read, or it is not nearby? How do we handle those cases?
In Figure 1, we see workloads and storage as close as possible, within the same nodes. This is typical of most standalone systems without remote storage, just local disks.
Figure 2 introduces a form of data locality where all new reads happen remotely. All repeat reads and writes happen locally. In this case, if the data is all new data, manipulated in memory and then written once, you will end up with all the data being read across the network. However, writes will be local. This approach assumes that the number of net new reads is far smaller than the number of repeat reads or writes. This approach requires a fair amount of low-latency bandwidth between nodes. The ultimate form of this is SGI Link, where the latency between each node in a rack is only 10ns. In effect, each node is close enough that the remote read of new data is just not a problem if backed by SSD. If the latency between nodes is low enough, then all nodes are local to each other node in effect. Nutanix is an example of this approach using low latency non-SGI Link networking.
Figure 3 introduces the concept of a cache that is consistent between nodes. Yet, if data is not in the cache, we fall back to what happens in Figure 1. This approach once more assumes that reads of existing data in the cache are more prevalent. This also implements a write-cache where all writes happen locally and across the cache. While new data is read via a slower path, you also need a fair amount of bandwidth to keep the cache consistent. Datrium is an example of this approach.
In Figure 4, we add in the concept of erasure coding, where data written to one node is simultaneously written to all other nodes. This approach can scale indefinitely as long as the mesh between nodes is maintained. To have performance, the mesh between the nodes often has to be a low-latency network. However, depending on workloads, this may not be required. ioFABRIC and Western Digital’s ActiveScale system are examples of this approach.
In Figure 5, we switch from moving the data around to moving the workloads around. In this approach, we have to know where our data resides. During workload scheduling, the data location is taken into account, and the workload is scheduled within the node that holds the data. This is an approach that is used by Diamanti via Kubernetes for containers.
In Figure 6, we marry several constructs together, a mesh to maintain data mobility, and data protection, but also adding workload placement. This approach requires us to know where all data is at any given time. Yet, it has the strength that we can start up workloads anywhere within our environment, even if it is a hybrid cloud. Data would move between the nodes per standard data locality, but if data does not yet exist at the chosen location, our workload placement logic would start the workload up in one location. Once data is properly migrated we have the choice of migrating the workload between nodes, clouds, on-site, etc. or recreating the workload in the proper location. There are suggestions of this possibility from a number of vendors, but nothing in production yet.
Final Thoughts
The above lists the various forms of data locality that exist and some vendors that supply the approaches, excepting the last approach, which is a combination that is not yet available. The data locality approach to use depends on your application as well. If your application is read heavy of new data, then you need to look at the later approaches to data locality. If your application is read heavy of old data, then any approach will apply. Know your application to choose the best approach for you. Even the type of application such as high-performance analytics vs. a LAMP stack will change your choices on data locality.
As you can see, the discussion went from single nodes (Figure 1), to clusters of nodes (Figures 2–3), to clouds (Figures 4–6) pretty quickly—and we did not even mention the real concerns of data locality and migration between clouds or on-site to a cloud.
Those concerns are becoming increasingly important. They are, in no particular order:
- Data Protection (security, retention, signatures, recovery, continuity)
- Data Transformation (encryption, redaction, byte-order, header/footer changes, etc.)
- Data Jurisdiction (legal)
- Data Integrity (who changed what, when, where, and how, block chain)
- Data Catalog (catalog of catalogs, data investigation, what is where)
The last issue seems to be where copy data solutions are heading, but they are not quite there yet. The others have been important for ages and are not going away anytime soon. In essence, as the scope of our data locality increases to span clouds, our need to gain control of our data increases. The control starts with knowledge.
How do you handle data locality today? Is it cluster based? Site based? Cloud to cloud? How?