We are all moving to the future. The election has, hopefully, forced us to rethink basic fundamentals of society. Individuals are usually easy to deal with, but larger and larger groups are much harder. Data scale changes everything. Even Isaac Asimov had this in mind when he wrote the Foundation trilogy. In his case, scale worked to smooth out predictions. We are not at that scale yet; hopefully, we will reach it. However, our data has far exceeded the scale we take for granted. Let us think about scale. What is high scale to you? For some businesses, high scale might mean a few hundred million queries and associated records a day. For others, it’s tens of billions of queries and ten times that in records a day. Where do you fit? As your application scales, what do you need to consider?
No matter what we consider to be high scale, we have to change how we deal with it. In many cases, this implies using more in-memory solutions, higher performing low-latency devices, or a combination of the two. For traditional server approaches, use of memory becomes attractive. Memory is cheap and easy to set up as a cache, but it is also usable for message passing. Message passing allows the movement of small amounts of data very quickly. It can even be used to aggregate and coalesce data long before it is written to disk.
Data scale changes how we need to think about processing. The high-performance computing folks have known this for years, as they deal not only with data scale, but also with the limits of current storage. They use services to reduce data so that the items written to disk are the smallest quantity possible. Should all these services be in the application or in the storage layers?
Today, they stay in the application. However, software-defined storage is all about data services. As those services can be defined, they can be moved into storage layers, even if those storage layers exist solely in memory. Write coalescing is not new. It is the ability to fill the block size in memory so it is one write for many chunks of data. Coalescing data for services is also convenient. The real goal is to deal with the date scale issues as close to the data as possible, while maintaining a close proximity of compute. Data locality becomes the Achilles Heel of data scale.
Data service products are on the horizon. ioFABRIC, Hedvig, and others are thinking about data services now. As storage moves to general compute platforms, data services become a major factor in using scale-out storage solutions. What is your strategy?
One strategy that works well with billions of data elements is to write data locally to memory and manipulate the data into buckets based on time, perhaps at the per-minute mark, or coalesce the data based on time, not size. The next step is to write that block of data to persistent storage somewhere, perhaps using one of the new NoSQL databases. The heavy lifting for the data is done elsewhere, leaving the primary systems write heavy while the secondary systems end up being read heavy. These secondary systems become the analytics engines. They may not need to be as large a cluster as the write-heavy cluster of systems.
Each system does its job. We have many services running within a server to handle the data write: services that read messages, coalesce the data into a usable format, even perform the write. Some of these services end up more intelligent than others. The real question becomes: “What strategy will you use to scale up your data services, to handle your data scale issues?”
Any approach we take needs to be distributed and to involve thinking outside the box. We need to consider latency within clouds, between clouds, and within our hybrid cloud. We need to consider the costs of moving data around. We also need to understand the data itself and the policies that surround the data. We may need even more data services to tokenize, translate, encrypt, etc. before we write. We also have to consider data scale with disaster recovery, as well as managing all that data.
How do you handle data scale? Are you at billions of data elements a day? Past that?