Connecting the costs we incur to the business value we deliver is something we in IT need to do better. One of the dimensions of this is the value of a single piece of data compared to the cost of storing that piece of data. I think it’s safe to say that not all data is created equal. A contract document for a ten-million-dollar sale is a lot more valuable than a single tweet. What are the implications for IT infrastructure if we know the value of data? Can we make more sensible decisions about how we build our infrastructure based on knowledge of data value?
One of the common sayings in storage is “There is no data recession.” No matter what happens to macroeconomic conditions, more data is stored every year. Back in the 1970s, all of the data was in the mainframe, and only very valuable processes were computerized. The data on the mainframe was, and is, very valuable. In the 1990s the PC revolution came, and a whole lot more data was created. Word documents, spreadsheets, and PowerPoint presentations proliferated. This is still quite valuable data. The web revolution followed, and more data was created, much of it in emails, photos, and GPS data. This decade has brought the social revolution, with its vast numbers of selfies and videos of cats. How valuable is each of these pieces of information?
We are expecting the next deluge of data to be created by the Internet of Things (IoT). Your light bulbs will be creating data, and so will your coffeepot. The value of many of these pieces of data has got to be getting smaller as the volume of data keeps increasing. I’m sure that the rate of data growth is outstripping the rate at which the cost of capacity is falling. This means that the cost of storage is never going to be “effectively free,” as some predict. Managing the cost of storage to match the value of data will continue to be a role in IT.
One of the central expectations of IT is that we should keep data forever. This has particularly been the mantra for any organization considering a big data solution. This is a sensible idea for the valuable data. That contract and the mainframe data. But what about the vast amount of unstructured data from social? How about all the IoT data? Is it all data that we think will be valuable and that we cannot recreate, such that we should protect it from disk failure? Should we back it up? Store a copy of it at a second site? All of this data protection is expensive; it requires us to store multiple copies of the data and to move copies around.
One way to reduce storage cost is to use lower-reliability storage: maybe use a disk system with a broadly spread erasure code (EC) where less capacity is consumed by redundancy. A possible option is to use a storage architecture with no disk redundancy. What is the cost to the business if all of the IoT records on one hard disk are lost? Maybe we lose 4 TB out of a 2 PB data repository. Would that be a better outcome than buying more disks for redundancy? Bear in mind that disks most often fail after years of service. Thus, the data we might lose is old and may have the least value. It is worth considering the blast radius of a failure. A 2 TB disk loss has less impact than a 4 TB disk loss. If we built a RAID0 array out of ten of the 2 TB disks, then losing one disk invalidates the data on all ten disks.
Another consideration is how easy it will be to re-create the data. If we have a backup on tape, then recovery may take hours, but it will still be possible. The data loss will be transitory, but we have the cost of the tape system and tape media lifecycle to manage. Yet another consideration is whether we need fast access to low-value data. We may have a primary copy on non-redundant fast disk. We could use a copy on redundant slower and cheaper disk for recovery if the fast disk fails. Buying less fast disk will save money.
Not all data has the same value, so not all data warrants expensive storage. The more you know about the business value of the data, the better you can tailor the infrastructure for that data. We again see the value of understanding the business when we design infrastructure for that business.