Recently, I made two interesting support requests, each to a different company. Both companies asked for the output of many different commands and log files. Both balked once I explained my organization’s security policy. The policy reads simply:

No anonymized data shall be delivered to a 3rd party.

It is a simple statement, but it has a powerful effect on all data being delivered to third parties, even for support. It implies that all user, machine, and service identifiers must be tokenized, encrypted, or outright removed. What must truly remain anonymous within our data? This is not simply a support question, but rather a major issue with all data today. Do we even know what is in our data? Do you?

I have built many anonymization scripts. Some have been generalized, and some have been specific to product logs and use cases. However, all of them have stripped out identifying information: information that is useful to hackers for attacking systems and accruing knowledge about users, and that exposes company and personal data. Most of this data is metadata—data about data—but it is extremely useful. It is useful to attackers, hackers, and all the bad guys. It is useful to any person or entity that just wants information, such as a surveillance state.
Data and data about data, metadata, need to be controlled, whether for support, data management, privacy, or intellectual property reasons. Every company needs to step up and add anonymization to any output that might be sent to a third party. However, this is just the tip of the iceberg. There are other reasons for data control to be put into place, and not just with regard to metadata—which in essence logs are about—but also with regard to real data.
This is the future of data services. Like network functions virtualization, I expect there to be data functions virtualization. We are already starting to see a data pipeline of sorts being created. Companies like Dataguise are there for protecting on ingest of data. Even the copy data solution Actifio is allowing a form of chaining to happen. It does not quite hook into enough functions, but it is a start. Orchestrating data functions is a part of any application today, but do data functions need to be in the application? As you scale up an application, this becomes an issue.
Data functions virtualization is the movement of discrete data services into a service chain for use by the application, perhaps by moving those services lower down, out of the application, perhaps into middleware or hardware. Some of those functions, such as encryption, compression, and deduplication, already reside in hardware. Others, such as tokenization, indexing, and masking, are higher up the stack. Still others are the basis for today’s analytics engines.
There is now a need to anonymize data, which can be done through tokenization, encryption, redacting, or removal of metadata about our data, PII, PHI, or PCI controlled data. As systems become more complex and more API driven, logs become even more important for debugging. Protecting your data should happen at all levels—not just within the data, but within the metadata as well.
Should data services be part of the application, or should they be part of something else?
In all cases, this is a serious question that needs some more thought. How would one implement data functions virtualization? Should the basics of data transformation, deduplication, compression, protection, or coalescing based on time be part of such data functions? Should these data services be part of data storage, the application, middleware, or all three?
We have been seeing data repositories grow. Now we need to consider how to best handle our data in the future, how to protect it, and how to transform it into more useful bits.
Anonymization is one such transformation. There are others. What do you need done to your data to make it safer to share, or more powerful for use? Do you have a data sharing and management policy?