Privacy in a Big Data World

We are seeing more and more cloud-based big data solutions for security, business analysis, application performance management, and many other things we see the results of every day, from when we search on Google, Bing, etc., to the email we get from various marketing campaigns. We know that governments and many others are using big data, whether in a cloud form or on-premise form, to correlate various forms of data to determine who we are, where we going, what we are doing, how we are doing something, and sometimes why we are doing anything. So with all this data out there in the hands of ‘others’, how can privacy be achieved for the individual? We touched on this within the Internet of Things: Expectation of Privacy article, and within this we spoke about the handling of personal and identifiable information (PII).
As we look at privacy of big data within any cloud, on-premise or mixed, we need to realize that the amount of data could be so large that retroactively redacting data may itself be a big data problem, and that redacting well-defined PII is a possibility on ingest, as well as using tools like DataGuise to redact, encrypt, tokenize, etc. such data retroactively, can be accomplished as another big data task. But, that only handles well-known PII. How do we handle derived PII?

Privacy-Derived PII

Let us look at Figure 1, which shows a typical big data application and the controls around it as they exist today. First, we can ingest the data. On ingest we can mask, redact, tokenize, encrypt, etc., any well-known PII as defined by various compliance standards such as PCI DSS. We can furthermore use role-based access controls for applications and people accessing the big data from outside. I am not exactly sure why a person would need to access this data directly, but it does happen, although I see applications as the main tool to access the data. We may also have the pool of big data encrypted at rest using self-encrypting drives or some other out-of-band means.

And in Figure 1, Access and Ingest are currently the only ways to secure big data. However, there is a new concern. When you correlate multiple streams of data (represented by the multi-colored circles within Figure 1), you can end up with new bits of data that can represent some form of PII. How much data you need to correlate is a problem for the mathematicians of the world. At HP Discover, Kevin Bacon told us about the 6 degrees of separation. Perhaps all we need is 6 different bits of data to derive who a person is, which means that combined, these seemingly innocuous bits of data represent PII, or what I refer to as “Derived PII”.
For example, many search companies can take your behavior, search history, cookies, and other bits of seemingly dissimilar data and use it to form an identifier that uniquely identifies an individual from all other individuals. Could this identifier be a new form of PII or derived PII, since it uniquely identifies an individual? Would this data combined with, say, a zipcode via another data stream further point to a person?
The problem currently is how do we recognize it as PII and refuse to output it except in a redacted, tokenized, encrypted, or masked form to hide the identity of the innocent? What we do on ingress should be done on output to protect privacy.

Privacy Opt-In/Opt-Out

In conjunction with handling derived PII, we should allow for the opt-in or opt-out of inclusion of a single person’s data. If a person opts out of inclusion in a big data pool, how is this handled? Just on ingest or can the tools retroactively opt-out an individual? If too many opt out, the data itself could become suspect as well. Is there some limit to the amount of opt-out? Privacy is about the individual; if we opt out of submitting data to an organization, and that data is shared by many organizations such as in a community cloud, is that data opted out for all organizations within the cloud or just one? How do you handle deletion of data on a full opt-out. For example, if you remove your accounts from a system permanently, does this delete the data in some fashion? Can this be implemented at all, and if so, at what layer do we implement it?

Where to Implement Privacy

This leads to the ongoing question of where privacy should be implemented. We do our best with individual streams that must meet specific compliance but nothing in general.

Inherent within the big data pool such as within Hadoop, can we tag bits of data as being part of something bigger and which therefore should be considered private, and can we then use those tags of data to put in place RBAC based on who can assess a certain type of data? We do not do RBAC by type of data but by access to all data today. (See Figure 1: Access) What about deletion of data?
Inherent within ingest? Should ingest of data modify all possible forms of PII as redacted, masked, tokenized, encrypted, etc. as a matter of course, even if your current compliance stance does not require it? For example, you may not participate in PCI-DSS compliance, but what is the case of getting data from a group that does, such as in a community cloud?
Inherent within the application? Should the application be responsible for handling privacy and therefore not only the well-defined PII but also the derived PII?
Inherent in the barrier between inside the pool of data and outside the pool of data? Should there be some filter within the tool that holds the data such as the Hadoop filesystem to handle cases of derived PII such that if any output can identify an individual it is just not allowed out? For example, if exactly 1 result is found, no results are returned, but if 100s of results are found, then all results are returned. But, this needs to be disabled, perhaps, for law enforcement doing legal forensic analysis. HP’s HAVEn loosely coupled products include the concept of data governance, but does this data governance extend to privacy rules enacted by the individual? In fact, what we do know about HP’s HAVEn and others that have data governance is that it is about the possible loss of said data, the risk of that data getting out in the wild, or a breach of that data which is part of the barrier between the inside and outside of the pool of data; it is not truly about privacy.

Conclusion

Privacy of big data has the same issues that privacy within the cloud does. We are trying to keep our data private, whether individual data or corporate data. Once it enters the cloud, unless it is encrypted beforehand there is no way to prevent access to that data for the cloud’s use. The whole goal of secure multi-tenancy is to limit not only other organizations from accessing your data but also the cloud providers or, in this discussion, the big data providers from accessing your data or your customers’ data. Privacy for big data is pretty sparse at the moment and limited to ingest (same as for the cloud) and data access limits.