I have written many times about the need for application-centric data protection and data-centric security. But what these both require is that our data protection, security, management, and networking are data-aware. We use applications, but we thrive on data. The more data we have, the more chance we can make use of it, which has resulted in big data tools and big data extensions, even to hypervisors. We talk constantly about moving data closer to processing, with flash and other techniques at the storage layer. But we have not grown other aspects of our systems to be data-aware. It is time this changed.
Being data-aware entails that security, management, and data protection are application-agnostic. To implement the tools, we just need to know where our data is located. Actually, it is far easier than that: we need to follow the users to where they have put the data, and from there create policies to protect, secure, and manage that data. Being data-aware and following the user also means following the secure hybrid cloud and secure software-defined data center reference architecture, about which I have written before. (Shown below in Figure 1.)
Data-Aware Products
At the time of this writing, only a few truly data-aware products are even available; yet, there are some that manage data in generic buckets without any need for a view into the data. There are cases to be made for both approaches. These product types fall into distinct categories: compliance (note: not security) and data protection.
Data-Aware Compliance
The only product currently available for data-aware compliance is from DataGravity. In our architecture, DataGravity could fall into a storage cloud or a single-tenant data center, but it has a far-reaching impact. DataGravity can tell us who modified our data, how they did it, when they did it, what they did, and where they did it. That alone is invaluable. Currently, we have to rely on next-generation firewalls to talk to our identity store to determine the “who.” We also rely on applications to spit out appropriate logfiles to tell us when, what, how, and where. However, if we put all this at the storage layer, there is no need to count on logfiles and guesswork based on correlating time within these logfiles.
The goal is not only to become data-aware but also to have an authoritative source for data-modification audit information. DataGravity can determine what type of data resides within storage in an unencrypted state. If that data is personally identifiable information (PII), then it reports on the type of data that results in a compliance issue or breach of defined policy for such information.
Yet, we all know that compliance is not security. So what would need to happen to improve security?
Data-Aware Security
Data-aware security is a bit different than compliance, but it uses similar techniques. If we can produce an audit log, we can also impose a specific security policy based on the data in use by a given user, device, application, and even location of access. This policy could impose a level of encryption, pop up a request for another identifying factor, or establish a lockout based on location. However, this cannot happen unless we first develop that policy, which is based on data classification.
Data classification is the first step toward data-aware security. Once we have classified the data, we will know what security policies apply to it. Our data-aware security tools should understand these policies and impose, quietly and behind the scenes, the necessary security to keep our data safe, regardless of where it is located (cloud, data center, personal device, etc.) Data-aware security allows us to associate security context and policy with data directly.
Data-Aware Forensics (and Court of Law)
Data-aware tools such as those from DataGravity help with eDiscovery when a warrant is issued for specific data or even for an internal investigation. However, there could also be a means by which to perform forensic data analysis and acquisition once data is modified or specific types of data are discovered while being written to a storage array. Some of the key issues with forensics are how to handle forensics within the cloud, how to identify specific tenants’ data, and how to inquire of this data for forensics purposes.
Various types of data exist that, if discovered, require informing the authorities immediately (this is the law in some countries). However, this reprehensible data is not something most organizations wish to store. Without real-time forensic analysis, whether such data is found would be hit or miss. This could be the future of data-aware analytics for forensic reasons.
Since such processing happens below the operating system, application, and user, we would also have the ability to properly fix, quarantine, and even delete malware and virus data. These tools would not act upon the raw storage, as they are not within their native operating system and therefore cannot further infect the environment.
Data-Aware Protection
Our last class of technology is data-aware protection and data-aware recovery, whether for business continuity or disaster reasons. At the moment, we have backup of various containers, whether they are physical or virtual systems, groups of systems, or specific systems. But we do not have anything more robust than that. Instead, we have partial data awareness: we replicate or back up only those blocks that have changed since the last backup. However, we have nothing further about those blocks of data.
It would be very interesting to adjust our data protection in accordance with what the data contains, when it was accessed, and its classification. Instead of treating each system as this or that, we could concentrate on the data. We could adjust recovery time and objectives by the data in question, not by the system.
Figure 1: Secure Hybrid Cloud
There are three parts to our secure hybrid cloud that are of interest:
- Transition: The transitional component of a secure hybrid cloud contains all items that allow access to or move data between multiple cloud instances, between those clouds and a data center or centers, or between the end user computing device and clouds and data centers. The transitional component is fairly fluid, yet traditional security approaches can play within this arena if the transition is contained within a controlled area. Unfortunately, that may not actually be the case. See these earlier posts:
- Cloud: The cloud includes all places outside our immediate control where data could end up or from which data could be taken. In some cases, it is even used to further our transitional goals. This is where APIs tend to live. However, the chances of adding traditional security to this aspect of the secure hybrid cloud are generally low unless one is willing to go to great expense (and end up in a managed hosted environment over a cloud). Check out these posts:
- Data Center: The data center is generally within our control. It could be a private cloud or just a collection of virtual and physical machines. The data center may transfer data between multiple data centers or back and forth to and from the cloud. Within the data center, which is generally under our control, we can attempt to add in traditional security approaches. See the following posts:
Data-Aware Futures
It would be ideal if our data protection, compliance, and security systems reacted to the data regardless of location and type. We should be able to automate the application of policy based on the classification of our data, the type of data, access patterns for the data, and who should have access to the data. If we were data-aware, we would be able to find malware and audit, perform forensics, secure, and protect our data at a much finer-grained resolution. Granted, we need applications to access the data, but we ultimately want to secure and protect it. When a breach occurs, it is the data that the bad-actors wish to gain access to for their nefarious plans.
Which leads to a question: Do you have a data-aware policy but implement it at a system level?
It seems to me that if hyperconverged really catches on, that vendors in that space tend to gain a lot from your vision. The local-distributed storage model (a la VSAN, Nutanix, etc.) grabs that storage layer.
Maybe DataGravity can integrate into the vSphere layer (aka Datastore-level)? That would allow DataGravity to provide a solution for all Virtualized infrastructures.
Just a thought.
Hello Steve,
It would be interesting if Data Gravity licensed their software, I agree. But as the storage layer it works with all virtual and physical environments currently in use. I initially see it being the storage for highly regulated data (virtual or physical) for compliance reasons. With a good API, others can call it to do specific items or react to specific use cases. I.e if you have a class of virtual machines that have HIPAA, PCI, or other compliance requirements, you can place them on a DataGravity storage device. Actually, you could also just present the data to the specific VMs and get more than what a VMDK would give you. Lots of use cases out there.
It does unlock a new way to think about what you can do with data however.
Best regards,
Edward L. Haletky