For the last several years, and really ever since public repositories and storage were first used, API and other confidential data has been leaking. The treasure trove as the start of an attack is now becoming common place (most recently from Accenture, DXC Technology, and now the drone maker DJI). The treasure troves are either leaked credentials, intellectual property, or leaked API keys. Is such leakage a technology, people, or process problem?
We always come back to this: people, process, or technology. In this case, we will need to fix all three to solve the problem.
People: People are being forced to do things faster; they are no longer as diligent as they had time to be in the past. More importantly, many people do not know what is considered a secret or confidential information. To developers, it is either data or code: not secrets, confidential information, or even cost. It is about efficiency. Even the DevOps movement pushes people to use more APIs, to be more efficient. At the same time, there is no real push to use secrets management, which means we end up with embedded API keys, we we use S3 without protections, or we have any number of human mistakes with regard to the treatment of data. This is not just a developer issue: anyone who accesses a service, writes a macro, or just wants to make their job easier may be at risk of leaking critical data.
Process: We have done away with the organized code review, the use of internal repositories, and many of the checks and balances of the past—all in the name of efficiency and the need to get things done. These processes were there to help, not to bog things down. Unfortunately, they were often abused. Code reviews would get stuck on variable names, not the intent of the code. More and more people would get involved in creating time-saving macros and scripts. Those who were not officially developers are now managing and writing code. As we move to more “as Code” approaches (Infrastructure, Security, Testing, Network, Storage, etc.), our processes must keep up and still maintain good code hygiene.
Technology: The technology has not caught up. I had expected that by now, GitHub would have added filters to find secrets and either remove them or use a vault product automatically. There are several projects on GitHub that do this already. They are not 100%, but they are a start and cover many of the issues. The next questions are “How do you put such tools into play? Where do they belong? Do they belong as part of the service (GitHub, S3, etc.) or as part of the gateway to those services maintained by your organization, or do they belong on everyone’s desktop to ensure proper hygiene is maintained?”
The key here is how intrusive such tools and processes can be. If they are intrusive in any way, then you end up with the same problem we always end up with: people. People will find a way around anything they consider to be intrusive. Therefore, security should be invisible and allow for use as appropriate.
Code, data, and other artifacts dumped into GitHub, S3, Box, Dropbox, Google, and many other services are important to understand—perhaps not to stop, but to understand, classify, and apply the proper policy. Is this a data loss prevention (DLP) issue? Maybe not. DLP-style tools can help here, but the reality is that this is a data classification issue, with data transformation as a form of remediation. The process should not interfere with the people, but should impose the appropriate process and technology upon the artifacts.
Some of the transformations that could happen are automated encryption, exchange of secrets for proper Vault code, or even tokenization. The goal should be to protect the data, not to punish those who do not know better. The other problem is finding an existing treasure trove of data. How do we manage to do solve that problem?
How does your company protect its data? Is it draconian? Is it invisible? Do you trust the measures? Or do you believe there is a treasure trove of data available for your company?