I have spoken and written quite a bit on the delegate user problem facing cloud and virtual environments. It is a growing problem, as we delegate actions from logged-in users to service accounts to implement changes on our systems. Any system, for example, that proxies administrative requests suffers from the delegate user problem. In essence, when we go to determine who did what, when, where, and how, forensics leads us to a delegate user or service account. We do not know beyond a shadow of a doubt who the user really was. We can correlate multiple log files, and based on time we may be able to come up with a set of users who could have done the deed. However, unless only one user was involved, we just end up with a set of users. Those sets of users, themselves, can be other service accounts—other delegate users, abstracting the real user.
The solution to this problem is to include enough context at the lower levels so that correlation is based not just on time but also on some sort of identifier that can be used to determine which user actually did the deed. For a subset of products, VMware has actually solved this problem. Granted, it is just for vCenter and vSphere, but it is a start in the proper direction. If you look at vSphere 6 logs, you will find that they have introduced a user field that includes both the user called within vSphere (the service account) and the calling user from within vCenter. Here is an example:
2015-07-16T14:53:53.665Z esxi.example.com Hostd: verbose hostd[4DAC4B70] [Originator@6876 sub=Default opID=742806f0-bf-5d14 user=vpxuser:DOMAIN\User] AdapterServer: target=’vim.PerformanceManager:ha-perfmgr’, method=’summarizeStats’
We know several things from this log entry, but the most important one is [Originator@6876 sub=Default opID=742806f0-bf-5d14 user=vpxuser:DOMAIN\User], as it contains information vital to ultimately solving the delegate user problem for vSphere implementations. Specifically, it contains the domain user that requested the action to take place, as well as an originator ID that can be used to correlate based not just on time but also on event data. With time and other items to correlate further up the stack, we no longer end up with just a set of users, but with the exact user who perpetrated the action. For forensics and auditing, this is a crucial step, as well as an invaluable one.
While this is solved for vCenter to vSphere communication and the vCenter delegate user used within vSphere, it is not solved for the greater set of products. For that, you need to ensure that there is a single service account per management tool used within your virtual and cloud environments. This implies that if you are using a third-party backup tool such as Veeam Backup & Replication, it requires a service account specific to the product, not one shared with others.
This helps in problem solving and deep root cause analysis, as well as in forensics for a court of law. Now, maintaining those service accounts could be difficult. My lab environment has something like thirty service accounts for the all the VMware and third-party products in use, but it is crucial to determine who did what, when, where, and how. The “why” we may have to infer from the action taken and from where.
The delegate user problem is related not just to VMware products, but to all virtualization and cloud products. One sure sign of delegate user issues centers around the use of APIs. Users log into applications, but the API is hooked to the application and not to the actual user. So, logging within the application needs to contain some ID used within the API, so that events within a cloud can be correlated back to the application and the ultimate user who performed the action. Our delegate user problems can also be fairly complex, as per the following diagram:
As you can see from Figure 1, we could have many different products involved. While VMware has solved the CMS to hypervisor (per our diagram), we still have to solve the bigger picture. However, this is a step in the proper direction!
As you design your cloud applications, how are you handling logging? Are you making it easy to determine who did what, when, where, and how? Can you find this information if you need to for your virtual and cloud environments?