Monitoring - The basics of the Cloud

“What do you wish to monitor?”, is often my response when someone states they need to monitor the virtual environment. Monitoring however becomes much more of an issue when you enter the cloud. Some of my friends have businesses that use the cloud, specifically private IaaS clouds, but what should the cloud provider monitor and what should the tenant monitor has been a struggle and a debate when dealing with them.
So what does this tenant wish to monitor?

Hardware functionality with predictive failures ala Dell Open Manage or HP Insight Manager
Host State
Virtual Machine State (up/down)
Virtual Machine Statistics
Application State
Application Statistics

These seem a reasonable set of requirements for monitoring by a cloud provider, but this also seems to be a difficult task for some cloud providers. When an Enterprise enters a cloud much of this monitoring should already be in place but apparently it may also be continually under change.
My friends are actually having issues with monitoring within the cloud. It has gotten to the point that all they wish the cloud provider to monitor is the physical hardware and any dependencies while they monitor the application and virtual environment. Mainly because of the mechanism by which they monitor application state and the total lack of usable statistics. It is simply a fact that most infrastructure monitoring solutions are designed for the owner of the infrastructure which is fine for an on premise infrastructure owned by an enterprise. However for a cloud the low level “physical” statistics collected by infrastructure monitoring solutions are not easily broken up by how the resources are sliced up for each customer of the cloud. Therefore it is difficult to impossible for the cloud vendor to give the customer a accurate picture of how the slice of the infrastructure being allocated that customer is actually performing at any one point in time.
Since the IaaS environment is 100% Linux for this group with judicious use of virtual machines, they need to have application statistics to know when or if they need to add more capacity and to determine trends. The peaks and valleys of the trends determine when new systems are needed as well and when maintenance could be performed as they are a 24/7 shop. Application level statistics can be gathered by a variety of solutions that target cloud based applications. Both AppDynamics and New Relic offer solutions that can monitor Java based applications running in cloud hosted environments. However these solutions are about telling the team that owns the applications how the application is performing and where in the code the problem most likely lies – not addressing the issue of whether the infrastructure owned by the Cloud provider is causing a problem.
The Cloud Provider determines state by verifying various ports are available but not necessarily if the application is available. Ports will be available as long as the daemon for the application in this case Apache and MySQL. However, if the application is non-responsive the port could ALSO still be open, therefore the monitoring fails to report an issue. The issues show up within the trends, but the trending software looks at statistics over time not necessarily as they happen which implies its response to a failure would be late at best.
Monitoring of Cloud environments are very important, some use Zenoss, and others use other tools, but the cloud providers do not necessarily export all the functionality of such tools to the tenants. For example, this one provider makes use of Zenoss, but does not export the ability to send alerts to the tenants directly. They first go through a group that may or may not be checking their email. These types of situations produce delays in solving problems.
Tenants want to know the problems are being worked but also wish to know when there is a problem so the tenant can also work the problem, perhaps bring up redundant systems or start their DR process.However, how much of this monitoring should be performed by the tenant? Is what the cloud provider provides sufficient for the tenants needs? In my friends case, the monitoring was NOT sufficient.
As a tenant, putting your trust in the cloud provider is something you may have to do to get the footprint you require to run the business but there must be some form of local disaster recovery and business continuity in place for when the cloud provider has long term problems.
If you are using the cloud, what is your Monitoring, Business Continuity, and Disaster Recovery plans?