A Service Level Agreement (SLA) is an excellent expectations-managing mechanism, but it’s important to manage your own expectations of what an SLA can realistically accomplish. Just those three words “Service” “Level” and “Agreement” is often an attention turn-off I know: SLAs are to infrastructure bods what documentation is to developers. Yet, when considering taking up cloud and utility services many consider that the SLAs offered aren’t reliable, if they exist at all. So the SLA becomes the blocker – ‘If I move services out of my data centre, how will I guarantee availability and performance’.
Are SLAs for Cloud services really worthless and if they are, will the wider adoption of cloud services be impacted because of this?
What is an Service Level Agreement
In a great article on Establishing SLAs Naomi Karten gives a number of points around what an SLA is. An SLA can be considered as :
- A communications tool. The value of an agreement is not just in the final product; the very process of establishing an SLA helps to open up communications.
- A conflict-prevention tool. An agreement helps to avoid or alleviate disputes by providing a shared understanding of needs and priorities. And if conflicts do occur, they tend to be resolved more readily and with less gnashing of teeth.
- An objective basis for gauging service effectiveness. An SLA ensures that both parties use the same criteria to evaluate service quality.
- A living document. This is one of its most important benefits. The agreement isn’t a dead-end document. On a predetermined frequency, the parties to the SLA review the agreement to assess service adequacy and negotiate adjustments.
That said, the process of planning, establishing, and implementing an agreement is typically a many-month process of information-gathering, analyzing, documenting, presenting, educating, negotiating, and consensus-building – and the process must involve both suppliers and customers. If customers are not part of the process, it’s not an agreement.
In this respect Cloud and Utility computing SLAs could be considered “worthless”. There may well be a statement on a website proposing a level of availability for a cost; an indication of compute power or storage for a cost. But how do those statements relate to what it is that you (as the customer) need in order to meet the expectations of your users or customers (or both)? Where is your agreement with the service provider? Your implied acceptance by signing up? If you find that the service needs to be improved – how do you go about doing that? If I need a better service I’ll move my provision – Really? before you commit to that utility service how easy is that to do?
Cloud or Utility Computing?
The concept of ‘utility computing’ is meant to be that purchasing IT resources is like ‘buying electricity’ or ‘buying a phone service’: how it is provided is none of your concern. A core benefit of utility computing is better economics. Traditionally, data centres are notoriously underutilized, it wasn’t unusual to find server resources often idle 85% of the time. This is due to over-provisioning i.e. buying more hardware than is needed on average in order to handle peaks, to handle expected future loads and to prepare for unanticipated surges in demand. This is why companies such as VMware found such a great response to their virtualisation offerings; because they improved data centre resource utilisation. Utility computing allows companies to only pay for the computing resources they need, when they need them. However, there is rarely a guarantee that those services will always be there to be consumed.
If your electricity supply fails, that’s it – game over. When the London tube bombers struck external phone lines and mobile networks went off in the city under the load. There are solutions to these problems sure – but all take time, money and planning. More importantly, when utility computing fails (i.e. doesn’t meet its service guarantee) you are entitled to have the utility service costs refunded: not the cost of the failure of IT to your business, which is likely much more. In terms of an SLA, your utility supplier may offer a service level guarantee but how are you managing that? Are you testing for outages? Are you monitoring the performance of applications? In terms of application performance – who is responsible for improvement? How regularly are you reviewing capacity needs?
Cloud computing is a broader concept than utility computing. Cloud computing is accessing resources and services needed to perform functions with dynamically changing needs. An application or service developer requests access from the Cloud rather than a specific endpoint or named resource. The Cloud automatically manages multiple infrastructures across multiple organizations. That’s not to say the services that underpin Cloud computing are magic, they suffer the same issues (failure, under provision, poor performance) as any other service. The complex component here is that any SLA to a customer is difficult to manage because it is unknown who is providing your service – the very nature of Cloud services means that the components can be provisioned from an ever changing set of providers.
How do you consistently measure a service when the provider of a service can readily change? When you’re accessing an application – is it the application providers failure that is causing poor performance; hardware? network?. From a supplier perspective, how do you measure and deliver on application response times rather than raw CPU and storage value?
Does this mean that its not possible to have an SLA for Cloud services? Is it worthless because its just too hard to do? It is undoubtedly difficult and typically a reason why service agreements are focused on up/down rather than ‘user experience’ response times. But not impossible, as we discussed in Virtualization Performance Management – Linking Response Time, Load and Chargeback .
Worthless?
The question may be ‘If I move services out of my data centre, how will I guarantee availability and performance’ – the first response should be “how did you guarantee it before?” What was the nature of your SLA – was it just about availability? Was performance about response times, or was it about not running out disk space and keeping CPU use on the Exchange server low?
Will the lack of collaborative and dynamic SLAs hamper the move to Utility Computing and a wider use of cloud services? Unlikely. SMBs/SME’s aren’t in the position to afford to do accommodate high costs of designing or using the services and creating an SLA that may take months to agree and require tender love and care to maintain: But then, they weren’t before either. Larger organisations will likely further delay any complete reliance on Cloud and utility services opting to make use of them perhaps for less critical line of business application: but such organisations were likely to be doing that regardless. The economic benefits of utility computing have a cost vs service trade-off. What is important to understand is where that trade-off impacts the business.
What is interesting is that the nature of how SLAs are managed in a Cloud services environment. Traditionally, a service has been provided by one organisation – an agreement is managed with entities that rarely change over time. As Cloud Services develop properly – this is going to be increasingly difficult to manage in that traditional sense. Cloud offers the opportunity to monitor and manage a service and move that service to another supplier not because it is cheaper, but that it is more responsive, more reliable.
Cloud and Utility services today offer Service Level Guarantees – but these are Guarantees, not an Agreement. For many this is not important up until that guarantee fails, but in all fairness your utility IT failing is no different than your own personal IT failing. It may be all well and good that Technically, the Amazon Cloud didn’t actually fail, but the fact is suppliers pointing at an SLA and saying “hey we didn’t technically fail” leaves a bitter taste for customers. It could be argued that Service providers sometimes want to create an SLA to suppress customer complaints. Maybe Amazon will change their SLA as part of the resolution of the problem: but the piece highlights the fact that when you’re using such services you need to understand the impact of each components failure on your service levels to your users. It is not that the Cloud service agreements are worthless, it is perhaps more that these new services are expected to do many things: to be magic. A Service Level Agreement (SLA) is an excellent expectations-managing mechanism, but it’s important to manage your own expectations of what an SLA can realistically accomplish. An SLA is not a pair of ruby slippers but, if you take the time to understand what it is that your business needs you better understand the level of service the business requires and so can make better choices of when to deploy internal/external services.
The dynamic nature of Cloud services will mean that they become more prevalent that is for sure. Understanding how to bridge the service requirements of the business and relate that to the services available and treat that relationship as a dynamically changing thing is going to be much more important for IT executives.
It doesn’t help Amazons customers at all that they ‘didn’t technically fail’. What counts is the availability of the service. Cloud SLAs and EULAs need to be tightened up to be more appropriate to an enterprise or it will definitely slow the uptake of services.
It comes down to the impact of the end user experience and whether they were affected. What was and what will be the typical performance of the application for a client before and after the migration to the cloud? You would think it would be better…but how would you know without a baseline…
Hi Kevin,
Agree completely – end user experience is the key metric. As a matter of the next post will be on end user expereience and real user monitoring.
Cheers,
Bernd