Eucalyptus/Terracotta a scalable Java Cloud Platform?

We have a series of posts (SpringSource/VMware, 3Tera, Eucalyptus, Hadoop/Cloudera…) about the application directly targeting a distributed virtual machine which is abstracting over the virtualization layer and/or operating system. Essentially these are targeted at those who are building or adapting applications for the cloud, rather than starting from the premise of a virtualization of existing infrastructures.

It must be said there is no clear model yet emerging for how you do this. The 3Tera solution is slick and allows you to define your infrastructure at a logical (application) level and grow or shrink your architecture graphically on commodity hardware, but ultimately there are limits to the horizontal scalability of the layers in the architecture that comprise your application.  When we last looked at Eucalyptus it was driving in a similar direction with packaged VMs and its own scalable filesystem  but wasn’t really dealing with the tiers of an application as logical entities.

We recently received a presentation on a combined solution from Eucalyptus and Terracotta. Initially we were suspicious because they clearly share an investor – Benchmark Capital.  Was this a PowerPoint integration dreamt up by two Venture Capitalists over a power breakfast?  However, the combined solution was presented by some very plausible techies with a real-live demo and does look as though it starts to provide a generally-useful abstraction over which to deploy scalable applications (specifically Java stacks), and it too works with commodity hardware.  It’s not as slick as the 3Tera solution, more of a command-line approach, but it potentially has the edge in scalability.

When I used to run a performance testing company, we always tended to find that the customer’s problems could mysteriously be solved by the biggest computer that HP or Dell or IBM or Sun, or whoever, could sell them at that point in time.  Given  that I was in the business for 10 years, the performance of that machine increased massively over time, but the number of users it could support increased much less significantly.  In reality programmers get lazy, frameworks get fat, then people wrapper up one framework in another bigger framework, security and/or UI requirements get more complex and code bloats out more or less to remove any benefit from the increased performance of hardware, so the reality of scalability stays the same, and you end up throwing replicated hardware (horizontal scalability) at the problem to respond to a business requirement, rather than paring down software to make it faster and thus require only a single server.

In one of those strange Mobius strips, scalability and performance are both adjacent and in opposition to each other, and in no area does this show up more than in the context of stateful or stateless objects in the middle tier.  Put it simply, if you only have one server assuming aren’t profligate with memory you can write your code so that all the objects that contain the state associated with the user session are held in memory on that server. If, however, you want to scale the middle tier with multiple servers you can no longer make the assumption that a given user’s session will be routed through a specific server so you have to write your code in such a way that it persists the application state back to the database after each page request, and re-consitutes that state into an appropriate set of objects on (potentially) a different server.

In reality you haven’t actually solved your scalability problem, you’ve just pushed it down the stack into the database, where there is a lot of activity around certain tables, but databases are highly-engineered afted entities (within the limits of the relational model), they’ve been crafted to run well on “big iron” and you can also get a certain distance by clustering database servers. You’ve also put a large performance hit on the middle tier, because reconstituting that object (via a connection to a database) is typically a very computationally expensive process. And so you get hit by a “double whammy” of costs because we are back to big iron for the database, and we have a large number of servers for the middle tiers.

To give you a sense of this, you could easily find that in moving your “non-Scalable” stateful single server code into “Scalable” stateless multi-server code, you have added a factor of 20 to the overall CPU resource required to process the transaction.  You get that back by scaling out your single server to 20 servers.  (This is not an exaggeration). It costs you a lot of hardware and software licences to get you the performance back, so the industry as a whole tends to suggest that “Scalabilty” is what you need here (rather than performance through efficient code).

But there is another way that can help get back some of the performance. You can work with “sticky sessions”, that is arrange for your load balancer only occasionally to route a given user session to a different server from the one it was last routed to. This in turn means that if you can introduce a cache (e.g. Terrracotta’s Open Source Ehcache) between the application and the database that caches pre-built Objects, the chances are you won’t have to hit the database or reconstitute the object for the majority of page hits. Your response time will drop significantly, the throughput (number of transations per second) that a given server can sustain will increase, so you will need fewer servers, and the load on the database itself will significantly decrease.

However, back to our “lazy” programmers.  The trick is to get a caching abstraction in place so that the programmer doesn’t need to care whether or not the object is being serviced from cache or is being reconstituted from database-resident state (in the very occasional case that the object is out of cache).  Also, in the case where there are multiple application servers, we don’t want the programmer to have to go through the bother of retrieving state from a different server and it should all happen automagically.  Teracotta and Ehcache give you this.

Then, as long as you over-specify the database server against the (massively reduced) workload, you can forget about the limits of scalability of the database for all practical purposes and use a cloud computing layer (like Eucalyptus with its Amazon EC3 API abstraction) to add and remove servers from the mid tier to cope with variation in demand, even spilling out into the public cloud if you need to. In any case your overall requirement for horizontal scalabilty in the mid tier is reduced because each server is able to support more users.

It should work, unless of course we have simply added three new layers of framework (Object Cache, Distributed Cache, Virtualization) with their own overheads and that removes all the performance gains.

To reiterate our views on this space, there is definitely something here that can abstract away from the virtualization layer and indeed the traditional operating system, and provide a platform for new (or at least modern) applications to move into public or private clouds (or both), but there is no clear winner in either architectural or vendor terms. Ehcache itself is only one of many open source Java caches. Or you can buy the Open Source Spring from VMware in hosted form as SpringSource CloudFoundry, based on Amazon Cloud.  And then you can put Ehcache onto Spring (possibly, since CloudFoundry is still beta).  So many choices…

In our series of posts (SpringSource, 3Tera, Eucalyptus) about the application directly targeting a distributed virtual machine which is abstracting over the virtualization layer and/or operating system. Essentially we are looking at those who are building or adapting applications for the cloud, rather than starting from the premise of a virtualization of existing infrastructures.

It must be said there is no clear model yet emerging for how you do this. The 3tera solution is slick and allows you to define your infrastructure at a logical (application) level and grow or shrink your architecture graphically on commodity hardware, but ultimately there are limits to the horizontal scalability of the layers in the architecture that comprise your application. When we last looked at Eucalyptus it was driving in a similar direction with packaged VMs and its own scalable filesystem but wasn’t really dealing with the tiers of an application as logical entities.

We recently saw a presentation on a combined solution from Eucalyptus and Terracotta. Initially we were suspicious because they clearly share an investor – Benchmark Capital. Was this an integration dreamt up by two Venture Csapitalists over a power breakfast? However, the combined solution was presented by some very plausible techies and does look as though it starts to provide a generally-useful abstraction over which to deploy scalable applications(specifically Java stacks), and it too works with commodity hardware. It’s not as slick as the 3Tera solution, more of a command-line approach, but it potentially has the edge in scalability.

When I used to run a Perforamnce testing company, we always tended to find that the customer’s problems could mysteriously be solved by the biggest computer that HP or Dell or IBM or Sun, or whoever, could sell them at that point in time. Given that I was in the business for 10 years, the performance of that machine increased massively over time, but the number of users it could support increased much less significantly. In reality programmers get lazy, frameworks get fat, then people wrapper up one framework in another bigger framework, security and/or UI requirements get more complex and code bloats out more or less to remove any benefit from the increased performance of hardware, so the reality of scalability stays the same, and you end up throwing replicated hardware (horizontal scalability) at the problem to respond to a business requirement, rather than paring down software to make it faster and thus require only a single server.

In one of those strange Mobius strips, scalability and performance are both adjacent and in opposition to each other, and in no area does this show up more than in the context of stateful or stateless objects in the middle tier. Put it simply, if you only have one server assuming aren’t profligate with memory you can write your code so that all the objects that contain the state associated with the user session are held in memory on that server. If, however, you want to scale the middle tier with multiple servers you can no longer make the assumption that a given user’s session will be routed through a specific server so you have to write your code in such a way that it persists the application state back to the database after each page request, and re-consitutes that state into an appropriate set of objects on (potentially) a different server.

In reality you haven’t actually solved your scalability problem, you’ve just pushed it down the stack into the database, where there is a lot of activity around certain tables, but databases are highly-crafted entities (within the limits of the relational model), they’ve been crafted to run well on “big iron” and you can also get a certain distance by clustering databases.

You’ve also put a large performance hit on the middle tier, because reconstituting that object (via a connection to a database) is typically a very computationally expensive process. And so you get hit by a “double whammy” of costs because we are back to big iron for the database, and we have a large number of servers for the middle tiers.

To give you a sense of this, you could easily find that in moving your “non-Scalable” stateful single server code into “Scalable” stateless multi-server code, you have added a factor of 20 to the overall CPU resource required to process the transaction. You get that back by scaling out your single server to 20 servers. (This is not an exaggeration). It costs you a lot of hardware and software licences to get you the performance back, so the industry as a whole tends to suggest that “Scalabilty” is what you need here (rather than performance through efficient code).

But there is another way that can help get back some of the performance. So, coming back to terracotta the key is that you can work with “sticky sessions”, that is arrange for your load balancer only occasionally to route a given user session to a different server from the one it was last routed to. This in turn means that if you can introduce a cache (terrracotta’s Open Source Ehcache) between the application and the database that caches pre-built Objects, the chances are you won’t have to hit the database or reconstitute the object for the majority of page hits. Your response time will drop significantly, the throughput (number of transations per second) that a given server can sustain will increase, so you will need fewer servers, and the load on the database itself will significantly decrease.

However, back to our “lazy” programmers. The trick is to get a caching abstraction in place so that the programmer doesn’t need to care whether or not the object is being serviced from cache or is being reconstituted from database-resident state (in the very occasional case that the object is out of cache). Also, in the case where there are multiple application servers, we don’t want the programmer to have to go through the bother of retrieving state from a different server and it should all happen automagically. This is where the original Terracotta product, now bundled with Ehcache comes in.

Then, as long as you over-specify the database server against the (massively reduced) workload, you can forget about the limits of scalability of the database for all practical purposes and use a cloud computing layer (like Eucalyptus with its Amazon EC3 API abstraction) to add and remove servers from the mid tier to cope with variation in demand, even spilling out into the public cloud if you need to. In any case your overall requirement for horizontal scalabilty in the mid tier is reduced because each server is able to support more users.

Hmmm,… It should work, unless of course we have simply added three new layers of framework (Object Cache, Distributed Cache, Virtualization) with their own overheads and that removes all the performance gains.

To reiterate our views on this space, there is definitely something here that can abstract away from the virtualization layer and indeed the traditional operating system, and provide a platform for new (or at least modern) applications to move into public or private clouds (or both), but there is no clear winner in either architectural or vendor terms. Ehcache itself is only one of many open source Java caches. Or you can buy a Spring from vmware as SpringSource CloudFoundry, based on Amazon Cloud which doesn’t seem to have any ESXi in it. And then you can put Ehcache onto Spring (possibly, since CloudFoundry is still beta). So many choices…