Coming Full Circle on Scale Out vs. Scale Up

When I first started with virtualization, the only option you had at the time was single core processors in the hosts. Scale up or scale out was the hot debatable topic when designing your infrastructure.  On one side of the coin the idea was to scale up in that it was best to get a few of the biggest servers you could find and load them up with as much memory and processors that you could fit in the box.  The end result were some very expensive servers able to run a lot of virtual machines for its time.  The other side of the coin presented the idea that it was better to scale out with more, smaller servers to make up the cluster.  I have worked in both type of environments and attitudes over the years and as for me, personally, I aligned myself with the scale out philosophy.  The simple reason for aligning with the scale out group was host failure.  When you have sixty to eighty virtual machines per host and lose that host it was really a lot of eggs in one basket and took some time to recover.  When you have more, smaller servers, then the shock of losing a host was not as severe because there were not as many virtual machines running on single host and it would take less time to recover.  This was during the time before vCenter, vMotion, HA and DRS when it was just you and the VMware ESX hosts.

Flash forward to today and we have multi-core processors that pretty much come standard in every server.  I found a VMmark Performance Brief that HP released in April 2010 about the new, at the time, ProLiant DL385 G7 Dual Socket with twelve cores per socket for a total of twenty four sockets per machine.  The VMmark benchmark score was 30.96@22 tiles.

“This outcome means the server can run 132 virtual machines”

as announced in the brief.  So, have we once again, come full circle?  Back in the day the HP DL380 and DL385 were the scale out models that I liked to recommend. Granted with HA and DRS to automated, the recovery of the virtual machines on a failed host, it will still take some time to restart over a hundred virtual machines. In fact, there was an environment that I had designed and used the scale out approach with, that lost a host and with it around thirty virtual machines that VMware HA was able to recover in about five minutes or fast enough that the alerts never got triggered.  That has really been the moment in time that has solidified my view on that.

If we are going to better service our customers, should speed of recovery be one of the top considerations in the design?  I understand the need for server consolidation and in some datacenters the absolute need to reduce the physical footprint, but where do we draw the line?  I would really like to hear your thoughts of when you design a system, what is an acceptable amount of time for recovery as well as what the current amount of time to recover in your environment?  I understand that we need multi-core to handle the increase in demand for multi-processor virtual machines but when does it become overkill or does it?  Inquiring minds want to know.