A Look at Stratus Technology

There used to be a FedEx commercial that had a saying “when it just has to be there overnight”. What if we did a play on words and changed the saying to work with Fault-Tolerance and or High Availability. The saying would be something like “when it just has to remain running overnight”.

Every business environment today demands both performance and ultra-high availability. When working with virtual environments some high availability options are included already with the ability to restart any virtual machines that were running on a host that failed and crashed. This still has limitations in that the virtual machine would still need to be restarted and this in itself still has some downtime. The amount of downtime can vary depending on variables with things like the number of virtual machines to be restarted and the number of hosts available to handle the virtual machines restarting. Downtime could be as quick as five minutes or as long as thirty minutes depending on the variables.

In some environments, even five minutes of downtime is not acceptable. Let’s consider financial and/or medical environments as an example. When a host goes down the virtual machines go down with it and anything that is in memory is lost. What if there was a financial transaction and/or say a communication in a hospital from a monitoring machine that sends out an alert when a patient is in distress. If the transaction or alert was in memory during the crash, then it is gone and that could have the potential to cost millions of dollars or even someone’s life.

The Stratus fault tolerance solution is exactly what solves this problem using hardware. The Stratus solution provides an out-of-the-box fault tolerance solution. VMware Fault Tolerance (VMware FT) can do this but currently has some serious limitations in that the virtual machine can only be a single processor virtual machine and VMware does not recommend or really support running all your virtual machines with VMware FT. Stratus Technology has no such limitations and can guarantee all your virtual machines continue to operate without data loss and with nearly unchanged performance on the guest virtual machines

What is the secret sauce with this solution? I spent some time talking with the Stratus team at VMworld to get the inside scoop. Picture this… What if you could take something like two blade servers and put them together in a single case and run the servers themselves in a raid 1 configuration? By that I mean the servers themselves are mirrored and kept in lock-stepp. As a test, the Stratus Team pulled the power on one of the hosts and all the virtual machines kept on running. The Stratus server would immediately phone home and report the problem which would have one of the Stratus Engineers call to verify the problem and get what was needed to fix the problem sent right out. This kind of reminds me of OnStar calling once your car as been in an accident.

As an example let’s say the mother board failed, Stratus would send out the part needed and you would pull out the faulty unit and replace the part. Once the repair is complete you slide the unit back into place and the server raid or mirror is rebuilt and lock-step restored in a relatively short amount of time. All this happens with the virtual machines never skipping a beat or having any idea there was ever a problem.

Knowing that your servers will continue running without issue, Priceless!