Virtualizing High Performance Computing

There have been a recent set of VMware Communities questions that have got me thinking about the prospect of virtualizing high performance computing (vHPC) and whether or not this is even practical, reasonable, and would give any gains to HPC.   I think there are some gains to be made but with everything there are some concerns as well.  This is of interest to me as at one time I was deep into High Performance Technical Computing and marrying Virtualization to HPC/HPTC would be a very interesting option.When considering whether or not to use a machine for HPC there are several concerns up front. The first is the number of CPUs within the system as HPC applications are heavily threaded. More processors often imply more possibilities of race conditions, locking issues, and other thread specific errors that can occur. It is common that once you go over 4 processors there tends to be more thread race conditions to solve than when there are below 4. The same holds true for going over 8 processors. .
The APIs used by HPC applications use some form of message passing over a high speed interconnects (Quadrics, InfiniBand, 10G, etc.). The MPI and PVM APIs were designed specifically for these high speed interconnects. These interconnects exists in addition to standard network connections used to manage the cluster.
HPC makes use of CPU Parallelism in many ways. But how does this work within a virtual environment?
VMware ESX and for that matter most other hypervisors do the following:

For each VM scheduled Each vCPU within a VM is assigned to a specific Core or Hyperthread within the CPU subsystem of the server. If there are enough Cores or Hyperthreads to run more than one VM at a time then these VMs run in parallel.
Within each VM, VMware ESX support Virtual SMP (Symmetric Multi-Processing), which implies that more than 1 vCPU (up to 8) can be assigned to a given VM, and if the Guest Operating System within the VM can make use of each of the presented vCPUs then it will. There is no difference between this and a physical box. So in this case if the Guest Operating System, Linux, Windows, *BSD, Solaris, etc. supports Parallelism then your Application can take advantage of it.

There is an odd bird when it comes to CPU Parallelism and that is VMware Fault Tolerance.

VMware Fault Tolerance is an odd bird with respect to CPU Parallelism as its sole intention is to keep a vCPU on one Host in Lockstep with a vCPU on another Host. Which means you have 2 vCPUs on different hosts for all conceivable purposes acting as one when that vCPU from a VM is scheduled to run on a given Core or Hyperthread.

HPC is mainly compute intensive, so one of the aspects of CPU Parallelism we discussed above is not something you actually want to enable. That is CPU Hyperthreading? Why? Because CPU Hyperthreading makes use of multiple threads within the CPU, but the only time a thread is swapped out is when there is a requirement to go to main memory. Since that process takes 10-20 times the time it takes to run a single instruction, the hyperthread runs while the CPU is basically waiting for main memory to respond.
Yet, in compute intensive HPC clusters main memory access is minimalized so only one thread is actually running within a Core and not multiples. So for vHPC, hyperthreads should be disabled within the hardware and only real Cores should be used.
The key to making vHPC work is to choose a box with more Cores than you would normally run within a single pHPC cluster. If you normally have Dual Quad Core boxes then your vHPC box should have more than 8 Cores otherwise you get no benefit from virtualization. A fully loaded 64 core DL785 G2 seems a perfect candidate for making a vHPC node. With 64 cores, one could easily make a 15 node 4 vCPU each vHPC cluster of virtual machines. Why only 15 for this one box when there are 64 cores? Because the hypervisor and virtual networking also require CPU as does perhaps one extra VM.
Now that we have 15 4 vCPU nodes for our vHPC cluster how do we manage the interconnect? The interconnect is really what makes many HPC clusters sing. They have to be fast. In steps VMCI. The VMCI interface bypasses the virtual switching layer and connects the VMs together using the VMCI interface instead. What this means however is that PVM and MPI need a new Interconnect module to make this work. Alternatively you could use a single virtual switch to act as an interconnect and do everything using standard networking protocols. The vSwitch wire runs at the speed of the device to which it is connected and the drivers used within the VM so using a pair of 10Gb ethernet load balanced ports just for the VMs sounds very good. If you add more you will have more load balancing taking place but not necessarily aggregation. VMCI on the other hand runs within memory and is extremely fast.
So what do you do if you want more than 15 nodes for your vHPC cluster? Perhaps another fully loaded DL785 G2 is available. Remember the extra VM I mentioned? You could make an interconnect VM that talks over the wire between the two DL785s and they in turn pass the data further on using VMCI. If you use purely a 10Gb network interconnect this would not be necessary.
Will this be a true HPC cluster? Absolutely. There will be 30 4 vCPU VMs available for compute tasks. With 10Gb links you have a speedy interconnect. If you use VMCI however you may be even faster (memory speeds), but that requires special interconnect VMs.
In the beginning of this post I mentioned race conditions. Most of the time race conditions happen when you go over 4, 8, or 16 CPUs per physical server. In this case of limiting the VMs to 4 or 8 vCPUs we can limit the possibility of race conditions while getting more use of physical hardware. Granted when using virtualization the timing of the application will once more change and may lead to other issues.
There may exist a proof of concept within Platform Computing’s ISF product with its various Allocation Policies across physical and virtual hardware. The interested question is whether this is designed for HPC or general cloud usage. Its heritage is most likely the management tools Platform is known for within the HPC realm. Could this be the perfect management tool for a vHPC?
Even so, this is one demo I would like to see.