Microsoft’s Project Catapult Puts Custom Chips in the Cloud

Microsoft is testing a new server technology with Project Catapult that is likely to play an important role in future cloud computing environments—a server technology that can dramatically increase the performance of some data center workloads and breathe fresh life into Moore’s Law, all without significantly increasing server cost or power consumption. Microsoft Research’s Project Catapult pairs Intel Xeon CPUs with high-performance field-programmable gate array (FPGA) processors configured to perform a set of predefined, resource-intensive calculations that are the core of the Bing search engine page-ranking service.

Testing Project Catapult

To test the idea, Microsoft built and deployed 1,632 Catapult servers in seventeen racks, which achieved a 95% improvement in throughput, whilst requiring only a 10% increase in power consumption and a 30% increase in cost of ownership.

Cloud services like Bing are under constant pressure to control and, wherever possible, reduce costs, even as the as size of the service continues to grow. Until very recently, Moore’s Law has proven reliable in plotting processor performance, with improvements in design and manufacturing techniques more or less doubling CPU performance every eighteen months. However, after fifty years, Moore’s Law is breaking down. The rate of increase in clock speed has slowed to a crawl. At the same time, transistor energy efficiencies have peaked, and physical transistor sizing has begun approaching its lower limits as it becomes increasingly difficult to prevent source-drain leakage.

Now, concerns are growing that increasing the number of cores per processor will become more difficult and deliver fewer gains than it has in the past. Without major advances in microprocessor manufacturing techniques, which are by no means assured (extreme ultraviolet lithography, or EUV, is years behind schedule), the economic benefits that Moore’s Law has delivered will disappear, significantly increasing the cost of web-scale computing and beyond. However, as Project Catapult is beginning to show, there are ways to avoid the Moorepocalypse.

Specialty Tasks

Mainstream CPUs are fast general-purpose devices, supporting a rich set of instructions that allow them to perform most tasks reasonably quickly. However, as general-purpose processors, they can only do so much. For specialty tasks, specialist processors are required. Bitcoin mining, for example, is little more than a single, albeit complex, mathematical calculation. As that calculation is fixed, it’s possible to forgo software and run it in hardware. The bitcoin currency was conceived such that, over time, the amount of computing resources needed to mine bitcoins would increase. In order to maintain a commercially viable bitcoin mine for which the cost of production—chiefly power—does not outstrip the value of coins produced (fluctuations in exchange rate aside), the hardware used has had to undergo rapid evolution, from servers with high-performance CPUs, though GPUs, and to FPGAs, before adopting ASICs as the fastest, most energy-efficient way of cranking out bitcoins.

ASICs can run calculations one thousand to ten thousand times faster than software can, by trading flexibility for performance. FPGAs sit in between ASICs and general-purpose microprocessors on the performance/flexibility curve. They are frequently used to prototype designs before committing them to silicon as an ASIC, but they are also used in smaller production runs for which the high tooling cost of manufacturing an ASIC is not warranted. To compare them to burning chips in silicon, FPGAs are typically about one tenth the speed of an equivalent ASIC, but they can still outperform software at a rate one hundred to one thousand times faster.

What Microsoft is doing with Project Catapult is taking the computationally intensive parts of the Bing page-ranking process and moving them from software into an FPGA. The current Catapult implementation is built from high-density 1U half-width rackmount servers, with the Catapult ASIC and interconnects fitting to PCIe cards attached to the server via a mezzanine board.

Catapult Server Spec

Processor	2 x Intel Xeon 12-core Sandy Bridge CPUs
Server Memory	64 GB of DRAM
Disk	2 x SSD4 x HDD

Project Catapult employs an elastic architecture that links the FPGAs together in a six by eight ring network that provides 20 Gb/s of peak bidirectional bandwidth at sub-microsecond latency and allows the FPGAs to share data directly without having to go back through the host servers.

By using FPGAs rather than the much faster ASICs, Microsoft can reprogram the Catapult fabric to adapt to new ranking algorithms without the time and cost of developing custom ASICs or the disruption of having to pull servers from production to install new ASICs, a primary consideration when operating cloud-scale data centers. Cloud data centers see far greater uniformity than their enterprise equivalents, especially at the largest scale, where capacity is provisioned not by the rack but by the shipping container. More than anything else, standardization at this scale simplifies management and allows workloads to be scaled on demand without risking the possibility that, while servers may be available, they may be the “wrong” type for a given workload.

Future Development

Future development spurred by this initiative might include taking greater advantage of the programmable nature of FPGAs. Whereas today, Microsoft’s Project Catapult servers are used exclusively to generate Bing page rankings, the technology could readily be applied to other similar activities. By virtualizing the FPGA fabric in much the same way as the server virtualization is used to provision standard OS and application stacks, it is possible to reconfigure FPGAs to perform different tasks on demand. The biggest challenge to widespread adoption of FPGAs for data center workloads today lies in the lack of integrated development tools. FPGA development currently requires extensive hand-coding and manual tuning, increasing development costs and restricting opportunities for use. However, if this approach proves successful and FPGA use expands, better tooling will inevitably be developed.

FPGAs can’t be used for every workload, but for well-understood, repetitive mathematical functions, there is considerable merit in this approach. Microsoft researcher Doug Burger predicts, “This portends a future where systems are specialized dynamically by compiling a good chunk of demanding workloads into hardware…I would imagine that a decade hence, it will be common to compile applications into a mix of programmable hardware and programmable software.”