Ask any virtualization administrator what their major pain points are and the first thing on the list will be storage. It isn’t surprising. Storage was likely the first major bottleneck for virtualization, back when it was “the Internet” and not “the cloud.” And as any IT person can tell you, there are two ways storage can be a bottleneck: performance and capacity. Traditionally, the problem of capacity is less complicated to solve than that of performance. To gain capacity you just add disk. To gain performance you needed to select a disk form factor (2.5″ or 3.5″), connection technology (SAS, iSCSI, fibre channel), rotational speed (7200, 10000, 15000 RPM), sometimes a controller (do I get the Dell PERC with 512 MB of cache or 1 GB?), and do the math to figure out how many disks you need to match both the problem of your I/O and its corollary: the problem of your budget. Complicating things, virtualization turned most I/O into random I/O. What might be a nice sequential write from each virtual machine looks pretty random in aggregate. Of course, random I/O is the hardest type of I/O for a disk to do.
Then a few things happened. First, RAM modules got bigger and cheaper, in a Moore’s Law sort of way, and we started thinking that $5000 of extra RAM in a server might be way cheaper than an upgrade to the SAN. So we started using memory to take the load off our perpetually overworked storage arrays simply by adding more memory to the operating systems and growing Oracle’s SGA, Java’s heap size, or countless buffers in our environments. Sometimes we recoded our applications to use in-memory databases and key-value stores like memcached. And nowadays, maybe we use some of that RAM explicitly as a storage tier, inserted either as a shim between our slow arrays and fast CPUs to try to keep more data local, or perhaps as the whole array itself.
Second, solid state drives (SSDs) showed up, and over a decade manufacturers steadily worked through the reliability and speed issues. Reliability issues were solved through a lot of little changes, improving lifespan at the semiconductor level and better firmware and wear-leveling techniques in the drive firmware. SSD-friendly behavioral training for operating systems was also crucial. The Logical Block Addressing scheme PCs used to overcome early BIOS limitations caused filesystems to not be aligned with the underlying storage on a power-of-two boundary (like 2048 byte blocks). Instead, I/O was offset, so that every read and every write ended up having to touch two blocks on disk. This wasn’t such a problem with traditional magnetic media, but an SSD is built of units that are sized in powers of two, and have a very limited life cycle. Inadvertently doing twice as much I/O caused your device to live half as long and be half as fast. Even now you see SSD & flash labeled as read-optimized or write-optimized, indicating that different choices have been made to prolong the lifespan of the device. Dell Compellent arrays even use both types internally.
Speaking of speed, the British computer scientist David Wheeler said once that “all problems in computer science can be solved by another level of indirection.” For SSD the opposite is the case. By removing abstractions like RAID, SAS & SATA, port expanders, and all the baggage from traditional disks, and strapping flash memory straight to the PCI Express bus, we’ve been able to create incredibly fast pools of data. Not a panacea, though, as these fast pools come at a price that far surpasses magnetic media. They’re also local, which presents problems for reliability and distributed computing. But in the quest to move the data closer to the CPUs and decouple storage capacity from storage performance they’re a great tool to have. They still may be cheaper than upgrading your storage arrays.
Earlier this year my colleague Edward wrote about caching throughout the stack, and about deciding how to move data closer to where computing is done. VMworld 2013 re-emphasized this, with VMware adding native read caching to vSphere 5.5 and startups like Infinio and PernixData winning awards for their own approaches. SANdisk FlashSoft, Condusiv V-locity – there are options for all budgets and architectures. It is the year of caching, and over the next few weeks I’ll be posting about the interesting things going on with caching within the virtualization community.
I hope that in twenty years our storage will be a giant lump of flash sitting in our data center, and we can look back at all these caching hacks with nostalgia. For now we are stuck trying to marry the fast, expensive flash with the slow, economical spinning disk as best we can, trying to balance locality with redundancy, price with performance.