2009 - the "Long View": Hadoop and Cloudera

In summarizing 2009 from a 2010 perspective, Bernd naturally looked at the continuing emergence of virtualization platforms, management, and the VMware/Microsoft fight. It is also interesting to consider what would be the seminal event of 2009 when viewed from a 2020 perspective, and (without fear of immediate contradiction) I would like to suggest that 2009 will be noted for emergence of a commercial MapReduce framework. The first such product was the Cloudera Desktop for Apache Hadoop

MapReduce is not an area we have covered yet at The Virtualization Practice, but it is a widely-used algorithmic framework for very large-scale data analysis. It provides a mechanism for dividing up large problems into hierarchical sub-problems which are independently solved and then hierarchically combined to provide a solution to the problem as a whole. Many problems in data analysis can be defined this way, and (assuming there is enough data) can then be efficiently executed in multiple pieces – and thus across very large clusters of distributed computers.

Whereas Relational Database is based around tables, and was built in the days when data was assumed to reside in tabular form on a disk, MapReduce is based around trees, and/or graphs (within which there are trees), and works on the assumption that data is distributed around a network. It is unlikely that most large-scale DataWarehouse applications built around “Star Schema” or similar would easily migrate to a MapReduce implementation, but it is useful for anything with a natural hierarchy (e.g. a bill of materials) that require recursive subqueries. A good application (familiar to IT) is log file analysis, but there are applications in diverse areas such as gene sequence analysis and energy networks (for improved efficiency and security of supply). It is also generally useful for anyone that is trying to do complex intermediation of supply and demand – a good example being last.fm looking to match the supply of music to customer’s preferences.

There is an Apache Open Source framework for implementing MapReduce, known as Hadoop, built largely by Yahoo. Basically you install the Hadoop Service onto a bunch of commodity Linux servers and (as long as your application is naturally map-reducible) the cluster behaves like a supercomputer. It’s important to understand that in most applications MapReduce is generally a batch process. Analysis is done off-line, and generates a datastructure (e.g. an index) which is then queried online.

Of course the Hadoop cluster needs managing, and management is basically about scheduling jobs and adding/removing computers from the cluster. If we draw an analogy to Xen, the open soucrce hypervisor, Cloudera is the associated company (analogous to XenSource), which provides a commercial implementation and management tooling, support and services.

Cloudera is up to $11M in venture funding (as of the middle of 2009), which isn’t a silly number by Silicon Valley standards. However unlike, for example, MySQL it doesn’t fully own its core intellectual property (IP) which is collaboratively developed in Apache. The best analogy is XenSource which didn’t fully own Xen. The calculation amongst the Venture Capitalists is that (like Citrix did for XenSource) someone will pay over the odds for the management tooling company in order to establish a position in the space. Given that in the middle of 2009 Microsoft paid $100M for Powerset (a startup semantic search engine built using Hadoop), the current expectation is the Venture Capitalists will do well.

Our view on our notional 10 year horizon is that MapReduce and Hadoop will do well, but that the lessons of the XenSource acquisition will have been learned – it doesn’t make sense for an incumbent in one space to enter another by paying ridiculous amounts of money for some management “garnish” around an inherently open-source infrastructure layer. Much smarter to do what Oracle did with VirtualIron: wait for it to run out of money and then pick up the brand, the IP and the people for pennies.

We do, however, expect vast fortunes to be made in the provision of services based on Hadoop. Internet search, obviously, but also personal genomics and optimized forms of supply chain and portfolio management.