DataStax - Three Ways to access the same Big Data

We recently had a conversation with DataStax regarding their DataStax Enterprise product, which got us to thinking a little about the nature of Big Data and Cloud. DataStax is the company behind the Open Source Cassandra NoSQL database. It provides technical direction and the majority of committers to the Apache Cassandra project. Cassandra in turn is a Column Family-based database along the lines of Google’s BigTable. If you are a SQL programmer it’s determining feature is… it doesn’t do joins.

DataStax was founded in 2010 to commercialize Cassandra. It is based in San Mateo California and has recently opened up an office in Europe to allow direct support of European customers. And we were expecting the usual conversation around permissively licensed open source business models – DataStax providing an Enterprise supported version of Cassandra, which in a sense they do. However the conversation didn’t go quite as we were expecting because it focused much more widely on Big Data.

In most people’s mind NoSQL isn’t the same as Big Bata – since Big Data means Hadoop or some variant of MapReduce. However, in practical terms, Big Data comes from somewhere – it may be coming in from a website or a sensor or a game and the data at the point of capture usually has some significance in the context of the application that is capturing and needs to be recorded and processed as it happens. Thereafter, it may need to be analyzed, and perhaps it may need to be searched. DataStax’s big idea is that it should be possible to access the same data in situ for all three of these purposes. So whilst your application may be managing session state and user activity in Cassandra, and Cassandra may be partitioned across multiple servers in a cloud, the same data may be accessed using the same partitioning scheme via two additional APIs: Hadoop and Lucene/Solr for search. Under the covers there is a single Cassandra database, rather than a separate filesystem for each.

The key benefit for customers is the removal of the requirement for an Extraction Transformation and Loading (ETL) step in the migration of data from operational systems to analytic systems. This can clearly reduce the complexity of the application architecture, and may reduce costs of software and hardware and management. It also gives the opportunity to reduce the cycle time of analytics inside the application – data can be analyzed more-or-less as it is being created and without a requirement to extract to a second system.

DataStax will supply a management tool called OpsCenter Enterprise, which provides a single point of control for all of the Cassandara, Hadoop and Solr functionality across one or more clusters. The Cassandra underlying database technology provides fault-tolerance and support for multiple data centers. Of course you could also do this backwards, you can provide BigTable-like NoSQL functionality on top of the Hadoop filesystem using another Apache project known as Hbase which competes with Cassandra. However DataStax’s argument is that the symmetry of Cassandra’s filesystem leads to no single points of failure or potential bottlenecks, unlike the native Hadoop filesystem. Plus DataStax will provide you with support for the combined system including Search.

So, what does all of this have to do with Cloud? DataStax actually sells this as licensed software in the usual “open core” way for permissive licenses. You can install it on bare metal or virtualized servers or a private IaaS cloud in your Data Center. If you want to stick it into a public cloud you can do that on Amazon or Rackspace or wherever, but DataStax does not offer a hosted Cassandra Service with added Hadoop and Stax.

However, if you did put DataStax into the cloud you would end up with something interesting: data that is born in the cloud stays in the cloud, and actually does not move very fast inside the cloud – after all it is Big Data. You may need to replicate it, you may need to partition it, and sometimes you may need to migrate partitions, but you may not need to run ETL on it. ETL is a familiar tool in the boundary between operational and data warehouse/data mart systems, but may be an artifact of SQL and MDDB/ROLAP architectures and not required for NoSQL and MapReduce – i.e. for on-line closed-loop analytics rather than dashboards. If this turns out to be the case we need to think of a cloud as a single database and a range of APIs by which it can be accessed, rather than thinking of the cloud as a platform for running machine images (IaaS) or even applications (PaaS).