In this brave new world of big data, a database technology called "Bigtable" would seem to be worth considering -- particularly if that technology is the creation of engineers at Google, a company that should know a thing or two about managing large quantities of data. If you believe that, two Apache database projects -- Cassandra and HBase -- have you covered.
Bigtable was originally described in a 2006 Google research publication. Interestingly, that paper doesn't describe Bigtable as a database, but as a "sparse, distributed, persistent multidimensional map" designed to store petabytes of data and run on commodity hardware. Rows are uniquely indexed, and Bigtable uses the row keys to partition data for distribution around the cluster. Columns can be defined within rows on the fly, making Bigtable for the most part schema-less.
Cassandra and HBase have borrowed much from the original Bigtable definition. In fact, whereas Cassandra descends from both Bigtable and Amazon's Dynamo, HBase describes itself as an "open source Bigtable implementation." As such, the two share many characteristics, but there are also important differences.
Born for big data
Both Cassandra and HBase are NoSQL databases, a term for which you can find numerous definitions. Generally, it means you cannot manipulate the database with SQL. However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL.
Both are designed to manage extremely large data sets. HBase documentation proclaims that an HBase database should have hundreds of millions or -- even better -- billions of rows. Anything less, and you're advised to stick with an RDBMS.
Both are distributed databases, not only in how data is stored, but also in how the data can be accessed. Clients can connect to any node in the cluster and access any data.
Both claim near linear scalability. Need to manage twice the data? Then double the number of nodes in your cluster.
Both safeguard data loss from cluster node failure via replication. A row written to the database is primarily the responsibility of a single cluster node (the row-to-node mapping being determined by whatever partitioning scheme you've employed). But the data is mirrored to other cluster members called replica nodes (the user-configurable replication factor specifies how many). If the primary node fails, its data can still be fetched from one of the replica nodes.
Both are referred to as column-oriented databases, which can be confusing because it sounds like a relational database that's been conceptually rotated 90 degrees. What's even more confusing is the fact that data is apparently arranged initially by row, and a table's primary key is the row key. However, unlike a relational database, no two rows in a column-oriented database need have the same columns. As mentioned above, you can add columns to a row on the fly (after the table has been created). In fact, you can add lots of columns to a row. The exact upper limit is hard to calculate, but it's unlikely you'll hit the limit even if you add tens of thousands of columns.
Apart from characteristics derived from the Bigtable definition, Cassandra and HBase share other similarities.
Both implement similar write paths that begin with logging write operations to a log file to ensure durability. Even if the remainder of the write fails, the operation saved in the log can be replayed. The data is written next to a memory cache, then finally to disk via a large, sequential write (essentially a copy of the memory cache). The overall memory-and-disk data structure used by both Cassandra and HBase is more or less a log-structured merge tree. The disk component in Cassandra is the SSTable; in HBase it is the HFile.
Both provide command-line shells implemented in JRuby. Both are written largely in Java, which is the primary programming language for accessing either -- though client libraries are available for both in many other programming languages.
Of course, both Cassandra and HBase are open source projects managed under the Apache Software Foundation, and both are available free under an Apache version 2 license.
Clusters and contrasts
Nevertheless, in spite of all these parallels, you'll find a number of key differences.
While nodes in both Cassandra in HBase are symmetrical -- meaning that clients can connect to any node in the cluster -- the symmetry is not complete. Cassandra requires that you identify some nodes as seed nodes, which serve as concentration points for intercluster communication. Meanwhile, on HBase, you must press some nodes into serving as master nodes, whose job it is to monitor and coordinate the actions of region servers. Thus, Cassandra guarantees high availability by allowing multiple seed nodes in a cluster, while HBase guarantees the same via standby master nodes -- one of which will become the new master should the current master fail.
Cassandra uses the Gossip protocol for internode communications, and Gossip services are integrated with the Cassandra software. HBase relies on Zookeeper -- an entirely separate distributed application -- to handle corresponding tasks. While HBase ships with a Zookeeper installation, nothing stops you from using a pre-existing Zookeeper ensemble with an HBase database.
While neither Cassandra nor HBase support real transactions, both provide some level of consistency control. HBase gives you strong record-level (that is, row-level) consistency. In fact, HBase supports ACID-level semantics on a per-row basis. Also, you can lock a row in HBase, though this is not encouraged, not only because it hampers concurrency, but also because a row lock will not survive a region split operation. In addition, HBase has a "check and put" operation, which provides atomic "read-modify-write" semantics on a single data element.
Meanwhile, though Cassandra is described as having "eventual" consistency, both read and write consistency can be tuned, not only by level, but in extent. That is, you can configure not only how many replica nodes must successfully complete the operation before it is acknowledged, but also whether the participating replica nodes span data centers.
Further, Cassandra has added lightweight transactions to its repertoire. Cassandra's lightweight transaction is a "compare and set" mechanism roughly comparable to HBase's "check and put" capability; HBase also has a "read-check-delete" operation for which Cassandra has no counterpart. Finally, Cassandra's 2.0 release added row-level write isolation: If a client updates multiple columns in a row, other clients will see either none of the updates or all of the updates.
In both Cassandra and HBase, the primary index is the row key, but data is stored on disk such that column family members are kept in close proximity to one another. It is therefore important to carefully plan the organization of column families. To keep query performance high, columns with similar access patterns should be placed in the same column family. Cassandra lets you create additional, secondary indexes on column values. This can improve data access in columns whose values have a high level of repetition -- such as a column that stores the state field of a customer's mailing address. HBase lacks built-in support for secondary indexes, but offers a number of mechanisms that provide secondary index functionality. These are described in HBase's online reference guide and on HBase community blogs.
As stated earlier, both databases have "command line" shells for issuing data manipulation commands. Both HBase's and Cassandra's shells are built on the JRuby shell, so you can write scripts that employ all of the JRuby shell's resources to interact with specific APIs provided by the databases. In addition, Cassandra has defined CQL, modeled after SQL. CQL is far richer than the query language used by HBase, and it can be executed directly in Cassandra's shell.
In fact, Cassandra is moving toward CQL as the database's primary programming interface, though Cassandra still supports the Thrift API. (Thrift is language independent, but it's now considered a legacy API.) The Cassandra documentation lists drivers for Java, C#, and Python, all of which employ CQL version 3. Finally, a JDBC driver is also available for Cassandra. It uses CQL in place of SQL as its data definition and data management languages.
HBase's native Java API provides the richest functionality to programmers, though HBase also sports the language-agnostic Thrift interface, as well as a RESTful Web service interface. While the data manipulation commands of HBase are not as rich as CQL, HBase does have a "filter" capability that executes on the server side of a session and improves scanning (search) throughput.
HBase has also introduced "coprocessors," which allow the execution of user code in the context of the HBase processes. The result is roughly comparable to the relational database world's triggers and stored procedures. Cassandra currently has no counterpart to HBase's coprocessors.
Cassandra's documentation is noticeably better than HBase's, and good documentation certainly flattens the learning curve. In my experience, setting up a development Cassandra cluster is simpler than setting up an HBase cluster. Of course, this is only important for development and testing purposes.
The win column
The real work appears when you must tune a cluster for your particular application. Given the size of the data sets involved and the complexity of building and managing a multinode cluster (that often spans multiple data centers), tuning is hardly straightforward. It demands a solid understanding of the interplay of the cluster's memory caching, disk storage, and internode communications, and it requires careful monitoring of cluster behavior.
It's true that HBase's reliance on Zookeeper -- a separate application -- introduces an additional point of failure (and the attendant difficulties troubleshooting the source of a problem) that Cassandra avoids. But it isn't the case that tuning a Cassandra cluster is orders of magnitude less difficult. In the end, comparing the travails of cluster tuning of both databases, it's probably a wash.
Which means, as usual, there is no clear winner or loser. You'll find zealots for both databases, and each camp will present compelling evidence demonstrating the superiority of their system. And as usual, you'll face the chore of taking each for a test drive and benchmarking them against your target application. But given the scope of these technologies, could there be any other way?