Get to know Cassandra, the NoSQL maverick

Cassandra may not get the NoSQL spotlight, but it's fantastic for certain jobs -- just ask Netflix and Instagram

A few years back, as other NoSQL players such as MongoDB exploded on the scene, Apache Cassandra's star was fading. The company that developed it, Facebook, had dumped it. Its community seemed to be lagging. But Cassandra's fortunes changed when Netflix decided to move from Oracle in its own data centers to Cassandra in the Amazon cloud. Not long after, Facebook picked up Cassandra again, this time to power its Instagram acquisition. Reddit, Twitter, and WebEx all use Cassandra in some capacity.

One of Cassandra's challenges is that it's almost in a category of its own. Like HBase, it's a column-family database, but it also has fundamental differences. It's a different species altogether than NoSQL document databases such as MongoDB or Couchbase, or key-value-pair databases such as DynamoDB, Redis, or Riak.

What's a column-family database anyhow?

A column-family database stores data as row keys and sets of tuples. This looks very much like a table, and often people familiar with relational databases are led down a path of misunderstanding, especially since Google's famous column-family implementation is called BigTable. Don't be fooled: Column-family database "tables" do not act like RDBMS tables.

For one thing, they are basically schema-less, in that you can easily add another column at will. Unlike RDBMS tables that are somewhat optimized for storage and need to know a reasonable size for each row, column families are optimized for distribution. That rowkey serves more than convenient lookup; it also allows the system to effectively shard the data around the cluster. See below for the structure of a standard column family if expressed as JSON (names can be arbitrary).

ColumnFamilyName {

rowkey1 = {column1:"value1", column2:"value2"},

rowkey2 = {column1:"value2.1", column2:"value2.2", column3:"value2.3"},

rowkey3 = {column2:"value3.2", column3:"value3.3"},

rowkey4 = {column1:"value4.1", column3:"value4.3"}


To compound matters, this is a simple or "standard" column family. It is fairly limited by itself. Many column family databases, including Cassandra, include the concept of a supercolumn and a supercolumn family. Structurally, it is a bit like turducken in turducken. The example below shows a supercolumn family containing column families.

SuperColumnFamily {

ColumnFamily1 {

cf1rowkey1 = {column1:"value1", column2:"value2"},

cf1rowkey2 = {column1:"value2.1", column2:"value2.2", column3:"value2.3"},

cf1rowkey3 = {column2:"value3.2", column3:"value3.3"},

cf1rowkey4 = {column1:"value4.1", column3:"value4.3"}


ColumnFamily2 {

cf2rowkey1 = {column1:"value1", column2:"value2"},

cf2rowkey2 = {column1:"value2.1", column2:"value2.2", column3:"value2.3"},

cf2rowkey3 = {column2:"value3.2", column3:"value3.3"},

cf2rowkey4 = {column1:"value4.1", column3:"value4.3"}



Moreover, you'll find other structures such as composite columns and other distinctions, including static and dynamic. There are also special column types like counters and timestamps.

When should you use Cassandra?

Cassandra has turned out to be especially useful for a well-defined set of applications. You'll hear about these most often:

  • Time series comes up a lot when discussing column family databases. Time series data can include anything from temperature sensor data to schedules to stock prices to signal processing telemetry to epidemiology records to blogs. This sort of data is where document databases such as MongoDB tend to fail.
  • Product catalogs are sometimes powered by Cassandra, although obviously, other databases play in this space. You might also use a document databases or even straight search engines (ElasticSearch or Solr), such as those based on Lucene (arguably sort of a document database itself). But people pick Cassandra for a reason, chief among them the reliable architecture. If you can trade immediate atomic consistency (less of a concern with product catalog data), you can achieve a higher level of system reliablity, meaning any node can be contacted and get to the data. This is why high-scale media services as Netflix, Hulu, and Sky in Europe use Cassandra for catalogs.
  • Recommendations are another area, though once again other technologies are frequently in the mix. For example,Mahout (the machine-learrning project running atop Hadoop) can be used with Cassandra. One of the reasons to use a column-family database for recommendations instead of a graph database is scale. With a graph database, you have to do a lot of work (generally, manually sharding or partitioning) to run it at a large scale. However, what you often need for a recommendation engine is not complex data or complex relationships, but very simple row-based data. Cassandra delivers.
  • Fraud and spam detection is somewhat related to recommendations and often involves time series data. Again, this may also demand a machine-learning tool like Mahout. Spammers and fraudsters are often more motivated -- and faster -- than you, so you'd better have a system that adapts quickly and can handle increasing amounts of data. For a high-scale service like eBay orEventbrite, you don't need superconsistent/atomic reads and writes. What you need is a whole lot of them!
  • Back-end storage for messaging, especially across data centers, is a big deal (so-called WAN replication). To some degree, messaging isn't a straightforward column-family case -- if it weren't for Cassandra's caching support. Column family plus cache plus WAN replication is a powerful force. This isn't for every messaging use case. But if you need to persist and read messages across multiple data centers at a massive scale in an active-active configuration as the New York Times or Comcast does, you might want to give Cassandra a spin.

What about Hbase?

HBase isn't a Cassandra replacement and Cassandra isn't merely better than HBase. They each have their strengths and weaknesses. If you already run a Hadoop-oriented shop and have extensive Hadoop expertise and infrastructure, HBase may be a more natural fit.

While Cassandra is in the Hadoop ecosystem, it rolls its own in a lot of places, many of which work to your advantage. HBase's modularity can also be read as complexity and configuration side effects, as a lot of knobs must turn at various layers that aren't always aware of each other. Cassandra is more monolithic, which in some ways means more thought out as a consistent design.

The crux of it will come down to your read-to-write ratio. Cassandra was designed for high workloads of both writes and reads where millisecond consistency isn't as important as throughput. HBase is optimized for reads and greater write consistency. To a large degree, Cassandra tends to be used for operational systems and HBase more for data warehouse and batch-system-type use cases. There are crossovers, exceptions, and places where it doesn't matter -- or where it's simply a matter of default configuration being more conservative in one than the other.

Speaking of transactions ...

Developers often get confused about when atomic consistency is really needed. Also, if you start with an RDBMS, you'll be confused because RDBMSes require more operations across more places to get the same amount of work done. That is, it isn't natural that a "person" be broken into multiple tables just because they have a variable number of phone numbers and addresses -- which is exactly what a good RDBMS schema requires. Most other database types (document, column family, you name it) would be able to handle variable amounts of phone numbers or whatever in one entity (document, table, rowkey, and so on).

Generally, any database can make an operation to a single entity consistent. Thus, many operations that require a "transaction" in an RDBMS are naturally atomic in other databases.

In any modern system, you make compromises about consistency no matter what database you're using. No RDBMS really offers long-running transactions that are lightweight enough for "Internet" or "Web scale" systems.  Few people run in Read Serialized mode and open a connection per user and hold transactions open across user-think time. That would give you the most atomic consistency.  Moreover, any multithreaded batch or analytical system must make some compromise with consistency. 

In many cases, a millisecond (or 500) doesn't make a lot of difference. If I change a couple of rowkeys, eventually in Cassandra they'll be consistent, but I could get a phantom read. How problematic is that for, say, Netflix? It likely won't strike. Even if it does, you're unlikely to notice. Even if you do, you're unlikely to complain if there is a momentary glitch where you see a show and it doesn't have episodes, but you reload the page and suddenly it does. You're more likely to notice if every catalog change brings the system to a screeching halt. Down-to-the-millisecond consistency is less important than scale and performance, due to the nature of the data and its use case.

Choose your weapon

Why make such a trade-off when you could have perfection? Quite simply: Because you have to. Any guarantee of consistency requires some kind of lock or reduction in concurrency. That trade-off might be threads or how much we can distribute the data across multiple machines or disks or whether we can replicate it across a WAN. Forget the CAP theorem; consistency always trades off against concurrency.

Cassandra is a great tool for data sets where scale is more important than immediate consistency and you have a great deal of both reads and writes. While it has not received as much attention as other NoSQL databases and slipped into a quiet period a couple years back, Cassandra is widely used and deployed, and it's a great fit for time series, product catalog, recommendations, and other applications. If you have those sorts of problems at scale, Cassandra should do the trick.