Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

NoSQL showdown: MongoDB vs. Couchbase

Which NoSQL database has richer querying, indexing, and ease of use?

  • Print
  • Feedback

Page 7 of 7

Couchbase Server is available in both Enterprise and Community editions. The Enterprise edition undergoes more thorough testing than the Community edition, and it receives the latest bug fixes. Also, hot fixes are available, as is 24/7 support (with the purchase of an annual subscription). Nevertheless, the Enterprise edition is free for testing and development on any number of nodes or for production use on up to two nodes. The Community edition, as you might guess, is free for any number of production nodes.

Pros and cons: Couchbase Server 2.0
 
Pros
  • Provides legacy Memcached capabilities
  • Supports spatial data and views
  • Now a true document database
Cons
  • Indexing mechanisms not well developed
  • JSON support is relatively immature
  • Does not support range sharding

Review: Visual Studio 2012 shines on Windows 8

MongoDB

MongoDB is about three years old, first released in late 2009. The goal behind MongoDB was to create a NoSQL database that offered high performance and did not cast out the good aspects of working with RDBMSes. For instance, the way that queries are designed and optimized in MongoDB is similar to how that would be done in an RDBMS. MongoDB's designers also wanted to make the database easier for application developers to work with -- for example, by allowing developers to change the data model quickly. MongoDB, whose name is short for "humongous," stores documents in BSON (Binary JSON), an extension of JSON that allows for the use of data types such as integers, dates, and byte arrays.

Two primary processes are at work in a MongoDB system, mongod and mongos. The mongod process is the real workhorse. In a sharded MongoDB cluster, mongod can be found playing one of two roles: config server or shard server. The config server tracks the cluster's metadata. (In a sharded MongoDB cluster, there must be at least three config servers for redundancy's sake.) Each config server knows which server in the cluster is responsible for a given document or, more precisely, where a given contiguous range of shard keys (called a chunk) belongs in the cluster.

Other mongod processes in the cluster run as shard servers, and these handle the actual reading and writing of the data. For fail-over purposes, two instances of a mongod process on a given cluster member run as shard servers. One process is primary, and the other is secondary. All write requests go to the primary, while read requests can go to either primary or secondary.

Secondaries are updated asynchronously from the primary so that they can take over in the event of a primary's crash. This, however, means that some read requests (sent to secondaries) may not be consistent with write requests (sent to primaries). This is an instance of MongoDB's "eventual consistency." Over time, all secondaries will become consistent with write operations on the primary. Note that you can guarantee consistent read/write behavior by configuring a MongoDB system such that all I/O -- reads and writes -- go to the primary instances. In such an arrangement, secondaries act as standby servers, coming online only when the primary fails.

The mongos process, which runs at a conceptually higher level than the mongod processes, is best thought of as a kind of routing service. Database requests from clients arrive first at a mongos process, which determines which shard(s) in a sharded cluster can service each request. The mongos process dispatches I/O requests to the appropriate mongod processes, gathers the results, and returns them to the client. Note that in a nonsharded cluster, clients talk directly to a mongod process.

MongoDB scaling and replication
MongoDB doesn't have an explicit memory caching layer. Instead, all MongoDB operations are performed through memory-mapped files. Consequently, MongoDB hands off the chore of juggling memory caching versus persistence-to-disk to the operating system. You can tweak various flush-to-disk settings for optimal performance, however. For example, MongoDB maintains a journal of database operations (for recovery purposes) that is flushed to the disk every 100ms. Not only is this interval configurable, but you can configure the system so that write operations return only after the journal has been written to disk.

Documents are placed in named containers called collections, which are roughly equivalent to Couchbase's buckets. A collection serves as a means of partitioning related documents into separate groups. The effects of many multidocument operations in a MondoDB database are restricted to the collection in which those operations are performed. MongoDB supports sharding at the collection level, which means -- should requirements dictate -- you could construct a database with unsharded and sharded collections. Of course, only a sharded collection is protected against a single point of failure.

A document's membership in a particular shard is determined by a shard key, which is derived from one or more fields in each document. The exact fields can be specified by the database administrator. In addition, MongoDB provides autosharding, which means that, once you've configured sharding, MongoDB will automatically manage the storage of documents in the appropriate physical location. This includes rebalancing shards as the number of documents grows or the number of mongod instances changes.

As of the 2.4 release, MongoDB supports both hash-based sharding and range-based sharding. As you might guess, hash-based sharding hashes the shard key, which creates a relatively even distribution of documents across the cluster. With range-based sharding (the sole sharding type prior to 2.4), a given member of a MongoDB sharded cluster will store all the documents within a given subrange of the shard key's overall domain. More precisely, MongoDB defines a logical container, called a chunk, which is a subset of documents whose shard keys fall within a specific range. The mongos process then dictates which mongod process will manage a given chunk.

Typically, you permit the load balancer to determine which cluster member manages a given shard range. However, with version 2.4, you can associate tags with shard ranges (a tag being nothing more than an identifying string). Once that's done, you can specify which member of a cluster will manage any shard ranges associated with a tag. In a sense, this lets you override some of the load balancer's decision making and steer identifiable subsets of the database to specific servers. For example, you could put the data most frequently accessed from California on the cluster member in California, the data most frequently accessed from Texas on the cluster member in Texas, and so on.

MongoDB's locking is on the database level, whereas it was global prior to version 2.2. The system implements shared-read, exclusive-write locking (many concurrent readers, but only one writer) with priority given to waiting writers over waiting readers. MongoDB avoids contentions via yield operations within locks. Predictive coding was added to the 2.2 release; if a process requests a document that is not in memory, it yields its lock so that other processes -- whose documents are in memory -- can be serviced. Long-running operations will also periodically yield locks.

You'll find no clear notion of transactions in MongoDB. Certainly, you cannot perform pure ACID transactions on a MongoDB installation. Database changes can be made durable if you enable journaling, in which case write operations are blocked until the journal entry is persisted to disk (as described earlier). And MongoDB defines the $atomic isolation operator, which imposes what amounts to an exclusive-write lock on the document involved. However, $atomic is applied at the document level only. You cannot guard multiple updates across documents or collections.

MongoDB indexing and queries
MongoDB makes it easy to create secondary indexes for all document fields. A primary index always exists on the document ID. As with Couchbase Server, this is automatically generated for each document. However, with MongoDB, you can specify a separate field as being the document's unique identifier. For example, a database of bank accounts might use the bank's generated account number as the document ID field. Indexes exist at the collection level, and they can be compound -- that is, created on multiple fields. MongoDB can also handle multikey indexes. If you index a field that includes an array, MongoDB will index each value in that array. Finally, MongoDB supports geospatial indexes.

MongoDB's querying capabilities are well developed. If you're coming to MongoDB from the RDBMS world, the online documentation shows how SQL queries might be mapped to MongoDB operations. For example, in most cases, the equivalent of SQL's SELECT can be performed by a find() function. The find() function takes two arguments: a query document and a projection document. The query document specifies filter operations on specific document fields that are fetched. You could use it to request that only documents with a quantity field whose contents are greater than, say, 100 be returned. Therefore, the query document corresponds to the WHERE clause in an SQL statement. The projection document identifies which fields are to be returned in the results, which allows you to request that, say, only the name and address fields of matching documents be returned from the query. The sort() function, which can be executed on the results of find(), corresponds to SQL's ORDER BY statement.

You can locate documents with the command db.<collection>.find(), possibly the simplest query you can perform. The find() command will return the first 20 members of the result, but it also provides a cursor, which allows you to iterate through all the documents in the collection. If you'd like to navigate the results more directly, you can reference the elements of the cursor as though it were an array.

More complex queries are possible thanks to MongoDB's set of prefix operators, which can describe comparisons as well as boolean connections. MongoDB also provides the $regex operator in case you want to apply regular expressions to document fields in the result set. These prefix operators can be used in the update() command to construct the MongoDB equivalent of SQL's UPDATE ... WHERE statement.

In the 2.2 release, MongoDB added the aggregation framework, which allows for calculating aggregated values without having to resort to mapreduce (which can be overkill if all you want to do is calculate a field's total or average). The aggregation framework provides functionality similar to SQL's SUM and AVG functions. It can also calculate computed fields and mimic the GROUP BY operator. Note that the aggregation framework is declarative -- it does not employ JavaScript. You define a chain of operations, much in the same way you might perform Unix/Linux shell programming, and these operations are performed on the target documents in stream fashion.

One of the more significant new features in MongoDB's 2.4 release is the arrival of text search. In the past, developers accomplished this by integrating Apache Lucene with MongoDB, which piled on considerable complexity. Adding Lucene in a clustered system with replication and fault tolerance is not an easy thing to do. MongoDB users now get text search for free. The new text search feature is not meant to match Lucene, but to provide basic capabilities such as more efficient Boolean queries ("dog and cat but not bird"), stemming (search for "reading" and you'll also get "read"), and the automatic culling of stop words (for example, "and", "the", "of") from the index.

You can define a text index on multiple string fields, but there can be only a single text index per collection, and indexes do not store word proximity information (that is, how close words are to one another, which can affect how matches are weighted). In addition, the text index is fully consistent: when you update data, the index is also updated.

Ease-of-use features have been added to version 2.4 as well. For example, you can now define a "capped array" as a data element, which works sort of like an ordered circular buffer. If, for example, you're keeping track of the top 10 entries in a blog, using a capped array will allow you to add new entries, and (based on the specified ordering) previous entries will be removed to cap the array at 10 or whatever number you specify.

MongoDB 2.4 also has improved geospatial capabilities. For example, you can now perform polygon operations, which would allow you to determine if two regions overlap. The spherical model used in 2.4 is improved too; it now takes into account the fact that the earth is not perfectly spherical, so distance calculations are more accurate.

MongoDB mapreduce
In Couchbase Server, the mapreduce operation's primary job is to provide a structured query and information aggregation capability on the documents in the database. In MongoDB, mapreduce can be used not only for querying and aggregating results, but as a general-purpose data processing tool. Just as a mapreduce operation executes within a given bucket in Couchbase Server, mapreduce executes within a given collection in a MongoDB database. As in Couchbase Server, mapreduce functions in MongoDB are written in JavaScript.

You can filter the documents passed into the map function via comparison operators, or you can limit the number of documents to a specific number. This allows you to create what amounts to an incremental mapreduce operation. Initially, you run mapreduce over the entire collection. For subsequent executions, you add a query function that includes only newly added documents. From there, set the output of mapreduce to be a separate collection, and configure your code so that the new results are merged into the existing results.

Further speed/size trade-offs are possible by choosing whether the intermediate results (the output of the map function, sent to the reduce function) are converted to BSON objects or remain JavaScript objects. If you choose BSON, the BSON objects are placed in temporary, on-disk storage, so the number of items you can process is limited only by available disk space. However, if you choose JavaScript objects, then the system can handle only about 500,000 distinct keys emitted by the map function. But as there is no writing to disk, the processing is faster.

You have to be careful with long-running mapreduce operations, because their execution involves lengthy locks. As mentioned earlier, the system has built-in facilities to mitigate this. For example, the read lock on the input collection is yielded every 100 documents. The MongoDB documentation describes the various locks that must be considered -- as well as mechanisms to relieve the possible problems.

Managing MongoDB
Management access with the MongoDB database goes through the interactive mongo shell. Very much a command-line interface that lets you enter arbitrary JavaScript, it is nonetheless surprisingly facile. The MongoDB related commands are uncomplicated, but at the price of being dangerous if you're careless. For example, to select a database, you enter use <databasename>. But that command doesn't check for the presence of the specific database; if you mistype it and proceed to enter documents into that database, you might not know what's going on until you've put a whole lot of documents into the wrong place. The same goes for collections within databases.

Other useful command-line utilities are mongostat, which returns information concerning the number of operations -- inserts, updates, deletes, and so on -- within a specific time period. The mongotop utility likewise returns statistical information on a MongoDB instance, this time focusing on a specific collection. You can see the amount of time spent reading or writing in the collection, for instance.

In addition, 10gen provides the free cloud-based MongoDB Monitoring Service (MMS) which provides a monitoring dashboard for MongoDB installations. Built on the SaaS model, MMS requires you to run a small agent on your MongoDB cluster that communicates with the management system.

NoSQL deep dive: MongoDB vs. Couchbase

10gen's MongoDB Monitoring Service shows statistics -- in this case, for a replica set -- but management of the database is done from the command line.

In addition to the new text search and geospatial capabilities discussed above, MongoDB 2.4 comes with performance and security improvements. The performance enhancements include the working set analyzer. The idea is that you want to configure your system so that the working set -- that subset of a databas accessed most frequently -- fits entirely in memory. But it was not easy to figure out your working set or how much memory you need. The working set analyzer, which operates like a helper function, provides diagnostic output to aid you in discovering the characteristics of your working set and tuning your system accordingly. In addition, the JavaScript engine has been replaced by Google's open source V8 engine. In the past, the JavaScript engine was single-threaded. V8 permits concurrent mapreduce jobs, as well as general speed improvements.

Finally, the Enterprise edition welcomes Kerberos-based authentication. In all editions, MongoDB now supports role-based privileges, which gives you finer-grained control over users' access and operations on databases and collections.

10gen's release of MongoDB 2.4 is accompanied by new subscription levels: Community, Basic, Standard, and Enterprise. The Community subscription level is free, but it's also free of any support. The other subscription levels provide varying support response times and hours of availability. In addition, the Enterprise subscription level comes with the Enterprise version of MongoDB, which has more security features and SNMP support. It has also undergone more rigorous testing.

Pros and cons: MongoDB 2.4
 
Pros
  • New release incorporates text search
  • New release adds improved JavaScript engine
  • Free MongoDB training courseware available from 10gen
Cons
  • Text index doesn't store proximity information
  • No GUI-based management console
  • Kerberos authentication available in Enterprise edition only

 

Mongo or Couch?
As usual, which product is the best choice depends heavily on the target application. Both are highly regarded NoSQL databases with outstanding pedigrees. On the one hand, MongoDB has spent much more of its lifetime as a document database, and its support for document-level querying and indexing is richer than that in Couchbase. On the other hand, Couchbase can serve equally well as a document database, a Memcached replacement, or both.

Happily, exploring either Couchbase or MongoDB is remarkably simple. A single-node system for either database server is easily installed. And if you want to experiment with a sharded system (and have enough memory and processor horsepower), you can easily set up a gang of virtual machines on a single system, and lash them together via a virtual network switch. The documentation for both systems is voluminous and well maintained. 10Gen even provides free online MongoDB classes for developers, complete with video lectures, quizzes, and homework.

This article, "NoSQL showdown: MongoDB vs. Couchbase," was originally published at InfoWorld.com. Follow the latest developments in application development,data management, cloud computing, and open source at InfoWorld.com. For the latest business technology news, follow InfoWorld.com on Twitter.


  • Print
  • Feedback