Do Twitter analysis the easy way with MongoDB

fifa 14 world cup

For simple Twitter analysis, try MongoDB's aggregation framework

With all the World Cup excitement, I found myself wondering what the Twitter-scape looked like. Who is tweeting? What are they tweeting about? Where are they? What language are they tweeting in?

Obviously, such questions can apply to any tweet-worthy event. Along with the idly curious like me, various types of businesses from tech startups to local restaurants might want to know: What's my most vocal demographic? What time of day are people tweeting about my service most often? How do people feel about the new website?

Collecting all this data and analyzing it might seem like a big investment, but with the right tools it becomes trivially easy. In this article, I show you how to analyze tweet data using MongoDB as both the data store and the analytics engine. MongoDB has powerful analytics tools and straightforward pluggability into Hadoop for when you have a question that needs a more generic tool. I'm using tweets about the World Cup to demonstrate, but the concepts are generic and can be easily applied to your own data set.

The aggregation framework

MongoDB has long had an implementation of MapReduce, but it's all done in JavaScript and not particularly fast. All the processing happens in V8 (Google's JavaScript engine), which is a black box to MongoDB, so it can't do any optimization on your MapReduce code.

The alternative is to use the MongoDB aggregation framework, which is a data-processing tool based on the concept of pipelines. You're given a handful of simple building blocks that expect one or more documents as input, perform an operation on that input, and output one or more documents. They can then be chained together to form arbitrarily complex queries. Because it's a part of the database and not running in a third-party virtual machine (which is what's happening with V8), it's much faster than the MongoDB MapReduce implementation.

All the pipeline operators (the individual commands that get strung together) are well documented, so I won't go into all of the inner workings here. I will, however, give a quick description of the ones I'm using:

  • $match: Filters incoming documents using the full MongoDB query syntax.
  • $sort: Sorts incoming documents based on a given field.
  • $limit: Limits the output to a specified number of documents.
  • $project: Reshapes documents -- add fields, remove fields, add subdocuments, and so on.
  • $group: Aggregates documents on a given field (or fields). It also lets you take counts, sums, averages, and more of fields in the documents being grouped together.
  • $unwind: A handy tool that doesn't have a SQL equivalent. It's designed to operate on an array field of the input documents, creating a copy of a document for each field in the array and replacing the array with a single value instead.

An example will clarify the $unwind operator, starting with the input.


{
  _id: 'myDoc1',
  tags: ['pie', 'chocolate']
}

The operator would then be:


{$unwind: '$tags'}

Finally, the output:


{
  _id: 'myDoc1',
  tags: 'pie'
},
{
  _id: 'myDoc1',
  tags: 'chocolate'
}

Asking the questions

To start, I put together an application in Node.js from open source libraries that pulls tweets from the streaming Twitter API and loads them into MongoDB. Looking for tweets containing "worldcup" and "fifa" (two of the most used hashtags for the subject), I pulled in about a million tweets over the course of five games. Each tweet looks something like this.

Now that you have a handle on the tool and a data set, it's time to formulate your questions in the query language.

To find out how many tweets were created in each language, you want to group the tweets by language and count how many tweets are in each group. This can be done in a single step with $group. (On each tweet document, there is a 'lang' field that contains an abbreviated language string.)


db.allTweets.aggregate([
  { $group: { 
     _id: '$lang',
     count: {$sum: 1}
  }},
]);

The result:


{ "_id" : "am", "count" : 2 }
{ "_id" : "zh", "count" : 370 }
{ "_id" : "vi", "count" : 667 }
{ "_id" : "ur", "count" : 33 }
{ "_id" : "hy", "count" : 4 }
{ "_id" : "bg", "count" : 112 }
{ "_id" : "el", "count" : 210 }
{ "_id" : "gu", "count" : 4 }
{ "_id" : "ne", "count" : 63 }
...

This almost gives us what we want, but it's not in order, so let's add a step in the pipeline to sort on the count field in descending order:


db.allTweets.aggregate([
  { $group: { 
     _id: '$lang',
     count: {$sum: 1}
   }},

   {$sort: {
    count: -1
}}
]);

The result:


{ "_id" : "en", "count" : 516745 }
{ "_id" : "es", "count" : 262056 }
{ "_id" : "pt", "count" : 55117 }
{ "_id" : "ar", "count" : 36122 }
{ "_id" : "fr", "count" : 30003 }
{ "_id" : "ja", "count" : 24851 }
{ "_id" : "in", "count" : 17930 }
{ "_id" : "it", "count" : 15876 }
{ "_id" : "ro", "count" : 9878 }
...

What if you want to see the top five languages used to tweet about World Cup games? Add another step to the pipeline:

The query:


db.allTweets.aggregate([
{ $group: { 
_id: '$lang',
count: {$sum: 1}
}},

{$sort: {
count: -1
}},

{$limit: 5}
]);

The result:


{ "_id" : "en", "count" : 516745 } 
{ "_id" : "es", "count" : 262056 } 
{ "_id" : "pt", "count" : 55117 } 
{ "_id" : "ar", "count" : 36122 } 
{ "_id" : "fr", "count" : 30003 }

Notice that I'm $sum-ing 1, which adds one to count for each item in the group. I could also sum together values on the documents. If there was a timesViewed property and we wanted to see how many times tweets in a certain language were viewed, it would look like:


db.allTweets.aggregate([ 
{ $group: { 
_id: '$lang', 
totalViews: {$sum: '$timesViewed'} 
}}, 
]);

What if you want all of the languages that have above a certain threshold of tweets -- 10,000, for example? Remove the $limit and add a $match to help you filter.


db.allTweets.aggregate([ 
{ $group: { 
_id: '$lang', 
count: {$sum: 1} 
}}, 

{ $match: { 
count: { $gt: 10000 } 
}}, 

{ $sort: { 
count: -1 
}} 
]);

The result:


{ "_id" : "en", "count" : 516745 } 
{ "_id" : "es", "count" : 262056 } 
{ "_id" : "pt", "count" : 55117 } 
{ "_id" : "ar", "count" : 36122 } 
{ "_id" : "fr", "count" : 30003 } 
{ "_id" : "ja", "count" : 24851 } 
{ "_id" : "in", "count" : 17930 } 
{ "_id" : "it", "count" : 15876 }

Now, if you were feeding these results through to some service that consumes JSON, you'd probably want to rename the _id field to clarify what the data actually means. Instead of doing application-level processing to do this, you can add another step in the pipeline:

The query:


db.allTweets.aggregate([ 
{ $group: { 
_id: '$lang', 
count: {$sum: 1} 
}}, 

{ $match: { 
count: { $gt: 10000 } 
}}, 

{ $sort: { 
count: -1 
}}, 

{ $project: { 
language: '$_id', 
count: 1, 
_id: 0 
}} 
]);

The result:


{ "count" : 516745, "language" : "en" } 
{ "count" : 262056, "language" : "es" } 
{ "count" : 55117, "language" : "pt" } 
{ "count" : 36122, "language" : "ar" } 
{ "count" : 30003, "language" : "fr" } 
{ "count" : 24851, "language" : "ja" } 
{ "count" : 17930, "language" : "in" } 
{ "count" : 15876, "language" : "it" }

In the $project operator, I'm creating a new field that has the value of the _id field, specifying that I want to keep the count field but not the _id field anymore. For most fields, you can simply not mention them in the $project operator if you don't want them anymore. The only exception is _id, and you have to explicitly exclude it if you don't want to pass it along.

Dealing with arrays

Let's change course and see what are the top five hashtags being used. On each document, the hashtags are included in a subdocument that's an array. To group on the hashtag text, you have to get them out of the array before you can group on them.

The query:


db.allTweets.aggregate([ 
{$unwind: '$entities.hashtags'}, 

{ $group: { 
_id: '$entities.hashtags.text', 
tagCount: {$sum: 1} 
}}, 

{ $sort: { 
tagCount: -1 
}}, 

{ $limit: 5 }

The result:


{ "_id" : "WorldCup", "tagCount" : 399139 } 
{ "_id" : "Brasil2014", "tagCount" : 172419 } 
{ "_id" : "worldcup", "tagCount" : 98049 } 
{ "_id" : "CRC", "tagCount" : 70970 } 
{ "_id" : "FRA", "tagCount" : 67226 }

As a pipeline, it's pretty easy to string together simple building blocks to get arbitrarily complex queries. What if you want to know who the top five tweeters are, what hashtags they use, and how many times they use them? All you'd have to do is use these simple building blocks to work the data into what you want it to be. Instead of including that query here, I'll let it be an open challenge for the adventurous. Tweet me a GitHub gist with an answer at @freethejazz.

Hopefully you've seen that it's easy to start getting the answers to important questions using MongoDB and the aggregation framework. Of course, depending on your needs, the framework might not be the only solution. If you wanted to start doing sentiment analysis on the tweets, for example, you're not going to do it with MongoDB. You could easily plug into Hadoop, using the MongoDB connector for Hadoop, and do the heavy lifting on a Hadoop cluster. For many reporting tasks, however, the aggregation framework will do everything you need it to.

This story, "Do Twitter analysis the easy way with MongoDB" was originally published by InfoWorld .

Join the discussion
Be the first to comment on this article. Our Commenting Policies
See more