Evernote buys into big data analytics for a song

How Evernote transitioned from MySQL to Hadoop and ParAccel

When a flood of data threatened to swamp Evernote's analytics system, the company modernized its analytic environment to handle big data -- without breaking the budget. The provider of popular personal organization and productivity applications has moved from a conventional data warehouse to a modern hybrid of Hadoop and ParAccel, a massively parallel processing (MPP) analytic database.

Evernote collects and analyzes enormous quantities of data about its users. Since 2008, more than 36 million people have created Evernote accounts, generating hundreds of terabytes of data, including 1.2 billion "notes" and 2 billion attachments. Notes can be text, Web pages, photos, voice memos, and so on -- and they can be tagged, annotated, edited, sorted into folders, and manipulated in other ways.

[ Download InfoWorld's Big Data Analytics Deep Dive for a comprehensive, practical overview of this booming field. | Harness the power of Hadoop with InfoWorld's 7 top tools for taming big data. ]

To determine how to optimize the Evernote user experience, the company now analyzes 200 million events per day through a combination of Hadoop and the ParAccel database. In addition, the open source JasperReports Server Community Edition is used for generating reports and charts.

Hitting the data volume limit

Evernote's original data warehouse was based on an OLTP (online transaction processing) relational database. The data warehouse used a star schema, meaning the data is arranged with respect to queries as opposed to transactions. This solution is usually good enough for reporting and some analysis as long as data volumes on MySQL top out at several terabytes. But as data spills over that limit, less history can be retained and data warehouse query speeds, flexibility, and affordability suffer.

This was the case with Evernote, whose data warehouse was powered by MySQL on a large RAID10 array on the same network as the application's main servers. Every night, over a period of 9 to 18 hours, a series of batch processes dumped increments of several operational database tables combined with parsed structured event logs from the application servers. After manual creation and tuning, reports were then distributed by email.

By the beginning of 2012, however, the analytics team realized its existing solution could no longer handle the load. With a primary table of more than 40 billion rows, it was impractical to access more than a few days' data at a time. The reporting database was slow, difficult to maintain, and did not lend itself to ad hoc queries.

The Evernote team set out to build an analytics environment that could efficiently store the full history of the data, generate dozens of standard daily reports, easily manage ad hoc queries, and continue to scale in the future. And it needed to respect a tight budget.

Baking a hybrid analytic environment

Evernote settled on a new approach composed of Hadoop and ParAccel.

A 10-node Hadoop cluster now stores all historical data and handles data preparation for analytics. Hadoop was seen as an affordable solution, thanks to open source licensing and ability to scale out using commodity hardware.

As an MPP analytic database, ParAccel excels at superfast ad hoc queries. At Evernote, a three-node ParAccel columnar analytic database handles queries over sets of derived tables. The nodes are SuperMicro boxes, each with dual L5630 quad-core processors, 192GB of RAM, 10Gbps networking, and a RAID5 of solid-state drives manually provisioned and configured with Red Hat Enterprise Linux.

Finally, as one of the most popular open source reporting solutions, JasperReports was an easy call. The team chose Jaspersoft's open source JasperReports Server for querying the ParAccel server and generating dozens of daily reports in a variety of formats. (Recently, the combination of ParAccel and JasperReports Server received a de facto endorsement from Amazon, which uses the two to power its Redshift hosted analytics environment.)

Evernote uses JasperReports Server to generate dozens of charts and reports every day.

Evernote uses JasperReports Server to generate dozens of charts and reports every day.

For security reasons, the analytics environment is on a separate network with no connections to the production application servers. Daily online data is securely pushed into the reporting environment through a one-way network connection.

Building the Hadoop deployment

All raw data first goes to Hadoop, where it is both archived and prepared for loading into ParAccel for daily reporting as well as ad hoc analysis. Evernote uses the Cloudera Hadoop distribution, with Puppet employed for configuration management.

The Hadoop cluster includes six data nodes with eight 500GB drives, for a total of 24TB of raw storage. Two eight-core processors and 64GB of RAM run 132 MapReduce tasks across the cluster with more than 2GB of RAM for each task.

In addition, Evernote runs a single Hadoop Job Tracker on a pair of servers for redundancy, along with one client node for running Hive and Hue, two key open source tools for Hadoop. The Hadoop cluster is accessed through the Hive abstraction layer, which provides a SQL-like interface for querying. Hue is a Web-based interface for Hadoop that includes a number of utilities, such as a file browser, a job tracker interface, a cluster health monitor, and more -- as well as an environment for building custom Hadoop applications.

Working together

User activity data captured from Hive is loaded into ParAccel every night, along with the reference tables from the online production database. Using Hive, derived tables are created that contain presliced information for optimal representations in common reports. For example, a country summary table contains just one row per country each day with a sum of the daily, weekly, and monthly active users as of that date.

This ParAccel database and its tables are tuned for quick aggregation of data, so Evernote can answer many types of questions much faster than using Hive alone. For example, it takes three seconds to see which versions of Evernote Windows were most widely used in Germany during a particular week.

Now the team has a modern analytic environment with room to grow. Thanks to Hadoop, the team can archive unprecedented quantities of operational and log data -- and, more important, load and transform hundreds of millions of records in two hours instead of the 10 or more hours that were once required. Thanks to ParAccel, Evernote can perform much more complex analyses of user trends, with JasperReports Server delivering the final, polished results.

The ability to store all historical data, achieve fast ad hoc queries, and automate quality reports on a daily basis is giving Evernote new insight into how its customers use its products -- and how those products can be continually improved.

This article, "Evernote buys into big data analytics for a song," was originally published at InfoWorld.com. Read more of Andrew Lampitt's Think Big Data blog, and keep up on the latest developments in big data at InfoWorld.com For the latest business technology news, follow InfoWorld.com on Twitter.

This story, "Evernote buys into big data analytics for a song" was originally published by InfoWorld.