Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Enterprise Hadoop: Big data processing made easier

Amazon, Cloudera, Hortonworks, IBM, and MapR mix simpler setup of Hadoop clusters with proprietary twists and trade-offs

  • Print
  • Feedback

Page 5 of 7

The monitoring is also rudimentary. You can see that the nodes in your cluster are running and the components have started, but you don't get any cool dials or widgets that show the load or the progress. If you ask for the "details" about a component, you get a popup with some Log4J lines related to that component. A Java programmer won't blink an eye, but others might find it spare and uninviting.

There are a number of better tools in the enterprise edition. The aforementioned BigSheets, a so-called spreadsheet running on Hadoop, will let you play around with the data in the Hadoop cluster just as you would experiment with the data in Excel. There's also a number of tools for connecting your cluster with other databases and data sources throughout the enterprise. The basic edition is good for trying out a pretty standard version of Hadoop, while the enterprise edition adds a slew of features that go far beyond the open source core.

 MapR M3 and M5
Whereas Cloudera is run by folks who come from Hadoop strongholds such as Yahoo, MapR's corporate team is filled with people who hail from Google, EMC, Microsoft, and Cisco, companies with plenty of experience with big data sets, even if they're not steeped in Hadoop's traditional way of working with them.

The new talent is also bringing more sophistication to the stack. The MapR distribution of Hadoop includes a better version of the file system with snapshots, mirroring, and direct NFS access if you need it. MapR also offers a more resilient architecture that won't go down if the central controller locks up. MapR calls all of this "high availability" and charges for it.

MapR comes in two flavors: M3 and M5. Is there an M4? Apparently not, but that's marketing for you. The real distinction is between the free community edition (M3) and the proprietary version with all the extra, high-availability features (M5). While some of the other companies are effectively selling tools for monitoring and reporting, MapR is selling a more sophisticated layer under the hood. In other words, whereas the others are wrapping more features around the open source Hadoop, MapR is rebuilding it.

The value of this approach will depend on the seriousness of your job. If your Hadoop data is mission critical or simply has to be ready most of the time, you'll definitely be interested in the extra features for preserving the data and keeping up the cluster. But if you're processing log files and generating reports that can wait a few hours or even days, there's not much need for it. Restarting your cluster when the NameNode fails is kind of a pain, but not if you have the slack in your system to begin again.

The value will also depend on the nature of your calculations. If you do many short calculations, then restarting isn't a big problem. But if your jobs last hours, days, or especially weeks, the ability to store snapshots becomes more and more valuable.

There is a cost for some of these features. I couldn't install the M3 distribution on my cluster of machines in the Rackspace cloud because it requires access to "physical hard drives." In other words, the NFS code from MapR burrows fairly deeply into the file system to generate the performance gains. It can't work its magic with all of the layers of virtualization in some environments. This won't be an issue if you're using real machines with real disks, but it can be a roadblock in some of the new virtual worlds.


  • Print
  • Feedback

Resources