Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Enterprise Hadoop: Big data processing made easier

Amazon, Cloudera, Hortonworks, IBM, and MapR mix simpler setup of Hadoop clusters with proprietary twists and trade-offs

  • Print
  • Feedback

Page 3 of 7

I think all of Amazon's extra features are good options for two classes of users. If you already have most of the relevant data in Amazon's cloud, Elastic MapReduce makes it easy to spin up jobs to analyze it. The piping is already well in place.

The other group would be those who don't need a cluster most of the time but want to do short, intensive calculations once a week, once a month, or once a quarter. It's not much work to create a full Hadoop cluster using the other tools in this review, but it's kind of silly to request new machines from scratch every now and again. Amazon offers a nice shortcut to uploading a Python script or a JAR file and going straight to computation.

 Cloudera CDH, Manager, and Enterprise
Cloudera is a startup that has collected Hadoop experts from all of the major companies using Hadoop. The CTO came from Yahoo, the chief scientist from Facebook, and the CEO from Oracle. The staff is filled with the names of people who learned Hadoop by building it.

The company is selling training, support, professional services, and some tools for managing your cluster. The Cloudera distribution and basic manager are free for clusters with fewer than 50 machines, while the subscription-based enterprise edition offers many more features for handling standard data formats.

The free version is quite useful for starting up a cluster and monitoring the jobs as they flow through the system. The manager takes a list of IP addresses, logs into all of them with SSH, and installs the major tools.

The automation makes it pretty easy to run the Cloudera distro, but I still had to patch a few glitches to install it on CentOS. One component wanted a certain version of zip, and it ground to a halt until I logged into the machines and installed it myself. At another point, the Web-based graphical user interface wouldn't work until I logged in again and installed a widget library, ExtJS. The open source licenses probably weren't compatible.

The logging in reminded me of a small point. The IBM installer can use a different root password for each machine. Cloudera's installer wants to use either the same root password or the same RSA key. This meant I had to log into all of the machines and change the password because I was using a stock version of CentOS to start up the rack.

The fact that I noticed this small point and remembered it says much about what is for sale here. The tools are open source and the companies are selling ease-of-use. Little delays can multiply when you're not running exactly the same code.

I think Cloudera has done a better job of making its tools work with different Linux distros. It lists Ubuntu, Suse, Red Hat, CentOS, and Debian. Although I had to do a bit of patching with CentOS, it was relatively simple.

The difference between the free and enterprise versions is a bit bigger than I often see. The proprietary version will not only handle more than 50 machines, but it also includes plenty of monitoring, reporting, and data analysis tools.


  • Print
  • Feedback

Resources