Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Enterprise Hadoop: Big data processing made easier

Amazon, Cloudera, Hortonworks, IBM, and MapR mix simpler setup of Hadoop clusters with proprietary twists and trade-offs

  • Print
  • Feedback

Page 4 of 7

In other words, the free version is a great way to start up a Hadoop cluster and make sure that everything is running, but you'll have to do some poking around to monitor it. The enterprise version includes more tools that automate the poking around and double-checking.

 IBM InfoSphere BigInsights
IBM bundles Hadoop into something it calls InfoSphere BigInsights. The word "Hadoop" is on the main page, but the advertising copy clearly suggests that this is a product to help people who want "deep insights" into "big data." It's a tool for data analysis that just happens to use Hadoop for all of the structure.

There are two tiers: basic and enterprise. The basic edition is available completely for free, but you can buy support if you like. The enterprise edition, available through a commercial license, includes a number of extra features like BigSheets, a spreadsheetlike tool for drilling down into the data sitting in the cluster.

The collection includes all of the usual suspects and a few that aren't always mentioned -- such as Lucene. Lucene makes sense because BigInsights includes more than a few mechanisms for taking apart text. There's an entire collection of TextExtractors that will do things like search for addresses and flag certain words. The meat of the text analytics is in the enterprise edition.

IBM's literature says the BigInsights package is for Linux, but I found that it ran smoothly with only Red Hat's Enterprise distribution. The installation script would limp to the finish with a few of the others I tried, but it often reported that it failed to install entire tools like Hive or Pig. Even CentOS wasn't close enough to get much running. I think it may still be possible to get BigInsights running if you're adept with Linux and happy to poke around the log files, but it achieves labor savings only if you're running Red Hat Enterprise.

There are several nice touches in this installation script, by the way. As I plowed along looking for a good distribution, the software was careful to remember all of my inputs, so it wouldn't need to be reconfigured each time. This should be useful in a cloud where people may try to spin up a cluster, then tear it down. The software also includes a number of little features, like the ability to remember a different root password for each node; these can be quite helpful.

The center of the IBM tool is a console that helps you set up some jobs and kick them off. It's completely browser-based -- like the install script -- and you can simply upload your JAR files directly through the Web browser. You can even drill down into the HDFS file system layer and read the results without leaving the browser.

The Web GUI is a big advance over using the command line, but I easily found a number of ways that the console in the basic edition could be improved. As far as I can tell, there's no way to delete the old jobs. The information for each job includes basic details about the start and stop time, but almost everything else is just dumped as raw text. It wouldn't be too hard to parse some of this and do a nicer job displaying the log information.


  • Print
  • Feedback

Resources