Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Enterprise Hadoop: Big data processing made easier

Amazon, Cloudera, Hortonworks, IBM, and MapR mix simpler setup of Hadoop clusters with proprietary twists and trade-offs

  • Print
  • Feedback

Page 6 of 7

I ended up doing my testing with a VMware version that worked with MapR's version of Ubuntu.

I have to say that the direct access to NFS is a nice feature. Although it's always possible to move the data in and out of HDFS with the regular tools, it's much easier to integrate the system with a direct NFS link. I wouldn't be surprised if some errant tool occasionally introduces a bug because the data is not run through HDFS, but I'm guessing the occasional problems will be worth the trade-off.

It's clear that MapR is putting most of its effort into the code under the hood. The Web console for monitoring jobs is perfectly nice, but it lacks some of the gaudier features you'll find in other distributions. I even found myself kicking off jobs by typing "hadoop" into a command line. There's nothing missing that will get in the way of serious work, but the interface isn't as accessible for new users as some of the others.

MapR also has some interesting partnerships. Many have noted that EMC is almost certainly repackaging MapR and selling it as part of the Greenplum collection of big data analytics tools. This suggests that we're already starting to see these stacks disappearing inside of other packages.

 Hortonworks Data Platform
I wanted to test the Hortonworks distribution, but it wasn't ready when I was writing. The company will be concentrating on selling training and support while avoiding creating proprietary extensions.

"We are an open source company," Eric Baldeschwieler, the CEO, told me. "The only product we have is open source. We won't commit to never selling anything, but you won't see anything in the next year. We're committed to a complete open source, horizontal platform. We want people to be able to download everything they want for free. That differentiates us from everyone else in the market."

Indeed, the company employs a number of people with a deep knowledge of Hadoop gained from years at Yahoo. The company formally separated from Yahoo last year, and now it's looking for partnerships to support their work.

Hortonworks is currently running a private beta. I couldn't join it, but perhaps your company will be able to participate. In the meantime, you can grab Hadoop directly from Apache. It's guaranteed to be pretty close to what Hortonworks will be shipping, at least for the next year.

Choosing a Hadoop
There's no easy way to summarize the quickly shifting space. Each of these companies is pointed in a slightly different direction. They may all agree that the Hadoop collection of software is a great way to spread out work over a cluster, but they each have different visions of who would want to do this and, more important, how to accomplish it. The similarities are fewer than you might expect.

The biggest differences may be in how you handle your data. The idea of making your data accessible through NFS may be one of the neatest innovations, but MapR is introducing some risk by breaking from the pack and adding its own proprietary extensions. MapR's claims for great speed and better throughput are tantalizing, but there's also the danger of bugs or mistakes appearing because of incompatibility. Just as in horror movies, bad things can happen when you split up and strike off on your own.


  • Print
  • Feedback

Resources