16 for '16: What you must know about Hadoop and Spark right now

Amazingly, Hadoop has been redefined in the space of a year. Let's take a look at all the salient parts of this roiling ecosystem and what they mean

1 2 Page 2
Page 2 of 2

New technologies to watch

I wouldn't throw these technologies into production yet, but you should certainly know about them.

Kylin: Some queries need lower latency, so you have HBase on one side, and on the other side, larger analytics queries might not be appropriate for HBase -- thus, Hive on the other. Moreover, joining a few tables over and over to calculate a result is slow, so “prejoining” and “precalculating” that data into Cubes is a major advantage for such datasets. This is where Kylin comes in.

Kylin is this year’s up and comer. We’ve already seen people using Kylin in production, but I’d suggest a bit more caution. Because Kylin isn’t for everything, its adoption isn't as broad as Spark's, but Kylin has similar energy behind it. You should know at least a little about it at this point.

Atlas/Navigator: Atlas is Hortonworks’ new data governance tool. It isn’t even close to fully baked yet, but it's making progress. I expect it will probably surpass Cloudera’s Navigator, but if history repeats itself, it will have a less fancy GUI. If you need to know the lineage of a table or, say, map security without having to do so on a column-by-column basis (tagging), then either Atlas or Navigator could be your tool. Governance is a hot topic these days. You should know what one of these doohickies does.

Technologies I'd rather forget

Here's the stuff I am happily throwing under the bus. I have that luxury because new technologies have emerged to perform the same functions better.

Oozie: At All Things Open this year, Ricky Saltzer from Cloudera defended Oozie and said it was good for what it was originally intended to do -- that is, chain a couple MapReduce jobs together -- and dissatisfaction with Oozie stemmed from people overextending its purpose. I still say Oozie was bad at all of it.

Let's make a list: error-hiding, features that don’t work or work differently than documented, totally incorrect documentation with XML errors in it, a broken validator, and more. Oozie simply blows. It was written poorly and even elementary tasks become week-long travails when nothing works right. You can tell who actually works with Hadoop on a day-to-day basis versus who only talks about it because the professionals hate Oozie more. With Nifi and other tools taking over, I don’t expect to use Oozie much anymore.

MapReduce: The processing heart of Hadoop is on the way out. A DAG algorithm is a better use of resources. Spark does this in memory with a nicer API. The economic reasons that justified sticking with MapReduce recede as memory gets ever cheaper and the move to the cloud accelerates.

Tez: To some degree, Tez is a road not taken -- or a neanderthal branch of the evolutionary tree of distributed computing. Like Spark, it's a DAG algorithm, although one of its developers described it as an assembly language.

As with MapReduce, the economic rationale (disk versus memory) for using Tez is receding. The main reason to continue using it: The Spark bindings for some popular Hadoop tools are less mature or not ready at all. However, with Hortonworks joining the move to Spark, it seems unlikely Tez will have a place by the end of the year. If you don’t know Tez by now, don’t bother.

Now's the time

The Hadoop/Spark realm changes constantly. Despite some fragmentation, the core is about to become a lot more stable as the ecosystem coalesces around Spark.

The next big push will be around governance and application of the technology, along with tools to make cloudification and containerization more manageable and straightforward. Such progress presents a major opportunity for vendors that missed out on the first wave.

Good timing, then, to jump into big data technologies if you haven't already. Things evolve so quickly, it's never too late. Meanwhile, vendors with legacy MPP cube analytics platforms should prepare to be disrupted.

This story, "16 for '16: What you must know about Hadoop and Spark right now" was originally published by InfoWorld.

1 2 Page 2
Page 2 of 2