Today a study will come out saying that Spark is eating Hadoop -- really! That's like saying SQL is eating RDBMSes or HEMIs are eating trucks.
Spark is one more execution engine on an overall platform built of various tools and parts. So, dear pedants, if it makes you feel better, when I say "Hadoop," read "Hadoop and Spark" (and Storm and Tez and Flink and Drill and Avaro and Apex and ...).
The major Hadoop vendors say Hadoop is not an enterprise data warehouse (EDW) solution, nor does it replace EDW solutions. That's because Hadoop providers want to co-sell with Teradata and IBM Netezza, despite hawking products that are increasingly eating into the market established by the big incumbents.
Not that we lack a legitimate reason to put the two in separate camps. A few years back, the ACM published what was then considered the seminal paper on the issue, which concluded:
Parallel DBMSes excel at efficient querying of large data sets; MR-style systems excel at complex analytics and ETL tasks. Neither is good at what the other does well. Hence, the two technologies are complementary, and we expect MR-style systems performing ETL to live directly upstream from DBMSes.
That was five years ago. A lot can happen in that time. For example, Miley Cyrus was still Hannah Montana when that paper was written. Her current persona was a twerkle in her eye.
I'm the first to admit that the ACM paper still holds some truth. Certainly parallel DBMSes like Teradata excel at running SQL across large data sets -- and certainly they aren’t great at stringy, pattern-matchy, loose queries. Hadoop is not quite as efficient at doing SQL across large data sets, and joining tables is never going to be as efficient with Hive as it would be with Netezza or Teradata.
Yet the gap is closing. A modern ORC or Parquet-centered Hive using Tez will do the job a lot faster than yesterday's Hive over text files with MapReduce. Meanwhile, Spark blows most everything in the Hadoop ecosystem out of the water (except while streaming). The new world is “in memory,” and memory is a whole lot cheaper than it used to be.
Hadoop is a different beast than a Netezza or a Teradata. When we refer to the modern Hadoop distribution, it's a distributed execution platform, and on that platform there are many tools. Some are for running SQL, doing analytics, and streaming, and some are for organizing data into neat little tables.
Is Hadoop going to be as fast or as efficient as a special-purpose, single-minded appliance running SQL on large data sets? Probably not.
On the other hand, Hadoop doesn’t have to be as fast as Teradata. It doesn’t even have to be half as fast as Teradata. When the ACM did its test, Hadoop wasn’t in the ballpark -- but now is a different time.
This reminds me of when Sun Microsystems showed us it could blow Intel out of the water, especially for business apps and particularly with Java workloads. What it didn’t mention was that we could buy lots more computing power in Intel/AMD-land for a whole lot less money. The market ultimately decided the latter was good enough.
Cheaper by the cluster
With Hadoop, your buy-in is cheap. You can deploy the stuff on Amazon for zero capital investment or on your favorite commodity hardware for a small investment. While expertise is not cheap, it isn’t any less expensive on the Teradata/Netezza side.
With Spark and Tez, the performance gap is growing narrower by the day, and if they aren't quick enough, there are other structures -- such as those used by Splice Machine and Apache Phoenix -- that rely on HBase. Either should beat the $3,000 to $34,000 per terabyte claimed by Teradata.
In the real world, I’ve seen few companies replace Teradata or Netezza with Hadoop. But I’ve also seen relatively few adding to the Teradata or Netezza platform. If it's already there and working, they keep at it, but when a new project comes up, they use Hadoop and/or Spark. These new projects sometimes involve social media or streams of data and sometimes they’re run-of-the-mill BI projects. Companies that don’t have Teradata or Netezza tend to consider Hadoop first.
Beyond this, Teradata and Netezza were always too pricey for a big chunk of the midmarket, but that market can afford Hadoop. A lot of companies with small IT budgets and midsized data can now use Tableau or Pentaho to plumb their own data lake.
The pace of change
Data is growing at a fantastic rate and in forms that don’t really lend themselves to studiously designing your EDW -- then, when you’re all done, beginning your analysis.
Today’s business and data sets are more like “OMFSM, let’s get the data, hurry! Here’s the use case. Let’s try and do this before it overflows or we miss something important,” because your hair is always on fire, there is always an ASAP timeline, and the data is always messy and often voluminous.
Last week, when I wrote about the latest Hadoop survey and took my shot at debunking Gartner’s pandering, anti-Hadoop analysis, I noted that BI is the No. 1 application of Hadoop. Your major BI tools have all added support for Hive and are now moving to support Spark.
Clearly, companies are making this switch -- slowly and conservatively, but this is the direction of things to come. If you ignored the warning label and kind of did actually use Hadoop instead of an EDW or “augmented and replaced” key functionality of your EDW, you’re in good company. Inevitably, that's where the future is headed.
This story, "Hadoop is slowly eating conventional analytics" was originally published by InfoWorld.