Spark, big data's brightest star, needs to grow up

Spark is hotter than the ecosystem that spawned it, but needs clearer direction to succeed

Spark is on the ascent in the big data world and rightfully so. It's faster than MapReduce by far, and with its SQL interface, it's faster than Hive. Though operationally different than either of the two, Spark can replace both in many instances.

The company behind Spark, Databricks, hopes to carve out a niche for itself in the big data world. Yet all of the major Hadoop vendors have announced support for Spark as well. At the recent Spark Summit East, I asked Databrick’s head of customer engagement, Arsalan Tavakoli, how the company plans to compete:

It is really two different segments. I think the Hadoop ecosystem is alive and kicking. Hortonworks, MapR, Cloudera are all very focused in the on-premise world. We don’t have a distribution of Spark in the on-premise world. Actually, all of those guys leverage databricks for their L2, L3 support for Spark. When they go to a customer and sell Spark support, they rely on our expertise because we have the core braintrust around that.

This is rosy if not well-rehearsed answer to the question, but the truth is more complicated. Paco Nathan, Databricks' director of community engagement, made several unfavorable references to Hadoop during a Databricks cloud training session at Spark Summit East. He stated that he saw several companies “jumping over Hadoop” and “skipping the big Yarn deploy” to go straight to Spark. He went further to say that Hadoop would be over in a few years.

What does Databricks sell exactly?

According to Tavakoli, “When we built the company, we said two things: Our focus is, one, entirely on the cloud, and two, it's about something broader than 'here is an open source product and we’re going to wrap some professional services around it.'”

Translation: The company has a cloud-based, Spark-based platform that uses the concept of a “notebook” in which you write both markup and code in what amounts to a Web page, then “execute” the notebook across the cluster. It looks like Interwoven Teamsite (an old, fat CMS) ate iPython Notebook but forgot about security.

You can embed HTML, SQL, Python, and Scala in a notebook, then store the notebook in a folder. You can’t, however, secure a folder or notebook, which was demonstrated comically during introductory training at the conference. Someone didn’t pay close attention to the instructions; rather than copy the course material to their own folder, they edited the instructor's copy, introduced garbage, and made it so that only we “advanced students” could complete the lesson.

Your notebooks stay in and are executed across the cloud. According to Tavakoli, unlike with a typical SaaS multitenanted architecture, Databricks deploys as a fully managed service inside a virtual private cloud. The product is currently on Amazon, but it will be available on other clouds.

The product is far from mature. During the training, I watched the product stack trace. It also had a really annoying habit of saying your page was executing, only to hang and fail to return the results, so you had to refresh. Admittedly, this might have been due to the crappy hotel Wi-FI -- but if so, the page should notice a bum connection, which didn’t always seem to be the case. The lack of folder permissions, version control, and other “I’m not working with one other person” features are going to be essential for Databricks' cloud to reach the company’s sales targets.

Looking ahead

The company sees “solutions” as the future. Everyone is supposed to say that even if they’re a platform company. According to Tavakoli:

You don’t want to just say, hey it's great, I got a big data platform and deployed a BI tool and ETL. I deployed these things that I [ascribe] real business value to. That’s something that I feel really hindered the Hadoop ecosystem and big data so far. Our goal is to get more and more to those solutions, but do it a way that is more productized and automated rather than you brought an army of 1,000 consultants to build you a custom solution so you could only do one or two.

This is a long way from the product Databricks has today. The Databricks Cloud is really a platform for great mathematicians who can do crappy coding or people who have more love for Python than sense. It is far from the Tableau of data science.

By Tavakoli’s math, with the company’s 3,500-person waiting list and his estimate of maybe 1,000 to 1,400 paid Hadoop installations worldwide, the future is bright. But a waiting list and dollars aren’t the same. Moreover, as a strategy, Databricks is counting on two things for now: It hired the brains behind Spark, who are all tied together by academic relationships at MIT and Berkeley -- and everyone plays nice.

The first is indeed a challenge. The second inevitably falls apart as soon as Hortonworks or Cloudera loses a big deal and calculates that coming up with its own “notebook” and building its own Spark team is a better solution than relying on Databricks. Meanwhile, Google has Dataflow (which competes with part of the Databricks product) and Google Docs. If Databricks gets traction, why not put the two together and compete directly?

The real question is the viability of the “solutions” vision, where a marketing manager can use machine learning against a big data cluster without becoming a mathematician. To turn that dream into reality, is your most appropriate commercial entry into the market a tool that lets mainly Python developers embed code into an HTML page and execute it across the cluster?

I think it is clear that Spark will do well. It's also possible that Databricks Cloud will grab a decent niche market, but I’ll be watching closely for a pivot in this company’s future.

This story, "Spark, big data's brightest star, needs to grow up" was originally published by InfoWorld.

View Comments
Join the discussion
Be the first to comment on this article. Our Commenting Policies