My company has been considering using Spark on a client project that calls for lower latency than Hadoop can offer. The general evaluation has been about maturity, the flow of data in and out of memory, and the advantages of Spark versus more mature technologies. For this project, we didn't really consider Hadoop due to its nature.
Over the past week as Spark hit 1.0, a lot of lazy journalists have been describing Spark as a "Swiss army knife" of data processing and rattling off a bunch of press release marketing malarkey. Come on, people! If you're going to copy a press release into your column, at least reword it a bit! Nothing I read actually bothered to explain exactly what Spark is, why you would use it, or where it would fit into your Hadoop stack.
Spark is a real-time data processing framework that runs in memory and can be set up on a Hadoop (YARN and HDFS) cluster. It provides parallel, in-memory processing, whereas traditional Hadoop is focused on MapReduce and distributed storage. Cost-wise, Spark requires a greater investment in memory than Hadoop does, though both fall well short of "big iron." Why Spark? Because many problems do not lend themselves to the two-step process of map and reduce, and for those that do, Spark can do map and reduce much faster than Hadoop can.
Other solutions are remarkably similar to Spark, but are unable to hit quite the same marks. For example, JBoss Inifinispan and Gemfire (now Gemfire XD), to name two, allow you to create different types of in-memory data grids.
There are a number of applications for Spark, including direct ports from Hadoop. Spark has an answer to Hive called Shark that allows you to run SQL queries on Spark data. Indeed, Shark is compatible with Hive. Spark has its own machine-learning library similar to Mahout (which I covered last week) called MLib. Spark goes beyond this with a highly scalable, distributed graph algorithm library called GraphX. It also improves on Hadoop's memory caching through a related project called Tachyon.
Spark's biggest value is for data streaming. Whereas with Hadoop you might ask a question, get back a big batch of data, and be done, with Spark you might ask "continue to give me answers to this question." When new data comes, Spark will continue to notify the user. Obviously, you can't really "reduce" this answer. Instead, you can think of Spark as a database version of the publish-and-subscribe model of messaging.
Until recently, when the high-speed trading crowd, or anyone else who needed low latency, asked about Hadoop, a reasonably competent professional would answer, "This stuff isn't for you." With Spark, you now have software that can process large data sets, do data streaming, and apply machine learning -- alll with an acceptable latency.
Should we all ditch our shiny new Hadoop clusters and go Spark? Not really. If you have offensive amounts of money to burn per node, you can of course do more in memory, but also having "permanent" storage is obviously advantageous. More likely, you will mix in Spark for low latency, parallel computing, and streaming problems. You will keep Hadoop, or more correctly HDFS, to store large data sets; indeed, Spark has hooks for that. You'll keep Hadoop around for tasks that really lend themselves to MapReduce at a fraction of the cost it would take to create enough all-memory nodes, or where loading the data into memory first might not be a huge advantage.
In short, Spark makes the Hadoop ecosystem an even more general-purpose platform that can support more industries and types of problems. It isn't a Swiss Army knife at all, but a specific, complementary toolset for a specific set of problems.
This story, "Straight talk on Apache Spark -- and why you should care" was originally published by InfoWorld.