|
|
Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs
Page 2 of 7
For many developers investigating Storm, the first question will be: How does it differ from Hadoop? The simple answer is that Storm analyzes realtime data while Hadoop analyzes offline data. In truth, the two frameworks complement one another more than they compete.
Hadoop provides its own file system (HDFS) and manages both data and code/tasks. It divides data into blocks and when a "job" executes, it pushes analysis code close to the data it is analyzing. This is how Hadoop avoids the overhead of network communication in loading data -- keeping the analysis code next to the data enables Hadoop to read it faster by orders of magnitude.
Hadoop's programming paradigm is the infamous MapReduce. Basically, Hadoop partitions data into chunks and passes those chunks to mappers that map keys to values, such as the hit count for a resource on your website. Reducers then assemble those mapped key/value pairs into a usable output. The MapReduce paradigm operates quite elegantly but is targeted at data analysis. In order to leverage all the power of Hadoop application data must be stored in the HDFS file system.
Storm solves a different problem altogether. Storm is interested in understanding things that are happening in realtime -- meaning right now -- and interpreting them. Storm does not have its own file system and its programming paradigm is quite a bit different from Hadoop's. Storm is all about obtaining chunks of data, known as spouts, from somewhere (like a Twitter feed or live web traffic to your site) and passing that data through various processing components, known as bolts. Storm's data processing mechanism is extremely fast and is meant to help you identify live trends as they are happening. Unlike Hadoop, Storm doesn't care what happened yesterday or last week.
Some use cases have shown that Storm and Hadoop can work beautifully together (see Resources). For instance, you might use Storm to dynamically adjust your advertising engine to respond to current user behavior, then use Hadoop to identify the long-term patterns in that behavior. The important point is that you don't have to choose between Storm and Hadoop; rather, work to understand the problem you are trying to solve and then choose the best tool for the job.
Storm has its own vernacular, but if you've studied Hadoop and other distributed data processing systems you should find its basic architecture familiar.
At the highest level, Storm is comprised of topologies. A topology is a graph of computations -- each node contains processing logic and each path between nodes indicates how data should be passed between nodes.
Inside of toplogies you have networks of streams, which are unbounded sequences of tuples. Storm provides a mechanism to transform streams into new streams using spouts and bolts. Spouts generate streams, which can pull data from a site like Twitter or Facebook and then publish it in an abstract format. Bolts consume input streams, process them, and then optionally generate new streams.
Recent articles in the Open source Java projects series:
More about Storm: