Storm, a top-level Apache project, is a Java framework designed to help programmers write real-time applications that run on Hadoop clusters. Designed at Twitter, Storm excels at processing high-volume message streams to collect metrics, detect patterns, or take actions when certain conditions in the stream are detected. Typically Storm scenarios are at the intersection of real time and high volume, such as analyzing financial transactions for fraud or monitoring cell-tower traffic to maintain service level agreements.
Traditionally these sorts of systems have been constructed using a network of computers connected by a message bus (such as JMS). What makes Storm different is that it combines the message passing and processing infrastructure into a single conceptual unit known as a “topology” and runs them on a Hadoop cluster. This means that Storm clusters can take advantage of the linear scalability and fault tolerance of Hadoop, without the need to reconfigure the messaging bus when increasing capacity.
When working with teams new to Storm, I have found it helpful to approach system design from three dimensions: operations, topology, and data. These roughly map onto their corresponding dimensions in traditional enterprise applications, but translated into the Hadoop world. A Storm topology is a processing workflow analogous to a set of steps in a processing pipeline that would be managed by Oozie in a multipurpose Hadoop cluster.
The topology is the fundamental unit of deployment in Storm. It consists of two types of objects: spouts (message sources) and bolts (message processors). Spouts are available for many common data sources such as JMS, Kafka, and HBase.
To continue reading this article register now