Newsletter sign-up
View all newsletters

Enterprise Java Newsletter
Stay up to date on the latest tutorials and Java community news posted on JavaWorld

Sponsored Links

Optimize with a SATA RAID Storage Solution
Range of capacities as low as $1250 per TB. Ideal if you currently rely on servers/disks/JBODs

Open source Java projects: Storm

Parallel realtime computation for unbounded data streams

  • Print
  • Feedback

Page 3 of 7

A stream can be simple and consumed by a single bolt or it could be complex, require multiple streams, and hence require multiple bolts. A bolt can do most anything, including run functions, filter tuples, perform stream aggregation, join streams, or even execute calls to a database.

Figure 1 shows the relationship between spouts and bolts within a topology.

Figure 1. Spouts and Bolts within a Storm topology (click to enlarge)

As Figure 1 illustrates, a Storm topology can have multiple spouts and each spout can have one or more bolts. Additionally, each bolt can exist on its own or it could have additional bolts that listen to it. For example, you might have a spout that observed user behavior on your site and a bolt that used a database field to categorize the types of pages that users were accessing. Another bolt might handle article views and add a tick to a recommendation engine for a unique view on each type of article. You would simply configure the relationships between spouts and bolts to match your own business logic and application demands.

As previously mentioned, Storm's data model is represented by tuples. A tuple is a named list of values of any type. Storm supports all primitive types, Strings, and byte-arrays and you can build your own serializer if you want to use your own object types. Your spouts will "emit" tuples and your bolts will consume them. Your bolts may also emit tuples if their output is destined to be processed by another bolt downstream. Basically, emitting tuples is the mechanism for passing data from a spout to a bolt, or from a bolt to another bolt.

A Storm cluster

I've covered the main vernacular for the structure of a Storm-based application, but deploying to a Storm cluster involves another set of terms.

A Storm cluster is somewhat similar to Hadoop clusters, but while a Hadoop cluster runs map-reduce jobs, Storm runs topologies. As mentioned earlier, one of the primary differences between map-reduce jobs and topologies is that map-reduce jobs eventually end, while topologies are destined to run until you explicitly kill them. Storm clusters define two types of nodes:

  • Master node: This node runs a daemon process called Nimbus. Nimbus is responsible for distributing code across the cluster, assigning tasks to machines, and monitoring the success and failure of units of work.
  • Worker nodes: These nodes run a daemon process called the Supervisor. A Supervisor is responsible for listening for work assignments for its machine. It then subsequently starts and stops worker processes. Each worker process executes a subset of a topology, so that the execution of a topology is spread across a multitude of worker processes running on a multitude of machines.

Storm and ZooKeeper

Sitting between the Nimbus and the various Supervisors is the Apache open source project ZooKeeper. ZooKeeper's goal is to enable highly reliable distributed coordination, mainly by acting as a centralized service for distributed cluster functionality.

  • Print
  • Feedback

Resources
  • Download the source code for this article's demonstration app.

Recent articles in the Open source Java projects series:

More about Storm: