Streaming data is a rapidly emerging category for new application development. The interest in apps that use streaming data is driven by the growth of machine data (think IoT and machine-to-machine communications) and the desire to improve online customer engagement through real-time personalization techniques. Definitions of streaming data applications vary, but they typically revolve around three essential capabilities: real-time data ingestion, real-time analysis, and automated decision making.
Stream processing solutions can also take many forms. Common architectures include tuple-at-a-time stream processing tools (aka complex event processing tools); batch-oriented stream processing solutions that trade latency for flexibility and robustness; and in-memory databases that can transact in real time. While these approaches can be used to solve many problems, they aren't all created equal in terms of developer effort, maintenance overhead, capabilities, or performance. In particular, it's important to look at the number of components required to provide the functionality you need. The amount of development, testing, and operations work to run a system is proportional to the number of components that must be glued together to solve a problem.
Case in point: Apache Storm is one of the most popular tools available to developers working on streaming apps, but a stream processing solution built using Storm is often composed of much more than Storm alone. As I discussed in a previous article, Storm typically works with Kafka to manage ingestion. Both Storm and Kafka require ZooKeeper for management and failure handling. Finally, as I discuss below, a fourth system is required to manage state. Thus, when all is said and done, a three-node Storm cluster could easily require 12 nodes!
Apache Storm and fast data applications
Apache Storm is a streaming data framework that can run snippets of user code against every data event in near-real time. It’s really a series of pipes that can be connected together. Essentially developers connect an incoming data source to a back-end data store while running some code on an intermediate path. Storm’s latency is closer to CEP (complex event processing) than to Hadoop-based systems. The do-it-yourself nature of writing simple code to process events while letting the framework manage the broader picture is appealing to developers who have the time to tinker.
Storm is often used for simple analytics tasks, such as counting, as well as for cleaning, normalizing, and preparing ingested data for long-term storage. However, Storm is not capable of stateful operations, which are essential in making real-time decisions. The analysis of stateful data is at the heart of fast data applications such as personalization, customer engagement, recommendation engines, alerting, authorization, and policy enforcement. Without additional components such as ZooKeeper and Cassandra, Storm is unable to look up dimension data, update an aggregate, or act directly on an event (that is, make real-time decisions).
These examples are part of a broader class of applications pertaining to "fast" data. Developers are recognizing the value of understanding and acting on data in real time, and fast data tools are augmenting simple streaming approaches and traditional analytic stores to realize that value. Considering the problems and not the technologies involved, what are the common themes of a fast data application?
First, there’s the ingestion piece. Whether it’s log data, sensor data, financial ticks, click streams, online behavior, or something else, fast data applications have the ability to ingest a fire hose of data.
The second step is to extract value from that ingestion in real time. Many systems can capture a stream and load it into an OLAP system for analysis, but that kind of analysis can involve long delays between the source and the resulting action, undermining the ability to make decisions in real time or on a per-event basis. Fast data systems provide the ability to act on a per-event basis but also can leverage existing state, such as dimension tables and historical data, to support that processing. This processing can support dashboards and alerting for human decision support, but the low-latency nature of the system is well suited to automated decision making -- say, for this type of user with X resources and Y conditions, recommend action Z.
Lastly, fast data integrates with big data. One of the fundamental assumptions of the fast data infrastructure is that data has a lifecycle. Fast data systems extract real-time value, while OLAP and Hadoop/HDFS-based systems provide deeper analysis, including machine learning, back-testing, and ad-hoc historical queries. Fast data systems can not only enrich data and forward it into these long-term stores, but also consume the output, often learning real-time processing rules as deduced by a rule-learning engine in a deep-analysis system.
What Storm solves and what it doesn’t
To understand where Storm fits when trying to tackle fast data applications, it’s easiest to look at what problems Storm set out to solve and what problems it left to developers. The Storm framework is designed to move data from sources to user processing in a horizontally scalable and failure-tolerant way. It provides at-least-once or at-most-once ingestion semantics, and it has the power to restart work if processes fail.
What’s notably absent is any awareness of persistent state. Fast data applications need state for lots of reasons. Keeping previously seen tuples around, in raw or aggregate form, allows the processing engine to look for patterns in data. Having dimension data allows the processing engine to enrich tuples with broader understanding before performing any analytics. Often decisions are based on dynamic rules that reside in persistent storage.
Storm punts on this problem, letting developers manage access to persistent state on their own. There are two approaches to persistent state in Storm. First, each worker process in a Storm cluster can store its own state by adding a local third-party store. Memcached and Redis are popular local stores. Adding local storage allows the processing of each tuple to consider previous tuples that this particular worker process has seen. The downside to this build-it-yourself approach: It isn’t integrated into Storm’s fault-tolerance code. Storm must handle failed worker processes or failed hardware by restarting a new worker process on a running machine and migrating the tuples from the old worker to the new. This migration can’t include any local state, whether it resides in the process or in a system like Memcached. The replacement worker process loses all context the failed worker had built.
Additionally, it is difficult or impossible to directly query the distributed state split among all of these worker processes. If each process is maintaining aggregate statistics, those statistics must be pushed to another system to be queried with powerful tools. Direct fetches of some data may be possible using distributed remote procedure calls in Storm, but these are severely limited compared to the query functionality in modern SQL or NoSQL data stores.
The second approach to state is to connect Storm worker processes to a central or distributed persistent store, such as Cassandra or HBase. With this approach, Storm processing code can access gigabytes or terabytes of state, and the persistent store can be queried easily and independently from the Storm system. This solves both problems with the local storage described above: State doesn’t need to be rebuilt in the wake of worker process failure, and the store likely will support more powerful queries. While this approach solves these problems, it introduces significant complexity and reliability issues.
It’s important to note that the coupling of Storm and separate storage can cripple performance. Processing a tuple in memory is easy to do at tremendous speed. Maintaining redundant state storage with indexing to support queries, while likely offering some kind of consistency guarantees, is much more expensive. The Storm system will be limited by the performance of the distributed store. Using an external store for queries also means substantial network latency for every processed tuple. Hiding that latency means more parallelism, more management, and more complexity. And now you have two systems that can fail.
The Storm worker process needs to handle the case where the distributed store is not available, adding complexity. When using at-least-once semantics in Storm, developers also need to use at-least-once semantics in the distributed store, if this is even possible without complex user code.
Scale-out SQL and fast data processing
Modern in-memory, ACID-compliant, SQL-relational databases can power previously impossible applications more simply and easily than Storm and other alternatives. Relational databases with ACID transactions simplify data ingestion when either at-most-once or at-least-once semantics are desired. Because operations are atomic, they either complete or roll back entirely.
Many relational systems allow database-hosted, transactional logic; some even allow the use of Java code, just like Storm. Thus, much of the logic and flexibility afforded by Storm can be had using relational systems. But unlike in Storm, server-side logic in a relational database has a direct and carefully managed path to stateful data, eliminating the challenges that arise from mixing Storm and a distributed or partitioned persistent store.
What motivated Nathan Marz to create Storm when systems like PostgreSQL existed? PostgreSQL and its ilk are neither horizontally scalable nor capable of the tuple processing throughput of a system like Storm. But the last five years have seen several new entries into the transactional storage space, built specifically to address horizontal scaling, fault-tolerance, and raw throughput. Some of these systems are as scalable and fault tolerant as Storm, as well as offer the powerful processing semantics of transactions combined with direct state manipulation at the point of processing.
In short, these new platforms offer the processing/logic of Storm, the stateful capability of Cassandra, and the ingestion semantics of Kafka. Even more important, a distributed, transactional (ACID), SQL-relational database can process thousands to millions of incoming events (or application requests) per second.
For fast data applications, developers seek a more complete stream processing platform than Storm and other stream processing alternatives. Storm requires a companion database to serve any analytics it produces. By contrast, with certain scale-out, SQL databases, developers can calculate, store, and query real-time analytics directly from the database. Data ingestion pipelines that filter, enrich, and group incoming event streams require accessing look-up data in addition to the incoming event feed. Storm requires a companion database to store that look-up data.
Storm falls short of meeting the requirements developers have for fast data applications. A distributed, in-memory relational database offers a simpler, more capable, more interactive solution suited to a broader set of fast data use cases.
John Hugg, VoltDB senior architect, has spent his career working with databases and information management. As the first engineer on the VoltDB product, he liaised with the team of academics at MIT, Yale, and Brown building H-Store, VoltDB's research prototype. John also helped build the world-class engineering team at VoltDB to continue development of the open source and commercial products.
This story, "Beyond Storm for streaming data applications" was originally published by InfoWorld.