What is Stream Processing?
Stream processing is a programming paradigm defining applications which, when receiving a sequence of data, treat it as a collection of elements, or datapoints, and rather than group and process them together, process each datapoint by itself. In stream processing, each datapoint is processed as it arrives and independently from other datapoints, unlike batch processing, where datapoints are usually buffered and processed together, in bulk. Therefore, stream processors have become an important building block of real-time applications, as they facilitate acting on event data in real-time, allowing a user access to the real-time state of a system and its data, rather than allowing access to periodical snapshots of it.
Alooma's main product is a data-pipeline, that allows our users to send or pull data from multiple sources, perform computations on the data (e.g., to handle schema changes, clean corrupt data, etc.), and load it to a data warehouse. Stream processing allows our pipeline to have a much shorter latency compared to a batch processing approach. Stateful stream processing enriches the types of computations our users are able to perform on the stream. For example, by aggregating the number of events with certain attributes and counting them over time windows, it is possible to create real-time dashboards of the data in the stream, without ever needing to load the data to a data warehouse.
Stateless vs. Stateful stream processing
In a Stateless stream, the way each event is handled is completely independent from the preceding events. Given an event, the stream processor will treat it exactly the same way every time, no matter what data arrived beforehand.
Stateful stream processing means that a "state" is shared between events and therefore past events can influence the way current events are processed. This state can usually be queried from outside the stream processing system as well. For example, stateful systems can keep track of user sessions (aggregate events coming from the same session, and output only session-level metrics, when the session ends), perform aggregated counts (e.g., count the number of errors in every time window) and more.
Stateless stream processing is easy to scale up, because, by definition, events are processed independently. The stream can be processed by multiple identical processors, with a simple load balancing between them. When the system needs to process a higher throughput of events, you simply launch more processors.
The challenge of stateful stream processing
Stateful stream processing is much more difficult to scale up, because you need the different workers to share the state. A simple solution would be to use an external store (such as a database), but then the performance of the external store limits the performance of your stream processing. Another option is to partition the stream: instead of randomly sending events to processors, you can send events to processors according to some attribute of the events. For example, in the sessions use-case, one could send all events from the same user to the same processor. This way, each processor can handle its own state, significantly improving performance. However, using this approach means you now have multiple states instead of one, so querying it from outside of the stream processing engine becomes more complex.
A couple of months ago we were discussing the reasons behind increasing demand for distributed stream processing. I also stated there was a number of available frameworks to address it. Now it's a time have a look at them and discuss their similarities and differences and their, from my opinion, recommended use cases.
In the previous post we went through the necessary theory and also introduced popular streaming framework from Apache landscape - Storm, Trident, Spark Streaming, Samza and Flink. Today, we're going to dig a little bit deeper and go through topics like fault tolerance, state management or performance.
In Apache Spark 1.6, we have dramatically improved our support for stateful stream processing with a new API. In this blog post, we are going to explain mapWithState in more detail as well as give a sneak peek of what is coming in the next few releases.
Questions? Requests? Leave us a comment below.