If you care about Big Data, you care about Stream Processing

As the scale of data grows across organizations with terabytes and petabytes coming into systems every day, running ad hoc queries across the entire dataset to generate important metrics and intelligence is no longer feasible. Once the quantum of data crosses a threshold, even simple questions such as what is the distribution of request latencies becomes infuriatingly slow with the usual sql and database model. Imagine running such a request on the Facebook request data, the query would take days to complete.

Stream processing offers the solution for anyone looking to manage an ever increasing volume of data.

What is stream processing

Stream processing is the handling of units of data on a record-by-record basis or over sliding time windows. This methodology limits the amount of data processed and offers a very contrasting way of looking at the data and the analytics. Complexity is traded in for speed. The nature of the analytics with stream processing are usually filtering, correlations, sampling and aggregations over the data.

Examples of the diverse applications of stream processing include:

Alerting on sensor data on IoT devices

Log analysis and statistics on web traffic

Risk analysis with movements of money and orders in Fintech

Click stream analytics on Ad Networks

Comparison to Batch Processing:

	Batch processing	Stream processing
Scope	Queries over all or most of the data in the dataset.	Queries or processing over data within a rolling time window, or on just the most recent data record.
Size	Large numbers of records	Individual records or micro batches consisting of a few records.
Performance	Latencies in minutes to hours.	Requires latency in the order of seconds or milliseconds.
Analyses	Complex analytics.	Simple response functions, aggregates, and rolling metrics.

Be truly real-time

Stream processing is the only way applications can truly be real time with their data. As soon as we talk about aggregations over the universe of data, any batch process job will continue to take longer as the amount of data increases. If you have an application that relies on real time metrics, without a stream processing solution, your view of the world is always going to be delayed.

Flow data over queries

To harness the power of stream processing, engineers and data scientists need to evolve their model from one where they run queries over their data to one where the data runs over the queries. This is a powerful shift in the way to look at the data applications.

The key shift is from tables to streams and doing operations on those streams such as

filtering

joining

aggregating

windowing

Lets compare the contrast in the batch processing world vs the stream processing world in the example application of Calculating the distribution of number of user sessions per user in a day on the website.

In batch processing: a complicated query would define what events to include in a user session, what the maximum delay between events should be for a session and computing a unique count over them. Since events in tables can occur out of order, even if a new table is maintained every record in the table would need to be traversed for an answer

In stream processing: A stream of events would be simply be filtered to produce a stream of relevant events, this stream would then be windowed over a delay window to produce a stream of unique sessions. Finally this stream would be aggregated upon for unique counts per user with the relevant number simply being appended to a key value store.

As this example demonstrates, a less computationally intense and more real-time answer can be obtained simply be switching the way the data is perceived

Easier than ever before

Managing stream processing applications is not easy with all the moving pieces of managing issues such as fault-tolerance, partitioning and scaling and durability of the data. Over the last 7 years a lot of systems have emerged; many of which are open source and handle a lot of these issues entirely. This leaves the developer to only worry about implementing the application logic.

Some of the most exciting examples of have only launched in the last couple of years: Spark Streaming (2015), Apache Beam (2016) and Kafka Streams (2016). A full comparison of the various popular options are in figure 2.

Any person developing applications with big data should keep the models of stream processing their mind as they choose the architecture and stack for their system. While stream processing is not a panacea for all the tasks around large scale data processing, it offers a new and important way to look at data that allows scalable real-time

Credits:

Figure 1: Information obtained from AWS Documentation

Figure 2: Image Courtesy: Ian Hellström on Twitter

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

If you care about Big Data, you care about Stream Processing

Related Posts

Monero (XMR) adoption in 2025: Growing demand for cryptocurrency privacy

Essential skills for blockchain development in 2025

AI Is advancing but can chatbots understand human feelings?

Hackers with high IQ: The dangerous link between intelligence and cybercrime

The guide to finding the best high-yield business savings account

Why UI/UX design agencies are the key to success

LATEST NEWS

AI’s Code Revolution: Generators vs. Assistants – A Developer’s Deep Dive

AI Powers E-Commerce, But Scaling Up Presents Complex Hurdles

From Iron Man to Reality: Hand Gesture Recognition Reshapes Tech Interaction

Embedded ML: Balancing Power, Privacy, and Performance

AIM Congress 2025 turns Abu Dhabi into AI’s global war room

How GameStop fell 22% after $1.3B Bitcoin bet

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

If you care about Big Data, you care about Stream Processing

What is stream processing

Stay Ahead of the Curve!

Be truly real-time

Flow data over queries

Easier than ever before

Related Posts

LATEST NEWS

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

Follow Us