If You Care About Big Data, You Care About Stream Processing

As the scale of data grows across organizations with terabytes and petabytes coming into systems every day, running ad hoc queries across the entire dataset to generate important metrics and intelligence is no longer feasible. Once the quantum of data crosses a threshold, even simple questions such as what is the distribution of request latencies becomes infuriatingly slow with the usual sql and database model. Imagine running such a request on the Facebook request data, the query would take days to complete.

Stream processing offers the solution for anyone looking to manage an ever increasing volume of data.

What is stream processing

Stream processing is the handling of units of data on a record-by-record basis or over sliding time windows. This methodology limits the amount of data processed and offers a very contrasting way of looking at the data and the analytics. Complexity is traded in for speed. The nature of the analytics with stream processing are usually filtering, correlations, sampling and aggregations over the data.

Examples of the diverse applications of stream processing include:

Alerting on sensor data on IoT devices

Log analysis and statistics on web traffic

Risk analysis with movements of money and orders in Fintech

Click stream analytics on Ad Networks

Comparison to Batch Processing:

	Batch processing	Stream processing
Scope	Queries over all or most of the data in the dataset.	Queries or processing over data within a rolling time window, or on just the most recent data record.
Size	Large numbers of records	Individual records or micro batches consisting of a few records.
Performance	Latencies in minutes to hours.	Requires latency in the order of seconds or milliseconds.
Analyses	Complex analytics.	Simple response functions, aggregates, and rolling metrics.

Be truly real-time

Stream processing is the only way applications can truly be real time with their data. As soon as we talk about aggregations over the universe of data, any batch process job will continue to take longer as the amount of data increases. If you have an application that relies on real time metrics, without a stream processing solution, your view of the world is always going to be delayed.

Flow data over queries

To harness the power of stream processing, engineers and data scientists need to evolve their model from one where they run queries over their data to one where the data runs over the queries. This is a powerful shift in the way to look at the data applications.

The key shift is from tables to streams and doing operations on those streams such as

filtering

joining

aggregating

windowing

Lets compare the contrast in the batch processing world vs the stream processing world in the example application of Calculating the distribution of number of user sessions per user in a day on the website.

In batch processing: a complicated query would define what events to include in a user session, what the maximum delay between events should be for a session and computing a unique count over them. Since events in tables can occur out of order, even if a new table is maintained every record in the table would need to be traversed for an answer

In stream processing: A stream of events would be simply be filtered to produce a stream of relevant events, this stream would then be windowed over a delay window to produce a stream of unique sessions. Finally this stream would be aggregated upon for unique counts per user with the relevant number simply being appended to a key value store.

As this example demonstrates, a less computationally intense and more real-time answer can be obtained simply be switching the way the data is perceived

Easier than ever before

Managing stream processing applications is not easy with all the moving pieces of managing issues such as fault-tolerance, partitioning and scaling and durability of the data. Over the last 7 years a lot of systems have emerged; many of which are open source and handle a lot of these issues entirely. This leaves the developer to only worry about implementing the application logic.

Some of the most exciting examples of have only launched in the last couple of years: Spark Streaming (2015), Apache Beam (2016) and Kafka Streams (2016). A full comparison of the various popular options are in figure 2.

Any person developing applications with big data should keep the models of stream processing their mind as they choose the architecture and stack for their system. While stream processing is not a panacea for all the tasks around large scale data processing, it offers a new and important way to look at data that allows scalable real-time

Credits:

Figure 1: Information obtained from AWS Documentation

Figure 2: Image Courtesy: Ian Hellström on Twitter

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

If you care about Big Data, you care about Stream Processing

Related Posts

What 53,000 Churches Reveal About the Digital Transformation of Faith Communities

Xenco Medical wins back-to-back honors with Fast Company’s 2026 World Changing Ideas Award and Time Magazine 2026 Impact Award

Data Sovereignty and Document Security: Where Does the Data Actually Live?

How Public Web Data Can Strengthen Environmental Protection

How automation tools are being integrated into professional networking

Autonomous agentic UI orchestration for high-throughput enterprise ecosystems

LATEST NEWS

Bitcoin climbs above $65,000 as sentiment improves

X to open-source full codebase in August, Musk says

God of War Laufey launches on February 16th

Apple delays smart glasses over privacy concerns

Huawei expected to unveil new devices in August

Apple prepares biggest iPhone redesign since iPhone X

BEST AI MODELS LEADERBOARD

LATEST TOOLS

Amanda AI

InterviewBot

VernAI

MyLoans

Essay Grader AI

Cover Letter AI

Animate Old Photos

Resume.io

MonAI

AIEngine Plugin

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.