Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Amazon Kinesis vs. Apache Kafka For Big Data Analysis

by Christopher Low
May 26, 2017
in Big Data, Data Science
Home Topics Data Science Big Data
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

Data processing today is done in form of pipelines which include various steps like aggregation, sanitization, filtering and finally generating insights by applying various statistical models. Amazon Kinesis is a platform to build pipelines for streaming data at the scale of terabytes per hour. Parts of the Kinesis platform are a direct competitor to the Apache Kafka project for Big Data Analysis. The platform is divided into three separate products: Firehose, Streams, and Analytics. All three of these solve different problems, as discussed below:

Table of Contents

  • How to load huge amount of data into the pipeline?
  • How to direct live input streams into the pipeline?
  • How to cherry-pick data from the pipeline (and obtain long-running queries)?

How to load huge amount of data into the pipeline?

This problem is solved by Firehose. It is the entrypoint of the data into the AWS ecosystem. Kafka does not have an equivalent to Firehose. This product is more specific to Amazon’s other offerings. Firehose can load the incoming data from various different sources, buffer it into larger chunks and forward it to other AWS services like S3, Redshift, Lambda. Firehose solves the problems with backpressure. Backpressure arises when the input buffer of a service is not able to keep up with the output buffer of another service that feeds data into it. Firehose automatically scales up or down to match the its throughput with the data it has to work with. It can also batch, compress and encrypt the data before feeding it into other AWS services. Amazon recently added several other types of data transformations to the pipeline that work before the data is loaded.

How to direct live input streams into the pipeline?

Amazon Kinesis Streams is very similar to Kafka in that it is built to work with live input streams. It stores the streams that are sent to it and the streams can then be utilised by custom applications written using the Kinesis Client Library.

Kafka “topics” are roughly equivalent to Kinesis Streams. Both represent an ordered and immutable list of messages. Each message has a unique identifier and it appended to the list as it arrives. Kafka messages can be retrieved from the last known message onwards using an offset. To get the same functionality in Kinesis, the user has to build it themselves using the API and message sequence numbers.


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


Topics in Kafka consist of one or more partitions, which are similar to the concept of shards in Kinesis Streams. Kafka is scaled by looking out for hot partitions and adding/removing partitions as needed. Kinesis Streams are scaled by splitting/joining shards. Kafka being a hosted service has some extra overhead in terms of managing the clusters, setting up monitoring, alerting, updating the packages, tuning and failover management as compared to Kinesis. But the actual cost estimations depend on other factors like expected payload size, flow density and retention period. Kafka can be fine tuned to have less than 1 second latencies while Kinesis Streams typically have 1-3 seconds latency. Kinesis Streams are better suited when the payload size is more and the throughput is high, while latencies do not matter much.

Kafka provides durability by replicating over multiple broker nodes while Kinesis Streams do it by replicating the data over multiple zones. The user is not required to configure the replication strategy with Amazon’s offering, so they can focus more on the Big Data Analysis part.

A good use case for streaming data is analyzing clickstreams on websites and giving realtime recommendations based on the insights gained. Streams are very useful in financial trading as well. Fintech requires very high throughput data processing capabilities that can generate insights to find patterns and exceptions on the flight. This can be used to make automated trading systems that work using realtime data to make decisions. Financial models rely upon the analysis of previous data and Streams lets the user update their models as and when the data comes. This minute analysis is also required for generating realtime dashboards to monitor activity and uptime of critical services in various businesses. Streams are also useful in realtime billing and metrics systems operating at scales where account for individual unit consumption is not feasible. Anomalies can be easily spotted in this way.

But what about sending the data to Streams? Kinesis Streams can take in input data from thousands of endpoints all at once. The concept of producers is the same as in Kafka, although the implementation has differences.The producer accepts records from higher level applications, performs batching, breaks records as per partition/shard and forwards those to Kinesis Streams or Kafka. Some organization might require more fine grained control over the data producer. In that case a data aggregator like Segment can be used to pull in data from various services like mobile apps and web apps using the Mixpanel integration and pre-process it to send it ahead. Segment has an Amazon Kinesis plugin, and a Heroku Kafka plugin for this specific use case.

How to cherry-pick data from the pipeline (and obtain long-running queries)?

Kafka sits at the ingestion stage of the data processing pipeline and does not have in-built analytics abilities. Amazon analytics uses the input data from Firehose (loaded data is injected) or from Streams (data is injected as and when it arrives). SQL queries can be run against these data sources to filter out the relevant data. This data can be further processed using an AWS service like Lambda and the generated insights can further be routed to other AWS services like DynamoDB for storage and to services like Slack for notifications. Analytics supports streaming queries, so you get data output in realtime instead of waiting for the job to finish and the result pushed in bulk as in traditional SQL queries.
Pipeline based data processing is becoming the norm in big data analysis. Because of the modular design of pipeline based systems, additional processing points can be easily added to the same pipeline or in parallel processing units to form non-linear pipelines. Ingestion systems like Kinesis Streams and Kafka provide a layer of asynchronous data flow between the producers and the consumers. Kinesis Firehose helps buffer and inject huge amounts of data into the systems and finally Kinesis Analytics is used to draw insights. Choosing Kinesis Streams or Kafka is an important decision and depends on matching the desired metrics of a system.

 

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Tags: Amazon KinesisApache Kafkaawsbig data analysisData Pipelinestreaming data

Related Posts

We explained how to use Microsoft 365 Copilot in Word, PowerPoint, Excel, Outlook, Teams, Power Platform, and Business Chat. Check out!

Microsoft 365 Copilot is more than just a chatbot

March 20, 2023
What is storage automation

Mastering the art of storage automation for your enterprise

March 17, 2023
Can Komo AI be the alternative to Bing?

Can Komo AI be the alternative to Bing?

March 17, 2023
GPT-4 powered LinkedIn AI assistant explained. Learn how to use LinkedIn writing suggestions for headlines, summaries, and job descriptions.

LinkedIn AI won’t take your job but will help you find one

March 16, 2023
OpenAI released GPT-4, the highly anticipated successor to ChatGPT

OpenAI released GPT-4, the highly anticipated successor to ChatGPT

March 15, 2023
What is multimodal AI: Understanding GPT-4

Tracing the evolution of a revolutionary idea: GPT-4 and multimodal AI

March 15, 2023

Comments 1

  1. Matthias J. Sax says:
    6 years ago

    This blog post seems to refer to a rather old version of Kafka. Kafka 0.9 (released 2015) added the Connect API that allows to import/export data to/from Kafka. Furthermore, Kafka 0.10 added Streams API, that allows sophisticated stream processing (event time support, time/session windows, stream/table duality, stream joins as well as stream/table and table/table join, fault-tolerant stateful operators, interactive queries, and dynamic scaling). For more details check out the latest doc for 0.10.2.1: http://docs.confluent.io/current/

    Thus “Kafka does not have an equivalent to Firehose.” and “Kafka […] does not have in-built analytics abilities.” is not quite true.

    Last but not least, I want to mention the upcoming 0.11 release that adds exactly-once processing semantics to Kafka and its Streams API. This is also an important factor to consider when choosing a system. (cf. https://www.confluent.io/blog/log-compaction-highlights-in-the-apache-kafka-and-stream-processing-community-may-2017/)

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

Microsoft 365 Copilot is more than just a chatbot

The silent spreaders: How computer worms can sneak into your system undetected?

Mastering the art of storage automation for your enterprise

Can Komo AI be the alternative to Bing?

LinkedIn AI won’t take your job but will help you find one

Where does your data go: Inside the world of blockchain storage

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.