Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
  • AI
  • Tech
  • Cybersecurity
  • Finance
  • DeFi & Blockchain
  • Startups
  • Gaming
Dataconomy
  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

What Is Google Cloud Dataflow?

byEileen McNulty
August 8, 2014
in Articles
Home Resources Articles

To say that the cloud computing market was exploding would be an understatement. In July, we heard multiple reports supporting the proclamation of cloud as the next revolution in the computing industry. The IDC claimed the cloud computing market at the close of the year would be worth $4 billion in EMEA. In the UK, 78% of organisations have “formally” adopted one or more cloud-based services. Fujitsu recently announced they’ve set aside $2 billion to expand their cloud portfolio. Evidently, the cloud is big business.

The market continues to be dominated by Amazon Web Services, with Microsoft and IBM making serious inroads. But there’s one industry giant missing from this list: Google. In Q2, Microsoft’s cloud infrastructure revenue grew by 164%; Google lagged at only 47%. But Google have a secret weapon in their cloud portfolio, whose release may sky-rocket their market share- Google Cloud Dataflow.

What is Google Cloud Dataflow?

What is Google Cloud Dataflow 2

Cloud Dataflow keynote slide; source
In short, Cloud Dataflow allows you to build pipelines, monitor their execution, and transform & analyse data, all in the cloud. In a “sneak peek” blogpost, Google stated Cloud Dataflow will allows you to gain “actionable insights from your data while lowering operational costs without the hassles of deploying, maintaining or scaling infrastructure.” It’s still currently in private beta, but here’s an overview of what we know so far:

  • It’s multifunctional- As a generalisation, most database technologies have one speciality, like batch processing or lightning-fast analytics. Google Cloud Dataflow counts ETL, batch processing and streaming real-time analytics amongst its capabilities.
  • It aims to address the performance issues of MapReduce when building pipelines- Google was the first to develop MapReduce, and the function has since become a core component of Hadoop. Cloud Dataflow has now largely replaced MapReduce at Google, which the company apparently stopped using “years ago”, according to Urs Hölzle, Google’s Senior VP of Technical Infrastructure.
  • It’s good with big data- Hölzle stated that MapReduce performance started to sharply decline when handling multipetabyte datasets. Cloud Dataflow apparently offers much better performance on large datasets.
  • The coding model is pretty straightforward- The Google blog post describes the underlying service as “language-agnostic”, but the first SDK is for Java. All datasets are represented in PCollections (“parallel collections”). It includes a “rich” library of PTransforms (parallel transforms), including ParDo (similar to Map and Reduce functions and WHERE in SQL), and GroupByKey (similar to the shuffle step of MapReduce and GROUPBY and JOIN in SQL). A starter set of these transforms can be used out of the box, including Top, Count and Mean.
  •  It “evolved” from Flume and Millwheel- Flume lets you develop and run parallel pipelines for data processing. Millwheel allows you to build low-latency data-processing applications.

What Does Cloud Dataflow Mean For Existing Google Cloud Customers?

What is Google Cloud Dataflow

Google Cloud Dataflow keynote slide; source

Dataflow is designed to complement the rest of Google’s existing cloud portfolio. If you’re already using Google BigQuery, Dataflow will allow you to clean, prep and filter your data before it gets written to BigQuery. Dataflow can also be used to read from BigQuery if you want to join your BigQuery data with other sources. This can also be written back to BigQuery.

Are Google the Only Major Players Tapping into Data Flow?


Facebook have already developed a data flow architecture called Flux. The video above, explaining Facebook’s data flow project Flux, is a pretty good example of a data flow architecture, and demonstrates theirs at work within the Facebook messaging system. As the video explains, Flux “avoids cascading affects by preventing nested updates”- simply put, Flux has a single directional data flow, meaning additional actions aren’t triggered until the data layer has completely finished processing.

FlumeJava, from which Cloud Dataflow evolved, is also involved the process of creating easy-to-use, efficient parallel pipelines. At Flume’s core are “a couple of classes that represent immutable parallel collections, each supporting a modest number of operations for processing them in parallel. Parallel collections and their operations present a simple, high-level, uniform abstraction over different data representations and execution strategies.”

Many see Cloud Dataflow as a competitor to Kinesis, a managed service designed for real-time data streaming developed by industry leaders Amazon Web Services. Kinesis allows you to write applications for processing data in real-time, and works in conjunction with other AWS products such as Amazon Simple Storage Service (Amazon S3), Amazon DynamoDB, or Amazon Redshift.

Does Google Cloud Dataflow Mean the Death of Hadoop and MapReduce?

Since Cloud Dataflow is being used in place of MapReduce in the Google offices, and Google have marketed Cloud Dataflow as having “evolved” from MapReduce, many have been proclaiming the death of MapReduce, and also Hadoop, of which MapReduce is the core component.

On the subject, Ovum analyst Tony Baer told InfoWorld Cloud Dataflow forms “part of an overriding trend where we are seeing an explosion of different frameworks and approaches for dissecting and analyzing big data. Where once big data processing was practically synonymous with MapReduce, you are now seeing frameworks like Spark, Storm, Giraph, and others providing alternatives that allow you to select the approach that is right for the analytic problem.”

It is true MapReduce use in the decline. But that’s why Hadoop 2.0 introduced YARN, which allows you to circumvent MapReduce and run multiple other applications in Hadoop which all share common cluster management. One application that’s gained considerable attention is Spark; as InfoWorld states, which can perform map and reduce in-memory, making it much faster than MapReduce. Of course, such applications can run on top of Hadoop, so whilst there are now many different approaches to MapReduce, it doesn’t mean Hadoop is dead. Current Hadoop users have all of their data stored on-premise, and it’s unlikely that a considerable number of these users are going to migrate all of their data to the cloud to use Cloud Dataflow. In short: Hadoop is safe for now.

This post will be updated as and when further updates about Cloud Dataflow are announced, to give you an up-to-date guide on advancements ahead of its release.

Follow @DataconomyMedia


Eileen McNulty-Holmes – Editor

1069171_10151498260206906_1602723926_n

Eileen has five years’ experience in journalism and editing for a range of online publications. She has a degree in English Literature from the University of Exeter, and is particularly interested in big data’s application in humanities. She is a native of Shropshire, United Kingdom.

Stay Ahead of the Curve!

Don't miss out on the latest insights, trends, and analysis in the world of data, technology, and startups. Subscribe to our newsletter and get exclusive content delivered straight to your inbox.

Email: [email protected]


Tags: awscloudFlumeGooglegoogle cloud dataflowWeekly Newsletter

Related Posts

Best ELD devices and fleet management tools 2025: Top picks for trucking companies

Best ELD devices and fleet management tools 2025: Top picks for trucking companies

September 18, 2025
Zen Media and Optimum7 Merge to Create AI-Native Growth Agency: Why Data Is at the Core

Zen Media and Optimum7 Merge to Create AI-Native Growth Agency: Why Data Is at the Core

September 18, 2025
How wedding photographers save hours with SoftOrbits batch editing

How wedding photographers save hours with SoftOrbits batch editing

September 11, 2025

Digital inheritance technology by Glenn Devitt addresses the $19T asset transfer problem

September 5, 2025
Earn Stable Crypto Passive Income in 2025 with 5 Best AI Crypto Coin Staking Cloud Mining Platforms

Earn Stable Crypto Passive Income in 2025 with 5 Best AI Crypto Coin Staking Cloud Mining Platforms

September 4, 2025
Why BPM tools are essential for the future of Business Process Automation

Why BPM tools are essential for the future of Business Process Automation

September 3, 2025
Please login to join discussion

LATEST NEWS

Xiaomi to launch 17, 17 Pro, and 17 Pro Max series in China on September 25

Next-gen PCIe 8.0 standard promises 1TB/s bandwidth for AI and quantum workloads

Nvidia Drive AGX Thor to power robotaxi project

Poll: Half of Taiwan fears TSMC becoming US-SMC

From Pilot to Policy: RYT Gathers Global Leaders at TOKEN2049

Nvidia and OpenAI announce landmark $100 billion partnership, igniting global stock rally

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy

Follow Us

  • News
    • Artificial Intelligence
    • Cybersecurity
    • DeFi & Blockchain
    • Finance
    • Gaming
    • Startups
    • Tech
  • Industry
  • Research
  • Resources
    • Articles
    • Guides
    • Case Studies
    • Glossary
    • Whitepapers
  • Newsletter
  • + More
    • Conversations
    • Events
    • About
      • About
      • Contact
      • Imprint
      • Legal & Privacy
      • Partner With Us
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.