Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Big data processing platforms: What’s next?

by Kostas Tzoumas
September 8, 2014
in BI & Analytics, Technology & IT
Home Topics Data Science BI & Analytics
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

“Big Data” has been a major technology trend over the past years, and buzz around the term continuing strong today. By now, it is widely accepted that the “Big” in Big Data does not refer only to the terabytes and petabytes of data to be processed, but more broadly to the complexity of such data. The standard definition of Big Data by now is a data management problem that cannot be solved by old-fashioned technology because of a combination of some of the “4 Vs”, referring to four characteristics of the complexity of Big Data:

  • Volume: The data is too much to be handled by traditional technologies.
  • Velocity: The data is coming (streaming) too fast to be ingested by traditional technologies.
  • Variety: There are too many different sources or formats of data that cannot be integrated by traditional technologies.
  • Veracity: The inherent uncertainty in the data cannot be handled by traditional technologies.

Big data can create tremendous value for the companies that do put them to use. An often-overlooked factor, however, is that a bunch of hard drives full with data are useless (and thus carry no value) by themselves. What drives the value of data is really the analysis that companies conduct on the data in order to derive valuable insights. According to Gartner, companies that employ advanced forms of analysis on their data can get a higher return on investment and are expected to grow more than their competitors . A key trend is that it is not just the data that is becoming more complex, but increasingly the analysis itself. Along the lines of the four “Vs”, let us try to characterize the complexity of current “Big Analysis” requirements using four “Is”.

  • In-situ: Analytics directly operate on the data where it sits without requiring an expensive process of ETL (Extract, Transform, Load)
  • Iterative: Analysis iterates over the data several times in order to build and train a model of the data rather than just extract data summaries (often referred to as predictive versus descriptive analysis)
  • Incremental: Analysis needs to maintain said models under high data arrival rates, and analysis need to interactively analyze data sets based on previous results.
  • Interactive: Analysts work interactively with data, formulating the next question depending on the results of the previous one.

Apache Hadoop is the major available technology today for Big Data management. Does Hadoop live up to the challenge for Big Analysis? The answer is yes and no. Hadoop today has evolved way beyond the original system modeled after Google’s MapReduce research paper. A Hadoop installation is today a collection of several distinct products, ranging from data storage to distinct tools for stream processing, interactive querying, to data visualization and reporting and monitoring tools. The original Hadoop MapReduce system performs poorly on many emerging use cases. New technologies are on the rise to fill this gap.

Broadly speaking, “Hadoop and co” is today a diverse set of data processing technologies that interoperate with each other on the same data. The figure  (from Hortonworks) gives an example of such a Hadoop 2 installation. All data is stored in a reliable common data storage layer called the Hadoop Distributed File System, and cluster resources are managed by a common resource manager called YARN (YARN stands for Yet Another Resource Negotiator, using a commonly used tongue-in-cheek in developer circles). On top of that we see a diversification of tools that support different kinds of analysis. That way, the “in situ” analysis requirement is satisfied, as data is stored in a single place.

image00


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


In the analysis platforms themselves, we see several tools that target different types of analysis.

  • Hadoop MapReduce itself is still the most robust and widely-used tool for very large-scale batch data processing.
  • Apache Storm (http://storm.incubator.apache.org/) was created by for applications that process data “on the move,” rather than batch processing. It was first open sourced by twitter
  • Apache Tez (http://hortonworks.com/hadoop/tez/), Apache Drill (http://incubator.apache.org/drill/) and Impala from Cloudera (http://www.cloudera.com/content/cloudera/en/products-and-services/cdh/impala.html). These projects aim to offer interactive SQL or SQL-style querying on top of Hadoop data. They are inspired, interestingly, by relational database technology like columnar storage, the exact technology that Hadoop initially competed with, and by Google’s Dremel system.
  • Apache Giraph (https://giraph.apache.org/) was inspired by Google’s Pregel  system and is a good match for applications that need to analyze graph data.
  • Apache Spark (https://spark.incubator.apache.org/) was initially created by UC Berkeley. Spark is based on the premise that if most data fits in main memory, and operations on data communicate via main memory, a lot of Big Data applications can become much faster. As such, Spark covers a wide variety of the aforementioned use cases using one system.

While there is a wealth of choice already, the above list is by no means complete. A plethora of other tools and systems are springing up. While this is an indication of a healthy market and indicates a fast pace of innovation, the choice of the right platform to use for the problem at hand can be mind boggling for many companies. Each of these platforms has its own programming interface, and programs written in one of them cannot be transferred to another. Add this to the problem of managing a Hadoop installation in the first place and finding qualified people to conduct the data analysis, which is widely considered as one of the major problems that many companies face in order to realise their Big Data strategy.

A solution to this problem is yet to be seen. Some systems are developing compatibility layers so that programs written in one can be transferred to another. While this may keep this “Babel tower” of technologies from collapsing, a need is emerging for more general-purpose platforms that can cover a wide variety of use cases better than the original Apache Hadoop design, having learned the lessons of history so far. As Mike Olson, CSO of Cloudera, recently put it: “What the Hadoop ecosystem needs is a successor system that is more powerful, more flexible and more real-time than MapReduce. While not every current (or maybe even future) application will abandon the MapReduce framework of today, new applications could use such a general-purpose engine to be faster and to do more than is possible with MapReduce.”

The evolution in this space is history that is waiting to be written. Here in Berlin, we are writing our part of this history with the Stratosphere (www.stratosphere.eu) open source platform for data analytics. Stratosphere is a general-purpose platform that can cover a wealth of analytical use cases and is compatible with Hadoop installations. It is based on the premise that combining technological ideas from Hadoop MapReduce, parallel databases, and compilers, we can create and efficient yet very powerful system for the next generation of data analysis needs.


KOSTAS

Dr. Kostas Tzoumas is a post-doctoral researcher at TU Berlin . Kostas is currently co-leading the Stratosphere project, that develops an open-source Big Data Analytics platform, the only of its kind in Europe. His research is focused on system architectures, programming models, processing and optimization for Big Data.  Kostas has published extensively on the topic in leading database conferences such as ICDE, SIGMOD or CIKM. Kostas regularly serves as Program Committee member in leading conferences such as ACM SIGMOD, VLDB, and ICDE, and initiated the workshop on Data Analytics in the Cloud. Before joining TU Berlin, Kostas received his PhD from Aalborg University, Denmark, was visiting the University of Maryland, College Park, and received his Diploma in Electrical and Computer Engineering from the National Technical University of Athens, Greece.


Image credit : Raymond Bryson

Tags: BerlinHadoopStratospheretu berlinWhat is Big Data

Related Posts

5 best digital process automation tools

A journey worth taking: Shifting from BPM to DPA

February 1, 2023
What is digital maturity: Model, scale, level

Fostering a culture of innovation through digital maturity

January 30, 2023
What is Analytics as a Service (AaaS): Examples

Transform your data into a competitive advantage with AaaS

January 26, 2023
Top 4 business intelligence reporting tools       

Transforming data into insightful information with BI reporting

January 25, 2023
5 best office automation tools: Examples

The significance of office automation in today’s rapidly changing business world

January 24, 2023
What is smart robotics: Benefits and challenges

Unlocking the full potential of automation with smart robotics

January 19, 2023

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

Cyberpsychology: The psychological underpinnings of cybersecurity risks

ChatGPT Plus: How does the paid version work?

AI Text Classifier: OpenAI’s ChatGPT detector indicates AI-generated text

A journey worth taking: Shifting from BPM to DPA

BuzzFeed ChatGPT integration: Buzzfeed stock surges after the OpenAI deal

Adversarial machine learning 101: A new cybersecurity frontier

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.