Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

3 Reasons Why In-Hadoop Analytics are a Big Deal

by Ben Porterfield
December 8, 2016
in Data Science, Technology & IT
Home Topics Data Science
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

Recent technology advances within the Apache Hadoop ecosystem have provided a big boost to Hadoop’s viability as an analytics environment—above and beyond just being a good place to store data. Leveraging these advances, new technologies now support SQL on Hadoop, making in-cluster analytics of data in Hadoop a reality. This is a big deal – it meets a huge demand, it shows how rapidly the technologies have evolved, and it delivers on one of the most significant unmet promises of big data analytics.

Table of Contents

  • 1. We Need It. Yesterday.
  • 2. Technologies Are Converging on It
  • 3. It Delivers on the Promise of Big Data

1. We Need It. Yesterday.

Say you have a lot of data that you need to analyze, sitting in a Hadoop cluster. How do you go about that? There are two ways.

The first approach has been kind of a default for many organizations. You can use the Hadoop environment for collecting and transforming data, and then export that data into a relational analytical database—for example Redshift or Vertica—for the actual analysis. So you get the advantages of high-speed, powerful analysis…but only after you’ve moved the data out of Hadoop and into a more familiar environment. This approach is more complicated than it needs to be. Plus it’s pricey, and getting more so as data volumes grow.

The second approach is more elegant. You can analyze the data directly within the Hadoop cluster using one of the several available SQL-on-Hadoop technologies, eliminating the need to move the data into a separate database. Unfortunately, it hasn’t been easy to get that to work. Until recently, these tools have not provided the speed or the depth of capability that most organizations need.


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


Between the complexity and expense of the first option and the inadequate performance and capabilities of the second, the need for reliable in-cluster analytics on Hadoop has become widespread and increasingly urgent.

2. Technologies Are Converging on It

The past 18 months have seen a series of major improvements in the leading SQL-on-Hadoop technologies. The various development communities have added performance enhancements, new functions, compatibility with other SQL dialects, aliases for functions and data types, and improved support for columnar storage formats. These solutions are providing more complete support for SQL JOIN clauses and EXPLAIN queries.They’ve also added new utilities to track memory usage and to manage tasks and query plans, and security features to make them more enterprise-ready.

Let’s take a closer look at the progress made by four of the major offerings in this space: Apache Hive, Cloudera Impala, Apache Spark, and Presto.

Hive. The original SQL-on-Hadoop technology has cut achievable query response times to less than a second. Hive now supports SQL INSERT, UPDATE, and DELETE statements as well as SQL:2011 standard query syntax. Hive has also added a third execution engine, now supporting Spark along with MapReduce and Tez.

Impala. With improved support for subqueries and the addition of window functions, Impala has upped its analytics game considerably. Impala now allows joins and aggregates, and can now spill data to disk instead of crashing when memory gets too low. Impala has increased compatibility across SQL environments by adding math, string, date/time, and bit functions used by other dialects, for example the STRUCT, ARRAY, and MAP types already supported by Hive. Impala has also introduced Ibis, a Python data analysis framework for data scientists. Additionally, Cloudera benchmarks show significant improvements in speed and concurrency.

Spark. Like Impala, Spark has added window functions and has improved its compatibility across the ecosystem with a number of new date/time, string, and math functions—over 100 of them. Spark has also added compatibility with most versions of Hive’s metastore, making it easier to read tables written by Hive. Additionally, last year Spark SQL introduced Spark packages, which allow access to many other data sources from CSV to Avro to several NoSQL data stores. More than 200 of these packages are listed on spark-packages.org. Spark is also coming on strong as an alternative to MapReduce, demonstrating its viability as a standalone compute framework that supports an efficient and easily implemented analytics workflow.

Presto. With significant support from Teradata, Presto is emerging as the SQL-on-Hadoop technology of choice for highly complex SQL queries—supporting the demanding multi-petabyte analytics needs of such players as petabytes of data at AirBnB, DropBox, Groupon, Netflix, and Facebook. Like the others, Presto has added multiple functions to make it more compatible with other dialects. Its support for Hive tables also added INSERT, DELETE, and CREATE statements with support for partitioned tables.

3. It Delivers on the Promise of Big Data

One of the key tenets of big data in general, and of Hadoop in particular, is schema-on-read. If you’re going to put all your varieties of data into one repository, it’s not feasible for that repository to try to enforce some overarching structure or schema onto the data. Instead, you apply the schema as you analyze the data, meaning that you have the flexibility to define and redefine that schema on the fly based on changing circumstances and what you’re learning.

Schema-on-read is also a core tenet of new emerging technologies that are architected from the start to support that kind of fast, flexible in-database analysis.

But now that has all changed. Thanks to the improvements the SQL-in-Hadoop technologies have achieved in recent months, technologies such as Looker, Tableau, Qlik, and others—many that act like an intranet for business metrics—are delivering the future of enterprise analytics. These technologies enable users to analyze and model all their Hadoop data in-cluster, putting the “big” into big data analytics without sacrificing speed or ease of use.

So now if you have a lot of data in Hadoop and you’re looking to analyze it, the choice becomes clear. You don’t have to continue bumping up against the limits of the relational database you’re moving the data into—or how much of it you can afford to use. With new technologies, you can leverage the power of the cluster you already have in place, expanding and accelerating what you can do while saving you time and money. And that is truly a big deal.

image credit: Skasuya

Like this article? Subscribe to our weekly newsletter to never miss out!

Follow @DataconomyMedia

Tags: AnalyticsApacheBig DataHadoopLookerSQL

Related Posts

What is storage automation

Mastering the art of storage automation for your enterprise

March 17, 2023
Can Komo AI be the alternative to Bing?

Can Komo AI be the alternative to Bing?

March 17, 2023
GPT-4 powered LinkedIn AI assistant explained. Learn how to use LinkedIn writing suggestions for headlines, summaries, and job descriptions.

LinkedIn AI won’t take your job but will help you find one

March 16, 2023
OpenAI released GPT-4, the highly anticipated successor to ChatGPT

OpenAI released GPT-4, the highly anticipated successor to ChatGPT

March 15, 2023
What is multimodal AI: Understanding GPT-4

Tracing the evolution of a revolutionary idea: GPT-4 and multimodal AI

March 15, 2023
What is Reimagine Home AI with examples? Learn how to use Reimagine Home AI and find out how AI can help interior designers. Keep reading...

Reimagine Home AI wants to redesign your home

March 15, 2023

Comments 2

  1. glenshef . says:
    7 years ago

    Unlike Spark, Impala, and Hive – Presto is not an Apache project – so it is not “Apache Presto”

    Reply

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

Mastering the art of storage automation for your enterprise

Can Komo AI be the alternative to Bing?

LinkedIn AI won’t take your job but will help you find one

Where does your data go: Inside the world of blockchain storage

OpenAI released GPT-4, the highly anticipated successor to ChatGPT

Tracing the evolution of a revolutionary idea: GPT-4 and multimodal AI

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.