Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
Subscribe
No Result
View All Result
Dataconomy
No Result
View All Result

Presto versus Hive: What You Need to Know

by Kiyoto Tamura
April 30, 2015
in Data Science
Home Topics Data Science
Share on FacebookShare on TwitterShare on LinkedInShare on WhatsAppShare on e-mail

There is much discussion in the industry about analytic engines and, specifically, which engines best meet various analytic needs. This post looks at two popular engines, Hive and Presto, and assesses the best uses for each.

Presto vs Hive 1

Table of Contents

  • How Hive Works
  • How Presto Works
  • How to Best Use Hive and Presto
  • Hive vs. Presto

How Hive Works

Hive translates SQL queries into multiple stages of MapReduce and it is powerful enough to handle huge numbers of jobs (Although as Arun C Murthy pointed out, modern Hive runs on Tez whose computational model is similar to Spark’s). MapReduce is fault-tolerant since it stores the intermediate results into disks and enables batch-style data processing. Many of our customers issue thousands of Hive queries to our service on a daily basis. A key advantage of Hive over newer SQL-on-Hadoop engines is robustness: Other engines like Cloudera’s Impala and Presto require careful optimizations when two large tables (100M rows and above) are joined. Hive can join tables with billions of rows with ease and should the jobs fail it retries automatically. Furthermore, Hive itself is becoming faster as a result of the Hortonworks Stinger initiative.

How Presto Works

In some instances simply processing SQL queries is not enough—it is necessary to process queries as quickly as possible so that data scientists and analysts can use Treasure Data for quickly gaining insights from their data collections. For these instances Treasure Data offers the Presto query engine. Presto is an in-memory distributed SQL query engine developed by Facebook that has been open-sourced since November 2013.Presto has been adopted at Treasure Data for its usability and performance.


Join the Partisia Blockchain Hackathon, design the future, gain new skills, and win!


BEST OF HIVEBEST OF PRESTO
Large data aggregationsInteractive queries (where you want to wait for the answer)
Large Fact-to-Fact joinsQuickly exploring the data (e.g. what types of records are found in the table)
Large distincts (aka de-duplication jobs)Joins with a large Fact table and many smaller Dimension tables
Batch jobs that can be scheduled

How to Best Use Hive and Presto

HIVEPRESTO
Optimized forThroughputInteractivity
SQL Standardized fidelityHiveQL (subset of common data warehousing SQL)Designed to comply with ANSI SQL
Window functionsYesYes
Large JOINsVery good for large Fact-to-Fact joinsOptimized for star schema joins (1 large Fact table and many smaller dimension tables)

Hive is optimized for query throughput, while Presto is optimized for latency. Presto has a limitation on the maximum amount of memory that each task in a query can store, so if a query requires a large amount of memory, the query simply fails. Such error handling logic (or a lack thereof) is acceptable for interactive queries; however, for daily/weekly reports that must run reliably, it is ill-suited. For such tasks, Hive is a better alternative.

In terms of data-processing models, Hive is often described as a pull model, since its MapReduce stage pulls data from the preceding tasks. Presto follows the push model, which is a traditional implementation of DBMS, processing a SQL query using multiple stages running concurrently. An upstream stage receives data from its downstream stages, so the intermediate data can be passed directly without using disks. If the query consists of multiple stages, Presto can be 100 or more times faster than Hive.

Hive vs. Presto

Learn how Treasure Data customers can utilize the power of distributed query engines without any configuration or maintenance of complex cluster systems.


kiyoto_tamuraKiyoto Tamura leads marketing at Treasure Data and is a maintainer of Fluentd, the open source data collector to unify log management. He spends much of his day at Treasure Data as a developer marketer/community manager. He also is a math nerd turned quantitative trader turned software engineer turned open source community advocate and cherishes American brunch and Japanese game shows.


Photo credit: andryn2006 / Source / CC BY-SA

Tags: FluentdHiveMapReducePrestoSQLTreasure DataWeekly Newsletter

Related Posts

What is storage automation

Mastering the art of storage automation for your enterprise

March 17, 2023
Can Komo AI be the alternative to Bing?

Can Komo AI be the alternative to Bing?

March 17, 2023
GPT-4 powered LinkedIn AI assistant explained. Learn how to use LinkedIn writing suggestions for headlines, summaries, and job descriptions.

LinkedIn AI won’t take your job but will help you find one

March 16, 2023
OpenAI released GPT-4, the highly anticipated successor to ChatGPT

OpenAI released GPT-4, the highly anticipated successor to ChatGPT

March 15, 2023
What is multimodal AI: Understanding GPT-4

Tracing the evolution of a revolutionary idea: GPT-4 and multimodal AI

March 15, 2023
What is Reimagine Home AI with examples? Learn how to use Reimagine Home AI and find out how AI can help interior designers. Keep reading...

Reimagine Home AI wants to redesign your home

March 15, 2023

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

LATEST ARTICLES

The silent spreaders: How computer worms can sneak into your system undetected?

Mastering the art of storage automation for your enterprise

Can Komo AI be the alternative to Bing?

LinkedIn AI won’t take your job but will help you find one

Where does your data go: Inside the world of blockchain storage

OpenAI released GPT-4, the highly anticipated successor to ChatGPT

Dataconomy

COPYRIGHT © DATACONOMY MEDIA GMBH, ALL RIGHTS RESERVED.

  • About
  • Imprint
  • Contact
  • Legal & Privacy
  • Partnership
  • Writers wanted

Follow Us

  • News
  • AI
  • Big Data
  • Machine Learning
  • Trends
    • Blockchain
    • Cybersecurity
    • FinTech
    • Gaming
    • Internet of Things
    • Startups
    • Whitepapers
  • Industry
    • Energy & Environment
    • Finance
    • Healthcare
    • Industrial Goods & Services
    • Marketing & Sales
    • Retail & Consumer
    • Technology & IT
    • Transportation & Logistics
  • Events
  • About
    • About Us
    • Contact
    • Imprint
    • Legal & Privacy
    • Newsletter
    • Partner With Us
    • Writers wanted
No Result
View All Result
Subscribe

This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy Policy.