Recent technology advances within the Apache Hadoop ecosystem have provided a big boost to Hadoop’s viability as an analytics environment—above and beyond just being a good place to store data. Leveraging these advances, new technologies now support SQL on Hadoop, making in-cluster analytics of data in Hadoop a reality. This is a big deal – it meets a huge demand, it shows how rapidly the technologies have evolved, and it delivers on one of the most significant unmet promises of big data analytics.
1. We Need It. Yesterday.
Say you have a lot of data that you need to analyze, sitting in a Hadoop cluster. How do you go about that? There are two ways.
The first approach has been kind of a default for many organizations. You can use the Hadoop environment for collecting and transforming data, and then export that data into a relational analytical database—for example Redshift or Vertica—for the actual analysis. So you get the advantages of high-speed, powerful analysis…but only after you’ve moved the data out of Hadoop and into a more familiar environment. This approach is more complicated than it needs to be. Plus it’s pricey, and getting more so as data volumes grow.
The second approach is more elegant. You can analyze the data directly within the Hadoop cluster using one of the several available SQL-on-Hadoop technologies, eliminating the need to move the data into a separate database. Unfortunately, it hasn’t been easy to get that to work. Until recently, these tools have not provided the speed or the depth of capability that most organizations need.
Between the complexity and expense of the first option and the inadequate performance and capabilities of the second, the need for reliable in-cluster analytics on Hadoop has become widespread and increasingly urgent.
2. Technologies Are Converging on It
The past 18 months have seen a series of major improvements in the leading SQL-on-Hadoop technologies. The various development communities have added performance enhancements, new functions, compatibility with other SQL dialects, aliases for functions and data types, and improved support for columnar storage formats. These solutions are providing more complete support for SQL JOIN clauses and EXPLAIN queries.They’ve also added new utilities to track memory usage and to manage tasks and query plans, and security features to make them more enterprise-ready.
Let’s take a closer look at the progress made by four of the major offerings in this space: Apache Hive, Cloudera Impala, Apache Spark, and Presto.
Hive. The original SQL-on-Hadoop technology has cut achievable query response times to less than a second. Hive now supports SQL INSERT, UPDATE, and DELETE statements as well as SQL:2011 standard query syntax. Hive has also added a third execution engine, now supporting Spark along with MapReduce and Tez.
Impala. With improved support for subqueries and the addition of window functions, Impala has upped its analytics game considerably. Impala now allows joins and aggregates, and can now spill data to disk instead of crashing when memory gets too low. Impala has increased compatibility across SQL environments by adding math, string, date/time, and bit functions used by other dialects, for example the STRUCT, ARRAY, and MAP types already supported by Hive. Impala has also introduced Ibis, a Python data analysis framework for data scientists. Additionally, Cloudera benchmarks show significant improvements in speed and concurrency.
Spark. Like Impala, Spark has added window functions and has improved its compatibility across the ecosystem with a number of new date/time, string, and math functions—over 100 of them. Spark has also added compatibility with most versions of Hive’s metastore, making it easier to read tables written by Hive. Additionally, last year Spark SQL introduced Spark packages, which allow access to many other data sources from CSV to Avro to several NoSQL data stores. More than 200 of these packages are listed on spark-packages.org. Spark is also coming on strong as an alternative to MapReduce, demonstrating its viability as a standalone compute framework that supports an efficient and easily implemented analytics workflow.
Presto. With significant support from Teradata, Presto is emerging as the SQL-on-Hadoop technology of choice for highly complex SQL queries—supporting the demanding multi-petabyte analytics needs of such players as petabytes of data at AirBnB, DropBox, Groupon, Netflix, and Facebook. Like the others, Presto has added multiple functions to make it more compatible with other dialects. Its support for Hive tables also added INSERT, DELETE, and CREATE statements with support for partitioned tables.
3. It Delivers on the Promise of Big Data
One of the key tenets of big data in general, and of Hadoop in particular, is schema-on-read. If you’re going to put all your varieties of data into one repository, it’s not feasible for that repository to try to enforce some overarching structure or schema onto the data. Instead, you apply the schema as you analyze the data, meaning that you have the flexibility to define and redefine that schema on the fly based on changing circumstances and what you’re learning.
Schema-on-read is also a core tenet of new emerging technologies that are architected from the start to support that kind of fast, flexible in-database analysis.
But now that has all changed. Thanks to the improvements the SQL-in-Hadoop technologies have achieved in recent months, technologies such as Looker, Tableau, Qlik, and others—many that act like an intranet for business metrics—are delivering the future of enterprise analytics. These technologies enable users to analyze and model all their Hadoop data in-cluster, putting the “big” into big data analytics without sacrificing speed or ease of use.
So now if you have a lot of data in Hadoop and you’re looking to analyze it, the choice becomes clear. You don’t have to continue bumping up against the limits of the relational database you’re moving the data into—or how much of it you can afford to use. With new technologies, you can leverage the power of the cluster you already have in place, expanding and accelerating what you can do while saving you time and money. And that is truly a big deal.
image credit: Skasuya
Like this article? Subscribe to our weekly newsletter to never miss out!
Unlike Spark, Impala, and Hive – Presto is not an Apache project – so it is not “Apache Presto”