Ebay’s latest offering to the open-source community is designed to accelerate analytics on Hadoop and allow the use of SQL-compatible tools, while providing an SQL interface and multi-dimensional analysis (OLAP) on Hadoop to support extremely large datasets.
Dubbed, Kylin, according to an eBay blog post, this is what it essentially helps with:
“When data becomes bigger, the pre-calculation processing becomes impossible – even with powerful hardware. However, with the benefit of Hadoop’s distributed computing power, calculation jobs can leverage hundreds of thousands of nodes. This allows Kylin to perform these calculations in parallel and merge the final result, thereby significantly reducing the processing time.”
It is currently used in production by various business units at eBay. “Our largest use case is the analysis of 12+ billion source records generating 14+ TB cubes. Its 90% query latency is less than 5 seconds. Now, our use cases target analysts and business users, who can access analytics and get results through the Tableau dashboard very easily – no more Hive query, shell command, and so on,” the blog explains.
Ebay has also proposed Kylin as an Apache Incubator project.
The Platform offers the following features for big data analytics:
- Extremely fast OLAP engine at scale: Kylin is designed to reduce query latency on Hadoop for 10+ billion rows of data.
- ANSI SQL on Hadoop: Kylin supports most ANSI SQL query functions in its ANSI SQL on Hadoop interface.
- Interactive query capability: Users can interact with Hadoop data via Kylin at sub-second latency – better than Hive queries for the same dataset.
- MOLAP cube query serving on billions of rows: Users can define a data model and pre-build in Kylin with more than 10+ billions of raw data records.
- Seamless integration with BI Tools: Kylin currently offers integration with business intelligence tools such as Tableau and third-party applications.
- Open-source ODBC driver: Kylin’s ODBC driver is built from scratch and works very well with Tableau. We have open-sourced the driver to the community as well.
Derric Harris of GigaOm wonders how Kylin would fare against “the next-generation versions of Hive, Spark SQL and other options for SQL analysis in Hadoop that have emerged as a result of the YARN resource manager available in the latest versions of Apache Hadoop.” He believes that it might be slower but more scalable than in-memory option while being another option for Hadoop users who are running earlier versions of the software.
Other technologies over the past 30 years have have leveraged the same theory on which Kylin is based to accelerate analytics including methods to ‘store pre-calculated results to serve analysis queries, generate each level’s cuboids (referencing cuboid typology) with all possible combinations of dimensions, and calculate all metrics at different levels.’
Read more here.
(Image credit: Flickr)