HIVE 0.14 Cost Based Optimizer (CBO)

Analysts and data scientists⎯not to mention business executives⎯want Big Data not for the sake of the data itself, but for the ability to work with and learn from that data. As other users become more savvy, they also want more access. But too many inefficient queries can create a bottleneck in the system.

The good news is that Apache™ Hive 0.14—the standard SQL interface for processing, accessing and analyzing Apache Hadoop® data sets—is now powered by Apache Calcite. Calcite is an open source, enterprise-grade Cost-Based Logical Optimizer (CBO) and query execution framework.

The main goal of a CBO is to generate efficient execution plans by examining the tables and conditions specified in the query, ultimately cutting down on query execution time and reducing resource utilization. Calcite has an efficient plan pruner that can select the cheapest query plan. All SQL queries are converted by Hive to a physical operator tree, optimized and converted to Tez/MapReduce jobs, then executed on the Hadoop cluster. This conversion includes SQL parsing and transforming, as well as operator-tree optimization.

Query optimizers tend to have the biggest performance impact in a data warehouse system, since generating the right (or wrong) execution plan could mean the difference of seconds, minutes, or even hours in query execution time.

Performance testing of Hive 0.14 shows an average speedup of 2.5 times for TPC-DS benchmarked queries against a 30TB TPC-DS dataset. In our tests, total workload runtime shrank from 20.6 to 9 hours. In the current world of cloud computing, this improved performance directly translates into reduced operating costs.

In this post, we will dig into how this CBO works, including the background and challenges of query optimization, such as cardinality estimation and join ordering. We’ll also provide in-depth analysis of TPC-DS 30TB results.


Details: https://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/

Comments