Improving the performance of Hadoop/Hive by sharing scan and computation tasks

Download
2013
Özal, Serkan
MapReduce is a popular model of executing time-consuming analytical queries as a batch of tasks on large scale data. During simultaneous execution of multiple queries, many oppor- tunities can arise for sharing scan and/or computation tasks. Executing common tasks only once can reduce the total execution time of all queries remarkably. Therefore, we propose to use Multiple Query Optimization (MQO) techniques to improve the overall performance of Hadoop Hive, an open source SQL-based distributed warehouse system based on MapReduce. Our framework, SharedHive, transforms a set of correlated HiveQL queries into new global queries that can produce the same results in remarkably smaller total execution times. It is ex- perimentally shown that SharedHive outperforms the conventional Hive by %20-90 reduction, depending on the number of queries and percentage of shared tasks, in the total execution time of correlated TPC-H queries.
Citation Formats
S. Özal, “Improving the performance of Hadoop/Hive by sharing scan and computation tasks,” M.S. - Master of Science, Middle East Technical University, 2013.