Improving the performance of Hadoop/Hive by sharing scan and computation tasks

Download
2013
Özal, Serkan
MapReduce is a popular model of executing time-consuming analytical queries as a batch of tasks on large scale data. During simultaneous execution of multiple queries, many oppor- tunities can arise for sharing scan and/or computation tasks. Executing common tasks only once can reduce the total execution time of all queries remarkably. Therefore, we propose to use Multiple Query Optimization (MQO) techniques to improve the overall performance of Hadoop Hive, an open source SQL-based distributed warehouse system based on MapReduce. Our framework, SharedHive, transforms a set of correlated HiveQL queries into new global queries that can produce the same results in remarkably smaller total execution times. It is ex- perimentally shown that SharedHive outperforms the conventional Hive by %20-90 reduction, depending on the number of queries and percentage of shared tasks, in the total execution time of correlated TPC-H queries.

Suggestions

A Cost-Aware Strategy for Query Result Caching in Web Search Engines
Altıngövde, İsmail Sengör; Ulusoy, Oezguer (2009-01-01)
Search engines and large scale IR systems need to cache query results for efficiency and scalability purposes. In this study, we propose to explicitly incorporate the query costs in the static caching policy. To this end, a query’s cost is represented by its execution time, which involves CPU time to decompress the postings and compute the query-document similarities to obtain the final top-N answers. Simulation results using a large Web crawl data and a real query log reveal that the proposed strategy impr...
Using object-oriented materialized views to answer selection-based complex queries
Alhajj, R; Polat, Faruk (1999-09-01)
Presented in this paper is a model that utilizes existing materialized views to handle a wide range of complex selection-based queries, including linear recursive queries. Such queries are complex because it is almost impossible for naive users to predict the formulation of their predicate expressions. Object variables bound to objects in the result of a query are allowed to appear in the predicate of that query. Also, the predicate definition is extended to make it possible to have in the output only a sub...
Improving the efficiency of distributed information retrieval using hybrid index partitioning
Hafızoğlu, Fatih; Altıngövde, İsmail Sengör; Department of Computer Engineering (2018)
Selective search with traditional partitioning have advantages over exhaustive search in terms of total query cost. However, it can suffer from query latency and load imbalance for most of the time due to its nature. To overcome these issues, we proposed a new partitioning method in this thesis, namely Hybrid partitioning. Our studies shows that it is possible to obtain significant savings in query latency with this new partitioning methodology. In addition to that, query processing with Hybrid partitioning...
Improving forecasting accuracy of time series data using a new ARIMA-ANN hybrid method and empirical mode decomposition
Buyuksahin, Umit Cavus; Ertekin Bolelli, Şeyda (Elsevier BV, 2019-10-07)
Many applications in different domains produce large amount of time series data. Making accurate forecasting is critical for many decision makers. Various time series forecasting methods exist that use linear and nonlinear models separately or combination of both. Studies show that combining of linear and nonlinear models can be effective to improve forecasting performance. However, some assumptions that those existing methods make, might restrict their performance in certain situations. We provide a new Au...
Efficient computation of strong partial transitive-closures
Toroslu, İsmail Hakkı (null; 1993-01-01)
The development of efficient algorithms to process the different forms of the transitive-closure (TC) queries within the context of large database systems has recently attracted a large volume of research efforts. In this paper, we present a new algorithm suitable for processing one of these forms, the so called strong partially-instantiated, in which one of the query's argument is instantiated to a set of constants and the processing of which yields a set of tuples that draw their values form both of the q...
Citation Formats
S. Özal, “Improving the performance of Hadoop/Hive by sharing scan and computation tasks,” M.S. - Master of Science, Middle East Technical University, 2013.