Efficient Big Data Processing in Hadoop MapReduce
Jens Dittrich
Jorge-Arnulfo Quiane-Ruiz
´
Information Systems Group
Saarland University http://infosys.cs.uni-saarland.de ABSTRACT
This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently. Examples include web analytics applications, scientific applications, and social networks. A popular data processing engine for big data is Hadoop MapReduce. Early versions of Hadoop
MapReduce suffered from severe performance problems. Today, this is becoming history. There are many techniques that can be used with Hadoop MapReduce jobs to boost performance by orders of magnitude. In this tutorial we teach such techniques. First, we will briefly familiarize the audience with Hadoop MapReduce and motivate its use for big data processing. Then, we will focus on different data management techniques, going from job optimization to physical data organization like data layouts and indexes. Throughout this tutorial, we will highlight the similarities and differences between Hadoop MapReduce and Parallel DBMS. Furthermore, we will point out unresolved research problems and open issues.
1.
INTRODUCTION
Nowadays, dealing with datasets in the order of terabytes or even petabytes is a reality [24, 23, 19]. Therefore, processing such big datasets in an efficient way is a clear need for many users. In this context, Hadoop MapReduce [6, 1] is a big data processing framework that has rapidly become the de facto standard in both industry and academia [16, 7, 24, 10, 26, 13]. The main reasons of such popularity are the ease-of-use, scalability, and failover properties of Hadoop MapReduce. However, these features come at a price: the performance of Hadoop MapReduce is usually far from the performance of a well-tuned parallel database [21]. Therefore, many research works (from industry and academia) have focused on improving the performance of Hadoop MapReduce jobs in many aspects. For example, researchers have proposed different data layouts [16, 9, 18], join algorithms [3, 5, 20], high-level query languages [10, 13, 24], failover algorithms [22], query optimization techniques [25, 4, 12, 14], and indexing techniques [7, 15, 8]. The latter includes HAIL [8]: an indexing technique presented at this
VLDB 2012. It improves the performance of Hadoop MapReduce jobs by up to a factor of 70 — without requiring expensive index
creation phases. Over the past years researchers have actively studied the different performance problems of Hadoop MapReduce.
Unfortunately, users do not always have a deep knowledge on how to efficiently exploit the different techniques.
In this tutorial, we discuss how to reduce the performance gap to well-tuned database systems. We will point out the similarities and differences between the techniques used in Hadoop with those used in parallel databases. In particular, we will highlight research areas that have not yet been exploited. In the following, we present the three parts in which this tutorial will be structured.
2.
HADOOP MAPREDUCE
We will focus on Hadoop MapReduce, which is the most popular open source implementation of the MapReduce framework proposed by Google [6]. Generally speaking, a Hadoop MapReduce job mainly consists of two user-defined functions: map and reduce.
The input of a Hadoop MapReduce job is a set of key-value pairs
(k, v) and the map function is called for each of these pairs. The map function produces zero or more intermediate key-value pairs
(k , v ). Then, the Hadoop MapReduce framework groups these intermediate key-value pairs by intermediate key k and calls the reduce function for each group. Finally, the reduce function produces zero or more aggregated results. The beauty of Hadoop MapReduce is that users usually only have to define the map and reduce functions. The framework takes care of everything else such as