Optimizing hadoop parameter settings with gene expression programming guided PSO

Hadoop MapReduce has become a major computing technology in support of big data analytics. The Hadoop framework has over 190 configuration parameters, and some of them can have a significant effect on the performance of a Hadoop job. Manually tuning the optimum or near optimum values of these parameters is a challenging task and also a time consuming process. This paper optimizes the performance of Hadoop by automatically tuning its configuration parameter settings. The proposed work first employs gene expression programming technique to build an objective function based on historical job running records, which represents a correlation among the Hadoop configuration parameters. It then employs particle swarm optimization technique, which makes use of the objective function to search for optimal or near optimal parameter settings. Experimental results show that the proposed work enhances the performance of Hadoop significantly compared with the default settings. Moreover, it outperforms both rule‐of‐thumb settings and the Starfish model in Hadoop performance optimization. © 2016 The Authors. Concurrency and Computation: Practice and Experience Published by John Wiley & Sons Ltd.


INTRODUCTION
Many organizations are continuously collecting massive amounts of datasets from various sources such as the World Wide Web, sensor networks, and social networks. The ability to perform scalable and timely analytics on these unstructured datasets is a high priority for many enterprises. It has become difficult for traditional database systems to process these continuously growing datasets. Hadoop MapReduce has become a major computing technology in support of big data analytics [1,2]. Hadoop has received a wide uptake from the community because of its remarkable features such as high scalability, fault-tolerance, and data parallelization. It automatically distributes data and parallelizes computation across a cluster of computer nodes [3][4][5][6][7].
Despite these remarkable features, Hadoop is a large and complex framework, which has a number of components that interact with each other across multiple computer nodes. The performance of a Hadoop job is sensitive to each component of the Hadoop framework including the underlying hardware, network infrastructure, and Hadoop configuration parameters, which are over 190. Recent researches show that the parameter settings of the Hadoop framework play a critical role in the performance of Hadoop. A small change in the configuration parameter settings can have a significant impact on the performance of a Hadoop job [8]. Manually tuning the optimum or near optimum values of these parameters is a challenging task and also a time consuming process. In addition, the Hadoop framework has a black box like feature, which makes it extremely difficult to find a mathematical model or an objective function, which represents a correlation among the parameters. The large parameter space together with the complex correlations among the configuration parameters further increases the complexity of a manual tuning process. Therefore, an effective and automatic approach to tuning Hadoop parameters has become a necessity.
A number of research works have been proposed to automatically tune Hadoop parameter settings. The rule-of-thumb (ROT) proposed by industrial professionals [9][10][11] is just a common practice to tune Hadoop parameter settings. The Starfish optimizer [12,13] optimizes the performance of a Hadoop job based on the job profile and a cost model [14]. The job profile is collected at a fine granularity with detailed information. However, collecting the detailed execution profile of a job incurs a high overhead, which overestimates the values for some configuration parameters. Moreover, the Starfish optimizer divides the search space into subspaces in the optimization process, which ignores the correlations among the configuration parameters. PPABS [15] automatically tunes Hadoop parameter settings based on the executed job profiles. Profiling and Performance Analysis-based System (PPABS) employs K-means++ to classify the jobs into equivalent classes. It applies simulated annealing to search for optimum parameter values and implements a pattern recognition technique to determine the class that a new job belongs to. However, PPABS is unable to tune the parameter settings for a new job, which does not belong to any of the pre-classified classes. Gunther, a search based system proposed in [16] automatically searches for optimum parameter values for the configuration parameters using genetic algorithm. One critical limitation of Gunther is that it does not have a fitness function in the implemented genetic algorithm. Gunther evaluates the fitness of a set of parameter values by running a Hadoop job physically, which is a time consuming process. Panacea [17] optimizes Hadoop applications based on a process of tuning the configuration parameter settings. Similar to Starfish, Panacea also divides the search space into subspaces and then searches for optimal values within pre-defined ranges. The work presented in [18] proposes a performance evaluation model, which focuses on the impact of the Hadoop configuration settings from the aspects of hardware, software, and network.
Tuning the configuration parameters of Hadoop requires the knowledge of the internal dynamics of the Hadoop framework and the inter-dependencies among its configuration parameters. This is because the value of one parameter can have a significant impact on the other parameters. It should be pointed out that none of the aforementioned works considers the inter-dependencies among Hadoop configuration parameters. In this paper, we optimize the performance of Hadoop by automatically tuning its configuration parameter settings. The major contributions of this paper are as follows: • Based on the running records of Hadoop jobs, which can be either CPU intensive or IO intensive, we employ gene expression programming (GEP) technique to build an objective function, which represents a correlation among the Hadoop configuration parameters. To the best of our knowledge, this is the first work that mathematically describes the inter-dependencies among the Hadoop configuration parameters when tuning the performance of Hadoop. • For the purpose of configuration parameter optimization, particle swarm optimization (PSO) [19,20] is employed that makes use of the GEP constructed objective function to search for a set of optimal or near optimal values of the configuration parameters. Unlike other optimization works that divide the search space into subspaces, the implemented PSO considers the whole search space in the optimization process in order to maintain the inter-dependencies among the configuration parameters.
To evaluate the performance of the proposed work, we run two typical Hadoop MapReduce applications, that is, WordCount and Sort, which are CPU and Input/Output (I/O) intensive, respectively. The performance of the proposed work is initially evaluated on an experimental Hadoop cluster configured with eight Virtual Machines (VMs) and subsequently on another Hadoop cluster configured with 16 VMs. The experimental results show that the proposed work enhances the performance of Hadoop by on average 67% on the WordCount application and 46% on the Sort application, respectively, compared with its default settings. The proposed work also outperforms both ROT and the Starfish model in Hadoop performance optimization.
The reminder of paper is organized as follows. Section 2 introduces a set of core configuration parameters of Hadoop considered in this work. Section 3 presents the design and implementation of GEP in generating an objective function, which represents a correlation of the Hadoop parameters. The implementation of the PSO for optimization of parameter settings is presented in Section 4. Section 5 evaluates the performance of the proposed work on two experimental Hadoop clusters. Section 6 discusses a number of related works. Section 7 concludes the paper and points out some further work.

HADOOP CORE PARAMETERS
The Hadoop framework has more than 190 tunable configuration parameters that allow users to manage the flow of a Hadoop job in different phases during the execution process. Some of them are core parameters and have a significant impact on the performance of a Hadoop job [12,16]. The core parameters are briefly presented in Table I.

io.Sort.Factor
This parameter determines the number of files (streams) to be merged during the sorting process of map tasks. The default value is 10, but increasing its value improves the utilization of the physical memory and reduces the overhead in IO operations.

io.Sort.mb
During a job execution, the output of a map task is not directly written into the hard disk but is written into an in-memory buffer, which is assigned to each map task. The size of the in-memory buffer is specified through the io.sort.mb parameter. The default value of this parameter is 100 MB. The recommended value for this parameter is between 30% and 40% of the Java_Opts value and should be larger than the output size of a map task, which minimizes the number of spill records [11].

io.Sort.Spill.Percent
The default value of this parameter is 0.8 (80%). When an in-memory buffer is filled up to 80%, the data of the in-memory buffer (io.sort.mb) should be spilled into the hard disk. It is recommended that the value of io.sort.spill.percent should not be less than 0.50.

Mapred.Reduce.Tasks
This parameter can have a significant impact on the performance of a Hadoop job [21]. The default value is 1. The optimum value of this parameter is mainly dependent on the size of an input dataset and the number of reduce slots configured in a Hadoop cluster. Setting a small number of reduce tasks for a job decreases the overhead in setting up tasks on a small input dataset, while setting a large number of reduce tasks improves the hard disk IO utilization on a large input dataset. The recommended number of reduce tasks is 90% of the total number of reduce slots configured in a cluster [8]. These parameters define the number of the map and reduce tasks that can be executed simultaneously on each cluster node. Increasing the values of these parameters increases the utilization of CPUs and physical memory of the cluster node, which can improve the performance of a Hadoop job. The optimum values of these parameters are dependent on the number of CPUs, the number of cores in each CPU, multi-threading capability, and the computational complexity of a job. The recommended values for these parameters are the number of CPU cores minus 1 as long as the cluster node has sufficient physical memory [9][10][11]. One CPU is reserved for other services in Hadoop such as DataNode and TaskTracker.

Mapred.Child.Java.Opts
This is a memory related parameter and the main candidate for JVM tuning. The default value is -Xmx200m, which gives at most 200 MB physical memory to each child task. Increasing the value of Java_Opt reduces spill operations to output map results into the hard disk, which can improve the performance of a job. By default, each work node utilizes 2.8GB physical memory [11]. The worker node assigns 400 MB to the map phase (i.e., two map slots), 400 MB to the reduce phase (i.e., two reduce slots) and 1000 MB to each DataNode and TaskTracker that run on the worker node.
2.7. Mapred.Compress.map.Output, mapred.Output.Compress These two parameters are related to the hard disk IO and network data transfer operations. Boolean values are used to determine whether or not the map output and the reduce output need to be compressed. Enabling the compression of the map and reduce outputs for a job can speed up the hard disk IO and minimize the overhead in data shuffling across the network.

MINING HADOOP PARAMETER CORRELATIONS WITH GEP
Gene expression programming [22] is a new type of evolutionary algorithm [23]. It is developed based on a similar idea to genetic algorithms [24] and genetic programming [25]. Using a special format of the solution representation structure, GEP overcomes some limitations of both genetic algorithms and genetic programming. GEP brings a significant improvement on problems such as combinatorial optimization, classification, time series prediction, parametric regression, and symbolic regression.
GEP has been applied to a variety of domains such as data analysis in high energy physics, traffic engineering for IP networks, designing electronic circuits, and evolving classification rules. It has also been applied to data mining field especially for the investigation of an internal correlation among the involved parameters. Gene expression programming uses a chromosome and expression tree combined structure [22] to represent a targeted problem being investigated. The factors of the targeted problem are encoded into a linear chromosome format together with some potential functions, which can be used to describe a correlation of the factors. Each chromosome generates an expression tree, and the chromosomes containing these factors are evolved during the evolutionary process.

Gene expression programming design
The execution time of a Hadoop job can be expressed in Eq.(1) where x 0 , x 1 , …, x n are the configuration parameters of Hadoop.
In this work, we consider 10 core parameters of Hadoop as listed in Table II. Based on the data types of these Hadoop configuration parameters, the mathematic functions shown in Table III are used in GEP. A correlation of the Hadoop parameters can be represented by a combination of these mathematical functions. Figure 1 shows an example of mining a correlation of two parameters (x 0 and x 1 ), which is conducted in the following steps in GEP: • Based on the data types of x 0 and x 1 , find a mathematical function that has the same input data type as either x 0 or x 1 and has two input parameters. • Calculate the estimated execution time of the selected mathematical function using the parameter setting samples. • Find the best mathematical function between x 0 and x 1 that produces the closest estimated execution time to the actual execution time. In this case, the Plus function is selected.
Similarly, a correlation of x 0 , x 1 , …, x n can be mined using the GEP method. The chromosome and expression tree structure of GEP is used to hold the parameters and mathematical functions. A combination of mathematical functions, which takes x 0 , x 1 , …, x n as inputs, is encoded into a linear chromosome, which is maintained and developed during the evolution process. Meanwhile, the expression tree generated from the linear chromosome produces a form of f(x 0 , x 1 , …, x n ) based on which an estimated execution time is computed and compared with the actual execution time. A final form of f (x 0 , x 1 , …, x n ) will be produced at the end of the evolution process whose estimated execution time is the closest to the actual execution time.
In GEP, a chromosome can consist of one or more genes. For simplicity in computation, each chromosome has only one gene in this work. A gene is composed of a head and a tail. The input dataset size (GB) integer elements of the head are selected randomly from the set of Hadoop parameters (listed in Table II) and the set of mathematical functions (listed in Table III). However, the elements of the tail are selected only from the Hadoop parameter set. The length of a gene head is set to 20, which cover all the possible combinations of the mathematical functions. The length of a gene tail can be computed using Eq. (2).
where n is the number of input arguments of a mathematical function, which has the most number of input arguments among the functions. Figure 2 shows an example of a chromosome and expression tree structure taking into account five parameters -x 0 , x 1 , x 2 , x 3 , x 4 . In Figure 2, the size of the gene head is 4 and n is 2. Then the size of the gene tail is 5 based on Eq.(2). Four mathematical functions (+,À,/, pow) are selected to represent a correlation of the parameters x 0 , x 1 , x 2 , x 3 , x 4 . As a result, a form of f(x 0 , x 1 , …, x n ) is generated from the expression tree as illustrated in Eq. (3).   In the following section, we present how the GEP method evolves in mining a correlation among the Hadoop configuration parameters.

Gene expression programming implementation
Algorithm 1 shows the implementation of the GEP method. The input of the Algorithm 1 is a set of Hadoop job running samples, which are used as a training dataset. To build the training dataset, we conducted 320 experiments on a Hadoop cluster, which is presented in Section 5.1. We run two typical Hadoop applications (i.e., WordCount and Sort) to process an input dataset of different sizes ranging from 5GB to 15GB. For each experiment, we manually tuned the configuration parameter values and run the two applications three times each and took an average of the execution times. A small portion of the training dataset is presented in Table IV.
In Algorithm 1, Lines 1 to 5 initialize the first generation of 500 chromosomes, which represent 500 possible correlations among the Hadoop parameters. Lines 8 to 29 implement an evolution process in which a single loop represents a generation of the evolution process. For each chromosome, it is translated into an expression tree. Lines 11 to 17 calculate the fitness value of a chromosome. For each training sample, GEP produces an estimated execution time of a Hadoop job and makes a comparison with the actual execution time of the job. If the difference is less than a pre-defined bias window, the fitness value of the current chromosome will be increased by 1.  x 0 The size of the bias window is set to 50 s, which allows a maximum of 10% of the error space taking into account the actual execution time of a Hadoop job sample. Line 18 shows that the evolution process terminates in an ideal case when the fitness value is equal to the number of training samples. Otherwise, the evolution process continues, and the chromosome with the best fitness value will be kept as shown in Lines 20 to 23. At the end of each generation as shown in Lines 24 to 25, a genetic modification is applied to the current generation to generate variations of the chromosomes for the next generation.
We varied the number of generations from 20000 to 80000 in the GEP evolution process and found that the quality of a chromosome (the ratio of the fitness value to the number of training samples) was finally higher than 90%. As a result, we set 80000 as the number of generations. The genetic modification parameters were set using the classic values [22] as shown in Table V. After 80000 generations, GEP generates Eq. (4), which represents a correlation of the Hadoop parameters listed in Table II.

HADOOP PARAMETER OPTIMIZATION WITH PSO
In this section, we employ PSO to optimize Hadoop parameter settings. We use Eq. (4) generated by the GEP method in Section 3.2 as an objective function in PSO optimization.
Particle swarm optimization is a kind of an evolutionary computational algorithm introduced by Eberhart and Kennedy in 1995. The algorithm is inspired by the social behaviors of bird flocking, fish schooling, and swarm theory [19,20]. PSO has been successfully applied in a wide range of problem domains due to its rapid convergence process towards an optimum solution [26][27][28][29][30]. In PSO, particles can be considered as agents that fly through a multidimensional search space and record the best solution that they have discovered. Each particle of the swarm adjusts its path according to its own flying experience and also the flying experiences of its neighborhood particles in a multidimensional search space.

Let
• d be the number of dimensions of a search space. In this work, d is set to 9, which represents the 9 Hadoop configuration parameters listed in Table II. • n be the total number of particles in a swarm. • X i,j be the list of positions of the particle i, j is a dimension of the search space. • P i,j be a list of the locally best positions of the particle i, P i,j = (p i,1 , p i,2 , p i,3 ,....., p i,d ).
• V i,j be the velocity of the particle i, • G be the list of the globally best positions of a swarm, G = (g 1 , g 2 ,......, g d ).
To implement the PSO algorithm, we first initialize the positions of the particles randomly within the bounds of the search space so that the search space is uniformly covered, while the velocities of the particles are initialized to zeros as suggested in [31]. Then the PSO algorithm updates the swarm by updating the velocity and position of each particle in every dimension using Eq. (5) and Eq. (6), respectively.
where • r 1 and r 2 are cognitive and social randomization parameters, respectively. They have random values between 0 and 1. • c 1 and c 2 are local and global weights, respectively. They are acceleration constants.
• w is an inertia weight that balances the global and local search capabilities [32].
• t is a relative time index.
• v tþ1 i;j is the velocity of the particle i at time step t + 1. • v t i;j is the velocity of the particle i at time step t. • p t i;j is the locally best position of the particle i at time step t. • x t i;j is the current position of the particle i at time step t. • g t j is the globally best position visited by any particle at time step t . • x tþ1 i;j is the new position of the particle i at time step t + 1. In each iteration, the new position of a particle is evaluated using the objective function f (x 0 , x 1 , …, x 9 ). The locally best value is compared with the new fitness value and updated accordingly. Similarly, the globally best position is updated. In the PSO algorithm, clamping the velocity and position of a particle within a feasible search area is a challenging task. This task becomes even more complicated if the optimization problem has bounds. If the optimization problem has bounds then it is important to handle the particle positions along with the velocities flying out of the feasible area (i.e., out of boundary). In addition, it has been shown that as the number of problem parameters increases, the probability of the particles flying out of the feasible space increases dramatically [33], [34]. For this purpose, we employ the nearest method presented in [34] to handle bound violations.
To handle bound violations of a particle, we define v j,min and v j,max, which represent a lower bound and an upper bound of the velocity of the particle, respectively. Similarly, we define x j,min and x j,max representing a lower bound and an upper bound of the position of the particle, respectively. The values of the lower bound and the upper bound of the position of a particle are set according to the range of each Hadoop parameter listed in Table VI. However, setting the values for the lower bound and the upper bound of the velocity of a particle is problem dependent, and the values can be found empirically. We set the value of v j,min to (À10%) of (x j,max À x j,min ) and the value of v j,max to (+10%) of (x j,max À x j,min ). Each particle moves in a search space following the upper and lower bounds of its position and velocity. If any particle is roaming then its velocity and position values are set back to the nearest bound values. Algorithm 2 shows the PSO implementation.
It is worth pointing out that sometimes, PSO can be trapped in a local optimum. This issue can be avoided by adjusting the inertia weight (w) factor used in Eq. (5). Instead of using a constant value for Algorithm 2. PSO implementation.
w, we use a dynamic inertia weight that linearly decreases in every iteration to overcome the local optima problem [32]. The dynamic inertia weight can be computed using Eq. (7).
where w min = 0 and w max =1.

PERFORMANCE EVALUATION
The performance of the GEP guided PSO optimization work was initially evaluated on an experimental Hadoop cluster using a single Intel Xeon server machine configured with eight VMs and subsequently on another Hadoop cluster using two Intel Xeon Server machines configured with 16 VMs. The intuition of using two Hadoop clusters was to intensively evaluate the performance of the proposed work by considering the network overhead across the two server machines. In this section, we first give a brief introduction to the experimental environments that were set up in the evaluation process and then present performance evaluation results.

Experimental set up
We set up a Hadoop cluster using one Intel Xeon server machine. The specification of the server is shown in  1~3 Based on the specification of a worker node.
Based on the specification of a worker node.
Based on the physical memory of a worker node and the x 1 value. Empirically. Empirically.
The size of an input dataset in MB Specified by user.

The impact of hadoop parameters on performance
We run the WordCount application as a Hadoop job to evaluate the impacts of the configuration parameters listed in Table I on Hadoop performance. From Figure 3, it can be observed that the execution time of the job decreases with an increasing size of the io.sort.mb value. The larger size the parameter value has, the less operations will be incurred in writing the spill records to the hard disk leading to a less overhead in output.
The io-sort-factor parameter determines the number of data streams that can be merged in the sorting process. Initially, the execution time of the job goes down with an increasing value of the parameter as shown in Figure 4 that the value of 200 represents the best value of the parameter. Subsequently, the execution time goes up when the value of the parameter further increases. This is because that there is a tradeoff between the reduced overhead incurred in IO operations when the   value of the parameter increases and the added overhead incurred in merging the data streams. Figure 5 shows the impact of the number of reduce tasks on the job performance. There is a tradeoff between the overhead incurred in setting up reduce tasks and the performance gain in utilizing resources. Initially increasing the number of reduce tasks better utilizes the available resources, which leads to a decreased execution time. However, a large number of reduce tasks incurs a high overhead in the setting up process, which leads to an increased execution time.
Increasing the number of map and reduce slots better utilizes available resources, which leads to a decreased execution time which can be observed in Figure 6 when the number of slots increases from 1 to 2. However, resources might be over utilized when the number of slots further increases, which slows down a job execution.
Increasing the value of Java_opts parameter utilizes more memory, which leads to a decreased execution time as shown in Figure 7. However, a large value of the parameter would over utilize the available memory space. In this case, the hard disk is used as a virtual memory, which slows down a job execution. Figure 8 shows the impact of the compression parameter on the performance of a Hadoop job. The results generated by map tasks or reduce tasks can be compressed to reduce the overhead in IO operations and data transfer across network, which leads to a decreased execution time. It is worth noting that the performance gap between the case of using the compression feature and the case of using uncompressing feature gets large with an increasing size of the input data.

Particle swarm optimization set up
The parameters used in the PSO algorithm are presented in Table VIII. We set 20 for the particle swarm size and 100 for the number of iterations as suggested in the literature [35,36]. The values of c 1 and c 2 were set to 1.4269 as proposed in [37], the value of w was set dynamically between 0 and 1, and the values of r 1 and r 2 were selected randomly between 0 and 1 in every iteration. The PSO algorithm  processes real number values, while some of the Hadoop configuration parameters accept only integer number values (e.g., the number of map slots). We rounded the values of these PSO parameters to integer values. We set two configuration parameters which have a Boolean value (i.e., mapred. compress.map.output and mapred.out.compress) to True. This is because empirically, we found that the True values of these two parameters showed a significant improvement on the performance of a Hadoop job as shown in Figure 8. Table IX presents the PSO recommended configuration parameter settings for a Hadoop job with an input dataset of varied sizes ranging from 5GB to 20GB.

Starfish Job profile
In order to collect a job profile for the Starfish optimizer, we first run both WordCount and Sort in the Starfish environment with profiler enabled. Both applications processed an input dataset of 5GB. Then the Starfish optimizer was invoked to generate configuration parameter settings. The recommended   configuration parameter settings recommended by Starfish for both applications are presented in Table X and Table XI, respectively.

Experimental results on hadoop performance
In this section we compare the performance of the proposed work with that of Starfish, ROT, and the default configuration parameter settings in Hadoop optimization. Both WordCount and Sort applications were deployed on the Hadoop cluster with eight VMs to process an input dataset of four different sizes varying from 5GB to 20GB. We run both applications three times each using the PSO recommended parameter settings, and an average of the execution times was taken. The performance results of the two applications are shown in Figure 9 and Figure 10, respectively. It can be observed that overall, the implemented PSO improves the performance of the WordCount application by an average of 67% in the four input data scenarios compared with the default Hadoop parameter settings, 28% compared with Starfish, and 26% compared with ROT. The improvement reaches a maximum of 71% when the input data size is 20GB. The performance improvement of the PSO optimization on the Sort application is on average 46% over the default Hadoop parameter settings, 16% over Starfish, and 37% over ROT. The improvement reaches a maximum of 65% when the input data size is 20GB.
It should be pointed out that the implemented PSO algorithm considers both the underlying hardware resources and the size of an input dataset and then recommends configuration parameter settings for both applications. The ROT work only considers the underlying hardware resources    shown in Table XII. It is worth noting that ROT performs slightly better than Starfish on the WordCount application. This is because Starfish suggests a large number of reduce tasks, which generates a high overhead in setting up these reduce tasks, especially in the case of using a small input dataset (e.g., 5GB). Whereas ROT suggests a small number of reduce tasks, which are completed in a single wave generating a low overhead in setting up the reduce tasks. ROT estimates the number of reduce tasks based on the total number of reduce slots configured in the Hadoop cluster.
We have further evaluated the performance of the PSO optimization work on another Hadoop cluster configured with 16 VMs. From Figure 11 and Figure 12 it can be observed that the PSO work improves the performance of both applications on average by 65% and 86% compared with ROT and the default Hadoop settings, respectively. The improvement reaches a maximum of 87% when the input data size is 35GB on the WordCount application. The performance gains of the PSO work over the Starfish model on the WordCount application and the Sort application are on average 20% and 21%, respectively. It is worth noting that the Starfish model performs better than ROT in the case of using 16 VMs. In this case, a large dataset with a size varying from 25GB to 40GB was used. ROT recommends False for the mapred.output.compress parameter (as shown in Table XII). As a result, both applications took a long time in the reduce phase when writing the reduce task outputs into the hard disk. For example, the WordCount took 19 min to process the 40GB dataset in the map phase and 61 min in the reduce phase following the ROT recommended parameter settings. Whereas it took 13 minutes to process the same amount of data in the map phase and only 23 min in the reduce phase following the Starfish recommended parameter settings. This is because Starfish enabled the mapred.output.compress parameter, which reduces the overhead in writing the reduce task outputs into the hard disk.  Figure 11. The performance of the WordCount application using 16 virtual machines.

RELATED WORK
In recent years, numerous researches have been carried out to optimize the performance of Hadoop from different aspects. The methodologies of these studies are diverse and range from optimizing Hadoop job scheduling mechanisms to tuning the configuration parameter settings. For example, many researchers have focused on developing adaptive load balancing mechanisms [38][39][40][41] and data locality algorithms [42][43][44][45] to improve the performance of Hadoop.
A group of researchers have proposed optimization approaches for a particular type of jobs such as short jobs and query based jobs [46][47][48][49]. Jahani et al. proposed the MANIMAL model [46], which automatically analyzes a Hadoop program using a static analyzer tool for optimization. However, the MANIMAL model only focuses on relational style programs employing the selection and projection operators and does not consider text-processing programs. Moreover, it only optimizes the map phase in Hadoop. Elmeleegy et al. presented Piranha [49], a system which optimizes short jobs (i.e., query base jobs) by minimizing their response times. They suggested that fault-tolerance facilities are not necessary for short running jobs because the jobs are small, and they are unlikely to incur failures. The works presented in [47,48] focus on optimizing short Hadoop jobs by enhancing tasks execution mechanisms. They optimized task initialization and termination stages by removing the constant heartbeat, which is used for the tasks set up and clean up process in Hadoop. They proposed a push-model for heartbeat communication to reduce delays between the JobTracker and a TaksTracker, and implemented an instance communication mechanism between the JobTraker and a TaskTracker in order to separate message communication from the heartbeat.
Many researchers have also researched into resources provisioning for Hadoop jobs. Palanisamy et al. presented the Cura model [50] that allocates an optimum number of VMs to a user job. The model dynamically creates and destroys the VMs based on the user workload in order to minimize the overall cost of the VMs. Virajith et al. [51] proposed Bazaar that predicts Hadoop job performance and provisions the resources in term of VMs to satisfy user requirements. A model proposed in [52] optimizes Hadoop resource provisioning in the Cloud. The model employed a brute-force search to find optimum values for map slots and reduce slots over the resource configuration space. Tian et al. [53] proposed a cost model that estimates the performance of a Hadoop job and provisions the resources for the job using a simple regression technique. Chen et al. [54] further improved the cost model and proposed CRESP, which employs a brute-force search technique for provisioning optimal resources in term of map slots and reduce slots for Hadoop jobs. Lama et al. [55] proposed AROMA, a system that automatically provisions the optimal resources of a job to achieve service level objectives. AROMA builds on a clustering technique to group the jobs with similar behaviors. It employed support vector machine to predict the performance of a Hadoop job and a pattern search technique to find an optimal set of resources for a job to achieve the required deadline with a minimum cost. However, AROMA cannot predict the performance of a job Figure 12. The performance of the Sort application using 16 virtual machines. whose resource utilization pattern is different from any previous ones. More importantly, AROMA does not provide a comprehensive mathematical model to estimate a job execution time.
There are a few other sophisticated models such as [12,13,[15][16][17] that are similar to the proposed work in the sense that they optimize a Hadoop job by tuning the configuration parameter settings. Wu et al. proposed PPABS [15], which automatically tunes the Hadoop framework configuration parameter settings based on executed job profiles. The PPABS framework consists of analyzer and recognizer components. The analyzer trains the PPABS to classify the jobs having similar performance into a set of equivalent classes. The analyzer uses K-means++ to classify the jobs and simulated annealing to find optimal settings. The recognizer classifies a new job into one of these equivalent classes using a pattern recognition technique. The Recognizer first runs the new job on a small dataset using default configuration settings and then applies the pattern recognition technique to classify it. Each class has the best configuration parameter settings. Once the recognizer determines the class of a new job then it automatically uploads the best configuration settings for this job. However, PPABS is unable to find the fine-tuned configuration settings for a new job, which does not belong to any of these equivalent classes. Moreover, PPABS does not consider the correlations among the configuration parameters. Herodotou et al. proposed Starfish [12,13] that employs a mixture of cost model [14] and simulator to optimize a Hadoop job based on previously executed job profile information. Starfish divides the search space into subspaces. It considers the configuration parameters independently for optimization and combines the optimum configuration settings found in each subspace as a group of optimum configuration settings. Starfish collects the running job profile information at a fine-granularity for job estimation and automatic optimization. However, collecting detailed job profile information with a large set of metrics generates an extra overhead. As a result, the Starfish model is unable to accurately estimate the job execution time due to which it overestimates the values for some configuration parameters especially for the number of reduce tasks. As Starfish divides the configuration parameter space into subspaces which may ignore the correlations among the parameters. Liao et al. proposed Gunther [16], a search based model that automatically tunes the configuration parameters using genetic algorithm. One critical limitation of Gunther is that it does not have a fitness function in the implemented genetic algorithm. The fitness of a set of parameter values is evaluated through physically running, a Hadoop job using these parameters which is a time consuming process. Liu et al. [17] proposed Panacea with two approaches to optimizing Hadoop applications. In the first approach, it optimizes the compiler at run time, and a new API was developed on top of Soot [56] to reduce the overhead of iterative Hadoop applications. In the second approach, it optimizes a Hadoop application by tuning Hadoop configuration parameters. In this approach, it divides the parameters search space into sub-search spaces and then searches for optimum values by trying different values for parameters iteratively within the range. However, Panacea is unable to provide a sophisticated search technique and a mathematical function, which represents a correlation of the Hadoop configuration parameters. Li et al. [18] proposed a performance evaluation model for the whole system optimization of Hadoop. The model analyzes the hardware and software levels and explores the performance issues in these layers. The model mainly focuses on the impact of different configuration settings on a job performance instead of tuning the configuration parameters.

CONCLUSION
Running a Hadoop job on default parameter settings has led to performance issues. This paper optimized Hadoop performance by automatically tuning its configuration parameters. The proposed work employed GEP to build up an objective function that represents a correlation among the Hadoop configuration parameters and implemented PSO to further search for optimal or near optimal parameter settings. The proposed work improves Hadoop performance significantly in comparison with the default settings. In addition, the proposed work performs better than existing representative works such as the ROT work and the Starfish model in Hadoop performance optimization.
We have implemented some work on Hadoop job estimation and resource provisioning [21]. One future work will be to integrate the two works together for resource provisioning in optimized Hadoop clusters.