Efficient exact algorithms for continuous bi-objective performance-energy optimization of applications with linear energy and monotonically increasing performance profiles on heterogeneous high performance computing platforms

Performance and energy are the two most important objectives for optimization on heterogeneoushighperformancecomputingplatforms.Thisworkstudiesamathemat-ical problem motivated by the bi-objective optimization of data-parallel applications on such platforms for performance and energy. First, we formulate the problem and present an exact algorithm of polynomial complexity solving the problem where all the application profiles of objective type one are continuous and strictly increasing, and all the application profiles of objective type two are linear increasing. We then apply the algorithm to develop solutions for two related optimization problems of parallel applications on heterogeneous hybrid platforms, one for performance and dynamic energy and the other for performance and total energy. Our proposed solution meth-ods are then employed to solve the two bi-objective optimization problems for two data-parallel applications, matrix multiplication and gene sequencing, on a hybrid platform employing five heterogeneous processors, namely, two different Intel multicore CPUs, an Nvidia K40c GPU, an Nvidia P100 PCIe GPU, and an Intel Xeon Phi.

Research works [9][10][11][12] propose application-level solution methods that employ decision variables, which include the number of processes, number of threads, loop blocking factor, and workload distribution. The solution methods proposed in References 10,11 solve the bi-objective optimization problem of an application for performance and energy on homogeneous clusters of modern multicore CPUs. The solution method 9 considers the effect of heterogeneous workload distribution on bi-objective optimization of data analytics applications by simulating heterogeneity on homogeneous clusters.
Khaleghzadeh et al. 12 study bi-objective optimization of data-parallel applications for performance and energy on heterogeneous processors.
The main contribution of this work is a bi-objective optimization algorithm for the case of discrete performance and energy functions with any arbitrary shape. The algorithm returns the Pareto front of load imbalanced solutions and best load balanced solutions. The authors also briefly study the continuous bi-objective optimization problem but only for the simple case of two heterogeneous processors with linear execution time and linear dynamic energy functions. They propose an algorithm to find the Pareto front and show that it is linear, containing an infinite number of solutions.
While one solution is load balanced, the rest are load imbalanced. However, they do not present an algorithm to determine the workload distribution (solution in the decision vector space) corresponding to a point in the Pareto front (solution in the objective space).
In Reference 13, we study a more general continuous bi-objective optimization problem for a generic case of k heterogeneous processors.
The problem is motivated by the bi-objective optimization for the performance and energy of data-parallel applications on heterogeneous HPC platforms. We now present an use case that highlights the problem.
Consider, for example, the bi-objective optimization of a highly optimized matrix multiplication application on a heterogeneous computing plat-  equal to the execution time of the application multiplied by the static power consumption of the platform. The application's dynamic energy consumption is equal to the platform's total energy consumption (E T ) minus the static energy consumption. The static energy consumption is the idle power of the platform (P S ) multiplied by the application's execution time (t). The static and dynamic energy consumptions during an application execution is obtained using power meters, which is considered the most accurate method of energy measurement. 14 The execution time function shapes are continuous and strictly increasing. The energy function shapes can be approximated accurately by linear increasing functions. While the execution time profiles of the two CPUs are close to each other, the energy profile of CPU_1 is significantly higher F I G U R E 1 Specifications of the five heterogeneous processors, Intel Haswell multicore CPU, Nvidia K40c, Intel Xeon Phi 3120P, Intel Skylake multicore CPU and Nvidia P100 PCIe are the same except that they do not contain the profiles for Xeon Phi whose energy profile dominates the other energy profiles. While the execution time profiles of the two CPUs are close to each other, the energy profile of CPU_1 is significantly higher than that of CPU_2 than that of CPU_2. The optimization goal is to find workload distributions of the workload size n ({x 0 , … , x 4 }, ∑ 4 i=0 x i = n) minimizing the execution time (max 4 i=0 f i (x i )) and the dynamic energy consumption ( ∑ 4 i=0 g i (x i )) during the parallel execution of the application. In Reference 13, we solve the continuous optimization problem for such shapes of performance and dynamic energy functions. We first formulate the mathematical problem, which for a given positive real number n aims to find a vector X = {x 0 , · · · , x k−1 } ∈ R k ≥0 such that ∑ k−1 i=0 x i = n, minimizing the max of k-dimensional vector of functions of objective type one and the sum of k-dimensional vector of functions of objective type two. We then propose an exact algorithm of polynomial complexity solving the case where all the functions of objective type one are continuous and strictly increasing, and all the functions of objective type two are linear increasing. The algorithm exhibits polynomial complexity.
In this work, we apply our proposed algorithm to solve two related optimization problems of parallel applications on heterogeneous hybrid platforms, one for performance and dynamic energy and the other for performance and total energy. We believe both optimization problems are pertinent to optimizing applications on modern heterogeneous hybrid HPC platforms due to following reasons. First, the thermal design power (TDP) of the multicore CPUs and accelerators has either remained the same or increased with each new generation. For example, the TDPs of the K40 and P100 GPUs used in our experiments are 235 and 250 W. The TDP of the latest generation A100 GPU is 250 W. The TDPs of Intel Xeon Gold 6152 and Intel Xeon E5-2670 v3 used in our experiments are 140 and 120 W. The TDP of the latest generation Icelake Xeon Gold 6354 is 205 W. Second, although the static energy consumption of devices is decreasing with every new generation, the total static energy consumption of a heterogeneous hybrid HPC platform with two or more such devices is still significant. Finally, the improvements in dynamic power consumption are not similar in magnitude to those for static power consumption.
We first formulate and solve a bi-objective optimization problem of parallel applications on heterogeneous hybrid platforms for performance and dynamic energy. The solution to the problem is a straightforward application of our algorithm proposed in Reference 13.
We then formulate a bi-objective optimization problem of parallel applications on heterogeneous hybrid platforms for performance and total energy. We prove a theorem that lays the foundation for our algorithm solving the theorem. The theorem states that a solution vector ∑ k−1 i=0 x i = n is Pareto-optimal for execution time and total energy if and only if it is Pareto-optimal for execution time and dynamic energy and there is no solution vector X 1 such that X 1 is Pareto-optimal for execution time and dynamic energy and E T (X 1 ) < E T (X). Then, we propose an algorithm for solving the problem. The correctness of the algorithm follows from the theorem. Finally, we prove the algorithm has polynomial complexity.
The proposed algorithms (in Reference 13 and this work) are then employed to solve the two bi-objective optimization problems for two data-parallel applications, matrix multiplication and gene sequencing, employing five heterogeneous processors, two Intel multicore CPUs, an Nvidia K40c GPU, an Nvidia P100 PCIe GPU, and an Intel Xeon Phi (Figure 1). For the workloads and the platform employed in our experiments, the algorithms provide continuous piecewise linear Pareto fronts for performance and dynamic energy and performance and total energy where the performance-optimal point is the load balanced configuration of the application.
Based on our experiments, the maximum dynamic energy savings can be up to 17% while tolerating a performance degradation of 5% (an energy savings of 106 J for an execution time increase of 0.05 seconds) for the matrix multiplication application. The maximum total energy savings is 8%.
The dynamic energy and total energy savings for the gene sequencing application accepting a 1% performance hit are 23% and 16%.
The main original contributions of this work are: • Mathematical formulations of the bi-objective optimization problem of parallel applications on heterogeneous hybrid platforms for performance and dynamic energy and for performance and total energy; • A theorem that lays the foundation for our algorithm solving the bi-objective optimization problem for performance and total energy. The theorem states that a solution vector X is Pareto-optimal for execution time and total energy if and only if it is Pareto-optimal for execution time and dynamic energy and there is no solution vector X 1 such that X 1 is Pareto-optimal for execution time and dynamic energy and E T (X 1 ) < E T (X); • An exact algorithm of polynomial complexity solving the bi-objective optimization problem for performance and total energy and whose correctness follows from the theorem; • Experimental study of the practical efficacy of our proposed algorithms for optimization of two data-parallel applications, matrix multiplication and gene sequencing, on a platform comprising five heterogeneous processors that include two multicore CPUs, two GPUs, and one Intel Xeon Phi. We demonstrate that the algorithms provide continuous piecewise linear Pareto fronts for performance and dynamic energy and performance and total energy for the workloads and the platform employed in our experiments.
The rest of the paper is organized as follows. We discuss the related work in Section 2. The formulation of the bi-objective optimization problem is presented in Section 3. In Section 4, we propose an efficient and exact algorithm solving the bi-objective optimization problem. Section 5 presents application of our proposed algorithm to optimization of heterogeneous parallel applications for performance and energy. Section 6 contains the experimental results. Finally, we conclude the paper in Section 7.

Bi-objective optimization: background
A bi-objective optimization problem can be mathematically formulated as: 15,16 where there are two objective functions, T ∶ R k → R and E ∶ R k → R. We denote the vector of objective functions by  (X) = (T(X), E(X)) T . The decision vectors X = (x 0 , ..., x k−1 ) T belong to the (nonempty) feasible region (set) , which is a subset of the decision variable space R k . We denote the image of the feasible region by  (=  ()), and call it a feasible objective region. It is a subset of the objective space R 2 . The elements of  are called objective (function) vectors or criterion vectors and denoted by  (X) or z = (z 1 , z 2 ) T , where z 1 = T(X) and z 2 = E(X) are objective (function) values or criterion values.
The objective is to minimize both the objective functions simultaneously. The objective functions are at least partly conflicting or incommensurable, due to which it is impossible to find a single solution that would be optimal for all the objectives simultaneously.

Definition 1.
A decision vector X * ∈  is Pareto optimal if there does not exist another decision vector X ∈  such that T(X) ≤ T(X * ), E(X) ≤ E(X * ) and either T(X) < T(X * ) or E(X) < E(X * ) or both. 15 An objective vector z * ∈  is Pareto optimal if there is not another objective vector z ∈  such that z 1 ≤ z * 1 , z 2 ≤ z * 2 and z j < z * j for at least one index j.
There are several classifications for methods solving bi-objective optimization problems. 15,16 Since the set of Pareto optimal solutions is partially ordered, one classification is based on the involvement of the decision-maker in the solution method to select specific solutions. There are four categories in this classification, No preference, A priori, A posteriori, Interactive. The algorithms solving bi-objective optimization problems can be divided into two major categories, exact methods and metaheuristics. While branch-and-bound is the dominant technique in the first category, genetic algorithm (GA) is popular in the second category.

Bi-objective optimization for performance and energy on HPC platforms
There are two principal categories of methods for optimizing applications on HPC platforms for performance and energy.

System-level methods
The first category of system-level solution methods aims to optimize the performance and energy of the executing environment of the applications. The dominant decision variable in this category is DVFS. DVFS reduces the dynamic power consumed by a processor by throttling its clock frequency.
Rong et al. 17 present a runtime system (CPU MISER) based on DVFS that provides energy savings with minimal performance degradation by using a performance model. Huang et al. 18 propose an eco-friendly daemon that employs workload characterization as a guide to DVFS to reduce power and energy consumption with little impact on application performance. Mezmaz et al. 19 propose a parallel bi-objective GA to maximize the performance and minimize the energy consumption in cloud computing infrastructures. Fard et al. 1 present a four-objective case study comprising performance, economic cost, energy consumption, and reliability for optimization of scientific workflows in heterogeneous computing environments. Beloglazov et al. 20 propose heuristics that consider twin objectives of energy efficiency and Quality of Service for provisioning data center resources.
Durillo et al. 3 propose a multi-objective workflow scheduling algorithm for optimization of applications executing in heterogeneous high-performance parallel and distributed computing systems. Performance and energy consumption are among the objectives. Das et al. 21 propose task mapping to optimize for energy and reliability on multiprocessor systems-on-chip with performance as a constraint. Kolodziej et al. 5 propose multi-objective GAs that aim to maximize performance and energy consumption of applications executing in green grid clusters and clouds. The performance is modeled using computation speed of a processor. The decision variable is the DVFS level. Vaibhav et al. 22 present a runtime system that performs both processor and DRAM frequency scaling and demonstrate total energy savings with minimal performance loss. Abdi et al. 23 propose multicriteria optimization where they minimize the execution time under three constraints, the reliability, the power consumption, and the peak temperature. DVFS is a key decision variable in all of these research works.
The methods proposed in References 6-8 optimize for performance under a energy budget or optimize for energy under an execution time constraint. The methods proposed in References 2,3,5 solve bi-objective optimization for performance and energy with no time constraint or energy budget.

Application-level methods
The second category of application-level solution methods 9-12,24,25 use application-level decision variables and models. The most popular decision variables include the loop tile size, workload distribution, number of processors, and number of threads.
Tarplee et al. 26 employ task-mapping as a decision variable for bi-objective optimization of applications for performance and energy in a HPC platform. Aba et al. 27 present an approximation algorithm for bi-objective optimization of parallel applications running on a heterogeneous resources system for performance and total energy. The decision variable is task scheduling. Their algorithm ignores all solutions where energy consumption exceeds a given constraint and returns the solution with minimum execution time.
Reddy et al. 11,25 study bi-objective optimization of data-parallel applications for performance and energy on homogeneous clusters multicore CPUs employing only one decision variable, the workload distribution. They propose an efficient solution method. The method accepts as input the number of available processors, the discrete function of the processor's energy consumption against the workload size, the discrete function of the processor's performance against the workload size. It outputs a Pareto-optimal set of workload distributions.
Chakraborti et al. 9 consider the effect of heterogeneous workload distribution on bi-objective optimization of data analytics applications by simulating heterogeneity on homogeneous clusters. The performance is represented by a linear function of problem size and the total energy is predicted using historical data tables. Khaleghzadeh et al. 12 propose a solution method solving the bi-objective optimization problem of data-parallel applications for performance and energy on heterogeneous processors and comprising of two principal components. The first component is a data partitioning algorithm that takes as an input discrete performance and dynamic energy functions with no shape assumptions. The second component is a novel methodology employed to build the discrete dynamic energy profiles of individual computing devices, which are input to the algorithm.
Khokhriakhov et al. 28 propose a novel solution method for bi-objective optimization of multithreaded data-parallel applications for performance and dynamic energy on a single multicore processor. The method uses two decision variables, the number of identical multithreaded kernels (threadgroups) executing the application and the number of threads per threadgroup, with a given workload partitioned equally between the threadgroups.

FORMULATION OF THE BI-OBJECTIVE OPTIMIZATION PROBLEM
Given a positive real number n ∈ R >0 and two sets of k functions each, We use T × E to denote the objective space of this problem, R ≥0 × R ≥0 . Thus, the problem can be formulated as follows: We aim to solve BOPGV by finding both the Pareto front containing the optimal objective vectors in the objective space T × E and the decision vector for a point in the Pareto front. Thus, our solution finds a set of triplets Ψ = {(T(X), E(X), X)} such that X is a Pareto-optimal decision vector, and the projection of Ψ onto the objective space T × E is the Pareto front symbolized by Ψ ↓ T×E .

BI-OBJECTIVE OPTIMIZATION PROBLEM FOR MAX OF CONTINUOUS FUNCTIONS AND SUM OF LINEAR FUNCTIONS
In this section, we solve BOPGV for the case where all functions in the set F are continuous and strictly increasing, and all functions in the set G are linear increasing, that is, Without loss of generality, we assume that the functions in G are sorted in the decreasing order of coefficients Our solution consists of two algorithms, Algorithms 1 and 2. The first one, which we call LBOPA, constructs the Pareto front of the optimal solutions in the objective space Ψ ↓ T×E . The second algorithm finds the decision vector for a given point in the Pareto front.
The inputs to LBOPA (see Algorithm 1 for pseudo-code) are two sets of k functions each, F and G, and an input value, n ∈ R >0 . LBOPA constructs a continuous Pareto front, consisting of k − 1 segments {s 0 , s 1 , · · · , s k−2 }. Each segment s i has two endpoints, (t i , e i ) and (t i+1 , e i+1 ), which are connected Figure 3 illustrates the functions in the sets, F and G, when all functions in F are linear, f i (x) = a i × x. In this particular case, the Pareto front returned by LBOPA will be piecewise linear, , as shown in Figure 3.
The main loop of the Algorithm 1 computes k end-points of the segments of the Pareto front (Lines 3-7). In an iteration i, the minimum value of objective T, t i , is obtained using the algorithm, solving the single-objective min-max optimization problem, min X {max k−1 j=i f j (x j )}. We do not present the details of this algorithm. Depending on the shapes of functions, {f 0 , … , f k−1 }, one of the existing polynomial algorithms solving this problem can be employed. 29,30 The end point (t min , e max ) = (t 0 , e 0 ) represents decision vectors with the minimum value of objective T and the maximum value of objective E, while the end point (t max , e min ) = (t k−1 , e k−1 ) represents decision vectors with the maximum value of objective T and the minimum value of objective E (as illustrated for the case of all linear increasing functions in Figure 3).

12: end function
Given x i may be either equal to n or greater than n. If ∑ k−1 i=0 x i = n, then this initial X will be the only decision vector such that In that case, this initial vector X will maximize both The algorithm then iteratively reduces elements of vector X until their sum becomes equal to n. Obviously, each such reduction will also reduce ∑ k−1 i=0 g i (x i ). To achieve the maximum reduction of , the algorithm starts from vector element x i , the reduction of which by an arbitrary amount Δx will result in the maximum reduction of ∑ k−1 i=0 g i (x i ). In our case, it will be x 0 as the functions in G are sorted in the decreasing order of coefficients b i . Thus, at the first reduction step, the algorithm will try to reduce x 0 by n plus . If x 0 ≥ n plus , it will succeed and find a Pareto-optimal decision vector X = {x 0 − n plus , x 1 , · · · , x k−1 }. If x 0 < n plus , it will reduce n plus by x 0 , set x 0 = 0 and move to the second step. At the second step, it will try to reduce x 1 by the reduced n plus , and so on. This way the algorithm minimizes Algorithm 2. Algorithm finding a Pareto-optimal decision vector X = {x 0 , x 1 , · · · , x k−1 } for the problem BOPGV(n, k, F, G), where functions in F are continuous and strictly increasing and functions in G are linear increasing, for a given point (t, e) from the Pareto front of this problem, (t, e) ∈ Ψ ↓ T×E .

18: end function
The correctness of LBOPA and PARTITION is proved in Theorem 1.

Theorem 1. Consider bi-objective optimization problem BOPGV(n, k, F, G) where all functions in F are continuous and strictly increasing and G
Then, the piecewise function S, returned by LBOPA n, k, F, G (Algorithm 1) and consisting of k − 1 segments, is the Pareto front of this problem, Ψ ↓ T×E , and for any (t, e) ∈ Ψ ↓ T×E , Algorithm 2 returns a Pareto-optimal decision vector X such that T(X) = t and Proof. First, consider Algorithm 2 and arbitrary input parameters n > 0 and t > 0. If after initialization of X (Line 2) we will have ∑ k−1 i=0 x i < n, it means that t is too small for the given n, and for any vector Y = {y 0 , y 1 , · · · , y k−1 } such that In this case, there is no solution to the optimization problem, and the algorithm terminates abnormally.
Otherwise, the algorithm enters the while loop (Line 6). If i < k − 1 upon exit from this loop, then the elements of vector X will be calculated as and therefore satisfy the conditions ∑ k−1 j=0 x j = n and max k−1 j=0 f j (x j ) = t. Moreover, the total amount of n will be distributed in X between vector elements with higher indices, which have lower G cost, g i (x), because b i ≥ b i+1 , ∀i ∈ {0, · · · , k − 2}. Therefore, for any other vector Y = {y 0 , y 1 , · · · , y k−1 } satisfying these two conditions, we will have Indeed, such a vector Y can be obtained from X by relocating certain amounts from vector elements with higher indices to vector elements with lower indices, which will increase the G cost of the relocated amounts. Thus, when the algorithm exits from the while loop with i < k − 1, it returns a Pareto-optimal vector X.
If the algorithm exits from the while loop with i = k − 1, it will mean that t is too big for the given n. We would still have n plus > 0 to take off the last vector element, x k−1 , but if we did it, we would make max k−1 j=0 f j (x j ) < t. This way we would construct for the given n a decision vector, which minimizes ∑ k−1 i=0 g i (x i ) but whose max k−1 j=0 f j (x j ) will be less than t, which means that no decision vector X such that max k−1 j=0 f j (x j ) = t can be Pareto optimal. Therefore, in this case the algorithm also terminates abnormally.
Thus, for any t ∈ T, Algorithm 2 either finds a Pareto-optimal decision vector X such that T(X) = t and E(X) = ∑ k−1 i=0 b i × x i = e, or returns abnormally if such a vector does not exist. Let Algorithm 2 return normally, and the loop variable i be equal to s upon exit from the loop. Then, accord- where s, n, b i , b s , a i are all known constants. Therefore, the Pareto front e = P f (t) can be expressed as follows: which is the analytical expression of the piecewise function constructed by Algorithm 1 (LBOPA). Proof. The for loop in LBOPA (Algorithm 1, Lines 3-7) has k iterations. At each iteration i, the computation of t i has a time complexity of (k 2 × log 2 n), 29 the computation of e i has a time complexity of (k), and the insertion of the point in the set  has complexity (1). Therefore, the time complexity of the loop is (k 3 × log 2 n). The time complexity of the loop (Lines 8-10) is (k). Therefore, the time complexity of LBOPA is (k 3 × log 2 n).
Let us consider the PARTITION algorithm. The initialization of X (Line 2) and computation of n plus has time complexity (k) each. The while loop (Lines 6-15) iterates as long as n plus > 0 and i < k − 1, of which i < k − 1 is the worst case scenario. The time complexity of the loop is, therefore, (k). Therefore, the time complexity of PARTITION is bounded by (k). ▪

APPLICATION OF THE BI-OBJECTIVE ALGORITHMS TO OPTIMIZATION OF HETEROGENEOUS PARALLEL APPLICATIONS FOR PERFORMANCE AND ENERGY
In this section, we apply the LBOPA and PARTITION algorithms to optimization of heterogeneous parallel applications for performance and energy.
We look at two bi-objective optimization problems, performance and dynamic energy and performance and total energy.

Bi-objective optimization for performance and dynamic energy
The bi-objective optimization problem for performance and dynamic energy is a direct application of BOPGV where the functions in F and G are the If the application requires integer solutions, X I = {x 0 , · · · , x k−1 } ∈ Z k ≥0 such that ∑ k−1 i=0 x i = n, we will find the closest approximation to the real-valued solution vector, X, output by PARTITION, in the Euclidean space.

Bi-objective optimization for performance and total energy
We start with the formulation for bi-objective optimization problem for performance and total energy. It is an extension of BOPGV.

Problem formulation
Consider a workload size n executed using p heterogeneous processors, whose execution time and dynamic energy functions are given by the two sets, F and G. Let P S ∈ R + be the static power consumption of the platform, which is a constant. The problem is then to find a vector X = {x 0 , · · · , x k−1 } ∈ R k ≥0 such that ∑ k−1 i=0 x i = n, minimizing the objective functions T(X) and E T (X) = E(X) + P S × T(X). We use T × E T to denote the objective space of this problem, R ≥0 × R ≥0 .
Thus, the problem can be formulated as follows: Our solution for BOPPTE finds a set of tuples, Ψ TE = {(T(X), E T (X), X)}, where X is a Pareto-optimal decision vector, T(X) and E T (X) are the execution time and the total energy consumption corresponding to X. The projection of Ψ TE onto the objective space T × E T is the Pareto front symbolized by Ψ TE ↓ T×E T . The projection of Ψ TE onto the decision vector space X is the set of solutions (workload distributions) represented by Ψ TE ↓ X .

Solution using the bi-objective algorithms LBOPA and PARTITION
In this section, we propose a solution that employs the LBOPA and PARTITION algorithms to solve BOPPTE for the case where all functions in the set F are continuous and strictly increasing, and all functions in the set G are linear increasing.
The solution is based on the following observations. Let  be the space of all feasible solutions to the BOPGV problem, that is, a set consisting Then,  will also be the space of all feasible solutions to the BOPPTE problem. However, its image in the BOPPTE objective space,  T×E T . will be different, obtained by moving each point (t, e) ∈  vertically by P S × t. This transformation guarantees that no solution X ∈ , which is not Pareto optimal for BOPGV, will become Pareto optimal for BOPPTE. Indeed, as we consider the case when all functions in F are continuous and strictly increasing, and all functions in G are linear increasing, the Pareto front constructed by LBOPA for the BOPGV problem will be a continuous decreasing function. Therefore, there exists a BOPGV Pareto-optimal X * , which dominates X so that T(X * ) = T(X) and E(X * ) < E(X). The images of X * and X in the BOPPTE objective space will be (T(X), E(X * ) + P S × T(X)) and (T(X), E(X) + P S × T(X)), respectively. As (T(X), E(X * ) + P S × T(X)) < (T(X), E(X) + P S × T(X)), X * will be dominating X in the BOPPTE space as well.
Thus, solutions, which are not BOPGV Pareto optimal, cannot be BOPPTE Pareto optimal. It means that if a solution is BOPPTE Pareto optimal, then it must be BOPGV Pareto optimal; that is, the BOPPTE set of Pareto-optimal solutions is a subset of the BOPGV set of Pareto-optimal solutions. By construction, the image of the set of BOPGV Pareto-optimal solutions in the BOPPTE objective space will be a continuous function of objective T but not necessarily decreasing. Therefore, different points belonging to this function will have different T coordinates but may have the same E T coordinate. Obviously, if we have two points with the same E T coordinate, the one with greater T coordinate will be the image of the inferior solution, which should be removed from the BOPPTE set of Pareto-optimal solutions.
These observations can be summarised as the following theorem.

Theorem 3. A solution vector X ∈  is Pareto-optimal for execution time and total energy if and only if it is Pareto-optimal for execution time and dynamic
energy and there is no solution vector X 1 such that X 1 is Pareto-optimal for execution time and dynamic energy and E T (X 1 ) < E T (X).    [3]>e min ) then 12: continue 13 [3]), is Pareto-optimal and is added to S TE . Line 18 represents the case of the line segment whose points satisfy Pareto-optimality. Therefore, it is stored in S TE at index i.
If no solutions are added to S TE in the for loop, then S TE will contain only the performance-optimal point corresponding to the load-balanced workload distribution.
We illustrate LBOPA-TE using an example shown in the Figure 4. The number of processors employed in the example is four. The static power consumption, P S , is assumed to be 5 W. The Pareto front for execution time and dynamic energy (Ψ DE ↓ T×E ) is given by the blue line in Figure 4A.
It contains four segments. The static energy consumption as a function of execution time (5 × t) is shown as an orange line. Figure 4B shows the execution time versus total energy curve (highlighted in green) obtained by adding the static energy consumptions to the energies in the execution time and dynamic energy Pareto front. In Figure 4C, the solutions highlighted in red are the non-Pareto-optimal solutions removed by LBOPA-TE in Lines 13-17. The output Pareto front for execution time and total energy, Ψ TE ↓ T×E T , is shown Figure 4D. Hence, the time complexity of LBOPA-TE is (k 3 × log 2 n). ▪

EXPERIMENTAL RESULTS AND DISCUSSION
We analyze the proposed algorithms for two data-parallel applications, matrix multiplication and gene sequencing, executed on a platform comprising the five heterogeneous processors illustrated in Figure 1.
We first describe the methodology to construct the discrete execution time and the dynamic energy profiles based on system-level physical power measurements using power meters for the processors involved in the execution of our applications. We then present the applications and the experimental results. Our platform is equipped with WattsUp Pro power meters between the wall A/C outlets and the input power sockets. The power meters capture the total power consumption of the node. They have data cables connected to USB ports of the node. A Perl script collects the data from the power meter using the serial USB interface. The execution of these scripts is nonintrusive and consumes insignificant power. The power meters are periodically calibrated using an ANSI C12.20 revenue-grade power meter, Yokogawa WT210. The maximum sampling speed of the power meters is one sample every second. The accuracy specified in the data sheets is ±3%. The minimum measurable power is 0.5 watts. The accuracy at 0.5 W is To ensure the reliability of our results, we follow a statistical methodology where a sample mean for a response variable (energy, time, PMC, utilization variables) is obtained from multiple experimental runs. The sample mean is calculated by executing the application repeatedly until it lies in the 95% confidence interval and a precision of 0.025 (2.5%) is achieved. For this purpose, Student's t-test is used assuming that the individual observations are independent and their population follows the normal distribution. We verify the validity of these assumptions using Pearson's chi-squared test.

Methodology to construct execution time and dynamic energy profiles
We employ an experimental methodology, 1431 that accurately models the energy consumption by a hybrid data-parallel application executing on a heterogeneous HPC platform containing different computing devices using system-level power measurements provided by power meters. The automated software tool, HCLWATTSUP, 32 provides the dynamic and total energy consumptions based on system-level physical power measurements using power meters. The tool has no overhead and, therefore, does not influence the energy consumption of the application. HCLWATTSUP gives the static power consumption of our platform when it does not execute any application. Based on our measurements, it is 410 W.
A data-parallel application executing on this heterogeneous hybrid platform consists of several kernels (generally speaking, multithreaded) running in parallel on different computing devices of the platform. The proposed algorithms for solving the bi-objective optimisation problem for performance and energy requires individual performance and dynamic energy profiles of all the kernels.
Due to tight integration and severe resource contention in heterogeneous hybrid platforms, the load of one computational kernel may significantly impact others' performance to the extent of preventing the ability to model the performance and energy consumption of each kernel in hybrid applications individually. 33 Therefore, we only consider configurations where one CPU kernel or accelerator kernel runs on the corresponding device. Each group of cores executing an individual kernel of the application is modeled as an abstract processor so that the executing platform is represented as a set of heterogeneous abstract processors. We ensure that the sharing of system resources is maximized within groups of computational cores representing the abstract processors and minimized between the groups. This way, the contention and mutual dependence between abstract processors are minimized.
We thus model our platform by five abstract processors, CPU_1, GPU_1, xeonphi_1, CPU_2, and GPU_2. CPU_1 contains 22 (out of the total 24 physical) CPU cores. GPU_1 involves the Nvidia K40c GPU and a host CPU core connected to this GPU via a dedicated PCI-E link. CPU_2 comprises 10 (out of the total 12 physical) CPU cores. XeonPhi_1 is made up of one Intel Xeon Phi 3120P and its host CPU core connected via a dedicated PCI-E link. GPU_2 involves the Nvidia P100 PCI-E GPU and a host CPU core connected to this GPU via a dedicated PCI-E link. Since there should be a one-to-one mapping between the abstract processors and computational kernels, any hybrid application executing on the node should consist of three kernels, one kernel per computational device, running in parallel. Because the abstract processors contain CPU cores that share some resources such as main memory and QPI, they cannot be considered entirely independent. Therefore, the performance of these loosely coupled abstract processors must be measured simultaneously, thereby taking into account the influence of resource contention.
The execution time profiles of the abstract processors are experimentally built separately using an automated build procedure using OpenMP threads where one thread is mapped to one abstract processor. To account for the influence of resource contention, all the abstract processors execute the same workload simultaneously and their execution times are measured. The execution time for accelerators includes the time taken to transfer data between the host and devices.
The dynamic energy profiles of the abstract processors are constructed using the additive approach. 12 In the additive approach, the dynamic energy profiles of the five processors are constructed serially. The combined profile where the individual dynamic energy consumptions are totaled for each data point is then obtained. Then, the dynamic energy profile employing all the processors in parallel is built.
The difference between the parallel and combined dynamic energy profiles is observed. We find that the average difference between parallel and combined dynamic energy profiles is around 2.5% for the applications and within the statistical accuracy threshold set in our experiments. Both the parallel and combined profiles also follow the same pattern. Therefore, we conclude that the processors in our experiments satisfy the additive hypothesis: the abstract processors are loosely coupled and do not interfere during the application. Thus, we conclude that the dynamic energy profiles of the five processors can be constructed serially or in parallel for our experimental platform and applications.

Precautions to rule out interference of other components in dynamic energy consumption
Several precautions are taken in computing energy measurements to eliminate any potential interference of the computing elements that are not part of the given abstract processor running the given application kernel. First, we group abstract processors so that a given abstract processor constitutes solely the computing elements involved to run a given application kernel. Hence, the dynamic energy consumption will solely reflect the work done by the computing elements of the given abstract processor executing the application kernel.
Consider the DGEMM application kernel executing on the abstract processor CPU_1, which comprises CPU and DRAM. The HCLWattsUp API function gives the total energy consumption of the server during the execution of an application. The energy consumption includes the contribution from all components such as NIC, SSDs, and fans. We ensure that the application exercises only the CPUs and DRAM and not the other components so that the dynamic energy consumption reflects the contribution of only these two components. The following steps are employed to achieve this goal: • The disk consumption is monitored before and during the application run and ensure no I/O is performed by the application using tools such as sar, and iotop; • The problem size used in executing an application does not exceed the main memory and that swapping (paging) does not occur; • The application does not use the network by monitoring using tools such as sar, and atop; • The application kernel's CPU affinity mask is set using SCHED API's system call, SCHED_SETAFFINITY(). To bind the DGEMM application kernel, we set its CPU affinity mask to 11 physical CPU cores of Socket 1 and 11 physical CPU cores of Socket 2.
Fans are also a great contributor to energy consumption. On our platform, fans are controlled in two zones: (a) zone 0: CPU or System fans, (b) zone 1: Peripheral zone fans. There are four levels to control the speed of fans: • Standard: BMC control of both fan zones, with the CPU zone based on CPU temp (target speed 50%) and Peripheral zone based on PCH temp (target speed 50%); • Optimal: BMC control of the CPU zone (target speed 30%), with Peripheral zone fixed at low speed (fixed 30%); • Heavy IO: BMC control of CPU zone (target speed 50%), Peripheral zone fixed at 75%; • Full: all fans are running at 100%.
We set the fans at full speed before launching the experiments to rule out fans' contribution to dynamic energy consumption. When set at full speed, the fans run consistently at ∼ 13,400 rpm. In this way, fans consume the same amount of power that is included in the static power of the platform. Furthermore, we monitor the server's temperatures and the fans' speeds with the help of Intelligent Platform Management Interface (IPMI) sensors, both with and without the application run. We find that there are no significant differences in temperature and the speeds of fans are the same in both scenarios.
Thus, we ensure that the dynamic energy consumption measured reflects the contribution solely by the abstract processor executing the given application kernel.

Applications used in the experiments
The The application employs the five heterogeneous processors, CPU_1, GPU_1, xeonphi_1, CPU_2, and GPU_2. The application invokes optimized SW routines provided by SWIPE for Multicore CPUs, 37 CUDASW++3.0 for Nvidia GPU accelerators, 38 and SWAPHI for Intel Xeon Phi accelerators. 39 All the computations are in-card.
The performance and dynamic energy profiles for the matrix multiplication and gene sequencing applications are shown in the Figures 2 and 5.
The input performance and dynamic energy functions, (F, G), to LBOPA and PARTITION are linear approximations of the profiles.
To demonstrate the practical efficacy and the most interesting aspects of our algorithms, we select two workloads, 12,352 × 10,112 and 15,552 × 10,112, for the matrix multiplication and the workload 29,312 × 163,841 for the gene sequencing application. First, the workloads provide shapes of Pareto fronts with steep slopes and a wide range of performance-energy tradeoffs for both performance and dynamic energy and performance and total energy. Second, the workloads allow us to demonstrate scenarios where the set of Pareto-optimal solutions for performance and total energy is equal to the set of Pareto-optimal solutions for performance and dynamic energy (Ψ TE ↓ X = Ψ DE ↓ X ) and where it is only a proper subset (Ψ TE ↓ X ⊂ Ψ DE ↓ X ). Figure 6 shows the Pareto fronts for the matrix multiplication application for two workloads, 12     The total energy savings accepting a 1% performance hit is 16%.

Discussion
Following are our salient observations: • The set of Pareto-optimal solutions (workload distributions) for execution time and dynamic energy is optimal for total energy only for the first two linear segments starting from the performance-optimal endpoint for the matrix multiplication application. The third and fourth linear segments with the positive slopes contain non-Pareto-optimal solutions due to high static energy consumptions; • The shapes of the two Pareto fronts for execution time and dynamic energy and execution time and total energy are similar, suggesting that the qualitative conclusions apply for all workloads; • For the gene-sequencing application, the set of Pareto-optimal solutions (workload distributions) for execution time and dynamic energy is also optimal for total energy for the workload employed in our experiments; • Based on an input user-specified energy-performance tradeoff, one can selectively focus on a specific segment in the Pareto fronts to return the Pareto-optimal solutions (workload distributions). A steep slope in the line segment with the load-balanced solution as the performance-optimal endpoint will provide significant energy savings while tolerating little performance degradation. It signifies that introducing a small load imbalance can provide good energy savings.
• The execution times of our proposed algorithms range from milliseconds to 1 s to find Pareto-optimal solutions for the workload sizes used in the experiments. These execution times are insignificant compared to the execution times of the applications where our proposed algorithms are employed to find the workload distribution.

CONCLUSION
Performance and energy are the two most important objectives for optimization on heterogeneous HPC platforms. Khaleghzadeh et al. 12 studied bi-objective optimization of data-parallel applications for performance and energy on heterogeneous processors. They proposed an algorithm for the case of discrete performance and energy functions with any arbitrary shape. They also briefly studied the continuous bi-objective optimization problem but only for the simple case of two heterogeneous processors with linear execution time and linear dynamic energy functions. They proposed an algorithm to find the Pareto front and showed that it is linear, containing an infinite number of solutions. While one solution is load balanced, the rest are load imbalanced.
In Reference 13, we studied a more general continuous bi-objective optimization problem for a generic case of k heterogeneous processors.
The problem is motivated by the bi-objective optimization for the performance and dynamic energy of data-parallel applications on heterogeneous HPC platforms. We first formulated the problem, which for a given positive real number n aims to find a vector X = {x 0 , · · · , x k−1 } ∈ R k ≥0 such that ∑ k−1 i=0 x i = n, minimizing the max of k-dimensional vector of functions of objective type one and the sum of k-dimensional vector of functions of objective type two. We then proposed an exact algorithm of polynomial complexity solving the problem where all the functions of objective type one are continuous and strictly increasing, and all the functions of objective type two are linear increasing.
In this work, we applied the problem and the algorithm proposed in Reference 13 to solve two related optimization problems of parallel applications on heterogeneous hybrid platforms, one for performance and dynamic energy and the other for performance and total energy. First, we formulated and solved the bi-objective optimization problem for performance and dynamic energy. The problem and the solution are a direct application of the problem and algorithm proposed in Reference 13.
We then formulated the bi-objective optimization problem of parallel applications on heterogeneous hybrid platforms for performance and total energy. We proved a theorem that states that a solution vector X is Pareto-optimal for execution time and total energy if and only if it is Pareto-optimal for execution time and dynamic energy and there is no solution vector X 1 such that X 1 is Pareto-optimal for execution time and dynamic energy and E T (X 1 ) < E T (X). Finally, we proposed an algorithm of polynomial complexity to solve the problem and whose correctness follows from the theorem.
Using the algorithms (proposed in Reference 13 and this work), we solved the two bi-objective optimization problems for two applications, matrix multiplication and gene sequencing, employing five heterogeneous processors, two Intel multicore CPUs, an Nvidia K40c GPU, an Nvidia P100 PCIe GPU, and an Intel Xeon Phi. For the workloads and the platform employed in our experiments, the algorithms provide continuous piecewise linear Pareto fronts for performance and dynamic energy and performance and total energy where the performance-optimal point is the load balanced configuration of the application.
Finally, 17% dynamic energy savings was achieved while tolerating a performance degradation of 5% (a saving of 106 J for an execution time increase of 0.05 s) for the matrix multiplication application. The dynamic energy and total energy savings for the gene sequencing application accepting a 1% performance hit were 23% and 16%.
In our future work, we will study bi-objective performance-energy optimization of applications with both continuous performance and energy profiles on heterogeneous hybrid HPC platforms.

ACKNOWLEDGMENTS
This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number 14/IA/2474. Open access funding provided by IReL.