The search‐based scheduling algorithm HP* for parallel tasks on heterogeneous platforms

Scheduling is a widely used method in parallel computing, which assigns tasks to compute resources of a parallel environments. In this article, we consider independent parallel tasks to be scheduled onto a heterogeneous execution platform consisting of a set of multicores of different architecture. Each parallel task has an internal potential parallelism which allows a parallel execution on any multicore processors. However, the execution time may differ due a different computation speed of different multicores. In this article, we propose a new search‐based scheduling algorithm Heterogeneous Parallel task scheduling based on A* (called HP*) to solve the problem of scheduling independent parallel tasks onto heterogeneous multicore platforms. Specifically, we propose a heuristic cost function needed for an informed search. Also, three pruning techniques are proposed, which are shown to significantly reduce the search space of HP*. Performance measurements on a heterogeneous platform are performed and the results of HP* are compared to scheduling results of other popular scheduling methods. The performance results with benchmark tasks from the SPLASH‐3 benchmark suite demonstrate the good scheduling results and the improvements achieved by HP*.

multicores that have different computing speeds possibly leading to a different performance when executing the same parallel task. Achieving a high efficiency on such a heterogeneous system requires a scheduling method, which takes the heterogeneity into account. For the scheduling of parallel tasks, two properties of the compute nodes are particularly important, which are the number of processor cores and the computational speed of each node.
Many proposed task scheduling methods focus on sequential tasks where each tasks is assigned to exactly one processor of a compute node.
Scheduling algorithms for independent parallel tasks usually have to take into account more possibilities of the mapping of tasks, since a task can be executed on an arbitrary number of cores and a concurrent execution of different tasks is possible. Thus, such a scheduling algorithm becomes increasingly complex due to the increasing number of options for placing tasks. For example, for only two tasks, there is the possibility to place them to all available cores and execute them one after another or to execute them concurrently on subsets of the cores. The solution of the scheduling problem is to determine for each task the particular compute node and the number of processor cores to be used on this node.
Since task scheduling is an NP-complete problem, many of the proposed scheduling methods are based on heuristics, see, for example, References 1-5. Heuristic scheduling methods may find solutions of the scheduling problem that are acceptably good for a specific use case but are usually not optimal. To find an optimal solution, a search of the entire solution space may be required in the worst case. Since the computation time required to find an optimal solution can be extremely long, informed search-based algorithms which prune the search space are advantageous. 6 Some recent search-based scheduling algorithms are based on the A* (pronounced "A-star") algorithm. The A* algorithm is an informed search algorithm for graph traversal and path search, which strongly relies on an estimation of the cost for the rest of the search. It has been shown that informed search algorithms, such as the A* search algorithm, 7 find an optimal solution if a so-called admissible and consistent cost function is used. 8 The cost estimation has to underestimate the actual cost in order to guarantee an optimal solution. A scheduling algorithm for sequential dependent tasks to homogeneous platforms based on the A* search algorithm has been proposed in Reference 9 and several pruning techniques to reduce the search space are presented. A scheduling algorithm for the same scheduling problem with an improved cost function has been proposed in Reference 10 with additional pruning techniques in Reference 11. In contrast to these earlier works, we consider the scheduling of independent parallel tasks to heterogeneous platforms with its much higher possibilities of task mappings described above and propose a novel search-based scheduling algorithm HETEROGENEOUS PARALLEL TASK SCHEDULING BASED ON A* (HP*).
The algorithm HP* is also inspired by the A* algorithm. For the search, we define a scheduling decision tree consisting of incomplete schedules which are completed by the informed search. The crucial point was to find an estimation function that underestimates the actual cost. Since our tasks are parallel tasks, these cost are parallel cost which means that the parallel execution time of each task has to be modeled for a varying number of nodes and varying speed of the cores. We have also proposed pruning techniques that are able to reduce the search space of HP*. The HP* has been implemented and the performance has been measured on a heterogeneous multicore using the parallel programs of the SPLASH-3 benchmark suite as parallel tasks. The results are compared with popular scheduling algorithms for parallel task.
The contribution of this article is: • A new task scheduling method HP* for solving the problem of scheduling independent parallel tasks onto heterogeneous platforms. The algorithm HP* is inspired by the A* search algorithm.
• A modeling of a heuristic cost function for parallel tasks that guides the search of HP*.
• Three pruning techniques that are able to significantly reduce search space and thus increase the performance of HP*.
• Theoretical as well as practical performance evaluations.
• An evaluation of the effects of the pruning techniques on the size of the search space. This is done for each pruning technique in isolation as well as in combination of all techniques together.
• A performance comparison with three popular heuristic task scheduling techniques.
• Experiments with programs from the SPLASH-3 benchmark suite 12 used as parallel tasks on a heterogeneous multicore cluster.
The rest of the article is organized as follows: Section 2 defines a scheduling problem for independent parallel tasks and describes the modeling of the task execution times. Section 3 presents the new search-based scheduling algorithms for parallel tasks HP*. Section 4 proposes techniques for pruning the search space of HP*. Section 5 presents experimental results. Section 6 discusses related work and Section 7 concludes the article.

SCHEDULING OF PARALLEL TASKS ON HETEROGENEOUS PLATFORMS
In this section, we present the scheduling problem for independent parallel task to be solved in Section 2.1and the model of the cost function is given in Section 2.2.

Scheduling problem
The scheduling problem considered in this article comprises n T independent parallel tasks T i , i = 1, … ,n T . A parallel task can be executed on a single compute node utilizing an arbitrary number of processor cores. The number of cores used by each task is fixed during the task execution. This type of tasks is also known as moldable task. The tasks are nonpreemptive, that is, their execution cannot be interrupted. For each task T i , its parallel execution time using p cores of compute node N j ,j ∈ {1, … ,n N } is denoted by t i,j (p).
The heterogeneous platform considered consists of n N multicore compute nodes N 1 , … , N n N . The heterogeneity of the platform results from the different architectures of each compute node. Thus, each compute node N j might have a different computational speed and a different number of processor cores p j . It is also stated that each processor core can execute only one task at a time. Thus, each parallel task might be executed on 1 to p j cores of a compute node N j . However, several tasks can be executed on a compute node at the same time depending on the number of cores utilized on a compute node.
A solution for the scheduling problem described above is an assignment of the tasks T i , i = 1, … ,n T to the compute nodes N j , j = 1, … ,n N . For each task T i , the resulting schedule S provides the following information: • the compute node and the number of cores to be utilized, • the calculated start time s i and finish time e i .
The makespan M(S) of a schedule S is the difference between the earliest start time and latest finish time of all tasks. We assume that the execution of the first task starts at time 0. Thus, the makespan is identical to the latest finish time of all tasks. This can be expressed as The goal is to determine a schedule S such that the makespan M(S) is minimized.

Cost model for parallel tasks
The decisions made by scheduling methods are usually based on predictions of the execution times of single tasks. These predictions can be completely determined by benchmark measurements or can be calculated using a specific cost model. Since the program structures of the parallel tasks are unknown, existing cost models for parallel programming, such as PRAM, 13 BSP, 14 or LogP, 15 cannot be used for the considered scheduling problem. Thus, we use the following runtime formula to model the execution time t i,j of each task T i ,i = 1, … ,n T on a compute node N j ,j = 1, … ,n N depending on the number of utilized processor cores p: The parameter f j denotes the performance factor of compute node N j that describes its computational speed in relation to a reference compute node N r . It is defined as the ratio between the sequential execution time of a task on a reference node N r and the compute node N j . Since the reference node N r is also used to determine the other parameters of the runtime formula, the compute node with the highest number of cores is used as reference node. The remaining part of Equation (1) represents the execution time of task T i on the reference node N r . The structure of this part was chosen to cover the runtime behavior of typical parallel tasks. It consists of a parallel computation time a i that decreases linearly with the number of cores p, a constant sequential computation time b i , and a parallelization overhead c i that increases logarithmically with the number of cores p (eg, for synchronization or communication). To determine the parameters a i , b i , and c i of a task T i , first, the execution times are measured on the reference node with different numbers of cores. Then the concrete values of the parameters are calculated based on a least squares fit of these execution times. In Reference 16, it is shown that this cost model achieves a good prediction of the task execution time. Table 1 summarizes the notations used to describe the scheduling problem.

SEARCH-BASED SCHEDULING ALGORITHM
In this section, we propose a new search-based task scheduling method HP* for assigning parallel tasks optimally to heterogeneous platforms which is derived from the search algorithm A*, which is briefly summarized in Section 3.1. The algorithm of the new scheduling method HP* is proposed in Section 3.2. The time complexity of HP* is analyzed in Section 3.3. Table 2 lists the notation used to describe the A* search algorithm and HP*.

The A* search algorithm
The A* search algorithm 7 is commonly used to find the shortest path in a directed graph with positive edge weights. The goal of the algorithm is to find the shortest path in a graph G from a start node s to a nonempty set of goal nodes T. For its search, the algorithm uses a function f(n) representing the cost of a path from s to a goal node via node n. The function f(n) consists of two parts: the actual cost g(n) of the path from s to n and the estimated cost h(n) of the shortest path from n to a goal node. In Reference 8, it was shown that using an admissible and consistent function f(n), the A* search algorithm is guaranteed to find an optimal solution. Definition 1. (Admissible cost function). The cost function f(n)=g(n)+h(n) said to be admissible, if the heuristic function h(n) underestimates the exact cost h * (n) for each node n, that is, if the following holds: Definition 2. (Consistent cost function). The cost function f(n)=g(n)+h(n) said to be consistent, if f(n) never decreases along any path. This is the case, if for any pair of adjacent nodes x and y with edge weight d(x,y), the following holds: Algorithm 1 shows the pseudocode of the A* search algorithm presented in Reference 7. First, the start node s is marked "open" and the cost function f(s) is evaluated. All "open" nodes have been discovered but have not been visited yet. Nodes that have been visited are marked as "closed" to avoid revisiting them. Then, the "open" node n with the smallest cost f(n) is selected and marked "closed." Each unmarked successor u of n is marked "open" and f(u) is calculated. Nodes u that are "closed" are marked "open" if the current cost f(u) is lower than the cost when they were marked "closed." The algorithm continues selecting the next node n with the smallest cost f(n) (line 2) until a goal node is reached.

Scheduling parallel tasks with HP*
For the scheduling of parallel tasks onto heterogeneous platforms, we propose a new scheduling method HP* (HETEROGENEOUS PARALLEL TASK SCHEDULING BASED ON A*) based on the A* search algorithm. HP* is a search-based scheduling algorithm for independent parallel task and exploits a graph data structure, which guides this search for a schedule. The approach of the scheduling algorithm is derived from the A* search algorithm in the following sense. The scheduling problem described in Section 2.1 is defined as a directed graph in form of a tree, in the following referred to as scheduling decision tree.

Definition 3.
(Scheduling decision tree). A scheduling decision tree is a tree where each graph node n represents a (partial) schedule S n . The root node represents an empty schedule, that is, where no tasks have been scheduled. The successors of a graph node represent all schedules in which exactly one more task is scheduled. In a scheduling decision tree, the leaf nodes are complete schedules, that is, where all tasks have been scheduled. The weight d(n,u) of the edge between two nodes n and u is the difference between the makespans of the schedules S n and S u , that is, According to the A* search algorithm, the cost function f(n)=g(n)+h(n) used by HP* consists of two parts: • g(n) that is the makespan M(S n ) of the schedule S n corresponding to node n, • h(n) that is a heuristic for the difference between the makespans M(S n ) and M(S l ) for a node n and a leaf node l.
Thus, the graph structure supports a targeted search for a complete schedule and the result of HP* is the leaf node at the end of a shortest path in this graph. The path from an empty schedule (root node) to the solution represents the construction of the schedule found by HP*. For the calculation of the heuristic cost function h(n), it is assumed that the remaining tasks can be distributed "optimally" to all cores. This means that the execution of a task is not restricted to a single compute node and the number of processor cores used by a task can change during its execution. It is also assumed that each remaining task can be executed fully parallel, which means that the sequential execution time is divided by the number of processor cores utilized. Without loss of generality, it is assumed that in a node n the tasks T x , ..., T n T |x ∈ {1, ..., n T } have not been scheduled yet. The remaining sequential workload W s is then calculated as F I G U R E 1 Illustration of the calculation of the cost function f(n) with scheduled tasks (gray) and remaining workload (blue) that is lower (left) or greater (right) than the computational capacity using Equation (1) considering the reference node N r . The computational capacity available on all cores of the compute nodes regarding to a schedule S n is defined as For a compute node N j , j = 1, … ,n N , the set C j,k denotes all tasks that have been assigned to core k of this node. For each node n, the heuristic cost function h(n) is defined as follows: If the remaining workload is bigger than the available computational capacity, then h(n) is set to the difference divided by the total compute power, that is, the sum of p j ⋅f j over all nodes N j , j = 1, … ,n N . Otherwise, there is enough computational capacity available for the remaining workload, which leads to h(n)=0. Figure 1 shows an illustration of the calculation of the proposed cost function f(n)=g(n)+h(n) with tasks scheduled already (gray) and the remaining workload (blue). In this example, the remaining workload is either lower (left) or greater (right) than the computational capacity.
In If s cur is already part of L closed , s cur is skipped and the algorithm continues with the next node. Otherwise, s cur is added to L closed and a task T is selected that has not been scheduled in s cur yet. For each possible assignment of each unscheduled task T, a new node s is created. This is done by an iteration over all compute nodes N j , j = 1, … ,n N , and all numbers of cores p from 1 to p j . In each step of this iteration, all possible assignments of task T to p cores of node N j are generated. Each assignment is added to the schedule used in s cur and the resulting schedule is represented by a new node s. Then value f(s) of this new node s is calculated and s is added to L open . (6)) that is admissible and consistent.

Lemma 1. HP* uses a heuristic cost function h(n) (Equation
Proof. Within the scheduling decision tree, the successor m of a node n represents a schedule S m where exactly one more task is scheduled than in S n . It is assumed that this task is assigned to p,1 ≤ p ≤ p j processor cores of compute node N j , j = 1, … , n N . h(n) denotes the increase of the makespan M(S n ) of the schedule S n that is estimated based on the remaining tasks. The exact cost of this increase is denoted by h * (n). It can be shown that for each node n, the following holds: For the calculation of h(n), Equation (6) is used assuming that the sequential workload W s is bigger than the computational capacity. Since all costs are nonnegative values, h(n) = 0 always underestimates h * (n), that is, 0 = h(n) ≤ h * (n) for each node n. For the calculation of the exact cost h * (n), the parallel workload W is considered instead of the sequential workload. This means that the execution of task T i requires p cores with a performance factor f j for the execution time t i,j (p). According to Equation (1), the execution time t i,j (p) on compute node N j can also be expressed as the execution time t i,r (p) on the reference node divided by the performance factor f j . The total compute power denoted by (p l ⋅ f l ) is always greater or equal to p⋅f j . For common parallel tasks, it can be assumed that t i,r (1) is always less or equal t i,j (p)⋅p. As a consequence, h(n) ≤ h * (n) holds for each node n and thus, h(n) is an admissible heuristic. According to the definition, both, g(n) and h(n) are nonnegative for each node n. Together with the admissibility of h(n), this implies that f(n) is a nondecreasing function along any path. Consequently, f(n) is a consistent function. ▪ Lemma 2. Using the heuristic cost function h(n) (Equation (6)), HP* finds an optimal solution.
Proof. In Reference 8, it is shown that the A* search algorithm is guaranteed to find a shortest path if an admissible and consistent function f(n) Since the cost function used by HP* is admissible and consistent, HP* is guaranteed to find a shortest path in the scheduling decision tree. Per definition of the scheduling decision tree, the leaf node at the end of this path is also a schedule with a minimum makespan.
In each step, HP* selects the node with the smallest cost from all nodes discovered at this point. The first complete node that is visited by HP* is returned as the solution of the search. Since always the node with the smallest cost is visited and f(n) is consistent, no other node in the scheduling decision tree has a lower cost. This implies that the schedule found by HP* is an optimal solution. ▪

Time complexity of HP*
The time complexity of a scheduling method is important for its practical use. Scheduling is a NP-hard problem even for sequential tasks. Thus, unless P ≠ NP, the problem of scheduling parallel tasks onto a heterogeneous platform cannot be solved in polynomial time. The worst time complexity of HP* is exponential in the number of tasks, since it searches for an optimal solution. Proof. In the worst case scenario, HP* has to explore the whole search space to find an optimal solution. Thus, the worst-case execution time of HP* depends on the size of the search space. As stated in Section 3.2, the search space of HP* has the structure of a tree. The root of this scheduling decision tree is an empty schedule. Starting from the root, in each level of the scheduling decision tree, one more task is assigned to the schedule of the respective predecessor. Thus, the level of the scheduling decision tree indicates the number of tasks that have been scheduled and the depth of the tree is n T . Each inner node of the scheduling decision tree has a successor for the assignment of each remaining task. Each task is assigned to each combination of 1, … , p j cores of each compute node N j , j = 1, … , n N . Thus, the branching factor of the scheduling decision tree can be calculated as follows: Since exactly one task is scheduled in each level, the number of remaining tasks r is decreasing with the depth of the scheduling decision tree.
Thus, the maximum number of nodes of the scheduling decision tree can be calculated as follows: ▪ The heuristic function h(n) has a big influence on the time complexity, since it guides the search of HP* through the scheduling decision tree. h(n) uses information about the remaining tasks to prevent HP* from expanding unpromising nodes. Pruning techniques as presented in Section 4 can also reduce the size of the scheduling decision tree drastically. Although the complexity of HP* is exponential in the worst case, the use of a good heuristic function and pruning techniques can reduce its practical running time. With these optimizations, HP* is appropriate for small to medium sized scheduling problems.

PRUNING TECHNIQUES AND OPTIMIZATIONS
As described in Section 3.2, the structure that is searched by HP* is the scheduling decision tree. The scheduling decision tree contains all possible schedules, and thus, a lot of unpromising nodes will not lead to an optimal solution. Pruning techniques eliminate such nodes from the search space without losing the ability of finding an optimal solution. The advantages are a reduced memory consumption of HP* and the reduction of the number of nodes that need to be visited. In this section, we present three techniques to prune the search space of HP*. In Section 4.1, a technique to remove all duplicate nodes from the scheduling decision tree is described. A technique for removing all equivalent nodes from the scheduling decision tree is given in Section 4.2. In Section 4.3, a technique that uses a heuristic upper bound to reduce the memory consumption of HP* is presented.

Removing duplicate nodes
Since HP* considers all possible orders and assignments of the tasks to the processor cores, the scheduling decision tree can contain a lot of duplicate nodes.

Definition 4.
(Duplicate nodes). Two nodes in the scheduling decision tree are said to be duplicates if the schedules that are represented by these nodes are equal. Two schedules are equal if each processor core has been assigned to the same tasks in the same order. Figure 2 shows how multiple paths in the scheduling decision tree can lead to duplicate nodes. In the example, the scheduling problem consists of two tasks (yellow and green), which should be assigned to two compute nodes N 1 with two processor cores and N 2 with one processor core. In the first case, one task (yellow) is assigned to the compute node N 1 and afterward, the second task (green) is assigned to the compute node N 2 . HP* also considers the reverse case, where the green task is assigned to N 2 before the yellow task is assigned to N 1 . Both cases lead to duplicate nodes (marked blue) that are detected and skipped in HP* by using the list L closed (line 10 of Algorithm 2).

Removing equivalent nodes
For each task, the scheduling decision tree contains assignments to all permutations of the processor cores of a compute node. Since the processor cores of each compute node are homogeneous, the scheduling decision tree can also contain a lot of equivalent nodes. Thus, for each task, its execution time on a compute node is the same for the same number of cores regardless of the assignment to specific cores. This means that every permutation of the cores in a schedule represents an equivalent schedule. A schedule can be normalized by sorting the processor cores of each compute node, respectively, according to the maximum finish time of all tasks assigned to each core.

Definition 5.
(Equivalent nodes). Two nodes in the scheduling decision tree are said to be equivalent if they represent the same schedule except for a permutation of the processor cores. To be more precise, two nodes n i and n j in the scheduling decision tree are said to be equivalent if the following conditions are true: F I G U R E 2 Section of a scheduling decision tree for two tasks (yellow and green) and two compute nodes N 1 (two processor cores) and N 2 (one processor core). Duplicate nodes are marked blue and equivalent nodes are marked red • f(n i )=f(n j ), that is, the nodes have the same cost; • W s (n i )=W s (n j ), that is, the remaining sequential workload is the same; • the normalized schedules S n i and S n j are equal.

Using a heuristic upper bound
A nonoptimal solution for the considered problem of scheduling parallel tasks onto heterogeneous platforms can be found in polynomial time using a heuristic scheduling method. The makespan M heur of such a schedule can be used as an upper bound for the cost f(n) of each node n in the scheduling decision tree. Since the cost of an optimal solution has to be lower or equal than M heur , graph nodes with higher costs will never be visited and thus, can be removed from the search space. This optimization will not reduce the number of graph nodes that need to be visited. However, removing such unpromising graph nodes reduces the memory consumption of HP*. The better the solution found by the heuristic method, the more efficient is this optimization strategy.

Corollary 1. The presented pruning techniques for removing duplicate and equivalent nodes as well as using a heuristic upper bound have no influence on
the optimality of HP*.
Proof. When HP* proceeds from each of the duplicate nodes, the same remaining tasks are assigned to all combinations of processor cores. This leads to the same subtrees in the scheduling decision tree, each of which contains the same possible solutions. Graph nodes are only skipped by HP* if a duplicate node has already been visited. Thus, at least one of the subtrees of the duplicate nodes remains in the scheduling decision tree and no possible solution is eliminated. Therefore, removing duplicate nodes reduces the search space without eliminating optimal solutions. Equivalent nodes are equal except for the naming of the processor cores which has no influence on the makespan. Proceeding from each equivalent node, HP* assigns the same remaining tasks to all combinations of processor cores. This leads to equivalent subtrees in the scheduling decision tree, each of which contains equivalent solutions with the same makespan. Since one of the equivalent subtrees is kept in the scheduling decision tree, no possible solution is eliminated. Thus, equivalent nodes can be removed from the search space without eliminating optimal solutions.
In HP*, a node n is skipped if its cost f(n) = g(n) + h(n) are higher than a heuristic upper bound b. The heuristic cost function h(n) (Equation (6)) underestimates the exact cost h * (n) for each node n. Thus, for each node n, the costs f(n) are lower than the exact costs f * (n). A solution n opt is optimal if no other solution has less cost, that is, if f * (n opt ) ≤ f * (n) holds for each node n. As a consequence, the heuristic upper bound b is greater or equal to f * (n opt ). Thus, the optimal solution is not skipped by HP*. Proof. Duplicate nodes can be detected by checking whether the node has already been visited. This means that each node has to be compared with each previous node. Thus, a total of comparisons are needed to detect duplicate nodes.
Equivalent nodes can be detected during the expansion of the current node. For each compute node, the generated schedules have to be compared. Thus, in each step of HP*, comparisons are needed to detect equivalent nodes. ▪

EXPERIMENTAL RESULTS WITH PARALLEL TASKS ON A HETEROGENEOUS COMPUTE CLUSTER
This section presents experimental results of HP* for the execution of independent parallel tasks on a heterogeneous compute cluster. The experimental setup is described in Section 5.1. In Section 5.2, the makespans of the produced schedules of HP* are measured and compared to the results of three other popular scheduling methods. The influence of the pruning techniques from Section 4 on the performance of HP* is evaluated in Section 5.3. In Section 5.4, HP* is compared with the newly implemented metaheuristic scheduling methods TS* and SA* in terms of the makespans of the produced schedules and the computing times of the scheduling algorithm itself.

Experimental setup
The heterogeneous compute cluster used consists of three nodes with a total of 16 processor cores. Table 3 lists the properties of these compute nodes. The compute node sb1 is used as reference node for the determination of the parameters described in Section 2.2. The scheduling method described in Section 3.2 is implemented in C++ using the gcc compiler with optimization level 2. Additionally, we have implemented three existing heuristic scheduling methods that are suitable for the scheduling of parallel tasks on heterogeneous platforms:

HCPA:
The HETEROGENEOUS CRITICAL PATH AND ALLOCATION method 17 transforms a heterogeneous compute cluster with individual computational speeds of the processors into a "virtual" homogeneous cluster with equal speed. Then, an existing method for homogeneous compute clusters (ie, CPA 18 ) is used for the scheduling. HETEROGENEOUS CRITICAL PATH AND ALLOCATION is a two-phase scheduling algorithm: one phase to determine the number of processors that are allocated to each task, and a second phase to schedule the tasks on the platform using a list scheduling algorithm.

WLS:
The WATER-LEVEL-SEARCH method 21 combines list scheduling with a search based approach. The method uses a limit for the predicted makespan that must not be exceeded by the finish time of any task. First, a list scheduling approach is applied repeatedly while the limit is increased until all tasks are scheduled. All computed finish times of all tasks are collected in a set of limits. Then a binary search on this list is performed to find the smallest limit where all tasks can be scheduled.
A separate front-end node of the compute cluster is responsible for the scheduling and for starting the task execution using SSH connections to the compute nodes. This front-end node is equipped with an Intel Xeon E5-2683 cpu and 128 GB of RAM. The tasks are executed according to the determined schedule and the total execution time is measured. A task is started as soon as all of its predecessors in the schedule have finished.
The measurements are performed five times and the average result is shown.
As parallel tasks, two application tasks and two kernel tasks from the SPLASH-3 benchmark suite 12 are used. Unless otherwise stated, the default parameters or the provided "parsec-simlarge" parameter sets are used for the different benchmark tasks. The following application and kernel tasks were selected: • BARNES (application): Barnes-Hut algorithm for a simulation of a particle system of 2 18 particles.

• FMM (application):
Fast multipole method for a simulation of a particle system of 2 19 particles. • LU (kernel): LU factorization of a dense matrix. The size of the input matrix is set to 4096×4096.

Performance results with benchmark tasks
In the following, the search-based scheduling method (HP*) proposed in Section 3.2 and the scheduling methods (HCPA, Δ-CTS, WLS) described in the previous subsection are investigated in several measurements. These methods are used to determine schedules for the execution of the SPLASH-3 benchmark tasks on a heterogeneous cluster. The heterogeneous cluster used for the following measurements consists of all compute nodes listed in Table 3.   Table 3. For the CHOLESKY tasks, the execution times using HCPA are up to 87% higher than the best results. A reason for these significant differences might be that HCPA favors a parallel task execution that uses many cores for each task. However, the execution times of the CHOLESKY tasks are too small to achieve a proper reduction of the parallel execution time for increasing numbers of cores. The other methods achieved very similar results, except for task numbers between 3 and 7 where Δ-CTS leads to execution times that are up to 42% higher. For the LU tasks, the differences between the results of the methods used are smaller. The execution times using HCPA and Δ-CTS are slightly higher than for WLS and HP* with large increases for 7 and 13 tasks. As for the application tasks, the execution times for 7 and 8 tasks are up to 11% lower using HP* compared with WLS. All in all, HP* leads to lower or equal execution times with a more steady increase compared with the other methods.
F I G U R E 3 Top: Measured total execution times of BARNES application tasks (left) and FMM application tasks (right) depending on the number of tasks using all compute nodes of Table 3. Bottom: Measured total execution times of CHOLESKY kernel tasks (left) and LU kernel tasks (right) depending on the number of tasks using all compute nodes of Table 3

Evaluation of the pruning techniques
In this section, the benefit of the pruning techniques presented in Section 4 is evaluated. As the scheduling time of HP* depends on the hardware and implementation used, we evaluated the number of nodes that are generated and the number of nodes that are visited by HP* to find a solution. These numbers are hardware-and implementation-independent and approximately proportional to the runtime of HP*. The scheduling problem considered in this evaluation consists of 1 to 16 BARNES tasks that have to be assigned to the heterogeneous compute cluster listed in Table 3. Table 4 shows the numbers of nodes that are generated and visited by HP* using different pruning techniques depending on the number of tasks.
The nodes that are generated and also the nodes that have been visited have to be stored during the execution of HP*. Whenever too many nodes have to be stored, the system runs out of memory. The missing entries in Table 4 are the result of such a situation. Thus, without pruning HP* is applicable only to small problems. The pruning technique of removing duplicate nodes described in Section 4.1 reduces the number of nodes significantly.
Without duplicates HP* needs to visit and generate up to 94% less nodes to find a solution. Nonetheless, for more than 6 tasks, the system runs out of memory. The biggest impact of all presented pruning techniques is achieved by removing equal nodes as described in Section 4.2. Without equal nodes HP* can schedule up to 11 tasks before running out of memory. The reason is that the number of nodes that need to be handled by HP* in this case is reduced up to 99% compared with the version without pruning. The WATER-LEVEL-SEARCH method that is presented in Section 5.1 is used to determine the heuristic upper bound used in this evaluation. As characterized in Section 4.3, the heuristic upper bound has no influence on the number of nodes visited by HP*. However, the number of nodes that are generated is reduced up to 99% compared with the version without pruning. Even this significant reduction does not prevent the system from running out of memory for more than 7 tasks.
Each pruning technique itself can reduce the number of nodes significantly. However, even for small problems, the memory consumption is still too high. Since each pruning technique removes a certain set of unpromising nodes, a combination of all pruning techniques might further reduce the memory consumption. Therefore, we integrated all presented pruning techniques in HP* to apply a complete pruning of the search space. The heuristic cost function h(n) presented in Section 3.2 also influences the number of nodes in the scheduling decision tree. A version of HP* using h(n) is compared with a version without a heuristic cost function, that is, where h(n)=0. Table 5 shows the numbers of nodes that are generated and visited by HP* using a combination of all pruning techniques depending on the number of tasks. The results of this complete pruning are presented with and without the heuristic cost function compared to HP* without pruning. The complete pruning removes almost all unpromising nodes from the scheduling decision tree except for the nodes that are most relevant for finding an optimal solution.
Compared with HP* without pruning this leads to a huge reduction of visited and generated nodes up to about 99.9%. A comparison with the results presented in Table 4 shows that the combination of all pruning techniques reduces the search space much more than each single prun-

Performance comparison of search-based scheduling methods
The search-based scheduling method (HP*) proposed in Section 3.2 is compared with the newly implemented metaheuristic scheduling methods TS* and SA*. TS* and SA* are methods for scheduling parallel tasks onto a heterogeneous platform based on a tabu search and simulated annealing, respectively. These methods are used to determine schedules for the execution of the SPLASH-3 benchmark tasks on a heterogeneous cluster. The heterogeneous cluster used for the following measurements consists of all compute nodes listed in Table 3.
Figure 4 (top) shows the measured total execution times of the BARNES application tasks (left) and FMM application tasks (right) of the SPLASH-3 benchmark depending on the number of tasks. For both types of application tasks, the measured times using the HP* method are lower or equal than the results of the metaheuristic scheduling methods TS* and SA*. Using HP*, the measured times of both application tasks are up to 22% lower than for TS* and SA*. In four cases, for 27 and 28 BARNES tasks and for 22 and 23 FMM tasks, the scheduling of HP* was aborted due to too high memory consumption. One reason might be that the heuristic cost function h(n) (Equation (6)) underestimated the exact cost of the remaining tasks too much. As a result, HP* has to consider much more possible solutions in its search space. A more precise heuristic cost function may avoid such situations. Except for 12 and 13 BARNES tasks, the measured times using SA* are lower or equal compared with TS*. For the FMM tasks, both metaheuristics show similar results with slightly lower execution times using SA*.  Table 3. For both types of kernel tasks, the execution times for the HP* method are lower or equal compared to the metaheuristic scheduling methods TS* and SA*. For the CHOLESKY tasks, the execution times using HP* are up to 17% lower than for TS* and up to 22% lower than for SA*. The scheduling of HP* for 25-30 CHOLESKY tasks and 24, 25, and 32 LU tasks was aborted due to too high memory consumption. Especially for the short CHOLESKY tasks, the heuristic cost function h(n) (Equation (6)) underestimates the exact cost of the remaining tasks too much. Thus, HP* needs to search a wider range of the search space. For 1 to 22 CHOLESKY tasks, SA* achieved lower execution times than using TS*, whereas for 23 to 32 tasks, the measured times for TS* are lower than for SA*. Especially for 29 to 32 CHOLESKY tasks, the execution times are up to 27% higher using SA*. For the LU tasks, there are less differences between the results of all three methods compared with the other tasks. The execution times using HP* are up to 10% lower than for TS* and up to 31% lower than for SA*. For more than 20 LU tasks, the execution times for SA* are up to 8% lower compared with TS*. Figure 5 shows the measured computing times of the search-based scheduling methods depending on the number of BARNES application tasks using all compute nodes of Table 3. The increase in the computing time of TS* and SA* is steadier than for HP*. A reason for this might be the use of the heuristic cost function h(n) (Equation (6)) and the pruning methods described in Section 4. The influence of the heuristic F I G U R E 4 Top: Measured total execution times of BARNES application tasks (left) and FMM application tasks (right) depending on the number of tasks using all compute nodes of Table 3. Bottom: Measured total execution times of CHOLESKY kernel tasks (left) and LU kernel tasks (right) depending on the number of tasks using all compute nodes of Table 3 F I G U R E 5 Measured computing times of the scheduling methods depending on the number of BARNES application tasks using all compute nodes of Table 3 cost function and the pruning methods differs strongly with the concrete scheduling problem. Influencing factors are the size and structure of the heterogeneous platform as well as the number of tasks and their runtime formula. Except for 16, 22, 30, 31, and 32 tasks, the computing time of HP* is lower than for SA*. For small numbers of tasks, HP* needs even less computing time than TS*. In average, TS* is 73 times faster than SA*.

RELATED WORK
In Reference 22, a classification of parallel tasks into rigid tasks, moldable tasks, or malleable tasks is given. For rigid tasks, the number of processors used is fixed a priori. Malleable tasks can change the number of processors during their execution. Most parallel applications are moldable. This means that the number of processors is fixed during the execution but not before.
Search-based methods are promising approaches to solve scheduling problems. A task scheduling problem can be formulated as the search for an optimal assignment of a set of tasks onto a set of processors, such that the total execution time, also called makespan, is minimized. In the past decade, different types of search-based scheduling algorithms, the so-called metaheuristics, such as genetic algorithms, simulated annealing, or tabu search, have been proposed. A comparison of search-based and heuristic approaches for scheduling independent tasks onto heterogeneous systems can be found in Reference 2. Different approaches for assigning sequential tasks to heterogeneous machines are presented, like genetic algorithms, simulated annealing, tabu search, and A*. All approaches focus on the scheduling of sequential tasks and, thus, cannot be applied to the scheduling problem of parallel tasks.
An experimental comparison of several scheduling algorithms, including A*, genetic algorithms, simulated annealing, tabu search, as well as popular list scheduling heuristics, is given in Reference 23. Since the work considers the problem of mapping sequential tasks with dependencies onto a homogeneous cluster, the algorithms are not suitable for the scheduling of parallel tasks.
Reference 9 proposed a scheduling algorithm for the assignment of sequential tasks with dependencies to homogeneous platforms based on the A* search algorithm. Several pruning techniques to reduce the search space as well as a parallelization of the algorithm are presented. Reference 10 proposed a scheduling algorithm based on the A* search algorithm. The algorithm uses an improved cost function along with several pruning techniques to reduce the search space. In Reference 11, additional pruning techniques for the same scheduling problem are proposed. In contrast to these works, we consider the scheduling of independent parallel tasks to heterogeneous platforms. To the best of our knowledge, the A* search algorithm has not been applied to the scheduling of parallel tasks onto a heterogeneous platform. Although some pruning techniques proposed in References 9 and 11 as the heuristic upper bound can be applied to this problem, specific techniques are needed to reduce the search space.
In Reference, 24 a method based on a tabu-search for scheduling independent rigid tasks (ie, parallel tasks with a fixed number of cores) onto a homogeneous cluster is proposed. Since the numbers of processors are fixed a priori, such a scheduling problem can be seen as a two dimensional packing problem. The tabu search approach is compared with three greedy strategies for this packing problem. We focus on the scheduling of moldable tasks, which is more complex and allows more choices.
A simulated annealing approach for the scheduling of sequential tasks with dependencies onto heterogeneous multiprocessor systems is proposed in Reference 25. In Reference 26, the best practices for defining the temperature, the acceptance functions, and move heuristics are presented. Although the given information is helpful for scheduling with simulated annealing in general, the scheduling approaches themselves are not suitable for parallel tasks.
Genetic algorithms are inspired by biological evolution, such as reproduction, mutation, recombination, and selection. In Reference 27, a genetic algorithm for the scheduling of independent sequential tasks onto a heterogeneous cluster is proposed. In genetic algorithms, the schedules are represented by chromosomes on which genetic operators such as crossover or mutation are applied. However, the chromosomal representation of schedules for sequential tasks is not suitable for parallel tasks.

CONCLUSION
In this article, we have proposed a task scheduling method HP* for assigning parallel tasks to heterogeneous platforms which is based on the A* search algorithm. In addition, a heuristic cost function has been proposed that is able to reduce the search space of our algorithm. The pruning techniques presented are able to significantly reduce the search space and improve the efficiency of HP*. Theoretical results as well as practical measurements have been presented. Measurements with benchmark tasks from the SPLASH3 benchmark suite have been performed and the scheduling results of HP* have been compared to three heuristic scheduling methods as well as newly implemented scheduling methods based on tabu search and simulated annealing. The performance results demonstrate that the use of HP* leads to a reduction of the total execution times (makespan) of the resulting schedules in comparison with the other scheduling algorithms. We have shown that the computing time of HP* is reasonable for small to medium sized scheduling problems. The evaluation of the pruning techniques has shown that each pruning technique reduces the search space significantly. The highest reduction in the search space is achieved by a combination of all pruning techniques. Also, this complete pruning can prevent the system from running out of memory and allows the computation of larger scheduling problems.