SimAS: A Simulation-assisted Approach for the Scheduling Algorithm Selection under Perturbations

Many scientific applications consist of large and computationally-intensive loops. Dynamic loop self-scheduling (DLS) techniques are used to parallelize and to balance the load during the execution of such applications. Load imbalance arises from variations in the loop iteration (or tasks) execution times, caused by problem, algorithmic, or systemic characteristics. The variations in systemic characteristics are referred to as perturbations, and can be caused by other applications or processes that share the same resources, or a temporary system fault or malfunction. Therefore, the selection of the most efficient DLS technique is critical to achieve the best application performance. The following question motivates this work: Given an application, an HPC system, and their characteristics and interplay, which DLS technique will achieve improved performance under unpredictable perturbations? Existing studies focus on variations in the delivered computational speed only as the source of perturbations in the system. However, perturbations in available network bandwidth or latency are inevitable on production HPC systems. A Simulator-assisted scheduling (SimAS) is introduced as a new control-theoretic-inspired approach to dynamically select DLS techniques that improve the performance of applications executing on heterogeneous HPC systems under perturbations. The present work examines the performance of seven applications on a heterogeneous system under all the above system perturbations. SimAS is evaluated as a proof of concept using native and simulative experiments. The performance results confirm the original hypothesis that no single DLS technique can deliver the absolute best performance in all scenarios, whereas the SimAS-based DLS selection resulted in improved application performance in most experiments.

) delivers consistent performance under various system perturbations. However, GSS does not achieve the best performance under all perturbations as shown in Figure 1B. As shown in the figure, no single technique delivers the best performance in all execution scenarios. 17  Adaptive DLS techniques include adaptive WF 9 (AWF), its variants 10 batch (AWF-B), chunk (AWF-C), batch-like (AWF-D), and chunk-like (AWF-E), as well as adaptive factoring 11 (AF), among others.
An a priori optimal selection of the most appropriate DLS technique for a given application and system is not trivial and oftentimes infeasible, given the various sources of load imbalance and the different load balancing properties of the DLS techniques. This represents the algorithm selection problem 12 in the context of scheduling. The goal of this work is to solve the algorithm selection problem for the scheduling of computationally intensive applications on HPC systems under perturbations. Earlier work studied the flexibility of DLS (taken as robustness to variable delivered computational speed) and the selection of the most robust DLS using machine learning 13 with the SimGrid (SG) 14 simulation toolkit. The selection of DLS techniques for synthetic time-stepping scientific workloads using reinforcement learning was also studied using simulative experimentation. 15 We define simulative experiments as the counterpart of native experiments* to describe performance results obtained from simulations rather than direct execution on computing systems, ie, native performance. The aforementioned work focuses on one source of perturbations, namely, the variation in the delivered computing speed, in time-stepping applications to learn from previous time steps. In addition, that approach is not directly applicable to noniterative applications. Scheduling solutions using static optimizations, eg, using evolutionary and genetic algorithms, 16 cannot dynamically adapt to the perturbations encountered during execution. Modern HPC systems are often heterogeneous production systems typically shared by many users. Therefore, perturbations in the available network bandwidth and latency are unavoidable in such systems.
The study of the performance of scientific applications with DLS under perturbations revealed that the most robust DLS technique, identified as the DLS technique that results in the least variation of the application execution time under various perturbations, does not always achieve the best performance in all execution scenarios. 17 Figure 1 shows the simulative performance of PSIA (cf, Section 4.1) on 696 cores of miniHPC (cf, Section 4.4) under perturbations (cf, Section 4.6). According to these results, GSS is the most robust DLS technique due to the minimal variation of its performance under perturbations ( Figure 1A); however, it results in poor application performance under perturbations. Even the next most robust DLS technique, WF, is outperformed by SS and AWF-C in certain perturbation scenarios, as can be seen in Figure 1B. These results suggest that even if the most robust DLS technique is known a priori, which could be challenging, the application performance degrades in different execution scenarios due to perturbations. Therefore, a methodology for the dynamic selection of DLS techniques is needed to achieve the highest possible performance in all execution scenarios. In the present work, in an effort to select the most appropriate DLS dynamically for a given application and system under perturbations, the simulation-assisted scheduling algorithm selection (SimAS) approach is proposed. The performance of two scientific applications (PSIA 18 and Mandelbrot 19 ) executed in single-sweep and time-stepping modes, and five synthetic workloads is studied on a heterogeneous HPC system using nonadaptive and adaptive DLS techniques, in the presence of perturbations. The synthetic workloads are used to cover a broader spectrum of application load imbalance profiles beyond what is encountered in practice.
*Direct experiments on real HPC systems.
The present work makes the following contributions. First, it extends and optimizes the Simulator in the Loop 20 (SiL) approach into the SimAS † approach for dynamically selecting a DLS technique during execution based on the application characteristics and the present (monitored or predicted) state of the computing system. Second, it extends a dynamic load balancing tool (DLB_tool) from the literature 21  This work is structured as follows. Section 2 contains a brief review of the selected DLS techniques, the SG simulation toolkit, as well as the work related to the performance of scheduling scientific applications with DLS in the presence of perturbations. The proposed SimAS approach is conceptually described in Section 3. The factorial design of experiments, together with details about the DLS techniques and SimAS implementation into the DLS4LB, the HPC system characteristics, and the perturbations injected in native and simulative experiments are presented in Section 4. The analysis of the applications performance under perturbations is presented in Section 5. The work concludes and outlines potential future work in Section 6. This paper extends the authors' prior work. 20 Specifically, an additional real-world application (Mandelbrot set) is used (in Section 4.1) to evaluate the performance of thirteen DLS under perturbations. Two versions of PSIA and Mandelbrot applications are considered, single-sweep and time-stepping (Section 4.1). DLB_tool 21 is extended with four more DLS techniques (described in Section 4.2) and is used in this work to parallelize and load balance the real world applications (see Section 4.1). In addition, the proposed SimAS approach is also integrated with DLS4LB (as clarified in Section 4.3). The performance of the applications of interest using DLS under perturbations is examined in native as well as simulative experiments (see Section 5.1). The native experiments confirm the conclusions drawn from the simulative experiments for the performance of applications under perturbations and in the evaluation of the proposed approach SimAS.

BACKGROUND AND RELATED WORK
Scheduling. The aim of scheduling is to achieve a balanced load execution among parallel PEs with minimum scheduling overhead. A loop iteration is a parallel independent task that represents the smallest unit of work to schedule and execute.  WF divides a batch of iterations into unequally-sized chunks, proportional to the relative PE speeds (called weights). The PE weights need to be determined prior to the execution and are assumed not to change during execution. This work considers the practical implementations of FAC and WF. All nonadaptive DLS techniques account for variations in the iteration execution times due to application characteristics.
The adaptive DLS techniques monitor the performance of the application during execution and adapt the chunk calculation accordingly.
Adaptive DLS techniques include AWF 9 and its variants 10 : AWF-B, AWF-C, AWF-D, AWF-E, and AF, 11 among others. AWF is designed for time-stepping applications. It improves WF by adapting the relative weights of PEs during execution by monitoring their performance in each time step. AWF-B relieves the time-stepping requirement in AWF and measures the performance after each batch to update the PE weights.
AWF-C is similar to AWF-B where weight updates are performed upon the completion of each chunk, instead of a batch. AWF-D is similar to AWF-B and considers the total chunk time (equal to the sum of the iteration execution times in the chunk plus the associated overhead of the PE to acquire the chunk) and all the book keeping operations to calculate and update the PE weights. AWF-B and AWF-C only consider the chunk iterations execution times. AWF-E is similar to AWF-C by updating the PE weights on every chunk. Yet, AWF-E is also similar to AWF-D by also considering the total chunk time. Unlike FAC, AF estimates the values of and during execution and updates them based on the measured performance of the PEs on the executed loop iterations.
Loop scheduling in simulation. SG 14 is a versatile event-based simulation toolkit designed for the study of the behavior of large scale distributed systems. It provides ready to use application programming interfaces (APIs) to represent applications and computing systems through different interfaces: MSG (SG-MSG), SimDag (SG-SD), and SMPI (SG-SMPI). SG uses a simple and fast CPU computation model and verified, more complex network models, 23 which render it well suited for the study of computationally intensive parallel and distributed scientific applications.
Various studies have used SG to evaluate the performance of applications with DLS techniques in different scenarios. 13,15 To attain high trustworthiness in the performance results obtained with SG, the implementation of the nonadaptive DLS techniques in SG-SD has been verified 24 by reproducing the results presented in the work that introduced factoring. 7 The accuracy of the performance results obtained by simulative experiments against native experiments has also recently been quantified. 25 The present work employs the SG-SD interface to study the performance of scientific applications on heterogeneous HPC systems under perturbations.
Related work. Scheduling of applications on large HPC systems involves many sources of uncertainties, eg, task execution times and perturbations in the computing system. Therefore, many studies have focused on robust schedules that maintain certain performance requirements despite fluctuations in the behavior of the system. 26 Robust scheduling of tasks with uncertain execution and communication times was studied 16,27 using a multiobjective evolutionary algorithm and using dynamic scheduling, respectively. Moreover, the flexibility, defined as robustness to perturbations of DLS techniques, was examined 28 in an effort to select the most flexible technique using machine learning. However, a robust scheduling technique may not always guarantee the best performance in all possible execution scenarios and for all application parameters 17 (eg, problem size and data distribution). Thus, dynamically selecting the best performing DLS technique is of paramount importance, given the broad spectrum of available DLS techniques, each with unique properties. Selecting the best performing DLS technique for time-stepping applications, using reinforcement learning was introduced 15 by adapting to the variations in the delivered computational speed during previous time steps.
In addition, machine learning and decision trees were used to select the best performing DLS technique dynamically from a portfolio of DLS techniques 13 and for multithreaded applications parallelized with OpenMP 29 or with Charm++. 30 Scheduling solutions based on optimization techniques, such as genetic and evolutionary algorithms, cannot adapt to perturbations during execution. None of the aforementioned efforts considered perturbations in network bandwidth or latency. This work complements the previous efforts by studying the performance of scientific applications using nonadaptive and adaptive DLS techniques under different perturbation scenarios (variations in delivered computational speed, network bandwidth, and network latency) on an HPC system. A new approach, namely, SimAS is introduced, to select DLS techniques dynamically that improve the performance of applications on HPC systems under multiple sources of perturbations known mostly during execution.

SimAS: SIMULATION-ASSISTED SCHEDULING ALGORITHM SELECTION
SimAS is inspired by control theory, where a controller (scheduler) is used to achieve and maintain a desired state (load balance) of the system (parallel loop execution). Its concept is motivated by the well-known control strategy, model predictive control (MPC). 31 The MPC predicts the performance of the system with different control signals to optimize system performance. As shown in Figure 2, a call to a loop simulator is inserted inside a typical scheduling loop. SimAS leverages loop simulators to predict the performance of the application with various DLS techniques. The system monitor and estimator components read the system state during the execution and update the computing system representation accordingly to feed the simulator with the current perturbations on the system. SimAS examines the predicted performance by the simulator with different DLS techniques and selects the DLS technique that achieves the best performance in the current execution scenario.
The above steps may be repeated several times during the execution of the loop, and the SimAS call frequency can be aligned with the occurrence of perturbations (monitored or predicted). The main idea of the SimAS approach (inspired by MPC) is to use the simulator (system model in MPC terms) to test different DLS techniques (control signals in MPC terms) on the loop execution (the system in MPC terms), before actually employing the selected DLS in the executing application.
The advantage of SimAS is that it leverages the use of already developed state-of-the-art simulators to predict the performance dynamically during execution. The prediction accuracy of a simulator is strongly influenced by the representation of both the applications and the systems in simulation as well as by the subsystem models it comprises. 25 Since loop simulators predict the performance of load imbalanced computationally-intensive loops, the influence of the memory subsystem (eg, complex memory hierarchy) on their performance is minimal.
Therefore, application performance can accurately be predicted via simulation. For instance, the percent error between native and simulative executions for a given application (PSIA 18 ) using the SG-SD interface was found to be between 0.95% and 2.99%. 25 The percent error is  32 It is expected that the accuracy and speed of the simulators employed by SimAS will improve as simulators in general are continuously being developed and refined.
The cost of frequent calls to SimAS can be amortized by launching parallel SimAS instances on dedicated resources to derive predictions for various DLS techniques. In addition, this cost can entirely be mitigated by asynchronously calling SimAS, ie, the application does not block (nor wait) until SimAS simulations complete.
The system monitor and estimator components can be implemented with a number of system monitoring tools. 33 such as collectl. Such tools can periodically be instantiated to measure PE and network loads and to update the system representation in the simulator for the next call to SimAS. The measured chunk execution times can also be used to estimate the current PE computational speeds. The implementation details of the loop simulator and SimAS are described in Section 4. The PE loads can be estimated and predicted using autoregressive integrated moving average. 34

DESIGN OF EXPERIMENTS
In this work, we employ a factorial design of experiments, due to the numerous parameters and values to explore. The design of the factorial experiments 35 is presented in the following (cf, Table 1), along with details of the DLS techniques and SimAS implementation, the implemented loop simulator, the computing system under test and its injected perturbations in native and simulative experiments.

Applications
This work considers two real-world applications (executed as single-sweep applications and as time-stepping applications) and five synthetic (single-sweep) workloads.
Real applications.
1. PSIA. Parallel spin-image algorithm 18 (PSIA), is a computationally intensive application from computer vision. PSIA is embarrassingly parallel application and algorithmically load imbalanced where the computational effort of a loop iteration depends on the input data. The performance of PSIA has been studied in prior work 18 and was enhanced for execution on a heterogeneous cluster by using nonadaptive DLS techniques. The total number of parallel loop iterations in PSIA is 400 000.  Table 1), to avoid extremely long execution times.

Synthetic workloads.
Five synthetic workloads are examined in this work. Each of the five synthetic workloads contains 400 000 parallel loop iterations. The number of floating point operations (FLOP count) in each loop iteration is assumed to follow five different probability distributions, namely: constant, uniform, normal, exponential, and gamma probability distributions. The probability distribution parameters used to generate these FLOP counts are also given in Table 1. The synthetic workloads are used to stress test the performance and usefulness of the proposed approach and to cover a broader spectrum of application load imbalance profiles than what may be encountered in practice. 15,37

Loop scheduling
Thirteen loop scheduling techniques are used to assess the performance of the above nine applications under various execution scenarios. These techniques represent a wide range of static and dynamic loop scheduling approaches. The DLS techniques can further be distinguished into five adaptive and seven nonadaptive techniques.

DLS4LB employs MPI two-sided communications for work distribution among processes and implements a master-worker execution model,
where the master is responsible for handling work requests from workers. In addition, the master acts also as a worker, and checks for outstanding work requests with a certain adjustable frequency. DLS4LB is designed to parallelize an application with minimum changes. Listing 1 shows, in green font color, the lines needed to be added to the application code to parallelize it.

Simulation-assisted scheduling Algorithm Selection
DLS4LB is extended to support the SimAS approach as the 14th option in DLS4LB. Taking the same approach of DLS4LB of minimal application code changes, an application can use the SimAS approach by inserting only two function calls, shown in red font color, in Listing 1.
SimAS_setup function sets up the main data structure SimAS_info that holds important information, such as the number of PEs, the number of loop iterations, the path to the simulator, the FLOP file that contains the FLOP count per loop iteration, and the platform file that describes the computing system. In addition, SimAS_setup asynchronously starts the simulation of the application performance immediately with a portfolio of DLS techniques in parallel. SimAS_setup sets the scheduling technique to a default DLS (WF in this work), to allow the application to start and avoid delaying the application execution.
SimAS_update checks (every 5 seconds in this work) if the simulation is finished, and selects the DLS technique that allows the application to finish the largest number of tasks in the shortest time by manipulating the DLS_info structure; otherwise, it will keep the selected DLS technique unchanged. SimAS_update reruns the simulation again if 50 seconds (the SimAS calling frequency) have passed since the simulator was previously called or every new time step for time-stepping applications. SimAS_update prevents the start of a new instance of the simulator unless the earlier one is completed or the number of remaining unscheduled iterations is less than or equal the number of Pes.

SimAS improvements
Several measures were taken in this work to mitigate the overhead of simulation during execution, such as running the simulation in parallel, asynchronously to the application, to avoid stopping the application execution.

Computing system in native and simulative experiments
The

Simulation details
Applications in simulation. LoopSim, # an SG-SD-based simulator, is used to simulate the applications of interest, where the loop iterations in the application code are represented as tasks. 25 To represent the computational effort associated with an application's loop iterations, the number of floating point operations (FLOP) of each loop iteration is counted using PAPI counters. 38 The FLOP count per iteration is then read by LoopSim during execution to simulate the computation per iteration. All DLS techniques supported by the DLS4LB are also implemented in LoopSim and tasks are assigned to free and requesting simulated cores, similar to the native execution. Computing system in simulation. A computing system is represented in SG via an XML file denoted as platform file. SG registers each processor core for its representation as a host in the platform file. The computational speed of a processor core is estimated by measuring a loop execution time and dividing it by the total number of floating point operations included in the loop. 25 A Xeon core was found to be four ¶ miniHPC is a fully controlled non-production HPC cluster at the Department of Mathematics and Computer Science at the University of Basel, Switzerland. # https://github.com/unibas-dmi-hpc/LoopSim times faster than a Xeon Phi core as indicated by the relative core weights (cf, Table 1). The network bandwidth and latency represented in the platform file are calibrated with the SG calibration procedure. ‖

Injected perturbations
Three different categories of perturbations are considered in this work, namely delivered computational speed, available network bandwidth, and available network latency. Two intensities are considered, mild and severe, for each category. Two scenarios are considered for each intensity, where the value of the delivered computational speed is either constant or exponentially distributed.
All perturbations (cf, Table 1) are considered to occur periodically, with a period of 100 seconds where the perturbations affect the system only during 50% of the perturbation period. The network (bandwidth and latency) perturbations commence with the application execution, while the delivered computational speed perturbations begin 50 seconds after the start of the application. Another perturbation scenario is created by combining all perturbations from the other individual categories.
Perturbations in simulative experiments. All perturbations are enacted in SG during simulation via the availability, bandwidth, latency, and platform files to represent perturbations in delivered computational speed, network available bandwidth, and network latency, respectively.  Table 1. Given that the applications of interest are computationally intensive and the communicated data size between application's processes is minimal, perturbations in the network bandwidth does not have a significant effect on the application performance, as can be seen from the simulative experiments below. Therefore, perturbations in the network bandwidth are excluded from native experimentation.
A combined perturbations scenario is created for the native execution by combining PE availability perturbations and network latency perturbations. As both perturbation distributions (constant and exponential) have a comparable effect on the performance, where the impact of constantly distributed perturbations is more evident, only the constant distribution of perturbations is considered in the native experiments.

EXPERIMENTAL EVALUATION AND DISCUSSION
The performance results of the execution of the applications with different loop scheduling techniques under different execution scenarios are illustrated and discussed. We also show the need and importance of the proposed SimAS approach.

Performance of scientific applications under perturbations
Simulative experiments. The simulative performance results of the two real applications, PSIA and Mandelbrot, under perturbations are shown in Under perturbations, WF cannot accommodate the variability in the system due to perturbations as PE weights are constant, especially to perturbations in the delivered computational speed of the PEs. The performance of FSC and mFSC is, in general, better than that of STATIC, GSS, TSS, and FAC. However, FSC and mFSC are highly affected by the perturbations in the PE availability. SS is resilient to perturbations in the delivered computational speed of the PEs. However, it is significantly influenced by the network latency variations, as can be seen in Figure 3 lat-cs and lat-es. Specifically, SS outperforms WF in Figure 3B pea-cs and pea-es.

These results suggest that no single DLS outperforms all other techniques in all execution scenarios and that even the most robust DLS technique can
result in a suboptimal performance in certain execution scenarios. Therefore, the best strategy is the dynamic selection of DLS techniques based on the current application and system states. SimAS is called every 50 seconds and when there is a work request, to select the best performing DLS.
In other cases, the application performance with SimAS was slightly poorer than the best execution time independently achieved by other DLS techniques. This is due to the fact that loop scheduling is, by definition, nonpreemptive and the execution of already scheduled loop iterations cannot be preempted to be resumed with the newly (expected more suitable) selected DLS. The results of the time-stepping applications are shown in Figure 5. Similar to the single-sweep versions of PSIA and Mandelbrot, SimAS improved the performance of applications in most of the cases. One can note that no single DLS technique always achieves the best performance, therefore, a dynamic selection of DLS technique according to the current perturbations in the system is needed. The overhead of calling SimAS is in general below 0.5% of the execution time, except for PSIA_TS for which the overhead is at most 2.7%. This is due to the short execution time of PSIA_TS compared to its nontime-stepping version.
The nonpreemptive scheduling approach of the DLS techniques significantly impacted the performance of applications with SimAS. The execution of already scheduled chunks of loop iterations is not preempted to be resumed with the newly selected DLS. As shown in Figure 4A, even though SimAS selected DLS techniques with shorter execution times in the case of lat-cs with PSIA application on 128 cores, the execution time with SimAS was even longer than that of SS, which was not selected by SimAS.
For time-stepping applications, the effect of frequently switching the DLS technique and the nonpreemption overhead is much less than for single-sweep applications. Therefore, the performance of time-stepping applications with SimAS under perturbations is better than that of the single-sweep versions of the same applications as shown in Figure 5A and Figure 5B. It is planned to study the preemption of scheduled (yet not executed) loop iterations in the future, to switch, without further delay, from one DLS to another during execution.

Discussion
Even though the applications considered are computationally-intensive and only communicate loop indices with the master, perturbations in network latency had a significant impact on performance. The implementation choice of the scheduling techniques, such as STATIC, implemented in an SS fashion, led to degrading its performance in scenarios with network perturbations.
Selecting the most performing DLS technique before execution might not deliver the best performance, as perturbations in the HPC system are unknown a priori. For instance, the best DLS technique for Mandelbrot that could be identified before execution, ie, in np execution scenario, is SS, which is outperformed by SimAS in lat-cs and pea+lat-cs in Figure 4B. Similar change in the best DLS technique can be seen from the results (A) (B)  Figure 5B. Since there is no high load imbalance in the PSIA or PSIA_TS, there is no high variation in the performance of different DLS techniques. Since the best DLS technique cannot be known before execution, SimAS improved the performance by dynamically selecting the DLS with the best performance based on the simulation predictions.
In general, the DLS techniques are designed to be efficient. However, in certain cases, efficiency prevents robustness due to the reduced tolerance of the efficient techniques to uncertain events. Uncertainty is ineradicable and it manifests in HPC systems as perturbations.
Perturbations due to non fatal errors and system interference significantly degrade applications performance on large scale HPC systems. 1,39,40 This highlights the importance of the careful choice of DLS techniques for each application, system size, and execution scenario. The dynamic selection of DLS techniques ensures that each DLS technique is employed where and when its the most efficient.
The SimAS approach can proactively select the best suited DLS before any perturbations manifest in the system, whenever perturbations can be predicted in advance. SimAS leverages the use of already developed simulators, instead of needing the development of novel prediction techniques. The DLS selection decisions taken by SimAS can then be used to create a rule-based DLS selection mechanism for a combination of application, system, and execution scenarios, to improve application performance dynamically without the need of online simulation. The scheduling of scientific applications has traditionally been approached without preemption. However, operating systems perform preemption of the processes and threads that execute a certain application task. We believe that application level scheduling can further improve performance if the scheduling strategies employ task preemption. It is planned in the future to experiment with preempting scheduled yet not