Domain knowledge specification for energy tuning

To overcome the challenges of energy consumption of HPC systems, the European Union Horizon 2020 READEX (Runtime Exploitation of Application Dynamism for Energy‐efficient Exascale computing) project uses an online auto‐tuning approach to improve energy efficiency of HPC applications. The READEX methodology pre‐computes optimal system configurations at design‐time, such as the CPU frequency, for instances of program regions and switches at runtime to the configuration given in the tuning model when the region is executed. READEX goes beyond previous approaches by exploiting dynamic changes of a region's characteristics by leveraging region and characteristic specific system configurations. While the tool suite supports an automatic approach, specifying domain knowledge such as the structure and characteristics of the application and application tuning parameters can significantly help to create a more refined tuning model. This paper presents the means available for an application expert to provide domain knowledge and presents tuning results for some benchmarks.

Pre-analysis first filters out regions with too short execution time to prevent excessive measurement overheads. Coarse granular regions that constitute most of the execution time are selected for dynamic tuning and are identified as significant regions 5 using a READEX tool called readex-dyn-detect. readex-dyn-detect then computes the tuning potential of significant regions for dynamic tuning.
If significant tuning potential was found in the pre-analysis step, DTA determines the application's tuning model. READEX targets applications that exhibit an iterative behavior in the form of a main progress loop, which is called a phase region, and whose individual time steps are called phases. DTA exploits this iterative structure in its auto-tuning approach. Within a single execution of the application, different system configurations are assessed in different program phases. This results in best configurations for the rts's. Finally, the rts's are clustered according to their best configurations, and the information is written to the tuning model file.
This tuning model file then guides the Runtime Application Tuning. At the start of a region, first, the rts are determined based on the observed region identifiers, and the best configuration is taken from the tuning model. The current system configuration is then switched to the recommended one.
The implementation of DTA is based on the Periscope Tuning Framework (PTF) 6 and the Score-P 7, 8

monitoring system. Runtime Application
Tuning is implemented as an extension of Score-P called the READEX Runtime Library (RRL). This paper presents new features of the READEX tool suite, enabling the application developer to specify domain knowledge that is used in DTA and runtime tuning for additional application tuning. This domain knowledge covers additional identifiers for the application structure and characteristics and application-level tuning parameters, such as the types of solvers, access strides, or number of subdomains for exploiting the dynamicity.
READEX uses the so-called identifiers to predict at runtime the characteristics of an upcoming rts. The READEX Domain Knowledge Specification Interface supports region identifiers to distinguish rts's, phase identifiers to distinguish phase characteristics, and input identifiers to distinguish executions with different application inputs. Without these identifiers, rts's of a significant region cannot be distinguished during runtime tuning, and thus would end up with the same system configuration even if they have a different behavior. Hence, these identifiers will improve the tuning model by distinguishing rts's and assigning them to different system configurations.
Section 2 first gives a general overview about the implementation of DTA and Section 3 outlines related work. Section 4 then presents the integration of the domain knowledge specification in the design-time workflow. It describes how domain knowledge, such as the program regions and the phase region (Section 4.1.1), application characteristics (Section 4.2), and the ATP (Section 4.3) specifications can be added by the application expert. Section 5 provides details on the implementation of the identifiers and the ATPs and how DTA uses the domain knowledge to generate a more refined tuning model. The static and dynamic energy savings obtained after applying region identifiers for the Multi-Grid (MG) benchmark from the NAS parallel benchmark suite 9 are discussed in Section 6.1, followed by a description on the trend in the energy consumption for the multi-zone version of the Block Tri-diagonal solver (BT) for different input sets in Section 6.2. An evaluation of the results obtained for the ESPRESO library 10 using preconditioners, domain decomposition, and iterative solver switching as the Application Tuning Parameters is also presented in Section 6.3.
In the end, this paper will summarize the important results and draw the concluding remarks.

DESIGN-TIME ANALYSIS
DTA is implemented using the Periscope Tuning Framework (PTF) that was developed at the Technical University of Munich, Germany. PTF is a distributed framework consisting of the frontend, the tuning plugins, the experiment execution engine, and a hierarchy of analysis agents. 6 PTF provides several tuning plugins for different tuning aspects, such as energy consumption and compiler flags. A tuning plugin determines a number of system configurations based on expert knowledge and then assesses the configurations via experiments that measure the objective function.
A novel tuning plugin called the READEX Tuning Plugin was developed for PTF to perform DTA by running experiments that currently evaluate three tuning parameters, ie, CPU frequency, uncore frequency, and the number of OpenMP threads within a single program run. Each experiment is an execution of the phase region. First, the plugin reads the ranges (minimum, maximum, and the step size) of the tuning parameters, the search algorithm, the objective to tune the application for, and the Application Tuning Parameters (ATPs). The plugin uses one of several predefined search algorithms to walk the search space consisting of the cross-product of the tuning parameters and ATPs.
For each selected configuration, an experiment is executed in which the effect of the system configuration on the rts's of significant regions are measured. The measurements for the phase as well as the objective values for the rts's are retrieved from Score-P and propagated to the tuning plugin. The best system configuration for the rts's of the significant regions is returned for the selected objective. The result of this search, including the objective values, the execution time and the number of instances, as well as the best found configuration for each rts are then stored in an RTS database.
Finally, a classifier groups the rts's with similar or identical best found configurations into scenarios using a similarity score that is computed by aggregating the closeness of system configurations. The clustering limits the number of scenarios and thus reduces the associated runtime overhead. A selector then chooses the best configuration for each scenario for the specified objective. The knowledge obtained during DTA, such as the best-found system configurations for individual scenarios is encapsulated in a tuning model (TM).
For production runs, the tuning model is forwarded to the READEX Runtime Library (RRL), which performs runtime tuning by detecting the scenarios and dynamically switching to the best configurations for upcoming rts's.

RELATED WORK
There are limited numbers of automatic tuning approaches and tools available for runtime optimizations using domain knowledge. Here, we present the most recent auto-tuning approaches used in research projects.
The AutoTune project 6 developed a static tuning approach based on Dynamic Voltage and Frequency Scaling (DVFS), which involves increasing or decreasing both the voltage and the frequency. During this project, a specialized tuning plugin was implemented using the energy consumption as the tuning aspect. The plugin generates models using the performance data of a certain platform to predict the energy consumption, time, and power at different CPU frequencies. The plugin uses the enopt library 11 to configure the CPU frequency for different regions and to collect energy measurements. The library updates a timer counter every 10 seconds to collect energy measurements. This is a static tuning approach where the CPU frequency remain unchanged for a whole program region. In the READEX approach, both the CPU frequency and uncore frequency can be configured dynamically for all the program regions.
The MATE 12 framework implemented a tuning environment to tune applications dynamically during runtime. Here, the application monitoring takes place as well as detecting performance bottlenecks. Then, the application is adapted dynamically due to the behavior changes without recompiling or re-running. The tuning is driven by aspect-specific tuning components. For example, in master/worker type applications, the number of workers is adapted to the number of tasks. However, the framework only limits the tuning approach to MPI processes. In contrast to READEX, the MATE framework focuses on performance improvements. It is the responsibility of the expert to provide the aspect-specific tuning components.
READEX provides a general tuning strategy independent of the application structure.
Calore et al 13 used a static Dynamic Voltage and Frequency Scaling (DVFS) to tune modern processors, namely, an NVIDIA K80 Graphics Processing Unit (GPU) and an Intel Haswell CPU. They target the energy costs of HPC applications by tuning CPU and GPU frequencies. In this work, they focused on a hardware tuning approach. Although they considered a few software tools for energy optimization, they preferred to avoid the software approach, and thus could not exploit, eg, application tuning parameters.
The ANTAREX project 14 uses a domain-specific language (DSL) for tuning applications targeting energy-efficiency on multi-core CPUs and accelerators. This DSL approach uses adaptive and auto-tuning strategies to enforce runtime application tuning. The DSL template interacts with the runtime resource and power manager to configure software tuning parameters (such as application parameters, code transformations, and code variants) for the application regions. The additional compilation step is required to translate from the DSL template into the target language.
This DSL approach is imposed only on ARM-based multi-cores and GPGPUs, while the READEX tuning approach targets different HPC platforms.
Furthermore, READEX targets standard HPC applications, while ANTAREX requires applications to be programmed in the DSL.
The use of domain knowledge in the READEX methodology for energy efficiency significantly extends our current dynamic tuning approach.
None of the aforementioned works have yet explored this aspect to the best of our knowledge.

DOMAIN KNOWLEDGE SPECIFICATION
The READEX programming paradigm developed the Domain-level Knowledge Specification Interface in order to enhance the tuning process. This is done by enabling the application expert to provide user-level specifications to expose domain-level knowledge, such as information characterizing application dynamism and the parameters to be used for tuning. The specification allows the identification of new system scenarios, and covers the following aspects of domain knowledge: 1. Specification of application structure (Section 4.1), 2. Specification of application characteristics (Section 4.2), and 3. Specification of application tuning parameters (Section 4.3).
The three aspects of the domain knowledge specification are integrated at different stages of the DTA workflow, as shown in Figure 1. As the first step, the application expert specifies the application structure, which includes the phase region and additional program regions that should be considered for tuning. Then, the application dynamism is detected with the help of a tool called readex-dyn-detect. Based on the result of the dynamism detection, READEX tuning is either stopped in case of no available dynamism, or the application expert may identify application characteristics to support DTA in case there is enough dynamism. These characteristics are marked by identifiers, such as region identifiers, phase identifiers, and input identifiers. The identifiers will allow DTA to generate a more sophisticated application tuning model by using a specific setting of the tuning parameters for certain application characteristics. While the region and phase identifiers are specified in the source code via Score-P macros, input identifiers are added in files accompanying the original input file.
Finally, the Application Tuning Parameters (ATPs) are specified in the application source files via Score-P macros calls to the ATP library. When the application is executed, the ATPs are written to the ATP description file and are then read by PTF at the start of the analysis. Separate tuning model files are generated for each input specification, which are then merged into the final application tuning model. DTA terminates after processing all the inputs. The generated merged tuning model is then read during the runtime tuning stage.

Region and phase specification
The READEX approach is based on tuning program regions. The decomposition of the program into program regions is typically defined by the syntax of the programming language. Common types of regions are subroutines, loops, and structured blocks. Besides standard regions, the application expert might be able to identify additional regions that are not represented as a standard region but may have interesting characteristics or tuning potential. Typical use-cases for user regions are as follows.
• User regions can combine several calls to different functions, which belong together and are too fine granular for switching the configuration individually.
• They can identify certain parts of an algorithm that would otherwise not be a target of tuning because this part is not represented by a standard region type of the programming language.
• If instrumentation is too fine granular and leads to a lot of overhead, automatic instrumentation can be switched off and significant regions can be manually instrumented.
The READEX tuning approach targets applications that have a central progress loop, called a phase region that iteratively performs some computation. The phase region is a program region that defines the phases of execution. Although this loop can be implemented by a standard loop construct of the programming language, automatic detection of the phase region is difficult. Score-P 7 offers an online access phase region to enable external tools to configure Score-P dynamically when a phase is started. This configuration mechanism is used in DTA to perform experiments for evaluating different system configurations. Both the start and the end of the phase region entail a barrier synchronization of all participating processes when an online tool like PTF is connected to Score-P.
The phase region and user regions are defined in READEX using Score-P macros. These macros can enclose arbitrary code and are instrumented automatically. As a result, the annotated regions can be handled by the READEX tool suite like any other program region.
First, the region handles are defined with the macro SCOREP_USER_REGION_DEFINE. Then, the start and end of the user region are marked with the macros SCOREP_USER_REGION_BEGIN and SCOREP_USER_REGION_END, while the phase region is enclosed in the macros SCOREP_USER_OA_PHASE_BEGIN and SCOREP_USER_OA_PHASE_END.

Use-Case
A use-case for the phase region and a user region specification for the MG (MultiGrid) benchmark from the NAS parallel benchmark suite 9

Region identifier specification
Region identifiers, which may be specified by the application expert, along with the region name and the call path, can be used to distinguish runtime situations with different characteristics. Without these identifiers to help the DTA identify and distinguish runtime situations, the tuning model cannot specify different configurations. Region identifiers are specified as Score-P user parameters and can be of type integer or string. They are defined using the Score-P macros SCOREP_USER_PARAMETER_INT64(handle,name,value) and SCOREP_USER_PARAMETER_STRING(handle,name,value), respectively. The parameters determine region identifiers for a program region. In FORTRAN, first, a handle for each parameter has to be declared with a SCOREP_USER_PARAMETER_DEFINE macro. Then, the handle is used to associate a name and a value with the parameter.

Use-Case
In Listing 1, the size of the grid processed in the call to interpolate( … , k)in the MG benchmark gets higher when going from the minimum grid level to the maximum. At a certain grid level, the computation switches from being compute bound to memory bound. To enable DTA to determine special system configurations for compute and memory bound runtime situations, the application expert can add a region identifier for the grid level inside the interpolate region. First, the identifier is defined in line 39 of Listing 1, and then associated with the value of the grid level, as shown in line 41.

Phase identifier specification
Phase identifiers identify phases with different characteristics or behavior and can be used in the tuning model to distinguish rts's based on the variation in the behavior of the phase. Phases that have similar characteristics can be grouped together into a cluster using phase identifiers, such as the degree of sparsity of the matrix or the arithmetic intensity of the phase region, thus enabling the prediction or selection of different best configurations for different phase clusters. Moreover, different configurations can be set for rts's in different phase clusters.
Phase identifiers are provided in the same way as region identifiers (Section 4.2.1) via Score-P user parameters that are attached to the phase region. Phase identifiers should be chosen carefully as they have a high impact in selecting the best configuration for the phases in a cluster. Typically, these should be provided by the application expert who knows how the application behaves when certain aspects in the code are modified. Phase identifiers such as the Compute Intensity and the Conditional Branch Instructions enable DTA to cluster phases that have a similar behavior into a group. Hence, phases that have a higher compute intensity depicted by the spikes in Figure 2 may be grouped into one cluster. DTA may now use the phase identifiers to select a different best configuration for each cluster for the specified objective.

Input identifier specification
The READEX approach characterizes variations in application executions for different input sets via input identifiers provided by the application expert. As a result, DTA will be able to identify more rts's with different characteristics, which will eventually improve the tuning model. In the MG benchmark, while the grid level switches from a coarser grid to the next finer grid, the compute intensity characteristics might change, eg, from memory bound to compute bound. On which grid level this switch happens, depends, eg, on the grid size given in the input file. Thus, the region identifier for the grid level k (Section 4.2.1) for the rts's of the regions is not sufficient to improve the tuning model for different input sizes.
As the optimal configuration is dependent on the size of the finest grid, this grid size has to be taken into account when selecting the configuration for a certain grid level at runtime. In addition to the size of the finest grid, the number of processes is also important. The more processes are used, the better the data distributes over the caches and the earlier; in terms of grid level, the application switches between memory and compute bound.
The numbers of processes and threads are considered as standard input identifiers and need not be given in the specification file.
LISTING 2 Input identifier specification file IID_SPEC

Application Tuning Parameters
In addition to hardware and system tuning parameters, READEX also targets application level parameters, ie, parts of the code itself, which could be used as tuning parameters. The simplest example of this is the case where different implementations of the same algorithm are available, each having its own impact on performance and energy. Some examples of application tuning parameters include, ie, choosing between different decomposition algorithms, preconditioners, or also blocking factors. The aim of using Application Tuning Parameters (ATPs) is to exploit the possibility to switch between the different implementations or, in a more general sense, the possibility for READEX to choose between code level alternatives at runtime.
In order to optimize energy efficiency by choosing the best settings for ATPs, the application developer needs to pass the knowledge on to the tuning system. READEX provides an annotation API, ie, the source code is annotated at specific locations to provide the tuning system with the necessary details on the tuning parameters.
The ATP component ensures support for ATP handling in the READEX tool suite. The overall process by which ATPs are handled can be summarized by the following steps.

Annotation and instrumentation:
The annotation step is primarily done by the application developer. It consists of exposing control variables, which drive the switching between different possible alternatives by marking them with API functions as ATPs. The marking allows to provide additional details, such as the range of values to the tuning system at runtime. All API functions are implemented in the ATP library.

Design-Time Analysis:
During the first phase of the application in DTA, the ATP library generates the information about ATPs into file, and the ATP component provides valid settings for the ATPs to the READEX tuning plugin in PTF. During an experiment, the ATP library will retrieve a specified setting and assign the value to the control variable or assign the default value that is to be given with the ATP specification.
3. Runtime Application Tuning: In a production run, the behavior of the ATP library changes. The annotation part (API parts involved in parameter detail gathering) is shutdown, thus leaving only the part related to value assignment active. Hence, the ATP component receives the values from the RRL and assigns them to the control variables.

ATP declaration and retrieval
The control variables used for ATPs can be of simple data types, such as integers, booleans or floats, as well as complex types such as arrays or structures. The current version of the READEX tool suite is set to handle integer data types, as exploring integer ranges and solving constraints between them requires a relatively reasonable amount of compute power compared to floats and complex structures. Hence, in the cases where a tuning parameter is a complex structure, the user may, when possible, simplify it by mapping its possible values to integer variables.

Domains
An application source code may contain several ATPs. Ideally, these would be independent from each other, enabling independent tuning. In practice, this is not always the case; the values of a parameter may depend on those of another one, and hence, this would translate into the notion of constraints between parameters. Therefore, the application developer may indicate to READEX that a number of parameters have constraints between them by putting them in the same domain.
In the API, domain declaration is included within the parameter declaration call, where both parameter name and domain name need to be supplied. In addition, if no domain name is supplied, the parameter is assigned to the default domain.

Constraints
In the case where two or more ATPs have constraints between them, it is necessary to express the constraints to READEX, as this would allow to generate only valid values for the tuple of parameters. In order to take dependences between ATPs into account, the ATP description formalism allows these dependences to be expressed mathematically in the form of logical constraints. An example of such constraints is shown in Figure 3.
Constraint declaration goes through a single API function call, as illustrated in line 6 of Listing 3, where the constraint is given in the form of a logical expression. Line 29 in Listing 4 illustrates the declaration of a constraint between the ATPs solver and mesh.
In order to avoid increased complexity in constraint solving, READEX accepts affine function based constraints only. These are sufficient to handle a wide range of constraints. Therefore, in addition to logical operators, addition, subtraction, multiplication, and division are accepted to form affine functions.

Exploration
The ATP library API provides a function call for the application developer to give hints to the READEX tuning system about what search algorithm to use in the DTA, eg, exhaustive, individual, random, or genetic. The goal being to let the developer add knowledge about the search space. The call allows to give an ordered list of exploration heuristics to DTA. It should be noted that the hints are tied to a domain, which means that all the parameters included in the domain would be subject to the same hints. Line 7 of Listing 3 shows the prototype of the exploration API call.

IMPLEMENTATION
First, the domain knowledge specification is added to the application to specify the phase region, as shown in Figure 1. Pre-analysis steps are then performed by readex-dyn-detect, which returns the significant regions and computes the tuning potential at the end of the dynamism analysis step. readex-dyn-detect stores the dynamism information for each significant region along with tags to specify the objectives, the tuning parameters, and the search algorithm in a READEX configuration file. Figure 4 gives an overview of the interaction of the components of the READEX methodology with the domain knowledge during DTA and runtime tuning. The components are described as follows.

Score-P
After the pre-analysis steps, the significant regions are annotated using Score-P macros (Section 4. Online-Access Interface. The Online Access Interface allows online tools like PTF to connect to the Score-P monitoring infrastructure linked to the application processes, when the phase region is touched the first time.
The important program regions are then stored in PTF.

ATP Library
The ATP library provides an annotation API to the application expert. The API allows to mark specific chosen control variables in the source code as application parameters. It also provides the instrumentation interface, which allows to assign the parameter values generated by PTF and prepared by the runtime library to the right control variable. At the end of the first phase of the application execution, the ATP library generates the ATP description file in the JSON format, which consists of details about the declared application parameters.

ATP Server
In contrast to the ATP library, which is tied to the application itself, the ATP server is launched by PTF and reads the contents of the ATP description file. The primary task of the ATP server is to respond to PTF requests on ATPs. Some requests are simple, such as querying the list of ATP parameters, and the server can get the information by looking in the content of the ATP description file. Other requests are more complex, such as the generation of a list of valid points for the parameters, which require the resolution of the constraints held between parameters. These require the server to query a third party constraint solver software called the Omega Calculator. 16

Omega Calculator
The Omega Calculator is an affine functions constraint solver and generates the valid values for each ATP for the recorded constraints by filtering the tuning parameter values that do not satisfy the constraints. The Omega Calculator software is composed of a library Omega Library, which constitutes the core of the solver as well as a text interpreter to query the library. The software is registered under the BSD license, and the source code can be downloaded freely from Github. 17 One big advantage of using the Omega Calculator is the small computational time needed to solve the affine function based constraints, which makes it fit for use to solve the constraints at runtime. After the execution of the phase, Score-P returns the measurements for the rts's, and Partial Calling Context Trees (CCT) * are generated at each analysis agents for the MPI processes controlled by it. The partial trees are then gathered to create the complete tree in the PTF frontend. The phase region represents the root node of the CCT, and its children represent call sites of the significant regions. Each child node can be viewed as an rts, ie, invocation of a significant region. If the region identifier is specified, an rts can be identified by its region name, the call path, and the region identifier. In Listing 1, a separate node representing an rts of interpolate is created for each grid level. Hence, the rts's of interpolate can now be distinguished in PTF via the call path, which includes the region identifiers (represented as parameter_name=value). If no region identifier is given, these rts's would be merged into a single node in the tree and would be indistinguishable for each grid level.

READEX Tuning Plugin
The measurements for the phase and the objective values for the rts's are then propagated to the tuning plugin.
For the Energy objective, the plugin computes the consumed energy per experiment by aggregating the values returned by the designated processes of all the nodes for an MPI application. Since energy is a global metric of a node, only a designated single process in each node returns this value to Score-P. The plugin then outputs the best configuration for both the phase and the valid rts's. The plugin also performs additional experiments for verifying the savings incurred. The static saving for the phase is computed as the improvement in the energy consumption for the best setting of the tuning parameters over the energy consumed for the default setting of the tuning parameters. The static energy saving for the rts's is the improvement in the energy consumed by the rts's for the best static configuration for the phase over the energy consumed for the default configuration.
The dynamic savings are computed as the improvement in the energy consumption for the best configuration for the rts's over the energy consumption for the best static configuration.

RTS Database
The information about each rts, including its call path, region identifiers, default objective values, and the best and the worst setting of the tuning parameters for the READEX tuning plugin is stored in the RTS database.

Tuning Model Generation
Finally, a tuning model is generated from the knowledge stored in the RTS database. The tuning model generation clusters the rts's into scenarios using a classifier. The classifier maps each valid rts onto a unique scenario based on the best configuration. The rts's may also be clustered based on similar best configurations using a similarity score that determines if the objective values are close to each other. Thus, instead of treating the two *A context sensitive version of a call graph.
rts's as different scenarios, they can be merged into one scenario to reduce the overhead of scenario detection and switching at runtime. For each scenario, a selector is generated that returns the best configuration for that scenario with respect to the chosen objective. The tuning model encapsulates this knowledge as a JSON file.

Tuning Model Merger
For each input identifier specified in the input specification file, a separate tuning model is generated at the end of DTA for the application. The Tuning Model Merger then merges all the tuning models into one merged application tuning model file, which is then read at runtime to perform dynamic switching.

READEX Runtime Library (RRL)
The READEX Runtime Library (RRL) is used during both design-time and runtime. While RRL only switches the tuning parameter settings at design-time, it uses the ATP library and the merged application tuning model to look up the best configuration upon encountering a valid rts for each input identifier and ATP during production runs. If there is significant expected saving in the energy consumption over the switching overhead for an rts, the configuration is switched. RRL employs separate parameter control plugins for each tuning parameter to perform configuration switching. 18 RRL may also perform calibration by using different machine learning techniques to predict the optimal configuration if it encounters rts's that were not seen at design-time.

EVALUATION
The evaluation of DTA using domain knowledge was performed on three applications: Multi-Grid (MG), 9 Block Tri-diagonal solver (BT-MZ), 9 and ESPRESO. 10 First, too fine granular regions were filtered using a Score-P filter file, which contains all the regions that have to be eliminated from instrumentation. Then, readex-dyn-detect was run on the three applications separately, and significant regions for each application were output into a READEX configuration file. The phase region and the significant regions were annotated using Score-P macros (Section 4.1.1). The Energy objective was specified for all three applications.
The domain knowledge specification for MG was the grid level k, which was specified as the region identifier for the interpolate subroutine.
For BT-MZ, two problem classes (B and C) were specified as the input identifiers. For ESPRESO, application tuning parameters, such as preconditioners, iterative solver methods, and domain decompositions were specified. DTA was performed for each application using the READEX tuning plugin, and the savings for the specified objective were computed. Finally, tuning models were generated for MG and ESPRESO, while a merged application tuning model was generated for BT-MZ. Table 1 presents the results obtained from readex-dyn-detect and DTA for the MG benchmark after specifying the grid level as a region identifier.

Region identifiers
MG was executed for Class C with grid size 512×512x512 using 4 MPI processes and 1 OpenMP thread on a single node of the Taurus HPC system.
After applying readex-dyn-detect, the significant regions that were obtained are rprj3, interp, psinv, and resid. The valid rts's for each significant region of MG are presented in Table 1. The READEX tuning plugin executed a total of 70 experiments using the exhaustive search strategy.
The best CPU frequency and uncore frequency setting for each valid rts of a significant region are shown in columns 5 and 6, respectively, along with the total energy consumed, as shown in column 3 and the execution time in column 4. As it can be seen, the best setting for the phase is not necessarily the best setting for all the rts's, and the best configuration varies for the rts's of interp. The static energy saving for the rts's amounts to 6.9% for all the significant regions and 4.5% for the phase region as compared to the default energy values. The improvement obtained using the READEX tuning for the rts's using domain knowledge was 2.2%.
Although there are both static and dynamic savings for MG, there is a trade-off between the energy consumption and the execution time.

Input identifiers
DTA was performed on BT-MZ with two problem classes, ie, B (64 zones) and C (256 zones), where each zone is assigned to a process for parallel execution. The two classes eventually are defined by zones, which have an uneven distribution of the workload across the X, Y, and Z dimensions.
The READEX tuning plugin performed two different DTA runs for two input identifier files containing the specifications for Class B and Class C.
The READEX plugin executed 38 experiments with 8 MPI processes and 1 OpenMP thread on one node of the Taurus HPC system. Figures 5 and   6 show the trend in the energy consumption for 64 and 256 zones, respectively. The energy consumption in each of the graphs is presented for different settings of the CPU frequency and uncore frequency. As seen, for Class B, the trend in energy consumption slowly increases from a lower to a higher CPU frequency setting, while the energy consumption follows quite an irregular trend for Class C. This indicates that the work distribution of zones across processes is not the same. Thus, the best configuration for Class B is {CPU_Freq=1.2, Uncore_Freq=1.0} and Class C is {CPU_Freq=1.3,

Uncore_Freq=2.0}.
Thus, an input identifier for the class size allows to distinguish the two situations in which different best configurations are obtained. This knowledge helps in the generation of a better tuning model by including information about the input identifier and thus returning the correct best configuration at runtime. Energy consumption of BT-MZ for Class C for different CPU frequency and uncore frequency settings during DTA

Application Tuning Parameters
This section demonstrates the tuning potential and a comparison of the energy consumption obtained after using ATPs for ESPRESO.
Using a simplified approximation, we can state that, from the preconditioners listed in Section 6, the more computationally demanding the preconditioner is, the more numerically efficient it is, ie, the more it reduces the number of iterations to solve the problem. In ESPRESO, we can dynamically switch between any of these during the runtime. If a preconditioner is not used, one iteration contains an action of a FETI operator (cost is 30.9 J and 0.12 s) and an application of a projector (cost is 0.7 J and 0.005 s). If a preconditioner is used, each iteration contains one more projector application in addition to the preconditioner action.
We evaluated a projected conjugate gradient (PCG) solver with the preconditioners on a structural mechanics (linear elasticity) problem with 2.3 million unknowns on a single compute node using 24 MPI processes. The results in Table 2 show that the solution can be reached in 5.46 s when using Light Dirichlet preconditioner, despite the fact that it needs more iterations than the Dirichlet preconditioner. The Light Dirichlet preconditioner saved 15.9 s and 4 091.5 J in comparison to solving the problem without any preconditioner.
ESPRESO can switch during the runtime not only the preconditioners but also the iterative solver methods. The presented results show good trends and savings but are not generally applicable to every problem that can be solved by ESPRESO. Different physics, different problem sizes, and material properties will affect the savings that can be achieved. For instance, the domain decomposition will play a more significant role for larger problems. This paper first presents the domain knowledge specification for the application structure, namely, the region and phase specifications. The specification of a user region enables to aggregate too fine granular regions for intra-phase tuning, ie, dynamism existing between rts's within the phase.
The phase specification allows to explore inter-phase dynamism, ie, dynamism between phases in addition to the intra-phase dynamism of the application. Region identifiers are used to define specific characteristics of a computation and can be used to enhance the tuning model by distinguishing more rts's. Input identifiers characterize different input behaviors and enable DTA to generate a global merged application tuning model. The Domain Knowledge Specification Interface also enables the application expert to specify new tuning parameters via the ATP library. DTA then takes these tuning parameters into account and creates a more extensive search space for energy tuning. This paper demonstrates a 2.2% dynamic energy saving using region identifiers for the MG benchmark due to DTA being able to distinguish more rts's than usual. Trends in the energy consumption for two different input identifiers specified by Class B and Class C for the BT-MZ benchmark are also discussed. Finally, the performance improvements are assessed with respect to energy consumption and runtime for ESPRESO by using different application tuning parameters such as preconditioners, iterative solver methods, and domain decompositions. For the preconditioner ATP, the Light Dirichlet preconditioner shows a 74.37% saving in energy consumption with respect to other preconditioners. Similarly, using the ORTH_PCG_CP solver method and the 192 elements per domain, the application expert saved 25.8% and 22.7% energy in comparison to the default (PCG) solver and 24 elements per domain, respectively.

CONCLUSIONS
As of today, the major challenge in HPC research is saving energy consumption for Exascale computing. This paper has given an overview of the READEX project, which is aiming at improving the energy efficiency of HPC applications by a dynamic tuning approach. Furthermore, it emphasizes that the use of the domain knowledge specification approach introduced in the READEX methodology in fact results in dynamic savings. This results in an enhanced tuning model in addition to the auto-tuning approach in order to guide the dynamic switching of the tuning parameters. The application owners can provide their expert knowledge about the application domain by specifying identifiers and application-level tuning parameters that significantly increase the tuning potential. We present some preliminary results of different domain knowledge aspects and evaluate their impact on the tuning process.