SimAnMo—A parallelized runtime model generator

In this article, we present the novel features of the recent version of SimAnMo, the Simulated Annealing Modeler. The tool creates models that correlate the size of one input parameter of an application to the corresponding runtime and thus SimAnMo allows predictions for larger input sizes. A focus lies on applications whose runtime grows exponentially in the input parameter size. Such programs are, for example, of high interest for cryptanalysis to analyze practical security of traditional and post‐quantum secure schemes. However, SimAnMo also generates reliable models for the widespread case of polynomial runtime behavior and also for the important case of factorial runtime increase. SimAnMo's model generation is based on a parallelized simulated annealing procedure and heuristically minimizes the costs of a model. Those may rely on different quality metrics. Insights into SimAnMo's software design and its usage are provided. We demonstrate the quality of SimAnMo's models for different algorithms from various application fields. We show that our approach also works well on ARM architectures.

constant runtime. Detailed knowledge of those sub-algorithm's cost development increases the understanding of the behavior of the overall program.
We call a value of the input parameter p a feasible choice if that value allows us to solve the problem in an acceptable average runtime and with the hardware resources of our test systems.
In this article, we demonstrate SimAnMo's applicability for all mentioned types of runtime behavior. For programs with exponential runtime behavior, we focus on algorithms from the cryptanalysis context. Exponential runtime behavior often occurs in this context (cf. Section 5.3). The use-cases for polynomial and factorial runtime are well-known basic problems. For example, dense linear algebra or combinatorial problems, also including a parallel, heuristic algorithm (cf. Sections 5.2 and 5.4).

Linear-logarithmic approach for exponential runtime behavior
A common approach to create models for the runtime of programs with exponential runtime behavior is the following: (1)  Theoretically, one could estimate the runtime for higher values of p, if one evaluates 2 c 0 +c 1 ⋅p . But Ducas 4 and Albrecht et al. 5 state not to employ extrapolations of their data for predictions of higher order. However, we think that it is important to get insights into the behavior of the most advance algorithms to understand the practical hardness of the mathematical problems. For example, Reference 6 ( fig. 13) demonstrates that different theoretical estimates for the runtime of a specific problem may differ by many orders of magnitude.

Motivating example
We start with an illustrative example to highlight the properties of programs with exponential runtime behavior, the drawbacks of the lin-log approach and the quality of the models we create with SimAnMo.
The runtime of the Schnorr-Euchner enumeration algorithm (cf. Section 5.3.0.6) is expected to grow in 2 (n 2 ) with nbeing the dimension of the enumerated lattice. 9 As an example, we assume that the dependence of the runtime of a program ENUMimplementing the enumeration and its input parameter n follows the function: The offset 0.4 emulates a constant initialization time for the program and the coefficients 10 −3 and 0.01 are chosen arbitrarily. The only prerequisite for them is that the result values of g ENUM (n) lie in a range that is representable by double precision numbers for the values of n considered.
We evaluate g ENUM (n) for n ∈ {20, 21, 22, 23, 24, 25} since in this region the resulting theoretical runtime is still short but the effect of the constant 0.4 does not dominate the actual behavior. The resulting values are represented by the diamonds in Figure 1.
Afterward, we create a model with the lin-log approach by linear regression, shown in orange. Our alternative model created by SimAnMo is shown in blue.
The comparison in Figure 1 shows that SimAnMo's model nearly coincides with the diamonds. In contrast, the lin-log model lies between the points since it lacks the flexibility to fit a curve. This mainly results from the fixed order of 1 for n in the exponent −2.05 + 0.038 ⋅ n 1 , a natural limitation of the lin-log approach.
In the next step, we extrapolate the model function to higher values of n and compare them to the values resulting from evaluating Equation (1) at the respective n-values. The results are visualized in Figure 2. For the y-axis, we employ a logarithmic scale.
F I G U R E 1 Models for g ENUM (n) created from the training points resulting from Equation (1) F I G U R E 2 Prediction and evaluation for g ENUM (n)-models We see a huge difference between the models considered. The SimAnMo model fits the additional points quite exactly, while the lin-log model underestimates the values by more than 200 orders of magnitude. The coefficients in the diagram are rounded to two significant digits. The difference between our model and the real data point is smaller than 7% for n = 250.
A second lin-log model with suffix "(m)" is shown dashed. It is based on training points for 20 ≤ n ≤ 100 in steps of 5. Although the model performs better, it underestimates the real value at n = 250 by more than 100 orders of magnitude. Again, the fixed order of n prevents an acceptable fit to the extended training data, resulting in the wrong prediction.
In addition, we also assess the quality of other models labeled as Exp and Exp (m). They are so called exponential models of the form f(p) = c 0 + c 1 ⋅ 2 k⋅p , c 0 , c 1 , k ∈ R and based on the same training data. We discuss them in Section 3.1.
This example suggests the accuracy of our models and highlights the lack of flexibility of the linear-logarithmic approach for a real algorithm.
Our contributions are: 1. We propose a new type of modeling functions for exponential runtime behavior 2. We demonstrate SimAnMo's extensibility and flexibility by integrating and evaluating also factorial runtime models 3. We show that the resulting prediction of the models is more accurate when the target is to minimize the mean absolute percentage error (MAPE) instead of standard residual sum of squares (RSS).
4. We highlight the ease-of-use of SimAnMo

RELATED WORK
Performance modeling is mainly employed to analyze and improve the performance of an application on a specific hardware platform. Besides analytical models, the most famous approach is the Roofline performance model 10 as a combination of the real system behavior (memory bandwidth) and knowledge of the code or the instructions it generates on a machine (number of FLOPs related to the number of memory accesses, i.e., computational intensity). It gives an upper bound for the reachable performance on a given architecture. The Roofline model was extended to take the cache hierarchy into account, for example, by Ilic et al. 11 Those cache-aware models were successfully applied to applications on large single compute nodes 12 or GPGPU. 13 Another approach for Roofline models on GPUs was presented by Ding and Williams. 14 Their Instruction Roofline methodology takes the various cache hierarchies of NVIDIA GPUs into account and considers more metrics than the computational intensity like strided memory accesses or instruction throughput. Two other extensions to the Roofline model were proposed by Lorenzo et al. 15 which include considering multiple measurements during different stages in the program execution and include the memory latency into the model.
A different approach is empirical performance model generation. Those models can, for example, be based on performing benchmarks with the target applications and employing machine learning 16 to generate the models. Another method is based on small-scale experiments/runs and to automatically extrapolate models from the results obtained. These techniques are applied in the tool Extra-P 17,18 which was created as a tool to assist developers of parallel programs in identifying scalability bugs by creating performance models of critical regions of the code. This is, in particular, important for the optimal usage of HPC systems. The approach is hardware agnostic in the sense that the modeler does not require knowledge about the system parameters like memory bandwidth, the configuration of the cache hierarchy or the GPU architecture employed. Also no detailed knowledge about properties of the code like the number of memory writes in specific parts is required.
In this article, our goal is not to analyze or improve the performance of applications by performance models on a given system but to generate models which estimate the runtime of a given algorithm with exponential runtime behavior for problem sizes which are infeasible to solve. It was already demonstrated that parts of the methodology of Extra-P can be employed to generate reliable models for the runtime of programs with polynomial runtime behavior 19 and that those models can be generated with low effort.

MODEL GENERATION METHODOLOGY
In this chapter, we first present the approach to automatically generate runtime models from training data. It is followed by the definition of different metrics to evaluate the quality of models. Afterward, we present all types of models integrated into SimAnMo and finally discuss SimAnMo's software design.

Model generation
SimAnMo is based on principles of Extra-P. 20 There, functions are generated from profiled runtime data. This describes the relation of the runtime of different code regions with different variables, for example, the size of the input data. All models of Extra-P are of the form shown in Equation (2): where p is the input parameter whose influence is considered and c 0 a constant term. Reisert et al. 21 have shown that in most cases = 1 is sufficient to achieve a good result. That means the model looks like: and, in particular, that those single term models are even the most accurate ones. When c 0 = 0 and c 1 = 1, the remaining part of the model is called a constant model. Its value depends on the other coefficients i and jas well as the value of p.
Appropriate values for the parameters c 0 , c 1 , i, and j ∈ R are determined in a two step approach. The exponents i and j are set by iterative refinement. After the choice of i and j, the two values c 0 and c 1 result from setting their values in order to minimize a given quality metric for already fixed exponents i and j. This means c 0 and c 1 are set in order to optimize the fit of the constant models to the training data points.
An experimental version of Extra-P also allows to create models of programs with exponential runtime behavior. The shape of those model functions is: and we call those exponential models from our extended Extra-P version exp models in the following. Independent of the model function employed, the workflow to create a model is the same. In the first step, so called training data is generated. It is represented by profiled runtime measurements with l varying p-values. In the second step, Extra-P's model generator determines the best models that describe the training data points by regression, cross-validation, and iterative refinement (cf. Calotoiu et al. 20,21 for details). The regression tries to minimize the residual sum of squares (RSS) between the training data points and the model function which is defined as: where m is the number of the measurement y m and f m is the evaluation of the model function f for the p-value corresponding to the mth measurement.
The RSS is the standard least squares residual to evaluate the quality of a model. If RSS = 0, then y m = f m ∀m and each point of the training data lies directly on the fitting curve.

Quality assessment
SimAnMo is also based on collecting training data points and then creating a model function which minimizes a given cost function. Since one drawback of RSS is that the errors are summed up, that is, adding more measurements always increases the RSS value, SimAnMo supports several different quality metrics to analyze the quality of fit with the training data.
For our own metric anRSS, we start from the absolute normalized RSS (nRSS). This was presented by Ilyas et al. 21 and is defined as: with y being the mean value of all ys. From the nRSS, we define the average normalized RSS (anRSS) over the number of points: The anRSS represents the mean value of the error per measurement point. The anRSS also normalizes the RSS to the scale of the measurement values and hence is a good indicator for the quality of the model.
However, SimAnMo also integrates two well-established metrics. A metric frequently employed to determine the quality of the fit is R 2 . We use the following definition: R 2 normally is between 1, indicating no difference between model and the points, and 0. In no case in this article, R 2 lies outside this range.
Another common metric is the root-mean-square error (RMSE). It is often employed to rate the predictive power of a generated model. The RMSE is defined as: It is on the same scale as the data to be analyzed. However, is relatively sensitive to outliers. 22 One important point about all metrics mentioned above is that points with higher values have a higher influence on their overall value.
Consequently, for the generation of the model, points with higher p-values have a higher influence on the model's shape.
To remove this property, there exists the mean absolute percentage error (MAPE) metric: It is a relative measurement for the overall error where the individual errors relative to each point are combined equally. In that way, large relative errors at points p with low y-values have the same influence on the overall metric as large relative errors on a p with high y-value. Because of their properties, we expect, in general, the MAPE to be higher than the anRSS for the same model. The ideal value for both metrics is 0.
In general, the training data has to be carefully collected. This means that no side-effects for a small pdominate the actual runtime of the algorithm in that range like initialization procedures or reading data from the file system.

Integrated model types
SimAnMo contains three type of model functions. Polynomial logarithmic (pol-log) models (cf. Section 3.3.1), which also cover the special cases of constant runtime, exponential polynomial models (exp-pol) models to cope with exponential growth of the runtime (cf. Section 3.3.2) and factorial runtime models (cf. Section 3.3.3).

Polynomial logarithmic models
Like Extra-P, 20 SimAnMo can model polynomial runtime behavior. Following the results of Reisert et al., 23 the shape of those models is given by Equation (11). Hence, the values of c 0 , c 1 , i, and j are determined by iterative refinement and minimizing the costs of the constant model after i and j are set.
Since this case was discussed by Burger et al. 19 we omit the details.

Exponential polynomial models
We propose a new shape for modeling functions for exponential runtime behavior in Equation (12) The two main ideas behind it are: (1) The enumeration algorithm we already considered demonstrates that the variable in the exponent may grow faster than linearly (quadratic in that case). (2) Previous work about the LLL lattice reduction algorithm 19 shows that the practical behavior of programs with polynomial runtime can be accurately modeled by polynomials with a real exponent,although the complexity class  for the underlying algorithm is given by integer exponents. For example, the complexity class of an algorithm ALG1 may be given by (p 3 ), but the model describing its behavior the best is 1 ⋅ p 2.61 and not 2 ⋅ p 3 or 3 ⋅ p 2 for 1 , 2 , 3 ∈ R.
We again consider Figures 1 and 2. The green curve shows the exp models as used in the extended Extra-P version based on Equation (4). This type of model is also created by SimAnMo since we extended SimAnMo with exp models. To that end, we followed the methods described in Section 4.3. In that way, SimAnMo automatically determines the coefficients. Figure 1 demonstrates that the exp model performs better than the lin-log model. Its graph coincides with our novel model and hits all the training points. However, the value of the prediction for n = 250 is too low by 150 orders of magnitude, as observable in Figure 2.
The dashed lines show a lin-log and an exp model, respectively, which employ n ∈ {20, … , 100} as training data. Although those training points also would cause infeasible runtime and the resulting models perform better, they do not give reasonable predictions for n = 250. The values are too low by 103 and 70 orders of magnitude, respectively. In contrast, our exp-pol model, employing the small training data range, correctly predicts the function's behavior.

Factorial models
Compared to the SimAnMo version presented by Burger et al., 24 we also integrated a new type of model: It is capable of representing factorial runtime behavior. This means that the growth of the runtime is mainly directed by the term p! An example for factorial runtime behavior is finding permutations of a data set or a list. Another example is the traveling salesman problem (TSP). It consist of searching for a tour in a graph which visits all nodes exactly once and the overall length of this tour is minimized. This problem, for example, occurs when routing vehicles or in chip design. The shape for those factorial models is given in Equation (13): The factorial function p! mandates p to only have integer values. The terms p i ⋅ log j (p) allow more flexibility for fitting the training data. Their insertion is motivated by our previous results from modeling exponential runtime behavior. 24 Test cases based on these models which justify this decision are evaluated in Section 5.4. Concerning the rapid growth of the factorial term and taking the results of Figure 2 into account suggests that a lin-log approach fails even more severely in the case of factorial behavior.

THE SimAnMo FRAMEWORK
SimAnMo is written in C++-11 and open software. * It contains features to generate model functions, to visualize the fit to the training data, and to assess the quality of the prediction if additional measurement points are available. The overall process to create the models is semi-automatic. The training data can be generated via scripts. Afterward, the user has to decide which training points are the input for the model generator. Also the model type (pol-log, exp-pol, fac) has to be selected. The rest is automatically performed by SimAnMo without further user intervention. The final result is a pdf file with the diagrams in the same manner as they are shown in this article. The report also contains the values of the quality metrics employed. The printed digits are limited to a reasonable number. Internally all calculations are performed with at least double precision. Additionally, the development of the quality of the model during the iterative process can be visualized.

Software architecture
SimAnMo can be employed as a stand alone console application or as a static library alternatively. The behavior of the application is configured via more than thirty command line parameters which are described in the documentation. This includes the configuration for the models themselves like their type or the possible range of their coefficients. Furthermore, users can choose for the configuration of the simulated annealing process the starting temperature, the number of steps per temperature value or the number of threads employed. The only mandatory parameter is, however, the path to the input file with the training data and the optional measurement points. Remaining parameters are experimentally set to default values which result in reasonable behavior in all tested cases. The mandatory input file has to be structured as shown in Listing 1.
( p−value of f i r s t t r a i n i n g data po i n t ; i t s runtime ) ( p−value of second t r a i n i n g data po i n t ; i t s runtime ) . . . #ADDPOINTS ( p−value of f i r s t measurement p o i n t ; i t s runtime ) ( p−value of second measurement p o i n t ; i t s runtime ) . . .

Listing 1: Layout of an SimAnMo input file
This structure indicates that SimAnMo can create other models than those relating the size of one input parameter to the resulting runtime of a program. In principle, every domain in which the relation of the value p to the modeled value y can be expressed by one of the available types of models f(p). An additional model f(p) for the respective domain can also be added.
In its default configuration, SimAnMo searches for a polynomial-logarithmic model that minimizes the RSS metric of the training points in the pre-defined coefficient range. If the quality of the model is not sufficient, the best fitting exponential-polynomial model is searched and returned if it is better.
The library version of SimAnMo is mainly based on the function SimAnMo::findModel. It returns a SimAnMo::FunctionModel object. This object contains all the necessary information, like the resulting model, its type and its cost.
SimAnMo::findModel has three parameters as shown in Listing 2. The first two of type std::map<double, double>& give the pairs of p-values and training data points and measurement points, respectively. The third of type std::string corresponds to the input parameters of the console application.
SimAnMo : : FunctionModel SimAnMo : : findModel ( std : : map<double, double>& t r a i n i n g _ p o i n t s , std : : map<double, double>& measurement_points , std : : s t r i n g options ) Listing 2: Usage of the SimAnMo library

Calculating the models
We extended the previous version of the code presented by Burger et al. 19 with the capability to search for exponential models. They possess the shape of Equation (12). We also added factorial models following Equation (13). Due to SimAnMo's software design, the main code structure remains the same when adding new types of models. The quality of the fit of the model to the training data can be assessed by two metrics. By the standard RSS and by MAPE which was also newly integrated into the framework.
For example., to RSS as it was used for the programs with polynomial runtime behavior. 19 In fact, the RSS is the standard metric to minimize within the Extra-P application and the traditional lin-log approach. For the other models in this article, we stick to the MAPE in Sections 5.2 and 5.4.
SimAnMo's model generation approach is based on parallelized simulated annealing. As discussed in Section 3.1, models are generated in a two step approach. First by iterative refinement and second by setting the values of c 0 and c 1 to minimize the cost metric between the training data points and c 0 + c 1 ⋅ constModel(p m ), 1 ≤ m ≤ l, that is, the respective model values. Both steps are implicitly performed in each simulated annealing step.

4.2.1
Iterative refinement Iterative refinement is performed within simulating annealing where randomized neighboring models from the current model are generated. To that end, the coefficients within the constant model part, that is, in i, j, and k depending on the model type, are modified by a small randomized adjustment.
Boundary checks limit the acceptable values of i, j, and k. The maximum allowed range of the changes was determined experimentally.

4.2.2
Determining c 0 and c 1 SimAnMo provides two different approaches for setting c 0 and c 1 , depending on the cost metric employed. The second to determine c 0 and c 1 for MAPE is more compute intensive. The main idea is to minimize the MAPE-equation with the choices of c 0 and c 1 . This is done with the BLEIC (boundary, linear equality-inequality constraints) algorithm of the ALGLIB-library. 26 The underlying optimization algorithm is a nonlinear conjugate gradient method. It also requires the gradient of the function to minimize which is calculated following the analytical differentiation of this function. Due to its complexity, this second approach may be more unstable than the RSS-based variant. For the conjugate gradient method, an adequate stopping criterion is required. At the moment, this is performed by setting the EpsX parameter in the call of BLEIC's minbleicsetcond function to 10 −9 . That means, the conjugate gradient algorithm stops, when the Euclidean norm ||v|| of the scaled step vector v is smaller than EpsX= 10 −9 .
In average, the computation with MAPE requires ten times as long as the RSS-based variant. For example, the models generated in Section 5.4 require between 4 and 7 s for RSS-based and 60-75 s for MAPE-based. However, tuning the stopping criterion for the conjugate gradient can reduce this difference. Another possibility to improve the performance and to increase the stability is the function minbleicsetscale which sets the scaling of the variables. The scaling is set to [1,1] by default. This is not optimal since, in particular, in the case of exponential and factorial runtime behavior the difference in the size of the variables is several orders of magnitude.
Switching between the RSS and the novel MAPE minimization is realized by just adding one command line parameter.

Optimizations for simulated annealing
We also added two heuristics to the annealing process to improve the convergence and the quality of our models: (1) Backtracking: Each thread protocols whether the modified solution which is generated in the current iteration is better than the former solution. If q 1 (parameter can be set by the user) successive solutions, starting from the last solution S imp which resulted in an improvement in iteration t, are worse than S imp , the actual solution S act in iteration t + q 1 is reverted to S imp . The overall model quality reacts in a very sensitive fashion to changes of the values in the exponent. There are apparently many blind ends for the annealing process from which it can only hardly escape. This is simplified by backtracking.
(2) Adaptive random modifications: Depending on the overall allowed search space for each coefficient and its current value, SimAnMo adapts the extent of the modifications to this coefficient when generating random neighboring solutions. Additionally, for exp-pol models (Equation 12), the values for k and i are both modified at once since systematic experiments showed that this method improves the convergence behavior.

Integrating new model types
Integrating a new model type into SimAnMo is simplified through a uniform interface. Two basic constructors for the model object are required. The first one is the default constructor returning an empty solution. The second constructor takes given values for c 0 , c 1 , i, jand optionally k and creates a model for those. A third additional constructor generates a random solution based on the input data. Its costs must already lie in a feasible range.
Those models are the starting point for the simulated annealing procedure.
Additionally, a routine is required that slightly, randomly modifies a given model for the generation of a random neighbor within simulated annealing. This routine must also check that the returned model is still feasible. Furthermore, two routines to evaluate the model and the constant model at a position p must be provided. For the optional output of the results, SimAnMo requires methods to represent the model as string and as LATE X-formula.
For the case of factorial models, these are less than 400 lines of code, and less than 250 when excluding printing functionality. The main challenge for all integrated model types is the generation of a random feasible solution and the adjustment of the appropriate extent of the random modifications.

Numerical aspects
Most internal calculations concerning the quality metrics are performed with double precision. The only exception is the calculation of p! which is done with 64 bit integers. To avoid numerical problems, following measures and experiments are taken: 1. We limit the range for the coefficients c 0 and c 1 in Equations (11)- (13). This avoids that they become arbitrary small or arbitrary large and that the models overfit. The default limit is 10 −11 and 10 40 , respectively, which worked for all cases we considered. Hence, we are confident that those experimentally defined bounds will also work well for other models. If a value is smaller or higher than the limits, it is set to the limit. Then, the costs are re-calculated before returning the model. Additionally, a warning code is returned by the routine which determines c 0 and c 1 to indicate the overflow.
2. When c 0 and c 1 should be determined with the MAPE-approach, first the coefficients are set with the RSS-approach. In the case that a warning is returned (as explained above) the MAPE-approach is stopped. This is done because the chance is very high that the conjugate gradient will not converge or will return a very bad result. An error is returned that signals the caller that for this choice of i, j, and k no adequate c-values can be found and that i, j, and k have to be changed.
3. When modifying or randomly creating a solution, the annealing process checks whether its costs are feasible before taking the solution into account. Feasible means in that context that the RSSor the MAPE are not allowed to be many orders of magnitude higher than the (squared)

EVALUATION
In this section, we evaluate the quality of SimAnMo's generated models for the three different types of models presented in Section 3.3. We start with the polynomial logarithmic models. Afterward, we discuss the exponential polynomial models beginning with a test suite of synthetic tests.
Those provide the advantage that we can control the parameters and have knowledge about their real behavior a priori. Additionally, four real world programs with exponential runtime behavior are analyzed. Finally, we discuss the factorial models with one theoretical example and two real world problems. For all models, we show that our approach works well for ARM architectures, too. We assess the quality of the fit to the training data, in general, at hand of the RSS and the MAPE metric. In some cases, we additionally employ the anRSS to highlight some details. As mentioned in Section 3.2, SimAnMo also calculates RMSE and R 2 . Since they offer no additional insights in the cases investigated, we do not list them for clarity of the results and focus on the others.

Polynomial logarithmic
As demonstrated by Burger et al., 19 SimAnMo creates reliable pol-log models for different implementations of the LLL lattice reduction algorithm with various types of input lattices. The models generated are much more accurate than the existing theoretical models. To underpin the previous results, 19 we evaluate the dense matrix matrix multiplication as one additional application with pol-log runtime behavior.
Because of the number of primitive arithmetic operations the runtime is expected to grow in (p 3 ). We implemented a naive multiplication without optimization techniques like blocking. As data type, we employ cpp_dec_float_50 from boost::multiprecision. It guarantees a 50 decimal digit precision (double precision guarantees 15 digits). This choice increases the demand for memory and includes a non-standard data type into the runtime model. We added an OpenMP parallelization for the outermost for-loop. It runs over the rows of the left matrix and the columns of the right matrix. Thus, the model includes the effect of employing multiple threads. We used 12 threads in this case on SystemAMD. The results are summarized in Figure 3. We also performed the matrix matrix multiplication on the Raspberry Pi 3B. Figure 4 shows the model generated by SimAnMo in relation to the data points. The model again fits the training data p ∈ {135,140, … , 240} well. It predicts the highest value tested, p = 1150, with an error of 8.8%.
Because of the much smaller cache size on the Pi 3B of just 512 kB, there are much fewer side effects for smaller problem sizes. Hence, already from p = 135, the training points are the base for a reliable model. Since the error at the last measurement point is slightly bigger on the Pi 3B the robust model fits the higher points better. However, the difference of the exponent in the polynomial factor of 0.14 is still small.
In both cases of multiplication discussed, SimAnMo reliably predicts values for p which are about five times higher than the last training data point.
This example and those in earlier evaluations 19 highlight that SimAnMo generates reliable pol-log models with comparable or better accuracy than existing solutions.

Exponential polynomial
Here, we demonstrate the superiority of SimAnMo's exp-pol models compared to the existing approaches linear-logarithmic regression and exponential models as employed in Extra-P.

Synthetic tests
We employed a C++ code which performs a fixed series of calculations on randomly chosen elements of a 50,000 × 50,000 matrix filled with random The evaluation of the models on the right side of Figure 5 shows that the prediction of the exp-pol model is the most accurate. It underestimates the measurement by less than 20%. At the same time, the prediction of the lin-log model and the exp model is too low by a factor of 7 and 3.5, respectively.
Finally, we also employed the Raspberry Pi 4B. We repeated the test Synthetic 1.. The model in Figure 6 is generated from training points p ∈ {138,139, … , 147} since the runtimes are much higher than on the desktop system.
However, the resulting situation is very similar to the ×86 system. The lin-log model considerably overestimates the measurements, while the predicted values of the exp-pol model slightly overestimate them. For p = 420, the prediction is too high by 20%. In contrast, the lin-log model's value is too high by a factor of 20 and the exp model's by a factor higher than 5. Hence, this example again demonstrates that our model generation approach is directly applicable to different architectures.
We summarize that we are confident that the additional degree of freedom in the exponent resulting from Equation (12) considerably increases the quality of the models generated compared to the exp model. Hence, we do not further consider the exp model in the following. In particular, it is just a special case of the more generic exp-pol model. The lin-log models, in general, have problems to predict the runtimes of functions where p has an exponent ≠ 1.

SVP and methodology
Three programs we employ for our further evaluation are so-called SVP solvers. They try to find the shortest non-zero vector in a lattice. 6 The basis B determining a lattice consists of d basis vectors of length n. n is also called the dimension of the lattice and d its rank, respectively. In our test cases n always equals d. The SVP is known to be NP-hard for randomized reductions 28 and is expected to be not efficiently solvable by quantum computers 29(p. 151) . The hardness of the problem and the runtime to solve SVP grows exponentially with the dimension of the lattice. The growth of the complexity when increasing the dimension is mainly estimated theoretically for the different SVP solving algorithms. 30 The input parameter to generate the runtime models is the dimension of the lattice n.
Practically determining a runtime model is complicated by the exponential runtime behavior. Training data points can only systematically be generated for low dimensions and verifying measurements for high dimensions are infeasible. Additionally, the RAM requirements for the sieving algorithms are very high (up to 1 TB). This excludes our ARM system for the tests.
A second challenge is that all SVP solvers are heuristic algorithms. They contain probabilistic elements, for example, sampling random vectors or applying small transformations to lattice bases with random unimodular matrices. Hence, the runtime of each algorithm to solve a problem instance varies for each attempt. This variation can span more than one order of magnitude. 5,31 Thus, at least 20 runs are performed for the same configuration determined by the tuple (algorithm, n, random seed of lattice) and averaged.
The third challenge is that the practical hardness of the problems varies also with the different instantiations possible for each dimension. This includes the type of the random lattice but also different random seeds for the same type. 31,32 The input bases we employ come from the Darmstadt SVP challenge. † We employ the lattice generator provided at the challenge homepage. ‡

Results for SVP solvers
In the following, we compare the exp-pol models created by SimAnMo with the lin-log model approach on the SVP solver codes. Additionally, we again perform robustness tests.

SubSieve
SubSieve 4 considerably reduces its runtime and memory-consumption compared to previous sieving variants. This is because it only requires to solve n − -dimensional ( ∈ N) sub-SVPs to solve the overall n-dimensional SVP.
To generate the training data, we performed 100 runs for each n ∈ {70, 72, … , 80}. The 100 runs were evenly distributed to random lattices with five different random seeds. For each dimension, the average of the runtimes is the training point. We passed the length of the shortest vector to SubSieve to indicate termination. SubSieve is a serial code. Our resulting model is compared to the lin-log model in Figure 7. For the evaluation, we continued the scheme of 100 runs per dimension with five different seeds for each dimension up to n = 90. Additionally, for n = 94, 96, 98 ten runs with seed 0 were performed. For higher dimensions, the runtime is expected to exceed 24 hours in most cases which hinders a systematic evaluation.
The resulting prediction is shown in Figure 7. In general, the exp-pol model has a higher accuracy when predicting the measurements than the lin-log approach. However, there is an exception at n = 96. In that case, the problem seems to be rather easy since the time to solution is not much higher than for n = 94. Furthermore, the data point lies outside all possible curves of exponential functions that try to fit all data points. The robustness test shows stable coefficients for the model.

G6K
The General Sieve Kernel (G6K) 5 is the fastest known SVP solver. It holds the first places in the Darmstadt SVP Challenge.

F I G U R E 7 Prediction of both SubSieve models
To create a model for G6K, we first generated the training points on SystemHaswell with 24 threads for n ∈ {90,100, 110,120, 130}. The G6K code is configured to solve the SVP challenge, that is, to find a vector shorter than the given bound. We choose five different random seeds for each dimension and executed three runs per seed.
In Figure 9, For further measurement points, we rely on the SVP challenge entries related to Albrecht et al. 5 and the number of threads given in the hall of fame. We performed a scaling analysis for the code on our system and extrapolated the scaling behavior to the number of cores and threads given in the SVP challenge entries. Implicitly, we use the fact that our system runs with about the same clock speed and that the processors employed have no big differences in their architecture. So the only variable considered is the number of threads employed for the respective run. In that way, we get measurement points for up to dimension 155. There, the time to solution was more than 14 days and a total CPU time of 1056 days was spent.
This procedure may introduce some errors and the entries in the hall of fame result from only a single run.
Nevertheless, our comparison of models and their predictions in Figures 8 and 9 show that the training data and the extrapolated measurement points follow a sensible pattern. We see that both models underestimate the last two points n = 153 and n = 155. The prediction is too low by a factor of 1.33 for n = 153 and 1.14 for n = 155, respectively, for the exp-pol model. The lin-log model is much further away from the real values and underestimates the runtime by a factor of 11.63 for n = 153 and 11.90 for n = 155, respectively. In that case, the difference in the prediction for the last measurement point n = 155 for both models is already about one order of magnitude. It quickly grows to several orders of magnitude when further increasing n. Again, the robustness test shows stable coefficients.
F I G U R E 8 Fitting of the G6K models for the training points for n ∈ {90,100, 110,120, 130} F I G U R E 9 Prediction of both G6K models

Enumeration with extreme pruning
Our last test case is an SVP solver based on enumeration with extreme pruning. Those algorithms build a search tree for all possible coefficients for the shortest vector. They heuristically cut off large parts of that tree where the chance is low to find the shortest vector. 9,31 If the shortest vector was removed, the procedure has to be repeated with a slightly shuffled lattice basis. 6 Hence, several heuristics and random elements are included in the algorithm. They result in highly varying runtimes for the same problem when solving it multiple times. 31 The implementation employed for our experiments is the progressive BKZ (pBKZ) library. It presented by Aono et al.. 6 We employ the first part of the runtime data presented by Aono et al. 6 to predict the remaining points. The runtimes for the training data n ∈ {102,104, 106,108, 111} and the models created are summarized in Figure 10. The data has an outlier for n = 106. From the diagram it is not obvious to decide which graph fits the training data better. It is even ambiguous when considering the quality metrics. The RSS for the exp-pol model  Figure 11 shows our evaluation. We also include the three points for n ∈ {138,142, 146} from Reference 6. They are taken from the Darmstadt SVP challenge hall of fame. We employ them to validate our model against higher dimensions. The lin-log model lies further away from the measurements than the exp-pol model in all cases but for n = 112 and n = 123. For n = 123, the exp-pol model predicts a considerably higher runtime compared to the measurement and the lin-log model. The data suggest that n = 123 was one of the runs which only require few randomization of the input base. It is much faster than the run for n = 119 and about equally as fast as the run for five dimensions less at n = 118. The fact, that the lin-log model underestimates all measurements after n = 123 also is an indicator for this assumption.
F I G U R E 10 Exp-pol model and the lin-log model for training points n ∈ {102,104, 106,108, 111} In green, we see the model from Aono et al. which was created by the data of their cost-simulator for the runtime of progressive BKZ. Simulation points were created for up to n = 150 in steps of five. Their evaluation shows that the simulator gives very good estimates for the runtime of their pBKZ library for SVP challenge problems. The line is just a linear regression to the logarithmized measurement points.
We see that the model which SimAnMo creates from n ∈ {102,104, 106,108, 111} gives a good prediction for the real time to solution up to n = 146.

5.3.7
Factorizing product of prime numbers Finally, we analyze a different problem from the cryptanalysis context which is the factorization of big integer values. To that end, we generate two prime numbers p and q which fulfill several properties that are recommended for the Rivest-Shamir-Adleman (RSA) cryptosystem. 33 For example, We calculate the product n = p ⋅ q and then perform a OpenMP-parallelized brute force attack which tries to guess the right p. The data type employed is cpp_int from the boost library. The input parameter p is the bit-length that is employed for the independent_bits_engine random number generator within the prime number generator that is based on the Miller-Rabin primality test. The tests are performed on SystemCascade and the runtimes for a fixed p have a very high variance. For example, for p = 42, the fastest run is 30 times faster than the slowest. Figure 12 shows that the fit to the training data of our model and the lin-log model is quite good. The RSS of the lin-log model is even somewhat better (1378 vs. 1628) but its MAPE is worse (0.100 vs. 0.083).
The predictive quality is analyzed in Figure 13. Both types of models give good estimates for the real runtime. The error at the last measurement point p = 49 is 16% for our model and 21% for the lin-log model. In this case, both models work well. We summarize that the exp-pol models deliver the better approach to describe the training data in all of our test cases. For SubSieve, the differences between the exp-pol model and the lin-log model are less significant than for G6K-Sieve and the pBKZ library. The experiments show that the exp-pol approach is more flexible than the lin-log models when trying to accurately model the behavior of the training data. Due to the nature of exponential functions, the number of measurement points for verifying the predictions is limited. However, the experiments performed strongly indicate that the predictions of the exp-pol models are more realistic than previous approaches.

Factorial polynomial
We start the evaluation of the new model for factorial runtime behavior with a synthetic test-case, which is f synth (p) = 20 + 0.13 ⋅ p! As training points we employ p ∈ {1, 2, … , 6} and let the model predict up to p = 20. The results are shown in Figure 14 assuming a theoretical runtime in nanoseconds. A stair like pattern on a logarithmic y-axis of the model reflects the factorial behavior. The coefficients i and j quickly converge to 0 and the resulting model predicts p = 20 with an error of 0.46%. For p = 21, the value of p! exceeds the range of 64 bit integers and gives an impression how fast the factorial runtimes grow even in comparison to (super-)exponential functions discussed above. In this theoretical case we included a lin-log model in orange. As derived from the theory, the lin-log approach is unable to fit such a rapid growth and fails in giving meaningful predictions or even fitting the training data. Hence, we omit it in the following evaluation to keep the diagrams clearer.
Our second test case evaluated in this section is the permutation of an std::string. We internally employ the std::rotate function to modify the order of the strings and perform all permutations recursively. Our test string is ABC … XYZ truncated to the respective target length  Figure 15 summarizes the results of our experiments for p ∈ {4, 5, … , 10} as training data. Here, the RSS of 46,822 seems to be high but we have to consider that the runtime is given in microseconds. For p = 10, about 18,500 microseconds are required, which relativizes the RSS-value and the small MAPE shows that the fitting of the model is really good.
In that case, a small additional polynomial term occurs in the model and the model can predict four additional measurement points correctly, for example, p = 14 with an error of 7.9% although the last training point only requires 0.02 s while the last measurement point requires about 1.5 h.
The string permutation test was repeated on the Raspberry Pi 3B and the results are shown in Figure 16. The prediction for the last measurement point is too low by 33% which is very accurate when taking into account that the difference in the runtime between the last training point p = 10 and p = 13 is more than three orders of magnitude. In this test, the quality of the model differs most when changing the procedure to determine c 0 and c 1 from the RSS-based to the MAPE-based mode (cf. Section 4.2). Hence, Figure 16 also visualizes the MAPE-based graph in orange. In that case, the error at p = 13 is reduced to 9%. Furthermore, the fit to the first training points is better for the MAPE-based graph. This is mainly due to the high c 0 = 12.4 resulting from the RSS-based approach. Since the absolute values in the region for p ∈ {5, 6, 7, 8} are considerably smaller than 10 they do not have a large influence on the overall RSS-cost metric and the algorithm focuses on fitting the higher training data points. This is a vivid example of how SimAnMo can be configured for the desired tradeoff between runtime and model quality. The MAPE of the fit to the training data is reduced from 22.9 to 0.18 while the runtime increases from 5.1 to 61 s.
Our last experiment was performed with a C++-code to solve the TSP. § It was executed on SystemCascade. We focused on the brute force solver which enumerates all tours in a systematic fashion. For the first benchmark, we deactivated the integrated optimization that following a path is aborted if the actual length is longer than an already registered path which visits all nodes once. Figure 17  SimAnMo finds a well fitting factorial model for p ∈ {8, 9, … , 13} which is able to predict the runtime for p = 15 with an error of 7.6%. Higher values for p would last more than one day.
For a second benchmark, we re-activated the early exit and parallelized the brute force solver's execution. The parallelization is realized with OpenMP tasks on the decision for the second node in the path, that is, after we set the first node (the one following the starting node). All possibilities for the next node are executed in parallel. The length of the shortest tour found so far is shared between all threads and write-protected by a critical region. The parallel execution with 24 threads leads to varying runtimes between different executions even if the distance map between the nodes is equal. This lies in the order in which tours are processed and, hence, the order in which the upper bounds for the overall length are updated. Our approach is to re-create a random distance map for each execution and to perform twelve runs for each choice of p. For the training data points and the additional measurements, we employ the average values.
The code for the TSP solver can be found in SimAnMo's repository. Figure 18 summarizes the results. Interestingly, the best model that can be found is an exp-pol model which predicts p = 21 with an error of only 3.5%. For p = 22, we also started 12 runs. Nine finished within the hard limit of 24 h while three exceeded this boundary and were aborted by the scheduler. Concerning the ranges from the fastest to the slowest run per p value, this distribution for p = 22 is expected since the maximum runtime for p = 22 can be estimated to 31-33 h. This additional maximum point is drawn in Figure 18 which also shows that the exp-pol model well predicts the average point at p = 22.
The best fitting factorial model is shown in dashed. It only approximately hits p ∈ {13, 20, 21} while already overestimating p = 21 by 95% (17,808 s vs. 9090 s). The model also hits several boundaries for the ranges of coefficients like −2 for i and jas well as 10 −11 for c 1 which renders the existence of a useful factorial model very unlikely. Even relaxing these boundaries further does not improve the quality of the model.
Hence, the practical behavior of this TSP-variant seems to be described the best by an exponential function and not by an factorial as it was expected. This section demonstrates that the novel factorial type works well for modeling algorithms with factorial runtime behavior. The easy integration of this new models also underpins the flexibility of SimAnMo's software design.

COMPARISON OF QUALITY METRICS
In this section, we analyze the effect of using MAPE as metric to minimize by the simulated annealing process. To that end, we additionally generated models for the practical examples from Section 5 which minimize the RSS of the model and the training points. We also include another sieving algorithm, called HashSieve, in the considerations since required training points and measurements are already available. 24 The progressive BKZ example is not considered here because of its drawbacks already discussed.
Then, we took the graphs of the resulting prediction and calculated several quality metrics with regard to the measurement points. The results for the exponential runtime examples from Section 5.3 are summarized in Table 1 For the string permutation, the minimization of the MAPE considerably improves the predictive quality of the models. For example, the resulting RSS is reduced by two orders of magnitude and the RMSE by a factor of 10. In contrast, for the case of the TSP, the RSS-approach results in a slightly better model. But we want to highlight that in all cases our exponential-polynomial and factorial models have a much better predictive quality than the traditional linear-logarithmic approach.

CONCLUSION
We presented the new features of SimAnMo. It is a highly configurable and flexible generator for models of a program's runtime depending on the size of an input parameter p. The SimAnMo-framework is based on parallelized simulated annealing. Its software design makes it easily extendable with new types of models as we highlighted at hand of the factorial runtime models or with new types of cost metrics as we showed for the MAPE metric.
The accuracy of the models generated was demonstrated on many theoretic and practical examples on different hardware architectures. The evaluation of the exponential runtime models shows their superiority over the linear-logarithmic approach and over exponential models as employed in an extended version of Extra-P. Our exp-pol models are based on the modeling function f(p) = c 0 + c 1 ⋅ 2 k⋅p i relating the size of p to the actual runtime. Our novel models meet the runtime behavior of the codes in practice. For the polynomial runtime behavior, we extended previous results from cryptanalysis 19 to the application domains of dense linear algebra where SimAnMo also generates well fitting and reliable models. The novel factorial models f(p) = c 0 + c 1 ⋅ p! ⋅ p i ⋅ log j (p) are a good example for the easy integrability of new models. They also highlight that reliable models can be generated for this type. With now combining polynomial, exponential, and factorial runtime behavior, SimAnMo covers all important types of runtime complexity.
In the future, SimAnMo will be integrated into the tool PIRA. 34 PIRA automatically refines the instrumentation for profiling in a given code. This refinement is based on models for the runtime of different code regions.
Future extensions to SimAnMo include the automation of the whole model creation process. In particular, we stress on the selection of the training data points. Currently, manual inspection is required to remove heavy outliers or to detect ranges of p where the program behaves differently from high p-values. Furthermore, the numerical stability and the performance of the MAPE-approach to determine c 0 and c 1 will be increased. Either by tuning the parameters to configure the conjugate gradient process or by switching to another minimization procedure.

ACKNOWLEDGMENTS
We thank the participants and organizers of the PMBS2020 workshop for their interesting questions concerning SimAnMo which motivated much of the research and the results presented here. We thank the anonymous reviewers of this article for their valuable feedback. We also thank Y.
Wang for the fruitful discussions about the pBKZ library and that he granted us access to the measurement data. Our work is partially funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation)-SFB 1119 236615297. Calculations for this research were conducted on the Lichtenberg high performance computer of the TU Darmstadt.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are openly available in