Well‐scaled, a‐posteriori error estimation for model order reduction of large second‐order mechanical systems

Model Order Reduction is used to vastly speed up simulations but it also introduces an error to the simulation results, which needs to be controlled. The a‐posteriori error estimator of Ruiner et al. for second‐order systems, which is based on the residual, has the advantage of having provable upper bounds and being usable independently of the reduction method. Nevertheless a bottleneck is found in the offline phase, making it unusable for larger models. We use the spectral theorem, power series expansions, monotonicity properties, and self‐tailored algorithms to largely speed up the offline phase by one polynomial order both in terms of computation time as well as storage complexity. All properties are proven rigorously. This eliminates the aforementioned bottleneck. Hence, the error estimator of Ruiner et al. can finally be used for large, linear, second‐order mechanical systems reduced by any model reduction method based on Petrov–Galerkin reduction. The examples show speedups of up to 28.000 and the ability to compute much larger systems with a fixed amount of memory.


INTRODUCTION
Mechanical systems modeled with the Finite Element Method (FEM) become more and more detailed with increasing demands for precision. The finite element model of a car nowadays has usually over 10 million degrees of freedom. Simulating such large systems in time, e.g., for car crashes becomes a challenge even for large automobile manufacturers on high-performance computers when small modifications need to be checked quickly regarding crash safety or when optimizations need to be performed. These simulations are still run over night, not satisfying the need of an engineer for quick feedback to his changes.
Model Order Reduction (MOR) is a tool that is commonly used to significantly reduce the computation time and storage complexity of the simulations. Certain computer-aided engineering (CAE) tasks would not be possible without it. In model order reduction, the time-discretized ordinary differential equation is usually projected onto a space of much smaller dimension through Petrov-Galerkin reduction while retaining as much system properties as possible. We refer to, e.g., Schilders et al. [1] and Benner et al. [2], who give overviews of model order reduction. We will focus on linear mechanical systems given in second-order form. Model reduction techniques for this type of system can be found, e.g., in Fehr [3] and Nowakowski et al. [4].
Since information is typically lost when projecting to a lower-dimensional space, an error is introduced to the simulation of the reduced model compared to the full-order model. Error estimators are developed to quantify the magnitude of this error, usually as an upper bound which we will call error bound. Without any estimation of the error, model reduction would be of no use since bad approximations may not be noticed. This is especially important for safety critical systems like car crash simulations. Error estimators can also be used in an adaptive way to tune parameters of the used model reduction method [5]. A-priori error bounds are able to estimate the error before the solution of the reduced system is available. The estimation cannot be better than the one of a-posteriori error estimators, which take also the individual result of the reduced simulation into account.
Error Estimation as part of certified model reduction is an important topic to ensure the reliability and efficiency of the reduced order model. Numerous works deal with residual-based error bounds / estimates in the context of PDEs, e.g., in the reduced basis (RB) framework most focus on parabolic PDEs [6][7][8] and fewer focus on hyperbolic PDEs [9][10][11]. The publications of Billaud and FriessNouy [12], Yano [13], or Urban and Patera [14] use the space-time RB framework to consider linear or quadratically nonlinear parabolic problems. Also for the PGD-based reduction framework [15,16] or the MOR framework in FE 2 / homogenization multiscale simulations [17], error bounds are essential issues. Most of the error estimators [6- 8,18] are derived in the functional space in the framework of the finite element (FE) or finite volume discretization. Whereas the error estimator by Zhang et al. [11] as well as our work aims at error estimation for projectionbased MOR. For example, the error estimation in Zhang et al. [11] builds a connection between the norm of the residuals and the error in approximating the state by defining a dual system and aims at quantifying the error in an output quantity of interest.
There are not many error estimators dedicated for second-order systems. Some of them are only specialized estimators for, e.g., Krylov reduction [19], Gramian-based reduction [20], Reduced Basis method [21], or for the relative error in the frequency domain [5,22]. When focusing only on a quantity of interest / output, a machine learning-based estimator exists [23]. Additionally, an a-priori error estimator for every Galerkin reduction has been published [24]. The here improved estimator [25] delivers instead a-posteriori bounds in the time-domain and can be used with all Galerkin reduction methods. A similar error estimator is not known to the authors.
We focus on the a-posteriori error estimator in the time-domain described in Ruiner et al. [25], which is based on the work of Haasdonk and Ohlberger [26]. Our work will provide a vast speedup of this error estimator. The error is defined as a weighted norm of the difference between reconstructed and full state. The error estimator, on the other hand, is an upper bound on this error for each time instance after applying an arbitrary time integration scheme. Its main advantage is the general usability. While most error estimators are tailored for a specific model reduction method, the estimator by Ruiner et al. [25] can be used for any Petrov-Galerkin projection independently of how the projection matrices were tailored. The quality of the error estimation was already discussed in Ruiner et al. [25] and Fehr et al. [27]. The value of the bounds lies in the fact that they are rigorous, i.e., if the bound is small, then the error is also small. Therefore, we solely focus on the computational performance and scalability for large systems. For completeness, one example comparing the true error and error bound will be given in Section 6.
The error estimator of Ruiner et al. [25] involves taking the inverse of the mass matrix of the full-order system and computing the matrix exponential of a large matrix in each time step in order to compute specific coefficients in the offline phase. These computations are the bottleneck with respect to storage and computation time, respectively. We show that a straightforward computation will scale like ( 3 ) in time and ( 2 ) in storage with being the dimension of the full-order system and the number of time steps in the discretized setting. This is infeasible for large-scale systems. The limits of a standard desktop computer are already reached for around 4 000. It may be argued that arbitrary computation time is allowed in the offline phase but this is infeasible for real-world applications. Additionally, the storage bottleneck cannot be overcome with infinite time resources. Therefore, the bad scalability in the offline phase is a major problem once the error estimator is applied for real-world, large-scale applications.
The novelty of this work is to speed up the offline phase of the error estimator of Ruiner et al. [25] We improve the computational time to scale like ( 1.8 ) + ( ) with a storage complexity of ( ) in the case of a proportionally damped, second-order, linear mechanical system and a common choice for the error norm. The results of the undamped, less interesting case are similar. This way, the error estimator can be used on much larger systems. The examples show a speedup of up to 28, 000 with the ability to solve 35 times larger systems before running out of memory on a standard desktop computer. For other examples and more powerful hardware, these numbers increase due to the better scalability of one order in time and storage complexity. This vast performance improvement is not only of practical nature but the theory developed may also serve as a template for performance improvements of other problems involving the calculation of a weighted spectral norm and the matrix exponential. This article is structured as follows: In Section 2, the system of interest as well as the error estimator of Ruiner et al. [25] are described. The section ends with a short analysis of the performance to identify the bottlenecks, which are then tackled in Section 3: We prove a special case of the spectral theorem and apply it to the power series of the terms used to compute two coefficients needed in the online phase. These coefficients can now be described as the maximum of one-dimensional functions over the spectrum of a suitable system matrix. While this is still a considerable improvement since only onedimensional operations are involved, the spectrum still needs to be computed fully. Hence, Section 4 focuses on analyzing these one-dimensional functions for monotonicity properties and finding bounds in certain areas. With these properties, algorithms are tailored which only need a handful of eigenvalues of the aforementioned system matrix. The examples presented in Section 6 show that the theoretical analysis of the performance of these algorithms in Section 5 is indeed in correspondence with numerical experiments showing a huge performance improvement. The paper ends with the conclusion and an outlook in Section 7. Most of the technical proofs are located in the appendix to not loose track of the overall idea.

A-POSTERIORI ERROR ESTIMATION
We first introduce the a-posteriori error estimator by Ruiner et al. [25] used as basis for our improvements. After an analysis of the scalability, we argue that the error estimator in its original form is not usable for large systems and identify the cause.

System of interest
The system of interest is any linear, second-order system which can be written as a multiple-input and multiple-output system (MIMO) of the form̈( with system dimension , where ∈ ℝ × is a symmetric, positive definite (SPD) mass matrix, ∈ ℝ × a symmetric, positive semi-definite (SPSD) damping matrix, ∈ ℝ × an SPSD stiffness matrix, ( ) ∈ ℝ the flexible coordinates describing the elastic deformation of the mechanical system in time ∈ cont ∶= [0, end ] with initial values 0 ,̇0 ∈ ℝ , and ( ) ∈ ℝ i a time-dependent input with input matrix ∈ ℝ × i . The output ( ) ∈ ℝ o is a linear combination of the state ( ) described by the output matrix ∈ ℝ o × . We will omit the time dependence in the following if it is clear from the context, i.e., we may write instead of ( ). The formulation (1) usually arises from discretizing a continuum mechanical system in space with the finite element method (FEM). Many finite element programs in engineering like ANSYS Mechanical [28], Abaqus FEA [29], LS-DYNA [30], etc. allow to extract the system matrices , , and .
The second-order, ordinary differential equation (ODE) given in (1) can be transformed to the state-space representation given by the first-order ODĖ(

Linear model order reduction
Model Order Reduction (MOR) transforms the systems (1) and (2) of dimensions 2 and to systems of the same form with much smaller dimensions 2 ≪ 2 and ≪ , respectively. The goal consists in having the same system behavior for the reduced-order system as for the full-order system. Despite some special cases, the reduced-order system always constitutes an approximation rather than an exact replacement of the full-order system. Most model order reduction methods use the Petrov-Galerkin ansatz described in the following for the second-order system (1). Assuming a so-called reduction matrix ∈ ℝ × , the state is approximated by˜∶= , where ∈ ℝ with ≪ is the so-called reduced state and˜∈ ℝ the reconstructed state. It is clear that˜= is only possible for in the column span of . Simply substituting with˜= in (1), therefore, introduces a residual between the left and right hand side. Hence, a second reduction matrix ∈ ℝ × is multiplied to the left of the equation ( ) = yielding the reduced system̈( The reduced matrices ∶= , ∶= , ∶= , ∶= , and ∶= can be precomputed. The system (4) is now solved for , which yields the approximated state˜= ≈ and approximated output˜( ) ≈ ( ). The reduced system only depends on the system dimension instead of , which is expected to yield a faster integration in time.
An equivalent reduction for the first-order system (2) can be defined [25] with the biorthogonal ( s s = 2 ×2 ) reduction matrices yielding the reduced, first-order systeṁ( The approximated outputs˜( ) of (4) and (5) are the same since ( ) = holds for the states. The choice of the reduction matrices and are an important characteristic of each reduction method. Since the error estimator discussed later can be used with any reduction matrix, we only mention a few common choices for linear, second-order mechanical systems: • Modal reduction, which truncates eigenmodes above a certain frequency [31]; • Model reduction based on Krylov subspaces, which are used to match certain moments in the frequency response of the model [32]; • Balanced truncation, which removes states that are badly observable and controllable [33]; • The Craig-Bampton method [34] and its extension CMS-Gram [35], which both combine different types of component modes; • Proper Orthogonal Decomposition (POD), which takes the most important principal components of simulation snapshots as basis [36].
The error of the state and output is now defined as the difference between original and approximated state, i.e., It is more feasible to analyze the error as a scalar. Therefore, a suitable norm has to be chosen. Proof. Since is symmetric and positive definite, there exists exactly one symmetric and positive definite square root, which we will call 1 2 [37]. The inverse ( 1 2 ) −1 exists and is symmetric since 1 2 is positive definite and symmetric.
With similar arguments, the matrix ( −1 ) 1 2 exists uniquely. We will now show that ( Definition 1 (Weighted norm). For a symmetric, positive definite weighting matrix ∈ ℝ × , the -weighted norm of a vector ∈ ℂ and the induced, -weighted norm of a matrix ∈ ℂ × are defined as respectively. The norm ‖ ⋅ ‖ 2 is the Euclidean norm for vectors and the induced spectral norm for matrices.
Proof. The last equality in (7) is worth a proof: The substitution = 1 2 is bijective since -and therefore 1 2 -is regular, see Lemma 1. □ A common choice for the weighting matrix in Definition 1 is the mass matrix for finite element models to compensate for different units of the displacement and rotation in the state [25]. We will make use of this choice most of the time but keep the results as general as possible. It is interesting to note that some results only hold for = , the most common choice. It will be clear from the context whether we mean the scalar ‖ ( )‖ or the vector ( ) when we are talking about the error.
In order to prevent the inversion of 1 2 during the computation of the weighted matrix norm ‖ ‖ from Definition 1, we instead solve a generalized eigenvalue problem [27]. Proposition 1. With the assumptions from Definition 1, the weighted norm ‖ ‖ is the square root of the largest eigenvalue of the generalized eigenvalue problem = .
Proof. As already proven for Definition 1, ‖ ‖ = ‖ 1 2 − 1 2 ‖ 2 . The spectral norm of a matrix is its largest singular value, which is the square root of the largest eigenvalue of the matrix transposed and multiplied by itself: In the last step, we used that the eigenvalue of similar matrices -in this case transformed with which proves the proposition. □ Even though we have now found an easy way to compute the weighted norm ‖ ‖ with Proposition 1, the error defined in (6) cannot be computed directly since the computation of ( ) for every is equivalent to solving the full-order system; but obviously, there is no need for a reduced-order system once the full-order solution is known. Hence, we will learn how to estimate the error without computing the full-order system in the next section.

A-posteriori error estimator
We now present the error estimator of Ruiner et al. [25], which gives upper bounds to the errors ‖ ( )‖ and ‖ out ( )‖ 2 of (6). It is originally based on an error estimator for first-order systems by Haasdonk and Ohlberger [26]. Simply transforming the second-order system (1) to the equivalent first-order system (2) and applying this error estimator may lead to a high overestimation of the original error by the so-called hump phenomenon [25]. This is circumvented by writing the error estimator in a second-order form which does not contain the term that is responsible for the high overestimation. We will only give a short summary of how the error estimator is derived. First, a differential equation for the error ( of the first-order system is found by differentiating (6) with respect to and substituting (2) and the residual of (5).
Given initial conditions 0 = ( − ) 0 for the error in the state anḋ0 = ( − )̇0 for the error in the velocity, the differential equation can be solved analytically. Taking only the upper part representing ( ) yields the explicit solution for the error with the mass-inverted residual and the fundamental matrix of the systeṁ( ) = ( ). The block matrices ( ) of ( ) are of size × . Taking the norm ‖ ⋅ ‖ on both sides of (8), using the triangle inequality of norms as well as the definition of the matrix norm, and taking the norm inside the integral yields The coefficients 11 ( ) and 12 ( ) will play an important role in the following. After these coefficients and the initial errors ‖ 0 ‖ , ‖̇0‖ are precomputed in an offline phase, the error bound Δ ( ) for the state can be computed by inserting the reduced simulation result into the residual˜and integrating over time according to (11). An error estimator Δ ( ) ∶= ‖ ‖ ,2 Δ ( ) for the output follows directly from It depends linearly on the error estimator of the state. Therefore, we will focus only on the error estimation of the state in the following. Note that improved output error bounds are possible by a suitable adjoint problem and output correction term [26]. The described error estimator has several advantages. Obviously, it is a-posteriori since it only depends on the reduced state and its derivative. Most noticeably the error estimator may be applied for any Petrov-Galerkin reduction independently of the chosen reduction matrices. This allows for an approach to find the optimal reduction where even the reduction method can be varied without changing the error estimator. It can be implemented in an offline / online fashion. The coefficients 11 ( ) and 12 ( ) need to be only precomputed once for a specific full-order system. Then the error can be estimated by (11) for an arbitrary reduction method during the simulation since the error bound at time does not depend on quantities of future time instances.

Analysis of scalability
We will now roughly analyze the computational complexity and scalability for → ∞ for the error bound (11). Section 5 will cover this topic in greater detail. Assuming that the system matrices and initial values of (1) as well as the projection matrices , are known, the initial errors 0 = ( − ) 0 anḋ0 = ( − )̇0 are easy to compute. Even though these computations scale with the full-order dimension , they only involve matrix multiplications and are performed only once for each reduced simulation run with changed initial values.
During the online phase, the residual˜needs to be computed for each discretized time instance of the simulation run. The residual itself is a quantity in ℝ but only its norm is saved. The vector ( ) is usually also sparse. The leftmultiplication by −1 is performed by solving the equivalent linear system with the sparse matrix and does not need the explicit inversion of . Instead, a suitable decomposition of like the LU decomposition is calculated once in the offline phase and can then be reused in the online phase to solve the linear system. The integral in (11) can be approximated by any numerical integration method that only needs evaluations of the integrand at timesteps available during simulation.
As proven in Proposition 1, the computation of the matrix norm ‖ ⋅ ‖ is equivalent to solving for the largest eigenvalue of a generalized eigenvalue problem. This is usually accomplished with the Lanczos method for Hermitian matrices as will be discussed later in Section 5. The computation of the weighted norm of a vector involves by definition only matrix multiplications and the usually sparse matrix .
Only the coefficients 11 ( ) and 12 ( ) as part of the offline phase remain to be analyzed. One problem lies in the matrix , of which we need to take the matrix exponential according to (10). The matrix needs to be computed directly. Its definition in (3) involves the inverse of . While itself is a sparse matrix coming from an FE discretization, its inverse is usually a dense matrix scaling like 2 in terms of memory consumption. Even for sparse, e.g., diagonal , the matrix exponential is usually a dense matrix. Additionally, the matrix exponential (10) needs to be computed for each discretized time instance of leading to a high computational complexity. All together, the computation of these coefficients in the offline phase is infeasible for large systems, which will be discussed in greater detail in Section 5 and demonstrated in the numerical examples of Section 6.
The remainder of this work will largely improve the computational complexity of the coefficients 11 ( ) and 12 ( ), which constitutes the main novelty of this work. With little to no additional inaccuracy, they will be computable even for large systems within decent time and memory limitations. First, we will rewrite the coefficients as the solution of a one-dimensional problem involving generalized eigenvalues.

ALTERNATIVE FORMS OF THE COEFFICIENTS
As has been argued in the previous section, the computation of the coefficients 11 ( ) and 12 ( ) is a bottleneck for large systems. Instead of computing exp( ) directly and taking the norm ‖ exp( )‖ , we will derive an alternative form of this norm as supremum over eigenvalues of a specific eigenvalue problem. The computation of the norm is then reduced to a one-dimensional problem once these eigenvalues are known. It will be discussed in Section 4 how we are able to calculate the supremum efficiently without knowing all eigenvalues.
The damped case turns out to be more challenging than the undamped case. Therefore, we will sometimes look at both systems separately. Additionally, we will restrict ourselves to proportional or Rayleigh damping, i.e., = + with the damping parameters , ≥ 0 [38]. The choice = = 0 constitutes an undamped system. For damped systems, we assume > 0 and ≤ 1 in order to circumvent some special cases in the proofs and algorithms, which will be summarized in Section 7.
Until now, we only worked with symmetric matrices. This will change and it is needed to extend the common definition of positive definiteness to non-Hermitian matrices.
We define the matrices as mass-inverted forms of and ∕2 for easier notation. This yields the new notation of the state matrix (3) Remark 1. We use the notation for the image of the subset˜of the domain of any arbitrary mapping with codomain and −1 ( ) ∶= { ∈ ∶ ( ) = } for the fiber of ∈ .
In the remainder of this work, we will write ( ) ⊆ ℂ for the spectrum of a matrix , i.e., the set of all its complex eigenvalues. The spectral theorem, see e.g. Reed and Simon [39], will play a crucial part in the following. We will only need a variant dealing with power series.

Theorem 1 (Spectral Theorem for Power Series). For any Hermitian matrix and power series
with real coefficients ∈ ℝ for all ∈ ℕ 0 and infinite radius of convergence, it holds: i.e., the spectral norm of a power series of a Hermitian matrix can be easily computed from its eigenvalues. Proof.
(a) The proof can be found in standard text books [40].
be the partial sum of the power series to the -th order. The polynomial ( ) in is then defined as with 0 ∶= . It is Hermitian due to the properties of the conjugate transpose and the coefficients being real. First we note that the Cauchy criterion holds for all Banach spaces. For the convergence of ( ), we, therefore, only have to check the Cauchy criterion for the sequence ( ) in the Banach algebra (ℂ × , ‖ ⋅ ‖ 2 ). This immediately follows from the inequality and that the power series ( ) converges in ℝ, i.e., it fulfills the Cauchy criterion for each since we assumed an infinite convergence radius. Therefore, ( ) is well-defined. It still needs to be shown that ( ) is Hermitian, i.e., a self-adjoint operator. The scalar product in ℂ induces the matrix norm ‖ ⋅ ‖ 2 , which was used to define ( ). Since each partial sum ( ) is Hermitian and the scalar product is continuous, it follows immediately that ( ) is also Hermitian. (c) Let ∈ ( ). Then for every ∈  ( ), we have = . It follows immediately ( ) = ( ) and by taking the limit on both sides also ( ) = ( ) . Therefore, not only ( ) ∈ ( ( )) but also  ( ) ⊆  ( ) ( ( )). Since = ∑ ∈ ( ) dim  ( ) = ∑ ∈ ( ( )) dim  ( ( )) for the Hermitian matrices and ( ) and since eigenvectors to distinct eigenvalues are linearly independent, it follows ⨁ ∈ −1 ( )  ( ) =  ( ( )) for every ∈ ( ( )) as well as ( ( )) = ( ( )) by the pigeonhole principle. The sum of the eigenspaces is necessary since may not be injective. (d) Follows immediately from (a) and (c). □ The goal is now to extract 11 and 12 from the power series of exp( ), see Equations (10) and (15). We define the matrix for easier notation. It is easy to see that , , and commute pairwisely.  (2 + 1)! = sinh The power series in Equations (17) and (18) have an infinite radius of convergence. We define 0 0 ∶= 1 in the sum notation of (19) and (20). The terms sinh ( Proof. Since , , and commute pairwisely, one can set = 1 w.l.o.g. to obtain the series expansions (17) and (18) symbolically from The technical calculations are left to the reader. Since the hyperbolic sine and cosine have a convergence radius of infinity, this is also true for s ( ) and c ( ). The matrices and can be written as polynomials in , see (14) and (16). Inserting a polynomial into a power series with radius of convergence of infinity makes it again a power series with the same radius of convergence. The same is true for the multiplication of two power series, e.g., exp(− ) and s ( ; ). This shows that each of the Equations (17) and (18) can be indeed written as a power series of with radius of convergence of infinity. □ These power series will help us in Section 4 to compute the coefficients 11 ( ) and 12 ( ). For this, we need to take the weighted norm of 11 and 12 , which will eventually transform the power series of Equations (17) and (18)  . In order to apply the Spectral Theorem 1, we need to show that this matrix is Hermitian. This is in general only true for = since is usually not symmetric, see Remark 5. − 1 2 is symmetric since it reads the same from right to left as from left to right and each factor is symmetric, see Lemma 1. Therefore, is also symmetric. Similar arguments hold for We already know that − 1 2 − 1 2 is symmetric. Additionally, the square root of this symmetric matrix is again symmetric, which makes symmetric as the sum of symmetric matrices. □ All these preparations allow us to write the coefficients 11 ( ) and 12 ( ) in an alternative form that only depends on the solution of the generalized eigenvalue problem = .
Theorem 3 (Alternative Representation of 11 ( ) and 12 ( )). For = , the coefficients 11 ( ) and 12 ( ) can be computed as Proof. The time is assumed to be fixed. First, we will incorporate the weighting matrix in the power series of Theorem 2 with the help of Definition 1. This needs We already know from Theorem 2 that exp(− )( c ( ; ) + s ( ; )) and exp(− ) s ( ; ) are indeed power series w.r.t. . Substituting by and by and taking the spectral norm leads to (26) and (27). Since we know from Lemma 2 that and are polynomial in 1 2 − 1 2 , (26) and (27) can indeed be written as spectral norm of a power series -called 11 ( ; ) and 12 ( ; ) for now -w.r.t. 1 2 − 1 2 , which is symmetric. This allows the application of the Spectral Theorem 1 for fixed with = 1 2 − 1 2 and = 11 or = 12 , respectively.
The last equality in each line uses that the spectrum of the similar matrices 1 2 − 1 2 and are the same. For ∈ ( ), we can now reorder the power series 11 and 22 again to have a form like (17) and (18), respectively. The difference is that we are not inserting the matrices but their eigenvalues into the power series. They can be interchanged freely according to Theorem 1(c) and since ( ) and ( ) were defined such that = ( ) and = ( ).
Theorem 3 constitutes one of the main results of this work since it allows to compute the coefficients 11 ( ) and 12 ( ) as a maximum over the eigenvalues of . We will see in Section 4 how to find this maximum quickly without computing the full spectrum of . At some occasions, it will be more convenient to use a variant of this theorem which depends on the spectrum of instead of . This is the statement of the following corollary.
Proof. The proof is skipped since it is almost the same as the proof of Theorem 3. □ Remark 2. While the terms sinh ( 1 2 ) − 1 2 and cosh ( 1 2 ) were only to be understood symbolically in Theorem 2, they are now valid and ( ) ≠ 0 and ( ) ≠ 0 can be assumed in Theorem 3 and Corollary 1. See the next remark for the case ( ) = ( ) = 0. Due to their structure, they yield the same result for any square root of ( ) and ( ), respectively. For easier notation, we will define ( ) and ( ) to be the unique principal value of the square root of ( ) and ( ) in the rest of this work, respectively.
Remark 3. It is allowed to divide by zero in Theorem 3 and Corollary 1 since according to L'Hôpital's rule in ℂ, which is obviously the same value as inserting = 0 into the corresponding power series (19).
With Theorem 3 and Corollary 1, we got rid of the direct inversion of and are left with a one-dimensional maximization problem of the spectrum of . The question remains if the full spectrum of needs to be computed in order to find the maximum. We will find a way to circumvent this in the next section.

FAST COEFFICIENT COMPUTATION AND APPROXIMATION
This section constitutes the second main contribution of this work. The goal is to find the maximum in Theorem 3 and Corollary 1 with low computational costs since computing the full spectrum of is infeasible. The computational costs will be discussed in detail in Section 5. For now it is enough to assume that the ≪ lowest / highest eigenvalues of = −1 can be efficiently computed by solving the generalized eigenvalue problem = with the Lanczos method. Therefore, we aim at using as few eigenvalues of as possible.

Undamped systems
Throughout this section, we assume an undamped system, i.e., = = 0, which leads to = = = as well as = − . It is of importance for the following results to analyze the spectrum of before applying Theorem 3.
Proof. −1 is real, symmetric, and positive definite since is real, symmetric, and positive definite. Therefore, −1 has a real, symmetric, and positive definite square root, which we will call − 1 2 , see Lemma 1.
As already shown for the proof of Lemma 2, − 1 2 − 1 2 is real and symmetric. It is also positive semi-definite, which can be seen by looking at the inner product: For arbitrary ∈ ℂ , Therefore, all eigenvalues of ) since both matrices are similar. Thus, = −1 is positive semi-definite and real as a factor of two real matrices. □ Remark 4. For positive definite , it can be shown with only small changes to the above proof that is also positive definite.
In the undamped case, the coefficients 11 ( ) and 12 ( ) have a much easier form, which will be seen in the following corollary following directly from Theorem 3 and the previous lemma:

Corollary 2. Under the assumptions of this section, it holds
Proof. The first equalities in (34) and (35) follow directly by setting = = 0 in Theorem 3. According to Lemma 3, ( ) = √ − = √ is purely imaginary having w.l.o.g. non-negative imaginary part, i.e., ( ) ∈ ℝ ≥0 , see Remark 2. Dividing by zero is again defined with the limit (Remark 3). Due to known properties of the hyperbolic sine and cosine, we can simplify (34) and (35) even further. We use known trigonometric identities for the hyperbolic sine and cosine: Since it is assumed that we cannot compute the full spectrum of , we are unable to predict the value of (34) since there is always the possibility that one eigenvalue of fulfills √ ∈ ℤ leading to the maximum absolute value of the cosine. But we can use this to our advantage: Since ( ) has usually around ≫ 1 distinct eigenvalues scattered above several orders of magnitude (see the examples in Section 6), there is a high probability that the maximum will be near 1. Therefore, we are fine using 1 as an upper bound for 11 ( ) in the undamped case. The estimation for 12 ( ) will be a little bit more complicated.
be the increasingly ordered elements of the spectrum of . Additionally, define ∶= √ . If for arbitrary 1 ≤ ≤ and all ≥ 0. Proof.
(b) Due to the monotonicity of the square root, the 's are also ordered monotonically increasingly. They are also nonnegative according to Lemma 3. From Corollary 2, we know F I G U R E 1 This figure helps to understand the motivation behind the proof of Theorem 4 for = 1, = 4, and positive definite . The blue function | sin( )| −1 is not monotonically decreasing. We want to find the maximum function value over all . It may happen that it has a small value for the first few like for 1 in this figure. Therefore, we are taking the maximum at the points 1 , … , 3 -which is at 2 in this example -and approximate the function values at the remaining 4 , … , by −1 4 , which is the black, dashed line  (37) is less or equal to for ≥ 0. Equality holds for = 0. Therefore, 12 ( ) ≤ . Equality holds for 1 = 0, which proves the first statement. For 1 > 0, we can even find a better bound. Let 1 ≤ ≤ be the lowest index for which the maximum (37) is reached, i.e., 12 ( ) = | sin( )| −1 . We differentiate two cases: (37) is in the set from which the maximum is taken in (36).
• If ≥ , then we can bound the sine by 1: This can be combined with the bound from above | sin( )| −1 ≤ to min { 1 , } , which constitutes the last value in the set of (36). Both cases yield (36). The case 0 ∈ ( ) does not need any algorithmic consideration since 11 ( ) = 1 and 12 ( ) = holds for every according to Theorem 4. For the case 0 ∉ ( ), the results of Theorem 4 still allow us to find an easy algorithm which calculates the bounds 11,bound ( ) and ( ) 12,bound ( ) for 11 ( ) and 12 ( ), respectively. These bounds are then used in the definition of the error estimator in (11) instead of the original 11 ( ) and 12 ( ). The algorithm is given as Algorithm 1 for a discrete time set disc in the setting of a time-discretized system. The computational complexity of computing the smallest eigenvalues of will be analyzed in Section 5.     Therefore, the algorithm finishes for sure with 12 ( ) according to (35). If the highest value so far is larger than the monotonically decreasing bound − 1 2 to the right of the next eigenvalue (see line 5), then no higher value can be found and the algorithm finishes earlier in line 7. Usually, only a few eigenvalues of need to be computed before the precise value of 12 ( ) is known. We will analyze this in great detail in Sections 5 and 6 for similar algorithms for the damped case.

Damped systems
For damped systems, i.e., > 0 and ≤ 1, we will work in the context of Corollary 1, i.e., everything is written with respect to the spectrum of .
It will also be of great benefit to separate ( ) in four parts depending on the monotonicity and sign of the real function from (30). Therefore, we start by collecting some properties of this function.
has the following properties.
(a) has its only local and global minimum at min ∶= 1 with minimum value − 1 2 ≤ 0 and zero only for = = 1. (b) is monotonically decreasing for ≤ min and monotonically increasing for ≥ min . (c) has the only zeros at (d) The domain  can be divided in the four connected sets As can be seen in Figure 2, these sets are chosen such that the first subscript describes the sign of ( ) and the second subscript the orientation to the minimum, i.e., the monotonicity of in this set. They may intersect at the border for easier notation. We also use the notations Proof. The left border of  was chosen as the lowest possible eigenvalue of . It is 2 since is positive semi-definite (see Lemma 3) and since (14) holds. The assumption ≤ 1 for the damping parameters will be used several times.
(a) Obviously, − −1 is the smallest for = min , which is in the domain  because min = −1 ≥ ≥ 2 . Since is a parabola, this local minimum is also global. The minimum value − 1 2 = 1 ( − −1 ) is non-positive since − −1 ≤ 0 is fulfilled by the condition ≤ 1 on the damping coefficients.
• There exist monotonically decreasing bounds for | 11 (⋅; )| and | 12 (⋅; )| on Ω +,right , see Theorem 8. These properties together with the theoretical result of Corollary 1 are then leveraged in Algorithm 3 and Algorithm 4 to efficiently compute 12 ( ) and 11 ( ), respectively. The proofs of the properties listed above are given in the appendix.

Algorithms
The properties of 11 ( ) and 12 ( ) stated in the previous section and proven in the appendix are now utilized to construct Algorithms 3 and 4 to compute 11 ( ) and 12 ( ) fast and memory efficiently. See Section 5 for a discussion about the computational complexity. The difference to Algorithm 1 (and partly Algorithm 2) for undamped systems is that 12 ( ) and 11 ( ) are calculated precisely for the damped system instead of using only sharp bounds. Again, we assume a timediscretized system with discrete time steps disc . The condition in line 5 of Algorithm 3 needs some explanation: Recall that 12 ( ) is defined in (29) as the maximum value of | 12 (⋅; )| over the spectrum of . The function | 12 (⋅; )| has its largest value in Ω +,left ∩ ( ) at min,left since 12 (⋅; ) is monotonically decreasing and positive in Ω +,left , see Theorem 5. Additionally, | 12 (⋅; )| will not take larger values on Ω − as on Ω +,left (Theorem 6) which makes 12 ( min,left ; ) the largest value on (Ω +,left ∪ Ω − ) ∩ ( ). If it is also larger than min{ 12,+ ( ), 12,+,right ( min,right ; )}, an upper bound on Ω +,right ∩ ( ) according to Theorems 7 and 8, then 12 ( min,left ; ) is the largest value of | 12 (⋅; )| on ( ), which is 12 ( ) by definition.
If the previously discussed condition is not satisfied to finish the algorithm right at the beginning, we have to loop over all in the spectrum of in increasing order in line 10. This guarantees that the algorithm finds 12 ( ) in the worst case by looping over all eigenvalues in the spectrum. But this is rarely needed due to the theory from the appendix: For each newly calculated , we check in line 19 if any higher value for | 12 ( ; )| is possible for the remaining eigenvalues larger than . Since 12 (⋅; ) is monotonically decreasing in Ω +,left , no higher eigenvalue is possible from this area. We already discussed above the highest possible value in Ω +,right . If ∈ Ω +,right , then we can update bound 12,+,right (see line 17) since 12,+,right (⋅; ) is monotonically decreasing (see Theorem 8) and the bound is only used for the remaining points in the spectrum larger than the current value . It only remains to be discussed how large | 12 (⋅; )| can get on Ω − right of . Only if ( ) ∩ Ω +,right = ∅, the bound from Theorem 11 needs to be checked. Otherwise, we know from Theorem 6 that no eigenvalues from Ω − will achieve a higher value for | 12 (⋅; )| than already saved in the variable named best. In this case, the bound is deactivated by setting bound 12,− to −∞.
Algorithm 4 for computing 11 ( ) is quite similar to Algorithm 3. There are a few differences in the theory which lead to alterations in the algorithm: Theorem 5 does not state that 11 (⋅; ) is monotonically decreasing in Ω +,left but only a monotonically decreasing bound is given in Theorem 5(b). Therefore, skipping the loop over ( ) as in line 5 of Algorithm 3 is not possible anymore. Additionally, we need to take the bound on Ω +,left into account at line 31 of Algorithm 4. Theorem 10 allows to finish the algorithm early if the conditions of increasing monotonicity are matched, see line 20, or if we know that a sufficiently large area of decreasing monotonicity is to the right of , see line 24.
Remark 7. Algorithms 3 and 4 are not given in full detail to keep them understandable. They can be optimized even further.
• Some variables may not exist, e.g., min,left if Ω +,left = ∅. In these cases, the corresponding parts of the algorithm need to be skipped. • The algorithms put a lot of effort into finding the precise values for 12 ( ) and 11 ( ). If a small additional error is tolerable, then the bounds can be used to break of the algorithm early resulting in a little higher value. Choose any given tol > 0, then lines 19 and following in Algorithm 3 can be changed to: if best ⋅ (1 + tol) ≥ max{bound 12,− , bound 12,+,right } then 12 This modified algorithm guarantees that 12 ( ) ≤ 12,bound ( ) ≤ (1 + tol) ⋅ 12 ( ). Similar modifications can be done for Algorithm 4 to trade accuracy for speed. The relative increase in Δ ( ) is bounded by the tolerance since Δ ( ) in (11) depends linearly on the coefficients 11 ( ) and 12 ( ). • Leveraging the monotonicity in certain areas allows to skip certain eigenvalues in line 10 of Algorithm 3 and line 6 of Algorithm 4.

IMPLEMENTATION AND COMPUTATIONAL COMPLEXITY
In this section, we discuss the implementation and computational complexity of Algorithms 1, 2, 3, and 4, which will be called improved methods in the following. The direct method as formulated in Equations (3), (10), and (12) instead constitutes in computing first , then ( ) as matrix exponential, and finally 11 ( ), 12 ( ) by taking the -weighted norm of parts of ( ). We focus on discussing the implementation in MATLAB by MathWorks since MATLAB is used later for the examples in Section 6.
In the improved method, we have broken down the computation of the coefficients 11 ( ), 12 ( ) to basic, onedimensional computations over the spectrum of ( ) and ( ). Therefore, the question remains how to efficiently compute the spectrum of ( ) and ( ). Due to the relation = In all algorithms, we are only interested in computing a handful of these eigenvalues with the smallest / largest absolute value or nearest to a given value. The examples will show indeed that only a small quantity of all eigenvalues is needed. This can be accomplished by the iterative Arnoldi (for non-Hermitian matrices) or Lanczos (for Hermitian matrices) method based on Krylov subspaces, see Sorensen [41]. These algorithms are more advanced than the power iteration and faster than computing all eigenvalues, see Saad [42]. Since the matrices and are symmetric, the Lanczos method -or to be more precise, the Implicitly Restarted Lanczos Method (IRLM) -is discussed in the following. One famous implementation for IRLM is ARPACK (ARnoldi PACKage) [43], which heavily uses LAPACK (Linear Algebra PACKage) [44] and BLAS (Basic Linear Algebra Subprograms, see Dongarra et al. [45] and the references therein). MATLAB used ARPACK for eigs until version 2017a and its own implementation beginning with version 2017b. The latter is used in this work. Other software for sparse eigenvalue problems is listed in Hernández et al. [46]. ARPACK needs only ( ) + ( 2 ) storage to compute a subset of ≪ eigenvalues [43]. Since is negligible compared to , we can assume a storage complexity of ( ). Unfortunately, it is not possible to know in advance the steps until the iterative IRLM will converge to the desired subset of eigenvalues [43]. But the examples in Section 6 will show a generally fast computation.
Sometimes single eigenvalues like min,right = min( ( ) ∩ Ω +,right ) in Algorithm 4 need to be computed. This is not a problem for the IRLM since the subset of eigenvalues can be defined by an arbitrary condition but certain implementations like expm in MATLAB only allow to get the smallest / largest eigenvalues or the nearest eigenvalues to a specific value. Due to this limitation, some tricks need to be applied. By definition, min,right is the one positive eigenvalue of nearest to 0,right . Since it is easier to work with , which prevents the introduction of additional numerical error of terms with , we are equivalently interested in the positive eigenvalue of nearest to min,right = 2 min,right − . The positive eigenvalue nearest to min,right of the problem = is equivalent to finding the largest positive eigenvalue of the problem with the transformation = −1 + min,right . Here, we simply used that is near min,right if and only if − min,right is near zero if and only if ∶= ( − min,right ) −1 is large. The eigenvalue problem on the right hand side of (42) can be computed again with common implementations of IRLM. Similar spectral transformations can be used to compute min,left and the next larger eigenvalue in ( ) needed in line 10 of Algorithm 3 and line 6 of Algorithm 4. When assessing the direct computation of the coefficients 11 ( ), 12 ( ) with Equations (3), (10), and (12), we need to make a fair comparison to the IRLM: • Computing with Equation (3) involves computing = −1 . Since and are banded matrices coming from the finite element method, the linear solver SGBTRS from LAPACK for general banded matrices can be used. Its version without partial pivoting has a time complexity of ( 2 ) with ≪ being the bandwidth of [47]. During the forward and back substitution stage, the algorithm does not take into account that is also banded. Modifying the algorithm in this way should then lead to a complexity of ( 2 ). In general -with a diagonal as an exception -= −1 is a dense matrix leading to a storage complexity of ( 2 ). • Taking the matrix exponential of the 2 × 2 matrix in (10) is the bottleneck of the direct computation. The current implementation of expm in MATLAB 2017b uses the scaling and squaring algorithm based on a Padé approximation of order 13. It needs 6 matrix multiplications and involves powers of up to 13 , see Higham [48]. Even for sparse in the case of diagonal , this power easily gets dense, i.e., having a storage complexity of ( 2 ): For a simple beam modellater referred to as Model B -with = 2457 degrees of freedom, diagonal mass matrix, and banded matrices and with bandwidth of 89, the power 13 has only 44% zero elements and exp( ) has even no zero elements at all. Therefore, the time complexity depends on the matrix multiplication, which scales with ( 3 ). • Computing the -weighted norm in (12) needs neither the inverse of = nor the matrix square root. Instead, we proved in Proposition 1 that the largest eigenvalue of a generalized eigenvalue problem needs to be found. We already discussed that this can be accomplished relatively easily with the IRLM.
In summary, we have argued that the storage complexity of the algorithms presented in this work is only ( ) compared to ( 2 ) of the direct computation. Since the computation of 11 ( ) and 12 ( ) is part of the offline phase, scaling with the dimension of the full-order system is acceptable. We have argued that the direct computation has a time complexity of ( 3 ). While the time complexity of the iterative algorithm IRLM is unpredictable and the number of eigenvalues computed with the IRLM was assumed to be negligible compared to , we will find numerical evidence in Section 6 to support this claim and see an overall time complexity of ( 1.8 ) of the Algorithms 3 and 4.
The discussion so far only involved one time step ∈ disc . Another advantage of the improved method over the direct computation of the coefficients lies in the almost linear scaling with the number of time steps with unchanged cont . The direct computation involves recomputing ( ) in (10) and taking the -weighted norm in (12) for every , which is only trivial for disc ⊆ ℤ. Thus, the direct method has computational complexity of ( 3 ).
Algorithm 1 only calls the IRLM once for a subset of ≪ eigenvalues. The computation of the coefficients for each time step are then only one-dimensional computations based on the eigenvalues computed before. Even though this scales with the number of time steps, it can be easily parallelized and each step is very cheap to compute. Algorithms 2, 3, and 4 have the same structure: an outer loop over all time steps and an inner loop over all eigenvalues of or . On a first look, one suspects a high computational complexity but these algorithms are designed in a smart way. The two loops are almost independent since the spectrum does not depend on the current time step. Again, only one-dimensional computations are performed -with the computation of the spectrum as an exception. Luckily, the spectrum does not need to be computed fully. Instead we use the IRLM to compute one eigenvalue after the other. These time-independent eigenvalues will be saved and used for other time steps at no additional cost. Further, the design of the algorithms uses as much knowledge of 11 (⋅; ) and 12 (⋅; ) to abort the loop over the spectrum as early as possible. As will be seen in Section 6, only a handful of eigenvalues need to be computed at all. The design of the algorithm based on the behavior of the continuous functions 11 ( ; ) and 12 ( ; ) w.r.t. guarantees that the same amount of eigenvalues need to be computed independently of the discretization detail of the time interval cont = [0, end ]. For few time steps , the IRLM will dominate the computation. Since a finer time discretization usually does not change the small amount of eigenvalues needed and these eigenvalues are reused in all time steps, the improved method scales like ( 1.8 ) + ( ) compared to ( 3 ) of the direct method. The exponent 1.8 was obtained experimentally since IRML as iterative algorithm has no fixed time complexity.
Algorithms 2, 3, and 4 are implemented together with the error estimator described in Section 2 in the software package CCMOR. They consist of one single MATLAB class, which can be plugged non-invasively into any ODE solver of MATLAB through the callback function OutputFcn if disc is fixed in the offline stage. This way, any existing model reduction method based on Petrov-Galerkin reduction can be easily extended by this error estimator.

EXAMPLES
This section presents numerical examples for different finite element models to support the claims of Section 5 with regards to time and storage complexity. For brevity, Algorithms 1 and 2 for the undamped case are not investigated since the damped case is more challenging. The authors like to emphasize that the error estimator itself as described in Section 2 including its online performance and overestimation is not the topic of this work. Instead we solely focus on the fast and TA B L E 1 Summary of the example models used in Section 6 Description ( ) ( )

Notes Reference
Model A 114 Stabilization linkage of a car front suspension 10 −7 to 10 −1 10 2 to 10 9 - [ 49] Model B 2 457 0.1 m × 0.5 m × 1.2 m aluminium beam 10 −2 to 10 −1 10 −1 to 10 5 is diagonal -Model C 4 131 Two stiffly coupled beams 10 −4 to 10 −2 10 3 to 10 11 - [ 50] Model D 7 338 Piston rod of a crank drive from a combustion engine 10 −6 to 10 −3 10 5 to 10 10 Very small 11 ( ), 12 ( ) [51] Model E 22 680 1 m × 2 m × 0.4 m aluminium block 10 −3 to 10 −1 10 4 to 10 10 is adjustable [52] Model F 50 784 Support in the shape of a wishbone 10 −7 to 10 −4 10 1 to 10 9 - [53] well-scalable computation of the coefficients 12 ( ) and 11 ( ) in the offline phase. Other properties of this error estimator are presented in the references Ruiner et al. [25] and Fehr et al. [27]. Algorithms 3 and 4 are designed to return the exact values of 12 ( ) and 11 ( ), respectively. A comparison to the direct computation with Equations (3), (10), and (12) will be performed whenever possible to validate the results. We refrain from interpreting any small differences in value since the direct computation is not a good candidate for the truth of these values. In fact, the direct computation is deemed to give a less precise result due to numerical errors introduced by the direct inversion of . Since the error bound ‖ ( )‖ from (11) depends linearly on the coefficients, the error bound itself will differ only on the same small scale due to numerical differences. We will see that this numerical difference is under 0.1%. Additionally, the algorithms have been validated with random matrices, which is not part of this work.

Models and settings
Models A to F given in Table 1  On the other hand, the local parameter of Theorem 9 is set according to Remark 8 as also pointed out in Algorithm 4. The weighting matrix is set to be the mass matrix, i.e., = .
All computations are performed on a Dell Precision T1600 personal computer with an Intel Xeon E3-1245 v1 CPU and 16 GB RAM running a Debian 9 Stretch operating system. MATLAB 2017b is used for all calculations, which has approx. 12.5 GB RAM available on this particular system. This number will be important for calculations that abort because of insufficient memory.

Results
In Section 5, we argued that the improved computation with Algorithms 3 and 4 outperforms the direct method (Equations (3), (10), and (12)) for many time steps disc , i.e., large . Hence, we first investigate the performance for both computational approaches for only one time step = 0.01. Tables 2 and 3 share the same structure. Hence, the meaning of each column is only described once. The first and second column state the name of the example models and their system size . The next two columns give the number of iterations the improved Algorithms 4 and 3 spend in the loops beginning in lines 6 and 10, respectively. Tables covering several time steps will give the minimum, mean, and maximum loop iterations over all time steps as triple. The total computation time of the improved method of this work and the direct method are shown in columns 5 and 6, respectively. They are followed by the speedup of the improved method over the direct method. Their relative difference is shown in the last column as maximum over all and over both coefficients 11 ( ) and 12 ( ). The three last columns do not show any values if and only if the direct method needs more memory than available on the computer described in Section 6.1 or the computation takes more than one day. All values are given for three significant digits. TA B L E 2 Comparison of the improved and direct method used to compute the coefficients for a single time step = 0.01 in the offline phase of the error estimator  Table 2 shows the superiority of the improved method over the direct computation. Since most of the time only one loop iteration is needed in lines 6 and 10 of Algorithms 4 and 3, only a small fraction of the spectrum needs to be computed. This leads to a vast speedup for Model B to F. The improved method shows a great performance boost in comparison to the direct method. This will be investigated later in greater detail. Table 3 and Figure 3 show the results for the default time discretization disc = {0.01, 0.02, … , 5.00} defined in Section 6.1. The direct method already becomes impractical for Model B with a computation time of over 11 hours and uncomputable for any larger model due to the memory restrictions of a standard PC. The improved method, on the other hand, shows roughly the same computation times for these = 500 time steps as for the single time step computations in Table 2.

F I G U R E 3
Computation times for all six models for disc = {0.01, 0.02, … , 5.00}. The data is taken from As already discussed in Section 5, the IRLM has a computational complexity which cannot be quantified easily. Hence, we will analyze the scalability with respect to the system dimension . The aluminium block of Model E is modified to be 1 m in length, width, and height. In each of these dimensions, it is discretized with el solid elements with 8 nodes having 3 degrees of freedom each. Since one side of the aluminium block is fixed, the system dimension can be computed as = 3 el ( el + 1) 2 . The computation times for the direct and improved method can be seen in Figure 4 with el varied from 1 to 31 leading to system dimensions between 12 and 95 232. Only the single time step = 0.01 is analyzed. As already discussed above, the performance improvement will be much higher for the multi-timestep scenario.
The direct method runs out of memory on the testing computer for el ≥ 10 ( ≥ 3 630) and the improved method only for el ≥ 32 ( ≥ 104 544). This limit can be increased by using better hardware or using out-of-core algorithms. Both methods yield the same results for 1 ≤ el ≤ 5. For 6 ≤ el ≤ 9, the direct method is unable to compute a result for 11 but yields the same result as the improved method for 12 . The improved method is much faster than the direct method with a maximum speedup of 719 for el = 9. If the direct method would not run out of memory on the testing computer, then much higher speedups would be possible. With a time budget of 12 min, the improved method is able to compute 11 and 12 for a system with 35 times higher (95 232 instead of 2 700). The data points covering several orders of system dimensions allow an extrapolation shown as dashed lines in Figure 4. The computation time of the direct method scales approximately with 1.70 ⋅ 10 −8 ⋅ 2.94 . This polynomial dependency with order 3 was already predicted in Section 5. The improved method instead scales with 1.21 ⋅ 10 −6 ⋅ 1.76 on the testing computer, i.e., one order less than the direct method.
The improvement of one order in time complexity was only analyzed for one time step. As already discussed in Section 5, the improved method also scales very well with the number of time steps for a constant time interval cont . This will now be shown numerically for Model A and C for cont = [0, 5] and disc = {Δ , 2Δ , … , 5} with varying time step size Δ ∈ {10 0 , 10 −1 , 10 −2 , … , 10 −6 }. Figure 5 shows that the total computation time is first dominated by the IRLM, which needs more time for Model C due to the 40 times larger system dimension as can be seen in Table 2. Beginning with around 10 000 time steps, the one-dimensional computations dominate because they have to be repeated 10 000 times. From this moment on, the computation time scales linearly in the number of time steps independently of , which confirms the time complexity of ( 1.8 ) + ( ). The one-dimensional computations can be accelerated by using parallelization. It has to be noted that the loop iterations over the spectrum (lines 6 and 10 of Algorithms 4 and 3, respectively) increases from 2 and 1 to 11 and 3 for Model A and Model C, respectively, with 10 6 times the amount of time steps. While this also means that the IRLM needs to be called around 10 times more often, this number is comparable small to the increase in time steps and therefore negligible. Now, end is varied from 5 ⋅ 10 −7 to 5 ⋅ 10 7 with a fixed number of time steps = 500 for the improved algorithms. The results are depicted in Figure 6 for Model A and C as examples. For small , an increase in computational time can be observed. This is due to a decreasing derivative of 11 (⋅; ) in absolute value for decreasing , which makes it difficult for Algorithm 4 to leverage certain monotonicity properties. This leads to a higher amount of eigenvalues needed, here 46 (out of 114) and 21 (out of 4131) with end = 5 ⋅ 10 −7 compared to only 1 and 1 with end = 5 for Model A and Model C, respectively. This effect would be even higher without the check for decreasing monotonicity in line 24 of Algorithm 4. It diminishes with an increased number of time steps for fixed end as discussed above. While we have focused so far only on the faster calculation of the coefficients 11 ( ) and 12 ( ) as major part of the offline phase, it is also of interest how the error estimator performs in a full simulation scenario including the online phase. In Figure 7, we apply the error estimator to Model A with a force of ( ) = 10 ⋅ sin(2 ) applied to node 10 of the stabilization linkage in -direction for disc = {0.01, 0.02, … , 5.00} and damping paramters = 5 ⋅ 10 −3 , = 1 ⋅ 10 −3 . A second-order, Gramian matrix reduction technique described in Fehr et al. [5] is used with = 10. Ten POD snapshots in the frequency range of [0, 1500] are used. The true error of this online simulation is given as a reference and shows the same time evolution but an effectivity Δ ( )∕‖ ( )‖ of 3 orders for this example. Note that for time-dependent problems, conservative bounds of several orders are quite common. Effectivities of around 1 have also been observed for this error estimator [27].
The full-order system (1) of the above example takes 64 s to be simulated on a standard desktop computer; whereas the reduced-order system (4) is already solved after 3.4 s, which results in a speedup of 19.1. The offline phase of the error estimator takes only 1.6 s. Keep in mind that once the offline phase is finished, the computed quantities including Time complexity 11 ( ) and 12 ( ) can be reused for several online simulations with different inputs -the force ( ) in this exampleand even different reduction methods. Only when the system matrices , , or are changed, the offline phase of the error estimator needs to be repeated for the new system. The online phase of the error estimator, on the other hand, only influences the computation time of the reduced system by a small quantity. In this example, the simulation of the reduced model only takes 0.023 s or 0.7% longer for an activated error estimator. These computation times are in correspondence with the theoretical estimations of Section 2.4.
The example in Figure 7 also demonstrates that the error estimator (11) does not need to be monotonically decreasing in . Indeed, 11 ( ) and 12 ( ) are not necessarily monotonically decreasing in as can be seen from their continuous forms given in Theorem 3 and Corollary 1. The integral in (11) is a convolution of 12 (⋅) and the (not necessarily monotonically decreasing) residual ‖˜(⋅)‖ , which means that for different values of also different values of 12 (⋅) and the residual are multiplied under the integral.
The results of this section are summarized in Table 4 for damped systems.

CONCLUSIONS AND OUTLOOK
After recalling the error estimator of Ruiner et al. [25] and discussing its computational downsides, we were able to derive terms for the coefficients 11 ( ) and 12 ( ) -the bottleneck of the error estimator -only depending on the spectrum of − and one-dimensional computations in Section 3. These terms were then analyzed in detail in the appendix for certain properties like monotonicity in the spectrum, which were then leveraged to develop Algorithms 1 and 2 for undamped and Algorithms 3 and 4 for (proportionally) damped systems. These algorithms try to use as few eigenvalues as possible such that only a fraction of the spectrum needs to be computed with the fast Implicitly Restarted Lanczos Method. The time and storage complexity of the algorithms for the more difficult damped case were then analyzed in Section 5 and confirmed by numerical experiments with models of largely different system dimensions in Section 6.
It has been shown that the improved method has a storage complexity of ( ) instead of ( 2 ) and time complexity of ( 1.8 ) + ( ) instead of ( 3 ) with being the system dimension and the number of time steps in the discretized setting. This constitutes a vast improvement and makes it possible for the first time to use the error estimator of Ruiner et al. [25] with non-academic models since the offline phase is not a bottleneck anymore in terms of time and storage. We refrained from analyzing the quality of the error estimator and referenced other scientific contributions instead.
There are several possibilities to continue this work: • The hump phenomenon described in Ruiner et al. [25] may be analyzed by looking at the power series expansion of ‖ 21 ( )‖ in the same way as it was done for ‖ 11 ( )‖ and ‖ 12 ( )‖ . • An extension to parameter-and time-dependent system matrices may be possible. The algorithms presented in this work cannot be used efficiently without modifications since reuse of the eigenvalues is not possible anymore. If a change in parameter or time only perturbs the system matrix by a small quantity, then perturbation theory for eigenvalues may be applied, which studies the stability of eigenvalues under perturbation. • For damped systems, we assumed proportional damping with damping parameters , ≥ 0. For technical reasons, we additionally assumed > 0 (used, e.g., in Corollary 1) and ≤ 1 (used, e.g., in Lemma 4, Lemma 5(b), Lemma 6(a), Theorem 10, and Theorem 11), which may be dropped with only small modifications. A possible extension for a general SPSD damping matrix is not known to the authors and deemed to be hard.
• The condition = is needed for Theorem 3, which plays a crucial role in the overall proof. Maybe it is possible to prove similar results for different weighting matrices .
• The algorithms presented in this work can still be improved in terms of performance. These include the improvements listed in Remark 7, parallelization over disc , and a faster variant of the IRLM (see Hernández et al. [46] for an overview).

A C K N O W L E D G E M E N T S
The authors would like to thank Benjamin Fröhlich, Darko Milaković, and Nadine Walker for providing the example models in Section 6 and Henrik Ebel for partially proofreading this work.

A U T H O R C O N T R I B U T I O N S
Bernard Haasdonk provided the idea for the first-order error estimator. Jörg Fehr adopted the error estimator to secondorder systems as described in Section 2 and provided some of the models from Section 6. Dennis Grunert developed the speedup of the error estimator as illustrated in Sections 3 and 4. He also carried out the implementation (Section 5) as well as numerical examples (Section 6). He also wrote this publication, while Jörg Fehr and Bernard Haasdonk have contributed to the structuring, correcting, and editing of the article.

F I N A N C I A L D I S C L O S U R E
The authors would like to thank the German Research Foundation (DFG) for financial support FE 1583/2-1 and HA 5821/5-1 of the project at the University of Stuttgart.

C O N F L I C T O F I N T E R E S T
The authors declare no potential conflict of interests.

APPENDIX: ADDITIONAL THEOREMS AND THEIR PROOFS
The appendix covers the proofs of all properties of 11 and 12 listed in Section 4.2. Most of the following proofs will focus on 12 and then give a shorter proof for 11 when it is analogue. We start by showing that | 12 (⋅; )| is monotonically decreasing and | 11 (⋅; )| has a monotonically decreasing bound on Ω +,left . We will omit the absolute value bars in the case ∈ Ω + since then ( ), ( ) ≥ 0, which yields 11 ( ; ), 12 ( ; ) ≥ 0 with (32) and (33).  Keeping this inequality in mind, we will now prove that sinh(˜) is monotonically decreasing in˜≥ 0. For˜= 0, this follows by the limit lim˜→ 0 sinh(˜) = and Inequality (A.2) with cosh(˜) ≤ 1. For˜> 0, we look at the derivative: Since ( ) is monotonically decreasing in ∈ Ω +,left by definition, we have 11,+,left ( ; ) is also monotonically decreasing in ∈ Ω +,left : First, we note that ( ) is monotonically decreasing for increasing ∈ Ω +,left by definition; so is cosh( ( ) ). For the remaining factor exp(− )(1 + ), we will show that its derivative is non-positive: for all ≥̃with ∈ Ω +,left due to this monotonicity, which proves the last statement of the theorem. □ The next theorem states that we cannot find any larger value for | 11 (⋅; )| and | 12 (⋅; )| on Ω − than on Ω +,left .
(a) We will first prove the following two bounds: with Lemma 4(e) and the definition of the hyperbolic sine. The statement of the lemma follows immediately from the decreasing monotonicity of ( ; ) ∶= −1 (1 − exp(− )) in > 0 for all > 0. Then, we can bound (A.8) by (2 min,r ; ) since (2 min,r ; ) ≥ (2 ( ); ) = ( .8) for every ∈ Ω + (equivalently every real ( )) due to the monotonicity. For the decreasing monotonicity of (⋅; ), we will show that its derivative with L'Hôpital's rule. With the same limit argument, the decreasing monotonicity of (⋅; ) can be extended to [0, ∞). This proves the statement for min,r = 0. □ If we restrict ourselves to Ω +,right instead of Ω + , we can even find a better bound, which holds for every ≥˜with ∈ Ω +,right . First, we need some preparations, which are contained in a separate lemma.
Proof. Let ∈ Ω +,right for the entire proof. We start by differentiating 11 ( ; ) w.r.t. and simplifying the condition when it will be non-positive, i.e., monotonically decreasing.
. It is hard to solve Inequality (A.13) exactly. Instead, we restrict ourselves to all ∈ Ω +,right which satisfy tanh 2 ( ( ) ) ≥ for arbitrary 0 < < 1. This way, we can rewrite (A.13) to a stronger condition and therefore reduce the size of known decreasing monotonicity. The advantage is that tanh will not be involved anymore. First, we square Inequality (A.13), substitute tanh 2 ( ( ) ) by , and substitute the definition of ( ) 2 . This then leads to the inequality which is stronger than (A.13) due to tanh 2 ( ( ) ) ≥ . Inequality (A.14) can be rewritten to ( ) ≥ 0 by sorting the coefficients by monomials of and factoring out the zero . This lengthy computation is omitted due to brevity. The polynomial has at least two real zeros since one zero is ∈ ℝ and the remaining complex three zeros cannot be all purely imaginary. We will call the largest real zeros 1 and 2 with 1 ≥ 2 . Since ( ) → −∞ for → ∞, is non-negative in the interval [ 2 , 1 ]. Together with the condition tanh 2 ( ( ) ) ≥ , which restricts again, we have that Inequality (A.14) is satisfied at least for all ∈ [ 2 , 1 ] ∩ { ∈ Ω +,right ∶ tanh 2 ( ( ) ) ≥ }. Since this inequality is stronger than (A.13), these also satisfy Inequality (A.13), which is equivalent to 11 ( ; ) ≤ 0, the condition for decreasing monotonicity of 11 (⋅; ). □ Remark 8. Even though Theorem 9 does not deliver all domains in Ω +,right with decreasing monotonicity, it can be used iteratively to increase the area of known decreasing monotonicity. We start with arbitrary˜∈ Ω +,right with˜≥ 2 and 11 (˜; ) ≤ 0. Setting = tanh 2 ( (˜) ), we know that tanh 2 ( ( ) ) ≥ tanh 2 ( (˜) ) = for all ≥˜due to the increasing monotonicity of (⋅) ≥ 0 in Ω +,right . Applying Theorem 9 returns the interval [˜, 1 ], in which 11 (⋅; ) is monotonically decreasing. We can now repeat this by setting˜= 1 , defining = tanh 2 ( (˜) ), and applying Theorem 9 again. Since will now be larger as in the iteration before, another interval will be found, which can be appended to the already found interval if they overlap. This can be repeated until the interval cannot be extended anymore with help of Theorem 9. Each iteration is cheap to compute since only basic, one-dimensional computations are performed and the zeros of can be found by using the explicit solution of the cubic function since the zero is already factored out. For 11 , we will also show that 11 (⋅; ) is at least monotonically increasing in a certain interval under some conditions. Theorem 8 only showed the monotonicity for an upper bound. First, we need some preparations, which will be presented in a separate lemma: (a) ( ) ≤ − 1 for ∈ Ω +,right .
(a) ( ) ≥ 0 since ∈ Ω +,right . Additionally, ≥ 0,right ≥ 1 . Therefore, both sides of ( ) ≤ − 1 are non-negative. The inequality to be shown is in fact equivalent to ≤ 1, which is a global assumption: (b) ( ) ≥ 0 since ≥ 0,right ⇔ ∈ Ω +,right . For < , the inequality is fulfilled trivially since the left hand side is nonnegative and the right hand side negative. From now on, we assume ≥ . Then, (c) From the definition of 0,right in (40), we see 0,right ≤ 2 ≤ . First, we will show that the conditions from Lemma 6(b) are met for ≥ 2 and ≥ : Therefore, ≥ max{ 2 − 2 ( −1) , 0,right } follows from ≥ and ≥ 2. For the second case 0,right ≤ < , the inequality is trivially fulfilled since the left hand side is non-negative and the right hand side negative. Hence, it holds for every ≥ 0,right and ≥ 2. □ With these preparations, we are now able to find an area in Ω +,left in which | 11 (⋅; )| is monotonically increasing.  ( which is non-negative if and only if it is non-negative when multiplied with ( ) 3 exp(− ( ( ) − )) ≥ 0: So far, only equivalence transformations have been applied. Both sides are non-negative due to ≥ and ≤ 1. We will now substitute ( ) with the inequality of Lemma 6(c) with ≥ 2 on the left hand side and with the inequality of Lemma 4(e) on the right hand side, which yields a stronger statement instead of an equivalence transformation: This proves that also the first summand of 11 is monotonically increasing under the above conditions. □ Remark 9. The start of the interval with known monotonicity in Theorem 10 can be shifted when using Lemma 6(b) instead of Lemma 6(c) in the proof. Then, needs to be optimized such that 2 − 2 ( −1) is as low as possible under the condition that either only imaginary zeros or real zeros as small as possible exist for the parabola in Theorem 10.  for Δ = 1 − − 2 2 ≥ 0 and no real solutions otherwise. In the latter case, we only use 1 as upper bound, i.e., 12,− ( ; ) ∶= 1 ( ; ) since 2 ( ; ) will be larger. Then, 12,− ( ; ) is monotonically decreasing in . Therefore, we use the lowest possible value for , which is˜, to achieve a bound which is independent of : Let us now assume Δ ≥ 0. Then, switch,left/right ∈ Ω − by direct comparison with (39) and (40). One could now define 12,− (⋅; ) as 1 (⋅; ) on [ 0,left , switch,left ] ∪ [ switch,right , 0,right ] and as 2 (⋅; ) on [ switch,left , switch,right ]. But then 12,− (⋅; ) would still depend on , which will be a disadvantage. Instead, we seek a monotonically decreasing bound, which allows us to bound 12 ( ; ) independently of .
The goal is to find an interval [ mono,left , mono,right ] in which 2 (⋅; ) is monotonically decreasing. By taking the square and inverting, we find that 2 ( ; ) = Both zeros are real due to ≤ 1. Since the coefficient in front of 2 in (A.20) has a positive sign, 2 (⋅; ) is monotonically decreasing in [ mono,left , mono,right ]. This interval does not necessarily need to be a subset of Ω − . By comparison with (39), we see that mono,left ≤ 0,left ≤ switch,lef t .
The idea outlined in Figure A.1 is as follows: We know the monotonicity of 2 (⋅; ) only up to mono,right . In the end, we want to reach the point switch,right with a bound which is as low as possible but monotonically decreasing at the same time. This leads to the following definition: if 0,left ≤ < switch,lef t , max{ 2 ( ; ), 2 ( switch,right ; )} if switch,lef t ≤ ≤ switch,right , 1 ( ; ) if switch,right < ≤ 0,right .
In summary, we know | 12 ( ; )| ≤ 12,− ( ; ) for ∈ Ω − and that 12,− (⋅; ) is monotonically decreasing in Ω − . Due to this monotonicity, we have 12,− ( ; ) ≤ 12,− (˜; ) and therefore | 12 ( ; )| ≤ 12,− (˜; ) for any˜∈ Ω − and˜≤ ≤ 0,right , which proves the statement. (b) The proof for 11 is similar to the proof of 12 . Therefore, we will only sketch the proof and note differences. We will redefine most of the variables used in the proof above if not already done in the formulation of the lemma (like mono,left ). Again, we have two bounds available: The bound 1 (⋅; ) is monotonically decreasing by looking at its derivative. The meaning of switch,left and switch,right stays the same, which already proves the first case in (A. 19).