Stage-parallel preconditioners for implicit Runge–Kutta methods of arbitrarily high order, linear problems

Fully implicit Runge–Kutta methods offer the possibility to use high order accurate time discretization to match space discretization accuracy, an issue of significant importance for many large scale problems of current interest, where we may have fine space resolution with many millions of spatial degrees of freedom and long time intervals. In this work, we consider strongly A-stable implicit Runge–Kutta methods of arbitrary order of accuracy, based on Radau quadratures. For the arising large algebraic systems we introduce efficient preconditioners, that (1) use only real arithmetic, (2) demonstrate robustness with respect to problem and discretization parameters, and (3) allow for fully stage-parallel solution. The preconditioners are based on the observation that the lower-triangular part of the coefficient matrices in the Butcher tableau has larger in magnitude values, compared to the corresponding strictly upper-triangular part. We analyze the spectrum of the corresponding preconditioned systems and illustrate their performance with numerical experiments. Even though the observation has been made some time ago, its impact on constructing stage-parallel preconditioners has not yet been done and its systematic study constitutes the novelty of this article.

Runge-Kutta (IRK) methods, based on Radau (also referred to as Radau IIA) quadratures.To be specific, consider a system of evolutionary equations of the form M u(t) t + Ku(t) = f(t), (1) arising after semidiscretization in space of some non-stationary linear partial differential equation of the type u t − Δu + b ⋅ ∇u = f (t), equipped with appropriate initial and boundary conditions.In (1), M is a mass matrix, K is a stiffness matrix and u(t), f(t) are vectors.
As the name suggests, evolution equations describe the time evolution of a physical system starting from given initial data.
In many real-life problems, where we need to resolve fine features of the solution during the time evolution, Such problems are, for instance, convection-diffusion problems where various types of layers are developing in time.In this study, we exclude from consideration explicit time integrators.When we use a stable implicit time-integrator such as the backward Euler, the trapezoidal method or the Crank-Nicolson method (cf.Reference 1), the time-step must still be small to guarantee a sufficiently small time-integration error that balances a small space discretization error.Therefore, there are strong reasons to apply higher order time-integration methods, which can enable the usage of much fewer time-steps and combine small total discretization error with fast integration in time.
As is also well known, see for example, Reference 2, classical multistep methods cannot have a higher order than two, otherwise they are not stable for all eigenvalues of the evolution operator M −1 K, that is, they are not A-stable.This can be a severe limitation because in many problems, such as network problems and time-harmonic Maxwell's equations there can appear rapidly changing oscillations, leading to the appearance of eigenvalues, widespread in the whole right half complex plane.On the other hand, in Reference 3 see also Reference 4, it is proven that there exist implicit Runge-Kutta methods of an arbitrary high order that are A-stable.
Implicit Runge-Kutta methods, known also as collocation methods, are first presented in Reference 5, giving the methods the name "IRK."Their general form to solve a problem of the form u ′ = F(t, u(t)) reads: and the so-called stage variables are computed as v (k) i = u (k) +  ∑ q j=1 a ij F(t k + c j , v (k) j ), i = 1, … , q.The method ingredients are the number of stages q, the Runge-Kutta matrix A q = [a ij ], the weights b i that depend on the particular integration method, and the stage nodes c i , which are the zeros of the shifted Legendre polynomials of degree q, defined in the interval [0, 1] and normalized at 0, satisfying ∑ q i=1 c i = 1.Further,  is the time step, u (k) and u (k+1) are the numerically computed solutions at time-steps k and k + 1.
Independently, such methods are presented in Reference 6 and it is shown that due to their high order of approximation and stability properties, they could be considered as global integration methods, that is, it could suffice to use just one or very few time-steps, that is, very large time-step intervals.In these original papers the A-stability property of the methods is not shown, but it is shown later in References 7 and 3.
There are several versions of IRK methods. 8-11The most familiar methods are based on inner interval integration points being the zeros of some particular polynomials, namely, the Gauss integration method (G), the Radau or Gauss-Radau integration method (R) and the Lobatto or Gauss-Lobatto integration method (L).
The approximation order (p) at the endpoints of each time-interval of methods (G), (R), and (L) is correspondingly p = 2q, 2q − 1, and 2q − 2. The approximation order at the interior integration points, however, is only of order q, compare, for example, Reference 5. We stress, that IRK are particularly important for large time intervals where we are somewhat less interested in the behavior of the solution in intermediate time instances but rather mainly of the solution of the underlying evolution systems at the final time in the considered time interval.
All methods (G), (R), (L) are A-stable, compare, for example, Reference 12.However, methods (G) and (L) are not strongly A-stable because for them the absolute value of the stability function converges to unity, implying, in the presence of large eigenvalues of K or the Jacobian matrix, at least a linear growth of rounding errors with increasing number of repeated time steps.In this work, we advocate the Radau method which is the only strongly A-stable (also called strongly L-stable) method, for a definition, see for instance. 6,9,13Furthermore, the Radau method (referred to as stiffly accurate, compare, e.g., Reference 9) does not suffer from order reduction, that can occur for instance in the solution of systems of differential-algebraic equations, see, for example, References 14 and 15.
The high accuracy of the IRK methods makes them very attractive, in particular, in the case when the underlying problem requires a very fine space resolution and the time interval to integrate over is large.The drawback with IRK is that in each timestep we have to solve the arising large algebraic system of order qn, where q is the number of stages of the method, that is, equals the degree and number of zeros of the Legendre or the combination of Legendre polynomials and n is the size of M and K.For general IRK methods all stages are coupled and cannot be straightforwardly decomposed in some easier to handle form.The solution of this system can be costly and somewhat involved, which has been the major reason why IRK methods are more rarely used.In practice, the system must be solved by some preconditioned iterative method, preferably, implemented efficiently in a parallel computer environment.
We mention for completeness, that in order to avoid the complications when employing implicit time-integration methods, there have been efforts in applying Richardson extrapolation and similar techniques in combination with explicit Runge-Kutta methods.The extrapolation still enables extension of the stability region, see, for example, Reference 16 and 17.However, in addition to be forced to choose time-steps to make a stable recursion, the methods do not have the high order of discretization error, inherent in the IRK methods.
For an earlier discussion of solution techniques for IRK methods, see References 18-20, also References 21 and 22, where diagonally implicit Runge-Kutta (DIRK) methods are presented.DIRK methods allow parallel implementation but have a much lower order of approximation, compared with the full IRK methods and, therefore, force the use of smaller time-steps.Since some time an alternative approach to solve time-dependent partial differential equations have been used, see, for example, References 23-25, based on combined time-space finite elements.This means that a 2D space partial differential operator is solved in a 3D space-time domain and a 3D space partial differential operator is solved in a 4D space-time finite element mesh.Clearly this complicates the implementation of the method, but such methods enable the use of adaptive mesh resolution methods in both time and space.
As stated above, in this article, we consider the solution of time-dependent partial differential equations and, to motivate the need of large timesteps, assume that the time interval is large.
The major topic of this study is the construction of preconditioning methods for full IRK.The preconditioners considered here differ from other similar preconditioners, being based on a simple observation from Reference 6, which requires only real computations, enables very numerically efficient preconditioning techniques, which are also fully parallel with respect to the stages.The novelty in this study is that we formulate, analyze, implement the suggested preconditioning methods and show their robustness and good parallelization properties on problems of varying difficulty.The construction of the preconditioners is based on the inherent properties of the Runge-Kutta matrices and not on purely algebraic manipulations with the matrices as in other studies.To our knowledge, this is the first real-arithmetic, discretization-robust, stage-parallel fully implicit Runge-Kutta preconditioning approach.
In this study, we implement and show the performance of the methods on large scale distributed memory high performance computer platform for the stage-serial case.Stage-parallel implementations using a memory efficient matrix-free FEM framework-including comparisons to the stage-serial case and implementation details for example, necessary communication patterns, virtual typologies and so forth warrant a separate study, available in Reference 26.(For completeness we state that the idea is first illustrated in Reference 27 without rigorous theoretical background and parallel performance tests.) The article is structured as follows.In Section 2, we present the Radau type of IRK methods.Section 3 summarizes earlier approaches to solve the large IRK matrices efficiently and in parallel.The preconditioners, suggested in this article are stated in Section 4 and analyzed in Section 5. Section 6 discusses implementation issues and Section 7 contains illustrative numerical experiments.Conclusions and outlook are found in Section 8.
For more details we refer to the technical report in Reference 28.

IMPLICIT RUNGE-KUTTA METHODS
Time-integration methods are repeated on each time-step, whereby the end solution of the previous step is used as the initial solution for the next time interval.Therefore, we do not use overlapping time-steps but instead advocate to utilize high order methods and very long time steps.To describe the method it suffices to consider just a single time-integration interval, (0, ),  > 0, without any overlap.
In the subsequent derivations I m denotes the identity matrix of order m.We utilize also the following tensor algebra identity with ⊗ being the matrix tensor product, known as well as Kronecker product, (2)

Discrete tensor product matrix forms
To give an instance, consider the differential equation (1), that is, This can be written as where v(t) = du(t) dt , that is, u(t) = ∫ t 0 v(s)ds + u 0 .Following Reference 3, to solve (3) we approximate the integral term as ∑ k i=0 a ik v(c i ), by using Radau quadrature.Let  q be the shifted Legendre polynomial of degree q, normalized at 0. We integrate from 0 to c i , i = 1, … , q, where c i are the zeros of  q (t) −  q−1 (t), 0 < t ≤ 1, use the Lagrange interpolation polynomials for the points As shown in for example, References 29 and 30, the matrix A q = [a ik ] q i,k=1 can be presented in the form of a product of four matrices, where C = diag{c 1 , c 2 , … , c q }, R = diag{1, 1∕2, … , 1∕q} and V is the Vandermonde matrix, generated by c i , that is, Since the zeros {c i } of the Legendre polynomials are distinct, V is nonsingular, thus, invertible.Clearly the IRK quadrature matrix A q is nonsingular too.
Having computed A q , by use of the numerical integration ∫ c i 0 ṽ(s) ds = ∑ q k=1 a ik ṽ(c k ), i = 1, 2, … , q, and using tensor product representation, we obtain the algebraic form of (3), ) and e q is a vector of length q with all components 1.The matrices M and K are real of size n × n.The block vector v of length qn has block components v i , i = 1, … , q and each v i is of length n.Multiplying (4) from the left by A −1 q ⊗ I n and utilizing (2) we obtain the alternatively transformed form of (4), This transformation is suggested in Reference 31, see also Reference 29.The systems (4) and ( 5) involve the qn × qn block matrices, respectively, The IRK framework is conveniently formulated via the so-called Butcher tableau, 32 that is, ] .We note here, that for the Radau quadrature method c q = 1, then b i = a qi , i = 1, 2, … , q.For further comments, see Reference 29.Explicit forms of the matrices A q and A −1 q for various values of q are given in Reference 28 Appendix A.

A SUMMARY OF SOME EARLIER PRESENTED STAGE-PARALLEL PRECONDITIONING METHODS
Clearly, for large scale algebraic problems, to solve systems with the matrices  1 or  2 in (6), using an exact block matrix factorization method in a straightforward manner is infeasible as it would lead to full matrices and would be clearly too expensive in computer time and memory demands.The way to reduce the computer resource demands is, instead of direct solution methods, to use a preconditioned iterative solution method, such as the generalized conjugate gradient method (GCG) 33 or the generalized minimum residual (GMRES) method, 11 entailing the task to construct an efficient preconditioner.
Over many years, there have been various attempts to construct parallel solution methods for IRK problems, either by approximating the method by some lower order method on simpler form, by constructing a parallel preconditioner or by extending the method to a multistage method on several time steps.We list here some of these methods.
There are three general types of approaches that achieve some parallelism-across the method (across stages), across the problem, and across the time-steps.
For a more general discussion of parallelization in Runge-Kutta methods, we refer to Reference 34.
An early attempt to achieve parallelism across the method is to apply diagonally-implicit Runge-Kutta methods, also called DIRK methods, see, for example, References 35 and 19, also Reference 22.However, as already mentioned, the methods have lower order of approximation (for a given number of stages), and thus require smaller and, therefore, relatively many time steps.Another possibility is in using a diagonal matrix as a preconditioner to the IRK method.An early experience is found in Reference 36.There, the linear systems are decoupled into subsystems, which means that the cost of the full LU factorization of the global matrix is reduced to q factorizations of the diagonal blocks of size n.
Examples of approaches to achieve parallelism across the problem can be found in Reference 37, see also the references therein, exploring the possibility to solve the problem in subintervals in parallel.Further, as pointed out in Reference 36, the construction of the preconditioner can be based on the Vandermonde matrix representation of the quadrature matrix.
The parareal approach, for example, in Reference 38 is to seek parallelism across time.

Using decompositions of A q or A −1 q
One strategy to construct a stage-parallel preconditioner is based on the spectral decomposition of the quadrature matrix A q , 1 ,  (1)  2 , … ,  (1)  q }.
The matrix  1 (6) can then be transformed into a factorized form, namely, The second option is to consider 1 ,  (2)  2 , … ,  (2)   q } and obtain We see that  1,ii and  2,ii are block-diagonal matrices that decouple the stages.However, some of the eigenvalues of A q or A −1 q are complex, which entails the usage of complex arithmetic and increases the computational cost.We refer to Reference 39, where a Jordan normal form based approach is suggested, which, if A q is diagonalizable, allows the transformation to a block diagonal system, enabling stage parallelism on the cost of, in general, solving complex systems for the complex-conjugate pairs of eigenvalues in real arithmetic.One can exploit the complex-conjugate property in the solver but some standard techniques for solving complex systems limit the achievable performance-see Reference 26 for a detailed comparison.
In Reference 31, ILU-based stage-coupled and stage-parallel preconditioners are presented.There, the stage-parallel preconditioner is the block-diagonal part of the block-LU factorization of the system matrix, where the blocks are ILU-factorized themselves.The experiments indicate that the stage-parallel (referred to as stage-uncoupled) preconditioner is numerically less efficient than that of the stage-coupled version.
A comparison study of a series of LDU-based preconditioning approaches, applied on the non-transformed Equation ( 4) is recently presented in Reference 40.The authors test various combinations of the LDU factorization of A q .The preconditioners are implemented in a block-Jacobi and Gauss-Seidel fashion.The numerical results, presented there, indicate that the  LD option is most robust in terms of small condition number.The approach does not utilize any particular property of the Runge-Kutta matrix and does not provide estimates of the spectrum or the condition number of the arising preconditioned matrices.
Another recent work on preconditioning IRK, related to the considered framework, is found in Reference 41.The preconditioner is based on a form of the system matrix, similar to that of  2 , and on a closed form of the inverse of the matrix A −1 q ⊗ I n + I q ⊗ M −1 K.The method presented there exploits the complex-conjugate property of the eigenvalues.It is inherently stage-serial as it requires the sequential solution of stage(-pairs).

Transforming A q to avoid complex arithmetic
To construct a preconditioner that requires only real arithmetic, it is suggested to use some transformation of A q , for instance, where the entries of W are computed as w ij =  j−1 (c i ) with  i (x) being the shifted Legendre polynomial of degree i and The matrix X is tridiagonal.The above W-transformation is first published in Reference 8 and also advocated in Reference 42.In this way the complex arithmetic is avoided and the decoupling cost for the parallel implementation of this approach corresponds to that for the implicit Euler method.In Reference 42, the factorization ( 9) is used to obtain a preconditioner of block-tridiagonal form, which is then LU-factorized and simplified using a method parameter to be determined.

Other approaches
In Reference 43, standard preconditioners for low-order time discretizations are used to construct order-optimal diagonal block Jacobi preconditioners for high order discretizations.The convergence properties of the methods are improved in Reference 44 by employing block Gauss-Seidel techniques.In Reference 45, optimal complex and real Schur-based preconditioners are compared, together with a block Jordan-form preconditioner and a near optimal singly diagonal approximate block real Schur decomposition is derived.The latter in particular has memory requirements and setup cost comparable to singly DIRK methods.
An example of a parallel across the problem method is the waveform relaxation technique, see for instance, Reference 46.A drawback to mention here is that for stiff problems the convergence of these methods to the exact solution can be very slow.A similar technique is to use a time-harmonic expansion of the solution, which is a natural approach for problems with alternating source functions, such as in time-harmonic electromagnetic problems, compare References 47 and 48 and the references therein.In such an approach the problem decouples into independent subproblems per angular frequency and in this way there is no need to use any time-integration method at all.Another approach to use Fourier expansions is to extend the interval [0, ] to [0, 2] and use the backward form of the IRK method on the symmetrical interval [, 2].Here the solution can be expanded in e i k t terms, which, again leads to uncoupled problems for each frequency  k and can be solved fully in parallel, compare Reference 48.
For parabolic problems of the form one can use the Laplace transform with parameter s.We then have The solution u(t) can be computed via the inverse Laplace transform u(t) = 1 2 i ∫ Γ e st û(s) ds, where Γ is a contour in the right half plane, that is, contains no eigenvalues of sI + A.Here the integral is approximated with a quadrature formula with nodes s j .Then one only needs to compute û(s) at all s j , which can be done fully in parallel.For details, see References 49 and 50.
Another possible parallelization method is the boundary value technique, that has appeared in Reference 51.Here the whole time interval is solved by computing the equation as arising in the implicit trapezoidal method with a backward midpoint rule at the endpoint.The whole system then becomes block tridiagonal and can be solved in various ways by common methods for two-point boundary value problems.
Since long, multigrid methods have successfully been used to solve space discretized problems and also have been shown to be efficiently parallelized.Parallel-in-time, that is, parallelizable across time methods have also been developed, such as multiple-shooting techniques, compare, for example, Reference 52.The parareal framework has further renewed the interest in such methods, see References 38 and 53.We claim that if one can solve the full IRK method in a stage-parallel manner in each time step, there would less need for such multiple timestep methods.However, these can be useful to obtain solution with higher accuracy also in interior integration points.A GPU implementation of the fully implicit IRK method is found in Reference 54, applied to a system of ODEs.There, the matrices to be solved have the same structure as in (7) and the preconditioner has a tridiagonal structure.
Related to IRK methods, however out of the context of parallel implementation, preconditioners for quadratic matrix polynomials are analyzed in Reference 55.
In Reference 56, a monolithic multigrid for IRK methods, applied to incompressible flows is studied.The coupled systems are solved by the designed multigrid, choosing all computational ingredients of the solver to be parallelizable, however, not aiming to achieve parallelization across the stages.

STAGE-PARALLEL PRECONDITIONING METHODS FOR THE RADAU TENSOR PRODUCT SYSTEM
From now on we consider IRK methods based on Radau quadratures (Radau IIA).The derivations apply to other IRK methods for which the Butcher matrices A possess the same properties as those for Radau IIA, discussed in Reference 6.
It is straightforwardly observed that the first term in the matrix  1 , respectively, the second term of the matrix  2 in (6) involve a block-diagonal matrix and a plausible aim is to get a simpler form of the other matrix term as well.The approach to use the spectral decomposition of A q achieves that, however, on the cost of using complex arithmetic.
In this study, we retain the framework, however, not using the spectral decomposition of A q .As shown in Reference 6 (p.75), for IRK methods based on Radau quadratures, A q has a dominating lower-triangular part in the sense that it has larger in magnitude entries than the strictly upper triangular part.This property holds also for A −1 q as well as the corresponding exact LU-factorizations of A q (A q = L (1)  q U (1) q ) and A −1 q (A −1 q = L (2) q U (2) q ).Therefore, a possibility to precondition  1 and  2 is by the matrices  1 = I q ⊗ M + L (1)  q ⊗ K and  2 = L (2)  q ⊗ M + I q ⊗ K, (10) respectively, where L (1)  q and L (2)  q are the lower-triangular factors in the exact LU-factorizations of A q , respectively, of A −1 q .This leads to achieving a block lower-triangular form of  1 and  2 , however, in this way the solution with them involves successively solving systems with the block-diagonal part (I q ⊗ M + (L (1)  q ⊗ K)) ii of  1 and correspondingly for  2 , and does not allow for any parallelism between the stages of the method.
Clearly, a preconditioner based on directly using the lower-triangular part is mainly sequential and although there exist general techniques to handle solutions with block lower-triangular matrices in parallel, the amount of parallelism to achieve is rather limited and we leave this approach out of consideration.
Next we make a choice between  1 and  2 .For  1 the elimination requires solutions with matrices (I q ⊗ M + (L (1)  q ⊗ K)) ii .When K is ill-conditioned then these block matrices are also ill-conditioned and we need a good preconditioner of these matrices too, for instance of algebraic multilevel (AMLI) type or of algebraic multigrid (AMG) type method. 57,58There exists various possibilities to employ parallel solvers for these inner systems, but we do not discuss this any further.
The same type of successive elimination method can be applied also for (5).Here the block-diagonal matrices equal  2,ii which may have a much better form for the efficiency of the inner solution than  1,ii .Furthermore, (i) the off-diagonal blocks arise now from the matrix term A −1 q ⊗ M, which is in general sparser than A q ⊗ K, in particular if M is a lumped mass matrix, hence, we save also in memory demand; (ii) in addition, any ill-conditioning of K does not harm the off-diagonal blocks.
Since solving (5) instead of (4) can significantly reduce the computational cost, it can be preferable and this is also the approach taken in this article.Thus, we consider only the form (5). To simplify the notations, we drop the subscript 2 in  2 and in  2 , and the superscript (2) in L (2)  q , thus, we deal with  = A −1 q ⊗ M + I q ⊗ K.We consider here a particular LU factorization of A −1 q , namely, we let A −1 q = L q U q , where the factor L q is real-valued lower-triangular and the upper-triangular factor U q has a unit diagonal, thus, U q = I q + Ûq with Ûq strictly upper-triangular.To take an example, for q = 2, then we have ][ that is, || Û2 || = 1∕3.For q = 3 and q = 4 we have correspondingly, || Û3 || = 0.4098, || Û4 || = 0.4779 and We notice, that the lower-triangular part of A −1 q and L q itself possess the property that their diagonals have alternating signs.Therefore, they can be seen as some sort of difference matrices.There holds also that L −1 q are positive matrices (with all entries positive).The property is observed for all q up to 12.

Preconditioner, based on the spectral decomposition of L q
Consider the matrix in (10) (right) to be used as a preconditioner to , The subscript ′ L ′ denotes that this preconditioner uses the complete factor L q .If applied straightforwardly, solutions of systems with  L do not allow parallelism across stages.Similarly to some earlier attempts towards overcoming this and aiming to obtain a fully stage-parallel preconditioner, compare, for example, References 39 and 59, instead of using the spectral decomposition of A −1 q we use that of L q .So, we assume that L q is a good approximation to A −1 q and construct the spectral decomposition L q = T q Λ q T −1 q .Here, T q contains the eigenvectors of L q and Λ q is the diagonal part of L q that contains the eigenvalues of L q , all real.We then where Then, the solution of the generic system  L x = f is performed in the following steps: (1) Solve The eigenvectors of L q (the columns of T q ) can be computed by a simple recursion, as shown in Reference 28 Appendix B, or obtained by some available linear algebra software.

Preconditioner, based on a diagonal approximation of L q
Let  = max{L q,ii , i = 1, … , q} be the maximum eigenvalue of L q .Consider the matrix as a preconditioner to .The consequence of the approach ( 13) is that all blocks in  I q ⊗ M are the same and in the computations we do not need to involve the eigenvector matrices T q .The numerical examples in Reference 27 are made using the latter framework.There, no theoretical analysis is provided and some preliminary experiments in Matlab are included.We analyse the spectral properties of the corresponding preconditioned matrix in Section 5.2.
Remark 2. Clearly, the approach can be applied to A q , which is tested in Reference 40.The variant  LD from that article would correspond to our approach when applied to A q .As mentioned, there it is found empirically that this variant is numerically most efficient, without considering any particular properties of A q .

SPECTRAL PROPERTIES OF THE PROPOSED PRECONDITIONERS
We analyze the spectral properties of the preconditioned matrices  −1 L  and  −1 D .

Spectral properties of the preconditioned matrix  −1 L 
We make first the general observation that where Ûq is the strictly upper triangular part of U q .Denote Q TA B L E 1 Some characteristics of A −1 q , L q , and Ûq for different number of stages q, S q = We next attempt to derive an upper bound of the spectral radius of the preconditioned matrix  −1 L .To this end, as an analysis tool we use the field of values (A), defined as (A) = {x * Ax ∶ x ∈ C, ||x|| = 1}, where x * denotes the conjugate transpose of the vector x.Let A and B denote generic square matrices, (A) be the spectrum of A and (A)its field of values, (A) and (A) be the spectral and the numerical radius of A, respectively, defined as
The spectrum { j }, j = 1, … , n depends on the time-independent part of the equation at hand, whether it is of diffusion with or without anisotropy, convection-diffusion, or only of advection type, as well as of the chosen discretization in space.The mass matrix M is symmetric positive definite and in the case of a space operator of the form −Δu + b ⋅ ∇u, K is also positive definite.For piece-wise linear finite element discretization in space with a mesh-size h and for arbitrary real vector x, such that ||x|| = 1, we can estimate  j from the following relations, Here c i > 0, d i > 0, i = 1, 2 and c1 = d 1 ∕c 2 ,c 2 = d 2 ∕c 1 are constants, not depending on h.The estimates (16) of the eigenvalues of the global FEM matrices are expressed in terms of the extremal nonzero eigenvalues of the element stiffness and mass matrices and the maximal degree of the nodes in the discretization mesh, compare, for example, Reference 63.
We state the following conjecture.
Conjecture 1.Consider the matrix  = A −1 q ⊗ M + I q ⊗ K, where K and M are positive definite matrices, corresponding to a finite element discretization of a second order elliptic operator.Further, let A q be the matrix,  L  -q = 2, q = 6, q = 9, circle with radius ||Û q || and centered at 1. corresponding to a fully implicit Runge-Kutta method of Radau type with q stages and I q be the identity matrix of order q.Let A −1 q = L q U q be the exact LU factorization of A −1 q , such that U q has unit diagonal.Denote Û = U q − I q , strictly upper triangular and assume that || Ûq || < 1. Define the matrix  L = L q ⊗ M + I q ⊗ K. Consider the generalized eigenvalue problem

F I G U R E 2
Then, all eigenvalues  are contained in a circle, centered at 1 and with radius || Û||, strictly less than one.Furthermore, there are at least n eigenvalues that are exactly equal to 1. Remark 3. Conjecture 1 is supported by strong numerical evidence.Figures 1 and 2 illustrate the spectrum of  −1 L , showing that it is favorably clustered around 1. Figure 3 illustrates the spectrum for the pure convection case from Problem 3. Note that in the latter case the assumptions of the conjecture are not satisfied.The circular bound of the spectra is also shown in the figures.However, using various relations between spectral and numerical radius, norms, singular values, induced norm techniques, rigorous derivation has not been possible to show.We sketch below one of the possible approaches.
Given that  > 0, to estimate the smallest |r 1 | it is enough to show that all r ∈ (( Since M is symmetric and positive definite, the following similarity relation holds true, ) .
Therefore, (19) As K is symmetric positive definite it induces a norm and an inner product.Therefore, in order to show that S ≥ 0, it is enough to have that |S od | ≤ S d .We have: where we use the Cauchy-Schwarz inequality on the right relation.Now it suffices to show For q = 2, in order for (21)

Spectral properties of the preconditioned matrix  −1 D 
For the spectrum of the preconditioned matrix  −1 D  the following result holds true.
Theorem 2. Consider the matrix  = A −1 q ⊗ M + I q ⊗ K, where K and M are positive definite matrices, corresponding to a finite element discretization of a second order elliptic operator.Further, let A q be the matrix, corresponding to a fully implicit Runge-Kutta method of Radau type with q stages and I q be the identity matrix of order q.Let A −1 q = L q U q be the exact LU factorization of A −1 q , such that U q has unit diagonal.Let  be the largest eigenvalue of L q .Define the matrix  D = I q ⊗ M + I q ⊗ K and consider the generalized eigenvalue problem Then, all eigenvalues  are contained in a circle, centered at 1 and with radius  q > 1, where  q depends only on q.In addition, the real parts of the eigenvalues are larger than a constant C q > 0, depending only of the number of stages q.Proof.To analyze the spectrum of the preconditioned matrix  −1 D  we consider the generalized eigenvalue In order to derive spectral bounds we rewrite the matrices in matrix-tensor product form: Then, using (P5) we obtain From Table 1 we see that 1 <  q < 1.5 for q ≤ 12. Since the bound in (23) is larger than 1 and we want to estimate  =  + 1, we derive a bound of the real part of .
Due to the matrix-tensor form of Q D = Z 1 ⊗ Z −1 2 , using (P5), we have that  =  1  2 , where  1 is an eigenvalue of Z 1 and  2 is an eigenvalue of Z −1 2 .From (P6) we are in a position to estimate  2 , which are real and positive, namely, Further, from Table 1 we see that, at least up to q = 12, | 1 | < g q < 1.Therefore, Re( 1 ) < g q and Re() = Re(1

Choice of inner solvers
As already mentioned, the systems (4) and ( 5) are in general very large scale, of order qn but with the approach taken here we solve systems only of size n.In typical problems, in particular for three dimensional problems in space, n itself is very large.We assume that we can either construct a feasible approximation of the diagonal blocks  j M + K, j = 1, 2, … , q or can solve the inner systems with some efficient iterative solution method.A suitable approximation could for instance be based on a modified incomplete factorization, see References 64 and 65 with sparse matrix factors.When solving systems with the diagonal blocks we can apply some algebraic multilevel technique, see, for example, References 66, 67 and 60, or an algebraic multigrid (AMG) solver, which we utilize in this study, using library-provided AMG implementations.L  (red)-q = 3, 9, circle with radius ||Û q || and centered at 1.

Parallelization aspects
As the target solution method is a preconditioned Krylov Subspace iteration, the two major ingredients are the matrix-vector multiplication with the matrix  and the solution of linear systems with the preconditioning matrix  L or  D .The matrices ,  L , and  D are of dimension qn × qn.They are never explicitly stored but rather we utilize their structure to implement the above two operations efficiently in parallel.We aim at distributed memory MPI-based implementation of the computations.In the chosen parallel computing environment there are two general strategies to administer the fully implicit IRK method and the preconditioner  L in (12).
The first strategy is to distribute the space discretization mesh among a number of processes.This entails that the solution of each system on the diagonal of  d is done in parallel in space, using well-known parallelizable methods, such as multilevel, multigrid, domain-decomposition techniques and so forth implemented and optimized in some of the established scientific computing libraries.This is the implementation chosen for the numerical tests in this work and we do not implement parallelism across the stages.
When it comes to implementing simultaneous parallelization across stages and in space, different strategies could be employed.This aspect of the implementation is a topic of an independent study. 26The strategy, chosen there, is based on allocating q groups of processes and within group carrying out space parallelism, using matrix-free operations and matrix-vector multiplications in a block-fashion.The tests include discretization with higher order finite elements, comparisons between a stage-parallel and stage-serial implementation as well as between the performance of  L using only real arithmetic with that of the complex factorization from Reference 39.

NUMERICAL TESTS
The parallel tests are run on the Rackham cluster at the Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX).Rackham consists of 486 nodes, each node having two 10-core Intel Xeon E5 2630 v4 at 2.20 GHz/core.We use nodes with 128 GB of memory.The test problems are implemented in the deal.II FEM library, 68 interfacing with PETSc 69 to utilize the available preconditioned iterative methods.The space discretization for Problems 1 and 2 is done by piece-wise linear and for Problem 3-by third-order conforming finite element basis functions on quadrilateral meshes.The arising systems for the stage variables are solved by an inner-outer solution procedure.Due to the latter the outer solver must allow for variable preconditioners.Our choice of outer solution method is GCR (Algorithm 3.1 in Reference 57), which, together with GCG and FGMRES, is capable of handling variable preconditioning.For Problems 1 and 2 the inner block systems in  L and  D are solved by AMG-preconditioned CG (P1)/GCR (P2).For Problem 3 the blocks are solved using the MUMPS sparse direct solver.
The tests, presented here, utilize only space parallelism, that is, the block systems in the preconditioners are solved one after another.The full stage-and space-parallel implementation is outlined in Section 6.2 and implemented and tested in Reference 26.
The performance of the IRK method and the proposed preconditioner are illustrated using the following three test problems.For the tests we choose a x = a y = 2 and a t = 0.5.For the outer solver we use a variant of GCR with relative stopping criteria 10 −8 .For the q block-systems in the preconditioner we solve using CG preconditioned by an AMG, with a relative stopping criteria of 10 −6 .We fix  = 0.1 and take five time-steps.The tests are with two and nine stages.
with  = 1, initial condition u(x, y, 0) = u 0 -the pyramid function shown in Figure 5, boundary conditions 5 sin(kt), g(x, y, t) = 2y(1 − y) sin(kt) and f (x, y, t) = 2e −x .As we deal with non-homogeneous and time-dependent boundary conditions, we follow the general practice to construct a partial solution (x, y, t) that satisfies the boundary conditions and then reformulate the problem to find v = u − .In order to homogenize the boundary conditions we choose (x, y, t) = e −x x 2 y(1 − y) sin(kt).We find then that the function v(x, y, t) = u(x, y, t) − (x, y, t) is the solution of the initial boundary problem t − Δ + b ⋅ ∇, boundary conditions v = 0 on y = 0, y = 1, x = 1 and v n = 0 on x = 0, initial condition v(x, y, 0) = u 0 (x, y) − (x, y, 0) = u 0 (x, y).In detail, e −x sin(kt).The additional complication with the -scaling is taken into account when constructing the preconditioner as follows.The system equation, analogous to (4), now reads where Σ q is a diagonal matrix, containing the values of  in the integration points.Multiplying the latter equation by A −1 q Σ −1 q from the left we obtain As for Problem 1, we then select the preconditioner to be  L ∶= (LΣ −1 q ⊗ M + I q ⊗ K).Finally we decompose L q Σ −1 q = TΛT −1 and obtain an analogous form of the preconditioner used in Problem 1 with the difference that the eigenvalue decomposition is computed in every time step.The cost of this is low, however, as the matrix in question is of size q × q.The outer solver is GCR with stopping criteria ||Ax − b|| 2 < 10 −8 .For the q block-systems in the preconditioner we now use GCR preconditioned by an AMG with stopping criteria ||Ax − b|| 2 < 10 −6 .Problem 3 (70).Consider the convection-diffusion equation (24)  Here x c , y c , and  are the center of the pulse and the standard deviations, respectively.The corresponding analytical solution for a constant diffusion  is given by , where x = x cos(4t) + y sin(4t) and y = −x sin(4t) + y cos(4t).The time interval is [0,  2 ], equal to the time for one complete rotation of the pulse.
As noted in Reference 70, the behavior of this problem changes from advection-dominated in most of the domain to diffusion-dominated close to the origin.We solve this problem in the purely advection case, namely, for the numerical experiments we choose  = 0, x c = −0.5, y c = 0, and  = 0.0447.As the value of the analytical solution on the boundary of the domain is less than the machine accuracy we choose homogeneous boundary conditions.The inner multigrid solver used in Problems 1 and 2 is not suitable for the pure advection case and to keep this work contained and focused on the proposed block preconditioners, we replace the inner multigrid-based solver by a parallel direct solver and present the outer iterations.Note: Errors evaluated at  F = 0.5.

TA B L E 3
Problem 1  L : Parallel run times and errors for q = 9, solving for 5 time steps with  = 0.1.Note: Errors evaluated at  F = 0.5.

Results for Problems 1 and 2
For Problem 1 we carry out tests for the parallel performance of the preconditioners  L and  D .The results for  L are found in Table 2 for q = 2 and in Table 3-for q = 9.The corresponding results for  D are shown in Tables 4 and 5.
From the timing results for q = 2 in Table 2, column 2, we see that speedup in the strong scaling, also referred to as the the fixed-size speedup, is found to be 203∕130 ≈ 1.6, compared to the ideal factor 2 and 99∕203 ≈ 2 compared to ideal factor 3. Analogously, from the timing results for q = 9 in Table 3, column 2, we compute the fixed-size speedup as 2246∕1414 ≈ 1.6, compared to the ideal factor 2 and 2246∕762 ≈ 2.9 close to the ideal factor 3. Comparing the performance of  L from Tables 2 and 3 with that of  D in Tables 4 and 5, we see that  D does not require more outer iterations and is superior to  L in terms of total solution time.The main reason for this is that, as  D assumes a constant block-diagonal, it needs only to set up a single AMG in comparison to the q different AMG solvers for  L .We find that using constant diagonal blocks does not harm the convergence of the outer iterative solver in the case of Problem 1.The effect of saving Note: Errors evaluated at  F = 0.5.

TA B L E 5
Problem 1  D : Parallel run times and errors for q = 9, solving for 5 time steps with  = 0.1.Note: Errors evaluated at  F = 0.5.

No. CPUs
is less pronounced in the case of full stage parallelism as the implementation needs q separate multigrid solvers by construction.As shown in Reference 26 the benefits of stage-parallelism are seen at the scaling limit, hence in many settings a stage-serial implementation may be preferable and  D may be preferred over  L due to its lower computational cost.Problem 2 is tested only for  L .The corresponding results for q = 2 are shown in Table 6.

Results for Problem 3
As we consider the pure advection case with  = 0, CG/AMG is no longer an efficient inner solver.For the purpose of testing the outer preconditioner, as an inner solver we use here a sparse direct solver.As in Reference 41 we use Q 3 elements in space, select a fixed time-step and a number of stages, and then choose h such that  2q−1 ≈ h 4 .Plots of the solution of this problem are shown in Figure 6.The outer iterations are given in Table 7.
TA B L E 7 Problem 3: Outer iterations for the pure advection case using a direct solver as inner solver, space discretized by Q 3 elements.In contrast to the results for Problem 1, Table 7 shows that  D exhibits a significantly worse convergence for the more difficult advection case.The iteration counts for both preconditioners grow with the number of stages but  L reaches convergence in substantially fewer iterations.Remark 4. As already mentioned, the use of implicit time-stepping methods for convective problems is justified in applications not aiming to resolving features of the solution that require fine time resolution.Therefore we leave out of consideration issues, such as adaptive meshes, appropriate discretization using stabilization, CFL-condition.

CONCLUDING REMARKS
Given their fundamental significance in many branches of science, solving time-dependent partial differential equations has been an important question for centuries and it remains an issue with high impact in many scientific computing applications.When handling such problems, the use of high order accurate implicit Runge-Kutta methods of Radau type can be numerically very efficient since the methods are strongly A-stable, enable use of large time-steps and can handle highly ill-conditioned matrices with a widespread spectrum.In this work, for the strongly A-stable IRK methods of Radau IIa type, applied to linear problems we show that thanks to the particular properties of the quadrature matrices A q , originally derived in Reference 6 and holding true also for A −1 q , the arising globally coupled linear systems on each time-step can be solved efficiently by a preconditioned iterative method with a stage-parallel (block-diagonal) preconditioner.Based on these properties, for the construction of the preconditioner we consider a high quality approximation of A −1 q , namely, its lower-triangular part having real eigenvalues, and in this way fully avoid complex arithmetic.The implementation of the idea for two types of preconditioners, the analysis of the properties of the spectrum of the arising preconditioned nonsymmetric matrices (although currently not in its full generality for any number of stages) and the comprehensive performance tests are the major contributions of this article.
In this work, we consider only linear problems and constant timesteps.However, the approach allows to use automatic time step-size control for hybrid implicit-explicit methods enabling fine time resolution for a certain time interval and coarse time resolution in other parts of the time domain, switching to the considered here IRK method.As a suitable example we mention the simulation of the so-called glacial isostatic adjustment problem, where the uplift of the earth surface after melting of continental ice sheets is simulated for a period of about tens to hundreds of years and there the dynamic of the earth uplift is fast in the beginning but very slow in the larger of the time interval, see for instance Reference 71 for some details.

F I G U R E 1
Problem 1: Eigenvalues of  −1

F I G U R E 4
Problem 1: Eigenvalues of  −1 D  (blue) and of  −1

F I G U R E 5
Problem 2: Initial condition.

)
10991506, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1002/nla.2532by Uppsala University Karin Boye, Wiley Online Library on [21/09/2023].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License to be small since || Ûq || is reasonably less than 1.For illustration, in Table1we show the value of || Ûq || for a number of stages 2 ≤ q ≤ 12. Further, if, for instance, K has a positive definite symmetric part, then ||W −1 1 || is likely smaller than unity and particularly small for large time-steps .
TA B L E 2