Why diffusion-based preconditioning of Richards equation works: spectral analysis and computational experiments at very large scale

We consider here a cell-centered finite difference approximation of the Richards equation in three dimensions, averaging for interface values the hydraulic conductivity $K=K(p)$, a highly nonlinear function, by arithmetic, upstream, and harmonic means. The nonlinearities in the equation can lead to changes in soil conductivity over several orders of magnitude and discretizations with respect to space variables often produce stiff systems of differential equations. A fully implicit time discretization is provided by \emph{backward Euler} one-step formula; the resulting nonlinear algebraic system is solved by an inexact Newton Armijo-Goldstein algorithm, requiring the solution of a sequence of linear systems involving Jacobian matrices. We prove some new results concerning the distribution of the Jacobians eigenvalues and the explicit expression of their entries. Moreover, we explore some connections between the saturation of the soil and the ill-conditioning of the Jacobians. The information on eigenvalues justifies the effectiveness of some preconditioner approaches which are widely used in the solution of Richards equation. We also propose a new software framework to experiment with scalable and robust preconditioners suitable for efficient parallel simulations at very large scales. Performance results on a literature test case show that our framework is very promising in the advance towards realistic simulations at extreme scale.


INTRODUCTION
Groundwater flow in the unsaturated zone is a highly nonlinear phenomenon that can be modeled by the Richards equation, and there is a significant amount of research concerning different formulations and algorithms for calculating the flow of water through unsaturated porous media [1][2][3][4][5] .
The Richards equation is a time-dependent Partial Differential Equation (PDE) whose discretization leads to large nonlinear systems of algebraic equations, which often include coefficients showing large variations over different orders of magnitude.Typically, the variation in the coefficients is due to the use of a geostatistical model for hydraulic conductivity, allowing changes of many orders of magnitude from one cell to the next (heterogeneity) as well as correlations of values in each direction (statistical anisotropy).High heterogeneity and anisotropy in the problem and the presence of strong nonlinearities in the equation's coefficients make the problem difficult to be solved numerically.
The main contribution of the present work is twofold.First, we investigate the spectral properties of the sequence of Jacobian matrices arising in the solution of the non linear systems of algebraic equations by a quasi-Newton method; these properties allow us to formulate a new theoretical justification for some preconditioning choices in solving Richards equation which are indeed current usual practices in the literature; to the best of our knowledge, this is the first theoretical formulation to support such common practices.Then, on the basis of the theoretical indications, we discuss a scalable and efficient parallel solution of a modified inexact quasi-Newton method with Krylov solvers and Armijo-Goldstein's line search, i.e.Newton-like algorithms where the Newton correction linear equations are solved by a Krylov subspace method.To this aim, we interface the KINSOL package available from the Sundials project 6 with the most recent version of libraries for solving sparse linear systems by parallel Krylov solvers coupled with purely algebraic preconditioners 7 , currently being extended in some EU-funded projects.
The work is organized as follows.First, in Section 2 we discuss the discretization of the Richards equation by means of cellcentered finite differences.Then, in Section 3, we summarize the inexact quasi-Newton method as implemented in the KINSOL framework.In Section 4 we propose an analysis of some spectral properties of the sequence of Jacobian matrices produced by the underlying Newton method.In Section 5, we describe some of the main features of the PSCToolkit parallel software framework, which we use to implement the computational procedure described in this paper.
The information produced by the spectral analysis of the Jacobian matrices is used in Section 5.2 to devise a preconditioning strategy for the Krylov subspace method and to present the preconditioners used in this study.Some numerical examples highlighting the computational performance of the proposed methods are then presented in Section 6.Finally, in Section 7 we draw our conclusion and briefly discuss future extensions.

FORMULATING THE DISCRETE PROBLEM
Flow in the vadose zone, which is also known as the unsaturated zone, has rather delicate aspects such as the parameters that control the flow, which depend on the saturation of the media, leading to the nonlinear problem described by the Richards equation.The flow can be expressed as a combination of Darcy's law and the principle of mass conservation by in which ( ) is the saturation at pressure head of a fluid with density in terrain with porosity , and is the volumetric water flux.By using Darcy's law as = − ( ) (∇ + ̂ ) , for ( ) the hydraulic conductivity, and the cosine of the angle between the downward axis ̂ and the direction of the gravity force, we obtain that the overall equation has a diffusion as well as an advection term, the latter being related to gravity.For the sake of simplicity, we consider cases in which = 1, i.e. the advection acts only in the direction.There are two different forms of the Richards equation that differ in how they deal with the nonlinearity in the time derivative.The most popular form, which is the one considered here for a fluid that will always be water, permits to express the general equation as where ( ) is the pressure head at time , ( ) is water saturation at pressure head , is water density, is the porosity of the medium, ( ) the hydraulic conductivity, represents any water source terms and is elevation.The equation is then completed by adding boundary and initial conditions.This formulation of the Richards equation is called the mixed form 1,3 because the equation is parameterized in (pressure head) but the time derivative is in terms of (water saturation).Another formulation of the Richards equation is known as the head-based form, and is popular because the time derivative is written explicitly in terms of the pressure head 2,3 .
It is important to note that in unsaturated flow both water content, ( ), and hydraulic conductivity, ( ), are functions of the pressure head .There are several empirical relations used to relate these parameters, including, e.g., the Brooks-Corey 8 model and the Van Genuchten model 2 .The Van Genuchten model is slightly more popular because it containts no discontinuities in the functions, unlike the Brooks-Corey model, and therefore avoids the risk of losing the uniqueness of the solution of the semidiscretized equation in space by the well-known Peano Theorem.A version of the Van Genuchten-Richards equation in mixed form model by Celia et al. 3 reported the following choices: and where the , , , and are fitting parameters that are often assumed to be constant in the media; and are the residual and saturated moisture contents, and is the saturated hydraulic conductivity.Small changes in the pressure head can change the hydraulic conductivity by several orders of magnitude, and ( ) is a highly nonlinear function.The water content curve is also highly nonlinear as saturation can change drastically over a small range of pressure head values.It should be noted that these functions are only valid when the pressure head is negative; that is, the media is semi-saturated; when the media is fully saturated, = , ( ) is equal to the porosity, and the Richards equation reduces to the Darcy linear equation; see the example in Figure 1.Typical uses of the Richards equation are to simulate infiltration experiments in the laboratory and in the field.These begin with initially dry soil, then water is added to the top of the core sample (or ground surface).These experiments require the simulation of fast changes in pressure head and saturation over the most nonlinear part of the Van Genuchten curves.

Cell-centered finite difference discretization
At the beginning of an infiltration experiment, the pressure head can be close to discontinuous.These large changes are also reflected in the nonlinear terms and ( ).Taking into account the effect of the initial conditions, the time step can be severely limited if an inappropriate time discretization is chosen.Hydrogeologists are often interested in following the process until a steady state is achieved, which may take a long time; therefore, large time steps should be used to avoid excessively expensive simulations.The presence of stiffness and the desire to use long time steps, lead to implicit time integration schemes.In this paper we consider an implicit backward Euler numerical scheme; higher-order implicit methods are not considered because the uncertainty associated with the fitting parameters in the Van Genuchten models and the possible low smoothness of (i.e., the max order of differentiability) can severely reduce the effective order of a (theoretically) high order numerical method.
Successful discretizations in space for Richards equation are based on finite differences 3,5 ; using finite elements may require mass lumping in order to recover possible large mass balance errors and undershoot errors ahead of the infiltration front 3 .
We consider here a discretization of the Richards equation on a cell-centered finite difference tensor mesh on a parallelepiped discretized with = ( , , ) nodes, in such a way that the most external nodes are on the physical boundary of the domain.
We then denote the cell centers as { , , = ( ℎ , ℎ , ℎ )} −1 , , =0 , for = (ℎ , ℎ , ℎ ) = ( , , )∕( − 1), and with the interfaces located at midpoints between adjacent nodes.The time direction is then discretized by considering uniform time steps, i.e., the grid { = Δ } −1 =0 for Δ = 1∕( − 1); the approximation on the cell ( , , ) at time step of the pressure head is denoted as ( ) , , .With this choice, the discretization of the mixed form of the Richards equation in (1) on the internal nodes of the grid for ≥ 1 is written, for , , = 1, … , − 2, as 2ℎ , in which the generic term AV , ( ) represents an average at the interfaces of the function ( ) in (3).The selection of the form of the average term best suited to more realistic simulations does depend on the problem and is still an open issue 9,10 .In particular, the choice is dependent on the type of fluid infiltration one needs to deal with.If we denote by and the two values of the function on the opposite sides of the interface, e.g., = ( ( ) , +1, ) and = ( ( ) , , ), the most frequently used formulations are the arithmetic mean 3 , i.e., ARIT , ( ) = ( + )∕2, the geometric mean 11 , i.e., ( ) = √ , the upstream-weighted mean 12 and the integral mean 13 , i.e., , otherwise.It is also possible to employ a combination of the above, using two different means, one for the horizontal and another for the vertical direction, or to use an algorithm that computes the internodal conductivity by using different approaches depending on the terrain and the value of the pressure head 9 .

APPLYING THE INEXACT NEWTON METHOD
We apply at each time step a quasi-Newton method for the solution of the nonlinear system of equations (4) as implemented in the KINSOL parallel library 6 .Let be the current iterate of pressure head, for each node of the computational mesh and for each time step.A Newton method computes an increment as the solution of the following equation: where ( ) is the Jacobian matrix of .Specifically, we consider an inexact Newton solver in which we update the Jacobian matrix as infrequently as possible.This means that a first Jacobian is computed in the initialization phase, i.e. in the first step = 0 of the algorithm; then, a new one is built if and only if at least one of the following conditions are met The linear solver failed to converge with the previous Jacobian (backtracking), or The line search failed with outdated Jacobian information, (7)   where we are using the scaled norm ‖ ⋅ ‖ ,∞ = ‖ ⋅ ‖ ∞ for a diagonal matrix such that the entries of the scaled vector are almost of the same magnitude when is close to a solution.Similarly, ‖ ⋅ ‖ ,∞ = ‖ ⋅ ‖ ∞ for a diagonal matrix when we are far from the solution.The value of the step length is computed via the Armijo-Goldstein line search strategy, i.e., is chosen to guarantee a sufficient decrease in with respect to the step length as well as a minimum step length to the initial rate of decrease of .
The Jacobian matrix = can then be recovered by direct computation from (4) using finite-difference approximations to the derivatives of the constitutive equations in (2), and (3) given by At the core of the parallel procedure we tackle the solution of the (right preconditioned) linear system where could be either a freshly computed Jacobian for the vector , or the Jacobian coming from a previous step.The linear iterative solver for the Newton equations should handle the tradeoff between using a tolerance that is large enough to avoid oversolving, and reducing the number of iterations to attain the prescribed convergence.In the following we focus on a choice for the preconditioner in (9) which allows to balance accuracy and efficiency on parallel distributed-memory architectures when very large scale simulations have to be carried out.
In Section 4 we investigate the asymptotic spectral properties of the sequence of the Jacobian matrices.This information will then be used to devise an asymptotically spectrally equivalent sequence of symmetric and positive definite matrices.Then, to approximate the action of the inverses of the approximating sequence, we will employ some preconditioners from the PSCToolkit framework described in Section 5.The aim to experiment the functionalities of PSCToolkit for building and apply preconditioners for the linear systems arising in the quasi-Newton procedure by KINSOL, we developed some KINSOL modules that enable it to use the solvers and preconditioners from PSCToolkit inside its Newton-based non-linear procedure.These interfaces are written in C from the KINSOL library and use the C/Fortran2003 interfaces from the PSCToolkit libraries, guaranteeing a full interoperability of the data structures, i.e., they do not require producing any auxiliary copy of KINSOL objects for translation into PSCToolkit objects; everything can be manipulated from KINSOL directly into the native formats for PSCToolkit.The details about the implementation of the relevant APIs, and the operators made available by the interfacing are described in the documentation for the interface that can be downloaded from https://github.com/Cirdans-Home/kinsol-psblas.This interfacing had the twofold aim of extending PSCToolkit to handle non-linear algebraic equations, as well as to extend the KINSOL library with new methods for solving sparse linear systems on high-end supercomputers.

SPECTRAL ANALYSIS OF THE JACOBIAN SEQUENCE
In general, the construction of the preconditioner in (9) depends on the choice of the average for the discretization used in (4).To formulate a proposal for , we divide the discussion in two steps: first we look for a suitable matrix to precondition the Jacobian matrix associated to the different averages; then we discuss how we can efficiently setup and apply −1 inside the Krylov subspace method on a high-end parallel computer.

Tools for the spectral analysis
Our idea to compute starts by investigating the distribution of the eigenvalues, { ( )} =( , , ) = , for the Jacobian matrix of size = .Specifically, we look for a measurable function ∶ ⊂ ℝ → ℂ to associate to the sequence { } for which we can prove the following asymptotic relation where (⋅) is the Lebesgue measure on ℝ , 0 < ( ) < ∞, and (ℂ) is the space of continuous functions with compact support.Despite the apparently technical form of the previous expression, we can easily summarize the information contained in .If we assume that is large enough, then the eigenvalues of the matrix , except possibly for ( ) outliers, are approximately equal to the samples of over a uniform grid in , i.e., the function , that we will call symbol of the sequence of matrices { } , provides an accurate description of their spectrum asymptotically.
In order to achieve this result, we need to introduce some preliminary tools.To simplify notation, let us start by focusing on the one dimensional problem to select only an expression for AV , related to the flux oriented toward the axis.For these cases, we are interested in the overall behavior for both the eigenvalues and the singular values of the Jacobian matrices.Formally, we are interested in the computation of the so-called singular and spectral value symbol for the sequence of the Jacobians.Definition 1.Let { } be a sequence of matrices and let ∶ ⊂ ℝ → ℂ be a measurable function defined on a set with 0 < ( ) < ∞.
• { } has a singular value distribution described by , and we write { } ∼ , if In this case, is called the singular value symbol of { } .
• { } has a spectral (or eigenvalue) distribution described by , and we write { } ∼ , if In this case, is called the spectral (or eigenvalue) symbol of { } .
If { } has both a singular value and a spectral distribution described by , we write { } ∼ , .
Moreover, we refer to a sequence of matrices { } such that { } ∼ 0 as a zero-distributed sequence.Examples of sequences for which we can easily compute such symbols are the th diagonal sampling matrix generated by ∶ [0, 1] → ℂ and ∈ ℕ, that is the × diagonal matrix given by , and the Toeplitz sequences, i.e., for a ∈ ℕ and where the numbers are the Fourier coefficients of , ) and is real then { ( )} ∼ .A similar relationship holds for the case of multilevel Toeplitz matrices.
To compute the asymptotic spectral/singular value distribution of more general matrix sequences we need to expand our set of tools, and for this task we use the Generalized Locally Toeplitz (GLT) sequences 14,15 .GLT sequences are are sequences of matrices equipped with a measurable function ∶ [0, 1] × [− , ] → ℂ called symbol; we will use the notation { } ∼ GLT to indicate that { } is a GLT sequence with symbol .We can characterize the sequences by the following list of properties.• every is Hermitian, • ‖ ‖, ‖ ‖ ≤ for some constant independent of , We call unilevel sequences those for which the dimension is characterized by a single index , we call -level those in which the dimension is characterized by a multi-index ∈ ℕ .All definitions can be transparently generalized to this context.
Remark 1.In all the following analysis, to simplify the notation, we remove from the vector ( , +1) the index denoting the dependence on the iterate of the Newton method.The spectral analysis uses only the entries ( +1) and does not depend on the step at which such a vector has been obtained.

Arithmetic average
The simplest choice for the AV , is given by the arithmetic mean of the values of in (3) at the two sides of the interface, i.e., ARIT ( +1) Then, the Jacobian matrix is, as in all the other one dimensional cases, a tridiagonal matrix with entries where, To perform the spectral analysis we first rewrite the Jacobian matrices as a sum of matrices that are easier to investigate, then we use then the * -algebra (GLT4) and perturbation techniques (GLT2) to obtain the spectral information we look for.In particular, to analyze the Jacobian in (10) we separate it into three parts, one for the time-stepping, one taking into account the Darcian diffusion, and the last one related to the transport along the -axis, i.e., where { ARIT } is the scaled diagonal matrix sampling ′ ( ) on the function approximated at the ( + 1)th time step, i.e., We can express the Darcian diffusion part as while the remaining part is given by ARIT Lemma 1.The matrix sequence {ℎ 2 ARIT } has GLT symbol ′ ( ( ( ))), for ′ in (8), for ( ) the function mapping the [0, 1] interval to the domain of definition for the variable, and = lim , →+∞ ℎ 2 ∕Δ .
Proof.Observe that the ratio above is such that i.e., it is bounded whenever we impose the compatibility conditions for the relation between the time and space discretization.We observe that ARIT is the ( − 2)th diagonal sampling matrix for the function ( ( )).Therefore, it is one of the sequences of which we know the distribution (GLT3), and we conclude that Lemma 2. The matrix sequence {ℎ 2 ARIT } has GLT symbol where ( ) is the function mapping [0, 1] interval to the domain of definition for the variable.
Proof.Let us start from the first tridiagonal matrix , to produce its GLT symbol, we can consider the matrix sequence If we then perform a direct comparison of we observe that the only nonzero entries are the ones on the lower and upper diagonals, in which we have the differences ) ).From this, we bound the modulus of each off-diagonal entries of by the modulus of continuity of 1 2 ( ( )) (ℎ ) and then the 1-norm and the ∞-norm of the difference Then, a direct application of GLT3 and GLT4 tells us that ∼ GLT 1∕2 ( ( ( )))(2 − 2 cos( )).With minor modifications, the same arguments can be applied to the other parts of {ℎ 2 ARIT } , and by means of the *-algebra properties from GLT4, we can write Remark 2. In the computation of the GLT symbol of the sequence ARIT we have that a part of the distribution simplifies itself.This appears as if there was a simplification of the lower order terms induced by the arithmetic mean, i.e., in a chain-rule style the correction due to the term ′ ( ) cancels out.
Lemma 3. The matrix sequence {ℎ 2 ARIT } is a sequence distributed as zero.
Proof.By construction, the matrices of the sequence {ℎ 2 ARIT } are such that Theorem 1.The matrix sequence {ℎ 2 ARIT } is distributed in the eigenvalue sense as the function where ( ) is the function mapping [0, 1] interval to the domain of definition for the variable.
Proof.The proof follows in two step from applying GLT4 with the symbols obtained in Lemmas 1, 2, and 3, we first find that {ℎ 2 ARIT } ∼ GLT ( , ) = ′ ( ( ( ))) + ( ( ( )))(2 − 2 cos( )).Then, by GLT2, that holds in virtue of Lemma 3, we have that the distribution holds also in the eigenvalue sense.Remark 3. From Theorem 1 we infer that the ill-conditioning in the Jacobian matrices is determined by two main factors.On the one hand, we have the ill-conditioning due to the diffusion operators that drives the eigenvalues to zero as an (ℎ 2 ).On the other, we observe that this effect is enhanced by the behavior of the hydraulic conductivity ( ) at the current time step/Newton iterate.Indeed, for lower values of the pressure head, this further enhances the decay to zero of the eigenvalues.This implies also that, when the soil becomes more saturated, the ill-conditioning is due only to the diffusion.Now, with the choice of this average for all the spatial dimensions, we can explicitly write the nonzero elements of the Jacobian matrix by maintaining the local ( , , )-indices, i.e., in a way that is indepedent of the selected ordering; see the supplementary materials Section B for the complete expression.To complete the discretization we only need to impose and discretize the boundary conditions.For the test case considered here we focus on Dirichlet boundary conditions that can be either homogeneous, to identify water table boundary conditions, or time dependent.
Proof of Theorem 2. Moving from the results in Theorem 1 to the ones in Theorem 2 is just a technical rewriting.The proofs in Lemma 1, and Lemma 3 remain the same.We need to rewrite the decomposition of ARIT exploiting the fact that the tensor structure in the grid corresponds to a Kronecker structure in the matrix.
We can prove the exact same statement given in Theorem 2 also for the upstream mean together with the expression of the complete Jacobian.By virtue of the mechanism applied, the proof is completely analogous and requires only some technical adjustments related to the corrections in rank.The details and full Jacobian expression for this case are given in the supplementary materials, see sections C and D. Theorem 3. The sequence { ( , ) } obtained with the choice of the upstream average (entries in (C2)) or with the arithmetic average (entries in (B1)), for ( ), ( ) in ( 2), (3) is distributed in the sense of the eigenvalues as the function where ( ) is the function mapping the cube [0, 1] 3 to the physical domain and = lim , →∞ Δ .
We can conclude that, independently of the mean used, the asymptotic behavior of the spectrum remains the same and it is dominated by the diffusion operator.This suggests the idea of how to define an appropriate sequence of preconditioners for the underlying problem.Specifically, we use the matrix sequence generated by the considered discretization of the Jacobians of (4) having modified the flux term to be prec.= − ( )∇ .In practice, we disregard all the terms related to the fluid motion along the ̂ axis thus using the Jacobians of the nonlinear diffusion operators.using the Jacobian matrices of the nonlinear diffusion operators only.Indeed, these diffusion matrices have the same spectral distribution we have proved with Theorem 3, and therefore by GLT5 and GLT4 they provide an optimal preconditioner for the underlying sequence of matrices.We note that, to the best of our knowledge, this is the first theoretical justification of a very common approach in preconditioning Newton correction linear systems for solving the Richards equation 5,16,17 .We are now left with the problem of inverting the asymptotically spectrally equivalent sequence we have just built from the nonlinear diffusion.To face this task, we will perform a further approximation of the theoretical operators sequence by employing the preconditioning library available in PSCToolkit, a software framework recently proposed for scalable simulations on high-end supercomputers. 1

SOFTWARE FRAMEWORK FOR VERY LARGE-SCALE SIMULATIONS
In this work we employ the recently proposed PSCToolkit software framework for parallel sparse computations on current petascale supercomputers.PSCToolkit is composed of two main libraries, named PSBLAS (Parallel Sparse Basic Linear Algebra Subprograms) 18,19 , and AMG4PSBLAS (Algebraic MultiGrid Preconditioners for PSBLAS) 7 .PSBLAS implements all the main computational building blocks for iterative Krylov subspace linear solvers on parallel computers made of multiple nodes; a plugin for NVIDIA GPUs allows the exploitation of these devices in main sparse matrix and vector computations on hybrid architectures.The toolkit also implements a number of support functionalities to handle the construction of sparse matrices and of their communication structure.
AMG4PSBLAS is a package of preconditioners leveraging on PSBLAS kernels and providing one-level Additive Schwarz (AS) and Algebraic MultiGrid (AMG) preconditioners for PSBLAS-based Krylov solvers.For the present work, we developed a new set of interfaces allowing a seamless integration of linear solvers and preconditioners from PSBLAS and AMG4PSBLAS into the SUNDIALS/KINSOL library 6 , in order to use the KINSOL version of the modified Newton methods to solve the discretized Richards nonlinear equations.Interfacing with KINSOL provides a double advantage: the extension of PSCToolkit to handle non-linear algebraic equations, as well as the extension of the KINSOL library with new methods for solving sparse linear systems on high-end supercomputers.
In the following we describe the details of the proposed parallel Richards solver implemented in C by using PSCToolkit facilities for data generation and distribution and for preconditioner setup and application inside Krylov solvers, and on top of KINSOL facilities for the quasi-Newton procedure as described in Section 3.

Domain decomposition and data partitioning
In our parallel solution procedure of the discrete problem in (4), we employ a 2D block decomposition of the parallelepipedal domain Ω of size [0, ] × [0, ] × [0, ], i.e., we partition only in the horizontal directions.Indeed, in realistic simulations, one often needs to work with a wide horizontal domain, with a fixed size and resolution for the vertical layer and the vertical physical fields.The above domain decomposition corresponds to a mesh partitioning in which each parallel process owns all of degrees of freedom (dofs) in the direction, while uniform partitioning is applied in the and directions.PSBLAS provides all the functionalities needed for very general parallel mesh partitioning and to set up parallel data structures by compact and easy to use interfaces.All PSBLAS routines have a pure algebraic interface, where the main data structures are a distributed sparse matrix and a corresponding communication descriptor.The procedure for mesh partitioning builds a local sparse matrix of possibily non-contiguous rows of the global matrix, where each row corresponds to a local mesh dof and is handled through a local numbering scheme, and automatically builds an index map object contained in the communication descriptor to keep track of the correspondence between local and global indices.The communication descriptor also includes information needed for data communication among processes which is automatically generated after the mesh partitioning procedure and handled internally by the software.

AMG4PSBLAS preconditioners
The results in Theorem 2 suggest that we can expect to achieve good convergence in the iterative solution of the linear systems of the quasi-Newton steps (6) by using a matrix with symbol given by ( , ) as a preconditioner.By means of the * -algebra properties in GLT4 and GLT5, this would guarantee a sequence { −1 ( ( ) )} ∼ 1. Informally, by the spectral symbol, this implies that we can expect the spectrum of the sequence of the preconditioned matrices to have a cluster of eigenvalues in 1.While this would guarantee the theoretical properties we look for, we also need to approximate the inverse of the symmetric and positive definite (spd for short) preconditioner.To this aim, we exploit some of the methods available in AMG4PSBLAS 7 .
AMG4PSBLAS includes iterative methods for the approximation of the inverse of the preconditioner , including domain decomposition techniques of Additive Schwarz (AS) type and AMG methods based on aggregation of unknowns; details on the methods and their parallel versions are available in the literature 7,20,21 .In the following we describe the main features of the preconditioners selected for the experiments discussed in section 6.
In the AS methods, the index space, i.e., the set of row/column indices of , Ω = {1, 2, … , }, is divided into , possibly overlapping, subsets Ω of size .For each subset, we can define the restriction operator which maps a vector ∈  to the vector ∈  made of the components of having indices in Ω , and the corresponding prolongation operator = ( ) .The restriction of to the subspace Ω is then defined by the Galerkin product = .The Additive Schwarz preconditioner for is defined as the following matrix: where is supposed to be nonsingular and an inverse (or an approximation of it) can be computed by an efficient algorithm.We observe that in the case of non-overlapping subsets Ω , the AS preconditioner reduces to the well-known block-Jacobi preconditioner.
AMG4PSBLAS includes all the functionalities for setup and application of the preconditioner in (13) in a parallel setting, where the original index set Ω has been partitioned into subsets Ω , each of which is owned by one of the parallel processes, so that the inverse of can be locally computed by a LU factorization or variants of incomplete LU factorizations and sparse approximate inverses.
AS methods can also be applied as smoothers in an AMG procedure, which is the main focus of the AMG4PSBLAS software framework.In this case, the inverse of the preconditioner matrix is defined by the recursion below.Let be a sequence of spd coarse matrices obtained from by a Galerkin product +1 = ( ) , = 0, … , − 1, with 0 = , and a sequence of prolongation matrices of size × +1 , with 0 = and +1 < .Let be a convergent smoother with respect to the −inner product, the well known symmetric V-cycle, with one smoother iteration applied at each level before and after the coarse-level correction, defines the inverse of the preconditioner matrix as a sequence of the following matrices: assuming that ≈ ( ) −1 is an approximation of the inverse of the coarsest-level matrix.AMG methods are characterized by the coarsening procedure applied to setup the sequence of coarse matrices, i.e., the corresponding prolongation matrices; they use only information from the original (fine) matrix, with no additional information related to the geometry of the problem.In AMG4PSBLAS two different parallel coarsening procedures for spd matrices are available; both employ disjoint aggregates of fine dofs to form the coarse dofs, and the prolongation matrices are piecewise-constant interpolation matrices (unsmoothed aggregation) or a smoothed variant thereof (smoothed aggregation).For details on the coarsening procedures implemented in AMG4PSBLAS are available in the literature 7,21 .Here we note that in section 6 we refer to the two acronims: DSVMB: the smoothed aggregation scheme introduced by Vaněk, Mandel and Brezina in 22 , and applied in a parallel setting by a decoupled approach, where each process applies the coarsening algorithm to its subset of dofs, ignoring interactions with dofs owned by other processes 21 .
SMATCH: the smoothed aggregation scheme introduced by D'Ambra and Vassilevski 7,23,24 .It relies on a parallel coupled aggregation of dofs based on a maximum weighted graph matching algorithm, where the maximum size of aggregates can be chosen in a flexible way by a user-defined parameter so that computational complexity and convergence properties of the final preconditioner can be balanced.
The parallel smoothers available in AMG4PSBLAS can be applied both as one-level preconditioners as well as in an AMG procedure.These include variants of the AS methods described above, weighted versions of the simple Jacobi method, such as the so-called 1 −Jacobi, and a hybrid version of Gauss-Seidel, which acts as the Jacobi method among matrix blocks assigned to different parallel processes and updates the unknowns in a "Gauss-Seidel style" within the block owned by a single process 7,25 .Some of the above methods are well suited for use with the PSBLAS plugin for GPU accelerators in the preconditioner application phase.

NUMERICAL EXPERIMENTS
In this section we investigate the parallel performances of our Richards equation solver relying on the PSCToolkit software framework using preconditioners discussed in Section 5.2.The section is divided in two parts: in Section 6.1 we analyse strong scalability by measuring how the overall computational time decreases with the number of processes for a fixed problem size, then in Section 6.2 we focus on weak scalability, i.e., we measure how the solution time varies by increasing the number of processes, while keeping fixed the problem size per process, so that the global problem size proportionally increases with the number of processes.Experiments validating the spectral analysis are reported in the supplementary materials sections E, and F.
All the experiments are executed on the CPU cores of the Marconi-100 supercomputer (21 st in the June 2022 TOP500 list2 ), with no usage of hyperthreading.Marconi's nodes are built on an IBM Power System AC922, they contain two banks of 16 cores IBM POWER93 3.1 GHz processors and are equipped with 256 GB of RAM.The inter-node communication is handled by a Dual-rail Mellanox EDR Infiniband network by IBM with 220/300 GB/s of nominal and peak frequency.All the code is compiled with the gnu/8.4.0 suite and linked against the openmpi/4.0.3 and openblas/0.3.9 libraries.We use only inner functionalities of the PSBLAS 3.7.0.2 and AMG4PSBLAS 1.0 libraries, with no use of optional third party libraries.
We discuss results of the overall solution procedure for solving the discretized Richards equation, by comparing parallel performance and convergence behavior when the AS preconditioner in (13) and the two AMG preconditioners based on the two different coarsening strategies discussed in Section 5.2 are employed for solving the linear systems within the modified Newton procedure.For the one-level preconditioner we use one layer of mesh points in each direction as overlap among the subdomains Ω , each of them assigned to different processes, and apply an Incomplete LU factorization with no fill-in for computing the local subdomain matrix inverses −1 .In the case of AMG preconditioners, we apply a symmetric V-cycle with 1 iteration of hybrid backward/forward Gauss-Seidel as pre/post-smoother at the intermediate levels.As a coarsest-level solver we use a parallel iterative procedure based on the preconditioned Conjugate Gradient method with block-Jacobi as preconditioner, where ILU with 1 level of fill-in is applied on the local diagonal blocks.The coarsest-level iterative procedure is stopped when the relative residual is less than 10 −4 or when a maximum number of 30 iterations is reached.In both the DSVMB and SMATCH procedures, the coarsening is stopped when the size of the coarsest matrix includes no more than 200 dofs per core; in the case of the SMATCH coarsening, a maximum size of aggregates equal to 8 is required.In the following, the two AMG preconditioners are referred as VDSVMB and VSMATCH, respectively.In order to reduce the setup costs of the AMG preconditioners in nonlinear and/or time-dependent simulations, we support a re-use strategy of operators that is often used in our Richards solution approach.Namely, we compute the multilevel hierarchy of coarser matrices only on the first Jacobian of the sequence (at the first time step), then for subsequent Jacobians we just update the smoother on the current approximation of the matrix, while keeping fixed the coarser matrices and transfer operators.Moreover, at each subsequent time step, we reuse the Jacobian (and its related approximation) from the previous one; we leave KINSOL control over the possible need to recompute a Jacobian at each new Newton iteration.
All the preconditioners are applied as right preconditioner to the PSBLAS-based Restarted GMRES with restarting step equal to 10 -GMRES (10).We stop the iterations when the relative residual satisfies ‖ + ‖ < ‖ ‖ with = 10 −7 or when a maximum number of iterations equal to 200 is done.
To solve the Richards equation with upstream averages on a parallelepipedal domain Ω of size [0, ] × [0, ] × [0, ], we apply water at height = such that the pressure head becomes zero in a square region at the center of the top layer, ( 4 ≤ ≤ 3 4 , 4 ≤ ≤ 3 4 ), and is fixed to the value ℎ = ℎ on all the remaining boundaries, that is 3 4 ]×[ 4 , 3 4 ] ( , , ) , where we denote by Ω the characteristic function of the set Ω.The associated initial condition is given by ( , , , 0) = ℎ .In all cases we run the simulation for ∈ [0, 2] and = 10.This means that the number of time steps is fixed independently from the number of processes , thus the performance analysis will be done on the quantities averaged on the number of time steps relative to the given .
All the timings reported in the figures of the following sections are in seconds.

Strong scalability
In this section we discuss parallel performance results of our solution procedure when we fix the target domain as the parallelepiped [0, 64] × [0, 64] × [0, 1], discretized with = = 800 mesh points in the and directions, and = 40 mesh points in the vertical direction, for a total number of 20 millions of dofs, on a number of computational cores from 1 to 256; in particular we used a number of cores = 4 = 0, … , 4, so that the computational domain is uniformly partitioned in an increasing number of vertical subdomains with square basis, for increasing number of parallel cores.
We start by looking at the average number of linear iterations done for the Newton correction by using the three different preconditioners, and the time needed per each linear iteration in Figure 2. We observe that, as expected, the AMG preconditioners require a smaller number of iterations than the AS method; the latter shows an increase in the number of iterations as the number of cores, and therefore of subdomains, increases.VDSVMB always requires the smallest number of iterations showing the ability of the DSVMB coarsening procedure to setup a good quality matrix hierarchy on the first Jacobian.On the other hand, we observe that when increasing the number of cores, the number of linear iterations also increases, due to the decoupled parallel approach of the coarsening.The VSMATCH algorithm, thanks to its coupled approach, produces a number of iterations that is essentially unaffected by the number of processes, even though it is always larger than VDSVMB.
If we look at the time per linear iteration in Figure 2 (b), we observe, as expected, that the AS preconditioner has the smallest time per iteration.Indeed, for his one-level nature, its application cost is smaller than that of the multigrid methods.It also shows a regular decreasing for increasing number of parallel cores.If we look at the time per iteration of the AMG preconditioners, we observe that VDSVMB and VSMATCH have very similar behavior, and a regular decreasing for larger number of cores.On the other hand, VDSVMB always achieves a smaller time per iteration than VSMATCH, due to its better coarsening ratio.Indeed, it coarsens the original fine matrix in an efficient way, producing coarse matrices that are both smaller and of good quality.This feature makes the VDSVMB preconditioner competitive with respect to the AS method.Both preconditioners produce similar global solution times, as shown in Figure 3.We can observe that, in all the cases, the overall simulation has a regular decreasing computational time for increasing number of cores, leading to a global speedup ranging from 150 to 169, depending on the preconditioner, on 256 computational cores.This corresponds to a satisfactory parallel efficiency ranging from 59% to 66%.For the sake of completeness, in Table 1 we report the total number of computed Jacobians and of the Newton iterations required by the global simulation, when the different preconditioners are employed.We can see that the choice of the preconditioner also affects the KINSOL nonlinear procedure, and that the best behavior always corresponds to the VDSVBM preconditioner.

Weak scalability
In this section we look at the weak scalability of the overall solution procedure.The ideal goal is to obtain a constant execution time when the number of cores increases, while the computational work per core is kept fixed.Unfortunately, it is in general impossible to get an exactly constant execution time.Indeed, we would need to have a perfect algorithmic scalability of the preconditioners, i.e., the ability to keep the number of linear iterations perfectly constant with an increasing problem size.Moreover, the parallelization efficiency of sparse matrix kernels, which are communication bound, is also affected by the increase in the number of processes and hence in the communication requirements.
To achieve a scaling of the computational work that is meaningful with respect to the physical properties of the underlying problem, we consider a growing domain increasing = 0, … , 7, = 0, … , 6, and a corresponding mesh with a total of points equal to ( × ) = (2 , 2 , ), where = = 50, and = 40.In this way, each process has the same amount of computational and communication load, and the overall problem size increases with the total number of processes .We run experiments up to 8192 CPU cores for solving a problem with a global size up to about 829 millions of dofs.
As in the previous section, we first discuss convergence behavior and efficiency of the linear solvers for the Newton corrections, when the different preconditioners are employed.In Figure 4 we show the average number of linear iterations along the overall simulation and the average execution time per each linear iteration.We observe that, as expected, the AS method has the worst algorithmic scalability, showing a general increase of the linear iterations needed to solve problems of increasing size.The average number of iterations has a slow and small decrease from 16 to 1024 cores, while there is a rapid increase from 150 to 203 iterations going from 1024 up to 8192 cores, confirming that convergence properties of the one-level Schwarz-type domain decomposition methods are dependent on the number of involved subdomains.On the other hand, if we look at the AMG preconditioners, where a better coupling among the subdomains is considered, we see that the average number of iterations is smaller and has a very small variation for increasing number of subdomains.In particular, we observe that for the VDSVMB preconditioner the average number of linear iterations has an increase from 67 to 84 iterations going from 1 to 8192 cores.The VSMATCH preconditioner, after an initial small increase, displays a decrease which shows that the coupled aggregation algorithm based on matching is able to produce an effective preconditioner with very good algorithmic scalability properties.This makes VSMATCH promising for exploring extreme scalability in this type of simulation procedure for the Richards model.
If we look at the average execution time per linear iteration, for all the preconditioners we observe a very small increase, which demonstrates a good implementation scalability for all the sparse computations involved in the linear solver phase.We now turn to the evaluation of the overall time to solution and scaled efficiency3 , of the global simulation.In Figure 5(a) we see that, as for linear solvers, also for the global procedure there is a general small increase for increasing number of cores and problem size for all the employed preconditioners, except for the VSMATCH method going from 4096 to 8192 cores, where a better convergence behavior of the linear solver also produces a small reduction in nonlinear iterations (cfr Table F3 in the supplementary materials).The best time to solution is in general obtained by using the AS preconditioner till to 1024 cores, due to the smallest computational cost of this preconditioner.On the other hand, its worst algorithmic scalability (i.e., the large increase in the number of linear iterations) results in larger global execution times with respect to the most effective VDSVMB preconditioner when the number of cores increases.
If we look at the scaled efficiency shown in percentage scale in Figure 5, the AMG preconditioners show a similar behavior, while VSMATCH confirms his better efficiency when the largest number of cores is used.To give a complete picture of the performance of the various parts of the solution procedure we also report the percentage of the execution time spent in the different computational kernels.The barplots in Figure 6 deliver several useful information.First 2 0 2 2 2 4 2 6 2 8 2 10 2 12

FIGURE 6
Percentage of the overall time spent in the different phases for the overall solution process.We sum all the partial times and normalize it against the total to account for the different computational kernels. of all, as expected, the two main parts of the efforts are represented by the time spent in actually solving the linear systems (6) (LinSol) and evaluating the non-linear function encoding the discretization (Feval), on which our implementation efforts were more focused.In particular, in the Feval kernel we exploited many of the support functionalities of PSCToolkit.To increase the efficiency, at the expense of a marginal increase in the use of memory, we experimented with memorizing the local-to-global map on the MPI tasks, thus avoiding the overhead of explicit calls to the functions that permit the user to explore the index space.This is made possible by the fact that we are using a communicator for data distribution which is fixed through the time steps and iterates of Newton's method; see the discussion in Section 5.1.Thanks to the flexibility of the underlying software framework, we were able to optimize the procedure for building the vectors and matrices at each nonlinear time step, making good reuse of local data.
We have a small and almost equal time that is spent in building the distributed Jacobian (Jacobian) and auxiliary matrices (Auxiliary).The rebuild of the local portions of the matrices is well optimized and is very close to the absolute minimum that is necessary to compute the coefficient values.A reasonably small amount of time is spent in building and updating the preconditioners (Setup).In particular, it is possible to reuse and apply the aggregation hierarchy, so that we only need to recompute the smoothers.The other (almost invisible) bar is represented by the halo exchanges: this is the data exchange among the parallel processes that happens before each build of the Jacobian, build of auxiliary matrices and function evaluation.It is a communication that is necessary for having an agreement on all the quantities needed by the processes to perform their computations, namely the exchange of the boundary data; this operation is persistent, in the sense that the pattern of the data exchange is determined by the discretization mesh structure.The last overhead is all the remaining computations that are made inside the Newton method, i.e., vector updates and vector norm computations, namely the Newton direction updates, convergence checks and line-searches, and are in charge to the KINSOL library.

CONCLUSIONS AND PERSPECTIVES
In this paper we focused on two main objectives: to prove some spectral properties of the sequence of Jacobian matrices generated discretizing the Richards equation in mixed form for simulation of unsaturated subsurface flows by a Newton-type method, and to prove the efficiency, flexibility and robustness of a software framework for parallel sparse matrix computations.
The theoretical results we obtained are consistent with expectations and justify some preconditioner choices already used in the literature for solving Richards equations.We used several functionalities of the PSCToolkit for iterative solution of sparse linear systems, some of the support routines for re-using preconditioners in solving sequences of linear systems and for an efficient setup and update of data structures needed for Jacobian and right-hand side computations in Newton iterations.The performances of our strategy on one of the most powerful supercomputers currently available are quite promising in view of exploring extreme scalability and confirm the benefits of using multigrid preconditioners when number of processing cores largely increases.
Our plans for future work include the extension of the PSCToolkit interface to KINSOL, in order to use the ability of the PSCToolkit linear solvers in exploiting GPU architectures, and the integration of the software stack into the Parflow code 26 for realistic simulations in hydrological applications.
To perform the spectral analysis we need to first rewrite the Jacobians as the sum of matrices that are more amenable to be treated, and use then the * -algebra and perturbation properties to obtain the spectral information on the different Jacobian expressions.
As for the previous case, the three parts are one relative to the time-stepping, one relative to the Darcian diffusion, and the last one relative to the transport along the -axis, i.e., where { UP } is the scaled diagonal matrix sampling the function ′ ( ) on approximated at the ( + 1)th time step that is unchanged from ARIT in (11).This means that { UP } = ARIT has a spectral distribution given again by Lemma 1.In the same way, the matrix UP is unchanged from ARIT in (12), i.e., UP = ARIT , whose spectra distribution is given in Lemma 3. The only part for which we encounter a difference is the one relative to the Darcian diffusion.To shorten the notation and make the subsequent spectra proof easier we pose Δ +1 = ( +1) +1 − ( +1) , and Δ −1 = ( +1) − ( +1) −1 , then the decomposition reads as where ( ) is the function mapping [0, 1] to the domain of definition for the variable .
Proof.The proof is similar to the one for Lemma 2. We need to consider that the entries in the rows of the matrices are evaluation of the ( ( )) and ( ) ′ ( ( )) functions on different stencils depending on the flux between the interfaces of the solution vector ( , +1) .Let us start from the first tridiagonal matrix , to produce its GLT symbol, we can consider the matrix sequence ( ( ( +1) )) (2 − 2 cos ).If we then perform a direct comparison of we observe that four different occurrences are possible if we consider the previous difference on the th row from which we find again that the 1-norm and the ∞-norm of the difference are bounded by a constant multiple of the continuity module ( ( )) (ℎ ) of ( ( )).Thus, 2ℎ .

E A COMPARISON BETWEEN SPECTRAL CHARACTERIZATION AND TRUE EIGENVALUES
To illustrate the results depicted in Theorem 1.We consider a small 1D example for having a small MATLAB implementation, and being able of looking the comparison between the eigenvalues of the Jacobian matrices for the full Newton method and their distribution function.That is, we check how good the asymptotic description of the spectrum behaves with respect to the true eigenvalues.Specifically, we consider a 1D water infiltration test 27 for a column of soil with parameters for the Van Genuchten model ( 2 small grid size thus namely far away from the asymptotic regime, there is a good matching between the predicted eigenvalues and the computed ones.We consider the same test problem also for the case in which we use the upstream-weighted mean of the values of in (3).Again the result proved in Theorem 4 is clearly depicted in Figure E2, in which we observe a good accordance between the spectral symbol and the computed eigenvalues already at coarse grid.
To have a comparison in between the approximate and full sequence we consider also the only preconditioner we can apply on both sequence, namely the AS preconditioner, and compare the average number of iteration across the time steps in the weak scaling regime described in Section 6.2.From the results in Table E1 we observe readily that the spectrally equivalent sequence clearly catches the properties we are interested in.
Original sequence Approximating sequence

F WEAK SCALABILITY
In this section we complement the results discussed in Section 6.2 by considering further details on the weak-scalability analysis.
In first place, we want to investigate numerically the consequences of Theorem 3. Indeed, from the theoretical analysis we expect that if we would use just the discretization of the diffusive part of the equation, then a cluster of eigenvalues at 1 would appear and suggest good performances for the underlying Krylov solver.Since this is not feasible, we resorted to using either the multigrid hierarchies to further approximate the diffusive part.Therefore, we investigate here numerically the capability of such approximating sequence with respect to the original one.To have an homogeneous situation with respect to the coarse grid, and diminish the number of variables that could change the interpretation of the result, we resort to using just 30 iterations of a Block-Jacobi method with the ILU(0) factorization as a coarse solver.From the results collected in Table F2, we observe that

F.1 Finer details on the build and solve phases
To deepen the analysis, we look with a finer degree of detail at the performances of every slices of the bar-plots in Figure 6.Let us start from the bottom and look at the build phase of the Jacobian matrices; Figure F3.Up to 4096 cores the assembly phase stays around the 70% of efficiency, and is only sligthly diminished to around the 60% on 8192 cores.Indeed, even if we are using a method that needs to assembly the matrices for building the preconditioners, we observe that this doesn't hampers the overall efficiency of the method.This is even more true if we also consider the efficiency of the function evaluation routine (that is instead implemented in a matrix-free way) given in Figure F4.We notice that this mimics the overall efficiency of the whole procedure.As it usually happens in nonlinear processes, what characterizes the overall performance is indeed the number and efficiency of the function evaluation calls.That, as we see from Table F3, are very stable with respect to the number of employed cores when the VSMATCH preconditioner is used for Newton correction, while shows some oscillations for the other preconditioners.Finally, let us look further into the scalability performances of the solution of the linear systems associated with the Newton correction.In our strategy this is divided into three separate phases.A complete setup in which we build the hierarchies for the first time, an update phase in which we recompute just the smoothers when there has been a change in the Jacobian, and the application phase inside the Krylov iterative method.For what concern the first two phases, Figures F5a, F5b show that the update strategy reduces effectively the time needed to prepare the preconditioner, while keeping a reasonable quality for the aggregation procedure.In order to better understand the results on the execution time per iteration of the AMG preconditioners discussed in the paper, we also report (average) operator complexity of the two different methods.We recall that the operator complexity of an AMG method is defined as opc = and is usually both a measure of memory occupancy and of the time complexity for the method application in a V-cycle 7 .
Looking also at the overall behavior of the solving times is useful.If we compare the results in Figure F6b with the ones in Figure 5b, we observe that the efficiency for the linear system solve phase reflects the global efficiency and is not hampered by the fact that we are building Jacobians and auxiliary matrices, see again Figure F3, or by the way in which we are performing the function evaluations.

FIGURE F5
Weak scaling.Setup phases of the preconditioners in which subsequent calls are updates.

FIGURE 2
FIGURE 2 Strong scaling.Average number of linear iterations and execution time per linear iteration with different preconditioners.

FIGURE 4
FIGURE 4 Weak scaling.Average number of linear iterations and execution time per linear iteration with different preconditioners.

FIGURE 5
FIGURE 5 Weak scaling: Total execution time and Scaled efficiency shown in percentage for the solution of the whole problem (all time steps), KINSOL with different linear solvers.

FIGURE E1
FIGURE E1 Arithmetic average.Comparison of the eigenvalues and spectral symbol for the coefficients in (E4) with ℎ = 40∕( − 1), Δ = 0.1, and = 800 on different time steps and for different iterates of the full Newton method.

FIGURE E2
FIGURE E2Upstream-weighted average.Comparison of the eigenvalues and spectral symbol for the coefficients in (E4) with ℎ = 40∕( − 1), Δ = 0.1, and = 800 on different time steps and for different iterates of the full Newton method.

FIGUREF3
FIGURE F3Weak scaling.Speedup (left) and scaled efficiency (right) for setup of the Jacobian matrices.

FIGUREF6
FIGURE F6Weak scaling.Average operator complexity of AMG preconditioners and average time ( ) for linear solver.

TABLE 1
Strong scaling.Total execution time and Speedup for the solution of the whole problem (all time steps), KINSOL with different linear solvers.Strong scaling.Number of nonlinear iterations (NLin It.s), and number of computed Jacobians (N Jac.s) for the three preconditioners.

TABLE E1
Weak scaling.Comparison of the (rounded) average number of linear iteration employed by the AS preconditioner when built directly on the original sequence and when built on the approximating sequence obtained from the spectral analysis.

TABLE F3
Weak scaling.Speedup (left) and scaled efficiency (right) for the evaluation of the function.Weak scaling.Number of nonlinear iterations (NLin It.s), and number of computed Jacobians (N Jac.s) for the three preconditioners.