Randomised preconditioning for the forcing formulation of weak constraint 4D-Var

There is growing awareness that errors in the model equations cannot be ignored in data assimilation methods such as four-dimensional variational assimilation (4D-Var). If allowed for, more information can be extracted from observations, longer time windows are possible, and the minimisation process is easier, at least in principle. Weak constraint 4D-Var estimates the model error and minimises a series of linear least-squares cost functionsfunctions, which can be achieved using the conjugate gradient (CG) method; minimising each cost function is called an inner loop. CG needs preconditioning to improve its performance. In previous work, limited memory preconditioners (LMPs) have been constructed using approximations of the eigenvalues and eigenvectors of the Hessian in the previous inner loop. If the Hessian changes significantly in consecutive inner loops, the LMP may be of limited usefulness. To circumvent this, we propose using randomised methods for low rank eigenvalue decomposition and use these approximations to cheaply construct LMPs using information from the current inner loop. Three randomised methods are compared. Numerical experiments in idealized systems show that the resulting LMPs perform better than the existing LMPs. Using these methods may allow more efficient and robust implementations of incremental weak constraint 4D-Var.


Introduction
In numerical weather prediction, data assimilation provides the initial conditions for the weather model and hence influences the accuracy of the forecast [Kalnay, 2002].Data assimilation uses observations of a dynamical system to correct a previous estimate (background) of the system's state.The statistical knowledge of the errors in the observations and the background is incorporated in the process.A variational data assimilation method called weak constraint 4D-Var provides a way to also take into account the model error [Trémolet, 2006], which can lead to a better analysis (e.g.Trémolet [2007]).
We explore the weak constraint 4D-Var cost function.In its incremental version, the state is updated by a minimiser of the linearised version of the cost function.The minimiser can be found by solving a large sparse linear system.The process of solving each system is called an inner loop.Because the second derivative of the cost function, the Hessian, is symmetric positive definite, the systems may be solved with the conjugate gradient (CG) method [Hestenes and Stiefel, 1952], whose convergence rate depends on the eigenvalue distribution of the Hessian.Limited memory preconditioners (LMPs) have been shown to improve the convergence of CG when minimising the strong constraint 4D-Var cost function [Fisher, 1998, Tshimanga et al., 2008].Strong constraint 4D-Var differs from the weak constraint 4D-Var by making the assumption that the dynamical model has no error.
LMPs can be constructed using approximations to the eigenvalues and eigenvectors (eigenpairs) of the Hessian.The Lanczos and CG connection (Section 6.7 of Saad [2003]) can be exploited to obtain approximations to the eigenpairs of the Hessian in one inner loop, and these approximations may then be employed to construct the preconditioner for the next inner loop [Tshimanga et al., 2008].This approach does not describe how to precondition the first inner loop, and the number of CG iterations used on the ith inner loop limits the number of vectors available to construct the preconditioner on the (i + 1)st inner loop.Furthermore, the success of preconditioning relies on the assumption that the Hessians do not change significantly from one inner loop to the next.
In this paper, we propose addressing these drawbacks by using the easy to implement subspace iteration methods (see Chapter 5 of Saad [2011]) to obtain approximations of the largest eigenvalues and corresponding eigenvectors of the Hessian in the current inner loop.The subspace iteration method first approximates the range of the Hessian by multiplying it with a start matrix (for approaches to choosing it see e.g.Duff and Scott [1993]) and the speed of convergence depends on the choice of this matrix (e.g.Gu [2015]).A variant of subspace iteration, which uses a Gaussian random start matrix, is called the Randomised Eigenvalue Decomposition (REVD).REVD has been popularised by probabilistic analysis [Halko et al., 2011, Martinsson andTropp, 2020].It has been shown that REVD, which is equivalent to one iteration of the subspace iteration method, can often generate a satisfactory approximation of the largest eigenpairs of a matrix that has rapidly decreasing eigenvalues.Because the Hessian is symmetric positive definite, a randomised Nyström method for computing a low rank eigenvalue decomposition can also be used.It is expected to give a higher quality approximation than REVD (e.g.Halko et al. [2011]).We explore these two methods and another implementation of REVD, which is based on the ritzit implementation of the subspace method [Rutishauser, 1971].The methods differ in the number of matrix-matrix products with the Hessian.Even though more computations are required to generate the preconditioner in the current inner loop compared to using information from the previous inner loop, the randomised methods are block methods and hence easily parallelisable.
In Section 2, we discuss the weak constraint 4D-Var method and, in Section 3, we consider LMPs and ways to obtain spectral approximations.The three randomised methods are examined in Section 4. Numerical experiments with linear advection and Lorenz 96 models are presented in Section 5, followed by a concluding discussion in Section 6.

Weak constraint 4D-Var
We are interested in estimating the state evolution of a dynamical system x 0 , x 1 , . . ., x N , with x i ∈ R n , at times t 0 , t 1 , . . . ,t N .Prior information about the state at t 0 is called the background and is denoted by x b ∈ R n .It is assumed that x b has Gaussian errors with zero mean and covariance matrix B ∈ R n×n .Observations of the system at time t i are denoted by y i ∈ R qi and their errors are assumed to be Gaussian with zero mean and covariance matrix R i ∈ R qi×qi (q i ≪ n).An observation operator H i maps the model variables into the observed quantities at the correct location, i.e. y i = H i (x i ) + ζ i , where ζ i is the observation error.We assume that the observation errors are uncorrelated in time.
The dynamics of the system are described using a nonlinear model M i such that where η i+1 is the model error at time t i+1 .The model errors are assumed to be uncorrelated in time and to be Gaussian with zero mean and covariance matrix The forcing formulation of the nonlinear weak constraint 4D-Var cost function in which we solve for the initial state and the model error realizations, is (2) where x i satisfies the model equation (1) [Trémolet, 2006].The analysis (approximation of the state evolution over the time window) x a 0 , x a 1 , . . ., x a N can be obtained from the minimiser of (2) using constraints (1).

Incremental 4D-Var
One way to compute the analysis is to approximate the minimum of ( 2) with an inexact Gauss-Newton algorithm [Gratton et al., 2007], where a sequence of quadratic cost functions is minimised.In this approach, we update x 0 and the model error where T is the jth approximation and δp (j) = (δx (j)T 0 , δη ) T is calculated with (1) using p (j) .The update δp (j) is obtained by minimising the following cost function where ||a|| 2 A −1 = a T A −1 a and the covariance matrices are block diagonal, i.e. q×q , where q = Σ N i=0 q i .We use the notation (following Gratton et al. [2018]) +1) , where H (j) i is the linearised observation operator, and . . . where i is the linearised model, i.e.M (j)  i,l denotes the linearised model integration from time t i to t l+1 , (L −1 ) (j) ∈ R n(N +1)×n(N +1) , x (j) , δx (j) , b (j) ∈ R n(N +1) and d (j) ∈ R q .The outer loop consists of updating (3), calculating x (j) , b (j) , d (j) , and linearising H i and M i for the next inner loop.
The minimum of the quadratic cost function (4) can be found by solving a linear system where A (j) is the Hessian of (4), which is symmetric positive definite.These large sparse systems are usually solved with the conjugate gradient (CG) method, whose convergence properties depend on the spectrum of A (j) (see Section 3.1 for a discussion).In general, clustered eigenvalues result in fast convergence.We consider a technique to cluster eigenvalues of A (j) in the following section.From now on we omit the superscript (j).

Control Variable Transform
A control variable transform, which is also called first level preconditioning, maps the variables δp to δ p, whose errors are uncorrelated (see, e.g.Section 3.2 of Lawless [2013]).This can be denoted as the transformation D 1/2 δ p = δp, where D = D 1/2 D 1/2 and D 1/2 is the symmetric square root.The update δ p is then the solution of A pr is the sum of the identity matrix and a rank q positive semi-definite matrix.Hence, it has a cluster of n(N + 1) − q eigenvalues at one and q eigenvalues that are greater than one.Thus, the spectral condition number κ = λmax/λ min (here λmax and λ min are the largest and smallest, respectively, eigenvalues of A pr ) is equal to λmax.We discuss employing second level preconditioning to reduce the condition number while also preserving the cluster of the eigenvalues at one.In the subsequent sections, we use notation that is common in numerical linear algebra.Namely, we use A for the Hessian with first level preconditioning, x for the unknown and b for the right hand side of the system of linear equations.Thus, we denote (10) by where the right-hand side b is known and x is the required solution.We assume throughout that A is symmetric positive definite.

Preconditioned conjugate gradients
The CG method (see, e.g.Saad [2003]) is a popular Krylov subspace method for solving systems of the form (11).A well known bound for the error at the ith CG iteration where κ is the spectral condition number and i Aǫ i (see, e.g., Section 5.1.of Nocedal and Wright [2006]).Note that this bound describes the worst-case convergence and only takes into account the largest and smallest eigenvalues.The convergence of CG also depends on the distribution of the eigenvalues of A (as well as the right hand side b and the initial guess x 0 ); eigenvalues clustered away from zero suggest rapid convergence (Lecture 38 of Trefethen and Bau, III [1997]).Otherwise, CG can display slow convergence and preconditioning is used to try and tackle this problem (Chapter 9 of Saad [2003]).Preconditioning aims to map the system (11) to another system that has the same solution, but different properties that imply faster convergence.Ideally, the preconditioner P should be both cheap to construct and to apply, and the preconditioned system should be easy to solve.
If P is a symmetric positive definite matrix that approximates A −1 and is available in factored form P = CC T , the following system is solved where x = C −1 x.Split preconditioned CG (PCG) for solving (13) is described in Algorithm 1 (see, for example, Algorithm 9.2 of Saad [2003]).Note that every CG iteration involves one matrix-vector product with A (the product Ap j−1 is stored in step 3 and reused in step 5) and this is expensive in weak constraint 4D-Var, because it involves running the linearised model throughout the assimilation window through the factor L −1 .

Limited memory preconditioners
In weak constraint 4D-Var the preconditioner P approximates the inverse Hessian.Hence, P can be obtained using Quasi-Newton methods for unconstrained optimization that construct an approximation of the Hessian matrix, which is updated regularly (see, for example, Chapter 6 of Nocedal and Wright [2006]).A popular method to approximate the Hessian is BFGS (named after Broyden, Fletcher, Goldfarb, and Shanno, who proposed it), but it is too expensive in terms of storage and updating the approximation.Instead, the so-called block BFGS method (derived by Schnabel [1983]) uses only a limited number of vectors to build the Hessian, and when new vectors are added older ones are dropped.This is an example of a limited memory preconditioner (LMP), and the one considered by Tshimanga et al.
An LMP for an n A × n A symmetric positive definite matrix A is defined as follows where S is an n A × k (k ≤ n A ) matrix with linearly independent columns s 1 , . . ., s k , and In A is the n A × n A identity matrix [Gratton et al., 2011].
In data assimilation, we have k ≪ n A , hence the name LMPs.P k is called a balancing preconditioner in Tang et al. [2009].
A potential problem for practical applications of ( 14) is the need for expensive matrix-matrix products with A. Simpler formulations of ( 14) are obtained by imposing more conditions on the vectors s 1 , . . ., s k .Two formulations that Tshimanga et al. [2008] calls spectral-LMP and Ritz-LMP have been used, for example, in ocean data assimilation in the Regional Ocean Modeling System (ROMS) [Moore et al., 2011] and the variational data assimilation software with the Nucleus for European Modelling of the Ocean (NEMO) ocean model (NEMOVAR) [Mogensen et al., 2012], and coupled climate reanalysis in Coupled ECMWF ReAnalysis (CERA) [Laloyaux et al., 2018].
To obtain the spectral-LMP, let v 1 , . . ., v k be orthonormal eigenvectors of A with corresponding eigenvalues 14) is the spectral-LMP P sp k (it is called a deflation preconditioner in Giraud and Gratton [2006]), which can be simplified as (15) Tshimanga [2007]) In many applications, including data assimilation, exact eigenpairs are not known, and their approximations, called Ritz values and vectors, are used (we discuss these in the following section).If u 1 , . . ., u k are orthogonal Ritz vectors, then the following relation holds Each application of P Rt k requires a matrix-matrix product with A. If the Ritz vectors are obtained by the Lanczos process (described in Section 3.4 below), then (17) can be further simplified, so that no matrix-matrix products with A are needed (see Section 4.2.2. of Gratton et al. [2011] for details).
An important property is that if an LMP is constructed using k vectors then at least k eigenvalues of the preconditioned matrix C T AC will be equal to 1, and the remaining eigenvalues will lie between the smallest and largest eigenvalues of A (see Theorem 3.4 of Gratton et al. [2011]).Moreover, if A has a cluster of eigenvalues at 1, then LMPs preserve this cluster.This is crucial when preconditioning (10): because the LMPs preserve the n(N + 1) − q smallest eigenvalues of A pr that are equal to 1, the CG convergence can be improved by decreasing the largest eigenvalues.Hence, it is preferable to use the largest eigenpairs or their approximations.
In practice, both the spectral-LMP and Ritz-LMP use Ritz vectors and values to construct the LMPs.It has been shown that the Ritz-LMP can perform better than the spectral-LMP in a strong constraint 4D-Var setting by correcting for the inaccuracies in the estimates of eigenpairs [Tshimanga et al., 2008].However, Gratton et al. [2011] (Theorem 4.5) have shown that if the preconditioners are constructed with Ritz vectors and values that have converged, then the spectral-LMP acts like the Ritz-LMP.

Ritz information
Calculating or approximating all the eigenpairs of a large sparse matrix is impractical.Hence, only a subset is approximated to construct the LMPs.This is often done by extracting approximations from a subspace, and the Rayleigh-Ritz (RR) procedure is a popular method for doing this.
Assume that Z ⊂ R nA is an invariant subspace of A, i.e.Az ∈ Z for every z ∈ Z, and the columns of is an eigenpair of A (see, e.g.Theorem 1.2 in Chapter 4 of Stewart [2001]).Hence, eigenvalues of A that lie in the subspace Z can be extracted by solving a small eigenvalue problem.
However, generally the computed subspace Z with orthonormal basis as columns of Z is not invariant.Hence, only approximations ṽ to the eigenvectors v belong to Z.The RR procedure computes approximations u to ṽ.We give the RR procedure in Algorithm 2, where the eigenvalue decomposition is abbreviated as EVD.Approximations to eigenvalues λ are called Ritz values θ, and u are the Ritz vectors.Eigenvectors of K = ZT A Z, which is the projection of A onto Z, are denoted by w and are called primitive Ritz vectors.
Algorithm 2 Rayleigh-Ritz procedure for computing approximations of eigenpairs of symmetric A Input: with approximations to eigenvectors of A as its columns, and diagonal Θ ∈ R m×m with approximations to eigenvalues of A on the diagonal 3.4 Spectral information from CG Tshimanga et al. [2008] use Ritz pairs of the Hessian in one inner loop to construct LMPs for the following inner loop, i.e. information on A (0) is used to precondition A (1) , and so on.Success relies on the Hessians not changing significantly from one inner loop to the next.Ritz information can be obtained from the Lanczos process that is connected to CG, hence information for the preconditioner can be gathered at a negligible cost.
The Lanczos process is used to obtain estimates of a few extremal eigenvalues and corresponding eigenvectors of a symmetric matrix A (Section 10.1 of Golub and Van Loan [2013]).It produces a sequence of tridiagonal matrices T j ∈ R j×j , whose largest and smallest eigenvalues converge to the largest and smallest eigenvalues of A. Given a starting vector f 0 , it also computes an orthonormal basis f 0 , . . ., f j−1 for the Krylov subspace K j = span{f 0 , Af 0 , . . ., A j−1 f 0 }.Ritz values θ i are obtained as eigenvalues of a tridiagonal matrix, which has the following structure: The Ritz vectors of A are u i = F j w i , where F j = (f 0 , . . ., f j−1 ) and an eigenvector w i of T j is a primitive Ritz vector.Eigenpairs of T j can be obtained using a symmetric tridiagonal QR algorithm or Jacobi procedures (e.g.Section 8.5 of Golub and Van Loan [2013]).Saad (Section 6.7.3 of Saad [2003]) discusses obtaining entries of T j when solving Ax = b with CG.At the j-th iteration of CG, new entries of T j are calculated as follows and the vector f j = r j /||r j ||, where ||r j || 2 = r T j r j and α j , β j and r j are as in Algorithm 1.Hence, obtaining eigenvalue information requires normalizing the residual vectors and finding eigenpairs of the tridiagonal matrix T j .In data assimilation, the dimension of T j is small, because the cost of matrix-vector products restricts the number of CG iterations in the previous inner loop.Hence its eigenpairs can be calculated cheaply.However, caution has to be taken to avoid 'ghost' values, i.e. repeated Ritz values, due to the loss of orthogonality in CG (Section 10.3.5 of Golub and Van Loan [2013]).This can be addressed using a complete reorthogonalization in every CG iteration, which is done in the CONGRAD routine used at the European Centre for Medium Range Weather Forecasts [ECMWF, 2020].This makes every CG iteration more expensive, but CG may converge in fewer iterations [Fisher, 1998].

Randomised eigenvalue decomposition
If the Hessian in one inner loop differs significantly from the Hessian in the previous inner loop, then it may not be useful to precondition the former with an LMP that is constructed with information from the latter.Employing the Lanczos process to obtain eigenpair estimates and use them to construct an LMP in the same inner loop is too computationally expensive, because each iteration of the Lanczos process requires a matrix-vector product with the Hessian, thus the cost is similar to the cost of CG.Hence, we explore a different approach.
Subspace iteration is a simple procedure to obtain approximations to the largest eigenpairs (see, e.g., Chapter 5 of Saad [2011]).It is easily understandable and can be implemented in a straightforward manner, although its convergence can be very slow if the largest eigenvalues are not well separated from the rest of the spectrum.The accuracy of subspace iteration may be enhanced by using a RR projection.
Such an approach is used in the Randomised Eigenvalue Decomposition (REVD) (see, e.g., Halko et al. [2011]).This takes a Gaussian random matrix, i.e. a matrix with independent standard normal random variables with zero mean and variance equal to one as its entries, and applies one iteration of the subspace iteration method with RR projection, hence obtaining a rank m approximation A ≈ Z 1 (Z T 1 AZ 1 )Z T 1 , where Z 1 ∈ R nA×m is orthogonal.We present REVD in Algorithm 3.An important feature of REVD is the observation that the accuracy of the approximation is enhanced with oversampling (which is also called using guard vectors in Duff and Scott [1993]), i.e. working on a larger space than the required number of Ritz vectors.Halko et al. [2011] claim that setting the oversampling parameter to 5 or 10 is often sufficient.

Algorithm 3 Randomised eigenvalue decomposition, REVD
Input: symmetric matrix A ∈ R nA×nA , target rank k, an oversampling parameter l Output: orthogonal U 1 ∈ R nA×k with approximations to eigenvectors of A as its columns, and diagonal Θ 1 ∈ R k×k with approximations to the largest eigenvalues of A on the diagonal k+l) , elements of Θ 1 are sorted in decreasing order 6: Remove last l columns and rows of Θ 1 , so that Θ 1 ∈ R k×k 7: Remove last l columns of W 1 , so that W 1 ∈ R (k+l)×k 8: Randomised algorithms are designed to minimise the communication instead of the flop count.The expensive parts of Algorithm 3 are the two matrix-matrix products AG and AZ 1 in steps 2 and 4, i.e. matrix A has to be multiplied with 2(k + l) vectors, which in serial computations would be essentially the cost of 2(k + l) iterations of unpreconditioned CG.However, note that these matrix-matrix products can be parallelised.
In weak constraint 4D-Var, A is the Hessian, hence it is symmetric positive definite and its eigenpairs can also be approximated using a randomised Nyström method (Algorithm 5.5. of Halko et al. [2011]), which is expected to give much more accurate results than REVD [Halko et al., 2011].We present the Nyström method in Algorithm 4, where singular value decomposition is abbreviated as SVD.It considers a more elaborate rank m approximation than in REVD: A ≈ (AZ 1 )(Z T 1 AZ 1 ) −1 (AZ 1 ) T = FF T , where Z 1 ∈ R nA×m is orthogonal (obtained in the same way as in REVD, e.g. using a tall skinny QR (TSQR) decomposition [Demmel et al., 2012]) and F = (AZ 1 )(Z T 1 AZ 1 ) −1/2 ∈ R nA×m is an approximate Cholesky factor of A, which is found in step 6.The eigenvalues of FF T are the squares of the singular values of F (see section 2.4.2 of Golub and Van Loan [2013]).In numerical computations we store matrices E (1) = AZ 1 and E (2) = Z T 1 E (1) = Z T 1 AZ 1 (step 4), perform the Cholesky factorization of E (2) = C T C (step 5) and obtain F by solving the triangular system FC = E (1) .

Algorithm 4 Randomised eigenvalue decomposition for symmetric positive semidefinite A, Nyström
Input: symmetric positive semidefinite matrix A ∈ R nA×nA , target rank k, an oversampling parameter l Output: orthogonal U 2 ∈ R nA×k with approximations to eigenvectors of A as its columns, and diagonal Θ 2 ∈ R k×k with approximations to eigenvalues of A on the diagonal Note that the Ritz vectors given by Algorithms 3, 4 and 5 are different.Although Algorithm 5 accesses the matrix A only once, it requires an additional orthogonalisation of a matrix of size n A × (k + l).
In Table 1, we summarise some properties of the Lanczos, REVD, Nyström and REVD ritzit methods when they are used to compute Ritz values and vectors to generate a preconditioner for A in incremental data assimilation.Note that the cost of applying the spectral-LMP depends on the number of vectors k used in its construction and is independent of which method is used to obtain them.The additional cost of using randomised algorithms arises only once per inner loop when the preconditioner is generated.We recall that in these algorithms the required EVD or SVD of the small matrix can be obtained cheaply and the most expensive parts of the algorithms are the matrix-matrix products of A and n A × (k + l) matrices.If enough computational resources are available, these can be parallelised.In the best case scenario, all k + l matrix-vector products can be performed at the same time, making the cost of the matrix-matrix product equivalent to the cost of one iteration of CG plus communication between the processors.
When a randomised method is used to generate the preconditioner, an inner loop is performed as follows.Estimates of the Ritz values of the Hessian and the corresponding Ritz vectors are obtained with a randomised method (Algorithm 3, 4 or 5) and used to construct an LMP.Then the system (11) with the exact Hessian A is solved with PCG (Algorithm 1) using the LMP.The state is updated in the outer loop using the PCG solution.

Cholesky factorization
None None

None
Triangular solve None None Deterministic SVD None None

None
Table 1: A summary of the properties of the different methods of obtaining k Ritz vectors and values to generate the preconditioner for a n A × n A matrix A in the ith inner loop.Here l is the oversampling parameter.* applies for CG with reorthogonalization.

Numerical experiments
We demonstrate our proposed preconditioning strategies using two models: a simple linear advection model to explore the spectra of the preconditioned Hessian and the nonlinear Lorenz 96 model [Lorenz, 1996] to explore the convergence of split preconditioned CG (PCG) .We perform identical twin experiments, where x t = ((x t 0 ) T , . . ., (x t N ) T ) T denotes the reference trajectory.The observations and background state are generated by adding noise with covariance matrices R and B, respectively, to H i (x t i ) and x 0 .We use direct observations, thus the observation operator H i is linear.We use covariance matrices R i = σ 2 o Iq i , where q i is the number of observations at time t i , Q i = σ 2 q Cq, where Cq is a Laplacian correlation matrix [Johnson et al., 2005], and B = σ 2 b C b , where C b is a second-order auto-regressive correlation matrix [Daley, 1993].
We assume that first level preconditioning has already been applied (recall ( 10)).In data assimilation, using Ritz-LMP as formulated in ( 17) is impractical because of the matrix products with A and we cannot use a simple formulation of Ritz-LMP when the Ritz values and vectors are obtained with the randomised methods.Hence, we use the spectral-LMP.However, as we mentioned in Section 3.2, the spectral-LMP that is constructed with well converged Ritz values and vectors acts like Ritz-LMP.When we consider the second inner loop, we compare the spectral-LMPs with information from the randomised methods with the spectral-LMP constructed with information obtained with the Matlab function eigs in the previous inner loop.eigs returns a highly accurate estimate of a few largest or smallest eigenvalues and corresponding eigenvectors.We will use the term randomised LMP to refer to the spectral-LMPs that are constructed with information from the randomised methods, and deterministic LMP to refer to the spectral-LMP that is constructed with information from eigs.
The computations are performed with Matlab R2017b.Linear systems are solved using the Matlab implementation of PCG (function pcg), which was modified to allow split preconditioning to maintain the symmetric coefficient matrix at every loop.

Advection model
First, we consider the linear advection model: where z ∈ [0, 1] and t ∈ (0, T ).An upwind numerical scheme is used to discretise (21) (see, e.g.Chapter 4 of Morton and Mayers [1994]).To allow us to compute all the eigenvalues (described in the following section), we consider a small system with the linear advection model.The domain is divided into n = 40 equally spaced grid points, with grid spacing ∆z = 1/n.We run the model for 51 time steps (N = 50) with the time step size ∆t = 1/N , hence A is a 2040 × 2040 matrix.The Courant number is C = 0.8 (the upwind scheme is stable with C ∈ [0, 1]).The initial conditions are Gaussian u(z, 0) = 6 exp − (z−0.5) 2 2×0.1 2 , and the boundary conditions are periodic u(1, t) = u(0, t).We set σo = 0.05, σq = 0.05 and σ b = 0.1.Cq and C b have length scales equal to 10∆y.Every 4th model variable is observed at every 5th time step, ensuring that there is an observation at the final time step (100 observations in total).Because the model and the observational operator H i are linear, the cost function ( 2) is quadratic and its minimiser is found in the first loop of the incremental method.

Eigenvalues of the preconditioned matrix
We apply the randomised LMPs in the first inner loop.Note that if the deterministic LMP is used, it is unclear how to precondition the first inner loop.We explore what effect the randomised LMPs have on the eigenvalues of A. The oversampling parameter l is set to 5 and the randomised LMPs are constructed with k = 25 vectors.
The Ritz values of A given by the randomised methods are compared to those computed using eigs (Figure 1(I)).The Nyström method produces a good approximation of the largest eigenvalues, REVD gives a slightly worse approximation, except for the largest five eigenvalues.The REVD ritzit method underestimates the largest eigenvalues significantly.The largest eigenvalues of the preconditioned matrices are smaller than the largest eigenvalue of A (Figure 1(II)).However, the smallest eigenvalues of the preconditioned matrices are less than one and hence applying the preconditioner expands the spectrum of A at the lower boundary (Figure 1(III)), so that Theorem 3.4 of Gratton et al. [2011], which considers the non-expansiveness of the spectrum of the Hessian after preconditioning with an LMP, does not hold.This happens because the formulation of the spectral-LMP is derived assuming that the eigenvalues and eigenvectors are exact, while the randomized methods provide only approximations.Note that even though REVD ritzit gives the worst approximations of the largest eigenvalues of the Hessian, using the randomised LMP with information from REVD ritzit reduces the largest eigenvalues of the preconditioned matrix the most and the smallest eigenvalues are close to one.Using the randomised LMP with estimates from Nyström gives similar results.Hence, the condition number of the preconditioned matrix is lower when the preconditioners are constructed with REVD ritzit or Nyström compared to REVD.
The values of the quadratic cost function at the first ten iterations of PCG are shown in Figure 1(IV).Using the randomised LMP that is constructed with information from REVD is detrimental to the PCG convergence compared to using no preconditioning.Using information from the Nyström and REVD ritzit methods results in similar PCG convergence and low values of the quadratic cost function are reached in fewer iterations than without preconditioning.The PCG convergence may be explained by the favourable distribution of the eigenvalues after preconditioning using Nyström and REVD ritzit, and the smaller than one eigenvalues when using REVD.These results, however, do not necessarily generalize to an operational setting as this system is well conditioned while operational settings are not.This will be investigated further in the next section.

Lorenz 96 model
We next use the Lorenz 96 model to examine what effect the randomised LMPs have on PCG performance.In the Lorenz 96 model the evolution of the n components X j , j ∈ {1, 2, . . ., n} of x i is governed by a set of n coupled ODEs: where X −1 = X n−1 , X 0 = X n and X n+1 = X 1 and F = 8.The equations are integrated using a fourth order Runge-Kutta scheme [Butcher, 1987].We set n = 80 and N = 150 (the size of A is 12080 × 12080) and observe every 10th model variable at every 10th time step (120 observations in total), ensuring that there are observations at the final time step.The grid point distance is ∆X = 1/n and the time step is set to ∆t = 2.5 × 10 −2 .
For the covariance matrices we use σo = 0.15 and σ b = 0.2.C b has length scale equal to 2∆X.Two setups are used for the model error covariance matrix: • σq = 0.1 and Cq has length scale Lq = 2∆X (the same as C b ); • σq = 0.05 and Cq has length scale Lq = 0.25∆X.
In our numerical experiments, the preconditioners have very similar effect using both setups.Hence, we present plots for the case σq = 0.1 and Lq = 2∆X in the following sections, except Figure 3.
The first outer loop is performed and no second level preconditioning is used in the first inner loop, where PCG is run for 100 iterations or until the relative residual norm reaches 10 −6 .In the following sections, we use randomised and deterministic LMPs in the second inner loop.PCG has the same stopping criteria as in the first inner loop.

Minimising the inner loop cost function
In Figure 2, we compare the performance of the randomised LMPs with the deterministic LMP.We also consider the effect of varying k, the number of vectors used to construct the preconditioner.We set the oversampling parameter to l = 5.Because results from the randomized methods depend on the random matrix used, we perform 50 experiments with different realizations for the random matrix.We find that the different realizations lead to very similar results (see Figure 2(I)).
Independently of the k value, there is an advantage in using the second level preconditioning.The reduction in the value of the quadratic cost function is faster using the randomised LMPs compared to the deterministic LMPs, with REVD ritzit performing the best after the first few iterations.The more information we use in the preconditioner (i.e. the higher k value), the faster REVD ritzit shows better results than the other methods.The performances of the REVD and Nyström methods are similar.Note that as k increases, the storage (see Table 1) and work per PCG iteration increase.Examination of the Ritz values given by the randomised methods shows that REVD ritzit gives the worse estimate of the largest eigenvalues, as was the case when using the advection model.We calculated the smallest eigenvalue of the preconditioned matrix (C sp 5 ) T AC sp 5 using eigs.When C sp 5 is constructed using REVD ritzit or Nyström the smallest eigenvalue of (C sp 5 ) T AC sp 5 is equal to one, whereas using REVD it is approximately 0.94.This may explain why the preconditioner constructed using REVD may not perform as well as other randomised preconditioners, but it is not entirely clear why the preconditioner that uses REVD ritzit shows the best performance.
The PCG convergence when using the deterministic LMP and the randomised LMP with information from REVD ritzit with different k values is compared in Figure 3 for both setups of the model error covariance matrix.For the deterministic LMP, varying k has little effect, especially in the first iterations of PCG.However, for REVD ritzit, increasing k results in a greater decrease of the cost function in the first iterations of PCG.Also, at any iteration of PCG we obtain a lower value of the quadratic cost function using the randomised LMP with k = 5 compared to the deterministic LMP with k = 15, which uses exact eigenpair information from the Hessian of the previous loop.

Effect of the observation network
To understand the sensitivities of the results from the different LMPs to the observation network, we consider a system with the same parameters as in the previous section, where we had 120 observations, but we now observe • every 5th model variable at every 5th time step (480 observations in total); • every 2nd variable at every 2nd time step (3000 observations in total).The oversampling parameter is again set to l = 5 and we set k = 5 and k = 15 for both observation networks.Since the number of observations is equal to the number of eigenvalues that are larger than one and there are more observations than in the previous section, there are more eigenvalues that are larger than one after the first level preconditioning.Because all 50 experiments with different Gaussian matrices in the previous section were close to the mean, we perform 10 experiments for each randomised method, solve the systems and report the means of the quadratic cost function.
The results are presented in Figure 4. Again, the randomised LMPs perform better than the deterministic LMP.However, if the preconditioner is constructed with a small amount of information about the system (k = 5 for both systems and k = 15 for the system with 3000 observations), then there is little difference in the performance of different  methods can be used to cheaply construct the preconditioner in the current inner loop, with no dependence on the previous inner loop, and are parallelisable.Numerical experiments with the linear advection and Lorenz-96 models have shown that the randomised LMPs constructed with approximate eigenpairs improve the convergence of PCG more than deterministic LMPs with information from the previous loop.The quadratic cost function reduces more rapidly when using a randomised LMP rather than a deterministic LMP, even if the randomised LMP is constructed with fewer vectors than the deterministic LMP.Also, for the randomised LMPs, the more information about the system we use (i.e. more approximations of eigenpairs are used to construct the preconditioner), the greater the reduction in the quadratic cost function.Using more information to construct a deterministic LMP may not result in larger reduction of the quadratic cost function,  especially in the first iterations of PCG, which is in line with results in Tshimanga et al. [2008].However, if not enough information is included in the randomised LMP, then preconditioning may have no effect on the first few iterations of PCG.
Of the randomised methods considered, the best overall performance was for REVD ritzit.However, if we run a small number of PCG iterations, the preconditioners obtained with different randomised methods give similar results.The performance was independent of the choice of the random Gaussian start matrix and it may be improved with oversampling.
In this work we apply randomised methods to generate a preconditioner, which is then used to accelerate the solution of the exact inner loop problem (11) with PCG method (as discussed in Section 4).A different approach has been explored by Bousserez and Henze [2018] and Bousserez et al. [2020], who presented and tested a randomised solution algorithm called the Randomized Incremental Optimal Technique (RIOT) in data assimilation.RIOT is designed to be used instead of PCG and employs a randomised eigenvalue decomposition of the Hessian (using a different method than the ones presented in this paper) to directly construct the solution x in (11), which approximates the solution given by PCG.
The randomised preconditioning approach can also be employed to minimise other quadratic cost functions, including the strong constraint 4D-Var formulation.Further exploration of other single-pass versions of the randomised methods for the eigenvalue decomposition, that are discussed in Halko et al. [2011], may be useful.In particular, the single-pass version of the Nyström method is potentially attractive.If a large number of Ritz vectors are used to construct the preconditioner, more attention can be paid to choosing the value of the oversampling parameter l in the randomised methods.In some cases a better approximation may be obtained if l linearly depends on the target rank of the approximation Nakatsukasa [2020].

Figure 1 :
Figure 1: Advection problem.((I)) 25 largest eigenvalues of A (eigs) and their estimates given by randomised methods, the largest eigenvalues and their estimates given by REVD and Nyström coincide.((II)) largest eigenvalues of A (no LMP, the same as eigs in ((I))) and (C sp 25 ) T AC sp 25 , where C sp 25 is constructed with Ritz values in ((I)) and corresponding Ritz vectors.((III)) smallest eigenvalues of A and (C sp 25 ) T AC sp 25 .((IV)) quadratic cost function value versus PCG iteration when solving systems with A and (C sp 25 ) T AC sp 25 .
Figure3: A comparison of the values of the quadratic cost function at every PCG iteration when using deterministic LMP with information from the previous loop (eigs) and the randomised LMP with information from REVD ritzit for different k values (5, 10 and 15).No second level preconditioning is also shown (case ((I)) is the same as in Figure2).In cases ((I)) and ((II)) the model error covariance matrices are constructed using parameters σq and Lq.

Figure 4 :
Figure 4: As in Figure2, but for two systems with q observations and 10 experiments are done for each randomised method and the mean values plotted.

Figure 5 :
Figure 5: Standard deviation of the quadratic cost function at every iteration of PCG when the spectral-LMP is constructed with different randomised methods.For every randomised method we do 50 experiments.Here σq = 0.1 and Lq = 2∆X.