New bounds on the condition number of the Hessian of the preconditioned variational data assimilation problem

Data assimilation algorithms combine prior and observational information, weighted by their respective uncertainties, to obtain the most likely posterior of a dynamical system. In variational data assimilation the posterior is computed by solving a nonlinear least squares problem. Many numerical weather prediction (NWP) centers use full observation error covariance (OEC) weighting matrices, which can slow convergence of the data assimilation procedure. Previous work revealed the importance of the minimum eigenvalue of the OEC matrix for conditioning and convergence of the unpreconditioned data assimilation problem. In this article we examine the use of correlated OEC matrices in the preconditioned data assimilation problem for the first time. We consider the case where there are more state variables than observations, which is typical for applications with sparse measurements, for example, NWP and remote sensing. We find that similarly to the unpreconditioned problem, the minimum eigenvalue of the OEC matrix appears in new bounds on the condition number of the Hessian of the preconditioned objective function. Numerical experiments reveal that the condition number of the Hessian is minimized when the background and observation lengthscales are equal. This contrasts with the unpreconditioned case, where decreasing the observation error lengthscale always improves conditioning. Conjugate gradient experiments show that in this framework the condition number of the Hessian is a good proxy for convergence. Eigenvalue clustering explains cases where convergence is faster than expected.

In this article we examine the use of correlated OEC matrices in the preconditioned data assimilation problem for the first time. We consider the case where there are more state variables than observations, which is typical for applications with sparse measurements, for example, NWP and remote sensing. We find that similarly to the unpreconditioned problem, the minimum eigenvalue of the OEC matrix appears in new bounds on the condition number of the Hessian of the preconditioned objective function. Numerical experiments reveal that the condition number of the Hessian is minimized when the background and observation lengthscales are equal. This contrasts with the unpreconditioned case, where decreasing the observation error lengthscale always improves conditioning. Conjugate gradient experiments show that in this framework the condition number of the Hessian is a good proxy for convergence. Eigenvalue clustering explains cases where convergence is faster than expected. observation error covariance (OEC) matrices on the convergence of the preconditioned variational data assimilation problem. We develop new bounds on the condition number of the Hessian of the linearized preconditioned objective function. Numerical experiments allow us to compare the bounds to the computed condition number. We also investigate the relationship between conditioning, the full spectrum of the Hessian and convergence of a linear data assimilation test problem to assess the suitability of using the condition number of the Hessian as a proxy for convergence in this setting.
We now define the variational data assimilation objective function of interest for this article. For a time window [t 0 , t n ], we let x t i ∈ R N be the true state of the dynamical system of interest at time t i , where N is the number of state variables. The prior, or background state, is valid at the initial time t 0 and can be written as an approximation to the true state as We assume that the background errors b ∼  (0, B), where B ∈ R N×N is the background error covariance matrix. As observations can be made at different locations, or of different variables to those in the state vector x i , we define an observation operator h i ∶ R N → R p i which maps from state variable space to observation space at time t i . Observations y i ∈ R p i at time t i are similarly expressed as y i = h i [x t i ] + i for i = 0, … , n: the sum of the model equivalent h i [x t i ] and an observation error i ∼  (0, R i ), where R i ∈ R p i ×p i are the OEC matrices. We additionally assume that the observation and background errors are mutually uncorrelated. The total number of observations across the whole time window is given by p = ∑ n i=0 p i . The state x i−1 at time t i−1 is propagated to the next observation time t i using a nonlinear forecast model operator, , to obtain In variational data assimilation the analysis, x 0 , or most likely state at the initial time t 0 , minimizes the full 4D-Var objective function, given by In applications such as numerical weather prediction (NWP), the nonlinear objective function (2) is typically minimized using an iterative method. The most common implementation is the incremental formulation, which solves the variational data assimilation problem via a small number of nonlinear outer loops, and a larger number of inner loop iterations which minimize a linearized least squares problem. 1 This procedure is equivalent to a Gauss-Newton method [2][3][4] and will be presented in Section 2.
For many systems in the geosciences and neurosciences 5,6 the number of state variables, N, can be of the order of 10 9 . In this article we consider the case where the number of state variables is greater than the number of observations, that is, N > p, an assumption which holds for applications with sparse measurement data. The large dimension of the state space motivates the use of a control variable transform (CVT) to model the background error covariance matrix, B, implicitly. 7 The CVT uses the square root of B as a variable transform to obtain a modified objective function [8, sec 9.1], and can be interpreted as a form of preconditioning. The transformation diagonalizes the weighting on the first term of (2), making the transformed state variables uncorrelated. We refer to the incremental variational problem with the CVT as the preconditioned data assimilation problem for the remainder of this article.
As the inner iterations of the incremental 4D-Var algorithm solve a linear least squares problem, the conjugate gradient method can be used for the minimization of the linearized objective function. [9][10][11] Convergence of a conjugate gradient method can be bounded by the condition number of the Hessian of the objective function; 12-14 therefore the condition number of the linearized Hessian can be considered as a proxy to study how changes to a data assimilation method are likely to affect convergence of the inner loop. The Hessian of the linearized preconditioned objective function is given bŷ where H i ∈ R p i ×N is the linearized observation operator, and M i ∈ R N×N is the linearization of the model operator (1). The Hessian (3) is a low rank update of the identity matrix, and hence is typically better conditioned than the Hessian corresponding to the unpreconditioned problem (as its minimum eigenvalue is one). However, the distribution of the full spectrum, and not just the extreme eigenvalues, is important for the conjugate gradient method (see [12, theorem 38.4; 15, theorems 38.3, 38.5]). In this article we will therefore consider how the condition number of the Hessian relates to convergence of the conjugate gradient method in an idealized numerical framework, and examine the distribution of the full spectrum of (3). In recent years there has been a rise in the introduction of correlated OEC matrices (R i in (2)) at NWP centers (e.g., [16][17][18]). The use of correlated OEC matrices brings benefit to applications by allowing users to include more observations 19,20 at higher resolutions. Correlated OEC matrices also lead to greater information content of observations, particularly on smaller scales. 19,[21][22][23] However, the move from uncorrelated (diagonal) to correlated (full) covariance matrices has caused problems with the convergence of the data assimilation procedure in experiments at NWP centers. 17,24,25 Previous studies of the conditioning of the preconditioned Hessian have focused on the case of uncorrelated OEC matrices. 14,26 In this article we extend this theory to the case of correlated OEC matrices.
Tabeart et al. 27 considered the effect of using correlated (full) OEC matrices within the unpreconditioned data assimilation problem. The minimum eigenvalue of the correlated OEC matrix was found to be important in determining the conditioning of the Hessian of the objective function both theoretically and numerically. The condition number of the Hessian was found to be a good proxy for convergence in this framework. Haben et al. 14,26 developed bounds on the condition number of the Hessian for both the unpreconditioned and preconditioned problems in the case of uncorrelated (diagonal) OEC matrices. In the preconditioned case, reducing the observation error variance increases both the bounds on and numerical value of the condition number of the Hessian. The choice of observation network was also shown to be important for determining the conditioning and convergence of the preconditioned problem.
In this article we consider the conditioning of the preconditioned variational data assimilation problem in the case of correlated OEC matrices. We extend the analysis of Tabeart et al. 27 to the preconditioned case where there are fewer observations than state variables, that is, p < N. We begin in Section 2 by defining the problem and introducing existing mathematical results relating to conditioning. In Section 3 we present new theoretical bounds on the condition number of the preconditioned Hessian in terms of its constituent matrices. In Section 4 we introduce the numerical framework that will be used for our experiments. We present the results of these experiments and related discussion in Section 5. These experiments reveal the ratio between background and observation error correlation lengthscales strongly influences the conditioning of the Hessian, with minimum condition numbers occurring when the two lengthscales are equal. This contrasts with the unpreconditioned case, where the condition number of the Hessian could always be reduced by decreasing the lengthscale of the observation error covariances. We find cases where the new bounds represent the qualitative behavior of the conditioning well, as well as cases where bounds from Haben 14 are tighter. For many cases the condition number of the Hessian is a good proxy for convergence of a conjugate gradient method. Cases where convergence is much faster than expected can be explained by a single large eigenvalue with the remainder clustering around unity. Our conclusions are presented in Section 6.

2.1
The CVT formulation of the data assimilation problem In this section we define the preconditioned 4D-Var data assimilation problem and introduce further notation that will be used in this article. We recall 28 that covariance matrices can be decomposed as B = Σ BB Σ B , and R i = Σ R iR i Σ R i where Σ B , Σ R i are diagonal matrices containing standard deviations, andB,R i are correlation matrices with unit entries on the diagonal. By definition covariance matrices are symmetric positive semidefinite. However, we will assume in what follows thatB,R i , B, and R i are strictly positive definite, and therefore their inverses are well defined. We now derive the linearized incremental objective function. In this formulation instead of finding the state which minimizes the objective function (2) directly, subject to the model constraint (1), we minimize a sequence of linearizations of the objective function to obtain a sequence of increments to the background, x b . Typically this is done via a series of outer loops, where the forecast model and observation operators are linearized about the current best estimate of x 0 .
For the lth outer loop we define x (l+1) We then consider the Taylor expansion of (t i−1 , t i ; x (l) i−1 ) and obtain the linearization i . We then write the linearized objective function in terms of x (l) 0 , are the innovation vectors. These measure the misfit between the observations and the linearized state, using the full nonlinear observation operator.
In order to simplify the notation in what follows we can group the linearized forecast model and observation operator terms together as a single linear operator. We define the generalized observation operator aŝ where the linearized forward model from time t 0 to time t i is given bŷ Finally we letR ∈ R p×p denote the block diagonal matrix with the ith block consisting of R i . This allows us to write the Hessian of the linearized objective function, (4), in the simplified form The formulation of the objective function given by (4) is too expensive to be used in practice both in terms of computation, but also storage. The number of state variables, N, is very large and typically B cannot be stored explicitly.
The CVT formulates the objective function in terms of alternative "control variables," which means that the background matrix B does not need to be stored explicitly. The CVT is described in detail by Bannister,7,29 and is often used in NWP applications.
The CVT may be applied to the incremental form of the variational problem (4), via the change of variable z (l) 0 . This yields the objective function where is a vector made up of the innovation vectors. This yields a Hessian for the incremental 4D-Var problem with the CVT given bŷ Therefore using the CVT is equivalent to pre-and postmultiplying the Hessian of the incremental data assimilation problem (7) by B 1∕2 (the uniquely defined, symmetric square root of B). The exact value of B −1∕2 is not computed, but rather an approximation is constructed using physical and statistical knowledge of the system of interest. 7 The CVT can be interpreted as preconditioning the Hessian by B 1∕2 . The data assimilation formulation described in (8) is often referred to as the preconditioned data assimilation problem, and this naming convention will be used throughout the remainder of the article. We note that as we assume B andR are strictly positive definite,Ŝ is also symmetric positive definite.
The preconditioned Hessian (10) highlights the computational benefit of using the CVT. For most NWP applications there are fewer observations than state variables (typically a difference of two orders of magnitude 5 ), meaning that the second term in (10) is rank deficient. Therefore the preconditioned Hessian is a low-rank update to the identity, and hence its minimum eigenvalue is unity. This guarantees that the preconditioned Hessian will not suffer from small minimum eigenvalues that often result in ill-conditioning for the unpreconditioned problem. This improved conditioning is expected to lead to faster convergence of the associated data assimilation algorithm.
In this article we study the conditioning of the Hessian of the CVT objective function (8) as a proxy for convergence of the preconditioned data assimilation problem. We develop bounds on the condition number of (10) in terms of its constituent matrices. Separating the contribution of each matrix in the bounds allows us to investigate the effect of changes to each component of the data assimilation system on conditioning and convergence. In particular, we focus on the introduction of correlated OEC matrices within the preconditioned framework.

Some inequalities for the eigenvalues of the product of positive semidefinite Hermitian matrices
For the remainder of this article, we use the following order of eigenvalues: For a matrix A ∈ R k×k let the eigenvalues i be such that max In this section we introduce theoretical results from linear algebra. These will be used in Section 3 to develop new bounds on the condition number of the preconditioned Hessian in terms of its constituent matrices. We also present existing bounds on the Hessian of the preconditioned 3D-Var problem. The 3D-Var problem is obtained by setting n = 0 in (2). In the numerical experiments of Section 5 we will compare these existing bounds with the new bounds developed in Section 3.
We begin by formally defining the condition number.
Definition 1 (13, sec. 2.7.2). For A ∈ R k×k symmetric positive definite we define the condition number (A) by We then characterize the condition number in the 2-norm of A as where i are the eigenvalues of A. The condition number in the 2-norm shall be referred to as the condition number and denoted (A) for the remainder of this work.
We recall our additional assumption that both B andR are strictly positive definite. SinceŜ is symmetric positive definite we apply the characterization of the condition number given by (12) throughout this article.
We present two results which bound the eigenvalues of a matrix product in terms of the product of the eigenvalues of the individual matrices. These will be used in Section 3 to separate the contribution of the background and OEC matrices to (Ŝ). Theorem 1. Let F, G ∈ C d×d be positive semidefinite Hermitian matrices and let i 1 , … i k denote an ordered subset of the integers {1, … , d}. Then Proof. The proof is given by Wang and Zhang, 30 theorem 3. ▪ Theorem 2. Let F, G ∈ C d×d be positive semidefinite Hermitian and let i 1 , … i k denote an ordered subset of the integers Proof. The proof is given by Wang and Zhang, 30 theorem 4. ▪ These results will be used to develop bounds on the condition number of the Hessian (10). We now present an existing bound on the condition number of the 3D-Var preconditioned Hessian, (Ŝ), from Haben. 14 Theorem 3. Let B ∈ R N×N be the background error covariance matrix and R ∈ R p×p be the OEC matrix with p < N. Then the following bounds are satisfied by the condition number of the preconditioned 3D-Var Proof. The proof is given by Haben, 14 theorem 6.2.1. ▪ We note that the result of Theorem 3 extends naturally to the 4D-Var problem by replacing R withR and H withĤ. As discussed at the end of Section 2.1, we want to develop bounds that separate the contribution of each constituent matrix. This will allow us to study how altering a single term, particularly the OEC matrix, is likely to affect the conditioning and convergence of the preconditioned data assimilation system. As well as being interesting from a theoretical perspective, improved understanding of the influence of individual terms will be useful for practical applications. For example, when introducing new observation operators or OEC matrices into operational systems, bounds which separate the role of each matrix will provide insight into how the conditioning of the preconditioned 4D-Var problem is likely to change. However, as the bounds given by (15) do not separate out each term, they are likely to be tighter than the new bounds which are presented in Section 3. In Section 5 we will numerically compare the bounds given by (15) with those developed in Section 3.

THEORETICAL BOUNDS ON THE HESSIAN OF THE PRECONDITIONED PROBLEM
In this section we develop new theoretical bounds on the condition number of the Hessian of the preconditioned variational data assimilation problem, following similar methods to the unpreconditioned case in Tabeart et al.. 27 These bounds will all be presented in terms of the Hessian of the preconditioned 4D-Var problem (10). For the case n = 0,R ≡ R 0 ∈ R p 0 ×p 0 andĤ ≡ H 0 ∈ R p 0 ×N , meaning that the bounds in this section will also apply directly to the preconditioned 3D-Var Hessian. This relation will be used in the numerical experiments presented in Section 5.
Key Assumption. The total number of observations across the time window, p, is smaller than the number of state variables, that is, p < N.
The first result shows that the condition number can be calculated via the eigenvalues of the rank-p update B 1∕2ĤTR−1Ĥ B 1∕2 .

Lemma 1. Following the Key Assumption we can express the condition number ofŜ as
Proof. We begin by showing that , as was presented in Haben, 14 equation (4.2). We define B 1∕2ĤTR−1Ĥ B 1∕2 = C and writeŜ = I + C. Let 1 ≥ 2 ≥ · · · ≥ N be the eigenvalues of C, with corresponding eigenvectors v i . As p < N, C is rank deficient and therefore N = 0. Therefore N (Ŝ) = 1, and (Ŝ) = 1 (Ŝ) = 1 + 1 (C). Matrices AB and BA have the same nonzero eigenvalues, 31 and therefore we can write Hence, we obtain the result The result of Lemma 1 shows that computing (Ŝ) only requires the computation of the maximum eigenvalue of a single matrix product. We also note that the matrix products that appear in (16) and (17) are of different dimensions: BĤ TR−1Ĥ ∈ R N×N andR −1Ĥ BĤ T ∈ R p×p . Additionally, by the Key Assumption (as p < N) the first matrix product is always rank deficient, whereas for the case thatĤ TR−1Ĥ is rank p, the second matrix product is full rank.
Previous studies 14,27 have considered the effect of separately changing the variances and correlations associated with the background and observation covariances. In our numerical experiments in Section 5, we will focus on the role of the correlations in B and R in the conditioning of the preconditioned assimilation problem and assume the variances are constant, that is, In that case it is known 26 that (Ŝ) increases and decreases with the ratio of the background variance to the observation variance. We therefore assume in the experiments that the variances all take unit values and examine how changes to the background and observation correlations affect the conditioning.

General bounds on the condition number
We now develop bounds on the condition number ofŜ in terms of its constituent matrices. We assume that B andR are strictly positive definite, and that the Key Assumption holds. Otherwise we make no further restrictions on the structure of the constituent matrices in this section.

Theorem 4. Given the Key Assumption we can bound
Proof. We write (Ŝ) as in the statement of Lemma 1. To obtain the upper bound of (20), we use the result of Theorem 1 to separate the contribution of the background and observation term Similarly the alternative formulation from Lemma 1 yields Combining these two expressions yields the upper bound in the theorem statement.
To compute the lower bound of (20), we apply the result of Theorem 2 to (16) This last inequality is due to the fact thatĤ TR−1Ĥ is rank deficient. It follows from fact 5.11.14 of Bernstein 32 that . Applying the result of Theorem 2 to (17) with k = 1, i 1 = 1, d = N we obtain Combining the results of (21)-(24) yields (20) as required. ▪ We can separate the contribution of the OEC matrix from the observation operator to give the following bound.

Corollary 1. Under the same conditions as in Theorem 4, we can bound (Ŝ) by
Proof. We begin by considering the upper bound of (20). By theorem 21.10 of Harville, 31ĤTR−1Ĥ has precisely the same nonzero eigenvalues asR −1ĤĤT . It follows from fact 5.11.14 of Bernstein 32 . Applying Theorem 1 By theorem 21.10 of Harville, 31Ĥ BĤ T has precisely the same nonzero eigenvalues as BĤ TĤ . Applying Theorem 1 for k = 1, i 1 = 1, d = N yields: The final equality arises as the nonzero eigenvalues ofĤĤ T are equal to those ofĤ TĤ . Therefore the two cases from Theorem 4 yield the same "factorized" upper bound, and gives the upper bound in (25).
We now consider the first term in the lower bound of (20) and bound 1 (Ĥ TR−1Ĥ ) below. We separate the contribution ofR andĤĤ T using Theorem 2 for k = 1, i 1 = 1, d = p. This yields Multiplying this by N (B) gives the two terms that appear in the lower bound of (25). We now consider the second term of (20) and bound 1 (ĤBĤ T ) below. We separate the contribution of B andĤ TĤ using Theorem 2 for k = 1, i 1 = 1, d = N. This yields The last inequality follows asĤ TĤ is not full rank and therefore N (Ĥ TĤ ) = 0. Multiplying this result by 1∕ 1 (R) gives the same value as the second term in (25).
Finally, we bound the third term of the lower bound in (20). By theorem 21.10 of Harville, 31 p (ĤBĤ T ) = p (BĤ TĤ ). Applying Theorem 2 for k = 1, Multiplying the second term of (30) by 1∕ p (R) gives the first term in (25), as p (Ĥ TĤ ) = p (ĤĤ T ). ▪ In general it is not possible to determine which term in the lower bound of (25) is larger, as this will depend on the choice of B,Ĥ, andR. However, we are able to comment on how the bounds are likely to be altered by changes to individual matrices. As we increase p (R) both the upper bound and first term in the lower bound decrease. Increasing 1 (R) will lead to a decrease in the second term of the lower bound. As N (B) increases, the lower bound will increase but the upper bound will remain unchanged. Increasing 1 (B) will yield a larger upper bound and has no effect on the lower bound. Larger values of p (ĤĤ T ) will lead to increases to the first term in the lower bound and larger values of 1 (ĤĤ T ) will lead to increases of the upper bound and second term of the lower bound. In the experiments in Section 5 we will study how each of these terms change with interacting parameters, and assess which lower bound is tighter for a variety of situations.

Bounds on the condition number in the case of circulant error covariance matrices
The theoretical bounds presented in Section 3.1 apply for any choice of observation and background error covariance matrices. However, for a given numerical framework, general bounds can typically be improved by exploiting specific structure of the matrices being used. 14 In this section we will show that under additional assumptions on the structure of the error covariance matrices and observation operator, the bounds given by (15) yield the exact value of (Ŝ).
We begin by defining circulant matrices. Circulant matrices are a natural choice for spatial correlation matrices on a one-dimensional periodic domain, as they yield correlation matrices that are homogeneous and isotropic. 14 We will make use of this structure in the numerical experiments presented in Section 5.
Definition 2 (33). A circulant matrix C ∈ R N×N is a matrix of the form One computationally beneficial property of circulant matrices is that their eigenvalues can be calculated directly via a discrete Fourier transform. As we shall see in Theorem 5, any circulant matrix of dimension N admits the same eigenvectors.
with corresponding eigenvectors where = e −2 i∕N is an Nth root of unity.

Proof. See Reference 34 for full derivation. ▪
Our numerical experiments in Section 5 will use circulant background and OEC matrices. When bothĤBĤ T andR are circulant, with some additional assumptions on the entries of matrix products, we can prove that the upper and lower bounds given by Theorem 3 are equal and yield the exact value of (Ŝ). Proof. The product of circulant matrices is a circulant matrix, the inverse of a circulant matrix is circulant, 34 and the square root of a circulant matrix is also circulant. 35 Therefore if the productĤBĤ T is circulant then the product is circulant, asR is circulant by assumption of the corollary.
The lower bound of (15) computes the average row sum of the productR −1∕2Ĥ BĤ TR −1∕2 . As the product is circulant, each row has the same sum, given by ∑ p−1 k=0 c k , where c i is the ith entry of the first row of the circulant matrix (as introduced in Definition 2).
The upper bound of (15) returns the maximum absolute row sum of the product. As the product is circulant with only positive entries, all absolute row sums are identically equal to ∑ p−1 k=0 |c k | = are positive is less straightforward to guarantee a priori, and depends on the specific structure of the three matrices being considered. In particular, even if all of the entries of R,Ĥ, and B are positive, entries of the productR −1∕2Ĥ can still be negative. The result of Corollary 2 will be used in the numerical experiments in the next section to compare the performance of the new bounds given by (25) and the existing bounds given by Theorem 3.

NUMERICAL FRAMEWORK
In this section we describe the numerical framework that will be used to study how the bounds on the preconditioned Hessian (10) compare with the actual value of (Ŝ). Although the bounds that were developed in Section 3 were developed for the 4D-Var problem, the numerical experiments presented in Section 5 will be conducted for a 3D-Var problem. This allows us to use the framework that was introduced in Tabeart et al., 27 and directly compare the preconditioned and unpreconditioned formulations in the same numerical setting. We note that in the case of 3D-Var,R andĤ simplify to the standard OEC matrix R and observation operator H, respectively in (8), (10) and all the bounds in Section 3. The Hessian that is used for the experiments in this section is therefore given byŜ = I N + B 1∕2 H T R −1 HB 1∕2 . We now define the different components of the numerical framework. Our domain is the unit circle, and we fix the ratio of the number of state variables to observations as N = 2p, that is, twice as many state variables as observations. Similarly to Tabeart et al. 27 we define both the observation and background error covariance matrices to have a circulant structure with unit variances. Circulant matrices are a natural choice for correlations on a periodic domain with evenly distributed state variables. They also admit useful theoretical properties as was discussed in Section 3.2. The use of circulant error covariance matrices allow us to better understand the interaction between different terms in the Hessian, and to isolate the impact of parameter changes.
The experiments presented in this article will use circulant matrices arising from the second order autoregressive (SOAR) correlation function. 36,37 SOAR matrices are used in NWP applications as a horizontal correlation function 20 and are fully defined by a correlation lengthscale for a given domain. We remark that we substitute the great circle distance in the SOAR correlation function with the chordal distance 38,39 to ensure that the properties of positive definiteness are satisfied and that we obtain a valid correlation matrix. Definition 3. The SOAR error correlation matrix on the unit circle is given by where L > 0 is the correlation lengthscale and i,j denotes the angle between grid points i and j. The chordal distance between adjacent grid points is given by where N is the number of gridpoints and = 2 N is the angle between adjacent gridpoints. Both the background and OEC matrices for the experiments presented in Section 5 will be SOAR with constant unit variance. We will denote their respective lengthscales by L B and L R .
We now introduce the observation operators that will be used for the 3D-Var experiments. Three of our observation operators are the same as those used in Tabeart et al. 27 The  Figure 1. Observations are spread over the whole domain, but are clustered rather than evenly distributed. Figure 2 of Tabeart et al. 27 shows a representation of a low dimensional version of the observation operator structure for H 1 , H 2 , and H 3 . For the numerical experiments in Section 5, we will use p = 100 observations and N = 200 state variables. In the unpreconditioned case, structure in the observation operator, such as regularly spaced observations, was important for the tightness of bounds and convergence of a conjugate gradient method. 27 We therefore consider H 4 as an operator without strict structure. This will allow us to see how structure (or the lack of it) affects the preconditioned problem.

Changes to the condition number of the Hessian
Our first set of experiments consider how different combinations of parameters will alter the value of (Ŝ) and the bounds given by (25). We compute the condition number of the Hessian (10) using the Matlab 2018b function cond 40 and compare against the values given by our bounds.  27 ) shows that increasing the lengthscale of a SOAR correlation matrix will reduce its smallest eigenvalue and increase its largest eigenvalue. The maximum and minimum eigenvalues of both error covariance matrices appear in (25). We can therefore predict how the bounds will change with varying parameter values.
• As L R (the lengthscale of the correlation function used to construct R) increases, p (R) decreases. This means that both the upper bound and the first term in the lower bound of (25) will increase. However, 1 (R) increases with L R meaning that the second term in the lower bound will decrease. It is therefore not possible to determine whether the lower bound will increase or decrease with increasing L R in general.
• For the case of direct observations, all eigenvalues of HH T are equal to unity. 14 Therefore the first term in the lower bound of (25) will always be greater than the second term. Both H 1 and H 2 correspond to direct observations; hence for these choices of observation operator the first term in the lower bound of (25) is the lower bound of (Ŝ).
• As L B (the lengthscale of the correlation function used to construct B) increases, 1 (B) increases. This means that the upper bound of (25) will increase with L B . As L B increases, N (B) decreases. This means that both terms in the lower bound of (25) will decrease with increasing L B . Hence, the bounds (25) will diverge as L B increases. We wish to assess whether the qualitative behavior of (Ŝ) agrees with the qualitative behavior of the bounds for our experimental framework. Additionally, we are interested in determining which term in the lower bound of (25) is largest, and whether this depends on the choice of B, R, and H.
In Section 5 we compare the bounds given by (25) with those of (15). As discussed at the end of Section 2, although we expect the bounds given by (15) to be tighter in many cases, separating the contribution of constituent matrices by using (25) will be qualitatively informative.

Convergence of a conjugate gradient algorithm
Although conditioning of a problem is often used as a proxy to study convergence, there are well-known situations where the condition number provides a pessimistic indication of convergence speed, notably in the case of repeated or clustered eigenvalues (e.g., [12, theorem 38.4; 15, theorems 38.3, 38.5]). We therefore wish to investigate how well the condition number of the preconditioned system reflects the convergence of a conjugate gradient method for our experimental framework. Following a similar method to section 5.3.2. of Tabeart et al. 27 we study how the speed of convergence of a conjugate gradient method applied to the linear systemŜx = b changes with the parameters of the system. We define x as a vector with features at a variety of scales, and then calculate b =Ŝx before recovering x. We use the Matlab 2018b routine pcg.m to recover x using the conjugate gradient method. As we are studying a preconditioned system, convergence is fast. In order to make the differences between parameter choices more evident we use a tolerance of 1 × 10 −10 on the relative residual. In Section 5 we show results for one particular realization of x to enable a fair comparison between different choices of R, B, and H. A number of other values of x were tested, with similar results. We consider how changes to lengthscale and observation operator alter the convergence of the conjugate gradient method. For cases where convergence behaves differently to conditioning, we study the spectrum ofŜ to understand why these differences occur.

3D-VAR EXPERIMENTS
In this section we present the results of our numerical experiments. Figures will be plotted as a function of changes to correlation lengthscales for both B and R. We recall that increasing the lengthscale of a SOAR correlation matrix will reduce its smallest eigenvalue and increase its largest eigenvalue. 27,41 Figure 2 shows how the condition number of the preconditioned Hessian (10) changes with the lengthscales of B and R for different choices of H. For H 1 , increasing L R increases the value of (Ŝ). Changes with L B are much smaller, but increases to L B lead to a slight decrease in (Ŝ). For both H 2 and H 3 , large values of (Ŝ) occur for very large values of L R and small values of L B . For a fixed value of L R , increasing L B results in a rapid decrease in the value of (Ŝ). For small fixed values of L R (L R < 0.1), this decrease is followed by a slow increase to (Ŝ) with increasing L B . The minimum value of (Ŝ) occurs when L R = L B ; in this case HBH T = R to machine precision for both H 2 and H 3 . The qualitative behavior for H 2 and H 3 is very similar, with smaller values of (Ŝ) for H 3 than H 2 . This is also the case in the unpreconditioned setting, 27 and occurs as H 3 can be considered as a smoothed version of H 2 . Qualitatively the behavior for H 4 is a compromise between H 1 and H 2 ; we can reduce (Ŝ) by increasing L B or decreasing L R . In the unpreconditioned case decreasing either lengthscale always reduces (Ŝ). However, in the preconditioned setting the ratio between background and observation lengthscales is important, meaning that for some cases increasing L B or L R will reduce (Ŝ). Figure 3 shows the value of (Ŝ), terms in the bounds (25), and the bounds (15) for various combinations of H, R, and B. The second term in the lower bound (25), given by 1 + 1 (HH T ) N (B)( 1 (R)) −1 , is not shown, as it performs worse than the first term of (25), given by 1 + p (HH T ) N (B)( p (R)) −1 , for all parameter combinations studied. Both the upper and lower bounds of (25) increase with L R . They represent the increase in (Ŝ) which occurs for L R ≥ L B for H 2 and H 3 and for larger values of L R for H 1 and H 4 . The initial decrease of (Ŝ) with increasing L R is not represented by the bounds of (25). Although some of the qualitative behavior is well represented, the bounds are very wide. Notably for larger values of L B the lower bound given by (25) is very close to 1 for all values of L R . By contrast, the bounds given by (15) represent the initial decrease in (Ŝ) for small values of L R well, both qualitatively and quantitatively. The upper bound of (15) then increases with increasing L R and remains tight for all parameter combinations. The lower bound of (15) is monotonically decreasing, and hence does not represent the behavior of (Ŝ) well for larger values of L B and L R . We note that for H 2 and H 3 the upper and lower bounds of (15) are equal for L B > L R . This results from Corollary 2 as HBH T is circulant when H = H 2 or H = H 3 and all entries in the product R −1∕2 HBH T R −1∕2 are positive for L B ≥ L R . For panels (j) and (k) this means that the bounds given by (15) are equal to (Ŝ) for all plotted values of L R Comparing the bounds given by (25) and (15), we find that the upper bound of (15) performs better for all parameters studied. The best lower bound depends on the choice of L B and L R : for lower values of L B and larger values of L R the first term of (25) is the tightest. Otherwise the bound given by (15) yields the tightest bound in this setting. Although the bounds given by (15) represent the behavior of (Ŝ) well, we note that the numerical framework considered here has a very specific structure that is unlikely to occur in practice. Observation operators are likely to be much less smooth and have less regular structure, for example, observations may not occur at the location of state variables, observation and state variables may not be evenly spaced, data may be missing, leading to different observation networks at different times or time windows. This may make a difference to the performance of both sets of bounds. We now consider how altering the data assimilation system affects the convergence of a conjugate gradient method for the problem introduced in Section 4.2. Figure 4 shows how convergence of the conjugate gradient problem changes with L B , L R , and H. For all choices of H the largest number of iterations occurs when L R is large and L B is small. Similarly to the unpreconditioned case 27 , we see that for many cases (Ŝ) is a good proxy for convergence: for H 2 , H 3 , and H 4 reductions in (Ŝ) and the number of iterations required for convergence occur for the same changes to L R and L B . The main difference in behavior is seen for H 1 , where increasing L R increases (Ŝ) for all choices of L B , but makes no difference to the number of iterations required for convergence for L B ≥ 0.4. This difference can be explained by considering the full distribution of the eigenvalues ofŜ rather than just the condition number. Convergence of the conjugate gradient method depends on the distribution of the entire spectrum, and we expect faster convergence to occur where eigenvalues are clustered (see [12, theorem 38.4; 15, theorems 38.3, 38.5]). The eigenvalues of the full Hessian are given by 1 + (B 1∕2 H T R −1 HB 1∕2 ), and N − p further unit eigenvalues. Figure 5 shows the nonzero eigenvalues of the low-rank update to the identity, B 1∕2 H T R −1 HB 1∕2 , for L B = 0.1 and L R = 0.1, 0.4, 0.7, 1. For all choices of H increasing L R leads to an increase in the maximum eigenvalue of the product. Additionally, the spectrum is distributed smoothly with few clusters, meaning that the condition number is a good indicator for convergence of a conjugate gradient method. This explains why increasing L R for L B = 0.1 leads to an increase in the number of iterations required for convergence for all choices of H.   42 the condition number is more likely to be a good proxy for convergence of a conjugate gradient method in this setting. We conclude that in this framework changing L B has a larger effect on convergence of a conjugate gradient method than changing L R . This contrasts with the unpreconditioned case, where changes to both L B and L R had a large impact on convergence. 27 For all choices of observation operator, small values of L B lead to poor convergence. Although changing L R impacts (Ŝ), due to an increase in clustered eigenvalues these changes do not always affect the convergence of a conjugate gradient method. Overall the condition number is a good proxy for convergence in this framework.

CONCLUSIONS
The inclusion of correlated observation errors in data assimilation is important for high resolution forecasts, 22,23 and to ensure we make the best use of existing data. 19,20,43 However, multiple studies have found issues with convergence of data assimilation routines when introducing correlated OEC matrices. 16,24,44 Earlier work considers the preconditioned data assimilation problem in the case of uncorrelated OEC matrices. 14 In this article we study the effect of introducing correlated OEC matrices on the conditioning and convergence of the preconditioned variational data assimilation problem. This extends the theoretical and numerical results of a previous study by Tabeart et al. 27 that considered the use of correlated OEC matrices in the unpreconditioned variational data assimilation framework.
In this article, we developed bounds on the condition number of the Hessian of the preconditioned variational data assimilation problem, for the case that there are fewer observations than state variables. We then studied these bounds numerically in an idealized framework. We found that: • As in the unpreconditioned case, decreasing the observation error variance or increasing the background error variance increases the condition number of the Hessian.
• The minimum eigenvalue of the OEC matrix appears in both the upper and lower bounds. This was also true for the unpreconditioned case.
• For a fixed lengthscale of the observation (background) error covariance matrix, L, the condition number of the Hessian is smallest when the lengthscale of the background (observation) error covariance matrix is also equal to L. This is in contrast to the unpreconditioned case, where for a fixed lengthscale of the observation (background) error covariance matrix, the condition number of the Hessian is smallest when the lengthscale of the background (observation) error covariance is minimized.
• Our new lower bound represented the qualitative behavior better than an existing bound for some cases. The upper bound from Haben 14 was shown to be tight for all parameter choices. We proved that under additional assumptions the upper and lower bounds from Haben 14 are equal.
• For most cases the conditioning of the Hessian performed well as a proxy for the convergence of a conjugate gradient method. However in some cases, clustered eigenvalues (induced by the specific structure of the numerical framework) meant that convergence was much faster than predicted by the conditioning.
We remark that our findings about clustered eigenvalues occur as our numerical framework has very specific structures. In particular, the eigenvectors of the background and OEC matrices are strongly related. Other experiments not presented in this article considered the use of the Laplacian correlation function for either or both of the observation and background error covariance matrices. 14 Qualitative conclusions were very similar to those shown in Section 5, even though the negative entries of the Laplacian correlation function do not satisfy the additional assumptions required for the bounds to be equal. In applications, we are likely to have more complicated observation operators, and the background and OEC matrices are less likely to have complementary structures. Satellite observations for NWP often have interchannel correlation structures that are different from the typical spatial correlations of background error covariance matrices. 17,19 We also note that our state variables were evenly distributed and homogeneous, which will not be the case for nonuniform grids.
In the unpreconditioned case using a similar numerical framework Tabeart et al. 27 found that improving the conditioning of the background or OEC matrix separately would always decrease (Ŝ). The preconditioned system is more complicated; in some cases decreasing the condition number (B) or (R) increases the condition number (Ŝ). We expect the relationship between each of the constituent matrices to be complicated for more general problems. This is relevant for practical applications, as estimated OEC matrices typically need to be treated via reconditioning methods before they can be used. 16,24 Currently the use of reconditioning methods is heuristic, 28 meaning that there may be flexibility to select a treated matrix that will result in faster convergence in some cases. However, popular reconditioning techniques work by increasing small eigenvalues of the OEC matrix. In the preconditioned setting, such techniques will not automatically reduce the value of (Ŝ), due to the multiplication of background and observation error covariances. This means that reconditioning techniques may perform differently for the preconditioned data assimilation problem than in the unpreconditioned setting.
Although the numerical experiments in this article consider a limited choice of matrices and parameters, we note that the theory and bounds presented in this work are general and apply to any choice of covariance matrices B and R, and any linear observation operator (or generalized observation operator in the case of 4D-Var). We could consider the numerical results presented here as a "best case" due to the circulant structure of both covariance matrices. For more general choices of B and R any eigenvalue clustering is likely to be less extreme, and hence conditioning may be more influential for the convergence of a conjugate gradient method. Increased eigenvalue clustering occurred for observation operators with regular structure, whereas in practice the "randomly observed" experiment is more realistic. For the 4D-Var problem the generalized observation operatorĤ also accounts for model evolution, and hence the structure of the linearized model is also expected to be important when considering clustering and convergence of a conjugate gradient problem. Previous work has also shown that for the unpreconditioned problem, the qualitative behavior of an operational system 25 largely followed the linear theory. 27 Similarly, for the case of uncorrelated OEC matrices, the behavior of preconditioned 4D-Var experiments broadly coincided with theory from the linear setting. 14,26 This indicates that conclusions arising from the study of linear data assimilation problems can often provide insight for a wider range of practical implementations, even if theoretical results are not directly applicable.

CONFLICT OF INTEREST
This study does not have any conflicts to disclose.

DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no datasets were generated or analyzed during the current study.