The conditioning of least squares problems in variational data assimilation

In variational data assimilation a least-squares objective function is minimised to obtain the most likely state of a dynamical system. This objective function combines observation and prior (or background) data weighted by their respective error statistics. In numerical weather prediction, data assimilation is used to estimate the current atmospheric state, which then serves as an initial condition for a forecast. New developments in the treatment of observation uncertainties have recently been shown to cause convergence problems for this least-squares minimisation. This is important for operational numerical weather prediction centres due to the time constraints of producing regular forecasts. The condition number of the Hessian of the objective function can be used as a proxy to investigate the speed of convergence of the least-squares minimisation. In this paper we develop novel theoretical bounds on the condition number of the Hessian. These new bounds depend on the minimum eigenvalue of the observation error covariance matrix and the ratio of background error variance to observation error variance. Numerical tests in a linear setting show that the location of observation measurements has an important effect on the condition number of the Hessian. We identify that the conditioning of the problem is related to the complex interactions between observation error covariance and background error covariance matrices. Increased understanding of the role of each constituent matrix in the conditioning of the Hessian will prove useful for informing the choice of correlated observation error covariance matrix and observation location, particularly for practical applications.


INTRODUCTION
Data assimilation combines the output from a numerical model of a dynamical system, the background or prior, with observations of the system to yield an accurate description of the current dynamical state (analysis).
Contributions from observations and the background are weighted according to their relative uncertainty via error covariance matrices, meaning that assessing and quantifying observation error are crucial in order to obtain an accurate analysis sufficiently quickly. 1,2One of the most well-known applications of data assimilation is to numerical weather prediction (NWP), where observations of the atmosphere and ocean are combined with a prior model state of the atmosphere in order to produce the initial conditions for a weather forecast.Until recently, diagonal observation error covariance matrices have been used operationally at all major NWP centres, 3 a choice that is only valid in the case that observation errors are uncorrelated.5][6][7][8] However, using diagnosed full observation error covariance matrices directly in the assimilation has been shown to cause problems with the speed of convergence of the assimilation scheme. 9ariational assimilation, a popular data assimilation method, [10][11][12] finds the analysis by minimising a nonlinear least-squares objective function.This objective function, which is dependent on both observations and the background field, is minimised by an iterative method, such as the Gauss-Newton method. 13,14This consists of an outer loop that solves the full nonlinear problem, and an inner loop that solves the linearised problem, often via a conjugate gradient method. 157][18] Hence, it can be used as a rough estimate for the number of iterations needed to solve the inner loop problem.We note, however, that this worst-case bound on convergence can be improved significantly in the case of clustered eigenvalues. 17,19The magnitude of the condition number also provides an indication of the sensitivity of the system to perturbations in the data. 10Speed of convergence is critical in practise due to the need to provide timely forecasts.In this work, we investigate how introducing correlated observation errors affects the condition number of the Hessian and examine the associated speed of convergence of a conjugate gradient method.
Correlated observation error statistics have been diagnosed for certain observation types (e.g., see other works [20][21][22][23][24][25][26][27] ), although there are problems associated with their use.In particular, the methods used to diagnose observation error covariance matrices are imperfect, and the quality of these estimates is unclear.Due to unknown observation error statistics and in order to reduce the computational cost of operational assimilation, in practise, the majority of observation errors are assumed uncorrelated.However, empirical evidence from simple model experiments indicate that even approximate correlation structures give significant benefit in terms of analysis accuracy. 7,28Similar conclusions can be drawn for practical implementations. 4n the works of Stewart 6 and Stewart et al., 26 it was shown that there were problems with the use of diagonal observation error covariance matrices in the variational data assimilation for certain instruments.Motivated by this work, in 2011, the UK Met Office first trialled the use of correlated observation errors in their operational system. 3However, there were problems with the convergence of the minimisation algorithm, which necessitated the "reconditioning" of observation error covariance matrices (by altering their eigenvalues), prior to their use in the system.In the works of Weston 3 and Weston et al., 9 it was suggested that slow convergence was caused by the very small minimum eigenvalues of the diagnosed observation error covariance matrix.This work provides motivation to investigate further the role of the minimum eigenvalue of the observation error covariance matrix on the conditioning of the variational data assimilation problem; in turn, developing this crucial understanding will permit the optimal use of correlated observation errors in data assimilation systems.
Even in the case of uncorrelated observation errors, the minimisation problem for any large system is very ill conditioned.Preconditioning, where the original problem is transformed into an equivalent but less ill-conditioned problem, is used operationally to mitigate against the slow convergence of the minimisation. 29In data assimilation, the most common method of preconditioning is the control variable transform, 16,30 where the preconditioner is based on the background error covariance matrix.The optimal choice of preconditioning depends on the formulation of the data assimilation problem, 31 and practical constraints may require the use of a less computationally intensive preconditioner. 32In this work, an unpreconditioned framework will be used, as it is unknown whether the introduction of correlated observation errors will alter the optimal choice of preconditioner.This framework also has practical relevance, as the UK Met Office uses an unpreconditioned 1D-Var routine, where each observation is assimilated individually, for quality control purposes.Hence, the bounds and conclusions presented here will apply directly to that case.
In this article, we develop a new theory for bounding the condition number of the Hessian of the least-squares objective function.This theory applies to both uncorrelated and correlated choices of observation error.We investigate the impact of introducing these correlations via small-scale numerical tests, which illustrate the influence of observation correlations associated with a physical length scale.We begin in Section 2 by defining a notation common to data assimilation and the condition number.We explain why the conditioning of the system and the rate of convergence of the minimisation are linked and present results from linear algebra that will be used to construct the bounds discussed in Section 3. Three new sets of bounds will be introduced in Section 3; these will have a varying number of additional constraints on the constituent matrices.Bounds that separate the contribution of each of the constituent terms have been developed for both general matrices and matrices with additional assumptions on observation location and observation error correlations.In Section 4, we discuss our numerical framework for the experiments of Section 5.The results of these numerical tests support the theoretical conclusions presented in Section 3. In particular, we see that the minimum eigenvalue of the observation error covariance matrix and the ratio of background variance to observation variance are important terms for controlling the conditioning of the variational problem for both the bounds in Section 3 and the numerical results from Section 5. We conclude in Section 6 that even in a simple linear setting, the choice of observation operator has a significant effect on the conditioning.The theoretical conclusions indicate how correlated error statistics in the observation and background can be expected to interact, and highlight areas where reconditioning and similar techniques could be used to reduce the increased computational cost associated with using correlated observation errors operationally.Although the primary motivation for the investigation of the impact of correlated observation errors arises from their application in meteorology, the theory and conclusions presented here are very general and apply to any other application of variational data assimilation such as in neuroscience 33,34 and ecology. 35,36

Notation
In data assimilation, information from observations, y ∈ R  , is combined with information from a background, or "prior", field, x b ∈ R N .The analysis, x a ∈ R N , or posterior, is found by weighting each of the two components using their respective error statistics.It is assumed that observation errors and background errors are unbiased and mutually uncorrelated.The background and observation error covariance matrices are denoted by the symmetric positive semidefinite matrices B ∈ R N×N and R ∈ R × , respectively (although in practise, we assume B and R are positive definite matrices).Usually, there are far fewer observations than state variables, that is, p ≪ N. Observation and background information may describe different variables or be situated at different locations in space.The observation operator h ∶ R N → R  , which may be nonlinear, is used to map from state space to observation space to allow the comparison of observations with the background; in particular, y will be compared with h [x].
For variational assimilation methods, the analysis is found by minimising an objective function.In this work, we focus on 3D-Var, a particular variational assimilation method, which assimilates variables at a single fixed time in the assimilation window over the entire spatial domain. 37In the case of 3D-Var, the objective function is given by the following: The state vector x a that minimises this objective function is then used as the initial condition to produce a forecast.When h is linear, this equation has an analytic solution (see equation 2.4 in the work of Haben 16 ), but (1) is too expensive to be solved explicitly on an operational scale.In NWP, where observation operators can be nonlinear and high dimensional, a gradient descent algorithm, such as the Gauss-Newton method, is used to solve a sequence of linearised problems, in order to converge iteratively to the solution, x a . 10We note that x a corresponds to the maximum a posteriori estimate under the assumption that all probability distributions are Gaussian. 38,39

Condition number
In practice, to solve the nonlinear problem, the Gauss-Newton method is used to solve a sequence of linearised problems, often via a conjugate gradient method. 29We will now consider the linearised problem, where the nonlinear problem given by ( 1) is linearised about x a , the optimal solution.As the linearisation of ( 1) is a quadratic function, 37 finding x a is equivalent to solving a linear system of the form where w ∈ R N and b ∈ R N are given by (3.10) in Section 3.2 in the work of Haben. 16(This formulation will be used in numerical experiments in Section 5.) Here, S ∈ R N×N is the Hessian of the linearisation of the objective function (1)  given by where H ∈ R ×N is the Jacobian of the observation operator h linearised about the optimal state.The Hessian can be used to study the sensitivity of the solution to small changes in observation or background data, by considering its condition number (see Section 2.7 in the work of Golub et al. 18 ).As B and R are symmetric positive definite, S is also symmetric positive definite, and hence, the L 2 condition number of S can be represented in terms of its eigenvalues.

Eigenvalue theory
For the remainder of the paper, the following ordering of eigenvalues of matrix D will be used: For a matrix Theorem 1.If S ∈ R N×N is a symmetric and positive definite matrix, then we can write the condition number in the L 2 norm as where  1 (S) and  N (S) correspond to the largest and smallest eigenvalues of S, respectively.
Proof.(See Section 2.7.2 in the work of Golub et al. 18 ) Henceforth,  2 (S) will be referred to as the condition number of S and will be denoted (S).
In order to determine the bounds on the condition number of the Hessian we make use of the following result from linear algebra.Theorem 2. Consider two symmetric matrices S 1 , S 2 ∈ R N×N .The kth eigenvalue of the matrix sum S 1 + S 2 satisfies the following: ( Proof.(See Chapter 2, Theorem 44 in the work of Wilkinson. 40) This result allows us to separate the contributions of B −1 and H T R −1 H when bounding the condition number of S given by (3) and is discussed in Section 3.
A result bounding the eigenvalues of matrix products in terms of the eigenvalues of the constituent matrices is given by the following.
Proof.(See Section 9 H.1.a. in the work of Marshall et al. 41 ) with equality for k = N.
Proof.(See the work of Wang et al. 42 )

THEORETICAL RESULTS
We now present new bounds on the condition number of the Hessian given by (3).We begin in Section 3.1 by considering the general case: B and R are general covariance matrices, and H is any linear observation operator.In Section 3.2, we then introduce further assumptions that constrain H to only observe state variables.Finally, in Section 3.3, we restrict the form of B and R to have a particular structure.

General bounds on the condition number
We begin by introducing bounds on the eigenvalues of S in terms of the eigenvalues of B, R, and H. Lemma 1.For S = B −1 + H T R −1 H, where B ∈ R N×N , R ∈ R × are symmetric positive definite covariance matrices, and H ∈ R ×N with p < N, we can bound the eigenvalues of S below by and above by where λ k (S) is the kth eigenvalue of S.
Proof.The bounds follow immediately from the result of Theorem 2 by exchanging the order of addition.Note that As we wish to bound the condition number of S, we are primarily interested in bounding λ 1 (S) and λ N (S).In this case, the bounds given by ( 8) and ( 9) then simplify to We note that this applies to any choice of correlation matrices B and R and for any linear choice of observation operator H.This suggests that we expect the eigenvalues, and hence condition number, of S to vary based on the interactions between B and R. We now introduce a new bound on the condition number of (3) for 3D-Var for the most general choice of B, R, and H. Theorem 5. Let the background and observation error covariance matrices, B ∈ R N×N and R ∈ R × , respectively, be symmetric positive definite covariance matrices, with p < N. Additionally, let H ∈ R ×N be the observation operator.Then, the following bounds are satisfied by the condition number of the Hessian (given by ( 3)): (This is a slightly modified form of Theorem 6.1.1 in the work of Haben. 16) Proof.To obtain an upper bound for the condition number of (3), we take the upper bound for λ 1 (S) in (11) and the lower bound (10) for λ N (S), using the fact that (λ 1 (B −1 )) −1 = λ N (B).We can obtain a lower bound for the condition number similarly by taking the lower bound for λ 1 (S) in (11) and the upper bound for λ N (S) in (10).This gives two possible bounds for (S), depending on which of the two terms is larger, using the fact that (λ 1 (B)) −1 = λ N (B −1 ).Combining these inequalities completes the proof.
We note that the two terms in (14) are reciprocals.This means that the lower bound will always be greater than or equal to one.Any condition number is bounded below by one. 18e now extend this result to write it in a form that explicitly separates the role of the observation error covariance matrices and the observation operator.This makes it easier to investigate how changes in R, B, and H affect the condition number of the Hessian.Corollary 1.Let B ∈ R N×N and R ∈ R × , with p < N, be the background and observation error covariance matrices, respectively.Additionally, let H ∈ R ×N be the observation operator.Then, the following bounds are satisfied by the condition number of the Hessian (given by ( 3)): Proof.Using Theorem 21.10.1 in the work of Harville, 43 we see that H T R −1 H has precisely the same nonzero eigenvalues as R −1 HH T .Applying the same result, H T R −1 H also has the same nonzero eigenvalues as . Applying Theorem 3 for k = 1 and i = 1 yields the following bound: as λ 1 (R −1 ) = 1∕λ N (R).To bound λ 1 (R −1 HH T ) below, we apply Theorem 4 for k = 1 and i 1 = 1 to obtain two lower bounds, as follows: Substituting ( 16) and ( 17) into the upper and lower bounds of Theorem 5 gives the desired result.
We note that the upper bound in (15) increases as λ N (R) decreases.It is not immediately clear how the lower bound will change with R.This will be discussed in Section 4.3, which provides a summary of how the bounds given by ( 15) vary with R and B for the numerical framework tested in Section 5.

Bounds on the condition number with additional restrictions on the choice of observation operator
We now develop a further bound, which applies in the case that additional assumptions are made regarding the choice of observation operator.In particular, we restrict the observation operator to direct observations of single state variables.We note that if observations are restricted to direct observations of single state variables, then H T H is diagonal with (H T H) i,i = 1 if variable i is observed and zero otherwise, as shown by Haben et al. 44 Under this stricter assumption, we show that the value of λ 1 (HH T ) is the same, irrespective of the choice of observations.Lemma 2. If H T H ∈ R N×N is a diagonal matrix with p < N units on the diagonal and the remaining elements zero, then HH T is the p × p identity matrix.
Proof.As H T H is diagonal, we can calculate its eigenvalues directly; they are simply its diagonal elements.Hence, H T H has p unit eigenvalues and N − p zero eigenvalues.By Theorem 21.10.1 in the work of Harville, 43 HH T has the same nonzero eigenvalues as H T H, that is, p units.As HH T is symmetric, these eigenvalues correspond to p linearly independent eigenvectors.We now write HH T in terms of its eigendecomposition.Let  = diag(λ 1 , … , λ N ) ∈ R × be the matrix of eigenvalues of HH T , and V ∈ R × be the corresponding matrix of eigenvectors of HH T .As the eigenvalues of HH T are all units,  = I p , the p × p identity.Then, Hence, under the assumptions on H T H, HH T is the p × p identity matrix.
Hence, if observations are restricted to single state variables, then HH T = I  .Eliminating the H and H T terms from the bound given by Corollary 1 reduces the number of matrix multiplications required for evaluation.This result is now used to obtain a bound for the case where observation and background error covariances are correlated and where observations are limited to model variables.We additionally assume that observation variance,  2 o , and background variance,  2 b , are uniform variances, and hence, the covariance matrices can be written as a scalar variance multiplied by a correlation matrix.
, where C and D are symmetric positive definite correlation matrices, and  2 b and  2 o are positive scalars denoting the background and observation error variances, respectively.In addition, let H T H be a diagonal matrix with p < N units on the diagonal and the remaining elements zero.Then, the following bound on the condition number of S (given by (3)) holds: Proof.Using (15) with the definitions of B and R in the theorem statement along with the result of Lemma 2 yields the desired result immediately.
The bounds given by ( 19) are equal to those given by ( 15) for the case of direct observations, so the comments concerning how the bounds change with R and B following Corollary 1 also apply here.In general, it is not possible to comment on how the lower bound given by ( 19) will behave with changing B and R. In Section 5, we provide an overview for how the terms in (19) change for some specific choices of B, R, and H.
We note that the ratio appears in both bounds, meaning that as the observations get more accurate, and the variance o decreases, we will see an increased upper bound and a decreased lower bound for λ 1 (B) > λ N (B) + λ 1 (R), with an increased lower bound for λ 1 (B) < λ N (B) + λ 1 (R).This was also observed theoretically and numerically by Haben 16 for the case that R is uncorrelated.Both of these results assume the same variance for all observations, which is not true in general.However, they indicate the general behaviour we would expect for an increase in accuracy across a wide range of observing systems.

Bounds on the condition number for circulant error covariance matrices
In this section, we present a lower bound that is tighter than those of (15) for a given matrix framework.Improved bounds are obtained for this specific case by exploiting the eigenvalue and eigenvector properties of a particular matrix structure.It is feasible that for other matrix structures, similar properties could be used to compute tighter bounds for other classes of matrices.However, as the results from Section 3.1 are general and apply to any choice of covariance matrices, we do not consider other specialised bounds in this work.
It is often desirable for error correlations to be homogeneous and isotropic, meaning that the correlation between two points is determined solely by the distance between them. 45This makes circulant matrices a natural choice for correlation matrices on a one-dimensional periodic domain.For the numerical tests discussed in Section 5, both B and R will be chosen to be circulant matrices, although the bounds given by Theorem 5, Corollary 1, and Corollary 2 apply for any valid choice of correlation matrix.Definition 1. (See the work of Davis. 46) A circulant matrix D ∈ R N×N is a matrix of the form As described by Gray, 47 the structure of a circulant matrix of the form given by Definition 1 permits the rapid calculation of eigenvalues and eigenvectors via a discrete Fourier transform.In practise, this means that we can calculate the eigenvalues of D directly via the following formula.Theorem 6.The eigenvalues of a circulant matrix D, as given by Definition 1, are given by with corresponding eigenvectors where  = e −2i/N is an Nth root of unity.
Proof.(See the work of Gray 47 for full derivation.) To avoid confusion, the eigenvalues of a circulant matrix calculated using (20) will be denoted by  j rather than λ j , as they are ordered in terms of wavenumber rather than size.We can see from (21) that the eigenvectors only depend on N, the dimension of the circulant matrix.Therefore, any N × N circulant matrix will have the same set of eigenvectors.
We now use this matrix structure to consider a further restriction to the case that observation error is assumed to be uncorrelated, and the background observation error matrix is required to be circulant.In particular, in the following theorem, R is taken to be a scalar multiple of the identity.We note that Theorem 7 was presented by Haben et al. 45 without proof.
where C is a symmetric positive definite circulant matrix, and R =  2 o I  , where I  ∈ R × is the identity matrix.Both  2 b and  2 o are positive scalars.In addition, let H T H be a diagonal matrix with p < N units on the diagonal and the remaining elements zero.Then the following bounds on the condition number of S (given by ( 3)) hold: where λ 1 (C) and λ N (C) are the largest and smallest eigenvalues of the matrix C, respectively.
Proof.By the assumptions on the matrices in the theorem, we can write . If we substitute these into the upper bound of (12), we obtain which establishes the upper bound.Rather than repeat this procedure with the lower bound, we produce an improved estimate by applying the Rayleigh quotient, R S (x), x ∈ C N (defined in Section 5.9 in the work of Süli et al. 48).Let v 1 ∈ C N be the eigenvector corresponding to the largest eigenvalue of C −1 .Because C −1 is circulant, then all the components of the eigenvectors of C −1 lie on the unit circle in C (see (21)).In particular, this implies that for an eigenvector, v m , of where K denotes the positions of the nonzero diagonal elements of H T H, and v † denotes the conjugate transpose of v.The maximum value obtained by the Rayleigh quotient of S occurs at the eigenvector corresponding to the largest eigenvalue of S (see Section 5.9 in the work of Süli et al. 48).Hence, Similarly, the minimum value of the Rayleigh quotient occurs at the eigenvector corresponding to the smallest eigenvalue of S. Let v N be the eigenvector corresponding to the smallest eigenvalue of C −1 .Then, again using the Rayleigh quotient, we find Combining ( 25) and ( 26), we find giving the lower bound on the condition number.This completes the proof.
We note that the lower bound presented here is tighter than the others introduced in this section.This comes from the restriction on the form of S when additional assumptions are made on R and H and does not generalise to the other results presented in this work.We also observe that the lower bound ( 22) has an explicit dependence on the number of observations, p, meaning that (S) increases with p.This was studied in detail by Haben. 16Additionally, the ratio appears in both bounds, meaning that the discussion following the result of Corollary 2 also applies to the result of Theorem 7.
We now have bounds that require minimal matrix multiplications for evaluation and that separate the contributions of B, R, and H.In the following sections we will test these bounds numerically and discuss the impact of changing each of the constituent matrices in turn.

NUMERICAL FRAMEWORK
We now outline the experimental framework that will be used in Section 5 to numerically investigate the bounds presented in Section 3. In particular, in Section 4.1, we introduce specific matrix structures that will be used to generate covariance matrices.These structures have been chosen as they permit easy calculation of eigenvalues of the resulting matrices.We note that these correlation structures illustrate the case where there is a physical length scale associated with our observation and background error correlations, as in the case of horizontal correlations.Different choices of observation operator will then be presented in Section 4.2.Finally, in Section 4.3, we define the experiments that will be studied in Section 5 and discuss the choice of parameters to be used in these tests in detail.

Correlation and second-order auto-regressive correlation matrices
This work will make use of the second-order auto-regressive correlation (SOAR) function, which is used by the UK Met Office as a horizontal correlation function, as detailed in the work of Simonin et al. 49 It is also commonly used to model background error correlations, 7 as its relatively long tails coincide well with estimates of correlation structure.Additionally, these longer tails ensure that SOAR matrices are better conditioned for inversion than Gaussian matrices. 10,16he SOAR function, defined by Daley, 50 is homogeneous and isotropic and naturally extends to a circulant form when we have equally spaced observations on a periodic domain, such as a latitude circle on the Earth.We define the SOAR error correlation matrix for a 1D model with state variables (or observations) given by equally spaced gridpoints on a fixed domain (the unit circle of radius a = 1), following the procedure given in the works of Haben 16 and Waller et al. 51 This makes use of a substitution of a chordal distance for a "great circle distance" to ensure that we obtain a valid correlation model on the circle, as discussed by Gaspari et al. 52 and Jeong et al. 53 Definition 2. The SOAR error correlation matrix on the finite domain is given by

FIGURE 1
Eigenvalues of an error correlation matrix defined by the second-order auto-regressive function given in (28) for N = 20 and a = 1 where L > 0 is the correlation length scale,  i, j denotes the angle between gridpoints i and j, and a is the radius of the domain.The chordal distance between adjacent gridpoints is given by where N is the number of gridpoints, and  =  2N is the angle between adjacent gridpoints.As SOAR matrices are circulant by construction, we can calculate their eigenvalues directly using Equation 20.The distribution of eigenvalues is symmetric and, as shown in Figure 1, decreases monotonically towards the central value.This means that only two eigenvalues need to be calculated in order to obtain the maximum and minimum eigenvalues of any SOAR matrix:  1 and  N/2 (if N is even) or  (N+1)/2 (if N is odd).The circulant structure can hence be exploited to reduce the number of computations required for computing the bounds given by ( 15) and (19) for the condition number of the Hessian.
For the numerical experiments, we alter the length scales of the SOAR matrices corresponding to background and observation error.Figures will be plotted in terms of the maximum eigenvalues of B −1 and R −1 (recalling that for any matrix, D ∈ R m×m , λ 1 (D −1 ) = 1∕λ N (D)).We note that this also means that λ 1 (D −1 ) =  N∕2 (D −1 ) for N even (or λ 1 (D −1 ) =  (N+1)∕2 (D −1 ) for N odd), using the notation established in Theorem 6.The relationship between the increasing length scale and the spectrum of a SOAR matrix is shown in Figure 1, namely that as the length scale, L, increases, the minimum eigenvalue of the SOAR matrix decreases and the maximum eigenvalue increases.This means that the maximum eigenvalue of the inverse of a SOAR matrix increases with length scale, and its minimum eigenvalue decreases.
Having described the choice of correlation matrices that will be used in the numerical tests in Section 5, in the next section we discuss the different choices of observation operator that will be tested in our experiments.

Choice of observation operator
Most previous research into the impact of correlated observation errors on the variational assimilation problem does not investigate the impact of using different observation operators systematically.Either the operational observation operator is used (e.g., see other works 9,21 ) or the experiments are carried out in a simple linear case where H is taken to be a variant of the identity, as in other works. 5,7,8,54In this paper, we compare how the condition number of the Hessian is affected by different choices of linear observation operator in order to gain some theoretical insight into the role played by this operator.We define three choices of the observation operator that will be investigated in detail numerically.We are particularly interested in how important our choice of H is in determining both the true condition number of S and the value of the bounds given by (15).Firstly, we note that all bounds presented in this work require the assumption that the observation operator, H, is linear, and the bounds given by ( 19) and ( 22) have the restriction that observations are only of single state variables.All the choices of H that are tested in the numerical experiments presented in this work are linear, and two correspond to direct observations of single model variables.
The choice of H = H 1 corresponds to observing the first p state variables and to making no observations in the second half of the state space.Choosing H = H 2 corresponds to making observations at alternate state variables over the entire model domain.The observation operator H = H 3 is a smoothed version of H 2 ; state variables at alternate gridpoints are smoothed over five adjacent points in state space with equal weighting.This can be thought of as a simplified version of a satellite weighting function (see Section 2.4.1 in the work of Stewart 6 and Section 2.1.3 in the work of Rodgers 39 ), which measures average radiation over several model levels of the atmosphere.In Figure 2, these observation operators are depicted for a small-scale example when p = 10 and N = 20.
The choice of H 1 was made as a check to allow the comparison of preliminary numerical tests with those from Chapter 6 in the work of Haben. 16The bounds given by Corollary 2 in Section 3 require that H T H be a diagonal matrix with p units on the diagonal.The observation operator H 1 satisfies this requirement, as does H 2 , meaning that we can apply the bounds of Corollary 2 for these two cases.Additionally, by Lemma 2, This means that for fixed choices of B and R, both H = H 1 and H = H 2 will yield the same upper and lower bounds.We wish to see whether there will be a significant difference in the true condition number of S for H = H 1 and H = H 2 .
As H = H 3 does not satisfy the condition in the statement of Corollary 2, we must apply the more general bound given by ( 15) in Corollary 1.We would like to be able to use the same bounds to compare each of the three choices of observation operator.A short calculation reveals that we have equality of the bounds given by Corollary 1 and Corollary 2 when observations are restricted to model gridpoints for the framework described here.Hence, for what follows, we will be comparing the bounds given by ( 15) irrespective of the observation network chosen.

Experimental design
We now discuss the experimental framework, which will be used for the numerical tests presented in Section 5.In particular, we motivate the range of parameters that will be investigated.
We fix the ratio between p, the number of observations, and N, the number of state variables, to be N = 2p for all the experiments discussed below.The same ratio was used for numerical testing by Haben 16 and is not representative of what is used in practise, where observations are much less dense.Unless stated otherwise, the values N = 200 and p = 100 were used for all the plots presented here.Other choices of p and N were studied in detail; as qualitative results were similar for all cases considered, they will not be shown here.
Both the background error covariance matrix, B ∈ R N×N , and the observation error covariance matrix, R ∈ R × , are chosen to be SOAR correlation matrices (see Section 4.1) with fixed variances The domain for the tests is the unit circle (a = 1).In the experiments that follow, we will vary L R , the correlation length scale of the SOAR matrix defining R, and L B , the correlation length scale of the the SOAR matrix defining B, over a regular grid, but figures will be plotted in terms of λ 1 (B −1 ) and λ 1 (R −1 ).In addition to studying the impact of changing the length scale of B and R for both sets of experiments, we also consider the effect of using the different choices of H presented in Section 4.2.

Condition number testing
In the numerical tests, we consider how the condition number of S (calculated using the Matlab 2016b function cond) and the bounds given by ( 15) change as the minimum eigenvalues of both error covariance matrices change.Of particular interest is the interaction between changes to both B and R. For the results presented in this paper, the length scales of both B and R were varied between 0.1 and 1.The equivalent eigenvalues of R and B for these parameters are given in Table 1.
Table 1 presents the values of the terms that appear in (15) and depend on the background and observation error matrices for typical values of L B and L R used in the experiments.We observe the following: • As L R increases, λ N (R) decreases; hence, the first term in the lower bound of (15) will increase with increasing L R , and the third term in the lower bound of (15) will decrease with increasing L R .It is therefore not possible in general to determine how the lower bound will change with increasing L R .• As L R increases, λ 1 (R) increases, meaning that the second term in the lower bound of (15) will decrease with increasing L R .• As L B increases, the difference between its minimum and maximum eigenvalues increases, meaning that the condition number of B increases with L B .• In this setting, the upper bound of (15) will increase as L R or L B increases, as λ 1 (B) and (B) increase with L B and 1 ) increases, meaning that for the fixed L R , the first and second terms of the lower bound of (15) will decrease, and the third term will increase.Therefore, increasing L B for fixed L R will cause both bounds to increase.It is not possible at this stage to say whether the upper and lower bound will move closer together or further apart as L B increases.It is also not clear which term in the lower bound of (15) will be the largest for a general choice of B, R, and H.This means that we cannot say how the lower bound of (15) will change with L R .We will investigate how the bounds change numerically with B and R in Section 5.Although we understand the effect of changing L B and L R on the bounds of the condition number, we now want to investigate their influence on the actual value of (S).

Convergence of a conjugate gradient routine
In addition to studying how the condition number of the Hessian changes with B, R, and H, it is of interest to determine the effect of these same changes on the rate of convergence of the minimisation of the objective function.In order to do this, we consider the convergence rate of a conjugate gradient method applied to the linear system (2) associated with the 3D-Var cost function (1).
To do this, we follow the same method that is used in Chapter 6 in the work of Haben 16 ; we construct a vector w that has small-and large-scale features, calculate b = Sw, and then recover w by applying a linear solver, in this case, the conjugate gradient method, to Sw = b.Here we used the Matlab conjugate gradient routine, pcg, 55 to investigate the change in the number of iterations to convergence.In exact arithmetic, the conjugate gradient method should converge to the true solution in exactly n iterations for an n-dimensional problem. 17We note that in finite precision, convergence in n iterations may not occur, as the search directions lose conjugacy due to roundoff errors. 56Operationally, however, even n iterations are too many in order to obtain a solution in reasonable computational time.This problem is usually solved by preconditioning, but for this paper, we are interested in the unpreconditioned problem as discussed in Section 1.We use a tolerance of 1 × 10 −6 on the relative residual for all results presented in the next section.
We expect that the impact of changing B and R on the condition number of the Hessian will be similar for both sets of experiments (condition number and conjugate gradient convergence) due to the theoretical link between the condition number and the convergence of the conjugate gradient method. 18,19In addition to investigating the impact of changing the length scale on the convergence of 3D-Var, we are interested in how the choice of observation operators introduced in Section 4.2 influences 3D-Var in terms of both the condition number and the convergence of the conjugate gradient method.

NUMERICAL TESTING
Our experiments focus on how (S) changes with both λ 1 (R −1 ), for R correlated, and λ 1 (B −1 ) for each of the choices of observation operator introduced in Section 4.2 (recalling that for any matrix D ∈ R N×N , λ 1 (D −1 ) = 1∕λ N (D)).This extends the experiments of Haben 16 where the effect of the length scale of B on the conditioning of the Hessian was considered for uncorrelated R. As the terms λ 1 (B −1 ) = 1∕λ N (B) and λ 1 (R −1 ) = 1∕λ N (R) appear in both upper and lower bounds of (15), we investigate the relationship between changing the eigenvalues of B and R and the condition number of S. We also investigate how correlations in B and in R interact in terms of both the bounds and the true conditioning of the Hessian.We then test our conclusions in terms of a minimisation problem, to assess the impact of changing correlation length scales on the number of iterations required for the convergence of a conjugate gradient routine.We present and discuss the results for H = H 1 , H 2 , and H 3 separately before comparing the different cases.

Investigating changing length scales: observing the first p variables (H = H 1 )
In Figure 3a, we plot the condition number of S (colour) with the maximum eigenvalue of B −1 , shown along the x-axis, and the maximum eigenvalue of R −1 , shown on the y-axis, for the case H = H 1 .Both axes and the colour values are shown with a logarithmic scale.We recall that as length scale increases, λ 1 (B −1 ) and λ 1 (R −1 ) both increase.
We observe the following: • For a fixed value of λ 1 (R −1 ), increasing λ 1 (B −1 ) results in an increased value of (S).This behaviour is also seen in the work of Haben 16 for an uncorrelated choice of R. The effect of this increase depends on the size of λ 1 (R −1 ); larger values of λ 1 (R −1 ) lead to smaller gradients in the contours of (S).The inclusion of correlated observation errors therefore results in a more complex dependence of (S) on B. • For a small fixed value of λ 1 (B −1 ), increasing λ 1 (R −1 ) results in an increased value of (S), whereas for a large fixed value of λ 1 (B −1 ), increasing λ 1 (R −1 ) has minimal impact on the value of (S).• In general, the impact of changing λ 1 (B −1 ) on (S) is larger than when changing λ 1 (R −1 ).
We hence note that interactions of λ 1 (B −1 ) and λ 1 (R −1 ) have an important effect on the condition number of S. This agrees with the results of Corollary 1, which showed that depending on the relationship between the largest eigenvalues of B −1 and H T R −1 H, there are two distinct bounds on the eigenvalues of S, one in terms of λ 1 (B −1 ) and the other in terms of λ 1 (H T R −1 H).
In Figure 3b, we see the number of iterations required for the conjugate gradient method to solve the problem described in Section 4.3.The values of λ 1 (R −1 ) plotted in Figure 3b are shown in Figure 3a as horizontal lines for 80 values of λ 1 (B −1 ).Firstly, for λ 1 (B −1 ) < 4, increasing λ 1 (B −1 ) for fixed λ 1 (R −1 ) results in an increase in the number of iterations required for convergence.Additionally, for fixed λ 1 (B −1 ), increasing λ 1 (R −1 ) results in a clear increase in the number of iterations.This behaviour agrees well with the qualitative conclusions from the condition number experiment in Figure 3a.For λ 1 (B −1 ) > 4, we see a decrease in the number of iterations as λ 1 (B −1 ) increases.In this range, the value of (S) is similar across each of the horizontal lines shown in Figure 3a, so we could expect the number of iterations to convergence to be similar.Additionally, the Hessian is extremely ill conditioned, which, combined with a small tolerance in the conjugate gradient routine, could explain the noisy values for large λ 1 (B −1 ).We also show 10 equally spaced contours (solid lines) and horizontal lines (corresponding to the lines plotted in b, d, and f).The solid, dotted, and dash-dotted lines represent log 10 (λ 1 (H T R −1 H)) = 3.24, 4.14, and 4.67, respectively

Investigating changing length scales: observing p alternate state variables (H = H 2 )
In Figure 3c, we see how changing B and R affects the condition number of S for the case H = H 2 .The changes in (S) with λ 1 (R −1 ) and λ 1 (B −1 ) are qualitatively similar to the case H 1 described in Section 5.1.Again, we see that the interaction between B and R has an important effect on (S), in agreement with the results of Corollary 1.However, for H = H 2 , the change of behaviour of (S) does not occur smoothly; we observe a discontinuity in the gradient of the contours.As λ 1 (R −1 ) increases, the value of λ 1 (B −1 ) at which this "kink" occurs also increases linearly.We will investigate this kink further in Section 5.6 and show that it is caused by a change in regime.
In Figure 3d, we see the number of iterations required for the conjugate gradient method to converge for the case H = H 2 .
• For fixed values of λ 1 (R −1 ) we observe a change in behaviour as λ 1 (B −1 ) increases; for smaller values of λ 1 (B −1 ) we see a decrease in the number of iterations as λ 1 (B −1 ) increases, and for larger values of λ 1 (B −1 ) the number of iterations increases with λ 1 (B −1 ).This does not agree with the results for the condition number of S in Figure 3c, where an increase in λ 1 (B −1 ) causes an increase in (S) for all values of λ 1 (B −1 ).• For smaller values of λ 1 (B −1 ), increasing λ 1 (R −1 ) leads to an increase in the number of iterations required for convergence.For larger values of λ 1 (B −1 ), which occur to the right of the kink, increasing λ 1 (R −1 ) decreases the number of iterations.Again, this is unlike the results seen for the condition number, where increasing λ 1 (R −1 ) leads to an increase in both the actual value and the upper bound of (S) for all values of λ 1 (B −1 ).
We note that the value of λ 1 (B −1 ), where this change in behaviour occurs, is the same as the value of λ 1 (B −1 ), where the change in gradient of the contours occurs in Figure 3c, indicating that the kink is caused by an underlying change in regime.If we consider the eigenvalues of S (not shown here), the clustering of eigenvalues increases as the kink is approached.The clustering of eigenvalues is important for the convergence of a conjugate gradient method 19 and is not detected by the condition number.This explains the difference in behaviour between Figure 3c and Figure 3d with increasing λ 1 (B −1 ).

Investigating changing length scales: observing p alternate variables smoothed over five state variables (H = H 3 )
In Figure 3e, we see how changing B and R affects the condition number of S for the case H = H 3 .The behaviour of (S) with changing λ 1 (B −1 ) and λ 1 (R −1 ) is qualitatively similar to the case H = H 2 .However, for H = H 3 and fixed λ 1 (B −1 ), only changes to very large values of λ 1 (R −1 ) result in a significant change to (S), and this is true for only the smallest values of λ 1 (B −1 ).Again, interaction between λ 1 (B −1 ) and λ 1 (R −1 ) has an important impact on (S) but to much less of an extent than in the previous two cases.This agrees with the results of Corollary 1, as the value of and hence, L R will need to take a much larger value in order that λ 1 (H T 3 R −1 H 3 ) + λ N (B −1 ) > λ 1 (B −1 ) .A discontinuity in gradient similar to the one observed for the case H = H 2 is seen here but for much larger values of λ 1 (H T R −1 H) than for Figure 3c.
In Figure 3f, we see the number of iterations required for the conjugate gradient to converge for the problem described in Section 4.3 when H = H 3 .Similar to Figure 3d, we see an initial decrease in the number of iterations required for convergence, before a turning point where the number of iterations increases with λ 1 (B −1 ).This turning point occurs for the same values of λ 1 (B −1 ) as the discontinuity in gradient that was seen in Figure 3e.As the value of λ 1 (B −1 ) at which this kink occurs is much smaller than for the case H = H 2 , for most values of λ 1 (B −1 ), increasing λ 1 (R −1 ) decreases the number of iterations.As in the case H = H 2 , the clustering of the eigenvalues of S increases as we approach the kink.The structure of the eigenvalues is more important in determining the convergence of a conjugate gradient method than the condition number in this case.

Investigating bounds and actual value of 𝜅(S) for different choices of observation operator
We now compare the effect of changing the observation operator on both the condition number of S and the bounds of S introduced in Section 3. Of particular interest is how tight the bounds are for different values of λ 1 (B −1 ).For clarity, the Hessian for the cases H = H 1 , H = H 2 , and H = H 3 will be referred to as S 1 , S 2 , and S 3 , respectively.Figure 4 displays the actual value of the condition number and the bounds from (15) for a fixed choice of R with L R = 0.33 for all three choices of H.We recall (Section 4.2) that the bounds for the cases H = H 1 and H = H 2 are equal, with tighter bounds for the case H = H 3 .This is because the maximum eigenvalue of H 3 H T 3 , which appears in both upper and lower bounds, is 0.52 rather than 1.
• Figure 4 shows cases where both the upper and lower bounds given by ( 15) are tight.The upper bound is close to the actual value of (S) for H 1 , particularly when λ 1 (B −1 ) is small.For small values of λ 1 (B −1 ), the actual value of (S) for H 3 is much closer to the lower bound than the bound.• The kink that was observed in for H = H 2 can also be seen in Figure 4.The kink occurs at the location where (S 2 ) coincides with (S 3 ).For values of λ 1 (B −1 ) greater than the kink, (S 2 ) and (S 3 ) are very close to each other.• For all choices of H shown in Figure 4, increasing λ 1 (B −1 ) leads to the upper bound moving away from both the lower bound and the actual value of (S).
We note that we have found different choices of B, R, and H, where the actual values of S are close to both the upper and lower bounds given by (15).We now discuss the implications of changing B, R, and H in terms of the condition number of S and the number of iterations required for the conjugate gradient to converge.

Comparison of results
In this section, we compare the results of the previous sections for different choices of observation operator H, as well as different choices of B and R. We recall that λ 1 (B −1 ) = 1∕λ N (B) and λ 1 (R −1 ) = 1∕λ N (R).
We begin by considering how the lower bounds given by Lemma 1 for λ 1 (S) change depending on whether λ 1 (B −1 ) or λ 1 (H T R −1 H) + λ N (B −1 ) is the larger term.
• For a fixed value of L R and changing L B : For small values of λ 1 (B −1 ), the lower bound of λ 1 (S) from ( 8) is given by λ 1 (H T R −1 H) + λ N (B −1 ), meaning that the maximum eigenvalue of H T R −1 H is most important for determining λ 1 (S).• As L B increases, at some point, λ 1 (B −1 ) will be larger than λ 1 (H T R −1 H) + λ N (B −1 ), meaning that λ 1 (B −1 ) will be the most important term for determining λ 1 (S).• Alternatively, fixing L B and changing L R , we observe similar behaviour: For smaller values of L R , we see less impact on (S) when changing L R than for larger values of L R , where a change in λ 1 (H T R −1 H) has a significant effect on the value of (S).
This behaviour is seen for all choices of H in Figure 3.This bound also provides justification for the variation with λ 1 (B −1 ) and λ 1 (R −1 ) in the gradient of the contours seen in Figure 3a,c,e.
We now consider the similarities between different choices of observation operator for the two experiments, as follows: • For a fixed choice of H, there are strong similarities between the effect of increasing λ 1 (B −1 ) on the convergence of the conjugate gradient method and the effect on the condition number of the Hessian.In particular, the kink in the condition number (Figure 3c,e) and the change in gradient for convergence (Figure 3d,f) occur at the same values of λ 1 (B −1 ) and λ 1 (R −1 ) for both H = H 2 and H = H 3 .This indicates that the kink is due to a change in the underlying structure of S. • The effect of varying λ 1 (R −1 ) and λ 1 (B −1 ) for H 1 , H 2 , and H 3 was broadly similar in terms of (S), with the main difference being the discontinuity in the contours of (S) seen for H 2 and H 3 but not for H 1 .
We also see some large differences between the two experiments.The main dissimilarity between the graphs for condition number (Figure 3a,c,e) and for convergence (Figure 3b,d,e) is that increasing λ 1 (B −1 ) uniformly results in an increase in the condition number of S, but it is not always linked to an increase in the number of iterations required for convergence.This difference was explained in Sections 5.1-5.3 by the clustering of eigenvalues near the kink for H 2 and H 3 .
For the conjugate gradient experiments, conclusions for the cases H = H 2 and H = H 3 were very different from those of the case H = H 1 .Both H = H 2 and H = H 3 have block-circulant structures, meaning that in these cases S will have a block-circulant structure.We suggest that this is the reason for the difference in eigenvalue clustering behaviour compared with the case H = H 1 .This was tested through the use of an additional noncirculant observation operator made by observing 100 random state variables.The behaviour in this case is very similar to that observed for H = H 1 .The fact that qualitative behaviour for the case H = H 1 is the same as for the randomly selected observation operator supports the conjecture that the rapid convergence of the conjugate gradient seen for H = H 2 and H = H 3 is caused by the inherent block-circulant structure of S 2 and S 3 .

Understanding the discontinuity in the gradient for H = H 2 and H = H 3
We now return to discuss the discontinuity in the gradient, or kink, that was observed for H = H 2 and H = H 3 for both the condition number of S (Figure 3c,e) and the convergence of the conjugate gradient method (Figure 3d,f).We explain this theoretically and discuss why the discontinuity in gradient is observed for H 2 and H 3 but not for H 1 .We begin by considering the bounds for the eigenvalues of S in terms of the eigenvalues of B −1 and R −1 , using the bounds given by Corollary 1 and the discussion that follows in Section 3.1.
Equations ( 8)-( 11) explain the variation with λ 1 (B −1 ) and λ 1 (R −1 ) that was observed in Figure 3.However, as the bounds in (10) and (11) apply to all choices of H, they do not explain the difference between the choices of H for which the kink is observed (H 2 and H 3 ) and the choices of H that have smoothly varying values of (S) (namely H 1 ).
In order to illustrate why kink occurs for some choices of H but not for others, we present a tighter upper bound for the specific framework used in the numerical experiments for two cases, beginning with H 1 .By expressing S in terms of the difference between a circulant matrix and a low-rank update, we use (20) to directly compute the eigenvalues of the circulant component via a direct Fourier decomposition.This allows us to show that the kink occurs when there is a significant change in the wavenumber corresponding to the largest eigenvalue of S. Lemma 3. We define C 1 as in the Appendix.For H = H 1 , we can bound the eigenvalues of S above by the following: where the eigenvalues of C 1 are given by the following: where  = e −2i/N .Recall (using the notation introduced in Section 3.3) that the  j s are ordered in terms of wavenumber rather than by decreasing eigenvalue.
Proof.See the Appendix.
Lemma 3 yields an expression that is a sum of an eigenvalue of B −1 , plus a term depending on the coefficients of R −1 and the structure of H 1 .The choice of H = H 1 is important in determining the wavenumber at which the maximum value of the second term of (34) is achieved.From Section 4.1, we recall that the largest eigenvalue of B −1 occurs for the pth wavenumber,  N∕2 (B −1 ), for N = 2p, or  (N±1)∕2 (B −1 ), for N = 2p + 1.The eigenvalues of B −1 ordered by the wavenumber are shown by circles in Figure 5.The crosses in Figure 5 show the second term of (34) ordered by the wavenumber.For H 1 , the largest value of the second term of (34) occurs for the same wavenumber as the largest eigenvalue of B −1 .The maximum value of this term is equal to λ 1 (R −1 ).This means that as λ 1 (S 1 ) changes from being controlled by λ 1 (R −1 ) to λ 1 (B −1 ), the change appears smooth, as the wavenumber associated with the frequency of the largest eigenvalue remains constant.It is clear that increasing L B will have a significant effect on the value of this bound, as changing L B increases λ 1 (B −1 ) significantly, and hence the upper bound given by (33).Therefore, for both regimes, changing L B has a large impact on both bounds for λ 1 (S).
We now present a similar bound for H = H 2 .
Lemma 4. For H = H 2 , the eigenvalues of S are bounded above by the following: where the eigenvalues of C 2 are given by the following:

FIGURE 5
Plots of the contribution of the background and observation terms to the eigenvalues of the circulant matrix made up of the first row of S 1 and S 2 for L R = 0.7 and L B = 0.3.Circles denote the eigenvalues of B −1 (which is a term in both ( 34) and ( 36)), crosses denote the contribution of R −1 in the second term of (34) (i.e., for H = H 1 ), and pluses denote the contribution of R −1 in the second term of (36) (i.e., for H = H 2 ) Recall (Section 3.3) that  j s are ordered in terms of the wavenumber rather than by the maximum eigenvalue.
Proof.See the Appendix.
Lemma 4 also yields an upper bound that is the sum of an eigenvalue of B −1 and a term depending on R −1 and the choice of H 2 .We note that the values of the second term of (36) take the same values as the second term of (34) but in a different order.These are shown by the pluses in Figure 5, where we see that in the order of wavenumber j, the second term of (36) yields the spectrum of R −1 twice.The second term of ( 36) is maximised for j = p∕2 and j = 3p∕2.These are different wavenumbers to the value of j = p, which maximises the first term.
Hence, we can bound λ 1 (S) above by λ . In this case, increasing L B has a very small effect on the upper bound for λ 1 (S), as λ N∕4 (B −1 ) does not change significantly with L B .However, when λ , small changes to L B will have a larger impact on λ 1 (B −1 ) for all choices of H. Similar behaviour is observed for fixed L B and changing L R .This change in the wavenumber of the largest eigenvalue explains why the kink occurs in the case of H 2 .
Finally, we discuss why the kink occurs for different values of L B and L R for H 2 and H 3 .We have shown that the kink occurs when λ 1 (B −1 ) becomes larger than λ 1 (H As the contribution of B −1 is not affected by the choice of observation operator, changing from H 2 to H 3 increases the value of L R necessary for λ 1 (H T R −1 H) to be greater than λ 1 (B −1 ).Hence, the kink is only visible (see Figure 3e) for L R ≫ L B for the choice H = H 3 .

CONCLUSIONS
Data assimilation is an important technique for combining information from observations with model data for the purpose of state estimation.One application of this is in NWP, where data assimilation is used to combine observations of the atmosphere with a numerical model, in order to obtain an accurate description of the current state of the atmosphere.In this case, correct specification of the uncertainty of each term is needed to produce the best forecast.The introduction of correlated observation error terms at operational NWP centres motivates investigation into the influence of observation error covariance on the convergence of the data assimilation procedure.We emphasise that the results presented here are general and are relevant for any application of variational data assimilation.Improved knowledge of the role of correlated observation error covariances will be of use in the context of engineering, 33 neuroscience, 33,34 and ecology. 35,36n this work, we developed theoretical bounds on the condition number of the Hessian of the 3D-Var objective function, which can be studied as a bound on the speed of convergence of the minimisation.These bounds were then tested in a simple numerical framework.We found the following: • The bounds separate the contributions of the (correlated) observation error, background error, and observation operator, allowing us to better understand the role played by each term.We note that Theorem 5 and Corollary 1 in particular are general bounds applying to any valid covariance matrices and any choice of observation network.
• Numerical experiments for simple linear choices of observation network revealed interaction between observation error and background error terms.This interaction was also demonstrated theoretically for any choice of observation network and error covariance matrices.• The structure of the observation network was seen to be crucial for determining how the observation and background errors interact.• Both bounds and experiments revealed that the minimum eigenvalue of the two error covariance matrices is important for determining the conditioning of the Hessian, as well as the number of iterations required for the convergence of a minimisation procedure.This agrees with the findings of Weston et al., 9 where small minimum eigenvalues of the observation error covariance matrix caused convergence problems in a practical setting.• The ratio of the variances was also shown to be influential, although this was not investigated in detail in this work.
This was also seen in the work of Haben. 16 emphasise that many of the theoretical results and conclusions presented in this work are general and apply to any valid choice of background and observation error covariance matrices and any linear observation operator.In particular, although the theoretical results presented in this paper focus on the 3D-Var problem, a natural extension to 4D-Var is obtained by replacing the observation operator H with the generalised 4D observation operator that incorporates dynamical model information. 16It is therefore expected that the eigenvalues of this model will also be important for the conditioning of the Hessian in this framework.
The importance of the choice of observation operator was revealed by the numerical tests, both for the condition number of the Hessian and in terms of interaction between observation and background error covariances.Even for two observation networks with identical theoretical bounds, very different behaviour was observed numerically.This was explained by the existence of underlying structures in the Hessian, induced by the structures of the constituent error covariance and observation operator matrices.Better understanding of these interactions will be important for predicting the response of operational systems to the introduction of correlated observation errors.This is particularly applicable in practical applications where diagnosed correlated observation error covariance matrices must be adapted prior to their use in order to ensure the convergence of the minimisation of the objective function.
In the numerical experiments presented in this paper in Section 5, the observation and background error covariance matrices were altered by changing the length scales of the underlying correlation functions.This approach is mainly applicable for spatial correlations, where correlation length scales have a physical interpretation.There is significant research investigating spatial correlations, 20,23,27,51 but much current work concerns the practical implementation of interchannel correlations for satellite observations. 3,6,9,21,22,26Although the theory presented in Section 3 applies directly to the case of interchannel correlations, it would be of interest to extend our numerical testing to the interchannel covariance case.In particular, practical experiments have revealed that the minimum eigenvalue of the observation error correlation matrix is important for the conditioning of the Hessian in the case of interchannel correlations, 3,9 which coincides with the theoretical and experimental results presented in this work.This is of particular interest as the correlation structure used by Weston et al. 9 is not circulant and demonstrates that, even beyond the numerical framework presented in this paper, our qualitative conclusions provide useful insight.
An additional area of future interest is investigation into how the best choice of preconditioning changes with the introduction of correlated observation error.Bounds on conditioning for the preconditioned case could be found by extending the results presented here, using similar theoretical techniques to those used in this work.The numerical and theoretical results discussed in this paper suggest that interactions between observation and background correlations are also likely in that framework.It is expected that understanding how the introduction of correlated observation error covariance affects the unpreconditioned 3D-Var problem will provide insight for suitable preconditioning methods in the correlated setting.One question of particular interest is whether the use of the background error covariance term as a preconditioner, as is done for the control variable transform, 30 remains optimal.One example of an operational problem that is not preconditioned is the 1D-Var used at the UK Met Office for quality control 5 ; the conclusions from this paper apply directly to that implementation.The application of these results to the UK Met Office system will be discussed in a future paper.

FIGURE 2 5 Definition 3 .
FIGURE 2Visualisation of the observation operators described in Definition 3 for the case p = 10 and N = 20.Shading indicates the value of the entry in the matrix; in the case of H 1 and H 2 , all nonzero entries are 1, and for H 3 , all nonzero entries are1   5

TABLE 1
Summary of how terms that appear in(15)change with the length scales L B and L R for B ∈ R 200×200 and R ∈ R 100×100 Bounds (dashed lines) and condition number (solid lines) of S for H 1 (cross), H 2 (triangle), and H 3 (circle) for L R = 0.33.The bounds are calculated using(15)for all choices of H.We note that the bounds for the cases H 1 and H 2 are the same