Numerical linear algebra in data assimilation

Data assimilation is a method that combines observations (ie, real world data) of a state of a system with model output for that system in order to improve the estimate of the state of the system and thereby the model output. The model is usually represented by a discretized partial differential equation. The data assimilation problem can be formulated as a large scale Bayesian inverse problem. Based on this interpretation we will derive the most important variational and sequential data assimilation approaches, in particular three‐dimensional and four‐dimensional variational data assimilation (3D‐Var and 4D‐Var) and the Kalman filter. We will then consider more advanced methods which are extensions of the Kalman filter and variational data assimilation and pay particular attention to their advantages and disadvantages. The data assimilation problem usually results in a very large optimization problem and/or a very large linear system to solve (due to inclusion of time and space dimensions). Therefore, the second part of this article aims to review advances and challenges, in particular from the numerical linear algebra perspective, within the various data assimilation approaches.

Even if we did have perfect knowledge of a system, often initial conditions or boundary conditions are not known to high accuracy.
In addition to the model, measurements of the system variables, perhaps indirect, are available at different locations in time and space.
Both the model and the observations obtained through measurements have errors.The data assimilation problem is therefore an inverse problem that uses incomplete and erroneous data and knowledge about a model which is also imperfect, in order to find the best possible estimate of the state of a system (or the best possible approximation of an unknown parameter, in which case we speak of parameter estimation).Using this state estimate one can then use the model to make predictions about the future states of the system, and, as more observations become available, update the state estimate using cycled data assimilation schemes.
Many algorithms for data assimilation have been developed over the past century, we introduce the main ones in the next section.However many challenges remain, in particular as more and more data become available and larger, more complex and higher dimensional problems need to be solved.
This review article will not address all problems arising within data assimilation, but will focus on challenges in data assimilation for numerical linear algebra.
The article is structured as follows: The most common methods for data assimilation are introduced in Section 2: Based on Bayes' theorem we derive variational and sequential data assimilation techniques.Section 3 focuses on the solution to the optimisation problem and the linear system arising within variational data assimilation.Approximations, in particular low-rank approximations, and other variants of the Kalman filter are considered in Section 4. Section 5 reviews several dimension reduction approaches for variational and sequential data assimilation methods, and in Section 6 we briefly consider various other aspects, such as the connection between data assimilation, Bayesian inference and Tikhonov regularisation.Finally the last two sections give a short survey on data assimilation software and a conclusion.By no means we consider this article a complete survey on data assimilation techniques, we rather focus on a selected methods and approaches where linear algebra plays an important role.

Basic methods and algorithms for state estimation
The goal of data assimilation is to incorporate measured observations into a model of a dynamical system in order to produce estimates of the current system state (and future system states) which are as accurate as possible.In that sense data assimilation can be defined as an approximation of a true state of a physical system at a given time, by combining time-distributed observations with the dynamical system model in some optimal way.
One can view the data assimilation problem as a Bayesian inference problem.Let x ∈ R n be a model state that we would like to estimate.In Bayesian statistics, we model x as a realisation of a random variable (here a random vector), X : Ω → R n .If Θ : Ω → R p is another random variable with mean zero modelling the observational noise, then we can also model the observed variable Y ∈ R p as a random variable, defined by Y = h(X) + Θ, where h : R n → R p is a (in general nonlinear) continuous map which models the transformation of the system space to the observation space.X and Θ are assumed to be independent.We would like to infer information about states x given realisations y of Y , which is a Bayesian inverse problem [12,162,57,2,34,35,6].
If we assume that Θ ∈ R p is a random variable with probability density π, and y as well as x realisations of the random variable Y and X, respectively, then the probability of y given x is given by π Y |X (y|x) = π(y − h(x)).This is often referred to as the data likelihood.Further let π X (x) be the probability density function of X, which describes our prior believes about the distribution of X.Then, by Bayes' formula, π X|Y (x|y), the posterior conditional probability density function of x given the observations y is given by where π Y (y) = R n π Y |X (y|x)π X (x)dx is a normalisation constant depending only on y (see, for example [162,119,149]).In general it is hard to obtain the entire probability density π X (x|y), in particular in higher dimensions.However, if we make some assumptions about the probability density functions of the prior and the likelihood, the problem of finding the posterior density can be simplified, which leads to data assimilation algorithms in the traditional sense, which we discuss in the next sections.We will distinguish between variational and sequential data assimilation methods.

Variational data assimilation
If the prior density in the Bayesian inference problem (1) is Gaussian, that is X ∼ N (x B , B) with mean x B ∈ R n (often called the background vector) and positive definite background error covariance matrix B ∈ R n×n , then we have Here x ∈ R n is the state vector we would like to estimate.If, in addition, the observation error is also Gaussian, that is Θ ∼ N (0, R) (or Y ∼ N (h(X), R)) with positive definite error covariance matrix R ∈ R p×p , then the likelihood function can be written as and hence where z 2 R −1 := z T R −1 z is a weighted norm, the so-called Mahalanobis distance to zero.The maximum a posteriori (MAP) estimator for x ∈ R n is then given by argmin x∈R n J(x), where J which is a weighted nonlinear least squares problem.The minimisation problem in (2) is what is known as the three dimensional variational data assimilation problem (3D-Var ).If h : R n → R p is a linear operator, represented by a matrix H ∈ R p×n , the solution to (2) can be computed immediately using the (sometimes called Kalman gain) matrix K given by , using the Sherman-Morrison-Woodbury formula [89], and the minimum in ( 2) is then given by This direct solve of the optimisation problem ( 2) is sometimes referred to as optimal interpolation [56].We note that the minimisation problem in ( 2) is a generalised form of Tikhonov regularisation [82,134,12,102,138,111], the most commonly used form of regularisation for inverse problem.In standard Tikhonov regularisation problems, the operator R = I, where I is the identity matrix, x B = 0, B = 1 λ I, where λ > 0 is a regularisation parameter, and often h is linear.The generalised form (2) is a nonlinear Tikhonov regularisation in a weighted inner product space (see, [82,35,175]).In recent years the dynamic aspect of inverse problems (naturally leading to a variety of data assimilation problems) has become of interest in the inverse problems community, we especially refer to [158,112] and references therein.
In practice the matrices involved in the computation of the Kalman gain are very large and direct inversion is not feasible, often it is not even possible to store the full matrices.In addition, if the operator h is nonlinear, then an iterative optimisation procedure is required in order to solve the minimisation problem (2), we will discuss more details about this optimisation in Section 3.
In variational data assimilation one uses a descent algorithm in the direction of the gradient and an adjoint approach for the computation of the gradient in order to solve the minimisation problem.
The problem ( 2) is a stationary one, there is no time-dependence.Instead of a state x fixed in time, let us consider a set of state estimates x = [x T 0 , . . ., x T N ] T , where x i ∈ R n refers to a state at time t i .Define a new observation operator H : R n(N +1) → R p(N +1) , with H : [x T 0 , . . ., x T N ] T → [H 0 (x 0 ) T , . . ., H N (x N ) T ] T and let y = [y T 0 , . . ., y T N ] T , where y i ∈ R p denotes a set of observations at time t i , i = 0, . . ., N .Furthermore, let R ∈ R p(N +1)×p(N +1) now be a block diagonal matrix with R i ∈ R p×p , i = 0, . . ., N , on the diagonal blocks.Again assuming the observation errors are distributed according to a normal distribution with error covariance matrix R i we define the cost function In addition we introduce a discrete model for the evolution of the underlying physical system from time t i to time t i+1 , described by the dynamical system equations where M i : R n → R n can be time-dependent and nonlinear, describing the evolution of the dynamical system.Usually this forward operator requires the solution of a time-dependent partial differential equation.The problem argmin subject to the nonlinear model dynamics is a constrained optimisation problem and is called 4D-Var, four dimensional variational data assimilation.
Here, the subscripts refer to the time index.Note that in (4) the regularisation term only involves the initial state x 0 at time t 0 .The minimisation provides the initial condition of the model that most closely fits the data.
Figure 1 illustrates the difference between 4D-Var and 3D-Var (for illustration purposes the state dimension is n = 1).The 3D-Var problem finds the best estimate using an observation and a background state x B at one specific time point (here at the initial time), the best estimate is called x A 3D-Var .In 4D-Var observations are obtained throughout a time window -often called the assimilation window -and the optimisation is done over a time window.The best estimate, taking into account the background state, that is the previous forecast and the observations within the assimilation window is computed, where we have to take the dynamical forecast model into account.The best estimate at the initial state is then x A 4D-Var , which of course is not necessary the same as x A 3D-Var .Both estimates can be used to evolve the model forward and to obtain a better (updated) forecast.
The cost function in (2) or ( 4) is minimised using an iterative method (e.g.steepest descent, conjugate gradient, a quasi-Newton method etc. [141]).We choose where α (ℓ) is a damping parameter obtained through line search, for example, and ℓ is the iteration index.The search direction is d 0 ) for a quasi-Newton method (where ∇ 2 J(x   Clearly, 4D-Var (4)-( 5) is a constrained optimisation problem, and the gradient ∇J(x (ℓ) 0 ) is obtained via an adjoint approach (see [8]): We introduce Lagrange multipliers λ i ∈ R n at time t i , i = 1, . . ., N and define the Lagrangian Necessary conditions for the minimum of the 4D-Var cost function subject to the constraint are then found by taking the variations of L with respect to λ i and x i (KKT conditions, see [141]) where M i ∈ R n×n and H i ∈ R p×n are the Jacobians of the forward operator M i and the observation operator H i , evaluated at x i , respectively, that is The adjoint equations for the adjoint variables λ i , i = 0, . . ., N + 1, that measure the sensitivity of the cost function to changes in x i , are then given by λ N +1 = 0 (6) They provide an efficient method to compute the gradient of the objective function (4), which is then given by ∇J( The general method for solving the 4D-Var optimisation problem is sketched in Algorithm 1.

Algorithm 1 4D-Var
Input: error covariance matrices R i and B, routines to apply model and observation operators M i and H i and their linearisations M i and H i , respectively, and observations y i for i = 0, . . ., N , maximum number of iterations ℓ max .
Initialise the iteration ℓ = 0 and x 0 ) using the forward model.Compute ∇J(x (ℓ) 0 ) using the adjoint equations.Apply descent method (gradient descent, Newton, quasi-Newton . ..) to compute descent direction d (ℓ) .Update the initial condition x The problem in (4)-( 5) is often called strong constraint 4D-Var.In the presence of uncertainty we can model the imperfect state dynamics by where η i ∼ N (0, Q i ) is the model error which is assumed to be Gaussian with error covariance Q i , uncorrelated in time and uncorrelated with the background and observation errors.The relaxation of the strong constraint is commonly used in sequential data assimilation as we will see in Section 2.2.For variational data assimilation a weak constraint was proposed in [156], however, due to high computational costs is not used heavily in practice.In the past decades, with the availability of increasing computing power, there has been greater interest in this method [77,80,90,168].The cost function for weak constraint 4D-Var is given by where x = [x T 0 , . . ., x T N ] T , and x i is the model state at time step t i , i = 0, . . ., N .The resulting optimisation problem argmin x∈R n(N +1) J(x) can become extremely large (in three spatial and the additional time dimension) and computationally expensive to solve, as M i usually arises from a the discretisation of a nonlinear partial differential equation.We will discuss more detailed solution approaches to (9) in Section 3. Note that the variational approach assumes that prior and likelihood have Gaussian distributions.As such the posterior is also Gaussian if the observation operators h (for 3D-Var) or H (for 4D-Var, in which case it also includes the model operator) are linear.These are quite strong assumptions and, by minimising the cost function J in (2), (3), or (4)-( 5) we only compute the posterior mean (the MAP estimator).
We are able to approximate the posterior covariance matrix A by computing the inverse of the Hessian, A = (∇ 2 J(x)) −1 .This covariance matrix is exact for linear models and Gaussian distributions, in which case it is given by A . ., R N ) for 4D-Var, respectively (see more details about this at the end of Section 2.2 and in reference [84]).If the models are nonlinear or the distributions non-Gaussian, the inverse of the Hessian (∇ 2 J(x)) −1 is only an approximation of the posterior covariance matrix.
To conclude this section we emphasise that variational data assimilation is essentially based on optimal control theory, that is we minimise a cost function subject to a constraint arising from the state dynamics.The minimisation requires techniques from numerical optimisation.For more insight into the relation of variational data assimilation and PDE-constrained optimisation we refer to [8,76,98].

Sequential data assimilation and Kalman filters
In sequential data assimilation we correct the model state estimate whenever observations are available, that is we incorporate the observations sequentially.Recall the Bayesian inference problem (1) with x i ∈ R n and y i ∈ R p observations at time t i , for i = 0, . . ., N .Given a prior uncertainty π X (x i ), find the updated conditional posterior uncertainty π X|Y (x i |y i ) of x i , given the observation y i .Using the likelihood π Y |X (y i |x i ) and Bayes' formula we would like to compute Again, ideally we wish to compute the the entire probability density function π X|Y (x i |y i ) at every time step t i .We consider a discrete-time linear system dynamics, where w i ∼ N (0, Q i ) and v i ∼ N (0, R i ) represent model and observation/measurement errors and are assumed to be independent, and Gaussian with error covariance matrices Q i ∈ R n×n and R i ∈ R p×p respectively.Hence the errors are uncorrelated in time and between each other.Moreover, model and observation operators are assumed to be linear here, however, extensions to nonlinear models exist.The Kalman filter (as well as many other sequential data assimilation schemes) then follow two steps: A forecast or prediction step and an analysis or correction step, illustrated in Figure 2. At time step t i we have an analysis x A i available, which we assume to be the best estimate of the state at time t i (we have incorporated all our knowledge about observations and previous forecasts at time t i ).Then the state dynamics are evolved forward in time using (11) producing a forecast x F i+1 , which becomes the background state at step i + 1.We also have a set of observations y i+1 available which we combine with x F i+1 in order to get the best estimate of the state, a so-called analysis, x A i+1 .This analysis is obtained using the best linear unbiased estimate (BLUE), see [107].This is again used to evolve the state forward using the model dynamics.

Let x t
i ∈ R n be the unknown exact true state of the system.Then we define the error in the forecast at time t i by e F i = x F i − x t i , and similar, the error in the analysis at time t i by e A i = x A i − x t i .Moreover, the forecast and analysis error covariance matrices can then be expressed as respectively.
Forecast/predictor step Given a previous analysis state estimate x A i at time t i , an estimate of x t i+1 at time t i+1 is given by the application of the model dynamics ( 11) For the corresponding error covariance matrix Since the model is linear and the analysis error e A i and model error w i are assumed to be uncorrelated we obtain (using properties of the expected value and Analysis/corrector step Given an a-priori estimate x F i and observations y i at time step i, the goal of the analysis step is to compute the optimal a-posteriori estimate, x A i as a linear combination of x F i and y i of the form where K i ∈ R n×p is called the Kalman gain matrix [114] and the general form of ( 14) arises from the assumption that the estimate x A i of the true vector x t i should be both linear and unbiased.The a-posteriori error covariance is given by , where we have used the observation equation (12).With the definition of where we have also used that the forecast error e F i and observation error v i are uncorrelated, random variables, that is E[e F i v T i ] = 0.The Kalman gain matrix K i in ( 15) is chosen to minimise the a-posterior variance tr(P A i ).We evaluate ∂P A i ∂Ki = 0, and using results from matrix differential calculus we obtain the Kalman gain for details of the derivation, see [8,107,13].Substituting K i into (15) then yields Note that, for P F i small (or zero) the Kalman gain is also small (or zero) and the a-posteriori estimate x A i is heavily geared towards x F i .Similarly, if the observation error covariance matrix R i is zero (for perfect observations) the estimate x A i is steered towards the observations (in particular, for H i square and nonsingular We remark that the error covariance matrices in the Kalman filter equations satisfy discrete algebraic Riccati equations [117,100]. Clearly the Kalman Filter has some shortcomings, as it is only optimal for linear models and observation operators, and Gaussian error statistics (in which case the mean and error covariances, which are propagated in the Kalman Filter, are sufficient to describe the probability density function of the state estimates).
Several methods have been proposed to overcome these issues, some we mention here, for more details we refer to [8].The Extended Kalman Filter (EKF) applies to systems of the form where M i and H i are nonlinear.The nonlinearities are dealt with in the filter equations by linearising the model and observation operator about x F i and x A i , in the computations of the Kalman gain and error covariance matrices, that is For computing the analysis state estimate and the forecast state estimate the nonlinear operators are used.We obtain an approximation to the best linear unbiased estimate, see [85,110].An algorithmic description of the EKF is given in Algorithm 2.

Algorithm 2 Extended Kalman filter
Input: error covariance matrices R i and Q i , routines to apply model and observation operators M i and H i and their linearisations M i and H i , respectively, and observations y i for i = 0, . . ., N .Initialise the system state x F 0 and the corresponding error covariance matrix P F 0 .for i = 0, . . ., N do Compute Kalman gain There are a lot of similarities between the Kalman filter discussed in this section and variational data assimilation discussed in Section 2.1.In particular, one can interpret the analysis/corrector step in the Kalman filter in a variational form: the Kalman filter state estimate x A i from ( 14) with the Kalman gain ( 16) exactly represent the equations we obtained for computing the solution to the 3D-Var problem in Section 2.1.Hence, for the variational formulation of the Kalman filter analysis step we have which is a quadratic problem for a linear observation operator H i and hence has a unique solution which can be obtained by setting the gradient ∇J i (x i ) = 0.For a nonlinear observation operator the solution may not be unique.The Hessian of this cost function (for linear observation operator H i ) is given by and, with the Sherman-Morrison-Woodbury formula we have which is precisely the a-posteriori error covariance matrix P A i given in (17).Therefore, the inverse of the Hessian is equal to the posterior covariance (in the linear case with Gaussian errors).For nonlinear problems, ∇ 2 J i (x i ) −1 is an approximation to the posterior covariance.Solving (18) iteratively is more advantageous than computing the Kalman gain directly for very large problems with sparsity structure.More details on this idea can be found in [13].
To complete this section we add a remark about Kalman smoothers.Filtering algorithms make use of observations as they become available and provide the best estimate of a state given all past information and the current observation.Smoothing algorithms provide the best estimate of a system, using past, present and future information.So the Kalman smoother is equivalent to the Kalman filter at the final time step.Moreover, it can be shown that for linear problems with linear observation operator and model dynamics, and the same initial background error covariance, the Kalman filter and 4D-Var result in the same state estimate for the same time step (when the same observations have been used), see [82,127,75].
In the next sections we give an overview of important approaches and contributions from numerical linear algebra to solve variational and sequential data assimilation problems efficiently and provide an outlook for challenges ahead.

Solutions to the optimisation problem arising in variational data assimilation
Variational data assimilation leads to large PDE-constrained optimisation problems.We will first concentrate on 4D-Var (4)-( 5) and later discuss weak constraint 4D-Var (8).
Incremental 4D-Var and Gauss-Newton method Since the full nonlinear minimisation (4)-( 5) is difficult to solve, special algorithms for optimisation and linear algebra are required.In geoscience applications an incremental approach [50] was proposed, which is merely a Gauss-Newton method for nonlinear least squares problems [121,122,94].In incremental variational data assimilation the solution to the nonlinear optimisation problem is approximated by a sequence of minimisations of quadratic (and hence convex) cost functions, which are obtained by linearising both the model and the observation operators.Let x (ℓ) 0 be the ℓth estimate to the solution at x 0 .We linearise the cost function (4) around the model trajectory from this estimate: we linearise the model and observation operators about x (ℓ) 0 and then obtain the next iterate by the increment δx We use this expansion and substitute it into the nonlinear cost function ( 4), which we then linearise about the model trajectory obtained from x (ℓ) 0 .We then see that the increment δx 0 is obtained by minimising the quadratic problem (the incremental cost function) is the nonlinear trajectory computed from the current estimate at the initial time x where M i is the linear solution operator of the nonlinear model M i linearised around the nonlinear trajectory.We therefore obtain a so-called inner-outer iterative method.The outer iteration (with iteration index ℓ) is represented by the update of the nonlinear trajectory (19).The minimisation of the quadratic cost function (20) is the inner iteration, it essentially amounts to the solution of a large linear system.At each iteration the forward model, and its linearisation, are evaluated in order to compute the cost function (20), and the adjoint model (7) is applied in order to compute the gradient of the cost function.
Incremental variational data assimilation can be shown to be a version of the Gauss-Newton method applied to the original nonlinear cost function (3).In order to illustrate this, consider a general nonlinear least squares problem with f : R n → R q a twice continuously differentiable function [f 1 (x), . . ., f q (x)] and • denoting the Euclidean norm.Let G(x) = f ′ (x) be the q × n Jacobian of f (x).The gradient and Hessian of Φ(x) are then given by The Gauss-Newton method discards the expensive second term in the Hessian when applying Newton's method to ∇Φ(x) = G(x) T f (x) = 0.The Gauss-Newton method is described in Algorithm 3. Note that the

Algorithm 3 Gauss-Newton method
Input: Routines to compute f (x) and its Jacobian G(x), maximum number of iterations ℓ max .
Initialise the iteration ℓ = 0 and ). Update x (ℓ+1) = x (ℓ) + δx (ℓ) .Set ℓ = ℓ + 1. end while second step of the Algorithm is equivalent to solving the linearised least squares problem at each iteration ℓ.If we define , then the general cost function (21), with x = x 0 is equivalent to the 4D-Var cost function ( 4)- (5), where here q = n + (N + 1)p.When applying the Gauss-Newton method to (21), then the linearised least squares problem ( 22) is equivalent to finding the solution to the quadratic cost function (20) with x = x 0 .More details and also suitable stopping criteria for the inner-outer iteration arising within the Gauss-Newton method were discussed in [121,122,123,94].Besides Gauss-Newton, there are of course standard methods to solve nonlinear least squares problems, such as Levenberg-Marquardt and Quasi-Newton methods (in particular BFGS), see, for example [76,141].
The inner iteration and preconditioning The quadratic cost function (20) (see also (22)) within the Gauss-Newton (incremental) approach can be minimised using a conjugate gradient (CG) method [104,89,141,88].As the problem is usually very large, in practice, the inner loop, the CG iteration, is applied to a system with lower spatial resolution and with a preconditioner [76].One of the earliest approaches is the multilevel setting in [50], where the quadratic cost function is replaced by a lower resolution model, in order to obtain a multi-resolution scheme.
In addition, two-level preconditioning is usually applied within 4D-Var in the following way.We consider the cost function where , where H 0 (x 0 ) = h(x 0 ) for 3D-Var, respectively.The formulation in (23) aims to treat 3D-Var and 4D-Var simultaneously.
At each step of the Gauss-Newton process an equation of the form ) needs to be solved, which results in , where H depends on x (ℓ) .Note that H is the linearisation of H at x (ℓ) .For 4D-Var H includes the model operator, )p×(N +1)p , for 3D-Var H = H 0 and R = R 0 .We exclude the superscript ℓ for ease of notation.The Hessian of the 4D-Var cost function at the ℓth linearised problem is given by B The first level preconditioning employs a linear change in variables, δx = Lδx, where B = LL T , that is, the Cholesky factor of the background error covariance matrix.The Hessian of the cost function then becomes I + L T H T R −1 HL, which ensures that the smallest eigenvalue of the transformed Hessian is one.Moreover, since the rank of L T H T R −1 HL is smaller than the dimension of the system, there are many unit eigenvalues.Note that another option is to employ a transformation with B to obtained the transformed Hessian I + BH T R −1 H, which is no longer symmetric (but the use of this transformation may be necessary if the factorisation of B is not available).Both transformed Hessians share the same eigenvalues [65]; they are all greater than or equal to one.Detailed analysis on the conditioning of the Hessian can be found in [101,163].
At a second preconditioning level the spectrum of the (symmetric) Hessian is used.This utilises the fact that a few of the dominant eigenvalues and corresponding eigenvectors can be obtained using the Lanczos method [118,54], and, one can compute those Hessian eigenpairs within the CG iteration itself [89].After k steps of the CG algorithm, a few approximate leading eigenpairs (λ i , w i ), i = 1, . . ., k of the Hessian are available and one can approximate the Hessian by where k is much smaller than the dimension of the linear system.Hence, after one step of the Gauss-Newton method (the outer iteration), approximations to the Hessian, C, are available for the next step of the Gauss-Newton iteration, and preconditioners C −1 , or C − 1 2 for H can be obtained by replacing λ i in (24) by 1/λ i or 1/ √ λ i , respectively, see [4,133,167].In [65] this procedure was extended to non-symmetric Hessians by using the bi-conjugate gradient method.
The work in [28] takes up the idea of approximating the Hessian by eigenpairs obtained from the Lanczos procedure.Matrix-vector products with the Hessian are expensive (since they require the evaluation of the linearised forward and adjoint models), so even obtaining the limited memory representation (24) may be computationally infeasible for large k.Therefore [28] propose a multilevel version of the limited memory Hessian, where the eigenvalue decompositions are obtained from several coarser levels and fed through to the finer level, which enhances the algorithm in terms of reducing the number of matrix vector products at the fine grid levels.In addition, multigrid solvers and multigrid preconditioners for the solution of the variational data assimilation problem were considered recently in [59,95].
The limited memory preconditioners (LMP) discussed above are often referred to as spectral LMP, as they use the spectral information of the Hessian approximation (24).More general versions of LMP were investigated in [170] and [93].In the latter the authors also show the equivalence of a certain reduced version of 4D-Var and the SEEK filter, discussed in Section 4, and use this equivalence to accelerate the convergence of the Gauss-Newton method.
The dual formulation of 4D-Var As observed in the previous paragraph, both for 3D-Var and for 4D-Var, each inner iteration of the Gauss-Newton method essentially attempts to minimise the cost function see (20), which is equivalent to solving Here we have neglected all sub-and superscripts for simplicity.The minimisation of the cost function in (25) is often referred to as the primal approach, as the minimisation takes place in model space.Using the Sherman-Morrison-Woodbury formula we write the solution as which can be obtained by solving the smaller system in observation space (HBH T + R)λ = d and setting δx = BH T λ.This method is known as dual formulation of 3D-Var/4D-Var, or PSAS (Physical-space Statistical Analysis System) [48].It is easy to see that the cost function for the dual formulation is given by If the dimension of the observation space p is significantly smaller than that of the model state space n, the dual formulation can reduce both memory usage and computational cost compared to the primal approach.Preconditioned conjugate gradient methods for the dual approach were considered in [96,92] and convergence properties of the primal and dual approaches were investigated in [64,1].Most recently, so called B-preconditioned minimisation algorithms for variational data assimilation were introduced in the paper [99], which also contains a good literature review on the dual formulation and PSAS.
Weak constraint 4D-Var and saddle point formulation of the inner iteration Weak constraint 4D-Var requires minimisation over all state variables within the assimilation window and is therefore more computationally expensive.The incremental approach [50] for the more general weak constraint 4D-Var cost function ( 8) can be formulated as follows.We approximate the 4D-Var cost function by a quadratic function of an increment where denotes the ℓth iterate of the Gauss-Newton algorithm applied to weak-constraint 4D-Var.This increment δx (ℓ) is a solution to the minimisation of the linearised cost function where M i and H i , are linearisations of M i and H i about the current state trajectory x (ℓ) , and b i+1 .Dropping the superscript for the ℓth iterate for simplicity, the linearised cost function (27) can be written more concisely as where δx = δx T 0 , δx T 1 , • • • , δx T N T and L and H are matrices of size (N + 1)n × (N + 1)n and (N + 1)p × (N + 1)n, respectively: which can be thought of as all-at-once model and observation operators over the assimilation window.Here we assume y i ∈ R p , but this can be generalised to y i ∈ R pi .We assume there is no correlation between the errors at each time steps, and hence the covariance matrices are block diagonal matrices Moreover, the vectors b and d are given by The system above can be written as a saddle point problem [72,77,73,74], a form that recently has seen a lot of interest for data assimilation problems, see also [80,97].With new variables the gradient of the cost function (28) provides a constraint and altogether the coupled linear system a very large, sparse, symmetric indefinite saddle point system needs to be solved at every inner iteration.A vast amount of literature on saddle point problems and their solution via Krylov methods and preconditioners is available, see for example [21] and references therein.For this particular saddle point problem low-rank limited memory preconditioners exploiting the structure of the saddle point problem were proposed and analysed in [73] (see also [91]).In the work [80,97] the Kronecker structure of the saddle point problem was used in order to compute low-rank solutions to GMRES.The convergence of the saddle point formulation of weak constrained 4D-Var was reviewed in [90], and spectral estimates for the saddle point system were obtained in [58].
Hybrid methods and inexact approaches It has been observed that incremental weak constraint 4D-Var for minimising the large scale cost function ( 8) may diverge, hence the authors in [131] add a regularisation term and thereby replace the Gauss-Newton approach by the Levenberg-Marquardt method.
In addition they use an ensemble Kalman smoother (see Section 4 for a discussion on ensemble methods) within the minimisation process and thereby apply a hybrid approach.Such hybrid methods combining variational and sequential ensemble approaches are popular as ensemble methods are naturally parallelisable and do not require adjoint operators [45].A parallel-in-time approach for solving the strong constraint 4D-Var optimisation problem was proposed in [147].
When both the model operator and the gradient are not available exactly, inexact methods need to be used.In [23,18] a Levenberg-Marquardt method is proposed and investigated for dealing with inexact gradients and Jacobians.

The Kalman filter and low-rank approximations
The Kalman filter is impractical for large dimensional systems.It requires the storage and evolution of large covariance matrices P A/F i , which are both not feasible for very high dimensional systems (for example, in oceanography and numerical weather prediction, the state dimension is of the order of 10 7 and higher).The propagation of the error covariance in (13) requires a number of integrations of the forward model equal to the dimension of the system.Moreover matrix inversion within the Kalman gain computation ( 16) is expensive.Hence a range of approximate Kalman filters have been developed for large systems, either by using a simplified or reduced order model [60,46,71] to propagate the covariance matrices (that is, to propagate the error statistics) or by using a reduced state space or error space [36,146,172].Many of the approaches are quite similar and most of them rely on low-rank approximations of the error covariance matrices.We discuss some methods below.

Reduced-rank Kalman filters
The singular evolutive extended (SEEK) Kalman filter algorithm (see, for example [146,27,153,129]) is one of the best known reduced rank square root (RRSQRT) filters.It is assumed that the covariance matrices P arising within the algorithm have low-rank form and can be written as P = SS T , where S ∈ R n×r with r ≪ n.The factorisation can be obtained via a truncated eigenvalue decomposition, for example.The Kalman filter equations are then rewritten using the matrices S F i and S A i , the low rank approximations of the forecast error and analysis error covariance matrix, respectively.The equation for the Kalman gain (16) becomes or, using the Sherman-Morrison Woodbury identity Note that often R i is a (block) diagonal matrix, and the latter version of the low rank Kalman gain can be computed at lower cost in r ≪ n dimensions.The analysis increment is , and therefore a linear combination of the columns of S F i .Substituting the low rank Kalman gain into (17) yields , where the inverse of the square root is taken in the lower dimensional space of dimension r.
For the forecast step, the propagation of the error covariance matrix is done via or, for nonlinear models, via the finite difference approximation where {•} ℓ refers to the ℓth column.In order to write P F i+1 = S F i+1 (S F i+1 ) T , and conserving the rank r some assumptions need to be made about Q i (see [174,27] for details), otherwise a rank reduction, for example via computing an SVD, may be required at every step in order to keep the rank of the covariance matrices small.

Algorithm 4 SEEK filter
Input: error covariance matrices R i and Q i , routines to apply model and observation operators M i and H i and their linearisations M i and H i , respectively, and observations y i for i = 0, . . ., N .Initialise the system state x F 0 and the corresponding error covariance matrix in low-rank form

. , N do
Compute low rank Kalman gain

Compute error covariance estimate P
). Compute the forecast error covariance P F i+1 = SF i+1 ( SF i+1 ) T + Q i where SF i+1 is given by (32).end for The SEEK filter, shown in Algorithm 4 is just one example of a reduced rank square root filter (RRSQRT), see [172,39].Another example, shown in Algorithm 5 is a more general version of the SEEK filter and makes sure the rank of the error covariance matrix does not increase during the iteration.Again, we drop the time index, and assume a low rank approximation of the error covariance matrix P = SS T is possible, where S ∈ R n×r with r ≪ n.The Kalman gain can then be written as and simple computations (again using the Sherman Morrison Woodbury identity), show that and using a matrix square root (of a smaller r × r matrix), one can write . Several algorithms are available for computing the matrix square root, e.g.[105,62], but often a Cholesky factorisation is used.After the analysis step the dimension of the system is reduced, keeping only r − s eigenmodes of (S A i ) T S A i , where s < r.This avoids an increase of the rank of the covariance matrix when variability through the model error (with covariance matrix Q i ) is introduced.The details Algorithm 5 Reduced rank square root filter (RRSQRT) Input: error covariance matrices R i and Q i , routines to apply model and observation operators M i and H i and their linearisations M i and H i , respectively, and observations y i for i = 0, . . ., N .Initialise the system state x F 0 and the corresponding error covariance matrix in low-rank form P F 0 = S F 0 (S F 0 ) T and decompose Q i = T i T T i , a rank s factorisation of Q i , with s < r. for i = 0, . . ., N do Compute low rank Kalman gain

Compute the low rank factor of error covariance S
. Compute an eigenvalue decomposition (or low-rank factorisation) V ΛV T = (S A i ) T S A i .Select largest r − s eigenvalues and corresponding eigenvectors Ṽ := V (:, 1 : of the RRSQRT filter are given in Algorithm 5. Reduced rank filters are cheaper to implement than the full filtering algorithm, by using low rank approximations of covariance matrices.Only r forward model integrations are necessary.However, they still use linearisations of the nonlinear operators M i and H i in order to propagate the error covariance matrices.The ensemble Kalman filter overcomes this issue and utilises the full nonlinear model to propagate the covariances.

Ensemble Kalman filter
The need to propagate a probability distribution is an important feature of the ensemble Kalman Filter (EnKF).It is also a major challenge as the propagation of large covariance matrices of size n is very expense.The ensemble Kalman filter is also a reduced rank method, as it requires the propagation and analysis of a small number of ensemble members.It was proposed in [69] (see also [106,70,24,165]) and is essentially a Monte Carlo implementation of the Bayesian update.There are plenty of variants of the ensemble Kalman filter, with the same idea behind all of them, the difference is in the implementation detail.We describe two versions briefly, the stochastic ensemble Kalman filter (EnKF) and the ensemble transform Kalman filter (ETKF), where, in the latter case, the linear algebra is performed in the ensemble subspace which is of much smaller dimension than the state space or the observation space.
Suppose we have r ensemble members, or prior samples, {x F k } r k=1 .Note that here the subscript k denotes the ensemble index, we neglect the time index in this explanation for simplicity.The forecast error covariance can be estimated using the empirical covariance and [X F ] k the kth column of the n × r matrix X F .Each of the ensemble members is then updated using ( 14): When computing the sample posterior covariance P A = X A (X A ) T it turns out this is underestimated compared to the BLUE in (15).A way around this is to perturb the observation vector, Algorithm 6 Ensemble Kalman Filter (EnKF) Input: error covariance matrices R i , routines to apply forward model and observation operators M i and H i , respectively, and observations y i for i = 0, . . ., N .Initialise the ensemble states x F k,0 where k = 1, . . ., r (via random perturbations from the initial conditions x F 0 , for example).
Compute the ensemble means Compute the rectangular normalised ensemble matrices Compute the (approximate Kalman) gain:

r. end for
Then it can be shown [8] The Kalman gain can be computed using K = X F (L F ) T (L F (L F ) T ) −1 .Moreover it can be shown that within the algorithm we only require the application of the nonlinear operator H to the ensemble members, rather than its linearised version H.
The full version of the (stochastic) EnKF is given in Algorithm 6.Note that k is the ensemble index, i is, as before, the time index.For more details on the algorithm we refer to [8,103].An error analysis for the ensemble Kalman filter analysis step was performed in [116].

Algorithm 7 Ensemble Transform Kalman Filter (ETKF)
Input: error covariance matrices R i , routines to apply forward model and observation operators M i and H i , respectively, and observations y i for i = 0, . . ., N .V ∈ R r×r an orthogonal matrix such that V 1 = 1.Initialise the ensemble states {x F k } k=1,...,r at time i = 0 and set Compute ensemble mean xF = E F 1/r and ensemble matrix Compute normalised innovation vector d = R

Update state estimate ensemble E
For the deterministic version of the ensemble Kalman filter [24], the ensemble transform Kalman filter (ETKF) which is illustrated in Algorithm 7, we again write the forecast error covariance in low rank form (33), with ensembles {x F k } r k=1 .It is then assumed that the state estimate x A is of the form x A = xF +X F w A , where w A is a vector of coefficients in the small dimensional ensemble subspace R r , and X F ∈ R n×r .Using (14), the mean of the analysis vector is given by x A = xF + K(y − H(x F )). Hence, we obtain xF + X F w A = xF + K(y − H(x F )), and with the low rank approximation P F = X F (X F ) T within the Kalman gain K, and using the Sherman-Morrison-Woodbury formula again, we derive , where L F = HX F , and hence the Kalman gain is computed in the low dimensional ensemble space.For computing the posterior ensemble covariance matrix we proceed as in the RRSQRT derivation.This yields where V is an arbitrary orthogonal matrix.To ensure that X A 1 = 0, that is, the updated perturbations are centred at x A (similar to X F 1 = 0) it is sufficient for V 1 = 1 to hold.Here 1 is the vector of all ones.The posterior ensemble is then given by Note that within the ETKF, all matrix inversions and matrix square roots are carried out in the lower dimensional ensemble subspace of dimension r.Moreover, with a small number of ensemble members, only r applications of the expensive forward model M i and the operator H i are necessary.
Using RRSQRT filters and EnKFs results in the increments being confined in a subspace spanned by the columns of the low rank matrix .There are localisation methods which overcome these issues, for details we refer to [8,142,108] and references therein.A whole range of variations of the ensemble and square root filters have been developed over the years, see the references in [8,137].
Iterative solvers within the Kalman filter As mentioned at the end of Section 2, a variational form of the Kalman filter can be formulated and state estimate and posterior covariance are given by minimum and inverse Hessian of a quadratic cost function, respectively [13].When the preconditioned CG method is used, this leads to a conjugate gradient ensemble Kalman filter, which has been discussed in [15].Low-rank approximations to covariance and inverse covariance matrices can be obtained by exploiting the connection between conjugate gradient and Lanczos iterations [14].Moreover, in a similar way, the use of limited memory BFGS [141] within the Kalman filter estimate has been investigated in [10].

Model reduction and dimension reduction approaches
Data assimilation problems are often large, in particular when the model is described by discretised timedependent partial differential equations and when large amounts of data need to be assimilated.This leads to either a large optimisation problem for variational data assimilation, or, the solution to large linear systems, eventually, for the optimisation problem and the Kalman filter.Over the years several reduction techniques have been proposed for data assimilation problems (or, inverse problems) which we will describe here briefly.
We will collect results from two different ideas.Usually the state space dimension n in data assimilation is very large.This leads to very large covariance matrices which need to be stored, manipulated and inverted, which is expensive.Therefore, one approach is the reduction of the dimension of these covariance matrices, usually by using low-rank approximations.
A second approach considers the expensive, PDE-based model operator M (or M i , i = 0, . . ., N ).In order to solve the optimisation problem in 4D-Var, or, in order to apply sampling based methods such as the ensemble Kalman filter (or, more general Markov Chain Monte Carlo methods), this model has to be evaluated many times at several points in space, which is expensive.Therefore reducing the dimension of that model operator is a second way of applying dimension reduction.
Reduced rank methods -reducing the dimension of the covariance matrix A key idea in dimension reduction for Bayesian inference is to exploit the fact that in updating the prior to the posterior, some directions in the high-dimensional state space are more important than others.In the case of Gaussian posteriors, this fact is quantified by the rate of decay of the eigenvalues of the Hessian of the data misfit [78,31].The decay leads to a low-rank approximation of the Hessian (and hence the posterior covariance).Quantitative error analysis of approximation methods for the posterior distribution in Bayesian inference have been performed in [160].
We have already explained several reduced rank approaches applied to the Kalman Filter in Section 4, which are based on reduced rank covariance matrices, so we will not repeat details here.
In recent years, a whole range of hybrid methods were developed, which combine ensemble (and hence lowrank) Kalman filters with variational data assimilation.This can be done in several ways and we point to [11] for a review of those ideas.One such method (often referred to EnVar) uses the low rank covariance matrix that arises within the ensemble Kalman filter (EnKF), and applies it within variational data assimilation as the background error covariance matrix, B = SS T , where S ∈ R n×r and r ≪ n.The resulting system has the same form as preconditioned 4D-Var, however with a low-rank version of the Hessian, and hence the optimisation problem is solved in the reduced system dimension, see [93].
As discussed above, it is known that for linear inverse problems, the inverse of the Hessian is an approximation of the posterior error covariance.From a control theoretic view point, there is an equivalence between the Hessian and the observability Gramian.Hessian based model reduction using this viewpoint was investigated in [128,16,17] In [20], the Kronecker product structure of a discretised linear partial differential operator and a low-rank Arnoldi method were used to approximate posterior error covariance matrices of Gaussian posteriors using the idea of low-rank Hessians [78,31,160].
Model order reduction applied to the forward model operator A second approach for reducing the cost within sequential and variational data assimilation is reducing the cost of the forward and adjoint models by applying reduced order models (ROMs).The aim of model order reduction (MOR) is to find models that approximate and reflect the dynamics of the underlying large-scale system accurately, in ways that enable the reduction process to be implemented efficiently.There is a large number of MOR techniques available, and many of them have been very popular in the system theoretic community [19].
The earliest reduction approaches for 4D-Var suggest using a simplified operator or a coarser grid within the minimisation of incremental 4D-Var [109,166].
In [71] balanced truncation [132], a control theoretic approach for MOR, is applied to the Kalman filter.The linearised model and observation operators M i ∈ R n×n and H i ∈ R p×n are projected onto lower dimensional subspaces, where U ∈ R n×r and V ∈ R n×r are projection matrices satisfying U T V = I, obtained through balanced truncation.Balanced truncation is a method that retains the dominant observable and reachable states, which are the important ones for the system dynamics (after transforming both state and observation equation so that reachable and observable states can be expressed in the same basis).In [71] it is assumed that the time-dependent system underlying the problem has a time-invariant dominant part on which balanced truncation is performed.MOR via balanced truncation was also proposed for incremental 4D-Var in [25,26,125,124].Model and observation operators are projected as in (34), δx is restricted to δx = U T δx ∈ R r , r ≪ n, and the background error covariance matrix, B, is projected onto U T BU .The approach was extended to weak constraint 4D-Var in [81].
In [63,152] it is assumed that the initial state x 0 in 4D-Var is contained in a space of reduced dimension r ≪ n about the background state, where c i are real coefficients and w i are linearly independent vectors containing the variability in the system.Minimisation of the reduced cost function then takes place in the reduced space of dimension r.The authors in [7] use a truncated SVD approach to solve the 4D-Var problem in reduced spaces.Proper orthogonal decomposition (POD) methods were applied to variational data assimilation by several authors [173,37,55,61].Snapshots are taken at various time steps from the model trajectory.An SVD is then taken of the matrix of snapshots X = [x 1 , . . ., x r ], that is, X = U ΣV T .Finally, x 0 is then projected onto the POD space spanned by the r most important left singular vectors, that is, the ones corresponding to the largest singular values.The number of those left singular vectors is chosen significantly smaller than the dimension of the state space.POD was applied in [3] to the adjoint in order to compute a reduced model.The work in [161] refines this work on POD by considering the discrete empirical interpolation method (DEIM) within POD for nonlinear dynamical systems.
A challenge for applying MOR techniques within data assimilation is the nonlinearity and time-dependence of the forward model operator.One idea is to use a data driven and online approach applied to the ROMs, where updates to the basis vectors of the ROMs are computed from combinations of reduced solutions, snapshots and adjoint information [144].
Dimension reduction is an important tool for Bayesian inverse problems as one often has to perform Markov chain Monte Carlo (MCMC) sampling to access the posterior distribution.However, each MCMC sample requires an expensive forward model solve.In [51,52,53], methods for dimension reduction for nonlinear Bayesian inverse problems were described.The goal of these method is to approximate the likelihood using a reduced model.In [67], POD-DEIM is successfully applied to the likelihood function in order to reduce the cost of each MCMC draw.A review of multi-fidelity methods was recently published [145].For additional ideas for dimension reduction methods within data assimilation we refer to [8] and references therein.
6 Bayesian inference and Tikhonov regularisation, and other aspects within data assimilation One aspect we have not considered in detail in this article is the link between data assimilation and Tikhonov regularisation.We refer to [82] and references therein.A key task in the Bayesian inverse problem framework is finding an informative prior.This corresponds to finding a computationally efficient penalty term in the regularisation approach, see [35,33].Most of the work in this area considers Gaussian priors and linear, but very large scale static inverse problems.In reference [41] methods based on Golub-Kahan bidiagonalisation for computing solutions based on Tikhonov regularisation are used, which avoid computing inverses and square roots of large covariance matrices.The idea was generalised to dynamic inverse problems, which also allows the use of a wide class of spatio-temporal priors [42,154].
When designing an observation or sensor network for data assimilation another important aspect not considered here is the placement of sensors.An important tool for this is computing the so-called observation impact, which provides a measure for important or redundant information.Mathematically this results in analysing the sensitivity of states with respect to the data, and eventually in a large linear system to solve.For more information and solution methods, including low-rank approaches, we refer to [169,44,43] and references therein.
Data assimilation using different regularisation terms was considered in [30,148], efficient solution techniques for such approaches require ideas from numerical linear algebra.
There is a whole range of ideas for data assimilation we have not considered here.For literature on particle filters for Bayesian inference, for example, we refer to [8,40] and references therein.

Software
Software development for data assimilation algorithms is very much driven by geoscientists.Several packages are available to download and test algorithms, these also often provide examples.We list a few here.PDAF, a software environment (written in FORTRAN) for ensemble data assimilation was developed by researchers from the Alfred Wegener Institute for Polar and Marine Research and is freely available [136].DATes is a recently developed data assimilation testing suite written in PYTHON [9].Finally, DART [5] is a FORTRAN software package developed and maintained at the National Center for Atmospheric Research.
MATLAB code, which is applied to a range of simple examples, both for Kalman filters and variational data assimilation, is available in the book by Law et al [119].Further MATLAB codes are provided which accompany the book [150].

Conclusions
In the era of "big data" it is important to be able to process and evaluate data, gain insight from data, extract knowledge from data and make predictions from data.However, making important decisions based only on data can be misleading as they might be incomplete or erroneous.Therefore, it is often crucial to combine data with mathematical models.This approach leads to the important area of data assimilation which is evolving rapidly and continuously.
Data assimilation uses tools from many different fields of mathematics, such as statistics, optimisation, machine learning, numerical linear algebra, mathematical modelling and scientific computing, to name a few.
For example, classical optimisation tools in variational data assimilation are constantly adapted to faster algorithms and new supercomputers.Variational data assimilation typically only provides a point estimate, but no uncertainty quantification.However, approximations of a-posteriori covariance matrices can be computed using efficient numerical linear algebra techniques.
On the other hand, statistical learning techniques, such as Kalman filters and ensemble methods are regularly seeing improvements.Those techniques do provide uncertainty quantification as they are merely Monte Carlo implementation of the Bayesian update.
We have seen that in the field of data assimilation, traditional computational scientific modelling meets the area of statistics and data science in order to produce new and better algorithms and methods.
In this review we have stated and explained the most established data assimilation methods, starting from the point of view of Bayesian inference.We have focused on problems arising within the numerical linear algebra for these methods, that is, the solution to linear systems, preconditioning, and matrix methods for filtering and model reduction, and thereby given a extended (but by no means complete) review of the existing literature in numerical methods for data assimilation.
With the increasing size and complexity of datasets and larger models after discretisation of partial differential equations, many data assimilation problems involve operations on matrices with millions or billions of elements.This amount of large matrices brings new computational challenges to classical numerical linear algebra algorithms, for example solving large systems of linear equations, large linear (and nonlinear) regression problems, constructing low-rank matrix approximation etc.
The efficiency of these linear algebra applications is essential for the performance of data assimilation methods.
Hence, this review shows that efficient numerical linear algebra techniques and matrix computation tools are crucial within the subject of data driven modelling and data assimilation, even more so when the data and models become even larger.

Figure 1 :
Figure 1: Illustration of 3D-Var and 4D-Var for state estimation in one dimension.

Figure 2 :
Figure 2: Illustration of sequential data assimilation for the Kalman Filter in one dimension.
the nonlinear model trajectory.H i is the linearisation of the observation operator H i about x (ℓ) i .The perturbations δx (ℓ) i satisfy the linearised constraint δx (ℓ)