Conditions for successful data assimilation

Authors

  • Alexandre J. Chorin,

    1. Department of Mathematics, University of California, Berkeley, California, USA
    2. Lawrence Berkeley National Laboratory, Berkeley, California, USA
    Search for more papers by this author
  • Matthias Morzfeld

    Corresponding author
    1. Department of Mathematics, University of California, Berkeley, California, USA
    2. Lawrence Berkeley National Laboratory, Berkeley, California, USA
    • Corresponding author: M. Morzfeld, Lawrence Berkeley National Laboratory, 1 Cyclotron Rd., Berkeley, CA 94720, USA. (mmo@math.lbl.gov)

    Search for more papers by this author

Abstract

[1] We show, using idealized models, that numerical data assimilation can be successful only if an effective dimension of the problem is not excessive. This effective dimension depends on the noise in the model and the data, and in physically reasonable problems, it can be moderate even when the number of variables is huge. We then analyze several data assimilation algorithms, including particle filters and variational methods. We show that well-designed particle filters can solve most of those data assimilation problems that can be solved in principle and compare the conditions under which variational methods can succeed to the conditions required of particle filters. We also discuss the limitations of our analysis.

1 Introduction

[2] Many applications in science and engineering require that the predictions of uncertain models be updated by information from a stream of noisy data [see, e.g., Doucet et al., 2001; van Leeuwen, 2009; Bocquet et al., 2010]. The model and data jointly define a conditional probability density function (pdf) p(x0n|z1n), where the discrete variable math formula can be thought of as discrete time, xn is a real m-dimensional vector to be estimated, called the “state”, x0n is a shorthand for the set of vectors math formula, and where the data sets zn are a k-dimensional vectors (km). All information about the state at time n is contained in this conditional pdf and a variety of methods are available for its study, e.g., the Kalman filter [Kalman, 1960], the extended and ensemble Kalman filter [Evensen, 2006], particle filters [Doucet et al., 2001], or variational methods [Talagrand and Courtier, 1987; Bennet et al., 1993]. Given a model and data, each of these algorithms will produce a result. We are interested in the conditions under which this result is reasonable, i.e., consistent with the real-life situation one is modeling.

[3] We say that data assimilation is feasible in principle if it is possible to calculate the mean of the conditional probability density that it defines with a small-to-moderate uncertainty; we discuss what we mean by “moderate” below after we develop the appropriate tools. If data assimilation is feasible in this sense, it is possible to find an estimate of the state of a system whose distance from an outcome of the physical experiment described by the dynamics is small to moderate, with a high probability, i.e., reliable conclusions can be reached based on the results of the assimilation. We consider a data assimilation algorithm, e.g., a particle filter or a variational method, to be successful if it can produce an accurate estimate of the state of the system. A data assimilation algorithm can only be successful if data assimilation is feasible in principle. Our definition of success is in line with what is required in the physical sciences, where one wants to make reliable predictions given a model and data. We do not consider data assimilation to be successful if the posterior variance is reduced (e.g., when compared to the variance of the data) but remains large.

[4] Generally, we restrict the analysis to linear state space models driven by Gaussian noise and supplemented by a synchronous stream of data perturbed by Gaussian noise, i.e., the noisy data are available at every time step of the model and only then. We further assume that all model parameters (including the covariance matrices of the noise) are known, i.e., we consider state estimation rather than combined state and parameter estimation. We study this class of problems because it can be examined in some generality, and we can explain qualitatively its important aspects; however, we also discuss its limitations.

[5] In section 2 we derive conditions under which data assimilation is feasible in principle, without regard to a specific algorithm. We define the effective dimension of a Gaussian data assimilation problem as the Frobenius norm of the steady state posterior covariance and show that data assimilation is feasible in the sense described above only if this effective dimension is moderate. We argue that realistic problems have a moderate effective dimension.

[6] In the remainder of the paper, we discuss the conditions under which particular data assimilation algorithms can succeed in solving problems (where success is defined as above) that are solvable in principle. In section 3 we briefly review particle filters. In section 4, we use the results of [Snyder, 2011] to show that the optimal particle filter (which in the linear synchronous case coincides with the implicit particle filter [Atkins et al., 2013; Chorin et al., 2010; Morzfeld et al., 2012]) performs well if the problem is solvable in principle, provided that a certain balance condition is satisfied. We conclude that optimal particle filters can solve many data assimilation problems even if the number of variables to be estimated is large. Building on the results in [Snyder et al., 2008; Bengtsson et al., 2008; Bickel et al., 2008], we show that another filter fails under conditions that are frequently met. Thus, how a particle filter is implemented is very important, since a poor choice of algorithm may lead to poor performance. In section 5 we consider particle smoothing and variational data assimilation and show that these methods as well can only be successful under conditions comparable to those we found in particle filtering. We discuss limitations of our analysis in section 6 and present conclusions in section 7.

[7] The effective dimension defined in the present paper is different from the effective dimensions introduced in Snyder et al. [2008], Bengtsson et al. [2008], Bickel et al. [2008], and Snyder [2011]. The effective dimensions in Snyder et al. [2008], Bengtsson et al. [2008], Bickel et al. [2008], and Snyder [2011] are defined for particular particle filters, whereas the effective dimension defined in the present paper is a characteristic of the model and data stream, i.e., independent of the data assimilation algorithm used. We show in particular that the effective dimension (as defined in the present paper) remains moderate for realistic models, even when the state dimension is large (asymptotically infinite) and that numerical data assimilation can be successful in these cases; in particular, a moderate effective dimension in our sense can imply moderate effective dimensions in the sense of Snyder et al. [2008], Bengtsson et al. [2008], Bickel et al. [2008], and Snyder [2011] for a suitable algorithm.

2 The Effective Dimension of Linear Gaussian Data Assimilation Problems

[8] We consider autonomous, linear Gaussian state space models of the form

display math(1)

where math formula is a discrete time, A is a given m×m matrix, and wn are independent and identically distributed (iid) Gaussian random variables with mean zero and given covariance matrix Q, which we write as math formula. The initial conditions may be random, and we assume that their pdf is also Gaussian, i.e., math formula, with both μ0 and Σ0 given. We assume further that the data satisfy

display math(2)

where H is a given k×m matrix (km) and the math formula are iid, where R is a given k×k matrix. The wns and vns are independent of each other and also independent of x0.

[9] In principle, but not necessarily in practice, the covariance matrix Pn of the state xn conditioned on the data z1n can be computed recursively, starting with P00:

display math

where Im is theidentity matrix of order m and the m×k matrix Kn is often called the “Kalman gain.” This is the Kalman formalism. We assume that the pair (H,A) is d-detectable and that (A,Q) is d-stabilizable. Detectability and stabilizabilty can respectively be interpreted (roughly) as requiring that the observation operator be sufficiently rich to determine the dynamics and the noise be able to affect the whole dynamics (see Lancaster and Rodman [1995], pp. 90–91 for technical definitions). These assumptions allow unstable dynamics, as often encountered in geophysics, but also make it possible to perform a steady state analysis because the covariance matrix reaches a steady state so that

display math

where X is the unique positive semi-definite solution of the discrete algebraic Riccati equation (DARE)

display math

and where

display math

is the “steady state” Kalman gain. Note that the steady state covariance matrix P is independent of the initial covariance matrix Σ0 and that the rate of convergence to this limit is at least linear, in many cases quadratic [see Lancaster and Rodman, 1995, p. 313]. This means that after a relatively short time, the samples of the state given the data are normally distributed with mean μn and covariance matrix P (the mean μn of the variables is not needed here, but it can also be computed using Kalman's formulas).

[10] The steady state covariance matrix P=(pij) determines the posterior uncertainty, i.e., the uncertainty after we considered the data. If P is “large,” the uncertainty is large, which translates to a large spread of the samples in state space. We suggest to measure uncertainty with the Frobenius norm of math formula because this norm determines the spread of the posterior samples in state space.

[11] To see this, consider the random variable y=(xnμn)T(xnμn), where math formula, i.e., consider the squared distances of the samples from their mean (their most likely value). Let U be an orthogonal m×mmatrix whose columns are the eigenvectors of P. Then

display math

where math formula and Λ=UTPU is a diagonal matrix whose diagonal elements are the m eigenvalues λj of P. It is now straightforward to compute the mean and variance of y because the sjs (the elements of s) are independent:

display math

Note that y=r2, where r is the distance from the sample to the most likely state (the mean). Assuming that m is large, we obtain, using Taylor expansion of math formula around 1 and assuming that λj=O(1), that

display math

The techniques in Bickel et al. [2008] can be used to extend the above formulas for m, math formula and with λj=O(1), i.e., to the case for which the moments of y do not necessarily exist. We use standard inequalities to show that

display math

and, with these, obtain bounds for math formula and math formula:

display math

The Frobenius norm of a matrix is the square root of the sum of its eigenvalues squared, i.e., math formula. Thus, the above bounds indicate that the Frobenius norm of P determines the mean and variance of the distance of a sample from the most likely state, i.e., the spread of the samples in the state space.

[12] Based on the calculations above, we now investigate what a large posterior covariance, i.e., a large spread of posterior samples, means for data assimilation. Suppose that m is large and that λj=O(1) for math formula; then math formula and math formula. This means that the samples collect on a shell of thickness O(1) at a distance O(m1/2) from their mean and are distributed over a volume O(m(m+1)/2), i.e., for large m, the predictions spread out over a large volume at a large distance from the most likely state. By considering both the model (1) and the data (2), one concludes that the true state is likely to be found somewhere on this shell. However, since this shell is huge, the various states on it can correspond to very different physical situations. Knowing that the state is somewhere on this shell is not satisfactory if one wants to compute a reliable estimate of the state; the uncertainties in the model and the observation error are too large.

[13] What we have shown is that data assimilation makes sense, according to our definitions, only if the Frobenius norm of the posterior steady state covariance matrix is moderate. We thus define the effective dimension of the Gaussian data assimilation problem defined by equations (1) and (2) to be this Frobenius norm:

display math

Data assimilation can only be successful if this effective dimension is moderate. The precise value of the effective dimension that can not be exceeded if one wants to reach reliable conclusions varies from one problem to the next and, in particular, depends on the level of accuracy required, so that it is very difficult to pin down an upper bound for the effective dimension in general. In cases where one can interpret the data assimilation problem defined by (1) and (2) as an approximation to an infinite dimensional problem, e.g., in problems that arise from partial differential equations (PDE), our requirements imply that the effective dimension remains bounded as m. This is connected to well posedness, stability, and accuracy of infinite dimensional Bayesian inverse problems discussed in Stuart [2010].

[14] We expect that the effective dimension is moderate in practice, since the data assimilation problem reflects an experimental situation, and we wish that the numerical samples behave like experimental samples: if the uncertainty is large, one will observe that the outcomes of repeated experiments exhibit a large spread; if the uncertainty is small, then the spread in the outcomes of experiments is also small. Since the outcomes of repeated experiments rarely exhibit large variations, one should expect that the samples of numerical data assimilation all fall into a small “low-dimensional” ball centered around the most likely state, i.e., the radius, math formula, is comparable to the thickness, math formula(see below).

[15] For the reminder of this section, we will investigate conditions for successful data assimilation by studying conditions on the errors in the model (1), represented by the covariance matrix Q and conditions on the errors in the data (2) represented by the covariance matrix R that lead to a moderate effective dimension.

[16] Finally, we point out that the effective dimension defined above is different from the effective dimensions defined in Snyder et al. [2008], Bengtsson et al. [2008], Bickel et al. [2008], and Snyder [2011], which came up in connection with specific particle filters. The effective dimension defined here is defined from the posterior pdf and thus is independent of a data assimilation technique; it is a characteristic of the model (1) and data stream (2). However, since we consider the posterior pdf of linear Gaussian data assimilation problems (for which the Kalman formalism gives the answer), our analysis is valid only for such models. We discuss the limitations of our analysis in more detail in section 6.

2.1 Bounds on the Effective Dimension

[17] To discover the real-life interpretation of the effective dimension, we study its upper bounds in terms of the Frobenius norms of Q and R. From Khinchin's theorem [see, e.g., Chorin and Hald, 2009], we know that the Frobenius norms of Q and R must be bounded as m,k or else the energies of the noises are infinite, which is unrealistic. We show that a moderate Frobenius norm of Q and R can lead to a moderate effective dimension. We start by a simple example, which is also useful in the study of data assimilation methods in later sections.

2.1.1 Example

[18] Put A=H=Im and let Q=qIm, R=rIm. The Riccati equation can be solved analytically for this example, and we find the effective dimension

display math

In a real-life problem, we would expect ||P||F and thus meff to grow slowly, if at all, when the number of variables increases. In fact, we have just shown that meff must be moderate or else data assimilation cannot be successful.

[19] The condition of moderate effective dimension induces a “balance condition” between the errors in the model (represented by q) and the errors in the data (represented by r). In this simple example, an O(1) effective dimension gives rise to the balance condition

display math

where the 1 in the numerator of the right-hand side stands for a constant; we set this constant equal to 1 because this already captures the general behavior. The constant cannot be pinned down precisely because an acceptable level of accuracy may vary from one application to the next; the balance condition above and its generalizations below do however provide guidance as to what can be done.

[20] Figure 1 illustrates the condition for successful data assimilation and shows a plot of the function that is defined by the left-hand side of the above inequality as well as three-level sets corresponding to m=5,10,100, respectively; for a given dimension m, all values of q and r below the corresponding level set lead to an O(1) effective dimension, i.e., to a scenario in which data assimilation is feasible in principle.

Figure 1.

Conditions for successful sequential data assimilation.

[21] The condition implies that for fixed m, the smaller the errors in the data (represented by r), the larger can the uncertainty in the model (represented by q) be and vice versa. Moreover, note that for very small q, the boundaries for successful data assimilation are (almost) vertical lines. The reason is that if the model is very good, neither accurate nor inaccurate data can improve it, i.e., data assimilation is not necessary. If the model is poor, only nearly perfect data can help. We will encounter this balance condition (in more complicated forms) again in the general case in the next section and also in the analysis of particle filters and variational data assimilation.

[22] Finally, note that the Frobenius norms math formulaand math formula increase with the number of dimensions unless q or r or both decrease with m as shown in Figure 1. We will argue in section 2.2 that in realistic cases, the Frobenius norms of Q and Rare moderate even if m or k are large (asymptotically infinite). We also expect, but cannot prove in general, that a balance condition as in Figure 1 is valid in the general case (arbitrary A,H,Q,R), with q and r replaced by the Frobenius norms of Q and R.

2.1.2 The General Case

[23] In the general case, the condition for successful data assimilation that must be satisfied by uncertainties in the model (||Q||F) and data (||R||F) is more complicated because the effective dimension is the Frobenius norm of the solution of a Riccati equation which in general does not admit a closed form solution.

[24] However, if the covariance matrices Q and R have moderate Frobenius norms, then the effective dimension of the problem can be moderate even if m and k are large and, thus, data assimilation can be successful. To see this, let X and P be the solution of the DARE respectively the steady state covariance matrix of a given (A,Q,H,R) data assimilation problem and let math formula, i.e., math formula is symmetric positive semi-definite (SPD). If math formula, then by the comparison theorem [Lancaster and Rodman, 1995, Theorem 13.3.1] math formula, where math formula is the solution of the DARE associated with the math formula data assimilation problem. From the Kalman formulas, we know that

display math

which implies that PX. Moreover, for two SPD matrices C and D, CDimplies ||C||F≤||D||F. Thus, the smaller the Frobenius norm of Q and R, the smaller is the upper bound ||X||F on the effective dimension.

[25] However, the requirement that these Frobenius norms be moderate is not sufficient to ensure that the effective dimension of the problem is moderate; in particular, it is evident that the properties of A must play a role, for example, if the L2 norm of A exceeds unity, the model (1) is unstable, and successful data assimilation is unlikely unless the data are sufficiently rich to compensate for the instabilities [see also Stuart, 2010]. We have assumed such difficulties away by assuming the pair (H,A) to be d-detectable and (A,Q) to be d-stabilizable. However, unstable dynamics should be treated carefully and in specific cases (for nonlinear problems) as in Brett et al. [2013].

[26] While the model, or A, is implicitly accounted for in X, the solution of the DARE, one can construct sharper bounds on the effective dimension by accounting for the model (1) and data stream (2) more explicitly. To that extent, we construct matrix bounds on P, from matrix bounds for the solution of the DARE [Kwon et al., 1992]. Let XXu and XlX be upper and lower matrix bounds for the solution of the DARE, for example, we can choose the lower bound in Komaroff [1992]

display math

and the upper bound in Kwon et al. [1992]

display math

where X=A(η−1+HTR−1H)−1AT+Q, η=f(−λ1(A)−λn(HTR−1H)λ1(Q)+1,2λn(HTR−1H),2λ1(Q))), math formula, and λ1(C) and λn(C) are, respectively, the largest and smallest eigenvalues of the matrix C. Then an upper matrix bound for the steady state covariance matrix is

display math

The Frobenius norm of this upper matrix bound is an upper bound for the effective dimension.

2.2 The Real-World Interpretation of Effective Dimension

[27] We have shown that there is little hope for reaching reliable conclusions unless the effective dimension of the data assimilation problem defined by equations (1) and (2) is moderate. We now give more detail about the physical interpretation of this result.

[28] Suppose the variable x that one is estimating are point values of, for example, the velocity of a flow field (as they often are in applications). The Frobenius norm of the covariance matrix Q is proportional to the specific kinetic energy of the noise field that is perturbing an underlying flow. This energy should be a small fraction of the energy of the flow, or else there is not enough information in the model (1) to examine the flow one is interested in. We can thus assume that the Frobenius norm of Q is moderate. By the same arguments, we can assume that the Frobenius norm of R is moderate, or else the noise in the data equation overpowers the actual measurements. Since moderate Frobenius norms of Qand R often imply a moderate Frobenius norm of P, we typically are dealing with a data assimilation problem with a moderate effective dimension, even if m and k are arbitrarily large.

[29] Point values of a flow field usually come from a discretization of a stochastic differential equation. As one refines this discretization, one can expect the correlation between the errors at neighboring grid points to increase. These errors are represented by the covariance matrix Q, and from Khinchin's theorem [see, e.g., Chorin and Hald, 2009], we know that a random field with sufficiently correlated components has a finite energy density (and vice versa). This implies for the finite dimensional case that the Frobenius norm of Q does not grow without bound as we increase m.

[30] Another and perhaps even more dramatic instance of this situation is one where the random process we are interested in is smooth so that the spectrum of its covariance matrix decays quickly [Adler, 1981; Rasmussen and Williams, 2006]. For practical purposes, one may then consider md of the eigenvalues to be equal to zero (rather than just very small). This is an instance of “partial noise” [Morzfeld and Chorin, 2012], i.e. the state space splits into two disjoint subspaces, one of dimension d, which contains state variables, u, that are directly driven by Gaussian noise, and one of dimension md, which contains the remaining variables, v, that is a (linear) functions of the random variable u. Thus, the steady state covariance matrix is of size d×d, and the effective dimension is independent of the state dimension and moderate even if m is large. Smoothness of the random perturbations may be particularly important in data assimilation for PDE (e.g., in fluid mechanics), since the PDE itself can require regularity conditions [Stuart, 2010].

[31] Note that the key to moderate effective dimension in all of the above cases is the correlation among the errors, and indeed, the data assimilation problems derived by various practitioners and theorists show a strong correlation of the errors [see, e.g. van Leeuwen, 2003; Ganis et al., 2008; Zhang and Lu, 2004; Rasmussen and Williams, 2006; Adler, 1981; Miller and Cane, 1989; Miller et al., 1995; Richman et al., 2005; Morzfeld and Chorin, 2012; Bennet and Budgell, 1987]. The correlations are also key to the well boundedness of infinite dimensional problems [Stuart, 2010] where the spectra of the covariances (which are compact operators in this case) decay; a well correlated noise model was obtained from an infinite dimensional problem in Bennet and Budgell [1987].

[32] The geometrical interpretation of this situation is as follows: because of correlations in the noise, the probability mass is concentrated on a d-dimensional manifold, regardless of the dimension md of the state space. In addition, one must be careful that the noise in the observations not be too strong. Otherwise the data can push the probability mass away from the d-dimensional manifold (i.e., the data increase uncertainty, instead of decreasing it). This assumption is reasonable because typically the data contain information and not just noise. Similar observations were reported for infinite dimensional, strong constraint problems for low-observation noise (covariance of the error in the data goes to 0) [see Stuart, 2010, Theorem 2.5].

[33] Next, suppose that the vector x in (1) and (2) represents the components of an abstract model with the several components representing various indicators, for example, of economic activity (so that the concept of energy is not well defined). It is unreasonable to assume that each source of error affects only one component of x. As an example of what happens when each source of error affects many components, consider a model where Gaussian sources of error are distributed with spherical symmetry in the space of the xs and have a magnitude independent of the dimension m. In an mdimensional space, the components of the unit vector of length 1 have magnitude of order O(m−0.5), so that the variance of each component must decrease like m−1. Thus, the covariance matrices in (1) and (2) are proportional to m−1Im and the effective dimension (for A=H=Im) is math formula, which is small when m is large. This is a plausible outcome, because the more data and indicators are considered, the less uncertainty there should be in the outcome (because the new indicators provide additional information).

3 Review of Particle Filters

[34] In importance sampling one generates samples from a hard-to-sample pdf p(the “target” pdf) by producing weighted samples from an easy-to-sample pdf, π, called the “importance function” [see, e.g., Kalos and Whitlock, 1986; Chorin and Hald, 2009]. Specifically, if the random variable one is interested in is xp, one generates samples math formula (we use capital letters for realizations of random variables) and weighs each by the weight

display math

The weighted samples {Xj,Wj} (called particles in this context) form an empirical estimate of the target pdf p, i.e., for a smooth function u, the sum

display math

where math formula converges almost surely to the expected value of u with respect to the pdf p as M, provided that the support of πincludes the support of p.

[35] Particle filters apply these ideas to the recursive formulation of the conditional pdf:

display math

This requires that the importance function factorizes in the form

display math(3)

where the πk are updates for the importance function. The factorization of the importance function leads to the recursion

display math(4)

for the weights of each of the particles, which are then scaled so that their sum equals one. Using “resampling” techniques, i.e., replacing particles with small weights with ones with large weights [see, e.g., Doucet et al., 2001; Gordon et al., 1993] for resampling algorithms), makes it possible to set math formula when one computes math formula. Once one has set math formula but before sampling a new state at time n+1, each of the weights can be viewed as a function of the random variable math formula and is therefore a random variable.

[36] The weights determine the efficiency of particle filters. Suppose that before the normalization and resampling step, one weight is much larger than all others; then upon rescaling of the weights, such that their sum equals one, one finds that the largest normalized weight is near 1 and all others are near 0. In this case the empirical estimate of the conditional pdf by the particles is very poor (it is a single, often unlikely point) and the particle filter is said to have collapsed. The collapse of particle filters can be studied via the variance of the logarithm of the weights, and it was argued rigorously in Snyder et al. [2008], Bengtsson et al. [2008], Bickel et al. [2008], and Snyder [2011] that a large variance of the logarithm of the weights leads to the collapse of particle filters. The choice of importance function π is critical for avoiding the collapse, and many different importance functions have been considered in the literature [see, e.g., Weir et al., 2013; Weare, 2009; Vanden-Eijnden and Weare, 2012; van Leeuwen, 2010; Ades and van Leeuwen, 2013; Chorin and Tu, 2009; Chorin et al., 2010; Morzfeld et al., 2012]. Here we follow Snyder et al. [2008], Bengtsson et al. [2008], Bickel et al. [2008], and Snyder [2011] and discuss two particle filters in detail.

3.1 The SIR Filter

[37] A natural choice for the importance function is to generate samples with the model (1), i.e., to choose πn+1=p(xn+1|xn). When a resampling step is added, the resulting filter is often called a sequential importance sampling with resampling (SIR) filter [Gordon et al., 1993] and its weights are

display math

It is known that the SIR filter collapses if the probability measure induced by the importance function πn+1=p(xn+1|xn) and the probability measure induced by the target pdf, p(yn+1|xn+1)p(xn+1|xn), have supports such that an event that has significant probability in one of them has a very small probability in the other. This can happen even in one-dimensional problems; however, the situation becomes more dramatic as the dimension m increases. A rigorous analysis of the asymptotic behavior of weights of the SIR filter (as the number of particles and the dimension go to infinity) is given in [Snyder et al., 2008; Bengtsson et al., 2008; Bickel et al., 2008], and it is shown that the number of particles required to avoid the collapse of the SIR filter grows exponentially with the variance of the log likelihood (the logarithm of the weights).

3.2 The Optimal Particle Filter

[38] One can avoid the collapse of particle filters in low-dimensional problems by choosing the importance function wisely. If one chooses an importance function π so that the weights in (4) are close to uniform, then all particles contribute equally to the empirical estimate they define. In Doucet et al. [2000], Zaritskii and Shimelevich [1975], Liu and Chen [1995], and Snyder [2011], the importance function πn+1(xn+1|x0n,z0n+1)=p(xn+1|xn,zn+1) is discussed and it is shown that this importance function is “optimal” in the sense that it minimizes the variance of the weights given the data and math formula. For that reason, a filter that uses this importance function is called “optimal particle filter” and the optimal weights are

display math

For the class of models and data we consider, the optimal particle filter is identical to the implicit particle filter [Atkins et al., 2013; Morzfeld et al., 2012; Chorin et al., 2010]. The asymptotic behavior of the weights of the optimal particle filter was studied in Snyder [2011], and it was found that the optimal filter collapses if the variance of the logarithm of its weights is large. A connection to the collapse of the implicit particle filter (for linear Gaussian models) was made in Ades and van Leeuwen [2013].

4 The Collapse and Non-Collapse of Particle Filters

[39] The conditions for the collapse have been reported in Snyder et al. [2008], Bengtsson et al. [2008], and Bickel et al. [2008] for SIR and in Snyder [2011] for the optimal particle filter; here we connect these to our analysis of effective dimension.

4.1 The Case of The Optimal Particle Filter

[40] It was shown in Snyder [2011] that the optimal particle filter collapses if the Frobenius norm of the covariance matrix of (HQHT+R)−0.5HAxn−1 is large (asymptotically infinite as k). However, if this Frobenius norm is moderate, then the variance of the logarithm of the weights is also moderate so that the optimal particle filter works just fine (i.e., it does not collapse) even if k is large. We now investigate the role that the effective dimension of section 2 plays for the collapse of the optimal particle filter.

[41] Following Snyder [2011] and assuming that the conditional pdf p(xn|yn) has reached steady state, i.e., that the covariance of xn−1 is P, the steady state solution of the Riccati equation, one finds that the Frobenius norm of the symmetric matrix

display math(5)

governs the collapse of the optimal particle filter. If the Frobenius norm of Σ is moderate, then the optimal particle filter will work, even for large m and k. A condition for successful data assimilation with the optimal particle filter is thus that the Frobenius norm of Σ is moderate. This condition induces a balance condition between the errors in the model and in the data, which must be satisfied or else the optimal particle filter will fail; the situation is analogous to what we have observed in section 2.

[42] To understand the balance condition better, we consider again the simple example of section 2, i.e., we set H=A=Im and Q=qIm, R=rIm. We already computed P in section 2 and find from (5) that

display math

so that the balance condition becomes

display math

where 1 in the numerator again stands for a constant O(1), which we set equal to 1 because this already captures the general behavior. Note that for m fixed, the left-hand side depends only on the ratio of the covariances of the noise in the model and in the data so that the level sets are rays. In Figure 2 (middle), we superpose these rays, for which optimal particle filtering can be successful, with the (q,r) region in which data assimilation is feasible in principle (as computed in section 2). Figure 2 (left) shows what is, in principle, possible for comparison.

Figure 2.

Conditions (left) for successful sequential data assimilation and for successful particle filtering, (center) for optimal/implicit particle filtering, and (right) for SIR filtering. The broken ellipse in Figure 2 (right) locates the area where the SIR filter works.

[43] We find that the optimal particle filter can successfully solve most of the data assimilation problems that are feasible to solve in principle (see section 2). The exception are problems for which qr, i.e., the noise in the model and data are equally strong.

[44] Another way to see this is to set ε=q/r so that the balance condition for successful optimal particle filtering becomes

display math

which we solve for m and then plot the maximum dimension m as a function of the ratio of the noise in the model and the noise in the data; all values smaller than this maximum dimension are shown in Figure 3 as the light blue area.

Figure 3.

Maximum dimension for two particle filters.

[45] We conclude that the optimal particle filter works for high-dimensional data assimilation problems if ε is either small or large. The case of large εis the case typically encountered in practice. The reasons are as follows: if ε is small, then the model is very accurate. In this case, neither accurate nor inaccurate data can improve the model predictions (this case corresponds to the vertical line in Figure 2), i.e., data assimilation is unnecessary since one can simply trust the predictions of model (1). If ε is large, then the uncertainty in the data is much less than the uncertainty in the model, i.e., we can learn a lot from the data. This is the interesting case, and the optimal particle filter (or the implicit particle filter) can be expected to work in such scenarios. However, problems occur when ε≈1. We expect this case to occur infrequently because typically the data are more accurate than the model.

[46] It is however important to realize that the collapse of the optimal particle filter for ε≈1 does not imply that Monte Carlo sampling in general is not applicable in this case. Particle filtering induces variance into the weights because of its recursive problem formulation, and this variance can be reduced by particle smoothing. The reason is as follows: the variance of the weights of the optimal particle filter depends only on the variance of the particles' positions at time n (see section 4.1), i.e., each particle is updated to time n+1 such that no additional variance is introduced (this is why this filter is called optimal); however, the particles at time n may be unlikely in view of the data at n+1 (due to accumulation of errors up until this point). In this case, one can go back and correct the past, i.e., use a particle smoother (see also section 5). However, the number of steps one needs to go back in time for successful smoothing is problem dependent, and thus, we cannot provide a full analysis here (given that we work in a restrictive linear setting, it seems more realistic to do this analysis on a case-to-case basis). In particular, it was indicated in two independent papers [Vanden-Eijnden and Weare, 2012; Weir et al., 2013] that smoothing a few steps backwards can help with making Monte Carlo sampling applicable in situations where particle filters fail or perform poorly. In Vanden-Eijnden and Weare[2012], particle smoothing for the “low-noise regime” (which is an instance of the case where ε≈1) is considered in connection with an application in oceanography. In Weir et al. [2013], particle smoothing was found to give superior results than particle filtering for combined parameter and state estimation, again in connection with an application in oceanography. However, the approximations for (optimal) particle smoothers become difficult and computationally expensive as the problems get nonlinear.

[47] Moreover, we note that the collapse of the optimal particle filter occurs when one samples the joint density p(x0:n+1|z1:n+1). The Kalman filter on the other hand computes the marginal p(xn+1|z1:n+1). Computing a marginal is an easier problem than computing a joint density. For example, if (Xj,Yj), for math formula, are samples from the joint density p(x,y), then Xj, Yj are samples from the marginals p(x) and p(y), however the reverse is not true: given samples of the marginals, one can not obtain samples from the joint distribution by combining them. The samples of the optimal particle filter at time n+1 are thus samples of the marginal p(xn+1|y1:n+1) and consistent with what one obtains via Kalman filtering, but the reverse is not true. Basically, in this paper we take the Kalman filter as the gold standard.

[48] In the general case (arbitrary A,H,Q,R), we can simplify the balance condition for successful particle filtering by using the upper bound for the Frobenius norm of Σ:

display math

If we require that this upper bound is less than math formula, then we find, using the upper bound

display math

that

display math

is a sufficient conditionthat the Frobenius norm of Σ is moderate. As in section 2, we find that the balance condition in terms of ||R||F and ||Q||F is simple in simple cases but delicate in general.

4.2 The Case of the SIR Filter

[49] The collapse of the SIR filter has been studied in Snyder et al. [2008], Bengtsson et al. [2008], and Bickel et al. [2008], and it was shown that for a properly normalized model and data equation, this collapse is governed by the Frobenius norm of the covariance of Hxn; undoing the scaling and noting that xn−1 has covariance P (the steady state solution of the Riccati equation), we find that the Frobenius norm of

display math

governs the collapse of SIR filters. If ||Σ||F is moderate, the SIR filter can work even if m or k is large. This condition induces a balance condition for the covariance matrices of the noises which must be satisfied or else the SIR filter fails. For the simple example considered earlier (A=H=Im, Q=qIm, R=rIm), this condition becomes

display math

For m=100, the (q,r) region for which data assimilation with an SIR filter can be successful is plotted in Figure 2 (right). We observe that this region is very small compared to the region for which data assimilation is feasible with an optimal particle filter.

[50] We can also set ε=q/r and obtain

display math

which we solve for m so that we can plot the maximum dimension for which SIR particle filtering can be successful as a function of the covariance ratio ε (see Figure 3). Again, we observe that the SIR particle can only be useful in a limited class of problems. In particular, we find that the SIR particle filter works in high-dimensional problems only if the model is very accurate (compared to the data). However, we argued before that this case is somewhat unrealistic, since we expect that the errors in the model be typically larger than the errors in the data (or else the data are not very useful, or particle filtering is unnecessary because the model is already very good). In these realistic scenarios, the SIR particle filter collapses and we conclude that as the dimension m increases, it becomes more and more important to use the optimal importance function or a good approximation of it (see, e.g., Morzfeld et al., 2012; Weir et al., 2013; Weare, 2009; and Vanden-Eijnden and Weare, 2012 for approximations of the optimal filter).

[51] In the general case, we can use an upper bound, e.g.,

display math

and if we require that this bound is less than math formula, we obtain the simplified balance condition

display math

The above condition implies that the Frobenius norm of the covariance matrix of the model noise, Q, must be much smaller than the Frobenius norm of the covariance matrix of the errors in the data, which is unrealistic.

4.3 Discussion

[52] We wish to point out differences and similarities of our work and the asymptotic studies of Snyder et al. [2008], Bengtsson et al. [2008], Bickel et al. [2008], and Snyder [2011]. Clearly, the results of Snyder et al. [2008], Bengtsson et al. [2008], Bickel et al. [2008], and Snyder [2011] are used in our analysis of the optimal particle filter (section 4.1) and the SIR filter (section 4.2). Moreover, our analysis confirms Snyder's [2011] findings that the optimal particle filter is more robust in applications with large m and k because it “dramatically reduces the required sample size” (by lowering the exponent in the relation between the number of particles and the state dimension). In Snyder et al. [2008], Bengtsson et al. [2008], Bickel et al. [2008], and Snyder [2011], it was shown that the number of particles required grows exponentially with the variance of the logarithm of the weights; the variance of the logarithm of the weights is governed by the Forbenius norms of covariance matrices (which are different for SIR and the optimal particle filter). Our main contribution is to study the connection of these Frobenius norms with the effective dimension of section 2: if the effective dimension is moderate, then these Frobenius norms can be small even if m or k is large. Thus, one can find conditions under which the SIR and optimal particle filters work. We also explain the physical interpretation of our results and conclude that the optimal/implicit particle filter can work for many realistic and large-dimensional problems.

5 Particle Smoothing and Variational Data Assimilation

[53] We now consider the role of the effective dimension in particle smoothing and variational data assimilation. The idea here is to replace the step-by-step construction of the conditional pdf in a particle filter (or Kalman filter) by direct sampling of the full pdf p(x0:n|z1:n), i.e., all available data are assimilated in one sweep. Particle smoothers apply importance sampling to obtain weighted samples from this pdf, and in variational data assimilation, one estimates the state of the system by the mode of this pdf.

[54] It is clear that either method can only be successful if the Frobenius norm of the covariance matrix of the variables conditioned on the data is moderate (even if m or k are large) or else the samples of numerical or physical experiments collect on a thin shell far from the most likely state (to obtain this result, one has to repeat the steps in section 2). We now determine the conditions under which this Frobenius norm is moderate. As is customary in data assimilation, we distinguish between the “strong constraint” and “weak constraint” problem.

5.1 The Strong Constraint Problem

[55] In the strong constraint problem, one considers a “perfect model,” i.e., the model errors are neglected, and we set Q=0 [see, e.g., Talagrand and Courtier, 1987]. Since the initial conditions determine the state trajectory, the goal is to obtain initial conditions that are compatible with the data, i.e., we are interested in the pdf

display math

Straightforward calculation shows that this pdf is Gaussian (under our assumptions), and its covariance matrix is

display math

As explained above, successful data assimilation for the Gaussian model requires that the Frobenius norm of Σ is moderate so that the samples collect on a small- and low-dimensional ball, close to the most likely state. The condition for successful data assimilation is a moderate ||Σ||F, which in turn induces a condition between the errors in the prior (represented by Σ0) and the data (represented by R), which can be satisfied even if m and k are large. The situation is analogous to the balance conditions we encountered before in sequential data assimilation.

[56] We illustrate the balance condition for the strong constraint problem by considering a version of the simple example we used earlier, i.e., we set A=H=Im, Q=0, R=rIm, and in addition, n=1, Σ0=σ0Im. In this case, we can compute Σand its Frobenius norm:

display math

Figure 4 shows the values of r and σ0 which lead to an O(1) Frobenius norm of Σ. Three-level sets indicate the state dimensions m=10,100, and 1000; for a given state dimension, the values of r and σ0 below the corresponding curve lead to ||Σ||FO(1). We observe that for a fixed m, a larger error in the prior knowledge (corresponding to larger values of σ0) can be tolerated if the error in the data is very small (corresponding to small values of r) and vice versa. Similar observations were made in [Haben et al., 2011b, 2011a] in connection with the condition number in 3-D Var. Moreover, our analysis confirms what we know from the infinite dimensional problem [Stuart, 2010]: as the error in the observation (r) goes to zero, the prior (σ0) plays no role; however, its role is very important even for small observation noise (r).

Figure 4.

Conditions for successful data assimilation (strong constraint).

[57] Variational data assimilation (strong 4-D Var) represents the conditional pdf by its mode, i.e., by a single point in the state space. The smaller is the ball on which the samples collect (i.e., the smaller the Frobenius norm of Σ), the more applicable is strong 4-D Var. Particle smoothers on the other hand construct an empirical estimate of the pdf via sampling. Under our assumptions, we can construct an optimal particle smoother (minimum variance in the weights) by directly sampling the Gaussian posterior pdf (the weights of the particle smoother have zero, thus minimum, variance). We conclude that under realistic conditions (moderate ||Σ||F), the optimal particle smoother can be expected to perform well, even if m or k is large, because it can efficiently represent the pdf one is interested in.

[58] The situation is different for other particle smoothers. Consider, for example, the SIR-like particle smoother that uses p(x0) as its importance function. This filter produces weights whose negative logarithm is given by

display math

For n=1, the variance of these weights depends on the Frobenius norm of the matrix HAΣ0ATHTR−1, which has the upper bound

display math

If we require that this upper bound is less than math formula, then we obtain (using math formula) the condition

display math

which implies that the errors before we collect the data must be smaller than the errors in the data, which is unrealistic. In particular, for the simple example considered above, we find that math formula. We conclude that as in particle filtering, particle smoothing is possible under realistic conditions only if the importance function is chosen carefully.

[59] Note that the results we obtained here are different than those we would obtain if we would simply put Q=0 in the Kalman filter formulas of section 2. It is easy to show that for Q=0, the steady state covariance matrix converges to the zero matrix, provided that the dynamics are stable. What this means is that with enough data, one can wait for steady state and then accurately estimate the state at large n. What we have done in this section is to consider the consequences of having access to only a finite data set, i.e., making predictions before steady state is reached.

[60] Finally, note that in contrast to the sequential problem, the minimum variance of the weights of the smoothing problem is zero, whereas particle filters always produce nonzero variance weights. This variance is induced by the factorization of the importance function π, and since this factorization is not required in particle smoothing, this source of variance can disappear (or be reduced) by clever choice of importance functions. As indicated in section 4.1, the reason for the reduction in variance of the weights is that the data at time n may render the data at time n−1 unlikely; the smoother can make use of this information while the filter cannot, since it is “blind” toward the future. However, as the data sets get larger (and one eventually runs out of memory), one will have to assimilate the data in more than one sweep, thus inducing additional variance. Ultimately, smoothing as many data sets at a time as feasible cannot be a (complete) solution to the data assimilation problem.

5.2 The Weak Constraint Problem

[61] In the weak constraint problem [see, e.g., Bennet et al., 1993], one is interested in estimating the full state trajectory given the data, i.e., in the pdf

display math

An easy calculation reveals that this pdf is Gaussian, and its covariance matrix is

display math

For the same arguments as before, successful data assimilation requires that the Frobenius norm of Σ is moderate. This condition implies (again) a delicate balance condition between the errors in the prior knowledge (||Σ0||F), the errors in model (1) (||Q||F) and the errors in data (2) (||R||F). If this condition is satisfied, data assimilation is possible even if m or k is large.

[62] As in the strong constraint problem, variational data assimilation (weak 4-D Var) represents the conditional pdf by its mode (a single point), and the smaller the Frobenius norm of Σ is, this approximation is more applicable. An optimal particle smoother can be constructed for this problem by sampling directly (zero variance weights) the Gaussian conditional pdf. For the same reasons as in the previous section, we can expect an optimal particle smoother to perform well under realistic conditions, but we can also expect difficulties if the choice of importance function is poor.

6 Limitations of the Analysis

[63] We wish to point out limitations of the analysis above. To find the conditions for successful data assimilation, we study the conditional pdf and we rely on the Kalman formalism to compute it. Since the Kalman formalism is only applicable to linear Gaussian problems, our results are at best indicative of the general nonlinear/non-Gaussian case. However, we believe that the general idea that the probability mass must concentrate on a low-dimensional manifold holds in the nonlinear case as well. Since Khinchin's theorem is independent of our linearity assumption and since we expect that correlations among the errors also occur in nonlinear models, one can speculate that the probability mass does collect on a low-dimensional manifold (under realistic assumptions on the noise). However, finding (or describing) this manifold in general becomes difficult and is perhaps best done on a case-to-case basis so that special features of the model at hand can be exploited.

[64] We have further assumed that all model parameters, including the covariances of the errors in the model and data equations, are known. If these must be estimated simultaneously (combined parameter and state estimation), then the situation becomes far more difficult, even in the case of a linear model equation (1) and data stream (2). It seems reasonable that estimating parameters using data at several consecutive time points (as is done implicitly in some versions of variational data assimilation or particle smoothing) would help with the parameter estimation problem and perhaps even with model specification.

[65] Concerning particle filters, we have examined in detail only two choices of importance function: the one in SIR, where the samples are chosen independently of the data, and at the other extreme, one where the choice of samples depends strongly on the data. There is a large literature on importance functions [see Weir et al., 2013; Doucet et al., 2000; Weare, 2009; Vanden-Eijnden and Weare, 2012; van Leeuwen, 2010; Ades and van Leeuwen, 2013; Chorin and Tu, 2009; Morzfeld et al., 2012; Chorin et al., 2010]; it is quite possible that other choices can outperform the optimal/implicit particle filter even in the present linear synchronous case once computational costs are taken into account. In nonlinear problems, the optimal particle filter is hard to implement and the implicit particle filter is suboptimal, so further analysis may be needed to see what is optimal in each particular case (see also Weare [2009] and Vanden-Eijnden and Weare [2012] for approximations of the optimal filter).

[66] More broadly, the analysis of particle filters in the present paper is not robust as assumptions change. For example, if the model noise is multiplicative (i.e., the covariance matrices are state dependent), then our analysis does not hold, not even for the linear case. Moreover, the optimal particle filter becomes very difficult to implement, whereas the SIR filter remains easy to use. Similarly, if model parameters (the elements of A or the covariances Q and R) are not known, simultaneous state and parameter estimation using an optimal particle filter becomes difficult, but SIR, again, remains easy to use. While the filters may not collapse in these cases, they may give a poor prediction. The existence of such important departures is confirmed by the fact that the ensemble Kalman filter in the “perturbed observations” implementation [Evensen, 2006] and the square root filter [Tippet et al., 2003] differ substantially in their performance if the effects of nonlinearity are severe [Lei et al., 2010]. However, our analysis indicates that if (1) and (2) hold, the ensemble Kalman filter, the Kalman filter, and the optimal particle filter are equivalent in the non-collapse region of the optimal filter.

[67] Similarly, variational data assimilation or particle smoothing can be successful if (1) and (2) hold. We expect that variational data assimilation and particle smoothing can be successful in the nonlinear case, provided that the probability mass concentrates on a low-dimensional manifold. In particular, particle smoothing has the potential of extending the applicability of Monte Carlo sampling to data assimilation, since the variance of weights due to the sequential problem formulation in particle filters is reduced (the data at time 2 may label what one thought was likely at time 1 as unlikely). This statement is perhaps corroborated by the success of variational data assimilation in numerical weather prediction. However, the number of observations that should be assimilated per sweep depends on the various and competing time scales of the problem and, therefore, must be found on a case-to-case basis.

[68] Finally, it should be pointed out that we assumed throughout the paper that the model and data equations are “good,” i.e., that the model and data equations are capable of describing the physical situation one is interested in. It seems difficult in theory and practice to study the case where the model and data equations are incompatible with the data one has collected (although this would be more interesting). For example, it is unclear to us what happens if the covariances of the errors in the model and data equations are systematically underestimated or overestimated, i.e., if the various data assimilation algorithms work with the “wrong” covariances.

7 Conclusions

[69] We have investigated the conditions under which data assimilation can be successful, according to a criterion motivated by physical considerations, regardless of the algorithm used to do the assimilation. We quantified these conditions by defining an effective dimension of a Gaussian data assimilation problem and have shown that this effective dimension must be moderate or else one cannot reach reliable conclusions about the process one is modeling, even when the linear model is completely correct. This condition for successful data assimilation induces a balance condition for the errors in the model and data. This balance condition is often satisfied for realistic models, i.e., the effective dimension is moderate, even if the state dimension is large.

[70] The analysis was carried out in the linear synchronous case, where it can be done in some generality; we believe that this analysis captures the main features of the general case, but we have also discussed the limitations of the analysis.

[71] Building on the results of Snyder et al. [2008], Bengtsson et al. [2008], Bickel et al. [2008], and Snyder [2011], we studied the effects of the effective dimension on particle filters in two instances; one in which the importance function is based on the model alone, and one in which it is based on both the model and the data. We have three main conclusions:

  1. [72] The stability (i.e., non-collapse of weights) in particle filtering depends on the effective dimension of the problem. Particle filters can work well if the effective dimension is moderate even if the true dimension is large (which we expect to happen often in practice).

  2. [73] A suitable choice of importance function is essential or else particle filtering fails even when data assimilation is feasible in principle with a sequential algorithm.

  3. [74] There is a parameter range in which the model noise and the observation noise are roughly comparable and in which even the optimal particle filter collapses, even under ideal circumstances.

[75] We have then studied the role of the effective dimension in variational data assimilation and particle smoothing, for both the weak and strong constraint problem. It was found that these methods too require a moderate effective dimension, or else no accurate predictions can be expected. Moreover, variational data assimilation or particle smoothing may be applicable in the parameter range where particle filtering fails, because the use of more than one consecutive data set helps reduce the variance which is responsible for the collapse of the filters.

[76] These conclusions are predicated on the linearity of the model and data equations and on the assumption that the generative and data models are close enough to reality.

Acknowledgments

[77] We thank P. Bickel of UC Berkeley for many interesting discussions, for making our thoughts more rigorous (where possible), and for helping us recognize the limitations of our analysis. We thank R. Miller of Oregon State University for very helpful discussions and help with the literature. We thank J. Weare for an interesting discussion. This work was supported in part by the Director, Office of Science, Computational and Technology Research, U.S. Department of Energy under contract DE-AC02-05CH11231 and by the National Science Foundation under grant DMS-1217065.

Ancillary