Bayesian design of control space for optimal assimilation of observations. Part I: Consistent multiscale formalism

Authors

  • M. Bocquet,

    Corresponding author
    1. Université Paris-Est, CEREA, Joint laboratory École des Ponts ParisTech and EDF R&D, Champs-sur-Marne, France
    2. INRIA, Paris Rocquencourt Research Centre, France
    • CEREA, École des Ponts ParisTech, 6–8 avenue Blaise Pascal, Cité Descartes Champs-sur-Marne, 77455 Marne la Vallée Cedex, France.
    Search for more papers by this author
  • L. Wu,

    1. Université Paris-Est, CEREA, Joint laboratory École des Ponts ParisTech and EDF R&D, Champs-sur-Marne, France
    2. INRIA, Paris Rocquencourt Research Centre, France
    Search for more papers by this author
  • F. Chevallier

    1. Laboratoire des Sciences du Climat et de l'Environnement/IPSL, CEA-CNRS-UVSQ, Gif-sur-Yvette, France
    Search for more papers by this author

Abstract

In geophysical data assimilation, the control space is by definition the set of parameters which are estimated through the assimilation of observations. It has recently been proposed to design the discretizations of control space in order to assimilate observations optimally. The present paper describes the embedding of that formalism in a consistent Bayesian framework. General background errors are now accounted for. Scale-dependent errors, such as aggregation errors (that lead to representativeness errors) are consistently introduced. The optimal adaptive discretizations of control space minimize a criterion on a dictionary of grids. New criteria are proposed: degrees of freedom for the signal (DFS) built on the averaging kernel operator, and an observation-dependent criterion.

These concepts and results are applied to atmospheric transport of pollutants. The algorithms are tested on the European tracer experiment (ETEX), and on a prototype of CO2 flux inversion over Europe using a simplified CarboEurope-IP network. New types of adaptive discretization of control space are tested such as quaternary trees or factorised trees. Quaternary trees are proven to be both economical, in terms of storage and CPU time, and efficient on the test cases. This sets the path for the application of this methodology to high-dimensional and noisy geophysical systems. Part II of this article will develop asymptotic solutions for the design of control space representations that are obtained analytically and are contenders to exact numerical optimizations. Copyright © 2011 Royal Meteorological Society

1. Introduction

1.1. The resolution issue

Researchers using inverse modelling techniques in atmospheric chemistry have faced the so-called ‘resolution problem’.

A first example is given by the gridded emission inventories which are multidimensional fields and key components of the models. Unfortunately, the uncertainty of these fields is quite high (of the order of 40% for the ozone precursors in air quality at continental scale, for instance). Observations could help to constrain the emission fields through inverse modelling and reduce this uncertainty, e.g. Elbern et al. (2007) for an application to the precursors of ozone, or Davoine and Bocquet (2007) for an application to an accidental release of radionuclides. Both the model equations and the control space of the emission field need to be discretised at some predefined space and time resolution. The space and time resolutions of the discretised control space are not necessarily the same as those of state space. There is a non-trivial choice of resolution to be made. Furthermore, inventories are built at a given resolution, the model runs at another, and the data assimilation scheme injects the information of all observations into the system at still another scale depending on the nature of the instruments: ground-based, satellite, radar, lidar, etc. Therefore, the system should ideally be considered multiscale.

Another example pertains to the inverse modelling of greenhouse gases. Early carbon flux inversions relied on a partition of the globe (the control space of fluxes) into about 20 sub-domains representing several types of continental or ocean exchange with the atmosphere, with an annual or monthly time resolution (e.g. Fan et al., 1998; Bousquet et al., 2000). This was necessary because of the limited computational power together with a limited number of precise observations of CO2 concentration. However, such gross partitioning led to severe aggregation errors (Kaminski et al. 2001; Trampert and Sneider 1996). Thus it is tempting to increase the space and time resolutions of control space. But the total number of variables could dramatically exceed the total number of observations. Besides, because of the nature of transport and dispersion, the inverse modelling problem is ill-posed. Therefore a regularisation is needed (Rödenbeck et al. 2003), which can be written as a Tikhonov regularising term, as is usually done in geophysical data assimilation. This regularisation, which spatially and temporally correlates the errors, may stem from real physical correlations due, for instance, to similar ecosystems (Chevallier et al. 2006). But it may also be artificial and correspond to a smooth aggregate of variables. Note that this distinction is not always made clear in the literature.

In both cases, there is a difficult choice to be made on the resolution of control space. To make the problem worse, Bocquet (2005) has shown that, for atmospheric dispersion problems, the source estimation of atmospheric pollutant from inversions using pointwise measurements depends strongly on the control space resolution, even when using a proper classical Tikhonov regularisation (background-error term of quadratic form in the cost function).

1.2. Multiscale approach

To partially solve this resolution issue, a multiscale framework for such inversions was proposed (Bocquet, 2009). It is at the crossroads between a coarse partitioning of control space subject to aggregation errors and a highly resolved control space where regularisation is decisive. The method consists of constructing an adaptive grid of control space (also called a representation of control space in the following). This adaptive grid is optimal in the sense that it is designed to optimally capture the information carried by the observations and inject into control space through a model and the assimilation system. This is achieved by maximizing an objective function that measures the reduction of uncertainty granted by the observations on a space of all potential adaptive grids (later called a dictionary or class).

The method quantifies how observational information is propagated into control space. It diagnoses poorly observed areas. It informs how space- and time-scales should be related for the problem at hand. Also, it has strong algorithmic implication. Indeed, the method shows how to devise adaptive grids of control space that have significantly fewer grid cells than the original finest regular grid, but which can still capture most of the information content of observations. Such an adaptive grid was built and tested on the European Tracer Experiment (ETEX; Nodop et al., 1998). The inversion of the source term of this dispersion event was performed much faster with an optimization over about 100 times fewer independent variables in this adaptive grid, with results very similar to those obtained with a regular fine grid.

The method also offers a starting point for a general conceptual and mathematical framework for multiscale data assimilation in atmospheric chemistry, or in other areas of geophysics.

This two-part article aims to continue and improve the potential of this formalism and prepare for large-scale applications. The first part explores a few essential questions still unanswered, such as

  • Can the Bayesian approach that is currently used in geophysical data assimilation be made consistent with the multiscale framework of the method?

  • Can a non-diagonal background-error covariance matrix be taken into account in this formalism? Such matrices are often used in air quality, greenhouse gas flux inversions and, more generally, in data assimilation schemes for geophysical forecasting systems.

  • Can scale-dependent errors be accounted for?

  • Can one use other grid optimization objective functions, such as DFS, or observation-dependent criteria?

  • Can one perform the optimization within a simpler or more economical dictionary of adaptive grids than the so-called tiling dictionary introduced by Bocquet (2009)?

The results are obtained with a view to applications in atmospheric chemistry and air quality, but most of the findings are more general and could be applied outside this scope whenever the choice of control space is complex and decisive.

1.3. Outline

The conceptual and mathematical framework will be presented in section 2. The multiscale description of control space is made consistent with the assimilation of observations using Bayesian principles.

Section 3 deals with errors which may enter the inversions, and which are scale-dependent. Of particular interest are the aggregation errors occurring when grid cells are merged. They lead to representativeness errors.

The construction of optimal representations of control space requires the definition of a criterion that ranks adaptive grids in a given dictionary of representations. In addition to the so-called Fisher criterion introduced by Bocquet (2009), we add two new criteria in section 4. One is based on the DFS which measures the theoretical information gain in the analysis. A third criterion is defined with an objective function that not only depends on the prior statistics but also on the observations themselves.

In section 5, most of the developments will be illustrated on two test cases: the ETEX-I dispersion event using real measurements and realistic physics (from a chemistry and transport model), and another demonstration case based on a simplified European CO2 network (CarboEurope-IP).

In section 6, the dictionary of tilings is compared to a dictionary of quaternary tree structures (later called qtrees). Although suggested in Bocquet (2009), the quaternary tree structure was not tested and studied there.

Finally, in section 7, we summarise the results. We discuss its connection with other multiscale formalisms introduced very recently in data assimilation. We also discuss the scope of the method and its extension to nonlinear models. Elements that justify the need for Part II (Bocquet et al. 2011) of this work are explained.

2. Multiscale modelling

This section extends the multiscale approach developed in Bocquet (2009). It goes farther on several points, and unifies the concepts using a Bayesian methodology.

2.1. Data assimilation context

A simplified typical data assimilation set-up is employed. For the data assimilation problem at hand, the control space, named after its domain Ω, is discretised into cells of a grid ω. This grid can be regular (grid cells of equal size in a given system of coordinates), or not. It may have several space dimensions, and possibly one time dimension. For instance, in atmospheric chemistry inverse modelling, the control space is often the vector space of emission gas fluxes from the ground, at any time. Therefore there are two space dimensions plus the time dimension (2D+T). A vector representing a discretised flux or source field is denoted σ.

A clear distinction is made between control space and state space, which could be discretised at different resolutions, although they could share a subspace or even be identical.

We assume that the measurement vector μ is related to the source σ through a Jacobian matrix H, which stands for both the system's evolution model and the observation operator. It could result from the linearisation of models, but for simplicity we shall hypothesise that the models are linear. The equation that links the observation to the source via the models reads

equation image(1)

where equation image is the vector of errors (of any type). Note that space and time are not split up in this simplified equation, so that H links elements in space and time. We assume that σ follows a Gaussian prior probability density function (pdf): σequation image(σb,B) where σb is the first guess, and B the background-error covariance matrix. The errors are supposed to be unbiased and follow a normal pdf: equation imageequation image(0,R), where R is the observational-error covariance matrix. We shall designate by σa and Pa the analysed source and the analysis-error covariance matrix respectively. Both result from a data assimilation or inverse modelling analysis.

2.2. Multiscale framework

The control space discretization is discussed now, as well as the way to define a multiscale Jacobian.

2.2.1. Multiscale mesh

A multiscale mesh is defined first. It is assumed that the domain Ω is discretised into a fine-resolution regular grid, which represents the finest available discretization. The number of grid cells in the grid is Nfg. Grid cells at coarser scales will be obtained by dyadic coarse-grainings of cells in the finest grid, i.e. two grid cells that are adjacent along one direction can be merged into a coarser cell (hence the adjective dyadic which qualifies this binary grouping). The dyadic coarse-grainings can be performed in each space or time direction of the domain Ω. The number of coarse-grainings in each direction is limited by the number of accessible scales denoted by nx, ny, and nt for a 2D+T domain (ETEX-I case), or nx and ny for a 2D domain (simplified CarboEurope-IP case). In the ETEX-I, and similarly in the simplified CarboEurope-IP, each coarse-grained cell has an intrinsic scale vector of integers

equation image

where 0 ≤ lx< nx, 0 ≤ ly< ny, and 0 ≤ lt< nt. The scale levels are set by lx, ly and lt. For each direction, label 0 corresponds to the finest scale. For instance, the cells in the finest regular grid all share the same scale vector l = (0,0,0).

2.2.2. Multiscale Jacobian

Correspondingly, the Jacobian defined in Eq. (1) is generalized to a multiscale Jacobian. H is usually computed in the finest regular grid, using either direct forward simulations or backward adjoint simulations. Then dyadic coarse-grainings of the Jacobian are performed by simple averaging. Note that the multiscale Jacobian could also be defined using several Jacobians obtained at different scales from different models, or different versions of the same core model.

2.2.3. Adaptive representations

Using this multiscale framework, one can build representations (adaptive grids) of Ω. A representation ω is a set of cells of many sizes (depending on the scale of the cell), that cover Ω. A set, or dictionary, of representations will generically be called equation image(Ω). Besides, a representation will be called admissible, if it is a strict partition of Ω. That is, a single grid cell corresponds to each point in Ω.

Several kinds of multiscale structures were contemplated in Bocquet (2009). In each case, successive time coarse-grainings were represented by a binary tree. 2D space could be considered as the tensor product of two binary trees, one for each space direction. This means that the grid cells, or tiles of such a representation are the Kronecker products of two 1D elements of binary trees, one for each direction. This led to the so-called tiling representations.

In the case of two directions of space, one could use instead a quaternary tree, called qtree later. This means that each mother tile can be refined into four daughter tiles, instead of two. This reduces the space occupied by the multiscale Jacobian at the expense of a smaller (therefore less rich) dictionary equation image(Ω). Note that the dictionary of qtrees is included in the dictionary of tilings (any qtree is a tiling). In Figure 1, a 2D tiling made of the tensorialproduct of two binary trees, one for each space direction, is plotted. An example of qtree is also shown. A third type of representation, which is the direct product of two binary trees (called factorised tree or ftree later) is also displayed.

Figure 1.

Schematic illustration of three types of structure, in two dimensions. (a) shows the tensorial product structure of two binary trees, one for Ox, one for Oy, to produce a collections of 2D tilings. (b) shows the direct product structure of two binary trees (or ftree), one for Ox, one for Oy. (c) shows a quaternary tree structure, or qtree. In each case, an example of a generated grid is drawn.

2.3. Restriction and prolongation

To climb up or down the scales in the multiscale ladder, one needs to define a restriction operator that tells how a source is coarse-grained, and a prolongationoperator that tells how the source is refined through the scales. Rodgers (2000) gives an in-depth discussion on the topic.

First, let us consider the restriction operator. Assume σ is a source vector which is known in the finest regular grid. Let ω be an adaptive representation of a dictionary equation image(Ω). The coarse-graining of σ in ω is defined by σω = Γωσ, where equation image stands for the coarse-graining operator. This operator is supposed to be unambiguously defined. In most of the article, we suppose it identifies with simple averaging. But the formalism does not rule out more complex coarse-graining with associated prolongation operator given by a spline interpolation, or model-specific coarser Jacobians.

A source can also be refined thanks to a prolongation operator equation image which refines σω into equation image. This operator is ambiguous, since additional information is needed to reconstruct a source at higher resolution. One possible choice, which we shall call the deterministic one, is to set equation image. A schematic of the use of the restriction and prolongation operators is displayed in Figure 2

Figure 2.

Schematic of the restriction and prolongation operators from the finest regular grid to a representation (adaptive grid) ω, and vice versa.

However, in this data assimilation framework, one has prior information on the source that may be exploited. The pdf q(σ) gives prior information on σ. Following the statistical assumptions after Eq. (1), it is chosen to be Gaussian q(σ) ∼ equation image(σb,B). From this prior defined in the finest regular grid, one can infer, thanks to Γω, the prior pdf of σ in representation ω

equation image(2)

with

equation image(3)

Conversely, assume one knows σω in representation ω. Since the problem is underdetermined, then one could opt for the most likely refinement. It is given by the mode of q(σ|σω). From Bayes' rule, it is clear that

equation image(4)

where δ is the Dirac distribution. Then the mode of this posterior Gaussian distribution is given by

equation image(5)

Thus equation image would be an affine operator. We denote by Λ*ω its tangent linear component

equation image(6)

Moreover, we define

equation image(7)

so that we can choose as a prolongation operator

equation image(8)

where equation image is the identity operator. Since the refinement is now a probabilistic process, errors are attached to it. The corresponding error covariance matrix is

equation image(9)

As expected, if the representation ω is close to the finest grid, {Nfg − Rank(Πω)}/Nfg ≪ 1, the refinement error is negligible. If the representation is coarse, Rank(Πω)/Nfg ≪ 1, the refinement error is limited by that of the background.

Those operators first satisfy

equation image(10)

which is a consistency identity. Any reasonable prolongation operator should satisfy it. Then, one verifies that

equation image(11)

The linear operator Πω is a projector since it can be checked that Πω2 = Πω. Besides, it is B−1-symmetric since

equation image(12)

where equation image is the scalar product built on B−1. In matrix form, this is equivalent to

equation image(13)

Πω cannot be the identity because the coarse-graining implies a loss of information that, in general, cannot be fully recovered. The approach will be called the Bayesian or probabilistic prescription of equation image.

2.4. Observation equation in any representation

The mathematical formalism being laid, the observation equation Eq. (1) can be written in any representation ω of equation image. The Jacobian H becomes equation image. Inheriting from equation image, Hω is an affine operator. The observation equation reads

equation image(14)

The error equation imageω has been made scale-dependent, because several sources of errors depend on the scale, such as the aggregation errors, or the errors in model subgrid physical parametrisations.

2.5. Reduction of the correlated case

When B is not diagonal, correlation between errors of values defined on different tiles will occur. Non-zero covariances in B may come from true physical correlation in the errors. They may also come from imposed correlations between variables, a form of variable aggregation, or coarse-graining. This second case is discarded here because our coarsening scheme already copes with this explicitly.

Off-diagonal terms in B, which induce correlations between tiles, complicate the optimization scheme considerably. In particular the calculation of Γ*ω entails the representation-dependent computation of the inverse of Bω.

One way out of this is to redefine the original coarsening scheme Γω so that B induces no error cross-correlations between coarse-grained tiles. To do so, one defines a new coarse-graining operator, a substitute for Γω:

equation image(15)

This implies that the adaptive grid cells no longer represent a partition of the control space domain. Instead, they are (a priori) statistically independent linear combinations of the former cells. Coarse-graining is now applied to these combinations, maintaining the property of statistical independence, rather than to the original grid cells.

As a result, redefining Γω into equation imageΓω, the background-error covariance matrix in representation ω becomes

equation image(16)

Also the prolongation operator Eq. (8) changes according to

equation image(17)

As a consequence, one obtains

equation image(18)
equation image(19)

One checks also that equation image is B−1-symmetric.

3. Accounting for scale-dependent errors

The observations in μ are representative of some scale. This scale may not be accessible to modellers. Vector μ is related to σ at the finest scale, but also σω at a coarser scale through Eq. (1) and Eq. (14):

equation image(20)

Then consistency would impose that the errors are scale-dependent (hence the notation equation imageω), because the numerical model is.

3.1. Scale-free errors

In Bocquet (2009), only scale-independent errors equation image were considered. It means that these errors are attached to the observations themselves (instrumental errors), or pertain to model errors that are scale-free. For the sake of consistency, the measurements themselves are to be scale-dependent in this case:

equation image(21)

This is a natural standpoint in a synthetic data assimilation experiment performed at several scales. In this context, each Jacobian at different scales is assumed to be derived from a perfect model, so that discretization errors are discarded. The synthetic measurements of such an experiment are

equation image(22)

where equation image. These synthetic measurements are possibly made noisy. This is the point of view adopted by Bocquet (2005) and Saide et al. (2011). Hω could either be obtained by coarsening of H or by several models at several resolutions that are assumed perfect.

Since scale-dependent errors are discarded, this type of study is ideal to assess the signal in the observations without bothering about scale-dependent biases in the model, especially representativeness errors.

3.2. Errors due to aggregation only

Let us assume that errors are specified in the finest grid level, equation image = μHσ, and that they may originate from many sources. Then, errors at larger scale equation imageω = μHωσω are supposed to be solely due to this original error, plus errors entirely due to coarsening, or aggregation error that leads to representativeness error. In that case, the model scaling is entirely explained by the coarsening equation image. Since μ = Hσ + equation image = Hσb + ω (σσb) + equation imageω, the aggregation error, or scale-covariant error, can be identified:

equation image(23)

Assuming independence of the error and source error priors, the computation of the covariance matrix of these errors yields

equation image(24)

The fact that Πω is B−1-symmetric has been used in the derivation. Since equation image is a positive matrix, the mean variance of the errors always increases because of the aggregation.

Intuitively, the statistics of the innovation vector μHσb should not depend on the scale. However, when written in terms of errors, the innovation depends formally on the representation ω:

equation image(25)

We have used the fact that:

equation image(26)

This paradox is only superficial since one can check that the statistics of the innovation are truly scale-independent:

equation image(27)

More generally, an analysis performed in the representation ω is obtained by coarsening the analysis at the finest scale. Hence, in this case, the multiscale formalism has no theoretical benefit compared to performing data assimilation in the finest grid (although there are major practical advantages). This can be understood by applying Bayes' rule directly, using Gaussian statistics,

equation image(28)

This leads to the estimate

equation image(29)

with σa the emission estimation in the finest grid. The analysis-error covariance matrix transforms similarly according to

equation image(30)

where Pa is the analysis-error covariance matrix in the finest grid. This can also be consistently obtained, through the finest scale:

equation image(31)

which yields Eqs (29) and (30), by a simple convolution of Gaussian pdfs.

3.3. Scale-dependent model errors

As a first step, the errors were assumed to be scale-free equation imageωequation image, for instance coming from the observation: instrumental errors. Then, in addition, aggregation errors were taken into account by coarse-graining at fine resolution: equation image, where equation image.

A third decomposition could involve (i) the scale-independent observation error equation imageo which would also include model error that could be scale-independent, (ii) an error due to discretization equation image (coarse-graining), and a model error that would be scale-dependent equation image:

equation image(32)

On the one hand equation image would be decreasing as the resolution increases. On the other hand, equation image may have various behaviours depending on how the physics of the problem is parametrised and how the errors of the parametrisations depend on scale.

For instance, and for the latter source of errors, a large error increase is observed in atmospheric dispersion, when increasing the resolution of the atmospheric dispersion model beyond the reliable resolution of the meteorological fields used to drive the simulations.

In the rest of the article, we shall assume that the errors that are modelled account for scale-independent errors of all kinds, plus the scale-covariant aggregation errors. Additional scale-dependent model errors will not be considered.

4. Optimality criteria and optimization

4.1. Three optimality criteria

In addition to a multiscale formalism, the dependence of errors on the scale has been studied. Now, the optimal design of the representation of control space can be introduced. Three possible criteria of optimality are tested.

4.1.1. The Fisher criterion

Given our original incentive, which is to construct an adaptive grid of control space, optimal for data assimilation, the optimality criterion must be a measure of the quality of the analysis. In Bocquet (2009), the following criterion was chosen

equation image(33)

It is inspired by the Fisher information matrix, normalised by the background-error covariance matrix, so that the criterion is invariant by a change of coordinate in control space (for a given grid). Specifically, it measures the reduction of uncertainty granted by the observations.

In a representation ω, the criterion reads

equation image(34)

The operator Hω = *ω is the tangent linear operator of the affine operator Hω (which explains the difference of notation). Because only the linear part of Hω survives when averaging over the errors to obtain second-order moments, Hω appears in the criterion rather than Hω.

If one assumes that the errors are essentially scale-independent, then Rωequation imageR. In that case, equation image can be written in terms of Πω using the machinery developed earlier:

equation image(35)

Using the Bayesian prolongation operator Γ*ω that makes use of the prior, one obtains further

equation image(36)

owning to the B−1-symmetry of Πω.

But, if the errors are scale-covariant following Eq. (23), the Fisher criterion Eq. (33) reads

equation image(37)

which is more difficult to optimize because of the nonlinear dependence of equation image in Πω. The additional term is expected to increase the trust in the finest grid descriptions rather than the coarser ones.

4.1.2. Degrees of freedom for the signal

The dependence in Πω is actually simpler if the criterion (to be maximized) is chosen to be

equation image(38)

using the innovation statistics scaling, Eq. (27).

This criterion equation image is known to measure the number of DFS, i.e. the information load that helps resolve the parameter space. It is actually more common in data assimilation literature than the cost function (Eq. (36)). In the absence of any source of errors, the DFS are equal to the number of scalar observations that are assimilated (p here). In the presence of errors, the DFS ranges between 0 and the number of observations p, because the information of the observations is also used to resolve the noise (Rodgers, 2000). So the maximization of equation image entails maximizing these degrees of freedom, which seems very natural. Note that criterion Eq. (36) is the limiting case of this DFS criterion when R is inflated or when B vanishes.

In this vein, given an admissible representation ω,

equation image(39)

would represent the number of degrees of freedom per grid cell or tile. It is an objective measure of the data density (Rodgers, 2000) in parameter space.

4.1.3. Data-dependent criterion

One could consider the relative entropy, that is to say a gain in information, attached to the reconstructed parameters of control space (such as source variables). When the inference leading to the reconstructed source is Bayesian, and when the statistics are Gaussian, this information gain is (Kleeman, 2002)

equation image(40)

whereas in a maximum entropy inference context, only the third term of Eq. (40) appears (Bocquet, 2008):

equation image(41)

This term is a measure of the gain of information on the estimate of the source, whereas the additional terms in Eq. (40) focus on the gain in the knowledge of the uncertainty of this estimate. The former measures the information gain on the first-order moment, while the latter measures the information gain on the second-order moments. The average of the Bayesian result over all potential μ is

equation image(42)

whereas in the maximum entropy case, it is

equation image(43)

which is half of the DFS.

Therefore, Eq. (41) could be used as a criterion for its simplicity and its physical interpretation. Applied to a representation ω, and defining equation image, it reads

equation image(44)

where equation image is scale-independent. The choice of the scale-covariant error Eq. (23) leads to

equation image(45)

where the cyclic property of the trace operator has been used. Contrary to the Fisher and DFS criteria, this criterion depends on the observation vector μ. By Eq. (43), when averaged over all possible sources and errors following the prior statistics, it yields half of the DFS criterion.

The total gain of information both on the source and on the errors in the maximum entropy inference is

equation image(46)

Using a scale-covariant error Eq. (23) implies that equation image is scale-invariant. However, equation image is not. Therefore the information is distributed differently depending on the scale, or more generally the representation ω.

4.2. Reduction of the criteria in the correlated case

When B is not necessarily diagonal, a redefinition of the original restriction operator Γω into equation image was advocated. Let us take the example of the Fisher criterion. With this redefinition, the optimality criterion becomes

equation image(47)

where Πω is now reduced to

equation image(48)

where Γω is the original coarse-graining restriction operator obtained by simple averaging. Similar results can be obtained for the other two criteria.

In the following, we shall assume that either B is proportional to the identity, or one applies the above redefinition to Γω. Although this reduction of the correlated case is to be used in future work and needed to be addressed in this methodological article, it will not be directly used in the following test cases.

4.3. Algebraic formalism

Since the main goal is to optimize the representations of control space, we need to transform this abstract description of the multiscale structure and errors into numerical mathematics. For each tile at scales l, a vector vl,k in equation image is defined. Here l = (lx,ly,lt) represents the scales of the tile, k is the tile index in the set of tiles of the same type (i.e. of the same scales l). Recall that the finest regular grid is made of the tiles of scales l = (0,0,0). By construction, these tiles are in one-to-one correspondence with the canonical vectors {ei,j,h} of equation image. At a coarser scale, a tile at scales l can be partitioned into finer grid cells of the finest regular grid. Correspondingly, its vector vl,k is defined as the sum of the canonical vectors {ei,j,h} representing the finest grid cells that compose the tile:

equation image(49)

where ik, jk and hk are the smallest indices of the finest tiles composing tile (l,k).

From Eq. (7), and using the fact that B is proportional to the identity by definition or after the above redefinition Eq. (15), one obtains an explicit formula for Πω:

equation image(50)

where nl is the number of tiles in the set of tiles with scale vector l, which runs on all predefined scales 0 ≤ lx< nx, 0 ≤ ly< ny and 0 ≤ lt< nt. The coefficients equation image define representation ω: equation image is 1 when tile (l,k) belongs to the representation ω and is zero when it does not. Equation (50) can be checked by applying projector Eq. (7) on any vector vl,k. From now on, the superscript ω on equation image will be dropped to simplify the notation.

Since B is (truly or effectively) diagonal, then for any two vectors, equation image is non-zero, if and only if the two vectors correspond to overlapping tiles. If they belong to an admissible representation (the tiles form partition of Ω), then the matrix element is non-zero only if (l,k) = (l,k′).

Then, inserting Eq. (50) into Eq. (47), the cost function reads

equation image(51)

where equation image. For instance, in the case of the Fisher criterion one has W = B1/2HTR−1HB1/2. The local energyequation image is a local measure of the contribution of the cell to the cost function.

In the following subsection we assume that equation image is of this form.

4.4. Solving for optimal representations

The goal is to optimize the functional Eq. (51) on all admissible representations. In order to lift the constraint of admissibility (the tiles cannot overlap), one introduces a Lagrangian. A fixed number of tiles is imposed thanks to a single multiplier ζ. The one point:one tile requirement is imposed thanks to a vector λ of Nfg multipliers. Each multiplier is associated with one grid cell of the finest regular grid. The Lagrangian reads

equation image(52)

The sum on k = 1,…,Nfg runs on all cells of the finest grid. In this sum, equation image is the coefficient attached to the tile at scale l that covers cell k from the finest grid. This tile has index equation image among the nl tiles related to scale l. The Lagrangian can also be written as

equation image(53)

Then the maximum can formally be taken on all representations, admissible or not, with any number of tiles in [Ncg,Nfg], where Ncg is the number of grid cells in the coarsest regular grid. As a first step, the optimization is performed on the set of coefficients αl,k that have been freed from the constraints through the multipliers. This is made easier (to a limited extent) by the fact that αl,k can only be 0 or 1. Then one obtains an effective cost function of the Lagrange parameters:

equation image(54)

Because this cost function is dual to ℒ(ω), it needs to be minimized, not maximized (Borwein and Lewis, 2000). Note that the cost function is not smooth since it is non-differentiable on the edges of a polytope. Hence, optimization on the Lagrange parameters cannot make direct use of gradient-based minimization techniques. Besides, this functional may not be convex, nor is it guaranteed that it has a single minimum. To overcome these potential problems, a regularisation of this effective cost function is needed.

A statistical mechanics analogy was used earlier by Bocquet (2009) to solve this problem. We develop here an equivalent analytical approach through information theory. We look for the least committed representation, described by a pdf q(α) in the vector α, given that all constraints are satisfied on average. At finite temperature β−1, the optimal pdf is the one that maximizes the criterion with a weight β, plus the relative entropy of the representation pdf relative to the (non-admissible) geometry where all tiles are equiprobable.

equation image(55)

A first optimization on q leads to

equation image(56)

equation image is shorthand for equation image. The substitution of q given by Eq. (56) into Eq. (55) leads to a dual Lagrangian

equation image(57)

where the partition function Zβ, is given (after factorisation) by

equation image(58)

This leads to the dual Lagrangian, function of the Lagrange parameters

equation image(59)

This cost function is the one that was obtained in Bocquet (2009) using the statistical mechanics analogy. From the minimization of this free energy, yielding λ and ζ, one obtains the filling factor

equation image(60)

When β goes to infinity, the filling factors equation image converge to either 0 or 1. An alternate statistical regularisation is proposed in the Appendix.

5. Illustrations

The formalism described in the previous sections will be illustrated on two examples related to the transport and fate of atmospheric constituents.

5.1. Simplified CarboEurope-IP network

5.1.1. Set-up

The CarboEurope-IP network routinely measures CO2 concentrations over Europe at a precision of 0.1 ppm, and is part of the global monitoring network of greenhouses gases. The observations from this network of 22 stations can be used to perform inverse modelling of CO2 sources and sinks (http://www.carboeurope.org). Here we will use a much simpler prototype to apply the above formalism to this issue. Firstly, we shall use only one annual-mean observation for each station (for a total of 22). Secondly, we will use a drastically simpler model to construct the Jacobian H, made of the influence functions for each of those observations. Each influence function c attached to an observation i is assumed to be an average power law

equation image(61)

where r is the great-circle distance separating the observation location and the point where this sensitivity is being computed. The exponent αequation image 2.4 is chosen heuristically following Roustan and Bocquet (2006). As an average midlatitude footprint, it bears some realism. The Jacobian entries are given by equation image, where equation image is the discretised influence function. One is then looking for an optimal stationary adaptive representation.

A multiscale structure of six levels for each direction is defined. The domain Ω of control space is 22°W–42°E, 34–66°N. Its finest regular grid has dimensions Nx = 128 and Ny = 64, with grid-cell sizes Δx = Δy = 0.50°. The total number of cells in this grid is therefore Nfg = 8192.

With such simple assumptions, this example only represents a prototype of the kind of results that could be achieved with a more realistic physical model and observation set. It allows us to test the ideas presented in this article, as well as sketch a future computationally demanding full-scale application.

5.1.2. DFS criterion for the simplified CarboEurope-IP network case

It is first assumed that the model and observations are perfect (the error covariance matrix Requation image0 is negligible). The background-error covariance matrix is taken diagonal. For simplicity it is assumed that σb = 0. Figure 3(a) shows the optimal adaptive grid with N = 512, which represents 6% of the number of cells in the finest grid. The DFS obtained is 21.514, as compared to p = 22 observations. This means that this representation is able to capture all the degrees of freedom that could have been obtained in the finest grid. The densification of the grid close to Scandinavia is due to two outlying stations: Pallas and Zeppelin, while the densification in the Atlantic is due to the outlying station of Ivittuut, Greenland. These three stations are used in the optimization, but do not lie in the part of the domain that is shown here.

Figure 3.

Optimal adaptive grids over the tiling dictionary for the CarboEurope-IP prototype with N = 512. (a) corresponds to an optimization using the DFS criterion with negligible errors Requation image0. (b) corresponds to an optimization using the DFS criterion with errors. (c) corresponds to an optimization using criterion Eq. (36).

Then we assume a diagonal non-null error covariance matrix R, such that the theoretical maximum DFS is 5.89, much lower than the available p = 22 observations, which is quite realistic. The optimal grid that is obtained with N = 512 reaches the DFS 5.88. The result is displayed in Figure 3(b). The main difference is that the grid is even more peaked around the stations. Indeed, in Eq. (38), R acts as a threshold below which the propagation of information from control space to the observations, represented by HBHT, becomes less relevant (the denominator is dominated by R rather than by HBHT). As the errors represented by R increase, the information is propagated at shorter distances. The criterion Eq. (36) is the limiting case of Eq. (38) when observation/model errors dominate the background errors: R is far superior to HBHT for any reasonable norm. It consistently leads to the grid design displayed in Figure 3(c), which is even slightly more peaked than the grid of Figure 3(b).

5.2. ETEX-I dispersion experiment

5.2.1. Set-up

The second example is the European Tracer Experiment (ETEX), and in particular its first campaign, ETEX-I. Organised by the Joint Research Centre at Ispra, Italy, it dates back to 12 October 1994, 1600 UTC, when 340 kg of perfluoromethylcyclohexane were released uniformly over 12 h, at Monterfil, in Brittany, France. 168 stations of the World Meteorological Organisation (WMO) monitored the subsequent plume throughout Europe. The weather conditions (low pressure over Scotland) were selected so that the plume would be advected eastward toward the stations.

The measurements were intensively used to benchmark chemistry and transport models (Nodop et al. 1998), but also more recently for the tests of inverse modelling methodologies (Krysta et al. 2008). In particular, it was shown that, with a considerable reduction of the grid-cell numbers, the optimal tiling leads to inversions very similar to the one obtained with a fine regular grid.

A multiscale structure of five levels for each direction is defined. The finest regular grid is 20.8125°W–15.1875°E, 36.5625–54.5625°N, with Nx = 64, Ny = 32, and Nt = 160. The number of cells of size Δx = Δy = 0.5625° and Δt = 1 h in the finest grid is N = 327680.

Contrary to the example of CarboEurope-IP, H is obtained from a realistic Eulerian chemistry and transport model (Bocquet, 2007, gives modelling details). Also the adaptive grid will be dynamic: it will be optimized on the ground and in time.

5.2.2. Comparing designs with ETEX-I

The differences between the data-dependent criterion and the DFS, data-free, criterion are illustrated on the ETEX-I case. For both criteria, a scale-covariant error is assumed. The data-free criterion is therefore Eq. (38) while the data-dependent criterion is Eq. (45). A limited dataset of 201 real observations of tracer concentration is used. The same set was employed by Bocquet (2009), but with criterion Eq. (36).

We seek optimal grids of the same size N = 402 as in Bocquet (2009). The optimal 2D+T grid obtained from the data-free criterion is displayed in Figure 4, while the optimal 2D+T grid obtained from the data-dependent criterion is displayed in Figure 5. N = 402 offers a tight compromise for the data-free criterion optimization in terms of high DFS and significant reduction of the tile numbers (0.1% of the total number of grid cells in the finest grid: N = 327680). The resulting DFS is 75.70, while the maximum achievable DFS in the finest grid is 157.5. Since error statistics have been taken into account, it is lower than the perfect model case of 201 DFS. It is also the maximum of criterion Eq. (45), that is the maximum of the achievable information gain via a maximum entropy on the mean inference.

Figure 4.

Snapshots of the 2D+T optimal adaptive grid with N = 402 tiles for a selection of 201 concentration observations of the ETEX-I dispersion event. The criterion is given by the data-independent cost function Eq. (38). Time is indicated in the top left-hand corner of each panel. The triangles indicate the WMO stations that reported at least one of these 201 observations. The disk indicates the true source location of ETEX-I.

Figure 5.

As Figure 4, but the optimality criterion is now given by the data-dependent cost function Eq. (45).

The main difference is seen over Ireland. Indeed the value of the concentrations at the stations in Brittany are high and do not rule out a source upwind near Ireland or in the Atlantic, so that the grid is refined there. On the contrary, the data-free criterion accounts for any set of values compatible with the prior. The true observation set used in the data-dependent criterion is only one specific set, so that a refinement near the monitoring network is preferred at the expense of a refinement over Ireland and the Atlantic.

5.2.3. Inverse modelling with ETEX-I

Inverse modelling is performed using several adaptive grids and the results are reported in Table I. The details of the set-up of the inverse modelling are the same as those reported by Bocquet (2009), and they are not repeated here.

Table I. Results of source inverse modelling experiments on ETEX-I, using several types of regular or adaptive grids built from the criteria introduced in this article. The total mass of tracer released during ETEX-I was 340 kg at a point location.
Grid type Criterion type N Inversion type m0 (kg) χ (ng m−3) Local mass (kg) Total mass (kg)
Regular  20480 Gaussian 0.025 0.25 234 680
Regular  20480 Non-Gaussian 5 0.25 220 327
Tiling Fisher 402 Gaussian 0.025 0.25 270 1005
Tiling Fisher 402 Non-Gaussian 5 0.25 205 238
Tiling DFS 402 Gaussian 0.025 0.25 268 1136
Tiling DFS 402 Non-Gaussian 5 0.25 200 252
Tiling Data-dependent 402 Gaussian 0.025 0.25 173 599
Tiling Data-dependent 402 Non-Gaussian 5 0.25 134 195

Two types of inversion are considered: Gaussian and non-Gaussian. The Gaussian type is based on Gaussian background errors, such as those assumed in this article, while the non-Gaussian type is based on non-Gaussian background errors (following Bocquet et al., 2010, and references therein) that ensures positiveness of the source. The total retrieved mass and the mass retrieved near the location of the release site are reported in the table. Scalar m0 is a mass scale that parametrises the background-error term, whereas χ is the prior observation-error standard deviation. The results obtained for the DFS criterion are similar to those obtained with the Fisher criterion, and the remarks of Bocquet (2009) are still valid. However, for the grid obtained from the data-dependent criterion, the inversions lead to a very good localization of the source (not shown here but it can be inferred from the figures of the table) and an underestimation of the retrieved mass. The better localization is due to a stronger refinement of the grid close to the release site. On the downside, it probably strengthens the importance of the measurements performed on nearby sites, known to be largely overestimated by Eulerian dispersion models on ETEX-I, leading to an underestimation by the data assimilation scheme because of this model error. The inversions with adaptive grids are also compared to those performed with the regular grid of resolution 2.25° × 2.25° × 1h in Table I.

5.3. An optimal number N of tiles?

When one adds one more tile to an optimal adaptive grid, there is a marginal gain in the objective function which is defined mathematically by equation image. It can numerically be accessed through the parameter −ζ, conjugate to the number of tiles, because

equation image(62)

which can be checked by differentiating Eq. (54) with respect to N. If there is an optimal number N* of tiles which is non-trivial (i.e. strictly Ncg< N*< Nfg), then equation image is zero. To obtain such a grid, with an optimal N*, it is sufficient to get rid of the tile number constraint in the optimization of Eq. (54). Unfortunately, the existence of such a non-trivial N* is not a simple issue.

Consider the generic objective function equation image = Tr(ΠωΩ), where Πω is the projector onto representation ω and Ω is a positive definite matrix. If equation image is the optimal representation for this criterion with N tiles, then equation image is an increasing function of N. Suppose equation image has been determined for NNfg − 1 and let us look for a better representation ωN+1 with N + 1 tiles. Take any tile of equation image and split it into two sub-tiles. This leads to a representation ωN+1 that does not have to be optimal. If the eigensystem of Ω is equation image, then

equation image(63)

It is not difficult to show that the sum of the quantity equation image of two sub-tiles is greater or equal to the same quantity for the mother tile. Since the ζi are all positive, one concludes that equation image, so that equation image.

As a consequence, criteria Eqs (36) and (38) are monotonic functions of the optimal representations equation image, when N increases. The maximum of the objective function is reached in the finest regular grid. This was numerically checked by Bocquet (2009) in the case of the first objective function. It will be checked in section 6 for the second objective function based on the DFS.

This also applies to the data-dependent objective function Eq. (45), because in this case Ω is of the form Ω = uuT, where u is

equation image(64)

However, such monotonic behaviour may not be satisfied for an arbitrary objective function. The DFS and data-dependent objective functions used in this article account for aggregation errors that decrease with N, and for estimation errors that increase with N (for a given dataset). The net result is an error reduction with increasing N. In an even more realistic context, one should also take into account scale-dependent model errors equation image, that are not of aggregation type, as discussed in section 3. Then there may be an optimal N, as illustrated in Figure 6. This paradigm has been established in the greenhouse gas inversion community (e.g. Peylin et al., 2001).

Figure 6.

Schematic of the posterior error (arbitrary units), or of a reasonable criterion resulting from the aggregation, model and estimation errors, as a function of the resolution.

Such a non-trivial optimum should also exist when the errors are scale-free (as discussed in section 3). For instance, in the case of the data-dependent cost function, yet without taking into account aggregation (scale-covariant) errors, it was shown by Bocquet (2005) that the objective function vanishes when N goes to infinity. For a finite resolution limit (large but finite Nfg), the objective function is expected to ultimately decrease to a finite limiting value imposed by the finest accessible resolution. Taking into account aggregation errors counteracts this increase in information gain, because fields on coarser grids are not as trusted as fields defined in the finest grid. Again, this trust in the finest grid is likely to be mitigated by taking into account realistic scale-dependent model errors, yielding a non-trivial N*.

6. General tilings versus qtrees and ftrees

In the previous sections, a multiscale framework has been defined and a data assimilation system was made consistent with it, including scale-covariant aggregation errors. This allowed optimal representations of control space for the assimilation of observations to be built. Up to this point, the adaptive grids were optimized on a dictionary of general tilings. For a 2D+T parameter field, and when employing a dyadic multiscale structure, storing the multiscale Jacobian in memory requires up to eight times the size of the Jacobian of the finest grid. It is thus of practical concern to use a smaller, but still efficient enough, dictionary of representations.

6.1. Qtrees

If one adopts a quaternary tree structure (qtree) for the spatial part instead of the tensor product of two dyadic structures while keeping a binary tree multiscale structure for time, then storing the multiscale Jacobian in memory requires at most 8/3 times the size of the Jacobian of the finest grid. Note that the set of qtrees built on the same domain is a subset of the tilings.

In order to compare the results, we use again the ETEX-I example to visually illustrate the qtree representations. Figure 7 displays an optimal representation with the same assumption and for the same criterion as for the example of Figure 5. The corresponding tiling and qtree representations are consistently refined at the same space and time spots.

Figure 7.

As Figure 4, but the optimal representation is searched in the qtree set.

Figure 8 displays the DFS of optimal tilings, optimal qtrees, and regular grids, for a wide range of N. The optimal tilings and optimal qtrees are far superior to regular grids: much more information is captured with the same number of cells in an optimal adaptive grid. Besides, for a fixed N, the optimal qtree captures fewer DFS than the corresponding optimal tiling. This must be so since qtrees form a subset of tilings. Nevertheless, the drop in performance is very moderate. Moreover the optimization times for these computations were roughly two times shorter for the qtrees than for the tilings. Their respective numerical efficiencies will discussed in Part II of this article. Therefore, we believe that optimizing on qtrees is a good substitute for an optimization on tilings.

Figure 8.

Degrees of freedom for the signal of optimal tilings, optimal qtrees, optimal ftrees and regular grids versus the number of grid cells in the representation (ETEX-I example).

6.2. Ftrees

A factorised tree, or ftree is defined as the direct product of binary trees. In the 2D case, the ftree is the direct product of two binary trees, one for each of the two directions. An example of such an adaptive grid is displayed in Figure 1(b). It is similar to the grid used by global numerical weather prediction models or chemical transport models that require zooming onto some region, such as Arpège (Action de Recherche Petite Echelle Grande Echelle) by Météo-France or LMDZ (Laboratoire de Météorologie Dynamique ‘Zoom’).

Contrary to the qtree dictionary, the generation of all ftrees requires computation of the value of the Jacobian for any tile, so the same amount of memory would be required as that of the dictionary of general tilings.

This dictionary of ftrees has considerably fewer degrees of freedom than both the dictionary of tilings and the dictionary of qtrees. Moreover, the optimization algorithm over the set of ftrees requires an adaptation of the general algorithm used for the tilings and for the qtrees. Two vectors of filling factors, say αx and αy, one for each direction, are required. The global filling factor, at scales (lx,ly) and position (kx,ky) could thus be a product of the two directional ones (other choices are possible):

equation image(65)

The application of our optimization algorithm leads to the computation of the partition function Eq. (58). However, its computation is less simple for the ftrees because, on the one hand, α is factorised into two contributions (one for each direction), and on the other hand, the energies equation image cannot be factorised. So there is no trivial factorisation of the partition function according to the two directions.

We have opted for solving this optimization problem iteratively. At first, one of the directions (say Ox) is frozen and the vector αx is fixed. Then, one solves for αy, using our algorithm applied to a 1D problem. Then, in turn, direction Oy is frozen, and the newly obtained αy is fixed. Then one solves for a new estimation of αx, and so on, until convergence. We have contemplated a variant of the algorithm, where one imposes a fixed number of tiles for each direction, equation imagex and equation imagey, the global number of tiles N being equal to N = equation imagexequation imagey.

This geometry is tested on the ETEX-I example and its performance is compared to the skills of the tilings and qtrees in Figure 8. So, one obtains results which are significantly inferior to the performance of the qtrees, with a substantial complication in the optimization. That is why we recommend qtrees over ftrees in this context.

7. Summary, discussion and future work

7.1. Summary

In this article, we have developed a consistent Bayesian framework for the optimal design of control space in geophysical data assimilation. Prior information on the parameters of control space, including correlation of errors, is now accounted for and embedded in a multiscale framework. Prior information is also consistently used in the prolongation operator, so that every bit of available information is used when moving up and down the scale ladder. Note that, since the control space parameters can depend on both space and time, this framework accounts for space and time together.

Observation errors originating from aggregation were also explicitly considered in this framework. These scale-covariant errors consistently yield scale-invariant innovation statistics. The impact of observation errors on the optimal design of the representation was illustrated in a CO2 flux inversion context using a simplified CarboEurope-IP monitoring network. More general scale-dependent errors, such as complex model errors, could not be studied here since they are case-specific.

New objective functions to rank adaptive grids of a dictionary of representation of control space have been defined. The first one is a normalised measure of the uncertainty Tr(B−1Pa) , which is similar to the criterion at the heart of the Best Linear Unbiased Estimator (BLUE) approach used in most current data assimilation schemes. It is equal to the degrees of freedom for the signal (DFS). This DFS measure, together with scale-covariant errors, leads to an elegant criterion which is easier to optimize.

However, this DFS criterion is an implicit statistical average over all potential observation sets prescribed by the prior. That is why an observation-dependent criterion has been defined, which corresponds to a gain of information in the inference. Application to the real tracer dispersion campaign ETEX-I, has shown that the optimal grid obtained from this new criterion is not only refined around the observation site and upwind of those stations, but also in areas where an inversion of these observations might indicate. However, one may object that an inversion crime is committed when using such a data-dependent cost function, since the adaptive grid that is used to perform Bayesian inverse modelling using a set of observations has been constructed with the help of the same set of observations. A solution to this subtle issue is left for future work.

The existence of an optimal number of tiles N was also discussed. All the well-controlled examples given here lead to the choice of the largest (numerically) possible N. But it was shown that taking into account a more complex model error may lead to a finite optimal N. So this issue remains very dependent on the physical context and on the specification of the model through the various scales.

The choice of the representation dictionary on which these criteria are optimized is another issue of practical concern. General tilings, where grid cells are defined as the Kronecker product of leaves of 1D binary tree structures, offer a rich set, but the numerical optimization scheme can be computationally demanding. As an alternative, we have implemented and tested a qtree structure where spatial tiles belong to a quaternary tree structure. Qtrees form a subset of the dictionary of tilings. It is more economical and faster to optimize in that subset. Furthermore, it has been shown on the ETEX-I example that optimal qtrees could be almost as efficient as optimal tilings. This is not so for another class of representations, the ftrees, whose skills are significantly inferior with a greater complexity in the optimization algorithm.

7.2. Connection with other multicale data assimilation approaches

The introduction of consistent multiscale formalisms is very recent in data assimilation, even though the inner and outer loops of 4D-Var can be seen as a precursor methodology (Courtier, 1994). Exploiting the framework developed by Willsky (2002), Zhou et al. (2008) have introduced a multiscale tree structure. A model operating at a different scale is assigned to each level of the tree. Using conditional probabilities and Bayes' rule, the information carried by the observations is propagated up and down the tree. This formalism is meant to be efficient with ensemble Kalman filtering. In a variational context, a fully consistent 4D-Var scheme has been developed on top of a two-way nested model (Simon et al. 2011). It has been used to propagate information back and forth between the coarser and the finer grids. Multigrid methods used in the numerical solution of partial differential equations are percolating into data assimilation, although making them consistent with a data assimilation method is a challenge. As a preliminary step, Neveu et al. (2010) have tested such a scheme on a Burgers equation. The main advantage of the multigrid methods is the acceleration of convergence of the data assimilation scheme. In particular, it is shown to outperform the inner and outer loops scheme.

These formalisms, as well as the one of this article, are derived from first Bayesian principles. Therefore, to a large extent, they should be equivalent. However, they most naturally apply to different data assimilation schemes: Kalman filters, 4D-Var, or BLUE matrix equations in our case. That is why making connections between them cannot be a simple task.

7.3. Extension of the formalism to general data assimilation problems

The formalism developed in this article is expected to suit environmental problems whose monitoring network is known a priori, so that a grid optimization of control space can be performed prior to any inference. Yet the observation sites are not necessarily fixed, since the optimal representations can be dynamical. It is expected to be of primary interest for systems with sparse and inhomogeneous observations, and for data assimilation systems where the observational information is not propragated far, or anisotropically. At the very least, the methodology can help assess the areas which are poorly resolved (by the conjunction of models and observations).

The formalism of this study has been developed using the Jacobian of the system. In geophysics, the computation of the Jacobian is not always affordable, especially when the evolution model is nonlinear. To generalize our methodology to the nonlinear forecasting context of meteorology or oceanography, one needs to optimize the representations and compute the required pieces of the Jacobian when needed, similar to a standard 4D-Var. The main difficulty is that the optimal grids require to access second-order sensitivities, which are related to the Hessian B−1 + HTR−1H. We anticipate that this might be achieved using randomization techniques of Desroziers et al. (2005) or a stochastic gradient algorithm. Using ensemble Kalman filter methods, the error covariance matrix can be more easily accessed since it is given by the empirical statistics of the ensemble. As opposed to 4D-Var, control or state space representations can only be optimized in space, and not in time. After adequate inflation and localization, the methodology could used to optimize the representations of the state space and of control space. Furthermore, the methodology might be seen as a substitute for the localization of the raw empirical error covariance matrix. Insteading of choosing an adequate localization length, one chooses an adequate number of grid cells for the representation. The selection of an optimal representation would project the raw covariance matrix onto the active (real) degrees of freedom in the problem, curing any rank deficiency. The methodology is adaptive and can capture regions of the error covariance matrix where there is structure that might be smoothed out using standard uniform localization methods.

7.4. Towards computationally efficient designs

An optimization on tilings or qtrees could still be quite time-consuming when the number of grid cells in the finest grid reaches several hundred thousands, and when the hierarchical structure is deep. The application of the theory to large-dimensional systems, such as those contemplated earlier, may therefore be computationally challenging. As a short cut, an analytical approach based on the asymptotic properties of the optimal grids has been developed to offer an approximate but quick solution to the Bayesian design of control space. This will be reported in Part II of this article.

Acknowledgements

The authors would like to thank the associate editor, Dan Cornford, and two anonymous reviewers for their constructive and extensive comments and suggestions. The authors are grateful to Michel Ramonet for providing the CarboEurope-IP network characteristics. They thank Peter Rayner for stimulating discussions, and Monika Krysta for a careful reading of the manuscript. This article is a contribution to the MultiScale Data Assimilation in Geophysics (MSDAG) project supported by the Agence Nationale de la Recherche, grant ANR-08-SYSC-014.

Appendix

Alternate statistical regularisation

Before enforcing the tile number and the one tile–one point constraints, a tile can either be selected or not in the representation, with a probability that depends on its energy εl,k. Following this idea, one is led to a derivation similar to that of subsection 4.4, but with a more physical touch. It is assumed that the prior distribution (before imposing the constraints) of the tiles follows a Bernoulli law: tile (l,k) is a priori selected with probability

equation image(A.1)

which is the standard distribution factor of systems following Fermi–Dirac statistics. The prior law is then

equation image(A.2)

One should then minimize the gain of information (maximize the entropy) from the prior distribution to the equilibrium distribution of tiles that satisfies the constraints. The information gain is measured by the Kullback–Leibler divergence (Kullback, 1959)

equation image(A.3)

The resulting cost function to be maximized is

equation image(69)

The rest of the derivation is then unchanged, with the same intermediate and final results.

Ancillary