Journal of Geophysical Research: Atmospheres

An ensemble-based explicit four-dimensional variational assimilation method

Authors


Abstract

[1] The adjoint and tangent linear models in the traditional four-dimensional variational data assimilation (4DVAR) are difficult to obtain if the forecast model is highly nonlinear or the model physics contains parameterized discontinuities. A new method (referred to as POD-E4DVAR) is proposed in this paper by merging the Monte Carlo method and the proper orthogonal decomposition (POD) technique into 4DVAR to transform an implicit optimization problem into an explicit one. The POD method is used to efficiently approximate a forecast ensemble produced by the Monte Carlo method in a 4-dimensional (4-D) space using a set of base vectors that span the ensemble and capture its spatial structure and temporal evolution. After the analysis variables are represented by a truncated expansion of the base vectors in the 4-D space, the control (state) variables in the cost function appear explicit so that the adjoint model, which is used to derive the gradient of the cost function with respect to the control variables in the traditional 4DVAR, is no longer needed. The application of this new technique significantly simplifies the data assimilation process and retains the two main advantages of the traditional 4DVAR method. Assimilation experiments show that this ensemble-based explicit 4DVAR method performs much better than the traditional 4DVAR and ensemble Kalman filter (EnKF) method. It is also superior to another explicit 4DVAR method, especially when the forecast model is imperfect and the forecast error comes from both the noise of the initial field and the uncertainty in the forecast model. Computational costs for the new POD-E4DVAR are about twice as the traditional 4DVAR method but 5% less than the other explicit 4DVAR and much lower than the EnKF method. Another assimilation experiment conducted within the Lorenz model indicates potential wider applications of this new POD-E4DVAR method.

1. Introduction

[2] The four-dimensional variational data assimilation (4DVAR) method [Johnson et al., 2006; Kalnay et al., 2007; Tsuyuki and Miyoshi, 2007] has been a very successful technique used in operational numerical weather prediction (NWP) at many weather forecast centers [Bormann and Thepaut, 2004; Park and Zou, 2004; Caya et al., 2005; Bauer et al., 2006; Rosmond and Xu, 2006; Gauthier et al., 2007]. The 4DVAR technique has two attractive features: (1) the physical (forecast) model provides a strong dynamical constraint, and (2) it has the ability to assimilate the observational data at multiple times. However, 4DVAR still faces numerous challenges in coding, maintaining and updating the adjoint model of the forecast model and it requires the linearization of the forecast model. Usually, the control variables (or initial states) are expressed implicitly in the cost function. To compute the gradient of the cost function with respect to the control variables, one has to integrate the adjoint model, whose development and maintenance require significant resources, especially when the forecast model is highly nonlinear and the model physics contains parameterized discontinuities [Xu, 1996; Mu and Wang, 2003]. Many efforts have been devoted to avoid integrating the adjoint model or reduce the expensive computation costs [Courtier et al., 1994; Kalnay et al., 2000; Wang and Zhao, 2005], Nevertheless, the tangent linear model of the forecast model is still required in all these methods. On the other hand, the usual ensemble Kalman Filter (EnKF) [e.g., Evensen, 1994, 2003; Kalnay et al., 2007; Beezley and Mandel, 2008; also see Appendix A] has become an increasingly popular method because of its simple conceptual formulation and relative ease of implementation. For example, it requires no derivation of a tangent linear operator or adjoint equations, and no integrations backward in time. Furthermore, the computational costs are affordable and comparable with other popular and sophisticated assimilation methods such as the 4DVAR method. By forecasting the statistical characteristics, EnKF can provide flow-dependent error estimates of the background errors using the Monte Carlo method, but it lacks the dynamic constraint as in 4DVAR. Heemink et al. [2001] developed a variance reduced EnKF method by using a reduced-rank approximation technique to reduce the huge amount of computer costs. Farrell and Ioannou [2001] also proposed a reduced-order Kalman filter by the balanced truncation model-reduction technique. Uzunoglu et al. [2007] modified a maximum likelihood ensemble filter method [Zupanski, 2005] through an adaptive methodology. Generally, these three methods mentioned above belong to the Kalman filters. Vermeulen and Heemink [2006] have attempted to combine 4DVAR with EnKF; however, the tangent linear model is still needed in their method. How to retain the two primary advantages of the traditional 4DVAR while avoiding the need of an adjoint or tangent linear model of the forecast model has become a roadblock in advancing data assimilation. Recently, Qiu et al. [Qiu and Chou, 2006; Qiu et al., 2007a, 2007b] proposed a new method for 4DVAR (more details below) using the singular value decomposition (SVD) technique based on the theory of the atmospheric attractors. Cao et al. [2007] have applied the proper orthogonal decomposition (POD) technique [Ly and Tran, 2001, 2002; Volkwein, 2008; also see Appendix C] to 4DVAR to reduce the forecast model orders while reducing the computational costs, but the adjoint integration is still necessary in their method. Luo et al. [2007] also applied the POD technique to the tropical ocean reduced gravity model.

[3] Here we resort to the idea of the Monte Carlo method and the POD technique. The basic idea of the POD technique is to start with an ensemble of data, called snapshots, collected from an experiment or a numerical procedure of a physical system. The POD technique is then used to produce a set of base vectors which span the snapshot collection. The goal is to represent the ensemble of the data in terms of an optimal coordinate system. That is, the snapshots can be generated by a smallest possible set of base vectors. On the basis of this approach, an explicit new 4DVAR method is proposed in this paper: it begins with a 4-D ensemble obtained from the forecast ensembles at all times in an assimilation time window produced using the Monte Carlo method. We then apply the POD technique to the 4-D forecast ensemble, so that the orthogonal base vectors cannot only capture the spatial structure of the state but also reflect its temporal evolution. After the model status is expressed by a truncated expansion of the base vectors obtained using the POD technique, the control variables in the cost function appear explicit, so that the adjoint or tangent linear model is no longer needed.

[4] Our new method was motivated by the need to merge the Monte Carlo method into the traditional 4DVAR to transform an implicit optimization problem into an explicit one. Our method not only simplifies the data assimilation procedure but also maintains the two main advantages of the traditional 4DVAR. This method is somewhat similar to Qiu et al.'s SVD-based method (referred to as SVD-E4DVAR hereafter, see Appendix B for details) because they both begin with a 4-D ensemble obtained from the forecast ensembles. However, they differ significantly in several aspects as discussed in section 2.2. Hunt et al. [2004], John and Hunt [2007] and Szunyogh et al. [2008] also developed a 4-D ensemble Kalman filter that infers the tangent linear model dynamics from the ensemble instead of the tangent-linear map as done in the traditional 4DVAR, in which the model states are expressed by the linear combinations of the ensemble samples directly rather than some orthogonal base vectors of the ensemble space. This method is also largely Kalman filtering, with the generation of its ensemble space being different from our method.

[5] We conducted several numerical experiments using a one-dimensional (1-D) soil water equation and synthetic observations to evaluate our new method in land data assimilation. Comparisons were also made between our method, SVD-E4DVAR [Qiu and Chou, 2006; Qiu et al., 2007a, 2007b], traditional 4DVAR, and EnKF. We found that our new ensemble-based explicit 4DVAR (referred to as POD-E4DVAR) performs much better than the usual EnKF method in terms of both increasing the assimilation precision and reducing the computational costs. It is also better than the traditional 4DVAR and SVD-E4DVAR, especially when the forecast model is not perfect and the forecast error comes from both the noise of the initial field and the uncertainty in the forecast model. We also evaluate this approach using the Lorenz model. The corresponding assimilation experiments show that POD-E4DVAR can adjust the forecast state to approach the true Lorenz curve rapidly only by assimilating the observations twice in an assimilation cycle, which indicates its potential applications in other fields.

2. Methodology

2.1. POD-E4DVAR Method

[6] In principle, the traditional, implicit 4DVAR (referred to as I4DVAR) analysis of equation image is obtained through the minimization of a cost function J that measures the misfit between the model trajectory Hk(equation image) and the observation equation image at a series of times tk, t = 1, 2,⋯, m:

equation image

with the forecast model M0→k imposed as strong constraints, defined by

equation image

where the superscript T stands for a transpose, b is a background value, index k denotes the observational time, Hk is the observational operator, and matrices B and R are the background and observational error covariances, respectively. The control variables are the initial conditions equation image (at the start of the assimilation time window) of the model. In the cost function (1) the control variable equation image is connected with equation image through forward integration of (2) and expressed implicitly, which makes it difficult to compute the gradient of the cost function with respect to equation image. Assuming there are S time steps within the assimilation time window (0, T), generate N random perturbation fields using the Monte-Carol method and add each perturbation field to the initial background field at t = t0 to produce N initial fields equation imagen(t0), n = 1,2,⋯N. Integrate the forecast model equation imagen(ti) = Mi(equation imagen(ti−1)) with the initial fields equation imagen(t0)(n = 1, 2,⋯N) throughout the assimilation time window to obtain the state series equation imagen(ti)(i = 0,1, ⋯ S − 1) and then construct the perturbed 4-D fields (snapshots) equation imagen(n = 1,2, ⋯ N) over the assimilation time window:

equation image

It is obvious that such vectors can capture the spatial structure of the model state and its temporal evolution. All the perturbed 4-D fields equation imagen(n = 1, 2,⋯N) can expand a finite dimensional space equation image. Similarly, the analysis field equation imagea(ti)(i = 0,1,2, ⋯ S − 1) over the same assimilation time window can also be stored into the following vector:

equation image

When the ensemble size N is increased by adding random samples, the ensemble space could cover the analysis vector equation imagea, i.e., equation imagea is approximately assumed to be embedded in the linear space equation image. Let equation imagebn(n = 1,2, ⋯ K, KN) be the base vectors of this linear space equation image, the analysis vector equation imagea can be expressed by the linear combinations of this set of base vectors since it is in this space, i.e.

equation image

Substituting (4) and (5) into (1), the control variable becomes β = (β1βK)T instead of equation image(t0), so the control variable is expressed explicitly in the cost function and the computation of the gradient is simplified greatly. The tangent linear model or adjoint model is no longer required. To minimize the cost function, equation (1) is transformed into an explicit optimization problem with the variable vector β = (β1βK)T, which can be solved by the usual optimization algorithms, such as the quasi-Newton method. It is noted that, unlike EnKF, only one analyzed field is obtain in each analysis procedure in POD-E4DVAR and the initial condition should be perturbed at the start time of the assimilation in each cycle.

[7] How to obtain the appropriate base vectors remains the only task left. We found that the POD technique is a good choice for doing this. It can produce a set of base vectors spanning the ensemble of data in certain least squares optimal sense [Ly and Tran, 2001, 2002].

[8] The average of the ensemble of snapshots is given by

equation image

We form a new ensemble by focusing on deviations from the mean as follows

equation image

which form the matrix A(M × N), where M = Mg × Mv × S, and Mg, Mv are the number of the model spatial grid points and the number of the model variables respectively. To compute the POD modes, one must solve an M × M eigenvalue problem:

equation image

In practice, the direct solution of this eigenvalue problem is often not feasible if MN, which occurs often in numerical models. We can transform it into an N × N eigenvalue problem through the following transformations:

equation image
equation image
equation image
equation image

In the method of snapshots, one then solves the N × N eigenvalue problem

equation image

where T = (ATA)N×N, Vk is the kth column vector of V and is the kth row vector of λ. The nonzero eigenvectors λk (1 ≤ kN) may be chosen to be orthonormal, and the POD modes are given by ϕk = AVk/equation image, (1 ≤ kN).

[9] The truncated reconstruction of analysis variable in the four-dimensional space equation imagea is given by

equation image

where P (the number of the POD modes) is defined as follows

equation image

It is well known [Ly and Tran, 2001, 2002] that the expansion (9) is optimal. In particular, among all linear combinations, the POD is the most efficient, in the sense that, for a given number of modes P, the POD decomposition captures the most possible kinetic energy. The solution for the analysis problem is approximately expressed by a truncated expansion of the POD base vectors in the 4-D space. Substituting (9) and (4) into (1), the control variable becomes α = (α1, α2, ⋯, αP)T instead of equation image0, so the control variable is expressed explicitly in the cost function and the tangent linear model or adjoint model is not needed any more.

[10] In the above formulations, the usual optimization algorithms to find the solution of α = (α1, ⋯, αP)T still need an iterative procedure and likely result in higher computational costs. This issue is addressed as follows:

[11] Form the POD mode matrix

equation image

where ϕj = (ϕj(t0), ϕj(t1), ⋯, ϕj(tS−1))T, j = 1, 2, ⋯, P. Transform (11) into the following format

equation image

where Φk = (ϕ1(tk), ϕ2(tk), ⋯, ϕP(tk)).Equation (9) is rewritten as follows:

equation image

where α = (α1, α2, ⋯, αP)T. The cost function (13) can be transformed into the following

equation image

where Hj is the tangent linear observation operator.

[12] Given the vector of measurements equation imagej = (yj1, ⋯, image where mj is the size of equation imagej, we can define the N vectors with perturbed observations as

equation image

where ɛi = (ɛi,1, ɛi,2, ⋯, image The ensemble of perturbations, with ensemble mean equal to zero, can be stored in the matrix Ej = (ɛ1, ɛ2, ⋯, ɛN). The measurement error covariance matrix can be estimated by

equation image

Because Rj−1 is symmetrical, the gradient of the cost function is obtained through simple calculations

equation image

One can solve the optimization problem as follows

equation image

and

equation image

Equation (18) can be solved directly without an iterative procedure (see Figure 1 for a flowchart of the outlined method).

Figure 1.

The algorithmic flowchart of the POD-E4DVAR method.

2.2. Difference Between the POD- and SVD-Based Methods

[13] The two explicit (i.e., the POD- and SVD-based) methods are essential for constructing a forecast ensemble Ω(equation image) in a 4-D space with a view to cover the analysis state over each assimilation time window. Since the 4-D analysis vector is assumed to be in the linear space, it can be expressed by a set of base vectors of this space. How to extract the base vectors becomes the key to the two methods: neither of the two methods decomposes the ensemble AM×N = (X1, X2, ⋯, XN) directly (i.e., δXn = equation imagen − 0, n = 1, ⋯, N, referred to as Full-E4DVAR hereafter) instead of forming another ensemble by focusing on deviations from the mean or the background vector separately. The two different vector transformations (δXn = equation imagenequation image(equation imageb)) differentiate them clearly and affect their assimilation performance considerably even though POD and SVD techniques are closely related (see http://www.uni-graz.at/imawww/volkwein/svd.ps):

[14] Given a forecast ensemble matrix AM×N = (X1, X2, ⋯, XN), XnRM, one can define the following L2 norm ∥·∥ subjected to

equation image

As in (X. Tian and Z. Xie, Effects of sample density on the performance of an explicit four-dimensional variational data assimilation method, submitted to Science in China (D), 2008), we define the mean norm equation image of the ensemble A

equation image

For a given sample number N, we propose a concept of sample density as follows:

equation image

where Γ is the Gamma function. We want to look for a vector X* ∈ RM to minimize

equation image

which leads to the maximum sample density ρmax = equation image. The assimilation performance would be most efficient if ρ = equation image reaches its maximum value ρmax for the given sample ensemble, which means one single sample vector can represent the least “true space” and then gain the best assimilation effect. Straightforward calculations show that the function (22) is minimized when X* = equation imageXn. It should be noted that even minor deviations, which always exist, of the background vector equation imageb from the mean vector equation image would result in the sample density difference being magnified greatly by M orders (∼106) in the commonly used climate model:

equation image

The above POD-E4DVAR method also differs from SVD-E4DVAR significantly in two technical aspects:

[15] 1. the 4-D sample in the SVD-E4DVAR method is only composed of the state vectors at the observational times over the assimilation time window, while it is composed of the state vectors at all the time steps over the assimilation time window in the POD-EDVAR method. The latter contains the most possible forecast information in the assimilation time window.

[16] 2. The application of the matrix transformation technique in the POD-E4DVAR greatly lowers the computational costs by reducing the decomposition into an N × N eigenvalue problem (NM).

3. Evaluations in a 1-D Soil Water Model

[17] In this section, the applicability of this new method is evaluated through several assimilation experiments with a simple 1-D soil water equation model used in the NCAR Community Land Model (CLM) [Oleson et al., 2004]. In addition, we also compare assimilation results using the Full-E4DVAR, SVD-E4DVAR, I4DVAR, and EnKF methods.

3.1. Setup of Experiments

[18] The volumetric soil moisture (θ) for 1-D vertical water flow in a soil column in the CLM is expressed as

equation image

where q is the vertical soil water flux, E is the evapotranspiration rate, and Rfm is the melting (negative) or freezing (positive) rate, (for simplicity, E, Rfm are taken as zero in the experiments), and z is the depth from the soil surface. Both q and z are positive downward.

[19] The soil water flux q is described by Darcy's law [Darcy, 1856]:

equation image

where k = ksequation image2b+3 is the hydraulic conductivity, and ϕ = ϕsequation imageb is the soil matrix potential, ks, ϕs, θs and b are constants. The CLM computes soil water content in the 10 soil layers through ((24)(25)) (see Oleson et al. [2004] for details). The upper boundary condition is

equation image

where q0(t) is the water flux at the land surface (referred to as infiltration), and the lower boundary condition is ql = 0. The time step Δt is 1800 s (0.5 hour).

[20] We took a site at (47.43°N, 126.97°E) as the experimental site. The soil parameters ks, ϕs, θs and b at this site were calculated by the CLM using the high-resolution soil texture data released with the CLM by NCAR: θs = 0.46m3/m3, ks = 2.07263E-6 m/s, b = 8.634, ϕs = −3.6779 m. We then ran the model at the site forced with observation-based 3-hourly forcing data [Qian et al., 2006; Tian et al., 2007] from 1 January 1992 to 31 December 1993 after ten-year spinning-up to obtain a two-year time series of simulated infiltration (i.e., the water flux q at the surface, c.f., equation (25b)) for driving the soil water hydrodynamic equation (24). We used the first year (1 January 1992 to 31 December 1992) data of CLM-simulated infiltration as the “perfect” infiltration series, and took the second year data as the “imperfect” infiltration series (Figure 2). In our experiments, we integrated the soil water hydrodynamic equation (24) forced by the two infiltration time series for 365 days separately: equation (24) forced by the “perfect” infiltration series represents the perfect forecast model, whose forecast error comes only from the noise in the initial (soil moisture) field; on the contrary, equation (24) forced by the “imperfect” infiltration series acts as the “imperfect” forecast model, whose forecast error comes from not only the noise of the initial field but also the uncertainty in the forecast model itself.

Figure 2.

The “perfect” (solid line) and “imperfect” (dashed line) infiltration time series used in the assimilation experiments.

[21] Figure 3 shows the “imperfect” and the “perfect” initial soil moisture profiles, which were obtained by randomly taking two arbitrary CLM-simulated soil moisture profiles in the process of the infiltration series producing. These profiles represent the initial fields with and without noise. The “perfect” (or “true”) state was produced by integrating the “perfect” model with the “perfect” initial soil moisture profile for 365 days. The “observations” were generated by adding 3% random error perturbations to the time series of the “perfect” state (i.e., “observation” = (1 + ɛ) × “perfect”, where ɛ is a real random number varying from −3% to 3%), and these “observations” were assimilated using the various methods in the assimilation experiments (but not in the forecast experiments). In addition, two separate forecast states were produced by integrating the perfect and imperfect models with the “imperfect” initial soil moisture separately: for the former case, the forecast error comes only from the noise in the initial field, but in the latter case it comes from both the noise and the uncertainty in the forecast model.

Figure 3.

The “perfect” (solid line) and “imperfect” (dashed line) initial soil moisture profiles used in the assimilation experiments.

[22] The length of an assimilation time window in our experiments is one day (48 time steps), i.e., S = 48. The size of Xn = (xn(t1), xn(t2), ⋯, xn(tS− 1)) in our method is 480, where xn(ti) = (θn1(ti), θn2(ti), ⋯, θn10(ti)) and Mg = 10, Mv = 1. The background and observational error covariance matrices used in the E/I4DVAR methods can be obtained by using the ensemble covariance matrices defined by equations (A4) and (A8), respectively. We used γ = 0.90 in our experiments.

[23] Two groups of experiments were done: The perfect model with the “imperfect” initial field as Group 1 and the imperfect model with the “imperfect” initial field as Group 2. Three observation sampling frequencies (hourly, 2-hourly, and 3-hourly) were tested in each group's experiments. For simplicity, the Full-E4DVAR is only tested in Group 2. The ensemble size used in the Full-, POD- and SVD-E4DVAR and EnKF methods was 60 in this study (the impact of the ensemble size on the assimilation results will be discussed in another study). The linearization of the soil moisture equation (24) follows the format of Zhang et al. [2006].

3.2. Experimental Results

[24] To evaluate the performance of the five algorithms (Full-E4DVAR, SVD-E4DVAR, POD-E4DVAR, I4DVAR and EnKF), a relative error is defined as follows

equation image

where the index t0→S−1 denotes an assimilation time window (one day in our experiments), S is the length of an assimilation window (S = 48 in our experiments), f and a denote the forecast state (without assimilation of the “observations”) and the analysis state, respectively, t represents the “true” (“perfect”) state. Thus a relative error of 1% for a given assimilation method would mean that the mean error of the analyzed soil moisture is only 1% of that in the forecast case.

[25] Figures 45show that the POD/SVD-E4DVAR methods perform much better than the EnKF and I4DVAR methods in both groups of experiments. The two explicit 4DVAR methods perform almost same in Group 1 experiments. Their relative errors for analyzed soil moisture are very small (<1%) in the case that the forecast model is perfect, in which the forecast error comes only from the noise of the initial field (Figure 4). However, the relative errors of the EnKF method are many times higher than those of POD/SVD-E4DVAR, around 1∼ 2% or so. The traditional 4DVAR method performs even worse than EnKF, which is consistent with the results of Reichle et al. [Reichle and Entekhabi, 2001; Reichle et al., 2002a, 2002b]. This is expected because the soil water hydrodynamic equation (24) is a highly nonlinear system and the tangent linearization operator used in the usual 4DVAR can only propagate analytically with the first-order precision, which introduces large errors in variable estimation and leads to sub-optimal performance.

Figure 4.

Relative error (En) for analyzed soil moisture in the assimilation experiments by the perfect model with the “imperfect” initial field.

Figure 5.

Relative error (En) for analyzed soil moisture in the assimilation experiments by the imperfect model with the “imperfect” initial field.

[26] When the forecast model is imperfect, its forecast error comes from both the noise of the initial field and the uncertainty in the model itself. The relative errors of the four methods all become larger in this case (Figure 5), presumably due to the reduced effect of data assimilation under a poorly constrained model. Nevertheless, the relative errors for POD-E4DAVR are substantially smaller than those of the other methods, including SVD-E4DVAR which performs similarly with the EnKF in this case. The relative errors of POD-E4DVAR are still in the range of 0–6 %; however, the relative errors of the I4DVAR and SVD-E4DVAR methods are higher than 6%, and some are even up to 10%. It is also a bit surprising that the SVD-based method is apparently inferior to POD-E4DVAR in some assimilation time windows and even worse than the EnKF method (Figure 4). The relative errors of the Full-E4DVAR method fluctuate in magnitude between 20 and 60% (not shown). Direct comparisons between the Full-E4DVAR assimilations, the simulations and the truth show that Full-E4DAVR performs even worse than the pure simulation sometimes. This can be explained by the difference between the mean norms of Full-, SVD-, and POD-E4DVAR: the time-averaged mean norms of the SVD- and POD- based methods are around 1.8 × 10−2 and 1.7 × 10−2 respectively in Group 2 experiments (Figure 6). On the contrary, the mean norms of Full-E4DVAR are all higher than 2.06 (not shown). The huge difference between their (Full- and POD/SVD-E4DVAR) mean norms leads to a very small sample density ratio (equation image or equation imageM, M = 240). The higher mean norm of Full-E4DVAR results in its lower sample density and poor assimilation performance. Similarly, the ratio between the mean norms of the POD- and SVD-E4DVAR methods is about 0.94, which also affects their sample densities (ρPOD:ρSVD = equation imageM, M = 240) and makes POD-E4DVAR outperforming SVD-E4DVAR. Figures 45 also show that the observation frequency has larger impacts in the I4DVAR method than in the POD-E4DVAR method.

Figure 6.

Time series of mean norm for the Full-E4DVAR (solid line), SVD-E4DVAR (long dashed line), and POD-E4DVAR (short dashed line) methods in Group 2 experiments.

[27] For the two groups of experiments, the ratio of the computational costs for the four methods (POD-E4DVAR, SVD-E4DVAR, I4DVAR, and EnKF) is about 1:1.05:0.5:30. The high computational cost in the EnKF method is mainly due to the fact the analysis process consists of huge matrix and the computation has to be conducted repeatedly when there are observations in the assimilation time window, while in POD-E4DVAR the computation is performed only once in each cycle. The 5% reduction in POD-E4DVAR compared with SVD-E4DVAR results from the application of the matrix transformation technique described in section 2. We also implemented a usual iterative method for optimization [Liu and Nocedal, 1989] into POD-E4DVAR in the same framework to investigate how the direction solution method proposed in this paper reduces its computational costs. The experiments show that the radio of their (the direct solution method and the iterative method) computational costs varies around 1:10 to 1:5 or so. Of course, this conclusion is not absolute and case-dependent because the scale of the minimization of cost functional and the iteration times during each assimilation cycle vary greatly within different numerical models. The main computational costs of POD-E4DVAR come from the ensemble integrations over the assimilation time window, which can be done on parallel computers. Thus the additional costs of POD-E4DVAR compared with the traditional 4DVAR should not result in real difficulties, and it still costs only one thirtieth of that of the EnKF method in our experiments.

4. Evaluations Within the Lorenz Model

[28] In this section, our approach (POD-E4DVAR) is further evaluated within the Lorenz model for investigating its wider applications. The Lorenz model is widely used to test the new proposed methods in data assimilation community: e.g., Xiong et al. [2006] used it to test the performances of the EnKF and PF (particle filter) methods. Their results show that the PFGR (PF with Gaussian resampling) method possesses good stability and accuracy and is potentially applicable to large-scale data assimilation problems.

4.1. Setup of Experiments

[29] The Lorenz system under chaotic regime is used as a test problem, which is given by equation (e.g., see http://www.taygeta.com/perturb/node2.html):

equation image
equation image
equation image

For numerical experiments the Lorenz system with parameters s = 10, r = 28, b = equation image was integrated using a second-order Runge Kuatta's method, with Δt = 0.1, and initial conditions x(0) = −1.5, y(0) = −1.5, z(0) = 25 or the true solution and x(0) = −1.52, y(0) = −1.3, z(0) = 27 for background solution (a-priori forecast). The observation insertion is done at each 12 time-step. The length of each assimilation time window is 24 time-step.

4.2. Experimental Results

[30] Figure 7 shows time series of the Lorenz curve coordinates (x, y, z) from observations, assimilations and background forecasts: the assimilated Lorenz curve is adjusted to approach the true curve rapidly at the end of the first assimilation cycle, even though there are only two observations in each assimilation time window. On the contrary, the pure forecast state without assimilations begins to deviate from the true solution after 60 time-step or so (Figure 7), even though the noise of the initial filed (x, y, z) only results in small departures from the true state in the first 48 time steps or two assimilation time windows.

Figure 7.

Time series of the Lorenz curve coordinates (x, y, z) from observations (solid line), assimilations (long dashed line), and background forecasts (short dashed line).

5. Summary and Concluding Remarks

[31] To retain the main strength of traditional 4DVAR while avoiding the need of an adjoint or tangent linear model of the forecast model in data assimilation, we have developed an ensemble-based explicit 4DVAR method in this paper (called POD-E4DVAR). This new method merges the Monte Carlo method and the proper orthogonal decomposition (POD) technique into the 4DVAR to transform an implicit optimization problem into an explicit one. The POD method efficiently approximates a forecast ensemble produced by the Monte Carlo method in a 4-D space using a set of base vectors that span this ensemble and capture its spatial structure and temporal evolution. After the analysis variables are represented by a truncated expansion of the base vectors in the 4-D space, the control (state) variables in the cost function appear explicit, so that the adjoint model, which is used to derive the gradient of the cost function with respect to the control variables in traditional 4DVAR, is no longer needed. This new method significantly simplifies the data assimilation process and retains the two main advantages of the traditional 4DVAR (i.e., dynamic constraint and assimilation of multiple time observations).

[32] Several numerical experiments performed with a simple 1-D soil water equation show that the new POD-E4DVAR method performs much better than the traditional 4DVAR and EnKF method with assimilation errors being reduced to a fraction of the latter two. It is also superior to the SVD-E4DVAR, another explicit 4DVAR method developed by Qiu et al. [2007a, 2007b], especially when the forecast model is imperfect and the error comes from both the noise of the initial field and the uncertainty in the forecast model. In our experiments, the traditional (implicit) 4DVAR method performs worst, which is due to errors associated with the tangent linearization operator used in the usual 4DVAR that only propagates analytically with the first-order precision. Another assimilation experiment conducted within the Lorenz model also shows its potential applications in numerical weather or climate models. The results show that the POD-E4DVAR method provides a promising new tool for data assimilation.

[33] Several issues, such as the impacts of the ensemble size and the initial perturbation fields on the assimilated results and the actual performance of this new method in real numerical forecast models, still need to be addressed. It also should be pointed out that since this method begins with a 4-D ensemble obtained from the perturbed ensembles, the quality of the results relies heavily on the perturbation method. How to generate a reasonable perturbed field is a critical step in using this method. This aspect also requires further investigation.

Appendix A:: Ensemble Kalman Filter (EnKF) Method

A1. Ensemble Representation for Covariance Matrix

[34] One can define the matrix holding the ensemble members equation imageRn as

equation image

where N is the number of ensemble members and n is the size of the model state vector.

[35] The ensemble mean is stored in each column of equation image which can be defined as

equation image

where 1NRN × N is a matrix in which each element is equal to 1/N. One can then define the ensemble perturbation matrix as

equation image

The ensemble covariance matrix PeRn×n can be defined as

equation image

A2. Measurement Perturbations

[36] Given a vector of measurements yRm, with m being the number of measurements, one can define N vectors of perturbed observations as

equation image

which can be stored in the columns of a matrix

equation image

while the ensemble of perturbations, with ensemble mean equal to zero, can be stored in the matrix

equation image

from which we can construct the ensemble representation of the measurement error covariance matrix

equation image

A3. Analysis Equation

[37] The analysis equation, expressed in terms of the ensemble covariance matrices, is

equation image

Using the ensemble of innovation vectors defined as

equation image

and the definitions of the ensemble error covariance matrices in equations (A4) and (A8), the analysis can be expressed as

equation image

When the ensemble size, N, is increased by adding random samples, the analysis computed from this equation will converge toward the exact solution of equation (A9) with Pe and Re replaced by the exact covariance matrices P and R.

Appendix B:: SVD-E4DVAR Method

[38] Assuming there are m observations equation image (i = 0,1,⋯, m − 1) at time t = t0, ⋯, ti, ⋯, tm−1 during the assimilation time window. Generate N random perturbation fields and add each to the initial background field and integrate the model to produce a perturbed 4-D field over the analysis time window. The ith difference field is then given by δequation image = equation imageequation image at time t = t0, ⋯, ti, ⋯, tm − 1, where equation imageb, equation imagei denote the background and the perturbed fields, respectively. Consider an ensemble of column vectors represented by matrix A = (δequation image1, δequation image2, ⋯ δequation imageN), where the ith column vector δequation imagei represents the ith sampled data field in a discrete four-dimensional analysis space. The length of vector δequation imagei is Mg × Mv × m, where Mg, Mv are the number of the model spatial grid points and the number of the model variables, respectively. The SVD of A yields

equation image

where Λ is a diagonal matrix composed of the singular values of A with λ1λ2 ≥ ⋯ ≥ λr and λr + 1 = λr + 2 = ⋯ = 0, r ≤ min(Mg × Mv × m,N), is the rank of A, B and V are orthogonal matrices composed of the left and right singular vectors of A, respectively. The SVD in (B1) gives C = ATA = VΛ2VT and Q = AAT = BΛ 2BT. Thus the ith column vector of V, denoted by Vi, is the ith eigenvector of C, while the jth column vector of B, denoted by bj, is the jth column vector of Q and is called the singular vector of A.

[39] The truncated reconstruction of analysis variable equation imagea in 4-D space is given by

equation image

where P(≤r) is the truncation number, which can be obtained through equation (10) in section 2, equation imageb = (equation imageb, equation imageb, ⋯, equation imageb) is composed of m vectors (equation imageb).

[40] Substituting (B2) into equation (1) in section 2, the control variable becomes α instead of equation image0, so the control variable is expressed explicitly in the cost function.

Appendix C:: Proper Orthogonal Decomposition

C1. Continuous Case

[41] Let Ui (equation image), i = 1,2,⋯ N denote the set of N observations or simulations (also called snapshots) of some physical process taken at position equation image = (x, y). The average of the ensemble snapshots is given by

equation image

[42] We form new ensemble by focusing on deviation from mean as follows:

equation image

[43] We wish to find an optimal compressed description of the sequence of data (C2). One description of the process is a series expansion in terms of a set of base functions. Intuitively, the base functions should in some sense be representative of the members of the ensemble. Such a coordinate system, is provided by the Karhunan Loève expansion, where the base functions Φ are, in fact, admixtures of the snapshots and are given by:

equation image

Here, the coefficients ai are to be determined so that Φ given by (C3) will resemble the ensemble {Vi(equation image)}i=1N most closely. More specifically, we look for a function Φ to maximize

equation image

subjected to (Φ, Φ) = ∥Φ∥2 = 1, where (.,.) and ∥ · ∥ denote the usual L2 inner product and L2-norm, respectively.

[44] It follows that the base functions are the eigenfunctions of the integral equation

equation image

Substituting (C3) into (C5) yields the eigenvalue problem:

equation image

where Lij = equation image(Vi, Vj) is a symmetric and nonnegative matrix. Thus our problem amounts to solving for the eigenvectors of an N × N matrix, where N is the ensemble size of the snapshots. Straightforward calculation shows that the cost function

equation image

is maximized when the coefficients ai's of (C3) are the elements of the eigenvector corresponding to the largest eigenvalue of L.

C2. Discrete Case

[45] We consider the discrete Karhunan Loève expansion to find an optimal representation of the ensemble of snapshots. In the two-dimensional case, each sample of snapshots Ui(x, y) (defined on a set of n × n nodal points (x, y)) can be expressed as an n2 dimensional vector equation imagei as follows:

equation image

where equation imageij denotes the jth component of the vector equation imagei. Here the discrete covariance matrix of the ensemble equation image is defined as

equation image

where

equation image

is the mean vector, E is the expected value. Equations (C9) and (C10) can be replaced by

equation image

and equation image respectively.

Acknowledgments

[46] This work was supported by the National Natural Science Foundation of China under grant 40705035, the Knowledge Innovation Project of Chinese Academy of Sciences under grants KZCX2-YW-217 and KZCX2-YW-126-2, the National Basic Research Program under grant 2005CB321704, and the Chinese COPES project (GYHY200706005). The National Center for Atmospheric Research is sponsored by the U.S. National Science Foundation.

Ancillary