### Abstract

- Top of page
- Abstract
- 1. Introduction
- 2. Optimization Subsystem in LIS
- 3. Experimental Setup
- 4. Results
- 5. Summary
- Acknowledgments
- References
- Supporting Information

[1] Data assimilation is increasingly being used to merge remotely sensed land surface variables such as soil moisture, snow, and skin temperature with estimates from land models. Its success, however, depends on unbiased model predictions and unbiased observations. Here a suite of continental-scale, synthetic soil moisture assimilation experiments is used to compare two approaches that address typical biases in soil moisture prior to data assimilation: (1) parameter estimation to calibrate the land model to the climatology of the soil moisture observations and (2) scaling of the observations to the model's soil moisture climatology. To enable this research, an optimization infrastructure was added to the NASA Land Information System (LIS) that includes gradient-based optimization methods and global, heuristic search algorithms. The land model calibration eliminates the bias but does not necessarily result in more realistic model parameters. Nevertheless, the experiments confirm that model calibration yields assimilation estimates of surface and root zone soil moisture that are as skillful as those obtained through scaling of the observations to the model's climatology. Analysis of innovation diagnostics underlines the importance of addressing bias in soil moisture assimilation and confirms that both approaches adequately address the issue.

### 1. Introduction

- Top of page
- Abstract
- 1. Introduction
- 2. Optimization Subsystem in LIS
- 3. Experimental Setup
- 4. Results
- 5. Summary
- Acknowledgments
- References
- Supporting Information

[2] Land data assimilation systems merge satellite or in situ observations of land surface fields (such as soil moisture, snow and skin temperature) with estimates from land surface models. Observations are often discontinuous in space and time, and their incorporation into the modeled estimates helps generate spatially complete and temporally continuous estimates of land surface fields. The process of combining observations and model forecasts is typically carried out by weighting each on the basis of their respective errors. The uncertainty in model states results from model structural deficiencies, errors in model parameter specifications and input forcings. Similarly, observational data also suffer from errors caused by instrument noise and errors associated with the retrieval models. A key assumption in most data assimilation techniques is that the errors in observations and model forecasts are strictly random and that on average, the observations and model estimates agree with the true estimates. In reality, however, biases are unavoidable and it is difficult to attribute the bias to the model or the observations. Nevertheless, the proper treatment of such systematic errors is critical for the success of data assimilation systems [*Dee and da Silva*, 1998].

[3] A number of prior studies have described techniques to address the treatment of bias errors in data assimilation systems. *Dee* [2005] characterizes the data assimilation systems as either “bias blind” or “bias aware” on the basis of their treatment of systematic errors. The bias-blind systems are designed to correct random, zero-mean errors and assume the use of unbiased observations relative to the model-generated background. For soil moisture, the absolute levels of continental-scale estimates from land surface models and satellite observations differ significantly [*Reichle et al.*, 2004, 2007], which implies a need for “bias-aware” approaches to soil moisture assimilation. An often used method to address such biases is to rescale the observations prior to data assimilation in such a way that the observational climatology matches that of the land model [*Reichle and Koster*, 2004; *Drusch et al.*, 2005; *Crow et al.*, 2005; *Slater and Clark*, 2006; *Reichle et al.*, 2007; *Draper et al.*, 2009; *Kumar et al.*, 2009; *Reichle et al.*, 2010; *Liu et al.*, 2011; *Draper et al.*, 2011]. Put differently, these so-called a priori scaling approaches assimilate normalized deviates or percentiles instead of the raw observations. A priori scaling is easy to implement as a preprocessing step to the data assimilation system and does not make assumptions about whether the climatology of the model or that of the observations is more correct. Although the resulting analyses are produced in the model's climatology, they can be scaled back to the observational climatology, if needed. However, since the computation of the climatologies is conducted as a preprocessing step, the corrections cannot easily be adjusted to dynamic changes in bias.

[4] Dynamically bias-aware assimilation systems, on the other hand, incorporate specific assumptions about the nature of biases and are specifically built to estimate and correct them. These strategies typically attribute the bias to either the model or the observations and use the analysis increments in the data assimilation system to estimate the bias. Variants of such dynamic bias correction strategies have been used in soil moisture assimilation studies [*De Lannoy et al.*, 2007a, 2007b] and for land surface temperature assimilation by *Bosilovich et al.* [2007] and *Reichle et al.* [2010]. In these studies, the observations are assumed to be unbiased, and the bias is attributed to model exclusively. In reality, however, the retrievals from different sensors may be biased against each other [*Reichle et al.*, 2007; *Trigo and Viterbo*, 2003]. The key advantage of the dynamic bias estimation and correction approaches is their ability to adapt to transient changes in bias.

[5] In this paper, we explore an alternative strategy for a priori bias correction that has not been used for continental-scale soil moisture assimilation: the a priori calibration of land surface model (LSM) parameters. We use optimization algorithms to estimate model parameters that minimize the bias between model forecasts and observations. Similar to the a priori scaling methods discussed above, the a priori calibration approach complements the state update steps of the data assimilation system. In the latter, the model forecast is modified only when observations are present. In the absence of observational information, the model will revert back to its original climatology. Adjusting model parameters offers a way to bring the model's climatology in line with that of the observations, including at times and locations where observations are intermittently absent. Like a priori scaling, a priori model calibration does not adjust dynamically to changes in model or observation bias.

[6] Model parameters have long been recognized as a key source of errors in model predictions, and many LSM studies have focused on the application of techniques to estimate them [*Duan et al.*, 1992; *Burke et al.*, 1997; *Gupta et al.*, 1999; *Hogue et al.*, 2005; *Liu et al.*, 2004, 2005; *Santanello et al.*, 2007; *Peters-Lidard et al.*, 2008; *Lambot et al.*, 2009; *Gutman and Small*, 2010; *Nearing et al.*, 2010]. These studies estimate LSM parameters using independent observations of variables such as soil moisture, streamflow and surface temperature. In addition, data assimilation studies have also recognized the need to update and estimate model parameters for improving the model's predictive skills. A number of studies have examined the potential of parameter estimation in conjunction with state estimation in sequential data assimilation systems [*Boulet et al.*, 2002; *Moradkhani et al.*, 2005a, 2005b; *Qin et al.*, 2009; *Yang et al.*, 2009; *Nie et al.*, 2011; *Montzka et al.*, 2011; *DeChant and Moradkhani*, 2011]. These approaches, known as joint estimation or state augmentation methods, estimate the model parameters concurrently with the model states. Such approaches, however, have difficulties in handling the relative time invariance of parameters (compared to model states) and very large parameter spaces [*Liu and Gupta*, 2007]. *De Lannoy et al.* [2007a] note that in some situations it may be better to estimate the bias separately rather than correct it using state augmentation methods. An approach that employs the simultaneous use of optimization and data assimilation was described by *Vrugt et al.* [2005], where the model parameters are estimated through the recursive calibration over a data assimilation instance. This method considers the estimation of model parameter sets for generating the best possible forecasts, when model states are also adjusted through sequential data assimilation. The advantages and limitations of these joint state and parameter estimation approaches are discussed in detail by *Liu and Gupta* [2007].

[7] Here we compare, in the context of data assimilation, the approach of bias mitigation through the estimation of model parameters against a priori bias correction strategies that rescale the observations to conform to the model's climatology. The parameter estimation is performed in a “batch calibration” mode, where a set of observational data is used to estimate time-invariant model parameters with the objective of minimizing the climatological differences between the model and the observations. The model with the calibrated parameters is subsequently employed in the data assimilation system to assimilate the raw, unscaled observations. In contrast, the scaling approaches essentially assimilate the anomaly information instead of the raw observations. We investigate these methods with a soil moisture assimilation case study. A new generation of satellite soil moisture retrievals are becoming available from the recently launched Soil Moisture and Ocean Salinity (SMOS) [*Kerr et al.*, 2010] and the planned Soil Moisture Active Passive (SMAP) [*Entekhabi et al.*, 2010b] missions. The results from our study are directly relevant to the effective utilization of these new observations in land data assimilation systems.

[8] The experiments presented in this paper are conducted using the NASA Land Information System (LIS) [*Kumar et al.*, 2006; *Peters-Lidard et al.*, 2007], which is a multiscale modeling system for hydrologic applications developed with the goal of integrating satellite- and ground-based observational data products and advanced land surface models and techniques to generate improved estimates of land surface conditions. LIS includes a suite of subsystems to support land surface modeling for a variety of applications, including a comprehensive sequential data assimilation system, based on the NASA Global Modeling and Assimilation Office's infrastructure [*Reichle et al.*, 2009; *Kumar et al.*, 2008b]. More recently, a generic optimization subsystem has been developed within LIS, with the goal of combining the use of optimization and data assimilation in an integrated framework. This new extension to LIS will be described in detail below and was used to facilitate the experiments discussed here.

[9] The paper is organized as follows. The design and capabilities of the optimization subsystem within LIS are presented first (section 2). This is followed by the description of the experiment setup that evaluates the use of parameter estimation in data assimilation (section 3). The results from the data assimilation integrations are presented in section 4. Finally, section 5 discusses the conclusions from the study.

### 2. Optimization Subsystem in LIS

- Top of page
- Abstract
- 1. Introduction
- 2. Optimization Subsystem in LIS
- 3. Experimental Setup
- 4. Results
- 5. Summary
- Acknowledgments
- References
- Supporting Information

[10] LIS is designed as an object-oriented framework, where all functional extensions (such as land surface models, data assimilation algorithms, meteorological inputs, observational data, etc.) are implemented as abstract, extensible components [*Kumar et al.*, 2006, 2008a]. A large suite of modeling extensions have been incorporated in LIS using this design paradigm. The optimization subsystem in LIS is designed in a similar interoperable manner.

#### 2.1. Optimization Abstractions

[11] Generically, an optimization instance can be stated as a problem of determining unknown parameters by minimizing or maximizing an objective function subject to a number of constraints. The optimization subsystem in LIS defines three functional abstractions on the basis of this generic form, shown in Figure 1: (1) objective function, (2) decision/parameter space, and (3) algorithm used to solve the optimization problem. In the instance of parameter estimation, the decision space is defined by the list of LSM parameters (or a subset thereof). The objective function object represents the function or criteria to be maximized or minimized. Examples include the minimization of squared residuals and the maximization of likelihood measures. Finally, the optimization algorithm abstraction represents the actual search strategy used to find the optimal solution. The interconnections between these three generic pieces are handled within the LIS core, which is the unit that enables the integrated use of various extensible components in LIS. Custom implementations of each of these three abstractions constitute a specific instance of an optimization problem.

[12] Similar to the design of the LIS data assimilation subsystem [*Kumar et al.*, 2008b], the data exchanges between these abstractions are handled through the constructs of the Earth System Modeling Framework (ESMF) [*Hill et al.*, 2004]. ESMF provides a standardized, self-describing format for data exchange between these components. Three search algorithms of varying complexity are implemented in this infrastructure: (1) Levenberg-Marquardt (LM) [*Levenberg*, 1944; *Marquardt*, 1963], (2) shuffled complex evolution from the University of Arizona (SCE-UA) [*Duan et al.*, 1992, 1993], and (3) genetic algorithm (GA) [*Holland*, 1975]. LM is a gradient-based search technique and is suited only for deterministic convex optimization problems, whereas SCE-UA and GA are more suited for difficult combinatorial optimization problems such as LSM parameter estimation.

#### 2.2. Genetic Algorithm

[13] In this article, we employ GA for estimating LSM parameters. GAs are stochastic search techniques that use heuristics-based principles of natural evolution and genetics. The algorithm works by employing a population of individuals (or candidate solutions), each of which is represented by a set of values of the problem's variables that need to be estimated (also called decision space). By applying operations that are based on natural evolution concepts, such as selection, recombination and mutation, the population evolves toward better solutions over several generations (or iterations).

[14] Figure 2 depicts a flowchart showing the sequence of GA operations during optimization. A fitness value that reflects the quality of the solution and its ability to satisfy constraints and objectives of the problem is associated with each potential solution. The selection operator simulates the “survival of the fittest” behavior by preferentially selecting the solutions with higher fitnesses to be present in the subsequent populations. As a result, solutions with good traits survive and solutions with bad traits are eliminated. Each pair of selected solutions then undergoes the recombination step where two new solutions are generated by combining the “genes” of the parent solutions. The mutation operator is used to infuse the population with gene values that may not be present in the population. The recombination and mutation rates define the probability of crossover between any two pairs and the probability of a gene undergoing mutation, respectively. To ensure that the best solution in any generation is not lost through these probabilistic recombination and mutation operations, a strategy named elitism is used. Elitism ensures that the best solution from the previous generation is compared with the worst solution in the current generation, replacing the current generation's solution, if better. These steps are repeated through several iterations (or generations) until the specified convergence criteria is met.

[15] GAs do not rely upon local or gradient information and are able to deal with complexities in the search space such as the presence of local optima and discontinuities. GAs are also well suited to handle discrete decision variables and nonlinearity in the simulation models effectively. The problem-independent structure of the algorithm has enabled its application in many areas of science and engineering [*Goldberg*, 1989]. GAs, however, require the evaluation of several simulation runs to obtain the best solution, making them computationally intensive. The high-performance computing tools in LIS are employed for mitigating this limitation (section 4.3).

### 5. Summary

- Top of page
- Abstract
- 1. Introduction
- 2. Optimization Subsystem in LIS
- 3. Experimental Setup
- 4. Results
- 5. Summary
- Acknowledgments
- References
- Supporting Information

[48] Data assimilation methods such as the EnKF require that the errors in the model and the observations are strictly random. As a result, the presence of systematic or bias errors needs to be addressed separately within the data assimilation system. In this study, we evaluate a number of bias mitigation strategies in the context of assimilating surface soil moisture retrievals. Specifically, we examine the use of land model parameter estimation as a bias correction strategy prior to data assimilation. This strategy is compared to the approach of scaling the assimilated observations to the land model's climatology prior to data assimilation. The study is conducted using a fraternal twin experiment setup, where synthetic observations generated using the catchment LSM are assimilated into the Noah LSM. Five different data assimilation experiments are conducted, each using a different strategy to correct (or not) for bias prior to data assimilation. The resulting soil moisture estimates are evaluated against the corresponding synthetic truth fields from the catchment LSM.

[49] Our results indicate that a priori land model calibration is an effective strategy for bias mitigation in soil moisture assimilation. The domain-averaged skill estimates (in terms of anomaly R values) for the Noah open loop simulation without any data assimilation are 0.47 for surface soil moisture and 0.45 for root zone soil moisture. These skill estimates improve to 0.63 for surface soil moisture and 0.54 for root zone soil moisture. when assimilation is conducted without any bias correction (DA-NOSC). When observations are assimilated after rescaling to the model climatology, the assimilation skill improves further. Two approaches for a priori scaling are considered: (DA-STDN) using standard normal deviates and (DA-CDF) by matching the CDFs of the observations to that of the model. Assimilation using these a priori scaling approaches yields domain-averaged skill values of 0.71 and 0.73 for surface soil moisture and 0.62 and 0.63 for root zone soil moisture, respectively. Similar improvements in the surface and root zone soil moisture estimates are observed with the assimilation runs that employ optimized model parameters but ingest unscaled observations. Two sets of optimized parameters are used in the experiments: (DA-OPT1) parameters estimated from a single year of calibration and (DA-OPT6) parameters estimated from 6 years of calibration. When data assimilation is conducted using parameters from a single year of calibration, skill estimates of 0.73 for surface soil moisture and 0.62 for root zone soil moisture are obtained. The use of the 6 year based parameters further improves these skill measures to 0.75 for surface soil moisture and 0.63 for root zone soil moisture.

[50] It was also observed that spatial variability in the skill scores across the domain is reduced with the use of optimized parameters, resulting in more spatially consistent skill enhancements. The skill improvements in surface fluxes were found to be comparable for data assimilation following a priori scaling and a priori calibration. Similar trends in skill scores are also observed if the unbiased RMSE metric is used instead of anomaly R for evaluating the results. Finally, the analysis of innovation diagnostics also demonstrates that without the use of suitable bias correction, the assimilation system performs in a less than optimal manner and that all four bias mitigation strategies adequately address the bias issue.

[51] In the suite of synthetic experiments presented in this article we are in effect calibrating the Noah surface soil moisture climatology to that of the catchment LSM. It must be stressed that this approach is chosen not because one model (catchment) is more correct than the other (Noah). A similar argument holds when satellite soil moisture retrievals are assimilated. In that case, the climatology of the retrievals is not necessarily more correct than that of the model. However, when brightness temperatures are assimilated in radiance space instead of the retrievals, the model should be calibrated to the observed brightness temperature climatology. The long-term biases can be mitigated through calibration and the remaining shorter-term biases can be addressed with a priori scaling. The combined use of these strategies will be examined in future radiance-based data assimilation experiments.

[52] Though effective, the approach of using parameter estimation for bias correction also suffers from the limitations of the a priori scaling approaches. Since the parameters are estimated in advance of data assimilation, any subsequent changes in model behavior will not be captured, unlike in the dynamic bias estimation algorithms. The optimization formulation does not constrain the estimated parameters to conform to the traditional, look-up table–based definitions of parameters. Here no attempt was made to ensure the physical realism of the estimated parameters. The calibration might also require additional constraints to ensure that the behavior of related variables is not adversely affected. Note, however, that we have found that the estimates of the latent and sensible heat fluxes were comparable for the assimilation integrations with bias correction (DA-STDN, DA-CDF, DA-OPT1, and DA-OPT6). Furthermore, our results suggest that using model parameter estimation could be a viable strategy for bias mitigation in cases of relatively short (i.e., 1 year) satellite records. This result is important for expediting the use of soil moisture retrievals becoming available from SMOS and SMAP.

[53] The study also demonstrates the advanced capabilities of the NASA LIS framework, including the development of a new subsystem for optimization. This extension encapsulates a range of advanced search algorithms suited for both convex and nonconvex optimization problems. In this particular study, the genetic algorithm, a heuristic search technique based on principles of evolutionary computing, is employed for estimating model parameters. The optimization infrastructure within LIS is currently being enhanced with a suite of uncertainty estimation algorithms based on Bayesian methods. In contrast to the optimization techniques that have already been implemented in LIS and generate a single solution for parameters, the newer uncertainty estimation tools infer distributions of parameters based on the observational information. These parameter distributions can then be used to condition the ensembles used in the data assimilation system. The joint use of optimization and data assimilation tools presented here and future LIS advancements will enable the increased exploitation of observational data for improving hydrological modeling.