SEARCH

SEARCH BY CITATION

Keywords:

  • uncertainty;
  • modeling;
  • water balance;
  • Bayesian statistics;
  • data assimilation;
  • model structure estimation;
  • estimation

Abstract

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Model Structure Identification Through Data Assimilation: Theory
  5. 3. Model Structure Identification by Data Assimilation: Application to Leaf River Basin
  6. 4. Summary and Conclusions
  7. Appendix A:: Mixture of Multivariate Normal Distributions Approximation
  8. Appendix B:: Illustration of the Model Structure Estimation Algorithm Using a Simple Example
  9. Appendix C:: Extended State-Space Formulation
  10. Acknowledgments
  11. References
  12. Supporting Information

[1] When constructing a hydrological model at the macroscale (e.g., watershed scale), the structure of this model will inherently be uncertain because of many factors, including the lack of a robust hydrological theory at that scale. In this work, we assume that a suitable conceptual model structure for the hydrologic system has already been determined; that is, the system boundaries have been specified, the important state variables and input and output fluxes to be included have been selected, the major hydrological processes and geometries of their interconnections have been identified, and the continuity equation (mass balance) has been assumed to hold. The remaining structural identification problem that remains, then, is to select the mathematical form of the dependence of the output on the inputs and state variables, so that a computational model can be constructed for making simulations and/or predictions of the system input-state-output behavior. The conventional approach to this problem is to preassume some fixed (and possibly erroneous) mathematical forms for the model output equations. We show instead how Bayesian data assimilation can be used to directly estimate (construct) the form of these mathematical relationships such that they are statistically consistent with macroscale measurements of the system inputs, outputs, and (if available) state variables. The resulting model has a stochastic rather than deterministic form and thereby properly represents both what we know (our certainty) and what we do not know (our uncertainty) about the underlying structure and behavior of the system. Further, the Bayesian approach enables us to merge prior beliefs in the form of preassumed model equations with information derived from the data to construct a posterior model. As a consequence, in regions of the model space for which observational data are available, the errors in preassumed mathematical form of the model can be corrected, improving model performance. For regions where no such data are available the “prior” theoretical assumptions about the model structure and behavior will dominate. The approach, entitled Bayesian estimation of structure, is used to estimate water balance models for the Leaf River Basin, Mississippi, at annual, monthly, and weekly time scales, conditioned on the assumption of a simple single-state-variable conceptual model structure. Inputs to the system are uncertain observed precipitation and potential evapotranspiration, and outputs are estimated probability distributions of actual evapotranspiration and streamflow discharge. Results show that the models estimated for the annual and monthly time scales perform quite well. However, model performance deteriorates for the weekly time scale, suggesting limitations in the assumed form of the conceptual model.

1. Introduction

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Model Structure Identification Through Data Assimilation: Theory
  5. 3. Model Structure Identification by Data Assimilation: Application to Leaf River Basin
  6. 4. Summary and Conclusions
  7. Appendix A:: Mixture of Multivariate Normal Distributions Approximation
  8. Appendix B:: Illustration of the Model Structure Estimation Algorithm Using a Simple Example
  9. Appendix C:: Extended State-Space Formulation
  10. Acknowledgments
  11. References
  12. Supporting Information

[2] When constructing a hydrological model at the macroscale (e.g., watershed scale), the structure of this model will inherently be uncertain because of many factors. One important reason for this is the lack of a robust hydrological theory at the macroscale, and in fact most “physically based” models of hydrologic systems are based on an implicit upscaling premise, that the behavior at the system/model scale can be described by governing equations inferred from small-scale physics, field studies and observations (with spatial averaging of the state variables and use of “effective” parameters) [e.g., Arain et al., 1996, 1997; Beven and Binley, 1992; Freer et al., 1996; Thiemann et al., 2001; Wagener et al., 2003]. Given the strong nonlinearities and heterogeneities in a hydrological system, the upscaling assumption is clearly questionable and we might therefore expect the large-scale “effective” governing equations for the system to be different in form (not just different in parameters) from the equations inferred at the small scale [Amir and Neuman, 2001; Tartakovsky et al., 2003; Ye et al., 2004].

[3] Recognizing the upscaling problem, several approaches have been proposed that seek to build an appropriate model at the system scale directly from the data. Some diverse examples include the data-based mechanistic approach developed by Young [2001], the self organizing linear output artificial neural network approach by Hsu et al. [2002], and the macroscale hydrology model approach of Sivapalan [2003a, 2003b], to name just a few. The general challenge, of course, is to both properly incorporate current understanding of the underlying physics of the system while conditioning the model structural estimation process on the available data. The model will then provide predictions that are consistent with available observations when operating within the range of the data used for model estimation, while providing reasonable extrapolative predictions that are consistent with physical understanding when operating outside of the historical data range. A further challenge is to incorporate a proper representation of all relevant data and model uncertainties so that proper estimates of prediction accuracy and precision can be made. One objective of this current study is to explore how these abilities might properly be achieved.

[4] In attempting to understand model uncertainty, it is useful to treat the model estimation problem as having two sequential stages: a system conceptual structure identification stage and a system equation structure identification stage. The outcome of these two stages is a computational model that can be used to make simulations and/or predictions of system input-state-output behavior. For clarity, and because these terms are understood differently by various people and in various modeling approaches, we define these terms as follows. System conceptual structure identification is the process of specifying (somehow, by assumption and/or inference) one or more suitable conceptual models for the hydrologic system of interest [see, e.g., Ye et al., 2004; Young, 2001]; this requires clear definition of the spatial boundaries of the system, selection of which input and output fluxes crossing those boundaries are important and must be represented, definition of the important state variables to be included (involving decisions regarding their number, type, and spatial geometry), a list of the major hydrological processes that must be included, and a diagram describing the assumed geometry of interconnectedness between inputs, states and outputs. Further, we assume that the input-state-output continuity equation holds, so that mass/energy is conserved (as appropriate). Note that this step involves a clear and explicit hypothesis of what is important and should therefore be included in the model, while conversely also implying a hypothesis of what is not important and can be ignored. Clearly, this step involves considerable subjectivity, and the outcome is affected by the uncertainties involved in perception and conceptual analysis. Further, this step does not require formal specification of the mathematical (equation) forms of the relationships linking the input, state and output variables (beyond the continuity equation as mentioned above). The result is a “conceptual” model of the system.

[5] System equation structure identification is the subsequent process of somehow selecting (by assumption and/or inference) the remaining mathematical equations (and/or rules) linking the input, state, and output variables in such as way as to be simultaneously consistent with both the previously defined conceptual model structure and with the historical data. The conventional approach to this problem is to preassume some fixed (and possibly erroneous) mathematical structures for the model equations, based on existing (e.g., small-scale) hydrological theory. These equations are most usually represented as being deterministic and therefore perfectly certain. Uncertainty, when it is included, is typically represented as (1) a stochastic distribution over the equation parameters [e.g., Beven and Binley, 1992; Freer et al., 1996; Thiemann et al., 2001; Wagener et al., 2003], (2) by including a random additive “equation error” belonging to some assumed probability distribution [e.g., Wagener et al., 2004; Moradkhani et al., 2005], or (3) by providing a set of alternative (deterministic) model equations to choose from [Woolhiser et al., 1990]. So, watershed models can exist that are “conceptually” identical (or very similar) while being different by virtue of the equations used to carry out the computations.

[6] This paper is focused on the problem of system equation structure identification. We shall assume here, without loss of generality, that one appropriate conceptual model has previously been selected that properly incorporates all relevant “first principle” knowledge about the hydrological system; in section 3 we illustrate this using a conceptual model suitable for dynamical watershed-scale water balance computations. We will not, therefore, consider (in this paper) prediction uncertainties that are caused by errors or uncertainties in the conceptualization of the system; we leave this topic for a subsequent discussion. This paper will discuss and demonstrate how, given a conceptual model, the mathematical structural form of the macroscale model equations can be estimated from available system input-state-output observations, in a way that properly reflects both what we know about the system (the certainty in our knowledge) and what we do not know (the uncertainty in our knowledge) about the system. To be clear, what we know refers primarily to the presumed conceptual structure of the system; this might perhaps be called a “hard” prior, in the Bayesian sense that the presumed conceptual structure is assigned a prior probability equal to one, and we are therefore not looking to update this prior. Secondarily, if we wish, what we know can also include some prior statements (assumptions) regarding what we think the mathematical form(s) of the input-state-output relationships to be; this might be called a “soft” prior, in the Bayesian sense that the form of the presumed equations is treated as uncertain and is to be either reinforced or modified via the data assimilation process. Hence, what we do not know refers both to our uncertainty in the input-state-to-output mapping that is derived (from the error corrupted data) via Bayesian data assimilation, and to the prior uncertainty in the model equations as mentioned above.

[7] The proposed approach is entitled the Bayesian estimation of structure (BESt) method. The key to understanding this approach is to recognize that everything we know (or think we know) about the relationships between the relevant system variables (inputs, states and outputs), and particularly our degree of certainty (and conversely, uncertainty) in these relationships, can be compactly represented in terms of the joint probability density function (hereafter called the pdf) of all the system variables, conditioned on the observed data and on our conceptual understanding of system physics. Once this joint pdf is known, the mathematical form of the predictive “model” of the system can be readily determined by deriving from it the conditional pdf of the outputs given knowledge of the inputs and current state, via Bayes law. In this paper, we discuss how this formulation enables us to apply Bayesian data assimilation techniques to the problem of estimating, refining and/or correcting the model structural equations (and our estimate of the model uncertainty) over time, on the basis of new information as it becomes available.

[8] Specifically, the approach taken in this paper is to derive a “semiparametric” mathematical estimate of the joint pdf of the system variables from available error-corrupted data, and then use the conditional density derived from this estimate to make predictions, either by itself, or in combination with some prior mathematical model. For regions of the model space where historical data are available, the data-based estimate of the system equation structure will dominate and can be used to make predictions with numerically quantified uncertainty. Conversely, for regions of the model space where no historical data are available, the prior assumptions (conceptual and equation) regarding model structure will dominate and can be used to make predictions whose uncertainty can only be subjectively quantified. Importantly, the data assimilation process can be used to detect and correct errors in the prior beliefs about the input-state-output dependencies (wherever the data conflict with the prior model). Further, the approach has the desirable property of permitting the representation of (and discrimination between) three important sources of uncertainty, namely initial condition uncertainty, input uncertainty, and model structure uncertainty.

[9] The scope of this paper is to discuss the concepts and mathematics of the BESt approach, and demonstrate how it can be used to identify the mathematical form of the structural equation of a simple conceptual hydrological model that reflects a basic understanding of water balance processes at the watershed scale. We will restrict the discussion presented here to the case where the equation form of the input-state relationship is “known,” so that our task is to identify the form of the state-output equation; future work will generalize this result. At the watershed scale we are able to exploit the principle of continuity (or mass balance) for incoming and released water from a catchment, with the catchment acting as a water storage body.

[10] The paper is organized as follows. Section 2 presents a brief theoretical development of the proposed method for Bayesian Estimation of Structure; for further details please see Bulygina [2007]. Section 3 provides an illustrative example where the method is applied to watershed-scale water balance modeling for the 1944 km2 Leaf River Basin, Mississippi at annual, monthly and weekly time scales. Our model is derived to be conditional on the assumption of a simple single-state-variable conceptual model structure, having precipitation and potential evapotranspiration as inputs, and actual evapotranspiration and streamflow discharge as outputs. Storage components and mechanisms associated with groundwater recharge/discharge are not represented explicitly in the mass balance. In section 4, the conclusions and implications of this work are discussed along with ideas for further development and other possible applications.

2. Model Structure Identification Through Data Assimilation: Theory

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Model Structure Identification Through Data Assimilation: Theory
  5. 3. Model Structure Identification by Data Assimilation: Application to Leaf River Basin
  6. 4. Summary and Conclusions
  7. Appendix A:: Mixture of Multivariate Normal Distributions Approximation
  8. Appendix B:: Illustration of the Model Structure Estimation Algorithm Using a Simple Example
  9. Appendix C:: Extended State-Space Formulation
  10. Acknowledgments
  11. References
  12. Supporting Information

2.1. Data Assimilation

[11] Data assimilation for dynamic systems is a process whereby information is extracted from sequential observations and assimilated into the model. Given a sequence of observational data from some time in the past and up to present, three types of estimation problems can be considered: (1) forecasting to predict the system state at some future point in time, (2) filtering (or analysis) to characterize the system state at the current time, and (3) smoothing to estimate the system state at some past time. While conventional data assimilation applications focus on observation assimilation for the purposes of estimating uncertain states and/or parameters, this paper is concerned with the problem of jointly estimating the states and the uncertain mathematical structure of the model, given a presumed conceptual structure. We will find the method of smoothing to be particularly useful because it provides improved estimates of the extended model state that are of help in the recursive estimation of the models' mathematical structure. Also, the method of particle filtering [Doucet et al., 2000; Arulampalam et al., 2002; Hurzeler and Kunsch, 1998] will be used to construct efficient estimates of the pdf's representing the mathematical structure of the model. Convergence of the estimates of the model structure pdf's will be tested using the Kullback-Leibler divergence measure [Kullback and Leibler, 1951].

2.2. Model Components in Systems Theory

[12] Few, if any, hydrologic studies reported in the literature comply with a common set of definitions and terminology related to system components. In this work we will follow the notation proposed by Liu and Gupta [2007] (see notation section), and consider a system (Figure 1) to be composed of five different components: initial states (x0), structure (M), inputs (u), states (x), and outputs (y).

image

Figure 1. A systems diagram of model components.

Download figure to PowerPoint

[13] In this setup, we define inputs u and outputs y to be fluxes of mass and/or energy into and out of the system boundary, and states x to be time-varying quantities of mass and/or energy stored within the system boundary. For example, in hydrological modeling u may refer to the time-varying two-dimensional spatial distribution of precipitation flux over the catchment and of potential evaporation and transpiration from the surface of the catchment; y may refer to the time-varying two-dimensional distribution of streamflow flux at all points along the river network and of actual evaporation and transpiration from the surface of the catchment; and x may refer to three-dimensional time-varying spatial distribution of surface and subsurface moisture stored within the catchment boundaries.

[14] The model structure M consists of two components, the vector functional relationships Mx and My, where Mx represents the input-to-state mapping, and My represents the input-state-to-output mapping. For example, Mx may refer to the coupled equations describing the three-dimensional evolution of surface and subsurface moisture state variables in response to catchment inputs (precipitation and potential evaporation); whereas My may refer to the coupled equations describing the dependence of catchment outputs (evaporation, transpiration and outflow) on system states and inputs (precipitation, potential evaporation). These mappings can be formulated in a continuous time differential equation manner as:

  • equation image

or in discrete time difference equation manner (using k to represent discrete moments in time) as:

  • equation image

Without loss of generality, and since computer models are designed to make predictions at discrete moments of time, we shall use the discrete time formulation (2) in all subsequent discussion. Note that when integrating at different time scales the discrete time structural forms M of the model can be quite different.

2.3. Errors (Uncertainties) in Different Model Components

[15] Of the five model components illustrated in Figure 1, three must be specified before the model can actually be run in forward mode (i.e., u, x0, and M), while the remaining two (x and y) are computed by running the model. Each of the three components may be uncertain in various characteristic ways, and the consequence of these uncertainties will be mapped (implicitly) into the model states and outputs. Uncertainties in initial conditions and inputs are called data (or measurement, or observation) errors. Errors in the output observations used to evaluate the model results and errors in measurable parameters (if any) are also considered to be data errors.

[16] Data errors in hydrology usually consist of three components: (1) instrumental error due to imperfect measurement devices, (2) representativeness error due to differences between measurement and model scales, and (3) interpretive error due to the use of other models to interpret an actual measurement (e.g., river stage conversion to streamflow rates, neuron probe reading connection to soil moisture). In this work we assume the stochastic properties of the combined effect of these data errors to be known in the form of probability density functions; relaxations of this assumption will be explored in future work.

[17] Structural errors arise because of model assumptions and simplifications in approximating the complex reality. This can involve inappropriate conceptual model structures, as well as inappropriate mappings and relationships between the model inputs, states, and outputs. In this work we will address model structural uncertainty by fixing (assuming) a conceptual structure for the model, and estimating the (hence conditioned) form of the mappings between inputs, states, and outputs, using available measurements of input, and some output variables. Note: while we will assume in this paper that no direct state measurements are available, this assumption can also later be relaxed.

2.4. Extended State-Space Formulation

[18] Following equations (2), the true values of the inputs uk, state variables xk, and outputs yk are some nonrandom numbers at each time k. However, because of data and structural uncertainties these values cannot be known exactly, so we treat the estimates of these quantities as random variables, distinguished by the use of capital letters (Xk, Yk, and Uk) and their sampled realizations by the use of lowercase letters (xk, yk, and uk). Our goal is to estimate the triple (xk, yk, and uk) from available information. Therefore, let Sk = (Xk, Yk, Uk) represent the extended state vector. Since the mathematical structure of the system is unknown, we will represent the evolution of S in general form as an extended first-order Markov state-space model (consistent with the first-order dependence represented by equation (2)) with state Sk+1 and observations Zk+1 distributed as follows:

  • equation image

where the probabilistic operator M = (Mu, Mx, My) (not to be confused with the deterministic mapping M in (2)) expresses the temporal propagation of the system from time tk to time tk+1 but itself remains static, in the sense that the form of the mapping does not change with time, so that:

  • equation image

where each component Mu, Mx and My of the probabilistic operator is given by the pdf for the input, state and output, correspondingly. The observation triplet Zk+1 = (Zk+1u, Zk+1x, Zk+1y) is a random variable related to the augmented state Sk+1 through an observation probabilistic operator H = (Hu, Hx, Hy), where Zk+1uHu (zk+1usk+1) is an input observation, Zk+1xHx (zk+1xsk+1) is a state observation, and Zk+1yHy (zk+1ysk+1) is an output observation. Here, we will assume that the input is completely observable, that there may or may not be any state observations, and that possibly only some of the output variables are observed.

[19] The first-order Markov assumption about system (3) means that the state sk, when conditioned on all previous states, depends only on the state at time k − 1:

  • equation image

Our uncertain knowledge about the initial state of the system will be represented by the prior pdf p(s0).

[20] Another critical assumption usually made in sequential data assimilation [Wikle and Berliner, 2007] is that the observation triplets are mutually independent in time provided one knows the true values; that is, the observation distribution is fully defined if the corresponding true value is given, so that the joint pdf of the observed values is given by the product:

  • equation image

[21] To set up an assimilation system using the state-space formulation presented above requires that some assumptions be made about the form of the observation probabilistic operator H introduced by equation (3). Because the form of H depends on the nature of the observational process, it usually describes the distribution underlying the random variables such that the mean (expected value) corresponds to the actual system input/output values, while the observation noise is generally assumed to be additive and independent.

[22] To proceed, the assumed (and fixed) conceptual model can be used to define the input propagation operator Mu and the input-to-state propagation operator Mx, whereas the operator My relating the output to the inputs and states is not known (or perhaps incompletely known) and must therefore be estimated. Therefore, in this (fairly common) case, the input Mu and input-to state Mx operators are fixed as a part of the model conceptualization process and need not be identified hereafter. The probabilistic operator My will be considered as an estimate of the unknown output mapping My (see equation (2)) and will provide information regarding the uncertainty in form of the mapping, conditioned on the prior assumptions and available information. We will, of course, assume that all prior information regarding the output operator My is being adequately reflected by the conceptual model of the system, so as to represent our best current knowledge about the dynamics of the system. Our task, therefore, is to estimate the output mapping My via Bayesian data assimilation, so as to summarize all current (prior and new) available information about the global relationship between the variables in a probabilistic manner.

2.5. Estimation of the Joint Probability Density Function p(sz1:T)

[23] Suppose we have a batch of observations z1:T. From a Bayesian perspective, the task of estimating the output operator My can be viewed as a problem of predicting the output value given some values for the conditioning variables (i.e., ytxt, ut, st−1). Therefore, we must compute the predictive posterior operator My (ytxt, ut, st−1, z1:T) conditioned on the observations z1:T. Note that the operator My is a joint pdf of the outputs and therefore can be represented as a product of conditional density functions:

  • equation image

where yt = (y1,t, …, image dy is the output dimension, and y1:j,t = (y1,t, …, yj,t). Here, each term p(yj,ty1:j−1,t, xt, ut, st−1, z1:T) represents the conditional pdf for the jth output variable at time t given values for the other (from 1 to j − 1) output variables, states, inputs, previous extended states and data.

[24] There are several ways that a probability density function can be estimated given a batch of data (e.g., kernel density estimation [Silverman, 1986]). Our approach is to use our current estimate of My (we will keep the same notation for its estimate) to generate a reasonable inference regarding the future behavior of the system, so that the current system knowledge is not simply being extrapolated to make predictions under conditions that are new or rare in the observational record. For conditions not supported by the available data, this current estimate is based on prior information derived from theoretical considerations regarding system structure and behavior. More formally, we want the estimate of the operator My to have the following property: under new conditions, or conditions rarely experienced before (as represented by the y1,:j−1,t, xt, ut values) yj,t is drawn from some prior distribution, and otherwise it is described by dependencies extracted from the historical data. An estimate with such properties was proposed by Ferguson [1983] for discrete pdf approximation using Dirichlet processes. Here, we seek an estimate for each conditional density component of the operator My in (6) as a weighted sum of a part given by the prior estimate and a part derived from the observations:

  • equation image

where p0(·) is the prior estimate of the density, {αj, j = 1, dy} are the weights, and p(·) is the density function derived from the observations, so that equation imagep(sz1:T)ds is a proportion of the historical extended states s that lies in a region U(s0). Note that the part of the density function derived from the observations does not depend on the extended state at the previous time since, if the mapping My were known, the output would depend only on current state and input (equation (2)). In contrast, the part of the density given by the prior depends on the extended state at the previous time, and this helps to constrain the values of the output variables through the behavioral restrictions imposed by the form of the conceptual model (see the example in section 3.1).

[25] In this study we treat the weights {αj} as a part of a prior that does not depend on observations, but it is possible to extend the analysis specifying a hyper prior on {αj}, as is implemented, for example, by Muller et al. [1996]. To decide whether to make a draw from the prior density or from the data derived one, we compare the weights αj with T*p(y1,:j−1,t, xt, utz1:T), j = 1,dy. The quantity T*p(y1,:j−1,t, xt, utz1:T) is proportional to the number of historical points having values of output, state and input within a sufficiently close neighborhood of (y1:j−1,t, xt, ut), and αj represents the number of observations we would be willing to trade against our prior information. As more data points become available the data-derived contribution will become more dominant in the estimate of the mapping My.

[26] Further, since the prior is to be specified directly from the conceptual model (see example in section 3), we only need to compute the data derived part of the density. For this it is sufficient to compute the pdf p(x, y, uz1:T) = p(sz1:T) for the system, represented as an equally weighted sum of local posterior density functions for each time step:

  • equation image

where pk(·) is the pdf for time step k, so that that equation imagepk (sz1:T)ds is the proportion of the extended states s at time k that lies within the region U(s0), and hence the corresponding weighted average is the proportion of the historical extended states s that lies within the region U(s0). This defines the portion of the pdf derived from the observations.

[27] A computationally efficient way to represent the data-derived part of the pdf in a compact way, and to enable analytical representation of the marginal density, is to use a mixture of multivariate normal distributions as described in Appendix A, so that:

  • equation image

where θi = (μi, Ωi), i = 1, n are the parameters (means μi and covariance matrices Ωi,) of the normal distributions, and the weights ωi sum to one (equation imageωi = 1). Of course, all of these parameters depend on the observations z1:T, but for notational simplicity the dependence is omitted.

[28] The main assumption used here in the estimation of the posterior pdf is the availability at each time k = 1, T of the pdf pk(sz1:T). While the observation operator H provides information about the connections between the observations and the different variables of the system, it does not characterize the nature of their interdependence (i.e., the pdf pk(sz1:T), k = 1,T, is not directly knowable from measurements alone). Of course, its structure could be derived using the My operator if it were known, but since this is not the case we use a recursive algorithm to estimate both My and pk(sz1:T), k = 1,T.

[29] The concept underlying this algorithm is similar to that used in the Gibbs' sampling method [Casella and George, 1992; Gelfand and Smith, 1990; Gelman et al., 1995] that approximates draws from some multivariable pdf p(x,y) when only the forms of the conditional distributions p(xy) and p(yx) are known. It proceeds as follows: make an initial draw, say x0, so that it is possible to sample y0 from p(yx0), and use this to sample x1 from p(xy0), and so on. After some number N of sequential draws, the sequence {xN+i, yN+i}i>0 will behave as though it comes from the underlying multivariable pdf p(x,y). In our case, this will be the image image Mu, My) probability density function, where image and image are the parameters used to define the joint distribution of all T extended states s1:T and output mapping estimate My, respectively. Since we are able to sample only from conditional densities, i.e., from image Mu, My, image and image Mu, My, image and not from the full distribution, we propose to provide an initial guess (estimate) on the output operator structure (corresponds to some prior density function p0), draw joint state distribution parameters conditional on the estimate of the output operator, condition the subsequent output operator parameterization sampling on this estimate of the state distribution (as above), and so on. Under the assumption that a unique output operator structure exists (i.e., the existence of an output model given the conceptual model is not statistically ambiguous), the sampling scheme is expected to lead to output operator structure convergence (statistical equality).

2.6. Recursive Algorithm for Estimation of the Model Structure Operator My

[30] Following the approach of using a prior structure My,− for the output operator to estimate pk (sz1:T), k = 1,T, we recursively build a posterior structure My,+ as described in the previous section. Each new estimate of the posterior becomes the prior for the next iteration, and the procedure is repeated to convergence. Representing the prior and posterior density estimators at the lth iteration as Ml and Ml+ respectively, and the individual time step density estimators as pkl(sz1:T), k = 1,T, the recursive algorithm proceeds as follows:

[31] In step 0, begin with a prior M0, = p0. Use this prior to construct pk0(sz1:T), k = 1,T. Use these densities to compute the posterior M0,+.

[32] In step 1, assign the prior M1,− = M0,+. Use this prior to construct pk1(sz1:T), k = 1,T. Use these densities to compute the posterior M1,+

[33] In step l + 1, assign the prior M1+1,− = Ml,+. Use this prior to construct pkl+1(sz1:T), k = 1,T. Use these densities to compute the posterior Ml+1,+.

[34] Iterate to convergence.

[35] This method combines the separate estimates of pk (s1:T) and thereby summarizes all the available information in the observational data regarding the codependence of the system variables. Appendix B illustrates how this algorithm works for the “cartoon” case of a very simple input-state-output dynamical model.

[36] The method described above requires estimation of the pdf's pk(sz1:T), k = 1,T conditioned on the current approximation of My, which can done via a process of data assimilation [Wikle and Berliner, 2007]. To approximate the forecasting, analysis and smoothing distributions, we use the method of particle filtering [Doucet et al., 2000; Arulampalam et al., 2002; Hurzeler and Kunsch, 1998] which does not require assumptions to be made about the specific forms of the model structure or of the posterior distributions.

[37] To terminate the recursive procedure, we test for closeness of successive estimates pl−1,+(sz1:T) and pl,+(sz1:T) using the Kullback-Leibler (KL) divergence measure (also known as information divergence, information gain, or relative entropy) [Kullback and Leibler, 1951]. The KL divergence measures the difference between two densities p and q as:

  • equation image

The value of KL divergence is always nonnegative and equals zero if and only if p = q.

2.7. Making Predictions Using the Estimated Model Structure

[38] Having estimated the mathematical structure of the model, the data-derived pdf p(sz1:T) can be used for model prediction in several ways: (1) by itself with no prior estimate of the mathematical structure of the model, (2) in conjunction with a noninformative (or poorly informative) prior estimate of the model structure, and (3) in conjunction with some informative prior estimate of the model structure (that may be inaccurate and therefore needs to be corrected/updated) using the formulation:

  • equation image

The case of no prior model structure estimate corresponds to αj = 0, and can be used when the data length T is large and represents all possible cases of interest, or when the data is believed to represent the model dynamics sufficiently well. The case of noninformative prior corresponds to the situation where the prior pdf p0 reflects only our conceptual prior descriptions of the behavioral restrictions in the system, without dependence on the extended state at the previous time. Following the discrete time formulation (2), the output at time k depends only on the system input and state at the same time k. Since the output in (11) does not depend on the previous extended state, as it does in (7), this representation can be used as an estimate of the output mapping in (2).

[39] In the case of no prior, any prediction outside the range of the observational data used for model estimation is purely an extrapolation, and can therefore be poor when new conditions are encountered that were not well represented in that data. In the case of the noninformative prior, a further degree of prediction uncertainty is introduced that reflects the uncertainty attributable to lack of knowledge of the mathematical form of the model equations; its influence on the prediction depends on the strength of the prior weight α. However, the most interesting case is that of the informative prior, where the prediction reflects the relative weighting of the prior model and the data; in this case if the prior model is in fact incorrect (biased), the BESt procedure will correct (update) its form (via the joint pdf) to reflect the information assimilated from the observations.

3. Model Structure Identification by Data Assimilation: Application to Leaf River Basin

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Model Structure Identification Through Data Assimilation: Theory
  5. 3. Model Structure Identification by Data Assimilation: Application to Leaf River Basin
  6. 4. Summary and Conclusions
  7. Appendix A:: Mixture of Multivariate Normal Distributions Approximation
  8. Appendix B:: Illustration of the Model Structure Estimation Algorithm Using a Simple Example
  9. Appendix C:: Extended State-Space Formulation
  10. Acknowledgments
  11. References
  12. Supporting Information

[40] In this section we demonstrate and evaluate the BESt methodology using data from the Leaf River basin. This humid 1944 km2 watershed, located near Collins in southern Mississippi, has been investigated intensively [e.g., Sorooshian et al., 1983; Brazil, 1988; Gupta et al., 1998]. The data set we use consists of 40 years of mean areal daily precipitation (mm/d), potential evapotranspiration (mm/d), and streamflow (mm/d). Our interest is to derive a spatially lumped model of the structure and behavior of the basin water balance at several time scales: annual (case A), monthly (case M) and weekly (case W). We begin, therefore, with a simple conceptual mass balance description of the processes in this system (Figure 2); the system is driven by two inputs (potential evapotranspiration and precipitation), has a single state variable (representing aggregate soil moisture storage), and produces two outputs (evapotranspiration and outflow). In this study we assume that this conceptual model is correct and therefore ignore any uncertainty associated with its specification. However, an analysis using the False neighbors test [Kennel et al., 1992; Hegger and Kantz, 1999] suggested that while a single dimensional state variable may be sufficient to describe the dynamics of the system at the annual and monthly time scales [see Bulygina, 2007], this conceptual representation may be inadequate at the weekly scale; we will return to this point later. Further, the test revealed that at the monthly and weekly time scales the watershed response to driving forces is characterized by a 2-day lag (i.e., the monthly and weekly accumulated outflows depend most strongly on corresponding precipitation and potential evapotranspiration accumulations over a period offset by a 2-day shift).

image

Figure 2. Schematic representation of a simple conceptual mass balance model. The system is driven by two inputs (precipitation and potential evapotranspiration), has a single state variable (representing aggregate soil moisture storage), and produces two outputs (evapotranspiration and outflow).

Download figure to PowerPoint

3.1. Conceptual Model

[41] The conceptual model for Leaf River basin has a five-dimensional extended state-space sk = (u1,k, u2,k, xk, y1,k, y2,k) at each time k, where u1 represents potential evapotranspiration (PET; mm/time step), u2 represents precipitation (PPT; mm/time step), x is the total soil moisture storage in the basin (SM; mm), y1 represents actual evapotranspiration (ET; mm/time step), and y2 represents the outflow rate (QQ; mm/time step). The model preserves overall system water balance by relating the inputs, state variables and outputs via the continuity equation xk+1 = xk + dt*(u2,ky1,ky2,k) where dt corresponds to the length of the integration time step. Since the time step is assumed to be constant and equal to unity, the simpler form of this equation xk+1 = xk + u2,ky1,ky2,k will be used hereafter.

[42] To preserve the physical meaning of the driving forces, state variables and system responses, the following natural restrictions must be applied to the system variables:

[43] 1. Condition 1 is nonnegativity: u1,k ≥ 0, u2,k ≥ 0, xk ≥ 0, y1,k ≥ 0, y2,k ≥ 0. This condition requires all variables to be nonnegative. Further, application of the continuity equation restricts the sum of ET and QQ to not exceed the sum of PPT and SM at each moment of time.

[44] 2. Condition 2 is boundness: Condition 2a is for state, xkM*, and condition 2b is for ET, y1,k ≤ min {u2,k + xk, u1,k}. Condition 2a ensures that the storage is bounded above by a constant fixed capacity M* that is conceived to be the maximum possible accumulated water depth in the upper 2 m layer of soil. Condition 2b ensures that actual evapotranspiration does not exceed demand (potential evapotranspiration) or actual available water depth (stored water plus precipitation depth).

[45] 3. Condition 3 is monotonicity (ET):

  • equation image
  • equation image

These conditions reflect the conceptual idea that as more water is available and as evapotranspiration demand increases, the rate of actual evapotranspiration also increases. The probabilistic extended state-space formulation for the conceptual model described here appears in Appendix C.

3.2. Data Used

[46] The following data were used for estimation of the system equation structure and for performance evaluation:

[47] 1. For the annual model, there are 40 annual data points available (WY 1949–1988), so all points were used for model structure estimation. Since the amount of annual data is limited, to construct the data set used for model performance evaluation a 6-month time shift was applied to the daily data before aggregation. Therefore, these two data sets are not really independent, but may be helpful in revealing if the estimated model structure is sensitive to the choice of aggregation period.

[48] 2. For the monthly model, the model structure was estimated using 10 years of aggregate monthly data (WY 1954–1963; 120 data points), and evaluated using the subsequent 10 years of aggregate monthly data (WY 1964–1973; 120 data points).

[49] 3. For the weekly model, the model structure was estimated using a 3 year period of data (WY 1960–1962; 156 data points) which included a dry year, a moderate year and a relatively wet year. A separate 3-year period (WY 1972–1974; 156 data points) was used for model evaluation.

[50] In all cases we assume that the observation error characteristics of the estimation and evaluation period data are the same.

3.3. Measures Used for Performance Evaluation

[51] The mathematical form of the system equation estimated by the BESt procedure is a five-dimensional pdf that describes the joint stochastic relationship between the five system variables. Running the model in prediction mode therefore results in a time series of pdf's of the model outputs (as opposed to deterministic values) that reflect the effects of the uncertain inputs, uncertain model structure, and uncertain state values. In this work we characterize these prediction pdf's in several ways: by their expected value, maximum likelihood value (density maximum), and the means of the 25%, 50%, 75% and 95% probability mass intervals. For each of these “predictors” of actual outflow we report the overall bias statistic, and the Nash-Sutcliffe (NS) performance measure, computed as:

  • equation image

where equation imagey2 denotes the average of the observed outflow values. The Bias statistic is an indicator of the overall tendency to preserve water balance and the NS statistic is a variance-normalized measure of model accuracy (for a discussion of strengths and weaknesses of the NS measure see Schaefli and Gupta [2007]).

3.4. Results

[52] First we show the structural form of the system equation estimated by the procedure. Figures 3 and 4present response surface maps showing how the expected value of outflow and expected value of evapotranspiration vary with potential evapotranspiration, precipitation, and stored water. Because the probabilistic estimate of the output mapping My provides a representation of the overall statistical uncertainty in our estimate of the “true” mapping My (see (2)), it cannot be fully characterized by a single surface. Also, although the estimated system equation is a five-dimensional pdf characterizing the joint stochastic relationship between all the variables, we can only show visualizations of the three-dimensional projections of the pdf conditioned on assumed fixed values for the other variables. Characterizing the results in one of many possible ways, we show here only the three-dimensional response surfaces of the statistical “expectations” of the output relationships, these being akin to the deterministic relationships described by conventional model equations. The response surface plots show a more or less monotonic dependence of the outputs on the inputs and state. As expected, higher precipitation rates result in higher outflow and evapotranspiration rates (Figure 3a), potential evapotranspiration has a stronger effect on actual evapotranspiration at higher precipitation rates (Figure 3b), and larger system storage results in higher outflow and evapotranspiration rates (Figures 3a and 3b).

image

Figure 3. Output response surfaces for the annual model: (a) how the expected value of outflow varies with PET, precipitation, and state (from bottom to top, x = 450, 550, and 650 mm) and (b) how the expected value of evapotranspiration varies with the same quantities. The black dots represent projections onto the surfaces of the data used in structure estimation. The ellipses highlight data-scarce regions.

Download figure to PowerPoint

image

Figure 4. Output response surfaces for the monthly model: (a) how the expected value of outflow varies with PET, precipitation, and state (from bottom to top, x = 550, 600, and 650 mm) and (b) how the expected value of evapotranspiration varies with the same quantities. The black dots represent projections onto the surfaces of the data used in structure estimation. The ellipses highlight data-scarce regions.

Download figure to PowerPoint

[53] Figures 3 and 4 also show that some regions of the mapping space (highlighted by ellipses) are associated with relatively low data densities, and in these regions the mapping estimates can be misleading since they are based on extrapolation: for example Figure 3b shows a loss of monotonic annual evapotranspiration increase with precipitation increase, Figure 4a shows a nonmonotonic dependence of monthly outflow on potential evapotranspiration rate and a sharp change in monthly outflow rate for high precipitation values. This highlights the fact that model structure mappings estimated using sparsely representative data sets will generally require conditioning by means of some prior information (such as a noninformative or informative prior), so that the physical (conceptual) realism can be preserved when extrapolating into hydrologic regimes not covered by the historical data.

[54] Figures 57show plots of the time evolution of uncertainty in the model output predictions for both the estimation and evaluation periods; these plots show the total output uncertainty associated with uncertainty in the specification of initial states, observation uncertainty in the inputs, and estimation uncertainty associated with the model structure (Table 1 provides the corresponding statistics). At the annual and monthly time scales the quality of the prediction remains consistent from estimation to evaluation period (as measured by NS efficiency values reported in Table 1), indicating that the hypothesis regarding conceptual model structure is not contradicted. However, at the weekly time scale, the model performance deteriorates when going from estimation to evaluation period (the NS efficiency of the expected value predictor reduces from 0.93 to 0.69), indicating that the hypothesis regarding conceptual model structure is not supported: note, for example, that the prediction looks “flashy” at low flows when compared to the observations. This behavior at the weekly time scale is likely due to the lack of streamflow routing and/or interception components [Savenije, 2004] or other processes in the conceptual model, and/or perhaps due to a higher sensitivity of weekly response to (unrepresented) spatial patterns in the rainfall. We also note a consistent tendency to overestimate the evaluation period outflow at the monthly and weekly time scales leading to incorrect water partitioning (Table 1), which may be due to the poor quality of the assimilated information regarding ET (all we impose is that ET is bounded above by PET) [Bulygina, 2007], due to incorrect conceptual model structure, or due to incorrect measurement error structure. Overall, the use of the expected value of system output as a predictor gives the best performance in terms of the NS statistic (residual error variance) but the worst performance in terms of the Bias statistic (Table 1).

image

Figure 5. Time series predictions of outflow at the annual time step: (a) estimation period uncertainty and (b) evaluation period uncertainty. Results are shown for 95% confidence (light gray), 75% confidence (medium gray), 50% confidence (dark gray), and 25% confidence density regions (black). The circles indicate the data along with their with 95% confidence intervals, the squares indicate the expected value predictions, and the triangles indicate the maximum likelihood predictions.

Download figure to PowerPoint

image

Figure 6. Time series predictions of outflow at the monthly time step: (a) estimation period uncertainty and (b) evaluation period uncertainty. Results are shown for 95% confidence (light gray), 75% confidence (medium gray), 50% confidence (dark gray), and 25% confidence density regions (black). The circles indicate the data along with their with 95% confidence intervals, the squares indicate the expected value predictions, and the triangles indicate the maximum likelihood predictions.

Download figure to PowerPoint

image

Figure 7. Time series predictions of outflow at the weekly time step: (a) estimation period uncertainty and (b) evaluation period uncertainty. Results are shown for 95% confidence (light gray), 75% confidence (medium gray), 50% confidence (dark gray), and 25% confidence density regions (black). The circles indicate the data along with their with 95% confidence intervals, the squares indicate the expected value predictions, and the triangles indicate the maximum likelihood predictions.

Download figure to PowerPoint

Table 1. Model Performance Statistics
PeriodExpected ValueMax Likelihood95% Mass75% Mass50% Mass25% Mass
Bias (%)NSBias (%)NSBias (%)NSBias (%)NSBias (%)NSBias (%)NS
Annual
Estimation3.80.88−1.60.882.10.7−0.70.81−0.20.86−0.20.87
Evaluation5.20.88−4.20.883.40.640.10.79−2.60.83−3.40.86
 
Monthly
Estimation5.20.94−6.40.943.10.87−0.10.92−0.60.93−6.40.94
Evaluation18.70.8610.20.7816.30.6712.30.729.60.7310.20.78
 
Weekly
Estimation10.40.937.90.938.50.97.30.926.50.937.90.93
Evaluation17.70.695.60.6714.80.610.60.648.70.675.60.67

[55] Finally Figure 8 shows the impact of each of the three sources of uncertainty (specification of initial conditions, observation of the inputs, and estimation of model structure) on the output predictions during the evaluation period; here we show symmetric 95% confidence intervals. The light gray region indicates the prediction uncertainty arising from initial condition uncertainty alone; the effects of initial condition uncertainty die out very quickly (the gray region becomes essentially a line after the first two months). This is expected since the watershed is a highly stable system. The medium gray region indicates the prediction uncertainty arising from both initial condition uncertainty and model structural uncertainty. Finally, the dark gray region indicates the total prediction uncertainty caused by all three sources taken together. Note that the uncertain model predictions track the data (red dots shown with 95% confidence intervals) well. Note also, that the input uncertainty makes a much stronger contribution to the total uncertainty at the annual time scale, indicating that the system is strongly sensitive to the driving forces; while at the monthly and weekly time scales the impact of structure uncertainty dominates (except for some of the flow peaks).

image

Figure 8. Components of the evaluation period prediction uncertainty for outflow at the (a) annual, (b) monthly, and (c) weekly time steps. Symmetric 95% confidence intervals are shown for uncertainty caused by poor knowledge of initial conditions (light gray), uncertainty due to both uncertain initial conditions and estimated model structure (medium gray), and uncertainty due to all three sources (initial conditions, structure, and input, dark gray). The circles indicate the data along with their 95% confidence intervals.

Download figure to PowerPoint

3.5. Time Scale Issues

[56] A reviewer of this manuscript raised a question regarding how much information can be extracted at different sampling rates. One of the concerns was related to the Nyquist-Shannon sampling theorem [Shannon, 1949] describing the signal-sampling frequency necessary for a signal to be perfectly reconstructed from the samples taken at that rate. We should therefore stress that our analysis is not conducted with a decimated signal (i.e., instantaneous samples taken periodically each year, month or week), it uses accumulated data, and, further, that we do not seek to reconstruct the signal at times in between the sampling points (that would be unreasonable because of high signal frequency).

[57] A further concern raised was that the rainfall-flow relationship at the annual time scale could degenerate to a static input-output relationship, with no dependence on state, and no significant time variation of the state. While our analysis estimates a form for the input-state-to-output dependence that could appear to be static (Figure 3 shows that at the annual time scale the outflow is not strongly sensitive to soil moisture), and while the estimate for the output mapping in (2) appears to perform well, it does not reveal the full extent of the dynamics of the catchment; that is, there is no direct indication that mappings could be static or that the state does not vary a lot. To explore full catchment functioning, a more comprehensive analysis than the one performed here would be required; here we selected an a priori form for the conceptual model and set as our goal only the estimation of the output mapping structure.

4. Summary and Conclusions

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Model Structure Identification Through Data Assimilation: Theory
  5. 3. Model Structure Identification by Data Assimilation: Application to Leaf River Basin
  6. 4. Summary and Conclusions
  7. Appendix A:: Mixture of Multivariate Normal Distributions Approximation
  8. Appendix B:: Illustration of the Model Structure Estimation Algorithm Using a Simple Example
  9. Appendix C:: Extended State-Space Formulation
  10. Acknowledgments
  11. References
  12. Supporting Information

[58] In this work, we discuss a Bayesian data assimilation approach that can be used to estimate the uncertain (probabilistic) mathematical structure of the system equations, valid at the scale of available system observations and conditioned on some prior conceptual understanding regarding the processes physics. The uncertain relationships among the system variables are estimated in the form of a joint pdf conditioned on the prior model and the data, and the prediction model is derived as the conditional pdf of the system outputs given knowledge of the system inputs and other variables. The recursive algorithm, entitled Bayesian estimation of structure (BESt), enables a prior mathematical model of the system to be corrected (updated) using information contained in historical system data. The approach is not limited to any particular class of conceptual system structures, and has the following strengths:

[59] 1. It facilitates identification of the mathematical form of the system equations given only conceptual knowledge of the system structure, and without the need for strong additional assumptions to be made regarding the mathematical form of the system equations.

[60] 2. It facilitates inference of the nature of the relationships among system variables using a Bayesian probabilistic framework as an alternative to the conventional approach involving scaling up of small-scale governing equations.

[61] 3. It provides a method for representing the fact that our knowledge of the form of the system equations is at best uncertain; that is, instead of assuming known mathematical structures for the system equations, the model summarizes both what we know (certainty) and what we do not know (uncertainty) about the structural form and behavior of the system.

[62] 4. It facilitates an evaluation of the uncertainty in the predicted system response arising from uncertainties due to three major components (uncertainty in knowledge of initial conditions, uncertainty in input observations, and uncertainty in knowledge of the mathematical form of the system equations); future work will include effects of uncertainty in system conceptualization.

[63] 5. It provides a mechanism for correcting existing (preassumed) forms for the mathematical system equations while inferring the degree of uncertainty remaining in our formal characterization of the mapping relationships.

[64] A preliminary demonstration of the capabilities of the BESt method was performed by estimating simple water balance models at annual, monthly and weekly time scales for the Leaf River basin near Collins, Mississippi. The spatially lumped input-state-output response of the watershed was conceptualized as a “finite leaky bucket” having a single soil moisture storage state variable, driven by precipitation and potential evapotranspiration, and generating outflow and evapotranspiration as system outputs. On the basis of prior (conceptual) knowledge of the watershed, it was considered reasonable to exclude representations of channel routing, groundwater recharge/discharge, pumping, and snow accumulation/melting from the conceptual description of the system; future work will explore the marginal value of using progressively more sophisticated conceptual model structures. The results indicate the following:

[65] 1. Predictions computed for hydrological conditions not represented in the historical data can be highly uncertain and misleading (as might be expected; e.g., see the estimates of evapotranspiration). To make interpolations and extrapolations beyond the conditions represented by historical data we must condition the predictions on some prior assumptions regarding model conceptual structure and the form of the system equations.

[66] 2. At the annual and monthly time scales, the performance of the estimated model remains consistent when going from estimation to evaluation data periods, indicating that the hypothesis regarding conceptual model structure is not contradicted.

[67] 3. At the weekly time scale, the model performance deteriorates when going from estimation to evaluation period, and the streamflow predictions are biased. This indicates that the simple conceptual model is not adequate at shorter time scales, and that other hydrological processes (such as channel routing) may need to be included.

[68] 4. The model provides estimates of the prediction uncertainty associated with different sources that compare with the uncertain output observations in a statistically reasonable way. In particular:

[69] 5. Output uncertainty due to uncertainty in knowledge of initial conditions decays rapidly indicating that the system is highly stable.

[70] 6. At the annual time scale, the largest component of prediction uncertainty arises primarily from errors in the observations of system input.

[71] 7. At the monthly and weekly time scales, the largest component of prediction uncertainty arises primarily from uncertain knowledge of system structural form, except during peak flows when the effect of input observation error becomes dominant.

[72] In general, the results indicate that the system conceptualization assumed here is suitable for representing the dynamics of the Leaf River watershed system at annual and monthly time scales, but may be unsuitable for weekly and shorter time scales.

[73] An important characteristic of the approach presented here is that it can be used to facilitate the development of system models at the scale of available observations, for comparison with existing “physical” models derived in the conventional manner via the scaling up of small-scale governing equations. Because the approach does not depend on strong prior assumptions regarding the mathematical form of the system equations, and because information regarding that mathematical form can be extracted from system-scale data as they become available, the estimated posterior model can provide a progressively more accurate representation of both what we know (certainty) and what we do not know (uncertainty) about the structure and behavior of the system. The system equations are derived from the “posterior” joint probability density function of the system variables, and are constructed in such a way that data assimilation can help to correct stated “prior” beliefs regarding the mathematical forms of the dependencies. Meanwhile, under conditions for which no system data are available, “prior” knowledge about the system can be incorporated and will dominate. The key assumptions involved are that an adequate conceptual structure for the system can be specified and that the error characteristics of the observational data can be quantified. In ongoing work we are extending the approach to (1) include conceptualizations of hydrologic processes such as channel routing, groundwater recharge, pumping, and snow accumulation and melt; (2) incorporate the effects of conceptual model uncertainty; (3) investigate the interplay between inadequate conceptualization and insufficiency of information in the observational data; (4) explore how the approach may be used to facilitate diagnosis of conceptual model inadequacies (see Gupta et al. [2008] for a discussion of diagnostic model evaluation); (5) explore use of the method to detect and correct errors in preexisting models; (6) investigate the effects of varied assumptions regarding the structure of errors in the input and output observations; and (7) explore methods for inferring (and correcting assumptions regarding) the structures of the data error via Bayesian inference. As always we invite dialog and collaboration regarding these and other issues of model estimation.

Appendix A:: Mixture of Multivariate Normal Distributions Approximation

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Model Structure Identification Through Data Assimilation: Theory
  5. 3. Model Structure Identification by Data Assimilation: Application to Leaf River Basin
  6. 4. Summary and Conclusions
  7. Appendix A:: Mixture of Multivariate Normal Distributions Approximation
  8. Appendix B:: Illustration of the Model Structure Estimation Algorithm Using a Simple Example
  9. Appendix C:: Extended State-Space Formulation
  10. Acknowledgments
  11. References
  12. Supporting Information

[74] In this appendix a pdf is said to be approximated by a mixture of multivariate normal distributions if the mixture is constructed on the principle of maximum likelihood for draws from the approximated distribution. So, given a batch of observations z1:T the data-derived part of the pdf p(sz1:T) can be approximated via a mixture of multivariate normal distributions as described by Muller et al. [1996]:

  • equation image

where s1:M,1:T are independent realizations of random variables from the pk(sz1:T), k = 1,T density functions, so that sm,kpk (sz1:T), m = 1,Mk = 1,T; that is, the s1:M,1:T are draws corresponding to the approximated density (8), the θMT = (θ1, …, θM*T) where θi = (μi, Ωi), i = 1,M*T are the parameters of a normal mixture with means μi and correlation matrices Ωi, i = 1,M*T. In any set θM*T there will be kM*T distinctive values, denoted by θ* = (θ1*, …, θk*) with Tj number of occurrences θi = θj*, j = 1,k, so that T1 + … + Tk = M*T. Since the draws s1:M,1:T depend on the observations z1:T, the parameters of the mixture of multivariate normal distributions (means, correlation matrices, and numbers of occurrences) depend on the observations.

[75] From a Bayesian perspective, the second posterior density pM*Ts1:M,1:T) in (A1) can be estimated using a Dirichlet process prior [Ferguson, 1983; Muller et al., 1996]. A Gibbs sampler [Gelfand and Smith, 1990; Muller et al., 1996] can be implemented for sampling from this density function, denoting draws as {θqM*T}q=1P. The first posterior density in (A1) is

  • equation image

which for a large number of observations can be reduced to the following sum of multivariate normal distributions [Muller et al., 1996]

  • equation image

Using draws from pMTs1:M,1:T) and equations (8) and (A1)(A3), the pdf p(sz1:T) is estimated as

  • equation image

where θq* = (θq,1*, …, θq,k(q)*) is a set of distinctive values in θqM*T = (θq,1, …, θq,M*T) with Tj(q) being a number of occurrences θq,i = θq,j*, j = 1,k(q), i = 1,T. Thus, the posterior joint probability density function p(sz1:T) is expressed as a weighted sum of multivariate normal distributions.

Appendix B:: Illustration of the Model Structure Estimation Algorithm Using a Simple Example

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Model Structure Identification Through Data Assimilation: Theory
  5. 3. Model Structure Identification by Data Assimilation: Application to Leaf River Basin
  6. 4. Summary and Conclusions
  7. Appendix A:: Mixture of Multivariate Normal Distributions Approximation
  8. Appendix B:: Illustration of the Model Structure Estimation Algorithm Using a Simple Example
  9. Appendix C:: Extended State-Space Formulation
  10. Acknowledgments
  11. References
  12. Supporting Information

[76] Consider a hydrologic model (see Figure 1) having a single input (precipitation, u), single state variable (soil moisture storage, x), and single output (outflow, y). We assume that the following statements describe what we know about the system, and thereby comprise the conceptual model for our system: (1) Precipitation is independent of the extended system state at previous times. (2) Outflow yk depends only on total available water xk + uk. (3) The discrete time continuity equation holds: xk+1 = uk + xkyk. (4) All model variables are restricted to be nonnegative: uk ≥ 0, xk ≥ 0, yk ≥ 0. (5) The magnitude of the state is bounded from above: xkM*. (6) The state-outflow relationship displays monotonic behavior; that is, the higher the available water, the higher the outflow, and vice versa.

[77] Now, suppose that a time history of observations of input to and output from the system (precipitation and outflow) are available: z1:T = {(zku, zky)}k=1T, and that the structure of the observation error can be quantified. The goal is to estimate the outflow mapping by combining the prior information represented by the conceptual model with the data information given by the observations. Since it is impossible to derive analytical expressions for the joint pdf's at each time step (even if very simple observation error structures were to be used), and our purpose is to illustrate the steps of the algorithm, only possible shapes for the pdf structures are shown.

[78] Following section 2, the estimate, My (yx, u, z1:T) of the outflow mapping is given by equation (7), the data-derived part of which can be obtained from the posterior joint pdf p(x, y, uz1:T) = p(sz1:T) given by equation (8). Hence, we need to compute the joint pdf, pk (u, x, yz1:T) = p(uk, xk, ykz1:T) = p(skz1:T) given the observed data for each time step k. These pdf estimates, as well as the outflow-mapping estimate My (yx, u, z1:T) are derived (and adjusted at each algorithm iteration) in the following way.

B1. Iteration Level 0

[79] The outflow is restricted only by its observed values and by the conceptual form of the model, i.e., p(ykx, uk, sk−1) = My0,−(ykxk, uk, sk−1) = p0(ykxk, uk, sk−1) with p0 being the conditional probability density that reflects the conceptual form of the model only.

[80] 1. The forecast pdf for the input-state-output response at time k is given by: p(skz1:k−1) = ∫p(sksk−1) * p(sk−1z1:k−1)dsk−1, where p(sk−1z1:k−1) is known from the previous time moment, and where the term p(sksk−1) is computed as p(sksk−1) = p(uksk−1) * p(xkuk, sk−1) * p(ykuk, xk, sk−1), which, by following the conceptual model restrictions (1–6), simplifies to p(sksk−1) = p(uk) * p(xksk−1) * p(ykuk + xk, sk−1). The distribution for the input is assumed to be uniform over some reasonable range of values, and the state is uniquely defined via the continuity equation given the state at the previous time step k − 1. If the mapping connecting the input and state with the output were known, then the density p(ykuk + xk, sk−1) would be known. However, since the mapping is to be estimated, the prior knowledge about the process given by the conceptual model (all that we currently know about the dependencies in the model) can be used to constrain the outflow conditional pdf p(ykuk + xk, sk−1), as shown in Figure B1a with a = 0 and b = min {yk−1, uk + xk} if the current available water depth is lower than for the previous time; and a = yk−1 and b = uk + xk otherwise. In this way, the requirements for output nonnegativity and monotonic behavior are met. Consequently, the region associated with 95% probability mass for available water (sum of input and state) and outflow pairs will be as illustrated in Figure B1b.

image

Figure B1. Illustration of the algorithm for the 0th iteration level: (a) forecasting conditional outflow pdf. (b) The 95% probability mass area (light gray) for the joint forecasting pdf. (c) Filtering conditional outflow pdf. (d) The 95% probability mass area (medium gray) for the joint filtering pdf. (e) The 95% probability mass area (dark gray) for the joint smoothing pdf. (f) Data-derived 95% confidence intervals for the outflow mapping (light gray). The horizontal dotted line indicates the 95% probability mass interval for some fixed value of available water. (g) Data-derived 95% confidence intervals for the outflow mapping after the 1st algorithm iteration cycle (medium gray).

Download figure to PowerPoint

[81] 2. The analysis pdf is given by p(skz1:k) ∝ p(zksk) * p(skz1:k−1), so that the prior understanding about the dependencies at time k is weighted according to the observations at that moment. Figure B1c then represents a possible reweighed probability density function for outflow p(ykuk + xk, sk−1) with weights given by p(zkyyk) (compare with Figure B1a). With this observation, the 95% probability mass area for available water (sum of input and state) and outflow pairs might look something like Figure B1d (compare to Figure B1b).

[82] 3. The smoothing pdf is given by carrying out the computation p(skz1:T) = p(skz1:k) * ∫equation imagedsk+1, so that the analysis distribution is refined in a way that supports future observations of system behavior. Then, the 95% probability mass area for available water (sum of input and state) and outflow pairs might look something like Figure B1e (compare to Figure B1d).

[83] Now, the local (for each time moment) information about model variable dependencies is combined using equation (8) into a global representation of model variables connections, so that the data-derived 95% confidence intervals for outflow mapping estimate might look something like Figure B1f; that is, for any fixed available water value, outflow is to be found with 95% probability in the interval indicated by the dark gray area (note, there is no subindex k for global dependence). This data-derived mapping estimate is to be weighted with the conceptual model prior pdf p0, as given in equation (7) to produce My 0,+ (ykxk, uk, sk−1, z1:T) = My 1,− (ykxk, uk, sk−1, z1:T), and the result is to be treated as an outflow mapping during the next iteration, i.e., as p(ykxk, uk, sk−1) = My1,−(ykxk, uk, sk−1, z1:T).

B2. Iteration Level 1

[84] The outflow is restricted by its observation and mapping estimate derived from the previous iteration level, i.e., by:

  • equation image

[85] 1. The forecasting pdf is found in the same way as for iteration 0, except that the updated pdf for the output given the input, state and previous extended state is used, which might look something like Figure B2a (compare to Figure B1a).

image

Figure B2. The solid line indicates the updated conditional pdf for the output while the dashed line indicates the conditional pdf from the previous iteration: (a) forecasting and (b) analysis. The dot represents an output observation.

Download figure to PowerPoint

[86] 2. The analysis pdf is computed in the same way as in the previous iteration, except that the updated conditional pdf is weighted with the observation likelihood function, so that its weighted version might look something like Figure B2b (compare to Figure B1c). This analysis pdf combines information from the observations and information about the variable dependencies obtained from the previous iteration level.

[87] 3. The smoothing pdf is calculated using the same equation as before (see iteration zero), by taking into account the updated conditional density for output.

[88] After this, the local (for each time moment) information about model variable dependencies is again combined using equation (8) into a global representation of model variables connections, so that the data-derived 95% confidence intervals for outflow mapping estimate might look something like Figure B1g (compare with Figure B1f). This data-derived mapping estimate is to be again weighted with the conceptual model prior pdf p0 as given in equation (7) to produce My 1,+(ykxk, uk, sk−1, z1:T) = My 2,−(ykxk, uk, sk−1, z1:T), and the result is to be treated as an outflow mapping during the next iteration, i.e., as p(ykxk, uk, sk−1) = My 2,− (ykxk, uk, sk−1, z1:T). The mapping is iteratively reestimated until convergence is achieved, i.e., when there is stabilization in the mapping estimate, which can be visualized as the output confidence intervals from two successive iterations covering the same space (in Figure B1g this would correspond to coincidence or near coincidence of the areas shown).

Appendix C:: Extended State-Space Formulation

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Model Structure Identification Through Data Assimilation: Theory
  5. 3. Model Structure Identification by Data Assimilation: Application to Leaf River Basin
  6. 4. Summary and Conclusions
  7. Appendix A:: Mixture of Multivariate Normal Distributions Approximation
  8. Appendix B:: Illustration of the Model Structure Estimation Algorithm Using a Simple Example
  9. Appendix C:: Extended State-Space Formulation
  10. Acknowledgments
  11. References
  12. Supporting Information

[89] The extended state Sk = (U1,k, U2,k, Xk, Y1,k, Y2,k) is five-dimensional. The input to the system at each time is considered to be independent of the system extended state at the previous time, so that:

  • equation image

where U(0,A) denotes a uniform distribution density function. For our purposes no further specification of the prior distribution is needed, since all information about the input is to be extracted from the observations. The state propagation operator Mx, is defined by the deterministic continuity equation:

  • equation image

By application of Bayes' rule the output probabilistic operator becomes

  • equation image

This probabilistic operator My is to be estimated by the method proposed in section 2.5. For the equation (7) the prior estimate p0 is defined as follows:

  • equation image

These two priors reflect the required nonnegativity of state and outputs, a monotonic dependence of evapotranspiration on state and inputs, and the boundedness of the state and actual evapotranspiration.

[90] An essentially noninformative prior was assumed for the initial conditions by applying a widely spread pdf on x0 ∼ U[50; 750] hereby reflecting virtual lack of knowledge about initial water storage conditions. For practical (computational) reasons, the state was assumed to not exceed M* = 800 mm.

[91] In this real data study, no soil moisture condition (state) or actual evapotranspiration data are available, while observational data are available for potential evapotranspiration, image precipitation, image and outflow, image Further, very little information has been reported in the literature about the actual measurement error structure for areal precipitation, potential evapotranspiration or outflow. Following Sorooshian and Dracup [1980], we therefore assume the outflow to have a heteroscedastic (nonconstant) error variance. Further, Beven [2006] has proposed that a reasonable approach in such situations is to define an “effective observation error” on the basis of one's experience and qualitative information about the data quality. On the basis of these comments about the general theory of measurement error still under development, and for purposes of computational convenience, all observational data were treated as having 10% heteroscedastic uncorrelated zero-mean Gaussian error structures for the studies reported here. This treatment reflects a belief that the higher the measured value the wider the symmetric error range; that is, the observation operator H is defined as:

  • equation image

where N(zμ, σ) is normal probability density function with mean μ and standard deviation σ at point z.

Notation
H = (Hu, Hx, Hy)

an observation probabilistic operator with its three components.

Hu

a probability density function for input observations.

Hx

a probability density function for state observations.

Hy

a probability density function for output observations.

M = (Mu, Mx, My)

a probabilistic operator for system propagation.

Mu

a probabilistic operator for input.

Mx

a probabilistic operator for state.

My

a probabilistic operator for output.

Ml

lth iteration prior estimator for My.

Ml+

lth iteration posterior estimator for My.

Sk = (Uk, Xk, Yk)

an extended state random variable at time k.

T

length of data batch.

Uk

an input random variable at time k.

Xk

a state random variable at time k.

Yk

an output random variable at time k.

Zk = (Zku, Zkx, Zky)

an observation random variable at time k.

Zku

an input observation random variable at time k.

Zkx

a state observation random variable at time k.

Zky

an output observation random variable at time k.

du

an input dimension.

dx

a state dimension.

dy

an output dimension.

p

a general notation for probability density function.

p0

a prior probability density function.

pk

a probability density function for time k.

pkl

a probability density function for time k after lth algorithm iteration.

sk = (uk, xk, yk)

an extended state variable at time k.

uk

an input at time k.

xk

a state at time k.

yk

an output at time k.

z1:T

a batch of observations.

zk = (zku, zkx, zky)

an observation at time k.

zku

an input observation at time k.

zkx

a state observation at time k.

zky

an output observation at time k.

α = (α1, …, image

a prior weight.

θi = (μi, Ωi)

a normal distribution parameter: means μi and correlation matrix Ωi.

ωi

a weight used in mixture of multivariate normal distributions representation.

Acknowledgments

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Model Structure Identification Through Data Assimilation: Theory
  5. 3. Model Structure Identification by Data Assimilation: Application to Leaf River Basin
  6. 4. Summary and Conclusions
  7. Appendix A:: Mixture of Multivariate Normal Distributions Approximation
  8. Appendix B:: Illustration of the Model Structure Estimation Algorithm Using a Simple Example
  9. Appendix C:: Extended State-Space Formulation
  10. Acknowledgments
  11. References
  12. Supporting Information

[92] We gratefully acknowledge support for this work provided by the Department of Hydrology and Water Resources of the University of Arizona, by the National Weather Service Office of Hydrology under grant NA04NWS462001, and by SAHRA under NSF-STC grant EAR-9876800. The first author was partially supported by a doctoral fellowship from the Salt River Project.

References

  1. Top of page
  2. Abstract
  3. 1. Introduction
  4. 2. Model Structure Identification Through Data Assimilation: Theory
  5. 3. Model Structure Identification by Data Assimilation: Application to Leaf River Basin
  6. 4. Summary and Conclusions
  7. Appendix A:: Mixture of Multivariate Normal Distributions Approximation
  8. Appendix B:: Illustration of the Model Structure Estimation Algorithm Using a Simple Example
  9. Appendix C:: Extended State-Space Formulation
  10. Acknowledgments
  11. References
  12. Supporting Information