By continuing to browse this site you agree to us using cookies as described in About Cookies
Wiley Online Library is migrating to a new platform powered by Atypon, the leading provider of scholarly publishing platforms. The new Wiley Online Library will be migrated over the weekend of February 24 & 25 and will be live on February 26, 2018. For more information, please visit our migration page: http://www.wileyactual.com/WOLMigration/
 One way to reduce predictive uncertainty due to the model structure is to incorporate information from several different models at once. Model aggregation can lead to a better prediction than a single model structure, as each model provides new information about the processes that are occurring. This paper presents an alternative by further developing an idea introduced by Marshall et al. (2006). A modeling framework that combines a number of individual model structures is presented. The approach, known as the hierarchical mixtures of experts (HME) framework, allows for a more sophisticated method of model aggregation by allowing individual models to be selected based on the preceding catchment conditions and also gives greater flexibility to the specification of the model errors. The modeling framework was previously shown to have potential as a modeling tool for assessing model components and identifying structural deficiencies in existing representations. This paper presents a basis for using the HME rainfall-runoff modeling framework for prediction or simulation. We illustrate the rationale behind the proposed framework using daily rainfall, evapotranspiration, and flow data from two small sized catchments (<150 km2) in the state of New South Wales, Australia. Two different forms of a conceptual rainfall-runoff model are used to represent the multiple component models of the framework. We investigate the usefulness of different catchment predictors to weight the individual models by assessing the resulting performance in a predictive sense. The usefulness of alternative models for the description of the model errors is shown. The study shows that given careful comparison of the possible mechanisms related to a switch in the catchment “state,” the proposed approach can be a useful predictive tool, giving an aggregated model simulation that is better than any individual model.
 In classic hydrological model development, the modeler seeks to address and reproduce the numerous dynamics occurring within the catchment. Model components may be built and organized according to the individual characteristics of the catchment, the modeling exercise, or the available data. Often model improvement is attempted via the addition to an existing model structure of extra mechanisms that describe the hydrological processes occurring in the catchment. However, the widespread use of more complicated models is often hindered by lack of available data or by the inherent mistrust of practitioners to adopt a model that is too complex given predominant data limitations. Hence most modeling exercises search for a single good model, simple enough to apply to a range of exercises but with sufficient complexity for its predictions beyond the range of calibration data to be deemed “reliable.”
 A modeler's ability to include all the processes occurring in a catchment is also limited by the inherent errors and uncertainties in the modeling process. This problem is related to the limited information available in existing data. Recent advances in hydrological modeling have attempted to characterize the uncertainties in the modeling process, to better assess the risk associated with our model outputs. Approaches range from those reporting ensemble predictions, interval estimates, or probability distributions. However, the existence of uncertainty in our calibration data and input data or boundary conditions, and the limits in our understanding of the hydrological system, mean that we are forced to conceptualize or approximate the description of the catchment and the processes occurring.
1.1. Utility of Multimodel Formulations
 Given all the uncertainties in the modeling process, it is well noted that a single model is unlikely to have a consistent level of accuracy at all times and for all events. As a result, hydrologic modelers may try to improve model predictions by combining the results of several models at once, taking, say, a weighted average of individual models to try to capture the benefits of these models. The advantages of combining predictions from several models is well documented in many disciplines, with extensive literature for economics and the social sciences [Clemen, 1989]. In hydrology, established methods of combining models might include a simple or weighted average of models' results, or a nonlinear weighting such as that offered by artificial neural networks [Georgakakos et al., 2004; Shamseldin et al., 1997; Xiong et al., 2001]. For combining multiple models under uncertainty, Bayesian model averaging is an appealing approach [e.g., Neuman, 2003]. Multimodel ensembles are more commonly used in hydrological and climate forecasting, and can provide a forecast with greater skill than that from any single model [Hagedorn et al., 2005]. Established approaches to combining multiple models in forecasting for hydrology and climatology range from a simple average of the individual models, models averaged but weighted according to their overall perceived skill [e.g., Shamseldin et al., 1997], or models combined using multiple linear regression [Doblas-Reyes et al., 2005].
 Despite the usefulness of these approaches, it cannot be denied that the processes occurring in a hydrological system are highly dynamic and constantly changing. Many studies have examined the use of “dynamic” model structures, or assessed the way in which parameters vary in time [see, e.g., Wagener et al., 2003]. This idea could easily be extended to combining model structures. A method of model aggregation that can take into account the usefulness of different model structures or parameterizations under different hydrologic regimes is desirable.
1.2. Model Calibration and Validation: Specification of the Objective Function or Likelihood
 In identifying different model structures, modelers are also faced with issues in attempting parameter specification given available catchment information. One important area of research has been concerned with using likelihood-based methods for calibration, by making assumptions about the statistical distribution of the data via a probability density function. This is necessary in Bayesian approaches, where parameter and model uncertainty are described probabilistically. These classical statistical approaches are of use in model development and application, as they allow comparison of models which make different likelihood assumptions. This is often not possible in many other criteria used in hydrology. However, the choice of probability distribution to define the likelihood is often not thoroughly explored and the model errors are often assumed normally, independently, and identically distributed. Alternatively likelihoods have been used which assumed heteroscedastic, correlated errors [Bates and Campbell, 2001; Yapo et al., 1998], usually by transforming the data and fitting autoregressive models.
 Much research has also been based on using different performance criteria to assess different aspects of model fit. It is well recognized that multiobjective approaches allow focus on different important aspects of model predictions [Gupta et al., 1998]. These approaches generally try to provide some trade-off in model fitting by assessing different aspects of the model's predictions (such as minimizing peak flow error as well as overall flow error estimated over the full hydrograph). An issue often ignored is the assumptions on the distribution of the residuals or errors that result from the model fit. It is generally unwise to assume that the error distribution associated with a model calibrated using one objective is the same as using another.
 A possible solution may be to allow the error distribution to vary with time, or depend on the nature of the runoff generation mechanism that may dominate at a given time step. Several studies have noted the utility of calibrating models on different sections of the data [see, e.g., Boyle et al., 2000]. What if we could specify different error models to represent different sections of the hydrograph? This could allow implementation of formalized likelihood approaches, but also allow focus on different attributes of the hydrograph that the multiobjective approach idealizes.
1.3. Research Motivation
 This traditional approach to model development and calibration shows an important limitation in how we specify hydrological models. How do we model what is inherently a dynamic system, when the majority of our modeling tools are deterministic and static? We generally assume a single static model to represent the dominant observable processes in a catchment. If it is true that the catchment is dynamic, should it not also be true that different assumptions will better simulate the data at different times? These issues can also be extended to our understanding of the model errors. Is it sensible to make the assumption that the structure of the errors in our data do not change over the range of model responses? In this study, we propose a framework within which these issues can be addressed.
2. Hierarchical Mixtures of Experts (HME)
 A way of formally implementing the ideals in the preceding section presents itself in a class of statistical models known as hierarchical mixtures of experts (HME). HME models provide a method of combining the performance of several models in a single framework. The benefit of combining model predictions is that each model provides different information about the process being considered so that overall the process mechanisms are better captured. HME models combine the benefits of model aggregation and a dynamic model structure but provide an improvement of simple combinations of models, by allowing the way that model predictions are combined to depend on predictor variables. In this way, the relative contribution of each model to the overall simulation is allowed to vary in time, according to these predictors. The approach can be viewed as fitting a piecewise function to the data, by dividing the input space and fitting simple surfaces to the data that fall in each region. The HME approach to combining models was initially introduced in the statistical literature as the next evolution of classical regression and classification trees [Hastie et al., 2001]. HME models provided an improvement on traditional approaches to splitting data by allowing splits to be probabilistic, so that data points could lie in multiple regions and hence could be applied to different models simultaneously.
 HME has been applied in various settings and has been shown to be capable of a wide variety of tasks, including speech recognition [Peng et al., 1996], cancer classification [Jacobs et al., 1997], and modeling robot dynamics [Jordan and Jacobs, 1994]. These applications indicate that the HME approach could provide an evolution to existing multimodel frameworks in hydrology. A HME approach to model development in hydrology gives greater flexibility to specification of the model structure (by allowing multiple model structures to exist in a single framework) and to the specification of model errors (by allowing different assumptions to apply depending on the input data or predictor variables).
2.1. Model Structure
 The HME framework consists of individual model structures (known as experts in the statistical literature, and component models here) that are grouped in a tree-like architecture by nodes (known as gating functions). Figure 1 shows the simplest HME framework, consisting of a single level and combining only two component models. The architecture shown may be expanded by recursively dividing the branches to include further levels. The splits at each node are not necessarily binary, so component models can also be added at each node. As the hierarchy is increased, the framework can model data at different levels of resolution.
 In a hydrological setting, each of the component models is specified to have a separate hydrological model structure and assumed error distribution (see, e.g., Figure 2). Simpler choices may be made, and most previous applications of the HME approach in other disciplines used simple linear models to each of the expert nodes (for regression problems) or a logistic regression model (for classification problems). The next section gives a general overview of how HME can be implemented. Further details on implementing the HME approach in a hydrological modeling application are given by Marshall et al. .
2.2. Implementing the HME Framework in Rainfall-Runoff Modeling
 Consider the HME framework illustrated by Figure 1. If we let the observed flow for a catchment at time t be Qt., each component model may be specified by a conceptual rainfall-runoff model of the form
where θ is the set of unknown parameter values for the model, fi,t (xt; θi) is the rainfall runoff model for component model i, and ɛi,t is the model error for component model i with variance σi2. Note that each component model has inputs xt (such as rainfall and evapotranspiration data) that the model relates to the output f(xt; θ).
 The HME networks should be considered probabilistically. The overall output Qt is generated based on a probabilistic weighting of the output of each of the component models that is updated at each time step (see Figure 2a). The probability is based on current catchment indicators that are specified by the modeler to describe the state of the catchment. The respective probabilities are estimated through use of the gating function (Figure 1). The gating function is a mathematical function that uses relevant predictors (denoted Xt) to calculate the probability of selecting each of the component models. For a two-component model it has the form
where gt,1 is the probability of selecting HME component 1 at time step t and β is the set of parameters for the gating function. The function F(β, Xt) may take a number of forms but in preceding studies has generally been specified as a logistic function. The case study described in subsequent sections gives details on how the gating function and catchment predictors may be investigated.
 The probabilistic nature of the model framework involving latent states means that a Bayesian approach to inference is convenient and attractive. Bayesian approaches describe parameter and model uncertainty probabilistically, the end product being a probability distribution known as the posterior distribution on the parameter space. Using the preceding notation, prior to considering the observed data, the current knowledge about the parameter set θ can be summarized in a distribution P(θ) (called the prior distribution). The posterior distribution P(θ∣Q) of the parameter set may be found through the application of Bayes' theorem:
where P(Q∣θ) is the likelihood function summarizing the model for the data Q given the parameters and P(Q) is a proportionality constant.
2.3. Model Calibration Using Markov Chain Monte Carlo for a Two- Component HME
 The difficulties in applying Bayesian techniques to the HME framework in a hydrological setting lie in calculating the posterior distribution. A method known as Markov chain Monte Carlo (MCMC) is routinely used for estimating the posterior distribution in applied Bayesian statistics in complex problems. The method has been shown to apply well to rainfall-runoff models [Marshall et al., 2004] and has previously been applied to the HME framework [Marshall et al., 2006]. The aim of MCMC sampling is to generate samples of the parameter values from the posterior distribution by simulating a random process that has the posterior distribution as its stationary distribution.
 A number of MCMC algorithms exist. The HME framework presented here is specified via a mixture of Gibbs sampling [Geman and Geman, 1984] and Metropolis sampling [Metropolis et al., 1953]. Consider the HME framework given in Figure 1, where two component models are combined. Using earlier notation, the gating function parameters β, component model parameters θi, and model error variance σi2 (i = 1, 2) are sampled in three blocks using the Metropolis-Hastings algorithm. In a single-component model, the algorithm for sampling the component model parameters θ = (θ1…θd) will take the following general form [after Marshall et al., 2004]:
 1. Initialize j = 0.
 2. (1) Generate a proposed value θ* for θ from a proposal density q(θ*∣θj) depending on the current state. Then calculate the acceptance probability, α, of the proposed value,
where P(Q∣θ) is the likelihood and P(θ) is the prior distribution of θ. (2) Generate u ∼ U[0,1]. (3) If u < α, accept θj+1 = θ*; otherwise set θj+1 = θj.
 3. Repeat step 2.
 In multicomponent models there are several steps of the form described in step 2 above, for different blocks of parameters with the remaining parameters held fixed. For instance, in a two-component model we do a cycle of updating by replacing σ2 and β for θ in step 2.
 In the HME model sampling, the iteration takes the same form except that we do a cycle of updating θ1, θ2, β, σ12, and σ22 at step 2.
 In updating parameters for component models, it is convenient to introduce some additional latent variables in the model specification. In particular, a vector of latent variables zt is defined, such that zt = (1, 0) if the model output is generated by component model 1 with probability defined by (2), and zt = (0, 1) if the model output is generated by model 2. The series z = z1, …, zn, where n is the length of data, is then also generated at step 2 above from the full conditional probability distribution using the Gibbs sampling algorithm. This sampling proceeds as a conditional simulation of independent Bernoulli random variables with probability specified as
Note that with this step, the probability of selecting each model is conditioned on the observed data.
 Because the model framework is implemented using Bayesian inference, the output generated is an ensemble of streamflow hydrographs (generated from the parameter's posterior distribution). This means that the output for a single day can consist of multiple processes, resulting from a weighting of each of the component models. For example, an estimate of the catchments' lower storage is obtained as a weighted average of the storage simulated by each individual model.
2.4. Previous Applications of HME and Their Limitations
 A concern in applying the HME framework has been determining the optimum topology of the networks. Various solutions have been proposed, including Bayesian methods based on defining a worth index for model selection [Jacobs et al., 1997], growing the topology adaptively [Waterhouse and Robinson, 1996], and some automatic growing and pruning techniques [Fritsch et al., 1997]. However, the nature of the modeling process specific to hydrology means that these techniques have their limitations. There is often insufficient data available, particularly for rainfall-runoff modeling, which precludes the use of complex structures. There are inherent computational difficulties in specifying hydrological models in a Bayesian setting which are further complicated by aggregating models in a single framework. Thus in this study the tree structure was fixed and simpler component models and gating functions were favored over more complex choices.
 It is evident from previous studies that the HME approach has two main innovations for hydrological modelers:
 1. The first is interpretation of the final model structure during the calibration period. In a hydrological setting, the HME approach can be an ideal tool for model building and for assessing individual model components. In a previous application, the framework was introduced and applied to a range of catchments to illustrate the conditions under which different model structures and model parameters are preferred [Marshall et al., 2006].
 2. The second is as a way of combining models to get a better prediction that would be yielded from a single model structure. Given the difficulties in selecting a single model structure for predicting streamflow, an attractive alternative is to combine the results from several different hydrological models in a single framework. HME provides a more sophisticated way to combine models than achieved by simple or weighted averages of model outputs.
 Achieving a model performance in prediction that is better than that of the individual component models can be hard. The HME approach can show a significant improvement to the model in the calibration period [Marshall et al., 2006]. This is due to conditioning the probability of selecting one component model on the observed calibration data. In predictive mode, it must be assumed that these data are not available (indeed it is what the model is seeking to reproduce). In light of these difficulties, the aim of this paper is to illustrate how HME can be used for prediction via investigation of different catchment descriptors. The associated computations for assessing the HME predictive performance are also presented.
2.5. An Illustrative Example
 The utility of the HME approach is illustrated via a simple example. Say we fit a daily conceptual rainfall-runoff model to a catchment, with the aim of assessing the model fit and the effect of different assumptions about the model errors. We take data from an existing catchment, so that the underlying distribution of data uncertainties (including input data, boundary conditions, and calibration data) are not known. The catchment is of size 135 km2, located at Orara River at Karangi, in the southeast of New South Wales, Australia. It has a relatively high yield, with a mean annual rainfall of 1793 mm and mean annual runoff of 755 mm. The catchment data show a range of hydrologic conditions, with an underlying low base flow, few peak events, and sustained long dry periods in the dry season.
 The data are to be modeled by a simplified version of the Australian water balance model (AWBM) developed by Boughton . The simplified AWBM (Figure 3) is a conceptual rainfall-runoff model that uses daily rainfall and evapotranspiration to produce estimates of daily streamflow and is used widely in Australia for many applications. The simplified model consists of three parameters: S (surface store capacity), K (recession constant), and BFI (base-flow index).
 We specify the model using Bayesian inference. Two alternative likelihood functions are used to describe the model errors. In both cases, model errors are assumed to be independent and identically distribution (i.i.d.), assumptions that may not necessarily be satisfied in real applications (see discussions and alternative likelihood formulations relating to impact of heteroscedasticity and serial correlation in errors by Sorooshian  and Sorooshian and Dracup ). The difference in the two alternative likelihoods considered is solely the choice of the probability distribution assumed to characterize the errors. In the first case (equation (6)) the distribution of the errors is assumed to be Gaussian, with corresponding likelihood:
where p(Q∣θ) is the likelihood, Qt is the observed streamflow at time step t, f(xt;θ) is the modeled flow at time step t, xt is the set of inputs at time t (including precipitation and evapotranspiration estimates), and θ is the set of model parameters.
 In the second case the errors are assumed to follow a Students t-distribution with 4 degrees of freedom, with likelihood:
where ν is the number of degrees of freedom and σ2 is a scale parameter.
 A plot of the error distribution resulting from the optimal parameter sets corresponding to both likelihoods is shown in Figure 4. Also shown is the assumed probability distribution (assumed normal in Figure 4a and Students t in Figure 4b) that is fitted to the errors. Note the difference in the variance of the errors corresponding to each likelihood. The assumed Gaussian likelihood has a much higher variance, as the distribution is fitted to several large outlier errors and a dominance of low errors. The Student's t-distribution provides an improvement in fitting the model errors, as observed by the smaller error variance. Also note that the t-distribution shows a better correspondence between the calculated histogram of the model errors and the fitted probability distribution. Compared with a Gaussian distribution, a Student's t-distribution will have higher peaks and heavier tails. This allows for a greater number of observations in the tails of the distribution and for a higher number of low errors as observed at the low simulated flows.
 The results in Figures 4a and 4b illustrate the importance of a good assumption on the nature of model errors that are likely. They assume, however, that the structure of the model errors is static and independent of the model simulation or the timing of the catchment's observed response. Consider a situation where we now model the catchment by fitting a combination of two model structures, using the HME philosophy. This combination of structures is characterized by allowing the model to oscillate between two “states” in a probabilistic manner based on the prevailing antecedent conditions. The overall model output at any time is then effectively a weighted average of two alternate models with the weights being ascertained as functions of the catchment antecedent conditions. We specify each individual model to have the same structure, and so operate under the same conceptual and physical assumptions. Although each model has continuous simulation (so that individual models still satisfy the water balance of the individual model structure), only sections of the data are applied to each model when formulating the objective function. In general terms, one model is calibrated on the high observed flows and sustained low flows, and the other model is calibrated on the recession curve of the hydrograph. (Note that this is somewhat of a simplification of the way in which the model framework is specified as the splitting of the data is probabilistic.)
 Now consider the error distribution when such a HME model is applied to the study catchment, with the individual models (referred to as component models) taking the AWBM structure described before. The model errors resulting from each of the two component models are again assumed to follow a Students t-distribution, and are illustrated in Figure 2. Figure 2a shows a section of the hydrograph, illustrating the probability of selecting model 1. We can now allow the structure of errors to change depending on which part of the catchment's response we are modeling. The distribution of the errors is shown in Figures 2b–2d for model 1 and model 2. To create these subplots, it was assumed that the model simulation came from model 1 if the mean conditional probability of selecting the model was greater than 0.5. Note that the errors are now better summarized by the two separate probability distributions. By having separate error models, heteroscedasticity in the errors may be modeled. Note that the distribution of errors for model 1 has a much higher variance than model 2. The individual error structures do not show a dominance of very low flows with long tails. By dividing the data, the individual error distributions more closely correspond to Student's t-distributions than a single model does.
 Using an approach like this has a number of attractions. It allows greater flexibility in the specification of the model structure, as the preferred structure can change between time steps. The approach has the recognized benefits of combining the results from several hydrological models but allows greater flexibility than that provided by a simple or weighted average of model results. By combining model structures, it is possible to achieve a model prediction that is better than that provided by a single model. In the suggested approach, it is not a requirement that the model structure be the same for each component model, so it is possible to apply models which assume different catchment mechanisms for different sections of the data. If it is recognized that the catchment is dynamic, with different dominant processes occurring at different times, this ability to switch between models that assume different dominant mechanisms provides a simple yet elegant way of mimicking the dynamic nature of the changing processes in a catchment.
 The approach also allows greater flexibility in the specification of the model errors. Rather than using a multiobjective approach (where the entire length of the data is used to specify each objective), the data are divided and different assumptions about the distribution of the model errors in different sections of the data may be made. The approach can illustrate differences in the characteristics of errors resulting from different models.
3. HME as a Predictive Modeling Tool
 Being able to use the HME modeling framework in prediction relies on being able to accurately estimate the probability of selecting one component model at each time step. This probability is calculated from two elements of the HME framework: a catchment descriptor that summarizes the “state” of the catchment at a time step (the predictor) and the gating function that relates the probability of selecting a model to the predictor.
 An important and indeed beneficial part of implementing the HME framework is assessing the effectiveness of different catchment predictors and gating functions in reproducing the switch from one component model to another. Each coupled component model and error model can be thought of as reproducing a different catchment “state,” where different dominant hydrological processes are driving the catchment's response to rainfall. By assessing different predictors, modelers can interpret what physical processes are related to (or are forcing) the switch from one catchment state to another. This choice of predictor should be made through an elaborate comparison that considers the various mechanisms that could lead to a switch in the dominant processes generating streamflow. With an inappropriate predictor the probability of selecting a particular model may not be approximated according to its suitability. Despite its increased complexity the HME framework may give a fit in prediction that is worse than any individual component model.
 The HME model framework is ideally implemented in a Bayesian framework due to its probabilistic nature. A theoretically sound approach to model comparison using Bayesian inference can be used to compare the performance of different catchment predictors and gating functions. The traditional Bayesian approach, which requires calculation of Bayes factors, has been used in previous hydrological studies [Marshall et al., 2005] but may be hindered by computational difficulties. The more complex nature of the HME framework (including the introduction of latent variables indicating which component model is selected) means Bayes factors are not easily estimated. Other predictive model comparisons can be implemented in a Bayesian framework, and this study has implemented an alternative approach.
 The Bayesian information criterion (BIC) of Schwarz  is an asymptotic approximator of the marginal likelihood of a model. The criterion holds that the model log marginal likelihood is approximately −0.5 BIC, where (if N is the size of the sample)
The best model is that which minimizes the above criterion. The approach is desirable when comparing models of different complexity as it accounts for overfitting by penalizing the criterion according to the number of parameters in a model. It is not an appropriate measure under all circumstances, and it has been reported that the BIC will tend to favor models that are too simple if insufficient data are available [Marshall et al., 2005]. It is, however, a general and flexible method for comparing models that is generally consistent with more exact methods of calculating Bayes factors and as such is well suited to the HME approach.
 In the specification of the model as summarized by section 2, a set of latent variables are introduced. These latent variables are related to each other through the prior distribution, and counting the number of parameters for these kinds of models (where parameters are not independent) as required by BIC is not straightforward. Here we present an alternative approach, where we can integrate over the latent variables and work with the corresponding marginal likelihood. If we define the two expert models in Figure 1 as
then for every MCMC iterate we can calculate
where F(β, Xt) is the gating function defining the probability of selecting component model 1 at time t, ϕ(Qt; f1(xt;θ1)) is the distribution of Qt for model 1, and ϕ(Qt; f2(xt;θ2)) is the distribution of Qt for model 2. The likelihood L(Q∣θ1, θ2, σ12,σ22, β) is used to determine the maximized likelihood for estimating the BIC as in (8).
4. Case Study
 The example presented in section 2.5 shows how the HME approach may be used as a flexible modeling tool for combining model structures and for modeling heteroscedastic data errors. We present a case study here that extends these results by combining two alternative model structures and assessing different predictors for weighting the models. The aim of this study is to illustrate how HME can be used beyond the scope of calibration as a predictive tool. We present the simplest HME framework and compare two versions of a popular conceptual rainfall-runoff model. We compare the predictive performance of a number of covariates for switching between component models in the modeling framework.
4.1. Study Catchment
 The selected study area was the Never Never River at Glenniffer Bridge, a 51-km2 catchment located in the southeast coast of New South Wales, in Australia. The area consists of agricultural and national park land use, with predominantly alluvial soils. The catchment has relatively high runoff yield, with annual rainfall of 2036 mm and annual runoff of 1114 mm. Approximately 10 years of available daily rainfall, evapotranspiration, and runoff data (from September 1988 to December 1998) were used to specify the model in the study. (Note that this catchment is not the same source of data as used in the example presented in section 2.) An additional 2 years of data spanning from January 1986 to May 1988 was used as a validation period to test the fitted model.
4.2. Model Structure
 As no simple and effective method of determining the tree topology exists for application to hydrological systems, the simplest HME architecture was selected, consisting of only two component models and a single level (Figure 1). The desire was to implement a HME framework that was no more complicated than necessary to obtain a good performance in prediction. A more complex framework runs the risk of overfitting and limits the model's range of applicability. There are also existing computational challenges in specifying the model using a Bayesian approach. On the basis of earlier studies [Marshall et al., 2006] it was observed that a range of catchments were well modeled as a mix of only two states.
 Unlike earlier studies, different error variances were specified for each component model to enable modeling of heteroscedasticity. It was observed in the preceding study [Marshall et al., 2006] that one component model would tend to fit to the model peaks, with the other model fitting the recession curve. It is also recognized in a number of hydrological studies that model errors often exhibit heteroscedasticity proportional to flow magnitude [Sorooshian and Dracup, 1980]. On the basis of these results, it was likely that the component model fitting the peak flows would have a greater variance. Also, use of alternative error models would allow better justification of the assumptions made on the distribution of model errors.
 We compare two model structures in the modeling framework. Each component model is specified to be a conceptual rainfall-runoff model of differing complexity. The Australian water balance model (AWBM [Boughton, 2004]) (see Figure 5) is an eight-parameter model, consisting of three surface storages and a single lower storage. The model takes estimates of daily rainfall and evapotranspiration to simulate daily streamflow. The spatial variation in soil moisture within a catchment is represented in the original model via the three upper storages. The use of three surface storages (as opposed to two or four) to represent the catchment storage capacity is a pragmatic choice by the original model developer to allow a good fit to available data while ensuring a relatively simple structure (see Boughton  for discussion).
 It has been observed, however, that the minimum level of model complexity necessary to simulate catchment runoff is strongly related to catchment wetness [see, e.g., Atkinson et al., 2002; Farmer et al., 2003]. Given the hypothesis that the HME approach would favor one model over peak flows and another over sustained low flows, we combine two versions of the traditional AWBM in the HME framework. The first is the traditional three storage model (with eight parameters, Figure 5), and the second is a simplified single storage version, consisting only three parameters (as used in section 2, Figure 3).
4.3. Gating Function
 It was desirable to use a simple gating function. The forms of the gating function used were chosen to be simple enough to specify easily (so computational effort was minimized). A simpler function would also be easier to interpret, given the simple nature of the component models and the modeling exercise (i.e., daily runoff prediction rather than storm analysis). Hence the popular logistic regression model was used for the gating function. The general form of the model is
where Xt are the catchment predictors and β is the vector of logistic regression parameters. Different options for the function g(Xt, β) were implemented in the study. A simple linear regression function was used of the form
This was also extended to a polynomial regression function:
The final form of the gating function specified for the case study was a linear spline:
The spline was implemented having continuous piecewise linear basis functions hm(Xt) with h1(Xt) = 1; h2(Xt) = Xt; and h2+i(Xt) = (Xt − xti)+, where (zt)+ = max(0, z) denotes the positive part and xti, i = 1, 2, 3 (the so-called knot points) are chosen here as the 25th, 50th, and 75th percentiles of the predictor values.
4.4. Specifying the Model
 The data used to calibrate the model will influence the posterior distribution through the likelihood function. A suitable form for the likelihood is dependent on the properties of the data. Different forms of the likelihood will make different assumptions about the distribution of model errors. For this study, two error structures were defined (corresponding to different component models) to describe the model errors. It was hypothesized that this approach would reduce the likely heteroscedasticity in the model errors. Two different likelihood functions were applied in the study for the component models. The first assumed homoscedastic, uncorrelated error terms (equation (6)), and the second assumed the errors follow a Student's t-distribution (equation (7), with the number of degrees of freedom set at 4). As discussed earlier, the Student's t-distribution can provide an improvement in fitting the model errors, as it allows for a greater number of observations in the tails of the distribution. Note that for this distribution the variance of the errors for this distribution is related to the scale parameter by
where υ are the degrees of freedom, set at 4 for this study. The general form of the MCMC algorithm used to determine the distribution of the model parameters can be found in section 2. Note that the choice of proposal distribution is crucial in defining the MCMC approach. As discussed, the parameters are updated in four separate blocks: the component model parameters (θ1 and θ2, with each parameter set consisting of model parameters K, BFI, the surface storage parameters Si, and the fractional areas Ai as illustrated in Figures 3 and 5); the scale parameters for the model errors (σ12, σ22, corresponding to each component model); the gating function parameters (β); and the latent variables (z).
 The component model parameters are updated using a multivariate normal proposal distribution with pretuned covariance. The covariance was obtained using a slightly modified version of the adaptive Metropolis algorithm [Haario et al., 2001]. Initially, the modified AWBM model alone was applied to each catchment, and a full MCMC run of 50,000 iterations using the adaptive Metropolis algorithm was performed to obtain an estimate of the parameters' covariance. This covariance matrix was used as the proposal covariance for each HME component model. Note that pretuning runs were required to scale the covariance to obtain good rates of acceptance.
 To generate values of σ2 (the model error variance), a proposal distribution was used following from Bates and Campbell  of the form
where σ2′ is the proposed value, σ2 is the current value of the variance, and σ2 is the proposal variance. Again, the proposal variance was tuned to get desirable acceptance rates. For the gating function parameters, a multivariate normal proposal distribution was used, with mean at the current parameter value and fixed (pretuned) covariance.
 Convergence was diagnosed using the method suggested by Gelman and Rubin . An underestimate and an overestimate of the target distribution variance for a suitable scalar function of the parameters are formed based on a number of sequences. At convergence, the estimates will be roughly equal. Ten sequences of 50,000 iterations were used to determine convergence. It was determined that the chain had reached convergence if the two estimates of the distribution variance had a ratio less than 1.2.
 The HME model was then fitted to 10 years of the available data, and the parameters posteriors were determined using a full MCMC run of 50,000 iterations. The first 10,000 from each simulation were discarded and the remaining 40,000 were used in all calculations. A single component model (with AWBM structure) was also run using the adaptive Metropolis algorithm [Haario et al., 2001]. On the basis of these simulations, the BIC was calculated for each different gating function and catchment predictor considered in the study using the method outline in section 3. The BIC was also calculated for the single model. The BIC results were confirmed using an additional validation period. The fitted HME and AWBM models were run on an additional 2-year period using parameter values obtained from the calibration period corresponding to the point of maximum posterior density. The log-predictive density was then estimated for this period.
5. Results and Discussion
 The HME framework was assessed using two distributions of model errors in terms of its suitability for prediction in comparison to a single model structure. The two likelihood functions used to specify the model were compared to determine which function better modeled the data errors for a single model and the aggregated HME approach. The posterior distributions of the calibrated model parameters were also examined.
Table 1 gives the resulting posterior means for the parameters of the traditional three storage AWBM (when calibrated as a single model, not in the HME framework) and the parameters' posterior variances, indicating the identifiability of the model parameters. The results for both likelihood functions are included.
Table 1. AWBM Mean and Variance of Parameters Posterior Distributions
Distribution of Model Errors
S1, S2, S3
52.5, 128.9, 210.1
6.4E+02, 5.8E+02, 2.6E+00
12.3, 13.7, 15.1
5.4E-01, 9.1E-01, 8.2E-01
Table 2 gives the BIC value assessing the model fit and the log-predictive density for the validation period. Note that we have reported −0.5 BIC so that the results approximate the log-marginal likelihood of the model. A higher value of the marginal likelihood will mean a better model fit. Note that if we are comparing two models (with equal prior probability), the weight of evidence for model one over another can be approximated as the ratio where BIC1 and BIC2 are the respective BIC estimates for models 1 and 2.
Table 2. AWBM Predictive Performance Assuming Different Distributions of the Model Errors
Distribution of Model Errors
Log-Predictive Density for Validation Period
 It can be seen in the results in Table 2 that the model which assumes normally distributed errors has a significantly lower value of −0.5 BIC, indicating a better predictive ability for the model which assumes t-distributed errors. While the BIC is estimated for the “training” period, it penalizes model complexity and is a good indicator of the model simulation outside this period. However, the BIC is an asymptotic approximator and for small sample sizes can be unreliable. To assess the model fit for an additional validation period, Table 2 also gives the log-predictive density of the model simulation for an additional 2-year period. Note that the results generally correspond to those given by the BIC. The model errors are better fitted by a Student's t-distribution.
5.1. HME Model With Assumed Independent Model Error Structure
 The resulting BIC for different gating functions and predictors assuming the errors are normally distributed are given in Table 3 along with the log-predictive density for the validation period. For each likelihood function, a ‘null’ predictor (where a variable describing the catchment state is not included) was first implemented. For this case, the probability of selecting each model does not change in time so that the resulting simulation is a weighted average of the two component models. Note that this produces a BIC value that is worse than a single component model. Hence a simple weighted average of the two component models may not give a better out-of-sample prediction than that from a single model. While the individual component models are partially conditioned on the observed data and the gating function, a null predictor does not take into account the conditions under which alternative model structures and error distributions would be preferred in prediction. The usefulness of the HME approach for prediction thus relies on selection of an appropriate predictor that can summarize the hydrologic regime.
Table 3. HME Model Performance Assuming Normally Distributed Errors
Log-Predictive Density for Validation Period
Modeled base storage
Log-antecedent cumulative rainfall (7 days preceding)
 When different predictors are introduced, the model performance improves. In a preceding related study [Marshall et al., 2006] it was observed that the parameter BFI was most sensitive in describing the switch from one catchment state to another. This parameter describes the proportion of excess runoff returning to the base storage component of the model. Hence the modeled base storage was initially selected to describe the catchment “state” when determining the probability for each model. Note that with a linear logistic gating function, this gives a better predictive performance as measured by the BIC than when the AWBM alone is implemented.
 This result was then compared to an antecedent rainfall predictor assuming a linear logistic gating function. This combination gave a better predictive result than using the modeled base storage to describe the catchment state. On the basis of this, different gating functions were investigated using the antecedent rainfall as a predictor. A polynomial logistic gating function did not offer a convincing improvement in predictive performance as measured by the BIC. When a linear spline is implemented, the predictive performance according to the BIC improves, so that the antecedent rainfall with a linear spline gating function gives the best performance by this measure.
 It must be noted that the gating function introduces additional parameters that must be specified in the HME framework. For a linear gating function, there are an additional two parameters, three for a polynomial function, and five for a spline gating function. These parameters are well identified, due largely to the length of data available (10 years) and the simple functions used. For each simulation presented in Table 3, the coefficients did not include zero in the range of their posterior (indicating a definite positive/negative relationship between the predictor and the probability of selecting each model). For example, the parameters β0 and β1 of the linear logistic gating function taking the form of (13) with a log-antecedent cumulative rainfall predictor, had a mean of −11.6 and 2.7, respectively, with a variance of 0.31 and 0.018, respectively.
5.2. HME Model With Assumed Student's T-Distributed Model Error Structure
Table 4 lists the estimated BIC assuming the model errors follow a Student's t-distribution. Initially, the same combination of predictors and gating functions was applied as in the previous case where the errors were assumed normally distributed. It was obvious from initial results that the t-distributed error structure proved to better fit the model errors (note also the BIC values for the AWBM and HME model results assuming t-distributed errors are better than those assuming normally distributed errors). For this reason, a more rigorous exploration of catchment predictors and gating functions was performed assuming the likelihood given by (6). We test only individual predictors here out of a desire to keep the entire structure parsimonious, and as the modeling exercise (daily runoff prediction using relatively simple conceptual rainfall runoff models) means that a more complex choice would be difficult to interpret. Theoretically, any number of predictors could be combined.
Table 4. HME Model Performance Assuming Student's t-Distributed Errors
Log-Predictive Density for Validation Period
Modeled base storage
Log-antecedent cumulative rainfall (7 days preceding)
 Initially, as in the previous case, a null predictor was implemented, which also gave a lower value of −0.5 BIC (recalling that this is an estimator of the log-marginal likelihood). Other predictors, including the modeled streamflow and the modeled base storage, the observed rainfall, and the (7 days) antecedent rainfall improve the predictive performance over the single model with a linear logistic regression function. Note that in comparison to more complicated gating functions, a linear logistic regression function is sufficient for relating the predictor to the probability of selecting each model, with generally a small difference between BIC and log-predictive density values.
 The mean parameter values of the component model corresponding to the best combination of predictor and gating function are given in Table 5. Note carefully the implications of the parameters in reference to Figure 3 and Figure 5. Component model 2 had a relatively high BFI, implying that more excess runoff is stored in the base flow storage, and correspondingly a small fraction of the rainfall enters the stream as direct runoff. Also, component model 2 is the more complex conceptual model, better suited to modeling the partial area runoff that occurs when the catchment is not entirely saturated. The values for HME component 1 imply the reverse. Component model 1 can be thought of as a “quick flow” process, where more excess runoff is generated in each storm. Component model 2 shows a high recharge state, where excess rainfall is stored in the catchment and released more slowly. Component model 2 is better suited to simulating slower subsurface flow and so is dominant at the recession and sustained low flows rather than peak flows.
Table 5. HME Mean Posterior Model Parameters Assuming Student's t-Distributed Errorsa
A linear logistic gating function with the 7 day cumulative preceding rainfall as predictor is used.
HME Component Model 1
HME Component Model 2
S1, S2, S3
1.37, 2.52, 3.56
 The variance of the errors corresponding to each component model is also given in Table 5. Note that the distribution of errors resulting from the quick flow model (model 1) has a much higher variance. This would be expected from a more variable state and a simpler model structure. The more complex recharge model (which tends to be dominant at the sustained low flows) has a corresponding lower variance. These highlight important implications of the predictive ability of the HME approach. As the model framework has been specified using Bayesian inference, we can estimate prediction intervals for each model (and gating function) using the MCMC iterates generated during model calibration. Figure 6 compares these prediction intervals for the single AWBM and the HME model for a section of the fitted hydrograph. The HME model has been specified assuming a log (7 day) cumulative rainfall predictor and a linear logistic regression gating function, as this combination showed a good predictive performance (Table 4). The width of the prediction intervals for the AWBM is reasonably constant over the length of the selected simulation regardless of the magnitude of the simulated runoff. This is due to the assumption that the model errors are independently and identically distributed and do not vary in time. As such, the model variability arises out of parameter uncertainty and the large error variance. Given the length of data used to identify the model (11 years of daily data), the model parameters are well identified. For the HME model, the error bands are generally wider over the peak and early recession of the hydrograph. They narrow over the low simulated flows.
 It must be noted that the individual component models in the HME framework take the same assumption on the distribution of model errors as the AWBM. Why then do we not observe fairly constant prediction intervals over the length of the simulation for the HME approach? This difference arises from the varying probability with which each model is selected. Using the antecedent rainfall as a predictor of the switch between models, the quick flow model (model 1) has a higher probability at the high peaks and recession of the runoff hydrograph. This model has a higher calibrated error variance, and so for periods where this model has a higher weighting (probability) the prediction intervals are wider. Correspondingly, for the periods in which the recharge model is dominant, the error bands are narrower. The width of the error bands for the HME model, however, is intuitively too large over the recession period. This indicates an inability of the model to adequately capture the recession. One solution may be to introduce an additional model, another to model the error variance proportional to the flow magnitude, but these remain the scope of future works.
 The increased flexibility in the specification of the model errors for the HME approach means that the proficiency of the model is more consistent over the range of possible flow conditions. The reliability and sharpness of the HME and AWBM models can be compared in Table 6. The model reliability can be assessed via the proportion of times in which the observed flow lies outside the 90% prediction interval for different flow regimes. The prediction limits arise out of the assumption that the model errors follow a Students' t-distribution. The model “sharpness” is compared here as the average variance of the width of the prediction interval. Note that as we assume a constant distribution of model errors for the AWBM model, so the width of the prediction interval does not change. Both models simulate the low flows well. However, the HME model has a significantly sharper prediction for flow regimes less than the 50 percentile observed flow. For higher flow events, the width of the HME prediction limits increase. The HME model reliability is improved for these flows as compared to the AWBM, but the trade-off is a reduced “sharpness” on the prediction intervals. Overall, the AWBM has a good success rate at simulating flows below the average observed flow but a very poor success rate at simulating above average flows. The HME model is more reliable for different flow regimes with more comparable success rates for simulating runoff above and below the observed average.
Table 6. AWBM and HME Reliability and Size of Prediction Interval for Different Flow Regimesa
1 Flow Regime
2 Proportion of Observed Flow Lying Outside 90% Prediction Interval
3 Mean Variance of Model Errors
A linear logistic gating function with the 7-day cumulative preceding rainfall as a predictor is used.
Observed flow < 10% flow
Observed flow < 25% flow
Observed flow < 50% flow
Observed flow > 50% flow
Observed flow > 75% flow
Observed flow > 90% flow
Observed flow < 10% flow
Observed flow < 25% flow
Observed flow < 50% flow
Observed flow > 50% flow
Observed flow > 75% flow
Observed flow > 90% flow
6. Conclusion and Future Work
 Using a single model with a rigid model structure can lead to significant bias in the modeled hydrograph, as evidence exists of a catchment responding differently under different antecedent conditions. To form a better prediction of catchment behavior than would be provided by a single model, a model can be approximated through the combination of a number of different modeling configurations. Each model is adopted at a given time with a probability that depends on the current hydrologic state of the catchment. This framework is known as a hierarchical mixture of experts (HME).
 When applied in a hydrological context, the HME approach has two major functions. It can act as a predictive tool, where simulation is extended beyond the calibration period. This can be useful for a variety of water resource management applications, such as filling in streamflow records, flow forecasting, or design reservoir operating rules. However, the approach can also be used as a tool for model development and building, by interpreting the final model architecture in calibration. The success of the final model for prediction will also depend on interpretation of how the predictor variables describe the switch between models.
 Application of the HME framework in the case study presented in this paper shows that the catchment behaves differently under different conditions. Evidence was shown that the catchment is well modeled as two different “states” rather than by a single static model. The two HME component models used corresponded to different catchment mechanisms: a high recharge state where the base flow storage is increasing, and a low recharge state in low flow times.
 The challenge in applying the HME framework to catchments for predictive purposes lies in determining which model should be selected depending on the state of the catchment. Estimating the probability of selecting each HME component model is central to the effectiveness of the proposed HME framework. Estimating this probability is reliant on choosing appropriate variables to characterize the catchment state, and a mathematical function that can relate the predictor value to the dominant model. In this study several different gating functions and predictors were investigated. The predictive performances of these were compared via the BIC, a Bayesian-like comparison criterion. Results showed that by comparing different predictors, the modeler can assess which variables are most likely forcing or related to a “switch” in the catchment state. By selecting different predictors, the HME framework can give a better simulation than from a single model (taking into account the increase in model complexity). In this study, simpler gating functions and predictors were selected, but the approach could be extended to a combination of any number of catchment indicators. The cumulative antecedent rainfall (a measure of the catchment's wetness) proved to be the most appropriate predictor for the catchment investigated, given different assumptions about the distribution of the model errors.
 Much of the current literature concerning HME is interested in finding the optimum topology of the network's architecture. Although this may prove a promising way to extend the simple architecture used in this study, computational difficulties will likely arise and there is a desire to keep the model parsimonious in hydrological applications. Complicating the model structure also limits its applicability. By dividing the calibration data space, the individual component models may become overidentified. This is of particular importance in hydrological modeling, where there may be insufficient data available to identify each model.
 This research was partly funded by a UNSW Goldstar research grant. The authors wish to thank the two anonymous reviewers for their helpful suggestions. Francis Chiew at CSIRO is thanked for supplying us with the rainfall-runoff data sets used.