## 1. Introduction

[2] Hydrological models are prone to structural errors, defined, for example, by *Beven* [2005] as a combination of incorrect representations of processes, conceptual errors, processes that are not represented and implementation errors. As a consequence, several different model structures may exist for a given application, each with several different parameter sets, which may yield equally acceptable, yet imperfect, simulations when compared to the data available [*Beven and Binley*, 1992; *Neuman*, 2003; *Beven*, 2006]. Focussing on single model structures is, therefore, likely to result in modeling bias and underestimation of model uncertainty [*Neuman*, 2003].

[3] The realization of this fact has recently led to multiple model structures being considered simultaneously (ensemble simulation) in hydrological applications. Ensemble simulation studies have been undertaken in groundwater modeling [e.g., *Neuman*, 2003; *Ye et al.*, 2004; *Poeter and Anderson*, 2005], where different model structures mean mostly different models of spatially heterogeneous parameterizations. In rainfall-runoff modeling, *Shamseldin et al.* [1997] were among the first to explore simple and weighted averaging as well as neural networks as ways to combine the simulations of multiple models into a single output in some optimal way [see also *See and Abrahart* 2001]. Similar approaches to model combination, following the paradigm of a single optimal output, include: multiple-input/single-output linear transfer functions [*Shamseldin and O'Connor*, 1999]; (fuzzyfied) Bayesian inference [*See and Openshaw*, 2000]; (fuzzy) rules [*Xiong et al.*, 2001; *Abrahart and See*, 2002]; multimodel super-ensembles [*Ajami et al.*, 2006].

[4] The single model output paradigm, however, misses important information on prediction uncertainty. In contrast, *Georgakakos et al.* [2004] began to analyze the distribution of simulations within rainfall-runoff model ensembles as well as the ensemble mean. *Butts et al.* [2004] followed a similar approach in their analysis of an ensemble of structures within a common modeling framework, which they extended to the investigation of parameter and input/output data uncertainties. *Clark et al.* [2008] took a modular approach to combining the conceptual choices of four models into 79 unique structures, which they analyzed for differences and similarities.

[5] A framework to integrate all sources of uncertainty in modeling is available through Bayesian statistics. Formal Bayesian Model Averaging (BMA) was used in hydrological applications by *Vrugt et al.* [2006], *Duan et al.* [2007], *Ajami et al.* [2007] and *Vrugt and Robinson* [2007]. To overcome the usually static weighting of model structures in BMA, *Marshall et al.* [2006] introduced Hierarchical Mixtures of Experts to allow the weights of two rainfall-runoff model structures to vary dynamically depending on predicted states of a study catchment. *Hsu et al.* [2009] updated the weights of three model structures sequentially based on their performance at newly available observation time steps. An alternative to model averaging within Bayesian statistics is to formulate a model structural error term [*Kennedy and O'Hagan*, 2001; *Vrugt et al.*, 2005; *Kuczera et al.*, 2006; *Huard and Mailhot*, 2006, 2008], although this may be problematic to define [*Beven*, 2005; *Huard and Mailhot*, 2008].

[6] An informal Bayesian framework is the Generalized Likelihood Uncertainty Estimation (GLUE) methodology, which converges to formal Bayesian inference if the required assumptions are made and likelihood measures used [*Beven*, 2006]. The possibility of multiple model structures has always been inherent in the methodology [*Beven and Binley*, 1992], although this paper is the first to explore this in an application. At the heart of GLUE is the concept of rejecting non-behavioral models and weighting the behavioral ones for ensemble simulation. Input data uncertainty can be taken into account as multiple data scenarios which are propagated through a set of models to form an extended ensemble of simulations [*Pappenberger et al.*, 2005; *Younger et al.*, 2009]. The alternative to input scenarios in a Bayesian framework is an input error term [*Kavetski et al.*, 2003; *Vrugt et al.*, 2005; *Kavetski et al.*, 2006a, 2006b; *Huard and Mailhot*, 2006; *Ajami et al.*, 2007; *Huard and Mailhot*, 2008; *Vrugt et al.*, 2008], which, again, may be difficult to estimate in practice [*Beven*, 2005; *Kavetski et al.*, 2006a]. Uncertainty in the data that models are evaluated with (output data) is usually assumed implicitly when defining model performance measures. Recent efforts, however, have made the specification of output error models more explicit [*Kennedy and O'Hagan*, 2001; *Kavetski et al.*, 2003; *Vrugt et al.*, 2003, 2005; *Beven*, 2006; *Kavetski et al.*, 2006b; *Huard and Mailhot*, 2006; *Vrugt and Robinson*, 2007; *Harmel and Smith*, 2007], although error models have rarely been justified with independent data (see *Pappenberger et al.* [2006], *Huard and Mailhot* [2008], and *Liu et al.* [2009] for exceptions).

[7] This paper demonstrates for the first time how model parameter, structural and data uncertainties can be accounted for explicitly and simultaneously within GLUE. As an example application, different model hypotheses of runoff generation are tested on a set of experimental grassland field-scale lysimeters. Following the notion of models as hypotheses of environmental systems behavior [*Beck*, 1987], this is the starting point of a downward modeling approach [*Klemeš*, 1983], i.e., one that aims first at a parsimonious description of the dynamics reflected in the observed data and then at a disaggregation of these dynamics as a continuing learning process [*Sivapalan and Young*, 2005] in which model improvement and additional data collection are interdependent. As the first iteration in this learning process, this paper is not concerned with prediction, but with model diagnostics aiming at better process representation. Input scenarios are propagated through an ensemble of conceptual models which, accounting for parameter uncertainty, are evaluated against uncertain output data. Model rejection and diagnostics are used to learn about the hydrological behavior of the study fields. Model improvement and additional data collection are suggested for the next iteration of model development.