## 1. Introduction

[2] As watershed and other environmental simulation models become more widely used, there is greater need for procedures that generate realistic prediction intervals and other representations of uncertainty that describe the likely difference between actual flows and their forecasts, and between estimated parameters and their true values (if such true values exist). Uncertainty analysis has now become common practice in the application of environmental simulation models. This is also a primary goal of the Predictions in Ungauged Basins (PUB) initiative promoted by the *International Association of Hydrological Sciences* [2003] and a fundamental need of most end users [*Montanari*, 2007].

[3] The generalized likelihood uncertainty estimation (GLUE) technique introduced by *Beven and Binley* [1992] is an innovative uncertainty method that is often employed with environmental simulation models. There are now over 500 citations to their original paper which illustrates its tremendous impact. GLUE's popularity can be attributed to its simplicity and its applicability to nonlinear systems, including those for which a unique calibration is not apparent. *Montanari* [2005] suggests that GLUE's popularity is due to the apparent success it has enjoyed in real-world applications, and that it appears to provide the needed characterization of uncertainty. *Blasone et al.* [2008b, pp. 20–21] point to GLUE's conceptual simplicity, ease of implementation, and its flexibility with different sources of information that can be combined with different criteria to define a likelihood measure.

[4] Recent evaluations of GLUE by *Christensen* [2004], *Montanari* [2005], *Mantovan and Todini* [2006] and this study clearly demonstrate that prediction limits derived from GLUE can be significantly different from prediction limits derived from correct classical and widely accepted statistical methods. *Beven* [2006b] discussed these concerns, and called for additional studies. *Mantovan and Todini* [2006] and *Mantovan et al.* [2007], show that, with the “less formal” likelihood functions generally adopted in most previous GLUE applications, estimates of prediction uncertainty will be what they call “incoherent and inconsistent”, compromising valid statistical inference. In response, *Beven et al.* [2007, 2008] point out problems that result when a likelihood function overestimates the information content of data. The example from *Beven et al.* [2007, 2008] again reinforces a major point made by *Mantovan and Todini* [2006], and by this study: if one wants to correctly understand the information content of the data, one needs to use a likelihood function that correctly represents the statistical sampling distribution of the data.

[5] In GLUE's defense, *Beven and Freer* [2001] argue that GLUE prediction limits should not and will not coincide with limits based on classical statistics. More recently *Beven* [2006a] states that

“These prediction limits will be conditional on the choice of limits of acceptability; the choice of weighting function; the range of models considered; any prior weights used in sampling parameter sets; the treatment of input data error, etc. …However, given the potential for input and model structural errors, they [the choices] will not guarantee that a specified proportion of observations, either in calibration or future predictions, will lie within the tolerance or prediction limits (the aim, at least, of a statistical approach to uncertainty). Nor is this necessarily an aim in the proposed framework.”

[6] If the aim of the GLUE framework is not to generate prediction and uncertainty intervals that contain the specified quantities with the prescribed frequency or probability, then we do not know what the purpose of the analysis is, or what GLUE advocates intend for their uncertainty intervals to represent. If GLUE provides a valid statistical analysis of environmental models when employed as recommended, then we contend that when applied to a very simple model with a classic model error structure, GLUE should reproduce the widely accepted uncertainty intervals generated using both classical and Bayesian statistical methods that provide the correct descriptions of uncertainty in that case. If as we show, GLUE does not generally reproduce the correct uncertainty intervals when applied to a wide range of simple problems, then there is little reason to believe it will provide reasonable results for difficult problems for which the correct solution is not known.

[7] The aim of this paper is to evaluate GLUE using a linear rainfall/runoff model so that model calibration is a linear regression problem for which exact expressions for uncertainty are well known and understood. It is common practice to test new methods and theories on old well-understood problems and special cases to see if the new proposals provide valid solutions and thus are credible. Simple cases are, after all, special cases of complicated situations: so one cannot logically claim a method works for complicated situations if it does not work for the simple situations that are special cases.

[8] The statistical and probabilistic interpretation of GLUE analyses and the choice of a likelihood function is the focus of this paper. This paper also shows how to correctly employ GLUE with simulation models to assure that uncertainty analyses produce reasonable prediction limits consistent with traditional statistical methods. In a broader perspective, this paper reflects on the difference between reality and the claims made for GLUE with subjective likelihood measures as a model calibration and sensitivity analysis framework, and the validity of Beven's Equifinality Manifesto [*Beven*, 2006a].

### 1.1. Previous Applications of GLUE

[9] *Beven and Binley*'s [1992] paper introducing GLUE for use in uncertainty analysis of watershed models has now been extended well beyond rainfall-runoff watershed models to flood inundation estimation [*Romanowicz et al.*, 1996], ecological models [*Pinol et al.*, 2005], schistosomiasis transmission models [*Liang et al.*, 2005], algal dynamics models [*Hellweger and Lall*, 2004], crop models [*Tremblay and Wallach*, 2004], water quality models [*Smith et al.*, 2005], acid deposition models [*Page et al.*, 2004], geochemical models [*Zak et al.*, 1997], offshore marine sediment models [*Ruessink*, 2005], groundwater modeling [*Christensen*, 2004], wildfire prediction [*Bianchini et al.*, 2006] and others. Given the widespread adoption of GLUE analyses for a broad range or problems, it is appropriate that the validity of the approach be examined with care. *Christensen* [2004], *Montanari* [2005], *Mantovan and Todini* [2006] and this study provide such reviews.

### 1.2. GLUE Methodology

[10] The Beven-Binley GLUE method is a Monte Carlo approach which is an extension of Generalized Sensitivity Analysis (GSA) introduced by *Spear and Hornberger* [1980]. With GSA, ensembles of model parameters are sampled from distributions, typically with independent uniform or normal distributions for each parameter. The model is then run with many such parameter sets, producing multiple sets of model output. These are used together to generate uncertainty intervals for model predictions. *Spear and Hornberger* [1980] suggest a qualitative criterion to group the generated model parameters into two sets: (1) behavioral sets of model parameters that produce results consistent with the observations, and (2) nonbehavioral sets of model parameters that produce results that contradict the observations. Therefore, they implicitly weighted each model parameter set by giving nonbehavioral sets a probability of zero and all behavioral sets an equal nonzero probability.

[11] Like GSA, GLUE is based upon Monte Carlo simulation. Parameter sets may be sampled from any probability distribution, with most reported applications sampling from uniform distributions [*Beven*, 2001]. Each parameter set is used to produce model output; the acceptability of each model run is then assessed using a goodness-of-fit criterion which compares the predicted to observed values over some calibration period. The goodness-of-fit function is used to construct what *Beven and Binley* [1992, p. 283] call a likelihood measure. As with GSA, parameter sets that result in goodness-of-fit/likelihood values below a certain threshold are again termed “nonbehavioral” and are discarded. The remaining “behavioral” parameter sets are assigned rescaled likelihood weights that sum to 1, and thus look like probabilities. Clearly Beven, Binley, Freer and others who have advanced this scheme do not trust their likelihood measure to be able to distinguish between realistic (behavioral) and unrealistic (nonbehavioral) data sets, and thus impose an independent “behavioral” threshold criterion. If the statistical analyses were correct, it should be able to distinguish between behavioral and nonbehavioral solutions without the imposition of an arbitrary and rigid cutoff. As we will show, a correct statistical analysis does just that in our example.

[12] To obtain uncertainty intervals around model predictions using these rescaled likelihood weights, the model outputs are ranked so that the rescaled likelihood weights can be used to form a cumulative distribution for the output variable. From that distribution, quantiles are selected to provide uncertainty intervals for the variable of concern. Clearly this computation reflects only uncertainty arising from model parameter uncertainty. Nothing has been done in constructing the intervals to reflect the precision with which the model could reproduce observed values of the modeled variable over the calibration data set. In the previous quote, Beven referred to structural errors. Structural errors (or equivalently model errors) describe the inability, of even the best model with optimal parameters, to exactly reproduce the target output. *Kuczera et al.* [2006] provide a good example highlighting the fact that poorly determined parameters do not necessarily lead to high predictive uncertainty. Instead, they show that predictive uncertainty is often dominated by the model error component. We show that most previous GLUE applications have not handled this important model error component properly.

[13] Although GLUE is now very popular, it has frequently been criticized for its large computational demands. *Kuczera and Parent* [1998] note that GLUE “may require massive computing resources to characterize a highly dimensioned parameter space.” *Jia and Culver* [2008] report generating 50,000 parameter sets to find 381 acceptable sets (just 0.8%) for their watershed study. As *Kuczera and Parent* [1998, p. 72] explain, use of a simple and uniform prior probability distribution of model parameters over a relatively large region, can result in an algorithm that, even after billions of model evaluations, may not have generated even one good solution. Others have noted that it is difficult to determine how great the computational demand will be, because there is no way of determining a priori how many parameters sets will be necessary to adequately characterize the model response surface [*Carrera et al.*, 2005; *Pappenberger et al.*, 2005). *Mugunthan and Shoemaker* [2006], *Tolson and Shoemaker* [2007], *Blasone et al.* [2008a, 2008b] and others, have developed computationally efficient approaches for performing calibration and uncertainty analysis of complex environmental simulation models. Our focus is on the validity of the GLUE statistical computation, and not its computational efficiency, though both are serious concerns.

[14] The next section develops a framework for describing model uncertainty so that the appropriate statistical computation for an uncertainty analysis using GLUE can be understood. Section 3 summarizes the various likelihood measures which have been employed in practice and their use of the residual mean square error. Section 4 describes an evaluation of GLUE performance using a simple linear regression model as an example for which exact and correct analytical uncertainty intervals are available. Section 5 summarizes results of our simulation experiments and shows how use of GLUE with a correct likelihood function can lead to meaningful uncertainty and prediction intervals. Section 6 raises questions regarding Beven's recent manifesto [*Beven*, 2006a] and finally, section 7 provides recommendations for future research for improving GLUE applications.