## 1. Introduction

[1] The *Clark et al.* [2011] paper is an opinion piece. While we respect the authors' opinions and agree with much that is in their paper and what they have published elsewhere [e.g., *Clark and Kavetski*, 2010; *Kavetski and Clark*, 2010], we feel we have to comment on their approach to hypothesis testing, particularly on their characterization of the Generalized Likelihood Uncertainty Estimation (GLUE) methodology. Having had discussions with all the authors in the past, we find their presentation very surprising, not in the least because of the type of Bayesian methodology of hypothesis testing that they propose can be (and has been) incorporated into GLUE as a special case where strong assumptions about the nature of the modeling errors are justified [see *Romanowicz et al.*, 1994, 1996; *Beven et al.*, 2008]. GLUE is general in that sense.

[2] However, in this comment we will consider the authors' discussion of the differences between a formal Bayesian statistical approach and a GLUE rejectionist approach to hypothesis testing. We address the nature of aleatory and epistemic errors in hydrological models, how they are treated in GLUE and formal Bayesian inference, and why we (still) advocate GLUE with its implicit handling of model residuals over formal Bayesian inference for hypothesis testing.

[3] Clark et al. suggest that the rejection criteria in GLUE must necessarily be subjective. That implies that the Bayesian approach that they advocate is a more objective method for hypothesis testing. But how objective is a formal Bayesian statistical methodology? The method depends on the definition of a likelihood function, normally a representation of the structure of the residuals. The analysis can then be objective in the sense that the assumptions made in defining a likelihood function can be verified by a posterior analysis of the model residuals (as strongly recommended by *Clark et al.* [2011]). This should be considered as good practice but such checks are not always reported in papers based on formal Bayesian inference, though there are some where it is clear that the wrong likelihood function has been used [e.g., *Thiemann et al.*, 2001; *Feyen et al.*, 2007]. It is also difficult to do rigorously. When comparing models as hypotheses by making many different model runs there is no guarantee that the same likelihood function will be appropriate for all model structures (or even for different parameter sets within the same model), because of the way different model components might be activated (and interact with input errors) for different structures or parameter settings.

[4] In fact, a Bayesian statistical approach will never actually reject a hypothesis. Within this framework, models as hypotheses are compared in terms of likelihoods or Bayes ratios (or subjective posterior criteria such as visual comparisons of observed and predicted variables, but remember that the argument is that the framework is objective). Bayes ratios might suggest that one model is better (in terms of the specified likelihood function) than another, but the choice will not be unique with respect to a change of likelihood function nor is there any guarantee that the best model will be an adequate predictor.

[5] There are other reasons for questioning the objectivity of the Bayesian approach. As the authors recognize, hydrological modeling involves both aleatory and epistemic errors. Bayesian inference, however, requires that all errors (after allowing for bias, correlation, heteroscedasticity, etc.) be treated *as if* they are aleatory, with asymptotically stationary characteristics. Apparent nonstationarity can be used as evidence for how to modify a model [e.g., *Bulygina and Gupta*, 2009, 2010] but might also stem directly from the epistemic nature of some errors. Think about estimates of catchment rainfall when the rain gauge network is sparse (as is usually the case), or extrapolations to overbank flows well beyond the measured discharges, or of uncertain or nonstationary subsurface catchment boundaries. One source of epistemic error that is commonly ignored is the nonstationarity of rating curves in estimating the discharge observations with which model outputs will be compared [e.g., *McMillan et al.*, 2010; *Westerberg et al.*, 2011a]. Epistemic errors will vary from event to event with changing or inconsistent characteristics in space or time so that they are very likely to lead to nonstationary error characteristics, especially when, for epistemic input errors, they are processed through a nonlinear model. In some cases, the data used to force and evaluate a model might be physically inconsistent and introduce disinformation into the inference process [see *Beven et al.*, 2011; *Beven and Westerberg*, 2011; *Beven*, 2012a]. In such cases, even the water balance equation might be difficult to justify as a hypothesis, including within a Bayesian statistical framework. Indeed, for some purposes, such as real-time forecasting, it might not even be an informative hypothesis on which to base a model if the observed inputs are not representative. It is then much better to use data assimilation to compensate for model structure and input errors. This should not, however, be an option in a simulation for the hypothesis testing under discussion here.

[6] The effect of treating epistemic errors *as if* they are aleatory within a formal Bayesian statistical framework is to overestimate the information content of the model residuals. This is the result of defining a formal likelihood function based on the belief after allowing for bias, correlation, heteroscedasticity, etc. The residuals come from independent distributions, often considered identical and Gaussian. With these Gaussian assumptions, the dominant term in the likelihood is typically the error variance raised to the power *N*/2, where *N* is the number of observations. Where the observations are a time series, then *N*can be a large number. The result is that the likelihood surface is stretched to the extent that two models with very similar root-mean-square error over a long series of calibration time steps may have likelihoods that differ by orders of magnitude (and even hundreds of orders of magnitude). The stretching effect is reduced by allowing for bias, autocorrelation, and heteroscedasticity but will still, forcefully, differentiate model hypotheses that might be rather similar in performance. If the residuals are not aleatory (and they are not for rainfall-runoff models), this is a form of overfitting, where peaks in the likelihood surface will be sensitive to different realizations of the epistemic errors.

[7] Paradoxically, taking account of input error actually makes this problem worse within the Bayesian statistical framework. Any attempt to identify input errors as part of the inference will also involve interaction with model structural error (see the discussions by *Beven et al.* [2008] and *Beven* [2010, 2012a]). The potential for compensation between model structure and input errors in calibration will, in general, result in better fits to the observational data and more constrained parameter distributions. This can be seen, for example, in the results of *Vrugt et al.* [2008]. Allowing for such compensation does not seem to constitute better hypothesis testing if there is no independent evidence to justify such modifications to the input. In particular, it does not guarantee improved performance in prediction when the sequence of epistemic errors in the input sequence is unknown and might be quite different from the past.

[8] In fact, there should be an *expectation* that an aleatory model with constant parameters will not adequately describe model errors. To choose to do so is a (subjective) choice and a formal Bayesian approach is not free from subjectivity. Bayes himself seems to have been quite happy about using subjective judgments in defining odds in his original formulation [see, e.g., *Howson and Urbach*, 1993]. Subjectivity is a word that carries a lot of shorthand baggage in doing “science” that is not necessarily justified in our type of inexact science. If subjective judgments, particularly in the face of epistemic uncertainties, allow a thoughtful use of common sense (or engineering heuristics) [see *Koen*, 2003] for dealing with these very difficult problems, then it appears to us to be no worse than using a likelihood function with an inappropriate assumption of stationarity of residual characteristics and a strange, even exotic, stretching of the likelihood response surface (even if this stretching follows formally from the strong statistical assumptions made above). Hydrological common sense suggests that it will then be misleading to use a formal Bayesian statistical approach when there are important epistemic sources of error (a position justified by the results presented in the work of *Beven et al.* [2008]).

[9] So what about GLUE? Clark et al. question whether GLUE can provide any insight into how to separate behavioral from nonbehavioral models and consequently does not properly expose hypotheses to scrutiny. They also suggest that the likelihood measures used in GLUE largely mimic the standard specification of error models used within statistical inference. In summary, they suggest that: “The GLUE extensions themselves still do not address the key issue of isolating constituent model hypotheses and subjecting them to independent scrutiny; any new techniques for model decomposition, evaluation, and improvement must be developed separately from GLUE and, as such, could be applied in other, more theoretically grounded, inference frameworks. Moreover, to the extent that the GLUE likelihood function components and rejection thresholds are not subjected to scrutiny and improvement (e.g., as required within a more formal application of Bayesian principles), it is our opinion that the GLUE approach does not adequately address the quest for rigorous evaluation of hydrological hypotheses.” [*Clark et al.*, 2011, p. 5]

[10] These statements paint a picture of GLUE that we do not recognize. GLUE cannot solve the problems of dealing with epistemic error but it does allow a more common sense approach to model evaluation and hypothesis testing. In the recent use of limits of acceptability methods within GLUE (mostly ignored by the authors in their comments) it can also provide a formal framework that allows testing against individual observations rather than some integral statistical likelihood function [*Beven*, 2006; *Blazkova and Beven*, 2009; *Liu et al.*, 2009; *Winsemius et al.*, 2009; *Krueger et al.*, 2010; *Westerberg et al.*, 2011b]; (see also, earlier usage of fuzzy measures with rejection limits by *Freer et al.* [2004], *Pappenberger and Beven* [2004], *Pappenberger et al.* [2007], and *Page et al.* [2007]). This includes testing isolated constituent model hypotheses for rejection when the observations are adequate to allow this (see, e.g., the discussions of *Beven* [2006, 2010, 2012a], *Gupta et al.* [2008], and *McMillan et al.* [2011]).

[11] What do we actually need in a hydrological model as a working hypothesis? We want a tool that will be useful in simulation or prediction and that reflects our qualitative perceptual knowledge of real world processes [see *Beven*, 2012b, ch. 1]. The only way of assessing a model in terms of whether it will be useful in prediction to actually test its consistency with observations in a calibration period or on some surrogate catchment. We should not expect a model that is not consistent with calibration data to be useful so it should be rejected (assuming we have some belief in the calibration data that are available). Indeed, there may be cases where it is justified to reject all of the models tried [e.g., *Freer et al.*, 2003; *Page et al.*, 2007; *Dean et al.*, 2009; *Choi and Beven*, 2007; *Mitchell et al.*, 2009]. Note that rejection, properly justified, is a *good thing*. It forces a reconsideration of what is causing the failure, which might be either the model hypotheses or disinformative data. It is not really clear to us why the authors argue against a rejectionist framework, especially when they cite Popper in support of more rigorous hypothesis testing. Popper allowed for varying degrees of verisimilitude in matching observations, but also stressed the role of critical observations in hypothesis testing. A “limits of acceptability” approach allows this focus on critical observations when appropriate in testing models hypotheses.

[12] What we must then be careful to avoid is making a type II error under a null hypothesis that the model is correct; that is, rejecting models that might be useful in prediction simply because of errors in the input data and evaluation observations [*Beven*, 2010, 2012a]. The stretching of the likelihood surface in Bayesian statistical methods, and the consequent sensitivity to different realizations of epistemic errors will increase the possibility of making type II errors because of the consequent overconditioning. It gives the pretense of being able to distinguish between different models rather strongly, but it would be misleading if it leads to models that would be useful in prediction being discarded relative to others with higher Bayes ratios (though again, the decision to reject is a relative, not absolute choice in a Bayesian statistical methodology).

[13] In GLUE we can aim to set limits of acceptability prior to running a model based on what we know about the uncertainty associated with the available observations in a way that is not based on any particular model run. This can be done down to the level of individual observations and for different types of observations (though in the past this was usually done by thresholding some global performance measure, which is what the authors are actually criticizing). The limits should reflect observational error and the commensurability of measured and predicted variables. It is a way of recognizing the nonideal nature of such modeling problems [*Beven*, 2006, 2010, 2012a]. The justification of those limits must be recorded explicitly and is therefore open to full independent scrutiny, including common sense limits for “soft” data that can only be set subjectively [e.g., *Seibert and McDonnell*, 2002; *Winsemius et al.*, 2009]. There is then a formal audit trail independent of any particular model. Hydrological modelers are not used to thinking in rejectionist terms, but that does not mean this is not a formal theoretical framework for model evaluation.

[14] All models as hypotheses can then be considered relative to those limits of acceptability. Those that fail are given a likelihood (as measure of belief) of zero. Those that pass can be assigned a likelihood proportionate to their performance. Posterior analysis of which measures cause failure can be carried out for individual models or hypotheses. *Freer et al.* [2004], for example, defined limits based on both output and internal state variables with a significant discussion of posterior analysis of the model failures. In the work of *Blazkova and Beven* [2009], failures relative to some 70 different measures from several different types of observations were considered. These provided a guide to posterior analysis. *Westerberg et al.* [2011b] used posterior analysis of different aspects of the simulated hydrographs to assess model performance in detail and to identify periods likely dominated by model structural errors or by data errors. This type of approach is consistent with Bayes original use of odds in assessing hypotheses but is not limited to combining those odds multiplicatively in reflecting belief in a model conditional on the evidence. Multiplicative combinations of likelihoods will produce zero (model rejection) if any one of the individual likelihood measures is zero. That may be desired, but might also lead to rejection on a measure that does not have great significance to the outputs.

[15] *Clark et al.* [2011] repeat some common misinterpretations of the results of a GLUE analysis. The GLUE uncertainty estimates are not trying to mimic a more formal statistical analysis as the authors suggest. They are also not subsuming all modeling uncertainties into parameter uncertainties as the authors and others have suggested. One important difference is that applications of GLUE reveal that marginal parameter distributions are given too much significance in formal statistical inference. Certainly, the marginal distributions can be examined (as can the covariation of parameters in producing behavioral simulations although the form of that covariation can vary dramatically through the parameter space for some model structures), but it is the time series of a model output series, that arises from often complex interactions between values of the model parameters, that is either behavioral or not. In prediction, the model outputs are weighted by the likelihood in producing a predictive distribution. The residuals associated with a particular model prediction are not neglected, but they are usually considered implicitly, with an assumption that their characteristics in calibration will be similar in prediction. This is analogous to assuming that a statistical error model derived from calibration residuals will be similar in prediction. In neither case is there a guarantee that the assumption will hold in the face of new epistemic errors in prediction but, of course, we have no knowledge of what form those errors might take.

[16] In calibration, the predictive distribution over all behavioral models should be consistent with all the limits of acceptability as derived from the observational errors. That does not mean that the simulations will bracket the observations as is expected in a statistical model. They may exhibit nonstationary bias and skew while still being consistent in this sense. Posterior analysis of residuals, including looking for such nonstationarities in residual characteristics, should be performed and may be a guide to model improvements or a reason for querying the usefulness of particular periods of calibration data. This type of posterior analysis has been performed in a number of papers including *Blazkova and Beven* [2009], *Krueger et al.* [2010], and *Westerberg et al.* [2011b].

[17] These types of nonstationarities are themselves an indicator of epistemic error whether that stems predominantly from the data used or from the model hypotheses. It is also a warning that we should not expect identical epistemic uncertainties in prediction. New event characteristics or modes of model functioning might occur in prediction that have not been seen before in calibration. Success in calibration is no guarantee of success in prediction (which is one reason for the common reduction in performance of hydrological models in “validation” periods relative to calibration). Neither Bayesian statistical methods nor GLUE can guard against changing epistemic errors in hydrological prediction. Thus, the issues raised here are not going to be resolved easily (and will probably not be resolved until we can reduce observational uncertainties dramatically by greatly improved observational techniques). We should thus be wary of the possibility of type II errors by overconditioning. When we did include formal Bayesian statistical likelihoods in GLUE this was a significant issue [e.g., *Romanowicz et al.*, 1994, 1996] and was a reason why we have continued to seek alternative methods.

[18] In fact, there is not that much difference between the authors and Clark et al. What needs to be resolved, in the case of both Bayesian statistical and GLUE methodologies is what constitutes (an) appropriate likelihood measure(s) that properly reflect(s) the information content in a set of input and evaluation data allowing, in some way, for disinformation and other epistemic errors of unknown characteristics in both the input and the evaluation data. One potential approach to this, given our inability to constrain critical observational uncertainties (i.e., rainfall) may be to benchmark what might be our expected uncertainties and to share and understand these within our community [see *McMillan et al.*, 2012]. Statistical likelihood functions with their strong assumptions are not an adequate reflection of real information content in the face of epistemic errors, while allowing for input error effects on the GLUE limits of acceptability is also less than satisfactory. We are happy to accept a difference of opinion with Clark et al. as to what might be the best approach, but we would hope that in the future they will be more circumspect about the problems of applying formal Bayes likelihoods to nonideal problems and give a more reasoned account of GLUE and what it is trying to achieve.

[19] The problems of hypothesis testing in these nonideal cases are general. They are not specific to any particular uncertainty estimation methodology. Thus, one way of resolving these issues would be to carry out comparative studies, but there is a difficulty in doing so. As was the case with the hypothetical cases of *Mantovan and Todini* [2006], formal Bayes methods work well for hypothetical simulations with only aleatory uncertainties, although, as shown by *Beven et al.* [2008], even small departures from the correct assumptions can lead to biased inference. Then as soon as a different model structure is used from that used to construct the data set, there is no “correct answer” for that model. This is the simplest form of epistemic uncertainty that actually occurs whenever we try to simulate the “true” response, i.e., the catchment itself, with a model approximation. Further epistemic (and/or aleatory) uncertainty can then be added with respect to the inputs and evaluation data. It is easy to envisage such a staged series of experiments, but more difficult to construct realistic realizations of the epistemic uncertainties. For a true test, those taking part would need to be ignorant of the nature of the uncertainties that have been introduced. We would be happy to work with Clark et al. to see whether such an experiment can be developed for the benefit of the modeling community.