## 1. Introduction

[2] Given the significance of water in terrestrial and aquatic ecosystems, hydrological models are an integral part of virtually all environmental models formulated at the catchment scale. This paper focuses on conceptual rainfall-runoff (CRR) models, which aim to capture the dominant catchment dynamics while remaining parsimonious and computationally efficient. However, their parameters are not directly measurable and must be inferred (“calibrated”) from the observed data.

[3] Characterizing the uncertainty in runoff predicted by a CRR model has attracted the attention of hydrologists over many years [*Beven and Binley*, 1992]. Yet recent reviews of CRR model calibration, for example, *Kuczera and Franks* [2002], *Kavetski et al.* [2002, 2006a, 2006b], *Vrugt et al.* [2005], and *Wagener and Gupta* [2005] note the lack of a robust framework that accounts for all sources of error (input, model structural and output error).

[4] The lack of a robust calibration framework raises three problems in CRR modeling: (1) quantifying the predictive uncertainty in runoff and other model outputs remains problematic, (2) the regionalization of CRR parameters continues to be confounded by biases in the calibrated parameters and unreliable assessment of parameter uncertainty; and (3) discriminating between competing CRR model hypotheses is difficult because the precise causes of poor model performance are unclear.

[5] In the quest for a more robust and comprehensive calibration and uncertainty estimation methodology, *Kavetski et al.* [2002, 2006a, 2006b] and *Kuczera et al.* [2006] developed the Bayesian total error analysis (BATEA) framework. Its core ideas are as follows: (1) specify explicit probability models for each source of uncertainty (input, output and model structural errors); (2) where necessary, use hierarchical techniques to implement these probability models within a Bayesian inference framework; (3) where available, include a priori information about the catchment behavior and data uncertainty; (4) jointly infer the parameters of the CRR model and any unknown parameters of the error models; and (5) examine posterior diagnostics to check the assumptions made in step 1. BATEA allows, indeed, requires, modelers to explicitly hypothesize, infer and evaluate assumptions regarding each source of uncertainty, and generates model predictions accounting for all uncertainties included in the analysis.

[6] Earlier BATEA studies focused on the derivation of the posterior distribution given specific CRR and uncertainty models [*Kavetski et al.*, 2002, 2006a, 2006b]. Since the CRR model represents hypotheses describing hydrological dynamics and the uncertainty models represent hypotheses regarding the uncertainty in the calibration data, it is critical to evaluate these assumptions a posteriori and identify those that do not stand up to empirical scrutiny.

[7] The objective of this study is to compare and scrutinize the assumptions made in traditional least squares and BATEA calibrations. Specifically, the paper investigates the ability of these methods to provide (1) reliable quantification of predictive uncertainty and (2) consistent parameter estimation. The evaluation of competing CRR model hypotheses depends on successfully dealing with these two goals and will be undertaken in future work.

[8] The empirical assessment is based on a challenging case study of a catchment with markedly ephemeral hydrological dynamics and strong rainfall gradients. The quantification of predictive uncertainty is scrutinized by systematically assessing the credibility of the hypotheses underpinning four different calibration/prediction approaches, including two traditional least squares-based methods and two BATEA-based methods. The consistency of parameter estimates obtained by each calibration method is scrutinized by calibrating the same CRR model to different rainfall gauges and time periods.

[9] Of particular note is the application of a quantile-based diagnostic that directly evaluates whether the predictive distribution is consistent with the observed time series. This type of analysis, originally proposed in probabilistic forecasting [*Laio and Tamea*, 2007], is more comprehensive than traditional evaluation statistics such as the Nash-Sutcliffe index [*Nash and Sutcliffe*, 1970], which do not evaluate whether the predictive uncertainty is consistent with the observed data.

[10] This paper is structured as follows. Section 2 outlines the BATEA framework, including definitions of the error models. Section 3 describes the case study, including the catchment characteristics and the GR4J CRR model [*Perrin et al.*, 2003]. Section 4 outlines the calibration frameworks used in this paper: two traditional methods (standard least squares (SLS) and weighted least squares (WLS) schemes) and two BATEA methods (differing in the assumed error models). Section 5 applies posterior diagnostics to check the adequacy of the predictive distributions, while section 6 checks the consistency of the parameter inference. Section 7 discusses avenues for further improvements of the characterizations of predictive uncertainty, while section 8 discusses the potential of BATEA for model extrapolation and regionalization. Section 9 outlines future applications of BATEA to other types of hydrological models and catchments. The main conclusions are summarized in section 10.