## 1. Introduction

[2] Improving availability and coverage of spatial data has driven developments in distributed, process-based catchment modeling; however, despite the correspondence between modeled and observed processes, it is not usually possible to determine model parameter values directly from field measurements. Instead, the values required are those of the “effective parameters” which represent integrated behavior at the model element scale. These values must be determined through some calibration method. As has been extensively discussed by *Beven* [1993, 2006], *Beven and Binley* [1992], *Wagener and Gupta* [2005] and others, the many sources of uncertainty in a hydrological model application lead to equifinality of parameter sets in providing acceptable model performance with reference to some observed data. These uncertainty sources may include, but are not limited to, input data uncertainty, initial condition uncertainty, model structural error and observed data uncertainty [*Liu and Gupta*, 2007]. Indeed, since it is certain that our hydrological model does not fully represent the complexity of the natural catchment and is therefore “wrong,” we must expect that any calibration technique is a process of identifying some subset of model parameterizations which produce reasonable approximations to some aspects of true catchment behavior under some circumstances.

[3] The aim of a calibration technique should therefore be to enable an efficient search of the parameter space, identifying those regions where model performance is considered satisfactory. The task is made more difficult by the typically complex nature of the model response surface [*Duan et al.*, 1992; *Sorooshian et al.*, 1993] which may be exacerbated by artifacts of model time step and solution techniques [*Kavetski et al.*, 2006a, 2006b]. Difficulties encountered may include multiple local optima in multiple regions of attraction, discontinuous derivatives, parameter interaction and flat areas [*Duan et al.*, 1992]. The nature of these surfaces prohibits standard search mechanisms such as simplex- and Newton-type schemes. Alternative methods such as uniform random sampling suffer from a lack of sampling efficiency and can be extremely costly in terms of model evaluations. They also typically specify the sample space using minimum and maximum values for each parameter, usually on the basis of expert judgment, physical interpretation of the parameter and previous model use. However, with good model performance often occurring up to the boundary of the sample region, this technique may unjustifiably restrict the search.

[4] In recent years, Markov chain Monte Carlo (MCMC) methods have gained increasing popularity, in particular the Metropolis-Hastings (MH) algorithm [e.g., *Chib and Greenberg*, 1995]. These methods enable simulation of complex multivariate distributions by casting them as the invariant distribution of a Markov chain. By finding an appropriate transition kernel which converges to this distribution, samples with the desired posterior distribution can be drawn from the Markov chain. A popular version of the MH algorithm is the adaptive SCEM-UA algorithm [*Vrugt et al.*, 2003] which combines the MH sampler with the SCE-UA optimization method [*Duan et al.*, 1992], using information exchange between multiple sampler chains to improve convergence rates.

[5] All search techniques require a definition of the model response surface to be searched: this is usually couched in terms of “probability of model correctness given observed data” and is assessed via a likelihood measure. The debate continues on the relative advantages of the informal likelihood measures used in the GLUE framework compared with parameter estimation via formal statistical likelihood estimation [e.g., *Mantovan and Todini*, 2006; *Beven et al.*, 2007; *Mantovan et al.*, 2007; *Thiemann et al.*, 2001; *Beven*, 2003; *Gupta et al.*, 2003; *Clarke*, 1994]. If statistical likelihood theory is to be used, the error model between model predicted and observed variable must be specified exactly; this specification may include information on heteroscedasticity and autocorrelation [e.g., *Sorooshian*, 1981; *Sorooshian and Dracup*, 1980] and may rely on hierarchical error models [*Kuczera et al.*, 2006]. Under GLUE, the concept of a true model (and error model) against which to compare observations is rejected and it is accepted that many interacting sources of error, without well-defined formulations, combine to give total model error. Models are instead judged against informal likelihood measures, chosen by the hydrologist, which represent their expert perception of model performance in prediction of observed data [*Beven*, 2006].

[6] Although MCMC methods have traditionally used formal likelihood measures to define the response surface [e.g., *Arhonditsis et al.*, 2008; *Marshall et al.*, 2004; *Vrugt et al.*, 2003, 2006; *Thiemann et al.*, 2001], it is also possible to use informal likelihoods [e.g., *Engeland and Gottschalk*, 2002; *Blasone et al.*, 2008; *Vrugt et al.*, 2008]. When informal likelihoods are used in MCMC methods, the main difference between MCMC methods and GLUE is that MCMC methods provide targeted sampling of the parameter space. *Blasone et al.* [2008] compared performance of the informal likelihoods in the SCEM-UA method with the traditional GLUE method and demonstrated that the targeted sampling resulted in better predictions of the model output (and that the uncertainty limits were less sensitive to the number of retained solutions). *Vrugt et al.* [2008] compared a formal Bayesian approach that attempts to explicitly quantify the individual sources of uncertainty in the hydrological modeling process with the traditional GLUE method that maps all sources of uncertainty onto the parameter space. They showed that while the estimates of total uncertainty were similar in both methods, the GLUE method produced large estimates of parameter uncertainty which can lead to erroneous conclusions on the identifiably of model parameters.

[7] The formal Bayesian approaches for explicitly quantifying the individual sources of uncertainty suffer from two important limitations. First, as formulated by *Vrugt et al.* [2008] and *Kavetski et al.* [2006a, 2006b], the formal Bayesian methods require solving a high-dimensional optimization problem (i.e., separate multipliers for each storm); a problem that is intractable for distributed hydrological models where it is necessary to quantify uncertainty in the spatial pattern of precipitation events. Second, current methods for quantifying error in model structure are poorly developed; indeed, *Vrugt et al.* [2008] and *Kavetski et al.* [2006a, 2006b] essentially combine error in model inputs and model structure into a single error term. Informal likelihood measures therefore remain an attractive option.

[8] This paper considers the calibration of a distributed rainfall-runoff model (described in section 2.2) in an interesting case study catchment, the Rangitaiki in New Zealand (described in section 2.1). In the Rangitaiki catchment, heterogeneous geology leads to a difficult and high-dimensional calibration problem, where the response surface has multiple optima and strong parameter interactions. These characteristics render the problem unsuitable for solution by uniform Monte Carlo sampling (as per standard GLUE) and require a more targeted sampling strategy. In response to the challenging problem of model calibration for the Rangitaiki, this paper focuses on three objectives:

[9] 1. To compare two strategies for model calibration using MCMC methods (in this case the SCEM-UA algorithm). The first strategy (section 3.1) uses a “formal” likelihood function based on strict assumptions about the error structure; the second strategy (section 3.2) uses an “informal” likelihood based on the modeler's judgment. The two approaches are assessed in terms of their success at full coverage of the response surface.

[10] 2. To investigate an extension to the standard “informal” likelihoods on the basis of sum of squared errors (e.g., Nash-Sutcliffe efficiency), by incorporating the timing error of the simulated hydrograph into the calibration objective function.

[11] 3. To test a “spatially informed” approach to MCMC calibration of a distributed rainfall-runoff model to overcome the difficulties of a multimodal parameter distribution caused by the heterogeneous geology of the catchment. Flow data from subcatchments are used to independently verify model success at reproducing the hydrological response.