## 1. Introduction

[2] In simulating streamflow from watershed characteristics, physically based watershed hydrologic models benefit from some degree of calibration prior to use. Most modeling efforts involve analysis of a suite of parameters that are modified to affect various watershed processes. These parameters act as “tuning knobs” to adjust output, ideally such that the model closely simulates the watershed's observed hydrological behavior [*Madsen*, 2000]. Because the parameters generally have ostensible physical meaning (related to groundwater storage capacity, runoff potential, etc.), there are bounds within which parameter values must fall to be realistic for any given study area. There is ample discussion in the literature regarding the pitfalls of model calibration and the importance of minimizing subjectivity and overfitting during the calibration process [*Beven*, 2006; *Efstratiadis and Koutsoyiannis*, 2010]. Expert-guided, manual calibration processes are time consuming and require extensive familiarity with model operation, structure, and underlying assumptions [*Lindström*, 1997; *Madsen et al.*, 2002; *Ewen*, 2011]. Reproducible model evaluation is desirable for management and policy [*Matott et al.*, 2009], and manual calibration is potentially subjective. Given these complications, most hydrologists have come to favor automated calibration processes [*Madsen et al.*, 2002], which typically involve some type of Monte Carlo sampling to explore combinations of calibration parameters across defined ranges. While eliminating many of the problems associated with manual calibration, autocalibration is accompanied by other drawbacks. Most notably, it may lead to drastically increased computational demands (due to large numbers of simulation replicates), unrealistic parameter values or combinations resulting from inadequate parameter specification or model structure, and greater challenges to model output interpretation due to issues of equifinality [*Beven and Binley*, 1992; *Beven*, 2006].

[3] Methodological advancements for process-based Monte Carlo simulations have resulted in more efficient searches over a multidimensional parameter space to more quickly find a “best” maximum likelihood parameter set. While a number of effective optimization functions are available for finding the parameter set with the highest likelihood, recent advances for efficiently exploring the posterior likelihood distribution have been developed for rejection sampling [*Robert and Casella*, 2004], Markov chain Monte Carlo (MCMC) methods [*Andrieu et al.*, 2003], and particle filtering [*Arulampalam et al.*, 2002]. These efficient search methods are often combined with a summary statistic-based goodness of fit function, rather than the joint likelihood product of all the observations [*Hartig et al.*, 2011], and a likelihood-based inference strategy suitable for stochastic simulations [e.g., *Beaumont*, 2010; *Beven*, 2006; *Grimm et al.*, 2005; *Wood*, 2010]. For non-MCMC applications, Latin hypercube sampling (LHS) has been shown to ameliorate the drawback of increased computational requirements. In LHS, the range for each parameter is subdivided and a sample value is taken from each division. This reduces the number of simulations required to explore a representation of the full parameter space, compared to truly random Monte Carlo approaches [*Uhlenbrook and Sieber*, 2005].

[4] Calibrations using a single optimized parameter set are susceptible to over-fitting, in that the single best fit may represent such a tailored function for the calibration data set that validation fits and prediction accuracy suffer. Despite hydrologists' long-standing awareness of equifinality and other weaknesses associated with single “best-fit” calibrated parameter sets [*Beven*, 1993; *Yapo et al.*, 1998, *Uhlenbrook et al.*, 1999; *Matott et al.*, 2009], the use of simulation sets (e.g., confidence bands, ensembles, likelihood distributions, etc.) remains under-explored. Several studies have demonstrated that simulation sets are an efficient and successful approach [e.g., *Thiemann et al.*, 2001; *Uhlenbrook and Sieber*, 2005]. In addition, Pareto efficient subsets can be drawn from simulation sets to find parameter combinations that are not “dominated” by other members of the simulation set [*Goldberg*, 1989]. Despite the availability of simulation set and Pareto approaches, single best-fit simulation approaches for model calibration continue to be more commonly used [*Beven*, 2006]. In this study we present a novel compromise between these two calibration strategies.

[5] While marked advancements have been made in model calibration methods, there remains a major limitation: no single calibration target meets the needs of all modeling applications [*Yapo et al.*, 1998; *Matott et al.*, 2009; *Efstratiadis and Koutsoyiannis*, 2010]. Calibration relies on calculating an objective function (typically a goodness-of-fit statistic), which relates simulated flows to observed flows from a gauged watershed outlet. Commonly used individual objective functions to compare simulated and observed flows are the *R*^{2} coefficient of determination, the Nash-Sutcliffe Efficiency (NSE), and the index of agreement *d* [*Nash and Sutcliffe*, 1970; *Willmott*, 1981]. These metrics square the difference between simulated and observed flows, with the effect of emphasizing peak flows in model calibration [*Legates and McCabe*, 1999]. To attenuate or eliminate overemphasis of large errors on fit scores, modified forms of NSE and *d* have been devised to increase sensitivity to lower values [*Krause et al.*, 2005]. Given the much larger proportion of low flows, compared to high flows in natural flow regimes, modified forms effectively bias low flows.

[6] Another important calibration target relates to flow variability. Neither standard nor modified forms of *R*^{2}, NSE, *d*, or similar metrics, explicitly address the success of simulated flows in replicating dynamics of observed streamflow. Such variability is critically important to aquatic habitat, sustainable water use, and other ecosystem services considerations and is, thus, central to the concept of environmental flows [*O'Keeffe*, 2009; *Poff and Zimmerman*, 2010]. While many metrics quantify flow variability over long time scales [e.g., *Gao et al.*, 2009], long-term summary statistics cannot be used in place of a fit statistic for daily streamflow calibration. Several calibration targets have been suggested to meet this modeling need, generally incorporating the standard deviation of flow as part of the objective function [e.g., *Moriasi et al.*, 2007].

[7] The aforementioned individual objective functions may well serve highly specific modeling goals; however, many practitioners seek to simulate streamflows in a manner that reasonably represents multiple hydrograph response modes simultaneously [*Madsen et al.*, 2002; *Fenicia et al.*, 2007]. In such applications, modelers may be willing to accept suboptimal performance of one aspect of flow in order to improve accuracy in one or more other flow modalities [*Tekleab et al.*, 2011]. Calibration approaches relying on multiple simultaneous objective functions, hereafter referred to as multiobjective functions, have become increasingly common in recent years [*Gupta et al.*, 2009; *Matott et al.*, 2009; *Efstratiadis and Koutoyian*nis, 2010]. Thus, our objectives were to (1) compare model results using a suite of objective functions for calibration, (2) explore a calibration scheme not linked to any one model or study area that aggregates these into a multiobjective function, while allowing for user-defined fit criteria, and (3) develop a methodology that uses strengths from equifinality-based and Pareto optimization approaches. To achieve this, we applied the Soil and Water Assessment Tool (SWAT) watershed model to two adjacent watersheds, with separate calibrations performed for each calibration target in each watershed. We hypothesized that calibration targets known to emphasize certain aspects of flow (e.g., peak flows) would result in more accurate streamflow simulations for that particular response mode, while proving less accurate in others. Furthermore, we expected to see that calibrations based on calibrated simulation sets would result in superior validations and, by extension, better predictive models and/or spatial extrapolations, than calibrations using a single best-fit approach.