Corresponding author: L. Gudmundsson, Institute for Atmospheric and Climate Science, ETH Zürich, Universitätstrasse 16, CH-8092 Zürich, Switzerland. (firstname.lastname@example.org)
 Large-scale hydrological models, simulating the terrestrial water cycle on continental and global scales, are fundamental for many studies in earth system sciences. However, due to imperfect knowledge of real world systems, the models cannot be expected to capture all aspects of large-scale hydrology equally well. To gain insights in the strengths and shortcomings of nine large-scale hydrological models, we assessed their ability to capture the mean annual runoff cycle. Unlike most other studies that rely on discharge observations from continental scale river basins, our study is based on observed runoff from a large number of small, near-natural catchments in Europe. We evaluated the models' ability to capture the average magnitude, the amplitude, as well as the timing of the mean annual runoff cycle. Our study revealed large uncertainties when modeling runoff from these small catchments. We identified large differences in model performance, however, the ensemble mean (mean of all model simulations) yielded rather robust predictions. Model performance varied systematically with climatic conditions and was best in regions with little influence of snow. In cold regions, many models exhibited low correlations between observed and simulated mean annual cycles, which can be associated with shortcomings in simulating the timing of snow accumulation and melt. Local (grid cell) scale differences between observed and simulated runoff can be large and local biases often exceeded 100%. These local uncertainties are contrasted by a relatively good regional average performance, ultimately reflecting the purpose of the models, i.e., to capture regional hydroclimatology.
 The number of models that simulate the terrestrial water cycle on large (continental and global) scales is continuously increasing. These large-scale hydrological models broadly fall into two categories. Land Surface Models (LSM) historically evolved to provide lower boundary conditions for General Circulation Models (GCM) and their development has focused on solving the surface-energy balance. In contrast, Global Hydrology Models (GHM) do not necessarily solve the surface energy balance directly. They rather focus on describing terrestrial water fluxes and on solving the water balance equation in the context of water resources assessment. However, as most GHM and LSM share the same conceptualization of the terrestrial water balance [Haddeland et al., 2011], no further distinction is made between them in the context of this study. Note that large-scale hydrological models are, in contrast to catchment-scale models, usually not calibrated. Instead the model parameters are derived from mapped land properties, such as soil texture or vegetation density. This study therefore essentially assesses the ability of these models to predict runoff characteristics in ungauged basins. This is in line with the Predictions in Ungauged Basins (PUB) initiative, which is a decadal activity of the International Association of Hydrological Sciences (IAHS), focusing international research efforts on a common topic of high scientific and societal relevance [Sivapalan et al., 2003].
1.1. Evaluating Large-Scale Hydrological Models
 River flow, integrating fluxes across hydrological processes throughout the catchment, is a main variable of the terrestrial water cycle and is monitored with relatively high spatial and temporal coverage. Observations of other hydrologic response variables, such as satellite measurements of soil moisture [e.g., de Jeu et al., 2008; Wagner et al., 2007] or estimates of fluctuations in the terrestrial water storage from the Gravity Recovery and Climate Experiment (GRACE) [e.g., Ramillien et al., 2008; Rodell and Famiglietti, 1999; Tapley et al., 2004] are becoming increasingly available. However, these data products often do not allow assessments at the preferred spatial and temporal scales. Thus river flow remains a key variable for evaluating large-scale hydrological models [Dirmeyer, 2011].
 Another approach to model evaluation is followed by model intercomparison projects, which compare results from a multitude of models. An underlying assumption is that a high agreement among the models indicates a good (or at least consistent) representation of the relevant processes. Several projects have compared large-scale hydrological models, mainly focusing on evaporation rates and soil moisture content. Efforts date back as long asPolcher et al. , who highlighted large dependence of evaporation, soil moisture and runoff estimates on the specific model structure. In a major effort, the Project for Intercomparison of Land-surface Parametrization Schemes (PILPS) [Henderson-Sellers et al., 1995] assessed a large set of land surface schemes and their effect on GCMs. The key findings, as highlighted by Cornwell and Harvey , were large disagreements between the different model simulations and the fact that no single model was identified as adequately simulating all important variables of the terrestrial water cycle. In recent years, the Global Soil Wetness Project (GSWP) [Dirmeyer et al., 1999, 2006; Dirmeyer, 2011] assessed a large set of models driven by common forcing data. They concluded that the interannual variability and timing of the annual cycles and anomalies of soil moisture were well reproduced throughout all models, although absolute values were poorly simulated [Guo and Dirmeyer, 2006]. Guo et al.  highlighted that the use of the (ensemble) mean of all models as a predictor variable, lead to systematically improved estimates of soil moisture. Although these studies focused on soil moisture, Dirmeyer et al. reported intermediate agreement of runoff simulated by the GSWP models and suggested that the lack of agreement may be attributable to snow processes. One of the most recent efforts to assess and compare large-scale models is the Water Model Intercomparison Project (WaterMIP) [Haddeland et al., 2011]. Recently Gudmundsson et al. [2012a]evaluated a model ensemble with respect to the inter annual variability of low, mean and high flows in Europe, demonstrating that ensemble techniques can be used to improve continental-scale runoff simulations. There is a wide range of other comparison studies more focused on catchment-scale hydrologic models such as MOPEX and DMIP [Smith et al., 2012; Duan et al., 2006].
1.2. Models as Testable Hypotheses
 Models describing the terrestrial water cycle provide implementations of the current understanding of hydrological systems. Given our gaps in knowledge of the characteristics of specific systems and differences in modeling approaches, model implementations may differ in their conceptualization of these real world processes. Each model rather represents a hypothesis that needs to be tested. Identifying model limitations may thus eventually lead to improved understanding of the relevant hydrological processes [Savenije, 2009; Sivapalan et al., 2003; Wagener and Gupta, 2005].
 In the context of (large-scale) hydrological models, various approaches to model evaluation have been suggested, ranging from systematic comparison of the model physics to the quantification of different aspects of the model error.Koster and Milly  and Gedney et al. , for example, derived analytical equations characterizing the physics underlying a set of large-scale hydrological models. These equations were then used to demonstrate how different formulations of evaporation and runoff processes influenced model dynamics. Although elegant, such analytical approaches are limited to assessing the consistency between different models and a final proof for the superiority of particular models is not possible. Empirical approaches that directly compare simulations to observations typically aim to quantify model performance and uncertainty [e.g.,Gupta et al., 2005; Beven, 2006]. Although of high practical relevance they often do not address the issue of potentially unrealistic model assumptions [Wagener, 2003].
 In recent years, various approaches have been suggested to bridge the gap between observation based model validation and theory development. These studies contribute to a community wide effort in developing a strategy for a “diagnostic approach to model evaluation” [Gupta et al., 2008]. Such an approach should enable the detection of model components that cause the discrepancy between observed and simulated system behavior [Gupta et al., 2008] with the objective to develop models with a higher degree of realism [Wagener, 2003]. This ambitious task is not yet fully resolved and various ways to move forward have been suggested. In a diagnostic approach, the systematic assessment of model behavior is paired with assumptions about the expected behavior of the underlying true system. Approaches that are based on analyzing the model error usually disaggregate the model residuals, either in time [e.g., de Vos et al., 2010; Mahecha et al., 2010; Reusser et al., 2011] or into error-components like bias and variance [e.g.,Gulden et al., 2008; Herbst et al., 2009; Rosero et al., 2009; Martinez and Gupta, 2010]. Approaches that are based on strict assumptions on system behavior often define signature indices that measure theoretically relevant properties of the system dynamics [e.g., Gupta et al., 2008; Yilmaz et al., 2008].
 The primary objective of this study is to assess the ability of nine large-scale hydrological models to capture important aspects of the runoff climatology in Europe. The runoff climatology is characterized by the mean annual cycle, which is one of the most fundamental signals of large-scale hydrology [Dettinger and Diaz, 2000]. Characteristics of the mean annual cycle allow for insights into the water balance and the seasonality of runoff. Other aspects of runoff variability such as the interannual variability of low, mean and high flows, are addressed in a parallel study [Gudmundsson et al., 2012a]. Large-scale hydrological models are not built to accurately model runoff at the scale of small catchments and interpreting highly localized model performance may yield misleading results. Therefore, model performance is here quantified by regional averages, allowing for novel insights in model performance for different hydroclimatic regimes (i.e., groups of catchments with similar seasonal cycles).
 A secondary objective is to explore a diagnostic approach to model evaluation for multimodel ensembles of large-scale hydrological models, which potentially could facilitate the attribution of model error to process descriptions or climate. This task is approached by (1) quantifying model performance using different performance metrics that are sensitive to various aspects of runoff climatology; (2) comparing model performance for different hydroclimatic regimes; and (3) testing whether there is a link between various performance metrics and physical catchment properties or climatological conditions.
2. Model Simulations and Observations
2.1. Runoff Simulations
 A multimodel ensemble of nine large-scale hydrological models was considered. The ensemble has been developed in a joint effort within the framework of the Water and Global Change (WATCH) project (see http://eu-watch.org/). All model simulations have been carried out by project partners and the collection of model simulations is identical to the ensemble described byGudmundsson et al. [2012a]. Table 1 provides an overview on the principle concepts underlying the evaporation, snow and runoff routines. Table 2 lists additional details and relevant literature.
Table 1. Overview of the Participating Models and Their Main Characteristicsa
Models written in italic solve the energy balance. Surface runoff (Qs) is in all cases modeled as saturation excess; the following abbreviations refer to specific parametrizations of subgrid variability: ARNO [Todini, 1996]; improved ARNO [Dümeanil and Todini, 1992]; PDM [Moore, 1985]. Subsurface runoff (Qsb) is either modeled as a function of soil moisture or groundwater , where f(S) denotes potentially nonlinear model specific functions (“Richards”: N - layer approximation of Richards equation). Taken fromGudmundsson et al. [2012a], adapted from Haddeland et al. .
: Saturation excess
: Infiltration excess
: Saturation excess
: Infiltration and saturation excess
: Improved ARNO
: Infiltration excess
: Saturation excess
Table 2. Brief Descriptions of the Nine Large-Scale Hydrological Modelsa
The GWAVA model [Meigh et al., 1999] is based on the PDM - rainfall runoff model with an analytic approximation of the subgrid variability of soil moisture [Moore, 1985, 2007]. The subsurface features several conceptual storages representing the unsaturated and the saturated zone. Two additional storages are used for routing of water via fast pathways to the cell outlet.
The water movement within a grid cell of HTESSEL [Balsamo et al., 2009] is based on the ARNO infiltration excess scheme [Todini, 1996], which parametrizes subgrid variability of soil moisture as a function of the standard deviation of orography. HTESSEL features a detailed approximation of the unsaturated zone, which is described by several layers and soil moisture is calculated using an approximation of Richards equation.
JULES uses four soil layers to calculate subsurface hydrology, with vertical fluxes of water calculated from a solution of Richards equation including root water uptake. [Best et al., 2011; Clark et al., 2011].
LPJmL was developed to model global vegetation dynamics and their coupling to carbon and water fluxes. It features a five-layer soil parametrization where each layer is parametrized as a bucket model that produces saturation excess runoff. Soil moisture responds not only to atmospheric moisture demand, but also to vegetation dynamics (Fader et al. , Bondeau et al. , and new parameterizations as in S. Schaphoff (unpublished data, 2011)).
The subsurface hydrology of MATSIRO [Takata et al., 2003] is represented by vertical movement of infiltrated moisture through unsaturated soil layers underlain by a groundwater reservoir. The saturated and unsaturated soil zones are dynamically coupled through an exchange of groundwater recharge, and baseflow is generated from the groundwater reservoir (S. Koirala, Global modeling of land surface hydrology with the representation of water table dynamics: 1. Model construction and evaluation, submitted to Journal of Geophysical Research, 2012; S. Koirala, A global modeling of land surface hydrology with the representation of water table dynamics: 2. Parameter estimation, submitted to Journal of Geophysical Research, 2012).
ORCHIDEE has a complex hydrological infiltration scheme [d'Orgeval et al., 2008], which solves the vertical movement of water in the soil using the Fokker-Planck equation with Van Genuchten-Mualem parameters. Subsurface runoff considers orography and surface runoff may reinfiltrate in the same grid cell if the slope is small.
WATERGAP is based on a series of conceptual storages including surface water bodies, soil moisture and ground water [Alcamo et al., 2003]. WATERGAP is the only ensemble member that does not solely rely on input maps for parameter estimation, but also undergoes a very limited calibration procedure [see Hunger and Döll, 2008, for details].
 Although different in conceptualization, all models solve the vertical water balance at the grid cell scale, describing changes in storage, dS/dt as
where P is precipitation and ET evapotranspiration. Qs and Qsb denote surface runoff (water leaving the grid cell above ground) and subsurface runoff (water leaving the grid cell below ground), respectively. In this study, the total runoff from individual grid cells ( ) is compared to observation based runoff estimates from small catchments.
 The simulation setup was, except for the time period and the temporal resolution of the output, identical to the setup described by [Haddeland et al., 2011]. All models were run on the 0.5° grid defined by the CRU (Climate Research Unit of the University of East Anglia) global land mask. Model runs are available at daily time steps for the time period 1963–2000, preceded by a 5 year spin-up period. The models are not calibrated, with the exception of WATERGAP, which undergoes a limited calibration procedure (seeHunger and Döll  for details).
 All models were forced using the WATCH Forcing Data (WFD) [Weedon et al., 2010, 2011]. The WFD comprise an array of (near) surface variables (wind speed, air temperature at 2 m, surface pressure, specific humidity at 2 m, downward longwave and shortwave radiation, rainfall and snowfall rates) derived from the ERA40 reanalysis data [Uppala et al., 2005]. The WFD are optimized for use in large-scale hydrological modeling by a series of adjustments. In a first step the ERA40 data are interpolated from a 1° grid to the 0.5° defined by the CRU land mask. The variables are corrected for elevation differences between the two grids where appropriate (e.g., temperature is adjusted using the environmental lapse rate, which in turn impacts surface pressure and humidity). Temperature is bias corrected and shortwave radiation is adjusted based on aerosol loading and cloud cover using monthly CRU data, which comprise interpolated observations [New et al., 1999, 2000; Mitchell and Jones, 2005]. Precipitation is bias corrected using the observation based GPCCv4 precipitation grid [Rudolf and Schneider, 2005; Fuchs, 2009; Schneider et al., 2010]. For further technical details the reader is referred to Weedon et al. [2010, 2011]. The WFD are available from 1958 to 2001 corresponding to the complete years of the ERA40 reanalysis.
 Simulations of the nine individual models were augmented by the mean simulation of all models, which in the following will be referred to as the ENSEMBLE. The ENSEMBLE is treated as if it was an additional, independent model throughout the analysis. Modeled daily runoff Q was finally aggregated to monthly climatologies, by averaging the grid cell runoff for each month and over all years.
 In total 426 runoff series from small near-natural catchments that are not nested and which cover the time window 1963–2000, were included in this study. Most records originate from the European Water Archive (see http://grdc.bafg.de/ewa), a database assembled by the Euro-FRIEND (see http://ne-friend.bafg.de/servlet/is/7413/) program and held by the Global Runoff Data Centre (see http://www.bafg.de). The EWA was recently updated [Stahl et al., 2008, 2010] and complemented by national data from partners in the WATCH project. Each streamflow series was further augmented with information on catchment boundaries, catchment area and mean catchment elevation (estimated using a high-resolution digital elevation model) from the pan-European River and Catchment Database CCM2 [Vogt et al., 2007].
Figure 1a shows a histogram of catchment areas. Most catchments are about one order of magnitude smaller than the area covered by a 0.5° grid cell at mid latitudes. To enable a comparison between the observed streamflow from small catchments with grid cell runoff, flow volumes were converted to equivalent runoff rates per unit area . This quantity will in the following be referred to as “observed runoff” and is considered to be analogous to the simulated grid cell runoff. All gauging stations were further assigned to specific grid cells of the CRU land mask. In case of more than one runoff station per grid cell, the area-weighted average of the runoff values was used. This procedure resulted in a total of 298 grid cells with observation based runoff estimates. Finally, the runoff series of each grid cell were summarized to monthly averages (long term mean of Januaries, Februaries, etc.), representing of the observed mean annual cycle.
 The scale difference between the model grid cells and the relatively small catchments (Figure 1a) may lead to systematic errors, as the area of a grid cell that is covered by the corresponding catchment may differ and also because the models may not represent subgrid variability in sufficient detail. Figure 1bcompares the elevation of the grid cells of the CRU land mask to the mean catchment elevation. In general, grid cell and catchment elevation compare reasonably well. However, the mean grid cell elevation tends to be lower, suggesting that the catchments of the EWA data set are biased toward higher altitudes. It is, in this context, important to remember that the models are forced with atmospheric variables derived from a reanalysis data set. Although reanalysis data are considered to be reliable sources of information, they are in principal limited to represent average conditions of large grid cells. This is partly confirmed by the long-term runoff-ratio (Figure 1c). Values of , indicate that the observed average runoff ( ) is larger than (or equal to) average precipitation ( ). Such values are physically unlikely and suggest that the WFD underestimate precipitation in the corresponding grid cells. Interestingly, most of the grid cells with unrealistic runoff ratios are located along the northwest-coasts or in mountainous regions such as the Alps and Scandinavia. Further, increases slightly for grid cells with a notedly lower elevation than the corresponding catchments (Figure 1d; ).
Figure A1 in Appendix A displays mapped values of six additional variables that are used for the analysis of model error. These include: mean annual runoff, precipitation and temperature; catchment area and elevation; grid cell elevation and the difference between grid cell and catchment elevation.
3.1. Classification of Hydroclimatic Regimes
 It is common to group rivers and streams according to their within year variability into classes of so-called hydroclimatic regimes [Gottschalk et al., 1979; Haines et al., 1988; Harris et al., 2000]. Here, such classes were defined as groups of streams with similar mean annual cycles, also referred to as classification by shape [Harris et al., 2000]. At each location, the mean annual cycle of observed runoff was identified as the long-term mean of each month and standardized (subtracting the mean and then dividing by the standard deviation). The regime classification was then achieved using Wards hierarchical clustering algorithm [Ward, 1967].
3.2. Model Evaluation
3.2.1. Quantifying Model Performance
 The seasonal cycle can be characterized by three different properties, each with a different hydrological interpretation. The mean annual valueis a water balance measure, giving insight into the long-term water balance. Theamplitude or the spread from lowest to the highest monthly value, measures how pronounced the seasonal cycle is. The shape of the seasonal cycle, provides information on the timing of the months of minimum and maximum flow within a year.
 The models ability to capture the mean annual runoff is quantified using the relative bias:
where and denotes the mean of the modeled and observed annual cycle, respectively. The optimal value of is zero and negative (positive) values indicate underestimation (overestimation). An absolute value of one means that the difference is as large as the observed value (i.e., a deviation of 100%).
 The amplitude of the observed and modeled mean annual cycle are compared using the relative difference in standard deviation:
where and denote the standard deviation of the modeled and observed mean annual cycle.
 The shape of the observed and modeled mean annual cycle is compared using Pearson's correlation coefficient r, which is sensitive to differences in the shape as well as in the timing of the mean annual cycle. The optimal value is r = 1, whereas a value of shows that observed and modeled mean annual cycles are identical, but phase shifted (i.e., the model inverts the wet and the dry season).
 Differences in the three performance metrics , and r are compared for all models and regime classes (see section 3.1) separately. Average model performance is quantified by the regime class median. The spread, indicating spatial variability in model performance, is shown using box and whisker plots, displaying the interquartile range as well as the range from the 5 to the 95 percentile.
 The three performance metrics ( , and r) are augmented with information on the relative difference between observed and simulated values for each month separately ( ). These monthly biases provide easy accessible information on the evolution of the model error within a year.
3.2.2. Relative Influence of Error Components on the Total Model Error
 The three performance metrics provide insights to the models ability to capture different aspects of the mean annual cycle. However, they do not allow for an assessment of whether the bias, differences in amplitude or the errors due to imperfect correlation, have the largest impact on the total model performance. Total model error is often quantified using the mean square error (MSE). In its basic form
(mi modeled and oi observed values), the MSE does not discriminate between different error components. However, following Gupta et al. , the MSE can be decomposed into
with , , and
 The terms of equation (5) can be directly related to the performance metrics defined above. The squared bias and the squared difference in standard deviation are analog to and . The term summarizes the errors related to imperfect linear correlation (r). In their unnormalized form, the components , , and cannot easily be compared between different locations due to large differences in the magnitude of runoff. However, normalizing the different components as
allows the relative contribution of each component to the total error to be assessed. This normalization allows to quantify whether bias (a), differences in amplitude (b) or differences related to timing (c) have the most severe impact on overall model performance.
3.3. Influence of Catchment Characteristics and Climatic Conditions on Model Error
 To understand catchment and local climate controls on model performance, the three measures , and r were correlated with catchment area (Area), mean catchment elevation (Elevation), difference between catchment and grid cell elevation, observed mean annual runoff ( ), precipitation ( ), temperature ( ) and the long-term runoff coefficient ( ). Correlations were computed using Spearmans rank-correlation coefficient [Spearman, 1987, reprint of Spearman, 1904]. In contrast to Pearson's linear correlation coefficient, is a general measure of monotonic dependence and thus also sensitive to nonlinear relationships and robust against outliers. Spearmans can be thought of as Pearson's correlation coefficient between ranks and ranges from −1 to 1. As all variables entering the correlation analysis have to be expected to exhibit some spatial dependence, normal significance tests will yield unreliable results. Therefore, the significance of Spearmans was tested using a modified t test that takes spatial dependence into account [Clifford et al., 1989; Haining, 1991]. This test has recently been applied to fields of atmospheric variables [Pepin et al., 2011] and in the context of large-scale hydrology [Gudmundsson et al., 2011]. Significance is reported for .
4.1. Hydroclimatic Regime Classes
Figure 2 shows the results of the regime classification and provides summaries of mean annual runoff, precipitation and temperature for each regime class. Grid cells belonging to Regime Class 1 (RC1, 91 grid cells) are grouped into two locations, one in Scandinavia and one in the Alps. Regime Class 1 features a seasonal pattern with a winter (January, February) minimum and late spring (May) maximum. This is typical for snow dominated hydrological systems, where precipitation falling in winter is stored as snow (leading to a winter minimum in runoff) and released in a spring flood during snow melt. The median of observed mean annual runoff is highest in RC1 ( ), only slightly lower than the mean annual precipitation ( ). This winter minimum–spring maximum regime is also the coldest with a median mean annual temperature of 3.8°C. Regime Class 2 (RC2, 57 grid cells) has its regional center toward the east and a seasonal pattern with a spring (April) maximum and an autumn (September) minimum. This regime is the driest with the lowest median mean annual runoff ( ) and precipitation ( ). The median of mean annual temperature is = 7.9°C. Most of the stations in Regime Class 3 (RC3, 150 grid cells) are located in western and central Europe. This regime has a seasonal pattern with winter (December, January) maximum and summer (August) minimum, which is typical for hydrological systems with high evapotranspiration in summer. The mean annual climatology is comparable to RC2, with mean annual runoff ( ) being less than half of the incoming precipitation ( ). Regime Class 3 has on average the highest mean annual temperature ( = 8.6°C).
4.2. Model Performance
 The following sections summarize the performance of individual models within the three hydroclimatic regimes. For each regime class a set of four graphs is shown, summarizing the strengths and shortcomings of the individual models (e.g., Figure 3). The top panel shows the relative difference in the mean for each month (monthly biases, ). The three remaining panels show successively the relative error in mean ( ), the relative difference in standard deviation ( ) and the correlation (r) between the mean annual cycle of observed and simulated runoff. Table 3 presents the median values of the three performance metrics for each model and regime class.
Table 3. Median Model Performance for the Three Different Regime Classes (RC) as Measured by the Relative Difference in Mean ( ), the Relative Difference in the Standard Deviation ( ), and the Correlation (r)a
The best models in each line are highlighted in bold.
Regime Class 1
Regime Class 2
Regime Class 3
4.2.1. Regime Class 1 (Winter Minimum—Spring Maximum)
Figure 3 summarizes model performance in the snow dominated regime class (RC1). Only the GWAVA model is able to capture the shape and the amplitude of the mean annual cycle as reflected by small seasonal differences in . However, the median is negative for all months, indicating an underestimation of average runoff. For all other models, a more or less pronounced seasonal cycle in is found. Most models have a similar pattern with an overestimation of runoff (positive ) in the first months of the year and an underestimation of runoff in summer and early fall. The pattern is most pronounced for H08 and LPJmL, both having a sudden drop in from April to May (4th to 5th box). The models do not only differ in their seasonality, but also in the range of among the grid cells (whiskers of the box plots). Interestingly, models with low seasonality in also have the smallest range in among the grid cells (GWAVA, JULES). The difference among grid cells is most pronounced for LPJmL and MPI-HM in April (4th box), where observed runoff is up to six times larger than the simulated runoff.
 All models systematically underestimate runoff ( ). LPJmL and ORCHIDEE have the smallest bias, closely followed by H08. Most of the models underestimate the standard deviation of the annual cycle by almost one third with the exception of MPI-HM ( ). The correlation coefficient (r) between the observed and the simulated mean annual cycle shows a more complex result ( ). Most models have negative or zero correlations in some cases, indicating deviations in the phasing of the annual cycle. GWAVA has the highest median correlation, followed by WATERGAP, HTESSEL and JULES. Both H08 and LPJmL have a median , indicating severe shortcomings with the timing of the annual cycle for more than half of the grid cells. These models also have the most pronounced seasonal cycle in the monthly bias ( ).
4.2.2. Regime Class 2 (Summer Minimum—Spring Maximum)
Figure 4 summarizes model performance in the spring maximum and summer minimum regime class (RC2). Most models show a weak seasonal pattern in the monthly biases, , suggesting that they capture monthly runoff climatologies reasonably well. Only H08, LPJmL and ORCHIDEE deviate considerable from this general pattern. H08 overestimates mean monthly runoff during winter and underestimates runoff in the summer months. LPJmL has a similar, but less pronounced pattern, whereas the seasonal pattern from ORCHIDEE has a somewhat opposite shape (underestimates runoff in winter and overestimates runoff in late summer). Interestingly, the August values of ORCHIDEE have the largest differences among the grid cells, as shown by the whiskers of the box plot. This is contrasted by the spring values of H08, which appear to have the lowest spread.
 The underestimation of runoff ( ) is less pronounced than in RC1, and LPJmL even slightly overestimates runoff. Both MATSIRO and ORCHIDEE are almost unbiased, closely followed by H08. The largest bias is found for WATERGAP, which underestimates runoff significantly. There are large differences in how the models capture the standard deviation of the mean annual cycle ( ). Median values range from strong underestimation (GWAVA, WATERGAP) to pronounced overestimation (MPI-HM, H08). H08 also has the largest spread in (as seen in the whiskers). Only LPJmL has a median close to the optimal value of zero. The correlations between the observed and simulated mean annual cycle are relatively high for most of the models ( ). The ENSEMBLE has on average the highest correlations and HTESSEL, MATSIRO, MPI-HM, GWAVA and JULES have comparable values. Only H08 and ORCHIDEE, the models with the most pronounced annual cycle in , have relatively low r.
4.2.3. Regime Class 3 (Summer Minimum—Winter Maximum)
 The performance of the ensemble members in RC3 is summarized in Figure 5. The most pronounced seasonal pattern in are found for H08, which underestimates runoff in the summer months; and for MATSIRO and ORCHIDEE, both overestimating summer runoff. Overall the seasonal patterns of are comparable to those in RC2.
 As in RC2 there is a slight tendency for underestimating the mean annual runoff ( ), being most pronounced for JULES and MPI-HM. Only LPJmL clearly overestimates annual runoff. GWAVA, H08 and ORCHIDEE are approximately unbiased. The average model performances with respect to standard deviation are similar to RC2 with large differences among the models ( ). JULES gives the most precise estimate of the standard deviation, followed closely by MPI-HM and LPJmL. The correlation between the observed and the simulated mean annual cycle is on average high ( ) and higher than in RC1 and RC2. As in RC2 both the ENSMBLE and GWAVA correlate almost perfectly with the observations, followed by LPJmL, WATERGAP and MPI-HM. Only ORCHIDEE has a large spread inr.
4.2.4. Regime Class Averages
Figure 6 summarizes the overall model performance for each of the three hydroclimatic regimes. Both and have negative medians for all regime classes, indicating that the models underestimate average runoff as well as the amplitude of the annual cycle. In both cases the underestimation is most pronounced for RC1. Both measures show a considerable spread for each regime class, indicating large difference among the models in reproducing seasonal runoff. The spread in is almost twice the spread of . The correlation between observed and modeled mean annual cycles (r) is highest in RC3 followed by RC2. The lowest median r as well as the largest spread in r is found in RC1, where a considerable number of grid cells have almost zero or even negative correlations. The large spread of r in RC1 is contrasted by the small variability in RC3, which exhibits an overall good model performance. The r values of RC2 are somewhere in between these.
4.2.5. Trade-Off Between Error Components
Figure 7 (top) shows the mean square error (MSE) for the different regime classes and models. The MSE in RC1 is on average about one order of magnitude larger than in RC2 and RC3. The bottom panels display a decomposition of the median values of the MSE of each model and regime class into the three error components defined in section 3.2.2. The three axes display the relative contribution of the error in the mean, the variance and the error due to imperfect correlation to the MSE (equation (6)). The sum of the axis values of each data point equals one. Points in the center indicate an equal contribution of the three error components. Both panels display the same data, but differ in the color coding to emphasize different aspects in the grouping, i.e., regime class or model.
Figure 7 (bottom left) focuses on regime classes. Generally, errors in correlation have the largest contribution to the MSE in the snow dominated RC1, whereas errors in the mean have the largest contribution in the evapotranspiration dominated RC3. Regime Class 2, which features elements of both snow and evapotranspiration dominated mean annual cycles, falls in between. The error in variance appears only to be of secondary importance and is dominated by the two other error components in all, but one case.
 There is a large scatter in the relative contribution of the three error components among the models (bottom right panel). MATSIRO is the only model which experiences that the error in variance has a larger influence than the two other error components (for RC1, RC3). All other models have either a relatively balanced error structure (e.g., JULES) or show a distinct trade off between mean and correlation errors (e.g., LPJmL).
4.3. Influence of Catchment Characteristics and Climatic Conditions on Model Error
 The large spread of the three performance measures in each regime class (Figures 3, 4, and 5) suggests that the performance of most models has a considerable amount of spatial variability. Figure 8 exemplifies this by displaying the spatial patterns of , and r for the ENSEMBLE mean. (Note that absolute values of and larger than one were truncated to for the purpose of visualization. Similarly, negative correlation coefficients (r) were set to zero.) There is a large spatial scatter for all three performance metrics. Both and are mainly positive in the center of the spatial domain and negative in the south, west and north. High correlations (r) are found in central and southern Europe and lower r are seen in Scandinavia, the Alps and the Pyrenees.
Figure 9summarizes the results of the Spearman rank-correlation analysis that relates the three performance metrics to a set of explanatory variables for each regime class separately. Correlations that are significant ( ) are highlighted in the figure. It should be noted that the significance level varies between the pairs, depending on the degree of spatial dependence. The following description focuses only on results where significant effects of catchment properties on performance metrics have been found for at least half of the models (five out of ten). This is to reduce the risk of accidentally interpreting spuriously high correlations (false discovery rate) in this multiple testing problem.
 The strongest correlations are negative and found between and for all regime classes and all models. A similar pattern is found for the observed average runoff ( ), although with fewer significant correlations. The significant relation between and implies an underestimation of runoff for grid cells with high runoff coefficients. The correlation between and is positively correlated with five models in RC2 and all models in RC3. This means that runoff is more likely to be underestimated if the grid cell elevation is lower than the mean catchment elevation (recall that grid cell elevation is lower than catchment elevation in most instances; see Figure 1b).
 The relative difference in standard deviation ( ) is negatively correlated with in RC2 and RC3 in eight cases, suggesting that the amplitude of the mean annual cycle of runoff tends to be underestimated if is large. In RC3 significant positive correlations between and are found for nine out of ten models, suggesting that errors in the standard deviation depend on the elevation difference between the catchments and the corresponding grid cell.
 The linear correlation coefficient between the mean annual cycle of observed and simulated runoff (r) is negatively correlated with mean temperature ( ) for eight out of ten models in RC1. This shows that r is more likely to have low values for warmer mean temperatures in this snow dominated regime class. In RC1, r is in six cases positively correlated with , suggesting that differences in the shape of the mean annual cycle are less pronounced if the difference between grid cell and mean catchment elevation is small.
5.1. Sources of Uncertainty
 This study revealed large uncertainties associated with modeling the mean annual cycle of runoff on large (continental) scales. These uncertainties are characterized by large differences among the models, and the hydroclimatic regimes, and especially by the large spread in model performance between the individual grid cells within each regime class. In the following sections various sources of error that may contribute to the observed patterns are discussed.
5.1.1. Scale Mismatch and Uncertainty in Observed Runoff
 Most runoff observations correspond to an area which is about one order of magnitude smaller than the corresponding grid cell (Figure 1). Thus it might be expected that some of the differences between observed and simulated runoff values are directly related to this scale mismatch, as distinct aspects of the hydrological response are dependent on the size of the catchment [e.g., Blöschl and Sivapalan, 1997]. However, there is virtually no correlation between catchment area and performance metrics, suggesting that there is no systematic error. The fact that the coordinates of the gauging station were used to allocate catchments to grid cells may introduce some further imprecisions if the majority of the catchment area lies outside the grid cell. Finally, errors in observed runoff are closely related to the configuration of individual gauging stations and uncertainties in the rating curves [e.g., Di Baldassarre and Montanari, 2009; Tallaksen and van Lanen, 2004, chapter 4].
5.1.2. Bias and Uncertainties in Precipitation Input
 The general tendency of most models to underestimate runoff shows that a negative bias is an issue. Further, the systematic differences in between the regime classes suggest that this bias is dependent on the geographic region. Within each regime class the spread in median is comparably small, suggesting that the models have similar long-term runoff rates.
 It is widely recognized that gridded atmospheric variables (such as the WFD used to force the models) are generally biased. Especially errors in precipitation have triggered a large body of literature concerned with bias correction techniques (see Maraun et al. , Winkler et al. [2011a, 2011b] and Gudmundsson et al. [2012b] for comprehensive reviews). If the grid values are estimated using statistical methods, both interpolation technique and station density will impact the interpolation error. This is often most pronounced in mountainous regions, such as the Alps of the Scandinavian West Coast. In these regions, the precise estimation of precipitation amounts is hampered by the large spatial variability of precipitation, paired with a low station density (see Hofstra et al. [2008, 2009, 2010] for empirical evidence in Europe). Another approach to produce gridded variables is reanalysis, which use dynamical models to estimate atmospheric conditions at locations without observations. The precision of atmospheric models is limited by their spatial resolution [Rauscher et al., 2010] and they are repeatedly reported to have large uncertainties in mountainous regions with complex topography (see, e.g., Adam et al.  for global and Schmidli et al. [2006, 2007]; Barstad et al. ; Themeßl et al. ; Gudmundsson et al. [2012b] for analysis in Europe). The WFD are a hybrid between reanalysis and interpolated data and do not account explicitly for the influence of orographic effects [Weedon et al., 2010, 2011].
 Independent climate observations (not available in this study), would be needed to fully assess the influence of biases in the input data. However, a comparison of long-term averages of observed runoff ( ) and estimated precipitation used to force the models ( ), allows for a simple check of the water balance components. A runoff ratio is physically unlikely (but not impossible) and suggests that is too low. The number of grid cells with for each regime class is RC1: 27, RC2: 9 and RC3: 0. Most grid cells in RC1 are located in mountainous terrain (Alps, Scandinavian mountains), where orography and airflow direction have major influence on precipitation amounts (see Figures 1c and 2). In this context it is also noteworthy that is the variable yielding the strongest (negative) correlation with throughout all regime classes, i.e., the ensemble members consistently underestimate runoff if is large (i.e., ).
 We cannot rule out that processes such as glacier melting influence the observed differences in between the regime classes. The fact that most instances with occur in RC1, where glaciers are recurrent features of the landscape, suggests such effects. However, along the Norwegian West Coast, where most of the occur, no systematic melting of glaciers has not been observed in the study period [Andreassen et al., 2005]. There is, further, some indication that glaciers only have a minor impact on the runoff observations, as the catchments in the EWA data set have only little or no influence of glaciers (Kerstin Stahl, personal communication).
5.1.3. Temperature, Snow Dynamics, and Subgrid Variability
 Both evapotranspiration and snow dynamics, which dominate the seasonality of runoff in Europe, are functions of the surface energy balance. Thus, it is noteworthy that the correlation of observed and simulated mean annual cycle (r) is significantly related to the mean temperature ( ), as well as in the snow dominated Regime Class 1. The negative correlations between r and in RC1 suggest that model deficiencies in simulating snow dynamics get less pronounced for colder conditions. Similarly, positive correlations between and r indicate that these deficiencies gain importance if the mean grid cell elevation is substantially lower than the catchment elevation. In principle, snow dynamics are relatively well understood at small scales and known to be a function of precipitation, energy balance and topographic features [e.g., Liston et al., 2007]. However, describing these processes at a coarser resolution in large-scale hydrological models makes simplifications necessary, which are subject to ongoing research. Modeling approaches range from formulations based on the surface energy balance (seeBoone and Etchevers for an intercomparison of snow modules with differing complexity) to empirical equations that describe the snow pack as a function of temperature like the “degree-day” method [e.g.,Molini et al., 2011]. Possible deficiencies in the modeling of snow dynamics range from shortcomings in representing snow physics [e.g., Nijssen et al., 2003; Bowling et al., 2003] to effects of grid cell size and parameterization of subgrid variability [e.g., Boone et al., 2004; Hamlet et al., 2005; Dirmeyer, 2011].
5.1.4. Decisions Made in the Model Building Process
 Several sources of uncertainty that have not been mentioned so far, are directly related to choices made throughout the model building process. Among these are the conceptualizations of the physical system, data sources and equations used to derive model parameters (e.g., use of pedotransfer functions to relate soil texture to soil hydraulic properties) and also decisions made throughout model implementation and set up. Several studies have shown that rather technical aspects such as the data product used to estimate soil hydraulic properties [e.g., Gutmann and Small, 2005; Teuling et al., 2009; Guillod et al., 2012], the methods used for numerical integration [Clark and Kavetski, 2010; Kavetski and Clark, 2010, 2011] and a wide range of further decisions made by the modeler [e.g., Bormann et al., 2011] can have a bigger impact on differences in model outputs than differences in the model physics. Therefore, it was not feasible to attribute model error in this study, where the model simulations were analyzed a posteriori. A systematic evaluation of model assumptions or even a test of different model hypothesis, is in principal possible, but requires full control over each model, to ensure comparable simulation set ups. This view is increasingly being put forward in the recent hydrological literature where the “process-based evaluation of model hypotheses” [Clark et al., 2008, 2011a, 2011b] is advocated, paired with the need for transparency on (1) model-assumptions, (2) parameter identification, (3) forcing data and (4) numerical implementation.
5.2. Performance of the Ensemble Mean
 The large discrepancies found among the models in this study are contrasted by the fact that the ENSEMBLE, the mean of all model simulations, on average outperforms most individual models and yields consistently good results for all performance metrics and regime classes (Table 3). Similar results are repeatedly reported for various variables of the terrestrial water cycle such as soil moisture [Guo et al., 2006, 2007], discharge from large river basins [Hagemann and Jacob, 2007; Materia et al., 2010] and continental scale summaries of annual low and high flows [Gudmundsson et al., 2012a]. The reasons for the good performance of ensemble means are not fully understood. In the context of climate modeling it has been suggested that the averaging procedure reduces the uncertainty of the different parameterizations [Reichler and Kim, 2008].
 Unlike many other studies that have evaluated the performance of uncalibrated large-scale hydrological models using discharge from continental-scale river-basins, this study relied on runoff observations from a large number of small, near-natural, catchments. In contrast to discharge from continental-scale river basins, which aggregates over large heterogeneous areas, runoff from small (grid cell scale) catchments captures the spatial variability of the terrestrial water cycle and is therefore closer related to other variables of interest such as soil moisture, evapotranspiration or snow water equivalent.
 Model performance was evaluated for snow dominated, evapotranspiration dominated and mixed hydroclimatic regime classes. Overall model performance was found to be best in the evapotranspiration regime and worst in the snow regime. This shows that climate is a primary control of runoff dynamics and that it is crucial for large-scale hydrological models to capture the associated processes. The use of a set of performance metrics, each associated with different aspects of the annual cycle (mean, amplitude and timing), allowed us to distinguish different aspects of model error. A decomposition of the mean square error showed that the low model performance in the snow dominated regime class is primarily related to issues in the timing of the mean annual cycle. This in turn is associated with temperature data, as well as the parameterization of subgrid variability of elevation with respect to snow dynamics.
 The spatial patterns of biases in simulated runoff suggests that these are related to biases in the forcing data, although other factors cannot be ruled out. This result is most likely related to the fact that the global reanalysis product used to force the models cannot resolve the influence of complex topography on atmospheric variables such as the orographic effect on precipitation.
 The large differences among the models are contrasted by the overall good performance of the ENSEMBLE mean. This result emphasizes that, given the current imperfect understanding of the terrestrial water cycle on large (continental) scales, the use of ensemble techniques is a pragmatic approach to increase the confidence in model simulations. At least when climatological averages (such as hydroclimatic regimes) are considered.
 A secondary aim of this study was to assess the possibilities for “diagnostic” model evaluation, which would allow model performance to be related to the model assumptions. Despite the pronounced differences in performance among the models, it was not feasible to relate these to distinct model structures within this study. This is partly because each model simulation is dependent on many choices made by the modeler, which are usually not available for a posteriori analysis of simulation results. However, a systematic evaluation of isolated model assumptions or data sources, can be achieved, for example, by developing a modeling workbench for large-scale hydrological models that allows for systematic exchange of model components. Similar modeling workbenches have been suggested for catchment modeling [e.g.,Clark et al., 2008]. Such modular frameworks are getting increasingly used to conduct controlled modeling experiments, where the plausibility of different assumptions can be tested. Given the major uncertainties in current large-scale hydrological modeling, this may help to constrain the degrees of freedom involved in model building, increase our understanding of the real physical system and make global and continental-scale simulations of the terrestrial water cycle more reliable.
Appendix A:: Catchment Characteristics and Climatic Conditions
Figure A1 displays maps of climatic conditions and catchment properties that are used throughout the paper, but not displayed in the main body of the article.
 This research contributes to the European Union (FP6) funded Integrated Project WATCH (contract 036,946). The provision of streamflow data by all agencies that contributed data to the EWA-FRIEND or to the WATCH project is gratefully acknowledged. The contribution of Thorsten Wagener was partially supported by the Alexander von Humboldt Foundation.