Identification of factors influencing hydrologic model performance using a top‐down approach in a large number of U.S. catchments

Abstract Investigating the performance that can be achieved with different hydrological models across catchments with varying characteristics is a requirement for identifying an adequate model for any catchment, gauged or ungauged, just based on information about its climate and catchment properties. As parameter uncertainty increases with the number of model parameters, it is important not only to identify a model achieving good results but also to aim at the simplest model still able to provide acceptable results. The main objective of this study is to identify the climate and catchment properties determining the minimal required complexity of a hydrological model. As previous studies indicate that the required model complexity varies with the temporal scale, the study considers the performance at the daily, monthly, and annual timescales. In agreement with previous studies, the results show that catchments located in arid areas tend to be more difficult to model. They therefore require more complex models for achieving an acceptable performance. For determining which other factors influence model performance, an analysis was carried out for four catchment groups (snowy, arid, and eastern and western catchments). The results show that the baseflow and aridity indices are the most consistent predictors of model performance across catchment groups and timescales. Both properties are negatively correlated with model performance. Other relevant predictors are the fraction of snow in the annual precipitation (negative correlation with model performance), soil depth (negative correlation with model performance), and some other soil properties. It was observed that the sign of the correlation between the catchment characteristics and model performance varies between clusters in some cases, stressing the difficulties encountered in large sample analyses. Regarding the impact of the timescale, the study confirmed previous results indicating that more complex models are needed for shorter timescales.


| INTRODUCTION
Hydrologic models can be developed following bottom-up or topdown strategies (Littlewood, Croke, Jakeman, & Sivapalan, 2003). The former relies on physically based models that describe the system of interest considering the processes observed at small spatial and temporal scales using relationships derived through observation and experimentation. As this approach has been very successful in physics (e.g., universal law of gravity), there has been much interest in applying it to other areas (Dooge, 1986). However, unlike physics, other disciplines face difficulties in conducting reproducible experiments in which the influential variables can be successfully controlled (Young, Parkinson, & Lees, 1996). This has not prevented its spread to hydrology, a field that already had high expectations for this type of models 45 years ago. It was envisioned that developments resulting from more sophisticated observational methods would lead to improved mathematical descriptions solvable with increasing computing power (Freeze & Harlan, 1969). Much work has been done in this area, and today, physically based models are widely used (Beven, 2002).
The top-down approach, on the other hand, starts by looking at the system at a large scale. Depending on our study target, this might mean that we either (a) use a large number of entities and apply statistical techniques for inferring the properties of individual items (Dooge, 1986) or (b) analyse, for instance, a catchment at a large scale at then try to infer the processes happening at smaller temporal and spatial scales in the same catchment.
Statistical top-down approaches do not explain how systems work; instead, they aim at extracting information about specific catchments using a large databases. Although these applications might be considered black boxes, offering little opportunity for learning about how systems function (Littlewood et al., 2003), it is recognized that this approach has a large potential when adding a postprocessing step for interpreting the results and looking for patterns (Sivapalan, Blöschl, Zhang, & Vertessy, 2003).
Applications of the top-down approach for individual catchments often focus on identifying the required model complexity for a given application. In such cases, the analysis starts with a simple model. The modeller assesses then the performance of the model and tries to identify which missing process might be responsible for the deviations between the observed and measured hydrographs (or other system responses). This process is then added to the model, and the procedure is repeated until model performance is adequate. This top-down approach for individual catchments was introduced into hydrology by Klemeš (1983) and has been widely used in since then (Atkinson, Woods, & Sivapalan, 2002;Bai, Wagener, & Reed, 2009;Eder, Sivapalan, & Nachtnebel, 2003;Farmer, Sivapalan, & Jothityangkoon, 2003;Montanari, Sivapalan, & Montanari, 2006;Sivapalan et al., 2003). It has helped to increase our understanding about the dominant processes relevant at different time scales and locations, as well as about the relationships between model complexity, time scale and catchment, and climate properties. Bai et al. (2009) explain that one goal of these studies is to enable the selection of an adequate model for a given catchment by looking just at the climate and catchment properties and thus not have to rely on discharge measurements. This requires that the analyses are carried out for a high number of catchments covering a large part of the observed variability, so that the relationships between catchment properties and model performance can be identified with confidence (Gupta et al., 2014). This approach has thus elements from the statistical top-down approach (i.e., obtaining information about individual catchments on the basis of insight gained through the analysis of many catchments) and from the top-down approach for each of these catchments (i.e., going from large to smaller scales).
This study is inspired by the work of Atkinson et al. (2002) that looked at the relationship between model complexity and timescale in nine New Zealand catchments. They inferred that more complex models are required for modelling drier catchments and for analyses at higher temporal resolutions. Because Atkinson's study considered only few catchments in a smaller area, it is useful to find out if the conclusions are also valid in other places, or if we need to identify more adequate and local predictors between model performance and catchment properties. Motivated by this question, this study is based on a systematic top down analysis with eight conceptual hydrological models of varying complexity in 574 catchments in the United States.
These catchments cover a large range of climate and catchment properties and are therefore suitable for analysing the relationships between climate/catchment properties and model performance.
Although Atkinson et al. (2002) analysed only few catchments and thus were able to evaluate model performance visually, it was necessary to resort to another approach for dealing with the high number of catchments in the present study. This approach is based on the framework developed by Bai et al. (2009), which allows an automatized analysis of model performance. The main objective of this study is to identify relationships between catchment properties, model complexity, timescale, and model performance. This information might be helpful for identifying the models able to reach an acceptable performance for a given signature in a defined catchment. This is especially important in ungauged basins, where the lack of discharge data precludes an a priori comparison of model performance.

| Data and catchments
This study uses a subset of the GAGES II reference basins. It can be therefore expected that they are only minimally affected by human disturbances. Daily discharge time series were downloaded from the USGS webpage, and all 574 catchments without discharge data gaps between October 1, 1981, andSeptember 30, 2008 (27 years), were considered in this study.
The Daymet 2 dataset was used as source for the weather data (Thornton et al., 2014;Thornton, Running, & White, 1997). This dataset was chosen as a modelling study of U.S. catchments found that a previous Daymet version (Daymet 1) achieved a good performance (Newman et al., 2015). The 1980-2010 timeseries of the daily maximum and minimum temperatures ( C) and precipitation (mm d −1 ) were extracted for each catchment using the GAGES II catchment delineations. The potential evapotranspiration (PET), required as input to the hydrological models, was estimated according to Hargraeves and Samani (1982).
For learning about how catchments properties relate to model performance, the catchments were characterized with respect to climate, geology, landscape, topography, soil, and land use (Table 1).
Summary statistics for the catchments are shown in Table 2. The catchments with a runoff coefficient larger than one were excluded from subsequent analyses after being identified as problematic in a screening step, which aimed at identifying the catchments that could not be properly modelled. Because there is no straightforward way of doing this, it was decided to focus on model performance with respect to bias. This identifies the catchments for which forcing data results in simulations that do not replicate well the long-term water balance.
Although the catchments passing this test are able to replicate the water balance, it must be noted that they still might have errors describing the short-term dynamics. The implementation of this test resulted in the removal of 13 catchments. An analysis of the reasons for the poor performance of these catchments revealed that some of them had runoff coefficients larger than one, whereas the remaining had timing errors attributable to snow related problems.

| Description of the model structures
The models used in this study were initially developed in the context of the work carried out by Atkinson et al. (2002), Farmer et al. (2003), and Jothityangkoon, Sivapalan, and Farmer (2001). They were then used with some minor modifications by Bai et al. (2009). For the present study, it was necessary to add a snow routine computing snow accumulation and melt in 100-m elevation bands using a temperatureindex-based approach.
The hydrological models are composed of different evapotranspiration, storage, and routing modules, which can be combined in various ways for testing alternative hypotheses about catchment behaviour ( Figure 1). There are two evapotranspiration modules, both allowing for evaporation losses due to interception. Another feature shared by both modules is that they estimate separately the evapotranspiration for the fraction of the catchment covered by deeprooted vegetation and for bare soil areas. In this way, it is possible that the actual evapotranspiration equals the PET rate when soil moisture is above field capacity and the soil is covered by deep-rooted vegetation. The difference between both modules is that the ET2 module distinguishes a saturated and an unsaturated zone in the soil bucket, whereas ET1 is composed by a single zone.
Model S1 consists of the evapotranspiration module ET1 and a soil storage producing discharge via saturation excess. Model S2 is additionally able to produce discharge through saturation flow once the water content in the soil storage is above field capacity. The evapotranspiration module ET2 is included in Models S3 and S4. The difference between these two models is an additional deep-water bucket releasing groundwater in Model S4. Models M1 to M4 have the same structure as Models S1 to S4, but they differ in their assumptions about the distribution of the soil storage capacity in the landscape. The models from the S series assume a homogeneous storage and have thus a single bucket. On the other hand, the M models consist of 10 soil stores with varying depths, accounting in this way for variations in the soil storage capacity across catchments. The model equations and description of the parameters are provided in the Supporting Information.
An inspection of the model structures shows that there are two complexity axes: (a) the M/S axis, where the (multistore) M-version of each model is considered more complex than its (single store) S-version, and (b) the increase in complexity from Models 1 to 4 (i.e., S1 < S2 < S3 < S4 and M1 < M2 < M3 < M4). Although an overall summary of model performance for all models is presented in Section 3.1, the remaining analysis only considers one complexity axis and focuses therefore on the results obtained by Models S1 to S4 and Model M4.
The models were run for the period between the October 1, 1981, and September 30, 2008. As the first 3 years were for spinning up, the analysis was done for the remaining 24 years.

| Monte Carlo modelling framework
The models were evaluated within a Monte Carlo (MC) framework. All parameters were sampled from a uniform distribution. The sampling

Geology/Lithology (G)
Hydraulic permeability (Gleeson et al., 2011), 18 classes of surface lithology (Sayre et al., 2009), baseflow index (Wolock, 2003) Landscape/Morphology (M) 10 classes of land surface forms (Sayre et al., 2009) and 5 classes of topographic moisture potential (Sayre et al., 2009) Topography (T) Elevation (mean, minimum, maximum, and range), slopes, and exposition predictors estimated from NED (National Elevation Dataset) Soil (S) 138 soil properties (gSSURGO from Soil Survey Staff, 2015) Land use (L) 16 land use classes from the National Land Cover Database 2001 (Homer et al., 2007) ranges for the evapotranspiration, soil, and routing parameters were adopted from previous studies (Bai et al., 2009), whereas the ranges of snow accumulation threshold were taken from Rajagopal and Harpold (2016). The ranges of the snowmelt parameters were rather wide, accounting for the temperature biases that can be encountered in meteorological datasets (Behnke et al., 2016). F I G U R E 1 Sketch of model structures. All models have as input the precipitation (P), daily mean temperature (Tm), and potential evapotranspiration (PET) time series. Parameter names are in green (see table) and state variables in blue (St, level in soil storage; Sus, level in unsaturated compartment; Ssat, level in saturated compartment; Sgw, level in groundwater storage). The fluxes are in black: Ebs and Ev (evapotranspiration from bare soil and vegetated land); Qse, Qsat, and Qgw (discharge from excess overflow, saturated flow, and groundwater flow); and Qr (recharge) In an initial step, it was tested how many samples were needed for obtaining stable results. As 40,000 samples were found to provide robust estimates under repeat sampling, this number of parameter sets was drawn and then used for all models and catchments.

| Hydrological signatures
Hydrological signatures characterize different aspects of the catchment response, in this case, the discharge, with respect to which models are assessed. The signatures used in this study are based on the set used by Jothityangkoon et al. (2001), Atkinson et al. (2002), and Farmer et al. (2003) for analysing the water balance at different timescales. For the sake of comparability with previous studies and as this set of metrics is still being used, although with some minor modifications (e.g., Bai et al., 2009;Coxon, Freer, Wagener, Odoni, & Clark, 2014), they were the first choice for the current analysis.
Sorted from longer to shorter timescales, the signatures are as follows: • the interannual discharge variability, which is calculated as the total annual discharge (i.e., for a period of 24 years, there are 24 data points).
• Two signatures at the monthly timescale: the Pardé signature and the monthly time series. The first signature is based on the Pardé coefficients (Pardé, 1933), which are used for characterizing the flow regime (seasonal runoff????????). They are estimated as the ratio between the mean monthly discharge and the mean annual discharge (i.e., there are 12 data points, each representing the mean discharge for 1 month). The intra-annual signature consists on the timeseries of the monthly discharge; this means that for a 24-year period, there are 288 data points.
• Two signatures at the daily timescale: the discharge time series and the flow duration curve (FDC). Both signatures are based on the daily discharge, but it is expected that models are more successful at reproducing the FDC, as only the distribution of discharge-and not the timing-needs to be matched.

| Evaluation measures
The Kling-Gupta efficiency (KGE) was used for quantifying model performance across these signatures. In other words, the KGE was used for assessing how well different MC runs are able to reproduce the signatures. It is calculated as (Gupta, Kling, Yilmaz, & Martinez, 2009): This metric is a combination of the bias, standard deviation, and correlation errors, each focusing on different aspects determining how well a model agrees with the observations. The bias and standard deviation error indicate how well models are able reproduce the mean and the variability of the discharge time series: with Q s and Q o representing the simulated and observed mean discharge over the entire period and std(Q s ) and std(Q o ) standing for the standard deviation of the simulated and observed discharge time series, respectively. Pearson's correlation coefficient r characterizes the performance of the models regarding timing and shape. It can be seen that the KGE can reach a best value of 1 when the bias and standard deviation error are zero (i.e., the modelled and observed values are equal) and r is one.

| Study set-up
The first step aimed at replicating the conclusions of Atkinson et al. (2002), stating that more complex models are needed for more arid catchments and shorter timescales. Although Atkinson et al. (2002) relied on a visual assessment for determining which models had an "acceptable performance," it was necessary to define a threshold for automatizing this task in the present study. The choice of a threshold will always be subjective (Diskin & Simon, 1977), but it should be goaloriented and take into account the available data (Guthke, 2017;Laiti et al., 2018). As this study aimed at comparing model performance at different timescales and for different model complexities, the threshold needs to take into account the abilities of the considered models at different timescales. Some test allowed to conclude that the best option was using three different thresholds. This prevents selecting a threshold that is too low, for example, at the interannual scale (meaning that almost all models would be accepted) but, at the same time, too high for models to be accepted at the daily timescale. There is a threshold for the daily timescale (KGE = 0.70), one for the FDC timescale (KGE = 0.85), and another for the annual and monthly timescales (KGE = 0.80). The ranking of the threshold values agrees with the KGE ranking of the timescales, as the best results were achieved for the FDC and the worst for the daily timescale. The threshold values are in between the lowest and highest median KGE values achieved with different models at each timescale (see Table 3), allowing to see the impact of using models with different complexities. It must be considered that it is possible to cluster the catchments according to many factors, for instance, climate and catchment properties, the value of signatures, or model performance.
The results of such an exercise will also vary depending on the selected clustering algorithm, its parameterization, and the considered catchments. This indicates that this clustering of catchments into different groups will never be "objective" and that we should aim at a classification that is reproducible and adequate for fulfilling its intended purpose. In this study, the clusters were defined manually on the basis of maps of model performance and the distribu-  An analysis of the variability of model performance across timescales shows that the differences between models tend to be more pronounced at the daily scale than at the monthly or annual scales.
For example, the KGE difference between the model with best and worst performance is about 0.5 at the daily timescale, 0.16 at the intra-annual timescale, and about 0.05 at the annual scale. This variation in performance depends on the complexity of the considered model, as more complex models have more homogenous results across timescales than simpler models. This can be seen for instance, when looking at the KGE difference between the daily and annual timescales, which equals 0.52 for Model S1, about 0.12 for Models S2 and S3, and 0.08 for Model M4.

| Assessment of the adequacy of using the aridity index a performance predictor
This section investigates if the aridity index can be related to model performance as shown in Atkinson et al. (2002) who found that arid catchments require more complex models for achieving acceptable performances. Figure 2 shows the fraction of catchments for which each model is the simplest one achieving a performance above the threshold. The target, for example, of the Pardé signature in the second aridity class (i.e., aridity index 0.

| Identification of spatial patterns in model performance
The results presented in the previous section highlight that aridity is an important factor determining model performance across catchments. This section explores the impact of other catchment and climate properties, besides aridity, on model performance. This is done by separating the United States into zones with similar model performance and analysing each of these clusters separately. Figure 3 presents the best performance obtained with Models S1, S3, and M4 for each timescale in a series of maps. A pattern that can be identified at each timescale, especially using Model S3, is the variation in model performance from West to East: Model performance decreases from the West coast to the interior until the central part, from which on performance tends to increase again towards the East.
This pattern was used for clustering the catchments into four groups.
In the first step, two clusters were built by separating the western from the eastern catchments. In the second step, each of these groups was separated again into two, totalling four clusters (Figure 4). The clusters were delineated on the basis of the value of three catchment predictors selected according to the procedure described in the methodology, that is, by comparing the spatial patterns of maps with catchment properties with the spatial patterns of model performance maps.
The separation of the catchments into a western and an eastern cluster was carried out on the basis of the percentage of mountains  subsurface water flow (Sawicz, Wagener, Sivapalan, Troch, & Carrillo, 2011;Winter, 2001) and has an influence on climate (Siler, Roe, & Durran, 2013). It might therefore be argued that mountains are closely related to the snow in Cluster 2 and that the higher precipitation in the Western coast (Cluster 1) might be enhanced by orographic precipitation (Roe et al., 2003) justifying in this way the use of this predictor for defining the clusters.  We further see that more complex models have a much smaller potential for improving the performance in Clusters 1 and 4, while being especially successful at improving the performance in snowy catchments (Cluster 2). A comparison of the increase in performance achieved when going from Model S4 to M4 (compare S4-S1 with M4-S1) shows that Cluster 3 (arid) benefits most from a shift to the most complex model.

| Model performance for the clusters
Overall, Models S3 and M4 show much smaller increases in performance than Models S2 and S4. In some cases, Models S3 and M4 even have a lower performance than Models S2 and S4, respectively.
Model S3 should be avoided for modelling snowy catchments, as its performance is considerably lower than the one for Model S2. These catchments should instead be modelled with Model S4, which clearly has the best performance for this cluster.

| Identifying local predictors important for each cluster
The next step is to identify relationships between the predictors and model performance in each cluster. This was done using Spearman's correlation coefficient between the value of the predictors and the KGE. The thick black borders in Figure 7 highlight the relationships with a significant correlation (p < .01). It is interesting to note that, for some predictors, the sign of the correlation varies with the cluster, for suggesting that they have a higher proportion of loamy and silty soils, which are the ones with largest water holding capacity. Because climate is a main factor affecting soil formation, it is not unreasonable to assume that soils in the northern part are the result of the wetter conditions (coevolution). It is thus not possible to know if performance is primarily affected by the soil class or the climate; it is only possible to say that there is a significant correlation of these properties with model performance. Finally, there is a negative correlation between performance and the baseflow index (BFI), which is, however, not easily spotted by looking at the maps.
In Cluster 2, performance is positively correlated with elevation and snow. Because these predictors show no spatial pattern, they are not shown in Figure 8. There is also a positive correlation with precipitation, which decreases from NW to SE, and a negative correlation F I G U R E 6 Average difference in performance (KGE) achieved if Models S2, S3, S4, and M4 are used instead of Model S1. Analysis for all clusters and timescales F I G U R E 5 Boxplots with the KGE performance of Models S1, S2, S3, S4, and M4 for each cluster and timescale. The light grey line identifies the performance threshold of 0.85 KGE. There is one outlier catchment now shown here for Model S1 at the FDC and daily timescales and two outliers not shown for the Pardé and interannual signatures for Models S1, S2, and S3 F I G U R E 7 Spearman's correlation between 21 predictors and model performance (KGE) for each timescale, cluster, and Models S1 to S4 and M4. Values with a thick black border are significant (p < .01) with the PET, which is more pronounced in the North and Eastern catchments and agrees well with the patterns for the aridity index.
Model performance also tends to be negatively correlated with the BFI and positively correlated with the percentage soil in Class A. It was, however, not possible to identify spatial patterns for these properties.  Figure 8). The question of choosing a model of adequate performance has attracted much interest in hydrology. Although it is expected that more complex models can achieve better performances than simpler models, there are different reasons why this might not be the case. Perrin, Michel, and Andreassian (2001) concluded from an analysis in 429 catchments with 20 model structures that more complex models might provide better results during calibration-due to overfittingbut that models with more than three to five parameters did not have a better performance than simpler models during validation. This agrees well with the findings of Jakeman and Hornberger (1993) who estimated that only about three parameters could be identified in a conceptual rainfall-runoff model when only discharge data are available for calibration. Performance, and the supported model complexity, will therefore be different whether we consider the models in calibration or validation mode.
Another reason why more complex models might not give better results than simpler ones is that the increased complexity might not address the reasons for the poor performance of a simpler model. This is a limitation also in this study, as it relies on eight already developed models that were applied in all catchments. Because of the large number of catchments, it was not possible to carry out the analysis as done by Atkinson et al. (2002), who could evaluate each model individually, identify the processes that might be limiting the results in each case, and then propose a model with increased complexity, which addressed these deficiencies. Instead, this study looked at models with increasing complexity, but it is well possible that the additional process (es) considered in more complex models are not responsible for the reduced performance of the simpler model. This seems to happen with Model S3, which in general does not improve performance but even decreases it in some cases ( Figure 6). Future studies using these models could therefore compare the performance of Model S4 with a new model adding a groundwater storage to the S2 structure, which might give better results than Model S4, which combines Model S3 with a groundwater storage.
With respect to the timescale, this study confirms that it is in general easier to obtain good modelling results at more aggregated timescales (Atkinson et al., 2002;Bai et al., 2009;Chiew, Stewardson, & McMahon, 1993). The reason for this is that signatures at higher time resolutions need to account for the short-term flow dynamics, something that often requires the use of more complex models. For lower temporal resolutions, we look at aggregated results and could, therefore, have a low ability for modelling the short-term fluctuations but still get the average right. This has implications for model comparison suggesting that for understanding and being able to appreciate the differences between models, it is necessary to analyse them at daily (or smaller) timescales. It further explains why the KGE values obtained for the different models are more homogeneous at the interannual timescale than at the daily scale (Table 3 and Figure 5).

| Relationship between model performance and catchment properties
There are numerous studies unveiling the relationships between model performance and catchment predictors. The most commonly identified relationship is with catchment aridity, which is found to correlate negatively with model performance (Gudmundsson, Wagener, Tallaksen, & Engeland, 2012;McMahon et al., 2013;van Esse et al., 2013;Weingartner et al., 2013). This study agrees with this, as there is a consistent negative correlation between performance and aridity for all clusters. Viglione et al. (2013) hypothesize that the reduction of performance in arid catchments might be caused by an increasing nonlinearity of the runoff processes in arid catchments, as in dryer catchments rainfall input and runoff might only be coupled for wetter conditions or high rainfall intensities. In addition, rainfall events are more episodic in arid catchments, making them less predictable and more difficult to model (Pilgrim, Chapman, & Doran, 1988). Another explanation might be that an adequate treatment of soil processes becomes more critical for achieving a good performance in more arid catchments. This is because the main task of a model in a wet catchment is to route the water to the outlet, which can be done in many different ways, resulting in a higher degree of flexibility. If there is no, or only a little input of water into the system, it is critical to model what happens to this water for being able to reproduce the hydrograph correctly. This task is hampered by our limited knowledge about the subsurface properties, especially the water holding capacity, which is critical for an adequate representation of these processes (Atkinson et al., 2002).
The importance of snow has also been identified as a factor affecting model performance. More specifically, model performance tends to improve with increasing impact of snow (Merz, Parajka, & Blöschl, 2009). One possible explanation for this relates to the use of day degree melt approaches in lumped models. Because snowmelt depends in this case on the temperature, which is a variable with lower uncertainty than precipitation, we also expect a more reliable output. Another reason is that snowmelt produces a distinct peak which results in a seasonally varying hydrograph. If models are evaluated with the Nash-Sutcliffe efficiency, it is easier to achieve a good performance when there is a seasonal signal in comparison with a catchment with more uniform discharge across the year (Schaefli & Gupta, 2007). In this study, we found that the sign of the correlation for the fraction of snow varies depending on the cluster. Although snowy catchments (Cluster 2) have a positive correlation with performance, this is not the case for the other three clusters. However, as snow does not play an important role in most catchments assigned to the nonsnowy cluster, it is not meaningful to estimate this correlation for those clusters.
The relationship between seasonality and model performance is difficult to assess due to the multiple ways that exist for defining and estimating seasonality (e.g., Berghuijs & Woods, 2016;Laaha & Blöschl, 2006). However, it is expected that, as in the case of snowy catchments, the seasonal variability of the hydrograph affects the ease with which high NSE values can be achieved.
With respect to the subsurface properties, this study found similar results to Coxon et al. (2014), who found that catchments with a lower BFI (less groundwater dominated) tend to be easier to model than catchments in which groundwater plays a more important role.
Although this might be partially explained by the lack of consideration of chalk-related processes the models, which were important in the catchments with high BFI in the Coxon et al. (2014)  found that performance improves with catchment area. This relation is, however, not that clear, as Merz et al. (2009) found no clear trend for catchments smaller than 2,000 km 2 and only a weak relationship for larger catchments. This positive correlation was attributed to the increasing number of meteorological stations considered in larger catchments, which is reflected in a reduced uncertainty in the forcing data. In this study, we found in some cases a negative and in other a positive correlation with area. An analysis of the spatial distribution of the catchment sizes showed that the large catchments in this dataset tend to be concentrated in the central, arid part of the United States, which exhibits a lower performance. As we expect that aridity will have a larger influence on model performance than catchment area, it does not seem helpful to estimate the correlation with area for the whole group of catchments together, as the results will reflect the impact of aridity rather than catchment size. With respect to slope, Parajka, Andréassian, et al. (2013) hypothesized that its relationship with model performance depends on how aridity varies with elevation (and therefore slope, which correlates with elevation).
This study also identified some relationships between soil properties and model performance. Soil depth has a consistent positive correlation with model performance across clusters, models, and timescales. The percentage of soil in Classes A and D also has a significant relationship with performance in many clusters and timescales.
These relationships are, however, not consistent between the clusters as the correlation is, in some cases, positive and, in others, negative.

| Model selection based on catchment properties
Being able to identify an adequate model based on the catchment properties is an important goal in hydrology. Although not an easy task (Ley, Hellebrand, Casper, & Fenicia, 2016;van Esse et al., 2013), there are some starting points that can guide further studies in this area.

| CONCLUSIONS
Focusing on the performance of models with increasing complexity and linking this to catchment predictors is a good way for identifying which factors contribute to making a catchment "easier" or "more difficult" to model. Catchments that are easier to model achieve acceptable results with a large range of models and give us more flexibility when deciding which model to use, also making this decision less relevant. On the other hand, when dealing with catchments that are harder to model, we face more constrains in the number of models we can use and will probably need to invest more effort in modelling for obtaining a satisfactory performance (Clark et al., 2008).
One important finding of this study is that the relationship between catchment predictors and model performance suggested by Atkinson et al. (2002) is also valid for the catchments used in this study. The results further agree with previous studies regarding (a) the predictors identified as important and the (b) relationship between performance and timescale. With respect to the relevant predictors, it was seen that the BFI and the aridity index result in consistent performance patterns across all clusters and timescales. They are both negatively correlated with performance, that is, arid and groundwaterdominated catchments tend to have lower performances. It was also seen that snow is a relevant predictor in areas with a large impact of snow. Another predictor with consistent performance is the soil depth, which correlates positively with performance. Many of these predictors are related to the subsurface properties, indicating that our lack of knowledge about the subsurface characteristics as well as the difficulties involved in "seeing" what happens below the surface are critical factors for improving our models. Regarding the timescales, it seems easier to get acceptable model results for longer than for shorter timescales. It was also seen that the benefit of more complex models becomes evident at shorter timescales.
A limitation of this study is that the results are not easily extrapolated to other hydrological models. Future studies could therefore carry out similar analyses for modular models (e.g., Clark et al., 2008;Fenicia, Kavetski, & Savenije, 2011), which might allow to link the differences in performance to specific structural units. Another route that could be taken by future studies is to determine the complexity of the catchments (e.g., Pande & Moayeri, 2018) and then link this metric to model performance.

DATA AVAILABLITY STATEMENT
This study uses publicly available datasets as described in the methodology. Intermediate results supporting the findings of this study are available from the corresponding author upon reasonable request.