Tradeoffs Between Temporal and Spatial Pattern Calibration and Their Impacts on Robustness and Transferability of Hydrologic Model Parameters to Ungauged Basins

Optimization of spatially consistent parameter fields is believed to increase the robustness of parameter estimation and its transferability to ungauged basins. The current paper extends previous multi‐objective and transferability studies by exploring the value of both multi‐basin and spatial pattern calibration of distributed hydrologic models as compared to single‐basin and single‐objective model calibrations, with respect to tradeoffs, performance and transferability. The mesoscale Hydrological Model (mHM) is used across six large central European basins. Model simulations are evaluated against streamflow observations at the basin outlets and remotely sensed evapotranspiration patterns. Several model validation experiments are performed through combinations of single‐ (temporal evaluation through discharge) and multi‐objective (temporal and spatial evaluation through discharge and spatial evapotranspiration patterns) calibrations with holdout experiments saving alternating basins for model evaluation. The study shows that there are very minimal tradeoffs between spatial and temporal performance objectives and that a joint calibration of multiple basins using multiple objective functions provides the most robust estimations of parameter fields that perform better when transferred to ungauged basins. The study indicates that particularly the multi‐basin calibration approach is key for robust parametrizations, and that the addition of an objective function tailored for matching spatial patterns of ET fields alters the spatial parameter fields while significantly improving the spatial pattern performance without any tradeoffs with discharge performance. In light of model equifinality, the minimal tradeoff between spatial and temporal performance shows that adding spatial pattern evaluation to the traditional temporal evaluation of hydrological models can assist in identifying optimal parameter sets.

Integration of satellite remote sensing data with distributed hydrological models has been a common path toward improving the reliability of hydrological model simulations (Dembélé et al., 2020).This development has followed with the progress and availability of remotely sensed data sets, which have evolved significantly over the past decades, although accuracy of satellite-based data sets remains varying (Ko et al., 2019;Stisen et al., 2021).
Several studies have addressed the impacts of adding remotely sensed observations to streamflow calibration (Odusanya et al., 2022;Rientjes et al., 2013;Sirisena et al., 2020).Nijzink et al. (2018) presented a large modeling effort illustrating the impact of adding different remotely sensed products across five different conceptual model and 27 European catchments.They analyzed 1023 possible model combinations regarding model constraint and showed an added value of remotely sensed data in the absence of streamflow data.In a recent model intercomparison paper Mei et al. (2023) analyzed different model calibration strategies combining streamflow and global gridded soil moisture and evapotranspiration data sets.They found that adding soil moisture to the streamflow calibration improved evapotranspiration performances.Mei et al. (2023) also included a review of 16 previous papers on the subject of constraining models using a combination of streamflow and remotely sensed data.Both the study by Nijzink et al. (2018) and Mei et al. (2023), and 14 out of the 16 papers in Mei et al. (2023) review applied spatially averaged time series of the remotely sensed data.By this approach, the spatial information in the satellite data is ignored and the hydrological model evaluation remains limited to the temporal component of the models.This traditional focus on the temporal performance, defined as the comparison of simulated and observed time series, is often selected due to several factors related to either the spatial resolution of the models and the remote sensing data or lack of spatial performance metrics and optimization frameworks.However, temporal model evaluation, whether in the form of metrics based on discharge records or basin average remote sensing observations fall short on evaluating the spatially distributed states or fluxes simulated by distributed hydrological models.Inadequate evaluation of simulated spatial patterns becomes particularly problematic when distributed models are utilized for detailed spatial predictions such as impacts of land cover change or changes in water management.If a model performs well on discharge at the outlet but has a wrong representation of the relative contributions to streamflow from different regions or land covers its relevance for predicting impacts of land cover changes will be low.
The fundamental idea behind this approach is to employ a multi-objective calibration framework that adds to the traditional discharge-based calibration, an independent set of objective functions that mainly reflects the observed spatial pattern of key hydrological states or fluxes.This approach differs from multi-objective calibrations based on multiple metrics calculated from the same observation (e.g., streamflow timeseries) or application of basin average timeseries of remotely sensed data (Demirel, Koch, Mendiguren, & Stisen, 2018;Demirel, Koch, & Stisen 2018;Demirel, Mai, et al., 2018).In addition, independence in the optimization approach can be obtained by adding the new information source in combination with a pareto-achieving optimizer which circumvents the need to join multiple objective functions into a single score (Mei et al., 2023).
A previous study by Zink et al. (2018) incorporated land surface temperature patterns in model calibration and showed that this helped to better constrain the model parameters connected to evapotranspiration when compared to calibrations based on streamflow only.Moreover, in their study the model performance regarding evapotranspiration increased at seven eddy flux measurement sites used for evaluation.Adding new constraints to calibration decreased streamflow performance yet the authors of that study illustrated how land surface temperature data could secure better results for ungauged basins.For a single Danish basin, Demirel, Mai, et al. (2018) developed a spatial pattern-oriented calibration framework and a new spatial performance metrics, and illustrated a small tradeoff between streamflow and spatial pattern performance.Dembélé et al. (2020) applied a similar calibration framework to a model study of the poorly gauged Volta River basin in West Africa.They showed that while streamflow and terrestrial water storage performance decreased by 7% and 6%, respectively, soil moisture and evapotranspiration performances increased by 105% and 26% respectively when including the spatial calibration framework with multiple objectives.Soltani, Bjerre, et al. (2021) illustrated how adding spatial pattern 10.1029/2022WR034193 3 of 24 optimization to a national scale groundwater model improved evapotranspiration patterns and altered groundwater recharge patterns without deteriorating groundwater head and discharge performance significantly.Other, recent studies such as Xiao et al. (2022) and Ko et al. (2019) have utilized spatial patterns of land surface temperature for hydrological model evaluation.However, in the context of our current study, Xiao et al. (2022) and Ko et al. (2019) did not address the tradeoffs between different optimization strategies and streamflow performance.
As a results of increased availability of remotely sensed data sets combined with machine learning approaches and computational power, many gridded spatial products are now available (Belgiu & Drăguţ, 2016;Feigl et al., 2022).Despite the varying accuracy of both satellite data and machine learning approaches, these gridded data sets facilitate the spatial characterization of hydrologic variables and fluxes and enable spatial model evaluations.However, to optimize the simulated spatial patterns of a hydrological model, the model parametrization scheme needs to be fully distributed and spatially flexible.In this context, the multi-scale parameter regionalization (MPR) method (Samaniego et al., 2010) represented a significant advancement, which was initially included in the mesoscale Hydrological Model (mHM) (Kumar, Samaniego, & Attinger, 2013;Samaniego et al., 2010).Afterward, it has been incorporated into several other modeling frameworks (Lane et al., 2021;Mizukami et al., 2017;Tangdamrongsub et al., 2017) and it is available as a stand-alone parametrization tool that can be coupled to hydrological models (Schweppe et al., 2022).Other studies have developed similar flexible parametrizations schemes based on pedo-transfer functions using gridded data (Feigl et al., 2020;Ko et al., 2019;Soltani, Bjerre, et al., 2021).
It is well known that streamflow calibration does not guarantee good spatial pattern performance (Rakovec, Kumar, Mai, et al., 2016;Stisen et al., 2011) and performance on the initial single objective typically drops when adding additional objectives to the calibration.But what is the tradeoff between spatial and temporal performance?How does single and multi-objective optimization impact parameter transferability, and how does this compare to impacts of multi-basin optimization?Based on the above, we aim at addressing the following research gaps: • What are the tradeoffs between temporal and spatial model performance investigated in a pareto-achieving optimization framework?• How does multi-basin and spatial pattern-oriented calibration impact model performance and transferability to ungauged basins?
In this study, we demonstrate the impact of multi-site and multi-objective calibration compared to single-site and single-objective parameter estimation, that is, the most common practice in hydrologic modeling, specifically in the context of parameter transferability to ungauged basins.The single versus multi-objective comparison specifically addresses temporal versus spatial model evaluation.The impact on parameters transferability via adding spatial patterns into the model calibration, is a novel aspect that has not received much attention in the literature so far.The study is conducted for six mesoscale central European basins.
The distributed modeling study is carried out in the framework of a flexible spatial model parameterization scheme in combination with observed spatial patterns of actual evapotranspiration (AET) derived from satellite data.We apply the mHM model code since it suits the applied calibration framework well due to its flexible model parametrization schemes based on pedo-transfer functions to distribute soil parameters and the built-in multi-scale parameter regionalization.
We design a set of model calibration experiments including both single-and multiple basins as well as single-and multi-objective calibrations and two jack-knife experiments, that is, sequentially keeping one or five of six basins out of the joint calibration approach.Model simulations are evaluated based on temporal discharge performance and spatial AET performance using long term average monthly pattern maps, appropriate objective functions and a global multi-objective pareto-achieving search algorithm is applied to illustrate the exact tradeoff between the two objectives.

Methodology
Catchments, observed data (both Section 2.1), and the hydrologic model (Section 2.2) are presented in this section.The objective functions to evaluate the model performance of simulated discharge and AET, respectively, are described in Section 2.3.A sensitivity analysis performed to determine the most important parameters for model calibration is described in Section 2.4, while Section 2.5 describes the calibration and validation setup including a brief description of the multi-objective calibration algorithm applied.

Catchments and Hydro-Meteorological Data
This study is conducted using six European catchments, that is, Elbe, Main, Meuse, Mosel, Neckar, and Vienne with drainage areas varying from 12,775 to 95,042 km 2 .The catchments are spread over Central Europe and represent a diversity of soil texture, land use, and land cover.The mean annual rainfall varies from 637 to 874 mm while the mean annual runoff varies from 184 to 398 mm (Table 1).The six catchments are selected based on two criteria: good model performance obtained in previous studies (Rakovec, Kumar, Mai, et al., 2016) and spatial patterns of AET that are likely dominated by land-surface heterogeneity, that is, land cover and soil properties, rather than a strong climate gradient.The latter will facilitate a meaningful model calibration driven by spatial patterns since simulated patterns can be adjusted through the surface parametrization within the hydrological model and are not purely driven by climate (Koch et al., 2022).In basins with a large climate gradient, simulated spatial patterns are typically easier to simulate even with a suboptimal spatial parametrization since the patterns are to a lesser degree controlled by the model parameters and will display correct overall patterns enforced by the climate forcing data.
Average temperature, precipitation, and Reference ET (ET ref ) data are available at daily time steps and over 0.25° grids for the period extending from 1980 to 2018, whereas the length of the observed daily discharge data varies between catchments.Daily averaged meteorological data (P and ET ref ) were obtained from the E-OBS and ERA-5 reanalysis data sets (Cornes et al., 2018;Hersbach et al., 2020).Reference ET, estimated based on the Hargreaves-Samani model using ERA-5 air temperature data (daily minimum, maximum, and mean) as input (Hargreaves & Samani, 1985).The Hargreaves-Samani model is a parsimonious option with low data demand and reasonable accuracy (Pohl et al., 2023) and was therefore chosen in this study based on previous experiments showing that different formulations of PET/ET ref could be applied with no significant impact on the simulations regarding capturing the inter-annual variability across different European regions (Rakovec, Kumar, Mai, et al., 2016).
In addition to the six outlet discharge gauges used in model calibration, we obtained daily data from 46 gauging stations from the Global Runoff Data Center (GRDC, 2023) for internal validation of the six catchment models.Remotely sensed AET estimates for the period from 2002 to 2014 were obtained using MODIS data and the two-source energy balance method (Norman et al., 1995) as described in Stisen et al. (2021).Digital elevation model (DEM) data were retrieved from the Shuttle Radar Topographic Mission (SRTM, Farr et al., 2007).Soil texture variables, clay content, sand content, and bulk density were derived from the SoilGrid Database (Rakovec, Kumar, Attinger, et al., 2016;Rakovec et al., 2019).The soil texture data for six layers with varying depths (5, 15, 30, 50, 100, and 200 cm) and a tillage depth of 30 cm are introduced as input to the model.All input data were resampled to a common spatial resolution of 0.001953125° (∼200 m) (Rakovec, Kumar, Mai, et al., 2016).A MODIS-based land use map was reclassified into three classes, namely forest, pervious, and impervious.Longterm monthly leaf area index (LAI) maps used to calculate the spatiotemporally varying crop coefficient are based on the MODIS MOD16A2.v061product (Running et al., 2021).The original 8-day composite LAI maps were aggregated to long-term monthly means sampled at a matching spatial resolution of ∼200 m.

Hydrologic Model
The spatially explicit mesoscale Hydrologic Model v.5.11.1 (Kumar, Samaniego, & Attinger, 2013;Rakovec et al., 2019;Samaniego et al., 2010;Thober et al., 2019) was used to simulate daily discharge and spatial AET patterns of the six catchments.mHM is considered an appropriate model for these basins because they are mesoscale basins that represent relative natural flow systems and they have all previously been simulated with mHM with good performance (Rakovec, Kumar, Mai, et al., 2016).The backbone of mHM, that is, the numerical methods utilized to estimate various states and fluxes are based on the fusion of two well-known models, that is, HBV and VIC (Samaniego et al., 2010).mHM simulates major components of the hydrologic cycle, that is, evapotranspiration, canopy interception, snow accumulation and melting, soil moisture dynamics, infiltration, percolation, groundwater storage, and surface runoff generation.The model simulates these fluxes on a multi-layer distributed grid using the multi-scale parameter regionalization approach (Kumar, Samaniego, & Attinger, 2013;Samaniego et al., 2010) to account for sub-grid variability of landscape attributes and model parameters based on pedo-transfer functions.MPR is one of the unique features of mHM that facilitates a spatial pattern-oriented calibration.Moreover, AET and soil moisture from different soil layers are modeled based on available soil water and the root fraction of vegetation in each soil layer.Two transfer functions are of particular importance for this work.First, an exponential function Equation 1 uses monthly LAI maps to link distributed vegetation dynamics to a spatially distributed crop coefficient (Allen et al., 1998), also termed a dynamic scaling function (Demirel, Koch, Mendiguren, & Stisen, 2018;Demirel, Koch, & Stisen 2018;Demirel, Mai, et al., 2018).The spatio-temporally varying crop coefficient (K c ) is then applied for scaling spatially coarse reference ET (ET ref ).In this implementation Kc accounts for the deviation of the vegetation, in time and space, compared to the refence surface (well-watered 10 cm tall grass with Kc = 1).The crop coefficient scales the climate-based reference ET to a potential ET which acts as the upper limit for AET in the mHM model.The scaling performs a spatial downscaling from the meteorological data resolution of 0.25° to the hydrological model grid resolution of 0.015625°, to account for a heterogenous vegetation cover.
Second, another transfer function utilizes spatially distributed soil texture maps to allow for an incorporation of soil physical properties in the spatial parametrization of root fraction coefficients (Demirel, Koch, Mendiguren, & Stisen, 2018;Demirel, Koch, & Stisen 2018;Demirel, Mai, et al., 2018).Hereby, the root fraction coefficient can vary with both vegetation and soil type and is used in mHM to calculate root water uptake as part of the AET reductions factor (Samaniego, et al., 2021a(Samaniego, et al., , 2021b)).During calibration, both transfer functions increase the model flexibility to adjust the spatial AET patterns retrieved from satellite data (Demirel, Koch, Mendiguren, & Stisen, 2018;Demirel, Koch, & Stisen 2018;Demirel, Mai, et al., 2018;Koch et al., 2022).Finally, the total runoff generated at every grid cell is routed to its neighboring downstream cell using the adaptive timestep spatially varying celerity method for the river runoff routing scheme (Thober et al., 2019).
In this study, the following four different spatial resolutions are defined in the mHM model: 0.001953125° for the morphological characteristics (L0 scale), 0.015625° for the hydrologic modeling resolution (L1), 0.0625° for runoff routing (L11), and 0.25° for the meteorological forcing (L2).Note that around 50° North these resolutions correspond approximately to 140/430 m, 1.1/3.5 km, 4.5/7 km, and 18/28 km lon/lat for L0, L1, L11, and L2 respectively.Finally, a 13-year period (2002-2014) with a 4-year warming period (1998)(1999)(2000)(2001)) was simulated at a daily timestep for calibration and evaluation of the discharge performance and spatial pattern match between remote sensing based and simulated AET (from 2002 to 2014).The remote sensing based AET is estimated with the Two-Source Energy Balance method (TSEB) (Norman et al., 1995), using MODIS data including land surface temperature, albedo and NDVI.For a full description of the AET data set and comparison to other estimates for Europe the reader is referred to (Stisen et al., 2021).

Evaluation Metrics and Objective Functions
The hydrologic model performance was evaluated using two key objective functions (OFs).In this study, we interpret multi-objective calibration as the combination of two completely independent evaluation data sets, that is, discharge time series and spatial AET maps, instead of producing a variety of OFs based on a single variable.
For the temporal evaluation of discharge, the Kling-Gupta-Efficiency (KGE) was applied (Gupta et al., 2009).The KGE is defined as where r is the Pearson correlation coefficient between observed and simulated streamflow (Q), α is the variability which is defined as the ratio of the standard deviation of observed and simulated Q, and β is the bias defined as the ratio between average observed and simulated Q.
For the spatial pattern evaluation of simulated AET, the bias-insensitive Spatial Efficiency metric (SPAEF) was used (Demirel, Koch, Mendiguren, & Stisen, 2018;Demirel, Koch, & Stisen 2018;Demirel, Mai, et al., 2018;Koch et al., 2018).The SPAEF is a reformulation of KGE and is defined as where r is the Pearson correlation coefficient between observed and simulated spatial patterns of AET, Φ is the coefficient of determination fraction of observed and simulated AET, and γ quantifies the fraction of the histogram intersection based on the z-scores of observed and simulated AET fields.Both, KGE, and SPAEF vary in a range from −∞ to the best value of 1.
SPAEF is calculated separately for long-term seasonal averages across all years, focusing on the water limited growing seasons in three 3-month windows that is, March-April-May (MAM), June-July-August (JJA), and September-October-November (SON).The fourth quarter (December-January-February) is not used because there are cloud cover issues related to the satellite data in winter and because there are energy limited conditions where both AET and PET is very low and the simulated AET pattern is completely controlled by the PET input.Finally, we summed the squared residual of these three seasonal parts as where SSR AET represents the sum of squared residuals for the seasonal AET pattern performance applying SPAEF as OF.For the joint calibrations across several catchments SPAEF is calculated on the combined data set as a single value for each season across all catchments.
For streamflow, KGE was calculated at one, five or six stations (n) at the outlet of the basins and the sum of squared residuals is used where SSR Q represents the sum of squared residuals for the Q performance at the discharge stations using KGE as OF.For the single catchment calibrations, only the corresponding station KGE is utilized.
For both KGE and SPAEF metrics we combine individual stations and seasons into single SSR OF's to limit the number of independent OF's (one for KGE and one for SPAEF).It is only feasible to work with a few OF's (2-3) because the number of model evaluations required to define the Pareto front, and interpret trade-offs, increases drastically with increasing number of OF's.

Sensitivity Analysis
Identification of the optimal parameter set through a calibration framework can be cumbersome if the dimension of the search space is not limited by a sensitivity analysis first.mHM has 69 parameters (Samaniego et al., 2021a(Samaniego et al., , 2021b) ) each increasing the dimension of the search space.Focusing a calibration on parameters that are sensitive regarding the selected OFs is computationally more efficient than calibrating all parameters (Demirel, Koch, Mendiguren, & Stisen, 2018).To reduce the computational burden by narrowing the number of calibration parameters, a one-at-a-time (OAT) sensitivity analysis was conducted to identify the most important parameters for calibration using the PEST Toolbox (Doherty, 2010).Although the parameter interactions are not accounted for in this local OAT method, it provides an indication of sensitive parameters especially if combined with the expert opinion which can complement the assessment of parameter interactions.
The sensitivity analysis was based on an initial parameter set obtained from a previous calibration against KGE for the same model setup across Europe (Rakovec & Kumar, 2022), although with a different parameterization scheme for Kc and root fraction distribution.Demirel, Mai, et al., 2018).We used KGE as OF for discharge performance and SPAEF as OF for spatial pattern performance of the model, and the initial parameter set gave reasonable performance on both OF's.The purpose of the sensitivity analysis was to reduce the number of free parameters to the make the optimization more efficient and to ensure that parameter that were important for either KGE or SPAEF were selected.In addition, the new parameters introduced for the Kc and root fraction distribution were included in the analysis.Some parameters, such as routing parameters have no impact on the spatial pattern performance, whereas other parameters related to the spatial parameter distribution has limited impact on KGE.Each parameter was perturbated two times (5% increased and 5% decreased based on the initial point) to calculate the average sensitivity index of OFs for the change in the parameter value.This index value is then multiplied by the absolute parameter value to account for the parameter magnitude in the calculations.Finally, the sensitivities are normalized by the maximum of the group.The analysis is solely used to select parameters for optimization, the sensitivity does not carry over to the optimization, since the DSS algorithm does its parameter search independently of the initial sensitivity analysis.For all subsequent calibration tests the same parameters have been selected and as such each calibration experiment optimize the same parameters using the same parameter intervals.

Experimental Design of Calibration and Validation
In total, 26 calibration experiments were designed to investigate the potential benefits of incorporating AET to augment a multi-objective and multi-basin calibration framework (Figure 1).Note that SSR Q is incorporated in all calibration experiments as objective function while SSR AET is used only for the KSP1, KSP5, and KSP6 calibration experiments.Note that KSP stands for KGE and SPAEF multi-OF calibration whereas KGE stands for KGE only based single-OF calibration.The indices 1, 5, and 6 show the number of basins included in the calibration experiments as conceptualized in Figure 1.In this study, we did not include an AET-only scenario as it failed to reproduce reasonable water balances in our preliminary tests and also a previous study (Demirel, Koch, Mendiguren, & Stisen, 2018;Demirel, Koch, & Stisen 2018;Demirel, Mai, et al., 2018).
All 26 calibration experiments (cases) were performed with the open-source model-agnostic Ostrich optimization toolbox written in C++ (Matott, 2017).For all 26 calibration experiments, the parallel implementation of the Pareto-Archived Dynamically Dimensioned Search (ParaPADDS) algorithm was used (Asadzadeh & Tolson, 2013).This algorithm is the multi-objective version of the Dynamically Dimension Search (Tolson & Shoemaker, 2007) algorithm that identifies a Pareto front of non-dominated optimal solutions, which is most appropriate for our multi-objective calibrations (Beume & Rudolph, 2006;Razavi & Tolson, 2013).Moreover, ParaPADDS algorithm reached reasonable solutions for both single and multiple OFs; therefore, we used the same search algorithm in all scenarios for consistency.The ParaPADDS algorithm was configured with user-defined maximum 750 iterations, with 3 parallel nodes (logical processors), a perturbation value of 0.2, and the exact hypervolume contribution as the selection criterion.Note that initial tests for one basin with 200, 500, and 1,000 iterations indicated stable results already at 500, but a somewhat incomplete Pareto-front.Based on this and in the interest of saving computation time, we decided on 750 iterations.Like all multi-objective calibration methods, the algorithm does not provide a single best solution for the multiple OF problem.Still, it offers the modeler a set of possible solutions on the Pareto front (Asadzadeh & Tolson, 2013).
KGE1 and KGE6 calibrations resulted in the single best parameter set that was used to create our final results in the following figures.KSP1 and KSP6 calibrations provided multiple possible solutions on the Pareto front with KGE as one axis and SPAEF as the other axis.To systematically select a best-balanced parameter set, we picked the solution that is closest to the origin by normalizing both axes (SSR Q and SSR AET ) using min-max normalization and choosing the minimum of the sums, similar to the approach by Martinsen et al. (2022).
The normaliza tion is applied to avoid the metric-magnitude effects on the selection.KSP1 and KSP6 results presented hereafter are generated using this selected single parameter set.Calibrations were done with the six discharge gauges and three seasonal AET maps (March-November).We used 46 discharge stations from GRDC for internal validation of the six catchment models and we show the results of KGE5 and KSP5 cases as maps (see Figure 6).

Sensitivity Analysis
Table 2 shows the 20 most influential parameters out of 69 mHM parameters selected based on the combined sensitivity of the two metrics.We used these normalized sensitivities varying from 0% to 100% and applied a threshold of 1% for at least one of the OF's for selecting the most sensitive parameters for calibration (20).
Based on the KGE, the five most sensitive parameters controlling discharge are RotFrCofClay, RotFrCofFore, PTFLowConst, K c,min _ pervi, PTFKsConst which are parameters mainly controlling the AET and thereby the water balance.KGE is also sensitive to some routing parameters but generally less than the parameters controlling AET levels.The SPAEF OF is most sensitive to the parameters RotFrCofClay, RotFrCofFore, K c,min _pervi, and K c,min _forest, which is almost identical to the most sensitive parameters for KGE.Additionally, parameters associated with simulated patterns, for example, related to pedo-transfer functions for soil properties are important for SPAEF.Conversely, SPAEF has zero sensitivity to routing parameters.Overall, the most sensitive parameters contribute to spatial heterogeneity of root fraction coefficients, crop coefficients, infiltration factor and field capacities, of the grid cells.10.1029/2022WR034193 9 of 24

Calibration Results
Figure 2 shows the model calibration results for single basin calibrations using single (KGE1) and multi-objective functions (KSP1).Each calibration is performed using 750 model runs distributed in three parallel processors where non-dominated runs leading to a Pareto front are identified by the ParaPADDS algorithm (Asadzadeh & Tolson, 2013).A solution is called non-dominated if there is no other solution that is better in all objectives analyzed.Although the calibrations do not depict a clear Pareto front due to the combined plotting of KSP1 and KGE1, the tradeoff between only discharge and spatial performances is clearly distinguishable from the plots.Only KGE (KGE1) calibrations lead to slightly better KGE performance and much poorer spatial AET pattern performance than those in multi-objective calibrations (KSP1).However, KSP1 calibrations enable to identify a more-balanced solution leading to higher SPAEF performance and only slightly poorer KGE performance than KGE1's single solution for each basin (shown as a triangle).While it is well known that good KGE performance does not guarantee a good spatial pattern performance, it is a novel finding that there is a very limited tradeoff between the temporal and spatial performance of the models.
Table 3 shows the KGE performance for the best-balanced solutions from KGE1 and KSP1 along with the SPAEF calculated across all six basins when combining the six single basin calibrations.SPAEF is calculated across all basins, since we are not interested in individual basin SPAEF values for the joint evaluation, but the spatial pattern similarity across all basins.Generally, all calibrations can lead to KGE performances in the range of 0.84-0.96as shown in Table 3. Note.The parameter abbreviations correspond to the name of the parameter in the mHM setup (In the mHM namelist).

Table 2 20 Selected Parameters for Calibration, Their Range and Sensitivity for Both Objective Functions KGE and SPAEF
Subsequently, a multi-basin calibration was conducted again with both single (KGE6) and multiple (KSP6) objectives (see Section 2.5 for details).The results are shown in Figure 3 and Table 3.The model performance results mimic the results of the single basin test, with similar KGE performances; however, with a significant performance increase for SPAEF, from 0.02 with KGE6 to 0.61 with KSP6, as would be expected when adding   Figure 4 illustrates the spatial AET maps from TSEB (observed) and the various calibration tests.For the multi-objective calibrations (KSP1 and KSP6), the best-balanced solution (closest point to the origin) is chosen for visualization.The maps clearly show the issues related to KGE1, regarding spatial pattern performance.For three out of six basins, that is, Elbe, Mosel, and Vienne, the KGE1 calibration has resulted in a strikingly poor spatial AET pattern (compared to KSP1) where distinct low and high AET areas were inverted as compared to the TSEB pattern.In contrast, including the SPAEF metric in the optimization (KSP1) prevented such errors without any substantial loss in KGE performance (average KGE of 0.93 for KGE1 and 0.90 for KSP1, Table 3).
Interestingly, the KGE6 calibration, that is, without any spatial pattern constraint, was able to represent the overall pattern to some extent across the six basins, although with a significantly underestimated variance and some substantial differences.This emphasizes the value of joint multi-basin calibration for robustness in spatial parametrization within the MPR parametrization scheme.Adding the SPAEF metric to the multi-basin calibration (KSP6), generated the best spatial similarity to TSEB, although not better than combining spatial AET from the six individual KSP1 calibrations maps into one map (Figure 4 and Table 3).Comparing KGE1 and KGE6 calibrations illustrates the reduction in KGE performance from averages of 0.93 falling to 0.88, when seeking one common parametrization in KGE6.While the sampling uncertainty (0.01-0.03) in KGE scores (see Appendix A) are typically lower than change in KGE performance (0.05) they are of a similar magnitude.Analysis of the sampling uncertainty suggests that when moving from the KGE1 to KGE6 calibration approach, the uncertainties remain the same but centered around lower KGE values.
The higher KGE performance obtained from single basin optimization does however come with a very poor SPAEF performance of −0.45 for KGE1 compared to 0.02 for KGE6.Although the SPAEF for KGE6 is also low, this is mainly attributed to the variance component of SPAEF (Figure 4).
Even though the model performance of simulated spatial patterns across the six basins shares some similarities for KSP6 and KSP1, there is a marked difference between the parameter distributions that generate the spatial AET patterns.This is shown in Figure 5 displaying the resulting parameter fields of field capacity and crop coefficient, which are calculated in mHM and represents the key controls of AET simulations.The field capacity and crop coefficients are not parameters that are assigned directly in mHM but are the results of several transfer function parameters.Therefore, field capacity and crop coefficient are not included in Table 2, which lists the transfer parameters that generate them (Kc* and PTF*).Although KSP1 calibrations generate parameter distributions that have meaningful patterns of field capacity within each basin, it fails to form one consistent seamless parametrization across the basins (Figure 5).Similarly, for the KGE1 and KGE6 calibrations, the spatial inconsistency resulting from single basin calibration becomes apparent.For field capacity (Figure 5), the parametrization obtained from KGE6 and KSP6 is relatively similar, although the KSP6 results in a slightly larger variance.A different picture emerges for the crop coefficient where KSP1 generates patterns similar to KSP6.At the same time KGE6 produces very different patterns with unreasonably high values for urban areas and little impact of vegetation patterns on crop coefficients.This difference is due to the crop coefficient parameter pattern mainly being constrained by the SPAEF OF, while the KGE OF on discharge also constrains the field capacity parameter.

Cross-Validation Results
To investigate the potential impact of the calibration strategy on the transferability of parameters to ungauged basins, two Jack-knife tests were applied.The two tests are holding out five (KGE1-KSP1) or one (KGE5-KSP5) basins simultaneously and evaluating only the uncalibrated basins using parameters obtained calibrating either one or five other basins.These tests are performed for both single-and multi-objective calibrations, resulting in four parameter transfer tests.
Results for the single-basin calibrations and subsequent evaluation of the performance of parameter transfer to five ungauged basins based on the KGE1 and KSP1 calibrations are shown in Table 4.For each discharge evaluation, KGE is calculated as the average across all basins, each represented in five holdout evaluations (a total of 30 ungauged evaluations).The SPAEF is calculated based on three seasons for six holdouts (a total of 18 pattern evaluations).Table 4 shows that discharge performances with average KGE of 0.79 and 0.83 across ungauged basins, and similar between KGE1 and KSP1, although the latter performs better.Compared to the KGE6 and KSP6 calibrations (both with an average KGE of 0.88 in Table 3, relatively little loss in performance for discharge is noticed, even for ungauged cases.
For the spatial pattern evaluation, the performance for the KGE1 parameter transfer has low average SPAEF across all basins, while the standard deviations are large across seasons.For KSP1, the results of SPAEF are much better with an average of 0.41.This indicates that single basin calibration with multiple objectives can better make robust predictions for ungauged basins when both discharge and AET patterns are considered in calibration at gauged locations.
The single basin holdout evaluation based on the KGE5 and KSP5 calibrations (Table 4) shows that discharge performances (average of 0.85 and 0.86) are better than the five-basin holdout (KGE1 and KSP1) and very similar to the KGE6 and KSP6 calibrations.Again, the multi-objective calibrations seem more robust for parameter transfer when evaluated against discharge only.For the SPAEF performance evaluation KGE5 performs better than KGE1, indicating better parameter transfer when calibrated against more and diverse basins.However, spatial pattern performances are still considerably better for the ungauged assessment based on multiple objectives in KSP5.Also, KSP5 (SPAEF around 0.5) performs better than KSP1 (SPAEF around 0.4).
In summary, the four ungauged basin tests indicate that discharge can be predicted with average KGEs around 0.79-0.83across the six selected basins based on parameter transfer from calibration of neighboring basins, even when only a single basin is used to estimate parameters for five neighboring basins.Performances on discharge improve further when including an additional objective function in the form of AET patterns and when calibrating across five basins and evaluating on a single holdout basin.Similarly, spatial patterns can be simulated with average SPAEF values of 0.41 and 0.49, that is, somewhat lower than KSP6 at 0.61, when only accounting for AET patterns from neighboring basins in the parameter estimation.On the contrary, spatial patterns are very poorly represented when parameters are based on single-basin and single-objective calibrations (KGE1).In addition to the jack-knifing validation for ungauged basins, a validation test for internal discharge stations was performed for the KGE5 and KSP5 holdout (ungauged) simulations.This test was intended to analyze the possible added value of spatial pattern calibration on internal discharge stations' performance compared to a pure discharge calibration.The spatial pattern calibration based on the bias-insensitive SPAEF metric, does not add a temporal constrain on the model optimization and will not directly influence the temporal performance of the simulated downstream discharge.However, the impact of the SPAEF based calibration will alter the spatial pattern of AET and thereby the internal water balances within the larger basins.Therefore, the internal validation focuses on the discharge bias (Equation 2; β term) alone and not the KGE, in the attempt to quantify a possible reduction of simulated streamflow biases for internal validation points.
Since spatial patterns of AET are only included for the period March-November, they are likely to mainly influence the summer water balance where AET has the most impact.Hence, annual and summer statistics are estimated separately. Figure 6 illustrates the location of 46 internal discharge stations and the difference in absolute bias (%) between the ungauged simulations from the KGE5 and KSP5 holdout experiments.For annual statistics (Figure 6 top panel), results are very similar (same average bias) and most stations have differences between plus and minus 10%.For the Meuse basin, significant improvements can be detected in the bias for KSP5, while KGE5 tends to be better for the Elbe basin.For the summer statistics (Figure 6 bottom panel) the KSP5 has a slightly lower average bias with considerable improvements for the Meuse and Vienne.At the same time, differences for the Elbe basin are more polarized with stations that are better for both KGE5 and KSP5.Overall, the analysis did not show a clear improvement in biases when constraining the models with spatial patterns in the holdout test.If analyzing KGE and the α and r terms of KGE (Equation 2), the KGE-only calibrations performed best for internal station validation in the holdout test.This is illustrated by Figure B1 in the supplementary information section, which shows results for both KGE, and its three components.
The model performances presented in this study should be evaluated in light of the uncertainties associated to them.One aspect of this uncertainty is the sampling uncertainty associated with the KGE metric (Clark et al., 2021).The sampling uncertainty represents the uncertainty related to the time window used for the KGE calculation, since the KGE metric is sensitive to the variance of the evaluation period.This uncertainty can be significant and is important especially when evaluating the applicability of a given model for a particular purpose.Even though it is less important for the comparison of different calibration experiments based on the same evaluation periods, the uncertainties associated to each of the evaluation stations used in the study are given in Tables A1 and A2 in Appendix A. The uncertainties are estimated based on the method described in Clark et al. (2021) and vary between stations but are largely correlated between calibration experiments.

Discussion
The single-(temporal) versus multi-objective (temporal and spatial) calibration experiment presented here illustrated a minimal tradeoff in discharge performance when adding the spatial pattern-oriented metric to the traditional KGE objective function (Figure 2).This result is very similar to previous studies (Demirel, Koch, Mendiguren, & Stisen, 2018;Demirel, Koch, & Stisen 2018;Demirel, Mai, et al., 2018;Kumar, Samaniego, & Attinger, 2013;Rakovec, Kumar, Mai, et al., 2016;Soltani, Bjerre, et al., 2021;Zink et al., 2018) and can be attributed to two main factors.First, the metric design, with a long-term average bias-insensitive spatial pattern metric introduces limited conflict to matching the discharge biases and no conflict with the temporal dynamics of the discharge simulations.Second, single-objective calibrations based on downstream discharge only, are known to constrain the spatial distribution of internal fluxes to a minimal extent (Stisen et al., 2011), causing a high degree of equifinality.Consequently, the addition of a spatial pattern metric can be viewed as a means of selecting the best spatial pattern match among an extensive set of plausible parameter sets (all producing satisfying KGEs).These results on objective function selection, are consistent for both the single-basin and multi-basin tests (with six basins, Figure 3).Not surprisingly, it also becomes evident that a good discharge performance (KGE) does not guarantee a good spatial pattern performance.
In light of the low tradeoff for discharge, single-basin versus multi-basin calibrations, results are best analyzed through comparing the spatial patterns of AET and resulting parameter fields.Here, it becomes clear that single-basin single-objective (temporal) calibration can select parameter sets that are entirely inconsistent between the basins (Figure 5) and displays internal spatial AET patterns that are reverse of the observed patterns (Figure 4).Interestingly, the multi-basin KGE calibration (KGE6) shows that simply adding multiple basins in this case enables the model to obtain a somewhat realistic spatial pattern without being constrained specifically to AET.However, the spatial metric must be included to improve this pattern and spatial variability (KSP6).Logically, one joint calibration (KGE6 and KSP6) also ensures a spatially consistent parameter field (Figure 5) and thereby also spatially consistent AET patterns (Figure 4).This point has previously been highlighted by Samaniego et al. (2017), who illustrated the shortcomings in producing seamless parameter fields based on multiple single basin calibrations without parameter regionalization across Europe.Eventually, the goal of regional to continental scale distributed hydrologic modeling is to produce scalable spatial patterns of all states and fluxes across the entire model domain.
Moving on to the spatial holdout experiments, first with single basin calibrations (five holdouts) and later with multi-basin calibrations (single holdouts), the parameter transfer to "ungauged" basins results in average KGE values between 0.79 and 0.86 even when transferring parameters from a single basin to five neighboring basins.
For these holdout experiments, the mean KGE for ungauged basins lies around 0.8 (Figure 4) compared to 0.88 for the multi-basin calibrations (KGE6 and KSP6 in Table 3).This is probably a result of a considerable similarity between the basins and their relatively large size, all of them encompassing a range of land use, soil texture, and climate conditions.Also, the six basins were chosen because they all fulfilled the criteria of a similar climate and topography, and previous performance in a Pan-European modeling context (Rakovec, Kumar, Mai, et al., 2016).In this context, the robustness of parameter transferability might be overestimated compared to basins with less similarity.
Other studies have analyzed parameter transferability and KGE performance drop by spatial validation in ungauged basins.A recent and very relevant example is the model intercomparison paper by Mai et al. (2022).They explicitly performed a spatial validation test against basins not included in the calibration for a range of different model codes over the Great Lakes region in North America.They reported average loss in KGE of around 0.26 for locally calibrated models using a simple parameter transfer scheme and a loss of 0.10 KGE for regionally calibrated models.In comparison, our study reports a loss of KGE of 0.14 for the KGE1 holdout, 0.07 for KSP1 and 0.03 and 0.02 for the KGE6 and KSP6 holdouts (evaluated through the KGE5 and KSP5 performances).It is assumed that a simpler parameter transfer scheme will result in a greater performance loss when testing against basins not included during calibration.In addition, the basins used in our study are quite similar regarding climate and topography, which might not have been the case in other studies.In order to truly compare different holdout experiments regarding performance loss, some accounting for basin similarity and possibly parameter transfer schemes would be recommended.
For the parameter transfer, the experiments including AET during calibration (KSP1 and KSP5) produce better spatial patterns (SPAEF 0.41 and 0.49) when combining ungauged basins, as compared to the KGE-only calibrations (SPAEF −0.10 and 0.25), however KGE5 produced better patterns than KGE1.This is in line with the results of Poméon et al. (2018) who calibrated sparsely gauged basins using remote sensing products.Their study showed that including AET to model calibration significantly improved the performance of the evapotranspiration simulation whereas soil moisture and total water storage predictions were within a good predictive range.
In this study SPAEF is used as the evaluation metric for spatial pattern performance, however it should be noted that other metrics could have been utilized and that further investigations covering other regions, model codes and spatial observation data should be conducted to gain experience in interpretation of the SPAEF metric and benchmark spatial pattern performances of distributed models.
The internal validation against 46 discharge stations was intended to evaluate whether adding spatial patterns to the calibration would improve the discharge bias performance within each basin.Somewhat surprisingly and discouraging, such a systematic bias improvement could not be verified.A previous study by Conradt et al. (2013) on the Elbe basin revealed large discrepancies between water balance AET (precipitation-discharge) and remote sensing-based AET on the sub-basin level.This could indicate that sub-basin water balances are in some cases largely controlled by factors other than AET.This could be water divergence, abstraction, or inter-basin groundwater flow (Le Mesnil et al., 2020;Soltani, Koch, et al., 2021).Wan et al. (2015) showed that the inter-basin transfer of water could cause significant errors in the water balance-based AET calculations.Alternatively, the accuracy of the satellite-based AET might not be sufficient to describe differences at the sub-basin level.Recent analyses, using the AET data set used in this study, have demonstrated that remote sensing-based AET can reproduce large-scale AET patterns across major European basins (>25,000 km 2 ) (Stisen et al., 2021), while studies like Conradt et al. (2013) and Stisen et al. (2021) indicate substantial deviations for smaller sub-basins (below 200-500 km 2 ).

Conclusions
The need for systematically transferring parameters to ungauged basins while respecting their landscape heterogeneity and water balance motivated us to expand our previous single-basin experiments (Demirel, Koch, Mendiguren, & Stisen, 2018;Demirel, Koch, & Stisen 2018;Demirel, Mai, et al., 2018) to a regional scale study.
In this study, we elaborated on the value of multi-basin, multi-objective model calibration for distributed hydrologic modelers incorporating readily available global remote sensing data in flexible open-source models with cutting-edge parameter regionalization schemes like the multi-parameter regionalization in mHM.Through this approach our single versus multi objective calibration schemes represented purely temporal evaluations (KGE) versus temporal and spatial pattern evaluations (KGE and SPAEF).
We first selected the most relevant parameters for spatial calibration using a sensitivity analysis.Then remotely sensed AET based on the two-source energy budget approach is used together with outlet discharge time series to constrain mHM simulations.Through a series of calibration and cross-validation experiments we identify tradeoffs between objective functions representing temporal and spatial model performance and examine the robustness of parameter transferability to ungauged basins.
We can draw the following conclusions from our results: • Multi-objective calibrations including both temporal and spatial evaluation metrics for both individual and multiple basins resulted in balanced solutions leading to better spatio-temporal performances compared to single-objective calibrations focusing on the temporal performance alone.Adding new constraints on spatial patterns only lead to a very limited deterioration in discharge performance while they improve the model predictions for actual evapotranspiration, illustrating a small tradeoff between temporal and spatial model performance.• Combining multi-basin and multi-objective calibration has positive impacts on the simulated fluxes and improves the spatial consistency of parameter fields and their transferability to ungauged basins.Multi-basin calibration is found to be the most crucial element of robust parametrizations if only focusing on discharge.However, adding spatial pattern objectives further ensures spatial consistency, performance, and transferability.
Improved model parametrizations in distributed hydrologic models via different transfer functions in combination with appropriate spatial calibration frameworks could facilitate the applications of global hyper-resolution models for "everywhere" (Bierkens et al., 2015) and "without an illogical (unseamless) patchwork of states and fluxes" (Mizukami et al., 2017) in the future.Future work should incorporate more than six basins and spatial patterns of other variables readily available from reliable satellite products.

Figure 1 .
Figure1.Calibration framework.Six calibration experiments are performed: three using only streamflow observations (left column of panels) using KGE as performance metric and three experiments using streamflow and AET patterns (right column of panels) with two objective functions (KGE and SPAEF) in a KGE-SPAEF-Pareto (KSP) experiment.The six basins are either calibrated independently (first row), or collectively (second row), or in a one-basin-out at a time, that is, Jack-knife robustness test approach (third row).

Figure 2 .
Figure 2. Calibration results for six basins.Note that red symbols in the gray zone (lower left) are the exact OF values (sum of squared errors) used for calibration, whereas gray symbols in the white zone (upper right) are the corresponding metric values.In this way all calibration results are plotted twice (as numerical metric values and as SSE).We plot both as it is easier to relate and compare actual metric values.Mean SPAEF values refer to the mean of SPAEF calculated for the three seasons.

Figure 3 .
Figure 3. Multi-basin calibration results across the six basins for KGE only (triangles) and for KGE and SPAEF (circles) as objective functions.Note that the gray zone (lower left panel) is the exact OF values used for calibration whereas the white zone (upper right panel) is the corresponding metric value.

Figure 4 .
Figure 4. Spatial patterns of normalized AET (average of March-November) from TSEB model (first row), KGE-only calibration cases (second row), multi-objective calibration cases using KGE and SPAEF (third row).The left column shows calibration results when the six basins are calibrated independently while the right column shows the results when the six basins are calibrated collectively.

Figure 5 .
Figure 5. Spatial patterns of estimated normalized field capacity (top four panels) and crop coefficient (bottom four panels) when calibrating only streamflow for each basin independently (KGE1) or collectively (KGE6) or using streamflow and AET to constraint each basin independently (KSP1) or collectively (KSP6).Note that field capacity and crop coefficient are not direct parameters of mHM but are calculated from several parameters (Kc* and PTF*) in Table2.

Figure 6 .
Figure 6.Difference in absolute discharge bias between KGE5 holdout and KSP5 holdout for 46 internal discharge gauges in the six basins for the full year (top) and summer period (May-September) (bottom).Green colors indicate that constraining a model with streamflow and AET leads to better streamflow predictions in ungauged basins than constraining the model with streamflow only.Red colors indicate the opposite.

Figure B1 .
Figure B1.Ranked scores for 46 internal validation stations for KGE5 and KSP5 spatial holdouts.Values are the KGE and its three components r, α, and β.Data for β correspond to results mapped in Figure 6 (top).

Table 1 Main
Characteristics of the Six Catchments, That Is, Drainage Area (km 2 ), Annual Precipitation (P in mm), Annual Reference Evapotranspiration (ET ref in mm), and Annual Discharge (Q in mm) Calculated Based on the Common Period of 1980-2016

Table 3
Model Performances on KGE and SPAEF (Across All Basins) for Different Calibrations Experimentsthe spatial pattern objective function.Table3highlights the limited tradeoff for KGE both for individual stations and averages.

Table 2 .
Note.Values for the KSP-calibrations represent the best-balanced solutions from the Pareto fronts.For KGE1 and KPS1 KGE values are averages across five holdout experiments.Values in parentheses are STD across holdout experiments for single stations and across stations and holdout solutions for average KGE and for SPAEF STD is calculated across seasons and holdout solutions.

Table 4
Model Performances on KGE and SPAEF (Across All Basins) for Different Cross-Validation Experiments