The Abuse of Popular Performance Metrics in Hydrologic Modeling

The goal of this commentary is to critically evaluate the use of popular performance metrics in hydrologic modeling. We focus on the Nash‐Sutcliffe Efficiency (NSE) and the Kling‐Gupta Efficiency (KGE) metrics, which are both widely used in hydrologic research and practice around the world. Our specific objectives are: (a) to provide tools that quantify the sampling uncertainty in popular performance metrics; (b) to quantify sampling uncertainty in popular performance metrics across a large sample of catchments; and (c) to prescribe the further research that is, needed to improve the estimation, interpretation, and use of popular performance metrics in hydrologic modeling. Our large‐sample analysis demonstrates that there is substantial sampling uncertainty in the NSE and KGE estimators. This occurs because the probability distribution of squared errors between model simulations and observations has heavy tails, meaning that performance metrics can be heavily influenced by just a few data points. Our results highlight obvious (yet ignored) abuses of performance metrics that contaminate the conclusions of many hydrologic modeling studies: It is essential to quantify the sampling uncertainty in performance metrics when justifying the use of a model for a specific purpose and when comparing the performance of competing models.

Equation 5 provides an algebraic decomposition the MSE that includes the bias in the mean (the first term), the standard deviation (the second term) and the covariance (the third term). Note from Equation 5 that the algebraic decomposition of the MSE is not particularly effective because the second and third terms are not independent of one another (see also Gupta et al., 2009;Mizukami et al., 2019).

The Nash-Sutcliffe Efficiency (NSE)
The  NSE E is an estimator of a standardized skill score that measures the fractional improvement over a benchmark. The theoretical version of NSE is The algebraic decomposition of the NSE can be derived by making use of the decomposition in Equation 3. Substituting Equation 3 into 6 provides a decomposition of the NSE Equation 7 is the estimator version in Murphy (1988), his Equation 11, which is identical to the "new" decomposition of NSE presented by Gupta et al. (2009)  is limited because the variance and correlation terms cannot be separated cleanly.

The Kling-Gupta Efficiency (KGE)
The KGE metric differs from the NSE metric in that it is not derived from the MSE; KGE is simply the Euclidean distance computed using the coordinates of bias, standard deviation, and correlation (Gupta et al., 2009

Large-Sample Model Simulations for the CAMELS Catchments
In this study we analyze hydrologic model simulations from a large sample of catchments across the contiguous USA (Figure 1). Our analysis uses existing hydrologic model simulations from the Variable Infiltration Capacity model (VIC version 4.1.2h) applied to the 671 catchments in the CAMELS data set (Catchment Attributes and MEteorology for Large-sample Studies). Mizukami et al. (2019) provide details on the large-sample VIC configuration; Newman, Clark, Sampson, et al. (2015) and Addor et al. (2017) provide details on the hydrometeorological and physiographical characteristics of the CAMELS catchments. The CAMELS catchments are those with minimal human disturbance (i.e., minimal land use changes or disturbances, minimal water withdrawals), and are hence almost exclusively smaller, headwater-type catchments (median basin size of 336 km 2 ).
The calibration and evaluation procedure used by Mizukami et al. (2019) is as follows. The VIC model is forced using the daily basin-average meteorological data described by Maurer et al. (2002) and calibrated and evaluated using streamflow data obtained from the USGS National Water Information System server (http://waterdata.usgs.gov/usa/nwis/sw). The VIC model is calibrated using the dynamically dimensioned search (DDS, Tolson & Shoemaker, 2007) algorithm. In each of the 671 CAMELS catchments, the VIC model is calibrated separately for  NSE E and  KGE E (Mizukami et al., 2019). The hydrometeorological data are split into a calibration period (October 1, 1999-September 30, 2008 and an evaluation period (October 1, 1989-September 30, 1999, with a prior 10-years warm-up period. To maximize the sample size in our analysis, we analyze  NSE E and  KGE E computed over the combined 19-years calibration and evaluation period (October 1, 1989-September 30, 2008.

Analysis of the Influence of Individual Data Points
The uncertainties in system-scale performance metrics may be large because the estimates are shaped by a small fraction of the simulation-observation pairs (Clark et al., 2008;Fowler et al., 2018;Lamontagne et al., 2020;McCuen et al., 2006;Newman, Clark, Sampson, et al., 2015;Wright et al., 2019); that is, a small number of simulation-observation pairs have a disproportionate influence on performance metrics. In particular, there is enormous sampling variability associated with streamflow statistics in arid regions (see also Ye et al., 2021). The influence of individual data points can be quantified by successively deleting observations and evaluating their impact on a statistic of interest (e.g., see Efron, 1992;Hampel et al., 1986)-such methods are commonly used in applications of the Jackknife method.
It is straightforward and intuitive to calculate the influence of individual data points on the  MSE E estimates.

Quantifying Uncertainties in the NSE  and KGE  Estimates
It is particularly important to quantify the sampling uncertainty in model performance metrics when the error distributions exhibit heavy tails, as is the case with the errors obtained from daily streamflow simulations. Parallels to this problem are in the meteorological community, where it is common to quantify the uncertainty in the performance or skill metrics used to describe probabilistic forecasts of rare events (e.g., Bradley et al., 2008;Jolliffe, 2007).
Some attractive approaches to quantify sampling uncertainty are based on the bootstrap (e.g., Vogel & Shallcross, 1996), because they are relatively easy to implement and understand, and because they replace complex theoretical statistical methods with simple brute-force computations (see the Appendix A). Clark and Slater (2006) used bootstrap methods to quantify uncertainties in the performance metrics that they used to evaluate probabilistic estimates of precipitation extremes. Bootstrap methods have also been used to quantify the uncertainty in  NSE E estimates (Ritter & Muñoz-Carpena, 2013). Bootstrap methods are likely to find increasing use in hydrology due to the ease with which they can be applied compared to more complex methods. Given their simplicity it is indeed surprising how few examples of the bootstrap there have been in hydrology.
The sampling uncertainty in the  NSE E and  KGE E estimates is quantified using a mixture of Jackknife and Bootstrap methods. First, we use the Jackknife and Bootstrap methods to compute the standard error in the  NSE E and  KGE E estimates. These methods resample from the original data sample using the Non-overlapping Block Bootstrap (NBB) strategy of Carlstein (1986), using data blocks of length one year. The use of data blocks of length one year reduces the issues with substantial seasonal non-stationarity in shorter data blocks, while preserving the within-year autocorrelation and seasonal periodicity of streamflow series. Bootstrapping methods are only effective if the blocks used are approximately independent. Second, we use the Bootstrap methods to compute tolerance intervals for the  NSE E and  KGE E estimates, where the 90% tolerance intervals are defined as the difference between the 95th and 5th percentile of the empirical probability distribution of the  NSE E and  KGE E estimates. Tolerance intervals differ from confidence intervals, because tolerance intervals are intervals corresponding to a random variable, rather than random confidence intervals around some true value. These bootstrap tolerance intervals are computed using 1,000 bootstrap samples. Finally, we use the Jackknife-After-Bootstrap method (Efron, 1992) to estimate the standard error in the Bootstrap tolerance intervals, which enables us to evaluate how sensitive the resulting uncertainty intervals are to individual years (blocks). The implementation details of the uncertainty quantification methods discussed above are summarized in the Appendix A; the open-source "gumboot" package has been developed to quantify the sampling uncertainty in performance metrics (https://github.com/CH-Earth/ gumboot; https://cran.r-project.org/package=gumboot).
It is important to note that the methods implemented here quantify the sampling uncertainty in the  NSE E and  KGE E estimates for a given hydrologic model and a given sample of streamflow observations. The model itself will contain uncertainty (e.g., uncertainty in the meteorological inputs; uncertainty in the model parameters and model structure). The observations used to compute the  NSE E and  KGE E estimates also contain uncertainty, especially for the high flow extremes that can have a large influence on the  NSE E and  KGE E estimates. The model and data uncertainty are not explicitly included in the estimates of sampling uncertainty (we will return to this point in Section 5.3).

Results
The probability distribution of squared errors between model simulations and observations have heavy tails, meaning that the estimates of sum-of-squared error statistics can be heavily influenced by a small fraction of the simulation-observation pairs (Clark et al., 2008;Fowler et al., 2018;Lamontagne et al., 2020;Newman, Clark, Sampson, et al., 2015). To document this issue, Figure 2 uses Equation 10 to quantify the influence of the E k largest errors on the  MSE E estimates, repeating the analysis of Newman, Clark, Sampson, et al. (2015) with the VIC model. Figure 2a quantifies the influence of the 10 individual days with the largest errors on the  MSE E estimates- Figure 2a demonstrates that, in many catchments, 10 days in the 19-year period contribute to over 50% of the sum-of-squared errors between simulated and observed streamflow. Figure 2b identifies the E k largest observations that jointly contribute 50% of the  MSE E estimate, expressed as a percentage of the total sample length E n. Figure 2b demonstrates that, in many catchments, 50% of the sum-of-squared errors is caused by less than 0.5% of the simulation-observation pairs. These results suggest that there will be large uncertainty in the  NSE E and  KGE E metrics. values. Figure 3 illustrates that the 90% tolerance intervals for estimate. The upper plot shows the fraction of the  MSE E estimate contributed by the 10 days with the highest error. The lower plot shows the percentage of days that contribute to 50% of the  MSE E estimate.

of 16
both NSE and KGE (as obtained by the bootstrap methods described in the Appendix A) are greater than 0.1 for more than half of the CAMELS catchments. The results in Figure 3 illustrate that both the bootstrap and jackknife methods yield consistent standard error estimates. The large uncertainty in are used as a calibration target.
The jackknife-after-bootstrap methods enable an evaluation of the degree of precision and accuracy associated with the bootstrap tolerance intervals. While there is considerable sampling uncertainty in the tolerance intervals (estimated using the jackknife-after-bootstrap methods; Figure 4), that uncertainty is considerably smaller than the uncertainty associated with  NSE E and  KGE E as is shown in Figure 3. As we dis- and  KGE E estimates across the CAMELS catchments. The uncertainty is quantified using standard error estimates (×2) obtained using Jackknife and Bootstrap estimates (see the Appendix A for implementation details), along with tolerance intervals computed as the difference between the 95th and 5th percentiles of the Bootstrap samples. Results are shown for calibrations obtained by maximizing the  NSE E metric (upper plots) and by maximizing the  KGE E metric (lower plots).
cuss in the next section, the sampling uncertainty depicted in Figure 3 may be under-estimated in situations where there is extremely high skewness in daily streamflows.

It Is Necessary to Quantify the Uncertainty in Performance Metrics
The high uncertainty associated with the estimators  NSE E and  KGE E underscores the need to quantify the uncertainty in the performance metric estimators used in hydrologic modeling applications. Quantifying the sampling uncertainty in model evaluation statistics is easily accomplished using appropriate bootstrap methods. Moreover, bootstrap methods can be applied to any performance metric estimator. Quantifying  Figure 3. The standard error in the Bootstrap tolerance intervals is estimated using the jackknife-after-bootstrap method of Efron (1992) the uncertainty in the performance metric estimators should arguably become a routine part of the hydrologic modeling enterprise. As our results show, the width of the 90% tolerance intervals associated with the estimators  NSE E and  KGE E are greater than 0.1 in at least half of analyzed catchments. Such wide 90% tolerance intervals indicate considerable uncertainty associated with each of these metrics. These results imply that the conclusions from many hydrologic modeling studies may not be justified in light of the high sampling uncertainty in system-scale performance metric estimators.
In spite of the ease with which the bootstrap may be applied as a post-processing approach to developing uncertainty intervals, there is a need for additional research on methods to quantify the sampling uncertainty. Our experiments (not shown) demonstrate that traditional bootstrap methods may severely under-estimate the sampling uncertainty in the estimators  NSE E and  KGE E in situations where there is extremely high skewness (see also Chernick & LaBudde, 2011). These under-estimates in uncertainty occur because bootstrap methods "recycle" the observations, and the bootstrap samples do not adequately encapsulate the uncertainty associated with the few extraordinary errors in the thick upper tail of the error distribution. Indeed, our Jackknife-after-Bootstrap analyses demonstrate that there are large standard errors in our bootstrap estimates of uncertainty in  NSE E and  KGE E . Thus, given the extremely high skewness of daily streamflow observations in some watersheds, we recommend future research which compares the uncertainty intervals derived from various bootstrap methods against the uncertainty intervals derived from more advanced stochastic methods (e.g., Papalexiou, 2018). based on a bivariate lognormal monthly mixture model. Variance reduction methods introduced to the field of machine learning and statistics (e.g., Nelson & Schmeiser, 1986) can be used to improve estimates of the theoretical NSE and KGE performance metrics. More generally, the approaches of bagging and bragging could be tested, where the performance metrics are estimated using the median or the mean of multiple bootstrap samples (Berrendero, 2007). Further work is needed to better understand the characteristics of data points that have high leverage in order to devise methods that improve estimates of the theoretical NSE and KGE statistics.

It Is Necessary to Put Performance Metrics in Context
The growing field of model benchmarking seeks to put performance metrics into context, for example, by asking the question if models meet our a-priori expectations, or if models adequately use the information that is available to them. The recent efforts in model benchmarking have focused on defining lower and upper benchmarks to provide context for model performance (Nearing et al., 2018;Newman et al., 2017;Seibert et al., 2018). Lower benchmarks evaluate the extent to which models surpass expectations (Seibert, 2001), for example, the extent to which model simulations perform better than a benchmark such as climatology, persistence, simulations from another model (Wilks, 2011), or departures from the seasonal cycle (Knoben et al., 2020;Schaefli & Gupta, 2007). A key component of defining the lower benchmark is defining our a-priori expectations of model capabilities. We define the upper benchmark to quantify the predictability of the system, that is, the maximum information content in the forcing-response data (Nearing et al., 2018;Newman et al., 2017). For example, Best et al. (2015) recently demonstrated that many mechanistic land models were out-performed by simple statistical models, implying that modern land models were not adequately using the information that is available to them. Much work still needs to be done to quantify our expectations for model performance (the lower benchmark) as well as to quantify system-scale predictability (the upper benchmark).
Benchmarking is important in the context of performance metrics because the NSE and KGE have rather weak a-priori expectations of model performance. ). The observed mean is often used as a benchmark with the KGE metric as well, imposing old expectations on a new metric. Using stricter, purpose-specific benchmarks can give a clearer idea of model strengths and weaknesses.
It is also necessary to evaluate the system-scale performance metrics in the context of the uncertainties in the model inputs (e.g., spatial meteorological forcing data), the uncertainties in the hydrologic model (e.g., uncertainties in model parameters and model structure), and the uncertainties in the system-scale response (e.g., streamflow observations). Many groups are now developing ensemble spatial meteorological forcing fields in order to understand how uncertainties in the model forcing data affect uncertainties in the hydrologic model simulations (e.g., Clark & Slater, 2006;Cornes et al., 2018;Frei & Isotta, 2019;Newman, Clark, Craig, et al., 2015;. There are also a wealth of approaches to quantify hydrologic model uncertainty. Vogel (2017) introduced the concept of stochastic watershed models (SWM), which involve methods for generating likely stochastic traces of daily streamflow from deterministic watershed models. All such methods of developing SWMs reviewed by Vogel (2017), including the very generalized blueprint introduced by Montanari and Koutsoyiannis (2012), may be employed to develop uncertainty intervals associated with either streamflow predictions or other water resource system variables. There is now also substantial effort dedicated to quantifying uncertainty in streamflow observations (e.g., see the comparison of uncertainty techniques by Kiang et al., 2018and also Coxon et al., 2015and Mansanarez et al., 2019. The key issue is that the most uncertain observations of streamflow are in the upper tail; these observations also have the most influence on the KGE and NSE metrics. Further research is needed to understand how these sources of uncertainty are manifest in system-scale performance metrics.

It Is Necessary to Understand the Limitations of System-Scale Performance Metrics
It is well known that minimizing the sum-of-squared errors in calibration results in simulated streamflows with smaller variance than the observations (e.g., Gupta et al., 2009). This occurs because of the interplay between estimates of the variance of the flows and correlation in NSE described in Section 2. as an objective function in large-sample hydrologic model calibration study. They showed that the calibrated simulations had substantial under-estimates of high flow events, such as the annual peak flows that are used for flood frequency estimation. Underestimation of variance, as well as all other upper moments, is a general problem associated with simulation models and is not limited to use of a particular objective function (see Farmer & Vogel, 2016).
There are also problems with the KGE metric. As discussed by Santos et al. (2018) . We argue that this did not solve but only changed the problems related to system-scale performance metrics. It is important to be aware of the theoretical behavior of system-scale performance metrics, along with their limits of applicability, and use additional metrics that are tailored to suit specific applications.

It Is Necessary to Use Additional Performance Metrics
A key problem with system-scale performance metrics is that they do not make adequate use of the full information content in the data. Gupta et al. (2008) point out that global calibration of hydrologic models (e.g., using  NSE E or  KGE E as the objective function) entails compressing the information in the model output and observations into a single performance metric, and then using that single metric to infer values of multiple model parameters and all aspects of hydrological processes. Such global calibration methods can lead to problems of compensatory parameters, providing the "right" results for the wrong reasons (Kirchner, 2006). Specifically, parameters in one part of the model may be assigned unrealistic values that compensate for unrealistic parameter values in another part of the model, or that compensate for errors in the model forcing data and weaknesses in model structure (Clark & Vrugt, 2006). Addressing this problem requires asking a different question: Instead of asking "how good is my model?", it may be more appropriate to ask "What is my model good for?" This second question is more relevant when designing a modeling experiment for a specific application.
In this context, it is worth pointing out that it is straightforward to redefine the KGE metric to address the problems with the amplified   Another approach is to use additional non-global metrics (e.g., multiple diagnostic signatures of hydrologic behavior). For example, much of the research on model calibration and evaluation now focuses on multi-criteria methods, including analysis of trade-offs among multiple objective functions (e.g., Fenicia et al., 2007;Yapo et al., 1998), analysis of the temporal variability of model errors (Coxon et al., 2014;Reusser et al., 2009), and scrutinizing diagnostic signatures of hydrologic behavior in order to identify model weaknesses Rakovec et al., 2016). A key part of this analysis is to understand the sensitivity of different non-global metrics to individual parts of a model (e.g., Markstrom et al., 2016;Van Werkhoven et al., 2009). As such, these alternative metrics can focus attention on aspects of the model that may be more relevant for specific modeling applications.

Conclusions
The goal of this commentary is to critically evaluate the performance metrics that are habitually used in hydrologic modeling. Our focus is on the Nash-Sutcliffe Efficiency (NSE) and the Kling-Gupta Efficiency (KGE) metrics, which are both widely used in science and applications communities around the world. Our contributions in this paper are three-fold: 1. We provide tools to enable hydrologic modelers to quantify the sampling uncertainty in system-scale performance metrics. We use the non-overlapping block bootstrap method to obtain probability distributions and associated tolerance intervals of estimates of NSE and KGE, and we use the jackknife-after-bootstrap method to obtain estimates of standard error of those bootstrap tolerance intervals. These comparisons enable us to ensure that even though the tolerance intervals display sampling variability, that variability is always considerably smaller than the tolerance intervals themselves, thus providing a nice validation of the precision of the tolerance intervals. 2. We quantify the sampling uncertainty in system-scale performance metrics across a large sample of catchments. Our results show that the probability distribution of squared errors between model simulations and observations have heavy tails, meaning that the estimates of sum-of-squared error statistics can be shaped by just a few simulation-observation pairs (Figure 2). This leads to substantial uncertainty in the estimators  NSE E and  KGE E (Figures 3 and 4). The implication of these results is that the conclusions from many hydrologic modeling studies are based on values for these metrics that fall well within the metrics' uncertainty bounds. Such conclusions may thus not be justified. 3. We define further research that is, needed to improve the estimation, interpretation, and use of system-scale performance metrics in hydrological modeling More generally, our commentary highlights the obvious (yet ignored) abuses of performance metrics that contaminate the conclusions of many hydrologic modeling studies. We look forward to additional studies that improve the scientific basis of model evaluation.

Appendix A: The Jackknife and Bootstrap Methods
In this study, we use two resampling methods, the Jackknife and the Bootstrap, to estimate the empirical probability distribution of the  NSE E and  KGE E estimators for each of the 671 CAMELS catchments. These methods estimate the empirical probability distribution of a given statistic by drawing or resampling a number of independent samples from the original sample of data.
The following sub-sections describe the implementation of the Jackknife and Bootstrap methods, including the resampling strategies, the Jackknife and Bootstrap estimates of standard error, and the Jackknife estimates of the standard error in the bootstrap-derived empirical probability distributions of  NSE E and  KGE E .

A1. The Jackknife and Bootstrap Resampling Strategies
The Jackknife method is a structured approach of resampling without replacement where observations are successively deleted from the original sample of data. A Jackknife sample is the data set that remains after deleting the ith observation, or deleting the ith block of observations, that is, The value of the ith Jackknife replicate is the value of the estimator   ( ) The Jackknife method is useful in cases where it is desirable to conduct structured analysis of the deleted point statistics.
The Bootstrap method is much more flexible than the Jackknife method. The Bootstrap method uses the approach of resampling with replacement. A Bootstrap sample is obtained by using a random number generator to make E n independent draws from the original sample of data (Efron & Tibshirani, 1986), that is, and the process is repeated to generate E B samples, that is, y y y B When implementing these resampling methods, it is necessary to ensure independence between each draw from the original sample of data (Carlstein, 1986;Künsch, 1989;Vogel & Shallcross, 1996). Specifically, the errors in daily streamflow simulations are characterized by substantial periodicity and persistence -this creates complex temporal dependence structures on time scales from days (e.g., errors in the simulations of recessions after a storm event) to seasons (e.g., errors in the simulations of seasonal snow accumulation and melt, or errors in the seasonal cycle of transpiration). To address these issues, we implement a non-overlapping block resampling strategy that was developed for the Bootstrap method, the Non-overlapping Block Bootstrap (NBB) of Carlstein (1986). This approach identifies E k subseries of data of length  E , where each sub-series of data is statistically independent. In our implementation, the E k subseries are each of the 19 water years    1990,1991, ,2008 E , where the water years span the period Oct 1st-Sep 30th (e.g., water year 1990 is the period Oct 1st1989-Sep 30th 1990).
The non-overlapping block resampling strategy is used for both the Jackknife and Bootstrap methods. The Jackknife sample for a given water year is the data set that remains after deleting the ith water year. For ex- estimates using all daily data except in 2002. The Bootstrap method samples water years with replacement: A given bootstrap sample may include a given water year more than once, or a given sample may not include a given water year at all. The Bootstrap samples that do not have a given water years (e.g., all Bootstrap samples without water year 2002) open up opportunities to quantify the standard errors in the Bootstrap estimates of the empirical probability distributions (using the Jackknife-After-Bootstrap method introduced by Efron, 1992; we will discuss this implementation in Section A3).

A2. Jackknife and Bootstrap Estimates of Standard Errors in the NSE  and KGE  Estimates
The Jackknife estimates of standard error can be obtained by first considering the case where the standard error estimates are not needed (Efron & Gong, 1983). The average of the jackknife sample, with the ith observation given as . The standard error of E x is then (Efron & Gong, 1983) sejack  ( )  (Efron, 1992), then the jackknife estimate of the statistic of interest,  jack E , can be defined as where Ê is the estimate of the statistic using all observations and     The Bootstrap estimate of standard error is more straightforward: It is simply the standard deviation of the Bootstrap samples, that is,

A3. Jackknife Estimates of Standard Error in the Bootstrap-Derived Probability Distributions
The Bootstrap estimates of the empirical probability distributions create a conundrum: whilst outliers can cause large uncertainty in the  NSE E or  KGE E estimates, the outliers can also create large uncertainty in the Bootstrap estimates of the empirical probability distributions. It is hence necessary to estimate the standard error in the Bootstrap methods.
Estimates of the standard error in the Bootstrap methods can be computed easily using the Jackknife-After-Bootstrap method of Efron (1992). In the previous discussion we noted that the non-overlapping block resampling strategy opens up opportunities to quantify the standard errors in the Bootstrap estimates of the empirical probability distributions. Specifically, for a given water year we can compute a Jackknife sample using all of the Bootstrap samples that do not include that water year. When such Jackknife samples are constructed for all water years, the Jackknife method can be used to estimate standard error in the Bootstrap estimates (the Jackknife-After-Bootstrap method).
The Jackknife-After-Bootstrap method is implemented as follows (Efron, 1992 , , ,  . Given this information, we can calculate the proportion of each Bootstrap sample that equals a given observation i E x, that is, (Efron, 1992), B is the statistic of interest for all bootstrap samples. It is then possible to compute statistics from the subset of Bootstrap samples, that is, where g( . ) may be a statistic such as the fifth or 95th percentile.
The Jackknife estimate of standard error uses Equation A6 with  

Data Availability Statement
The data for the large-domain model simulations are publicly available at the National Center for Atmospheric Research at https://ral.ucar.edu/solutions/products/camels. The source code to quantify the sampling uncertainty in performance metrics (the "gumboot" package) is available at https://github.com/ CH-Earth/gumboot.