Ensemble Skill Gains Obtained From the Multi‐Physics Versus Multi‐Model Approaches for Continental‐Scale Hydrological Simulations

Multi‐physics ensembles have emerged as a promising approach to hydrological simulations. As multi‐physics ensembles are constructed by perturbing the model physics, the ensemble members share a substantial portion of the same physics and hence are not independent of each other. It is unknown whether and to what extent this nonindependence affects the skill gain of the ensemble method, especially compared with the multi‐model ensemble approach. This study compares a multi‐physics ensemble configured from the Noah land surface model with multi‐parameterization options (Noah‐MP) with the North American Land Data Assimilation System (NLDAS) multi‐model ensemble. The two ensembles are evaluated in terms of the annual cycle and interannual anomaly at 12 River Forecast Centers over the conterminous United States. The ensemble skill gain is measured by the difference between the performance of the ensemble mean and the average of the ensemble members' performance, and the inter‐member independence is measured by error correlations. Results show that, due to the improved model physics, the Noah‐MP configurations outperform, on average, the NLDAS models, especially in the snow‐dominated areas. The Noah‐MP ensemble almost always obtains an outstanding member that performs the best among the two ensembles, reflecting its dense sampling of the feasible model physics space. However, these two performance superiorities do not lead to a superiority of the ensemble mean. The Noah‐MP ensemble has a lower ensemble skill gain, which corresponds to the lower inter‐member independence. These results highlight the importance of inter‐member independence, particularly when most hydrological ensemble methods have overlooked it.

. This asymmetry indicates that the performance of the ensemble mean (PEM) differs from the arithmetic average of the ensemble members' performance (AEP).
"The key to the success of the multi-model concept lies in combining independent and skillful models, each with its own strengths and weaknesses" (Hagedorn et al., 2005). The importance of inter-model independence has been clearly shown in various analyses from different perspectives. Yoo and Kang (2005) measured the performance by the error correlation coefficient. Using mathematical decomposition of the correlation coefficient, they showed that the difference between the PEM and the AEP increases if the errors of the ensemble members are less correlated with each other. Winter and Nychka (2009) represented model errors as vectors in T dimensions (where T is the number of time steps), the length of which is the root-mean-square error. Their geometric analyses clearly showed that the PEM reaches a maximum (i.e., the mean square error reaches a minimum) if the set of member models can sample the error space evenly. From the perspective of information theory (Goodwell et al., 2020;Goodwell & Kumar, 2017a, 2017b, nonindependent models contain redundant information. On the basis of a measure of mutual information content, Sharma et al. (2019) demonstrated that the skill gains obtained by multi-model ensembles are dominated by model independence.
The North American Land Data Assimilation System (NLDAS) (Mitchell et al., 2004) Phase 2 adopts the concept of the multi-model ensemble and runs four distinct models over the conterminous United States (CONUS). The four models in NLDAS Phase 2 are the Noah land surface model (Noah, version 2.8), Mosaic, the Variable Infiltration Capacity (VIC, version 4.0.3) model, and the Sacramento Soil Moisture Accounting Model. These NLDAS models differ remarkably in model structure and process parameterization (Kumar et al., 2017;. Using confirmatory factor analysis, Kumar et al. (2017) confirmed that these structural differences result in dissimilar model behaviors, especially in the simulations of runoff. However, with a long history of extensive evaluations and improvements (Xia et al., 2019;, the performance of the NLDAS multi-model ensemble has been shown to be high. The ensemble can reproduce the spatiotemporal variations of the observed evapotranspiration (ET) (Long et al., 2014;Zhang et al., 2020), streamflow , soil moisture (Xia et al., 2015), and groundwater  over CONUS. The sound inter-model independence and satisfactory model performance make the NLDAS multi-model ensemble a useful benchmark for new ensemble techniques at a continental scale.
Recently, multi-physics models have emerged as a new tool for performing ensemble hydrological simulations. Such models host different parameterization schemes for several key processes (M. P. Clark, Kavetski, & Fenicia, 2011). An ensemble of model configurations can be generated by selecting different combinations of parameterization schemes for particular land surface processes Yang et al., 2011;Zhang et al., 2016;Zheng & Yang, 2016). Numerous studies using this kind of multi-physics ensemble simulation have already been conducted (M. P. Clark et al., 2010;Coxon et al., 2014;Krueger et al., 2010;McMillan et al., 2010;Oudin et al., 2006), addressing various research questions. By discriminating between competing model parameterizations, the linkage between model parameterizations and catchment type and hydrological signatures can be established (M. P. Clark et al., 2010;M. P. Clark et al., 2016;McMillan et al., 2010). Such analyses improve our understanding of the dominant hydrological processes in various catchments and hydrological events (Coxon et al., 2014;Douinot et al., 2018). The multi-physics ensemble can also benefit uncertainty attribution and model development. The impacts of a specified process on the overall model behavior can be pinpointed, and the interplay between different processes (Koster, 2015;Koster & Milly, 1997) can be revealed You et al., 2020;Zhang et al., 2016;Zheng et al., 2019). Recognizing these advantages in mechanistic understanding and uncertainty attribution, several mainstream models have adopted this concept. Available multi-physics models include the Noah land surface model with multi-parameterization options (Noah-MP) , the Community Land Model version 5 (Lawrence et al., 2019), the Structure for Unifying Multiple Modeling Alternatives (M. P. Clark et al., 2015aClark et al., , 2015b, and the Joint UK Land Environment Simulator D. B. Clark, Mercado, et al., 2011).
physical realism improves the model performance in simulating runoff, groundwater, soil moisture, snow, and total water storage at the basin and global scales. Ma et al. (2017) reported that Noah-MP is more skillful than the NLDAS models in simulating runoff. Xia et al. (2016a) and Cai, Yang, Xia, et al. (2014) also showed that Noah-MP outperforms the NLDAS models in capturing the seasonal cycle of snow and snowmelt runoff due to the inclusion of a multilayer snowpack and the improvement of the turbulence parameterization. The credibility obtained from the improved physical realism and model performance promotes the adoption of Noah-MP in large-scale operational systems, including the Weather Research and Forecasting model (Barlage et al., 2015) for weather forecasts and the National Water Model (Maidment, 2017) for hydrological forecasts. The model is also undergoing extensive tests for the next phase of NLDAS Zhang et al., 2020). Any improvement in the simulation performance from Noah-MP would directly benefit these operational systems.
However, it is largely unknown how the emerging multi-physics approach compares with the well-established multi-model approach for ensemble hydrological simulations. It is often hypothesized that the multi-physics approach has an advantage (M. P. Clark, Kavetski, & Fenicia, 2011). Multi-physics ensembles provide a dense sampling of the feasible model physics space. There is likely a multi-physics ensemble member that can closely simulate the observed hydrological behaviors, leading to superior performance. The inclusion of such a superior-performing member should result in a superior PEM, as a consequence of its asymmetric response to the constituent's performance. Furthermore, the best-performing multi-physics member varies according to the basin and climatic conditions of concern . A multi-physics ensemble that includes these members can offer a systematic performance improvement under various conditions. However, these advantages may be offset by a hidden pitfall. The multi-physics ensemble members share a substantial portion of the same parameterization schemes with each other. These physics overlaps reduce the inter-member independence and may lower the ensemble skill gains. This negative effect has been greatly outlooked. A comprehensive comparison of the multi-physics and multi-model ensemble approaches is needed.
This study presents such a comparison between the Noah-MP multi-physics ensemble and the NLDAS multi-model ensemble. The comparison is aimed at testing three hypotheses: (a) the Noah-MP configurations outperform the NLDAS models due to the improved model physics; (b) the Noah-MP ensemble contains an outstanding member as a result of the dense sampling of the feasible model physics space; and (c) the ensemble skill gain obtained from the Noah-MP ensemble is low owing to the low inter-member independence. The two ensembles are compared in terms of the annual cycle and interannual anomaly, the results of which may benefit studies of long-term water resource projections, seasonal hydrological forecasts, and drought monitoring. The comparisons are performed for the 12 River Forecast Centers (RFCs) over CONUS. This area covers a wide variety of climatic regimes, which should ensure robustness in the comparison (Gupta et al., 2014). In the remainder of this paper, Section 2 details the two ensembles and the observations. Section 3 describes the definitions of the ensemble skill gain and skill metrics. Section 4 presents the results. Section 5 provides conclusions and further discussion.

NLDAS Multi-Model Ensemble
We use three NLDAS models: Noah-2.8, VIC-4.0.3, and Mosaic. They are the only three NLDAS models whose outputs are publicly available from the NASA Goddard Earth Sciences Data and Information Services Center (https://disc.gsfc.nasa.gov/datasets?keywords=NLDAS). A brief introduction to these three models is provided as follows: Noah (Ek et al., 2003) was developed as the land component of the weather and climate forecasting models of the National Centers for Environmental Prediction (NCEP) and National Oceanic and Atmospheric Administration (NOAA). Noah has four soil layers and one snow layer. There is a dominant vegetation type for each grid cell. In terms of model physics, Noah considers a comprehensive time-dependent canopy resistance (Chen et al., 1996), seasonal frozen soils, and the snow accumulation/ablation processes (Koren et al., 1999). Noah 2.8 (the version used in NLDAS-2 and this study) contains many improvements over Noah 2.7.1 (the version used in NLDAS-1), including modification of the albedo formulation by combining snow-albedo decay and liquid-water refreeze (Livneh et al., 2010) and the introduction of seasonal factors to the simulation of warm-season processes (Wei et al., 2013).
Mosaic (Koster & Suarez, 1992) is a land surface model developed for use within general circulation models (Ducharne et al., 1999). It includes three soil layers and one snow layer. Mosaic features the "mosaic" strategy for considering surface heterogeneity. There are several vegetation tiles within each model grid cell, where the energy and water balance are calculated separately (Yang et al., 2003). The calculations are based on the Simple Biosphere model (Sellers et al., 1986), which is analogous to the electrical resistance method in calculating the energy and matter transfer in biophysical processes.
VIC (Liang et al., 1994) is a widely used semi-distributed hydrology model with a full Surface Vegetation-Atmosphere Transfer representation. It features a variable infiltration capacity approach for parameterizing runoff generation. VIC includes three soil layers and two snow layers. Like Mosaic, VIC also considers several vegetation types with a tiling method (Cherkauer et al., 2003). Furthermore, VIC utilizes subgrid elevation bands to realistically describe the dependence of temperature, precipitation, and snow on elevation in the snow-dominated regions (Liang et al., 1994). The version used in this study, 4.0.3, is subject to the constraints of both surface water and energy conservation.
All the NLDAS models run over CONUS (25°-53°N, 125°-67°W) at a spatial resolution of 0.125° × 0.125° . The time step of Noah and Mosaic is 15 min, while that of VIC is 60 min. These models are driven by the same set of atmospheric forcings, topography, vegetation, and soil types. The precipitation forcing data combine the gauge-based precipitation from NOAA's Climate Prediction Center (CPC) with the monthly Parameter-elevation Regressions on Independent Slopes Model topographical adjustment, Doppler Stage II radar precipitation data, NOAA's CPC Morphing Technique data, and the NCEP North American Regional Reanalysis (NARR) precipitation. The nonprecipitation forcings are derived from the NARR analysis fields (Cosgrove et al., 2003). The surface downward shortwave radiation data are also bias-corrected by the hourly 1/8th-degree GOES-based surface downward shortwave radiation fields (1996)(1997)(1998)(1999)(2000) (Pinker et al., 2003). The NLDAS topography data are based on the GTOPO30 Global 30 Arc Second (∼1 km) Elevation Data set. The 14-class University of Maryland (UMD) map at 1 km of the UMD's Land Cover Classification is used as the NLDAS vegetation class data set. The monthly leaf area index (LAI) data set is re-gridded from NOAA's National Environmental Satellite, Data, and Information Service 0.144° monthly climatology LAI (Gutman & Ignatov, 1998). The NLDAS soil data with 16 texture types over CONUS is derived from the 1-km Penn State STATSGO data. More details can be found in Table S1. These data sets have been widely evaluated. Although several deficiencies have been found in the clear-sky solar radiation (Slater, 2016) and the precipitation over the high-elevation western areas (Henn, Clark, et al., 2018;, the data have shown an overall sound performance for large-domain hydrological simulations (Cai, Yang, David, et al., 2014;Cai, Yang, Xia, et al., 2014;Xia et al., 2016aXia et al., , 2016bXia et al., 2015).

Noah-MP Multi-Physics Ensemble
Forty-eight configurations of Noah-MP version 3.6 are constructed by perturbing the model physics of runoff generation, stomatal conductance, soil moisture limitation factor to transpiration (β-factor), and surface-layer turbulence. These processes are selected based on their importance suggested in previous studies Zhang et al., 2016;Zheng et al., 2019). There are two distinct stomatal conductance schemes (Ball-Berry, Jarvis), three β-factor schemes (NOAHB, CLM, SSiB), four runoff schemes (SIMGM, SIMTOP, NOAHR, BATS), and two turbulence schemes (M-O, Chen97) (48 = 2 × 3 × 4 × 2). The four runoff schemes fall into two groups. The first group consists of SIMGM and SIMTOP. They are TOPMODEL-based schemes with a groundwater component. The groundwater in SIMGM is dynamic, which interacts with the soil moisture, whereas the groundwater in SIMTOP is determined by the bottom soil moisture based on an equilibrium assumption. The second group-namely NOAHR and BATS-does not have a groundwater component. NOAHR represents the infiltration excess runoff-generation mechanism, whereas BATS represents the idea of fractional saturation area. Details of the parameterization schemes can be found in Table S2. FEI ET AL.

10.1029/2020WR028846
The Noah-MP multi-physics ensemble is driven by the same atmospheric forcings and static inputs as the NLDAS models. The simulation is also performed at the same resolution as the NLDAS models. The simulations span 30 years from January 1982 to December 2011 with a 15-min time step. The initial condition on January 1, 1982 is generated by a 102-year spin-up run, which consists of 100 runs of the year 1979 and then the following 1980 to 1981.

USGS HUC8 Runoff Data
The monthly 8-digit Hydrologic Unit Code (HUC8) runoff data (http://waterwatch.usgs.gov/) (Brakebill et al., 2011) from 1982 to 2011 are used as the observations. The data are derived from the daily streamgauge observations of the USGS. The derivation is based on two assumptions: (a) the runoff is uniform within each HUC8; (b) the river routing can be ignored because the propagation of flow waves at the basin scale is in days (Allen et al., 2018). With these two assumptions, the calculation is as follows: denotes the runoff of the th k HUC8 basin, and i R is the th i stream-gauge observation.

 
HUC8 k R is weighted across the stream-gauge observations whose catchment intersects with the th k HUC8 basin. ,

RFCs Over CONUS and Their Climatology
The above-mentioned simulations and observations are upscaled into 12 RFCs over CONUS-namely Northeast (NE), Mid-Atlantic (MA), Ohio (OH), Lower Mississippi (LM), Southeast (SE), North Central (NC), Northwest (NW), Arkansas (AB), Missouri (MB), West Gulf (WG), California-Nevada (CN), and Colorado (CB). As discussed in  and Gudmundsson, Wagener, et al. (2012), the spatial aggregation of the observations/simulations from smaller basins (HUC8 and NLDAS grid cells) reduces the measurement errors in observations and modeling errors in spatially varying parameters. Furthermore, as the observations are taken in relatively smaller basins (i.e., HUC8) and used for a relatively long timescale (i.e., monthly), river routing is negligible. Figure 1 shows the boundaries and climatology, including multiyear averaged precipitation, potential evapotranspiration, runoff, runoff ratio, and Budyko's aridity index. The potential evapotranspiration is calculated from the NLDAS daily forcings based on the FAO-56 algorithm (Allen et al., 1998) of the Penman-Monteith equation (Monteith, 1965;Penman, 1956). Budyko's aridity index (Budyko, 1974) is defined as the ratio of potential evapotranspiration over precipitation. The RFC-averaged values of the climatology are shown in Table S3. Consistent with Budyko's hypothesis, the runoff ratio generally decreases with increasing aridity. However, three abnormalities exist in NW, CN, and CB. These abnormally high values hint that NLDAS may underestimate the precipitation in high-elevation areas, as also suggested by Henn, Clark, et al. (2018) and .

Annual Cycle and Interannual Anomaly
The annual cycle and interannual anomaly are evaluated separately. The decomposition of the modeled and observed runoff into a multi-year averaged climatology ( clim r ), mean annual cycle ( ancy,m r ), and interannual anomaly ( anom, , y m r ) is carried out as follows. For a modeled or observed runoff , y m r for the mth y m y m m r r r r (4) The mean annual cycle gives insights into the seasonal variation. The interannual anomaly reflects the yearto-year variation, which reflects a model's responses to the monthly perturbations in atmospheric forcings.

Ensemble Skill Gain
we define the ensemble skill gain (G) as the difference between the PEM (   S x ) and the AEP (S x  ): where S denotes a deterministic skill measure, which is detailed in the next section, and i w is the weight for  As broadly reported, the ensemble weights ( i w ) have a significant influence on the PEM (   S x ) (Ajami et al., 2006;Bohn et al., 2010), and thus on the ensemble skill gain (G). There are a number of sophisticated weighting methods that can optimize the PEM Arsenault et al., 2015;Duan et al., 2007;Hsu et al., 2009;Marshall et al., 2006;Oudin et al., 2006;Vrugt & Robinson, 2007;Vrugt et al., 2006). However, most of these methods assume inter-model independence, which is not appropriate for this study. Furthermore, the derivation of an optimal set of weights heavily depends on the objective functions, the referencing model signatures, the variables of interest, the research catchment, the study period, and the reference data sets. These will unnecessarily complicate this study without improving the universality of the conclusions. Accordingly, in this study, we use equal weights (i.e.,  1 / i w N), which are also widely used without utilizing prior information (Gao & Dirmeyer, 2006) and robust under nonstationary conditions (Ajami et al., 2006;Beck et al., 2017;Georgakakos et al., 2004;Guo et al., 2007).

Skill Measures
We quantify the ensemble skill gains based on four deterministic skill measures: the Taylor Skill Score (TSS) (Taylor, 2001), Nash-Sutcliffe Efficiency (NSE), Kling-Gupta Efficiency (KGE) (Gupta et al., 2009), and correlation coefficient (R). For brevity, we show the analyses mainly based on TSS. The analyses based on NSE, KGE, and R show consistent results, which can be found in the Supporting Information (Tables S3-S7). TSS is derived from the Taylor diagram (Taylor, 2001), which is a two-dimensional plot that visually summarizes multiple aspects of the performance of a model simulation ( f ) relative to the observations (o), including the correlation coefficient (R), normalized unbiased root-mean-square error (nuRMSE), and normalized variability ( f  ): where T is the total time step number, f is the temporal mean of the model simulation ( t f ), and TSS measures the distance of a model's performance apart from a reference point (usually, in a "perfect" simulation,  RMSE 0 and  1 R ) on the Taylor diagram. The definition (Taylor, 2001) is as follows: where 0 R is the maximum correlation coefficient attainable. 0 R is assumed to be 1 in this study. TSS ranges from 0 to 1. A higher value indicates a higher consistency between the model prediction and the observations. A probabilistic measure, the Continuous Ranked Probability Score (CRPS) (Hersbach, 2000;Wilks, 1995), is also used to investigate the effects of independence. CRPS is one of the well-established probabilistic forecast skill scores that describe the consistency between the forecast probability and observed state. Let the forecast variable be denoted by f . Then, CRPS is defined as follows: where o is the value that occurred (in this case, this value is set as the observation), P and o P are the cumulative distribution functions (CDFs), and o P is the Heaviside function. In this study, the empirical CDF was selected to estimate the CDF of runoff simulations: where    y denotes the probability density function forecast by an ensemble system.
The CRPS measures the difference between the predicted and observed cumulative distributions. A smaller CRPS indicates higher skill. CRPS is 0 if and only if  o P P. In this study, we calculated the time-averaged CRPS of the annual runoff cycle and the interannual runoff anomaly.

Error Correlation
The independence between each two ensemble members is measured by the error correlation coefficient (ECC) as follows: where error ( t e ) is defined as the difference between the model simulation and the observations; , The error correlation coefficient ranges from −1 to +1. For a two-member ensemble, if the error correlation equals −1, their errors vary in the opposite directions at each time step. The errors of the ensemble mean may have a lower magnitude but are not zero unless the two ensemble members' errors can cancel each other out completely. If two ensemble members have an error correlation of +1, their errors vary in the same direction. As the errors can add up, we do not expect an ensemble skill gain from the average of the two ensemble members. As the ensemble size increases above two, an ensemble of independent members is expected to have an average error correlation coefficient of 0. A lower absolute value of error correlation indicates strong independence and can potentially generate a higher ensemble skill gain.

Results
We individually examine the performance of the ensemble members to show dissimilarity within each ensemble and the performance comparisons between the two ensembles in Section 4.1. In Section 4.2, the two ensembles' members (51 in total: 48 Noah-MP configurations and three NLDAS models) are ranked. The best member and the ensemble mean are identified. Lastly, inter-member independence and its impacts on the PEM are presented and discussed. Figure 2 compares the simulated annual runoff cycle from the NLDAS multi-model ensemble and the Noah-MP multi-physics ensemble. The NLDAS models produce different timing of the runoff peaks. The differences are the most notable in NE, NC, NW, and CB. These RFCs are either in northern CONUS or in mountainous areas, where the spring runoff peak is dominated by snowmelt. Among the NLDAS models, VIC performs the best in capturing the timing of the runoff peak, with a relatively small bias of approximately one month earlier. Such outperformance may be attributable to the detailed consideration of elevation bands (Liang et al., 1994). Noah performs the worst, with a 1-to 2-month lag in NE and NC and a two-month lead in CB. Compared with the NLDAS models, the Noah-MP ensemble members perform better, especially in the snow-dominated RFCs mentioned above (i.e., NE, NC, NW, and CB). This superiority FEI ET AL.  has been previously reported and considered to be due to the multilayer snowpack physics (Cai, Yang, Xia, et al., 2014;Ma et al., 2017;Xia et al., 2016a).

Performance Difference Within and Between the Two Ensembles
Figure 2 also shows the simulated magnitude (or variability) of the annual cycle. The NLDAS models differ from each other remarkably. The Noah model tends to overestimate the variability, whereas the Mosaic model tends to underestimate it. The VIC model lies between these two models and is closest to the observations. Compared with the NLDAS models, the Noah-MP multi-physics ensemble members again show an overall better performance. This outperformance can also be found in other performance criteria, including NSE, KGE, and R (Tables S2-S4). Interestingly, the inter-member difference tends to be related to the climatic aridity, especially for the Noah-MP multi-physics ensemble (also shown in Figures 3 and 4 and Figure S9). In NE and MA, where the climate is humid, the Noah-MP ensemble members perform similarly to each other. The significant inter-member difference appears in AB, MB, WG, CN, and CB ( Figure S9). All these RFCs are spatially adjacent in southwest CONUS, with a semi-arid or arid climate. The remarkable inter-member differences in arid RFCs may be related to the difficulties in simulating terrestrial water storage anomalies (Cai, Yang, Xia, et al., 2014).
The performance of the NLDAS models and the Noah-MP configurations is shown in Figure 3. The NL-DAS models scatter significantly, in which the climatic aridity and snow play different roles. The spread in semi-arid and arid RFCs (e.g., AB, MB, WG, CN) is mainly manifested in the modeled variability (i.e., spread in the radial direction). The spread in snow-dominated RFCs (e.g., NE, NC, NW) is mainly manifested in FEI ET AL.
10.1029/2020WR028846 10 of 24  the correlation coefficient, which corresponds to the above-mentioned bias in the simulated timing of the spring runoff peak. In CB, where the climate is the driest and where snow is important, the spread is the most pronounced, suggesting a double difficulty in this arid and snow-dominated RFC.
The performance spread among the Noah-MP ensemble members is generally comparable to that among the NLDAS models. In humid RFCs (e.g., NE, MA, OH, LM), the Noah-MP ensemble members are similar to each other and comparable to the observations. The spread increases in the arid RFCs, and the spread increase is mainly manifested in the modeled variability (i.e., spread in the radial direction). However, there are two notable differences between the Noah-MP and NLDAS ensembles. First, in snow-dominated RFCs (e.g., NE, NC, NW), the spread in the correlation coefficient among the Noah-MP ensemble members is small. All Noah-MP configurations correlate well with the observations, suggesting superiority of the multilayer snowpack physics. Second, the Noah-MP ensemble members cluster according to the runoff parameterizations. A Sobol' variance-based analysis (Saltelli & Sobol', 1995;Zheng et al., 2019) shows that the runoff parameterization contributes more than 58% (median 78%) of the TSS variation in all RFCs ( Figure S1). Among the four runoff schemes, SIMTOP (r2) tends to have relatively low variability, whereas BATS (r4) tends to obtain a relatively high variability. SIMGM (r1) separates from the others with a high correlation coefficient in NE and CB but a low correlation coefficient in OH, LM, SE, and AB. It is interesting to note that the SIMTOP scheme (r2) with equilibrium groundwater performs closer to the two runoff parameterization schemes without groundwater (i.e., NOAHR (r3) and BATS (r4)). Figure 4 shows the Taylor diagram for the interannual anomaly. Compared with the annual cycle (Figure 3), the performance spread is slightly reduced for the Noah-MP ensemble but significantly reduced for the NLDAS models. The decrease in the spread is mainly attributable to the decreases in the correlation coefficient spread, which is the most obvious in snow-dominated RFCs (e.g., NE, NC, NW, and CB). This smaller FEI ET AL.
10.1029/2020WR028846 11 of 24 Noah VIC Mosaic Noah-MP ave NLDAS ave Obs spread in the correlation coefficient suggests a tighter control of the atmospheric forcing on the interannual anomalies than on the annual cycle. The cluster of the runoff parameterization schemes is clearly apparent. The Sobol' variance-based analysis reveals that more than 52% (median 65%) of the variance in the performance can be attributed to the runoff parameterization in all RFCs ( Figure S2). The spread of the performance is mainly manifested in the correlation coefficient, and SIMGM (r1) is distinguishable from the others. In most RFCs except CN and CB, SIMGM (r1) obtains a distinctly low correlation coefficient. The other three runoff parameterizations of Noah-MP (i.e., SIMTOP (r2), NOAHR (r3), BATS (r4)) and the three NLDAS models obtain a similar correlation coefficient. They mainly differ in the modeled variability. The BATS runoff scheme (r4) tends to have the largest variability, whereas SIMTOP (r2) tends to have the lowest. Figure 5 summarizes the average TSS obtained from the NLDAS and Noah-MP ensembles. Their performance deteriorates from the humid to arid RFCs. On average, Noah-MP outperforms NLDAS for both the annual cycle and interannual anomaly. For the annual cycle, the Noah-MP ensemble outperforms the NL-DAS ensemble in humid RFCs, but the outperformance is marginal in arid RFCs. In snow-dominated RFCs (e.g., NE, NC, and CB), the NLDAS ensemble further deteriorates, and Noah-MP shows a clearer superiority. For the interannual anomaly, the performance of the Noah-MP ensemble is consistently high in both humid and arid RFCs. The NLDAS ensemble is only marginally worse than Noah-MP in humid RFCs, but the performance deteriorates quickly in arid RFCs. Furthermore, for the interannual variability, snow does not have an obvious impact on the performance of the NLDAS models.
FEI ET AL.
10.1029/2020WR028846 12 of 24  Assimilation System multi-model (Noah, VIC, and Mosaic) ensemble members for simulating the annual runoff cycle over the 12 River Forecast Centers (RFCs). The ranking is across the two ensembles and ranges from 1 to 51 (51 = 48 + 3), from the best to the worst. Stars mark the best member among the two ensembles at each RFC. White indicates the performance median (a ranking of 26). Red indicates better than the median and blue indicates worse than the median. The bottom panel shows the ranking of the ensemble mean, which ranges from 0 to 52. A ranking of 0 (52) indicates that it outperforms (underperforms) all the 51 constituent members. The labels are r1 for SIMGM, r2 for SIMTOP, r3 for NOAHR, and r4 for BATS; b1 for NOAHB, b2 for CLM, b3 for SSiB; t1 for M-O, t2 for Chen97; c1 for Ball-Berry, c2 for Jarvis.
In the other schemes, runoff is directly controlled by soil moisture. The improvements of SIMTOP (r2) and BATS (r4) may suggest that soil moisture plays a more important role in mediating the interannual FEI ET AL.
We also identified the best member among all the members of the two ensembles. As shown in Table 1, the performance of the best ensemble member is remarkably high, with a TSS value higher than 0.9 in all RFCs for both the annual cycle and interannual anomaly. The best ensemble member varies significantly among the RFCs, indicating that its outstanding performance is not likely a result of the accurate representation of the universal model physics. The Noah-MP multi-physics ensemble contains the best member in almost all the RFCs. Only in NW for the annual cycle and in OH for the interannual anomaly does the VIC model marginally outperform the best Noah-MP configuration. This finding is consistent with the hypothesis given in the introduction that the multi-physics ensemble can sample the feasible model physics space densely and thus have more of a chance to include an outstanding-performing member that fits different basins.
The Noah-MP ensemble has a better average ranking ( Figure 5) and always contains the best ensemble members (Table 1). Thus, the Noah-MP ensemble mean should clearly outperform the NLDAS ensemble mean in all RFCs. However, the following results contradict this hypothesis. Figure 6 also shows the ranking of the ensemble means for the annual cycle at the bottom. The NLDAS ensemble mean shows comparable or better performance than the Noah-MP ensemble mean in semi-humid to semi-arid RFCs, including OH, LM, SE, NC, NW, AB, and WG. The superiority of the NLDAS ensemble mean is more apparent for the interannual anomaly (Figure 7) than for the annual cycle (Figure 6) Note. The best members of the two ensembles are shown in italics. The labels are r1 for SIMGM, r2 for SIMTOP, r3 for NOAHR, and r4 for BATS; b1 for NOAHB, b2 for CLM, b3 for SSiB; t1 for M-O, t2 for Chen97; c1 for Ball-Berry, c2 for Jarvis.
Abbreviations: NLDAS, North American Land Data Assimilation System; RFC, River Forecast Center; TSS, Taylor Skill Score.

Table 1 The Best Member of the Noah-MP Multi-Physics and the NLDAS Multi-Model Ensembles in Simulating the Annual Cycle and Interannual Anomaly Over Each RFC
mean suggests a higher ensemble skill gain obtained from the NLDAS models, which is confirmed in Table 2. The ensemble skill gain measured in NSE and KGE is also shown in Table S7. As discussed in the introduction, the relatively lower ensemble skill gain obtained from the Noah-MP multi-physics ensemble hints that the multi-physics ensemble members may be too similar to each other. Figure 8 shows the error correlation for the Noah-MP multi-physics and NLDAS multi-model ensembles. In the RFCs other than NW, CN, and CB, the average values of the error correlations among the NLDAS models are 0.20 for the annual cycle and 0.21 for the interannual anomaly. The models perform independently. This assessment based on error correlation is consistent with the assessment of model similarity by Kumar et al. (2017). However, in NW, CN, and CB, the error correlation is high (on average, 0.86 for the annual cycle and 0.64 for the interannual anomaly), which was not reported by Kumar et al. (2017). This may be partly caused by systematic errors in the forcing and evaluation data. These RFCs have a complex terrain. Zheng et al. (2020) showed that the precipitation, evapotranspiration, and runoff data over these RFCs are not physically consistent. Data set intercomparison and assessment based on streamflow observations (Henn, Clark, et al., 2018; suggest the precipitation data over the western areas with complex topography are highly uncertain. The error stemming from the forcings can propagate into all the model outputs, yielding a high error correlation. The mean error correlations of the Noah-MP multi-physics ensemble are 0.69 for the annual cycle and 0.75 for the interannual anomaly. Compared with the NLDAS models, these high values suggest that the Noah-MP ensemble members are not adequately independent. The high error correlation corresponds well to the low ensemble skill gain shown in Table 2 Note. The ensemble mean is ranked within each of the two ensembles. The ranking of the ensemble mean can be zero, which means that the ensemble mean outperforms all the ensemble members. Abbreviations: NLDAS, North American Land Data Assimilation System; PEM, performance of the ensemble mean; RFC, River Forecast Center; TSS, Taylor Skill Score.

Asymmetric Responses of the PEM to the Inclusion of the Most Versus Least Independent Members
The above analyses show that there is a correspondence between the low ensemble skill gain and low independence. In this section, we further confirm the effects of independence by removing the most and least independent members from consideration. The most independent member has the lowest average error correlation with the other members, whereas the least independent member has the highest average value. Figure 9 shows the performance of the Noah-MP multi-physics ensemble mean in the 12 RFCs. The points immediately to the left of the center denote the cases in which the least independent one or five members are removed, whereas the points immediately to the right denote the cases in which the most independent one or five members are removed. Note that the independence of a member is measured by the average of its error correlation with all the others. Figure 9 also shows the responses of the PEM to the removal of the worst-performing one or five members (left) and the best-performing one or five members (right). The effects of independence and constituent performance can thus be compared.
FEI ET AL.

10.1029/2020WR028846
17 of 24 Figure 8. The error correlation in simulating the annual runoff cycle (Ancy) and interannual anomaly (Anom) for the Noah-MP multi-physics and NLDAS multi-model ensembles. There are 1,128 (  48 47 / 2) samples for the Noah-MP multi-physics ensemble and three (  3 2 / 2) samples for the NLDAS multimodel ensemble. The upper and lower quantile lines show the 75th percentile value and 25th percentile value, respectively. The green and red lines denote the average and median values, respectively. The black dots denote the outliers, which are outside 1.5 times the interquartile range (the upper quantile value minus the lower quantile value) above the upper quantile and below the lower quantile. Figure 9 compares two kinds of asymmetry in the responses of the PEM. First, there is an asymmetry to the removal of the best-versus worst-performing members, which has been well established in previous studies (Ajami et al., 2006). Second, there is an asymmetry to the removal of the most versus least independent members, which is the focus of this study. The PEM degrades if the most independent members are removed, suggesting that the superior performance of the ensemble mean could come from these independent members. Removing the least independent members does not have an obvious impact on the PEM, suggesting that the Noah-MP multi-physics ensemble can be optimized by eliminating these ensemble members. Compared between the two kinds of asymmetry, the independence effect is more pronounced than the performance effect.
Not only the PEM (Figure 9), but also the ensemble performance measured by CRPS, responds asymmetrically ( Figure 10) to the removal of the best-performing (most independent) versus worst-performing (least independent) members. The ensemble performance degrades (an increase in the CRPS value) if the most independent or best-performing members are removed, whereas it shows minor or no changes when the least independent or worst-performing members are removed. As with the asymmetry shown in the PEM (Figure 9), the asymmetry in CRPS (Figure 10) is more pronounced when removing the most/least independent members than when removing the best-/worse-performing members.
FEI ET AL.

10.1029/2020WR028846
18 of 24 Figure 9. The performance of the Noah-MP multi-physics ensemble mean in simulating the annual runoff cycle (Ancy, upper subpanel, solid line) and interannual anomaly (Anom, lower subpanel, dash line) in the 12 River Forecast Centers. In each panel, the least independent or the worst-performing one or five member(s) are removed on the left, whereas the most independent or the best-performing one or five member(s) are removed on the right. The zero indicates the original Noah-MP ensemble with 48 members. The effect of removing the worst/best-performing (highest/lowest Taylor Skill Score) members is shown by the red lines, and the effect of removing the least/most independent (highest/lowest ECC) members is shown by the blue lines. The y-axis ranges of the plot are 0.04 (*except for the upper subpanel of WG (j), which is 0.06).

Conclusions and Discussion
In this study, we compared the Noah-MP multi-physics ensemble and the NLDAS multi-model ensemble in simulating the annual cycle and interannual variability of runoff in 12 RFCs of CONUS. The performance of the constituent members, the best member, the PEM, and how it is affected by inter-member independence was analyzed. The results can be summarized as follows: First, on average, the 48 Noah-MP configurations outperform the NLDAS models, resulting from the improved model physics. In snow-dominated regions of CONUS, Noah-MP can better capture the timing of the spring runoff peak from snowmelt. The models scatter more in arid than in humid RFCs, and the spread in arid RFCs is manifested in the different variability. The Noah-MP configurations with a dynamic groundwater component (SIMGM) produce a distinguishable correlation coefficient to those without groundwater. The correlation coefficient from SIMGM is overall higher in CN and CB, which have an arid climate. However, in other RFCs, SIMGM obtains a lower correlation coefficient. These differences between the Noah-MP configurations are more pronounced for the interannual anomaly than for the annual cycle. Second, multi-physics ensembles can sophisticatedly treat the subtle variations in the dominant processes of different environments with a dense sampling of the feasible model physics space. The Noah-MP multi-physics ensemble contains an outstanding member that performs the best between the two ensembles. The best member varies significantly among the RFCs and differs between the annual cycle and interannual anomaly. Third, however, the arithmetic average of the NLDAS models shows comparable performance to the Noah-MP multi-physics ensemble mean. This indicates that the ensemble skill gain obtained from the Noah-MP multi-physics ensemble is significantly low compared with that obtained from the NLDAS multi-model ensemble. The low ensemble skill gain corresponds well to the high error correlation among FEI ET AL.

10.1029/2020WR028846
19 of 24 Anom high/Low TSS Anom high/Low ECC the ensemble members. Fourth, there is an asymmetry in the responses of the PEM to inter-member independence. The performance of the ensemble mean deteriorates when the most independent members are removed, whereas it shows little change when the least independent members are removed.
This study suggests: (a) hydrological ensembles should include the models with advanced process representations, which can provide skillful simulations over a variety of environments; (b) multi-physics ensembles that can densely sample the feasible model physics space are beneficial to obtaining outstanding members that fit varying environments; and (c) selection of ensemble members should not only consider their performance but also how independent they are from the others.
Removing the low-performing and nonindependent members can reduce the computational cost of an ensemble while retaining its performance. Many previous studies of ensemble optimization Ye et al., 2014) and post-processing Duan et al., 2007) implicitly assume the ensemble members are independent (Abramowitz, 2010;Knutti et al., 2010;Tebaldi & Knutti, 2007). This study shows the assumption does not hold for the multi-physics ensemble since the ensemble members share a significant portion of the same physics (note that ensembles constructed by perturbing parameter values may have the same issue). There have been several measures of independence proposed (Abramowitz & Gupta, 2008;Tebaldi & Knutti, 2007). Methods for selecting independent members have been developed (Evans et al., 2013), but the selection generally requires significant effort in terms of manual screening. Automatic methods that require minimal effort in terms of manual screening are crucial for hydrological ensemble simulations, as the independence may vary with climate conditions, basin characteristics, and hydrological signatures. Future studies for developing such methods deserve to be prioritized.
This study has two other limitations. First, although we have shown the multi-physics ensemble can sample the feasible model physics space more densely, we have not verified whether the coverage of the sampling is broad enough. Unconsidered physics (or unknown unknowns as introduced in Zheng et al., 2020) may have substantial impacts on the assessment of inter-member independence. For instance, in mountainous areas, if there is lateral flow of groundwater among neighboring basins, this unconsidered process will introduce a common error term in all ensemble members and thus lower the assessed independence (higher error correlation). This may also be the case for extreme events, for which we lack a complete understanding of the underlying physics. Second, we have not investigated how calibration affects ensemble performance. While calibration can improve the performance of individual members, it can also decrease the independence among them. There is a need in future studies for a complete picture of the density, coverage, and fidelity of the model physics samples, and how calibration can affect them.

Data Availability Statement
The NLDAS-2 static data and meteorological forcing data were obtained from the Goddard Earth Sciences Data and Information Services Center (https://disc.sci.gsfc.nasa.gov/uui/datasets?keywords=NLDAS). The USGS HUC8 runoff data were downloaded from the USGS Water Watch website (https://waterwatch.usgs. gov/?id=romap3). The model outputs were generated from Noah-MP version 3.6, which can be downloaded from the website of the Research and Applications Laboratory at the National Center for Atmospheric Research (https://ral.ucar.edu/solutions/products/noah-multiparameterization-land-surface-model-noah-mp-lsm).