A comparison of National Water Model retrospective analysis snow outputs at snow telemetry sites across the Western United States

This study compares the US National Water Model (NWM) reanalysis snow outputs to observed snow water equivalent (SWE) and snow‐covered area fraction (SCAF) at snow telemetry (SNOTEL) sites across the Western United States. SWE was obtained from SNOTEL sites, while SCAF was obtained from moderate resolution imaging spectroradiometer (MODIS) observations at a nominal 500 m grid scale. Retrospective NWM results were at a 1000 m grid scale. We compared results for SNOTEL sites to gridded NWM and MODIS outputs for the grid cells encompassing each SNOTEL site. Differences between modelled and observed SWE were attributed to both model errors, as well as errors in inputs, notably precipitation and temperature. The NWM generally under‐predicted SWE, partly due to precipitation input differences. There was also a slight general bias for model input temperature to be cooler than observed, counter to the direction expected to lead to under‐modelling of SWE. There was also under‐modelling of SWE for a subset of sites where precipitation inputs were good. Furthermore, the NWM generally tends to melt snow early. There was considerable variability between modelled and observed SCAF as well as the binary comparison of snow cover presence that hampered useful interpretation of SCAF comparisons. This is in part due to the shortcomings associated with both model SCAF parameterization and MODIS observations, particularly in vegetated regions. However, when SCAF was aggregated across all sites and years, modelled SCAF tended to be more than observed using MODIS. These differences are regional with generally better SWE and SCAF results in the Central Basin and Range and differences tending to become larger the further away regions are from this region. These findings identify areas where predictions from the NWM involving snow may be better or worse, and suggest opportunities for research directed towards model improvements.


| INTRODUCTION
Accurate water supply forecasts will become increasingly crucial as western populations grow and demand more water, and as operational agencies have to manage water under global environmental change (Bhatti et al., 2016;Gergel et al., 2017;Li et al., 2017;Livneh & Badger, 2020;Mote, 2003;Mote et al., 2005;Regonda et al., 2005;Stewart et al., 2004Stewart et al., , 2005. Many scientific challenges in understanding and preparing for global environmental change rest upon our ability to predict streamflow and snowmelt quantity, timing, and spatial patterns that are important for decision making in watersensitive sectors. In the United States, the National Weather Service While Franz et al. (2008) showed that SNOW-17 performed well over the Reynolds Creek Experimental Watershed located in southwestern Idaho, other studies found limitations such as being unable to capture snowmelt timing precisely due to its simple conceptual framework, its inability to represent spatial variability of land properties, and its dependence on extensive calibration for each basin using historical data (Lundquist & Flint, 2006;Shamir et al., 2006;Zalenski et al., 2017). Furthermore, a National Research Council committee identified a gap between what is now considered state-of-the-art modelling capabilities and those used in AHPS (National Research Council, 2006). It concluded that the NWS needs to incorporate more advanced hydrologic science into their hydrologic models.
The increasing availability of distributed geographic data and computer power has made it possible to develop national/continental scale, physically-based, and distributed models. In 2016, NOAA's Office of Water Prediction implemented the National Water Model (NWM) as a physically-based distributed model based on the Weather Research and Forecasting Model Hydrological modelling system (WRF-Hydro) framework (Gochis, Barlage, Cabell, Casali, et al., 2020) to provide nationally consistent operational hydrologic forecasting capability. The main goals of the NWM were to provide forecast streamflow, produce spatially continuous countrywide estimates of hydrologic states (soil moisture, snowpack, etc.), and to implement a modelling architecture that permits rapid infusion of new data and science.
The NWM provides hourly flow forecasts at about 2.7 million locations in the United States. In addition to the increased number of forecast locations, another advantage of the NWM is that it utilizes a specific configuration of the physically-based Noah-MultiParameterization (Noah-MP) land surface model to represent the land-atmosphere interactions including snow processes. There have been several studies evaluating results from the NWM. For instance, Viterbo et al. (2020) evaluated the prediction of flooding in NWM streamflow forecasts. They found that errors were due to both meteorological input errors as well as hydrologic process representation. In another study, Lahmers et al. (2019) improved the performance of WRF-Hydro configured as NWM version 1.1 by implementing a conceptual channel infiltration function into the model architecture. They concluded that accounting for channel infiltration loss in the semi-arid Western United States improves the streamflow behaviour simulated when the model is forced with high-resolution precipitation input.
However, we are not aware of a systematic and thorough evaluation of the NWM snow outputs.
The NWM   are from the analysis fields of the National Centers for Environmental Prediction (NCEP)/North American Regional Reanalysis (NARR), that is, a retrospective dataset, while the precipitation is from the gagebased NCEP/Climate Prediction Center (CPC). As a pre-processing step, the NWM team downscaled the NLDAS2 data and applied a mountain mapper (Hou et al., 2014) adjustment to the precipitation data to adjust the values for climatological variation due to topography and wind directions . The result forcing dataset is a 1 km spatial resolution data layer for each hour which contains incoming short-and longwave radiation, specific humidity, air temperature, surface pressure, near surface wind, and precipitation rate. In terms of snow, outputs include gridded snow water equivalent (SWE), the amount of water stored in a snowpack, and the snowcovered area fraction (SCAF).
Across the Western United States, snow is observed at 808 snow telemetry (SNOTEL) sites that provide data intended to quantify snow and inform water supply forecasts. Illustrative comparisons of NWM-R2 SWE to SNOTEL SWE ( Figure 1) indicate that SWE is well mod-  • What are the potential causes responsible for discrepancies in NWM-R2 SWE, SCAF, and snowmelt timing?
• Are these discrepancies associated with the model input errors or the snow parameterization in the model?
Answers to these questions are needed to further improve the NWM snow components, and ultimately runoff and water supply forecasts in snowmelt-dominated regions. While US based, the NWM is built using the WRF-Hydro modelling framework that has been applied worldwide, and the lessons learned from this comparison across the United States have application to the representation of snow processes in national and continental scale models throughout the world.
The following section-Section 2-first presents a summary of the  (Figure 2b). The difference in scale is a potential source of uncertainty in our comparative analysis, and needs to be recognized in interpretation. There are small differences in elevation between SNOTEL (point elevations) and NWM-R2 (1 km grid elevations), that may impact temperature comparisons due to lapse rate effects, but there does not appear to be any significant bias ( Figure 2c).
where M liq (kg m À2 ) is the storage of liquid water in the canopy, and R intr (kg m À2 s À1 ), R dew (kg m À2 s À1 ), and R eva (kg m À2 s À1 ) are interception rate for rain, dew rate, and evaporation rate, respectively.
R melt (kg m À2 s À1 ) and R frz (kg m À2 s À1 ) are melting and refreezing rates. M ice (kg m À2 ) is the storage of ice in the canopy and R load (kg m À2 s À1 ) and R unload (kg m À2 s À1 ) are snow loading and unloading rates, respectively. R frost (kg m À2 s À1 ) and R sub (kg m À2 s À1 ) are frost and sublimation rates. Heat transported by snow and rain to the vegetation canopy layer, the vegetated ground, and non-vegetated ground is also computed; and is used later in the energy balance computation.   Niu and Yang (2007).
In NWM-R2 calculations of snow-covered area, ρ new and z 0,g are constants set equal to 100 kg m À3 and 0.002 m, respectively. However, the factor m is among the parameters that are adjusted during calibration to minimize differences between modelled and observed streamflow over calibration watersheds (Lahmers et al., 2019;RafieeiNasab et al., 2020).
The functional relationship between SCAF and depth quantifies smallscale variability of snow within a computational grid element which plays an important role in the process governing snow accumulation and ablation. SCAF is used to weight the ground emissivity and ground surface resistance. It also affects the computed snow surface albedo that is modelled using the biosphere-atmosphere transfer scheme (BATS). BATS (Yang & Dickinson, 1996) models direct and diffusive radiation in visible and near-infrared bands separately accounting for fresh snow albedo, snow age, grain size growth, impurity, and solar zenith angle.

| Surface energy balance, radiation, and momentum fluxes
Shortwave radiation is modelled over the entire grid cell using a modified two-stream approximation (Niu & Yang, 2004) treating the vegetation as evenly distributed with gaps. The result is canopyabsorbed and ground-absorbed solar radiation over the grid cell.
Longwave radiation, latent heat, sensible heat, and ground heat fluxes are modelled, using a tile approach that treats vegetated and bare fractions of the cell separately . Noah-MP treats turbulence fluxes between the snowpack, vegetation canopy, and air using Monin-Obukhov similarity theory to model atmospheric stability conditions. Stability corrections of under canopy turbulent transfer account for the strong stable condition of a warmer canopy overlying the snow surface during the melt season (Chen, Barlage, et al., 2014). Precipitation advected heat is also computed separately for the canopy vegetation, vegetated ground surface, and non-vegetated ground surface. The vegetation canopy temperature (T v ), the vegetated ground surface temperature (T g,v ), and the non-vegetated ground surface temperature (T g,b ) are estimated using the Newton-Raphson method with 20 iterations. If the snow depth is greater than a specified snow depth (≥0.05 m) and the ground surface temperature (T g,v /T g,b ) is greater than the freezing point (273.16 K), the ground temperature is updated to  Snow thermal properties including partial volume of ice, partial volume of liquid water, effective porosity, bulk density (based on Lynch-Stieglitz, 1994), volumetric specific heat, and thermal conductivity are computed for each snow layer (Equations 7-12). Energy for phase change (melting/refreezing) is also computed for each layer.
where θ ice,i (m À3 m À3 ) is partial volume ice of snow layer i, Mass ice,i (kg m À2 ) is snow ice mass of snow layer i, ΔZ i (m) is the snow layer thickness of snow layer i, ρ ice (917 kg m À3 ) is ice density, θ e,i (m À3 m À3 ) is the effective porosity of snow layer i, θ liquid,i (m À3 m À3 ) is partial volume of liquid water of snow layer i, Mass liquid,i (kg m À2 ) is liquid water mass of snow layer i, ρ water (1000 kg m À2 ) is liquid water density, specific heat capacity of ice, C liquid (4.188 Â 10 6 J m À3 K À1 ) is specific heat capacity of liquid water, and k i (W m À1 K À1 ) is thermal conductivity of snow layer i.
Heat flux between layers is calculated based on temperature gradient and thermal conductivity, and then this is used to update layer

| SWE and snow depth
The change in SWE is balanced by the input snowfall (Q snow ) reaching the surface in forms of drip and throughfall; and output snowmelt (M), snow sublimation, and snow frost (both expressed as E in Equation 13).
When new snowfall occurs in a time step, the snow depth and snow ice are increased based on the snow depth increasing rate and the input snowfall rate (both outputs of the snow interception module), respectively. After the depth, phase change and compaction calculations, the number of snow layers is adjusted by either combining the neighbour layers or subdividing them following Jordan (1991

| MODIS
The We chose to use MODIS-C6  as a reference to evaluate NWM-R2 SCAF because the improvements/revisions to MODIS-C6 (i.e., accounting for the surface temperature and surface height) led to a notable increase in accuracy of snow cover detection on mountain ranges and low illumination conditions in the Northern Hemisphere during spring and summer (Riggs et al., 2017).
The MODIS-C6 snow algorithm is designed to detect snow cover based on the normalized ratio of the differences in reflectance in band 4 (centred at 0.56 μm, visible green) and band 6 (centred at 1.64 μm) of the MODIS instrument with revisions applied to alleviate snow detection commission errors (reported for previous versions) for which snow detection is uncertain. The MODIS-C6 products include this ratio, the normalized difference snow index (NDSI, product name: NDSI_Snow_Cover) rather than snow cover. This approach allows users to have the option to estimate snow cover using the global empirical model (Equation 14) or develop region-specific models . In this study, we developed a script (Garousi-Nejad & Tarboton, 2022b) run in Google Earth Engine to retrieve NDSI_Snow_Cover for each NWM grid cell containing a SNOTEL site.
Since MODIS output is available on a 500 m grid and NWM grid cells are 1 km in size, the script averaged NDSI_Snow_Cover over the four MODIS grid cells that have their centroid within the NWM grid cell In Equation (14)

| Seasonal
• First day of the month comparisons were used for NWM-R2 SWE/SCAF (modelled) versus SNOTEL SWE and MODIS SCAF (observed) for months Nov-Jun.
• Monthly precipitation and average air temperature were also compared for these months.
These monthly comparisons let us evaluate the seasonal variability of snow in both modelled and observed datasets for data in the period of overlap between NWM-R2 and SNOTEL data.

| SWE and snow-covered area at peak SWE
• Modelled and observed SWE and SCAF were compared on the date of observed peak SWE (same day comparison).
• Modelled and observed peak SWE do not necessarily occur on the same date. We compared both SWE and SCAF on the separate dates where peak SWE was modelled and observed (different day comparison).
• Model input and SNOTEL observed total precipitation accumulated from the start of the water year, Oct 1, to the date of peak SWE were also compared.
Total precipitation was computed to assess the degree to which differences may be attributable to precipitation differences. This was done for both same day (observed peak SWE) and different day (observed and modelled peak day) comparisons. The different peak day comparison addresses the possibility that peak modelled and observed SWE may be close, but appear further apart in same day comparisons due to a timing mismatch. • Presence Absence comparison metrics were used to indicated the degree-of-overlap between modelled and observed datasets (Horritt & Bates, 2002;Sangwan & Merwade, 2015).
The correctness metric (Equation 15) compares the total number of modelled and observed grid cells having some or full snow cover, while the fit metric (Equation 16) quantifies whether modelled and observed locations match, scaled by the total area mapped with snow (either full or some).
where C t and F t are correctness and fit metrics computed for date t, respectively, and Modelled snow and Observed snow are grid cells classified as snowy cells on that date. Correctness (C t ) and Fit (F t ) should both ideally be 1 (100%).

| Commonly used statistics
• Coefficient of determination (r 2 , Equation 17) that ranges from À1 to 1 with 1 indicating a perfect positive linear relationship but insensitive to proportional differences between modelled and observed data; • Bias (Bias, Equation 21), the average of the difference between modelled and observed.
Spearman r ¼ 1 À Bias ¼ where M t is model simulation, O t is observation, N is the total number of simulations or observations, d t is difference between observed and modelled rank, and the overbar indicates average.

| Seasonal (monthly) comparison
We compared the NWM-R2 SWE results with observations from SNOTEL and found a persistent bias in modelled SWE across most months ( Figure 3). Results show that throughout the accumulation phase (November-February), the rank correlation between observed and modelled SWE increases (Spearman r from 0.7 to 0.8).
However, this does not necessarily indicate an acceptable model performance. The discrepancies between the observed and modelled SWE increase as snow accumulates (RMSE 21-135 mm). In the ablation phase (Mar-Jun), the rank correlation decreases, and discrepancies are highest in May (Bias À149 mm, RMSE 292 mm).
The increasing scatter in later months (Figure 3)  Discrepancies between the seasonal pattern of SWE and SCAF are regional and somewhat different for SWE than SCAF (Figures 9

F I G U R E 4 First day of month modelled (National Water Model [NWM]-R2) versus observed (moderate resolution imaging spectroradiometer [MODIS]-C6) snow-covered area fraction (SCAF) for NWM grid cells and MODIS grid cells containing snow telemetry (SNOTEL) sites. Each point is a site and a date within the period of overlap between NWM and MODIS data. Axis histograms depict the SCAF distributions
and 10, respectively). The NWM SWE was better in the Klamath Mountains, Blue Mountains, and Central Basin and Range (region 9, 2, and 5, respectively, in Figure 9) with SWE bias differences tending to become larger further to the north and east across the study region.
However, the NWM SCAF are closer to the observations in the Northern Basin and Range, Sierra Nevada, and Central Basin and Range regions (regions 12, 13, and 5, respectively, in Figure 10), with SCAF differences tending to become larger the further away regions are from the Central Basin and Range region.

| Observed peak SWE (same day and different day) comparison
The scatterplot of modelled versus observed SWE on the date of peak observed SWE (Figure 11a) indicates a general downward bias in modelled SWE. NWM SCAF clusters around 1 on this date (histograms in Figure 11b) while MODIS SCAF is more fractional, and similar to monthly SCAF the point comparisons are scattered and poor.
Precipitation accumulated from October 1 to the date of observed peak SWE indicates model input precipitation generally less than SNOTEL observed (Figure 11c: Bias À111 mm, RMSE 212 mm). This suggests that under estimation of model precipitation inputs may be a contributor to under modelling of peak SWE. This comparison may also be influenced by the fact that observed SWE is at its peak, but modelled SWE is not.
We also compared observed and modelled peak SWE, noting that these do not necessarily occur on the same date ( Figure 12). Results are similar to the observed peak SWE date comparison. Here the accumulated observed and modelled precipitation (Figure 12c) are over the accumulation period, to their respective peak SWE dates, a possible reason for increased scatter and poorer error metrics in this figure.
Under modelling of SWE is also evident when comparing the

F I G U R E 6 Comparison between National Water Model (NWM)-R2 monthly precipitation input (labelled as modelled) and snow telemetry (SNOTEL) monthly precipitation (labelled as observed). Each point is a site and month in the period of overlap between NWM-R2 and SNOTEL data
F I G U R E 7 Comparison between National Water Model (NWM)-R2 monthly average of hourly air temperature input (labelled as modelled) and snow telemetry (SNOTEL) monthly average of mean daily air temperature (labelled as observed). Each point is a site and month in the period of overlap between NWM-R2 and SNOTEL data precipitation accumulated from October 1 to peak observed SWE date ( Figure 13a).

| Direct (binary) comparison of snow presence or absence
The cell by cell binary comparison of snowy grid cells at SNOTEL sites shows that this comparison does not work well for the all-snowpresent condition, that is, when the observed and modelled SCAF thresholds were 0.7 and 0.95, respectively (Figure 14a). We observed that the average C for the entire period of study was 9.4 and average F, 0.11. These are poor degree of overlap statistics, and are due to the fact that MODIS never reports more than about 30% of the area as having full snow.

| Melt timing comparison
For 68% of the site years analysed, the modelled half melt date was earlier than observed. When further classified based on whether modelled half melt dates were close, ahead, behind or far apart from observed melt dates (Figure 15a) we observe that the NWM half melt date was greater than 20 days from observed half melt date, for 34% of the site years, and off by 6 days or more for 75% of site years. For those site years where the difference was between 5 and 20 days, a greater percentage had the model melting ahead, than behind the observed. The site years that have modelled half melt date ahead of observed tend to have lower modelled half melt date SWE (which is by definition half the peak SWE) than observed (Figure 15b).

| DISCUSSION
The   (Swenson & Lawrence, 2012). It was not uncommon to see accumulated precipitation less than SWE at SNOTEL sites (notably for stations at higher elevations), which could be due to either precipitation under-catch, or inflated SWE (Meyer et al., 2012).
This makes using this information for model comparison challenging, as the model cannot accumulate more snow than its precipitation input. This is an unresolvable difference and should be recognized as a source of uncertainty associated with the in situ measurements used in this study.
Our results show a cold (downward) bias for the model input air temperature (based on NLDAS-2) compared to SNOTEL sites' is one of the most sensitive parameterizations in simulating coldregion hydrological processes (Loth et al., 1993), is based on Jordan's (1991) algorithm, which ignores some physical processes controlling precipitation phase by not incorporating humidity. This may lead to biases in SWE, snow depth, and snow cover fraction  et al. (2020) shows that using the precipitation phase partition from the high-resolution rapid refresh (HRRR), in lieu of the operational method (Jordan, 1991) (Mahat & Tarboton, 2014;You et al., 2014).
Overall, NWM-R2 SCAF was difficult to compare to MODIS-C6 SCAF using single SNOTEL sites and days. Some of this difficultymanifested in the scatter in Figures 4, 11, and 12-may reflect the fact that the MODIS and NWM SCAF quantities are not really the same thing. MODIS may be interpreting vegetation as snow free (Steele et al., 2017;X. Wang et al., 2017), while NWM has snow beneath vegetation. In NWM-R2 results, the persistent low and high SCAF (<0.1 and >0.9, respectively) reflects that NWM treats SCAF as a binary metric in mountainous regions. NWM-R2 SCAF values stay near 1 with less variability between December-April for more than 70% of cases. This suggests that once the NWM grid cell (1 km spatial resolution) is more than 90% snow-covered, it is implausible for it to diverge from 1 for the rest of the accumulation phase and early ablation phase. One possible reason for this behaviour is the lack of representation of some factors affecting SCAF such as vegetation type and seasonal change, and topography. These limitations affect the accurate simulation of SCAF and SWE (Helbig et al., 2015;Magand et al., 2014;Swenson & Lawrence, 2012;Wrzesien et al., 2015).
Another possible reason for some of the differences is the lack of any representation of snow drifting processes (i.e., wind-driven redistribution of snow) in the snow model. Snow drifting increases the variability of snow depth within a grid cell, which then, when melting starts leads to intervening (non-binary 0 or 1) snow covered area fractions.
This may be a factor contributing to differences in regions with modelled SCAF less than 10% while the observed SCAF are more than 50% (points along the horizontal axis of SCAF on March 1, April 1, and May 1 in Figure 4).
We recognize that the SCAF mapped from MODIS in this study also has uncertainties and limitations. First, the temporal forward filling approach that we used to fill gaps associated with clouds may miss some of the daily variability of snow cover, particularly in mountainous regions. Second, the parameters of Equation (14), which estimates SCAF from MODIS-C6 NDSI_Snow_Cover product, were those from Salomonson and Appel (2006) and were constant for our entire study region. Adjusting these parameters to improve the snow cover products from MODIS regionally has been suggested (Riggs et al., 2017).
Third, MODIS NDSI_Snow_Cover grids (nominally 500 m) were averaged for 1 km NWM grid cells, using an unweighted approach in the Google Earth Engine platform. This approach selects MODIS grids whose centres fall within the target area (i.e., NWM grid cells). These scale differences may be a further source of uncertainty, compounded by the nonlinearity in Equation (14) (plateau at NDSI >0.7) having an impact on SCAF from averaged NDSI.
Results for the direct (binary) comparison of full snow cover were poor as MODIS never reports more than about 30% of the area as having full snow, while the degree-of-overlap between the modelled F I G U R E 1 3 (a) National Water Model (NWM)-R2 versus snow telemetry (SNOTEL) precipitation accumulated from 1 October to observed and modelled peak snow water equivalent (SWE) dates. This figure is similar to Figure 10a but with colours separating points into two groups.  would also be helpful, for future work, to have a more comprehensive observation data set, beyond the SNOTEL sites, such as possibly Critical Zone Observatory or experimental forest sites, that include snowfall/rainfall measurements, canopy snow interception, turbulence and radiation fluxes above and below the canopy. Another opportunity is to run the model at higher resolution which would involve downscaling the forcing inputs to higher resolution. Higherresolution remotely sensed snow-covered area (e.g., from LANDSAT satellite) could then be used for model evaluation.

CONFLICT OF INTEREST
The authors declare no conflict of interest.

DATA AVAILABILITY STATEMENT
All data sources used in this research are publicly available. comprised of: • Input data and code to get the indices of the NWM grid cells containing SNOTEL sites (Garousi-Nejad & Tarboton, 2022d) • Input data, code to retrieve the NWM-R2 inputs and outputs at SNOTEL sites  • Input data, code and output from post-processing the retrieved NWM-R2 inputs and outputs at SNOTEL sites (Garousi-Nejad & Tarboton, 2022f) • Input data and code to retrieve precipitation, air temperature, and SWE measurements at SNOTEL sites (Garousi-Nejad & Tarboton, 2022c) • Input data and Google Earth Engine code to retrieve averaged MODIS-C6 NDSI snow cover at SNOTEL sites (Garousi-Nejad & Tarboton, 2022b) • Input data, code and output from combining the NWM inputs and outputs with observations form SNOTEL and MODIS at SNOTEL sites (Garousi-Nejad & Tarboton, 2022e) • Input data, code and output used to produce Figures 1-4