We assess the value of dynamical versus statistical downscaling of National Centers for Environmental Prediction's (NCEP) Climate Forecast System (CFS) winter season forecasts for seasonal hydrologic forecasting. Dynamically downscaled CFS forecasts for 1 December to 30 April of 1982–2003 were obtained from the Multi-RCM Ensemble Downscaling (MRED) project that used multiple Regional Climate Models (RCMs) to downscale CFS forecasts. Statistical downscaling of CFS forecasts was achieved by a much simpler bias correction and spatial downscaling method. We evaluate forecast accuracy of runoff (RO), soil moisture (SM), and snow water equivalent produced by a hydrology model forced with dynamically (the MRED forecasts) and statistically downscaled CFS forecasts in comparison with predictions of those variables produced by forcing the same hydrology model with gridded observations (reference data set). Our results show that the MRED forecasts produce modest skill beyond what results from statistical downscaling of CFS. Although the improvement in hydrologic forecast skill associated with the ensemble average of the MRED forecasts (Multimodel) relative to statistical downscaled CFS forecasts is field significant for RO and SM forecasts with up to 3 months lead, the region of improvement is mainly limited to parts of the northwest and north central U.S. In general, one or more RCMs outperform the other RCMs as well as the Multimodel. Hence, we argue that careful selection of RCMs (based on their hindcast skill over any given region) is critical to improving hydrologic forecast skill using dynamical downscaling.
 Improved seasonal climate forecasts offer one of the most promising mechanisms by which climate-related risks to water and drought management can be mitigated [Hamlet et al., 2002; Wood et al., 2002; Steinemann, 2006; Voisin et al., 2006]. Major strides have been made in improving the scientific underpinnings of seasonal climate forecasts over the past two decades, notably through the evolution of coupled global atmosphere-ocean models [Goddard et al., 2001; Barnston et al., 2003; Palmer et al., 2004; Saha et al., 2006]. It remains to be demonstrated, however, that these modeling advances improve climate forecasts (especially of precipitation, the key hydrologic driver), and in turn seasonal hydrologic forecast accuracy. One of the major challenges in using seasonal climate forecasts for hydrologic prediction is their coarse spatial resolution, which results in a mismatch with the scale at which hydrologic models represent the land surface and requires some form of spatial (and sometimes temporal) downscaling [Wood et al., 2002, 2005; Diez et al., 2005; Wood and Lettenmaier, 2006; Luo et al., 2007].
 Two methods of downscaling global climate model outputs are commonly used: statistical downscaling [e.g., Zorita and Von Storch, 1999, Maurer and Hidalgo, 2008; Vrac et al., 2007; Wood et al., 2002, 2005; Yoon et al., 2011] and dynamical downscaling [e.g., Nobre et al., 2001; Diez et al., 2005, 2009; Castro et al., 2007]. Statistical downscaling methods take advantage of observed relationships between climate at fine and coarse resolutions and use those relationships to translate global climate model output to finer resolution [Wood et al., 2002; Maurer and Hidalgo, 2008]. In dynamical downscaling, Regional Climate Models (RCMs) are nested within a global model over a regional domain, with lateral boundary conditions taken from the global model [Castro et al., 2005, 2007; Pielke and Wilby, 2012]. RCMs provide higher detail in both topographic variations and areas of strong contrast in land cover, such as coastal zones or urban areas, than do global models and also allow description of smaller-scale atmospheric processes, which lead to the formation of mesoscale weather phenomena [Leung et al., 2003; Feser et al., 2011].
 Dynamical downscaling clearly is more physically based than statistical downscaling. It therefore is arguably applicable in global climate change scenario analysis when the assumption of climate stationarity inherent in statistical methods may not be valid [Hay and Clark, 2003]. However, dynamical downscaling is much more computationally demanding than statistical downscaling. The computational time required to generate climate forecasts is critically important for medium range to seasonal hydrological forecasts that are made at daily, weekly, and biweekly intervals, and there are questions as to whether computational resources are better allocated to increasing the spatial resolution of the global model using the RCM and/or increasing the number of ensemble members [Hay and Clark, 2003; Saha et al., 2006, 2010].
 Several recent studies have evaluated the value of dynamically downscaled medium range to seasonal climate forecasts relative to statistically downscaled forecasts in terms of potential improvement in precipitation, temperature, and hydrologic forecast skill [Kidson and Thompson, 1998; Wilby et al., 2000; Hay and Clark, 2003; Schmidli et al., 2007]. For example, Hay and Clark  evaluated runoff forecast skill in three snowmelt-dominated basins in the western U.S. using dynamically and statistically downscaled output from the National Centers for Environmental Prediction/National Center for Atmospheric Research reanalysis [Kalnay et al., 1996]. They concluded that even dynamically downscaled climate model output needs to be bias corrected and runoff forecasts based on statistically downscaled climate forecasts were at least as skillful as those based on dynamical downscaling in all three basins. Other studies, such as Wilby et al.  and Wood et al. , also came to similar conclusions.
 The Multi-RCM Ensemble Downscaling (MRED) of the National Centers for Environmental Prediction's (NCEP) Climate Forecast System (CFS) Seasonal Forecasts was undertaken to answer a central question: “Do RCMs add significant regional skill to current global model forecasts?” The MRED project downscaled CFS winter (December through April) seasonal reforecasts for 1982 to 2003 using seven RCMs. This forecast period was chosen to investigate possible improvements in the prediction of winter precipitation, temperature, and in turn snow, which dominate spring/summer runoff production in the western U.S. Furthermore, climate forecast skill during winter is generally higher than that for other seasons; therefore, the MRED focus on dynamical downscaling of winter season climate forecasts in a sense is a “best case” experiment.
 Ten CFS ensemble members, initialized on 21–25 November and 29 November to 3 December (at 00UTC), were downscaled by each of the RCMs. Yoon et al.  evaluated the MRED forecasts and concluded that dynamical downscaling of CFS forecasts does produce finer-scale features in the climatology and anomalies of precipitation and temperature that are missing in the native CFS forecasts. They also observed that the skill of the MRED precipitation forecasts was somewhat higher than that in CFS, mainly over the northwest and north central U.S.
 In this study we evaluate the value of the dynamically downscaled CFS forecasts (hereafter referred as the MRED forecasts) in terms of their implications for seasonal hydrologic forecast skill. We quantify the skill of the MRED forecasts in comparison with statistically downscaled (using bias correction and spatial downscaling (BCSD) [Wood et al., 2002]) CFS forecasts used to produce similar hydrologic forecasts. Our specific objectives are to (1) estimate the forecast skill of runoff (RO), soil moisture (SM), and snow water equivalent (SWE) obtained via the MRED forecasts (i.e., dynamically downscaled CFS reforecasts) in comparison with hydrologic forecasts that are based on statistical downscaling of the same CFS reforecasts and (2) evaluate whether multi-RCM ensembles improve hydrologic forecast skill relative to the use of a single RCM ensemble.
2 Methods, Data, and Hydrologic Model
 In this section we provide a brief description of the MRED project (section 2.1), the statistical downscaling method (section 2.2), the hydrologic model (section 2.3), the reference data set (section 2.4), the experimental design (section 2.5), and the forecast evaluation metrics (sections 2.6 and 2.7). Figure 1 summarizes the approach we used. We statistically downscaled both CFS forecasts (from their native resolution of T62) as well as the MRED forecasts (from 0.375° latitude-longitude) to 0.125° to force a hydrology model and generate hydrologic reforecasts.
2.1 Dynamical Downscaling
 Dynamical downscaling of winter season (December–April) CFS forecasts was undertaken by the MRED project [Arritt, 2010; Zhang and Juang, 2010; De Sales and Xue, 2012]. The RCMs used by MRED to dynamically downscale CFS forecasts are (1) the Regional Spectral Model–NCEP (RSM-NCEP) [Juang et al., 1997], (2) the Regional Spectral Model–Experimental Climate Prediction Center (RSM-ECPC) [Roads et al., 2010], (3) the Advanced Research–Weather Research and Forecasting Model (WRF-ARW) [Skamarock et al., 2005], (4) Mesoscale Model 5–Iowa State University (IMM5) [Anderson et al., 2007], (5) Climate-Weather Research and Forecasting (CWRF) [Liang et al., 2005], (6) Eta [Xue et al., 2007], and (7) Regional Atmospheric Modeling System (RAMS) [Cotton et al., 2003]. Initial conditions for the 10 common CFS ensemble members used in this study were taken from 21–25 November and 29 November to 3 December, for the years 1982–2003. Each RCM was forced with CFS lateral boundary conditions on those 10 dates resulting in 10 ensembles per RCM (total 70 ensemble members). Selection of those forecast initialization dates was intended to capture the evolution of both oceanic and atmospheric initial conditions in a continuous manner [Saha et al., 2006]. The forecast period was from 1 December to 30 April of each forecast year. The spatial domain of the MRED forecasts was the Contiguous United States (CONUS) and the resolution of each RCM was 0.375° latitude-longitude. The temporal resolution of the forecasts was daily. The hydrology model (section 2.3) was implemented at 0.125°; hence, the MRED forecasts had to be further downscaled from their original resolution (i.e., 0.375°). We used the BCSD method (section 2.2) to bias correct and then spatially and temporally downscale monthly means of precipitation (P), surface temperature maximum (Tmax), and surface temperature minimum (Tmin) for each of the RCMs ensembles individually (i.e., for all 70 ensemble members) at a daily time step. For further details about the MRED project and the experimental setup used to dynamically downscale CFS forecasts, the reader is referred to Yoon et al. .
2.2 Statistical Downscaling
 The BCSD method [Wood et al., 2002; 2005] was used to downscale CFS forecasts from their native resolution of ~1.9047° latitude and 1.875° longitude, and the MRED forecasts from 0.375° (both latitude and longitude)—in both cases—to 0.125° latitude-longitude (the spatial scale of the hydrologic model). The BCSD method has been widely used to downscale global climate model output to finer resolution for hydrologic simulation purposes [Wood et al., 2002, 2005; Maurer and Hidalgo, 2008]. The method as used in this study can be summarized in the following steps (see Wood et al. [2002, 2005] for details).
 The monthly values of CFS and MRED, P, Tmax, and Tmin forecasts for each month were bias corrected relative to the observed monthly climatology (derived from gridded observational data set [Maurer et al., 2002], as described in section 2.4) of those variables at their respective native (spatial) resolutions, using a quantile mapping approach [Panofsky and Brier, 1968] on a grid cell by grid cell basis. The climatologies of CFS and the MRED forecasts for each grid cell were taken from hindcast simulations and included all values from the 10 ensemble members for the 1982 to 2003 period (we did not include the forecast ensembles and observations from the target year in the climatologies used for bias correction). Therefore, the number of values in the forecast and observed climatologies for any given target year was 210 (21 years × 10 ensembles) and 21, respectively.
 Following bias correction, we spatially interpolated the forecast anomalies (multiplicative anomalies in case of P and additive anomalies in case of Tmax and Tmin) to 0.125° spatial resolution.
 Finally, we disaggregated downscaled monthly anomalies to daily values of P, Tmax, and Tmin following a random resampling approach as described in Wood et al. .
 The statistical downscaling was performed separately for each ensemble member from CFS and each of the seven RCMs. This resulted in 10 ensembles of daily P, Tmax, and Tmin values for the 1 December to 30 April forecast period for CFS and each RCM (total seven RCMs) for the period 1982–2003.
2.3 Variable Infiltration Capacity (VIC) Model
 The VIC model is a semi-distributed macroscale hydrology model [Liang et al., 1994]. It parameterizes major surface, subsurface, and land-atmosphere hydrometeorological processes and represents the role of subgrid spatial heterogeneity in soil moisture, topography, and vegetation on runoff generation [Liang et al., 1996a, 1996b]. It provides for nonlinear dependence of the partitioning of precipitation into infiltration and direct runoff as determined by soil moisture in the upper layer and its spatial heterogeneity. The subsurface is usually partitioned into three layers. The first layer has a fixed depth of ~10 cm and responds quickly to changes in surface conditions and precipitation. Moisture transfers between the first and second, and second and third soil layers are governed by gravity drainage, with diffusion from the second to the upper layer allowed in unsaturated conditions. Baseflow is a nonlinear function of the moisture content of the third soil layer [Liang et al., 1994; Todini, 1996]. The model was run in water balance mode, which means that the surface temperature is assumed equal to the surface air temperature and is not iterated for energy balance closure (this also implies zero ground heat flux). The snow accumulation and ablation algorithm [Cherkauer et al., 2003; Andreadis et al., 2009] module of the VIC model was run at a 3-hourly time step.
2.4 Reference Data Set
 To estimate the hydrologic forecast skill derived from the use of the MRED versus statistically downscaled CFS forecasts, a consistent long-term data set of RO, SM, and SWE is needed. Due to the scarcity of SM and SWE observations (and to a lesser extent runoff), we created long-term VIC model (section 2.3) output for these variables by forcing the model with high-quality gridded forcings of P, Tmax, Tmin, and wind speed and used them as “reference data set.” These gridded forcings data were taken from Maurer et al. , extended from 2000 to 2010 (see http://www.engr.scu.edu/~emaurer/gridded_obs/index_gridded_obs.html). This data set was also used as the observational climatology for the bias correction and spatial downscaling of the CFS and RCM output (section 2.2). Other forcings, such as shortwave and longwave radiation, specific humidity, etc., were estimated from the daily temperature range (difference between Tmax and Tmin) by a module in the VIC model that follows the methods of Bohn et al.  and Kimball et al. . The soil, vegetation, and snow band parameters used to run VIC model were the same as in Maurer et al. . Daily values of RO, SM, and SWE obtained from the VIC simulations were used to calculate total runoff (surface + baseflow) and mean SM and SWE at monthly time steps for each month during the forecast period (December to April) and used as reference data set in lieu of observations of those variables, to evaluate the hydrologic forecast skill in this analysis. Starting from a cold state as of 1 January 1948, we used the period up to 1982 as a spin-up period. We also saved the model state on 31 December 1981 to generate the initial hydrologic states for 0000 UTC on 1 December of each year (1982–2003) to initialize hydrological forecasts for the 1 December to 30 April forecast period.
2.5 Experimental Setup
 As depicted in Figure 1, we generated two separate hydrologic reforecasts data sets (reforecasts of RO, SM, and SWE) by forcing the VIC model (section 2.3) with statistically downscaled CFS and the MRED forecasts. We also generated a third set of the hydrologic forecasts using Ensemble Streamflow Prediction (ESP) method, and it was used as a benchmark for evaluating the improvement in hydrologic forecast skill due to the use of statistically downscaled CFS forecasts (section 3.1). The ESP [Day, 1985; Wood et al., 2002, 2005; Shukla and Lettenmaier, 2011] is a method widely used for seasonal hydrologic prediction that runs a physically based hydrology model up to the time of forecast using observation-based forcings, then resamples ensemble forcing members from sequences of past observations so as to form ensemble-based forecasts that derive their skill solely from the knowledge of the initial hydrologic conditions (IHCs).
 The IHCs to initialize simulations for all three hydrologic reforecasts data sets came from the VIC model simulation that was used to generate the reference data set as described in section 2.4.
2.6 Forecast Evaluation
 We calculated the hydrologic prediction skill for each hydrologic variable (RO, SM, and SWE) by calculating a simple Pearson product-moment correlation between the ensemble mean forecasts of those variables (generated by the MRED and statistically downscaled CFS forecasts as described in section 2.5) with the corresponding values obtained from the reference data set (section 2.4). We focus here on those regions and lead times at which the correlation coefficient (forecast skill) was statistically significant at 95% confidence level. For a sample size of 21 years (by excluding the target year from the 1982–2003 period; degree of freedom = 19), statistical significance at 95% is achieved for sample correlations r with │r│ > = 0.46.
 We also identified the regions and leads at which the difference between the hydrologic forecast skills derived by using statistically downscaled CFS forecasts as contrasted with the MRED forecasts was significant at 95% confidence levels. We did so by transforming the sample correlations r1 and r2 into standard normal deviates (Z1 and Z2) by using Fisher's Z transformation method (equation (1)):
 We then estimated critical Z values for the difference between Z1 and Z2 (equation (2)):
where n is the sample size (21).
 For Z to be statistically different at 95% confidence, its absolute value should be greater or equal to 1.96.
2.7 Field Significance Test
 Hydrologic variables such as gridded RO, SM, and SWE (as used in this study) show widespread spatial coherence. It is likely that the inherent spatial coherence in those variables affects the number of grid cells where the hydrologic forecast skill derived by the MRED versus statistically downscaled CFS forecasts are significantly different (section 2.6). Livezey and Chen  addressed this issue of local (each grid cell) versus field (spatial domain, i.e., CONUS in the case of this study) significance by using a Monte Carlo simulation procedure to determine the critical values of the number of locally significant rejections of a null hypothesis for spatially correlated variables. In this study we used a similar approach (also adopted by Andreadis and Lettenmaier ) to estimate the field significance of the difference in skill (i.e., correlation) of the hydrologic forecasts derived using the MRED vs statistically downscaled CFS forecasts, as described in following steps:
 We created 1000 random series of 21 “observations,” each by resampling the year order in the reference data set. This resampling maintains the spatial correlation (of actual reference data set) in the random series.
 Keeping the hydrologic reforecasts derived from the MRED and statistically downscaled CFS forecasts unchanged, we estimated the correlation (i.e., skill) of each reforecast at 1 to 5 months lead, with the random series of observations created in step 1.
 We then estimated the number of grid cells where the correlations of both reforecasts at 1 to 5 months lead were significantly different from each other at 95% confidence level, for each random series of observations (following section 2.6) resulting in a total 1000 values for each lead time.
 Finally, using the distribution of those 1000 values (for each lead time), we estimated the 95th percentile (exceedance probability 0.05) of the number of grid cells where the difference in the skill of both reforecasts (for any given lead time and hydrologic variable) was significantly different purely by chance. If that number was smaller than the actual number of grid cells (calculated using the reference data set described in section 2.4) where the skill of hydrologic forecast derived from the MRED versus statistically downscaled CFS forecasts is significantly different, then we considered the differences in the skill of both reforecasts to be field significant for our domain.
 In this section, we first present the difference in the hydrologic forecast skill resulting from VIC forced with statistically downscaled CFS precipitation and temperature forecasts relative to ESP forecasts made with VIC. The ESP method derives its hydrologic forecast skill solely from knowledge of the IHCs; hence, the difference between ESP and downscaled CFS skill provides a basis for estimating the value of CFS for hydrologic forecast applications. We then assess how the MRED forecasts (i.e., dynamical downscaled CFS via multi-RCMs) compare with statistically downscaled CFS forecasts in terms of resultant hydrologic forecast skill. In all cases, we assert hydrologic forecast skill improvements only when the results are statistically significant at 95% confidence level. We also highlight those regions and lead times for which the hydrologic forecast skill derived using the MRED and statistically downscaled CFS forecasts is statistically different from each other at 95% confidence level. Furthermore, we compare the skill of individual RCM against the ensemble average of MRED forecasts across all RCMs (hereafter referred as Multimodel) and the RCMs that has the highest hydrologic forecasts skill for any given grid cell and lead time (Best Model).
3.1 Statistically Downscaled CFS-Based Versus ESP Hydrologic Forecast Skill
3.1.1 Runoff (RO) Forecasts
 Figure 2a shows the difference between the forecast skill of monthly total RO derived from statistically downscaled CFS forecasts and the ESP method, at lead times of 1 to 5 months. For the regions in white, the hydrologic forecast skill derived from the use of CFS is not statistically significantly at 95% confidence level. Regions in light gray show degradation of hydrologic forecast skill for CFS relative to ESP, whereas nongray (and nonwhite) colors indicate improvement in skill relative to ESP. As shown in Yoon et al. , the CFS precipitation forecast skill (of the ensemble mean forecast) at Lead-1 month is significant mainly only over the southwestern, north, southeastern, and parts of the north central U.S., whereas the temperature forecast skill at Lead-1 is significant over the entire northern part of the country. Improvements in runoff forecast skill using CFS are most apparent at leads of 3, 4, and 5 months over much of California, the southwestern mountainous regions, and the southeastern U. S. (Figure 2a). The improvement in runoff forecast skill at relatively long leads can be attributed to CFS precipitation skill “rebound” [Guo et al., 2011, 2012] as well as to the low ESP skill at longer lead times (leaving more room for improvement in RO forecast skill at higher lead times than at short lead times). The reader is referred to Yoon et al.  for CFS precipitation forecast skill plots.
3.1.2 Soil Moisture Forecasts
 Figure 2b shows the differences in the SM forecast skill derived by using statistically downscaled CFS and ESP methods. There are no regions in white at Lead-1 month, which indicates high SM forecast skill at short lead time (probably in part due to lower natural variability of soil moisture relative to runoff). As observed in Shukla and Lettenmaier , this significant SM forecast skill across the CONUS, at short lead time (e.g., lead 1 to 2 months) is mainly derived from knowledge of the IHCs (particularly in winter and spring months). This feature results in small room for improvement in skill through use of CFS, which is why the differences in SM forecast skill derived from CFS and ESP are nominal (<0.1), at 1 month lead, across the country (Figure 2b). Nonetheless, there are some regions in the mountainous northwestern U.S. and north central U.S. where the difference between the SM forecast skill derived from CFS and ESP, at 1 month lead is significant at 95% confidence level (not shown here). Similar to the case of runoff forecast skill, the improvement in SM forecast skill for CFS relative to ESP method is also high (Figure 2b) and significant (not shown) at Leads-3, 4, and 5 months over the southwestern and southeastern U.S. This improvement in SM forecast skill (apparent at Lead-3 and 4 months) is due to the increase in CFS precipitation forecast skill as well as reduction of ESP-based SM forecast skill at these leads.
 It is worth mentioning here that the newer version of CFS (CFSv2) [Saha et al., 2010] has been shown to have greater precipitation forecast skill than CFSv1 mainly at 1 month lead [Yuan et al., 2011]. However, in a recent study, Mo et al.  evaluated CFSv2 SM forecast skill relative to ESP-based SM forecasts and found that CFSv2-based forecasts had only slightly better forecast skill than ESP, due to low precipitation forecast skill at Leads-2 and 3 months. CFS forecasts as used in this study show higher precipitation forecast skill over some regions (California/southwest and southeastern U.S.) at Leads-3 and 4 months relative to shorter leads (also shown in Yoon et al. ) which explain why CFS improves SM forecasts over those regions relative to ESP. Precipitation forecast skill at higher lead (greater than 1 to 2 months) is crucial because the hydrologic skill derived solely from the knowledge of the IHCs is generally much lower at higher leads leaving more room for improvement in SM skill due to precipitation forecast skill.
3.1.3 Snow Water Equivalent Forecasts
 Figure 2c shows the improvement in forecast skill of SWE derived by using statistically downscaled CFS forecasts and the ESP method. In this case we focus on those regions only where the long-term mean SWE is greater than 10 mm. Aside from some parts of the northeast, for most regions, the difference in SWE forecast skill at 1 month lead is either <0 (degradation in skill) or nominal (<0.1). Nevertheless, at Leads-3, 4, and 5 months, there are some regions in the mountainous western U.S. where the improvement in SWE forecast skill is significant (not shown). This improvement in SWE forecast skill at the end of the spring could be important for water management during the summer months for streams that head in the mountainous western U.S.
3.2 Hydrologic Forecast Skill Using the MRED Versus Statistically Downscaled CFS Forecasts
 In this section we evaluate the difference in hydrologic forecast skill derived from the MRED forecasts (dynamically downscaled CFS forecasts via multi-RCMs) and statistically downscaled CFS forecasts (Figures 3-8). We show relative differences in hydrologic forecast skill for individual RCMs as well as Multimodel and the skill of the RCMs with the highest skill (Best Model).
3.2.1 Runoff Forecasts
 Figure 3a shows the difference in runoff forecast skill derived from the MRED relative to statistically downscaled CFS forecasts. The difference in skill for each individual RCM is shown in the first to seventh rows in the figure (for Leads-1 to 5 months) while the eighth row shows the skill difference for Multimodel; the ninth row shows the skill of Best Model (i.e., the model with the highest skill among all of the individual RCMs). The regions shown in white are the regions where the hydrologic forecast skill derived from the MRED forecasts is not statistically significant at 95% confidence level. Where there is an improvement in the MRED relative to statistically downscaled CFS-derived RO forecasts, the improvement is mostly nominal (generally less than 0.2). Although overall most RCMs are in general agreement, a few RCMs do stand out with local improvements in skill as large as about 0.5 relative to statistically downscaled CFS. The hydrologic forecasts skill obtained from Multimodel is generally higher than the skill of statistically downscaled CFS (Figure 3a); hence, there are fewer regions shown in light gray (i.e., degradation of skill; compare row 8 with rows 1–7 of Figure 3a). The degradation in skill was generally less than 0.2, and we expect that it may have happened locally either due to chance or in some cases due to overestimation of precipitation by the RCMs as suggested by Yoon et al.  (the number of grid cells where the skill degraded was not field significant). Additionally, the highest skill among all the individual RCMs (Best Model) is almost always equal to or higher than the skill of Multimodel. For example, at Lead-1 over the southwestern U.S., Best Model shows an improvement in RO forecast skill of as much as 0.5 relative to statistically downscaled CFS forecasts, whereas the Multimodel-based improvement in skill is less than 0.2 (and also restricted to a much smaller domain). This indicates that over certain regions some RCMs perform much better than other RCMs. For example, the Eta (UCLA) model shows significant improvement in runoff forecast skill at Lead-5 in the lower part of Upper Mississippi Basin (in Illinois) whereas none of the other models show that. It appears that the improvement in skill may be due to high precipitation forecast skill derived by the Eta model in that region in March, as shown by Yoon et al.  as well.
 Best Model also shows appreciable improvement over parts of California and the Southeast, mainly at Leads-3 to 4, which is of practical importance because this is a location where statistically downscaled CFS forecasts show large improvements relative to ESP (Figure 2a). Figure 3b shows the improvement in runoff forecast skill accumulated over the entire forecast period (December to April). These improvements are generally minor (less than 0.2) but could be of practical importance for those basins that receive majority of their runoff during late winter/spring season.
 Figure 4 shows those regions where the RO forecast skill derived from the MRED forecasts is statistically different from the statistically downscaled CFS forecasts at 95% confidence level and the MRED forecasts result into improvement in RO forecast skill.
 In general, Multimodel (average of all RCMs and all ensembles) shows more grid cells with statistically significant improvement in skill. We performed the field significance test using only Multimodel as described in section 2.6 and observed that the improvement in runoff forecast skill is field significant at up to 3 months lead time. Table 1 shows the number of grid cells where runoff forecast skill improved significantly due to use of Multimodel. The numbers in italics show field significance at 95% confidence level.
Table 1. Number of One-Eighth Degree Latitude-Longitude Grid Cellsa
aOf 55,242 total. The hydrologic forecast skill associated with Multimodel (average of all RCMs and all ensembles) is significantly different from statistically downscaled CFS forecasts at 1 to 5 month lead times. The numbers in italics are field significant at 95% confidence level.
 The Best Model shows more spatially widespread statistically significant improvements in hydrologic forecast skill relative to statistically downscaled CFS at all leads than do any individual model or Multimodel. Those improvements are mainly apparent over the mountainous western U.S. and north central/Great Plains regions. This indicates that the improvement in runoff forecast skill could be best attained by either using the RCM with the highest skill or devising an averaging scheme where the RCMs with higher level of skill get more weights.
3.2.2 Soil Moisture Forecasts
 Figure 5 shows improvements in SM forecast skill when the MRED forecasts are used relative to simple statistically downscaled CFS forecasts at 1 to 5 months lead. At 1 month lead, SM forecast skill is significant across the CONUS (hence no white regions); however, the difference in forecast skill relative to statistically downscaled CFS forecasts is small (i.e., <0.1) or less than 0 (degradation of skill) in most cases. This nominal difference could be due to the fact that the baseline skill (i.e., the skill derived from using statistically downscaled CFS forecasts) is high (due to high contributions from the IHCs toward SM forecast skill), leaving little room for improvement (e.g., baseline correlation values can be quite high, so for instance if the baseline skill is 0.9 the improvement in skill cannot be more than 0.1). Similar to runoff forecast skill, Best Model is almost always better than Multimodel and shows at least some improvement across the country.
 Figure 6 shows the statistical significance of the difference in SM forecast skill for dynamical relative to statistical downscaling of CFS forecasts. There are some regions, mainly scattered over the mountainous northwestern and north central U.S. where the improvement in skill for dynamical downscaling is locally significant at 95% confidence levels. As shown in Table 1, the improvements due to use of Multimodel are field significant through 3 months lead. Once again, the areas of improvement in skill are more widespread for the Best Model than Multimodel.
3.2.3 Snow Water Equivalent Forecasts
 Figure 7 shows the difference in SWE forecast skill at Leads-1 to 5 months obtained from the MRED and statistically downscaled CFS forecasts. We show those regions only where the SWE forecast skill is significant at 95% the level and the long-term SWE mean is greater than 10 mm.
 Overall, the improvement in SWE forecast skill obtained by the MRED relative to statistically downscaling CFS forecasts is small (less than <0.1 for most places), except in northern Wisconsin and the mountainous Upper Colorado River Basin, for certain RCMs. Multimodel shows positive skill differences for 1 month lead only. Best Model shows larger areas with positive differences, mainly over mountainous western U.S. regions at all lead times.
 Figure 8 highlights those regions where difference in SWE forecast skill derived from the MRED versus statistically downscaled CFS forecasts is significant at Leads-1 to 5 months. The area of significant improvement in SWE forecast skill is relatively smaller than in the case of runoff and SM and limited to the interior part of the mountainous western U.S. (also shown in Table 1). Best Model consistently shows more regions of significant improvement in skill than does any individual model or Multimodel. Parts of the mountainous Upper Colorado River Basin show significant improvement in SWE forecast skill at Leads-3 to 4 months for almost all the models. This improvement in SWE forecast skill could be valuable due to its contribution to summer runoff.
3.3 Comparisons Over Major River Basins/Regions
 In this section we show results of the comparison of the hydrologic skill derived from the Best Model, Multimodel, and statistically downscaled CFS over each of the 13 major CONUS river basins and regions as defined in Maurer et al. . We performed this analysis to investigate the variability of the difference between the skill of each of the RCMs spatially across the CONUS and with lead time. Figures 9-11 show the distribution of the hydrologic skill for RO, SM, and SWE, respectively, as derived from the Best Model (green), Multimodel (blue), and CFS (red) for all the grid cells in each of the 13 regions/basins, using box-whisker diagrams. Larger differences in median skill of Multimodel average and Best Model imply higher variability in skill among RCMs.
 As shown in Figure 9, in terms of RO forecast skill, Best Model has higher median skill (correlation) than Multimodel and CFS over almost all basins and lead times. In general, this difference is higher for longer lead times, which is understandable because at short lead times the hydrologic skill in each case (Best Model, Multimodel, and CFS) is dominated by the IHCs. Some basins such as Lower Mississippi, Arkansas-Red, Ohio, and South Central Gulf stand out because for those basins at long lead times (3 and 4 months), the median skill derived from Multimodel is lower than CFS; however, the Best Model median is much higher than Multimodel. This indicates that over those regions, there is a large range of individual RCM skill and careful selection of the RCMs for hydrological forecast applications is critical.
 Figure 10 is similar to Figure 9, but for SM forecasts. For SM forecasts, the difference between the median skill derived from Best Model, Multimodel, and CFS at Lead-1 is nominal. This is again due to the strong influence of the IHCs on SM forecasts at short lead times, which is why the skill of precipitation and temperature forecasts has little impact on the SM forecast skill. The difference between the SM forecast skill of Best Model (green), Multimodel (blue), and CFS (red) does increase with lead time; however, that difference is not as discernible as in the case of RO forecasts. The difference is higher over eastern U.S. basins such as the Ohio, Lower Mississippi, Arkansas-Red, and the East Coast than for the rest of the country. This may be attributable to the difference in the skill of RCMs, as well as smaller influence of IHCs in those regions relative to the rest of the country.
 Figure 11 shows the distribution of SWE forecast skill for basins/regions that receive snow (we considered only those basins and months where there are at least 100 grid cells with greater than 10 mm mean SWE) for Best Model (green), Multimodel (blue), and CFS (red). Best Model almost always had greater skill than Multimodel and CFS. The difference in skills was lowest at Lead-1 month, due to the influence of initial snow conditions. The basins/regions where the difference between Best Model, Multimodel and CFS skill was highest are the Colorado and Rio Grande Basins, and Great Lakes and East Coast regions.
 In this study “Best Model” for any given grid cell and lead time is the RCM that resulted into the highest level of RO, SM, or SWE forecast skill (i.e., correlation value between the ensembles average and observations). Figures 12-14 show the number of grid cells (in %) for each of the 13 basins/hydrologic regions and CONUS, over which any given RCM was considered Best Model in terms of RO, SM, and SWE forecast skill, respectively. We calculated the % of grid cells by using those grid cells only where the forecast skill of the Best Model was statistically significant at 95% level. In case of SWE (Figure 14), we considered those grid cells only where the SWE forecast skill was statistically significant at 95% level and the long-term mean SWE was >10 mm. (We also focused on those basins only where there were at least 100 such grid cells.)
 Figures 12-14 show that for any given basin in terms of hydrologic forecast skill, some RCMs (for example, Eta and IMM5) generally are Best Models for larger number of grid cells than the other RCMs. However, for the CONUS as a whole, this difference among RCMs is not as apparent. It is worth mentioning though that given the small sample size of 21 years, the difference between the hydrologic skill resulted from the MRED forecasts in some cases may be purely by chance. We have not attempted to investigate the statistical significance of differences in hydrologic forecast skill among different RCMs.
 Finally, we also used a rank-based metric (i.e., Spearman rank correlation) for this analysis since Pearson correlations are sensitive to the outliers more than the rank-based metric. We found no significant differences in the results and therefore argue that our major findings are relatively insensitive to the skill measure.
 We have investigated whether seasonal hydrologic forecasts derived by dynamically downscaled CFS winter seasonal forecasts (December–April) can be more skillful than the similar hydrologic forecasts forced with statistically downscaled CFS forecasts. We conclude the following:
 Winter season CFS forecasts as used in this study do provide useful skill for hydrologic forecast application when evaluated relative to the ESP method, but mostly for relatively long lead times when initial condition-derived skill is minimal. The greatest improvements in RO and SM forecast skill relative to ESP were observed over the southwestern and southeastern U.S. at Leads-3 to 5 months. Significant improvements in SWE forecast skill were also found over parts of the mountainous western regions, also at Leads-3 to 5.
 The MRED forecasts (i.e., dynamically downscaled CFS forecasts) somewhat increase RO forecast skill beyond what is achievable by statistically downscaled CFS forecasts. The improvement in skill due to Multimodel is field significant through 3 months lead time; however, these improvements are mainly limited to parts of the mountainous western and north central U. S., and the skill improvements, while statistically significant, are mostly small. In almost all cases and at all lead times across the CONUS, the RCMs with the highest skill (Best Model) showed more widespread areas of improvement in RO forecast skill than did the Multimodel.
 In the case of SM forecasts, the improvements in hydrologic forecast skill derived using the MRED forecasts relative to the skill derived from statistically downscaled CFS forecasts were not as discernible as in the RO forecasts. However, differences were field significant at up to 3 months lead time, when Multimodel was used. Significant improvements in skill were found in parts of the interior mountainous West and some parts of the north central U.S. Again, Best Model showed much more widespread improvement in skill than did Multimodel.
 Finally, the improvements in SWE forecast skill were sparse. Statistically significant improvements were mostly limited to the Great Lakes region, Colorado, and interior northwestern mountainous regions.
 As to the question of whether dynamical downscaling of winter season CFS forecasts provides hydrologically useful information relative to what is achievable by performing a simple statistical downscaling, we find that the answer is a qualified yes, over limited regions of the CONUS. This appears to be so because the overall skill of dynamically downscaled winter season forecasts is mostly limited by the skill of the global model in simulating large-scale climate phenomena. Some modest additional skill might be derived, however, through the use of a combination (averaging) of RCMs where the RCMs with the highest hindcast skill are given higher weights. Also worth mentioning is that the statistically significant improvement in hydrologic forecast skill may not necessarily translate into useful information for decision makers. The level of improvement in skill that is useful for decision making likely varies with the region and stakeholder interests, quantification of which is beyond the scope of this study.
 Next, it is important to emphasize that we have focused on hydrologic forecast skill only at monthly/seasonal lead times. We have not investigated the ability of RCMs to forecast subdaily or daily hydrologic extremes, an application that may better exploit the inherent capabilities of RCMs. Moreover, in this study, we adopted a deterministic approach for the evaluation of forecast skill by focusing on the skill of the mean of the ensemble of hydrologic forecasts generated by dynamically downscaled and statistically downscaled climate forecasts. For certain stakeholders, probabilistic evaluation (that takes into account of the distribution of the forecast ensemble) of the forecasts skill might be more useful.
 Finally, in this study we focused on winter season forecasts. Winter precipitation is crucial for water resources in the western U.S. (as it is the source of spring snowpack, which provides natural storage of water at the onset of the warm season during which, for instance, water is used for irrigation) and climate predictability in winter season is generally the highest in the year. We acknowledge that for some other regions in the U.S., summer is the primary rainfall season and it is well known that the climate forecast skill in summer months is generally lower than it is in the winter season. On the other hand, the stronger influence of local scale features on summer rainfall could result in more value added by dynamical downscaling in summer that in winter. For this reason, exploration of the implications of dynamical downscaling for summer climate and hydrologic predictability is warranted.
 This work was facilitated through the use of advanced computational, storage, and networking infrastructure provided by the Hyak supercomputer system, supported in part by the University of Washington eScience Institute. At University of California, Santa Barbara, Shraddhanand Shukla was supported by the Postdoc Applying Climate Expertise (PACE) Fellowship Program, partially funded by the NOAA Climate Program Office and administered by the UCAR Visiting Scientist Programs. We would like to thank Jin-Ho Yoon (Pacific Northwest National Laboratory) for his valuable comments and help with accessing CFS data in their native format. We would also like to thank the entire MRED project team, and Raymond Arritt (Iowa State University) and Laurel Dehaan (University of California at San Diego) in particular, for providing us the access to MRED model output. The research reported herein was supported in part by NOAA's Climate Program Office under Cooperative Agreement NA08OAR4320899 with the University of Washington. Finally, we thank all four anonymous reviewers for their comments, which we believe have improved the manuscript.