This study compares two approaches, dynamical and statistical downscaling, for their potential to improve regional seasonal forecasts for the United States (U.S.) during the cold season. In the MultiRCM Ensemble Downscaling (MRED) project, seven regional climate models (RCMs) are used to dynamically downscale the Climate Forecast System (CFS) seasonal prediction over the conterminous U.S. out to 5 months for the period of 1982–2003. The simulations cover December to April of next year with 10 ensemble members from each RCM with different initial and boundary conditions from the corresponding ensemble members. These dynamically downscaled forecasts are compared with statistically downscaled forecasts produced by two bias correction methods applied to both the CFS and RCM forecasts. Results of the comparison suggest that the RCMs add value in seasonal prediction application, but the improvements largely depend on location, forecast lead time, variables, and skill metrics used for evaluation. Generally, more improvements are found over the Northwest and North Central U.S. for the shorter lead times. The comparison results also suggest a hybrid forecast system that combines both dynamical and statistical downscaling methods have the potential to maximize prediction skill.
 It has been well documented that GCMs generally lack the ability to produce realistic regional climate features or extreme events needed for climate and hydrologic applications [e.g., Gutowski et al., 1997; Wood et al., 2002; Zhu et al., 2004]. This is primarily related to the coarser spatial resolution that limits the GCMs' ability to capture regional forcings, such as orography, that play an important role in characterizing regional climate features. To address this limitation, regional climate models (RCMs) have been used as “dynamical downscaling” tools to produce climate simulations at high spatial resolution based on coarser resolution simulation by global climate models (GCMs) or global analyses [e.g., Leung et al., 1999, 2004; Leung and Qian 2005; Roads et al., 2003; Castro et al., 2005; Liang et al., 2007; De Sales and Xue 2011]. Statistical downscaling achieve similar goals by deriving empirical relationships between the observed surface climate and global climate model outputs of the upper air and/or near-surface conditions [Wilby and Wigley, 1997]. While dynamical downscaling requires high frequency (typically 6 hourly) GCM outputs and large computing resources, statistical downscaling is computationally efficient although it demands good quality and high spatial resolution observation data over long periods and corresponding historical global climate simulation to develop the empirical relationships. In operational forecast centers, such as the NOAA Climate Prediction Center (CPC), statistical downscaling methods have been widely used because of computational efficiency.
 As computational power increases, multiRCM forecast becomes practically feasible to meet the demand for higher resolution climate prediction. Two collaborative projects, Prediction of Regional scenarios and Uncertainties for Defining EuropeaN Climate change risks and Effects (PRUDENCE) [Christensen et al., 2007] and Ensembles-based predictions of climate changes and their impact (ENSEMBLES) [Hewitt, 2008; Weisheimer et al., 2009], have developed ensembles of dynamically downscaled climate change scenarios over Europe using multiple GCMs and RCMs. A similar effort, the North American Regional Climate Change Assessment Project (NARCCAP) [Mearns et al., 2009], produces regional climate change scenarios for North America for assessing uncertainty in climate change scenarios. More recently, NOAA is supporting the MultiRCM Ensemble Downscaling (MRED) project to explore the value of dynamical downscaling using multiple RCMs in seasonal climate forecasting over the U.S. Specifically, seven RCMs are used to downscale the National Centers for Environmental Prediction (NCEP) Climate Forecast System (CFS) seasonal forecasts [Saha et al., 2006].
 An overarching goal of the present study is to determine what kind of additional forecast skill can be provided by the ensembles of high-resolution RCMs in seasonal prediction, which is evaluated against statistical downscaling methods directly applied to NCEP CFS [Wood et al., 2004; Yoon et al., 2012]. The MRED project targets the winter–spring forecast because cold season forecast skill is generally higher as the impacts of El Niño-Southern Oscillation (ENSO) on the U.S. cold season climate are larger [Saha et al., 2006], and higher-resolution RCMs tend to have more skill in simulating atmospheric response to topography during the cold season than the warm season that is highly influenced by convection [Leung et al., 2003a, 2003b]. Thus, a cold season study would be a good first test of added value by RCMs in the context of seasonal climate forecasting.
 Our attention primarily is directed at surface climate variables: precipitation (P) and surface air temperature (T). These two are the most basic variables to be predicted at seasonal time scales to generate seasonal outlook [O'Lenic et al., 2008] and for driving hydrologic models to produce soil moisture or streamflow prediction [Luo and Wood, 2007; Wood et al., 2002]. Also, these two variables have relatively long historical records at higher spatial resolution, so statistical downscaling can be easily applied.
 Evaluation of forecast skill added by both the dynamical and statistical downscaling methods is a goal of the present study. In the following sections, statistical downscaling methods and data used in our study are reviewed. Results are given in Section 3, and a short discussion on limited predictability is featured in Section 4. Finally, concluding remarks are provided in Section 5.
2. Data and Methods
2.1. Observational Data
 The North American Regional Reanalysis (NARR) [Mesinger et al., 2006] is used as a verification data set for upper tropospheric circulation. For precipitation, the CPC unified precipitation (P) analyses [Xie et al., 2010] and the University of Washington (UW) precipitation analyses [Maurer et al., 2002] are used. For surface air temperature (T), data from UW [Maurer et al., 2002] is used. An optimal interpolation (OI) scheme with terrain adjustment is applied to the station reported as P and T, in both the CPC unified precipitation and UW data sets [e.g., Maurer et al., 2002]. Both data sets have a spatial resolution of 1/8° in latitude and longitude and cover the period from 1950 or earlier to the near present. For verification, monthly mean anomalies (denoted as Panom, and Tanom, hereafter) are defined as the departure from the monthly mean climatology from 1982 to 2003 for any month.
2.2. NCEP CFS Forecasts
 Since August 2004, a fully coupled atmosphere-ocean dynamical seasonal prediction system, the Climate Forecast System (CFS [Saha et al., 2006]), has been operational in NOAA/NCEP. This system was upgraded to its second version in March 2011 [Saha et al., 2010]. In the MRED project, an early version of CFS is used. The atmospheric model is the 2003 version of the Global Forecast System [Moorthi et al., 2001]. Its ocean component is the Geophysical Fluid Dynamics Laboratory (GFDL) Modular Ocean Model version 3 (MOM3) [Pacanowski and Griffies, 1998], and a two-layer land surface model is used [Mahrt and Pan, 1984]. Atmospheric initial conditions are taken from the NCEP/DOE Global Reanalysis (R2) [Kanamitsu et al., 2002]. Oceanic initial conditions are taken from the NCEP Global Ocean Data Assimilation (GODAS) [Behringer et al., 1998]. Detailed documentation and forecast evaluation can be found in Saha et al. .
 Because high temporal and vertical resolution boundary fields are needed to drive regional climate models, CFS is rerun with November–December initial conditions for the MRED project. Initial conditions are clustered in three, five-member groups: 1) 11–15 November, 2) 21–25 November, and 3) 29 November–3 December. Monthly forecast outputs are generated from December to April of the next year. In our analysis, we focus only on the latter 10 ensemble members, initialized at 21–25 and 29–30 November and 1–3 December due to availability of model outputs from some RCMs. The ensemble mean is the equally weighted mean of all the members.
2.3. RCMs From the MRED Project
 In the present study, seven different RCMs with 10 ensemble members are analyzed. In total, we have 70 ensemble members from various RCMs. Each RCM has its own physics parameterization package and dynamical configuration. Details of the model configuration and general performance of the RCMs are discussed in Arritt . All of the RCMs are forced by 6-hourly boundary conditions and initial conditions obtained from the CFS hindcasts. The model domain of each RCM can be slightly different due to map projection and grid discretization, but we use an analysis domain similar to that of the North American Regional Reanalysis (NARR) at 0.375° × 0.375° resolution latitude-longitude grid. All of the RCM outputs and observation are linearly interpolated to this common grid. An example of downscaled precipitation (P) produced by the RCMs is shown in Figure S1 in theauxiliary material. The list of models, research, and operational centers that produce the hindcasts, as well as references, are summarized in Table 1.
Table 1. List of the RCMs Used in the Present Study
 In the past, various statistical downscaling methods [Wilby and Wigley, 1997], ranging from simple bilinear interpolation to more complex methods such as the Bayesian merging technique, were used. In this study, two different methods, Bias Correction and Spatial Disaggregation (BCSD) [Wood et al., 2002] and the Bayesian merging technique (Bayesian), are applied [Coelho et al., 2004; Luo et al., 2007]. Generally, these statistical downscaling schemes can be decomposed into two steps. One is to produce a higher spatial resolution data set, and the other is to apply bias correction to the newly created data. Although these can be done together, as in case of the Bayesian merging [Luo et al., 2007], we apply them as two separate steps [Yoon et al., 2012]. Before applying BCSD or the Bayesian merging, all of the P and T hindcasts are bilinearly interpolated to a 35-km resolution grid over the conterminous U.S. After that, BCSD or the Bayesian merging techniques are applied as calibration methods.
 P from the RCMs has more detailed regional structure over the western U.S., where orography plays an important role [Leung et al., 2003a] (see Figure S1). Simultaneously, several deficiencies are noted in the precipitation over the Pacific Northwest. Given this bias in simulated precipitation by the RCMs, bias correction methods also are applied to the RCM output. The need for bias correction is supported by the joint probability distribution function (PDF) shown in Figure 1 for December precipitation over the western U.S., with observation in the x axis and model forecast in the y axis. Most of the forecast models produce excessive precipitation over the western U.S., except MM5. This biased PDF hinders using raw output from the forecast models to regional applications, such as hydrologic forecast [Wood et al., 2004]. Therefore, it becomes necessary to correct both the RCM and GCM output.
2.4.1. Bias Correction and Spatial Disaggregation
 For lead time τ and target year Ty, the cumulative distribution function (CDF) of P(Ty, τ) or T(Ty, τ) is computed at each grid point using all ensemble members. Lead time τ ranges from 1 to 5 months, corresponding to December to April. Data for all of the target years are used in computing CDF, labeled as CDF(Phindcast) or CDF(Thindcast). In the same manner, CDF of the observational data set is obtained, labeled as CDF(Pobs) or CDF(Tobs) from Pobs or Tobs, respectively. We label this period without the target year as the “training period.” At each grid point, the percentile of P and T is determined based on the CDF. Equal quantile mapping is then performed to transform the hindcast value to an observation value. When the percentile value falls above or below the range of the empirical Weibull percentiles, an extreme value type III Weibull function or extreme value type I Gumbel distribution is used as an approximation. The detailed procedures and discussions are provided by Wood et al. .
 The advantage of using the BCSD method is to correct both the mean and standard deviation of the ensemble hindcasts using transformation to the normal space. Thus, P and T should have mean and standard deviation similar to the observation. Also, the bias-corrected values are already terrain-adjusted because the observation data sets used in our analysis already corrected for orography [Maurer et al., 2002; Xie et al., 2010]. Of note, only the calibration part of the BCSD is adapted in our present study.
 For December 1995, precipitation and surface air temperature anomalies from various RCMs and the BCSD are shown in Figures 2 and S2. Consistent with Figure 1, Panom from most of the RCMs has large amplitude over the contiguous U.S. On the contrary, the BCSD produces Panom with similar amplitude to observation (Figure 2). The Tanom in Figure S2 also supports this notion. The simplicity and robustness in correcting the forecast makes the BCSD method widely used among the hydrologic forecast community [Wood et al., 2002].
2.4.2. Bayesian Merging
 The next method employs Bayes' theorem, which is based on the probability distribution of P and T, historical performance of the hindcast, and ensemble spread of the forecast. Similar to BCSD, both hindcast and the corresponding observation initially are transformed to the normal space (using the previously described methods). Afterward, historical performances, measured by linear regression and ensemble spread of the forecast, are considered to adjust the hindcast and finally are transformed back to the physical domain.
 A detailed description of this method can be found in [Coelho et al., 2004; Luo et al., 2007]. In this paper, we follow the formulation of Yoon et al. . Here, three important aspects are briefly discussed. First, to update the forecast based on Bayes' theorem shown in equation (1), the prior distribution constructed from observation in the training period, labeled as p(Pobs) or p(Tobs), and the unconditional or marginal distribution, p(y) based on the corresponding hindcasts (y), are transformed to the normal distribution and represented as N(μx,σx2), where μx and σx2 are the mean and variance determined from x, where x can be Pobs, Tobs, Phindcast, and Thindcast. Then, Bayes' theorem can be applied in equation (1)
where p(P|y) is the posterior distribution of P, i.e., distribution after updating distribution with information of p(y), and is computed based on p(y|P), the likelihood function, which is modeled using a linear regression between the hindcast and the corresponding observation.
 Second, the likelihood function p(y|P) or p(y|T) is determined by least squares fit of the ensemble mean hindcasts to the corresponding observation in the training period in which the target year is removed [Yoon et al., 2012]
where is the ensemble mean of the hindcasts, x is the corresponding observation, and ε is the residual or error of fitting. After this linear fitting, the likelihood function can be expressed as a normal distribution with mean α + βx and variance φε, which is the variance of the residual ε from the linear regression in equation (3).
 Finally, the posterior distribution can be obtained by applying Bayes' theorem (equation (1)). There are two important factors affecting the posterior distribution: 1) skill of the hindcasts, determined by linear fitting, and 2) ensemble spread of the forecast. For example, if the forecast at a certain grid point has historically low skill and large ensemble spread, the posterior would come as an almost zero anomaly forecast, i.e., climatological value of observation. In our present study, two cases of the Bayesian merging techniques are used. One does not consider ensemble spread following [Coelho et al., 2004; Yoon et al., 2012], and the other does [Luo et al., 2007]. By comparing these two cases, one can estimate the influence of ensemble spread on forecast skill. It is well known that the ensemble spread of CFS can be very large over the U.S. [Luo and Wood, 2006], partially due to the fact that initial conditions for the ensemble CFS hindcast are obtained from reanalysis conditions with dates spanning as many as 20 days apart. In other words, initial conditions for November 11 and December 3 could reflect especially varied weather and climate conditions.
 The influence of ensemble spread can be easily seen in Panom and Tanom for the December 1995 case (Figures 2 and S2). If ensemble spread is considered, the P anomalies after Bayesian merging are very weak (Figure 2), indicating that precipitation forecasts from either CFS or the multiRCMs have large ensemble spread over the U.S. Thus, the Bayesian merging gives them very small weight. In the case of no ensemble spread, the P anomalies have values with similar magnitude compared to observation. Another important factor in the Bayesian merging is historical forecast skill. Figures 2 and S2show the importance of this term, where negative forecast skill could reverse the sign of the P anomalies. On the other hand, the T anomalies after Bayesian merging with ensemble spread still are comparable to observation in magnitude. In the case of no ensemble spread, the T anomalies have larger magnitude than P anomalies. One potential advantage of the Bayesian with ensemble spread can be found in the T anomalies over the Eastern U.S., where all the forecast models predict above-normal temperature but not in agreement with observation. Only the Bayesian merging with ensemble spread can capture the bias and make a near-zero anomaly forecast over the Eastern U.S.
2.5. Cross Validation
 The downscaled P and T are validated against historical observation for the analysis period of 1983–2003. There are three metrics used to measure the forecast skill.
 1. Temporal or anomaly correlation (AC): At each forecast lead time, the correlation coefficient at each grid point is computed in the following way:
where year includes 1982 to 2003. Hindcast and observation are anomalies from its own climatology following Saha et al. . The AC is computed at each lead month of forecast. We also provide additional plots in Figures 7 and 8 later, showing area (%) where each RCM and statistical methods show positive correlation.
 2. Spatial correlation: At each year and lead month of forecast, the spatial correlation coefficient can be computed in a similar way as equation (5), except that the x- and y-axes together form a single time series. Spatial correlation indicates how well the forecast reproduces the spatial pattern compared to observation. This spatial correlation largely depends on the climatological features.
 3. Root Mean Squared Error (RMSE): It is necessary to measure how close the forecast anomalies are to observation. Thus, we use RMSE, which indicates the forecast error in terms of anomaly. For example, a smaller value of RMSE indicates better forecast and vice versa. RMSE is computed using the following formula:
where hindcast and observation are anomalies from its own climatology, just as with the AC.
 To assess forecast skill, these metrics are applied to both the dynamically and statistically downscaled P and T. Each skill measure explains one aspect—not the whole. Often, it is advantageous to use various metrics. Also, the probabilistic nature of forecast skill is compared using a reliability diagram [Wilks, 2005] that depicts the probabilities of observation and forecast in a single diagram, indicating whether the predicted probability agrees with the observed frequencies. A diagonal line indicates perfect reliability. The distance from this diagonal line illustrates how reliable a forecast is or whether it is over (or under) confident.
 An example using the December 1995 case demonstrates the difference between dynamical and statistical approaches. For a more quantitative view of forecast performance, a series of forecast skill measures are analyzed herein. Figure 3shows the AC between the downscaled P hindcast at one-month lead and the corresponding Pobs for December. The correlation needs to be greater than 0.48 (0.36) to be statistically confidence at 95% (90%) level. Thus, values below 0.36 are not shaded in Figures 3–6. The forecast skill is low overall, and it is regionally dependent. It is shown that winter forecasts are more skillful over the Southeast and North Central U.S. [Yoon et al., 2012], but our results show that forecast skill also is generally higher in the Southwest and Great Basin. For T, all forecasts show higher skill in the northern region (Figure 4). For both T and P, differences in forecast skill from CFS, RCMs, and statistical downscaling are generally small in the Southeast and Southwest. However, the individual RCM forecasts for P typically have higher AC in the Northwest and North Central regions compared to CFS, where orography and ENSO play an important role in determining precipitation anomalies [Leung et al., 1999]. Applying BCSD to the RCMs also produces the high AC for P.
 Furthermore, it is worthwhile to examine how the AC changes in the longer forecast lead in Figures 5 and 6. Generally, compared to December, higher correlation is found over the southwestern U.S. at the long lead in March for both T and P. This sustained or even higher skill in P in March compared to December is found in both CFS and RCMs and could be related to the stronger influence of sea surface temperature (SST) on spring (as opposed to winter) conditions in the southwestern U.S. during ENSO years. There also potentially is a role for land-atmosphere coupling to enhance forecast skill in spring [e.g.,Guo et al., 2011] as skill in predicting soil moisture and/or snowpack during the winter could transfer to enhance predictability in T and P in spring when snowmelt or soil moisture begin to influence atmospheric processes. It would be interesting to investigate the role of land-atmosphere coupling on prediction skill by extending the downscaling study to the summer months in future studies. To summarize forecast skill measured by temporal correlation, the cumulative percentage areas that have correlations higher than certain value are shown inFigures 7 and 8. At the Month 1 lead, most RCMs do have more grid points that show improvement of forecast skill compared to CFS—except a couple of RCMs and BCSD of the multiRCMs show better skill. At longer lead (Month 4), skill improvement is less clear. The Bayesian method shows large improvement in surface air temperature at longer lead.
 Next, we use spatial correlation to estimate how well the downscaled P and T reproduce the spatial pattern of observation shown in Figure 9. Consistent with prior works [Saha et al., 2006; Wang et al., 2010], the forecast of T exhibits higher skill than that of P. Also, most RCMs do produce higher spatial correlation, meaning they produce more spatially coherent structure compared to observation than CFS. The only exception is the spatial correlation for ETA_UCLA for precipitation. Notably, ETA_UCLA tends to produce much weaker orographic signature in precipitation than other RCMs (Figures 2 and S1). So, while ETA_UCLA produces more reasonable precipitation amounts, its spatial structure compares less favorably with the observed pattern than all of the RCMs or even CFS (for 1–3 month lead time). All of the statistical methods, as well as the multimodel ensemble, do a good job in capturing the spatial patterns of T or P compared to the raw CFS forecasts.
 Another measure of forecast skill is RMSE used to evaluate how close each forecast is to observation in terms of the anomaly's size. To evaluate the overall performance of each model over the U.S., area-averaged RMSE over the contiguous U.S. is displayed inFigure 10. The reduction of RMSE or enhancement in skill during spring compared to winter is, again, noted for T, but not for P. Unlike spatial correlation (Figure 9), RMSE shows a more complicated picture of the RCMs' performance and statistical downscaling. A couple of RCMs demonstrate higher forecast skill. However, the difference of RMSE between CFS and RCMs is relatively small and depends on which variable is analyzed. For example, the ETA model is superior in RMSE for P (for reasons already discussed), while RAMS stands out in forecasting T. Larger RMSE in the RCMs during cold season arises mainly from the overpredicted precipitation amounts in the Pacific Northwest and Northern California.
 Conversely, statistical downscaling directly applied to CFS or bias correction of the RCMs also is doing well as measured by RMSE (Figure 10). Two important findings are: 1) at the first month of forecast, BCSD of all seven RCMs performs the best for both precipitation and surface air temperature, and 2) in the long-lead forecast, Bayesian merging does very well. As measured by RMSE, the raw RCM outputs have shown similar magnitude of forecast error compared to CFS. A statistical bias correction method as simple as BCSD on the RCMs can provide extra values on top of the raw forecast. Furthermore, BCSD on the RCMs produces lower RMSE compared to BCSD on CFS for all lead time, suggesting that dynamical downscaling provides additional skill, albeit relatively small compared to the difference between applying or not applying BCSD to the CFS or RCM forecasts.
 It is worthwhile to investigate how each method performs better or worse regionally (Figure 11). At the first month of forecast, all RCMs do well over the western U.S., where topography plays an important role in determining precipitation anomaly. However, except for the ETA model (ETA_UCLA), most RCMs have mixed performance: better on the eastern side of the Cascades and worse on the Pacific side. This feature is related to the general overprediction of precipitation by the RCMs (except the ETA model) on the windward side of the Cascade Range and Sierra Nevada (see Figure S1). Why the ETA model produces much less precipitation along the coastal mountains in general requires further analysis beyond the scope of this study, but it may be related to vertical discretization using the ETA coordinate or convective or microphysical parameterizations used in the model.
 It is clear that RCMs do reproduce similar, but generally improved, precipitation (P) and surface air temperature (T) anomaly compared to CFS. However, the improvement is highly dependent on location and forecast lead time. In other words, at some locations and certain lead months, RCMs do add values but certainly not always and not everywhere. This is further confirmed by the reliability diagram [Wilks, 2005] showing probabilistic forecast skill of both RCMs and CFS (Figure 12). All of the forecasts either from CFS or RCMs are overconfident and have little distinction. For above-normal precipitation forecast, RCMs do have more reliability than CFS predicting those events occurring more frequently, and vice versa. However, this relationship changes for below-normal precipitation. This is consistent with the general finding that coarse-scale models, such as CFS, tend to have limitations in capturing intense precipitation, but they produce too much drizzle under dry conditions. Therefore, differences between the RCM and CFS skill are largest at the upper and lower ends of the reliability diagram for above- and below-normal precipitation, respectively. For surface air temperature, the results are more complicated. MM5 does a good job in forecasting above-normal temperature for most of the events, but it does poorly for less frequently occurring below-normal temperature events.
 Differences in the climate simulated by the RCMs and CFS are mainly due to the RCMs' ability to better resolve regional forcings, such as orography and physical processes due to the higher spatial resolution. Our results show that RCMs could generally improve the CFS temperature and precipitation forecasts, although the forecasts also are degraded at some locations and months and the improvements are relatively small. Therefore, we consider how different the RCM-simulated climate is compared to the CFS over the U.S. Each RCM is forced by boundary conditions from CFS that are far away from the conterminous U.S. Thus, the CFS boundary conditions' influence should be relatively weak at the central U.S., unless some form of interior nudging is applied. None of the RCMs used in MRED applies interior nudging to constrain the simulations.
 Scatterplots of the geopotential height anomaly at 500 mb over the central U.S. from CFS and RCMs reveal that the first ensemble member of CFS and those of seven RCMs indicate almost a one-to-one relationship (Figure 13). In other words, the RCMs do produce a similar, large-scale circulation pattern compared to CFS even though the central U.S. is far away from the lateral boundaries where CFS forcing is applied. This suggests that in the cold season, large-scale circulation in the RCMs is well constrained by boundary forcing that can be far away from the region of interest because the large-scale circulation is more dominated by synoptic systems than local/regional diabatic heating. On the other hand, scatterplots of the first ensemble member of CFS and the rest of the ensemble members of CFS show that the CFS cannot predict itself [Luo and Wood, 2006] (Figure 13). This indicates that the system's internal variability (noise) is very significant compared to the signal. This sharp contrast clearly indicates that dynamical downscaling is doing what it is supposed to: producing high-resolution information from the low-resolution global model. However, the ability of dynamical downscaling to improve forecast skill is limited by global forecast model biases and the significant internal variability of the climate system (hence, its low predictability).
 One challenge in dynamical downscaling is to overcome errors inherited from the global forecast model compounded by errors associated with the RCM itself. There may be different ways to remove these biases. For example, one can apply bias correction to the global forecast output before it is used to drive the RCM. In our case, bias correction is applied to the downscaled P and T produced by the RCMs. This method seems to perform relatively well based on results from this study.
 Notably, a rebound of forecast skill is observed during an increase of spatial correlation of P (Figure 9) and RMSE reduction of T (Figure 10) at longer leads. Also, long-sustained forecast skill is found in AC of P and T over the southwestern U.S. (Figures 3–6). These rebound and long-sustained skills are consistent between the CFS and RCM forecasts, but they depend somewhat on the skill metrics used. Possible reasons for the skill rebound are the stronger SST influence on spring conditions during ENSO years or the transfer of predictability from land to atmosphere [Guo et al., 2011]. Understanding these factors and their relative influence on forecast skill requires more detailed analysis and possibly extending the numerical experiments through the summer, which is beyond the scope of this study.
5. Concluding Remarks
 For practical applications in resource management and planning, high-resolution seasonal prediction is much needed [Hurrell et al., 2009]. Predicting climate anomalies, including extremes a month or a season in advance, can provide invaluable lead time for society to prepare and allocate limited resources. In recent decades, advances in observing, data assimilation, and model prediction systems have all made it possible for coupled GCMs to predict one of the most important interannual variabilities, ENSO, one to three seasons in advance [Saha et al., 2006]. Therefore, it becomes possible to predict the large-scale seasonal anomalies over the U.S. with some skill. However, there are many remaining challenges, including improving the prediction of the large-scale circulation, as well as regional details of precipitation and surface air temperature anomalies. To fill the latter gap, various downscaling methods have been developed and used. In our study, we examine the multiRCM seasonal forecast produced by the MRED project and two different statistical downscaling methods.
 Dynamical downscaling by the multiRCM provides a unique opportunity to evaluate its added value in winter seasonal forecasts. Each RCM produces 10 ensemble forecasts from initial conditions in November–December. With seven RCMs, a total of 70 ensemble members of 5-monthlong winter forecast are produced and analyzed. In this study, we evaluate forecast skill of these multiRCM outputs and compare it with statistical downscaling methods that are applied directly to the CFS. Statistical bias correction methods also are applied to the multiRCM output in the comparison.
 Our results can be summarized as follows:
 1. Dynamical downscaling by the multiRCM produces finer-scale seasonal prediction based on the coarser resolution global forecast model. In terms of both climatology and anomaly from the long-term mean, the RCMs generate finer-scale features that are missing from CFS. Robust performance of the RCM in seasonal prediction application is consistent when the RCM is used to produce regional weather or climate information [Feser et al., 2011] and also in line with a previous study on seasonal prediction [Kanamitsu and DeHaan, 2011].
 2. Forecast skill of the downscaled P and T can vary for different metrics used in the cross validation. In terms of temporal AC, it is found that RCMs and statistical downscaling methods generally are somewhat higher than CFS, especially in the Northwest and North Central regions. For this skill measure, some RCMs can even outperform the multimodel ensemble or combined dynamical-statistical methods. For skill measured by spatial correlation, RCMs and statistical downscaling also provide additional values in addition to CFS. The Bayesian method performs poorly for AC because of the large ensemble spread in the forecasts.
 3. Using RMSE as the metrics, we find that a couple of RCMs can reduce forecast errors compared to CFS, but some RCMs have higher RMSE due to the overprediction of precipitation in the Northwest and Northern California. However, the RCMs combined with statistical bias correction stand out clearly. At the first-month lead, simple BCSD of all seven RCMs do surprisingly well. At the longer leads, the Bayesian merging applied to either CFS or RCMs does a good job. Improvement of forecast skill can be found over the mountainous regions, especially the western U.S. during the winter season. This is consistent with a previous work byYoon et al. . Also of note, the RCMs' performance is highly dependent on location, forecast lead time, and metrics used. To compare probabilistic forecast skill, a reliability diagram is shown, which confirms the finding that RCMs do produce more reliable forecast for certain, but not all, aspects.
 The MultiRCM ensemble provides a unique perspective on uncertainty of the seasonal prediction. However, it faces various challenges. From practical perspectives, it requires a large amount of computing resources. From theoretical considerations, RCMs suffer from inherent biases in both the global forecast model and the RCMs themselves. To overcome this, we propose using statistical bias correction methods applied to the RCM output. As demonstrated, this hybrid approach provides the most skillful forecast at the first month lead. Therefore, this could be considered an alternative in producing 2–6 weeks prediction using high-resolution RCMs forced by operational CFS. However, this approach needs to be further tested for different seasons, especially spring–summer over the U.S., when convective precipitation becomes dominant.
 We would like to thank all of the MRED project's modeling groups for sharing their RCM results. This study is supported by NOAA Modeling, Analysis, and Prediction Program (MAPP). Thoughtful comments from Shrad Shukla and Dennis Lettenmaier at the University of Washington, Kingtse Mo at CPC/NWS/NOAA, Thomas Reichler at the University of Utah, S.-Y. (Simon) Wang at Utah State University, and Yun Qian at PNNL were helpful in various stage of this project. Editorial assistance by Charity Plata is greatly appreciated. PNNL is operated by Battelle for the U.S. Department of Energy under contract DE-AC05-76RL01830.