The utility of multi-system, coupled model-based seasonal predictions of Arctic sea ice area and extent is investigated for combined predictions from the Climate Forecast System version 2 (CFSv2) and Canadian Seasonal to Interannual Prediction System (CanSIPS) operational seasonal forecasting systems, which are among the first to have sea ice as a prognostic variable. Forecast skills for predictions of total anomalies and departures from long-term linear trends are examined both for the individual systems and the combined forecasts, and are compared against simple predictions such as damped anomaly persistence. Results indicate that the tendency for climate forecasts based on combined output from multiple prediction systems to outperform any one system, demonstrated previously for global variables such as temperature and precipitation, is realized for predictions of Arctic sea ice as well.
 Global coupled climate models are increasingly being applied to climate prediction on subseasonal to multi-seasonal time scales [NRC, 2010]. The comprehensive nature of such models, typically with interacting atmospheric, ocean, land and sea ice components, potentially enables a wide range of useful forecast products to the extent that adequate observations and methodologies exist for initialization, the models themselves accurately represent important phenomena, and natural predictability exists within the climate system.
 The recent marked decline in Arctic sea ice particularly in summer and early fall has sparked interest in how well coupled climate prediction models can forecast its near-term evolution. Initial model-based predictability studies [e.g., Holland et al., 2011; Blanchard-Wrigglesworth et al., 2011a, 2011b] have suggested that, depending on the season in which the prediction is made, anomalies in Arctic sea ice area and extent might be predictable to an appreciable degree up to a year or more in advance due to a tendency for skill to reemerge in certain seasons following its initial decay with increasing lead time. The results of such idealized predictability studies represent estimated upper limits to forecast skill; however, demonstrating how much of this potential skill can be realized in practice by a forecast system initialized with and validated against observational data is a separate matter.
 While some coupled model-based seasonal to interannual climate forecast systems now include prognostic sea ice components, their ability to predict observed sea ice anomalies has only begun to be examined. For example, the prediction of Arctic sea ice extent in the recently developed National Centers for Environmental Prediction (NCEP) Climate Forecast System version 2 (CFSv2) is described by Wang et al., whereas that of Arctic sea ice area in Environment Canada's Canadian Seasonal to Interannual Prediction System (CanSIPS) is documented in Sigmond et al. . While both systems demonstrate skill in Arctic sea ice prediction, the skill is to a large extent attributable to the observed long-term Arctic sea ice decline, especially at lead times of greater than 2 months or so. Nonetheless, both systems by at least some measures outperform simple forecasts based on anomaly persistence.
 The availability of multiple coupled model-based forecasts of future sea ice evolution presents an opportunity for the development of multi-system sea ice predictions. Combinations of climate predictions from different models or forecasting systems have been shown generally to outperform single-model forecasts due to the increased size of the forecast ensemble together with the partial compensation of differing model errors [Hagedorn et al., 2005; Kharin et al., 2009; Kirtman and Min, 2009]. This study explores the utility of multi-system sea ice predictions obtained by combining CFSv2 and CanSIPS forecasts.
2 Prediction Systems and Verification Data
2.1 CanSIPS Forecasts
 CanSIPS, which has provided Environment Canada's operational multi-seasonal predictions since late 2011, is itself a two-model system, combining predictions from the Canadian Centre for Climate Modelling and Analysis (CCCma) CanCM3 and CanCM4 climate models. Both models share the same two-category (thickness-volume) sea ice component, which is formulated on the T63 atmospheric model grid with sea ice thermodynamics governed by energy balance and dynamics by cavitating-fluid rheology [Flato and Hibler, 1992]. Forecast initial conditions are provided by assimilation runs in which atmospheric variables, sea surface temperatures and sea ice concentrations are constrained near observational values, whereas sea ice thickness is initialized very simply by relaxing to a model climatology as described in Merryfield et al.. For the hindcasts analyzed here, sea ice concentrations employed for initialization are daily values obtained through interpolation of the monthly HadISST dataset [Rayner et al., 2003]. Ensembles of 10 forecasts from each model are initialized at the start of every calendar month.
2.2 CFSv2 Forecasts
 The NCEP CFSv2 is a fully coupled model representing the interaction between the Earth's oceans, land and atmosphere (S. Saha et al., “The NCEP Climate Forecast System Version 2”, manuscript submitted to J. Clim.). Sea ice thermodynamics are based on Winton , and the elastic-viscous-plastic dynamics on Hunke and Dukowicz . Forecasts are initialized from the Climate Forecast System Reanalysis [CFSR, Saha and Coauthors, 2010], which assimilates sea ice concentration from the NASA Team analysis [Cavalieri et al., 1996] before 1997 and from NCEP analysis starting January 1997, and determines sea ice thickness through the CFSR ice model response to atmospheric and oceanic forcing. Ensembles of 24 forecasts from each initial month (except 28 for November) are used in this study.
2.3 Verification Data and Methodology
 The sea ice concentration dataset used to verify CFSv2 and CanSIPS predictions is based on the application of the Bootstrap algorithm of Comiso  to passive microwave satellite data [Comiso, 1999]. This choice is motivated by two considerations. First, it differs from both the HadISST and NASA Team datasets used to initialize CanSIPS and CFSv2, respectively, and therefore offers a modicum of independence that at least partially avoids any enhancement of skill due solely to using identical datasets for initialization and verification. Second, the Comiso  concentrations have been argued to be more reliable than NASA Team concentrations due to the latter having been found to be biased low compared to independent observations (D. Notz, “Sea-ice extent provides a limited metric of model performance”, manuscript to be submitted to The Cryosphere), whereas HadISST suffers from inconsistencies due to changes in its input data sources [Meier et al., 2012], which distort the long-term trends [Sigmond et al., 2013]. Impacts of choosing alternative verification data sets are examined in the Supplementary Information.
 Predictions of Northern Hemisphere sea ice area and extent are assessed. Ice area is computed as the sum of grid cell areas weighted by ice concentration, and ice extent as the total area of grid cells having ice concentrations ≥0.15, both common definitions. Forecast values are computed from concentrations on the model grids for each system, and observed values from concentrations on the nominal 25 km grid on which Bootstrap data is provided. In addition to ensemble mean forecasts for each system, a multi-system forecast is computed as the arithmetic mean of the CanSIPS and CFSv2 forecasts. Forecasts having a 10 month range common to both hindcast sets, initialized every month in 1982–2009 for which all verification data are available, are considered. Climatological biases as functions of target month and lead for the two CanSIPS models and the CFSv2 model are shown in Figure S1 in the Supplementary Information.
 Monthly mean anomalies for each target and lead month are considered. Because much of the skill in predicting the total anomalies is attributable to pronounced long-term Arctic sea ice trends [e.g., Lindsay et al., 2008], detrended anomalies consisting of departures from the linear trends are also considered. Skills are assessed using the anomaly correlation coefficient (ACC), which is sensitive to errors in the sign and relative magnitude of the anomalies, and root mean-square error (RMSE), which is sensitive to both random and systematic errors in the magnitude of anomalies. Value of forecasts from the individual and combined systems is assessed through comparison with predictions obtained through two simple means: damped persistence, whereby the observed mean anomaly for the initialization month immediately preceding the forecast is scaled by the lagged correlation between the initialization and target months as in Wang et al. , and a trend forecast consisting of the observed linear trend for each calendar month.
 The utility of multi-system sea ice forecasts is illustrated in Figure 1, which shows ACC for detrended anomalies in ice area, with hatching denoting p<0.05 significance based on comparison with 95th percentile skills of a distribution consisting of 103randomizations of the forecast values. Such skills for predicting anomalies from the linear trends are considerably lower than for total anomalies as has been discussed by Lindsay et al. , Wang et al. and Sigmond et al. . Relatively little skill remains both for CanSIPS (Figure 1a) and CFSv2 (Figure 1b) after lead two (the third forecast month), although there is some evidence for longer-lead skill, primarily for target months in late fall (mainly for CFSv2) and winter (for both systems). This apparently corresponds to an observed tendency for detrended cold-season anomalies in ice area and extent to be correlated from one year to the next, possibly as a result of slowly varying ocean thermal influences in regions of winter ice cover as discussed by Blanchard-Wrigglesworth et al.[2011a].
 ACC skills (Figure 1c) are generally higher for the combined forecast: multi-system ACC averaged over target and lead months is 0.37, compared to 0.32 for both CanSIPS and CFSv2. Persistence (Figure 1d), while also showing reemergence of skill at longer leads for target months in fall and winter, is on the whole less skillful than the dynamical forecasts, with a mean of 0.28. The enhancement of skill in the multi-system forecast compared to persistence, shown in Figure S4g, occurs mainly for target months in winter at leads of 1–7 months, corresponding to initialization in the preceding summer and fall, and for June to September forecasts initialized in April and May.
 RMSE skills for detrended ice area anomalies are shown in Figure 2. As for ACC, skill is evident mainly at the shortest leads, whereas at longer leads, the skills tend to reflect the observed seasonal cycle of detrended anomaly standard deviation, which peaks near 0.5 ×106km2 in September and remains below about 0.3 ×106km2from November through June. The reason for this can be seen by writing RMSE explicitly as
where y are the predicted anomalies, x are the observed anomalies, σy and σxare the corresponding standard deviations, and overbars represent averaging over available forecast years. As ACC decreases at longer leads (Figure 1) and σydecreases due to the predicted ensemble means relaxing toward climatology as predictability is lost (for dynamical forecasts) or the decay of the lagged correlations (for damped persistence), RMSE tends toward σx. This happens most rapidly for damped persistence, for which RMSE becomes essentially σx after lead three or so.
 As for ACC, multi-system RMSE is better on average than for either system alone, with a mean of 0.29 as compared to 0.30 for CanSIPS and 0.33 for CFSv2. Mean damped persistence RMSE is nearly the same as for the multi-system forecast, with the latter tending to have lower RMSE at leads zero to three and slightly higher RMSE at longer leads (Figure S4h).
 The overall performance of the multi-system forecasts relative to damped persistence is compared with that of the individual systems in Figure 3, where each symbol indicates the proportion of target and lead months for which the dynamical forecasts (red) or damped persistence (black) have better skills. Although CFSv2 skills are in most respects better relative to persistence than CanSIPS skills (the case of detrended ice area predictions shown in Figures 1and 2being an exception), the combined CanSIPS and CFSv2 multi-system forecasts generally match or outperform both individual systems and show the largest advantage relative to damped persistence. Differences in multi-system skill relative to damped persistence are shown as a function of target month and lead in Figure S4, where it is seen that the largest skill enhancements tend to occur for forecasts starting in late fall or early winter (November through January) or in late spring (May and June).
 Equivalent skill comparisons relative to another simple forecast obtained by persisting the observed linear trend for each calendar month (so that the detrended anomalies vanish) are presented in the Supplementary Information, where Figures S5 and S6 correspond to Figures S4 and 3for damped persistence. In this case, the multi-system forecasts are again better relative to the simple forecast than both individual systems, although in predicting total anomalies, they outperform the trend forecasts only at the shortest leads.
 The relatively poor performance of CanSIPS in predicting total anomalies, evident in Figures 3and S6, appears to be attributable at least partly to the inaccurate representation of long term trends in the HadISST data set employed for initializing these forecasts (cf. Figures 2e–2f in ). This hypothesis is supported by two observations. First, the mean negative trend in ice extent (averaged over all target and lead months) for the CanSIPS forecasts is 64% smaller than for the Bootstrap verification data, as compared to 45% smaller for HadISST but only 13% smaller for CFSv2. Second, total anomaly skills for CanSIPS become similar to those for CFSv2 when the difference between the NASA Team and HadISST trend lines is added to the CanSIPS forecasts. (An additional factor likely to reduce trends in the CanSIPS forecasts is that the thickness initialization is based on a stationary model climatology and hence does not fully account for the progressive thinning of Arctic sea ice.) The NASA Team data used for initializing CFSv2 also contains biases in ice concentration and hence ice area, which are low relative to the Bootstrap verification data especially near ice minimum. However, these biases do not appear to adversely impact CFSv2 skill in predicting ice area relative to skill in predicting ice extent (Figures 3, S2, S3 and S6).
 In interpreting the above comparisons, it should be kept in mind that damped persistence and trend skills have likely been overestimated, for the following reasons. For damped persistence, the observed lagged correlations were computed based on all years in the sample including the year being forecasted, whereas in practice, they must be estimated based on data from preceding years. Trends were similarly computed in sample rather than from multi-decadal time series preceding each forecast due to a lack of homogeneous verification data prior to 1979. Furthermore, for multi-model dynamical predictions, RMSE can be improved through a rescaling procedure [Kharin and Zwiers, 2002] that was not applied here.
 Finally, the choice of verification dataset is a significant consideration because differences in observational errors will affect the skill scores and also because the dataset used in initializing a forecast will tend to imprint its properties on that forecast at least initially. Therefore, the above analyses were carried out for several alternative verification data sets, including those used in initializing CanSIPS and CFSv2, as reported in Figures S2 and S3. Although there are systematic and sometimes large differences in computed skill depending on which dataset is used for verification, the conclusions reached here that the multi-system forecasts on the whole outperform the individual systems and simple forecasts such as damped persistence are not affected by this choice.
4 Discussion and Summary
 This study has analyzed multi-system dynamical predictions of Arctic sea ice area and extent obtained by combining forecasts from the CanSIPS and CFSv2 operational systems. It was found that the multi-system forecasts outperform by most measures each of these systems individually, and that overall, they also outperform simple forecasts, e.g., damped persistence.
 The improvement in multi-system forecasts relative to the individual systems occurs despite the generally higher skill of CFSv2 relative to CanSIPS, which is likely a result of much higher resolution and more detailed multicategory physics in the CFSv2 ice model as compared to the current generation of CanSIPS; results of Chevallier and Salas-Mélia , for example, suggest that climate models such as CFSv2 using multicategory representations of sea ice should be better able to predict September sea ice extent. However, the benefits of sampling different model errors and increasing overall ensemble size through combination of the two systems apparently outweigh these intrinsic differences in skill.
 The present results can be viewed as providing a baseline for multi-system sea ice prediction, since such predictions are virtually certain to improve over time. This will happen both through increases in the number of dynamical forecast systems providing sea ice forecasts, and through continued improvements in individual systems. For example, the National Multi-Model Ensemble (NMME), to which CanSIPS and CFSv2 both contribute [B. Kirtmanet al. manuscript submitted to Bull. Amer. Meteor. Soc.], offers a promising avenue for increasing the number of contributing systems and for developing real time multi-system sea ice forecasts.
 Valuable comments and suggestions were provided by Slava Kharin and Greg Flato. Dirk Notz is thanked for helpful discussions and for sharing a pre-publication manuscript. This work was supported by the Government of Canada through the Beaufort Regional Environmental Assessment.