Multi-model ensemble seasonal forecasting system has expanded in recent years, with a dozen coupled climate models around the world being used to produce hindcasts or real-time forecasts. However, many models are sharing similar atmospheric or oceanic components which may result in similar forecasts. This raises questions of whether the ensemble is over-confident if we treat each model equally, or whether we can obtain an effective subset of models that can retain predictability and skill as well. In this study, we use a hierarchical clustering method based on inverse trigonometric cosine function of the anomaly correlation of pairwise model hindcasts to measure the similarities among twelve American and European seasonal forecast models. Though similarities are found between models sharing the same atmospheric component, different versions of models from the same center sometimes produce quite different temperature forecasts, which indicate that detailed physics packages such as radiation and land surface schemes need to be analyzed in interpreting the clustering result. Uncertainties in clustering for different forecast lead times also make reducing redundant models more complicated. Predictability analysis shows that multi-model ensemble is not necessarily better than a single model, while the cluster ensemble shows consistent improvement against individual models. The eight model-based cluster ensemble forecast shows comparable performance to the total twelve model ensemble in terms of probabilistic forecast skill for accuracy and discrimination. This study also manifests that models developed in U.S. and Europe are more independent from each other, suggesting the necessity of international collaboration in enhancing multi-model ensemble seasonal forecasting.
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
 The concept of seasonal forecast came from Charney and Shukla's discussion on the contribution of variability over low and mid-latitudes. They concluded that short-term flow instabilities account for most of the variability over mid-latitudes, while boundary anomalies in quantities such as sea surface temperature (SST) contribute to a large part of the low-latitude variability. As surmised byCharney and Shukla , the main predictability of climate system in term of seasonal forecast resides in the ocean and land, where SST, snow and soil moisture evolves slower than the atmosphere, which impart additional memory to the climate system [Koster et al., 2010; Slingo and Palmer, 2011]. However, even the ocean is a chaotic and nonlinear system at seasonal to decadal time scales. For instance, the initiation of El Niño in 2003 is difficult to forecast by the coupled atmosphere-ocean-land general circulation models (CGCMs) due to stochastic forcing from the atmosphere [Slingo and Palmer, 2011]. To improve the predictability and quantify the uncertainty, ensemble forecast seems to be an effective approach. The operational climate forecast centers are generating ensembles from realizations with different initial conditions, either by perturbing wind stress and SST [Weisheimer et al., 2009], or running the model with different start dates [Saha et al., 2006]. Nevertheless, single model-based ensembles tend to have under-dispersion errors due to the model-specific biases [Slingo and Palmer, 2011]. Therefore, multi-model ensembles that allow the forecast space to be sampled more completely, are receiving more attentions over last decade [Krishnamurti et al., 1999; Barnston et al., 2003; Palmer et al., 2004]. The multi-model superiority lies in its enhanced reliability [Hagedorn et al., 2005] and spatially complementary feature [Yuan et al., 2011].
 Multi-model ensemble seasonal forecasts are firstly made by utilizing all available models at the time of forecasts, where each model has several realizations from different initial conditions, which is called the ensemble of opportunity. Recently, owing to the increased computing capability and enhanced national/international collaboration, a number of operational forecast centers are producing multi-model ensemble seasonal forecasting, such as European Operational Seasonal to Interannual Prediction (EUROSIP) and National Multi-Model Ensemble (NMME) [Zhang et al., 2011]. More than 30-year hindcasts from a dozen CGCMs are being made available to facilitate skill evaluation [Yuan et al., 2011], model diagnosis [DelSole and Shukla, 2012] and real-time forecast [Wang et al., 2010]. This also provides an opportunity for assessing similarity or dependence in the model space, which is important to avoid over-confident forecasts, especially if some models share the same components or use modules from the same institution (e.g., six of seven NMME models use the ocean models developed at NOAA's Geophysical Fluid Dynamics Laboratory (GFDL)). In fact, a number of studies have assessed model similarity or estimated the effective number of climate models in the context of climate change projection [Knutti, 2010; Masson and Knutti, 2011; Pennell and Reichler, 2011]. To date, more and more scientists are aware of the fact that the ensemble of opportunity (availability) by ignoring inter-model relationship is not justified, and that clustering of climate models seems to be necessary. However, how does such clustering affect predictability and skill of ensemble prediction? Though verifications for climate change projection are not available, it can be tested for seasonal hindcasts, and its consistency and stability in terms of forecasts with different lead times can also be investigated.
 In this study, a hierarchical clustering method is applied to the distance matrix of pairwise seasonal precipitation and surface air temperature forecasted by climate models from the ENSEMBLES project [Weisheimer et al., 2009] and the newly established NMME system [Zhang et al., 2011]. The predictability of ensemble mean, and probabilistic forecasting skill in terms of accuracy and discrimination, are evaluated based on the 24-year hindcast dataset.
2. Observation and Hindcast Data
 The observed precipitation data is from Climate Prediction Center (CPC) Unified Gauge-Based Analysis for the period 1979–2011 [Chen et al., 2008], and surface air temperature is from Climate Research Unit (CRU) TS3.1 dataset for 1901–2009 [Mitchell and Jones, 2005]. We aggregate them from daily/monthly mean to seasonal mean values, and from half degree to one degree. The aggregations are done by only using observation over land grids.
 The ENSEMBLES project, as described in Weisheimer et al. , is a seasonal-to-annual multi-model ensemble forecast system containing five models each with nine ensemble members. The five models are from the Euro-Mediterranean Centre for Climate Change (CMCC-INGV), the European Centre for Medium-Range Weather Forecasts (ECMWF), the Leibniz Institute of Marine Sciences at Kiel University (IFM-GEOMAR), Météo France, and UK Met Office (UKMO). The available dataset is at 2.5° × 2.5° resolution, and the hindcast period covers 46 years (1960–2005), with 7-month forecasts starting on the 1st of February, May, August and November of each year. We bilinearly interpolate precipitation and surface air temperature hindcasts to one degree, and aggregate them to seasonal mean values.
 The NMME is a collaborative prediction system that is being developed through the National Centers for Environmental Prediction (NCEP) Climate Test Bed (CTB) for experimental monthly and seasonal prediction at CPC [Zhang et al., 2011]. Currently, seven models from University of Miami (COLA-RSMAS-CCSM3 [Kirtman and Min, 2009]), GFDL (GFDL-CM2.1 [Delworth et al., 2006]), International Research Institute for Climate and Society (IRI-ECHAM4.5-AC and IRI-ECHAM4.5-DC [DeWitt, 2005]), National Aeronautics and Space Administration (NASA-GMAO [Molod et al., 2012]) and NCEP (NCEP-CFSv1 and NCEP-CFSv2 [Saha et al., 2006, 2010]) are participating. During the phase I project, monthly mean precipitation, SST and surface air temperature at one degree are being archived by IRI (http://iridl.ldeo.columbia.edu/SOURCES/.Models/.NMME/). There are 29-year (1982–2010) hindcast data, with ensemble members of each model varying from 6 to 24, and forecasts up to 8 to 12 months. Again, monthly mean values have been aggregated to seasonal means.
 The overlap period among observation, ENSEMBLES and NMME is 1982–2005. To be consistent with ENSEMBLES models, only the forecasts starting from the 1st of February, May, August and November are being considered, though some ensemble members from NASA-GMAO, NCEP-CFSv1 & v2 are around 0.5 month old at the beginning of those months. There are four seasonal hindcasts that cover all calendar month in each year for each forecast lead. For instance, for the 1 month lead, we will have seasonal mean hindcasts for MAM, JJA, SON and DJF. The data from observation and models used in this study are over land grids.
3. Clustering of Seasonal Forecast Models
 To quantify the distance between hindcasts of two models, we use inverse trigonometric cosine function of the anomaly correlation (AC), which is non-negative and increases as the similarity increases. AC is a widely used measure of association that operates on pairs of grid point values in the forecast and observed fields [Wilks, 2011]. Here we use a slightly different form of AC as defined by Saha et al. , which considers the correlation both in space and time. And we calculate the AC between ensemble mean of hindcasts from two models instead of between observation and hindcast. We use seasonal anomaly time series throughout years where climatology has been removed for each season separately, so there are 96 forecast cases (4 seasons in 24 years) at each grid point in this study. A hierarchical clustering [Wilks, 2011] is applied to the distance matrix of pairwise model dissimilarities (cosine of AC) to generate a tree diagram.
Figure 1 shows hierarchical clustering of twelve seasonal forecast models for precipitation and surface air temperature at different lead times. Consistent with previous studies on Coupled Model Intercomparison Project Phase 3 (CMIP3) models [Masson and Knutti, 2011; Pennell and Reichler, 2011], the models from the same institution (e.g., IRI) are very similar as measured by AC, regardless of different variables or different forecast lead times. For precipitation (Figures 1a, 1c, and 1e), similarities are also strong between models sharing similar atmospheric components (e.g., IRI-ECHAM4.5-AC, IRI-ECHAM4.5-DC, IFM-GEOMAR and CMCC-INGV are using ECHAM as the atmospheric module) or between successive versions of the same model (e.g., CFSv1 and CFSv2). However, it becomes more complicated for surface air temperature. For instance, NASA-GMAO and NCEP-CFSv2 have strong similarity in temperature, while CFSv1 and CFSv2 diverge greatly in the tree plot (Figures 1b and 1d), which might be due to the fact that the forecasted surface air temperature is highly dependent on radiation and land surface modules. Although CFSv2 is a successive version of CFSv1, they use quite different radiation and land surface packages.
 Besides the difference between precipitation and temperature, there are also some uncertainties in clustering in terms of forecast lead time. For example, COLA-RSMAS-CCSM3 has the strongest dissimilarity with other models for 4 month lead forecast (Figures 1c and 1d), while it has high correlations with others at 0 month lead (Figures 1a and 1b). Given that clustering of climate models for 0 month lead forecast might be affected by initial conditions, we select models for further evaluation based on atmospheric and oceanic components, and the clustering results for 4 month lead hindcasts. For precipitation, IRI-ECHAM4.5-DC, NCEP-CFSv2, ECMWF, Météo France, NASA-GMAO, GFDL-CM2.1, UKMO and COLA-RSMAS-CCSM3 are selected to form an eight-model ensemble called “cluster ensemble”. NCEP-CFSv1 is not selected due to its similarity with other top four models inFigure 1c, and its similarity with NCEP-CFSv2 in the physics (Figure 1e). For surface air temperature, being as consistent with precipitation as possible, we only replace NASA-GMAO with NCEP-CFSv1 to form the cluster ensemble, though CMCC-INGV also shows some independency (Figure 1d). Therefore, the cluster ensemble in this study consists of three ENSEMBLES models that form the EUROSIP system and five NMME models. All models in the multi-model ensembles are weighted equally, so the models with more ensemble members will have less weight for each member in the following sections.
4. Predictability of Ensemble Mean
 In this study, predictability of ensemble mean is measured by the square of correlation (R2), i.e., the coefficient of determination. Model forecasted seasonal mean precipitation and surface air temperature in each year during 1982–2005 are compared to the corresponding observation by plotting a scatter plot and then calculating R2. The R2 values represent the fraction of variance explained by the forecasts, and are computed for each grid cell, each season and each forecast lead time. All negative R2 values are set to zero to avoid noise [Koster et al., 2010], and all arid regions (annual precipitation less than 50 mm) are masked out in the precipitation analysis.
Figure 2 shows the spatial distribution of R2averaged over four seasons for seasonal mean precipitation and surface air temperature at different lead times. We compare the predictability of CFSv2 with other three multi-model ensembles: NMME (seven models), combining NMME and ENSEMBLES (twelve models, referred to as N + E), and cluster ensemble (eight models selected in the above section). For precipitation forecasts at 0 month lead, NMME increases predictability from CFSv2 over Australia, northern China, northeastern Brazil, and part of North American Monsoon region (Figures 2a and 2b). However, NMME degrades predictability over Russia, western Asia, northern Europe and central Africa, and finally has 2% (though not significant) less grid cells with predictability larger than 0.05 (Figures 2a and 2b). The advantage of multi-model becomes more consistent for long lead precipitation forecasts, where NMME is slightly better than CFSv2 over western U.S., Brazil and Australia (Figures 2e and 2f). Combing NMME and ENSEMBLES (N + E) not only retains the multi-model's advantage at long lead time (Figure 2g), but also increases grid cells with significant predictability for 0 month lead forecast by about 9% as compared with CFSv2 (Figure 2c). Similar to Yuan et al. , the improvement of N + E comes from complementary feature in space for seasonal forecast models developed in U.S. and Europe. For instance, ECMWF and UKMO have higher predictability than all seven American models over southern Africa and western Australia (not shown). Figures 2d and 2h demonstrates that reducing four models from N + E by using the clustering method does not affect (degrade) the predictability of precipitation for both short and long term forecasts. This verifies the usefulness of the clustering method in selecting independent (effective) seasonal forecast models, while retaining the predictability that is consistently beyond single models.
 The picture becomes clearer with the predictability of surface air temperature (Figures 2i–2p). Due to obvious improvement (upgrade in radiation and land surface modules) of CFSv2 in predicting temperature over high latitudes of the northern hemisphere [Yuan et al., 2011], NMME has lower predictability than CFSv2 even out to 4–6 months (Figures 2i, 2j, 2m, and 2n). In contrast, N + E and cluster ensemble outperforms CFSv2 globally (Figures 2k, 2l, 2o, and 2p). A common feature for multi-model ensemble forecasted temperature at long lead time (Figures 2n–2p) is that they have strong predictability over western U.S. and western China, while negligible predictability over the eastern parts of these two countries. The reasons might be threefold: 1) eastern parts of U.S. and China are wetter than their western parts, indicating that there may be more cloud that affect accuracy of radiation modeling in the east regions; 2) snow processes, which are important to seasonal forecasts, are more dominant contributions to the predictability over western parts of these two countries; and 3) the western parts are less densely observed; while the eastern part, though densely observed, might be polluted by urban-island effect, which also results in more uncertainty from observation and hence propagate into predictability analysis.
5. Verification of Probabilistic Forecast Skill
 Besides assessing the predictability of ensemble means, we use continuous ranked probability skill score (CRPSS) [Wilks, 2011] and Relative Operating Characteristic skill score (ROCSS) [Wilks, 2011] to verify probabilistic forecast in terms of accuracy and discrimination, respectively. CRPSS is defined as 1-CRPS/CRPSClim, where CRPS and CRPSClim are continuous ranked probability scores (here calculated by comparing deciles of the forecast distribution with observations) from the model and the climatological forecast, respectively. ROCSS is defined as 2A-1, whereA is the area under ROC curve. For both scores, positive values represent that climate model forecasts are more skillful than climatological forecasts, with a maximum of 1 for a perfect forecast. To reduce the impact of bias, climatological means of each season for both model and observation have been removed before calculating skill scores.
Figures 3a–3dshow the percentages of climate model forecasts that are more skillful than reference forecasts (CRPSS > 0) for seasonal mean precipitation and surface air temperature anomalies over global land area (except Antarctic) during 1982–2005. Unsurprisingly, both individual climate models and multi-model ensembles have higher skill in forecasting temperature than precipitation. Though CFSv2 is the most skillful model as compared to other individual models in NMME and ENSEMBLES, multi-model ensembles do obtain additional credit. Specifically, NMME increases percentage of skillful precipitation forecasts from CFSv2 by about 8–9%, and N + E increases it by about 11–12%. This is consistent with previous studies that more models will have greater accuracy and reliability in terms of probabilistic forecasts, although they do not necessarily increase the predictability as we discussed in the above section. For the seasonal surface air temperature forecast, multi-model ensembles have less advantage beyond CFSv2 than for precipitation, where NMME and N + E increases the percentage by 3–4% and 7–9%, respectively.Figures 3a–3d also show that the cluster ensemble has comparable performance to N + E. Therefore, although incorporating more climate models enhances accuracy (e.g., from NMME to N + E), there is an effective subset of models (e.g., the clustered models in this study) that beyond which the skill enhancement diminishes. In addition, the verification rank histograms averaged over the global land area indicate that the characteristics of cluster ensemble and N + E are quite similar (not shown).
 Another important attribute of forecast performance is discrimination, which reflects the ability of the forecasting system to produce different forecasts for those occasions having different realized outcomes of the predictand [Wilks, 2011]. Here we examine the models' discrimination on extreme seasonal temperature anomalies, the precipitation anomaly will not be discussed due to low skill. Figure 3eshows the ROCSS as functions of forecast lead times for surface air temperature anomaly 90th percentile over global land areas. Though there are some chaotic features that forecast discrimination skill of long lead times from some individual models is not necessarily lower than those of short lead times, multi-model ensembles do present obvious decreasing skill over lead times. As compared to CFSv2, NMME has no advantage in discriminating 90th percentile of temperature anomalies at 0 month lead, while its advantage emerges and increases as forecast lead increases. N + E is overall superior to NMME, especially for short lead forecasts. Again, cluster ensemble has similar performance to N + E, emphasizing that clustering of climate models through analyzing covariance structure of ensemble means is an effective method to select independent model as well as to retain ensemble forecast skill.
6. Conclusions and Discussions
 This study analyzes the similarities among different seasonal forecast models by using a hierarchical clustering method, and evaluates the predictability and skill of the cluster ensemble forecasts. Consistent with previous studies on CMIP3 models [Masson and Knutti, 2011; Pennell and Reichler, 2011], strong similarities are found between models from the same institution or sharing the same atmospheric component; however, the similarity between successive versions of the same model are different for precipitation and temperature forecasts: NCEP-CFSv1 & v2 present some similarity in terms of precipitation forecasts, while they are quite different for temperature forecasts. Such difference indicates that detailed physics packages such as radiation and land surface schemes need to be checked when interpreting the clustering results.
 Predictability analysis shows that NMME multi-model ensemble is not necessarily better than a single model such as CFSv2, especially for the forecasts at short lead time. However, combining NMME and ENSEMBLE (N + E) results in consistent improvement against CFSv2, which suggests that seasonal forecast models developed in U.S. and Europe complement each other to some extent, and more international collaboration is necessary to enhance predictability of climate models for seasonal forecasting. The multi-model ensemble from clustering, though excluding four models, has comparable performance to N + E in terms of predictability, continuous ranked probability skill score, and a discrimination score for seasonal temperature anomaly 90th percentile. This verifies that clustering method proposed in this study not only reduces redundancy of climate models, but also retains ensemble forecast skill.
 This paper also provides some implications for the clustering of climate models in seasonal forecasting. First, some uncertainties exist in clustering in terms of forecast lead time and the fields selected for clustering, which make reducing redundant seasonal forecast models more complicated. Second, all multi-model ensembles including cluster ensemble in this study have negligible predictability for surface air temperature forecasts at long lead time over eastern parts of U.S. and China (Figures 2n–2p), actually such deficiency resides in all twelve individual models (not shown). Whether the common deficiency is related to cloud-radiation modeling is an unknown issue that needs further investigation. One possibility is that the state-of-art climate forecast models may have common shortcomings due to similar parameterizations or universal use of the deterministic truncations [Palmer, 2012]. Actually Weisheimer et al. find that the ensemble based on stochastic parameterization of unresolved atmospheric processes outperforms the multi-model ensemble for precipitation forecasts, andPalmer suggests that development of such stochastic approach for the land surface and ocean may benefit temperature ensemble forecasts. We believe that a new era for ensemble forecast will begin if the stochastic parameterization could be incorporated into the multi-model ensemble framework. Third, hierarchical clustering is a method to select independent models, thus we get 0/1 weights for different models, which might not be optimal. Other calibration methods such as conditional distribution, Bayesian approach or canonical correlation analysis can be used to implicitly or explicitly weight different models. However, as pointed out byDoblas-Reyes et al. , long hindcast time series are needed to achieve a significant forecast quality improvement. Actually we have used the theory of conditional distribution of multivariate normal distributions to merge different models through cross validation procedure based on 24-year hindcast data. We find that the covariance matrices, which measure the correlation between observation and model, or the cross correlation among different models, are quite different year by year. Consequently, the distribution conditioned on multiple single models, where the weights are dependent on correlation between single model and observation, is less robust than the distribution conditioned on multi-model ensemble mean (e.g., equal weights). However, statistical calibration by using conditional distribution is also less beneficial for the multi-model ensemble mean than for the single models, perhaps due to higher reliability of the former. Furthermore, statistical calibration using hindcasts is challenging since it is based on the assumption that the current climate is stationary, which is not the case because the climate is changing due to the increase in CO2 that affects temperature [Slingo and Palmer, 2011]. Perhaps dynamical downscaling seasonal forecasts would be more physically consistent with climate change. Nevertheless, Yuan et al. shows that regional climate model (WRF) can improve forecasts of interannual variations only if the driving CGCM has the correct anomaly signal. Therefore, developing advanced global climate forecast models with greater independences in the physical modules will benefit ensemble seasonal forecasting. Also, long period hindcasts are critical to reasonably identifying the contribution of particular models to the multi-model ensemble by addressing issues like natural and anthropogenic climate variability. Finally, clustering is an effective method to reduce model redundancy while retains the predictability and skill.
 We would like to thank IRI to make the NMME forecast information available. We thank two anonymous reviewers for their helpful comments. The research presented in the paper was supported by the NOAA Climate Program Office through grants NA17RJ2612, NA12OAR4310090, and NA10OAR4310246.
 The Editor thanks the two anonymous reviewers for assisting in the evaluation of this paper.