In this paper the potential advantages and relative performances of different techniques for constructing multimodel ensemble seasonal predictions are examined. Two commonly used methods of constructing multimodel ensemble predictions are analyzed. Particular emphasis is placed on the analysis of the schemes themselves. In the first technique—simple multimodel ensemble (SME) predictions—equal weights are assigned to the ensemble mean predictions of each of the atmospheric general circulation models (AGCM). In the second approach—optimal multimodel ensemble (OME) predictions—the weights are obtained using a multiple linear regression. A theoretical analysis of these techniques is complemented by the analyses based on seasonal climate simulations for 45 January–February–March seasons over the 1950–1994 period. A comparison of seasonal simulation skill scores between SME and OME indicates that for the bias corrected data, i.e., when the seasonal anomalies of each of the AGCMs are computed with respect to its own mean state, the performance of seasonal predictions based on the simpler SME approach is comparable to that of the more complex OME approach. A major reason for this result is that the data record of historical predictions may not be long enough for a stable estimate of weights at individual geographical locations to be obtained. This problem can be reduced by extending the multiple linear regression approach to include the spatial domain. However, even with this algorithm change, the performance of OME in seasonal predictions does not improve over that using the SME approach. Results, therefore, indicate that the use of more sophisticated techniques for constructing multimodel ensembles may not be any more advantageous than the use of simpler approaches. Results also show that on average the skill scores for the predictions based on multimodel ensemble prediction techniques are only marginally better than those of the best AGCM. However, an advantage of multimodel ensemble prediction techniques may be that they retain the best performance of each AGCM on a regional basis in the merged forecasts.
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.
 Dynamical seasonal predictions (DSP) with state-of-the art atmospheric general circulation models (AGCMs) are currently conducted at several meteorological centers around the world. Since most models being used are different, either in their formulation of numerical schemes or in their representation of physical processes, different models have different biases in their simulation of climate and its interannual variability. For this reason the skill scores of DSP are also model dependent [Pavan and Doblas-Reyes, 2000; Shukla et al., 2000]. A very practical question is thus often raised: is it possible to enhance seasonal prediction skill by combining predictions from different AGCMs? Some studies have been carried out to explore this possibility, but these have provided conflicting results.
 In weather prediction, the practical usefulness of combining different forecasts into a more skillful single forecasts has long been recognized. For example, by linearly combining two independent forecasts with weights determined by the Gaussian method of least squares, Thompson  found that the mean square error of the constructed forecasts is less than that of the individual forecasts. Using a similar approach to optimally combine many ensemble members, van den Dool and Rukhovets  demonstrated that for the 6–10 day predictions, the constructed forecasts showed considerable improvement, both over the individual members, as well as over the equally weighted ensemble means. The statistical approach employed by these authors has generally been the multiple linear regression (MLR).
 On the other hand, for seasonal predictions various studies have reported conflicting results. Recently, also using the MLR approach, Krishnamurti et al. [1999, 2000] combined weather as well as seasonal climate forecasts from different models into an optimal multimodel ensemble (called a superensemble) forecast, and found the seasonal prediction skill of the superensemble to be significantly higher than that of the individual models. Conversely, Pavan and Doblas-Reyes  combined seasonal forecasts from four different AGCMs and found minimal improvements (see Figures 5 and 6 in their paper). They state “…multimodel ensemble produces a hindcast that is better than, if not comparable to the best single model hindcast” [see Pavan and Doblas-Reyes, 2000, p. 616], indicating that only a marginal improvement was obtained in the seasonal prediction skill of multimodel ensemble predictions. What are the possible reasons for the differing conclusions regarding the potential advantages of multimodel ensemble predictions among these studies? Is it due to different metrics of seasonal prediction skill used by these authors (e.g., root mean square error by Krishnamurti et al. [1999, 2000], and anomaly correlation by Pavan and Doblas-Reyes )? Or it is because of whereas the analysis of Krishnamurti et al. [1999, 2000] constructs multimodel ensemble predictions by combining the individual runs from different AGCMs, the analysis of Pavan and Doblas-Reyes  constructs of multimodel ensemble predictions using the ensemble mean of the runs from each AGCM.
 Another debate in the construction of multimodel ensembles for seasonal predictions concerns the relative merits of using equal weights for each model versus a more sophisticated approach such as an MLR, where weights are derived using some statistical procedure. For example, Krishnamurti et al.  compared two approaches and concluded that the skill of multimodel predictions where the weights for different models are computed using MLR approach were superior to the predictions where equal weights were assigned to each AGCM. For seasonal predictions, where an ensemble of realizations for each model is usually available, the generality of these conclusions remains in need for further testing.
 Toward resolving some of the above mentioned issues related to multimodel ensemble predictions, the focus of this paper is twofold: (1) to analyze possible reasons for the discrepancy between the conclusions about the potential advantages of multimodel ensemble predictions as reported by Krishnamurti et al. [1999, 2000] and by Pavan and Doblas-Reyes , and (2) to further clarify the role of different multimodel prediction techniques in the skill of seasonal forecasts. Toward this end, we have conducted a study using techniques similar to those employed by the authors cited above, but with more emphasis on the details of the techniques themselves. In particular, two methods of constructing multimodel ensemble predictions are analyzed: In the first technique, equal weights are assigned to each AGCM. In the second technique these weights are obtained using an MLR approach. While the former technique is very straightforward, it is often argued and expected that use of the MLR approach will lead to further skill improvements. On the other hand, it is also possible that due to the shortness of the historical data from which the weights are derived, the weights based on the MLR approach may not be stable, thus forfeiting the potential advantages this technique might have to offer. An additional problem with the MLR approach for seasonal predictions is that the predicted anomalies from different AGCMs (which tend to be atmospheric responses to anomalous SSTs) have a large degree of colinearity, which can also lead to unstable weights.
 Our analysis of the impact of these two multimodel ensemble techniques is based on seasonal predictions over the Pacific North American (PNA) region during boreal winter. Ensembles of climate simulations using four AGCMs are used. PNA region was chosen since during the cold season the boundary forced signal related to interannual variations in the tropical Pacific SSTs is largest compared to other extratropical regions. Descriptions of the AGCM and respective simulations are given in section 2. The characteristics of multimodel prediction techniques are described in section 3. The results and analyses of the performance of the two multimodel ensemble prediction techniques are presented in section 4. A discussion and some concluding remarks are given in section 5.
2. AGCMs and Data
 The AGCM data used to construct our multimodel ensemble predictions are the January–February–March (JFM) seasonal mean 500-mb heights from the ensemble of climate simulations performed with four different AGCMs. These AGCMs are MRF9 of the National Centers for Environmental Predictions (NCEP) [Kumar et al., 1996], the Scripps/MPI ECHAM3 [Barnett et al., 1994], CCM3 of the National Center for Atmospheric Research (NCAR) [Kiehl et al., 1998], and a version of the Geophysical Fluid Dynamics Laboratory (hereafter referred to as GFDLR30) model [Lau, 1997]. All AGCM simulations are forced by observed SSTs for the1950–94 period. The ensemble size is 12 for the MRF9, CCM3 and GFDLR30 models, and 10 for the ECHAM3 model.
 Within an ensemble for each AGCM, the different individual simulations start from different atmospheric initial states, and with the exception of GFDLR30, are forced with identical SSTs throughout the integration period. For the GFDLR30, the 12 simulations can be grouped in three different categories, with four simulations in each category. Simulations in category one are forced with the observed evolution of global SSTs. Simulations in category 2 are forced with the observed SSTs in the tropical Pacific Ocean alone, while the SSTs outside of this region evolve through their climatological seasonal cycle. In category three, simulations are also forced with the observed SSTs in the tropical Pacific Ocean, but SSTs outside the tropical Pacific are predicted using an oceanic mixed layer model [Lau, 1997]. Since the changes in the extratropical mean variability are most sensitive to SST variability in the tropical Pacific [Blade, 1999], and the simulations in all three categories experience the same tropical SST forcing, all 12 AGCM simulations are treated equally in this study, thus neglecting possible differences due to different SST forcing outside the tropical Pacific. A more detailed description about these four AGCMs and respective climate simulations is given by Kumar et al. . The verification data for JFM 500-mb is obtained from NCEP/NCAR reanalysis. For the purpose of the analysis, both AGCM and the observed data were interpolated to a 5° × 5° regular grid.
3. Multimodel Ensemble Prediction Techniques and Analysis Procedures
 In this section, the techniques for constructing multimodel predictions are described. Two different methodologies for combining multimodel predictions are used, differing in their specification of weights. In the first technique, ensembles from the different AGCMs are given equal weights. In the second technique, ensembles from different AGCMs are weighted according to the weights obtained using multiple linear regression. In this paper, the predictions based on the former method are referred to as “simple multimodel ensemble (SME),” and the predictions based on the latter technique as “optimal multimodel ensemble (OME).” These are described next.
3.1. Simple Multimodel Ensemble (SME) Predictions
 In this method ensemble means from the different AGCMs are weighted equally to construct multimodel ensemble predictions. The concept of this method is borrowed from the ensemble predictions based on a single model, where different members within the ensemble are AGCM integrations from different initial conditions, and are weighted equally to construct an ensemble mean as the prediction. Denoting fi(s, n) as the ensemble mean for the ith AGCM at spatial location s and for a particular JFM season n, and i(s) as some climatology relative to which the anomalies for the JFM seasonal mean for the ith AGCM are computed, the predicted anomaly p′(s, n) is given by
where n = 1…N(= 45) in our case, and I = 4 is the number of AGCMs in the multimodel ensemble prediction.
 The corresponding observed anomalies are given by
where ō(s) is the observed climatology for the JFM seasonal mean. It is easy to show that the expected value of the prediction error
is given by In the above expressions overbars stand for the mean over N = 45 cases.
 There are two obvious choices for the specification of the climatology i(s) relative to which anomalies for the ith AGCM are computed. In the first case it could be specified as the climatology for the JFM seasonal mean from the AGCM simulations themselves. In this case i(s) = i(s), and the expected error of the multimodel prediction error is zero. Since anomalies are computed with respect to the model's own climatology, this procedure is often referred to as an a posteriori removal of systematic errors. Hereafter, such predictions are referred to as “bias corrected” predictions.
 Other choice for i(s) is such that the anomalies for the different AGCMs are computed relative to the observed climatology for the JFM seasonal mean, and hence i(s) = ō(s). In this case the expected error of the multimodel prediction error is no longer zero and is given by , and further, is the mean bias of all AGCMs. Hereafter, these predictions are referred to as “biased” predictions. Although in many seasonal prediction efforts the former choice is preferred for the specification of i(s), we will compare the results of multimodel ensemble predictions using both frameworks.
 In this technique for constructing multimodel ensemble predictions, the weights are determined using the MLR approach. The multimodel ensemble predicted anomalies are now given by
where the αi(s)'s are the weighting coefficients which depend on the spatial location s. The weighting coefficients are determined by minimizing the mean squared error R between the predicted and the observed seasonal mean anomaly, which over the historical record of N JFM hindcasts, is defined by
The equations determining the weighting coefficients at each spatial location are obtained by minimizing the mean squared error R [e.g., see van den Dool and Rukhovets, 1994], and are
Substituting (6) in (4) the predicted anomalies are given by
The weights αi(s)'s in the above equation are obtained by solving (7).
 From (7) it can also be shown that for both choices of i(s) the resulting weights do not change. Given that, from (8) it is easy to see that unlike for SME, multimodel ensemble predictions based on OME do not depend on the choice of i(s) relative to which AGCM anomalies are computed. In other words, whether the anomalies are computed relative to the observed JFM seasonal means or relative to the AGCM's seasonal mean, the predicted anomalies in (8) are always the weighted average of individual AGCM anomalies computed with respect to their mean states. For this reason, the expected value of the prediction error as defined in (3) is always zero.
 A shortcoming of OME is that due to insufficient historical data, the weights may not be sufficiently stable. In other words, inclusion or exclusion of a few JFM hindcasts, significant variations in the weights may result. This could be avoided if the MLR is extended to include the spatial domain also, and a single weight, independent of spatial location, for each model is obtained. For spatially independent weights, the multimodel ensemble prediction is defined as
The only difference between (9) and (4) is that the weights are no longer function of spatial location. Following van den Dool and Rukhovets , the weights are now determined by minimizing the residual
defined over both time and space domains. By combining time and space, one effectively increases the sample size, thus arriving at better estimates for the weights. Equations for computing weights are the same as in (6) and (7), except averages denoted as overbars are now performed over both space and time.
 Upon elimination of αo the prediction equation (9) becomes
where double overbars denote the averages over both space and time. For bias corrected predictions, i.e., for i(s) = i(s), the predictions are given by
On the other hand, for biased predictions, i.e., for i(s) = ō(s) , the predictions are given by
It is apparent that, unlike for the case of space dependent weights, for the bias corrected or biased AGCM anomalies, the multimodel ensemble predictions now differ between the two. This distinction also extends to the computations of the weights themselves. Another difference between bias-corrected and biased predictions is that while for the former the expected value of the prediction error as defined by (3) is zero both locally and on a domain average basis, for the latter it is only true for domain averages (while a nonzero error exists on a local basis). Hereafter, while the results based on space dependent weights are referred to as OME1, results based on space independent weights are referred to as OME2.
where F′(s, n) is the predicted anomaly from either individual AGCMs or multimodel ensembles. Since multimodel ensemble predictions using OME are based on minimization of mean squared error, analysis of RMS error is a natural choice with which to assess the impact of multimodel ensembling. Our second measure of comparing seasonal prediction skill is the temporal anomaly correlation (AC) defined as
Although the possible impact of different multimodel ensemble prediction techniques on the temporal anomaly correlation is not obvious, it is expected that if the mean bias of the prediction is to go down, it will also have a positive impact on the anomaly correlations (see Appendix A). One thing to note is that in the definition of temporal anomaly correlation, average of predicted anomaly over N cases need not be zero. This is true for the predicted anomalies defined with respect to the observed mean state.
 In the above expressions, the summation is over the number of prediction-observation pairs. This, in our case, is over 45 JFM winters for the1950–1994 period. Further, since our focus is on the seasonal prediction skill over the PNA region, the spatial domain for our analysis corresponds to the region extending from 150°E–60°W and 20°N–70°N, and all area averages are defined as the latitudinally cosine weighted means.
3.4. Cross Validation Design
 In all the comparative skill evaluations, a cross validation approach is used. Within the cross validation approach, the weights are calculated using the data of all years except for the target year. Using this approach, usually called one-year-out cross validation, verification scores for all 45 JFMs are calculated.
4.1. Performance of Individual AGCMs
 In this subsection the individual model biases and their simulation skill for seasonal mean 500-mb JFM heights are first discussed. All computations are based on ensemble means. In Figure 1 model bias, defined as the difference between AGCM simulated and observed seasonal mean JFM heights averaged over the 1950–1994 period, is shown. All four AGCMs have considerable departure from the observed mean state. The area average of the root mean square of bias are further compared in Table 1, where it is found that CCM3 has the smallest overall error, and GFDLR30 the largest. One purpose of comparing the root mean square of bias is to investigate the possible impact of these errors on the seasonal predictions, and to assess the extent to which bias correction as a post processing measure can compensate for them.
Table 1. Domain Averaged Root Mean Square of Model Bias, RMS Error, and Temporal Anomaly Correlations (AC) of Different AGCM Simulations Over the PNA Region Based on 45 JFM Seasons During the 1950–1994 Perioda
RMS of Bias
RMS Error, m
RMS error and temporal anomaly correlations are computed for both biased and bias-corrected AGCM data.
 In Table 1, area averaged RMS and temporal AC for the biased and bias-corrected AGCM predictions are also shown. It is apparent that the correction for the model's systematic error during the postprocessing stage is very important and effective in decreasing the RMS error and increasing the AC for seasonal mean predictions [Anderson et al., 1999]. While the reason for the reduction in RMS error between biased and bias-corrected forecast is obvious, an increase in the average AC score for bias-corrected predictions is also occurs, due to the reduction in the root mean square of the amplitude of the AGCM's anomalies themselves (see Appendix A).
 In Table 1 it is also interesting to note that for the bias-corrected AGCM predictions the average temporal AC for the seasonal predictions tends to be higher for AGCMs with less systematic error. Thus CCM3, ECHAM3, MRF9, and GFDLR30, ordered according to increasing area averaged root mean square bias, have decreasing AC for the bias-corrected predictions. Although a posteriori correction of the model's systematic errors does result in improvements in the average AC score, this procedure cannot compensate for the possibility that errors in the mean state may have altered the characteristics of tropical-extratropical interactions, causing unrecoverable errors in the seasonal prediction skill. This provides a strong justification for continued model improvement for seasonal prediction efforts.
 For the bias corrected predictions the spatial pattern of the RMS error and the temporal AC skill for the four AGCMs are shown in Figures 2 and 3, respectively. For each AGCM the spatial pattern of RMS error is quite similar, with the centers of largest RMS error coinciding with the locations having largest interannual variability for the observations (see Figure 2a of Kumar et al. ). This is expected since a large fraction of interannual variability is due to atmospheric internal variability, which cannot be simulated from the specification of SSTs alone. This will also to be expected for any prediction system where the amplitude of predicted anomalies is much smaller than the observed anomalies, for example, ensemble mean predictions.
 A common feature of temporal AC score (Figure 3) is a tendency for peak scores in three geographical regions: one over the North Pacific near Gulf of Alaska, another over Canada, and a third over the southeast United States. As expected, these centers are colocated with the three centers of action of the PNA tropical-extratropical teleconnection pattern [Horel and Wallace, 1981; Peng et al., 2000]. Other common features are a lack of skill over the extreme western United States, and widespread high skill in the subtropical latitudes south of 25°N. Different AGCMs also have slightly different locations where the AC skill is maximum, and it is expected (or at least hoped) that multimodel ensemble predictions would be able to extract the best features from each of the models, and combine them into a single superior prediction.
4.2. Comparison of Performance of Multimodel Ensembles
 For the analysis of multimodel ensemble predictions we start with a discussion of bias-corrected predictions. Recall that three different multimodel ensemble prediction techniques are used: SME, OME1, and OME2. Spatial maps of RMS error for predictions based on these three techniques are shown in Figure 4. The RMS error for all three techniques is very similar in pattern, but have slightly different amplitudes. On an area averaged basis, OME2 is the best, while OME1 is the worst (see Table 2); however, the differences are only minor. It is also apparent that the spatial pattern of RMS error is also very similar for different multimodel ensemble techniques, once again indicating that it is a property closely related to the observed interannual variability of the JFM seasonal means (see the discussion above for the similarity of RMS between different AGCMs).
Table 2. Area Average of Simulation Skills Based on Different Multimodel Ensemble Prediction Techniquesa
RMS Error (Meters)
Area averages are computed over the PNA region, and the scores are based on 45 JFMs during the 1950–1994. SME refers to the simple multimodel ensemble approach where equal weights are assigned to each of the models. OME1 refers to an optimal multimodel ensemble approach where weights at each grid point are computed using the multiple linear regression approach. OME2 is similar to OME1; however, weights do not vary spatially.
 The spatial maps of temporal anomaly correlation for three multimodel ensemble prediction techniques are compared in Figure 5, and these also look quite similar. A comparison of area averaged value for temporal AC scores for bias-corrected predictions (Table 2) indicates that while the difference between SME and OME2 is minor (0.46 compared to 0.48), OME1 is the worst. Comparing with the spatial pattern of individual models it is also clear that the temporal anomaly correlations for the multimodel ensemble techniques are obviously better than the MRF9 and GFDLR30, but are comparable to (or marginally better than) the CCM3 and ECHAM3. This is further evident by comparing the area averaged temporal AC scores for three different techniques shown in Table 2 with the area averaged AC scores for bias-corrected AGCM prediction in Table 1. While these scores for CCM3, ECHAM3, MRF9, and GFDLR30 are 0.46, 0.41, 0.32, and 0.27 respectively, for SME, OME1, and OME2 the area averaged AC scores are 0.46, 0.40, and 0.48.
 A comparison of spatial maps of temporal AC scores for multimodel ensemble predictions with the corresponding spatial maps for the individual models indicates that the predictions using the multimodel approach have retained the best features from the individual models. For example, over Canada temporal AC scores for all the techniques are closer to the AC scores for the ECHAM3 model, which over that region has the best score among the four AGCMs. Similarly, over the northeastern part of Asia, the AC scores for the multimodel techniques are similar to the scores for the best AGCM over that region, i.e., CCM3.
 Until now skill scores based only on bias-corrected multimodel ensemble prediction were discussed. For area averages these scores are next compared with the corresponding scores for the biased predictions. These are also shown in Table 2. Recall that for OME1, use of different climatologies to compute respective AGCM's anomalies does not alter the multimodel ensemble predictions (see equation 8), and hence results based on biased or bias-corrected data are the same. For OME2 the multimodel ensemble prediction anomalies depend on the choice of the mean state relative to which the AGCM's anomalies are computed (see equation 11), however, a comparison of area averaged scores for the biased and bias-corrected data indicates that this impact is small. Also, as was discussed earlier, while for OME2 bias corrected predictions have no local or domain averaged mean prediction error, the predictions based on biased data do have a mean prediction error on a local basis. Thus, area averaged RMS error for biased predictions is slightly larger than the RMS error for the bias-corrected predictions (32 versus 30). For the area averaged temporal AC scores, the AC for the latter is also slightly larger than the area averaged AC for the biased predictions (0.48 versus 0.45).
 The largest difference between biased and bias-corrected multimodel ensemble predictions is seen in the case of the SME technique where the area averaged RMS error for the bias corrected predictions is much smaller than for the biased predictions (31 versus 43), the reason being simply that for the RMS error the model bias is a positive definite contribution to the error. A corresponding improvement in the area averaged temporal AC score also occurs (see Appendix A).
 The analysis up to this point indicates that answer to the question “which method of constructing multimodel ensemble predictions is better?” depends largely on whether or not bias corrected data is used to construct the predicted anomalies. For the biased data, predictions based on OME techniques are definitely superior to SME. For the bias corrected data, however, the distinction is less obvious. In this case, results based on the much simpler SME technique are almost similar to results based on the more complicated OME technique. Since, for seasonal predictions, the favored method for constructing model anomalies is an a priori correction for the AGCM's systematic errors, the use of the simpler SME technique may not be particularly disadvantageous. An additional point to note is that on an area-averaged basis, the skill scores analyzed here are only marginally better than the best AGCM in the ensemble. On a regional basis, however, multimodel techniques are able to retain the best features from each of the different AGCMs (for example, for temporal AC scores).
 Against the backdrop of the simplicity of SME, and the result that for the bias corrected multimodel ensemble predictions the RMS error and temporal AC scores for SME are comparable to the scores based on the OME techniques, a feature which might discourage the use of OME techniques is the stability of weights from which multimodel ensemble predictions are constructed. To demonstrate this, note that since the weights are computed using the one-year-out cross-validation approach, for 45 years of JFM hindcasts, 45 sets of spatial maps of weights for each model using OME1 were available. This is also the case for OME2 45. From this data set mean and the variability of the weights is assessed.
 In Figure 6, based on 45 different estimates of the weights, the spatial map of their mean and standard deviation using OME1 for each AGCM are shown. In general the weights are largest for CCM3 and ECHAM3, the AGCMs for which the area averaged mean square bias is also the least. However, for these AGCMs, over large regions of the North Pacific and North America, the standard deviations of weights is in excess of 20% of their mean value. The corresponding numbers for OME2 are shown in Table 3. For models with the largest weights, i.e., CCM3 and ECHAM3, the standard deviation of weights is only about 5% of their respective mean value. This implies that owing to the increase in the sample size, the weights using the OME2 technique are indeed more stable.
Table 3. Mean and Standard Deviation of Weights Obtained Using OME2, Based on 45 Different Estimates of Weights Obtained Using One-Year-Out Cross Validation
 An additional factor which can lead to unstable weights for MLR techniques is the colinearity among the predictors. For the case of seasonal predictions this colinearity is manifested because of the fact that on large spatial scales the atmospheric response to anomalous SSTs in different AGCMs tends to be similar [e.g., see Hoerling and Kumar, 2002]. To minimize undue effects of colinearity among the predictors, ridging technique [van den Dool and Rukhovets, 1994] was attempted. This, however, did not lead to a reductuion in variability of weights for OME1 shown in Figure 6, and was not pursued any further.
 To show that the limited sample size from which the weights are obtained does degrade the performance of multimodel ensemble predictions based on OME1, and to show that reduction in the scores for OME1 in Table 2 relative to scores for SME and OME2 is indeed due to this, a comparison of scores using a longer data set is made. This example also illustrates a possible cause for difference between the conclusion about the potential advantages of multimodel ensemble predictions as reported by Krishnamurti et al. [1999, 2000] and by Pavan and Doblas-Reyes .
 Until now we have focused our analysis on the ensemble mean of the AGCM simulations as the predictions. This restricts the length of the data to 45 JFMs. Instead of using ensemble means, we now use the respective AGCM's individual realizations as the forecasts. Treating 10 realizations for each AGCM as individual forecasts, a time series of 450 hindcasts for JFM seasonal mean predictions is constructed. The corresponding verification time series is simply the 45 observed JFMs repeated 10 times. For bias corrected data all three procedures for constructing multimodel ensemble predictions are employed and the verification scores compared. These are shown in Figure 7 as grey bars.
 Comparison of the RMS error between SME and OME1 in Figure 7 indicates that with weights obtained using a longer data set, the RMS error for OME1 is now slightly better than for SME. A similar result is seen for the temporal AC scores. This is in contrast to a comparison of the same scores between SME and OME1 based on the ensembles of AGCM predictions, which in Figure 7, are shown as dashed bars. These results indicate that use of limited data from which the weights of OME1 were obtained might have indeed resulted in poorer performance compared to the SME technique.
 In Figure 7 we can also compare the impact of ensemble averaging on the scores for different AGCMs. For individual models the area averaged RMS error for ensemble mean as the prediction is much smaller than the RMS error based on single AGCM realizations. The fundamental reason for this is the damping of predicted anomalies due to ensemble averaging. The AC scores for ensemble mean predictions for individual AGCMs are also higher than the AC scores based on individual AGCM realizations, and these results conform to the analysis of Kumar and Hoerling  where the impacts of the size of the ensemble on the expected value of AC score were analyzed.
 Finally, in Figure 7 we also compare the impact of multimodel ensemble predictions when the input for the predictions is either the ensemble mean of AGCM simulations or the individual realizations from different AGCMs. For multimodel predictions based on single AGCM realizations, there is a large reduction in the RMS error compared to the RMS error for the respective AGCMs. This decrease is once again due to the damping of predicted anomalies which is inherent in the multimodel ensemble prediction techniques, leading to smaller RMS errors. The same conclusion holds for the other two OME techniques.
 The decrease in the RMS error for the multimodel ensemble predictions starting from the ensemble mean of AGCM simulations relative to the scores for corresponding ensemble mean of AGCM simulations, on the other hand, is much smaller. This is due to the fact that for the predictions starting from the ensemble mean of the AGCM simulations, predicted anomalies have already been damped and further ensemble averaging inherent in the multimodel ensemble prediction techniques lead to much smaller reductions in the amplitude of predicted anomalies. (It is well known that for independent samples the variance of an ensemble mean decreases as 1/n of the variance of individual members. This being the case, the largest relative impact of ensemble averaging on the variance of an ensemble mean relative to the variance of individual members occurs for smaller values of n.)
 An additional point to note is that relative to RMS, the influence of multimodel ensemble on AC is smaller irrespective to whether one constructs multimodel ensembles from individual AGCM realizations or from the ensemble mean of AC realizations. This implies that the relative impact of multimodel ensemble on seasonal prediction skill scores can also depend on the choice of skill metric.
 These results may explain the possible discrepancy between the recent conclusions about the potential advantages of multimodel ensemble predictions as reported by Krishnamurti et al. [1999, 2000] and by Pavan and Doblas-Reyes . Based on RMS as a metric for skill performance and constructing multimodel ensemble predictions from single realization for each AGCM, Krishnamurti et al. [1999, 2000] concluded that multimodel ensemble prediction technique lead to large improvements in the seasonal prediction skill. On the other hand, based on the AC metric for skill performance and constructing multimodel ensemble predictions starting from an ensemble of realizations for each AGCM, Pavan and Doblas-Reyes  reported only marginal improvements. We should point out that the above exercise of constructing a multimodel ensembles starting from single realizations of an AGCM is for the purpose of illustration alone, and that for making actual predictions the advantage of the use of the ensemble means from AGCM simulations is much larger.
 In this paper relative merits of two different techniques for constructing multimodel ensemble predictions of seasonal mean atmospheric conditions have been examined. The focus and the motivation for the study were twofold. Our first aim was to assess the relative merits of two techniques for constructing multimodel ensemble predictions: an equal weighting system (SME) and a MLR-based weighting system (OME). Although the former is a much simpler approach, it was found that for bias-corrected data, the performance of multimodel ensemble predictions based on SME is similar to scores based on more complex MLR approaches. For biased data, the performance of MLR techniques was much better than results based on SME, however, since for seasonal predictions, removal of the model's systematic errors is a common practice, this advantage in the performance is only academic.
 A major disadvantage of multimodel ensemble prediction techniques based on the MLR approach is that for the current suite of dynamical seasonal predictions, the historical database of AGCM predictions may not be large enough to optimally estimate the space dependent weights. This may result in degraded performance of MLR techniques, especially where the weighting is space dependent. To a certain extent this problem can be remedied by extending MLR to include the spatial domain; however, comparison of performance of SME with the space independent version of MLR, i.e., OME2, also indicated that the results based on the former were at par with the latter, providing little justification for the use of OME2 over the simpler SME. From the analysis of AGCM simulations it was also found that the possible conclusions about the influence of multimodel ensemble may be metric dependent.
 The second aim of our study was to analyze an apparent discrepancy between the conclusions of Krishnamurti et al. [1999, 2000] and of Pavan and Doblas-Reyes  about the potential advantage of multimodel ensemble prediction techniques. We conclude that the possible cause for the discrepancy may be twofold. The first reason could be related to the data from which the multimodel ensemble predictions are constructed. When the input data are single realizations of AGCMs as used by Krishnamurti et al. [1999, 2000], performance of multimodel ensemble predictions showed a marked improvement over performance of individual AGCMs. On the other hand, when the input data are the ensemble mean of AGCMs, consistent with our own results, improvement in the performance of multimodel ensemble predictions over the best single model are only marginal [Pavan and Doblas-Reyes, 2000]. A second reason for the discrepancy between the conclusions of Krishnamurti et al. [1999, 2000] and of Pavan and Doblas-Reyes  could be that the influenceof multimodel ensemble techniques on the skill scores may be metric dependent.
 Although on the area averaged basis the performance scores for the predictions based on multimodel ensemble prediction techniques were only marginally better than the best AGCM among the set of AGCMs, the advantage of multimodel ensemble predictions techniques may be that seasonal predictions based on these are able to retain the best regional features of each of the individual AGCMs.
 Our conclusions may have been influenced by the particular choice of metrics of seasonal prediction skill used in this paper, i.e., RMS error and temporal AC. We have also focused our analysis on the ensemble mean as deterministic predictions, and issues related to probabilistic predictions based on a multimodel ensemble were not considered. Palmer et al.  and Rajagopalan et al.  have discussed that multimodel ensemble predictions, by better sampling the spread of seasonal mean states, may also provide improved information about the probabilistic seasonal predictions.
Appendix A:: Impact of Model Systematic Error on Temporal Anomaly Correlation
 If f(s, n) is the AGCM simulated seasonal mean at the spatial location s, and for a particular season n, and (s) is some climatology relative to which seasonal mean anomalies for AGCM are computed, then the AGCM seasonal mean anomaly is given by
The corresponding observed anomalies are given by
The temporal anomaly correlation (AC) is defined by
After some algebraic manipulations it can be easily shown that the numerator
is independent of the choice of climatology (s) relative to which AGCM's anomalies are computed. Similarly in the dominator it can be shown that for the standard deviation of AGCM anomalies
The above equation depends on the choice of the climatology. For (s) = (s), i.e., for bias-corrected anomalies, the second term on the right hand side is zero. On the other hand for (s) = ō(s), i.e., for biased anomalies, the second term on the right hand side is positive definite. Therefore, absolute value of the temporal AC for biased anomalies is always less than or equal to the corresponding value for the bias-corrected AGCM anomalies.