The problem of combining forecasts has received much attention lately because of the availability of numerous forecasts from several modelling institutions. Virtually every article that has been written on this topic agrees that combining multiple forecasts leads to increased accuracy and reliability. However, there is considerable disagreement on the optimal method for combining forecasts. A question that arises repeatedly is whether weighting forecasts unequally yields significantly better predictions than weighting forecasts equally. Numerous studies have shown that equal weighting, especially the arithmetic multi-model mean, is a competitive method for producing accurate and probabilistically reliable forecasts (Kharin and Zwiers, 2002; Peng et al., 2002; Hagedorn et al., 2005; DelSole, 2007; Weigel et al., 2010). The consistently good performance of the multi-model mean also has been recognized in the economic literature, as documented routinely in the International Journal of Forecasting, Operational Research Quarterly and Journal of Forecasting (Clemen, 1989). Nevertheless, many of these results are based on cross-validation experiments that lack a probabilistic statement as to whether unequal weighting strategy yields significantly better cross-validated predictions than equal weighting. In addition, cross-validation requires the definition of a specific strategy for unequal weighting, such as ordinary least-squares or ridge regression, with which to compare equal weighting. It would be desirable to test whether unequal weighting is significantly better than equal weighting without having to specify the specific strategy for unequal weighting.
The purpose of this article is to propose a statistical test for whether a multi-model combination based on unequal weighting has significantly smaller errors than one based on equal weighting. The test can be derived from a standard framework for testing hypotheses in linear regression models. Accordingly, we first review the standard framework in section 2, and specialize it to our particular problem in section 3. To illustrate certain aspects of the test, the test is applied to an idealized example based on synthetic data in section 4. The test is then applied to a set of multi-model hindcasts of seasonal mean temperature and precipitation. The dataset is described in 5 and the results of the test are described in section 6. This article concludes with a summary and discussion of results.
2. Review of hypothesis tests in multiple regression
In this section we review a standard framework for testing hypotheses in linear regression models. This review is selective in the sense that only material relevant to our particular procedure is discussed; more extensive coverage of hypothesis testing in regression models can be found in Graybill (1961), Neter and Wasserman (1974), Mardia et al. (1979), and Seber and Lee (2003). Multiple regression is concerned with identifying a relation between a predictand y and two or more predictors x. The N observed values of the predictand can be collected in the N-dimensional vector y, and those of the K predictors into the N × K matrix X. We assume the predictand is a linear function of the predictors plus random error. This assumption can be expressed mathematically as
where the dimension of each matrix is specified directly below the corresponding matrix, is a vector of independent and identically distributed random variables, and β specifies the regression coefficients for predicting y given X. The least-squares estimate of β is
where the hat indicates a sample quantity and superscript T denotes the transpose operation. The sum square error of the model is defined as
where we use the subscript F to indicate a quantity associated with the ‘full’ model, a distinction that will become clear shortly.
We want to test the hypothesis that the regression coefficients satisfy a set of linear constraints. The constraints considered here can be expressed in the form
where Q is the number of constraints, and C and c contain parameters specifying the constraints. We assume that the rank of C is Q ≤ K, implying that the constraints are linearly independent. The sum square error of the regression model (1) subject to the constraints (4) can be minimized using the method of Lagrange multipliers (Seber and Lee, 2003). However, for the kinds of constraints investigated in this article, it is simpler to impose the constraints directly, and then determine the values of the remaining coefficients by the method of least squares (examples of this will be given below). We assume that this step can be performed, so that the sum square error of model (1) subject to the constraints (4) is SSEC.
To quantify significance, the probability distribution of needs to be specified. A family of non-Gaussian distributions can be considered using the framework of generalized linear models, as discussed in McCullagh and Nelder (1989). However, the Gaussian case is attractive because of its simplicity and existence of closed-form estimates. Hereafter we assume the elements of are independently and identically distributed as a normal distribution with zero mean and common variance. In this case, Graybill (1961) and Seber and Lee (2003) show that the likelihood ratio test for the hypothesis that the coefficients satisfy (4) leads to the statistic
If the hypothesis is true, then this statistic has an F-distribution with Q and N − K degrees of freedom. Large values of F favour rejection of the hypothesis.
An important point is that the test statistic (5) can be interpreted as a comparison of two models. Specifically, SSEC − SSEF is a difference in sum square errors between the constrained and full models. The divisor SSEF can be interpreted as setting the ‘scale’ for measuring differences. If the value of F is ‘small’ (i.e. not significant), then the full model does not have significantly smaller errors than the constrained model. Thus, in this framework, testing whether two models have significantly different errors is equivalent to testing a hypothesis on the coefficients of a linear regression model.
It is worth recognizing that many hypothesis tests that are common in climate studies are special cases of the above test. As an illustrative example, the hypothesis that the correlation between x and y vanishes is equivalent to the hypothesis that a = 0 in the model
where, in contrast to (1), x is a univariate random variable, and a and b are scalars. In this case, the ‘full’ model is given by (6), while the constraint equation (4) specializes to
and the ‘constrained’ model is
It can be shown that the sample correlation is related to the error of the models as
which can be interpreted as the fraction of total variance in y explained by the regression model. It follows that the test statistic (5) can be written as
which is precisely the standard test statistic for the hypothesis that the population correlation ρ vanishes. Similarly, testing the hypothesis that any subset of coefficients vanishes can be formulated in the above framework. Also, analysis of variance (ANOVA) can be formulated in the above framework.
3. Application to multi-model combinations
We apply the above framework to multi-model combinations. If multiple ensemble members are available from each model, then only ensemble means from the same model should be used as predictors in a linear multi-model regression. To see this, note that ensemble members from the same model are exchangeable, hence the weights attached to these members must be equal. It follows that the multi-model regression can depend only on the sum of forecasts from the same model, which in turn is proportional to the ensemble mean forecast. (Recall that proportionality constants do not affect predictions from a regression.) Incorporating additional ensemble information, such as error or spread, presumably requires more probabilistic or Bayesian approaches. Also, if the number of ensemble members differs between models, it is unlikely that equal weighting of ensemble means is the best strategy for the population.
An important unsolved problem is whether a subset of model forecasts should be removed before constructing a multi-model combination. For instance, if one model is clearly inferior to all the others, perhaps it should be eliminated; if the model forecasts are highly correlated, perhaps some of them can be removed without adversely impacting overall skill. A key difficulty is that using the same data to select the subset and test hypotheses leads to selection bias. In this article, we assume that the pool of models is defined independently of the data (for instance, by agreement among operational forecasting centres) and that our task is to test hypotheses for the given pool.
Let the ensemble-mean dynamical model forecasts be xn1,xn2,…,xnM, where M is the number of models, and let yn be the predictand, for n = 1,2,…,N. The General Linear Model (GLM) is defined as
where β1,β2,…,βM are (generally different) weighting coefficients. The coefficient μ is included to account for model biases. This model can be expressed in the form (1) with the identifications
Note that the dimension of X is N × (M + 1), implying K = M + 1. The sum square error associated with the GLM model will be denoted SSEGLM.
The above mathematical specification involves certain assumptions that are worth stating explicitly. First, our multi-model combination is a linear combination of forecasts, hence nonlinear combinations, such as might be taken into account using neural networks, are not considered. Second, our multi-model combination is univariate in the sense that it is applied individually and independently at each grid point, and hence ignores spatial correlations that might be advantageous in developing more comprehensive multi-model combinations. We consider the univariate analysis developed here as a useful first step. Third, the regression residuals are assumed to be independent and drawn from the same normal distribution. This assumption is always open to question in practical application and hence will be tested in the examples presented in section 6. The proposed procedure is sufficiently general to account for non-stationary behaviour, as long as the nature of the non-stationarity is specified (e.g. linear, sinusoidal, etc.). However, the basis of any regression model is that ‘past patterns will persist into the future’, so extreme, unpredictable departures from stationarity would invalidate all multi-model approaches.
Various hypotheses suggest themselves in the context of multi-model combinations. For instance, the arithmetic mean of forecasts has proven competitive relative to more sophisticated strategies, as mentioned in the introduction, which suggests testing the hypothesis that the weights equal 1/M. Another common test hypothesis is that the weights vanish, implying that the forecasts have no skill. These two hypotheses are special cases of the general hypothesis that the weights are equal. Moreover, if the hypothesis of equal weights is true, then the M weights for the forecasts reduce to a single weight, resulting in a multi-model combination with fewer parameters to estimate. These considerations lead us to propose that the first step in developing a multi-model combination is to test the hypothesis that the weights are equal. If the hypothesis of equal weights is rejected, then developing an unequal weighting strategy seems defensible. In contrast, if the hypothesis of equal weights cannot be rejected at a prescribed significance level, then this fact would call into question whether an unequal weighting strategy can be justified.
3.1. Test hypothesis of equal weights
Our first question is whether a multi-model combination based on unequal weights produces significantly smaller errors than a multi-model combination based on equal weights. To address this question, we test the hypothesis
The hypothesis (13) can be expressed in the form (4) with the identifications
In this case, C is a (M − 1) × M matrix and c is an (M − 1)-dimensional vector, implying Q = M − 1. These expressions could be used in the Lagrange multiplier method, but it is easier to impose constraint (13), which leads to the scaled multi-model mean (SMMM) model
where α parametrizes the common weight implied in (13), such that α = 1 corresponds to a multi-model mean. This model has two regression coefficients, namely α and μ, which can be estimated by the least-squares method. The sum square error corresponding to this model will be denoted SSESMMM. The test statistic for hypothesis (13) is therefore
If hypothesis (13) is true, then this statistic has an F distribution with (M − 1) and (N − M − 1) degrees of freedom. We reject the hypothesis of equal weighting if the observed value of F exceeds the appropriate critical value. Rejection of the hypothesis implies that the model with unequal weights has significantly smaller errors than the model with equal weights.
3.2. Quantitative thresholds
We would like to gain a sense of how much smaller the errors of the GLM model need to be relative to those of the SMMM model to be considered significant. To define this measure, it is helpful to define the climatological model
The only regression coefficient in this model is μ. It can be shown that the least-squares estimate of μ is
and that the corresponding sum square error is
SSECLIM is proportional to the sample the variance of y.
An intuitive measure of the errors of the GLM model is the squared multiple correlation
As is well known, the squared multiple correlation is a measure of the variance explained by the predictors. As such, its value is between zero and one. Similarly, the variance explained by the scaled multi-model mean model can be quantified by
This parameter equals the squared correlation between the multi-model mean and predictand. As such, it is bounded between zero and one and measures the amount of variance explained by the SMMM model. It can be shown that , consistent with the fact that an unconstrained linear regression model explains more variance than a constrained regression model. A measure of the difference in variances explained by the two models is
The parameter δ is bounded between zero and one, and gives the amount of additional variance explained by the GLM model beyond the variance explained by the SMMM model, relative to the unexplained variance of the SMMM model (i.e. relative to ). Standard algebra shows that the F statistic defined in (16) can be written as
For a given significance threshold for testing equality of weights, this equation can be inverted for the corresponding value of δ. The result for different sample sizes N and number of models M is shown in Figure 1. The figure shows that the fractional difference in variance increases rapidly with the number of models M, for small sample sizes. For example, for a sample size of N = 20, the fractional difference must exceed 34% and 79% for a two- and ten-model regression, respectively, while for a sample size of N = 50, the fractional difference must exceed 14% and 35%, respectively. This strong dependence on sample size may explain why studies based on relatively short periods (e.g. 20 years) of seasonal forecasts almost never conclude that weighted combinations are superior to equally weighted combinations.
3.3. Interpretation of the equal weighting test
As with any hypothesis test, the interpretation of the equal weighting test is subject to certain caveats and nuances. First, even if the equal weight hypothesis is accepted, some unequal weighting strategy could be significant. For instance, the particular unequal weighting strategy may involve fewer degrees of freedom than the general model, but significantly smaller errors than equal weighting. Also, if the predictors are correlated, then some of the predictors might be eliminated without significantly reducing the explained variances, whereas the reduction in degrees of freedom may increase the F-statistic sufficiently to reject the equal weighting hypothesis.
Second, the significance test can be conducted without pre-specifying a strategy for unequal weighting. More precisely, the significance test controls for type one error, which requires specifying only the null hypothesis, and does not require specifying the alternative hypothesis. Therefore, the test proposed here does not commit the user to a particular strategy for unequal weighting. Third, if the general hypothesis (4) is rejected, this result does not imply any specific value for the regression coefficients. In particular, the test does not imply that the least-squares coefficients derived from the sample should be chosen, even though these coefficients are used to define the test statistic (16). Indeed, experience shows that the least-squares GLM often performs poorly in independent samples (Kharin and Zwiers, 2002). All that can be inferred from rejection of the hypothesis is that the estimated coefficients do not lie in the range of values expected under the null hypothesis. If we find that the differences in weights lie outside the expected range, it does not follow that we know the value of the actual weights, or the best method for estimating them.
The fact that rejecting the hypothesis of equal weights does not imply a specific strategy for unequal weighting is both an advantage and a disadvantage: it is an advantage in that the result of the hypothesis test holds regardless of the unequal weighting strategy, but it is a disadvantage in that rejection of the hypothesis provides no hint as to which unequal weighting strategy should be selected. That the test does not require pre-specifying a specific strategy for weighting forecasts makes it useful for exploring a variety of strategies.
4. An idealized example
The outcome of the equal weighting test will be influenced by numerous aspects of the model forecasts. For instance, a typical rationale for the use of unequal model weighting is varying levels of skill and variance among models. To illustrate the effect of these factors, we apply the equal weighting test to an idealized example based on synthetic data. We note that other aspects of model forecasts, such as sample size or correlations between forecast errors, also may play an important role, but it is not possible to comprehensively survey all possible factors that could influence the test. Accordingly, we restrict our example to the case of forecasts with varying levels of skill and variance, but with independent and Gaussian errors. The parameters are chosen to mimic the ENSEMBLES data discussed in the next section–the number of models is five, and the number of observations is 46. In the first example, all models have the same population skill and variance. By population skill, we mean the population correlation between the individual model forecast and predictand. In the second example, the models are the same as the first, except one model has no skill. In the third example, the models are the same as the first, except one model has no skill and has twice the variance of the other models.
Clearly, if the population properties of the models were known, an optimized unequal weighting scheme could be used in examples 2 and 3 to reduce error relative to that of equal weighting. However, sampling variability makes it difficult to determine the weights with sufficient accuracy so as to give a significant reduction in error. Figure 2 shows the frequency with which the hypothesis of equal weights is accepted from 10 000 Monte Carlo trials for the three examples, as a function of model skill level. When the models are identical, acceptance frequency of the equal weights hypothesis is 95%, independent of skill level, as anticipated from theory. In example 2, where one of the models has no skill, the acceptance rate decreases as the skill of the other models increases. Nevertheless, for a skill level of 0.4, the hypothesis of equal weights is accepted over 60% of the time even though it is false. This result merely means that the typical errors of a multi-model regression with unconstrained weights are not significantly different from those of a multi-model regression with equal weights. This result is not unreasonable in light of the fact that a model with skill level 0.4 explains only 16% of the variance and probably will have a weight close to zero, in which case including the no-skill model (with small weight) has little effect on the errors. However, as the skill level increases, the difference between the skill and no-skill models increases, thereby increasing the errors of the scaled multi-model mean model relative to the unconstrained model (which tends to apply small weight on the no-skill model). In the third example where one of the models has no skill and twice the variance as the other models, the acceptance of the equal weights hypothesis is further decreased. In essence, the greater the discrepancy in skill level, the more likely the hypothesis of equal weights is rejected. Finally, when the no-skill model also has large variance, the discrepancy in skill is the same as in example 2, but a discrepancy in variance exists, leading to lower acceptance of the equal-weight hypothesis.
For a more realistic application, we apply the equal weighting test to the seasonal hindcast dataset from the ENSEMBLES project. This dataset, reviewed by Weisheimer et al. (2009), consists of seven-month hindcasts by five state-of-the-art coupled atmosphere–ocean models from the UK Met Office, Météo-France, the European Centre for Medium-range Weather Forecasts (ECMWF), the Leibniz Institute of Marine Sciences at Kiel University, Germany, and the Euro-Mediterranean Centre for Climate Change in Bologna, Italy. All models include major radiative forcings and were initialized using realistic estimates from observations. Only hindcasts initialized on 1 May and 1 November of each year in the 46-year period 1960–2005 were used. Each model produced a nine-member ensemble hindcast. These were averaged to construct an ensemble-mean forecast for each model, as explained in section 3. Further details of these data can be found in Weisheimer et al. (2009).
The variables considered in this study are 2m surface temperature and precipitation. These variables were provided on a 2.5° × 2.5° grid. We examine the three-month mean hindcasts for November–December–January (NDJ) and May–June–July (MJJ), initialized in November and May respectively. These periods are representative for ‘winter’ and ‘summer’ conditions and the chosen hindcasts correspond to forecast leads of one to three months. The hindcast skill generally decreases with lead time, so the skill at longer leads are expected to be smaller than those found here. The systematic bias of the hindcasts were not removed because this adjustment is performed automatically by the regression models.
The observation-based surface temperature dataset used for verifying the 2 m temperature hindcasts is the National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) reanalysis (Kistler et al., 2001), which is provided on the same grid. The observation-based precipitation dataset used for verifying precipitation over land is the NCEP/Climate Prediction Center data of Chen et al. (2002).
The procedure discussed in section 3 was applied to hindcasts of 2 m temperature generated by five models from the ENSEMBLES dataset. Grid points for which the hypothesis of equal weights were rejected at the 5% significance level are shown as blue shading in Figure 3. About 22% of the global area falls in this category. In other words, the hypothesis of equal weights cannot be rejected in about 78% of the globe.
The weighting coefficients for the areas favouring unequal weighting were examined, but no simple conclusions could be drawn. For instance, 75% of the weights for the ECMWF model were positive whereas only 50% of the weights for the Météo-France model were positive. One might conclude from this that the weights for the Météo-France model are ‘small’ and therefore that this model can be dropped from the pool. However, we find that dropping any single model from the multi-model pool yields virtually indistinguishable results. Indeed, as is well known, small or negative weights do not necessarily imply that the model should be dropped from the multi-model pool. All that can be concluded when the hypothesis of equal weights is rejected is that at least one of the weights differs from the others.
The unequal weighting strategy is selected most often in the Tropics and in summer polar regions. One plausible explanation for why unequal weighting may be preferred in the Tropics is that the hindcasts have high skill–hence differences in skill are easier to detect–but also have large differences in variance. For example, a map of the standard deviation of hindcast variances (not shown) has a local maximum along the Equator. Casanova and Ahrens (2009) suggest that unequal weighting is preferred when there are large differences in skill, but this explanation may not apply here because a map of the standard deviation of correlation skill exhibits its smallest values in the Tropics (not shown). The summer polar regions also are characterized by large differences in hindcast variance, and these differences probably contribute to a preference for unequal weights.
The analogous results for precipitation, shown in Figure 4, differ dramatically from those for temperature. Specifically, the hypothesis of equal weights is rejected for about 10% of the global area, in small-scale regions randomly distributed over land. Since the test for unequal weights is applied at the 5% significance level, at least 5% of the area would be selected even if the weights were not equal. These results suggest that unequal weighting strategies are unlikely to produce significantly better precipitation predictions than equal weighting strategies over much of the globe.
Recall that the equal weighting test assumes that the regression model residuals are drawn independently from the same normal distribution. To check this assumption, we performed several diagnostic tests on the residuals. First, we performed the Lilliefors test for Gaussianity. In the case of temperature, we found that the Gaussian assumption was rejected for about 9% of the globe in both seasons, which is more than expected at the 5% significance level. However, the rejections occurred at small-scale regions distributed randomly throughout the globe, suggesting that departures from normality were not field significant. In the case of precipitation, the Gaussian assumption was rejected for over 20% of the globe in both seasons. The results of this test, shown in Figure 5, reveal that regions for which the Gaussian assumption is rejected tend to have arid climates. In other words, the Gaussian assumption is a poor approximation in regions where there is little to no rain. If we ignore arid regions, then the Gaussian assumption is found to hold surprisingly well. Second, we examined the linear trend and autocorrelation function of the residuals. In the case of precipitation, the number of significant trends and lag-1 autocorrelations did not exceed the number expected by chance. In the case of temperature, significant trends and autocorrelations were found over a large fraction of the globe (not shown). Examination of individual points with significant trends or autocorrelations revealed that the residuals followed the observations closely, and that the observations themselves exhibited trends, low-frequency variability, and even apparent discontinuities (e.g. a cold spell over one or more years). Although the existence of trends and significant autocorrelation implies violation of the stationarity assumption of the residuals, it also suggests that the effective number of degrees of freedom may be smaller than has been assumed, which would make it more difficult to reject the hypothesis of equal weighting.
7. Summary and conclusions
This article proposed a statistical procedure for testing whether a multi-model combination based on unequal weights has significantly smaller errors than that based on equal weights. The procedure is derived from a standard statistical framework in linear regression theory. Remarkably, the test does not require pre-specifying a strategy for unequal weighting in order to conclude that unequal weighting can give significantly better predictions.
To appreciate the importance of the proposed test, it is helpful to place it in context. Many studies have previously addressed the question of whether specific unequal weighting methods improve upon equal weighting methods, e.g. the multimodel mean. The typical procedure is to estimate model weights using one dataset and to compare the skills of the two methods using another independent dataset. The first difficulty with such an approach is that the skills of both methods are functions of the data, and are therefore affected by sampling error. This means that, even if the skills of the competing methods are equal, one method will, by chance, prove superior in a given sample. By attaching a statistical significance to differences in skill, the hypothesis testing approach presented here avoids conclusions based on differences that can be explained by chance. The second difficulty with the usual approach is that, when a particular weighting method is observed to be inferior to equal weighting, there is the possibility that some other method may prove superior. In fact, to arrive at the conclusion that unequal weighting does not improve on equal weighting would require testing all possible weighting methods. On the other hand, the hypothesis testing method presented here is able to answer whether any unequal weighting method gives significantly smaller errors than equal weighting. The obvious drawback of the hypothesis testing method is that it is non-constructive, and when unequal weights do give significantly better skill, we do not know the specific method for estimating those weights.
The procedure was applied to multi-model hindcasts of seasonal mean 2 m temperature and precipitation from five models in the ENSEMBLES dataset. For temperature, equal weights were selected for about 77% of the globe. For precipitation, equal weights were selected over about 90% of the land area. These results suggest that strategies for unequal weighting of forecasts are of value only over a small fraction of the globe.
It should be recognized that deciding that a model should have unequal weights does not imply that the model with least-squares weights is preferred. In practice, the least-squares model tends to perform poorly in cross-validation experiments, especially when the number of forecasts being combined is large (Kharin and Zwiers, 2002; DelSole, 2007). When the hypothesis of equal weights is rejected, the only conclusion that is justified is that at least one of the weights differs from the others. This conclusion leaves a wide range of possibilities open, including the possibility that one model is ‘bad’ and should have relatively small weight. The possible scenarios are not easily distinguished by examination of the weights since weights generally cannot be interpreted as indicators of relative model reliability (Kharin and Zwiers, 2002).
The fact that equal weighting often yields competitive forecasts suggests that alternative strategies should constrain the weights to be ‘close’ to each other. DelSole (2007) proposed a Bayesian multi-model method that incorporates this kind of prior information.
The procedure assumes that the residuals between predictand and multi-model combination are independent and normally distributed with identical variances. The Gaussian assumption was found to be reasonable nearly everywhere except for precipitation in arid regions or during dry seasons. Significant winter-to-winter or summer-to-summer autocorrelations were found over a significant fraction of the globe, but accounting for these autocorrelations by decreasing the degrees of freedom would make the equal weighting hypothesis even more difficult to reject. It also should be recognized that the procedure was applied point-wise whereas a more comprehensive procedure would address the field signficiance of the unequal weighting (DelSole and Yang, 2011).
A question that often arises in multi-model combinations is whether the original set of forecasts should be screened such that ‘poor’ models are excluded before combining the forecasts. Unfortunately, the concept of a ‘poor’ forecast model has a subtle dependence on other models, since a model could be nearly useless by itself but very useful when combined with other models (e.g. by cancelling errors). Strategies for screening models prior to combining them would seem to be an important next step.
This research was supported by the National Science Foundation (ATM0332910, ATM0830062, ATM0830068), National Aeronautics and Space Administration (NNG04GG46G, NNX09AN50G), the National Oceanic and Atmospheric Administration (NA04OAR4310034, NA09OAR4310058, NA05OAR4311004, NA10OAR4310210, NA10OAR4310249). The views expressed herein are those of the authors and do not necessarily reflect the views of these agencies.