Abstract
 Top of page
 Abstract
 1. INTRODUCTION
 2. FORECASTING BY THE CONDITIONAL EXPECTATION
 3. FORECASTS FROM MISSPECIFIED CONSTANT MODELS
 4. IMPLEMENTING FORECAST COMBINATIONS
 5. THE ROLE OF ENCOMPASSING
 6. COMBINING UNDER EXTRANEOUS STRUCTURAL BREAKS
 7. EMPIRICAL ILLUSTRATION
 8. A MONTE CARLO STUDY
 9. POOLING AND FORECAST DENSITIES
 10. CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 Appendix
Summary We consider forecasting using a combination, when no model coincides with a nonconstant data generation process (DGP). Practical experience suggests that combining forecasts adds value, and can even dominate the best individual device. We show why this can occur when forecasting models are differentially misspecified, and is likely to occur when the DGP is subject to location shifts. Moreover, averaging may then dominate over estimated weights in the combination. Finally, it cannot be proved that only nonencompassed devices should be retained in the combination. Empirical and Monte Carlo illustrations confirm the analysis.
1. INTRODUCTION
 Top of page
 Abstract
 1. INTRODUCTION
 2. FORECASTING BY THE CONDITIONAL EXPECTATION
 3. FORECASTS FROM MISSPECIFIED CONSTANT MODELS
 4. IMPLEMENTING FORECAST COMBINATIONS
 5. THE ROLE OF ENCOMPASSING
 6. COMBINING UNDER EXTRANEOUS STRUCTURAL BREAKS
 7. EMPIRICAL ILLUSTRATION
 8. A MONTE CARLO STUDY
 9. POOLING AND FORECAST DENSITIES
 10. CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 Appendix
In the third of a century since Bates and Granger (1969), the combination of individual forecasts of the same event has often been found to outperform the individual forecasts, in the sense that the combined forecast delivers a smaller meansquared forecast error (MSFE)—see inter aliaDiebold and Lopez (1996) and Newbold and Harvey (2002) for recent surveys, and Clemen (1989) for an annotated bibliography. Studies such as Newbold and Granger (1974) provided early evidence consistent with that claim. Moreover, simple rules for combining forecasts, such as averages (i.e. equal weights), often work as well as the more elaborate rules based on the relative past performance of the forecasts to be combined (see Stock and Watson 1999; Fildes and Ord 2002). Nevertheless, despite some potential explanations (such as Granger 1989), precisely why forecast combinations should work well does not appear to be fully understood. This paper addresses that issue.
There are a number of potential explanations. First, if two models provide partial, but incompletely overlapping, explanations, then some combination of the two might do better than either alone. In particular, if two forecasts were differentially biased (one upwards, one downwards), it is easy to see why combining could be an improvement over either. Similarly, if all explanatory variables were orthogonal, and models contained subsets of these, an appropriatelyweighted combination could more completely reflect all the information. However, it is unclear why investigators would construct systematically biased or inefficient models; and there are other solutions to forecast biases and inefficiencies than pooling forecasts. Moreover, it is less easy to see why a combination need improve over the best of a group, particularly if there are some decidedly poor forecasts in that group.
Second, in nonstationary time series, most forecasts will fail in the same direction when forecasting over a period within which a break unexpectedly occurs. Combination is unlikely to provide a substantial improvement over the best individual forecasts in such a setting. Nevertheless, what will occur when forecasting after a location shift depends on the extent of model misspecifications, data correlations, the sizes of breaks and so on, so combination might help. Since a theory of forecasting allowing for model misspecification interacting with intermittent location shifts has explained many other features of the empirical forecasting literature (see Clements and Hendry 1999), we explore the possibility that it can also account for the benefits from pooling.
Third, averaging reduces variance to the extent that separate sources of information are used. Since we allow all models to be differentially misspecified, such variance reduction remains possible. Nevertheless, we will ignore sample estimation uncertainty to focus on specification issues, so any gains from averaging also reducing that source of variance will be additional to those we delineate.^{1}
Next, an alternative interpretation of combination is that, relative to a ‘baseline’ forecast, additional forecasts act like intercept corrections (ICs). It is well known that appropriate ICs can improve forecasting performance not only if there are structural breaks, but also if there are deterministic misspecifications. Indeed, Clements and Hendry (1999) present eight distinct interpretations of the role that ICs can play in forecasting, and for example, interpret the crosscountry pooling in Hoogstrate et al. (2000) as a specific form of IC.
Finally, pooling can also be viewed as an application of the Stein–James ‘shrinkage’ estimation (see e.g. Judge and Bock 1978). If the unknown future value is viewed as a ‘metaparameter’ of which all the individual forecasts are estimates, then averaging may provide a ‘better’ estimate thereof. Below, we consider whether databased weighting will be useful when the process is subject to unanticipated breaks.
Thus, we evaluate the possible benefits of combining forecasts in light of the nature of the economic system and typical macroeconomic models thereof, to discern the properties of the system and models—and the relationships between the two—that result in forecast combination reducing MSFEs. In particular, given that a general theory of economic forecasting which allows for structural breaks and misspecified models has radically different implications from one that assumes stationarity and wellspecified models (see Clements and Hendry 1999; Hendry and Clements 2003), we explore the role of forecast combinations in the former framework.
Section 2 confirms that combinations of forecasts are ineffective when forecasting using the correct conditional expectation in a weakly stationary process. Thus, departures from ‘optimality’, due to misspecification, misestimation, or nonstationarities are necessary to explain gains from combination. Section 3 considers whether combination could deliver gains in a weaklystationary process when forecasting models are differentially misspecified by using only subsets of the relevant information. We show there is a range of values of the parameters of the data generation process (DGP) where this can occur, but gains are not guaranteed. Nevertheless, the logic of why gains ensue in such a setting points to why combination might work in general, partly by providing ‘insurance’ against obtaining the worst forecasts. Section 4 notes alternative ways of implementing forecast combinations, and Section 5 considers the role of encompassing—which is violated by the need to pool—and discusses whether only nonencompassed models are worth pooling. If the weights used in any combination are estimated, then they directly reflect a lack of encompassing; however, if prefixed weights, such as the average, are used, encompassed models may lower rather than raise the efficiency of the combined forecast. Section 6 extends the analysis to processes subject to location shifts, where the combination can dominate in MSFE. Moreover, previously encompassed models may later become dominant, and the earlier dominant model may fail badly, so averaging across all contenders cannot be excluded as a sensible strategy. Section 7 provides an empirical illustration based on the data set originally used by Bates and Granger (1969), and by demonstrating the efficacy of ICs, suggests that combination works there because of location shifts of the form underlying our theoretical approach. The Monte Carlo study of the behaviour in finite samples of our theoretical approximations in Section 8 supports their applicability in practice. Section 9 considers forecast densities after pooling, and Section 10 concludes.
2. FORECASTING BY THE CONDITIONAL EXPECTATION
 Top of page
 Abstract
 1. INTRODUCTION
 2. FORECASTING BY THE CONDITIONAL EXPECTATION
 3. FORECASTS FROM MISSPECIFIED CONSTANT MODELS
 4. IMPLEMENTING FORECAST COMBINATIONS
 5. THE ROLE OF ENCOMPASSING
 6. COMBINING UNDER EXTRANEOUS STRUCTURAL BREAKS
 7. EMPIRICAL ILLUSTRATION
 8. A MONTE CARLO STUDY
 9. POOLING AND FORECAST DENSITIES
 10. CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 Appendix
Consider a weakly stationary ndimensional stochastic process {x_{t}} with density D_{x}(x_{t}∣X_{t−1}, θ), which is a function of past information X_{t−1}= (…x_{1}…x_{t−1}) for . Forecasts of x_{T+h} based on the conditional expectation given information up to period T,
 (1)
are conditionally unbiased,
 (2)
and no other predictor conditional on only X_{T} has a smaller MSFE matrix,
 (3)
Moreover, both (2) and (3) hold for all h. Consequently, on a MSFE basis for forecasting x_{T+h}, the conditional expectation cannot be beaten, as is well known. However, the empirical evidence that combination is useful clearly indicates that the above framework is inappropriate as an analytic basis.
There are several possible explanations for the empirical outcome. First, forecasts might be used that are based on only subsets of the available information X_{T}. Second, the functions of past data used to form those forecasts do not coincide with the conditional expectation. Thirdly, parameter estimation uncertainty is sufficiently large that averaging is advantageous. Finally, the underlying data density D_{x}(x_{t}  X_{t−1}, θ) is not constant, in which case, the first two mistakes are almost bound to occur as well, particularly if location shifts are the source of the nonconstancy.^{2} The proliferation of competing forecasting methods and models is also evidence for the first two potential explanations. Here, we first explore the implications of combining the forecasts from misspecified models when D_{x}(·) is constant, then consider what happens when the DGP is subject to intermittent breaks.
3. FORECASTS FROM MISSPECIFIED CONSTANT MODELS
 Top of page
 Abstract
 1. INTRODUCTION
 2. FORECASTING BY THE CONDITIONAL EXPECTATION
 3. FORECASTS FROM MISSPECIFIED CONSTANT MODELS
 4. IMPLEMENTING FORECAST COMBINATIONS
 5. THE ROLE OF ENCOMPASSING
 6. COMBINING UNDER EXTRANEOUS STRUCTURAL BREAKS
 7. EMPIRICAL ILLUSTRATION
 8. A MONTE CARLO STUDY
 9. POOLING AND FORECAST DENSITIES
 10. CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 Appendix
To articulate our approach, we approximate the DGP D_{x}(x_{t}  X_{t−1}, θ) by the constantparameter firstorder vector autoregression (VAR),
 (4)
where ε_{t}∼IN_{n}[0, Ω_{ε}]. Section 6 considers the impacts of breaks due to location shifts. We focus on 1step ahead forecasts for T+ 1 from time T purely to simplify the algebra; no issue of principle seems involved in generalizing to multistep forecasts. Also, we restrict attention to forecasting the scalar y_{t}, which is one element of x_{t}, and in this section, assume that, in the absence of structural breaks, x_{t} in (4) has been reduced to weak stationarity by appropriate transformations. Thus, partitioning x′_{t}= (x′_{1,t}: x′_{2,t}), the model determining y_{t} is given by
 (5)
where e_{t}∼IN[0, σ^{2}_{e}], independently of x_{t−1}. Since the processes are all weakly stationary, intercepts are set to zero.
Two investigators unaware of the nature of the process in (5), fit separate models of the form:
 (6)
and
 (7)
Each model is misspecified by omitting the components which the other includes—the absence of overlapping variables seems an inessential simplification (the switch to w_{t} and z_{t} is to ease notation below, but note that w_{T+1} and z_{T+1} are known at the forecast origin). Moreover, as we believe the explanation for any benefits from combination derive from specification—rather than estimation—issues, we further simplify by neglecting sampling variability in the coefficients a and b where necessary to obtain sharper results. The assumption that the partial models span the information set is to simplify the algebra, and does not seem consequential: Section 8 provides a Monte Carlo illustration.
It must be stressed that in such a constantparameter framework, pooling the information will produce the optimal forecast, as the resulting model coincides with the DGP, whereas pooling the forecasts will not in general (but see Granger (1989) for an example). However, that implication need not generalize to nonconstant DGPs.
Let
 (8)
where φ_{w,t} and φ_{z,t} are fixed functions of past variables, and
 (9)
Our interest is in comparing the accuracy of the forecasts from the models in (6) and (7) against that of a pooled forecast, based on MSFEs (as that is the criterion most frequently applied in practice, but see Clements and Hendry 1993). We set φ_{w,t}=φ_{z,t}=0, so both dynamics and deterministic factors are ignored, and this is known to the investigators, so intercepts and further lags are omitted: Section 8 investigates dynamics via Monte Carlo simulations.
The 1step ahead forecast from (6) is denoted , so the forecast error is
 (10)
The corresponding forecast from (7) uses with
 (11)
Neither forecast should encompass the other. Section 5 considers testing for nonencompassing before forecast combining.
In the Appendix, we detail the derivation of the MSFEs for the two models. Letting M[·] denote MSFE, these are given by
 (12)
and
 (13)
where the approximations result from ignoring parameter estimation uncertainty, that is, terms of O_{p}(T^{−1}). In these expressions, Ω≡V[η_{zw,t}, where V[·] denotes a variance, and η_{zw,t} is defined by:
 (14)
Similarly, Ω_η_{wz}≡V[η_{wz,t}], where
 (15)
To a first approximation, then, the MSFEs depend on the importance in the DGP of the omitted variables (e.g. for the model given by (6), with MSFE given by (12), this is β_{2}, the coefficient on z_{t}) and will be greater to the extent that the included variables do not explain the excluded (measured by Ω in (12) for the model in (6)). To order the outcome in terms of accuracy, we assume , so β_{2}′Ωβ_{2} < β_{1}′Ωβ_{1}. Consequently, would transpire on average to be the more accurate forecast here: equivalent results hold for the opposite ranking.
Writing a combined forecast as
 (16)
we derive in the Appendix the associated MSFE as
 (17)
(to the same order of approximation as for the individual forecasts), where
and, for example, Ω_{zw}=E[z_{t}w′_{t}], as indicated. The last line in the above is the matrix analogue of (1−R^{2}_{wz}), and has a negative sign: intuitively, if the regression of z_{t} on w_{t} over (under) estimates, the reverse regression will do the opposite.
Stock and Watson (1999) find that a combination obtained by pooling forecasts across many methods does well, using either the mean or median forecast, so we focus on the case where λ= 0.5. Then
 (18)
as against the smaller of the two individual forecast errors:
Therefore
if and only if
Let β_{2}′Ωβ_{2}=kβ_{1}′Ωβ_{1}, where k < 1 given our ordering, then combination dominance requires
This is more likely to hold if the marginal effects of w and z on y in the DGP are of the same sign and ‘match’ the sign of Ω_{zw}.
In the special case that Ω_{zw}=0, combination dominance requires
so an improvement over the better individual forecast by averaging is possible within that range (and similarly for the alternative ranking). However, the larger forecast error was
as against (18), so when Ω_{zw}=0, dominance requires
which is bound to hold. Thus, averaging guarantees ‘insurance’, and may provide dominance when the models are differentially misspecified for a constant DGP.
3.1. Scalar case
In the scalar case where n_{1}=n_{2}= 1, somewhat more transparent results can be obtained. Denote the correlation between w and z by r_{wz} and their variances by σ^{2}_{w} and σ^{2}_{z}, then domination by the average over the best requires
for ρ=σ_{w}/σ_{z} > 0 with β^{2}_{2}σ^{2}_{z}=kβ^{2}_{1}σ^{2}_{w}. Normalizing such that β_{1}=β_{2}= 1, then k= 1/ρ^{2} so ρ > 1 and dominance requires:
This is bound to hold when ρ is close to unity, and also for ρ < 3 when r_{wz} is close to +1.
Also, against the larger forecast error (again using the normalized parameter values),
which must always hold even when r_{wz} < 0. Thus, combination—even by averaging—seems likely to be advantageous here.
5. THE ROLE OF ENCOMPASSING
 Top of page
 Abstract
 1. INTRODUCTION
 2. FORECASTING BY THE CONDITIONAL EXPECTATION
 3. FORECASTS FROM MISSPECIFIED CONSTANT MODELS
 4. IMPLEMENTING FORECAST COMBINATIONS
 5. THE ROLE OF ENCOMPASSING
 6. COMBINING UNDER EXTRANEOUS STRUCTURAL BREAKS
 7. EMPIRICAL ILLUSTRATION
 8. A MONTE CARLO STUDY
 9. POOLING AND FORECAST DENSITIES
 10. CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 Appendix
When fixed weights are used (as in an average), it is easy to illustrate a case where only nonencompassed models are worth pooling. In particular, when (5) is one of the forecasting equations, averaging with any subset model, or models, will produce systematically poorer forecasts. This should hold more generally for weakly stationary processes—since all other forecasts are then inferentially redundant—and suggests testing for forecast encompassing prior to averaging: (see Harvey et al. 1998 and Diebold (1989), who relate encompassing to forecast combinations). Ericsson and Marquez (1993) and Andrews et al. (1996) provide empirical examples of forecastencompassing tests. However, section 6.4 provides a counter example in processes subject to location shifts where an encompassed model may later dominate: since breaks seem pandemic in macroeconomics, no general result can be established.
When weights are estimated forces operate. First, under weak stationarity, there is the detrimental effect of the uncertainty added by estimation of the weights. Second, there is an offset from the benefit of choosing the best weights. Overall, we suspect estimation probably does not explain much of the success of pooling: whether or not the weights are estimated, combining must be better than the worst of the individual forecasts, and could beat the best. Section 8 shows that this occurs in the Monte Carlo.
When the weights are estimated by regression, then any forecast which contributes to a combination is not encompassed by the others (see Chong and Hendry 1986). Thus, estimated weights assign little role to encompassed forecasts, as their weights will be insignificant. While the need to pool violates encompassing (see Lu and Mizon 1991; Ericsson 1992), and so reveals noncongruence, congruence per se cannot be established as a necessary feature for good forecasting (see Hendry and Clements 2003). Indeed, the next section suggests that averaging might be preferable when unanticipated breaks can occur. Section 8 confirms that estimated weights need not dominate over fixed.
6. COMBINING UNDER EXTRANEOUS STRUCTURAL BREAKS
 Top of page
 Abstract
 1. INTRODUCTION
 2. FORECASTING BY THE CONDITIONAL EXPECTATION
 3. FORECASTS FROM MISSPECIFIED CONSTANT MODELS
 4. IMPLEMENTING FORECAST COMBINATIONS
 5. THE ROLE OF ENCOMPASSING
 6. COMBINING UNDER EXTRANEOUS STRUCTURAL BREAKS
 7. EMPIRICAL ILLUSTRATION
 8. A MONTE CARLO STUDY
 9. POOLING AND FORECAST DENSITIES
 10. CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 Appendix
Clements and Hendry (1999). Hendry and Doornik (1997) and Hendry (2000) establish that location shifts are the problematic class of structural breaks in a forecasting context, so we focus on those. We consider a DGP, where the regressor processes x_{1,t−1} and x_{2,t−1} in (5) experience breaks at different times, but the forecasting model remains unchanged. Thus, φ_{w,t} and φ_{z,t} in (8) are nonconstant, beyond being functions of past variables. The DGP for the y process in terms of w_{t} and z_{t} remains
 (21)
where e_{t}∼IN[0, σ^{2}_{e}]. As before, dynamics and intercepts are assumed absent merely to simplify the algebra, so prior to forecasting, φ_{z,t}=φ_{w,t}=0, whereas insample
 (22)
Again, the investigators fit separate models of the form
 (23)
 (24)
Now intercepts are included, to offset any mean values induced by location shifts. We first allow only the z process to shift by φ_{z,T + 1}=μ_{z} (redefined to simplify notation), which is in fact a change at the end of the estimation sample, influencing the forecastperiod behaviour of y. Since the shifts occur in the processes determining the regressors, we refer to these as extraneous breaks. Any breaks in variables that influence the DGP but are excluded from both (23) and (24) would act to influence them in a similar way to the case we examine, but without the offset from averaging over models that included the breaking variables. Breaks in the intercept of the DGP equation are noted below.
The 1step ahead forecast from (23) is
so the forecast error is
 (25)
The corresponding forecast from (24) uses with ,
Next, we derive the conditional biases and variances of the forecast errors. This requires the relationship equations between the regressors, of which the first is given by
 (26)
so
 (27)
Thus, from the estimation sample, prior to any shifts, and assuming least squares estimates of insample parameters
so
using (27). Again we ignore O_{p}(T^{−1}) terms arising from estimation increasing MSFEs, so
with
A break may also be induced in the model, which includes z when z_{T + 1} shifts because
so κ=−Π_{wz}μ_{z}, whereas Π′_{wz}=Ω^{−1}_{zz}Ω_{zw}, leading to a forecast error of
Then the squared error is
We continue to assume that, prior to the break, the model including w is the more accurate, so β_{2}′Ωβ_{2}=kβ_{1}′Ωβ_{1} for k < 1. Then, to the approximations involved
Consequently, could be the more accurate forecast here, despite being less accurate prior to the break. This is more likely the larger μ_{z} and the less correlated are z and w—in the limit, when is a consistent estimator of β_{2}, and the term involving μ_{z} drops out of the MSFE for .
The average forecast is
with error
so
Again, ignoring terms of O_{p}(T^{−1}),
Thus, the combined forecast could beat both individual forecasts depending on the relative sizes of the unmodelled shift in the z process to the error variances.
To illustrate this, we consider two simplifications: first Ω_{wz}=0, then a scalar case in Section 6.1. Against (the more accurate forecast in the absence of breaks) in the first simplification, the average forecast dominates when
which is bound to hold for k > 1/3 and could hold even for small k. Against the second forecast,
If we approximate by k= 1, then both hold when
where the last inequality must be true. If instead, k is small, then
Thus, irrespective of whether k is large or small, the average can ‘win’ against both misspecified forecasting devices when the DGP experiences location shifts.
6.1. Scalar illustration
In the scalar case when n_{1}=n_{2}= 1, using the approach in Section 3.1
with
Against , the average outperforms in the normalized case if (as r_{wz}=ρπ_{wz} and kρ^{2}= 1)
When ρ is close to unity and r_{wz} is large, this reduces to
 (28)
which must hold. Alternatively, if r_{wz}= 0, then
which will hold when the relative break is sufficiently large.
Against , the average dominates if
As before, when ρ is close to unity and r_{wz} is large, we replicate (28). And if r_{wz}= 0, dominance requires
Thus, dominance over both individual models simultaneously requires
We conclude that there is a wide range over which averaging will dominate.
6.2. Later breaks
If, in a later forecast period, there is a break in the other process, then a similar analysis applies with the initial rankings of the individual models reversed. The algebra naturally becomes tedious, but the outcome must depend on both the absolute and relative sizes of the breaks, whether earlier breaks were modelled or not, the robustness of devices to breaks, and the sizes of the signal–noise ratios. There must exist combinations in which the average dominates over individual forecasting devices, on average over repeated forecasting episodes, because other devices swing from good to bad performance. Such later breaks may also vitiate the estimation of weights: when a method is doing well because it had not previously suffered forecast failure, estimation will attribute an aboveaverage weight to it. Any later shift in that ‘currentbest’ device would induce poorer performance than just the average.
6.3. Breaks in falsely included variables
If some of the variables that are included with nonzero coefficients in forecasting models are in fact irrelevant, then an analogous derivation is feasible to show that the effects of breaks favour combination. When such variables experience a location shift, the forecasts from that model will be poor, since the dependent variable will not have been affected. Any average will attribute a smaller weight than unity to such a set of forecasts, and so outperform it. Later breaks in other variables in rival models will similarly worsen their performance, leaving the average as the ‘winner’.
6.4. Withinequation breaks
Finally, a break in the y process introduces further complications, depending on the class of models under analysis. When a break occurs after forecasts are announced, all devices will fail, usually in the same direction, so averaging will neither resolve nor exacerbate that problem. However, some methods will continue to fail for many later periods—especially equilibriumcorrection models (EqCMs)—again usually in the same direction (see e.g. Clements and Hendry 1999). If the EqCMs were previously the dominant approach, then we have the analogue of the conditions in Section 6, namely a switch in ranking between methods pre and post break, precisely the situation when averaging can dominate on average. Now, however, in the subperiods, the average may or may not dominate. Moreover, estimated weights would emphasize the near encompassing of an EqCM over (say) a firstdifferenced autoregression, so could do less well than the average. Indeed, when simple—but robust—forecasting devices are encompassed by the EqCM, and so excluded from pooling, we have a counter example to any claim that only nonencompassed models should be included in the average.
6.5. Pooling information
In the present context, pooling of information should prove more successful than pooling forecasts for all extraneous breaks in correctly included variables, but not for breaks in the equation of interest, however generated. Since there are often many variables involved, the former type of break should be more frequent than the latter, supporting pooling information. On the other hand, false inclusion of variables that later break will be detrimental. In Hendry and Clements (2001), we explore these ideas to investigate the apparent success of ‘factor forecasts’, or diffusion indices, as in Stock and Watson (1999) and Forni et al. (2000).
Moreover, extraneous breaks become endogenous in a system, so our approach also points to an explanation for why multistep (or dynamic) estimation may be advantageous: see Chevillon (2000). Conversely, when different transformations (e.g., log and linear) of the same variable are involved, pooling information seems less likely to dominate.
7. EMPIRICAL ILLUSTRATION
 Top of page
 Abstract
 1. INTRODUCTION
 2. FORECASTING BY THE CONDITIONAL EXPECTATION
 3. FORECASTS FROM MISSPECIFIED CONSTANT MODELS
 4. IMPLEMENTING FORECAST COMBINATIONS
 5. THE ROLE OF ENCOMPASSING
 6. COMBINING UNDER EXTRANEOUS STRUCTURAL BREAKS
 7. EMPIRICAL ILLUSTRATION
 8. A MONTE CARLO STUDY
 9. POOLING AND FORECAST DENSITIES
 10. CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 Appendix
Bates and Granger (1969) provide an example of the usefulness of combining forecasts from linear and exponential trend models of output. Table 1 records an output index for the U.K. gas, electricity and water sectors for the years 1948 to 1965, along with forecast errors from linear and exponential trend models of output {y_{t}}, given by y_{t}=α+βt+error_{t} and ln(y_{t}) =a + bt + error_{t}, where t is a linear time trend. The forecast error in period t(t= 1950, …, 1965) is calculated from a forecast based on estimating the model on data up to t− 1. The results in the table show that although the exponential model forecasts have a much smaller sum of squared errors (SSE) than the linear model, nevertheless, a combination which attaches a small weight to the linear forecasts has a smaller SSE. For example, for a fixed weight of 0.16 on the linear forecasts, the combined forecast SSE is 78.8.^{4} This clearly supports combination, but it is of interest to interpret how the gain comes about given our analysis.
Table 1. Forecasts of output indices, 1950–1965.  Actual  1step forecast errors 

Linear  Exponential  Combination  Linear biascorrected  Exponential biascorrected 


1948  58.0  
1949  62.0  
1950  67.0  1.0  0.7  0.77  1.0  0.7 
1951  72.0  0.7  0.1  0.21  −0.3  −0.6 
1952  74.0  −2.5  −3.4  −3.24  −3.3  −3.8 
1953  77.0  −2.2  −3.3  −3.11  −1.9  −2.4 
1954  84.0  2.1  0.8  0.99  2.8  2.2 
1955  88.0  1.0  −0.6  −0.37  1.2  0.4 
1956  92.0  0.4  −1.7  −1.33  0.4  −0.7 
1957  96.0  0.0  −2.5  −2.08  −0.0  −1.4 
1958  100.0  −0.2  −3.2  −2.71  −0.3  −2.0 
1959  103.0  −1.3  −4.8  −4.28  −1.4  −3.4 
1960  110.0  1.9  −2.1  −1.47  2.0  −0.3 
1961  116.0  3.2  −1.4  −0.71  3.1  0.4 
1962  125.0  7.0  1.8  2.60  6.7  3.5 
1963  133.0  8.8  2.8  3.74  8.0  4.3 
1964  137.0  6.1  −0.9  0.26  4.7  0.3 
1965  145.0  8.0  −0.0  1.26  6.3  1.1 
Sample bias  2.1  −1.1  −0.6  1.8  −0.1 
Sum of squared errors  263.3  84.4  78.8  211.9  77.0 
The forecast errors from the linear model become large and positive from around 1961 onwards, indicating that the constant absolute increase model is inappropriate. On average, the exponential model overpredicts (negative errors), albeit to a lesser extent. Combination is seen to work by tempering the negative errors of the more accurate exponential model with the predominantly positive errors of the linear model over the 1955–1961 period. This view is supported by the SSEs of the biascorrected forecast errors (see the last two columns of the table), and the results of combining the biascorrected forecasts. The biascorrected forecast of period t is calculated by adding the sample mean of the forecast errors up to period t− 1 to the forecast of period t. Because the bias term is calculated from past forecast errors up to that point, it adapts only slowly to the run of positive errors in the linear forecasts of the 1960s. The SSE of the biascorrected exponential forecasts is 77, less than the combined forecast SSE of 78.8 (with a weight of 0.16), but more pertinently, we find that any fixed weight combination of the biascorrected forecasts, with weights in the interval (0, 1), has a larger SSE than that of the exponential model forecasts.^{5} Of course, the fixedweight combination forecasts discussed are not feasible, in the sense that they are based on knowledge of the full set of forecast errors. Moreover, fixed weights can also be improved upon by varyingweight schemes, as shown by Bates and Granger (1969). This example shows that gains from combination may disappear if individual forecasts are first corrected, consistent with the derivation when there are no breaks that combination exploits offsetting biases.
A final implication, given the autocorrelated forecast errors, is that IC or differencing should improve the forecasts. For the latter, the SSEs become 73.9 and 59.0 for the linear and exponential models respectively, providing a dramatic improvement for the former, and a smaller—but worthwhile—gain for the latter, which now does better than any combination. Clements and Hendry (1999) treat inappropriate specification or estimation of deterministic terms as near equivalents of shifts in those terms, so such an interpretation is also consistent with the present gains from combination and differencing.
10. CONCLUSION
 Top of page
 Abstract
 1. INTRODUCTION
 2. FORECASTING BY THE CONDITIONAL EXPECTATION
 3. FORECASTS FROM MISSPECIFIED CONSTANT MODELS
 4. IMPLEMENTING FORECAST COMBINATIONS
 5. THE ROLE OF ENCOMPASSING
 6. COMBINING UNDER EXTRANEOUS STRUCTURAL BREAKS
 7. EMPIRICAL ILLUSTRATION
 8. A MONTE CARLO STUDY
 9. POOLING AND FORECAST DENSITIES
 10. CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 Appendix
Practical experience shows that combining forecasts has value added and can dominate even the best individual device. Thus, we considered selecting a forecasting method by pooling several individual devices when no model coincides with a nonconstant DGP.
We first show that averaging guarantees ‘insurance’, and may provide dominance, when the models are differentially misspecified even for a constant DGP. While such a result can occur in weakly stationary processes, we suspect that empirical findings are better explained by the intermittent occurrence of location shifts in unmodelled explanatory variables. Consequently, we demonstrate that when forecasting time series that are subject to location shifts, the average of a group of forecasts from differentially misspecified models can outperform them all on average over repeated forecasting episodes. Moreover, averaging may well then dominate over estimated weights in the combination. Finally, it cannot be proved that only nonencompassed devices should be retained in the combination.
In practice, trimmed means, or perhaps medians, might be needed to exclude ‘outlying’ forecasts, since otherwise, one really poor forecast could needlessly worsen a combination.
Both the empirical and Monte Carlo simulation illustrations confirmed the theoretical analysis. The average of the levels forecasts outperformed the best individual forecasts in both settings, sometimes spectacularly. However, in the empirical example, bias correcting the forecasts removed much of the benefit of averaging, and other devices for robustifying forecasts to breaks did even better. Thus, although we have established that combination can be beneficial in our theoretical framework, comparisons with other approaches are merited.
Hendry and Clements (2003) present 10 cases where wellknown empirical phenomena in economic forecasting can be explained by a theory of misspecified models of processes that experience intermittent location shifts. The present paper extends that list to 11 cases. We believe that the related results on forecasting using ‘factor models’ can be accounted for by the same general theory, and are also investigating multistep estimation within that framework.