Guest Editors’ Introduction: Information in Economic Forecasting


I. Overview

This Special Issue of the Oxford Bulletin is on the theme Information in Economic Forecasting. It comprises 10 papers which investigate a number of different aspects of its role, preceded by our introduction. Information is manifestly central to all forms of forecasting, yet many questions remain concerning precisely what information should be used, what form it should take, how it should be selected, how models based on it should be formulated, and what role that information should play in those models. Debate persists over whether more information is always better: if so, one must explain why there are repeated claims that large models perform poorly: if not, then why does additional information reduce forecast accuracy? Possible answers include estimation uncertainty, collinearity between explanatory variables, data measurement errors and structural breaks. Alternatively, one might enquire as to what types of information might improve forecasting: theory, causal, leading indicators, or something else? Are ‘better’ models of past data also better for forecasting? What information should be collected that is not currently available? The papers reported here, after rigorous reviewing and revision, set out to address the main issues lying behind these, and related, questions.

Unfortunately, ‘information’ is not an unambiguous concept in standard usage. Universally, it denotes the contents of the available information set on which forecasts are conditioned, denoted below by ℐt−1. However, it may also relate to knowledge as to how ℐt−1 enters the conditional distribution of the variables to be forecast, or even which elements of ℐt−1‘really matter’ (as some may be irrelevant if ℐt−1 is the σ-field generated by the history of all variables under consideration). This switch in meaning can be confusing, and helps explain the conflict between themes (A) and (B) below, as the former refers to using the largest set of relevant ℐt−1, whereas the latter is more about knowledge as to the relevant components of ℐt−1, and how they affect the forecast distribution.

We perceive four themes, which comprise analysing:

  • (A) the role of more information in forecasting, including the theory and practice of factor forecasting, the advantages and disadvantages of disaggregation (across variables and data frequencies), the role of forecast combinations, and information from leading indicators;
  • (B) the role of less information via imposing restrictions, both theory-based and data-based information, including selecting models for forecasting, and any possible benefits of parsimony in reducing estimation uncertainty and susceptibility to breaks;
  • (C) transformed information, including model transformations (such as differencing and intercept corrections [IC]), transformations of variables (including nonlinearities), and collinearity reductions;
  • (D) the role of evaluation information in forecasting, including historical forecast comparisons, interval and density forecasts, and forecast encompassing.

After presenting the background to these themes in section II, we will review the properties of unpredictability in section III and establish its main implications (based on Hendry, 1997), extending earlier results to non-stationary processes. Two important theorems about the role of information are established therein, and 10 steps linking predictability to forecastability are delineated in a taxonomy. Then section IV considers the implications of that taxonomy for the formulation of forecasting devices, noting two more theorems that act to limit the benefits of additional information in the context of practical forecasting. Section V investigates the specific setting of a cointegrated data generating process (DGP) subject to breaks, and examines some adaptions which might improve robustness in forecasting. Cointegrated vector autoregressions (VARs) are a natural DGP to study from a forecasting perspective, given that VARs in levels, differences, and with unit root and cointegration restrictions, are commonly used in macroeconomic forecasting. Section VI provides a summary and overview of the papers in this issue in relation to our discussion of the role of information in economic forecasting.

II. Background

The historical track record of econometric forecasting is both littered with forecast failures and the empirical out-performance of econometric models by ‘naive devices’: see, for example, many of the papers reprinted in Mills (1999). This adverse outcome for econometric systems is surprising since they incorporate inter-temporal economic theory-based causal information representing inertial dynamics in the economy: such models should have smaller prediction errors than purely extrapolative devices – but often do not. Discussions of the problems confronting economic forecasting date from the early history of econometrics: see, inter alia, Persons (1924), Morgenstern (1928) and Marget (1929). To explain such outcomes, Clements and Hendry (1996a, b, 1998, 1999) developed a theory of forecasting for non-stationary processes subject to structural breaks, where the forecasting model differed from the data generating mechanism (extended from a theory implicitly based on the assumptions that the model coincided with a constant-parameter mechanism). They thereby accounted for the successes and failures of various alternative forecasting approaches, and helped explain the outcomes of forecasting competitions (see, e.g. Makridakis and Hibon, 2000; Clements and Hendry, 2001; Fildes and Ord, 2002). Nevertheless, there remained a conflict between the intuitive notion that more relevant information should help in forecasting, and the hard reality that attempts to make it do so have not been uniformly successful.

To set the scene more formally, consider T observations inline image on a vector random variable, where we wish to forecast the H future values inline image. The joint probability of the observed and future xs is inline image where θ ∈ Θ ⊆ ℜp is the relevant parameter vector, and X0 denotes the initial conditions. Factorizing into conditional (forecast) and marginal (in-sample) probabilities:


inline image is obviously unknown at the forecast origin T, so must be derived from inline image, which requires the ‘basic assumption’ that:

‘The probability law inline image of the T + H variables (x1,…, xT+H) is of such a type that the specification of inline image implies the complete specification of inline image and, therefore, of inline image.’ (Haavelmo, 1944, p. 107: our notation).

Haavelmo's formulation highlights the major problems that need to be confronted for successful forecasting. The form of inline image and the value of θ in-sample must be learned from the observed data, involving problems of specification of the set of relevant variables {xt}, measurement of the xs, formulation of the joint density inline image, modelling of the relationships, selection of the relevant connections, and estimation of θ, all of which introduce uncertainties, the baseline level of which is set by the properties of inline image. When forecasting, the actual future form of inline image determines the ‘intrinsic’ uncertainty, growing as H increases (especially for non-stationary data from stochastic trends, etc.), further increased by any changes in the distribution function inline image, or parameters thereof, between T and T + H (lack of  time invariance). These 10 italicized issues structure the analysis of economic forecasting, but Clements and Hendry (1998, 1999) emphasized the importance of the last of these in determining forecast failure.

Here, we develop a complementary explanation based on the many steps between the ability to predict a random variable at a point in time, and a forecast of the realizations of that variable over a future horizon from a model based on an historical sample. This overview spells out those various steps, and also demonstrates that many of the results on forecasting in Clements and Hendry (1998, 1999) have a foundation in the properties of unpredictability. Having established such foundations, we draw some implications for forecasting non-stationary processes using incomplete (i.e. mis-specified) models, via a ‘forecasting strategy’ which uses a combination of ‘causal’ information with ‘robustification’ of the forecasting device. Such a combination could be either by rendering the econometric system robust, or by modifying a robust device using an estimate of any likely causal changes: for the latter, in the policy context, see Hendry and Mizon (2000, 2005).

Themes (A) and (B) above deliberately conflict, matching the contrasting views in the literature. Both cannot be correct, yet both have staunch supporters. Theme (A) includes studies using factor forecasting by Forni et al. (2000), Stock and Watson (2002) and Amstad and Fischer (2004) among many others, which provide empirical support for the value of extensive information sets. Combining forecasts also has a long pedigree (since Francis Galton, according to Surowiecki (2004): see, e.g. Bates and Granger, 1969; Diebold and Pauly, 1987; Clemen, 1989; Diebold and Lopez, 1996; Stock and Watson, 1999; Newbold and Harvey, 2002; Clements and Galvão, 2005a), as well as theories for its success (see Granger, 1989; Hendry and Clements, 2004), and again suggests that more information helps. Theme A also includes disaggregation across variables, with recent examples including Espasa et al. (2002) and Hubrich (2005). Recent research by Ghysels et al. (2004b) and Ghysels et al. (2004a) suggests a way of disaggregating in terms of data frequency: their MIDAS (MIxed Data Sampling) approach allows the regressand and regressors to be sampled at different frequencies, allowing higher frequency data to be used in forecasting. The MIDAS approach can be viewed as an alternative way of exploiting information in intraday data compared to the ‘realized volatility’ approach (see Andersen et al., 2003, inter alia). Clements and Galvão (2005b) show that the MIDAS approach can also be useful in macroeconomic forecasting contexts.

Theme (B) is a well-known claim in the folklore of economic forecasting, as well as an inference from ‘forecasting competitions’, such as in Makridakis and Hibon (2000) and Fildes and Ord (2002). The forecasting literature relies heavily on historical track records of which devices did well or badly on some (usually many) data sets using many measures, from which inductive inferences are made as to why the given outcomes eventuated. While one must applaud the empirical emphasis, a sound theoretical framework for interpreting the findings is often lacking. As an example, Clements and Hendry (2001) suggest that mistaken inferences concerning the role of parsimony in forecasting can arise when simplicity and robustness are confounded.1

A general framework must allow for an economy lacking time invariance from various forms of non-stationarity, with both slowly evolving and sudden shifts in distributions (so the future does not simply replicate the past), using forecasting devices that are mis-specified for the underlying DGP, and estimated from (possibly mismeasured) historical data. Hendry and Clements (2003) argued that such an approach explained a large fraction of the available evidence about forecast performance, and suggested some solutions that might be effective. Four provable theorems underpin our explanation here. To describe these, we must distinguish predictability from forecastability. The former is the relationship between a random variable νt and an information set ℐt−1, such that νt is predictable if its distribution conditional on ℐt−1 differs from its unconditional distribution. Forecastability concerns our ability to make ‘useful’ statements about a future outcome, where ‘useful’ can vary according to context. Then the first theorem from the theoretical analysis in section III is that:

more information unambiguously does not worsen predictability, even in intrinsically non-stationary processes.

Unfortunately, more information cannot, in general, be shown to improve forecasting. Even in a stationary environment, with no measurement errors and no model mis-specification, when the parameters of the forecasting model have to be estimated from sample data, the estimated DGP need not produce the best forecasts. Our second theorem applies to stationary processes:

forecasting using a model which retains all relevant variables with non-centralitiesτ2greater than unity in theirt2-distributions for testing the null of irrelevance will dominate in one-step forecast accuracy, at least as measured by mean-square forecast error.

As explained in section IV, τ2 > 1 translates into an expected t2 > 2. That result, related to model selection using the AIC (the Akaike information criterion: see Akaike, 1985) stems from results in Chong and Hendry (1986), extended by Clements and Hendry (1998, 1999). Since ‘conventional’ critical values, such as 5% or 1% usually entail t2 > 4, then much larger models would be retained than result from standard practice. Section IV briefly explores model selection for forecasting. Together, these two theorems strongly support theme (A) and clearly run counter to the conventional wisdom that parsimony is necessary in forecasting, as they suggest that more information is usually good.

The third theorem in section III returns to the context of predictability, and shows that:

less information should not induce predictive failure.

Such a result implies that the costs of reducing information are primarily inefficiency, not failure, thereby allowing for the possibility that parsimony need not be too expensive. However, the fourth theorem, demonstrated in Clements and Hendry (1998) and noted in section IV, acts as a powerful negative counter to theme (A):

when the process to be forecast lacks time invariance, ‘causal’ models, no matter how significant all their estimated parameters, need not outperform naive ‘robust’ devices in forecasting, even though the ‘causal’ models have dominated in-sample.

Thus, for forecasting variables from processes that lack time invariance, theoretical analysis cannot establish the pre-eminence of more information over less – theme (B). Moreover, re-interpreting the estimation result in the second theorem suggests that estimation and selection uncertainty remain important unless variables play a substantive role. Since τ2 is unknown, it must be inferred from the observed t2. Under the null that a variable is irrelevant, the mean value of t2 is unity, and the probability that t2 > 2 is about 16%. Thus, using the loose criterion of t2 > 2 would entail retaining many irrelevant variables, where their parameter estimates just added noise to a forecast, leading to larger errors and pointing back to more parsimonious models doing better. The practical trade-off between false retention of irrelevant variables and false exclusion of relevant variables needs careful consideration, addressed in section IV.

An implication of these four theorems (explored in Hendry, 2005), is that to use congruent encompassing causal models in forecasting mode, it may be essential to transform them to achieve some robustness to ex post breaks, which is theme (C). Section V discusses transforming models for use in forecasting mode.

Now the role of theme (D) becomes clearer: in part, evaluation helps to check in-sample dominance, provide selection rules, and discern the onset of forecast failure ex post. But the value of evaluation information for forecasting also depends partly on how a given model is used in forecasting mode. In-sample congruence may be of no help in either improving the accuracy of forecasting, or dominating rival methods, when location shifts occur. Conversely, in-sample congruence might be invaluable in building an undominated model of the local DGP when the resulting model is robustified before forecasting.

III. Unpredictability: a review and extension

A non-degenerate vector random variable νt is an unpredictable process with respect to an information set ℐt−1 over a period �� if its conditional distribution Dνt(νt|ℐt−1) equals its unconditional Dνt(νt):


Importantly, unpredictability is a property of νt in relation to ℐt−1 intrinsic to νt, and not dependent on any aspect of our knowledge thereof: this is one of the key gaps between predictability, when equation (2) is false, to ‘forecastability’. Note that �� may be a singleton (i.e. {t}), and that ℐt−1 always includes the sigma-field generated by the past of νt.

A necessary condition for equation (2) is that νt is unpredictable in mean (denoted Et) and variance (denoted Vt) at each point t in ��, so assuming the relevant moments exist:


The former does not imply the latter (a predictable conditional mean with a randomly heteroscedastic variance), or vice versa (e.g. an autoregressive conditional heteroscedastic – ARCH – process, as in (7) below, affecting a martingale difference sequence). Throughout, we will take the mean of the unpredictable process to be zero: Et[νt]=0 ∀t. Since we will be concerned with the predictability of functions of νt and ℐt−1, such as equation (6) below, any mean otherwise present could be absorbed in the latter. Due to possible shifts in the underlying distributions, both the information set available and all expectation operators must be time dated, which anyway clarifies multistep prediction as in ET+h[νT+h|ℐT] for h > 1. The paper will focus on the first two moments in equation (3), rather than the complete density in equation (2), although extensions to the latter are feasible (see, e.g. Tay and Wallis, 2000): however, for normal distributions, equation (3) suffices.

Unpredictability is only invariant under non-singular contemporaneous transforms: inter-temporal transforms must affect predictability (so no unique measure of forecast accuracy exists: see, e.g. Leitch and Tanner, 1991; Clements and Hendry, 1993a; Granger and Pesaran, 2000a, b). Predictability therefore requires combinations with ℐt−1, as for example:


so yt depends on both the information set and the innovation component. Then:


Two special cases of equation (4) are probably the most relevant empirically in economics, namely (after appropriate data transformations, such as logs):




where ⊙ denotes element by element multiplication, so that yi,t=νi,tϕi,t(ℐt−1). Combinations and generalizations of these are clearly feasible and are also potentially relevant.

In equation (6), yt is predictable in mean even if νt is not as:


in general. Thus, the ‘events’ which will help predict yt in equation (6) must already have happened, and a forecaster ‘merely’ needs to ascertain what ft(ℐt−1) comprises. The dependence of yt on ℐt−1 could be indirect (e.g. own lags may ‘capture’ actual past causes) as systematic correlations over the relevant horizon could suffice for forecasting – if not for policy. However, such stable correlations are unlikely in economic time series (a point made by Koopmans, 1937). The converse to equation (6) in linear models is well known in terms of the prediction decomposition (sequential factorization) of the likelihood (see, e.g. Schweppe, 1965): if a random variable yt is predictable from ℐt−1, as in equation (6), then it can be decomposed into two orthogonal components, one of which is unpredictable on ℐt−1 (i.e. νt here), so is a mean innovation. Since:


predictability ensures a variance reduction, consistent with its nomenclature, since unpredictability entails equality from equation (8)– the ‘smaller’ the conditional variance matrix, the less uncertain is the prediction of yt from ℐt−1.

Although yt remains unpredictable in mean in equation (7):


it is predictable in variance because:


A well-known special case of equation (7) of considerable relevance in financial markets is when ℐt−1 is the sigma field generated by the past of yt. For a scalar yt with constant inline image and ϕ(·)=σt, this yields:


so that (G)ARCH processes are generated by (see, e.g. Engle, 1982; Bollerslev, 1986: Shephard, 1996 provides an excellent overview):


Alternatively, ϕ(·)=exp (σt/2) leads to stochastic volatility (here as a first-order process: see, e.g. Taylor, 1986; Kim et al., 1998; and again, Shephard, 1996):


In both classes of models (9) and (10), predictability of the variance can be important in its own right (e.g. pricing options as in Melino and Turnbull, 1990), or for deriving appropriate interval and density forecasts.

3.1. Prediction from a changed information set

Predictability is obviously relative to the information set used – when ��t−1 ? ℐt−1, it is possible that:


This result helps underpin both general-to-specific model selection and the related use of congruence as a basis for econometric modelling (see, e.g. Hendry, 1995; Bontemps and Mizon, 2003). In terms of the former, less is learned based on ��t−1 than on ℐt−1, and the variance (where it exists) of the unpredictable component is unnecessarily large. In terms of the latter, a later investigator may discover additional information in ℐt−1 beyond ��t−1 which explains part of a previously unpredictable error.

Given the information set, ��t−1 ? ℐt−1 when the process to be predicted is yt=ft(ℐt−1) + νt as in equation (6), less accurate predictions will result, but they will remain unbiased. Since Et[νt|ℐt−1]=0:


so that:


say. Let et=yt − gt(��t−1), then, providing ��t−1 is a proper information set containing the history of the process:


so et is a mean innovation with respect to ��t−1. However, as et=νt +ft(ℐt−1) − gt(��t−1):


As a consequence of this failure of et to be an innovation with respect to ℐt−1:


so less accurate predictions will result. Nevertheless, that predictions remain unbiased on the reduced information set suggests that, by itself, incomplete information may not be fatal to the forecasting enterprise. Less (relevant) information only reduces the precision of the forecasts. This establishes our first and third theorems in section II.

The value of the loss of information from using ��t−1 relative to ℐt−1 in terms of prediction will depend on the loss function, just as forecast loss depends on the loss function of the user. The sum of the squared bias and the forecast-error variance gives rise to ‘squared-error loss’, with empirical counterpart mean squared forecast error (MSFE). MSFE is a general-purpose loss function (see the exchanges in Clements and Hendry 1993a, b). The importance of decision or cost-based assessment of the quality of forecasts has long been recognized (see, e.g. Pesaran and Skouras, 2002), along with the recognition that decision-based forecast evaluation criteria and MSFE need not necessarily agree on which of two rival sets of forecasts is superior.2

We can also show that predictability cannot increase as the horizon grows for a fixed event yT based on ℐTh for h=1, 2,…, H, since the information sets form a decreasing nested sequence going back in time:


The obverse of the horizon growing for a fixed event yT is that the information set is fixed at ℐT (say), and we consider predictability as the horizon increases for yT+h as h=1, 2,…, H. If a variable is unpredictable according to equation (2) (a ‘one-step’ definition), then it must remain unpredictable as the horizon increases ∀(T + h) ∈ ��: this again follows from equation (11). Equally, ‘looking back’ from time T + h, the available information sets form a decreasing, nested sequence as in equation (12). Beyond these rather weak implications, little more can be said in general once densities can change over time. For example, in the non-stationary process:


we wish to compare the predictability of yT+h with that of yT+h−1 given ℐT for known ρ. Then:


The inequality in equation (14) is strict, and yT+h becomes systematically more predictable from ℐT as h increases. Although DGPs like equation (13) may be unrealistic, specific assumptions (such as stationarity and ergodicity or mixing) are needed for stronger implications. For example, in a dynamic system which induces error accumulation, where error variances do not decrease systematically as time passes (e.g. being drawn from a mixing process), predictability falls as the horizon increases since additional unpredictable components will accrue. This case can be illustrated with the simple first-order autoregression:


such that:


when |ρ| < 1.

Conversely, disaggregating components of ℐTh into their elements cannot lower predictability of a given aggregate yT, where such disaggregation may be across space (e.g. regions of an economy), variables (such as sub-indices of a price measure), or both. Further, since a lower frequency is a subset of a higher, and unpredictability is not, in general, invariant to the data frequency, then equation (11) ensures that temporal disaggregation cannot lower the predictability of the same entity yT.

In all these cases, DyT+h(yT+h|·) remains the target of interest, and ℐTh is ‘decomposed’, in that additional content is added to the information set. A different, but related, form of disaggregation is of the target variable yT into components yi,T. It may be thought that predictability could be improved by this form of disaggregation. However, nothing is gained unless ℐT−1 increases, which does not necessitate forecasting the disaggregates, but does entail including the disaggregates in ℐT−1, as shown in Hendry and Hubrich (2004).

These attributes sustain general models, and provide a formal basis for including as much information as possible for predictability, being potentially consistent with multi-variable ‘factor forecasting’ and the benefits claimed in the ‘pooling of forecasts’ literature, as noted in section II. Nevertheless, there remains a large gap between predictability and forecasting, an issue we address in section IV. Before that, we discuss the impact of non-stationarity in section 3.2.

3.2. Non-stationarity

In non-stationary processes, unpredictability is also relative to the historical time period considered (which is why the notation above allowed for possibly changing densities), as it is then possible that:




or vice versa. More generally, the extent of any degree of predictability can change over time, especially in a social science like economics (e.g. a move from fixed to floating exchange rates).

A major source of non-stationarity in economics derives from the presence of unit roots. However, these can be ‘removed’ for the purposes of the theoretical analysis by considering suitably differenced or cointegrated combinations of variables, and that is assumed below: section V considers the relevant transformations in detail for a VAR. Of course, predictability is thereby changed – a random walk is highly predictable in levels but has unpredictable changes – but it is convenient to consider such I(0) transformations.

In terms of ft(ℐt−1) in equation (6), two important cases of change can now be distinguished. In the first, ft(·) alters to ft+1(·), so ft+1(·) ≠ ft(·), but the resulting mean of the {yt} process does not change:


In the face of such a change, interval predictions may be different, but their means will be unaltered. In the second case, equation (17) is violated, so there is a ‘location shift’ which alters the means:


Such changes over time are unproblematic for the concept of unpredictability, since yt+j − ft+j(ℐt+j−1) is unpredictable for both periods j=0, 1. The practical difficulties, however, for the forecaster may be immense, an issue to which we now turn.

IV. From predictability to forecasting

It is clear that one cannot forecast the unpredictable beyond its unconditional mean, but there may be hope of forecasting predictable events. To summarize, predictability of a random variable like yt in equation (6) from ℐt−1 has six distinct aspects:

  • 1the composition of ℐt−1;
  • 2how ℐt−1 influences Dyt(·∣ℐt−1) (or specifically, ft(ℐt−1));
  • 3how Dyt(·∣ℐt−1) (or specifically ft(ℐt−1)) changes over time;
  • 4the use of the limited information set ��t−1 ? ℐt−1t;
  • 5the mapping of Dyt(·∣ℐt−1) into Dyt(·∣��t−1) (or specifically, ft(ℐt−1) into gt(��t−1));
  • 6how ��Twill enter DyT+h(·∣��T) (or fT+h(��T)).

Forecasts of yT+h from a forecast origin at T are made using a model yt=ψ(��t−1θ) based on the limited information set ��t−1 with conditional expectation Et[yt|��t−1]=gt(��t−1). The postulated parameters (or indexes of the assumed distribution) θ must be estimated as inline image using a sample t=1,…, T of observed information, denoted by inline image. Doing so therefore introduces four more steps:

  • 7the approximation of gt(��t−1) by a function ψ(��t−1θ), ∀t;
  • 8measurement errors between ��t−1 and the observed inline image;
  • 9estimation of θ in inline image from in-sample data inline image;
  • 10forecasting yT+h from inline image.

We consider these 10 aspects in turn, closely matching the 10 issues described in section II.

Concerning aspect (1), although knowledge of the composition of ℐt−1 will never be available for such a complicated entity as an economy, any hope of success in forecasting with macro-econometric models requires that they actually do embody inertial responses. Consequently, ℐt−1 needs to have value for predicting the future evolution of the variables to be forecast, either from a causal or systematic correlational basis. Evidence on this requirement has perforce been based on using ��t−1, but seems clear-cut in two areas. First, there is a well-known range of essentially unpredictable financial variables, including changes in exchange rates, long-term interest rates, commodity prices and equity prices: if any of these could be accurately forecast for a future period, a ‘money machine’ could be created, which in turn would alter the outcome.3 While these are all key prices in decision taking, forward and future markets have evolved to help offset the risks of changes: unfortunately, there is as yet little evidence supporting the efficacy of those markets in forecasting the associated outcomes. Secondly, production processes indubitably take time, so lagged reactions seem the norm on the physical side of the economy. Less clear-cut is perhaps whether the change in aggregate consumers’ expenditure is unpredictable: Hall (1978) argues for this on theoretical grounds, invoking the Life Cycle-Permanent Income Hypothesis coupled with rational expectations, although the empirical evidence is generally not supportive.4 It seems reasonable to take the view that predictability does not seem to be precluded if ℐt−1 is known.

Learning precisely how ℐt−1 is relevant (aspect (2), albeit via inline image) has been the main focus of macro-econometric modelling, thereby inducing major developments in that discipline, particularly in recent years as various forms of non-stationarity have been modelled. Even so, a lack of well-based empirical equation specifications, past changes in data densities that remain poorly understood, mis-measured – and sometimes missing – data series (especially at frequencies higher than quarterly), and the present limitations of model selection tools to (near) linear models entail that much remains to be achieved at the technical frontier.

Changes in ft(ℐt−1) over time (aspect (3)) have been discussed above, and earlier research has clarified the impacts on forecasting of shifts in its mean values.

Turning to aspect (4), economic theory is the main vehicle for the specification of the information set ��t−1, partly supported by empirical studies. Any model of Dyt(·|·) embodies gt(·) not ft(·), but section 3.1 showed that models with mean innovation errors could still be developed. Thus, incomplete information about the ‘causal’ factors is not by itself problematic, providing gt(��t−1) is known.

Unfortunately, mapping ft(ℐt−1) into its conditional expectation gt(��t−1) (aspect (5)) is not under the investigator's control beyond the choice of ��t−1. Any changes in ft(ℐt−1) over time will have indirect effects on gt(��t−1) and make interpreting and modelling these shifts difficult. Nevertheless, the additional mistakes that arise from this mapping act like innovation errors.

However, even if aspects (1)–(5) could be overcome in considerable measure, aspect (6) highlights that relationships can change in the future, perhaps dramatically.5Section 3.2 distinguished between ‘mean-zero’ and ‘location’ shifts in yt, the most pernicious breaks being location shifts (confirmed in the forecasting context by the taxonomies of forecast errors in Clements and Hendry, 1998, 2005, and by a Monte Carlo in Hendry, 2000). Consider h=1, where the focus is on the mean, ET+1[yT+1|��T], which is the integral over the DGP distribution at time T + 1 conditional on a reduced information set ��T, and is unknown at T. Then averaging across alternative choices of the contents of ��T could provide improved forecasts relative to any single method (i.e. better approximate the integral) when the distribution changes from time T, and those choices reflect different sources of information. Of course, unanticipated breaks that occur after forecasts have been announced cannot be offset: the precise form of DyT+h(·|·) cannot be known till time T + h has been reached. However, after time T + h, DyT+h(·|·) becomes an in-sample density, so thereafter breaks could be offset.

Aspect (7), appears to be the central difficulty: gt(·) is not known. First, gt(��t−1) experiences derived, rather than direct, breaks from changes in ft(ℐt−1), making model formulation, and especially selection, difficult. Secondly, empirical modellers perforce approximate gt(��t−1) by a function ψ(��t−1θ), where the formulation of θ is intended to incorporate the effects of past breaks: most ‘time-varying coefficient’, regime-switching, and non-linear models are members of this class, as are models with (combinations of) impulse indicators. Thirdly, while ‘modelling breaks’ may be possible for historical events, a location shift at, or very near, the forecast origin may not be known to the forecaster; and even if known, may have effects that are difficult to discern, and impossible to model with the limited information available, especially given the next issue.

Measurement errors, aspect (8), almost always arise, as available observations are inevitably inaccurate. Although these may bias estimated coefficients and compound the modelling difficulties, by themselves, measurement errors do not imply inaccurate forecasts relative to the measured outcomes. However, in dynamic models, measurement errors induce negative moving-average residuals, so a potential incompatibility arises: differencing to attenuate systematic mis-specification or a location shift will exacerbate a negative moving average. In section 5.3, we discuss in detail the conflicting demands of measurement errors and location shifts. This result seems to lie at the heart of practical forecasting problems, and may explain the many cases where (e.g.) differencing and intercept correction have performed badly.

Concerning aspect (9), the ‘averaging’ of historical data to estimate θ by inline image imparts additional inertia in the model relative to the data, as well as increased uncertainty. More importantly, there are probably estimation biases from not fully capturing all past breaks, which would affect deterministic terms. Updating, moving windows and the like are attempts to mitigate such effects.

Finally, concerning aspect (10), multi-step forecasts have the added difficulty of cumulating innovation errors, although these are no more than would arise in the context of predictability.

Not adapting to location shifts induces systematic misforecasting, usually resulting in forecast failure. To thrive competitively, forecasting models need to avoid that fate, as there are many devices that track (with a lag) and hence are robust to such breaks after they have occurred. Section V considers several such devices. Before that, section 4.1 formalizes the possible forecast errors in a taxonomy both to highlight their relation to sources of information and seek pointers for attenuation of their adverse consequences. Subsequent sections discuss a number of issues which are implicit in the taxonomy.

4.1. Taxonomy of error sources

To forecast yT+h, the in-sample model inline image is developed for some specification of the parameters θ ∈ ℜ estimated as inline image from the full-sample available information inline image where ��t−1 ⊆ ℐt−1 is the information set at each point in time measured by inline image such that:


There are many ways to formulate the function ψh(·) in equation (18) for a dynamic model ψ(·), including ‘powering up’ and multi-step estimation. Below, only the former is considered (on the latter, see Bhansali, (1996, 1997, 1999; Clements and Hendry, 1996a, b; Chevillon and Hendry, 2005, inter alia), but this section allows for any possibility. We focus on the first two moments here rather than the complete forecast distribution.

The key steps that determine the forecast error:


are as above: the composition of the DGP information sets ℐt−1 ; how each ℐt−1 enters the DGP Dyt(yt|ℐt−1); how Dyt(yt|ℐt−1) changes over time in-sample; the limited information set ��t−1 ⊆ ℐt−1; the mapping of Dyt(yt|ℐt−1) into Dyt(yt|��t−1) inducing gt(��t−1)=Et[ft(ℐt−1)|��t−1]; how ��T enters DyT+h(·|��T) for a forecast origin at T; the approximation of gt(��t−1) by the model ψ(��t−1θ); the specification of θ; measurement errors in each inline image for ��t−1 (which may themselves change over time); and the estimation of θ by inline image, which together determine the properties of ψh(·). The first six are aspects of predictability in the DGP; the second four of the formulation of forecasting models which seek to capture that predictability.

From such a formulation, inline image can be decomposed into errors which derive from each of the main reduction or transformation steps, namely:


where gT+h|T (��T) is the ‘extrapolated’ value of gT+h(��T) holding fixed the forecast-origin parameters in g(·). While decompositions such as equation (19) are not unique, they help pinpoint the potential sources of forecast failure, and the components that are less likely to have a pernicious effect on forecast accuracy.

Taking the seven right-hand side terms in equation (19) in turn, the first four cannot be known (in the absence of a crystal ball), being dependent on the future innovation νT+h defined relative to the information sets ℐT, future information accrual up to ℐT+h−1, the change to the limited information set which is the unknown difference ℐT − ��T, and postforecast-origin changes in the induced process not captured by the model: all 4 are, therefore, unpredictable ex ante, affect the forecast-error variance, and may influence its mean. The first, second and third terms have expected values of zero for proper information sets ℐ and ��, so will not affect inline image. Consequently, we confirm that a lack of knowledge of the complete information set ℐ is not an explanation for forecast failure, although using more (relevant) information will reduce the variance component due to fT+h(ℐT) − gT+h(��T). The second term is only present when h > 1, but then represents the cumulation of the innovation errors {νT+j} for j=1,…, h − 1. However, the fourth term is a potential source of forecast failure when gT+h(��T) ≠ gT+h|T(��T), as occurs from an induced location shift (rather than just structural change, in general, where the change may be zero on average). Conversely, the fourth term would be zero under constant parameters. Overall, knowing even complete information ℐT will only remove one of the first four terms.

The next three terms depend on the goodness of the model for the local DGP DyT(yT|��T−1) and on data accuracy, both in-sample and at the forecast origin, as well as the choice of estimator. Specifically, the fifth is a function of the adequacy of the model, the sixth of the data accuracy at T, and the last on the properties of the estimator inline image for θ when the observed data are used. Thus, the fifth term would be zero for a correctly specified model, the sixth for accurate data, but the seventh only in an infinite sample: hence the focus in many derivations of forecast-error uncertainties on the impacts of parameter estimation and innovation error variances (e.g. Ericsson, 2002). More knowledge about the relevant in-sample changes, about the ‘true’ data, about how the information affects gT+h|T(·), and hence how ψh(·) should be specified, would all help, but there seems no additional role for more information per se (beyond having longer samples of data) in the final three terms.

To clarify the implications, we consider the one-step ahead error, uT+1=inline image, from the forecasting model inline image which can be decomposed into the six basic sources of mistakes:

uT+1=νT+1DGP innovation error
 +fT+1(ℐT) − gT+1(��T)Incomplete information
 +gT+1(��T) − gT(��T)Induced change
 +gT(��T) − ψ1(��Tθ)Approximation reduction
 +inline imageMeasurement error
 +inline imageEstimation uncertainty

Since νT+1 is an innovation against the DGP information set ℐT, nothing will reduce its uncertainty. The intrinsic properties of νT+1 matter greatly, specifically its variance, as well as any unpredictable changes in its distribution. The baseline accuracy of a forecast cannot exceed that inherited from the DGP innovation error.

There are many reasons why information available to the forecaster is incomplete relative to that underlying the behaviour of the DGP. For example, important variables may not be known, and even if known, may not be measured. Either of these make ��T a subset of ℐT, although the first (excluding relevant information) tends to be the most emphasized. As shown in section 3.1, incomplete information increases forecast uncertainty over any inherent unpredictability, but by construction:


so, no additional biases result from this source, even when breaks occur.

Rather, the problems posed by breaks manifest themselves in the next term, gT+1(��T) − gT(��T): section 4.5 below addresses their detection. In-sample, it is often possible to ascertain that a break has occurred, and at worst develop suitable indicator variables to offset it, but the real difficulties derive from breaks at, or very near, the forecast origin. Section 4.5 considers possible approaches; here we note that if ΔgT+1(��T) has a non-zero mean, either an additional intercept (i.e. IC), or further differencing will remove that mean error.

There will also usually be mis-specifications due to the formulation of both ψ(·) and θ as approximations to gT(��T). For example, linear approximations to non-linear responses will show up here, as will dynamic mis-specification (��T assumes all earlier values are available, but models often impose short lag lengths). If the effect is systematic, then an IC or differencing may again reduce its impact although the required sign may be incompatible with the previous case.

Even if all variables known to be relevant are measured, the observations available may be inaccurate relative to the DGP ‘forces’. A distinction from the case of excluding relevant information is useful, as it matters what the source is: measurement errors in dynamic models tend to induce negative moving average residuals, whereas omitted variables usually lead to positive autoregressive residuals. Thus, again a potential incompatibility arises: differencing will exacerbate a negative moving average, and an IC may need the opposite sign to that for a break.

Finally, estimation uncertainty arising from using inline image in place of θ can compound the systematic effects of breaks when inline image adjusts slowly to changes induced in θ.

When models are mis-specified by using ��t−1 ? ℐt−1, for a world where ℐt−1 enters the density in changing ways over time, forecasting theory delivers implications that are remarkably different from the theorems that hold for constant processes as the summary discussion in Hendry and Clements (2003) emphasizes. The taxonomy highlights the gulf between predictability and empirical forecasting, even before we consider the additional problem of selecting the particular model to use (the choice of ψ(·) and ��).

4.2. Model selection

We build on the analyses in Clements and Hendry (1998) and Hendry and Hubrich (2004). Consider a simple constant-parameter DGP where the first equation is a conditional regression model:


with the marginal process:


independently of {εt}. The one-step ahead MSFE for known regressors after estimating equation (20) is:


where Σ=HΛH where HH=In in-sample, and we allow for the post-sample change that:


In equation (21), let inline image of dimensions n1 and n2=n − n1 with inline image.

However, the forecasting model only includes the subset z1,t, leading to the one-step ahead forecast for known regressors:






The forecast error inline image is:


with unconditional expectation inline image and approximate MSFE:


dropping the term of O(T−2), and using equation (22) where H22β2=γ2 with:


which is the noncentrality of the t2-test of the null that γj=0 in equation (20) for j=n1 + 1,…, n.

Although this theoretical derivation is too simple to apply directly to practical forecasting, it nevertheless highlights five key results, most of which are well known but remain important for understanding the role of information in forecasting.

First, an estimated but mis-specified model can have a smaller MSFE than the estimated DGP if enough of the omitted regressors have inline image.

Secondly, changes in the second moments of the regressors can markedly increase the MSFE, irrespective of including or excluding such regressors: thus, omitting collinear regressors need not improve a forecasting model.

Thirdly, orthogonalizing transforms to reduce collinearity need not be useful when data second moments can change as equation (26) only depends on the eigenvalues.

Next, omitting irrelevant variables (inline image after orthogonalizing) is sufficient for the simplified model to outperform the estimated DGP on MSFE, but is not necessary.

Finally, all regressors with inline image should be included in the forecasting equation, especially if inline image. Of course, inline image is unknown, and the best estimate is based on the observed t2-statistic, noting that inline image. Since ‘statistical significance’ is often determined by the convention that inline image (the approximate 5% significance level), retaining all regressors with inline image would again run counter to parsimony in forecasting models. Unfortunately, the situation is much more complicated and a decision rule is far from clear, as we now discuss.

Obviously, the inline image have a sampling distribution, and even if inline image, then inline image (close to the implicit significance level of AIC when T=100 for a range of n: see Akaike, 1973; Campos et al., 2003). Thus, there is a high probability of retaining irrelevant variables using such a decision rule if n is at all large, namely αn ≃ 6 for n=40. Conversely, omitting variables with inline image could be expensive if their observed inline image, yet, for example, inline image so will still be incorrectly omitted more than a quarter of the time. Thus, knowledge as to which variables really do matter in the sense of inline image could be invaluable, and this provides a potential route for economic theory-based restrictions to improve forecasting. Indeed, the orthogonalized analysis above understates the advantage of valid exclusion restrictions, since in the original parameterization, omitting such variables will usually alter the ratios of the eigenvalues, and could do so markedly.

4.3. Non-linear models

Specifying an appropriate functional form for the forecasting model, once we allow for possible non-linear dependencies between yt and the explanatory variables, or between yt and its lags (in autoregressive models), greatly complicates the question of model selection. In some instances, ‘stylized facts’ may point towards a non-linear structure, given the information set ��. For example, when forecasting a broad measure of output growth, and with �� restricted to past changes in output growth, differences in dynamic behaviour between the expansion and contraction phases of the business cycle suggest a non-linear, regime-switching structure (see, inter alia, Albert and Chib, 1993; Diebold et al., 1994; Goodwin, 1993; Hamilton, 1994; Kim, 1994; Krolzig and Lütkepohl, 1995; Krolzig, 1997; Lam, 1990; McCulloch and Tsay, 1994; Phillips, 1991; Potter, 1995; Tiao and Tsay, 1994, as well as the collections edited by Barnett et al., 2000; Hamilton and Raj, 2002, and a Special Issue of the International Journal of Forecasting).6Clements et al. (2004) review a number of factors which play a role in tempering any improvements which might be expected to arise from allowing for non-linearities.

4.4. Causal models

Clements and Hendry (1998) establish the negative result that causal models which are dominant in-sample need not outperform in processes subject to location shifts. This is especially true of equilibrium-correction models when unmodelled equilibrium mean shifts occur, leading to systematic forecast failure. The susceptibility of this popular class of model to location shifts motivates the analysis in section V of ways of countering the impacts of breaks.

4.5. Diagnosing breaks

A problem for the forecaster hidden in the above formulation is determining that there has been a break. First, data at or near the forecast origin are always less well measured than more mature vintages, and some may be missing. Thus, a recent forecast error may reflect just a data mistake, and treating it as a location shift in the economy could induce systematic forecast errors in later periods. Secondly, a model which is mis-specified for the underlying process, such as a linear autoregression fitted to a regime-switching DGP, may suggest breaks have occurred when they have not. Then, ‘solutions’ such as additional differencing or ICs need not be appropriate. Thirdly, even when a break has occurred in some part of a model, its effects elsewhere depend on how well both the relevant equations and their links are specified: UK M1 results in Hendry and Doornik (1997) provide an example where only the opportunity cost of holding money is misforecast in one version of the model, but real money is misforecast in another, with similar forecast errors. Fourthly, sudden changes to data (e.g. in observed money growth rates) need not entail a break in the associated equation of the model: UK M1 again highlights this. Thus, without knowing how well specified a model is under recently changed conditions, data movements alone are insufficient to guide the detection of breaks or distinguish additive from innovation errors.

4.5.1. Co-breaking

Co-breaking of a subset of relationships over the forecast horizon would be valuable because such variables would move in tandem as a group (see, e.g. Clements and Hendry, 1999; Hendry and Massmann, 2005). Although forecasting the remaining variables would still be problematic, one would not need ICs or transformations for the co-breaking equations, thereby improving the efficiency of the forecasts. Moreover, lagged co-breaking is invaluable: a break in a marginal process which affects the variable to be forecast with a lag, does not induce forecast failure.

4.6. Congruent modelling for forecasting

That we are unable to prove that careful modelling of causal information, and of using well-specified, congruent models in-sample, will yield more accurate forecasts may be regarded as a nihilistic critique of econometric modelling. What then is the role for orthogonalized, parsimonious encompassing, congruent models capturing causal economic relations? Seven benefits are potentially available, even in the forecasting context, and the need for such a model in the policy context is clear.

  • 1Rigorous in-sample modelling helps detect, and thereby avoid, equilibrium-mean shifts which would otherwise distort forecasts.
  • 2Such models deliver the smallest variance for the innovation error defined on the available information set, and hence offer one measure of the ‘best approximation’ to g(·).
  • 3It is important to remove irrelevant variables which might suffer breaks over the forecast horizon (see, e.g. Clements and Hendry, 2002c).
  • 4The best estimates of the model's parameters will be invaluable over periods when no breaks occur, and thereby reduce forecast-error variances.
  • 5An orthogonalized and parsimonious model will avoid a large ratio of the largest to smallest eigenvalue of the second-moment matrix, which can have a detrimental effect on forecast-error variances when second moments alter, even for constant parameters in the forecasting model: see, e.g. Hendry and Hubrich (2004).
  • 6A dominant parsimonious congruent model offers a better understanding of the economic process by being more interpretable.
  • 7Such a model also sustains a progressive research strategy and offers a framework for interpreting forecast failure.

Besides which, how such a model is used in the forecast period may also matter, and in section V we discuss a number of ways that the effects of structural breaks can be offset. These come under the general heading of model transformations, namely theme C.

V. Model transformations

We consider a first-order VAR for simplicity, where the vector of n variables of interest is denoted by xt, and its DGP is:


ϒ is an n × n matrix of coefficients and τ is an n dimensional vector of constant terms. The specification in equation (27) is assumed constant in-sample, and the system is taken to be I(1), satisfying the r < n cointegration relations:


α and β are n × r full-rank matrices, no roots of |I − ϒL|=0 lie inside unit circle, and inline image is full rank, where α and β are full column rank n × (n − r) matrices, with αα=ββ=0. Then equation (27) is reparameterized as a vector equilibrium-correction model (VEqCM):


Both Δxt and βxt are I(0) but may have non-zero means. Let:




The variables grow at the rate Ext]=γ with βγ=0; and when βα is non-singular, the long-run equilibrium is:


Thus, in equation (31), both Δxt and βxt are expressed as deviations about their means. Note that γ is n × 1 subject to r restrictions, and μ is r × 1, leaving n unrestricted intercepts in total. Also, γ, α and μ are assumed to be variation free, although in principle, μ could depend on γ: see Hendry and von Ungern-Sternberg (1981). Then (τϒ) are not variation free, as seems reasonable when γ, α, β and μ are the ‘deep’ parameters: for a more extensive analysis, see Clements and Hendry (1996a, b).

The types of break we consider are shifts in the equilibrium mean, for the reasons detailed in Clements and Hendry (1999, 2005). At T + 1 the DGP becomes:


where ∇μ*=μ* − μ, or:


The first right-hand side term in equation (34) (namely inline image) is the constant-parameter forecast of ΔxT+1, so that the expected forecast error is:


There are a number of possible solutions to avoiding systematic forecast failure, assuming the equilibrium mean remains at μ* from period T + 1 onwards. One is differencing the VEqCM (31); another is using a ‘constant change forecast’; and finally, making use of forecast-error corrections. None of these actually improves predictability (as the information set is not extended), but they seek to mitigate the impact of breaks.

5.1. Differencing the VEqCM

Since shifts in μ are the most pernicious for forecasting, consider forecasting not from equation (31) itself but from a variant thereof which has been differenced after a congruent representation has been estimated:




Equation (35) is just the first difference of the original VAR, since (In + αβ′)=ϒ, but with the rank restriction from cointegration imposed. Equation (36) can be interpreted as augmenting a double differenced VAR (DDV) forecast by αβ′Δxt−1, which is zero on average.

To trace the behaviour of equation (35) one period after a break in μ, let:


where from equation (36):


since only at time T does Δμ*=∇μ*. Then:


Thus, the differenced VEqCM (DVEqCM) ‘misses’ for one period only, and does not make systematic, and increasing, errors.

To consider the impact of differencing on forecast-error variances, in the context of one-step ahead forecasts, let inline image be the forecast error. Then, ignoring parameter estimation uncertainty as Op(T −1/2):




Since the system error is {εt}, then the additional differencing doubles the one-step error variance. There is a gain from the DVEqCM relative to the DDV, as the forecast-error variance of the latter has a component from the variance of the omitted term (αβ′Δxt), as well as the same disturbance terms.

5.2. Using ΔxT to forecast: the DDV

Second differencing removes two unit roots, any intercepts and linear trends, changes location shifts to ‘blips’, and converts breaks in trends to impulses. Most economic time series do not continuously accelerate – entailing a zero unconditional expectation of the second difference:


and suggesting the well-known ‘minimal information’ forecasting rule:


One key to the success of double differencing is that no deterministic terms remain, so that for time series like speculative prices, where no deterministic terms are present, ‘random walk forecasts’ will be equally hard to beat. However, as discussed below, differencing is incompatible with solutions to measurement errors as it exacerbates negative moving averages.

Nevertheless, Hendry (2005) notes a deeper reason why a forecast of the form (38) may generally perform well. Consider the in-sample DGP:


where εt ∼ INn[0Ω] (independently of included variables and their history), with population parameter values denoted by the subscript 0. Here, wt denotes potentially many omitted I(0) effects, possibly all lagged (I(0) from ‘internal’ cointegration, being differenced, or intrinsically stationary). The postulated econometric model is a VEqCM in xt:


and that model, estimated from T observations, is used for forecasting:


However, over the forecast horizon, the DGP is:


If inline image:


All the main sources of forecast error occur: stochastic and deterministic breaks, omitted variables, inconsistent parameter estimates, estimation uncertainty, and innovation errors (measurement errors could be added). It is difficult to analyse equation (42) as its terms are not necessarily I(0), but conditional on (xT+i−1wT+i−1), uT+i has an approximate mean forecast error (using inline image, etc.) of:


Moreover, neglecting parameter estimation uncertainty as Op(T−1), uT+i has an approximate conditional error-variance matrix:


The conditional mean-square forecast error matrix is the sum of E[uT+i|xT+i−1wT+i−1]E[uT+i|xT+i−1wT+i−1]′ and equation (43), where the former includes terms in the shifts in the parameters.

Contrast using the sequence of ΔxT+i−1 to forecast ΔxT+i, as in the DDV given by equation (38):


From equation (41), for i > 1, ΔxT+i−1 is given by:


Then, without the forecaster knowing the causal variables or the structure of the economy, or whether there have been any structural breaks or shifts, and without any estimation needed, ΔxT+i−1 reflects all the effects in the DGP. However, there are two drawbacks: the unwanted presence of εT+i−1 in equation (45), which doubles the innovation error variance; and all variables are lagged one extra period, which adds the ‘noise’ of many I(−1) effects (as shown below). Thus, there is a clear trade-off between using the carefully modelled equation (40) and the ‘naive’ predictor (44). In forecasting competitions across many states of nature with structural breaks and complicated DGPs, it is easy to see why ΔxT+i−1 may win.

Let inline image, then:


All terms in the last line must be I(−1), so will be very ‘noisy’, but no systematic failure should result. Indeed:


Neglecting covariances, we have:


which is the mean-square error matrix because inline image. Although Ω enters equation (47) with a factor of two relative to equation (43), it is not necessarily the case that VwT+i] in equation (47) is twice V[wT+i] in equation (43). To see this, when {wt} is a stationary VAR (say):




so that:


which could attain a maximum of V[wT+i] when {wt} is white noise (γ=0), or approach −V[wT+i] when {wt} is highly autoregressive (γ ≃ Ik). Thus, the overall error variance in equation (47) will not necessarily double relative to equation (43), and could be smaller in sufficiently badly specified VEqCMs.

5.3. Forecast-error-based adaptation

Two techniques that use forecast-error-based information to correct breaks in econometric systems are ICs (which add back recent errors) and exponentially weighted moving averages (EWMA).7 We review both these ‘forecast-error correction’ mechanisms (FErCMs). EWMA is one of the most famous, and we consider its possible transmogrification to econometric systems.

5.3.1. EWMA

The EWMA recursive updating formula is, for λ ∈ (0, 1) and a scalar time series {yt}:


so (e.g.):


with start-up value inline image. Hence, for an origin T, inline image for all h > 1. One can view this method as ‘correcting’ a random-walk forecast by the latest forecast error inline image:


The EWMA forecast function can be seen as approximating the AutoRegressive Integrated Moving Average (ARIMA)(0, 1, 1):


and is the exact forecast function for the ARIMA(0, 1, 1) when θ=λ, so λ can always be set such that EWMA is optimal for an ARIMA(0, 1, 1).

An ARIMA(0, 1, 1) might appear to be a restrictive class of DGP, but it happens to result when the underlying series is a random walk measured with additive error, e.g.


where inline image is the underlying series, yt the measured series and wt the measurement error. Combining these two equations yields the single equation:


where we can write vt + wt − wt−1 as ɛt − θɛt−1, as in equation (51) with θ depending on inline image and inline image: θ lies in the interval [0, 1) and decreases monotonically as inline image increases. Thus, as the signal (inline image) increases relative to the noise (inline image), the optimal forecast function approaches inline image.

From equation (51), note that inline image, and that inline imageyT−1λɛT−1 implies that inline image, thus:


establishing the equivalence to equation (50) when λ=θ.

We have established that EWMA is an excellent foil for measurement error when the process to be forecast is approximately a random walk. Now, consider the EWMA scheme when there is a shift in the mean of {y}. The shift will eventually feed through to the forecasts from equation (49): adding back a damped function of recent forecast errors ought, therefore, to be productive when location shifts are common. The speed with which adjustment occurs depends on the degree of damping, λ, where λ=0 corresponds to a random walk forecast. The choice of a large λ prevents the predictor extrapolating the ‘noise’ in the latest observation, but when there is a shift in mean, the closer λ is to zero the more quickly a break will be assimilated in the forecasts.

5.3.2. ICs

The role of interception correction can be more easily seen by abstracting from measurement error, and considering a DGP of the form:


with |ρ| < 1 when 1{tT1} is an indicator variable, with the value zero till time T1 < T, after which it is unity. The interesting case is when T1=T − 1, so the shift has recently occurred. In equation (53):


for which:


Equation (54) is a poor forecast when δ is large, even if the estimated model parameters, inline image and inline image, coincide with their population values, μ and ρ, which we now assume for simplicity.

At time T, there was a residual of inline image, where from equation (53) at time T:


To set the model ‘back on track’ (i.e. fit the last observation perfectly), the IC inline image is often added to equation (54) to yield:


with the forecast error:


Thus, the IC forecast changes the forecasting error to the difference thereof, thereby removing the impact of the shift δ in the equilibrium mean. The unconditional MSFE is:


as against the minimum obtainable (for known parameters) of inline image.

5.3.3. The relation of EWMA and IC

Four components seem to contribute to the forecasting success of EWMA:

  • adapting the next forecast by the previous forecast error;
  • differencing to adjust to location shifts;
  • the absence of deterministic terms which could go awry;
  • rapidly adaptive when λ is small.

The correction of a forecast by a previous forecast error is reminiscent of IC. However, EWMA differs from IC by the sign and size of the damping factor, −λ in place of unity. EWMA may work well when there are large measurement errors at the forecast origin, but less so when there are location shifts. To investigate the implications of this sign change, consider a vector generalization of equation (50) using the forecast from equation (58), (abstracting from parameter estimation):


when augmented by the forecast-error correction:


Assuming the VEqCM (33) was congruent in-sample, then using:


leads to:


which is the last in-sample one-step residual, inline image. Thus, letting inline imageinline image:


so Λ=−In corresponds to the IC for ‘setting the forecasts back on track’ at the forecast origin. The sign change is not due to IC being an autoregressive, rather than a moving-average, correction: rather, the aim of the IC is to offset a location shift, whereas EWMA seeks to offset a previous measurement error, using differencing to remove location shifts. Thus, we see an important caveat to the explanations of the empirical success of ICs discussed in Clements and Hendry (1999, Ch. 6): some of the potential roles conflict. In particular, to offset previous mis-specifications or measurement errors requires the opposite sign to that for offsetting breaks.

VI. Papers in this volume

In this section, we provide a brief summary of the papers in the Special Issue, and an overview of the various contributions in terms of the general discussion of the role of information in economic forecasting in sections I–V.

The papers by Favero and Marcellino, and Banerjee, Marcellino and Masten come under theme A: forecasts are generated from diverse information sets; forecasts are pooled; models with multiple indicators are used to generate forecasts; forecasts are derived from factors that are intended to summarize vast amounts of information.

Favero and Marcellino present a comparison of forecasts of Euro-area country fiscal and macroeconomic variables produced using different econometric models and techniques and information sets. They find that forecasts produced by small-scale semi-structural models are generally inferior to simple time-series models, and forecasts obtained by pooling over a number of such models. They suggest that their findings are in tune with recent similar empirical forecasting exercise studies, and investigate further by carrying out a simulation study, in which the structural models are used to generate the data. Simple ARMA and random walk models still outperform, although the forecast errors associated with the fiscal variables are large and differences in accuracy are seldom significant. The poor performance of the correctly specified models is attributed to parameter estimation uncertainty when the in-sample estimation period is relatively short. Their findings suggest simple time-series models do well, and that this is due to the role of estimation uncertainty, lending some support to exponents of theme B, but note that pooling also does well (theme A).

Banerjee, Marcellino and Masten consider the information content in a large number of leading indicators for forecasting Euro-area inflation and GDP growth. They consider forecasts from models based on single indicators, selected groups of indicators, and on single and multiple factors estimated from the leading indicators. They conduct ex ante and ex post analyses, and rolling and recursive forecasting schemes, where in all cases the comparators are autoregressive models. Their measures of forecast accuracy are calculated in such a way as to highlight changes in performance over time. One of their main findings is that particular indicators do perform well at particular times, but that in terms of choosing a single model over the whole forecast period the autoregressions are hard to beat. This again tends to support theme B over A. Their broad conclusion matches the findings of others on different data sets (e.g. Stock and Watson, 2003) and is, of course, consistent with the changes in the composition of the US composite and leading indicators over time.

Harvey and Newbold focus on estimation uncertainty in the context of forecast combination. Whilst Favero and Marcellino use simulations to gauge the effects of estimation uncertainty on their forecasts, this is complemented by Harvey and Newbold's analytical work. They show analytically that forecasts from the DGP may not encompass those from a mis-specified model when the parameters of both the DGP and mis-specified models are unknown and have to be estimated. Specifically, more accurate forecasts on MSFE may be obtained by taking a linear combination of the sets of forecasts. However, the need to estimate the combination weights from past forecasts and actuals will temper the gains that could be realized relative to when the weights are known unless large samples of past forecasts and actuals are available. Their account of the potential efficacy of forecast combination stresses the role of parameter estimation uncertainty, complementing the explanation of Hendry and Clements (2004) in terms of structural breaks and model mis-specification. Their paper also complements the recent work of West (1996, 2001) and West and McCracken (1998, 2002) on the impact of estimation uncertainty on the properties of tests of forecast accuracy and encompassing. That research seeks to make an allowance for estimation uncertainty, such that encompassing tests find that the DGP encompasses mis-specified rival models.

The paper by Castle compares the performance of two automatic model selection algorithms, PcGets and RETINA,8 via a number of Monte Carlo experiments. We note that in their editorial introduction to an earlier Bulletin Special Issue on Model Selection and Evaluation, the Guest Editors (Niels Haldrup, David Hendry and Herman van Dijk) remarked that ‘horse races between RETINA and other automated model selection procedures such as PcGets remain for future research’. It is pleasing to be able to include a paper in this Special Issue that does precisely that. The two sets of algorithms considered by Castle differ in a number of important respects. PcGets is based on an extended general-to-specific modelling approach, that seeks to obtain a simpler, more parsimonious representation of an overly general unrestricted model. On the other hand, RETINA operates in a simple-to-general fashion, and aims to uncover a parsimonious model with the objective of forecasting in mind, and with a particular emphasis on capturing possible nonlinear dependencies between the variable being modelled and the set of explanatory variables. One could imagine circumstances in which one approach would perform relatively better than the other. A number of interesting issues are explored, including the effects on the selection algorithms of nonlinear functions of the explanatory variables and outliers. Although much remains to be established about model selection for ex ante forecasting in a world where location shifts and other structural breaks occur intermittently, Castle's general conclusion is that automatic computer packages do have a useful and valuable role to play in model selection.

Castle's contribution, and the paper by Allen and Fildes, come under theme B, on the role of model selection in forecasting. Allen and Fildes consider the recent literature and ask whether it is possible to discern a set of ‘cook-book instructions’ for how and when unit root and cointegration tests and restrictions should be applied in a model-development strategy designed for forecasting. Such a set of instructions would guide the practitioner, and presumably could be codified in automatic model selection strategies of the sort discussed by Castle. They argue that more should be done to establish guidelines for practitioners, and to give a clearer specification of when their preferred strategy – imposing unit root and cointegration restrictions – should be adopted.

The use of information in economic forecasting – in the form of the provision of cook-book instructions (how to deal with unit roots, in the case considered by Allen and Fildes) – is not uncontroversial. A referee of their paper argued persuasively that ‘It depends’ may be the correct answer to ‘What is the best strategy for building econometric forecasting models’ as a longer answer would typically depend on a large number of conditional statements – witness the diversity of approaches, models and methods occasioned by the presence of seasonality, non-linearities, structural breaks, second-moment dependencies, etc., and the interaction of these with unit root non-stationarities, as discussed in the recent Compendium on forecasting (Clements and Hendry, 2002a). Nevertheless, in common with Allen and Fildes, we would expect that future research will result in improved automatic model-selection algorithms as the quantitatively most important factors are clarified and codified.

Korenok and Swanson provide some evidence on the usefulness of theory information in economic forecasting, and in particular of the usefulness of economic theory-based restrictions. Their paper also comes under theme B, as well as the role of evaluation information, theme D. They evaluate density forecasts of inflation and the output gap from standard sticky-price models, newer generation DSGE models, and simple time-series models, including VARs. The evaluation is based on the approach to predictive accuracy testing of Corradi and Swanson (2005), as well as the more standard MSFE measure and related tests. Their findings run counter to the general view that theory models and theory-based restrictions result in inferior forecasts relative to simple time-series models, providing a more upbeat assessment of the value of theory information.

Clements and Hendry also consider the role of theory information in economic forecasting, but from the perspective of what can be learnt from a model's forecast performance, especially when the model is based on a particular economic theory. If theory information is used to construct or restrict the specification of the forecasting model in some way, then it seems natural to equate a good forecasting performance with support for the theory, and conversely, view a poor forecasting performance as undermining the theory. But Clements and Hendry argue that there are limits to using forecast performance to test the underlying theories. Out-of-sample forecast performance is not a reliable indicator of whether the empirical model is in some sense a good description of the phenomenon being modelled. The precise way in which the forecasting exercise is undertaken, including a number of aspects which are routinely thought to be of little importance, may determine whether the exercise is classed as being successful or not, as well as the possible presence of structural breaks (as discussed above; for detailed treatments, see Clements and Hendry, 2002b, 2005).

The paper by Anderson and Vahid comes under theme C, as they consider the problem of lag selection in nonlinear models. The greater complexity of nonlinear models and the large number of possibilities that arise outside the class of linear ARMA models of Box and Jenkins means that traditional methods of selecting lag order may not be appropriate. The authors propose nonlinear autocorrelograms and partial autocorrelograms that are capable of detecting lag structures that standard correlograms may miss. The Monte Carlo and empirical evidence presented in this innovative contribution suggests that their neural network-based measures of dependence should facilitate the use of information in forecasting where a nonlinear structure is appropriate by helping to identify the lag order.

The papers by Wallis, and Hall and Mitchell, come under themes A and D, being concerned with forecast combination, but of intervals and densities rather than point forecasts, and forecast evaluation. Wallis considers ways of extending forecast combination to interval and density forecasts. He suggests the use of finite mixture distributions as an appropriate statistical representation for combined density forecasts, and using the latter to obtain combined interval forecasts. These suggestions are compared to alternative ways of combining densities and intervals. Three important issues are the choice of combination weights in practice, whether it is possible to establish a general optimality result for density forecast combination comparable to that of Bates and Granger (1969) for point forecasts, and what to do when one wishes to retain the (common) distribution functional forms of the individual densities, given that, in general, this will not be preserved by the finite mixture.

The paper by Hall and Mitchell complements that of Wallis by considering a number of ways of obtaining weights for combining densities, including weights based on the Kullback–Leibler information criterion (KLIC). Hall and Mitchell follow Bao, Lee and Saltoğlu (2004) in describing the use of KLIC to test for equal accuracy between competing density forecasts. They suggest the use of KLIC as a unified tool for the evaluation, comparison and combination of density forecasts, and provide an empirical illustration that compares the Bank of England and National Institute density forecasts of UK inflation.

We hope that this Special Issue of the Oxford Bulletin will serve to foster debate and academic enquiry on all aspects of the role of information in economic forecasting.


  • 1

    It is also sometimes argued that one of the routes by which parsimonious models might be derived – a general-to-specific modelling strategy – may often fail to deliver the desired result. A recent, and not untypical, example from a paper that seeks to establish tests and procedures for selecting models is, ‘A sharp result in the tables is that the sequential model selection method is characterized by the worst performance, likely due to its tendency to select over-parameterized models (cases with 40 or more predictors in the final model were not uncommon)’ (see Giacomini and White, 2004). However, it transpires that despite the crucial development of multipath search procedures, the authors concede, ‘We consider a single reduction path and perform only a subset of the tests used by Hoover and Perez (1999)’. Thus, little of value is learned from their exercise, and two more potentially misleading sources of ‘folklore’ circulate that ‘overfitting’ or a lack of parsimony per se are detrimental to forecasting.

  • 2

    On decision-based forecast evaluation in macroeconomics, see Granger and Pesaran (2000a, b), and Clements (2004).

  • 3

    A ‘fixed-point’ analysis (like that proposed by Marget, 1929) is possible, but seems unlikely for phenomena prone to bubbles. However, transaction costs allow some predictability.

  • 4

    An equally influential paper that argues for the predictability of consumption is Davidson et al. (1978).

  • 5

    Cairncross (1969) suggested the example of forecasting UK GNP in 1940 for 1941 – a completely different outcome would have materialized had an invasion occurred. The theoretical analyses discussed above could have helped to formalize many of the issues he raised.

  • 6

    Forecasting Economic and Financial Time Series Using Nonlinear Methods, International Journal of Forecasting, Vol. 20, No. 2 (2004).

  • 7

    This section is based on Clements and Hendry (2003).

  • 8

    Another automated approach that is designed for selecting models for forecasting is based on the work of Phillips (1994, 1995, 1996, 2003), and re-selects the model specification and re-estimates as new information accrues, to ensure adaptability.