Abstract
 Top of page
 Abstract
 1. INTRODUCTION
 2. THE GARCH UNIVERSE
 3. FORECAST EVALUATION
 4. REALIZED VARIANCE
 5. TEST FOR SUPERIOR PREDICTIVE ABILITY
 6. DATA AND EMPIRICAL RESULTS
 7. CONCLUSIONS
 Acknowledgements
 REFERENCES
 Supporting Information
We compare 330 ARCHtype models in terms of their ability to describe the conditional variance. The models are compared outofsample using DM–$ exchange rate data and IBM return data, where the latter is based on a new data set of realized variance. We find no evidence that a GARCH(1,1) is outperformed by more sophisticated models in our analysis of exchange rates, whereas the GARCH(1,1) is clearly inferior to models that can accommodate a leverage effect in our analysis of IBM returns. The models are compared with the test for superior predictive ability (SPA) and the reality check for data snooping (RC). Our empirical results show that the RC lacks power to an extent that makes it unable to distinguish ‘good’ and ‘bad’ models in our analysis. Copyright © 2005 John Wiley & Sons, Ltd.
1. INTRODUCTION
 Top of page
 Abstract
 1. INTRODUCTION
 2. THE GARCH UNIVERSE
 3. FORECAST EVALUATION
 4. REALIZED VARIANCE
 5. TEST FOR SUPERIOR PREDICTIVE ABILITY
 6. DATA AND EMPIRICAL RESULTS
 7. CONCLUSIONS
 Acknowledgements
 REFERENCES
 Supporting Information
The conditional variance of financial time series is important for pricing derivatives, calculating measures of risk, and hedging. This has sparked an enormous interest in modelling the conditional variance and a large number of volatility models have been developed since the seminal paper of Engle (1982); see Poon and Granger (2003) for an extensive review and references.
The aim of this paper is to examine whether sophisticated volatility models provide a better description of financial time series than more parsimonious models. We address this question by comparing 330 GARCHtype models in terms of their ability to forecast the onedayahead conditional variance. The models are evaluated outofsample using six different loss functions, where the realized variance is substituted for the latent conditional variance. We use the test for superior predictive ability (SPA) of Hansen (2001) and the reality check for data snooping (RC) by White (2000) to benchmark the 330 volatility models to the GARCH(1,1) of Bollerslev (1986). These tests have the advantage that they properly account for the full set of models, without the use of probability inequalities, such as the Bonferroni bound, that typically lead to conservative tests.
We compare the models using daily DM–$ exchange rate data and daily IBM returns. There are three main findings of our empirical analysis. First, in the analysis of the exchange rate data we find no evidence that the GARCH(1,1) is inferior to other models, whereas the GARCH(1,1) is clearly outperformed in the analysis of IBM returns. Second, our model space includes models with many distinct characteristics that are interesting to compare,1 and some interesting details emerge from the outofsample analysis. The models that perform well in the IBM return data are primarily those that can accommodate a leverage effect, and the best overall performance is achieved by the APARCH(2,2) model of Ding et al. (1993). Other aspects of the volatility models are more ambiguous. While the tdistributed specification of standardized returns generally leads to a better average performance than the Gaussian in the analysis of exchange rates, the opposite is the case in our analysis of IBM returns. The different mean specifications, zeromean, constant mean and GARCHinmean, result in almost identical performances. Third, our empirical analysis shows that the RC has less power than the SPA test. This makes an important difference in our application, because the RC cannot detect that the GARCH(1,1) is significantly outperformed by other models in the analysis of IBM returns. In fact, the RC even suggests that an ARCH(1) may be the best model in many cases, which does not conform with the existing empirical evidence. The SPA test always finds the ARCH(1) model to be inferior, which shows that the SPA test has power in these applications and is therefore more likely to detect superior models when such exist.
Ideally, we would evaluate the models' ability to forecast all aspects of the conditional distribution. However, it is not possible to extract precise information about the conditional distribution without making restrictive assumptions. Instead we focus on the central component of the models—the conditional variance—that can be estimated by the realized variance. Initially, it was common to substitute the squared return for the unobserved conditional variance in outofsample evaluations of volatility models. This typically resulted in a poor performance, which instigated a discussion of the practical relevance of volatility models. However, Andersen and Bollerslev (1998) showed that the ‘poor’ performance could be explained by the fact that the squared return is a noisy proxy for the conditional variance. By substituting the realized variance (instead of the squared return), Andersen and Bollerslev (1998) showed that volatility models perform quite well. Hansen and Lunde (2003) provide another important argument for using the realized variance rather than the squared return. They show that substituting the squared returns for the conditional variance can severely distort the comparison, in the sense that the empirical ranking of models may be inconsistent for the true (population) ranking. So an evaluation that is based on squared returns may select an inferior model as the ‘best’ with a probability that converges to one as the sample size increases. For this reason, our evaluation is based on the realized variance.
Comparing multiple models is a nonstandard inference problem, and spurious results are likely to appear unless inference controls for the multiple comparisons. An inferior model can be ‘lucky’ and perform better than all other models, and the more models that are being compared the higher is the probability that the best model (in population) has a much smaller sample performance than some inferior model. It is therefore important to control for the full set of models and their interdependence when evaluating the significance of an excess performance. In our analysis we employ the SPA test and the RC, which are based on the work of Diebold and Mariano (1995) and West (1996). These tests can evaluate whether a particular model (benchmark) is significantly outperformed by other models, while taking into account the large number of models that are being compared. In other words, these tests are designed to evaluate whether an observed excess performance is significant or could have occurred by chance.
This paper is organized as follows. Section 2 describes the 330 volatility models under consideration and the loss functions are defined in Section 3. In Section 4, we describe our measures of realized variance and Section 5 contains some details of the SPA test and its bootstrap implementation. We present our empirical results in Section 6 and Section 7 contains some concluding remarks.
2. THE GARCH UNIVERSE
 Top of page
 Abstract
 1. INTRODUCTION
 2. THE GARCH UNIVERSE
 3. FORECAST EVALUATION
 4. REALIZED VARIANCE
 5. TEST FOR SUPERIOR PREDICTIVE ABILITY
 6. DATA AND EMPIRICAL RESULTS
 7. CONCLUSIONS
 Acknowledgements
 REFERENCES
 Supporting Information
Given a price process, p_{t}, we define the compounded daily return by r_{t} = log(p_{t})− log(p_{t−1}), t = −R + 1, …, n. Later we split the sample into an estimation period (the first R observations) and an evaluation period (the last n observations).
The conditional density of r_{t} is denoted by f(rℱ_{t−1}), where ℱ_{t−1} is the σalgebra induced by variables that are observed at time t − 1. We define the conditional mean by µ_{t} ≡ E(r_{t}ℱ_{t−1}) (the location parameter) and the conditional variance by (the scale parameter), assuming that both are well defined. Subsequently we can define the standardized return, e_{t} ≡ (r_{t} − µ_{t})/σ_{t}, and denote its conditional density by g(eℱ_{t−1}). Following Hansen (1994) we consider a parametric specification, f(rψ(ℱ_{t−1};θ)), where θ ∈ Θ ⊂ ℝ^{q} is a vector of parameters. It now follows that the timevarying vector of parameters, ψ_{t} ≡ ψ(ℱ_{t−1};θ), can be divided into , where η_{t} is a vector of shape parameters for the conditional density of e_{t}. Thus, we have a family of density functions for r_{t}, which is a locationscale family with (possibly timevarying) shape parameters, and we shall model and η_{t} individually. Most GARCHtype models can be formulated in this framework and η_{t} typically does not depend on t.
The notation for our modelling of the conditional mean and variance is m_{t} = µ(ℱ_{t−1};θ) and , respectively, and we employ two specifications for g(eη_{t}) in our empirical analysis. One is a Gaussian specification that is free of parameters g(eη_{t}) = g(e), and the other is a tspecification that has degrees of freedom, υ, as the only parameter, g(eη_{t}) = g(eυ).2 Our specifications for the conditional mean are: (GARCHinmean), m_{t} = µ_{0} and m_{t} = 0.
The conditional variance is the main object of interest and our analysis includes a large number of parametric specifications for σ_{t} that are listed in Table I. The use of acronyms has not been fully consistent in the existing literature, for example, AGARCH has been used to represent four different specifications. So to avoid any confusion we use ‘AGARCH’ to refer to a model by Engle and Ng (1993) and use different acronyms for all other models, e.g., we use HGARCH to refer to the model by Hentshel (1995). Several specifications nest other specifications, as is evident from Table I. In particular, the flexible specifications of the HGARCH and the AugGARCH, see Duan (1997), nest many of the simpler specifications. An empirical comparison of several of the models that are nested in the AugGARCH model can be found in Loudon et al. (2000).
In our analysis, we have included the four combinations of p, q = 1, 2 for the lag length parameters, with the following exceptions: the ARCH is only estimated for q = 1; HGARCH and AugGARCH are only estimated for (p, q) = (1, 1), because these are quite burdensome to estimate. It is well known that an ARCH(1) model is unable to fully capture the persistence in volatility, and this model is only included as a point of reference, and to verify that the tests, SPA and RC, have power. This is an important aspect of the analysis, because a test that is unable to reject that the ARCH(1) is the best model cannot be very informative about which is a better model. Restricting the models to have two lags (or less) should not affect the main conclusions of our empirical analysis, because it is unlikely that a model with more lags would outperform a simple benchmark in the outofsample comparison, unless the same model with two lags can outperform the benchmark. This aspect is also evident from our analysis, where a model with p = q = 2 rarely performs better (outofsample) than the same model with fewer lags, even though most parameters are found to be significant (insample).
3. FORECAST EVALUATION
 Top of page
 Abstract
 1. INTRODUCTION
 2. THE GARCH UNIVERSE
 3. FORECAST EVALUATION
 4. REALIZED VARIANCE
 5. TEST FOR SUPERIOR PREDICTIVE ABILITY
 6. DATA AND EMPIRICAL RESULTS
 7. CONCLUSIONS
 Acknowledgements
 REFERENCES
 Supporting Information
A popular way to evaluate volatility models outofsample is in terms of the R^{2} from a Mincer–Zarnowitz (MZ) regression, , where squared returns are regressed on the model forecast of and a constant. Or the logarithmic version, , that is less sensitive to outliers, as was noted by Pagan and Schwert (1990) and Engle and Patton (2001).4 However, the R^{2} of a MZ regression is not an ideal criterion for comparing volatility models, because it does not penalize a biased forecast. For example, a poor biased forecast may achieve a higher R^{2} than a good unbiased forecast, because the bias can be eliminated artificially through estimates of (a, b) that differ from (0, 1).
It is not obvious which loss function is more appropriate for the evaluation of volatility models, as discussed by Bollerslev et al. (1994), Diebold and Lopez (1996) and Lopez (2001). So rather than making a single choice we use the following six loss functions in our empirical analysis:
The criteria MSE_{2} and R^{2}LOG are similar to the R^{2} of the MZ regressions,5 and QLIKE corresponds to the loss implied by a Gaussian likelihood. The mean absolute error criteria, MAE_{2} and MAE_{1}, are interesting because they are more robust to outliers than, say, MSE_{2}. Additional discussions of the MSE_{2}, QLIKE and R^{2}LOG criteria can be found in Bollerslev et al. (1994).
4. REALIZED VARIANCE
 Top of page
 Abstract
 1. INTRODUCTION
 2. THE GARCH UNIVERSE
 3. FORECAST EVALUATION
 4. REALIZED VARIANCE
 5. TEST FOR SUPERIOR PREDICTIVE ABILITY
 6. DATA AND EMPIRICAL RESULTS
 7. CONCLUSIONS
 Acknowledgements
 REFERENCES
 Supporting Information
In our empirical analysis we substitute the realized variance for the latent . The realized variance for a particular day is calculated from intraday returns, r_{i, m}, where r_{t, i, m}≡p_{t−(i−1)/m} − p_{t−i/m} for i = 1, …, m. Thus r_{t, i, m} is the return over a time interval with length 1/m on day t, and we note that . It will often be reasonable to assume that E(r_{t, i, m}ℱ_{t−1})≃0 and that intraday returns are conditionally uncorrelated, cov(r_{t, i, m}, r_{t, j, m}ℱ_{t−1}) = 0 for i ≠ j, such that , where we have defined the realized variance (at frequency m) . Thus is approximately unbiased for (given our assumptions above), and it can often be shown that is decreasing in m, such that is an increasingly more precise estimator of as m increases.6 Further, the is (by definition) consistent for the quadratic variation of p_{t}, which is identical to the conditional variance, , for certain data generating processes (DGPs) such as the ARCHtype models considered in this paper.7
Several assets are not traded 24 hours a day, because the market is closed overnight and over weekends. In these situations we only observe f ≤ m (of the m possible) intraday returns. Assume for simplicity that we observe, r_{t, 1, m}, …, r_{t, f, m} and define . Since only captures the volatility during the part of the day that the market is open, we need to extend to a measure of volatility for the full day. One resolution is to add the squared closetoopen return to , but this leads to a noisy measure because ‘overnight’ returns are relatively noisy. A better solution is to scale , and use the estimator
 (1)
This yields an estimator that is approximately unbiased for under fairly reasonable assumptions. See Martens (2002), Hol and Koopman (2002) and Fleming et al. (2003), who applied similar scaling estimators to obtain a measure of volatility for the whole day.
5. TEST FOR SUPERIOR PREDICTIVE ABILITY
 Top of page
 Abstract
 1. INTRODUCTION
 2. THE GARCH UNIVERSE
 3. FORECAST EVALUATION
 4. REALIZED VARIANCE
 5. TEST FOR SUPERIOR PREDICTIVE ABILITY
 6. DATA AND EMPIRICAL RESULTS
 7. CONCLUSIONS
 Acknowledgements
 REFERENCES
 Supporting Information
We divide the observations into an estimation period and an evaluation period:
The parameters of the volatility models are estimated using the first R interday observations, and these estimates are used to make onestepahead forecasts for the remaining n periods. During the evaluation period we calculate the realized variance from intraday returns and obtain using (1). Thus model k yields a sequence of forecasts, , that are compared to , using a loss function L. Let the first model, k = 0, be the benchmark model that is compared to models k = 1, …, l. Each model leads to a sequence of losses, , and we define the relative performance variables
Our null hypothesis is that the benchmark model is as good as any other model in terms of expected loss. This can be formulated as the hypothesis H_{0}:λ_{k}≡E(X_{k, t})≤ 0 for all k = 1, …, l, because λ_{k} > 0 corresponds to the case where model k is better than the benchmark. In order to apply the stationary bootstrap of Politis and Romano (1994) in our empirical analysis, we assume that X_{t} = (X_{1, t}, …, X_{l, t})′ is strictly stationary, EX_{t}^{r+δ} < ∞ for some r > 2 and some δ > 0, and that X_{t} is αmixing of order −r/(r − 2). These assumptions are due to Goncalves and de Jong (2003) and are weaker than those formulated in Politis and Romano (1994). The stationarity of {X_{t}} would be satisfied if {r_{t}} is strictly stationary, because {X_{t}} is a function of {r_{t}}. Next, the moment condition is not alarming, because {X_{t}} measures the difference in performance of pairs of models, and it is unlikely that the predictions would be so different that the relative loss would violate the moment condition, since the models are quite similar and have the same information. Finally, the mixing condition for {X_{t}} is satisfied if it holds for r_{t}. It is important to note that we have not assumed that any of the volatility models are correctly specified. Nor is such an assumption needed, since our ranking of volatility models is entirely measured in terms of expected loss. The assumptions about {r_{t}} will suffice for the comparison and inference, and it is not necessary to make a reference to the true specification of the conditional variance. On the other hand, there is nothing preventing one of the volatility models being correctly specified.8
The bootstrap implementation can be justified under weaker assumptions than those above. For example, the stationarity assumption about {r_{t}} can be relaxed and replaced by a nearepoch condition for X_{t}, see Goncalves and de Jong (2003). This is valuable to have in mind in the present context, since the returns may not satisfy the strict stationarity requirement. A structural change in the DGP would be more critical for our analysis. While a structural change need not invalidate the bootstrap inference (if the break occurs prior to the evaluation period), it would make it very difficult to interpret the results, because the models are estimated using data that have different stochastic properties.
A closely related test is the RC of White (2000) that employs the nonstandardized test statistic . The critical values of the SPA test and the RC are derived in different ways, and this causes the latter to be sensitive to the inclusion of poor and irrelevant models, and to be less powerful, see Hansen (2003) for details. Power is important for our application, because a more powerful test is more likely to detect superior volatility models, if such exist.
Given the assumptions stated earlier in this section, it holds that , where ‘’ denotes convergence in distribution, where λ = (λ_{1}, …, λ_{l})′ and Ω≡lim_{n∞}E[n(X− λ)(X− λ)′]. This result makes it possible to test the hypothesis, H_{0}:λ ≤ 0.
5.1. Bootstrap Implementation
Unless n is large relative to l it is not possible to obtain a precise estimate of the l × l covariance matrix, Ω. It is therefore convenient to use a bootstrap implementation, which does not require an explicit estimate of Ω, and the tests of White (2000) and Hansen (2001) can both be implemented with the stationary bootstrap of Politis and Romano (1994).9 From the bootstrap resamples, , we can construct random draws of quantities of interest, which can be used to estimate the distributions of these quantities. In our setting we seek an estimate of and estimates of the distributions of and . First we calculate the sample averages, , and it follows from Goncalves and de Jong (2003) that the empirical distribution of converges to the true asymptotic distribution of n^{1/2}X. The resamples also allow us to calculate , which is consistent for . We seek the distribution of the test statistics, and , under the null hypothesis, λ ≤ 0, so we must recentre the bootstrap variables, such that they satisfy the null hypothesis.10 Ideally, the variables should be recentred about the true value of λ, but since λ is unknown we must use an estimate and Hansen (2001) proposed the estimates:
where . Thus we define , for i = l, c, u, where g_{l}(·)≡max(x, 0), g_{c}(x)≡x·1 and g_{u}(x)≡x, and it follows that for i = l, c, u. This enables us to approximate the distribution of by the empirical distribution of
 (2)
and we calculate the pvalue: , for i = l, c, u. The null hypothesis is rejected for small pvalues. In the event that , there is no evidence against the null hypothesis, and in this case we use the convention: p̂_{SPA}≡1.
The three choices for will typically yield three different pvalues, and Hansen (2001) has shown that the pvalue based on is consistent for the true pvalue, whereas and provide an upper and lower bound for the true pvalue, respectively.11 We denote the three resulting tests by SPA_{l}, SPA_{c} and SPA_{u}, where the subscripts refer to lower, consistent and upper. The purpose of the correction factor, A_{k, n}, that defines , is to ensure that and . This is important for the consistency, because the models with λ_{k} < 0 do not influence the asymptotic distribution of , see Hansen (2001). However, the choice of A_{k, n} is not unique, and it is therefore useful to include the pvalues of the two other tests, SPA_{l} and SPA_{u}, because they define the range of pvalues that can be obtained by varying the choice for A_{k, n}. The pvalues based on the tests statistic, , are obtained similarly. These are denoted by RC_{l}, RC_{c} and RC_{u}, where RC_{u} corresponds to the original RC of White (2000).
6. DATA AND EMPIRICAL RESULTS
 Top of page
 Abstract
 1. INTRODUCTION
 2. THE GARCH UNIVERSE
 3. FORECAST EVALUATION
 4. REALIZED VARIANCE
 5. TEST FOR SUPERIOR PREDICTIVE ABILITY
 6. DATA AND EMPIRICAL RESULTS
 7. CONCLUSIONS
 Acknowledgements
 REFERENCES
 Supporting Information
The models are estimated by maximum likelihood using the estimation sample, and the model's forecasts are compared to the realized variance in the evaluation sample.
The first data set consists of DM–$ spot exchange rate data, where the estimation sample spans the period from October 1, 1987 through September 30, 1992 (1254 observations) and the outofsample evaluation sample spans the period from October 1, 1992 through September 30, 1993(n = 260). The realized variance data for the exchange rate have previously been analysed in Andersen and Bollerslev (1998) and are based on m = 288 intraday returns per day. See Andersen and Bollerslev (1997) for additional details. We adjust their measure of realized variance and use , where ĉ = 0.8418 is defined in (1).
The second data set consists of IBM stock returns, where the estimation period spans the period from January 2, 1990 through May 28, 1999 (2378 days) and the evaluation period spans the period from June 1, 1999 through May 31, 2000 (n = 254). The realized variances were constructed from highfrequency data that were extracted from the Trade and Quote (TAQ) database. The intraday returns, r_{t, i, m}, were constructed artificially by fitting a cubic spline to all midquotes of a given trading day, using the time interval 9 : 30 EST–16 : 00 EST.12 From the splines we extract f = 130 artificial threeminute returns per day (out of the hypothetical m = 480 threeminute returns) and calculate . There are several other methods for constructing the realized variance and several of these are discussed in Andersen et al. (2003). Later we verify that our empirical results are not influenced by our choice of estimator, as we reach the same conclusions by using six other measures of the realized variance.
The estimate of the adjustment coefficient, (1), is ĉ = 4.4938, which exceeds 480/130 ≃ 3.7. This indicates that underestimates the daily variance by more than would be expected if the daily volatility was evenly spread over the 24 hours of the day. There are several possible explanations to the fact that we need to adjust the volatilities by a number different than 3.7. First of all, it could be the result of sample variation, but this seems unlikely as n is too large for sampling error to explain this large a difference. A second explanation is that our intraday returns are positively autocorrelated. The autocorrelation can arise from the market microstructure effects or can be an artifact of the way intraday returns are constructed. A third explanation is that returns are relatively more volatile between close and open, than between open and close, measured per unit of time. This requires that more information arrives to the market while it is closed than while it is open. This contradicts the findings of French and Roll (1986) and Baillie and Bollerslev (1989), so we find this explanation to be unrealistic. Finally, a fourth factor that can create a difference between squared interday returns and the sum of squared intraday returns is the omission of the conditional expected value E(r_{t, i, m}ℱ_{t−1}), i = 1, …, m in the calculations. Suppose that E(r_{t, i, m}ℱ_{t−1}) = 0 for i = 1, …, f, but is positive during the time the market is closed. Then would, on average, be larger than , even if intraday returns were independent and homoskedastic. Such a difference between expected returns during the time the market is open and closed could be explained as a compensation for the lack of opportunities to hedge against risk overnight. It is not important which of the four explanations cause the difference, as long as our adjustment does not favour some models over others. Because the adjustment is made ex post and does not depend on the model forecasts, it is unlikely that a particular model would benefit more than other models.
6.1. Results from the Model Comparison
Table II contains the results from the model comparisons in the form of pvalues.13 The pvalues correspond to the hypothesis that the benchmark model, ARCH(1) or GARCH(1,1), is the best model. The naive pvalue is the pvalue that one would obtain by comparing the best performing model to the benchmark without controlling for the full set of models. So the naive pvalue is not a valid pvalue and it will often be too small, and therefore more likely to indicate an unjustified ‘significance’. The pvalues of the SPA test and the RC control for the full set of models. Those of SPA_{c} and RC_{c} are asymptotically valid pvalues, whereas those with subscript l and u provide lower and upper bounds for the pvalues. Although the naive pvalue is not valid, it can exceed that of the SPA_{c}, because the best performing model need not be the model that results in the largest tstatistic.
Table II. Exchange rate data (DM/USD)Panel A: Exchange rate data (DM/USD), SPA pvalues 

Metric  Benchmark: ARCH(1)  Benchmark: GARCH(1,1) 

Naive  SPA_{l}  SPA_{c}  SPA_{u}  Naive  SPA_{l}  SPA_{c}  SPA_{u} 


MSE_{1}  0.0077  0.0179  0.0179  0.0209  0.2911  0.3164  0.4589  0.7887 
MSE_{2}  0.0392  0.0695  0.0748  0.0797  0.2025  0.6006  0.7652  0.9279 
QLIKE  0.0067  0.0169  0.0184  0.0194  0.2528  0.5831  0.7707  0.9639 
R^{2}LOG  <0.0001  0.0002  0.0002  0.0002  0.0708  0.2144  0.3269  0.6627 
MAE_{1}  <0.0001  0.0002  0.0002  0.0002  0.0636  0.2274  0.3296  0.6309 
MAE_{2}  0.0002  0.0011  0.0011  0.0012  0.1832  0.2177  0.2920  0.5663 
Panel B: IBM Data, SPA pvalues 
Metric  Benchmark: ARCH(1)  Benchmark: GARCH(1,1) 
Naive  SPA_{l}  SPA_{c}  SPA_{u}  Naive  SPA_{l}  SPA_{c}  SPA_{u} 
MSE_{1}  0.0052  0.0002  0.0002  0.0002  0.0355  0.0245  0.0300  0.0358 
MSE_{2}  0.0061  0.0001  0.0001  0.0001  0.0409  0.0260  0.0288  0.0316 
QLIKE  0.0003  <0.0001  <0.0001  <0.0001  0.0213  0.0379  0.0463  0.0528 
R^{2}LOG  0.0108  0.0011  0.0011  0.0014  0.0166  0.0526  0.0630  0.0741 
MAE_{1}  0.0012  0.0080  0.0086  0.0104  0.0026  0.0040  0.0051  0.0058 
MAE_{2}  0.0014  0.0097  0.0100  0.0115  0.0026  0.0054  0.0065  0.0078 
Panel C: IBM Data, RC pvalues 
Metric  Benchmark: ARCH(1)  Benchmark: GARCH(1,1) 
Naive  RC_{l}  RC_{c}  RC_{u}  Naive  RC_{l}  RC_{c}  RC_{u} 
MSE_{1}  0.0052  0.0164  0.0164  0.0164  0.0355  0.1000  0.1499  0.2811 
MSE_{2}  0.0061  0.0205  0.0205  0.0205  0.0409  0.1053  0.1056  0.1472 
QLIKE  0.0003  0.0017  0.0017  0.0017  0.0213  0.0943  0.1153  0.3750 
R^{2}LOG  0.0108  0.0601  0.0713  0.0713  0.0166  0.2908  0.3535  0.6039 
MAE_{1}  0.0012  0.0972  0.1227  0.1399  0.0026  0.0505  0.1144  0.1522 
MAE_{2}  0.0014  0.1219  0.1649  0.1941  0.0026  0.0644  0.1135  0.1734 
Panel A contains the results for the exchange rate data. The pvalues clearly show that the ARCH(1) is outperformed by other models, although the MSE_{2} criterion is a possible exception. However, there is no evidence that the GARCH(1,1) is outperformed and a closer inspection of the models reveals that the GARCH(1,1) has one of the best sample performances.
Panels B and C contain the results from the IBM return data, based on the SPA test and the RC, respectively. From Panel B it is evident that both the ARCH(1) and the GARCH(1,l) are significantly outperformed by other volatility models in terms of all loss functions, with the possible exception of the R^{2}LOG loss function. Thus there is strong evidence that the GARCH(1,1) is inferior to alternative models. The pvalues in Panel C are based on the (nonstandardized) test statistic . The results in Panel C are alarmingly different from those in Panel B, because these pvalues suggest the exact opposite conclusion in most cases. Panel C suggests that the GARCH(1,1) is not significantly outperformed, and even the ARCH(1) cannot be rejected as being superior to all other models for three of the six loss functions. The contradicting results are explained by the fact that the is not properly standardized, and this causes the tests RC_{l}, RC_{c} and RC_{u} to be sensitive to erratic models. The problem is that a model with a relatively large var(X̄_{k}) has a disproportional effect on the distribution of , in particular the right tail which defines the critical values, see Hansen (2003). The pvalues in the rightmost column (boldface) are those of the original RC by White (2000), and these provide little evidence against the two benchmarks. So the results in Table II confirm that the RC is less powerful than the SPA test.
The realized variance can be constructed in many ways and different measures of the realized variance could lead to different results. To verify that our results are not sensitive to our choice of RV measure we repeat the empirical analysis of the IBM returns data using six other measures. These measures include: one based on a different spline method and sampling frequency; one based on the Fourier method by Barucci and Reno (2002); two based on the previoustick method; and two based on the linear interpolation method. The pvalues of the SPA_{c} test for the seven different measures of the realized variance are presented in Table III. Fortunately, the pvalues do not differ much across the various measures of the realized variance, although most of the alternative measures provide slightly stronger evidence that the GARCH(1,1) is outperformed in terms of the R^{2}LOG loss function, and slightly weaker evidence in terms of the MAE_{1} and MAE_{2} loss functions.
Table III. Results for different measures of realized varianceCriterion  Method for estimating realized variance 

Spl50 3 min  Spl250 2 min  Fourier M = 85  Linear 5 min  Previous 5 min  Linear 1 min  Previous 1 min 


MSE_{1}  0.0271  0.0230  0.0134  0.0125  0.0133  0.0111  0.0103 
MSE_{2}  0.0280  0.0213  0.0135  0.0168  0.0181  0.0082  0.0082 
QLIKE  0.0457  0.0350  0.0166  0.0178  0.0175  0.0112  0.0118 
R^{2}LOG  0.0651  0.0998  0.0462  0.0409  0.0505  0.0375  0.0340 
MAE_{1}  0.0039  0.0635  0.0476  0.0690  0.0662  0.0960  0.0881 
MAE_{2}  0.0056  0.0888  0.0724  0.0510  0.0600  0.0707  0.0749 
Figures 1–4 show the ‘population’ of model performances for various loss functions (and the two data sets).14 The plots provide information about how similar/different the models' sample performances were, and show the location of the ARCH(1) and GARCH(1,1) relative to the full set of models. The xaxis is the (negative value of) average sample loss, such that the right tail represents the model with the best sample performance. Each figure contains four panels. The upper left panel is the model density of all the models, whereas the last three panels show the performance densities for different ‘types’ of models. The models are divided into groups according to their type: Gaussian vs. tdistributed specification; models with and without a leverage effect; and the three mean specifications.
Figures 1 and 2, which display the results for the exchange rate data, show that the GARCH(1,1) is one of the best performing models, whereas the ARCH(1) has one of the worst sample performances. There are no major differences between the various types of models, although there is a small tendency that the tdistributed specification leads to a better performance than a Gaussian specification in Figure 2.
The results for the IBM return data are displayed in Figures 3 and 4. From the SPA test we concluded that the GARCH(1,1) was significantly outperformed by other models, and the two figures also show that the GARCH(1,1) is ranked much lower in this sample. It now seems that the Gaussian specification does better than the tdistributed specification, on average. However, the very best performing model in terms of the MAE_{2} loss function is a model with a tdistributed specification. From our analysis of the IBM data it is evident that models that can accommodate a leverage effect are superior to those that cannot, particularly in Figure 4.
Although the conditional mean µ_{t} = E(r_{t}ℱ_{t−1}) is likely to be small, it cannot ex ante be ruled out that a more sophisticated specification for µ_{t}, such as the GARCHinmean, leads to better forecasts of volatility than the zeromean specification. However, the performance is almost identical across the three mean specifications, as can be seen from Figures 1–4.
7. CONCLUSIONS
 Top of page
 Abstract
 1. INTRODUCTION
 2. THE GARCH UNIVERSE
 3. FORECAST EVALUATION
 4. REALIZED VARIANCE
 5. TEST FOR SUPERIOR PREDICTIVE ABILITY
 6. DATA AND EMPIRICAL RESULTS
 7. CONCLUSIONS
 Acknowledgements
 REFERENCES
 Supporting Information
We have compared a large number of volatility models, in terms of their ability to forecast the conditional variance in an outofsample setting.
Our analysis was limited to DM–$ exchange rates and IBM stock returns and a universe of models that consisted of 330 different ARCHtype models. The main findings are that there is no evidence that the GARCH(1,1) model is outperformed by other models, when the models are evaluated using the exchange rate data. This cannot be explained by the SPA test lacking power because the ARCH(1) model is clearly rejected and found to be inferior to other models. In the analysis of IBM stock returns we found conclusive evidence that the GARCH(1,1) is inferior, and our results strongly suggested that good outofsample performance requires a specification that can accommodate a leverage effect.
The performances of the volatility models were measured outofsample using six loss functions, where realized variance was used to construct an estimate of the unobserved conditional variance. The significance of relative performance was evaluated with the test for superior predictive ability of Hansen (2001) and the reality check for data snooping of White (2000). Our empirical analysis illustrated the usefulness of the SPA test and showed that the SPA test is more powerful than the RC.
The SPA test and the RC are not model selection criteria and therefore not designed to identify the best volatility model (in population). It is also unlikely that our data contain sufficient information to conclude that the model with the best sample performance is significantly better than all other models. Nevertheless, the use of a significance test, such as the SPA test, has clear advantages over model selection criteria, because it allows us to make strong conclusions. In our setting, the SPA test provided conclusive evidence that the GARCH(1,1) is inferior to other models in our analysis of IBM returns. However, in the analysis of the exchange rate data, there was no evidence against the claim that: ‘nothing beats a GARCH(1,1)’.