4.2.1. SPA Tests of Option-Implied Forecasts for Individual Stocks
In this section we present the SPA test results for all three individual stocks, IBM, MSFT and GE, with both MF and ATM used as respective benchmarks. Comparative results for the S&P500 index are reported in Section 4.2.5. In the spirit of Hansen and Lunde (2006a) and Patton (2006) we use a ‘robust’ criterion, to measure the accuracy of forecast j, namely mean squared forecast error (MSFE) for variance quantities, with . We provide results for one and 22 days ahead in Tables I and II respectively, with the maturity of the options used to construct MF and ATM matching the forecast horizon in the second case only. In both tables, the results for IBM, MSFT and GE are given respectively in Panels A, B and C. Across the columns of each table we order the eight measures , and associated results, according to the extent to which each measure accommodates noise and/or jumps. Specifically, we report results for measures that do not formally adjust for noise or jumps: RV(5) and RVA(5); measures that adjust for noise only: TSRV, TSRV2, RKERN, OSRV and ALTM; and the measure that adjust for both noise and jumps: BV. We annotate the results in the following way: (i) if a benchmark is rejected at the 5% level, the SPA p-value appears in bold; (ii) in the case where a benchmark is rejected, the ‘most significant’ forecast model according to the pairwise ‘t statistics’ is indicated in parentheses in the line below;26 (iii) if a benchmark is not rejected and its MSFE loss is the smallest of that of all m + 1 models in the choice set, the p-value is allocated a # superscript.27
Table I. SPA p-values: forecasts based on a one-day-ahead forecast horizon. An option-implied volatility forecast is used as benchmark: MF (model free) and ATM (at-the-money). The SPA test is based on a mean squared forecast error (MSFE) loss criterion, for variance quantities. For each dataset the number of models against which the benchmark model is compared (m), plus the number of observations in the forecast evaluation period from which the p-values and sample loss are calculated (N), are as follows: IBM: m = 67; N = 1149; MSFT: m = 63; N = 1154; GE: m = 66; N = 1147. The model set always includes the option-implied forecast that is the alternative to the one being tested as the benchmark. p-values that are associated with rejection of the benchmark forecast at the 5% level are highlighted in bold font. In the case of rejection, the ‘most significant’ alternative forecast, according to the pairwise ‘t statistics’, is reported in parentheses below the p-value. The acronym LMown(cross) denotes a long-memory ARFIMA own (cross)-forecast, while the acronym SMown(cross) denotes a short-memory ARMA own (cross)-forecast. In the case where a benchmark is not rejected, the superscript # indicates that the forecast also has the smallest MSFE loss of all m + 1 forecasts in the choice set
|Benchmark||Measure to be forecasta|
|Panel A: IBM|
|(most sig.)|| ||(SMcross)|
|Panel B: MSFT|
|(most sig.)||(ATM)||(ATM)||(ATM)||(LMown)||(SMcross)||(LMcross)|| ||(ATM)|
|Panel C: GE|
|(most sig.)||(ATM)||(ATM)||(ATM)|| ||(LMcross)||(ATM)|| ||(ATM)|
Table II. SPA p-values: forecasts based on a 22-day-ahead forecast horizon. An option-implied volatility forecast is used as benchmark: MF (model free) and ATM (at-the-money). The SPA test is based on a mean squared forecast error (MSFE) loss criterion, for variance quantities. For each dataset the number of models against which the benchmark model is compared (m), plus the number of observations in the forecast evaluation period from which the p-values and sample loss are calculated (N) are as follows: IB M: m = 67; N = 1149; MSFT: m = 63; N = 1154; GE: m = 66; N = 1147. The model set always includes the option-implied forecast that is the alternative to the one being tested as the benchmark. p-values that are associated with rejection of the benchmark forecast at the 5% level are highlighted in bold font. In the case of rejection, the ‘most significant’ alternative forecast, according to the pairwise ‘t statistics’, is reported in parentheses below the p-value. The acronym LMcross denotes a long-memory ARFIMA cross forecast In the case where a benchmark is not rejected, the superscript # indicates that the forecast also has the smallest MSFE loss of all m + 1 forecasts in the choice set
|Benchmark||Measure to be forecasta|
|Panel A: IBM|
|(most sig.)|| ||(LMcross)||(LMcross)|| ||(LMcross)|
|Panel B: MSFT|
|Panel C: GE|
|(most sig.)||(ATM)||(ATM)||(ATM)||(ATM)||(ATM)||(ATM)|| ||(ATM)|
The results in Table I provide little evidence that the MF implied volatility is an accurate forecast of actual volatility one day ahead. For IBM the SPA test rejects at the 5% level for all eight measures of volatility. In all cases, ATM is the most ‘significant’ alternative, as based on the individual pairwise ‘t statistics’. For MSFT and GE there is support for MF using the ALTM measure, and a small amount of support in the case of GE using the RKERN measure also; however, in all other cases the MF benchmark is rejected, with ATM again the most ‘significant’ alternative in many instances. Both long-memory and short-memory direct forecasts also feature as the most significant alternatives in some cases.
While the lack of support for the MF benchmark may, superficially, be unsurprising, given the mismatch between option maturity (22 trading days) and forecast horizon (one day), the results for the ATM benchmark provide a startling refutation of the maturity explanation. In all but one case (the BV measure for IBM) ATM is accepted as a superior forecast, with the p-values all exceeding 0.2, usually well and truly so. In four cases the ATM is not only not rejected as benchmark, but also has the smallest MSFE loss of all models considered (as indicated by the # superscript).
Most importantly, given one of the main focuses of this paper, these qualitative results—strong support for ATM and lack of support for MF—are almost completely invariant to the measure used to proxy future volatility. This result is consistent with the robustness results reported by Ghysels and Sinko (2006), in the context of a more limited forecasting analysis of direct intraday returns-based forecasts. The only result that really stands out here is the inability of ATM to forecast the ‘jump-free’ BV measure for IBM, a result that contrasts with all other results in Table I related to this benchmark.
Given the particular maturity associated with the option-implied forecasts—22 trading days—one would anticipate an improved performance when the forecast horizon matches that maturity. As indicated by the results reported in Table II, for the ATM forecast of MSFT and GE volatility this is indeed the case, with the p-values for the ATM benchmark uniformly higher for the 22-day forecast horizon than the corresponding p-values for the one-day horizon, and close to one in many cases. Moreover, the ATM forecast has the lowest MSFE in the forecast set (again, as indicated by the # superscript) for all eight forecast variables, for both the MSFT and GE series. The results for IBM are less clear-cut, although there is still support for the benchmark ATM for the majority of forecast measures. In contrast, the results for the MF benchmark are even weaker at the longer horizon, with only a single failure to reject MF as the superior forecast, across all series and all measures, and that support for MF being only marginal (p-value = 0.057). Once again, both option-implied volatilities fail to successfully predict the BV measure for IBM. The p-value for the ATM forecast of the BV measure, in the case of GE, although very supportive of the ATM benchmark, is the smallest across the alternative measures. The corresponding p-value is amongst the smallest in the case of MSFT.
As with the one-day-ahead predictions, there is some support for direct forecasts, in that for the three instances in which ATM is rejected as the benchmark model, a long-memory direct forecast is the ‘most significant’ according to the pairwise test. For the longer time horizon, short-memory direct forecasts do not feature at all. For neither forecast horizon is any support given to the GARCH-type forecasts based on daily returns. Indeed, although these figures are not reported here, this category of model is consistently ranked amongst the worst performers in terms of MSFE, for all series and measures, and for both forecast horizons.
In the following section we attempt to shed some light on the contrast between the support for the ATM benchmark and the (overall) lack of support for the MF benchmark, by examining the option market information from which the forecasts have been extracted. In Section 4.2.3 we shed further light on the issue via reference to the analysis in Bollerslev and Zhou (2006) of the volatility risk premium.
4.2.2. Implied Volatility Curves
In Figure 1 (a), (c) and (e) we plot one particular volatility measure, OSRV, for each series, against MF.28 In the right-hand panels, (b), (d) and (f) respectively, we plot MF against ATM for each series. The intraday measure reported is for the 22-day-ahead forecast horizon and all volatility measures (both realized and option-implied) are graphed as annualized standard deviation figures.29 Four features in Figure 1, common to all three series, are immediately apparent: (i) there are two distinct sub-periods: a high-volatility period from 30 August 2001 to (approximately) 30 July 2004, and a lower-volatility period from 2 August 2004 to 31 May 2006;30 (ii) the MF forecast tends to exceed realized volatility (overall), and by a greater amount in the high- than in the low-volatility period; (iii) the MF forecast tends to exceed the ATM forecast, again by a greater amount in the high-volatility period; (iv) the MF forecast is excessively noisy, relative to realized volatility, and more so than is the ATM forecast, again in the high-volatility period in particular.
The empirical features of OSRV, MF and ATM, for all three series, and for the full sample period and both sub-periods identified here, are summarized in Table III. Using to represent OSRV, setting ft = MF, ATM (as variance quantities), and using the decomposition of the MSFE as , we report sample estimates of the forecast bias and forecast error variance, and respectively, as well as the sample variance of the forecast itself, var(ft). The numerical results clearly support the informal graphical evidence: MF is both a more biased forecast and has a larger forecast error variance than ATM, in particular over the high-volatility period. Most notably, the (magnitude of the) bias of MF is approximately twice as large as that for ATM in the high-volatility period, in the case of IBM and MSFT, and more than five times larger for the GE dataset. In the low-volatility period, however, the corresponding bias and variance figures for both forecasts are much more similar, for MSFT and GE in particular. Both options-based forecasts overestimate actual volatility in both the high- and low-volatility sample periods.
Table III. Summary statistics for the two option-implied forecasts, over the full sample and the high- and low-volatility sub-periods; realized volatility measured by OSRV
|Full sample period (30 August 2001 to 31 May 2006)|
|− 0.0343||− 0.0190||− 0.0313||− 0.0115||− 0.0244||− 0.0060|
|High-volatility sample period (30 August 2001 to 30 July 2004)|
|− 0.0503||− 0.0271||− 0.0518||− 0.0194||− 0.0372||− 0.0073|
|Low-volatility sample period (2 August 2004 to 31 May 2006)|
|− 0.0095||− 0.0066||− 0.0065||− 0.0061||− 0.0045||− 0.0042|
|1.52e − 004||1.32e − 004||9.27e − 005||8.76e − 005||4.36e − 005||4.13e − 005|
|1.12e − 004||8.81e − 005||7.58e − 005||7.12e − 005||3.14e − 005||2.89e − 005|
From the high- and low-volatility sub-periods we reproduce, in turn, a representative sequence of implied volatility curves from which both MF and ATM have been constructed, as per the explanation in Section 4.1. In Figure 2, all three curves, on each of four representative days from the high-volatility period, give higher implied volatility figures for each moneyness ratio, when compared with the comparable curves for the low-volatility period in Figure 3. Moreover, the former also exhibit a much more pronounced curvature than the latter, with the volatilities associated with very low values for X/Pt (and, in some instances, those associated with very high values for X/Pt) exceeding the near-the-money volatilities (X/Pt ≈ 1) by a large amount. This pattern reflects, in turn, both the existence of quotes for OTM put options (X/Pt low) and OTM calls (X/Pt high), plus the assignment of high values to some of those options. In a high volatility state the market thus places high value on options that pay off only if the asset price either rises or falls by a large amount, i.e., only if the present high-volatility state persists. A positive liquidity premium, associated with the relative lack of liquidity in far-from-the-money options, may also contribute to some of the high volatilities observed at the extreme ends of the moneyness spectrum. Only on one of the chosen days (17 May 2002) do all three implied volatility curves display the downward sloping skew pattern that is often a feature of equity option data.
Given that ATM is equated to the ordinate of the volatility curve at X/Pt = 1, and MF constructed from a formula that uses all ordinates, the reason why MF tends to exceed ATM by a large amount in the high-volatility period is clear. In addition, an examination of the sequence of implied volatility curves over the entire high-volatility period, of which the graphs in Figure 2 provide a snapshot, highlights a large degree of variation in the away-from-the-money volatilities in particular, a feature that contributes to the large variation in MF reported in Table III, which contributes, in turn, to the large forecast error variance. Again, this noise in the away-from-the-money volatilities is likely to be exacerbated by the lack of liquidity in options far from the money.
In contrast to the rather distinct smile shape that characterizes some of the curves in Figure 2, during the low-volatility period highlighted in Figure 3 the curves tend to be skewed, the majority having the negative slope that typifies equity option graphs, and all exhibiting much less variation across the moneyness spectrum than the curves in Figure 2. The flat curves beyond certain narrow ranges around X/Pt = 1 indicate that no quotes on away-from-the-money options are made at the end of the relevant day, with the implied volatilities at these boundary points simply being extrapolated to the outer boundaries of 0.5 and 1.5 (see Jiang and Tian, 2005). In the low-volatility state, options that have positive pay-offs only if Pt varies substantially from its current value, i.e., if volatility is high over the maturity of the option, are not traded. In this case, there is much less difference between the MF and ATM values, plus much less variation in the MF values, than during the high-volatility state.
In summary, close examination of the volatility smile information from which MF and ATM are extracted provides some explanation for both the discrepancy between the two measures and for the added variability in the MF measure, in particular in times of high volatility.31 In the following section we draw upon the insights of Bollerslev and Zhou (2006) in order to provide an explanation for the positive bias in both measures and for the fact that the magnitude of that bias is larger in the high-volatility period.
4.2.3. Forecasting Bias: Implied Volatility Risk Premium
Bollerslev and Zhou (2006) demonstrate that under the assumption of the square root stochastic volatility model of Heston (1993), the coefficients in the regression
are functions of the parameters of the risk-neutralized version of the distribution with respect to which in (16) is defined. We refer readers to Bollerslev and Zhou for details of the objective and risk-neutral distributions in question and the links between them. It is sufficient to note here that for standard values of the objective parameters, the negative market price of volatility risk that is observed empirically (e.g., Guo, 1998; Eraker, 2004; Forbes et al., 2007) leads unambiguously to ϕ1 < 1. Translated into the option context, the negative price means that the risk-neutralized distribution for volatility reverts more slowly to a higher long-run mean, in comparison with the objective distribution. That is, option prices have a positive premium factored in, as a consequence of stochastic volatility. It is this positive premium that leads to the implied volatility measure exceeding, on average, the objective measure of volatility, with the bias in the forecasting regression in (16) being a manifestation of the deviation between the two forms of volatility. As Bollerslev and Zhou demonstrate via simulation experiments, this qualitative result is unaffected by the estimation of using observed intraday returns. The empirical results reported in the previous section, in which both option-implied forecasts have positive bias with respect to one particular estimate of , namely OSRV, support this finding.32
The assumption of an underlying stochastic volatility process for returns is completely consistent with the implied volatility patterns observed in practice, including for the data analysed here. That is, implied volatility smiles/skews can be linked to the fat tails (and/or skewness) that characterize empirical returns—characteristics that, in turn, can be associated with a stochastic volatility process (see, for example, Heston, 1993; Bakshi et al., 1997; Bates, 2000). The particular shape of the implied volatility curve can be linked to features of the underlying stochastic volatility process, most notably the degree of volatility and the magnitude (and sign) of the instantaneous correlation between volatility and returns. The varying shapes observed over the sample period considered are suggestive of an underlying stochastic volatility model with time-varying parameters, although we attempt no formal investigation here of that observation.33 Certainly, the varying degree of bias, in particular between the high- and low-volatility periods, is indicative of a time-varying risk premium that is a positive function of the level of actual volatility. This empirical feature is consistent with the linear (in volatility) risk premium that is adopted in the Heston stochastic volatility model, along with a negative value for the relevant risk premium parameter (see Carr and Wu, 2004; Bollerslev et al., 2008).
It is the MF measure that is formally consistent with an underlying stochastic volatility models for returns and, hence, legitimately affected by any volatility risk premium via its method of calculation, whereby all available smile information is used. The ATM forecast, on the other hand, approximated by an implied volatility at a single point in the moneyness spectrum, does not formally factor in a risk premium and, as a consequence, exhibits less bias as a forecast of actual volatility, as attested to by the results in Table III.34
In summary, then, any potential additional forecast accuracy associated with the added flexibility of the assumptions underlying the MF forecast appears to be offset by the bias and noise which beset its calculation in practice. As such, it is of interest to ascertain whether or not a truncated version of MF, which retains some of the smile information, but not all, manages to outperform ATM. We investigate this in the following section by reporting SPA test results for three modified versions of MF.
4.2.4. SPA Tests of Truncated MF Forecasts
In Table IV we present the SPA p-values associated with the 22-day-ahead forecasts using five benchmarks: MF and ATM, plus three truncated versions of MF, denoted by MF1.5, MF2.0 and MF2.5. The benchmark MF1.5, for example, is the estimate of MF produced from implied volatilities within the moneyness range . The benchmarks MF2.0 and MF2.5 are defined correspondingly.35 We produce the test results for the full sample period (panel A), as well as results for the low-volatility period identified in Section 4.2.2 (panel B), the idea here being that the reduced bias and variation in all MF estimates in this latter period may lead to these benchmarks being given more support by the SPA test. The results for benchmarks MF and ATM are reproduced under the expanded model set in which MF1.5, MF2.0 and MF2.5 are included as alternatives. Hence, the results in the rows headed MF and ATM in Table IV differ in some cases from the corresponding results reported in Table II. In order to reduce the number of results reported, we focus on only three measures for each series: RKERN, ALTM and BV.
Table IV. SPA p-values: forecasts based on a 22-day-ahead forecast horizon. Alternative option-implied volatility forecasts are used as benchmark: MF, MF2.5, MF2.0, MF1.5 and ATM. The SPA test is based on a mean squared forecast error (MSFE) loss criterion, for variance quantities, with three alternative measures used for the actual volatility: RKERN, ALTM and BV. Results are produced for the full sample and low-volatility periods. p-values that are associated with rejection of the benchmark forecast at the 5% level are highlighted in bold font. In the case of rejection, the ‘most significant’ alternative forecast, according to the pairwise ‘t statistics’, is reported in parentheses below the p-value. The model set always includes all of the option-implied forecasts that are alternatives to the one being tested as the benchmark. Hence, the model set for each series underlying the results in Tables I and II is augmented by three here, to cater for the three additional versions of MF. The acronym LMcross denotes a long-memory ARFIMA cross forecast. In the case where a benchmark is not rejected, the superscript # indicates that the forecast also has the smallest MSFE loss of all m + 1 forecasts in the choice set
|Benchmark||Measure to be forecasta|
|Panel A: Full sample period (30 August 2001 to 31 May 2006)|
|(most sig.)||(MF1.5)||(MF1.5)||(MF1.5)||(MF1.5)||(MF1.5)||(MF1.5)||(MF1.5)|| ||(MF1.5)|
|(most sig.)||(ATM)||(ATM)||(ATM)||(ATM)|| ||(ATM)|| ||(ATM)|
|(most sig.)|| ||(LMcross)|| |
|Panel B: Low volatility period (2 August 2004 to 31 May 2006)|
|(most sig.)||(ATM)||(ATM)||(ATM)||(MF1.5)|| ||(LMcross)||(MF1.5)||(LMcross)||(LMcross)|
|(most sig.)||(ATM)||(ATM)||(ATM)||(MF1.5)|| ||(LMcross)||(MF1.5)||(LMcross)||(LMcross)|
|(most sig.)||(MF1.5)||(MF1.5)||(MF1.5)||(MF1.5)|| ||(LMcross)||(MF1.5)||(LMcross)||(LMcross)|
|(most sig.)||(ATM)||(ATM)||(ATM)||(ATM)|| ||(LMcross)||(LMcross)||(LMcross)||(LMcross)|
|(most sig.)||(LMcross)|| ||(LMcross)||(LMcross)|| ||(LMcross)||(LMcross)||(LMcross)||(LMcross)|
For the full sample period, the truncation of the smile used to estimate the MF implied volatility does nothing to improve its forecast performance in the case of IBM. The MF1.5 benchmark is given limited support for GE and MSFT (for the ALTM volatility measure in particular). However, overall, the ATM forecast remains dominant, even when the model set is expanded to include the added variants of MF.36 For the low-volatility period, as would be anticipated from the results recorded in Table III, the performance of both forms of option-implied forecasts (ATM, plus all variants of MF) is more similar, overall, than is their performance for the full period. However, rather than the performance of the MF forecasts improving when assessed over the low-volatility period, both the ATM and MF-type forecasts are now rejected as benchmarks in virtually all cases. Only for a single measure (ALTM for the IBM and MSFT series) is there any support for an option-implied forecast in the low volatility period.
As is consistent with earlier results, it is the BV measure which has the smallest p-values overall in Table IV, with the majority being zero to three decimal places. As was also the case for the earlier results, a long-memory direct forecast sometimes features as the most significant alternative according to a pairwise test. This is most notable in the low-volatility period. However the superiority of any particular long-memory forecast, taking into account the multiple alternative forecasts, would need to be formally verified by conducting SPA tests of long-memory benchmarks.
4.2.5. SPA Tests for the S&P500 Index
The small amount of work that has assessed the forecasting performance of the MF implied volatility has done so without formal account being taken of multiple alternative forecasts; see Jiang and Tian (2005) and Bollerslev and Zhou (2006). The analysis has also focused on the volatility of the S&P500 index, with the MF implied volatility being proxied by the VIX in the case of Bollerslev and Zhou. The results reported in Jiang and Tian, in which the MF method is compared with the BS method, give some support to MF. This result is thus in conflict with our SPA test results, which cast doubt on the usefulness of the MF method in forecasting the volatility of individual stocks. It is of interest, therefore, to assess the robustness of our SPA-based conclusions to the shift from individual equities to the index, in particular given that the MF formula is designed for the European-style option data associated with the index. Given that the different forms of noise adjustments that have been used in this paper have their prime motivation in the case of data on traded assets, rather than observations on a constructed index, we conduct SPA tests of the S&P500 implied volatility measures for the case where actual volatility is measured by RV(5) and BV only.37
In Figure 4 we plot, respectively, RV(5) and MF, RV(5) and BS, MF and VIX, and MF2.5 and VIX, for the 22-day-ahead forecast horizon. As is evident from panels (a) and (b), both implied volatility forecasts are very biased, even more so than was the case with the individual stocks. This is consistent with a substantial risk premium being factored into the index options. Panel (c) demonstrates the accuracy with which the VIX reproduces the MF method, with the truncated MF2.5 being virtually indistinguishable from the CBOE measure in panel (d). SPA-based tests of all five benchmarks used in the previous section were conducted, in addition to the test for the VIX benchmark. The tests were conducted over the full and low-volatility periods. The results (not reported here) provide a resounding rejection of all implied volatility benchmarks, with all p-values (to several decimal places) being equal to zero.