Volatility Forecasts Embedded in the Prices of Crude-Oil Options

This paper evaluates and compares the ability of alternative option-implied volatility measures to forecast the monthly realized volatility of crude-oil returns. We find that a corridor implied volatility measure that aggregates information from a narrow range of option contracts consistently outperforms forecasts obtained by the popular Black-Scholes and model-free volatility expectations, as well as those generated by a high-frequency realized volatility model. In particular, this measure ranks favorably in all regression-based tests, delivers the lowest forecast errors under either symmetric or asymmetric loss functions, and generates economically significant gains in volatility timing exercises. Our results also show that the CBOE’s “oil-VIX” (OVX) index performs poorly, as it routinely produces the least accurate forecasts.


Introduction
In economic terms, crude-oil is the most important traded commodity. Unsurprisingly, a wide range of economic agents, from individual investors to policy makers, closely monitor its price and routinely attempt to make predictions about the future. Unlike standard financial assets, however, one salient feature of crude-oil prices is that they can experience dramatic shifts for reasons that are largely unrelated to global macroeconomic conditions, such as OPEC policy changes or geopolitical instability in oil-producing regions. It is therefore tempting to expand the information set of standard time-series models, which rely exclusively on the record of historical prices, with measures that have "forward-looking" features by construction. volatilities calculated from different strikes and maturities (Trippi (1977); Chiras and Manaster (1978), Beckers (1981); Gemmill (1986) ;Fung, Lie, and Moreno (1990)) the consensus is that the simple ATMIV of a contract expiring as close to the forecast horizon appears to provide the most reliable results. More recently, ATMIV forecasts have been compared to the so-called model-free implied volatility (MFIV) that has a number of appealing theoretical properties. 4 The empirical evidence, however, has produced inconclusive results. Jiang and Tian (2005) study the S&P500 index and find MFIV to be more informative than ATMIV, while the opposite conclusion is reached by Andersen and Bondarenko (2007) for the same underlying asset. Taylor, Yadav, and Zhang (2010) examine individual U.S. stocks and report that ATMIV provides more accurate volatility forecasts than its model-free counterpart. Finally, in their study of three energy markets, Prokopczuk and Simen (2014) find that MFIV is more informative than ATMIV in predicting either crude-oil, heating oil or natural gas volatility. They also find that a simple adjustment for volatility risk-premia enhances the forecast performance of all option-implied measures.
When the task at hand is to predict future return variation MFIV is not without shortcomings. This is mainly for two reasons. First, some options included in the calculation of this measure (such as deep out-of-the money puts for the case of equities) tend to be very sensitive to volatility risk-premia fluctuations. This can introduce substantial variation in the option-implied measures that is largely unrelated to the forecast target, i.e. integrated variance. Second, calculating MFIV requires that market prices of options with extreme strikes are observed. In practice, this means that either some extrapolation scheme must be implemented, or that options beyond a certain strike range should be excluded from the calculation.
The most popular estimates of MFIV measures are the volatility indices produced and published by CBOE, such as the VIX index for the case of the S&P500 and the OVX for the case of crude-oil. CBOE's implementation algorithm, which is common for both the VIX and OVX indices, adopts a liquidity-based cut-off point that determines the range of options to be included in the MFIV measure calculation. The choice of this algorithm by CBOE has recently attracted some criticism. Andersen and Bondarenko (2007) were the first to note that the VIX is in fact an ex ante measure of corridor integrated variance, rather than integrated variance. In two comprehensive empirical studies, Andersen, Bondarenko, and Gonzalez-Perez (2015) and Andersen, Fusari, and Todorov (2017) use high-frequency option data and report that the VIX calculation method introduces systematic biases to the extracted measure, including artificial jumps, which become particularly pronounced during periods of market stress. From a different viewpoint, Griffin and Shams (2018) put forth evidence pointing towards market manipulation of the VIX futures market. In essence, this is facilitated by CBOE's adopted cut-off algorithm, as speculators can temporarily boost the liquidity of deep out-of-the money S&P 500 options, increasing the level of the VIX just before the settlement price for VIX futures is determined. Given that the same methodology is used to calculate both the VIX and OVX indices, all the above raise reasonable concerns regarding the informational efficiency of OVX-based forecasts. In addition, since the popularity of volatility indices has recently become widespread in the finance industry, a comparison between the OVX and other option-implied alternatives appears to be long overdue.
Our work builds on the study of Andersen and Bondarenko (2007) who propose an alternative measure of ex ante risk-neutral expectation of volatility, the so-called corridor implied volatility (CIV). Similar to the MFIV, and unlike the Black-Scholes model, this measure aggregates volatility information from several options and does not depend on a particular option pricing model. However, the extracted measure is not a risk-neutral expectation of integrated variance but corridor integrated variance, i.e., return variation accumulated only when the asset price lies within a corridor of two pre-specified price levels. The advantage of this approach is that one can select a corridor width that, while containing a wide-range of option prices, excludes those with extreme strikes, avoiding both price extrapolations and liquidity-driven cut-off points that may influence the reliability of the extracted measure.
The contribution of this paper is twofold. First, we examine the forecast performance of CIV measures vis-à-vis a collection of competing alternatives, including HAR, MFIV, OVX and ATMIV forecasts, for the case of crude-oil. Our paper builds, but significantly expands, on the work done by Prokopczuk and Simen (2014) who compare the performance of MFIV and ATMIV forecasts. Besides considering additional option-implied measures, our study also includes models that utilize high-frequency return information, while forecasts are ranked using both statistical and economic criteria. Second, we provide the first empirical evaluation of the OVX index, used in the forecasting study of Haugom, Langeland, Molnár, and Westgaard (2014), against other option-implied alternatives.
Our empirical results provide insights on a number of issues. We find that a particular CIV measure, that uses a relatively narrow range of option prices, consistently ranks favorably against all other competing measures using a variety of statistical and economic criteria. In particular, model forecasts that utilize this measure achieve the highest R 2 in Minzer-Zarnowitz regressions, remain significant in encompassing regression tests, and deliver the most accurate forecasts for under both the symmetric and asymmetric loss functions we consider. Moreover, volatility timing exercises show that utilizing this measure results in significant economic gains. With respect to the performance of the CBOE's OVX index, we find clear evidence that this measure is problematic, as it is routinely outperformed by all other option-based alternatives. Finally, in contrast to Prokopczuk and Simen (2014), we find that ATMIV is more informative about future crude-oil volatility compared to the MFIV measure 5 .
The structure of the paper is as follows. Section 2 discusses various measures of volatility that we use in this study. Section 3 describes the dataset and the details of our methodology. The empirical results are presented in Section 4. Robustness checks can be found in Section 5. Section 6 concludes. 5 The same result is also reported in Andersen and Bondarenko (2007) for the case of the S&P 500, and Taylor et al. (2010), for the case of individual stocks.

Volatility Measures
In this section we describe the alternative volatility measures we use to construct forecasts. However, before doing so, we state the assumptions we make about asset price dynamics and outline the relevant theory on which all our volatility measures are based.

The Dynamics of Futures Prices
Assume that over the period t ∈ [0, T] investors can continuously trade in a frictionless and arbitrage-free market. In the filtered probability space (Ω, F, P; F t∈[0,T] ), the futures price of a contract expiring at time 0 < T < T, denoted as F t , evolves according to the following general diffusion, where W t is a Wiener process. The drift µ t and volatility σ t can change across time according to the filtration F t . The constraint imposed on the futures price dynamics is that the stochastic process is a semimartingale without jumps in prices 6 . It is worth noting that the only restriction imposed on the volatility dynamics is that σ t is a strictly positive (càdlàg) stochastic process, so volatility can exhibit jumps.
According to these price dynamics, the total variation of logarithmic futures price changes from t = 0 to T is given by the integrated variance (IVAR), defined as, Although total return variation is the forecast target of this paper, we also utilize the concept of corridor integrated variance (CIVAR), i.e., variance accumulated only when the underlying asset (F t in our case) lies between two "barrier" price levels B 1 and B 2 . Defining the indicator function I t that takes the value of 1 if B 1 ≤ F t ≤ B 2 and 0 otherwise, CIVAR is given by the following expression, Obviously, when the corridor defined by B 1 and B 2 is sufficiently wide to contain all levels that the futures price can reach with positive probability under P, CIVAR will be equal to IVAR. In other words, IVAR can be seen as a special case of CIVAR, since for B 1 = 0 and B 2 = ∞ the definitions of the two measures coincide.
6 Price jumps are excluded from this representation because the OIV expectations, discussed later in the paper, will be biased when prices are subject to discontinuous movements. Jiang and Tian (2005) and Carr and Wu (2009) note that the bias will not be sizeable for small or moderate jumps, although large jumps could have a significant impact, as argued by Carr, Lee, and Wu (2012).

Volatility Expectations From Option Prices
Option markets may be informative about future volatility, since observed prices can be utilized to extract forward-looking expectations of the aforementioned volatility measures. In particular, suppose European plain vanilla options, written on an underlying futures contract F t and expiring at time t = T , trade for a continuum of strike prices K. As shown in Carr and Madan (1998), Demeterfi et al. (1999) and Britten-Jones and Neuberger (2000), ex ante risk-neutral expectations of the future integrated variance can be obtained by calculating the value of a static position in a portfolio of European options. Specifically, the expected integrated variance from time t = 0 to time T , under the risk-neutral measure Q, can be calculated from: where M 0,T (K) is the price of a European out-of-the money option (i.e., either put or call), with strike price K and maturity T . Since this expectation does not depend on a particular option pricing model (such as the Black-Scholes model for example), it is referred to as Model Free Implied Variance (MFIV). Similarly, as shown in Carr and Madan (1998) and Andersen and Bondarenko (2007), Corridor Implied Variance (CIV), i.e. the risk-neutral expectation of future integrated corridor variance, can be obtained by calculating the value of a static position in European options with strikes ranging from B 1 to B 2 ,

CBOE Crude Oil Volatility Index (OVX)
The last option-based measure we consider is the Crude Oil ETF Volatility Index (OVX), also known as the "Oil VIX". The OVX, which is produced and disseminated by the CBOE, intends to measure the market's (risk-neutral) expectation of crude-oil price volatility over the next month. It is defined as the square root of MFIV, given in Equation 4. The data underlying the OVX computation are options written on the United States Oil Fund (USO), an ETF that is designed to track the price of West Texas Intermediate light sweet crude-oil. For the construction of the OVX the CBOE adopts exactly the same methodology as the one employed for the popular S&P 500 VIX index. Notably, CBOE applies a liquidity criterion to determine the range of option contracts included in the calculation of the index. In particular, moving from high (low) strike, out-of-the-money, put (call) options towards those with lower (higher) strikes, once two contracts with consecutive strike prices have zero bid prices a cut-off point is applied and no further contracts are considered. Therefore, both the VIX and the OVX are, in reality, CIV measures, with a corridor width determined by market liquidity.

Realized Variance
While our forecast target, i.e. integrated variance, is inherently latent, accurate ex post IVAR estimates can be obtained using high-frequency price observations. In particular, Barndorff-Nielsen and Shephard (2002), Meddahi (2002) and Andersen, Bollerslev, Diebold, and Labys (2003) show that summing squared intraday returns leads to an estimator which converges in probability to IVAR and is referred to as realized variance (RV). To calculate RV, suppose on day t there are M + 1 equally spaced intraday price observations at times t i , i = 0, . . . , M . We will also assume the interval between the observations is 1/M , i.e., the length of a day is standardised to unity. If the log price at time t i is p t i , then the intraday return between times t i−1 and t i is r t i = p t i − p t i−1 . It is then straightforward to calculate RV on day t as, Theoretically, RV becomes more accurate as M increases, i.e., as more intraday prices are observed over shorter and shorter intervals. However, if prices are observed over very short intervals, RV will be contaminated by microstructure noise, which causes an upward bias (Andersen, Bollerslev, Diebold, and Labys, 2000). 7 A common remedy is to use prices observed over a relatively coarse set of intraday times so that the effects of microstructure noise are mitigated. Typically, prices recorded over 5-min intervals are used, despite transactions occurring at a much higher frequency. Although using a coarse intraday sampling interval solves the microstructure noise problem, it results in information being discarded. In order to recover some of this information, ensuring our RVs are estimated with as much accuracy as possible, and to continue avoiding microstructure noise by using a coarse sampling interval, we use sub-sampled RVs (Zhang, Mykland, and At-Sahalia, 2005) which are calculated as follows, 8 wherer ∆,t i = ∆ j=1 r t i+(j−1) = p t i−1+∆ − p t i−1 is the ∆-period intraday return between times t i−1 and t i−1+∆ , and M and ∆ are chosen such that M/∆ is an integer. 7 In the context of measuring IVAR, microstructure noise refers to the difference between observed transaction prices and efficient, true prices. Examples of factors that contribute to microstructure noise include: transactions tending to "bounce" between the bid and ask prices, i.e., transactions must occur at either the bid or ask price, neither of which may be the efficient, true price; and price discreteness, i.e., the minimum price change is one cent but the efficient, true price may change by fractions of a cent.
8 Alternative microstructure-robust estimators of IVAR, which use prices observed at higher frequencies, include the kernel realized variance introduced by Barndorff-Nielsen, Hansen, Lunde, and Shephard (2008) and the pre-averaged realized variance developed by Podolskij and Vetter (2009) and Jacod, Li, Mykland, Podolskij, and Vetter (2009). However, in a volatility forecasting context, Liu, Patton, and Sheppard (2015) show that these estimators lead to marginal improvements over realized variance estimated using a 5-min frequency.

Data and Sample Construction
Our main dataset consists of options and high-frequency prices for the WTI Light Sweet Crude Oil futures, which currently trade at CME, the world's most liquid commodity derivatives market. Our options and futures datasets start in January 1996 and end in April 2016.

Option Data
Option contracts are written on futures contracts for physical delivery of light sweet crudeoil. In particular, the underlying asset is the futures contract whose delivery date is three business days after the expiration of the option. These contracts have American-style exercise and are settled in cash. Our option data consist of daily settlement prices, which are recorded at 14:30 ET each day.
Our empirical study focuses on monthly variance forecasts. Along these lines, we study options that have approximately 22 trading days to expiration. 9 Throughout the sample, the day that we collect option prices is never before the maturity date of the previous option chain we studied, i.e. all our option-based forecasts are non-overlapping.
We make several adjustments to the raw data before we proceed with the estimation of the OIV measures. As the latter require price data on European options rather than American ones for their calculation, we attempt to alleviate this problem in two ways. First, we eliminate all in-the-money options and only keep out-of-the money options, for which the early exercise premium is significantly lower and, for the case of deep-out-of the money options, almost quantitatively negligible. Second, we estimate the early exercise premium of each option using the Barone-Adesi and Whaley (1987) American option pricing formula and subsequently calculate the price of their European-style counterparts.
Finally, in order to guard against recording errors and other market microstructure effects, we eliminate options with a price less than $0.01 and filter all call/put prices that violate standard arbitrage bounds. An overview of our final option data sample is provided in Table  1. It is noteworthy that the number of traded option contracts has increased substantially over the last few years.

High-Frequency Futures Data
Our high-frequency data comprise of transaction prices recorded at 30-sec intervals. Until June 2006, futures were traded between 09:00 and 14:30 ET using an open outcry system in a trading pit, resulting in 661 price observations per day. Subsequently, they have been traded between 18:00 and 17:00 ET the following day on the electronic GLOBEX platform, resulting in 2,761 price observations per day.
To construct RV we use 5-min intraday returns. This frequency is commonly used in the empirical literature, e.g., see Andersen, Bollerslev, Diebold, and Labys (2001), Andersen et al. (2003) and Liu et al. (2015) among others, as it is deemed to provide an appropriate trade-off between the objective of incorporating as much information as possible from intraday prices and the necessity to avoid contamination from microstructure noise. Hence, in applying our sub-sampled RV estimator in Equation (6) we set M = 660 or M = 2760, depending on which dates the data are from, and ∆ = 10.

Construction of Implied Volatility Measures
The computation of the MFIV and CIV measures requires the existence of options trading for a continuum of strike prices, an assumption which is of course not satisfied in practice. In order to address this, we first estimate a risk-neutral distribution using the prices of observed options for each relevant date, which enables us to subsequently generate option prices for arbitrary strike prices. Our preferred risk-neutral distribution is the flexible Generalized Beta Distribution of the second kind (GB2) which, as discussed in Taylor (2005), has a number of appealing properties. 10 The calculation of the CIV measures also requires a selection of the relevant corridor width, i.e., the barrier price levels B 1 and B 2 . We consider four CIV measures in total, with the barriers determined by evaluating the quantile function of the risk-neutral distribution , the CIV1 to CIV4 measures are obtained by first setting p = 0.45, 0.35, 0.25, 0.10, respectively, and subsequently evaluating Equation (5). Figure 1 plots the four CIV measures together with the MFIV estimates during our full sample (1996-2016) period. As expected, all measures exhibit a strong degree of covariation, although narrow CIV measures appear more stable than their wide corridor counterparts. As shown in Table 2, all CIV measures are highly correlated with both MFIV and ATMIV. The autocorrelation patterns for 1, 6, and 12 lags also reveal that CIV measures become less persistent as the width of the corridor widens. As expected, the unconditional moments of the CIV measures approach those of the full MFIV as more options are included in the calculations, while for the case of CIV4 the two measures have almost identical properties.

Volatility Forecast Construction
All our variance forecasts are non-overlapping and correspond to a monthly horizon (22 trading days). We consider two samples of different length, one that includes the OVX and one that does not. If we exclude the OVX from the set of OIVs evaluated, then our sample starts in January 1996 and ends in April 2016. We refer to this as our full sample. Conversely, if we include the OVX, then our sample starts in May 2007, when the OVX is first reported, and ends in April 2016. We refer to this as our OVX sample. Our full and OVX samples consist of 244 and 108 non-overlapping monthly observations, respectively. 11 Our OIVs provide ex ante expectations of monthly variance, hence they form stand-alone forecasts of variance. In contrast, RV is an ex post estimator of variance. Although we could use RV in a random-walk model, i.e., use the lagged RV as a stand-alone forecast, there is a considerable literature demonstrating the superiority of forecasts generated from timeseries models of RV (e.g., Corsi (2009), Patton and Sheppard (2015), Bollerslev, Patton, and Quaedvlieg (2016) and Bollerslev, Hood, Huss, and Pedersen (2018)). We use the heterogeneous autoregressive model (HAR) of Corsi (2009) to generate RV-based forecasts, While various extensions of the HAR model could be considered, our study relies exclusively on the baseline version. This is because, as shown in the comprehensive studies of Sévi (2014) and Prokopczuk et al. (2016), sophisticated HAR extensions do not outperform the simple HAR benchmark 12 . We estimate the HAR model parameters using a rolling window of 60 monthly observations. Forecast evaluation is conducted out-of-sample. Thus, in the following sections, reference to our full sample will be to the 184 out-of-sample observations. Similarly, reference to our OVX sample will be to the 48 out-of-sample observations.

Empirical Results
We examine the forecast accuracy of competing forecasts using several techniques. Firstly, we evaluate the information content of each of our forecasts using Mincer-Zarnowitz regressions. Secondly, we make comparisons between the information content of our forecasts using encompassing regressions. Thirdly, we analyse the prediction errors of our forecasts using statistical loss functions. Lastly, we assess the economic value of our forecasts by implementing a volatility timing exercise.

Mincer-Zarnowitz Regressions
Our first evaluation procedure assesses the information content of our HAR-based forecasts and the OIVs. This is done by running Mincer-Zarnowitz regressions (Mincer and Zarnowitz, 1969), whereby we regress our variance target, the RV calculated over each out-of-sample monthly forecast horizon, against the competing forecasts. More precisely, for each forecast i, we run the following regression, where f i t,t+22 is the forecast from model i using information available up until day t for the variance between days t + 1 and t + 22. The information content of each forecast is measured by the R 2 of this regression.
In Table 3 we present results for the Mincer-Zarnowitz regressions. In Panel A we report results for the full sample, whilst in Panel B we summarize results for the OVX sample. The values of R 2 suggest the following. Firstly, OIVs have markedly higher information content. Forecasts based solely on RV, i.e., the HAR forecasts, result in the lowest R 2 values, whilst the R 2 for the OIV forecasts are larger by 5-7 percentage points for the full sample and 35-37 percentage points for the OVX sample. Secondly, of the OIVs, CIV1 appears to have the highest information content. The CIV1 forecasts result in the largest R 2 s amongst all of the OIVs in both the full and OVX samples.
It should also be noted that the parameter values differ substantially between the forecasts. This is as expected and a consequence of the differing levels of bias in the forecasts. If a forecast is unbiased, then we would expect β 0 = 0 and β 1 = 1. Whilst all of the β 0 parameters are close to zero, there are large differences between the values of β 1 . Overall, the MFIV, ATMIV and OVX are upwardly biased, consistent with the presence of a variance risk-premium, whilst CIV1, CIV2 and CIV3 are downwardly biased, reflecting the fact that they provide risk-neutral expectations for CIVAR. The observed biases are consistent with the mean values for the OIVs reported in Table 2.

Encompassing Regressions
Our second evaluation procedure assesses the relative performance of the alternative forecasts by running encompassing regressions. We make comparisons between two forecasts by running the following bivariate regressions, 13 We can determine whether the forecast from one model encompasses the other by examining the significance of the individual regression parameters. If the information contained in the forecast from model i is subsumed by the information in the forecast from model j, then we expect β 1 to be insignificant and β 2 to be significant. If this occurs, then we say that the forecast from model i encompasses the forecast from model j and vice versa.
We make the following comparisons using our encompassing regressions: (i) we compare the information content of the HAR model forecasts, which are based on RV alone, against our OIVs to examine whether the forward-looking information in our OIVs is useful vis-à-vis the backward-looking information contained in RV; and (ii) we compare the information content of our alternative OIVs to examine which of our OIVs contains the most useful forecasting information.
The results of the encompassing regressions are summarized in Table 4. Results for comparisons between the OIVs and the HAR forecasts are summarized in Panels A and B for the full and OVX samples, respectively. Panel A shows that CIV1, CIV2 and CIV3 encompass the HAR forecasts, at the 1% level. In the remaining encompassing regressions, although the parameters on MFIV, ATMIV and CIV4 are significant at the 1% level, the parameters on HAR are also significant; albeit only at the 10% level for the encompassing regressions involving ATMIV and CIV4. Thus, although they do not encompass the HAR forecasts, there appears to be incremental information in the MFIV, ATMIV and CIV4 OIVs which is not captured by RV. In Panel B, all OIVs encompass the HAR forecasts, at the 1% level.
Results for comparisons between the OIVs are summarized in Panels C and D for the full and OVX samples, respectively. The results in Panels C and D provide more refined insights regarding the relative information content of the OIVs. Focusing on the full sample results in Panel C, it can be seen that the MFIV variances are encompassed by all the other OIVs, at the 5% level. This result is consistent with the arguments of Andersen and Bondarenko (2007) that illiquid options with strike prices in the tails of the risk-neutral distribution (RND) and the requirement to extrapolate the tails of the RND reduce the accuracy of the MFIV forecast. Hence, despite its inferior theoretical foundations, the ATMIV is able to provide superior forecasts to the MFIV variance. 14 Comparisons of the ATMIV and CIVs show that ATMIV is encompassed by CIV1, CIV2 and CIV3, at the 5% level. Thus, there is a trade-off between excluding imprecisely measured information from the tails of the RND and limiting the information used to construct OIVs to that contained in the atthe-money (ATM) option. It appears that CIV1, CIV2 and CIV3 make a trade-off which enables them to harness the information contained in the liquid strikes that straddle the ATM contract without being contaminated by measurement errors in the tails of the RND. The encompassing regressions involving CIV1, CIV2, CIV3 and CIV4 show that CIV4 is encompassed by CIV1, CIV2 and CIV3, at the 5% level. This is indicative of the corridor being too wide for CIV4 such that it too is contaminated by noise in the tails of the RND.
The results in Panel D are much weaker than those in Panel C and likely reflect the smaller size of the OVX sample. Nonetheless, although often significant at the 10% level only, the results show that CIV1 encompasses all other OIVs.
Thus, the results from the encompassing regressions support those from the Mincer-Zarnowitz regressions. The information content of the OIVs appears to be superior relative to those based on RV alone, i.e., the HAR forecasts. Further, of the OIVs, the CIV1 appears to be informationally superior.

Bias-corrected and Augmented Models
Unsurprising, our previous results showed that our OIV forecasts are biased. By comparing the adjusted R 2 s of the encompassing regressions in Table 4 to the R 2 s of the Mincer-Zarnowitz regressions in Table 3, we can also see that a linear combination of RVs and OIVs leads to a higher information content than can be attained with any individual forecast. Therefore, we introduce two additional sets of forecasting models. Firstly, instead of using our OIVs as stand-alone forecasts, we use predicted values from univariate regressions of RV t+1,t+22 on each of our OIVs. Intuitively, the objective of this set of models is to remove the bias from the OIVs. Hence, we refer to the predicted values from these models as biascorrected option-implied forecasts. We use BC-MFIV to denote the bias-corrected MFIV forecast, BC-ATMIV to denote the bias-corrected ATMIV forecast and so on.
Secondly, we augment the HAR model by adding each of the OIVs. Thus, the specification of our augmented HAR models is, where OIV t,t+22 is an OIV calculated on day t for return variation between days t + 1 and t + 22. We use HAR-MFIV to denote the forecasts from an HAR model augmented with MFIV, HAR-ATMIV the forecasts from an HAR model augmented with ATMIV and so on. Therefore, in total, we examine thirteen competing forecasts. Our benchmark is the HAR model, since this uses information from RV exclusively. We then have six bias-corrected option-implied forecasts and six augmented HAR forecasts.

Forecast Evaluation using Statistical Criteria
Although Mincer-Zarnowitz and encompassing regressions provide insights into the information content of our forecasts, they do not provide much information about their precision. From the perspective of economic agents, quantifying forecast accuracy is of paramount importance. Thus, we now turn to assessing prediction errors by means of statistical loss functions.

Statistical Loss Functions
To evaluate the accuracy of our forecasts, we use a symmetric, the mean squared error (MSE), and an asymmetric, the quasi-likelihood (QLIKE), loss function. A lower MSE and/or QLIKE corresponds to smaller prediction errors. These loss functions were chosen because they are commonly employed in the volatility forecasting literature and, as shown in Patton (2011), they are robust to measurement error in the IVAR proxy. More precisely, using the MSE and QLIKE loss functions ensures that the ranking of two forecasts in terms of expected loss is preserved even when the true integrated variance is replaced by a conditionally unbiased, but imperfect, proxy. 15 For forecast i, the MSE loss function we use is defined as, where x rounds x to an integer that is less than or equal to x. Similarly, the QLIKE loss function we use is defined as, In order to test for significant differences between the MSE and QLIKE of competing forecasts we use Diebold-Mariano tests (Diebold and Mariano, 1995) with Newey-West (Newey and West, 1987) standard errors.

Forecast Evaluation Results
The out-of-sample MSEs and QLIKEs of our competing forecasts are presented in From Panel A it can be seen that, overall, the bias-corrected forecasts outperform the corresponding HAR forecasts according to the MSE. The BC-CIV1 forecasts result in the lowest MSE and, of the HAR-based forecasts, the lowest MSE is associated with HAR-CIV1. However, Diebold-Mariano tests show that no forecast results in an MSE which is significantly lower than that generated by the HAR forecasts.
The QLIKE results in Panel A differ slightly to those for the MSE. Under this loss function the HAR-based forecasts outperform the bias-corrected forecasts, with the most accurate forecasts being HAR-CIV1. However, in contrast to the MSE results, we find that the QLIKEs for HAR-CIV1, HAR-CIV2, HAR-CIV3 and HAR-ATMIV are all significantly lower than for HAR at the 5% level. The QLIKEs for HAR-MFIV and HAR-CIV4 are also significantly lower at the 10% level. The fact that we find significant differences when using the QLIKE loss function is most likely associated with the ability of this loss function to more accurately discriminate between competing variance forecasts (Patton and Sheppard, 2009). Overall, the results in Panel A suggest that using information in CIV1 leads to the most accurate variance forecasts.
The results in Panel B are analogous to those in Panel A and support our overall conclusion that incorporating information from CIV1 leads to the most accurate forecasts. Importantly, it can also be seen that forecasts based on OVX perform poorly. When the HAR-OVX forecasts are used as a benchmark, it can be seen that nearly all non-OVX option-implied forecasts produce significantly lower QLIKEs and MSEs at the 5% level. 17 Although the BC-OVX forecasts result in an MSE that is significantly lower than that for the HAR-OVX forecasts, of the bias-corrected forecasts, BC-OVX result in the largest MSE.
In summary, the results show that prediction errors can be minimized when CIV1 is employed. The results are consistent with those from the Mincer-Zarnowitz and encompassing regressions, which showed that the OIVs were informationally superior to forecasts based on RV and that CIV1 had a higher information content relative to other OIVs. Our results also suggest that forecasts based on the OVX perform significantly worse than those based on our alternative OIVs. Thus, the results suggest there is value in constructing our OIVs directly from option prices rather than relying on the CBOE's methodology.

Forecast Evaluation using Economic Criteria
Although evaluating variance forecasts with MSE and QLIKE is common, they are statistical loss functions. It is not clear how minimising the MSE and/or QLIKE translates into eco-nomic gains for the forecaster. Therefore, in order to ascertain whether the improved forecast accuracy we observe leads to economic gains, we perform a volatility timing exercise.
We consider an agent whose investment opportunity set consists of the WTI crude-oil futures and a risk-free asset and assume the agent's objective is to maximise the utility of a portfolio consisting of these two assets. We follow the set-up of Bollerslev et al. (2018) and assume Sharpe ratios are constant and that a quadratic utility function provides an accurate approximation to investors' true utility functions. With these assumptions, the following function describes investors' utility per unit wealth (UoW), where W t is the investors wealth, w t is the proportion of the investors wealth held in crudeoil futures, SR is the Sharpe ratio and γ is the investors level of risk aversion. 18 It can be shown that constructing a portfolio to maximise expected utility is equivalent to forming a portfolio with a specific volatility target, where the optimal proportion of wealth to invest in crude-oil futures is, .
The ratio in the numerator corresponds to the volatility target and the denominator is the expected volatility. To operationalise the strategy, E t (RV t+1,t+22 ) in denominator is replaced with a variance forecast. We also follow Bollerslev et al. (2018) and set SR = 0.4 and γ = 2, which they argue are sensible parameters when forecasting variance over a monthly horizon, and results in a volatility target of 20%. Using Equation 10 to substitute for w t in Equation 9, replacing E t (RV t+1,t+22 ) with our variance forecasts, and plugging-in our assumed values of SR and γ leads to the following expression for the utility per unit wealth based on forecast f t+1,t+22 , Comparisons between models are then made using realized utility, Note, realized utility is expressed as a percentage return and, given our assumptions, can take a maximum value of 4%.

Realized utility results
Panels A and B of Table 6 summarize the realized utility results for our full and OVX samples, respectively. The first column of Panel A reports the realized utility. It can be seen that the HAR-based forecasts outperform the bias-corrected forecasts. In addition, of the HAR-based forecasts, HAR-CIV1 generates the highest realized utility.
In the first column of Panel A we also report the results from Diebold-Mariano tests of whether the realized utility associated with a forecast is significantly different to the realized utility generated by the HAR forecasts. It can be seen that HAR-CIV1, HAR-CIV2 and HAR-CIV3 produce realized utilities that are significantly higher, at the 5% level for HAR-CIV1 and HAR-CIV2 and the 10% level for HAR-CIV3, than the realized utility attained with the HAR forecasts. Therefore, the pattern in forecasting performance observed with the statistical loss functions is retained when we use an economic loss function. The results also confirm that the improvements in forecasting accuracy we observed with the statistical loss functions translate into economic benefits.
The difference between the value of the realized utility for the HAR-CIV1 forecasts relative to the HAR forecasts in Panel A is 2 bp. Although this may appear to be a relatively modest difference, there are two reasons why this represents a material economic improvement. Firstly, as highlighted by Bollerslev et al. (2018), there has been a drive by investment management companies, in particular mutual funds and ETFs, towards lowering fees. The fees now charged by low-cost funds are of the order of tens of basis points. As argued by Bollerslev et al. (2018), this means that a single digit basis point increase in fees is relatively substantial. The difference in realized utility of 2 bp means that a fund using the HAR-CIV1 forecasts instead of the HAR forecasts will be able to increase its fees by 2 bp and remain equally attractive to investors.
Secondly, the realized utilities reported are unconditional. Therefore, in any given month the economic benefit of using the HAR-CIV1 over the HAR forecasts could be much larger than 2 bp. In order to examine this further, we report in columns 2-7 of Panel A in Table  6 the 2.5, 5, 10, 25, 50 and 75% quantiles of the UoW for each set of forecasts. Comparing the HAR-CIV1 to the HAR forecasts, it is clear that there are substantial differences, of 5-10 bp, in the UoW at the 2.5, 5 and 10% quantiles. The difference between the lower quantiles of the UoW distributions suggests the HAR-CIV1 forecasts tend to outperform the HAR forecasts precisely when it is most difficult to forecast volatility and when an accurate forecast is in greatest demand, e.g., when there is a volatility shock which causes UoW to be low.
The results for the OVX sample in Panel B of Table 6 are analogous to those for the full sample in Panel A. Again, the HAR-CIV1 results in the highest realized utility, with all the non-OVX augmented HAR forecasts generating a realized utility higher than that attained by the HAR forecasts. The differences between the realized utilities of each non-OVX augmented HAR forecast and the HAR forecasts are also larger and approximately 2-5 bp.
To examine the economic benefit of using our OIVs over the OVX, in Panel B we test for a significant difference between each forecast and the HAR-OVX forecasts. 19 It can be seen that the realized utility of all of the non-OVX augmented HAR forecasts are significantly higher, typically at the 1% level, than the realized utility associated with the HAR-OVX forecasts. Therefore, these results further support our conclusion that there is value in constructing our OIVs directly from option prices rather than relying on the CBOE's methodology.
Similar to our findings in Panel A, there are potentially large differences between the conditional values of UoW for the OVX sample. For the 25-50% UoW quantiles the difference between the realized utilities associated with the HAR and augmented HAR forecasts is approximately 4-13 bp, whilst the difference between the HAR-OVX and non-OVX augmented HAR forecasts is approximately 9 bp.
Of course, the magnitude of the economic benefits derived from each forecast depends on the assumptions employed. As Bollerslev et al. (2018) highlight, the value of the realized utility is a linear function of the volatility target. Thus, if the volatility target doubles, e.g., through a doubling of the Sharpe ratio or a halving of the coefficient of risk aversion, the size of the economic benefits also double. Nevertheless, the framework above provides a sensible approximation and therefore the magnitude of the economic benefits presented should be reasonable.
In conclusion, the results demonstrate that the improvements in forecasting accuracy observed when using OIVs translate into economic benefits under reasonable assumptions. They also further corroborate our preference for the CIV1 amongst our OIVs.

Realized utility results with transaction costs
In order to make our analysis more realistic, we also take into consideration transaction costs. Specifically, transaction costs are calculated as being a proportion of turnover, where, The precise level of transaction costs is controlled by c. We follow Wang, Liu, Ma, and Wu (2016) and Caldeira, Moura, Nogales, and Santos (2017) and set c to be either 0.033% or 0.15%. It should be noted that transaction costs are low in futures markets (Locke and Venkatesh, 1997) and have decreased markedly over the past few decades. Realised utility net of transaction costs is then given by, Panels A and B of Table 7 summarize the net realized utility and transaction costs for the full and OVX samples, respectively. In Panel A (Panel B) we also use Diebold-Mariano tests to formally evaluate whether the net realized utility and transaction costs associated with each forecast are significantly different to the net realized utility and transaction costs generated by the HAR (HAR-OVX) forecasts.
In both Panels A and B it can be seen that transaction costs, whether c = 0.015% or c = 0.0033%, do not differ substantially between the competing forecasts. In Panel A, it can be seen that there are no significant differences in transaction costs, whilst in Panel B, BC-CIV1 and BC-CIV2 lead to significantly lower transaction costs at the 10% level. Consequently, because none of the forecasts lead to unusually high transaction costs, the ranking of the forecasts in Panels A and B of Table 7 are identical to those in Panels A and B of Table 6. In particular, the HAR-CIV1 forecasts produce the highest net realized utility in both the full and OVX samples. Within the full sample, the HAR-CIV1 and HAR-CIV2 forecasts result in net realized utilities that are significantly higher than the net realized utility of the HAR forecasts at the 5% level. Whilst for the OVX sample, all the non-OVX augmented HAR forecasts result in significantly higher net realized utilities than the HAR-OVX forecasts, typically at the 1% level. Thus, differences in trading volumes are small and do not have a material impact on the relative economic benefits of the forecasts. Therefore, the results also support our conclusions in Section 4.5.1

Robustness Checks
In this section we analyse the robustness of our results to: (i) the size of the rolling-window used to estimate the models; (ii) the choice of out-of-sample period; and (iii) the method used to bias correct the OIVs.

Estimation window
Thus far, the parameters for our bias-correction and HAR models have been estimated using rolling windows of 60 observations, or approximately five years worth of data. In this section we analyse the robustness of our results to the choice of estimation window. Specifically, we vary the estimation window between 66 and 96 observations, or between approximately five and eight years, and then re-evaluate our forecasts using the MSE, QLIKE and realized utility loss functions. We do not use an estimation window below 60 observations since this results in poorly estimated parameters and recalcitrant forecasts. Due to the limited number of observations, we are also unable to conduct this robustness test for the OVX sample. Table 8 summarizes the results of this analysis. Reassuringly, our main conclusions continue to hold. For all estimation windows, the BC-CIV1 model minimizes the MSE, whilst the HAR-CIV1 minimizes the QLIKE and maximises realized utility. Furthermore, our results using an estimation window of 60 observations are superior to those found using estimation windows between 66 and 96 observations.

Sub-sample analysis
In our analysis in Section 4, our out-of-sample period started in Jan 2001 and ended in April 2016. In this section we check the robustness of our results to variations in the out-of-sample period. Specifically, we vary the out-of-sample start date to be either Jan 2002, Jan 2003, . . ., or Jan 2012. This ensures we have sub-samples that both include and exclude the 2007/8 financial crisis. We do not consider a start date beyond 2012 because this would result in an insufficient number of out-of-sample observations.
In Table 9 we summarize the results from using the MSE, QLIKE and realized utility loss functions. Each column refers to the year in which the out-of-sample evaluation begins, with each analysis commencing in January of the associated year, and also reports the number of observations in each sub-sample. The pattern observed over the full out-of-sample period is retained in each of the sub-samples; the BC-CIV1 provides the most accurate forecasts according to the MSE whilst the HAR-CIV1 forecasts are the most accurate according to the QLIKE and realized utility. Therefore, we are convinced our results are robust to the choice of out-of-sample period.

Alternative Bias Correction Procedure
Our final robustness check evaluates an alternative bias correction technique. Thus far, the bias in OIVs has been corrected for by using a regression model or by including the OIVs in the HAR model, where the parameter multiplying the OIV is able to modulate the bias. In this section we use a technique inspired by Prokopczuk and Simen (2014), who show that the forecasting performance of the MFIV can be improved significantly by making a non-parametric adjustment for the variance risk premium. We apply their technique to all our OIVs. 20 Since this method relies on averages of ratios, we will refer to it as a relative bias-correction.
More precisely, to implement the relative bias-correction, the average ratio of OIV to RV must be computed to get an estimate of the relative bias, which, in our case, is estimated using a window of τ monthly non-overlapping observations, The relative bias-corrected OIV can then be estimated as follows, In Table 10 we summarize the MSE, QLIKE and realized utility for the relative biascorrection forecasts. As benchmarks, we include our, thus far, best-performing (non-relative) bias-corrected and HAR-based forecasts: the BC-CIV1 and HAR-CIV1 forecasts. To evaluate the effect of the window size used in the relative bias-correction, we report results for τ = {6, 12, . . . , 60}. 21 Several conclusions can be made. Firstly, the relative bias-corrected forecasts are most accurate when τ = 12. Secondly, the relative bias-corrected forecasts, for all values of τ , are more accurate than the BC-CIV1 and HAR-CIV1 forecasts under the MSE loss function. Except for when τ = 24, the MSE is minimized by the RBC-CIV2 forecasts. Lastly, under the QLIKE and realized utility loss functions, the HAR-CIV1 forecasts provide the most accurate forecasts for all values of τ . Overall, although there is some evidence that the relative bias-correction may produce more accurate forecasts under the MSE loss function, our conclusions concerning the value of the information content of OIVs and particularly CIV1 remain unaltered.
20 Although the approach can be thought of as a correction for the variance risk premium when using the MFIV, the same interpretation does not hold for all of our OIVs, particularly the CIVs. Instead, we take a pragmatic view and treat it as an alternative bias-correction method. 21 These effectively correspond to window sizes of 0.5, 1, 1.5, . . . , 5 years.

Conclusion
In this paper we evaluated, using both economic and statistical criteria, the information content of monthly crude-oil volatility forecasts extracted from the prices of traded options. We examined a variety of alternative option-implied measures including Black-Scholes atthe-money implieds (ATMIV), model-free volatility expectations (MFIV), CBOE's oil-VIX (OVX) and, notably, corridor implied volatilities (CIV). Besides stand-alone comparisons, option-implied forecasts were also contrasted, and combined, with those obtained by a realized volatility model (HAR) that utilizes high-frequency return information.
Our key finding is that a particular CIV measure (CIV1), that utilizes a narrow range of option contracts, consistently generates the most accurate forecasts compared to all other alternatives. In Mincer-Zarnowitz regressions, CIV1 achieves the highest R 2 , whilst cncompassing regression tests show that CIV1 subsumes the information contained in ATMIV, MFIV, OVX and HAR forecasts. Furthermore, under either a symmetric (MSE) or an asymmetric (QLIKE) loss function, CIV1-based forecasts deliver the lowest forecast errors. In terms of economic significance, incorporating the CIV1 into the HAR model leads to forecasts that generate a significantly higher realized utility, even when transaction costs are taken into account. All these findings remain intact for both our full sample (1996-2016) and a shorter sub-sample that starts when the OVX index was first disseminated (2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016).
Our results also provide valuable insights regarding the information content of the OVX index. In particular, we find that the OVX-based forecasts perform rather poorly. Regressionbased tests show that the OVX is encompassed by CIV1, whilst OVX-based forecasts are typically the least accurate according to either statistical or economic loss functions. To the best of our knowledge, this is the first time the reliability of the OVX measure has been scrutinized, so the concerns we raise have direct implications for practitioners who often rely on the CBOE's volatility indices.
Overall, this paper contributes to the academic literature that assesses the forwardlooking information embedded in the prices of crude-oil options. Given that measuring crude-oil risk is of paramount importance for a variety of economic agents, our empirical study is of value for policy-makers and investors alike.       (8). Estimates reported in the columns labeled β 1 are for the forecast listed in the column labeled Forecast 1. Estimates reported in the columns labeled β 2 are for the forecast listed in the column labeled Forecast 2. Panel A presents results for encompassing regressions involving comparisons between the HAR forecasts and each of the option-implied variances for the full sample, which excludes the OVX forecasts. Panel B presents results for encompassing regressions involving comparisons between the HAR forecasts and each of the option-implied variances for the OVX sample, which includes the OVX forecasts. Panel C presents results for encompassing regressions involving comparisons between the option-implied variances for the full sample. Panel D presents results for encompassing regressions involving comparisons between the option-implied variances for the OVX sample. Newey-West standard errors were estimated. t-statistics are reported in parentheses. *** indicates significance at the 1% level; ** indicates significance at the 5% level; * indicates significance at the 10% level.   Table 6: Realized utility and quantiles of the utility per unit wealth (UoW) for the bias-corrected and HAR-based forecasts. The realized utility is given in Equation (12) and the per period UoW is given in Equation (11). All values in the table are reported as percentages. In the columns reporting realized utility, boldface is used to highlight maximum values. Panel A reports the realized utility and quantiles of the UoW for forecasts calculated using the full sample. Panel B reports the realized utility and quantiles of the UoW for forecasts calculated using the OVX sample. We also report the results of Diebold-Mariano tests for differences between the realized utility of the HAR forecasts, which are the benchmark forecasts, and each of the remaining competing forecasts. Newey-West standard errors were used. ***, ** and * indicate a significant difference to the benchmark at the 1, 5 and 10% levels, respectively.  Table 7: Net realized utility and transaction costs (TC) for the bias-corrected and HAR-based forecasts. The net realized utility is given in Equation (14) and TC is given in Equation (13)    realized utility for option-implied variance forecasts generated using the alternative relative bias-correction procedure described in Equations (15) and (16). Results are reported for alternative window sizes, τ , used in applying the relative bias-correction.
Results for the BC-CIV1 and HAR-CIV1 forecasts, which do not employ the relative bias-correction procedure, are also included as benchmarks. All forecast evaluations were conducted using the full sample. Panel A reports the MSE, Panel B the QLIKE and Panel C the realized utility (reported as a percentage) for the forecasts. Boldface is used to highlight minimum values of MSE and QLIKE and maximum values of realized utility. We also report the results of Diebold-Mariano tests for differences between the MSE, QLIKE and realized utility of the HAR-CIV1 forecasts, the benchmark forecasts, and the MSE, QLIKE and realized utility, respectively, of each of the competing relative bias-corrected forecasts. Newey-West standard errors were used in all Diebold-Mariano tests. ***, ** and * indicate a significant difference to the benchmark at the 1, 5 and 10% levels, respectively.