Secular Mean Reversion and Long-Run Predictability of the Stock Market

Empirical financial literature documents the evidence of mean reversion in stock prices and the absence of out-of-sample return predictability over periods shorter than 10 years. The goal of this paper is to test the random walk hypothesis in stock prices and return predictability over periods longer than 10 years. Specifically, using 141 years of data, this paper begins by performing formal tests of the random walk hypothesis in the prices of the real S&P Composite Index over increasing time horizons up to 40 years. Even though our results cannot support the conventional wisdom which says that the stock market is safer for long-term investors, our findings speak in favor of the mean reversion hypothesis. In particular, we find statistically significant in-sample evidence that past 15-17 year returns are able to predict future 15-17 year returns. This finding is robust to the choice of data source, deflator, and test statistic. The paper continues by investigating the out-of-sample performance of long-horizon return forecast based on the mean-reverting model. These latter tests demonstrate that the forecast accuracy provided by the mean-reverting model is statistically significantly better than the forecast accuracy provided by the naive historical-mean model. Moreover, we show that the predictive ability of the mean-reverting model is economically significant and translates into substantial performance gains.


Introduction
Until the late 1980s there was a widespread agreement in the academic community that stock prices follow a random walk. Indeed, a large body of empirical literature seemed to support this point of view (see Fama (1970) and Leroy (1982) for surveys). The efficient market hypothesis is strongly associated with the idea of a random walk in stock prices and loosely says that stock returns are unpredictable. However, during the late 1980s there appeared a series of papers where the authors challenged the random walk hypothesis (see, for example, Summers (1986), Campbell and Mankiw (1987), Fama and French (1988b), Lo and MacKinlay (1988), and Poterba and Summers (1988)). In particular, these authors considered the time series properties of stock returns over increasing time horizons up to 10 years and found the indications of mean reversion 1 and return predictability. For example, Fama and French (1988b) discovered a substantial negative autocorrelation in returns over periods of 3-5 years and concluded that past 3-5 year returns are able to predict future 3-5 year returns. Poterba and Summers (1988) found that stock returns exhibit positive and statistically significant autocorrelation in returns over periods shorter than one year and negative, though not statistically significant at conventional levels (1% or 5%), autocorrelations over longer periods.
However, the conclusions reached in these earlier papers were strongly criticized on statistical grounds. For example, Kim, Nelson, and Startz (1991) demonstrated that due to the small-sample bias the statistical significance of the test statistics in Fama and French (1988b) and Poterba and Summers (1988) was overstated and there was no predictability of future 3-5 year returns on the basis of past 3-5 year returns. Similarly, Richardson and Stock (1989) and Richardson (1993) showed that correcting for the small-sample bias may reverse the results obtained by Fama and French (1988b) and Poterba and Summers (1988).
1 Mean reversion is an ambiguous concept and exists in several different forms. Most often, the concept of mean reversion can be expressed by the common investment wisdom which says that "over time markets tend to return to the mean". For example, when stocks go too far in one direction, they will eventually come back. Another type of mean reversion, which is studied in this paper, implies that the reversion is much more than just returning back to the mean. In reality the movement is far greater. This type of mean reversion incorporates another common investment wisdom which says that "an excess in one direction will lead to an excess in the opposite direction". That is, when stocks go too far in one direction, they will not just come back to the mean, but overshoot in the opposite direction. For example, a period of above average returns tends to be followed by a period of below average returns and vice versa. Throughout the paper, the term "period" is used to denote the period of mean reversion. The term "horizon" is mainly used to denote the average length of a complete cycle of reversion which consists of two periods: a period of higher than average returns and a period of lower than average returns (or vice versa).
Apparently, the statistical power of earlier tests was insufficient to reject the random walk hypothesis. Jegadeesh (1991) suggested a new more powerful test and detected statistically significant evidence of mean reversion in stock prices (over periods of 4-8 years). In addition, Jegadeesh found evidence of mean reversion not only for the US stock market, but also for the UK stock market. Later on based on a panel approach Balvers, Yangru, and Gilliland (2000) found statistically significant evidence of mean-reverting behavior (over periods of 3-3.5 years) in many international stock indices. Thus, mean reversion in stock prices seems to be an international phenomenon. Using the same technique as in Balvers et al. (2000), Gropp (2003) and Gropp (2004) found statistically significant evidence of mean reversion in the prices of portfolios of small cap stocks (over periods of 3.5 years) and industry-sorted portfolios (over periods of 4.5-8 years). Moreover, Balvers et al. (2000), Gropp (2003), and Gropp (2004) showed that parametric contrarian investment strategies that exploit mean reversion outperform buy-and-hold and standard contrarian strategies. This provides further support for the mean reversion findings in these papers.
Thus, nowadays the evidence of mean reversion in the prices of some stock portfolios over periods of 3-8 years seems to have been manifested. In contrast, the predictability of stock returns is still a source of heated debate within the academic community. Earlier papers, that demonstrated the existence of in-sample stock return predictability, include, among others, Fama (1981), Campbell (1987), Fama and French (1988b), Fama and French (1988a), Campbell and Shiller (1988), and Fama and French (1989). Again, the conclusions reached in these earlier papers were strongly criticized on statistical grounds. For example, Richardson and Stock (1989) and Nelson and Kim (1993) pointed to the small-sample bias problem, whereas Cavanagh, Elliott, and Stock (1995), Stambaugh (1999), and Lanne (2002) pointed to a neglected near unit root problem. Responding to the critique, Torous, Valkanov, and Yan (2004), Lewellen (2004), Rapach and Wohar (2005), and Campbell and Yogo (2006) developed new tests, that are free from the discovered flaws in the earlier tests, and again found some evidence of in-sample predictability. Yet, Bossaerts and Hillion (1999), Goyal and Welch (2003), and Welch and Goyal (2008) demonstrated that, despite evidence of in-sample predictability, the predictive models have no out-of-sample forecasting power. These authors therefore argued that in-sample predictability appears as a result of data mining. It should be noted, however, that in all these tests the longest forecast horizon was 10 years. Consequently, the results of these tests imply that the predictive models fail to demonstrate statistically significant predictive ability over short-term and medium-term horizons.
To the best knowledge of the author, no one has ever tested the random walk hypothesis in stock prices over periods longer than 10 years. Yet, anecdotal evidence suggests the presence of mean reversion in stock prices over very long horizons. Probably the best known evidence is presented by Siegel (2002) in his famous book "Stocks for the Long Run". In particular, using a historical sample that covers nearly 200 years, Siegel computed the standard deviation of average real annual returns on a broad US stock market index over increasing horizons up to 30 years. Siegel found that the standard deviation declines far faster than predicted by the random walk hypothesis. This led many to conclude that stocks are less risky in the long run. However, so far there have been no studies conducted on whether the decline in the standard deviation over very long horizons is statistically significant.
Another well-known anecdotal evidence, explicitly related to the mean reversion in stock prices over very long horizons, suggests the existence of long-lasting alternating periods of bull and bear markets. These long-lasting bull and bear markets are often termed as "secular" bull and bear markets. Alexander (2000), Easterling (2005), Rogers (2005), Katsenelson (2007, and Hirsch (2012), among others, analyzed the dynamics of the real S&P Composite Index since 1870 and found the indications of existence of secular stock market trends that last from 5 to 25 years, with average duration of about 15 years. Motivated by the seeming regularity in the reversion of secular trends, some authors made quite successful forecasts for the long-run US stock market outlook. For example, Alexander (2000) predicted that during the period from 2000 to 2020 the stock market will not beat the money market. So far, this forecast seems to come true. This anecdotal evidence suggests, among other things, that a price change over a given long-run period may be able to predict the price change over the subsequent long-run period. This idea motivates to re-examine the predictive performance of the model introduced by Fama and French (1988b). Even though Kim et al. (1991) demonstrated that this model has no predictive power on increasing periods up to 10 years, as far as the author knows, no one has ever tested this model on periods longer than 10 years. This paper aims to fill these gaps in scientific knowledge about the stock market dynamics over very long horizons.
The first contribution of this paper is to provide, for the first time, statistically significant evidence against the random walk hypothesis over periods longer than 10 years. Even though our results cannot support the anecdotal evidence which says that the stock market is safer for long-term investors, our findings do speak in favor of mean reversion in stock prices over periods of 15-17 years. In particular, using the whole sample of data, we find statistically significant evidence that a given change in price over 15-17 years tends to be reversed over the next 15-17 years by a predictable change in the opposite direction.
This implies the existence of in-sample long-horizon predictability. Since the conventional wisdom says that in-sample evidence of stock return predictability might be a result of data mining, we investigate the performance of out-of-sample long-horizon return forecast.
Besides the mean-reverting model, we investigate the out-of-sample forecast accuracy of a few other competing models which employ, as a predictor for long-horizon returns, the cyclically adjusted price-to-earnings ratio, the price-to-dividends ratio, and the long-term bond yield.
The second contribution of this paper is to demonstrate that the out-of-sample longhorizon forecasts provided by the mean-reverting model and the models that employ the price-to-earnings and price-to-dividends ratios are statistically significantly better than the forecast provided by the historical-mean model. It is worth emphasizing that Welch and Goyal (2008) also used the price-to-earnings and price-to-dividends ratios in their study and found that these models have no predictive ability over forecast horizons up to 5 years. Our results therefore advocate that these models do have predictive ability, but over forecast horizons longer than 10 years. We also demonstrate that the advantages of the models, that show the predictive ability, translate into significant performance gains. For example, we estimate that risk-averse investors would be willing to pay from 30 to 77 basis points fees per year to switch from the historical-mean model to a model with a superior forecast accuracy. Moreover, our tests suggest that over the recent past the out-of-sample forecast accuracy provided by the mean-reverting model was substantially better than that provided by the competing models. In addition, we find that the mean-reverting model delivers the highest performance gains when investors have to make long-term allocation decisions.
The rest of the paper is organized as follows. Section 2 presents the data for our study, namely, the returns on the real Standard and Poor's Composite Stock Price Index over the period from 1871 to 2011. In Section 3 we perform the tests of the random walk hypothesis using the S&P Composite Index. In Section 4 we study the out-of-sample predictability of multi-year returns on the S&P Composite Index. Finally, Section 5 summarizes and concludes the paper.

The Data
The data for the study in this paper are the annual log real returns on a broad US stock market index for the period from 1871 to 2011. The returns are adjusted for dividends and computed using the real (i.e., inflation-corrected) Standard and Poor's Composite Stock Price Index data and corresponding dividend data. The inflation adjustment is done using the Consumer Price Index (CPI) for the US. All the data are provided by Robert Shiller. 2 The Standard and Poor's Composite Stock Price index is a value-weighted stock index.
The index for the period from 1871 to 1925 is constructed using the Cowles Commission Common Stock Index series. From 1926 to the present, the index data come from various reports of the Standard and Poor's. From 1957 this index is identical to the Standard and Poor's 500 Index which is intended to be a representative sample of leading companies in leading industries within the US economy. Stocks in the index are chosen for market size, liquidity, and industry group representation. For more details about the construction of the index and its dividend series see Shiller (1989), Chapter 26. Formally, let (p 0 , p 1 , . . . , p n ) be observations of the natural log of an inflation-corrected stock index price over n + 1 years.
Denote the one-year log return during year t, 1 ≤ t ≤ n, by The resulting sample of n return observations is (r 1 , r 2 , . . . , r n ). The probability distribution of r t is unknown, yet it is well-documented that stock returns are non-normal and heteroscedastic.
In order to check the robustness of findings, in particular, to see whether the results of the testing the random walk hypothesis depend on a specific historical period, we divide the total sample period from 1871 to 2011 (141 annual observations) in two equal overlapping sub-samples, the first one is from 1871 to 1956 and the second one is from 1926 to 2011. 3 Both of these sub-samples cover a span of 86 years. Table 1 presents the descriptive statistics for the annual stock index returns, r t , for the total sample and both sub-samples. Table 2 reports the results of the t-test on difference in mean returns and F -test on difference in standard deviations between the first and the second sub-sample. The descriptive statistics and the results of the tests suggest that the mean and variance of returns on the index were more or less stable during the total sample. Specifically, using a t-test for equal means we cannot reject the hypotheses that the mean returns are alike in both sub-samples. Similarly, using an F -test for equal variances we cannot reject the hypotheses that the variances are alike in both sub-samples. All the series exhibit negative skewness and positive excess kurtosis which indicates a deviation from normality. Observe also that the return series during the overall sample period exhibits a statistically significant negative autocorrelation at lag 2 (at the 5% level). There are no other indications of serial dependence in the return series.

Methodology
One of the main questions we want to study in this paper is whether the log of the real S&P Composite Stock Price Index follows a random walk. To answer this question we perform two well-known tests. The first test is based on the examination of the first-order autocorrelation function of k-year returns. This test is used by, for example, Fama and French (1988b), Fama and French (1989), and Fama (1990) and based on the computation of the following test statistic  where r i,j is the compounded return from year i to year j, r i,j = p j − p i , Cov(·, ·) and V ar(·) denote the covariance and variance respectively, and AC1(k) stands for the first-order autocorrelation function of k-year returns. The second test is based on the examination of the variance ratio. This test is very popular and used by Cochrane (1988), Lo and MacKinlay (1988), Poterba and Summers (1988), and many other afterwards. The test is based on the computation of the following test statistic Both the tests are motivated by the notion that if the stock returns are independent and identically distributed, then the first-order autocorrelation function is zero and the variance ratio is unity irrespective of the number of years k. In other words, without serial dependence in data, the variance of k-year returns equals k times the variance of one-year returns and there is no correlation between two successive non-overlapping k-year returns. The null hypothesis of a random walk is rejected if the first-order autocorrelation is significantly different from zero or the variance ratio is significantly different from unity.
We want to compute the variance ratio V R(k) for return horizons k from 20 to 40 years and the first-order autocorrelation AC1(k) for periods from 10 to 20 years (note that in the latter case we also study serial dependence in data over time horizons from 20 to 40 years). The fundamental problem with these computations is that we have only a few nonoverlapping intervals of length 20-40 years. Therefore in the computations of the two test statistics we employ overlapping intervals (rolling k-year periods). To compute AC1(k) we regress k-year returns r t,t+k on lagged k-year returns r t−k,t . That is, we run the following regression Observe that the slopes of the regression, b(k), k ∈ [10, 20], are the estimated autocorrelations of k-year returns, AC1(k). The variance of k-year returns is computed as ] .
The use of overlapping returns leads to some potentially very serious econometric issues 9 which are commonly termed as "small-sample bias". In particular, when it comes to the estimation of regression (3), there are two econometric problems. First, the estimates for the slope coefficients are biased. The sources of this bias in the estimation of autocorrelation are described in details by Orcutt and Irwin (1948) and Marriott and Pope (1954).
More specifically, these authors show that an estimate of autocorrelation obtained using overlapping blocks of data is downward biased. Therefore, the estimates must be corrected for the bias. The second problem is that the standard errors of estimation using overlapping blocks of data are also downward biased, see, for example, Nelson and Kim (1993). Both biases work in the direction of making the values of t-statistic too large so that standard inference may indicate dependence in return series even if none is present. 4 Similarly, the estimate for the variance of multi-year returns, V ar(r t,t+k ), is downward biased when one uses overlapping blocks of data. 5 As an immediate consequence, the estimate for the variance ratio V R(k) becomes also downward biased. Therefore, the estimates for V R(k) must be corrected for the bias. In addition, since the estimate for V R(k) is a random variable, for the purpose of statistical inference we need to know the probability distribution of V R(k). This is necessary in order to be able to estimate standard errors and confidence intervals for V R(k). This is also necessary for performing hypothesis tests about the value of V R(k).
When the nature of the data generating process is unknown, it is generally not possible to tackle the econometric problems described above. However, in the context of the null hypothesis our goal is primarily to test whether or not stock returns are distributed independently of their ordering in time. Since under the null there is no dependence in return series, in order to estimate the significance level and perform the bias correction of the test statistics, we follow closely Kim et al. (1991) and Nelson and Kim (1993) where the authors employ the randomization method. The randomization method is introduced by Fisher obtaining some specific value for an estimator under the null hypothesis of no dependence.
We refer the interested readers to Noreen (1989) and Manly (1997) for extensive discussion of the randomization tests. In a nutshell, randomization consists of reshuffling the data to destroy any dependence and then recalculating the test statistics for each reshuffling in order to estimate its distribution under the null hypothesis of no dependence. The great advantage of the randomization method is that it is very simple and no assumptions are made about the actual distribution of stock returns.
To be more specific, consider the estimation of the significance level and the bias correction of the estimate for the autocorrelation of k-year returns AC1(k). First, we run regression (3) using the original series (r 1 , r 2 , . . . , r n ) to obtain the actual historical estimates for AC1(k). Then we randomize the original series to get a permutation (r * 1 , r * 2 , . . . , r * n ).
This is repeated 10,000 times, each time running regression (3) and obtaining an estimate for AC1 * (k). In this manner we estimate the sampling distribution of AC1(k) under the null hypothesis. Finally, to estimate the significance level for some particular k, we count how many times the computed value for AC1 * (k) after randomization falls below the value of the actual historical estimate for AC1(k). In other words, under the null hypothesis we compute the probability of obtaining a more extreme value for the autocorrelation of k-year returns than the actual historical estimate. Note that in this manner we compute p-values of one-tailed test. The estimation bias is defined as the difference between the expected and the true value of AC1 * (k). Since the true value is zero under the null hypothesis, the bias correction is done by subtracting the expected value of AC1 * (k) from the actual historical estimate for AC1(k). That is, the bias adjusted values of the first-order autocorrelation of The estimation of the significance level and the bias correction of the estimate for the variance ratio V R(k) is done in a similar manner. First, we use the original series to obtain the actual historical estimates for V R(k). Then we randomize the series and compute V R * (k) to obtain the sample distribution under the null hypothesis. Finally, to estimate the significance level for some particular k, we count how many times the computed value for V R * (k) after randomization falls below the value of the actual historical estimate of V R(k). The estimation bias in this case is given by E[V R * (k)] − 1 since the true value is unity under the null hypothesis. Finally, the bias adjusted values of the variance ratio are There is ample evidence that the series of stock returns is heteroscedastic, see, for example, Officer (1973) and Schwert (1989). In particular, many researchers document that the variance of stock returns is not constant, but time-varying. To see whether a change in the variance of returns might affect the sampling distribution of a test statistic, we follow closely Kim et al. (1991) and Nelson and Kim (1993) and use the stratified randomization.
In the stratified randomization method the total sample (or a sub-sample) is divided into several separate bins (urns) and the randomization is performed within each bin. Such a stratified randomization allows us to see whether the sampling distribution of a test statistic is sensitive to the particular pattern of heteroscedasticity that occurred historically.  Further, our results suggest that accounting for heteroscedasticity in stock returns does not influence the outcomes of the randomization tests on the first-order autocorrelations of k-year returns. Regardless of the number of bins in the stratified randomization, the firstorder autocorrelation of k-year returns remains statistically significantly different from zero at the 5% level over periods of 15-17 years for the total sample and the second sub-sample.

Empirical Results
In contrast, stratification of the sample weakens the evidence against the null hypothesis for the value of the variance ratio. In particular, for the total sample and the stratification with either 2, 4, or 5 bins, the variance ratio is not statistically significantly below unity at conventional levels. Similarly, for the second sub-sample and the stratification with either 3 or 5 bins the variance ratio is not significantly below unity at conventional levels. For the first sub-sample the variance ratio is not significantly below unity regardless of the number of bins in the stratified randomization.
Consequently, we do not have strong enough evidence to claim that the variance ratio decreases with increasing investment horizon. Even though without stratification the variance ratio over horizons of 30-34 years is statistically significantly below unity, stratification of the sample suggests that this effect can be attributed to the historical pattern

Robustness Tests
In order to check the robustness of our findings regarding the statistical significance of the secular mean reversion, we conducted a series of robustness checks which results are not reported in this paper in order to save the space. These additional robustness tests are described below.
First, the results reported in this section are obtained using the annual data provided by Robert Shiller. More specifically, these data are annual series of (average) January values of the real Standard and Poor Composite Stock Price Index. Hence, the results obtained in this section might be affected by seasonality. 6 To test the seasonality problem, we used the monthly data instead and obtained virtually the same levels of statistical significance of the mean-reverting behavior over very long horizons.
Second, Robert Shiller uses the CPI to adjust the nominal returns for inflation. We tested whether our evidence of mean reversion depends on the choice of deflator used to construct real stock returns. 7 For this purpose we constructed the real stock returns using the GDP deflator and value of the Consumer bundle. 8 We found that regardless of the choice of a deflator the evidence on mean reversion remains intact.
Third, since Kim et al. (1991) demonstrated that the mean-reversion in the study by Fama and French (1988b) is primarily a phenomenon of pre World War II period which is presented in both our sub-samples, we tested whether there is evidence of mean-reversion in the post 1940 period. 9 We found that the evidence is weaker (which is naturally to expect 6 We thank Ole Gjølberg for pointing this. 7 We thank an anonymous referee for pointing this. 8 The data on the GDP deflator and the Consumer bundle are downloaded from www.measuringworth.com. The value of the consumer bundle is defined as the average annual expenditures of consumer units. 9 We thank an anonymous referee for pointing this. since the sample length becomes shorter), but is still statistically significant at the 10% level.
Fourth, instead of the first-order autocorrelation of multi-year returns test statistic, suggested by Fama and French (1988b), we used the test statistic suggested by Jegadeesh (1991). In particular, instead of regression (3), we used the following regression Note that in this regression the stock market return at year t is predicted using the aggregated return over the preceding k years. Using this regression we could also reject the random walk hypothesis in stock prices over very long horizons in the post-1926 period.
Finally, instead of using the data provided by Robert Shiller, we used the real annual returns on the large cap stocks provided by Kenneth French 10 over the period from 1927 to 2012. Again we found that the values of the first-order autocorrelation of multi-year returns are statistically significantly negative over periods of 15-18 years.
Thus, on the basis of the results from numerous robustness tests, we conclude that our evidence on the secular mean reversion is robust to the choice of data, deflator, sample period, and test statistics.

Motivation
The results of the tests performed in the preceding section allow us to reject the hypothesis  Figure 2 presents a scatter plot of r t,t+15 versus r t−15,t for the returns on the real Standard and Poor's Composite Stock Price Index for the total sample period from 1871 to 2011. In addition, a regression line is fitted through these data points. The scatter plot clearly suggests a tendency for the past 15-year returns to predict future 15-year returns. The regression line has a strongly negative slope, and R 2 statistic is 42%.
However, if we use the full sample period to estimate the first-order autocorrelation of multi-year returns, our estimate measures the degree of in-sample (IS) predictability. Yet it is known that in-sample predictability might be spurious (for example, it appears as a result of data mining) and not hold out-of-sample (OOS) (see, for example, Bossaerts and Hillion (1999), Goyal and Welch (2003), and Welch and Goyal (2008)). In order to guard against data mining, in this section we assess the performance of the OOS forecast based on the mean-revering model given by regression (3). Besides the mean-reverting model, we use Poor's Composite Stock Price Index for the period from 1871 to 2011. In addition, a regression line is fit through these data points. The goodness of fit, as measured by R 2 , amounts to 42%. several other competing predictive models. We demonstrate that in the OOS tests the meanreverting model and a few other predictive models perform statistically significantly better than the naive historical-mean model. In addition, we demonstrate that the advantages of the predictive models translate into significant utility gains.

Methodology of Assessing the Performance of OOS Forecasts
Our OOS recursive forecasting procedure is as follows. The initial IS period [1, m], m < n, is used to estimate regression (3) for different period lengths k ∈ [10, 20] years. In this manner we estimate a number of autocorrelations of k-year returns, AC1(k). Then we perform the bias adjustment of AC1(k). Next we select the value of k = k 1 which produces the lowest estimate of the bias-adjusted autocorrelation. That is, [10,20]

AC1(k).
Presumable, over the initial IS period the evidence of mean reversion is strongest over the period of k 1 years. Subsequently, the estimated coefficients from regression (3) with k 1 are used to compute the first k 1 -year ahead return forecast for the period [m + 1, m + k 1 ]. We then expand our IS period by one year (it becomes [1, m + 1]), perform the selection of k 2 at which the evidence of mean reversion is strongest over the second IS period, and compute the OOS forecast for the period [m + 2, m + k 2 + 1]. We repeat the procedure, increasing every time our IS window by one year, until we compute the last k l -year ahead return for the period [n − k l + 1, n].
Observe that our OOS forecasting procedure is free from look-ahead bias, since to forecast the return for the period [m + j, m + k j + j − 1], j ≥ 1, we use only information that is available at time m + j − 1. It is worth noting that since we are dealing with a long-horizon forecast, in performing the recursive forecasting procedure we need not just to update the estimates for the coefficients of regression (3), but first of all we need to update the optimal length of the prediction period k. Observe that, in order to avoid the look-ahead bias, the optimal length of the prediction period k is determined using only information that is available at the end of each IS period as well. Thus, our OOS recursive forecasting procedure updates all the values of the model parameters and is able to adapt to changing conditions in the time series. For example, it can accommodate the possibility that the period of mean reversion is monotonically changing over time. 11 To assess the performance of OOS forecast, a common approach in the empirical literature is to run a "horse-race" among several competing predictive models. A standard criterion by which to compare two alternative predictive models is to compare their mean squared prediction errors (MSPE). As a matter of fact, the comparison of the mean squared prediction errors of two alternative models has a long tradition in evaluating which of the two models has a better ability to forecast, see McCracken (2007) and references therein. In our study, we run OOS horse races involving the mean-reverting model (MR), the historicalmean model (HM), Robert Shiller's model (PE10) that uses the cyclically adjusted priceto-earnings ratio as a predictor for long-horizon returns, the model that uses the priceto-dividends ratio (PD) as a predictor, and the model that uses the long-term bond yield 11 Recall that the results presented in the previous section indicate that the period of the long-term mean reversion seems to have been increasing over time. In particular, during the first sub-sample the evidence of mean reversion is strongest over horizons of about 24-26 years (judging by the values of the most statistically significant first-order autocorrelation and variance ratio). In contrast, during the second sub-sample the evidence of mean reversion is strongest over horizons of about 34-36 years. Apparently this results in the fact that over the total sample period the evidence of mean reversion is strongest over horizons of about 30-34 years.
(LTY) as a predictor. These models are given by LTY : where pe10 is the natural log of the ratio of price to 10-year moving average of earnings (this ratio is usually denoted as CAPE or PE10), pd is the natural log of the price-to-dividends ratio, and lty is the natural log of the long-term bond yield. The data for the price-toearnings ratio, price-to-dividends ratio, and the long-term bond yield are also provided by Robert Shiller.
Robert Shiller's model was introduced by Campbell and Shiller (1998) and further popularized and developed by Shiller (2000) and Campbell and Shiller (2001). Shiller's model is based on a simple mean reversion theory which says that when stock prices are very high relative to recent earnings, then prices will eventually fall in the future to bring the price-to-earnings ratio back to a more normal historical level. Using this model Campbell and Shiller (1998) predicted the stock market crash of 2000 on the basis of an unreasonably high PE10 ratio. Since that time, Shiller's model has been extremely popular among practitioners. Originally, Campbell andShiller (1998), Shiller (2000), and Campbell and Shiller (2001) used this model to forecast future 10-year returns. Yet, Asness (2003) demonstrated that the PE10 ratio is a good predictor of the future returns over periods from 10 to 20 years. 12 Thus, Shiller's model represents a natural competitor to our long-term meanreverting model.
The model that uses the price-to-dividends ratio as a predictor for future returns was presented by Fama and French (1988a). This model is also based on a simple mean reversion theory which says that if the price-to-dividends ratio is unusually high or low, then this ratio tends to return to its long-run historical mean. The motivation for the model that uses the long-term bond yield as a predictor is based on a simple idea that stocks and long-term bonds are two major competing assets. Therefore simple logic suggests that the changes in the long-term bond yield must be highly correlated with the changes in the stock market earnings yield (earnings-to-price ratio). If, for example, the bond yield increases, stock prices should decrease and the stock market earnings yield increase. The so-called "Fed model" postulates that the stock's earnings yield should be approximately equal to the long-term bond yield. Empirical support for this model is found in the studies by Lander, Orphanides, and Douvogiannis (1997), Koivu, Pennanen, and Ziemba (2005), Berge, Consigli, and Ziemba (2008), and Maio (2013).
The historical-mean model can be interpreted as a reduced version of any other predictive model. This model uses the historical average of k-year returns to predict the return for the next k years. It is worth emphasizing that Welch and Goyal (2008) also employed in their study the predictive models that use the price-to-earnings ratio, price-to-dividends ratio, and the long-term bond yield. They found that in out-of-sample tests these models perform worse than the historical-mean model. However, these authors used an increasing forecast horizon up to 5 years only. In our study the goal is to compare the out-of-sample forecast accuracy from these models on horizons longer than 10 years. Now we turn to the formal presentation of our test statistic that is employed to assess the performance of OOS forecasts provided by two competing models. Let r AC t,t+k , t > m, be the actual k-year returns and r mod 1 t,t+k and r mod 2 t,t+k be the OOS forecast of the k-year returns provided by models 1 and 2. To compute the test statistic, we first compute the OOS prediction errors of the two competing models Our test statistic is the ratio of the MSPE of model 1 to the MSPE of model 2 where T − m is the number of OOS forecasted k-year returns. 13 The null hypothesis in this test is that the forecast provided by model 2 is not better than the forecast provided by model 1. Formally, under the null hypothesis the MSPE of model 1 is less than or equal to the MSPE of model 2. Formally, H 0 : MSPE-R ≤ 1. Consequently, we reject the null hypothesis when the actual estimate for the MSPE ratio is significantly above unity. In our tests, the model 1 is always the historical-mean model. Therefore the outcome of our tests is whether a predictive model can "beat" the historical-mean model (a similar approach is used by Goyal and Welch (2003), Welch and Goyal (2008), and many others).
If two alternative prediction errors are assumed to be Gaussian, serially uncorrelated, and contemporaneously uncorrelated, then an MSPE-R statistic under the null hypothesis has the usual F -distribution. 14 However, in our case the assumptions listed above are not met. First, because of using overlapping multi-year returns, the prediction errors of all our models are serially correlated. Second, since the historical-mean model is the reduced version of any other predictive model, the prediction errors of the historical-mean models and any other predictive model are contemporaneously correlated. Finally, the assumption of Gaussian errors also seems to be unpalatable. One potential possibility to obtain correct statistical inference in this case is to perform asymptotically valid tests in the spirit of the seminal tests by Diebold and Mariano (1995). However, because we use relatively small samples, and because of the variable length k of the prediction horizon in our forecasting procedure, in order to compute the p-value of the MSPE ratio we employ a bootstrap method.
Our bootstrap method follows closely Welch and Goyal (2008). In this method we assume that the returns are serially independent, whereas the log of the PE10, the log of the PD, and the log of LTY follow the first-order autoregressive (AR(1)) process. Therefore the data generating process is assumed to be In this case the return series r t follows the random walk 15 and a bootstrapped resample is generated using the nonparametric bootstrap method. In particular, a random resample (r * 1 , r * 2 , . . . , r * n ) is generated by drawing with replacement from the original series (r 1 , r 2 , . . . , r n ). In contrast, a bootstrapped resample of any other predictive variable is generated using the semi-parametric bootstrap method. The construction of a bootstrapped resample for the log of the PE10 series, pe10 t , is performed as follows. First of all, the parameters α 1 and β 1 are estimated by OLS using the full sample of observations, with the residuals stored for resampling. Afterwards, to generate a random resample (pe10 * 1 , pe10 * 2 , . . . , pe10 * n ) we pick up an initial observation pe10 * 1 from the actual data at random. Then a series is generated using the AR(1) model and by drawing w * t with replacement from the residuals. 16 The construction of a bootstrapped resample for the log of the PD and the LTY series is done in a similar manner.
Now we turn to the description of how we compute the MSPE-R statistic and its pvalue. First, using the original series (r 1 , r 2 , . . . , r n ) we employ the recursive forecasting procedure described above to obtain the OOS forecasts of the mean-reverting model. Note that one of the outcomes of our recursive forecasting procedure is a sequence of lengths of prediction periods (k 1 , k 2 , . . . , k l ). Second, using the same sequence of lengths of prediction periods we obtain the OOS forecasts of all the other models. Afterwards we compute the mean squared prediction errors, and after that the MSPE-R statistic. Then we bootstrap the original series to get random resamples. The next crucial step is to generate a sequence of lengths of prediction periods (k * 1 , k * 2 , . . . , k * l ). All this is repeated 10,000 times, each time running the recursive forecasting procedures 17 and obtaining an estimate for MSPE-R * .
In this manner we estimate the sampling distribution of the MSPE-R statistic under the null hypothesis. Finally, to estimate the significance level, we count how many times the computed value for the MSPE-R * after bootstrapping happens to be above the value of the actual estimate for the MSPE-R. In other words, under the null hypothesis we compute 15 Note that is this case the historical-mean model is a version of the random walk hypothesis. 16 It should be noted, however, that our data generating process assumes no contemporaneous correlation between the stock return and a predictive variable. In the actual data there is a small but statistically significant correlation between the returns and the price-to-earnings (as well as the price-to-dividends) ratio.
To check the robustness of our findings, we also implemented another bootstrap method which retains the historical correlations between the data series. We found that both the bootstrap methods deliver similar p-values of our test statistic.
17 Note that this time the recursive forecasting procedures for all the models use the exogenously determined sequence of lengths of prediction periods. the probability of obtaining a more extreme value for the MSPE ratio than the actual estimate. 18 It is not clear what method should be used to generate a sequence of lengths of prediction periods for each bootstrap simulation. To the best of the author's knowledge, there are no similar forecasting procedures in the relevant scientific literature. Therefore we entertain four different methods listed below. In the first method we always use the original sequence of lengths of prediction periods (k 1 , k 2 , . . . , k l ). In the second and third methods a generated sequence (k * 1 , k * 2 , . . . , k * l ) is a bootstrapped version of the original sequence. Whereas in the second method we use the nonparametric bootstrap, in the third method we use the semiparametric bootstrap. In the semi-parametric bootstrap we assume that the length of a prediction period is a linear function of time. 19 In the fourth method a sequence of lengths of prediction periods is endogenously determined by the recursive forecasting procedure on the basis of the bootstrapped series (r * 1 , r * 2 , . . . , r * n ). We find that the first three methods produce virtually similar p-values, whereas the fourth method produces notably lower pvalues. Therefore when we report the p-values of the MSPE-R statistic we use the highest p-values. Thus, our statistical inference is based on the "worst case scenario" for the rejection of the null hypothesis. In other words, if we can reject the null in the "worst case scenario", we would reject it for any other case.

Empirical Results on Performance of OOS Forecasts
Our OOS forecast begins 50 years after the data are available, that is, in 1921, and ends in 1997 with the last forecast for the 15-year period from 1997 to 2011. To check the robustness of findings, we split the total OOS period in two equal OOS subperiods, the first one from 1921 to 1959, and the second one from 1959 to 1997. As in Goyal and Welch (2003), we employ a simple graphical diagnostic tool that makes it easy to understand the relative performance of two competing forecasting models. In particular, in order to monitor the predictive power of the unrestricted model relative to the predictive power of 18 Note again that in this manner we compute p-values of one-tailed test. 19 Indeed, for our OOS period from 1921 to 1997 the length of a prediction period is almost monotonically stepwise increasing from 10 to 15 years. The goodness of fit to the linear function, as measured by R 2 , amounts to 73%. To perform the semi-parametric bootstrap, first of all we estimate the simple linear trend model for the original sequence of lengths of prediction periods (k1, k2, . . . , k l ) with the residuals stored for resampling. Afterwards, to generate a random resample of the sequence of lengths of prediction periods, we pick up the original initial prediction period k1. The rest of the sequence is generated using the estimated linear trend model by drawing the error terms from the residuals with replacement. the restricted model, Goyal and Welch (2003) suggested using the cumulative difference between the MSPE of the restricted model (the HM model in our case) and the MSPE of the unrestricted model: By visual examination of the graph of CU DIF t it is easy to understand in which periods the unrestricted model predicts better than the restricted model. Specifically, in periods when the cumulative MSPE difference increases, the unrestricted model predicts better, in periods when it decreases, the unrestricted model predicts worse than the restricted model.  Table 5.
The p-values of the MSPE-R statistic demonstrate that over the total OOS period 3 out of 4 unrestricted models performed statistically significantly better (at the 5% level) than the restricted model. These unrestricted models are: the mean-reverting model, the priceearnings model, and the price-dividends model. However, over the first OOS subperiod only the price-dividends model performed statistically significantly better than the historicalmean model. In contrast, over the second OOS subperiod only the mean-revering and the price-earnings models showed the evidence of superior forecasting accuracy as compared to that of the historical-mean model. Our results advocate that the model, which uses the long-term bond yield as predictor, performed substantially worse than all the other competing models. Our results on the predictive ability of the long-term bond yield support the conclusions reached in the studies by Estrada (2006) and Estrada (2009)   that uses the long-term bond yield as a predictor for stock returns.

OOS period HM to MR HM to PE10 HM to PD HM to LTY
The graphs of the cumulative difference between the MSPE of the restricted (historicalmean) model and the unrestricted model allow us to see in which historical periods one model performed better than the other. Visual monitoring of these graphs reveals the following observations. The price-dividends model performed relatively well until about 1970 only. After that, the accuracy of the forecast provided by the price-dividends model was substantially worse than that of the historical-mean model. Both the mean-reverting and price-earnings models performed significantly better than the historical-mean model over . From about 1990 the price-earnings model lost its advantage over the historicalmean model. Starting from about 1980 the mean-revering model performed substantially better than all the other competing models. Only over the decade of 1950s the meanreverting model performed notably worse than the historical-mean model.

Economic Significance of Return Predictability
In the preceding subsection we found a statistically significant evidence of long-term pre- To estimate the economic significance of return predictability, we follow closely the methodology employed in the studies by Fleming, Kirby, and Ostdiek (2001), Campbell and Thompson (2008), and Kirby and Ostdiek (2012). We consider an investor who, at time t, allocates the proportion y t of his wealth to the stock market index and the proportion (1 − y t ) to the risk-free asset. The investor revises the composition of his portfolio at time t + q; that is, after q years, q ≥ 1. The investor's return over period (t, t + q) is given by where r t,t+q and r f ree t,t+q are the stock market return and the risk-free rate of return over period (t, t + q).
We assume that the investor is equipped with the mean-variance utility function which can be considered as a second-order approximation to the investor's true utility function.
As a result, the investor's realized utility over period (t, t + q) can be written as where σ t,t+k is the volatility of the stock market index over period (t, t + q) and γ is the investor's coefficient of risk aversion. The total investor's realized utility is found as the sum of single-period utilities where n = T q is the number of periods of length q from time 0 to time T (the end of the investment horizon).
The investor's optimal proportion y t , which maximizes the expected utility, is given by (see Bodie, Kane, and Marcus (2007), Chapter 7) where E[r t,t+q ] and σ t,t+q are the expected return and volatility over (t, t + q) that need to be forecasted at time t. The forecasting of expected returns is done using two competing models, 1 and 2. Specifically,r mod 1 t,t+q andr mod 2 t,t+q denote the return forecasts provided by models 1 and 2 respectively. Since we do not have a specific predictive model to forecast the volatility, the volatility over (t, t + q) is forecasted using the historical-mean model for volatility. Formally, whereσ t,t+q denotes the forecasted volatility.
It is important to observe that our predictive models forecast the stock market returns for a period of k ≥ 10 years. Since generally q ̸ = k (most often q < k), the q-year forecasted returns for model i ∈ {1, 2} are computed aŝ wherer mod i t,t+k is the k-year return forecast provided by model i.
As before, the model 1 in our study is the historical-mean model. The economic significance of return predictability is measured by equating to total realized utilities associated with two alternative forecasting models where ∆ denotes the annual fees the investor is willing to pay to switch from predictive model 1 to predictive model 2. Whereas Fleming et al. (2001) and Kirby and Ostdiek (2012) used the equation above to compute the annual fees, Campbell and Thompson (2008) demonstrated that the total realized investor's mean-variance utility can alternatively be measured by means of the Sharpe ratio. That is, the computation of the annual fees can be done using where SR(·) denotes the Sharpe ratio.
In our computations we assume that the investor's risk aversion γ = 5 (as in Kirby and Ostdiek (2012)). Since we do not have data for the real risk-free rate of return, to perform the computations we assume that the nominal annual risk-free rate of return equals the annual inflation rate. Therefore, in real terms, r f ree t,t+p = 0. We measure the annual  performance fees over our total OOS period 1921-2011. Table 6 reports the Sharpe ratios associated with each predictive model and the estimated annual fees measured in basis points. The results are reported for two values of q: q = 1 and q = 15. In the first case the investor rebalances his portfolio once a year, in the second case the investor rebalances his portfolio once in 15 years.
First we consider the case where the investor rebalances his portfolio once a year. In this case the Sharpe ratios of all predictive models, which perform statistically significantly better than the historical-mean model, are higher than the Sharpe ratio of the historicalmean model. The advantages of these predictive models translate into significant utility gains. Specifically, risk-averse investors would be willing to pay from 30 to 77 basis points fees per year to switch from the historical-mean model to a model with a superior forecast accuracy. In contrast to these models, our results indicate that the model that uses the long-term bond yield as a predictor demonstrates an inferior forecast accuracy as compared with that of the historical-mean model. As a result, not only the Sharpe ratio of this model is lower than that of the historical-mean model, but also the investor would require to be paid 20 basis points fees per year to switch from the historical-mean model to the bond yield model.
When the investor can rebalance his portfolio once a year, the price-earnings model performs best while the mean-reverting model performs second best. However, when the investor decreases the portfolio revision frequency, the performance gains delivered by the price-earnings model diminish whereas the performance gains provided by the meanreverting model remains rather stable. When the investor rebalances his portfolio once in 15 years, the performance gains of the price-earnings model virtually disappear. In contrast, the performance gains of the mean-reverting model (as measured in annual fees) remain virtually intact. Therefore in cases where the investor has to make long-term allocation decisions, the mean-reverting model delivers the highest performance gains.

Summary and Conclusions
We started the paper by performing two tests of the random walk hypothesis using the real Standard and Poor's Composite Stock Price Index data for the period from 1871 to 2011. In particular, we investigated the time series properties of the index returns at increasing horizons up to 40 years. In our tests of the random walk hypothesis we used two well-known test statistics: the autocorrelation of multi-year returns and the variance ratio.
In the context of the null hypothesis our goal was to test whether the index returns are distributed independently of their ordering in time. In order to estimate the significance level of the test statistics under the null hypothesis, we employed the randomization methods which are free of distributional assumptions.
Rather surprisingly, considering a seemingly insufficient span of available historical observations of the returns on the stock index, either of the test statistic allowed us to reject the random walk hypothesis at conventional statistical levels over very long horizons of about 30-34 years. By studying the impact of sample period on the test statistics we concluded that mean reversion seems to be an extraordinary strong phenomenon of the post-1926 period. Having performed the same randomization tests with stratification we found that the results based on the use of the variance ratio are sensitive to the particular pattern of heteroscedasticity that occurred historically, 21 while the results based on the use of the autocorrelation of multi-year returns are not.
21 A similar conclusion is drawn by Nelson and Kim (1993).
Consequently, we do not have strong enough evidence to claim that the variance ratio decreases with increasing investment horizons. In other words, our results cannot support the conventional belief that the stock market is safer for long-term investors. In contrast, we do have convincing evidence that suggests that a given change in price over 15-17 years tends to be reversed over the next 15-17 years by a predictable change in the opposite direction. Overall, our findings support the mean reversion hypothesis as the alternative to the random walk hypothesis. Our evidence of secular mean reversion in stock prices is robust to the choice of data source, deflator used to compute the real prices and returns, sample period, and test statistic.
The results of our tests demonstrated the evidence of in-sample predictability. However, conventional wisdom says that in-sample evidence of stock return predictability might be a result of data mining. In order to guard against data mining, we investigated the performance of out-of-sample forecast of multi-year returns. We demonstrated that the outof-sample forecast provided by the mean-reverting model is statistically significantly better than the forecast provided by the historical-mean model. Moreover, the out-of-sample forecast accuracy of the mean-reverting model is comparable to that of very popular (among practitioners) Robert Shiller's model that uses the cyclically adjusted price-earnings ratio as a predictor for long-horizon returns, and of the model that uses the price-dividends ratio as a predictor for long-horizon returns. In addition, we demonstrated that the advantages of these three predictive models translate into significant utility gains. We found that in cases where the investor has to make long-term allocation decisions, the mean-reverting model delivers the highest performance gains. Besides, in the post-1960 period the mean-reverting model showed the best forecast accuracy among all competing model.
Given the main result of our study, it is natural to ask the following question. What causes this long-lasting mean reversion in the stock market prices? Put it differently, what is the economic intuition behind this result? One possible answer is suggested by previous research on the link between the demography and stock market returns and on the longterm variations in the birth rates and population growth in the US. In particular, on the one hand, Bakshi and Chen (1994), Dent (1998), Geanakoplos, Magill, andQuinzii (2004), and Arnott and Chaves (2012) observe the interrelationship between the demography and the US stock market returns and argue that the demography determines the stock market returns. On the other hand, the evidence presented by Kuznets (1958), Dent (1998), Berry (1999, and Geanakoplos et al. (2004) suggests the presence of secular trends in birth rates in the US that last from 10 to 20 years. Thus, if the population growth goes through long-term alternating periods of above-average and below-average rates, and it is the demography that determines the stock market returns, then it is naturally to expect that the stock market also goes through long-term alternating periods of above-average and below-average returns.
A more elaborate model of cyclical dynamics of economic activity, interrelated with similar movements in other elements, is presented by Schlesinger (1949), Schlesinger (1986), Berry (1991), Berry, Elliot, Harpham, andKim (1998), andAlexander (2004). These authors argue that the dynamics of economic activity in the US has a long-term rhythm (with a period of 12-18 years) of accelerated and retarded secular growth. This cyclical fluctuation in economic activity, in particular the alternation of long-term periods of good and bad economic times, gives rise to similar long-term fluctuations in social and political activities.
In brief, a long-term period of rapid economic growth and technological development coincides with a conservative political wave (era). The conservative politics reduces the scope and the role of government in the life of the nation and frees up business and capital. Such a period is also characterized by a higher population growth, increase in inequality, and deflationary conditions. Yet inevitably a long-term period of economic growth comes to a long-term stagflationary crisis. During such a crisis conservative leaders are replaced by liberal leaders committed to business regulation, social innovation, equity, and redistribution via an enhanced role of government. A liberal era is usually characterized by a lower population growth, decrease in inequality, and inflationary conditions. In our opinion, the secular mean-reverting behavior of the stock market fits nicely into this model of socioeconomic dynamics. It seems to be possible to demonstrate that the conservative political waves are usually associated with above average stock market returns, whereas during the liberal political waves the stock market returns are below average.