Secular Mean Reversion and Long‐Run Predictability of the Stock Market

The empirical financial literature reports evidence of mean reversion in stock prices and the absence of out&#8208;of&#8208;sample return predictability over horizons shorter than 10 years. Anecdotal evidence suggests the presence of mean reversion in stock prices and return predictability over horizons longer than 10 years, but thus far, there is no empirical evidence confirming such anecdotal evidence. The goal of this paper is to fill this gap in the literature. Specifically, using 141 years of data, this paper begins by performing formal tests of the random walk hypothesis in the prices of the real S&amp;P Composite Index over increasing time horizons of up to 40 years. Although our results cannot support the conventional wisdom that the stock market is safer for long&#8208;term investors, our findings speak in favor of the mean reversion hypothesis. In particular, we find statistically significant in&#8208;sample evidence that past 15&#8208;17 year returns are able to predict the future 15&#8208;17 year returns. This finding is robust to the choice of data source, deflator, and test statistic. The paper continues by investigating the out&#8208;of&#8208;sample performance of long&#8208;horizon return forecasting based on the mean&#8208;reverting model. These latter tests demonstrate that the forecast accuracy provided by the mean&#8208;reverting model is statistically significantly better than the forecast accuracy provided by the naive historical&#8208;mean model. Moreover, we show that the predictive ability of the mean&#8208;reverting model is economically significant and translates into substantial performance gains.

and Nelson and Kim (1993) highlighted the small-sample bias problem, whereas Cavanagh et al. (1995), Stambaugh (1999), and Lanne (2002) noted the neglect of the potential for a near unit root problem. Responding to such critiques, Lewellen (2004), Torous et al. (2004), Rapach and Wohar (2005), and Campbell and Yogo (2006) developed new tests that are free of the flaws discovered in the earlier tests and again found some evidence of in-sample predictability. Yet, Bossaerts and Hillion (1999), Goyal and Welch (2003), and Welch and Goyal (2008) demonstrated that, despite evidence of in-sample predictability, the predictive models have no out-of-sample forecasting power. These authors therefore argued that in-sample predictability appears as a result of data mining. It should be noted, however, that the longest forecast horizon considered in all these tests was 10 years. Consequently, the results of these tests imply that the predictive models fail to demonstrate statistically significant predictive ability over short-and medium-term horizons.
To the best of our knowledge, no study to date has tested the random walk hypothesis in stock prices over horizons longer than 10 years. Yet, anecdotal evidence suggests the presence of mean reversion in stock prices over very long horizons. Likely the best known evidence is presented by Siegel (2002) in his famous book "Stocks for the Long Run". In particular, using a historical sample that covers nearly 200 years, Siegel computed the standard deviation of average real annual returns on a broad US stock market index over increasing horizons of up to 30 years. Siegel found that the standard deviation declines far more rapidly than predicted by the random walk hypothesis. This led many to conclude that stocks are less risky over the long run. However, thus far, there have been no studies of whether the decline in the standard deviation over very long horizons is statistically significant.
Another piece of well-known anecdotal evidence, explicitly related to mean reversion in stock prices over very long horizons, suggests the existence of long-lasting alternating periods of bull and bear markets. These long-lasting bull and bear markets are often dubbed 'secular' bull and bear markets. Alexander (2000), Easterling (2005), Rogers (2005), Katsenelson (2007), and Hirsch (2012), among others, analyzed the dynamics of the real S&P Composite Index since 1870 and found indications of secular stock market trends that lasted from 5 to 25 years, with an average duration of approximately 15 years. Motivated by the seeming regularity in the reversion of secular trends, some authors made quite successful forecasts of the long-run US stock market outlook. For example, Alexander (2000) predicted that during the period from 2000 to 2020 the stock market would not beat the money market. Thus far, this forecast has proven true. This anecdotal evidence suggests, among other things, that a price change over a given long-run period may be able to predict the price change over the subsequent long-run period. This idea serves as motivation to re-examine the predictive performance of the model introduced by Fama and French (1988b). Although Kim et al. (1991) demonstrated that this model has no predictive power on increasing horizons of up to 10 years, to the best of our knowledge, no study has tested this model for horizons longer than 10 years. The aim of this paper is to fill these gaps in the literature regarding stock market dynamics over very long horizons.
The first main contribution of this paper is to provide, for the first time, statistically significant evidence against the random walk hypothesis over periods longer than 10 years. Although our results cannot fully support the anecdotal evidence stating that the stock market is safer for long-term investors, our findings do suggest mean reversion in stock prices over periods of 15-17 years. In particular, using our full sample, we find statistically significant evidence that a given change in prices over 15-17 years tends to be reversed over the next 15-17 years by a predictable change in the opposite direction. This implies the existence of in-sample longhorizon predictability. Since the conventional wisdom holds that in-sample evidence of stock return predictability might be a result of data mining, we investigate the performance of out-ofsample long-horizon return forecasts. In addition to the mean-reverting model, we investigate the out-of-sample forecast accuracy of a few other competing models that employ, as a predictor of long-horizon returns, the cyclically adjusted price-to-earnings ratio, the price-to-dividends ratio, and the long-term bond yield.
The second main contribution of this paper is to demonstrate that the out-of-sample longhorizon forecasts provided by the mean-reverting model and the models that employ the price-to-earnings and price-to-dividends ratios are statistically significantly better than the forecast provided by the historical-mean model. It is worth emphasizing that Welch and Goyal (2008) also used the price-to-earnings and price-to-dividends ratios in their study and found that these models have no predictive ability over forecast horizons of up to 5 years. Our results therefore demonstrate that these models do have predictive ability, but over forecast horizons longer than 10 years. We also demonstrate that the advantages of the models, which demonstrate their predictive ability, translate into significant performance gains. For example, we estimate that risk-averse investors would be willing to pay from 30 to 77 basis points in fees per year to switch from the historical-mean model to a model with superior forecast accuracy. Moreover, our tests suggest that in the recent past, the out-of-sample forecast accuracy provided by the mean-reverting model was substantially better than that provided by the competing models. In addition, we find that the mean-reverting model delivers the highest performance gains when investors have to make long-term allocation decisions.
Overall, our paper contributes to the finance literature in several respects. First, it expands our understanding of stock market dynamics over longer horizons. A major problem with studying stock market behavior over longer horizons is limited data. For example, using data that span 141 years to study the distribution of stock returns over a period of 20 years, researchers have only seven truly independent observations. It is widely believed that tests based on such small sample sizes are bound to have low statistical power. Our paper demonstrates that, using overlapping blocks of data and randomization methods, one can dramatically increase statistical power of tests. Second, our study reaches several important conclusions for long-term investors. The gain in test power delivered by our approach allows us to find weak support for the conventional wisdom that the stock market is safer for long-term investors and strong support for the existence of secular mean reversion in the stock market. Third, the evidence of stock market return predictability over horizons shorter than 10 years is conflicting and, essentially, negative. In other words, investors cannot rely on short-and medium-term stock market return forecasts. This paper convincingly demonstrates that secular mean reversion can be used to forecast long-term stock market returns. Although the improvement in the forecast accuracy provided by the mean-reverting model, relative to that delivered by other alternative models, is not especially impressive when judged by the mean squared error, our predictive model generates meaningful utility gains for mean-variance investors. Thus, the results reported in this paper allow investors to make better long-term allocation decisions.
The remainder of the paper is organized as follows. Section II presents the data for our study, namely, the returns on the real Standard and Poor's Composite Stock Price Index over the period from 1871 to 2011. In Section III, we perform the tests of the random walk hypothesis using the S&P Composite Index. In Section IV, we study the out-of-sample predictability of multi-year returns on the S&P Composite Index. Finally, Section V summarizes and concludes the paper.

II. THE DATA
The data for the study in this paper are the annual log real returns on a broad US stock market index for the period from 1871 to 2011. The returns are adjusted for dividends and computed using the real (i.e., inflation-adjusted) Standard and Poor's Composite Stock Price Index data and corresponding dividend data. The inflation adjustment is done using the US Consumer Price Index (CPI). All the data are provided by Robert Shiller. 2 The Standard and Poor's Composite Stock Price index is a value-weighted stock index. The index for the period from 1871 to 1925 is constructed using the Cowles Commission Common Stock Index series. From 1926 to the present, the index data come from various reports of Standard and Poor's. From 1957, this index is identical to the Standard and Poor's 500 Index, which is intended to be a representative sample of leading companies in leading industries in the US economy. Stocks in the index are chosen for market size, liquidity, and industry group representation. For further details on the construction of the index and its dividend series, see Shiller (1989), Chapter 26. Formally, let ( p 0 , p 1 , . . . , p n ) be observations of the natural log of an inflation-adjusted stock index price over n + 1 years. Denote the one-year log return during year t, 1 ≤ t ≤ n, by The resulting sample of n return observations is (r 1 , r 2 , . . . , r n ). The probability distribution of r t is unknown, yet it is well-documented that stock returns are non-normal and heteroskedastic.
To check the robustness of the findings, in particular, to determine whether the results of testing the random walk hypothesis depend on a specific historical period, we divide the total sample period from 1871 to 2011 (141 annual observations) into two equal, overlapping subsamples; the first is from 1871 to 1956 and the second is from 1926 to 2011. 3 Both of these sub-samples cover a span of 86 years. Table 1 presents the descriptive statistics for the annual stock index returns, r t , for the full sample and both sub-samples. Table 2 reports the results of the t-test on the difference in mean returns and F-test on the difference in standard deviations between the first and the second sub-sample. The descriptive statistics and the results of the tests suggest that the mean and variance of returns on the index were generally stable during the full sample. Specifically, using a t-test for equal means, we cannot reject the hypothesis that the mean returns in the two sub-samples are alike. Similarly, using an F-test for equal variances, we cannot reject the hypothesis that the variances of the two sub-samples are alike. All series exhibit negative skewness and positive excess kurtosis, which indicates a deviation from normality. Observe further that the return series during the full sample period exhibits a statistically significant negative autocorrelation at lag 2 (at the 5% level). There are no other indications of serial dependence in the return series.

III.1 Methodology
One of the main questions we seek to study in this paper is whether the log of the real S&P Composite Stock Price Index follows a random walk. To answer this question, we perform two well-known tests. The first test is based on the examination of the first-order autocorrelation function of k-year returns. This test has been used by, for example, Fama and French (1988b), 2 See http://www.econ.yale.edu/˜shiller/data.htm. The real dividend-adjusted annual return series on the index are readily available in the file chapt26.xls. Robert Shiller stopped maintaining his database in 2012. 3 The reasons for using overlapping sub-samples are as follows. First, to perform statistical tests on the presence of long-run mean reversion, we need longer time series. Second, the starting point of our second sub-sample coincides with the starting point of the database of historical stock market data provided by the Center for Research in Security Prices. Therefore, the data on the stock market returns over the second sub-sample is much more accurate than that over the first sub-sample.   Fama and French (1989), and Fama (1990) and is based on the computation of the following test statistic where r i, j is the compounded return from year i to year j, r i, j = p j − p i , Cov(·, ·) and Var(·) denote the covariance and variance, respectively, and AC1(k) represents the first-order autocorrelation function of k-year returns. The second test is based on the examination of the variance ratio. This test is very popular and has been used by Cochrane (1988), Lo and MacKinlay (1988), Poterba and Summers (1988), and many others. The test is based on the computation of the following test statistic (2) Both of the tests are motivated by the notion that if the stock returns are independent and identically distributed, then the first-order autocorrelation function is zero and the variance ratio is unity irrespective of the number of years, k. In other words, absent serial dependence in the data, the variance of k-year returns equals k times the variance of one-year returns and there is no correlation between two successive non-overlapping k-year returns. The null hypothesis of a random walk is rejected if the first-order autocorrelation is significantly different from zero or the variance ratio is significantly different from unity. We wish to compute the variance ratio, VR(k), for return horizons, k, from 20 to 40 years and the first-order autocorrelation, AC1(k), for periods from 10 to 20 years (note that in the latter case, we also study serial dependence in the data over time horizons from 20 to 40 years). The fundamental problem with these computations is that we have only a few non-overlapping intervals of length 20-40 years. Therefore, in the computations of the two test statistics, we employ overlapping intervals (rolling k-year periods). To compute AC1(k), we regress k-year returns r t,t+k on lagged k-year returns r t−k,t . That is, we run the following regression Observe that the slopes of the regression, b(k), k ∈ [10, 20], are the estimated autocorrelations of k-year returns, AC1(k). The variance of k-year returns is computed as The use of overlapping returns leads to some potentially very serious econometric issues that are commonly termed 'small-sample bias'. In particular, when estimating regression (3), there are two econometric problems. First, the estimates of the slope coefficients are biased. The sources of this bias in the estimation of autocorrelation are described in detail by Orcutt and Irwin (1948) and Marriott and Pope (1954). Specifically, these authors demonstrated that an estimate of autocorrelation obtained using overlapping blocks of data is downward biased. Therefore, the estimates must be corrected for this bias. The second problem is that the standard errors of estimations using overlapping blocks of data are also downward biased; see, for example, Nelson and Kim (1993). Both biases work in the direction of making the values of the t-statistic too large such that standard inference may indicate dependence in the return series even if none is present. 4 Similarly, the estimate of the variance of multi-year returns, Var(r t,t+k ), is downward biased when one uses overlapping blocks of data. 5 As an immediate consequence, the estimate of the 4 Specifically, when returns are independent, using overlapping blocks of data produces a negative value of the estimated slope coefficient in regression (3). In addition, the standard error of estimation of the slope coefficient using overlapping blocks of data is downward biased. That is, the estimated standard error is smaller than the true value. The greater the overlap is, the more negative the slope coefficient and the smaller the estimated standard error. As a result, the values of the t-statistic may falsely indicate the presence of dependence in the return series when none is present. 5 Note that this is also related to the second econometric problem in the estimation of regression (3). That is, the standard errors of estimation of slope coefficients using overlapping blocks of data are downward biased because the estimates of the variance using overlapping blocks of data are downward biased. For the sake of motivation, consider what happens to the estimate of Var(r t,t+k ) when k → n. Obviously, in the limit, when the length k converges to the sample length, there is only one available block of data to estimate Var(r t,t+k ). Therefore, regardless of the nature of the data-generating process, Var(r t,t+k ) → 0 as k → n. variance ratio VR(k) also becomes downward biased. Therefore, the estimates of VR(k) must be corrected for this bias. In addition, since the estimate of VR(k) is a random variable, for the purpose of statistical inference, we need to know the probability distribution of VR(k). This is necessary to be able to estimate standard errors and confidence intervals for VR(k). This is also necessary for performing hypothesis tests concerning the value of VR(k).
When the nature of the data-generating process is unknown, it is generally not possible to address the econometric problems described above. However, in the context of the null hypothesis, our goal is primarily to test whether stock returns are distributed independently of their ordering in time. Because, under the null, there is no dependence in return series, to estimate the significance level and perform the bias correction of the test statistics, we closely follow Kim et al. (1991) and Nelson and Kim (1993), who employed the randomization method. The randomization method was introduced by Fisher (1935) and provides a very general and robust approach for computing the probability of obtaining some specific value for an estimator under the null hypothesis of no dependence. We refer interested readers to Noreen (1989) and Manly (1997) for extensive discussions of the randomization tests. In essence, randomization consists of reshuffling the data to destroy any dependence and then recalculating the test statistics for each reshuffling to estimate its distribution under the null hypothesis of no dependence. The great advantage of the randomization method is that it is very simple and requires no assumptions about the actual distribution of stock returns.
Specifically, consider the estimation of the significance level and the bias correction of the estimate of the autocorrelation of k-year returns, AC1(k). First, we run regression (3) using the original series (r 1 , r 2 , . . . , r n ) to obtain the actual historical estimates of AC1(k). Then, we randomize the original series to obtain a permutation (r * 1 , r * 2 , . . . , r * n ). This is repeated 10,000 times, each time running regression (3) and obtaining an estimate of AC1 * (k). In this manner, we estimate the sampling distribution of AC1(k) under the null hypothesis. Finally, to estimate the significance level for some particular k, we count how many times the computed value of AC1 * (k) after randomization falls below the value of the actual historical estimate of AC1(k). In other words, under the null hypothesis, we compute the probability of obtaining a more extreme value for the autocorrelation of k-year returns than the actual historical estimate. Note that in this manner, we compute the p-values of a one-tailed test. The estimation bias is defined as the difference between the expected and true value of AC1 * (k). Since the true value is zero under the null hypothesis, the bias correction is achieved by subtracting the expected value of AC1 * (k) from the actual historical estimate of AC1(k). That is, the biasadjusted values of the first-order autocorrelation of k-year returns are computed as The estimation of the significance level and the bias correction of the estimate of the variance ratio VR(k) is performed in a similar manner. First, we use the original series to obtain the actual historical estimates of VR(k). Then, we randomize the series and compute VR * (k) to obtain the sample distribution under the null hypothesis. Finally, to estimate the significance level for some particular k, we count how many times the computed value of VR * (k) after randomization falls below the value of the actual historical estimate of VR(k). The estimation bias in this case is given by E[VR * (k)] − 1 since the true value is unity under the null hypothesis. Finally, the bias-adjusted values of the variance ratio are computed as There is ample evidence that the series of stock returns is heteroskedastic; see, for example, Officer (1973) and Schwert (1989). In particular, many researchers document that the variance of stock returns is not constant, but time-varying. To determine whether a change in the variance of returns might affect the sampling distribution of a test statistic, we closely follow Kim et al. (1991) and Nelson and Kim (1993) and use stratified randomization. In the stratified randomization method, the total sample (or a sub-sample) is divided into several separate bins (urns) and the randomization is performed within each bin. Such a stratified randomization allows us to determine whether the sampling distribution of a test statistic is sensitive to the particular pattern of heteroskedasticity that occurred historically.

III.2 Empirical Results
Figure 1 plots the sample first-order autocorrelations and variance ratios of the k-year returns on the Standard and Poor's Composite Stock Price Index. The first-order autocorrelations and variance ratios are computed according to formulas (1) and (2), respectively, using overlapping blocks of data. In the full sample and both sub-samples, the first-order autocorrelations and variance ratios generally decline with increasing k. The indications that the null hypothesis should be rejected for very long horizons are stronger (i.e., the declines in the first-order autocorrelations and variance ratios are larger) for the second sub-sample (1926 to 2011) than for the full sample or the first sub-sample (1871 to 1956). Recall, however, that the estimates of both the first-order autocorrelations and the variance ratios presented in Figure 1 are downward biased. As a matter of fact, under the null hypothesis of no serial dependence in the return series, we would expect to observe declining first-order autocorrelations and variance ratios with increasing k. To determine whether the observed declines are statistically significantly different from the expected declines under the null hypothesis, and to correct for estimation bias under the null, we perform the randomization method with and without the stratification. These results are reported in Tables 3 and 4, which report the estimates for the bias-adjusted first-order autocorrelations and variance ratios, respectively, with corresponding p-values. The estimates are based on 10,000 reshuffles and computed using different numbers of bins in the stratification. The number of bins varies from 1 (no stratification) to 5.
Without the stratification (that is, when the number of bins equals one), both test statistics suggest that the return series over the full sample period (1871 to 2011) and the second sub-sample (1926 to 2011) exhibit clear evidence against the random walk on horizons of approximately 30-40 years. In particular, for the full sample, the first-order autocorrelation values are statistically significantly negative at the 5 percent level for periods of 12-20 years (which indicates dependence over 24-40-year horizons). In addition, the values of the variance ratio are statistically significantly below unity at the 5 percent level at horizons of 30-34 years. Thus, both test statistics present evidence against the null hypothesis over horizons of 30-34 years. For the second sub-sample, the first-order autocorrelation values are statistically significantly negative at the 5 percent level for periods of 15-18 years (which indicates dependence over 30-36-year horizons), and the values of the variance ratio are statistically significantly below unity at horizons of 34-36 years. For the first sub-sample, the evidence against the random walk is weaker. However, if we require only a 10 percent significance level, we can reject the null hypothesis of no dependence in the return series at several horizons.
Further, our results suggest that accounting for heteroskedasticity in stock returns does not influence the outcomes of the tests based on examining the first-order autocorrelations of kyear returns. Regardless of the number of bins in the stratified randomization, the first-order autocorrelation of k-year returns remains statistically significantly different from zero at the 5 percent level over periods of 15-17 years for the full sample and the second sub-sample. In contrast, stratification of the sample weakens the evidence against the null hypothesis for the value of the variance ratio. In particular, for the full sample and the stratification with 2, 4, or 5 bins, the variance ratio is not statistically significantly below unity at conventional levels (1 percent and 5 percent). Similarly, for the second sub-sample and the stratification with either 3 or 5 bins, the variance ratio is not significantly below unity at conventional levels. For the first sub-sample, the variance ratio is not significantly below unity regardless of the number of bins used in the stratified randomization.
Consequently, we do not have strong enough evidence to claim that the variance ratio decreases with an increasing investment horizon. Despite that without stratification the variance ratio over horizons of 30-34 years is statistically significantly below unity, stratification of the sample Panel  suggests that this effect can be attributed to the historical pattern of heteroskedasticity (that is, the existence of periods of high and low variance). Thus, our results cannot fully support the anecdotal evidence that states that the stock market is safer for long-term investors. Nevertheless, we do have strong enough evidence to reject the random walk hypothesis in stock prices over periods of approximately 15-17 years. This evidence is based on the first-order autocorrelation of multi-year returns. However, our results suggest that the departure from the random walk over very long horizons is primarily a phenomenon of the post-1926 period.

III.3 Robustness Tests
To check the robustness of our findings regarding the statistical significance of secular mean reversion, we conducted a series of robustness checks, the full results of which are not reported in this paper to save space. These additional robustness tests are described below. First, the results reported in this section are obtained using the annual data provided by Robert Shiller. Specifically, these data are annual series of (average) January values of the real Standard and Poor's Composite Stock Price Index. Hence, the results obtained in this section might be affected by seasonality. 6 To test the seasonality problem, we used the monthly data instead and obtained virtually the same levels of statistical significance of the mean-reverting behavior over very long horizons.
Second, Robert Shiller uses the CPI to adjust the nominal returns for inflation. We tested whether our evidence of mean reversion depends on the choice of deflator used to construct real stock returns. 7 For this purpose, we constructed the real stock returns using the GDP deflator and the value of the consumer bundle. 8 We found that regardless of the choice of a deflator, the evidence on mean reversion remains intact.
Third, since Kim et al. (1991) demonstrated that the mean-reversion in the study by Fama and French (1988b) is primarily a phenomenon of the pre-World War II period, which is included in both of our sub-samples, we tested whether there is evidence of mean-reversion in the post-1940 period. 9 We found that the evidence is weaker (which is reasonable since the sample size decreases) but is still statistically significant at the 10 percent level.
Fourth, instead of the first-order autocorrelation of multi-year returns test statistic, suggested by Fama and French (1988b), we used the test statistic suggested by Jegadeesh (1991). In particular, instead of regression (3), we used the following regression Note that in this regression, the stock market return in year t is predicted using the aggregate return over the preceding k years. Using this regression, we were also able to reject the random walk hypothesis in stock prices over very long horizons in the post-1926 period. Finally, instead of using the data provided by Robert Shiller, we used the real annual returns on large-cap stocks provided by Kenneth French 10 over the period from 1927 to 2012. Again, we found that the values of the first-order autocorrelation of multi-year returns are statistically significantly negative over periods of 15-18 years. 6 We thank Ole Gjølberg for bringing our attention to this. 7 We thank an anonymous referee for suggesting this. 8 The data on the GDP deflator and the consumer bundle are downloaded from www.measuringworth.com. The value of the consumer bundle is defined as the average annual expenditures of consumer units. 9 We thank an anonymous referee for suggesting this. 10 See http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html. We used the large-cap stocks because the S&P Composite is a large-cap index.
Thus, on the basis of the results from numerous robustness tests, we conclude that our evidence on secular mean reversion is robust to the choice of data, deflator, sample period, and test statistics.

IV.1 Motivation
The results of the tests performed in the preceding section allow us to reject the hypothesis that the real S&P Composite Stock Price Index follows a random walk. Rather surprisingly, considering a seemingly insufficient span of available historical observations of the returns on the stock index, convincing evidence against the random walk is present over long-lasting periods of approximately 15-17 years. That is, our tests support the alternative hypothesis that there is serial dependence in stock returns. This leads us to the following question: what type of serial dependence is present? In other words, what is the alternative hypothesis? Typically, a statistically significant decrease in the variance ratio with an increasing investment horizon (this effect is sometimes termed 'variance compression') is interpreted as evidence of mean reversion. Unfortunately, the evidence of mean reversion based on the variance ratio test does not appear to be sufficiently strong under a stratified randomization of the data. However, variance compression seems to be a sufficient, but likely not necessary, condition for mean reversion. Fortunately, in addition to the variance ratio, we have another test statistic, namely, the firstorder autocorrelation of multi-year returns. The significance of this test statistic is unaffected by the choice of randomization method. The presence of values of the autocorrelation of k-year returns that are statistically significantly below zero suggests mean-reverting behavior in stock prices. Specifically, a given change in price over first k years tends to be reversed over the next k years by a predictable change in the opposite direction. For the full sample period, evidence for mean reversion comes from the negative and statistically significant values of the first-order autocorrelations, particularly for periods of 15, 16, and 17 years.
Considering the above, the results reported in the previous section suggest the presence of long-term mean reversion over periods of approximately 15-17 years in the real Standard and Poor's Composite Stock Price Index. In this case, if the pattern of first-order autocorrelation of multi-year returns suggests the presence of mean reversion over a horizon of 2k years, there should be some degree of predictability of multi-year returns over half of this horizon, that is, over a period of k years. Indeed, regression (3) is a predictive regression. To demonstrate the predictability of multi-year returns, Figure 2 presents a scatter plot of r t,t+15 against r t−15,t for the returns on the real Standard and Poor's Composite Stock Price Index for the full sample period from 1871 to 2011. In addition, a regression line is fitted through these data points. The scatterplot clearly suggests a tendency for the past 15-year returns to predict future 15-year returns. The regression line has a strongly negative slope, and the R 2 statistic is 42 percent.
However, if we use the full sample period to estimate the first-order autocorrelation of multiyear returns, our estimate measures the degree of in-sample (IS) predictability. The issue here is the well-known possibility that in-sample predictability might be spurious (for example, it could appear as a result of data mining) and not hold out-of-sample (OOS) (see, for example, Bossaerts and Hillion, 1999, Goyal and Welch, 2003, and Welch and Goyal, 2008. To guard against data mining, in this section, we assess the performance of the OOS forecast based on the mean-revering model given by regression (3). In addition to the mean-reverting model, we use several other competing predictive models. We demonstrate that in the OOS tests, the meanreverting model and a few other predictive models perform statistically significantly better In addition, a regression line is fit through these data points. The goodness of fit, as measured by R 2 , amounts to 42%. than the naive historical-mean model. In addition, we demonstrate that the advantages of the predictive models translate into significant utility gains.

IV.2 Methodology for Assessing the Performance of OOS Forecasts
Our OOS recursive forecasting procedure is as follows. The initial IS period [1, m], m < n, is used to estimate regression (3) for different period lengths k ∈ [10, 20] years. In this manner, we estimate a number of autocorrelations of k-year returns, AC1(k). Then, we perform the bias adjustment of AC1(k). Next, we select the value of k = k 1 that produces the lowest estimate of the bias-adjusted autocorrelation. That is, [10,20] AC1(k).
Presumably, over the initial IS period, the evidence of mean reversion is strongest over the period of k 1 years. Subsequently, the estimated coefficients from regression (3) with k 1 are used to compute the first k 1 -year-ahead return forecast for the period [m + 1, m + k 1 ]. We then expand our IS period by one year (it becomes [1, m + 1]), perform the selection of the k 2 at which the evidence of mean reversion is strongest over the second IS period, and compute the OOS forecast for the period [m + 2, m + k 2 + 1]. We repeat the procedure, each time increasing our IS window by one year, until we compute the last k l -year-ahead return for the period [n − k l + 1, n].
It is worth emphasizing that our OOS forecasting procedure is free from look-ahead bias, as in forecasting the return for period [m + j, m + k j + j − 1], j ≥ 1, we use only information that is available at time m + j − 1. Note that since we are conducting a long-horizon forecast, in performing the recursive forecasting procedure, we need not only update the estimates of the coefficients of regression (3), but we also first need to update the optimal length of the prediction period, k. Observe that, to avoid look-ahead bias, the optimal length of the prediction period, k, is also determined using only information that is available at the end of each IS period. Thus, our OOS recursive forecasting procedure updates all values of the model parameters and is able to adapt to changing conditions in the time series. For example, it can accommodate the possibility that the period of mean reversion is monotonically changing over time. 11 A common approach to assess the performance of an OOS forecast in the empirical literature is to run a 'horse-race' among several competing predictive models. A standard criterion by which to compare two alternative predictive models is to compare their mean squared prediction errors (MSPE). A comparison of the mean squared prediction errors of two alternative models has a long tradition in evaluating which of two models has better forecasting ability; see McCracken (2007) and references therein. In our study, we run OOS horse races involving the mean-reverting model (MR), the historical-mean model (HM), Robert Shiller's model (PE10), which uses the cyclically adjusted price-to-earnings ratio as a predictor of long-horizon returns, the model that uses the price-to-dividends ratio (PD) as a predictor, and the model that uses the long-term bond yield (LTY) as a predictor. These models are given by LTY : HM : r t,t+k = a(k) + ε t,t+k , where pe10 is the natural log of the ratio of the price to the 10-year moving average of earnings (this ratio is usually denoted CAPE or PE10), pd is the natural log of the price-todividends ratio, and lt y is the natural log of the long-term bond yield. The data for the price-toearnings ratio, price-to-dividends ratio, and the long-term bond yield are also provided by Robert Shiller. Robert Shiller's model was introduced by Campbell and Shiller (1998) and further popularized and developed by Shiller (2000) and Campbell and Shiller (2001). Shiller's model is based on a simple mean reversion theory that states that when stock prices are very high relative to recent earnings, then prices will eventually fall in the future to return the price-to-earnings ratio to a more normal historical level. Using this model, Campbell and Shiller (1998) predicted the stock market crash of 2000 on the basis of an unreasonably high PE10 ratio. Since that time, Shiller's model has been extremely popular among practitioners. Originally, Campbell andShiller (1998), Shiller (2000), and Campbell and Shiller (2001) used this model to forecast future 10-year returns. Yet, Asness (2003) demonstrated that the PE10 ratio is a good predictor of the future returns over periods from 10 to 20 years. 12 Thus, Shiller's model represents a natural competitor to our long-term mean-reverting model. 11 Recall that the results presented in the previous section indicate that the period of the long-term mean reversion seems to have been increasing over time. In particular, during the first sub-sample, the evidence of mean reversion is strongest over horizons of approximately 24-26 years (judging by the values of the most statistically significant first-order autocorrelation and variance ratio). In contrast, during the second sub-sample, the evidence of mean reversion is strongest over horizons of approximately 34-36 years. This results in the observation that over the full sample period, the evidence of mean reversion is strongest over horizons of approximately 30-34 years. 12 This conclusion is made on the basis of studying the R 2 of the predictive regression at different forecasting horizons. It should be noted, however, that in estimating the coefficient in front of the predictor The model that uses the price-to-dividends ratio as a predictor of future returns was presented by Fama and French (1988a). This model is also based on a simple mean reversion theory that states that if the price-to-dividends ratio is unusually high or low, then this ratio tends to return to its long-run historical mean. The motivation for the model that uses the long-term bond yield as a predictor is based on the simple idea that stocks and long-term bonds are two major competing assets. Therefore, simple logic suggests that the changes in the long-term bond yield must be highly correlated with the changes in the stock market earnings yield (earnings-to-price ratio). If, for example, the bond yield increases, stock prices should decrease and the stock market earnings yield increases. The so-called 'Fed model' postulates that the stock market's earnings yield should be approximately equal to the long-term bond yield. Empirical support for this model is found in studies by Lander, Orphanides, and Douvogiannis (1997), Koivu, Pennanen, and Ziemba (2005), Berge et al. (2008), andMaio (2013).
The historical-mean model can be interpreted as a reduced version of any other predictive model. This model uses the historical average of k-year returns to predict the return over the next k years. It is worth emphasizing that Welch and Goyal (2008) also employed predictive models that use the price-to-earnings ratio, price-to-dividends ratio, and the long-term bond yield. They found in OOS tests that these models perform worse than the historical-mean model. However, these authors used an increasing forecast horizon of only up to 5 years. In our study, the goal is to compare the OOS forecast accuracy from these models over horizons longer than 10 years. Now, we turn to the formal presentation of our test statistic that is employed to assess the performance of OOS forecasts provided by two competing models. Let r AC t,t+k , t > m, be the actual k-year returns and r mod 1 t,t+k and r mod 2 t,t+k be the OOS forecast of the k-year returns provided by models 1 and 2. To compute the test statistic, we first compute the OOS prediction errors of the two competing models ε mod 1 t,t+k = r mod 1 t,t+k − r AC t,t+k , ε mod 2 t,t+k = r mod 2 t,t+k − r AC t,t+k . Our test statistic is the ratio of the MSPE of model 1 to the MSPE of model 2 where T − m is the number of OOS forecasted k-year returns. 13 The null hypothesis in this test is that the forecast provided by model 2 is not better than the forecast provided by model 1. Specifically, under the null hypothesis, the MSPE of model 1 is less than or equal to the MSPE of model 2. Formally, H 0 : MSPE-R ≤ 1. Consequently, we reject the null hypothesis when the actual estimate of the MSPE ratio is significantly above unity. In our tests, model 1 is always the historical-mean model. Therefore, the outcome of our tests is whether a predictive model can 'beat' the historical-mean model (a similar approach was used by Goyal and Welch (2003), Welch and Goyal (2008), and many others).
If two alternative prediction errors are assumed to be Gaussian, serially uncorrelated, and contemporaneously uncorrelated, then an MSPE-R statistic under the null hypothesis has the usual F-distribution. 14 However, in our case, the assumptions listed above are not met. First, because we use overlapping multi-year returns, the prediction errors of all our models are serially correlated. Second, since the historical-mean model is the reduced version of any other and its significance level, Asness (2003) did not account for the estimation biases discovered by Cavanagh et al. (1995) and Stambaugh (1999). 13 Note that k is not constant but a variable that is exogenously determined by our recursive forecasting procedure. We suppress its dependence on time to simplify the notation.
14 In this case, testing the null hypothesis largely corresponds to the standard F-test of equal forecast error variances. predictive model, the prediction errors of the historical-mean models and any other predictive model are contemporaneously correlated. Finally, the assumption of Gaussian errors also seems to be inappropriate. One possibility for obtaining correct statistical inference in this case is to perform asymptotically valid tests in the spirit of the seminal tests by Diebold and Mariano (1995). However, because we use relatively small samples, and because of the variable length k of the prediction horizon in our forecasting procedure, to compute the p-value of the MSPE ratio, we employ a bootstrap method.
Our bootstrap method closely follows Welch and Goyal (2008). In this method, we assume that the returns are serially independent, whereas the log of the PE10, the log of the PD, and the log of LTY follow a first-order autoregressive (AR(1)) process. Therefore, the data-generating process is assumed to be In this case, the return series, r t , follows a random walk 15 and a bootstrapped resample is generated using a nonparametric bootstrap method. In particular, a random resample (r * 1 , r * 2 , . . . , r * n ) is generated by drawing with replacement from the original series (r 1 , r 2 , . . . , r n ). In contrast, a bootstrapped resample of any other predictive variable is generated using a semi-parametric bootstrap method. The construction of a bootstrapped resample for the log of the PE10 series, pe10 t , is performed as follows. First, the parameters α 1 and β 1 are estimated by OLS using the full sample of observations, with the residuals stored for resampling. Next, to generate a random resample ( pe10 * 1 , pe10 * 2 , . . . , pe10 * n ) we select an initial observation pe10 * 1 from the actual data at random. Then, a series is generated using the AR(1) model and by drawing w * t with replacement from the residuals. 16 The construction of a bootstrapped resample for the log of the PD and the LTY series is done in a similar manner. Now, we describe how we compute the MSPE-R statistic and its p-value. First, using the original series (r 1 , r 2 , . . . , r n ), we employ the recursive forecasting procedure described above to obtain the OOS forecasts of the mean-reverting model. Note that one of the outcomes of our recursive forecasting procedure is a sequence of lengths of prediction periods (k 1 , k 2 , . . . , k l ). Second, using the same sequence of lengths of prediction periods, we obtain the OOS forecasts of all other models. Thereafter, we compute the mean squared prediction errors and, then, the MSPE-R statistic. We then bootstrap the original series to obtain random resamples. The next crucial step is to generate a sequence of lengths of prediction periods (k * 1 , k * 2 , . . . , k * l ). All of this is repeated 10,000 times, each time running the recursive forecasting procedures 17 and obtaining an estimate of MSPE-R * . In this manner, we estimate the sampling distribution of the MSPE-R statistic under the null hypothesis. Finally, to estimate the significance level, we count how many times the computed value of the MSPE-R * after bootstrapping is above the value of 15 Note that in this case, the historical-mean model is a version of the random walk hypothesis. 16 It should be noted, however, that our data-generating process assumes no contemporaneous correlation between the stock return and a predictive variable. In the actual data, there is a small but statistically significant correlation between the returns and the price-to-earnings (as well as the price-to-dividends) ratio.
To check the robustness of our findings, we also implemented another bootstrap method that retains the historical correlations between the data series. We found that both bootstrap methods deliver similar p-values for our test statistic. 17 Note that here the recursive forecasting procedures for all the models use the exogenously determined sequence of lengths of prediction periods. the actual estimate of the MSPE-R. In other words, under the null hypothesis, we compute the probability of obtaining a more extreme value of the MSPE ratio than the actual estimate. 18 It is not clear what method should be used to generate a sequence of lengths of prediction periods for each bootstrap simulation. To the best of our knowledge, there are no similar forecasting procedures in the relevant scientific literature. Therefore, we entertain four different methods listed below. In the first method, we always use the original sequence of lengths of prediction periods (k 1 , k 2 , . . . , k l ). In the second and third methods, a generated sequence (k * 1 , k * 2 , . . . , k * l ) is a bootstrapped version of the original sequence. Whereas in the second method, we use the nonparametric bootstrap, in the third method, we use the semi-parametric bootstrap. In the semi-parametric bootstrap, we assume that the length of a prediction period is a linear function of time. 19 In the fourth method, a sequence of lengths of prediction periods is endogenously determined by the recursive forecasting procedure on the basis of the bootstrapped series (r * 1 , r * 2 , . . . , r * n ). We find that the first three methods produce virtually identical p-values, whereas the fourth method produces notably lower p-values. When we report the p-values of the MSPE-R statistic, we use the highest p-values. Thus, our statistical inference is based on the 'worst-case scenario' for the rejection of the null hypothesis. In other words, if we can reject the null in the 'worst-case scenario', we would reject it for any other case.

IV.3 Empirical Results on the Performance of OOS Forecasts
Our OOS forecast begins 50 years after the data are available, that is, in 1921, and ends in 1997 with the last forecast for the 15-year period from 1997 to 2011. To check the robustness of findings, we divide the total OOS period into two equal OOS sub-periods, the first from 1921 to 1959 and the second from 1959 to 1997. As in Goyal and Welch (2003), we employ a simple graphical diagnostic tool that makes it easy to understand the relative performance of two competing forecasting models. In particular, to monitor the predictive power of the unrestricted model relative to the that of the restricted model, Goyal and Welch (2003) suggested using the cumulative difference between the MSPE of the restricted model (the HM model in our case) and the MSPE of the unrestricted model From a visual examination of the graph of CUDIF t , it is easy to understand in which periods the unrestricted model performs better than the restricted model. Specifically, in periods when the cumulative MSPE difference increases, the unrestricted model provides better predictions; in periods when it decreases, the unrestricted model has worse predictive performance than the restricted model. Figure 3 shows the performance of the unrestricted models versus the performance of the restricted (historical-mean) model. Specifically, the left panels in the figure plot the actual kyear-ahead returns versus the OOS forecasted k-year-ahead returns produced by the unrestricted and restricted models. The right panels in the figure plot the cumulative difference between 18 Note again that we compute p-values of a one-tailed test in this manner. 19 Indeed, for our OOS period from 1921 to 1997, the length of a prediction period is almost monotonically stepwise increasing from 10 to 15 years. The goodness of fit to the linear function, as measured by R 2 , amounts to 73 percent. To perform the semi-parametric bootstrap, first, we estimate the simple linear trend model for the original sequence of lengths of prediction periods (k 1 , k 2 , . . . , k l ), with the residuals stored for resampling. Thereafter, to generate a random resample of the sequence of lengths of prediction periods, we select the original initial prediction period, k 1 . The rest of the sequence is generated using the estimated linear trend model by drawing the error terms from the residuals with replacement.    Table 5.
The p-values of the MSPE-R statistic demonstrate that over the full OOS period, 3 out of 4 unrestricted models performed statistically significantly better (at the 5 percent level) than the restricted model. These unrestricted models are as follows: the mean-reverting model, the priceearnings model, and the price-dividends model. However, over the first OOS sub-period, only the price-dividends model performed statistically significantly better than the historical-mean model. In contrast, over the second OOS sub-period, only the mean-revering and price-earnings models showed evidence of superior forecasting accuracy compared to that of the historicalmean model. Our results advocate that the model using the long-term bond yield as predictor performed substantially worse than all of the other competing models. Our results on the predictive ability of the long-term bond yield support the conclusions reached in the studies by Estrada (2006) and Estrada (2009). Specifically, Estrada argued that the predictive ability of the long-term bond yield is supported by data in the post-1960 period only. 20 Prior to 1960, there is no empirical support for a model that uses the long-term bond yield as a predictor of stock returns.
The graphs of the cumulative difference between the MSPE of the restricted (historical-mean) model and the unrestricted model allow us to determine in which historical periods one model performed better than the other. Visual inspection of these graphs reveals the following observations. The price-dividends model performed relatively well only until circa 1970. Thereafter, the accuracy of the forecast provided by the price-dividends model was substantially worse than that of the historical-mean model. Both the mean-reverting and price-earnings models performed significantly better than the historical-mean model over the period . From circa 1990, the price-earnings model lost its advantage over the historical-mean model. Beginning from circa 1980, the mean-revering model performed substantially better than all other competing models. Only over the decade of the 1950s did the mean-reverting model perform notably worse than the historical-mean model.

IV.4 Economic Significance of Return Predictability
In the preceding subsection, we found statistically significant evidence of long-term predictability of stock returns. This evidence was obtained by comparing the MSPE of the predictive model with the MSPE of the historical-mean model. However, over the total OOS period, the ratios of 20 In all empirical studies that demonstrate the predictive ability of the long-term bond yield, the sample period begins after 1960. In this case, if, for example, the initial IS period is 1960-1980, then over the OOS period 1980-2010, one finds evidence of OOS predictability of stock returns using the long-term bond yield. the MSPE of the restricted model to the MSPE of the unrestricted model are not substantially above unity. This raises the important question of whether they are economically meaningful. In other words, statistical significance is not equivalent to economic significance.
To estimate the economic significance of return predictability, we closely follow the methodology employed in the studies by Fleming et al. (2001), Campbell and Thompson (2008), and Kirby and Ostdiek (2012). We consider an investor who, at time t, allocates a proportion y t of his wealth to the stock market index and a proportion (1 − y t ) to the risk-free asset. The investor revises the composition of his portfolio at time t + q; that is, after q years, q ≥ 1. The investor's return over period (t, t + q) is given by R t,t+q = y t r t,t+q + (1 − y t )r free t,t+q , where r t,t+q and r f ree t,t+q are the stock market return and the risk-free rate of return over period (t, t + q).
We assume that the investor behaves according to a mean-variance utility function, which can be considered a second-order approximation of the investor's true utility function. As a result, the investor's realized utility over period (t, t + q) can be written as where σ t,t+k is the volatility of the stock market index over period (t, t + q) and γ is the investor's coefficient of risk aversion. The investor's total realized utility is found as the sum of single-period utilities where n = T q is the number of periods of length q from time 0 to time T (the end of the investment horizon).
The investor's optimal proportion y t , which maximizes the expected utility, is given by (see Bodie et al., 2007, Chapter 7) where E[r t,t+q ] and σ t,t+q are the expected return and volatility over (t, t + q) that need to be forecasted at time t. Expected returns are forecasted using two competing models, 1 and 2. Specifically,r mod 1 t,t+q andr mod 2 t,t+q denote the return forecasts provided by models 1 and 2, respectively. Since we do not have a specific predictive model to forecast the volatility, the volatility over (t, t + q) is forecasted using the historical-mean model for volatility. Formally, whereσ t,t+q denotes the forecasted volatility.
It is important to observe that our predictive models forecast the stock market returns for a period of k ≥ 10 years. Because, generally, q = k (most often q < k), the q-year forecasted returns for model i ∈ {1, 2} are computed aŝ wherer mod i t,t+k is the k-year return forecast provided by model i. As before, model 1 in our study is the historical-mean model. The economic significance of return predictability is measured by equating the two total realized utilities associated with two alternative forecasting models where denotes the annual fees the investor is willing to pay to switch from predictive model 1 to predictive model 2. Whereas Fleming et al. (2001) and Kirby and Ostdiek (2012) used the equation above to compute the annual fees, Campbell and Thompson (2008) demonstrated that the investor's total realized mean-variance utility can be alternatively measured via the Sharpe ratio. That is, the annual fees can be computed using where S R(·) denotes the Sharpe ratio. In our computations, we assume that the investor's risk aversion γ = 5 (as in Kirby and Ostdiek, 2012). Since we do not have data for the real risk-free rate of return, to perform the computations, we assume that the nominal annual risk-free rate of return equals the annual inflation rate. Therefore, in real terms, r f ree t,t+ p = 0. We measure the annual performance fees over our total OOS period 1921-2011. Table 6 reports the Sharpe ratios associated with each predictive model and the estimated annual fees measured in basis points. The results are reported for two values of q: q = 1 and q = 15. In the first case, the investor rebalances his portfolio once a year; in the second case, the investor rebalances his portfolio once in 15 years.
First, we consider the case in which the investor rebalances his portfolio once a year. In this case, the Sharpe ratios of all predictive models, which perform statistically significantly better than the historical-mean model, are higher than the Sharpe ratio of the historical-mean model. The advantages of these predictive models translate into significant utility gains. Specifically, risk-averse investors would be willing to pay from 30 to 77 basis points in fees per year to switch from the historical-mean model to a model with superior forecast accuracy. In contrast to these models, our results indicate that the model that uses the long-term bond yield as a predictor exhibits inferior forecast accuracy compared with historical-mean model. As a result, not only is the Sharpe ratio of this model lower than that of the historical-mean model, but also the investor would need to be paid 20 basis points in fees per year to switch from the historical-mean model to the bond yield model.
When the investor can rebalance his portfolio once a year, the price-earnings model performs best while the mean-reverting model performs second best. However, when the investor decreases the portfolio revision frequency, the performance gains delivered by the price-earnings model diminish whereas the performance gains provided by the mean-reverting model remain rather stable. When the investor rebalances his portfolio once in 15 years, the performance gains of the price-earnings model virtually disappear. In contrast, the performance gains of the mean-reverting model (as measured in annual fees) remain virtually intact. Therefore, in cases in which the investor has to make long-term allocation decisions, the mean-reverting model delivers the highest performance gains.

V. SUMMARY AND CONCLUSIONS
We began the paper by performing two tests of the random walk hypothesis using the real Standard and Poor's Composite Stock Price Index data for the period from 1871 to 2011. In particular, we investigated the time series properties of the index returns at increasing horizons of up to 40 years. In our tests of the random walk hypothesis, we used two well-known test statistics: the autocorrelation of multi-year returns and the variance ratio. In the context of the null hypothesis, our goal was to test whether the index returns are distributed independently of their ordering in time. To estimate the significance level of the test statistics under the null hypothesis, we employed randomization methods that are free of distributional assumptions.
Rather surprisingly, given the seemingly insufficient span of available historical observations of the returns on the stock index, both test statistics allowed us to reject the random walk hypothesis at conventional significance levels over very long horizons of approximately 30-34 years. By studying the impact of the sample period on the test statistics, we concluded that mean reversion seems to be an extraordinarily strong phenomenon during the post-1926 period. Having performed the same randomization tests with stratification, we found that the results based on the use of the variance ratio are sensitive to the particular pattern of heteroskedasticity that occurred historically, 21 while the results based on the use of the autocorrelation of multi-year returns are not.
Consequently, we do not have sufficiently strong evidence to claim that the variance ratio decreases with increasing investment horizons. In other words, our results cannot fully support the conventional belief that the stock market is safer for long-term investors. In contrast, we do have convincing evidence that suggests that a given change in price over 15-17 years tends to be reversed over the next 15-17 years by a predictable change in the opposite direction. Overall, our findings support the mean reversion hypothesis as the alternative to the random walk hypothesis. Our evidence of secular mean reversion in stock prices is robust to the choice of data source, deflator used to compute the real prices and returns, sample period, and test statistic.
The results of our tests provided evidence of in-sample predictability. However, the conventional wisdom states that in-sample evidence of stock return predictability might be a result of data mining. To guard against data mining, we investigated the performance of out-of-sample forecasts of multi-year returns. We demonstrated that the out-of-sample forecast provided by 21 A similar conclusion was drawn by Nelson and Kim (1993). the mean-reverting model is statistically significantly better than the forecast provided by the historical-mean model. Moreover, the out-of-sample forecast accuracy of the mean-reverting model is comparable to that of Robert Shiller's very popular (among practitioners) model that uses the cyclically adjusted price-earnings ratio as a predictor of long-horizon returns and to that of the model that uses the price-dividends ratio as a predictor of long-horizon returns. In addition, we demonstrated that the advantages of these three predictive models translate into significant utility gains. We found that in cases in which the investor has to make long-term allocation decisions, the mean-reverting model delivers the highest performance gains. Furthermore, in the post-1960 period, the mean-reverting model showed the best forecast accuracy among all competing models.
Given the main result of our study, it is natural to ask the following question. What caused this long-lasting mean reversion in stock market prices? In other words, what is the economic intuition behind this result? One possible answer is suggested by previous research on the link between demography and stock market returns and on the long-term variations in the birth rates and population growth in the US. In particular, on the one hand, Bakshi and Chen (1994), Dent (1998), Geanakoplos et al. (2004), and Arnott and Chaves (2012) observed an interrelationship between demography and US stock market returns and argued that demography determines stock market returns. On the other hand, the evidence presented by Kuznets (1958), Dent (1998), Berry (1999), and Geanakoplos et al. (2004) suggests the presence of secular trends in birth rates in the US that last from 10 to 20 years. Thus, if population growth exhibits longterm alternating periods of above-average and below-average rates, and demography is what determines stock market returns, then it is natural to expect that the stock market also exhibits long-term alternating periods of above-average and below-average returns.
A more elaborate model of cyclical dynamics of economic activity, interrelated with similar movements in other elements, was presented by Schlesinger (1949), Schlesinger (1986), Berry (1991), Berry et al. (1998), andAlexander (2004). These authors argued that the dynamics of economic activity in the US have a long-term rhythm (with a period of 12-18 years) of accelerated and retarded secular growth. This cyclical fluctuation in economic activity, in particular the alternation between long-term periods of good and bad economic times, gives rise to similar long-term fluctuations in social and political activities. In brief, a long-term period of rapid economic growth and technological development coincides with a conservative political wave (era). The conservative politics reduces the scope and the role of government in the life of the nation and frees up business and capital. Such a period is also characterized by higher population growth, increased inequality, and deflationary conditions. However, a long-term period of economic growth inevitably leads to a long-term stagflation crisis. During such a crisis, conservative leaders are replaced by liberal leaders committed to business regxulation, social innovation, equity, and redistribution via an enhanced role of government. A liberal era is usually characterized by lower population growth, decreased inequality, and inflationary conditions. In our opinion, the secular mean-reverting behavior of the stock market fits nicely into this model of socioeconomic dynamics. It seems possible to demonstrate that conservative political waves are usually associated with above-average stock market returns, whereas during the liberal political waves, stock market returns are below average. 22 The investigation of this topic is left for future research. 22 In the American two-party system, the Democratic Party traditionally adheres to the liberal policy, whereas the Republican Party maintains the conservative policy. Liberals believe in government action to achieve equal opportunity and equality for all. They advocate for government regulation of the economy, higher taxes (primarily on the wealthy), and redistribution of wealth. In contrast, conservatives believe in free markets, limited government, lower taxes, and individual liberty. There is strong evidence that in the US, stock market returns vary across political cycles; see Hensel and Ziemba (1995) and Santa-Clara and Valkanov (2003). Yet, there are many examples of a Republican president implementing liberal policy and a