Real-time detection of regimes of predictability in the US equity premium

Summary We propose new real-time monitoring procedures for the emergence of end-of-sample predictive regimes using sequential implementations of standard (heteroskedasticity-robust) regression t -statistics for predictability applied over relatively short time periods. The procedures we develop can also be used for detecting historical regimes of temporary predictability. Our proposed methods are robust to both the degree of persistence and endogeneity of the regressors in the predictive regression and to certain forms of heteroskedasticity in the shocks. We discuss how the monitoring procedures can be designed such that their false positive rate can be set by the practitioner at the start of the monitoring period using detection rules based on information obtained from the data in a training period. We use these new monitoring procedures to investigate the presence of regime changes in the predictability of the US equity premium at the 1-month horizon by traditional macroeconomic and financial variables, and by binary technical analysis indicators. Our results suggest that the 1-month-ahead equity premium has temporarily been predictable, displaying so-called “ pockets of predictability, ” and that these episodes of predictability could have been detected in real time by practitioners using our proposed methodology. subsample test statistics commencing from a given start date. Because this is based on a sequence of subsample statistics, we need to avoid the issue of spurious detections highlighted by Inoue and Rossi (2005) by allowing the practitioner to control the overall false positive detection rate for the resulting procedure. To this end, we suggest two possible detection procedures, both of which are based on information obtained from the data in the training period. Applied using end-of-sample forms of the subsample predictability tests, both of these approaches can be used to provide a real-time monitoring procedure for the emergence of a regime of predictive ability of a regressor for returns data. The first procedure involves comparing the sequence of statistics in the monitoring period with the extremal value of the statistic (either the most negative, most positive or largest in absolute value, as appropriate to the alternative hypothesis being tested) within the training period. A predictability regime is signaled if one obtains an outcome of the predictability statistic that exceeds this extreme value from the the presence of regime changes in the predictability of the US equity premium at the 1-month horizon by traditional macroeconomic and financial variables, and by binary technical analysis indicators. Our results suggest that the 1-month-ahead equity premium has displayed episodes of temporary predictability and that these episodes could have been detected in real time by practitioners using our proposed methodology.


Summary
We propose new real-time monitoring procedures for the emergence of endof-sample predictive regimes using sequential implementations of standard (heteroskedasticity-robust) regression t-statistics for predictability applied over relatively short time periods. The procedures we develop can also be used for detecting historical regimes of temporary predictability. Our proposed methods are robust to both the degree of persistence and endogeneity of the regressors in the predictive regression and to certain forms of heteroskedasticity in the shocks. We discuss how the monitoring procedures can be designed such that their false positive rate can be set by the practitioner at the start of the monitoring period using detection rules based on information obtained from the data in a training period. We use these new monitoring procedures to investigate the presence of regime changes in the predictability of the US equity premium at the 1-month horizon by traditional macroeconomic and financial variables, and by binary technical analysis indicators. Our results suggest that the 1-month-ahead equity premium has temporarily been predictable, displaying so-called "pockets of predictability," and that these episodes of predictability could have been detected in real time by practitioners using our proposed methodology.

| INTRODUCTION
A large body of empirical research has been undertaken investigating stock return predictability, with a wide array of financial and macroeconomic variables considered as putative predictors for returns. These include valuation ratios such as the dividend-price ratio, earnings-price ratio, book-to-market ratio, various interest rates and interest rate spreads, and macroeconomic variables including inflation and industrial production; see, for example, Fama (1981), Keim and Stambaugh (1986), Campbell (1987), Campbell andShiller (1988a, 1988b), French (1988, 1989), and Fama (1990). Focusing on the in-sample predictability of US stock index returns these studies find relatively weak statistical evidence on predictability over short horizons, but as the forecasting horizon increases the evidence on predictability strengthens, and for longer horizons is strongly statistically significant. Finding that stock returns are predictable using financial ratios and macroeconomic variables does not necessarily mean that stock markets are inefficient. From a linearization of the standard present value model, if the dividend-price ratio for a stock varies over time then it must forecast either the dividend growth rate or returns, to some extent; see, among others, Campbell andShiller (1988a, 1988b) and Cochrane (2008). More generally, if a stock market is efficient then the expected excess return for the relevant stocks might be predictable using a variety of financial and macroeconomic variables if investors' risk premia are time varying and correlated with the business cycle.
Although consistent with orthodox financial theory, it has been argued that there are statistical reasons to suspect that the strong support for predictability obtained in earlier studies could be spurious. Nelson and Kim (1993) and Stambaugh (1999) showed that high-persistence predictors led to biased coefficients in predictive regressions if the innovations driving the predictors were correlated with returns, as is known to be the case for many of the popular macroeconomic and financial predictors used. Goyal and Welch (2003) showed that the persistence of dividend-based valuation ratios increased significantly over the typical sample periods used in empirical studies of predictability, and argue that as a consequence out-of-sample predictions using these variables are no better than those from a no-change strategy. When estimation and inference techniques are used that take account of the high degree of persistence of the typical financial and macroeconomic predictors used, the statistical evidence of short-and longhorizon predictability is considerably weaker and in some cases disappears completely; see, among others, Ang and Bekaert (2007), Boudoukh, Michaely, Richardson, and Roberts (2007), Welch and Goyal (2008), and Breitung and Demetrescu (2015).
The vast majority of empirical studies of stock market predictability are based on the assumption of a constant parameter predictive regression model. However, there are several reasons to suspect that if stock returns are predictable, then it is likely to be a time-varying phenomenon; for example, significant changes in monetary policy and financial regulations could lead to shifts in the relationship between macroeconomic variables and the fundamental value of stocks, via the impact of these changes on economic growth and the growth rates of earnings and dividends. A growing body of empirical evidence is also supportive of this view. For example, Henkel, Martin, and Nardari (2011) found that return predictability in the stock market appeared to be closely linked to economic recessions with dividend yield and term structure variables displaying predictive power only during recessions. Timmermann (2008) argued that for most time periods stock returns are not predictable but that there are "pockets in time" where evidence of local predictability is seen. In particular, if predictability exists as a result of market inefficiency rather than because of time-varying risk premia, then rational investors will attempt to exploit its presence to earn abnormal profits. Assuming that a large enough proportion of the total number of investors are rational, this behavior will eventually cause the predictive power of the relevant predictor to be eliminated. If a variable begins to have predictive power for stock returns, then a short window of predictability might exist before investors learn about the new relationship between that variable and returns, but it will eventually disappear; see, in particular, Paye and Timmermann (2006) and Timmermann (2008). It therefore seems reasonable to consider the possibility that the predictive relationship might change over time, so that over a long span of data one may observe some, possibly relatively short, windows of time during which predictability occurs. In such cases, standard predictability tests based on the full sample of available data will have very low power to detect these short-lived predictive episodes.
Several empirical studies find evidence suggesting that parameter instability is a feature of return prediction models. Lettau and Ludvigsson (2001) found instability in the predictive ability of the dividend and earnings yield in the second half of the 1990s. Goyal and Welch (2003) and Ang and Bekaert (2007) found instability in prediction models for US stock returns based on the dividend yield in the 1990s. Paye and Timmermann (2006) undertook a comprehensive analysis of prediction model instability for international stock market indices using the Perron (1998, 2003) structural change tests. They found evidence of structural breaks for many of the countries considered, arguing that "Empirical evidence of predictability is not uniform over time and is concentrated in certain periods" (Paye and Timmermann, (2006), p. 312). They found some evidence of a common break for the USA andUK in 1974-1975, and for European stock markets linked to the introduction of the European Monetary System in 1979. However, it is important to stress that conventional parameter instability tests such as Chow tests and Bai-Perron tests are not valid for use with highly persistent, endogenous predictors. Indeed, Paye and Timmermann (2006) used Monte Carlo simulations to show that in such cases this can cause substantial size inflation in the Bai-Perron tests coupled with a lack of power because of the large amount of noise typically present in predictive regression models. Moreover, traditional regression t-tests for predictability and structural break tests are an ex post tool for detecting the statistical significance of regressors and structural breaks in a historical sample of data. They are less useful in monitoring for change in real time because their repeated application in prediction models can lead to size distortions (with the probability of at least one of the tests rejecting tending to unity as the number of tests in the sequence increases) and, as a consequence, spurious evidence of in-sample predictive ability; see Inoue and Rossi (2005) for a detailed discussion of this problem in relation to t-tests.
Motivated by this, we develop new statistical monitoring techniques, specifically designed to avoid the spurious detection problems discussed in Inoue and Rossi (2005). We use these methods to monitor the stability of predictive regression models for the US equity premium. As putative predictors we consider various commonly used traditional macroeconomic and financial variables as well as a range of technical analysis rules where only price or volume data is used to predict returns. In an early paper in this direction, Brock, Lakonishok, and LeBaron (1992) studied the ability of moving average and trading range break trading rules to predict the Dow Jones Industrial Average (DJIA) index using daily data from 1897 through to 1986, finding strongly significant evidence that the trading strategies generated abnormal returns that could not be explained by serial correlation or conditional heteroskedasticity in the returns. Sullivan, Timmermann, and White (1999) analyzed a longer data sample on the DJIA, and found that the rules employed by Brock et al. (1992) were unable to identify profitable trading strategies for the period 1987-1996, although there was some evidence that they managed to do so prior to this period. Hudson, Dempsey, and Keasey (1996) undertook a similar analysis to Brock et al. (1992) for UK stock index returns and found that, although the rules examined do have predictive power, their use would not enable investors to make abnormal returns once trading transaction costs were accounted for. More recently, Neely, Rapach, Tu, and Zhou (2014) investigated the in-sample and out-of-sample predictive power of binary technical analysis indicators in a predictive regression-based context. Indicators are constructed from moving-average rules, momentum rules, and on-balance volume rules. They found that the indicators had predictive power that emulated that of the traditional financial and macroeconomic variables. They also showed that combining information from technical analysis indicators and macroeconomic variables significantly improved equity risk premium forecasts versus using either type in isolation.
The real-time monitoring procedures we propose are designed with the aim of detecting, as soon as possible after their inception, relatively short windows of predictability arising from shifts in the parameter on the predictor variable in the predictive regression. The presence of short pockets of predictability among long periods of no predictability in US stock returns has recently been documented by Farmer, Schmidt, and Timmermann (2019), using nonparametric methods and employing an R 2 -type statistic to measure predictability strength. Our analysis is also related to work by Dangl and Halling (2012), who use Bayesian methods to investigate gradual changes in return predictability. Although our procedures are designed to detect short regimes of predictability when the regime change is discrete, they can also be used to detect predictive regimes when the regime change is gradual and we investigate this issue with Monte Carlo simulations. Our focus is on the real-time detection of such regimes, but the methods we use can also be used for an historical analysis of the stability of predictive regression models. Our detection procedures are based around the sequential application of simple heteroskedasticity-robust regression t-statistics for the significance of the predictor variable calculated over a subsample of fixed length m. When used as simple one-shot tests these statistics can be compared with estimated critical values obtained from a training period using the subsampling-like method of Andrews (2003) and Andrews and Kim (2006). It is important to note that these resulting one-shot tests will be able to detect general structural change in the slope parameter on the predictor variable (in that particular subsample, relative to the rest of the sample), not just a change to predictability within the given subsample. This is because a rejection will occur where the estimated slope coefficient on the predictor differs significantly between the subsample over which the one-shot test is based and the subsamples used in the critical value generation. Based on the arguments above and the work of Paye and Timmermann (2006) and Timmermann (2008), among others, it seems reasonable to focus attention on the null model of no predictive relationship, such that structural change where it should occur is between no predictability and a short window of predictability. It is this interpretation that we will focus on in motivating and outlining our procedure. In our application to US equity data we first apply standard predictability tests to the full data sets (and indeed the training periods used to obtain the estimated critical value) to check for any evidence of sustained predictability in those samples.
Our approach is based on the sequential application of these one-shot subsample test statistics commencing from a given start date. Because this is based on a sequence of subsample statistics, we need to avoid the issue of spurious detections highlighted by Inoue and Rossi (2005) by allowing the practitioner to control the overall false positive detection rate for the resulting procedure. To this end, we suggest two possible detection procedures, both of which are based on information obtained from the data in the training period. Applied using end-of-sample forms of the subsample predictability tests, both of these approaches can be used to provide a real-time monitoring procedure for the emergence of a regime of predictive ability of a regressor for returns data. The first procedure involves comparing the sequence of statistics in the monitoring period with the extremal value of the statistic (either the most negative, most positive or largest in absolute value, as appropriate to the alternative hypothesis being tested) within the training period. A predictability regime is signaled if one obtains an outcome of the predictability statistic that exceeds this extreme value from the training period. Under the second procedure we discuss, a predictability regime is deemed to have occurred if and when the number of consecutive rejections (at a given marginal significance level using a critical value estimated by subsampling from the training period) by the one-shot tests observed in the monitoring period exceeds the longest run of such rejections in the training period. Both procedures can also be used to form estimates of the locations of the signaled predictive regimes.
The remainder of the paper is organized as follows. Section 2 outlines the time-varying predictive regression model forming the basis for our analysis. Section 3 details our proposed approach to detecting windows of predictability and for dating any predictive regimes signaled, showing how to implement real-time detection procedures whose false positive detection rates can be controlled in practical applications. Section 4 reports the results from Monte Carlo simulations to investigate the finite-sample behavior of our proposed procedures. Section 5 presents an applied investigation into the predictability of the 1-month-ahead equity premium on the S&P Composite index. Section 6 concludes. An online Supporting Information Appendix contains a proof of Proposition 1 as well as additional Monte Carlo results (these results are summarized in Section 4.2) and additional material relating to the empirical application discussed in Section 5.

| THE PREDICTIVE REGIME MODEL
We assume a relationship between the equity premium, y t , and a single predictor variable 1 x t that can be described by the following data-generating process (DGP): where the (putative) predictor is generated by with s x,0 = 0 and where d t (e j , m j ) is a dummy variable defined such that d t (e j ,m j ) takes the value 1 for m j > 0 consecutive values of t, ending with t = e j . The innovation vector ϵ t : = [ϵ y,t , ϵ x,t ] 0 , where the notation "x: = y" denotes that x is defined by y, is assumed to be a strictly stationary and uncorrelated mean zero process with unconditional covariance matrix given by where r xy , r xy < 1, is the correlation between ϵ y,t and ϵ x,t . Note that our assumption on ϵ t allows for the presence of conditional heteroskedasticity, such as generalized autoregressive conditional heteroskedasticity (GARCH) or stationary autoregressive stochastic volatility, in both ϵ y,t and ϵ x,t . In the context of Equation 1, if β j ≠ 0, then we have a predictive regime of y t by x t − 1 of length m j observations running from t = e j − m j + 1 through to t = e j . The model in Equation 1 allows for n ≥ 0 such predictive regimes. Consistent with the discussion in the Introduction and Paye and Timmermann (2006) and Timmermann (2008), we have in mind 1 For lucidity, we outline our procedure for the case of a single predictor. Our approach can be extended to the case where multiple predictors feature in Equation 1. Here individual subsample t-statistics, of the form discussed in section 3.1, associated with each of the predictor variables could be considered along with multiparameter heteroskedasticity-robust regression F-statistics. Consideration would need to be given to the appropriate statistics and decision rules to adopt, and to the usual issues surrounding multiple (significance) testing. Moreover, although we focus on the case where a constant term is included in both Equations 1 and 2, our approach is also valid for a more general deterministic component, such as a polynomial deterministic trend, appearing in both components provided it is included in the test regression in Equation 4 and the t-statistic, τ e,m , in Equation 5 is commensurately redefined. scenarios where such regimes are relatively scarce and short lived, so that both the number of predictive regimes, n, and their durations, m j , j = 1, … , n, are taken to be small relative to the sample size, T. We assume e j < e j + 1 − m j + 1 such that the regimes where predictability holds are ordered (i.e., d t (e 1 , m 1 ) is the earliest regime) and nonoverlapping. Our proposed predictive regime detection procedure will consider the quantities e j and m j , which delimit the start and end dates of the predictive regimes, and the number of regimes, n, to be unknown to the practitioner. Outside of these n predictive regimes the slope parameter in Equation 1 is zero and the DGP is such that y t = μ y + ϵ y,t and, hence, y t is unpredictable (in mean) due to the ϵ y,t being serially uncorrelated (a standard assumption in this literature). Where n = 0 in Equation 1, y t is unpredictable at all time periods.
As is standard in this literature, we have adopted an AR(1) specification for s x,t , and hence for x t , in Equation 3. As we will discuss in Section 3, the predictive regime detection procedures we propose in this paper can be applied regardless of whether the autoregressive root, ρ, in Equation 3 is such that ρ = 1 (a unit root predictor) or ρ j j < 1 (a stationary predictor). Moreover, ρ is also allowed to be T-dependent such as occurs, for example, in cases where the predictor is strongly persistent displaying either local or moderate deviations from a unit root; for full-sample predictability tests directed at the latter, see Kostakis, Magdalinos, and Stamatogiannis (2015). The AR(1) specification in Equation 3 is not critical for our analysis, and it could be generalized to allow ϵ x,t to be a weakly autocorrelated process without affecting the validity of our proposed procedures; see Remark 3 in Section 3.2.1.
In what follows, to facilitate our later analysis of real-time monitoring for the emergence of predictive regimes, we make a distinction between the end of the monitoring period, which we denote by t = E, and the notional future end of the DGP for y t ; that is t = T, such that E ≤ T.

| Subsample regression t-statistics
We are interested in detecting the presence of a predictive regime for y t in real time and propose a way of doing this using subsample regression t-statistics. To that end, consider first selecting a subsample of m observations running from t = e − m + 1 to t = e, where m is a fixed value (independent of the sample size, T) chosen by the user, and run the (generic) ordinary least squares (OLS) regression: We then calculate the regression t-statistic, based around a heteroskedasticity-robust variance estimate (see White, 1982), for the significance of x t − 1 in Equation 4; that is, whereb : = Detection of a predictive regime holding between y t and x t − 1 for the given subsample t = e − m + 1, … , e can be based on τ e,m . As a particular example, suppose we have data available for t = 1, … , T * + m ≤ T; a test for the presence of a predictive regime in the last m available sample observations would therefore be based on the statistic τ T * + m,m : Standard regime detection tests, such as those outlined in Paye and Timmermann (2006) use asymptotic (in the sample size T) distribution theory to approximate the test's critical value, but this approximation is based on the assumption that the sample window m used in constructing the statistic is a positive fraction of T. This assumption is clearly not consistent with our aim of detecting predictive regimes of short duration. Moreover, even if we were to assume m to be a function of T, the limiting distribution of τ e,m will depend on nuisance parameters in the DGP in Equations 1-3; specifically, the degree of persistence of the predictor variable, x t , and the correlation, r xy , between ϵ y,t and ϵ x,t . Without knowledge of these, valid asymptotic critical values could not be obtained.
An alternative approach, which we will consider further in the context of the detection procedure proposed in Section 3.2.2, robust to the degree of persistence and endogeneity of the predictor, can be based on the subsampling approach of Andrews (2003) and Andrews and Kim (2006). In the end-of-sample example above, suppose we have a sample of size T * + m and we form the predictability statistic τ T * + m,m . To obtain a critical value, one uses the training period t = 1, … , T * , to compute the T * − m analogous statistics {τ e,m }, e = m + 1, … , T * . The (1 − π) sample quantile of these statistics is the estimated significance level-π critical value for the end-of-sample predictability test. By construction, the resulting test is (asymptotically in T) robust to nuisance parameters in Equations 1-3 because the training period statistics have the same functional dependence on those nuisance parameters as τ T * + m,m . This test will have nontrivial power whenever there is predictability in the last m observations, but not in the training period.
Crucially though, the discussion above relates to a one-shot predictability test. However, our goal in this paper is to develop real-time monitoring procedures for the emergence of an end-of-sample predictive regime. To that end, we will construct a sequence of τ e,m statistics, of the form given in Equation 5, calculated for each possible end-of-subsample date e = T * + m, … , E, recalling that E denotes the end of the monitoring period, a parameter set by the practitioner. The predictive regime detection procedures we propose below are based on comparing the behavior of this sequence of statistics with corresponding sequences within the training period and will be designed such that the theoretical (i.e., large-sample) false positive rate (FPR) of the procedures is known and can be properly controlled, where the FPR represents the probability of incorrectly signifying the presence of at least one predictive regime in the monitoring period.

| The detection procedures
We now detail our predictive regime detection approaches. For transparency, these are presented in the context of upper tail testing (i.e., for predictability regimes where β j > 0), but can be adapted to lower tailed or two-tailed testing in an obvious way. We will discuss two procedures, each of which forms a decision rule for rejecting the null of no predictability in the monitoring period based on specific properties of the sequence of τ e,m statistics within the given training period. The first procedure we consider will be based on the largest of the τ e,m statistics observed in the training period, and the second will be based on the longest run of outcomes of the τ e,m statistics in the training period that exceed a given (critical) value.
For both of the procedures that follow, we define the training period as t = 1, … , T * . We assume that no predictive regime occurs within the training period; that is, T * < e 1 − m 1 + 1; further discussion relating to where this assumption might be violated is given in Section 3.4. In what follows we assume that T * and E are such that T * : = bλ 1 Tc, and E: = bλ 2 Tc, bÁc denoting the integer part of its argument, and where 0 < λ 1 < λ 2 ≤ 1.

| The MAX procedure
The first detection procedure we propose, which we will denote by MAX, is based on the maximum value of the sequence of τ e,m statistics taken across the training and monitoring periods (cf. Astill, Harvey, Leybourne, Sollis, & Taylor, 2018). More precisely, with fτ e,m g T * e = m + 1 and fτ e,m g E e = T * + m constituting the statistics obtained from the training and monitoring periods, respectively, we consider a detection procedure whereby a predictive regime in the monitoring period is signaled if max e2½T * + m,E τ e,m exceeds max e2½m + 1,T * τ e,m ; that is, the largest τ e,m in the monitoring period exceeds the largest τ e,m in the training period.
We now establish the theoretical (as T ! ∞) FPR of the MAX procedure when run out to the end of monitoring date, E. This is done by evaluating the limiting probability that max e2½T * + m,E τ e,m > max e2½m + 1,T * τ e,m under the null hypothesis that no predictability is present in the DGP. This result is now given in Proposition 1. Proposition 1. Let (y t ,x t ) be generated according to Equations 1-3 under the conditions stated in Section 2. Let the MAX decision rule be as given above. If n = 0, such that no predictability is present in the DGP, then as T ! ∞, where α * : = ðλ 2 −λ 1 Þ=λ 2 = lim T!∞ α where, for the stated choices of monitoring and training periods, Remark 1. The result in Proposition 1 provides an expression for the theoretical FPR of the MAX decision rule-that is, the limiting probability that the maximum of the τ e,m statistics in the monitoring period exceeds the maximum of the τ e,m statistics in the training period in the case where no predictability occurs. This is seen to be simply the limiting value of the ratio formed by dividing the total number of τ e,m statistics computed in the monitoring period (here E − T * − m + 1) by the total number of τ e,m statistics calculated in the training and monitoring periods combined (here (E − T * − m + 1) + (T * − m) = E − 2m + 1). This result holds more generally when comparing the maxima of the sequences of τ e,m statistics obtained from any two disjoint subintervals of the data whose lengths are both functions of T.
Remark 2. The result in Proposition 1 holds regardless of the degrees of persistence and endogeneity of the regressors in the predictive regression and holds for all conditionally heteroskedastic innovations that satisfy the condition of strict stationarity. In particular, the result in Proposition 1 holds regardless of whether the putative predictor x t in Equation 3 is: weakly dependent (|ρ| < 1); strongly persistent (ρ = 1 − c/T with the constant c ≥ 0, where c = 0 yields the pure unit root case, while c > 0 corresponds to the local-to-unity case); or moderately persistent (ρ = 1 − cT −θ with c > 0 and θ 2 (0,1), the moderate deviations from unity case of Kostakis et al. (2015).
Remark 3. As demonstrated in the proof of Proposition 1, the stated result follows using an application of theorem 2.1 of Ferreira and Scotto (2002, p. 478), with r = s = 1 in their notation, which applies to strictly stationary sequences of mixing random variables. To do so we establish that under the conditions given in Section 2 {τ e,m } forms a strictly stationary and (m − 1) dependent sequence, the latter therefore satisfying the required mixing condition stated in Ferreira and Scotto (p. 476). We have assumed for simplicity that ϵ t is serially uncorrelated, which yields the (m − 1) dependence result. Weakening this assumption to allow for stationary serial correlation in ϵ x,t would not alter this result. It is standard in this literature to assume that ϵ y,t is serially uncorrelated. However, this could be weakened to allow finite MA(k), 0 ≤ k < ∞, behavior in ϵ y,t in which case {τ e,m } would be a (k + m − 1) dependent sequence but would still satisfy the required mixing condition. We cannot formally allow for unconditional heteroskedasticity in ϵ t because {τ e,m } would not then form a strictly stationary sequence and so we could not appeal to theorem 2.1 of Ferreira and Scotto. However, we have still based our approach on heteroskedasticityrobust t-statistics because although not exact invariant to any unconditional heteroskedasticity present (which is what would be needed as m is finite) we expect them to be considerably more robust than the corresponding t-statistics based on OLS standard errors. In Section 4 we will investigate the impact of unconditional heteroskedasticity in ϵ t on the finite sample FPRs of the procedures discussed in this section.
For given values of T * and m, we can use Equation 9 to approximate the empirical FPR that would be obtained in practice for any monitoring horizon E. We observe that α is a monotonically increasing function of E as ∂α ∂E = T * − m E − 2m + 1 ð Þ 2 > 0. Hence, other things being equal, the longer the monitoring period, the greater the likelihood of spuriously finding a predictive regime. To illustrate, Figure 1 graphs this approximation for the case of T * = 400 and m = 30. So, for example, reading from Figure 1, if we wish to monitor out to E = 680, then the FPR will be about 0.40.
We can also rearrange Equation 9 as which is useful if we wish to know the maximum monitoring horizon E such that the FPR for the MAX procedure is (approximately) controlled at α. For the current illustration, Equation 10 shows that E should be chosen to be no more than about 520 for a choice of α = 0.20 (which is also apparent from Figure 1).

| The SEQ procedure
Our second detection procedure, which we denote by SEQ, is based on comparing the length of the longest contiguous sequence of exceedances of some value preset by the practitioner by the statistics τ e,m in the monitoring period, with the corresponding measure taken over the training period. An obvious choice for this threshold value, which we will adopt in what follows, would be to use a relevant marginal critical value for some significance level π for the one-shot τ e,m test. 2 In doing so we will follow the subsampling approach of Andrews (2003) and Andrews and Kim (2006) and calculate an empirical critical value, denoted by cv π in what follows, from the training period. Recalling that the sequence of τ e,m statistics that make use of data within the training period is given by τ e,m for e = m + 1, … , T * , then cv π is defined such that cv π : . Under the conditions on the DGP considered by Andrews and Kim, cv π is a consistent (as T ! ∞) estimate for the true π significance-level critical value. However, it should be stressed that the SEQprocedure we propose does not rely on this consistency property holding on cv π . Based on cv π , we then consider the maximum number of contiguous values of τ e,m within the training period that exceed cv π . To this end, define R π,e : = 1(τ e,m > cv π ), where 1(Á) denotes the indicator function, and consider the following measure over e = L to e = U with U ≥ L: Any sensible threshold value could in principle be used. A benefit of using such a critical value is that, where the training period contains no predictive regimes, each individual test in our monitoring sequence can be interpreted marginally as a test for predictability in that particular subsample. As such, it makes sense in practice to set π to a conventional significance level; for example, π = 0.05 or π = 0.10.
Here, when R π (L, U) is nonzero, its value, U − L + 1, represents the length of a sequence of contiguous exceedances. The maximum length of contiguous exceedances in the training period is then given by max L,U2½m + 1,T * R π ðL,UÞ . The corresponding measure for the monitoring period is given by max L,U2½T * + m, E R π ðL, UÞ. Our proposed SEQ procedure is then to signal a predictive regime in the monitoring period if max L,U2½T * + m, E R π ðL, UÞ > max L,U2½m + 1,T * R π ðL, UÞ. Paralleling the result in Proposition 1, when there is no predictability in the training or monitoring periods we conjecture that where α * is as defined in Proposition 1. Note here that, in contrast to the result for the MAX monitoring procedure, where the large-sample FPR when monitored up to E is exactly α * , the corresponding quantity for the SEQ procedure is bounded by α * . This arises because maxRðL, UÞ can only assume integer values, so there is a nonzero probability of a tied value in the training and monitoring periods, even asymptotically. Hence the strict equality obtained for the MAX procedure from Proposition 1 is replaced by the weak inequality in Equation 11. The (approximate) relationship in Equation 10 can also still be considered to hold, but interpreted to be the maximum monitoring horizon E such that the FPR for the SEQ procedure is bounded by α. 3 It will be convenient to denote the training period maximum length of contiguous exceedances, max L,U2½m + 1,T * R π ðL, UÞ, as l π . Note that the first time period at which it would be possible for SEQ to signal a predictive regime is t = T * + m + l π , because this is the first occasion where R π (L, U) in the monitoring period can exceed l π . In contrast, it is possible for the MAX procedure to signal a predictive regime as early as t = T * + m. However, we can control l π via the choice of π. The larger is π then the smaller is cv π , so we would naturally expect the larger is l π . This relationship is important, as choosing a large value of π might lead to what is considered an unacceptable delay before being able to detect a predictive regime. This is not a consideration with MAX, however. In fact, MAX can be thought of as an extreme case of SEQ, where we choose cv π = max e2½m + 1,T * τ e,m (the smallest value of cv π such that π = 0) and hence l π = 0.

| Dating of predictive regimes
In a real-time monitoring context, if the procedure signaled the presence of a predictive episode at time E * ≤ E then the monitoring procedure would of course terminate at that point, given that the procedure would have signaled the presence of a predictive regime at that time. However, one could also consider continuing the monitoring procedure up until E. It is therefore possible for both of our proposed MAX and SEQ procedures to detect more than one predictive regime before the notional end-of-monitoring date, E.
Although our focus on this paper is on real-time detection we can, where at least one predictive regime has been signaled by one of our procedures when run out until the end of the monitoring period, E, provide approximate dates for the location of these predictive regime(s). This should be viewed more as a historical dating exercise rather than something that would be done in the context of a real-time monitoring procedure. Detailing this first in the context of the MAX procedure, for e = T * + m, … , E define R 0,e : = 1ðτ e,m > max s2½m + 1,T * τ s,m Þ . Next, let D denote an E × 1 vector of zeros, and set D e = 1 whenever R 0,e = 1. Now suppose that D has h consecutive 1s in positions e = j, … , j + h − 1, where j is the earliest date for which R 0,j = 1. Here R 0,j is based on data over the period j − m + 1, … , j, so we might therefore consider j − m + 1 to represent a feasible start date for the first predictive regime. With R 0, j + h − 1 representing the final exceedance in D, and this being based on data over the period j − m + h, … , j + h − 1, we might similarly consider j + h − 1 to represent a feasible end date for this predictive regime. By this categorization, then, the predictive regime covers the contiguous set of dates j − m + 1, … , j + h − 1. In some sense, this set of dates is liberal, or weak, in that it is possible 3 We are unable to provide a formal proof of the result of Equation 11-hence our conjecture on the basis of extant, but much more limited theoretical results. A formal proof would be extremely involved, if even tractable, given the complexity of the arguments needed in Ferreira and Scotto (2002) to establish theoretical results relating to the much simpler case of subsample maxima. However, this conjecture is not without foundation. We have also conducted extensive Monte Carlo simulation experiments that appear to support it. Furthermore, these simulation results reveal that the empirical FPR of the SEQ procedure is always below but very close to α, implying that the probability of tied values in the training and monitoring periods is very small. that the predictive regime started after j − m + 1 and ended before j + h − 1; for example, only the later data used in R 0, j may be responsible for triggering that exceedance, and only the earlier data used in R 0, j + h − 1 responsible for triggering that exceedance. We might therefore consider an alternative dating approach where the predictive regime is characterized by the subset of dates for which every time that date is present in the subsample of data being tested, an exceedance is obtained. This subset, which we will refer to as strong, is the contiguous set of dates j, … , j − m + h; note that if h ≤ m − 1, the strong set will be empty. A second predictive regime is deemed to exist if R 0, j + h = 0 but R 0, j + h + s = 1 for some s ≥ 1, and weak/strong dates for the second regime can be determined in the same manner as for the first regime. This extends to more than two regimes in an obvious way. In situations where more than one predictive regime has been detected, it is possible that weak dates associated with consecutive regimes can overlap, although this possibility cannot arise with the strong dates.
For the SEQ procedure, the dating method follows the same process as for the MAX procedure, but with the nonzero elements of the D vector defined according to the following: for e = T * + m + l π , … , E, if Q e k = e − l π R π,k = 1 , set D e − l π ,…, D e to 1. That is, for all end-of-window dates e that form part of a contiguous run of at least l π + 1 exceedances R π,e , we set the eth element of D to one. The weak and strong dates can then be categorized in exactly the same way as for the MAX, based on the R π,e exceedances involved in the D vector.

| Additional discussion
We conclude this section with some observations, which apply in equal part to the MAX and SEQ procedures.
1. Suppose now that, in contradistinction to our maintained assumption so far, one or more predictive regimes in Equation 1 are present within the chosen training period. Provided such regimes are of finite length and finite in number, then the asymptotic (in T and T * ) properties of the MAX and SEQ procedures are unaffected by this. For a finite-length training period, if predictability regimes existed within it, we would expect both max e2½m + 1,T * τ e,m and l π to be increased relative to the case where no predictability is present in the training period, other things being equal. We might therefore anticipate some reduction in the ability of our procedures to detect genuine predictive regimes present in the monitoring period. We will explore the impact on our proposed procedures of a predictive regime holding in the training period as part of our Monte Carlo simulation study in Section 4.

2.
Although not consistent with the interpretation we are placing on the DGP in Equation 1, as discussed in the Introduction it is possible in practice that the training period could potentially exhibit predictability throughout its duration, or a large part of its duration. In this case, an upper tail rejection arising from MAX or SEQ in the monitoring period should be taken to indicate a statistically significant increase in the magnitude of the slope parameter on x t − 1 (and, hence, in the strength of the predictability of y t by x t − 1 ) vis-à-vis its value in the training period. In practical applications, we therefore recommend prior application of standard full-sample predictability tests to the training period to investigate whether the assumption of no predictability holds in the training period, and this will be done in the empirical data analysis undertaken in Section 5. 3. Our discussion thus far has implicitly assumed that the training period runs from the earliest available time period in the data set to the point immediately before the desired start of monitoring. This essentially makes the training period as large as possible, which ensures that, through the role of T * in Equations 9 and 10, the FPR is as small as possible for a given E, or, equivalently, E is as large as possible for a given FPR. In cases where a very long history of data is available, it may be prudent to use only relatively recent data, to avoid including historical predictive regimes in the training period. In practice, such regimes might be detected by prior pretesting-an approach we adopt in the empirical application in Section 5. Furthermore, we have so far focused, for simplicity, on the case where there is no separation between the data period used for the training period and the data used for monitoring, with the former spanning t = 1, … , T * and the latter starting at t = T * + 1. More generally, the last time period included in the training period could be T * − k for some k > 0, allowing for a separation between the training period and the start of the monitoring period. This might be relevant in cases where a predictive regime was thought to have occurred towards the end of the training period, so that the training period could be redefined to exclude this regime. As noted in Remark 1, an analogous result to Proposition 1 also holds here and the expressions for α and E in Equations 9 and 10 in this case become, respectively, α = E −T * −m + 1 E −2m + 1− k and E = T * + m −1 −αð2m − 1 + kÞ 1 −α :

| FINITE-SAMPLE PROPERTIES OF THE MONITORING PROCEDURES
We now report the results from four Monte Carlo simulation experiments designed to study the finite-sample properties of our MAX and SEQ procedures. These investigate the FPRs of the two procedures and their power to detect a predictive regime of given length. Extensive additional simulations were also undertaken to study the detection power of MAX and SEQ as a function of m 1 (the length of the predictive regime in the DGP), and to study the robustness of our procedures to different error term assumptions, patterns of heteroskedasticity, to higher order autocorrelation in the predictor, and to gradual regime change. We present these additional results in an online Supporting Information Appendix and briefly discuss the key findings in Section 4.2. 4 In all of the experiments we generated the simulation data according to the DGP given by Equations 1-3and set μ y = μ x = 0 (without loss of generality) using negatively correlated error terms with r x,y = −0.90. 5 For the four sets of experiments reported in the main text we generate ϵ y,t $ N(0, 1), ϵ x,t $ N(0, 1). All of the simulation experiments and the empirical application in Section 5 employ the upper-tailed version of our procedure. 6 In each simulation experiment the sample period when monitoring starts (T * + m) is the same as in the empirical application, T * + m = 302, and for the main experiments, m = 30. 7 All of the experiments are undertaken using MATLAB, employing the Mersenne Twister random number generator function and 10,000 replications.
The first set of experiments studies the power of MAX and SEQ to detect a single predictive regime as a function of β 1 = {0.05, 0.10, … ,0.45, 0.50} for ρ = {0.965, 0.995}, setting π = 0.10. 8 When β 1 = 0 (so that n = 0 and, hence, there are no predictive regimes in the data) the detection frequency obtained from the simulations is equivalent to an empirical FPR, and we also report simulation results for this case. In this first set of experiments we assume a short monitoring period that ends at E = 327, which, given the values used for T * and m, is consistent with α = 0.10 (this can be verified using Equation 9). Therefore, when β 1 = 0, the empirical FPR obtained for each procedure should be approximately equal to 0.10. If a predictive regime does occur during the monitoring period, then the power of our procedures to detect its presence will depend not only on how long the relevant predictive regime continues for (m 1 ) and its strength (measured by the magnitude of β 1 ), but also on when the predictive regime occurs relative to the start of monitoring. To investigate this issue in more detail, separate results are computed for five different predictive regime start dates: (a) t = 287 (15 observations before the start of monitoring), (b) t = 297 (five observations before the start of monitoring), (c) t = 302 (at the same time as the start of monitoring), (d) t = 307 (five observations after the start of monitoring), (e) t = 317 (15 observations after the start of monitoring). In each case the length of the predictive regime in the DGP is set to m 1 = 30. 9 In empirical applications, while there might be a particular reason for favoring a short monitoring period, for predictive regimes that start towards the end of a short monitoring period the power of our procedure to detect their 4 The online Supporting Information Appendix is available from www.sites.google.com/view/pr-supplementary, which also contains the data and MATLAB code used for the paper. 5 In predictive regression models for the equity premium employing valuation ratios as predictors (e.g., the dividend-price ratio, earnings-price ratio) the relevant error terms are strongly negatively correlated-hence our choice of r x,y = −0.90. 6 For the majority of the macroeconomic and financial variables and for all of the technical analysis indicators used in the empirical application in Section 5, financial theory suggests a positive relationship with the equity premium. For those of the macroeconomic and financial variables where financial theory suggests a negative relationship with the equity premium (e.g., interest rates) we use −x t − 1 rather than x t − 1 when testing for a predictive regime so that an upper-tailed test is applicable. This is consistent with recent research on detecting equity premium predictability using orthodox t-tests (e.g., Campbell & Thompson, 2008;Neely et al., 2014). 7 The data sample used for the equity premium application below is monthly, covering the period December 1974 to December 2015 (T = 493). In the application we monitor from January 2000 (hence T * + m = 302). In addition to m = 30, in the empirical application results are also computed for m = 15 and m = 60. 8 This range of values for ρ and β 1 was chosen following a preliminary analysis of the data used for the empirical application in Section 5. Typically, when AR(1) models are estimated for the traditional predictors used in Section 6 (e.g., the valuation ratios), the AR(1) coefficient estimates lie in the range 0.965-0.998. 9 Therefore, in these experiments m = m 1 . In the additional simulation experiments discussed in Section 4.2, we investigate the performance of our monitoring procedure when the values of m and m 1 differ. presence could be significantly improved if we monitor for a longer period of time. To investigate this issue in more detail, in the second set of experiments we repeat the first set of experiments employing the same simulation DGP and predictive regime dates, but extending the monitoring period to E = 361, which is consistent with α = 0.20. Hence the empirical FPR obtained from the simulations in this case (when β 1 = 0) should be approximately equal to 0.20.
The first two sets of experiments assume no predictability in the training period. As discussed in Section 3.4, our procedure can still be used for detecting predictive regimes during the monitoring period if predictability exists during the training period, although the FPR and power of the procedure could be affected. If our procedure is applied to data where a regime of positive predictability exists in the DGP during the training period, both the largest value of τ e,m over the training period, and the longest contiguous sequence of right-tailed τ e,m exceedances over the training period, are likely to be larger than the values obtained if the DGP had contained no predictability over the training period but was otherwise identical. It follows straightforwardly in this case that the power of our procedures to detect a predictive regime over the monitoring period (and also the empirical FPRs) will be reduced relative to the case of no predictability over the training period.
The third and fourth sets of experiments investigate this issue in more detail. In these experiments we repeat the first two sets of experiments again using the DGP given by Equations 1-3, but in addition to the original predictive regime at locations (a)-(e), an earlier predictive regime is imposed in the DGP during the relevant training periods. Specifically, the full DGP for each set of experiments contains two predictive regimes (i.e., we set n = 2 in Equation 1), where the first predictive regime is set to occur during the training period at t = bT * /2c + 1, and we set m 1 = 15 and β 1 = 0.25 (hence the associated predictive regime in the training period continues for 15 observations). The second predictive regime mirrors the original predictive regime in the first two sets of experiments. The length of this second regime, m 2 , and the strength of the predictability, β 2 , are set to the same values as the relevant parameters in the first two sets of experiments (m 1 and β 1 , respectively). Note that in the third and fourth sets of experiments the predictive regime in the training period is relatively short (being half the length of the predictive regime in the monitoring period for the first two sets of experiments). It is particularly important to assess the finite sample performance of our procedures when there is a short predictive regime in the training period, since short predictive regimes are more difficult to identify than long predictive regimes. If a long predictive regime exists over the initial training period chosen by a researcher using our procedures, then it is more likely that the researcher would be aware of its presence (e.g., via a preliminary analysis of the data).

| Main results
The results from the first set of experiments are given in Figure 2, which, as with Figures 3-5, graphs the empirical frequencies with which at least one predictive regime is signaled by our monitoring procedures MAX (solid and dotted red lines) and SEQ (solid and dotted blue lines) when run across the whole monitoring period under consideration. Recall that the end of the monitoring period for the set of experiments relating to Figure 2 is chosen using Equation 9 to be such that α = 0.10. Therefore, when β 1 = 0 we would expect the simulated predictive regime detection frequencies of our procedures to be close to 0.10. It can be seen that each of the curves reported in Figure 2 indeed starts from approximately 0.10. For both the MAX and SEQ procedures, when the predictive regime starts before or at the same time as the start of monitoring (cases (a)-(c)), power rises rapidly with β 1 . When the predictive regime starts after the start of monitoring (cases (d)-(e)), a higher proportion of the subsamples used when computing τ e,m will be data from the period of the DGP when no predictability exists. Furthermore, in these two cases monitoring ends shortly after the predictive regime starts (e.g., for case (e), monitoring ends 11 observations after the predictive regime starts). Therefore, as expected, for both procedures power rises with β 1 at a lower rate than for cases (a)-(c) and ultimately flattens out at a lower value.
Interestingly, these experiments show that the relative finite sample performance of the MAX and SEQ procedures is sensitive to the strength of the predictability (as measured by the magnitude of β 1 ), the location of the predictability regime relative to the monitoring period, and the persistence of the predictor (as measured by the value of ρ). For case (a), when predictability starts 15 observations before the start of monitoring, and for ρ = 0.965, SEQ has more power than MAX, but the difference in power declines as the strength of the predictability increases. Eventually, the power curve for MAX moves above the curve for SEQ (at approximately β 1 = 0.37). For case (a) with ρ = 0.995, the crossing point of the power curves occurs earlier (at approximately β 1 = 0.23). For case (b) the results have a similar pattern to case (a), although MAX has even more power than SEQ when the predictability is strong compared with case (a). Similar results are found for case (c), although power is noticeably lower for all values of β 1 . This is to be expected because in this case the predictability regime starts at the same time as the monitoring, and therefore the initial subsamples used to compute τ e,m contain very few observations from the predictability regime (by definition, the subsamples used to compute τ e,m contain more observations from the period when there is no predictability until half way through the monitoring period). For case (d), and ρ = 0.965, MAX and SEQ have very similar power when the predictability is weak, although as the predictability strengthens the power curve for MAX moves above the curve for SEQ. The same general pattern exists for ρ = 0.995, although the power of both procedures when the predictability is weak is higher than for ρ = 0.965. For case (e), MAX has more power than SEQ for all values of β 1 , and the difference in power increases as the predictability strengthens.
The results from the second set of experiments are given in Figure 3. As expected, when the monitoring period is extended from E = 327 to E = 361 the predictive regime detection frequency as a function of β 1 increases for both MAX and SEQ. Indeed, the detection frequency and relative finite-sample performance of MAX and SEQ are now virtually identical for each of the predictive regime start dates considered here. This reflects the fact that because of the longer monitoring period, each set of sequential τ e,m statistics now includes a run of statistics computed using subsamples where a high proportion of each subsample is data from when predictability exists in the DGP. When β 1 = 0 the empirical FPRs of MAX and SEQ both increase to approximately 0.20, again as expected. A further interesting feature of our monitoring procedures can be seen by comparing Figures 2a and 3a relating to the case where the predictive regime starts 15 observations before monitoring begins (of the cases considered, the one where detection power is least dependent on the start date of the predictive regime). Although, as discussed above, the FPR in Figure 3a is roughly double that in Figure 2a for each procedure, very little differences (for a given value of ρ) are seen between the two different cases in terms of the efficacy of MAX and SEQ to detect a predictive regime, except where β 1 is close to zero. Equation 9 shows that, other things being equal, the longer is the length of the training period relative to the monitoring period, the smaller is the theoretical FPR of the procedure. But as these simulation results highlight, a lower FPR from a longer training period does not entail a decrease in the efficacy of the procedures to detect a true predictive regime in the monitoring period.
The results for the third and fourth sets of experiments are given in Figures 4 and 5. We find that, as expected, due to the presence of a predictive regime during the training period, in each of the individual experiments both max e2½m + 1,T * τ e,m and l π are increased relative to the case where no predictability is present in the training period and, as a result, the power curves are generally lower in these experiments than the corresponding curves in Figures 2 and 3. For both MAX and SEQ, when β 2 = 0 and E = 327 (consistent with α = 0.10), the detection frequency in Figure 4 is approximately 0.05. When β 2 = 0 and E = 361 (consistent with α = 0.20), the detection frequency for both procedures in Figure 5 is approximately 0.10. Similarly, it can be seen in Figures 4 and 5 that for β 2 > 0 the curves are approximately 0.05-0.10 lower than the corresponding curves in Figures 2 and 3. The curves in Figure 4 for E = 327 are sensitive to where the second predictive regime is located. However, it can be seen in Figure 5 that, as in Figure 3, extending the monitoring period to E = 361 reduces the sensitivity of the curves to the exact location of the predictive regime.

| Additional simulations
The first set of additional simulations studies the detection power of the MAX and SEQ procedures as a function of m 1 (the length of the predictability regime in the DGP), employing the same DGP used in the main experiments and assuming the other parameters are fixed at their original values. The results are graphed in Figure S1 for E = 327 and in Figure S2 for E = 361. Increases in m 1 from a low value initially lead to an increase in detection power. For larger values of m 1 the power curves flatten out. This occurs because, as m 1 increases, eventually the end of the predictability regime in the DGP lies beyond the end of the monitoring period, which in these experiments is assumed to be fixed. When monitoring ends at E = 327 the point at which the power curves flatten out occurs earlier as we move from start dates (a) to (e), because the value of m 1 such that the end of the predictive regime lies beyond the end of the monitoring period E gets smaller. For the longer monitoring period E = 361 there is very little difference in detection power for the different start dates.
We also carried out an extensive set of robustness checks for MAX and SEQ. The first checks concern the error terms in the DGP. An attractive feature of our monitoring procedure, as Proposition 1 shows, is that for sufficiently large T, in addition to being robust to any degree of contemporaneous correlation of the error terms in the DGP, it is also robust to conditional heteroskedasticity and non-Gaussianity in the errors. To investigate how well these robustness properties hold in finite samples, we repeated the first set of main simulation experiments discussed above using the same DGPs but for a range of error distributions and heteroskedasticity patterns for ϵ y,t in Equation 1: (i) t(10) error terms; (ii) t(5) error terms; (iii) normally distributed GARCH(1,1) error terms with conditional variance σ 2 y,t = α 0 + α 1 ϵ 2 y,t − 1 + β 1 σ 2 y,t − 1 , where α 0 = 0.10, α 1 = 0.10 and β 1 = 0.80, and (iv) t(5) GARCH(1,1) error terms with the same GARCH parameters. Although not formally allowed under the conditions of Proposition 1, we also considered: (v) t(5) error terms with an unconditional volatility shift from σ y = 1 to σ y = 2 halfway through the monitoring period (at t = T * + m + b(E − T * − m)/2c + 1). Reassuringly, the results, which are graphed in Figures S3-S7, are very similar to the first set of main simulation results reported in Figure 2. As discussed in Remark 3, the AR(1) specification for the predictor in Equations 1-3 is not critical for our analysis, and for large T, both the MAX and SEQ procedures remain valid for higher order autoregressive predictors. To investigate this issue in finite samples we report the results from repeating the first set of main simulation experiments given in Figure 2, but using an AR(2) predictor rather than an AR(1); that is, replacing the AR(1) process in Equation 3 by s x,t = ρ 1 s x,t − 1 + ρ 2 s x,t − 2 + ϵ x,t , t = 1, … , T, setting ρ 1 = 0.595, and allowing ρ 2 = {0.30, 0.40}. The results are given in Figure S8, and again they are very similar to the main simulation results reported in Figure 2.
As a final robustness check, we investigated the detection power of MAX and SEQ when the regime change in Equations 1-3 is gradual rather than discrete. Specifically, we used the DGP for the first set of main simulation experiments but redefined the dummy variable d t (e 1 ,m 1 ) to be the exponential function d t ðe 1 , m 1 Þ := expð−γðt −sÞ 2 Þ, which allows for smooth regime change centered around s, where γ controls the speed of the change. We set s = e 1 − 0.50m 1 + 1 and γ = 0.01, so that the main part of the regime change for cases (a)-(e) starts at approximately the same point as in Figure 2 and lasts for approximately 30 observations. The results are given in Figure S9 and show that, as β 1 increases, both MAX and SEQ have good detection power for this form of regime change. Generally, the rate of increase in power with increases in β 1 is slower than in Figure 2 and the curves are slower to flatten out, which occurs because increases in β 1 are effectively being weighted by a factor less than one for most of the predictability regime; hence the β 1 that maximizes power (assuming the other parameters in the DGP are fixed) is higher than for the results in Figure 2.

| Data and preliminary analysis
The data set used for the empirical application of our monitoring procedure consists of monthly observations on the equity premium for the S&P Composite index calculated using CRSP's month-end values and on 20 different predictors for the period 1974:12-2015:12 (T = 493). We define the equity premium as in Welch and Goyal (2008) and Neely et al. (2014) as the log return on the value-weighted CRSP stock market index minus the log return on the risk-free Treasury bill: y t = logð1 + R m,t Þ−logð1 + R f ,t Þ where R m,t is the CRSP return and R f,t is the Treasury bill return. Ten of the predictors are traditional macroeconomic and financial variables (MFVs) and 10 are binary technical analysis indicators (TAIs) also used by Neely et al. (2014) in their analysis of equity premium predictability. Some of the traditional MFVs are in log form (as in Welch & Goyal, 2008;Neely et al., 2014) and each of the predictors is lagged one period. We consider the log dividend yield (dy t − 1 ), the log dividend-price ratio (dp t − 1 ), log earnings-price ratio (ep t − 1 ), book-tomarket ratio (bm t − 1 ), short-term yield (st t − 1 ), long-term yield (lt t − 1 ), long-term-short-term yield spread (sp t − 1 = lt t − 1 − st t − 1 ), BAA-AAA corporate bond yield spread (dsp t − 1 ), net equity expansion (ntis t − 1 ), and inflation (inf t − 1 ). The TAIs used are four moving average indicators (MAIs), two momentum indicators (MOIs), and four on-balance volume (OBV) indicators. The four moving-average rule indicators (MAI s,l,t ) are defined such that MAI s,l,t : = 1 if MA s,t ≥ MA l,t , indicating a buy signal, and are defined to be zero otherwise, where MA j,t : = ð1=jÞ P j − 1 i = 0 P t − i for j = {s,l} and s = {1, 2}, l = {9,12} and where P t is the level of the S&P Composite index. The two l-period momentum rule indicators (MOI l,t ) are defined such that MOI l,t : = 1 if P t ≥ P t − l , indicating a buy signal, and are defined to be zero otherwise, where l = {9, 12}. The four on-balance volume rule indicators (OBV s,l,t ) are defined such that OBV s,l,t : = 1 if MA OBV s,t ≥MA OBV l,t , indicating a buy signal, and are defined to be zero otherwise, where MA OBV j,t : = ð1=jÞ P j − 1 i = 0 obv t − i for j = {s, l} and s = {1, 2}, l = {9, 12}, and obv t : = P t k = 1 VOL k D k , where VOL k is trading volume for the S&P Composite index in period k and D k is a binary variable such that D t : = 1 if P t ≥ P t − 1 and D t : = −1 otherwise.
The data used to construct the equity premium and the predictors are taken from the updated monthly data set on Amit Goyal's website (http://www.hec.unil.ch/agoyal/), which is an extended version of the data set used by Welch and Goyal (2008). A full list of the predictors is given in Table S1 of the Supporting Information Appendix.
We begin with a preliminary analysis using some popular orthodox methods for detecting predictability. Table S2 in the Supporting Information Appendix reports, for each predictor variable considered, the estimated slope parameter (β), a right-tailed Newey-West t-test of significance (t NW ) and the standard and adjusted R 2 values for orthodox bivariate regression models applied to the full sample of data using OLS for parameter estimation. For both the MFVs and the TAIs, consistent with many of the previous empirical studies discussed in Section 1, very little evidence of predictability is provided by the t NW tests run at conventional significance levels and in all cases the R 2 values are under 1%. It is important to recognize that, although popular in studies of equity premium predictability, orthodox t-tests (including t NW ) can be misleading in this case because of the highly persistent lagged regressors used (see again the discussion in Section 1); therefore also reported in Table S2 is the IV comb test of Breitung and Demetrescu (2015). This statistic has a standard normal asymptotic null distribution, such that the test is valid, irrespective of the persistence of the predictor and any heteroskedasticity present in the errors. As discussed in Remark 4 of Breitung and Demetrescu ((2015), p.364), the IV comb test can only be validly implemented as a two-tailed test. For the MFVs there is no statistically significant evidence of predictability from IV comb at conventional significance levels, and only a single rejection at the 0.10 significance level for the TAIs. 10 Recall that in outlining our monitoring procedure in Section 3 we assumed in generating the empirical critical value, cv π , that there was no predictability over the training periods. To assess how this assumption sits with our data sets we apply the same methods used for obtaining the full-sample results in Table S2 to the training periods employed in the monitoring application below. Although we present the results for all of the methods used in Table 2, to assess the presence of predictability in these training periods we focus on the IV comb test. For the monitoring application below, our initial choice of training periods is 12/74-10/98 (for m = 15), 12/74-07/97 (for m = 30), and 12/74-01/95 (for m = 60). These are the implied training periods given by T * = 302 − m, where observation t = 302 is the date at which monitoring starts in the application below, 01/00. If there is statistically significant evidence of predictability for an initial choice of training period, but this is thought to be due to a period of predictability towards the end of that training period, then we recommend ending the training period at an earlier date so as to reduce the likelihood that it contains predictability. Thus the final training periods employed when monitoring could finish earlier than the initial choice of training period; see the discussion in Section 3.4. 11 Our preliminary analysis of the data over the implied training periods reveals that for the two interest rate series st t − 1 and lt t − 1 , and for the bond yield spread dsp t − 1 , there is statistically significant evidence of predictability at conventional significance levels from IV comb for one or more values of m. Furthermore, the rejections obtained do not appear to be driven by predictability at the end of these implied samples. Therefore, in the monitoring application below we continue to use the implied training periods for these three predictors despite the rejections from IV comb . Statistically significant evidence of predictability from IV comb is also obtained for ntis t − 1 , for all values of m. In this case, we find that predictability is concentrated in the data from 01/92 through to the end of the training periods. Hence, for this predictor and for all values of m, we end the relevant training periods at 12/91 in the monitoring application below. For all of the other MFV and TAI predictors no statistically significant evidence of predictability is found from IV comb using the implied training periods. The full set of results from the preliminary analysis of the data over the training periods (using the adjusted training period for ntis t − 1 ) are given in Tables S3 and S4 in the Supporting Information Appendix for the MFVs and TAIs, respectively.

| Monitoring results
We assume that a practitioner applies our MAX and SEQ procedures to monitor for the emergence of predictive regimes from 01/00 (so in all cases T * + m = 302). Results are presented assuming that monitoring continues through to the final data observation: 12/15. In real-world applications it is not envisaged that our procedures would be used for continuous monitoring over anything like such a long period, but it is helpful to present the results through to 12/15 to 10 Financial theory suggests negative predictive power for st t − 1 , lt t − 1 , ntis t − 1 , and inf t − 1 . We therefore multiply each of these predictors by −1 so that a right-sided test (excepting the IV comb test which, as discussed above, is implemented as a two-tailed test) is appropriate for detecting predictability. See footnote 5 for further details. 11 If predictability is present during the training period, as the simulations in Section 4 demonstrate, our procedure can still be useful for detecting positive predictability over the monitoring period. Note that if negative predictability exists over the training period and a predictability regime change is detected using the upper-tailed version of our procedure, we cannot conclude that the change is to a period of positive predictability without further analysis, because it could be due to a change to a period of no, or less negative, predictability.
illustrate the relationship between the length of the monitoring period and the FPR. Results are computed for m = {15, 30, 60}. For the SEQ procedure we have computed results for both 0.10 and 0.05 level estimated critical values; that is, cv π for π = {0.10, 0.05}, but we concentrate here on the results for π = 0.10. The results for π = 0.05 are given in Tables S5 and S6 of the Supporting Information Appendix. Table 1 reports the number of predictive regimes detected by MAX and SEQ (with π = 0.10) respectively. For each predictor where one or more predictive regimes are detected, Table 2 reports the date at which the first regime is detected and the associated empirical FPR for both MAX and SEQ (using cv 0.10 ). Note that the TAI predictors are 0-1 dummy variables that will often take the same value for several consecutive observations, and consequently the subsample τ e,m values can be undefined when the TAI does not change over the subsample. If τ e,m is undefined during the monitoring period it simply means that at the relevant observation when this occurs the test statistic is uninformative about the presence of predictability, but the τ e,m values that are defined can still be used for monitoring. However, a large number of undefined test statistics in the training period could have a detrimental impact on the finite-sample performance of the procedure. For completeness, the results for m = {15, 30} are reported in these tables, although for some of the TAIs undefined test statistics occur quite frequently over the training period with these values of m. In practice, we recommend using m ≥ 60 when using our procedure with these particular TAIs to minimize the number of undefined test statistics over the training period. Alternatively, for a given value of m, reducing the value of l when constructing the TAIs will result in fewer undefined test statistics. In the application here we report results for l = {9, 12} to be consistent with the regression-based analysis of TAIs in Neely et al. (2014), even though for some of the MOIs and OBV indicators with m = 60 and these values of l, τ e,m is occasionally undefined over the training and/or monitoring period. For the MAIs with l = {9, 12} and m = 60 there are no undefined test statistics.
It can be seen from Table 1  and inf t − 1 . Note that the total number of MFVs found to have predictive power is lower for m = 15 than for the larger values of m considered. In total, one or more predictive regimes are detected for all 10 of the TAIs considered, and the number of TAIs found to have predictive power increases with m: from five for m = 15, to six for m = 30, and nine for m = 60. Consider now the results from using the SEQ procedure, also given in Table 1. In total, employing the three subsample sizes m = {15, 30, 60}, one or more predictive regimes are detected by SEQ for seven of the 10 MFVs, and for all of the TAIs. Note that, in contrast to MAX, the total number of MFVs found to have predictive power is largest for m = 15: predictive regimes are detected for seven MFVs when m = 15, two when m = 30, and five when m = 60. The total number of TAIs found to have predictive power increases with m, from two for m = 15, to nine for m = 30, and 10 for m = 60. Our results from both MAX and SEQ are, in general, consistent with the findings in Neely et al. (2014), that stronger evidence of predictability is found for the TAIs than for the MFVs.
It can be seen in Table 2 that for many of the MFV and TAI predictors a predictive regime is first detected around the time of the dot-com bubble/crash in the late 1990s/early 2000s, or the global financial crisis in 2008-2009. Table 2 also shows that, as might be expected, for some of the predictors our procedures detect a predictability regime around the same date and, in some cases, in the same month. Consider, for example, the results using MAX with m = 30. For both dy t − 1 and dp t − 1 , predictability is first detected in 02/01. For dsp t − 1 and ntis t − 1 , in both cases predictability is first detected in 08/11. The dating procedures discussed in Section 3.3 also provide useful information on the location of the regimes. As an example, Figure 6 graphs τ e,m along with the weak set of dates obtained using MAX with m = 30 for the dividend-price ratio dp t − 1 as a predictor (note that the strong set of dates is empty in this case). Figures 7 and 8  with m = 30 for the short-term and long-term interest rates, st t − 1 and lt t − 1 . A selection of graphical results for the other predictors for which at least one predictive regime is signaled for either the MAX or SEQ procedures are provided in the Supporting Information Appendix in Figures S10-S13. For presentational purposes, in these graphs we do not display τ e,m over the entire training period and instead start the horizontal axis 5 years before the end of each training period.
F I G U R E 6 dp t − 1 , MAX procedure, m = 30: (τ e,m ), Also indicated on these graphs are the end of the training period T * , the date when monitoring starts T * + m, the largest τ e,m in the training period (max e2½m + 1,T * τ e,m ), the date of the first significant rejection for the ith predictive regime j i (for the MAX procedure, this is the date at which the ith predictive regime is detected), and the FPR, based on Equation 9, as a function of E. Figure 6 shows that for dp t − 1 the MAX procedure with m = 30 detects a single predictive regime in 02/01 and the weak set of dates covers the period 09/98-03/01. Thus our results suggest that dp t − 1 had predictive power for equity returns during the latter years of the dot-com bubble period. Note that the weak set of dates starts before the monitoring period, which can happen for early rejections because the rejection itself is indexed on the end date of the subsample window. For st t − 1 , the MAXprocedure with m = 30 detects one predictive regime, and Figure 7 shows that the weak set of dates covers the period 10/08-03/11. However, for lt t − 1 the MAXprocedure with m = 30 detects three predictive regimes. Figure 8 shows that in this case the weak set of dates covers the periods 11/00-11/04, 03/11-08/13, and 05/11-11/13. Therefore, the weak set of dates associated with the second and third regimes overlap-suggesting a single period of predictability that begins in 03/11 and ends in 11/13. Figure 8 shows that the first regime detected by MAX for lt t − 1 in 04/03 follows a gradual increase in τ e,m that began after the dot-com crash and continued through to late 2004. Over this period US interest rates gradually fell and equity markets recovered after the dot-com crash and 2001 recession. Our results suggest that the long-term interest rate had predictive power over this period but the short-term interest rate did not. The second and third regimes for lt t − 1 are shorter in duration than the first and are largely driven by a rapid and short-lived increase in τ e,m during 2013. Neely et al. (2014) investigated differences in predictability between macroeconomic recession and expansion periods by computing separate R 2 statistics for predictive regression models using the NBER indicator of recessions and expansions to partition the relevant data. They found that for both the MFVs and TAIs predictability was substantially higher over recessions than over expansions. In the light of these findings it is interesting to compare the subsample τ e,m values over the monitoring period with the NBER indicator to see if our procedure finds a similar pattern of support for predictability over the business cycle. Hence the NBER indicator is also plotted in that for dp t − 1 predictability peaks at the start of the 2001 recession but declines during the course of the recession; for st t − 1 and lt t − 1 the predictive regimes detected do not appear to be correlated with the business cycle. As shown in the Supporting Information Appendix, for the other predictors, while there is some evidence suggesting that, consistent with the findings in Neely et al. (2014), predictability is stronger during recessions than during expansions, it is not a pattern obtained for all of the predictors. 12 It is interesting to relate our results to recent research by Farmer et al. (2019), who also focused on detecting short pockets of in-sample predictability in US equity returns. While the sample sizes and the number of predictors they analyzed differ from ours, there are some similarities between their results and ours. For example, for the dividend yield, Farmer et al. found evidence of pockets of predictability in the early 2000s and the early/mid 2010s; and for the Treasury bill rate in the late 2000s and the early/mid 2010s. These dates are similar to the predictive regime dates obtained for these predictors using our MAXprocedure.
The predictive regimes in Figures 6-8 often end quite shortly after each regime is first detected (e.g., in Figure 7, the weak set of dates ends immediately after the regime is detected). Indeed, this general pattern was observed for all of the MFVs and for the majority of the TAIs. Hence the strong set of dates for most of the predictors is empty. This suggests that, although investors using our procedure in real time would have been able to detect predictability in these cases, there may have been very little time after the point of detection to exploit the predictability before it no longer existed. To investigate this point using traditional forecasting methods, for each MFV predictor where one or more predictive regimes are detected by MAX and/or SEQ with m = 30 we computed out-of-sample forecasts exploiting the information from the monitoring procedures. Specifically, for each of these predictors we move forward through the monitoring period 1 month at a time, computing MAX and SEQ at each month along with one-step-ahead forecasts. To compute the forecasts we use a fixed mean benchmark model estimated using an expanding sample of data that starts at the first observation, until the relevant monitoring procedure detects a first predictive regime. When this occurs we use the relevant regression model to compute the forecast for the next month, estimated using an expanding sample of data that starts at the weak start date for the relevant predictive regime. When the first predictive regime ends, we stop forecasting. We compared the forecasts computed in this way with the forecasts obtained using the fixed mean benchmark model for the whole forecasting period. The mean squared forecast error (MSFE) for each procedure, along with the Diebold and Mariano (1995) test of equal forecasting accuracy (employing the Harvey, Leybourne, & Newbold, 1997, bias correction and Student's t critical values), and the out-of-sample R 2 value for the procedure are reported in Table S7 in the Supporting Information Appendix. As expected, because the predictive regimes end so quickly after they are discovered, for the majority of predictors there is very little difference between the MSFE obtained exploiting our MAX and SEQ procedures in this way and the MSFE for the benchmark model. In some cases the MSFE using our test in this way is lower than the benchmark model, but the differences are not statistically significant. Paye and Timmermann (2006) and Timmermann (2008) argue that if predictability reflects market inefficiencies then it is only ever likely to be a short-lived phenomenon because, when it exists, investors will quickly allocate capital to exploit its presence. Our finding of short pockets of predictability that end quickly after being detected is entirely consistent with this view.

| CONCLUSIONS
We have developed new real-time monitoring procedures for detecting the emergence of predictive regimes. Our detection procedures are based on the sequential application of standard heteroskedasticity-robust (predictive) regression t-statistics for predictability to end-of-sample data. We have suggested two possible detection rules, both of which are designed to be robust to both the degree of persistence and endogeneity of the regressors in the predictive regression and are such that their false positive rates can be controlled, for a given monitoring period length, by using information obtained from data in a training period. We have applied our proposed monitoring procedures to investigate for the presence of regime changes in the predictability of the US equity premium at the 1-month horizon by traditional macroeconomic and financial variables, and by binary technical analysis indicators. Our results suggest that the 1-monthahead equity premium has displayed episodes of temporary predictability and that these episodes could have been detected in real time by practitioners using our proposed methodology.