Viewpoint: Boosting Recessions

Authors


  • I would like to thank the organizers for the opportunity to present this paper as a State of the Art Lecture at the 2013 Canadian Economics Association Meeting in Montreal, Quebec. Financial support from the National Science Foundation (SES-0962431) is gratefully acknowledged. E-mail: serena.ng@columbia.edu

Abstract

This paper explores the effectiveness of boosting, often regarded as the state of the art classification tool, in giving warning signals of recessions 3, 6, and 12 months ahead. Boosting is used to screen as many as 1,500 potentially relevant predictors consisting of 132 real and financial time series and their lags. Estimation over the full sample 1961:1–2011:12 finds that there are fewer than 10 important predictors and the identity of these variables changes with the forecast horizon. There is a distinct difference in the size and composition of the relevant predictor set before and after mid-1980. Rolling window estimation reveals that the importance of the term and default spreads are recession specific. The Aaa spread is the most robust predictor of recessions three and 6 months ahead, while the risky bond and 5-year spreads are important for 12 months ahead predictions. Certain employment variables have predictive power for the two most recent recessions when the interest rate spreads were uninformative. Warning signals for the post-1990 recessions have been sporadic and easy to miss. The results underscore the challenge that changing characteristics of business cycles pose for predicting recessions.

Abstract

Prévoir les récessions. Ce texte explore l'efficacité de la méthode dite du ‘boosting’, qu'on considère souvent comme un instrument de classification qui est à la fine pointe de l'art de prévoir les récessions 3, 6 et 12 mois à l'avance. Cette méthode est utilisée pour passer au crible quelques 1500 prédicteurs potentiellement pertinents construits à partir de 132 séries chronologiques de variables réelles et financières plus ou moins décalées. Des estimations de l'échantillon complet pour la période du début de 1961 à la fin de 2011 révèlent qu'aussi peu que dix prédicteurs sont importants, et que l'identité de ces variables change selon l'horizon de prévision considéré. Il y a aussi une différence marquée dans la taille et la composition de cet ensemble de prédicteurs avant et après le milieu des années 1980. Il appert que l'importance des écarts de crédit (écarts des taux d'intérêt et des risques de défaut de paiement) est spécifique à la récession particulière. L'écart Aaa est le prédicteur le plus robuste des récessions dans trois et six mois, alors que la débenture risquée et l'écart des taux d'intérêt pour la fenêtre de 5 ans sont les prédicteurs importants pour un horizon temporel de 12 mois. Certaines variables reliées à l'emploi ont eu un certain pouvoir de prédiction pour les deux dernières récessions quand les écarts de taux d'intérêt n'ont pas été éclairants. Les signaux des clignotants pour les récessions d'après 1990 ont été sporadiques et faciles à manquer. Les résultats soulignent le défi que posent à ceux qui font des prévisions de récessions les caractéristiques changeantes des cycles économiques.

1. Introduction

The central theme of business cycle analysis is to study the reasons why the economy goes through periods of contractions and expansions. In order to do so, we need to document features in the historical data during these two phases of economic activity. This involves determining the dates when recessions began and ended or, in other words, establishing the business cycle chronology. In the United States, this task falls to the NBER Business Cycle Dating Committee (Committee 2008), while the Center for Economic Policy Research (CEPR) has taken up this responsibility for the euro area since 2002. In Canada, the Business Cycle Council of the C.D. Howe Institute not only dates but also grades the severity of each recession. Cross and Bergevin (2012) find that at least for Canada, the 1929 recession was the only one in a century of data deemed to be a category five.

Recessions are understood at a general level to be periods of significant, persistent, and pervasive declines of economic activity, while expansions are periods of prosperity. However, there is no objective measure of economic activity, nor are the notions of pervasiveness and persistence universally defined. Wikipedia cites two consecutive quarters of negative GDP growth, or a 1.5% rise in unemployment within 12 months, as possible definitions of a recession. The U.K. Treasury simply calls a recession when there are two or more consecutive quarters of contraction in GDP. A U.S. recession is officially defined to be the period between a peak and a trough, and an expansion is the period between a trough and a peak. The turning points are then determined by considering monthly industrial production, employment, real income, as well as manufacturing and wholesale-retail sales. The rationale for not focusing on GDP data is that the monthly estimates tend to be noisy, and the quarterly data can be subject to large revisions. Indeed, when the NBER announced that the U.S. was in a recession at the start of 2008, quarterly GDP growth was still positive.

These recession announcements are important signals about state of the economy and tend to receive a good deal of public attention. However, the committees do not have explicit models or formulas for how they arrived at the dates, and furthermore, the announcements were made retroactively. For example, the NBER announced that economic activity peaked in December 2008 and bottomed out in September 2010, more than a full year after activity actually peaked (i.e., in July 2007) and bottomed (i.e., in June 2009.) This has spawned a good deal of interest in providing a formal analysis of the business cycle chronology in the hope that a better understanding of the past would enable better predictions of future recessions, even though such events cannot be totally avoided. But three features make the exercise challenging. First, the true duration and turning points of business cycles remain unknown even after the fact. A model could be seen to give a wrong classification relative to the announced dates, but such false positives could be valuable signals of what lies ahead. As such, there is no unique criterion to validate the model predictions. This issue is especially relevant for the U.S., since its reference cycle is not based on any observed variable per se. Second, recessions are time dependent, not single-period events, and there are far fewer recessions than non-recession periods. In the U.S., only 15% of the observations between 1961 and 2012 are deemed to be recession months, which can affect our ability to identify the recessions from the data. Third, while the committees officially look at a small number of variables, it is almost surely the case that many other series are unofficially monitored. A researcher typically pre-selects a few predictors for analysis. Omitting relevant information is a distinct possibility.

But what does an econometrician with lots of data at his disposal have to offer to policy makers on the issue of which variables to monitor? The question is useful even if the answer is “not much,” because we would then know that information has been used exhaustively. With this in mind, this paper considers the usefulness of a methodology known as boosting in giving warning signals of recessions and, in so doing, identify the predictors of recessions in the U.S. over the sample 1961:1 to 2011:12. Boosting is an ensemble scheme that combines models that do not perform particularly well individually into one with much improved properties. It was originally developed as a classification rule to determine, for example, if a message is spam or if a tumour is cancerous given gene expression data. Subsequent analysis shows that boosting algorithms are useful beyond precise classification. The two features of boosting algorithms that drew my attention are their abilities to perform estimation and variable selection simultaneously, and to entertain a large number of predictors. If N is the number of potential predictors and T is the number of time series observations, boosting allows N to be larger than T.

Boosting is applied in this paper to the problem of predicting recessions. In line with the ensemble nature of boosting, the recession probability estimates are based on a collection of logit models. In my application, each model has only one predictor. This is unlike standard logit models that put all predictors into a single model. The application to recession dates creates two interesting problems, both relating to the dependent nature of the data. The first arises from the fact that some variables lead, some lag, while others move concurrently with the reference cycle. A predictor may be useful at one lag and not at another. The second problem is that parameter instability is a generic feature of economic time series, and the relevant predictor set may well evolve over time. I adapt existing boosting algorithms to accommodate these problems.

The analysis aims to shed light on three problems. The first is to identify which variables and at which lags are informative about recessions. The second is to understand if predictors are recession and horizon specific. The third is to learn more about the characteristics of recent recessions. I find that a handful of variables are systematically important predictors over the 50-year period, but their relative importance has changed over time. While the model provides warning signals for the post-1990 recessions, the signals especially of the 2008 recession are sporadic and easy to miss.

The rest of the paper proceeds as follows. Section 'Related Work' begins with a review of existing work on business cycle dating. Section 'Boosting' then turns to Adaboost – the algorithm that initiated a huge literature in machine learning – before turning to recent boosting algorithms that can be justified on statistical grounds. The empirical analysis is presented in Section 'Rolling Estimation' Boosting is far from perfect for analyzing recessions. The paper concludes with suggestions for future work.

2. Related Work

Academic research on business cycle chronology takes one of two routes: fit existing models using better predictors, or find better models taking a small number of predictors as given. This section gives a brief overview of this work. A more complete survey can be found in Marcellino (2006), Hamilton (2011), and Stock and Watson (2010b).

Let math formula be the latent state of the economy. We observe math formula (as determined by the NBER, for example) only if period t is in a recession and zero otherwise. That is,

display math

where math formula is an unknown threshold. As math formula is not observed, it seems natural to replace it by observed indicators math formula, allowing for the relation between math formula and x to be phase shifted by h periods. A model for recession occurence would then be

display math

Once x is chosen, a binomial likelihood can be maximized and the estimated probability for math formula can be used for classification, given some user-specified threshold math formula. The simplest approach is to take math formula to be scalar. Popular choices of math formula are GDP and labour market data such as unemployment. These variables are also officially monitored by various dating committees. Lahiri and Yang (2013) provide a review of the literature on forecasting binary outcomes.

An increase in the short rate is indicative of economic slowdown due to monetary policy tightening. Because some recessions in the past are of monetary policy origin, interest rate spreads have been a popular recession predictor. Indeed, recessions tend to be preceded by an inverted yield curve, with short-term rates higher than long-term rates.1 The difference between the 10-year and a short-term rate on Treasury bills was used in work by Estrella and Mishkin (1998), Chauvet and Hamilton (2006), Wright (2006), Rudebusch and Williams (2009), among others. Also popular are spreads between corporate bonds of different grades (such as Baa and Aaa bonds) and a risk-free rate. These are considered to be measures of liquidity risk (of selling in thin markets) and credit risk (of default). They are countercyclical, as defaults and bankruptcies are more prevalent during economic downturns. A shortcoming of credit spreads is that the definition of the credit ratings varies over time. On the other hand, credit spreads could better reflect the changing developments in financial markets.

Data provided by the Institute for Supply Management (ISM) are also widely watched indicators of business activity, as documented in Klein and Moore (1991), Dasgupta and Lahiri (1993). The ISM surveys purchasing managers of 250 companies in 21 industries about new orders, production, employment, deliveries, and inventory. A weighted average of the five components (in decreasing order of importance) is used to construct a purchasing manager index (PMI), which is interpreted as a measure of excess demand. The ISM also produces a NAPM price index measuring the fraction of respondents reporting higher material prices. The index is seen as an inflation indicator. The NAPM data have two distinct advantages: the data are released on the first business day after the end of the month for which they are indicating, and they are not subject to revisions.

The exercise of predicting recessions would be easy if we had perfect indicators of economic activity, but this, of course, is not the case. As noted earlier, GDP growth was positive when the 2008 financial crisis was in full swing. The NAPM data are limited to manufacturing business activity, which is narrow in scope, especially when manufacturing has been a declining fraction of overall economic activity. The risky spread between commercial paper and Treasury bills worked well prior to 1990 but failed to predict the 1990–91 recession. The yield curve was inverted in August 2006, but as Hamilton (2011) pointed out, this recession signal is at odds with the fact that the level of the three-month rate was at a historical low. Stock and Watson (2001) reviewed evidence of various asset prices and found their predictive power not to be robust. Gilchrist and Zakrajsek (2012) documented that spreads between yields of bonds traded in the secondary market and a risk-free rate are better predictors than standard default spreads, especially in the 2008 recession. The finding is consistent with the view that business cycles are not alike.

An increasing number of studies have gone beyond using a single proxy to incorporate more information by way of diffusion indexes. These are scalar variables constructed as weighted averages of many indicators of economic activity. Examples include CFNAI (Chicago Fed National Activity Index) comprising 85 monthly indicators; the ADS index of Aruoba, Diebold, and Scotti (2009), which tracks movements of stocks and flows data at high frequencies. These diffusion indexes are typically based on a dynamic factor model in which N data series collected into a vector math formula are driven by a common cyclical variable math formula and idiosyncratic shocks. The estimation exercise consists of extracting math formula from math formula. Stock and Watson (1989) used data for math formula series to estimate parameters of the model by maximum likelihood. A recession is declared if math formula follows a pattern, such as if math formula is in a set designed to mimic the NBER recession dates. This work was re-examined in Stock and Watson (2010a) using different ways to estimate the coincident indexes.

A popular parametric framework for analyzing the two phases of business cycles is the Markov switching model of Hamilton (1989). Chauvet (1998) extends the four-variable dynamic factor model of Stock and Watson (1989) to allow for Markov switching properties. Monthly updates of the smoothed recession probabilities are published on the author's website and in the FRED database maintained by the Federal Reserve Bank of St. Louis. Chauvet and Hamilton (2006) takes as starting point that the mean growth rate of GDP is state dependent, being lower during recession than non-recession months. The sole source of serial dependence in GDP growth is attributed to the persistence in the recession indicator, but the regime-specific parameters are constant over time. Like a dynamic factor model, inference about the latent state can be obtained without directly using the NBER recession dates. Chauvet and Hamilton (2006) suggest that a recession be declared when the smoothed probability exceeds 0.65, and that the recession ends when the probability falls below 0.35. Turning points are then the dates when the probabilities cross a threshold.

Of the non-parametric methods, the algorithm by Bry and Boschan (1971) developed some 40 years ago remains to be widely used. The algorithm first identifies peaks as observations in the 12-month moving-average of a series that are lower over a two-sided window of five months. Analogously, troughs are points associated with observations in the five-month window that are higher. The algorithm then applies censoring rules to narrow the turning points of the reference cycle. In particular, the duration must be no less than 15 months, while the phase (peak to trough or trough to peak) must be no less than five months. The Bry-Boschan algorithm treats expansions and recessions symmetrically. Moench and Uhlig (2005) modified the algorithm to allow for asymmetries in recessions and expansions. They find that the identified number of recessions is sensitive to the censoring rules on phase length. It is noteworthy that the censoring rules have not changed since 1971, even though the economy has changed in many dimensions.

Various papers have compared the accuracy of different methods proposed. Chauvet and Piger (2008) used a real time data set with math formula consisting of employment, industrial production, manufacturing and trade sales, and real personal, income, as in Stock and Watson (1989). They find that both parametric and non-parametric methods produce a few false positives and can identify NBER troughs almost six to eight months before the NBER announcements, but these methods cannot improve upon the NBER timing in terms of calling the peaks of expansions. Berge and Jorda (2011) compared the adequacy of the components of diffusion indices vis-à-vis the indices themselves. They find that the turning points implied by the diffusion indices are well aligned with the NBER dates. However, the diffusion indices do not predict future turning points well. Within a year, their predictions are no better than a coin toss. However, some components of the Conference Board's leading indicator, notably term spreads and new orders for consumer goods, seem to have some information about future recessions.

Most studies in this literature make use of a small number of highly aggregated predictors. But calling recessions using aggregate data is different from calling recessions by aggregating information from disaggregate series, as originally suggested by Burns and Mitchell (1946). This is likely due to the computational difficulty in parametrically modelling a large number of series, and also that the Bry-Boschan algorithm is designed for univariate analysis. Harding and Pagan (2006) suggest to remedy the second problem by identifying the turning points of the reference cycle from individual turning points, but their analysis remains confined to four variables. As for the first problem, Stock and Watson (2010a), ,2010b) assumed knowledge that a recession has occurred in an interval to focus on the task of dating the peaks and troughs. They considered a 'date and aggregate' method that estimates the mode (or mean/median) from the individual turning point dates. From an analysis of 270 monthly series, they find that with the exception of four episodes (defined to be the NBER turning point plus or minus twelve months), the results are similar to an 'aggregate and date' approach that looks at turning points from an aggregate time series constructed from the sub-aggregates.

The present analysis centers on identifying the relevant predictor set. In practice, this means screening lots of potential predictors and selecting only those that are actually relevant. My ‘lots of data’ focus is close in spirit to Stock and Watson (2010b), but their task is dating the peaks and troughs conditional on knowing that a recession has occurred in an interval. My interest is in narrowing down the predictors to only those that are relevant. Since I do not estimate the turning points, determining when economic activity peaked and bottomed is outside the scope of my analysis. I consider out of sample predictions without modelling the latent state. In this regard, my analysis is close in spirit to Berge and Jorda (2011). However, I screen a much larger set of potential predictors and allow the predictor set to change over time.

Business cycles have changing characteristics. Compared with the features documented in Zarnowitz (1985), Ng and Wright (2013) find that the business cycle facts in the last two decades have changed again. An important consideration in my boosting analysis is that predictors useful in one recession may not be useful in other recessions. Before turning to the main analysis, the next section presents boosting first as a machine-learning algorithm and then as a non-parametric model-fitting device.

3. Boosting

For math formula, let math formula be a binary variable. In the application to follow, “1” indicates month t was in a recession according to the NBER dating committee. It will also be useful to define math formula. The objective of the exercise is to fit math formula with a model given N potentially relevant economic variables and eventually use the model for prediction. For now, I simply denote the predictor set at time t by math formula, dropping subscripts that indicate the predictors are lagged values.

Consider the population problem of classifying Y given predictors x using a rule math formula to minimize a loss function, say, math formula. If y was continuous, the squared loss function math formula would be the obvious choice. But since both y and F are binary indicators, a more appropriate criterion is the classification margin, math formula which is negative when a wrong prediction is made. It plays the role of residuals in regressions with continuous data. An algorithm that makes important use of the classification margin is AdaBoost due to Freund (1995) and Schapire (1990).

3.1. Discrete Adaboost

  1. Let math formula for math formula and math formula.
  2. For math formula:
    1. Find math formula from the set of candidate models to minimize the weighted error
      display math
    2. If math formula, update math formula and
      display math
      where math formula, math formula.
  3. Return classifier math formula.

An example that uses three predictors to classify twelve recession dates is provided in the Appendix.

Adaboost is an algorithm that has its roots in PAC (probably approximately correct) learning theory. Given covariates x and outcome y, a problem is said to be strongly PAC learnable if there exists a classifier (or learner) math formula such that the error rate error = math formula is arbitrarily small. That is, math formula for all math formula and all math formula. Now a random guess has a classification error of math formula. An algorithm is weakly learnable if there exists math formula such that math formula. Weak learnability thus only requires math formula to perform slightly better than random guessing. Obviously, strong learnability implies weak learnability. The question is whether weak learnability implies strong learnability. Schapire (1990) showed that the strong and weak learnable class are identical. This fundamentally important result implies that a weak classifier math formula that performs only slightly better than random guessing can be boosted to be a strong learner. Adaboost is the first of such algorithms.

In the boosting algorithm, the classifier chosen at stage m is the weak learner while a strong learner is the one that emerges at the final step M. These are denoted math formula and math formula respectively. A weak learner is a function parameterized by θ that maps the features of x into class labels math formula. The weaker learner could first obtain least squares estimates of θ and assign math formula. It can also be a decision stump that assigns the label of 1 if the condition math formula holds. While each math formula provides a classification, the final class label math formula is determined by the sign of a weighted sum of math formula. Hence it is a weighted majority vote. The classification margin math formula is a measure of the confidence of the model. The closer it is to 1, the more confidence there is that the final math formula is correct, and the closer it is to −1, the more confidence that math formula is incorrect. A margin that is close to zero indicates little confidence. The parameter M is a stopping rule to prevent overfitting. By suitable choice of M, we can identify which n of the N predictors are useful.

Dettling (2004) and Breiman (1996, 1998) noted that Adaboost is the best off-the-shelf classifier in the world. The crucial aspect of Adaboost is that it adjusts the weight on each observation so that the mis-classified observations are weighted more heavily in the next iteration. This can be seen by noting that

display math

Thus, the weight on math formula is scaled by math formula in iteration math formula if it is misclassified in iteration m. Correspondingly, observations that are correctly classified previously receive smaller weights. The algorithm effectively forces the classifier to focus on training the misclassified observations. One can interpret math formula (when divided by T) as the sample analog of the expected misclassification rate math formula with math formula as weights. The normalizing factor math formula is optimally chosen so that math formula sums (over t) to one.

Adaboost is presented above as a classification algorithm, but is it associated with a loss function and what are its optimality properties? Freund and Schapire (1996) showed that boosting can be interpreted as a two-player game in which a learner has to form a random choice of models to make a prediction in each of a sequence of trials, and the goal is to minimize mistakes. The Adaboost solution emerges upon applying the weighted majority algorithm of Littlestone and Warmuth (1994) to the dual of the game. For our purpose, the interesting angle is that Adaboost turns out to minimize a monotone transformation of the zero-one loss function math formula, defined as

display math

As math formula is negative only when the sign of y does not agree with the classifier math formula, minimizing math formula is the same as minimizing the misclassification rate. The zero-one loss function is neither smooth nor convex:2 Consider the exponential transformation

display math

Notably, if math formula is zero, the zero-one loss will also be zero. Because math formula, math formula is an upper bound for math formula. Minimizing math formula with respect to math formula gives

display math(1)

The classifier defined by

display math

coincides with Bayes classification based on the highest posterior probability. Equivalently, y is labelled 1 if the posterior probability exceeds math formula.

3.2. The Statistical View

This subsection first presents the statistical underpinnings of Adaboost and then considers generic features of boosting. The key link to statistical analysis is an additive logistic model. Recall that a parametric logit model maps the log-odds ratio to the predictors x via a finite dimensional parameter vector β. With math formula and class probability defined as

display math(2)

the log-odds ratio is modelled as

display math(3)

Given T observations, the sample binomial likelihood is

display math

As is well known, β can estimated by a gradient descent (Newton-Ralphson) procedure that iteratively updates math formula till convergence, where math formula and math formula are the first and second derivatives of math formula with respect to β. For the logit model, math formula and math formula. Let W be a math formula diagonal matrix with the tth entry being the weight math formula. The updating rule can be written as

display math

Upon defining math formula as the adjusted response variable, we also have

display math

The parameters can be conveniently updated by running a weighted regression of Z on X.

With the parametric model as the backdrop, consider now a non-parametric analysis that replaces math formula by math formula. Define

display math(4)

With math formula, the sample binomial likelihood

display math

is maximized at the true value of math formula, or equivalently at

display math

This solution evidently differs from the standard logit one given in (3) by a factor of a two. But observe that this is precisely the Adaboost solution given in (1). This interesting result is not immediately obvious because math formula is itself not a proper likelihood, but merely an approximation to the zero-one loss. Nonetheless, the two objective functions are second order equivalent, as

display math

while

display math

In general, math formula Adaboost imposes a larger penalty for mistakes because math formula grows linearly as math formula, while math formula grows exponentially.

Having seen that the Adaboost solution also maximizes the binomial likelihood with math formula defined as in (4), we will now use the insight of Breiman (1999) and Friedman (2001) to see Adaboost from the perspective of fitting an additive model. To maximize the expected binomial log likelihood defined for math formula with math formula defined as in (4), the method of gradient descent suggests to update from the current fit math formula to math formula according to

display math

where math formula and math formula are the first and second derivatives of the expected log likelihood with respect to f, evaluated at math formula. But under the assumptions of analysis, math formula and math formula. The update

display math

is designed to use the weighted expectation of the residual math formula to improve the fit. In practice, this amounts to taking math formula to be the fit from a weighted regression of the adjusted response math formula on math formula with math formula as weights. The important difference compared with the parametric logit analysis is that now the function math formula at each t, not the parameters, is being estimated. For this reason, the procedure is known as functional gradient descent. After M updates, the log odds ratio is represented by

display math

which is an ensemble of M component functions math formula. The functional gradient descent algorithm essentially fits a stagewise regression, meaning that variables are included sequentially in a stepwise regression, and no change is made to the coefficients of the variables already included. The size of the ensemble is determined by M. This parameter also controls which predictors are dropped, as variables not chosen in steps one to M will automatically have a weight of zero.

The ensemble feature of boosting is preserved even when the functional gradient algorithm is applied to other objective functions. The generic boosting algorithm is as follows:

Gradient Boosting. For minimizing math formula:

  1. For math formula, initialize math formula and math formula.
  2. For math formula:
    1. Compute the negative gradient math formula.
    2. Let math formula be the best fit of math formula using predictor math formula.
    3. Update math formula.
  3. Return the fit math formula or the classifier math formula.

Step (2a) computes the adjusted response and step (2b) obtains the best model at stage m. Step (2c) uses the negative gradient to update the fit. For quadratic loss math formula, math formula is the residual math formula. Then gradient boosting amounts to repeatedly finding a predictor to fit the residuals not explained in the previous step. By introducing a parameter ν to slow the effect of math formula on math formula, step (2c) can be modified to control the rate at which gradient descent takes place:

display math

The regularization parameter math formula is also known as the learning rate. Obviously, the parameters M and ν are related, as a low learning rate would necessitate a larger M.

Seen from the perspective of gradient boosting, Adaboost uses math formula, while the Logitboost of Friedman, Hastie, and Tibshirani (2000) specifies math formula The two are second-order equivalent, as discussed earlier.3 Some software packages use the terms interchangeably. There are, however, important differences between a boosting based logit model and the classical logit model even though both minimize the negative binomial likelihood. The predictors in a logit model are selected prior to estimation, and the fit is based on a model that considers multiple predictors jointly. In contrast, gradient boosting performs variable selection and estimation simultaneously, and the final model is built up from an ensemble of models.

We have thus seen that Adaboost is gradient boosting applied to a specific loss function. Many variations to this basic theme have been developed. Instead of fitting a base learner to the observed data, a variation known as stochastic boosting randomly samples a fraction of the observations at each step m (Friedman 2002). If all observations are randomly sampled, the bagging algorithm of Breiman (1996) obtains. Bagging tends to yield estimates with lower variances. By letting the subsampling rate to be between zero and one, stochastic boosting becomes a hybrid between bagging and boosting. There is a loss of information from returning a discrete weak classifier math formula at each step. To circumvent this problem, Friedman, Hastie, and Tibshirani (2000) proposed a Gentleboost algorithm, which updates the probabilities instead of the classifier. Multiclass problems have also been studied by treating boosting as an algorithm that fits an additive multinomial model.

Associated with each loss function J are model implied probabilities, but additional assumptions are necessary to turn the probabilities into classification. These are determined by the functions math formula and math formula. Both Logitboost and Adaboost label the weak and strong learners using the sign function; that is, math formula and math formula. As a consequence, these are Bayes classifiers that applies a threshold of one-half to the posterior probability. It might be desirable to choose a different threshold, especially when the two classes have uneven occurrence in the data. Let τ be the cost of misclassifying Y to 1 and math formula be the cost of misclassifying Y to zero. Minimizing the misclassification risk of math formula leads to a cost-weighted Bayes rule that labels y to one when math formula. In this case, step (3) would return

display math

which is a form of quantile classification (Mease, Wyner, and Buja 2007).

In summary, boosting can be motivated from the perspective of minimizing an exponential loss function, or fitting an additive logit model using the method of functional gradient descent that implicitly reweights the data at each step. Regularization, subsampling, and cross-validation are incorporated into R packages (such as MBOOST, ADA, and GBM). The analysis to follow uses the beroulli loss function as implemented in the GMB package of Ridgeway (2007). The package returns the class probability instead of classifier. For recession analysis, the probability estimate is interesting in its own right, and the flexibility to choose a threshold other than one-half is convenient.

4. Application to Macroeconomic Data

My analysis uses the same 132 monthly predictors as in Ludvigson and Ng (2011), updated to include observations in 2011, as explained in Jurado, Ludvigson, and Ng (2013). The data cover broad categories: real output and income, employment and hours, real retail, manufacturing and trade sales, consumer spending, housing starts, inventories and inventory sales ratios, orders and unfilled orders, compensation and labour costs, capacity utilization measures, price indexes, bond and stock market indexes, and foreign exchange measures. Denote the data available for prediction of math formula by a math formula matrix

display math

where each math formula is a math formula vector.

In the recession studies reviewed in Section 'Related Work' dynamic specification of the model is an important part of the analysis. For example, factor models estimate the reference cycle and its dynamics jointly, while a Markov switching model estimates the transition probability between the recession and non-recession states. The standard logit model is designed for independent data and is basically static. Allowing for dynamics complicates the estimation problem even when the number of predictors is small (Kauppi and Saikkonen 2008). To allow for a dynamic relation between math formula and possibly many predictors, I let d lags of every variable be a candidate predictor. The potential predictor set is then a math formula by N data matrix math formula, where math formula:

display math

When math formula, a forecast for math formula uses data at lags math formula and math formula.4

So far, the setup applies for h step ahead prediction of any variable math formula. If math formula was a continuous variable such as inflation or output growth, and the identity of the relevant variables in math formula were known, a linear regression model would produce what is known as a direct h-period forecast. Even if the relevant predictors were not known, and even with math formula, it is still possible to use the variable selection feature of boosting as in Bai and Ng (2009a) to identify the useful predictors. The current problem differs in that the dependent variable is a binary indicator. The log-odds model implies that p is non-linear in x. The next issue is then to decide on the specifics of this non-linear function. A common choice of the base learner is a regression tree defined by

display math

Given that the number of potential predictors is large, I do not allow for interaction among variables. As a consequence, each learner takes on only one regressor at a time, leading to what Buhlmann and Yu (2003) referred to as component-wise boosting. I further restrict the regression tree to have twonodes, reducing the tree to a stump. Effectively, variable m at each t belongs to one of two partitions depending on the value of a data dependent threshold math formula:

display math

where math formula and math formula are parameters, possibly the mean of observations in the partition. At each stage m, the identity of math formula is determined by considering the N candidate variables one by one, and choosing the one that gives the best fit. It is important to remark that a variable chosen by boosting at stage m can be chosen again at subsequent stages. Because the same variable can be chosen multiple times, the final additive model is spanned by math formula variables.

The relative importance of predictor j can be assessed by how it affects variation in math formula. Let math formula, be a function that returns the identity of the predictor chosen at stage m. Friedman (2001) suggests considering

display math(5)

The statistic is based on the number of times a variable is selected over the M steps, weighted by its improvement in squared error as given by math formula. The sum of math formula over j is 100. Higher values thus signify that the associated variable is important. Naturally, variables not selected have zero importance.

4.1. Full Sample Estimation

I consider math formula 3,6, and 12 months ahead forecasts. Considering the math formula case may seem ambitious given that many studies consider math formula, and economic data appear to have limited predictiability beyond one year (Berge and Jorda 2011). But the present analysis uses more data than previous studies and the boosting methodology is new. It is thus worth pushing the analysis to the limit. After adjusting for lags, the full sample estimation has math formula observations from 1961:3 to 2011:12. The size of the potential predictor set is varied by increasing d from (3, 3, 4) to (9, 9, 12) for the three forecast horizons considered respectively. The largest predictor set (for the math formula model) has a total of 1596 predictors: math formula variables in math formula as well as 12 lags of math formula. Boosting then selects the relevant predictors from this set.

As an exploratory analysis, a model is estimated using the default parameters in the GBM package which sets the learning rate ν to .001, the subsampling rate (BAGFRAC) for stochastic boosting to .5, with TRAIN = .5, so that half the sample serves as training data. The other half is used for cross validation to determining the optimal stopping parameter M. The purpose of stopping the algorithm at step M is to avoid overfitting. While overfitting is generally less of an issue for classification analysis, a saturated model that includes all variables will emerge if M is allowed to tend to infinity. By early termination of the algorithm, math formula is shrunk towards zero. Ridgeway (2007) argues in favour of setting ν small and then determining M by cross-validation, or a criterion such as the AIC.

The default values give an math formula that is slightly over 2000. While a large number of variables are being selected, many have low values of math formula. I find that by increasing the learning rate ν to 0.01, math formula can be significantly reduced to around 300, and the number of variables being selected (i.e., n) is even smaller. As an example, the math formula model has math formula predictors when math formula, much smaller than math formula because, as noted earlier, boosting can select the same variable multiple times. Furthermore, as N increases from 532 to 1596 when d increases from 4 to 12, n only increases from 80 to =107. This suggests that the longer lags add little information. In view of this finding, I focus on results for math formula.

Listed in Table 1 are the variables chosen and such that Friedman's importance indicator math formula exceeds two. The table says, for example, that the math formula model selects lags 6, 5, 4 (ordered by the value of math formula of 6mo-FF (the spread between the 6 months Treasury bill and the federal funds rate). Also selected is lag 6 of 1yr-FF spread, which is the difference between the one year Treasury bond and the federal funds rate. There are fewer than 10 variables that pass the threshold of two. Thresholding math formula at a value bigger than zero allows me to focus on the “reasonably” important predictors and ignore the “barely” important ones. It is reassuring that the important variables identified by boosting listed in Table 1 largely coincide with those used in studies of business cycle chronology. As noted earlier, term and credit/default spreads are generally thought to be useful predictors of recessions. However, previous work tends to consider one spread variable at a time. Here, all spreads in the database are potentially relevant and it is up to boosting to pick out the most important ones. The CP-FF spread lagged up to 16 months is important, while the 10yr-FF and Aaa-FF spreads lagged 13 months have predictive power. It can be argued that the CP-FF spread has an information lead advantage over the other spreads. As seen in Table 1, the data in the different spreads are not mutually exclusive. This is true for all three values of h considered. Take the math formula case as an example. Conditional on the 10yr-FF spread, the spreads CP-FF and Aaa-FF have additional predictive power.

Table 1. Top Variables Chosen by Cross-Validation
math formula)VariableVariable namemath formulak 

NOTES

  1. Forecasts for period t are based predictors at lag math formula. The column math formula is an indicator of importance as defined in (5). The last column math formula denotes that lag at which the corresponding predictor is chosen.

(3, 6)976mo-FF spread25.641654 
 963mo-FF spread18.830456 
 133lagy18.6514   
 981yr-FF spread15.0246   
 101Aaa-FF spread8.0996   
 105Ex rate: Japan2.1895   
(6, 9)976mo-FF spread24.58478  
 101Aaa-FF spread22.306897 
 981yr-FF spread19.18479  
 963mo-FF spread5.0607   
 10010yr-FF spread2.5037   
 114NAPM com price2.0438   
(12, 16)10010yr-FF spread14.32613   
 95CP-FF spread12.021151316 
 101Aaa-FF spread10.24213   
 63NAPM vendor del10.02013   
 35Emp: mining6.2351615  
 995yr-FF spread5.6441316  
 64NAPM Invent3.49414   
 114NAPM com price2.35813   

The characteristics of the relevant predictor set are evidently horizon specific. While there are more variables exceeding the math formula threshold of 2 when math formula than when math formula and 6, math formula tends to be lower when math formula. As lags of math formula have no predictive power at either math formula or 12, only the math formula model has an autoregressive structure. The 6mo-FF and 1yr-FF spreads are important when h = 3 and 6, but none of these variables seems to be important when math formula. The NAPM inventories and vendor delivery time are systematically selected when math formula, but none of these variables is selected for math formula or math formula. Notably, only nominal variables is selected for math formula and 6 months ahead forecasts. The only variables common to all three forecast horizons is the Aaa-FF spread. Perhaps the surprise variable in Table 1 is employment in mining. Though not frequently used in business cycles analysis, this variable is robustly countercyclical, as will be seen in results to follow.

Figure 1 plots the (in-sample) posterior probability of recessions denoted math formula along with the NBER recession dates. The estimated probabilities for the pre-1990 recessions clearly display spikes around the NBER recession dates. However, the fitted probabilities for the post-1990 recessions are poor, especially when math formula. The three recessions since 1983 came and went and the estimated probabilities barely spiked. The fitted probabilities based on the math formula and 6 models fare better, but the spikes after 1990 are still not as pronounced as expected.

Figure 1.

Recession Probabilities: In-Sample

Parameter instability is a generic feature of economic data and can be attributed to changing sources of economic fluctuations over time. The weak learners used in the analysis are stumps, and hence the model for the log-odds ratio is non-parametric. While parameter instability is not a meaningful notion in a non-parametric setup, the possibility of model instability cannot be overlooked. To see if the composition of the predictor set has changed over time. I construct two sets of recession probabilities as follows. The first is based on estimation over the sample math formula = (1962:3, 1986:8), and the second is for the sample math formula = (1986:9, 2011:12). The in-sample fitted probabilities spliced together from the estimation over the two samples are plotted as the solid blue line in Figure 2. Compared with the full sample estimates in Figure 1, the post-1990 recessions are much better aligned with the NBER dates without compromising the fit of the pre-1990 recessions. This suggests that it is possible to fit the data in the two subsamples reasonably well, but different models are needed over the subsamples.

Figure 2.

Recession Probabilities: Split-Sample, Spliced In-Sample Fit (Solid Line) and Spliced Out-of-Sample Fit (Dashed Line)

To gain further evidence for model instability, I use the model estimated over the first subsample to compute out-of-sample fitted probabilities for math formula. Similarly, the model estimated for the second subsample is used to compute out-of-sample fitted probabilities for math formula. If the model structure is stable, the out-of-sample fitted probabilities, represented by the dashed lines, should be similar to the in-sample fit, represented by the solid lines. Figure 2 shows that the fitted probabilities based on the model estimated up to 1986:8 do not line up with the recession dates after 1990 recessions. The discrepancy is particularly obvious for math formula. The probabilities based on the model estimated after 1986:8 also do not line up well with the actual recession dates in the first subsample. This confirms that the same model cannot fit the data of both subsamples.

The instability in the predictor set can be summarized by examining which variables are chosen in the two subsamples. This is reported in Table 2. The first impression is that while interest rate spreads are the important predictors in the first subsample for math formula and 6, many real activity variables become important in the second subsample. For math formula, the real activity variables found to be important in the first sample are no longer important in the second sample. The 5yr-FF spread is important in the second sample but not in the first. Few variables are important in both samples. Even of those that are important, the lags chosen and the degree of importance are different, as seen from the 1yr-FF spread. In general, many of the important predictors identified in the second sample have lower math formula.

Table 2. Top Variables Chosen by Cross-Validation: Split Sample Estimation
math formulaVariableVariable namemath formulak
 Estimation sample: 1960:3–1986:8    
(3,6)976mo-FF spread46.16265 
 133lagy19.4174  
 101Aaa-FF spread16.9976  
 981yr-FF spread5.9306  
 963mo-FF spread5.4684  
(6, 9)976mo-FF spread44.16578 
 101Aaa-FF spread28.52887 
 981yr-FF spread16.02379 
 114NAPM com price3.8508  
(12, 16)10010yr-FF spread25.84213  
 101Aaa-FF spread23.85413  
 95CP-FF spread18.857161513
 63NAPM vendor del10.22913  
 61PMI9.75114  
 64NAPM Invent3.74414  
 35Emp: mining2.02816  
 Estimation sample: 1986:9–2011:12    
(3, 6)133lagy64.8984  
 981yr-FF spread10.25245 
 22Help wanted/unemp5.45254 
 15IP: nondble matls4.6105  
 79DC&I loans4.2675  
 62NAPM new ordrs3.1094  
(6, 9)79DC&I loans22.48778 
 995yr-FF spread20.3189  
 15IP: nondble matls16.689798
 101Aaa-FF spread10.4769  
 133lagy8.8654  
 62NAPM new ordrs5.4007  
 981yr-FF spread4.8709  
 36Emp: const4.5307  
 22Help wanted/unemp4.4567  
(12, 16)995yr-FF spread53.407151614
 101Aaa-FF spread18.51916  
 79C&I loans8.4691415 
 102Baa-FF spread6.70016  
 981yr-FF spread2.85713  
 44Emp: FIRE2.77415  

4.2. Rolling Estimation

The full sample results suggest a change in the dynamic relation between math formula and the predictors, as well as the identity of the predictors themselves. However, the in-sample fit (or lack thereof) does not reflect the out-of-sample predictive ability of the model. Furthermore, the foregoing results are based on the default parameters of the GBM package that randomizes half the sample for stochastic boosting, with M determined by cross-validation as though the data were independent. Arguably, these settings are not appropriate for serially correlated data.

Several changes are made to tailor boosting to out-of-sample predictions that also take into account the time series nature of the data. First, stochastic boosting is disallowed by changing the randomization rate from the default of one-half to zero. Second, rolling regressions are used to generate out-of-sample predictions. These are constructed as follows:

Rolling Forecast. Initialize t1 and t2:

  1. For math formula, fit math formula using predictors in math formula.
  2. For math formula, record relative importance math formula at each t2
  3. Construct predicted probability math formula Increase t1 and t2 by one.

There are 407 of such rolling regressions, each with 180 observations ending in period math formula. The first estimation is based on 180 observations from math formula = 1962:3 to math formula = 1977:2. When math formula, the first forecast is made for t2 = 1978:2. The next forecast is based on estimation over the sample defined by math formula = (1962:4, 1977:3) and so on.

The final change is that in place of cross-validation, two new indicators are used to determine the set of relevant predictors. The first is the average of relative importance of each variable over the 407 rolling subsamples, constructed for math formula as

display math

The second indicator is the frequency that variable j is being selected in the rolling estimation:

display math

Both statistics are dated according to the period for which the forecast is made: t2.

Figure 3 plots the number of variables with positive importance in forecasting math formula defined as

display math

The black solid line indicates the total number of variables selected when lags of the same variable are treated as distinct. The dotted red line indicates the unique number of variables, meaning that variables at different lags are treated as the same variable. On average, the total number of variables selected for math formula months ahead forecast is between 13 and 16, while the unique number of variables is around 9. These numbers are much smaller than those found in the full sample analysis. The number of relevant predictors math formula has drifted down since the 1980 recessions and bottomed out in t2 = 1999:2, which roughly coincides with the Great Moderation. However, the downward trend is reversed in 2001. The number of relevant predictors for math formula and 6 generally follow the same pattern as math formula with the notable difference that since the 2008 recession, the number of important predictors at math formula has been on an upward trend, when that for math formula is slightly below the pre-recession level.

Figure 3.

Number of Variables Selected in Rolling Estimation, Including Lags (Solid Line) and unique variables (dashed Line)

Table 3 reports the variables with average importance math formula exceeding 2. While the term and default spreads are still found to be valuable predictors, there are qualitative differences between the full sample estimates reported in Table 1 and then rolling estimates reported in Table 3. The Aaa spread is the dominant predictor for math formula and 6 in rolling estimation, but the 6mo-FF spread is better for full sample predictions. For math formula, the 10yr-FF spread is the most important in-sample predictor, but the 5yr-FF spread performs better in rolling estimation. Berge and Jorda (2011) find that the 10yr-FF spread with 18 month lag has predictive power. Here, spreads lagged 13–16 months are found to be important.

Table 3. Variables Chosen in Rolling Window Estimation: By Average Importance
math formulaVariableVariable namemath formulak
(3, 6)133lagy18.5964  
 101Aaa-FF spread12.8996  
 963mo-FF spread11.19046 
 61PMI6.4494  
 976mo-FF spread6.0856  
 981yr-FF spread5.3086  
 33Emp: total5.0874  
 95CP-FF spread3.5664  
 102Baa-FF spread3.0496  
(6, 9)101Aaa-FF spread24.155879
 981yr-FF spread10.21579 
 102Baa-FF spread9.91079 
 995yr-FF spread7.15798 
 10010 yr-FF spread6.40498 
 976mo-FF spread5.8667  
 45Emp: Govt4.56797 
 963mo-FF spread4.1227  
 37Emp: mfg2.2217  
(12, 16)995yr-FF spread13.9931516 
 10010yr-FF spread8.95813  
 102Baa-FF spread8.46413  
 101Aaa-FF spread6.9771613 
 95CP-FF spread6.9691316 
 63NAPM vendor del6.51813  
 976mo-FF spread4.60016  
 64NAPM Invent3.10614  
 61PMI2.52415  
 35Emp: mining2.18015  

Table 4 reports the variables whose frequency of being selected (summed over lags) is at least 0.2. The Aaa-FF spread is the most frequently selected among variables in math formula when math formula and 6, while the CP-spread is most frequently selected when math formula. These variables have high values of freqj because multiple lags are selected while other variables are selected at only one lag. In this regard, the model chosen by boosting has rather weak dynamics. Tables 3 and 4 together suggest that the Aaa-FF spread is the most robust predictor for math formula and 6, while the CP and 5yr-FF spreads are most important for math formula. Of the real activity variables, government employment and employment in mining are valuable predictors.

Table 4. Variables Chosen in Rolling Window Estimation: By Frequency
math formulaVariableVariable nameFrequencyk 
(3, 6)101Aaa-FF spread1.330654 
 976mo-FF spread1.014645 
 963mo-FF spread0.944465 
 45Emp: Govt0.893546 
 133lagy0.8124   
 94Baa bond0.47254  
 981yr-FF spread0.4096   
 22Help wanted/unemp0.3984   
 105Ex rate: Japan0.3865   
 33Emp: total0.3604   
 65Orders: cons gds0.3355   
 102Baa-FF spread0.3336   
 93Aaa bond0.2704   
 61PMI0.2494   
(6, 9)101Aaa-FF spread1.761879 
 45Emp: Govt1.248987 
 981yr-FF spread0.73579  
 102Baa-FF spread0.63779  
 963mo-FF spread0.63779  
 95CP-FF spread0.60798  
 10010yr-FF spread0.58597  
 114NAPM com price0.54887  
 995yr-FF spread0.53997  
 976mo-FF spread0.3707   
 63NAPM vendor del0.3479   
 37Emp: mfg0.3007   
 62NAPM new ordrs0.2817   
 64NAPM Invent0.2557   
 35Emp: mining0.2469   
 66Orders: dble gds0.2207   
 5Retail sales0.2087   
(12,16)95CP-FF spread0.871161513 
 35Emp: mining0.859161315 
 81Inst cred/PI0.6941615  
 995yr-FF spread0.6871516  
 10010yr-FF spread0.6391315  
 101Aaa-FF spread0.4881613  
 102Baa-FF spread0.43813   
 63NAPM vendor del0.37113   
 61PMI0.31115   
 64NAPM Invent0.27314   
 976mo-FF spread0.27016   
 117CPI-U: transp0.24613   
 44Emp: FIRE0.23415   
 68Unf orders: dble0.21816   

The out-of-sample fitted probabilities are plotted as the solid lines in Figure 4. These probabilities are generally high around recession times, but there are also false positives especially around 1985 during which the key predictor is found to be the Baa-FF spread. Notably, all out-of-sample fitted probabilities are more choppy than the full sample estimates. One explanation is that each rolling window has only math formula observations, while the full sample estimates use math formula observations. The more likely reason is that, except for math formula, the other two prediction models have no autoregressive dynamics.

Figure 4.

Recession Probabilities: Rolling Estimates

Additional assumptions are necessary to use the estimated probabilities to generate warning signals for recessions. As mentioned earlier, only 15% of the sample are recession months and the Bayes threshold of 0.5 may not be appropriate. Chauvet and Hamilton (2006) suggest declaring a recession when the smoothed probability estimate of the latent state math formula exceeds .65 and the recession ends when the probability falls below 0.35. Chauvet and Piger (2008) require the smoothed probability to move above 0.8 and stay above that level for three consecutive months. Berge and Jorda (2011) searched between 0.3 and 0.6 for the optimal threshold such that the Chauvet and Piger (2008) probabilities would best fit the NBER dates. In contrast to these studies, I analyze math formula, and probabilities estimates are lower the higher is h. Applying the threshold of .65 would lead to the foregone conclusion of no recession. Yet it is clear that the predicted probabilities have distinguishable spikes.

I proceed with the assumption that it is the relative probability over time that matters and use as threshold the mean plus 1.65 times the standard deviation of the fitted probabilities over the sample. This yields thresholds of .435 for math formula, .444 for math formula, and .304 for math formula, respectively. This line is marked in each of the three panels in Figure 4. I then locate those dates t2 for which the estimated probability exceeds the thresholds to get some idea of whether the threshold-crossing dates are near the NBER recession dates. The most important variable used in the prediction at t2 is recorded for reference, even though it should be remembered that boosting is an ensemble scheme and prediction is not based on a particular regressor per se.

It is well documented that the risky spreads were successful in predicting recessions prior to 1990 but they completely missed the 1990 recession. The math formula model produces probability estimates above .2 for three of the five months between 1990:3 and 1990:8, but not on a consecutive basis, as the probabilities are as low as 0.06 in between. The math formula model also produces heightened probability estimates, but they fall short of crossing the threshold. That the two models did not strongly signal a recession is perhaps not surprising, as the top predictor is found to be the (Baa-FF) risky spread. The math formula model predicts a recession probability of .414 for t2 = 1989:11, with the 3m-FF spread being the most important predictor. The estimate is still short of the threshold of .436. The probability of recession reaches .534, but only in t2 = 1991:6 well after the recession started.

Turning to the 2001 recession, we see in Figure 4 that the probability of recession based on the math formula model reaches .568 in math formula 2002:2. This is largely due to the lagged recession indicator, since 2001:03 to 2001:10 were identified by the NBER as recession months. The estimated probability based on the math formula model jumps from .002 in t2 = 1999:11 to 0.739 in 2000:1, the 5yr-FF spread identified as being most important. The math formula model also gives a recession probability of .449 at t2 = 2001:8, the most important predictor being the Aaa-FF spread, but both signals of recession are short lived.

There has been much discussion of whether signs of the 2008 recession were missed. The math formula model gives a probability of .650 in t2 = 2008:7. The probability remains high for several months, the most important predictor being the PMI. The probability based on the math formula model exceeds .6 around math formula 2007:5 for several months, with the most important predictor being the 5yr-FF spread. For the math formula model, the recession probability is estimated to be .65 for math formula 2007:2 but it returned to lower values before climbing to .54 at math formula2007:12, the top predictors being the 10yr-FF and the Aaa-FF spreads. Thus, from the data in mid-2006, the models for math formula6 and 12 see signs of the 2008 recession a whole year before it occurred, but the signals are sporadic and easy to miss.

Overall, the models seem to give elevated probability estimates around but not always ahead of the actual recessions dates. Instead of eyeballing the probability estimates, I also attempt to automate the start date of the recessions as determined by the model. From the peak of economic activity identified by the NBER, I look within a window of τ-months to see if the predicted probabilities exceed the thresholds and record the variables with the highest math formula. I set τ to 12 for the recessions before 1990 and to 18 for the ones after to accommodate the observation in hindsight that the signals in the post-1990 recessions appear in the data earlier than the recessions before 1990.5 These dates are reported in column 3 of Table 5. As a point of reference, Table 5 also lists the recession months identified by the NBER. In parentheses are the dates that the recessions were announced. A noteworthy feature of Table 5 is the changing nature of the predictors in the five recessions. The risky spreads have been important in predicting the two pre-1990 recessions but not the subsequent ones, and employment data have been helpful in predicting the post-1990 recessions but not the pre-1990 ones.

Table 5. Summary of Model Warnings
RecessionNBERt2 (Model)math formulaTop predictorThreshold exceed

NOTES

  1. The business cycle dates are taken from the website http://www.nber.org/cycles/cyclesmain.html. The NBER column denotes the date that the trough was announced by the NBER. The t2 column is the date within a window since the last peak of economic activity that the probability of recession estimated by the model exceeds the threshold of mean + 1.65 standard deviations. If the thresholds of .435, .444, .304 are not crossed, a model with more lagged predictors is considered. For the first two recessions, the window is 12 months. For the last three recessions, the window is 18 months.

1: 1980:1–1980:71981:7    
math formula 1979:120.545Baa bondY
math formula 1979:80.718CP-FF spreadY
math formula 1979:60.313CP-FF spreadN
2: 1981:7–1981:111983:7    
math formula 1980:70.899Baa bondY
math formula 1980:70.794CP-FF spreadY
math formula 1980:70.339CP-FF spreadY
3: 1990:7–1990:31992:12    
math formula 1989:11.4143mo-FF spreadN
math formula 1989:12.317Govt. EmpN
math formula 1990:03.238Emp. MiningN
4: 2001:3–2001:112003:7    
math formula 2000:09.224Govt. EmpN
math formula 2001:02.104Govt. EmpN
math formula 2000:01.7395yr-FF spreadY
5: 2007:12–2009:62010:9    
math formula 2006:06.003HWIN
math formula 2007:05.606Govt. EmpY
math formula 2007:01.416Emp. MiningY

Table 5 shows that all models failed to call the 1990 recession, in agreement with the informal analysis. For the 2008 recession, the math formula model does not produce any probability estimate in the 18 months prior to 2007:12 that exceeds the threshold. However, the math formula month model reports a recession probability of .416 in for t2 = 2007:01, and the math formula model reports a recession probability of .606 for t2 = 2007:05. That the math formula model gives earlier warning than the math formula ones is interesting. The bottom line conclusion is that signals of the 2008 recession were in the data as early as mid-2006, but there is a lack of consensus across models, making it difficult to confidently make a recession call.

5. Conclusion

This analysis sets out to explore what boosting has to say about the predictors of recessions. Boosting is a non-parametric method with little input from economic theory. It has two main features. First, the fit of the model is based on an ensemble scheme. Second, by suitable choice of regularization parameters, it enables joint variable selection and estimation in a data-rich environment. The empirical analysis finds that even though many predictors are available for analysis, the predictor set with systematic and important predictive power consists of only 10 or so variables. It is reassuring that most variables in the list are already known to be useful, though some less obvious variables are also identified. The main finding is that there is substantial time variation in the size and composition of the relevant predictor set, and even the predictive power of term and risky spreads are recession specific. The full sample estimates and rolling regressions give confidence to the 5yr-FF spread and the Aaa and CP spreads (relative to the Fed funds rate) as the best predictors of recessions. The results echo the analysis of Ng and Wright (2013) that business cycles are not alike. This, in essence, is why predicting recessions is challenging.

Few economic applications have used boosting thus far, probably for the reason that the terminology and presentation are unfamiliar to economists. But binomial boosting is simply an algorithm for constructing stagewise additive logistic models. It can potentially be used to analyze discrete choice problems, such as whether to retire or which brand to use. In conjunction with other loss functions, boosting can also be used as an alternative to information criterion as a variable selection device. Bai and Ng (2009a,b) exploited these properties to choose instruments and predictors in a data rich environment. This paper has used boosting in the less common context of serially correlated data. The method is far from perfect, as there were misses and false positives. A weakness of boosting in recession analysis is that it produces fitted probabilities that are not sufficiently persistent. This is likely a consequence of the fact that the model dynamics are now entirely driven by the predictors. The autoregressive dynamics needed for the estimated probabilities to be slowly evolving are weak or absent altogether. Furthermore, the predictors are often selected at isolated but not consecutive lags. My conjecture is that improving the model dynamics will likely lead to smoother predicted probabilities without changing the key predictors identified in this analysis. How richer dynamics can be incorporated remains very much a topic for future research.

Appendix

A Toy Example

This Appendix provides an example to help in understanding the Adaboost algorithm. Consider classifying whether the 12 months in 2001 using three-month lagged data of the help-wanted index (HWI), new orders (NAPM), and the 10yr-FF spread (SPREAD). The data are listed in columns 2–4 of Table A1. The NBER dates are listed in column 5, where 1 indicates a recession month. I use a stump (two-node decision tree) as the weak learner. A stump uses an optimally chosen threshold to split the data into two partitions. This requires setting up a finite number of grid points for HWI, NAPM, and SPREAD, respectively, and evaluating the goodness of fit in each partition.

Table A1. Toy Example
 Data: Lagged 3 Monthsmath formulamath formulamath formulamath formulamath formula
 HWINAPMSPREADyHWINAPMHWISPREADNAPM
Date−.06648.5500.244 math formula.044<49.834math formula.100math formula.622< 47.062
2001. 10.01451.100−0.770−1−1−1−1−1−1
2001. 2−0.09150.300−0.790−11−1−1−1−1
2001. 30.08252.800−1.160−1−1−1−1−1−1
2001. 4−0.12949.800−0.820111111
2001. 5−0.13150.200−0.39011−1111
2001. 6−0.11147.700−0.420111111
2001. 7−0.05647.2000.340111111
2001. 8−0.10345.4001.180111111
2001. 9−0.09347.1001.310111111
2001. 10−0.00446.8001.4701−11−111
2001. 11−0.17446.7001.320111111
2001. 12−0.00747.5001.660−1−11−11−1
α    .8041.098.710.783.575
Error rate    .167.100.138.1550

The algorithm begins by assigning an equal weight of math formula to each observation. For each of the grid points chosen for HWI, the sample of Y values is partitioned into two parts depending on whether math formula exceeds the grid point or not. The grid point that minimizes classification error is found to be −.044. The procedure is repeated with NAPM as a splitting variable, and again for SPREAD. A comparison of the three sets of residuals reveals that splitting on the basis of HWI gives the smallest weighted error. The first weak learner thus labels math formula to 1 if HWImath formula −.044. The outcome of the decision is given in Column 6. Compared with the NBER dates in column 5, we see that months 2 and 10 are mislabelled, giving a misclassification rate of 2/12 = .167. This is ε1 of step (2a). Direct calculations give math formula of .804. The weights math formula are updated to complete step (2b). Months 2 and 10 now each have a weight of 0.25, while the remaining 10 observations each have a weight of 0.05. Three thresholds are again determined for each of the three variables and the weighted residuals are computed using weights w(2). Of the three, the NAPM split gives the smallest weighted residuals. The weak learner for step 2 is identified. The classification based on the sign of math formula is given in column 7. Compared with column 5, we see that months 5 and 12 are mislabelled. The weighted misclassification rate is decreased to .100. The new weights math formula are .25 for months 5 and 12, .138 for months 2 and 10, and .027 for the remaining months. Three sets of weighted residuals are again determined using new thresholds. The best predictor is again HWI with a threshold of –.100. Classification based on the sign of math formula is given in column 8, where math formula The error rate after three steps actually increases to .138. The weak learner in round four is 1(SPREAD>−.622). After NAPM is selected for one more round, all recession dates are correctly classified. The strong learner is an ensemble of five weak learners defined by math formula, where

display math

The key features of Adaboost are (i) the same variable can be chosen more than once, (ii) the weights are adjusted at each step to focus on the misclassified observations, and (iii) a final decision is based on an ensemble of models. No single variable can yield the correct classification, which is the premise of an ensemble decision rule.

  1. 1

    The importance of the term spread in recession analysis is documented in http://www.newyorkfed.org/research/capital_markets/ycfaq.html.

  2. 2

    A different approximation to the zero-one loss is given in Buhlmann and Yu (2003).

  3. 3

    Logitboost uses initialization math formula, math formula, math formula.

  4. 4

    An alternative set of potential predictors defined by averaging each predictor over math formula to math formula is also considered. The predictor set is constructed as math formula with math formula, this math formula. The results are qualitatively similar and hence are not reported.

  5. 5

    If the probability estimates never exceed the threshold, I consider a model with more predictors (larger d) than the base case. This often gives higher probability estimates, but not high enough to cross the thresholds.

Ancillary