Comparing smooth transition and Markov switching autoregressive models of US unemployment



Logistic smooth transition and Markov switching autoregressive models of a logistic transform of the monthly US unemployment rate are estimated by Markov chain Monte Carlo methods. The Markov switching model is identified by constraining the first autoregression coefficient to differ across regimes. The transition variable in the LSTAR model is the lagged seasonal difference of the unemployment rate. Out-of-sample forecasts are obtained from Bayesian predictive densities. Although both models provide very similar descriptions, Bayes factors and predictive efficiency tests (both Bayesian and classical) favor the smooth transition model. Copyright © 2008 John Wiley & Sons, Ltd.


US unemployment is characterized by relatively brief periods of rapid economic contraction (rising unemployment) followed by relatively extended periods of slow economic expansion (falling unemployment). In the recent literature, several studies (Rothman, 1998; Montgomery et al., 1998; Koop and Potter, 1999; Van Dijk et al., 2002) have attempted to capture this salient feature by means of well-known nonlinear models such as threshold autoregression (TAR), the closely related logistic smooth transition autoregression (LSTAR), and Markov switching autoregression (MSAR). The first two papers attempt to base model choice on a comparison of forecasting performance.

There are two important conceptual differences between the MSAR and the TAR or LSTAR models. First, MSAR incorporates less prior information than TAR and LSTAR. Indeed, a filtered or smoothed regime probability in an MSAR model can be interpreted as a transition function which is estimated flexibly from the data. By contrast, specifying the transition function in a TAR or LSTAR model necessitates the choice of a transition variable (a difficult problem). Secondly, regime changes are predetermined in a TAR or LSTAR model, but are exogenous in the MSAR: in the latter, even if the model parameters were known, these changes could not be predicted with certainty from past data due to the presence of additional disturbances (in the Markov evolution equation). It is of interest to investigate whether the added flexibility and complexity of the MSAR model result in a superior predictive ability.

For the reasons given by West and McCracken (1998), it may be important to base such an investigation on small-sample predictive densities that take parameter uncertainty into account. In the context of maximum likelihood estimation, this requires using the bootstrap, and involves the repeated estimation of nonlinear models by local optimization algorithms. Convergence difficulties may make this impractical: see, for example, Chan and McAleer (2002, 2003). These difficulties do not arise if Bayesian methods are used: Markov chain Monte Carlo (MCMC) is used for simulating the joint posterior, and the resulting parameter replications are used for dynamic simulations of future observations.

In a Bayesian context, a prior identification constraint on the parameters of Markov switching models should be imposed; if this is not done, a multimodal posterior is obtained, except by spurious accident when mixing in the MCMC sampler is poor. This constraint can take the form θ1 < θ2 < · < θK, where θi is a particular population parameter in regime i and K is the number of regimes. The permutation sampler proposed by Frühwirth-Schnatter (2001) provides an effective procedure both for choosing an appropriate prior identification constraint, and for subsequently imposing this constraint. An MCMC posterior simulator for LSTAR models has been proposed by Lopes and Salazar (2006).

Among the authors mentioned in the first paragraph, only Montgomery et al. (1998) investigate the forecasting performance of an MSAR model; and only Koop and Potter (1999) fully rely on Bayesian methods. Even though the MSAR model in Montgomery et al. (1998) is estimated by the Gibbs sampler, the presentation is frequentist: the authors do not present their prior specification (including those aspects of the prior that are relevant for model identification) and only discuss point estimates and point forecasts. Koop and Potter (1999) provide a thorough Bayesian treatment of a TAR model of US unemployment; however, they do not compare its forecasting performance with that of an MSAR model.

An LSTAR formulation can approximate the TAR models used in Montgomery et al. (1998) and Koop and Potter (1999), but can be estimated with standard econometric software (contrary to the TAR and MSAR models), and is therefore particularly convenient. Van Dijk et al. (2002) have been the only authors to investigate the forecasting accuracy of an LSTAR model of the US unemployment rate. However, they do not update the parameter estimates as new observations become available, presumably for the reasons given in our third paragraph.

As pointed out by Koop and Potter (1999), there are important benefits in using a logistic transformation of the unemployment rate. This transformation not only guarantees that predictions are restricted to the unit interval (an important consideration if the emphasis is on predictive densities), but also removes the strong residual leptokurticity which plagues a model estimated from untransformed data. Among the four contributions mentioned in the first paragraph, only the paper by Koop and Potter (1999) uses such a transformation.

On these grounds, and since the permutation sampler has only recently become available, it may be argued that the potential of the LSTAR and MSAR models for predicting the US unemployment rate should be examined in more detail, and that true Bayesian predictive densities should be used in the investigation. This is the twofold objective of this paper.

An outline follows. Section 2 presents an MCMC posterior simulator for LSTAR models. It differs from the previous one in two respects. First, an independence Metropolis–Hastings chain is used, rather than the random walk chain used by Lopes and Salazar (2006). Secondly, the autoregressive order p and transition delay parameter d are assumed to be fixed (whereas one of the algorithms proposed by Lopes and Salazar is defined on a space that includes p and d). In our approach, we propose to choose p and the transition variable (or function) according to the criterion of highest marginal likelihood; some potential advantages are discussed. Section 2 therefore also describes our application of the bridge sampling method of Meng and Wong (1996) to the estimation of marginal likelihoods in a STAR model.

Section 3 briefly describes the MCMC estimation of the MSAR model and the bridge sampling estimation of marginal likelihoods for this model.

Section 4 presents estimated marginal likelihoods for 54 possible LSTAR, MSAR, and autoregressive (AR) models, where the dependent variable is a logistic transformation of the monthly US unemployment rate; a sensitivity analysis with respect to the prior parameters is done.

Section 5 discusses Bayesian misspecification diagnostics for the LSTAR and MSAR models that were found, in Section 4, to have the highest marginal likelihoods. The diagnostics are based on posterior predictive p-values for three relevant misspecification indicators.

Section 6 presents the MCMC estimates of the chosen LSTAR model and of its MSAR counterpart; some economic implications of the estimates are discussed.

Section 7 presents, for comparison purposes, maximum likelihood estimates of the models in Section 6, and diagnostics based on generalized residuals.

Finally, Section 8 attempts to discriminate between the MSAR, the LSTAR, and a benchmark AR model by means of simulated out-of-sample prediction exercises. For each model, Bayesian predictive densities are estimated from expanding windows of observations and for horizons of 1 to 6 months. Diagnostics based on probability integral transforms (Diebold et al., 1998; Berkowitz, 2001), on one of the test statistics proposed by Diebold and Mariano (1995), and on efficiency tests based on regressions of observations on point predictions are reported; versions of the efficiency tests are analyzed from both classical and Bayesian standpoints. Section 9 concludes.


The two-state LSTAR model introduced by Teräsvirta (1994) may be written as

equation image(1)


equation image(2)

where r is a scaling constant that can be set equal to the sample standard deviation of the observable variable st, and where, conditional on st and yt−1, …, ytp, ut is a random disturbance with distribution N(0, σ2). G(st, γ, c) is called the transition function; st the transition variable; γ the shape parameter; and c the location parameter. The model implies transitions between the two regimes where G(st, γ, c) = 0 (which tends to occur when st < c) and G(st, γ, c) = 1 (which tends to occur when st > c). When γ becomes large, G(st, γ, c) tends to the step function postulated by a two-state TAR model. In (2), the division of γ2 by r ensures that γ has a comparable order of magnitude across competing models.

The MCMC algorithm of this section iterates on the full conditional posteriors of the vector:

equation image(3)

of σ2, and of (γ, c), using the most recently drawn conditioning values.

If γ and c are known, equation (1) collapses to the usual regression model y = Xβ+ u, where row t of the T × (2p + 2) matrix X has the form

equation image

with GtG(st, γ, c). A multinormal prior on β with expectation vector βa and precision matrix Va and an independent inverted Gamma prior on σ2 with parameters a and b are assumed. It is then straightforward to show that the full conditional posteriors of β and σ2 are respectively multinormal and inverted Gamma:

equation image(4)
equation image(5)


equation image

The full conditional posterior of the parameters of the transition function is nonstandard. Upon taking independent normal priors equation image and equation image for γ and c and noting ϑ = (γ, c), the kernel of the full conditional posterior p(ϑ|β, σ2, data) is

equation image(6)

The algorithm for simulating ϑ is based on a Metropolis–Hastings independence chain (Tierney, 1994), using a multivariate Student candidate-generating density with location and scale parameters based on the following linearization, obtained by taking a first-order Taylor expansion of (1)(2) around (γ*, c*) and regrouping in the left-hand side those terms that do not depend on γ and c:

equation image


equation image

The anchor point ϑ* = (γ*, c*) is an approximate solution of the Bayesian update equations:

equation image(7)
equation image(8)

where X* is the T × 2 matrix with row t equal to equation image and where y* is the T × 1 vector with elements equation image. This approximate solution is obtained from a few iterations on (7) and (8), with starting point given by the prior expectations. A candidate ϑ is drawn from a multivariate Student density with kernel

equation image(9)

and is accepted with probability

equation image

where ϑold is the most recently drawn vector. If the candidate is rejected, ϑold is retained. The number ν of degrees of freedom in (9) can be chosen by experimentation; in the empirical part of this paper, a value of ν = 3 was chosen and led to acceptance rates of approximately 0.80.

Lopes and Salazar (2006) parameterize equation (2) in terms of γ rather that γ2, and ensure the positivity of γ by specifying a prior with positive support (such as a Gamma distribution) for this parameter. They use a random walk Metropolis–Hastings chain, where two tuning parameters must be specified. An advantage of the method proposed in this section is that its implementation can be automatic: choosing ν = 3 in (9) seems to give uniformly good results, with high acceptance rates and well-mixing chains. This advantage will prove decisive in Section 8, where several thousand MCMC estimations will be needed.

Lopes and Salazar (2006) also present a reversible jump MCMC method where the autoregressive lag order and transition delay parameter are included in the parameter space. By contrast, our method treats p, and the transition variable st, as fixed; it is proposed to investigate the choice of p and st by estimating marginal likelihoods for a range of candidate models. Although less ambitious than reversible jump MCMC, this approach easily allows the comparative investigation of models where st is any transition variable, and G(.) is any transition function (in particular, LSTAR and ESTAR models can easily be compared).

The rest of this section describes the method for estimating marginal likelihoods. It is based on the bridge sampling identity (Meng and Wong, 1996), which allows estimation of the ratio of the normalizing constants of two density kernels with overlapping support. For a model with prior p(θ) and likelihood f(y|θ), and given a bridge function α(θ), the marginal likelihood

equation image

is equal to

equation image(10)

where q(θ) is a normalized importance density. The numerator in (10) can be estimated by an average of replications of p(θ)f(y|θ)α(θ) where θ is drawn from q(θ), and the denominator by an average of replications of q(θ*)α(θ*), where θ* is drawn from the posterior.

A good choice of α(θ) is important for numerical efficiency. Meng and Wong (1996) recommend

equation image(11)

where we have assumed that n replications from q(θ) and m replications from p(θ|y) are available. Since p(y) is unknown, (11) must be obtained by an iterative procedure.

For the STAR model, the author chose an importance density q(θ) having the same parametric form as the prior, but with moments that match the empirical posterior moments obtained by MCMC. This choice is appropriate when the marginal posteriors are unimodal, and gave very good results in this case.

Bridge sampling has been shown by Frühwirth-Schnatter (2004) to include as special cases other well-known methods for estimating p(y), such as the method of Gelfand and Dey (1994). The method proposed by Chib and Jeliazkov (2001) has also been shown by Mira and Nicholls (2004) to be a special case of bridge sampling.


The MSAR counterpart of (1)(2) is

equation image(12)

where γ1 = α1, β1j = ϕ1j, γ2 = α1 + α2, and β2j = ϕ1j + ϕ2j for j = 1, …, p. Given (Gt, yt−1, yt−2, …, ytp), ut has the distribution N(0, σ2). The parameters αi and ϕij have the same interpretation as in (1), but Gt is here a discrete random variable with a value of zero in the first regime and a value of unity in the second. The prior on (G1, …, GT) is first-order Markov, with P[Gt = 0|Gt−1 = 0] = p and P[Gt = 1|Gt−1 = 1] = q; independent uniform hyperpriors on p and q are assumed.

The MCMC estimation of this model is now well established; see, for example, Albert and Chib (1993), and Chib (1996). Assuming conjugate priors, the full conditional posterior of the regression coefficients is multivariate normal; that of σ2 is inverted Gamma; those of p and q are Beta; and draws of (G1, …, GT) can be obtained by a simulation smoother. The identification of (12), however, has only recently been adequately discussed in the literature. Frühwirth-Schnatter (2001) proposes the permutation sampler, which comes in an unconstrained and a constrained version. In the unconstrained version, each pass of the Gibbs sampler is followed by a random permutation of the regime definitions; this guarantees a balanced sample from the unconstrained posterior. An examination of this sample is used to suggest an appropriate identification constraint; in model (12), one may choose a single constraint of the form α2 > 0, ϕ2j > 0 for one j∈{1, …, p}, or p < q. This examination is followed by an application of the constrained version of the sampler, where the chosen identification constraint is imposed. More details can be found in Frühwirth-Schnatter (2006), where Bayesian and classical methods for estimation and specification search in finite mixture models such as (12) are extensively reviewed.

It is important to note that the permutation sampler requires a prior that is invariant with respect to relabeling. In the context of (12), this implies a prior with p1, γ2) = p2, γ1) and p1j, β2j) = p2j, β1j) for all j = 1, …, p. When the prior on the regression coefficients is normal, it is of course a simple matter to translate such a prior into an equivalent prior on the αi and ϕij.

Frühwirth-Schnatter (2004, 2006, Ch. 5) recommends using bridge sampling for estimating the marginal likelihood of this model from the unconstrained permutation sampler output. The unconstrained marginal likelihood is proportional to the constrained one, with a factor of proportionality equal to the factorial of the number of regimes; this implies that computing the constrained marginal likelihood is unnecessary, and also that marginal likelihoods cannot be used to investigate the adequacy of an identification constraint.

In the bridge sampling identity (10), the likelihood f(y|θ) is needed. It can easily be computed by integrating out the latent variables (G1, …, GT), as follows:

equation image(13)


equation image

where yt − 1 contains all observations on ys up to t − 1, and where the conditional probabilities of the regimes can be evaluated by the filter described in Hamilton (1994, pp. 692–693).

The construction of the importance density q(θ) in (10) requires care, due to the multimodality of the unconstrained posterior. The author adopts the suggestion of Frühwirth-Schnatter (2004) to construct q(θ) from a discrete mixture of transition kernels. Specifically, a sample (G1, …, GS) with equation image is drawn from the unconstrained posterior. The importance density is

equation image(14)

where k1(.) is the full conditional posterior of the transition probabilities, k2(.) is the full conditional posterior of the regression coefficients, k32) is an inverted gamma whose moments match the empirical moments of the MCMC output for σ2, and equation image is the previous draw from the full conditional posterior of σ2. In practice, for a two-state model, setting S = 20 will suffice.


We will now apply the methods of the two preceding sections to LSTAR and MSAR models of the unemployment rate. The dependent variable is yt = ln[0.01Ut/(1 − 0.01Ut)], where Ut is the civilian male (over 20 years old) deseasonalized monthly US unemployment rate (in percentage points) from 1960 : 1 to 2004 : 12, taken from the LRMT20 series in the Haver Analytics USECON database.

Van Dijk et al. (2002) suggest using as transition variable the lagged seasonal difference of the unemployment rate. Indeed, Figure 1 reveals that the series st = Ut−1Ut−13 closely reproduces the business cycle, with low values of st corresponding to expansions and high values of st to contractions. However, Van Dijk et al. (2002) do not use the logistic transformation of Ut, and their estimates (equations 45 and 46 in their paper) imply an extreme lack of normality in the residuals, with a p-value of 5.9 × 10−5 for the Bera–Jarque statistic.

Figure 1.

Unemployment series

Table I reports the logarithms of the marginal likelihoods for AR, LSTAR, and two- and three-state MSAR models with dependent variable yt, for various values of the autoregressive order p and of the transition delay parameter d in st = UtdUtd−12. The prior parameters were as follows. For the MSAR models, the autoregression coefficients (βij in (12)) are independent standard normal, and the intercepts γi are independent N(0, 0.01). For the LSTAR models, the prior on the intercepts (α1, α2) is bivariate normal with null expectation vector, variances of V1) = 0.01 and V2) = 0.02, and correlation coefficient equation image; and the priors on the autoregression coefficients (ϕ1j, ϕ2j) are bivariate normal with null expectation vector, variances of V1j) = 1 and V2j) = 2, and correlation coefficient equation image. This choice ensures that the priors on the regression coefficients are identical for the MSAR and the LSTAR models. The prior parameters on γ and c are γa = 3, equation image, ca = 0, and equation image. For the AR models, the prior on the intercept is N(0, 0.01), and the priors on the autoregression coefficients are N(0, 1). In all models, the inverted Gamma prior on σ2 is almost improper, with a = b = 10−6, and prior coefficient independence across covariates is assumed.

Table I. Logarithmic marginal likelihoods and numerical standard errors
AR 919.566916.410927.651937.031937.795935.637
2-state MSAR 932.093946.035949.597945.624941.595937.158
3-state MSAR 934.640941.001943.708939.379934.560929.930
LSTARd = 1935.575956.206955.722952.162946.854941.455
 d = 2936.308954.752953.807951.360946.043940.856
 d = 3928.488950.139951.096948.494943.463938.140
 d = 4926.544945.231947.103946.198941.172935.886
 d = 5922.249938.212942.523941.437937.365932.142
 d = 6919.496933.870938.681938.802934.502929.884

An examination of Table I confirms that bridge sampling is very efficient, especially for the AR and LSTAR models. Among the AR specifications, p = 5 is preferred. Among the LSTAR specifications, the evidence is in favor of (p = 2, d = 1), with (p = 3, d = 1) being a close contender. The Bayes factor in favor of the first of the two models is exp(956.206 − 955.722) = 1.623; on the Jeffreys scale, the evidence against the less parsimonious model is ‘not worth more than a bare mention’ (Jeffreys, 1961, Appendix B). Indeed, assuming prior odds of unity, the posterior probability of (p = 2, d = 1) against (p = 3, d = 1) is 1.623/2.623 = 0.619. In the MSAR case, a two-state model with p = 3 is very clearly preferred.

Marginal likelihoods are well known to be sensitive to the prior specification (and are not defined for improper priors). In Table II, we therefore present some marginal likelihood estimates obtained when all the prior variances (except that of σ2) are doubled. The marginal likelihoods are indeed uniformly lower, but the ranking between models remains unchanged.

Table II. Logarithmic marginal likelihoods and numerical standard errors with looser priors
AR 919.136915.639926.566935.586935.997933.493
2-state MSAR 931.312944.596947.571942.833938.350932.123
3-state MSAR 933.653938.641940.267935.681929.097928.811
LSTARd = 1934.715954.603953.354949.074943.098937.012
 d = 2935.410953.186951.411948.264942.274936.393
 d = 3927.619948.612948.802945.452939.741933.724

The conclusions that emerge from Tables I and II are clear. First, Bayes factors present very strong evidence against linearity. Secondly, the LSTAR model with st = Ut−1Ut−13 is uniformly, and very strongly, preferred to the two- and three-state MSAR models. Other choices of st, such as Ut−1Utd−1 for various values of d, yielded lower marginal likelihoods; the same was true for the transition variable (yt−1yt−6)/5, found by Koop and Potter (1999) to maximize the posterior odds in their TAR model, albeit for a different sample. An ESTAR model was also tried, with inferior results.

Even though care was taken to ensure comparable priors across models, an element of uncertainty due to prior sensitivity remains. Also, the marginal likelihood criterion is known to favor parsimony; if one treats the discrete latent variables in the MSAR model as unknown parameters, LSTAR becomes much more parsimonious than MSAR. In the author's opinion, an ultimate comparison of both models should therefore rely on complementary evidence. Such evidence might be provided by posterior predictive p-values; this is the topic of the next section.


Let si(x, θ), for i = 1, …, n, be a collection of statistics commonly used for misspecification testing, x being a vector of data assumed to be generated by a given model with parameter vector θ. In the applications of this section, si(x, θ) will be based on the generalized residuals ut(x, θ), defined as

equation image(15)

where Φ(.) is the normal integral, and xt − 1 contains all observations on xs up to t − 1. If the probabilities on the right-hand side of (15) are indeed those implied by the process generating x, the generalized residuals are independent standard normal. In the LSTAR model, ut(x, θ) is simply the standardized residual implied by (1). In the MSAR model (12), we have

equation image(16)

where the conditional probabilities of the regimes are computed as in (13).

The predictive distribution of si(x, θ) can be readily simulated if a posterior sample from p(θ|y) is available: for each replication θ from this sample, one simply generates x by recursively simulating the model, computes the generalized residual series equation image, and computes the resulting si(x, θ) for i = 1, …, n. The posterior predictive p-value for statistic i is defined as

equation image

and is estimated as the percentage of replications of si(x, θ) that exceed the posterior average of the values computed from the actual data. More details on this approach can be found in Gelman and Meng (1996) or Koop (2003).

We will compute three statistics si(x, θ) from the generalized residual series equation image, namely:

  • 1.the Bera–Jarque statistic, used as an indicator of error non-normality;
  • F statistic for the nullity of the autoregression coefficients in an AR(12) model of the generalized residuals. This is used as an indicator of error autocorrelation, and is denoted by AC(12);
  • F statistic for the nullity of the autoregression coefficients in an AR(12) model of the squared generalized residuals. This is used as an indicator of error conditional heteroscedasticity, and is denoted by ARCH(12).

Table III presents the estimated posterior predictive p-values for each statistic and for the LSTAR and MSAR models that were found, in Section 4, to have the highest marginal likelihoods. The MSAR model was identified using the first autoregression coefficient: the constraint ϕ21 > 0 in (12) was very clearly suggested by the output of the unconstrained permutation sampler. In each case, the estimated p-values are based on 10,000 replications. The estimated predictive distributions of the statistics were quite close to the relevant asymptotic chi-square and F distributions, suggesting that asymptotic approximations would be reasonable.

Table III. Posterior predictive p-values
LSTAR (p = 2, d = 1)0.67180.09350.0066
LSTAR (p = 3, d = 1)0.58720.06810.0254
2-state MSAR (p = 3)0.73300.01490.0468

The estimated p-values in Table III are all larger than 0.01, with the exception of that for ARCH(12) in the LSTAR model with p = 2. So, there is weak evidence of conditional heteroscedasticity in this model.

Since, as discussed in Section 4, the posterior probabilities of the two preferred LSTAR formulations are nearly equal, it is legitimate to use other evidence, such as the one provided in this section, for discriminating between these two models (additional evidence in favor of p = 3 will be given in Section 8). There is another, more subjective, reason for preferring a three-lag LSTAR model: since our ultimate aim is the comparison of LSTAR and MSAR models, it makes sense to choose the same autoregressive order for both. It might otherwise be difficult to disentangle those differences that are due to model order from those that are due to model class.

For these reasons, and for the sake of brevity, we will concentrate in the sequel on the 2-state MSAR model with p = 3 and on the LSTAR model with p = 3 and st = Ut−1Ut−13. However, we will occasionally mention results obtained with the two-lag LSTAR model when this appears noteworthy.


In this section, we will present Markov chain Monte Carlo estimates of the LSTAR model (1)–(2) with p = 3 and st = Ut−1Ut−13, and of the MSAR model (12) with p = 3 and the identification constraint ϕ21 > 0. The prior distributions are the same as those used for estimating the marginal likelihoods in Table I, and were described in the third paragraph of Section 4.

In Table IV, θα denotes the estimated posterior quantile at probability α and sθ is the estimated posterior standard error. The regression coefficient estimates for both models are generally similar, even though some differences can be noticed (in particular, LSTAR predicts regime switching in the second autoregression coefficient, but this is not the case for MSAR since the confidence interval for ϕ22 contains zero in the second model but not in the first). In the MSAR model, Gt = 0 can be associated with expansion and Gt = 1 with contraction. In the LSTAR model, the data do not appear to be very informative on the shape parameter of the transition function, since the posterior results for γ almost reproduce the N(3, 0.1) prior for this parameter: the 95% prior confidence interval is [2.38,3.62] and the p-value of a Bera–Jarque statistic for testing the normality of the posterior replications of γ is 0.39. This is not surprising, since a wide range of values of γ will typically lead to similar shapes of the transition function, as noted by Teräsvirta (1994); indeed, this reason is commonly invoked to explain the failure of likelihood maximization algorithms in STAR models.

Table IV. Posterior replication summaries for the LSTAR and MSAR models
α1− 0.082− 0.045− 0.0060.019− 0.078− 0.0340.0130.023
ϕ13− 0.0550.0540.1640.0560.0270.1510.2800.065
α2− 0.115− 0.0450.0230.035− 0.117− 0.0520.0190.034
ϕ22− 0.591− 0.312− 0.0270.143− 0.362− 0.0980.1420.127
ϕ23− 0.487− 0.294− 0.1010.098− 0.656− 0.467− 0.2610.100
σ2 × 1000.1280.1450.1630.0090.1140.1300.1480.009
p 0.9210.9660.9870.017
q 0.8680.9410.9780.029

In Figure 2, we present kernel density estimates of the marginal posteriors of the intercepts in both states (α1 and α1 + α2) and of the sums of the autoregression coefficients in both states (equation image and equation image). Both models predict a unit root during expansions and a highly persistent process during contractions (the results obtained with the two-lag LSTAR model were almost identical). The presence of unit roots, or roots very close to unity, implies that it would be misleading to speak of an equilibrium unemployment rate in a particular regime.

Figure 2.

Kernel density estimates

It is also of interest to compare the estimated transition function of the LSTAR model with the smoothed probability of the second regime in the MSAR model, which is P[Gt = 1|y1, …, yT]. This probability can be estimated from the MCMC output by the percentage of replications of Gt having a value of unity. The two functions are plotted in Figure 3. They are generally similar, but the ‘transition function’ of the MSAR model appears better able to anticipate turning points. This is not too surprising, since the smoothed probability is conditional on all the sample observations whereas the LSTAR transition function uses only past observations. For comparison, the NBER business cycles (available at, accessed on January 4, 2007) are also reported as shaded areas.

Figure 3.

Transition functions

Finally, Figure 4 reports a kernel density estimate of the ergodic probability of contraction in the MSAR model, given by (1 − p)/(2 − pq); of the expected duration of an expansion, given by 1/(1 − p); and of the expected duration of a contraction, given by 1/(1 − q). The posteriors of the expected durations are strongly leptokurtic and skewed to the right, with a median of approximately 30 months for expansions and 17 months for contractions.

Figure 4.

Long-run implications of MSAR

The MSAR model (12) assumes that the innovation variance σ2 is constant. In a previous version of this paper, the more general model described in Deschamps (2006) was estimated. This model assumes Student-t disturbances with regime-dependent scale parameters. In this more general model, the point estimates of the Student scale parameters are 0.00126 in expansions and 0.00125 in contractions; and the point estimate of the Student degrees of freedom is equal to 94. The point and interval posterior estimates for the other parameters (regression coefficients and transition probabilities) were almost identical to those in Table IV. Together with the predictive p-values reported in Section 5, this confirms that a normal homoscedastic model is appropriate for this sample.


Table V presents maximum likelihood estimates of the preceding three-lag LSTAR model and two-state, three-lag MSAR model. The LSTAR model was estimated using E-Views 5.1; the MSAR model was estimated by the EM algorithm described in Smith et al. (2006, section 2.2). In order to ensure the convergence of this algorithm, it proved necessary to concentrate the likelihood with respect to the regression coefficients and equation variance, and to maximize the concentrated loglikelihood L(p, q) by a bivariate grid search on the square [0.9, 0.99]× [0.9, 0.99], with increments of 0.01 for each coordinate; this was followed by a more refined grid search on [0.986, 0.995]× [0.986, 0.995] with increments of 0.001.

Table V. Maximum likelihood estimates of the LSTAR and MSAR models
equation imageequation imageequation imageequation image
α1− 0.0470.020− 0.0310.020
α2− 0.0500.038− 0.0540.029
ϕ22− 0.3230.146− 0.1010.105
ϕ23− 0.2790.100− 0.4620.078
σ2 × 1000.1450.0090.1320.008
p 0.9900.006
q 0.9910.006

The regression coefficient estimates in Table V are quite close to the corresponding ones in Table IV. The estimated standard error of equation image in Table V is, however, larger than sγ in Table IV, reflecting the relatively tight prior on γ. Another noteworthy observation is the near equality of the estimated persistence probabilities p̂ and in Table V; furthermore, the estimates are larger than in Table IV. They imply nearly equal ergodic probabilities, and nearly equal expected durations for both cycles. This is at variance with the behavior exhibited in Figure 4, and conflicts somewhat with historical evidence. However, the smoothed probability graph was very similar to the one in Figure 3.

Figures 5 and 6 present distribution graphs and correlograms of the generalized residuals equation image defined in Section 5. Table VI presents p-values of misspecification diagnostics computed from these residuals; the values for an AR(5) model are also reported. They confirm the previous results: there is evidence against linearity, and an LSTAR model with p = 2 appears slightly misspecified.

Figure 5.

LSTAR (p = 3) maximum likelihood residuals

Figure 6.

MSAR maximum likelihood residuals

Table VI. p-Values of misspecification diagnostics (maximum likelihood)
AR (p = 5)0.02000.10130.0008
LSTAR (p = 2)0.61120.10680.0096
LSTAR (p = 3)0.62520.12790.0484
MSAR (p = 3)0.76020.01710.0476


8.1. Predictive Densities

Simulating predictive densities from the LSTAR model is a straightforward exercise. Upon defining β as in (3), and

equation image

as the deterministic part of the right-hand side of (1) with p = 3 and st replaced by

equation image

we generate, for each posterior replication (β(i), γ(i), c(i), σ(i)) and for h = 1, …, H, draws equation image from normal distributions with expectations equation image and standard error σ(i); here equation image is the observed value yT+hj if jh, and the previously simulated value equation image otherwise.

Simulating predictive densities from the MSAR model can be done by the method of composition (Albert and Chib, 1993). It involves generating simulated paths:

equation image

from the Markov evolution equation, using as initial regime probabilities equation image and equation image, where

equation image

can be computed by the filter described in Hamilton (1994, pp. 692–693), using as input the ith MCMC replication of the MSAR parameters:

equation image

Once the future path equation image has been generated, it is a simple matter to generate predictions equation image from equation (12), by a method analogous to that used for the LSTAR model.

It may be of interest to present an illustration involving two information sets. The first information set (S1) consists of all observations from 1959 : 10 to 1998 : 12, the second (S2) adds all observations through 2001 : 12. Each information set ends approximately 18 months before a turning point. The author reestimated the previous MSAR and LSTAR models, as well as a benchmark AR(5) model of the logistic transform yt, on the samples defined by S1 and S2. For the MSAR and LSTAR models, the prior described in Section 4 was used; the AR model was estimated with the Gibbs sampler and a fully diffuse prior.

In the top panels of Figure 7 (corresponding to S1) and Figure 8 (corresponding to S2), we plot the predictive medians of Ut obtained from all three models through 2005 : 3, together with the realized values. There are substantial differences in the long-run predictive behavior of the three models. However, in each case, convergence to an equilibrium value is not manifest, reflecting the high persistence of the process.

Figure 7.

Median forecasts and LSTAR prediction intervals (information set S1)

Figure 8.

Median forecasts and LSTAR prediction intervals (information set S2)

The bottom panels of Figures 7 and 8 show the bounds of 95% highest density prediction intervals obtained from the LSTAR model, together with the medians and the observed values. The densities are reasonably symmetric at an horizon of one month, but exhibit considerable skewness at intermediate horizons. The lengths of the intervals turned out to be similar for all three models, and appear to be very conservative.

8.2. Classical Forecast Evaluation

A formal investigation of the predictive performance of the MSAR, LSTAR, and AR models will now be presented. Even though the approach of this subsection is classical, the forecasts will be based on Bayesian predictive densities. We consider, in all, 295 information sets, each including a training sample consisting of all observations between 1959 : 10 and 1979 : 11. The tth information set is obtained by adding t subsequent monthly observations to this training sample. We estimate the models on each information set, using the same priors as before, and simulate in each case six out of sample predictive densities at horizons of h = 1, …, 6. The simulated densities are then analyzed using three complementary approaches, which are now discussed.

The first approach applies only to the 295 one-step forecasts (h = 1). We compute the percentage pt of simulated predictions that are less than the observed value yt, and compute the transform ϕt = Φ−1(pt), where Φ(.) is the normal integral. The percentage pt is of course a Monte Carlo estimate of the probability integral transform discussed in Diebold et al. (1998); and the transforms ϕt should be independent standard normal under the correct predictive distribution, as emphasized by Berkowitz (2001). This can be investigated by Kolmogorov–Smirnov tests and by regressing powers of ϕt on their past values.

The second approach, proposed by Diebold and Mariano (1995), is based on a comparison of the mean absolute prediction errors, defined as equation image, where ŷt, h, i is the median h-step forecast of yt using model i, based on the information set available at th. We test the nullity of the expected loss differentials, using heteroscedasticity and autocorrelation-consistent (HAC) variance estimates. Specifically, for (i, j) = (1, 2), (1, 3), (2, 3) and for h = 1, …, 6, we regress the series:

equation image

on a constant term cijh, and estimate the variance of ĉijh by the method of Andrews (1991), using a Bartlett kernel and VAR(1) prewhitening. We then test the null hypothesis that cijh = 0 against a two-sided alternative, using the t statistic. When h = 1, it is well known that the forecast errors based on the true model and the true parameter values are independent. Since this is not necessarily so when the forecasts are based on the true model with unknown parameters, however, a semiparametric variance estimate is appropriate even for one-step forecasts.

The third approach is based on the six regressions of yt on a constant term and on the three covariates ŷt, h, 1, ŷt, h, 2, ŷt, h, 3, for each horizon h = 1, …, 6. For the reasons given in the previous paragraph, HAC variance estimation is appropriate, even when h = 1. Following West and McCracken (1998), predictor i will be called efficient relative to the predictors ji at horizon h if a robust F test of yt = ŷt, h, i + ϵt against the unrestricted model does not reject the null hypothesis.

The results are in Tables VIIX. The diagnostics in Table VII strongly indicate that the AR(5) model is misspecified; however, none of the diagnostics for the MSAR and LSTAR models are significant at the 1% level. In Table VIII, the mean absolute prediction errors are systematically lowest for the LSTAR model and highest for the AR model; however, none of the Diebold–Mariano statistics are significant.

Table VII. Analysis of probability integral transforms (p-Values)
AR (p = 5)0.21470.00200.0005
LSTAR (p = 2)0.06970.08610.0329
LSTAR (p = 3)0.11820.04490.0151
MSAR (p = 3)0.30660.04620.0307
Table VIII. Diebold–Mariano tests
HorizonMean absolute prediction errorsp-Values
Table IX. Efficiency tests with 3-lag LSTAR model
Table X. Efficiency tests with 2-lag LSTAR model

The clearest evidence is provided by the efficiency tests in Tables IX and X. The LSTAR model is the only one which passes the test at all horizons. It is also noteworthy that, according to this efficiency benchmark, the AR model actually performs better than MSAR at horizons 4, 5, and 6. A comparison between the last columns of Tables IX and X leads to the choice of p = 3 rather than p = 2 for the LSTAR model.

The preceding results suggest that all three models can provide a fairly good description of typical unemployment rate data, so that the nonlinear behavior implied by LSTAR and MSAR can be difficult to detect in practice. They also suggest that the first two approaches lack the power needed to discriminate between the LSTAR and MSAR models, at least for the modest sample sizes typically available in macroeconomic data. However, they suggest that the efficiency test is likely to be an effective tool for model discrimination in our case. Finally, the results suggest that studying predictions at horizons greater than one month is of somewhat marginal additional value; in the opinion of the author, this is due to the high persistence of the process that is investigated, which leads to very high correlations between successive forecasts at horizons larger than one, resulting in loss of power.

8.3. Bayesian Forecast Evaluation

A Bayesian implementation of the tests presented in Section 8.2 involves generating, in the spirit of Section 5, simulated series of unemployment data using posterior parameter samples generated by MCMC. One would then compute the statistics of Section 8.2 for each simulated data sample, and estimate predictive p-values.

In order to compute the efficiency statistics of Section 8.2, it is necessary to estimate all models on each information set. If 1000 data samples are generated from LSTAR processes and 1000 from MSAR processes, this would necessitate 3 × 2000 × 300 = 1, 800, 000 MCMC simulations when 300 forecasts must be generated using three different models. This is clearly infeasible.

However, a ‘poor man's version’ of the statistics of Section 8.2 is available. If, as the information set expands, one does not update the posteriors on which the predictives are based, the number of required MCMC simulations is equal to 6000 rather than the previous one, and is well within the capabilities of a recent workstation (optimized FORTRAN programs were used). Since some of the available information is excluded in computing the forecasts, there is no particular reason for classical p-values to be larger than the size of an efficiency test, even if the predictive model is correct; but this becomes irrelevant in the Bayesian context, since the actual distribution of the statistic is simulated. For the same reason, it becomes unnecessary to compute HAC versions of the efficiency or Diebold–Mariano statistics.

The author selected two samples of size 1000 from the 10,000 posterior LSTAR and MSAR replications summarized in Table IV, and generated 2000 corresponding unemployment series ranging from 1959 : 10 to 2004 : 12. The MSAR(3), LSTAR(3), and AR(5) models were then estimated by MCMC using the first 243 observations of each artificial sample. One-step-ahead median forecasts of the 300 remaining observations, ranging from 1980 : 1 to 2004 : 12, were then generated from each model, using expanding information sets but without updating the posteriors.

The estimated predictives of the efficiency and Diebold–Mariano statistics turn out to exhibit high asymmetry and extreme outliers. Unfortunately, 95%, and even 90%, highest density intervals are too wide to be useful. However, conventional boxplots are informative when outliers are omitted. The three panels of Figure 9 present the diagrams corresponding to the MSAR and LSTAR efficiency statistics and to the squared LSTAR-MSAR Diebold–Mariano statistic (at horizon 1); within each panel, the left-hand box corresponds to the MSAR data-generating process (DGP) and the right-hand box to the LSTAR one. The vertical separation of the boxes within each of the top two panels can be given an interpretation similar to the classical notion of power, which is the probability under the alternative that the statistic exceeds a quantile of its distribution under the null.

Figure 9.

Bayesian forecast evaluation

In the top two panels of Figure 9, conventional credible intervals overlap to a very large extent, so that none of the ‘poor man's efficiency tests’ can be called powerful. While this may in part be due to the lack of posterior updating, it may also reflect the fact that the ability of forecast evaluation to discriminate between competing models is only asymptotic (as pointed out by a referee). Nevertheless, concentrating on the interquartile ranges, bounded by the edges of the boxes in Figure 9, allows clear conclusions to emerge. First, it appears easier for the MSAR model to emulate the LSTAR DGP than the converse: the vertical separation of the boxes is larger in the middle panel than in the top panel, and the squared Diebold–Mariano statistic (in the bottom panel) tends to be lower under the STAR DGP. Secondly, it is only under the STAR DGP that the observed values of the three statistics fall within the interquartile predictive ranges. This observation implicitly sets the size of a classical test equal to 25%; however, many classical statisticians would agree that choosing a large value for size is legitimate when the test is known to lack power.

Bayesian forecast evaluation thus confirms the results obtained with classical tests: since the observed value of the squared Diebold–Mariano statistic does not exceed the third quartile of either predictive distribution in the bottom panel of Figure 9, this test is not useful for discriminating between the two models; however, both the MSAR and LSTAR efficiency tests point to the predictive superiority of the LSTAR model.


This paper has attempted to compare the performance of LSTAR, MSAR, and AR models for forecasting the monthly US unemployment rate. Unlike many past contributions, it has used fully Bayesian methods for estimating predictive densities.

The LSTAR model incorporates strong prior knowledge on the factors determining the onset of transitions between regimes, in the form of a particular transition function and transition variable; by contrast, in the MSAR model, such prior knowledge only consists in a flexible evolution equation. It is therefore not surprising that an appropriate LSTAR model should make better use of available information and perform better on predictive efficiency tests. The fact that both models yield insignificant misspecification diagnostics and have insignificant mean loss differentials, however, is an additional argument in favor of the particular LSTAR model used in this paper: if two formulations with different prior assumptions yield results that are essentially similar, our confidence in the stronger assumptions increases, and one will tend to favor the more structured alternative. In this sense, the MSAR and LSTAR approaches can be said to be cross-validating and complementary.

It has been pointed out by a referee that the basic MSAR model can be extended to specify transition probabilities that depend on past observables. A simple formulation would model the transition probabilities as logistic normal processes. By a suitable choice of parameters, a logistic normal can approximate a Beta distribution; this leads to a natural candidate-generating density in a Metropolis–Hastings step. This was tried by the author, but did not result in a well-behaved MCMC sampler. The reason appears to be due to the shape of the logistic function, which is nearly constant over a large range of argument values. A more sophisticated approach, followed by Filardo and Gordon (1998), uses a latent probit process to model the transition probabilities and appears to work well in the MCMC context. Unfortunately, this generalization obviously complicates an MCMC sampler, the bridge sampling estimation of marginal likelihoods, and the estimation of predictive densities; it is not certain that the primary aims of this paper could have been achieved with the more general model. In the opinion of the author, the more limited objective that consists in comparing two relatively simple formulations that clearly differ in their prior assumptions on structural change is also interesting, if only to validate the choice of a particular LSTAR transition variable: indeed, in Table I, the ranking between the MSAR and LSTAR models would have been reversed if d = 4 rather than d = 1 had been chosen.

The data used in this paper are available from January 1948. When the MSAR model is estimated using the full sample from 1948 to 2004, a third regime can be identified. In this regime, the equation variance is approximately six times higher than in the other two; and the autoregression coefficients take values that are intermediate between those for expansions and contractions. The probability that this third regime occurs after 1960, however, is estimated to be less than 20% in any period, most estimates being very close to zero. Adding the pre-1960 data to each of the 295 information sets defined in Section 8.2 and using the three-regime MSAR model for conducting the simulated prediction exercises improves neither efficiency nor mean absolute prediction errors. For this reason, the data prior to 1960 were considered to be mostly of historical interest, and the results obtained with the extended MSAR model are not reported.

The evidence in favor of nonlinear dynamics appears to be stronger in this paper than in previous contributions. This may be partly due to our use of a monthly data frequency, rather than the quarterly frequency in, for example, Rothman (1998) and Montgomery et al. (1998). Indeed, in another context, Klaassen (2005) finds that using lower-frequency data can mask evidence in favor of regime switching.

This paper has also illustrated the importance of allowing all the parameters of a hidden Markov model of the unemployment rate to switch between regimes. This can be seen by comparing our results with those obtained by Bianchi and Zoega (1998). By allowing only the intercept term and the equation variance to vary across regimes, these authors all but eliminate the possibility of detecting Markov switching behavior in the US unemployment rate, and conclude in favor of a one-state model for this series; this conclusion is at considerable variance with the results of the present study.

Finally, this paper has illustrated the potential of regressing observations on out-of-sample point predictions for discriminating between non-nested models. It is perhaps surprising that this simple technique, which appears to have been originally suggested by Granger and Ramanathan (1984), has not been used more often in the literature.


The author is indebted to three anonymous reviewers for helpful comments and suggestions.