Subspace shrinkage in conjugate Bayesian vector autoregressions

Summary Macroeconomists using large datasets often face the choice of working with either a large vector autoregression (VAR) or a factor model. In this paper, we develop a conjugate Bayesian VAR with a subspace shrinkage prior that combines the two. This prior shrinks towards the subspace which is defined by a factor model. Our approach allows for estimating the strength of the shrinkage and the number of factors. After establishing the theoretical properties of our prior, we show that it successfully detects the number of factors in simulations and that it leads to forecast improvements using US macroeconomic data.

How should the researcher decide whether to use a factor model or a large Bayesian VAR?This question can be answered through a comparison of their predictive performance in a pseudo out of sample forecasting exercise.Alternatively, marginal likelihoods can be used.But pseudo out of sample forecasting evaluation can be time consuming and marginal likelihoods can be sensitive to the prior used.In this paper, we develop an alternative method for choosing between factor models and large VARs.
But why is there a need to choose between them when something in between might lead to better forecast performance?This is another question addressed in this paper.We propose a model which shrinks the VAR coefficients towards the implied coefficients of a factor model leading to a model which combines the two.We do so using a subspace shrinkage prior, see Shin, Bhattacharya, and Johnson (2020).A conventional prior shrinks the posterior of a coefficient towards its prior mean, which is typically zero.In contrast a subspace shrinkage prior is a prior on function spaces that shrinks towards a class of functions.In the present paper, we choose the class of functions to be a factor model such as the FAVAR or DFM.We stay in the class of conjugate priors (although we will discuss how other VAR priors can be accommodated) and, thus, our methods are simple to implement.They do not require the use of computationallydemanding Markov Chain Monte Carlo (MCMC) methods, implying that these techniques are useful in very high dimensional models.We develop a method for estimating the weight put on the DFM restrictions and the number of factors involved in these restrictions.The result is a model which combines the large VAR with a DFM in an optimal way.Alternatively, output from our model can be used to select between the large VAR and the DFM and, if the latter is selected, determine the number of factors.
We consider two versions of our subspace VAR prior.First, the subspace prior can be combined with a conventional informative VAR prior such as the popular Minnesota prior (see Doan, Litterman, and Sims (1984), Litterman (1986), Kadiyala and Karlsson (1997), Sims and Zha (1998) and Banbura, Giannone, and Reichlin (2010) for a natural conjugate implementation).We demonstrate that results from such a model can be interpreted as a weighted average of the Minnesota prior VAR and the factor model.Second, the subspace prior can be combined with a non-informative VAR prior.The result is a new Bayesian VAR prior.In contrast to conventional priors which shrink towards plausible values for the VAR coefficients, our new prior shrinks towards the factor model.
Our approach is illustrated using synthetic as well as real data.In simulations, we show that our framework accurately detects the number of factors if the true number of factors is small.This finding is independent of the model size.In larger dimensions, and for a larger number of true factors, our model slightly underestimates the true number of factors.To investigate the merits of our approach we apply it to US macroeconomic data.In a forecasting exercise, the different priors which shrink the VAR towards a factor model improve upon a standard BVAR and the DFM.These improvements are pronounced during the global financial crisis and the Covid-19 pandemic.
The remainder of the paper is structured as follows.Section 2 introduces the econometric framework.After providing a brief overview on conjugate Bayesian VARs and DFMs in Subsection 2.1, we discuss our subspace shrinkage prior which can be used to force the coefficients of the VAR towards the restrictions implied by a DFM in Sub-sections 2.2 and 2.3.Sub-section 2.4 discusses how our approach can be used to estimate the number of factors alongside the remaining model parameters and provides a brief overview on posterior simulation.Section 3 applies our techniques to a big US macroeconomic dataset and illustrates its favorable forecasting properties.Section 4 discusses how alternative Bayesian VAR priors and extensions such as stochastic volatility can be incorporated in our techniques.The final section summarizes and concludes the paper.

Conjugate Bayesian VARs and Dynamic Factor Models
Let {Y t } T t=1 denote an M -dimensional vector of macroeconomic and financial quantities.The number of time series can be large and, in addition, display substantial co-movements.One popular approach of modeling this panel of time series is to assume Y t to follow a VAR(p) process: where A j (j = 1, . . ., p) is an M × M -dimensional coefficient matrix and ε t is a zero mean Gaussian shock with variance-covariance matrix Σ.1 Equation ( 1) can be written as a multivariate regression model as follows: respectively.Stacking Y t , X t and ε t allows us to recast the model in full-data form: with typical t th row of Y given by Y t , of X given by X t and of ε a typical row is ε t .
Notice that the number of VAR coefficients in a = vec(A) is k = (KM ), which sharply increases with the number of endogenous variables and/or the number of lags.Since T is moderate for typical macroeconomic datasets, shrinkage is necessary to obtain well behaved estimates and to rule out implausible regions of the parameter space (e.g.regions which would imply explosive roots of the VAR process).
Bayesian priors on a are often used to provide such shrinkage.If M is large, natural conjugate priors are popular since they allow for fast computation.This arises because they preserve a convenient Kronecker structure for the posterior covariance matrix of a, see Chan (2020).The conjugate prior on a is specified conditionally on Σ and takes a Gaussian form: Here, we let A denote a prior mean matrix of dimension K × M and V is a K × K matrix.The full conditional posterior distribution of a is also Gaussian with: The prior on Σ is inverted Wishart with prior degrees of freedom ν and scaling matrix S which, when combined with the likelihood, yields a marginal posterior which also follows an inverted Wishart distribution whose posterior moments take a standard form (see, e.g., chapter 21 of Chan, et al. (2019)).
A conventional Bayesian VAR prior such as the Minnesota prior would make particular choices for A and V .An alternative would be to exploit the fact that the data might feature a factor structure.That is, the information in X might be characterized by a small number of q (q K) latent factors.These can be estimated using principal components (PCs) which can be implemented through a singular value decomposition (SVD).The SVD allows to decompose X = F q L q in terms of a T × q matrix F q , which are the estimated factors, and a K × q matrix L q , which is a matrix of factor loadings.If the matrix X is of rank q, this equation is exact.In general, if the rank of X exceeds q, F q L q approximates X. Replacing X with F q L q in Eq. ( 2) shows that the corresponding matrix of regression coefficients B = L q A is of dimension q × M , a substantial reduction in the dimension of the state space.Using the Moore-Penrose inverse of L q , L † q , allows us to express A in terms of B and the estimated loadings: This equation enables us to think about a DFM in terms of an otherwise unrestricted VAR with specific restrictions (which are driven by L q ) on the VAR coefficients.In a conventional DFM, these restrictions are always dogmatically imposed.In this paper, our goal is to introduce a shrinkage prior which softly pushes the elements in A towards the implied restrictions of the PC regression model.

Shrinking the flat prior VAR towards a factor model
Shrinking the regression model towards a subspace spanned by, e.g., the principal components can be done in several ways.For instance, Oman (1982) show how shrinkage estimators can be used to force an unrestricted regression model towards a projection on a subspace (such as the one spanned by the PCs) as opposed to the origin.This approach uses the eigenvalues of X X to shrink coefficients towards the space spanned by the first q eigenvectors.Our approach is similar but relies on a modified variant of the functional Horseshoe prior stipulated in Shin, Bhattacharya, and Johnson (2020).This is achieved by setting A = 0 K×M and V as follows: Here, ω ∈ [0, 1] is a tightness parameter and the T × T matrix Φ 0 = F q (F q F q ) −1 F q is the projection of F q .Recall that we obtain F q from the SVD of X.2 We let ω be an unknown parameter and estimate it in a data-based fashion as described below.The posterior is given in the preceding sub-section with these particular choices of A and V inserted.
To see how ω shrinks the VAR towards the factor model, it is convenient to exploit the fact that if the rank of X is K, the matrix X and the matrix F K (i.e., the first K principal components of X) span the same column space C.Moreover, notice that F K = (F q , F (q+1):K ) with F (q+1):K storing the final K − q principal components of X.Using these definitions and the result that C(X) = C(F K ), the corresponding projection matrices coincide: Notice that conditional on a standard normalization, we have that F K F K = I K and F q F (q+1):K = 0.This allows us to rewrite Eq. (5) as: with where Φ = X(X X) −1 X is the projection matrix of X.Thus, we can substitute Φ 1 = Φ − Φ 0 in Eq. ( 8) and multiply from the right with Y to arrive at: which shows that the posterior mean of the regression function is a convex combination of the VAR fit, ΦY , and the fit of the PCA regression, Φ 0 Y .This result can be used to show that the resulting predictive distribution (or impulse responses) are weighted averages of the ones obtained from estimating an unrestricted VAR and a PC regression, both estimated using OLS.
Larger values of ω imply estimates which are closer to the ones obtained from estimating a PC regression while values of ω closer to zero yield estimates closer to those of a non-informative prior Bayesian VAR.
Note that in the preceding material we are not incorporating any conventional Bayesian VAR prior such as the Minnesota prior.The prior defined above is a new one which can be used if the researcher wishes to use a prior which only shrinks towards the factor model.We will use the acronym subVAR-Flat to denote this prior which combines the subspace prior with a flat prior for the VAR coefficients.The fact that ω = 0 yields a flat prior VAR illustrates an important aspect and potential shortcoming of this prior.Flat prior VARs tend to over-fit unless M is very small and if K > T , as commonly occurs with large VARs, the OLS estimator will not be defined.Adding subspace shrinkage will ensure the posterior is proper, but small values of ω can potentially lead to over-fitting.As we will document in our empirical results, using a non-informative prior for ω can lead to poor forecast performance in large VARs.Hence, the need for a suitable prior for ω.This will be provided below.

Shrinking the Minnesota prior VAR towards a factor model
Since our prior is conjugate, we can easily add additional VAR priors to complement our subspace prior.In this sub-section, we show how this can be done for the natural conjugate Minnesota prior as implemented in Banbura, Giannone, and Reichlin (2010).
Let Y = (Y , Y ) and X = (X , X ) denote dummy-augmented data matrices.The dummies Y and X can be specified to match features of the different priors in the Minnesota tradition.We assume that these dummies are parameterized by a hyperparameter ϑ, with values of ϑ close to zero implying strong shrinkage towards the prior mean.In our empirical work we set the dummies as follows: , with σj (j = 1, . . ., M ) denoting the OLS residual standard deviation of an AR(p) model for y jt , the j th variable in Y t , a j is the j th diagonal element of A, and J p = diag(1, . . ., p).Notice that this set of dummies includes the prior for the intercept which depends on the hyperparameter κ. κ is set to a very small number (in our empirical work it equals 0.001), leading to a weakly informative prior for the intercepts.
We can add these dummies to Y and X and then combine it with our subspace shrinkage prior.The posterior covariance matrix of the VAR coefficients then becomes: . Adding the Minnesota prior means that the result in (9) no longer holds exactly.Intuitively speaking, if ϑ is set too tight, the Minnesota-type prior overrules the subspace shrinkage prior.However, for reasonable values of ϑ the following will hold approximately: This result states that the posterior mean of the regression function is a convex combination of the (OLS) fit of a PCA regression and the posterior mean based on a Minnesota prior VAR.
To investigate the accuracy of this approximation for different values of values of ϑ we can compute the average squared approximation error: with A j , Y j and Y j denoting the j th column of the corresponding matrix and || • || denotes the Euclidean norm of a vector.This approximation error quickly approaches zero if ϑ becomes moderately large.If ω ≈ 0, the approximation error also vanishes since then we obtain the Minnesota prior BVAR estimate.The interaction between ϑ and ω in determining Ξ(ϑ, ω) is highly non-linear.The key point to take away is that if ϑ approaches zero faster than ω approaches one, the standard Minnesota prior dominates the subspace prior.
These points are illustrated in Figure 1 which plots the approximation error for different values of ϑ and ω using data sets simulated from different data generating processes (DGPs) for different values of M and T .The DGP is a dynamic factor model with q = 3 and the factors evolving according to a multivariate random walk with a full error variance-covariance matrix. 3 Figure 1 suggests that ω does not have a large effect on the approximation, but that ϑ does.In particular, for values of ϑ > 0.1, the log approximation error is less than −8 for all the different values of ω.In the next section we will specify a prior on ϑ which allocates substantial mass to this region.It is worth stressing that even if ϑ is smaller than this, our prior is still a valid prior combining the Minnesota prior with the subspace prior, it is just that the posterior mean that results will deviate more from being a linear combination of a posterior mean using the Minnesota prior and a PC regression.

Selecting the number of factors and estimating the hyperparameters
Our prior depends on a choice for the number of factors (q), the weight attached to the VAR relative to the PC regression (ω) and the degree of shrinkage in the Minnesota prior (ϑ).The posterior for these is: Estimation is straightforward since the natural conjugate prior leads to an analytical form for the marginal likelihood: where A is the posterior scaling matrix of the inverse Wishart posterior of Σ.This can be multiplied by the prior to produce the posterior.We define discrete grids for ω, ϑ and q and evaluate the posterior at points in the grids.We can then do Monte Carlo integration by sampling from the Multinomial distributions that arise.Hence, our predictive densities reflect uncertainty in these parameters.This is similar to a strategy suggested in Giannone, Lenza, and Primiceri (2015) for the Minnesota prior VAR but, as detailed below, avoids carrying out complex matrix operations during posterior simulation and thus offers substantial computational gains (at the cost of approximating a continuous posterior distribution using a discrete one).
It remains to specify the priors.For ϑ we follow suggestions in Giannone, Lenza, and Primiceri (2015) and use a Gamma prior which we set to have mode 0.2 and standard deviation 0.4.This value implies that the approximation error in ( 1) is extremely small and the posterior mean of the model fit can be safely interpreted as a convex combination between the BVAR and the DFM fit.
For ω we use a Beta prior: B(c 0 , c 1 ).In our empirical work we consider two ways of specifying the hyperparameters c 0 and c 1 .The first sets c 0 = c 1 = 1, yielding a non-informative uniform prior on ω.The second prior sets c 0 = c 0 × M and c 1 = c 1 × M , with c 0 , c 1 being scalars greater than zero.This choice implies that the prior mean on ω is equal to c 0 /(c0 + c 1 ) and the prior variance equals (c 0 c 1 )/((c 0 + c 1 ) 2 (M (c 0 + c 1 ) + 1).In our empirical work we set c 0 = 8 and c 1 = 6, yielding a prior mean on ω of around 0.6 and thus placing considerable mass on the factor model restrictions while the prior variance decreases in M .In large dimensions, this choice increasingly forces the model towards the factor restrictions but still provides sufficient flexibility for individual time series to exhibit VAR dynamics.
We assume a discrete uniform prior on q: q ∼ U(1, q 0 ), which implies that all values up to q 0 (which denotes some integer smaller than K set by the researcher) are a-priori equally likely.Other choices which utilize sample information (such as the eigenvalues of X) are in principle possible.
Two hyperparameters remain to be chosen.If we use a flat prior in combination with a subspace prior, we set v = M + 2 and S = 1 100 I M .If we use a Minnesota prior we set v equal to the number of rows of Y and S = (Y − XA) (Y − XA) (see Kadiyala and Karlsson (1997)).
It is worth noting that, in Bayesian factor analysis, selecting the number of latent factors is a difficult task.Bayesian solutions include using reversible jump MCMC algorithms which treat the number of factors as an unknown quantity (see, e.g., Lopes and West (2004) and Frühwirth-Schnatter and Lopes (2018)).Another strand of the literature estimates an overfitting factor model and applies Bayesian shrinkage priors to force the columns of the factor loadings matrix associated with irrelevant factors to zero, see Bhattacharya and Dunson (2011).However, the following sub-section investigates, using simulated data, the simple and computationally efficient approach given here.These simulations show that, even under a uniform prior, our approach selects the true number of factors successfully.
It is also worth noting that, conditional on ω and ϑ, all quantities used in the Monte Carlo sampling of q can be pre-computed and thus estimation of huge models (i.e., with M > 100) is feasible.This requires specifying a grid for ω, θ and q.In all our empirical work we set the grid for q ∈ {1, . . ., min(10, L * )} with L * denoting the Ledermann bound. 4The grid on ω is specified to go from 0.01 to 0.99 with a step-size of 0.05.Finally, the grid on ϑ is {0.001, 0.01, 0.025, 0.05, 0.10, 0.20, 0.3, 0.4, 0.5, 2, 3, 4, 5}.
Evaluating marginal likelihoods can be challenging in very large models and they depend on the prior.Accordingly, in our empirical work (which involves forecasting three variables of interest), we also investigate an alternative way of choosing q, ω and ϑ.This is to use the Bayesian Information Criterion (BIC) for the three variables of interest to choose them.

Simulated data exercise on selecting the number of factors
In this sub-section we investigate whether our approach successfully detects the correct number of factors by means of synthetic data.To analyze how estimation accuracy changes across different datasets, we consider DGPs that vary along the number of variables (M ) as well as the number of factors.The DGP is a dynamic factor model given by: with f t evolving according to a multivariate random walk with full state-innovation variance Ω, an initial state f 0 = 0 q and Σ = ΛΩΛ + W being a full matrix with W denoting a diagonal matrix of measurement error variances.Notice that this is a standard dynamic factor model which is rewritten by plugging the random walk state equation into an observation equation which typically includes the contemporaneous factors and uncorrelated measurement errors.
In all our simulations we assume that λ ij , the (i, j) th element of Λ, is drawn from a normal distribution with zero mean and variance 0.1 2 if i = j or set equal to unity if i = j.Instead of specifying Ω we obtain the lower Cholesky factor of Σ, A −1 0 , by simulating the off-diagonal elements from a Gaussian distribution with zero mean and variance 0.1 2 and the main diagonal elements are set equal to 0.1.Posterior mean of q M = 10 1.32 3.03 5.87 2.54 1.29 3.11 6.00 8.00 1.32 3.04 5.90 2.53 1.28 3.06 6.00 8.00 M = 60 1.00 2.96 5.23 6.01 1.03 2.95 5.34 6.24 1.00 2.95 5.26 6.01 1.02 2.99 5.28 6.22 M = 120 1.00 2.58 3.92 4.54 1.00 2.67 3.92 4.59 1.00 2.57 3.84 4.50 1.00 2.63 3.95 4.61 Notes: subVAR denotes the VAR coupled with the subspace shrinkage prior, Minn is the combination between subspace and Minnesota shrinkage while flat is the subspace shrinkage prior without additional shrinkage.The 0 and 1 attached to the respective label indicate a flat (0) or informative (1) prior on ω.Each number is based on computing the mean of posterior medians across 100 replications from the respective DGPs.For q, we use the posterior median as our point estimate while for ω we use the posterior mean.
We simulate T = 500 observations from small (M = 10), moderate (M = 60) and large (M = 120) datasets.For each of these, we vary the number of factors q ∈ {1, 3, 6, 8}.All simulations are repeated 100 times and, in Table 1, we report averages of posterior medians across these replications.
It can be seen that all of the versions of our subVAR prior are doing a good job of choosing the correct number of factors.It is mainly in the least parsimonious cases (i.e.DGPs with M = 120 and q = 8) where it is slightly underestimating the number of factors.But this is due to the large VAR providing some of the fit, leaving less for the DFM to explain.In the context of these very large models, slight over-shrinkage is better than the over-fitting which would have occurred if the prior had failed to shrink enough.
3 Forecasting Using US Macroeconomic Data

Data
We use a large set of 166 quarterly macroeconomic variables taken from the St. Louis Fed's FRED data base (fred.stlouisfed.org)and discussed in McCracken and Ng (2020).These are listed in the appendix in Table 3. Variables are transformed to stationarity following recommendations there.Our forecasting results focus on three variables of interest: GDP growth (based on real GDP growth, GDPC1), the Fed Funds rate (FEDFUNDS) and inflation (based on the consumer price index, CPIAUCSL).
The data runs from 1960:Q1 to 2020:Q3 and in our forecasting exercise, the evaluation period is from 1990:Q3-2020:Q3.We adopt a recursive forecasting design.We use the initial estimation period (1960:Q1 to 1990:Q2) to produce one-and iterated four-quarter-ahead forecast distributions for 1990:Q3 and 1991:Q2, respectively.After obtaining these, we expand the initial estimation period by one observation until we reach the end of the sample.

Models
We present results for four models involving subspace priors (acronym subVAR).These involve two priors for the VAR coefficients: the non-informative one (flat) and the Minnesota prior (Minn).There are also two priors for ω: one non-informative (flat) and one informative.These are indicated by adding a 0 (flat) and 1 (tight) to the relevant labels of the VAR coefficient prior.
For each of these four models, we present results for data sets of four different sizes: small (S, 12 variables), medium (M, 22 variables), large (L, 78 variables) and extra large (XL, 166 variables).Table 3 lists which variable belongs in which category.
For comparison we also present results for Minnesota prior VARs (implemented by setting ω = 0 in the subVAR-Minn) and factor models (labeled DFM).The factor model is a FAVAR for the three variables of interest and the factors estimated by extracting the PCs from the remaining time series within a given model size.The number of PCs is chosen by retaining the PCs with standard deviations greater than unity.It is estimated using a relatively noninformative Minnesota prior.The lag length in all models is set to two.

Summary of forecasting results
We begin by summarizing the results of our pseudo-out-of-sample forecasting exercise in Table 2.This table contains Root Mean Squares Forecast Errors (RMSFEs) and averages (over time) of log predictive likelihoods (LPLs) for our three variables of interest and for two different forecast horizons.The RMSFEs are ratios between the RMSFEs of a given model and the Minnesota VAR while the LPLs are differences between a given model and the Minnesota VAR (both for a given model size).
Before discussing our subVAR models, consider the comparison between the Minnesota prior VAR and the factor model.For some variables, forecast horizons and model sizes, the VAR yields more precise forecasts.That is, for the interest rate it consistently forecasts better and for inflation and GDP growth for larger models at longer horizons, its forecasts tend to be more precise than the ones of the DFM.But for other cases the factor model outperforms the Bayesian VAR.This result raises the possibility that an approach such as ours, which combines the two, could lead to better overall forecast performance than either the BVAR or DFM individually and, with several exceptions discussed below, this is what we find.
Consider first the most informative subVAR model which uses the Minnesota prior on the VAR coefficients and the informative prior on ω (Minn1).With some exceptions, this model is yielding forecasts which are better than the predictions produced by the BVAR and are often the most precise ones.The main exceptions are the one-year-ahead GDP growth predictions which are marginally worse than the the BVAR benchmark.However, this is a case where the DFM and some of the less informative subVAR approaches are forecasting substantially worse than the BVAR.
Consider now the second most informative subVAR approach (Minn0) which retains the Minnesota prior for the VAR coefficients, but uses a non-informative prior on ω.Its forecasts are comparable to those of Minn1, but overall are slightly worse.But clearly results are robust to the prior on ω.Both of these Minnesota prior subVAR approaches are forecasting well most of the time and even the few exceptions reveal only slight deterioration in forecast performance relative to the BVAR benchmark.
Using a non-informative prior for the VAR coefficients, however, goes wrong in some cases in larger models.In one sense, this is unsurprising.Non-informative priors work poorly in large VARs since they suffer from severe over-fitting problems.It might have been possible that these would have been corrected by adding the subspace prior shrinking towards the factor model.
But clearly this effect is not strong enough in the L and XL models to counter-balance the overfitting problem.One might have hoped that estimates of ω would have been pulled towards 1 in these cases, leading to results similar to the DFM but (unless we use an extremely dogmatic prior on ω) this is not happening in the larger models.Similar to the results based on synthetic data, this is because the larger (unrestricted) VARs explain the majority of variation in the data and leave little variation to explain for the factor model, yielding posterior estimates of ω close to zero.However, the subVAR-Flat models are performing well for our small and medium-sized models and for the one-quarter-ahead forecast horizon for the larger models.And it is only the iterated one-year-ahead forecasts that are deteriorating.This suggests that this approach might be found useful by researchers working with VARs up to a dimension of approximately 20 who wish to avoid the use of standard BVAR priors such as the Minnesota prior, particularly if the focus is on short-term forecasts.Such a researcher may also wish to avoid using marginal likelihoods since they are prior-dependent.For them, it is also interesting to note that using the BIC to estimate the prior hyperparameters also works roughly as well as using marginal likelihoods.

A deeper examination of forecast performance of subspace VAR methods
To examine more deeply the properties of our subVAR prior, in this sub-section we provide plots over time of predictive Bayes factors against the Minnesota prior VAR.Moreover, we investigate  -0.20 -0.73 -8.11 -9.95 -1.98 -1.45 -0.20 -0.74 -8.15 -9.97 -1.37 -1.57 -0.27 -0.96 -8.44 -7.59 -1.13 -1.65 Notes: subVAR denotes the VAR coupled with the subspace shrinkage prior, Minn is the combination between subspace and Minnesota shrinkage while flat is the subspace shrinkage prior without additional shrinkage.The 0 and 1 attached to the respective label indicate a flat (0) or informative (1) prior on ω.DFM is a VAR in the three focus variables augmented with principal components and BVAR refers to a Minnesota VAR.The upper part of the table shows relative root mean squared forecast errors (RMSEs) between a given model and the Minnesota VAR while the lower part of the table shows differences in average log predictive likelihoods (LPLs) to the Minnesota VAR.The numbers in the BVAR columns include the actual RMSEs and LPLs of the Minnesota VAR.
how the estimates of the prior hyperparameters q, ω and ϑ evolve over the hold-out period.
Figures 2 and 3 plot the log predictive Bayes factors for the three variables being forecast for the four subspace VAR priors.Figure 2 uses the marginal likelihood for all the variables in the model to estimate the prior hyperparameters and Figure 3 uses the BIC for the three variables of interest.The overall best performance of the priors which combine subspace shrinkage with the Minnesota prior can be seen in both figures.An examination of the main exception to this pattern is informative.This occurs for inflation forecasts where the combination of the non-informative prior VAR with the subspace shrinkage prior forecasts well.But this result holds only for the one-quarter-ahead horizon.The iterated one-year-ahead forecasts are very poor (see Table 2).
Another pattern worth noting is that substantial changes tend to occur during either the financial crisis (around 2009) or the pandemic (2020).The tendency at these times is for the subVAR-Flat models to do better.Results for GDP growth from larger VARs are particularly striking.The forecast performance of these models was extremely poor up until the pandemic when the subVAR-Flat models almost caught up to the subVAR-Minn models.The stronger prior information in the latter is a great benefit in normal times, but in the pandemic this makes it less able to adjust to the extreme observations which arise.This is because the corresponding predictive density is narrow which helps in tranquil periods while in turbulent times (such as during the pandemic) the variance is too low, rendering outliers less likely under the posterior predictive distribution.
The role of the prior on ω can to be seen to be relatively unimportant in most cases.
Although it is interesting to note that in smaller models it can be beneficial to use the informative prior for ω (see, in particular, the forecasts of inflation for the small and medium VARs).
Figures 4 and 5 plot the posterior mean of q over time for the marginal likelihood-based and BIC-based methods, respectively.The figures illustrate some interesting differences between these two methods.In particular, use of the BIC allows for more time variation in the parameter estimates for the XL model suggesting it allows for quicker adjustment to new information.
Consider the best-performing approach which is the subVAR-Minn prior with an informative prior on ω.For this case, using the marginal likelihood leads to a choice of 10 factors for all time periods for the XL model.But using the BIC, there is more variation over time.For the XL model in particular the number of factors increases gradually from 7 to 9 before quickly collapsing down to a posterior mean near 6 when the financial crisis hits.In general, for the XL model the marginal likelihood is consistently leading to large estimates for q which vary little over time.This pattern does not recur for the lower dimensional models where the marginal likelihood-based estimates of q tend to be lower.For instance, in the smallest model the sub-VAR flat model chooses q = 1 for all periods, which contrasts with much larger BIC-based estimates.Marginal likelihood calculation in large VARs can be unstable and sensitive to the prior and our results suggest that in larger models at least it may be safer to use BIC-based estimates.
Figures 6 and 7 present evidence on the estimation of ω.For the XL and L models, we are finding striking differences between the BIC and marginal-likelihood based estimates.Note   Figure 6: Evolution of the posterior mean of ω over the hold-out period when the marginal likelihood is used to select q, ϑ and ω that for subVAR-Minn model with the informative prior on ω we are finding the posterior mean of ω to be approximately 0.6 when estimated using BIC whereas the marginal likelihood based estimates are much lower at approximately 0.25/0.35for the XL/L models.Hence, the former model is shrinking much more closely to the factor model than the latter.
We can also see the role that the prior for ω has in that estimates using the non-informative prior for ω tend to be substantially lower than those produced using the informative prior.In fact, with rare exceptions, using the non-informative prior never leads to estimates of ω above 0.2.At least in this data set, it is necessary to use an informative prior for ω to achieve substantial shrinkage towards the factor model.It is interesting to note that, when we do so, we are consistently finding ω to be in the region [0.25, 0.60] being far from the region where one would feel confident selecting either the Minnesota prior VAR (ω = 0) or the factor model (ω = 1) thus indicating again the potential benefits of our approach which combines the two.Figure 7: Evolution of the posterior mean of ω over the hold-out period when the BIC over the three focus variables is used to select q, ϑ and ω Finally we turn to main shrinkage parameter of the Minnesota prior, ϑ.This hyperparameter only appears in the approaches involving the Minnesota prior.Note that smaller values of ϑ imply stronger shrinkage.Posterior means are plotted in Figure 8.
The most striking pattern here is that using the marginal likelihood leads to much lower estimates of this hyperparameter than using the BIC, especially for the small and medium models.In general, and consistent with Giannone, Lenza, and Primiceri (2015), we find that larger models generally feature smaller values of ϑ (and thus more shrinkage).If we combine this with the fact that the marginal likelihood-based estimates of ω are lower for these models we have the interesting finding that it is choosing to put more weight on a Minnesota prior VAR with more shrinkage.In contrast, the BIC based weights are closer to be a combination of a factor model with a Minnesota prior that is implemented rather loosely.
Another interesting finding is that ϑ tends to sharply increase during the pandemic.This is especially pronounced for small and medium-sized models.Since the variance of the predictive distribution is positively related to ϑ, larger values of ϑ are (all else being equal) accompanied by wider predictive intervals.This explains why some of the models improve appreciable against the benchmark in 2020.

Further Discussion
In this paper, we have worked with two popular models (i.e., the conjugate version of the Minnesota prior VAR with a single shrinkage hyperparameter and the factor model) both of which are homoskedastic.We did this to draw out all the theoretical insights in a clear and simple way and because, in many empirical contexts, simple approaches such as these have been found to work well (see, e.g., Banbura, Giannone, and Reichlin (2010) and Carriero, Clark, and Marcellino (2015)).Furthermore, computation is vastly simplified since analytical results are available and we can avoid the use of MCMC methods.
However, many recent Bayesian VAR papers have used richer econometric structures.
These can be classified in two main categories: other priors and other forms for the error covariance matrix.Here we discuss using the subVAR prior in the context of such extensions.
One restrictive feature of our Minnesota prior is that it involves a single shrinkage parameter.Allowing for each equation to have its own shrinkage parameter could be a useful extension of our approach.This raises two issues: i) the resulting priors would no longer be conjugate and ii) the number of prior hyperparameters would become larger leading to a higher dimensional grid of values to evaluate in the Monte Carlo step.These issues could partly be surmounted ) and writing the VAR in structural form (i.e. with A 0 Y t on the left hand side and the error covariance matrix being diagonal).Conditional on A 0 , the subspace shrinkage prior could be applied in an equation-specific manner and estimation could proceed one equation at a time (which would help reduce the computational burden, although the computational cost of having M different sets of prior hyperparameters would still be large).
A 0 could be drawn using MCMC methods.In essence, it would be straightforward to develop an MCMC algorithm for drawing A 0 , ω j , q j and ϑ j for j = 1, .., M .Being an MCMC algorithm it would be inherently much more computationally demanding than the methods developed in this paper, but could be useful at least in medium-sized VARs.
There have also been many global-local shrinkage priors (e.g., the Horseshoe and Lasso priors) which are conditionally Gaussian (i.e., conditional on some new parameters in the prior they are Gaussian).Estimation proceeds by adding blocks to the MCMC algorithm for drawing these new parameters.Since our Minnesota prior is Gaussian, it is trivial to replace it with any conditionally Gaussian prior.The theory developed in this paper would hold, conditional on the new parameters.Estimation would proceed through an MCMC algorithm which drew these new parameters and then conditional on each draw exploited the subVAR methods developed in this paper.
In a similar fashion, the assumption of homoskedasticity could be relaxed to allow for stochastic volatility.This would lead to an MCMC algorithm which involved drawing the volatilities and, conditional on each draw, the results for the subspace VAR prior developed in this paper could be used.Several forms for stochastic volatility in VARs have been proposed in the literature, see for instance Carriero, Clark, and Marcellino (2019) for a particularly popular form, and the general strategy outlined here would work with any of them.
In sum, many extensions of the conjugate subVAR approach developed in this paper are possible.However, they would require the use of MCMC methods.Provided the likelihood and prior remain Gaussian conditional on some new parameters, the theory derived in Section 2 would hold, conditional on these new parameters.
Finally, it is worth noting that the error covariance structure proposed in Chan (2020) maintains many of the benefits of the conjugate form.It assumes the VAR error covariance matrix is Σ⊗Ω where Ω can be any positive definite matrix.This nests many possible specifications, including a common stochastic volatility model, moving average errors and non-Gaussian errors.This Kronecker structure in the likelihood matches up with the Kronecker structure in the conjugate prior leading to derivations which are similar to those in Section 2 of this paper.Roughly speaking, whereas the derivations in Sub-section 2.2 show how the subspace prior leads to a posterior mean which is a combination of the OLS estimate of the VAR with a PC regression, using the model of Chan (2020) leads to a combination of a GLS estimate with a PC regression.
Chan (2020) develops a computationally efficient MCMC algorithm for models with this error covariance structure.

Conclusions
Macroeconomic researchers with large data sets have traditionally been forced to make a choice between a large VAR or a dynamic factor model.In this paper, we have shown how to combine the two.We have developed a subspace prior for the VAR which shrinks towards a dynamic factor model.A parameter, ω controls the degree of shrinkage and we have developed methods for estimating it from the data.Thus, we have developed a Bayesian methodology for averaging a large VAR with a factor model or choosing between them.
We illustrate our approach using synthetic and real data.In simulations, we show that our approach accurately detects the number of factors if the true DGP suggests relatively few factors (irrespective of the model size).If the DGP features a large number of factors, our approach slightly underestimates the true number of factors.In a forecasting exercise involving a large number of macroeconomic variables, we demonstrate the benefits of combining the two model classes using our subspace VAR approach.Using subspace shrinkage in combination with a Minnesota prior often yields more precise forecasts than the ones obtained from either the factor model or the VAR.

Figure 1 :
Figure 1: Log Squared Approximation Error for different DGPs and q = 3 10 Solid lines refer to the model which includes both subspace and Minnesota shrinkage whereas the dashed lines show the variable-specific log-predictive Bayes factor (to the Minnesota prior VAR) of the subspace shrinkage VAR without the Minnesota prior.Black lines refer to models with an uninformative prior on ω.Red lines denote models with an informative prior on ω.

Figure 2 :
Figure2: Evolution of the log predictive Bayes factor between subVAR and the Minnesota VAR across focus variables when the marginal likelihood is used to select q, ϑ and ω

Figure 3 :Figure 4 :Figure 5 :
Figure3: Evolution of the log predictive Bayes factor between subVAR and the Minnesota VAR across focus variables when the BIC over the three focus variables is used to select q, ϑ and ω SubVAR-Flat (informative prior on ω)

Figure 8 :
Figure 8: Evolution of the posterior mean of ϑ over the hold-out period

Table 1 :
Simulation results for differing values of q and M .Averages across 100 replications from the DGP

Table 2 :
Forecasting results across focus variables, models and forecast horizons