Combining Shrinkage and Sparsity in Conjugate Vector Autoregressive Models

Conjugate priors allow for fast inference in large dimensional vector autoregressive (VAR) models but, at the same time, introduce the restriction that each equation features the same set of explanatory variables. This paper proposes a straightforward means of post-processing posterior estimates of a conjugate Bayesian VAR to effectively perform equation-specific covariate selection. Compared to existing techniques using shrinkage alone, our approach combines shrinkage and sparsity in both the VAR coefficients and the error variance-covariance matrices, greatly reducing estimation uncertainty in large dimensions while maintaining computational tractability. We illustrate our approach by means of two applications. The first application uses synthetic data to investigate the properties of the model across different data-generating processes, the second application analyzes the predictive gains from sparsification in a forecasting exercise for US data.


INTRODUCTION
This paper deals with estimating vector autoregressive (VAR) models of the following form, where y t = (y 1t , . . ., y mt ) denotes a m-dimensional vector of time series measured in time t = 1, . . ., T, A j ( j = 1, . . ., p) is a (m × m)-dimensional matrix of coefficients associated with the jth lag of y t , C is a m-dimensional intercept vector, and ε t ∼ N (0, Σ) is a Gaussian shock vector with zero mean and a (m×m)dimensional variance-covariance matrix Σ.For further convenience, let a = vec{(A 1 , . . ., A p , C) } denote a vector of dimension k = m(mp + 1) of vectorized coefficients with a i (i = 1, . . ., k) denoting its ith element.This model class has been extensively used for forecasting and policy analysis in central banks (see Alessi et al., 2014) as well as a natural starting point for unveiling stylized time series facts in order to estimate theoretical models (see, inter alia, Hall et al., 2012).Conditional on the first p observations, estimation of the model in Eq. ( 1) can be carried out using ordinary least squares (OLS).In this case, however, overfitting issues arise, translating into imprecise out-of-sample forecasts.As a potential solution, the Bayesian Literature uses informative priors to push the system towards a prior model.For instance, the widely used Minnesota prior assumes that the elements in y t follow a random walk a priori (Doan et al., 1984;Litterman, 1986;Giannone et al., 2015).Theoretically inspired restrictions stemming from structural models can also be used to inform parameter estimates and thus improve inference (Ingram and Whiteman, 1994;Del Negro and Schorfheide, 2004).
The key feature of these priors is that they are conjugate, implying that the likelihood and the prior feature the same distributional from.This yields closed-form solutions for key posterior quantities and, if simulation-based techniques are necessary, greatly improves estimation speed.
In VARs conjugacy requires that each equation in the system features the same set of predictors, potentially leading to model misspecification (see, for example, George et al., 2008;Koop, 2013).This translates into a Kronecker structure in the likelihood, prior, and the resulting posterior distribution, implying that inversion of the posterior variance-covariance matrix of the coefficients is computationally cheap.In contrast, models based on non-conjugate priors allow for different predictors across equations by specifying the prior on the VAR coefficients independently of Σ.This, however, is computationally much more demanding, since the convenient Kronecker structure is lost.1 Apart from reduced flexibility in terms of covariate selection across equations, typical shrinkage priors push many VAR coefficients towards zero.Under continuous shrinkage priors, however, this implies that the probability of observing a coefficient that exactly equals zero is zero (see, for example, Griffin et al., 2010;Carvalho et al., 2010;Bhattacharya et al., 2015;Huber and Feldkircher, 2019;Huber et al., 2019).Using mixture priors that place point mass on zero, leads to a huge dimensional model space that is difficult to explore and convergence is often an issue (see Polson and Scott, 2010).
In recent contributions, Hahn and Carvalho (2015) and Ray and Bhattacharya (2018) propose a way to circumvent insufficient zero shrinkage/variable selection in light of an increasing amount of predictors (i.e. the curse of dimensionality problem).They estimate a large-scale regression model under a suitable shrinkage prior and then post-process a point estimator (the posterior mean) such that the distance 1For recent solutions that allow for estimating large-scale VARs under non-conjugate priors, see Carriero et al. (2019).
between the fit of the model based on the shrinkage prior and a model based on a sparse estimator (i.e. with coefficients set equal to zero) is minimized while accounting for a penalty term that depends on the L1-norm of the coefficients.This approach, labeled decoupled shrinkage and selection (DSS), yields a sparse estimator and is analogous to solving a LASSO-type problem.One key disadvantage, however, is the non-automatic nature of this approach.A semi-automatic approach that is similar in nature is described in Ray and Bhattacharya (2018).In this framework, an optimization problem is solved to efficiently zero out coefficients but only a single tuning parameter needs to be chosen by the researcher.These techniques which combine shrinkage and sparsity have been shown to work well in a wide range of applications ranging from finance (Puelz et al., 2017;2019) to macroeconomics (Huber et al., 2019).
In this paper, we deal with both issues discussed above by proposing a fully conjugate VAR model coupled with the prior proposed in Kadiyala and Karlsson (1997) and Koop (2013), that allows for different covariates across equations, sparsity in terms of the VAR coefficients, and data-based zero restrictions on the covariance parameters in Σ.Instead of post-processing posterior mean/median estimates, we follow Huber et al. (2019) in sparsifying each draw from the posterior distribution, providing a posterior distribution of sparse coefficients.The key advantage is that this significantly reduces estimation uncertainty if the data generating process is sparse.For example, if there is strong evidence that a i equals zero, our proposed framework is capable of selecting this restriction consistently across different draws from the posterior distribution of a i .This implies that the posterior variance of a i , in the limiting case that each draw of a i is set to zero, also equals zero.In terms of forecasting, the reduced estimation uncertainty then translates into more precise predictions, especially in situations where k is large.
The merits of our proposed approach are illustrated by means of two applications.In the first application, we use synthetic data obtained from a set of different data generating processes (DGPs) that differ in terms of sparsity, model size, and number of observations.Across DGPs, we find that i.) our framework successfully detects zero values in both the VAR coefficients and the error variancecovariance matrices and ii.) it outperforms other Bayesian VARs (BVARs) in terms of root mean square errors (RMSEs).In the second application, we forecast US output, inflation, and short-term interest rates using the dataset compiled in McCracken and Ng (2016).The key finding is that sparsification yields more precise point and density forecasts across differing model sizes.
The remainder of the paper is structured as follows.Section 2 introduces the methods to shrink and sparsify Bayesian conjugate VARs and outlines the posterior simulation algorithm.Section 3 discusses model features when applied to synthetic data, section 4 summarizes the results of a forecast exercise with real data.Finally, section 5 concludes the findings of the paper and an appendix provides additional forecast results and details on data.

A Bayesian VAR
Before discussing prior implementation, it is worth noting that Eq. ( 1) can be rewritten as a standard regression model, with x t = (y t−1 , . . ., y t−p , 1) denoting a n(= pm + 1)-dimensional vector of explanatory variables.In terms of full-data matrices Y (with tth row y t ) and X (with tth row x t ), the model reads where A = (A 1 , . . ., A p , C) and E is a (T × m)-dimensional matrix of stacked shocks with tth row given by ε t .
The model in Eq. ( 3) features k parameters in a and w = m(m + 1)/2 free parameters in Σ.If m and p become large, the number of parameters sharply increases, making precise estimation almost impossible.To deal with this issue, Bayesian econometricians rely on informative priors that reduce estimation uncertainty and push the estimates in a and Σ towards some stylized prior model like a random walk.
The general form of the conjugate prior in VARs assumes dependence between a and Σ and is given by where a 0 denotes a k-dimensional prior mean vector and V 0 (δ) is a prior variance-covariance matrix that depends on a lower dimensional set of q hyperparameters in δ.In what follows, we assume that y t is stationary and thus a common choice for the prior mean would be a 0 = 0.For V 0 (δ), we use a variant of the conjugate Minnesota prior (Kadiyala and Karlsson, 1997;Koop, 2013) that can be implemented using a set of dummy observations (Bańbura et al., 2010) that are then concatenated to Y and X: Here, J p = (1, . . ., p) , π is a hyperparameter that determines the prior variance on the intercepts, and φ i (i = 1, . . ., m) represents the prior mean associated with the coefficient on the first, own lag of a given variable (which is consequently set equal to zero).In addition, we let σi denote the OLS variances obtained by estimating m univariate AR(p) models for each element in y t .Finally, θ 1 is a hyperparameter that controls the overall tightness of the prior (i.e.δ = θ 1 ).Lower values of θ 1 imply a stronger prior belief, effectively pushing the elements in a towards a 0 .
The final ingredient is a conjugate prior on Σ.Here, conjugacy implies a prior on Σ that does not depend on a and follows an inverted Wishart distribution: (5) We let s 0 denote prior degree of freedom and S 0 a prior scaling matrix.The main shortcoming of this prior choice is that shrinking specific covariances in Σ to zero is impossible.For instance, even if there exists significant evidence that contemporaneous relations across elements in y t equal zero, this prior is not capable of selecting such restrictions and the resulting posterior estimate of Σ (and of its inverse) will be non-sparse.In the general case (i.e. with any form of the prior hyperparameters), one can show that the conditional posterior of a is given by a|Σ, Y, X ∼ N (vec(A), with Here, we let A 0 denote a (n × m)-dimensional matrix reshaped such that a 0 = vec(A 0 ).Using the Minnesota dummies, the posterior moments can be obtained by applying Theil-Goldberger mixed estimation (Theil and Goldberger, 1961): where Y = (Y , Y) and X = (X, X) denote full-data matrices augmented with dummy observations.Under the prior in Eq. ( 5), the posterior distribution also follows an inverted Wishart distribution, Σ|Y, X ∼ W −1 (s 1 , S 1 ).
The posterior degrees of freedom are denoted by s 1 = T + s 0 and S 1 represents the (m × m)-dimensional posterior scaling matrix, obtained by using the Minnesota-specific dummy observations: A key advantage of conjugacy is the Kronecker structure in Eq. ( 6), which implies that V = Σ ⊗ V is a block-diagonal matrix and computing the inverse or the Cholesky factor is computationally cheap.By contrast, if V were a full (k × k) matrix, computation would quickly become cumbersome and impossible even for moderate values of m and p.One further advantage of the conjugate prior is that the one-stepahead predictive density and the marginal likelihood (ML) are available in closed form (see, for instance, Zellner, 1985).This implies that if interest centers on one-step-ahead forecasts, no posterior simulation is required.22For higher-order forecasts or other quantities such as impulse responses, Monte-Carlo simulation is necessary.
Unfortunately, the conjugate prior has also two important shortcomings.First, each equation must include the same set of covariates (see Eq. ( 2)), a feature that could be unappealing if the researcher wishes to introduce theoretically motivated restrictions across equations.Second, the structure of the prior variance-covariance matrix implies that for each equation j = 1, . . ., m, the prior variance is given by σ 2 j j V 0 (δ), with σ 2 j j denoting the ( j, j)th element of Σ.Thus, the prior variances across equations are forced to be proportional to each other, also limiting the possibility to perform flexible shrinkage in order to select appropriate submodels across equations.

Achieving Sparsity in VAR Models
From a forecaster's perspective, heavily parameterized models, such as large-scale VARs, have another important shortcoming.The continuous shrinkage prior described in subsection 2.1 implies that the probability of observing exact zeros in a equals zero.One could ask whether it makes a big difference to zero out different a i 's as opposed to setting them close to zero.This essentially implies that there exists a lower bound of accuracy one can achieve under the specific prior distribution (Huber et al., 2019).For small-scale systems, this has negligible implications on predictive accuracy.However, if k is large (i.e. of order 1, 000 or 10, 000), parameter uncertainty adds up and potentially dominates the predictive variance.To see this point, notice that under the conjugate prior, the one-step-ahead predictive density follows a multivariate t-distribution (see Koop, 2013) with predictive variance given by: with Var(•) denoting the variance operator, x iT+1 is the ith element of x T+1 and v i j referring to the (i, j)th element in V.
Here, it can be seen that the predictive variance depends on the variance of the reduced-form shocks in ε T+1 and parameter uncertainty arising from the term in the parenthesis.Equation (10) indicates that if n increases, posterior uncertainty rises even if the v i j 's associated with irrelevant predictors are small.This point clearly highlights the difference between sparsity and shrinkage, namely the fact that under a sparse model, v i j would be equal to zero if and only if the relevant predictor is excluded from the model.In the next subsections, we will show how this lower bound on accuracy (determined by small but non-zero values of v i j ) can be removed.
Equation (10), moreover, highlights that the uncertainty associated with coefficients potentially adds up in large-scale models and the variance implied by the reduced-form shocks further influences the predictive variance.Without additional restrictions, these two sources can act in opposite directions.In the case of a large-scale model, the variance-covariance matrix might be underestimated due to overfitting while uncertainty surrounding parameter estimates is too large.The second effect is mainly driven by the fact that in VAR models and with standard macroeconomic datasets, covariates are often highly correlated and this, in combination with insufficient shrinkage, inflates variance estimates of the regression coefficients.Since these two sources play a crucial role in forming accurate forecasts, it is imperative to treat both of those carefully.

Achieving Sparsity on the VAR Coefficients
Since obtaining a sparse representation of a is unfeasible in high dimensions due to the necessity to explore a model space of cardinality 2 k , we follow a different route that combines shrinkage and sparsity.Our approach follows Hahn and Carvalho (2015) and Ray and Bhattacharya (2018) and is based on manipulating an estimator â ex-post by solving the following optimization problem, with Z = (I m ⊗ X), α being a sparse k-dimensional vector and m 2 denoting the Euclidean norm of a vector m.Equation ( 11) consists of two components.The first part measures the Euclidean distance between the fit of an unrestricted model, estimated using the shrinkage prior described in subsection 2.1, and a sparse model determined by α.The second part is a penalty term that penalizes non-zero values in α, with κ j denoting variable-specific penalties.In light of large k (which is almost always the case in moderately sized VARs), choosing the tuning parameters κ j by means of cross-validation becomes computational prohibitive.
To circumvent the necessity to employ cross-validation, we adopt the signal adaptive variable selection (SAVS) estimator proposed in Ray and Bhattacharya (2018).We rewrite Eq. ( 11) in terms of the jth column of Z, Z j , and solve the optimization problem in Eq. ( 11) for each covariate individually, adopting the coordinate descent algorithm (Friedman et al., 2007).This yields the following solution to the optimization problem in Eq. ( 11),3 for j = 1, . . ., k, with sign(c) returning the sign of a real number c and c + = max{c, 0}.
We set the penalty term as follows: which depends on the non-sparse estimate âj and two hyperparameters λ > 0 and ζ ≥ 1. Setting ζ ≥ 1 implies that smaller values of âj receive a larger penalty and are likely to be zeroed out by the SAVS algorithm.
A typical choice, proposed in Ray and Bhattacharya (2018), sets λ = 1 and ζ = 2.The approach stipulated in Hahn and Carvalho (2015) is obtained by setting ζ = 1 while inferring λ by visually inspecting the posterior output.More specifically, Hahn and Carvalho (2015) suggest choosing λ such that the variation-explained by a sparsified linear predictor (which is akin to a standard R 2 ) statistically equals the variation-explained of the non-sparsified model.If the researcher is interested in forecasting, this procedure has to be repeated sequentially over a hold-out period, and its non-automatic nature becomes problematic.
The prior variance under the traditional Minnesota prior becomes smaller for coefficients associated with higher order lags of y t while, for the ith equation, it remains only mildly informative for the first lag of y it .The conjugate prior we use in this paper is not capable of discriminating between coefficients related to y it and y jt for i j and thus treats everything symmetrically.In this paper, we modify the SAVS estimator in order to allow for asymmetric treatment of "own" and "foreign" lags of y t within equation i.
In what follows, we replace λ with a lag-wise parameter that increases the weight associated with coefficients on higher order lags of y t and impose a stronger penalty on coefficients related to lags of other variables in y jt within equation i and for i j.Moreover, we do not sparsify the diagonal elements of A 1 .These parameters are specified for each A l such that for l = 1, . . ., p; i = 1, . . ., m; j = 1, . . ., m.Here, we assume that λ is some lag-invariant scaling parameter and λ l,i j increases quadratically with the lag order.Note that for the first, own lag of a given equation, we set the penalty equal to zero, capturing the notion that this covariate is crucial and should never be set equal to zero (Bańbura et al., 2010).For coefficients on lags of other variables we increase the penalty slightly by multiplying λ with l 2 instead of (l − 1) 2 .4

Sparsification on the Variance-Covariance Matrix
Up to this point, we focused attention on obtaining a sparse representation of the VAR coefficients.In large dimensions, Σ also contains w free elements and, without using more sophisticated shrinkage techniques, the existing estimate would be prone to overfitting.As a potential remedy, we propose post-processing the estimates of the precision matrix Σ −1 (i.e. the inverse of Σ).Friedman et al. (2008) and, more recently, Bashir et al. (2018) propose methods to ex-post sparsify precision matrices using the graphical lasso.We follow this literature and specify a loss function similar to Eq. ( 11) that aims at striking a balance between model fit and parsimony.Let Ω be a sparse estimate of Σ −1 with elements given by ω i j .The loss function is then given by: with Ŝ denoting an estimate of the precision matrix, ρ i j referring to a parameter-specific penalty and log(det(•)) being the log-determinant while tr(•) denotes the trace of a square matrix.The term tr Ω Ŝ − log (det (Ω)) measures the (negative) expected fit whereas i j ρ i j |ω i j | constitutes a penalty term that penalizes non-zero precision parameters in Ω. Similarly to Eq. ( 11), Eq. ( 15) aims to find a sparse precision matrix that describes the data well while being parsimonious.5Optimizing Eq. ( 15) is challenging and suitable penalty parameters need to be defined.We follow Friedman et al. (2008) in adopting the coordinate descent algorithm and state Eq. ( 15) as a set of independent soft-threshold problems which can be solved for each off-diagonal element, respectively.
4In the empirical application, we specify the penalty on the intercept term equal to zero.5Note that, if ω * i j with i j, the (i, j)th element of Ω * , is set to zero, then y it and y jt exhibit no contemporaneous relationship.
To determine the penalty parameter, Friedman et al. (2019) use: with | ŝij | denoting the absolute size of the (i, j)th element of Ŝ and is a scalar penalty parameter while κ ≥ 1 controls the penalty on small precision parameters.Eq. ( 16) nests the specification stipulated in Bashir et al. (2018) if we set κ = 1, ŝij to an initial estimate of the (i, j)th element of the precision matrix, and cross-validate .

Posterior Inference with SAVS
Before discussing our posterior simulation algorithm, it is worth noting that up to this point, the different sparsification techniques have been proposed such that some estimate (i.e. the posterior mean/median) is used and then ex-post sparsified.This technique provides a sparse point estimator of a and Σ but is not capable of controlling for posterior uncertainty conditional on zeroing out the v i j 's.
Following Huber et al. (2019), we sparsify each draw from the joint posterior distribution of a and Σ.This yields a draw from the joint posterior over sparsified coefficients given by In the spirit of Bayesian model averaging (BMA), as an additional inferential opportunity, our approach also allows to estimate a posterior inclusion probability that a given element in â * and some off-diagonal element of Σ * , σ * i j , is non-zero by computing We let R denote the total number of retained draws, the superscript (r) represents the rth draw and d(•) denote an indicator function that equals unity if its argument is true.Equations ( 18) and ( 19) allow us to draw inferences on what covariates to include and the specific form of the relations between the reducedform shocks.It is noteworthy that if p a,i = 0, the corresponding posterior mean is also sparse.One key advantage of sparsifying each draw from the posterior distribution relative to the posterior mean/median is that if p ai = 0, the corresponding posterior of â * i will display no posterior uncertainty since all draws are equal to zero.

Posterior Simulation
Our direct Gibbs sampling algorithm consists of the following steps: For r = 1, . . ., R, repeat the following steps and discard the first M draws as burn-in 1. Sample a (r) from N (a, Σ (r−1) ⊗ V) with a = vec(A) and V.6 Alternatively, a ( j) could also be simulated from a multivariate t-distribution, obtained after integrating out Σ analytically.
Our MCMC scheme yields draws from p( â * , Σ * |Y, X).This distribution can then be used to compute predictive densities, impulse responses, historical decompositions, among other objects of interest.As mentioned in the introductory section, implementing the SAVS algorithm within a wider MCMC algorithm potentially implies that point estimators such as the posterior mean of â * and Σ * are non-sparse.However, this strongly depends on the information contained in the posterior distribution; if there is significant information that a given coefficient is equal to zero, the corresponding point estimator of the sparsified coefficient could also be exactly zero.

SIMULATION-BASED EVIDENCE
We use synthetic data and a set of different data generating processes (DGPs) that vary in terms of dimension (m ∈ {3, 7, 30}), length of the time series (T ∈ {80, 240}) and whether the model is sparse, moderately sparse or dense to analyze whether sparsification improves estimation accuracy.Without loss of generality, all simulated VAR models feature one lag (p = 1), with the coefficient matrix set such that diag(A 1 ) = 0.4 × I m and off-diagonal elements being drawn from N (0, 0.4 2 ) in the case that m ∈ {3, 7} and, for stability reasons, from N (0, 0.1 2 ) for m = 30.Similarly, the non-zero off-diagonal elements of the lower Cholesky factor of Σ are are sampled from N (0, 0.4 2 ).We obtain DGPs that feature different levels of sparsity by randomly zeroing out the off-diagonal elements of A 1 and the lower Cholesky factor of Σ to obtain three levels of sparsity.The moderately dense model features around 40% zeroes in the coefficients while the moderately sparse model features around 60% zeroes.Finally, we also consider an extremely sparse DGP with approximately 90% zeroes, which roughly corresponds to the level of sparsity of a large-sized VAR model.
To assess the sensitivity of the results with respect to different choices of λ and , we compute a range of sparse models and benchmark it to the non-sparse competitor.This non-sparse competitor is a Minnesota-prior BVAR with hyperparameters obtained by optimizing the marginal likelihood of the model over a grid (see, for example, Carriero et al., 2019).
6Note that if m and n is large, computation of the Kronecker product turns out to be computationally challenging.To speed up computation, we use the result that (N ⊗ M)vec(R) = vec(M RN ), with N, N, and R being conformable matrices.
Tab. 1: MAE ratios of coefficients and covariances to non-sparse BVAR estimates.Notes: Bold numbers indicate the smallest MAE ratio.We simulate a DGP for a small-scale (m = 3), medium-scale (m = 7) and large-scale (m = 30) VAR for two different number of observations T and for three different degrees of sparsity (zero parameters as percentage of total number of coefficients k = m(mp + 1) and covariances w = m(m + 1)/2, ranging from a dense DGP to a fully sparse DGP.Coefficients of the diagonal of the first lag a non-sparse, as well as the diagonal elements in the variance covariance matrix.
Table 1 shows (relative) mean absolute error (MAE) between the posterior median of the coefficients for the sparsified BVAR and the true parameter values, averaged across 50 replications per DGP.All MAEs are divided by the MAEs of a Minnesota BVAR without using any form of sparsification.The upper panel presents the results for the VAR coefficients while the lower panel of Tab. 1 displays the MAEs associated with the covariance parameters.In order to investigate how differing values of λ and impact estimation accuracy, we also estimate the model over a grid of values for λ ∈ {.005, 0.01, 0.05, 0.1, 0.5, 1} and set = λ/10.Before proceeding, it is worth noting that we estimate all VAR models with a single lag.
Considering the upper panel of Tab. 1, a few results are worth emphasizing.First, we observe that sparsification pays off in terms of achieving lower estimation errors.This improvement strongly depends on the true level of sparsity, with strong accuracy gains if the DGP is very sparse and the data sample is short.Especially when T is small relative to the number of parameters, sparsification improves against the traditional Bayesian VAR model.
Second, estimation accuracy strongly depends on the specific choice of λ.While we observe accuracy gains for all values of λ (as indicated by ratios smaller than unity), the accuracy gains tend to grow with λ (for the standard choice λ = 1 proposed in Ray and Bhattacharya (2018), improvements in estimation accuracy tend to be largest).
Third, for large models we find that the SAVS estimator yields substantial gains, improving upon the shrinkage-only estimator by large margins.These improvements even arise if the DGP is characterized by relatively few zeros in the VAR coefficients.This finding is not surprising given the fact that the absolute number of zeros increases with the dimension of the parameter space and the small but negligible posterior estimates under a Minnesota-BVAR have a detrimental effect on estimation accuracy.
The lower panel of Tab. 1 provides qualitatively similar insights.Sparsification of the variancecovariance matrix might yield accuracy improvements over its non-sparsified counterpart that range from being small (or in some rare cases even negative) to very large (in the case the DGP is sparse and the model is moderately large).We conjecture that the somewhat smaller improvements in predictive accuracy arise from the Wishart-distributed prior imposed on Σ −1 .This prior, by construction, is not capable of discriminating between relevant and irrelevant covariance parameters since it uniformly pushes the posterior estimate of Σ −1 towards a diagonal matrix.However, given that this prior is simple and conjugate, post-processing the posterior draws of Σ seems to further improve estimates at little additional computational costs.
We stressed one key advantage in subsection 2.2, namely that sparsification reduces estimation uncertainty by zeroing out the coefficient under scrutiny during posterior simulation.Thus, while the discussion in the previous two paragraphs highlights that using sparsification improves estimation performance in terms of point estimators, taking into account higher order moments of the posterior distribution of the coefficients yields further performance improvements.
To investigate whether using sparsification also leads to a more favorable bias-variance relationship, Fig. 1 depicts heatmaps that show the absolute distance between the posterior median and the true coefficients (upper panel) as well as the corresponding posterior standard deviation of the parameters (lower panel) for a single realization of the sparse DGP.
Considering these heatmaps reveals that sparsification improves estimation accuracy and accurately detects zeroes, as evidenced by the abundance of white cells in Fig. 1(a).The slight bias along the main diagonal (which also exists under the shrinkage-only model) stems from the informative prior that is centered around zero.However, note that even with a high degree of shrinkage, the corresponding estimate of a with the Minnesota prior is quite dense.By contrast, applying SAVS yields a very sparse coefficient matrix and, in addition, a sparsified estimate of the variance-covariance specification.
The accuracy gains in terms of producing point estimators is mirrored in the lower panel of Fig. 1.There, we show the posterior standard deviations of a and â * .It is noteworthy that white cells imply that the posterior standard deviation is zero, which is often the case for the Minnesota prior coupled with SAVS.These white cells often correspond to the white cells in panel (a) of the figure, indicating that if SAVS successfully detects that a predictor is irrelevant, it also excludes this covariate across all iterations of the algorithm, thereby setting the posterior variance to zero.This feature is crucial for accurate density predictions.

Design of the Forecasting Exercise and Competitors
In order to emphasize the advantages for forecasters to correctly detect sparsity, our second application focuses on a real data exercise, using the quarterly dataset described in McCracken and Ng (2016) spanning from 1959:Q1 up to 2018:Q4.To investigate whether shrinkage and sparsity pays off in predictive performance, we split the sample into two parts.The first part, called the training sample, runs from 1959:Q1 to 1989:Q4.This initial period is used to compute the h-step-ahead predictive distribution (for h ∈ {1, 4, 8}).After obtaining the predictive density for all h, we expand the initial estimation period by one quarter (i.e. to 1990 :Q1) and repeat this procedure until we reach the penultimate point the sample (i.e.2018 :Q3).The period from 1990 :Q1 to 2018 :Q4 is consequently labeled the hold-out or verification period.For each quarter in the hold-out period, we evaluate the predictive accuracy of the models using root-mean-squared forecast errors (RMSEs) and log predictive scores (LPSs).
The existing literature highlights the necessity to exploit large information sets (see, for instance, Bańbura et al., 2010;Carriero et al., 2015;Giannone et al., 2015;Koop, 2013).Building on this evidence, we apply our techniques to a VAR model that features m = 165 macroeconomic and financial variables.Out of these, we select three traditional target variables, namely output (GDPC1), consumer price inflation (CPIAUCSL) and the Federal Funds Rate (FEDFUNDS).Apart from this large-scale VAR, we investigate how our techniques perform across different model sizes and dimension-reduction techniques.These competing models range from small to medium-scale VARs to dynamic factor models in the spirit of Bernanke et al. (2005).These competing approaches are:7 • SMALL: This specification is inspired by the literature using small-scale three equation VAR models that feature the three target variables exclusively.
• MEDIUM: This model extends the small-scale model by additionally including financial market variables.In total, this model includes m = 21 variables.
• FAVAR: As a competitor that exploits the full information set but reduces the dimensionality of the estimation problem, we use a factor-augmented VAR (FAVAR).This model augments the smallscale VAR by including three principal components extracted from the remaining quantities (see Bańbura et al., 2010;Koop, 2013).

Choice of hyperparameters
Since we use VAR models that do not only differ in the size of their information sets, but also how these information is used during estimation, careful choice of the prior hyperparameters is necessary.Using the prior outlined above, we need to set θ 1 as well as the sparsification parameters λ and .
To set the hyperparameters of the Minnesota prior, we follow two different routes.The first (and simplest) way, is to set θ 1 such that shrinkage increases with the size of the information set (see, for instance, Bańbura et al., 2010;Koop, 2013) and select θ 1 on a grid of potential values.
The second approach is based on optimizing the marginal likelihood (which is available in closed form) a priori (see Carriero et al., 2015).One problem with this strategy, however, is that the marginal likelihood might be ill-behaved, which renders optimization difficult.8Since optimizing the marginal likelihood for the large model is often unfeasible, we assess the sensitivity of the forecasts with respect 7In Appendix B we show a detailed list of the variables included along the transformation codes.8To circumvent this issue Bańbura et al. (2010) and Koop (2013), for example, define a training sample which serves the purpose to calibrate θ 1 by minimizing the distance of the mean square error (MSE) of a large-scale model and a three-variable to three choices of the hyperparameters θ 1 = {0.025,0.5, 0.75}.Using these three values enables us to assess how shrinkage and sparsification interact.For example, if θ 1 is set to 0.025 it is very likely that the SAVS estimator will lead to a sparse model while a shrinkage parameter θ 1 = 0.75 allows for larger elements in a and thus a more dense model under the SAVS estimator.
For the small-and medium-scale models and the FAVAR specification, we define a large grid of for θ 1 ∈ {0.01, 0.025, 0.050, 0.075, 0.10, 0.125, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.75, 1, 2, 5}.Using this grid, we seek the value of θ 1 that maximizes the marginal likelihood.Over the hold-out sample, this procedure yields an average of θ 1 = 0.352 for the small-scale model, θ 1 = 0.162 for the FAVAR specification and θ 1 = 0.154 for the medium-scale model.These values indicate that the larger the model becomes, the more weight needs to be placed on the prior.Similar to Carriero et al. (2015), we find that the hyperparameters tend to display little variation over the hold-out period.For example, in the case of the medium-scale model we find that θ 1 ranges from 0.15 to 0.2.

Results
In this section, we start by considering point forecasting accuracy of the different models and by comparing sparse with non-sparse models.Table 2 depicts the relative RMSEs to a small-scale VAR with a Minnesota prior and without sparsification for the three target variables.

Point forecasts
From the Tab.2, it is visible that there is no single best modeling approach that outperforms all competing alternatives.Model performance is thus strongly dependent on both the variable, one wishes to forecast, and the forecast horizon.Hence, we discuss each case (variable and horizon) separately and focus on the relative performance of the competing models and on whether sparsification pays off for forecasting.
For output and one-step-ahead forecasts, we observe that all large models display relative RMSEs close to one, suggesting that their predictions are comparable to the ones obtained from using the simple small-scale Bayesian VAR.This finding, to some degree, carries over to the FAVAR model.Irrespective of the choice of λ, no single model improves upon the benchmark by more than eight percent in RMSE terms.Turning to the medium-scale models, we find slightly more pronounced improvements relative to the benchmark model, with model performance depending on the specific value of λ.For λ = 0.05, the medium-sized model improves upon all competing models in terms of GDP forecasting.When we consider the accuracy premium obtained from using the SAVS step, in most cases we can find at least a value of λ that improves upon the shrunk but not sparsified model.This difference is very small for the three-equation VAR but quite pronounced in the case of the medium-sized model and for some values of λ.
Considering the one-step-ahead forecasts of inflation, it appears that none of the competing models is capable of improving upon the benchmark small-scale VAR.All relative RMSEs are close to one and, VAR estimated with OLS.Intuitively, this strategy implies that large dimensional models are shrunk to a larger degree than smaller-scale models (De Mol et al., 2008). in fact, often slightly exceed unity.Notice, however, that for the medium-sized model and the FAVAR specification, introducing sparsity through our SAVS estimator sometimes yields more accurate short-run predictions.
The one-quarter-ahead interest rate forecasts reveals two insights.First, we find that medium-to largesized models yield forecasts that tend to more accurate than the ones obtained from the benchmark VAR.This result is particularly pronounced for the moderately-sized VAR and λ = 0.5.Second, comparing sparse with non-sparse shows that, for interest rates, introducing sparsity pays off markedly.This is visible for the small-scale model and λ = 0.5, with RMSEs being over 20 percent lower as compared to the shrinkage-only case.The largest gains, as expected, can be obtained in situations with increased model size.For the medium and large models, the accuracy of sparsification are substantial.We conjecture that these large increases in predictive performance are due to the zero lower bound, which is a prominent feature in our hold-out sample.Methods that shrink but do not sparsify yield forecasts that are non-zero and thus produce larger forecast errors as compared to models that zero out the interest rate equation and thus predict that short-term rates are zero in the next period.This is the main source of the strong accuracy gains obtained by applying the SAVS step.
Next, we consider the one-year-ahead forecasts.For longer-run forecasts, a qualitatively similar picture arises across the three target variables.For output, the medium-sized model provides a good forecasting performance, while for inflation the small model is very difficult to beat.Turning to short-term interest rates, we again observe that a.) moderately-sized VARs with around 20 endogenous variables work well and b.) accuracy gains from introducing sparsity tend to be substantial.
Finally, we consider two-years-ahead forecasts.For this forecast horizon, the main results that hold for one-and four-steps-ahead forecasts do not apply anymore.Instead of observing a forecasting performance that appears to be slightly tilted towards smaller models with sparsification, two-years-ahead predictions profit to some extent from larger information sets.Moreover, at this forecast horizon we find only little evidence that sparsification improves point predictions: we observe slight gains for output and inflation, but these gains appear to be quite small, ranging from almost three percent (for inflation and the large model and λ ∈ {0.1, 0.5}) to around 5.5 percent (for output and the FAVAR specification).
Tab. 2: RMSE ratios relative to the small-scale BVAR with a Minnesota prior.

Density forecasts
When focusing on point forecasting accuracy we necessarily disregard any information on how well the different models capture higher order moments of the predictive distribution.However, the discussion in subsection 2.2 suggests that applying SAVS might improve density forecasts by zeroing out irrelevant throughout MCMC sampling, eventually exerting a positive effect on the full predictive distribution.
Table 3 shows differences in LPSs relative to the small-scale Minnesota VAR (henceforth labeled log predictive Bayes factors (BFs)).Numbers greater than zero imply that a given model improves upon the benchmark while negative values suggest that the benchmark yields more precise density predictions.Instead of focusing exclusively on how well a given model predicts the three focus variables, this table also shows joint BFs over the three target variables, providing a general measure on how well some approach performs in forecasting output, inflation and interest rates jointly.
Starting with the joint predictive BFs and the one-step-ahead horizon, it is worth noting that these numbers can be interpreted as a training sample likelihood (Geweke and Amisano, 2010).For this measure, we find that the best performing model is the moderately-sized sparse VAR with λ = 0.05, suggesting that using SAVS improves density predictions.At the one-step-ahead horizon, varying λ between 0.005 and 0.1 yields a similar forecasting performance.If λ is set too large, forecasting accuracy drops markedly.This can be traced back to the fact that a larger penalty leads to an overly sparse model and this, in turn, decreases the predictive variance too much.With this too narrow predictive distribution, capturing outliers becomes increasingly difficult, hampering overall forecasting performance.
Note that the different large models perform poorly when all three focus variables are considered jointly.To investigate why this is the case, we can look at the marginal predictive BFs.Considering the density forecasting performance for output, we find that the model which does well for the joint predictive BF also excels (i.e. the medium-size VAR with λ = 0.05).For inflation, the best models are the class of FAVARs that exploit the full information set but use dimension reduction.Here, we find only very little gains from using sparsification, with the non-sparse and sparse models (with λ = 0.005) delivering almost identical density forecasts.All the large models perform poorly in predicting inflation.For interest rates, we interestingly find that the best models by far are the large-sized VARs with sparsification.The reasoning is similar to the logic described above in the case of the point forecasts: zeroing out most predictors (or even all of them) implies that during the zero lower bound, our model would predict an interest rate close to (or exactly equal to) zero.Since SAVS with a moderately large penalty parameter λ also reduces the predictive variance sharply, the LPS also displays a pronounced increase and model evidence in favor of the sparse large VAR increases.
After discussing all three marginal predictive BFs we find that large models yield competitive output and very strong interest rate forecasts while inflation forecasts are highly imprecise.Hence, if the agent places little weight on how well our model predicts inflation, the large VARs appear to be competitive.
The finding that large models perform poorly in inflation forecasting while they do well for the remaining two variables carries over to four-and eight-steps-ahead forecast distribution.In fact, for these two forecast horizons we observe that large models with SAVS yield excellent output and interest rate forecasts.Inflation forecasts, however, are among the most imprecise over all models considered.Finally, Fig. 2 shows the evolution of joint log predictive BFs for one-step-ahead predictions over time.The figure contrasts standard BVARs without SAVS (dashed lines) and BVARs post-processed with an additional SAVS step (solid lines) for all considered information sets.For the sparsified models, and for sake of brevity, we show models with the λ that maximizes the LPSs at the end of the hold-out within each model class (see Tab. 3).Moreover, for the large-scale BVAR we depict the evolution of BFs for the different values of θ 1 ∈ {0.025, 0.05, 0.075}.
Here, a few points are worth discussing.First, for the one-step-ahead horizon, we see that the mediumscale BVAR with the additional SAVS step performs extremely well, especially during the first part of the hold-out.The performance of the large model with θ 1 = 0.075 is also strong during that period of time; however, during the global financial crisis (GFC) we observe a pronounced decline in predictive model evidence for the large model with SAVS.During the GFC, evidence in favour of sparsification also decreases for the medium-scale model.This can be seen by comparing the solid and dashed red lines.While the non-sparse medium-sized BVAR is outperformed in the run-up to the crisis, model evidence change during the recession and strongly supports the non-sparse variant.The main reason why predictive accuracy of sparsified models deteriorates in turbulent periods is that the forecast error variance becomes too small and large shocks become increasingly unlikely under this predictive distribution.Additionally, using dense models in the sense that a large number of covariates is included is typically subject to overfitting and this increases the predictive variance.During crisis episodes this helps because the likelihood of observing outlying values is increased.After the GFC, we observe that applying the SAVS step again improves predictions.In particular, the medium-scale BVAR and the large-scale model with θ 1 substantially pick up in density forecast performance.This discussion highlights that during tranquil periods, in which large shocks have a low probability to occur, applying the SAVS step helps predictive accuracy.This can be attributed to the fact that the predictive distribution is quite narrow, which helps in times of low and stable volatility of macroeconomic fundamentals.In recessions, by contrast, predictions from sparse models are too conservative, hurting the density forecast performance.

CONCLUSIONS
This paper proposes methods to shrink and sparsify VAR models with conjugate priors.The main feature of our SAVS approach is that we post-process each draw from the joint posterior by solving an optimization problem to search for a sparse coefficient vector.Without destroying the conjugacy of the model, this approach allows for different predictors across the equations in the VAR.And, instead of pushing coefficients close to zero, our approach introduces exact zeros, removing the lower bound an accuracy one can achieve under a popular shrinkage prior in the Minnesota tradition.Since the error covariance matrix in large VARs also feature a large number of coefficients, we adapt techniques from the literature on graphical models to obtain a sparse estimate of the variance-covariance matrix of the system.Using a synthetic and real-data application, we illustrate the merits of combining shrinkage and sparsification in large multivariate models.
Tab. 3: Bayes factors relative to the small-scale BVAR with a Minnesota prior.q q q q q q Small−scale q q q q q q Small−scale

B. DATA DESCRIPTION
Here, we provide detailed information on the transformation applied for each variable, as we transform the data to stationarity, according to the suggestions of McCracken and Ng (2016).With stationary data the prior is centered on zero, assuming a white noise process for each variable a priori.Moreover, we standardise the data by demeaning each variable and dividing through the standard deviation.Due to the scale-variance of PCs the data is also standardised before extracting the factors.
draws from p( â * , Σ * |a, Σ, Y, X) can then be used to compute highly non-linear functions of the parameters such as impulse responses or predictive distributions.

Fig. 2 :
Fig. 2: Cumulative joint log-predictive Bayes factors for one-step-ahead predictions benchmarked against the small-scale BVAR without SAVS.Dashed lines indicate classic BVARs while solid lines depict the best performing sparsified version within each information set.Gray shaded areas denote NBER recessions.
Fig. A.1: Cumulative joint log-predictive Bayes factors for four-and eight-step-ahead predictions benchmarked against the small-scale BVAR without SAVS.Dashed lines indicate classic BVARs while solid lines depict the best performing sparsified version within each information set.Gray shaded areas denote NBER recessions.
Fig. A.2: Cumulative marginal log-predictive Bayes factors benchmarked against the small-scale BVAR without SAVS.Dashed lines indicate classic BVARs while solid lines depict the best performing sparsified version within each information set.Gray shaded areas denote NBER recessions.
Bold numbers indicate highest Bayes factors for each horizon and target variable (and therefore best performing models over the full hold-out sample in terms of density forecasts).