Sparse Bayesian vector autoregressions in huge dimensions

Correspondence Gregor Kastner, Institute for Statistics and Mathematics, WU Vienna University of Economics and Business, Welthandelsplatz 1, 1020 Vienna, Austria. Email: gregor.kastner@wu.ac.at Abstract We develop a Bayrewriting the error term esian vector autoregressive (VAR) model with multivariate stochastic volatility that is capable of handling vast dimensional information sets. Three features are introduced to permit reliable estimation of the model. First, we assume that the reduced-form errors in the VAR feature a factor stochastic volatility structure, allowing for conditional equation-by-equation estimation. Second, we apply recently developed global–local shrinkage priors to the VAR coefficients to cure the curse of dimensionality. Third, we utilize recent innovations to sample efficiently from high-dimensional multivariate Gaussian distributions. This makes simulation-based fully Bayesian inference feasible when the dimensionality is large but the time series length is moderate. We demonstrate the merits of our approach in an extensive simulation study and apply the model to US macroeconomic data to evaluate its forecasting capabilities.


INTRODUCTION
Previous research has identified two important features that macroeconometric models should possess: the ability to exploit high-dimensional information sets (Bańbura et al., 2010;Koop et al., 2019;Rockova & McAlinn, 2017;Stock & Watson, 2011) and the possibility to capture nonlinear features of the underlying time series (Bitto & Frühwirth-Schnatter, 2019;Clark, 2011;Clark & Ravazzolo, 2015;Cogley & Sargent, 2001;Primiceri, 2005). While the literature suggests several paths to estimate large models, the majority of such approaches imply that once nonlinearities are taken into account analytical solutions are no longer available and the computational burden becomes prohibitive. This implies that high-dimensional nonlinear models can practically be estimated only under strong (and often unrealistic) restrictions on the dynamics of the model. However, especially in forecasting applications or in structural analysis, successful models should generally be able to exploit much information and also control for breaks in the autoregressive parameters or, more importantly, changes in the volatility of economic shocks (Koop et al., 2009;Primiceri, 2005;Sims & Zha, 2006). Two reasons limit the use of large (or even huge) nonlinear models. The first reason is statistical. Since the number of parameters in a standard vector autoregression rises quadratically with the number of time series included and commonly used macroeconomic time series are rather short, in-sample overfitting turns out to be a serious issue. As a solution, the Bayesian literature on vector autoregressive (VAR) modeling (e.g., Ankargren et al., 2019;Bańbura et al., 2010;Clark, 2011;Clark & Ravazzolo, 2015;Doan et al., 1984;Follett & Yu, 2019;George et al., 2008;Koop, 2013;Litterman, 1986;Sims & Zha, 1998) suggests shrinkage priors that push the parameter space towards some stylized prior model like a multivariate random walk. On the other hand, Ahelegbey et al. (2016) suggest viewing VARs as graphical models and perform model selection drawing from the literature on sparse directed acyclic graphs. This typically leads to much improved forecasting properties and more meaningful structural inference. Moreover, the majority of the literature on Bayesian VARs imposes conjugate priors on the autoregressive parameters, allowing for analytical posterior solutions and thus avoiding simulation-based techniques such as Markov chain Monte Carlo (MCMC). Frequentist approaches often consider multistep approaches (e.g., Davis et al., 2016).
The second reason is computational. Nonlinear Bayesian models typically have to be estimated by means of MCMC, and computational intensity increases vastly when the number of component series becomes large. This increase stems from the fact that standard algorithms for multivariate regression models call for the inversion of large covariance matrices. Especially for sizable systems, this can quickly turn prohibitive since the inverse of the posterior variance-covariance matrix on the coefficients has to be computed for each sweep of the MCMC algorithm. For natural conjugate models, this step can be vastly simplified because the likelihood possesses a convenient Kronecker structure, implying that all equations in the VAR feature the same set of explanatory variables. This speeds up computation by large margins but restricts the flexibility of the model. Carriero et al. (2016), for instance, exploit this fact and introduce a simplified stochastic volatility specification. Another strand of the literature augments each equation of the VAR by including the residuals of the preceding equations (Carriero et al., 2019), which also provides significant improvements in terms of computational speed. Finally, in a recent contribution, Koop et al. (2019) reduce the dimensionality of the problem at hand by randomly compressing the lagged endogenous variables in the VAR.
All papers mentioned hitherto focus on capturing cross-variable correlation in the conditional mean through the VAR part, and the comovement in volatilities is captured by a rich specification of the error variance (Primiceri, 2005) or by a single factor (Carriero et al., 2016). Another strand of the literature, typically used in financial econometrics, utilizes factor models to provide a parsimonious representation of a covariance matrix, focusing exclusively on the second moment of the predictive density. For instance, Pitt and Shephard (1999) and Aguilar and West (2000) assume that the variance-covariance matrix of a broad panel of time series might be described by a lower dimensional matrix of latent factors featuring stochastic volatility and a variable-specific idiosyncratic stochastic volatility process. 1 The present paper combines the virtues of exploiting large information sets and allowing for movements in the error variance. The overfitting issue mentioned above is solved as follows. First, we use a Dirichlet-Laplace (DL) prior specification (see Bhattacharya et al., 2015) on the VAR coefficients. This prior is a global-local shrinkage prior in the spirit of Polson and Scott (2011) that enables us to heavily shrink the parameter space but at the same time provides enough flexibility to allow for nonzero regression coefficients if necessary. Second, a factor stochastic volatility model on the VAR errors grants a parsimonious representation of the time-varying error variance-covariance matrix of the VAR. To deal with the computational complexity, we exploit the fact that, conditionally on the latent factors and their loadings, equation-by-equation estimation becomes possible within each MCMC iteration. Moreover, we apply recent advances for fast sampling from high-dimensional multivariate Gaussian distributions (Bhattacharya et al., 2016) that permit estimation of models with hundreds of thousands of autoregressive parameters and an error covariance matrix with tens of thousands of nontrivial time-varying elements on a quarterly US data set in a reasonable amount of time. In a careful analysis, we show to what extent our proposed method improves upon a set of standard algorithms typically used to simulate from the joint posterior distribution of large-dimensional Bayesian VARs.
We first assess the merits of our approach in an extensive simulation study based on a range of different data-generating processes (DGPs). Relative to a set of competing benchmark specifications we show that, in terms of point estimates, the proposed global-local shrinkage prior yields precise parameter estimates and successfully introduces shrinkage in the modeling framework, without overshrinking significant signals.
In an empirical application, we adopt a modified version of the quarterly data set proposed by Stock and Watson (2011) and McCracken and Ng (2016). To illustrate the out-of-sample performance of our model, we forecast important economic indicators such as output, consumer price inflation, and short-term interest rates, amongst others. The proposed model is benchmarked against several alternatives. Our findings suggest that it performs well in terms of one-step-ahead predictive likelihoods. In addition, investigating the time profile of the cumulative log-predictive likelihood reveals that allowing for large information sets in combination with the factor structure especially pays off in times of economic stress.
The remainder of this paper is structured as follows. Section 2 introduces the econometric framework. Section 3 details the Bayesian estimation approach, including an elaborated account of the (shrinkage) prior setup adopted and the corresponding conditional posterior distributions. Section 4 provides an analysis of the computational gains of our algorithm relative to a set of established algorithms. Section 5 presents the results of an extensive simulation study comparing the performance of carefully selected shrinkage priors for different time series lengths and model dimensions within various (sparse and dense) data-generating scenarios. Section 6, after giving a brief overview of the data set used along with the model specification, illustrates our modeling approach by fitting a single-factor model to 215-dimensional quarterly US data. Moreover, we perform a forecasting exercise to assess the predictive performance of our approach and discuss the choice of the number of latent factors. Finally, Section 7 concludes.

ECONOMETRIC FRAMEWORK
Suppose interest centers on modeling an m × 1 vector of time series denoted by y t with t = 1, … , T. We assume that y t follows a heteroskedastic VAR(p) process: 2 is an m × m matrix of autoregressive coefficients. The error term is assumed to follow a multivariate Gaussian distribution with time-varying variance-covariance matrix t . To permit reliable and parsimonious estimation when m is large, we decompose the residual covariance matrix into where both t = diag( 2 1t , … , 2 mt ) and V t = diag(e h 1t , … , e h qt ) are diagonal matrices with dimension m and q, respectively, and denotes an m × q matrix of factor loadings with typical element ij (i = 1, … , m; j = 1, … , q). The logarithms of the diagonal elements of t and V t follow AR(1) processes: (4) To identify the scaling of the elements of , the process specified in Equation (3) is assumed to have mean zero, while j in Equation (4) is the unconditional mean of the log-elements of t to be estimated from the data (cf. Kastner et al., 2017). The parameters hj and i are a priori 2 For simplicity of exposition we omit the intercept term in the following discussion (which we nonetheless include in the empirical application). restricted to the interval (−1, 1) and denote the persistence of the latent log variances. The error terms e hj,t and e i,t constitute independent zero mean innovations with variances 2 h and 2 i , respectively. This specification implies that the volatilities are mean reverting and thus bounded in the limit.
This error structure is known as the factor stochastic volatility model (see, e.g., Aguilar & West, 2000;Pitt & Shephard, 1999). It can be equivalently written by introducing q conditionally independent latent factors f t ∼  q (0, V t ) and rewriting the error term in Equation (1) as Note that off-diagonal entries of t exclusively stem from the volatilities of the q factors, while the diagonal entries of t are allowed to feature idiosyncratic deviations driven by the elements of t . This specification reduces the number of free elements in t from m(m + 1)∕2 to mq, where the latter quantity is typically much smaller than the former. In addition, by conditioning on the latent factors, this representation enables us to derive an efficient Gibbs sampler that allows for conditional equation-by-equation estimation. As will be discussed in more detail in Section 3.2, this constitutes a key feature for computationally feasible Bayesian inference when the dimensionality m becomes large.
The model described by Equations (1) and (2) is related to several alternative specifications commonly used in the literature. For instance, assuming that V t = I and t ≡ for all t leads to the specification adopted in Stock and Watson (2005). Setting q = 1 and t ≡ yields a specification that is similar to the one stipulated in Carriero et al. (2016), with the difference that our model imposes restrictions on the covariances whereas Carriero et al. (2016) estimate a full (but constant) covariance matrix. In addition, our model implies that the stochastic volatility enters t in an additive fashion.
Before proceeding to the next subsection it is worth summarizing the key features of the model given by Equations (1)-(5). First, we capture cross-variable movements in the conditional mean through the VAR block of the model and assume that comovement in conditional variances is captured by a factor structure. Second, the model introduces stochastic volatility by assuming that a large panel of volatilities may be efficiently summarized through a set of latent heteroskedastic factors. This choice is more flexible than a single-factor model for the volatility, effectively providing a parsimonious representation of t that is flexible enough to replicate the dynamic behavior of the variances of a broad set of macroeconomic quantities.

INFERENCE IN LARGE-DIMENSIONAL VAR MODELS
Our approach to estimation and inference is Bayesian. This implies that, after specifying a suitable prior distribution on the model parameters, we can combine this prior with the likelihood implied by the data and the model to obtain the corresponding posterior distribution.

A global-local shrinkage prior
For prior implementation, it proves to be convenient to define a k × 1 vector of predictors to rewrite the model in Equation (1) more compactly as y t = Bx t + t . Stacking the rows of y t , x t , and t yields where Y = (y 1 , … , y T ) ′ , X = (x 1 , … , x T ) ′ , and E = ( 1 , … , T ) ′ denote the corresponding full data matrices. Typically, the matrix B is a sparse matrix with nonzero elements mainly located on the main diagonal of A 1 . In fact, existing priors in the Minnesota tradition tend to strongly push the system towards the prior model in high dimensions. However, especially in large models, an extremely tight prior on B might lead to severe overshrinking, effectively zeroing out coefficients that might be important to explain y t . If the matrix B is characterized by a relatively low number of nonzero regression coefficients, a possible solution is a global-local shrinkage prior (Polson & Scott, 2011).
A recent variant that falls within the class of global-local shrinkage priors is the Dirichlet-Laplace (DL) prior put forward in Bhattacharya et al. (2015). This prior possesses convenient shrinkage properties in the presence of a large degree of sparsity of the parameter vector b = vec(B). In what follows, we impose the DL prior on each of the where  denotes the double exponential (Laplace) and  the exponential distribution, j is an auxiliary scaling parameter to achieve conditional normality, and the elements of = ( 1 , … , K ) ′ are local auxiliary scaling parameters that are bounded to the A natural prior choice for j is the (symmetric) Dirichlet distribution with hyperparameter a: ∼ (a, … , a). In addition, is a global shrinkage parameter that pushes all elements in B towards zero and exhibits an important role in determining the tail behavior of the marginal prior distribution on b j , obtained after integrating out the j s.
Thus we follow Bhattacharya et al. (2015) and adopt a fully Bayesian approach by specifying a gamma distributed prior on ∼ (Ka, 1∕2). It is noteworthy that this prior setup has at least two convenient features that appear to be of prime importance for VAR modeling. First, it exerts a strong degree of shrinkage on all elements of B but still provides additional flexibility such that nonzero regression coefficients are permitted. This critical property is a feature which a large class of global-local shrinkage priors share (Griffin & Brown, 2010;Carvalho et al., 2010;Polson & Scott, 2011) and has been recently adopted in a VAR framework by  and within the general context of state-space models by Bitto and Frühwirth-Schnatter (2019). Second, implementation is simple and requires relatively little additional input from the researcher. In fact, the prior heavily relies on a single structural hyperparameter that has to be specified with care, namely a.
The hyperparameter a influences the empirical properties of the proposed shrinkage prior along several important dimensions. Smaller values of a lead to heavy shrinkage on all elements of B. To see this, note that lower values of a imply that more prior mass is placed on small values of a priori. Similarly, when a is small, the Dirichlet prior places more mass on values of j close to zero. Since lower values of translate into thicker tails of the marginal prior on b j , the specific choice of a not only influences the overall degree of shrinkage but also the tail behavior of the prior. Lettingp denote the number of predictors, Bhattacharya et al. (2015) show that if a is specified as p −(1+Δ) for any Δ > 0 to be small, the DL prior displays excellent posterior contraction rates, and Pati et al. (2014), discuss the shrinkage properties of the proposed prior within the context of factor models. In our application, p = K (when considering the total number of predictors) orp = k (when considering the number or predictors per equation).
For the factor loadings we independently use a standard normally distributed prior on each element i ∼  (0, 1) for i = 1, … , m and j = 1, … , q. In the empirical application (Section 6), we consider in addition the row-wise normal-gamma (NG Griffin & Brown, 2010) shrinkage prior discussed in Kastner (2019) . Furthermore, we impose a normally distributed prior on the mean of the log-volatility ∼  (0, M ) with M denoting the prior variance, and the commonly employed beta distributed prior on the transformed persistence parameter of the log-volatility s +1 2 ∼ (a 0 , b 0 ) for s ∈ {h, } and a 0 , b 0 ∈ R + to ensure stationarity. Finally, we use a restricted gamma prior on the innovation variances in Equations (3) and (4) Here, is a hyperparameter used to control the tightness of the prior. This choice, motivated in Frühwirth-Schnatter and Wagner (2010), implies that if the data are not informative on the degree of time variation of the log-volatilities then we do not bound 2 s artificially away from zero, effectively applying more shrinkage than the standard inverted gamma prior.

Full conditional posterior distributions
Conditional on the latent factors and the corresponding loadings, the model in Equation (1) can be cast as a system of m unrelated regression models for the elements in z t = y t − f t , labeled z it , with heteroskedastic errors: Here we let B i• denote the ith row of B and it is the ith element of t . The corresponding posterior distribution of with • indicating that we condition on the remaining parameters and latent quantities of the model. The posterior variance and mean are given by The diagonal prior covariance matrix of the coefficients related to the ith equation is given by i , the respective . Moreover,X i is a T × k matrix with typical row t given by X t ∕ it andz i is a T-dimensional vector with the tth element given by z it ∕ it . This normalization renders Equation (8) conditionally homoskedastic with standard normally distributed white noise errors.
The full conditional posterior distribution of j is inverse Gaussian: The conditional posterior of the global shrinkage parameter follows a generalized inverse Gaussian (GIG) distribution: To draw from this distribution, we use the efficient algorithm of Hörmann and Leydold (2013). Moreover, we sample the scaling parameters j by first sampling L j from L |• ∼ (a − 1, 1, 2|b |), and then setting = L ∕ ∑ K i=1 L i . The conditional posterior distributions of the factors are Gaussian and thus straightforward to draw from. The factor loadings are sampled using "deep interweaving" (see Kastner et al., 2017), and the parameters in Equations (3) and (4) along the full histories of the latent log-volatilities are sampled as in Kastner and Frühwirth-Schnatter (2014) using the R-packages factorstochvol (Hosszejni & Kastner, 2019) and stochvol (Kastner, 2016).
Our MCMC algorithm iteratively draws from the conditional posterior distributions outlined above and discards the first J draws as burn-in. In terms of computational requirements, the single most intensive step is the simulation from the joint posterior of the autoregressive coefficients in B. Because this step is implemented on an equation-by-equation basis, speed improvements relative to the standard approach are already quite substantial. However, note that if k is large (i.e., of the order of several thousands), even the commonly employed equation-by-equation sampling fails to deliver a sufficient amount of draws within a reasonable time window. Consequently, we outline an alternative algorithm to draw from a high-dimensional multivariate Gaussian distribution under a Bayesian prior that features a diagonal prior variance-covariance matrix in the upcoming section.

COMPUTATIONAL ASPECTS
The typical approach to sampling from Equation (9) is based on the full system and simultaneously samples from the full conditional posterior of B, implying that the corresponding posterior distribution is a K-dimensional Gaussian distribution with a K × K dimensional variance-covariance matrix. Under a nonconjugate prior, the computational difficulties arise from the need to invert the K × K variance-covariance matrix, which requires operations of order O(m 6 p 3 ) under Gaussian elimination.
If a conjugate prior in combination with a constant (or vastly simplified heteroskedastic; see Carriero et al., 2016) specification of t is used, the corresponding variance-covariance features a Kronecker structure which is computationally cheaper to invert and scales better in large dimensions. Specifically, the manipulations of the corresponding covariance matrix are of order O(m 3 + k 3 ), a significant gain relative to the standard approach. However, this comes at a cost since all equations have to feature the same set of variables, the prior on the VAR coefficients has to be symmetric, and any stochastic volatility specification that preserves conjugacy is necessarily overly simplistic.
By contrast, recent studies emphasize the computational gains that arise from utilizing a framework that is based on equation-by-equation estimation. Carriero et al. (2019) and Koop et al. (2019) augment each equation of the system by either contemporaneous values of the endogenous variables of the preceding equations or the resid-uals from the previous equations. Here, our approach renders the equations of the system conditionally independent by conditioning on the factors. From a computational perspective, the differences between using a factor model to disentangle the equations and an approach based on augmenting specific equations by quantities that aim to approximate covariance parameters are negligible. If we sample from Equation (9) (m 4 p 3 ). This already poses significant improvements relative to full system estimation.
One contribution of the present paper is the application of the algorithm proposed by Bhattacharya et al. (2016) and developed for univariate regression models under a global-local shrinkage prior. This algorithm is applied to each equation in the system and cycles through the following steps: This algorithm outperforms all competing variants discussed previously in situations where k ≫ T, a situation commonly encountered when dealing with large VAR models. In such cases, steps 1-4 can be carried out using O(pm 2 T 2 ) floating point operations. In situations where k ≈ T, the computational advantages relative to the standard equation-by-equation algorithm mentioned above are modest or even negative. However, note that the cost is quadratic in m and linear in p and thus scales much better when the number of endogenous variables and/or lags thereof is increased. More information on the empirical performance of our algorithm can be found in Section 6.4.

SIMULATION STUDY
This section aims at comparing the performance of the DL prior with a range of commonly used alternatives. We investigate sparse, intermediate, and dense DGPs, where T ∈ {50, 100, 150, 200, 250} and m ∈ {10, 20, 50, 100}. The probability of an off-diagonal entry to be nonzero is 0.01, 0.1, and 0.8 in each of the respective scenarios. In all scenarios, each intercept entry has a 0.1 probability of being nonzero and all diagonal elements are nonzero with probability 0.8. The nonzero elements are randomly generated from Gaussian distributions roughly tuned to yield stable VARs. More concretely, both the mean I and the standard deviation I of the intercept are set to 0.01, whereas mean and standard deviation of the diago- Concerning the errors, we use a single-factor SV specification. The factor loadings are generated from  (0.001, 0.001 2 ) to roughly match the above scaling. The AR(1) processes driving the idiosyncratic log-variances are assumed to have mean i = −12 with persistences i ranging from 0.85 to 0.98 and innovation standard deviations i from 0.3 to 0.1. The process driving the factor log-variance is assumed to be highly persistent, with h1 = 0.99 and h1 = 0.1.
For each of the 60 settings, we simulate 10 data sets. For each of these, we run our MCMC algorithm to obtain 2,000 posterior draws after a burn-in of 1,000. Consequently, the posterior means are compared to the true values and root mean squared errors (RMSEs) are computed. Finally, the median of each of these is reported in Table 1. Alongside the DL prior with weak (a DL = a = 1∕2) and strong (a DL = a = 1∕k and a DL = a = 1∕K) shrinkage, we also consider the NG prior with a single global shrinkage parameter (see , for the exact specification) and a standard conjugate Minnesota prior with a single shrinkage parameter a M , implemented by using dummy observations. For the NG prior we specify the prior on the global shrinkage parameter to induce heavy shrinkage (by setting both hyperparameters of the gamma prior equal to 0.01) and the prior controlling the excess kurtosis a NG is set equal to 1, corresponding to the Bayesian Lasso (see Park & Casella, 2008), and a NG = 0.1. The latter choice places significant prior mass around zero but at the same time leads to a heavy tailed marginal prior. Finally, we report RMSEs of the ordinary least squares (OLS) estimator (if it exists).
As is to be expected, Table 1 reveals strong to severe overfitting of OLS (corresponding to the posterior mode under a flat prior), which can be mitigated to a certain extent when the Minnesota prior with a M = 0.001 is employed instead. Similarly, the DL prior with weak shrinkage (a DL = 1∕2) displays a tendency to overfit, in particular when T is small. By contrast, the more aggressive DL and NG shrinkage priors show superior performance. Overall, DL(1∕k) and NG(0.1) exhibit lowest RMSEs, where DL(1∕k) performs best in the sparse scenarios, NG(0.1) performs best in the intermediate settings, and no clear winner is to be found in the dense  For further illustration, we showcase four exemplary scenarios in Figures A1-A4 in the Appendix.

EMPIRICAL FORECASTING APPLICATION
In Section 6.1 we first summarize the data set adopted and present the model specification choices made. Section 6.2 estimates a simple one-factor model to outline the virtues of our proposed framework. Section 6.3 presents the main findings of our forecasting exercise and discusses the choice of the number of factors used for modeling the error covariance structure.

Data, model specification and selection issues
The aim of the empirical application is to forecast a set of key US macroeconomic quantities. To this end, we use the quarterly data set provided by McCracken and Ng (2016)

Some empirical key features of the model
To provide some intuition on how our modeling approach works in practice, we first estimate a simple one-factor model (i.e., q = 1) and investigate several features of our empirical model. In the next section we will perform an extensive forecasting exercise and discuss the optimal number of factors in terms of forecasting accuracy. We start by inspecting the posterior distribution of and assess what variables load heavily on the latent factor. 3 In addition to quarterly observations, McCracken and Ng (2016) also provide a subset of the data which is observed monthly. Of course, our method is analogously applicable to higher frequency observations. However, given that the computational cost of the Bhattacharya et al. (2016) approach is quadratic in T, the run-time gains of their approach in comparison to equation-by-equation estimation is then smaller and can, depending on the number of lags, even become be negative. 4 We have also experimented with higher lag orders and also found some evidence of signals at lag two for the data set at hand; see Figures A5-A7 in the Appendix for an illustration. However, out-of-sample predictive studies favored one lag only (cf. Section 6.3).
It is worth emphasizing that most quantities 5 associated with real activity (i.e., industrial production and its components, gross domestic product (GDP) growth, employment measures) load heavily on the factor. Moreover, expectation measures, housing markets, equity prices, and spreads also load heavily on the joint factor.
To assess whether spikes in the volatility associated with the factor coincide with major economic events, the bottom panel of Figure 1 depicts the evolution of the posterior distribution of factor volatility over time. A few findings are worth mentioning. First, volatility spikes sharply during the mid-1970s, a period characterized by the first oil price shock and the bankruptcy of Franklin National Bank in 1974. After declining markedly during the second half of the 1970s, the shift in US monetary policy towards aggressively fighting inflation and the second oil price shock again translate into higher macroeconomic uncertainty. Note that from the mid-1980s onward we observe a general decline in macroeconomic volatility that lasts until the beginning of the 1990s. There we observe a slight increase in volatility possibly caused by the events surrounding the first Gulf War. The remaining years up to the beginning of the 2000s have been relatively unspectacular, with volatility levels being muted most of the time. In 2000/2001, volatility again increases due to the burst of the dot-com bubble and the 9/11 terrorist attacks. Finally, we observe marked spikes in volatility during recessionary episodes like the recent financial crisis in 2008.
Finally, we assess how well the DL prior with a = 1∕k performs in shrinking the coefficients in B to zero. The top panel of Figure 2 depicts a heat map that gives a rough feeling of the size of each regression coefficient based on the posterior median of B. The bottom panel of Figure 2 depicts the posterior interquartile range, providing some evidence on posterior uncertainty. 6 The DL prior apparently succeeds in shrinking the vast majority of the approximately 50,000 coefficients towards zero. Even though not discussed in detail to conserve space, we note that at higher lag orders this very strong shrinkage effect is even more pronounced; see also Figures A5-A7 in the Appendix.
The top panel of Figure 3 displays the posterior median estimates when the shrinkage parameter a is chosen to be 1/2 (cf. Bhattacharya et al., 2015, for a discussion of this choice). While a = 1∕2 appears to provide a fair amount of shrinkage in other applications, for our huge dimensional example this prior exerts only relatively little shrinkage and tends to lead to overfitting. The diagonal pattern in the first lag appears here as well, but there is a considerable FIGURE 1 5th, 50th, and 95th posterior percentiles of factor loadings (upper panel) and factor volatility (lower panel) amount of nonzero medians elsewhere. Correspondingly, the interquartile ranges visualized in the bottom panel of Figure 3 are also very large compared to those obtained with a = 1∕k.
Interestingly, for selected time series measuring inflation (both consumer and producer price inflation) we find that lags of monetary aggregates are allowed to load on the respective inflation series. This result points towards a big advantage of our proposed prior relative to standard VAR priors in the Minnesota tradition: While these priors have been shown to work relatively well in huge dimensions (see Bańbura et al., 2010), they also display a tendency to overshrink when the overall tightness of the prior is integrated out in a Bayesian framework, effectively pushing the posterior distribution of B towards the prior mean and thus ruling out patterns observed under the DL prior.
Inspection of the interquartile range also indicates that the proposed shrinkage prior succeeds in reducing posterior uncertainty markedly. Note that the pattern found for the posterior median of B can also be found in terms of the posterior dispersion. We again observe that the coefficients associated with the first, own lag of a given variable are allowed to be nonzero whereas in most other cases the associated posterior is strongly concentrated around zero.

Predictive evidence
We focus on forecasting gross domestic product (GDPC96), industrial production (INPRO), total nonfarm payroll (PAYEMS), civilian unemployment rate (UNRATE), new privately owned housing units started (HOUST), consumer price index inflation (CPIAUCSL), producer price index for finished goods inflation (PPIFGS), effective federal funds rate (FEDFUNDS), 10-year Treasury constant maturity rate (GS10), US/UK exchange rate (EXUSUKx), and the S&P 500 (SP500). This choice includes the variables investigated by Koop et al. (2019) and some additional important macroeconomic indicators that are commonly monitored by practitioners, resulting in a total of 11 series.
To assess the forecasting performance of our model, we conduct a pseudo out-of-sample forecasting exercise with initial estimation sample ranging from 1959:Q3 to 1990:Q2. Based on this estimation period, we compute one-quarter-ahead predictive densities for the first period in the hold-out (i.e., 1990:Q3). After obtaining the corresponding predictive densities and evaluating the corresponding log-predictive likelihoods, we expand the estimation period and reestimate the model. This procedure is repeated 100 times until the final point of the full sample is reached. The quarterly scores obtained this way are then accumulated.
Our model with q ∈ {0, 1,… , 4} factors is benchmarked against the prior model, a pure factor stochastic volatility (FSV) model with conditional mean equal to zero (i.e., B = 0 m×k ). In what follows we label this specification FSV 0. To assess the merits of the proposed shrinkage prior vis-à-vis a Minnesota prior and an NG shrinkage prior we also include the models described in Section 5. Moreover, we include two models that impose the restriction that A 1 = I m and A 1 = 0.8 × I m , while A j for j > 1 are set equal to zero matrices in both cases. The first model, labeled FSV 1, assumes that the conditional mean of y t follows a random walk process, and the second specification, denoted by FSV 0.8, imposes the restriction that the variables in y t feature a rather strong degree of persistence but are stationary. The exercise serves to evaluate whether it pays to impose a VAR structure on the first moment of the joint density of our data and to assess how many factors are needed to obtain precise multivariate density predictions for our 11 variables of interest.
Overall log-predictive scores (LPSs) are summarized in Table 2. An immediate finding is that ignoring the error covariance structure (using zero factors) produces rather inaccurate forecasts for all models considered. While a single factor model improves predictive accuracy by a large margin, allowing for more factors (i.e., even more flexible modeling of the covariance structure) further increases the forecasting performance. For this specific exercise, we identify two or three factors to be a reasonable choice for most models when the joint log-predictive scores of the aforementioned variables are considered. We would like to stress that this choice critically depends on the number of variables we include in our prediction set. If we focus attention on the marginal predictive densities (i.e., the univariate predictive densities obtained after integrating out the remaining elements in y t ), we find that fewer or even no factors receive more support (see Table 3), whereas in the case of higher dimensional prediction sets more than two factors lead to more accurate density predictions (cf. Kastner et al., 2017, for an investigation of this issue in the context of a standard FSV model). As a general remark, we note that identifying the optimal number of factors in high-dimensional FSV models is a challenging problem in practice. Using the deviance information criterion (DIC; cf. Chan & Grant, 2016) may be an option but is likely to be unstable in very high dimensions. The approach adopted in the paper at hand, namely the decomposition of the marginal likelihood into predictive likelihoods (cf. Geweke & Amisano, 2010) tends to be more stable, in particular when interest is placed on predicting subsets only. Moreover, it can be trivially parallelized, thus becoming computationally feasible on high-performance computing infrastructures.
To investigate whether forecasting performance is homogeneous over time, Figure 4 visualizes the cumulative LPSs relative to the zero-factor FSV model over time. The benefit of the flexible SV structure in the VAR residuals is particularly pronounced during the 2008 financial crisis which can be seen by comparing the solid lines to the broken lines. During this period, time-varying covariance modeling appears to be of great importance and the performance of models that ignore contemporaneous dependence deteriorates. This finding is in line with Kastner (2019), who reports analogous results for US asset returns. The increase in predictive accuracy can be traced back to the fact that within an economic downturn the correlation structure of our data set changes markedly, with most indicators that measure real activity sharply declining in lockstep. A model that takes contemporaneous cross-variable linkages seriously is thus able to fully exploit such behavior, which in turn improves predictions.
Up to this point, we have focused exclusively on the joint performance of our model for the specific set of variables considered. To gain a deeper understanding on how our model performs for relevant selected quantities, Table 3 displays marginal LPSs for the two most promising prior specifications with one, two, and five lags. The variables we consider are inflation (CPIAUCSL), short-term interest rates (FEDFUNDS), and output growth (GDPC96).
In contrast to the findings based on joint LPSs, we observe that models without a factor structure tend to perform better than models that set q > 0, with the exception of interest rates where all models predict more or less equally badly. This finding corroborates our conjecture stated above, implying that if the set of focus variables is subsequently enlarged, more factors are necessary in order to obtain precise density predictions. Here, we only focus on marginal model performance, implying that for each variable, contemporaneous relations between the elements in y t are integrated out. This, in turn, implies that the additional gain in model flexibility is offset by the comparatively larger number of parameters. Concerning the difference between VAR priors, it appears that NG slightly outperforms DL for inflation, whereas DL is superior when it comes to predicting output growth.

A note on the computational burden
Even though the efficient sampling schemes outlined in this paper help to overcome absolutely prohibitive com-  putational burdens, the CPU time needed to perform fully Bayesian inference in a model of this size can still be considered substantial. In what follows we shed light on the estimation time required and how it is related to the length of the time series T, the lag length p, and the number of latent factors q ∈ {0, 50}. Figure 5 shows the time needed to perform a single draw from the joint posterior distribution of the 215+215 2 p coefficients and their corresponding 2(215+215 2 p)+1 auxiliary shrinkage quantities, the qT factor realizations and the associated 215q loadings, alongside (T + 1)(215 + q) latent volatilities with their corresponding 645 + 2q parameters. This amounts to 166,841 random draws for the smallest model considered (one lag, no factors, T = 124) and 776,341 random draws for the largest model (5 lags, 50 factors, T = 224) at each MCMC iteration.
As mentioned above, the computation time rises approximately linearly with the number of lags included. Dotted lines indicate the time in seconds needed to perform a single draw from a model with 50 factors included, while solid lines refer to the time needed to estimate a model without factors and a diagonal time-varying variance-covariance matrix t . Interestingly, the additional complexity when moving from a model without factors to a highly parametrized model with 50 factors appears to be negligible, increasing the time needed by a fraction of a second on average. The important role of the length of the sample can be seen by comparing the green, red, and black lines. The time necessary to perform a simple MCMC draw quickly rises with the length of our sample, consistent with the statements made in Section 4. This feature of our algorithm, however, is convenient especially when researchers are interested in combining many short time series or performing recursive forecasting based on a tiny initial estimation sample.

CLOSING REMARKS
In this paper we propose an alternative route to estimate huge-dimensional VAR models that allow for time variation in the error variances. The Dirichlet-Laplace prior, a recent variant of a global-local shrinkage prior, enables us to heavily shrink the parameter space towards the prior model while providing enough flexibility that individual regression coefficients are allowed to be unrestricted. This prior setup alleviates overfitting issues generally associated with large VAR models. To cope with computational issues we assume that the one-step-ahead forecast errors of the VAR feature a factor stochastic volatility structure that enables us to perform equation-by-equation estimation, conditional on the loadings and the factors. Since posterior simulation of each equation's autoregressive parameters involves manipulating large matrices, we implement an alternative recent algorithm that improves upon existing methods by large margins, rendering a fully fledged Bayesian estimation of truly huge systems possible.
In an empirical application we first present various key features of our approach based on a single-factor model. This single factor, which summarizes the joint dynamics of the VAR errors, can be interpreted as an uncertainty measure that closely tracks observed factors such as the volatility index. The question whether such a simplistic structure proves to be an adequate representation of the time-varying covariance matrix naturally arises, and we therefore provide a detailed forecasting exercise to evaluate the merits of our approach relative to the prior model and a set of competing models with a different number of latent factors in the errors.
Finally, three potential extensions are worth mentioning. First, given the fact that systematic and in-depth empirical comparisons of the various recently developed roads towards handling high-dimensional VARs with time-varying contemporaneous covariance in a Bayesian framework (VAR-FSV, VAR-Cholesky-SV, compressed VAR-SV, etc.) are still missing and it is not clear whether one of these models turns out to dominate the others for all points in time, one could consider averaging/selecting dynamically. Second, note that it is trivial to relax the assumption of symmetry for the DL components. In the context of VARs, this might be of particular interest for distinguishing diagonal (a D large) from off-diagonal (a O small) elements in the spirit of the Minnesota prior or increasing the amount of shrinkage with increasing lag order (cf. Huber & Feldkircher, 2019, for a similar setup in the context of the normal-gamma shrinkage prior). Third, we would like to stress that our approach could also be used to estimate huge-dimensional time-varying parameter VAR models with stochastic volatility. To cope with the computational difficulties associated with the vast state space, a possible approach could be to rely on an additional layer of hierarchy that imposes a (dynamic) factor structure on the time-varying autoregressive coefficients in the spirit of Eisenstat et al. (2018) and thus reduce the computational burden considerably.

APPENDIX A: FURTHER ILLUSTRATIONS
First, we showcase four selected data-generating scenarios (small + sparse, small + dense, large + sparse, large + dense) and visualize the posterior distribution of the VAR coefficients under seven different prior choices in Figures A1-A4. For a comprehensive overview, see Table 1 in the main part of the paper.