A computationally efficient Bayesian seemingly unrelated regressions model for high-dimensional quantitative trait loci discovery

Our work is motivated by the search for metabolite quantitative trait loci (QTL) in a cohort of more than 5000 people. There are 158 metabolites measured by NMR spectroscopy in the 31-year follow-up of the Northern Finland Birth Cohort 1966 (NFBC66). These metabolites, as with many multivariate phenotypes produced by high-throughput biomarker technology, exhibit strong correlation structures. Existing approaches for combining such data with genetic variants for multivariate QTL analysis generally ignore phenotypic correlations or make restrictive assumptions about the associations between phenotypes and genetic loci. We present a computationally efficient Bayesian seemingly unrelated regressions model for high-dimensional data, with cell-sparse variable selection and sparse graphical structure for covariance selection. Cell sparsity allows different phenotype responses to be associated with different genetic predictors and the graphical structure is used to represent the conditional dependencies between phenotype variables. To achieve feasible computation of the large model space, we exploit a factorisation of the covariance matrix. Applying the model to the NFBC66 data with 9000 directly genotyped single nucleotide polymorphisms, we are able to simultaneously estimate genotype–phenotype associations and the residual dependence structure among the metabolites. The R package BayesSUR with full documentation is available at https://cran.r-project.org/web/packages/BayesSUR/

as with many multivariate phenotypes produced by highthroughput biomarker technology, exhibit strong correlation structures. Existing approaches for combining such data with genetic variants for multivariate QTL analysis generally ignore phenotypic correlations or make restrictive assumptions about the associations between phenotypes and genetic loci. We present a computationally efficient Bayesian seemingly unrelated regressions model for high-dimensional data, with cell-sparse variable selection and sparse graphical structure for covariance selection. Cell sparsity allows different phenotype responses to be associated with different genetic predictors and the graphical structure is used to represent the conditional dependencies between phenotype variables. To achieve feasible computation of the large model space, we exploit a factorisation of the covariance matrix. Applying the model | 887 BOTTOLO eT aL.

| INTRODUCTION
Integrating high-dimensional molecular biomarker data sets is a fundamental problem in genetic epidemiology and bioinformatics, in the search for molecular mechanisms mediating the effects of genetic variants on clinical phenotypes. An important step in this search is to find associations between a set of genetic variants and downstream molecular phenotypes such as gene expression, proteomics, metabolomics or epigenetic data (quantitative trait loci or QTL analysis). In the simplest analysis, univariate regressions are performed for each phenotype-genotype pair, needing post hoc adjustment for multiple comparisons and ignoring any correlations between genotypes and between phenotypes. This is unlikely to be the best strategy for data where latent structures induce high levels of correlation between phenotypes, for example serum metabolomic profiles Soininen et al., 2009), imaging and gene expressions. Comparison studies show that using multivariate quantitative phenotypes increases the statistical power in association tests compared to univariate analysis (Fusi et al., 2012;Inouye et al., 2012).
In this paper, we develop a model designed for integrated multivariate QTL analysis, particular aimed at highly correlated molecular phenotypes. Our case study is in metabolomics quantitative trait locus (mQTL), a powerful approach used to identify genes associated with metabolic markers of diseases, where the multivariate response is generally on the order of hundreds of metabolites. The model is also suitable for other multivariate molecular phenotypes. We have a data set from the Northern Finland Birth Cohort 1966 (NFBC66) 31-year follow-up, consisting of 158 nuclear magnetic resonance (NMR) spectroscopy measured metabolites and over 9000 directly genotyped single nucleotide polymorphisms (SNPs) on chromosome 16, with a sample size of more than 5000 people. The metabolites set comprises lipoprotein particle concentrations, low molecular weight metabolites such as amino acids, 3-hydroxybutyrate and creatinine and different serum lipids, including free and esterified cholesterol, sphingomyelin and fatty acid saturation. These data exhibit strong residual correlation Marttinen et al., 2014), even after accounting for the variance explained by all reported SNPs.

BOTTOLO eT aL.
Cholesky factors is also popular in econometrics in both conditional and simultaneous autoregressive models (Datta et al., 2019). Wang et al. (2012) use a similar idea in the context of sampling the parameters for the Bayesian lasso, and Zellner and Ando (2010) used this reparametrisation for a SUR model with fixed covariates that is not resulting from variable selection.
To our knowledge, this is the first model proposed for fully Bayesian analysis for multivariate QTLs allowing for both cell-sparsity and residual dependence. A similar model was proposed by Wang (2010) for autoregressive models in econometrics, with two computational algorithms, an MCMC algorithm using Gibbs updates (George & McCulloch, 1993) for variable selection called 'indirect' and a 'direct' algorithm involving numerical approximation of the marginal likelihood (Chib, 1995) combined with a Metropolis-Hastings algorithm for posterior models exploration. Both approaches work well for small data sets but are computationally prohibitive in high-dimensional space comprising hundreds of responses and predictors such as in molecular QTL work. Our novel contribution to the computational method is that we derive the priors (dense and sparse versions) in the space of the transformed covariance matrix and therefore run the whole computation in this space, enabling parallelisation over the response variables. We are thus able to estimate models for mQTLs with hundreds of response variables and thousands of predictors.
Section 2 introduces the model including the derived priors in the transformed space. Section 3 provides the posterior computations for the parameters and details the MCMC algorithm used to estimate them. In Section 4, the method is validated in an extensive simulation study, where we show that we can estimate larger models with more accuracy in considerably less computational time than the Wang (2010) software. We also obtain similar sensitivity but fewer false positives than the penalized likelihood method with simultaneous estimation of regression coefficients and covariance structure proposed by Rothman et al. (2010). Section 5 presents the results on the NFBC66 mQTL data set, including visualisation both of the genotype-phenotype associations and of the residual dependence structure between metabolites. Further details of model derivations and posterior updates are available in the Supplementary Material. An R package BayesSUR with full documentation is available at https://cran.r-proje ct.org/ web/packa ges/Bayes SUR/.

| MODEL
The model can be seen as a set of regressions for multivariate phenotype responses Y = (y 1 , …, y s ), y k = (y 1k , …, y nk ) T , for k = 1, …, s and corresponding covariate genotype matrices X k with dimensions n × p. We assume independence between samples, but allow for dependence across responses. Moreover, we assume that the same set of predictors is available for all responses. The same set of genotypes may be used for all regressions, but this is not necessary. The predictors may be continuous or categorical, hence the model accommodates the usual additive genetic association models using observed and imputed allele counts or can be extended to more complex genetic models including interactions.
Variable selection is performed on the predictors using binary indicators vector k = ( k1 , …, kp ) T , where kj is 1 if covariate j is included in the regression for response k and 0 if not. We use the shorthand notation X k for the columns of X k selected by the vector k and similar for k . Thus, we can write the set of linked regressions as (1) y k = X k k + u k , k = 1, …, s, but most importantly the residuals will be correlated, that is, We can also write the likelihood for this model as where  = vec(Y), vec(·) being the vectorisation operator, = vec( 1 , …, s ), = vec( 1 , …, s ) and  is a block-diagonal matrix with X k as the k-th diagonal element. Following George and McCulloch (1993), we set kj = 0 conditional on kj = 0, while nonzero coefficients follow a diffuse normal distribution, that is, | ∼ N 0, W − 1 . The precision matrix W is generally decomposed into a shrinking coefficient, say w, and a matrix that governs the covariance structure. Here, we use W = w − 1 | | , with an inverse-gamma prior on w and |γ| the number of selected predictors across all outcomes. A number of sparsity inducing priors for γ have been used in the literature, the most common being kj ∼ Ber( ) with a fixed or random π. Another choice in QTL analysis is the 'hotspot detection' prior used in Bottolo et al. (2011), which decomposes the inclusion probability into an overall sparsity level for each outcome (o k ) and a propensity parameter (π j ) for each predictor, that is, kj ∼ Ber(o k × j ). In this work, we use the hotspot detection model with a beta prior on o k and a gamma prior on the propensity j and its simplified version with j = 1, j = 1, …, p, which corresponds to a beta-binomial sparsity prior for each response.

| Factorisation of the likelihood
If one assumes either a diagonal C or row sparsity for k , with conjugate priors on C and , both and C can be integrated out analytically (Bhadra & Mallick, 2013). In our model, the usual priors on these parameters lose conjugacy and cannot be integrated out. Nonetheless, the full conditionals retain their simple forms, so it is straightforward to write a Gibbs sampler for the posterior distributions (Holmes et al., 2002). The computational time needed is, however, prohibitive for most highdimensional settings.
To overcome this issue, we decompose the covariance matrix C iteratively as for all k = 2, …, s, with C (s) = C and C (1) = c 1 = C 11 (the scalar variance of response 1) and c 1 is null. Thus, each C (k) is the marginal covariance matrix for responses 1, …, k, c k is the variance of response k and c k is the vector of covariances between response k and responses {1, …, k − 1}. With this decomposition, the likelihood can be factorised (e.g. Giri, 2014) as is a matrix consisting of the first k − 1 residuals from the original linked regressions where U (0) is null. For k = 1, the likelihood simplifies to  (y 1 | X 1 1 , 2 1 n ). The parameters 2 k and k are also defined through the reparametrisation of the residual covariance matrix, that is, 2 1 = c 1 and
Note that the joint distribution p( | , , , C) is the same regardless of the order used for the decomposition since we are simply factorising it by chain conditioning. From Equation (2), it is straightforward to see that 2 k is the residual variance of response k conditioned on U (k−1) and k is the (k − 1)-vector of regression coefficients on the same U (k−1) residuals. The likelihood is thus decomposed into a product of independent (conditionally on the new parameters) factors over the outcomes. The novelty of our approach is two-fold. First, we estimate our model completely in the reparametrised space, deriving priors and posterior full conditionals for the parameters 2 k and k . Second, this allows us to update these parameters in parallel, which greatly increases the computational efficiency of this model.

| Prior for dense residual dependence
In order to take advantage of the factorisation of the model across responses, we must also transform the model priors. For modelling dense dependence structure between responses, we use an inverse-Wishart prior on the original covariance matrix C ∼ IW(ν, M). We use standard matrix properties of the inverse-Wishart distribution (Dawid, 1981;Dempster, 1969;Roverato, 2000) to calculate the transformed prior. The C (k) is a submatrix of C, thus it also has an inverse-Wishart distribution. The new parameters are related to the block structure of the inverse of this matrix, 2 k being the Schur complement of C (k−1) in C (k) and for k = 2, …, s, the priors on the changed variables become for k = 2, …, s and 2 1 ∼ IGa(( − s + 1)∕2, m 1 ∕2). From the prior on 2 1 , we can see that we the degrees of freedom in the inverse-Wishart distribution must be chosen to be ν > s − 1. Moreover, from these equations, we can see that we obtain independent priors for 2 k and k for different k. Thus, since both likelihood and prior factorise across responses, the posterior full conditionals also factorise and hence the MCMC update of the residual covariance parameters can be parallelised.

| Prior for sparse residual dependence
To model sparsity in the residual dependency structure, we introduce a decomposable graph G such that variables are conditionally independent if there is no direct edge between them in the graph (Lauritzen, 1996). Conditional on the graph, we use the hyper inverse-Wishart prior on the original covariance matrix C ∼ HIW G ( , M) (Carvalho et al., 2007). This is defined as the distribution such that the covariance matrix for each prime component in the decomposable graph is marginally inverse Wishart, that is, C P q ∼ IW( − (s − | P q | ), M P q ), where P q is the q-th prime component, M P q is the submatrix of M corresponding to P q and | P q | is its cardinality.
In the following, we derive the corresponding prior for the transformed variables 2 k and k under the assumption of a sparse covariance matrix. A more in depth derivation of the following results can be found in the Supplementary Material Section S.1.1. Briefly, the decomposability of the graph G allows us to define a sequence of complete, overlapping, subgraphs (i.e. cliques) called 'prime components' P q , q ∈ (1, …, Q) that can be ordered in such a way that for every q > 1 there exists m < q such that P q ∩ H q ⊂ P m , where H q = ⋃ q − 1 l = 1 P l for q = 2, …, Q. We also define the separators S q = P q ∩ H q and residuals R q = P q �S q for q = 2, …, Q. The nodes of G can also be arranged according to a so-called 'perfect elimination ordering', denoted by ξ, which implies that if Λ (k) (l) = 0 then (k) (l) = 0, where Λ is the precision matrix C − 1 (Paulsen et al., 1989).
Due to this correspondence between Λ (k) (l) = 0 and (k) (l) = 0, the transformation in Equation (3) decomposes across the prime components, so that, for nodes ordered according to a perfect elimination ordering, the new variables 2 k and k are defined within the prime components, that is, is the submatrix of C P q with variable k removed and c k,P q is the final column of C P q without the last element. All other elements of k are zero.
We parameterise G using the junction tree representation of a decomposable graph as its state variable proposed by Green and Thomas (2013). A decomposable graph may have many junction tree representations. However, for a given junction tree J, the implied graph G is uniquely determined.
To write the densities explicitly, we need to order the nodes using the perfect elimination order ξ, which respects the perfect ordering of the prime components. With this ordering, we find that the hyper inverse-Wishart prior on C is transformed to where the ordering of nodes within each residual does not matter. The corresponding prior densities are where q is the index of the prime residual that node k belongs to in the particular node ordering of the graph, for all k, except for the first node, that is, 2 The sets P q and S q are defined above. The index t in Equations (4) and (5) is the index of the node within the graph residual component and is given by Here, we have applied similar arguments as for the dense covariance case presented in Section 2.2, using the properties of block matrices and the inverse-Wishart distribution. Again, we see that the priors factorise over responses, so the posterior full conditionals in Equations (4) and (5) also factorise enabling faster computation of the model through parallelization of the MCMC updates. An important feature of working with this transformation is that we do not need to do any completion operation to fill in the covariances between the separated parts of prime components (Carvalho et al., 2007) since these correspond to the zeros in the k , k = 1, …, s, parameters.
We use a prior on the junction tree J which is proportional to a Binomial distribution on the number of edges |J| in the graph, that is, p(J| ) ∝ ( Since sparser graphs in general have more junction tree representations (Thomas & Green, 2009), this prior favours sparse structures. Finally, we use a conjugate Beta prior on the hyperparameter η.

| Summary of full model
In this section, we summarise the full model with all its conditional dependencies. We provide the version using the sparse covariance structure. The dense covariance case is as below, except that there are no J or η parameters and the distributions for 2 k and k are the simpler expressions presented in Section 2.2. In Sections 2.2 and 2.3, we have illustrated the results in terms of a general matrix M in the (hyper) inverse-Wishart distribution. In our implementation, we use M = s .
The joint distribution is where with U (k−1) is defined as in Equation (2) and 0 is the Dirac delta function centred at 0. The parameter J stands for the junction tree representing the graph and |J| is the number of edges in the graph represented by the junction tree. J is a perfect elimination ordering of the nodes in the graph. Both S (J) q and H (J) q depend on the graph and are defined in Section 2.3. The index q(k) is the index of the prime residual R (J) q that node k belongs to in the current node ordering for graph J.
Finally, the parameters a o , b o , a , b , a w , b w , a , b and a , b are fixed. The degrees of freedom ν > s − 1 in the inverse-Wishart distribution is also fixed.

| POSTERIOR COMPUTATIONS
In the original model space, posterior full conditionals for and C are available analytically (Holmes et al., 2002). However, these updates require inverting at every MCMC update both the |γ| × |γ| quadratic form in the selected columns of the design matrix  and the s × s matrix for the covariance matrix C. Additionally, the update of γ and all other unknowns where the likelihood is involved, require the heavy computation of the non-factorised likelihood. Our approach, based on the reparametrisation of C which leads to the factorisation of the model and the introduction of a sparse precision matrix via the junction tree representation of the decomposable graph G as its state variable, allows us to introduce a much more computational efficient MCMC scheme that scales well in high-dimensional settings.
Zellner and Ando (2010) used the same reparametrisation in a simpler SUR model without variable selection, using Jeffrey's priors. They devised a direct Monte Carlo procedure for β, 2 and ρ. Their method uses an approximation to the full conditionals, with an additional resampling step for the β. However, it is possible to recover the correct posterior full conditional for β, avoiding unnecessary and computationally prohibitive resampling steps as we show below.
To sample from the posterior distribution of the binary indicators vector γ, we use the evolutionary stochastic search (ESS) algorithm Bottolo et al., 2011;Lewin et al., 2016), which uses a particular form of evolutionary Monte Carlo as defined in Liang and Wong (2000). Within this framework, posterior samples of , 2 and ρ are obtained by employing a Gibbs sampler, but used instead in the joint updates of { , } and {J, 2 , }. Specifically, the posterior full conditionals for and 2 , ρ are used as proposal distributions in the joint updates with γ and J, respectively, since it reduces the posterior correlation between γand J-{ 2 , }. In this set-up, the proposal and target densities cancel out in the Metropolis-Hastings acceptance ratios. This is known as 'implicit marginalisation' (Alexopoulos & Bottolo, 2020;Holmes & Held, 2006) since the resulting acceptance ratio does not contain the current and proposed values of or { 2 , ρ} and it has been shown to greatly improve mixing of the structural parameters γ and J which are in our case the main focus of inference.
We have derived the full conditionals for the regression parameters and state the results here. Further details regarding the derivations can be found in the Section S1.5. The posterior conditional for the non-zero regression coefficients is where the subscript '∖k' implies that the vector of the regression coefficients consists of all the elements except those that are related to the kth response, with u k defined as the residuals given in Equation (1).
In the sparse covariance case, the index sets are defined with respect to the perfect elimination order ξ, that is, where l∼k means that nodes l and k are in the same prime component. In the dense case, these reduce to The posterior updates of the reparametrised covariance parameters depend on the ordering of the nodes and prime residuals of the graph with t = J (k) − | H (J) q | as before, M = s + U T U, M P q ,(t−1) and m qt are submatrices of M defined conformally with previous transformations. In the dense covariance case, the posterior updates reduce to the following equations (with any ordering on the outcomes) To efficiently explore the graphical structure G, we use the sampler introduced by Green and Thomas (2013), making use of the junction tree representation of a decomposable graph as its state variable to allow for bolder, multi-edge proposals in the graph space. The edge probability η is updated via a Gibbs perturbation. All other unknowns are updated via Metropolis-within-Gibbs updates with adaptive proposal distributions (Roberts & Rosenthal, 2009). Algorithm 1 provides an overview of the designed MCMC algorithm to sample from the joint posterior distribution p ( , , 2 , , J, , o, w, , |Y).
Note that, even though each sample is from a decomposable model, the sampler allows us to discover non-decomposable graph structures via Bayesian model averaging of the marginal edge inclusion probabilities. As the graph is updated, the perfect elimination ordering ξ changes, hence we do not retain the sampled values of 2 and ρ. Moreover, as we are interested mainly in structure learning, both for variable and covariance selection, we consider the reparametrised covariance as nuisance parameters.

| SIMULATION STUDY
We evaluate the performance of the reparametrised multivariate sparse SUR model and our efficient sampler in simulated mQTL data. We first investigate the effect of allowing for residual dependence in the phenotypes, by comparing dependent and independent covariances within our own model. We then compare our model with sparse dependence structure against other methods that also allow for dependence. For all the work in the simulation study, we employ the same priors as we use for the mQTL analysis of the NFBC cohort data, see Section 5 for details, except in the comparison with other models that allow covariance selection where we use the simplified version of the hotspot detection prior with j = 1, j = 1, …, p.

| Comparison with models without covariance selection
We validate our method against the Hierarchical Evolutionary Stochastic Search (HESS) algorithm of Bottolo et al. (2011) in a synthetic setting. Following Richardson et al. (2010) and Bhadra and Mallick (2013), we set up our simulation study by randomly subsampling p = 300 SNPs from our real -omics data set (see Section 5). This forms our covariate set X and allows us to mimic real correlation effects and linkage disequilibrium between genetic markers that would be difficult to simulate artificially. The observed correlations between predictors range from small to over 0.8 in absolute value. We set | 897 BOTTOLO eT aL. n = 200 and s = 30 and proceed by selecting the correlation structure for the outcomes in form of a graph. We explore three graphical structures, that is, a block diagonal, a decomposable and a nondecomposable model. To present a range of possible association patterns between outcomes and predictors, we fix (conditionally on the selected graph structure) the binary indicators vector γ so that different sets of predictors display associations with, that is, all outcomes (representing true hotspots), all outcomes within each prime component, all outcomes within each residual component (so predictors are linked only with correlated outcomes and not to conditionally independent ones) and finally with a set of selected outcomes that spans multiple components, so that selected predictors are linked to both correlated and (conditionally) independent outcomes. See Figure S2 for an example of the generated structures.
With the structure fixed, we sample the non-zero regression coefficients from a N(5, 1), so that most of them are distinct from zero, and the residuals from a matrix variate normal distribution, that is, MN 0, n , C . C − 1 is sampled from a G-Wishart distribution, W G (s + 2, M), using the R package BDgraph (Mohammadi & Wit, 2019) with M = αR, R a correlation matrix with r on the off-diagonal elements and ∈ ℝ + to obtain data sets with the desired noise.
We consider two summaries of signal to noise, designed to be sensitive to the information contained in the predictors and in the covariance matrix, respectively: where 2 and ρ are the reparametrised values of the covariance matrix C and  is as defined in Equation (7). We observe that SNR C is highly correlated with the off-diagonal (residual) correlation r. Thus, we parameterise the simulation study in terms of G (block-diagonal, decomposable and non-decomposable), r ∈ {0.3, 0.6, 0.9} and SNR . For each value of G and r, we generate multiple data sets with different α values and use the data with resulting SNR within 10% of each of the desired values of 5, 15 and 25. Based on this criteria, we simulate 20 replicates for each combination of the parameters and run both sparse BayesSUR model and HESS for 250,000 iterations of which 50,000 as burn-in.
To report on performance, we focus on posterior marginal inclusion probabilities, that is, the average over the MCMC iterations of kj , k = 1, …, s, j = 1, …, p. Figure 1 shows the average ROC curves over 20 replicates for each simulation set-up corresponding to SNR = 5, the lowest signal-to-noise ratio, for both BayesSUR with covariance selection and HESS. The results correspond to our expectations, that is, HESS is known to perform relatively well even in cases where residuals are correlated (Bottolo et al., 2011;Lewin et al., 2016) as long as r is not too high. In most cases though, especially at higher r, HESS estimates are more noisy and more false positives are picked up due to the confounding effect of the correlations. BayesSUR has a more marked separations between true and false positive signals and overall returns less noisy estimates (see Figure S5, top panels). The ROC curves relative to the other SNR levels are reported in Figures S3 and S4.
BayesSUR is also able to recover simultaneously the conditional (in)dependence structure of the residuals. Table 1 shows the average over 20 simulated replicates of true positive rates (TPR) and false positive rates (FPR) for graph edges found by thresholding Pr(G kk � = 1|data) at 0.5 probability level.
The graphs are in general well estimated. For non-decomposable graphs, there is a tendency towards over-inclusion, as we would expect based on Fitch et al. (2014), who find that graphs constrained to be decomposable converge to a close (in the graph space), more dense, chordal graph alternative (see for example Figure Table S1.

| Comparison with alternative covariance models
We compare the performance of BayesSUR to two different software implementations of a sparse seemingly unrelated regressions (SSUR) model by Wang (2010). SSUR indirect performs posterior computation of the SSUR model using MCMC, where the regression coefficients are sampled using the Gibbs sampler described in George and McCulloch (1993). SSUR direct uses the marginal likelihood approach of Chib (1995) for 'direct' variable selection of important predictors and non-zero entries of the sparse inverse covariance matrix via Metropolis-Hastings steps. The Matlab version of the SSUR code is available from the author web site. For the 'indirect' version, we run the algorithm for 5 × 10 5 iterations with 10 5 as burn-in, storing the MCMC output every 500 iterations. For the 'direct' version, we run the algorithm for 2 × 10 3 iterations with 10 3 as burn-in. In each iteration, the calculation of the marginal likelihood requires 500 extra samples from the Gibbs sampler, including 100 as burn-in. Overall, the algorithm is run for 10 6 iterations. All hyperparameters and proposal densities are left unchanged as originally set-up in the Matlab code. The prior probability of inclusion is set equal to 0.1 in both versions of the Matlab code. We run BayesSUR with covariance selection for 5 × 10 5 iterations with 10 5 as burn-in, two parallel chains in the EES sampler and matching the hyperparameters of the Beta-Binomial prior on the inclusion probability with the prior used in the SSUR algorithms. We simulate three scenarios, with differing levels of sparsity in the inverse covariance matrix. For each scenario, we simulate 20 replicates with n = 150, p = 30 and s = 20. Out of 30 × 20 regression coefficients, 120 (20%) are simulated from an uniform distribution in (−2, 2). We selected at random the same proportion of cells in the 30 × 20 matrix of regression coefficients and assigned to them the simulated values, while the other cells are set to zero. In our experiment, for each response, on average between 2 and 10 non-zero regression coefficients are assigned. To generate the correlated predictors, we follow Rothman et al. (2010) and simulate, for each i = 1, …, n and k = 1, …, s, is the (j, j ′ )th element of V, j, j � = 1, …, p, implying the same unit marginal variance. The inverse error covariance T − 1 is a Toeplitz matrix with value 0.5 in the first principal diagonal, 0.5 and 0.4 in the first two principal diagonals and 0.5, 0.4 and 0.3 in the first three principal diagonals in Scenario 1, 2 and 3, respectively. In all scenario considered, the sparse diagonal inverse error covariance is positive definite with 19 (10%), 37 (19%) and 54 (28%) non-zeros entries in Scenario 1, 2 and 3, respectively, while the corresponding covariance matrices are dense. Finally, the responses are generated from a Normal matrix variate distribution using the simulated matrix of regression coefficients, the predictors matrix and the dense covariance matrices, that is, Y ∼ MN(XB, n , T), where B is the p × q matrix of the simulated regression coefficients and T is the inverse of the q × q Toeplitz matrix. Figure 2 shows the ROC curves obtained from the simulation study distinguishing between the estimation of the non-zeros regression coefficients (top panels) and the estimation of non-zero entries of the inverse error covariance (bottom panels). From the plots, it is apparent that BayesSUR (with or without covariance selection) has better or similar performance to SSUR. It is more efficient than SSUR direct in all scenarios considered due to the expensive computation of the approximate marginal likelihood that prevents running the algorithm for many iterations. BayesSUR performs better than SSUR indirect whose performance deteriorates as the estimation of the sparse inverse error covariance becomes less sparse (Scenarios 2 and 3). A closer inspection of the MCMC output shows that in Scenarios 2 and 3 both versions of the SSUR algorithm incorrectly estimate that the responses are almost independent conditionally on the estimated regression coefficients (results not shown).
We also compare the performance of BayesSUR with MRCE, the penalized likelihood method with simultaneous estimation of the regression coefficients and the covariance structure proposed by Rothman et al. (2010). While the power to detect non-zero regression coefficients and non-zero elements of the precision matrix is similar to BayesSUR, in all simulated scenarios, MRCE seems to include a larger number of false positives, in particular in the covariance selection.
In addition, we examine the computational time of the algorithms presented in this section. For the Bayesian algorithms, we match the values of the sparse priors hyperparameters and, as far as possible, the total number of iterations. For MRCE, we select the option cv, i.e., the penalty parameters for the regression coefficients and for the covariance structure are chosen by using a 5-fold cross-validation procedure. We also specify different dimensions of the candidate penalty vectors to check the impact of this choice on the computation time. All algorithms are run on an Intel(R) CPU 2.60 GHz with 64 Gb memory. Table 2 shows that BayesSUR is 20 times faster than SSUR direct in all scenarios considered. Interestingly, it is also almost 10 times faster than SSUR indirect with the SSVS Gibbs sampler. This is due to the effect of the direct manipulation of the junction tree representation of a decomposable graph (Green & Thomas, 2013) used in this work, in contrast to the computational expensive evaluation of the decomposability after edge perturbation employed in Wang (2010) and originally proposed by Giudici and Green (1999). The different computational efficiency depends also on the programming F I G U R E 2 Averaged (over 20 simulated replicates) ROC curves to compare the selection performance of the non-zero regression coefficients (top panels) and non-zero elements of the precision matrix (bottom panels) for the methods considered: BayesSUR with covariance selection (solid red line), BayesSUR with dense covariance estimation (dashed red line), SSUR Indirect (black line), SSUR Direct (blue line) and MRCE (green dot). For MRCE, each dot represents the averaged specificity and sensitivity of the corresponding penalised likelihood solution. BayesSUR with dense covariance estimation appears only in the top panels since it does not perform covariance selection. Its performance with respect to selecting regression coefficients is equal to that of BayesSUR with sparse covariance selection, hence the red dashed lines are indistinuishable from the red solid lines. Different scenarios are obtained by specifying distinct Toeplitz matrices for the inverse error covariance [Colour figure can be viewed at wileyonlinelibrary.com] | 901 BOTTOLO eT aL.
language used by the two algorithms, C++ and Matlab, respectively. Note that we employ a single core to run BayesSUR in order to make the comparison with other methods fair. However, large computational gains can be achieved by using a multi-core parallel computing architecture such as Message Passing Interface to exploit the parallelization of step 9 in Algorithm 1. The computational time of MRCE greatly depends on the number of candidate values where the fivefold cross-validation procedure is performed. At the default value, 4 equally spaced grid points, the algorithm is very fast, but it becomes slower than BayesSUR when 200 candidate values are specified. Moreover, in contrast to BayesSUR, the sparser the graph, the slower MRCE becomes. Similarly to the number of iterations in MCMC algorithms, for penalised likelihood methods the choice of the number of candidate penalty values depends on the trade-off between accuracy and computational time.
Finally, we repeat the same analysis presented in this section with s = 150 responses to mimic the number of responses in the motivating application presented in the next section. Results are similar to those presented here, although the analysis becomes computationally prohibitive for both SSUR direct and SUUR indirect. Details of the selection performance of the different methods as well as their computational time are shown in Figure S6 and Table S2, respectively.

ANALYSIS IN THE NORTHERN FINNISH BIRTH COHORT
In this section, we present our results of the mQTL analysis of the NFBC66 data. The serum metabolic data are from the 31-year follow-up study of the NFBC66 and based on a widespread metabolomics platform in epidemiology and genetics (Würtz et al., 2017). After quality control and data cleaning, the data consist of p = 9310 directly genotyped SNPs on chromosome 16 and s = 158 metabolite concentrations, measured on n = 5154 individuals. The metabolites are normalised and standardised via the inverse rank-Normal transformation, following Kettunen et al. (2016).
Thanks to growing evidence in favour of pleiotropy (the association of multiple phenotypes with the same locus) in mQTL analysis (Sabatti et al., 2009), we expect these associations to be driven by a handful of SNPs that associate with numerous metabolites. To drive the variable selection procedure we will therefore use the hotspot prior introduced by Bottolo et al. (2011) which expresses the prior probabilities for variable inclusion into overall sparsity level o k for outcome k and a propensity parameter j for each predictor j with kj ∼ Ber that the average model size and its variance for each outcome is small, as we want to enforce a strong sparsity in the model, that is, E(o k ) = 2 and Var(o k ) = 2. These values imply a priori a range of associations for each response between 0 and 6. We use independent N(0, w) priors on non-zero regression coefficients which correspond to W = w − 1 and let the prior matrix in the Inverse-Wishart for the covariance be diagonal, that is, M = s . Since we standardise and centre all responses and predictors, the hyperparameters for τ and w are set such that these variances are centred on small values but, at the same time, allowing the respective prior to be diffuse (a w = b w = 0.1 and a = b = 0.1). Finally, we set a = 1, b = 1 (E(η) = 1/2, Var(η) ≃ 1/12) which a priori does not push for a sparse graph G.
Our model provides us with a rich output that can be summarised in many ways. Regarding the regression structure, one example is that we can use the posterior of the covariate propensity parameter j , j = 1, …, p, to search for hotspots (i.e. genetic variants that are associated with multiple metabolites). Figure 3 shows the posterior expectations of j for each SNP on chromosome 16. In particular, we report rs4985124, rs931406 and most importantly rs12102766 and rs3764261. Analysing the whole binary indicators vector γ gives us a lot more information though, as shown in Figure  all metabolites superimposed. From the plot, we can see how some SNPs are associated with only one or a few outcomes and would thus be missed by only looking at hotspots detection. By thresholding mPIPs at 0.5, we discover a total of 38 associations and the average Bayes FDR (bFDR, see, e.g. Lewin et al., 2016) is ≈0.058. The associations found using the mPIP are presented in Supplementary  Tables S.3 and S.4. By increasing the mPIP threshold to 0.9, we would keep 32 SNP-metabolite associations with bFDR < 0.01. The results obtained by BayesSUR confirm the known association between SNP rs3764261 in the CEPT gene and HDLs Sabatti et al., 2009), but additionally highlights the relevance of the CEPT locus on different lipoproteins. rs4985124, that we report associated with fatty acids, is situated in the PDXDC1 locus, roughly 23 Kb from SNPs rs11075253 and rs11644601 which were previously linked with fatty acids metabolism (Kettunen et al., , 2016. Finally, rs1210276, F I G U R E 5 Network representation of the associations between SNPs (right) and metabolite (left) after thresholding the marginal posterior inclusion probabilities at 0.5 and dependence structure between metabolites estimate from the graph G after thresholding the marginal posterior edge inclusion probabilities at 0.5 [Colour figure can be viewed at wileyonlinelibrary.com] that we report associated with multiple metabolites connected to VLDL, is situated in the proximity of rs74249229, previously reported by Kettunen et al. (2016).
A comparison with MatrixeQTL (Shabalin, 2012), a widely used 'one-SNP-one-trait-at-a-time' method in GWAS analysis is presented in the Supplementary Material. Not accounting for correlations between metabolites, highly reduces the number of findings in the GWAS analysis. While mostly consistent with BayesSUR in terms of detected loci, multiple close-by SNPs are selected by MatrixeQTL as significantly predictive, whereas BayesSUR method picks only one (see Figures S7  and S8). Our method, that accounts for residual correlation, was also able to uncover other potential important associations that warrant further investigations, in particular rs7191766 associated with multiple cholesterol-related phenotypes.
Finally, Figure 5 presents a summary of the estimated graph G. By thresholding the marginal posterior edge inclusion probabilities (mEPIP) at 0.5, we obtain an adjacency matrix that we represent as a network plot, using the R package igraph (Csardi & Nepusz, 2006). In the same plot, we also represent the selected SNPs and their associations with the metabolites.
An interesting feature of the estimate of G is that we recover the three macrogroups mentioned in the Introduction, that is, lipoprotein concentrations (represented by circles), serum lipids (squares) and low molecular weight metabolites (triangles), with the lipoproteins being further separated into two components. HDLs and LDLs in particular seem to be highly associated with serum lipids, while the VLDL and IDL concentrations form a group almost by themselves. There are associations between the serum lipids and low molecular weight metabolites, driven mostly by a couple of low molecular weight hubs. It is important to note that edges here represent non-zero conditional correlations and we thus expect a much sparser graph than would be seen using marginal correlations. The highly sparse estimate of G also implies that considerable computational gains were achieved using the sparse model.

| DISCUSSION
In this work, we present a novel computational method to perform Bayesian variable selection in a multivariate regression setting for QTLoci analysis that takes into account residual correlations between phenotypes while maintaining a flexible association pattern between phenotypes and genotypes. Although conjugacy is lost for this model, by virtue of a crucial reparametrisation of the covariance matrix, our novel results show that: (i) it is possible to obtain simple expressions for the priors distribution of the reparametrised parameters 2 k and k , k = 1, …, s, in both dense and sparse cases, (ii) posterior full conditionals are available in closed-form expression, including for the regression coefficients k , and (iii) since the likelihood is now computed as a product of independent factors, the posterior updates of 2 k and k can be trivially parallelised which greatly increases the computational efficiency of our model. Thus, our method is able to analyse a large number of outcomes and their associations with a large set of predictors, thanks as well to the efficient C++ implementation, as illustrate in the simulation study and in the motivating application. It is moreover possible to introduce further computational gains by assuming that the conditional independence structure between the residuals is sparse, and inference on the resulting graph is straightforward to obtain. We demonstrate this feature in a simulated example with 150 responses, where BayesSUR with covariance selection is 30% faster than the version of the algorithm with dense covariance estimation.
We have also shown in the simulation study that, when there is non-negligible residual correlations between the responses, our method exhibits better performance in selecting relevant predictors than existing methods (Bottolo et al., 2011;Lewin et al., 2016) and is able at the same time to effectively perform covariance selection. Computationally, BayeSUR is faster than existing Bayesian sparse SUR methods with covariance selection, although in the simulated examples we have not shown the reduction in computational time when multiple cores are used in order to exploit the factorisation of the proposed model. When a large number of responses are considered and the graph is very sparse, BayeSUR computational time is almost comparable to penalized likelihood methods, although the output of the former is much richer (full posterior distributions versus point estimates).
Our method is able to scale well in the regime of hundreds of outcomes and thousands of predictors, as demonstrated in the analysis of the NFBC66 mQTL data set; we are able to recover already published and known associations, as well as uncovering some previously unknown associations that might offer new insights into the relationships between chromosome 16 and lipid metabolism.
One might expect the restriction to decomposable graphs to be too stringent for real applications and various attempt have been made to relax such an assumption, using the G-Wishart distribution first introduced by Roverato (2002) (see also Mitsakakis et al., 2011;Mohammadi & Wit, 2015;Wang et al., 2012 and references therein for some recent examples). However, the computational disadvantages connected with a general graph are in general exceedingly high (Jones et al., 2005).
The work of Fitch et al. (2014) concludes that, under model assumptions similar to ours, inference on G will asymptotically converge towards minimal triangulations of the true graph, that is, the decomposable graph with the smallest number of extra edges, and that inference on the covariance matrix is competitive in terms of prediction errors against penalised likelihood methods that estimate unrestricted graphs like the graphical lasso. In practice, assuming decomposability seems to be sensible and inference on the covariance matrix under such an assumption sound. Additionally, Bayesian modelling averaging enables the estimation via marginal edge inclusion probabilities of a general non-decomposable graph.
Thanks to the very general formulation of the SUR models we expect the present work to find applications beyond the mQTL application presented here, for example in finance, econometrics and other biological settings where linked regression models are already widespread.