A simple new approach to variable selection in regression, with application to genetic fine-mapping

We introduce a simple new approach to variable selection in linear regression, and to quantifying uncertainty in selected variables. The approach is based on a new model – the “Sum of Single Effects” (SuSiE) model – which comes from writing the sparse vector of regression coefficients as a sum of “single-effect” vectors, each with one non-zero element. We also introduce a corresponding new fitting procedure – Iterative Bayesian Step-wise Selection (IBSS) – which is a Bayesian analogue of traditional stepwise selection methods. IBSS shares the computational simplicity and speed of traditional stepwise methods, but instead of selecting a single variable at each step, IBSS computes a distribution on variables that captures uncertainty in which variable to select. We show that the IBSS algorithm computes a variational approximation to the posterior distribution under the SuSiE model. Further, this approximate posterior distribution naturally leads to a convenient, novel, way to summarize uncertainty in variable selection, and provides a Credible Set for each selected variable. Our methods are particularly well suited to settings where variables are highly correlated and true effects are very sparse, both of which are characteristics of genetic fine-mapping applications. We demonstrate through numerical experiments that our methods outperform existing methods for this task, and illustrate the methods by fine-mapping genetic variants that influence alternative splicing in human cell-lines. We also discuss both the potential and the challenges for applying these methods to generic variable selection problems.


Introduction
The need to identify, or "select", relevant variables in regression models arises in a diverse range of applications, and has spurred the development of a correspondingly diverse range of methods (e.g., see O'Hara and Sillanpää, 2009;Fan and Lv, 2010;Desboulets, 2018, for reviews). However, variable selection is a complex problem, and so despite considerable work there remain important issues that existing methods do not fully address. Here we develop a new approach to this problem that has several attractive features: it is simple, computationally scaleable, and it provides new, more effective, ways to capture uncertainty in which variables should be selected. Our new approach is particularly helpful in situations involving highly correlated variables, where it may be impossible to confidently select any individual variable, but it may nonetheless be possible to confidently draw useful conclusions such as "either variable A or B is relevant". More generally it may be possible to confidently identify "Credible Sets" of (correlated) variables, that each, with high probability, contain a relevant variable. Our new approach can quickly, simply and reliably identify such sets, as well as prioritize the variables within each set.
Our work, although potentially of broad interest, is particularly motivated by genetic fine-mapping studies (e.g Veyrieras et al., 2008;Maller et al., 2012;Spain and Barrett, 2015;Huang et al., 2017;Schaid et al., 2018), which aim to identify which genetic variants influence some trait of interest (e.g. LDL cholesterol in blood, gene expression in cells). Genetic fine-mapping can be helpfully framed as a variable selection problem, by building a regression model with the trait as the outcome and genetic variants as predictor variables (Sillanpää and Bhattacharjee, 2005). Performing variable selection in this model identifies variants that may causally affect the trait, and this -rather than prediction accuracy -is the main goal here. Fine-mapping is an example of a variable selection problem that often involves highly correlated variables: neighboring genetic variants are often highly correlated, a phenomenon called linkage disequilibrium (Ott, 1999).
Our approach builds on previous work on Bayesian variable selection regression (BVSR) (Mitchell and Beauchamp, 1988;George and McCulloch, 1997), which is already commonly used for genetic fine-mapping and related applications (e.g. Meuwissen et al., 2001;Sillanpää and Bhattacharjee, 2005;Servin and Stephens, 2007;Guan and Stephens, 2011;Bottolo et al., 2011;Maller et al., 2012;Carbonetto and Stephens, 2012;Hormozdiari et al., 2014;Chen et al., 2015;Wallace et al., 2015;Wen et al., 2016;Lee et al., 2018). BVSR has several appealing features compared with other approaches to variable selection. In particular, in principle, BVSR can assess uncertainty about which variables to select, even in the presence of strong correlations among variables. However, applying BVSR in practice remains difficult for two reasons. First, BVSR is computationally challenging, often requiring implementation of sophisticated Markov chain Monte Carlo or stochastic search algorithms (Bottolo and Richardson, 2010;Bottolo et al., 2011;Guan and Stephens, 2011;Wallace et al., 2015;Benner et al., 2016;Wen et al., 2016;Lee et al., 2018). Second, and perhaps more important, the output of existing methods for fitting BVSR is typically a complex posterior distribution, or a sample from it, which can be difficult to distill into easily-interpretable conclusions.
Our new approach addresses these limitations of BVSR through several innovations. First we introduce a new formulation of BVSR, which we call the Sum of Single Effects (SuSiE) model. This model, while similar to existing BVSR models, has a different structure that naturally leads to a very simple, intuitive and fast fitting procedure -Iterative Bayesian Stepwise Selection (IBSS) -which is a Bayesian analogue of traditional stepwise seletion methods. We show that IBSS can be viewed as computing a variational approximation to the posterior distribution under the SuSiE model. Unlike previous variational approaches to sparse regression (Logsdon et al., 2010;Carbonetto and Stephens, 2012), this new approach deals well with correlated variables. Furthermore, the approximate posterior leads immediately to Credible Sets of variables that are designed to be as small as possible while still each capturing a relevant variable. Arguably this is exactly the kind of posterior summary that one would like to obtain from MCMC-based or stochastic search BVSR methods, but doing so would require non-trivial post-processing of their output. In contrast, our method provides this posterior summary directly, and at a fraction of the computational effort.
The structure of the paper is as follows. In Section 2 we provide further motivation for our work, and brief background on BVSR. Section 3 describes the new SuSiE model and fitting procedure. In Section 4 we use simulations, designed to mimic realistic genetic fine-mapping studies, to demonstrate the effectiveness of our approach compared with existing BVSR methods. Section 5 illustrates our methods on fine-mapping of genetic variants affecting splicing, and Section 6 briefly highlights both the promise and the limitations of our methods for other applications using change-point problems. We end with Discussion highlighting avenues for further work.

A motivating toy example
Suppose the relationship between y (an n-vector) and variables X " px 1 , . . . , x p q, an nˆp matrix, is modeled as a multiple regression: y " Xb`e (2.1) e " N p0, σ 2 I n q, in which b is a p-vector of regression coefficients, e is an n-vector of error terms, σ 2 ą 0 is the residual variance, and I n is the nˆn identity matrix. For brevity, we will refer to variables j with non-zero effects (b j ‰ 0) as "effect variables". We assume that exactly two variables are effect variables -variables 1 and 4, say -and that each of these two effect variables are each completely correlated with another non-effect variable, say x 1 " x 2 and x 3 " x 4 . We further suppose that no other pairs of variables are correlated. In this situation, given sufficient data, it should be possible to conclude that there are (at least) two effect variables. However, because the effect variables are completely correlated with other variables, it will be impossible to confidently select the correct variables, even when n is very large. The best we can hope to infer is that pb 1 ‰ 0 or b 2 ‰ 0q and pb 3 ‰ 0 or b 4 ‰ 0q. (2.2) Our goal, in short, is to provide methods that directly produce this kind of inferential statement. Although this example is simplistic, it mimics the kind of structure that occurs in, for example, genetic fine-mapping applications, where it often happens that an association can be narrowed down to a small set of highly correlated genetic variants, and it is desired to provide a quantitative summary about which genetic variants are, based on the data, the plausible effect variables.
Most existing approaches to sparse regression do not provide statements like (2.2). For example, sparse regression methods based on penalized likelihood (e.g., Lasso; Tibshirani, 1996) would, in our example, select one of the four equivalent configurations (combinations of variables) -t1, 3u, t1, 4u, t2, 3u or t2, 4u -and give no indication that other configurations are equally plausible. Attempting to control error rates of false discoveries at the level of individual variables using methods such as stability selection (Meinshausen and Bühlmann, 2010) or the knockoff filter (Barber and Candès, 2015) will result in no discoveries, since no individual variable can be confidently declared an effect variable. One potential solution would be to first cluster the variables into groups of highly correlated variables and then perform some kind of "group selection" (Huang et al., 2012) or hierarchical testing (Meinshausen, 2008;Mandozzi and Bühlmann, 2016). However, although this might work in our stylized example, in general this approach involves ad hoc decisions about which variables to cluster together -an unattractive feature we seek to avoid.
In principle, Bayesian approaches (BVSR; see Introduction for references) can solve this problem. These approaches introduce a prior distribution on the coefficients b to capture the notion that b is sparse, then compute (approximately) a posterior distribution that assesses relative support for each configuration. In our example, the posterior distribution would typically have equal mass (« 0.25) on the four equivalent configurations. This posterior distribution contains exactly the information necessary to infer (2.2). However, even in this simple example, translating the posterior distribution to the simple statement (2.2) requires some effort, and in more complex settings such translations become highly non-trivial. Practical implementations of BVSR typically summarize the posterior distribution by the marginal posterior inclusion probability (PIP) of each variable, PIP j -Prpb j ‰ 0 | X, yq. (2.3) In our example, they would report PIP 1 " PIP 2 " PIP 3 " PIP 4 « 0.5. While not inaccurate, these marginal PIPs nonetheless fail to convey the information necessary to infer (2.2).

Credible Sets
To define our main goal more formally, we introduce the concept of a Credible Set (CS) of variables: Definition 1. In the context of a multiple regression model, we define a level-ρ Credible Set to be a subset of variables that has probability ě ρ of containing at least one effect variable ( i.e., a variable with non-zero regression coefficient). Equivalently, the probability that all variables in the Credible Set have zero regression coefficients is no more than 1´ρ.
Our use of the term "Credible Set" here indicates that we have in mind a Bayesian inference approach, in which the probability statements in the definition are statements about uncertainty in the regression coefficients with respect to the available data and modelling assumptions. One could analogously define a "Confidence Set" by interpreting the probability statements as referring to the set, considered random.
Although the term "Credible Set" has been used in fine-mapping applications before, most previous uses either assumed there was a single effect variable (Maller et al., 2012), or defined a CS as a set that contains all effect variables (Hormozdiari et al., 2014), which is a very different definition (and, we argue, both less informative and less attainable). Our definition here is closer to the "signal clusters" from Lee et al. (2018). It is also related to the idea of "minimal true detection" in Mandozzi and Bühlmann (2016).
With Definition 1, our primary aim can be restated: we wish to report as many CSs as the data support, each as few variables as possible. For example, to infer (2.2) we would like to report two CSs, t1, 2u and t3, 4u. As a secondary goal, we would also like to prioritize the variables within each CS, assigning each a probability that reflects the strength of the evidence for that variable being an effect variable. Our methods achieve both these goals.

The single effect regression model
The building block for our approach is the "single effect regression" (SER) model, which we define as a multiple regression model in which exactly one of the p explanatory variables has a non-zero regression coefficient. This idea was introduced in Servin and Stephens (2007) to fine-map genetic associations, and consequently adopted and extended by others, such as Veyrieras et al. (2008) and Pickrell (2014). Although of very narrow applicability, the SER model is trivial to fit. Furthermore, when its assumptions hold the SER provides exactly the inferences we desire, including CSs. For example, if we simplify our motivating example (Section 2.1) to have a single effect variable -variable 1, for example -then the SER model would, given sufficient data, infer a 95% CS containing both of the correlated variables, 1 and 2, with PIPs of approximately 0.5 each. This CS tells us that we can be confident that one of the two variables has a non-zero coefficient, but we do not know which one.
Specifically, we consider the following SER model, with hyperparameters σ 2 (the residual variance), σ 2 0 (the prior variance of the non-zero effect) and π " pπ 1 , . . . , π p q (a p-vector of prior inclusion probabilities, with π j giving the prior probability that variable j is the effect variable): Here, y is the n-vector of response data; X is an nˆp matrix; b is a p-vector of regression coefficients; e is an n-vector of independent error terms; and Multpm, πq denotes the multinomial distribution on class counts obtained when m samples are drawn with class probabilities given by π. (We assume that y and the columns of X have been centered to have mean zero, which avoids the need for an intercept term; see Chipman et al. 2001.) Under the SER model (2.4-2.8), the effect vector b has exactly one nonzero element (equals to b), so we refer to b as a "single effect vector". The element of b that is non-zero is determined by the binary vector γ, which also has exactly one non-zero entry. The probability vector π determines the prior probability distribution of which of the p variables is the effect variable. In the simplest case, π " p1{p, . . . , 1{pq; we assume this uniform prior here but SER model requires only that π is fixed and known (so in fine-mapping one could incorporate different priors based on genetic annotations; e.g., Veyrieras et al., 2008). To lighten notation, we henceforth make conditioning on π implicit.

Empirical Bayes for SER model
Although most previous treatments of the SER model assume σ 2 0 and σ 2 to be fixed and known, we note here the possibility of estimating σ 2 0 and/or σ 2 by maximum-likelihood before computing the posterior distribution of b. This is effectively an "Empirical Bayes" approach. The likelihood for σ 2 0 and σ 2 under the SER, Lpσ 2 0 , σ 2 q :" ppy | X, σ 2 0 , σ 2 q, (2.12) is available in closed form (Appendix A), and can be maximized over one or both parameters numerically.

The Sum of Single Effects Regression model
We now introduce a new approach to variable selection in multiple regression. Our approach is motivated by the observation that the SER model provides simple inference if there is indeed exactly one effect variable; it is thus desirable to extend the SER to allow for multiple variables. The conventional approach to doing this in BVSR is to introduce a prior on b that allows for multiple non-zero entries (e.g., using the "spike and slab" prior of Mitchell and Beauchamp, 1988). However, this approach no longer enjoys the convenient analytical properties and intuitive interpretation of the posterior distribution under the SER model: posterior distributions become difficult to compute accurately, and CSs even harder.
Here we introduce a different approach, which better preserves the nice features of the SER model. The key idea is straightforward to describe: introduce multiple single-effect vectors, b 1 , . . . , b L , and construct the overall effect vector b as the sum of these single effects. We call this the "SUm of SIngle Effects" (SuSiE) regression model: For generality, we have allowed the variance of each effect, σ 2 0l , to vary among the components l " 1, . . . , L. The special case in which L " 1 recovers the SER model. For simplicity, we initially assume σ 2 and σ 2 0 " pσ 2 01 , . . . , σ 2 0L q are given, and defer estimation of these hyperparameters to Section 3.3.
Note that if L ! p then the SuSiE model is approximately equivalent to a standard BVSR model in which L randomly chosen variables have non-zero coefficients. The only difference is that with some (small) probability some of the b l in the SuSiE model may have the same non-zero co-ordinates, and so the number of non-zero elements in b has some (small) probability to be less than L. Thus at most L variables have non-zero coefficients in this model. We discuss choice of L in Section 3.5.
Although the SuSiE model is approximately equivalent to a standard BVSR model, its novel structure has two major advantages. First, it leads to a simple, iterative and deterministic algorithm for computing approximate posterior distributions. Second, it yields a simple way to calculate the CSs. In essense, because each b l captures only one effect, the posterior distribution on each γ l can be used to compute a CS that has a high probability of containing an effect variable. The remainder of this section details both these advantages, and other issues that may arise in fitting the model.

Fitting SuSiE: Iterative Bayesian stepwise selection
The key to SuSiE model fitting procedure is that, given b 1 , . . . , b L´1 , estimating b L involves fitting the simpler SER model, which is analytically tractable. This immediately suggests an iterative algorithm that uses the SER model to estimate each b l in turn, given estimates of the other b l 1 (l 1 ‰ l). Algorithm 1 details such an algorithm, which is both simple and computationally scalable (computational complexity OpnpLq per iteration).

Algorithm 1 Iterative Bayesian stepwise selection
Require: data y and variable matrix X.
We call Algorithm 1 "Iterative Bayesian Stepwise Selection" (IBSS) because it can be viewed as a Bayesian version of stepwise selection approaches. For example, we can compare it with an approach referred to as "Forward-Stagewise" (FS) selection in Hastie et al. (2009) Section 3.3.3 (although subsequent literature often uses this term slightly differently). In brief, FS first selects the single "best" variable among p candidates by comparing the results of the p univariate regressions. It then computes the residuals from the univariate regression on this selected variable, and selects the next "best" variable by comparing the results of univariate regression of the residuals on each variable. This process continues, selecting one variable each iteration, until some stopping criteria is reached.
IBSS is similar to FS, but instead of selecting a single "best" variable at each step, it computes a distribution on which variable to select, by fitting the Bayesian SER model (Step 5). Similar to FS, this distribution is based on the results of the p univariate regressions, and so IBSS has the same computational complexity as FS (Opnpq per selection). However, by computing a distribution on variables -rather than choosing a single best variable -IBSS captures uncertainty about which variable should be selected at each step. This uncertainty is taken into account when computing residuals (Step 4) by using a model-averaged (posterior mean) estimate for the regression coefficients. In IBSS we incorporate an iterative procedure, whereby early selections are re-evaluated in light of the later selections (as in "backfitting"; Friedman and Stuetzle, 1981). The final output of IBSS is L distributions on variables (parameterized by pα l , µ 1l , σ 1l q; l " 1, . . . , L), in place of the L selected variables output by FS. Each distribution is easily summarized by, for example, a 95% CS for each selection.
To illustrate, consider our motivating example (Section 2.1) with x 1 " x 2 , x 3 " x 4 , and with variables 1 and 4 having non-zero effects. Suppose for simplicity that the effect of variable 1 is substantially larger than the effect of variable 4. Then FS would first select either variable 1 or 2 (which one being arbitrary), and then select variable 3 or 4 (again, which one being arbitrary). In contrast, given enough data, the first step of IBSS would select both variables 1 and 2 (with equal weight, « 0.5 each, and small weights on other variables). The second step of IBSS would similarly select variables 3 and 4 (again with equal weight, « 0.5 each). Summarizing these results would yield two CSs, t1, 2u and t3, 4u, and the inference (2.2) falls into our lap. This simple example is intended only to sharpen intuition; later numerical experiments demonstrate that IBSS works well in realistic settings.

IBSS computes a variational approximation to the SuSiE posterior distribution
We now provide a more formal justification for the IBSS algorithm. Specifically, we show that it is a coordinate ascent algorithm for optimizing a variational approximation to the posterior distribution for b 1 , . . . , b L under the SuSiE model (3.1)-(3.6). This result also leads to a natural extension of the algorithm to estimate the hyperparameters σ 2 , σ 2 0 . The idea behind variational approximation methods for Bayesian models (e.g. Blei et al., 2017) is to find an approximation qpb 1 , . . . , b L q to the posterior distribution p post :" ppb 1 , . . . , b L |yq, by minimizing the Kullback-Leibler (KL) divergence from q to p post , D KL pq, p post q, subject to constraints on q that make the problem tractable. Although D KL pq, p post q is hard to compute, it can be written as: where F is an easier-to-compute function known as the "evidence lower bound" (ELBO). (Note: F depends on the data y, X, but we suppress this dependence to lighten notation.) Because log ppy|σ 2 , σ 2 0 q does not depend on q, minimizing D KL over q is equivalent to maximizing F ; and since F is easier to compute this is how the problem is usually framed. See Appendix B.1 for further details.
Here we seek an approximate posterior, q, that factorizes: Under this approximation b 1 , . . . , b L are independent a posteriori. We make no assumptions on the form of q l , and in particular we do not assume that q l factorizes over the p elements of b l . This is a crucial difference from previous variational approaches for standard multiple regression models (e.g. Logsdon et al., 2010;Carbonetto and Stephens, 2012), and it means that q l can capture the strong dependencies among the elements of b l that are induced by the assumption that exactly one element of b l is non-zero. Intuitively each q l captures one effect variable, and provides inferences of the form "we need one of variables tA, B, Cu but we cannot tell which". Similarly, the approximation (3.8) provides inferences of the form "we need one of variables tA, B, Cu and one of variables tD, E, F, Gu, and . . . ". Under (3.8), finding the optimal q can now be written as: Although jointly optimizing F over (q 1 , . . . , q L ) is not straightforward, it turns out to be very easy to optimize over a single q l (given q l 1 for l 1 ‰ l), by fitting an SER model, as formalized in the following proposition.
For intuition in this proposition, recall that computing the posterior distribution for b l under the SuSiE model if the other effects b l 1 , l 1 ‰ l were known reduces to fitting a SER to residuals y´X ř l 1 ‰l b l 1 . Now consider computing an (approximate) posterior distribution for b l when b l 1 are not known, but we have approximations q l 1 to their posterior distributions. This is, essentially, the problem of finding arg max q l F pq 1 , . . . , q L q. Proposition 1 says that we can solve this using a similar procedure as for known b l 1 , but replacing each b l 1 with its (approximate) posterior meanb l 1 . The proof is given in Appendix B (Proposition 2).
The following corollary is an immediate consequence of Proposition 1: Corollary 1. Algorithm 1 is a coordinate ascent algorithm for maximizing the ELBO F , and therefore for minimizing the KL divergence D KL pq, p post q. Proof.
Step 5 of Algorithm 1 is simply computing the right hand side of equation (3.10). Thus, by Proposition 1, it is a coordinate ascent step for optimizing the lth coordinate of F pq 1 , . . . , q L ; σ 2 , σ 2 0 q (the distribution q l being determined by the parameters α l , µ 1l , σ 1l ).
Optimizing F over σ 2 involves computing the expected residual sum of squares under the variational approximation, which is straightforward; see Appendix B for details.
Optimizing F over σ 2 0 " pσ 2 0l , . . . , σ 2 0L q can be achieved by modifying the step that computes the posterior distribution for b l under the SER model (Step 5) to first estimate the hyperparameter σ 2 0l in the SER model by maximum likelihood. That is, by optimizing (2.12) over σ 2 0 , keeping σ 2 fixed. This is a one-dimensional optimization which is easily performed numerically (we used the R function uniroot).
Algorithm 4 in Appendix B extends IBSS to include both these steps.

Posterior inference: posterior inclusion probabilities and Credible Sets
Algorithm 1 provides an approximation to the posterior distribution on b under the SuSiE model, parameterized by pα 1 , µ 11 , σ 11 q, . . . , pα L , µ 1L , σ 1L q. From these results it is straightforward to compute approximations to various posterior quantities of interest, including PIPs and CSs.

Posterior inclusion probabilities
Under the SuSiE model the effect b j is zero if and only if b lj " 0 for all l. Under the variational approximation the b lj are independent across l, and so: Here, L " tl : σ 2 0l ą 0u, to account for the edge case where some σ 2 0l " 0 (which can happen when σ 2 0 is estimated as in Section 3.3).

Credible Sets
Simply computing the sets CSpα l ; ρq (A.17) for l " 1, . . . , L immediately yields L CSs that satisfy Definition 1 under the variational approximation to the posterior.
If L exceeds the number of detectable effects in the data then in practice it turns out that many of the L CSs are large, often containing the majority of variables. The intuition is that once all the detectable signals have been accounted for, the IBSS algorithm becomes very uncertain about which variable to include at each step, and so the distributions α become very diffuse. CSs that contain very many uncorrelated variables are of essentially no inferential value -whether or not they actually contain a effect variable -and so in practice it makes sense to ignore them. To automate this process, in this paper we discard CSs with purity ă 0.5, where we define purity as the smallest absolute correlation among all pairs of variables within the CS. (To reduce computation for CSs containing ą100 variables we sampled 100 variables at random to compute purity.) The purity threshold 0.5 was chosen primarily for comparability with Lee et al. (2018) who use a similar threshold in a related context. Although any choice of threshold is somewhat arbitrary, in practice we observed that most CSs are either very pure (ą 0.95) or very impure (ă 0.05), with intermediate cases being rare ( Figure S2), and so results are robust to this choice of threshold.

Choice of L
It may seem that choice of suitable L would be crucial. However, in practice key inferences are robust to overstating L; for example in our simulations later the true L is 1-5, but we obtain good results with L " 10. This is because, when L is too large, the method becomes very uncertain about where to place the additional (non-existent) effects; consequently it distributes them broadly among many variables, and so they have little impact on key inferences. For example, setting L too large inflates the PIPs of many variables just very slightly, and leads to some low-purity CSs that are filtered out (see Section 3.4.2).
Although inferences are generally robust to overstating L, we also note that the Empirical Bayes version of our method, which estimates σ 2 0 , also effectively estimates the number of effects: when L is greater than the number of signals in the data, the maximum likelihood estimate of σ 2 0l is often 0 for many l, which forces b l " 0. This is connected to "Automatic relevance determination" (Neal, 1996).

Numerical Comparisons
We use simulations to assess our methods and compare with standard BVSR methods. Our simulations are designed to mimic genetic fine-mapping studies, in particular fine-mapping of expression quantitative trait loci (eQTLs) -eQTLs are genetic variants associated with gene expression.
In genetic fine-mapping, X is a matrix of genotype data, in which each row corresponds to an individual, and each column corresponds to a genetic variant, typically a single nucleotide polymorphism (SNP). In our simulations, we used the real human genotype data from the n " 574 genotype samples collected as part of the Genotype-Tissue Expression (GTEx) project (GTEx Consortium, 2017). To simulate fine-mapping of cis effects on gene expression, we randomly select 150 genes from the ą 20,000 genes on chromosomes 1-22, then take X to be the genotypes for genetic variants nearby the transcribed region of the selected gene. For a given gene, between p " 1,000 and p " 12,000 SNPs are included in the fine-mapping analysis; for more details on how SNPs are selected, see Appendix C.
To assess accuracy of SuSiE inference by comparing estimates against "groundtruth", we generate synthetic gene expression data y under the multiple regression model (2.1), with various assumptions on the effects b. We specify our assumptions about the simulated effects b using two parameters: S, the number of effect variables; and φ, the proportion of variance in y explained by X ("PVE" for short).
We consider two sets of simulations. In the first set, each data set has exactly p " 1, 000 SNPs. We simulate data sets under all combinations of S P t1, . . . , 5u and φ P t0.05, 0.1, 0.2, 0.4u. These settings were chosen to span typical values for eQTL studies. Given choices of S and φ, we take the following steps to simulate gene expression data: (a) Sample the indices S of the S effect variables uniformly at random from t1, . . . , pu. (b) For each j P S, draw b j " N p0, 0.6 2 q, and set b j " 0 for all j R S. (c) Set σ 2 to achieve the desired PVE, φ; specifically, we solve for σ 2 in φ " We simulate two replicates for each gene and each scenario, resulting in a total of 2ˆ150ˆ4ˆ5 " 6,000 simulations.
In the second set of simulations, we generate larger data sets with 3,000 to 12,000 SNPs. To simulate gene expression data, we set S " 10 and φ " 0.3. Again, we simulate two replicates for each gene, so in total, we generate an additional 2ˆ150 " 300 data sets in the second set of simulations.

Illustrative example
We begin with an example to illustrate that, despite its simplicity, the IBSS algorithm (Algorithm 1) can perform well in a challenging situation. This example is given in Figure 1.
We draw this example from one of our simulations in which the variable most strongly associated with y is not one of the actual effect variables (in this particular example, there are two effect variables). This situation occurs because at least one variable has moderate correlation with both effect variables, and these effects combine to make its marginal association stronger than the marginal associations of the individual effect variables. Standard forward selection in this case would select the wrong variable in the first step; by contrast, the iterative nature of IBSS allows it to escape this trap. Indeed, in this example IBSS yields two high-purity CSs, each containing one of the effect variables.
Interestingly, in this example the most strongly associated variable does not appear in either CS. This illustrates that multiple regression can sometimes result in very different conclusions compared to a marginal association analysis. An animation showing the iteration-by-iteration progress of the IBSS algorithm can be found at our manuscript resource repository . Marginal associations  Figure 1: Illustration that IBSS algorithm can deal with a challenging case. Results are from a simulated data with p " 1, 000 variables (the SNPs), of which two are effect variables (labeled as "SNP 1" and "SNP 2", in red). This example was chosen because the strongest marginal association is with a non-effect variable (at position 780 on the x-axis); see the p values in the left-hand panel. Despite its simplicity, the IBSS algorithm converges to a solution in which the two 95% CSs -represented by the light and dark blue open circles in the right-hand panel -each contain a true effect variable. Additionally, neither CS contains the variable that has the strongest marginal association. One CS contains only 3 SNPs, whereas the other CS (in dark blue) contains 37 very highly correlated variables (minimum pairwise absolute correlation of 0.972). In the latter CS, the individual PIPs are small, but the inclusion of the 37 variables in this CS indicates, correctly, high confidence in one effect variable among them.

Posterior inclusion probabilities
Next we seek to assess the effectiveness of our methods more quantitatively. We focus initially on one of the simpler tasks in BVSR: computing posterior inclusion probabilities (PIPs). Most implementations of BVSR compute PIPs, making it possible to compare results across several implementations. Here we compare our methods (henceforth SuSiE for short, implemented in an R package, susieR, version 0.4.29) with three other software implementations aimed at genetic fine-mapping applications: CAVIAR (Hormozdiari et al., 2014, version 2.2), FINEMAP (Benner et al., 2016, version 1.1) and DAP-G (Wen et al., 2016;Lee et al., 2018, GitHub commit ef11b26). These C++ software packages implement different algorithms to fit similar BVSR models, which differ in details such as priors on effect sizes. CAVIAR exhaustively evaluates all possible combinations of up to L non-zero effects among the p variables, whereas FINEMAP and DAP-G approximate this exhaustive approach by heuristics that target the best combinations. Another difference among methods is that FINEMAP and CAVIAR perform inference using summary statistics computed for each dataset -specifically, the marginal association Z scores and the pˆp correlation matrix for all variables -whereas, as we apply them here, DAP-G and SuSiE use the full data. The summary statistic approach can be viewed as approximating inferences from the full data; see Lee et al. (2018) for discussion.
For SuSiE, we set L " 10 for the first set of simulations, and L " 20 for the data sets with the larger numbers of SNPs. We assessed performance when estimating both hyperparameters σ 2 , σ 2 0 , and when fixing one or both of these. Overall, results from these different strategies were similar. In the main text, we show results obtained when estimating σ 2 and fixing σ 2 0l to 0.1Varpyq, to be consistent with data applications in Section 5; other results are found in Supplementary Data ( Figure S4, Figure S5). Parameter settings for other methods are given in Appendix C. Since CAVIAR and FINEMAP were much more computationally intensive than DAP-G and SuSiE, we ran all methods in simulations with S " 1, 2, 3, and only ran DAP-G and SuSiE in simulations with S ą 3.
Since these methods differ in their modelling assumptions, we do not expect their PIPs to agree exactly. Nonetheless, we found generally good agreement ( Figure 2A). For S " 1, the PIPs from all four methods closely agree. For S ą 1, the PIPs from different methods are also highly correlated; correlations between PIPs from SuSiE and other methods vary from 0.94 to 1, and the proportions of PIPs differing by more than 0.1 between methods vary from 0.013% to 0.2%. Visually, this agreement appears less strong because the eye is drawn to the small proportion of points that lie away from the diagonal, while the vast majority of points lie on or near the origin. In addition, all four methods produce reasonably well-calibrated PIPs ( Figure S1).
The general agreement of PIPs from four different methods suggests that: (i) all four methods are mostly accurate for computing PIPs for the size of the problems explored in our numerical comparisons; and (ii) the PIPs themselves are typically robust to details of the modelling assumptions. Nonetheless, non-trivial differences in PIPs are clearly visible from Figure 2A. Visual inspection of the results of these simulations suggests that, in many of these cases, SuSiE assigns higher PIPs to the true effect variables than other methods, particularly compared to FINEMAP and CAVIAR; for non-effect variables where other methods report high PIPs, SuSiE often correctly assigns PIPs close to zero. These observations suggest that the PIPs from SuSiE may better distinguish effect variables from non-effect variables. This is confirmed by our analysis of power vs. False Discovery Rate (FDR) for each method, which is obtained by varying the PIP threshold for each method ( Figure 2B): the SuSiE PIPs always yield comparable or higher power at a given FDR.
Notably, even implemented in R, SuSiE computations are much faster than others implemented in C++: in the S " 3 simulations, SuSiE is roughly 4 times faster than DAP-G, 30 times faster than FINEMAP, and 4,000 times faster than CAVIAR on average (Table 1).
In summary, the results suggest that SuSiE produces PIPs that are as or more reliable than existing methods, and does so at a fraction of the computational cost.

Credible Sets
A key feature of SuSiE is that it yields multiple Credible Sets (CSs), each aimed at capturing an effect variable (Definition 1). The only other BVSR method that attempts something similar -as far as we are aware -is DAP-G, which outputs "signal clusters" defined by heuristic rules (Lee et al., 2018). Although Lee et al. (2018) do not refer to their signal clusters as CSs, and do not give a formal definition of signal cluster, the signal clusters have a similar goal to our CSs, and so for brevity we henceforth refer to them as CSs. We compared the level 95% CSs produced by SuSiE and DAP-G in several ways. First we assessed their empirical (frequentist) coverage levels, i.e. , proportion of CSs that contain an effect variable. Since our CSs are Bayesian Credible Sets, they are not designed or guaranteed to have frequentist coverage 0.95 (Fraser, 2011). Indeed, coverage will inevitably depend on simulation scenario. For example, in completely null simulations (b " 0) every CS would necessarily contain no effect variable, and so coverage would be 0. Nonetheless, one might hope that under reasonable simulations that include effect variables the Bayesian CSs would have coverage near the nominal levels. And indeed, we confirmed this was the case: in these simulations CSs from both methods typically had coverage slightly below 0.95, but usually ą 0.9 (Figure 3; see Figure S3 for additional results).
Having established that the methods produce CSs with similar coverage, we compared them by three other criteria: (i) power (overall proportion of simulated effect variables included in a CS); (ii) average size (median number of variables included in CS) and (iii) purity (here measured as average squared correlation of variables in CS since this is output by DAP-G). By all three metrics the CSs from SuSiE are consistently better: higher power, smaller size and higher purity (Figure 3).
Although the way that we construct CSs in SuSiE does not require that they be disjoint, we note that in practice here CSs rarely overlapped (after filtering out low purity CSs; Section 3.4.2). Indeed, across all simulations there was only one instance of a pair of overlapping CSs.

Genome-wide identification of splice QTL in human cell lines
To illustrate SuSiE on a real fine-mapping problem we analyzed data from Li et al. (2016) aimed at detecting genetic variants (SNPs) that influence splicing (known as "splice QTLs", sQTLs). These authors quantified alternative splicing by estimating, at each intron in each sample, a ratio capturing how often the intron is used relative to other introns in the same "cluster" (roughly, gene). The data involve 77,345 intron ratios measured on lymphoblastoid cell lines from 87 Yoruban individuals, together with genotypes of these individuals. Following Li et al. (2016) we pre-process the intron ratios by regressing out the first 3 principle components of the matrix of intron ratios, which aims to control for unmeasured confounders (Leek and Storey, 2007). For each intron ratio we test for its association with SNPs within 100kb of the intron, which is on average " 600 SNPs. In other words, we run SuSiE on 77,345 data sets with n " 87 and p « 600.
To specify the prior variance σ 2 0l we first estimated typical effect sizes from the data on all introns. Specifically we performed simple (SNP-by-SNP) regression analysis at every intron, and estimated the PVE of the top (strongest associated) SNP. The mean PVE of the top SNP across all introns was 0.096, and so we  applied SuSiE with σ 2 0l " 0.096Varpyq (with the columns of X standardized to have unit variance).
We then applied SuSiE to fine-map sQTLs at all 77,345 introns. After filtering for purity, this yielded a total of 2,652 CSs (level 0.95) which were spread across 2,496 intron units. These numbers are broadly in line with the original study, which reported 2,893 significant introns at 10% FDR. Of the 2,652 CSs identified, 457 contain exactly one SNP, and these represent strong candidates for being actual causal variants that affect splicing. Another 239 CSs contain exactly two SNPs. The median size of CS was 7 and the median purity was 0.94.
The vast majority of intron units with any CS had only one CS (2,357 of 2,496). Thus, at most introns SuSiE could reliably identify (at most) one sQTL. Of the remainder, 129 introns yielded two CSs, 5 introns yielded three CSs, 3 introns yielded four CSs and 2 introns yielded five CSs. This represents a total of 156 (129+10+9+8) additional ("secondary") signals that would be missed in conventional analyses that report only one signal per intron. Although these data show relatively few secondary signals, this is a relatively small study (n " 87); it is likely that in larger studies the ability of SuSiE (and other fine-mapping methods) to detect secondary signals will be more important.

Functional enrichment of association signals
Although in these real data we do not know the true causal SNPs, we can provide indirect evidence that both the primary and secondary signals identified here are enriched for real signals using functional enrichment analysis. To perform this analysis we labelled one CS at each intron the "primary" CS, and we chose the CS with highest purity at each intron as the primary CS; remaining CSs at each intron (if any) were labelled "secondary" CSs. We then tested both primary and secondary CSs for enrichment of SNPs with various biological annotations, by comparing SNPs inside these CSs (with PIPą 0.2) against random control SNPs outside CSs.
We used the same annotations in our enrichment analysis as Li et al. (2016). These were: LCL-specific histone marks (H3K27ac, H3K27me3, H3K36me3, H3K4me1, H3K4me2, H3K4me3, H3K79me2, H3K9ac, H3K9me3, H4K20me1), DNase I hypersensitive sites, transcriptional repressor CTCF binding sites, RNA polymerase II (PolII) binding sites and extended splice sites (defined as 5bp up/down-stream of intron start site and 15bp up/down-stream of intron end site), and intron and coding annotations. Figure 4 shows the enrichments in both primary and secondary CSs, for annotations that were significant at p-value ă 10´4 in the primary signals (Fisher's exact test, two-sided). The strongest enrichment in both primary and secondary signals was for extended splice sites (odds ratio « 5 in primary signals), which is reassuring given that these results are for splice QTLs. Other significantly enriched annotations in primary signals include PolII binding, several histone marks, and coding regions. The only annotation showing a significant depletion was introns. Results for secondary signals were qualitatively similar to those for primary, though all enrichments are less significant due to the much smaller numbers of secondary CSs.
6. An example beyond fine-mapping: change point detection Although our methods were initially motivated by genetic fine-mapping applications, they are also applicable to other sparse regression applications. Here we briefly illustrate this by applying SuSiE to an example that is very different from fine-mapping: change point detection. This application also provides a simple example where the IBSS algorithm produces a poor fit -due to getting stuck in a local optimum -which is something we seldom observed in fine-mapping simulations. We believe that examples where algorithms fail are just as important as examples where they succeed -perhaps more so -and that this example could help motivate future methods development and improvements.
In brief, we consider a simple change point model: where t indexes location in one-dimensional space (or time), the errors e t are independent and identically distributed N p0, σ 2 q, and the mean vector µ :" pµ 1 , . . . , µ T q is assumed to be piecewise constant. Indices, t, at which µ changes (µ t ‰ µ t`1 ) are then called "change-points". The idea that change points are rare can be captured by formulating this model as a sparse multiple regression (2.1), where X has T´1 columns, the tth column being a step function with step at t (x st " 0 for s ď t and 1 for s ą t). The tth element of b then determines the change in the mean at position t, µ t`1´µt , and so the non-zero regression coefficients in this multiple regression model correspond to change points in µ.
Note that the design matrix X has a very different structure here from in fine-mapping applications. In particular, the correlation matrix of the columns of X has its largest elements near the diagonal, and gradually decays moving away from the diagonal -very different from the "blocky" correlation structure that typically occurs in genetic fine-mapping. (A side note on computation: due to the special structure of this X, SuSiE computations can be made OpT Lq rather than OpT 2 Lq which would be achieved by a naive implementation; for example the vector X T y is simply the cumulative sums of the elements of the reverse of y, which can be computed in linear complexity.) Change point detection has a wide range of potential applications, including, for example, segmentation of genomes into regions with different numbers of copies of the genome. Software packages in R that can be used for detecting change points include changepoint (Killick and Eckley, 2014), DNAcopy (Seshan and Olshen, 2018;Olshen et al., 2004), bcp (Erdman and Emerson, 2007) and genlasso (Tibshirani, 2014;Arnold and Tibshirani, 2016); see Killick and Eckley (2014) for a longer list. Of these, only bcp, which implements a Bayesian method, quantifies uncertainty in estimated change point locations, and bcp provides only (marginal) PIPs, not CSs for change point locations. Therefore the ability of SuSiE to provide such CSs is unusual, and perhaps even unique, among existing methods for this problem.
To illustrate the potential of SuSiE for change point estimation we applied it to a simple simulated example from the DNAcopy package. Figure 5 shows both the SuSiE and DNAcopy results. The two methods provide similar estimates for the change point locations, but SuSiE also provides a 95% CS for each change point. In this case every true change point is contained in a reported CS, and every CS contains a true change point. This is true even though our fit assumed L " 10 change points and the truth is only 7 change points: the additional CSs were filtered out here because they contain very uncorrelated variables. (Actually SuSiE reported 8 CSs after filtering, two of them overlapping and containing the same true change point. As reported in Section 4, in fine-mapping applications we found such overlapping CSs very rarely.) To highlight an example where IBSS can converge to a poor local optima, consider the simple simulated example in Figure 5, which consists of two change points in quick succession that approximately cancel each other out (so the mean before and after the change points are equal). We designed this example specifically to illustrate a limitation of IBSS: here introducing any single change point (to the null model of no change point) makes the fit worse, and one really needs to introduce both change points at the same time to improve the fit, which IBSS is not set up to do. Consequently, when run from a null initialization, IBSS finds no change points (and reports no CSs).
We emphasize that this result represents a limitation of the IBSS algorithm for optimizing the objective function, and not a limitation of either the SuSiE model or the variational approximation. To demonstrate this we re-ran the IBSS algorithm, initialized from a solution that contains the two true change points. This yields a fit with two CSs, containing the two correct change points, and a higher value of the objective function than the original fit (-148.2 vs -181.8). Improved fitting algorithms -or more careful initialization of IBSS -could therefore address this problem.

Discussion
We presented a new model (SuSiE) and algorithm (IBSS) which together provide a simple new approach to variable selection in regression. Compared with existing methods, the main benefits of our approach are its computational efficiency, and its ability to provide CSs summarizing uncertainty in which variables should be selected. Our numerical comparisons demonstrate that for genetic fine-mapping our methods outperform existing methods at a fraction of the computational cost.
Although our methods apply generally to variable selection in linear regression, further work may be required to improve performance in difficult settings. Each CS contains a true change point. Bottom panel shows a simple simulated example with two change points in quick succession, designed to show how the IBSS algorithm used to fit SuSiE can converge to a local optimum. The two lines shows the fit from initializing IBSS from the null model with no change points (black), and the true model with two change points (red). The red line is much closer to the truth and attains a higher value of the objective function (-148.2 vs -181.8) In particular, while the IBSS algorithm worked well in our fine-mapping experiments, for change-point problems we showed that IBSS may converge to poor local optima. We have also seen convergence problems in experiments with many effect variables (e.g. 200 effect variables out of 1,000). Such problems may be alleviated by better initialization, for example using fits from convex objective functions (e.g. Lasso) or from more sophisticated algorithms for nonconvex problems (Bertsimas et al., 2016;Hazimeh and Mazumder, 2018). More ambitiously, one could attempt to develop better algorithms to reliably optimize the SuSiE variational objective function in difficult cases. For example, taking smaller steps each iteration, rather than full coordinate ascent, may help.
At its core, the SuSiE model is based on adding up simple models (SERs) to create more flexible models (sparse multiple regression). This additive structure is the key to our variational approximations, and indeed our methods apply generally to adding up any simple models for which exact Bayesian calculations are tractable, not only SER models (Appendix B; Algorithm 3). These observations suggest connections with both additive models and boosting (e.g. Friedman et al., 2000;Freund et al., 2017). However, our methods differ from most work on boosting in that each "weak learner" (here, SER model) itself yields a model-averaged predictor. Other differences include our use of backfitting, the potential to estimate hyper-parameters by maximizing an objective function rather than cross-validation, and the interpretation of our algorithm as a variational approximation to a Bayesian posterior. Although we did not focus on prediction accuracy here, the generally good predictive performance of methods based on model averaging and boosting suggest that SuSiE should work well for prediction as well as variable selection.
It would be natural to extend our methods to generalized linear models (glms), particularly logistic regression. In genetic studies with small effects Gaussian models are often adequate to model binary outcomes (e.g. Zhou et al., 2013). However, in other settings this extension may be more important. One strategy would be to directly modify the IBSS algorithm, replacing the SER fitting procedure with a logistic or glm equivalent. This strategy is appealing in its simplicity, although it is not obvious what objective function the resulting algorithm is optimizing.
For genetic fine-mapping it would also be useful to modify our methods to deal with settings where only summary data are available (e.g. the p simple regression results). Many recent fine-mapping methods deal with this (e.g Chen et al., 2015;Benner et al., 2016;Newcombe et al., 2016) and ideas used by these methods can also be applied to SuSiE. Indeed our software already includes preliminary implementations for this problem.
Finally, we are particularly interested in extending these methods to select variables simultaneously for multiple outcomes (multivariate regression, and multi-task learning). In settings where multiple outcomes share the same relevant variables, multivariate analysis can greatly enhance power and precision to identify relevant variables. The computational simplicity of our approach makes it particularly appealing for this complex task, and we are currently pursuing this by combining our methods with those from Urbut et al. (2018).

Data and resources
SuSiE is implemented in the R package susieR available at https://github.com/stephenslab/susieR. Source code and a website documenting in detail the analysis steps for numerical comparisons and data applications are available at our manuscript resource repository Wang et al. (2018), also available at https://github.com/stephenslab/susiepaper.
Here y is an n-vector of response data (centered to have mean 0), x is an nvector containing values of a single explanatory variable (similarly centered), e is an n-vector of independent error terms with variance σ 2 , b is the scalar regression coefficient to be estimated, and σ 2 0 is the prior variance of b. Given σ 2 , σ 2 0 the Bayesian computations for this model are very simple. They can be conveniently written in terms of the usual least-squares estimate of b, b :" px T xq´1x T y, its variance, s 2 :" σ 2 {px T xq, and the corresponding z score z :"b{s. The posterior distribution for b is b|y, σ 2 , σ 2 0 " N pµ 1 , σ 2 1 q (A.4) where σ 2 1 px; σ 2 , σ 2 0 q :" ps´2`σ´2 0 q´1, (A.5) µ 1 py, x; σ 2 , σ 2 0 q :" pσ 2 1 {s 2 qb. (A.6) And the Bayes Factor (BF) for comparing this model with the null model (b " 0) is: BFpy, x; σ 2 , σ 2 0 q :" The form of BF matches the "asymptotic BF" of Wakefield (2009), but herebecause we consider linear regression and given σ 2 -it is an exact expression and not only asymptotic.)

A.3. Computing Credible Sets
As noted in the main text, under the SER model it is simple to compute a level ρ CS (Definition 1), CSpα; ρq, as described in Maller et al. (2012). For convenience we give the procedure here explicitly.
Given α, let r " pr 1 , . . . , r p q denote the indices of the variables ranked in order of decreasing α j , so α r1 ą α r2 ą¨¨¨ą α rp , and let S k denote the cumulative sum of the k largest PIPs: Now take CSpα; ρq :" tr 1 , . . . , r k0 u (A.17) where k 0 " mintk : S k ě ρu. This choice of k 0 ensures that the CS is as small as possible while still satisfying the requirement that it has at least level ρ.

A.4. Empirical Bayes approach
As noted in the main text, it is possible to take an Empirical Bayes approach to estimating the hyperparameters σ 2 , σ 2 0 . The likelihood is: where p 0 denotes the distribution of y under the "null" that b " 0 (i.e. N p0, σ 2 I n q).
The likelihood (A.18) can be maximized over one or both parameters numerically.

B.1. Background: Empirical Bayes and Variational Approximation
Here we introduce helpful background and notation before applying the ideas to our specific application.

B.1.1. Empirical Bayes as a single optimization problem
Consider problems of the form: where y represent observed data, b represent unobserved (latent) variables of interest, g P G represents a prior distribution for b (which in the Empirical Bayes paradigm is treated as an unknown to be estimated) and θ P Θ represents parameters to be estimated. This formulation also includes as a special case situations where g is pre-specified rather than estimated, simply by making G contain a single distribution. Fitting this model by Empirical Bayes involves the following steps: 1. Obtain estimatesĝ,θ for g, θ, by maximizing the log-likelihood: 2. Compute the posterior distribution for b given these estimates, z p post pbq :" p post pb; y,ĝ,θq where p post pb; y, g, θq :" ppb|y, g, θq. (B.5) This two step procedure can be conveniently written as a single optimization problem: pz p post ,ĝ,θq " arg max is the Kullback-Leibler divergence from q to p and the optimization over q in (B.6) is over all possible distributions. The function F (B.7) is often called the "evidence lower bound", or ELBO, because it is a lower bound for the evidence (log-likelihood). (This follows from the fact that KL divergence is non-negative.) That this single optimization problem (B.6) is equivalent to the usual twostep EB procedure follows from two simple observations: 1. Since the log-likelihood, l, does not depend on q, we have arg max q F pq, g, θ; yq " arg min q D KL pq||p post p¨; g, θ, yqq " p post p¨; g, θ, yq.

B.1.2. Variational approximation
The optimization problem (B.6) is often intractable. The idea of variational approximation is to adjust the problem to make it tractable, simply by restricting the optimization over q to q P Q where Q denotes a suitably chosen class of distributions: pz p post ,ĝ,θq " arg max qPQ,gPG,θPΘ F pq, g, θ; yq. (B.11) From the definition of F it follows that optimizing F over q P Q (for given g, θ) corresponds to minimizing the KL divergence from q to the posterior distribution, and so can be interpreted as finding the "best" approximation to the posterior distribution for b among distributions in the class Q. And the optimization of F over g, θ can be thought of as replacing the optimization of the log-likelihood with optimization of the ELBO, a lower bound to the loglikelihood.
We refer to solutions of the general problem (B.6), in which q is unrestricted, as "EB solutions". We refer to solutions of the restricted problem (B.11) as "Variational EB (VEB) solutions".

B.1.3. Algebraic form for F
It is helpful to note that, by simple algebraic manipulations, the ELBO F in (B.7) can be written as: log ppy, b|g, θq qpbq

B.2. The additive effects model
We now focus on fitting an additive model, M, that includes the SuSiE model as a special case: y " L ÿ l"1 µ l`e (B.14) µ l " g l independently for l " 1, . . . , L (B.15) e " N p0, σ 2 I n q, (B.16) where y P R n , µ l P R n , e P R n , and I n denotes the nˆn identity matrix. We let M l denote the simpler model that is derived from M by setting µ l 1 " 0 for l 1 ‰ l (i.e. M l is the model that contains only the lth additive term), and use L l to denote the marginal likelihood for this simpler model: L l py; g l , σ 2 q :" ppy|M l , g l , σ 2 q. (B.17) The SuSiE model corresponds to the special case of M where µ l :" Xb l and g l is the "single effect prior" in (2.6)-(2.8). Further, in this special case each M l is a "single effect regression" (SER) model.
The key idea is that we can fit M by VEB provided we can fit each simpler model M l by EB. To expand on this: consider fitting the model M by VEB, where the restricted family Q is the class of distributions on pµ l q L l"1 that factorize over l. That is, for any q P Q, qpµ 1 , . . . , µ L q " L ź l"1 q l pµ l q, (B.18) and we can write q " pq 1 , . . . , q L q. For q P Q, using expression (B.13), we obtain the following expression for the ELBO F : (B.19) where y 2 :" y T y and g denotes the priors pg 1 , . . . , g L q. Note that the second term here is the expected residual sum of squares (ERSS) under q, and depends on q only through its first and second moments. Indeed, if we define andμ :" pμ l q L l"1 , Ď µ 2 :" p Ď µ 2 l q L l"1 , then (This expression follows from the definition, and independence across l " 1, . . . , L, by simple algebraic manipulation; see Section B.6). Fitting M by VEB involves optimizing F in (B.19) over q, g, σ 2 . Our strategy is to use "coordinate ascent", using steps that optimize over pq l , g l q (l " 1, . . . , L) while keeping other elements of q, g fixed, and with a separate step to optimize over σ 2 given q, g. This strategy is summarized in Algorithm 2.
Algorithm 2 Coordinate Ascent for F (outline) 1: repeat 2: for l in 1, . . . , L do 3: pq l , g l q Ð arg maxq l ,g l F pq, g, σ 2 ; yq Ź update q l , g l 4: σ 2 Ð arg max σ 2 F pq, g, σ 2 ; yq Ź update σ 2 5: until converged The update for σ 2 in Algorithm 2 is easily obtained by taking partial derivative of (B.19), setting to zero, and solving for σ 2 , givinĝ σ 2 :" ERSSpy,μ, Ď µ 2 q n . (B.23) The update for pq l , g l q turns out to correspond to finding the EB solution to the simpler model M l , but with the data y replaced with the expected residuals, r l :" y´ř l 1 ‰lμ l 1 . The proof of this is given in the next section (Proposition 2).
Substituting these ideas into Algorithm 2 gives Algorithm 3, which is a generalization of the IBSS algorithm (Algorithm 1) in the main text.

B.3. Special case of SuSiE model
The SuSiE model corresponds to the special case µ l " Xb l , in which case M l is a single effects regression model. The first and second moments of µ l ,μ l and where the last line comes from the fact that only one element of b l is nonzero, so the cross terms b lj b lj 1 " 0 for j ‰ j 1 . Because of this we can write ERSSpy,μ, Ď µ 2 q as a function of the first and second moments of the b l -say ERSSpy,b, s b 2 q -and Algorithm 3 can be implemented by working with the posterior distributions of b instead of µ.
For completeness we give this algorithm, which is the one we implemented in the susieR software, explicitly as Algorithm 4. This algorithm is the same as the IBSS algorithm in the main text, but extended to estimate the hyperparameters σ 2 , σ 2 0 .

Algorithm 4 Iterative Bayesian stepwise selection (Extended)
Require: data y and variable matrix X.
We implemented the option Step 5, which is a one-dimensional optimization, using uniroot in R to find the point where the derivative of L SER is 0.
B.4. Update for q l , g l is EB solution of M l In this subsection we establish that Step 3 in Algorithm 2 is accomplished by EB solution of M l (Steps 4-5 in Algorithm 3). This result is formalized in the following Proposition, which generalizes Proposition 1 in the main text: Proposition 2. Optimizing F in (B.19) over q l , g l is achieved by arg max q l ,g l F pq, g, σ 2 ; yq " arg max q l ,g l F l pq l , g l , σ 2 ; y´ÿ l 1 ‰lμ l 1 q. (B.27) where F l denotes the ELBO corresponding to model M l andμ l is as in (B.20).
Note that the optimization of F l over q l , g l on the right hand side of (B.27) does not involve restrictions on q l , and so corresponds precisely to finding the EB solution to M l (see Section B.1.1).
Proof. Let F l denote the ELBO for model M l . Then, from (B.13) we have: F l pq l , g l , σ 2 ; yq "´n 2 logp2πσ 2 q´1 2σ 2 E q l y´µ l 2`E q l " log g l pµ l q q l pµ l q  . (B.28) Further, let µ´l denote the components of pµ 1 , . . . , µ L q omitting µ l , and q´l denote the distribution on µ´l induced by marginalizing q over b l . Finally, let r l denote the residuals obtained by removing all the effects other than l from y, andr l denote its expectation under q´l: r l :" y´ÿ Now, separating F in (B.19) into the parts that depend on q l , g l , and those that do not (here denoted "const"), we have: F pq, g, σ 2 ; yq "´1 2σ 2 E q «ˆr l´µl˙Tˆrl´µl˙ff`Eq l " log g l pµ l q q l pµ l q `c onst (B.31) "´1 2σ 2 E q l "ˆ´2r T l µ l`µ T l µ l˙`Eq l " log g l pµ l q q l pµ l q `c onst (B.32) " F l pq l , g l , σ 2 ;r l q`const. (B.33)

B.5. Computing the evidence lower bound
Although not strictly required to implement Algorithm 3, it can also be helpful to compute the objective function F (e.g. , for monitoring convergence and for comparing solutions obtained from different initial points). Here we outline a convenient approach to computing F in practice. The ELBO F is given by (B.19). Computing the first term is easy, and the second term is the ERSS (B.22). The third term can be computed from the marginal likelihoods L l in (B.17), computation of which is straightforward for M l the SER model, involving a simple sum over the p possible single effects.
Specifically we have the following lemma: Lemma 1. Letq l " arg max q F l pq l , g l , σ 2 ;r l q. Then Eq l " log g l pµ l q q l pµ l q  " log L l pr l ; g l , σ 2 q`n 2 logp2πσ 2 q`1 2σ 2 Eq l r l´µl 2 . (B.34) Proof. Rearranging (B.28) with y replaced byr l , we have E q l " log g l pµ l q q l pµ l q  " F l pq l , g l , σ 2 ;r l q`n 2 logp2πσ 2 q`1 2σ 2 E q l r l´µl 2 . (B.35) The result then follows from noting that F l is equal to log-L l at the optimum q l "q l . That is, F l pq l , g l , σ 2 ;r l q " L l pr l ; g l , σ 2 q.

B.6. Expression for ERSS
The expression (B.22) is derived as follows: ERSS " E q  Coverage were set to nominal levels 75%-99% (X-axis), and the corresponding empirical coverage were computed (Y-axis). Consistent with observation in Figure 3, coverage became lower as more weaker signals were analyzed.  Figure S5: Comparison of 95% credible sets (CS) Same plot as Figure 3, but prior variance σ 2 0 were estimated for SuSiE rather than fixing to σ 2 0l " 0.1.