Assessing the robustness of sisVIVE in a Mendelian randomization study to estimate the causal effect of body mass index on income using multiple SNPs from understanding society

The “some invalid, some valid instrumental variable estimator” (sisVIVE) is a lasso‐based method for instrumental variables (IVs) regression of outcome on an exposure. In principle, sisVIVE is robust to some of the IVs in the analysis being invalid, in the sense of being related to the outcome variable through pathways not mediated by the exposure. In this paper, we consider the application of sisVIVE to a Mendelian randomization study in which multiple genetic variants are used as IVs to estimate the causal effect of body mass index on personal income in the presence of unobserved confounding. In addition to analyzing data from the large‐scale longitudinal household survey Understanding Society, we conduct a simulation study to (a) assess the performance of sisVIVE in relation to that of competing robust methods like “MR‐Egger” and “MR‐Median” and (b) identify scenarios under which its absolute performance is poor. We find that sisVIVE outperforms alternative robust methods, in terms of mean‐square error, across a wide range of scenarios, but that its performance is poor in absolute terms when the presence of indirect pleiotropy leads to failure of the “InSIDE” condition, which is not explicitly required for identification. We argue that this is because the consistency criterion for sisVIVE does not identify the true causal effect when InSIDE fails.

effects using observational data. A genetic variant is a valid IV if it is (a) associated with the exposure, (b) has no direct effect on the outcome, and (c) has no indirect effect through the unobserved confounding variables on the outcome.
The genetic variants used in MR studies are called single-nucleotide polymorphisms (SNPs). SNPs are locations in the DNA sequence that typically comprise a base pair of variant forms called alleles. The alleles at these locations vary between individuals in that individuals can have zero, one or two copies of a specific allele. If the exposure being studied is the expression (or phenotype) of a SNP, then this SNP is, potentially, a valid IV. If only one SNP is used as an IV, the causal effect of the exposure on the outcome is estimated using the ratio of the estimated coefficient of the SNP-obtained from the regression of the outcome on the SNP-to the corresponding coefficient from the regression of the exposure on the SNP. 6 If the SNP is a valid IV, and the causal relationship between exposure and outcome is linear, the ratio estimator will be consistent, but not unbiased, for the true causal effect.
A well-known problem with ratio estimators is "weak instrument bias," that is, the bias that arises when an IV is insufficiently predictive of the exposure. 7,8 This is often true for MR studies where the correlations between SNPs and exposures tend to be small. One strategy for avoiding weak instrument bias is, if available, to use more than one SNP. Multiple SNPs can be combined into what are called allele, genetic or polygenic risk scores 9 to be used as IVs for ratio estimation. Alternatively, one can use two-stage least squares (2SLS), rather than ratio estimation, with the SNPs as multiple IVs. 10 In both cases, the rationale for using multiple SNPs is that the additional information contained by these SNPs will predict the exposure more accurately and thus alleviate weak instrument bias.
In this paper, we use MR to estimate the causal effect of body mass index (BMI) on personal income using data from Understanding Society: the UK Longitudinal Household Study (UKHLS). UKHLS contains rich genetic data from which we obtain 71 out of the 97 common genetic variants found to be associated with BMI in genome-wide association studies (GWASs) at the genome-wide level of significance. 11 We adopt the multiple SNPs strategy here because the single genetic marker most strongly associated with BMI-the FTO gene (rs1558902)-explains only 0.27% of the variation.
The use of multiple SNPs presents two further challenges. The first is that, even if all the SNPs were valid IVs, the combined explanatory power of the additional SNPs could still be small enough to lead to many weak instruments bias. 12 This is certainly possible here because combining all 97 BMI-associated SNPs from the GIANT study explains only 2.7% of the total variation in BMI. 11 A "two-sample" strategy has been proposed to alleviate this problem, that is, rather than estimating the SNP-exposure associations using the analysis sample, estimates of these associations are taken from another, ideally much larger, data set. 13 Provided that the second data set was drawn from the same (or at least a comparable) population as the original, and the estimates were precise, this strategy would eliminate weak instrument bias. Hence, we also consider a two-sample strategy for our analysis using estimates of the associations between the SNPs and BMI from the GIANT consortium. 11 The second, and most important, challenge facing any MR study is that some of the SNPs are not valid IVs satisfying conditions (a) to (c) above. This is usually caused by "pleiotropic" SNPs, that is, SNPs which are related to individual traits other than the exposure. However, there has been recent work on developing "robust" IV methods that adjust for pleiotropic bias. Our main methodological focus is thus on comparing the performance of the recently developed sisVIVE with two established approaches, "MR-Egger" regression 14 and "MR-Median" regression. 15 Our objectives are twofold: to assess the relative performance of sisVIVE in comparison with its competitors and to investigate scenarios under which sisVIVE performs poorly in absolute terms.

Choosing SNPs
A single-SNP MR study will typically involve a SNP for which there is robust evidence that it is a genotype for the exposure. This evidence is usually obtained from a dedicated GWAS. A GWAS involves estimating the associations between a genome-wide set of genetic variants (typically SNPs) and biological traits to identify which of these variants are associated with each trait. GWAS are adjusted for errors due to multiple testing, potential confounders of these associations, and for population groups with different genotype distributions (the so-called population stratification). The genome-wide level of significance is determined statistically; a p-value significance threshold of 5 × 10 −8 has become widely accepted.
The accuracy of a GWAS in determining which SNPs are associated with a given trait depends on the sample size and, particularly, on the adequacy of the confounding and population-stratification adjustments. However, as we have already discussed, there is also a risk that GWAS-identified SNPs will be pleiotropic so that either condition (b) or (c) does not hold; this would lead to biased estimates even if the association between the SNP and exposure were not weak. 16 In the subsequent development, we distinguish between the "direct" pleiotropy that results from failure of condition (b) and the "indirect" pleiotropy that results from failure of condition (c).
As explained above, the UKHLS contains 71 of the 97 SNPs identified by the GIANT consortium as being associated with BMI. 11 The full list of these SNPs is given in Table S1 in the Supplementary Information; the distributions of each  SNP in UKHLS and GIANT consortium are given in the Table.

Modeling assumptions
We denote the outcome variable by Y and the exposure by X, and consider scenarios in which Y can be treated as a continuous variable that is causally related to X by the linear model where X is the causal exposure effect, and Y is the model error comprising the combined effect of every influence (apart from X) on outcome Y. Any adjustments for observed confounding variables associated with exposure and outcome may be included but, for notational simplicity, we have omitted these from (1). It is supposed either way that there remains substantial unobserved confounding because important confounding variables have been omitted. These variables are hence absorbed into Y and lead to an association between Y and X. In such cases, standard regression estimation of (1) using ordinary least squares (OLS) or generalized least squares would be inconsistent and biased for causal parameters like X . The precise interpretation of X depends on the assumptions we make about exposure-effect heterogeneity, that is, between-person variation in the causal effect of BMI on personal income. The estimate of X can be interpreted as the average causal effect of BMI on personal income if (i) the effect of BMI on personal income is the exactly the same for everyone or, more realistically, (ii) the between-person variation in the effect of BMI among people with the same observed BMI is the same for all levels of BMI (after adjusting for the multiple SNPs); see chapter 5.2 in the work of Wooldridge. 17 MR studies involve choosing one or more SNPs to use as IVs. Suppose that we use GWAS studies to identify J SNPs, which we denote by G = (G 1 , … , G J ). Each SNP takes values from {0, 1, 2} depending on the number of times that the allele associated with increased exposure was found at this gene location. This allele is referred to as the "risk" or "effect" allele, and the other as the "base" or "noneffect" allele. We choose the effect allele for each of the 71 SNPs to be the one used by the GIANT consortium; see Table S1 in the Supplementary Information.
Using this notation, we can respectively rewrite conditions (a) to (c) for a single SNP, G j , in a slightly more formal way as follows. (i) The exposure is associated with the SNP such that E(Y |G = g) depends nontrivially on g. (ii) No direct pleiotropy corresponds to no direct effect of the SNP on the outcome such that E(Y |X = x, G = g, Y ) does not depend on g. (iii) No indirect pleiotropy corresponds to no association between the SNP and the unobserved confounding variables such that E( Y |G = g) = 0. Note that conditioning on the other SNPs, and on observed confounding variables, is suppressed. Conditions (i) to (iii) are sometimes referred to as the core conditions 10 ; the SNP G j is a valid IV only if it satisfies all three core conditions. Finally, we take the independence assumption to hold throughout the following development, that is, the multiple SNPs are drawn from distinct gene regions such that the G j are mutually independent at the population level.

Estimation with multiple SNPs that are all valid IVs
The ratio estimator and two-stage least squares: If only one valid SNP, G j , was available, then the ratio estimator would be consistent and efficient for the causal effect X . The numeratorΓ is the OLS estimator of the coefficient of G j in the regression of Y on G j , and the denominatorb is the OLS estimator of the corresponding coefficient in the regression of X on G j . (Note that, if covariates were also included in this analysis, Y and X in the preceding description would be respectively replaced by the residuals obtained from regressing Y and then X on these covariates.) The ratio estimator is a special case of 2SLS. 2SLS is consistent and normally distributed in large samples if all of the SNPs are valid IVs satisfying the core conditions. Stage one of 2SLS for a single exposure X involves fitting a regression model of the form X = zb + X using OLS to obtainb, where z is a row vector of predictors comprising a constant term and the instrumental variable(s), b is a column vector of regression coefficients, and X is the model residual satisfying E( X |G) = 0. Stage two involves usinĝ b to obtainX = zb and then finally regressing Y onX using OLS; the 2SLS estimator̂2 SLS X is simply the OLS estimator of the coefficient ofX. Note that̂2 SLS X is equal tôX ; in (2) if z = (1, G j ). To facilitate our comparison with sisVIVE further on, we write the 2SLS estimator aŝ (1), and z i , X i , and Y i are respectively the observed values of the IVs, exposure, and outcome for individual There are thus two ways to incorporate multiple SNPs into an analysis. Polygenic risk scores: The most straightforward strategy for IV analysis with multiple SNPs is to combine the SNPs to form a polygenic risk score (also known as an allele score or genetic risk score). The ratio estimator using the polygenic risk score as an IV can then be used to estimate X . 9 A polygenic risk score has the general form where w j is a user-specified weight for SNP G j . The simple polygenic risk score (SPRS) is obtained by setting w j = 1 for every SNP. Alternatively, the internally weighted polygenic risk score (IPRS) can be obtained by setting w =b . In either case, the ratio estimator with polygenic score G as the IV can be used to estimate the causal effect, or 2SLS with z = (1, G) in (2) can equivalently be used.
Multiple IVs: An alternative to polygenic risk scores is to treat the J SNPs as multiple IVs for exposure X by setting z = (1, G) in (2).
Bowden et al 14 show that the ratio estimate based on a polygenic risk score G is a weighted sum of thêX ; , that is, the ratio estimate obtained using SNP j as the sole IV for X; the SPRS and IPRS estimates differ in how the SNP-specific estimates are combined. Multiple-IV 2SLS is exactly equivalent to IPRS in large samples if the independence assumption holds, that is, the population correlations between the SNPs and exposure are all equal to zero. IPRS is the basis of the inverse weighted (IVW) estimator for MR studies using summarized data. 14,18

Estimation with multiple SNPs where some are invalid IVs using sisVIVE
If any of the chosen SNPs were invalid IVs, then every estimator discussed in Section 3.1 would be inconsistent and subject to further bias regardless of whether or not the IVs were weak. This is because pleiotropy would lead to model (1) being incorrectly specified. In the presence of pleiotropy, it is assumed that model (1) becomes where model error Y satisfies E( Y |G) = 0, j ≠ 0 if pleiotropic G j fails core condition (ii), and j ≠ 0 if it fails core condition (iii). 14 Without any loss of generality, we shall henceforth assume that Y, X, and G are all mean centered so that 0 = 0. Under model (6), the relationship between the true numerators and denominators of the SNP-specific ratio estimators can be written as where j = j + j is the sum of the pleiotropic errors related to SNP j = 1, … , J. This relationship drives the "consistency criterion" that identifies the causal exposure effect when the identity of the invalid-IV SNPs is unknown. 1 Kang et al 1 developed sisVIVE (the "some invalid some valid instrumental variable estimator") to estimate γ X under model (4) when the set of valid-IV SNPs is unknown. They first show that, while model (6) is not generally identified if the set of valid-IV SNPs is unknown, it can be identified if over half of the SNPs are valid IVs, that is, the "majority rule." Constrained estimation of model (6) subject to the majority rule holding is infeasible, but the closely related problem (̂X is feasible. The terms of (8) are defined as follows: is a scalar "tuning" parameter; || || 1 = ∑ J =1 | | is the 1 -norm; W N is the same weight matrix as used for 2SLS (2) is the value of the residuals in model (4) for individual i; and = ( 1 , … , J ) ′ . If both core conditions (ii) and (iii) held, then Equation (8) would reduce to 2SLS (4) because would equal zero.
The resulting procedure is closely related to the "lasso" a technique used to identify subsets of predictor variables with nonzero regression coefficients. 19 The idea is to constrain the sum of the absolute values of { j } so that the optimizer is forced to shrink toward zero those j that have least impact on the first component (8), that is, the valid-IV SNPs. In doing this, sisVIVE identifies the setV = { ∶̂; = 0} as an estimate of the true set of valid-IV The choice of tuning parameter is crucial: a large value of would force to be heavily penalized unless it were very close to zero, which would result in every SNP being classed as a valid IV; conversely, a small value of would lead to every SNP being treated as an invalid IV. The choice of is made using cross-validation to estimate the prediction error across a range of fixed tuning-parameter values. Rather than choosing the value = min that minimizes the cross-validation error, min is incremented by a value determined by the "one standard error" rule (the standard error being that of the average prediction errors across the different cross-validation samples) in order to reduce the chance of incorrectly identifying invalid-IV SNPs as valid IVs. 1,19 This last point is important because incorrectly treating invalid-IV SNPs as valid IVs has been shown to lead to greater bias than incorrectly identifying valid-IV SNPs as invalid-IV ones. 1 Belloni and Chernozhukov 20 have shown that the shrinkage induced by standard lasso estimators exacerbates the bias of the model-parameter estimate and that sisVIVE is more effective at identifying the set of valid-IV SNPs, V, than estimating and X . They hence proposed the use of "post-lasso" estimators to reduce this bias. In the same spirit, Windmeijer et al 21 considered post-lasso estimators for sisVIVE in which only the SNPs identified as valid IVs inV are used as multiple IVs in 2SLS; they found the resulting estimator to be subject to considerably lower bias than sisVIVE. However, they also derive theoretical results that show that sisVIVE is not always consistent, in the sense thatV ↛ V as n → ∞: in short, if the relative strengths of the valid-IV and invalid-IV SNPs do not satisfy the "irrepresentable condition," 22 then sisVIVE will not be consistent and can perform poorly.
Finally, we make the following note about sisVIVE in the presence of heterogeneous causal effects. In Section 2.2, it was noted for 2SLS that X can be interpreted as the average causal effect if the effect of BMI is homogenous, or there is heterogeneity but the between-individual variation in BMI effects is independent of the exposure given the genetic IVs. Kang et al 1 stated that the same assumptions are required if we wish to interpret X as the average causal effect when using sisVIVE. However, we show that a further constraint is required if model (6) holds because the potential dependence of the mean treatment effect on the SNPs can result in j ≠ 0 even if SNP j is a valid IV, that is, the between-individual variation in BMI causal effects is mean independent of both exposure and the SNPs (see Section S2.1 in the Supplementary  Information).
We now briefly outline the two established robust methods to be compared with sisVIVE.

MR-Egger regression
where Egg is a residual assumed to satisfy E( Egg | | b ) = 0 (noting that expectation here is with respect to the set of chosen SNPs). Model (9) can be fitted using the inverse of the estimated residual variance from the regression of Y on SNP j as weights to account for differential minor allele frequencies (MAFs) such that SNPs with low MAFs contribute little to the estimation of (9).
The requirement that E( Egg | | b ) = 0 is called the "InSIDE" condition. 14  = 0 if the SNPs are all valid IVs. InSIDE is thought to be plausible if core condition (ii) fails but is generally unrealistic if condition (iii) fails because failure leads to both b j and Egg depending on j . 14 MR-Egger also requires thatb is precisely estimated because otherwise γ Egg X will be biased toward zero (see also the discussion of the "no measurement error assumption" in Section 3.3).
MR-Median: Bowden et al 15 proposed MR-Median as an alternative to MR-Egger that, in theory, does not require the InSIDE condition to hold. The authors view MR-Median as a practicable alternative to the sisVIVE estimator to be introduced next. The main strengths of MR-Median are its simplicity and that it can be used if only summary data are available. It is simply the median of the SNP-specific ratio estimates; in other words, letting {̂X ( ) ∶ = 1, … , J} be the ordered set of SNP-specific ratio estimates (ie, such that̂X ( +1) ≥̂X ( ) for all j = 1, … , J − 1), thenγ Med if J is odd. MR-Median is consistent for X if the majority rule holds, that is, less than J/2, or 50%, of the SNPs are invalid IVs but can be biased and does not converge in distribution to a normal random variable.

Two-sample strategies
In the discussion so far, it has been assumed that the analysis will be based on one individual-level data set. It is, however, possible to adopt a two-sample strategy if estimates of the SNP-exposure associations are available from another, preferably much larger, study. [23][24][25] We denote the estimates obtained from this second sample by {b ∶ = 1, … , J}.
Estimates from a second sample can be incorporated into the approaches described above as follows.
In terms of estimator performance, the impact of the usual weak instrument bias is ameliorated usinĝX ; =Γ ∕b rather than̂X ; because, in contrast tob andΓ , there is no association betweenb andΓ (or this association is very weak if the two samples have some individuals in common) 13 ; in fact, ifb = b , then̂would be unbiased because it would reduce to an OLS estimate from the regression of Y on the known and unconfounded variable G j b j . More realistically, we require the standard error ofb to be small, and both samples to be drawn from the same population (or at least, from two different but homogeneous populations), forb ≃ b to hold. 26,27 Such scenarios satisfy what is called the no measurement error (NOME) condition. 13 However, ifb is not precisely estimated, then NOME will not hold. NOME is so-called because the imprecision means thatb = b + , where j behaves like mean-zero measurement error; the estimate is hence subject to an attenuation bias toward zero if NOME fails. This second form of weak-instrument bias can be viewed as a kind of conservative shrinkage, which is less harmful than the first because it makes it more difficult to reject the null hypothesis of no causal effect.

Nonlinear exposure effects
The development above focuses on linear models with linear exposure effects. If the true exposure effect is nonlinear, with a curvilinear or piecewise-linear form, then the ratio estimator is no longer consistent for the average causal effect. In the valid-IV case, 2SLS needs only a small modification to remain consistent: the first stage is unchanged with only the second stage requiring modification to match the nonlinear form of the exposure effect. If the effect of X is piecewise nonlinear, such that the exposure is replaced by dummy variables representing the interval in which the observed exposure lies, identification typically requires at least as many IVs as there are piecewise intervals. The robust methods considered here are all built around model (1) with a single-parameter linear exposure effect. Neither MR-Egger nor MR-Median can be straightforwardly used because both hinge on the ratio estimator, which derives from the linear structural model (1). sisVIVE is based around constrained 2SLS estimation and so, in principle, can be adapted to nonlinear exposure effects. However, care must be taken to establish identification because the results underpinning sisVIVE are all based on the single-SNP ratio estimates. 1

ANALYSIS USING UKHLS DATA
MR has been used recently to understand the mechanism of relationship of high BMI and disadvantage of social-economic status, such as low income: Tyrrell et al 28 used highly self-selected UK Biobank data for their analysis, where income is reported at the household level, and Böckerman et al 29 used a small sample for their analysis, which severely limits the power of that study; we refer the reader to these articles for a justification for the existence of a linear effect of BMI on income. We apply the methods introduced above to the larger and more representative UKHLS sample data to estimate the causal effect of BMI on personal income. For our multiple SNPs, we use 71 out of the 97 common genetic variants identified as being associated with BMI at a genome-wide level of significance. 11 The 16 excluded SNPs were either obtained from analyses involving only men, only women, or including non-Europeans (6 SNPs), or based on imputed data found to be below the 0.9 threshold (10 SNPs). 11 The remaining 71 variants explain 1.6% of the variation in BMI among UKHLS participants; the effect-allele frequencies from both UKHLS and the GIANT consortium 11 are listed in Supplementary Information Table S1.
UKHLS is an annual household-based panel study that started collecting information about the social, economic, and health status of its participants in 2009. The data set used for our analysis is drawn from the General Population Sample (GPS) and the British Household Panel Survey (BHPS) arms of UKHLS (BHPS merged with wave two of UKHLS in 2010). UKHLS collected additional health information, including BMI and blood samples, at wave two (for GPS members) and wave three (for BHPS members). A total of 10 480 individuals were genotyped using the Infinium Human Core Exome Beadchip. After quality-control steps, a sample of 9944 individuals was obtained from which a further 1104 individuals were excluded using the following criteria: genetic relatedness larger than 0.05% (N = 707); BMI greater than 60 kg/m 2 (N = 8); and aged under 25 (N = 389). The number of cases with at least one wave of personal income and no missing values of BMI, SNPs and covariates is N = 8047.
The outcome is average annual personal income (API) for each individual taken over three consecutive waves starting from the wave at which the individual's health information was recorded. For individuals with one or two of these observations missing, we took the mean API over the available waves. We standardized both BMI and API, where API was standardized separately for the GPS and BHPS samples because mean API is higher in the BHPS than in GPS. To control for population stratification, we restricted our analysis to the white population and included the following baseline covariates C: age (at which BMI was measured), gender, and the first 20 genetic principal components to control for ancestry. 27 We do not consider nonlinear effects of BMI in this analysis. Our preliminary analyses did not detect any nonlinear effects of BMI on income, using the semiparametric method proposed by Stanley and Burgess, 30 so we are able to focus on the linear-exposure case and compare the performance of the three robust methods discussed above.
To obtain covariate-adjusted estimates of the causal effect, we do not work with raw API and raw BMI but adjusted API and adjusted BMI. Each adjusted variable is equal to the fitted residual obtained from regressing it on C. For the two-sample strategy, the exposure-SNP estimates {b } come from the GIANT study 11 and so are already adjusted for C. The scatter plots of association of SNP-BMI against associations of SNP-Income for 71 SNPs based on one-sample and two-sample strategies are given in Figure 1.  It can be seen from Table 1 that the potentially confounded association between personal income and BMI has a significant negative value. Interpreting this estimate causally, we estimate that a one standard-deviation increase in BMI (that is, of 5.26 BMI points) is estimated to lead on average to a decline in personal income of −0.032 of a standard deviation (£431.20).
Perhaps surprisingly, sisVIVE did not identify any invalid IVs. This is the case using both one-sample and two-sample strategies, and despite sisVIVE using the one-standard-error rule to choose the tuning parameter, which typically incorrectly identifies valid-IV SNPs as invalid IVs.
We thus conclude that the valid-IV methods discussed in Section 3.1 can be used in conjunction with all 71 SNPs to estimate the causal effect of BMI on personal income. The first pair of one-sample results are ratio estimates obtained using for the SPRS and IPRS as IVs. The next two results use 2SLS and limited information maximum likelihood (LIML) treating the 71 SNPs as multiple distinct IVs. As expected, IRLS and multiple-IV 2SLS give very similar results because the 71 SNPs are drawn from different gene regions and hence mutually independent. All four estimates are larger (negative) than the observational association but are also nonsignificant.
Because sisVIVE detects no invalid-IV SNPs, the valid-IV results above can also be classed as post-sisVIVE estimates. We would reach the same conclusion using the other robust methods: the MR-Egger test (for a nonzero intercept term) was not significant, and neither the MR-Egger nor MR-Median estimates were statistically significant (noting that the large standard errors indicate that both are less powerful than the valid-IV methods when there are no invalid IVs).
Using a two-sample strategy, we find no evidence for any invalid-IVs SNPs but, in contrast to the one-sample results, find a statistically significant large negative causal effect of BMI on personal income. The significance is, to a large extent, a consequence of estimated effect sizes that are larger than for any of the one-sample results. Note also the following points: as expected, the EPRS and two-sample version of multiple-IV 2SLS produce very similar estimates because the 71 SNPs satisfy the independence assumption; and we do not present two-sample LIML results because this estimator is used solely to reduce the impact of many weak instruments bias, which is not an issue for two-sample strategies. The robust estimates obtained using MR-Egger and MR-Median are again far larger (negative) than the post-sisVIVE ones but are statistically significant despite estimated less precisely.
The interpretation of these findings hinges crucially on the following untestable assumptions: the two-sample approach requires that the data in the GIANT study and UKHLS are drawn from homogeneous populations; MR-Median and sisVIVE both require that the total number of invalid-IV SNPs does not exceed 35; and MR-Egger requires that the InSIDE condition holds, that is, there is no relationship between the strength of SNP-BMI association and the size of its pleiotropic effect on personal income in model (6). It is not possible to definitively resolve which of these assumptions holds, but in the next section, we conduct a simulation study to investigate more generally the relative performance of sisVIVE compared with other methods, for different numbers of invalid-IV SNPs increases and different types of pleiotropy.

SIMULATION STUDY
We now carry out a simulation study to compare the effectiveness of sisVIVE with that of the other robust methods introduced above. The basic design of the study is based on those reported by Kang et al 1 and Bowden et al, 14 but here the data are simulated to mimic key characteristics of the UKHLS. Furthermore, the comparison is not just with 2SLS but with other robust methods.
We now set up a simulation study to investigate the following questions raised by the preceding development and data analysis: A. How accurately does sisVIVE identify valid-IV SNPs and how is this accuracy affected as the number of valid-IV SNPs increases? B. Post-sisVIVE estimators involve using sisVIVE to identify the valid-IV SNPs and then applying a valid-IV estimator: in such cases, is 2SLS the best post-sisVIVE estimator? C. Does the best post-sisVIVE estimator perform better than MR-Egger and MR-Median in terms of estimator accuracy and confidence interval coverage? D. How well do post-sisVIVE estimators perform in an absolute sense? Under which scenarios is its performance particularly good or particularly bad?
A final question, cutting across A to D, is the extent to which these answers depend on whether a one-sample or two-sample strategy is used.
A full description of the study design is given the Supplementary Information (Section S1) but we highlight some of its features here. The 71 SNPs G 1 , … , G 71 are generated independently from a trinomial distribution in which the probabilities of G j being 0, 1, and 2 are respectively equal to the proportions of G j being 0, 1, and 2 in the UKHLS data (see Table S1 in the Supplementary Information). The causal association between BMI and SNP j is denoted by j and set to the value taken from the GIANT consortium study. 11 In the one-sample context, these are weak instruments because the average F-statistic of the SNPs with respect to exposure BMI is 2.5 < 10. The true value of the causal effect of BMI on income is X = −0.2. The results presented below are all based on 1000 generated samples of size N = 10 000.
To simulate data subject to unobserved confounding, the error terms in personal-income model (1) and BMI model (3) are respectively decomposed as Y = U U + . Y and X = U + . X , where U is a zero-mean variable representing the unobserved confounding, and . Y and . X are not only mutually independent but jointly independent of (Y, X, G, U). Parameter U = 1 controls the impact of unobserved confounding, that is, the strength and sign of the correlation between Y and X , such that there is no unobserved confounding if U = 0. Values of the outcome and exposure are respectively generated under model (1)  ) > 0 for the pleiotropic/invalid-IV SNPs.

Scenario 3:
As Scenario 2, except that the pleiotropic SNPs are also indirectly pleiotropic with positive s ∼U(0,0.4). The InSIDE assumption necessarily fails in this scenario because the true association between G s and exposure is b s = s + s rather than s (in contrast, for the valid-IV SNPs, b j = j ) and so related to the pleiotropic effect. For each of these scenarios, the performance of the estimators is assessed for S = 10, 20, and 30 invalid IVs, which respectively correspond to 14%, 28%, and 42% of the SNPs; these scenarios all satisfy the requirement that less than 50% of the SNPs are invalid.
For the two-sample strategy, we consider three levels of accuracy for the estimates from the consortium study. The first accuracy level ("True") is the gold standard in whichb is taken to equal the true association b j . The second level ("Precise") has the association estimated with high accuracyb ∼ N ( b , 0.01 2 ) . The third level ("Imprecise") has to represent situations where the consortium study yields noisy estimates. The precise and imprecise accuracy levels reflect degrees of failure of the NOME condition. 13 Finally, the following Stata functions are used to implement the methods described above: ivreg2 to implement SPRS, EPRS, multiple-IV 2SLS, and LIML; mregger from the mrrobust package to implement MR-Egger regression; mrmedian from the mrrobust package for MR-Median. The R package sisVIVE (downloaded from CRAN at https:// cran.r-project.org/web/packages/sisVIVE.html) is used for sisVIVE.
Question A: Selecting valid-IV SNPs: We first assess the performance of sisVIVE in terms of False Select In (FSI) and False Select Out (FSO). For each simulated data set, FSI is the ratio of the number of pleiotropic SNPs incorrectly identified as valid IVs to the total number of pleiotropic SNPs; and FSO is the ratio of valid-IV SNPs incorrectly identified as invalid IVs to the total number of valid-IV SNPs. The results that are shown in Table 2 are the mean FSI and mean FSO cross 1000 simulated data sets for each of the pleiotropy scenarios defined above.
Starting with FSO, when the number of pleiotropic SNPs is 30, sisVIVE incorrectly identifies (on average) up to 71% of the valid-IV SNPs as invalid IVs, but it performs well (no more than 5% of valid-IV SNPs incorrectly identified) when there are only 10 pleiotropic SNPs. The deterioration in FSO as the number of pleiotropic SNPs increases is apparent under both direct pleiotropy scenarios (Scenarios 1 and 2) for both one-and two-sample strategies. The deterioration is also apparent under Scenario 3 (direct and indirect pleiotropy) but FSO exceeds 40% even if there are only 10 pleiotropic SNPs. The performance of sisVIVE in terms of FSO using a two-sample strategy and using a one-sample strategy are generally similar, except under the direct pleiotropy scenarios when the accuracy level of the SNP-BMI associations is imprecise; in this case, failure of the NOME condition appears to prevent valid-IV SNPs being incorrectly classed as invalid IVs.
In terms of FSI, between 39% and 11% of the invalid-IV SNPs are incorrectly identified as valid IVs under the scenarios we consider. Superficially, sisVIVE performs best in terms of FSI for two-sample strategies under Scenario 3 (direct and indirect pleiotropy) when the accuracy level of the SNP-BMI associations is imprecise, but this is counteracted by its FSO performance being the worst among all the scenarios.
Question B: Post-sisVIVE estimators: To assess the performance of alternative post-sisVIVE estimators, we consider the "oracle" properties of the different valid-IV estimators from Section 3.1 (and the two-sample equivalents of these estimators in Section 3.3). In other words, we assess how well each valid-IV estimator performs when the actual identities of the valid-IV SNPs are known. The results are presented in the Supplementary Information (Table S2).
In summary, we found that the one-sample versions of the valid-IV estimators, except for SPRS and LIML, were adversely affected by many weak instruments bias. 12 SPRS outperformed IPRS in terms of bias and MSE, despite IPRS having smaller standard errors. Only SPRS and LIML offer close-to-nominal coverage and were the most powerful at detecting the true causal effects. Using a two-sample strategy, every estimator was nearly unbiased for the true and precise accuracy scenarios, but every method was found to be biased toward zero under the imprecise accuracy scenario, with the consequent impact on coverage and power. The exception was EPRS for which bias was less affected by the failure of the NOME condition than the other estimators, but its standard error was inflated leading to a high MSE and very low power. We hence prefer SPRS as the post-sisVIVE estimator when using a one-sample strategy, and EPRS when using a two-sample strategy. We also, however, present results from 2SLS in both cases for the purposes of comparison.
Question C: Do the best post-sisVIVE estimators perform better than MR-Egger and MR-Median? Table 3 contains the results for scenarios with 10 invalid-IV SNPs using one-and two-sample strategies. Table 3a contains those results under pleiotropy Scenario 2 (unbalanced direct pleiotropy where InSIDE holds) and Table 3b  The picture that emerges from these results is that, across these scenarios, sisVIVE SPRS has the best performance in terms of bias and MSE, closely followed by sisVIVE-2SLS. If InSIDE holds, then MR-Egger performs best in terms of bias and confidence interval coverage using a two-sample strategy, but it performs considerably less well than post-sisVIVE SPRS and 2SLS in terms of mean-square error; this is also the case under the precise and imprecise accuracy levels (see Tables S6-S8 in the Supplementary Information). The poor performance of MR-Egger under the one-sample strategy is thus explained by the impact of many weak instruments bias.
Question D: How well does sisVIVE do in an absolute sense? It is clear that, across all the scenarios, the post-sisVIVE estimators are subject to substantial bias and that these biases are very large if (a) a one-sample strategy is used (in all circumstances apart from the less plausible Scenario 1) and (b) Scenario 3 holds (where InSIDE fails due to the presence indirect pleiotropy). This conclusion holds if performance is also judged in terms of coverage and power of the normal-based confidence intervals.
Overall, the same patterns are found for scenarios with 20 (see Tables S4, S7, and S10 in the Supplementary Information) and 30 (see Tables S5, S8, and S11) invalid-IV SNPs, except that the magnitude of the biases and MSEs, and the extent to which the confidence intervals fail to achieve nominal coverage, becomes worse as the number of invalid-IV SNPs increases.
The poor performance of the post-sisVIVE estimators under Scenario 3 is surprising because, superficially at least, these require only that less than half of the SNPs are invalid. We investigate this further by rerunning the simulations for Scenario 3 but, when using a two-sample strategy, took the true causal effects j of the SNPs on the exposure to be known to us for the pleiotropic SNPs rather than observed association b j = j + j . The results (not presented) show that the bias, MSE, and coverage of the sisVIVE-based approaches are similar to those under Scenarios 1 and 2; the same is also true for MR-Median. This indicates that indirect pleiotropy and the failure of InSIDE have, contrary to theory, a detrimental effect on these methods' performance.
One possible explanation for this is set out by Windmeijer et al. 21 Their proposition 2 contains conditions under which sisVIVE will not satisfy the "irrepresentable condition" of Zou 22 and, hence, cannot be consistent. Crudely put, this condition states that sisVIVE will only be consistent if the strength of the invalid-IV SNPs is greater than those of the valid-IV ones. This is a possible explanation for the poor performance of sisVIVE in our simulations because our scenario 3 simulations do not satisfy it. However, using corollary 1, 21 we set up a new version of our third scenario in which InSIDE fails but proposition 2 is satisfied; the new design and results are presented in the Supplementary Information: the design is described in Section S2.2, and the results for 10, 20, and 30 invalid-IV SNPs respectively presented in Tables S12, S13, and S14. It can be seen from these results that sisVIVE performs as badly as it did when the irrepresentable condition fails, so we conclude that failure of the irrepresentable condition does not explain our results.
We instead argue that sisVIVE's poor performance is because the consistency criterion (defined in theorem 1) 1 , which underpins the identification sisVIVE, cannot be satisfied if indirect pleiotropy leads to failure of the InSIDE condition. In short, even were the identity of the valid-IV and invalid-IV SNPs known, identification would not be possible because the effects of direct pleiotropy-and hence, the combined pleiotropic effects-would be confounded with the invalid-IV SNPs' causal effects on the exposure, so that the consistency criterion would not hold. This result also explains why the performance of sisVIVE improved in the hypothetical situation, explored above, where j (the causal effect of SNP on exposure) were known, because this identifies the combined pleiotropic effect. We provide a more detailed argument in Section S2.3 in the Supplementary Information.

DISCUSSION
Our investigation has revealed that, while it outperforms the alternative approaches we considered, sisVIVE performs well in terms of mean square error but performs particularly poorly if the InSIDE condition fails due to the presence of indirect pleiotropy. This is despite these scenarios apparently satisfying the sufficient condition that more than 50% of the SNPs are valid IVs. We argue that this is due to indirect pleiotropy leading to nonidentification and failure of the consistency criterion of Kang et al. The conclusion from our data analysis, which indicates the true causal effect of BMI could be underestimated by as much as a factor of five, is thus dependent on an assumption that there is no indirect pleiotropy through which the SNPs are related to the confounders. In general, such an occurrence cannot be ruled out because genes are determined at conception, and confounders determined at any point up to the time at which an individual's exposure is determined. The exception to this would be if there was strong scientific evidence that the phenotypical trait of each SNP was functionally related to the exposure of interest.
The performance of sisVIVE in scenarios where InSIDE holds and there is no indirect pleiotropy is more encouraging, but more generally, we would recommend that MR studies should be based on a sensitivity analysis involving robust MR-Egger and "valid IV" methods to capture whether differences between the estimates point to the presence of pleiotropic SNPs. We would also concur with others who have suggest using unweighted polygenic risk scores or a two-sample strategy or both. [31][32][33] For a one-sample strategy, it appears that imprecision in the estimated weights of an IPRS, while improving efficiency, can lead to substantial bias. While SPRSs perform well here, it is important to note that, in further simulations, we found (results not shown) that severe bias was introduced if the effect-allele coding of the SNPs led tô(or̃) being positive when true j would have led us to code it the other way; such "flip flopping" is possible, 34 even in the absence of population stratification, but remains a potential source of bias for SPRS despite its being discounted elsewhere (eg, see page 1883 in the work of Burgess et al). 35 The LIML or CUE estimators would, however, be unaffected by flip-flopping and, unlike 2SLS, not subject to large biases.