Statistical methods for cis-Mendelian randomization with two-sample summary-level data

Mendelian randomization (MR) is the use of genetic variants to assess the existence of a causal relationship between a risk factor and an outcome of interest. Here, we focus on two-sample summary-data MR analyses with many correlated variants from a single gene region, particularly on cis-MR studies which use protein expression as a risk factor. Such studies must rely on a small, curated set of variants from the studied region; using all variants in the region requires inverting an ill-conditioned genetic correlation matrix and results in numerically unstable causal effect estimates. We review methods for variable selection and estimation in cis-MR with summary-level data, ranging from stepwise pruning and conditional analysis to principal components analysis, factor analysis, and Bayesian variable selection. In a simulation study, we show that the various methods have comparable performance in analyses with large sample sizes and strong genetic instruments. However, when weak instrument bias is suspected, factor analysis and Bayesian variable selection produce more reliable inferences than simple pruning approaches, which are often used in practice. We conclude by examining two case studies, assessing the effects of low-density lipoprotein-cholesterol and serum testosterone on coronary heart disease risk using variants in the HMGCR and SHBG gene regions, respectively.


Introduction
Mendelian randomization (MR) is the use of genetic data to assess the existence of a causal relationship between a modifiable risk factor and an outcome of interest (Burgess & Thompson, 2015;DaveySmith & Ebrahim, 2003). It is an application of instrumental variables analysis in the field of genetic epidemiology, where genetic variants are used as instruments. The approach has received much attention in recent years and has been used to identify a large number of cause-effect relationships in the epidemiologic literature (Boef et al., 2015). For example, MR studies have strengthened the evidence for a causal link between lipoprotein(a) and coronary heart disease (CHD) (Kamstrup et al., 2002), but have weakened the evidence for an effect of C-reactive protein on CHD risk (CRP CHD Genetics Collaboration [CCGC], 2011).
A MR analysis requires a set of genetic variants that satisfy the instrumental variable assumptions: genetic variants should be (i) associated with the risk factor of interest, (ii) independent of confounders of the risk factor-outcome association, and (iii) they should only influence the outcome through their effect on the risk factor (the no-pleiotropy assumption). Under these three assumptions, MR offers a framework for assessing whether the risk factor is causally related to the outcome; under additional modeling assumptions, one can also estimate the magnitude of the risk factor-outcome causal effect ( Figure 1).
Traditionally, MR studies are implemented using independent genetic variants from across the whole genome. This is especially the case when studying complex polygenic traits such as body mass index, cholesterol or blood pressure. Such genome-wide analyses rely on published results from large-scale Genome-wide association studies (GWAS) studies to identify large numbers of genetic regions associated with the risk factor studied. Genetic variants in each region are pruned for independence and only the variant with the smallest p value, or a small number of weakly correlated variants with small p values are selected. Variants from different regions are then combined to create a genome-wide set of instruments for MR, thus increasing the power of the analysis to detect a causal relationship.
In parallel to the traditional genome-wide MR studies described above, recent years have seen an increased number of MR studies based on a single gene region (Gill et al., 2021;Schmidt et al., 2020). These single-region investigations are sometimes called cis-MR studies, because they use cis-variants (variants in close proximity to the gene of interest) as instrumental variables. The rise in popularity of cis-MR studies has been driven by the increasing appreciation of the fact that such studies can be used to inform drug target discovery and validation (Gill et al., 2021). Cis-MR studies typically use protein expression (or downstream biomarkers strongly associated with protein expression) as a risk factor, since each coding gene region encodes one protein and since proteins are the drug targets of most medicines. By studying the associations of variants in the region with a disease outcome of interest, cis-MR studies can thus provide information on whether the encoded protein can be used as a drug target for the outcome. Their scope is therefore different to that of genome-wide MR analyses, which aim to assess causal relationships between phenotypic traits, and more similar to that of transcriptome-wide and proteome-wide association studies, as we discuss later.
Cis-MR studies typically rely on a single gene region, and therefore have to select genetic instruments among a potentially large number of correlated variants. Using all variants in the region can result in numerically unstable causal effect estimates , so instead a small set of variants that are highly informative for the risk factor is normally used. Through the existing cis-MR literature, a variety of techniques have been employed to select these variants. However, to date no review or comparison of the different methods has been published, and applied cis-MR analyses typically use some form of LD-pruning for variable selection. The aim of this paper is to provide an overview of statistical methods for cis-MR studies, focusing on the common case of two-sample MR studies with summary-level data. After reviewing commonly used approaches such as LD-pruning and conditional analysis, we discuss the use of principal components analysis (PCA), factor analysis and stochasticsearch variable selection. The use of PCA  was proposed for MR with correlated instruments and can be directly extended to cis-MR. The use of factor analysis was recently explored by Patel et al. (2020), building on similar methods for instrumental variables analysis with individual-level data (Bai & Ng, 2002). Stochastic search can be implemented using the JAM algorithm (joint analysis of marginal summary statistics), originally proposed by Newcombe et al. (2016) for fine-mapping densely genotyped regions and recently adapted for MR . We discuss how to implement each method in the context of cis-MR and compare their performance in simulation studies.
We also provide two real-data applications, investigating associations of variants in the SHBG and HMGCR regions with CHD risk. The HMGCR gene encodes the protein HMG-coenzyme A reductase, which is known to play an important role in the cholesterolbiosynthesis pathway. Its inhibition with statins is known to reduce LDL-cholesterol levels, therefore statin treatment is often prescribed to individuals in high CHD risk. The SHBG region encodes the protein sex-hormone binding globulin, and is known to contain genetic associations with testosterone and other sex hormones; there is some uncertainty on whether testosterone levels are associated with CHD risk, although a recent MR study has suggested no causal relationship (Schooling et al., 2018).
increased success rate of drug development programs with genomic support (Nelson et al., 2015).
Cis-MR studies can be used to assess the suitability of proteins as potential drug targets, predict and inform the results of clinical trials, or investigate repurposing opportunities and potential side-effects of existing drugs. For example, an MR analysis of LDL cholesterol and cardiovascular disease risk based on the ACLY region (Ference et al., 2019) was recently used alongside a clinical trial  to assess the suitability of the ACLY region as a potential drug target for lowering LDL-cholesterol. Similar analyses have been conducted using the CETP region to instrument lipid fractions and blood pressure (Sofat et al., 2010), using the IL6R region to assess the effect of interleukin 6 reduction on CHD risk (The Interleukin-6 Receptor Mendelian Randomization Analysis [IL6R MR] Consortium, 2012), and using the HMGCR (Swerdlow et al., 2015) and PCSK9 regions (Schmidt et al., 2017) to explore potential effects of LDL-cholesterol lowering treatments on body mass index and type 2 diabetes risk.
Ideally, cis-MR should be conducted using protein expression data for the target protein risk factor (Gill et al., 2021). However, data on protein expression are not always available, and researchers often use either gene expression data for the associated gene region or genetic associations with a downstream biomarker instead. For example, LDL-cholesterol has been used as a downstream risk factor to study the effects of HMGCR inhibition on the risk of type 2 diabetes (Swerdlow et al., 2015).
Genetic variants for cis-MR are selected from a region containing the protein-encoding gene. The gene region is defined to contain the target gene, as well as variants within a few hundred thousand base pairs on either side of the gene, to include possible cis-eQTL variants. In principle, MR of protein expression can be implemented using variants from across the genome. However, since cis-MR is used for drug target discovery and validation, it is more relevant to focus on the protein-encoding gene region (Schmidt et al., 2020). This is especially the case when using a downstream biomarker as a proxy for protein expression, as there may be variants associated with the biomarker but not with the target protein. In the LDL-cholesterol example mentioned above, using variants associated with LDL-cholesterol across the genome would implicate multiple causal mechanisms affecting LDL-cholesterol concentration, many of which may not be related to the function of the HMGCR protein. Therefore, if the aim of the analysis is to assess potential side-effects of HMGCR-inhibiting drugs, it would be better to conduct the analysis using only variants in the HMGCR region.
In addition, variants in a single gene region often exhibit less heterogeneity in their genetic effects than variants from across the genome (see figure 4 of Burgess et al., 2017, e.g.).
Within the limits of a gene region, it is often possible to identify hundreds of correlated variants marginally associated with protein expression. Researchers must then select which of these variants to incorporate in a subsequent MR analysis; including all available variants is usually avoided as it can introduce numerical approximation errors in estimating the MR causal effect . Some authors have suggested prioritizing variants based on functional annotations; however, Schmidt et al. (2020) investigated this strategy in a simulation study and found little to no difference in causal inferences conducted using functional variants compared to noncoding variants. Therefore, the selection of genetic variants is usually performed based on the strength of their associations with the risk factor, while accounting for LD correlation to reduce numerical approximation errors. In the next section, we review statistical approaches for implementing this variable selection procedure.
Finally, similar to standard MR analyses, genetic variants used for cis-MR must satisfy the instrumental variables assumptions. Violations of the instrumental variable assumptions are often a concern because they have the potential to devalidate tests of causal association between the risk factor and the outcome and to bias causal effect estimates. In principle, violations of the no-pleiotropy assumption are less likely to occur when studying protein expression, because proteins are causally upstream of metabolites and biomarkers which constitute the most common risk factors used in traditional MR analyses (Schmidt et al., 2020;Swerdlow et al., 2016). This is illustrated in the causal diagram of Figure 2 (which is adapted from figure 1 of Schmidt et al., 2020).
Nevertheless, pleiotropic bias remains a concern even for studies of protein expression; it would arise if variants from the same region were involved in different biological pathways relevant to the outcome, resulting in heterogeneous causal effect estimates. When pleiotropy does occur, standard algorithms for pleiotropy-robust MR (including the JAM algorithm) can be used to address it. However, care must be taken when extrapolating the assumptions made by such algorithms to the framework of many correlated genetic variants. For example, suppose that a region contains a single pleiotropic variant, but that variant is strongly correlated with half of the other variants in the region. This will induce bias in the marginal causal effect estimates of the correlated variants, and hence pleiotropy-robust methods requiring that a majority or plurality of genetic variants in the region are valid instruments will be biased.

Notation
We now present various approaches for selecting genetic variants and computing causal effect estimates in cis-MR studies. Let X denote the risk factor of interest, Y denote the outcome, G 1 , …, G P denote a set of P putative genetic instruments from a single gene region, and let U represent confounders of the risk factor-outcome association. Also, let θ denote the causal effect of the risk factor on the outcome. Our task is to obtain a subset of the genetic variants to use as instruments, an estimator θ of the causal effect parameter and an associated standard error se(θ).
We work under the framework of two-sample summary-data MR and assume the availability of two sets of summary statistics, one for SNP-risk factor and one for SNPoutcome associations, obtained from different data sets. We use β Xj , β Y j to denote the marginal associations between genetic variant G j and the two traits, and σ Xj 2 , σ Y j 2 for the corresponding standard errors. In addition, p Xj denotes the p value of association between variant G j and the risk factor. Finally, we assume that we have access to a reference data set, such as 1000 genomes or the UK Biobank, from which to estimate correlations between genetic variants.

Single-variant MR
Some studies in the cis-MR literature have used only the variant with the smallest univariate p value in a region as an instrument (Sofat et al., 2010;Swerdlow et al., 2015). With a single genetic variant, the MR causal effect can be estimated using a ratio formula. This avoids the need to account for genetic correlations and to perform complex numerical tasks such as inverting the genetic covariance matrix. On the other hand, if the region contains multiple variants with independent associations with the risk factor, a single-variant MR analysis will not be able to model all the genetic effects in the region. This typically translates into a loss of power for the cis-MR analysis. In addition, identification of the smallest p value variant in the region can be subject to winner's curse bias. Therefore, most recent cis-MR papers have utilized variable selection or dimension reduction approaches to include several variants per region.

LD-pruning
Perhaps the most common approach in the literature for selecting genetic variants for inclusion into a cis-MR study is LD-pruning (Dudbridge & Newcombe, 2015;Schmidt et al., 2020). Given a set of candidate genetic variants G 1 ,…,G P , the stepwise pruning procedure iterates between:

•
Selecting the variant with the smallest p value and including it in the analysis.
• Identifying all variants whose correlation with the previously selected variant exceeds a correlation threshold ρ and removing them from further consideration.
This is repeated until all the candidate variants have either been selected for inclusion or discarded. Alternatively, the process may be stopped when there are no remaining variants with p values lower than a significance threshold τ, to guard against the inclusion of weak instruments into the analysis. The threshold τ is often taken to be the GWAS significance threshold τ = 5 × 10 -8 , although there is some evidence that using more relaxed thresholds may be beneficial (Dudbridge, 2013).
Once LD-pruning is implemented and a suitable set of genetic instruments is selected, the MR causal effect can be estimated using an inverse variance weighted (IVW) estimate. If the correlation threshold ρ is small enough, the selected variants can be considered approximately independent and the standard IVW estimate, can be used. This approach is typically used in genome-wide MR analyses and can also be adapted to account for uncertainty in G-X associations (Bowden et al., 2018). However, for single-region MR, the use of a small correlation threshold may result in the exclusion of causal variants, especially when studying gene regions with multiple independent associations with the risk factor. On the other hand, if a larger threshold is used, the standard IVW estimate will underestimate the causal standard error, since it does not adjust for correlations. In this case it is better to use a modified version of the IVW estimator that takes genetic correlations into account : where Ω denotes the genetic covariance matrix, Ω i,j = Cov(G i , G j ), and is usually estimated from a reference data set. The estimator ( 1 ) can be motivated from the literature on metaanalysis. It is sometimes referred to as the generalized least squares method (Schmidt et al., 2020) because it can be derived using generalized linear regression to model the genetic associations with the outcome in terms of the genetic associations with the risk factor .
The correlated-instruments IVW estimator requires the genetic covariance matrix Ω to be computed and inverted, a process which can become numerically unstable if highly correlated variants are used . This is the main reason why cis-MR analyses cannot use all available variants within a region. Numerical instabilities are less likely to occur when the IVW estimator is used in combination with pruning, since the pairwise correlations of variants selected by LD-pruning are all lower than the specified correlation threshold ρ, although problems can still occur if the genetic covariance matrix is near-singular.
A potential criticism for the LD-pruning approach is that there is no consensus in the literature on how to select the correlation threshold. Implementations of the method using different correlation thresholds can give rise to different results, especially when large thresholds are used . To address this issue, Schmidt et al. (2020) proposed that stepwise pruning should be implemented with a variety of pruning thresholds, as a form of sensitivity analysis. In addition, stepwise methods for variable selection have been criticized for being unstable and getting stuck in local maxima of the model space (Hocking, 1976;Miller, 2002;Tibshirani, 1996) and more flexible variable selection approaches are often preferred in some applications of biostatistics (Asimit et al., 2019;Benner et al., 2016;Newcombe et al., 2016), although it is not clear whether the added flexibility provides a substantial benefit in MR studies.

Conditional analysis
The conditional analysis is a process similar to LD-pruning, except variant selection is based on genetic associations with the risk factor conditional on any previously selected variants instead of univariate marginal p values. Perhaps the most established algorithm for conditional analysis is the Conditional and Joint (CoJo) algorithm of Yang et al. (2012). The algorithm aims to identify a set of selected variants, to be included in the MR analysis. Initially, this set only contains the genetic variant with the smallest univariate p value in the region. The algorithm then executes the following steps in each iteration: • Calculate the p values of all candidate genetic variants conditional on the selected variants. We refer to Yang et al. (2012) for more details on how to do this. As with stepwise pruning, the p values only need to be computed for genetic variants whose correlations with previously selected variants do not exceed a prespecified correlation threshold ρ.
• Identify the variant with the smallest conditional p value (provided that this p value is smaller than a significance threshold τ) and add it to the set of selected variants.

•
Regress the risk factor on all selected variants and compute the joint-model p values for each selected variant.
• If any of the joint-model p values is larger than the significance threshold τ, drop the variant with the largest p value from the set of selected variants.
The algorithm is run until it can no longer add or remove any variants. Using conditional (and not marginal) p values for variable selection can improve the algorithm's ability to identify variants with independent signals compared to LD-pruning. Upon finishing, the algorithm provides a set of genetic variants which can subsequently be used to conduct MR. A causal effect estimate based on the selected variants is then obtained using correlatedinstruments IVW (1)-(2).
The conditional and joint-model p values can be computed using marginal summary statistics (Yang et al., 2012). The fact that the algorithm only requires summary-level data, along with its implementation as part of the GCTA software, have contributed to conditional analysis becoming a popular option for selecting variants from single genetic regions.
In practice, conditional analysis often has a similar performance to LD-pruning (Dudbridge & Newcombe, 2015). Both approaches rely on a correlation threshold parameter ρ and can produce inconsistent results for different values of the correlation parameter, especially for large values (e.g., ρ ≥ 0.9), due to numerical instabilities.

Principal component analysis
PCA is widely used in GWAS studies to adjust for population stratification. In the context of correlated-instruments summary-data MR, the use of principal components was proposed by . The method aims to identify linear combinations of genetic variants that are orthogonal to each other and explain as much of the variance in the genetic data set as possible. These linear combinations are then used as genetic instruments to assess the causal relationship between the risk factor and outcome, using an IVW estimate.
PCA is applied to a matrix Ψ with elements where ρ ij = Cor(G i ,G j ). The matrix Ψ is practically a transformed version of the genetic correlation matrix R = (ρ ij ). It is preferred over the untransformed matrix because its entries depend on the precisions σ Y j −1 of univariate estimates and thus, if two genetic variants are almost perfectly correlated, the variant with the highest precision will be prioritized.
PCA uses an eigendecomposition for the matrix Ψ and identifies the diagonal matrix of eigenvalues Λ = (λ ii ) and the corresponding matrix of eigenvectors W so that Ψ = W T ΛW.
By transforming the summary statistics and the genetic covariance matrix Ω = W T ΩW one can rewrite the IVW estimate (1) in terms of these transformed vectors: Evaluating (3) is subject to the same numerical difficulties as (1). However, one can construct a numerically stable approximation for θ cor by using only the first k principal components. If W is the sub-matrix of W containing only the first k columns and β X = W T β X , β Y = W T β Y , Ω = W T ΩW , then the causal effect estimate based on the first k principal components is The number k of principal components to use is a tuning parameter for the method. In practice, it is often specified so that the selected principal components explain a specific proportion of variation in the genetic data set. The criterion used for MR  is to select principal components that explain either 99% or 99.9% of the variation in the data.

The JAM algorithm
The JAM algorithm (joint analysis of marginal summary statistics, Newcombe et al., 2016) is a Bayesian stochastic-search variable selection algorithm that was proposed for the purpose of statistical fine-mapping from summary GWAS results. The algorithm aims to identify genetic variants robustly and independently associated with a trait of interest, among a large set of correlated candidate variants. JAM only requires the availability of GWAS summary-level data, and genetic correlations which can be estimated from an external reference data set. Using these summary data, JAM identifies sets of genetic variants that are most plausibly associated with the trait of interest.
The algorithm assumes a linear regression model for the trait X in terms of genetic variants G = (G 1 , …, G P ): where b denotes the vector of joint effects of the genetic variants on the trait (in contrast to β X , β Y , which represent marginal effects of each variant separately). To fit this regression model using summary statistics, JAM employs a transformation z = G T X, for which (4) implies that Assuming Hardy-Weinberg equilibrium and that the trait measurements and genetic data have been centered, each element z j of the vector z can be approximated using the marginal summary statistics β XJ and effect allele frequencies f j , where N 1 is the sample size from which the G-X associations are inferred. In practice, an additional Cholesky decomposition is implemented to avoid modeling under the correlated error structure of (5); we refer to Newcombe et al. (2016) for the technical details. In addition, note that G T G is (N 1 -1 times) the genetic covariance matrix, which can be estimated from reference data. Hence, equation (5) can be used to construct a summary-data A similar likelihood is used by the conditional and joint method (Yang et al., 2012) to conduct the joint analysis step. JAM employs Bayesian analysis instead, and uses conjugate normal-inverse-gamma g-priors p b, σ X 2 to facilitate Bayesian inference for the genetic associations with the trait.
The likelihood (5) relies on a fixed set of genetic variants. For variable selection, we assume that the likelihood has been conditioned on a specific set of variants γ and use a Beta-Binomial prior p (γ) for the model coefficient to obtain the posterior This posterior is difficult to evaluate analytically, but JAM implements stochastic-search model selection via a reversible-jump Markov Chain Monte Carlo (RJMCMC) algorithm to sample from (6) and identify the most suitable sets of genetic variants. The stochastic search procedure starts from a set γ (0) containing only the variant with the smallest p value. Then in each iteration, given a current set of genetic variants γ (k) , JAM randomly proposes a new set γ (k+1) by either • Adding a new variant to the current set γ (k) .
• Swapping a variant in γ (k) for a variant not in γ (k) .
The algorithm then decides whether to accept the new set by evaluating how well it fits the trait, according to the posterior (6). The process is repeated for a large number of iterations, and the various sets of variants are assigned posterior probabilities according to how often they were visited. The stochastic search procedure allows JAM to evaluate a wide range of different models and efficiently explore the parameter space of genetic configurations.
If the trait studied is binary, z can be derived by mapping univariate log-odds ratios to the univariate effects that would have been estimated if the binary trait was modeled by linear regression, as has been done in other linear-based summary data algorithms (Vilhjalmsson et al., 2015).
For MR, the above procedure can be used to identify variants strongly and independently associated with the risk factor of interest ). An IVW causal effect estimate can then be obtained for each set of variants visited by JAM, and these modelspecific estimates can be combined into an aggregate causal effect estimate by model averaging.
Since it jointly models all available genetic variants and can account for genetic correlations, JAM is naturally suited for cis-MR. The algorithm discards variants which are associated with the risk factor only through their correlation with other present variants, and its posterior models include variants independently associated with the risk factor. JAM is not completely free of numerical issues when implemented on near-collinear variants, but these issues can be overcome by employing pruning as a first step before calling the algorithm. An advantage over the LD-pruning approach is that JAM is more robust to the choice of pruning thresholds, since it implements the second layer of variable selection after pruning is finished. The pruning step can be implemented using a high correlation threshold (e.g., ρ = 0.9) and no significance threshold. As an alternative, JAM can avoid the need for pruning by using a ridge term for the genetic covariance matrix to make its inversion more stable. The choice of a pruning threshold is then replaced by the tuning of the ridge term parameter.
Here we focus on implementations using pruning as a first step and do not explore the use of a ridge term.
Outside of the context of cis-MR, JAM has been extended to handle violations of the no-pleiotropy assumption when working with multiple independent genetic variants, by augmenting its variable selection with a heterogeneity loss function to penalize and downweight pleiotropic variants . Furthermore, a hierarchical version of the algorithm that is useful for multivariable MR, as well as transcriptome-wide association studies, was recently developed by Jiang et al. (2020).

Factor-based methods
Recently, Patel et al. (2020) proposed the use of factor analysis for MR with correlated weak instruments. The authors model the genetic variants in terms of a factor model, where f is a vector of k < P latent factors, Λ is a matrix of factor loadings and ϵ f is a vector of idiosyncratic errors. The matrix Λ can be estimated using the first k eigenvectors in the eigendecomposition of the (rescaled) genetic covariance matrix G T G. This covariance matrix and its eigendecomposition can be approximated if one has access to a reference data set. The number k of latent factors can be selected either empirically, as is common in factor analysis, or in a more disciplined way, for example using the factor penalization method of Bai and Ng (2002).
The factors are then used as genetic instruments in a standard MR model: where ϵ X , ϵ Y are mean-zero error terms. An estimate of the causal effect parameter θ can be obtained by limited-information maximum-likelihood (LIML), minimizing the criterion where Λ is the estimated matrix of factor loadings and D X , D Y are the diagonal matrices whose diagonal terms are the sample variances of each genetic instrument; these can be approximated from the reference data set. Minimization of (a weighted version of) g(θ) yields the factor LIML (F-LIML) estimate.
Under certain conditions, the F-LIML estimator has been shown to be consistent and asymptotically normal (Patel et al., 2020). Unfortunately, these conditions are often violated under a weak instruments framework and F-LIML may suffer from weak instrument bias. In this scenario, an alternative is to conduct hypothesis tests, using the factors as genetic instruments, to assess the existence of a causal effect. Such tests were first developed in the instrumental variables literature and adapted to work with summary-level data (Patel et al., 2020). They include the Anderson-Rubin (1949), Lagrange multiplier (Kleibergen, 2002), and conditional likelihood ratio (Moreira, 2003) tests. The various tests are more robust to weak instrument bias than the F-LIML estimator, and were shown to have higher power than the principal components method under a weak instruments setting.

Separating variable selection from estimation
The approaches discussed in this section implement both variable selection and causal effect estimation. Here, we have focused on the variable selection part, because this is where most differences between the various approaches are observed. In terms of variable selection, LD-pruning and conditional analysis implement stepwise selection of genetic variants, JAM implements Bayesian stochastic search, and PCA and factor analysis construct linear combinations using all available variants and use those linear combinations as instruments.
We have coupled each variable selection algorithm with the estimation method most commonly associated with it in practice, or (for newly developed algorithms) with the estimation method proposed by its authors. For top-SNP analysis, causal effect estimation is performed using a Wald ratio estimate. For pruning, conditional analysis, PCA, and JAM, we have used the popular IVW formula. For the factor-based approach, Patel et al. (2020) explicitly proposed using likelihood-based methods (LIML and the conditional likelihood ratio test) for estimation using the factors. These three methods (Wald ratio, IVW, and LIML) are the most common estimation methods in two-sample MR with summary-level data (Burgess et al., 2013). Access to individual-level data would facilitate the use of additional estimation methods , including the popular two-stage least squares algorithm.
In principle, different methods for variable selection and estimation can be combined. For example, the SNPs selected in each iteration of the JAM algorithm could be combined into an allele score, or used for causal effect estimation via the LIML method. Another idea, as hinted in our previous section, would be to combine the variable selection algorithms presented here with pleiotropy-robust MR methods. Although these ideas would be interesting, our goal in this manuscript was to present the current state of the literature, therefore we did not pursue them further.

Simulation design
To compare the various cis-MR approaches, we implemented a simulation study. We generated data under the following simulation model: U, ϵ X , ϵ Y ∼ N 0, σ 0 2 Independently of each other . Our simulation design was informed by the real-data applications in the next section. We obtained genetic variants from two gene regions: the SHBG region, which encodes the protein sex-hormone binding globulin and is known to be associated with sex hormone traits, and the HMGCR region, which encodes the protein HMG coenzyme-A reductase, plays an important role in the metabolic pathway that produces cholesterol and is the drug target of statins. We refer to the real-data applications for more information about these regions. SHBG and HMGCR represent two distinctly different genetic correlation structures; correlation heat-maps for the two regions are provided in Figures 3 and 4 below. By basing simulations on two different regions we guarded against spurious observations on comparative performance of the methods that may arise due to subtle properties of the correlation structure in any one region.
We defined the two gene regions as spanning 100Kb pairs on either side of the corresponding gene and used all variants in these regions for which individual-level data were available in the UK Biobank and the minor allele frequencies were higher than 1%. This resulted in P 1 = 717 genetic variants for the SHBG region and P 2 = 590 variants for the HMGCR region. We extracted individual-level genetic observations for the selected variants from the UK Biobank, using a sample of N ref = 367, 643 nonrelated individuals of European origin. Aiming to replicate a two-sample MR analysis, we then bootstrapped the UK Biobank data and obtained two genetic matrices G 1 , G 2 of sizes N 1 , N 2 , respectively. Using the genetic matrix G 1 , we simulated risk factor measurements according to (8) and used them to obtain G-X summary statistics. Using the matrix G 2 and (8)-(9), we simulated outcome measurements and used them to generate G-Y summary statistics. Finally, the reference correlation matrix was computed using the entire UK Biobank data set.
To generate risk factor measurements, we used information from our real-data applications and the literature. For the SHBG region, we assumed that six independent genetic effects were present in the region. This was inspired by Coviello et al. (2012) who investigated the associations of SHBG variants with concentration of the SHBG protein, and suggested that as many as nine independent signals may be present in the region. Three of the nine causal variants identified by the study were not included in our data set, as they were located more than 100 kb pairs away from the SHBG gene, therefore we used the remaining six variants in our simulations. For these six variants, the effects of the risk-increasing allele on the risk factor were generated according to b Xj ~ |N (0, 0.2)| + 0.1. This created a pattern of univariate SNP-exposure associations similar to that we observed in our real-data application.
For the HMGCR region, we also assumed the existence of six variants independently associated with the risk factor. The six causal signals were placed at genetic variants used in a recent paper (Ference et al., 2019) to construct a genetic score for LDL-cholesterol based on the HMGCR region. Effects were generated according to b Xj ~ |N (0, 0.03)| + 0.03 for the risk-increasing allele; again, this closely resembled the univariate summary statistics obtained in our real-data application.
We considered three different values for the causal effect parameter: θ = 0, 0.05, 0.1. We set the G-X sample size to be N 1 = 10, 000, which is fairly small for modern MR standards but may be reasonable for a cis-MR study using protein expression data for the risk factor.
In Supporting Information material, we report simulation results using a larger sample size of N 1 = 100, 000. The sample size for SNP-outcome associations was set equal to N 2 = 180, 000, similar to that of the CARDIoGRAM-plusC4D data set (CARDIoGRAMplusC4D Consortium, 2015) used in our real-data applications. Finally, we assumed that all the genetic variants are valid instruments and did not generate pleiotropic effects.
One of our goals in the simulations was to assess the effect of weak instruments bias on the performance of the various methods. This form of bias can be of particular concern in cis-MR studies, as focusing on a single region means that usually there will be far fewer genetic instruments to use. A common diagnostic for weak instrument bias is to compute the F statistic for the regression of the risk factor on all the genetic variants; a rule-of-thumb is that F ≤ 10 is indicative of bias.
Therefore, we considered two simulation scenarios. In the first scenario, the proportion of variation in the risk factor explained by the genetic variants in each region was specified according to what has been empirically observed for their protein products, and was set equal to 3% for the SHBG region (Jin et al., 2012) and 2% for the HMGCR region (Krauss et al., 2008). This is a "strong instruments" setting, in which an "oracle" MR analysis using only the six causal variants in each region attained an average F statistic of 52 for the SHBG region and 35 for the HMGCR region. In this scenario, the effect of weak instrument bias on the performance of cis-MR methods should be small.
In our second scenario, we reduced the proportion of variation in the risk factor that is explained by the genetic variants to 10% of its value in the previous scenario. For the SHBG region, we assumed that the genetic variants only explain v G = 0.3% of variation in the risk factor, while for the HMGCR region, the genetic variants explained 0.2% of variation. This resulted in a "weak instruments" scenario in which the "oracle" MR analysis had an average F statistic of 6.0 for the SHBG region and 4.2 for the HMGCR region.
For each scenario, the simulation was replicated 1000 times per region and per value of the causal effect.

Methods
In our simulations we implemented the following cis-MR methods: • Single -instrument MR using only the minimum p value variant in the region.
• Principal components analysis.

•
The JAM algorithm.

•
The factor-based conditional likelihood ratio (CLR) test.
The performance of the conditional method is often quite similar to that of stepwise pruning (Dudbridge & Newcombe, 2015), hence the method was not implemented. A brief comparison between conditional analysis, stepwise pruning, and PCA was conducted in .
Many of the above methods depend on additional tuning parameters. To assess the sensitivity of each method to the specification of its tuning parameter(s), we used a range of different values. For stepwise pruning, we set a correlation threshold of ρ = 0.1, 0.3, 0.5, 0.7, 0.9. For the significance threshold, we use the standard GWAS threshold (τ = 5×10 -8 ) in simulations with "strong instruments"; with "weak instruments" there were simulations in which no variants passed the GWAS significance threshold, so we used a lower value (τ = 10 -3 ) instead. For the PCA method, we used principal components that explained k = 0.99 or k = 0.999 of the variation in the genetic data. For JAM, we implemented a prepruning step with a correlation threshold of ρ = 0.6, 0.8, 0.9, 0.95 and no significance threshold-note that JAM does not need a significance threshold for the prepruning step because its variable selection discards variants weakly associated with the risk factor anyway. The algorithm was run for 1 million iterations in each instance. Finally, for the F-LIML method, we allowed the algorithm to determine the number of latent factors to use.
The simulations were implemented in the statistical software R. Stepwise pruning and correlated-instruments IVW were implemented manually. To implement the PCA approach, we used code provided in the appendix of . The JAM algorithm was implemented using the R package R2BGLiMS. The F-LIML algorithm and the summary-data CLR test were implemented using code provided to us by the authors of the relevant publication (Patel et al., 2020). Stepwise pruning was the fastest algorithm to implement and the JAM algorithm was the slowest, although the differences were small on an absolute scale and none of the methods required more than a few seconds to run for each data set.
The various MR methods in our simulations were subject to two sources of bias. The first source is weak instrument bias, which is known to act towards the direction of the observational association in one-sample MR analyses and towards the null in two-sample MR (Burgess et al., , 2011. As discussed earlier, weak instrument bias should be a concern in our second simulation scenario, where the F statistic is below 10. Note however that even in the first scenario, weak instrument bias may still affect methods using many nuisance variants to obtain a causal effect estimate, as using such variants reduces the value of the F statistic. Second, the performance of various methods in our simulations can be affected by numerical issues, in particular related to inverting the genetic covariance matrix G T G. Inaccurate estimation of (G T G) -1 may lead to spurious increases or decreases in genetic correlations.
This form of bias can be detected by computing the condition number of the genetic covariance matrix, but its direction is not straightforward to assess. It is more likely to affect methods using highly correlated variants, such as stepwise pruning with high correlation thresholds.

Results
Simulation results are reported in Tables 1 and 2. For each method and each value of the causal effect parameter, we report median causal effect estimates and estimated standard errors. For simulations with θ = 0 we also report the empirical Type I error rates, defined as the proportion of simulations in which the method rejected the null causal hypothesis. For simulations with θ ≠ 0 we also report the empirical coverage of confidence intervals and the empirical power to reject the null causal hypothesis. All methods had very high power in the "strong instruments" scenario with θ = 0.1, and very low power in the "weak instruments" scenario with θ = 0.05, therefore we do not report power in these two scenarios. Finally, the conditional likelihood ratio test does not provide a causal effect estimate or an associated standard error, hence we only report its coverage, power and Type I error rate. Table 1 contains simulation results from the "strong instruments" scenario. In this scenario, all methods performed quite well in simulations with a null causal effect. When a positive causal effect was used, PCA, JAM, and F-LIML managed to identify the true value of the causal parameter with decent accuracy in both regions. LD-pruning did the same in most cases, but the method's performance deteriorated for large ρ values, exhibiting bias toward the null. The bias was more pronounced for θ = 0.1 than for θ = 0.05.
The good performance of all methods for θ = 0 suggests that any issues in their performance are due to weak instrument bias, which acts toward the null and hence would only affect simulations with θ ≠ 0. Accordingly, any biases observed for θ ≠ 0 were toward the null.
For LD-pruning, large values of ρ make the inclusion of weak instruments more likely and the numerical computation and inversion of the correlation matrix more challenging.
For small values of ρ the method is more robust to the inclusion of weak instruments. On the other hand, when the genetic region studied contains multiple causal signals, using a small correlation threshold may discard some of the causal variants from the analysis. In our simulations, this translated into a fairly small increase in causal standard errors and a decrease in the method's power to detect a causal effect. We also note that in the simulations of Table 1, LD-pruning was implemented after discarding genetic variants whose p values did not reach genome-wide significance. This is a "best-case scenario" for the method in terms of avoiding weak instrument bias; in practice, pruning is often implemented using less stringent thresholds (Dudbridge, 2013), and the effects of weak instrument bias can be more severe.
As expected, an MR analysis using only the variant with the smallest p value in the region was unbiased, but had larger standard errors and lower power than other methods. This was more pronounced for the SHBG region and less so for the HMGCR region, since genetic correlation were stronger in the HMGCR region, and using only a single variant could partly account for the effects of other variants through correlation. We also note that in our simulations, all causal variants had G -X effects in the same direction and of similar magnitude, and there were no heterogeneous effects toward the outcome. This is a best-case scenario for single-variant analysis; differences between it and other methods are likely to be larger in practice.
The performance of the PCA method was similar for k = 99% and k = 99.9% and was quite accurate. With a causal effect of θ = 0.1, the algorithm exhibited a small reduction in coverage due to weak instrument bias but still performed better than LD-pruning. In general, the effect of weak instruments bias on the algorithm is more pronounced for larger values of the tuning parameter k. Here, the standard values of k = 99% and k = 99.9% worked reasonably well in all simulation scenarios. The power to reject the null causal hypothesis was greater for k = 99.9%, at least for the SHBG region.
The JAM algorithm exhibited similar performance to PCA, with a small attenuation of causal effect estimates for θ = 0.1. JAM requires a correlation threshold to be specified for the pruning step before running Bayesian variable selection, but the algorithm's performance was quite robust to the value of that tuning parameter, certainly more so than that of LD-pruning. Its empirical power was slightly higher than PCA for the SHBG region, but lower for the HMGCR region.
The F-LIML method does not depend on a tuning parameter, as it can automatically determine the number of latent factors to use. Compared to JAM and PCA, the algorithm yielded slightly more accurate causal effect estimates and slightly better-calibrated confidence intervals for θ = 0.1 but had slightly inflated Type I error rates for θ = 0. The latter issue was addressed by the conditional likelihood ratio test, at the expense of no causal effect estimates and slightly lower power than F-LIML. Table 2 reports simulation results from the "weak instruments" scenario. Weak instruments bias had a much higher impact in these simulations, with several methods facing attenuation of their causal effect estimates. As expected, any bias occurred only for θ ≠ 0 and was toward the null, while all methods also had increased standard errors and low power.
Top-SNP analysis, LD-pruning, PCA, and JAM all suffered from weak instrument bias in this scenario. The magnitude of bias was similar for these methods. For pruning and PCA, the bias caused poor coverage properties for confidence intervals and Type I error rates above nominal levels. JAM only selected a small number of genetic variants (it selected an average of 1.6 variants per run) and attempted to adjust for the presence of weak instruments by producing wider confidence intervals; in fact, it was rather conservative in our simulations, with coverage rates above 95%, and consequently, its power was quite low.
In terms of their dependence on tuning parameters, the methods behaved similar to the "strong instruments"scenario: LD-pruning was more susceptible to bias for large correlation thresholds, while having smaller standard errors. PCA and JAM remained robust to the specification of their tuning parameters.
The F-LIML method had the opposite performance compared to JAM: the algorithm provided quite accurate causal effect estimates, but underestimated standard errors and resulted in confidence intervals with inflated Type I error rates and below nominal coverage. The inefficiency of the algorithm in weak instrument scenarios has been noted by its authors. On the other hand, the conditional likelihood ratio test was the method least affected by weak instruments bias and offered a reliable way of assessing the existence of a causal effect. Its power was worse than the (overprecise) F-LIML method but better than that of the JAM algorithm. The relation between these results and the estimation algorithms used is worth noting. It has been reported in the literature (Angrist & Pischke, 2008, chapter 4) that likelihood-based methods can yield near-unbiased causal effect estimates even in the presence of weak instruments, a property not shared by the IVW estimator used by JAM. This was mostly confirmed by our simulation results, although the reduced coverage of the F-LIML approach suggests that even likelihood-based methods are not completely free of weak instrument bias.
In summary, our simulations suggest that simple methods such as LD-pruning and singlevariant MR analysis can be reliable in simulations where weak instrument bias is of lesser concern, but should be used with caution in cis-MR analyses where weak instrument bias is suspected. Instrument strength can be assessed using the F statistic, as usual in MR. Even when LD-pruning is used, MR causal effect estimates should be computed and reported for a range of different correlation thresholds as a form of sensitivity analysis. Between JAM, PCA, and F-LIML, the differences were small in the "strong instruments" simulation. With weak instruments, F-LIML provided accurate causal effect estimates but poor uncertainty quantification, JAM yielded biased causal effect estimates, reasonable confidence intervals but low power, the CLR test offered nominal confidence intervals and decent power at the expense of no point estimates and PCA was the method most affected by weak instruments bias. A theoretical advantage of JAM compared to the other approaches is that JAM's variable selection gives an indication of which genetic variants are more likely to be causally associated with the risk factor, although it is not clear whether this additional information would be useful in an MR study where the objective is to perform inference about the X -Y causal effect. Finally, it may be possible to improve the performance of some of the methods by removing very weak variants (e.g., with p > 0.05) from the analysis before implementing the methods, but we have not considered that here.

Additional simulations
In Supporting Information material, we report results from a range of additional simulation scenarios. Specifically, we conducted three additional simulations, on which we comment here briefly.
First, we used a larger sample size of N 1 = 100, 000 from which to obtain risk factoroutcome summary statistics. This could be reminiscent of a cis-MR analysis using a downstream biomarker as a proxy for protein expression, and obtaining G-X summary data from a large-scale GWAS. Our results (Supporting Information Table 1) suggested that the various methods perform similarly in this case and issues such as weak instrument bias and robustness to the choice of the pruning threshold are less concerning with large sample sizes.
In the second simulation, we experimented with different numbers of causal variants. We used the SHBG region, simulated under the "strong instruments" scenario, and assumed that the region contains either one or three genetic variants causally associated with the risk factor. The performance of the various methods (Supporting Information Table 2) was similar to that observed in our baseline simulations with six causal variants, suggesting that the number of causal signals in the region has little impact on their performance relative to each other, except for the single-variant approach which did expectedly better in simulations with only one causal variant.
In our third Supporting Information simulation, we assessed the impact of only having access to a small reference data set on the various methods. In each iteration, we bootstrapped the UK Biobank data to obtain a small reference sample of size N ref = 1000 and implemented cis-MR using that reference data set (Supporting Information Table 3).
The JAM algorithm was the most sensitive method to this change. The algorithm selected more variants on average, exhibited slight differences in its performance for different values of the correlation threshold, and performed poorly for ρ ≥ 0.9. In such data sets, we recommend using a more stringent correlation threshold for the algorithm. The other methods were less affected in comparison.
We refer to the Supporting Information for more details on these simulations.

Introduction-We now compare the various cis-MR methods in two real-data
applications. In the first application, we use MR to investigate the relationship between low-density lipoprotein (LDL) cholesterol concentration and the risk of CHD using genetic variants from the HMGCR region. The HMGCR region is located in chromosome 5 and encodes the protein HMG coenzyme-A reductase. The protein plays an important role in the metabolic pathway that produces cholesterol, and its inhibition by statins is a common treatment to reduce LDL-cholesterol levels.
This analysis is performed mainly for illustrative purposes: the causal connection between LDL-cholesterol and CHD risk has already been explored in several papers in the literature. Many of these papers have been genome-wide MR studies (Allara et al., 2019;Burgess et al., 2014;Holmes et al., 2014;Linsel-Nitschke et al., 2008;Waterworth et al., 2010).
In the field of cis-MR, associations between HMGCR variants and a range of biomarkers have been used to investigate whether statin treatment increases the risk of type 2 diabetes (Swerdlow et al., 2015). In addition, cis-MR analyses have used the HMGCR region as a benchmark to assess the suitability of other genetic regions as potential drug targets for CHD risk; this includes the PCSK9 (Ference et al., 2016) and ACLY (Ference et al., 2019) regions. The HMGCR region was also used as an applied example in Schmidt et al. (2020).

Data sets and methods-Since a clear link between HMGCR protein expression
and lowering LDL-cholesterol has been established, we used LDL-cholesterol as a risk factor instead of protein expression. We estimated genetic associations between variants in the HMGCR region and LDL-cholesterol using data from the UK Biobank. Genetic associations were computed based on a sample of N 1 = 349, 795 unrelated individuals of European ancestry. Associations with CHD risk were obtained from the CARDIoGRAM-plusC4D consortium (CARDIoGRAMplusC4D Consortium, 2015), based on a sample of N 2 = 184, 305 individuals. Finally, genetic correlations were computed using individual-level data for 367,643 unrelated Europeans from the UK Biobank (the sample size for genetic associations with LDL-cholesterol was slightly smaller than the reference sample due to missing values).
We used a wider region than in our simulations and extracted information on variants within 500 kb pairs from the HMGCR gene. In total, 2915 variants were present in both the UK Biobank and the CARDIoGRAM-plusC4D data set. Of those 2915 variants, 20 were discarded because they either had missing associations with LDL-cholesterol in the UK Biobank data set or concerned multiple alleles on the same locus that were not detected in both data sets. We did not discard variants with low effect allele frequencies. Our analysis was therefore based on P = 2895 genetic variants. In Figure 3 we visualize the genetic correlations in the HMGCR region, and present a Manhattan plot of genetic association with LDL-cholesterol levels.
We then conducted cis-MR using the minimum p value variant, LD-pruning, principal components analysis, the JAM algorithm, and the factor-based methods. The various methods were implemented using a range of different parameter specifications, similar to those used for our simulation study. For stepwise pruning, we used the values ρ = 0.1, 0.3, 0.5, 0.7, 0.9 for the correlation threshold and a GWAS significance threshold of τ = 5 × 10 -8 . Of the 2895 variants in the region, 1424 had p values below the GWAS threshold. For the PCA method, we used principal components that explained k = 99% or k = 99.9% of variation in the genetic data. For JAM, we implemented a prepruning step with a correlation threshold of ρ = 0.6, 0.8, 0.9, 0.95 and no significance threshold. For F-LIML, we allowed the algorithm to determine the number of latent factors to use.

Results-
The results of our analysis are presented in Table 3. We report causal effect estimates, standard errors, and 95% confidence intervals obtained by each method. The reported estimates represent log-odds ratios of increase in CHD risk per standard deviation increase in LDL-cholesterol levels.
Genetically elevated LDL-cholesterol concentration based on variants in the HMGCR region is known to be associated with increased risk of CHD (Cholesterol Treatment Trialists, 2010;Ference et al., 2019). This was confirmed by the pruning, JAM, F-LIML, and CLR methods. For example, the JAM algorithm using a prepruning threshold of ρ = 0.8 suggested a log-odds ratio of 0.355 (odds ratio 1.426, 95% confidence interval [CI] =(1.110, 1.833)). Interestingly, the principal components method suggested a null effect. The main difference in the implementation of the method compared to our simulation study was the larger number of genetic variants used here. Since the presence of many noncausal variants can exacerbate bias due to weak instruments, we conjecture that this was the issue behind the method's poor performance. To confirm our suspicions, we implemented the PCA method using only the 1424 GWAS-significant variants in the data set. The method produced much more reasonable results, in line with other methods: the log-odds ratio estimates were 0.423 for k = 99% and 0.448 for k = 99.9% and the null causal hypothesis was rejected on both occasions.
The remaining methods were more consistent in their results. LD-pruning exhibited the usual attenuation toward the null for ρ = 0.7 but yielded a larger causal effect estimate for ρ = 0.9, as a result of numerical errors. Bias due to numerical errors is more likely to be a concern here than in our simulation study, due to the larger number of variants used. For the JAM algorithm, there were small differences between implementations for ρ = 0.6 and larger values of the correlation threshold, but overall the algorithm was more robust to the choice of that threshold than LD-pruning. The algorithm suggested the existence of three-five genetic variants independently associated with LDL-cholesterol in the region. The F-LIML method suggested a slightly higher causal effect than the other methods, although differences were within the margin of statistical error. As an additional form of sensitivity analysis, we implemented both JAM and F-LIML using only GWAS-significant variants; results were similar to those in Table 3.

5.2
The causal effect of testosterone on CHD risk using variants in the SHBG region 5.2.1 Introduction-In our second application, we apply MR using variants in the SHBG gene region to assess the causal relationship between testosterone levels and CHD risk. The SHBG region is located in chromosome 17 and encodes sex-hormone binding globulin, a protein that inhibits the function of sex hormones such as testosterone and estradiol. The region has been shown to contain strong genetic associations with testosterone levels, as well as a number of sex hormone traits (Coviello et al., 2012;Jin et al., 2012;Ruth et al., 2020;Schooling et al., 2018). In addition, previous research has suggested that the region is likely to contain several genetic variants independently associated with testosterone, with Coviello et al. (2012) claiming that as many as nine independent signals may be present. This implies that a naive approach using only the variant with the smallest p value would underestimate the genetic contributions to testosterone levels. The causal relationship between sex hormone traits and CHD risk using variants in the SHBG region has been studied by  and Schooling et al. (2018) with evidence mostly suggesting no causal relationship. Here, we aimed to replicate the analysis of  using larger data sets.

Data sets and methods-
We used serum testosterone levels as the exposure for our MR analysis. We obtained genetic associations with testosterone using summary-level data from the Neale Lab website. 1 These summary data were derived from the UK Biobank, using a sample of N 1 = 312, 102 unrelated individuals of European descent. We defined the genetic region to include genetic variants within 500 kb pairs on either side of the SHBG gene.
As in the previous application, we obtained genetic associations with CHD risk from the CARDIoGRAMplusC4D data set (CARDIoGRAMplusC4D Consortium, 2015), using a sample of N 2 = 184, 305 individuals. A total of 3053 genetic variants were present in both data sets and were therefore included in our analysis. We did not conduct separate analyses on males and females, since we did not have access to sex-specific associations with CHD risk. Finally, we used the UK Biobank as a reference data set from which to extract LD correlations between genetic variants. Figure 4 presents a Manhattan plot of associations with testosterone, as well as the genetic correlations in the region.
We then implemented the various cis-MR methods to select genetic instruments and assess whether SHBG variants are causally associated with CHD risk. We used the same setting as in the HMGCR application for the tuning parameters of each method. The stepwise pruning method was implemented using only variants with GWAS-significant associations with serum testosterone levels. A total of 1156 variants had p values lower than the 5 × 10 -8 threshold. The remaining methods were implemented using all 3053 variants in the region. Table 4 reports causal effect estimates, standard errors and 95% confidence intervals for each method. Once again, estimates are reported in the log-odds ratio scale and represent increases in CHD risk per standard deviation increase in serum testosterone levels. All methods with the exception of top-SNP and LD-pruning at 0.7 or 0.9 suggested no causal relationship between testosterone and CHD risk based on the SHBG region, although point estimates were consistently in the risk-decreasing direction.

Results-
In this example, the performance of the various methods was less seriously affected by weak instruments bias because there was apparently no causal relationship between serum testosterone levels and CHD risk. This is consistent with our simulation design, where no bias was observed under the "null causal effect" scenario.
A notable inconsistency was that an MR analysis using only the genetic variant with the smallest p value in the region suggested a risk-decreasing, statistically significant causal effect. This may be suggestive of heterogeneity in evidence from different variants within the SHBG region. Therefore, this example empirically demonstrates the pitfalls of using simplistic single variant analyses when in fact multiple signals exist within a region. The top variant in our analysis was rs1799941, which is known to be associated with testosterone levels (Ruth et al., 2020).
The LD-pruning method was rather inconclusive in this application. Implementations with a low correlation threshold suggested no causal effect. However, the method suggested an effect in the risk-decreasing direction for ρ = 0.7 and in the risk-increasing direction for ρ = 0.9. This was combined with a rather sharp increase in precision around the causal estimates. As in our previous application, this is indicative of numerical issues in computing the correlated-instruments IVW estimate and its standard error.
The principal components approach suggested no causal association for both values of its tuning parameter. JAM did the same and was once again consistent with respect to the value of its tuning parameter. The algorithm suggested the existence of about eight-nine independent signals in the region, confirming Coviello et al. (2012). The F-LIML estimate and the CLR confidence interval were in line with the results obtained by JAM.
Our results obtained in this application are similar to those reported by . Similar to the results reported in Table 4, pruning estimates in that paper suggested no causal effect for small correlation thresholds but were unstable for large ρ (in fact, they were more unstable than in our results, possibly due to the smaller sample sizes used in that paper). Estimates from the PCA method were in the risk-decreasing direction but did not achieve statistical significance. Overall, both  and our analysis suggested no causal association between testosterone and CHD risk based on variants in the SHBG region; further evidence of no causality was provided by Schooling et al. (2018), which also used data from the UK Biobank.

Proteome-Wide Association Studies (PWAS)
Before concluding our paper, we acknowledge the connections between cis-MR studies and two closely related approaches that have emerged in the literature in recent years, namely TWAS andPWAS. TWAS studies (Gamazon et al., 2015;Gusev et al., 2016) aim to identify genes associated with an outcome trait of interest, using eQTL variants to instrument gene expression. This is done through a two-step process: first, eQTL variants for a target gene are identified from existing databases and combined to create a genetic score predicting gene expression. And second, the outcome is regressed on predicted expression levels to explore whether the gene relates to the outcome. The original TWAS implementation used individual-level data (PrediXcan, Gamazon et al., 2015), but methods for summary-level outcome data have been developed since then (Barbeira et al., 2018;Gusev et al., 2016).
PWAS studies (Brandes et al., 2020) perform effectively the same analysis as TWAS, except they use protein expression data for protein-coding genes instead of gene expression. PWAS use variants that affect the coding regions of genes as instruments and use a similar two-stage process to assess whether protein expression relates to the outcome studied.
The similarities between cis-MR and TWAS/PWAS studies are obvious. All three approaches aim to identify genetic associations with an outcome of interest at a locus level, and use a similar approach to do so (apart from the difference between using gene expression or protein expression data). In addition, the inference procedure used by TWAS/PWAS methods is effectively a two-stage least squares algorithm. Regarding variable selection, TWAS/PWAS also face the issue of having to select among a group of correlated genetic variants, but these approaches often assume access to individual-level data for gene expression, and the corresponding methods are often linked to individual-level eQTL data sets. This gives TWAS methods greater flexibility to overcome the variable selection issues discussed in the current manuscript.
A few minor differences exist in how TWAS and PWAS studies have been utilized so far in the literature compared to cis-MR studies. For example, cis-MR is typically conducted with a focus on a specific gene region of interest, while TWAS studies are mode agnostic and are often implemented across the genome. In addition, although cis-MR studies regularly use protein expression data as an exposure, the methods developed for cis-MR can be extended to any MR analysis with correlated instruments. However, these differences should not distract from the strong similarities between cis-MR and TWAS/PWAS studies.
Although the methods were developed under different names, they are effectively different implementations of the same approach to investigate causal links between gene regions and downstream outcomes.
A detailed description of the methods available for TWAS/PWAS studies is beyond the scope of this manuscript. Instead, we refer to the relevant literature for an overview of methods (Li & Ritchie, 2021) and challenges (Wainberg et al., 2019) for such studies.

Discussion
Cis-MR is emerging as a widely applicable approach to support drug development and inform clinical trials. In this manuscript, we have outlined and compared a range of methods from the cis-MR literature to select variants from a genetic region of interest and estimate the MR causal effect. In particular, we have considered MR using only the minimum p value variant, and also LD-pruning, which is one of the most commonly used approaches in practice, principal components analysis, factor-based methods and stochastic-search variable selection via the JAM algorithm. We compared the various methods across a range of simulation scenarios, which were based on real genetic correlation structures of two different gene regions drawn from the UK Biobank. We also assessed the performance of the different methods in two cis-MR case studies, investigating the effect of LDL-cholesterol on CHD risk using variants in the HMGCR region and the effect of testosterone on CHD risk using variants in the SHBG region.
We have found that LD-pruning can be reliable when using large sample sizes and strong genetic instruments, but can be susceptible to weak instrument bias. Moreover, results obtained using LD-pruning can be sensitive to the correlation threshold used, and using a high correlation threshold can result in numerical instabilities. Based on these results, we would therefore recommend that when the pruning method is used, it is implemented using a range of correlation thresholds as a form of sensitivity analysis.
Although not completely free from weak instrument bias, methods such as PCA, F-LIML, the JAM algorithm and the conditional likelihood ratio test were less affected by such bias than LD-pruning. JAM was also more robust to the choice of its tuning parameter than pruning, while PCA and F-LIML achieved good performance in a range of different scenarios with the default values for their tuning parameters. Therefore, we recommend that some of these methods are implemented as sensitivity tools in applied cis-MR analyses.
Compared to each other, JAM, PCA, F-LIML and CLR exhibited similar performance in simulations with strong genetic instruments. With weak instruments, the CLR test was least affected by weak instruments bias and provided a valid test for the causal null hypothesis, albeit no point estimates. F-LIML obtained accurate causal effect estimates but yielded confidence intervals with poor coverage, while JAM had biased causal effect estimates but better uncertainty quantification. The principal components IVW approach was affected by weak instruments bias to a greater extent than the other methods, although our real-data applications suggest that PCA can still perform well if very weak variants are removed from the analysis before its implementation.
The use of multiple methods as a form of sensitivity analysis can increase the reliability of results of a cis-MR study. This can be further reinforced by triangulating evidence from different study designs (Gill et al., 2021). In addition to TWAS and PWAS studies discussed earlier, this can include colocalization, which has been shown to yield benefits when used in conjunction with MR to prioritize proteins for drug development (Zheng et al., 2020). Here, we have focused on MR and have not discussed colocalization in detail; we refer to a recent review comparing the two approaches instead (Zuber et al., 2022).
Our analysis has a number of limitations. Our simulations were by no means exhaustive; for example, they were based on only two genetic regions. In addition, our simulations have not considered other forms of bias that may arise in cis-MR applications. This includes pleiotropic bias, as discussed in Section 2, as well as selection bias, population stratification or winner's curse bias. The latter is worth discussing further, as it could have affected our simulation study. Winner's curse bias occurs when selection of variants into a study is conducted in the same data set as estimation of instrument-risk factor associations. It has been shown to affect MR studies using independent SNPs as instruments (Sadreev et al., 2021), increasing the magnitude of G-X associations and potentially exacerbating the effects of weak instrument bias. These conclusions also apply to methods that employ variable selection in cis-MR studies. PCA and factor analysis do not perform variable selection explicitly, but may still suffer from a similar type of bias if estimation of the principal component weights or factor loadings is performed in the same data set as estimation of G-X associations. To address this issue, applied analyses should aim to perform variable selection and estimation of instrument-risk factor associations in separate samples, if possible.
Another limitation is that our work has focused on cis-MR methods that can be implemented using two-sample summary-level data. Our methods and results can be extended to onesample summary-data designs, with a few differences (e.g., weak instrument bias acts towards the direction of the observational association in this case) and the additional note that IVW-based methods can suffer from bias due to sample overlap in this case. On the other hand, with access to individual-level data, a range of additional methods such as two-stage least squares and control function approaches can be implemented. These methods might suffer from similar issues as the summary-level methods outlined here, because they also rely on inverting the genetic correlation matrix to perform inference, but may be able to avoid some of the secondary challenges discussed in this paper, such as the potential bias induced by inaccurately estimating the genetic correlation matrix in a reference data set. It is also possible to compute summary statistics from one-sample individual-level data and then implement the methods presented here, though this may also be biased by sample overlap as mentioned earlier.
Cis-MR is an active area of research, both methodological and applied. As the field develops further, it is likely that new statistical approaches will be created to aid applied investigators in their work. By providing an overview and evaluation of existing methods, we hope that our current paper will contribute to that direction.

Supplementary Material
Refer to Web version on PubMed Central for supplementary material.
influence of the plasma proteome on complex diseases. Nature Genetics. A causal diagram representation of the three assumptions of Mendelian randomization.
Here, X represents the risk factor, Y the outcome, G the genetic instrument and U denotes confounders of the X-Y relationship. Directed acyclic graph illustrating why pleiotropic bias is less likely in MR studies of protein expression than in studies using downstream phenotypic traits as risk factors. Here, G denotes the genetic instrument, P protein expression, B the downstream biomarker, Y the outcome and U denotes confounding factors. Both types of MR studies are affected by pleiotropy due to "direct"effects of the genetic instrument on the outcome (G → Y pathway), but standard MR analyses are also subject to pleiotropy due to potential effects of the protein on the outcome (G → P → Y pathway) which is not the case for studies of disease progression. MR, Mendelian randomization.   Performance of cis-MR methods in simulations for various values of the causal effect parameter θ, using genetic variants from two gene regions (SHBG and HMGCR) and "strong" genetic instruments (corresponding F statistics >10)  Performance of cis-MR methods in simulations for various values of the causal effect parameter θ, using genetic variants from two gene regions (SHBG and HMGCR) and "weak"genetic instruments (corresponding F statistics <10)   Table 4 Results from various cis-MR methods for the real-data analysis of the effect of serum testosterone levels on CHD risk, using genetic variants in the SHBG region