simmrd: An open‐source tool to perform simulations in Mendelian randomization

Mendelian randomization (MR) has become a popular tool for inferring causality of risk factors on disease. There are currently over 45 different methods available to perform MR, reflecting this extremely active research area. It would be desirable to have a standard simulation environment to objectively evaluate the existing and future methods. We present simmrd, an open‐source software for performing simulations to evaluate the performance of MR methods in a range of scenarios encountered in practice. Researchers can directly modify the simmrd source code so that the research community may arrive at a widely accepted framework for researchers to evaluate the performance of different MR methods.


| INTRODUCTION
Mendelian randomization (MR) is a genetic instrumental variable method that has become a popular statistical tool for estimating causal effects of risk factors on disease (Richmond & Davey Smith, 2022;Sanderson, Glymour, et al., 2022).The validity of inferences made using MR relies heavily on the satisfaction of three primary assumptions (de Leeuw et al., 2022): the genetic instruments are (i) strongly associated with the exposure(s), (ii) not associated with the outcome conditional on the exposures, and (iii) not associated with any confounders of the exposure(s)-outcome relationship(s) (Davies et al., 2015;Hemani et al., 2018;Morrison et al., 2020;Sanderson et al., 2021).Violating any of these assumptions can lead to bias in causal estimation, which may inflate false positive and/or false negative rates (Bowden et al., 2015).Three additional challenges are present in MR analyses: (a) some individuals may have been present in both the exposure(s) and outcome Genome-Wide Association Study (GWAS) which leads to sample overlap bias (Burgess, Davies, et al., 2016), (b) applying strict instrumental variable (IV) selection criteria to satisfy assumption (i) above can introduce a "winner's curse" bias (Jiang, Gill, et al., 2022), and (c) strong linkage disequilibrium (LD) between IVs (Gkatzionis et al., 2023), and its imprecise estimation using reference panels, can inflate false positive rates (Gleason et al., 2021).
At least 45 statistical methods for performing MR have been introduced in the literature to address different subsets of assumptions (i)-(iii) and scenarios (a)-(c) in either the single-exposure (univariable) or multiple-exposure (multivariable) settings (Boehm & Zhou, 2022).Clearly, the literature is saturated with a variety of MR estimators.In each of the initial studies introducing these methods, simulation was performed to evaluate the performance of new and existing methods.The variability of simulation settings reported in the literature is vast.Some use 50 or fewer single-nucleotide polymorphisms (SNPs) to explain most of phenoptypic (Verbanck et al., 2018;Zhu et al., 2021) or genetic (Lin et al., 2023) variance while others have used hundreds (Lorincz-Comi et al., 2023) or even thousands (Mounier & Kutalik, 2023) of SNPs.Additionally, some have used GWAS sample sizes no larger than 500 (Deng et al., 2022) while others have used GWAS sample sizes no less than 100k (Xu et al., 2021).For example, some have 50 or less IVs in simulation (Lin et al., 2023;Sanderson et al., 2019;Sanderson, Richardson, et al., 2022), but experience with real data analyses with multiple exposures suggests that hundreds or even thousands of SNPs meeting the valid IV conditions may be identified (Davies et al., 2019;Lorincz-Comi et al., 2023).Since weak IV bias generally becomes more severe as more IVs are used (Davies et al., 2015;Lorincz-Comi et al., 2023), multivariable Mendelian randomization (MVMR) simulations in which very few IVs are used may unrealistically reflect optimal performance of MVMR methods which may break down in practice.
Some of these issues arise from researchers reproducing the marginally realistic simulation settings of others as described above (e.g., Zhu et al., 2021, reproduced Verbanck et al., 2018;Lin et al., 2023, reproduced Grant & Burgess, 2021).Other challenges may exist because of a unified framework for performing simulations in MR.Both explanations have produced a literature of results potentially sensitive to the unique simulation conditions that may or may not mirror reality.Indeed, some independent simulations intended to mimic the same reality with nearly identical reported simulation settings have even produced conflicting results.As example, the literature contains some conflicting evidence about the bias in dIVW in the absence of sample overlap and horizontal pleiotropy (Mounier & Kutalik, 2023;Ye et al., 2021); the power of MR-Robust in the presence of 50% invalid IVs, being either less than 20% or greater than 80% (Lin et al., 2023;Slob & Burgess, 2020); or MR-RAPS having Type I error greater than 80% or controlled at 5% in the presence of horizontal pleiotropy (Qi & Chatterjee, 2021;Xue et al., 2021).
We introduce simmrd, an open-source and flexible tool for generating data to use in MR simulations.The last time a tool such as this was introduced was MR_predictor 10 years ago (Voight, 2014), however, MR_predictor cannot accommodate horizontal pleiotropy, sample overlap, multiple exposures, weak instruments, correlated IVs, imprecise LD estimation, or winner's curse.simmrd can accommodate each of these and even more scenarios simultaneously in either the univariable Mendelian randomization (UVMR) or MVMR settings.With no modern standard set in the literature, the purpose of the open-source simmrd software is to begin the construction of a standard simulation environment in which fair comparisons can be made between competing MR estimators in different settings that are encountered in practice.

| SIMULATION MODELS
The simulation data-generating process is based on the directed acyclic graph (DAG) in Figure 1.This DAG can be represented by the following structural models: (3) where U is a confounder of the relationship between exposure(s) x and outcome Y , θ represents the corresponding causal effect(s) of the exposure(s) on the outcome, B represents true associations between g and x, and γ C and γ U are respectively correlated horizontal pleiotropy (CHP) and uncorrelated horizontal pleiotropy (UHP) effects.The aim of MR is to estimate θ using GWAS summary data corresponding to B and α, which at minimum includes association estimates and their corresponding standard error estimates.GWAS summary data can be generated either by drawing individual-level data from Equations (1) to (4) above then performing GWAS on them, or by directly generating the GWAS summary statistics from fully specified distributions which place some assumptions on the models in Equations ( 1)-(4).Since in some cases users of simmrd may want the fully flexibility of the models above, but in others only require a simplified datagenerating process which is more computationally efficient, we provide users with both options.Users can indirectly generate GWAS summary data from individuallevel data using the generate_individual() function, or directly generate GWAS summary data using the generate_summary() function.Algorithm 1 shows the complete data-generating process that generate_individual() uses and Algorithm 2 shows the data-generating process that generate_- summary() uses.Section 5 provides more computational details and outlines the steps required to use each function.The parameters that users can set in the generate_individual() and generate_summary() functions is presented in Figure 2.This figure demonstrates that generate_individual() allows users with greater flexibility in specifying the exact relationships of the confounder with the exposures and outcome compared to generate_summary().These two approaches also differ in their setting of the true causal effect parameters.Using generate_individual (), users specify pairwise conditional variances to parameterize the models in Equations ( 1)-(4).Model parameters are then exactly determined by assuming the exposures, confounder, and outcome each have variance 1.Using generate_summary(), users can directly specify the true causal effect sizes.Additionally, for computational simplicity, all exposures GWAS samples generated using generate_individual() completely overlap, whereas generate_summary() can accommodate an arbitrary overlap structure between all exposures and the outcome.
This figure provides a conceptual overview of the simmrd data-generating process and the parameters to set when using the generate_individual() (in blue) or generate_summary() functions (in red).Brief descriptions of how each parameter is set and used is provided in the appendix and complete examples of how to generate simulated data using generate_summary() and generate_individual() are available at https://github.com/noahlorinczcomi/simmrd.
F I G U R E 1 This DAG is used to produce the simulation models described in Section 2. Users of simmrd can modify all parameters that are present in this DAG using the generate_individual() function, or most using the generate_summary() function.The parameter B represents true genetic associations between the SNP genotypes in g and each of the exposures in x, and θ represents the true causal effects of the exposures on the outcome Y conditional on confounding U , whose effects on the exposures and outcome are respectively π x and π Y .The parameter γ C represents correlated horizontal pleiotropy and γ U represents uncorrelated horizontal pleiotropy.
Algorithm 1: Pseudo-code of simulated GWAS summary-level data using the generate_individual() function. Require: × matrices of phenotypic and genetic correlation between the exposures R xx and R ββ , respective exposure(s) and outcome GWAS sample sizes n 1 and n 0 and their proportion of overlap p 01 , the number of causal exposure SNPs M and their LD structure R, the number of individuals in the LD reference panel n ref , the number of UHP and CHP causal exposure SNPs m U and m C , the variance in Y and U respectively explained by UHP and CHP SNPs σ γ 2 U and σ γ 2 C , LD pruning threshold for IV selection κ 2 , the simulation type (either winner's curse or weak IVs), the type of standardization to apply to the MR summary data (Z-score standardization or no standardization), the p-value threshold for IV selection if performing winner's curse simulations τ , the F-statistic for IV strength of association with exposures (F) if performing a weak instrument simulation.Preliminaries: All phenotypes g Y U x ( , , , ) ⊤ ⊤ will have mean 0 and variance 1. UHP, CHP, and SNPs that are otherwise valid IV candidates are put into mutually exclusive groups (see Figure 2).
and standardize to g , where , and re-scale the columns such that and re-scale each to achieve Step 6. Define sets x  and Y  of simulated individuals to respectively use in the exposure and outcome GWAS, of which p n n × min ( , ) 01 0 1 elements will overlap Step 7. Perform GWAS using OLS on phenotypes Y Y  and x x


Step 8. Standardize all GWAS summary data according to user input (one of "z" or "none") Step 9. Perform IV selection according to LD threshold κ 2 , p-value thresohld τ , and F-statistic F to obtain IV set  Step 10.Draw the estimated LD matrix for set of IVs from n R Wishart( , ) ref  and standardize to be the correlation matrix R   Ensure: The set of IVs  ; their standardized GWAS estimates for the exposures (B   ) and the outcome (α   ) as the R objects bx and by; the corresponding GWAS-estimated standard errors B SE( )    and α SE( )    as bxse and byse; indications for each IV if it was generated as a valid, UHP, or CHP SNP in IVType; true causal effects θ as theta; the matrix of correlations between GWAS estimation errors Ω as RhoME, with elements corresponding to the outcome in the first row and column; the matrices of true and estimated LD between the IVs R  and R   as LDMatrix and LDhatMatrix.
Algorithm 2: Pseudo-code of simulated GWAS summary-level data using the generate_summary() function.
Require: The number of exposures (p), GWAS sample sizes for all exposures ( ⊤ ) and the outcome (n Y ) and their proportions of overlapping participants ( ⊤ ), the p p × matrix of overlap proportions between exposure GWASs (K xx ), the matrix of phenotypic correlations between the outcome and all exposures (R Y x ), SNP heritability of the exposures the matrix of genetic correlation between the exposures (R ββ ), the number of SNPs causally associated with the exposures (M ), the number of causal SNPs which have UHP (m U ) and CHP (m C ) effects, the proportion of total MR variance (see Preliminaries for a definition) due to UHP (ρ U ) and CHP (ρ C ), the correlation between CHP and non-CHP effects (ζ C ), the p-length vector of true causal effects (θ), LD pruning threshold for IV selection (κ 2 ), simulation type (either winner's curse or weak IVs), type of standardization to apply to the MR summary data (Z-score standardization or no standardization), p-value threshold for IV selection if performing winner's curse simulations (τ ), F-statistic for IV strength of association with exposures (F) if performing a weak instrument simulation, the LD structure (R) between the M causal SNPs and the number of independent LD blocks (s), and the sample size of the LD reference panel (n ref ).
Preliminaries: Data is generated under the assumption that all phenotypes Y U x g ( , , , ) ⊤ ⊤ have mean 0 and variance 1. UHP, CHP, and SNPs that are otherwise valid IV candidates are put into mutually exclusive groups (see Figure 2).Where the true MR model for the jth SNP is for M SNPs then fix all but the first m U elements to be 0.
Step 4. Draw CHP effects from then fix all but the next m C elements to be 0, which ensures that UHP and CHP SNPs do not overlap.The purpose of adding the factor θ (− + ζ B 1 ) C is to make the correlation between θ B and γ C negative by default, where the magnitude of this correlation is controlled by the user-specified value of ζ C .See Figure 4 for an example when ζ C was −0.5.
Step 7. Calculate the GWAS estimates α α + w = , where χ a ( ) is random chi-square distributed variate with a degrees of freedom.
Step 9. Apply LD and p-value pruning at the LD threshold κ 2 and p-value threshold τ , or reduce the set of SNPs to just IVs which achieve an average F-statistic of F across each exposure.The new set of SNPs which serve as IVs are placed into the set  .
Step 10.Standardize GWAS estimates according to user input Step 11.Draw the matrix of estimated LD correlations between the IVs R  If the user sets n ref to infinity, then this step is skipped and R   is equal to R  .
Ensure: The set of IVs  ; their standardized GWAS estimates for the exposures (B   ) and the outcome (α   ) as the R objects bx and by; the corresponding GWAS-estimated standard errors B SE( )    and α SE( )    as bxse and byse; indications for each IV if it was generated as a valid, UHP, or CHP SNP in IVType; true causal effects θ as theta; the matrix of correlations between GWAS estimation errors Ω as RhoME, with elements corresponding to the outcome in the first row and column; the matrices of true and estimated LD between the IVs R  and R   as LDMatrix and LDhatMatrix.

| SIMULATION SETTINGS
In this section, we describe the conditions under which GWAS summary data can be generated and how some particular conditions are achieved.Users can generate GWAS summary data that spans a wide spectrum of scenarios defined by GWAS sample overlap, instrument strength, UHP, CHP, LD correlations between IVs, imprecise LD estimation, IV selection, and direct versus indirect causal effect sizes.To make the standardization of simulations performed by the broader community of MR researchers, we have also made a comprehensive set of default setups available for researchers to directly use (see Section 4 for more details).To achieve weak instruments, users can specify the exact F-statistic which their IV set achieves.In the case of multiple exposures, users specify the mean F-statistic which each exposure achieves in the MVMR IV set.Exact F-statistics are achieved by excluding SNPs with the largest effect sizes, thus also reducing the exposure heritability explained by the IVs.Correlated SNPs are generated by randomly sampling the true SNP-exposure(s) associations β B = ( ) from a matrix normal distribution with rowwise covariance that is the LD matrix R and column-wise covariance that is the matrix of genetic correlations between the exposures, Σ ββ .For any covariance matrix input, including the LD matrix, users can specify either independence, autoregressive, or Toeplitz structure using simple string expressions, some examples of which are presented in Figure 3.We finally note here that MR methods which rely on genome-wide GWAS summary statistics (i.e., not just those for the selected IVs) for reasons other than to calculate α B Cov( , )   may not be able to solely use simmrd in their simulations, though it may still provide some support when, for example, creating block-diagonal LD matrices, fixing the F-statistic of the IV set, or pruning SNPs by LD and genetic association p-value.Such methods generally include Bayesian estimators which require genome-wide summary statistics to set and/or update parameterizations of prior distributions (see the appendix for a list).simmrd is technically capable of generating millions of SNP-phenotype association estimates for multiple traits without performing IV selection, but it will not generate any SNP-phenotype associations under null models in which the true association effect size is 0. MR methods which rely on genome-wide GWAS summary statistics only for the purposes of calculating Figure 4 shows an example of data generated using generate_summary() for three exposures and how simmr integrates with the popular Mende-lianRandomization R package (Yavorska & Burgess, 2017).The top panel of this figure displays scatterplots which are the direct output of the plot_- simdata() function included with the simmrd R package.For multiple exposures, plot_simdata() will return the both the bivariate SNP-exposure and SNPoutcome association estimates using exposure-specific IV sets and the association between the linear predictor θ B   and α .A first-order autoregressive covariance structure with correlation parameter a is generated by using "ar1(a)," a Toeplitz structure by using "toeplitz," and an independence structure using "I."Multiple parameters in simmrd receive the specification of a covariance structure as input, and a complete list of all parameters and their input is presented in the appendix for both the generate_ individual() and generate_summary() functions.
simmrd is an R package (R Core Team, 2021) and can be installed from our Github repository: https://github.com/noahlorinczcomi/simmrd.The simmrd software is open-source and intended to be iteratively improved by researchers to allow for greater functionality as new methodological developments are made.Changes to simmrd are automatically, and earlier versions can be restored at any time.At our Github F I G U R E 4 (See caption on next page).
| 65 repository, we have provided the set of default simulation settings in the default_setups folder which researchers can use to perform simulations under pre-defined settings.These setups are stored in files containing parameters to give the generate_- summary() function and cover five basic scenarios: (i) CHP, (ii) UHP, (iii) UHP + CHP, (iv) weak IVs, (v) UHP + CHP+weak IVs.Within each scenario, there are also multiple conditions of GWAS sample size (30 vs. 100K), GWAS sample overlap (0% vs. 100%), number of causal exposure SNPs (100 vs. 500), and number of exposures (1 vs. 3).A complete tutorial of how to use simmrd is also available on our Github repository.

| INPUT AND OUTPUT
Whether users of simmrd are generating GWAS summary data using individual-level data genera- te_individual() or just directly using gener-ate_summary(), they are required to specify a named list of input parameters.The named items in this The top panel of this figure shows an example of the plots that plot_simdata(data,params) produces where params is a user-defined named list of input parameters and data is the stored output of either generate_individual(params) or generate_summary(params).Displayed on the x-axis of the top-left plot is the standardized linear predictor used in multivariable MR ( θ B ˆˆ) and the y-axis is the estimated association of the IVs with the outcome in standardized scale.The blue line is the IVW estimate using only the valid IVs (blue points), whereas the gray line is the IVW estimate using all IVs (blue, orange, and green points).All other plots are scatterplots of estimated SNP-exposure and SNP-outcome associations using only IVs that were selected separately for each exposure using the criteria specified by the user for MVMR.Displayed under the titles of each plot are F-statistics representing instrument strength (Burgess & Thompson, 2011).For the top-left plot which includes multiple exposures, the displayed F-statistic is the median F-statistic across all exposures using the full set of MVMR IVs.Determinations of the IV types ("valid," "CHP," or "UHP") correspond to their status in the full MVMR IV set and not in the exposure-specific IV sets.The left figure in bottom panel contains R code to generate the figure in the top panel and numerical results in the table on the right, showing an example of how simmrd integrates with the popular MendelianRandomization R package (Yavorska & Burgess, 2017).The object params of simmrd input parameters is abbreviated to save space but is available in complete form in the example_params.R file on our Github repository.Note, the MR-CML (Lin et al., 2023) method assumes that the elements of the GWAS estimation error correlation matrix Ω corresponding to the outcome are in the bottom-right position, whereas simmrd automatically places them in the top-left position, hence why the object RhoME was created in the R code.
T A B L E 1 Displayed values are the times in seconds to generate a single set of simulated GWAS summary statistics for three exposures using either generate_summary() or generate_individual() as the GWAS sample size (N) and number of causal exposure SNPs changes.
simmrd run time (s) Causal exposure SNPs or data=generate_summary (params), the following summary data will be present in the named list data: (i) standardized SNP association estimates between all IVs and each exposure in the R matrix object bx, (ii) their corresponding standardized standard errors in bxse, (iii) standardized estimates of SNP association between each IV and the outcome in the vector by, (iv) their corresponding standardized standard errors in byse, (v) all corresponding unstandardized estimates of SNP association between the exposures and outcome and their corresponding standard errors (bx_unstd, bxse_unstd, by_unstd, by- se_unstd), (vi) an indication for each IV if it was generated under a non-UHP/CHP, UHP, or CHP model in the vector IVType, (vii) the matrix of true LD correlations between the IVs (LDMatrix), (viii) the matrix of estimated LD correlations between the IVs (LDhatMatrix), (ix) the true causal effects of each exposure on the outcome in the vector theta, and (x) the matrix of correlations between GWAS estimation errors for the outcome and all exposures in the matrix RhoME.Users can view a summary visualization of their simulated data by executing plot_simdata (da- ta,params) to view the scatterplots of SNP-exposure (s) and SNP-outcome association estimates.

A.2. Using generate_individual() and generate_summary()
To generate data using simmrd, two steps must be completed.First, users create a named list.Each element of this list is a parameter value and the name of each element is the parameter name.commands in R. The returned object stored as data is a named list of elements which are GWAS summary data.We provide a description of each parameter for generate_individual() and generate_summary() below and complete examples of how to use simmrd at our Github repository: https://github.com/noahlorinczcomi/simmrd.
The parameters that generate_individual () uses are the following: 1. [sample_size_Xs]: sample sizes of the p exposure GWASs; either a scalar applied to all exposures or a p-length vector of scalars applied to each exposure element-wise.phenotypic correlations between the exposures; scalar, actual matrix of correlations, or string value which is one of "I" for independent exposures, "toeplitz" for a correlation matrix with Toeplitz structure, or (e.g.) "ar1(0.25)"for a correlation matrix with first-order autoregressive structure with correlation parameter 0.25.6. [genetic_correlation_Xs]: genotypic correlations between the exposures; scalar, actual matrix of correlations, or string value which is one of "I" for independent exposures, "toeplitz" for a correlation matrix with Toeplitz structure, or (e.g.) "ar1(0.25)"for a correlation matrix with firstorder autoregressive structure with correlation parameter 0.25.7. [Xs_variance_explained_by_U]: variance in the exposure(s) explained by the confounder conditional on the causal exposure SNPs (the total variance of the exposure(s) is/are 1); scalar applied to threshold, only the one with the smaller p-value will be considered as a candidate IV; scalar.25. [N_of_LD_ref]: sample size of the LD reference panel; scalar or "Inf."

jCU
Step 2. Draw CHP effects for m C SNPs from γ Uniform(variance in U Step 3. Draw UHP effects for m U SNPs from γ Uniform(variance in Y Step 4. Draw SNP associations with the exposures (B) from a matrix normal distribution with row-wise covariance R and column-wise covariance M DR D ββ −1

⊗
and R Y x , calculate the variance-covariance matrix of GWAS estimation errors (Ω) using the methods in LeBlanc et al. (2018).Step 2. Define the exposure genetic covariance matrix DR D Σ = ββ ββ where h h D = diag( , …, ) p 1 .Draw the Mp ( ) × 1 vector of true effect sizes for causal exposure SNPs from 0 and reshape to be the M p × matrix B in which rows correspond to SNPs and columns to exposures.If R = I, we directly generate M true effect sizes from 0 M Normal( , Σ ) ββ −1 to construct the matrix B.

F
I G U R E 3 This figure provides examples of covariance structures which can be generated using simple expressions in simmrd.
2. [sample_size_Y]: sample size of the outcome GWAS; scalar.3. [prop_gwas_overlap_Xs_and_Y]: proportions of sample overlap between exposure and outcome GWAS; either a scalar applied to all exposure-outcome pairs or a p-length vector of scalars applied to each exposure-outcome pair element-wise.4. [number_of_exposures]: number of exposures; scalar.5. [phenotypic_correlation_Xs]: Table 1 shows the time in seconds required to generate simulated MR data for a range of GWAS sample and 500 causal exposure SNPs.Still, many scenarios can be completed in less than 1 s, and simulation times increase more rapidly as GWAS sample sizes increase than as the number of causal SNPs increase.
The causal SNPs were generated without LD structure and the exposure and outcome GWAS samples did not overlap.Using generate_individual(), the exposure GWASs overlapped completely.list are specific to each data-generating function, and we provide examples of how to create these lists in our tutorial provided at our Github repository.This is the only user input required for users to generate GWAS summary data to use in MR simulations.Here, we demonstrate how this list of parameters is used and state the output of both generate_summary() and Note: simmrd, an open-source software for performing simulations to evaluate the performance of Mendelian Randomization methods.Researchers can directly modify the simmrd source code at our Github repository.It is our intention that the community will use this opportunity to establish an accepted procedure for performing simulations in MR.As simmrd is refined and expanded, we expect it will provide a useful tool to facilitate future MR method development and evaluation.
generate_indi- vidual()and generate_summary() use different lists of parameters, as Figure2demonstrates.Assume for example, that the user has one list of parameters assigned to the R object individual_params which they intend to use with generate_individual() and another named summary_params which they intend to use with generate_summary ().The first few elements of summary_params or individual_params may look like: