Bounding the average causal effect in Mendelian randomisation studies with multiple proposed instruments: An application to prenatal alcohol exposure and attention deficit hyperactivity disorder

Abstract Background As large‐scale observational data become more available, caution regarding causal assumptions remains critically important. This may be especially true for Mendelian randomisation (MR), an increasingly popular approach. Point estimation in MR usually requires strong, often implausible homogeneity assumptions beyond the core instrumental conditions. Bounding, which does not require homogeneity assumptions, is infrequently applied in MR. Objectives We aimed to demonstrate computing nonparametric bounds for the causal risk difference derived from multiple proposed instruments in an MR study where effect heterogeneity is expected. Methods Using data from the Norwegian Mother, Father and Child Cohort Study (n = 2056) and Avon Longitudinal Study of Parents and Children (n = 6216) to study the average causal effect of maternal pregnancy alcohol use on offspring attention deficit hyperactivity disorder symptoms, we proposed 11 maternal SNPs as instruments. We computed bounds assuming subsets of SNPs were jointly valid instruments, for all combinations of SNPs where the MR model was not falsified. Results The MR assumptions were violated for all sets with more than 4 SNPs in one cohort and for all sets with more than 2 SNPs in the other. Bounds assuming one SNP was an individually valid instrument barely improved on assumption‐free bounds. Bounds tightened as more SNPs were assumed to be jointly valid instruments, and occasionally identified directions of effect, though bounds from different sets varied. Conclusions Our results suggest that, when proposing multiple instruments, bounds can contextualise plausible magnitudes and directions of effects. Computing bounds over multiple assumption sets, particularly in large, high‐dimensional data, offers a means of triangulating results across different potential sources of bias within a study and may help researchers to better evaluate and emphasise which estimates are compatible with the most plausible assumptions for their specific setting.

As discussed in the main text, 4b (and 4c) are unlikely to hold in the context of alcohol metabolism-related genetic variants proposed as instruments.Beyond this, it is important to recognize that 4a-4f are all strong and unverifiable assumptions, which should be applied with caution.Nonetheless, where there is evidence that the homogeneity assumptions may be reasonable, researchers might consider presenting bounds alongside point estimates computed under assumptions of additive and multiplicative effect homogeneity, in order to see the fullest possible picture of the potential impact of different forms of heterogeneity on their results.
2 Protocol of target trial and emulation using MR

Note:
Emulation of this target trial using MR involves many strong assumptions.Violations of these assumptions can result of violations of the instrumental conditions.The assumptions necessary for emulation of a target trial using MR are part of the set of assumptions evaluated by the instrumental inequalities.a MR: Mendelian randomization b ADHD: attention deficit-hyperactivity disorder 3 Description of genotyping in ALSPAC ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms.The resulting raw genome-wide data were subjected to standard quality control methods.Individuals were excluded on the basis of gender mismatches; minimal or excessive heterozygosity; disproportionate levels of individual missingness (>3%) and insufficient sample replication (IBD < 0.8).Population stratification was assessed by multidimensional scaling analysis and compared with Hapmap II (release 22) European descent (CEU), Han Chinese, Japanese and Yoruba reference populations; all individuals with non-European ancestry were removed.SNPs with a call rate of < 95% or evidence for violations of Hardy-Weinberg equilibrium (P < 5E-7) were removed.Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1).Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation.9,115 subjects and 500,527 SNPs passed these quality control filters.
ALSPAC mothers were genotyped using the Illumina human660W-quad array at Centre National de Génotypage (CNG) and genotypes were called with Illumina GenomeStudio.PLINK (v1.07) was used to carry out quality control measures on an initial set of 10,015 subjects and 557,124 directly genotyped SNPs.SNPs were removed if they displayed more than 5% missingness or a Hardy-Weinberg equilibrium P value of less than 1 x 10 −06 .Samples were excluded if they displayed more than 5% missingness, had indeterminate X chromosome heterozygosity or extreme autosomal heterozygosity.Samples showing evidence of population stratification were identified by multidimensional scaling of genome-wide identity by state pairwise distances using the four HapMap populations as a reference, and then excluded.Cryptic relatedness was assessed using a IBD estimate of more than 0.125 which is expected to correspond to roughly 12.5% alleles shared IBD or a relatedness at the first cousin level.Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation.9,048 subjects and 526,688 SNPs passed these quality control filters.After combining genotype data in the mothers and the children, SNPs with genotype missingness above 1% were removed due to poor quality (11,396 SNPs removed) and a further 321 subjects were removed due to potential ID mismatches.This resulted in a dataset of 17,842 subjects.Imputation of the target data was performed using Impute V2.2.2 against the 1000 genomes reference panel (Phase 1, Version 3) (all polymorphic SNPs excluding singletons), using all 2186 reference haplotypes (including non-Europeans).This gave 8,196 eligible mothers with available genotype data after exclusion of related subjects using cryptic relatedness measures described previously.

Description of genotyping in MoBa
Genotyping of MoBa participants is currently ongoing, and this analysis was conducted using the first available maternal genetic data.Approximately 17,000 trios from MoBa were genotyped in 3 batches.Samples were selected randomly, and excluded from genotyping if the trio met any of the following exclusion criteria: 1) offspring stillborn, 2) offspring deceased, 3) twin offspring, 4) non-existent Medical Birth Registry data, 5) missing anthropometric measures at birth in Medical Birth Registry, 6) pregnancies where the mother did not answer the first questionnaire (as a proxy for higher fallout rate), 7) missing parental DNA samples.The first batch, comprising 20,664 individuals (including parents and children), was genotyped at the Genomics Core Facility (Iceland) using the Illumina HumanCoreExome (Illumina, San Diego, USA) genotyping array, version 12 1.1.The second batch, comprising 12,874 individuals, was genotyped at the Genomics Core Facility (Iceland) using the Illumina HumanCoreExome (Illumina, San Diego, USA) genotyping array, version 24 1.0.The third batch, comprising 17,949 individuals, was genotyped at Erasmus MC (the Netherlands) using the Illumina Global Screening Array (Illumina, San Diego, USA) version 24 1.Genotypes were called using GenomeStudio (Illumina, San Diego, USA) and converted to PLINK format files.
PLINK version 1.90 beta 3.36 (http://pngu.mgh.harvard.edu/purcell/plink/)was used to conduct the quality control, which has been previous described by Helgeland et al (Helgeland et al. 2019).Known problematic SNPs previously reported by the Cohorts for Heart and Aging Research in Genomic Epidemiology consortium and Psychiatric Genomics Consortium were excluded from each batch.Duplicate samples were removed, and each batch was split into parents and offspring.Quality control was conducted separately for parents and offspring.
Individuals were excluded if they had a genotyping call rate below 955 or autosomal zygosity greater than four standard deviations from the sample mean.SNPs were excluded if they were ambiguous, had a genotyping call rate below 98%, or Hardy-Weinberg equilibrium p-value less than 1 x 10 −6 .Population stratification was assessed using the HapMap phase 3 release 3 as a reference, by principal component analysis using EIGENSTRAT version 6.1.4.Visual inspection identified a homogenous population of European ethnicity, and individuals of non-european ethnicity were removed.A sex check was done by assessing the sex declared in the pedigree with the genetic sex, which was imputed based on the heterozygosity of chromosome X.When sex discrepancies were identified, the individual was flagged.Relatedness was assessed by flagging one individual from each pairwise comparison of identity-by-descent with a pi-hat greater than 0.1.
Parent and offspring datasets were then merged into one dataset per genotyping batch.All individuals passing the genotyping call rate and normal heterozygosity measures were included in the merged datasets, meaning individuals who had previously been flagged or excluded for being a duplicate, having a sex discrepancy, being an ethnic outlier, or having a high level of relatedness, were included.Concordance checks were then conducted on validated duplicates.Duplicate and tri-allelic SNPs, as well as SNPs that were discordant between validated duplicates were excluded.Indivudals with a genotyping call rate below 98% in the merged datasets were removed.Insertions and deletions were excluded.
Phasing was conducted using Shapeit 2 release 837 and the duoHMM approach was used to account for pedigree structure.Imputation was conducted using the Haplotype Reference Consortium release 1.1 as the reference panel.The Sanger Imputation Server was used to perform the imputations with the Positional Burrows-Wheeler Transform.Phasing and imputation were conducted separately for each genotyping batch.
Imputation quality control was performed by initially converting dosages to best-guess genotypes.Individuals were removed if they had a genotyping call rate less than 99% or were of non-European ethnicity.SNPs with a genotyping call rate less than 98%, or a Hardy-Weinberg equilibrium p-value less than 1 x 10 −6 were removed.Relatedness was assessed intergenerationally and across batches by flagging one individual from each pairwise comparison of identity-by-descent with a pi-hat greater than 0.15 (excepting known parent-offspring relationships).Individuals were flagged for removal only if the other member of the pair would otherwise be included in the same analysis.One individual from each pair was flagged at random, except when retaining one individual would keep more duo/trio data available, in which case the other member was dropped.After quality control, a core homogenous sample of European ethnicity (based on PCA of markers overlapping with available HapMap markers) individuals across all batches and array were available for analysis, resulting in a total n of 14,804 mothers prior to analysis-specific exclusions.Each of these assumptions is a slightly different version of the IV conditions used in the literature (Swanson et al. 2018).Under assumption (i), (ii), (iii), or (iv), for all i, j ∈ 0, 1, P (Y ai = j) ≤ g(i, j) where

Expression for Richardson-Robins bounds for all possible combinations of instruments
Because P (Y a0 ) and P (Y a1 ) are variation independent, the average causal effect of A on Y , denoted Returning to our setting with multiple proposed instruments, we can consider the set of proposed instruments B = {b 1 , b 2 , .., b n }.We note that any combination of the proposed instruments in B that are themselves categorical variables can be combined into a single joint instrument Z which takes states {1, 2, .., k}, where each state is a unique possible combination of values of the proposed joint instruments in the subset.Thus the Richardson-Robins bounds can be applied to any joint instrument Z, assuming (i), (ii), (iii), or (iv) hold both individually and jointly for each proposed instrument included in Z.In our application, we considered this for all possible subsets of our set of proposed instruments.

Point estimation procedures
For each proposed joint instrument Z l , point estimates for the average causal effect were estimated using two-stage least squares, using linear regression models for both steps.95% confidence intervals were estimated using basic bootstrap.Two stage squares were estimated using the ivreg() function from the AER package (Kleiber, Zeileis, and Zeileis 2020), and bootstrapping was conducting using the boot.ci()function from the boot package (Ripley 2010).
In the context of categorical exposures and outcomes, two stage least squares using linear regression is vulnerable to measurement error, which can result in predicted values of the exposure outside of the 0-1 range, and may violate the assumption of bivariate normally distributed errors (Rassen et al. 2009).However, some research has suggested this issue may be primarily theoretical, and has limited impact on practical applications (Rassen et al. 2009;Angrist 2001;Johnston et al. 2008) .Some MR researchers attempt to avoid this issue by using models based on logistic regression.However, these approaches will produce estimates of the causal odds ratio, rather than the average causal effect on the risk difference scale.In order to produce point estimates of the average causal effect on the same risk difference scale as the bounds, we therefore chose to use two stage least squares based on linear regression.The resulting point estimates are shown in Supplementary Tables 1-4.

Expression of inverse probability weights for each proposed joint instrument
For each proposed joint instrument Z l , unstabilized inverse probability weights (Robins 1997) to account for 10 principal components were estimated as follows: To estimate W A , we fitted multinomial logistic regression models predicting Z l assuming the principal components contributed additively and linearly on the logit scale.Values were subsequently back-transformed to probabilities, and we calculated 1/P (Z l |P C 1 , P C 2 , P C 3 , P C 4 , P C 5 , P C 6 , P C 7 , P C 8 , P C 9 , P C 10 ) for each individual using these backtransformed probabilities.

Possible violations of the MR assumptions in this analysis
The MR assumptions are strong, and any or all the SNPs proposed as instruments in our analysis may have been affected by a number of different biases.While the complete case approach used in this analysis aligns with common practice in MR, both analytic samples were much smaller than the original recruited cohort, meaning the results may have been affected by selection bias due to loss to followup and missing data.Further, because both cohorts were recruited based on the presence of a pregnancy, and offspring ADHD status can only be evaluated in women who become pregnant and carry to term, the MR conditions would have been violated if a woman's alcohol consumption impacted her probability of becoming pregnant (Diemer et al. 2020).Results were largely consistent when inverse probability weighted for 10 principal components, suggesting that the results were not affected by residual population stratification.The correlation between maternal and paternal genotype was small, implying the study was not strongly biased by assortative mating.
While the MR assumptions can be violated if offspring outcomes are affected by offspring genotype, this path was unlikely to have impacted our analyses, because alcohol dehydrogenase genes are not expressed in fetuses or young children, who process alcohol through a different mechanism (Faassen and Niemelä 2011).The MR conditions could also be violated if maternal genetic variants impacted offspring ADHD through mechanisms other than alcohol consumption, such a consumption of other substances, if maternal alcohol consumption after birth also impacted offspring ADHD, or if the relationship between maternal genotype and alcohol consumption changed over the course of pregnancy.As mentioned in the main text, if the relationship between maternal alcohol use and offspring ADHD was truly continuous, binarizing alcohol use could violate the MR conditions.However, results of the instrumental inequalities were consistent across increasing numbers of categories of alcohol use, suggesting violations of the instrumental inequalities were not primarily due to our categorization of alcohol use.While it is possible that the MR conditions could also be violated as a result of measurement error of exposure, maternal alcohol use, previous research suggests bias from such measurement error is generally mild (Pierce and VanderWeele 2012).
Supplementary figure 7: Cropped visualization of the application of the instrumental inequalities to models for the average causal effect of alcohol consumption during pregnancy on offspring ADHD in the Avon Longitudinal Study of Parents and Children, grouping alcohol consumption into 4 categories.
This visualization has been cropped such that all sets of proposed instruments not shown violated the instrumental inequalities.Note that these analyses were conducted in the unweighted dataset.
Supplementary figure 8: Cropped visualization of the application of the instrumental inequalities to models for the average causal effect of alcohol consumption during pregnancy on offspring ADHD in the Avon Longitudinal Study of Parents and Children, grouping alcohol consumption into 7 categories.

Table 1 :
Specification and emulation of target trial of pregnancy alcohol consumption and offspring ADHD using Mendelian randomization in ALSPAC and MoBa

Table 1 :
Specification and emulation of target trial of pregnancy alcohol consumption and offspring ADHD using Mendelian randomization inALSPAC and MoBa (continued)

10 Supplementary tables 1-5
Supplementary table 1: Bounds and point estimates for the average causal effect of any vs. no alcohol consumption during pregnancy on offspring ADHD in the Avon Longitudinal Study of Parents and Children, inverse probability weighted for 10 principal components

table 2 :
Bounds and point estimates for the average causal effect of any vs. no alcohol consumption during pregnancy on offspring ADHD in the Norwegian Mother and Child Study, inverse probability weighted for 10 principal components

table 3 :
Bounds and point estimates for the average causal effect of moderate vs. no alcohol consumption during pregnancy on offspring ADHD in the Avon Longitudinal Study of Parents and Children, inverse probability weighted for 10 principal components

table 4 :
Bounds and point estimates for the average causal effect of moderate vs. no alcohol consumption during pregnancy on offspring ADHD in the Norwegian Mother and Child Study, inverse probability weighted for 10 principal components

table 5 :
Correlation between maternal and paternal genotypes in the Norwegian Mother, Father, and Child Study.