CpG‐related SNPs in the MS4A region have a dose‐dependent effect on risk of late–onset Alzheimer disease

Abstract CpG‐related single nucleotide polymorphisms (CGS) have the potential to perturb DNA methylation; however, their effects on Alzheimer disease (AD) risk have not been evaluated systematically. We conducted a genome‐wide association study using a sliding‐window approach to measure the combined effects of CGSes on AD risk in a discovery sample of 24 European ancestry cohorts (12,181 cases, 12,601 controls) from the Alzheimer's Disease Genetics Consortium (ADGC) and replication sample of seven European ancestry cohorts (7,554 cases, 27,382 controls) from the International Genomics of Alzheimer's Project (IGAP). The potential functional relevance of significant associations was evaluated by analysis of methylation and expression levels in brain tissue of the Religious Orders Study and the Rush Memory and Aging Project (ROSMAP), and in whole blood of Framingham Heart Study participants (FHS). Genome‐wide significant (p < 5 × 10−8) associations were identified with 171 1.0 kb‐length windows spanning 932 kb in the APOE region (top p < 2.2 × 10−308), five windows at BIN1 (top p = 1.3 × 10−13), two windows at MS4A6A (top p = 2.7 × 10−10), two windows near MS4A4A (top p = 6.4 × 10−10), and one window at PICALM (p = 6.3 × 10‐9). The total number of CGS‐derived CpG dinucleotides in the window near MS4A4A was associated with AD risk (p = 2.67 × 10−10), brain DNA methylation (p = 2.15 × 10−10), and gene expression in brain (p = 0.03) and blood (p = 2.53 × 10−4). Pathway analysis of the genes responsive to changes in the methylation quantitative trait locus signal at MS4A4A (cg14750746) showed an enrichment of methyltransferase functions. We confirm the importance of CGS in AD and the potential for creating a functional CpG dosage‐derived genetic score to predict AD risk.


| INTRODUC TI ON
Much has been learned about the genetic basis of Alzheimer disease (AD), the most common cause of dementia in the elderly. Genome-wide association studies (GWAS) have identified common and rare variants in more than 30 loci that contribute to AD risk Hollingworth et al., 2011;Jakobsdottir et al., 2016;Jun et al., 2017Jun et al., , 2016Lambert et al., 2013;Mez et al., 2017;Naj et al., 2011;Sims et al., 2017). However, these associations explain only a fraction of the heritability of AD, and their functional consequence also remains unclear (Lambert et al., 2013;Ridge, Mukherjee, Crane, & Kauwe, 2013).
Thus, here we investigate AD risk from a different perspective.
Epigenetic phenomena such as DNA methylation may be involved but have not been studied extensively in AD. DNA methylation is intimately associated with genetic variation because of frequent attachment of a methyl group directly to a DNA nucleotide, particularly a dinucleotide comprising a cytosine and guanine (CpG). CpG-related SNPs (CGS) alter the sequence of the primary target sites for DNA Aging Alzheimer's Disease Data Storage Site (NIAGADS) at the University of Pennsylvania and funded by NIA grant U24-AG041689-01. Samples from the National Centralized Repository for Alzheimer's Disease and Related Dementias (NCRAD), which receives government support under a cooperative agreement grant (U24 AG21886) awarded by the National Institute on Aging (NIA), were used in this study. We thank contributors who collected samples used in this study, as well as patients and their families, whose help and participation made this work possible. The NACC database is funded by NIA/NIH Grant U01 AG016976. . ROSMAP data were generated with support from NIA grants P30-AG10161, R01-AG17917, R01-AG36042, and U01-AG46152. This work was also supported by NIA grants R01-AG048927 and RF1-AG057519 (to LAF).
Pathway analysis of the genes responsive to changes in the methylation quantitative trait locus signal at MS4A4A (cg14750746) showed an enrichment of methyltransferase functions. We confirm the importance of CGS in AD and the potential for creating a functional CpG dosage-derived genetic score to predict AD risk.

K E Y W O R D S
Alzheimer disease, DNA methylation, epigenetics, eQTL, genetics, mQTL methylation (Lister et al., 2009) and account for a significant fraction (~38%-88%) of allele-specific methylation (ASM) regions in the human genome (Shoemaker, Deng, Wang, & Zhang, 2010). It has been demonstrated that more than 80% of CGSes have a regulatory role in DNA methylation (Zhi et al., 2013). Recently, we found that a haplotype of multiple CGSes is associated with DNA methylation patterns on a genome-wide scale (Ma et al., 2016). DNA methylation has been shown to influence risk of age-related diseases (Hunter et al., 2012;De Jager et al., 2014). For example, a genome-wide DNA methylation study reported association of AD pathological features with methylation changes at several loci (De Jager et al., 2014). Also, levels of DNA methylation of GSTM1 and GSTM5 have been associated with risk of age-related macular degeneration (Hunter et al., 2012).
In this study, we evaluated the association of AD with CGSes genome-wide and validated significant findings by expression quantitative trait locus (eQTL) and methylation QTL (mQTL) analyses.

| Sliding Window Association of CGSes with AD
Association of AD with CGSes was tested genome-wide using sliding windows that were 1 kb in length, overlapping by 0.5 kb and contained at least two CGSes. These analyses, which were performed using SKAT-O (Lee, Wu, & Lin, 2012), considered the combined effects of all CGSes in a window and weighted rare variants more heavily than common variants. Because the SKAT-O window-based test does not consider the effect direction of the variants in each window, we also tested a model including CGS dosage which was calculated as the total number of CpG dinucleotides created by the CGSes in the window. Genome-wide analysis of 2,288,371 overlapping windows each containing at least two CGSes showed little evidence of inflation (λ = 1.099, Figure S1). SKAT-O and CGS dosage approaches provided similar results across the genome ( Figure     In order to show the unique role of CGSes in these windows, we compared the significance level for the windows under two conditions, (a) including only CGSes and (b) including only non-CGSes, which are the SNPs that do not disrupt CpG dinucleotide formation.
As shown in Table 3, the p values of all the identified AD-associated windows in Table 1 were attenuated when only non-CGSes were included in the test. The number of CGSes and non-CGSes in each top window differed by no more than two except for the windows at MS4A4A and PICALM. There is very modest LD between the top CGS and non-CGS for the most of the windows except for a window at MS4A6A (R 2 = 0.98) which is 165 kb from the top window at MS4A4A (R 2 = 0.37). The attenuation of the significance level was also observed at the individual SNP level for the comparisons of the two types of SNPs in each window, noting that the APOE region did not contain any non-CGSes (Table S4).

| Association of CGSes with DNA methylation and gene expression
Windows containing CGSes located in MA4A4A, PICALM and APOE were associated (p ≤ 0.05) with the degree of DNA methylation in brains (Table 4); however, only the MS4A4A window was significantly associated in brains after correction for the 176 methylation probes tested for association (adjusted p = 2.15 × 10 −9 at cg14750746). This window was also nominally associated with increased methylation in blood after correcting for the same 176 methylation probes (nominal TA B L E 2 Top-ranked windows associated with AD in replication stage F I G U R E 1 Forest plot of dose-response effect of the number of CpG dinucleotides created by the CGSes in the intergenic window close to MS4A4A on the logged odds ratio of AD. The filled square and horizontal line for each population or the filled diamond for the summary denote the estimated logged odds ratio and its 95% CI per unit increase in the number of CpG dinucleotides in the window p = 3.34 × 10 −4 and adjusted p = 0.06). In addition, the number of CpG dinucleotides created by the CGSes in the intergenic window between MS4A4A and MS4A6E was associated with increased expression of MS4A4A in both brain (p = 0.03) and blood (p = 2.53 × 10 −4 ).
The MS4A6A window was associated with DNA methylation (adjusted p = 1.47 × 10 −7 ) and gene expression (p = 5.89 × 10 −26 ) in blood, but rs12226022 was not well imputed in the ROSMAP dataset to test this association in brain.

| Pathway analysis at the MS4A4A window
Transcriptome analysis using RNAseq data from the Religious Order Study and Rush Memory and Aging Project (ROSMAP) brain samples was performed to identify the set of genes whose expression is influenced by methylation of CpG site cg14750746 that was associated with the dosage of MS4A4A CGSes (Table 4). In total, 15,508 protein-coding genes remained in the analysis after removing genes expressed in less than 10% subjects. Although no genes remained significant after correcting for the number tests (threshold p = 3.2 × 10 −6 ), there were 34 nominally associated genes (p < 5×10 -3 ) (Table S5) and pathway analysis showed enrichment in methyltransferase activity (Table 5).

| D ISCUSS I ON
Our study using a sliding-window approach confirmed the importance of CGS in AD and is the first to report dosage effects of CpG dinucleotides created by CGSes on AD risk. In particular, we iden- This observation does not seem to be related to the differences in the number of variants or LD between CGSes and non-CGSes.
The MS4A gene cluster encodes a family of proteins spanning the cellular membrane four times which share similar polypeptide sequence and predicted topological structure. MS4A6A expression in brain is positively associated with AD-related neurofibrillary tangles and neuritic plaques (Karch et al., 2012;Martiskainen et al., 2015). AD risk alleles at these loci were reported to be associated with higher expression in brain (Allen et al., 2012;Karch et al., 2012;Martiskainen et al., 2015). The underlying mechanism for the effects of MS4A genes on AD may be related to their regulation of calcium channels (Walshe et al., 2008), immune system (Zuccolo et al., 2013).
Our findings of an association of the CpG dinucleotide dosage in this region with AD risk suggest a potential novel AD-related mechanism involving MS4A genes. Further experiments examining DNA methylation in the MS4A region are necessary to clarify the exact mechanism.
All of the loci identified in our study using a sliding-window approach were previously reported to be associated with AD through DNA methylation analyses, indicating an overlap between genetic and epigenetic mechanisms. For example, brain DNA methylation levels of CpG sites located in the top-ranked loci have been asso-  (Table S6), but it is unclear why the pairs of methylation probes in MS4A region and APOE are inversely correlated in brain and blood.
All of the genes identified by our analyses have been implicated in inflammation and the immune system. BIN1 knock-out mice were shown to have higher incidence of inflammation during aging (Chang et al., 2007). BIN1 was also reported to be related to inflammation and immunity by its participation in the phagocytic pathway (Gold et al., 2004)  Our findings also suggest that the sliding-window approach focused on CGSes is useful for identifying loci whose influence on disease risk may involve clinically relevant epigenetic mechanisms.
In the large GWAS conducted by the International Genomics of Alzheimer's Project (Lambert et al., 2013), approximately 44% of the top AD-associated SNPs are CGSes. However, not all of these CGSes were significantly associated with AD in our analysis (e.g., CGSes in CR1, CD2AP, and CLU; Table S1). Interestingly, none of these three loci were reported to have significant brain methylation changes re- that is correlated with both of these markers, is responsible for the association. We did not remove CGSes in high LD, which may inflate the number of significant findings. However, some of these associations may be independent because multiple adjacent methylated CpG sites can serve as the platform for chromatin binding proteins that lead to changes in chromatin state (Bartke et al., 2010). Another concern is that despite experimental evidence suggesting an optimal window size of 1kb, it is unknown whether other window sizes may increase power. Also, our selection of the default weights of variants has bias toward rare variants. Finally, we observed that the CGS most significantly associated with AD risk also has significant mQTL and eQTL effects that survive regional multiple test correction but do not achieve genome-wide significance.
In conclusion, we confirmed the importance of CGS in AD and the potential for creating a functional genetic score based on CpG dosage to predict disease risk. However, it is unknown whether these CGS signals act as causative mechanisms in AD progression.
Further replication and mechanistic studies are necessary to validate these findings. Future genome-wide mQTL and eQTL analyses may extend our findings.

| CGS annotation
CGSes were annotated as described previously (Ma et al., 2016). In brief, CGS information was retrieved by Galaxy (Goecks, Nekrutenko, Taylor, & Galaxy, 2010) from UCSC human genome browser based on SNP141 and human hg19 sequence data.

| Discovery stage subjects
The discovery stage included 12,181 unrelated cases and 12,601 controls from 22 cohorts with European ancestry participating in the Alzheimer's Disease Genetic Consortium (ADGC) ( Table S2).
Characteristics of the ADC7 cohort are provided in the Appendix S1, and details of other study cohorts were previously described (Jun et al., 2016;Lambert et al., 2013). Studies of the individual cohorts were approved by the appropriate Institutional Review Boards, and written informed consent for all subjects was provided on behalf of

| Statistical analysis
Details of SNP genotyping and quality control are described elsewhere (Jun et al., 2016;Lambert et al., 2013). SNP genotype imputation was performed using IMPUTE2 with reference haplotypes from the March 2012 release of 1,000 Genomes. Principal component (PC) analysis was conducted using the smartpca program in EIGENSOFT Price et al., 2006) to evaluate population substructure within each dataset. Association of AD risk with CGSes was tested using a sliding-window approach (Tang, Feng, Sha, & Zhang, 2009). Windows spanning 1kb were constructed based on evidence suggesting that sequence variants within 1 kb can affect the methylation status of a gene (Lienert et al., 2011). Consecutive windows with a 500 bp overlap were tested to optimize power for detection of associations and ensure a sufficient number of SNPs in each window. Thus, for example, each unique 2000 bp region contains three overlapping windows. CGSes with imputation quality (r 2 ) ≤0.4 or genotype data available for less than half of the cohorts were removed. Windows with fewer than two CGSes were omitted from the analysis. After these filtering steps, 2,288,371 windows remained for association analyses.
The association of AD with the combined effects of multiple CGSes in each window on the risk of AD was evaluated by logistic regression using the optimal sequence kernel association test (SKAT-O) (Lee et al., 2012) using R package seqMeta (https ://cran.r-proje ct.org/ web/packa ges/seqMe ta/index.html) as implemented in Universal Genome Analyst (UGA) software (https ://github.com/rmkoe stere r/ uga). The fast P value calculation "integration" method was used as a screening tool. Windows with p ≤ 5 × 10 -4 or no reported p value were re-analyzed using the "saddlepoint" method (Duchesne & de Micheaux, 2010). We used the default weights of the seqMeta package to up weight the contributions from rare variants with the aim to identify potential novel loci. The same methodology was applied to the analysis of non-CGSes. SKAT-O is not sensitive to effect direction of the individual variants included in the test and thus does not produce effect estimates. Thus, we also conducted the dose-response effect of the multiple CGSes in the window on AD risk using logistic regression. The allele that creates a CpG dinucleotide was considered as the effect allele and the allele that disrupts the CpG dinucleotide as the reference allele. The sum of the imputed dosages for multiple CGSes in each window was calculated and used as the exposure variable for the logistic regression model with AD status as the outcome.
The summary statistics for regression coefficients and robust standard errors from each cohort were meta-analyzed using an inverse variance-weighted, fixed-effects approach implemented in METAL (Willer, Li, & Abecasis, 2010). Both SKAT-O and dosage analyses were adjusted for age, sex, and PCs. Windows surviving Bonferroni-corrected genome-wide significance level (p ≤ 5 × 10 −8 ) from both methodologies were considered. The genome-wide summary statistics from the two methodologies are provided in Table S8. consortium (Lambert et al., 2013). The protocols and participant consent forms were approved by each institution. The combined effects of multiple CGSes in each window on AD were determined using the GATES method, implemented in the GATES R package (Li, Gui, Kwan, & Sham, 2011). This method extends the Simes test to combine the p-values of the SNPs within a region into an overall regional p value.  HumanMethylation450 BeadChip. Analyses of FHS data were conducted in two stages. A linear mixed model was used to derive the residuals of the DNA methylation of the probe adjusted for the imputed cell types (CD8T, CD4T, NK, B-cell, monocyte), row and column as fixed-effects, chip ID as a random effect at first. Then, each residual was regressed on the CGSes dosage in models including age and sex as fixed-effects and kinship matrix as random effect to account for familial correlation. Analyses of ROSMAP data were conducted with the linear model by adjusting the methylation batch, age at death, sex, post-mortem interval, and study group (ROS or MAP), which was test to be the most appropriate model for the data as reported by De Jager et al. (2014). p-values were adjusted using a

| mQTL Analysis
Bonferroni correction for the total number of probes tested within each window.
Normalized gene expression level was regressed on the sum of dosages of CpG dinucleotides in each window with covariates for age, sex, and the first three PCs of ancestry using a linear mixed model for analyses of FHS data and a general linear model for analyses of ROSMAP data. p-Values were corrected for the seven tests (i.e., 7 genes) performed.

| Pathway analysis
Using the ROSMAP brain methylation and RNAseq data, we performed a genome-wide expression-methylation scan using a general linear model with the methylation of CpG site cg14750476 as the exposure variable and the normalized gene expression levels of all the protein-coding genes as outcomes (n = 15,508), including the same covariates as in the mQTL and eQTL analyses. Genes with p < 0.005 were included in the pathway enrichment analysis implemented in the software of STRINGdb (Szklarczyk et al., 2015), which conducted a hypergeometric test, using the false discovery rate (FDR) to correct for multiple tests (Benjamini, 1995), to query the enrichment of the input gene sets against the background gene list in Gene Ontology database classified as "molecular function".

CO N FLI C T O F I NTE R E S T
None declared.