Novel DNA methylation marker discovery by assumption‐free genome‐wide association analysis of cognitive function in twins

Abstract Privileged by rapid increase in available epigenomic data, epigenome‐wide association studies (EWAS) are to make a profound contribution to understand the molecular mechanism of DNA methylation in cognitive aging. Current statistical methods used in EWAS are dominated by models based on multiple assumptions, for example, linear relationship between molecular profiles and phenotype, normal distribution for the methylation data and phenotype. In this study, we applied an assumption‐free method, the generalized correlation coefficient (GCC), and compare it to linear models, namely the linear mixed model and kinship model. We use DNA methylation associated with a cognitive score in 400 and 206 twins as discovery and replication samples respectively. DNA methylation associated with cognitive function using GCC, linear mixed model, and kinship model, identified 65 CpGs (p < 1e‐04) from discovery sample displaying both nonlinear and linear correlations. Replication analysis successfully replicated 9 of these top CpGs. When combining results of GCC and linear models to cover diverse patterns of relationships, we identified genes like KLHDC4, PAPSS2, and MRPS18B as well as pathways including focal adhesion, axon guidance, and some neurological signaling. Genomic region‐based analysis found 15 methylated regions harboring 11 genes, with three verified in gene expression analysis, also the 11 genes were related to top functional clusters including neurohypophyseal hormone and maternal aggressive behaviors. The GCC approach detects valuable methylation sites missed by traditional linear models. A combination of methylation markers from GCC and linear models enriched biological pathways sensible in neurological function that could implicate cognitive performance and cognitive aging.


| INTRODUC TI ON
Cognitive impairment refers to when a person has trouble remembering, learning new things, concentrating, or making decisions that can affect one's everyday life. Cognitive impairment in the elderly is costly and a key issue for health and social care since aging is the greatest risk factor for it. The level of cognitive functioning is shown to decrease with age accompanied by increased variability in twin pairs (McCartney et al., 1990). Epigenetic modification such as DNA methylation is a promising marker in understanding many age-related phenotypes (Bakulski & Fallin, 2014). Despite the wide-spread performance of epigenetic association studies, limited markers have been detected. The limitation in marker detection might have different reasons such as the distribution of the phenotype of interest which is not always normal and the complex patterns of relationship between epigenetic markers and cognition, possibly involving any non-random patterns not limited to linearity. The multiple assumptions in the conventional statistical models could be responsible for the low replication and limited explanation in the phenotype variation by the identified markers.
Monozygotic (MZ) twins are valuable for controlling the genetic background in identifying epigenetic associations, as they share similar genetic makeups. In addition, differences in DNA methylation levels between discordant monozygotic twins associated with mental disorders have been frequently discussed (Castellani et al., 2015).
However, modeling the correlation in dependent samples like twins requires additional assumptions concerning covariance structure and degree of genetic relatedness.
As an alternative, assumption-free measurements of association or generalized correlation coefficient (GCC) have been proposed for omics studies (Murrell et al., 2016;Reshef et al., 2011). We think that this method has advantages to be considered for analyzing epigenetic data as the associations between methylation values and cognitive function are expected to be complex and more importantly, this method does not rely on strict assumptions such as normality of phenotypes and linear correlation.
In this study, we aim at promoting the use of GCC as a complementary method along with traditional linear models. We here investigate the performances of popular conventional models and the GCC method in an epigenome-wide association study on cognitive function using DNA methylation data measured in blood samples of 400 MZ twins as a discovery sample. Performing single CpG sites differentially methylated regions (DMRs) and pathway analyses.
Also, top single CpG sites and candidate regions in nearby genes are replicated in cognitive function using an independent DNA methylation sample of 206 twins (192 MZ and 14 dizygotic (DZ) twins) and verified in a gene expression data, respectively.

| RE SULTS
The descriptive statistics of the study samples are shown in Table   S1. DNA methylation data for the entire sample passed the quality control (QC) were explained in the method section. We observed significant associations of cognitive measurement scores with age and sex in the middle-aged Danish twin (MADT) sample (Pedersen et al., 2019). The cognitive score declines with age (p = 1.9e-05) and is higher in females in comparison to males (p = 0.003). Additionally, none of the cell counts were associated with cognitive score. This was done by the linear mixed-effect (LME) (Bates et al., 2014) model with cognitive function as outcome and age, sex, cell counts as fixed effects, and twin pairing as a random factor.
Before performing the statistical analyses, we first investigated the performance of three models, GCC (Murrell et al., 2016), LME, and kinship (Therneau, 2012) model by estimating their type I error rates using simulated random methylation data for one marker based on a standard normal distribution. Association of the simulated molecular data with cognitive function in the MADT cohort was assessed (p < 0.05 as significant) with type I error rates estimated as 0.052 for GCC, 0.054 for LME, and 0.055 for kinship model, upon 10,000 replicates. The models are generally unbiased, although the two linear models gave a slightly high type I error rate due to non-optimal adjustment of twin correlation on cognition by the linear models.

| Single CpG epigenome-wide association studies
We applied the three models to examine the correlation between DNA methylation and cognitive function with methylation data in MADT cohort collected using the Infinium Human Methylation 450K BeadChip. After performing QC, 427,409 CpGs remained.
The identified CpGs with p < 1e-4 from GCC, Kinship, and LME were 42, 24, and 18. We chose Kinship results as a linear representative model (because of high consistency with LME) and GCC as a nonlinear method and combined the results by taking the minimum p-values between them. After combining GCC results sites missed by traditional linear models. A combination of methylation markers from GCC and linear models enriched biological pathways sensible in neurological function that could implicate cognitive performance and cognitive aging. mapped to PAPSS2 (p = 1.0173e-06), and cg23988749 mapped to MRPS18B (p = 2.5999e-06). Table S2 shows the summary statistics results for the CpGs p < 0.05 from EWAS analysis. Figure 1 shows density plots between DNA methylation M-value and cognitive function for the top 12 significant CpGs. Only two CpGs,

| Biological pathway analysis
The total number of mapped genes (p < 0.05) was 27,413. These genes were used as input for over-representation analysis of KEGG and REACTOME pathways as well as DisGeNET (human disease) implemented in WebGestalt (Liao et al., 2019). Table 2 shows the top 30 KEGG, REACTOME, and DisGeNET pathways with statistical significance (FDR < 5.88e-04

| Analysis of differentially methylated regions
The EWAS final results from the combined CpGs of GCC and linear model were used for identification of DMRs, using combp algorithm (Pedersen et al., 2012). In total, 15 DMRs with FDR < 0.05 were identified, which is shown in Table S6. The DMRs pattern has been depicted in Figure S4, which among them DMRs in Figure S4A-G, J are hyper-methylated and DMRs in Figure   S4H

| Further verification using gene expression data
Using gene expression data on the same MADT samples, we found 11 genes annotated to the 15 DMRs in Table S6 and investigated association of their expression levels with cognitive function using the combined GCC method and Kinship model. Among the 11 genes, 3 genes, SMC1B, GABBR1, and HCG16 were confirmed associated with cognitive function with p < 0.05. In the gene expression data, two probes (p = 0.02, p = 0.27) were mapped to GABBR1, one probe (p = 0.02) to SMC1B and one probe (p = 0.04) to HCG16.

| DISCUSS ION
This study of DNA methylation data in twin samples indicates the strength of GCC in comparison with the traditional methods characterized by (1) ability to capture nonlinear correlation patterns missed by the linear models, (2) robustness in handling EWAS data on related samples such as twins, and (3) biologically meaningful annotations of identified markers.
Conventionally, linear relationship has been assumed in describing the association between a molecular marker and a phenotype of interest for simplicity in statistical modeling. In a genome-wide association study (GWAS), this assumption translates to the additive genetic effects. The linear assumption suffers intrinsically from low efficiency in handling the biological complexity in the molecular regulation on phenotype expression, which cannot be simplified by just a linear model. This phenomenon has been clearly exemplified in a published microarray time course experiment analyzed using sophisticated parametric modeling (Murrell et al., 2016;Reshef et al., 2011). Instead of complex modeling, GCC provides an alternative assumption-free approach inherently capable of capturing patterns of any non-random relationship. Based on our findings, there were no top CpGs overlapping between GCC and linear models (Figure 3). In fact, linear models are defined and coded for linear relationships and GCC has to sacrifice power for some kinds of relationships to see other kinds. With this consideration, we suggest that GCC could with advantage be applied as a complement to the linear methods to ensure different patterns of correlation be captured with high efficiency. As shown in Table 1, the majority of the CpGs are detected by GCC presumably, because GCC is not limited to any predefined correlation pattern.
Although the single CpG EWAS did not find genome-wide significant sites after correcting for multiple testing, the top rank CpGs were annotated to biologically meaningful genes and pathways.
In Table 1 Among the significant pathways in Table 2, it has been reported that the disturbance in signal transduction in brain cells causes the cognitive decline (Lo Vasco, 2018). Focal adhesion is involved in integrin adhesion, communication between the extracellular matrix and the actin cytoskeleton, and the regulation of many cell types. Loss of cell adhesion can lead to cell death and altered focal signaling has been linked to synaptic loss, which may cause AD (Caltagarone et al., 2007). There is evidence that axon guidance might play a role in some brain disorders such as Parkinson's and AD (Lesnick et al., 2007).
MAPK signaling pathway regulates neuronal apoptosis, β-and γ-secretase activity, and phosphorylation of APP and tau which has a role in the pathogenesis of AD (Kim & Choi, 2010). Neurotrophins, such as brain-derived neurotrophic factor (BDNF), are essential regulators of neuronal survival and lower level of them are related to etiology of Alzheimer's and Huntington's diseases (Mitre et al., 2017). In the literature, there is also evidence of a direct or indirect link of Membrane trafficking, Ion homeostasis to cognitive function or other neurological disorders (Cuomo et al., 2015;Wang et al., 2013).
Analysis of DMRs based on the joint results of GCC and linear models had the power to detect 13 DMRs with FDR < 0.05 and two with 0.05 < FDR < 0.1. Among the genes harbored by these DMRs, three genes SMC1B, GABBR1, and HCG16 were verified using gene expression analysis. The GABBR1 gene is highly expressed in brain, and it is a main inhibitory in the central nervous. Dysfunctional GABA interneuron could represent a core pathophysiological mechanism underlying cognitive dysfunction in schizophrenia (Li et al., 2016;Xu & Wong, 2018). In the processes of learning and memory, changes in GABAergic function could be an important factor in both early and later stages of AD pathogenesis (Govindpani et al., 2017). The two other genes SMC1B and HCG16 have not been reported yet to have a direct link to cognitive function. Most interestingly, functional annotation of the DMRs by GREAT reported highly relevant GO terms as shown in Table S7. Moreover, functional gene annotation analysis using GREAT from top identified DMRs revealed important biological and molecular functional clusters implicated in cognitive function. The link of neurohypophyseal hormone and oxytocin receptor binding with memory process and cognitive function has been already discussed (Fehm-Wolfsdorf et al., 1984;Iovino et al., 2018;Strupp et al., 1983).
Oxytocin is produced in the hypothalamus and released into the circulation through the neurohypophyseal system (Ross & Young, 2009). It is shown that there is an impairing influence of memory by oxytocin (Fehm-Wolfsdorf et al., 1984) and also it modulates social cognition and affiliative behavior in both sexes (Ross & Young, 2009). Oxytocin in brain also regulates anxiety-behaviors, which has been shown to correlate the maternal aggressive behavior (Bosch et al., 2005).
In women, the rate of muscle contraction, especially the rate of velocity development (RVD), is predominantly associated with cognition, particularly in women with low muscle strength (Tian et al., 2019).
There is also evidence of the role of mating, female receptivity, hypermethylation of CpG island, and metabolic process in cognitive function (Haberman et al., 2012;Sharp et al., 2010;Smith et al., 2015).
In Table S8, the 9 replicated CpGs were annotated to genes including EPHA8, PRDM15, PRR25, GPLD1, SLC1A3, DPYSL2, and NHEJ1. Gu et al., (2005) reported that the EphA8 receptor is capable of inducing a sustained increase in MAPK activity, thereby promoting neurite outgrowth in neuronal cells. PRR25 is among the 171 age-related differential expression genes (Lee & Lee, 2017). GPLD1 gene has been reported to interact with Apolipoprotein A1 and APOA4 (Deeg et al., 2001) and Lin et al., 2015 have shown that the low levels of APOA1, APOC3, and APOA4 are associated with risk of AD. Importantly, the GPLD1 is a candidate gene that modulates Aβ production (Seki et al., 2020). Wilmsdorff et al., (2013) reported that the increased SLC1A3 expression indicates facilitated transport and may result in reduced glutamate neurotransmission. The gene DPYSL2 is highly expressed in brain and associated with AD (https://www.genec ards.org/cgi-bin/cardd isp. pl?gene=DPYSL2).
As an extra effort to characterize our identified CpGs, we tested the proportion of CpG-SNPs in the top CpGs with p < 1e-4 and compare it with the proportion of CpG-SNPs from the whole 450k array using hypergeometric test. The proportion of CpG-SNPs in our study is higher than the proportion of CpG-SNPs in the 450k array with borderline significance (p = 0.05). This indicates that DNA methylation could be one of the ways that genetic variations influence cognitive aging as it is discussed in a literature (Shoemaker et al., 2010). Additionally, we found a significant overlap (p = 3.04e-16) of top genes in our study with age-related genes including PAPSS2, MRPS18B,USP35,PRDM15,ODZ3,DNHD1,WDR51A,AGTRAP,OPCML,FLJ3281,CCDC102B,PRR16,and FGF12 in the Lothian Birth Cohorts (LBC) reported by Li et al., (2017). Also, a significant overlap (p = 4.54e-12) of genes KLHDC4, MRPS18B, USP35, EIF2C2, AOAH, COBRA1, OPCML, and SLC1A3 with age-related genes in LSADT reported by Tan et al., (2016). The two genes PRDM15 and SLC1A3 are also among the list of genes in the replication analysis.
A limitation for the GCC method is that its association parameter has no direction compare to the linear regression models that report the coefficient regression with a direction of effect (+ or −). This is because the direction of effect cannot be determined in the case of nonlinear relationship. We propose that, for any significant CpG marker, its direction of effect be roughly determined by the sign of its coefficient from the linear model. This same idea has been implemented in R Bioconductor package RTN (https://www.bioconductor. org/packages/release/bioc/html/RTN.html) which uses a generalized correlation coefficient to estimate correlation between expression of a transcription factor gene and expression of a target gene, but uses the sign of Pearson's correlation (negative or positive) to determine the direction of correlation. Another possible limitation of this study is the use of DNA methylation for the whole blood rather than a brain tissue. However, studies have shown whether blood can be used as a proxy in methylation studies as the brain tissue is usually unavailable. They concluded that the methylation status of many CpGs in the blood mirror those in the brain (Aberg et al., 2013).
Dependent samples can lead to biased statistical assessment if not adjusted properly in association studies. In genome-wide association analysis, the sub-grouping of samples is responsible for inflated statistical significance which can be revealed by QQ plots. In this study, LME and kinship models were applied to account for the correlated structure in our twin pairs. As shown by the QQ plots in Figure 2a, the significance estimates by linear models were not inflated, suggesting that the random effects estimated in the linear models well captured the twin correlation. In practice, however, inflated significances are frequently observed due to model misspecification in parametric modeling. It is encouraging that the GCC as an assumption-free method does not require any specific handling of the correlated twin samples in achieving unbiased assessments. This feature is especially valuable in association analysis of omics data for dealing with sample sub-grouping due to population stratification or batch effect from non-biological experimental variations.

| CON CLUS ION
Through applying both GCC and linear models to EWAS on cognitive function, we identified more and meaningful genes and pathways as well as DMRs that could implicate the cognitive performance and cognitive aging compared with restricting to the popular linear models. The assumption-free GCC is also robust in dealing with correlated samples as in our twins. The intrinsic features of GCC enabled the identification of DNA methylation markers displaying diverse correlation patterns with cognitive function. Our results promote the use of GCC to complement the traditional linear models in EWAS studies for marker discovery and for biological interpretation.

| Population study
The study sample comprises 400 monozygotic (MZ) twins aged 56-80, including 220 male and 180 female pairs (Table S1)  The cognitive test scores were standardized to mean 0 and standard deviation 1 and were summed to calculate the general cognitive composite scores. The cognitive function for each twin pair through plotting cognitive scores for twin 1 versus twin 2 is depicted in Figure S1. The details about sample collection have been described in detail elsewhere (McGue & Christensen, 2001

| DNA methylation profiling
DNA methylation profiles in whole blood samples were analyzed using the Infinium HumanMethylation450 BeadChips (Illumina) containing 485,512 CpG sites across the human genome. The QC of the DNA methylation data was done based on two R packages: MethylAid (Van Iterson et al., 2014) and minfi (Aryee et al., 2014). Probes were removed from the analysis based on the following thresholds: zero signals, low bead count < 3 beads, p-value detection > 0.01, crossreactive probes defined by Chen et al., 2013 and probes missing in > 5% samples, and sex-chromosome and SNP probes were removed and the rest were imputed based on their median. This resulted in 427,409 CpGs left for analysis. Data normalization was done by functional normalization (Fortin et al., 2014). Before the statistical analysis, the methylation beta values were logit transformed to M values (log 2 (β/(1−β)).

| Generalized correlation coefficient and linear regression models
We tested by fitting two linear regression models and GCC whether DNA methylation in whole blood is associated with cognitive function. In the linear models, both kinship model from kinship2 R package (Therneau, 2012) and LME from lme4 R package (Bates et al., 2014) were applied to find the association of DNA methylation levels with cognitive function. Before applying association testing, we adjusted for age, gender, and blood cell compositions. The kinship module calculates a kinship matrix and integrates it in the covariance matrix of the methylation data. The LME model corrects for correlation between twins in a pair by including twin pairing as a random effect in the model.
For GCC analysis, Matie R package was used (Murrell et al., 2016), which GCC was computed based on a ratio of maximum likelihoods for the marginal distribution and maximum weighted likelihoods for the joint distribution.

| Gene-set enrichment analysis
The identified CpGs with p < 0.05 were annotated to genes and used as input for overrepresentation enrichment analysis (ORA) against the genes on the 450K array using the WEB-based GEne SeT AnaLysis Toolkit (WebGestalt). Gene sets with FDR q-value < 0.05 were reported as significant.

| Differentially methylated regions
In addition to single CpG based analysis, we also performed test- The significant DMRs identified in comb-p were mapped to genes, and these genes were verified in the aforementioned gene expression data from the same MADT cohort. The analysis was done by GCC, Kinship, and LME with adjustments on age, sex, and cell counts information similar to the methylation data. Then, the final assessment was done based on both GCC and Kinship models.

CO N FLI C T O F I NTE R E S T
The authors declare that they have no conflict of interest.

AUTH O R CO NTR I B UTI O N S
AM, QT, and JH conceived the study. AM performed data analysis.
AM and QT drafted the manuscript. All authors contributed in the manuscript, commented, and approved the submitted manuscript.

E TH I C S A PPROVA L A N D CO N S E NT TO PA RTI CI PATE
The MADT study was approved by the Regional Committees on Health Research Ethics for Southern Denmark (S-VF-19980072).
Written informed consents were obtained from all participants in the Danish twin studies. Written informed consents were obtained from all LSADT participants and approved by the Regional Committees on Health Research Ethics for Southern Denmark (S-VF-20040241) and was conducted in accordance with the Helsinki II declaration.

DATA AVA I L A B I L I T Y S TAT E M E N T
According to the Danish and EU legislations, transfer and sharing of individual-level data require prior approval from the Danish Data Protection Agency and require that data sharing requests are dealt with on a case-by-case basis. However, we have provided information about sample in this study.