Identifying interactive biological pathways associated with reading disability

Abstract Introduction Past research has suggested that reading disability is a complex disorder involving genetic and environment contributions, as well as gene–gene and gene–environment interaction, but to date little is known about the underlying mechanisms. Method Using the Avon Longitudinal Study of Parents and Children, we assessed the contributions of genetic, demographic, and environmental variables on case–control status using machine learning. We investigated the functional interactions between genes using pathway and network analysis. Results Our results support a systems approach to studying the etiology of reading disability with many genes (e.g., RAPGEF2, KIAA0319, DLC1) and biological pathways (e.g., neuron migration, positive regulation of dendrite regulation, nervous system development) interacting with each other. We found that single nucleotide variants within genes often had opposite effects and that enriched biological pathways were mediated by neuron migration. We also identified behavioral (i.e., receptive language, nonverbal intelligence, and vocabulary), demographic (i.e., mother's highest education), and environmental (i.e., birthweight) factors that influenced case–control status when accounting for genetic information. Discussion The behavioral and demographic factors were suggested to be protective against reading disability status, while birthweight conveyed risk. We provided supporting evidence that reading disability has a complex biological and environmental etiology and that there may be a shared genetic and neurobiological architecture for reading (dis)ability.

Past research has established a genetic contribution for reading disability (Facoetti, Gori, Vicari, & Menghini, 2019;Gialluisi, Guadalupe, Francks, & Fisher, 2017;Landi & Perdue, 2019;Mascheretti et al., 2017;Paracchini, Diaz, & Stein, 2016;Skeide et al., 2015), starting with evidence that reading disability is highly heritable with estimates between 40 and 60 percent (Wadsworth, DeFries, Olson, & Willcutt, 2007;Willcutt et al., 2011). A substantial body of literature has investigated the complex genetic contributions for reading disability, considering single gene associations, gene-gene interactions, and gene-environment interactions. The major findings are briefly reviewed as follows. Multiple studies have associated nine candidate regions and 14 candidate genes with dyslexia, one type of reading disability (Gibson & Gruen, 2008;Newbury, Monaco, & Paracchini, 2014;Willcutt et al., 2011). These genes include the following: DYX1C1, DCDC2, KIAA0319, C2ORF3, MRPL19,ROBO1,FAM176A,FMR1,S100B,DOCK4,KIAA0319L,DIP2A,GTF2I,and GRIN2B. Research has suggested that reading disability is polygenic in nature and that these genes interact with each other and with environmental factors (Friend, DeFries, Wadsworth, & Olson, 2007;Gayán & Olson, 2001;Price et al., 2020). In the context of this research area, environmental factors refer to a broad set of mostly nongenetic predictors of reading disability status, such as biological sex, birthweight, gestational weeks, mother's highest education, and language ability (Mascheretti, Andreola, Scaini, & Sulpizio, 2018). Mascheretti, Bureau, Trezzi, Giorda, and Marino (2015) investigated gene-gene interactions in reading disability and found that KIAA0319/TTRAP and DYX1C1 interact with GRIN2B for predicting performance on short-term memory tasks in children with reading disability. There is evidence that the genes associated with reading disability interact with each other at the functional level, as many of these genes are involved in several brain development process. Past research has linked many of candidate genes to neuronal migration, neurite outgrowth, cortical morphogenesis, and ciliary structure and function (Newbury et al., 2014). Neuronal migration has been suggested as the neurological basis for dyslexia in prior studies (Martin et al., 2015(Martin et al., , 2016. Mascheretti et al. (2018) performed a systematic review of studies examining environmental factors for dyslexia and reported birthweight and gestational weeks were predictive of dyslexia status and possible interplay between genetic risk and teacher quality and parental education. Recent research has provided some insights into how genes and environmental factors may interact (Gu et al., 2018;Kershner, 2019;Mascheretti et al., 2018;Mascheretti et al., 2013). Gu et al. (2018) examined the interaction between genetic variants in CNTNAP2 and environmental factors and found sex specific interaction; specifically, they found that two variants (rs3779031 and rs987456) in CNTNAP2 were associated with reading disability status in females but not males and that the interaction between rs987456 and scheduled reading time was protective in females. In summary, the past research has revealed that the genetic contributions are a complex system with multiple genes involved, as well as gene-gene and gene-environment interactions.
Despite the recent advances in understanding the genetics of reading disability, there is still much that is not understood. Past genetic studies about reading disability are limited because the genetic associations were evaluated on a one-gene-at-a-time basis, which is inefficient for identifying genetic contributions in complex phenotypes due to statistical constraints and the inability to represent complex etiologies. Therefore, past research may have missed important genetic factors that contribute to or protect against reading disability. An additional complication is that environmental and demographic factors interact with genetic factors but a limited number of studies have examined gene-environment interactions (Becker et al., 2017;Gu et al., 2018;Jerrim, Vignoles, Lingam, & Friend, 2015). No studies to our knowledge have integrated genetic, environmental, and demographic data within the same analysis due to constraints imposed by research design and statistical analysis.
Because our understanding of the genetics is limited by prior statistical constraints, we do not know how many genes are relevant for understanding the genetics of reading disability, which biological pathways are crucial, or how including environment and demographic factors influence genetic associations.
In this paper, we perform a novel study different from prior approaches by hypothesizing that there are (a) multiple genetic markers, environmental, and demographic factors involved in reading disability, (b) informative genetic markers are overrepresented within certain biological process pathways, and (c) genetic markers can be positively and negatively associated with reading disability status.
Our approach is to take advantage of modern machine learning developments that provide effective and efficient approaches for big data modeling and analysis. Specifically, we use a sparse learning method called elastic net (Waldmann, Mészáros, Gredler, Fuerst, & Sölkner, 2013) to identify an array of genetic markers-single nucleotide polymorphisms (SNPs, also known as single nucleotide variants)-predictive of reading disability simultaneously. This approach overcomes the limitation of past research primarily dependent on multiple test correction applied to results of univariate analysis. The correction is known to be overly strict and thus having the risk of missing important genetic associations (Stein et al., 2012;Waldmann et al., 2013). Our machine learning model includes not only SNPs but also environmental and demographic variables so that we can identify how these factors jointly affect reading disability. Furthermore, we perform pathway and network analysis for the SNPs that are found by elastic net to investigate possible gene-gene interactions, as genes involved in the same biological process pathway will have functional interactions with each other and biological process pathways potentially interact.
Our study is performed by leveraging the large population-based database, the Avon Longitudinal Study of Parents and Children (ALSPAC; Boyd et al., 2013), which is a longitudinal birth cohort from the UK. The ALSPAC is ideal for testing our initial hypotheses as it contains a large sample of children with genetic, environmental, demographic, and behavioral data. It is the largest publicly available genome-wide data for reading disability. By applying the aforementioned proposed approach to ALSPAC data, our major findings include the association of novel genes and biological process pathways with reading disability. We also provide evidence that a combination of genetic, environment, and demographic factors was informative for predicting reading disability status, with some factors associated with having reading disability, while other factors were associated with not having reading disability. Lastly, we found that the biological process pathways interacted with each other, suggesting that the genetics of reading disability is a highly complex system.

| Participants
ALSPAC is a population-based birth cohort which has been extensively described in various studies Eicher et al., 2013;Fraser et al., 2013;Paracchini et al., 2008). The total sample size was 15,454 pregnancies, resulting in 15,589 fetuses, and 14,901 were alive at 1 year of age. For this study, we used data from 8,071 participants who had behavioral data and genetic data. Measures included parent surveys and clinical data. Please note that the study website contains details of all the data that are available through a fully searchable data dictionary (http://www.brist ol.ac.uk/alspa c/ resea rcher s/our-data/). Data from this study are available through ALSPAC upon approval by executive board.
The inclusion criteria were as follows: (a) no diagnosis of autism spectrum disorder, (b) normal hearing status at Focus at 7, (c) nonverbal intelligence >72 standard score on the Wechsler Intelligence Scales for Children (Wechsler, Golombok, & Rust, 1992), and (d) enough data to classify as case-control (i.e., child had a minimum of 80% of data necessary for classification). These criteria were based on prior studies that used the ALSPAC for genetic analysis (Eicher et al., 2014;Paracchini et al., 2008;Scerri et al., 2012). We used a more lenient nonverbal cutoff than past studies which used a cut-off of 75 standard score. We used a cut-off of 72 because this represents the common cut-off of 75 minus the standard error measurement.
Lastly, for twin pairs one child was randomly selected for analysis to achieve data independence, which resulted in 186 children being removed the analysis set. This was done for both monozygotic and dizygotic twin pairs.

| Demographics
Biological sex and birthweight in grams were reported at birth.
Maternal education was obtained at 32 weeks' gestation and measures the highest degree the mother had obtained by that point: CSE (certificate of secondary education generally obtained by age 16)/ none, vocational, O levels (ordinary-level subject-specific qualifications obtained at age 16), A levels (advanced-level subject-specific qualifications obtained by age 18, required for entry to college), and college degree (any degree beyond A levels). Child's ethnicity was reported by mothers at 32 weeks' gestation and then ALSPAC classified responses as white or non-white. Bilingual language status was obtained via parent report at Focus at 8 as monolingual or bilingual.
Hearing functioning was measured via bone conduction at Focus at 7. Attention-deficit/hyperactivity disorder (ADHD) status was determined at age 7 using parent and teacher questionnaires. ADHD status was coded by subtype inattentive, hyperactive-impulsive, combined, or typical.

| Reading
Reading skill was measured during Focus at 7 and Focus at 9 using a combination of word reading, spelling, and connected text tasks. At Focus at 7 years, children completed the single word reading subtest on the Wechsler Objective Reading Dimensions (Rust, Golombok, & Trickey, 1993), an experimenter-derived spelling task (Bryant, Nunes, & Barros, 2014), and a phoneme deletion task (Rosner & Simon, 1971). Nonword repetition was measured at Focus at 8 (Gathercole, Willis, Baddeley, & Emslie, 1994). At Focus at 9, children completed single word reading, nonword reading (Nunes, Byrant, & Olsson, 2003), and spelling tasks like the ones presented to those at Focus at 7 years but with new word/items. Additionally at age 9, children completed the Neale Analysis of Reading Ability (NARA) (Neale, McKAY, & Childs, 1986), which provided scores for reading rate, accuracy, and reading comprehension.

| Nonverbal intelligence
The WISC-III (Focus at 8) yielded an estimate of nonverbal intelligence. Nonverbal intelligence was used to filter out children with potential cognitive impairments and was included in the sparse machine learning model to determine whether nonverbal intelligence was an important predictor of reading disability case-control status above and beyond genetics. Nonverbal IQ score was derived from the following subtests: Picture Completion, Coding, Picture Arrangement, Block Design, and Object Assembly.

| Language
Receptive language was assessed using the Wechsler Objective Language Dimensions Language Comprehension subtest (WOLD) (Rust, 1996). Vocabulary was measured using the WISC-III vocabulary subtest (Wechsler et al., 1992). These language variables were used to compare the samples and included in sparse machine learning models to determine whether language ability was an important predictor of reading disability case-control status above and beyond genetics.

| Classifying case-control status
We used single word reading at 7, spelling at 7, phoneme deletion, nonword repetition, single word reading at 9, nonword reading at 9, spelling at 9, NARA fluency, NARA accuracy, and NARA reading comprehension to classify children. We computed zscores within the whole dataset and then classified participants.
Children were classified as cases if they scored less than −1 z-score on three or more reading tasks. This classification was based on work by Eicher (Eicher et al., 2014;Paracchini et al., 2008;Scerri et al., 2012) using the ASLPAC dataset. We expanded the number of reading tasks considered so as to better represent the skills that reading disabilities affect. Using this classification, system yielded 1,215 cases and 6,586 controls. Table 1 provides descriptive statistics by group for demographic, reading, nonverbal intelligence, and language variables.
Cryptic relatedness was measured as proportion of identity by descent (IBD > 0.1). Related subjects that passed all other quality control thresholds were retained during subsequent phasing and imputation. 9,115 subjects and 500,527 SNPs passed these quality control filters.

| SNP screening
Because we have a super-high-dimensional dataset in terms of the number of SNPs, a screening procedure was recommended, which warrants a more robust model than putting all the SNPs into a multivariate model to link with reading disability (Fan & Lv, 2008). Following this recommendation, we used genomewide association to screen SNPs. Genome-wide association (GWA) was completed in PLINK (version v2.00a2LM) (Chang et al., 2015) and performed chromosome by chromosome. The top 100 SNPs genome-wide before multiple testing correction and any surviving SNPs after multiple testing correction were included in the subsequent multivariate modeling. A similar screening method was used by other genetic studies employing lasso methods to identify informative SNPs (Cho, Kim, Oh, Kim, & Park, 2009).
In addition to SNPs, we also included, in the multivariate model, nonverbal IQ, vocabulary, receptive language score, ADHD status (inattentive only, hyperactive only, and combined type), birthweight, bilingual language status, mum's highest education, and child's ethnic background. This is to factor out the potential influence from these demographic, environmental, and behavioral variables on reading disability so that the direct association between SNPs and reading disability can be better revealed.

| Multivariate modeling by elastic net
We used an elastic net model to link reading disability (case vs. control) with the SNPs which survived the screening step, as well as demographic, environmental, and behavioral covariates. An elastic net is a regularized regression model to enable simultaneous variable selection (in our case, variables are the SNPs) in high-dimensional setting (Fan & Lv, 2008;Zou, 2006). It adds two regularization terms to the loss function of an ordinary regression model: one L1-norm regularization whose effect is to force the regression coefficients of small effects to be exactly zero, thus enabling variable/SNP selection; another L2-norm regularization to make sure highly correlated SNPs are selected. There are two tuning parameters corresponding to the two regularization terms to balance with the loss function.
Tuning parameters selection is typically done using cross-validation (see below).
The ratio between case and control was high (1,215:6,856); therefore, we decided to use oversampling to guarantee a 1:1 ratio between groups when performing cross-validation. We used fivefold cross-validation to determine the best tuning parameters. In fivefold cross-validation, the sample is split into five random groups, four of which are used to train the model and one for testing. This splitting repeats until every "fold" has served as the test set. Cross-validation was performed 10 times to select tuning parameters. After the best tuning parameters were identified, the model was refit using all the data to generate coefficients.

| Pathway enrichment and network analysis
We mapped informative SNPs from the elastic net to genes using g:SNPense on g:Profiler (Ensembl 90, Ensembl Genomes 37, rev 1741 build date 2017-10-91) (Reimand et al., 2016). After mapping SNPs to genes, we performed enrichment analysis using g:GOSt on g:Profiler. g:Profiler was selected over similar tools because recent comparisons on the available tools showed that g:Profiler has the most up-to-date repository of pathways and draws from multiple curated sources (e.g., KEGG, Reactome). We selected the following settings on g:Profiler: Homo sapiens; significant only; size of functional category between 10 and 500; size of query 3; significance threshold-g:SCS threshold; gene ontology-biological process; and biological pathways-Reactome. However, enrichment analysis alone only provides what pathways are overrepresented in a gene list, it cannot tell us how these pathways interact. Therefore, we used Cytoscape to explore how the pathways were connected (Shannon et al., 2003). Cytoscape performs network analysis on biological pathways and produces visualizations and network statistics. We imported gmt files from g:Profiler into Cytoscape and used enrichment map with standard settings. We applied the elastic net model to the SNPs without and with demographic, environmental, and behavioral variables. We call these "Analysis_1" and "Analysis_2," respectively, hereafter. Ninety-one and 68 SNPs were identified with main effects in Analysis_1 and Analysis_2, respectively. The protective SNPs were located within 16 genes with the majority of variant effects being intronic or noncoding. There were eight genes present in both the risk and protective gene lists. Additionally, many of the SNPs were located within the same gene. For example, there were four intron variant SNPs located within ZNF165 (chr 6).

| Ethical approval
Compared to a null model, the model in Analysis_2 fits the data significantly better (F (1, 76) = 10.35, p < .0001). The risk SNPs were located within 23 genes, while the protective SNPs were located within 13 genes. Non-white children via parent-reported ethnicity, monolingual language status, better receptive language skills, higher nonverbal IQ, increased mother's highest education, and better vocabulary skills were all associated with typical reader status, while low birthweight was associated with reading disability status. Examination of sample distributions suggests that ethnicity and bilingual language status were selected due to higher number for monolingual and non-white children in the typical reader group; therefore, these features may not be as informative as the language features. Although all children of non-European ancestry were removed for the analyses, a small minority of parents still reported their children as "non-white." We compared the results from both analyses to identify which SNPs replicated internally and map the replicated SNPs to genes.

| Pathway analysis
For Analysis_1, the risk gene list was significantly overrepresented in fourteen biological process pathways after multiple testing correction (false discovery rate), while the protective gene list was significantly overrepresented in three biological process pathways. There was one pathway shared by both lists, neuron migration. For Analysis_2, the risk gene list was overrepresented in seven biological process, while the protective gene list was overrepresented in three biological process pathways. All overrepresented pathways were associated with brain and/or dendrite development. Table 3 presents information for all pathways identified.

| Network analysis
We imported the pathway results from g:Profiler into Cytoscape for Analysis_1 and Analysis_2, and performed network analysis.

| D ISCUSS I ON
We investigated the genetic and environmental contributions to reading disability using a combination of GWAS, sparse machine learning, and pathway and network analysis. Using the pipeline in this study, we were able to overcome a common limitation of genetic studies of reading disability, namely lack of significant findings after multiple testing correction. There are three major findings from this study. We identified novel genes associated with reading disability case-control status, some of which were indicative of risk and others were protective. We provide evidence that a combination of genetic, environment, and demographic factors is informative for predicting reading disability case-control status. Lastly, we found that biological process pathways associated with reading disability case-control status interacted with each other, suggesting that the genetics of reading disability is a highly complex system.

| Novel genes
The elastic net models selected SNPs genome-wide as informative for predicting reading disability case-control status. We compared results between the two models in Analysis_1 and Analysis_2 and found that there were 60 SNPs shared by both model results, 41 positively associated with reading disability and 19 negatively associated with reading disability. We found that there were SNPs associated with reading disability and typical reader status. Multiple SNPs from KIAA0319 (Zhao, Chen, Zhang, & Zuo, 2016), DCDC2 (Meng et al., 2005), and other previously identified genes were selected by the elastic net model, providing replication and further support for these genes being involved in reading disability (Poelmans, Buitelaar, Pauls, & Franke, 2011).
We associated 18 novel genes with reading disability case-control status. Of these, RAPGEF2, DLC1, TDP1, and RELN were highly represented in the enriched biological process pathways and are therefore more likely to be involved in the biology of reading disability. The combination of the elastic net findings and the pathway analyses indicate that SNPs in these four genes play a larger role in reading disability case-control status that previously assumed.
RAPGEF2 (chr4q32.1) is a guanine nucleotide exchange factor and is involved in neuron migration and brain development (Maeta et al., 2018). RAPGEF2 helps with the formation of the major forebrain fiber connections for the corpus callosum, the anterior commissure, and the hippocampal commissure during brain development.

F I G U R E 1
This figure is a graphical display of the relationships between overrepresented biological process using gene lists from Analysis_1. Biological process pathways are represented by circles, called nodes. Relationships between pathways are presented by lines, called edges. The more statistically powerful the edge between two nodes, the wider the edge. Blue nodes are biological pathways which were overrepresented for on the dyslexia risk gene list, while green nodes were overrepresented on the protective gene list. Nodes significantly overrepresented in both gene lists are therefore colored blue and green RAPGEF2 has been associated with decrease in the ability to learn and deficits in working memory in knockout mouse models (Maeta et al., 2018). This is the first study to associate markers from RAPGEF2 with reading disability in humans. DLC1 (chr 8p22) encodes a GTPase-activating protein that terminates downstream signaling of GTPases RHOA, RHOB, RHOC, and CDC42 and plays a critical role in cell migration and proliferation. Activation of DLC1 increases cell migration, but reduces directionality (Tai et al., 2008). DLC1 has a CpG island making it a target for methylation and gene silencing.
DLC1 has most frequently been linking to colon cancer, but work in mouse models has shown that it is a critical gene for brain development during embryogenesis (Durkin et al., 2005). TDP2 (chr6p22.3) is involved in DNA repair and protects transcription of genes necessary for neurological development (Hornyak et al., 2016). TDP2 is also known as TTRAP and is located within the DYX2 loci and near DCDC2, a frequently reported gene associated with reading disability (Poelmans et al., 2011). Prior work on DCDC2 has associated SNPs from TDP1/TTRAP with reading disability (Cope et al., 2012;Deffenbacher et al., 2004;Meng et al., 2005). Given their proximity, it is likely that TDP1 and DCDC2 interact with each other. RELN (chr7q22.1) is a protein encoding gene which provides instruction for reelin (Devasenapathy et al., 2018). Reelin is expressed in the brain before and after birth and activates the reelin signaling pathway (Folsom & Fatemi, 2013). The reelin signaling pathway is responsible for neuron migration, as well as axon maintenance and neuronal signaling in adulthood (Folsom & Fatemi, 2013). One previous study has associated a triallelic unit in the RELN gene to a family with reading disability and the interactions between RELN, DCDC2, and ROBO1 (Devasenapathy et al., 2018). Additionally, change in RELN which decrease the production of reelin has been associated with autism spectrum disorder and other neurodevelopmental disorders (Folsom & Fatemi, 2013;Ishii, Kubo, & Nakajima, 2016). As we excluded all children with suspected autism spectrum disorder, we can infer that changes in RELN have broader effects than just deficits in social interaction and communication, but may also affect general learning structures and functions. RAPGEF2, TDP1, DLC1, and RELN all have roles in brain development, function, and maintenance which align with genes more frequently associated with reading disability (e.g., KIAA0319, DCDC2). These findings provide some insight into the underlying biology of reading disability.

F I G U R E 2
This figure is a graphical display of the relationships between overrepresented biological process using gene lists from Analysis_2. Biological process pathways are represented by circles, called nodes. Relationships between pathways are presented by lines, called edges. The more statistically powerful the edge between two nodes, the wider the edge. Blue nodes are biological pathways which were overrepresented for on the dyslexia risk gene list, while green nodes were overrepresented on the protective gene list. Nodes significantly overrepresented in both gene lists are therefore colored blue and green Several genes had multiple SNPs selected by the elastic net; however, SNPs from the same gene did not always have the same directionality. For example, KIAA0319 had six SNPs selected as informative. Of these, two were found to be positively associated with reading disability (rs6935076 and rs16889506) and four were found to be negatively associated with reading disability (rs16889556, rs699463, rs4504469, and rs2038137). One interpretation of this is that some SNPs increase risk of developing reading disability, while other offer protection from reading disability. This interpretation would fit in with recent findings that there are protective alleles for reading disability (Powers et al., 2016;Shao, Niu, et al., 2016). An alternative solution is that dichotomizing reading ability may serve as a barrier to understanding the genetics of reading (dis)ability. This is an avenue for future research.

| Environment and demographic factors
Alongside SNPs, there were environmental and demographic features that were informative for reading disability case-control status. Of these, better receptive language skills, higher nonverbal IQ, increased mother's highest education, and better vocabulary skills were all associated with typical reader status, while low birthweight was associated with reading disability status. All of ademic difficulties has shown that maternal education works as a preventative measure for difficulties (Dollaghan et al., 1999;Lervåg, Dolean, Tincas, & Melby-Lervåg, 2019), but this relationship is complicated (Harding, Morris, & Hughes, 2015). Additionally, for children at family risk of reading disability nonverbal IQ can serve as a protective factor. Lastly, lower birthweight has been associated with a higher risk of reading disability (Mascheretti et al., 2013(Mascheretti et al., , 2018, although the exact nature of this association requires more exploration. Considering the number of genes and biological processes that are linked to reading disability that play a role in brain development before and after birth, any factors which could impair brain development are likely to increase risk for reading disability. In summary, our findings regarding the influence of selected environmental and demographic factors support past research and the assumption that genes potentially with environment and demographic factors.

| Biological process
Our analyses identified several significantly enriched biological processes, including neuron migration, nervous system development, and dendrite development and regulation. Some of these pathways, like neuron migration, have been previously linked to reading disability (Carrion-Castillo, Franke, & Fisher, 2013;Poelmans et al., 2011), while others have not. Additionally using network analysis, we mapped the interactions between pathways. The network analysis revealed that all of the identified pathways interacted with each other either directly or mediated by neuron migration or nervous system development.
Neuron migration (GO:0001764) was the most overrepresented biological process pathway and was identified for both the risk and protective gene lists. Neuron migration is a large biological process pathway that is responsible for organizing neuronal structure and organization in the developing brain. This pathway encompasses a number of smaller pathways, including the reelin signaling pathway discussed earlier. Neuron migration has been previously associated with reading disability in multiple samples (Luciano, Gow, Pattie, Bates, & Deary, 2018;Poelmans et al., 2011); therefore, we provide replication of this finding. Neuronal migration has been hypothesized to be an underlying biological etiology of reading disability because of the differences in brain structure and function in people with reading disability compared to typical readers (McCandliss & Noble, 2003;Niogi & McCandliss, 2006;Waldie et al., 2017). Our evidence for neuron migration fits in with current research examining the gene-brain-behavior model (Landi & Perdue, 2019). An extension from our study is that it is possible that neuron migration was shared for positive and negative associated genes suggesting that the genetic architecture of reading is shared for reading disability and typical reading which raises new questions about the development of reading.
Beyond neuron migration, we also identified several pathways associated with brain development and regulation of dendrite development. The brain development pathways were enriched in the risk gene sets and whose primary roles are structural development from formation to mature structure (e.g., forebrain development, telencephalon development). The dendrite regulation pathways were enriched in protective gene set. The dendrite regulation pathways were primarily responsible for creating axons (e.g., regulation of plasma membrane bounded cell projection organization) and neurons in early development and then maintaining axon and neuron functions throughout the lifespan. Both sets of pathways, brain development and dendrite regulation, interacted with neuron migration furthering the possibility that there is a shared genetic and neurobiological architecture for reading. Additionally, the fact that all of the pathways are associated with brain development furthers the gene-brain-behavior model under investigation (Landi & Perdue, 2019).

| Limitations and future directions
There are a number of limitations for this study. First, several of SNPs selected by the elastic net model were imputed during genotyping. This suggests that either our screening methods were not stringent enough or the model had a tendency to select markers which were correlated due to location. This was overcome by using pathway analysis in conjunction with elastic net as imputed genes were dropped from pathway analysis. Second, reading is not a dichotomous skill-our phenotype does not reflect the nature of reading and dichotomizes a continuous skill using an arbitrary cut point.
This decision results in two problems: (a) We are not adequately reflecting the nature of reading and (b) we decrease power by creating a smaller group of cases. Although our case-control design matches prior research, we recognize these limitations. We attempted to overcome these limitations by using oversampling methods to compensate for the unbalanced design. Third, our model could not quantify interactions between genes and environmental/demographic factors. Instead, our model examined the combined influence, but not how they interacted. This is a limitation that can be overcome in future analyses using alternative models and/or databases. Fourth, this is a small sample size in terms of genetic studies overall, but this sample size is comparable to genetic studies for dyslexia or reading disability. The goal of this study was to test our pipeline and determine some pilot results using a well-known database. Lastly, our findings, especially the novel genes and biological pathways, need to be replicated in other studies with more diverse genetic populations.
The ALSPAC database is a great database for testing new methods and obtaining discovery findings because of its size, breadth of measures, and high data quality, but findings from it can only be generalized to Caucasian genetic and English-speaking populations.
Reading, however, is a worldwide skill and to truly understand the genetic, environmental, and demographic factors of reading (dis) ability we need to include samples with diverse languages, socio-

| Conclusion
Our study suggests there are multiple genes associated with reading disability case-control status, which aligns with numerous studies (Eicher et al., 2014;Fisher et al., 1999;Gu et al., 2018;Hofmeister et al., 2015;Poelmans et al., 2011;Scerri et al., 2011). Our analysis approach allowed us to investigate the impact of multiple genetic markers without losing data due to multiple test correction. In doing so, we identified several novel genes. We provided evidence that mother's education and child language skills may provide protection from genetic risk. Additionally, our pathway and network analyses indicated that neuron migration, brain development pathways, and dendrite regulation pathways are associated with reading disability case-control status and that brain development and dendrite regulation pathways interact with each other through neuron migration.
Our results support the hypothesis that reading disability represents a complex system with multiple genes, environmental, and demographic factors involved in an interactive fashion. Furthermore, our results suggest that there is a shared genetic and neurobiological architecture for reading (dis)ability which requires more research. of America) using support from 23andMe.

CO N FLI C T O F I NTE R E S T
The authors declare that they had no financial or commercial conflicts of interest.

AUTH O R CO NTR I B UTI O N
HSL conceived and designed the study under the mentorship of JL and VD. HSL analyzed and interpreted the data. XL provided critical support for analyzing the data. HSL drafted the manuscript with input and suggestions from JL and VD. HSL approved the final version of the manuscript on behalf of all the authors.

PE E R R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1002/brb3.1735. Note p-value is the unadjusted p value from genome-wide association tests. Previous literature SNPs were included in genome-wide association analyses, but may not have been in the top 100.

DATA AVA I L A B I L I T Y S TAT E M E N T
Abbreviation: Chr, chromosome.