The amplification of genetic factors for early vocabulary during children’s language and literacy development

The heritability of language and literacy skills increases during development. The underlying mechanisms are little understood, and may involve (i) the amplification of early genetic influences and/or (ii) the emergence of novel genetic factors (innovation). Here, we use multivariate structural equation models to quantify these processes, as captured by genome-wide genetic markers. Studying expressive and receptive vocabulary at 38 months and subsequent language, literacy and cognitive skills (7-13 years) in unrelated children (ALSPAC: N≤6,092), we found little support for genetic innovation during mid-childhood and adolescence. Instead, genetic factors for early vocabulary, especially those unique to receptive skills, were amplified. Explaining as little as 3.9%(SE=1.8%) variation in early language, the same genetic influences accounted for 25.7%(SE=6.4%) to 45.1%(SE=7.6%) variation in verbal intelligence and literacy skills, but also performance intelligence, capturing the majority of SNP-heritability (≤99%). This suggests that complex verbal and non-verbal cognitive skills originate developmentaly in early receptive language.


Introduction
Individual differences in vocabulary during the preschool period are predictive of many later language-and literacy-related skills [1][2][3][4] , an important component of academic achievement 5 . For example, a latent factor consisting of infant expressive and receptive vocabulary size at 16-24 months was found to predict vocabulary size, as well as performance on tests of phonological awareness, reading accuracy and reading comprehension in children five years later 3 . Similarly, infants with a larger expressive vocabulary at 24 months subsequently showed a larger vocabulary as well as better decoding, word recognition, and passage comprehension skills when assessed up to primary school 4 .
Associations between infant vocabulary and language and literacy skills during later life may arise due to shared underlying aetiologies. According to the "simple view of reading" theory, reading comprehension is the product of both printed word recognition (decoding) and oral language comprehension 6 . Early vocabulary is a central component of both these abilities 7 . Decoding is substantially based on phonological awareness (i.e. the awareness of sound structures of speech), which develops in the prescohol period and has been shown to be related to vocabulary size; listening comprehension (i.e. the understanding of spoken language), particularly bottom-up processing, necessarily begins with vocabulary comprehension. Spelling performance is also closely related to phonological awareness and other phonological abilities 8 . However, the biological processes that underlie these complex developmental interrelationships are only partially understood.
Variation in expressive and receptive vocabulary size, assessed during the first four years of life, is modestly heritable, while genetic influences on language and literacy skills assessed from mid-childhood to early adolescence are moderate to strong [9][10][11][12] . Specifically, longitudinal twin studies reported heritabilities (twin-h 2 ) of 22%-28% for a combined language measure including expressive vocabulary at 2, 3 and 4 years of age 10 . Similar estimates were obtained for expressive vocabulary at 15-18 and 24-30 months of age in independent population-based samples, using genome-wide single-nucleotide polymorphism (SNP) information (SNP-h 2 =13%-14%) 11 . In contrast, the heritability for language and literacy skills assessed from mid-childhood onwards is larger, with twin-h 2 estimates of 47%-72% 9,10 and SNP-h 2 estimates of 32%-54% 12 . However, developmental stages nonetheless genetically overlap, as shown by moderate genetic correlations reported in twin research 9,10 .
The increase in heritability from early childhood to adolescence has been reported for many cognitive skills 13,14 , suggesting overarching aetiological mechanisms that may involve processes of genetic innovation and amplification 15 . Innovation refers to novel genetic factors emerging during development (i.e. previously unrelated genetic variation becoming associated with a trait). In contrast, amplification refers to genetically stable influences that are active throughout development, explaining increasingly more variation with progressing age 13 . A meta-analysis of twin studies on cognitive abilities suggested that novel genetic influences predominate during the transition from early to middle childhood, but wane quickly, with enhanced genetic stability and amplification processes dominating from 8 years of age onwards 13 . This developmental paradigm is consistent with twin study findings examining genetic links between early language (including expressive vocabulary and syntax skills between 2-4 years of age) and mid-childhood/adolescent language 10 and reading 9 , based on latent factor models. Thus, it is possible that innovation rather than amplification processes will account for the observed increase in heritability during language and literacy development, not only in twins 9,10 , but in all typically developing children.
Furthermore, these processes may represent a developmental paradigm that has relevance not only for language and literacy skills, but cognitive functioning in general, possibly involving "generalist genes" that impact on many related traits 16 .
However, beyond latent factor twin analyses 9,10 , the specific processes genetically linking early vocabulary skills with language, literacy and cognition later in life are little characterised. In particular, genetic relationships with early receptive vocabulary are unknown, and the spectrum of interrelated skills, shaping language, literacy and cognition, affected by amplification processes is only partially understood. Here, we use SNP information from directly genotyped common genetic markers and structural equation models to quantify these genetic mechanisms within a sample of unrelated children from the Avon Longitudinal Study of Parents And Children (ALSPAC, N≤6,092). Specifically, we study expressive and receptive vocabulary at 38 months and a wide range of later language-and literacyrelated skills (7-13 years, including reading, spelling, phonemic awareness, listening comprehension, nonword repetition) as well as verbal and non-verbal intelligence scores, seeking evidence for innovation and/or amplification processes.

Participants
All participants were drawn from ALSPAC, a UK population-based longitudinal pregnancyascertained birth cohort (estimated birth date: 1991-1992, Supplementary Methods) 17 20 . Parents were asked whether their child was able to say, understand or both say and understand a word from a list of 123 words. Expressive vocabulary was defined as the number of words a child was able to say or say and understand, whereas receptive vocabulary was defined as the number of words a child could understand or say and understand. In total, 6,092 children had both phenotypic and genome-wide genetic data available (Table 1).
Mid-childhood/adolescent language-and literacy-related abilities: Thirteen measures capturing reading, spelling, phonemic awareness, listening comprehension, non-word repetition and verbal intelligence were assessed from mid-childhood onwards (7)(8)(9)(10)(11)(12)(13) year, N≤5,749) using both standardised and ALSPAC-specific instruments ( Table 1, Supplementary Methods). Combined word reading accuracy and comprehension (age 7 years) was measured using the basic reading subtest of the Wechsler Objective Reading Dimensions (WORD) 21 assessment in addition to word and non-word reading accuracy scores (age 9 years) using an ALSPAC-specific measure 22 . Passage reading accuracy and speed (age 9 years) was captured with the revised Neale Analysis of Reading Ability (NARA II) 23  Phenotype transformation: Vocabulary, LRA and PIQ scores were rank-transformed to achieve normality and to allow for comparisons of genetic effects across different psychological instruments.
Vocabulary measures were residualised for sex, age, age 2 and the two most significant ancestryinformative principal components, calculated using EIGENSOFT 29 (v6.1.4). LRAs and PIQ measures were residualised for sex, age (unless measures were derived using age-specific norms) and the two most significant ancestry-informative principal components.

Analyses
Phenotypic correlations: Phenotypic correlations (r p ) were calculated for untransformed and ranktransformed scores using Spearman rank-correlation and Pearson correlation coefficients respectively.  Table 2). Estimates were combined using random-effects meta-regression intercepts, accounting for interrelatedness between LRAs (R:metafor library, Rv3.2.0) 35 . For this, a variance/covariance matrix across measures was approximated by including the observed phenotypic correlation matrix, weighted by the standard errors of the path coefficients as estimated by GSEM, analogous to models accounting for correlated phylogenetic histories 36 . As part of sensitivity analyses, the order of the two vocabulary measures at 38 months was reversed within the 13 SEMs (termed "reverse" GSEM, Supplementary Figure 6a). To compare LRA genetic covariance patterns with non-verbal cognitive abilities, we also studied expressive and receptive vocabulary at 38 months together with PIQ at 8 years.
Experiment-wide significance threshold: The effective number of phenotypes was calculated based on phenotypic correlations using matrix Spectral Decomposition (matSpD) 37 , resulting in nine independent measures. This corresponds to an experiment-wide significance threshold of 0.005 (0.05/9).  Table 1).

Phenotypic and genetic descriptives
Squared path coefficients for the first genetic factor (A1) fully explain genetic variance in expressive vocabulary at 38 months (a 11 ) and genetic variance that is shared with receptive vocabulary Squared path coefficients for the third genetic factor (A3) account for unique genetic variance in the studied LRAs, independent of genetic factors contributing to both expressive and receptive vocabulary at 38 months (a 33 , Supplementary Figure 4a). Contrary to our initial hypothesis, we found little evidence for novel genetic LRA influences arising after early childhood (Figure 2, Supplementary Figure 4).
A meta-analysis of absolute Cholesky path coefficients across all 13 SEM models (Supplementary   Table 2), correcting for phenotypic inter-correlations (Supplementary Figure 2), confirmed the amplification of genetic influences that are unique to receptive vocabulary at 38 months (meta-pathcoefficient a 32 =0.62(0.06), P<1x10 -10 , Table 2). Nominal evidence was also found for the amplification of genetic influences that capture the entirety of expressive vocabulary at 38 months (meta-path-coefficient a 31 =0.20(SE=0.08), P=0.009, Table 2), although it did not pass the experiment-wide multiple testing threshold. Consistent with individual GSEM models, there was little meta-analytic evidence for novel genetic influences arising after early childhood (meta-path-coefficient a 33 =0.34(SE=0.29), P=0.24, Table   2). Literacy-specific meta-analyses of reading measures only and spelling measures only, suggested that developmental genetic amplification patterns involve primarily, but not exclusively, reading-related abilities ( Table 2). The second genetic factor (A2) captures genetic influences that are unique to expressive vocabulary (i.e. independent of receptive vocabulary) and explained an additional 5.9%(SE=3.0%) of its phenotypic and a third of its genetic variance (factorial co-heritability: 0.33(SE=0.17)). Both early genetic factors accounted for phenotypic variation in VIQ, reading and spelling abilities, but also phonemic awareness and/or nonword repetition (Supplementary Figure 6-7).
To identify the most predictive genetic variance components of early vocabulary using either forward or reverse GSEM models, we studied model-specific factorial co-heritabilities and bivariate heritabilities (which are identical for forward and reverse GSEM). The largest contribution to genetic variance in later LRAs was confirmed for genetic influences uniquely related to receptive vocabulary (A2, forward GSEM, Supplementary Figure 4a), explaining up to 95%(SE=20%) in LRA SNP-h 2 , especially for reading and VIQ scores (Supplementary Table 3

Discussion
This study provides evidence that the amplification of early vocabulary-related genetic factors plays a major role during later language and literacy development. Multivariate variance analyses using genome-wide data showed that genetic influences underlying receptive vocabulary at 38 months, and to a lesser extent expressive vocabulary at the same age, could fully account for genetic variation in many reading and spelling skills, but also verbal and non-verbal cognitive functioning, ascertained later in development. Independent of model specification, there was little evidence for novel LRA-related genetic influences emerging during mid-childhood and adolescence. Thus, despite increases in trait heritability from early childhood to adolescence, developmental variation in language and literacy skills may not fully adhere to a developmental paradigm that exclusively predicts genetic innovation during the transition from early to middle childhood 10,13 .
Instead, the identification of amplification processes is consistent with twin research reporting moderate genetic correlations between latent factors for early language (including expressive vocabulary and syntax skills between 2-4 years of age) and both mid-childhood and/or adolescent latent language 10 and reading 9 . For example, latent factors for early language explained ~12% of the phenotypic variation in a latent factor for mid-childhood reading 9 using individual pathway models. Based on bivariate heritability patterns between latent factors, accounting only for about a third of phenotypic correlations 9,10 , findings have been interpreted as evidence for novel genetic influences emerging during mid-childhood 10 . In the present study, early vocabulary-related genetic factors could explain up to 45.1% phenotypic variation in subsequent LRAs, especially for literacy and verbal cognition, accounting for the majority of SNP-h 2 (≤95%). Bivariate heritability estimations confirmed these findings. Similar amplification patterns were observed between early vocabulary and PIQ, although the evidence for bivariate heritability with PIQ was less strong. This suggests that genetic variance between early vocabulary and subsequent verbal cognition and literacy, but also non-verbal cognition, is shared, showing developmental genetic stability.
However, the striking similarity among structural models for many literacy skills may partially reflect their complex phenotypic interrelatedness.
The largest amplification of genetic variation contributing to later literacy and cognition was identified for a small proportion of genetic influences that is unique to receptive and independent of expressive vocabulary at 38 months of age. Consistently, bivariate heritabilities with early receptive vocabulary accounted for 70-100% of the phenotypic covariance with later reading and cognition skills, although the 95% confidence intervals are wide. In contrast, genetic influences for expressive vocabulary did not substantially contribute to the total genetic variance of later LRAs (based on their factorial coheritability) except for VIQ scores. Analysing reverse GSEM (where the order of early vocabulary scores is reversed) confirmed these patterns. It is noteworthy that in these reverse GSEM models we also identified evidence for the amplification of a genetic factor that is unique to expressive vocabulary (i.e. independent of receptive vocabulary). However, there was little evidence for a substantial contribution to LRA SNP-h 2 . In addition, the identified bivariate heritability patterns remained unchanged. Thus, our results suggest that genetic variance between early vocabulary and subsequent literacy and cognitive skills is not only shared, but that genetic links are dominated by early receptive vocabulary, suggesting specificity, and thus only partially adhere to the concept of 'generalist genes' 16 . Genetic links with expressive vocabulary still exist, albeit to a lower extent.
The observed differences in genetic overlap with LRAs may reflect differential mechanisms that link receptive and expressive vocabulary-related genetic factors to later reading and cognitive skills. For example, receptive vocabulary may be more strongly related to pre-reading skills, such as phonological awareness and orthographic knowledge, while expressive vocabulary has been previously identified as predictive of word identification 38 . Furthermore, a delay in both expressive and receptive vocabulary is much more likely to lead to problems with later literacy compared to delays in expressive vocabulary alone 39 .
The methodology applied in this study does not allow us to infer specific biological pathways or specific genes encoded by the identified genetic factors. However, it is still possible to speculate about the biological mechanisms that may underlie the observed amplification patterns. Genes are known to have multiple biological functions (pleiotropy), and dynamic gene expression patterns over time and space have been shown for multiple brain-related gene expression modules 40 . The stability of genetic factors across development is furthermore consistent with signalling pathways and genes that contribute to synaptic function and plasticity with important biological roles throughout development 41 , though specifically designed gene-based studies are warranted to confirm such claims.
The increase in SNP-h 2 , comparing early vocabulary skills with later language, literacy and cognitive performance, as observed in this study, may not necessarily involve an increase in genetic variance over time. Instead, it may arise due to genotype-environment correlations, implying an amplification of small genetic differences as children develop, because of environment modification and selection in accordance with their genetic make-up 42  show evidence for both amplification and innovation processes from infancy to later adolescence 46 . A further limitation of the current study is that the CDI Toddler version was developed to assess vocabulary in children up to 30 months 20 , whereas ALSPAC children were assessed at 38 months of age, potentially leading to ceiling effects.
The strength of this work lies in the identification of amplification processes exploiting a temporal sequence of events, suggesting that the developmental origins of later complex cognitive and literacy processes lie in early childhood. Our findings suggest that cheaply and easily administered parentreported CDI questionnaires, which are widely used to assess children's early language 47 , can be useful instruments to capture common genetic influences affecting individual differences in LRAs many years later in life. Moreover, when applied to large numbers of participants (hundreds of thousands), these parent-reports could become sensitive genetic prediction tools. However, there is a need to improve their predictive power, although moderate to strong correlations between parental judgements of a child's vocabulary and direct assessments of a child's vocabulary suggest instrument validity 48,49 .
In summary, we show that the amplification of a small proportion of genetic influences that uniquely capture early receptive vocabulary play a major role during later cognitive and literacy development. This suggests genetic stability, with developmental origins of complex cognitive and literacy skills arising early in childhood.

Data availability
The data used is available through a fully searchable data dictionary (http://www.bris.ac.uk/alspac/researchers/data-access/data-dictionary/). Access to ALSPAC data can be obtained as described within the ALSPAC data access policy (http://www.bristol.ac.uk/alspac/researchers/access/).