Age of acquisition impacts the brain differently depending on neuroanatomical metric

Abstract Although researchers generally agree that a certain set of brain areas underlie bilingual language processing, there is discrepancy regarding what effect timing of language acquisition has on these regions. We aimed to investigate the neuroanatomical correlates of age of acquisition (AoA), which has been examined previously, but with inconsistent results, likely influenced by methodological differences across studies. We analyzed gray matter density, volume, and thickness using whole‐brain linear models in 334 bilinguals and monolinguals. Neuroanatomical correlates of AoA differed depending on gray matter metric. Relative to early bilinguals, late bilinguals had thicker cortex in language processing and cognitive control regions, and greater density in multiple frontal areas and the right middle temporal and supramarginal gyri. Early bilinguals had greater volume than late bilinguals in the left middle temporal gyrus. Overall, volume was the least sensitive to AoA‐related differences. Multiple regions not classically implicated in dual‐language processing were also found, which highlights the important role of whole‐brain analyses in neuroscience. This is the first study to investigate AoA and gray matter thickness, volume, and density all in the same sample. We conclude that cognitive models of bilingualism should consider the roles of development and neuroanatomical metric in driving our understanding of bilingual and monolingual language organization.


| INTRODUCTION
Although researchers generally agree that a certain set of brain areas are involved in bilingual language processing, it is unclear what effect timing of language acquisition has on these regions. Twelve peerreviewed studies have examined the relationship between brain structure and age of acquisition (AoA), and results of these studies are conflicting. We suggest that there are three main reasons for differing conclusions across previous studies (see Table 1 for summary): different measures of brain structure (four measure white matter, two measure GM thickness, three measure GM volume, and three measure GM density), different parameterizations of AoA (as a continuous vs. categorical variable with bins that vary across studies), and small group sizes (between n = 12 and n = 39), all of which increase variability in results.
The present study aimed to understand the neuroanatomical correlates of AoA and increase comparability to previous studies through the use of multiple measures of gray matter (GM) structure and larger group sizes.
It has been well-established that brain structure reflects the accumulation of genetic influences and life-long experiences (Alexander-Bloch, Giedd, & Bullmore, 2013). Dual language processing is one skill that should have especially far-reaching effects on the brain because it draws on many domains, including motor skills, emotional processing, and higher-level thinking (Friederici, 2011). This idea has been demonstrated through neuroimaging research, which shows that speaking two languages is related to regions involved in language processing as well as cognitive control regions (Price, 2010) and is reflected in structural changes to these regions (García-Pentón, García, Costello, Duñabeitia, & Carreiras, 2016;Olulade et al., 2016).
Given that there is a sensitive period for processing of sensations such as vision and sound (Brainard & Knudsen, 1998;Harrison, Gordon, & Mount, 2005), it follows that the neural substrates recruited to support L2 acquisition may differ depending on the developmental stage of the individual (Hartshorne, Tenenbaum, & Pinker, 2018). For instance, native-like accent production, phonemic perception, and implicit knowledge of syntax are thought to have an early sensitive period (i.e., before age 7; Johnson & Newport, 1989;Tokowicz & MacWhinney, 2005) due to higher plasticity of certain brain structures (e.g., motor/speech pathways) early in development (Hernandez & Li, 2007). Therefore, it is likely that neuroplasticity manifests differently in relation to age of L2 acquisition (Berken, Gracco, & Klein, 2017), but its exact manifestation differs between studies. Some studies (e.g., Grogan et al., 2012) suggest that the plasticity of certain structures facilitates high language proficiency in bilinguals who learned the L2 early in development. One piece of evidence for this idea comes from a study by Mohades et al. (2012), who found that simultaneous bilinguals (i.e., those who acquired the first and second language before age 3) had higher fractional anisotropy in the left inferior frontal-occipital fasciculus compared to sequential bilinguals and monolinguals. They suggested that this enhanced connectivity permits faster transmission of semantic information. Additionally, Mechelli et al. (2004) found that bilinguals had increased GM density in the left inferior parietal lobule T A B L E 1 Studies on structural correlates of age of acquisition (AoA)  Wei et al. (2015) Volume and area (GM) 36 Bi + R pars orbitalis − R SPL Note: Parameters for inclusion: studies had to examine structural correlates of AoA within bilinguals. For instance, a study that compares only simultaneous bilinguals to monolinguals would not be included because AoA effects cannot be explained; However, a study that examines structural effects of AoA within bilinguals would be included. Measures: DTI, diffusion tensor imaging; FA, fractional anisotropy; GM, gray matter; VBM, voxel-based morphometry; WM, White matter. Groups: Mo, monolingual; Bi, bilingual; Multi, multilingual. Bi Sim , simultaneous bilingual (acquired L1 and L2 from birth). Bi Seq , sequential bilingual (acquired L2 after L1). Bi Early , early bilingual (AoA 4-7 years). Bi Late , late bilingual (AoA >7 years). Results: + indicates that a region increased as AoA increased; − indicates that a region decreased as AoA increased. < and > indicate which group had greater density/volume/thickness in a region. AC-OL, anterior part of corpus callosum projecting to orbital lobe; IFG, inferior frontal gyrus; IFOF, inferior occipitofrontal fasciculus; ILF, inferior longitudinal fasciculus; IPL, inferior parietal lobule; IPPG, inferior posterior parietal gyrus; ITG, inferior temporal gyrus. L, left hemisphere; MFG, middle frontal gyrus; MTG, middle temporal gyrus; PFC, prefrontal cortex; R, right hemisphere; SPL, superior parietal lobule.
(IPL) compared to monolinguals, and that those with the earliest AoA had the highest density. Finally, Grogan et al. (2012) showed that simultaneous bilinguals have greater adaptations to cortical structure than monolinguals and other bilinguals. Specifically, earlier AoA and greater lexical efficiency were both related to greater GM density in the left pars opercularis, possibly due to increased demand for phonological processing (see Table 1 for summary of all structural studies that examine AoA).
On the other hand, some studies suggest that structural changes are more likely to occur in late bilinguals. Berken et al. (2017) suggest that bilinguals with an AoA that is later than the optimal period for language (i.e., nonsimultaneous bilinguals) depend on compensatory processes beyond the activity patterns for a single L1. Based on the sensorimotor hypothesis (Hernandez & Li, 2007), one could surmise that the later an individual begins learning a second language, the more likely it is that structures involved in higher-level cognition  (Berken, Chai, Chen, Gracco, & Klein, 2016a).
Given these mixed hypotheses and results, it is necessary to consider how exactly previous studies operationalize brain structure. Cortical thickness is the measure of the distance between the white matter surface and pial surface. One advantage of this measure is that it is not confounded by head size (unlike volume, which is influenced by surface area; Panizzon et al., 2009). Although the term volume is often used interchangeably with density, these metrics are distinct, with the main difference being that density can be calculated using normalized unmodulated data, whereas volume and CT are calculated using modulated data (Ashburner & Friston, 2000). Modulation accounts for the changes in brain volume that occurred as a result of spatial normalization (Good et al., 2001). Normalization requires the original brain volume to morph (some areas expand and others compress). To account for these changes the voxel intensities are multiplied by a Jacobian determinant. Analyzing unmodulated data yields information about concentration (i.e., density) differences per unit volume in native space; analyzing modulated data yields information about the absolute amount of GM (i.e., volume) (Ashburner & Friston, 2000). The key point here is that MRI-derived structural metrics each reflect unique aspects of experience-dependent plasticity; However, research is still in the process of illuminating how each metric uniquely reflects function, so it is important to consider that a neuroanatomical profile of any demographic variable is likely best explained by examining multiple metrics.
On top of methodological differences, high variability across findings is also likely influenced by small sample sizes, which is a common issue in neuroimaging studies because of the highly exclusive participation eligibility and high time/cost commitments. The research published thus far has advanced our understanding of bilingualism. At the same time, if power is low there is a decreased ability to detect population effects, which "also reduces the likelihood that a statistically significant result reflects a true effect" (Button et al., 2013, p. 365). If it is a true effect, this can result in an exaggerated estimate of the effect's magnitude (Kellmeyer, 2017). One example of this problem comes from a study by Cremers, Wager, and Yarkoni (2017), who demonstrated that inference in functional MRI is disproportionately influenced by small sample sizes. Specifically, the authors took 10,000 horizontal brain slices (i.e., one slice from each of 10,000 subjects) and drew 2,000 random subsamples that ranged between 10 and 150 slices. They found that sample sizes below 30 displayed disproportionately low statistical power, did a poor job of representing true effects that existed in the full sample, and were highly inconsistent in their ability to replicate. This may be one reason for varying results across previous studies, in which the sample size per group was between n = 12 and n = 39 participants (see Table 1).

| THE PRESENT STUDY
The aim of the present study was to investigate the relationship between AoA and structural GM, as well as increase clarity across previous studies. This objective was accomplished by: comparing cortical thickness, volume, and density in the same sample; using both categorical and continuous methods to measure AoA; and doing so in a larger-than-average sample of bilinguals and monolinguals. Based on the bilingual literature at large, we hypothesized that both continuous (regression) and categorical (ANOVA using orthogonal contrasts) analyses would demonstrate that as AoA increases, so does GM volume, thickness, and density, and that these changes would occur specifically in regions related to language processing (Price, 2010) and cognitive control (Abutalebi & Green, 2007;Green & Abutalebi, 2013).
Evidence in favor of this hypothesis would support the idea that late L2 acquisition is an effortful process that requires top-down processing. Although a priori hypothesis are not necessary for wholebrain analyses, we expected that these changes would be seen in the following regions of the cortex: those implicated in cognitive control lingual participants. Fifteen brain scans were removed from analyses (six brain scans were removed due to reconstruction issues, five due to technical MRI scanning issues, three due to not reporting AoA, and one due to brain trauma). This resulted in a final group of 334 brain scans (189 bilingual, 145 monolingual).
Inclusion criteria was based on participants reporting that they were right-handed, had normal or corrected-to-normal vision, and had no history of neurological disorders. Participants were excluded if they were unable to enter the MRI safely (e.g., due to a metal or electronic medical implant or claustrophobia). Participants were males and females living in Houston, Texas (71.34% women and 28.66% men).
Individuals were considered monolingual if they reported no knowledge of languages other than English and fewer than 2 years of foreign language classes. They were asked to self-rate proficiency on a scale of 1-7 (1 = no knowledge, 7 = like a native) their knowledge of English and other languages in which they may have taken classes, and if they self-rated themselves above a 2 on any language other than English, they were excluded from the study.

| Materials and procedure
This study, which gathered and analyzed structural MRI data from eight individual studies, was approved by the University of Houston Institutional Review Board. All participants completed informed consent, background information questionnaires, language history questionnaires, and language proficiency assessments, as well as confirmation that the participant would be able to safely enter the MRI scanner, prior to beginning MRI scanning.

| Background information survey
Participants completed a paper questionnaire on background information that asked about age, gender, birthplace, bilingual/monolingual designation, and SES. SES was determined by coding parents' education status on a scale of 1-6 (1 = no education, 6 = graduate degree) and averaging the two parents' scores. When only one parent's education status was reported, that sole score was used to estimate SES.

| Language history questionnaire
Participants were asked to complete a language history questionnaire, which was used to ensure that they qualified as bilingual or monolingual. The language history questionnaire asked participants to selfreport age of L2 acquisition, daily language use of English and Spanish,

| English and Spanish proficiency measures
Monolinguals, all of whom spoke English as a first language, were asked to complete the Boston Naming Test (Kaplan, Goodglass, & Weintraub, 1983) and/or the following subtests of the

| Design
Within the 12 previous studies that examined this relationship, no two sets of results are the same, which is influenced by the use of different neuroanatomical metrics and parameterizations of AoA across studies. For comparison purposes, we chose to focus on the eight studies that examine GM structure, reasoning that bilingualism is likely to have profound effects on cortical organization, and to permit evaluation of multiple metrics within one type of tissue. We analyzed three different metrics of GM structure and used multiple parameterizations of AoA in our sample in order to increase comparability to previous findings.
This study had a between-subjects design in which participants were grouped as monolingual or bilingual, and bilingual participants were further separated according to AoA (early: AoA = 4-7 years; late: AoA > 7 years), following the parameters used by Klein et al. (2014). The bilingual sample included 121 early and 68 late bilinguals.
Because treating AoA as a categorical variable does not necessarily reflect any concrete age cut-off in development, a regression analysis was also conducted in which AoA was treated as a continuous variable. Examining AoA according to multiple sets of parameters increased the number of previous studies to which we could compare our results.
The dependent variable in this study was GM structure. Three commonly-used metrics of GM include density, volume, and CT. Given that the previous studies on the relationship between AoA and brain structure each use different measures individually (see Table 1), this study is the first to include all three measures in the same sample of bilinguals and monolinguals. Each variable was treated separately to answer three research questions: What is the relationship between AoA (using all above-mentioned grouping methods) and CT; What is the relationship between AoA and GM volume; and what is the relationship between AoA and GM density.

| MRI acquisition
All MRI scans were collected at the Center for Advanced Magnetic Resonance Imaging at Baylor College of Medicine in Houston, Texas.
Brain scans were acquired using a 3. 3.5 | MR image processing

| Image processing in Freesurfer
Both volume and CT analyses were conducted using Freesurfer, version 5.3.0 (available online at http://surfer.nmr.mgh.harvard.edu).
Freesurfer is equipped with semi-automatic preprocessing abilities.
Brain images were preprocessed using cortical-based geometry to determine CT and folding patterns. The program was then used to preprocess images for motion correction and skull-stripping, and to complete the segmentation of cerebrospinal fluid, white matter, and GM. Surface-based registration was used to align cortical folds with a brain template and then volumetric registration was used to reconstruct images. Following the reconstruction process, every MR image was visually inspected slice by slice in horizontal, coronal, and sagittal views using Freeview. Images that had errors were fixed manually, white matter and pial segmentation were reconstructed, and then were inspected once again. At that point, images were either edited again and reconstructed, or processed through a final reconstruction before analysis. 3.6.1 | FreeSurfer-Surface-based cortical thickness and volume

| Image processing in SPM
All analyses were conducted using whole brain analyses. A one-way ANOVA with AoA group as the factor was conducted using FreeSurfer's mri_glmfit command. Additionally, a one-way ANCOVA with proficiency included as a covariate was conducted using the same command. To interpret the results, orthogonal contrasts were created with monolingual, early bilingual (AoA = 4-7 years), and late bilingual (AoA > 7 years) groups. This led to the following contrasts: monolinguals versus bilinguals (i.e., early and late bilinguals), and early versus late bilinguals. The ANOVA and ANCOVA were computed for thickness and volume in each hemisphere. One characteristic of whole-brain analyses is that many voxels will appear significant simply by chance due to the fact that thousands of voxels are being analyzed. Conventional multiple hypothesis testing (e.g., Bonferroni) is often not sensitive enough in brain analyses (see Genovese, Lazar, & Nichols, 2002). This "multiple comparisons problem" was corrected for by setting a voxel-wise false discovery rate was volume) across each hemisphere using an FDR of 0.05. A regression was also conducted with AoA as a predictor of brain structure controlling for English proficiency and percent daily Spanish use. For all ANOVA, ANCOVA, and regression analyses, results that survived the FDR correction were displayed and examined on the three-dimensional brain template that is built into Freesurfer, which was then parceled according to the Desikan-Killiany atlas (Desikan et al., 2006).

| SPM CAT12-Voxel-based volume and density
The same one-way ANOVA and ANCOVA computed in FreeSurfer were used in SPM12, with AoA group as the factor and proficiency as the covariate. The same orthogonal contrasts were used for interpretation: monolinguals versus bilinguals (i.e., early and late bilinguals), and early bilinguals versus late bilinguals. The same regressions conducted in Freesurfer were used in SPM12, with AoA as a predictor of each measure of brain structure (volume or density), and English proficiency and daily Spanish use as nuisance variables. Both volume (modulated) and density (unmodulated) results were extracted from SPM. Results were displayed using xjview, a neuroanatomical viewing toolbox for SPM12.
FDR-correction was applied in SPM, and results with a cluster size larger than or equal to 10 voxels are presented here.

| RESULTS
Whole-brain analyses were conducted on 334 brain scans using general linear modeling to determine differences in brain structure F I G U R E 1 Comparison of ANOVA results showing relationship between AoA and neuroanatomical metric. Three neuroanatomical metrics (gray matter density, volume, and thickness) were analyzed using ANOVA with age of acquisition (AoA) group (monolingual, early bilingual, or late bilingual) as the factor. Warm colors indicate greater density/volume/thickness (bi > mono or late > early) and cold colors indicate lesser density/volume/thickness (bi < mono or late < early  Berken et al., 2016b;Kaiser et al., 2015;Klein et al., 2014;Mohades et al., 2012) and continuous (i.e., Klein et al., 2014;Nichols & Joanisse, 2016;Wei et al., 2015), this study used both group comparisons (ANOVA) and regression analyses (in which AoA was treated as a continuous variable). Analyses of cortical thickness, volume, and density were conducted. A summary of all contrasts that were significant can be seen in Table 3. Results from analyses conducted in Freesurfer (cortical thickness and volume) were displayed on Freesurfer's inflated brain template and labeled following the Desikan-Killiany atlas (Desikan et al., 2006). Results from analyses conducted in SPM12 (volume and density) were displayed using the "Render View" option in the xjview toolbox. The resulting significant clusters were then compared to the hypothesized areas. Results have been summarized in Figures 1 and 2.
As mentioned above, AoA is one of the strongest predictors of proficiency. Parsing these variables out from one another does not always lead to the most informative analyses. In the present study, we include both an ANOVA with AoA as a predictor of brain structure, and an ANCOVA with AoA as a predictor of brain structure and English proficiency as a covariate. This is different from matching participants'

| ANCOVA
An ANCOVA with English proficiency as the covariate revealed that late bilinguals had greater density than early bilinguals in the right IFG.
The ANCOVA also revealed that bilinguals had greater density than monolinguals in a broad set of regions that included bilateral frontal, parietal, and temporal regions (see Figure 2; see Appendix B for details on all clusters). bilinguals showing greater volume than monolinguals in the right cuneus, precuneus, and SPL, and lesser volume in the right MTG, DLPFC, and postcentral gyrus (Figure 1, Table 6; see Appendix C for further details). The same ANOVA in SPM12 showed that bilinguals had greater volume than monolinguals in 41 clusters that were especially concentrated in the temporal lobe; bilinguals had lesser volume than monolinguals in 29 clusters, including the left IFG, MFG, and SFG (FDR-corrected p < .05; Figure 1; see Table 7 for 10 largest clusters;

| Volume
see Appendix D for all clusters). T A B L E 5 Ten largest gray matter clusters where bilingualism related to density--SPM  Figure 2; see Appendix C for all clusters).

| Regression
None of the regression results were significant for thickness, volume, or density.

| DISCUSSION
Although researchers generally agree that a certain set of brain areas underlie bilingual language processing, it is unclear what effect timing of language acquisition has on these regions. Discrepancy among previous findings may be driven by methodological differences, so this study aimed to elucidate this relationship by analyzing a set of three neuroanatomical correlates (GM density, volume, and thickness) of AoA. In the present sample, analyses of GM density revealed AoArelated differences in regions associated with execution of speech and articulation that may occur within a fronto-parietal control network for top-down selection of words, analyses of thickness revealed changes to areas involved in language control, and analyses of volume revealed a smaller set of clusters localized to either the temporal lobes (when analyzed in SPM) or the MTG and DLPFC (when analyzed in Freesurfer). These differences were found when AoA was analyzed categorically in ANOVA analyses (no linear regression results were significant), which implies that language development is better characterized by optimal windows of development than a linear trajectory. Our results demonstrate two main points: (a) models of bilingualism that focus on the neural underpinnings of language control should consider the effects of development on language organization, (b) given that the present findings differed according to neuroanatomical metric and neuroimaging software, neuroscience researchers should consider the role that these parameters play when designing analyses, and (c) although results from previous studies are discrepant, they may each be capturing smaller pieces of a larger picture (see Figure 3).
At the present time, research in the field of bilingualism that has examined the neuroanatomical correlates of AoA has tended to analyze only one metric of brain structure at a time. This is important to acknowledge because in the present study, our findings were driven by which neuroanatomical metric of AoA was used. Changes to neuropil, glia, dendrites, number of soma, and/or underlying white matter structure may all influence each metric (Wenger, Brozzoli, Lindenberg, & Lövdén, 2017), and likely do so in a unique manner. For example, Mechelli and colleagues note that GM density as measured by voxels, "should not be confused with cell packing density measured cytoarchitectonically" (Mechelli, Price, Friston, & Ashburner, 2005, p. 8) because it estimates the overall dimension of an area rather than which cellular adaptations cause that area to expand or compress. Therefore, it is important to avoid characterizing the neuroanatomical correlates of AoA according to only one metric of GM until researchers are able to elucidate the developmental trajectory of each metric.

| Effects of AoA differ according to neuroanatomical metric
Cortical thickness analyses revealed that late bilinguals had thicker cortex than early bilinguals in regions that have been classically implicated in language control. These regions included those that have been implicated in prelexical speech (left STG; Price, 2010), as well as verbal working memory and integrating auditory perception with motor production (left IPL and precentral gyrus; Alain, Arnott, Gillingham, Leung, & Wong, 2015;Simonyan & Fuertinger, 2015). Late bilinguals also had thicker cortex in the left SPL, which is implicated in top-down attentional orienting (Shomstein, 2012). This is in line with Klein et al. (2014), who also found that later AoA relates to thicker left SPL, although in their study this was a continuous (not categorical) relationship.
T A B L E 7 Ten largest gray matter clusters where bilingualism related to volume--SPM The findings of the left SPL, IPL, and STG fit with the argument that late bilinguals in particular may achieve L2 resonance via metacognitive processes that rely on nonlanguage areas (e.g., recoding, rehearsal, and imagery) in order to defend against L1 entrenchment (Berken et al., 2017;Hernandez, Claussenius-Kalman, Ronderos, & Vaughn, 2018;Hernandez, Li, & MacWhinney, 2005).
Density analyses revealed that late bilinguals had greater density (i.e., higher voxel concentration) than early bilinguals in regions related to speech and articulation more so than language control, including the bilateral MFG, IFG, right SMG, and left SFG. The MFG is activated when participants are asked to generate semantically-or phonologically-related words to a stimulus word (Heim, Eickhoff, & Amunts, 2009). Regarding the bilateral IFG, Price (2010) notes that the left pars opercularis of the IFG is involved in processing speech and syntax, but that activation increases in both the left and right pars opercularis when extraction of semantic content in a sentence is particularly difficult. This means that greater use of the pars opercularis may be a product of using top-down predictions for plausible events. Learning to make predictions about the meanings of sentences without understanding every part of the sentence is a common experience for late bilinguals, so it is possible that they use context-based mental predictions in order to process the L2.
Volume analyses showed that late bilinguals had lesser volume than early bilinguals in the left MTG. Interestingly, Kaiser et al. (2015) also found AoA effects of volume, where sequential bilinguals had greater volume than simultaneous bilinguals in the MTG, although their findings were right-lateralized and more posterior than our finding. Price (2010) notes that the middle temporal lobe is likely implicated in making meaning of and integrating multiple semantic concepts in both written and auditory form. For example, activation in the MTG has been shown to occur in response to hearing grammatically correct sentences with plausible (vs. implausible) meanings (Rogalsky & Hickok, 2009), and the anterior MTG has been implicated F I G U R E 3 Summary of previous findings versus present results.
(a) Localizations are for visualization purposes only; Actual size/specificity of cluster may vary. Medial sections not shown. (b) Summary of present findings. Left cerebral hemisphere outline: Gray (1918;public domain) in reading content words with high semantic associations (Davis & Gaskell, 2009;Diaz & McCarthy, 2009). The fact that late bilinguals in this sample showed greater volume in this area may have to do with learning to read in the L2 at the same time as learning the L2 itself, whereas earlier bilinguals on average may have more time to begin speaking the language before learning to read (similar to monolinguals).
Of all three measures, GM volume was overall less sensitive to AoA than density or thickness. It may be the case that GM density and thickness are more sensitive metrics to AoA-related cortical adaptations. Whereas Freesurfer analyses of GM volume revealed that early bilinguals had greater volume than late bilinguals in the left MTG when controlling for English proficiency, SPM analyses did not show any significant difference between these groups. This may be due to differences between Freesurfer's surface-based calculation and SPM's voxel-based calculation. Had the present study only examined GM volume in SPM, it likely would have been concluded that AoA does not have a significant effect on brain structure. In fact, the lack of significant AoA-related volume results in SPM is in line with results from Abutalebi et al. (2015). As far as choice of specific neuroanatomical metric, volume may be less sensitive to AoA than other metrics.
Unlike thickness, volume is influenced by surface area, which is the measure of differences in regional expansion across subjects (Ronan & Fletcher, 2015) and is affected by both the width of sulci and the number of gyrifications in a given area (Im et al., 2008). Given that surface area is included in calculations of volume (and not in calculations of cortical thickness), one possible explanation for the lack of findings is that cortical adaptations in response to timing of language adaptation are less likely to occur in the direction of surface area and more likely to occur in the direction of thickness.

| Development and its role in language organization
Cognitive control (Abutalebi & Green, 2007) and developmental (Hernandez & Li, 2007) models make separate predictions regarding the role of development in determining language organization. The adaptive control hypothesis (Abutalebi & Green, 2007;Green & Abutalebi, 2013) provides an account for how bilinguals control their two languages: the left basal ganglia and ACC modulate left prefrontal cortex activity, which permits the prefrontal cortex and IPL to work in tandem as a top-down control system when executing word selection (Abutalebi & Green, 2007;Green & Abutalebi, 2013). This model accounts for bilingual language control, but further models are needed to predict the effects of development on these cognitive processes.
The sensorimotor hypothesis (Hernandez & Li, 2007) posits that skill acquisition occurs through recruitment of brain structures that are in the midst of development at the time of acquisition. Specifically, skill acquisition in children may occur through sensorimotor processes involving the basal ganglia due to its high plasticity in early development, whereas skill acquisition in adults may be facilitated by laterdeveloping cortical structures involved in higher cognitive processing (such as the temporal and parietal regions and their connections to the medial temporal lobe; Ullman, 2016), and metacognition (such as the frontal lobe; Bradley, King, & Hernandez, 2013 attentional network tests (Kapa & Colombo, 2013). In line with the sensorimotor hypothesis, late bilinguals may depend more on laterdeveloping cortical structures to facilitate the L2, leading to the structural changes seen here.
It is also important to also consider that the neuroanatomical correlates of AoA differed between the ANCOVA (where English proficiency was the covariate) and the ANOVA, as displayed in Figure 3.
This finding demonstrates the point that inclusion/exclusion of particular variables in the field of bilingual research may influence replicability across studies. English proficiency is unique because, instead of being an explanatory variable, it is an outcome variable that is the result of a few main factors (AoA, language exposure, motivation to learn, and potential genetic factors; . Given that, early AoA is a consistently strong predictor of having higher proficiency in both languages (Birdsong, 2018), attempts to separate these two variables may lead to results that are less informative. For instance, matching early and late bilinguals on proficiency would require picking only the early bilinguals with English proficiency that is lower than expected in order to match the late bilinguals who have English proficiency that is higher than expected. In both cases, these individuals would be outliers because earlier AoA is expected to lead to higher L2 proficiency (Granena & Long, 2013;Johnson & Newport, 1989).
In light of the complexity of the decision to include or exclude this variable as a covariate, we analyzed AoA effects both without (ANOVA) and with English proficiency (ANCOVA) as a covariate. As can be seen in Figure 3, significant regions did not overlap between these two analyses. In the ANCOVA, late bilinguals had greater GM density in the right IFG than early bilinguals; in the ANOVA, late bilinguals had greater density in the right STG, SMG, and left MFG.

| Differences between bilinguals and monolinguals
The current study also considered structural differences between bilinguals and monolinguals. GM analyses revealed differences in a broad set of cortical regions that varied depending on the metric. Cortical thickness analyses showed that bilinguals had thicker cortex than monolinguals in a set of clusters concentrated near the occipital and temporal lobes, especially the bilateral LOC, STG, ITG, and right MTG and lingual gyrus. Bilinguals showed greater density than monolinguals in a broad set of bilateral temporal, parietal, and frontal regions (see Figure 1). Volume results were not as straightforward since the results differed between SPM12 and Freesurfer (likely influenced by SPM's voxel-based and Freesurfer's surface-based calculations).
Results from the analyses using SPM12 showed that bilinguals had greater volume than monolinguals in the bilateral temporal lobes and lesser volume in the left IFG, MFG, and SFG. Results using Freesurfer showed that bilinguals had greater volume than monolinguals in the right cuneus and SPL, and lesser volume in the right MTG and DLPFC.
These findings demonstrate that the ability to speak two languages can be linked to differences in brain structure. Multiple factors may play into these adaptations, including the need for top-down selection of words and inhibition of words in the nontarget language (Costa, Pannunzi, Deco, & Pickering, 2016;Hernandez et al., 2019).
In the present case, density analyses were more sensitive to alterations that may occur within a fronto-parietal control network, thickness analyses were more sensitive to alterations in brain areas associated with visual and semantic processing, and volume was more sensitive to either temporal (SPM) or a smaller set of regions in the middle temporal lobe and DLPFC (Freesurfer). Given that bilingualism is often treated as a monolithic variable, it is interesting that the regions seen here did not necessarily overlap with those in the late versus early bilingual comparisons (see Figure 1). These findings emphasize the point that the effects of language differ in relation to individual differences in language experience, and that different neuroanatomical metrics are sensitive to different sets of languagerelated adaptations.

| The role of software
The field of neuroimaging research is fortunate to have access to multiple software packages to conduct analyses that would have been previously not pragmatic or possible due to the complex computational analyses required to measure the thousands of voxels in one brain volume. Although the present study examined two popular soft-  Popescu et al. (2016) found marked differences between packages, with a wide range of intraclass correlation coefficient values. Agreement between Freesurfer and SPM was lowest in the occipital lobes and insula, but the majority of disagreement between programs was with deep GM structures (e.g., thalamus, putamen, hippocampus).
Popescu and colleagues note that a large part of these differences may be the influence of the different atlases used across programs. Given that both surface-and voxel-based methods are valid for measuring the brain (Li et al., 2014), it is important for researchers to consider the retest reliability of analyses conducted with a the neuroimaging software of choice for the population of interest, the effects that software package has on the order of preprocessing steps, and the algorithms used to achieve these steps.

| Replicability in neuroscience
In light of a replication crisis in neuroscience, we suggest that neuroscientists develop agreed-upon metrics by which they will evaluate neuroimaging data in order to decrease variability that is caused solely by using differential analytical methods. The fact that the present study did not replicate many previous findings is not surprising, given the different sample populations and sample sizes across previous studies. For example, Mechelli et al. (2004)  English monolinguals in Houston, TX and did not find this effect (see Table 1 for full summary of previous studies). Button et al. (2013)  findings. In addition to the regions that we hypothesized the wholebrain analyses would reveal, this study also found that AoA was related to many regions that were not hypothesized, which have been listed in the Appendices. These nonhypothesized findings underscore the importance of obtaining enough power to conduct whole-brain analyses. Therefore, inclusion of multiple measures of neuroanatomy (and measuring them using whole-brain analyses) will be important for building our understanding of the neuroanatomical correlates of AoA, especially while we anticipate larger advances in neuroimaging methodologies that will help us better relate constructs such as volume to specific changes in cell biology.

| Limitations, alternative explanations, and future directions
One limitation is that our sample was specific to participants in Houston, Texas, who may differ from other populations because they have exposure to a higher-than-average number of foreign languages (Emerson, Bratter, Howell, Jeanty, & Cline, 2013). Another limitation is that, because later AoA correlates with fewer years of language use, the present results showing cortical expansion in late bilinguals relative to early bilinguals may reflect fewer years of experience with the L2. Wenger et al. (2017) propose that cortical structure follows a U-shaped curve, where increased use of a region at the beginning of skill acquisition leads to initial GM expansion (reflecting neural, glial, and/or dendritic growth), which renormalizes following skill mastery. GM could eventually return to baseline, given enough time (see Hernandez et al., 2019, for nonlinear language development). However, given the vast differences in developmental processes occurring at the time of L2 acquisition (e.g., early development of motor/speech pathways), it is unlikely that language processing in late and early bilinguals would exactly mirror one another. temporal, occipital, and parietal) changes are have been found to peak in adolescence and stabilize in adulthood around age 20 (Raznahan et al., 2011;Shaw et al., 2008;Wierenga, Langen, Oranje, & Durston, 2014). Given that the difference between early and late bilinguals was 3 years and remained within the 20s (as opposed to comparing adolescent and nonadolescent brains), it is not likely that age would explain results across the sample of participants examined in this study. However, future studies may want to examine the relationship between age and language development in the 20s.
This study operated under the assumption that greater use of an area leads to expansion of that area. Most areas of the human cortex have six layers of inputs and outputs (Johnson & de Haan, 2015), some of which communicate with other cortical areas, and others of which communicate with subcortical areas such as the thalamus (Guillery & Sherman, 2002), so one limitation is that the exact reason for cortical expansion may vary by circumstance. Current MRI technology cannot provide information on the structure of cortical layers or individual cells in vivo, which means that neuroanatomical studies should be cautious in making direct inferences about changes to function until researchers are able to elucidate the relationship between regional brain function and structural expansion versus compression.
However, given that there is a strong relationship between function and structural changes (e.g., Boyke, Driemeyer, Gaser, Büchel, & May, 2008;May, 2011), research on bilingual brain structure should not be ignored. It is still clear from our results that the neural adaptations required to support bilingualism are different when the L2 is added late versus early in development.
Ideally, neuroimaging studies would always have an ample number of participants in every group to provide adequate power to examine the hypotheses of interest. Future studies should focus on including simultaneous bilinguals in their analyses, a group which we did not have the power to examine here. Munson and Hernandez (2019) demonstrated the importance of obtaining adequate statistical power by conducting repeated subsampling on an MRI brain dataset to examine the effect of power on consistency in GM volume results.
The researchers found that for neuroimaging studies with subsamples below 40 per group and an alpha = .025, the probability of finding a false positive result was higher than finding a true positive. Repeated subsampling on an MRI brain dataset showed that as subsample size increased to 50, the probability of a true positive increased beyond that of a false positive, and that the probability of a true positive increased steadily as group size increased. Meanwhile, the probability of a false positive remained approximately the same. Although the present paper had group sizes (68 late bilinguals, 121 early bilinguals, and 145 monolinguals) that are overall larger than is the norm for many neuroimaging studies, we still did not have enough power to include 27 simultaneous bilinguals that were dropped from analyses due to the disparity in group sizes. Future research should continue to aim to increase sample sizes, and especially focus on recruiting simultaneous bilinguals.
The fact that differences were only found when bilinguals were grouped categorically by AoA (and no linear regression results were significant) implies that language development is best characterized by optimal windows of development rather than having a linear trajectory, although the possibility remains that linear changes still exist at the functional or subcortical level. Future researchers should continue to explore the possibility that language development does not occur in a linear fashion. This could be done by examining the progression of the relationship between brain structure and AoA throughout the course of development by conducting longitudinal studies that begin in childhood. Future studies may also want to more deeply examine the interaction between AoA and other background variables, such as genetics, SES, and language exposure. Future studies should also examine how subcortical and functional differences culminate in bilinguals according to language experience.

| CONCLUSION
The present study aimed to create clarity regarding previous studies of the relationship between language organization in the brain and timing of second language acquisition. Discrepancy in the findings reported by previous studies is likely due to methodological differences across studies. To summarize the present findings, our results suggest that late bilinguals handle L2 acquisition via structural changes to a set of areas that are involved in cognitive control and language processing, but that the exact changes that occur depend on which measure of GM (density, volume, or thickness) is used. Relative to early bilinguals, late bilinguals showed thicker cortex in regions implicated in both speech articulation and language planning and greater density in regions related to language planning (but not articulation). Volume was overall the least sensitive to AoA-related differences but showed differences in one semantic processing area. The fact that cortical thickness, volume, and density results were each unique demonstrates the importance of considering multiple indices of brain structure in regards to the neural bases of bilingualism and helps explain some discrepancy among previous findings. Finally, this study highlights the importance of obtaining enough power to conduct whole-brain analyses in place of region-of-interest analyses because it found results in a variety of areas that were not initially hypothesized.

SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of this article.