Longitudinally stable, brain‐based predictive models mediate the relationships between childhood cognition and socio‐demographic, psychological and genetic factors

Abstract Cognitive abilities are one of the major transdiagnostic domains in the National Institute of Mental Health's Research Domain Criteria (RDoC). Following RDoC's integrative approach, we aimed to develop brain‐based predictive models for cognitive abilities that (a) are developmentally stable over years during adolescence and (b) account for the relationships between cognitive abilities and socio‐demographic, psychological and genetic factors. For this, we leveraged the unique power of the large‐scale, longitudinal data from the Adolescent Brain Cognitive Development (ABCD) study (n ~ 11 k) and combined MRI data across modalities (task‐fMRI from three tasks: resting‐state fMRI, structural MRI and DTI) using machine‐learning. Our brain‐based, predictive models for cognitive abilities were stable across 2 years during young adolescence and generalisable to different sites, partially predicting childhood cognition at around 20% of the variance. Moreover, our use of ‘opportunistic stacking’ allowed the model to handle missing values, reducing the exclusion from around 80% to around 5% of the data. We found fronto‐parietal networks during a working‐memory task to drive childhood‐cognition prediction. The brain‐based, predictive models significantly, albeit partially, accounted for variance in childhood cognition due to (1) key socio‐demographic and psychological factors (proportion mediated = 18.65% [17.29%–20.12%]) and (2) genetic variation, as reflected by the polygenic score of cognition (proportion mediated = 15.6% [11%–20.7%]). Thus, our brain‐based predictive models for cognitive abilities facilitate the development of a robust, transdiagnostic research tool for cognition at the neural level in keeping with the RDoC's integrative framework.


| INTRODUCTION
According to the Research Domain Criteria (RDoC), cognitive abilities are considered one of the major transdiagnostic domains, cutting across mental disorders (Morris & Cuthbert, 2012). In children and adults, cognitive abilities are related to various mental disorders, including but not limited to depression (Shilyansky et al., 2016), attention-deficit/ hyperactivity disorder (ADHD) (Thaler et al., 2013) and psychotic disorders (Sheffield et al., 2018). Cognitive abilities that span across cognitive tasks, such as language, mental flexibility and memory, reflect a trait, known as general cognition or the g-factor (Flynn, 2009). Yet, we still do not have predictive models that can robustly capture the relationship between the g-factor and the brain. Having a brain-based predictive model for the g-factor is a key for us to adapt the RDoC's integrative approach-to understand cognitive abilities across units of analyses, from behaviours to brain and genes that reflect the influences of sociodemographical and psychological factors across the lifespan (Insel et al., 2010;National Institute of Mental Health [NIMH], n.d.-a).
Developing the brain-based predictive models for children's gfactor to be used in the RDoC framework faces several challenges.
The first challenge is longitudinal stability, which is one of the requirements in the RDoC framework (Insel et al., 2010;NIMH, n.d.-a). Predictive models should not only be generalisable to out-of-sample data (i.e., be predictive of children's g-factor that were not part of the original sample) but also be developmentally stable  in order to capture the g-factor across the lifespan (Tucker-Drob, 2009).
Here, we started to tackle this challenge by using-for the first time to the best of our knowledge-longitudinal, large-scale data in children, from the Adolescent Brain Cognitive Development (ABCD) study (Yang & Jernigan, n.d.), to demonstrate the longitudinal stability of the brain-based predictive models across 2 years during adolescence.
The second challenge is multimodal integration. So far, brain-based predictive models have been mainly built from a single MRI modality without integrating different sources of information from different MRI modalities. For instance, the g-factor is associated with activity during certain cognitive tasks, such as working memory (Gray et al., 2003;Waiter et al., 2009) (task-based functional MRI; task-fMRI), the intrinsic functional connectivity between different areas (Dubois et al., 2018;Pamplona et al., 2015; (resting-state fMRI [rs-fMRI]) and the anatomy of grey matter (Narr et al., 2007) (structural MRI [sMRI]) and white matter (Genç et al., 2018;G ongora et al., 2020) (diffusion tensor imaging [DTI]). However, recent findings, mainly in adults, have started to show the benefits of integrating data across modalities, rather than relying solely on a single modality Rasero et al., 2021;Sui et al., 2020). Here, we adapted a machine-learning framework, called stacking (Wolpert, 1992), to integrate information across MRI modalities into a 'stacked' model. Briefly, we separately built models to predict the g-factor based on each brain modality, resulting in one predicted value from each modality for each participant. We then built a 'stacked' model to predict the g-factor based on these predicted values. We tested if the stacked model indeed enhanced predictive performance over single modalities in predicting children's g-factor.
The third challenge is missing data. Children's neuroimaging data are notoriously affected by movement artefacts (Fassbender et al., 2017). For example, the ABCD study recommended a set of quality control variables for detecting noisy data from each modality (Hagler et al., 2019;Yang & Jernigan, n.d.), resulting in a listwise exclusion of 17% to over 50% of data depending on a modality. If we were to exclude children who have noisy data from any single modality, we would have to exclude almost 80% of the data, strictly limiting the generalisability of our model to children with highly clean data (who are unlikely to be representative of the rest of the sample). We overcame this problem by using a recently developed framework, built on top of the stacking framework, called 'opportunistic stacking' (Engemann et al., 2020). Briefly, we first duplicated predicted values from each modality-specific model, and imputed the missing value in each duplicate either with an arbitrarily high or low value. We then used Random Forest (Breiman, 2001) to create a final prediction from the imputed, predicted values. Accordingly, opportunistic stacking allows us to keep the data as long as there is at least one modality available, leaving more data in the model-building process and reducing the risk of missing-data bias.
Beyond demonstrating a robust out-of-sample relationship between the brain and the g-factor, the brain-based predictive models have to demonstrate the construct validity, especially for them to be used according to the RDoC framework (Insel et al., 2010). For instance, RDoC stipulates that cognitive abilities are affected by socio-demographic and psychological factors (Morris & Cuthbert, 2012;NIMH, n.d.-b). This is in line with recent studies showing that cognitive abilities are related to factors such as socio-economic status (Farah et al., 2006), mental health (Biederman et al., 2004;Goodall et al., 2018) and extracurricular activities (Kirlic et al., 2021). Accordingly, for the brain-based predictive models to demonstrate RDoC's construct validity, the brain-based predictive models should be able to explain the associations between the g-factor and these socio-demographic and psychological factors.
Likewise, RDoC stipulates that cognitive abilities should not be studied as a unitary construct, but should rather be studied through different units of analysis, from behaviours to the brain and genes (Insel et al., 2010;Morris & Cuthbert, 2012;NIMH, n.d.-a, n.d.-c).
Thus, the brain-based predictive models for cognitive abilities should be related to the 'gene-based' predictive models for cognitive abilities, given that they both reflect different units of analysis of the same RDoC's domain. A polygenic score (PGS), a composite measure of common gene variants, can be considered a predicted value from the gene-based predictive models (Bogdan et al., 2018). For cognitive abilities (Plomin & Deary, 2015), a PGS is based on the associations between several single nucleotide polymorphisms (SNPs) and cognitive abilities in a separate Genome-Wide Association Study (GWAS) (Davies et al., 2011), such as in a recent GWAS among 257,841 adults . Accordingly, for the brain-based predictive models to demonstrate RDoC's construct validity, the brain-based predictive models should also be able to explain the associations between the gfactor and the PGS of cognitive abilities .
To develop brain-based predictive models for the g-factor, we (i) used behavioural performance from cognitive tasks to derive the g-factor and (ii) built brain-based predictive models to predict this behaviourally derived g-factor from multimodal MRI data. We used the ABCD Release 3.0 (Yang & Jernigan, n.d.), including baseline data (age 9-10 years old) from over 11,000 children and follow-up data (age 11-12 years old) from roughly half of the participants. We first derived children's g-factor from their behavioural performance on six cognitive tasks using confirmatory factor analysis (CFA). We then built brain-based predictive models by treating multimodal MRI data as the features and the children's g-factor derived from behavioural performance as the target. More specifically, in our models, we implemented opportunistic stacking (Engemann et al., 2020) to integrate MRI data across modalities and to deal with missing values from each modality.
There were six modalities in total: three task-based fMRI (workingmemory 'N-Back', reward 'Monetary Incentive Delay [MID]' and inhibitory control 'Stop Signal'), rs-fMRI, sMRI and DTI. To determine the robustness and longitudinal stability of the brain-based predictive models, we tested how well the models predicted the g-factor of unseen children at the same ages and at 2 years older as well as at different data-collection sites. Next, to demonstrate whether multimodal integration led to better predictive performance, we applied bootstrapping to compare the stacked model with the best-performing modality-specific model. To explain the feature importance of the final models (i.e., determining brain features that contributed highly to the prediction of the g-factor), we applied several 'explainers', including eNetXplorer (Candia & Tsang, 2019a), conditional permutation importance (CPI) (Strobl et al., 2008) and SHapley Additive exPlanations (SHAP) (Lundberg & Lee, 2017).
We then conducted mediation analyses to ensure that the brainbased predictive models for the g-factor demonstrated RDoC's construct validity. In these analyses, we tested the extent to which our brain-based predictive models could account for the relationships between the behaviourally derived g-factor and key socio-demographic, psychological and genetic factors. For this purpose, in addition to the brain-based predictive models, we also computed two additional predictive models that predicted the behaviourally derived g-factor, either from (a) 70 sociodemographic and psychological variables (Kirlic et al., 2021) or (b) genes via a PGS of cognitive abilities . The 70 sociodemographic and psychological variables covered children's and/or their parents' socio-demographics, mental health, personality, sleep, physical activity, screen use, drug use, developmental adversity and social interaction. This resulted in three predicted values of the g-factor, based on features of the predictive models: 'brain-based g-factor', 'sociodemography-and-psychology-based g-factor' and 'gene-based g-factor'.
We then computed these predicted values on unseen children at each hold-out data collection site and applied the mediation analyses. Here, we treated (i) the socio-demography-and-psychology-based and genebased g-factors as the independent variables, (ii) the brain-based g-factor as the mediator and (iii) the behaviourally derived g-factor as the dependent variable. Through these mediation analyses, we quantified the extent to which the brain-based predictive models for cognitive abilities developed in this study mediated the relationships between the behaviourally derived g-factor and socio-demographic, psychological and genetic factors.

| MATERIALS AND METHODS
We employed the ABCD Study Curated Annual Release 3.0 (Yang & Jernigan, n.d.), which included 3 T MRI data and cognitive tests from 11,758 children (female = 5631) at the baseline (9-10 years old) and 5693 children (female = 2617) at the 2-year follow-up (11-12 years old). The study recruited the children from 21 sites across the United States . We further excluded 54 children based on Snellen Vision Screener (Luciana et al., 2018;Snellen, 1862).
These children either could not read any line, could only read the first (biggest) line, or could read up to the fourth line but indicated difficulty in reading stimuli on the iPad used for administering cognitive tasks (see below). The ethical considerations of the ABCD study, such as informed consent, confidentiality and communication with participants about assessment results, have been detailed elsewhere (Clark et al., 2018). Institutional Review Boards where the data were collected approved the study's protocols.

| The g-factor
We derived the g-factor using children's behavioural performance from six cognitive tasks. These six tasks, collected on an iPad during a 70-min in-session visit outside of MRI (Luciana et al., 2018;Thompson et al., 2019), were available in both baseline and follow-up datasets. First, the Picture Vocabulary measured vocabulary comprehension and language (Gershon et al., 2014). Second, the Oral Reading Recognition measured reading and language decoding (Bleck et al., 2013). Third, the Flanker measured conflict monitoring and inhibitory control (Eriksen & Eriksen, 1974). Fourth, the Pattern Comparison Processing measured the speed of processing (Carlozzi et al., 2013). Fifth, the Picture Sequence Memory measured episodic memory (Bauer et al., 2013). Sixth, the Rey-Auditory Verbal Learning measured memory recall after distraction and a short delay (Daniel & Wahlstrom, 2014).
Similar to the previous work (Ang et al., 2020;Pat et al., 2021;Thompson et al., 2019), we applied the second-order model of the gfactor using CFA to encapsulate the g-factor as the higher-order latent variable underlying performance across cognitive tasks. More specifically, our input data were standardised performance from each cognitive task. In our second-order model, we had the g-factor as the second-order latent variable. We also had three first-order latent variables in the model: language (underlying the Picture Vocabulary and Oral Reading Recognition), mental flexibility (underlying the Flanker and Pattern Comparison Processing), and memory recall (underlying the Picture Sequence Memory and Rey-Auditory Verbal Learning).
We fixed latent factor variances to one and applied Maximum Likelihood with Robust standard errors (MLR) using Huber-White statndard erros and scaled test statistics. To demonstrate model fit, we used scaled and/or robust indices, including comparative fit index (CFI), Tucker-Lewis index (TLI), root mean squared error of approximation (RMSEA) and standardized root mean square residual (SRMR) as well as used internal consistency, OmegaL2 (Jorgensen et al., 2018), of the g-factor. To implement the CFA, we used lavaan (Rosseel, 2012) (version = .6-6) and semTools (Jorgensen et al., 2018) along with semPlot (Epskamp, 2015) for visualisation. Note to ensure the robustness of the chosen g-factor model, we also examined the similarity in factor scores of the g-factor based on three different CFA models: the second-order model, the single-factor model, and the mixture between exploratory factor analysis (EFA) and CFA models (Appendix S1).

| Multimodal MRI
We used MRI data from six modalities: three task-based fMRI, rs-fMRI, sMRI and DTI. Note 'modalities' here referred to sets of features in our predictive models, as such we treated three task-based fMRI as separate modalities even though they were task-based fMRI.
The ABCD study provided detailed procedures on data acquisition and MRI image processing elsewhere (Casey et al., 2018;Hagler et al., 2019;Yang & Jernigan, n.d.). We strictly followed their recommended exclusion criteria based on automated and manual QC review of each modality, listed under the abcd_imgincl01 table (Yang & Jernigan, n.d.). The ABCD created an exclusion flag for each modality (with a prefix 'imgincl') based on several criteria, involving image quality, MR neurological screening, behavioural performance, number of repetition time (TRs) among others. We removed participants with an exclusion flag at any MRI indices, separately for each modality. We also applied the three interquartile range (3 Â IQR) rule (i.e., datapoint with a value over 3 IQRs away from the nearest quartile) with listwise deletion to remove observations with outliers in any indices within each modality. Additionally, to adjust for between-site variability, we used an Empirical Bayes method, ComBat (Fortin et al., 2017;Nielson et al., 2018). We applied ComBat to all modalities except for taskbased fMRI, given that between-site variability was found to be negligible for task-based contrasts (Nielson et al., 2018). See below for our approach to mitigate data leakage due to 3 Â IQR and ComBat.

| Three task-based fMRI
We used task-based fMRI from three tasks. First, in the working-memory 'N-Back' task (Barch et al., 2013;Casey et al., 2018), children saw pictures of houses and emotional faces. Depending on the block, children reported if a picture matched either: (a) a picture that was shown 2 trials earlier (2-back), or (b) a picture that was shown at the beginning of the block (0-back). To focus on working-memory-related activity, we used the (2-back vs. 0-back) linear contrast (i.e., high vs. low working memory load).
Second, in the MID task (Casey et al., 2018;Knutson et al., 2000), children needed to respond before the target disappeared. And doing so would provide them with a reward, if and only if the target followed the 'reward cue' (but not the 'neural cue'). To focus on reward anticipationrelated activity, we used the (Reward Cue vs. Neutral Cue) linear contrast.
Third, in the Stop-Signal Task (SST) (Casey et al., 2018;Whelan et al., 2012), children needed to withhold or interrupt their motor response to a 'Go' stimulus when it was followed unpredictably by a Stop signal. To focus on inhibitory control-related activity, we used the (Any Stop vs. Correct Go) linear contrast. Note that, for the SST, we used two additional exclusion criteria, tfmri_sst_beh_glitchflag, and tfmri_sst_beh_violatorflag, to address glitches in the task as recommended by the study (Bissett et al., 2020;Garavan et al., 2020). For all tasks, we used the average contrast values across two runs. More specifically, these contrasts were unthresholded, similar to previous work (Bolt et al., 2017). These values were embedded in the brain parcels based on FreeSurfer's (Dale et al., 1999) Destrieux (Destrieux et al., 2010) and ASEG (Fischl et al., 2002) atlases (148 cortical surface and 19 subcortical volumetric regions, resulting in 167 features for each task-based fMRI task).

| Resting-state fMRI
During rs-fMRI collection, the children viewed a crosshair for 20 min.
The ABCD's preprocessing strategy has been published elsewhere (Hagler et al., 2019). Briefly, the study parcellated regions into 333 cortical-surface regions (Gordon et al., 2016) and correlated their time-series (Hagler et al., 2019). They then grouped these correlations based on 13 predefined large-scale networks (Gordon et al., 2016): auditory, cingulo-opercular, cingulo-parietal, default-mode, dorsalattention, frontoparietal, none, retrosplenial-temporal, salience, sensorimotor-hand, sensorimotor-mouth, ventral-attention and visual networks. Note that 'none' refers to regions that do not belong to any networks. After applying the Fisher's r-to-z transformation, the study computed mean correlations between pairs of regions within each large-scale network (n = 13) and between large-scale networks (n = 78) and provided these mean correlations in their Releases (Yang & Jernigan, n.d.). This resulted in 91 features for the rs-fMRI.
Given that the correlations between (not within) large-scale networks were highly collinear with each other (e.g., the correlation between auditory and cingulo-opercular was collinear with that between auditory and default-mode), we further decorrelated them using partial correlation. We first applied the inverse Fisher's rto-z transformation, then partial correlation transformation, and then reapplied the Fisher r-to-z transformation.

| Diffusion tensor imaging
Here, we focused on fractional anisotropy (FA) (Alexander et al., 2007) of DTI. FA characterises the directionality of the distribution of diffusion within white matter tracts, which can indicate the density of fibre packing (Alexander et al., 2007). The ABCD study segmented major white matter tracts using AtlasTrack (Hagler et al., 2009(Hagler et al., , 2019

| Predictive models of multimodal MRI: opportunistic stacking
To integrate multimodal MRI into one predictive model and to control for missing values across modalities, we applied opportunity stacking (Engemann et al., 2020) (Figure 1). We started with the first-layer training set. Here, we used standardised features from each modality to separately predict the g-factor via a penalised regression. The main advantage of a penalised regression is its ease of interpretation given that the prediction is made based on a weighted sum of features. Moreover, predictive performance of penalised regressions for capturing brain-andbehaviour relationships in MRI appeared good, often on-par with other more black-box algorithms (Dadi et al., 2019;Dubois et al., 2018;Engemann et al., 2020;Niu et al., 2020;Rasero et al., 2021). Following previous research (Dubois et al., 2018), we used Elastic Net (Zou & Hastie, 2005), a general form of penalised regression via the glmnet package (Friedman et al., 2010). Elastic Net requires two hyperparameters. First, the 'penalty' determines how strong the feature's slopes are regularised. Second, the 'mixture' determines the degree to which the regularisation is applied to the sum of squared coefficients (known as Ridge) versus to the sum of absolute values of the coefficients (known as LASSO). We tuned these two hyperparameters using a 10-fold crossvalidation grid search and selected the model with the lowest mean absolute error (MAE). In the grid, we used 200 levels of the penalty from 10 -10 to 10, equally spaced on the logarithmic-10 scale and 11 levels of the mixture from 0 to 1 on the linear scale.
Once we obtained the final modality-specific models from the firstlayer training set, we fit these models to data in the second-layer training set. This gave us six predicted values of the g-factor from six modalities, and these are the features to predict the g-factor in the second-layer training set. To handle missing observations when combining these modality-specific features, we applied the opportunistic stacking approach (Engemann et al., 2020) by creating duplicates of each modality-specific feature. After standardisation, we coded missing observations in one as an arbitrarily large value of 1000 and in the other as an arbitrarily small value of À1000, resulting in 12 features. That is, as long as a child had at least one modality available, we would be able to include this child in stacked modelling.
Previous research (Engemann et al., 2020) advocated for a more flexible algorithm that can capture non-linear and interactive relationships at the second-layer training set. Here, we used the Random Forests algorithm (Breiman, 2001) from the ranger package (Wright & Ziegler, 2017) to predict the g-factor from the 12 features (Engemann et al., 2020;Josse et al., 2020). Random Forests use a multitude of decision trees on various sub-samples of the data and implement averaging to enhance prediction and to control over-fitting. We used 1000 trees and turned two hyperparameters. First 'mtry' is the number of features randomly sampled at each split. Second 'min_n' is the minimum number of observations in a node needed for the node to be split further. We implemented a 10-fold cross-validation grid search and selected the model with the lowest root mean squared error (RMSE). In the grid, we used 12 levels of the mtry from 1 to 12, and 101 levels of the min_n from 1 to 1000, both on the linear scale. This resulted in the 'stacked' model that incorporated data across modalities.
To prevent data leakage, we fit the CFA model to the observations in the first-layer training data and then computed factor scores of the g-factor on all training and test data. Note that to demonstrate the stability of the factor scores of the g-factor when applied to unseen data (i.e., not part of the modelling process), we also compared the factor scores of the g-factor estimated from the first-layer training data and the scores estimated from the whole baseline data (Appendix S2). Similarly, we also applied the 3 Â IQR rule and Combat separately for first-layer training, second-layer training, baseline test and follow-up test data. For the machine learning workflow, we used 'tidymodels' (www.tidymodels.org).

| Testing the robustness of the predictive models of multimodal MRI
We examined the predictive ability of the models based on multimodal MRI between predicted versus observed g-factor, using Pearson's correlation (r), coefficient of determination (R 2 , calculated using the sum of square definition), MAE, and RMSE. To investigate the predictive ability of the modality-specific models, we used the models tuned from the first-layer training set. To investigate the predictive ability of the stacked model, we used the model tuned from both the first-layer and second-layer training sets.

| Out-of-sample predictive ability of multimodal MRI: Baseline and follow-up samples
We first split the data into four parts ( Figure 1): (1) first-layer training set (n = 3041), (2) second-layer training set (n = 3042), (3) baseline test set (n = 5622) and (4) follow-up test set (n = 5656). Especially noteworthy is that children who were in the baseline test set were also in the follow-up test set. In other words, none of the children in the first-layer and second-layer training sets was in either of the test sets. We used the baseline test set for out-of-sample, same-age predictive abilities, while we used the follow-up test sets for out-of-sample, longitudinal predictive abilities.
F I G U R E 1 Longitudinal predictive modelling approach used for out-of-sample predictive ability of multimodal MRI. We split the data into four sets: First-layer training, second-layer training, baseline test, and follow-up test. We used the same participants in the baseline test and follow-up test sets. Modality-specific modelling only used the first-layer training set, while stacked modelling used both training sets to combine predicted values across modalities. At the first training layer, using elastic net, we separately predicted the g-factor based on each of the six modalities, resulting in six predicted values. At the second training layer, we applied opportunistic stacking by duplicating these six predicted values, and then imputed missing observations in one as an arbitrarily large value of 1000 and in the other as an arbitrarily small value of À1000, resulting in 12 predicted values. We then used Random Forest to predict the g-factor based on these 12 predicted values. The number of observations was different depending on the quality control of data from each modality. "Data not yet released" reflects the fact that ABCD release 3.0 (Yang & Jernigan, n.d.) only provided half of the follow-up data (age 11-12 years old), while providing the full baseline data (age 9-10 years old). CFA, confirmatory factor analysis; cor, correlation; CV, cross-validation; FA, fractional anisotropy To examine the performance of opportunistic stacking as a function of missing values, we further split the test sets based on the presence of each modality. First, Stacked All required data with at least one modality present. This allowed us to examine the stacked model's performance when the missing values were all arbitrarily coded. Second, Stacked Complete required data with all modalities present. This represents the situation when the data were as clean as possible.
Third, Stacked Best had the same missing values as the modality with the best prediction. This allowed us to make a fair comparison in performance between the stacked model and the model with the best modality, given their same noise level from missing value. Fourth, Stacked No Best did not have any data from the modality with the best prediction and had at least one modality present. This represents the highest level of noise possible.
2.4.2 | Comparing out-of-sample predictive ability of multimodal MRI between the stacked model and the model with the best modality: baseline and follow-up samples Here, we made a statistical comparison in the out-of-sample predictive ability between Stacked Best and the modality-specific model with the highest predictive performance, two of which had the same number of missing values in the test sets. We applied bootstrapping with 5000 iterations to examine the differences in performance indices (including, r, R 2 , MAE and RMSE) on both baseline and follow-up test sets. If stacking truly led to enhanced predictive performance, then we should see 95% CI of the bootstrapped differences to be different from 0.

| Out-of-site predictive ability of multimodal MRI
To examine out-of-site predictive ability, we applied leave-one-siteout cross-validation to the baseline data. This enabled us to extract predicted values of the g-factor based on multimodal MRI data at each hold-out site, and in turn, to examine the generalisability of different models on different data collection sites. Different sites involved different MRI machines, experimenters as well as demographics across the United States . Moreover, using leave-onesite-out cross-validation also prevented having the participants from the same family in the training and test sets. Here, we first removed data from one site that only recruited 34 participants and removed participants from six families who were separately scanned at different sites. We then held out data from one site as a test set and divided the rest to be first-and second-layer training sets. We crossvalidated predictive ability across these hold-out sites. We applied the same modelling approach with the out-of-sample predictive models, except for two configurations to reduce the amount of ram used and computational time. Specifically, in our grid search, we used 100 levels of penalty (as opposed to 200) for Elastic Net and limited the maximal min_n to 500 (as opposed to 1000) for Random Forests. For the stacked model, we tested its predictive ability on children with at least one modality (i.e., stacked all). We examined the out-of-site prediction between predicted versus observed g-factor at each hold-out site.

| Feature importance of multimodal MRI models
To understand which features contribute to the prediction of the modality-specific (i.e., Elastic Net) models, we applied permutation from the eNetXplorer (Candia & Tsang, 2019b) package to the firstlayer training set of the out-of-sample predictive ability splits ( Figure 1). We first chose the best mixture from the previously run grid and fit two sets of several Elastic Net models. The first 'target' models used the true g-factor as the target, while the second 'null' models used the randomly permuted g-factor as the target. eNetXplorer split the data into 10 folds 100 times/runs. For each run, eNetXplorer performed cross-validation by repeatedly training the target models on nine folds and tested on the leftover fold. Also, in each cross-validation run, eNetXplorer trained the null models 25 times. eNetXplorer then used the mean of non-zero model coefficients across all folds in a given run as a coefficient for each run, k r .
Across runs, eNetXplorer weighted the mean of a model coefficient by the frequency of obtaining a non-zero model coefficient per run.
Formally, we defined an empirical p-value as: where p val is an empirical p-value, run is a run index, n_run is the number of runs, per is a permutation index, n_per is the number of permutation, Θ is the right-continuous Heaviside step function and jβj is the magnitude of feature coefficient. That is, to establish statistical significance for each feature, we used the proportion of runs in which the null models performed better than the target models. We plotted the target models' coefficients with p val < .05 on the brain images using the ggseg (Mowinckel & Vidal-Piñeiro, 2020) package.
To identify which modalities contributed strongly to the prediction of the stacked (i.e., Random Forests) model, we applied two methods: (1) CPI (Debeer & Strobl, 2020) and (2) SHAP (Lundberg & Lee, 2017) to the second-layer training set. CPI is an explainer, designed specifically for Random Forest. We implemented CPI using the 'permimp' package, as detailed elsewhere (Debeer & Strobl, 2020). Briefly, the original permutation importance (Breiman, 2001) shuffled the observations of one feature at a time while holding the target and other features in the same order. Researchers then examined decreases in predictive accuracy in the out-of-bag observations due to the permutation of some features. Stronger decreases are then assumed to reflect the importance of such features. However, this method has shown to be biased when there are correlated features (Strobl et al., 2007). CPI corrected for this bias by constraining the feature permutation to be within partitions of other features, which was controlled by the threshold 's' value. We used the default s value at 0.95, which assumed dependencies among features (Debeer & Strobl, 2020).
SHAP (Lundberg & Lee, 2017) is a model-agnostic explainer, designed to explain the contribution of each feature to the prediction from any machine learning models via Shapley values (Roth, 1988). We implemented SHAP using the 'fastshap' package (https://bgreenwell. github.io/fastshap/). Based on the cooperative game theory, a Shapley value (Roth, 1988)  2.6 | Testing whether the brain-based predictive models mediated the relationships of the behaviourally derived g-factor with sociodemographic, psychological and genetic factors Using leave-one-site-out cross-validation, we built three predictive models for the g-factor from (1)  each hold-out data collection site: the brain-based g-factor, the sociodemography-and-psychology-based g-factor and the gene-based gfactor, respectively. We then test if the brain-based g-factor mediated the relationship that the behaviourally derived g-factor had with the socio-demography-and-psychology-based and gene-based g-factors.

| Key socio-demographic and psychological factors
We performed leave-one-site-out cross-validation to build 'sociodemographic-and-psychological-based' predictive models. These models predicted the behaviourally derived g-factor from key sociodemographic and psychological factors on the baseline data, similar to using leave-one-site-out cross-validation to create the 'brain-based' predictive models above. This enabled us to extract predicted values of the g-factor based on key socio-demographic and psychological factors at each hold-out site, called socio-demography-and-psychology-based g-factor. Here, we applied a similar modelling approach with leave-one-site-out cross-validation for multimodal MRI, except that we used only one layer of Elastic Net tuned with 200 levels of the penalty (from 10 -10 to 10) and 11 levels of the mixture (from 0 to 1). For pre-processing, we first imputed missing values of the categorical features via mode replacement and then converted them to dummy variables. We next normalised these dummy variables and all numerical features and the behaviourally derived g-factor. At the last pre-processing step, we used k-nearest neighbour with five neighbours to impute the missing values of the normalised, numerical features.

| Polygenic scores
To extract predicted values of the g-factor based on genetics, we used PGSs for adult cognitive ability . The ABCD study provided details on genotyping elsewhere (Uban et al., 2018). Briefly, the study took saliva and whole blood samples and genotyped them using Smokescreen™ Array. The ABCD applied quality control based on calling signals and variant call rates, ran the Ricopili pipeline and imputed the data with TOPMED. We excluded data from problematic plates and with a subject-matching issue, identified by the ABCD. We further quality controlled the data as follows. First, we removed individuals with minimal or excessive heterozygosity. We also excluded SNPs based on minor allele frequency (<5%) and violations of Hardy-Weinberg equilibrium (P < 1EÀ10). We limited the analysis to 'unrelated individuals' as defined by individuals with low genetic relatedness (more than third-degree relative pairs; identical by descent [IBD] ≥ 0.0422).
We defined alleles associated with the g-factor as those related to cognitive abilities in a large-scale discovery GWAS sample of European ancestry (N = 257,841) . Accordingly, we focused our analyses using the p < .01 PGS threshold and treated the PGS at this threshold as our gene-based g-factors.

| Mediation analyses
To examine the extent to which brain-based, stacked predictive models of the g-factor accounted for the relationship between the behaviourally derived g-factor and the socio-demographic, psychological and genetic g-factors, we applied mediation analyses (MacKinnon et al., 2007). In these mediation analyses, we treated (i) the brain-based g-factor as the mediator, (ii) the socio-demography-and-psychology-based and genebased g-factors as the independent variables and (iii) the behaviourally derived g-factor as the dependent variable. Note the behaviourally derived g-factor was computed based on the CFA models in the training data, which were later applied to each hold-out site. While the behaviourally derived g-factor was a latent variable, it represented the only 'observed' value here since the other three g-factors (brain-based, sociodemography-and-psychology-based and gene-based) were 'predicted' values from predictive models.
We conducted three mediation analyses. The first analysis only used the socio-demography-and-psychology-based g-factor as the independent variable. The second analysis only used the gene-based g-factor as the independent variable. The third analysis used both the socio-demography-and-psychology-based and gene-based g-factors as the independent variables, simultaneously in the same model. To control for population stratification in genetics, we also included four PCs as control variables in the mediation analyses involving the gene-based g-factor.
To implement the mediation analyses, we used structural equation modelling (SEM) with 5000 bootstrapping iterations via lavaan (Rosseel, 2012). We specifically calculated the indirect effects to show whether the relationships between the behaviourally derived g-factor and the socio-demography-and-psychology-based and gene-based g-factors were significantly explained by the brain-based g-factor.
Along with the indirect effects, we also computed the proportion mediated to demonstrate the proportion of variance accounted for by the brain-based g-factor.

| Data and code availability
We used publicly available data provided by the ABCD study (  See Appendix S1 and S2 for a more detailed CFA of the g-factor. In brief, firstly, the second-order model had better fit indices than the single-factor model. Additionally, factor scores of the g-factor from the second-order model, the single-factor model, and the mixture between EFA and CFA models were similar to each other at high magnitude (Pearson's rs ≥ 0.987). Accordingly, the choice of g-factor models had only minimal effects on the estimation of the factor scores for the g-factor, and thus our brain-based predictive models should be generalisable to the factor scores of different g-factor CFA models beyond the second-order model. Lastly, the factor scores estimated from the firstlayer training data were similar to the factor scores estimated from the full baseline data at high magnitude (Pearson's rs > 0.997), indicating the stability of the factor scores used.
3.2 | How robust are the brain-based predictive models?

| Out-of-sample predictive ability of multimodal MRI
For hyperparameter-tuning results, see Appendix S3.   the g-factor were weaker in magnitude in the Stacked All than in other models with high predictive performance (such as the N-back taskbased fMRI and Stacked Complete) as indicated by Cohen's d. Accordingly, by imputing the data via the opportunistic stacking (Engemann et al., 2020), we were able to include more participants, and thus, less likely to exclude participants with a lower g-factor.
3.2.2 | Comparing out-of-sample predictive ability of multimodal MRI between the stacked model and Nback task-based fMRI N-back task-based fMRI provided the best out-of-sample predictive ability for both baseline and follow-up test sets, relative to other modalityspecific models. Figure 4 compared the predictive ability between the Stacked Best and N-back task-based fMRI using bootstrapped differences. The Stacked Best had significantly higher performance in both baseline and follow-up test sets, reflected by higher Pearson's r and R 2 and lower MAE and RMSE. This indicates the boost in predictive performance when multiple modalities were integrated, at around 12% for the baseline data and 6% for the follow-up data. Accordingly, the stacked model performed better than the best single modality.

| Out-of-site predictive ability of multimodal MRI
Based on leave-one-site-out cross-validation, the out-of-site predictive ability of the stacked model was highest, explaining on-average 21% (SD = 5.2) of the variance in the g-factor across 21 sites (Table 1 and Figure 5). This confirmed the generalisability of the stacked model and ensured its use for subsequent mediation analyses. Figure 6 shows the feature importance of both the modality-specific and stacked models. For the modality-specific models, we applied eNetXplorer (Candia & Tsang, 2019a) to show brain features that significantly (empirical p < .05) contributed to the prediction. For N-back task-based fMRI, the g-factor prediction was driven by activity in areas, such as the  (Strobl et al., 2008) and SHAP (Lundberg & Lee, 2017) to examine which of the modalities contributed strongly to the prediction. CPI and SHAP provided similar results. N-back task-based fMRI by far had the highest importance score.

F I G U R E 2
Out-of-sample predictive ability of multimodal MRI as a function of modalities in the test sets for baseline (a) and follow-up (b) samples. Stacked all required the test data with at least one modality present. Stacked complete required the test data with all modalities present. Stacked best had the same missing values with the modality with the best prediction (N-back task-based fMRI). Stacked no best did not have any test data from the modality with the best prediction and had at least one modality present F I G U R E 3 Missing values in each predictive model in the baseline and follow-up test sets. (a) Shows the differences in the g-factor between participants with versus without missing values for each predictive model in the two test sets. **** indicates p-value < .001 based on Welsh's ttest. Positive Cohen's d indicates that participants without missing values had a higher g-factor than participants without missing values. Dot and line are the mean and standard deviation x 2 of the g-factor, respectively. (b) Shows the proportion of missing data for each predictive model in the two test sets 3.4 | Did the brain-based predictive models mediate the relationships of the behaviorally derived g-factor with socio-demographic, psychological and genetic factors?

| Key socio-demographic and psychological factors
Based on leave-one-site-out cross-validation, socio-demographic and psychological factors explained on-average 29.7% (SD = 8.1) of the variance in the behaviourally derived g-factor across sites (see [CI95% = 0.18-0.24]). Accordingly, we used PGS at the p < .01 PGS threshold as the gene-based g-factor for the mediation analyses.

| Mediation analyses
We tested whether brain-based g-factor mediated the relationships between the behaviourally derived g-factor and socio-demography-and-F I G U R E 4 Comparing out-of-sample predictive ability of multimodal MRI between stacked best and the model with the best modality (N-back task-based fMRI). Here, we separately applied bootstrapping on the baseline and follow-up test sets. At each of 5000 iterations, we computed performance indices (including r, R 2 , MAE and RMSE) of stacked best and N-back task-based fMRI models and subtracted performance indices of N-back task-based fMRI from that of stacked best. Dotted lines indicate 95% confidence intervals. MAE, mean absolute error; R 2 , coefficient of determination; RMSE, root mean squared error F I G U R E 5 Out-of-site predictive ability of multimodal MRI via leaveone-site-out cross-validation. We evaluated out-of-site predictive ability between predicted versus observed gfactor in the hold-out site. Note that DTI data were not available from three sites (sites 1, 17 and 19). MAE, mean absolute error; R 2 , coefficient of determination; RMSE, root mean squared error psychology-based and gene-based g-factors. We found significant indirect effects (1) when the socio-demography-and-psychology-based gfactor was the sole independent variable (Figure 7d proportion mediated = 19.1%), (2) when the gene-based g-factor was the sole independent variable (Figure 8c, proportion mediated = 15.6%) and (3) when both socio-demography-and-psychology-based g-factor (Figure 9, proportion mediated = 15%) and gene-based g-factor (Figure 9, proportion mediated = 10.75%) were the covaried independent variables.

| DISCUSSION
Following the RDoC's integrative approach for cognitive abilities (Morris & Cuthbert, 2012), we aimed to develop brain-based predictive models that can (a) improve our current ability to predict children's cognitive abilities and (b) account for the relationships between cognitive abilities and socio-demographic, psychological and genetic factors.
Here, we showed that incorporating data from different MRI modalities F I G U R E 6 Feature importance of the modalityspecific and stacked models. For the modality-specific models, we applied eNetXplorer (Candia & Tsang, 2019a) permutation and only plotted brain features with empirical p < .05. For the stacked model, we applied conditional permutation importance (CPI) (Debeer & Strobl, 2020) and SHapley additive exPlanations (SHAP) (Lundberg & Lee, 2017). Both CPI and SHAP were computed based on the secondlayer training set. Error bars in the CPI plot show an interval between 0.25 and 0.75 quantiles of the CPI for each tree in the random forests. The '_large' and '_small' suffixes indicate whether the missing values were coded as a large (1000) or small (À1000) number, respectively. For SHAP, we combined Shapley values across the two coded features of the same modality. We then ranked the modalities according to the absolute value of SHAP; the highest one was N-back taskbased fMRI. Note the grey colour indicates observations with a missing value (coded as 1000 or À1000). ant, anterior; G, gyrus; IFG, inferior frontal gyrus; L, left; Lat, lateral; med, medial; R, right; S, sulcus; Sup, superior into stacked models substantially improved our ability to predict cognitive abilities, operationalised as the behaviourally derived g-factor. Our brain-based, stacked predictive models were stable across years and generalisable to different sites while being able to handle missing values. Moreover, we showed that the brain-based, stacked models significantly, albeit partially, mediated the relationships of the behaviourally derived g-factor with socio-demographic, psychological and genetic factors. Thus, our brain-based predictive models for children's 4.1 | The brain-based, stacked predictive models for the g-factor were (1) predictive, (2) longitudinal stable, (3) robust against missing values and (4) explainable We developed longitudinal predictive models for children's g-factor from MRI data of different modalities. We built models from the baseline MRI data and tested them on unseen children at the same age and 2 years older. We found similar predictive abilities across these two test sets for all modality-specific and stacked models.
That is, the models that had high out-of-sample prediction on same-age children also had high out-of-sample prediction on older children, suggesting the longitudinal stability of MRI for many modalities. The best model across all performance indices (Pearson's r, R 2 , MAE and RMSE) was the stacked model that incorporated all six modalities, which was followed closely by the Nback task-related fMRI model. Apart from the SST task-related fMRI model, other models (including the MID tasked-related, rs-fMRI, sMRI and DTI) performed moderately well. We also found a similar magnitude for out-of-site predictive ability based on leaveone-site-out cross-validation, suggesting the generalisability of MRI not only across ages but also across data collection sites. Overall, the stacked model partially predicted the children's g-factor at around 20% of the variance. This made the stacked model the most generalisable model to out-of-sample, out-of-site children as well as the most longitudinally stable model. (d) Shows a mediation analysis where (1) the socio-demography-and-psychology-based g-factor (the out-of-site predicted values of the g-factor based on the key sociodemographic and psychological factors at all hold-out sites) is the independent variable, (2) the brain-based g-factor (the out-of-site predicted values of the g-factor of the stacked model based on multimodal MRI data at all hold-out sites) is the mediator and (3) the behaviourally derived g-factor (the observed g-factor) is the dependent variable. % under the indirect effect indicates proportion mediated. [] indicates a 95% confidence interval based on bootstrapping. MAE, mean absolute error; R 2 , coefficient of determination; RMSE, root mean squared error Beyond generalisability across ages and sites, the stacked model based on opportunistic stacking (Engemann et al., 2020) also allowed us to handle missingness in the data. This is especially important for children's MRI data given high levels of noise in certain modalities (Fassbender et al., 2017). If we were to use data only from children with all modalities present (i.e., the Stacked Complete), the model would not apply to around 80% of the children. The opportunistic stacking allowed us to use the data as long as one modality was F I G U R E 9 Mediation analysis with both key socio-demographic and psychological factors as well as genetic factors as independent variables. Specifically, this model treated (1) the socio-demography-and-psychology-based g-factor (i.e., the out-of-site predicted values of the g-factor based on the key socio-demographic and psychological factors at all hold-out sites) and (2) the gene-based g-factor (i.e., the PGS of cognitive abilities at the p < .01 PGS threshold) as two separate independent variables. It treated the brain-based g-factor (i.e., the predicted values of the out-of-site predicted values of the g-factor of the stacked model based on multimodal MRI data at all hold-out sites) as the mediator and the behaviourally derived g-factor (i.e., the observed g-factor) as the dependent variable. Not shown in the figure are four PCs included as the control variables to adjust for population stratification. % under the indirect effect indicates proportion mediated. [] indicates a 95% confidence interval based on bootstrapping. The dotted, double arrowed line indicates covariation between the two independent variables. PGS, polygenic score present (i.e., the Stacked All), leaving the exclusion to just around 5%.
Importantly, the predictive performance of Stacked Complete and Stacked All were both relatively high, ensuring the ability of opportunistic stacking to deal with the missing data. Furthermore, handling missingness in the data via opportunistic stacking also heightened the chance of including participants with a wider range of the g-factor, including those with a lower g-factor who usually had missingness in the MRI data (perhaps due to high movement artefacts [Fassbender et al., 2017] R 2 ) appeared stronger in magnitude than any other non-optimal modalities by themselves. Accordingly, in settings where not all of the modalities are available, researchers/practitioners can still take advantage of the boosted predictive ability of the stacked models over unimodal models.
The stacked model improved predictive ability over and above the best modality, which was the N-back task-based fMRI. This is based on bootstrapping distributions of the differences in performance indices between the N-back task-based fMRI and the stacked model with the same participants (i.e., the Stacked Best). Our finding is consistent with previous studies showing the enhanced predictive power of the stacked model (Engemann et al., 2020;Rasero et al., 2021). Yet, it is important to note that, while the improvement in performance was statistically significant, the magnitude of this improvement was somewhat modest. For instance, in the case of the baseline samples, the Stacked Best led to r = 0.442 and R 2 = 0.195, which was improved from the N-back task-based fMRI at r = 0.402 and R 2 = 0.072, rendering the improvement at around r $ 0.04 and R 2 $ 0.123. Accordingly for researchers who have access to all MRI modalities and several fMRI tasks, including the N-back task, using the stacked model should provide the best possible performance for predicting the g-factor. However, if resources are constrained, the next best option would be using the N-back task-based fMRI along with other modalities that are available.
In addition to predictability, our machine learning framework allowed for easy-to-explain models, highlighting the neurobiological bases of children's g-factor. Explainability is used in a specific machine-learning sense (Molnar, 2019), referring to the extent to which a technique applied allows us to explain the contribution of each brain feature to the prediction. Here, CPI (Debeer & Strobl, 2020) and SHAP (Lundberg & Lee, 2017) allowed us to infer that prediction from the stacked model was driven primarily by Nback task-related fMRI. This indicates the important role of working memory. eNetXplorer permutation (Candia & Tsang, 2019a) further showed us that contribution from fMRI activity in the parietal and frontal areas during the N-back task drove the prediction. These areas were similar to the areas previously found in a recent study in adults . Similarly, we also found brain indices from other modalities, from activity during other tasks to the cortical thickness and white matter density, that contributed to the prediction of the g-factor, albeit with lower predictive performance.
Unlike previous unimodal (Dubois et al., 2018;Genç et al., 2018;G ongora et al., 2020;Gray et al., 2003;Narr et al., 2007;Pamplona et al., 2015;Waiter et al., 2009) and multimodal studies Rasero et al., 2021), we were able to compare the ability of task-based fMRI with other modalities in predicting the g-factor. We found that one of the three task-based fMRI models, the N-back, performed exceptionally well. Based on the CPI (Debeer & Strobl, 2020) and SHAP (Lundberg & Lee, 2017), the N-back task-related fMRI appeared to drive the prediction of the stacked model. This finding is consistent with a recent study using adults' data from the Human Connectome Project, showing superior performance of the N-back task in predicting the g-factor, compared to rs-fMRI  and other tasks. Showing that task-based fMRI from a certain task could capture cognitive ability across a 2-year gap provided a promising outlook for the use of task-based fMRI as a predictive tool. Our finding is contradictory to a more common practice in cognitive neuroscience that usually relies on sMRI (McDaniel, 2005;Mihalik et al., 2019;Pietschnig et al., 2015) or rs-fMRI (Dubois et al., 2018;Rasero et al., 2021; when predicting cognitive abilities. These sMRI and rs-fMRI studies often result in poorer predictive performance (at r < 0.4) than what was found here. Accordingly, we are in agreement with a recent movement (Finn, 2021) for studies on individual differences to move from rs-fMRI and embrace other MRI modalities, including taskbased fMRI.
It is important to note that not all fMRI tasks were suitable for predicting certain targets. The N-back task and SST, for instance, were designed to capture working memory (Barch et al., 2013;Casey et al., 2018) and inhibitory control (Casey et al., 2018;Whelan et al., 2012), respectively. Accordingly, both should be related to the g-factor, especially on memory recall and mental flexibility portions of the g-factor. Yet, only the N-back task showed good predictive ability. This may be due to different cognitive processes in each task (i.e., working memory vs. inhibitory control) or to different task configurations. It is entirely possible, for instance, that the block design used in the N-back, as opposed to the event-related design used in the SST, allowed the N-back to have higher predictive power. Accordingly, while task-based fMRI can have high predictive power, systematic comparisons are required in future research to better understand the characteristics of some tasks that make them more suitable for predicting the g-factor and other individual differences.
4.2 | The brain-based, stacked predictive models for the g-factor demonstrated construct validity, according to the RDoC framework (Insel et al., 2010) Here we tested the construct validity of the brain-based, stacked predictive models for the g-factor according to the RDoC framework (Insel et al., 2010). The RDoC proposes that cognitive abilities are affected by socio-demographic and psychological factors (Morris & Cuthbert, 2012;NIMH, n.d.-b). The RDoC also proposes that cognitive abilities as measured by brain differences belong to the same domain as cognitive abilities as measured by gene differences (Insel et al., 2010;Morris & Cuthbert, 2012;NIMH, n.d.-a, n.d.-c). Accordingly, to satisfy these presuppositions, our brain-based, stacked predictive models for the g-factor should be able to capture the relationship between the behaviourally derived g-factor and sociodemographic, psychological and genetic factors.
We first built a predictive model of the g-factor using 70 sociodemographic and psychological features (Kirlic et al., 2021), resulting in the socio-demography-and-psychology-based g-factor. This model had relatively high performance, accounting for around 30% of the g-factor. Moreover, the top contributing features are consistent with previous studies, including socio-demographics (Farah et al., 2006) (e.g., parents' education and income) along with children's mental health (Biederman et al., 2004;Goodall et al., 2018) (e.g., attention and social problems) and children's extracurricular activities (Kirlic et al., 2021). More importantly, our mediation analysis showed that the brain-based g-factor captured approximately 19% of the relationship between the behaviourally derived g-factor and the socio-demography-and-psychology-based g-factor.
As for the genetic factor, we first showed that the PGS based on adults' cognitive abilities  was related to children's g-factor, consistent with a recent study (Allegrini et al., 2019). This enabled us to use the PGS of cognitive abilities as the gene-based g-factor. Similar to the socio-demography-and-psychology-based g-factor, our mediation analysis showed that the brain-based g-factor accounted for approximately 16% of the relationship between the behaviourally derived g-factor and the gene-based g-factor. In fact, mediation from the brain-based g-factor was still significant when having both socio-demography-and-psychology-based and genebased g-factors together as independent variables in the model. Altogether, our brain-based, stacked predictive models for the g-factor demonstrated the construct validity of cognitive abilities that is in line with the RDoC framework (Insel et al., 2010).

| Applications, limitations and disclaimers
For applications, our brain-based predictive models for the g-factor  Morris & Cuthbert, 2012;NIMH, n.d.-a, n.d.-c). In fact, the predictive ability of our brain-based predictive models in capturing the behavioural performance of cognitive tasks was considerably higher than that of PGS (multimodal MRI's r $ 0.4 and R 2 $ 0.2 vs. PGS's r $ 0.21 in our study and R 2 < 0.1 in another study [Allegrini et al., 2019]), suggesting the potential use of brain-based predictive models for a robust, transdiagnostic, brain-based marker for cognitive abilities.
With opportunistic stacking, those who wish to adapt our brainbased predictive models to compute a transdiagnostic brain-based marker for cognition in their own data, but do not have as many modalities as the ABCD, can still use our models. That is, they can still use the model built from the ABCD and impute missing values of certain modalities to fits with their study. Accordingly, our use of opportunistic stacking provides a scalable and flexible approach for future researchers following the RDoC framework (Morris & Cuthbert, 2012).
Our study is not without limitations. We relied on the ABCD study's curated, preprocessed data (Casey et al., 2018;Hagler et al., 2019;Yang & Jernigan, n.d.). This provided certain advantages.
For instance, given that the curated data provided by the ABCD have already been preprocessed, other studies that wish to apply our model of the g-factor to the ABCD data can readily do so without concerns about differences in preprocessing steps. Preprocessed data also enabled us to apply the manual quality control done by the study, a process that required time and well-trained labour (Casey et al., 2018;Hagler et al., 2019;Yang & Jernigan, n.d.). Preprocessing large-scale multi-modal data ourselves would not only demand significant computer power and time but is prone to error. However, using the preprocessed data only allowed us to follow the choices of processing done by the study. For example, ABCD Release 3 only provided Freesurfer's parcellation (Destrieux et al., 2010;Fischl et al., 2002) for task-based fMRI. While this popular method allowed us to explain task-based activity on subject-specific anatomical landmarks, the regions are relatively large compared to other parcellations. Future studies will need to examine if smaller and/or different parcellations would improve predictive performance. Next, our predictive modelling framework was designed to predict the out-of-sample g-factor, but not the developmental changes in the g-factor, from multimodal MRI. More specifically, we standardised MRI and cognitive data within each age group to satisfy the assumption of our machine-learning algorithms (Zou & Hastie, 2005) and to force behavioural performance from different cognitive tasks onto the same scale. This unfortunately made our predictive models inappropriate for predicting the developmental changes in cognition over years (Moeller, 2015). Future research that aims to capture the developmental changes in cognition would need to employ a different strategy for standardisation (Moeller, 2015).
In terms of important disclaimers, research reporting on cognitive abilities can be misunderstood or misquoted for alien purposes (Suzuki & Aronson, 2005). It is therefore important to clarify the following. First, the fact that measurements taken from the brain were related to cognitive abilities should not be equated with assertions that variability in cognitive abilities is 'purely biological'. Here, we showed that the predictive model for the g-factor based on sociodemographic and psychological variables that were available in the ABCD (Kirlic et al., 2021) already accounted for a larger variance of the g-factor ($30%) than the predictive models based on the brain ($20%) or genes (<10% [Allegrini et al., 2019]). Moreover, our mediation analysis showed that the brain-based predictive models could only account for approximately 19% of the relationship between cognitive abilities and socio-demographic and psychological factors.
Accordingly, it is very plausible that social-demographic and psychological circumstances, broadly construed, have at least partial aetiological primacy. Second, it should be clear that social-demographic, psychological and genetic circumstances may not be independent of one another, as suggested by studies on the complex interplay of genes and environments on cognitive abilities over the course of cognitive development (Tucker-Drob et al., 2013;Tucker-Drob & Briley, 2014). This is shown in our mediation analyses. Here, the brain-based g-factor showed less proportion mediated for the influences of social-demographic, psychological factors and genes when they were included together in the model, compared to when they were included in separate models. This suggests the interdependency among the brain, genes, social-demographic and psychological factors as proposed by the RDoC (Insel et al., 2010;NIMH, n.d.-a). Third, under no circumstances should the results of this article be interpreted as entailing a value judgement about how people vary in measurements of cognitive abilities. Indeed, it is important to reflect on the fact that the way we measured cognitive abilities, for example, through the g-factor here, reflects norms that are entrenched in cultures and societies of a certain time in history, rather than reflecting some universal truth or a supra-historical marker of cognitive abilities (Flynn, 2009). The value of the g-factor here is as a marker (present in early life) of a series of other important life outcomes in current societal circumstances.

| CONCLUSION
In conclusion, we developed brain-based stacked, predictive models for children's cognitive abilities that were longitudinally stable, generalisable and robust against missingness. More importantly, our brainbased models were able to partially mediate the relationships of childhood cognitive abilities with the socio-demographic, psychological and genetic factors. Accordingly, our approach should pave the way for future researchers to employ multimodal MRI as a useful research tool for integrative, RDoC-inspired research in cognition and mental health.

ACKNOWLEDGEMENTS
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https:// abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood.

CONFLICT OF INTEREST
The authors declare no conflict of interests.

DATA AVAILABILITY STATEMENT
Data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https:// abcdstudy.org), held in the NIMH Data Archive (NDA). This is a multisite, longitudinal study designed to recruit more than 10,000 children age 9-10 and follow them over 10 years into early adulthood. The ipating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/scientists/workgroups/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report.