Can we diagnose mental disorders in children? A large‐scale assessment of machine learning on structural neuroimaging of 6916 children in the adolescent brain cognitive development study

Abstract Background Prediction of mental disorders based on neuroimaging is an emerging area of research with promising first results in adults. However, research on the unique demographic of children is underrepresented and it is doubtful whether findings obtained on adults can be transferred to children. Methods Using data from 6916 children aged 9–10 in the multicenter Adolescent Brain Cognitive Development study, we extracted 136 regional volume and thickness measures from structural magnetic resonance images to rigorously evaluate the capabilities of machine learning to predict 10 different psychiatric disorders: major depressive disorder, bipolar disorder (BD), psychotic symptoms, attention deficit hyperactivity disorder (ADHD), oppositional defiant disorder, conduct disorder, post‐traumatic stress disorder, obsessive‐compulsive disorder, generalized anxiety disorder, and social anxiety disorder. For each disorder, we performed cross‐validation and assessed whether models discovered a true pattern in the data via permutation testing. Results Two of 10 disorders can be detected with statistical significance when using advanced models that (i) allow for non‐linear relationships between neuroanatomy and disorder, (ii) model interdependencies between disorders, and (iii) avoid confounding due to sociodemographic factors: ADHD (AUROC = 0.567, p = 0.002) and BD (AUROC = 0.551, p = 0.002). In contrast, traditional models perform consistently worse and predict only ADHD with statistical significance (AUROC = 0.529, p = 0.002). Conclusion While the modest absolute classification performance does not warrant application in the clinic, our results provide empirical evidence that embracing and explicitly accounting for the complexities of mental disorders via advanced machine learning models can discover patterns that would remain hidden with traditional models.

The most successful studies employing machine learning are those focusing on Parkinson's disease, Alzheimer's disease, schizophrenia, major depressive disorder (MDD), autism spectrum disorder, and ADHD (Arbabshirani et al., 2017;Sakai & Yamada, 2019;Woo et al., 2017).While positive results can be expected for neurodegenerative disorders that involve neuroanatomical changes visible on a macroscopic level (Love et al., 2018), it is notable that similar results were shown for mental disorders for which distinct diseasecausing neuroanatomical changes have not been established (Borsboom et al., 2019).For instance, for MDD, median reported accuracies are as high as 79% (Sakai & Yamada, 2019), 82% (Arbabshirani et al., 2017), and 86.7% (Woo et al., 2017), while sensitivity and specificity are within the 70%-90% range (Kambeitz et al., 2017).Promising results are also reported by a few studies exclusively focusing on mental disorders in child and adolescent participants (First et al., 2018).Using structural magnetic resonance imaging (sMRI) data from the ENIGMA consortium, ADHD in children could be detected with a test set area under the curve of 0.64 (Zhang-James et al., 2021).Pediatric depression could be successfully predicted with an accuracy of 78.4% (Wu et al., 2015).However, these previous findings are limited in their generalizability by the fact that small-scale studies that investigate single disorders exclusively in adults still dominate.In particular, research on neuroimaging biomarkers in children is rare and it is doubtful whether biomarker findings obtained on adults can be readily transferred to children.
Building on the ABCD study (Karcher & Barch, 2021) and the ABCD Neurocognitive Prediction Challenge (Pohl et al., 2019), we seek to carry out an explorative analysis to rigorously evaluate the potential of using sMRI and machine learning techniques to predict 10 different psychiatric disorders in an ecologically valid sample of 6916 children.We study MDD, BD, psychotic symptoms, ADHD, ODD, conduct disorder (CD), post-traumatic stress disorder, obsessive-compulsive disorder (OCD), generalized anxiety disorder (GAD), and social anxiety disorder (SAD).

Participants
All 11,875 participants from the baseline assessment of the ABCD study (Karcher & Barch, 2021) were considered for inclusion in the present study.The participants of the ABCD study were mainly recruited through the US school systems.Adolescent Brain Cognitive Development focused on ensuring that the sample reflects the diversity of the US population by employing probability sampling of US schools as the primary method for recruiting eligible children.
To this end, school selection was informed by gender, race and ethnicity, socioeconomic status, and urbanicity.Despite the effort in ABCD to match the demographics of the US population, it may not be representative in all dimensions that influence a child's development.Participants provided informed consent (parents) and assent (child).First, all participants with missing structural brain measures (see section Imaging Data below) were excluded.Imputation is not an option for recovering missing image features as either all or none of the brain measures are available.Second, we also excluded all participants for which Kiddie Schedule for Affective Disorders and Schizophrenia (K-SADS) diagnoses for the investigated clinical conditions were missing.To rule out any statistical dependencies due to sibling relationships from the data, we excluded all but one randomly selected child from each family.
For a detailed flow diagram of the participant selection, see Figure 1.

Psychiatric disorders
The psychiatric disorders in the ABCD study were diagnosed using a computerized version of the K-SADS (Kaufman et al., 1997) for DSM-5 (KSADS-5; see (Barch et al., 2018) for details regarding the application of this instrument in the ABCD study).The paper-and-pencil KSADS is a well-established diagnostic interview with good to � While the overall low classification performance does not warrant application in the clinic, our results highlight that future studies should apply advanced machine learning models that are appropriate for the task at hand.

of 12
- excellent test-retest reliability (Cohen's Kappa = 0.63-1.00)(Ambrosini, 2000) and high interrater agreement (93%-100%) (Kaufman et al., 1997).Preliminary validity data on the computerized KSADS-5 demonstrated that it possesses good to excellent clinical validity (Townsend et al., 2020).Initially, we considered all DSM-5 coded diagnoses (American Psychiatric Association, 2013) that were available in the baseline assessment of the ABCD study (see Table S2 from (Barch et al., 2021)) as prediction targets.From these, we excluded all "unspecified" and "other specified" diagnoses due to validity concerns and the diagnoses for persistent depressive disorder, panic disorder, agoraphobia, eating disorders, disruptive mood dysregulation disorder, and substance use disorder due to sparse case numbers (Table S1).Separation anxiety disorder was excluded, because two common anxiety disorders, generalized and SAD, were already included.
Most psychiatric diagnoses in the ABCD study were assessed by both parental and youth self-report.A well-documented phenomenon in such multi-informant mental health assessments is high rates of disagreement between parent and self-reports (De Los Reyes et al., 2015), with the ABCD study being no exception.
While historically many techniques have been proposed to integrate these discrepant reports into a best estimate of the child's "true" diagnostic state, there exists no established best practice so far (Martel et al., 2017).A frequently applied-prominently in the DSM-IV field trials (Lahey et al., 1994)-and robust (Bird et al., 1992;Piacentini et al., 1992) strategy is the OR rule where a child's symptom is considered present when it is reported by at least one informant.We used the OR rule to aggregate parental and self-report DSM-5 diagnoses into a single binary disorder for each child.Moreover, we assumed the presence of a disorder when the diagnosis was given at any time state (current, past, and in remission) to capture not only the current state but the lifetime history of the disorder.
To increase case numbers, we aggregated bipolar I and II disorder into a composite category "bipolar disorder".Lastly, while there is robust evidence for brain structural abnormalities in schizophrenia (van Erp et al., 2018), the baseline assessment of the ABCD study did not include any diagnoses from the Schizophrenia Spectrum and Other Psychotic Disorders DSM-5 category.To still include a surrogate of psychotic symptomatology in our study, we defined "psychotic symptoms" as the presence of at least one of the symptoms of hallucinations, delusions, or associated psychotic symptoms or a diagnosis of unspecified schizophrenia spectrum/other psychotic disorder in the KSADS-5 parent interview.

Imaging data
To be able to capture a diverse range of anatomical changes, we extracted 136 volumetric and thickness measures based on Free-Surfer (Fischl, 2012) (see Table S2) from participants' T1-weighted sMRI.See Appendix S1 for a detailed description of the feature extraction pipeline.We do not include functional MRI data in the analysis due to the very poor reliability of task-related brain activations in the ABCD sample (Kennedy et al., 2022), largely preventing its use for studying individual differences.

Known confounders
To account for known confounding effects on the relationship between neuroanatomy and disorder, we residualized (see Appendix S1) all 136 neuroanatomical measures to exclude effects due to the sociodemographic variables age, sex, marriage status of parents, study site, highest parental education, ethnicity, and supratentorial brain volume, which were used as known confounders on fluid intelligence in the ABCD Neurocognitive Prediction Challenge (Pohl et al., 2021).

Machine learning models
We trained ensembles of gradient boosted trees (GBM; (Friedman, 2001);) to predict the binary psychiatric disorder based on brain structural measures.GBMs currently represent the state-of-the-art in classification algorithms (Zhang et al., 2017).Bayesian hyperparameter optimization (J.Snoek et al., 2012) was used to tune the hyperparameters of the model.To account for interdependencies between disorders, we constructed classifier chains (Read et al., 2011) of 10 GBMs.For each classifier chain, disorders are randomly ordered in a list and 10 GBMs are trained, where the ith GBM is trained to predict the ith disorder in the list, using image-derived measurements and the presence/absence of all disorders preceding i in this list.We considered multiple such random orderings by creating a final ensemble of 10 such chains, which we refer to as GBM-CCE.In addition, we employed a simple logistic regression classifier (LRC) as a linear benchmark model to compare to.See Appendix S1 for a detailed description of the machine learning models.

Model evaluation
The performance of the GBM-CCE and LRC in predicting each disorder in the test set was measured in terms of the area under the receiver operating characteristic curve ( AUROC).To assess statistical robustness and reproducibility, we employed a 30-times repeated 5fold cross-validation scheme (see Figure 2).At every one of the 30 repeats of the outer loop, the dataset was randomly divided into five parts of equal size (folds).For each of the five repeats of the inner loop, model training and validation were performed on four of the folds (80% of the data), and the model was tested on the remaining fold (20%) until each fold has been used as the test set exactly once.Each of these 150 individual cross-validation splits represented an independent experiment with a newly initialized model, so that there was no data leakage.All data splits were stratified with respect to all eight disorders to ensure homogeneous label distributions between splits.
Finally, the resulting 150 unique test set AUROC values were averaged.In addition, we calculated mean balanced accuracy (Brodersen et al., 2010), mean sensitivity, and mean specificity based on the binary classification thresholds corresponding to the highest Youden's J statistic (Youden, 1950) in each individual test set.
To statistically test whether the models had found real patterns in the features that are predictive of a particular disorder, we employed the permutation test described in (Ojala & Garriga, 2010), which yielded a p-value for each model with the null hypothesis that the model did not discover a meaningful relationship between features and labels (see Appendix S1 for details).As we performed 20 independent permutation tests (two models times 10 disorders), we applied Bonferroni correction to obtain the adjusted significance level α adj ¼ 0:05 20 ¼ 0:0025.

Sample characteristics
We used a sample of 6916 children from the ABCD study (Karcher & Barch, 2021) (see Figure 1).Table 1 summarizes the characteristics of the sample used in our study.Data was pooled from 22 study sites with a median of 4.0% (IQR: 3.4%-6.0%) of participants coming from a single site (see Figure 3).The most common disorder was ADHD F I G U R E 2 Model evaluation pipeline.Area under the receiver operating characteristic curve (AUROC) values are obtained by evaluating trained models on the test set, then averaged over all cross-validation folds.This results in one average AUROC for the original dataset and a distribution of 500 average AUROC representing the distribution under the null hypothesis of "no real pattern has been discovered".

of 12
- with a prevalence of 18.7% (see Figure 4, top), followed by ODD (14.8%), and OCD (9.4%).The most common co-occurrence of disorders was that of ADHD and ODD with a prevalence of 2.9%, which was more common than any isolated disorder, except for isolated ADHD or ODD (see Figure 4, bottom).

Predictive performance
We studied the predictive performance of detecting each of the binary psychiatric disorders using 136 regional subcortical volumes and cortical volume and thickness measurements derived from sMRI.
Using an ensemble of classifier chains of gradient boosted trees (GBM-CCE) has two major advantages over a traditional LRC.First, by relying on gradient boosted trees (Friedman, 2001), we do not have to specify non-linear feature transformations nor interaction terms explicitly, but can learn those implicitly from the data.Second, by using classifier chains (Read et al., 2011), our model considers all 10 disorders concurrently.Specifically, we use classifier chains to leverage interdependencies among disorders, which is particularly relevant due to the high prevalence of comorbidities among mental illnesses in youth (Newman et al., 1998).Moreover, we accounted for confounding due to acquisition site and sociodemographic factors by residualizing MRI-derived measurements.Figure 5 summarizes the predictive performance in terms of the AUROC for all 10 disorders, and whether the model has found a real pattern in the data based on permutation testing.Finally, we repeated our experiments using volumetric measures based on the SRI24 Atlas (Rohlfing et al., 2010).
Results for these experiments are summarized in Appendix S2.
The GBM-CCE achieved a statistically significant prediction performance for two disorders: ADHD (mean AUROC = 0.567, p = 0.002) and BD (mean AUROC = 0.551, p = 0.002).For both disorders, the mean AUROC was higher than all 500 AUROC values on the permuted data sets, resulting in the lowest possible p-value of p ¼ 1 500 ¼ 0:002.Table 2 provides the mean balanced accuracy, sensitivity and specificity for these disorders.Mean balanced accuracy was 56.1% and 56.2%, mean sensitivity 57.7% and 52.6%, and mean specificity 54.6% and 59.9% for ADHD and BD, respectively.
The two disorders, which were predicted significantly, were also among the four most prevalent.
To investigate the performance improvement due to modeling interdependencies among disorders, and non-linear relationships, we evaluated a simple LRC that does not have these desirable properties.As seen in Figure 5, such a model achieved a statistically significant prediction only for ADHD, and with a lower performance (AUROC = 0.539, p = 0.002).confounding inflated the predictive performance for all disorders except SAD.The difference in performance with and without adjustment for confounding was highly significant in seven out of 10 disorders for both the GBM-CCE and the LRC (Table S5).

DISCUSSION
In this study, we evaluated the potential of structural neuroimaging to detect 10 psychiatric disorders in children, based on T1-weighted MRI from 6916 children from the ABCD study (Karcher & Barch, 2021).Compared to many earlier studies, the ABCD study offers the advantage of an exceptionally high degree of heterogeneity and ecological validity due to pooling data from 22 sites and featuring mental disorder prevalence and comorbidity rates that closely approximate that of the general population.Moreover, the ABCD study includes assessments of many sociodemographic measures, which enables a rigorous analysis of confounding effects.
Our evaluation showed that the GBM-CCE was able to learn statistically significant patterns to detect ADHD and BD.These findings are suggestive of the existence of brain structural patterns that are associated with certain mental disorders and corroborate previous findings (Arbabshirani et al., 2017;Rashid & Calhoun, 2020;Sakai & Yamada, 2019;Wolfers et al., 2015;Woo et al., 2017).
Importantly, we only obtained a statistically significant classification performance for BD when accounting for interdependencies and cooccurrence of disorders, and non-linear relationships between neuroimaging biomarkers and disorder.A traditional logistic regression model that does not account for interdependencies nor non-linear relationships failed to achieve a statistically significant prediction performance for BD and showed lower classification performance in terms of AUROC for all disorders compared with the GBM-CCE.
First, this confirms that the relationship between brain structure and disorders is highly non-linear (Arbabshirani et al., 2017).Second, it suggests that there is no one-to-one mapping between neuroanatomy and disorder, and that patterns of comorbidity can only be leveraged if appropriate machine learning models are employed.The latter relates to the fact the etiopathophysiology of psychiatric disease is highly complex due to "functional dependencies among neural systems that yield complex patterns of comorbidity" (Beauchaine & Hinshaw, 2020)   LRC (5 of 10 disorders)-although none of the results achieved statistical significance.This suggests that the choice of neuroanatomical segmentation algorithm may impact the maximum achievable performance, but that modeling interdependencies between diagnoses is often preferred, independent of the segmentation algorithm.
When repeating those analyses without confounding correction (see dashed diamonds in Figure 5), 14 out of 20 experiments resulted in inflated significance.These results indicate that the relationship between neuroanatomy and mental disorder is often confounded.This is confirmed by our supplementary analyses using the SRI24 Atlas, where 17 of 20 experiments were inflated.Hence, it is paramount to account for common confounders to minimize the risk that machine learning models leverage spurious correlations rather than biologically meaningful concepts in their predictions.Alternatively to residualization, one could control for confounding by including sociodemographic variables as additional predictors.While this will likely increase the overall prediction performance, it would not allow us to accurately evaluate the predictive performance of neuroimaging alone, because neuroimaging and sociodemographic factors will T A B L E 2 Disorders for which significant predictions were achieved.contribute to the prediction.Common sources of confounding in neuroimaging studies include imaging site, age, gender, and head size (Kirkpatrick et al., 2020;Nielsen et al., 2020;Scheinost et al., 2019;Wachinger et al., 2020).In addition, studies in children are often subject to bias due to sociodemographic factors.For instance, children in households with low parental education or those part of racial/ethnic minorities are less likely to participate in clinical research (Kirkpatrick et al., 2020;Reiss, 2013).In this study, we accounted for all of these factors to ensure reported prediction performances are indeed based on biologically meaningful signals.

Disorders
The fact that two of the four most prevalent diagnoses (see Figure 4, top) could be predicted with statistical significance can in part be attributed to the variance of the AUROC under the nullhypothesis of "no real pattern has been discovered".For a fixed sample size n, the variance of the AUROC will be larger for rare disorders compared to common disorders, because the variance of the AUROC scales inversely with n 1 ⋅n 2 , where n 1 , n 2 (n = n 1 + n 2 ) are the size of the number of cases and controls respectively (Hanley & McNeil, 1982).In turn, an increase in variance increases the threshold of statistical significance, which is evident from Figure 5 too.Hence, our results should be interpreted as conservative estimates with a focus on reducing the type I error (false positives).
While our findings are statistically significant, the low classification performance in absolute terms seems to contradict results from previous studies (Arbabshirani et al., 2017;Sakai & Yamada, 2019;Woo et al., 2017).Our highest achieved test set performance was a mean AUROC of 0.567 for ADHD (mean balanced accuracy of 56.1%).
In contrast, the largest previous effort to predict ADHD in children (N = 2454) using sMRI features yielded an AUROC of 0.64 (Zhang-James et al., 2021).Current reviews with a main focus on adults even report a mean classification accuracy of over 77% for ADHD across individual studies (Arbabshirani et al., 2017;Woo et al., 2017).For MDD, reviews report an accuracy of over 80% (Arbabshirani et al., 2017;Sakai & Yamada, 2019;Woo et al., 2017), which could not be predicted with statistical significance in our study.In the following, we highlight three key characteristics of the ABCD study that distinguish it from previous studies and that likely contribute to this marked discrepancy.
First, the recruitment process of the ABCD study ensures a nearrepresentative selection of participants to reduce systemic bias.It included probabilistic sampling of schools within the catchment of the 22 research sites (Garavan et al., 2018) and only minimal exclusion criteria (Thompson et al., 2019).Although the resulting sample should not be considered a fully representative sample (Compton et al., 2019), it closely matches the distribution of the U.S. population as a whole (Garavan et al., 2018), thus reaching an exceptionally high degree of ecological validity.In contrast, many of the previous studies assembled equally sized groups of affected and healthy participants (Arbabshirani et al., 2017;Zhang-James et al., 2021).
The second factor that contributes to the observed discrepancy stems from the high heterogeneity that the ABCD studies captures and that is largely absent in previous studies.We distinguish between three sources of heterogeneity: (a) sociodemographic diversity, (b) comorbidities, and (c) scanning devices.(a) The ABCD study comprises a highly heterogeneous sample in terms of sociodemographic diversity, because children with a wide range of ethnical, cultural, and economic backgrounds were included.We accounted for these effects by residualizing the image-derived measures.Our results indicate that confounding effects due to sociodemographic status can be substantial.For the two disorders that could be predicted with statistically significant performance, accounting for confounding effects increased the mean AUROC by 0.023 for ADHD and reduced it by 0.012 for BD (see dashed diamonds in Figure 5).(b) The second source of heterogeneity is due to the high rate of comorbidities (see Figure 4).Since different mental illnesses likely overlap in their neuroanatomical associations (Goodkind et al., 2015) and disorder subtypes can arise (Schnack & Kahn, 2016), accurate classification of subjects with comorbid clinical conditions is inherently more difficult.Nevertheless, the fact that our GBM-CCE outperformed the LRC, shows that we can also leverage this overlap to improve accuracy.(c) Finally, the ABCD study is subject to heterogeneity due to scanning devices: it employed 29 different scanning devices by three different vendors, across 22 sites (Casey et al., 2018;Owens et al., 2021).Recent studies revealed that large multi-center studies can be sensitive to confounding effects due to differences in acquisition, thus increasing heterogeneity in acquired scans and the measures extracted from them (Alfaro-Almagro et al., 2021;Wachinger et al., 2020).How to best overcome heterogeneity due to differences in scanner is subject to ongoing research, and no consensus has been reached yet.
Precisely these three sources of heterogeneity are thought to be the reason why recent studies on neuroimaging for mental disorders found the striking result that classification accuracy consistently declines as the sample size of studies increases (Arbabshirani et al., 2017;Rashid & Calhoun, 2020;Sakai & Yamada, 2019;Wolfers et al., 2015;Woo et al., 2017).Smaller studies typically comprise a more homogeneous sample with fewer comorbidities, which in turn does away with many of the aforementioned challenges to discovering patterns for classifying the majority of participants accurately (Schnack, 2019).
In this regard, our results are in line with our experience in the ABCD Neurocognitive Prediction Challenge (Pohl et al., 2019) where our team ranked third, but none of the predictions of fluid intelligence-of any team-was accurate enough to be meaningful (Pohl et al., 2019)despite access to data from thousands of children.
The third factor that contributes to the observed discrepancy stems from the fact that our work was conducted exclusively on children.Firstly, diagnoses are more difficult to establish in children, because reports from multiple informants must be considered, and there is no single established best practice for aggregating them (Martel et al., 2017).In clinical practice, diagnoses are often given by mental health professionals.However, when only informant survey reports are available, as in the ABCD study, we must resort to simpler rule-based aggregation-like the OR rule-which may not replicate the clinician gold standard in all cases.Secondly, the neuroanatomy of children's brains changes continuously and nonlinearly during maturation, and children of the same ages can be at different points in this process (Tamnes et al., 2017).This change may obscure structural neuroimaging patterns associated with psychiatric disorders, complicating accurate prediction.
In light of the modest classification performance, we currently do not see a clinical utility of biomarkers derived from structural MRI for the diagnosis of mental disorders in children.This is in line with the conclusion drawn by (First et al., 2018), who stated that neuroimaging has yet to make an impact on the diagnosis of psychiatric disorders in a clinical setting.Notably, this statement includes the diagnosis of population.Both family income and the proportion of participants with married parents, who are both employed, are higher in the ABCD cohort (Heeringa & Berglund, 2020).Although we adjusted our analysis for parental marriage status, residual confounding may have persisted and may limit the generalizability of our results to the general population.Although we followed the common approach of assuming a linear relationship between confounding factors and neuroanatomical measures (L.Snoek et al., 2019), we cannot prove based on data alone that this is indeed the true (unknown) causal relationship.Moreover, the classifier is dependent on hyperparameters and while we have performed Bayesian hyperparameter optimization, further tuning may improve results.While model explainability was not an explicit focus of our study, future research on predicting mental disorders using neuroimaging could benefit from incorporating feature importance analyses, as this would add knowledge of the associations between features (such as regional brain volumes or comorbid disorders) and the target disorder.Finally, we captured neuroanatomical changes by relying on regional volume and thickness measures.While these measures are relatively robust to imaging noise, they cannot capture the full range of neuroanatomical changes, because multiple geometric structures can have the same volume/thickness.Hence, our models may fail to capture subtle changes in neuroanatomy.

CONCLUSION
Our findings illustrate that detecting psychiatric disorders in children based on structural neuroimaging remains a significant challenge when generalization to large, ecologically valid, and heterogeneous samples is desired.At the same time, we showed that we can leverage comorbidities and interdependencies among symptoms to significantly improve prediction accuracy, although the absolute performance remains modest.To overcome common pitfalls, we argue that researchers should (i) study a heterogeneous sample, (ii) employ advanced machine learning techniques appropriate for the task at hand, and (iii) account for confounding effects due to sociodemographic factors.We analyzed a sample of 6916 children from the ABCD study (Karcher & Barch, 2021), which is the largest and most comprehensive study on psychiatric disorders in children to date.Next, we embraced the complexity of detecting psychiatric disorders by leveraging the capabilities of advanced machine learning models that are better suited for this task than traditional linear models.Finally, we ensured that our models were not significantly impacted by spurious correlations due to common sociodemographic factors by residualizing neuroanatomical measurements.We hope these strategies can form the basis for a push to advance our understanding of the etiopathophysiology of psychiatric disorders.

�
Prediction of mental disorders based on neuroimaging showed promising results in adults.� It is doubtful whether findings obtained on adults can be transferred to children.� We carry out an explorative analysis to rigorously evaluate the potential of neuroimaging and machine learning to predict ten different psychiatric disorders in an ecologically valid sample of 6916 children from the Adolescent Brain Cognitive Development (ABCD) study.� Our results indicate that attention deficit hyperactivity disorder (ADHD) and bipolar disorder (BD) can be predicted with statistical significance if one accounts for the complexities of mental disorders.

F
Flow diagram of participant selection.CAN WE DIAGNOSE MENTAL DISORDERS IN CHILDREN?

Finally
, we evaluated the impact of confounding factors by fitting LRC and GBM-CCE models on the original volume and thickness measurements without adjusting them for confounding factors via residualization (see dashed diamonds in Figure 5).The results demonstrate that ignoring confounding resulted in an inflated prediction performance in five out of 10 disorders for the GBM-CCE: MDD, BD, OCD, GAD, and SAD.Whereas for LRC, ignoring T A B L E 1 Demographics of selected participants.

F
Comorbidities of studied mental disorders.(top) Studied mental health conditions with overall prevalence; (bottom) 14 most common patterns of disorders and their prevalence.Having no disorder at all was most common (63.7% of participants).

F
Violin plots of cross-validation results.For each disorder and both classifiers, the distribution of area under the receiver operating characteristic curve (AUROC) under the nullhypothesis of "no real pattern has been discovered" (in gray) is contrasted with the AUROC value (diamond) on the original dataset.Dashed diamonds show AUROC values on unpermuted data with no adjustment by sociodemographic confounders (see TableS5for a statistical comparison with original AUROC values).Dashed line at AUROC ¼ 0:5 corresponds to a classifier with no discriminative ability.CCE, Gradient boosting model classifier chain ensemble; GBM-LRC, Logistic regression classifier.
prior research studies have reported high classification accuracy.The reason for the lacking clinical utility in classifying individuals is mainly due to the small effect sizes of neuroimaging markers with respect to psychiatric illnesses, yielding insufficient sensitivity and specificity.As a future direction, the combination of neuroimaging markers with non-imaging data may be promising to improve the performance by establishing a more comprehensive picture of a patient.In this regard, exploring how neuroimaging can enhance already established diagnostic tools such as clinical and neurocognitive tests will be of particular clinical relevance.There are several limitations to this study.First, because the ABCD study was conducted in the U.S., the data are limited to this geographic region.Factors such as symptom reports and ethnographic backgrounds may not be representative of other geographic and cultural contexts.Second, socioeconomic variables in the ABCD study are not exactly representative of the comparable-age U.S.