Generalizability of machine learning for classification of schizophrenia based on resting‐state functional MRI data

Abstract Machine learning has increasingly been applied to classification of schizophrenia in neuroimaging research. However, direct replication studies and studies seeking to investigate generalizability are scarce. To address these issues, we assessed within‐site and between‐site generalizability of a machine learning classification framework which achieved excellent performance in a previous study using two independent resting‐state functional magnetic resonance imaging data sets collected from different sites and scanners. We established within‐site generalizability of the classification framework in the main data set using cross‐validation. Then, we trained a model in the main data set and investigated between‐site generalization in the validated data set using external validation. Finally, recognizing the poor between‐site generalization performance, we updated the unsupervised algorithm to investigate if transfer learning using additional unlabeled data were able to improve between‐site classification performance. Cross‐validation showed that the published classification procedure achieved an accuracy of 0.73 using majority voting across all selected components. External validation found a classification accuracy of 0.55 (not significant) and 0.70 (significant) using the direct and transfer learning procedures, respectively. The failure of direct generalization from one site to another demonstrates the limitation of within‐site cross‐validation and points toward the need to incorporate efforts to facilitate application of machine learning across multiple data sets. The improvement in performance with transfer learning highlights the importance of taking into account the properties of data when constructing predictive models across samples and sites. Our findings suggest that machine learning classification result based on a single study should be interpreted cautiously.


| INTRODUCTION
Schizophrenia is a serious mental disorder that imposes a significant burden on society around the world (Charlson, Baxter, Cheng, Shidhaye, & Whiteford, 2016). The clinical symptoms of schizophrenia are heterogeneous (Arango & Carpenter, 2011;Owen, 2014), and its diagnosis is still dependent on the subjective report from patients and assessment by clinicians (Frances, 1994).
With the aim to provide objective assessment to guide clinical practice, integrative psychobiological approaches including the Research Domain Criteria initiative (Kozak & Cuthbert, 2016) and the Hierarchical Taxonomy of Psychopathology (Kotov et al., 2017) suggest breaking the symptom-oriented approach down to different levels of neuroscientific analysis for mental disorders. In line with these new approaches, patients with schizophrenia have been demonstrated to have functional and structural brain alterations from magnetic resonance imaging (MRI) research (Cheng, Newman, et al., 2015;Cheng, Palaniyappan, et al., 2015;Ellison-Wright & Bullmore, 2009;Haijma et al., 2013). Among the various MRI techniques, resting-state functional MRI (rsfMRI) has been widely applied and aberrant brain activity has been reported in schizophrenia patients (Cheng, Newman, et al., 2015;Cheng, Palaniyappan, et al., 2015;Tang et al., 2013;Xu & Lipsky, 2014).
However, results from traditional MRI studies are group-level data and have limited clinical applications. More recently, machine learning has been increasingly utilized to optimize the use of brain imaging data in clinical classification and build predictive models for individualized diagnosis for different psychiatric disorders (Arbabshirani, Plis, Sui, & Calhoun, 2017). Machine learning is defined as the process of enabling computers to acquire the ability to learn patterns of data without being explicitly programmed to do so (Samuel, 1959). Multivariate machine learning approaches are algorithms specialized in recognizing patterns from high-dimensional data like brain imaging data, which can capture the complex relationships between brain regions compared with univariate methods (Davatzikos, 2004). The general machine learning framework with rsfMRI data for computational psychiatry includes: (a) preprocessing the imaging data; (b) separating the training data and the testing data completely; (c) extracting and selecting features from the training data; (d) training the predictive model; and (e) generalizing the predictive model to the testing data. In this framework, assessing generalizability is one of the most important steps in evaluating predictive models, which simulates the real-world context. Generalizability is assessed in two distinct settings: within-site and between-site. Two main strategies, internal validation (or cross-validation) and external validation, have been used to assess generalizability. Within-site generalizability is typically established using cross-validation. In this setting, a data set collected from a single site is repeatedly split into independent training and test data sets, and the performance of the model is assessed in the training set for each split. However, to assess betweensite generalizability, the predictive model is trained on a data set from one site and then applied to an independent data set collected in a separate site using external validation. Using internal validation is practical considering the difficulty in collecting data from different sites, but it could lead to an overestimation of performance due to overfitting of the predictive model to one specific data set, compared with external validation which considers generalizability across different data sets (Woo, Chang, Lindquist, & Wager, 2017). It is therefore important to assess generalizability using both internal and external validation methods.
A number of promising studies have successfully classified patients with schizophrenia from healthy controls based on rsfMRI data using machine learning approaches. As shown in Table 1, accumulated studies have utilized machine learning as a tool to analyze rsfMRI data, investigate the underlying neural mechanisms, recognize specific brain patterns, and classify patients with schizophrenia from healthy controls at the individual level with accuracies ranging from 65 to 95% (Arbabshirani, Kiehl, Pearlson, & Calhoun, 2013;Cao et al., 2018;Cheng, Newman, et al., 2015;Cheng, Palaniyappan, et al., 2015;Du et al., 2012;Venkataraman et al., 2012). However, most of these studies have only assessed generalizability using internal validation methods. While two studies have used external validation to assess the generalizability of their classification methods, they did so in independent rsfMRI data sets (Cui, Liu, Song, et al., 2018;Cui, Liu, Wang, et al., 2018;Skatun et al., 2017) and no study has independently replicated the machine learning procedure from a previous study.
In this study, the overall aim was to assess the within-site and between-site generalizability of a previous machine learning framework of rsfMRI data that have shown promising performance using both internal and external validation methods. Among the various previous studies, Du et al. (2012) have reported an excellent classification accuracy of 0.93 using rsfMRI data and 0.98 using fMRI data in identifying schizophrenia patients from healthy controls based on their own machine learning procedure. Therefore, we examined the generalizability of this machine learning procedure in the present study. We first investigated the generalizability of the classification procedure in a main data set from a single site (internal validation) following the exact steps from Du et al.'s study. Then, we assessed the between-site generalizability to a completely independent data set (validated data set) from a different site (external validation) to test whether factors from different sites, such as scanning setting and procedure, would influence generalizability. To further explore the generalizability of the procedure rather than the algorithm itself, we updated the unsupervised part of the algorithm to investigate the degree to which the performance of betweensite generalizability could be improved in the new setting. Importantly, the labels (schizophrenia or healthy control) of the testing data set were not used when training the predictive model in the last step, which served to determine the between-site generalizability of the procedure from independent data sets. Given the excellent performance of the procedure reported by Du et al. (2012), we hypothesized that this machine learning procedure could discriminate patients with schizophrenia from healthy controls with promising classification performance with good within-site generalizability. Due to the effect of site on rsfMRI data (Dansereau et al., 2017), we further hypothesized that the cross-site generalizability of the procedure would be compromised. Moreover, because some site effects could be taken into account when updating the unsupervised last step of the procedure, we further hypothesized that the between-site generalizability would improve in the new setting.

| Participants
Two data sets of patients with schizophrenia and healthy controls were included in this study. In the main data set, 51 patients with schizophrenia and 51 healthy controls were recruited (  (Frances, 1994). The Positive and Negative Syndrome Scale (PANSS) (Kay, Fiszbein, & Opler, 1987) was used to assess the severity of schizophrenia symptoms. Both the clinical assessment and the diagnostic interviews were conducted by experienced psychiatrists. All patients were taking antipsychotic medications. Healthy controls were recruited from the local community via advertisements. They had no personal and family history of mental disorders. Participants with neurological disorders, substance abuse, and/or dependence or head injuries were excluded. The short form of the Chinese version of the Wechsler Adult Intelligence Scale (Gong, 1992) was used to estimate the IQ of all participants.
In addition, independent sample t tests were conducted to compare age, length of education and estimated IQ between the two healthy control groups (Table 3), and the same were conducted to compare onset age, duration of illness and severity of psychotic symptoms between the two patient groups. Chi-square tests were also used to compare the gender proportion between the two healthy control groups and the two clinical groups. Finally, a five-factor model of the PANSS (Anderson et al., 2017;Lindenmayer, Grochowski, & Hyman, 1995) comprising negative, positive, disorganized, excited, and anxiety symptom domains was computed and compared between the two clinical groups (Table 3).
The study was approved by the Ethics Committee of the Institute of Psychology, the Chinese Academy of Sciences. All participants gave written informed consent.

| Image acquisition
For the main data set, all participants were scanned in a 3-T Siemens  While participants were in the scanner, they were asked to remain as stationary as possible and their heads were stabilized with foam pads.  (Patel et al., 2014), high-pass filtering and removal of low-frequency drifts and motion by a 24-parameter autoregressive model using realignment regressors and scrubbing (using a 1 mm relative movement threshold and a 1% DVARS threshold (Power et al., 2012)) (Friston, Williams, Howard, Frackowiak, & Turner, 1996). Subsequently, spatial normalization with DARTEL was adopted to normalize the functional images Note. into Montreal Neurological Institute (MNI) space (Ashburner, 2009) in SPM12. Finally, an 8-mm full-width-at-half-maximum isotropic Gaussian kernel was used to smooth the functional images (Table 5).

| Machine learning analysis
After preprocessing, further machine learning analysis was conducted (see Figure 1 for the analysis flowchart). To assess generalizability by internal validation, we followed the procedures from Du et al. (2012) in the main data set. To assess between-site generalizability by external validation, the predictive model built in the main data set was directly applied to the validated data set. To establish transfer learning and to explore factors influencing between-site generalizability, the unsupervised group independent component analysis (ICA) step was updated based on the two data sets and the between-site generalizability was estimated again.
In general, the machine learning framework consisted of the following

| Assessing within-site generalizability
In feature extraction and selection, spatial group ICA was first run based on the entire main data set. Data for all participants were concatenated across time. Then, PCA was used to reduce the data into 40 principal components at the participant level. Afterward, ICA was performed to extract 30 independent spatial components based on the resulting data from the PCA at the group level. The analysis of spatial group ICA employed Infomax ICA (Bell & Sejnowski, 1995) in the GIFT software package for MATLAB (GIFT, 2011). Second, the 16 spatial components were selected manually according to the exclusion criteria below: 1. Components originating from artifacts, motion, and respiratory/cardiac cycles, which were identified by considering how much regions known to be associated with these signals were represented in the components. This included edges (motion, respiration), major arteries or veins, the circle of Willis, and the sagittal sinus (cardiac cycle) as well as regions in the vicinity of field inhomogeneity (movement by field inhomogeneity interaction effects).
3. Components that showed low and widespread activation.

| Assessing between-site generalizability
To assess between-site generalizability by external validation, the template of spatial components which was used to select features in assessing within-site generalizability was applied to the validated set of rsfMRI data. The spatial components for each participant in the validated data set were obtained through spatial-temporal reconstruction based on the template which was built from the main data set. In this setting, determination of the threshold t 0 , kernel PCA and training of the LDA classifier considered only the main data set (training set), and hence the identified classification model was directly applied to the validated data set without any adaptation. Apart from changing the LOO cross-validation to cross-sample validation, all classification parameters and procedures remained identical.
To further explore factors influencing between-site generalizability, we updated the ICA step to consider data from both sites. To achieve this, we identified 30 spatially independent components across both data sets using group ICA and established correspondence to the template components from the within-site generalization procedure using spatial correlation. The matching procedure involved calculating the pairwise spatial correlation between all 30 estimated components and the 16 template components. Then, pairs of components that exhibited the highest absolute correlation were sequentially matched to identify all 16 components in the joint data set. After this matching procedure, the components were visually inspected to confirm that they represented the same brain regions as the template. Importantly, in this setting, the validated data set was only considered in the training of the group ICA model and hence no label information from the validated data set was used in feature extraction or fitting of the classification model. Apart from the different templates of spatial components, the classification procedure was identical to the within-site generalization setting. The significance of the classification performance was assessed using a random permutation test in all settings where an empirical null distribution was obtained by applying an identical classification procedure to the data with permutated class labels.

| Group ICA on the main rsfMRI data set
Based on the aforementioned criteria, 16 components were selected to classify patients with schizophrenia from healthy controls in the analysis. The selected components are shown in Figure 2. The name, spatial location, peak MNI region, and peak MNI coordinates for each component are shown in Table 5.

| Within-site classification performance
For classification based on individual features, the performances including accuracy, sensitivity, and specificity of each component are shown in Table 5. We found that the basal ganglia (the fifth feature)

| Between-site classification performance
Between-site generalizability was assessed using the template of spatial components from the main data set. We found that accuracy in general decreased to the random level for classification from each selected component (see Table 4 and Figure 3). When all 16 features were combined by majority voting, the performance was worse compared to the within-site performance evaluated by internal validation, with an accuracy of 0.550, a sensitivity of 0.394 and a specificity of 0.741. Random permutation test showed that the accuracy failed to reach statistical significance, with a p value of .183.
For the between-site generalizability by external validation using ICA across both data sets, we found that the performance was similar to the within-site classification performance for each selected component (see Table 5 and Figure 3). When all 16 features were combined by majority voting, the performance remained similar to the first classification, with an accuracy of 0.700, a sensitivity of 0.636 and a spec- F I G U R E 3 Accuracy, sensitivity, and specificity based on individual features. Internal validation was conducted within the main data set, external validation 1 was conducted by directly applying the classification procedure from the main data set to the validated data set, and external validation 2 was conducted using an updated group ICA across both data sets but with all other steps being identical In this study, we first followed a machine learning procedure developed by Du et al. (2012) using rsfMRI data from a sample of schizophrenia patients and healthy controls to assess the within-site generalizability and the reproducibility of the machine learning procedure. The classification followed the same feature extraction, feature selection, and LOO validation steps as Du et al. (2012).  (Anderson & Cohen, 2013;Bassett et al., 2012;Fan et al., 2011;Su et al., 2013;Venkataraman et al., 2012), our results support the reproducibility of the machine learning procedure which can successfully discriminate patients with schizophrenia from healthy controls. preprocessing was performed to reduce noise and limit the effect of artifacts on the analysis (Churchill et al., 2011), these confounding factors could still persist and could affect the classification performance.

| Implication of between-site generalizability
When assessing between-site generalizability in two completely independent data sets from two sites, we obtained a nonsignificant accuracy of 0.550 and a significant accuracy of 0.700 based on all selected features when applying a template of spatial components built based on only the main data set and both data sets, respectively. It is important to note that although the validated data set was utilized with the main data set for generating a new template of spatial components through spatial group ICA, the labels of the validated data set were still kept unseen when training the model to ensure that the training and testing data sets were independent.
More importantly, the findings that the predictive model did not successfully generalize to the novel data set when applying the template only based on the training data set but successfully generalized to the independent testing data set when applying the template based on both data sets substantially extends our understanding of the generalizability of machine learning using rsfMRI data. Failure to generalize does not necessarily mean that a model is invalid, since multiple factors are different between independent data sets (Scheinost et al., 2019). As indicated in Table 3, the schizophrenia patients in the main data set were significantly older, had significantly higher estimated IQ, longer illness duration and higher PANSS positive subscale score, but significantly lower PANSS total, negative and general subscale scores compared with patients in the validated data set. In addition, the healthy controls in the main data set had significantly higher estimated IQ than the healthy controls in the validated data set. Such heterogeneity between the two data sets may be one potential factor contributing to the failure in direct generalization. In addition, the two data sets were acquired independently from different scanners with different acquisition parameters, which could further compromise generalizability. The within the same data set and across different data sets and also found that the classification accuracy across different data sets was lower than within the same data set. However, the extent of difference in accuracy was not large in their study compared with ours. This could be due to differences in participant characteristics. In their study, patients with untreated first-episode schizophrenia were recruited and the demographics were very similar between the two samples; whereas in our study, we recruited patients with chronic schizophrenia and the demographics were different between the two samples. Taken together, these findings suggest that future research to construct machine learning models should take into account illness heterogeneity.

| Implication for schizophrenia research
The classification performance based on individual features in this study revealed that the striatum yielded the highest accuracy. Spatial components including the lateral occipital cortex, the fusiform gyrus, the temporal lobe, the middle cingulate gyrus, the DMN, and the precuneus could also distinguish patients with schizophrenia from healthy controls with acceptable accuracy. These findings are supported by previous studies using machine learning methods in schizophrenia (Du et al., 2012;Fan et al., 2011;Tang et al., 2012;Yu et al., 2013). Our findings suggest that the striatum may play a key role in schizophrenia. This is consistent with results from previous studies which demonstrated alteration in striatal volume (Chakravarty et al., 2015) and white matter connectivity (James et al., 2016). For the DMN, studies by Fan et al. (2011) andDu et al. (2012) also suggest that it is one of the most informative brain regions for the diagnosis of schizophrenia. Moreover, a recent study found that the DMN interacted with the central executive network and the salience network in smoking schizophrenia patients, indicating the potential role of the DMN in the symptomatology of the disorder (Liao et al., 2018). At the same time, the structural and functional alterations in the DMN in schizophrenia patients have been shown to be related to impairment in working memory and attention (Garrity et al., 2007;Hu et al., 2017;Pomarol-Clotet et al., 2008;Salgado-Pineda et al., 2011;Whitfield-Gabrieli & Ford, 2012). Previous studies have also suggested that the fusiform gyrus and the temporal lobe may be discriminative features for classification Tang et al., 2012). In anatomical studies, reduced gray matter volume at the superior temporal gyrus has been reported in schizophrenia patients (Ohi et al., 2016;Sun, Maller, Guo, & Fitzgerald, 2009), which may also be related to hallucinations (Cui, Liu, Song, et al., 2018;Cui, Liu, Wang, et al., 2018).

| Limitation
This study has several limitations. First, the sample size in the present study was small. Bigger data sets are needed to avoid overfitting and to build a classifier with better generalizability. Second, only one modality of data (rsfMRI) was utilized, even though both functional and structural brain information may be important for high-accuracy classification using machine learning (Mikolas et al., 2017;Orban et al., 2018;Ota et al., 2012). Third, we did not collect the smoking status of our participants in this study. Considering the high rate of smoking in schizophrenia patients (Liao et al., 2018), which may influence brain activity (Potvin et al., 2016;Tanabe, Tregellas, Martin, & Freedman, 2006), smoking status may be a confounding factor that affects classification performance. Fourth, since our findings suggest the importance to take scanning setting and characteristics of participants into account when collecting data from different sites, future studies should examine how these two sets of factors affect between-site generalizability.

| CONCLUSION
In this study, we found that the machine learning procedure developed by Du et al. (2012) could successfully classify patients with schizophrenia from healthy controls based on rsfMRI data in internal validation but not external validation. Moreover, we found that a transfer learning procedure based on unsupervised learning was able to improve between-site generalizability and may eventually contribute to the incorporation of machine learning approaches into clinical practice.