Comparing machine and deep learning‐based algorithms for prediction of clinical improvement in psychosis with functional magnetic resonance imaging

Abstract Previous work using logistic regression suggests that cognitive control‐related frontoparietal activation in early psychosis can predict symptomatic improvement after 1 year of coordinated specialty care with 66% accuracy. Here, we evaluated the ability of six machine learning (ML) algorithms and deep learning (DL) to predict “Improver” status (>20% improvement on Brief Psychiatric Rating Scale [BPRS] total score at 1‐year follow‐up vs. baseline) and continuous change in BPRS score using the same functional magnetic resonance imaging‐based features (frontoparietal activations during the AX‐continuous performance task) in the same sample (individuals with either schizophrenia (n = 65, 49M/16F, mean age 20.8 years) or Type I bipolar disorder (n = 17, 9M/8F, mean age 21.6 years)). 138 healthy controls were included as a reference group. “Shallow” ML methods included Naive Bayes, support vector machine, K Star, AdaBoost, J48 decision tree, and random forest. DL included an explainable artificial intelligence (XAI) procedure for understanding results. The best overall performances (70% accuracy for the binary outcome and root mean square error = 9.47 for the continuous outcome) were achieved using DL. XAI revealed left DLPFC activation was the strongest feature used to make binary classification decisions, with a classification activation threshold (adjusted beta = .017) intermediate to the healthy control mean (adjusted beta = .15, 95% CI = −0.02 to 0.31) and patient mean (adjusted beta = −.13, 95% CI = −0.37 to 0.11). Our results suggest DL is more powerful than shallow ML methods for predicting symptomatic improvement. The left DLPFC may be a functional target for future biomarker development as its activation was particularly important for predicting improvement.

to have good treatment responses would help identify individuals who would benefit from the early introduction of alternative and/or supplemental interventions.
To that end, we have previously demonstrated that functional magnetic resonance imaging (fMRI)-based measures of cognitive control-related brain activity collected during initial engagement in treatment predict symptomatic "improvement" (defined as showing at least 20% improvement in total Brief Psychiatric Rating Scale [BPRS] score after 1 year) in early psychosis patients after 1 year with 66% accuracy using logistic regression (Smucny, Lesh, & Carter, 2019).
Although this finding was an important step toward developing a neuroimaging-based prognostic regimen, to be clinically useful a predictive battery should display at least 80% accuracy.
As logistic regression classifies based on a logarithmic function, its accuracy is limited by the fit of that function to the data. Furthermore, logistic regression-based classification is only able to utilize the features/ independent variables/descriptors provided. Machine learning (ML) classifiers utilize more complex functions that may improve performance, although most of these classifiers are "shallow" and therefore again limited to using the features provided. Deep (machine) learning (DL) classifiers, however, have more compositional capabilities to create new features from the features provided by utilizing a layered network, potentially allowing for more accurate representation of the data and thus more precise mapping (Durstewitz, Koppe, & Meyer-Lindenberg,-2019;Montufar, Pascanu, Cho, & Bengio, 2014). Consider a simple illustrative example with two input features-brain activations from "Region 1" and "Region 2." Shallow ML methods will find a classification function solely using activation values from these regions as parameters. DL methods, on the other hand, will try many different combinations of them and learn which combination is the most predictive, in a process referred to as end-to-end learning. Furthermore, explainable artificial intelligence (XAI) methods have been recently developed (Davidson, Gourru, & Ravi, 2018;Sambaturu et al., 2020) that can extract which feature(s) most strongly influenced the deep learner's classification decisions, helping to demystify the "black box" problem inherent in ML methods. As DL is combinatorial, these "rules" for classification may be complex. In the example with two brain regions, XAI may find rules such as (A) "If activation >0.5 in Region 1 classify as B," (B) "If activation <0.5 in Region 1 and >0.5 in Region 2, classify as "B," (C) otherwise classify as "A." Here, we evaluate the performance of six commonly used ML methods as well as a deep learner in predicting symptomatic improvement in early psychosis individuals after 1 year of treatment. The shallow ML methods vary in their complexity and approach. We were particularly interested in determining if the ability of DL to combine features into more complex predictors may show performance enhancement versus shallow architectures and our previous logistic regression-based finding. In order to compare classification accuracies, therefore, here we analyzed the same dataset and frontoparietal features that were used in our previous study using logistic regression. In alignment with previous findings (Krizhevsky, Sutskever, & Hinton, 2012;Le et al., 2012), we hypothesized the best performance would be observed using the deep learner.

| Sample
As described previously (Smucny et al., 2019), complete (baseline and follow-up) neuroimaging AX-CPT data were available for 171 individuals (139 with schizophrenia [SZ], 32 with Type I bipolar disorder [BD] with psychotic features). Of this sample, follow-up clinical data were available for 82 individuals (65 SZ, 17 BD). 138 healthy controls (HCs) were included to verify the task was activating expected frontoparietal regions (see Section 2.4) and as a reference group. Neuroimaging AX-CPT data from the 82 patients with complete (baseline and follow-up) datasets have been used in previous studies as follows: (Lesh et al., 2013)-53 controls and 18 patients, (Lesh et al., 2015)-34 HC and 20 psychosis, (Niendam et al., 2014)-23 HC and 11 psychosis, (Smucny et al., 2018)-52 HC and 43 psychosis, (Yoon et al., 2008) (First, Spitzer, Gibbon, & Williams, 2002) was used for diagnosis of psychopathology. Diagnoses were confirmed by a group of trained clinicians during case conferences. All patients reported psychosis onset within 2 years of the date of informed consent.
Patients were excluded for a diagnosis of major medical or neurological illness, head trauma, substance abuse in the previous 3 months (as well as a positive urinalysis on the day of scanning), Weschler Abbreviated Scale of Intelligence-2 score (Weschler, 1999) score < 70, and magnetic resonance imaging (MRI) exclusion criteria (e.g., claustrophobia, metal in the body). Control participants were excluded for all of the above as well as a history of Axis I mental illness or first-degree family history of psychosis. All participants provided written informed consent and were compensated for participation. The UCD Institutional Review Board approved the study. Medication regimen (type and dose) was assessed by clinical records at baseline and follow-up. Medication compliance was based on self-report. Medicated patients at follow-up all self-reported at least medium compliance with antipsychotic medication during the treatment period (except for two SZ individuals who were missing compliance data at follow-up). Symptoms were assessed using the 24-point BPRS (Ventura et al., 1993) rescaled to a lowest score of zero (i.e., score of 24 = score of 0). At baseline, all patients had BPRS scores > = 5 to ensure sufficient resolution to detect a 20% improvement in score at follow-up.

| Task description
The AX-CPT and associated task parameters have been described in detail elsewhere (Braver, Paxton, Locke, & Barch, 2009;Cohen, Barch, Carter, & Servan-Schreiber, 1999;Henderson et al., 2012;Lesh et al., 2013;Phillips, Salo, & Carter, 2015). Briefly, participants are presented with a series of cues and probes and are instructed to make a target response (pressing a button with the index finger) to the probe letter "X" only if it was preceded by the cue letter "A." All cues and nontarget probes require nontarget responses (pressing a button with the middle finger). Target sequence trials (i.e., "AX" trials) are frequent (60-70% occurrence) and set up a prepotent tendency to make a target response when the probe letter X occurs. As a result, a nontarget sequence trial in which any non-A cue (collectively called "B" cues) is presented and followed by a probe letter X (i.e., "BX" trials) requires proactive cognitive control (e.g., maintenance of the inhibitory rule over the delay time) (Braver et al., 2009). Consistent with prior work (Henderson et al., 2012), individual subject data was only included in analyses if results suggested the subject understood the AX-CPT (specifically, accuracy greater than 44% on AX trials and 50% on BY trials at both baseline and follow-up). Participants were combined across two task protocols collected from two MRI scanners.
Parameters for each protocol (AX-1 and AX-2) are provided in Supple- mentary Table 1a. The task was presented using EPrime2 software (Psychology Software Tools, Inc.). The behavioral index of proactive cognitive control was d-prime context, a function of AX hits minus BX false alarms (Cohen et al., 1999).

| fMRI scanning parameters and preprocessing
Functional images were acquired with a gradient-echo T2* blood oxygenation level-dependent (BOLD) contrast technique as outlined in Supplementary Table 1b. AX-1 was performed in a 1.5 T scanner (GE Healthcare), and AX-2 in a 3.0 T scanner (Siemens). fMRI data were preprocessed using SPM8 (Wellcome Department of Imaging Neuroscience, London). Briefly, images were slicetiming corrected, realigned, normalized to the Montreal Neurological Institute template using a rigid-body transformation followed by nonlinear warping, and smoothed with an 8 mm full-width-halfmaximum Gaussian kernel. All individual fMRI runs had less than 4 mm of translational within-run movement, 3 of rotational withinrun movement, and 0.45 mm of average framewise displacement (calculated using the fsl_motion_outliers tool) (https://fsl.fmrib.ox.ac.uk/ fsl/fslwiki/FSLMotionOutliers). Mean displacement did not differ between Improvers and non-Improvers (t = 1.42, p = .16). All participants had at least two fMRI runs surviving these criteria.
Preprocessing pipelines were identical for AX-1 and 2.

| fMRI analysis and prespecified ROI selection
First-level effects were modeled with a double-gamma function with temporal derivatives using the general linear model in SPM8.
Rigid-body motion parameters were included as single-subject regressors in order to partially account for movement effects. B > A cue (correct trials only) contrast images (parameter estimates) were generated for each subject. The B > A cue contrast measures response under conditions of high versus low proactive cognitive control (Lesh et al., 2013;Lesh et al., 2015). All trial types were modeled (AX/AY/ BX/BY) and only correct responses were used to create first-level images, consistent with previous studies (Lesh et al., 2013;Lesh et al., 2015). Whole-brain analyses across the final sample (HCs and patients with follow-up data) using the B > A contrast were used to confirm significant (height threshold p < .001, cluster threshold p < .05 [whole brain FDR-corrected]) activation in expected brain regions (bilateral DLPFC/SPC) for both protocol versions (AX-1 and AX-2). For ML using first-level images and as extracted in our previous logistic regression study (Smucny et al., 2019), BOLD response was extracted from prespecified bilateral, 5 mm radius spherical DLPFC and SPC ROIs. Although this size was chosen arbitrarily, previous work from our group suggests varying ROI radius between 4 and 8 mm does not substantially affect AX-CPT task-associated response patterns in psychosis (Smucny et al., 2018). The DLPFC ROI was taken from a previous study from an independent dataset (MacDonald III, Cohen, Stenger, & Carter, 2000). The SPC ROIs was taken from a meta-analysis of executive function in SZ (Minzenberg, Laird, Thelen, Carter, & Glahn, 2009).
Mean task-associated activation from these ROIs was extracted using the Marsbar toolbox (Brett, Anton, Valabregue, & Poline, 2002). Activations were adjusted for differences in protocol version prior to further analysis by calculating standardized residuals from the linear regression of protocol version by each measure. The adjusted values from each of the four ROIs were then utilized as input features (i.e., four total inputs) for ML/DL. Unlike our previous study, AX-CPT performance was not used as a predictor as our previous logistic regression study found no association between behavior and outcome. As a supplementary analysis, we also extracted voxelwise data within these ROIs for use as features, adjusted for protocol version as described above.

| Machine learning
"Shallow" and DL algorithms were trained to predict "Improver" (vs. non-Improver) status, defined as >20% decrease in Total BPRS score from baseline rescaled to a lowest score of zero (Howes et al., 2017), as well as change in Total BPRS score.
Shallow algorithms were employed using Weka software (University of Waikato, New Zealand) and included Naive Bayes, support vector machine (SVM), K Star (K *), AdaBoost, J48 decision tree, and random forest. Classifier accuracies were calculated by averaging performance across 1,000 random assortments of 90% training data and 10% test data for each algorithm.

| Naive Bayes
Naive Bayes (Maron, 1961) compares the probability of observing an Improver or non-Improver for each test data point according to the where P(C i ) is the prior probability of class C i (e.g., Improver) occurring, p (x j C i ) is the conditional probability that class C i is associated with feature observation x, and p (x) is the marginal probability that observation x is observed (effectively constant for any given dataset). The joint model (combining all features) can then be expressed as the product of the probabilities for all features, and the algorithm classifies unseen data as Improver or non-Improver based on the highest probability.

| Support vector machine
SVM classifiers find the maximum-margin hyperplane using only those data instances closest to the separation boundary (i.e., "support vectors") to determine classification boundaries (Burges, 1998). Both linear and nonlinear (using a kernel) classifications can be performed.
Polykernel SVM classifiers were evaluated starting with an exponent of one and increasing in size until average accuracy (over all 1,000 allocations of test/training data) plateaued.

| K star (K-nearest neighbor)
The K* algorithm operates by assigning new data instances to the class that occurs most frequently amongst the k-nearest data points, y j , where j = 1,2…k (Hart, 1968). Distance is then used to retrieve the most similar instances from the data set. The K* function is operationalized as K* (y i ,x) = −ln p*(y i ,x), where p* is the probability of all transformational paths from instance x to y, that is, the probability x will arrive at y via a random walk in feature space.

| AdaBoost
AdaBoost operates by creating multiple weak classifiers that are weighed by their effectiveness at classifying data (Viola & Jones, 2001). Initially, a classifier is created with all instances weighted equally. Next, the weight of the incorrectly predicted instances is increased. The instances that are still misclassified are then selected and their weights increased as well, and so forth. After the complete classifier is constructed, each weak classifier then casts a weighted "vote" as to the class membership of each set of individual test data to make a classification decision.

| J48 decision tree
Decision tree classifiers operate hierarchically, with each level representing a feature (e.g., left DLPFC activation) (Alpaydin, 2004;Quinlan, 1993). Based on the value of that feature the tree either classifies immediately or passes the information to the next level of the tree. The C4.5 algorithm (Quinlan, 1993) was used for the J48 decision tree, which uses a measure called "information gain" to select each attribute at each stage. In essence, the J48 tree first chooses the feature that most effectively splits the training data into one class or another using a measure called "information gain" (essentially, the effectiveness of feature at classifying data). After this split, the tree then chooses the next most effective feature to split each resulting partition. The process then iteratively repeats until all training data is classified. Performance of the resulting tree is then evaluated on test data.

| Random forest
A random forest is a group of decision trees made up of random partitions of training data (Breiman, 2001). Each tree casts a "vote" as to the classification of a testing instance and votes are counted to produce the final classification.

| Deep learning
DL was performed using a multi-layer perceptron network trained using the ADAM optimizer in Pytorch (pytorch.org) to construct an architecture with parameters (e.g., weights) for optimal classification accuracy. The network consisted of multiple hidden layers (8-32-16-8) for the four-ROI classifier and (252-128-64-8) for the within-ROI voxelwise classifier with a single neuron as the output layer.
After the network was learned, XAI methods were used to extract rules the deep learner used for classification (Davidson et al., 2018). These rules were in disjunctive normal form in that they consisted of alternative (disjunctions) of a combination (conjunction) of input neurons whose activations cooccur. This procedure was only performed for binary classifiers. Devising rules for continuous classifiers would require dividing the sample into quantiles. This procedure is likely to be underpowered given the already limited sample size of the study. Figure 1 illustrates how these explanations were generated. The training data were divided into Improvers or non-Improvers (+,−).
Each training instance x activated a subset of neurons which can be considered a simple binary vector (row). The collection of all neuron activations for each type of instance (+,−) was then simply two tables of input and hidden layer neuron activations X + and X − with the number of rows being the number of instances of each type and the number of columns being the number of neurons. Hence, the entry (i, j) of X + was set to 1 if and only if Improver instance i activated node j.
Using X + and X − we set up a set-style cover problem to find the subset of hidden nodes (A + ) that activated for X + but that did not activate for X − . Conversely, we found (A − ) which was a subset of hidden nodes that activated for X − but not for X + . Importantly, A − and A + must not overlap on any hidden units. These two subsets were found using an integer linear programming optimization formulation as described in Davidson et al. (2018) and Sambaturu et al. (2020). Notably, XAI only outputs the most predictive rules (and not every rule), and therefore XAI outputs themselves may not perform as well as the entire deep learner.

| Performance comparison
Accuracies were qualitatively examined by calculating mean accuracies (for the binary outcome) or mean root mean square errors (RMSEs, for the continuous outcome) and 95% confidence intervals for performance across the 1,000 random assortments of 90% training data and 10% test data for each algorithm. For the binary outcome, McNemar's test was also performed to determine if the proportion of average correct vs. incorrect predictions (i.e., average accuracies) were statistically significantly different (two-tailed p < .05) between the DL algorithm and each other algorithm. As we were primarily interested in performance of the DL algorithm, McNemar's tests were only performed comparing the DL to each shallow learner or to the pooled (average) performance of all shallow learners. Mean BPRS score at follow-up for the psychosis group was 37.3 (SD = 9.0). Then, 47% of BD and 60% of SZ participants showed greater than 20% decrease in total BPRS score (scaled to a lowest value of zero) at follow-up and were classified as "Improvers". Mean improvement in BPRS score for Improvers was 12.7 (SD = 7.3), corresponding to a 59% decrease.

| Binary classifiers
Accuracy data for ML algorithms for the four ROI feature and voxelwise ROI feature classifiers are presented in Table 3. McNemar's test results are presented in Table 4. For SVM, accuracy peaked at an exponent of 3 for the polynomial kernel; accuracy using this kernel are hereafter reported. Qualitatively, the best overall performance was achieved using DL for both the four ROI feature and voxelwise ROI feature classifiers. Accuracies for the four-ROI feature classifiers were only statistically significantly improved for DL versus the K* and Random Forest learners. Accuracies for the voxelwise ROI feature classifiers were significantly improved for DL vs. the majority of other algorithms as well as the pooled average of all shallow learners. The improvement in DL accuracy was driven by qualitative improvements in predicting both Improvers and non-Improvers, suggesting performance changes were not exclusively driven by positive or negative predictive values. For four ROI feature classifiers, the J48 decision F I G U R E 1 Example neural network for deep learning to illustrate explainable artificial intelligence procedures. The training data were divided into Improvers (green font, +) or non-Improvers (red font, −). Each training instancexactivated a subset of neurons which can be considered a simple binary vector (row). Using X + (Improvers) and X − (non-Improvers) we set up a set-style cover problem to find the subset of hidden nodes A + that activated for X + but that did not activate for X − , as well as the subset A − that activated for X − but not X + (see Section 2for more details) tree algorithm most closely approached the DL in regard to performance. Although the Naive Bayes learner showed identical accuracy to the J48 algorithm, its performance was driven by strong performance in non-Improvers at the expense of performance in Improvers.
For voxelwise ROI classifiers, the Naive Bayes learner most closely approached DL performance levels for overall accuracy, predicting Improvers, and predicting non-Improvers.
For the four ROI feature classifier, XAI revealed the most predictive feature used by the deep learner was left DLPFC activation.
T A B L E 3 ML/DL performance for each method, using either mean BOLD signal within the four frontoparietal ROIs or voxelwise data within the four frontoparietal ROIs as features and Improver/non-Improver status as a binary outcome. Numbers in parentheses represent the 95% CI over 1,000 repetitions of random 90% training/10% test data allocations. "Pooled" = average of all non-DL methods

| Continuous classifiers
RMSE data for ML algorithms for the four ROI feature and voxelwise ROI feature continuous outcome classifiers are presented in Table 5.
In contrast to the binary outcome classifiers above (Improvement/ non-Improvement), these classifiers predicted change in total BPRS score from baseline to follow-up.
As demonstrated for the binary classifiers, DL outperformed all shallow ML algorithms as well as the pooled average of the ML algorithms as demonstrated by qualitatively lower RMSE values (Table 5).
The DL performance differences were magnified for the voxelwise ROI features.

| CONCLUSIONS
The results of this study suggest that DL performs better than shallow ML algorithms in regard to predicting BPRS improvement in recent onset psychosis using cognitive control-associated activations from frontoparietal brain regions, supporting the development of DL-based methods for predicting treatment outcomes in psychosis populations.
The performance enhancement in DL was not specific to either Improvers on non-Improvers, suggesting positive predictive value was not enhanced at the expense of negative predictive value (or vice versa). XAI revealed that the most predictive feature used by the binary DL was left DLPFC activation. Overall, our study therefore illustrates the utility DL in combination with XAI to maximize predictive power while also shedding light on how to solve the "black box" problem inherent in most ML algorithms by revealing the bases for DL-based prediction. The finding that DL was the best method at predicting the continuous outcome (change in BPRS score) that the predictive enhancement observed with DL is nonspecific to any particular definition of 1-year clinical "Improvement" in psychosis.
Improved performance with DL is consistent with prior research comparing DL with shallow architectures (Krizhevsky et al., 2012;Le et al., 2012). Although the reason(s) for this improvement have not been experimentally verified, it is likely related to the ability of DL to combinatorically use input features to improve prediction. This effect may be magnified as features are added to classifier, as evidenced by larger performance differences between DL and shallow architectures using the voxelwise ROI feature sets. It should also be noted that for the four ROI feature binary outcome, although XAI methods found that the most important classification rule was a simple threshold for one feature, additional, more complex combinations of features may have also been utilized by the deep learner to refine its predictions that were not individually predictive enough to be identified by XAI.
Although our study was limited by its sample size (and therefore had limited power to detect statistically significant differences, as evidenced by the mostly nonsignificant McNemar's tests), we purposefully used the same sample as our previous report (Smucny et al., 2019) in order to directly compare accuracies with the prior logistic regression-based approach. As hypothesized, the deep learner outperformed the logit function. As our study was conducted at a single site with a small sample size, however, future studies using more participants and/or multiple sites will be necessary to determine the generalizability of our findings. It is nonetheless notable that such high accuracy (70%) toward predicting outcome could be achieved using brain activations from a single, simple cognitive-control task. Another limitation to consider when interpreting our results was the naturalistic nature of our prospective study. Specifically, we did not use strict guidelines on the use of antipsychotic medications across time, and medication compliance was ascertained by self-report. It is unclear, therefore, if symptomatic change in this study was due to medications or other forms of treatment (e.g., psychotherapy, psychoeducation). For this reason, in this study and our previous work using this sample, we labeled participants with psychosis as "Improvers" and "non-Improvers" rather than responders/nonresponders. On the other hand, the application of this approach in a naturalistic setting where personalized evidenced-based practices are provided does speak to the potential generalizability of this approach and utility for future clinical applications.
Finally, although we argue that 80% accuracy is desirable to achieve clinical utility, others have argued that the effectiveness of a prognostic indicator should be solely based on their ability to change clinical practice (e.g., Perlis, 2011). It is possible, therefore, that the 73% accuracy achieved in this study with DL is sufficient to suggest using fMRI scanning during the described cognitive control task to predict clinical improvement in psychosis. Nonetheless, as neuroimaging studies sometimes have questionable reproducibility (Elliott et al., 2020), our results require replication using a larger sample before strong statements can be made regarding our method's clinical utility (Perlis, 2011). To our knowledge, however, no established method has been developed that can reliably predict long-term symptomatic improvement in early psychosis patients. In combination with our prior work using logistic regression, we believe that the present study represents an important preliminary step toward developing a predictive algorithm for this purpose.

CONFLICT OF INTEREST
Dr I. D. has an active research grant from Google and a previous grant from Yahoo!.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request.