Towards a brain‐based predictome of mental illness

Abstract Neuroimaging‐based approaches have been extensively applied to study mental illness in recent years and have deepened our understanding of both cognitively healthy and disordered brain structure and function. Recent advancements in machine learning techniques have shown promising outcomes for individualized prediction and characterization of patients with psychiatric disorders. Studies have utilized features from a variety of neuroimaging modalities, including structural, functional, and diffusion magnetic resonance imaging data, as well as jointly estimated features from multiple modalities, to assess patients with heterogeneous mental disorders, such as schizophrenia and autism. We use the term “predictome” to describe the use of multivariate brain network features from one or more neuroimaging modalities to predict mental illness. In the predictome, multiple brain network‐based features (either from the same modality or multiple modalities) are incorporated into a predictive model to jointly estimate features that are unique to a disorder and predict subjects accordingly. To date, more than 650 studies have been published on subject‐level prediction focusing on psychiatric disorders. We have surveyed about 250 studies including schizophrenia, major depression, bipolar disorder, autism spectrum disorder, attention‐deficit hyperactivity disorder, obsessive–compulsive disorder, social anxiety disorder, posttraumatic stress disorder, and substance dependence. In this review, we present a comprehensive review of recent neuroimaging‐based predictomic approaches, current trends, and common shortcomings and share our vision for future directions.

approaches, current trends, and common shortcomings and share our vision for future directions. T. Insel et al., 2010). This approach aims at incorporating the most recent findings from clinical and genetic neuroscience, thereby opening the field to a dimensional approach informed by the specific neural pathophysiology underlying psychiatric disorders. By utilizing advanced neuroimaging techniques, it is now possible to study disease-specific structural and functional brain impairments. Neuroimaging modalities, such as magnetic resonance imaging (MRI), magnetoencephalography (MEG), and electroencephalography (EEG) offer tools to noninvasively study the neural structure of psychiatric disorders with exceptional accuracy. Using these powerful techniques, researchers have begun to understand the complex neural function and structure that may lead to specific disorders.
In recent years, there has been a growing trend in designing neuroimaging-based prognostic/diagnostic tools. As a result, there has been a lot of effort focused on the use of neuroimaging tools to automatically discriminate patients with brain disorders from healthy control (HC) or each other. Many of these studies have reported promising prediction performances with the claim that complex mental illness can be diagnosed robustly, accurately and rapidly in an automatic fashion. However, until now, these tools have not been integrated into the clinical realm. We believe a key reason for this is that many of the studies of this nature, despite the promising results on a specific research dataset, are not designed to generalize to other datasets, specifically the clinical ones.
So far, the most extensive review on major psychiatric disorders is the review article by Wolfers et al., where about 120 pattern recognition studies in SZ, mood disorders, attention-deficit hyperactivity disorder (ADHD), autism spectrum disorder (ASD), anxiety disorders, and specific phobias have been reviewed (Wolfers, Buitelaar, Beckmann, Franke, & Marquand, 2015). While there are some overlaps among the aforementioned studies and this current survey, to the best of our knowledge, this is by far the largest survey in the field of major psychiatric disorders based on the number of papers reviewed (about 250 papers). Further, in recent years, there has been an exponential growth of predictive analysis studies, and therefore, an updated survey is much warranted.
In this review, a general discussion of the current trends in the brain-based psychiatric "predictome" and their translational perspectives will be provided, along with highlighting some of the common challenges and guidelines for future directions. We also discuss emerging trends in neuroimaging such as data sharing, multimodal brain imaging, and differential diagnosis. The main goals of this study include: (a) to review and systematically compare a large number of recent MRI-based mental disorder diagnostic/prognostic studies in SZ, MDD, bipolar (BP) disorder, ASD, ADHD, obsessive-compulsive disorder (OCD), social anxiety disorder (SAD), posttraumatic stress disorder (PTSD) and substance dependence (SD), and (b) to discuss pitfalls and promises of existing machine learning techniques, and (c) to provide our vision and future directions to address some of the challenges. While there are a number of challenges remain to be addressed, brain-based predictome studies have made a considerable progress in recent years. We hope that, with more sophisticated machine learning approaches integrated with large-scale data, predictive modeling tools will transition from the "proof-of-concept" stage to the "ready for clinical implementation" stage in the near future.
Typically, after feature extraction and selection, a classifier is trained in a supervised or semi-supervised way with a predefined set of labels.
Further model validation is performed either using an independent testing dataset or by incorporating a cross-validation (CV) scheme. Figure 1 presents the most common components of a brain-based predictome pipeline of mental illness prediction using neuroimaging data. While specific pipelines might vary at different preprocessing and postprocessing stages, conventional predictome analyses typically include the following steps: (a) feature extraction and selection/reduction, (b) classifier training, (c) classification and CV, and (d) performance evaluation.

| Feature extraction, selection/reduction
The first step of a predictome analysis is to transform neuroimaging data into features (i.e., deciding what to use as features and extracting these feature values from the data). A neuroimaging feature refers to F I G U R E 1 Predictome pipeline. An overview of neuroimaging-based predictome pipeline. (a) Neuroimaging modalities typically used for mental illness prediction. (b) Current approaches for feature selection. Feature extraction can include (i) voxel-based (ii) network-based, (iii) datadriven approaches (e.g., independent component analysis, ICA), or (iv) jointly estimated features from multiple modalities (e.g., fMRI and genomics). (c) Types of feature selections can include automatic or expert selection approaches. (d) Choice of classifiers may include support vector machine (SVM), linear discriminant analysis (LDA), Gaussian process classifier (GPC), neural network classifier (NNC) or logistic regression classifier (LRC). (e) Model validation can be performed using either a test-validation setup or using a k-fold cross-validation scheme. (f) Datadriven subtype identification can also be performed for homogeneous disorders (Gupta et al., 2017;Marquand, Rezek, Buitelaar, & Beckmann, 2016). (g) Various measures for performance evaluation such as accuracy, sensitivity, specificity, precision and F1-score. FN, false negative; FP, false positive; TN, true negative; TP, true positive any derived variable containing valuable information about the class labels that can be extracted from the data.
In this survey, we reviewed and highlighted predictome studies based on the type of features used for classification purposes, including voxel-based, region-based, and brain-network based feature selection approaches. For example, features can simply be a set of brain voxels within a particular brain network, or a region of interest (ROI), multivariate data-driven (e.g., using independent component analysis [ICA]) brain networks, or jointly estimated multimodal features, as seen in Figure 1b. A voxel-based approach employs feature extraction at the brain voxel level, while a region-based approach identifies and extracts predefined region-of-interests (ROIs) as features based on a brain atlas (either functional or structural). A network-based feature extraction approach, such as ICA, aims at combining multiple voxels across brain network (Calhoun, Adali, Pearlson, & Pekar, 2001b;McKeown et al., 1998).
In addition to feature extraction, it is often important to reduce the number of features from high-dimensional neuroimaging data before proceeding with model training. In the context of neuroimaging, feature selection can help achieve higher accuracy rates (Ad-Dab'bagh et al., 2006), and allow a more specific focus on the underlying brain regions that account for between-group differences (Plitt, Barnes, & Martin, 2015). Indeed, the number of features in neuroimaging data is large with many irrelevant features not contributing to the prediction power of the model, and not all disorders affect every brain network in the same way. Thus, some brain-based features might not contribute to the diagnosis labels, and some features may capture redundant information already uncovered by other features. Computational time and model generalization can also be improved by excluding redundant and unrelated features (Dash & Liu, 1997;Guyon & Elisseeff, 2003;Moradi et al., 2015).
Feature selection approaches (e.g., principle component analysis [PCA]) project the high-dimensional neuroimaging data into a lower dimensional space with a goal of preserving model discriminative power. Although not an essential step, in order to improve the strength of the prediction algorithm, it is important to select both optimum and meaningful features (Chu et al., 2012;Cuingnet et al., 2011). In a supervised learning approach, most discriminative features are selected to amplify the signal and reduce the noise.
Often, prior information is used to address the dimensionality issue of neuroimaging data. Based on the characteristics of features and the type of learning problem, a particular feature selection approach is used (Mwangi, Tian, & Soares, 2014). Common feature selection approaches include: (a) expert feature selection (based on prior knowledge) and (b) automatic feature selection (based on a feature selection algorithm). A combination of these two approaches can also be used for feature selection. For example, an expert feature selection approach can first be implemented by selecting a previously known disorder-specific ROI, and then an automatic feature selection algorithm can be used to favor discriminative features within the predefined ROI. Note that, to avoid performance bias, feature selection and extraction methods should be limited to training dataset.

| Classifier training
A classifier is a function that takes features as input and generates a class label prediction. Based on the learning function and underlying assumptions, different types of classifiers can be developed. Neuroimaging studies have applied various classifiers for mental illness prediction. The dimensionality issue associated with the relatively large number of features and the small number of samples should be accounted for while applying such classification algorithms. Typically, the classifier learns a rule and separates the underlying classes optimally. Any type of classification or regression algorithm can be used for training purpose, such as linear and logistic regression algorithms, multilayer neural networks and Gaussian approaches (Bishop, 2006), in the current review, we have limited our focus on classifiers using discrete outcome measures (i.e., diagnostic labels), with the exception of discussion on translational perspective and advanced predictive modeling (Sections 5 and 6).

| Nearest-neighbor
The simplest form of classifier is known as the "nearest-neighbor" which does not require any explicit learning of a classification function. Using the nearest-neighbor approach, classification of an independent test sample is performed by identifying the most similar measures, for example, lowest Euclidean distance, between the training and testing samples, and then assigning the label of the training sample (i.e., nearest neighbor) to the test sample.

| Discriminative and generative models
Other classifiers that require explicit learning function can be categorized as discriminative and generative models (Trevor, Robert, & Friedman, 2009). A discriminative classifier directly learns to predict from the training data using a learning function based on predefined parameters. In contrast, generative classifier learns a statistical model to generate class labels by modeling the distributions of feature values that are conditional on example class labels.

| Support vector machine
During the training stage of a supervised learning, data labels are used to optimize the model by finding a hyperplane or decision boundary that can maximally discriminate between groups. The most common choice for a simple learning function is predicting class labels based on a linear combination of the features that might influence the outcomes.
A linear classifier can be viewed as learning a line or boundary (i.e., decision boundary) that separates points within the two classes and discriminates their labels. For instance, a linear support vector machine (SVM) is such a classifier that learns the decision boundary.
Due to its widespread use and promising results in neuroimaging-based prediction, SVM is the most commonly seen classifier in our current survey. The SVM algorithm is typically intended for binary classification that aims at maximizing the boundary between different classes in a higher dimensional space. Mathematically, the discriminant function for SVM consists of a weight vector orthogonal to the decision boundary, and is specified by the data points that lie closest to the decision boundary, known as support vectors. This decision boundary further defines the classification rule of new, unseen cases.

| Linear discriminant classifier
Another powerful linear model is the linear discriminant classifier (LDC) that attempts to separate classes by maximizing the variance of between-class to within-class ratio. An example of probabilistic discriminant model is the logistic regression classifier (LRC) that focuses on learning an optimum decision rule by modeling the log-odds ratio as a linear combination of predictor variables (i.e., feature). Both LDC and LRC methods yield probabilistic predictions that a new case can be assigned to a particular class and a class label.

| Gaussian process classifier
Further, a Gaussian process classifier (GPC) is a probabilistic model and is a Bayesian extension of LRC (Wolfers et al., 2015). Briefly, GPC is first trained using the training feature to determine an optimized predictive distribution distinguishing between case and control. Note that, parameters relevant to this predictive distribution are estimated by maximizing the logarithm of the marginal likelihood on the training features. During the training stage, GPC then predicts the case and control by providing the predictive distribution of the test data using a sigmoid function (Frangou, Dima, & Jogia, 2017). For technical details of GPC, refer to Schrouff et al. (2013).

| Neural network classifier
Also, artificial neural network classifiers (NNC) have recently become popular for modeling biological networks. Multilayer NNC is the extension of linear perceptron classifier, which can yield complex nonlinear decision boundaries. Typically, the structure of NNC includes an input layer, hidden layer(s) and an output layer.
Neurons from each of these layers are connected to the neurons of the subsequent layers. A variety of nonlinear transfer functions of the hidden later neurons can be used (e.g., sigmoid function).
Briefly, during the training phase, the weights across a set of artificial connected neurons are adjusted for learning purposes using backpropagation technique (Werbos, 1990), and then used for classification. For example, in case of mental illness prediction, artificial NNC analyzes the training labels (i.e., healthy versus disorder) and learns to identify a test example.

| Random forest
Other recent and more powerful approaches for brain-based prediction include random forest and deep learning classifiers. In the random forest classifier, which is an ensemble of decision tree classifiers, multiple levels of randomization are integrated (Breiman, 2001). Using a randomized subset of the training data, each decision tree is grown, and each node is subsequently formed by searching through a random subset of training features. For each feature, the classifier estimates a score to highlight the feature's discriminative power (i.e., Gini Importance [GI] score). The random forest approach offers improved generalization accuracy as it randomizes training subjects, particularly in cases with a relatively smaller training subjects compared to the number of training features. Further, the random forest classifier provides nonlinear decision boundaries, which helps to model nonlinear patters of features during training.

| Deep learning
Deep learning classifiers have recently become an attractive choice for mental illness prediction Han, Huang, Zhang, Zhao, & Chen, 2017;Iidaka, 2015;Jang, Plis, Calhoun, & Lee, 2017;J. Kim, Calhoun, Shim, & Lee, 2016;Plis et al., 2014). Deep learning classifiers can learn the features with optimal discriminating power directly from the raw data by using a hierarchical approach (Schmidhuber, 2015;Vieira, Pinaya, & Mechelli, 2017). This provides a great advantage over conventional classifiers that require explicit feature reduction steps. By applying nonlinear transformations to the raw data, deep learning classifiers automatically overcome issues with feature selection which is particularly helpful for higher-dimensional features or data with lack of prior knowledge.

| Classification and CV framework
Once the classifier learns the decision rule based on features from a training set, the next step is to validate the model in a testing set. In order to mitigate performance bias and overfitting from predictive modeling, it is critical to analyze the training and testing datasets independently. During the training stage, the classifier learns to predict the labels from the training features based on the associated learning algorithm. For example, for learning problems without complex, iterative feature selection, the trained classifier is tested on previously unseen testing data (Wolfers et al., 2015). In order to achieve better model performance, a classifier should be trained with as much training data as possible, which is often a challenging issue in neuroimaging-based prediction studies. CV approach offers us to train classifiers with a higher number of training samples. A common CV approach is to repeatedly evaluate model performance using multiple training and testing partitions, a validation approach known as k-fold CV (k: number of data partitions; Kohavi, 1995;Patel, Khalaf, & Aizenstein, 2016). Other popular CV approaches include, leave-one-out (LOO-CV), and holdout. LOO-CV is an iterative process, typically used on smaller sample size, where k is equal to the number of samples and every subject in the whole sample is left out once for testing the classifier. Briefly, the LOO-CV procedure includes the following steps: (a) leave one sample out, train on the remaining ones, make a prediction for this sample (b) repeat for each sample in turn, and (c) compute the accuracy of the predictions made for all the samples. While a popular choice, leaving each sample out could become computationally expensive as it requires training of as many classifiers as the number of samples. In addition, LOO-CV has also been shown to potentially introduce some prediction bias , as it could introduce high variance by providing more data during the training state, which could also result in overfitting (Elisseeff & Pontil, 2003;Refaeilzadeh, Tang, & Liu, 2009). Because of this, the preferred approach is k-fold CV where k < number of samples. Common choices for partitioning are k = 10 or k = 5, corresponding to leaving out 10 or 20% of the total samples during each validation fold. Other important considerations for designing a CV procedure include: (a) inclusion of examples from all classes in the training data for better prediction accuracy, (b) having roughly equal number of samples across classes (i.e., balanced classes), and (c) inclusion of correlated samples in the same fold to avoid misleading performance that may accurately predict test samples with a correlated counterpart in the training set (Pereira, Mitchell, & Botvinick, 2009).
Performance measures, including accuracy, are averaged across iterations for the training and testing phase. For a supervised approach, a model is optimized using labeled data to find a discriminative decision boundary or hyperplane differentiating between case and control groups. The model parameters are optimized for maximum discrimination between groups. The CV approach helps ensure generalization of the training. During the classification stage, the trained model is then used to predict the label for new, unseen observations from testing set. For an unbiased generalization, it is important that the testing data do not overlap with the training data (Lemm, Blankertz, Dickhaus, & Müller, 2011). Further, the new data should be preprocessed in the same way as the training data.
More recently, another type of CV has been introduced, where various types of classifiers are cross-validated by running multiple classifiers on the same training data. For example, "Polyssifier" can be used to cross-validate multiple classifiers, where the baseline is first computed by applying multiple classifiers, such as nearest neighbors, linear SVM, radial basis function or RBF-SVM, decision tree, random forest, logistic regression, naive Bayes, and linear discriminative analysis (LDA; http://mialab.mrn.org/software/#polyssifier).

| Measures for performance evaluation
The most commonly used performance evaluation measures for predictive algorithms include accuracy, sensitivity, specificity and the receiver operating characteristic (ROC) curve. These measures provide an evaluation of how accurately a classifier can generalize to new test samples (i.e., cases). In a clinical context, accuracy indicates how accurately the model classifies the cases and controls, sensitivity shows the proportion of true positives correctly identified (i.e., what percentage of cases are truly identified), and specificity demonstrates the proportion of true negatives correctly identified (i.e., what percentage of controls are truly identified) by the model. The overall performance of the model can be assessed by the ROC curve which provides a summary of the area under the curve (AUC). A high sensitivity suggests that only a few participants are falsely diagnosed as HCs while actually being patients, and a high specificity indicates that a few participants are falsely diagnosed as patients while actually being HCs. The accuracy refers to the total proportion of samples correctly classified. The ROC curves show the balance between the true positive rate (sensitivity) and the false positive rate (1-specificity) across a range of decision thresholds within the model. To avoid bias by potential imbalances between groups, a common practice is to report balanced accuracy measures, by taking an average accuracy obtained for each diagnostic label (Brodersen, Ong, Stephan, & Buhmann, 2010).
A useful measure to summarize the classification performance is to provide a confusion matrix, which represents actual labels on one side and the predicted labels on the other side. This is more important with models predicting more than two groups (Baldi, Brunak, Chauvin, Andersen, & Nielsen, 2000). Other useful performance measures can be extracted from the confusion matrix including positive predictive value (PPV), negative predictive value (NPV), F1-score (harmonic mean of precision and recall), and G-mean (geometric mean of precision and recall; Alberg, Park, Hager, Brock, & Diener-West, 2004). Positive and negative predictive values are important for predictive studies as they directly quantify the potential utility of the classifier for clinical diagnosis. The positive predictive value is defined as the number times the classifier correctly predicted participants as patients (i.e., positive diagnosis) divided by the total number of positive predictions. The negative predictive value is defined as the number of times the classifier correctly predicted a negative diagnosis divided by the total number of negative predictions.

| PREDICTION OF MENTAL ILLNESS USING NEUROIMAGING TECHNIQUES
With recent advancements in medical imaging technology, neuroimaging data is being collected more rapidly and at finer resolution than ever before. In recent years, there has been an increasing interest in leveraging this vast amount of brain data across analytic levels, acquisition approaches, and experimental designs to achieve a deeper understanding of brain structure and function. In this review, we use the term "predictome" to describe the use of multivariate brain network features from one or more neuroimaging modalities to predict mental illness.
In the predictome, multiple brain network-based features (either from the same modality or multiple modalities) are incorporated into a predictive model to jointly estimate features that are unique to a disorder and predict subjects accordingly. Here, we review recent predictomic approaches used for neuroimaging classification and prediction, and provide an overview of studies for prediction of mental illness from their healthy counterparts.
3.1 | Survey procedure for the current literature review The current review is based on a comprehensive literature search for research articles performing MRI-based predictive analyses of psychiatric illnesses. A systematic literature search was performed primarily in PubMed from 1990 to 2018, and more than 550 articles were found. SZ (Calhoun, Kiehl, Liddle, & Pearlson, 2004) was one of the first disorders investigated with predictive analyses, followed by major depressive disorder (MDD; Fu et al., 2008;Marquand, Mourão-Miranda, Brammer, Cleare, & Fu, 2008) and BP disorder (Arribas, Calhoun, & Adali, 2010), ADHD (C.-Z. Zhu et al., 2008), ASD , PTSD (Q Gong et al., 2014), OCD (Weygandt et al., 2012), SAD  and SD (Vergara, Mayer, Damaraju, Hutchison, & Calhoun, 2017;Vergara, Weiland, Hutchison, & Calhoun, 2018). Figure 2 illustrates the systematic literature search process for this current study. Briefly, the search consistent of the following steps: (a) different terms related to classification/machine learning as well as their abbreviations (e.g., for support vector machine, search with the term "SVM"), (b) all terms and abbreviations related to structural, functional and diffusion MRI (dMRI) combined with the term "biomarker", and (c) all terms and abbreviations for one of the eight psychiatric disorders mentioned above. These steps were repeated for all disorders, and the identified references were further checked for missed publications which were included in the review as well. An additional screening process included on the relevance of the publications for the current review. Finally, we focused on all publications using a predictive analyses approach on MRI-based data in a case-control design of mental illness diagnoses that explicitly evaluated classification performance measures (e.g., overall classification accuracy). Further, the same search procedure was repeated in Google Scholar in order to reduce the probability of missing relevant articles of interest. About 250 papers were eventually selected for this survey that includes: 101 SZ, 61 MDD/BP, 35 ADHD, 38 ASD, 1 PTSD, 12 OCD, 2 SAD, and 7 SD. We categorized these articles based on a scheme developed for this review as depicted in Figure 1a-e and a summary of all articles is presented in Tables 1-8. Further, we limited our search range to journal articles in English published up until December 2018. Search criteria also included exclusion of articles without available full-text, and similar papers published by the same authors. For each study, key aspects such as imaging modality, classification method, sample size, and type features were investigated in a quantitative manner, as seen in Figures 3-6.

| Schizophrenia
SZ is a chronic mental disorder (Bhugra, 2005), which is typically characterized by cognitive problems, disintegration in perception of reality, auditory and/or visual hallucination, and a chronic course with lasting impairment (Heinrichs & Zakzanis, 1998). There is currently no standard clinical diagnostic test for SZ, and there has been considerable focus on identifying a biologically based marker using neuroimaging features which has shown some promise. We surveyed 101 peer-reviewed articles, which are presented in Table 1. Calhoun et al. (2004),  and Yushkevich et al. (2005) are among the first studies to perform predictive analyses on SZ using MRI-based neuroimaging data (Table 1).
1 Structural MRI: By utilizing sMRI data,  used voxel-based feature set and applied a high-dimensional nonlinear pattern classification approach to compute the degree of separation between SZ and HCs (HC). Using the leave-one out CV (LOO-CV), the authors reported 81% classification accuracy (for gender-wise classification, 82% for women and 85% for men).
Another study by Yushkevich et al. (2005) used SVM classifier and region-based feature sets to discriminate SZ patients with 72% accuracy. More recently, Koutsouleris et al. (2009) used sMRI and a principal component feature selection approach, where based on the overall predictive performance of the feature selection algorithm, an optimal number of principal components was identified to predict SZ. This study is particularly of importance as it reported to reliably predict different subcategories of SZ, with a three-class classification for SZ showing a maximal accuracy of 82%. Another large-scale study with a sample size of 256 case-control as well as a similar sized replication cohort predicted SZ based on sMRIderived features with an accuracy of about 70% for both CV and replication study (Nieuwenhuis et al., 2012).
2 Functional MRI: More recently, a high number of studies used features from resting-state and task functional MRI (fMRI) for predictive modeling of SZ, and achieved promising outcomes.
F I G U R E 2 The systematic literature review procedure, the inclusion criteria and the number of surveyed studies for each modality. ADHD, attention-deficit/hyperactivity disorder; ASD, autism spectrum disorder; MDD/BP, major depression disorder/ bipolar disorder; OCD, obsessive-compulsive disorder; PTSD, posttraumatic stress disorder; SAD, social anxiety disorder; SD, substance dependence; SZ, schizophrenia  (Continues) (a) Task-based: Studies using features from task-based fMRI paradigms include experiments with verbal fluency, working memory and auditory oddball (Castro et al., 2011;Costafreda et al., 2011;Honorio et al., 2012). One of the first relatively large-scale study (i.e., sample size >150) classified SZ using three different task-based fMRI data (i.e., auditory oddball, Sternberg item recognition and sensorimotor tasks) of 155 participants from two sites (Demirci, Clark, Magnotta, et al., 2008). The authors applied a projection pursuit algorithm on ICA spatial maps, and achieved classification accuracies ranging between 80 and 90%, with sensorimotor task providing the best performance. Further, based on regions with greater synchrony estimated from synchronous hemodynamic independent maps of auditory cortex as features, Calhoun and colleagues used a within-participant subtractive comparison to discriminate SZ from HC with 97% initial accuracy and 94% accuracy after a retest validation using a new subjects scanned at a different site . Another study focusing on three-class differential diagnosis of SZ, BP and HC individuals reported that verbal fluency led to a reliable diagnostic specificity for SZ with 92% accuracy  "biotypes" based approach where they identified three neurobiologically unique biologically defined psychosis categories and shown that biotypes did not follow a straightforward disease severity continuum, with heritable properties in unaffected first-degree relatives (Clementz et al., 2015).
(b) Resting-state: The rsfMRI studies for SZ prediction included a variety of classifiers, such as, SVM, fused lasso, GraphNet, RF, C-means clustering, regularized LDC, and ensembles of SVM classifiers, as presented in Table 1. Overall, the sample size was relatively large across these studies, with classification accuracies ranging from 62% to 100%, although the study with 100% accuracy rate had a very small sample size (20 participants) and therefore, the results might not be generalizable across other studies (Pouyan & Shahamat, 2015).
4 Multimodal: Recently, by using connectivity measures from multimodal dMRI and sMRI data, Zhu, Shen, Jiang, and Liu (2014) predicted SZ patients and achieved a perfect accuracy (i.e., 100%).    While these studies utilized a variety of MRI based features, such as sMRI, resting and task-based fMRI and dMRI, they provide evidence of brain-based differential diagnoses of MDD and BP. Future studies should employ large-scale sample size, and more BP prediction research.

| Autism spectrum disorder
ASD is a neurodevelopmental disorder characterized by impaired social communication, deficits in social-emotional reciprocity, deficits in nonverbal communicative behaviors used for social interaction and stereotypic behavior. Since 2010, a few studies have investigated automatic diagnosis of ASD in both male-only and male-female samples Ecker, Rocha-Rego, et al., 2010;Jiao et al., 2010) with promising results showing accuracies ranging from 81% to 90% Table 3). We surveyed 30 papers in automatic diagnosis of ASD using MRI-based features which are listed in Table 3.
Interestingly, Uddin and colleagues employed a searchlight algorithm to sMRI features, where a small number of voxels within the spatial proximity with one another provide the predicting features (Uddin et al., 2011). The resulting model obtained 92% classification accuracy for ASD diagnosis.
2 Functional MRI: (a) Task-based: More recently, features from task-based fMRI and SVM, LRC, LDC, and GPC classifiers were used for ASD prediction, with accuracies ranging from 70% to 96% (Table 3). 3.5 | Attention-deficit/hyperactivity disorder One of the most commonly found neurodevelopmental disorders is ADHD. However, given the lack of biological-based diagnosis approach, ADHD is currently diagnosed based on behavioral symptoms only. In this review, we surveyed 35 papers in automatic diagnosis of ADHD using MRI-based features which are listed in Table 4.
1 Structural MRI: Studies based on sMRI features reported accuracies for ADHD classification ranging from about 72% to 93% (Igual et al., 2012;Johnston et al., 2014;Lim et al., 2013). Using voxelbased feature set from sMRI data and automatic feature selection approach, Johnston and colleagues trained the features using a SVM classifier and obtained 93% accuracy (Johnston et al., 2014).
Lim and colleagues used whole brain gray and white matter from sMRI data and GPC classifier and predicted ADHD with 79.3% accuracy (Lim et al., 2013) 2 Functional MRI: a Task-based: In order to predict ADHD, Hart and colleagues used stop signal task-based fMRI data (Hart, Chantiluke, et al., 2014; 77% accuracy) and temporal discounting task-based fMRI data (Hart, Marquand, et al., 2014;75% accuracy). However, these studies were performed using relatively small sample sizes, and therefore may lack generalizability across other independent samples.  (Bohland et al., 2012;Colby et al., 2012;Dai et al., 2012), and sMRI and task-based MRI data (Iannaccone et al., 2015), with accuracy ranging from 55% to 80%.

| Obsessive-compulsive disorder
Only a few recent studies have applied classification algorithms to OCD. We surveyed 12 papers focused on automatic diagnosis of OCD using MRI-based features which are listed in Table 5.

| Social anxiety
To date, only two studies on SAD have been published which included relatively small samples, with accuracies above 80% (Frick et al., 2014;. These studies derived multivariate patterns from different MRI modalities suggesting that features relevant for SAD studies reported useful features to be distributed across widespread brain areas, rather than localized brain regions typically associated with anxiety. We surveyed two papers focused on automatic diagnosis of SAD using MRI-based features which are listed in Table 6.

| Posttraumatic stress disorder
Only one study to date has performed discriminative analysis on PTSD, where 50 earthquake survivors with and without PTSD were compared to controls using structural imaging (Q Gong et al., 2014). Patients with PTSD were classified with an accuracy of 91%, with the most discriminative features were found in different brain areas, particularly in left and right parietal regions. Table 7 presents the surveyed paper in automatic diagnosis of PTSD using MRI-based features.

| Substance dependence
To date, only a few predictive studies on SD (e.g., alcohol, nicotine, and cocaine addiction) and treatment completion have been performed, with only one study implementing a multimodal imaging approach to predict alcohol consumption and treatment effects in the   . Alcohol dependence was predicted in a recent study using regional gray matter maps from sMRI and weighted robust distance and SVM classifiers (Guggenmos et al., 2018), resulting in 74%classification accuracy.
3.10 | Analysis of the survey ADHD studies, rsfMRI is the most popular modality. Moreover, compared to dMRI, multimodal studies are more common across these major disorders. Figure 3d shows the overall prediction accuracy against the commonly used classifiers in each disorder type, Figure 3e reports the overall prediction accuracy against each modality and each disorder type, and Figure 3f presents the total sample size against each disorder type for each modality. SVM classifier was the most popular across all major disorders, followed by LDA. Figure 4a shows the overall accuracy against the total sample size for each disorder and for each classifier used in the studies, Figure 4b shows the overall accuracy against the total sample size for each modality and for each classifier used in the studies, and Figure 4c shows the overall accuracy against the total sample size for each disorder and each modality used in the studies. Interestingly, even with a sample size smaller than 100, almost all studies reported very high accuracies, with few reporting  An interesting observation from this summary is that, there is a growth in multimodal prediction studies for many of these major mental disorders.
4 | TRANSLATIONAL PERSPECTIVE OF BRAIN-BASED PREDICTOME RESEARCH FOR CLINICAL APPLICATION 4.1 | Translating predictive outcomes toward clinical utility Typically, in a research-based setup, predictive studies are implemented using two or more well-proportioned groups of patients with mental illness and their healthy counterparts. The group labels are carefully diagnosed before training a supervised classification algorithm, where exclusion of subjects with uncertain diagnoses or comorbidities is a common practice (Wolfers et al., 2015). However, in case of real clinical population, the disease diagnosis is a more complex and sophisticated process. Thus, considerable improvement in the field of predictive modeling is required before they will be useful to apply in clinical practice. In many clinical cases, the central question to be addressed is not as simple as how to distinguish patients from controls, but rather the specific distinction between different illnesses in the same population (i.e., subtypes). Simply put, a differential diagnostic process is required before accurate clinical implementation of the available tools can be made available. Moreover, another limitation of the current predictive modeling approaches is the lack of appropriate (or any) identification of comorbidities among patients, which is essential to properly assign • Treatment response/outcome prediction: In addition to aiding the decision-making procedure in clinical diagnoses, predictive techniques can also be utilized in predicting treatment response (Bzdok & Meyer-Lindenberg, 2018) (e.g., Gong et al., 2011) and treatment outcome (e.g., Schmaal et al., 2015). By monitoring treatment outcomes and pursuing potential treatments using predictive modeling, the clinical diagnosis can become more costefficient.
• Drug trial design: Based on predictive modeling outcomes, future response can also be classified. By selecting subsets of individuals who are most likely to response to a particular medication, more efficient drug trails can be designed. For example, medication-class of response to mood stabilizers (bipolar) or antidepressants (depression) can be classified using machine learning approaches (Osuch et al., 2018).

| Prediction of continuous measures versus categorical diagnoses
Most of the mental illness prediction studies surveyed in this review are based on assignment of discrete or categorical class labels for test samples.
However, the categorical diagnosis approach overlooks the continuous measures while predicting a certain disease class, which can lead to misleading outcomes, or miss sub-clinical tendencies that can be useful for predicting risk. For more reliable outcomes, predictive modeling using continuous measure, such as pattern regression, can become a valuable tool.
Moreover, for mental illness prediction using brain-based features, regression-based modeling can be used to estimate the disease progression and treatment outcomes, and can estimate continuous measures (e.g., neuropsychiatric or cognitive measures). In order to estimate continuous clinical measures from neuroimaging data, a recent study by Wang and colleagues proposed a framework using the relevance vector machine (RVM) to build regression and obtained higher classification accuracy and better generalizing ability compared to support vector regression (Y. Wang, Fan, Bhatt, & Davatzikos, 2010). Another study explored interregional cortical thickness correlations to identify and characterize the autism diagnostic observation schedule score in ASD .
The results from this study showed that structural covariance measures among multiple brain networks are associated with autistic symptoms.

Further, Tognin and colleagues used relevance vector regression to Predict
Positive and Negative Syndrome Scale scores of subjects at high risk of psychosis based on gray matter volume and cortical thickness measurements . More recently, studies have started to predict continuous measures of assessment in both health and disease (e.g., the research domain criterion [RDoC]; T. Insel et al., 2010). These studies suggest that promising results can be achieved by using continuous measure for disease prediction in addition to categorical diagnosis.

| Prediction of disease risk (prodromal state)
While challenging, early diagnosis of individuals at high risk of future mental illness is very critical in order to delay or prevent the disease progression. Since most mental illnesses typically have an onset in adolescence or early adulthood (Kessler et al., 2005), early detection could delay, or even prevent, future onset of these severe illnesses in  Guo, Su, et al., 2014). Also, using machine learning methods, Fan and colleagues explored structural endophenotypes in unaffected family members of SZ patients, and reported that family members had structural profiles highly overlapping with those of SZ patients (Fan et al., 2008).

| Prediction of disease onset and treatment outcome/responses
Typically, prediction approach is applied for predicting disease onset, independent relationships among brain regions (e.g., functional connectivity). In addition, the univariate approach does not allow estimation of the stimulus effects at multiple brain locations. Multivariate neuroimaging approaches take into account the full spatial pattern of brain activity simultaneously at multiple spatial locations, and are able to detect subtle but localized measures of brain activity that are not captured by univariate approaches (Allen et al., 2011;Habeck et al., 2008;Habeck et al., 2005;J. Liu & Calhoun, 2014;McIntosh & Lobaugh, 2004;Moeller & Strother, 1991;Narayanan et al., 2015;Sui, Adali, Yu, Chen, & Calhoun, 2012). In contrast to the traditional univariate, model-based approach which lacks the ability to directly address interaction between voxels/regions, multivariate approach estimates correlation or covariance of activation across brain regions. Multivariate results can also be more reliably translated as a signature of underlying brain networks.
Recent multivariate neuroimaging methods offer to analyze the relationship between a stimulus and the responses simultaneously measured at many locations, such as spatial response patterns or  Example of multivariate machine learning methods for mental illness diagnoses includes classification of SZ using structural MRI data with classification accuracies ranging from 81% to 93% (Gould et al., 2014;Greenstein et al., 2012;Kawasaki et al., 2007;Sun et al., 2009;Yoon et al., 2008).

| Multimodal studies
Although neuroimaging techniques have become popular tools to identify mental illness related biomarkers, each imaging technique has its limitations S. Liu, Cai, et al., 2015). The modality-specific limitations can be partly overcome by developing multimodal neuroimaging techniques, by combining data obtained from multiple neuroimaging techniques, such as EEG, structural magnetic resonance imaging (sMRI), and fMRI, which provides more informative and reliable results on brain structure and dynamics than unimodal neuroimaging approach. Multimodal neuroimaging is a relatively new and rapidly expanding field that integrates data from different modalities to understand the pathophysiology of mental illness . For instance, by linking the genomic variation to brain function, structure and connec- by combining data from rs-fMRI or task-based fMRI, and sMRI (Cabral et al., 2016;J. Ford, Shen, Makedon, Flashman, & Saykin, 2002;Qureshi, Oh, Cho, Jo, & Lee, 2017;Yang, He, & Zhong, 2016), fMRI and single nucleotide polymorphism (SNP; genomic feature ;Yang et al., 2010), and rs-fMRI and MEG , while only a few studies combined data from three or more modalities Sui et al., 2014), with accuracy ranging from 75% to 100%. Other recent data fusion advances include integration of multiple task-based fMRI data sets (D. I. Kim et al., 2010;Sui, Adali, Pearlson, & Calhoun, 2009;Sui et al., 2015) from the same participants, where common versus specific sources of activity was specified to a greater degree than conventional general linear model-based approaches. Using a Fisher's linear discriminate classifier, Ford and colleagues classified SZ and HC based on tasked-based fMRI data with 78% accuracy, and sMRI data with 52% accuracy, however, the combined multimodal data (fMRI and sMRI) resulted in the highest accuracy of 87% (J. Ford et al., 2002). Another recent multimodal neuroimaging study by Yang and colleagues integrated rs-fMRI based connectivity features and sMRI based structural features extracted using ICA, and used SVM classifier to compare unimodal versus multimodal accuracy (Yang, Chen, et al., 2016). Results from this study showed that multimodal features achieved higher accuracy (77.91%) than single modality accuracy (72.09%). Using multimodal sMRI and rs-fMRI data, Cabral and colleagues classified SZ patients and HC individuals with 75% accuracy, where the multimodal features based classification outperformed both of the unimodal features based prediction accuracy (69.7% accuracy using sMRI data and 70.5% accuracy using rs-fMRI data; Cabral et al., 2016). Qureshi and colleagues employed a similar approach to classify SZ patients and HC individuals using combined rs-fMRI and sMRI data but on a higher number of samples, and achieved 10-by-10-fold nested cross-validated prediction accuracy of 99.29% (Qureshi, Oh, Cho, et al., 2017). Note that, in order to use as much training data as possible and overcome the sample size issue, the framework utilized a nested CV without novel data for testing, which might have introduced classification bias and resulted in such high prediction accuracy. Regardless of the methodological limitations, these and other studies show the potential of leveraging multimodal imaging data. However, more robust multimodal fusion approaches and validations are required before making them available for clinical purposes.
Besides multimodal integration of MRI modalities, recent studies have also combined neuroimaging data with non-neuroimaging features, such as genomics data (e.g., SNPs), DTI, MEG, and EEG for classification of mental illnesses. Using ICA and SVM-based classifier ensemble (SVME), in a relatively small sample size and SNPs array, Yang and colleagues classified SZ patients and HC with 73.88% accuracy for SNPs data, 81.63% accuracy for voxellevel fMRI activations, 82.50% accuracy for ICA componentspecific fMRI activation, and finally, 87.25% accuracy for combined fMRI-SNPs data (Honghui Yang et al., 2010). Further, in a large dataset, using multiple classifiers including sparse representationbased classifier (SRC), fuzzy c-means (FCM) classifier, and SVM-based classifier, Cao and colleagues discriminated between SZ patients and HC individuals by combining fMRI and SNP data modalities, and found the best classification accuracy of 89.7% achieved using SRC (L. Cao et al., 2014). Another recent study by Cetin and colleagues integrated rs-fMRI and MEG data to distinguish SZ from HC, and found that the best performance of 87.91% accuracy was obtained by using the ensemble classifier ; Figure 8). Using a novel data fusion technique known as mCCA + jICA and multiple types of classifiers, Sui and colleagues integrated features from rs-fMRI, sMRI, and DTI (i.e., FA) to classify SZ patients and HC individuals, and achieved a maximum classification accuracy of over 90% using radial basis function support vector machine (RSVM) classifier on DTI (i.e., FA) and sMRI (gray matter) features .
Further, using the sample data fusion technique (i.e., mCCA+jICA) and features from rs-fMRI, sMRI, and EEG modalities, Sui and colleagues utilized a SVM classifier with recursive feature elimination (SVM-RFE), and obtained 91% accuracy in training data and 100% accuracy (i.e., predication rate) with combination of all modalities to classify SZ patients and HC individuals (Sui et al., 2014). In order to classify ultra-high-risk individuals for psychosis, first-episode psychosis and HC, Pettersson-Yeo and colleagues used a multi-step data fusion approach that includes an unweighted sum of kernels, multi-kernel learning, prediction averaging, and majority voting, and obtained 86.33% accuracy by combining features from DTI and fMRI modalities .
The results of the above-mentioned studies are encouraging for using multimodal neuroimaging in classification of mental illness, suggesting that data fusion methods combined with advanced machine learning techniques present a promising direction for mental illness prediction.

| Multi-class classification for disease subtype to reduce diagnoses heterogeneity
While the case (i.e., patient with mental illness) versus control diagnostic approach has been successfully implemented in the existing mental illness prediction literature, it does not address the differential diagnosis aspect of mental illness prediction (i.e., distinguishing between illnesses with overlapping symptoms or subgroup diagnosis).
Traditional case versus control model ignores the heterogeneity As discussed above, one of the main limitations of the traditional case-control prediction is the binary disease characterization, where test samples are assigned to either case or control category. This approach overlooks the associated disease heterogeneity, commonly known as the disease subtype. However, many heterogeneous mental illnesses including autism and SZ are defined as spectrum disorders (i.e., a continuum) with multiple disease etiologies lying under the same diagnostic category. While it is a common practice to classify such spectrum-like disorders using the generic category to find diagnostic biomarkers, a major issue in mental illness diagnostic procedure is the lack of differential diagnosis of patients across several disease subtypes. Accurate diagnosis of disease subtype is very critical for the appropriate course of treatments. For example, in case of SZ, patients can exhibit similar cognitive deficits but with variable magnitude.
Therefore, to emphasize the phenotypic heterogeneity in SZ, two major subtypes with different genetic and cognitive profiles have been introduced: (a) cognitive deficit and (b) cognitively spared (Green et al., 2013;Jablensky, 2006). However, differential diagnosis of SZ subtypes has been rarely studied, due to limited sample size. The number of subjects in each disease subtype is small in most of the existing datasets, which limits the ability to develop robust subtype predictor to accurately differentiate them. Ingalhalikar and colleagues proposed an unsupervised spectral clustering approach using multi-edge graphs derived from a structural connectivity network among 78 ROIs to classify subtypes of autism and SZ (Ingalhalikar et al., 2012).
Among the surveyed studies in this review, only a few considered the important area of automatic differential diagnosis. Costafreda and colleagues used fMRI with a verbal fluency task for subject-level classification of SZ, bipolar and HCs . Two studies More recently, using gray matter densities, Schnack and colleagues proposed a classification framework for SZ, bipolar and HCs (Schnack et al., 2014). Using gray matter maps from structural MRI, Koutsouleris and colleagues classified SZ from mood disorder . Ota and colleagues combined volumetric measures derived from structural MRI with FA from dMRI in selected ROIs to classify SZ from MDD (Ota et al., 2013). Moreover, using gray matter volumes of caudate and ventral diencephalon, Sacchet, Livermore, et al. (2015) proposed an algorithm to classify MDD, bipolar and remitted MDD patients.
Although the limited sample size in most of the current datasets makes it challenging to perform disease subtype prediction as the number of subjects in each diseases subtype is small, it shows the potential for a paradigm shift in the predictive modeling of spectrumlike mental illness beyond discrete, case-control diagnosis.

| Advanced algorithms for brain-based prediction
Recent advanced machine learning algorithms have shown tremendous potential for neuroimaging-based mental illness prediction. For example, a recent study proposed a novel parallel group ICA-based framework to jointly estimate the association between functional network variability and structural covariation in SZ, as well as to predict several cognitive domain scores based on these associated functional/ structural features (Qi et al., 2019; Figure 11). Briefly, by jointly incorporating and estimating temporal domain features from fMRI (extracted from group ICA) and structural MRI features within a parallel group ICA algorithm, functional network variability and structural covariation were jointly estimated to identify between-modality linkage. Using real neuroimaging data, a significant functional and structural MRI component pair was identified that captured group difference in both imaging modalities, which further correlated with cognitive scores suggesting that multimodal brain features can predict multiple cognitive scores. Another recent SZ study proposed a multimodal fusion with reference algorithm by combining multi-site canonical correlation analysis with reference and joint-ICA (MCCAR+jICA) to identify co-varying multimodal feature patterns using a reference (specifically, working memory performance) in a three-way data fusion (fMRI, sMRI, and dMRI; Qi et al., 2018). Results identified several brain regions that were previously linked with working memory deficits in SZ, suggesting that the novel MCCAR+jICA method has great potentials to identify biomarkers for severe mental disorders, such as SZ. Further, Sui and colleagues implemented a constrained fusion approach to predict cognition in SZ ; Figure 12). The assessment of cognition was measured using the MATRICS Consensus Cognitive Battery (MCCB), and using multi-set canonical correlation, the linkage between MCCB and brain abnormalities as measured by fractional amplitude of low-frequency fluctuations (fALFF) from resting fMRI, gray matter density (GM) from structural MRI, and FA from dMRI were explored. Findings from this study suggested that the associated functional and structural deficits might be linked to cognitive impairments in SZ. Other recent neuroimaging-based data-driven advanced algorithms include biclustering and triclustering ICA approaches that utilize spatial and temporal variance as a measure to cluster mental disorders into homogeneous subgroups. For instance, using gray matter concentration (GMC) from SZ patients, Gupta and colleagues implemented source-based morphometry (SBM) decomposition, followed by subtype component reconstruction using group information-guided ICA (GIG-ICA), and identified two subtypes (i.e., two different subsets of subjects; Gupta et al., 2017). Also, Rahman and colleagues have used structural MRI features from SZ patients to perform multi-component and symptom bi-clustering , and further extended this approach into triclustering framework using dynamic FNC measures to identify disease subtypes (Rahaman, Damaraju, & Calhoun, 2019).

| Functional connectivity measures for brain-based prediction
In recent years, brain connectivity studies using neuroimaging have become popular to investigate the associations among brain networks.
Another study used the ICA spatial maps as features and a probabilistic Bayesian classifier to discriminate between SZ, BP disorder, and healthy individuals, and achieved the average three-way correct classification rate within the range of 70-72% (Arribas et al., 2010). Few recent studies also performed of classification of mental illnesses with overlapping symptoms, such as SZ, schizoaffective disorder, and BP disorder with psychosis (Cardno & Owen, 2014;Cosgrove & Suppes, 2013;Pearlson et al., 2016). Du and colleagues used GIG-ICA to extract resting-state brain networks and classified SZ patients, BP disorder with psychosis, schizoaffective disorder with manic episode disorders schizoaffective disorder with depressive episodes exclusively and healthy individuals . FNC features were selected using RFE method and a five-class SVM classifier was used for training purpose, achieving 68.75% of classification accuracy ( Figure 14). seed-based FC and whole-brain FC in a logistic regression classifier to classify autism and reported 96.3% accuracy with both whole-brain and seed-based FC features (Murdaugh et al., 2012). Further, using FC between three ROI sets from ABIDE dataset, Plitt and colleagues applied RFE-based feature selection and both logistic regression and SVM classifiers to classify autism and achieved an overall 76.7% accuracy (Plitt et al., 2015). In a multimodal classification study of autism, Deshpande and colleagues used FC estimates and fractional anisotropy (from DTI data) and obtained a maximum classification accuracy of 95.9% with a recursive cluster elimination based SVM classifier (Deshpande et al., 2013). Another recent resting-state FC-based (between 90 ROIs) prediction study used deep learning classifier (probabilistic neural network [PNN]) for classification of ASD and achieved a classification accuracy of about 90% (Iidaka, 2015). Interestingly, FC between signals in different frequency bands was used as features in a recent ASD classification study, where the Slow-4 band (0.027-0.073 Hz) was found to capture the most discriminative features (H. Chen et al., 2016). To discriminate ADHD from healthy individuals, Zhu and colleagues used a PCA-based Fisher discriminative analysis (PC-FDA) with regional homogeneity (ReHo) from fMRI data as features, and showed a classification accuracy of 85% (C.-Z. Zhu et al., 2008). Another study by Wang and colleagues also used ReHo from resting-state fMRI data in a SVM classifier, and obtained a classification accuracy of 80% for discriminating ADHD from healthy individuals . Several other studies also used FC measures to successfully classify ADHD from HCs (Dey et al., 2014;D. Fair et al., 2013;João Ricardo Sato et al., 2012). Moreover, by leveraging a large-scale resting-state fMRI study of SZ from multiple sites (i.e., human connectome project), another recent study proposed a ICA-based preprocessing pipeline to extract FNC and spatial map based imaging features as potential biomarkers. Results showed that compared to FNC-based features, spatial map shows better classification performance in all experiments . Almost all of the papers surveyed in this review have group-level discriminative analysis followed by subject-level classification. Many of these studies have first performed discriminative analyses using statistical tests (e.g., t tests) to extract significant features showing group difference, and then using these features performed subjectlevel classification. However, the use of test dataset together with the training dataset during feature selection, extraction or reduction will introduce additional bias to the predictive model. This process of feature selection based on the group differences results identified from the whole sample could cause a "double dipping" issue, that may lead to a biased performance (Bishop, 2006;Demirci, Clark, Magnotta, et al., 2008). Another major issue with the group difference based feature selection approach is that the significance levels are based on p values of the statistical tests, which are not necessarily linearly associated with the discriminative power of the models. An alternative solution to the feature selection based on univariate group-level statistical tests could be the use of filtering and wrapper methods (Blum & Langley, 1997;Hall & Smith, 1998). Filtering methods assign scores to each feature from which a number of top ones can be selected, while the wrapper methods consider selection of a set of feature as a search problem. Supervised feature selection methods have been most commonly used in the existing literature. However, since it has been reported that feature selection performance can be improved by increasing the sample size (Jain & Zongker, 1997), classification using a supervised feature selection algorithm on a small dataset might result in suboptimal performance. Further, unsupervised feature reduction methods, such as PCA have also been applied in the field of neuroimaging studies. As suggested by Osborne and Costello, unsupervised feature reduction methods on larger datasets could provide additional information for accurately generalizing population trends, which may lead to a more efficient model (Osborne & Costello, 2004).

| Overfitting
Overfitting refers to a model that models the training data too well, resulting in very good classification performance on the training data (i.e., observed data) but very poor performance on independent, testing data (Pereira et al., 2009). Overfitting can be caused by utilizing models with large number of features from small sample size or complex models with many parameters, as the model would capture noisy features from the data more than the actual features of interest (Franke et al., 2010;Klöppel et al., 2008). Since neuroimaging datasets on mental illnesses have smaller samples in general and many features of interest, predictive models using these datasets are susceptible to overfitting. The majority of the surveyed studies reported in the current work performed predictive modeling based on a very small number of subjects, resulting in a decrease in the overall classification accuracy with smaller sample size. CV and regularization are common approaches to control for overfitting in neuroimaging data-based predictive modeling.

| Optimal model selection
In predictive modeling, model selection, more formally known as hyperparameter optimization or tuning, refers to the problem of choosing a set of optimal hyperparameters for a learning algorithm in order to achieve the best performance of the algorithm. The hyperparameter optimization step is performed during the training stage, typically during the CV of the training samples. One of the most commonly used classifiers is the SVM which is designed for binary classification that maximizes the boundary between classes in a high-dimensional space. Linear SVM classifier includes a userdefined soft margin hyperparameter that affects the trade-off between errors on training data set and margin maximization. A smaller soft margin would introduce more errors, resulting in a larger margin. Nonlinear SVM classifier includes additional hyperparameters depending on the kernel of choice (e.g., sigma/gamma for RBF kernel and degree for polynomial kernel). Therefore, inefficient hyperparameter optimization could negatively influence model performance.

| Challenge with reproducibility
In the existing brain-based prediction literature, variability across raw data processing and analysis streams, feature types, feature selection schemes, choice of classifier, and CV methods may limit reproducibility of outcomes across independent datasets. Indeed, without any standard approach, the growing flexibility across machine learning pipelines is introducing less reliable replication across studies (Squeglia et al., 2016). In order to provide optimum diagnostic tools,

| Heterogeneity between patients
Another limitation for neuroimaging based machine learning studies is the substantial heterogeneity that exists between patients. In research-based neuroimaging studies, participants are recruited based on well-matched age, sex, or education background, typically with a particular type of brain pathology. In contrast, participants recruited in the clinical settings may include several types of pathology with variability in disease stage and demographic variables (i.e., age, sex etc.).
As mentioned earlier, classification performance (e.g., accuracy) can be improved by using a larger training sample (Franke et al., 2010;Klöppel et al., 2008), and may also reduce the disease heterogeneity by inte- 3 Deep learning models offer a black box-like system, which may introduce lack of transparency during the learning and testing steps (Alain & Bengio, 2016;Yosinski, Clune, Nguyen, Fuchs, & Lipson, 2015). In many cases, it is very difficult to understand the technical and logical bases of the model. The lack of transparency of deep learning may limit the interpretability of the neuroimaging results. Indeed, the multiple nonlinearities within deep models make it challenging to trace back the successive layers of weights to the original brain data, therefore limiting the ability to detect abnormalities within brain regions (Suk et al., 2015).  (Franke et al., 2010;Klöppel et al., 2009), lack of generalization for diagnosis purposes, and inability to address disease heterogeneity, and model overfitting due to poor sample size (Pereira et al., 2009). For optimal evaluation of machine learning methods, therefore larger sample size is required to minimize the variance in assessments of accuracy, sensitivity, specificity, and other performance measures.
To address this limitation within neuroimaging research, multiple ongoing efforts have created dataset repositories. The large-scale or "big data" revolution shows promises to reduce data heterogeneity related issues in neuroimaging studies (Franke et al., 2010;Klöppel et al., 2008

| Decentralized repositories
Additionally, to address the legal, ethical, and sociological concerns that might prohibit open data sharing initiatives and avoid reidentification of study subjects, repositories with anonymized raw data are also being established.
Further information on implementation of decentralized algorithms, enhancement of user interface, regression statistic calculation for decentralization, and comprehensive pipeline specifications can be found in Ming et al. (2017).

| Large-scale studies
Moreover, several multi-site studies have also started to share neuroimaging data in a collaborative setup. These studies include: While the data sharing initiatives show potential (Milham et al., 2018), there are several methodological challenges of big data approaches in the field of neuroimaging which need to be addressed.  (Fortin et al., 2017;Fortin et al., 2018;Johnson, Li, & Rabinovic, 2007); dMRI and sMRI data. Further efforts are required to standardize data acquisition parameters across all data sharing sites for more ideal data pooling. For example, when sMRI data were analyzed, inconsistency in field strength and imaging sequence design showed evidence of significant systematic differences in multi-site studies (Fennema-Notestine et al., 2007;Stonnington et al., 2008). Without standardized parameter agreement, the variability observed across the subjects is increasingly driven by variability in scanner and imaging parameters, which could potentially introduce false diseasespecific effects. Another study showed similar field related differences how in 10,000 subjects across 1.5 T and 3 T scanners, they also showed evidence of highly consistent changes associated with age (Panta et al., 2016). By designing and maintaining study protocols across all contribution sites, and developing analytic approaches which are more robust to site effects (e.g., end-to-end deep learning to predict and remove site effects), chances of observing diseasespecific effects are increased. The benefits of big data are many, and with improved integration between participating sites, in term of acquisition and other parameters, the potential inhomogeneities can be mitigated.
2 Another major issue for "big" imaging data is the statistical challenge. These rich datasets are designed to explore a variety of hypotheses. As researchers investigate multiple imaging modalities, many of them tend to explore various alternative models to search for significance without proper multiple comparison testing or CV framework. This makes CV and replication even more essential.
Similarly, the effect size should also be reported, if using null hypothesis testing, big data can provide highly significant results for tiny effect sizes, which may not be particularly useful for any individual subjects. Also, the use of robust test statistics is important, for example, the use of, nonparametric testing, such as permutation-based tests can be incorporated while examining multiple modalities (Winkler et al., 2016).
3 With big data comes the "curse of dimensionality". Compared to the number of observations, there are many features in high dimensional data, making it susceptible to issues such as sparsity, multicolinearity, computational cost, model complexity, and overfitting. One potential solution could be to implement feature selection or reduction approaches, such as principal component analysis, prior to analyzing and modeling the data.
4 Neuroimaging data sharing through the big data consortiums have raised some ethical and privacy concerns, for example, the possibility of facial reconstruction from structural images. By removing recognizable facial features using the de-facing approach prior to data sharing, this concern can potentially be addressed. Other ethical concerns include the risk of subject identification based on their geographical location, since typically these large-scale studies are conducted within a particular region. By adopting multilayer, restricted data sharing approach, more controlled access to the full dataset can be achieved, thus eliminating the risk of subject identification. Another approach is to utilize federated learning or decentralized analysis approaches, for example, the COINSTAC tool allows one to perform regression as well as more advanced voxel-wise and machine learning-based approaches in a decentralized framework without requiring the data to be shared .

| Standard machine learning competitions in neuroimaging
In Further, in 2018, to challenge participants in predicting MDD and HCs using structural MRI data, the predictive analytics competition (PAC) was arranged (https://www.photon-ai.com/pac). The competition included training data from 759 MDD patients and 1,033 HCs and unlabeled testing data from 448 subjects obtained across three different publicly available sites. The winners achieved a classification accuracy of 65%. These standard machine learning competitions show the potential for brain-based mental illness predictions, as they are able to evaluate data with accurate, unbiased predictive power.

| Benefits of leveraging dynamic connectivity features
Until recently, functional connectivity has been assumed to be relatively stable over the scanning time (usually several minutes). While convenient for analysis and interpretation purposes, this oversimplified assumption was recently challenged by several studies focused on time-varying multivariate connectivity patterns (Sako glu et al., 2010), as well as in studies focusing on time-frequency analysis methods (C. Chang & Glover, 2010). Several other studies have also delved into the time-resolved connectivity measures and their successful applications in identifying biomarkers using dynamic connectivity features Calhoun, Miller, Pearlson, & Adalı, 2014;Rashid et al., 2018;Rashid, Damaraju, Pearlson, & Calhoun, 2014;Zalesky & Breakspear, 2015). These studies reported that brain functional connectivity can vary within a short period (e.g., tens of seconds), and can successfully capture the connectivity disruptions in a disease population.
Only a few studies have utilized dynamic brain connectivity features to predict mental illness. Using static and dynamic connectivity features, Rashid and colleagues developed a classification framework to predict SZ, bipolar, and healthy subjects . The classification performance measures among static, dynamic and combined static and dynamic connectivity features were compared using a 10-fold CV framework. The results showed that dynamic FNC (clas-

| Fusion of dynamic connectivity and other data types
Dynamic FNC measures estimated from fMRI data can be further integrated with other data types and modalities, for example, genomic and structural MRI data, to leverage inter-modality based features for disease characterization. In a novel imaging-genomic framework, Rashid and colleagues have recently modeled the association between dynamic FNC states and genomic features to examine the SZ-related inter-modality abnormalities Figure 17). Specifically, the parallel ICA algorithm (J. Liu et al., 2009) was utilized to combine genetic variants (i.e., single nucleotide polymorphism [SNP]) and functional features from fMRI data as subjectspecific states that are revealed from the dynamic FNC data using a sliding window and clustering approach Rashid et al., 2014)  colleagues proposed an mCCA+jICA framework to fuse dynamic FNC from fMRI data and gray matter maps from structural MRI data (Abrol, Rashid, Rachakonda, Damaraju, & Calhoun, 2017). The framework identified the associated changes in both modalities, highlighting significantly disrupted links between dynamic FNC and gray matter volumes in SZ patients. Results from this study reported significant group differences in gray matter maps, particularly in the superior parietal lobule, precuneus, postcentral gyrus, medial/superior frontal gyrus, superior/middle temporal gyrus, insula and fusiform gyrus. Further, results also highlighted alterations in several interregional connectivity strength in SZ patients. In the field of brainbased prediction of mental illness, fusion approaches using dFNC and other features could increase the discriminative power of the models, although future studies are required to confirm their utilities in this regard.

| SUMMARY AND CONCLUSION
Recent brain-based mental illness predictome studies have shown promising results, although some results require further validation due to small sample sizes. While many advanced algorithms have been developed and applied in the field of mental illness prediction, there exists many challenging issues which must be further resolved prior to their applications in clinical settings. In this work, we comprehensively review existing brain-based prediction studies in several mental illnesses such as SZ, depressive disorders (i.e., MDD and BP), ASD, ADHD, SAD, OCD, PTSD, and substance disorder. We also highlight a number of existing approaches and future research directions. A major challenge in the field is the prediction of phenotypic heterogeneity that characterizes psychiatric disorders. However, recent approaches have started to address the disease subtypes. This can improve disease prediction, provide biological support for existing categories or support the revision of existing diagnostic categories. Another major challenge is the relatively small sample size reported across most studies. Without more robust validation, it is unclear how generalizable these results will be when applied to an independent dataset.
However, recent data-sharing initiatives have started to improve the sample size issue by offering adequate data to develop more robust and improved prediction models.
While brain-based classification has proven challenging, there has been considerable progress made in recent years. With the accelerating growth of large volumes of patient data and data-sharing initiatives in the field of neuroimaging and medicine, we anticipate diagnostic tools operating on comprehensive biomarker profile accessed from multiple modalities will be available for specific use cases in the near future. With more sophisticated deep learning models integrated with large-scale data, we believe that predictive modeling tools will soon transition from the "proof-of-concept" stage to the "ready for clinical implementation" stage. Further, while patient characteristics appear to be more homogeneous within relatively small samples (Schnack & Kahn, 2016), failing to capture disease-specific variability, with the power of "big" brain data and advanced machine learning algorithms, it is now possible to explore the heterogeneity within a disease (i.e., disease sub-types). It is a common practice for clinicians to consider diagnostic homogeneity (i.e., patients with similar clinical symptoms belong to the same broader diagnostic category).
We can also evaluate this within imaging data using, for example, N-way clustering approaches to identify subgroups of homogeneous data followed by evaluation of clinical phenotypes. One recent example shows this in SZ and finds enhanced sensitivity to group differences and stronger links to symptoms scales that is typically found (Rahaman, Damaraju, & Calhoun, 2019;Rahaman, Turner, et al., 2019). By identifying complex and heterogeneous brain-based disease patterns, predictive modeling can potentially be used clinically for more personalized medicine targeted at specific subtypes or clusters of the disorder with varying symptomology and disease progression. However, this can only be achieved by integrating clinical and technical expertise, possible by some back-and-forth feedback system between both fields' experts, until the tools are optimized as well as simplified for clinical applications. Finally, we expect the brain-based predictome to progress beyond the categorical diagnosis (i.e., identifying disease groups), and taking into account some of the key continuous measures, such as cognition and behavior, to provide a comprehensive diagnostic approach. We look forward to seeing the full potential of the brain-based predictome realized.
There are several limitations to this work. In this survey, we restricted our search to MRI-based, English journal articles only for specific mental disorders. Further, we did not focus on other modality-based mental illness prediction studies, such as EEG, MEG, and PET. Also, we narrowed our focus on mental disorders only and did not consider other brain disorders such as Alzheimer's disorder, mild cognitive impairment, and Parkinson disease. Moreover, we mostly reported the best performing features and classifiers, and experimental setups.

ACKNOWLEDGMENT
This work was supported by NIH grants 1R01EB020407 and 1R01MH118695.

CONFLICT OF INTEREST
The authors declare no conflict of interest related to this work.

DATA AVAILABILITY STATEMENT
Data sharing is not applicable to this article as no new data were created or analyzed in this study.