Regression‐based machine‐learning approaches to predict task activation using resting‐state fMRI

Abstract Resting‐state fMRI has shown the ability to predict task activation on an individual basis by using a general linear model (GLM) to map resting‐state network features to activation z‐scores. The question remains whether the relatively simplistic GLM is the best approach to accomplish this prediction. In this study, several regression‐based machine‐learning approaches were compared, including GLMs, feed‐forward neural networks, and random forest bootstrap aggregation (bagging). Resting‐state and task data from 350 Human Connectome Project subjects were analyzed. First, the effect of the number of training subjects on the prediction accuracy was evaluated. In addition, the prediction accuracy and Dice coefficient were compared across models. Prediction accuracy increased with the training number up to 200 subjects; however, an elbow in the prediction curve occurred around 30–40 training subjects. All models performed well with correlation matrices, which displayed correlation between actual and predicted task activation for all subjects, exhibiting a strong diagonal trend for all tasks. Overall, the neural network and random forest bagging techniques outperformed the GLM. These approaches, however, require additional computing power and processing time. These results show that, while the GLM performs well, resting‐state fMRI prediction of task activation could benefit from more complex machine learning approaches.

able to accurately predict language lateralization. In a follow-up study, a GLM was used to predict task activation in a clinical population with lower spatial and temporal resolution data (Parker Jones et al., 2017).
Using a covert category fluency task, they were able to accurately predict patients' activation using a model trained on a group of healthy control subjects.
The question remains as to whether the GLM approach is optimal to accomplish this prediction. The GLM is a multiple linear regression approach and typically is used in fMRI to compute task activation. For fMRI, one or more regressors model the task, and additional nuisance regressors (i.e., motion parameters) can be added to remove unwanted signals from the data. As implied by the name, the GLM assumes a linear dependence between the variables. It also assumes the error is normally distributed, the error at each measurement is the same, and there are no correlations between the errors. These assumptions are not necessarily true for fMRI data. For the GLM prediction, the regressors are the rs-fMRI-derived features.
Other regression-based learning models exist including the neural network (NN) and ensemble approaches. A typical feed-forward NN is comprised of an input layer consisting of one or more features to be used for prediction, one or more hidden layers, and an output layer (Sperber & Karnath, 2017). Each hidden layer contains hidden nodes, or neurons, that are connected to the nodes in the subsequent layer via weighted edges. The input layer is connected to the first hidden layer and the output layer is connected to the last hidden layer. The NN machine learning approaches model nonlinearities in the data. In ensemble learning, multiple models are combined to improve learning results. One type of ensemble learning that can be used for regression is random forest bootstrap aggregation (RFbag), also referred to as "bagging." Bagging can be used in combination with decision trees (Breiman, 2017), where data are split based on the values of the input features and terminate in leaves that contain the predicted values. In bagging, several different trees are trained on different subsets of randomly chosen data to create a so-called forest of trees, and the values from the different trees are then averaged (Breiman, 1996(Breiman, , 2001.
This reduces the variance. In the RFbag approach, the features are also randomly selected at each split in the decision tree.
When using the GLM approach, Tavor et al. (2016) found that not all tasks performed equally. In general, although variability existed within domains, the language, relational, and working memory domains performed better than the motor domain. Thus, one goal of this study was to evaluate whether the use of alternative approaches could improve the poorer performing domains.
In their study, Tavor et al. (2016) used only 98 subjects from the much larger HCP dataset to train and predict via a leave-one-out approach. Here, we have repeated the study by Tavor et al. using a larger subset of 350 HCP subjects. We examined the effect of the number of training subjects on the accuracy of the predictions. Also, we expanded on Tavor et al.'s method by comparing the NN and bagging approaches to the GLM approach. We evaluated all approaches in terms of their ability to predict task activation from rs-fMRI based on the spatial correlation and overlap between the actual and predicted maps. Finally, we compared the spatial correlation and Dice coefficient (DC) between the approaches.

| Subjects
Subjects were selected from the Human Connectome Project dataset (www.humanconnectome.org;Van Essen et al., 2013). In total, 350 subjects were selected for this analysis. Subjects were required to have four resting-state scans and data from all seven task domains (described below).

| HCP imaging
The HCP imaging protocol includes resting-state and task fMRI data.
Task fMRI data also were acquired sagittally with TR/TE = 720/30 ms, 2 × 2 × 2 mm isotropic resolution, and MB-factor = 8. As with the resting-state scans, scans with both LR and RL phase encoding directions were collected. Total imaging times were different based on the task but lasted approximately 5 min in general. Tasks varied across several domains including language, motor, emotion, gambling, relational, social and working memory. These are described in Barch et al. (2013).
For this study, 25 contrasts across all seven domains were chosen for further analysis and are shown in Table 1. Structural and diffusionweighted scans were collected but were not used in this study.

| HCP preprocessing
The minimally preprocessed datasets, as described in Glasser et al. (2013) were used for all analyses. All functional data were denoised using FIX Salimi-Khorshidi et al., 2014). The data were then resampled to a set of 91,282 grayordinates in standard space (CIFTI; Glasser et al., 2013). This structure represents the cortex on a surface and the subcortical space and cerebellum on a voxelwise volume basis.

| Feature extraction
All analyses were performed in MATLAB (2018). Resting-state fMRI features were extracted following the methods outlined in Tavor et al. (2016). First, group features were derived from concatenated rs-fMRI time series across the four scans per subject and across subjects.
Group principal component analysis was performed to reduce the dimensionality to 1,000. Next, group independent component analysis was applied to each hemisphere separately to extract 40 components per hemisphere. Features with LR symmetry were kept, resulting in 34 components per hemisphere. These 34 components were then used in a dual-regression analysis to extract individual features for all 350 subjects (Filippini et al., 2009). In the dual-regression step, first, resting-state data for each subject were normalized to zero mean and standard deviation equal to one. Next, each of the 34 components was used as a spatial regressor in a GLM, and the temporal signal associated with the network was extracted. This signal was then used as a regressor in a second GLM to find the spatial maps associated (32 subcortical plus L/R hemisphere of the 34 ICA components). The final individual feature maps (X i ) were normalized to zero mean with a standard deviation equal to one and then paired with the corresponding individual z-score maps derived from the task fMRI data (y i ).
In addition, the brain was parcellated. Unlike the study by Tavor et al., where parcels were based on a group independent component analysis (ICA), a random parcellation scheme was used in this study with 50 parcels per hemisphere. Parcels were contiguous and were created using a Voronoi tessellation from seeds randomly spread across the surface. Parcels did not correspond to any functional or structural brain network.
Three different approaches were used to map the resting-state features to the task data: GLM, NN, and RFBag. subjects to form one large dataset, which was used to train a model using the three methods. This model was then applied to the 100 test subjects. These methods are schematized in Figure 1.

| GLM prediction
A GLM was used to determine β coefficients (β i ) for each training subject or group of training subjects by fitting the features (X i ) to the zscores (y i ) for each task (Equation 1). A different GLM was fit for each parcel. Thus, each group of training subjects had 100 × 100 beta values, one for each feature and parcel. Trained β values (β t ) were then matrix multiplied by each test subject's feature maps (X j ) to generate predicted activation maps for that test subject (y j ; Equation (2)).

| NN prediction
The NN approach used the same features, tasks, and parcellation scheme as the GLM approach. A feed-forward NN was created using the fitnet function in MATLAB. A schematic of the NN model is shown in Figure 2.

| Hyperparameter optimization
First, hyperparameter optimization was performed for each task on the 50 hyperparameter optimization subjects. For the hyperparameter optimization step, no parcellation was performed.
Data was subdivided by randomly setting aside 20% of the data for testing and 20% for validation. The test data was used to evaluate model performance using the root mean square error (RMSE). Three parameters were tuned: Number of hidden layers, hidden layer size, and learning rate. First, hidden layer size was iterated from 1 to 50 and the size that minimized the RMSE was selected. Next, the number of hidden layers was iterated from 1 to 3 and the number that minimized the RMSE was selected. Finally, the learning rate F I G U R E 1 Schematic depicting the training and prediction scheme for the NN and RFBag models. In the training step, data from the N training subjects were aggregated and used to train M separate models. Data were randomly split into 60% training, 20% validation, and 20% testing for each model separately. Then, for each test subject, each of the M models was used to predict activation resulting in M predicted maps per subject. The M predicted maps were then averaged to produce one predicted map for each test subject. Both N and M were varied to examine the effects of the number of training subjects and number of averages respectively on the accuracy of the predicted maps F I G U R E 2 Schematic of output from the feed-forward NN employed in this study, from the view command in MATLAB. In this example, the input consisted of 100 rs-fMRI-derived features connected to one hidden layer with 10 neurons; however, the hidden layer size varied with task according to the hyperparameter optimization. The hidden layer was, in turn, connected to the output layer, resulting in one predicted activation value for each grayordinate. W represents the weights of the connections between layers and b is the bias term fluctuated randomly with the learning rate. Therefore, a learning rate of 0.001 was chosen for subsequent analyses. The hidden layer size were varied based on task, but the same value was used regardless of the number of training subjects.

| Model training
Additional parameters for the feed-forward NN included a tanh activation function for the hidden layer, a Nguyen-Widrow layer initialization function (Nguyen & Widrow, 1990), and the resilient backpropagation training function (Riedmiller & Braun, 1993a). The network was then trained using the train function, with the mean square error as the cost function. The network was train several times with the number of training subjects varying from 10 to 200 (see section 2.5). To control for overfitting, the training data was further subdivided by randomly setting aside 20% of the data for testing and 20% for validation. Training stopped when the mean square error reached a minimum on the validation data or 1,000 epochs were reached. The majority of cases reached a mean square error minimum.
Trained models were then used to predict activation for the test subjects using the test subjects' features as input.
As another means of controlling for overfitting and to improve prediction accuracy, the model was trained several times for each number of training subjects. Each trained model was used to predict activation for the test subjects and the results from each model were averaged to produce the final predicted activation map ( Figure 1). Thus, for each task, 100 × #averages models were trained, one for each parcel and each average. To examine the effects of the number of averages on the prediction accuracy, 1, 5, 10, 20, and 50 averages were used for 50 training subjects.

| Random forest bootstrap aggregation prediction
Bagging also used the same features, tasks, and parcellation scheme as the GLM approach, and a separate model was trained for each parcel.
Bagging generates many decision trees by randomly selecting subsets of the data with replacement and a random subset of features at each split.

| Hyperparameter optimization
Hyperparameter optimization was also performed for the RFBag approach using the 50 hyperparameter optimization subjects and no parcellation. Data were subdivided by randomly setting aside 20% of the data for testing, which was used to evaluate model performance using the RMSE. Three parameters were tuned: Number of decision  Also, as with the NN approach, the model was trained several times for each number of training subjects. Each trained model was used to predict activation for the test subjects and the results from each model were averaged to produce the final predicted activation map. Like the NN, for each task 100 × #averages models were trained, one for each parcel and each average. To examine the effects of the number of averages on the prediction accuracy, 1, 5, 10, 20 and 50 averages were used for 50 training subjects.

| Model evaluation
To evaluate the prediction accuracy of each model, Pearson correlation coefficients (CC) were calculated between each test subject's predicted maps and all other test subjects' task maps, resulting in a 100 × 100 matrix for each approach and task. For these matrices, a strong diagonal component indicates actual and predicted activation maps are more similar for the same subject compared with the other subjects, and therefore predict individual activation well. The matrices were row and column normalized to account for the different variances of the actual and predicted maps. The correlation between actual and predicted activation was compared across models with a F I G U R E 6 Non-normalized (top) and normalized (bottom) correlation matrices for the GLM (left), NN (middle), and RFBag (right) approaches. Nine tasks are shown, including at least one task from each of the seven domains. Higher correlation along the diagonal can be seen for all approaches and tasks. The diagonal trend is strongest for REL and TOM tasks. The diagonal is heightened for the normalized correlation matrices. Qualitatively, the matrices look very similar across approaches, although the diagonal is slightly more prominent for the RFBag approach repeated-measures analysis of variance (ANOVA). Maps were also thresholded for each subject and modeled separately using a mixture model consisting of a Gaussian and two gamma functions (Jones et al., 2017), and the DC was computed (Equation (3)). The median of the upper gamma was used as the threshold. The DC was also compared across models using repeated-measures ANOVA. Finally, to examine the effect of the number of training subjects and number of averages on accuracy, the mean CC value between actual and predicted activation was plotted versus number of training subjects and number of averages respectively for each approach and task.
It is important to note there were two methods used for model performance evaluation. During training, model performance was evaluated using the mean square error on the testing subset of data from the training subjects. Once the models were trained, they were used to predict activation for the 100 separate test subjects. Here prediction accuracy was defined as the Pearson's correlation between the actual and predicted tasks.

| Prediction accuracy (correlation)
Overall, as in Tavor et al. (2016) and Jones et al. (2017), both the actual and predicted activation maps varied between subjects in terms of the strength and size of activation. For all approaches, prediction accuracy varied across the task domain and between tasks within the same domain. For example, for the motor domain, mean CC between actual and predicted activation ranged from 0.544 for the LH task to 0.702 for the CUE task for the RFBag approach. In general, the gambling, relational, and social domains performed the best, with a mean CC > 0.7, while the motor domain performed the worst, with a mean CC < 0.6. Mean CC values for all methods and tasks are summarized in Table 1. Quantitatively, the RFBag and NN approach outperformed the GLM approach, with significantly higher CC values for all tasks. In addition, the RFbag approach had significantly higher CC compared to the NN approach for all tasks.
Correlation maps between actual and predicted activation were constructed and show the correlation between the actual activation map and the predicted activation maps of all other subjects. They showed a higher correlation along the diagonal for all approaches and tasks, indicating the predicted activation maps more closely matched the actual maps for the same subject compared to all other subjects.
This diagonal trend was stronger for the gambling, relational, and F I G U R E 7 Example actual (top), predicted (middle), and overlap (bottom) maps for the GLM, NN, and RFBag models for the RANDOM (left), STORY (middle), and LH (right) tasks for one representative subject. Only positive activation is shown. Significant overlap is seen for all three approaches and tasks; however, additional overlap can be seen for the RFBag approaches as illustrated by the red arrows. For the LH case, the GLM model approach resulted in more spurious predicted activation compared to the NN and RFBag approaches (white arrow) social domains compared with the others. Correlation maps were also row and column normalized to allow for comparison across subjects.
The diagonal component was heightened for the normalized matrices.
The normalized and non-normalized CC matrices were qualitatively similar across approaches. Example matrices from nine tasks are shown in Figure 6.

| Dice coefficient
Maps were thresholded using a mixture model method and used to calculate the DC. Overall, the DC results mirrored the CC results. The emotion, gambling, relational, and social tasks performed the best with DC > 0.5 for all tasks within those domains. The RFBag approach performed the best, with a significantly higher DC for the majority of tasks as compared with the other approaches. Quantitative DC results are shown in Table 2. Visually, the overlap was similar across approaches. Figure 7 shows examples of actual, predicted, and overlap maps for the RANDOM, STORY, and LH tasks.

| DISCUSSION AND CONCLUSIONS
In this study, several regression models to predict task activation from rs-fMRI were compared, including GLMs, neural networks, and random forests. First, the effect of the number of training subjects on accuracy was analyzed by varying the number of subjects used for training from 10 to 200 subjects. Accuracy increased with training number for all approaches up to 200 subjects, but an elbow occurred around 30-40 subjects. The effect of averaging the results of multiple models was also investigated for the NN and RFBag approaches.
Increasing accuracy was seen as a function of number of averages. In addition, the results from Tavor et al. (2016) in which a GLM was used, were replicated on a set of 350 HCP subjects. Similar results were found in our analysis, including a strong diagonalization of the cross-correlation matrices and variable correlation strengths across tasks. Next, the analysis was expanded using more complex regression models. Significantly higher accuracy, defined as the correlation between actual and predicted activation, was achieved by using the RFBag approach compared to NN and GLM approaches. There are several NN training types available in MATLAB, including resilient backpropagation, Levenberg-Marquardt, and scaled conjugate gradient backpropagation, among others. Of the methods we tested, we found that the resilient back propagation method worked best for the type of analysis used in this study. Resilient back propagation is a gradient descent algorithm that is dependent on only the sign, not the magnitude, of the partial derivatives (Riedmiller & Braun, 1993a). This approach typically converges faster than other gradient descent algorithms.
One issue with NN approaches is that they have the potential to overfit the training data, which leads to a model that fits the training data very well but does not generalize when applied to new data. To account for this, 20% of the training data was set aside as test data and 20% as validation data. Throughout the training, the model was tested on the test data, and the loss function (i.e., mean square error) was computed on the validation data. The training stopped when the validation loss function was minimized. The test data is not seen by the model when training, which allows the model to be tested on unseen data. In this way, the generalizability of the model can be maintained. Another approach to reduce NN overfitting is the dropout method where nodes and their connections are randomly dropped during training (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014). This creates a number of "thinned" networks. During testing, all nodes remain and their weights are multiplied by the probability a node is present at training time. It may be worthwhile comparing this method to the NN approach with averaging as it also creates a NN ensemble.
Although hyperparameter optimization was used for the RFbag approach, another option that can be used is feature reduction, which determines the importance of individual features and excludes less important features from the training process. Feature reduction was not used in this study.
The HCP dataset consists of more than 80 task contrasts collected across seven task-fMRI scans. Previous work by Tavor et al. analyzed 43 of these contrasts and found a wide variability in the ability of rs-fMRI to predict task activation, with correlation values between actual and predicted task activation ranging from 0.12 for the GAMBLING −PUNISH−REWARD task to 0.80 for the RELATIONAL−MATCH task (Tavor et al., 2016). In this study, 25 tasks were selected to provide a range of correlation values to evaluate the ability of the more advanced regression approaches to improve predictions for both "good" and "bad" tasks across several task domains.  (2017) found promising results by extending the GLM model to a group of patients who performed a category fluency task with only 5 min of resting-state fMRI (TR = 3.5 s). Future studies are underway to apply these models in a patient setting. As mentioned, feature selection was not employed in this study. More extensive hyperparameter tuning and feature selection might be explored in future studies.
In conclusion, advanced regression techniques were used to predict task activation from rs-fMRI data. All models accurately predicted task activation for a wide range of task domains on an individual basis; however, higher correlation between actual and predicted task activation was seen for the RFBag model compared to the NN model and the previously studied GLM.

ACKNOWLEDGMENTS
This work was partially supported by the Medical College of Wisconsin and a grant from the Daniel M. Soref Charitable Trust.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request.