A deep learning based approach identifies regions more relevant than resting‐state networks to the prediction of general intelligence from resting‐state fMRI

Abstract Prediction of cognitive ability latent factors such as general intelligence from neuroimaging has elucidated questions pertaining to their neural origins. However, predicting general intelligence from functional connectivity limit hypotheses to that specific domain, being agnostic to time‐distributed features and dynamics. We used an ensemble of recurrent neural networks to circumvent this limitation, bypassing feature extraction, to predict general intelligence from resting‐state functional magnetic resonance imaging regional signals of a large sample (n = 873) of Human Connectome Project adult subjects. Ablating common resting‐state networks (RSNs) and measuring degradation in performance, we show that model reliance can be mostly explained by network size. Using our approach based on the temporal variance of saliencies, that is, gradients of outputs with regards to inputs, we identify a candidate set of networks that more reliably affect performance in the prediction of general intelligence than similarly sized RSNs. Our approach allows us to further test the effect of local alterations on data and the expected changes in derived metrics such as functional connectivity and instantaneous innovations.


| Neural bases of intelligence
The neural bases of intelligence constitute an open scientific question.
Recent advances make neuroimaging a fundamental tool to answer this question. For a recent review, see Dizaji et al. (2021).
Intra cranial volume (ICV) and intelligence quotients are substantially correlated (Luders, Narr, Thompson, & Toga, 2009). This phenomenon is reproducible in both sexes and across all age groups (McDaniel, 2005). In a meta-analysis, approximately 10% of the variance of intelligence quotients can be accounted for by differences in brain volumetry alone (McDaniel, 2005). Yet, intelligence depends on verbal, visual-processing, information encoding and retrieval and executive tasks (Luders et al., 2009). This evokes the importance of specialized regions and networks of the brain. While ICV or gray matter volume account for much of the variation in intelligence, the remaining variance might be explained by other neurobiological factors. Potential candidates include connectivity, neuroanatomy and microstructural properties, and metabolism. Indeed, including cortical gray matter thickness estimates and white matter hyperintensity loads almost doubled the explained variance of general intelligence compared to a model only accounting for brain volume in Ritchie et al. (2015).
Previous studies were used to formulate current theories on the neural bases of intelligence. The Parieto Frontal Integration Theory (P-FIT) proposes the existence of a single network primarily subserving human intelligence, with substantial empirical evidence derived from neuroimaging (Jung & Haier, 2007). The Network Neuroscience Theory (NNT), on the other hand, proposes that fluid and crystallized intelligence and specific skills emerge from the dynamic reorganization of networks (Barbey, 2018). This theory accommodates multiple networks and network dynamics, and relating specific network topologies to fluid and crystalized intelligence. For an overview on other competing theories, see Barbey (2018).

| Functional magnetic resonance imaging and the biological importance of resting-state functional connectivity
Functional magnetic resonance imaging (fMRI) allows the study of cerebral neurophysiology. Even at rest, the brain stays functionally and metabolically active. From the resting-state-fMRI signal, it is possible to define resting-state functional connectivity (RSFC), the temporal coupling between signals in anatomically distinct regions of the brain (Yeo et al., 2011).
This connectivity between regions can be estimated from relatively simple measures, such as the Pearson correlation coefficients.
RSFC alterations have been linked to multiple biological processes, such as brain disorders, aging, and cognition. More importantly, intelligence has also been found to correlate with RSFC and graph theoretical measures, such as local efficiency (Pamplona, Santos Neto, Rosset, Rogers, & Salmon, 2015). More recently, fMRI-derived data has been used to perform predictive analyses at the individual-level (Sui, Liu, Lee, Zhang, & Calhoun, 2020).
Other than estimating temporal coupling between time series, it is possible to define spatially independent components of blood-oxygen-level-dependent (BOLD) contrast fluctuations using data-driven blind signal separation techniques such as independent component analysis (ICA; Calhoun, Adali, Pearlson, & Pekar, 2001;Calhoun & Adali, 2012). In addition to structured noise components, neural components have been robustly identified with ICA as restingstate networks (RSNs), with large empirical evidence corroborating their existence. ICA-derived RSN topologies found validation with other techniques, such as magnetoencephalography (Brookes et al., 2011) or community detection with discrete regions (Ito et al., 2017).
These RSNs are often linked to known cognitive, sensory, or motor processes. The connectivity or topology of RSNs is often used to deduce the cognitive effect of the alterations observed.
1.3 | Predicting "g" from resting-state functional imaging Individualized intelligence estimation from RSFC based on machine learning is already in practice (Dubois, Galdi, Paul, & Adolphs, 2018;Finn et al., 2015). The most successful approaches to the prediction of intelligence are often based on linear modeling with univariate feature filtering. Even though it is not theoretically a required step, it was employed by most of the successful approaches (Dizaji et al., 2021).
Under this type of model, when a feature is discarded it can no longer contribute with the predictions, no matter how much it is altered. This implausibility, compounded with our prior knowledge of how the brain works, motivate us to search for other approaches. The brain is a complex entity with complex dynamics. The interplay between these dynamics and biological processes such as intelligence can be presumed to be complex as well. The use of aggregate measures, for example, RSFC, can be used to inform highly predictive models with a window for direct interpretability. At the same time, that choice limits the hypotheses about the data. Deep learning can be used to learn predictive representations of data with automatic feature extraction (Abrol et al., 2021). Employing deep learning to learn about the biological bases of "g" opens up new possibilities for interpretation of results, removing the bias due to the choice of features. Human-level interpretability is more challenging due to the increased complexity of the models. A few strategies exist, however, that allow for insights to be extracted from modeling. We explore two such strategies, namely feature ablation and saliency (Molnar, 2019).
The effectiveness of deep learning in predictive analyses in cognitive neuroscience over "classical" machine-learning methods, such as linear, kernel-based, and tree-based models, has been a topic of recent debate. While some studies show comparable performance between both paradigms (He et al., 2020), deep learning methods have been shown to outperform classic machine learning as well (Abrol et al., 2020). Deep learning models are able to explore existing nonlinearities in neuroimaging data and automatically extract informative features (Abrol et al., 2021). This allows deep models to often surpass traditional machine learning models in performance. Since highly flexible models are more prone to variance, we control their variance via ensembling, that is, averaging predictions networks trained independently on the same data.
In this work, we aim to demonstrate that learning from lower level data, that is, timeseries instead of RSFC, can bring further insights into the question of the neuronal bases on intelligence. This type of data requires specialized models, and we opt to use an ensemble of recurrent neural networks (RNNs) for this task. We performed the prediction of "g" from fMRI timeseries obtained from a discrete cortical parcellation. To the best of authors' knowledge, this is the first work to perform this type of analysis. We then interpret the model predictions on unseen data performing ablation and saliency studies. We show that the ablation of single anatomical regions does not degrade performance and that the degradation in performance when ablating RSNs is not greater than when ablating random sets of regions with equal size. We calculate model reliance based on temporal variance of saliencies and show that, when ablating sets of regions ranked by this measure, significant degradation of performance ensues compared with the random sets. We then propagate saliencies to derivative measures and derive neuroscientific insights from the nature of observed saliencies.

| Data and preprocessing
Original data were provided by the Human Connectome Project (HCP; Essen et al., 2013). Preprocessed and behavior data were provided by Dubois et al. (2018). Briefly, test scores from 1,206 subjects were obtained. Tests include seven tasks from the NIH Toolbox for Assessment of Neurological and Behavioral function (dimensional change card sort; flanker inhibitory control and attention; list sorting working memory; picture sequence memory; picture vocabulary; pattern comparison processing speed; oral reading recognition) and three from the Penn Computerized Neurocognitive Battery (Penn progressive matrices; Penn word memory test; variable short Penn line orientation). Twenty-three subjects with missing or incomplete test scores were excluded (n = 1,183). Two subjects that scored 26 or less in the Mini Mental State Examination (MMSE) were also excluded (n = 1,181). These subjects were available for factor analysis. Subjects that completed four imaging sessions (n = 998) were further filtered by excluding 114 subjects with excessive in-scanner head movement (n = 884). Preprocessing included z-standardization, removal of temporal drifts from white-matter and CSF signals, regression of mean white-matter and CSF signals from gray matter, regression of six realignment parameters and their first derivatives, low-pass filtering with a Gaussian with a SD of 720 ms, removal of temporal drifts from the resulting gray matter signal, and finally global-signal regression.
Temporal drift removal was based on third-degree Legendre polynomials. See Dubois et al. (2018) for details. We further removed 11 subjects with less than 1,200 timepoints per session, achieving a final set of 873 subjects with complete imaging and behavioral data. The Schmid-Leiman transformation is used to derive loadings for the general factor. Individual scores are obtained by the Thurstone regression method.
We applied band-pass filtering (BPF) at [0.008 Hz, 0.09 Hz] to the ROI timeseries. This frequency band corresponds to the expected slow BOLD fluctuations, increasing signal specificity. This is due to the hemodynamic response function attenuation (Sun, Miller, & D'Esposito, 2004), typically up to 0.15 Hz. The major drawback of BPF is that it possibly reduces sensitivity, since higher frequencies might still pertain to the task. Data were then decimated. This entails truncating the spectra at 0.09 Hz and then transforming data back to the time domain. We obtained 155 timepoints per session after decimation. Dividing the session length of 864 s by 155, this leads to a virtual TR of 5.57 s. This additional procedure reduces the data dimensionality without loss of information and removes the longrange temporal autocorrelation induced by BPF. Individual sessions were kept separate.

| Neural network architecture
We opted to build upon a simple RNN based on the long-short term memory (LSTM) (Hochreiter & Schmidhuber, 1997) module. The LSTM captures both short and long-range information in sequences. It has been shown to work efficiently empirically, including applications in neuroimaging data (Dvornek, Ventola, Pelphrey, & Duncan, 2017).
To capture complex dynamics and also to better distribute gradients in the network, we used the BiLSTM module (Graves & Schmidhuber, 2005). The BiLSTM consists of two LSTM layers applied in parallel to time-distributed data. One of the layers transverse the data forward in time, while the other does the same backward. The activations of both are then combined at each timestep by adding both, in our case.
Our architecture, represented in Figure 1 consists of two BiLSTM modules and a linear layer with identity activation. The first BiLSTM has 360 inputs (corresponding to each ROI defined in the atlas) and 256 outputs, while the second has 256 inputs and outputs. The activations of the latter are mean-aggregated, resulting in a 256-dimensional vector per timeseries. This representation is fed to the linear layer, resulting a single scalar output. Parameters are optimized so that the average value of these scalars best predict "g." We implemented our architecture in the framework Flux (Innes, 2018), written completely in Julia (Bezanson, Edelman, Karpinski, & Shah, 2017). It has 2,316,545 learnable parameters in total, 632,320 and 525,824 in each LSTM in the first layer and second layers, respectively, and 257 in the affine layer. Based on this architecture, we trained an ensemble of 50 networks trained independently with the same data configuration. Final prediction is obtained by averaging the prediction of each member of the ensemble. For comparison, the convolutional architecture for image classification AlexNet (Krizhevsky, Sutskever, & Hinton, 2012) has almost 61 million trainable parameters.

| Training
During training, we decorrelated general intelligence from gender, age, brain volume, movement from each resting-state session, and reconstruction algorithm version, in accordance to Dubois et al. (2018). This procedure eliminates possible contributions of no interest from "g," augmenting specificity. General intelligence was then rescaled to unit variance and centered to zero mean.
We trained our architecture with backpropagation through time (BPTT; Williams & Peng, 1990). In our learning scheme, each LSTM layer is fed a single batch timepoint, producing its respective output and LSTM cell state. In the next timestep, the layer is again fed the batch input and takes the previous cell state to produce its output. This is repeated until the end of the sequence, when we calculate gradients for the backward pass on the objective function and update the parameters of the network accordingly.
We used mean-pooling on the outputs of the forward pass during training. Each sequence had up to 20 timesteps removed during each batch optimization, both at the start and also at the end, providing some variability to the training scheme. The four sessions of restingstate timecourses were fed separately and in random order for the forward pass of the network and were effectively treated as separate training samples. This subtle difference stimulates the network to obtain the best possible estimate from each session instead of vying to optimize the average between sessions.
We trained the architecture for 50 epochs using the "ADAM" optimization algorithm (Kingma & Ba, 2014) to perform update iterations. The learning rate was set to 0.0005. We also regularized gradient descent by employing Weight Decay (Krogh & Hertz, 1992) with a decay constant equal to 0.0005. Due to the adaptive nature of "ADAM," Weight Decay, and ℒ 2 regularization do not coincide. We used decoupled Weight Decay (Loshchilov & Hutter, 2019) to overcome that limitation.

| Validation
We performed 10-fold stratified cross-validation. We stratified families based on the terciles of family-averaged "g" and family size, with groups of families with 1, 2, and 3 or more members. This stratification was conducted in order to make validation folds more homogeneous. We computed folds from each stratum of families, and then recombined these folds across strata to ensure a homogenous distribution of "g" across folds. The number of families in each stratum is shown in Table 1.
The forward pass during validation was virtually exactly the same as in training. The only exception is that outputs were generated by mean-aggregating outputs from all timepoints within a session, and then averaged across sessions. The general intelligence of validation data was transformed in accordance to the transformation obtained for training data in each fold. This prevents leakage of validation data into training data. The procedure includes decorrelation of confounder variables, rescaling and centering.

| Performance-evaluation, comparison, and model exploration
We employed three performance metrics. The MSE (mean squared error) was used during training. To evaluate performance on validation data we computed the coefficient of determination We also report the squared correlation were filtered in univariate fashion at each fold. This procedure is based on the correlation between each input feature and "g" for training data. We set the parameter balancing the LASSO and ridge penalties to α ¼ :05. This leads to almost pure ridge regularization. The parameter λ that controls the tradeoff between the loss function and the penalization was optimized in an inner three-fold cross-validation.
This inner optimization was based solely on training data at each outer fold. λ was optimized based on a grid with 50 values. This whole model was validated on the same fold configuration mentioned previously in Section 2.4.
To further understand how the ensemble model behaves, we performed two model-agnostic exploration strategies: ablation and saliency (Molnar, 2019). This allows us to understand what information the model relies on to reach the performance we assessed in cross-validation. Both techniques were applied to validation data.

| Model exploration: ablation study on validation sets
We performed an ablation study on the trained model. Ablation consists in removing information from input and assessing the respective degradation in performance. We ablated anatomically-defined atlas regions and entire functional networks. This procedure should be able to discern model reliance on individual features and combinations of features. For the networks, we used the network definition established by Ito et al. (2017), which is based on the same MMP cortical atlas we employed.
To account for the fact that networks have different sizes, we compared the statistics obtained with a distribution of statistics.
This distribution was obtained from resampling random networks with matching sizes to the one being tested. This way, we can more certainly state that a change in performance after ablation is not simply due to the removal of nodes. The procedure consists of, given a network with M nodes, selecting M nodes at random to be ablated. The statistics associated with the performance are stored and the procedure is repeated for 300 iterations, in our case. We extracted the paired T-statistics, comparing the average performance across the 10 folds with and without ablation, with a null-hypothesis of zero difference. We then define a p-value for each hypothesis tested as the proportion of resampled statistics from the pool that are more extreme than the statistic measured empirically.
We used p < :05 as a significance threshold throughout, after correcting for multiple comparisons at the analysis level with the false discovery rate controlling procedure described in Yekutieli and Benjamini (1999).

| Model exploration: saliency analysis on validation sets
Saliency measures the degree that a measure is influenced locally by perturbations. It is often defined as the partial derivative of the outputs of a model on its inputs. We chose to study the saliency of the trained models to understand what features of brain resting-state activity are related to "g." Since our model is fully differentiable, we can easily compute this measure. For each fold, we obtained the saliencies ∂ X N X ð Þ based on validation, that is, unseen, data. The saliencies have the same dimensionality of input data since our output is scalar.
Due to efficiency reasons, mean centering and unit scaling and temporal filtering are often performed outside cross-validation per timeseries. When analyzing saliencies, however, we must take into account this normalization, as it is a constituent part of our model.
Otherwise, we might observe, for example, a non-null gradient at frequency zero, which is not possible under our model, as it assumes all data was standardized. We used the chain rule to propagate derivatives to standardized filtered data ∂ S N X ð Þ¼ ∂ S X Á ∂ X N X ð Þ. For more details, see Appendix A. In practice, however, the procedure is performed automatically using automatic differentiation.
Since we take into account standardization, average saliency of a region is zero across time, by design. We instead summarize reliance on regional timeseries by the temporal variance of saliency, defined, for a region timeseries x i , as P T t ∂ xit N X ð Þ ð Þ 2 = T À 1 ð Þ.

| Ablating a network defined by the regions with highest saliency
Regions with high average squared saliency should be approximately the regions the model relies on more for local changes in the estimate of "g." Thus, it makes sense to use that as a proxy of model reliance, per region. Using the temporal variance of saliency per region, we define 12 networks that have each the size of the 14 RSNs defined in Ito et al. (2017). For each of these "saliency-based networks" we perform the ablation and permutation study described in Section 2.6. To avoid circular analysis, that is, "double-dipping," we cross-validate this procedure. We employ saliencies from training data to define which regions are ablated in validation data.

| Functional connectivity saliency
Apart from the original data, we also explored saliencies on functional connectivity. Given that saliencies denote a change in input data, we T A B L E 1 Stratification of families into average "g" terciles and family size can propagate this to functional connectivity. Since our saliencies preserve the mean and variance of data, we can plug them directly into the definition of functional connectivity. Equation (1) shows the update operator for X and, likewise, In Equation (1), since η ( 1, the third term of C 0 is, again, very small. Therefore, we focus on the second term, δC. This local change in connectivity is proportional to the covariance between the data and the saliencies. δC is the associated alteration in connectivity from changing activity to increasing the "g" estimates in the model.

| Propagating saliency to decorrelated inputs through the zero-phase component analysis
When investigating saliency on the original data, one can incorporate possible data generating processes. We assume that RSFC generates the data acting on instantaneous innovations in each region. This can be described as X ¼ ZW. X is the centered and unit-scaled data, where each column is a region and each row is a timepoint, Z represents the (uncorrelated) innovations and has the same size of X and W is a square dewhitening matrix, thus n À k Taking the singular value decomposition (SVD) of X results in X ¼ USV T , then the inverse square root of the correlation is simply This procedure, called zero-phase components analysis (ZCA), was described in Krizhevsky (2009

| RESULTS
3.1 | Confounder variables explain substantial variance of "g" As part of our framework, we predict "g" in validation data across folds using the coefficients estimated from training data and perform additional analyses on the residuals of this model. We obtain ρ 2 mean ¼ 0:138, ρ 2 stderr ¼ 0:0279 across folds. These confounders include gender, age, brain volume, movement from each resting-state session, and reconstruction algorithm version. This result is in line with previous works (Dubois et al., 2018).

| Ablation of single regions does not affect performance
We show in Figure 3 how removing one region at a time from the validation data affects performance. Colors represent each network assignment defined in Ito et al. (2017).

| Ablation of RSNs alter performance; their size explains this effect
We also performed an ablation study deleting one RSN at a time. This is shown in Figure 4. We assessed significance with the paired T-test.
We compared the average performance across folds in the baseline with the average performance after ablating each RSN. Results were corrected for multiple comparisons using the procedure described in Yekutieli and Benjamini (1999).
However, resampling the distribution of t-statistics with random, equally-sized networks, we did not observe any significant effect in any performance metric. This implies that the alterations in performance observed in Figure 4 can be explained by the size of RSNs alone.
3.6 | Peak regional saliency variance describes a network that when ablated significantly deteriorates model performance When comparing the degradation in performance measured by R 2 , after correcting for multiple comparisons, most saliency based networks degraded performance significantly more than equally sized random networks. This is shown in Figure 6. Figure 7 shows the expected value of the change in connectivity δC that accompanies increasing "g." As in Ferreira et al. (2016) we choose to categorize saliencies with regards to their average magnitude and the direction of change to ease visualization. Only the highest 5% effects, defined as the average saliencies divided by the respective SD, are shown. No correlation was found between the values of average δC and average C. Training the ensemble with leave-one-family-out cross-validation would increase computational costs prohibitively. We thus employ 10-fold CV, which entails less training data in each resample, but displays less variance than leave-one-out cross-validation at the same time (Kohavi & Edu, 1993). Both effects can be noted by the slightly deteriorated performance we observe when validating the penalized linear model. In our data, we retrieve R 2 10ÀfoldCV ¼ 0:170, compared to R 2 LOFO ¼ 0:206 reported in Dubois et al. (2018). Increasing the training set size would increase performance for both the ensemble and the penalized linear model.

| Saliency on whitened timeseries Z demonstrates the weight of RSFC on model reliance
Having said that, a modest 8% increase in performance is noted when using our model. This suggests we successfully captured the effects of interest. Given the increased complexity of our model it becomes clear we are possibly reaching the point of capturing most information available in the data. Future works could explore data sampled at other spatial granularities, culminating in using the voxel timeseries themselves. Other timescales can be explored, either fast changing neural oscillations afforded by techniques such as MEG and EEG, or slow changes in activity patterns with longitudinal data.
We chose to ensemble RNNs to control their variance. Figure 2 demonstrates that we reach a plateau on performance long before summing 50 models in our ensemble. Roughly, 10 models achieve peak performance in our scenario.
We choose to assess importance on validation data. An equally valid choice would have been studying training data prior to learning. Dubois et al. (2018) applies this approach to study RSNs. For a discussion, see Molnar (2019).
As expected, Figure 3 shows that the removal of no single region is sufficient to substantially disrupt model performance on validation data. This suggests that, as other works have shown, intelligence is distributed across different regions in the brain (Dubois et al., 2018).
This result is compatible with the initial premises of current theories, such as P-FIT and NNT.
When looking into RSNs, as shown in Figure 4, removing one of the visual, somato-motor, cingulo-opercular network (CON), default mode network (DMN), or auditory networks significantly lowers performance on the validation set. These networks have been reported before as important networks for intelligence. However, some of these networks are also among the biggest in the atlas in Ito et al. (2017). This led us to investigate a direct effect of the amount of information being removed. We found no significant difference in the decrease in performance to the one obtained with resampled random networks of the same size. This means that we have no evidence that the ablation of RSNs reveals specific reliance on these networks for prediction. Rather, the size of RSNs explains this effect.
As exposed in Section 2, by design saliencies sum to zero and their variance is such that, locally, it cancels their covariance with data. This is important for their validity regarding the full model. It also precludes us from exploring such simple measures as average saliency.
Results point to the fact that saliency is heterogeneous across the cortex. Additionally, we did not detect bilateral patterns. This could be due to redundancies in information between homotopic regions. Ventral visual areas display low saliency overall. Wernicke's and Broca's areas are highlighted with high saliency temporal variance.
The role of DMN deactivation in cognition has been studied elsewhere. See Anticevic et al. (2012) for a review.
To define a measure of regional reliance, we employ the temporal variance of saliencies. Saliencies can be interpreted as the change in data that leads to an increase in the output. Thus, regions with high temporal variance of saliencies are also the ones being most altered.
Ablation of the "networks" shown in Figure 5  plated. Several regions, however, lie on the frontier between major F I G U R E 3 Effect of regional ablation on cross-validated average R 2 . Ribbon shows the standard error. The black line represents the performance measured when using all data. Colors represent each resting state network assignment defined in Ito et al. (2017) F I G U R E 2 Effect of ensemble size on cross-validated average R 2 . Ribbon shows the SE. Ensemble members were added consecutively in the same order of training networks. We might speculate that the individual extent of RSNs could contribute to the prediction of "g." The selections of networks in Figure 5 include prominent P-FIT areas. Notably, prefrontal, parietal, and temporal associative areas are included. However, bilateral visual associative regions are missing. We also do not verify any prominence of the left hemisphere over the right one, as theorized in the P-FIT. This can be an artifact of the number of regions selected or the interindividual variability in functional F I G U R E 4 Effect of network ablation on cross-validated average R 2 . Error bars represent the SE. The black line represents the performance measured when using all data. Significance was assessed with the paired T-test, comparing performance in the baseline with the ablation of each network across folds. The Benjamini-Hochberg procedure was employed to correct for multiple comparisons F I G U R E 5 Regions with top saliency temporal variance in training data across folds. Occurrence goes from 0 to 10. This selection is used to ablate regions in validation data. The number of regions encompassed by the "networks" is shown in white next to each row localization. Of particular interest is the confounding effect of information shared through connectivity. This effect will be further discussed below.
Comparing the highly modularized functional connectivity between RSNs with the average functional connectivity saliency shown in Figure 7, it is clear that the latter does not share the same modules as the first. This implies no simple alignment with intra-or internetwork connectivity. Instead, we observe that a small number of regions exhibit marked saliency to connectivity with specific networks. These are easily spotted as blue or orange lines in Figure 7, respectively average increasing and decreasing of functional connectivity with increasing "g." In the case of increasing connectivity with a whole network, this would indicate that a region is more integrated with that network with increasing "g." Likewise with decreasing functional connectivity, said region would be more segregated from that specific network. The CON exhibits marked loss of intranetwork connectivity with increasing "g." Some visual, DMN, and FPN regions show uniform decreasing connectivity to it as well.
Comparing the top row with the bottom row in Figure 7 In the latter, CON intranetwork connections become weaker in the direction of increasing "g." In the case of negative correlation shown in Figure 7c,d, which occur mostly between RSNs, bilateral CON, DMN, and FPN connectivities are highlighted. The role of networks such as the CON, DMN, and FPN RSNs in intelligence was previously studied (Dubois et al., 2018;Hearne, Mattingley, & Cocchi, 2016). These three networks, in particular, are coupled: the CON is associated with switching levels of activity between DMN and FPN (Sridharan, Levitin, & Menon, 2008).
After disambiguating the role of spontaneous activity, that is, temporal innovations, in the saliencies, other regions emerge in Figure 8. In special, the inferior parietal lobules have high saliency in this regard, including the angular gyri and the supramarginal gyri.
These two regions are involved in language streams and several higher order cognitive functions. They are directly connected to the frontal lobe by the superior longitudinal fasciculus, including dorsolateral prefrontal cortices. P-FIT places high importance to this area of the cortex, especially the left angular gyrus (Jung & Haier, 2007). Again, the left primary auditory cortex has high saliency, the sole primary sensorial region to be so, which is also accommodated within P-FIT.
On the other hand, areas such as the insulas, parahippocampal and entorhinal cortices, and the subgenual area are less pronounced F I G U R E 6 Comparison of degradation in performance by ablation of networks. Paired T-test statistic comparing ablated results with original performance across folds. The gray ribbon represents a randomization-based distribution using random networks with the same number of nodes of corresponding RSNs. Downward-pointing triangles represent the test statistics obtained when ablating networks defined in Figure 5. Upwardpointing triangles represent the test statistics obtained when ablating RSNs. Significant effects when compared with the randomization-based distribution are shown in green. Nonsignificant results are shown in red. RSNs are ordered largest to smallest, from left to right in Figure 8. These areas are absent in P-FIT (Jung & Haier, 2007), further corroborating our results.
We can hypothesize that regions that were highlighted in Figure 5 but not in Figure 8 could attain high saliency due to their connectivity. The converse is also true, where regions highlighted in Figure 8 but not in Figure 5, under this hypothesis, would be so due to their connectivity.
The smallest "networks" containing two or four regions do not display significance in Figure 6. Low concordance was obtained across folds for these, as can be seen in Figure 5. This adds to the point that intelligence cannot be attributed to single regions. It is very likely that the choice of "networks" that degrade performance more than random, equally-sized, ones is not unique. While other "networks" can possibly attain the same effect, the point is that RSNs are not, in fact, specific to intelligence.
Our model has enough parameters to approximately interpolate training data. This does not lead to overfitting, as can be seen in the results, however. This might be due to the double-descent phenomenon (Belkin, Hsu, & Mitra, 2018). According to this theory, when model capacity surpasses the interpolation threshold, the solution space of the problem is enlarged, leading to flatter solutions on parameter space. We can hypothesize that even larger, more expressive, networks will not degrade performance, pointing toward possible improvements with increasing computational resources. Abrol et al. (2021) demonstrates systematically that deep models benefit from representation learning, extracting more informative features from data than hand-crafted ones. Previous benchmarks comparing deep models with traditional machine learning could be thus biased against deep models. Applying deep neural networks to lower level data, that is, timeseries instead of hand-crafted features such as F I G U R E 7 Average derivative of functional connectivity in the direction of increasing "g," standardized by the SD across samples. Since RNNs are very flexible nonlinear models we cannot rule out the possibility that the RNNs could have learned a proxy measure of RSFC directly from data. RSFC is largely driven by a small number of high amplitude, small time-scale, coactivity events (Esfahlani et al., 2020), thus also reflecting dynamically rich information.
Obtaining maximal performance was not our main objective.
Instead, we aimed at providing a differentiable model relating RS-fMRI activity to "g" and extracting relevant neuroscientific insights from it. The study of brain dynamics in relation to human intelligence warrants the use of models that use that information. In special, the leading linear modeling approaches are based on univariate filtering.
Univariate filtering is nondifferentiable, precluding us from relying on gradients for the study of model reliance.
We demonstrated how one can decouple instant innovations from functional connectivity contributions using ZCA. It is possible to embed this knowledge into networks. Future works could then fuse models working on the functional-connectivity domain and the time activity domain simultaneously. Such models could better differentiate the roles of dynamics in interindividual variations in cognition.
We successfully reproduced several findings from the literature pertaining to brain biology in the context of general intelligence. We built a model that predicts "g" from time-distributed BOLD fMRI activity. This model attains slightly increased performance in this task, without filtering. Studying its reliance on different parts of input information allows us to retrieve neuroscientific insights. We also present a method based on propagating saliencies from data to derivative measures, such as functional connectivity. Using it, we disambiguate the contributions of instantaneous innovations to model reliance, for example. Combining an ablation-based approach with a saliency based one allowed us to identify a set of regions that degrade performance significantly more than equally sized RSNs.

ACKNOWLEDGMENTS
This study was financed by the Coordenação de Aperfeiçoamento de F I G U R E 8 Average temporal variance on saliency calculated over whitened data, that is, multivariate uncorrelated innovations