Longitudinal white matter changes associated with cognitive training

Abstract Improvements in behavior are known to be accompanied by both structural and functional changes in the brain. However, whether those changes lead to more general improvements, beyond the behavior being trained, remains a contentious issue. We investigated whether training on one of two cognitive tasks would lead to either near transfer (that is, improvements on a quantifiably similar task) or far transfer (that is, improvements on a quantifiably different task), and furthermore, if such changes did occur, what the underlying neural mechanisms might be. Healthy adults (n = 16, 15 females) trained on either a verbal inhibitory control task or a visuospatial working memory task for 4 weeks, over the course of which they received five diffusion tensor imaging scans. Two additional tasks served as measures of near and far transfer. Behaviorally, participants improved on the task that they trained on, but did not improve on cognitively similar tests (near transfer), nor cognitively dissimilar tests (far transfer). Extensive changes to white matter microstructure were observed, with verbal inhibitory control training leading to changes in a left‐lateralized network of frontotemporal and occipitofrontal tracts, and visuospatial working memory training leading to changes in right‐lateralized frontoparietal tracts. Very little overlap was observed in changes between the two training groups. On the basis of these results, we suggest that near and far transfer were not observed because the changes in white matter tracts associated with training on each task are almost entirely nonoverlapping with, and therefore afford no advantages for, the untrained tasks.

(UK) taxi drivers, who have achieved a high level of expertise in spatially navigating the city, have greater hippocampal gray matter volume than controls (Woollett & Maguire, 2011a), while studying for the Law School Admission Test leads to white matter changes within the frontoparietal network (Mackey, Whitaker, & Bunge, 2012). Structural and functional changes after learning a cognitive task in the laboratory have also widely been observed (Lampit, Hallock, Suo, Naismith, & Valenzuela, 2015;Thompson, Waskom, & Gabrieli, 2016).
One interesting observation arising from this literature is that training on a cognitive task typically results in improvements on that task, but not on others, even when they are quite closely related.
As an example, in well-known series of studies involving taxi drivers (Maguire, Woollett, & Spiers, 2006;Woollett & Maguire, 2011b), the drivers were better at navigating through London, although they were generally no better at other cognitive tasks. In fact, they performed worse than controls in acquiring new visuospatial information (Maguire et al., 2006). This speaks to one of the most contentious issues in the cognitive training literature; that is, whether training on cognitive tasks can lead to generalized improvements in cognition.
While some authors have suggested that training on one cognitive task leads to improvements on unrelated tasks (Au et al., 2015;Brem et al., 2018;De Lillo, Brunsdon, Bradford, Gasking, & Ferguson, 2021;Flegal, Ragland, & Ranganath, 2019;Jaeggi, Buschkuehl, Jonides, & Perrig, 2008;Kattner, 2021;Li et al., 2021;Olfers & Band, 2018;Studer-Luethi & Meier, 2020), a considerable number of studies have failed to show such effects (Owen et al., 2010;Stojanoski et al., 2020;Stojanoski, Lyons, Pearce, & Owen, 2018). Indeed, while the holy grail of the cognitive training literature is so-called "far transfer" (i.e., where training on one task improves performance on a completely unrelated second task), many studies have failed to even demonstrate "near transfer" (i.e., when training on one task improves performance on a similar, cognitively related second task) (Simons et al., 2016). Whether cognitive training "works" or not will likely remain a contentious issue. Regardless, what is lacking from the literature is any clear neuroscientific explanation for how cognitive training would lead to far transfer, assuming it occurs at all.
If we first take the position that cognitive training, in general, does not lead to any transfer, we can generate a number of hypotheses for why that may be the case. Unlike many aspects of motor control, cognitive processes rarely have a one-to-one mapping between the behavioral tasks used to test them and structures in the brain. For example, there is no "digit span" area of the brain comparable to the brain regions that are known to initiate a simple motor function such as finger tapping (Cramer, Finklestein, Schaechter, Bush, & Rosen, 1999). Cognitive processes generally recruit extensive networks of brain regions, none of which are uniquely and singularly devoted to performing a specific cognitive task (Crittenden & Duncan, 2014;Crittenden, Mitchell, & Duncan, 2016;Duncan, 2010;Duncan & Owen, 2000). Even though training on a cognitive task evokes changes at the network level (Bassett et al., 2011;Finc et al., 2020), it is possible that transfer does not occur between tasks because the networks that drive those tasks are not similar enough for changes in one to improve the other. To illustrate this point, consider the relationship between spatial span and digit span. Although on the surface they appear to be similar tasks, they differ in both obvious and less obvious ways. Clearly, they differ in terms of modality of the memoranda; however, it is also known that the strategies that are adopted to solve these tasks are quite different (Bor, Cumming, Scott, & Owen, 2004;Bor, Duncan, Wiseman, & Owen, 2003). While the most common strategy that improves performance on digit span is chunking, where several numbers are grouped together into single units to be remembered, spatial span relies more heavily on pattern recognition (for discussion, see Owen, 1997). These differences may outweigh any surface similarities that these two tasks may have (e.g., both tasks rely on storing and repeating ordered information).
Let us now take the alternate position that cognitive training on one task does lead to benefits on other tasks, and consider what the neuroscientific mechanism for that might be. It is possible that the two tasks (the trained and the untrained) recruit networks that are similar enough that strengthening one is sufficient to lead to performance enhancements on the other. This leads to two testable predictions: first, if training on a task leads to transfer, those two tasks should recruit largely overlapping brain networks. Second, for quantitatively similar cognitive tasks, a significant amount of network overlap and a correspondingly large amount of transfer should be observed. In contrast, cognitively dissimilar tasks should involve less network overlap, and thus, less transfer.
An important step in testing these predictions is not to choose tasks based on an intuitive sense that they are similar or different (e.g., see Digit Span and Spatial Span example above). Fortunately, it is possible to select tasks based on quantifiable measures of similarity between them, thereby operationalizing how the transfer is defined and measured. For example, based on a factor analysis computed by Hampshire, Highfield, Parkin, and Owen (2012) that grouped 12 cognitive tasks into three functionally and anatomically distinct neural components, the assigned factor and the corresponding factor loadings of each cognitive task can be used to guide the choice of training and transfer tasks. Using that method, we selected a test of inhibitory control as a training task for the current study, and a test of grammatical reasoning to assess near transfer because, among a group of 44,000 participants, they loaded heavily on the same factor (Hampshire et al., 2012). Similarly, a test of spatial working memory was selected as a second training task (in a separate group of individuals), and a test of spatial span was used as a second test of near transfer because, in the same group of 44,000 participants, they loaded heavily on the same factor as one another. Importantly, these two sets of tests loaded on entirely different factors and were functionally and anatomically dissociable, confirming that there is very little overlap between the demands of each. Therefore, the two near transfer tasks (Grammatical Reasoning and Spatial Span) also serve as ideal measures of far transfer for the spatial working memory task and inhibitory control task, respectively.
Based on previous work (Caeyenberghs, Metzler-Baddeley, Foley, & Jones, 2016;Lampit et al., 2015;Thompson et al., 2016), we hypothesized that training on a cognitive task will lead to behavioral improvements on that task that are associated with specific structural changes in the brain's white matter microstructure. Second, based on our own previous work (Owen et al., 2010;Stojanoski et al., 2018;Stojanoski et al., 2020) we hypothesized that training on a cognitive task will not lead to improvements on a second cognitive task, regardless of whether it is quantifiably similar (near-transfer) or quantifiably different (far-transfer). Finally, we hypothesized that this lack of transfer effect is underpinned by the different structural changes that are associated with learning the two training tasks. Specifically, we expected that improvements in spatial working memory with training would lead to structural changes that are nonoverlapping with improvements observed with inhibitory control training.

| Training procedure
Participants were randomly assigned to train on either "double trouble," a modified Stroop test of inhibitory control described below, or "self-ordered search," a visuospatial working memory task, also described below. They completed three to five at-home training sessions per week and five in-person scans, and a schematic of the training protocol is provided in Figure 1. Both resting-state and task-based fMRI data were also collected, but were not analyzed for the current study. During the first scanning session, the task instructions were explained to participants and any questions were answered to ensure full understanding. Additionally, text and visual instructions were presented at the beginning of each training session as a reminder.

| At-home training
All participants completed a minimum of three and a maximum of five training sessions per week over the course of 4 weeks, with at least 24 hr between each session, but no more than 72 hr without training.
Each training session took approximately 30 min and was completed at home using the Cambridge Brain Sciences online platform at cambridgebrainsciences.com (Hampshire et al., 2012). Participants were asked to train for a set amount of time, as opposed to a set number of trials, because the trial structure of the tasks differs. That is, double trouble is a timed task with many short trials per test, whereas self-ordered search is untimed and trials generally take much longer to complete. Thus, the most comparable measure to ensure similar amounts of exposure between groups was to use the amount of time spent training. Participants received an automated daily email reminding them to complete their at-home training, and compliance was confirmed in person at each scanning session.

| In-scanner training
Participants completed five imaging sessions throughout their cognitive training to track changes in their white matter microstructure.
The first imaging session was completed prior to the start of their cognitive training. The subsequent four imaging sessions occurred with F I G U R E 1 Study Procedure. Five scanning sessions were completed over the course of 4 weeks, with 1 week in between each session. Participants trained on the task at home between scans the at-home training sessions between each. During all imaging sessions participants underwent a structural MRI, DTI, and resting-state fMRI scan before completing their training task for approximately 30 min. During the first and last scanning sessions, participants also completed two tasks used to measure cognitive transfer, described below. To eliminate practice effects, participants were allowed to practice the transfer tasks twice each while in the scanner, and then given a third attempt, which was taken as their measurement of performance on that test. The two practice trials ensured that they were familiar with the task and with using the scanner-compatible ball mouse.

| Tasks
Four computerized cognitive tasks from an online cognitive testing battery (cambridgebrainsciences.com) were used in this study. The two training tasks, Double Trouble and Self-Ordered Search, were chosen as they tax different cognitive domains, specifically inhibitory control and verbal abilities, and visuospatial processing/working memory, respectively (see Hampshire et al., 2012). Using tasks that tap into functionally and anatomically dissociable cognitive domains allowed us to determine whether functional and structural changes following cognitive training are specific to the training task, are more domaingeneral changes, or are some combination of both.
Two additional tasks were used as measures of cognitive transfer.
The first task was the "Grammatical Reasoning" task from the Cambridge Brain Sciences battery, which was used to assess near-transfer for the Double Trouble training group and far transfer for the Self-Ordered Search group. In a factor analysis of 44,000 participants, Grammatical Reasoning loaded heavily on the same factor as Double Trouble (factor loadings = 0.66 and 0.51, respectively) and, through fMRI was shown to recruit a similar functional network in the brain (Hampshire et al., 2012). In contrast, the Self-Ordered Search task did not load heavily on that factor (factor loading = 0.16) and recruited a different functional network. The second task, "Spatial Span," is another test of spatial working memory and was used to assess near transfer for the Self-Ordered Search group and far transfer for the Double Trouble group. In the same factor analysis, Spatial Span loaded heavily on the same factor as Self-Ordered Search (factor loadings = 0.69 and 0.62, respectively), and the Double Trouble task did not load heavily on that factor (factor loading = 0.22). These tasks were completed in-scanner prior to completing the training task during the first scanning session, and after completing the training task during the last scanning session.

| Double Trouble
The Double Trouble task is a modified version of the classical Stroop test (Stroop, 1935) and measures inhibitory control, verbal ability, and attention. On each trial, a probe word "red" or "blue" is displayed at the top of the screen in either red or blue font. The participant must select the word at the bottom of the screen that describes the font color of the probe word, inhibiting any response based on the probe word ("red" or "blue") itself. The two response choices are the words "red" or "blue" and are also displayed in either red or blue font, which "doubles" the inhibitory load because participants have to inhibit any response based on font color. The word and the color of the font may be congruent or incongruent for both the probe and the answer choices. The participant is given 90 s to complete as many trials as possible. Their score increases by one point each time they make a correct response and decreases by one point each time they make an incorrect response. Participants assigned to this condition completed $20 90-s runs within each 30-min training period.

| Self-Ordered Search
The Self-Ordered Search task assesses visuospatial working memory (Owen, Downes, Sahakian, Polkey, & Robbins, 1990). A set of squares in random positions within an invisible five-by-five grid is displayed on the screen. Participants click on squares, which "open" to reveal whether there is a "token" inside. After a token is found, it is hidden within another square, and the participant must locate it again. Within any trial, a square will never be used to hide a token more than once and the number of tokens that must be found in each trial equals the number of squares in that trial. Participants must avoid squares in which they have already discovered a token and squares they have already searched while looking for the current token. If they re-click a previously discovered or searched square the trial ends and the next trial begins with one less square in the grid. If they find all tokens without making an error, a new trial begins with one extra square in the grid. The round begins with four squares and ends after three errors have been made. Their final score is equal to the maximum level they achieved. As this task is not timed, the number of runs completed during a 30-min period depended on participant performance.

| Grammatical Reasoning
Grammatical Reasoning is based on Alan Baddeley's 3-min grammatical reasoning test (Baddeley, 1968) and assesses verbal reasoning. On each trial, a written statement regarding two shapes is displayed on the screen, and the participant must indicate whether it correctly describes the shapes pictured below. The participant has 90 s to complete as many trials as possible. A correct response increases the total score by one point, and an incorrect response decreases the score by one point.

| Spatial Span
Spatial Span is based on the Corsi Block Tapping Task-a tool for measuring spatial short-term memory capacity. Sixteen purple boxes are displayed in a grid. A sequence of randomly selected boxes turn green one at a time (900 ms per green square). Participants must then repeat the sequence by clicking boxes in the same order. Difficulty is varied dynamically: correct responses increase the length of the next sequence by one square, and an incorrect response decreases the sequence length. The test finishes after three errors. The score is the length of the longest sequence successfully remembered.

| Behavioral data analysis
Behavioral data were analyzed using R for statistical computing (R Core Team, 2020). Because participants completed multiple rounds of their assigned task during each scanning/training session, the single maximum score was used as that training day's value. Then, because people trained a different number of times per week, and to accommodate natural minor fluctuations in performance from day to day, we took the maximum score for each participant for training weeks two, three, and four as their overall measure of performance for those weeks, which was then used for data analysis. To assess participants' trends of learning across their training sessions, scatterplots were created with each week's highest score plotted over time for each participant. Curve estimation was used to fit linear and logarithmic models to the data to determine the nature of learning trends. Paired-samples t-tests were performed to determine whether logarithmic or linear models fit the learning trend data better for each of the two groups.
To quantify overall task improvement due to cognitive training, and to determine whether there was a significant difference in the amount of learning between the Double Trouble and Self-Ordered Search groups, we conducted a 2 x 2 linear mixed model on scores from Week 1 and Week 5. Scores were first transformed to z-scores using population means and SD derived from Wild et al. (Wild, Nichols, Battista, Stojanoski, & Owen, 2018) to allow for comparison between tests, and the model was built with group (Double Trouble/ Self-Ordered Search) and time (Week 1/Week 5) as binary regressors and participants as a random effect.
To assess near transfer, we conducted a linear mixed model on near transfer task scores (that is, Spatial Span scores for the Self-Ordered Search training group, and Grammatical Reasoning scores for the Double Trouble training group). Scores again were first transformed to z-scores using population means and standard deviations (Wild et al., 2018), and the model was constructed with group (Self-Ordered Search/Double Trouble) and time (pre-training/posttraining), and participants as a random effect. To assess far transfer, a similar model was constructed but with far transfer task scores (that is, Grammatical Reasoning scores for the Self-Ordered Search training group, and Spatial Span scores for the Double Trouble training group).

| Neuroimaging data acquisition
Imaging data were acquired using a 3T Siemens Prisma scanner (Erlangen, Germany) and 32-channel head coil at the Centre for Functional and Metabolic Mapping (Robarts Research Institute, Western University, London, Canada). Whole-brain T1-weighted structural F I G U R E 2 Amount of improvement across the at-home training sessions for (a) Self-Ordered Search and (b) Double Trouble F I G U R E 3 Task improvement from beginning of cognitive training to the end, for the Self-Ordered Search (SOS) and Double Trouble groups (DT). The first and third quartiles are marked by the lower and upper edges of the boxes, respectively. Lower and upper whiskers extend to the smallest and largest value, respectively, within 1.5 times the interquartile range. Outlying values beyond these ranges are plotted individually images (repetition time (TR) = 2,300 ms, echo time (TE) = 2.98 ms, field of view (FOV) = 256 mm, 256 x 256 matrix, slice thickness = 1 mm, 176 slices) were first obtained. Diffusion-weighted images were acquired in the transverse plane using a single-shot sequence (84 slices with 2 mm slice thickness, voxel size = 2 Â 2 mm in-plane, field of view = 210 mm, 137 diffusion directions with b = 2000 s/mm 2 , TR = 4 s, TE = 59.20 ms; GRAPPA acceleration factor = 2).

| Anatomical data preprocessing
A total of five T1-weighted (T1w) images were present in the input BIDS dataset. All of them were corrected for intensity non-uniformity (INU) with N4BiasFieldCorrection (Tustison et al., 2010), distributed with ANTs 2.2.0 (Avants, Epstein, Grossman, & Gee, 2008) (RRID: SCR_004757). The T1w-reference was then skull-stripped with a Nipype implementation of the antsBrainExtraction.sh workflow (from ANTs), using OASIS30ANTs as target template. Brain tissue F I G U R E 4 Changes in FA across all scanning sessions relative to baseline. (a) In red, we show the areas in which the changes from baseline are significantly larger in the Double Trouble training group than in the Self-Ordered Search training group. (b) In blue, we show the areas in which the changes from baseline are significantly larger in the Self-Ordered Search training group than in the Double Trouble training group. Clusters have been thickened for visualization using tbss_fill, and results are overlaid on the FMRIB58_FA template and the mean skeletonized FA data of the current sample segmentation of cerebrospinal fluid (CSF), white-matter (WM) and gray-matter (GM) was performed on the brain-extracted T1w using fast (FSL 5.0.9, RRID:SCR_002823) (Zhang, Brady, & Smith, 2001). A T1w-reference map was computed after registration of 5 T1w images (after INU-correction) using mri_robust_template (FreeSurfer 6.0.1) (Reuter, Rosas, & Fischl, 2010). Volume-based spatial normalization to two standard spaces (MNI152NLin2009cAsym, MNI152NLin6Asym) was performed through nonlinear registration with antsRegistration (ANTs 2.2.0), using brain-extracted versions of both T1w reference and the T1w template. The following templates were selected for spatial normalization: ICBM 152 Nonlinear Asymmetrical template version 2009c (Fonov, Evans, McKinstry, Almli, & Collins, 2009)  processing workflow. For more details of the pipeline, see the section corresponding to workflows in fMRIPrep's documentation.

| Diffusion-weighted images preprocessing
Diffusion-weighted images were first checked for quality using DTIPrep (Oguz et al., 2014), an automated toolkit. Images were first converted to NRRD file format and checked for header information including correct image dimensions, spacing, and orientation. DTIPrep then ensures correct diffusion gradient orientations and b-values.
Rician noise removal was performed, followed by artifact detection and removal. Images were then co-registered to an iterative average over all baseline images, followed by eddy-current correction and motion correction, including gradient direction adjustments. The second round of motion detection was the performed to ensure that registration was successful. Mean translational motion was 1.35 mm (SD = 0.33, range = 0.73-2.28), and mean rotational motion was 0.008 (SD = 0.002, range = 0.005-0.017).

| DTI analysis
Voxelwise statistical analysis of the fractional anisotropy (FA) data was carried out using TBSS (Tract-Based Spatial Statistics) (Smith et al., 2006), which is part of FSL (Smith et al., 2004). First, FA images were created by fitting a tensor model to the raw diffusion data using FDT, and then brain-extracted using BET (Smith, 2002). All subjects' FA data were then aligned into the FMRIB58_FA standard space using the nonlinear registration tool FNIRT (Andersson, Jenkinson, & Smith, 2007a;Andersson, Jenkinson, & Smith, 2007b), which uses a b-spline representation of the registration warp field (Rueckert et al., 1999). Next, the mean FA image was created and thinned to create a mean FA skeleton which represents the centers of all tracts common to the group. Each subject's aligned FA data were then projected onto this skeleton and the resulting data fed into voxelwise cross-group statistics.
Group comparisons were run using Permutation Analysis of Lin- Because the contrasts only showed whether the difference between days was bigger in one group or the other, areas of significant change were used as a mask on difference FA maps to determine the direction of the change.
To assess the amount of overlap in changes to white matter microstructure between the two tasks, a conjunction analysis was conducted using Scan 5-Baseline contrasts, thresholded at p = .01 (corrected), for each group. A final set of analyses were run to investigate whether the FA in areas that showed group differences was predicted by percent change from the beginning to end of training.
First, Scan 5-Baseline difference FA maps were calculated. We then conducted correlations using randomize (Winkler et al., 2014), with 10,000 permutations, between scores and FA values, with search space restricted to areas that had shown a significant difference between groups from Baseline to Scan 5. Results of the correlations were thresholded at p = .01.

| Cognitive training and associated learning
To assess learning trends among participants, we plotted scores achieved during cognitive training over time (Figure 2). Curve estimation was applied to each participant's learning trend to determine F I G U R E 5 Areas showing group differences in the change in FA from Baseline to Scan 5. In red, we show the areas in which the changes from Scan 1 to Scan 5 are significantly larger in the Double Trouble training group than in the Self-Ordered Search training group. In blue, we show the areas in which the changes from Baseline to Scan 5 are significantly larger in the self-ordered search training group than in the double trouble training group. Changes uniquely associated with Double Trouble were largely within the left inferior occipitofrontal and longitudinal fasciculi, while changes associated with Self-Ordered Search were largely within the right superior longitudinal fasciculus. Clusters have been thickened for visualization using tbss_fill, and results are overlaid on the FMRIB58_FA template and the mean skeletonized FA data of the current sample whether the data fit better to a linear or logarithmic model. The "better fit" model was defined as the model whose curve estimation regression analysis returned a greater coefficient of determination (R 2 ) value. An ANOVA comparing the two fits within the Double Trouble group confirmed that the R 2 associated with logarithmic models (R 2 = .79) was significantly higher than linear models (R 2 = .62; F(1,38) = 31.77, p < .001), indicating that logarithmic models fit the participants' learning trend data better. For the Self-Ordered Search group, there was no significant difference between the R 2 associated with linear models (R 2 = .327) and that of the logarithmic models (R 2 = .333; F(1,38) = 0.33, p = .569), indicating that the logarithmic model did not fit the data better than the linear model.
We next wanted to confirm that both groups showed learning effects, and to determine whether the groups differed in the amount of learning that occurred over the cognitive training period. The results are shown in Figure 3. A linear mixed model showed a main effect of group (F(1,14) = 7.54, p = .016), a main effect of time (F(1,14) = 132.08, p < .001) and a significant group x time interaction (F(1,14) = 17.28, p < .001). Post hoc contrasts confirmed that there was a significant improvement from Week 1 to Week 5 in both the Self-Ordered Search group (t(14) = 5.19, p < .001) and the Double Trouble group (t(14) = 11.07, p < .001), and the effects of training were more pronounced for the latter group than the former.

| Transfer effects
We next examined whether training on Double Trouble and Self-

| DTI analysis
Between-group differences comparing each follow-up scan to Baseline are shown in Figure 4. Additionally, within-group differences comparing each follow-up scan to Baseline are shown in Figures S1 and

S2. Results of the statistical analyses between Scan 5 and Baseline
and Scan 2 and Baseline are reported below.

| Day 5-Baseline contrasts
To assess which regions of the white matter skeleton showed changes in FA across the entire training protocol for each group, between-groups t-contrasts were conducted on the difference between Scan 5 and Baseline. Because there were a large number of significant clusters (i.e., > 300), we report here the 35 most significant tracts for each contrast, and full results are presented in the supplemental materials (Tables S1 and S2 In contrast, the Self-Ordered Search training group showed significantly greater changes in FA than the Double Trouble training group in a right-lateralized group of regions in the dorsolateral prefrontal and parietal areas of the brain (Table 2). Specifically, the frontal section of the right superior longitudinal fasciculus showed the largest changes.
There were also significant decreases in FA in a posterior section of the left superior longitudinal fasciculus within the parietal lobe, as well as the right corticospinal tract underlying the supplementary motor cortex.

| Day 2-Baseline contrasts
Because the Double Trouble training group showed a large improvement in behavioral scores within the first week, we also ran between-group contrasts on the Scan 2-Baseline differences in FA (Table 3, Figure 6). The double trouble training group again showed leftlateralized changes, primarily in the inferior occipitofrontal fasciculus, uncinate, and forceps minor. There were also more extensive changes in the left ILF than those in Scan 5. The left forceps major also showed extensive changes in FA, extending toward, but not crossing the corpus callosum.
The Self-Ordered Search training group again showed a large cluster of changes in FA in the right dorsolateral prefrontal area of the superior longitudinal fasciculus and the posterior temporal SLF, as well as the white matter underlying the right parietal lobe (Table 4). Additionally, there was a significant cluster in the right body of the corpus callosum, however it did not cross the midline.

| Conjunction analysis
To assess the degree of overlap in the changes to white matter microstructure over the course of training between the two groups, we performed a conjunction analysis. Results are shown in Figure 7 and Table 5. Very few tracts showed significant overlap between Double Trouble and Self-Ordered Search groups, including the corticospinal tract and tracts underlying the primary auditory cortex. Additionally, sections of the anterior thalamic radiation and inferior occipitofrontal fasciculus overlapped F I G U R E 6 Areas showing group differences in the change in FA from Baseline to Scan 2. In red, we show the areas in which the changes from Baseline to Scan 2 are significantly larger in the Double Trouble training group than in the Self-Ordered Search training group.
In blue, we show the areas in which the changes from Baseline to Scan 2 are significantly larger in the Self-Ordered Search training group than in the Double Trouble training group. Clusters have been thickened for visualization using tbss_fill, and results are overlaid on the FMRIB58_FA template and the mean skeletonized FA data of the current sample between groups, as did small regions within the superior parietal lobe and the forceps major.

| Correlations between FA and behavioral score
To assess whether FA changes between Baseline and Scan 5 were predicted by behavioral difference scores, we conducted Pearson correlations for each group. It is important to note that due to the small sample size, these results should be interpreted with caution. As can be seen in Figure 8 and the left genu and splenium of the corpus callosum, in addition to motor pathways including white matter underlying the primary motor area. Improvement on Self-Ordered Search also correlated with changes in FA in several tracts, including the bilateral superior longitudinal fasciculus. Additional correlations were seen in midbrain and frontal pathways, including the anterior thalamic radiation, thalamus, and the inferior longitudinal fasciculus, as well as in motor pathways such as the posterior limb of the internal capsule.
One voxel in the retrolenticular part of the internal capsule showed overlap between the two groups. Additionally, the two groups showed close but nonoverlapping correlations in the pontine crossing tract and the forceps minor.

| DISCUSSION
In this study, we set out to investigate whether training on one of two cognitive tasks would lead to either near transfer (that is, improvements on a quantifiably similar task) or far transfer (i.e., improvements on a quantifiably different task), and furthermore, if such changes exist, what the underlying neural mechanisms might be. Behaviorally, participants who trained on a spatial working memory task improved on that task over time, but did not improve on a cognitively similar test of spatial span, or a cognitively dissimilar test of grammatical reasoning. Likewise, participants who trained on a test of inhibitory control improved on that task, but did not improve on a related test of grammatical reasoning, or a cognitively dissimilar test of spatial span.
As such, these results add to the body of work demonstrating that cognitive training does not "work" in the sense that improvements with training in young healthy participants on cognitive tasks do not appear to generalize to other cognitive domains (Owen et al., 2010;Stojanoski et al., 2018;Stojanoski et al., 2020). What then, might be a mechanistic explanation for why such training affords no generalized cognitive advantages?
To address this question, we examined changes to white matter microstructure (by measuring FA) over the course of five scanning sessions spread over the training period of 4 weeks. As participants trained and behaviorally improved on the primary tasks, significant changes in FA were observed in both participant groups (those that trained on Self-Ordered Search and those that trained on Double Trouble) over the course of the training period. Specifically, relative to training on Double Trouble, training on Self-Ordered Search revealed changes in integrity in the superior longitudinal fasciculus and other white matter tracts underlying frontal and parietal areas of the brain, particularly in the right hemisphere. This task has been shown to be highly sensitive to neurosurgical excisions of the frontal lobe (Owen et al., 1990) and specifically activates the mid-dorsolateral prefrontal F I G U R E 7 Brain regions that showed significant changes from Baseline to Scan 5 for both Double Trouble and Self-Ordered Search. Significant regions included the primary auditory area, the anterior thalamic radiation, the corticospinal tract, and the forceps major, as well as one region each within the inferior occipitofrontal fasciculus and the superior parietal lobe. Clusters have been thickened for visualization using tbss_fill, and results are overlaid on the FMRIB58_FA template and the mean skeletonized FA data of the current sample cortex and the posterior parietal cortex in healthy participants (Owen, Evans, & Petrides, 1996). Moreover, the role of the right mid-dorsolateral prefrontal cortex has been shown to be in the involvement of task-specific strategies that lead to improvements in performance through the adoption of a repetitive searching pattern of behavior (Owen et al., 1990;Owen et al., 1996). It is perhaps then not surprising that repetitive training on this task leads to white matter changes in a network of tracts that connect and support the functioning of these two regions.
Patients with damage to the orbitofrontal cortex also exhibit failures of inhibitory control, including OCD (Abe et al., 2015;Maia, Cooney, & Peterson, 2008), obsessive gambling behavior (Cavedini, Riboldi, Keller, D'Annucci, & Bellodi, 2002), alcoholism (Medina et al., 2008;Volkow et al., 1993), and sexual disinhibition (Gorman & Cummings, 1992;Miller, Cummings, McIntyre, Ebers, & Grode, 1986). There is also a substantial literature in nonhuman primates detailing the relationship between orbitofrontal lesions and failures of inhibitory control (McEnaney & Butter, 1969;Oikonomidis et al., 2017;Wallis, Dias, Robbins, & Roberts, 2001). Again, the fact that we observed white matter changes in tracts that support connectivity to this region while participants were training and improving on a test of inhibition is therefore perhaps not surprising. The left lateralization of these changes may be related to the verbal component of the task. Indeed, in a factor analysis using a battery of tasks including those employed in the current study, Hampshire et al. (2012) found that the "verbal" component, on which double trouble loaded most heavily, predicted activity within a left-lateralized network of regions including the left inferior frontal gyrus, as well as bilateral temporal regions.
Remarkably, there was almost no overlap between the white matter changes that were observed in the tracts that support improvements on Self-Ordered Search and those that support improvements on Double Trouble. In fact, a formal conjunction analysis revealed no regions with more than a single voxel in common to both training regimens, and even then, those changes were primarily in auditory, thalamic, and visual regions. Moreover, when we examined those regions in which white matter changes correlated with performance improvements, there was again very little overlap between the two training regimens.
Of course, in the description above of white matter changes that were associated with training on each of the two tasks, we focused mainly on regions that are known to be functionally involved in those tasks. In both cases, many other tracts showed changes (in some cases, over 200 areas, see Tables S1-S4). Nevertheless, the important point is that there was virtually no overlap between these two sets of regions as indexed by the conjunction analysis, which revealed almost no common areas of change.
Therefore, on the basis of these findings, we propose that training on one cognitive task for 4 weeks does not lead to improvements on a cognitively dissimilar task because the underlying white matter tracts that support communication between regions involved in those tasks are almost completely nonoverlapping. Put simply, improvements in a test of spatial working memory with training are underpinned by changes in a task-specific network of brain regions that are not involved in supporting other tests like those of inhibitory control, and vice versa. The fact that we did not see any improvements in the tests of "near transfer" (i.e., Spatial Span for the Self- "transfer." The terms "near" and "far" transfer are often used to refer to improvements in closely related and unrelated cognitive tasks, respectively, yet how "related" one task actually is to another is rarely quantified. Tasks are often selected based on their inferred cognitive properties, rather than on an empirical measure of similarity, and without a consistent definition of transfer, and quantifiable measures of similarity between tasks, it is very difficult to make comparisons across studies, and assess the reliability of any observed training-related benefits. Of course, an argument could be made that with a longer period of training, or with an increased number of participants, we may have found some evidence of near, or even far, transfer in this study. However, given the marked differences between the white matter changes associated with training on these two cognitively different tasks over 4 weeks, it seems very unlikely that increasing the length of training or the number of participants would fundamentally alter that emerging pattern.
One potential limitation of the study is that of the relatively small sample size. Longitudinal neuroimaging comes with a number of challenges, and in the present study we experienced a scanner failure in the middle of the 4-week training period for a number of participants, who we had to eliminate from analysis. Despite this loss of data, in the present study we analyzes five scans per participant, for a total of 80 data sets. Additionally, because we report results using TFCE, a conservative p-value of p = .01, and with Bonferroni correction, we believe that the findings are robust and unlikely to be affected by the small number of participants.
A final point of note is that not all the changes in FA that correlated with training were in the same direction (see Table 6). FA quantifies how strongly directional white matter tract structure is, based on the degree of diffusion of water molecules (Smith, Kindlmann, & Jbabdi, 2014). While higher FA was originally thought to represent "better" white matter integrity, numerous studies have found decreases with experience or training (Nichols & Joanisse, 2016).
These changes indicate reorganization of white matter; given that we mainly observed decreases in FA, one interpretation is that these results reflect new connections being formed, leading to increases in crossing fibers within these areas, decreasing the overall directionality of the main tract. Although it is difficult to interpret the direction of the correlations in the present study, we can nevertheless confirm that these regions showed change in white matter microstructure with training.
In conclusion, in this study we showed that improvements through training on a cognitive task for 4 weeks do not transfer to other cognitive tasks, even those that are quantifiably similar. We suggest that this lack of near or far transfer occurs because changes in white matter tracts associated with training on each task are almost entirely nonoverlapping, and therefore afford no advantages for untrained tasks.

ACKNOWLEDGMENTS
We are grateful to Anderson M. Winkler for help with the DTI analysis using PALM. This work was supported by a Canada Excellence Research Chair (CERC) program grant (#215063) to Adrian Mark Owen. Adrian Mark Owen is a Fellow of the CIFAR Brain, Mind, and Consciousness program.

CONFLICT OF INTEREST
As the creator of the Cambridge Brain Sciences platform, Adrian Mark Owen owns shares in Cambridge Brain Sciences Inc., which markets the tests for commercial purposes. In line with the existing free licensing agreement between Cambridge Brain Sciences Inc., and the University of Western Ontario, neither person, nor organization received any financial remuneration for the use of these tests in this research study.

DATA AVAILABILITY STATEMENT
Due to ethics restrictions, data used in this study cannot be made publicly available, but can be made available to collaborative investigators through a data sharing agreement.