Mapping differential responses to cognitive training using machine learning

Abstract We used two simple unsupervised machine learning techniques to identify differential trajectories of change in children who undergo intensive working memory (WM) training. We used self‐organizing maps (SOMs)—a type of simple artificial neural network—to represent multivariate cognitive training data, and then tested whether the way tasks are represented changed as a result of training. The patterns of change we observed in the SOM weight matrices implied that the processes drawn upon to perform WM tasks changed following training. This was then combined with K‐means clustering to identify distinct groups of children who respond to the training in different ways. Firstly, the K‐means clustering was applied to an independent large sample (N = 616, M age = 9.16 years, range = 5.16–17.91 years) to identify subgroups. We then allocated children who had been through cognitive training (N = 179, M age = 9.00 years, range = 7.08–11.50 years) to these same four subgroups, both before and after their training. In doing so, we were able to map their improvement trajectories. Scores on a separate measure of fluid intelligence were predictive of a child's improvement trajectory. This paper provides an alternative approach to analysing cognitive training data that go beyond considering changes in individual tasks. This proof‐of‐principle demonstrates a potentially powerful way of distinguishing task‐specific from domain‐general changes following training and of establishing different profiles of response to training.


| INTRODUC TI ON
Working memory (WM), the ability to hold and manipulate information in the mind for brief periods of time, is predictive of healthy cognition across the lifespan and closely linked to academic attainment, employability and well-being (Diamond, 2012). Consequently, the prospect of enhancing WM and closely associated cognitive skills such as attention, processing speed and reasoning via cognitive training has received considerable interest from researchers and commercial enterprises (Diamond, 2012;Green & Bavelier, 2008;Hertzog, Kramer, Wilson, & Lindenberger, 2008). The assumption being that enhancing this general-purpose system will produce wide benefits to other aspects of cognition and learning.
Cognitive training studies typically use a range of assessment tasks to test the effect of training. These are delivered before and after extended practice on a different set of training tasks. Studies designed to test whether the training is effective typically compare these training effects against an active control condition and correct for multiple comparisons across assessments (Simons et al., 2016).
The role of individual differences in the size of training effects is receiving increasing attention from researchers. The longest-standing example of this is the aptitude by the treatment interaction (e.g., Cronbach, 1957;Ferguson, 1956;Snow, 1989), or in other words, how an individual's current cognitive ability interacts with their training outcome. Two popular accounts have emerged, namely: the compensation account and the magnification account (Lövdén, Brehmer, Li, & Lindenberger, 2012). The compensation account suggests that those with higher baseline scores have less to gain, being closer to ceiling prior to training. This assumes that there is plateau in overall performance, with some subjects being closer to this before they start training. Conversely, the magnification account suggests that those with higher baseline scores will show greater improvements, because they have more cognitive resources available in order to maximize on the potential benefit of the training-for example to develop strategies (see Karbach, Könen, & Spengler, 2017, for a recent overview). These extreme accounts are likely oversimplifications (Smoleń, Jastrzebski, Estrada, & Chuderski, 2018). Noetheless, understanding prior factors that predict transfer effects may help explain the many inconsistencies concerning the effectiveness of cognitive training; it could also help tailor training towards those most responsive. Thus far, studies examining individual differences in training are relatively rare, but are steadily growing in number. The majority have explored the impact of known pre-training individual differences, such as age (Borella et al., 2014;Schmiedek, Lövdén, & Lindenberger, 2010), baseline cognitive performance (Bürki, Ludwig, Chicherio, & Ribaupierre, 2014;Guye, Simoni, & Bastian, 2017;Zinke et al., 2014) and cognition-related beliefs (e.g., malleability of intelligence; Jaeggi, Buschkuehl, Shah, & Jonides, 2014). They provide evidence that some pre-training individual differences may explain variability in training effects.
The majority of these studies has used univariate analytical techniques (e.g., Jaeggi, Buschkuehl, Jonides, & Perrig, 2008;Zinke et al., 2014). That is, taking single tasks and testing whether performance on them changes significantly following training, and whether this is moderated by a known individual difference factor.
A principal challenge to this approach is task impurity-the extent to which any given task measures an intended construct-because this makes it difficult to identify what mechanism is being trained (Burgess, 2004;Hasson, Chen, & Honey, 2015;Meyer et al., 2001;Miyake et al., 2000). For example, both N-back and complex span tasks purportedly measure "WM capacity," but training effects on these tasks do not consistently transfer to one another (Harrison et al., 2013;Li et al., 2008). Similarly, both letter span and word span tasks purportedly measure "verbal short-term memory," but training effects on letter span do not always transfer to word span (Ericcson, Chase, & Faloon, 1980). In short, the labels assigned to tasks do not always correspond well to the underlying processes taxed by the assessment, or those enhanced via practice. Comparing individual tasks before and after training does not overcome this challenge, because changes on individual measures could stem from changes in multiple different underlying processes (Protzko, 2017). As a result, a number of researchers are now beginning to explore the potential value of multivariate approaches to considering changes that occur following cognitive training.
One such approach is structural equation modelling (SEM), in which cognitive abilities are represented by latent constructs (Karbach et al., 2017;Schmiedek et al., 2010). Schmiedek and colleagues conducted a large training study, in which they used Latent Score Change Modelling (a form of SEM) and found transfer effects to be detectable at a latent level. As they note, it is possible to observe significant changes at the latent level despite non-significant changes at a task-specific level and vice versa. This is presumably because latent constructs may change substantially, but their contribution to any single task in the battery could be relatively small.
Conversely, we might observe highly specific practice effects particular to a given paradigm or stimulus set (e.g., letters or digits) that do not stem from changes to any broader underlying latent construct.
This can also be a powerful tool for looking at individual differences because it accounts for measurement error in observed variables and thus provides a good way of establishing stable individual differences (Hamaker, Kuiper, & Grasman, 2015). This has enabled some researchers to investigate individual differences by including separate predictors for the estimated change variable in their models (e.g., Bürki et al., 2014;Karbach et al., 2017;Lövdén et al., 2012).
Although promising, this method is not without its drawbacks.
Confirmatory factor analysis approaches require researchers to make subjective choices (albeit based on theory) about the structure

Research Highlights
• We used a multivariate approach to understand cognitive training mechanisms-unsupervised machine learning.
• Following training, task relationships change, implying that the cognitive processes drawn upon to perform these tasks have changed.
• The learning algorithm also learnt that there were differential improvement trajectories among children and an independent measure of fluid intelligence is predictive of these trajectories. of underlying components from the many possible configurations at differing levels of granularity. Furthermore, establishing training effects is particularly challenging because the nature of the underlying constructs, their interrelationships or their task loadings may have changed substantially as a function of training. Investigators are faced with a dilemma: they could fit the same model both before and after training, allowing for a meaningful comparison of model parameters but ignoring the fact that this model may no longer be the most appropriate. Alternatively, they could fit the best model separately before and after training, which would allow for the best representation of the underlying components, but render direct comparisons less meaningful.
Machine learning provides an alternative to modelling task relationships. Unsupervised learning algorithms hold the same advantage as other data-driven methods such as principal component analysis (PCA) and exploratory factor analysis (EFA), in that they allow researchers to explore task relationships without requiring subjective judgements to be made about their nature a priori.
Machine learning algorithms also lend themselves well to non-linearities in multidimensional data, allowing them to capture more nuanced task relationships compared with commonly used linear methods (general linear regression, factor analysis, PCA etc.). Some algorithms cluster participants in a competitive manner, rather than clustering tasks at the whole-group level (as would be the case for PCA or EFA). These may be particularly useful when we suspect there could be large individual differences-in the context of training,resulting in differing profiles of change following an intervention. Iterative clustering techniques can provide a data-driven way of subgrouping participants and thereby reveal different profiles of performance. This has the potential to enable researchers to explore individual differences in training in a different way-rather than testing whether gains in training are predicted by known factors (e.g., age, baseline ability); it might allow researchers to identify individual differences in the profile of the training response itself. Despite these potential benefits, we know of no attempts to use machine learning to understand transfer effects following cognitive training. This paper aims to explore the utility of combining two relatively simple machine learning techniques, namely: self-organizing maps (SOMs) and K-means clustering, to explore task relationships and how these might be altered by training in two large datasets.
First proposed by Kohonen (1990), SOMs belong to a family of artificial neural networks and provide a way of organizing multidimensional data into a lower dimensional space, represented as a topographical distribution. An unsupervised learning algorithm projects the original data from a multidimensional input space onto a two-dimensional grid of nodes called a map. Each node corresponds to a node-weight vector with the same dimensionality as the number of input variables, thereby producing an inter-variable representational space, wherein the geometric distance between nodes corresponds to the degree of similarity in the input data associated with them (Kohonen, 2014). This enables key inter-variable relationships existing in multidimensional space to be identified and accentuated. Moreover, this allows the researcher to explore the overlap in representational space between tasks and how, or whether, this changes as a result of the training. Once established, SOMs can be used to generate quantitative predictions about training effects in unseen data, something currently underutilized in cognitive training research.
Subsequently, a K-means clustering algorithm can be used to identify relatively homogenous subgroups (i.e., "clusters") within the multidimensional node-weight vector space produced by the SOM algorithm. This allows for the exploration of individual differences in task relationships and makes use of information that would otherwise be lost. Identifying data-driven subgroups with distinct cognitive profiles could be a valuable way of understanding different trajectories in cognitive change. We wanted to establish a method for doing this.

| ME THODS
This section contains a brief description of the SOM algorithm and its generic implementation, followed by a stepwise account of the analyses performed on two datasets containing the same set of tasks.

| SOM algorithm
SOMs were trained using the neural network toolbox in (MATLAB and Statistics Toolbox Release). SOMs consist of a predefined number of nodes laid out on a two-dimensional grid plane. Each node corresponds to a weight vector with the same dimensionality as the input data. We initialized the node-weight vectors using linear combinations of the first two principal components of the input data.
SOMs were then trained using a batch implementation (see Figure 1 for a graphical overview), in which each node i is associated with a model m i and a "buffer memory." One cycle of the batch algorithm can be broken down into the following: Each input vector, in this case a single child's performance profile across the four assessment tasks, x(t) is mapped onto the node with which it shares the least Euclidean distance at time t. This node is known as its Best Matching Unit. Each buffer sums the values of all input vectors x(t) in the neighbourhood set belonging to node i and divides this by the total number of these input vectors to derive a mean value. All m i are then updated concurrently according to these values. In this way, neighbouring nodes become more similar to one another. This cycle is repeated, clearing all the buffers on each cycle and distributing new copies of the input vectors into them. The neighbourhood size (ND) decreases as a function of t over n steps in an "ordering" phase, from the initial neighbourhood size (INS) down to 1 (Equation 1). In the "fine-tuning" phase the neighbourhood size is fixed at <1, meaning that the node weights are updated according only to the input vectors for which they are the Best Matching Unit. This node adjustment process is the mechanism by which the SOM learns about the input data. (1)

| Cognitive assessments
Four span tasks from the automated working memory assessment battery (AWMA; Alloway, 2007) were used in the current analysis. In Forward Digit Recall, participants hear a sequence of numbers and are required to repeat them back out loud, in the same order in which they were presented; Backward Digit Recall, participants hear a sequence of numbers and are required to repeat them back out loud, in the reverse of the presentation order.
These tasks purport to measure verbal short-term and WM respectively. Dot Matrix, participants see a sequence of dots in a 3 × 3 matrix and are required to recall the order and position of the dots by pointing to a blank 3 × 3 response matrix; Mr.X, participants are present with sequences of two cartoon characters placed next to one another, both of which are holding a ball in one of their two outstretched arms, and the one on the right is rotated to varying degrees on each presentation. For each pair of Mr. X's participants are required to make a same-different judgement with regard to whether they are holding the ball in the same hand, whilst retaining the spatial information as to where the ball held by the right-hand Mr. X resides. They are then required to recall the previously retained spatial locations in the correct order by pointing to one of the eight locations represented by dots in a circle. These tasks purport to measure visuospatial short-term and WM respectively. All tasks along with the instructions are computerized and practice trials were completed on each to help ensure comprehension.

| Participants
We used three relatively large datasets in this analysis. All datasets consisted of age-standardized data (i.e., M = 100, SD = 15) from the four AWMA tasks. In the following sections, we describe the datasets we used, and summary scores are described in Table 1.

| Centre for Attention, Learning and Memory
The first dataset comprised of data collected from 526 partici- Sciences Unit and undergo a wide battery of cognitive and behavioural assessments, which includes the four tasks described above.

| Attention and Cognition in Education
This sample was collected for a study investigating the neural, cognitive and environmental markers of risk and resilience in children.
Ninety typically developing children who attend mainstream schools in the UK (M = 9.42 years, range = 6.91-12.58 years, SD = 1.49 years; 45 girls) and their families were invited to the MRC Cognition and Brain Sciences Unit in Cambridge for a comprehensive cognitive assessment, which included the four tasks described above.
In later analyses, we combined data from the two above-mentioned studies for better statistical power and larger individual variability in task profiles, which is desirable for a "baseline" dataset.

| Combined training studies
This dataset comprised of pre-training and post-training data col-

| Training SOMs
The SOM learning algorithm and model require the selection of several parameters, including the number of map nodes, initial neighbourhood size, the ordering phase length, and fine-tuning phase length. These hold important theoretical, computational and statistical implications. However, according to Kohonen (2014) and provides suggestions based on experience. A detailed discussion of this topic is beyond the scope of this paper. However, for a more detailed explanation of our selection process and an overview of the results, see the Supporting Information provided on this topic. In short, we selected parameters with the aim that the SOM model would represent the training sample well, whilst still maintaining generalizability to the wider population. We trained the fol-

Cross-validation
The first step after fitting a model is to test its validity. We applied a cross-validation procedure to test the null hypothesis that the SOM does not estimate unseen data above chance levels. Specifically, this involved randomly removing 20% of the CALM/ACE data (i.e., approximately 120 participants), then using the remaining 80% to fit a SOM, which was used to predict the reserved data. The prediction was made with a technique called K-Nearest Neighbours (Altman, 1992), in which the value of the to-be-predicted variable is decided by the values of the three closest SOM nodes in terms of Euclidian distance with respect to the vector containing the other-unseen variables. For example, if Forward Digit is the target variable, a subject's scores on the other three tasks will be fed to the algorithm to find the three nearest SOM nodes. Then the values of the three nodes on Forward Digit are pooled and weighted based on distance (the closest node has the highest weight) to calculate the participant's predicted score. The mean absolute difference between the predicted scores and true scores of the unseen sample was used as the measure of prediction error.
To better evaluate the average model performance, we repeated the cross-validation process 1,000 times to derive distributions of the mean prediction errors. The distributions for chance level were achieved by randomly shuffling the order of the predicted scores, then subtracting the true scores to obtain a null mean absolute difference. For each iteration of the 1,000 cross-validations, we also repeated the shuffling 100 times to create a null distribution containing 100,000 values of prediction errors. Finally, the mean prediction errors for all variables were compared to the corresponding null distributions to compute p-values by calculating the proportion of the null distribution greater than the mean prediction error.

Assessing generalizability across samples
We were also interested in whether the representativeness of the SOM extended to other samples. To test this, we used a SOM trained on the entire CALM/ACE dataset to predict task scores in the pre-and post-training datasets respectively. The CALM/ACE sample is much larger in size and included a wide range of ability levels. This means that a model based on these data is more likely to generalize well with other datasets. We generated the chance level distributions for pre-and post-training samples similar to the last step by shuffling the order of predicted scores 100,000 times. Again, true prediction errors were compared to derive p-values.
An alternative way to address this question is to compare predic-

| Does training alter the relationships between tasks?
Here we ask this question in two ways. Each model node is an instance of a multivariate task relationship that exists in the data used to train the SOM. If these SOM maps have less predictive power when used to estimate new data points this means that different multivariate task relationships exist in that dataset, which are not well accounted for by the model. This is the first way of testing whether the training has changed task relationships.
Secondly, we also addressed this question by comparing the SOMs To access task similarities as represented by SOM, elements of SOM node-weight vectors can be extracted individually (e.g., the first element of all node vectors) to form a "component plane." Each plane corresponds to a representation of a task. The pairwise correlation coefficients between component planes can be derived and serve as multivariate activity patterns, which is useful for quantitative analysis. If two tasks tap into similar cognitive processes, their activity patterns ought to overlap (e.g., Figure 2 the Forward Digit and Backward Digit which both involve auditory information, share more topological similarity). By extracting the correlations between the same pair of tasks before and after training, we could then make a direct comparison of how their relationships had changed as a result of the training.
To compute the relationship, component planes associated with each pair of tasks were compared using Pearson's correlation coefficient.
Then, the similarity values are assembled into a 4 × 4 matrix.
Once the similarity matrices for pre-and post-training were computed, we compared the same pairs of tasks between times of preand post-training to identify any significant differences in correlation coefficients. We chose to bootstrap the node-weight elements associated with the two tasks and computed the correlation coefficients before subtracting one from another (post-training pairwise correlation-pre-training pairwise correlation). By repeating this procedure 10,000 times, we obtained a distribution of the difference between correlations. If zero falls within the bottom or top 5% of the distribution, we reject the null hypothesis that the two correlation coefficients are not different, with a false-positive rate of α = 0.05. We conducted this analysis for all pairs of tasks.

| Are there subgroups with different profiles of change following training?
K-means clustering provides a data-driven method for identifying k relatively homogenous subgroups within the SOM node-weight vector space by minimizing the distance between data points and the We first identified subgroups within the SOM fit to the CALM/ ACE data, by applying K-means clustering to the node weights. Once nodes were grouped based on similarity, participants were allocated to the cluster to which their Best Matching Unit belongs to. This provided us with clusters of children based on nodes they were assigned to in the original mapping. This process was repeated 1,000 times, with the map retrained on every iteration; and the K-means clustering recalculated to check that the clusters were robust. Participants in the training datasets were also allocated to these identified clusters in the same manner (i.e., based on closest Euclidean distance) at both pre-and post-training, separately. Profiles of subgroups were characterized by calculating their respective means and standard errors on each of the tasks and compared between groups to identify the ways in which they differ. In the case of the cognitive training datasets, we also contrasted children who changed subgroup following the training.
We did this by calculating gain scores (post-minus pre-training) on each task as a way of testing how different gain scores are associated with changes in subgroup membership.
Finally, we tested whether these clusters were predicted by another measure that was not included in the SOM training or clustering, namely Wechsler, Scales, & Index, 2012) were available for 158 participants in the training sample. Matrix reasoning is considered a measure of general fluid intelligence (Gf), which refers to the ability to reason and solve novel problems. Gf is a critical factor for success in a wide variety of cognitive tasks and the capacity to learn in general (Gray & Thompson, 2004). We explored whether performance on the WASI matrix reasoning task assessed prior to training was predictive of change of subgroup membership.

| Summary
The above pipeline describes our stepwise analyses. (a) SOMs were used to model task relationships. We cross-validated the model

| RE SULTS
A 64-node (8 × 8) SOM with an initial neighbourhood size of 2 was trained over 10 ordering phase steps and two fine-tuning phase steps using the CALM/ACE data (Quantization error = 9.72); quantization error is defined as the mean absolute distance between the input vectors (i.e., training data) and their corresponding Best Matching Units. The rationale behind the selection of these parameters, along-

side different solutions with different parameters, is included in the
Supporting Information. Figure 2 shows how the SOM represents the four tasks as well as the number of participants allocated to each node.

| Does the SOM model represent the samples well?
We first cross-validated the model performance of SOM trained on the CALM/ACE data using permutation testing. The SOMs proved capable of predicting unseen CALM/ACE data significantly better than chance for all four task variables (Table 2).
Next, a SOM trained on the entire CALM/ACE sample was then used to test how well it represents the pre-and post-training datasets, again using the same method. The model predicted unseen data from other samples better than chance on all tasks.
We also directly compared the CALM/ACE prediction errors with the pre-and post-training data (Table 3)

| Does training alter the relationships between tasks?
New SOMs trained on pre-and post-training data respectively were compared to examine changes in task relationships as a function of training. Pairwise correlation coefficients were computed from the SOMs component planes representing tasks and assembled into similarity matrices. Figure 3a,b depict the pre-and post-training matrices. We conducted pairwise comparisons before and after  Figure S10 and the section titled "Comparison with the control group" in the Supporting Information for the same analysis in the control data).

| Are there subgroups with different profiles of change following training?
We applied K-means clustering to the node-weight vector space The algorithm identified a subgroup of participants who achieved a high level of performance on all tasks, a subgroup whose scores were at the lower end of the distribution and two subgroups who TA B L E 3 Direct comparisons between SOM prediction errors between unseen CALM/ACE and pre-training sample (CALM/ACE vs. pretraining), as well as between CALM/ACE and post-training (CALM/ACE vs. post-training), and between the pre-training and post-training (pre vs. post-training) F I G U R E 3 Pairwise task relationships derived from SOM weights before and after training and the difference over time. Larger value indicates more similarity between the two tasks. (a) Task relationships for pre-training sample. (b) Task relationships for post-training sample.
(c) Difference in similarity between pre-and post-training (post-pre). The Backward Digit-Mr. X pair showed significant change after training (p < 0.01), as did the Forward Digit-Backward Digit pair (p < 0.05). Abbreviation: SOM, self-organizing map were in the middle. One of these middle subgroups tended to have average performance on all tasks, whereas the other tended to have average or slightly above average performance on the visual spatial tasks but below average performance on the verbal tasks.
Participants in the pre-and post-training sample were also al- all p < 0.001) and post-hoc tests (see Table 4 for multiple pairwise comparison results). Overall, movers to 1 had the highest improvement across all measures compared to the other groups. Movers to 2 were characterized by moderate gains globally but benefited less on Dot Matrix and Mr.X relative to children moved to Cluster 1. The third group, children moved to Cluster 3 had comparable magnitude of gains on Dot Matrix and Mr.X to movers to 1, but significantly less gains on Forward and Backward Digit tasks than movers to 1 or 3.
Unsurprisingly, children who stayed in the lowest performance group (Cluster 4) had overall limited benefit from the training.
To investigate whether performance on a measure of fluid intelligence (WASI matric reasoning) could predict individual differences in patterns of improvement, we compared the four interest groups' WASI scores assessed prior to training (see Figure 4b)

| D ISCUSS I ON
Our understanding of cognitive training has hitherto focused on exploring its impact on single tasks (though with some notable exceptions, e.g., Karbach et al., 2017;Schmiedek et al., 2010) and treating all participants as a single homogenous group (e.g., Borella  Bürki et al., 2014;Zinke et al., 2014). In the present study, we used machine learning to show that WM training alters the relationships between tasks, implying that the cognitive processes recruited for performing those tasks can change following training. Furthermore, we identified subgroups with differential responses to training which were predicted by fluid intelligence scores.

| SOMs accurately represent task relationships
A SOM was fit to a large dataset of children who were assessed on standardized measures of verbal and visuospatial short-term and WM. Using leave-N-out cross-validation, we showed that SOMs fitted on these data predicted performance on unseen data for all tasks. These predictions generalized to the cognitive training samples; importantly, however, the model fit and prediction accuracy was reduced significantly following training for the Dot Matrix, the implication of which is discussed subsequently.

| Task relationships change following training
Multiple studies have shown that performance on individual tasks improves following training (for reviews, see Hertzog et al., 2008;Melby-Lervag et al., 2016;von Bastian & Oberauer, 2014). But this does not provide any insight into whether or how underlying constructs are being changed, or whether different cognitive processes are recruited following the intervention. One way of investigating this is to test whether relationships between tasks change as a function of training. Following training, we identified a large decrease in prediction accuracy for the Dot Matrix task mirroring substantial improvements in task performance. Lower prediction accuracy following training also suggests that the relationships between Dot Matrix and the other tasks may have been altered. In other words, new task relationships (i.e., multivariate data points) exist in the post-training data that were not learnt or represented in a large sample of children who did not complete cognitive training. In this case, the training program contains many exercises similar to the Dot Matrix task (i.e., visuospatial serial recall, Klingberg et al., 2005), and thus subjects may show a more task-specific effect rather than a domain-general improvement.
This is in line with demonstrations of maximal transfer to assessment tasks most similar to those trained (Gathercole et al., 2019), with highest levels of transfer for tasks with the greatest numbers of shared task features (Soveri, Antfolk, Karlsson, Salo, & Laine, 2017). If the bulk of improvements had been domain general, then we would expect similar-sized improvements on other tasks measuring visuospatial WM (i.e., Mr X), but these improvements were relatively small. Moreover, this is further exaggerated when we look at the size of improvements relative to those in the control group, indicative of practice effects.
These changing task relationships underscore the fact that the cognitive processes we recruit for individual tasks are not static but can change as a function of experience.
Most task relationships remained stable across training, but the correlations between MR.X-Backward Digit and Forward Digit-Backward Digit changed significantly. The MR.X-Backward Digit pair substantially decreased in correlation following training, whereas there as a moderate increase for the Forward Digit-Backward Digit pair. Again, this shows that relationships between tasks, as represented by the SOM, are subject to change following training. One possibility is that as subjects practice the Backwards Digit task-a version of which exists in the training battery-they gradually start to recruit similar cognitive processes or strategies that they previously used for the Forward Digit task, like chunking. The end result is that the SOM represents these tasks more similarly following training. By contrast, the task is now represented more differently to the other complex span task in the assessment battery, Mr X. In short, even though both Backwards Digit and Mr X are described as WM tasks, and both improve overall following training, the changing way that they are represented by the SOM indicates that different cognitive processes or strategies are recruited for them following training. Importantly this would not be captured by a conventional approach to testing for transfer.

| Subgroups with different training profiles
There is an increasing interest in individual differences in cognitive training effects. The approach typically taken is to explore the impact of known factors, like age (Borella et al., 2014;Schmiedek et al., 2010), baseline ability (Bürki et al., 2014;Zinke et al., 2014) or cognition-related beliefs (e.g., malleability of intelligence; Jaeggi et This suggests there are differential improvement trajectories among children, which would be lost in conventional group-level comparisons. These improvement profiles were meaningfully associated with fluid intelligence: those who made the largest improvements across all measures (movers to the highest performing group) had significantly higher fluid reasoning skills compared with those who stayed in one of the low-performing groups. General intelligence is thought as the ability to reason and solve novel problems (Duncan & Owen, 2000), or an index for flexible cognitive resources believed to play a critical role in the process of decomposing the unfamiliar tasks into their component parts (Duncan, Chylinski, Mitchell, & Bhandari, 2017). This may indicate that the ability to abstract and generalize newly learned routines to unpractised tasks is one of the deciding factors of transfer effects (Gathercole et al., 2019).
The positive association between fluid intelligence and improvement profile is reminiscent of some previous studies that have shown age-related and ability-related magnification effects in the context of cognitive training (e.g., Bürki et al., 2014;. Magnification effects are more typically observed in the context of strategy-based training than process-based training (e.g., Karbach & Verhaeghen, 2014;Karbach et al., 2017), possibly indicating that the training intervention in this study facilitated strategy acquisition . Indeed, it has been shown that training-related improvements in WM may be mediated by implicit development of task-specific strategies such as grouping of sequential information for recall (Dunning & Holmes, 2014;Minear et al., 2016). Gathercole et al. (2019) argue that these kinds of effects are evidence that training-related gains rely on the construction and refinement of new cognitive routines and strategies.
Individuals with higher levels of cognitive performance at baseline may have more capacity to acquire and perform strategies that enhance the training effect (Lövdén et al., 2012). Our findings would support this. An interesting line of enquiry would be to investigate whether children with relatively low intelligence scores could benefit from explicit instructions to help aid strategy generation while training.

| General discussion
We show that task relationships change following training (according to two separate measures), thereby indicating that the underlying mechanisms tapped by training might be task-specific rather than domain-general, and subject to change over time. We have also demonstrated that task performance trajectories are subject to individual differences under this paradigm. This highlights the need to reconsider the interpretation of training-related gains. Children could improve significantly on a particular task via learning specific strategies whilst having moderate or no gains on other tasks claimed to measure the same construct (Moreau, Kirk, & Waldie, 2016).
To remedy this, previous studies investigating the training-induced improvement on the ability level used latent factor analysis, which is necessarily constrained by how the observed variables load onto the latent factors before and after the training for the sake of model comparability and interpretability (Bürki et al., 2014;Karbach et al., 2017;Lövdén et al., 2012;Schmiedek et al., 2010).
However, this assumption is challenged by the current findings, which imply that training does not only enhance the performance, but also alters the task structures. In the Supporting Information, we show that this is indeed the case in the context of the current dataset, by fitting linear models to the data. The difference in best fitting model before and after training could either be due to the enhancement of task-specific processes or an increase in individual variance across tasks, or both. Either way, it suggests that the best latent variable model before and after the training may not necessarily be the same. Fitting different models pre-and post-training would limit the meaningfulness of comparisons across time points (Dimitrov, 2006). Conversely, imposing parameter invariance when the real data suggest otherwise could lead to a large estimation bias of the model, which cannot be reliably indicated by fit statistics (Clark, Nuttall, & Bowles, 2018). If such cases arise, the SOM approach taken here is a potentially more flexible alternative that does not rely on as many assumptions, while still allowing for meaningful comparisons over time.
Importantly, our findings may be specific to the set of training and assessment tasks we had available. Moreover, our dataset was a composite of many individual studies, with independent recruitment criteria. Nonetheless, our primary aim was to demonstrate a proof of principle, with potential benefits for those exploring multivariate profiles of change. The next step is for this to be tested in well-powered training studies with a broader set of assessments.

| CON CLUS ION
SOM models provide an effective alternative for the representation and prediction of multivariate data typically found in training studies. Applying SOMs to the current training data revealed nuanced task relationships that are subject to change following WM training, suggesting that the underlying cognitive mechanisms of improvement may be at least partially task-specific rather than domain-general. The use of K-means clustering revealed distinct subgroups with differentiable improvement trajectories. These improvement trajectories were related to pre-training fluid intelligence.

ACK N OWLED G EM ENTS
MZ was supported by the CSC Cambridge International

CO N FLI C T O F I NTE R E S T
The researchers declare that there is no conflict of interest involved in this work.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from the corresponding author upon reasonable request (Dr. Duncan E. Astle: duncan.astle@mrc-cbu.cam.ac.uk).