Accounting for symptom heterogeneity can improve neuroimaging models of antidepressant response after electroconvulsive therapy

Abstract Depression symptom heterogeneity limits the identifiability of treatment‐response biomarkers. Whether improvement along dimensions of depressive symptoms relates to separable neural networks remains poorly understood. We build on work describing three latent symptom dimensions within the 17‐item Hamilton Depression Rating Scale (HDRS) and use data‐driven methods to relate multivariate patterns of patient clinical, demographic, and brain structural changes over electroconvulsive therapy (ECT) to dimensional changes in depressive symptoms. We included 110 ECT patients from Global ECT‐MRI Research Collaboration (GEMRIC) sites who underwent structural MRI and HDRS assessments before and after treatment. Cross validated random forest regression models predicted change along symptom dimensions. HDRS symptoms clustered into dimensions of somatic disturbances (SoD), core mood and anhedonia (CMA), and insomnia. The coefficient of determination between predicted and actual changes were 22%, 39%, and 39% (all p < .01) for SoD, CMA, and insomnia, respectively. CMA and insomnia change were predicted more accurately than HDRS‐6 and HDRS‐17 changes (p < .05). Pretreatment symptoms, body‐mass index, and age were important predictors. Important imaging predictors included the right transverse temporal gyrus and left frontal pole for the SoD dimension; right transverse temporal gyrus and right rostral middle frontal gyrus for the CMA dimension; and right superior parietal lobule and left accumbens for the insomnia dimension. Our findings support that recovery along depressive symptom dimensions is predicted more accurately than HDRS total scores and are related to unique and overlapping patterns of clinical and demographic data and volumetric changes in brain regions related to depression and near ECT electrodes.

and right superior parietal lobule and left accumbens for the insomnia dimension. Our findings support that recovery along depressive symptom dimensions is predicted more accurately than HDRS total scores and are related to unique and overlapping patterns of clinical and demographic data and volumetric changes in brain regions related to depression and near ECT electrodes.
K E Y W O R D S electroconvulsive therapy, machine learning, major depressive disorder, structural neuroimaging, symptom heterogeneity

| INTRODUCTION
Depression is a leading cause of disability worldwide (James et al., 2018). Despite its prevalence, effective treatment remains challenging. Antidepressant and behavioral interventions serve as first-line treatments but roughly one third of patients remain unresponsive to these interventions (Trivedi et al., 2006). Heterogeneity of depressive symptoms likely contributes to treatment failures as a variety of mechanisms and etiologies may be associated with particular dimensions of depressive symptoms.
Limited alignment between DSM-based categorizations of psychiatric disorders and underlying neurobiological processes related to these disturbances has motivated an effort to characterize these dysfunctions along symptom dimensions. The National Institutes of Mental Health's Research Domain Criteria (RDoCs) has attempted to formalize this research. In support of this approach, a high degree of symptomatic heterogeneity is often observed across individuals sharing the same diagnosis. A DSM-V depression diagnosis, for example, requires presentation of at least five out of nine symptoms in addition to one of two core symptoms; yielding 227 potential symptom constellations for a single diagnosis (van Loo, de Jonge, Romeijn, Kessler, & Schoevers, 2012). Moreover, there may be nuanced differences in neural systems involved across this potentially varied spectrum of symptom manifestations. Thus, a clearer mapping between neural systems and symptom constellations would inform more targeted antidepressant neurostimulation techniques such as transcranial magnetic stimulation, high-definition transcranial direct current stimulation, or more focal electroconvulsive therapy (ECT) strategies.
ECT is a rapidly acting and effective treatment for major depressive disorder (MDD) boasting response rates between 50 and 80% (Haq, Sitzmann, Goldman, Maixner, & Mickey, 2015;Tokutsu et al., 2013). Previous studies have investigated neuroimaging-based biomarkers for various clinical outcomes following ECT (Jiang et al., 2018;Leaver et al., 2018;Schmitgen et al., 2020). Among these, cortex, and gyrification of the right middle frontal gyrus, and treatment-related reductions in depressive symptoms; pretreatment symptom severity was also a key predictor of symptom change (Schmitgen et al., 2020). Previous work from our group reported on volumetric increases in the accumbens, pallidum, and caudate among ECT-responsive patients (Wade et al., 2016). Using resting state functional connectivity measures, Leaver et al. reported that pretreatment connectivity of networks encompassing the dorsolateral prefrontal cortex, subgenual anterior cingulate, and motor cortex were predictive of ECT response (Leaver et al., 2018).
Many earlier studies tracked the course of depressive symptoms using only aggregate scores from multi-item scales (Ousdal et al., 2020) such as the Hamilton Depression Rating Scale (HDRS), thereby ignoring potentially separable symptom dimensions. Validity of the HDRS total score and related scales as measures of depression severity have been established, but aggregated scales may fail to separate distinct symptom dimensions (Michael Bagby, Ryder, Deborah Schuller, & Marshall, 2004). Moreover, use of a total score as a primary outcome ignores potentially subtle differences between symptom profiles that might relate to clinical outcomes.
Several strategies could be used to model symptom heterogeneity in the context of biomarker-identification studies including relating neuroimaging measures to changes in: (1) individual items of a scale; (2) scale sub-scores related to depression subtypes (i.e., atypical, melancholic, psychotic depression); or (3) weighted combinations of scale items. The first approach is limited as it ignores between-item correlation and overlooks more parsimonious models. The second approach overcomes this, but, a priori depression subtypes based on clinical observations may not align with clustering tendencies of observed data. The third approach is commonly implemented using either exploratory factor analysis (EFA) or principal components analysis (PCA) and addresses potential limitations of the first two strategies by identifying compact, lower-dimensional representations of outcomes in a data-driven manner.
Previous studies used PCA decompositions of the HDRS; however, results have yielded between two to eight factors with inconsistent loadings (Michael Bagby et al., 2004). Others have applied Rasch logistic analysis to extract a subset of six HDRS items (i.e., the HDRS-6) that provide a consistent unidimensional measure (Bech et al., 1981). Factor analysis and item response theory have also been explored (Faries et al., 2000;Williams, 2001). While these approaches improve psychometric properties of the HDRS by enforcing a more valid unidimensional internal structure, additional information pertinent to other aspects of depression is then discarded by item exclusion.
More recently, we used EFA to identify latent symptom dimensions in a large, multisite patient cohort undergoing electroconvulsive therapy (ECT) for depression . We identified three symptom dimensions from pretreatment HDRS-17 items: somatic symptoms, core mood and anhedonia, and insomnia. Here, we build on this work and aim to relate patient demographic, clinical, and patterns of regional ECT-induced volumetric brain changes to changes in latent symptom dimensions using data-driven methods. We hypothesized that demographic and clinical measures would be broadly related to symptom changes, and as has been previously reported, that regional volumetric changes of the hippocampus, motor cortex, striatum, and cingulate would be commonly related to changes along separate symptom dimensions. However, at the same time we hypothesized that there would be unique volumetric changes related to individual symptom dimensions based on the assumption that these dimensions recruit different functional networks. Finally, we hypothesized that prediction of symptom changes would be more accurate along latent symptom dimensions rather than the HDRS-17 total score. Characterizing these unique and shared predictors and mechanisms of treatment response will inform development of targeted Treatment protocols were naturalistic as stimulus parameters were not manipulated for research purposes. Local physicians determined each patient's need for ECT. As reported , several treatment procedures differed across sites. Site 1 used a MECTA Spectrum 5000Q device (Oswego, Oregon); all other sites used Thymatron IV devices (Somatics Inc.). All sites used barbiturate (Methohexitol or Thiopental) induction. Modes of ECT administration were mixed within and across the sites and included right unilateral d'Elia, bitemporal, and left anterior right temporal. Sites 1 and 2 used ultrabrief pulse widths (<0.5 ms) while sites 3 and 4 used brief pulse widths (0.5 ms). ECT pulse amplitude was 800 and 900 milliamps for sites 1 and 2-4, respectively. Only Site 1 tapered patients off antidepressant medications before ECT. Treatment resistance was defined as failure to respond to two or more medication trials at Sites 1 and 2; one or more unsuccessful psychotherapy trials for Site 4; Site 3 had no formalized criteria.
HDRS-17 assessments and MR imaging occurred immediately before ECT index series and within 2 weeks of completing the index series for all sites. ECT was administered 3 times weekly for Sites 1, 2, and 3 and 2-3 times weekly for Site 4. Treatment duration was determined based on clinical response and was suspended either upon achieving maximal clinical response defined by either the HDRS or Quick Inventory of Depressive Symptoms or after no appreciable benefit was observed.
All participants provided written informed consent as approved by their local ethical committees or Institutional Review Boards (IRBs), and centralized analysis of pooled data was approved by the Regional Ethic Committee South-East in Norway (2018/769).

| Image processing
Image processing details have previously been outlined (Oltedal et al., 2018). Images were acquired on Siemens 3T (Erlangen, Germany) system for Sites 1, 2 and 3 and a Philips 3T (Gyroscan Intera 3T, Philips Medical Systems, Best, NL) at Site 4. Each site provided pretreatment and post-treatment 3T T1-weighted MRI images, with a minimum resolution of 1.3 mm in any direction. Sites uploaded MRIs to a common server for a unified preprocessing approach which included correction for scanner-specific gradient nonlinearity, registration to a common atlas space, and resampling to a 1 mm isotropic grid. Images were segmented using FreeSurfer version 5.3 and Quarc (Holland & Dale, 2011) was used to estimate regional cortical and subcortical volumes. Table S1 lists the included regions of interest (ROIs).

| Symptom dimensions
We previously applied EFA to the pretreatment HDRS-17 items of this cohort . Three latent dimensions were identified: somatic disturbances (SoD), core mood and anhedonia (CMA), and insomnia. Symptom dimension change was computed by subtracting its baseline from the post-treatment score.
trained and tested using 10-repeated 10-fold cross validation in which roughly 90% of the samples were randomly assigned to a training set and the remaining 10% were assigned to a test set. RFR models were fit with 1,000 regression trees and were trained to minimize the root mean squared error (RMSE) of the predicted versus actual symptom change. Prior to model training, two transformations were applied separately to patient data in the training and testing sets to prevent biasing model performance (information leakage). First, to isolate the contributions of regional brain volumes independently apart from any correlated clinical or demographic features, regional volumetric data in the training set was residualized with respect to patient age, sex, BMI, primary and secondary electrode placements, and pretreatment symptom dimension severity. Regression model parameters identified for the training data were used to residualize test set imaging data.
Similarly, because multisite imaging data is commonly confounded by scanner-specific idiosyncrasies, we used an adapted ComBat harmonization algorithm (Fortin et al., 2017;Radua et al., 2020) to harmonize (i.e., mitigate unwanted influences of acquisition site) training and testing imaging data, separately. Harmonization parameters acquired from the training set were applied to the test cohort (see Figure S1).
Model performance was assessed using the sum-of-squares R 2 value (i.e., the fraction of explained variance), where y i is the actual symptom change for the i-th subject; b y i is the predicted symptom change for the i-th subject; and y is the mean symptom change. The normalized RMSE (NRMSE) value was also reported to facilitate comparisons across scales with different ranges, NRMSE = RMSE y max Ày min where y max and y min are the maximum and minimum outcome values, respectively, in the test set; each metric was evaluated across test set predictions (Poldrack, Huckins, & Varoquaux, 2020). The significance of each model's fit was assessed using permutation tests with 100 shuffles. Multiple comparisons made across latent symptom dimensions and benchmarks were jointly adjusted using the false discovery rate method with a significance level of .05. Performance differences for models across symptom dimensions were done using one-way ANOVAs to test the difference in R 2 scores resulting from cross validation. The importance of each feature was computed using permutation-based calculations of the percent increase in mean squared error (PIMSE). This formulation is detailed in Supplementary Methods.

| Benchmarks
We compared the performance of models using latent symptom dimensions as outcomes to models using the HDRS-17 and HDRS-6 total scores.

| Post-hoc sensitivity analyses
Several sensitivity analyses were conducted to probe model robustness and specificity. First, to directly understand the contribution of neuroimaging features, we evaluated the performance of models that included only regional brain volume predictors, rather than the joint set of volumetric, clinical, and demographic predictors. In this approach, we again regressed the effects of demographic and clinical variables out of the imaging predictors and harmonized them using ComBat.
We secondly explored whether models trained on a specific dimension would be uniquely predictive of the symptom dimension it was trained on or whether it is equally predictive of other dimensions.
This would inform potentially unique mechanisms underlying reduction of specific symptom clusters.
Thirdly, we evaluated whether model performance improved when Site 1 participants, where patient symptoms were significantly less improved, were excluded.
We next excluded patients with psychotic features as their symptoms were significantly more reduced than those without psychotic features and were predominantly from Site 2 (see Section 3). However, because psychotic features are known to be indicative of better response to ECT (van Diermen et al., 2018), we kept these patients in the primary analysis.
Lastly, we performed leave-one-site out cross validation to evaluate model generalizability to data from sites unseen by the model. Here, however, an additional step was necessary to permit data harmonization. Because these models were each trained on three of the four sites, harmonization parameters for the held-out site were not available from the training data set. To correct this, an intermediary random forest model was trained to predict site labels of the training observations. This model was then used to predict pseudo-site labels for the held-out site. Data from the hold-out site was then harmonized using these pseudo-site labels as an approximation.

| Code availability
Code for this project is available at https://github.com/bscwade/ gemric_latent_symptom_dimension_study. HDRS-17 scores from all sites were reduced more than observed in the healthy control cohort (all p < .05, one-tailed). The degree of symptom reduction did not differ between those with and without a diagnosis of bipolar disorder (t = 1.3, df = 9.7, p > .05). Symptoms were significantly more reduced among patients with psychotic features compared to those without (t = À4.4, df = 20.1, p < .001). Detailed patient characteristics are provided in Table 1.

| Pretreatment symptom dimensions
As detailed in Wade et al. (2020), the SoD dimension included somatic gastrointestinal (G.I.) symptoms, hypochondriasis, feelings of guilt, genital symptoms, general somatic symptoms, somatic anxiety, psychic anxiety, and psychomotor agitation HDRS items. The CMA dimension was composed of work and interests, weight loss, psychomotor retardation, and depressed mood. The insomnia dimension included early, middle, and late insomnia items (see Figure S2). Notably, suicide and insight items did not adequately load onto a single factor and thus were omitted. The correlated change in symptom dimensions over treatment ranged between .23 and .91 (see Figure S3). The degree of dimension changes also varied by site ( Figure S4). All symptom dimensions were significantly reduced across patients relative to controls (all p < .01, one-tailed).

| Models including clinical and demographic predictors
The fractional variance explained, R 2 , for the SoD dimension was 22% with an NRMSE of 0.19 (p < .01). Model performances are tabulated in Table 2 Performance for the insomnia dimension was R 2 of 39% with an NRMSE of 0.15 (p < .01). Top predictors for insomnia were baseline symptom severity (PIMSE = 2.6%) and BMI (1.0%). Important imaging features included the right superior parietal lobule, left accumbens, and right pars triangularis volume changes (all PIMSE <1%). Higher pretreatment insomnia symptoms, BMI, age, and volumetric reductions of the right superior parietal lobule, left accumbens, and pars triangularis associated with more reduced insomnia symptoms.

| Models excluding clinical and demographic predictors
Model performance was substantially lower when clinical and demographic features were not used though all were predicted significantly above chance levels. The R 2 score for the SoD dimension was 2.5% while the NRMSE was 0.21 (both p < .05). Important predictors for these models are reported in Supplementary Results, illustrated in Figure 1b, and Figure S5 illustrates PDP plots for these models. The CMA dimension R 2 score was reduced to 5.6% with an NRMSE of 0.20 (both p < .05). The insomnia symptom R 2 score lowered to 1.4% with an NRMSE of 0.19 (both p < .05). Here, CMA dimension change was predicted significantly more accurately than SoD and insomnia symptoms (all p < .01).

| Benchmark predictions
When clinical and demographic predictors were included, R 2 scores for the prediction of HDRS-6 change was 24% with an NRMSE of 0.21 (p < .01). The R 2 score for the HDRS-17 change was 25% with an  In models that excluded clinical and demographic predictors and instead used only neuroimaging predictors, the HDRS-6 R 2 score was 4.9% with an NRMSE of 0.24 while the HDRS-17 R 2 was 10% with an NRMSE of 0.23 (all p < .05). With clinical and demographic measures excluded, the HDRS-17 change was predicted significantly more accurately than the CMA, SoD, insomnia, and HDRS-6 symptom sets (all p < .0001).

| Cross factor predictions and model specificity
Model performance was lower, on average, when models were trained and tested on separate symptom dimensions. However, models trained on CMA symptoms and used to predict SoD symptoms, or vice versa, did not significantly differ in distributions of R 2 scores from models trained and tested within these respective dimensions.

| Models excluding Site 1
Models excluding patients from Site 1 and including clinical and demographic predictors performed higher, on average with R 2 scores of 35%, 38%, and 47% for SoD, CMA, and insomnia dimensions, respectively.
R 2 scores were 22% and 28% for the HDRS-6 and HDRS-17 benchmarks, respectively. When these models excluded clinical and demographic predictors, however, all R 2 scores were below zero.

| Leave-one-site-out cross validation
When clinical and demographic predictors were included, all R 2 scores were negative under leave-one-site-out cross validation (LOOCV) except for the insomnia dimension (R 2 = 30%). Exclusion of clinical and demographic predictors produced negative R 2 scores for all symptom dimensions.

| DISCUSSION
Our study extends recent work using exploratory factor analysis to identify three homogenous latent symptom dimensions of treatmentresistant MDD. We used data-driven methods to identify clinical, demographic, and distributed patterns of regional volumetric brain changes associated with ECT-related recovery along these latent symptom dimensions of depression. Models predicting recovery along core mood and anhedonia and insomnia dimensions were predicted significantly more accurately than the more widely used HDRS-6 and HDRS-17 total scores. Further, treatment-related recovery along these symptom dimensions was associated with unique and overlapping sets of clinical, demographic, and volumetric brain regions.
Brain regions informative of symptom changes were generally wellaligned with existing literature. widely been linked with better ECT responsivity (Kranaster et al., 2018;O'Connor et al., 2001). However, the etiology of depression in elderly patients more frequently involves cardiovascular pathology, cognitive impairment, or chronic medical illness (Alexopoulos, 2005), which in addition to normal aging effects of brain tissue loss, could impact response to ECT. In our sample, patient age differed significantly by site that may have limited our model generalizability to new sites. Earlier work has also reported that elevated BMI is associated with reduced white matter integrity (Repple et al., 2018), gray matter reductions of frontal regions, and a more chronic course of depression (Opel et al., 2015). Interestingly, while elevated BMI predicted poorer outcomes for the CMA symptom dimension, it predicted better treatment response for the SoD and insomnia symptom clusters. This effect is likely driven by the inclusion of the HDRS weight loss item in the CMA factor. Here, we observed that, relative to patients who exhibited no change in the weight loss item, those who reduced their weight loss symptoms had a significantly lower pretreatment BMI (p < .05).
Although clinical and demographic measures were the most informative predictors of outcomes, models using only changes in regional brain volumes as predictors were highly significant and explained between 1 and 10% of outcome variability across symptom dimensions and remain valuable. Regional volumetric predictors of change  (Jiang et al., 2018), and rostral middle frontal gyrus (Mulders et al., 2020), while other studies have identified the pretreatment middle frontal gyrus structure (Jiang et al., 2018) as a predictor of ECT response. Several temporal regions have been reported to change over ECT index (Ota et al., 2015;van Eijndhoven et al., 2016). Imaging predictors of SoD symptom reduction included volumetric changes in the left frontal pole, right transverse temporal gyrus, and the left pallidum. The transverse temporal gyrus was also a predictor of CMA change. As SoD and CMA symptom changes were highly correlated and the temporal gyrus is proximal to ECT electrodes, this is perhaps unsurprising but may also suggest a common neurobiological basis for these symptoms. Volumetric increases of the right frontal pole was likewise inversely related to antidepressant response in a related cohort using the Montgomery-Asberg Depression Rating Scale and HDRS total scores as outcomes (Mulders et al., 2020). Similarly, an earlier study of our own in a partially overlapping cohort identified that volumetric increases in the right pallidum were associated with poorer ECT response (Wade et al., 2016); here, we report the same trend for the contralateral pallidal volume.
Volumetric changes of the right superior parietal lobule, left accumbens, and right pars triangularis were predictive of change in the insomnia dimension. Although their specific relationships to symptoms of insomnia are difficult to interpret, ECT-related changes in the accumbens have previously been related to antidepressant response (Wade et al., 2016). The superior parietal lobule is a component of the default-mode network (DMN) which may indirectly modulate emotional processes underlying depressive symptoms (Zhang, Peng, Sweeney, Jia, & Gong, 2018); this structure also neighbors the supramarginal gyrus which has been shown to relate to ECT response (Mulders et al., 2020).
The imaging measures informative of treatment-related symptom changes are biologically plausible. Despite differences in analysis methods and samples, findings also appear collectively aligned with previous reports of treatment-predictive and treatment-responsive biomarkers of ECT response, which point to the involvement of somatomotor cortex (Leaver et al., 2018), RMFG (Mulders et al., 2020), and striatum and basal ganglia (Wade et al., 2016). Notably, the current findings expand on previous work by shedding new light on how these broad neural systems along with patient clinical/ demographic characteristics relate to more specific symptom constellations.
Importantly, because heterogeneity was a factor not only at the level of symptoms but also across sites, we decided to conduct leave-

| Limitations
Naturalistic treatment protocols across this multisite study are a potential limitation; however, this design also builds the natural diversity of ECT treatment protocols and outcomes seen worldwide into our models. Variation in treatment parameters was naturally confounded with acquisition site. We did, however, include both site and electrode placement as predictors though electrode placement was not an important predictor. Relatedly, patient inclusion/exclusion criteria varied by site though all had in common that patients were required to have failed at least one prior treatment or that patients immediately needed ECT. A minority of patients were included with diagnoses of bipolar disorder and depression with psychotic features.
Symptom changes did not significantly differ among those with bipolar disorder but were significantly more reduced among those with psychotic features as is commonly observed (van Diermen et al., 2018). Site 1 patients were also least responsive with an average HDRS-17 reduction of 3.8 points. These patients were tapered off of antidepressant medication prior to ECT index, though concurrent medications do not reliably effect the efficacy of ECT (Fink, 1994;Haskett & Loo, 2010). Exclusion of patients from Site 1 largely improved model performance and generalizability but this gain is counterbalanced against the benefit of training these models using the range of ECT outcomes seen around the world. Lastly, we emphasize that prospective prediction of clinical outcomes was not attempted; this would require exclusive use of pretreatment predictors.

| Conclusions and future directions
Taken together, our findings provide new evidence that use of homogenized latent symptom dimensions of multi-item scales can improve the detection of imaging, demographic, and clinical biomarkers related to the trajectories of specific symptom constellations.
While clinical and demographic measures accounted for more outcome variability, neuroimaging measures of regions often implicated in the pathology of depression and ECT-related treatment response were significantly predictive and accounted for between 1 and 10% of outcome variability when used alone. As neurostimulation methods become more refined and capable of targeting more specific neural systems, it is plausible that findings such as these will inform the targeting of neural systems underlying more specific symptom dimensions. Future work will explore whether prospective prediction of change in these symptom dimensions will similarly be predicted more accurately than more heterogenous total scores.