Individualized treatment response prediction of dialectical behavior therapy for borderline personality disorder using multimodal magnetic resonance imaging

Abstract Introduction Individualized treatment prediction is crucial for the development and selection of personalized psychiatric interventions. Here, we use random forest classification via pretreatment clinical and demographical (CD), functional, and structural magnetic resonance imaging (MRI) data from patients with borderline personality disorder (BPD) to predict individual treatment response. Methods Before dialectical behavior therapy (DBT), 31 female patients underwent functional (three different emotion regulation tasks) and structural MRI. DBT response was predicted using CD and MRI data in previously identified anatomical regions, which have been reported to be multimodally affected in BPD. Results Amygdala and parahippocampus activation during a cognitive reappraisal task (in contrasts displaying neural activation for emotional challenge and for regulation), along with severity measures of BPD psychopathology and gray matter volume of the amygdala, provided best predictive power with neuronal hyperractivities in nonresponders. All models, except one model using CD data solely, achieved significantly better accuracy (>70.25%) than a simple all‐respond model, with sensitivity and specificity of >0.7 and >0.7, as well as positive and negative likelihood ratios of >2.74 and <0.36 each. Surprisingly, a model combining all data modalities only reached rank five of seven. Among the functional tasks, only the activation elicited by a cognitive reappraisal paradigm yielded sufficient predictive power to enter the final models. Conclusion This proof of principle study shows that it is possible to achieve good predictions of psychotherapy outcome to find the most valid predictors among numerous variables via using a random forest classification approach.


| INTRODUC TI ON
Although dialectical behavior therapy (DBT) is the currently best-established psychosocial treatment for borderline personality disorder (BPD; Cristea et al., 2017;Kliem, Kroger, & Kosfelder, 2010;Stoffers et al., 2012), it is unclear who will respond best to this therapy. Since psychotherapy is quite expensive, a prognostic tool to predict which patient will benefit from DBT is desirable. However, so far only few investigations have been made to gain a better understanding of individual predictors of therapy response in BPD.
In anxiety disorders, it has been shown that functional and structural magnetic resonance imaging (MRI) data in addition to variables such as clinical characteristics and demographics (CD) or electroencephalography (EEG) can provide useful information for models predicting treatment response (Ball, Stein, Ramsawh, Campbell-Sills, & Paulus, 2014;Lueken et al., 2016). Most studies used only one or two of these modalities to predict treatment response, and none (to the knowledge of the authors) so far combined functional and structural MRI (fMRI, sMRI, respectively) in addition to CD data to predict BPD therapy response.
Such a multimodal and integrative approach to predict treatment outcome is of major interest in a disorder with a psychotherapy response rate not significantly exceeding 50% of patients (Stoffers et al., 2012). It is also in line with the leading idea of the Research Domain Criteria (RDoC) Initiative, which is to ground treatment development and outcome prediction on dimensions of equally weighted neurobiological measures and behavioral functions (Cuthbert & Kozak, 2013).
Affect dysregulation is a central characteristic of BPD psychopathology (Linehan, 1993) and a main target of intervention across various psychotherapeutic programs. Meanwhile, neuronal correlates of affect dysregulation in BPD have been identified with a meta-analysis suggesting multimodal (functional and structural) dysfunctions in frontolimbic brain areas, such as left amygdala, right parahippocampus, left hippocampus as well as left and right hemisphere inferior and superior frontal, temporal, parietal, and motor-associated regions as well as cerebellar vermis (Schulze, Schmahl, & Niedtfeld, 2016). Therefore, we specifically examined the role of functional and structural alterations of these regions of interest (ROIs) in predicting treatment response of patients with BPD receiving a psychosocial treatment-dialectic behavior therapy (DBT)-which specifically focuses on improving affect regulation capacity in patients with BPD (Kliem et al., 2010;Stoffers et al., 2012).
Among the various techniques proposed for predicting treatment response via neurobiological markers (Lueken et al., 2016), the random forest method (Breiman, 2001) stands out for its robustness and excellent suitability for predictive data analysis (Qi, Bar-Joseph, & Klein-Seetharaman, 2006). The overall procedure can be summarized in three main steps: decision tree and forest building, cross-validation, and selection of most relevant variables to build a final random forest model (further details can be found in Breiman, 2001;Bureau et al., 2005;Genuer, Poggi, & Tuleau-Malot, 2010;Strobl, Malley, & Tutz, 2009)). With regard to psychotherapeutic outcome predictions, random forest models have already successfully been used with fMRI  and sMRI data separately (Wade et al., 2015), but (to the knowledge of the authors) not with multimodal fMRI and structural MRI (sMRI) data sets in addition to CD data.
Here, we used random forests based on CD, fMRI, and sMRI data to specifically predict DBT outcomes in patients with BPD. FMRI data were based on three different affect regulation tasks which had been acquired in addition to sMRI data from 31 patients with BPD before DBT Schmitt, Winter, Niedtfeld, Herpertz, & Schmahl, 2016;Winter et al., 2017). In summary, these longitudinal studies showed that successful DBT is behaviorally represented in a more efficient emotion regulation during reappraisal of negative pictures, a normalization of the processing of painful stimuli, and lower emotional susceptibility during distraction via alterations of the respective functional networks (including bilateral parahippocampus, amygdala, anterior cingulate cortex, orbitofrontal cortex as well as right dorsolateral prefrontal cortex, and cerebellum).
We expected that the multimodally affected regions (see erratum on Schulze et al., 2016, table 3 "multimodally affected brain regions in patients with BPD" (Schulze et al., 2016)) would provide useful information for building accurate predictive models. Therefore, we examined the utility of random forest analysis to specifically predict treatment response to DBT in patients with BPD via multimodal data sets in addition to CD data.

| Participants
The sample used in this study comprised 31 female patients drawn from previous studies Schmitt et al., 2016;Winter et al., 2017) meeting DSM-IV criteria for BPD diagnosis (including affective instability and self-injurious behavior) who received treatment in two residential DBT programs at the Center for Psychosocial Medicine and the Central Institute of Mental Health, both located at Heidelberg University and providing fMRI data from all three tasks and sMRI data.
As in the three earlier reports using the patient pool Schmitt et al., 2016;Winter et al., 2017), exclusion criteria were left-handedness, traumatic brain injury, lifetime diagnoses of schizophrenia or bipolar I disorder, mental or developmental disorders, substance dependence during the last year, consumption of illegal drugs in the last two months, current severe depressive episode, and benzodiazepine use. Furthermore, patients who had significant DBT skills training experience and/or did not meet criteria for MRI safety and eligibility were excluded.
Since the same patient pool as in the three earlier reports Schmitt et al., 2016;Winter et al., 2017) was used, also clinical and demographical measures were adopted: Trained clinical psychologists assessed BPD diagnoses using the International Personality Disorder Examination (IPDE; Loranger, 1999) and Axis I disorders using the Structured Clinical Interview for DSM-IV (SCID-I; Wittchen, Wunderlich, Gruschwitz, & Zaudig, 1997). Accompanying the MRI measurements, BPD symptom severity was assessed using the Zanarini Rating Scale for BPD (ZAN-BPD; Zanarini et al., 2003) and the Borderline Symptom List (BSL; Bohus et al., 2007). Emotion regulation difficulties were assessed by the Difficulties in Emotion Regulation Scale (DERS; Gratz & Roemer, 2004) and dissociative symptoms by a self-report questionnaire (FDS; German version of the Dissociative Experiences Scale DES; Spitzer et al., 1998). The State-Trait Anxiety Inventory (STAI-state, STAI-trait; Spielberger, Gorusch, & Lushene, 1970) was used to probe anxiety of the patients, depressiveness was measured via the Beck Depression Inventory (BDI; Beck, Ward, Mendelson, Mock, & Erbaugh, 1961), and participants were instructed to memorize digits to estimate working-memory capabilities (digit span; Tewes, 1991). Additionally, patients were asked, if they had used skills within the last three days and if they considered this skill use as successful. Identification of DBT responders was performed as in the three earlier reports (reliable change index [Jacobson & Truax, 1991] based on the symptom reduction in the ZAN-BPD total score, cut-off ≥ 1.96; Niedtfeld et al., 2017;Schmitt et al., 2016;Winter et al., 2017) resulting in an identification of 16 patients showing a significant improvement after therapy. Table 1

| Dialectical behavior therapy
Patients participated in a well-established and evaluated 12-week standard residential DBT-based treatment (Bohus et al., 2004). The program comprised weekly skills training groups (emotion regulation skills, mindfulness, self-esteem, and social competence) and individual treatment twice a week. Therapists were experienced Ph.D., M.D., and M.Sc. level clinicians (psychologists and physicians) and were supervised regularly. The treatment program has already been

| Cognitive reappraisal task
The cognitive reappraisal task (reap) is described in detail in Schmitt et al. (2016). In brief, participants were instructed for 2 s to either look at or decrease their emotions via cognitive reappraisal during the presentation of the following image. The following negative or neutral images were selected from the International Affective Picture System (Lang, Bradley, & Cuthbert, 1997) or the Emotional Picture Set (Wessa et al., 2010) and presented for 6 s. Negative images were low in valence and high in arousal. Neutral images  13.1 (5.6) t(29) = 3.59 ** 10.1 (6.4) 8.3 (5.6) 11.9 (6.9) t(29) = 1.61 Note: Values presented as mean and standard deviation (SD). Statistic refers to comparison between responders and nonresponders as t-value and degrees of freedom (df). Medication at pretreatment was missing six data points (three from responders), which were predicted via decision tree. The scores of the Beck Depression Inventory (BDI) at post-treatment were missing one data point (responder), which was predicted via decision tree. had intermediate valence and low arousal. After the picture presentation, the letter "O" was presented occasionally (for 2 s) and participants had to respond to that by pressing a button as fast as possible. After each trial, a fixation cross was presented for 3-8 s (mean 5.5 s). In total, the experiment comprised 72 trials (18 per condition).

| Sensory distraction task
The sensory distraction task (pain) is described in detail in Niedtfeld et al. (2017). In brief, participants were instructed to look at the following image for 2 s. The following negative or neutral images were selected from the International Affective Picture System or the Emotional Picture Set and presented for 6 s. Similar to the cognitive reappraisal task, negative images were low in valence and high in arousal and neutral images had intermediate valence and low arousal. During 50% of the presented images, a painful heat stimulus was delivered to the participants (individually adapted temperature).
After the picture presentation, the letter "O" was presented occasionally (for 2 s) and participants had to respond to that by pressing a button as fast as possible. After each trial, a fixation cross was presented for 3-8 s (mean 5.5 s). In total, the experiment comprised 72 trials (18 per condition).

| Cognitive distraction task
The cognitive distraction task (distr) is described in detail in Winter et al. (2017). In brief, participants were instructed to either look at or memorize five consonants, presented for 2 s, followed by a negative or neutral image (presented for 6 s) selected from the International Affective Picture System or the Emotional Picture Set. As in the other tasks, negative images were low in valence and high in arousal and neutral images had intermediate valence and low arousal. In the memorize condition, participants had to press a respective button as fast as possible to indicate whether a presented character (presented for 2 s) was included in the initially presented string after presentation of the image. In the view condition, the letter "O" was presented after the image (presented for 2 s) and participants had to respond to that by pressing a button as fast as possible. After each single-letter presentation, a fixation cross was presented for 3-8 s (mean 5.5 s). In total, the experiment comprised 72 trials (18 per condition).

| MRI data acquisition
Whole-brain fMRI data was acquired on a 3 Tesla Siemens TRIO-MRI

| Statistical analysis
To predict therapy response to DBT using CD, fMRI, and sMRI data, we performed a random forest classification approach, following the methodological (or statistical) procedure presented in Ball et al. (2014). To do so, the randomForest package was used in R statistics (version 3.5.1, http://cran.r-proje ct.org). Our main objective was to predict responder status using CD, functional activations, and GMV extracted from anatomical ROIs (functional and anatomical ROI's were selected from the meta-analysis by Schulze et al. (2016), as described under "Image processing") during the three experiments (reap, pain, and distr) in two differential contrasts (emotional challenge and regulation), and the anatomical images at the start of treatment.
In contrast to Ball et al. (2014), we used 10,000 classification trees in the random forest procedure and performed a 10-repeated 10-fold cross-validation as described at https ://machi nelea rning maste ry.com/k-fold-cross-valid ation/ . The procedure was carried out as follows: First, the order of the subjects was shuffled. In the second step, the data were split into k groups (here: k = 10, while groups one to nine comprised three subjects and group 10 comprised four subjects). In step three, the following algorithm was performed for each unique group: (1) Take the group as test group; (2) the remaining groups build up the training group; (3) fit the random forest model (each tree using a bootstrapped subsample of the model training group and a randomly selected subset of the seven sets of independent variables (CD, fMRI, sMRI, and all combinations of these sets)) and evaluate it on the test group; 4. keep the evaluation scores and discard the model. In the fourth step, the evaluation scores were summarized. The whole procedure was repeated 10 times for each set of the independent variables, resulting in a different split of the sample during each repetition, and evaluation scores were summarized over the 10 repetitions (see Figure 1). The calculated median permutation importance scores (Genuer et al., 2010) were then used to select variables for inclusion in the final models.
Variables survived the selection process, if the median permutation importance was greater than the absolute value of the most negative value and t test showed significant difference to zero . After the variables for the final models were identified, the described 10-repeated 10-fold cross-validation procedure was applied onto the final models. Afterward, the corresponding mean values were used to evaluate classification accuracy, sensitivity, specificity, positive, and negative likelihood ratios (each subject was classified as responder or nonresponder 70 times (=seven models × 10 repetitions)).
The response rate for DBT was 51.61% (95% CI: 32.98%-70.25%), and therefore, a simple all-respond model, that is, classifying all patients as responders would achieve 51.61% accuracy. Consequently, only models performing better than 70.25% perform statistically better than the all-respond model and therefore provide a significant gain in prediction of DBT treatment outcome.
To aid in the interpretation of the selected ROIs, correlations were performed between the ROIs (fMRI and sMRI) and the CD measurements in the final models. Table 1 summarizes CD measures and statistics of responder-nonresponder comparisons. Effects of fMRI tasks, behavioral data, and relationships between brain activation and symptom severity have already been reported Schmitt et al., 2016;Winter et al., 2017). Here, we report the initial (all variables) and the final model (subset of variables contributing most to classification accuracy) for each set of predictor variables. All permutation importance scores chosen for the final models significantly differed from zero (p < .001).

| CD model
The original CD model comprised the following predictors: age in years, education in years, medication at start of therapy (yes/no), digit span, BDI total score, BSL total score, FDS total score, DERS total score, STAI-state total score, STAI-trait total score, and ZAN-BPD total score. Four predictors entered the final model (ordered by predictive value): ZAN-BPD total score, BSL total score, education in years, and BDI total score. Accuracy of the final model was 68.00% (Table 2) and therefore not significantly better than a simple all-respond model.

| Combined CD and sMRI model
The ZAN-BPD total score, BSL total score, left amygdala (GMV), and BDI total score. Accuracy of the final model was 73.33% (Table 2) and therefore significantly better than a simple all-respond model.

| Combined CD, fMRI, and sMRI model
The Accuracy of the final model was 73.50% (Table 2) and therefore significantly better than a simple all-respond model.

| Combined CD and fMRI model
The and BSL total score. Accuracy of the final model was 74.75% (Table 2) and therefore significantly better than a simple all-respond model.

| FMRI model
The original fMRI model comprised mean activation for the emotional challenge and regulation contrasts in the three tasks (reap, pain, and distr) before treatment in the multimodally affected ROIs selected according to Schulze et al. (2016)

| SMRI model
The original sMRI model comprised GMV within the multimodally affected ROIs selected according to Schulze et al. (2016). Only left amygdala (GMV) entered the final model. Accuracy of the final model was 75.92% (Table 2) and therefore significantly better than a simple all-respond model.

| Combined fMRI and sMRI model
The  (Table 2) and therefore significantly better than a simple all-respond model.
Noteworthy, the final model comprised the same predictors as the fMRI model, but in a different order. Therefore, the slight improve of the mean accuracy might be a statistical variation due to the bootstrapped subsampling and random feature selection for each tree in the random forest procedure. Table 2 shows the features of the seven final models, and Figure 2 shows the increase in predictive information of each model which

| D ISCUSS I ON
In this study, we used random forest models based on clinical/demographical data, functional, and structural neuroimaging data to predict DBT treatment response in patients with BPD. To our knowledge, this is the first study combining CD, fMRI, and sMRI data in random forest models for treatment outcome predictions. Our results support the conclusion of Ball et al. (2014), that fMRI data can be used to generate predictions with reasonable test characteristics. Furthermore, we show proof of principle that the combination of data from multiple modalities yields the potential to improve predictions.
The likelihood ratios indicate that relative to the odds of treatment response, predicted responders based on the combined fMRI and sMRI model (best model in terms of accuracy and negative likelihood ratio) are 3.54 times more likely to respond to DBT and predicted nonresponders are 3.45 times less likely to respond to DBT treatment. Based on the sMRI model (best model in terms of positive likelihood ratio), predicted responders are 3.71 times more likely to respond to DBT and predicted nonresponders are 2.94 times less likely to respond to DBT treatment.
All models except the CD model generated predictions that were significantly better than chance with accuracies of up to 76.08% (combined fMRI and sMRI model). The combined fMRI and sMRI model contained the same variables as the fMRI model (i.e., no sMRI data), but in a different order and should therefore be interpreted as a variation of the fMRI model. The models including CD data performed less well than the models without CD data (surpris-  psychiatric patients that derive from neurobiological data are more likely to predict therapy response than purely clinical data (Bzdok & Meyer-Lindenberg, 2018;Cuthbert & Insel, 2013). According to our results, there is no clear winning model: The sMRI model wins by its simplicity (only GMV of the left amygdala as predictor) and feasibility in clinical routine (only one T1 MRI sequence to run and no time-consuming task-based data to collect and analyze), while the combined fMRI and sMRI model comprises the potential to utilize the advances of both modalities (some subjects were classified correctly in all tests by using fMRI-based data but not sMRI-based data and vice versa).
Brain activation during watching negative versus neutral images (emotional challenge) and regulating versus watching negative images (regulation) during the reappraisal task provided best predictive power, compared with the other tasks. Especially, the mean activation within left amygdala during the reappraisal task yielded superior predictive power, compared with the variables derived from the other tasks. Therefore and consistent with our expectation, especially left amygdala played a crucial role in predicting treatment response and showed functional differences between responders and nonresponders. From a prognostic point of view, hyperreactivity of the left amygdala during the emotional challenge condition in the reappraisal task points toward a lower chance to respond to DBT. This might imply that patients with lower emotional reactivity might be more likely to respond to DBT.
Furthermore, higher cognitive functions such as regulating emotions might be trained during DBT. This is reflected in our finding that DBT responders show initially greater activation of the left amygdala during the regulation condition in the reappraisal task, reflecting lower regulation success. In addition, neuronal activity of the right parahippocampus, as a crucial part of the emotion regulation network (Frank et al., 2014), was found to be a significant predictor of treatment outcome and showed similar features of activation as the left amygdala.
Notably, there was no more significant difference in BSL and ZAN-BPD total scores between responders and nonresponders post-treatment. This might be explained by the fact that categorization into responders and nonresponders via the well-established reliable change index (Jacobson & Truax, 1991) led to higher BPD symptom severity in the responder group (Schmitt et al., 2016).
The pretreatment to post-treatment differences, however, were greater in the responder group, which is likely to be based on a known phenomenon (Gratz, Dixon-Gordon, & Tull, 2014;Schmitt et al., 2016).

| LI M ITATI O N S
The main limitation of the present study is the small sample size.
Therefore, the findings presented here should be interpreted as proof of principle and should not be read as suggestions for therapy selection until validated by studies using a larger sample size and models based on a wider scope of modalities. However, random fitting Breiman, 2001;Strobl et al., 2009). To further minimize the effects of small sample size and sample selection and increase stability of our results, we used a high number of classification trees and applied a 10-repeated 10-fold cross-validation for variable selection and testing the final models. Still, the sample is too small to conduct model comparisons using traditional methods.
Consequently, conclusions about model comparisons are limited and replication using a larger sample is highly desirable.
Also, there are further modalities and variables imaginable to enhance classification performance, such as additional questionnaires and ROI's, resting state fMRI, diffusion tensor imaging, PET, (electro-) physiological, genomic, and baseline cortisol, or cortisol reactivity data.
Our study specifically tested treatment outcome prediction of DBT for BPD via random forest models. Nevertheless, there are various alternative psychotherapy options for BPD, which are not covered by our test for utility of random forest models in treatment outcome prediction for BPD per se, as presented here. Therefore, further studies should consider testing alternative treatments or combinations of treatments as well, to take a further step toward finding the best individualized treatment.

| CON CLUS IONS
Here, we showed proof of principle that random forest models built with CD and multimodal MRI data can provide predictions of therapy response with reasonable test characteristics, outperforming models with CD data only. Our results suggest that fMRI and sMRI have a significant role in predicting treatment outcomes for DBT in patients with BPD. Future studies should examine if our findings remain valid after testing them in a larger sample, set an additional focus on the interplay between hard-wired subcortical structures and higher cognitive cortical functions, and by continuing the use of the promising approach of random forest classifications or machine learning in general on alternative treatments, to develop predictive models with verified clinical relevance. Also, classifications might be further enhanced by including additional sets of variables as, for example resting state fMRI, diffusion tensor imaging, PET, (electro-) physiological, genomic, and baseline cortisol, or cortisol reactivity data.

ACK N OWLED G EM ENTS
The study was supported by a grant from the German Research Heidelberg. The funding sources had no involvement in study design; collection, analysis, and interpretation of data; writing of the report; decision to submit the article for publication.

CO N FLI C T S O F I NTE R E S T
None declared.
F I G U R E 5 Mean GMV in responders (shaded) and nonresponders (white) in regions selected for the final models.
Error bars show standard error of the mean. l.Amy.GMV: gray matter volume of the left amygdala F I G U R E 4 Mean fMRI activation in responders (shaded) and nonresponders (white) in regions selected for the final models comprising fMRI data for emotional challenge (negative watch vs. neutral watch) and regulation (negative regulate vs. negative watch). Error bars show standard error of the mean. l.Amy.reap.ec: left amygdala in the reappraisal task for emotional challenge, l.Amy. reap.reg: left amygdala in the reappraisal task for regulation, r.Par. Hip.reap.reg: right parahippocampus in the reappraisal task for regulation

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from the corresponding author upon reasonable request.