How well do clinical and demographic characteristics predict Patient Health Questionnaire‐9 scores among patients with treatment‐resistant major depressive disorder in a real‐world setting?

Abstract Objectives To create and validate a model to predict depression symptom severity among patients with treatment‐resistant depression (TRD) using commonly recorded variables within medical claims databases. Methods Adults with TRD (here defined as > 2 antidepressant treatments in an episode, suggestive of nonresponse) and ≥ 1 Patient Health Questionnaire (PHQ)‐9 record on or after the index TRD date were identified (2013–2018) in Decision Resource Group's Real World Data Repository, which links an electronic health record database to a medical claims database. A total of 116 clinical/demographic variables were utilized as predictors of the study outcome of depression symptom severity, which was measured by PHQ‐9 total score category (score: 0–9 = none to mild, 10–14 = moderate, 15–27 = moderately severe to severe). A random forest approach was applied to develop and validate the predictive model. Results Among 5,356 PHQ‐9 scores in the study population, the mean (standard deviation) PHQ‐9 score was 10.1 (7.2). The model yielded an accuracy of 62.7%. For each predicted depression symptom severity category, the mean observed scores (8.0, 12.2, and 16.2) fell within the appropriate range. Conclusions While there is room for improvement in its accuracy, the use of a machine learning tool that predicts depression symptom severity of patients with TRD can potentially have wide population‐level applications. Healthcare systems and payers can build upon this groundwork and use the variables identified and the predictive modeling approach to create an algorithm specific to their population.


| INTRODUC TI ON
Major depressive disorder (MDD) is a prevalent chronic mood disorder that affects more than 300 million people globally (World Health Organization, 2017). In the United States, approximately 7.1% (17.3 million) of all adults had at least one major depressive episode in 2017 (National Institute of Mental Health, 2019). The goal of MDD treatment is to achieve complete remission (i.e., full return to baseline functioning with minimal to no residual symptoms; Ballenger, 1999;Trivedi & Daly, 2008;Work Group on Major Depressive Disorder et al., 2010). Pharmacologic treatment with oral antidepressants (ADs) is recommended for patients presenting with mild to moderate symptom severity (Moller et al., 2012;Work Group on Major Depressive Disorder et al., 2010); however, findings from the landmark Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study on the effectiveness of treatment strategies for depression showed that only approximately one-third (36.8%) of adults with MDD achieved full remission with their first step of AD treatment, and subsequent lines of treatment resulted in substantial decreases in remission rates (approximately 31% and 14% with second and third steps of AD treatment, respectively ;Rush et al., 2006).

Treatment-resistant depression (TRD) is commonly defined
as present when a patient with MDD does not reach response or remission after two or more different AD treatments of adequate dose and duration in the current depressive episode (Gaynes et al., 2018). Importantly, over the course of their illness, patients with TRD may experience a wide range of depression symptom severities that span from minimal/no symptoms (i.e., remission) to severe symptoms (American Psychiatric Association [APA], 2013;Kroenke & Spitzer, 2002;Kroenke et al., 2001). Assessment of depression symptom severity may in turn facilitate assessment of critical outcomes for healthcare systems.
Depression symptoms can be assessed using clinician-administered instruments and/or with patient-rated instruments, such as the Patient Health Questionnaire (PHQ)-9. The PHQ-9 is a self-reported instrument developed to capture the frequency of nine depression-related symptoms during the previous two weeks (Kroenke & Spitzer, 2002)

. Recent guidance by the US Food and Drug
Administration has aimed to enhance incorporation of the patient perspective in medical product development and regulatory deci- Unfortunately, neither clinician-administered nor patient-rated instruments are generally available in standard medical claims databases nor do these databases include consistent or validated documentation of depression symptom severity; thus, it is difficult to assess the influence of depression symptom severity on a variety of outcomes, including treatment course and response, course of the disease, and other health outcomes. However, some instruments, including the PHQ-9, are sometimes administered by healthcare providers to patients and recorded in an electronic health record (EHR), providing an opportunity to connect a patient's PHQ-9 score to their health data from a medical claims database if the two databases can be linked. An effective model that accurately predicts depression symptom severity from commonly recorded variables within medical claims databases could significantly improve understanding of the impacts of the severity of depression symptoms, including its impact on treatment choices made by physicians.
The aim of the current study was to create and validate a model to predict depression symptom severity among patients with TRD using PHQ-9 scores available within an EHR database linked to a medical claims database.  TRD Episode Criteria). Additionally, patients were required to have at least one PHQ-9 measurement in the EHR database on or following the index TRD date. Patients with a diagnosis of specific psychiatric disorders (i.e., autism, bipolar disorder, schizophrenia, and other nonmood psychotic disorders) and/or neurologic disorders (i.e., dementia, intellectual disability, traumatic brain injury, Parkinson's disease) during the study period were also excluded.

| MDD and TRD episode criteria
As MDD is a chronic, cyclical disorder consisting of distinct time periods of episodes and remission, the following criteria were applied ( Figure S1) with the aims to isolate specific episodes of MDD within each patient's longitudinal journey and to identify the incidence of treatment resistance within an episode of MDD.

| MDD episode
An MDD episode was defined as a time period that included one or more diagnosis code or treatment for MDD following the first diagnosis code for MDD. Treatments for MDD included oral ADs of adequate dose and duration (≥42 days' supply of each AD at the preceded by at least 180 days of a clean period (i.e., without a diagnosis or treatment for MDD), and ended on the date of the last MDD diagnosis code or the end of the days' supply of an adequate AD medication, whichever came later, followed by at least 180 days of a clean period. The clean period was defined as an absence of MDD diagnosis codes or treatments used for MDD as a means to determine that the patient was in remission of their MDD during this period. Additionally, this allowed for patients to have more than one MDD episode during the study period.

| TRD episode
Lines of treatment were evaluated during each MDD episode. The start date of the third line of AD treatment was defined as the index TRD date, based on the assumption that the two previous lines of oral AD treatments of adequate dose and duration had failed to produce a response or remission (see Table S1 for a list of ADs used in determining line of treatment). An AD regimen was considered as failed when the initial AD regimen was augmented with another AD or switched to a new regimen completely. All ADs of adequate dose and duration filled within 30 days of the initial AD claim were considered part of the same regimen.

| Variables included in the predictive model of PHQ-9 scores
A total of 116 clinical and demographic variables typically available in medical claims databases were utilized as predictors associated with depression symptom severity. Variables were identified from (1) a literature search and review of publications, including studies related to causation or association of MDD or depression symptom severity (Amos et al., 2017;APA, 2013;Briggs et al., 2018;Carter et al., 2012;Chin et al., 2016;Gaynes, 2009;Gross et al., 2015;Hinz et al., 2016;Iosifescu et al., 2003;Katzelnick et al., 2011;Mulvahill et al., 2017;Raval et al., 2010;Rossom et al., 2016;Shittu et al., 2014;Wada et al., 2015;Waxmonsky et al., 2012), and (2) discussions with clinicians with expertise in treating patients with TRD. The potential predictors included demographic characteristics, treatment-specific variables (e.g., site of care, nonpharmacologic treatment, number of prior MDD treatments, specific medications taken for MDD treatment), psychiatric comorbidities, medical comorbidities, measures of healthcare resource utilization, and others (see Table S2 for a full list of variables).

| Study outcome
The study outcome of depression symptom severity was measured by PHQ-9 total score category. The PHQ-9 total score ranges from 0 to 27 and is typically grouped into six distinct categories ranging from none to severe (Kroenke et al., 2001). For the purpose of this study, the six categories were collapsed into three clinically meaningful categories for the predictive model: none to mild (PHQ-9 scores, 0-9), moderate (10-14), and moderately severe to severe (15-27). All PHQ-9 scores recorded on or after the TRD index date in the EHR database were considered for inclusion in the study; each score was treated as a unique outcome, as certain variables may have changed over time and differed between different TRD episodes (e.g., weight, number of previous MDD treatments, comorbidities).

| Statistical modeling methodology
A machine learning tool was used because it can test a large number of predictors and identify patterns in the large and heterogeneous dataset used in this study to predict depression symptom severity.
A random forest approach was applied to leverage its high prediction accuracy with large numbers of predictors due to the embedded feature selection in the model generation process. The data were randomly divided into training (70%) and validation (30%) datasets, and the random forest classifier, a machine learning technique that enables a large number of weak or weakly correlated classifiers to form a strong classifier (Pal, 2017), was run using the training dataset. After the classifier was trained, the resulting model was applied to the validation dataset in order to provide an unbiased estimate of the model fit. A random forest is a meta-estimator that fits multiple decision tree classifiers on various subsamples of the dataset and uses averaging to improve the predictive accuracy and limit overfitting. This classifier evolved from and consists of many decision trees. Each uncorrelated decision tree selects a classification of the outcome, and the final choice is based on the aggregated "votes" for each class from the decision tree; the most common classification from the individual trees becomes the result. The input of each tree is sampled data (with replacement) from the original dataset (in this case, the DRG Real World Data Repository). In addition, a subset of features is randomly selected from the optional features to grow the tree at each node. Random forests tend to have high accuracy prediction and can handle large numbers of features due to the embedded feature selection in the model generation process (Pal, 2017).
The random forest approach also identifies the rank of importance of predictors by applying a score called the variable importance in projection (VIP; Breiman, 2001), which can be used to identify the most important or influential predictors (the score ranges from zero to one, with a higher score indicating greater importance or influence). While the predictors are ranked, no information on the directionality of the relationship with the outcomes is given by this methodology. Therefore, this study ascertained the direction of effect for selected important variables by calculating the mean value of each by the observed depression symptom severity category. This was done on the entire dataset in a descriptive manner.
Upon completion of the PHQ-9 classifier, the predicted scores were tested for accuracy against observed PHQ-9 scores in two ways. First, the overall and individual concordance between the predicted and observed depression symptom severity categories was calculated. Second, in order to verify the use of the three PHQ-9 depression symptom severity categories (i.e., none to mild, moderate, moderately severe to severe), the mean and median of the observed PHQ-9 scores within each of the three categories were computed to confirm that the mean and median scores fell within the range for the predicted depression symptom severity category. For example, the mean observed score of a patient predicted to be in the none to mild category should fall within the range (score 0-9) of that category.

| Sample cohort characteristics
In total, 2,077 patients with TRD and 5,356 associated PHQ-9 measurements were included in the study (Table 1). A total of 116 predictors were included in the model (full list in Table S2) and select variables are reported in Tables 2-4. The mean age of patients at the time of PHQ-9 measurement was 51.2 years, 76.9% were female, 52.9% were from the Midwest, and 62.5% had commercial health insurance (Table 2). Anxiety was the most common (41.7%) psychiatric comorbidity within the 180 days prior to the PHQ-9 measurement, and hypertension was the most common (29.5%) medical comorbidity. Overall, the majority (58.9%) of samples were from patients with one to five psychiatric and/or medical comorbidities.
Analysis of healthcare resource utilization in the 90 days preceding PHQ-9 measurement showed that 20.0% of samples were from patients who had a record for psychotherapy, 9.5% from patients who had an all-cause inpatient visit, and 12.7% from patients who had an all-cause emergency room visit (Table 3). The majority (74.4%) of samples were from patients who had used an AD in the 90 days preceding PHQ-9 measurement, and a greater proportion (79.1%) of samples were from patients who had used any of a select group of mental health-related prescriptions (see Table 3 for list).
The mean (standard deviation) PHQ-9 score among all samples in the cohort was 10.1 (7.2), indicating moderate depression symptom severity (Table 4). By distribution, it was observed that approximately half (50.1%) of the scores fell in the none to mild category. Step Label N % retained from prior step

| Outcomes of the machine learning predictive model
After training the random forest classifier with the 116 predictors and applying it to the validation dataset, the model yielded predicted PHQ-9 depression symptom severity categories that corresponded to the correct observed PHQ-9 categories for 62.7% of patients ( Figure 1a). The highest level of concordance between the predicted and observed depression symptom severity categories was found in the none to mild category; 87.9% of those who had observed scores within the none to mild category were accurately predicted to be in the none to mild category. This varied across the other two categories, with the next-best prediction occurring in the moderately severe to severe category, where 51.2% were accurately predicted.
The lowest prediction accuracy occurred in the moderate category, with 20.9% accurately predicted. Furthermore, the mean and median observed PHQ-9 scores fell within the appropriate range of each predicted depression symptom severity category (mean observed PHQ-9 score for the predicted none to mild category, 8.0; moderate category, 12.2; moderately severe to severe category, 16.2; Figure 1b).

| Important predictors
Out of the 116 predictors included in the random forest classification model, 70 had a VIP score of at least 0.6. Six predictors had a VIP score of at least 0.8 and thus were considered to be the most important predictors in this study. In order of importance, these six predictors were days from index TRD date to PHQ-9 measure- In order to assess the direction of each effect, the mean values of these six predictors were examined by observed depression symptom severity in the 5,356 PHQ-9 scores in the sample cohort ( Figure 2). In general, greater depression symptom severity was associated with a shorter gap from the index TRD date or the last MDD diagnosis to the PHQ-9 measurement, as well as a higher mean number of SNRI medications in the last 90 days for the moderately to severe and moderate categories compared to the none to mild category (1.1 versus. 1.1 versus. 0.9, respectively).

| D ISCUSS I ON
In a sample of 5,356 PHQ -9 scores corresponding to 2,077 patients with TRD, this study found using a machine learning predictive model that commonly recorded variables within a medical claims database can be used to predict the depression symptom severity category from the three possible severity category choices with an overall accuracy of 62.7%. While there was variability in the accuracy of the model between the three categories, the observed mean score of patients in each predicted depression category was still within the threshold range of that category. While these results are encouraging, there is yet considerable likelihood of false positives, which is a concern with this model. It is also possible that with the large number of possible predictors used, we overfit our training dataset and this contributed to the results seen with the validation dataset. We hope that these concerns can be alleviated with the help of more advanced machine learning techniques that can improve the accuracy of the model while relying on fewer predictors.

Of the 116 clinical and demographic variables available in med-
ical claims databases, six were found to be the most important predictors in this study. Overall, the findings suggest that the model may be useful to identify important variables researchers should consider when evaluating risk and outcomes across a population with TRD, such as the time from the outcome of interest to the last MDD diagnosis code or index date of TRD.
To our knowledge, this is the first study which attempts to predict depression symptom severity on the PHQ-9 instrument by using clinical and demographic characteristics among adults with TRD.
However, machine learning techniques have been used to predict depression symptom severity in other contexts. One such study validated a previously generated model by using data prospectively collected from individuals with lifetime MDD in two US National Comorbidity Surveys (Kessler et al., 2016;van Loo et al., 2014).
Information gathered from the fully structured interview in the first survey was used to predict, among other outcomes, depression symptom severity in the second survey. Severity was based on patient hospitalization for depression, current disability due to depression, and history of suicide attempt. Interestingly, prediction using the machine learning model was better than when using a traditional Abbreviations: AD, antidepressant; DNRI, dopamine-norepinephrine reuptake inhibitor; ER, emergency room; PHQ-9, Patient Health Questionnaire-9; SNRI, serotonin-norepinephrine reuptake inhibitor; and SSRI, selective serotonin reuptake inhibitor. a Items reported for < 1% of PHQ-9 measurements are not shown. PHQ-9 score, mean (SD) 10.1 7.2 PHQ-9 score category None to mild (0-9) 2,686 50.1 Moderate (10-14) 1,120 20.9 Moderately severe to severe (15-27) 1,550 28.9 Days from index TRD, mean (SD) 571.4 443.3 Year of PHQ-9 measurement Abbreviations: PHQ-9, Patient Health Questionnaire-9; SD, standard deviation; TRD, treatment-resistant depression. a Percentages may not add up to 100.0% due to rounding. et al., 2005, 2018). MDD was more prevalent among women and was associated with other psychiatric disorders, especially generalized anxiety disorder.
Implementation of a predictive model that estimates clinically relevant rating scale scores or categories (e.g., PHQ-9 score category) can have multiple applications-such as (1) to allow population health decision-makers with access to claims data that lack PHQ-9 scores to estimate depression severity among the TRD population they manage and to develop appropriate policies to aid this population, or (2)  Other predictive models in depression have been developed to identify predictors of remission and response to therapy. In one study, predictors of remission were identified based on placebo-treated patients with MDD in double-blind randomized clinical trials (Nelson et al., 2012). Four predictors were identified: less severe depression symptoms, younger age, less anxiety, and shorter current MDD episode duration. Interestingly, anxiety was not identified as a predictor of depression symptom severity in the current study, notwithstanding the high proportion of the study population (41.7%) with comorbid anxiety disorder. In another study, predictors of response and remission among inpatients with depression were identified (Riedel et al., 2011). Common predictors for both outcomes were fewer previous hospitalizations and episode duration less than 24 months. Of note, the presence of suicidality was found to be a predictor of response. While this seems counterintuitive, the authors speculated that suicidality served In conclusion, while acknowledging the substantial room for improvement in accuracy, the use of a machine learning tool that predicts depression symptom severity of patients with TRD with the help of commonly available variables in a medical claims database can potentially have wide population-level applications. Healthcare systems and payers can build upon this groundwork and use the variables identified and the predictive modeling approach to create an algorithm specific to their population, leading to ultimately, provision of better care and improved health outcomes for this vulnerable population.

ACK N OWLED G M ENTS
This research was supported by Janssen Scientific Affairs, LLC.
Writing and editorial support was provided by Jessica Kim, PharmD, and Courtney St. Amour, PhD, of MedErgy, and was funded by Janssen Scientific Affairs, LLC.

E TH I C A L S TATEM ENT
All patient data contained with the DRG database are de-identified and in compliance with the Health Insurance Portability and Accountability Act. Thus, no Institutional Review Board approval was required for this study.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are available from Decision Resources Group (DRG) but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. More information about accessing the DRG Real World Data Repository can be found at: https://decis ionre sourc esgro up.com/solut ions/realworld -data/.