Subphenotyping depression using machine learning and electronic health records

Abstract Objective To identify depression subphenotypes from Electronic Health Records (EHRs) using machine learning methods, and analyze their characteristics with respect to patient demographics, comorbidities, and medications. Materials and Methods Using EHRs from the INSIGHT Clinical Research Network (CRN) database, multiple machine learning (ML) algorithms were applied to analyze 11 275 patients with depression to discern depression subphenotypes with distinct characteristics. Results Using the computational approaches, we derived three depression subphenotypes: Phenotype_A (n = 2791; 31.35%) included patients who were the oldest (mean (SD) age, 72.55 (14.93) years), had the most comorbidities, and took the most medications. The most common comorbidities in this cluster of patients were hyperlipidemia, hypertension, and diabetes. Phenotype_B (mean (SD) age, 68.44 (19.09) years) was the largest cluster (n = 4687; 52.65%), and included patients suffering from moderate loss of body function. Asthma, fibromyalgia, and Chronic Pain and Fatigue (CPF) were common comorbidities in this subphenotype. Phenotype_C (n = 1452; 16.31%) included patients who were younger (mean (SD) age, 63.47 (18.81) years), had the fewest comorbidities, and took fewer medications. Anxiety and tobacco use were common comorbidities in this subphenotype. Conclusion Computationally deriving depression subtypes can provide meaningful insights and improve understanding of depression as a heterogeneous disorder. Further investigation is needed to assess the utility of these derived phenotypes to inform clinical trial design and interpretation in routine patient care.


| INTRODUCTION
Clinical depression (depressive disorder) is one of the most common psychiatric disorders, which affects about 14% of individuals all over the world. 1 The economic cost resulting from depression is staggering, which is expected to be the second largest contributor to disease burden by 2020. 2 Clinical depression is a complex condition and patients with depression usually present a complex etiology, involving multiple risk factors such as recent stressful events. 3,4 In addition, clinical depression is usually associated with the elevated risk of other diseases such as cardiac diseases and mortality, including suicide. 5 Furthermore, depression is highly recurrent in general populations. 6 Therefore, the discovery of depression subphenotypes has a potential to improve the understanding of the underlying disease heterogeneity, which could provide benefits for patients in terms of early recognition and more targeted interventions and therapies. However, due to the complex etiology of depression, it is challenging to define depression subphenotypes based on clinical knowledge and empirical evidence.
Recently, the wider availability of Electronic Health Records (EHRs) has created a continuously growing repository of clinical data, which provides new opportunities for population-based studies on a large scale and at low-cost. 7 Multiple data-driven approaches for identifying disease phenotypes with EHRs have been explored. 8,9 From a data-driven perspective, discovering phenotypes using EHRs can be seen as a "data clustering" problem. [9][10][11] The disease manifestations of patients in the same cluster (ie, subphenotype) usually tend to be more similar. Comprehensive and longitudinal data captured in EHRs such as patient demographics, diagnoses, medications, laboratory measurements and procedures provide an opportunity to construct an appropriate representation for patients. The integration of these rich data and existing clustering methods such as hierarchical agglomerative clustering provide a potential to obtain clusters of patients, wherein each cluster corresponds to a unique subphenotype. Multiple statistical testing methods such as Chi-square test 12 can be performed on each cluster, which aim at finding discriminative variables across different clusters and providing interpretation for the computationally derived subphenotypes. The overall objective of this study is to define subphenotypes of depression disorders and investigate its clinical heterogeneity using machine learning methods and EHRs derived prior to patients' first case of depression. The ultimate goal is to provide assistance for the clinicians and further improve the ability to anticipate disease onset, for example, alert clinicians of the need for diagnostic work up for frequently co-occurring disorders in those who fit the phenotype profile (Internists treating people   In this study, we also identified a "control" population (1:1 ratio) matched on age, gender, and comorbidity using propensity risk scoring. 14 To select the best control subject (non-depressed patient) for each case subject (depressed patient), we used Nearest Neighbor Matching and matched covariates using the propensity score distance measure. 15 The "control" group is used for model training and then obtains the best classifier that is used to choose important variables to perform clustering.
The basic summary statistics of our dataset are shown in Table 1.
For this cohort, all demographic information (age, gender, race, and ethnicity) was extracted. Multiple comorbidities were also extracted based on the CMS Chronic Conditions Warehouse (CCW).
Medication data was mapped to the Anatomical Therapeutic Chemical (ATC) Classification System, 16 which classifies the active ingredients of drugs by taking into account their therapeutic, pharmacological and chemical properties. In the ATC system, drugs are classified into groups at five different levels. In this study, the fourth level was used to map medication information, which is usually more appropriate to identify pharmacological subgroups. 17 All demographic, comorbidity and medication information were used to train the classifiers on multiple machine learning models. There are more than 500 features used for training machine learning models. We encoded medications and comorbidities as ever/never (1/0).

| Classification and clustering
In order to choose multiple variables that are useful for discovering the subphenotypes, the "current classification" experimental setting 18 was applied in this study. In particular, let t be the time of "first diagnosis" for depression either during an outpatient or inpatient encounter. In this setting, we considered all the data prior to time t and extracted patient demographics, comorbidities, and medications for training multiple machine learning models to classify depression. For each patient in the control group, the "time t" is the time of the last record of the patient in our dataset, which means we extracted all data for patients in the The heatmap obtained from Clustergram based on the selected variables. The x and y axis represents the patients' unique ID. The similarity among the individual patients was computed using the Jaccard Index. The "green rectangles" represent the three depression subphenotypes. The smaller the distance of patients were, the darker the color was, the greater the degree of similarity among patients were. The clusters can be approximately outlined on the clustermap by observing the distribution of colors along the diagonal line of the distance matrix The area under the receiver operating characteristic (AUC) was used to evaluate the model performance. Features from the model that performed the best, were ranked and ordered based on their variable importance measure, and subsequently used as inputs for the hierarchical agglomerative clustering algorithm to identify subphenotypes.
We used the hierarchical agglomerative clustering algorithm from the Scikit-learn software library. 22 The only continuous variable (age) was excluded during this process, and similarity between the clusters was computed using the Jaccard Index. Clustergram 24  3 | RESULTS

| The performance of classification and obtaining the depression subphenotypes
As shown in Table 2, GDBT achieved the highest performance for the current classification task in terms of AUC. By extracting feature importance scores from the GBDT model, we obtained multiple variables, including demographics, comorbidities and medications, with feature importance scores greater than zero. These variables were subsequently used as inputs for the clustering algorithm. By using Jaccard Index and hierarchical clustering, we obtained three depression subphenotypes ( Figure 2). The optimal number of clusters was obtained by using the McClain index. 25 3.2 | Association of comorbidities with the depression subphenotypes    In addition, to further investigate the characteristics of three subphenotypes, we performed multiple statistical analyses on our results.  County, Sweden demonstrated that hypertension was probably underdiagnosed and ignored in individuals with psychiatric disorders. 27 Multiple studies have also suggested that the risk of developing depression was increased in individuals with diabetes 28 and that there was significant association between depression and diabetes. 29 The connections between depression and hyperlipidemia have also been shown 30 and few studies have suggested that preexisting hyperlipidemia could be an independent predictor of newonset depression. 31 In our study, Phenotype_C (n = 1452; 16.31%) was the youngest (mean (SD) age, 63.47 (18.81) years) and included the least number of patients with fewer comorbidities and prescription medications. Furthermore, the comorbidities of anxiety and tobacco use were common in this subphenotype. Patients in this subphenotype also showed mild loss of their body function. Strong associations exist between depression and anxiety and previous studies have suggested that more than 50% of patients with an anxiety disorder had depression. 32 An association between tobacco use and depression has also been shown by multiple previous studies [33][34][35] [45][46][47] In addition, off-label use of antidepressants is common in treating sleep problems, eating disorders, smoking cessation, and managing chronic pain even when depression is not involved. 48 By restricting the study cohort to depressed patients treated via pharmacotherapy, we might be missing patients whose prescription data is not captured in the INSIGHT CRN. It is possible that many of these patients received an antidepressant from a private provider outside the INSIGHT CRN network or received alternative therapies such as psychotherapy or cognitive behavioral therapy (CBT) to treat their depressive symptoms.

As shown in
Unfortunately, our dataset is unable to capture these treatment modalities. It is also possible that patients initiated alternative treatments like psychotherapy and CBT during the 0 to 180 day time window but later transitioned into treatment via pharmacotherapy (eg, antidepressant). With careful consideration given to limitations including a dramatically smaller cohort, we selected a highly sensitive case definition that minimizes the inclusion of false positives and ensures a highly chronic dual diagnosis sample. Second, we only considered patient demographics, diagnoses, and prescription medication data extracted from the EHR for deriving the subphenotypes. Prior work by others 49 and our team 50 has demonstrated that for mood disorders, processing of unstructured clinical text via natural language processing is critical to detect symptoms, diagnosis and treatment. Third, we did not consider temporal information (eg, age of disease onset) for our classification and clustering tasks. Temporal data may correspond to a patient's current therapy, their overall health status, or any other discrete state, and the transition time information represents the duration of each of those states. In future work, we plan to leverage recent research in temporal pattern mining for clustering analysis. 51,52 Finally, with an emphasis on algorithm interpretation, portability and generalizability, we investigated traditional machine learning algorithms in this study. As we have done in other studies, 9,53,54 future work will explore advanced deep learning methods for depression subphenotyping.

| CONCLUSION
Using routinely collected longitudinal EHRs and ML algorithms, we computationally derived depression subphenotypes that can potentially guide improved diagnosis and treatment of clinical depression.
The derived subphenotypes had statistically significant differences with respect to patient demographics, comorbidities and treatment suggesting that depression is a heterogeneous disorder with multiple phenotypes.