A systematic review of validated methods for identifying depression using administrative data
L. Townsend, Rutgers University School of Social Work, 536 George Street, Room 103, New Brunswick, NJ 08901. E-mail: email@example.com
Administrative and claims data (hereafter administrative data) represent an important resource for safety surveillance and research on the use and effectiveness of medical products. They offer opportunities to conduct safety surveillance or study drug use, other treatments, and selected outcomes for large and diverse patient populations across a broad range of usual care settings. However, because these datasets are not designed for either surveillance or research purposes, their use is subject to significant challenges and limitations. One key challenge involves assessing the validity of algorithms for identifying various outcomes and coexisting conditions. This was recognized by the Food and Drug Administration as an important need for its Mini-Sentinel pilot program, which is currently focused on conducting safety surveillance to refine safety signals that emerge for its regulated medical products. An important step is to assess the validity of algorithms for identifying health outcomes of interest (HOI) in administrative data.
Depression represents an important HOI. It is a leading source of disability, accounting for approximately one third of healthy life years lost to disability for people aged 15 years and older. Given the significant disease burden caused by depression, large-scale efforts to identify the disorder and monitor its outcomes are of paramount importance to public health. In the present report, we review studies that have sought to validate algorithms for identifying cases of depression using administrative data. This report summarizes the process and findings of the depression algorithm review. The full report is available on the Mini-Sentinel Web site at http://mini-sentinel.org/foundational_activities/related_projects/default.aspx.
Data sources were limited to administrative datasets from the USA or Canada. The general search strategy was developed based on prior work by the Observational Medical Outcomes Partnership (OMOP) and its contractors and modified slightly for these reports. The modified OMOP search strategy was combined with PubMed terms representing the HOI. Medical subject heading (MeSH) terms were used as HOI search terms. Details of the methods for these systematic reviews can be found in the accompanying article by Carnahan and Moores. Briefly, the base PubMed search was combined with the following terms to represent depression: depression, major depressive disorder, dysthymic disorder, or seasonal affective disorder. The workgroup also searched the database of the Iowa Drug Information Service (IDIS) using a similar search strategy to identify other relevant articles that were not found in the PubMed search. To identify depression validation studies that were unpublished, in prepublication, or that were not identified by the search strategies, Mini-Sentinel investigators were requested to provide information on any published or unpublished administrative data studies that validated an algorithm for depression. Results were aggregated into two sets of files, one containing the abstracts for review and the other for documenting abstract review results. The PubMed search was conducted on 14 May 2010 and the IDIS searches on 11 June 2010.
Each abstract was reviewed independently by the first and second authors (L.T. and J.W.) to determine whether the full-text article should be reviewed. The following abstract exclusion criteria were applied: (i) The abstract did not mention depression or dysthymia; (ii) the study did not use an administrative database (eligible sources included insurance claims databases and other secondary databases that identify health outcomes using billing codes); and (iii) the data source was not from the USA or Canada. Exclusion criteria were documented sequentially (i.e., if one exclusion criterion was met, then the other criteria were not documented). If the reviewers disagreed on whether the full text should be reviewed, then it was selected for review. Interrater agreement on whether to include or exclude an abstract was calculated using Cohen's kappa.
Full-text articles were reviewed independently by the first and second authors (L.T. and J.W.), with the goal of identifying validation studies. The full-text review included examination of articles’ reference sections as an additional means of capturing relevant citations. Citations from the references were selected for full-text review if they were cited as a source for a depression algorithm or were otherwise deemed likely to be relevant. Full-text studies were excluded from the final evidence table if they met one or more of the following criteria: (i) The article contained a poorly described or difficult to operationalize depression algorithm defined by the absence of Diagnostic and Statistical Manual of Mental Disorder (DSM) depression diagnosis codes (296.2, 296.3, 300.4, or 311) or the International Classification of Diseases (ICD) diagnosis codes for depression (296.2, 296.3, 300.4, 311, 298.0, or 309.1), and (ii) the article provided no validation measure of depression or did not report validity statistics. Full-text review exclusion criteria were applied sequentially. If there was disagreement on whether a study should be included, the two reviewers (L.T. and J.W.) attempted to reach consensus on inclusion by discussion. If they could not agree, an additional investigator (M.O.) was consulted to make the final decision.
All studies that survived the exclusion screen were included in the final evidence table. A single investigator abstracted each study for the table. A second investigator confirmed the accuracy of the abstracted data. A clinician or topic expert was consulted to review the results of the evidence table and to evaluate how the findings compared with the findings of diagnostic methods used in clinical practice. This assessment helped to determine whether the algorithms excluded any depression diagnosis codes commonly used in clinical practice and the appropriateness of the validation measures in relation to clinical diagnostic criteria.
The total number of unique citations from the combined searches was 1731. A second PubMed search incorporating additional database names identified 30 citations.
Of the 1761 abstracts reviewed, 286 were selected for full-text review; 219 were excluded because they did not study depression, 994 were excluded because they were not administrative database studies, and 262 were excluded because the data source was not from the USA or Canada. Cohen's kappa for reviewer agreement regarding abstract inclusion or exclusion was 0.77.
Of the 286 full-text articles reviewed, 10 were included in the final evidence tables; 142 were excluded because the identification algorithm was poorly defined (e.g., did not use the specified DSM or ICD diagnosis codes for depression), and 134 were excluded because they included no validation of depression or validity statistics. Reviewers identified 36 new citations from full-text article references. Of these, one was included in the final report; 9 did not study depression, 12 were not database studies, 5 were excluded because the depression algorithm was poorly defined, and 9 were excluded because they included no validation of depression or validity statistics. The final evidence tables collectively include 11 articles (see Tables 1 and 2). Cohen's kappa for reviewer agreement regarding inclusion or exclusion of full-text articles was 0.95.
Table 1. Depression Algorithm Definitions/Validation Characteristics for Studies Using Standardized Scales or Structured Diagnostic Interviews as a Validation Standard
|Kahn et al., 2008||Adult (≥18 years) Medicaid behavioral health managed care organization enrollees (n=249) with a diagnosed mental disorder and diabetes, 63.1% female, mean age 52.2 years, 2006||PHQ-9 assessed depression by mail survey||Depression ICD-9 code: 296.2,296.3, 298.0, 300.4, 309.0, 309.1, 309.28, or 311* in the encounter data||Operating characteristics of depression diagnosis code in encounter data (screen) in relation to PHQ-9 score ≥ 10||PPV=66.4% (71 of 107)||Sens=51.1% (71 of 139)|
|Katon et al., 2006||Adolescent (11–17 years) primary care outpatients (n=769), with history of asthma treatment, excluded patients treated for bipolar disorder or schizophrenia; depressed subsample 64% female, 2004–2005||C-DISC assessed DSM-IV major depression or dysthymia; C-DISC assessed DSM-IV panic, generalized anxiety, social phobia, agoraphobia, or separation anxiety disorder||≥1 ICD-9: 296.2,296.3, 298.0, 300.4, 309.0, 309.1, 309.28, or 311* depressive disorder diagnosis in utilization record during 12 months before C-DISC||Proportion of patients with utilization claim for depression who met C-DISC criteria for a depressive disorder (PPV1); Proportion of patients with utilization claim for depression who met C-DISC depressive or anxiety disorder criteria (PPV2)||PPV1=31.5% (29 of 92) PPV2= 39.4% (41 of 104)||Sens1=48.3% (29 of 60) Sens2=36.6% (41 of 112)|
|Katon et al., 2004||Adult HMO patients (n=4385) with treatment of diabetes mellitus and major depression as assessed by PHQ-9, 60% female, mean age 59 years, excluded patients without diabetes, with cognitive impairment, too ill to participate, or language/hearing problem.||PHQ-9 positive screen for current major depression.||In 12 months before assessment: 1. ICD-9 code for ≥1 depression: 296.2,296.3, 298.0, 300.4, 309.0, 309.1, 309.28, or 311* AND 2. Antidepressant (AD) prescription||Proportion of patients with PHQ-9 major depression disorder who were detected by ICD-9 code (S1) or AD prescription (S2).||NA||Sens (S1) = 36.3% (190 of 524) Sens (S2) = 42.9% (225 of 524)|
|McCusker et al., 2008||Emergently admitted medical inpatients ages ≥65 years (n=185) from 2 university-affiliated acute care hospitals in Montreal, over sampled for depression, excluding patients with cognitive impairment.||DIS assessed major depressive disorder of > 6 months or < 6 months duration.||During 12 months after index inpatient admission, 3 Algorithms: 1. Outpatient claim for physicians services for ICD-9: 311, 300.4. 2. Antidepressant prescription. 3. Psychiatrist visit.||Proportion with DIS assessed major depression > 6 months (Sens1) and < 6months (Sens2) duration with outpatient claims for depression during 12 months before or after index inpatient admission.||PPV1=56.3% (9 of 16) PPV2=54.5% PPV3=62.5% (10 of 16)||Sens1= 15.8% (9 of 57) Sens2= 52.6% (30 of 57) Sens3=17.5% (10 of 57)|
|Solberg et al., 2003||Adult (≥18 years) outpatients (n=274) from 9 staff model primary care clinics in a metropolitan area with depression code, no antidepressant prescription 6 months, no diagnosis bipolar, schizophrenia, or alcoholism past year, 1998–1999||1. Depressive symptoms (CES-D ≥6) ( PPV1)||ICD-9 311 (only code available for depression) code, no other 311 codes in previous 6 months, no AP fills in previous 6 months.||Proportion of patients meeting administrative code definition of depression who met each of the four outcomes (CES-D score, self-reported current depression, told by health care professional at visit had depression, and chart audit with depression diagnosis or treatment at index visit.||PPV1=71.5% (196 of 274)||NA|
Table 2. Depression Algorithm Definitions/Validation Characteristics for Studies Using Medical Records/Self-Report as a Validation Standard
|Frayne et al., 2010||National sample (n=133,068) of Veterans who were treated for diabetes mellitus and responded to a health survey, 98.1% male, 76.6% white, mean age-66.3 years, 1998-1999.||Patient self-report of depression: “Has a doctor ever told you that you have depression?”||Diagnosis of ICD-9: 296.2x, 296.3x, 311.||Proportion of patients meeting algorithm criteria told that they have depression (PPV) and proportion not meeting algorithm criteria not told that they have depression (NPV).||PPV=84%||NA|
|Kramer et al., 2003||Veterans (n=109) from 3 VA medical centers with ≥1 outpatient claim for a depressive disorder (296.2-296.26, 296.3-296.36, or 311) after 180 days without such a claim or an AD fill, 95.2% male, mean age 55.5 years,1999-2001.||Presence of depression diagnosis in medical record during 180 day period after the new depressive disorder claim||≥1 outpatient claims for depression (ICD-9: 296.2, 296.3, or 311) in any service setting after 189 days without such a claim or an antidepressant fill.||Estimate of PPV for new onset depression by determining proportion of cases with depression diagnosis in medical record during 180 days prior to new index claim for depression.||PPV= 48.6% (53 of 109) (New onset depression)||NA|
|Rawson et al., 1997||Randomly selected inpatients with a first listed discharge diagnosis of depression (311) in administrative files, Saskatchewan, 1986||A diagnosis of depression (311) or a depression-related disorder (ICD-9: 296.1, 296.4, 296.6, 300.4, 309, 311) in the medical record discharge note.||First listed inpatient discharge diagnosis of depression (311) in administrative files.||Proportion with first listed discharge diagnosis of depression with ICD-9 311 (PPV1) or depression-related diagnosis (ICD-9: 296.1, 296.4, 296.6, 300.4, 309, 311) in medical record discharge note.||PPV1= 58.3% (91 of 156) PPV2= 93.6% (146 of 156)||NA|
|Smith et al., 2009||Adult (19–64 years) work disabled Medicaid beneficiaries with depression claims who responded to an employment and disability survey, 2003 or 2005||Self-rated depressed mood item.||≥1 Medicaid claims for ICD-9: 296.2, 296.3, or 311 during 12 months prior to survey||Proportion of beneficiaries with depression claims who report depressed mood (PPV1). Also the corresponding proportion among subgroup with adequately treated depression (ATHF score of 3 or 4) (PPV2).||PPV1= 87.9% (175 of 199) PPV2= 91.1% (153 of 168)||NA|
|Solberg et al., 2006||Adults (>19 years) from a large private health plan (N=135,842). 5 random samples (N=20) meeting different algorithms for depression, 2000.||Medical record diagnosis of depression.||D1. Prevalent depression: ≥2 outpatient or ≥1 inpatient ICD-9 296.2, 296.3, 300.4, 311 in 12 months D2. Antidepressant (AD) treatment (A and B and C): A. 6-months no AD prior to new AD fill B. ≥1 depression code 3 months before or after AD. C. ≥1 more depression code or AD fill in 1.5 years before or after new AD||Proportion of selected plan members with D1 claims-based algorithm found to have depression diagnosis in medical record (PPV1) and proportion with D2 algorithm found in medical record to have started a new antidepressant treatment episode for depression (PPV2)||PPV1= 98.8% (79 of 80) PPV2= 65.0% (13 of 20) and 90.0% (18 of 20)||NA|
|Solberg et al., 2003||Adult (≥18 years) outpatients (n=274) from 9 staff model primary care clinics in a metropolitan area with depression code, no antidepressant prescription 6 months, no diagnosis bipolar, schizophrenia, or alcoholism past year, 1998-1999||2. Self-reported current depression (PPV2) 3. Reported told at index visit has depression (PPV3) 4. Chart audit depression diagnosis or treatment (PPV4)||ICD-9 311 (only code available for depression) code, no other 311 codes in previous 6 months, no AP fills in previous 6 months.||Proportion of patients meeting administrative code definition of depression who met each of the four outcomes (CES-D score, self-reported current depression, told by health care professional at visit had depression, and chart audit with depression diagnosis or treatment at index visit.||PPV2=71.5% (196 of 274) PPV3=54.6% (149 of 274) PPV4=94.9% (260 of 274)||NA|
|Spettell, et al, 2003||Primary care physician panel members (≥12 years) from a large MCO (n=892,786) selected for meeting algorithm 1 or 2 and members matched by age, gender, and number of comorbid conditions not meeting algorithms, 1997.||Physician diagnosis of depression in medical record during the 12 month study period.||Algorithm 1. In 12 months, ≥2: A. First listed ICD-9 296.2, 296.3, 300.4, 311 B. AD fill Algorithm 2. In 12 months, A. above and ≥1 of A or B. Bipolar disorder, depressive psychosis or lithium fills excluded. A or B above during 12 months before study period also excluded.||Sensitivity (Sens), Specificity (Spec), positive predictive value (PPV), and negative predictive value (NPV) determined for a sample of 465 patients for algorithm 1 and 2 with physician diagnosis of depression in medical record as the criterion standard.||PPV1= 49.1% (115 of 234) PPV2 = 60.6% (63 of 103)||Sens1= 95.0% (115 of 121) Sens2= 52.1% (63 of 121)|
Depression algorithms and validation statistics
All 11 publications listed in the evidence tables included algorithms with various combinations of four different ICD-9 diagnostic codes to define depression: depression NOS (311), dysthymic disorder (300.4), and major depressive disorder, single episode (296.2) or recurrent (296.3) (Tables 1 and 2). Two studies further permitted adjustment disorder with brief depressive reaction (309.0), adjustment reaction with prolonged depressive reaction (309.1), adjustment reaction with mixed emotional features (309.28), and depressive type psychosis (298.0).[5, 6] For some algorithms in one study, the following codes were also included: alcohol-induced mental disorder (291.89); other specified drug-induced mental disorders (292.89); bipolar affective disorder, depressed (296.5); and bipolar affective disorder, mixed (296.6). Algorithms in three studies also required a claim for a filled antidepressant prescription.[8-10]
Two categories of studies are presented. Table 1 summarizes studies that used standardized scales or structured diagnostic interviews as validation standards. Table 2 presents studies that employed medical record or self-report information as validation standards. Each table includes studies that were population based, allowing for calculation of both positive predictive values (PPVs) and sensitivities and cohort-based studies that permitted calculation of PPV only. Algorithms for individual studies varied with respect to selected codes, treatment setting (outpatient or inpatient), position of code listing (principal vs. secondary), billing health care professional, number of required codes, and timing of the codes. The heterogeneity of the algorithms across the studies constrains strict comparisons. For example, the study that defined depression on the basis of at least two outpatient or one inpatient codes for depression (296.2, 296.3, 300.4, and 311) in a 12-month period likely included a narrower patient population than did the study that required only a single depression claim (296.2, 296.3, and 311).
Validation criteria and method
The studies used three main comparators for validating depression algorithms: structured diagnostic interviews for depressive disorders,[5, 12] self-report items or questionnaires,[6, 7, 11, 13, 14] and depression diagnoses contained in the medical record.[8-10, 13, 15] Some of the self-report validations were based on a single item for depression or depressed mood[11, 13] or whether patients had ever been told by a doctor that they had depression.[7, 13] One study used a 20-item self-report depression scale (Center for Epidemiologic Studies Depression Scale [CES-D]), and two studies[6, 14] used the PHQ-9, which is a brief, 9-item validated screen for major depressive disorder. One of the studies permitted direct comparison of four different validation criteria: a CES-D score of ≥6, self-report depression item, being told by a physician of depression diagnosis, and a medical record diagnosis of depression. This study appears in both Tables 1 and 2 for ease of comparison.
Four of the 11 studies provided sufficient information to derive a chance-corrected measure of agreement (kappa) between the algorithm and a criterion standard. Landis and Koch suggested the following kappa agreement standards: 0.0–0.20 (slight), 0.21–0.40 (fair), 0.41–0.60 (moderate), and 0.61–0.80 (substantial). In the three studies that used an independent assessment of depression as the criterion standard, agreement was either in the slight[12, 14] or fair[5, 12] range. The only study that permitted calculation of kappa and used a physician diagnosis of depression in the medical record as the criterion standard achieved a moderate level of chance-corrected agreement. However, use of medical record or chart information is not an optimal method for validating depression diagnoses in administrative data given the variability with which diagnostic information is represented in chart notation.[19-21]
The other seven studies did not provide sufficient information to calculate kappa values. In six of these studies,[7, 9-11, 13, 15] the analysis was limited to patients who met criteria for the depression algorithm. These studies only permit assessment of the extent to which patients who meet algorithm criteria also meet the criterion standard (PPV) but do not permit determination of the proportion of patients with the criterion standard who meet algorithm criteria (sensitivity). In four of these six studies,[9, 10, 13, 15] the criterion standard was a medical record diagnosis of depression. PPVs ranged from 48.6% to 98.8%. Because PPVs vary as function of prevalence, the relevance of these PPVs is undermined by the absence of prevalence data. Furthermore, findings based on medical record diagnoses of depression must be interpreted with caution given that chart information may not reference diagnoses used for billing purposes.[19-21]
None of the algorithms achieved a high level of agreement with depression as measured by independent assessment. The highest agreement with clinically diagnosed depression was achieved by an algorithm that required over a 12-month period at least two first-listed ICD-9 codes for 296.2 (major depressive episode, single episode), 296.3 (major depressive episode, recurrent episode), 300.4 (dysthymic disorder), or 311 (depression not elsewhere classified) as well as a filled prescription for an antidepressant medication. In a large population of primary care patients, the chance-corrected agreement of this algorithm was moderate (kappa = 0.464).
Factors influencing algorithm performance
Operating characteristics varied as a function of the validation algorithm. These relationships were most readily apparent in studies that tested different algorithms against the same criterion standard within the same patient population. In a study of veterans treated for diabetes mellitus, for example, broadening the algorithm from unipolar depression codes to include bipolar disorder and substance-related mood disorder codes markedly increased the percentage of patients captured by the algorithm from 4.5% to 16.5% but reduced the PPV (0.90 to 0.82). In this study, the criterion standard was patient report of being told by a doctor that the patient had depression.
In one primary care study in which physician medical record diagnosis of depression was the criterion standard, narrowing an algorithm markedly altered the operating characteristics. In the first algorithm, patients were required to have at least two events either of which could be outpatient encounter with a primary diagnosis of depression or pharmacy claims for an antidepressant medication. Thus, patients with antidepressant claims without an outpatient depression diagnosis met algorithm requirements. The second algorithm required one outpatient claim with a primary depression diagnosis and another event which could be either a second outpatient depression diagnosis or an antidepressant claim. Perhaps not surprisingly, the less stringent first algorithm had a lower specificity (65.4% vs. 88.4%) but a much higher sensitivity (95.0% vs. 52.1%) than that of the more stringent second algorithm. The kappa was slightly higher for the first (0.464) than for the second (0.425) algorithm.
The position of the listed depression code may affect algorithm performance. Only one study included algorithms that required the depression code to appear in the principal position. The other studies did not specify position of depression codes. Whether the validity of depression codes varies with code position has not been subjected to systematic study.
The studies varied with respect to patient population, ranging from unselected primary care HMO beneficiaries to populations selected for chronic illness. Clinical differences were evident among the study populations: three studies were limited to patients treated for diabetes mellitus,[6, 7, 14] one focused exclusively on patients with a history of asthma treatment, and one involved only disabled persons. Some evidence suggests that co-occurrence of general medical disorders may compromise the management and perhaps the recognition of depression either because comorbid medical disorders compete for clinical attention or because physicians may attribute signs and symptoms of depression to other medical disorders. For these reasons, the validity of claims-based measures of depression may vary by the general medical status of the patient population.
Level of care
The studies included in this report examined depression outcomes based on billing codes for inpatient treatment,[7, 9, 15] outpatient visits,[7-10, 12, 13] or emergency encounters.[7, 10, 13] None of the studies separately assessed or compared the validation of depression outcomes from different treatment settings, although one study compared validity measures from primary care visit codes with mental health service codes.
One study was limited to adolescents, one included children and adults, and one included only patients at least 65 years of age. The remaining eight studies involved non-elderly and elderly adults. The average patient age in these studies was between 50 and 60 years. None of the studies included validation information stratified by patient age. Because PPV depends upon the prevalence of the underlying condition, it is likely that PPV will be lower in children and adolescents than in adults because of the lower treated prevalence of depression in children and adolescents.
Except for two studies of veterans,[7, 10] all of the studies involved a predominance of female patients. None of the studies, however, provided validation data stratified by patient sex. After controlling for measures of severity and impairment, patient sex has not been found to be related to the rate of treatment of major depression.
The patient populations included a range of payment sources for medical services. Although some of the studies were limited to patients with Medicaid coverage,[11, 14] others included only patients with private insurance.[5, 6, 9] Two studies were based on patients receiving care that was financed and provided by the Veterans Health Administration.[7, 10] The effects of differences in payment source and associated billing systems on validity of depression codes remain unknown. One study of veterans care indicated that supplementing Veterans Health Administration data with Medicare data enhanced the rate of depression detection (Table 2).
Period of data collection
The 11 reviewed studies were published between 1997 and 2010. The earliest data were based on care delivered in 1986, and the most recent data were derived from 2006, although 2 studies did not specify the dates of service delivery.[6, 12] It is likely that increases in rates of antidepressant prescribing over time may influence the performance of algorithms including antidepressant prescriptions as a criterion, rendering it difficult to compare studies conducted at different periods. Although time effects were not examined specifically in this study, the influence of improvements in depression detection on algorithm performance over time remains an important question for further research.
Excluded populations and diagnoses
Five studies focused exclusively on populations with specific conditions including diabetes,[6, 7, 14] asthma, and disability. Of the other six studies, two were limited to patients who had received inpatient treatment,[12, 15] and one was limited to veterans and focused on new onsets of depression. Only one of the studies was based on a national population. The other studies were derived from a Canadian province, two acute care hospitals, three Veterans Health Administration medical centers, large managed care organizations,[8, 14] a private health plan or health maintenance organization,[5, 6, 9] or a group of nine primary care clinics. The composition of the study samples included herein is an important variable in evaluating the utility of algorithms to detect depression in administrative data. Population-based studies that are likely to include both individuals with and without depression allow for assessment of algorithm sensitivity in identifying depression and in providing a measure of precision (PPV). Cohort-based studies, especially those that oversample for individuals who are likely to have depression, only allow for estimation of precision and do not provide information regarding algorithm sensitivity. It is possible that cohort-based studies designed to measure PPV may utilize stricter algorithms, leading to a loss of sensitivity. The magnitude of reduction in sensitivity is unable to be determined from cohort-based data. In addition, the interpretive value of PPV statistics is dependent on the degree to which the prevalence of depression in the sample matches the epidemiological prevalence of depression.
When their performance was measured in terms of agreement with independent assessment of depression, most algorithms produced only slight or fair levels of agreement with the criterion standard. Performance improved when a physician diagnosis of depression in the medical record was used as the criterion standard. However, physician diagnoses and measures based on administrative data do not capture individuals whose depression has not come to medical attention. Furthermore, using medical records can be problematic for ascertaining whether administrative data can accurately identify individuals with depression given potential variability in the types of information recorded in patient charts.[19-21] For depression cases identified algorithmically, PPV was somewhat higher, ranging from 48.6% to 98.8%. However, the overall performance of algorithms to detect depression in administrative data has shortcomings, suggesting that in view of current limitations in the clinical recognition of depression in primary care, it is unlikely that an algorithm derived from administrative data will be developed that has generally acceptable validity for identifying depression as an outcome. For this reason, it is suggested that when possible, research should focus on clinical practices that systematically screen for depression while recognizing that routine screening for depression in adults and youth, although consistent with recommendations from the US Preventive Services Task Force,[27, 28] may not be conducted in all service settings.
The goal of identifying depression from administrative data is constrained by incomplete clinical detection. In the community, just over one half of adults with major depression receive treatment for their symptoms during the course of 1 year.[25, 29] In addition, primary care physicians recognize as depressed only about half of patients with depression who present for medical treatment. The detection rate may be even lower among patients with medical morbidity (30%) and veterans (40%). Deficiencies in the clinical diagnosis of depression, especially varying identification rates in different clinical subgroups, impose a ceiling on the performance of efforts to detect depression using administrative data. One countervailing consideration is that clinically detected or treated depression may be more severe than undetected or untreated depression.[25, 32] Screening initiatives and other efforts to improve the detection and management of depression in primary care practice may have an incidental salutary effect on the validity of electronic-health-record-based detection of depression.
Beyond problems with clinical detection, psychosocial considerations may further limit the accuracy of administrative data for identifying depression. Although attributions regarding the causes of depression are changing, concern about protecting patient confidentiality may lead some physicians to substitute non-mental disorder diagnoses on claims and patient encounter forms. Underreporting of depression may occur in a conscious effort by the clinician to reduce social stigma that might otherwise have adverse occupational or legal consequences for the patient.[37, 38]
Constraints on clinical diagnosis may help to explain the range of sensitivities in the reported studies. When self-report patient assessments that capture clinically undetected depression serve as the criterion standard, sensitivities are low, ranging from 12.5% to 51.1%. When medical record diagnoses are treated as the criterion standard, the sensitivity of algorithms based on administrative data reaches as high as 95.0%. These differences are also reflected in the kappa values that measure agreement between the algorithm and a criterion standard.
Studies with independent assessments of depression[5-7, 12, 13] rather than medical record diagnoses[8-10, 15] provide more credible evidence of algorithm validity. In this regard, the most rigorous studies involve structured diagnostic assessments[5, 12] followed by those that use a depression screening instrument as the criterion standard. Unfortunately, low chance-corrected agreement of these studies precludes the algorithms in these studies from being recommended for case identification.
One strategy for broadening algorithms and capturing more patients with depression is to include pharmacy records indicating an antidepressant prescription fill.[6, 8-10] Although this strategy tends to increase sensitivity by capturing more patients with depression, it comes at the expense of false positive cases that diminish PPV.[6, 8, 9] These trade-offs arise from the large proportion of antidepressant prescriptions that are for psychiatric and general medical conditions other than depression. In one national study, only 27% of individuals treated with antidepressants reported receiving them for depression.
Positive predictive value depends upon the prevalence of the underlying condition. Many of the studies were performed in highly enriched samples with base prevalence rates of depression that greatly exceed primary care populations. Without careful attention to the prevalence of depression in the base population, the PPVs may appear deceptively high. Although this may be obvious in samples that are limited only to patients with depression,[8, 13] it also distorts estimates in samples that are enriched by oversampling of patients with depression. In one study, for example, the base prevalence of depression in a sample selected for treatment of a mental disorder (28.5%) yielded a PPV of 66.4%. In another study that involved matching patients without depression to patients with depression, the base prevalence of depression from the resulting study sample was 26.0% and the PPV was 49.1% to 60.6% depending upon the algorithm. Substantially lower PPVs would be expected in general populations of primary care patients. The prevalence of major depression occurs in only 5%–10% of primary care patients and approximately 10%–14% of medical inpatients.
No algorithm achieved a high level of concordance with independent assessments of depression. Most routine electronic billing data do not currently include sufficient information to identify depression in a reasonably reliable manner. Electronic health records do provide sufficient information to identify adult primary care patients who have been diagnosed with depression with a moderate level of agreement. The algorithm with the strongest psychometric properties required at least two first-listed depression codes (296.2, 296.3, 300.4, and 311) over the course of 12 months as well as an antidepressant prescription claim (kappa: 0.464). However, this algorithm utilized medical record diagnosis of depression as a validation standard and therefore cannot be recommended for identifying clinically diagnosed depression on the basis of administrative data.
Given that depression is an established side-effect of several medications,[40, 41] a well-validated algorithm to identify depression in electronic health data would be valuable for postmarketing evaluation of drug safety. To improve detection of depression in electronic health records, recommendations for future research efforts include the following:
- Replicating the evaluation of the most promising depression algorithm for identifying clinically diagnosed depression in a fee-for-service or other primary care setting using standardized scales or structured diagnostic interviews as the validation standard is encouraged.
- Future research assessing promising algorithms should be conducted in general medical or specialty mental health settings that routinely screen for depression. It should be noted that even within specialty mental health settings, correlations are modest between clinical diagnoses and structured diagnostic interviews.
- Improvement in the clinical recognition of depression through implementation of routine depression screening is an important variable that can affect algorithm performance. Future research should specifically examine algorithm performance over time as depression screening becomes more thoroughly implemented in youth and primary care populations.
- Because of increasing concern over depression-related adverse events in youth[43, 44] and the paucity of information that is currently available for this age group, priority should be given to developing algorithms to identify depression in children and adolescents.
Incomplete recognition of depression in routine clinical practice constrains the performance of electronic health information to identify depression. Unless substantial progress is made in the clinical detection of depression, algorithms based on administrative depression codes are unlikely to achieve acceptable sensitivity in identifying depression as measured by independent assessment. Depression estimates are partly a reflection of the base rates of depression in any given sample; therefore, depression rates may be elevated in samples with a high prevalence of mental disorders (e.g., Medicaid samples) and underestimated in samples with a comparatively low prevalence of mental disorders (e.g., general primary care samples). At the same time, administrative depression codes have been demonstrated to have reasonable concordance with medical record diagnoses of depression. In several contexts, most adults who receive administrative codes for depression have notations of depression in their medical records. Much less is known, however, about the sensitivity of administrative depression codes for identifying depression in the medical record. However, given the variability of information contained in medical records,[19-21] these sources are not considered an optimal means of validating diagnoses in administrative data. Given that antidepressants are prescribed for a wide variety of psychiatric disorders and some general medical conditions, inclusion of prescription claims for antidepressant medications may or may not improve the PPV of algorithms to identify depression. The current value of administrative depression codes appears to be limited to identification of selected cases with a reasonable probability of having clinically recognized depression.
CONFLICT OF INTEREST
The authors declare no conflict of interest.
- None of the algorithms evaluated achieved a high level of agreement with depression as measured by independent assessment.
- Detection of depression in administrative records is influenced by the base prevalence of depression in the population of interest and is constrained by incomplete clinical detection of the disorder.
- The algorithms with the strongest psychometric properties employed the following: encounter depression codes (296.2, 296.3, 298.0, 300.4, 309.0, 309.1, 309.28, or 311) validated against the PHQ-9 (sensitivity = 51.1%, PPV = 66.4%) or an antidepressant claim within 12 months after an emergent medical hospital admission validated against a structured diagnostic interview (sensitivity = 52.6%, PPV = 54.5%).
- Use of antidepressant claims to identify individuals with depression is problematic given the widespread use of antidepressants for purpose unrelated to depressive disorders.
- The strongest algorithm performance was found when medical record notation of depression in patient charts was used as the criterion standard. However, use of medical record notation information is problematic due to the variability with which depression is referenced in these documents.
This work was supported by the Food and Drug Administration through the Department of Health and Human Services Contract Number HHSF223200910006I. The views expressed in this document do not necessarily reflect the official policies of the Department of Health and Human Services, nor does mention of trade names, commercial practices, or organizations imply endorsement by the US government.