Fax: (011) 0131 777 3549
Screening for major depression in cancer outpatients
The diagnostic accuracy of the 9-item patient health questionnaire
Article first published online: 24 AUG 2010
Copyright © 2010 American Cancer Society
Volume 117, Issue 1, pages 218–227, 1 January 2011
How to Cite
Thekkumpurath, P., Walker, J., Butcher, I., Hodges, L., Kleiboer, A., O'Connor, M., Wall, L., Murray, G., Kroenke, K. and Sharpe, M. (2011), Screening for major depression in cancer outpatients. Cancer, 117: 218–227. doi: 10.1002/cncr.25514
- Issue published online: 16 DEC 2010
- Article first published online: 24 AUG 2010
- Manuscript Accepted: 2 JUN 2010
- Manuscript Revised: 29 APR 2010
- Manuscript Received: 13 JAN 2010
- Patient Health Questionnaire;
- major depression;
- sensitivity and specificity;
- diagnostic accuracy
Systematic screening for depression has been recommended for patients who have medical conditions like cancer. The 9-item Patient Health Questionnaire (PHQ-9) is becoming widely used, but its diagnostic accuracy has not yet been tested in a cancer patient population. In this article, the authors report on the performance of the PHQ-9 as a screening instrument for major depressive disorder (MDD) in patients with cancer.
Data obtained from a depression screening service for patients who were attending clinics of a Regional Cancer Centre in Edinburgh, United Kingdom were used. Patients had completed both the PHQ-9 and a 2-stage procedure to identify cases of MDD. Performance of the PHQ-9 in identifying cases of MDD was determined using receiver operating characteristic (ROC) analysis.
Data were available on 4264 patients. When scored as a continuous measure, the PHQ-9 performed well with an area under the ROC curve of 0.94 (95% confidence interval [CI], 0.93-0.95). A cutoff score of ≥8 provided a sensitivity of 93% (95% CI, 89%-95%), a specificity of 81% (95% CI, 80%-82%), a positive predictive value (PPV) of 25%, and a negative predictive value (NPV) of 99% and could be considered optimum in a screening context. The PHQ-9 did not perform as well when it was scored using an algorithm with a sensitivity of 56% (95% CI, 55%-57%), a specificity of 96% (95% CI, 95%-97%), a PPV of 52%, and an NPV of 97%.
The PHQ-9 scored as a continuous measure with a cutoff score of ≥8 performed well in identifying MDD in cancer patients and should be considered as a screening instrument in this population. Cancer 2011. © 2010 American Cancer Society.
Depression substantially reduces the subjective health of patients with medical conditions.1 It also may lead to poor compliance with medical treatment and a poorer outcome.2 Although major depressive disorder (MDD) is common among individuals who have cancer, with a reported prevalence of 6% to 24%,3, 4 it often goes undetected.5, 6 Consequently, systematic screening for MDD has been recommended frequently7, 8; however, currently, there is no consensus regarding which measures should be used for screening.
The 9-item Patient Health Questionnaire (PHQ-9) is a brief depression screening, self-rated questionnaire that was developed for use in primary care. It can be scored either as a continuous measure or using an algorithm.9 The brevity and face validity of the PHQ-9 have made it a popular choice as a screening instrument.10 Two systematic reviews have summarized the diagnostic accuracy of PHQ-9 in identifying both patients with MDD in primary care and hospital patients.10, 11 Those reviews reported a sensitivity of 77% to 80% and a specificity of 92% to 94%, which are comparable to the proposed benchmarks against which case-finding instruments commonly are judged.12 Table 1 lists all studies of which we are aware that reported on the diagnostic accuracy of the PHQ-9. We included only those studies in which the comparator was a gold-standard reference criterion of a structured or semistructured interview and the sample size was ≥100 patients.
|Reference: Country||Patient Population and Setting (No.)||PHQ-9 Scoring Method||Diagnosis: Reference Standard||Findingsa|
|Spitzer 19999: United States||Primary care (585)||Algorithm||MDD: DSM SCID||Sensitivity, 73%; specificity, 98%|
|Kroenke 200123: United States||Primary care (580)||Continuous measure with cutoff score||MDD: DSM SCID||AUC, 0.95; sensitivity, 88%; specificity, 88%; cutoff score score, ≥10|
|Diez-Quevedo 200131: Spain||Medical and surgical inpatients (1003)||Algorithm||MDD: DSM SCID||Sensitivity, 84%; specificity, 92%|
|Becker 200232: Saudi Arabia||Primary care (173)||Algorithm||MDD: DSM SCID||Sensitivity, 62%; specificity, 95%|
|Mazzotti 200333: Italy||Dermatology inpatients (170)||Algorithm||MDD: DSM SCID||Sensitivity, 46%; specificity, 92%|
|Lowe 200424: Germany||Medical outpatient and family practice clinics (501)||Continuous measure with cutoff score [algorithm]||MDD: DSM SCID||AUC, 0.95; sensitivity, 95%; specificity, 84%; cutoff score, ≥12 [sensitivity, 83%; specificity, 90%]|
|Henkel 200434: Germany||Primary care clinics (431)||Algorithm||MDD: DSM CIDI||AUC, 0.91; sensitivity. 79%; specificity. 86%|
|Grafe 200435: Germany||Medical and psychosomatic outpatients (528)||Algorithm||MDD: DSM SCID||Sensitivity, 95%; specificity, 86%|
|McManus 200536: United States||Cardiology outpatients (1024)||Continuous measure with cutoff score||MDD: DIS||AUC, 0.87; sensitivity, 52%; specificity, 90%; cutoff score, ≥10|
|Williams 200537: United States||Stroke outpatients (316)||Continuous measure with cutoff score||MDD: DSM SCID||AUC, 0.96; sensitivity, 91%; specificity, 89%; cutoff score, ≥10|
|Fann 200538: United States||Head injury (135)||Algorithm||MDD: DSM SCID||Sensitivity, 93%; specificity, 63%|
|Picardi 200539: Italy||Dermatology inpatients (141)||Algorithm||MDD: DSM SCID||Sensitivity, 55%; specificity, 91%|
|Stafford 200740: Australia||Cardiology patients at home (193)||Continuous measure with cutoff score||MDD: DSM MINI||AUC, 0.8; sensitivity, 54%; specificity, 91%; cutoff score, ≥10|
Despite its strong psychometric properties and widespread use in primary care, to our knowledge, there are no published empiric evaluations of the diagnostic accuracy (sensitivity and specificity) of the PHQ-9 in cancer patients. A recent major review of screening questionnaires for psychological distress in cancer patients commented specifically on the lack of information on this scale's psychometric properties when used in cancer patients.13 Therefore, the objective of the current study was to determine the diagnostic accuracy of the PHQ-9 as a screening instrument for identifying cases of MDD in a large sample of cancer outpatients using anonymized data obtained from a large depression screening service.
MATERIALS AND METHODS
We analyzed cross-sectional questionnaire and interview data that had been collected by a service that screened for depression in patients attending a Cancer Center. Because the data had been collected as part of routine clinical care, individual patient consent had not been obtained. Approval for the aggregate and anonymized data to be analyzed and reported was obtained from the local research ethics committee.
Patients and Procedure
The depression screening service was part of a comprehensive symptom monitoring service that was provided to patients who attended the cancer clinics of The Edinburgh Cancer Center and was designed to identify patients with MDD.6 All patients attending colorectal cancer, breast cancer, gynecologic cancer, genitourinary cancer, sarcoma, and melanoma clinics had been invited to be screened except those who were too ill or who had significant communication or cognitive difficulties. The findings of the screening were provided to the patients' clinicians. The data used in this report were collected by the service between June 2003 and December 2005.
Reference criterion for MDD: The depression screening service adopted a commonly used,14 2-stage procedure for the diagnosis of MDD.
Patients completed the Hospital Anxiety and Depression Scale (HADS) on a touch-screen computer. The HADS was developed to screen for depression and anxiety in medical patients.15 A cutoff ≥15 on the total HADS score has been validated previously in cancer patients as sensitive for the detection of probable MDD.16, 17
Patients who scored above the specified cutoff on the HADS were interviewed using the MDD section of the Structured Clinical Interview for DSM (SCID). This semistructured interview was developed to accompany the Diagnostic and Statistical Manual of Mental Disorders, third edition (DSM-III18) and has been updated for the DSM-IV.19 A diagnosis of MDD requires the presence of at least 5 items, 1 of which should be a core symptom (“depressed mood most of the day, nearly every day” or “markedly diminished interest or pleasure in all or almost all activities most of the day, nearly every day”). Diagnostic assessments counted all symptoms (including somatic symptoms, such as sleep disturbance and weight loss) toward the diagnosis of MDD, and no attempt was made to judge their cause (the so-called inclusive approach to diagnosis).20 These interviews usually took place within several days of the clinical attendance and were conducted over the telephone. It has been demonstrated that the results from telephone administration of the SCID have good agreement with the results from the face-to-face interview and that telephone administration of the SCID is acceptable to patients.21, 22 The interviewers were specially trained psychology graduates and nurses who received weekly supervision (including the regular review of audio recordings of interviews) from a psychiatrist. The diagnosis of MDD was made without reference to the PHQ-9 score, which was used to measure the severity of depression.
The 9-Item Patient Health Questionnaire
The PHQ-9 had been administered by the depression screening service in addition to the 2-stage system described above for a limited period to test its performance as a possible screening tool. This scale was developed originally as part of a larger questionnaire designed to improve the detection of common mental disorders in primary care.9, 23 It has 9 items: The first 2 items address anhedonia and depressed mood—the cardinal symptoms of major depression. These are followed by 7 additional items that address changes in sleep, energy, appetite, guilt and worthlessness, concentration, feeling slowed down or restlessness, and suicidal thoughts. For each item, the patient is asked to rate how much over the past 2 weeks they have been bothered by the symptom. Scoring is on a Likert-type scale from 0 to 3 (0 indicates not at all; 1, several days; 2, more than half the days; 3, nearly every day).
The PHQ-9 can be scored in 2 ways (Fig. 1). In Method A, the PHQ-9 is scored as a continuous measure of depression severity with scores ranging from 0 to 27 to which cutoff scores for probable MDD can be applied. In Method B, it is scored using a diagnostic algorithm based on DSM-IV criteria for MDD.23
Because the depression screening service had been used by patients at every visit they made to the clinic, some patients had completed multiple screening episodes during the study period. For the purposes of the current analysis, we used only the first screening event on which both the PHQ-9 and the 2-stage diagnostic process were completed. To examine the performance of the PHQ-9 as a screening instrument for MDD, we compared the PHQ-9 score against the “gold-standard” diagnosis of MDD that was made by the 2-stage diagnostic process described above.
First, we used the PHQ-9 as a continuous measure with cutoff scores to identify cases of MDD. Cutoff scores are specific sum scores on questionnaires that help to distinguish between cases and noncases. Cutoff scores on the PHQ-9 total score were compared against the presence or absence of MDD. The questionnaire's ability to discriminate between cases and noncases was then examined using receiver operating characteristic (ROC) analysis. The area under the ROC curve was used to indicate the performance of the PHQ-9. On such a curve, a value of 0.5 represents discrimination no better than chance, and a value of 1.0 represents perfect discrimination between cases and noncases.
The sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) were calculated for various PHQ-9 cutoff scores. The choice of optimal cutoff score is always a tradeoff between sensitivity and specificity. A lower cutoff score makes the questionnaire very sensitive and inclusive, whereas a higher cutoff score will make it more specific at the cost of missing some cases. For clinical use, maximum sensitivity with a specificity ≥75% has been suggested as desirable.24 The Youden index (sensitivity + specificity − 1) is another way to summarize diagnostic test accuracy; the cutoff score with the highest Youden index is the numerically optimum cutoff if sensitivity and specificity are weighted equally.
The item structure of PHQ-9 allows the use of an algorithm to screen for major depression.23 By using this algorithm, the criteria for a case are met if patients report at least 5 of the 9 items for “more than half the days” or more in the last 2 weeks and 1 of the items reported is depressed mood or anhedonia. Patients were categorized with depression or not by applying this algorithm. This dichotomous scoring was then compared with the presence or absence of MDD from the reference criterion to generate sensitivity, specificity, PPV, and NPV.
Because the depression screening service had only interviewed those patients who had scored above the predefined cutoff score on the HADS, it is possible that some patients who in fact were cases of MDD had scored below that cutoff and consequently were missed. We addressed this limitation of the reference criterion using a sensitivity analysis. That analysis was informed by a previous study of a separate sample of 361 cancer outpatients attending the same clinics in which it was reported that, when using a cutoff score ≥15 on the HADS, 282 of 286 patients were correctly ruled out as not having MDD, with an NPV of 99%.17 Therefore, we assumed that 1% of the patients who had scored below the HADS cutoff and, hence, were not interviewed during the second stage of the depression screening actually would have had MDD, and we tested the effect this error rate would have on our estimate of PHQ-9 performance. This was done by redistributing the extra 1% of cases proportionally for each cutoff point on the PHQ-9. Then, sensitivity and specificity were recalculated.
Between June 2003 and December 2005, 4264 patients had completed both the PHQ-9 and the 2-stage diagnostic process for MDD. The data collected on these patients formed the analysis sample. The derivation of the sample is shown in Figure 2. The mean patient age was 61 years (range, 18-101 years), and 63% of patients were women. Because the data were collected as part of routine screening, we did not have detailed information on cancer diagnosis, disease extent, or treatment. However, a sample of 2867 patients who were screened 1 year earlier by the same service25 had the following clinical characteristics: The primary cancer was breast in 32% of patients, bowel in 16% of patients, prostate in 12% of patients, ovary in 11% of patients, other gynecologic in 11% of patients, testicular in 8% of patients, and miscellaneous in 10% of patients. Sixty-six percent of the sample was disease free at the time of screening, 30% was receiving either radiotherapy or chemotherapy, and 16% was on hormone treatment.
Prevalence of MDD
In total, 270 of 4264 patients had MDD identified by the depression screening service, producing an estimated prevalence of 6.3%.
Performance of the 9-Item Patient Health Questionnaire
Method A (scored as a continuous measure with cutoff scores to define cases)
When the total PHQ-9 score was used to identify cases of MDD, the area under the curve was 0.94 (95% confidence interval [CI], 0.93-0.95) on ROC analysis (Fig. 3). This is evidence of good discriminating power between cases and noncases of MDD.
Sensitivity, specificity, NPV, PPV, and the Youden index for a range of cutoff scores from 5 to 10 are shown in Table 2. Cutoff scores of 8, 9, and 10 produced acceptable levels of sensitivity and specificity. According to the Youden index, a score ≥8 was the optimum cutoff, providing sensitivity of 93% (95% CI, 89%-95%), specificity of 81% (95% CI 80%-82%), a PPV of 25%, and an NPV of 99%.
|Cutoff Score||Sensitivity (95% CI), %||Specificity (95% CI), %||Youden index, %a||PPV, %||NPV, %|
|≥5||100 (98-100)||62 (61-64)||62||15||100|
|≥6||98 (95-99)||70 (68-71)||67||18||100|
|≥7||97 (94-99)||76 (75-77)||73||21||100|
|≥8||93 (89-95)||81 (80-82)||74||25||99|
|≥9||88 (83-91)||85 (84-86)||73||28||99|
|≥10||82 (77-86)||88 (87-89)||71||32||99|
When a cutoff score ≥8 was applied, the PHQ-9 identified 93% of the patients with MDD. The PPV, which is the percentage of individuals who have the true condition (in this instance, MDD) among those who are screened positive, is a good indicator of clinical utility of the scale in a given population. It is influenced by the prevalence of the condition and will be low when the target condition is of low prevalence. At a cutoff score ≥8, the PPV was 25%, indicating that 1 in 4 patients who screened positive actually had MDD. The NPV is the proportion of patients who screen negative and who do not have MDD. At the ≥8 cutoff score, the NPV was 99.4%, indicating that almost all patients who scored below this cutoff on the PHQ-9 did not have MDD. Hence, only 0.6% of patients who scored <8 on the PHQ-9 were misclassified (ie, had false-negative results; that is, they screened negative but actually had MDD).
The sensitivity analysis, which tested the effect of an assumed 1% misclassification of MDD diagnoses on the operating characteristics across PHQ-9 cutoff scores from 5 to 10, is shown in Table 3. The observed effect is only a modest drop in sensitivity. For example, at a cutoff score ≥8, the sensitivity dropped to 83%, maintaining a specificity of 81%. There was no change in the PPV with a very small drop in the NPV of 1%.
|Cutoff Score||Sensitivity (95% CI), %||Specificity (95% CI), %||Youden Index, %a||PPV, %||NPV, %|
|≥5||91 (88-94)||62 (61-64)||54||16||99|
|≥6||89 (85-92)||69 (68-71)||59||18||99|
|≥7||88 (84-91)||76 (74-77)||64||22||99|
|≥8||83 (79-87)||81 (80-82)||64||25||98|
|≥9||79 (74-83)||85 (84-86)||64||29||98|
|≥10||73 (68-78)||88 (87-89)||62||33||98|
Method B (scored by algorithm to define cases)
Applying the screening algorithm to all patients who completed the PHQ-9, 291 of 4264 patients (6.8%) had positive screening results. Of the 270 patients who had a diagnosis of MDD, the PHQ-9 algorithm identified 151 as such and misclassified 119 as not having MDD. The performance of the PHQ-9 as a screening instrument when used this way is shown in Table 4. The sensitivity was poor, and only 56% of individuals with MDD were identified. Given the poor performance, a sensitivity analysis was not done on this scoring method.
|Sensitivity (95% CI), %||Specificity (95% CI), %||Youden Index, %a||PPV, %||NPV, %|
|56 (55-57)||96 (95-97)||52||52||97|
Depression often is missed by clinicians working in nonpsychiatric medical settings (primary care or medical patients), and systematic screening for MDD has been proposed as a solution.7 A recent review of screening instruments revealed that several questionnaires performed similarly in identifying patients with MDD, with a sensitivity of approximately 80% and a specificity >75%.13 Therefore, the choice of a scale can be based on ease of use, clinician and patient preference, and other clinical considerations. In this regard, the PHQ-9 has strong appeal, because it has clear face validity and brevity, and it uses symptoms taken from the DSM-IV diagnostic criteria for major depression. However, to the best of our knowledge, its diagnostic accuracy when administered as a screening tool for MDD in a cancer population has not been evaluated. We observed that, when it was used as a continuous measure, the PHQ-9 performed well against a 2-stage, gold-standard diagnosis. The operating characteristics were comparable to or better than those of other commonly used instruments.17 When we examined individual cutoff scores, we observed that a cutoff score ≥8 offered a high sensitivity of 93% while maintaining an adequate specificity of 81%. Therefore, this arguably may be the best cutoff score for use as an initial screening questionnaire in this population. However, the PPV of only 25%, (which is similar to that of other depression case-finding measures12), indicates that only 1 in 4 of the patients who screen positive at this first stage of screening subsequently will meet the criteria for MDD when they are interviewed. By using a cutoff score ≥10, a threshold that is recommended for screening general medical and primary care patients, the number of negative subsequent interviews (false-positives) would be reduced but at the cost of missing more patients with MDD (more false-negatives).
When we scored the PHQ-9 using the recommended algorithm, we observed that the diagnostic accuracy was not as good. The low sensitivity of 56% obtained from this method effectively rules out the usefulness of this approach to scoring in the cancer population. Examining the cases that were misclassified using this method, we discovered that the majority (60%) of false-negative results (ie, those who screened negative on the PHQ-9 algorithm but were positive for MDD at the interview) did not report either low mood or anhedonia on the questionnaire to a degree that would count toward the algorithm (ie, more than half the days). Although some previous studies have reported identical performance of the 2 methods of scoring,10, 23 a Dutch study of chronically ill, elderly patients with diabetes or chronic obstructive pulmonary disease also reported poor performance of the PHQ-9 when it was scored using the algorithm.26 One possible explanation for this is that medical patients associate the cardinal symptoms of depression (low mood and loss of interest) more strongly with an (unwanted) diagnosis of depression and consequently under report these on a questionnaire while admitting to them only when they are probed at an interview. Therefore, we advocate caution when using only the 2 items (low mood and anhedonia) as an initial screener in a medically ill or elderly population.
The psychometric standards for depression screening questionnaires have not been defined. However, sensitivity rates of 85% and specificity rates of 70% to 75% usually are considered acceptable benchmarks.10, 12 In the current study, the PHQ-9 met these standards when it was used as a continuous measure at a cutoff score ≥8.
A summary of previously published studies that have examined the diagnostic accuracy of PHQ-9 against a gold-standard reference criterion in at least 100 patients is shown in Table 1. A meta-analysis of the diagnostic performance of the PHQ-9 (either as an algorithm or with a cutoff score) in primary care, general medical outpatient clinics, and specialist medical outpatient clinics10 derived a pooled sensitivity of 80% (95% CI, 71%-87%) and a specificity of 92% (95% CI, 88%-95%) for identifying MDD. Most of those studies used a cutoff score ≥10. The reported optimum cutoff score varied from ≥9 to ≥12. A meta-analysis of primary care studies of the PHQ-9 reported a pooled sensitivity of 77% (95% CI, 71%-84%) and specificity of 94% (95% CI, 90%-97%).11 Relative to these reviews, our results suggest better sensitivity and acceptable specificity in a cancer population.
Practical Use of the 9-Item Patient Health Questionnaire
How can screening with the PHQ-9 be implemented? The Edinburgh depression screening service uses a semiautomated touch-screen, computer-delivered system that provides real-time delivery of the scores to clinicians. According to previous reports, this method of delivering the PHQ-9 is feasible in cancer patients and has an average completion time of only 2 minutes.27 It also is being evaluated as a method of monitoring changes in depression severity over time.28 One important practical point is that approximately 33% of cancer patients who responded positively to Item 9 on the PHQ-9 reported suicidal thoughts at a subsequent interview (unpublished data). Consequently, any service using it for screening must be prepared to follow these patients with a risk assessment. It also may be argued that Item 9 may be too complex (it asks about thoughts of better off dead and hurting yourself as 1 item), could be better worded (thoughts of hurting oneself), and may be off-putting to some patients. One suggestion has been to simply omit this item from the scale.29 Finally, it is important to consider the provision of treatment or referral for the cases identified when implementing screening for depression.8, 30
The current study had several limitations: First, the derivation of the reference criteria for MDD used a 2-stage process and may have missed a small number of cases of MDD. We were able to address the potential effect of this limitation in a sensitivity analysis, which did not indicate any substantial effect on the performance estimates we obtained for the PHQ-9. Second, although the data sample was large and representative of the patients who attended the clinics that were screened, it did not include patients with all types of cancer. Third, we had demographic information but not detailed clinical information on the sample. Fourth, it is possible that the short time lag (usually several days) between patient completion of the PHQ-9 in the clinic and the telephone SCID interview may have reduced its measured performance if the patients' depression status had changed over this period.
The current results indicate that, when choosing an instrument with which to screen cancer patients for MDD, the PHQ-9 is a strong candidate. However, no scale can completely replace a clinical interview. Therefore, we recommend interviewing all patients who are identified by screening as possibly depressed before a definite diagnosis of MDD is made and treatment is commenced.
CONFLICT OF INTEREST DISCLOSURES
Supported by Cancer Research UK.
- 7National Institute for Health and Clinical Excellence. Management of depression in primary and secondary care. Bethesda, Md: National Institute for Health and Clinical Excellence; 2007. http://www.nice.org.uk/nicemedia/pdf/CG23NICEguidelineamended.pdf. 1-4-2007. Accessed January 9, 2009.
- 8U.S. Preventive Services Task Force. Screening for depression in adults: U.S. Preventive Services Task Force recommendation statement. Ann Intern Med. 2009; 151: 784-792.
- 18The Structured Clinical Interview for DSM-III-R (SCID). I: history, rationale, and description. Arch Gen Psychiatry. 1992; 49: 624-629., , ,
- 19American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders. 4th ed. Washington, DC: American Psychiatric Association; 1994.
- 28The Indiana Cancer Pain and Depression (INCPAD) trial. Design of a telecare management intervention for cancer-related symptoms and baseline characteristics of study participants. Gen Hosp Psychiatry. 2009; 31: 240-253., , , et al.
- 33The Patient Health Questionnaire (PHQ) for the screening of psychiatric disorders: a validation study versus the Structured Clinical Interview for DSM-IV Axis I (SCID-I). Ital J Psychopathol. 2003; 9: 235-242., , , et al.
- 35Screening for psychiatric disorders with the Patient Health Questionnaire (PHQ). Results from the German validation study. Diagnostica. 2004; 50: 171-181., , ,