Depression prevalence based on the Edinburgh Postnatal Depression Scale compared to Structured Clinical Interview for DSM DIsorders classification: Systematic review and individual participant data meta‐analysis

Abstract Objectives Estimates of depression prevalence in pregnancy and postpartum are based on the Edinburgh Postnatal Depression Scale (EPDS) more than on any other method. We aimed to determine if any EPDS cutoff can accurately and consistently estimate depression prevalence in individual studies. Methods We analyzed datasets that compared EPDS scores to Structured Clinical Interview for DSM (SCID) major depression status. Random‐effects meta‐analysis was used to compare prevalence with EPDS cutoffs versus the SCID. Results Seven thousand three hundred and fifteen participants (1017 SCID major depression) from 29 primary studies were included. For EPDS cutoffs used to estimate prevalence in recent studies (≥9 to ≥14), pooled prevalence estimates ranged from 27.8% (95% CI: 22.0%–34.5%) for EPDS ≥ 9 to 9.0% (95% CI: 6.8%–11.9%) for EPDS ≥ 14; pooled SCID major depression prevalence was 9.0% (95% CI: 6.5%–12.3%). EPDS ≥14 provided pooled prevalence closest to SCID‐based prevalence but differed from SCID prevalence in individual studies by a mean absolute difference of 5.1% (95% prediction interval: −13.7%, 12.3%). Conclusion EPDS ≥14 approximated SCID‐based prevalence overall, but considerable heterogeneity in individual studies is a barrier to using it for prevalence estimation.


| INTRODUCTION
Accurate estimates of depression prevalence are necessary to understand disease burden and allocate healthcare resources.
Validated diagnostic interviews, such as the Composite International Diagnostic Interview (Wittchen, 1994)

and the Structured Clinical
Interview for the DSM (SCID) (First & Gibbon, 2004) are designed to classify major depression and estimate depression prevalence in a manner consistent with diagnostic criteria. However, administering validated diagnostic interviews to samples that are large enough to LYUBENOVA ET AL.
-3 of 13 estimate prevalence is resource-intensive. Thus, many researchers administer self-report depression symptom questionnaires, or screening tools, instead, and report the percentage above a cutoff threshold as the prevalence of depression (Levis et al., 2019b;Thombs, Kwakkenbos, Levis, & Benedetti, 2018). Some items included in self-report questionnaires address similar symptoms as those evaluated in validated diagnostic interviews, but most questionnaires do not evaluate all relevant symptoms, and most include other items that are not part of diagnostic criteria. Furthermore, unlike validated diagnostic interviews, self-report questionnaires do not include historical information necessary for differential diagnosis, investigate non-psychiatric medical conditions that can cause similar symptoms to those of depression, assess functional impairment related to symptoms, or verify that symptoms are not an expected reaction to losses or stressors .
Depression screening tools are designed to cast a wide net and identify individuals who may have depression. Individuals who screen positive on depression screening tools must be further evaluated by a trained health care professional to confirm whether diagnostic criteria are met. Based on sensitivity and specificity estimates for common depression screening tools and cutoff thresholds, if depression screening tools are used to attempt to estimate prevalence rather than identify individuals who may have depression, most would be expected to overestimate prevalence compared to actual diagnoses (Levis et al., 2019b;Thombs et al., 2018).
A recent study that examined 69 meta-analyses of depression prevalence found that 44% of pooled prevalence estimates in metaanalysis abstracts were based solely on screening or rating tools and 46% on a combination of screening tools and other methods (e.g., unstructured interviews, medical charts); only 10% were based solely on diagnostic interviews (Levis et al., 2019b). Among 2094 primary studies included in the meta-analyses, 77% used screening or rating tools, whereas only 13% used validated diagnostic interviews exclusively. Meta-analyses based solely on screening or rating tools reported an average depression prevalence of 31% compared to 17% in meta-analyses based solely on diagnostic interviews. The degree to which screening questionnaires overestimate the true prevalence depends on the specific depression screening tool and cutoff threshold used (Levis et al., 2019b;Thombs et al., 2018).
To date, we are aware of only one study that has directly compared prevalence based on a specific screening tool and cutoff threshold to prevalence based on a validated diagnostic interview for major depression . That study, an individual participant data meta-analysis (IPDMA), included 9242 participants from 44 primary studies who were administered both the Patient Health Questionnaire-9 (PHQ-9) and the SCID diagnostic interview and found that prevalence based on the standard PHQ-9 cutoff of ≥10 was 25% (95% confidence interval [CI]: 21%-30%), compared to 12% (95% CI: 10%-15%) based on the SCID. The study also reported that no PHQ-9 cutoff consistently matched prevalence based on the SCID in individual studies.
The 10-item Edinburgh Postnatal Depression Scale (EPDS) (Cox, Holden, & Sagovsky, 1987) is the most commonly used depression screening tool for women in pregnancy or postpartum (Hewitt et al., 2009;Howard et al., 2014). It was designed for assessing symptoms continuously, for providing information for discussion for patients, and to identify women who may benefit from formal mental health assessment (Cox et al., 1987). We reviewed 53 recently published studies that stated in their title or abstract that they assessed prevalence of "depression", "depressive disorders", "major depression" or "major depressive disorder". We excluded any that stated that they reported the prevalence of "depressive symptoms" or similar terms. We found that only 6 (11%) used a validated diagnostic interview designed for this purpose. There were 26 (49%) studies that used the EPDS and 21 studies that used other methods, mostly other questionnaires. Studies that reported prevalence based on the EPDS used cutoff thresholds from ≥9 to ≥14, with the majority using cutoffs of ≥10 and ≥ 13 (see Supplementary material, Methods S1 and Table S1). The extent to which prevalence estimates based on different EPDS cutoffs may differ from prevalence based on validated diagnostic interviews, however, is unknown.
The aim of the present study was to use an IPDMA approach to (1) determine the degree to which EPDS cutoffs that are commonly used to report depression prevalence may deviate from prevalence based on a validated semi-structured diagnostic interview, the SCID; and (2) to use a prevalence matching approach (Kelly, Dunstan, Lloyd, & Fone, 2008;Thombs et al., 2018) to determine whether any cutoff threshold on the EPDS matches SCID major depression prevalence closely and with sufficiently low heterogeneity to be used for estimating major depression prevalence in individual studies.

| METHODS
We used a subset of data accrued for an IPDMA on EPDS diagnostic accuracy. The IPDMA was registered in PROSPERO (CRD42015024785), and a protocol was published (Thombs et al., 2015). The present study was not included in the protocol for the main EPDS IPDMA, but a separate protocol was published on the Open Science Framework prior to initiating the study (https://osf.io/ 7gy6p/).

| Identification of eligible studies
In the main IPDMA, datasets from articles in any language were eligible for inclusion if (1) they included EPDS scores for women who were pregnant or in the postpartum period, defined as within 12 months of birth; (2) they included diagnostic classifications for current Major Depressive Episode or Major Depressive Disorder based on DSM (American Psychiatric Association, 1987, 1994, 2000 or International Classification of Diseases (World Health Organization, 1992) criteria, using a validated semi-structured or fully structured interview; (3) the EPDS and diagnostic interview were administered within 2 weeks of each other, since diagnostic criteria for major depression are for symptoms in the last 2 weeks; (4) participants were ≥18 years and not recruited from youth or school-based settings; and (5) participants were not recruited from psychiatric settings or because they were identified as having symptoms of depression, since screening is done to identify unrecognized cases. Datasets where not all participants were eligible were included if primary data allowed selection of eligible participants.
For the present study, in our main analyses, we included only primary studies that based major depression diagnoses on the SCID (First & Gibbon, 2004). The SCID is a semi-structured diagnostic interview intended to be conducted by an experienced diagnostician; it requires clinical judgment and allows rephrasing questions and probes to follow up responses. The reason for including only studies that administered the SCID is because semi-structured interviews replicate diagnostic standards more closely than other types of interviews, and the SCID is by far the most commonly used semi-structured diagnostic interview for depression research (Levis et al., , 2019aWu et al., 2020). In recent analyses using three large IPDMA databases (Levis et al., , 2019aWu et al., 2020), compared to semi-structured interviews, fully structured interviews, which are designed for administration by lay interviewers, identified more patients with low-level depressive symptoms as depressed but fewer patients with high-level symptoms. These results are consistent with the idea that semi-structured interviews most closely replicate clinical interviews done by trained professionals. Fully structured interviews are less resource-intensive options because they are completely scripted and allow for minimal or no judgment, since they are designed to be administered by research staff without diagnostic skills. They may, however, misclassify major depression in substantial numbers of patients. In the EPDS IPDMA database, the SCID was the most common semi-structured interview. In a sensitivity analysis, we included two additional studies from the database that used semi-structured interviews other than the SCID. Two investigators independently reviewed titles and abstracts for eligibility. If either deemed a study potentially eligible, full-text review was done by two investigators, independently, with disagreements resolved by consensus, consulting a third investigator when necessary. Translators were consulted for languages other than those for which team members were fluent.

| Data contribution and synthesis
Authors of eligible datasets were invited to contribute de-identified primary data, including EPDS scores and major depression classification status. We emailed corresponding authors of eligible primary studies at least three times, as necessary, with at least 2 weeks between each email. If we did not receive a response, we emailed co-authors and attempted to contact corresponding authors by phone.
Prior to integrating individual datasets into our synthesized dataset, we compared published participant characteristics and diagnostic accuracy results with results from raw datasets and resolved any discrepancies in consultation with the original investigators. The number of participants and the number of cases from a primary study in the IPDMA dataset differed from the originally published primary study reports for some studies. There are several reasons for this. First, in some primary studies, some, but not all, participants met the inclusion criteria for the main IPDMA. For instance, we required administration of the EPDS index test and reference standard to be within a 2-week period and only included participants aged 18 or older recruited from non-psychiatric settings.
We only included data from participants in primary studies who met these criteria. Second, the reference standard diagnostic category for the main IPDMA differed from that used in some published reports of primary studies. Some primary studies reported accuracy results for depression diagnoses broader than major depression, such as "major þ minor depression" or "any depressive disorder". We restricted our depression variable to major depression classification. Third, as part of our data verification process, we compared published participant characteristics and diagnostic accuracy results with results obtained using the raw datasets. When primary data that we received from investigators and original publications were discrepant, we identified and corrected errors in consultation with the original primary study investigators.
When primary datasets included statistical weights to reflect sampling procedures, we used the weights provided. For studies where sampling procedures merited weighting, but the original study did not weight, we constructed weights using inverse selection probabilities. This occurred, for instance, when all participants with positive screens and a random subset of participants with negative screens were administered a diagnostic interview.

| Statistical analyses
First, for each primary study, we estimated three values: (1) the percentage of participants classified as having major depression based on the SCID, (2) the percentage of participants who scored above the cutoff threshold for all possible EPDS cutoffs LYUBENOVA ET AL.
Then, across all studies, we pooled prevalence for each EPDS cutoff, prevalence for the SCID, and the difference in prevalence from each study.
Second, we identified the EPDS cutoff with the smallest pooled difference. Then, for each included study, in addition to already estimated difference in prevalence based on the cutoff versus SCID major depression, we also estimated the ratio of prevalence based on the cutoff to that of the SCID. We plotted study-level differences by sample size and determined the mean and median absolute difference and the range of differences across all studies. To illustrate the range of difference values that would be expected if a new study were to compare prevalence based on the prevalence match scoring approach to prevalence based on SCID major depression, we estimated a 95% prediction interval for the difference.
All meta-analyses incorporated sampling weights and were conducted in R (R version R 3.6.0 and R Studio version 1.1.453) using the lme4 package (Bates, Mächler, Bolker, & Walker, 2015). To estimate pooled prevalence values, generalized linear mixed-effects models with a logit link function were fit using the glmer function. To estimate pooled difference values, linear mixed-effects models were fit using the lmer function. To account for correlation between subjects within the same primary study, random intercepts were fit for each primary study. To quantify heterogeneity, for each analysis, we calculated τ 2 , which is the estimate of between-study variance, and I 2 , which quantifies the proportion of total variability due to the between-study heterogeneity.
We conducted two sets of post-hoc analyses. First, we repeated the prevalence match analysis excluding studies with SCID-based prevalence >20% and >15%, separately, in order to assess results without studies that reported very high prevalence and ensure that results were consistent when only studies with more typical prevalence were included. For each subset of studies, we (1) identified the EPDS cutoff with the smallest pooled difference and (2) estimated the 95% prediction interval for the difference. Second, we investigated whether differences in prevalence for the EPDS prevalence match scoring approach and SCID were associated with study and participant characteristics in order to attempt to explain the heterogeneity we found. To do this, we fit additional linear mixed-effects models for pooled prevalence difference, including age, pregnant versus postpartum status, country human development index ("very high", "high", or "low-medium") (United Nation's Development Programme, 2020), and study sample size as fixedeffect covariates.

| Search results and inclusion of primary study datasets
For the main IPDMA, of the 3417 unique titles and abstracts identified from the search, 3097 were excluded after title and abstract review and 212 after full-text review. The 108 remaining articles comprised data from 73 unique samples, of which 49 (67%) contributed individual participant data. One additional study, which was subsequently published, was provided by the authors of an included study, for a total of 50 datasets. For our main analyses, we excluded 21 studies that classified major depression using a diagnostic interview other than the SCID, such that the sample for those analyses included 7315 participants (1017 major depression cases; prevalence 14%) from 29 primary studies (see Figure 1). Table 1 shows characteristics of each included study.
In sensitivity analyses, we included data from two additional studies that used a semi-structured diagnostic interview other than the SCID (N participants: 255; N major depression cases: 38; prevalence 15%). This resulted in inclusion of data from 31 primary studies (N participants: 7570; N major depression cases: 1055; prevalence 14%). See Table 1.

| DISCUSSION
The developers of the EPDS intended it to be a questionnaire that could detect symptoms of depression that are commonly experienced by women in the postpartum period but that would not be  picked up by other scales (Cox et al., 1987). It was not intended to classify cases or estimate prevalence. Most studies that report the prevalence of depression in pregnancy or postpartum, however, are based on the proportion of women in the study who score above a cutoff threshold on a depression screening questionnaire, most commonly the EPDS. The EPDS cutoffs used to estimate prevalence in recently published studies ranged from ≥9 to ≥14, with ≥10 and ≥ 13 being the cutoffs most commonly used for this purpose. We found that, compared to SCID major depression prevalence, commonly used EPDS cutoffs overestimated prevalence. A cutoff of ≥10 on the EPDS generated a prevalence of 22%, and a cutoff of ≥13 generated prevalence of 11%, compared to 9% with the SCID.
Overall, the pooled prevalence based on an EPDS cutoff of ≥14 (9%) was closest to the pooled SCID major depression prevalence (9%). However, differences between prevalence based on the EPDS and SCID varied substantially across individual studies. The difference between EPDS-and SCID-based prevalence ranged from À 17% to 18%, and the estimated 95% prediction interval indicated that in the next study using both tools, the difference in prevalence could fall anywhere between À 14% and 12%. Thus, although overall prevalence with EPDS ≥14 is similar to that of the SCID, if used to estimate prevalence in individual studies, it could considerably under or overestimate the true major depression prevalence in any given study. Differences between EPDS and SCID-based estimates were not associated with sample size. We found that age was statistically significantly associated with the difference between EPDS ≥14 and SCID-based prevalence, but a 1-year difference in age was associated with only a 0.2% difference in prevalence; given the general similarity in ages of pregnant and postpartum women, this would not explain the large differences we found.
The results from this study are similar to findings from Levis et al.
(2020) that compared prevalence based on the PHQ-9 screening tool and the SCID. The most commonly used cutoff of PHQ-9 ≥10 overestimated the SCID prevalence by approximately 12%. Prevalence matching for PHQ-9 revealed that PHQ-9 ≥14 provided a pooled prevalence estimate closest to SCID major depression prevalence. However, as in the present study, the difference in prevalence between PHQ-9 ≥14 and the SCID varied considerably across individual studies.
It is common to report the proportion scoring at or above the cutoff threshold as prevalence of "depressive symptoms" or "clinically significant depressive symptoms" rather than suggesting that prevalence of depression has been reported. However, this does not resolve the problem. Diagnostic thresholds are designed to identify individuals with a condition or with a level of impairment that warrants attention, and there is no evidence that impairment from symptoms of depression becomes meaningful at or above these thresholds, which have been set for the purpose of screening, not for delineating impairment. Furthermore, while people with symptom scores above these thresholds have greater symptom impairment on average than those below the threshold, that would be the case for whatever threshold is set. Reporting percentages of women