Screening for Perinatal Depression with the Patient Health Questionnaire Depression Scale (PHQ-9): A Systematic Review and Meta-analysis

a Indiana University School of Medicine, Indianapolis, IN, United States.

Find articles by Larry Wang

Kurt Kroenke

a Indiana University School of Medicine, Indianapolis, IN, United States.

b Regenstrief Institute, Inc., Indianapolis, IN, United States.

Find articles by Kurt Kroenke

Timothy E. Stump

c Department of Biostatistics, Indiana University Fairbanks School of Public Health and School of Medicine, Indianapolis, IN, USA

Find articles by Timothy E. Stump

Patrick O. Monahan

c Department of Biostatistics, Indiana University Fairbanks School of Public Health and School of Medicine, Indianapolis, IN, USA

Find articles by Patrick O. Monahan a Indiana University School of Medicine, Indianapolis, IN, United States. b Regenstrief Institute, Inc., Indianapolis, IN, United States.

c Department of Biostatistics, Indiana University Fairbanks School of Public Health and School of Medicine, Indianapolis, IN, USA

Larry Wang contributed in the conceptualization, data curation, funding acquisition, investigation, methodology, project administration, writing the original draft, and critical review and revision of the manuscript

Kurt Kroenke, MD, contributed in the conceptualization, data curation, investigation, methodology, project administration, supervision, writing the original draft, and critical review and revision of the manuscript

Timothy Stump, MS, contributed in the formal analysis, methodology, validation, visualization, and critical review and revision of the manuscript

Patrick Monahan, PhD, contributed in the formal analysis, supervision, methodology, validation, visualization, and critical review and revision of the manuscript

* Corresponding author : Kurt Kroenke, MD, Regenstrief Institute, Rm 221, 1101 W. 10 th Street, Indianapolis, IN 46202. Ph 317-274-9046 FAX 317-274-9304. gro.feirtsneger@ekneorkk

The publisher's final edited version of this article is available at Gen Hosp Psychiatry

Associated Data

2. GUID: 6E670F4C-15B0-49F0-A0CF-97BB4C40647E

Abstract

Objectives:

Perinatal depression (PND) is a prevalent and disabling problem both during pregnancy and the postpartum period. The legacy screening measure has been the Edinburgh Postnatal Depression Scale (EPDS). This systematic review examines the validity of the PHQ-9 as a screener for PND.

Methods:

The following databases were searched from January 2001 (when the PHQ-9 was first published) through June 2020: MEDLINE, Embase, and PsychInfo. Studies that compared the PHQ-9 to a criterion standard psychiatric interview were used to determine the operating characteristics of sensitivity, specificity and area under the curve (AUC). Studies comparing the PHQ-9 to the EPDS and other depression scales evaluated convergent validity.

Results:

A total of 35 articles were eligible for criterion (n=10) or convergent (n=25) validity. Meta-analysis of the 7 criterion validity studies using the standard PHQ-9 cut point ≥ 10 showed a pooled sensitivity, specificity and AUC of 0.84, 0.81 and 0.89, respectively. Operating characteristics of the PHQ-9 and EPDS were nearly identical in head-to-head comparison studies. The median correlation between the PHQ-9 and EPDS was 0.59, and categorical agreement was moderate.

Conclusions:

The PHQ-9 appears to be a viable option for perinatal depression screening with operating characteristics similar to the legacy EPDS.

Keywords: depression, screening, pregnancy, postpartum, perinatal, PHQ-9

1. Introduction

1.1. Rationale

Perinatal depression (i.e., depression in women during pregnancy or in the postnatal period up to 12 months postpartum) occurs in 10–20% of women [1]. Untreated depression is associated with adverse fetal and newborn outcomes in addition to long-term effects on the mother, child, and family [2]. Numerous guidelines advocate universal perinatal depression screening [2–8]. The legacy screener most commonly recommended and for which there is the greatest amount of evidence is the Edinburgh Postnatal Depression Scale (EPDS). A “legacy” measure is one that is well-validated, widely-used, and considered by many experts to be the standard against which competing measures should be compared [9–11]. The EPDS qualifies as a legacy measure for several reasons. First, it has the largest number of validation studies of any perinatal depression screening measure as summarized in several systematic reviews [12–16]. Second, it is the screening measure most commonly-recommended in perinatal depression guidelines [3, 4]. Third, the 10-item EPDS is brief and has been translated into more than 50 languages [3].

However, many experts also consider the Patient Health Questionnaire depression scale (PHQ-9) as an alternative to the EPDS for perinatal depression screening [2, 5–8]. The PHQ-9 is the most widely used depression measure globally [17] and has been validated across a wide range of age groups, medical conditions, and clinical settings. Since depression is often a chronic or recurrent condition, using a single measure that can assess depression both during and outside the perinatal period may be advantageous in monitoring scores over the lifespan of a woman. The widespread incorporation of the PHQ-9 into healthcare systems, electronic records, and depression screening guidelines makes it a highly familiar metric to clinicians in both primary care and multiple specialty settings [18].

1.2. Objectives

Unlike the EPDS, there has not been a comprehensive assessment of the published literature regarding the validity of the PHQ-9 in screening for perinatal depression. Therefore, we conducted a systematic review and meta-analysis with three objectives:

To determine the criterion validity of the PHQ-9 in perinatal depression screening when compared to a criterion standard psychiatric interview;

To examine the convergent validity of the PHQ-9 when compared to other validated depression measures in a perinatal population;

To compare the performance of the PHQ-9 and EPDS when used in the same studies.

2. Methods

2.1. Identification of studies:

The following databases were searched from January 2001 through June 2020: MEDLINE, Embase, and PsychInfo via PubMed, Embase, and EBSCO search engines respectively. Literature since 2001 was searched since that was the year the first paper on the PHQ-9 was published. The search was formatted to a PICO question with a perinatal population, the PHQ-9 as the intervention, a criterion standard psychiatric interview as the comparator, and depression as the outcome. The search consisted of puerperal disorders or puerperal or postpartum or post-partum or pregnan* or post natal or postnatal or perinatal depression or postpartum depression AND PHQ or patient health questionnaire or patient health questionnaire 9 or PHQ-9 or PHQ 9 AND depress*. The asterisk indicates the search term was truncated and will include variations. For example, “depress*” includes “depression” as well as “depressed” and “depressive”.

2.2. Study selection

Results of the literature search were imported to EndNote X9 where duplicates were removed. Articles were initially screened by one author for relevance to the PHQ-9 predominantly through reading the abstract. If there was insufficient information from the abstract to include or exclude, the full text paper was analyzed. Conference abstracts were not included. The 89 resulting full text articles were independently assessed by both authors to reduce selection bias. Included studies were further sorted into studies that investigated criterion validity or studies that investigated convergent validity. The references of the articles were checked for relevant studies. Inclusion criteria required that the study: 1) utilized the PHQ-9 to assess perinatal depression; 2) assessed either criterion validity of the PHQ-9 using a structured psychiatric interview or its convergent validity by comparing to another validated depression measure; and 3) was published in English or had an English translation available.

3.1. Quality assessment

The Quality Assessment of Diagnostic Accuracy Studies (QUADAS) tool uses 11 questions to assess the risk of bias of four domains-- patient selection (3 questions), index test (2 questions), reference standard (2 questions), and flow and timing (4 questions).[19] The tool has three choices for each question -- yes, no, and unclear. An answer of “yes” to a question indicates a fulfillment of the QUADAS-2 criteria. An answer of “no” to a question indicates a risk of bias in that domain. An answer of “unclear” indicates the study only partially fulfills the domain or there is insufficient information to draw a conclusion. To be conservative, an answer of “unclear” is treated as a “no”. If each domain contains all “yes” answers, the risk of bias is considered to be low. Developers of the QUADAS consider one or more “no” answers for a domain to indicate a high risk of bias for that domain. Because this seems particularly conservative, we adapted the scoring so that one “no” answer was considered to represent a medium risk of bias, whereas two or more “no” answers was considered to represent a high risk. Two authors appraised each of the articles independently and blind to the other’s results. Consensus was achieved by discussing any ratings that differed between the two reviewers.

Two of the 11 questions required operationalization for this review. Under the index test bias domain, one question is whether the “index test results were interpreted without knowledge of the results of the reference standard.” While blinded scoring of the PHQ-9 was not explicitly noted in any of the studies, this is irrelevant since the PHQ-9 is a self-administered scale for which knowing the reference results cannot influence its scoring. Under the flow and timing bias domain, one question is whether “there was an appropriate interval between the index test and the reference standard.” A two-week interval or less was deemed appropriate since this is the standard duration DSM uses for defining clinically relevant depressive symptoms.

3.4. Data Extraction and Analysis

Data extracted for each study included sample size, mean age, country and clinical setting in which the study was conducted, perinatal population (pregnancy, postpartum, or both), and proportion of participants with a PHQ-9 score ≥ 10. For Objective 1, diagnostic operating characteristics for criterion validity studies included sensitivity (percent of patients with major depression who have a depression screener score at or above the defined cutpoint), specificity (percent of patients without major depression who have a depression screener score below the defined cutpoint), and area under the curve (AUC) as determined by receiver operating characteristic (ROC) curve analyses. AUC represents the true positive rate divided by the false positive rate across a series of cutpoints. AUCs ≥ 0.80 and ≥ 0.90 represent good and excellent diagnostic test accuracy, respectively.

For studies that used the conventional PHQ-9 threshold score ≥ 10, two-by-two tables were generated from the presented data. From these tables, we calculated the sensitivity, specificity and confidence intervals (CIs) for each study included. The data were presented in the form of forest plots. For all summary-level estimates, we used a bivariate generalized linear mixed model to simultaneously estimate pooled measures of sensitivity and specificity while accounting for the potential correlation between sensitivity and specificity [20]. Summary receiver-operating characteristic curves were obtained along with 95% confidence regions for the bivariate estimate of AUC. The glmer function [21] of the lme4 package [22] in R (R Foundation for Statistical Computing) [23] was used to estimate the bivariate models.

For Objective 2, studies that compared the PHQ-9 to another validated depression measure were summarized to examine convergent validity. Objective 3 focused on the subset of criterion or convergent validity studies that administered the PHQ-9 and the legacy EPDS depression scale to the same patient sample; for this objective, psychometric characteristics of the two scales were compared, including diagnostic operating characteristics, correlation, categorical agreement, and other metrics reported for both scales.

3. Results

As summarized in the PRISMA diagram ( Figure 1 ), 2229 articles were identified through the literature search, of which 1675 abstracts were screened once duplicates were removed. Of 89 full-text articles reviewed after abstract screening, 35 articles were eligible, including 10 for criterion validity [24–33] and 25 for convergent validity [34–58].

An external file that holds a picture, illustration, etc. Object name is nihms-1659022-f0001.jpg

PRISMA flow diagram of study selection

3.1. Study characteristics

Table 1 summarizes the 35 studies included in this review. A total of 19,760 women were included in these studies, with a median sample size of 293 (range, 56 to 3342). There were 15 studies that included pregnant women only, 13 that focused only on the postpartum period, and 7 studies that included both pregnant women and those in the postpartum period. Thirteen studies were conducted in the United States, 9 in Africa, 4 in Europe, 4 in Asia, 4 in Latin America, and 1 in Australia. The mean age of women in the studies was a median of 28.2 years (range, 18 to 34.6). Seven of the studies were conducted in an antenatal clinic; 5 in a hospital; 3 each in an obstetrics clinic, a psychiatry setting, the community, a clinical trial, or by a phone survey; 2 each in a postpartum clinic, a family medicine clinic, or a cohort study; and 1 each in pediatrics and an HIV clinic. Of the 19 studies that reported the proportion of the sample having clinically relevant depressive symptoms as noted by the conventional PHQ-9 cutpoint ≥ 10, the median proportion was 18% (range, 5% to 100%).

Table 1.

Studies Examining Criterion or Convergent Validity of PHQ-9 in Perinatal Depression Screening

Author YearNPreg nantPost par tumAge meanCountrySetting * PHQ ≥10 %EPDSValidity Results †
Barthel 20151024X 28.7Ghana & Cote I’VoireOB Hospital30.6 Correlation of PHQ-9 with WHO-DAS disability = 0.41. Confirmatory factorial validity
Beck 201280 X24.7USAPP Clinic13.8 Correlation of PHQ-9 and PDSS = 0.65 Moderate concordance (no, mild, moderate to severe depression): weighted kappa= 0.40
Brodey 2016879XX27.6USAOB Clinics XSee Table S1
Buttner 2013478 X29.6USAPhone survey100 Of 478 who had PHQ-9 ≥ 10 and a SCID, PPV = 29.1%
Davis 20131392 X28.5USAPhone survey54.1 Of 1011 who had PHQ-9 ≥10 and a SCID, PPV of PHQ-9 ≥10 = 54%. AUC = 0.826
Di Venanzio 2017225XX33.9ItalyPsych and OB31.1 Of 70 who had PHQ-9 ≥5 and a psychiatric interview PPV of PHQ-9 ≥5 = 56%
Flynn 2011185XX28.2USAPsychiatry XSee Table S1
Gallis 20181731X 26.7PakistanCommunity33 See Table S1
Gawlik 2013273X 32.8GermanyOB Clinic9XOf 5 patients with minor or major depression by SCID, 4 and 2 exceeded EPDS ≥ 12 and PHQ-9 ≥ 10 cutoffs. Of 266 without depression, 234 and 246 were below cutoffs
Gelaye 20173342X 28.2PeruCohort study XCorrelation of PHQ-9 and EPDS = 0.51
Gjerdingen 2009506 X29.1USAPediatrics See Table S1
Green 2018193X 30.6KenyaCommunity XSee Table S1
Hanusa 2008135 X29.5USAPhone survey17XCorrelation of PHQ-9 & EPDS = 0.75. By DIS, AUC of PHQ-9 & EPDS = 0.80 & 0.88.
Harrington 2018299X 26 * MalawiHIV XGood concordance (no, mild, moderate, severe depression) of PHQ-9 and EPDS: weighted kappa= 0.53
Joshi 2020100X 23.5IndiaAN Clinic15XExcellent concordance between PHQ-9 and EPDS. Kappa = 0.76
Kadir 2009293 X31.5MalaysiaHospital XCorrelation of PHQ-9 & EPDS = 0.36 Depression prevalence using EPDS ≥ 12 and PHQ-9 ≥5 was 22.5% and 34.8%
Kulathilaka 2016255X 29.6Sri LankaHospital14.1 MDD prevalence by structured interview and PHQ-9 ≥ 10 similar (13.7% vs. 14.1%)
Lara 2015210XX29.5MexicoAN Clinic PHQ-9 and SCID completed at 3 time points (3 rd trimester, 6 wk. and 6 mo. postpartum) Depression by SCID was 9.0%, 13.8%, and 13.3%. Depressive symptoms (PHQ-9 ≥10) prevalence was 16.6%, 17.1% and 20.0%.
Loughnan 2019120 X32.6AustraliaClinical trial XResponsiveness: PHQ-9 and EPDS showed similar between-group differences (effect sizes = 0.99 and 0.90).
Maliszewska 2017548 X30.2PolandHospital13.3XAt 4 weeks, 48 (11.7%) patients had EPDS ≥ 13, and 61 (14.9%) had PHQ-9 ≥ 10, and 30 (7.3%) exceeded both thresholds. Correlation between the two scales was 0.70 at 4 weeks and 0.60 at 3 mo. (mean=0.65)
Meltzer-Brody 201491XX28.1USAPsychiatry Inpatient XResponsiveness. PHQ-9 and EPDS showed large effect size changes in depression over time (1.32 and 1.85, respectively)
Miller 2012541XX USAFamily med9 PPV = 45% (13/29) by clinical interview in those with PHQ-9 ≥ 10
Mochache 2018255X 20–29KenyaAN Clinic XIn 153 patients with an EPDS ≥ 10 PPV of PHQ-9 ≥ 5 = 71.9%
Nieminen 201656 X34.6SwedenClinical trial Responsiveness: Beck (BDI) and PHQ-9 showed similar effect size changes (0.29 and 0.20) in depressive symptoms over time
Orta 20151321X 33.3USACohort study13.7 PHQ-9 ≥ 10 = 13.7%, DASS ≥10 = 14.2% and DASS ≥14 = 5.9%
Osok 2018176X 18KenyaAN Clinic54.5XAt least moderate depression was present in 55% by PHQ-9 ≥10 and 58% by EPDS ≥13.
Sanchez 2013959 X28.3PeruHospital7.4 At least moderate depression was present in 7.4% by PHQ-9 ≥10 and 7.6% by DASS ≥14
Sefogah 2020350 X20–34GhanaPP Clinic XOf those with PHQ-9 ≥ 5 (n=350), 32.6% had EPDS ≥ 10.
Sidebottom 2012745X 23USAAN Clinics18 See Table S1
Smith 2010218X 28.9USAOB Clinics5 See Table S1
van Heyningen 2018376XX26.8S. AfricaAN Clinic XSee Table S1
Weobong 2009160 X27.1GhanaClinical Trial XSee Table S1
Woldetensay 2018246X 24.3EthiopiaCommunity18 See Table S1
Yawn 2009481 X25–29USAFamily med19XEPDS and PHQ-9 concordant in 399 women (83%) including 326 normal on both scales, and 73 elevated on both
Zhong 20141517X 28PeruAN Clinic29XCorrelation of PHQ-9 & EPDS = 0.52. At score ≥ 10 on PHQ-9 & EPDS, 29% and 28% exceeded cutpoint. Agreement between 2 scales at this cutpoint was 74%. Weighted kappa using severity categories = 0.35

Abbreviations : DASS = Depression Anxiety Stress Scale. DIS = Diagnostic Interview Schedule. EPDS = Edinburgh Postpartum Depression Scale. HRSD = Hamilton Rating Scale for Depression. NHW = non-Hispanic white. PDSS = Postpartum Depression Screening Scale. PPV = positive predictive value. SCID = Structured Clinical Interview for DSM.

* AN = antenatal. PP = postpartum. OB = obstetrics

† Criterion validity studies (n =10) are summarized in Table S1. The other 25 studies examine convergent validity.

3.2. Criterion validity

Table S1 summarizes the 10 studies that examined criterion validity using a structured psychiatric interview. A total of 5,235 women were included in these 10 studies, with a median sample size of 311 (range, 160 to 1731). The most common criterion standard interviews were the SCID (n = 5) and MINI (n = 2). In the 9 studies reporting the prevalence of major depression, the median was 11.3% (range, 3.6% to 72.4%).

Figure 2 shows the forest plots for the 7 studies that used the conventional PHQ-9 threshold score ≥ 10. The pooled sensitivity was 0.84 (95% CI, 0.75 to 0.90) and the pooled specificity was 0.81 (95% CI, 0.74 to 0.86). Of note, the 95% CIs were reasonably narrow suggesting relatively precise point estimates. Figure S1 shows a pooled AUC of 0.89 with a reasonably precise elliptical confidence region. In 3 studies that also examined the PHQ-2, a cutpoint of 2 or 3 on the latter had relatively comparable operating characteristics to the PHQ-9.

An external file that holds a picture, illustration, etc. Object name is nihms-1659022-f0002.jpg

Forest plots: Sensitivity (A) and specificity (B) of the 7 diagnostic accuracy studies that used the standard PHQ-9 cutpoint of a score ≥ 10. For sensitivity, total is the number of patients with major depression by the criterion standard interview and events is the number of these patients with a PHQ-9 score ≥ 10 (i.e., true positives). For specificity, total is the number of patients without major depression by the criterion standard interview and events is the number of these patients with a PHQ-9 score < 10 (i.e., true negatives).

3.3. Quality assessment

Table S2 summarizes the QUADAS ratings for the 10 studies that examined criterion validity. A low risk of bias in all 4 domains was achieved by 1 study, in 3 domains by 4 studies, in 2 domains by 2 studies, and in only 1 domain by 3 studies. The study by Sidebottom et al. had the least bias and the second largest sample size (n = 745, or 19% of the total patients in the 7 studies using a PHQ-9 ≥ 10 cutpoint).[27] Moreover, operating characteristics in the Sidebottom et al study were higher than the calculated median. The highest risk of bias occurred in the reference standard domain, with only 3 of the 10 criterion validity studies having the lowest risk of bias. The risk was due to six articles being unclear if blinding occurred when giving the reference standard, with only one article[28] explicitly stating a lack of blinding. For the 110 QUADAS questions rated (11 questions across 10 articles), the two raters agreed 85% of the time for a kappa of 0.47. Disagreements were discussed in order to achieve a consensus on final QUADAS ratings.

3.4. Psychometric comparison of PHQ-9 and EPDS

Of 19 studies administering both the PHQ-9 and EPDS to participants, 15 provided one or more psychometric comparisons between the two scales ( Table 2 ). The median sensitivity of the PHQ-9 and EPDS were remarkably similar (.81 vs .82 in 5 studies) as were the median specificity (.75 vs .73 in 5 studies) and the median AUC (.86 vs .88 in 5 studies). In 6 studies, the median correlation between the two scales was 0.59. In 4 studies, there was moderate categorical agreement as assessed by simple agreement or kappa (agreement beyond chance). Responsiveness and the rates of moderate depression were examined in two studies each and were similar using the PHQ-9 and EPDS. Finally, test-retest reliability was compared in only 1 study and was somewhat higher for the PHQ-9 (0.75 vs. 0.51).

Table 2.

Comparison of PHQ-9 and EPDS Psychometrics

PsychometricTotal # StudiesPHQ-9 Median (Mean)EPDS Median (Mean)Individual StudyPHQ-9EPDS
Sensitivity5.81 (.81).82 (.81)
Flynn.82.87
Van Heyningen.79.86
Weobong.94.78
Green.70.70
Brodey.81.82
Specificity5.75 (.76).73 (.74)
Flynn.69.62
Van Heyningen.82.81
Weobong.75.73
Green.74.72
Brodey.79.81
AUC5.86 (.85).88 (.86)
Flynn.86.89
Van Heyningen,89,91
Weobong.90.84
Green.79.80
Hanusa.80.88
Correlation6.59 (.59)
Flynn.75
Hanusa.75
Maliszewska.65
Zhong.52
Gelaye.51
Kadir.36
Categorical agreement4 *
Simple agreement Yawn.83
Kappa, unweighted Joshi.76
Kappa, weighted Harrington.53
Kappa, weighted Zhong.35
Responsiveness (ES) † 2 *
Loughnan.99.90
Meltzer1.321.85
Moderate dep., %2 * Osok.55.58
Malis.15.12
Test-retest reliability1 * Weobong.75.51

* Mean (median) not calculated for psychometrics examined in only 1–2 studies or for concordance which was measured in different ways among studies

† Change over time measured in effect size (ES) for which 0.2, 0.5, and 0.8 represent small, moderate and large changes.

4. Discussion

Our systematic review has several major findings. First, the PHQ-9 has good diagnostic operating characteristics as a screener for perinatal depression; its sensitivity, specificity, and AUC all > 0.80 is similar to the performance of other well-validated depression measures used across a range of clinical conditions [59–63]. Second, the PHQ-9 appears to perform comparably to the EPDS which heretofore has been the legacy perinatal depression scale. Third, the wide diversity of study samples, countries and clinical settings in which the PHQ-9 has been examined for perinatal depression assessment ( Table 1 ) increases the potential generalizability of our study results to the various settings in which depression screening may be warranted. The widespread use of the PHQ-9 coupled with recommendations for universal perinatal depression screening enhances the importance of our findings.

Notably, the pooled sensitivity and specificity of 0.84 and 0.81 for a PHQ-9 score ≥ 10 from our meta-analysis is very similar to the 0.85 and 0.84 reported for a standard EPDS cutpoint score ≥ 10 in a recent meta-analysis [16]. Paralleling these findings, the studies in our systematic review that compared the PHQ-9 and EPDS in the same patient sample found the two measures to have a comparable median sensitivity, specificity, and AUC ( Table 2 ). Both correlations as well as categorical agreement between the PHQ-9 and EPDS were moderate. However, even agreement between criterion standard interviews can vary substantially [64, 65]. Thus, the comparable results for the PHQ-9 and EPDS in Table 2 suggest that either scale may be a reasonable option for perinatal depression screening. Both the EPDS and PHQ-9 are equally brief (10 and 9 items, respectively) and also freely available. This is in contrast to some depression measures which are 20 items or longer and, in some cases, proprietary.

Certainly, the EPDS has had a larger number of validation studies, warranting further criterion validity studies of the PHQ-9 in larger samples as well as additional head-to-head comparisons with the EPDS. That being said, it is also important to comment on three differences between the EPDS and PHQ-9. First, the EPDS does not include the four somatic symptoms of major depressive disorder (fatigue, sleep disturbance, change in weight or appetite, and psychomotor agitation/retardation). This is sometimes cited as an advantage because such symptoms may be common in pregnancy and not specific for depression. Paradoxically, however, validation studies typically compare the EPDS to criterion standard interviews for major depression that include these four somatic symptoms. Importantly, debates about how to handle somatic symptoms in patients with medical or other conditions have tended to favor an inclusive approach because: a) including somatic symptoms is more sensitive and reliable [66]; and b) these four somatic symptoms demonstrate robust improvement with treatment, and this improvement does not differ significantly between patients with and without medical co-morbidity [67] Second, the EPDS includes two anxiety symptoms (anxious/worried and scared/panicky) which, although common in perinatal depression, are not part of the diagnostic criteria for depressive disorders. The GAD-7 and the GAD-2 are more specific for anxiety and have been validated for perinatal anxiety screening [68, 69]. Third, the EPDS is a perinatal-specific depression measure whereas the PHQ-9 has the advantage of assessing depression across the lifespan of women [53, 70–72]. Premenstrual and menopausal mood disorders are common in women as well as other chronic depressive disorders; moreover, depression is often a recurrent condition. Thus, using a common measure throughout a woman’s life may be preferable to using one measure during the perinatal period and a different measure for assessing depression at other times. Translated into more than 100 languages, the PHQ-9 is public domain and now the most commonly used depression measure in both clinical practice and research [18, 73].

A two-step approach to screening is sometimes used with administration of the PHQ-2 as a first-step screener and, if positive (using a cutpoint of either 2 or 3 on this 6-point scale), completion of the full PHQ-9 [74, 75]. Notably, the three studies in our review that reported on the PHQ-2 showed operating characteristics as a screener that compared favorably with the longer PHQ-9.[24, 25, 30] Further, a yes-no version of the PHQ-2 (instead of the scored version) with or without a follow-up PHQ-9 was found have comparable operating characteristics to the EPDS and greater cost-effectiveness.[76, 77]

Screening for suicidality is an essential step in evaluating depression. Item 9 of the PHQ-9 assesses suicidal ideation over the last 14 days while item 10 of the EPDS assesses suicidal ideation in the past 7 days. A study of 1,517 pregnant women in Peru found that the two suicidal ideation items had a high concordance rate (84.2%) and moderate agreement beyond chance (kappa = 0.42) [78]. Based on the PHQ-9 and the EPDS, 15.8 and 8.8 % of participants screened positive for suicidal ideation, respectively. Responsiveness to treatment (also called sensitivity to change to an anchor or criterion, with or without treatment) is another valuable attribute of a depression measure in clinical settings. The PHQ-9 has proven sensitive to change in prior studies [72, 79–81] and its responsiveness was similar to the EPDS in our review [47, 49].

Anxiety is as prevalent as depression in the perinatal population and the two conditions co-occur 40% of the time. Thus, guidelines frequently recommend joint screening for anxiety and depression [2, 3, 82–85]. The EPDS has two anxiety items and has AUCs of 0.62 to 0.73 in screening for perinatal anxiety [85]. However, the GAD-7 is more commonly recommended if anxiety screening is desired and in one study proved superior to the EPDS [86]. The PHQ-4 combines the PHQ-2 depression scale and GAD-2 anxiety scale and has proven an effective screener in an international study of 1,148 pregnant women.[87]

Prior systematic reviews [4, 15, 88, 89] may have differed in their conclusions regarding depression screeners due to the particular studies included (or criteria for excluding studies), wide variation among studies/samples for findings regarding the same scale, and the smaller number of studies doing head-to-head comparisons of 2 or more scales against a structured psychiatric interview. Also, a recent meta-analysis found that depression prevalence varied widely among perinatal depression studies [90], a finding consistent with the studies included in our systematic review.

One limitation of our findings is that the risk of bias varied among the criterion validity studies, with only 5 of the 10 studies having a low risk of bias in 3 or all 4 of the four QUADAS domains. Also, only 7 of the 10 studies used the standard PHQ-9 ≥ 10 cutpoint. However, the remarkable similarity that we found between the operating characteristics of the PHQ-9 and EPDS is reassuring. Whereas the EPDS has been widely studied for perinatal depression screening, our review is the first to systematically compare its performance with the PHQ-9 in studies administering both scales.

Examples of the increased use of the PHQ-9 during pregnancy and the postpartum period include serving as the criterion standard or primary outcome in clinical trials [91–93] and large cohort studies [94, 95] as well as perinatal screening in large healthcare systems [96, 97]. Interestingly, one study found the PHQ-9 and EPDS comparable in a non-pregnant population which also included men [98]. Also, both scales have been used to screen for antenatal and postnatal paternal depression.[99–102]. Results from ongoing studies comparing the PHQ-9 and EPDS may further inform scale selection [103, 104]. Treatment of perinatal depression has proven effective in multiple trials [105], further highlighting the importance of screening with efficient and evidence-based measures.