Reproducibility of assessment of severity of pelvic endometriosis using transvaginal ultrasound

Authors


Correspondence to: Dr D. Jurkovic, Department of Obstetrics and Gynaecology, University College Hospital, 235 Euston Road, London, NW1 2BU, UK (e-mail: davor.jurkovic@uclh.nhs.uk)

ABSTRACT

Objective

To examine the reproducibility of assessment of severity of pelvic endometriosis by transvaginal sonography (TVS).

Methods

This was a prospective observational study conducted from August 2006 to July 2009 in two academic departments of obstetrics and gynecology. Women with clinically suspected or proven pelvic endometriosis were invited to join the study. All patients included underwent TVS performed by two observers and a laparoscopic assessment of pelvic endometriosis. The ultrasound observers were blinded to each other's results. The reproducibility of TVS was examined by evaluation of interobserver agreement for the American Society of Reproductive Medicine (ASRM) score by Bland–Altman analysis and the stage and the diagnosis of deeply infiltrating endometriosis (DIE) by calculation of kappa coefficients. Agreement between the findings on TVS for each observer and those on laparoscopy was also evaluated.

Results

Thirty-four patients were recruited to the study, and TVS was performed by two ultrasound observers. Of these patients, one did not undergo laparoscopy and was therefore excluded from the final analysis. No endometriosis was found in 12 (36.4%) patients. One patient (3%) had minimal disease, one (3%) had mild disease, five (15.2%) had moderate disease and 14 (42.4%) had severe disease. Interobserver agreement was very good for disease classification on TVS (Cohen's kappa, 0.931). Agreement between TVS and laparoscopy findings was also very good (Cohen's kappa, 0.955 and 0.966 for the two examiners). For ASRM score on TVS, the interobserver 95% limits of agreement were –16.6 to 12.7, with a mean difference of –1.9 (95% CI, –4.35 to 0.71).

Conclusion

TVS is a reproducible method for assessment of the severity of pelvic endometriosis and shows good agreement with findings on laparoscopy.

INTRODUCTION

The development of non-invasive techniques to establish the presence and severity of pelvic endometriosis is valuable in a number of ways: to guide patient choice regarding medical or surgical treatment; to plan fertility or medical treatment if surgery is not chosen; to enable referral to the most appropriate center and surgeon if surgery is chosen; to enable the surgeon to counsel the patient about the likely extent of surgery and potential risks; and to allow the surgeon to prepare sufficiently for surgery, including the involvement of other specialties as indicated.

There is no clear consensus regarding definition of the severity of endometriosis[1, 2] and the most commonly used classification, the American Society of Reproductive Medicine (ASRM) classification[3], has both advantages and disadvantages. The advantages of this classification are that it is widely used in clinical practice and it provides a formalized systematic approach to documenting the impact of the disease on the patient's fertility. However, many authors recognize that the features of deeply infiltrating endometriosis (DIE) are often the most symptomatic[4, 5] and difficult to treat[6]. These features are poorly represented in the ASRM classification[1] and therefore need to be documented separately.

It has recently been shown that targeted transvaginal sonography (TVS) is an accurate test to establish the severity of pelvic endometriosis by assessing both the ASRM stage and the extent of DIE[2]. However, the interobserver reproducibility of these results has not been formally tested. In this study we examined the interobserver reproducibility of preoperative TVS to assess the severity of pelvic endometriosis and we also compared the findings on TVS with those on laparoscopy. The aim of the study was to establish whether TVS is a reproducible technique for preoperative assessment of the severity of pelvic endometriosis.

METHODS

This was a prospective, observational, multicenter study, which was conducted at King's College Hospital and at University College Hospital in London. These are major teaching hospitals and the latter included a specialist tertiary endometriosis center. Women with clinically suspected or proven pelvic endometriosis were invited to join the study. The inclusion criteria were: premenopausal status with a clinical suspicion of endometriosis awaiting diagnostic laparoscopy or diagnosis of pelvic endometriosis on diagnostic laparoscopy in a woman awaiting operative treatment; age of 16 years or over; ability to provide informed consent. Women who could not undergo a transvaginal ultrasound scan and those who became pregnant whilst awaiting surgery were excluded from the study.

The study was ethically approved and an information leaflet was given to all eligible women before assessment. Informed consent was obtained from all patients who agreed to take part in the study.

All women were assessed separately by both attending clinicians who obtained a detailed history that was recorded in a dedicated clinical database (ViewPoint; GE Healthcare, Fairfield, CT, USA). Women were asked specifically about symptoms associated with endometriosis such as dysmenorrhea, chronic pelvic pain, dyspareunia, subfertility, dyschezia and cyclic rectal bleeding.

Transvaginal ultrasound examination was performed by two ultrasound observers (A and B) who were both gynecologists with a high level of expertise in gynecological ultrasonography. All patients were examined by both ultrasound observers and the observers were blinded to each other's findings. The ultrasound observers were also blinded to clinical and previous surgical findings. All patients were operated on by one of two different laparoscopic surgeons with a high level of expertise in laparoscopic surgery. When moderate or severe disease or DIE was present a complete surgical exploration of the pelvis was performed, involving dissection of the pouch of Douglas (when obliterated) and resection of any DIE, especially rectovaginal or bowel. The operating surgeons were blinded to the detailed transvaginal ultrasound findings.

The process of TVS assessment of pelvic endometriosis severity has been described previously[2]. The only refinement in the present study is that rectovaginal endometriosis was defined as disease affecting the posterior pelvic compartment with evidence of endometriotic nodules which are located between the rectum and posterior fornix of the vagina and/or posterior aspect of the cervix. All of the features of pelvic endometriosis on TVS were documented and scored using the ASRM classification[3]. The score was used to grade the disease as absent (0), minimal (1–5), mild (6–15), moderate (16–40) or severe (> 40). DIE is given a maximum score of 6 using the ASRM classification and we therefore also recorded the presence of these lesions separately. All findings were recorded in a database file using a Microsoft Excel for Windows spreadsheet to facilitate data entry and retrieval. The severity of endometriosis as assessed by TVS was compared with laparoscopic findings using the same ASRM classification.

Statistical analysis

All statistical analyses were carried out using Medcalc version 9.2.0.2 (Medcalc Software, Mariakerke, Belgium). Interobserver agreement for the ASRM score on TVS and agreement between the findings of each observer using TVS and those on laparoscopy were evaluated using Bland–Altman plots, in which differences in score were plotted against the mean of the two scores, allowing assessment of the relationship between any differences and the magnitude of the scores. Systematic bias between the two observers was evaluated by calculating the 95% CI of the mean (mean ± 2 standard errors (SE)); if 0 lay within this interval no bias was assumed to exist[7, 8]. Correlation coefficients for ASRM scores between the observers on TVS and between TVS and laparoscopy were also calculated.

Cohen's quadratic weighted kappa coefficient was calculated to determine agreement between the two observers on TVS and the findings on laparoscopy in classifying both the stage of the disease and the presence of individual features of the disease. Kappa values of 0.81–1.0 were taken to indicate very good agreement, 0.61–0.80 good agreement, 0.41–0.60 moderate agreement, 0.21–0.40 fair agreement and < 0.20 poor agreement[9]. Two SE estimates are given for each quadratic weighted kappa coefficient: SE (Kw′ = 0) is appropriate for testing the hypothesis that the underlying value of weighted kappa is zero and SE (Kw′ ≠ 0) is appropriate for testing the hypothesis that the underlying value of weighted kappa is equal to a prespecified value other than zero. Measures of diagnostic accuracy for individual features of endometriosis were calculated for each of the observers on TVS with respect to the findings on laparoscopy.

RESULTS

In the 3-year period from August 2006 to July 2009, 34 patients were recruited to the study and all received a TVS examination performed by two ultrasound observers. Of these patients, one cancelled laparoscopy and was therefore excluded from the final analysis. The mean age of the patients at recruitment was 33.8 (range, 18–51) years. The principal presenting symptoms were: dysmenorrhea (29/33; 87.9%), dyspareunia (20/33; 60.6%), chronic pelvic pain (19/33; 57.5%), infertility (11/33; 33.3%), dyschezia (11/33; 33.3%) and cyclic rectal bleeding (1/33; 3%). Six (18.2%) patients had a single symptom, four (12.1%) had two symptoms, 11 (33.3%) had three symptoms and 12 (36.4%) had four or more symptoms. At laparoscopy, no endometriosis was found in 12 (36.4%) of these patients, one (3%) had minimal disease, one (3%) had mild disease, five (15.2%) had moderate disease and 14 (42.4%) had severe disease. The mean interval between scan and laparoscopy was 35.2 (95% CI, 32.0–38.6; SD, 21.1) days.

There was a good overall level of agreement between the ultrasound observers in identifying absent, minimal, mild, moderate and severe disease (quadratic weighted kappa, 0.931; SE (Kw′ = 0), 0.172; SE (Kw′ ≠ 0), 0.034). Agreement between the TVS and laparoscopy findings for stage of the disease was also very good (kappa, 0.955; SE (Kw′ = 0), 0.174; SE (Kw′ ≠ 0), 0.021 for Observer A; and kappa, 0.966; SE (Kw′ = 0), 0.174; SE (Kw′ ≠ 0), 0.019 for Observer B). The correlation coefficient for the ASRM scores was 0.987 (95 % CI, 0.973–0.993) for Observers A and B on TVS, 0.963 (95% CI, 0.926–0.982) for Observer A and laparoscopy and 0.966 (95% CI, 0.932–0.983) for Observer B and laparoscopy. On Bland–Altman analysis, the interobserver 95% limits of agreement for the ASRM score on TVS were –16.6 to 12.7 with a mean difference of –1.9 (95% CI, –4.35 to 0.71) (Figure 1a). As the confidence interval for the mean difference contains zero, no bias was assumed to exist. The 95% limits of agreement for Observers A and B in comparison to the ASRM scores on laparoscopy were –26.1 to 20.9 and –23.0 to 21.8, respectively, with non-significant mean bias in each case (Figures 1b and c). Visual inspection of the Bland–Altman plots also revealed that the magnitude of differences for each comparison did not change with increasing score magnitude.

Figure 1.

Bland–Altman plots for agreement in American Society of Reproductive Medicine (ASRM) scores between Observers A and B on transvaginal sonography (TVS) (a), between Observer A on TVS and score at laparoscopy (b) and between Observer B on TVS and score at laparoscopy (c).

Table 1 shows the interobserver agreement between Observers A and B on TVS, and between each of the observers and laparoscopy, for assessing the individual features of severe endometriosis. Table 2 shows the sensitivity, specificity, positive and negative predictive values, likelihood ratios and area under the receiver–operating characteristics curve for Observers A and B for assessing the individual features of severe endometriosis with respect to the findings on laparoscopy. There was one case of bladder DIE, which was correctly identified by both observers with no false positives or negatives (100% accuracy). There was also one case of pelvic side wall DIE, which was correctly identified by Observer B but not by Observer A. There was only one case of bladder and pelvic side wall DIE which was not included in Table 2 as statistical analysis was not possible.

Table 1. Interobserver agreement between Observers A and B on transvaginal sonography (TVS), and between each of the observers and laparoscopy, in assessing the individual features of severe endometriosis
Feature of DIEPrevalence (n (%))Cohen's kappa (SE (Kw′ = 0), SE (Kw′ ≠ 0))
Observers A and B on TVSObserver A on TVS and laparoscopyObserver B on TVS and laparoscopy
  1. DIE, deeply infiltrating endometriosis; POD, pouch of Douglas; PSW, pelvic side wall; RV, rectovaginal; SE, standard error; USL, uterosacral ligament.

Bladder DIE1/33 (3.0)1 (0.171, 0.000)1 (0.171, 0.000)1 (0.171, 0.000)
POD obliteration (partial or complete)19/33 (57.6)0.947 (0.171, 0.031)0.963 (0.174, 0.026)0.982 (0.174, 0.018)
Ovarian adhesions overall34/66 (51.5)0.927 (0.123, 0.039)0.751 (0.123, 0.073)0.837 (0.122, 0.056)
Bowel DIE12/33 (36.4)0.555 (0.164, 0.155)0.644 (0.160, 0.137)0.934 (0.171, 0.654)
RV DIE11/33 (33.3)0.530 (0.165, 0.180)0.530 (0.151, 0.154)0.783 (0.167, 0.177)
USL DIE9/66 (13.6)0.645 (0.115, 0.187)0.463 (0.104, 0.176)0.327 (0.120, 0.171)
PSW DIE1/66 (1.5)0.548 (0.122, 0.232)−0.023 (0.106, 0.018)0.385 (0.097, 0.274)
Table 2. Sensitivity, specificity, positive and negative predictive values (PPV and NPV), positive and negative likelihood ratios (LR+ and LR–) and area under the receiver–operating characteristics curve (AUC) for Observers A and B in assessing specific features of severe endometriosis on transvaginal sonography, with respect to findings on laparoscopy
Feature of DIEObserverSensitivity (% (95% CI))Specificity (% (95% CI))PPV (% (95% CI))NPV (% (95% CI))LR + (95% CI)LR– (95% CI)AUC (95% CI)
  1. DIE, deeply infiltrating endometriosis; POD, pouch of Douglas; RV, rectovaginal; USL, uterosacral ligament.

POD adhesions (partial or complete obliteration)A94.7 (73.9–99.1)92.9 (66.1–98.8)94.7 (73.9–99.1)92.9 (66.1–98.8)13.3 (2.0–87.9)0.057 (0.008–0.384)0.938 (0.795–0.990)
 B94.7 (73.9–99.1)100 (76.7–100)100 (82.4–100)93.3 (70.2–98.8)0.053 (0.008–0.355)0.974 (0.848–0.994)
Ovarian adhesionsA73.5 (55.6–87.1)75 (56.6–88.5)75.8 (59.0–87.2)72.7 (55.8–84.9)2.94 (1.56–5.54)0.35 (0.195–0.64)0.743 (0.620–0.842)
 B82.3 (65.5–93.2)100 (89.0–100)100 (87.9–100)84.2 (69.6–92.5)0.18 (0.085–0.365)0.912 (0.816–0.967)
Bowel DIEA58.3 (27.8–84.7)100 (83.7–100)100 (64.6–100)80.0 (62.1–91.5)0.42 (0.213–0.814)0.792 (0.616–0.912)
 B91.7 (61.5–98.6)100 (83.7–100)100 (74.1–100)95.5 (78.2–99.2)0.083 (0.013–0.544)0.958 (0.825–0.995)
RV DIEA45.5 (16.9–76.5)100 (84.4–100)100 (56.5–100)78.6 (60.5–89.8)0.55 (0.318–0.935)0.727 (0.545–0.867)
 B72.7 (39.1–93.7)100 (84.4–100)100 (67.6–100)88.0 (70.0–95.8)0.27 (0.104–0.716)0.864 (0.699–0.957)
USL DIEA33.3 (7.9–69.9)100 (93.7–100)100 (43.8–100)90.5 (80.7–95.6)0.67 (0.42–1.058)0.667 (0.540–0.778)
 B33.3 (7.9–69.9)94.7 (85.4–98.8)50.0 (18.8–81.2)90.0 (80.0–95.3)6.33 (1.50–26.7)0.70 (0.442–1.121)0.640 (0.513–0.755)

DISCUSSION

Assessment of the reproducibility of a diagnostic test is an essential part of evaluating its accuracy. This is the first study to prospectively examine interobserver variability of TVS in the assessment of pelvic endometriosis. Overall, TVS performed well in assessing the stage of disease, with very good levels of agreement between two observers, and between each of the observers and the stage found at laparoscopy. There were also very high correlation coefficients for the ASRM score between the two observers, and between these observers' scores and the scores given to the findings at laparoscopy.

When detection rates for individual features of endometriosis were compared between the observers, there were very good levels of agreement regarding diagnosis of endometriosis of the bladder, ovarian adhesions and pouch of Douglas obliteration. The high level of agreement for the diagnosis of bladder endometriosis is concordant with previous studies, which showed a high level of accuracy in the TVS diagnosis of bladder endometriosis[6, 10]. There was also a good level of agreement for the diagnosis of ovarian adhesions. The preoperative diagnosis of ovarian adhesions has not been extensively investigated; however, Okaro et al.[11] found a high level of accuracy with a kappa of 0.80. Prior to this, in a study by Guerriero et al.[12] in 1997, a moderate level of accuracy was found (kappa, 0.50). The preoperative diagnosis of partial or complete obliteration of the pouch of Douglas has not been reported on directly before. Hudelist et al.[13] reported a high degree of accuracy for the diagnosis of pouch of Douglas endometriosis but did not report obliteration separately.

In the present study, there were moderate levels of agreement between Observers A and B regarding rectovaginal and bowel endometriosis. There were high levels of agreement between Observer B and laparoscopy for both these features. However, Observer A did not perform as well for these features and this is likely to be the cause of a reduced level of agreement between Observers A and B. The accuracy reported by Observer B for bowel lesions is in keeping with the high level of accuracy reported in previous studies for the diagnosis of endometriosis in these areas[10, 14-17].

There were fair to poor levels of agreement for endometriosis affecting the uterosacral ligaments and pelvic side walls. This could be explained by the low prevalence of endometriosis in these areas in the study population. In our study the accuracy of the diagnosis of uterosacral ligament DIE was lower than previously reported[6, 10]. The diagnosis of endometriosis in these locations is not critical for the management as surgical excision can usually be achieved without involvement of other surgical specialists. However, all operators should strive to achieve maximum accuracy in detection of these lesions as their presence in symptomatic patients is an indication for surgery.

In our study, we were not able to assess the intraobserver reproducibility of TVS as all patients were examined in real time and it was not felt to be appropriate to subject patients to additional repeated examinations solely for the purpose of this study. Further research into the intraobserver variability of TVS in the diagnosis of endometriosis would be valuable.

Although there are no studies assessing the reproducibility of TVS in the diagnosis of endometriosis, previous studies have examined the interobserver variability of laparoscopy for the diagnosis of endometriosis. Interestingly, although laparoscopy is currently the gold standard for diagnosis, previous studies have shown significant intra- and interobserver variability. Hornstein et al.[18] asked five specialists to view video recordings of laparoscopies in patients with endometriosis and to score them according to the ASRM classification. Each video was viewed twice in order to assess intraobserver variability. The greatest variability occurred in endometriosis of the ovary and cul-de-sac obliteration, with less variability observed for peritoneal endometriosis and for ovarian and tubal adhesions. In 2005 Buchweitz et al.[19] asked 108 gynecological surgeons to view three videos of laparoscopies. The patients had Stage I disease, Stage II disease or no disease. The surgeons were asked to indicate the endometriotic lesions on a prepared surgical sketch and to classify the site according to the ASRM classification. They found a correct classification of endometriotic disease Stages I and II in only 22% and 13% of the cases, respectively, and concluded that the visual assessment of a video of laparoscopy in cases of minimal and mild endometriosis is subject to considerable interobserver variability. In 2007 Weijenborg et al.[20] asked two observers to review 90 videos of laparoscopies and found a high level of agreement in the stage of the disease but fair to moderate levels of agreement regarding the presence or absence of adhesions in the intra- and interobserver setting, respectively.

We have presented data regarding agreement between observers A and B and laparoscopy findings for stage of disease, ARSM score and individual features of endometriosis in order to explain some of the variation in interobserver agreement. The data show high levels of agreement for Observers A and B with respect to laparoscopy findings for the stage of disease and the ASRM score. However, when there was poorer diagnostic accuracy for a specific feature of disease, the interobserver agreement worsened.

A weakness of this study is the small number of cases, especially in the early stages of disease. Further larger studies would be helpful to validate our findings.

In conclusion, our results show that TVS is a suitable and reproducible method for the initial assessment of patients with suspected endometriosis. Although laparoscopy remains the gold standard for diagnosis of endometriosis, our results confirm that TVS may be used to appropriately triage patients with severe disease for referral to local specialists or tertiary endometriosis centers.

Ancillary