Interobserver variability

Comparison between liquid-based and conventional preparations in gynecologic cytology

Authors


Abstract

BACKGROUND

Studies have shown that the ThinPrep Papanicolaou test (TP) increases the detection of epithelial cell abnormalities compared with the conventional preparation. Little is known about the interobserver variability of reporting gynecologic cytology results using the TP preparation and its comparison with results obtained using the conventional method.

METHODS

To compare the interobserver variability between the TP method and the conventional method for reporting the diagnoses of gynecologic cytology, 20 pairs of conventional and TP slides (total, 40 slides) that were prepared from split samples were evaluated blindly by 19 cytotechnologists from three different laboratories. Each reviewer was asked to categorize each slide into the following five categories: within normal limits, benign cellular changes, atypical squamous cells of undetermined significance, low-grade squamous intraepithelial lesion (LSIL), and high-grade squamous intraepithelial lesion (HSIL). For both conventional and TP preparations, interobserver variability was analyzed using Spearman rank correlation coefficients. The mean correlation coefficients (weak, 0.0–0.4; fair, 0.4–0.7; and strong, 0.7–1.0) between the TP method and the conventional method were then compared.

RESULTS

The overall interobserver agreement as well as interobserver agreement within each laboratory was good for both TP and conventional preparations. Based on the set of conventional cervical smears, only one slide that was diagnosed as HSIL had unanimous agreement; whereas, based on the set of TP slides, three slides, including two diagnosed as HSIL and one diagnosed as LSIL, had a unanimous diagnosis. The difference in the interobserver agreement between TP and conventional methods, based on comparing their mean ± standard deviation correlation coefficients (TP method, 0.84 ± 0.081; conventional method, 0.82 ± 0.105; P < 0.001), was statistically significant.

CONCLUSIONS

Interobserver agreement in reporting gynecologic cytology using the TP method is good, particularly for squamous intraepithelial lesions, and appears to be superior to the conventional method. Cancer (Cancer Cytopathol) 2002;96:000–000. © 2002 American Cancer Society.

One of the objectives of the Bethesda System was to provide a uniform terminology for reporting cytologic diagnoses and to develop specific criteria for each diagnostic category.1 However, many studies have demonstrated that interobserver agreement is poor, particularly in instances of equivocal results, such as the diagnostic category atypical squamous cells of undetermined significance (ASCUS).2–12 This is reflected by the fact that the rate of ASCUS and the likelihood of an underlying squamous intraepithelial lesion (SIL) after a diagnosis of ASCUS vary with different laboratories.

Several technologies have been introduced as alternative preparation methods in diagnostic cytology. One of these innovative developments is the ThinPrep Papanicolaou (Pap) test (TP; Cytyc Corporation, Boxborough, MA), a liquid-based preparation method. The TP was approved by the Food and Drug Administration in 1996 as an alternative for the conventional method of preparing cervical slides. The clinical trial using a split-sample/matched-pair protocol demonstrated that the TP resulted in a significant increase (up to 65%) in the detection of cervical carcinoma precursors compared with the conventional method.13 There also was an improvement in specimen adequacy and a decrease in unsatisfactory samples. Subsequent studies using a direct-to-vial protocol also demonstrated significant improvements in diagnostic yield and specimen adequacy as well as a decrease in the number of equivocal specimens, thus reinforcing earlier data.14, 15

Little is known about the interobserver reproducibility of reporting gynecologic cytologic diagnoses using TP. A recent study reported that interobserver agreement for diagnoses of within normal limits (WNL) or SIL was good but was poor for ASCUS using TP.16 To our best knowledge, there is no study comparing interobserver variability between TP and the conventional method of collecting and preparing cervical smears. Therefore, the purpose of this study was to assess interobserver variability in classifying squamous lesions between the TP method and conventional methods among individual observers and laboratories with different experiences in TP gynecologic cytology.

MATERIALS AND METHODS

Twenty TP slides and their matched conventional smears were retrieved from the files of the Department Pathology at the University of Alabama at Birmingham. These split sample/matched pairs were part of a pilot study when TP was first introduced into our laboratory. Specimens for both conventional and TP slides were obtained during the same gynecologic examination from 20 premenopausal women. The specimens were collected using an Ayre spatula and cytobrush. Conventional smears were made by smearing cellular material obtained with the collection device onto the slide, followed by immediate fixation with a spray fixative. The remaining cellular material still attached to the spatula and brush after preparing the conventional smears was rinsed into a vial of PreservCyt Solution (Cytyc Corporation). This material was processed with the ThinPrep 2000 automated slide processor (Cytyc Corporation). All slides were reviewed, and their original cytologic diagnoses were confirmed.

The slides were masked and were labeled randomly. Nineteen observers from the laboratories of three academic institutes then reviewed these 20 pairs of cervical smears blindly. TP was introduced to Laboratories A and B for 4 years. TP accounted for almost 100% and 20% of all gynecologic specimens in Laboratories A and B, respectively. TP was not implemented in Laboratory C during the study period. Reviewers were asked to categorize each slide into the following five categories: WNL, benign cellular changes (BCC), ASCUS, low-grade SIL (LSIL), and high-grade SIL (HSIL), based on the criteria set forth by the Bethesda System. Information regarding years of experience and any formal or informal TP training attended by the reviewers also was obtained. Formal training in the evaluation of TP cervical smears referred to the training program provided by Cytyc Corporation or by organizations or individuals designated by Cytyc Corporation. Informal training referred to the attendance at any TP symposium or workshop.

All necessary underlying statistical assumptions were checked prior to the application of a given statistical test. A one-way analysis of variance (ANOVA) was used to determine whether there was any difference in the number of years of experience for the observers among the three laboratories. Interobserver agreement was analyzed by comparing all pair-wise combinations of reviewers both within laboratories and between laboratories. Because of the ordinal nature of the diagnostic categories, each category was assigned a numeric score. For example, those with a diagnosis of WNL were assigned a score of 1, whereas those diagnosed with HSIL were assigned a score of 5. Spearman rank correlation coefficients were used to determine the direction and extent of interobserver and intraobserver agreement. Coefficients ranged from −1.0 to 1.0. A coefficient of −1.0 would indicate a strong inverse correlation. A mean correlation coefficient of 1.0 would indicate perfect positive agreement. Subsequently, a coefficient ≥ 0.75 would be interpreted as strong positive agreement, a coefficient of ≤ 0.4 would signify weak positive agreement, and a coefficient in the range of 0.4–0.75 would be interpreted as fair positive agreement. No association would be indicated by a value of 0. The paired t test was used to compare mean correlation coefficients. All significance tests were two-sided at the α = 0.05 level. Analyses were performed using SAS software (version 8.0; SAS Institute Inc. Cary, NC).

RESULTS

All 19 observers were cytotechnologists. Seven observers were from Laboratory A, five observers were from Laboratory B, and seven observers were from Laboratory C. Years of experience ranged from 2 years to 25 years (mean, 11 years; median, 9 years). The mean years of experience for observers from Laboratories A, B, and C were 12.9 years, 11.2 years, and 8.4 years, respectively. There was no statistically significant difference in the number of years of experience among observers from different laboratories (ANOVA; P = 0.545). Thirteen observers received formal training in the evaluation of TP, 2 observers received informal training, and 4 observers had no training at all.

Based on the original diagnoses, four pairs of slides were classified as WNL, one pair of slides was classified as BCC, five pairs of slides were classified as ASCUS, five pairs of slides were classified as LSIL, and five pairs of slides were classified as HSIL. Both TP slides and the corresponding conventional smears were diagnosed similarly. All slides with a cytologic diagnosis of either LSIL or HSIL were confirmed by subsequent cervical biopsy. One patient with a diagnosis of ASCUS had an LSIL, and two patients had benign processes on follow-up histology. Another patient with a diagnosis of ASCUS had persistent ASCUS diagnoses on repeat Pap smears. Three patients with a cytologic diagnosis of WNL or BCC had negative follow-up results. One patient with a diagnosis of ASCUS and two patients with a diagnosis of WNL were lost to follow-up.

Table 1 represents a symmetric agreement matrix based on the conventional method. Because there were 20 conventional slides and 19 observers, a total of 380 diagnoses (20 slides × 19 observers) were made. Each of these diagnoses was then compared with diagnoses that were made by the other 18 observers on the corresponding slide, resulting in a total of 6840 possible pairs (20 slides × 19 observers × 18 comparisons). For example, in Table 1, 129 diagnoses were categorized as HSIL based on the conventional smears. Each of these 129 diagnoses was compared with the diagnoses made by the other 18 observers on the corresponding slide, resulting in a total of 2322 comparisons (129 diagnoses × 18 observers). In 1766 instances, the other observers made a diagnosis of HSIL (i.e., concordant diagnoses); in 349 instances, the other observers made a diagnosis of LSIL; in 141 instances, the other observers made a diagnosis of ASCUS; and, in 66 instances, the other observers made a diagnosis of WNL/BCC. Table 2 represents a symmetric agreement table based on the TP method. Based on the set of conventional smears, interobserver agreement was very good for a diagnosis of HSIL, followed by WNL/BCC, LSIL, and finally ASCUS. This trend also was observed with TP slides (Table 2). In addition, the number of concordant pairs was higher with TP (4568 concordant pairs) compared with the conventional method (4160 concordant pairs). Table 3 summarizes the frequency distribution of the number of different diagnostic categories assigned to each smear by all observers. A unanimous agreement would be represented by the category one diagnostic category. A unanimous diagnosis of HSIL was reached in one conventional preparation, whereas a unanimous diagnosis was reached in three TP preparations, including two that were diagnosed as HSILs and one that was diagnosed as LSIL.

Table 1. Comparison of Variation in Diagnoses among 19 Reviewers Using the Conventional Method
 
DiagnosisNo. of diagnosesWNL/BCCASCUSLSILHSILTotal no. of paired comparisons
  1. WNL/BCC: within normal limits/benign cellular changes; ASCUS: atypical squamous cells of undetermined significance; LSIL; low-grade squamous intraepithelial lesions; HSIL; high-grade squamous intraepithelial lesions.

WNL/BCC112134252583662016
ASCUS735254721761411314
LSIL66831765803491188
HSIL1296614134917662322
Total38020161314118823226840
Table 2. Comparison of Variation in Diagnoses among 19 Reviewers Using the ThinPrep Method
 
DiagnosisNo. of diagnosesWNL/BCCASCUSLSILHSILTotal no. of paired comparisons
  1. WNL/BCC: within normal limits/benign cellular changes; ASCUS: atypical squamous cells of undetermined significance; LSIL; low-grade squamous intraepithelial lesions; HSIL; high-grade squamous intraepithelial lesions.

WNL/BCC120158054187322160
ASCUS68541446160771224
LSIL62871605902791116
HSIL130327727919522340
Total38021601224111623406840
Table 3. Frequency Distribution of the Number of Diagnostic Categories Assigned to Each Smear
Preparation methodNo. of diagnostic categories that were assigned to each slide by 19 observers
1 (Unanimous)2345
  • a

    One low-grade squamous intraepithelial lesion (SIL) and two high-grade SILs.

  • b

    One high-grade SIL.

ThinPrep (no. of slides)3a7361
Conventional (no. of slides)1b31132

Table 4 summarizes the interobserver reproducibility, which is expressed in terms of the mean correlation coefficient, among all observers and individual laboratories. The overall interobserver agreement as well as agreement within individual laboratories was good for both the conventional method and the TP method. The overall mean correlation coefficient with TP was greater than with conventional method; the difference was statistically significant (P < 0.001). This also was true for individual laboratories (P < 0.001).

Table 4. Interobserver Agreement among Different Laboratories
 
Observer characteristicNo. of observersConventional method: Mean correlation coefficient (standard deviation)ThinPrep method: Mean correlation coefficient (standard deviation)P value
Overall190.819 (0.105)0.842 (0.081)< 0.001
Laboratory
 A70.824 (0.100)0.866 (0.063)< 0.001
 B50.778 (0.106)0.866 (0.063)< 0.001
 C70.798 (0.104)0.852 (0.073)< 0.001
Training
 None40.832 (0.080)0.879 (0.062)< 0.001
 Informal20.712 (0.098)0.794 (0.077)< 0.001
 Formal130.796 (0.107)0.838 (0.082)< 0.001
Experience
 ≤9 yrs100.825 (0.098)0.859 (0.070)< 0.001
 >9 yrs90.768 (0.105)0.827 (0.087)< 0.001

The mean correlation coefficient among observers who had no training in the evaluation of TP, among observers who had informal training, and among observers who had formal training was 0.88, 0.79, and 0.84, respectively. Thus, observers who had undergone any formal or informal training evaluating TP slides did not perform better than observers who had received no training at all. The mean correlation coefficient among observers who had ≤ 9 years of experience was 0.825 and 0.859 for conventional and TP methods, respectively. The mean correlation coefficient among observers who had > 9 years of experience was 0.768 and 0.827 for conventional and TP methods, respectively. It appeared that the less experienced observers agreed more among themselves than the more experienced observers.

Intraobserver agreement between same observer was found in 30–75% of smears, with a mean of 58% and a median of 60%. The mean correlation coefficient was 0.81 (P < 0.001). Table 5 lists the frequency with which the observers agreed among themselves for different diagnostic categories. Intraobserver agreement was highest with smears that were classified as HSIL followed by LSIL, WNL, and ASCUS.

Table 5. Frequency Distribution of Intraobserver Agreement According to Diagnostic Categories
Diagnostic category (%)
WNLaASCUSLSILHSIL
  • WNL: within normal limits; ASCUS: atypical squamous cells of unknown significance; LSIL: low-grade squamous intraepithelial lesion: HSIL: high-grade squamous intraepithelial lesion.

  • a

    Included one patient with benign cellular changes.

44375987

DISCUSSION

Strong interobserver agreement in gynecologic cytology contributes to the accuracy and consistency of the detection of epithelial cell abnormalities. It is influenced by either biologic variability and/or observer variability. Biologic variability refers to the variation of the cellular material obtained and present in the cytologic specimens. This can be attributed to the techniques used to obtain the cellular materials, the quality of the preparations, and the degree of epithelial abnormality between specimens taken from the same patient at different time intervals. Observer variability is the variation in the interpretation by the observers, including intraobserver variability and interobserver variability.

The lack of interobserver reproducibility in gynecologic cytology has always been an issue. In a study by Klinkhamer et al., 100 cervical smears ranging from WNL to invasive carcinoma were screened twice by 19 observers during a 6-month interval. Those authors reported considerable interobserver variability.17 In another study by Maguire, photomicrographs from 20 cervical smears were sent to 20 reviewers.18 In only two smears was there agreement among all reviewers. The greatest agreement was for benign/reactive processes, whereas there was little uniformity in the recognition of intraepithelial lesions and carcinoma.

In 1988, The Bethesda System was developed. One of the objectives was to provide a more simplified and uniform classification scheme for reporting gynecologic cytology, thereby promoting interobserver agreement. However, interobserver variability remained problematic. In a study by Sherman et al., two observers rendered the same cytologic diagnoses in only 43% of 257 cervical smears selected.19 In about 50% of the discordant smears, the differences were greater than one diagnostic category. In another study, Young et al. distributed 20 cervical smears to 5 experts in cytopathology: There was unanimous agreement in seven smears (35%),11 and, in six smears (30%), the range of disagreement was more than one diagnostic category. Such interobserver variation was more pronounced in the equivocal smears, namely, ASCUS cervical smears. For example, in one study, 3 observers classified 80 smears, including 74 smears originally diagnosed as ASCUS, into 5 categories—WNL, BCC, ASCUS, LSIL, or HSIL20—and complete agreement was seen in only 11% of smears. Interobserver agreement was still poor when the number of diagnostic categories was reduced to three: WNL, ASCUS, and SIL. In another study, a panel of 5 expert cytopathologists reviewed 100 atypical cervical smears.21 Complete consensus was achieved in < 30% of the smears, and none of the smears was classified unanimously as ASCUS. In a more recent study using the same set of smears from the previous study, the authors reported that a review of the Bethesda System atlas did not improve interobserver reproducibility.12

In the current study, the overall interobserver agreement in diagnosing gynecologic cytology was good for conventional cervical smears. Our results were better than those reported previously in the literature. There are two possible explanations. First, in the current study, all reviewers were cytotechnologists, whereas the majority of the reviewers in other studies were pathologists. In a previous study, we reported that cytotechnologists as a group had a better interobserver agreement compared with pathologists as a group.22 Second, among the set of smears that was distributed to our reviewers, 5 smears (25%) were WNL or BCC, and 10 smears (50%) were SIL (5 LSILs and 5 HSILs). Other investigators have shown that interobserver agreement for the diagnosis of WNL/BCC and SIL was better than interobserver agreement for the diagnosis of ASCUS.16 Our findings also support the finding that smears with diagnoses of HSIL and WNL/BCC demonstrated excellent interobserver reproducibility. Conversely, smears with a diagnosis of ASCUS had poor interobserver agreement.

ThinPrep, a liquid-based cytology preparation, has recently emerged as an alternative preparation method for gynecologic cytology.23 Early studies using split samples showed overall increased detection of epithelial abnormalities.13, 24, 25 This was supported further by subsequent studies using direct-to-vial samples.14, 15 In a study of more than 8500 TP cervical slides, the authors reported that the percentage of patients who had a cytologic diagnosis of SIL and subsequent benign biopsies was reduced by 32%,14 and the percentage of patients with histologically confirmed LSIL and HSIL increased by 16% and 9%, respectively.14 In addition, the percentage of borderline smears that were diagnosed as ASCUS/atypical glandular cells of undetermined significance was lowered by 27%. Similarly, Carpenter and Davey also reported an increase in the detection rate of SIL from 8% to 11% and a decrease in the detection rate of ASCUS from 13% to 7%.15 The enhanced detection of epithelial cell abnormalities by TP is due in part to improved sampling and a lack of obscuring blood, inflammation, and mucus.26

There are only a few studies addressing interobserver variability in diagnosing gynecologic cytology using TP. In a recent study, 3 pathologists and 1 senior cytotechnologist reviewed 144 TP slides previously diagnosed as WNL, ASCUS, and SIL.16 Those authors reported that the interobserver reproducibility for slides with a diagnosis of WNL or SIL was good; however, the reproducibility for a diagnosis of ASCUS was poor. In the current study, the interobserver agreement among 19 cytotechnologists from 3 different laboratories was good, with a mean correlation coefficient of 0.842. Our result appeared to be better than previous studies, most likely for the same reasons mentioned above for conventional methods. We also demonstrated that the overall interobserver reproducibility improved with the TP method compared with the conventional method. Based on TP, three slides (two HSILs and one LSIL) had a unanimous diagnosis, whereas only one slide (HSIL) had a unanimous diagnosis based on conventional preparation (Table 3). From a different perspective, the interobserver agreement expressed in terms of the mean correlation coefficient was better with the TP method than with the conventional method (Table 4). The difference in the mean correlation coefficient between the two methods was statistically significant. To the best of our knowledge, this is the first study to compare the interobserver variability between TP and conventional methods. There were 20 pairs of matched TP and conventional cervical smears prepared from split samples that were used for evaluation, and there were 19 observers with different level of experience in evaluating TP slides. These resulted in 6840 pair-wise comparisons for each method. Therefore, the results are unlikely to be spurious with such a large sample size.

The improved cytologic preservation and presentation of TP over the conventional method are likely explanations for the observed differences in interobserver agreement. For example, the epithelial cells in the conventional cervical smears often were distributed unevenly over the slide and aggregated in overlapping groups, whereas the TP cervical slides displayed cells that were distributed more evenly within a 22-mm circle and with minimal overlap.26, 27 Better cell preservation and visualization due to lack of air-drying artifacts and obscuring inflammation and blood also contributed to the improved quality of TP.14, 15, 27 A recent study reported an increase in the rate of completely satisfactory specimens from 80% to 90% and a decrease in the rate of satisfactory but limited by… findings by 46% with the use of TP.15

Improved interobserver agreement with TP was noted in all three laboratories. It was not surprising in Laboratories A and B because of their experience with TP. Conversely, the improved interobserver agreement with TP was unexpected for Laboratory C, because its cytotechnologists did not evaluate TP or other liquid-based preparations routinely. This may be attributed to the improved quality of TP. Reviewer attendance at any formal or informal TP training had no significant influence on interobserver variability.

Some authors have shown that interobserver reproducibility correlates with the experience of the observers.17 In the current study, experience did not appear to be an important factor. The number of years of experience of our reviewers ranged from 2 years to 25 years. In fact, reviewers who had less experience (≤ 9 years) appeared to agree better among themselves than reviewers who had more experience (> 9 years). One possible explanation is that those cytotechnologists with < 9 years of experience had been exposed only to the Bethesda System throughout their entire cytology career, whereas the more experienced cytotechnologists received training before the era of the Bethesda System and were exposed to other classification systems, such as the Papanicolaou system. Joste et al. showed that relatively inexperienced cytopathologists (fellows in training) were able to achieve levels of consistency similar to those achieved by more experienced practicing pathologists.4

According to the manufacturer's recommendation, the evaluation of cytologic materials processed by TP should be performed by personnel who have undergone TP training by the company or by organizations or individuals designated by the company.28 In the current study, reviewers who had undergone the formal TP training did not agree among themselves more than reviewers who had received no training. It would be premature to draw any definite conclusion, because there were only 4 reviewers who had received no TP training, and all were from the same laboratory, whereas the 13 reviewers who had formal training were from 3 different laboratories. Future studies involving larger numbers of reviewers without formal TP training in the evaluating of TP slides may be warranted to address the influence of TP training on interobserver variability. There also are other reasons that TP training is necessary, but they are not addressed in our study. These reasons include requirement by the Food Drug Administration and laboratory personnel's acquaintance with the adequacy requirement and diagnostic criteria for liquid-based preparations.

In summary, interobserver agreement in diagnosing gynecologic cytology using either the TP method or the conventional method was good, particularly with slides that were within normal limits or that demonstrated HSIL. We demonstrated that the TP method was superior to the conventional method in terms of interobserver agreement, regardless of the experience of the observers in evaluating TP gynecologic cytology.

Acknowledgements

The authors thank all of the cytotechnologists that participated in this study: Maria A. Angeles, Silvia Babore, Edward Bartosiewicz, Helena Brown, Jun Chen, Jon Gidley, Kathy Connolly, Sandra Gallaspy, Rimma Gorokhovsky, Hans Graef, Anne Marie Gulka, Sharon E. McMahon, Mara Jo Miller, Alla Morim, Dorota Rudomina, and Guadalupe Warren. They also thank Ms. Shirley Lobenthal for her comments and suggestions.

Ancillary