Glandular cell atypia on Papanicolaou smears
Interobserver variability in the diagnosis and prediction of the cell of origin
The 2001 Bethesda System recommended qualification of atypical glandular cells (AGC) to indicate the site of origin and separated endocervical adenocarcinoma in situ (AIS) from “AGC favor neoplastic” as a specific diagnostic category. To the authors' knowledge, the literature evaluating the reproducibility of Papanicolaou (Pap) smear diagnosis of glandular cell abnormalities with emphasis on the cell of origin is limited. The aim of the current study was to investigate whether a variety of benign to neoplastic glandular lesions can be reliably classified on Pap smear with regard to diagnosis and cell of origin.
Twenty-three conventional Pap smears (CPS) with glandular cellular changes varying from benign to adenocarcinoma (ACA) were reviewed by six observers. They were asked to categorize each smear according to cell of origin (endocervical vs. endometrial) and diagnosis (benign, AGC, or ACA). Kappa statistics were used to evaluate interobserver agreement and correlation of interobserver agreement with experience.
There was no consensus among observers for both the origin of the cells and the diagnosis. Interobserver agreement for site was poor (kappa < 0.4) especially in the AGC category. Unanimous agreement for site was reached for 7 of 23 smears (30%). Two of five endocervical AIS were classified as endometrial and another two were classified as benign by four observers. Interobserver agreement was poor in all diagnostic categories (kappa < 0.4) and showed slight correlation with level of experience. Unanimous agreement for diagnosis was reached for only 2 smears (9%). Three of 11 (27%) smears demonstrating preneoplastic/neoplastic processes were diagnosed as benign by 3 observers. Three (25%) benign CPS were diagnosed as ACA by 2 observers. Accurate prediction of the final histologic diagnosis by observers varied from 30% to 87% and did not correlate closely with experience.
Cytologic diagnosis of glandular lesions by CPS was problematic and suffered from significant interobserver subjectivity. Cancer (Cancer Cytopathol) 2003;99;323–30. © 2003 American Cancer Society.
Diagnosis of glandular cell abnormalities using the conventional Papanicolaou (Pap) smear (CPS) is challenging for most practicing pathologists, including those specializing in cytopathology.1–11 The difficulty in identifying correctly glandular lesions using CPS lies not only in differentiating clinically significant lesions from benign reactive conditions but also in classifying accurately the cell of origin as being squamous, endocervical, or endometrial. Although several studies published to date have addressed interobserver variability in the cytologic diagnosis of glandular lesions, to our knowledge the reproducibility of the cytologic prediction of the cell of origin has not been adequately studied.3, 4, 7 Both issues are equally important because the new consensus management guidelines that have been established and published recently by the American Society for Colposcopy and Cervical Pathology (ASCCP) are based on the 2001 Bethesda system.12
The 2001 Bethesda system revised the classification of atypical glandular cells of undetermined significance (renamed as atypical glandular cells [AGC])significantly.13 It is now recommended that AGC should be specified as to the site of origin (endocervical vs. endometrial) with the assumption that this is achieved without difficulty in the majority of cases.13 In addition, endocervical adenocarcinoma in situ (AIS) now stands alone as a specific and separate diagnostic category with a similar premise that it can be diagnosed reliably by CPS.
Numerous studies have shown that cytomorphologic alterations involving the glandular cells using CPS represent a diagnostic challenge, even with the knowledge of a biopsy follow-up.2–11, 14, 15 The AGC diagnostic category in particular remains a poorly understood concept within the Bethesda system. The reproducibility of an AGC diagnosis is poor and to our knowledge there is no consensus among experts regarding specific diagnostic cytologic criteria.3, 4, 7 Moreover, the percentage of women reported to have a significant lesion on follow-up has varied widely, from 9% to 54%.1, 2, 14–17 For women who do have a biopsy-confirmed lesion, high-grade squamous lesions comprise the majority rather than true glandular abnormalities, emphasizing the difficulty in establishing a correct cytologic classification of the cell of origin.
Several investigators addressed interobserver agreement in the diagnosis of squamous18, 19 and glandular cell abnormalities.3, 4, 7 Two previous studies have shown that AGC as a diagnostic category has poor reproducibility.3, 4 In both of these studies, reviewers rendered their diagnoses on smears originally diagnosed as AGC and correlated the reclassification results with a follow-up biopsy to determine the level of interobserver agreement and diagnostic accuracy in predicting the correct histologic diagnosis (final outcome). In the current study, we sought to evaluate a variety of glandular cell changes with initial CPS diagnoses of mostly AGC, but also encompassing a spectrum from benign reactive to adenocarcinoma. The aim was to determine whether the site of origin (endocervical vs. endometrial) and the diagnosis could be predicted reliably by CPS as recommended by the 2001 Bethesda system. We specifically selected a group of observers with varying degrees of expertise in the interpretation of cervicovaginal cytology rather than experts in the field to more accurately represent a heterogeneous population of practicing pathologists and cytotechnologists.
MATERIALS AND METHODS
Twenty-three CPS from 23 women were selected by 2 referee cytopathologists to be reviewed by 6 independent observers from 2 institutions (New York University [New York, NY] and University of Alabama [Birmingham, AL]). Twenty-two CPS demonstrated glandular cell changes varying from benign to adenocarcinoma. The selection of the smears was based on the review of the corresponding surgical biopsy and/or excision slides to ensure the accurate representation of the underlying pathology on the CPS. All endocervical in situ and invasive adenocarcinomas were mucinous (endocervical and intestinal) in type. All endometrial adenocarcinomas were endometrioid in type. One smear from Patient 2 was selected as a classic example of a menstrual smear pattern. This patient only had a CPS follow-up. Table 1 shows the original CPS, the review CPS (by two referee pathologists), and the histologic diagnoses. Of the 23 CPS, there were 3 benign endocervical processes, 1 extrauterine (ovarian) adenocarcinoma, 8 benign endometrial processes, 8 endocervical lesions, and 3 endometrial preneoplastic and neoplastic lesions. The 2 referee pathologists believed that 12 smears represented classic cytologic examples of the representative entity depicted on histology and that the remaining 11 smears could potentially lead to varying opinions among observers. This selection process was used to measure interobserver variability in the current study. For example, the CPS for Patient 2 was included, even though it was a classic example of a menstrual smear pattern, because the reactive appearance and the abundance of clusters of endometrial cells could lead to an erroneous diagnosis of AGC.
Table 1. Original Pap Smear, Review Pap Smear, and Final Histologic Diagnoses
|1||AGC, favor reactive||EC, reactive with tubal metaplasiaa||Tubal metaplasia|
|2||Benign endometrial cells||EM, menstrual patterna||Menstrual smear. no biopsy follow-up|
|3||AGC, favor reactive endometrial||EM, benign reactive||Polyp with sloughing|
|4||AGC, not qualified||EC with reactive change and LSILa||Condyloma with CIN 1 with endocervical gland involvement|
|5||AGC, favor reactive||EC, ACA||AIS and ACA|
|6||AGC, not qualified||EM, benigna||Polyp|
|7||AGC, favor neoplastic||EC, AIS, and HSIL||Polyp with AIS and CIS|
|8||AGC, endometrial, not qualified||EM, benigna||Secretory, sloughing with surface tubal/eosinophilic metaplasia|
|9||AGC, not qualified||EC, benign reactivea||Polyp with surface reactive atypia|
|10||Ovarian ACA||Ovarian ACA||Serous papillary ACA|
|11||AGC, endocervical type and HSIL||EC, AIS, and HSIL||AIS and CIN 3|
|12||WNL||EC, AIS (positive on rescreening)||AIS|
|13||AGC, not qualified||EM, benign||Sloughing with TM|
|14||AGC, favor endometrial||EM, benign||Polyp|
|15||AGC, endometrial||EM, benign reactivea||Polyp with necrosis and sloughing|
|16||Benign endometrial cells||EM, benign||Simple hyperplasia and tubal metaplasia|
|17||AGC, not qualified||EM, benigna||Proliferative with tubal metaplasia|
|18||AGC, endometrial type, favor neoplastic||EM, ACAa||ACA|
|19||AGC, not qualified||EM, ACA||ACA|
|20||AGC, favor neoplastic||EC, benign reactivea||Tubal metaplasia, immature squamous metaplasia|
|21||HSIL||EC, AIS, and HSIL||AIS and CIS|
|23||Squamous and ACA||EC, AIS, and HSIL||AIS and CIS|
The observers were asked to categorize each smear according to the cell of origin (endocervical vs. endometrial) and the diagnosis (benign, AGC, and adenocarcinoma). We did not use “AGC, favor neoplasia” and “endocervical AIS” diagnoses as separate categories. “AGC, favor neoplasia” was combined with AGC. The observers were instructed to place endocervical AIS within the adenocarcinoma category. The study was designed to maintain the diagnostic categories as simple as possible to optimize the goals of this study and to measure interobserver variability for diagnostic abnormalities that would have the most clinical impact.
The observers were told that the CPS specimens contained a spectrum of glandular cell changes varying from benign to malignant processes and that some may also have concurrent squamous abnormalities. The age of the patient was the only clinical information provided. The observers did not communicate with one another and used their own cytologic criteria to reach a diagnosis. The observers were comprised of a diverse group of practitioners that included 4 cytopathologists with 1–8 years of sign-out experience and 2 cytotechnologists with 20 years and 23 years of experience, respectively.
Kappa statistics were used to test the null hypothesis that there was no agreement among multiple observers. The kappa values were computed using the MAGREE macro (SAS Inc., Cary, NC) and were calculated for site of origin, diagnostic categories, and correlation of interobserver agreement with level of experience. Kappa values < 0.4 reflected weak or poor agreement, values between 0.4–0.7 reflected good agreement, and values > 0.7 reflected excellent agreement. P values for the kappa statistics were used to suggest whether a kappa coefficient was significantly different from 0. P values ≤ than 0.05 were considered to be statistically significant. For example, a kappa coefficient of 0.37 with a P value < 0.05 could indicate that there was a weakly positive agreement that was statistically significant (i.e., statistically different from no agreement at all).
The diagnostic accuracy for each individual was calculated using stringent and nonstringent criteria. Stringent calculation was based on an “exact match” between a cytologic and a reference diagnosis. Using this method, AGC was considered either an overcall (for patients with benign processes) or an undercall (for patients with AIS and carcinoma). Nonstringent calculation was based on a “close match” between the cytologic and the reference diagnosis. Using this method, AGC was considered correct for patients with preneoplasia and neoplasia (i.e., squamous intraepithelial lesions [SIL], AIS, or carcinoma). Also, AGC was considered correct for Patients 3, 8, 9, 14, and 15; their smears demonstrated benign processes but with significant reactive/inflammatory atypia.
There was no consensus among all observers regarding both the origin of the cells and the diagnosis. The kappa values for site were < 0.4, indicating only a slight degree (poor) of interobserver agreement (Table 2). Unanimous (6-way) agreement for site was reached for 7 of 23 specimens (30%). Four specimens were determined to be endocervical (one benign and three neoplastic processes) and three were determined to be endometrial (all benign) in origin (Table 3). The prediction of site of origin was better for benign processes compared with AGC and adenocarcinoma.
Table 2. Kappa Values for Interobserver Variability by Site
|Overall kappa value||0.048||0.056||0.854||0.221||0.051||< 0.001|
Table 3. Level of Agreement among Six Observers in Interpretation of Pap Smears with Histologically Proven Benign Glandular Cells
|One||1 case||3 cases||3 cases|
|Two-way agreement||2 cases||4 cases||—|
|Three-way agreement||—||1 case||—|
|Four-way agreement||5 cases||2 cases||—|
|Five-way agreement||2 cases||—||—|
|Six-way agreement||2 cases||—||—|
In the diagnostic category, regardless of the initial (reference) diagnoses, interobserver agreement was poor in all groups (i.e., benign, AGC, and adenocarcinoma). Table 4 summarizes the kappa values when interobserver agreement is calculated for each diagnostic category separately, when AGC and ACA were grouped as one single entity, and for all categories combined. Although combining AGC with carcinoma as one abnormal category somewhat improved interobserver agreement, the kappa value still remained poor (< 0.4). Unanimous agreement for diagnosis was reached for only 2 specimens (9%). Both of these were benign endometrial cells (Table 5). Three of the 11 preneoplastic/neoplastic specimens (27%) (Patients 7, 21, and 22) were reclassified as benign by 3 observers. Each observer, all with < 4 years of experience, misclassified 1 specimen. Three benign CPSs (25%) (i.e., 1 benign sloughing endometrium [Patient 8], 1 endometrial polyp with sloughing [Patient 15], and 1 endocervical polyp with reactive surface epithelial atypia [Patient 9]) were reclassified as adenocarcinoma by 2 observers. One observer misclassified one specimen, the other observers misclassified two cases. Both individuals had many years of sign-out experience. Of five endocervical AIS specimens, two were classified as endometrial in origin by two observers and two were misdiagnosed as benign by another two observers.
Table 4. Kappa values for Interobserver Variability by Diagnosis
|P value||< 0.001||0.484||0.005||0.001||0.001|
Table 5. Level of Agreement among Six observers in Interpretation of Pap Smears with Histologically Proven Preneoplastic/Neoplastic Glandular Cells
|One||3 cases||1 case||—|
|Two-way agreement||—||2 cases||6 cases|
|Three-way agreement||—||5 cases||3 cases|
|Four-way agreement||—||4 cases||1 case|
For a definitive diagnosis of carcinoma, there was no better than a 4-way agreement in only 1 of 11 cases (9%). Agreements between 3 and 2 observers were reached for 3 (27%) and 6 (55%) specimens, respectively. Interobserver agreement was slightly better for benign compared with carcinoma cases. Of 12 specimens, a unanimous agreement was reached for 2 cases(17%), 5-way agreement for 2cases (17%), and 4-way agreement for 5 cases (42%).
Table 6 shows the diagnoses, final histologic diagnoses (true outcome), and years of experience of each observer. Correct prediction of the final outcome varied from 30–87% when stringent criteria were applied and from 84–100% when nonstringent criteria were applied. Statistical analysis for the diagnostic accuracy and level of experience was not calculated because of insufficient data across strata. Table 7 shows the correlation between interobserver agreement and years of experience. Interobserver agreement was found to be weakly correlated with level of experience. The overall performance of observers with > 5 years of experience was better, especially in the benign category (kappa = 0.498). However, some of the clinically significant errors were also made by senior observers.
Table 6. Correlation of Observer Experience with Prediction of Correct Histologic Diagnosis on Pap Smear Evaluation
|1||7||5||0||1||10||0||7/23 (30)||18/23 (78)||4|
|2||9||2||1||0||1||10||20/23 (87)||20/23 (87)||20|
|3||10||2||0||0||5||6||16/23 (70)||23/23 (100)||8|
|4||5||7||0||1||6||4||9/23 (39)||17/23 (74)||4|
|5||6||4||2||0||2||9||15/23 (65)||18/23 (78)||23|
|6||11||1||0||3||8||0||11/23 (48)||20/23 (87)||1|
Table 7. Kappa Values for Interobserver Variability by Observer Years of Experience
|Overall kappa value||0.007||0.104||0.472||0.292||0.086||< 0.001|
The data from the current study show that there was significant interobserver subjectivity for the prediction of the site of origin and the diagnosis of glandular lesions using CPS. Interobserver agreement in the AGC category is particularly poor. Although attempts to make specific benign and malignant diagnoses rather than using the category AGC led to better kappa coefficient values, clinically significant errors were made at the expense of increased specificity. Our findings are in accordance with those of Lee et al.3 and Raab et al.4 Twenty-seven percent of preneoplastic and neoplastic glandular lesions were designated as benign and 25% of benign lesions were designated as adenocarcinoma by at least 1 observer. Moreover, different specimens were misclassified by different observers and there was no one specific case that was a source of diagnostic difficulty among all observers. Overall performance of the more experienced observers was better. However, the clinically significant errors were made by both seasoned and junior observers. Prediction of cell origin was better for benign lesions in comparison to preneoplastic and neoplastic lesions. Similarly, Costa et al.20 reported that correct identification of anatomic site was achieved by cytology in only 18% of 39 cervical carcinomas and 54% of 28 endometrial carcinomas.
Lee at al.3 and Raab et al.4 reported poor interobserver agreement among five and four reviewers, respectively, in reclassification of AGC CPSs. In both studies, reviewers, who were expert cytopathologists, agreed with the original AGC diagnosis for only 3.8% and 15% of specimens and neither retained the original AGC diagnosis in 20.3% and 12% of specimens, respectively. In addition, 14–25% of AGC specimens that were found to have high-grade SIL (HSIL) or glandular neoplasia on follow-up were reclassified as benign or AGC favor reactive. Compared with the original diagnoses, expert reviewers missed more lesions that were clinically significant than the original pathologists.7 We cannot compare our results specifically for the prediction of the cell of origin with the results of these two studies. Raab et al.4 excluded “AGC, favor endometrial origin” from their study. Although Lee et al.3 asked reviewers to specify the site of origin in their diagnoses, the calculation of interobserver agreement was not directed to site and this information could not be tabulated from their article.
The objective of the current study was not to evaluate the validity of cytologic criteria used by different observers. However, this has important implications because the application of different criteria deemed important by different observers undoubtedly contribute to poor interobserver agreement in the diagnosis of glandular lesions. Raab et al.4, 7 effectively addressed this issue in two separate studies. In the first study, cytopathologists were asked to reclassify CPSs previously diagnosed as AGC, to predict the surgical pathology follow-up, and in doing so, to list the specific cytologic criteria that they used.4 In no case did more than two cytopathologists select the same cytologic feature to distinguish a benign lesions from a clinically significant lesion. In the second study, 8 observers with different levels of experience were asked to place 88 CPS previously diagnosed as AGC into 1 of 5 diagnostic categories (i.e., benign, probably benign, possibly clinically significant, probably clinically significant, and definitely clinically significant). This was performed after a didactic session that involved review of 10 Pap smears with histologic follow-up emphasizing the presence of decreased cytoplasm, atypical single cells, and irregular nuclear contours as defining features for a neoplastic lesion.7 Although experienced observers performed better than the less experienced, all observers classified some clinically significant lesions in the category of benign lesions in spite of a teaching session.
It has been suggested that benign reactive lesions such as tubal metaplasia and endocervical adenocarcinoma can be distinguished from AGC based on select cytologic criteria.5, 6, 21–23 Others have found that these criteria are not reliable in discriminating benign from preneoplastic and neoplastic lesions.7, 8 Irregular nuclear membranes, atypical single cells, and decreased cytoplasm had a high sensitivity for detecting neoplastic lesions, but the specificity of these criteria (i.e., 28%) was poor.8 It has been hypothesized that one of the reasons why glandular lesions are difficult to characterize cytologically is a secondary effect of newer sampling techniques such as the endocervical brush.9, 10 Benign reactive glandular tissue may appear neoplastic and, conversely, neoplastic may appear benign. In addition, in our experience, thick cell clusters and tissue fragments obtained by vigorous brushing, especially if originated from the lower uterine segment, cause confusion as to the true origin of cells whether they are metaplastic squamous, endocervical, or endometrial. The difficulty in classifying abnormal cells according to the cell of origin and diagnosis was reported by Lee at al.11 They reviewed negative CPS from 34 women with tissue-proven endocervical AIS. Upon rescreening, 55% of these negative smears were found to be abnormal. However, greater than half of these abnormal smears were difficult to diagnose specifically as AIS even with the knowledge of the biopsy result. In some smears, the abnormal cells were very small and in crowded clusters, resembling normal endometrial cells or cells from the lower uterine segment. In other smears, they could be mistaken for reactive endocervical cells. In the current study, which included five endocervical AIS specimens, two of the five smears were classified as endometrial in origin and another two were diagnosed as benign by four observers.
In the current study, the percentages of low-grade SIL (LSIL), HSIL, AIS, and adenocarcinoma do not represent the spectrum of lesions encountered in routine practice. This is a shortcoming as this may have created a bias. Because the current study was designed to measure interobserver reproducibility in the diagnosis of glandular abnormalities, we included more glandular lesions than SILs. Various investigators reported that 9–54% of women with AGC have biopsy-confirmed SIL, 0–8% have AIS, and approximately 1% each have endocervical and endometrial adenocarcinoma on follow-up.12, 15, 17 HSIL is more common than LSIL. In some studies, the HSIL-to-LSIL ratio was1.3:4.15, 17 In our study, 48% of smears were negative, 4% had LSIL, 4% had AIS, 22% had AIS with HSIL, and 8% had endocervical and endometrial adenocarcinomas. We included a variety of glandular changes to emphasize the difficulty in distinguishing these entities even when the observers are specifically informed regarding what that they will be evaluating.
Several management and legal issues require special emphasis based on these findings. The 2001 Bethesda system recommends reporting the site of origin for atypical glandular cells with the assumption that this is easily achievable in the majority of cases.13 Although the 2001 ASCCP-sponsored Bethesda consensus conference reached new guidelines for management of women with AGC and AIS based on the 2001 Bethesda system, current practices vary greatly from repeating the Pap test to immediate colposcopy with endocervical and endometrial curettage and to conization even in the absence of a detectable abnormality on initial colposcopic examination.12, 24–27 In such an environment, we believe classifying AGC specifically as to the site of origin may give a false sense of security to the clinician because if only one type of specimen is obtained at follow-up, the real lesion may escape detection. For example, an endocervical polyp with reactive atypia after a CPS diagnosis of “AGC, endocervical type” in a patient with an underlying endometrial pathology may prevent the clinician from performing further investigative procedures. Because reproducibility of origin of cells is poor based on CPS evaluation, especially in preneoplastic and neoplastic lesions, it may be the best approach to obtain cervical, endocervical, and endometrial samples to ensure adequate representation of the uterine squamous and glandular lining. A second important issue is the overinterpretation of glandular lesions using CPSs. Close correlation of cytologic and histologic samples has the utmost importance for optimal patient care. As was seen in the current study, significant reactive/reparative atypia in the endocervical epithelium, as well as metaplastic and secretory changes accompanied by sloughing in the endometrium, may give rise to an AGC and even a false-positive diagnosis of carcinoma on CPS.
Approximately 50% of the women with endocervical adenocarcinoma have negative CPS, which highlights the difficulty of detecting glandular neoplasia using cytologic screening.11, 27, 28 This is attributed mostly to sampling. However, interpretative error also plays a role.11, 29 In malpractice cases, the review diagnosis rendered by an “expert” cytopathologist is regarded as the truth and the initial diagnosis that led to the litigation is considered an error. In view of the poor interobserver agreement among experienced cytopathologists in evaluation of glandular lesions, a panel approach, as previously recommended by others, offers a better means of resolving a diagnostic controversy.4, 30, 31 Even then, it still remains to be determined how many observers is an adequate number to form a panel and how many of the experts in a panel should be in agreement to arrive at a final diagnosis.
We conclude that CPS is an imperfect test for evaluation of glandular cell abnormalities because of significant observer subjectivity. Overinterpretation and underinterpretation of glandular cell abnormalities may lead to significant diagnostic errors and do not necessarily reflect the individual's level of experience.