Inter-observer variation in histopathological diagnosis and grading of vulvar intraepithelial neoplasia: results of an European collaborative study
Other participants in the study are listed on page 599.
Correspondence: Dr M. Preti, Department of Obstetrics and Gynaecology, University of Turin, Via Don Grioli 6–10137 Turin, Italy.
Objective To evaluate the inter-observer variability of vulvar intraepithelial neoplasia diagnosis and grading system.
Design Prospective study.
Sample Histological sections of 66 vulvar biopsies.
Methods Six consultant pathologists working at different European institutions independently reviewed 66 vulvar biopsies. The following variables were investigated: specimen adequacy, gross categorisation into benign or neoplastic changes, presence of atypical cytological pattern, presence of neoplastic architectural pattern, grade of vulvar intraepithelial neoplasia, presence of histopathologic associated findings for human papillomavirus infection.
Main outcome measures The degree of inter-observer variation for each histopathologic parameter was assessed by Kappa (κ) statistics. The frequency and the degree of disagreement were calculated by a symmetrical agreement matrix showing the number paired classifications.
Results A good agreement (overall weighted κ= 0.65, unweighted κ= 0.46) was observed for grading vulvar intraepithelial neoplasia. Human papillomavirus infection associated findings and specimen adequacy were the variables with less inter-observer agreement (overall weighted κ 0.26 and 0.22, respectively). Exact agreement between two pathologists for grade of vulvar intraepithelial neoplasia was observed in 63.6% of paired readings; the rate of paired agreement reached 73.9% considering vulvar intraepithelial neoplasia 2 and 3 as a single class. Conversely, only 5.0% of vulvar intraepithelial neoplasia 1 diagnoses were concordant in paired analysis.
Conclusions Current terminology offers a reproducible tool in the hands of expert pathologists. While on the diagnosis of ‘high grade’ vulvar intraepithelial neoplasia (vulvar intraepithelial neoplasia 2 and 3) there is good agreement, the diagnostic category of vulvar intraepithelial neoplasia 1 is not reproducible.
Pre-cancerous lesions of the vulva were classified in the past with numerous descriptive terms resulting in a complex and confusing terminology. Benign lesions, such as lichen sclerosus and squamous cell hyperplasia, were misunderstood as having neoplastic potential and classified and treated as such. The International Society for the Study of Vulvovaginal Diseases (ISSVD) has introduced in the late 1980s a new classification dividing neoplastic from non-neoplastic vulvar disease, and grouped under the term ‘vulvar intraepithelial neoplasia’ (VIN) all the lesions with a known oncogenic potential1. At present VIN histological classification is widely accepted and used to classify and manage pre-cancerous vulvar lesions. The main histopathological features of VIN are disordered maturation and nuclear abnormalities (e.g. loss of polarity, pleomorphism, coarsening of nuclear chromatin, irregularities of the nuclear membrane and mitotic figures, including atypical forms) at various levels of the epithelium. The epithelial cells are typically crowded, acanthosis, parakeratosis, hyperker-atosis and features of human papillomavirus (HPV) infection may be present. In VIN 1 the dysplasia is confined to the lowest third of the epithelium. In VIN 2 lesion the dysplasia involves the lower two thirds of the epithelium and in VIN 3 carcinoma in situ the changes extend into the upper third or involve the full thickness of the squamous epithelium2.
In recent years numerous studies have shown poor inter-observer and intra-observer reproducibility for the cytological or histopathological diagnosis of the three different grades of cervical intraepithelial neoplasia and a new, more reproducible classification dividing pre-cancerous cervical lesions into low and high grade lesions has been advocated3–6. Obviously, high agreement of pathological diagnoses is of utmost importance to assure similar standard of care for women in different geographical areas. A procedure often used by patholo-gists in order to test the diagnostic agreement is to circulate a set of slides, assessing concordance among observers by the use of Kappa statistics7.
To date only one study has addressed the issue of the reproducibility of histopathological classification of VIN, but not its grading8. The aim of the present report is to quantify the inter-observer agreement on the histological diagnoses and grading of VIN among six European pathologists with the use of Kappa statistics.
All the cases diagnosed as VIN from January 1990 to December 1996 were retrieved from the computer database of the Department of Pathology of St. Anna Obstetrical and Gynaecological Hospital in Turin, Italy. A total of 51 cases fulfilling these criteria were found. All the corresponding specimens, fixed in formalin and embedded in paraffin were re-cut at five micron and the sections haematoxilyn-eosin stained to avoid different degree of staining depending on the ageing of the slides. The cases with not sufficient material for adequate sectioning were excluded. Therefore, 41 slides originally classified as VIN were obtained (seven VIN grade 1, 16 VIN 2, 18 VIN 3, no cases of differentiated VIN were included). Slides of histologically diagnosed condylomatous lesions (n = 11), vulvar dermatoses (n = 12), as well as three cases of superficially invasive vulvar squamous cell carcinomas9 were obtained from the corresponding blocks and added to the series. One slide (vulvar dermatosis) was damaged during the circulation and was not considered for the analysis, so that a final number of 66 slides formed the study group.
No more than one slide with a single section was available for each specimen and no delineation of lesion with marking pen or masking the slide was made. The slides were randomly numbered and circulated among participants. No information about age, clinical aspect of the lesions, original diagnosis was provided to the participants. Pathologists did not meet before the slide review to discuss the diagnostic criteria and the results of interpretation were not examined during the course of the study.
Pathologists were asked to classify every biopsy for the presence or absence of the parameters listed in Table 1 within three weeks from the slides acceptance. Diagnoses were based on published criteria2,9,10. No specific instructions or guidelines were agreed upon in advance in relation to the adequacy of the specimens. A full spectrum of diagnoses ranging from normal to invasive carcinoma was available for each case. No attempt to reach a final consensus among the participants members was made. Data were entered into a computed database (Microsoft Access) and analysed using statistical package SAS (SAS Software, Version 6–11, SAS Institute Inc, Cary, North Carolina, 1995).
Table 1. Evaluation form: histopathological parameters considered for examination of slides and statistical analysis. VIN = vulvar intraepithelial neoplasia; HPV = human papillomavirus.
|Atypical cytological pattern||P||Present|
|Neoplastic architectural pattern||P||Present|
|VIN grade||0||No VIN|
| ||2||VIN 2|
| ||3||VIN 3|
| ||CA||Invasive carcinoma|
|HPV histopathologic associated findings||P||Present|
The main aim of the study was to estimate the agreement and a sample size of 70 was required to estimated κ with a standard error (SE) of 0.08. A second aspect of the study design was to ensure that the sample size was large enough to have 95% power to detect a value of κ statistics equal to 0.3 at the 5% significance level11.
The degree of agreement between each pair of the pathologists has been assessed by calculation of Kappa statistics, with respect to each histopathologic variable7,12. Kappa (κ) is a statistical parameter of agreement not requiring any assumption of the ‘correct’ diagnosis, expressed through a coefficient ranging between −1.0 and +1.0.
where Po is the proportion of observed agreement and Pe is the overall proportion of chance-expected agreement. Perfect agreement corresponds to a coefficient κ=+1.0, a value of κ= 0.0 indicates chance agreement only. As a rough guideline Landis and Koch13 indicated that values 0-4-0-6 suggest moderate agreement, 0.6–0.8, substantial agreement and values in excess of 0-8 reflect excellent correspondence.
The Kappa coefficient measures the agreement between each pair of pathologists but it is required also to compare and combine the different estimates of Kappa. The overall κ summarises in a single coefficient the κ values relative to the different pairs of pathologists, giving us a measure of overall agreement. This is a weighted average of the individual κ values for each pair of pathologists where the weights are the inverses of the variances of the individual K. The standard errors for the overall Kappa are obtained using the bootstrap method based upon 1000 samples14. This involves repeated sampling from the observed data to generate a sampling distribution for the overall Kappa and hence evaluating the standard error. Furthermore, the hypothesis that the underlying values of κ corresponding to each pair of pathologist are equal has been tested. This is a X2 test requiring independent samples which we do not have and so a simulation test was used to take into account the dependence among the individual κ values14.
In its original form, the Kappa statistic measures exact agreement between two raters for nominal variables but when used with ordinal data it is not influenced by the magnitude of disagreement and attributes equal importance to all disagreements. To correct this deficiency, weighted κ statistics have been devised and recommended15. The unweighted κ values depend on the number of categories and weighted κ values are strongly influenced by the disagreement weights applied. For VIN grade a weighted κ has been evaluated where the relative seriousness of each possible disagreement has been quantified7,16. This measure is based on the concept that in any ordered scale composed of categories representing increasing severity of abnormalities, some possible disagreements are more serious than others are. If two observations differ by more than one category, their disagreement should be given more weight than if they differ by only one category5,17. In our situation the distance between categories, where adjacent categories have unit distance, gives the level of dissimilarity.
The distribution of collected data is shown in Table 2. Some missing classifications were found, but missing values account for less than 5% for each parameter. Most diagnoses belonged to either ‘no VIN’ or ‘VIN 3’ category and an atypical cytological pattern and a neoplastic architectural pattern was identified in the majority.
Table 2. Summary of slides classification by pathologist in according to each histopathologic parameter*. Values are given as n. VIN = vulvar intraepithelial neoplasia; HPV = human papillomavirus.
|Specimen adequacy|| || || || || || |
|Gross categorisation|| || || || || || |
|Atypical cytological pattern|| || || || || || |
|Neoplastic architectural pattern|| || || || || || |
|VIN grade|| || || || || || |
| No VIN||18||26||18||25||24||21|
| VIN 1||6||1||6||4||4||0|
| VIN 2||9||6||4||16||3||3|
| VIN 3||25||28||32||18||30||39|
| Invasive carcinoma||4||5||6||3||5||3|
|HPV associated findings|| || || || || || |
Table 3 shows the overall κ for the categories examined. The poor agreement for specimen adequacy was improved when sub-optimal and inadequate specimen adequacy were considered as one category reaching an overall κ of 0.34 (SD 0.05) (range 0.17–0.56). Poor agreement for associated findings of human papillomavirus resulted, with overall κ values equal to 0.26, and three comparisons with Kappa < 0.15. Unweighted κ were calculated when more than two categories were present: specimen adequacy 0.22 (range 0.09-0.45) and VIN grade (0.46, range 0.30-0.65). The agreement for all the other variables was large, where each comparison was significantly different from zero (P < 0.01). Good homogeneity among the pairs of observers was found for most variables, but not for neoplastic architectural pattern (values range from 0.39 and 0.84).
Table 3. Overall K and range for the examined categories. VIN = vulvar intraepithelial neoplasia; HPV = human papillomavirus.
|Gross categorization||0.675||0.5 16–0.840|
|Atypical cytological pattern||0.616||0.533–0.690|
|Neoplastic architectural pattern||0.610||0.390–0.838|
Unweighted κ values are shown in Table 4, reflecting the direct comparison between each pair of pathologists' grading of the slides, irrespective of how far apart the grading is.
Table 4. Inter-observer agreement for VIN grade: unweighted κ. Values are given as κ (asymptotic standard error). VIN = vulvar intraepithelial neoplasia.
|A||0.457 (0.082)||0.463 (0.081)||0.322 (0.081)||0.503 (0.080)||0.583 (0.076)|
|B||−||0.478 (0.076)||0.505 (0.076)||0.559 (0.077)||0.615 (0.078)|
|C||−||−||0.304 (0.076)||0.443 (0.079)||0.421 (0.078)|
|D||−||−||−||0.499 (0.079)||0.368 (0.068)|
Weighted κ values for VIN grade, between each pair of observers are shown in Table 5. These values ranged from 0.52 to 0.75, with only two values < 0.6 and three values that achieved a value > 0.7. The overall value of 0.65 (0.015) was achieved, and there was no evidence that the individual κ values among the six rates were grossly different from each other. The κ and weighted κ statistics were also evaluated in a different classification aggregation, considering only four categories, with VIN 2 and VIN 3 as one category. This aggregation was done for clinico-pathological reasons. The agreement improved slightly, as one would expect when adjacent categories are combined, and a weighted overall value of 0.68 was achieved. Exploratory analyses were conducted to try to find particular subgroups of slides showing a better (or worse) agreement and for particular categories showing the better (or worse) agreement, but nothing was found.
Table 5. Inter-observer agreement for VIN grade: weighted κ. Values are given as κ (asymptotic standard error). VIN = vulvar intraepithelial neoplasia.
|A||0.616 (0.078)||0.627 (0.072)||0.561 (0.074)||0.646 (0.076)||0.617 (0.078)|
|B||−||0.664 (0.071)||0.664 (0.058)||0.725 (0.063)||0.745 (0.067)|
|C||−||−||0.522 (0.076)||0.632 (0.077)||0.641 (0.077)|
|D||−||−||−||0.636 (0.067)||0.604 (0.067)|
Table 6 shows the frequency and the degree of agreement for VIN grade by a symmetrical agreement matrix showing the number paired classification. A total of 614 of 966 (63.6%) paired classifications lie on the main diagonal representing exact agreement between pathologists; some observations lie far from the diagonal; for example, six observations were classified invasive carcinoma by one pathologist and ‘no VIN’ by the other one, representing major disagreement between two pathologists, 46 were classified VIN 3 and ‘no VIN’. Data from the paired classification analysis indicate that 56.7% (311/548) of all the VIN 3 diagnoses are concordant, and that, if VIN 2 and VIN 3 diagnoses are collated into one diagnostic class, the rate of paired agreement reaches 73.9% (452/612). Only 5.0% (5/100) of VIN 1 diagnoses are concordant, thus indicating very low degree of agreement for this diagnosis.
Table 6. Number of paired classification for VIN grade calculated by a symmetrical agreement matrix. Values are given as n. VIN = vulvar intraepithelial neoplasia.
Treatment of vulvar intraepithelial neoplasia is a challenge for the gynaecologist. No standard treatment modality has been proposed, and therapy relies on a combination of multiple factors, such as VIN grade, age of the woman, site and size of the lesions, symptomatology. In this context histopathological diagnosis is one of the keystones for decisions affecting clinical management of the patients and a clear and reliable pathology report is needed18,19.
The use of standardised terminology and nomenclature as the ISSVD classification of VIN represented a good starting point to obtain uniform terminology and standardised written criteria and to avoid ambiguous and imprecise terms. Adoption of the VIN terminology recalls the analogy with cervical intraepithelial neoplasia and the VIN 1 to 3 histological grading system carries the implicit inference of a biological continuum which may end with the development of an invasive carcinoma. This seems not to be the case. In fact, a greater number of VIN 3 are diagnosed in respect to the milder cases, which is not expected if the lesion should develop through a continuum of worsening pathological changes; in addition, only occasionally progression of low to high grade VIN does occur20. An explanation may be our inability to detect early changes of VIN that are often asymptomatic and can resemble non-neoplastic disorders of the vulva or condy-lomatous lesions21,22. However, even in the largest series of all grades VIN published so far, the distribution of different VIN grades (148 VIN 1, 53 VIN 2 and 169 VIN 3) adds further evidence against the presence of a biologic continuum linking the three grades of vulvar intraepithelial neoplasia23. It appears that the current terminology is not supported by a biologic background.
A pathologic classification should be relevant, easily understood, highly reproducible and clinical useful24. The present VIN terminology also lacks full reproducibility. The results of this present study show that the overall degree of agreement is good (overall weighted κ value = 0.65), but not excellent. As these measures of agreement are based upon samples with an uneven spread (most observations were ‘no VIN’ or ‘VIN 3’;) we would not expect to get such good agreement if the distribution of samples over the five categories was more even. Interestingly, if we look at the data carefully, the degree of agreement on VIN 3 diagnosis, in terms of number of slides classified as VIN 3 by all the observers, is high (Table 6): data from the paired classification analysis indicate that 56.7% of all the VIN 3 diagnoses are concordant, and that, if VIN 2 and VIN 3 diagnosis are collated into one diagnostic class, the rate of paired agreement reaches 73.9%. These observations indicate that VIN 3 is a well defined and reproducible category; in addition, an optimal diagnostic category is obtained by collating diagnosis of VIN 2 and VIN 3 into a single pathologic entity. Clinical data are also consistent with this type categorisation. This type of diagnostic grouping has been recently introduced in the cervical cytology reporting system where two diagnostic categories, low and high grade intraepithelial lesions are at present classified25. However, while our data indicate a good diagnostic agreement for ‘high grade’ VIN, they do not support a diagnostic category of ‘low grade’ VIN (only 5.0% of VIN 1 diagnoses are concordant). The poor agreement lies in the difficulty to discriminate VIN 1 from the neighbouring classes, in particular the benign category (57.0% of the paired observations are classified as ‘no VIN'). As regards discrimination between condylomatous lesions and VIN 1, it must be appreciated that some pathologists tend to classify flat condyloma acuminatum as VIN 1, so it is not surprising that there were difficulties in separating condylomas from VIN 1 lesions.
A disagreement at the lower end of the spectrum of VIN grading is similar to the results published for intraepithelial neoplasia of other sites such as the uterine cervix3,4,6, the prostate26,27, and anal intraepithelial neoplasia28. Data from the present investigation on the reproducibility of the VIN 1 diagnostic category are even worse. Clinical data also are not consistent with the identification of a pathologic process biologically classified as ‘low grade’ VIN.
In this study low Kappa values emerged from the analysis of two other parameters: specimen adequacy and associated findings of human papillomavirus. The impact of sub-optimal slides could be seen in a double way: in most laboratories diagnoses are reported examining sections at different levels, while in the present study diagnoses were achieved through the examination of only one section. In fact, when the initial examination suggests morphologic heterogeneity, performance of additional levels into the blocks is carried out routinely. Diagnostic discrepancy might stem from histotechnical factors. The step most critical processing the specimen is the proper specimen orientation for perpendicular sectioning through the skin surface; lack of full thickness epithelium does not allow separation between different intraepithelial levels of atypia. Finally, lack of clinical context (clinical history and finding on gross examination) could also have contributed to variability: pertinent clinical information provided by the physician further enhances the pathologic interpretation and reduces the possibilities of misinterpretation. We think that a statement on specimen adequacy can contribute to obtain a better agreement among observers.
There was poor agreement concerning the presence of infection with human papillomavirus (overall κ value = 0.26), but the relative importance of different pathological parameters which can concur in diagnosis of changes associated with human papillomavirus was not specified. Similar results were reported by Robertson et al.4 for identification of changes caused by human papillomavirus in cervical intraepithelial neoplasia. Problems arising from attempts to discriminate between reactive cellular anomalies and nuclear changes associated with basal cell hyperplasia and features of infection with human papillomavirus associated with nuclear enlargement, irregularity and hyperchromasia may lead to diagnostic confusion with VIN, thus exaggerating the severity of premalignant appearance. Good repro-ducibility would have been surprising in the absence of a clearly defined threshold. In the present series we had no golden standard for the typing of human papillo-mavirus and the issue regarding its presence may be an irrelevant discrimination in that the majority of VIN lesions are papillomavirus-related, and no slides of VIN originally classified as differentiated were included.
However, we purposely avoided pre-study agreement on a defined set of criteria in order to recreate as closely as possible the conditions that pathologist may encounter in daily practice, well knowing that a part of variability is due to the lack of prior agreement. We believe that a clear definition of human papillomavirus-related lesion in the presence of a disease process affecting the vulva could contribute to obtain more reproducible diagnostic classes.
In conclusion, the current terminology for VIN retains an overall good reproducibility. There is good agreement for diagnosis of ‘high grade’ VIN (VIN 2 and 3), but the diagnostic category of VIN 1 is not reproducible. VIN 1 diagnosis also lacks biological and clinical background. Criteria for the definition of specimen adequacy and classification of changes associated with human papillomavirus are needed to improve the reproducibility of the VIN pathologic classification.
Participants in the study
Six European consultant pathologists participated in the study. They were: Professor C. Bergeron (Paris, France), Professor H. Fox (Manchester, UK), Dr B. Ghiringhello (Turin, Italy), Professor R. Kurzl (Munich, Germany), Professor J. Prat (Barcelona, Spain) and Professor G. Taddei (Florence, Italy). No participating pathologists reviewed cases of VIN together in prior studies.
The authors would like to thank Mrs M. T. Ritroso (Department of Pathology St. Anna Hospital, Turin, Italy) for her technical assistance in cutting blocks and preparing slides.