- Top of page
- MATERIALS AND METHODS
- CONFLICT OF INTEREST
Since its introduction more than 40 years ago, the Gleason grading system of classifying and stratifying structural patterns of prostatic tumor cells has become well established as an important tool both in the early detection of prostate cancer and in the prognosis of clinical outcomes following radical prostatectomy [1,2]. Because of its proven prognostic value, Gleason grade is currently used by urologists worldwide to inform clinical management for patients with prostate cancer.
Many studies have investigated the reproducibility of the Gleason grading system across pathologists when used to analyse prostate biopsies [3,4] and its correlation with final prostatectomy specimens [5–7]. Over the years, the Gleason system has been modified to improve its prognostic relevance in the context of a constantly evolving range of prostate cancer detection methods, including the introduction of PSA testing, transrectal ultrasound guided prostate needle biopsy, and improved immunohistochemistry aiding atypical cell identification [2,8,9].
In 2005, the International Society of Urological Pathologists (ISUP) held a consensus conference of experts and updated the Gleason grading system in an effort to improve reproducibility with a universal method of grading and to optimize the prediction of clinical outcomes for patients with prostate cancer . Specifically, the panel recommended that, for needle biopsies, tumours should be graded by the primary and highest-grade pattern, a change from the previous protocol of primary and secondary pattern. For full gland prostatectomy specimens the panel agreed that the grade should be based on the primary and secondary patterns with a comment about the tertiary pattern. A recent study described the important predictive value of the tertiary Gleason component in predicting biochemical recurrence following prostatectomy .
Studies have investigated the effects of the 2005 recommendation on interobserver agreement in prostatectomy Gleason score (pGS)  and the impact of the modified grading of needle core biopsy specimens on improving prediction of biochemical (PSA) progression-free outcome . To our knowledge, no studies have investigated the impact of the modified grading system on inter-laboratory agreement of biopsy Gleason score (bGS) and the effect of re-evaluation on accuracy in predicting the true underlying histopathology.
The bGS is an important tool in categorizing disease severity, ultimately influencing clinical management. In order to remain effective, the system must be reproducible across pathologists and accurately agree with the grade of the final prostatectomy specimen. We aimed to investigate the effects of the 2005 ISUP recommendation on inter-observer correlation by comparing agreement between community pathological laboratories and our institution’s internal re-evaluation of positive TRUS core prostate biopsies. We also evaluated the agreement of bGS with pGS and the impact of re-grading on the accurate prediction of the true underlying tumour architecture.
MATERIALS AND METHODS
- Top of page
- MATERIALS AND METHODS
- CONFLICT OF INTEREST
A retrospective analysis was performed on men who underwent robotic-assisted radical prostatectomy (RARP) by two surgeons from March 2005 to July 2009 as primary treatment for prostate cancer. All study data were collected and analysed according to an approved institutional review board protocol. No patients received neo-adjuvant treatment, including androgen deprivation or radiation therapy. All patients included in the study underwent an initial TRUS prostate needle core biopsy (range 2–6 cores) outside our institution, which was evaluated by the respective outside laboratory. Although standard of care recommends ≥10 cores for accurate sampling, patients with <10 cores were included, given that only positive cores were analysed. For men to be included in the study, carcinoma of the prostate needed to be detected in at least one core. Positive cores were then re-evaluated for % cancer involvement and Gleason grade by a single GU pathologist from a team consisting of seven dedicated GU pathologists at a single academic institution prior to surgery.
Outside institutions from which our department received consultation cases were categorized as academic hospitals, community hospitals, surgical centres, or private laboratories, as follows. Hospitals were classified as academic if they included an institutional review board as well as a research centre or institute; all other hospitals were classified as community hospitals. Surgical centres were defined as non-hospital sites at which practitioners perform surgical procedures. Diagnostic facilities without patient-care areas were considered private laboratories.
Radical prostatectomy specimens were prepared according to the protocol described in our institution’s Surgical Pathology Laboratory Manual. Prostates were serially sectioned from apex to base, with at least four transverse cuts, in order to separate at least five levels. Four samples (right anterior, left anterior, right posterior and left anterior) from alternate levels were submitted for histological analysis. The proximal and distal urethral margins of resection, right and left seminal vesicles, and all lymph nodes were also submitted for microscopy. Hematoxylin- and Eosin-stained slides were prepared from paraffin-embedded tissue samples and examined by an attending GU surgical pathologist. Slides from radical prostatectomy specimens were evaluated for Gleason score. Slides from needle core biopsy specimens were evaluated for lengthwise percentage of cores involved and Gleason score. All needle core biopsy and radical prostatectomy specimens were graded at low-power magnification (4× to 10×) according to the Gleason grading system [1,2,8,14], as well as to the International Society of Urological Pathology Consensus Conference guidelines . Figure 1 shows characteristic Gleason patterns 3–5 as graded by our pathologists.
Figure 1. Images a–f depict Gleason patterns 3–5. Gleason score was assigned according to the original descriptions of Gleason et al. [1,2,8,14], as modified by Epstein et al. . Images a, b: Gleason pattern 3. Glands vary in size and shape (although are usually smaller than Gleason pattern 1 or 2 glands). Individual glands are circumscribed and constitute distinct units. Images c, d: Gleason pattern 4. Glands are fused, poorly defined with only occasional lumen formation, or cribriform. Images e, f: Gleason pattern 5. Solid sheets, cords of tumour cells, or single tumour cells with virtually no glandular differentiation. All microscopic photographs were taken at 10× magnification.
Download figure to PowerPoint
Statistical analyses were conducted to assess agreement of the positive cores examined and the accuracy of predicting final surgical pathology. Percentage carcinoma involvement and bGS were compared between outside laboratories and our institution for each biopsy. In comparing pre-operative biopsy evaluation with final surgical pathology, the bGS assigned to each patient was defined as the greatest combined Gleason score present in all positive cores. Kappa (κ) statistics for agreement and linear regression analysis were used for comparisons of categorical variables (i.e. Gleason score) [15,16], and the coefficient of concordance was used for comparisons of continuous variables (i.e. % core involvement). Accuracy of agreement was determined based on the κ coefficient, which adjusts for agreement expected by chance, with 0 denoting no agreement, <0.4 denoting marginal agreement, 0.4–0.75 denoting good agreement and >0.75 denoting excellent agreement . Weighted kappa values (κw) were calculated in order to include the relative difference between categories in the quantification of agreement. In this way, close disagreement (i.e. bGS within ± 1) is given a greater weight than a more serious disagreement. Stata 10 software was used to perform all the statistical analyses .
- Top of page
- MATERIALS AND METHODS
- CONFLICT OF INTEREST
A total of 130 patients were identified, of which 100 (77%) had complete data available on initial outside needle core biopsy pathological read, internal re-read of biopsy, and final pathological read of prostatectomy specimen. A total of 331 positive core needle biopsies were included (mean 3.3 per patient). Table 1 summarizes the study cohort demographic and pathological characteristics. No significant differences were seen across cohorts in age, race, total number of positive cores, pGS, or duration of time between biopsy and RARP. On average, surgical centres took more initial needle core biopsies than the other outside laboratories, with a mean of 12.3 biopsies per patient (P < 0.001).
Table 1. Patient characteristics
| ||Academic hospitals||Community hospitals||Surgical centres||Private laboratories||Total||P-value|
|No. of patients|| 7||14||15||64||100|| |
|Age at TRUS biopsy|| || || || || || |
| Mean (sd)||56.57 (8.50)||60.36 (7.46)||60.60 (6.99)||58.89 (7.25)||59.19 (7.29)||0.7591*|
| Min-Max||40–65||46–76||51–72||44–71||40–76|| |
|Race no. (%)|| || || || || || |
| Black||0 (0.00)||1 (7.14)||0 (0.00)||13 (20.31)||14 (14.00)||0.171**|
| White||7 (100.00)||10 (71.43)||13 (86.67)||37 (57.81)||67 (67.00)|| |
| Other||0 (0.00)||3 (21.43)||2 (13.33)||14 (21.88)||19 (19.00)|| |
|Total no. cores|| || || || || || |
| Mean (sd)||8.57 (3.21)||9.07 (3.41)||12.27 (1.67)||9.36 (3.72)||9.7 (3.54)||<0.001*|
| Min-Max||4–12||4–15||8–16||1–14||1–16|| |
|Total no. positive cores|| || || || || || |
| Mean (sd)||4.71 (3.50)||3.50 (1.83)||4.53 (3.04)||2.88 (2.01)||3.34 (2.36)||0.4354*|
| Min-Max||1–10||1–7||1–10||1–9||1–10|| |
|Biopsy Gleason score|| || || || || || |
| 3 + 3||2 (28.57)||2 (14.29)||4 (26.67)||17 (26.56)||25 (25.00)||0.882**|
| 3 + 4||4 (57.14)||7 (50.00)||6 (40.00)||26 (40.63)||43 (43.00)|| |
| 4 + 3||1 (14.29)||5 (35.71)||2 (13.33)||12 (18.75)||20 (20.00)|| |
| 4 + 4||0 (0.00)||0 (0.00)||2 (13.33)||7 (10.94)||9 (9.00)|| |
| 4 + 5||0 (0.00)||0 (0.00)||1 (6.67)||2 (3.13)||3 (3.00)|| |
|Time from Bx to Sx (days)|| || || || || || |
| Mean (sd)||77.29 (36.02)||81.36 (26.25)||76.60 (30.49)||95.88 (67.22)||89.65 (56.98)||0.9526*|
| Min-Max||44–119||45–132||13.134||20–375||13–375|| |
Overall concordance for % core involvement of carcinoma between all outside laboratories (initial read) and our internal team of pathologists (re-read) was 0.925 (95% CI: 0.907, 0.943) with a Pearson coefficient of 0.932 (P < 0.001). No significant differences in agreement on cancer size were seen across cohorts. Of the 331 positive biopsies, perfect agreement in bGS was seen in 222 (67.1%).
Table 2 summarizes the breakdown of agreement, listing % of perfect agreement, κ and wκ with corresponding P-values across cohorts on the pathological grade of needle core biopsies from outside laboratories and our internal re-read. The overall bGS agreement (κ) between all outside laboratories and our internal pathologists was 0.55. All calculations of κ, P < 0.001. Overall perfect agreement between outside laboratories and our pathologists for primary Gleason grade was 92.0%, κ= 0.77, with perfect secondary Gleason grade agreement 71.9%, κ= 0.44.
Table 2. Percentage agreement, κ score and weighted κ score, stratified by cohorts for outside biopsy Gleason score (bGS) (initial read) and internal bGS (re-read)
|Outside bGS vs internal bGS (re-read)||No. of cores||Agreement (%)||Kappa -κ- (s.e)||P-value||Weighted kappa -wκ- (s.e)||P-value|
|Academic hospitals||33||72.73||0.5607 (0.12)||<0.001||0.4289 (0.11)||<0.001|
|Community hospitals||49||67.35||0.5063 (0.08)||<0.001||0.6046 (0.09)||<0.001|
|Surgical centres||69||66.67||0.5868 (0.06)||<0.001||0.7124 (0.07)||<0.001|
|Private laboratories||180||66.11||0.5305 (0.04)||<0.001||0.6219 (0.04)||<0.001|
|Overall||331||67.07||0.5512 (0.03)||<0.001||0.6424 (0.03)||<0.001|
Perfect agreement of the highest bGS (n= 100) with pGS was seen in 49/100 cases (49%) for outside laboratories and in 55/100 cases (55%) for our dedicated GU pathologists. No effect was seen on accuracy with increased time to surgery, although the data suggest a statistically significant relationship between increased number of core needle biopsies and improved accuracy at predicting pGS for both outside laboratories (P= 0.04) and our internal pathologists (P= 0.002).
Accuracy in predicting pGS using the highest bGS was κ= 0.30 for all outside laboratories and κ= 0.39 for the internal re-read (P < 0.001). Individual outside laboratories had agreement with pGS ranging from κ= 0.05 to κ= 0.49 (Table 3). When comparing all the highest bGS (n= 100) between all outside laboratories and the internal re-read, perfect agreement was seen in 79% of cases with κ= 0.69 (P < 0.001).
Table 3. Percentage agreement, κ score and weighted κ score, stratified by cohorts for maximum outside biopsy Gleason score (bGS) (initial read) and internal bGS (re-read) vs prostatectomy Gleason score (pGS)
| ||No. of cores||Agreement (%)||Kappa -κ- (s.e)||P-value||Weighted kappa -wκ- (s.e)||P-value|
|Internal bGS vs pGS||100||55.00||0.3875 (0.06)||<0.001||0.4963 (0.06)||<0.001|
|Outside bGS vs pGS||100||49.00||0.3042 (0.05)||<0.001||0.4293 (0.06)||<0.001|
| Academic Hospitals||7||42.86||0.0968 (0.24)||0.3419||0.2222 (21)||0.1450|
| Community Hospitals||14||28.57||0.0476 (0.13)||0.3560||0.1515 (0.13)||0.1219|
| Surgical Centers||15||60.00||0.4886 (0.13)||<0.001||0.6097 (0.15)||<0.001|
| Private Laboratories||64||51.56||0.3331 (0.07)||<0.001||0.4494 (0.08)||<0.001|
|Outside bGS vs Internal bGS (re-read)||100||79.00||0.6875 (0.06)||<0.001||0.7593 (0.07)||<0.001|
| Academic Hospitals||7||85.71||0.7200 (0.29)||0.007||0.5333 (0.28)||0.027|
| Community Hospitals||14||71.43||0.5520 (0.18)||0.001||0.6627 (0.18)||<0.001|
| Surgical Centers||15||80.00||0.7321 (0.14)||<0.001||0.7842 (0.16)||<0.001|
| Private Laboratories||64||79.69||0.6949 (0.08)||<0.001||0.7766 (0.08)||<0.001|
Re-evaluation of needle biopsies resulted in 35 Gleason sum downgrades (10.6%), 76 upgrades (23%), and no change in the remaining 220 cores (66.5%). No significant differences were seen in the propensity for changing initial bGS across the laboratory subgroups (P= 0.46). When re-evaluation of a needle core biopsy resulted in a change in the original bGS (either upgrade or downgrade), the agreement of the revised score with pGS was κ= 0.29, an improvement from the agreement of the initial (outside) bGS with pGS of κ=−0.04. When no change was made to the bGS, agreement with pGS was κ= 0.40 (P < 0.001).
- Top of page
- MATERIALS AND METHODS
- CONFLICT OF INTEREST
The Gleason grading system is internationally recognized as the most accurate tool for classifying severity and determining prognosis for patients with prostate cancer [19,20]. Given the importance of the Gleason grading system in informing clinical decision-making for patients with prostate cancer, it is critical that this classification system be both reproducible across different pathologists and predictive of the true underlying tumour architecture within the gland.
Prior to the 2005 ISUP revision to the Gleason grading system, numerous studies had investigated the reproducibility of the classification tool across pathologists, with agreement values (κ) ranging from 0.41 to 0.64 for core needle biopsies [21–24]. Exact agreement between bGS and pGS was found to range between 35 and 45%. Past studies have demonstrated an improvement in reliability of grading for expert GU pathologists [22,23], higher-grade carcinomas , and extended needle biopsy schemes of mean 12.4 cores . Many factors can influence the reliability and interpretation of inter-observer agreement studies for bGS and pGS. Specifically, differences regarding definitions of agreement, statistical analyses used, types of specimens examined, number of samples included, number of participating pathologists, and expertise of pathologists and laboratories involved all contribute to difficulties in making comparisons across studies . Other factors contributing to disagreements in Gleason grading across pathologists include the degree of heterogeneity within a given tumour and the existence of architecturally borderline cases resulting in the need for subjective determinations .
The main objectives for the 2005 revision to the grading system were to improve inter-rater concordance, eliminate ambiguity, and enhance the prognostic value of needle core biopsy scores . Since the implementation of the revised grading system, studies have demonstrated improved concordance between bGS and pGS, with agreement of 72% up from 58% for the conventional system . Although this agreement rate is higher than our observed agreement of bGS and pGS of 55%, only two dedicated uropathologists participated in the former study, which probably contributed to the high observed agreement. By contrast, our study included a team of seven dedicated uropathologists conducting all evaluations of biopsy and surgical specimens, with an unknown total number of outside pathologists participating in the study. Although this model introduces opportunity for increased variability it is more consistent with the real-world conditions in which these evaluations occur.
A study at John’s Hopkins in 1997 compared the correlation of bGS and pGS between academic and non-academic institutions, citing perfect agreement scores of 58% and 34%, respectively . A repeat of the study design in 2008 on patients who underwent radical prostatectomy from 2002 to 2003 demonstrated improved agreement scores of 75.9% and 69.7% for Hopkins and outside pathologists, respectively, for accurately predicting pGS from bGS. In addition, agreement scores of 81.8% were seen in bGS by Hopkins and outside pathologists . The authors attributed the improved agreement scores to large decreases in the undergrading of needle biopsies as evidenced by near elimination of Gleason sum scores of 2 to 4. These agreement scores are similar to our observed value of 79% for perfect bGS agreement between outside institutions and our academic centre, but these studies were conducted prior to the 2005 ISUP revisions to Gleason grading.
According to the Landis and Koch criteria for interpreting degree of agreement, our study demonstrates ‘good’ reproducibility bordering on ‘excellent’ between all outside laboratories and our internal re-reads of bGS with κ= 0.5512 (wκ= 0.624). These values of κ are similar to agreement scores from past studies conducted using the classic Gleason scoring system, indicating that the revised system has not impacted inter-rater reliability. Differences across outside laboratories were observed, with greatest agreement of biopsy grading seen with surgical centres and academic hospitals (κ= 0.59 and 0.56, respectively) and lowest agreement seen among community hospitals (κ= 0.51). Such differences may possibly be explained by differential expertise of pathologists, given that community hospitals are less likely to have dedicated GU pathologists on staff reading all prostate biopsies [22,23].
To better understand the accuracy of biopsy evaluations in predicting true underlying morphology, we compared the highest bGS from a given patient with the pGS. As expected, internal re-evaluation of prostate biopsy correlated better with final surgical pathology (κ= 0.39) than did the original read from all outside laboratories (κ= 0.30). It must be noted that the same team of pathologists read both the internal re-reads and the final surgical pathology, possibly contributing to increased accuracy as a result of improved intra-rater reliability. Of the outside laboratories, surgical centres demonstrated the highest accuracy at predicting pGS using biopsy (κ= 0.49), probably driven by the greater number of biopsy specimens per patient (12.3) and the resulting improved likelihood of detecting true pathology .
Re-evaluation of external biopsies resulted in revisions of the Gleason score in 34% of specimens, with upgrades occurring twice as often as downgrades. When a revision to the original bGS was made on re-evaluation, the agreement with pGS was improved from κ=−0.04 to 0.29. When internal pathological re-read resulted in no change to the original bGS, agreement with pGS was ‘good’ (κ= 0.40). Such findings suggest a potential benefit to re-evaluation of biopsy scores, especially in borderline cases in which the clinical plan is ambiguous. Given that studies have shown a prognostic difference between seemingly small differences in Gleason score (i.e. 3 + 4 vs 4 + 3) [29,30], even small variations in agreement can influence the clinical management of patients.
An important critique of our study design is its retrospective nature. Conditions were not established to systematically control for variables, which could account for variability, including the expertise and experience of participating pathologists as well as the total number of outside pathologists included in the analysis. In addition, only positive needle core biopsies were included in the study for pathological re-evaluation and no follow-up data on patients’ outcomes were included in the analysis. It would have been valuable to re-evaluate previously read negative cores in our analysis. Unfortunately, these data were not available. Although such decisions impact the accuracy and variability of the results obtained, they appropriately reflect the clinical environment in which biopsy re-evaluation currently occurs both at our institution as well as at many other institutions in the United States.