Comparison of different anti‐Ki67 antibody clones and hot‐spot sizes for assessing proliferative index and grading in pancreatic neuroendocrine tumours using manual and image analysis

Ki67 proliferative index (PI) is essential for grading gastroenteric and pancreatic neuroendocrine tumours (GEP NETs). Analytical and preanalytical variables can affect Ki67 PI. In contrast to counting methodology, until now little attention has focused on the question of clone equivalence and the effect of hot‐spot size on Ki67 PI in GEP NETs. Using manual counting and image analysis, this study compared the Ki67 PI achieved using MM1, K2 and 30‐9 to MIB1, a clone which has been validated for, and is referenced in, guidelines relating to assessment of Ki67 PI in GEP NETs.


Introduction
Grading of gastroenteric and pancreatic (GEP) neuroendocrine tumours (NETs), based on assessment of mitotic count and Ki67 proliferative index (PI), is essential for subclassifying NETs, determining their prognosis and informing therapeutic decisions. [1][2][3][4][5][6][7][8][9][10] The Ki67 protein is associated with cellular proliferation and its expression is limited to active phases of the cell cycle; namely, G1, S, G2 and M phases. 11,12 Ki67 protein can be detected using immunohistochemistry and the Ki67 PI is established by expressing, as a percentage, the ratio of positively labelled tumour nuclei to the total number of tumour nuclei in a selected area. Guidelines suggest that the Ki67 index for NETs should be assessed within areas of maximal immunolabelling, referred to as proliferative 'hot-spots', consisting of between 500 and 2000 tumour cells. 2,[4][5][6][7]13,14 These suggested hot-spot sizes were originally proposed on the basis of common sense rather than evidence-based data. 15 Analytical and pre-analytical issues exist with regard to Ki67 assessment. Multiple studies have sought to establish the best method for evaluating Ki67 PI and highlighted the need for higher quality standardised counting methods. As a result, visual approximation or 'eyeballing', which is prone to considerable interobserver variation, is discouraged in favour of more reproducible methods such as manual counts using printed digitally captured images or on a computer screen. [16][17][18][19][20][21] Image analysis techniques, which overcome some of the intra-and interobserver variabilities in humans, have been shown to provide equivalent accuracy for Ki67 PI assessment compared to manual counting methods. Historically, the costs and technical requirements involved in establishing a digital infrastructure have been considered prohibitive to the adoption of digital pathology in routine practice. 16,19,22 However, as implementation costs fall, digital deployments are becoming increasingly more attractive to clinical pathology laboratories. 23 Several pre-analytical variables may affect the assessment of Ki67 for which standardised practices do not exist. Ki67 immunohistochemistry produces variable staining intensity. In the past, different criteria have been applied in regard to the level of staining required to determine a positively labelled nucleus, resulting in inconsistencies in tumour grading. 24 While some studies have only counted strong or dark brown nuclear staining as positive, 17 others have accepted any degree of nuclear staining as positive. 18 However, such variation in practice is important in view of the low threshold (<3 versus ≥3%) that separates grades 1 and 2 NETs, which can be influenced by very small differences in the number of cells determined to be positive. It is worth noting there is now a growing consensus among GEP NET experts that all nuclei that display positive staining (homogeneous or speckled) should be counted as positive in the setting of a high quality anti-Ki67 immunostain. 25 This is logical, as the expression of Ki67 varies during the different phases of the cell cycle, beginning in the nucleolus before being redistributed to the rest of the cell. Differences in interlaboratory staining protocols can also result in different Ki67 PI results, even when the same antibody clone is used, ultimately leading to tumour grade shifts. 20 Other pre-analytical factors, such as fixation, section thickness and the size of the proliferative hot-spot used, may also influence Ki67 PI. 26,27 There are numerous commercially available antibody clones which detect the Ki67 protein. Data available from an external quality assurance scheme in the United Kingdom (UK-NEQAS) revealed that more than 10 different clones were in use in registered clinical pathology laboratories in addition to a variety staining protocols. 28 The MIB1 clone, which is specifically referenced in some international NET guidelines, and has been validated for NET grading, 1,2,4,13,29 was the most commonly used clone (167 of 306 laboratories) in this scheme. The use of clones other than MIB1 may result from an assumption that all clones are equivalent, a lack of awareness of the specific reference to MIB1 in guidelines, or an unawareness that multiple clones exist at all. However, studies in breast cancer cases and in tonsillar tissue have indicated that different Ki67 antibody clones do not produce equivalent proliferative indices, and this may affect reproducibility and capacity to predict prognosis in breast cancer. [30][31][32] Therefore, clone equivalence should not be assumed in GEP NET grading.
The primary aim of this study was to compare the absolute Ki67 PI and indicative tumour grade obtained using four different commercially available Ki67 antibodies under typical laboratory conditions applied to a series of well-differentiated (grades 1 and 2) pancreatic neuroendocrine tumours (PanNETs). In view of the increasing availability of whole slide imaging and image analysis software, we also sought to validate QuPath, an open source image analysis platform, for assessing Ki67 PI in PanNETs. Using digital image analysis it was also possible to provide a quantitative evaluation of the effect of hot-spot size (500, 1000 or 2000 cells) on Ki67 PI and tumour grade.
identified from the pathology archives of the Belfast Health and Social Care Trust, Northern Ireland. The most representative tumour blocks were identified and accessed through the Northern Ireland Biobank 33 (NIB study number 18-0276); 5 9 4 µm sequential sections were cut from the selected formalin-fixed, paraffin-embedded (FFPE) blocks. Each selected case had a haematoxylin and eosin (H&E) section cut to review morphology and identify and annotate the tumour area. The remaining four sections were stained for four separate anti-Ki67 antibodies ( Table 1) Each immunohistochemically (IHC)-stained section was reviewed on a Leica DM 4000B microscope at 910 objective to identify and annotate the area of greatest Ki67 positivity, i.e. the proliferative hot-spot. A single high-power field, representing the maximal hot-spot within this area, was then identified at 940 objective, photographed and subsequently printed on white paper using a colour printer. Ki67 positively and negatively stained tumour nuclei were separately marked on each printed image using two different colours, respectively, and the total tumour cell count and positively labelled tumour nuclei recorded. For the purpose of this study any Ki67 staining was considered positive, as outlined by Gentil Perret et al. 29 Figure 1 shows specific examples of the expression of the four Ki67 clones in a single hot-spot within a single case in the cohort. Hot-spot identification, annotation and manual counting was performed by a pancreatic pathologist with expertise in neuroendocrine tumours (P.K.). The Ki67 PI was calculated as a percentage by dividing the total number of Ki67positive tumour nuclei by the total number of tumour nuclei in the hot-spot multiplied by 100. The tumour grade suggested by the calculated Ki67 PI was then determined in line with the grading thresholds proposed in World Health Organisation (WHO) guidelines (<3% = grade 1; ≥3-20% = grade 2; >20% = grade 3). 5,7 For this study the MIB1 clone was chosen as the gold standard antibody, as it is recognised as a validated anti-Ki67 clone in guidelines for grade assessment in NETs. 1,2,4,13,29 S T A T I S T I C A L A N A L Y S I S The mean cell count and mean Ki67 PI were compared between MIB1 and other anti-Ki67 clones using a paired-samples t-test. The proportion of samples where a non-MIB1 clone was higher than MIB1 was calculated, and the sign test was used to test if this was different from 50%. A Bland-Altman plot of the mean against the difference was generated for MIB1 and each anti-Ki67 clone. Pearson's correlation coefficient was used to formally test for an association between the average and the difference. Limits of agreement were calculated for MIB1 and, where possible, each anti-Ki67 clone. Limits of agreement give range values which, for 95% of samples, capture the difference in MIB1 and each non-MIB1 clone. In addition, the proportion of values within 1% of MIB1 Ki67 PI with 95% confidence intervals (CIs) was determined. The grade concordance on Ki67 category (<3% or ≥3%) was determined by calculating the proportion agreement and kappa (chance corrected agreement) with 95% CIs. Statistical analysis was conducted in STATA ver-sion14/GraphPad Prism version 5.03.

D I G I T A L I M A G E A N A L Y S I S
Digital image analysis of IHC-stained Ki67 whole-slide images was performed using the open source image analysis programme QuPath version 0.1. 34 The same hot-spot regions used for the manual Ki67 assessments were used for image analysis. Scanning of the IHC-stained slides was performed using an Aperio AT2 scanner at 940 magnification. Scans were imported into QuPath and cell detection was performed within the pre-annotated areas. Areas of necrosis, tissue folds and artefacts were excluded from analysis, as previously described. 31,[34][35][36][37] Each hotspot was classified into tumour and stromal compartments using a detection classifier based on training regions. Following quality control, the tumour compartment was analysed for Ki67 staining using the single feature classifier, modified for a 0.15 minimum mean nuclear diaminobenzidine (DAB) optical density, to mitigate background staining. Therefore, all tumour cells with a mean nuclear DAB greater than or equal to 0.15 were classed as positive. Digital image analysis readily allows for the accurate counting of tumour cells within a selected area on a digital slide. Taking advantage of this, three objects were drawn within the annotated hot-spot containing 500, 1000 and 2000 tumour cells, ensuring capture of the most proliferative areas. Positive cell detection data for each hot-spot size among all four Ki67 clones in the 42 PanNET cases were extracted from QuPath for analysis.

M A N U A L L Y A S S E S S E D K I 6 7 P R O L I F E R A T I V E
Total cell counts for each antibody clone Compared with MIB1, the mean total cell counts were similar for MM1 (P = 0.74) and K2 (P = 0.185), but lower for 30-9 (P = 0.026); see Table 2. For one case, the total cell count was below 500 cells for all four clones. Although some guidelines suggest that hot-spots used for Ki67 assessments    Table 3 shows the range, mean and median for Ki67 PI for each antibody clone throughout the cohort. Compared with MIB1, the mean Ki67 PI was similar for MM1 (P = 0.56), but significantly higher for K2 (P < 0.0001) and 30-9 (P < 0.0001    Table 4 shows grade concordance between MIB1 and MM1 (a), K2 (b) and 30-9 (c). Generally, in discordant cases, non-MIB1 anti-Ki67 clones showed a strong tendency to indicate grade 2 status compared to grade 1 by MIB1 (11 of 13 discordant cases).

Digital image analysis assessment of Ki67 index
The original cohort of 42 cases, each stained with four different anti-Ki67 antibody clones, generated a total cohort of 168 (42 9 4) comparisons between manually assessed Ki67 PI and that obtained by digital image analysis. An example of QuPath masks on two cases is depicted in Figure 3. Throughout the cohort, image analysis resulted in higher Ki67 PIs in 93% of cases (156 of 168; sign test P < 0.001) when 500 cell hot-spots were used, 83% of cases when 1000 cell hot-spots were used (139 of 168; sign test P < 0.001) and in 75% of cases when 2000 cell hot-spots were used (126 of 168; sign test P < 0.001). Of the 500 cell hotspot digital Ki67, 42% PIs were within 1% of the manually determined value (73 of 168, CI = 35%, 51%); 61% of the 1000 cell hot-spots digital Ki67 PIs were within 1% of the manual Ki67 PI (102 of 168, CI = 53%, 68%) compared to 89% for the 2000 cell hot-spot Ki67 PIs (149 of 168, CI = 83%, 93%). The tendency of image analysis to overestimate Ki67 PI compared to manually assessed Ki67 PI was also reflected when the mean Ki67 PI for image analysis was compared to the mean manually determined Ki67 PI (P < 0.0001, all hot-spot sizes) ( Table 5). In addition, the mean Ki67 PI determined by image analysis decreased as the size of the tumour hot-spot increased. Using image analysis, MM1 was found to have a lower mean Ki67 PI when assessed by image analysis compared to digitally assessed MIB1 for all three hot-spot sizes.

G R A D E C O N C O R D A N C E B E T W E E N M A N U A L L Y A S S E S S E D K I 6 7 A N D D I G I T A L I M A G E A N A L Y S I S
The proportion of concordant cases and kappa between manual and image analysis grades for the three hot-spot sizes is shown in Table 6. In the discordant cases, image analysis was more likely to result in a grade 2 designation compared to manual assessments accounting for 96, 88 and 72% of discordances when 500, 1000 and 2000 tumour cell hot-spots were used, respectively. Grade concordance breakdown by clone is shown in Supporting information, Table S1, with highest overall concordance between manual and image analysis observed for MM1.

Discussion
Assessment of Ki67 PI is essential for grading, classifying and treating GEP NETs. 38 The MIB1 anti-Ki67 clone has been validated for grading GEP NETs and its use is specifically referred to in NET guidelines. 1,2,4,13,29,39 However, alternative anti-Ki67 clones are routinely in use in clinical pathology laboratories, yet studies in breast cancer and tonsillar tissue have shown that different anti-Ki67 clones do not produce equivalent Ki67 PIs. 30,32 Furthermore, the recommended size of the proliferative hot-spot to    disparity as the Ki67 PI increased. One issue noted was that the total cell count for the 30-9 clone was, on average, 50 cells less than that used for the other clones. It could be argued that the smaller total cell count contributed to a concentration effect on Ki67 PI for 30-9 compared to MIB1. However, when the same clone-to-clone comparisons were duplicated using image analysis, which ensured that equivalent total cell counts were applied for all assessments, the same trend was noted. This suggests that the differences between manually assessed MIB1 and 30-9 PIs were independent of different mean tumour cell counts.
The grading threshold that exists between grades 1 and 2 is low (3%) and can be influenced by relatively small differences in the number of positively labelled cells, as described in a study of the impact of different interlaboratory staining practices using the MIB1 clone. 20 Our data show that although grade concordance or agreement between MIB1 and non-MIB1 clones is moderate to substantial the grade was influenced by the Ki67 clone selected, as non-MIB1 clones produced more grade 2/1 than grade 1/2 discordances compared to MIB1 (11 versus 2). That the differences in absolute Ki67 values did not manifest with more grade discordances is almost certainly a reflection of the wide range of Ki67 PI values captured in the grade 2 category (3-20%). None of the cases used in our study had a Ki67 PI at the grade 2/3 (20%) boundary. However, it would be anticipated that the use of certain clones, such as 30-9 and K2, will lead to the designation of more PanNETs as grade 3. Prior to the most recent WHO NET grading system, such tumours would have been classified as grade 3 neuroendocrine carcinomas (NECs).
Absolute Ki67 PI is important to clinicians. Although current guidelines now widely adopt the 3% threshold for grade 1 versus 2, some studies have suggested that a 5% cut-off may be more clinically relevant in PanNETs. [40][41][42][43] Ki67 PI, rather than grade, is increasingly used to inform treatment decisions in NETs. For example, a Ki67 PI cut-off of 10% has been used to select patients suitable for liver transplantation and is comparable with the inclusion criteria of the CLARINET study, and of some oncologists when selecting systemic therapy. 40,44,45 ENETS guidelines for PanNETs refer to 'low' and 'high' grade 2 tumours, implying that treatments may be adapted according to grade 2 subdivisions. 46 The response of GEP NETs to chemotherapy has been reported to improve with increasing Ki67 PI. 47 With regard to higher-grade NENs, although NECs are no longer formally graded according to the current WHO classification scheme a Ki67 PI threshold of 55% appears to continue to have prognostic relevance. 5,7,[48][49][50] Based on our data, it is likely that throughout large patient cohorts the overall relationship between outcomes, such as response to treatment and Ki67 PI, would be observed regardless of the clone selected because, in general, an increase in Ki67 PI was mirrored across all clones. However, for the individual patient the situation is different, and the choice of antibody clone could have a significant impact on treatment.
In the past, use of image analysis for assessing Ki67 PI has been compared to manual counting methods and shown to be comparable, but cost has been seen as prohibitive for routine use. 17,18,22 The technological improvements of the past decade have seen increased adoption of digital pathology in clinical practice, creating more potential for the use of image analysis and artificial intelligence. Previous work in our laboratory in breast cancer samples demonstrated high concordance between image analysis using QuPath and manual scoring of Ki67 (MM1 clone) and also suggested a dependence of Ki67 PI on the specific antibody used. 31 Our study of 42 cases and four Ki67 antibody clones facilitated the creation of 168 manual versus digital comparisons, one of the largest validation cohorts reported to date in PanNETs. We found that mean Ki67 PI generated using image analysis was higher than the mean of manually assessed Ki67 PI and demonstrated that tumour hot-spot size has an inverse effect on Ki67 PI, with a decrease in mean Ki67 as the hot-spot size increases. This inverse relationship between hot-spot size and Ki67 PI has also been described by Volynskaya et al. 22 The highest correlations between image analysis and manual Ki67 were seen using a 2000 tumour cell hot-spot for image analysis Ki67 assessments. The average hot-spot size used for manual assessment was closer to 1000 cells, so the better correlation at 2000 cells is probably due to a dilution effect counteracting the apparent overestimation of Ki67 PI by image analysis. Despite the differences in absolute Ki67 PI, grading concordance between manual and digital assessments was substantial for all hot-spot sizes. The overestimation of Ki67 PI by image analysis manifested in more grade 2 versus 1 discordant cases between digital versus manual, stressing the importance of quality control of image analysis results by pathologists.
Using image analysis, we also replicated the manual MIB1 versus non-MIB1 clone comparison study. As described above, similar to the observations in the manual study, the K2 and 30-9 clones showed overestimation of Ki67 PI compared to MIB1. Given the unbiased nature of image analysis, this supports the original observation that K2 and 30-9 are more sensitive under the conditions used.
In a review of selected cases, to ascertain the reasons for discrepancy between digital and manual assessments, it was noted that instances of misclassification of non-tumour cells as tumour cell was quite rare. During initial quality control of the image analysis software it was noted that there was a tendency to ascribe positivity to tumour cells that were interpreted as negative on review of the digital image by a pathologist. The main reason for this phenomenon was often the presence of pre-analytical staining artefacts, such as increased section thickness, nuclear crowding or excess background staining. A loss of contrast was also noted when digital images were printed for manual counting, compared to the highresolution digitally scanned images viewed on screen. This led to occasional under-recognition of faintly positive staining in print images compounding the differences in individual comparisons. Ultimately, this limited discrepancy review highlights the need for high technical quality of tissue sectioning and immunostaining, particularly when image analysis is being utilised. Pathologist supervision of image analysis software is therefore paramount if deployed in clinical practice.
This study did not attempt to determine if any single clone was superior in terms of clinical outcomes, as has been conducted in breast cancer, 30 but we suggest that such studies are warranted. Furthermore, we used a different antibody detection platform for 30-9 compared to the other clones, which introduces another variable known to have potential to influence the determination of PI when a single clone is used. 20 The MIB1 clone was the only clone in the study that was not ready-to-use (RTU). Unlike Blank et al., 20 we did not alter the dilution, retrieval and incubation schedules for MIB1 to assess the impact of these factors on Ki67 staining. Similarly, we did not assess the impact of altering the retrieval and incubation schedules for the non-MIB1 clones. The work by Blank et al. suggests that it may be possible to achieve greater Ki67 PI agreement between the various clones by altering their respective staining protocols; however, this was not considered feasible for this particular study. These points could be seen as criticisms and limitations of our study design, although it is important to stress that that study was conducted in such a way as to replicate typical and variable interlaboratory practices in clinical pathology departments. Therefore, all antibodies were optimised independently, and the most appropriate detection platforms used rather than a single staining platform for all antibodies. The time taken to perform manual Ki67 and image analysis PI assessments was not formally recorded in this study. Manual counts have been reported to take 6 min on average, although individual cases may take 55 min. 17,22 In contrast, image analysis software can perform this task on selected hot-spots in seconds. 22

Conclusion
In a series of 42 PanNETs, we established that MIB1 and non-MIB1 anti Ki67 antibody clones do not always produce equivalent Ki67 PI results, which may impact upon patient care. MM1 correlated most closely with MIB1 in both manual and image analysis comparisons, suggesting that it may be the best MIB1 alternative in clinical practice. Nevertheless, validation of individual non-MIB1 clones may be required for grading of NETs, and further studies should assess if any particular clones are better at predicting the clinical outcomes of NETs in large patient cohorts. We also show that the size of the tumour cell hotspot used to assess Ki67 has an inverse relationship on the PI, stressing the need for greater standardisation in guidelines and clinical practice. Our results also question the effectiveness of quality assurance schemes which do not incorporate a quantitative assessment of Ki67 antibody clone performance. Our findings support the continuing validation and use of image analysis as a rapid method assessing Ki67 PI in clinical practice, but stress the importance of high technical quality with regard to immunostaining and the need for overall pathologist quality assurance of results generated through automated platforms.

Supporting Information
Additional Supporting Information may be found in the online version of this article: Figure S1. Line graphs demonstrating the agreement between the manual Ki67 PI (blue line) versus digital Ki67 PI for 500 cells (yellow line), 1000 cells (green line) and 2000 cells (red line) for the individual clones in the 42 PNET cases. Clinical cut-off (Black line), M1B1 (i), MM1 (ii), K2 (iii) and 30-9 (iv). Table S1. Percentage of cases showing grade concordance between individual manually assessed Ki67 clones and image analysis using 500, 1000 and 2000 total tumour cell hotspots.