The human‐in‐the‐loop: an evaluation of pathologists’ interaction with artificial intelligence in clinical practice

One of the major drivers of the adoption of digital pathology in clinical practice is the possibility of introducing digital image analysis (DIA) to assist with diagnostic tasks. This offers potential increases in accuracy, reproducibility, and efficiency. Whereas stand‐alone DIA has great potential benefit for research, little is known about the effect of DIA assistance in clinical use. The aim of this study was to investigate the clinical use characteristics of a DIA application for Ki67 proliferation assessment. Specifically, the human‐in‐the‐loop interplay between DIA and pathologists was studied.


Introduction
As more pathology departments implement digital review, artificial intelligence (AI) systems for digital image analysis (DIA) will also become more commonplace. [1][2][3][4] It is important to recognize accurate quantification, facilitated through AI for prognostic and predictive scoring, as a diagnostic companion to pathologists. A popular candidate for AI use is Ki67 scoring in breast cancer (BC), for which studies have shown that DIA is equal to or better than manual scoring by pathologists. [5][6][7][8][9] There is, however, relatively little evidence to show how these algorithms perform in a clinical setting. It is known that DIA can fail because of, for instance, poor slide quality, 8,10 which could be a recurring challenge in the clinic. Moreover, AI systems are known to regress in performance when applied to data from sources that are not represented in the training data, because of so-called domain shift. 11 Recent studies have investigated human-in-the-loop (HITL) AI systems, in which pathologists interact with DIA systems, and have shown that an HITL approach can yield a performance improvement. [12][13][14][15] In order to reach the full potential of the HITL approach, a deeper understanding of how these systems should be designed is needed. 13,16 In this study, we evaluated an existing HITL AI system in clinical use. The pathology department in Link€ oping, Sweden was an early adopter of digital pathology, 17 and has, since 2015, used DIA applied to Ki67 for immunohistochemistry (IHC)-based intrinsic subtyping of luminal HER2-negative BCs; there have been >1600 such cases to date. The role of Ki67 in BC as a prognostic and predictive factor is debated, but it has been proven to correlate with a poor prognosis in primary BC, and is included in the St Gallen surrogate IHC classification for intrinsic subtypes. [18][19][20][21][22] Ki67 is included in the Swedish guidelines for BC pathology, with a floating cut-off that removes the local bias component caused by differences in staining intensity and interpretation consensus. 23 To explore the effect that an AI system, with or without HITL interaction, has on diagnostic accuracy, we performed a retrospective study based on Ki67 DIA applied to hotspot areas on whole slide images of invasive BC by pathologists at the laboratory at Link€ oping University Hospital.

Materials and methods
Ethical approval was given by the Regional Ethical Review Board, in Link€ oping (Approval no. 2017/ 276-31).

S A M P L E S A N D I M A G E D A T A
A total of 200 analysed areas, containing 200 tumour cells each, in Ki67-stained slides from different BC cases, produced by the Department of Clinical Pathology, Link€ oping, Sweden, were extracted from two time periods, starting in May 2015 and February 2017. One hundred and eighty-four of these cases were primary invasive BCs, and the remaining 16 cases were local metastases or recurrences, distant metastases, or only suspected BC. Both the patient data and data on the reviewing pathologists were anonymised at the time of the extraction. The extracted areas corresponded to two different versions (denoted V1 and V2; same user interface) of the underlying nuclear detection algorithm, V1 from May 2015 and V2 from February 2017. From each time period, 100 consecutive BC cases were selected by the use of search tools in the pathology picture archiving and communication system (PACS) (Sectra AB, Link€ oping, Sweden) based on the combination of the local SNOMED code 'breast cancer' and existing 200-cell Ki67 annotations. Relevant clinical metadata were also extracted.
The Ki67 index in the annotations was checked for consistency with the value in the signed pathology report. With the local cut-off level of 41% positive cells in the hotspot, the Ki67 index was low in 140 specimens (71%). With a cut-off of 20%, the Ki67 index was low in 64 specimens (33%). Twenty-two cases were reported as HER2-positive (11%). Nottingham grading had been performed on 147 of the cases; 19 were grade 1, 78 were grade 2, and 50 were grade 3.

I H C
No additional staining procedure was performed for the study. All BC specimens had been processed with the established in-house clinical procedures, including the NordiQC external quality programme. Fixation times were semicontrolled within the range of 72 h. The routine protocol consisted of cutting one 4-µmthick section from each formalin-fixed paraffinembedded tumour specimen, and then mounting the section on a precoated slide for IHC analysis with anti-Ki67 rabbit monoclonal antibody (MIB1 30-9; Ventana/Roche, Tuscon, AZ, USA) by use of the Intellipath autostainer (Biocare Medical, Concord, CA, USA). Diaminobenzidine was used as a chromogen substrate for all specimens. All sections were counterstained with haematoxylin.

D I G I T A L I S A T I O N A N D D I A
No additional scanning was performed for the study. All slides were originally scanned in routine clinical practice with either Aperio Scanscope AT Turbo scanners (Leica Biosystems, Buffalo Grove, IL, USA) or the Hamamatsu NanoZoomer XR (Hamamatsu, Photonics, Hamamatsu City, Japan), with application of the 920 scanning resolution mode (⁓0.5 µm/pixel, JPEG Q70-80) in a single focus plane automatically identified by the scanner. Quality control was performed manually.
The workstations for the original diagnostic work were equipped with an 8-megapixel colour-calibrated screen (Radiforce RX850; Eizo Corporation, Hakusan, Japan). All pathologists responsible for the cases included in this study had, at the time, >6 months of experience with the digital pathology system. At the time of the original diagnostic review, the pathologists did not know that the cases would be retrospectively investigated.
The nuclear detection algorithm used in the system was based on machine learning, with pixel annotations in V1 24 and a larger number of point annotations in V2. 25 The cell detection F1 scores were 0.79 for V1 and 0.83 for V2, as determined from an image dataset described by Molin et al. 26 The computeraided tool was fully integrated in the viewer.
The Ki67 tool uses an HITL workflow in four steps ( The ground truth was established independently by a panel of three experienced breast pathologists (A.B., S.G., and R.W.), who manually marked every positive and negative cell in each area by using a web-based tool custom-built for this study. The image was interpreted without access to the full whole slide image or the corresponding haematoxylin and eosin-stained slide. An initial session established a joint view on cell categorisation by use of a separate training set. The median Ki67 indexes from the three panellists were used as the ground truth; one pathologist had previously been involved in the original HITL scoring. Fleiss' j and standard deviation were calculated to enable the quality of the ground truth to be judged.

S C O R I N G M E T H O D S
The study evaluated the difference in accuracy between three different scoring methods: eyeballing, automatic scoring, and HITL scoring.

Eyeballing
Eyeballing was performed by three pathologists (A.B., S.G., and R.W.) before the ground truth elicitation. Eyeballing was performed by estimating the percentage, without decimals (e.g. 17%), of Ki67-positive tumour cells by visual assessment of the digital image, avoiding actual counting. A single digit was preferred over brackets because the use of brackets would introduce additional variability and therefore create an unfair advantage for the other scoring methods. A non-enforced timing guideline of 20 s for each area estimate was used.

Automatic scoring
The automatic scoring value is the output provided by the nuclear detection algorithm, retrieved from the pathology PACS, corresponding to the situation that would exist if the pathologist directly accepted the Ki67 index without modification after the processing step in Figure 1.

HTIL scoring
The HITL scoring corresponds to the Ki67 index from the clinical review, retrieved from the pathology PACS, corresponding to the output of step 4 in Figure 1. The pathologist could either correct the algorithmic result or leave it unchanged.

S T A T I S T I C S
The statistical accuracy was calculated at three levels of detail: Ki67 status, Ki67 index, and individual cells. The Ki67 status refers to the binary assessment of being below or above the predetermined cut-off value (41%), as compared with ground truth. For the Ki67 index level, the accuracy is measured as the percentage point difference from the ground truth index value. The cell level compares the positivity for each individual cell with ground truth.
The statistical analysis was performed with PANDAS (0.25.2), SCIPY (1.3.1), and STATSMODELS (0.10.1). Statistical testing for the overall accuracy comparison was performed on the variability error on the Ki67 index in the V2 period by the use of Levene's test with Holm correction. Estimates of other variables are also reported, but no hypothesis testing was performed. The statistical testing plan was derived a priori from a smaller pilot dataset extracted in a similar manner without any cases overlapping with the cases included in the final study.

Results
For the ground truth assessments, Fleiss' j between raters showed the Ki67 status-level variability to be 0.87. The estimated standard error of the ground truth Ki67 index was 2.9.
In the study, eight pathologists in total used the system, six in each period and four in both. The duration of tool use was measured from the start of processing of an area until the area had been marked as verified (steps 2-4 in Figure 1). The median durations were 113 s [interquartile range (IQR) 79-176] in V1 and 73 s (IQR 47-122) in V2.

A C C U R A C Y O F S C O R I N G M E T H O D S
Ki67 status agreement, Ki67 index error and celllevel error are shown in Table 1. Automatic scoring improved the accuracy with both algorithm versions. The Ki67 index error for eyeballing (14.9) was significantly larger (P < 0.05) than those for automatic scoring (7.2) and HITL scoring (6.9), on comparison of the results in the V2 period. In general, HITL corrections did not result in an overall improvement in the Ki67 index as compared with automatic scoring. The overall correlation patterns for the different scoring methods are shown in Figure 2.
The HITL and automatic methods had lower variability, as shown in Figure 3, whereas they showed a slight statistical bias in that the observations clustered on the lower half, and the mean error deviated from zero as opposed to eyeballing. To further investigate possible clinical impacts of Ki67 status, discordances were mapped as status error. The performance of the algorithm dropped in the clinical environment; this was a domain shift, as reflected by an F1 score for V2 of 0.68 as compared with the formerly reported 0.83. 25

H I T L
The effect of the pathologists' corrections, with regard to improving or worsening the automatic result in terms of Ki67 index error, is shown in Figure 4. A large number of cases had an error difference close to zero, showing that the corrections had little effect. In V1, corrections worsened the Ki67 index by, on average, À3.1 percentage points, and in V2 the corrections caused an improvement of, on average, 0.9 percentage points. This small effect does not mean that few corrections were made. Areas with an index error difference below 5 had, on average, 33 corrections. There was a slight difference in the number of corrections close to the cut-off; within 10 percentage   Figure 5A. For many cases in which HITL caused deterioration, however, there was no obvious explanation other than the challenges of the diagnostic task as such. The visual inspection also revealed issues when the DIA system had failed and had been corrected by HITL. Root causes included overlapping positive nuclei, as shown in Figure 5B-E, misidentification of tumour and stroma morphology, staining quality, or a combination of different factors.

Discussion
This is one of very few studies that have evaluated how an HITL AI system is used in clinical practice.
Our results, at all three levels analysed, are in support of the safe use of DIA. Agreement for Ki67 status was good, both in automatic and in assisted modes (automatic j 0.84 and HITL j 0.76). This is on a par  with the Ki67 agreement previously found in the Swedish setting (j 0.77). 27 With regard to the Ki67 index, which was the second level of evaluation, HITL and automatic DIA were significantly more accurate, with lower deviation of the standard error. With respect to potential error sources in DIA use, our results are in line with those of Kwon et al. 8 The following causes of discordance between visual assessment and DIA regarding the Ki67 index in BC were identified: (i) tumour heterogeneity; (ii) visual assessment error; (iii) misidentification of tumour cells; (iv) poor immunostaining or slide quality; and (v) estimation of non-tumour cells. Additionally, we propose that the fourth category should also include   digitisation quality, and that a sixth category should be introduced: user handling error.
A main question is whether the HITL intervention of the pathologists is justified in comparison with automatic AI in the analysis of Ki67 in routine BC diagnostics. Our findings are not as categorical as those of recently published studies of AI assistance within pathology, which have indicated either improvement or failure. 13,28,29 Our results reveal individual cases in which poor DIA results can be remedied by pathologist HITL intervention, but human adjustments can also worsen the results.
Justifying pathologist interaction is also dependent on the time spent on performing the HITL task. In our study, the time spent was substantially improved in V2, being 40 s lower. The improvement can partly be attributed to the higher accuracy of the nuclear detection algorithm, resulting in fewer corrections, but other causes cannot be excluded. The diagnostic review in pathology is a complex mixture of . It looks as though the intention must have been to remove these detections rather than modifying the positivity. B, The digital image analysis (DIA) system has failed to detect overlapping positive cells, and the user has correctly added the missing detections (arrow). C, The DIA system has poor performance when separating tumour from nontumour cells. Here, the pathologist has removed excessive detections (arrow). D, Poor staining quality causes the DIA system to mark nuclei as positive even though the staining in this case is unspecific. The pathologist has correctly changed many nuclei from positive to negative (arrow). E, An out-of-focus artefact causes the algorithm to detect nuclei even though there are no nuclei at all. These have been removed by the pathologist (arrow).
quantitative and qualitative measures, making it challenging to evaluate results in terms of added value in the diagnostic process. One weakness of this study is that we did not investigate the pathologists' experience regarding the usage and perceived benefit of the tool. One clear indication is, however, that the tool is frequently and consistently used among breast pathologists in everyday practice in Link€ oping. Even though our study shows that, for these 200 cases, the pathologist could have been removed from the loop and time could have been saved without substantially worsening the results, it is uncertain whether this result is generalisable. The HITL approach provides an important safety mechanism for detecting and correcting algorithmic errors that may occur. Retaining human oversight of analysis outputs is often a recommendation for quality control and safety in digital image analysis 9,30 Key issues for successful implementation of AI assistance in clinical routine will be to determine where the limited time of pathologists is best used and to identify the value added by AI in the diagnostic process, focusing on the resulting benefit for patient care in terms of accurate quantification (as facilitated through AI for predictive scoring).

Summary
We have shown that DIA applied in real-world clinical routine, both in automated and in HITL settings, contributes to more accurate scoring of Ki67 in patients with BC. The main finding is that the primary value of HITL correction is to detect major weaknesses of the DIA algorithm rather than finetuning by analysing individual cells.

Financial support
This work is supported by an ALF grant from Region € Osterg€ otland.