Digital pathology for reporting histopathology samples, including cancer screening samples – definitive evidence from a multisite study

To conduct a definitive multicentre comparison of digital pathology (DP) with light microscopy (LM) for reporting histopathology slides including breast and bowel cancer screening samples.


Introduction
Histopathology is the light microscopic (LM) examination of tissue sections and is an integral component of many patient pathways.2][3] In this context, the most efficient use of a limited cellular pathology workforce is vital to maintain standard of care and patient safety. 4apturing histopathology slides at high resolution and stitching these digital images together enables pathology slides to be recreated on computer workstations.The process of using digital whole slide images (WSI) as a means of examining pathology slides has been termed 'digital pathology' (DP), and has increased rapidly during the past decade, aided by high-throughput automated slide scanners requiring minimal input from laboratory technicians that fit seamlessly into the laboratory workflow. 57][8] DP thereby provides almost limitless flexibility in the management of this workload: a factor exploited by many laboratories in response to the COVID-19 pandemic. 9P also enables analysis of pixel data contained in the images to be exploited to develop aids to improve diagnosis. 10,11][15][16][17][18] Novel technologies require definitive evidence of comparable accuracy with the existing standard.0][21][22] A recent meta-analysis demonstrated high concordance rates between the digital and glass readings in these studies. 23However, the majority (92%) of those studies was performed at a single institution without enrichment for challenging cases or samples from cancer screening programmes, leading to a lack of data supporting the use of DP in this setting and preventing wider adoption.Additionally, to date, few studies have evaluated the accuracy of DP for samples from medical renal biopsies with immunofluorescence slides, a speciality comprising highly complex and low-volume samples where DP may prove to have important benefits in providing improved access to specialist expertise. 24,25xamining histopathology slides depends upon interpretation of histological features in light of the clinical setting, and is subject both to inter-and intraobserver variation.The studies comparing DP to LM published to date lack rigorous assessment of both inter-and intraobserver variation, making an assessment of equivalence between the two platforms difficult.
In this study, 26,27 we performed a multisite comparison of breast, gastrointestinal (GI), skin and renal specialities with consultant pathologists experienced in reporting these samples, comprising routine biopsies, cancer screening samples and resections, as well as cases known to contain challenging lesions.The primary outcomes were intra-and interobserver agreement for pathologists' diagnoses using DP as opposed to LM.

S T U D Y D E S I G N
The study design was developed incorporating principles published by the Royal College of Pathologists  28,29 A blinded crossover comparison compared pathologists' reports using LM and DP (Figure 1).The Health Research Authority (National Health Service, London, UK) approved the study protocol and any subsequent amendments.The study protocol was published in the International Traditional Medicine Clinical Trial Registry. 27The steering committee, including an independent chair, the chief investigator and patient representatives, provided study oversight.

E T H I C
The study protocol and any subsequent amendments were reviewed and approved by the Health Regulation Authority (HRA) and Research Ethics Committee (REC), ISRCTN number 14513591, IRAS number 258799, 2018.Samples recruited from Oxford (renal) had generic consent for research.Consent was not sought for the remaining cases.The sample pathway is summarised in Figure 2. Prospective consecutive histopathology samples, enrolled between July 2019 and July 2021, were recruited throughout the four subspeciality areas, including breast and bowel cancer screening biopsies.These were enriched with 20% cases considered either difficult or moderately difficult to report (see Supporting information, Table S1). 23Renal biopsy samples, all deemed difficult due to the nature of these biopsies, comprised a consecutive series of native and transplant biopsies prospectively recruited from one centre (Oxford).All the other speciality group cases were recruited equally from the departments of the study pathologists.
The glass slides were retrieved along with the corresponding reports.The original report was the reference diagnosis (RD).All slides were included for biopsies.For some large (> 10 blocks) breast and GI resection samples, submitting pathologists selected representative slides sufficient to provide the report.All the available stains, including haematoxylin and In each group each case was examined twice by each pathologist using light microscopy (LM) and digital pathology (DP), respectively.The sequence of whether LM or DP was performed first was randomised and there was a six-week gap between readings.On completion of the eight reads all clinically significant differences were reviewed in consensus meetings, held by the reporting pathologists, to agree the ground truth diagnosis.eosin (H&E), special, immunocytochemistry and immunofluorescence stains, were included into the study with the exception of GI, where only H&E stains were included.Pen marks were cleaned from the slides and overhanging or badly marked coverslips were replaced, otherwise no additional preparation of slides was performed prior to scanning.Specifically, no attempts were made to correct for imperfections in section quality.
Cases were excluded if: • there were missing or damaged slides; • contained oversized slides; • a prior biopsy review was required for interpretation.
The skin, GI and breast slides were scanned with Philips IntelliSite Pathology Solution (Philips, Eindhoven, the Netherlands) using a single Philips Ultra-Fast Scanner (UFS 1.8, IVD-CE), with automated focal point selection and tissue detection.Cases were viewed using the Philips Image Management System (IMS version 3.3.1;Philips).Once digitised at equivalent ×40 magnification, 0.25 μm per pixel, the WSIs were stored locally at the UHCW Coventry (network connection:1GB/s bandwidth) in two HP DL380 iron servers with a net 24 TB storage capacity.WSI were checked by laboratory technicians at low power to detect obvious errors in focusing or tissue detection, and rescanned if required.All participating sites were

R E P O R T I N G O F S A M P L E S
Pathologists reported each study sample twice: once using DP and once using LM.The order was randomised, and there was a minimum 6-week gap between viewings.Clinical and macroscopic details were accessed on the study database.LM was conducted using the microscopes used for routine diagnostic work and DP using the workstations provided.Where possible, reporting proformas were used.Reporting followed the UK NHS Bowel and Breast Cancer Screening programme and RCPath minimum data sets requirements.
The annotations and measurement tools available on the DP systems were permitted, but hidden from fellow pathologists.Pathologists recorded their diagnostic confidence for each report on a seven-point Likert scale, from least to most confident. 28

R E P O R T S C O M P A R I S O N , A R B I T R A T I O N A N D C O N S E N S U S P R O C E S S
The reports were compared by study reviewers blinded to modality, participating site and pathologist.Any variations between reports were forwarded for arbitration.Two pathologists, not involved in reporting of the cases, decided whether the differences identified would more probably have resulted in differences in management (clinically significant) or not (clinically insignificant).In uncertain cases, this decision was referred to a consulting clinician.
All cases were analysed as a whole rather than in parts.A case with a clinically significant discordance in a single part was labelled as discordant.

Consensus ground truth
Where there was one or more clinically significant difference, the WSI (glass slides were available on request) and all the reports (study and reference reports) were reviewed by the study pathologists reporting the case and a consensus ground truth (GT) was agreed.

Outcomes
The primary endpoints of the study were intraobserver intermodality clinical management concordance (CMC, identical diagnoses plus differences clinically insignificant differences) comparing pairs of LM and DP reports by the same pathologist, and interpathologist CMC among the four DP and LM diagnoses, respectively, and the GT.
The secondary outcome measures included: repetition of these comparisons in terms of complete concordance (CC), pathologists' diagnosis confidence separately rated for their LM and DP diagnoses.

Sample size
Percentage CMC for routine and difficult-to-diagnose cases were assumed to be, respectively, 98.8% 13 and 55% (based on the range of 40-70% found in the literature), and 75% for moderate cases (midpoint between routine and difficult). 23Taking account of enrichment with difficult and moderately difficult cases, the baseline intramodality variability of the whole study sample was defined as 90%.
The study sample size was determined so that it was sufficient to analyse each speciality separately.Based on the precisions of intraobserver intermodality percentage CMC estimates, target recruitment was 2000 cases; 600 cases for each of breast, skin and GI specialities and 200 cases for renal.
Four comparisons arising from four pathologists diagnosing 600 cases within the breast, skin and GI specialities resulted in a total of 2400 LM:DP comparisons.An overall ICC was estimated at 0.8.Hence, the design effect is 1 + ICC (comparisons per case-1) = 3.4.Consequently, 2400 LM:DP comparisons correspond to 705 independent comparisons.This allows a margin of error of 2.2%, so precision is high while analysing breast, skin and GI specimens separately.Due to smaller sample size, the margin of error for renal is 3.1%.

S T A T I S T I C A L A N A L Y S I S
Random-effects (RE) logistic regression models, with crossed RE terms for case and pathologist, were used to estimate both the primary endpoint of intraobserver intermodality percentage CMC (between a pathologist's LM and DP pair of reports) and the secondary endpoints of CMC between a pathologist's LM and GT and between a pathologist's DP and GT.The 'gamm4' package in R statistical program was used. 30,31dditionally, using these models, ICC to estimate interobserver agreement, first within LM and then within DP, was computed as: , where σ 2 path and σ 2 case are the RE estimates for pathologist and case, respectively; 500 bootstrap samples were used to compute ICC 95% confidence intervals (CIs).CC data were analysed using the same approach.
LM and DP diagnosis confidence data were compared using a RE generalised Poisson model with crossed RE terms for case and were pathologist-fitted using the 'glmmTMB' package in R. 32 Subgroup analyses were defined by speciality, screening/non-screening and difficulty level.

C H A R A C T E R I S T I C S O F C A S E S
A total of 2024 cases (62.8% female 37.2% male) comprising 608 breast, 607 GI, 609 skin and 200 renal samples (Table 1 and consort diagram Figure 3) were recruited, producing 7750 slides.In total, 766 slides required rescanning, the majority for out-offocus regions or missing fatty tissue fragments.The four pathologists' reading reports on LM and DP resulted in 16 192 case readings and 8096 comparisons in three possible combinations: LM versus DP, LM versus GT and DP versus GT, totalling 24 288 comparison combinations, excluding RD.
The reports' comparison data are summarised in Table 2.An RE logistic regression model of the 8096 LM versus DP comparisons showed, over all 2024 cases, that CMC between LM and DP was 99.95% (95% CI = 99.90, 99.97; Table 3).This primary Respective LM-GT and DP-GT percentage CMCs are very close, so that one modality does not outperform the other in diagnosis accuracy (Table 3).Both modalities also have similar interobserver agreements which, except for moderately difficult, difficult and breast screening cases, are very high, with intraclass correlation (ICC) above 0.8 (Table 3).

S E C O N D A R Y O U T C O M E S
Summary comparison and RE logistic regression models results for CC, i.e. any difference regardless of clinical relevance, are shown in Supporting information, Tables S2 and S3.All LM-DP percentage CC (intraobserver agreements) are above 88%.Overall, and in subgroup analyses, respective LM-GT and DP-GT percentage CC are close, so that one modality does not outperform the other.Agreement between modalities appeared similar over the longitudinal course of the study, as shown by agreement levels in the various batches of cases (see Supporting information, Tables S4-S6).
Pathologists reported the highest confidence level in 88% of the diagnoses (Table 4).Within a modality, GI pathologists were the most confident with their diagnoses, closely followed by skin pathologists, while renal pathologists were noticeably less confident compared to the other specialities' pathologists.Skin pathologists had approximately the same level of confidence on LM and DP diagnoses, while for the rest of the specialities and overall, the confidence of DP diagnoses was slightly less than the generalised model showing that, overall, lower confidence in DP diagnosis was borderline significant (rate ratio = 0.92, 95% CI = 0.85-1.00,P = 0.053; Table 5).Lower confidence with DP diagnoses was significant for the routine cases (rate ratio = 0.86, 95% CI = 0.76-0.98,P = 0.024).
Clinically important differences were grouped into common themes (Table 6).The renal differences, to be examined in a separate paper, are not discussed.In all three specialities, interpathologist differences appear similar in the comparisons: LM versus GT and DP versus GT and higher than intraobserver intermodality differences LM versus DP.
In breast, slightly higher numbers of differences were seen in B5a versus B5b microinvasion on DP (10) in comparison to LM (four).In three of the 10 DP differences the pathologist gave the same diagnosis on LM as they did for DP.In the seven remaining cases four cases were reported as showing no invasion where the GT concluded that invasion was present, and three cases were reported as showing invasion where the GT concluded that no invasion was present.
A slightly higher intraobserver intermodality than interpathologist difference was seen in the B2 versus B3 (with atypia) (31) LM versus DP compared to either LM (20) or DP (19) to GT.The 31 intraobserver differences were equally divided between LM (15)  and DP (16), in equal agreement with GT.
GI showed 31 instances where discrepancy between high-and low-grade dysplasia was recorded.Of these, 21 LM and 27 DP diagnoses were different to GT. Fourteen LM and 19 DP diagnoses showed low-grade dysplasia where GT was high-grade, as opposed to seven LM and nine DP showing highgrade dysplasia and GT recorded low-grade.

Discussion
This study measured the assessment and reporting of 2024 cases by consultant pathologists working at six sites in the United Kingdom, and demonstrated extremely high levels of agreement (99.95% agreement) between DP and LM readings.The level of agreement between the two platforms is identical to that of either platform with the consensus GT.These figures are similar to those seen in other studies (Table 7), some of which used different DP systems, indicating that the results are likely to translate to laboratories using other equipment.Randomisation of the platform used for first view, and a washout period of 6 weeks, a period longer than those used in similar studies, 13,20,22 were used to reduce 'recall bias'.Recall bias does not affect interpathologist concordance, and this is the first study, to our knowledge, to measure interobserver agreement on the same cases, demonstrating that interobserver performance is identical to DP and LM, as measured by agreement to consensus GT.The study shows near-identical results between the DP and LM platforms among all the speciality groups, as well as for cancer screening cases in breast and GI groups.
Histopathology is an interpretive discipline, and occasional discordance between reports issued on the same case is to be expected, even when re-reported by the same pathologist with an identical clinical context.4][35] Clinically significant differences were observed in these cases and reflected in lower levels of agreement seen.Table 6 lists the most common themes giving rise to differences in breast, GI and skin groups.It is noticeable that the incidence of these differences is similar in reports issued with DP and LM platforms.
Previous studies have highlighted areas where DP may present difficulties.These include recognition of bacteria, identification of amyloid and calcification and a tendency to 'over-call' dysplasia or atypia. 13,23,36,37Examining these and other areas revealed no apparent trends patterns across the DP and LM modalities.For example, failure to recognise Helicobacter pylori in gastric biopsies was seen six times in LM and seven times in DP; gastric amyloidosis was missed by two pathologists on both LM and DP reports.There were only single instances of Giarida duodenalis and H. cytomegalovirus, respectively, being missed, both in DP.There were no errors recorded in breast due to failure to pick up calcification.
Where slight differences between LM and DP were seen, for example in breast B5a (in-situ carcinoma) versus B5mi (microinvasive carcinoma) and in GI grading dysplasia in adenomatous polyps; these were in areas where differences between reports are common, and further examination showed no consistent trend with either modality.Regarding dysplasia grading, the second most common difference seen in the GI series, this difference occurred in 21 and 28 LM and DP reports, respectively.However, DP, in common with LM, showed greater differences of lowgrade dysplasia against the GT of high-grade dysplasia (i.e.undercalling the dysplasia grade) than the reverse, which is the opposite to what would be seen if DP were indeed leading to overcalling of dysplasia grade.Therefore, we can find no evidence that the platform used has any bearing on these differences.
It is important to note that challenging cases are recognised as such by pathologists at the time of reporting and reflected in lower confidence levels and varying terminologies in the reports, and that arbitrators can have different opinions of what is considered a clinically important difference based on variation in local practice.Pathologists in practice are aware of these challenges and routinely refer such cases to peer review from colleagues.Pathologists know when they have confidently seen a region of interest to be able to make a diagnosis.The recognition of (and absence of) bacteria and similar subcellular objects may indeed be better on LM.It is possible that this could account for the trend towards greater confidence in LM than DP seen in this study, although further work is needed to fully understand the reason for increased confidence scores with LM.Irrespective of this finding, the advantages DP offers can still be fully exploited while retaining the undoubted superiority that LM may have for some tasks: a timely reminder, if it were needed, to *These ICC's could not be estimated reliably because there were only few cases where there was discordance between LM and GT reports and between DP and GT reports ( laboratories to ensure that support exists for pathologists working geographically separate from the slides; the slides may need to be examined by LM before the case is reported.Either transport of slides to pathologist when needed or review by a colleague with access to the slides would suffice.This is the first study, to our knowledge, to demonstrate that DP is equivalent to LM in cancer screening cases and renal biopsies.The flexibility that DP allows in the distribution of the workload is pivotal in both these areas, where capacity demand and access to highly specialised services are currently important constraints of service delivery. 3In breast cancer screening, comparison between LM versus DP for CMC was 96.27%, which is very high, but slightly below the reference of 98.3%.However, comparison to the GT for these samples shows slightly better agreement seen with DP (99.89) as opposed to LM (97.57) indicating, together with the lower ICC scores, that these variances are more likely to be due to differences in the interpretation of challenging biopsies than the modality.The reporting of cancer screening cases for other sites, such as uterine cervix and lung, is based upon similar principles (i.e.assessment of features of atypia and invasive carcinoma on H&E sections), so there is every reason to believe that the results presented here will translate to these sites.Renal biopsy cases require both fine optical resolution and access to immunofluorescence studies.The data for these samples is being published in greater detail in a separate paper but, overall, this study demonstrates that DP is equivalent to LM for these samples, and should help healthcare providers to embrace the opportunities DP offers to redesign and strengthen the service and provide confidence that DP should be equally successful in other speciality areas with similar requirements, such as haematopathology and neuropathology.
This study is one of the largest and most detailed studies comparing DP and LM yet conducted.In common with previous studies our results show excellent correlation between LM and DP, including in cancer P A T H O L O G I S T S Sixteen pathologists, all National Health Service (NHS) consultants with 3-35 years' experience worked in speciality areas of their normal practice.All completed training on the study DP image management system.Eleven pathologists not using DP for routine practice completed DP training following the Royal College of Pathologists' best practice recommendations. 28S A M P L E S E L E C T I O N

Figure 1 .
Figure 1.Study overview.Cases were recruited from participating sites in the four speciality groups anonymised and enrolled into the study.In each group each case was examined twice by each pathologist using light microscopy (LM) and digital pathology (DP), respectively.The sequence of whether LM or DP was performed first was randomised and there was a six-week gap between readings.On completion of the eight reads all clinically significant differences were reviewed in consensus meetings, held by the reporting pathologists, to agree the ground truth diagnosis.

4
A S Azam et al.
Figure 3. Consort diagram of cases entered into the study.Histopathology Validation of digital pathology 7 13652559, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/his.15129by Test, Wiley Online Library on [02/02/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License Ó 2024 The Authors.Histopathology published by John Wiley & Sons Ltd.,

Table 2 .
Summary of the reports' comparisons data Ó 2024 The Authors.Histopathology published by John Wiley & Sons Ltd., Histopathology 8 A S Azam et al. 13652559, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/his.15129by Test, Wiley Online Library on [02/02/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License

Table 3 .
Summary of the clinical management concordance (CMC) analysis using random effects (RE) logistic regression models † n is the number of cases.Each case is reported by four pathologists and so the number of comparisons in the analysis is 4n.‡ Primary objective intra-observer inter-modality clinical management concordance.

Table 6 .
Errors recorded in two or more instances in breast, gastrointestinal (GI) and skin specialties
Ó 2024 The Authors.Histopathology published by John Wiley & Sons Ltd., Histopathology 14 A S Azam et al.13652559

Table 7 .
Comparison of this study with other multisite validation studies previously published in the literature Histopathology Validation of digital pathology 15 13652559, 0, Downloaded from https://onlinelibrary.wiley.com/doi/10.1111/his.15129by Test, Wiley Online Library on [02/02/2024].See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions)on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License Ó 2024 The Authors.Histopathology published by John Wiley & Sons Ltd.,