Performance of artificial intelligence for detection of subtle and advanced colorectal neoplasia

There is uncertainty regarding the efficacy of artificial intelligence (AI) software to detect advanced subtle neoplasia, particularly flat lesions and sessile serrated lesions (SSLs), due to low prevalence in testing datasets and prospective trials. This has been highlighted as a top research priority for the field.


INTRODUCTION
A RTIFICIAL INTELLIGENCE (AI) based systems for polyp detection have been shown to increase adenoma detection rate (ADR) in randomized controlled trials.To date, these have been limited to non-advanced adenomas. 1here remains significant uncertainty regarding the efficacy of AI software to detect advanced neoplasia, particularly flat lesions, due to low prevalence of these subtle abnormalities in both pre-clinical testing datasets and prospective trials. 2,3A similar issue exists for sessile serrated lesions (SSLs).This issue is particularly important since there is debate about whether the increased detection of non-advanced adenomas alone translates to reductions in interval colorectal cancers (CRCs).Improving the performance of AI to detect more challenging and advanced lesions, was ranked as the second highest priority in a recent international research priority setting exercise for AI in colonoscopy. 4In particular a recommendation was made to create enriched datasets with subtle lesions, especially in scenarios where perceptual errors can occur.This was further emphasized by a recent literature review. 3lthough current research efforts predominantly focus on prospective evaluation of computer aided detection (CADe) software in clinical trials, there remains an important role for retrospective pre-clinical studies using video datasets.These allow for evaluation and improvement of standalone technical performance of the AI software, and comparison of performance against multiple endoscopists who view the same videos.Current datasets are often limited by selection bias, largely containing lesions that are readily identified during routine clinic practice. 5n this study, we aimed to develop video datasets that were enriched with flat lesions, SSLs and advanced colorectal polyps, to evaluate AI technical performance, including a perceptually challenging video database to also allow for comparisons of AI performance against endoscopists.

Training and initial test set
T O DEVELOP THE deep-learning based algorithm, a training dataset was created which consisted of a combination of still colonoscopy images and videos (Dataset A and Dataset B).7][8][9] A video database was also created at our institution between August 2018 and March 2019 consisting of complete colonoscopy withdrawals from 50 patients (cecum to rectum) using Olympus (Tokyo, Japan) EVIS LUCERA CV290(SL) processors and colonoscopes, recorded at 25 frames per second.Patients with advanced CRC or inflammatory bowel disease were excluded.Procedures were performed by two expert national bowel cancer screening accredited colonoscopists (ADR >45%).All polyps were confirmed by histopathology.Polyp size, morphology and location were also recorded.Full-length videos (white light only) were divided into shorter polyp-positive and -negative sequences.Magnification or near-focus frames were excluded.Polyp-positive frames were annotated based on the methods described in Appendix S1.The 50 procedures containing 210 polyps were randomly split on a per-procedure basis to create training (Dataset A), tuning and initial test datasets (Dataset C) consisting of 33, two and 15 procedures, respectively.The datasets are described in further detail in Table 1.The tuning dataset was used for optimizing the model hyperparameters.

Prospective independent test datasets
Once the algorithm had been developed and initially evaluated in the first step above, we prospectively recorded a further 45 patient colonoscopy withdrawals using the same methods described previously at our institution between April 2019 and November 2019.Twenty of these procedures contained 88 polyps and 25 were negative.Based on the methods in Appendix S1, these generated 8950 polyppositive (Dataset D) and 542,484 polyp-negative frames (Dataset E) which are described in further detail in Table 1.

'Subtle' and perceptually challenging dataset
To specifically evaluate the algorithm on perceptually challenging lesions, we prospectively collected colonoscopy polyp encounter videos, during routine clinical care, where two expert endoscopists identified subtle visual cues of polyps in 'near miss' scenarios.Short white light video sequences were generated.Initial early sequences of the polyp encounter, including the subtle visual cues of the polyp, were created.The median length of these videos was 9.5 s (interquartile range [IQR] 8.0-10.0).In these situations, the polyp was not immediately identified i.e. the operator continued to withdraw a few folds before noticing the subtle visual cue, or the lesion was in the periphery or distance of the visual field before being recognized.For the same polyp encounters, we also created paired late short sequences, where the same polyp had been brought close into view, and optimally positioned i.e. centered just prior to polypectomy.The median length of these videos was 4.0 s (IQR 3.0-6.0).All of these polyps were confirmed by histopathology.Frames were annotated according to the methods in Appendix S1.A total of 39 polyps were included from 30 patients resulting in a total of 7683 polyp-positive frames (Dataset F).We named this the University College London (UCL)subtle polyp dataset.

External validation dataset
The ETIS-LARIB open database consists of 196 highdefinition polyp-positive frames, from 44 different polyps involving 31 sequences, captured using Pentax 90i series, EPKi 7000 processors (Dataset G). 9,10 All the datasets are summarized in Table 1; two were used for training (Datasets A and B) and five for testing purposes (Datasets C, D, E, F and G).The test datasets were independent of all training processes with no patient overlap.

Algorithm development
A fully convolutional network with a ResNet-101 backbone architecture was used.The model was trained with Pytorch on an NVIDIA GeForce RTX 2080 Ti GPU.Further algorithm development details are included in Appendix S1.

Evaluating the algorithm
The bounding box annotations were used as ground truth for polyp presence or absence, with all polyps included in the study confirmed by histopathology.Performance metrics for evaluating the algorithm performance included per-frame sensitivity, per-frame specificity, and per-frame positive predictive value.A true positive occurred when the algorithm bounding box overlapped with the ground truth bounding box.Per-polyp sensitivity was defined as the number of polyps correctly detected by the model in at least one frame  divided by the total number of polyps present in the test dataset.For the UCL-subtle dataset, time to detection was also calculated.Further detailed definitions are in Appendix S1.We also utilized an existing published false positive CADe clinical classification system to categorize false positives. 11For the purposes of this analysis, a total of 80 false positives were randomly extracted based on duration of appearance.

Endoscopist evaluation
To compare performance with the algorithm, and also evaluate the perceptual difficulty of the UCL-subtle polyp dataset, eight endoscopists from our institution reviewed the same 34 video clips containing 39 polyps.The endoscopists had never seen the lesions before.In these instances, only the challenging early video polyp sequences were included.Further methodological details are included in Appendix S1.
Two groups of endoscopists participated.The first consisted of four independent colonoscopists who had performed >1000 colonoscopies and were accredited according to the Joint Advisory Group on gastrointestinal endoscopy (JAG) national standards in the UK.The second group included four JAG non-independent (trainee) endoscopists who had performed <500 colonoscopies.

Statistical analysis
Parametric continuous variables are expressed as means with standard deviation and non-parametric variables as medians with IQR.Clopper-Pearson exact 95% confidence intervals (CIs) were calculated.Chi-squared, or Fisher's exact test where appropriate, was used to compare differences in categorical variables.The Mann-Whitney U test was used to compare differences in polyp detection reaction times between endoscopists and the convolutional neural network (CNN).P < 0.05 was considered to be statistically significant.All statistical analyses were performed using GraphPad Prism (version 8; San Diego, CA, USA).

Ethics
The study was approved by the Cambridge central research medical ethics committee (REC Reference No. 18/EE/0148).
The inference time for the algorithm was 53.5 ms for the testing dataset (using a NVIDIA Geforce RTX 2080 Ti GPU), meeting the requirements for real-time detection.
Figure 1 and Video S1 demonstrate detections of subtle polyps by the algorithm.
Performance in test datasets (Datasets C, D, E and G) and the subtle dataset (Dataset F) are summarized in Tables 2  and 3.
Further subgroup analyses based on polyp size, morphology and histopathology are included in Tables S2 and S3.
When considering only polyps that were detected by both the CNN and at least one endoscopist, after correcting for baseline endoscopist reaction times, the median detection times for JAG independent endoscopists and the CNN were

False positive analysis
Of the randomly selected 80 false positives that were reviewed, 86% (n = 69) were caused by artefacts from the bowel wall and 14% (n = 11) were caused by artefacts from bowel content.The subcategories leading to the highest proportion of false positives included folds 43.8% (n = 35), followed by normal mucosa 16.3% (n = 13) and ileocecal valve 15% (n = 12).The results are summarized in Table S1.

DISCUSSION
I N THIS STUDY we developed and validated an AI polyp detection system across multiple datasets, effectively creating high-risk populations enriched with flat lesions including a high proportion of SSLs and advanced colorectal polyps including laterally spreading tumors (LSTs).Our model was evaluated exclusively on video frames, demonstrating high per lesion sensitivities.Moreover, in our unique perceptually challenging video dataset, our CADe detected significantly more subtle polyps when compared to both JAG independent and trainee endoscopists.
To our knowledge, our study reports the largest video validation for CNN performance on SSLs to date.3][14][15] Hassan et al. 16 reported results using one of the largest video test datasets, containing 338 polyps, however absolute SSLs numbers were not described, performance reporting was grouped with adenomas and was on a per lesion basis.Zhou et al. 17 specifically addressed SSL validation on a dataset of 42 SSLs, with a per frame and per-polyp sensitivity of 84.1% and 100%, respectively.However, in the dataset described by Zhou et al., 47% of SSLs were located in the rectum and sigmoid, and overall 69% were diminutive, which are less likely to contribute to interval CRC.Similarly, video performance evaluation on flat neoplasia, particularly advanced lesions including LSTs, is very limited in published studies. 3isawa et al. recently published an open-access video dataset, with a reported CNN flat per-lesion sensitivity of 98.3%, and a per-frame sensitivity of 86.7%.This dataset contained 100 lesions, including 34 flat lesions, one LST and four SSLs. 13Yamada et al. produced a flat morphology enriched video dataset of 56 lesions, where 44 were slightly elevated or depressed, reporting an overall per frame and per-lesion sensitivity of 74% and 100%, respectively. 15On the basis of per-frame sensitivities, our model performance was better than Yamada et al. and was comparable to Misawa et al. on all our datasets, excluding the subtle dataset.There is considerable variability in the definition for per-polyp sensitivity across studies, future consensus definitions could improve benchmarking.
Existing published retrospective CADe studies have rarely compared AI video performance with multiple endoscopists.Wang et al. 18 performed a post-hoc analysis, using 159 short video clips of missed polyps from a doubleblind CADe RCT.Three experienced endoscopists retrospectively reviewed the video clips achieving an overall perpolyp sensitivity of 17%.Almost half of the missed polyps were hyperplastic, and 91% of the adenomas were diminutive and 99% were sessile.Furthermore, only five SSLs and one advanced adenoma (LST) were included.Livovsky et al. 19 evaluated CADe performance on a large testing video set, containing 1393 procedures, including a subgroup of 'subtle polyps' missed by endoscopists, although these were defined by re-analysis of false positives, without corresponding data for polyp size, morphology or histopathology.Our perceptually challenging dataset was enriched with lesions which are critical for CRC prevention.Our study is also the first to introduce the concept of separating analyses into early and late polyp encounter sequences.We demonstrate significantly lower sensitivities for early sequences.This emphasizes the importance of focusing video dataset design on the most challenging component of sequences.We also validated perceptual difficulty, by performing multi-reader studies on the polyp encounters, demonstrating a superior performance of our CNN against both JAG accredited independent endoscopists and trainees.The relatively low sensitivity of endoscopists in this study suggests that recognition errors for subtle advanced neoplasia could be an important factor in interval CRCs.In addition, our results suggest that a learning curve may exist.
When considering false positives, overall the per-frame specificity was approximately 80%, which is slightly lower than other video studies. 20However, we did not exclude low-quality images in our non-polyp frames, unlike many other studies, which are a common source of false positives.However, per-frame metrics alone may not reflect the clinical relevance of false positives, therefore we classified a random selection of false positives using a published scheme. 11The distribution of causes of false positives was similar to that published for another CADe system, mostly represented by artefacts from the bowel wall, and a smaller proportion due to bowel content.Similarly, the main two subcategories were folds and normal mucosa.The previously published classification study suggested that most of these were readily discarded by endoscopists.False positives may, however, lead to poor adoption of CADe systems.Further research is warranted to identify methods to address this, such as the use of recurrent neural networks, whilst ensuring that sensitivity is maintained for the detection of subtle, advanced colorectal neoplasia.
Limitations of this study include its retrospective design, with results possibly being subject to selection bias, however this was minimized by using video data including low quality image frames from perceptually challenging lesions.Bowel preparation scores were not recorded for procedures, therefore its effect on CADe performance was not evaluated.Also, we did not evaluate AI-endoscopist interaction.The clinical impact of CADe systems will depend on the ability of operating endoscopists to recognize whether the AI output represents a true lesion, or it might be discarded incorrectly as a false positive.Furthermore, although we did perform an external validation using a still image dataset, it is very difficult to obtain video datasets enriched with subtle advanced lesions.Moreover, although we created a novel dataset, the absolute number of advanced subtle lesions was relatively small.Given the low prevalence of such lesions, large multi-center research collaborations will be required to overcome this limitation.
In conclusion, we evaluated the technical performance of a CADe algorithm to detect flat neoplasia, SSLs and advanced polyps demonstrating high sensitivity in a video dataset.Using a novel perceptually challenging dataset enriched with advanced lesions, the algorithm detected significantly more polyps than endoscopists.Prospective clinical trials should assess the ability of CADe systems to detect subtle advanced neoplasia in higher risk populations for CRC.However, ultimately population-based trials, targeting 'average-risk' individuals are required to establish the value of AI in CRC prevention. 21

Figure 1
Figure 1 Examples of subtle polyp detections by algorithm.Top row contains raw images and bottom row contains corresponding images with the algorithm output (blue bounding box) and a black outline highlighting the polyp area.From left to right, the first image contains two LSTs (LST-G-H and LST-NG-F) subtypes in the cecum, the second image contains a LST-NG-F subtype in the transverse colon, the third contains a sessile serrated lesion (SSL) in the transverse colon and the final image contains two SSLs in the transverse colon.LST-G-H, laterally spreading tumor, granular homogeneous type; LST-NG-F, laterally spreading tumor, non-granular flat-elevated type.

Table 1
Description of all the datasets used to train and test the artificial intelligence algorithm

Table 2
Algorithm performance in test datasets C, D, E and G