A new artificial intelligence system successfully detects and localises early neoplasia in Barrett's esophagus by using convolutional neural networks

Abstract Background and aims Seattle protocol biopsies for Barrett's Esophagus (BE) surveillance are labour intensive with low compliance. Dysplasia detection rates vary, leading to missed lesions. This can potentially be offset with computer aided detection. We have developed convolutional neural networks (CNNs) to identify areas of dysplasia and where to target biopsy. Methods 119 Videos were collected in high‐definition white light and optical chromoendoscopy with i‐scan (Pentax Hoya, Japan) imaging in patients with dysplastic and non‐dysplastic BE (NDBE). We trained an indirectly supervised CNN to classify images as dysplastic/non‐dysplastic using whole video annotations to minimise selection bias and maximise accuracy. The CNN was trained using 148,936 video frames (31 dysplastic patients, 31 NDBE, two normal esophagus), validated on 25,161 images from 11 patient videos and tested on 264 iscan‐1 images from 28 dysplastic and 16 NDBE patients which included expert delineations. To localise targeted biopsies/delineations, a second directly supervised CNN was generated based on expert delineations of 94 dysplastic images from 30 patients. This was tested on 86 i‐scan one images from 28 dysplastic patients. Findings The indirectly supervised CNN achieved a per image sensitivity in the test set of 91%, specificity 79%, area under receiver operator curve of 93% to detect dysplasia. Per‐lesion sensitivity was 100%. Mean assessment speed was 48 frames per second (fps). 97% of targeted biopsy predictions matched expert and histological assessment at 56 fps. The artificial intelligence system performed better than six endoscopists. Interpretation Our CNNs classify and localise dysplastic Barrett's Esophagus potentially supporting endoscopists during surveillance.


INTRODUCTION
Barrett's Esophagus (BE) is associated with increased progression risk to oesophageal cancer (EAC), progressing from non-dysplastic Barrett's Esophagus (non-dysplastic BE (NDBE)) to low grade dysplasia (LGD), high grade dysplasia (HGD) and then EAC. The overall 5-year survival rate of EAC is less than 20% 1 but early neoplasia in BE can be treated endoscopically with eradication rates of more than 90%. 2 The current standard of care for BE surveillance is to take biopsies every 2 cm as part of the Seattle protocol. This is time consuming and may suffer from sampling error 3 and poor compliance. 4 Despite advances in endoscopic imaging, BE dysplasia is still missed. 5 A metaanalysis showed that amongst adults with NDBE at index endoscopy and prolonged follow up, 25% of EAC's are diagnosed within a year of the index procedure. They were classified as a missed diagnosis. 6 There is growing interest in the use of computer aided detection (CAD) of early lesions in the gastrointestinal tract. To date, this has focused on detection of colonic polyps [7][8][9] but CAD is likely to become increasingly important for BE neoplasia detection and several studies in recent years investigate this. 10,11 Artificial intelligence (AI) technology could lead to less random biopsies with lower histopathology costs, shorter procedures and early endoscopic therapy could offset the costs of an oesophagectomy/radiotherapy. This is increasingly important in view of the COVID-19 pandemic induced reduction in routine endoscopic provision. 12 There may still be an issue with the detection of invisible dysplasia. This will be answered with randomised control trials where seattle protocol biopsies are compared against AI predictions.
With advances in endoscopic optical technology, classification protocols have been developed based on vascular architecture and mucosal pit pattern to improve dysplasia detection. [13][14][15] It might therefore be possible to train a CAD system to use these features.
I-scan (Pentax Hoya, Japan) is a virtual chromoendoscopy technique that uses post processing technology to provide contrast and surface enhancement. There are 3 modes -i-scan 1 (surface enhancement), i-scan-2 (contrast enhancement), i-scan 3 (tone enhancement). I-scan 1 has become the default equivalent of highdefinition white light (WL) with the Pentax system. It is the best imaging mode for lesion detection (Supplementary Figure 1). 16 Previous publications from our group have shown that utilisation of optical enhancement with i-scan is superior to WL in BE and therefore has paved the way for i-scan to be the standard of care in most units using this platform. 15,17 Studies have investigated development of neural networks for dysplasia detection in the Esophagus with some promising results using Olympus and Fuji imaging. 10,18 However, to the best of our knowledge, no studies have developed a neural network using Pentax i-scan imaging. The studies to date have trained neural networks on limited number of high quality still images. These lack generalisability, particularly as most of the testing and training data sets originated from the same centre or country. 19,20 These factors may limit real time implementation. 21 The primary aims of this multi-centre study were first, to develop a neural network to detect dysplasia within BE by classifying an image as dysplastic or non-dysplastic and second to identify a point of interest for targeted biopsies. Secondary aims were to achieve these goals fast enough to allow real-time dysplasia detection, compare the performance of the system on i-scan 1 versus WL images and compare the performance versus endoscopists. -529

Patient recruitment
Patients attending for BE assessment at four expert European centres were recruited. All cases were collected prospectively including 86 that were collected prospectively in a previous BE imaging study. 15

Endoscopic procedures and video collection
All videos and still images were collected by six expert endoscopists (RJH, MRB, VS, RB, JM, KR). We defined these as endoscopists with more than 5 years' experience of BE endotherapy and who perform endotherapy procedures weekly in BE expert centres as defined by the European Society of Gastrointestinal Endoscopy guidelines. 23 Videos were prospectively collected using the Pentax endoscopy system (OPTIVISTA plus, EPK-i7000, EG-2990i, EG29-i10). Mucus were also collected in WL for most patients.

Tissue acquisition and histology
BE with no suspicion for dysplasia was biopsied as per the Seattle protocol. Areas suspicious for dysplasia either were target biopsied or resected by endoscopic mucosal resection (EMR). Histology was reviewed by expert BE histopathologists with more than 10 years' experience. Cases of suspected dysplasia were reviewed by two different BE histopathologists in each expert centre.

F I G U R E 1
Breakdown of the data set in the classification/segmentation models and the potential importance of each model output in the computer aided detection (CAD) system. CAD; Computer aided detection, *In one patient, the video segment of esophagus was split into two segments: dysplastic and NDBE. The former was used for training and the latter for testing 530 -

Creation of a gold standard on video segments
A computer vision annotation tool (Odin Vision, London, UK) was used to annotate a sequence of video frames. Annotation confirmed that dysplasia was present within individual frames without defining the position. Gold standard was determined from histology of EMR specimens or biopsies and the annotation segments matched these areas on the videos. In NDBE patients, all the frames from the Esophagus, including squamous mucosa, were included. The dysplastic and nondysplastic frames were used to train and validate a classification convolutional neural network (CNN; Supplementary Figure 2).

Creation of a gold standard on still images
High and moderate quality BE images were delineated for the pres- We test the performance of the targeted biopsy predictions generated by the CNN on two levels -against all areas annotated by experts and then against the small area of overlap between experts.

Model 1: Classification convolutional neural network for dysplasia detection within BE
We trained a CNN with a Resnet101 architecture to classify images into dysplastic or non-dysplastic using randomly selected frames from annotated videos. For each pixel, the CNN predicted a number between 0 (no dysplasia) and 1 (dysplasia present). Further algorithm development details can be found in the supplementary section.

Comparison of the performance of the convolutional neural network versus endoscopists
Sixty one i-scan one images from the testing set were randomly selected (28 dysplastic, 33 non dysplastic). Six non expert HUSSEIN ET AL.

Statistical analysis
Descriptive statistics consisted of the mean (+/− standard deviation).
We measured the sensitivity and specificity at a per-image and per patient level. The area under the receiver operator curve (AUC) was calculated. Further details are in the supplementary section.

Per image classification
In the test set the neural network detected dysplasia on i-scan 1 images with an AUC of 93%, sensitivity of 91% and specificity of 79%.
The AUC was 10% greater than on WL (Table 1; Figure 2). The i-scan

Localisation of points of interest for targeted biopsies
Four different scenarios for localisation of points of interests were generated ( Table 2) Table 2). When allowing the system to generate any number of targeted biopsy predictions, the sensitivity for localising dysplasia was 97% in the union of expert delineations (Table 2).

Localisation with delineation
The delineation outline prediction for an area of dysplasia generated by the segmentation model overlapped with at least one expert delineation in 98% of images with a minimum of one pixel of overlap ( Figure 4b). This suggested the model is pointing in the right direction. The segmentation delineation had a 50% average Dice score with expert 1 (including the false negative predictions in this result).

Speed of detection of BE dysplasia
The

Comparison of the performance of the convolutional neural network versus endoscopists in the detection of dysplasia on i-scan 1 images
On a subset of testing set images 6 non expert endoscopists detected dysplasia with a mean sensitivity of 79% and specificity of 49%. The CNN classified dysplasia with a sensitivity of 96% and specificity of 88% on the same images (Table 3).

DISCUSSION
We demonstrate for the first time a CAD system which can accurately detect early neoplasia in BE using Pentax endoscopes. It has a per image sensitivity and specificity of 91% and 79%, respectively, on  19 This however needs a much larger data set and further studies to assess its true value in terms of localisation performance.
We developed a second segmentation algorithm which allows a high accuracy for localising targeted biopsies with high detection speeds.
This creates a two-stage CAD algorithm which can be translated into T A B L E 2 Targeted biopsy predictions generated by the CNN Note: Different scenarios were generated assessing where the targeted biopsies fall within the gold standard expert delineations which matched areas of histologically confirmed dysplasia in videos. Based on expert consensus scenario 2 was considered the most clinically relevant and user friendly at the same time.  Figure 4). We use two models because of annotated data availiability.

Proportion of biopsies within union of delineations
To use a delineation model to both, detect and delineate more data is needed. In terms of real-time analysis, delineation models are slower than classification ones. Our current set up allows for faster, real time, classification than with a single delineation model, and a slower (but still fast) delineation on demand.
We chose to use an indirect learning approach to generate heat map outputs from the classifier to assess whether they would help endoscopists to identify areas of interest in BE without the need of expert delineations to train, to provide insight into the classifier's predictions, and to compare to the segmentation output using a directly supervised learning approach to assess the potential impli- Previous studies trained and tested networks on Olympus and Fujinon endoscopy systems for the detection of BE. 10,11,18 Ours is the first study developing a CNN using the Pentax system. I-scan one is often the default imaging modality on these systems as this provides the natural colour tone of WL and added advantage of surface enhancement. Therefore, the training set was predominantly composed of I-scan 1 as opposed to WL. Our results hint towards improved performance of the classifier model on i-scan 1 versus WL.
However, of note the data set for i-scan 1 was larger than that of WL and matched studies are needed to directly compare the performance of the two.
Seattle protocol biopsies on a surveillance endoscopy will sample a small surface area of a BE segment. 6 25 The per patient sensitivity was 90% in both analyses. In our current model using a minority and majority voting approach where at least 2/6 and 4/6 correctly classified images are necessary, we achieve a per patient sensitivity of 100% and 89.3% respectively on i-scan one images.
The CAD system in our study performed better than all 6 non expert endoscopists with a range of 3-11 years of endoscopic experience. This is the cohort for whom a BE CAD system would be most beneficial. A smaller subset of images was randomly selected for this experiment as it will mean endoscopists will be more likely to complete the task. The AI system performed better on this subset however when comparing the endoscopists performance to the AI system performance on the whole data set it is still weaker. -535 There are limitations in this current study. We developed a model which is trained using videos from a single endoscopic system. Ideally, to allow for more generalisation, we would incorporate data from other systems. The current CNN is tested on higher quality still images. However, our training strategy and data selection would allow us now to develop networks which could potentially work better on whole video data in real time. Another limitation is the threshold for neoplasia detection was based on the test set. However, the performace can be compared independently of the threshold selection by looking at the AUC scores. It is important to test the threshold selection on an independent, hold out, data set, which was not available at the time. Another limitation is the testing set in 'model 2' (segmentation) was 86 images. Ideally, we would have selected a larger test set however in this model these images were all required to be delineated by experts. For the purposes of the available time of experts and to complete the task a smaller data set of images was selected. This problem could be rectified in future by selecting a broader range of experts. The data set for i-scan 1 was larger than that of WL.
This should be kept in mind when looking at the improved performance on i-scan 1 versus WL in the classification of dysplasia.
In future we will perform matched studies comparing the two light modalities where both cohorts are matched in terms of number of images, histology and Paris classification of lesions. Another potential limitation is the segmentation model was trained based on the delineations of one expert. In future to help improve our results further we will aim to train the model using the intersection of expert delineations to improve the localisation ability of the CAD system.
We demonstrate a CAD system which is able to detect BE dysplasia with high accuracy on a per image and per patient level. It localises areas of interest with targeted biopsies with high sensitivities. Our next step is allowing this to work in real time with whole video predictions.