Artificial intelligence and its impact on quality improvement in upper and lower gastrointestinal endoscopy

Artificial intelligence (AI) and its application in medicine has grown large interest. Within gastrointestinal (GI) endoscopy, the field of colonoscopy and polyp detection is the most investigated, however, upper GI follows the lead. Since endoscopy is performed by humans, it is inherently an imperfect procedure. Computer‐aided diagnosis may improve its quality by helping prevent missing lesions and supporting optical diagnosis for those detected. An entire evolution in AI systems has been established in the last decades, resulting in optimization of the diagnostic performance with lower variability and matching or even outperformance of expert endoscopists. This shows a great potential for future quality improvement of endoscopy, given the outstanding diagnostic features of AI. With this narrative review, we highlight the potential benefit of AI to improve overall quality in daily endoscopy and describe the most recent developments for characterization and diagnosis as well as the recent conditions for regulatory approval.


INTRODUCTION
I N THE LAST decades, there is a growing attention for artificial intelligence (AI) in biomedical sciences. Several medical AI applications are currently being investigated and clinically tested with the ultimate goal of improving quality of diagnosis in clinical practice. [1][2][3] Image-based medicine and diagnosis like gastrointestinal (GI) endoscopy is prone to go hand in hand with AI applications. Within the field of GI endoscopy, AI is expected to have the potential to overcome certain clinical-human hurdles that are difficult to solve with the current available imaging techniques, hereby increasing the quality. AI seems to be applicable in many different fields within GI endoscopy, especially in detection of and differentiation between neoplastic and non-neoplastic lesions in both the upper and lower GI tract. 4,5 For instance, optical diagnosis of (early) dysplasia in Barrett's esophagus (BE) remains a struggle mainly for non-expert endoscopists. To overcome this hurdle several endoscopic enhancement techniques have been developed and validated with significant improvement when used by experts. However, in non-expert hands many BE lesions are missed. 6 In colonoscopy, a widely used quality metric is the adenoma detection rate (ADR). The risk of interval colorectal cancer (CRC) and cancer mortality is inversely correlated with the ADR. 7 Several factors influence the ADR of an endoscopist, so AI might be an additional tool to improve the ADR. 8 To improve this scattered quality landscape, AI is expected to support all endoscopists and improve overall daily performances and quality.
In this physician-engineer co-authored narrative review article we summarize the literature of AI in both upper and lower GI endoscopy and how AI can improve quality of daily endoscopy.

THE UPPER GI TRACT (OVERVIEW, TABLE 1)
Barrett's esophagus and esophageal adenocarcinoma is more prominent than ever. The Seattle protocol and the introduction of enhanced imaging techniques have facilitated diagnosis but with a remaining low sensitivity, sampling error and need of expert hands. [9][10][11] The American Society of GI Endoscopy endorsed the use of these advanced imaging techniques to shift from random towards targeted biopsies only under specific circumstances; performance thresholds for optical diagnosis were set at a perpatient sensitivity of ≥90%, a negative predictive value (NPV) of ≥98% and a specificity of ≥80% for detection of HGD or EAC. 12 These high demands in combination with the long learning curve makes that this only feasible for very experienced endoscopists. It is well known that 50% of endoscopists in community centers do not adhere to the guideline of following the Seattle protocol and that it is difficult to identify early lesions. 6,13 Indeed, in up to 75% of cases expert endoscopists can identify visible lesions that are not found before referral to an expert center. 6 Computer-aided detection (CADe) has the potential to support all endoscopists regardless of their experience in the detection of early neoplasia. Van der Sommen et al. 4 reported a CADe for detection of early neoplasia in BE in comparison to four expert endoscopists resulting in a reasonable performance. The ARGOS project used a CADe with hand-crafted features based on color and texture showing a sensitivity 95%. 14 Hashimoto et al. 15 trained an AI system (different types of AI, see Appendix S1) based on 916 images of histologically proven BE with high grade dysplasia or a T1 stage EAC and 916 non-dysplastic BE images with good performance to binary outcome (dysplastic, non-dysplastic) with an accuracy of 95.4% per image. The binary classifier ran at a speed of 72 frames per second, allowing a real-time application. Other bench studies of similar CADe systems, studied in a variety of study designs, report similar performances as shown by De Souza et al. 16 Recently, de Groof et al. 17 validated a multistep trained CADe system (pretraining stage followed by a training phase and an internal validation phase) for external validation on two different datasets of both 80 images, showing a classifying accuracy of 89% (dysplastic vs non-dysplastic). Additionally, their system outperformed every one of the 53 expert endoscopists and indicated the optimal site for targeted biopsy in 97% and 92% of the cases in external validation datasets 1 and 2, respectively.
A major limitation of the previous mentioned studies is the offline, often retrospective, use of the CADe. Therefore, the research has pivoted more and more in the direction of prospective real-time or live studies of CADe in the last 2 years. In a pilot trial of de Groof et al. 18 a CADe system for detection of early neoplasia in BE was used during endoscopy in 10 patients with non-dysplastic BE and 10 patients with confirmed dysplastic BE. During endoscopy three white light (WL) images were taken at every 2 cm level of the BE segment and were analyzed by the CADe system at the same time. This resulted in a per-level accuracy of the CADe system of 90%. In 75% of all levels, the system produced three concordant predictions. Ebigbo et al. 19 validated their CADe system on 62 images with 36 images of early EAC and 26 normal BE images from 14 patients concurrently assessed by an expert endoscopist and with histopathological confirmation. The system showed a sensitivity of 83.7% and a specificity of 100%. In parallel the same group tested a CADe system for detection of EAC showing a sensitivity and specificity of 97% and 88%, respectively, for WL images and 94% and 80% for narrow band imaging (NBI) images. 20 Furthermore, these performances significantly outperformed 11 of the 13 endoscopists for either WL, NBI or both images.

Esophageal squamous cell carcinoma
Pharyngeal and esophageal squamous cell carcinoma are frequent and often diagnosed in an advanced stage. Since early esophageal squamous cell carcinoma (ESCC) is often subtle or impossible to visualize with WL endoscopy (WLE), Lugol's chromoendoscopy is the gold standard screening method in high risk patients. Endoscopic enhancement techniques as NBI facilitate the detection of ESCC without the use of iodine staining 21,22 ; however in a recent randomized controlled trial, the specificity was rather low (42.1%) and resulted in a sensitivity of only 53% in inexperienced hands. 23,24 So again, like in BE, detection appears to depend a lot on the endoscopist's experience, a caveat that could potentially be resolved by CAD.
Several benchmark studies with a CADe system have been published in the last decade. Recently, CADe application for pharyngeal cancers has been emerging with promising results. Two Japanese research groups retrospectively showed good sensitivities (both >85%) of the CNN-based CADe detection of pharyngeal cancer on NBI images. 25 24 Van der Sommen et al. 4 Ebigbo et al. 20 de Groof et al. 17 Esophageal Squamous Cell Carcinoma (ESCC) High burden of Lugol staining for ESCC detection ESCC detection Horie et al. 27 Cai et al. 28 Zhao et al. 30 Guo et al. 29 Difficult interpretation of invasion depth and impact on planning therapeutic endoscopy Invasion depth determination Nakagawa et al. 31 Tokai et al. 32 Gastric cancer (GC) Differentiation cancerous vs non-cancerous lesion

Invasion depth determination
Kubota et al. 40 Zhu et al. 41 Helicobacter pylori (HP) Infection Optical detection and diagnosis of HP infection without random/targeted biopsies Prediction of HP infection Huang et al. 45 Shichijo et al. 46 Itoh et al. 49 Nakashima et al. 50 Zheng et al. 52 5 Klare et al. 70 Wang et al. 71 Wang et al. 72 Hassan et al. 73 Repici et al. 74

Polyp characterization
Takemura et al. 77 Kominami et al. 75 Byrne et al. 76 Komeda et al. 78 Sanchez-Montes et al. 79 Kudo et al. 80 29 developed a computer-aided diagnostic tool (CADx) trained on 6473 NBI images including precancerous, early ESCC and normal images and video data. A heat map for every input image was generated and in case the CADx system identified a suspicious lesion. The per-image sensitivity for all 1480 malignant images in 59 patients (dataset A) was 98.04% and the per-image specificity for all 5191 non-cancerous images from 2004 patients (dataset B) was 95.03%. With an analysis processing speed of 25 frames per second and a latency of less than 100 milliseconds (ms), the per-frame sensitivity (dataset C, 27 nonmagnifying videoclips and 20 magnifying videoclips) and specificity (dataset D, 30 non-magnifying and three magnifying videoclips) was 91.5% and 99.9%, respectively. The group of Zhao et al. 30 developed a CADe system for classification of intrapapillary capillary loops for detection and classification of ESCC on NBI magnifying images with a mean diagnostic accuracy of 89.2% and 93.0% at the lesion and pixel levels, respectively, which was significantly better than the endoscopists.
Quality of ESCC treatment can also be improved by better predicting submucosal invasion, which is a decisive parameter for patient selection. Two Japanese research groups recently published their data of a CADx system developed and trained to differentiate invasion depth of ESCC. Nakagawa et al. 31 showed a sensitivity and accuracy of 90.1%, and 91.0%, respectively, for differentiation of pathologic mucosal and submucosal microinvasive from deep invasive cancers. These results were comparable with the outcome of the 16

Gastric cancer
Cardia and noncardia gastric cancer combined are the fifth most frequently diagnosed cancer. 33 Endoscopic diagnosis is difficult due to two major problems. First, early gastric cancer mostly shows a subtle depression or elevation with faint redness which hinders endoscopic recognition. Second, the prediction of the invasion depth in the gastric wall is hard but important in the treatment selection. Optical diagnosis using image enhancement techniques as NBI, flexible spectral imaging color (FICE) or blue light imaging (BLI) has shown to be useful but demand again a substantial expertise. CADx might help to overcome these problems.
The group of Miyaki et al. 34 developed a CADx system for differentiation between cancerous and non-cancerous gastric lesions on magnifying FICE images, yielding a detection accuracy of 86% and a sensitivity of 85%. In 2015, the same group applied the same CADx to magnifying BLI images for a quantitative validation, generating a significantly greater average output of 0.846 AE 0.220 for cancerous lesions than that for reddened lesions or surrounding tissue. 35 Further advances were made by Kanesaka et al. 36 developing a CADe tool not only for detection but also delineation of the border between cancerous and noncancerous gastric lesions with a sensitivity of 97% and a specificity of 95%. Hirasawa et al. 37 were the first to report detection of early gastric cancer using a deep learning (DL) Charisis et al. 83 Saito et al. 82 Klang et al. 84 High interrater variability in the assessment of endoscopic disease activity in ulcerative colitis Estimation of disease activity (ulcerative colitis) Ozawa et al. 85 Takenaka et al. 86 Bossuyt et al. 87 No reliable optical prediction marker for remission

Relapse prediction
Maeda et al. 88 Endoscopic estimation of treatment effectiveness Treatment outcome and relapse prediction Waljee et al. 89 Waljee et al. 90 Jain et al. 92 90.1%, and 84.5%, respectively. If multiple gastric images were used per patient, the diagnostic accuracy on a perpatient basis was very high with a sensitivity, specificity, and accuracy of 91.6%, 98.6%, and 93.8%, respectively, comparable to sensitivity and specificity of histological testing. 48 Artificial intelligence for patient management of gastric ulcer bleeding. Wong et al. used machine learning to derive a predictive score to identify patients with a high risk of recurring idiopathic HP negative gastric bleeding. 53 The machine learning model predicted the 1-year risk of gastric ulcer bleeding recurrence with an accuracy of 84.3%.

THE LOWER GI TRACT (OVERVIEW, TABLE 1)
W ITH THE INTRODUCTION of local and national screening programs, CRC mortality has declined by 32% among individuals aged >50 years between 2000 and 2014. 54,55 However, tandem colonoscopy studies show colonoscopy to be an imperfect test with miss-rates up to 27% for diminutive polyps. 56,57 ADR is a widely accepted and validated quality metric for screening colonoscopy and is defined as the percentage of colonoscopies performed by an endoscopist in which at least one adenoma is detected. ADR is inversely correlated with the risk of interval CRC, in a way that every increase of 1% in ADR is associated with a 3% decrease in risk of CRC and a 5% decrease in risk of fatal CRC. 7 Many applications, both mechanical and optical, have been developed to increase ADR, all with shared shortcomings such as the need of specialized hardware and/ or a long learning curve. [58][59][60][61] This explains the keen interest for AI in colonoscopy as an adjunctive detection and diagnosis. [62][63][64] Polyp detection In the last two decades a tremendous evolution in CADe for colorectal polyp detection has occurred, with a shift from hand-crafted algorithms to DL CNNs with or without transfer learning. [65][66][67][68][69] Not only the technical aspect like processing time and data processing became better, also the diagnostic performances improved tremendously over the past decade. However, all of these systems were only tested on the bench.
Real-time use of CNN was first introduced in 2018 by Urban et al., 5 using a CNN algorithm pre-trained with images from ImageNet and subsequently trained and tested on multiple datasets of colonoscopy images and additionally tested the algorithm on a set of 11 quick withdrawal videos resulting in an accuracy of 96%. They also assessed the efficacy of their algorithm by comparing assessments of nine videos with CADe vs without CADe and showed that the assessment with CADe identified nine more polyps than the assessment without CADe (45 vs 36 identified polyps, respectively).
The most recent step in the AI evolution is CADe application in vivo during live colonoscopy. First Klare et al. 70 prospectively studied CADe for polyp detection during live colonoscopy performed by a trained endoscopist while a second observer monitored the CADe output. The system analyzed with an average delay of only 50 ms and achieved a polyp detection rate (PDR) of 51% and ADR of 29%, comparable to the endoscopist's PDR of 56% and ADR of 31%. No additional detections were made by the system, hereby questioning the real-world applicability and additional clinical value of such a system. The first randomized controlled trial (RCT) comparing a CADe system to standard colonoscopy for polyp detection has been conducted by Wang et al. 71 1058 patients were randomized 1:1 to either standard colonoscopy or colonoscopy with CADe. The CADe system significantly increased the ADR (29.1% vs 20.3%, P < 0.001) and the mean number of adenomas per patient (0.53 vs 0.31, P < 0.001). The increase in ADR was due to a higher number of diminutive adenomas, while there was no statistical difference in adenomas larger than 10 mm. In addition, the number of hyperplastic polyps was also significantly increased (43.6% vs 34.9%, P < 0.001) without a difference in advanced adenomas or sessile serrated polyps detected. The CADe system gave only 39 false-positive results (i.e. a continuous trace of the CADe system deemed by the endoscopist not to be a polyp) in the CADe group, which is on average 0.075 false positives per colonoscopy. Similar results were recently published by Wang et al. who conducted a RCT sham vs CADe, resulting in an increase of ADR of 23% (28% vs 34%) and only 0.1 consistent false-positive per colonoscopy. 72 The first commercially available CADe system for polyp detection (GI Genius, Medtronic, Table 2) was recently studied in a retrospective validation trial by Hassan et al. 73 showing an excellent performance with a sensitivity per lesion of 99.7%. The reaction time of the CADe system was faster in 82% of the cases when compared to expert endoscopists. Moreover, Repici et al. 74 evaluated the same CADe system in a prospective RCT with 685 patients showing an ADR of the CADe of 54.8% vs an ADR in the control group (without CADe) of 40.4% (relative risk 1.30, 95% CI 1.14-1.45) and a higher mean number of adenomas detected in the CADe group (1.07 AE 1.54 vs 0.71 AE 1.20). Adenomas of ≤5 and 6-9 mm were detected in a significant higher proportion in the CADe group.
These studies and performances highlight the full potential of AI in colorectal polyp detection as well as its possible impact on the future quality of colorectal endoscopy.

Polyp characterization and diagnosis
Many research teams developed a CADx system designed for differentiation between adenomas and hyperplastic polyps based on vascular and surface patterns, like in the NICE classification. Kominami et al. 75 tested prospectively in a pilot trial a CADx system showing accuracies of >92.7% in predicting a surveillance interval based on CADx diagnosis of diminutive polyps. In 2012 Takemura et al. and more recently (2019) Byrne et al. retrospectively tested both a DL CADx system for magnified-NBI polyp characterization that both met the PIVI-threshold for optical diagnosis (NPV >90% for diagnosis of diminutive polyps). 76,77 White light endoscopy and CADx has not been so extensively investigated. 62 Preliminary results have been published by Komeda et al., 78 who developed a DL model providing an accuracy of 75.1%. Sanchez-Montes et al. 79 developed a hand-crafted CADx based on contrast, tubularity and branching of the polyp's surface resulting in a sensitivity of 95% for diminutive rectosigmoid adenomas With a significant increase of diagnostic performances when compared to expert endoscopists in both stained endocytoscopic and NBI images a new CADx system, EndoBRAIN, has recently shown the potential to improve low identification rates of neoplasms without interobserver variability ( Table 2). 80 For effective diagnosis, treatment and determining surveillance intervals, polyp size measurement is important. Often polyp size is overestimated, Suykens et al. 81 reported an AI system that can objectively infer polyp size based on a reference tool (i.e. biopsy forceps) in the endoscopic image. They used two separate DL algorithms: (i) delineation of the polyp and (ii) detection of two landmarks on the forceps, resulting in a size estimation that can detect the polyp and forceps in 71% of the 35 test images. The trimmed average difference is +0.52 mm [Standard Deviation (SD) 1.78 mm] and +1.40 mm (SD 1.82 mm) between the ground truth and predicted size by the algorithm or endoscopist, respectively. This leads to a decrease in overestimation bias by 63% (Pvalue < 0.1).

Automated assessment of quality in endoscopy
Although many of the abovementioned features in AI seem promising for future application in daily endoscopy, the performance of the best AI system will still largely depend on the technical skills of the endoscopists. Indeed, if a polyp is not presented to the AI system due to low quality colonoscopy technique, it will not be detected. Therefore, it is conceivable that AI systems need to incorporate quality control mechanisms (retraction time, inspection time stomach, 3D reconstruction, automated reporting photo documentation). First steps in this field have been made by Wu et al. 97 showing a significant reduction in blind spots during esophagogastroduodenoscopy (EGD) with the WISENSE system. Consecutively, this research group conducted a trial with ENDOANGEL, evaluating the difference of blind spot rate between sedated or unsedated patients and the use of a standard or unsedated ultrathin endoscopy, resulting in the lowest blind spot rate during conventional EGD in sedated patients. 98

DISCUSSION
I N THIS REVIEW, we highlighted the potential benefit of AI to improve overall quality in daily endoscopy. AI is without any doubt an attractive option for standardizing endoscopy, which is inherently imperfect due to human error. Even though the abovementioned studies report promising results in terms of diagnostic performance, supporting evidence to actually proof the beneficial effect of CADx on the quality in upper or lower GI endoscopy is currently very sparse. We need more robust data from future trials with (i) an optimal study design with a prospective and comparative (CADx vs no-CADx) set-up, in either randomized controlled or single-arm setting, (ii) real-time use (in vivo), (iii) robust endpoints like ADR or miss-rates rather than technical AUCs and (iv) a multicenter non-expert setting to enable reproduction of the results. Additionally, questions about what to report, how to unify metrics, which datasets to use should be addressed by international societies to enable a uniform reporting and basis for comparison between all of the systems. At this point, it is however reasonable to accept that at least in the colon for polyp detection, the current technical data on AI suggest that it could certainly serve as an adjunct to both detection and diagnosis of GI lesions. However, basic endoscopic skills (i.e. insertion, withdrawal, inspection, mucosal exposure) will remain and will probably become even more important to have an acceptable performance of AI. Therefore, maybe even more important than the detection "skills" of AI tools, is that this is associated with software that actually monitors the technical performance of the endoscopists and mucosal visualization. Recently Wu et al. developed AI-software to improve quality of mucosal exposure in the stomach, as did Oh et al. for colonoscopy, and this is most likely the way forward. 97,99 The use of advanced imaging techniques for characterization, which are often the ground truth of AI systems, also requires thorough training to capture a stable endoscopic image. Thus, combination of the good (benchmark) diagnostic performances of the current AI systems for both upper and lower GI endoscopy and well-trained endoscopists may improve the quality of daily endoscopy. With this respect, AI itself can be a valid tool for future education and training of less experienced endoscopists or trainees, to give feedback on quality during the procedure. In the end, the AI tool may also serve to generate an automated endoscopy report with relevant photo documentation, standardized terminology and inclusion of all performance measures like defined by for instance ESGE. 100 In conclusion, due to recent breakthroughs in AI the interest in CAD is gaining attraction as a novel approach to improve the quality of upper and lower GI endoscopy. The advantage of CAD in GI endoscopy can be anticipated given the high performances provided by the state-of-the-art technology of DL. With larger and more well-designed prospective trials, this novel technology for GI endoscopy will be implemented in clinical practice in the near future.

CONFLICT OF INTEREST
P B HAS RECEIVED financial support for research from