Deep transfer learning from ordinary to capsule esophagogastroduodenoscopy for image quality controlling

Quality controlling for capsule endoscopic images can be completed with the assistance of artificial intelligence, but the labeling process is time‐consuming. Domain adaption is a robust tool for cross‐domain learning to reach a consistent target. Current research aims to study the feasibility and effectiveness of domain adaption from ordinary endoscopic images to capsule endoscopic images in quality controlling. Dynamic adversarial adaptation network (DAAN) was trained to identify low‐quality images using ordinary endoscopic images with corresponding labels (source domain with supervision) and capsule endoscopic images without corresponding labels (target domain without supervision) so that image quality controlling can be transferred from ordinary to capsule endoscopic images. 62,850 images from capsule endoscopy and 17,434 images from ordinary endoscopy were included in developing deep learning models. In internal cross‐validation, DAAN achieved an average area under receiver operating characteristic curve (AUROC) of 0.8638 (95% confidence interval [CI] 0.6753–1.0000) in filtering low‐quality images for capsule endoscopic images, compared with CNN B/16 and L/32, which were also trained with ordinary endoscopic images with corresponding labels. 18,636 images from 355 patients who received capsule endoscopy were prospectively collected. The AUROC of DAAN reached 0.9471 (95% CI 0.9428–0.9511), which surpassed CNN (0.8570 and 95% CI [0.8529–0.8608]) and ViT (L/32: 0.8183 and 95% CI [0.8143–0.8220] and B/16: 0.7779 and 95% CI [0.7960–0.8036]). Domain adaption can complete image quality controlling task in capsule endoscopic images with the supervision of ordinary endoscopic images, whose quantity is smaller so that the annotation workload can be alleviated.


INTRODUCTION
Esophagogastroduodenoscopy is the most common examination in the department of gastroenterology, 1 and it can be used to diagnose many digestive tract disorders. 2,3It produces large quantity of images during the whole examining process, especially for capsule endoscope, 4 automatic capturing mode may produce more pictures than ordinary endoscopy.Plenty of endoscopic images should be analyzed and corresponding reports need to be provided in the real clinical work of gastroenterologists.Furthermore, unavoidable unqualified images (e.g., blurred and overexposure images) will also be captured, especially for capsule endoscopy.
Image quality controlling, 5 which is an important module (function) in the automatic examination report system, is also a time-consuming fundamental task for gastroenterologists.The large number of endoscopic images could be used to build artificial intelligence tools to reduce physicians' mechanical repetitive labor.In recent years, deep learning (DL) 6 represented by convolutional neural network (CNN) 7,8 has performed promisingly in medical imaging fields. 9,10The contents of ordinary and capsule endoscopic images are consistent, and there are unqualified images in both of them.Therefore, we aim to train a deep learning model with the least information (annotation) to identify unqualified images in capsule endoscopic images.
In the present study, we applied domain adaption (DA), 11,12 a transfer learning method 13 which can apply the inputs from different domains and the supervised labels from one domain (ordinary endoscopy) to complete the same task in another domain (capsule endoscopy).DA can map the features from two different domains to a unified space and then narrow the difference between the two domains to reach a global target.DA will be fed with the inputs and supervised labels from the source domain (ordinary endoscopic images), and the inputs from the target domain (capsule endoscopic images).At the same time, we compared the performance of DA with CNN (convolutional neural network) and ViT (vision transformer) 14 with the same hyperparameter settings, but CNN and ViT (supervised learning) were only fed with the inputs and supervised labels from the source domain (ordinary endoscopic images).A graphic workflow of all experimental designing is shown in Figure 1.

Study design and ethics approval
This multi-centers study was carried out in Friendship hospital and Minhang Hospital in China.The Medical Ethics Committees of Friendship Hospital (2022-P1-equipment-018-01) and Minhang Hospital (2023-009-01X) approved the study protocol.Written informed consents for all patients were also collected in the retrospective and prospective datasets.

Dataset collection and preprocessing
The ordinary endoscopic images of training and internal cross validation dataset were retrospectively collected between February 2021 and February 2022.Five gastroenterologists (YQ Zhang, Y Ding, Z Qin, XH Zhang, and L Feng) obtained The framework for image quality controlling and the experiment settings.
images using a commercially available endoscope (Olympus GIF-H290; Olympus Medical Systems, Co., Ltd., Tokyo, Japan) and the standard imaging protocol in Minhang hospital.The capsule endoscopic images of training and internal cross validation dataset were retrospectively collected in Friendship hospital between March 2022 and October 2022.A gastroenterologist (P Li) obtained images using a commercially available capsule endoscope (Pillbot C10000SI, Jiangsu CITRON Bio Technology Co. Ltd., Jiaxing, China) and the standard imaging protocol in Friendship hospital.The capsule endoscopic images in prospective validation dataset was prospectively collected from November 2022 to February 2023 in Friendship hospital.The black border of all images was removed to eliminate the effect of unrelated content.

Image labeling
All endoscopic images were stored in jpeg or bitmap format in the imaging databases.First, the quality of all images was evaluated.For ordinary endoscopic images, narrow band images (this research only focused on white light images), blurred images, overexposure images, and images covered by food or other things were labeled as unqualified images (for both ordinary and capsule endoscopic images).The remaining images were labeled as qualified images.Four gastroenterologists experts (YQ Zhang, Y Ding, Z Qin, XH Zhang) with a minimum of 5 years of experience on endoscope management labeled all images.Because image quality is subjective, to avoid the subjective, each image will be labeled by two junior gastroenterologists back-to-back, if the labeling results for specific images are different for them.The divisive images will be labeled by a senior gastroenterologist.Then the labels for all images will be determined.Each image was labeled by two physicians.Disagreement was arbitrated by the senior physician with 15 years' experience on gastroenterology (L Feng and P Li).

Development of deep learning models
The patients from the retrospective dataset were randomly split into training and the internal cross validation datasets as four-fold cross-validation 15 for developing and validating the performance of models, respectively.The cross validation was split as subject independent manner, namely the images of one specific patient will not be split into training and validation datasets meanwhile. 16We compared six types of DA algorithms, and dynamic adversarial adaption network (DAAN) 17 whose performance is the best were adopted to complete image quality controlling task using ResNet101 backbone (the architectures of DAAN is shown in Supplementary Figure 1).The performance of ResNet-101 CNN 18 and two types of ViT (B/16, L/32) were also compared, but the training process of CNN and ViT did not involve the capsule endoscopic images, namely the input of CNN and ViT only contain ordinary endoscopic images (Figure 1).DAAN adopts three loss functions to guarantee the transfer from the source domain to the target domain: (1) classification loss that is the same as the conventional classification problem; (2) domain discrimination loss which is used to judge the domain of each input image; and (3) sub-domain discrimination loss which carefully compare the domain of each input image in terms of each class.It only applies the input images of the source domain, the supervised labels of the source domain and the inputted images of the target domain to achieve the goals in the target domain.
The input of the model is an endoscopic image.The output of the model is a binary output determined whether the quality of the input image can meet the requirements for later analysis.In addition, the training dataset is imbalanced, so we adopted a class weight policy in the training process.Given the input of softmax layer (last layer), corresponding to original training dataset, {(x 1 y 1 ), (x 2 y 2 ) … (x n y n )}, x i ∈ R N , y i ∈ {1, 0}, then the loss is shown as Equation 1.
where, m, w, n, and k denote the number of samples in mini-batch, the network weight to be trained, the number of neurons of the layer before softmax layer and the number of classes respectively.Class_weight represent the weight of the sample i with label class j and penalization term which can avoid the over fitting.In this manner, the majority and minority classes could be given small and big class weights to trade off the imbalanced effect.If class weight policy is not adopted, the class weight equals to a vector in which all entities are 1. 19The optimal model in internal cross validation will be used to test with prospective validation dataset.We studied on the effect of data augmentation 20,21 on classification performance, the results show data augmentation cannot improve the performance significantly. 22Besides, we hope study and compare the effectiveness of DAAN, CNN, and ViT without any interference, Therefore, we did not adopt data augmentation.All models were developed with PyTorch 1.8 on the server with four NVIDIA RTX A4000 GPUs (Graphical Processing Units).All images were resized to 512 × 512 and then fed into models to train or test.The optimization algorithm is stochastic gradient descent (SGD), 23 the default learning rate is 0.01, and at the same time, the batch size was 32.Besides, class weight was used to trade off the effect of the imbalanced distribution of two classes.Based on the repeated experiments, 24 different epochs were also applied to train the models without underfitting.All hyperparameters are the same (epochs: 5, class weight: [1, 5]) for DAAN, CNN, and ViT.

Statistical analysis
All statistical analyses were performed using Python 3.7.3(Wilmington, DE) and MATLAB R2016a (https://www .mathworks.com/).We used the accuracy, sensitivity, specificity, receiver operating characteristic (ROC) curve, and precision recall (PR) curve 25 to assess the performance of the DL models.The area under the ROC curve (AUROC) with 95% confidence interval (CI) was calculated, in internal cross validation dataset, point estimation was used to figure out the corresponding confidence interval 26 for internal cross validation.Whereas in prospective validation dataset, Wilson confidence intervals (95% CI) is used.

Characteristics of datasets
Totally, 98,920 images (81,486 capsule and 17,434 ordinary endoscopic images) from 1233 patients were used to develop and prospectively validate the performance of DAAN, CNN, and ViT.The baseline demographic information of ordinary and capsule endoscopic images collected from Minhang Hospital and Friendship Hospital, are shown in Table 1.The sample images of ordinary and capsule endoscopic images are shown in Figure 2.

Performance of deep learning in internal cross validation
The classic CNN architecture (ResNet101) and ViT (B/16 and L/32) were also used to train the models.The performance of them in internal cross validation is shown in Figure 3, which illustrated that ResNet101 was more robust than ViT and DA, but the difference was not so big: the difference in average AUROC between ResNet101 and DA is 0.056; the difference in average AUROC between B/16 and DA is 0.0501; the difference in average AUROC between L/32 and DA was 0.0328.More metrics are shown in Table 2, which showed the accuracy of ResNet101 was the highest, and L/32 ViT was more effective than B/16 ViT.The difference between sensitivity and specificity (DA: 0.0556; CNN: 0.3861; L/32: 0.2027; B/16: 0.2728 for the mean value) showed that DA can tackle imbalanced classification problem better than the other three methods.In addition, the accuracy of DA was lower than CNN and ViT.

Performance of deep learning in prospective validation
The performance of DA, ResNet101, and ViT is shown in Figure 4, which shows DA is the most robust method in prospective validation dataset, the difference was not so significant as internal cross validation: the difference of AUROC between ResNet101 and DA was 0.0901; the difference of AUROC between B/16 was 0.0501 and that of DA was 0.1472; the difference of AUROC between L/32 and DA was 0.1288.Meanwhile, the L/32 performed no better than B/16 for prospective

TA B L E 1
The characteristics of all datasets in current study.

Prospective validation dataset
No. of subjects (four fold cross validation)

F I G U R E 2
The sample images in datasets.

F I G U R E 3
The ROC and PR curves of DAAN, ResNet101, and ViT in internal cross validation.

F I G U R E 4
The ROC and PR curves of DAAN, ResNet101, and ViT in prospective validation.validation.More metrics are shown in Table 3, which showed the accuracy, sensitivity, and specificity of DA were better than CNN, B/16, and L/32.Both the mean values of sensitivity and specificity were over 0.7 for DA, but CNN only reached 0.57 in both sensitivity and specificity.Besides, the specificity of L/32 was 0.6122, and the performance of B/16 is better than L/32.Although the difference between sensitivity and specificity of DA was larger than B/16 and L/32 (DA: 0.2383; CNN: 0.3079; L/32: 0.2060; B/16: 0.0971 for the mean value), DA achieved the best performance in image quality controlling.

DISCUSSION
The content in ordinary and capsule endoscopic images is consistent, but the repeated annotation exerts a heavy burden on medical staff and physicians.The domain adaption can transfer a specific task from one domain with the annotation to another domain without annotation.Compared with the DL models trained only with the supervised information of ordinary endoscopic images, DAAN model can complete quality controlling for capsule endoscopic images with better performance.Besides, the quantity of ordinary endoscopic images is not bigger than that of capsule endoscopic images, therefore, the annotation burden can be largely alleviated.The developing and cross validation datasets are collected earlier; the prospective validation dataset is collected later.The intestinal internal environment for each patient is different when the patient is receiving capsule endoscopy, such as food residue and secretion.Moreover, the cross validation is subject independent.Therefore, the result in cross validation is slightly worse than CNN and ViT.However, the generalizing ability of domain adaption is more robust than CNN and ViT, namely, domain adaption can fully consider the similarity between source domain (ordinary endoscopy) and target domain (capsule domain) to reach a better generalized ability with a better performance in prospective validation dataset.Sharib Ali et al. 27 proposed a framework to assess the image quality of endoscopic images via detecting and segmenting the artifacts (including blur, specularity, saturation [overexposure], bubbles, contrast [underexposure] and misc.artifact [e.g., chromatic aberration, debris etc.]).In this research, they compared four types of object detection algorithms (YOLOv3, RetinaNet, Faster-RCNN, YOLOv3-spp), and YOLOv3-spp achieved the best overall performance for all types of artifacts.Besides, they compared four types of semantic segmentation algorithms (DeepLabv3+, ResNet-UNet, psp-Net, FCN8) for segmenting all artifacts, and DeepLabv3+ achieved the best overall performance for all types of artifacts.Finally, they designed a quality score, which owned a correlation of more than 0.6 with three experts for evaluating each frame.Qi He et al. 28 collected 3704 endoscopic images from 211 patients to train ResNet-50, Inception-v3, VGG-11-bn, VGG-16-bn, and DenseNet-121 models for distinguishing qualified and unqualified images.VGG-11-bn achieved the highest accuracy of 67.6% and 80.9% for unqualified and qualified images.Lianlian Wu et al. 29 used 12,220 in vitro, 25,222 in vivo and 16,760 unqualified endoscopic images to train a deep convolutional neural network to discern whether a scope is out of the body during esophagogastroduodenoscopy.The confusion matrix showed that the accuracy for each class is over 70%.
There are some limitations in the present research, we only included two types of endoscope and capsule endoscope, and the difference between different brands was not considered carefully.Moreover, we did not categorize all types of qualified conditions, and the distribution of all images may not be consistent in real clinical scenario.
Our image quality controlling module can be integrated into the workflow of the endoscopist, no matter in real-time and post-analysis with videos or images.Because a large amount of images would be captured in the examining process, the unqualified images would be sifted out so that more time could be saved.Furthermore, the data from a new data source (healthcare center) can also be used to train a new model for this healthcare center using the model in current research as a pretrained model.

CONCLUSIONS
The domain adaption (a transfer learning method) can accurately sift out the unqualified images with the supervised information on ordinary endoscopic images and smoothly complete the same task on capsule endoscopic images.This not only alleviates the daily workload of physicians, but also alleviates the annotation burden in medical artificial intelligence.

AUTHOR CONTRIBUTIONS
The performance of DA, CNN, and ViT in internal cross validation (Mean value ± standard deviation [95% confidence interval]).
The performance of DA, CNN, and ViT in prospective validation.