Detection of laryngeal carcinoma during endoscopy using artificial intelligence

The objective of this study was to assess the performance and application of a self‐developed deep learning (DL) algorithm for the real‐time localization and classification of both vocal cord carcinoma and benign vocal cord lesions.


| INTRODUCTION
Detection of laryngeal lesions has been of scientific interest for over centuries, where the first reports date back to the nineteenth century. 1,2 Since the introduction of fiberoptic transnasal flexible laryngoscopes in the late 1960s and digital transnasal flexible endoscopes at the beginning of the 21st century, image quality and thus diagnostic accuracy has tremendously increased. [3][4][5] Since then, digital transnasal flexible endoscopes have shown their potential and value in the current state of health care in laryngology and head and neck oncology. 6,7 Digital laryngoscopy has been shown to be superior compared to fiberoptic laryngoscopy in detecting laryngeal lesions. 8 Furthermore, since laryngeal images can be stored and coupled to the patients' electronic file, this data can be used among others in patient explanation, to evaluate a lesion over time, to discuss pathology in multidisciplinary board meetings and to use it for training of medical students, interns, residents, medical specialists, and more recently, also for computer algorithms based on artificial intelligence (AI).
The overall term which is used for these algorithms is AI, which composes of software which can reason, react, adapt, and learn. Machine learning (ML) is a sub-field of AI focused on building algorithms and statistical models that "learn" to perform a set of tasks by leveraging information found in data. 9 The advantage of ML algorithms over traditional AI algorithms are their ability to be trained on tasks without needing explicit logic or instructions. The ML algorithm is trained on a dataset that includes a set of examples along with their corresponding label. In deep learning (DL), a sub-division of ML, an artificial neural network (ANN) is an artificial adaptive algorithm inspired by the functioning processes and neural pathways in the human brain. 10 Despite this definition, the core principles of modern ANNs are more rooted in mathematics, optimization, and statistics than they are in biology. ANNs comprise a large number of "neurons" or "nodes" in layers that are interconnected to one another. An input is fed into the neural network and propagates through it, activating some nodes and not others. Through this and other mathematical concepts, DL algorithms are capable of modeling complex, nonlinear relationships and therefore can perform tasks with high degrees of complexity. 11 An important DL algorithm is the Convolutional Neural Network (CNN) which is well suited to image classification due to their ability to extract a hierarchy of important features (such as color, edges, curves, corners, etc.) from the input images to detect patterns. Extensions of the CNN algorithm allow for the creation of object detection (OD) models, 12 a neural network that can localize and classify one or more objects within an image. Such a system could be used when investigating laryngeal pathology during laryngoscopy, effectively increasing the certainty of a clinician by acting as a visual aid, especially when lower-resolution endoscope systems are being used, or a less experienced clinician is performing the laryngoscopy.
A standardized computer-based method for vocal cord lesion detection has been proposed and investigated almost 20 years ago. [13][14][15] Only in recent years has there been a constant rise in publications of articles investigating the application of AI in laryngoscopic images. Although we did not performed a systematic review of literature, a thorough literature search revealed over 15 publications in the last 6 years. [16][17][18][19][20][21][22][23][24][25][26][27][28][29][30][31][32] A rough summary of these articles showed that all but one research group investigated still images (i.e., photos) of laryngeal lesions in a postprocessing method. Some studied white light images and others used chromoendoscopy [i.e., narrow band imaging (NBI)] images. There was a great variety in the size of datasets, where some studies used 33 unique (i.e., each image from a different patient) images while others had the availability of almost 25 000 images from over 9000 patients. 19,25 A frequently used technique in studies with a smaller dataset, called data augmentation, was to copy and slightly adjust laryngoscopic images from a single patient to increase the dataset. [19][20][21][23][24][25]27,29,31,32 Several open-source algorithms were investigated with an overall high accuracy in detecting laryngeal lesions, with varying sensitivity of 73%-100%. Some studies also compared the prediction of the computer algorithm to the prediction of the medical specialist (in training) and found comparing or even better results in favor of the algorithm. [23][24][25]28 To the best of our knowledge, there is only one recent study that investigated real-time detection of laryngeal carcinoma during live laryngoscopy. 32 In this article, we investigated the feasibility of developing a DL system to detect real-time benign and malignant vocal cord lesions during moving laryngoscopic images. The goal was to create an algorithm that assists the medical specialist (in training) during a live laryngoscopy in their clinical decision making when encountering a patient with vocal cord leukoplakia. In the current study, we report on our developed and validated DL system for detecting benign and malignant laryngeal lesions on videos and photos from video laryngoscopies. The algorithm's inference speed and performance on video inference were also assessed to ensure its usefulness in a clinical, real-time (i.e., live) environment.

| MATERIALS AND METHODS
This single-center retrospective study was conducted in accordance with the guidelines established in the Declaration of Helsinki, and was approved by the local medical ethical committee of our institution (file number 2019-5808). For the development of the DL system, we collaborated with WSK Medical, a company specialized in creating DL algorithms for medical use.

| Dataset
To properly train a DL algorithm, a large dataset had to be collected and labeled. In this study, the dataset was acquired from two different sources. The first set of image stills was extracted from laryngoscopy videos taken by the Department of Otorhinolaryngology and Head and Neck Surgery from the Radboud university medical center (Radboudumc) in Nijmegen, The Netherlands. In 2012, digital transnasal flexible laryngoscopes were introduced in our department, and since then hundreds of white light videos from unique patients with vocal cord lesions have been captured and stored in our digital database. The majority of these images (videos) were from several digital transnasal flexible endoscopes (VNL-1570STK, VNL9-CP, VNL-11-J10, and VNL15-J10 laryngoscopes, PENTAX Nederland B.V., Dodewaard, The Netherlands). The minority of these images (photo's) were from patients treated with microlaryngoscopic surgery between 2009 and 2012. Each image from every unique patient was evaluated on the availability of histopathology. If histopathology was not available, these patients were excluded from the dataset. Furthermore, at least one (still) image of the vocal cord lesion had to be available. Image quality (e.g., saliva or blurred vision) was not an exclusion criteria, since our goal was to create an algorithm which is able to function properly during a live endoscopy. During an average endoscopy in the outpatient clinic, vision quality can sometimes be (temporarily) decreased, thus the algorithm should be able to function regardless of this inconvenience. Tumor/lesion size was not a feature that was collected during data extraction. Between 2009 and 2021, a total of 1448 images from videos of 649 patients were extracted. These included 447 unique patients with vocal cord carcinoma, 37 patients with vocal cord papilloma, and 65 patients with vocal cord cysts or polyps. Furthermore, 100 patients with normal vocal cords were added to the database. For each video, several images were extracted to increase the variety of the collected data (varying viewing angles of the lesion, differences in lighting, and difference in image quality). The number of images extracted from each video differed, based on image quality.
The second set of images was extracted from the "Laryngoscope8" repository (available at https://github. com/greenyin/Laryngoscope8). 29 This is a dataset of highresolution laryngoscopic photos taken from several otorhinolaryngology departments in hospitals across China. These photos were labeled based on their histopathology by Lin et al., although histopathology could not be checked by the authors. Thus, the labels and images could only be visually checked by the first author to ensure no egregious errors were made during the labeling. Furthermore, the dataset only comes with image-level classification, and had to be manually labeled with localization information for it to be pertinent to our database. Nevertheless, the sheer number of images in the dataset provided a good basis for the training of the algorithm, since our database alone would result in a lower accuracy. The "Laryngoscope8" dataset comprises 3059 images taken from 1950 unique individual patients. Each image was labeled with one of seven probable pathological labels (i.e., Reinke's edema, glottic carcinoma, granuloma, leukoplakia, cysts, nodules, and polyps) and a "normal" label to designate patients with healthy vocal cords. The two datasets were merged to create one dataset, which henceforth will have the name "Zeno AI" dataset.
To prepare the "Zeno AI" dataset to be machinereadable, data preprocessing steps were performed. The pathological label "Reinke's edema" was removed from the dataset, due to the limited amount of available stills. Images of width and height less than 448 pixels were also removed due to the low resolution. This resulted in a total of 4488 images in the "Zeno AI" database. For each image within the dataset, a ground-truth bounding box around the laryngeal lesion was created, along with a corresponding classification. These bounding boxes were stored in the You Only Look Once (YOLO) format. 33 In Figure 1, the number of images per class in the Zeno AI dataset is visualized. These three classes considered for classification are: • Carcinoma: encompasses severe dysplasia, carcinoma in situ, and squamous cell carcinoma of the vocal cord. This subclassification was solely possible for our own database, since histopathological data were available. To ensure no data leakage, which can cause overconfidence in the model's ability, images taken from the same video, and thus the same patient, were not shared across the splits.

| Model developing and training
The You Only Look Once (YOLO) algorithm is a DL model capable of performing object detection by outputting one or more bounding box coordinates and their corresponding classifications are given an input image. 33 This is visualized in Figure 2 and derived from Azam et al. 32 It achieves this through an initial "backbone" CNN (in this case the Darknet algorithm) that extracts a hierarchy of feature maps at different granularities. These feature maps are then fed into a series of layers to mix and combine image features to send them forward to the "prediction head". The post-processing algorithm Non-Max Suppression (NMS) is used to remove lowconfidence scores and predictions with large overlap. It is important to note that in DL there is a trade-off between model accuracy and model simplicity. In OD, simpler networks that can achieve real-time inference usually sacrifice a significant amount of accuracy to do so; conversely, the best models in terms of accuracy are cumbersome and highly complex, making them unusable in a real-time environment, even with various optimization methods. YOLOv5 is one of the first frameworks that reduces this trade-off by introducing small, efficient algorithms that compete with the accuracy of previous stateof-the-art models. The YOLOv5 algorithm was implemented and trained using the Ultralytics open-source GitHub repository, built on the PyTorch DL framework. 34 Three sizes of the neural network were considered for the task: the YOLOs (7.2 million parameters), YOLOm (21.2 million parameters), and YOLOl (46.5 million parameters). The higher the parameters, the more complex tasks it can perform, with the downside being slower and thus less suitable in a real-time detection mode. The pre-trained weights of YOLO on the COCO dataset were used as an initial checkpoint, which helped with training time and accuracy. 35 The algorithm was trained using an Nvidia RTX 3060 card with 12GB VRAM. The hyper-parameters used were the following: To evaluate the performance of the algorithm, standard OD metrics were selected. The definition of a True Positive (TP) in OD depends on the Intersectionover-Union (IoU), a measure of how much overlap there is between a Ground Truth bounding box and a predicted bounding box. 36 The definition of IoU is displayed in Figure 3. Visual inspection of the images showed that the bounding boxes for the lesions (particularly for vocal cord carcinoma) in the dataset are hard to define due to the lack of hard borders (i.e., irregularity) of the lesions. From a visual perspective, carcinomas (particularly the ones which are not in an advanced stage) can appear diffuse, can come in different sizes, and the area of concern can be hard to pinpoint. In this study, completely missing a lesion is deemed a worse outcome than a "loose" localization of the lesion; if the algorithm mislabels a lesion (i.e., labeling a carcinoma an anomaly or vice versa), this is deemed more acceptable than a false negative. An example of an image from the Zeno AI dataset is seen in Figure 4.
We can construct the following definitions of True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN) using the definition of IoU: • TP: if the IoU ≥0.5, and the predicted classification is the same as the ground-truth classification • FP: if the IoU <0.5, or if IoU ≥0.5 but the predicted classification does not match the ground-truth classification • TN: while True Negatives are an important aspect of validation, the algorithm is being evaluated at the object-level (not at the image-level). This means that the model is constantly correctly not detecting objects within an image, making true negatives impossible to quantify. Accuracy metrics that rely on a TN number-such as specificity and AUC-are therefore also not used as validation metrics in this article. • FN: if no predicted bounding box is given for a ground-truth box. This metric is of particular importance to this article, since a high FN percentage will mean that lesions will be missed by the algorithm We can change the value of the IoU threshold to be more or less strict with how well the model localizes vocal cord lesions. Other metrics specific to OD were used in measuring the accuracy of a model: Defined as: When outputting predictions, the algorithm also outputs a confidence score between 0 and 1 for each prediction, representing the likelihood that the prediction is correct.
To avoid too large a number of false positives, a confidence threshold must also be chosen to ensure predictions of very low confidence are ignored. In this case, the confidence threshold of 0.3 was chosen, as it provided the best balance between high precision and high recall.

| RESULTS
In Table 1, the performance of the models obtained from training on the validation set for each YOLO mode is displayed. From the results displayed in Table 1, the ensemble method that combined the YOLOs and the YOLOm algorithm provided higher accuracy in most metrics; most notably, it had the lowest percentage of false negatives, making it less prone to missing out carcinomas or anomalies within an image. The algorithm correctly localizes and classifies 71% of vocal cord carcinomas and 82% of benign vocal cord lesions. Table 2 summarizes the performance of the models on the test set. Again, the ensemble method provides good results, resulting in a correct localization and classification of 78% for carcinoma detection and 70% for benign lesion detection. Figures 5-7 show examples of YOLO inference on images taken from the test set. The left column represents the ground truth (GT) image; the right column represents the predictions of the algorithm. The bounding boxes in the right column also have their confidence score displayed. The predictions were created with a confidence threshold of 0.30. A compilation video of the algorithm can be viewed on the YouTube channel of the senior author (GB).
To test the inference speed of the model, a set of test videos taken from the Radboudumc database were kept aside to qualitatively assess both the accuracy and the efficiency of the algorithm. In Table 3, the inference speed of each algorithm is summarized.
All models, including the ensemble method, vastly surpass the general frames per second (fps) rate of most modern laryngoscopes (25-30 fps), meaning that they are perfectly suited for real-time detection. However, these numbers were only achievable with computers with modern hardware, namely with a high-end GDDR6 GPU. Further tests were carried out on lower-end hardware to test the inference speed in a more restrictive environment. On a laptop with a 10th generation i7 CPU, 16GB RAM, and a GDDR5 Nvidia GTX 1050 GPU, only the YOLOs algorithm was suitable for real-time inference, T A B L E 1 Performance of the YOLO algorithms on the validation set.

Model
True   retrieving the video output of the laryngoscope or displaying the predictions). This, presumably, will result in a slight decrease in fps rate. Furthermore, the algorithm has only been validated on an image set: to ensure its performance on video, a separate validation strategy would need to be conducted on a set of unseen videos.

| DISCUSSION
In this study, we have demonstrated the first results of our own developed DL system for the detection of laryngeal lesions during real-time endoscopy. Our goal was to create an algorithm that assists the otorhinolaryngologist or head and neck surgeon (in training) in the clinical decision making when encountering a patient with vocal cord leukoplakia. Usually, the clinician has the following choices: whether to follow a conservative or an invasive diagnostic trajectory. In the former example, a patient's case is deferred to a follow-up. The clinician estimates that the vocal cord lesion is suitable for a delay and will probably submit the patient to another endoscopy in the following weeks or months. This policy has the advantage that the patient does not have to undergo an invasive diagnostic procedure. The disadvantage will be that if during follow-up the leukoplakia is (or becomes) malignant, there is a delay in time to diagnosis and time to treatment, with potentially significant consequences. 37 In the latter example, the patient has to undergo an invasive diagnostic procedure. In most otorhinolaryngology clinics in the world, this will result in taking a biopsy under general anesthesia. This procedure has considerable consequences for the patient (e.g., general anesthesia, adverse events, or daily admission at the ward with following an absence from work), the clinician (e.g., increasing waiting times for the operating room and ward), and the health care system (e.g., increasing costs due to surgery, general anesthesia and admission at the ward). Although an alternative less invasive diagnostic method is available, namely flexible endoscopic biopsy, we experience that this procedure is just slowly being performed by laryngologists and head and neck surgeons. [38][39][40] Since these decisions are usually made within the several minutes duration of an endoscopy, our goal was to create a realtime algorithm that functions live during endoscopy. Our initial results show that our DL algorithm is capable of correctly localizing and classifying vocal cord carcinoma on still images in 71% and 78% of the cases in the validation and test set, respectively. For benign vocal cord lesions, this was between 70% and 82% in the test and validation set, respectively. As one will notice, there is a difference in TP and sensitivity outcome between the validation and test set for both entities. The difference could be explained by the normal variation in the quality of laryngoscopies that have been performed over the years. While annotating our dataset, we experienced the variety in image quality (e.g., light intensity, presence of saliva) and proper visualization (e.g., moving larynx, difference in distance from the tip of the endoscope to the vocal cord lesion). Furthermore, new digital transnasal flexible endoscopes emerged over the years with improvement of the chip and thus image resolution. The images used in the validation and test dataset were randomly chosen and equally distributed, so we believe this is not a contributing factor to the difference in accuracy. While using the strongest performing algorithm model (i.e., ensemble YOLOs/YOLOm) the fps rate of 63 was still well above the minimum required 25-30 fps for digital transnasal flexible endoscopy.
To the best of our knowledge, there is only one recent article that had the premier of publishing on real-time detection of vocal cord carcinoma during live laryngoscopy. 32 These authors created a similar DL system, also based on YOLO software, using white light and NBI laryngoscopic videos of 219 patients with vocal cord carcinoma. They showed a sensitivity of 62% for detecting vocal cord carcinoma with an fps rate between 25 and 30. Although their dataset is limited in number and only on vocal cord carcinoma, they are the pioneers in the field of real-time detection for vocal cord carcinoma. The first results of our DL algorithm are in accordance with Azam et al., showing to be feasible in real-time use with even higher sensitivity for detecting vocal cord carcinoma. Furthermore, we also included patients with benign vocal cord lesions, to increase the usability of the DL algorithm in the outpatient clinic in the foreseeable future.
A limitation of our study was the relatively small data set. Although our department implemented digital transnasal flexible laryngoscopes as one of the firsts in The Netherlands, and our head and neck oncology division has one of the highest numbers of new patient intakes per year, until now our database only consists of 447 identical patients with early-stage vocal cord carcinoma and far less for benign vocal cord lesions. By extracting several images per patient, we increased our dataset to a total of 1448 images. These numbers are in vast contrast with some exceptional databases from recent studies. 25,26,[28][29][30] We, therefore, used the dataset collected by Yin et al. to increase the number of data. 29 This resulted in a significant increase in the size of our dataset, to a total of 4488 images, which increased the results of our DL algorithm. Limitations of using the "Laryngoscope8" dataset are the lack of histopathological control and the absence of laryngoscopic videos. While the authors state that the stills were labeled by professional otorhinolaryngologists, there are no assurances that this was done with the histopathological outcomes of the patients. However, to counter this, the dataset was manually checked by the first and second authors of this article. The image-level labels did the separation of the dataset into Carcinoma, Anomaly, and Normal. The localization of the lesion was a trivial addition that simply required a visual check of each image. We believe that all these steps account for the lack of access to the histopathological outcome. Lastly, although our collaboration with WSK Medical was essential since it resulted in the creation of our DL algorithm, this could be interpreted as a conflict of interest. Although we acknowledge their business motivation in the development of this algorithm, we strived to work as independently as deemed possible.
Currently, our DL algorithm will be further developed and investigated, to evaluated its functioning in daily practice. More lesion features will have to be explored, for example, the effect of lesion/tumor size on the algorithm's ability predict will have to be monitored, and to see whether the algorithm performs poorly on subclassifications. Although we did not objectively investigate the influence of tumor size, we did notice that tumor size was of relative influence on the accuracy of our algorithm. More importantly, when an endoscopy was of adequate quality (i.e., in close proximity to the vocal cords with significant time of visualization) our algorithm was also able to properly localize and classify smaller vocal cord lesions. This will be, among others, investigated in our following prospective study when investigating the feasibility and accuracy of the outpatient clinic. Furthermore, to increase the accuracy of our algorithm, more data are needed. Since we experienced that the availability of high-quality videos of benign and malignant vocal cord lesions is limited in quantities needed for proper DL algorithm training, we have initiated a collaboration with other (academic) hospitals in The Netherlands. Our goal is to create a nationwide digital database, where colleagues can add laryngoscopic videos of benign and malignant vocal cord lesions. These digital platforms are available, such as Digital Research Environment (DRE), where digital data can be safely stored and departments still manage their own data by signing a data transfer agreement. 41 Parallel to increasing the accuracy of the algorithm, implementation in the outpatient clinic will be performed. When writing this article, the DL algorithm is already installed in our outpatient clinic. The algorithm is independently operating, parallel to our existing endoscopy infrastructure, in a stand-alone setting. The next study will focus on a prospective evaluation of feasibility and accuracy when using the DL algorithm and compare it to the clinician's opinion and obtained histopathology. In the foreseeable future, expansion to other laryngopharyngeal subsites or the proximal esophagus are some of the possibilities.
In the foreseeable future, the model can be expanded upon by increasing the granularity of the predictions, namely by being able to differentiate between different types of benign lesions. Re-training the algorithm on chromoendoscopic images will also allow it to be used in all laryngoscopic settings. However, as mentioned in the previous paragraph, this will require an even more comprehensive dataset.
In conclusion, we have demonstrated that the primary version of our developed DL algorithm, based on YOLO software, is able to correctly localize and classify benign and malignant vocal cord lesions on still images with a sensitivity between 70%-82% and 71%-78%, respectively. Furthermore, by using the strongest DL algorithm, with a detection speed of 63 fps, it is suitable to function in a real-time detection mode thus making it suitable to use during a live laryngoscopy.