Effectiveness of convolutional neural networks in the interpretation of pulmonary cytologic images in endobronchial ultrasound procedures

Abstract Background Rapid on‐site cytologic evaluation (ROSE) helps to improve the diagnostic accuracy in endobronchial ultrasound (EBUS) procedures. However, cytologists are seldom available to perform ROSE in many institutions. Recent studies have investigated the application of deep learning in cytologic image analysis. As such, the present study analyzed lung cytologic images obtained by EBUS procedures, and employed deep‐learning methods to distinguish between benign and malignant cells and to semantically segment malignant cells. Methods Ninety‐seven patients who underwent 104 EBUS procedures were enrolled. Four hundred and ninety‐nine lung cytologic images obtained via ROSE, including 425 malignant and 74 benign, and most malignant were lung adenocarcinoma (64.3%). All the images were used to train a residual network model with 101 layers (ResNet101), with suitable hyperparameters selected to classify benign and malignant lung cytologic images. An HRNet model was also employed to mark the area of malignant cells. Automatic patch‐cropping was adopted to facilitate dataset preparation. Results Malignant cells were successfully classified by ResNet101 with 98.8% classification accuracy, 98.8% sensitivity, and 98.8% specificity in patch‐based classification; 95.5% classification accuracy in image‐based classification; and 92.9% classification accuracy in patient‐based classification. Malignant cell area was successfully marked by HRNet with a mean intersection over union of 89.2%. The automatic cropping method enabled the system to complete diagnosis within 1 s. Conclusions This is the first study to combine lung cytologic image deep‐learning classification with semantic segmentation. The model was optimized for high accuracy and the automatic cropping facilitates the clinical application of our model. The success in both lung cytologic images classification and semantic segmentation on our dataset shows a promising result for clinical application in the future.


| INTRODUCTION
Endobronchial ultrasound (EBUS) is a relatively new and minimally invasive procedure for diagnosing peripheral pulmonary lesions (PPLs) or mediastinal/hilar lesions. [1][2][3] Prior research has confirmed its low procedure-related complication rates. 4 As a result, EBUS is widely applied in diagnosing thoracic lesions in many clinical institutions. 5,6 However, the diagnostic yields of EBUS alone are insufficient. 7,8 To improve the procedure's diagnostic accuracy, attempts have been made to combine EBUS with other methods, such as fluoroscopy, virtual bronchoscopic navigation, and electromagnetic navigation. [9][10][11][12][13] However, the equipment employed in these procedures is not widely available because of the high cost and other limitations. As such, new approaches should be identified for clinical practice.
Rapid on-site cytologic evaluation (ROSE) provides immediate feedback, which ensures correct and sufficient sample collection. 14 Although ROSE can improve the diagnostic accuracy of EBUS procedures, [15][16][17] it requires extra time from the cytologist and is largely considered economically inefficient. 14 As such, few cytologists are willing to perform ROSE during a bronchoscopy procedure. Despite attempts to train pulmonologists to interpret cytologic smears on-site while EBUS procedures are being performed, 8 the EBUS procedure time appears to be prolonged because of the interruption of having to wait for the ROSE results. An adequate and effective means to present on-site cytologic material during EBUS procedures is thus required if the procedure time is to be shortened.
With computer vision, machines can recognize and analyze images and videos, effectively allowing them to view the world as humans do. Advancements in computer vision with deep learning have led to considerable developments, particularly regarding convolutional neural networks (CNNs). Previous studies have applied CNNs in cytologic image analysis. Sanyal et al. identified papillary carcinoma on thyroid fine-needle aspiration cytology smears 18 and Savala et al. distinguished follicular adenoma from follicular carcinoma on fine-needle aspiration of thyroid. 19 Zejmo et al., Steiner et al., and Bejnordi et al. classified breast cancer cytologic specimen. [20][21][22] Pouliakis et al. analyzed the role of artificial neural networks in cytopathology. 23 Teramoto et al. classified lung cytologic images. 24,25 However, limited data have been reported on the application of CNNs in the presentation of lung cytologic specimens via EBUS procedures. Thus, the present study evaluated the accuracy of CNNs in distinguishing between malignant and benign pulmonary cytologic specimens obtained by EBUS procedures.

| EBUS procedures and on-site cytologic image collection
All EBUS procedures were performed by a pulmonologist, who has more than 10 years of experience in bronchoscopic clinical practice. Before the procedures, computer tomography images were screened for planning. For the diagnosis of PPLs, we performed EBUS-guided transbronchial biopsy (EBUS-TBB). If the target lesions were in the mediastinal or hilar area, EBUS-guided transbronchial needle aspiration (EBUS-TBNA) will be considered.
During EBUS-TBB, we used flexible bronchoscopy (BF-Q290 or BF-1T290; Olympus Co.) combined with a 20-MHz radial-EBUS (UM-S20-17S or UM-S20-20R; Olympus Co.) for the procedure. The radial-EBUS was inserted through the working channel of the scope into the suspected target bronchus based on computed tomography image. After confirming the location of the lesion, specimens were collected via biopsy forceps (NBF01-11018120; MICRO-TECH Co. Ltd.) or a guide sheath kit (K201/K203; Olympus Co.).
Convex-EBUS (BF-UC260FW; Olympus Co.) was dedicated for EBUS-TBNA procedure. We identified the mediastinal and hilar lesions via slow withdrawal and rotation of the ultrasound transducer. TBNA biopsy with a K E Y W O R D S benign and malignant classification, convolutional neural network, deep learning, endobronchial ultrasound, lung cytologic image, semantic segmentation 22-gauge needle (NA-201SX-4022; Olympus Co.) was then performed to obtain histological cores.
During the procedure, material from the EBUS-guided samples was imprinted on a clear glass slide without mounting coverslip for ROSE. Imprint smears were stained using a rapid method (Hemacolor; Merck KGaA) and evaluated on-site via microscopy (BX43; Olympus Co.) by our pulmonologist who also has well cytologic training with more than 6-year experience in cytologic clinical practice. When malignant cells were suspected during the ROSE, at least three more TBB, or one more TBNA would be performed at the same position. If none of the suspicious cells was detected, we would change to another site for repeating biopsy and ROSE study. The EBUS procedure would be terminated if no suspicious cell was explored via ROSE study for 2-3 times or if the patient could no longer tolerate the procedure. All tissue samples obtained by EBUS procedures were impregnated in 10% formalin, embedded in paraffin, and stained with hematoxylin and eosin for subsequent pathological analysis.
During the ROSE study, we also recorded the images of suspected malignant cells at 100×, 200×, or 400× amplification with a microscope digital camera system (DP22; Olympus Co.). Random cytologic images were also taken of samples with no malignant cells. Diagnosis of on-site cytologic images was confirmed by the formal cytopathologic results. Based on the cytopathologic results, final  In the present study, 499 cytologic images were obtained from the participants, with 335 images via EBUS-TBB and 164 images via EBUS-TBNA. Four hundred and twenty-five of them were classified as malignant and 74 as benign. In the 425 malignant images, 321 were lung adenocarcinoma, 41 were lung squamous cell carcinoma, 33 were small cell carcinoma, 12 were other non-small cell lung cancer, 10 were hepatocellular carcinoma, 6 were breast cancer, and the remaining 2 were pancreatic cancer. Nonmalignant processes were pulmonary cryptococcosis in 4 cases, granulomatous inflammation in 2 cases, benign inflammation cells, which were dominant of alveolar macrophages, polymorphonuclear leukocytes, or lymphocytes in 55 cases, and ciliated columnar cells only in 13 cases (Table 1).

| Data preprocessing
The original images (1920 × 1440 pixels) were cropped into small patches (224 × 224 pixels). To balance the number of benign and malignant data, additional benign patches were generated from the benign images and nonmalignant cell areas in the malignant images. First, 15 benign patches were randomly cropped from the benign images ( Figure 1A). Second, we labeled malignant cells area in malignant images with LabelMe 26 in pixel level. From malignant images, 10 malignant patches were randomly cropped from areas overlapping malignant cells, and 5 benign patches were randomly cropped from areas clear of malignant cells ( Figure 1B). A total of 7486 small patches were generated after automatic cropping, including 3286 benign patches and 4200 malignant patches. Finally, all the patches were divided into a training set (70% of participants), a validation set (15% of participants), and a test set (15% of participants).

| Benign and malignant cell classification
The volume of training data was increased through data augmentation to avoid overfitting. Data augmentation included vertical flips, horizontal flips, random rotation by 20 degrees left and right, Gaussian blurring with a probability of 0.2 (SD: 0-3), contrast adjustment with a probability of 0.2 (gamma: 0.5-2), and hue and saturation adjustment with a probability of 0.2 (10 to −10 degrees).
Several deep-learning classification models were constructed to evaluate their ability to classify benign and malignant patches. VGG16 27 was built for comparison with research by Teramoto et al. 24,25 ResNet50, ResNet101, and ResNet152 28 were selected due to its strong ability of avoiding accuracy saturation in deep networks. If the residual connection was the optimal path for backward propagation, then the network would approach the weight of the convolutional block to zero, making the network continue to learn with other residual blocks. ResNeXt50 and ResNeXt101 29 were able to detect different scale size features by combining inception modules and residual connections. ResNeSt50, ResNeSt101, and ResNeSt200 30 combined the ResNeXt model with an attention mechanism to enhance the ability of feature extraction. All the models were fine-tuned for 100 epochs after pretraining on the ImageNet, dataset with natural color (RGB). The hyperparameters used in the benign and malignant patch classification were an initial learning rate of 0.0001, a batch size of 32, and image input size of 224. The optimizer was stochastic gradient descent, the loss function was binary cross-entropy, and cosine learning rate decay was employed.
During testing, accuracy for patch-based and imagebased classification was calculated separately. Image-based classification accuracy was calculated using a sliding window algorithm, with the 224 × 224 patches sliding 112 pixels from the upper left corner of the image to the right, moving downward row-by-row. Since the benign patches contained background patches, which were easier to be classified, the model tended to overfit and focus less on classifying benign and malignant cells, resulting in a high false-positive rate and low specificity. Moreover, classifying malignant patches was difficult and required distinguishing features such as the ratio of the nucleus to the cell, which can lead to a high false-positive rate and low specificity. To solve these problems, we set the Softmax output threshold to 0.99. An image was classified as malignant only if there was at least a patch with a Softmax output higher than 0.99; otherwise, it was classified as benign.
For patient-based classification, each patient had 2-10 images. A majority vote algorithm was thus employed to classify the cases as benign or malignant. The algorithm classified cases with more benign images than malignant images as benign, and vice versa for those with more malignant images. To avoid potentially missing malignant cases, those with the same number of benign and malignant images were also considered malignant.

| Malignant cell segmentation
The CNN for semantic segmentation of malignant cells was trained with malignant images only. The 425 malignant images were divided into a training set (70% of images), a validation set (15% of images), and a test set (15% of images). Data augmentation was performed to increase the volume of data, including random rotation by 90 degrees, random horizontal flips, random vertical flips, hue saturation adjustment, brightness and contrast adjustment, and random cropping of images with a size of 1024 × 1024.
During training, different semantic segmentation models were constructed. FCN 31 is the basic model with an encoder/decoder structure and was used as our baseline model. U-Net 32 is suitable for medical image segmentation which added skip connections between the encoder and decoder to achieve well performance with low parameters. PSPNet 33 applies a pyramid pooling module after the encoder and is able to extract useful information in encoder. DeepLabv3 34 is a powerful semantic segmentation model on semantic segmentation tasks which employed dilated convolution kernels to preserve high-resolution information. DeepLabv3+ 35 simplified the decoder from DeepLabv3 to reduce the computational complexity while maintaining the ability of preserving high-resolution information. FPN 36 performed well on object detection tasks which stacked different sizes of feature maps in the decoder to obtain multiscale features. The design of the decoder of FPN can also perform well on semantic segmentation tasks. HRNet 37 leveraged 256 × 256 highresolution image operations throughout the entire network and added some low-resolution image information (128 × 128, 64 × 64, 32 × 32) at each stage to provide features of larger cells, as shown in Figure 2. This enabled the model to segment and distinguish malignant cells globally and locally. At the end of the network, feature maps of different sizes were stacked to obtain different levels of cell information. All the selected models were fine-tuned for 300 epochs after pretraining on the ImageNet dataset. Hyperparameters employed for semantic segmentation included an initial learning rate of 0.001, batch size of 4, image input size of 1024, and weight decay of 0.0001. The optimizer was stochastic gradient descent, and the loss function was 0.5 times the binary cross-entropy plus the dice loss. 38 Cosine learning rate decay was employed.

| Statistical analysis
Interpretations of the ResNet101-based deep-learning models were compared to the final diagnosis from the lung cytologic images. Sensitivity, specificity, positive predictive value, negative predictive value, and diagnostic accuracy rate were calculated according to standard definitions. Mean intersection over union (mIoU) was obtained for each semantic segmentation model during the malignant cell semantic segmentation process. The result for mIoU was calculated as the intersection area divided by the union area for a predicted area and target area. The formulas were as follows:

| RESULTS
Among the models tested in this study, the ResNet101 model achieved excellent accuracy, sensitivity, and specificity in patch-based classification, image-based classification, and patient-based classification of benign and malignant cells in lung cytologic images obtained via EBUS procedures. Our semantic segmentation tests also achieved a very high mIoU using the HRNet model.

| DISCUSSION
A few studies have applied deep-learning models in lung cytologic image classification and segmentation. In two studies, Teramoto et al have employed CNNs to classify benign and malignant cells from lung cytologic images, achieving 89.3% sensitivity and 83.3% specificity. 24,25 In our study, the ResNet101 model with patch-based classification achieved 98.8% testing accuracy with 98.8% sensitivity and 98.8% specificity. The loss/epoch curve is shown in Figure 3. We also found that ResNet101 exhibited the highest accuracy, sensitivity, and specificity compared to the other CNN models (Table 3). By comparing the result of ResNet50, ResNet101, and ResNet152 in Table 3, we can see that residual connections in ResNet101 provide the network with appropriate model depth and size to learn distinguishable features from cells without overfitting. By comparing the result of ResNet and ResNeSt in Table 3, we can find that models with an attention mechanism have too many parameters, often leading to overfitting of the model. Thus, ResNet101 was optimal for learning most of the features for distinguishing between benign and malignant lung cytologic patches among the models we tested. Furthermore, we also observed that the patch-based classification accuracy of ResNet101 can be increased from 92.2% to 98.8% using the data augmentation. The use of ImageNet dataset for transfer learning can also increase the patch-based classification accuracy of ResNet101 from 86.5% to 98.8%.
In image-based classification, images were used as a unit to distinguish between benign and malignant cases, with the images also cropped into patches through sliding windows for patch-based classification. Since the patchbased classification models tended to have low specificity and high false-positive rate, as shown in Table 3, we set the threshold of Softmax output to be 0.99. If the Softmax output from the patch-based classification was higher than 0.99, the image was classified as malignant. The image-based classification accuracy was 95.5%, with 98.2% sensitivity and 77.8% specificity. Besides, the image-based classification was also conducted on a dataset with cytologic images obtained via EBUS-TBB only (Table S1) and a dataset with cytologic images obtained via EBUS-TBNA only (Table S2). The diagnostic accuracy rate reached 96.0%, 93.8%, and 95.5% on the EBUS-TBB dataset, EBUS-TBNA dataset, and EBUS-TBB dataset + EBUS-EBNA dataset, respectively. To our knowledge, EBUS-TBB and EBUS-TBNA approach the different locations of the lesions, may have different cytologic pictures. The diagnostic accuracy of both study groups is very similar. The results demonstrated the effectiveness of our method on classifying cytologic images obtained via both EBUS-TBB and EBUS-TBNA. The classification results also showed that our model can perform well on both kinds of data obtained from these two different cytologic image acquisition methods. Table 4 shows there were two false positives and one false negative. The two false positives occurred because the nucleus was enlarged in these reactive benign cells (reactive bronchial cells and alveolar macrophages), which mimicked the appearance of malignant cells (Figure 4). The false-negative image may have been caused by blurred cell boundaries, making it difficult for our CNN models to identify the target cells. The error in the image-based classification of each patient accounted for only a few images, and the majority vote algorithm corrected these in the patient-based classification.
The accuracy of the patient-based classification was 92.9%, with 100% sensitivity and 66.7% specificity. The relatively low specificity may be due to the small sample size. Fourteen patients were enrolled for the test group, with one patient mistakenly categorized as a malignant case (Table 5). Although misdiagnosis might delay treatment planning in cancer patients, achieving 100% diagnostic accuracy in cytologic interpretation is difficult, even for experienced cytologists. Clinically, repeat sampling would be performed when lung malignancy is highly suspected based on computed tomography image T A B L E 2 Patch-based benign and malignant classification results using ResNet101

ResNet101
Final cytologic image results

Total Prediction
Positive Negative finding or serological tumor marker elevation. We also found that among the two images from this patient, the cellular morphology in one image was very similar to that of malignant cells. We believe that obtaining more images to increase the number of training data during the EBUS procedure might minimize or eliminate this problem.
The images classified as malignant by image-based classification were sent to the semantic segmentation model to mark the malignant cell areas. This was performed using different models with adjusted hyperparameters.
The best semantic segmentation result was achieved using HRNet (mIoU: 89.2%), as shown in Table 6. HRNet comprised four subnetworks. Each subnetwork was operated at different resolutions with information repeatedly exchanged with other subnetworks via multiscale fusion. We leveraged high-resolution image operations throughout the entire network to focus on the features of small cells, and we added low-resolution image information through multiscale fusion for the features of large cells. Hence, the model possessed sufficient information to segment and distinguish malignant cells, both globally and locally. Adding an object-contextual representation (OCR) module 39 to HRNet did not improve the accuracy, since the dataset in this study was a single-class semantic segmentation task, which does not fully leverage the advantages of the OCR module. Figure 5 shows a comparison of the test images, test targets, and results of the semantic segmentation by HRNet (mIoU: 89.2%).
In our study, the image sample comprised considerably more malignant cases than benign cases. We leveraged a patch-cropping method during data preprocessing to solve the data imbalance problem of image-level data and then calculated the image-based classification results based on the results of the patch-based classification with a sliding window algorithm. The number of benign images was also directly upsampled to solve the data imbalance and was defined as "Image-level + upsampling" in this study (Table 7). To compare the effect of different data preprocessing methods, we used ResNet101 with an initial learning rate of 0.0001, batch size of 32, optimizer set to stochastic gradient descent, and loss function set to binary cross-entropy; cosine learning rate decay was also used. The only difference was that the input image size was 512 × 512 for the directimage classification and the patch size was 224 × 224 for the patch-based classification. Experimental results showed that the accuracy of patch-based classification with a sliding window was higher than the other two methods; thus, this approach could effectively solve any data imbalance. This overcomes the problem of directly classifying lung cytologic images with a data imbalance, which would result in the model predicting all the images as malignant. Even when the benign images were upsampled five times to balance the data, the benign data variation was still too low for the model to successfully learn the cell characteristics. The weights of the last layer of the ResNet101 were also visualized for the patch-based classification to confirm  Figure 6 indicates the area the model focuses on while learning; the blue area receives less focus. We found that the model learned the specific characteristics of malignant cells and ignored the background and benign areas, thus confirming that it focuses on the correct area of the cell.
We routinely perform ROSE with Hemacolor stain in our institution because the color is very similar to Diff-Quik stain and the procedure time is shorter. Most reports on the efficacy of CNNs use Papanicolaou stain for cytologic preparation. 24,25 In previous clinical studies, different staining methods have been associated with sensitivities ranging from 72.8% to 96.9%. [40][41][42] In the present study, ResNet101 exhibited excellent performance in differentiating between benign and malignant cells. This is the first study to use deep-learning methods to interpret the cytologic specimens via Hemacolor stain, confirming that different staining methods can be used by deep-learning models in interpreting cytologic specimens.
In future, more data and pulmonologist should join to overcome the limitations in our study. First, the volume of data was relatively small for training the F I G U R E 5 Semantic segmentation results visualization. Test images are in the first row; test targets (ground truth) are in the second row. White pixels denote areas predicted as malignant and black pixels denote areas predicted as benign or background. Semantic segmentation results are in the third row T A B L E 7 Classification method comparison of different data preprocessing methods deep-learning model. Second, most of the malignant data were from cases of lung adenocarcinoma so the data amount of other cancer cells should be increased. Third, only one pulmonologist (L.C.K.) who is focus on interventional pulmonology has completed the course of cytologic training. Due to this reason, ROSE can only be performed during the bronchoscopy procedure and we limited our research to EBUS procedures. To overcome these limitations, future studies should follow the present investigation but with a larger and different study population.

Methods
In conclusion, classification procedures followed by semantic segmentation yield high accuracy for lung cytologic analysis. ResNet101 achieved 98.8% accuracy for patch-based classification after hyperparameter adjustment. Image-based classification accuracy was 95.5% with the sliding window algorithm, and patient-based classification accuracy was 92.9%. After benign and malignant classification of lung cytologic images, semantic segmentation was employed to classify each pixel in the malignant images to mark malignant cell areas; for this, HRNet achieved an mIoU of 89.2%. This is the first study to combine lung cytologic image deep-learning classification with semantic segmentation. It is also the first research and deep-learning analysis of a dataset comprising Hemacolor-stained lung cytologic images. We believe that the deep-learning model employed in this study can be applied clinically in the interpretation of lung cytologic images in the future.