Artificial intelligence driven digital whole slide image for intelligent recognition of different development stages of tongue tumor via a new deep learning framework

Accurate clinical diagnosis of the stage of tumor development is essential for formulating a treatment plan. However, the stage of tongue tumor development among malignant, benign and leukoplakia is easily misdiagnosed, resulting differing treatment approaches, taking patients in danger and preventing them from receiving the deserved care. This study aimed to establish an automatic recognition system for tongue tumors at different stages of development using artificial intelligence methods along with pathological tissue section images. By improving Swin Transformer framework, a new framework in deep learning, the tissue slice image is used to identify the lesion and non‐lesion areas by the patch‐based method, and then an output is reconstructed by self‐Assembly method that the lesion areas is marked with heat map. Subsequently, an automatic recognition system with a friendly page is designed for the stage of tongue tumor development (malignant, benign, and leukoplakia). The proposed model has a high recognition accuracy (98.45%). The prediction accuracy of each category of the system is higher than that of specialist doctors with 13 years of experience. The Swin Transformer framework was improved in this study to accurately and automatically identify the various stages of tongue tumor development.


INTRODUCTION
Oral cancer is one of the eight most common types of cancer worldwide, and the 5-year survival rate of patients is less than 60% after the onset. 1,2The proportion of tongue cancer is the highest among oral cancers, and the incidence is increasing and shows a younger trend globally. 3Additionally, tongue cancer is distinguished by a high degree of malignancy, a high rate of local recurrence, and a high rate of neck metastasis. 4Therefore, it is necessary to have radical surgeries, or else their life would be threatened.Currently, the modern comprehensive treatment of tongue cancer is dominated by surgery. 5After surgery, it is a critical step to discriminate the extent of tumor development (benign, malignant, and leukoplakia that a state between benign and malignant) accurately, depending on the tumor microenvironment. 6n the course of routine diagnosis, digital whole slide image (DWSI) of hematoxylin and eosin (H&E) stained tumor tissue slice is manually magnified by a trained histopathologist through a computer. 7,80][11] Nevertheless, the scarcity of pathologists and high clinical experience requirements exacerbate the conflict between clinical demands and actual production.In addition, intra-and inter-observer differences introduce additional bias and risk to histopathological analysis in the diagnostic processes. 12,13esides, misjudgments may also arise from the subtle characteristic changes between malignant and benign.Therefore, developing an intelligent-aided diagnosis system is of great necessity.
In recent years, convolutional neural networks (CNN) in AI have broken performance benchmarks for image classification tasks showing new opportunities in medical image classification. 14However, CNN models do have certain drawbacks, one of which is that they struggle to deal with global information. 15In contrast, the new Visual Transformer (ViT) model used in the computer vision field, may extract more richer global information. 16It has demonstrated strong performance in image classification by directly applying standard Transformer encoders 17 in natural language processing (NLP) to non-overlapping image patches.While ViT is designed for image classification tasks, it is not well suited for downstream tasks requiring dense predictions.However, Swin Transformer (Swin-T) 18 incorporates a priori such as hierarchical, localization, and translation invariance into the Transformer network structure design, solving such problems effectively.
The H&E-stained DWSI pixels are often relatively large (about 10 6 × 10 6 pixels).Direct operation and recognition will significantly increase the demand for the related computer memory.0][21] After that, the intelligent prediction of the development of tongue cancer was carried out by adjusting the Swin-T framework, and the Top1 accuracy reached 98.45%.In addition, it was encapsulated into an automatic detection system and had gone through clinical and human-machine tests, and all had proven its high stability and reliability.

Dataset acquisition
Figure 1 showed the workflow of this study.The first step is to ensure that the postoperative tongue tumor pathological tissue is sectioned and fabricated.After obtaining a complete H&E-stained DWSI of the tongue tumor, the large image was cut into several small images according to a predetermined size (500 × 500 pixels), and both the lesion and non-lesion areas were included in the images samples.When identifying, the image of a non-lesion areas would not be processed, while the lesion areas would be carried out for class discrimination.In addition, when the image was cut into small pieces, it might lead to misjudgment.Taking that into consideration, we put forward the idea of designing that the sample would be predicted as a type of disease, if only more than 80% of the lesions of the small images of the same sample that were predicted to be the same disease.Finally, the small images were spliced back to the original shape according to the original path.For the ease of model interpretation, we also presented the visualization of our model.
From January 2016 to June 2022, 389 patients who had been diagnosed in the Department of Stomatology of the Second Affiliated Hospital of Guangzhou Medical University provided the data for this study.The dataset included squamous cell carcinoma in situ (SCCiS), well-differentiated squamous cell carcinoma (WDSCC), moderately differentiated squamous cell carcinoma (MDSCC), low differentiated squamous cell carcinoma (LDSCC), leukoplakia, maxillofacial hemangioma (MH), oral squamous papilloma (OSP) and lymphatic follicular cyst (LFC), the sample proportion of each category was shown in Figure 2A.
The pathological tissue sections were scanned with a 3DHISTECH pathological slide scanner (Pannoramic 250 Flash), and the scanned images were processed and obtained using SlideViewer (3DHISTECH) software.The samples were divided into three categories, malignant, benign and leukoplakia, by professional doctors with more than 18 years of working experience, and the lesion area was outlined on the slice images.Among them, benign ones included MH, OSP and LFC (Figure 2B); Leukoplakia included leukoplakia and/or associated inflammatory hyperplasia (Figure 2C); Malignant included SCCiS, WDSCC, MDSCC and LDSCC (Figure 2D).The oral sites considered in the analysis were the tongue dorsum, the left tongue, the right tongue and so on.
Due to the scale of DWSI was too large (about 10 6 × 10 6 pixels), and the lesion area was only a small fraction of it (Figure 2B-D).Therefore, we adopted stratified sampling method to randomly collect images, whose size was 500 × 500 pixels, for each sample in the three categories.All the slides were collected in two cases, one with a ×20 magnification (0.50 μm/pixel) (Figures 2E, S1), and the other with a ×40 magnification (0.25 μm/pixel) (Figures 2E and S2).Moreover, to ensure the feasibility of the training data set, we removed more than 2/3 of the blank area in an image, and the sampling distribution of the final dataset was as follows.There were 2755 Malignant, 1473 Leukoplakia and 1901 Benign at a magnification of ×20.Meanwhile, there were 2639 Malignant, 1451 Leukoplakia and 1516 Benign at a magnification of ×40.The entire dataset was then randomly split into training and validation sets with a ratio of 8:2.In addition, the images of non-lesions (the number of images was 1023 [Figure S3]) were collected under the condition of ×20 magnification.

Model improvement and training
In order to extract features from tongue tumor H&E-stained images for automatic detection of tongue tumor lesions, in this work, a new computer vision image classification task, Swin Transformer framework, 18 was adopted.Generally, the Swin Transformer model's structure is typically broken down into four stages, each of which is made up of a number of Swin Transformer blocks.Each block primarily consists of two multi-head self-attention modules, W-MSA and SW-MSA, with conventional and shifted windowing configurations, respectively, as well as a two-layer multi-layer perceptron (MLP) with GELU nonlinearity.To prevent the feature map from shrinking too quickly and losing features, we added a stage module, marked as Swin-T_5S (Figure 3), to the Swin Transformer model to improve it.Moreover, the downsampling multiplier of the model started from four times and increased by two times to 64 times for each stage.The number of channels doubled from 48 to 1536, and we adopted multi-head self-attention comprising (3, 6, 12, 24, 48) heads, with window_size set to 7.
In general, when we used the gradient descent algorithm to optimize the objective function, as the global minimum of the Loss value was approached, we hoped that the learning rate could gradually decay, so that the model could be as close to the global optimum as possible.Additionally, the cosine annealing learning rate, which decreases the learning rate in accordance with a half-cosine curve, was selected as the learning rate.Therefore, the Cosine annealing 22 method (Equation 1) was applied for the adjustment of the learning rate here.When T i epochs are executed, the warm restart will begin.Here, the subscript i refer to the number of restarts.The word "restart" is not the same as a restart, but a process with the following description: it is a process simulated by increasing the learning rate, and the former epoch's x t , which is the solution of the loss function solved by gradient descent, is used after the process.After debugging, found that it was more appropriate to set T max to 5 and the minimum learning rate to 0. The study used the deep learning framework PyTorch, 23 with two graphics processing units (RTX 3090 Ti, NVIDIA) for computation.

Model evaluation
The focus of evaluation generally lies in the prediction error of a machine-learning model.It is not only a good learning prediction capability for training data during the learning process, but also a good prediction capability (generalization capability) for new data.Therefore, the generalization capability of the model was assessed using the test set indicators' performance.In this study, Accuracy (Equation 2), Confusion Matrix, Precision (Equation 3), Recall (Equation 4), F1-Score (Equation 5), and Specificity (Equation 6) were the top evaluation metrics.
Here, true positive (TP): The proportion of observations that are both initially and formally categorized as positive; The number of expected negatives is known as the true negative (TN).False positive (FP): Shows instances of false positives in which a positive value was predicted but the actual result was negative; False Negative FN: Indicates that the forecast is correct but the true value is negative (positive) (negative).

Evaluation of model performance
While testing Swin_T model, we also tested some classical CNN models, such as, VGG16, 24 ResNet50, 25 DenseNet121, 26 MoblieNetV3, 27 InceptionV4 28 and InceptionResNetV2. 28For the Top1 accuracy (Figure 4A), the Swin-T_5S (98.45%) was the highest, followed by Swin-T (96.93%),InceptionResNetV2 (95.64%),InceptionV4 (93.84%),DenseNet121 (93.02%),VGG19 (90.59%),ResNet101 (89.76%) and MoblieNetV3 (87.35%).The accuracy was the ×20 magnification image of the lesions (malignant, benign and leukoplakia) test result.Therefore, the model Swin-T_5S was then applied to the test and design of the automatic identification system.For the identification of the lesions and non-lesions, the validation accuracy was more than 99.9%.Judging from the confusion matrix and ROC curves (Figure S4) of the test set, lesions and non-lesions could be completely separated, whose reason might lies in the fact that the features between the two were more obvious.On the other hand, this also indicated high validity for the Swin-T_5S model, thus establishing foundations for the three classifications (malignant, benign and leukoplakia) in later lesions.
Furthermore, we subdivided the lesion area into malignant, benign and leukoplakia after the lesion and non-lesion samples were fully identified.It appeared that the accuracy of the images at the magnification of ×20 (98.45%) was higher  than that of the ×40 (93.13%).In addition, the Precision, Recall, and F1-Score values for the image at magnification ×20 were better than those at magnification ×40 (Table 1).For the above case, it indicated that the image at the magnification of ×20 was more representative of the stratification resulting from the tongue tumor, so this magnification image was used as the sample input of the hierarchical prediction model of tongue tumors.

Model visualization
As we all know, the classification decisions of neural networks are not suitable for direct interpretation by humans because they are developed in a data-driven fashion with training sets.Nonetheless, to gain insight into the classification decisions used by these algorithms, some interpretive approaches have been developed in recent years. 29In this work, to determine which regions of the input image were important for the network's classification decision, we analyzed the Swin-T_5S model with the Grad-CAM method 30 (Figure 5A-C), highlighting slides of all suspicious regions, especially tiny lesions.The results indicated that the model had learned to focus on the relevant inputs of single-cell plaques and ignore background characteristics.Therefore, for the model Swin-T_5S, prediction maps of slides for the entire sample could be derived by assembling the probabilities of tiles relayed with the patch-level classifier (Figure 5D-F).On the other hand, it further reveals the state of the tongue tumor microenvironment.The darker the red color, the more important this area is in this category.For example, in the malignant type, the darker the red color is, the more malignant it is.
To gain further insight into the consistency of model classifications, the t-distributed stochastic neighbor embedding (TSNE) algorithm 31 was applied in visualizing the dimensionality reduction of the 1536-dimensional (the penultimate layers) and 3-dimensional (the final layers) prediction spaces of the MLP in the last stage.The algorithm transformed probabilities basing on Gaussian distribution in high-dimensional space into probabilities basing on t-distribution in embedding space (two-dimensional space).This enabled the TSNE algorithm not only to focus on the local but also on the global feature, so that we could have a better understanding of the separation between different categories.We could see from the results that there were some discrete points in the 1536 space (Figure 5G), but they were controlled in the dimensionality reduction result of the 3D space (Figure 5H), indicating that the Swin-T_5S model can clearly separate the three categories in both dimensions (1536-D and 3-D) and understand the samples that have been misclassified.

Intellectual detection system
To facilitate clinical auxiliary diagnosis, we designed an automatic identification system based on the Swin-T_5S model (Figure 6A).When the DWSI of the tongue tumor was obtained, it was passed into this system and predicted by the patch-based method.In the end, a heat map of the entire sample was derived by layer-layer assembling on the patch-level classifier.The DWSI with a ×20 magnification was applied in the system as an input sample.
On the other hand, the output of the model prediction was a 3-dimensional vector, and the category corresponding to the maximum prediction probability will be the final prediction result.Besides, to evaluate the volatility of the probability values of the predicted results, all the training samples were tested (Figure 6D).When the predicted probability maximum value was less than 0.7 and the second maximum value was more than 0.3, we discovered that the error rate was above 50%.Therefore, due to the limitation of the number of categories in this study, we set up a category (marked as Others) to indicate that it may not belong to the three categories in this work.The following were the categorization (Others) rules: the number of patch images that the maximum probability <0.7 and the second maximum probability >0.3 in one sample.In this way, it would prompt the need for manual intervention and function as a warning to avoid misdiagnosis caused by system misjudgment.Furthermore, the system also came with other functions, such as magnification, manual marking, patient clinical records and report printing.
In addition, clinical tests on this system were performed.It could be seen from the tested confusion matrix (Figure 6B) and ROC curves (Figure 6C) that this system was relatively stable.Moreover, we carried out a human-machine comparison to check the system's performance further.Among the doctors participating in the test, there were three professional doctors with 13,11, and 2 years of work experience, respectively.During the test, the test doctor made the diagnosis according to the routine clinical status, not being informed that the purpose of this diagnosis was for human-machine comparison.The test results were shown in Figure 6E, which indicated that the prediction accuracy of each category of the system was higher than that of specialist doctors with 13 years of experience, while that of professional doctors with 2 years of work experience was lower.In particular, the accuracy of the category leukoplakia (the extent of tumor development between benign and malignant) was about 60% (Figure S5).Thus, it indicated that identifying such types required more extensive clinical work experience, which further implied the necessity of this system as an auxiliary diagnostic system.

DISCUSION
The Swin-T model was adapted for this study to create an AI system and forecast the stage of tongue tumor development automatically by using a patch-based method, which effortlessly get around computational memory limitations.Based on H&E-stained DWSI, the system intelligently identifies three types of tongue tumors: malignant, benign and leukoplakia, and the Top1 accuracy is 98.45% for the images of ×20 magnification.Precision, Recall, and F1-Score greater than 0.978 (Table 1).The system was relatively stable after testing, and the performance of the man-machine comparison was also excellent.Additionally, there are also some shortcomings in this study, such as the lack of data from other tumor samples, which makes it hard to verify whether this model is suitable for the corresponding analysis of other tumors, thus requiring further research validation.However, the model can be adjusted using the enhanced approaches in this work, such as module and deep design of the model, etc., if the sample data fits the conditions.
The pathological examination is the fundamental basis for tumor diagnosis in modern medicine, especially when combined with DWSI, which is essential for the analysis and categorization of benign and malignant tumors.By comparison, clinical physical examination and magnetic resonance imaging examination can only provide auxiliary references for the final diagnosis.For the tongue tumor, the treatment strategy differs significantly according to the stage of tumor development.In general, malignant tumors are treated with a combination of therapeutic modalities, such as extended resection and neck lymph node dissection, radiotherapy and chemotherapy.However, there are certain side effects associated with these methods.For one thing, chemotherapy will result in bone marrow suppression, causing patients to have low white blood cells and low platelet anemia.For another, radiotherapy can even lead to major systemic side effects such as radio encephalopathy.In extreme circumstances, gastrointestinal hemorrhage through the skull and other internal organs might happen, posing a life-threatening risk.On the other hand, the use of extended local excision combined with nodal neck dissection will not only affect the tongue's ability to speak and feed, but also significantly restrict the movement of the neck.Therefore, these methods can seriously impair patients' quality of life and psychological well-being, in addition to causing serious physiological damage.Only extended local excision is required for lesions at the border between benign and malignant lesions, whereas extended local excision is not required for benign tumors.Based on the above, making it critical for clinicians to accurately differentiate the stage of tongue tumor development when creating treatment regimens to increase patient survival.However, manual observation of pathological slide images relies on professional knowledge as well as work experience, which often carries the risk of misjudgment, especially in cases where the characteristics of malignant and benign tumors only differ slightly, thus delaying the treatment.Therefore, it is crucial to develop AI-aided diagnostic system of the stage of tongue tumor development.
For focus localization of the tongue tumor, our model demonstrated good coverage in most situations by the CAM method.The regions that CAM highlighted have annotated features and a strong correlation to the predictions.It further reveals the state of the tongue tumor microenvironment.The darker the red color, the more important this area is in this category.For example, in the malignant type, the darker the red color is, the more malignant it is.Furthermore, the lesion area is often subjective among experts.

CONCLUSION
Overall, in this study, an automated AI model based Swin-T framework was designed for multi-class tongue tumor development stages classification according to the H&E-stained DWSI, accompanied by some visual analysis based on the Grad-CAM and TSNE methods.The Top1 accuracy was 98.45% for the images of ×20 magnification.A user-friendly system was created that allowed AI-powered automatic categorization, and it can provide an auxiliary diagnosis for tongue tumors.

F I G U R E 1
Workflow of the current study.
The sample proportion of each category.(B-D) Digital whole-slide image of hematoxylin and eosin stained tongue tumor tissue samples for benign, leukoplakia and malignant, respectively.Blue contours are the lesion area.(E) ×20 and ×40 magnification image style of benign, leukoplakia and malignant, in the green box, respectively.F I G U R E 3 Schematic diagram of the workflow of the model Swin-T_5S.

F
I G U R E 4 (A) Models Top1 accuracy.(B,C) The receiver operating characteristic (ROC) curves of Swin-T_5S for validation set of the ×20 and ×40 magnification, respectively.(D,E) The confusion matrix of Swin-T_5S for the validation set of the ×20 and ×40 magnification, respectively.(F) One-way ANOVA results for the random test.

F
I G U R E 5 (A-C) The patch heat maps of the model Swin-T_5S with the Grad-CAM method.(D-F) The self-assembly heat maps the model Swin-T_5S with the Grad-CAM method.(G,H) The visualizing of the dimensionality reduction of the 1536-dimensional (the penultimate layers) and three-dimensional (the final layers) prediction spaces of the multi-layer perceptron (MLP) in the last stage with the testing set, respectively, and the data tag is real tag.

F
I G U R E 6 (A) Schematic diagram of the terminal page of the identification system.(B,C) The ROC curves and confusion matrix of the identification system for clinical testing, respectively.(D) Distribution of predicted probability values for the identification system, magenta and green circles are the maximum and the second maximum prediction probability, respectively.(E) The identification accuracy of specialist doctors with 2, 11, and 13 years of experience and Swin-T_5S model for malignant, benign and leukoplakia, respectively.
Test results for the Swin_T_5s model validation set for the ×20 and ×40 images.