Prognostic risk factor of major salivary gland carcinomas and survival prediction model based on random survival forests

Abstract Salivary gland malignancies are rare and are often acompanied by poor prognoses. So, identifying the populations with risk factors and timely intervention to avoid disease progression is significant. This study provides an effective prediction model to screen the target patients and is helpful to construct a cost‐effective follow‐up strategy. We enrolled 249 patients diagnosed with salivary gland tumors and analyzed prognostic risk factors using Cox proportional hazard univariable and multivariable regression models. The patients' data were split into training and validation sets on a 7:3 ratio, and the random survival forest (RSF) model was established using the training sets and validated using the validation sets. The maximally selected rank statistics method was used to determine a cut point value corresponding to the most significant relation with survival. Univariable Cox regression suggested age, smoking, alcohol consumption, untreated, neural invasion, capsular invasion, skin invasion, tumors larger than 4 cm, advanced T and N stage, distant metastasis, and non‐mucous cell carcinoma were risk factors for poor prognosis, and multivariable analysis suggested that female, aging, smoking, untreated, and non‐mucous cell carcinoma were risk factors. The time‐dependent ROC curve showed the AUC of the RSF prediction model on 1‐, 2‐, and 3‐year survival were 0.696, 0.779, and 0.765 respectively in the validation sets. Log‐rank tests suggested that the cut point 7.42 risk score calculated from the RSF was most effective in dividing patients with significantly different prognoses. The prediction model based on the RSF could effectively screen patients with poor prognoses.


| INTRODUCTION
Carcinomas of the salivary glands are heterogeneous and rare. It takes up less than 1% of malignant head and neck neoplasms. [1][2][3][4] The histopathological classification of malignant gland tumors is complex, 5 and it is tough for surgeons and pathologists to determine its origin.
The carcinomas of the major salivary glands (C-MSG) occur in three sets of paired glands: the parotid, the submandibular, and the sublingual gland. The parotid is the most common site of onset, while tumors derived from the sublingual gland are usually malignant. 6 Due to their rarity and histopathologic variety, prognostic judgment was challenging. There is also an urgent need for proper and effective treatments to control the tumor and reduce facial nerve injury. 7 Furthermore, knowledge of the risk factors is significant for reasonable follow-up strategies to prevent tumor recurrence. 8 Clinical prediction models are extensively used in current studies, and proper selection of the model is the cornerstone. 9,10 The Cox model is the most widely used, 11 associated with cancer survival, and identifies risk factors effectively with a respectively straightforward procedure. However, with electronic patient files bringing much more information, an easily automated model is needed. The random survival forest (RSF) is a good example. It does not have to satisfy strict assumptions like the Cox model, which widens its use, and performs better in some ways. [12][13][14][15] Breiman first introduced RSF in 2001, 16 and the specified package "randomForestSRC" 17 of R software is developed to apply better use of the model and augment visualization of the results. 18 This retrospective analysis was undertaken to identify prognostic factors for the overall survival (OS) of patients with salivary gland tumors using the Cox model. For better prediction of the prognosis, the RFS model was performed to screen patients at high risk.

| Patient selection
The clinical data of 249 patients with histologically confirmed C-MSG treated at Sun Yat-sen University Cancer Center between January 2000 and December 2013 were included in the study.

| Demographic and clinical variables
Information on demographic features and clinicalpathological features, including regional invasion, recurrence, treatment modality, and other vital information, was extracted from the hospital information system (HIS) database. Tobacco use and alcohol assumption were recorded as well. Body mass index (BMI) less than 25 was clarified as normal, and the others were overweight. Tumor size and the presence of cervical lymph node metastases were evaluated through various imaging examinations, including CT, MRI scans, and ultrasound. The presence of distant metastasis was assessed by X-rays or CT, bone scintigraphy, ultrasound, or PET-CT examinations. All cases were staged according to the WHO Classification for Tumors (2005) and the AJCC TNM Staging System (8th edition) for salivary gland tumors. 5,19

| Follow-up
Information on the 249 patients with C-MSG was collected by letter, telephone, and outpatient follow-up visits. Follow-up continued until December 2018. OS was calculated from the date of definitive diagnosis to the date of death.

| Statistical analysis
Continuous variables were summarized as mean ± standard deviation. Categorical variables were presented as frequency (percentage). Univariable and multivariable Cox proportional hazards regression analyses were performed to determine the risk factors. Hazards ratio (HR) was presented with its 95% confidence intervals (CI) as HR (low CI, high CI), and HR less than 1.0 was regarded as a protective factor. p < 0.05 suggested statistical significance. Data of the patients were split into training and validation sets in a 7:3 ratio, and the survival prediction models were developed using RSF in the training set. Time-dependent receiver operator characteristic curve (ROC) analysis 20,21 was performed to evaluate the accuracy of the prediction model. The maximally selected rank statistics method was used to determine a value of a cut point that corresponds to the most significant relation with survival. Kaplan-Meier method was used to estimate the survival distributions, and the log-rank test was performed for the comparison of OS. Figure 1 shows the flow diagram of our study. Data processing was carried out using R version 4.1.3.

| Patient characteristics
The characteristics of the 249 patients who satisfied the criteria above in this study are summarized in Table 1.
In the whole cohort, 81.9% of the tumor site was the parotid glands, with submandibular glands 16.9%, and sublingual glands 1.2%. The male proportion was slightly higher than the female proportion, while the proportion of patients with normal or overweight BMI was exactly the same.

| Risk factors of prognosis
The univariate Cox proportional hazards regression suggested that smoking and alcohol consumption was higher in patients with poor prognoses. Younger age and lower tumor burden including no neural invasion, no capsular invasion, no skin invasion, tumor size no more than 4 cm, and lower TNM staging were protective factors. Treatment discipline containing surgery showed significantly better prognosis, with the HR 0.18 (0.08, 0.39) for surgery only and 0.14 (0.06, 0.32) for surgery plus radio or chemotherapy compared to patients receiving no treatment. And the HR for patients received only radiotherapy, chemotherapy, or radiochemo therapy was 0.95 (0.42, 2.16). The result that the resection margin did not influence the prognosis was somewhat counterintuitive. And patients with mucoepidermoid carcinoma had a better prognosis than others.
Multivariable analysis showed male gender was a protective factor with HR 0.47 (0.23, 0.96) compared to the female gender. Besides, smoking and non-mucoepidermoid pathology subtypes were risk factors. Patients benefited more from treatment discipline containing surgery, consistent with the findings in univariable analysis.

| Prediction model based on random survival forest
In the prediction model of the RSF, Figure 2 shows the variations of the prediction error rate along with the increasing number of trees and the importance of the variables. The top five variables that significantly influenced the model were M stage, age, treatment discipline, N stage, and the pathology type. This was in agreement with the Cox model analysis. The time-dependent ROC showed the AUC of 1-, 2-, and 3-year survival rates was 0.895, 0.917, and 0.9 respectively in the training data ( Figure 3), and 0.696, 0.779, and 0.765 in the validation data ( Figure 4), presenting a reasonable accuracy in the prediction of the OS, especially in the second and third year. We also built a prediction model based on the Cox model as a comparison, while the performance was not as good as the RSF (Figures S1 and S2).
According to the RSF model, a risk score to indicate patient prognosis was calculated, presenting that the higher risk score was correlated with a worse prognosis. In the training data, the cut-off risk score of 7.42 was selected using the maximally selected rank statistics from the "maxstat" R package, which is an outcome-oriented method providing a value of a cut point that corresponds to the most significant relation with the survival ( Figure 5). Then we used the cut-off point and divided the patients into high-risk group and low-risk group in the validation data. Kaplan-Meier curves were performed for both groups, and the log-rank test showed the OS was significantly different (Figure 6), proving that the modelcalculated risk score could effectively clarify patients with poor prognoses.

| DISCUSSION
Improving the local control of the tumor and OS rate was the most crucial issue for doctors to consider. This study F I G U R E 1 Flow diagram of the present study. was performed to identify the factors associated with the outcomes and to develop a model to screen patients at high risk. Age was reported as a significant factor for the OS in C-MSG. [22][23][24] We found the same results that younger age is a protective factor, which is the same among most head and neck cancer tumors, especially papillary thyroid tumors. Gender did not affect the prognosis in previous studies [23][24][25] and is conflicting with our findings that the male factor suggested a protective factor in multivariable Abbreviations: CI, confidence interval; HR, hazards ratio. a n (%); mean ± SD.

T A B L E 1 (Continued)
F I G U R E 2 Results of the random survival forest.
analysis. Possible explanations could be either bias due to the lesser size of the study or because we controlled more variables in the investigation, for the univariable analysis also showed no significant results. Tumors greater than 4 cm have been reported to be related to a worse prognosis, 6,24,26 which was consistent with our finding. The tumor size is usually an indicator of tumor load and is associated with the difficulty of complete resection of the specimen. Besides, we also found that nerve invasion, capsular invasion, and skin invasion could significantly influence the prognosis in the

F I G U R E 5
Cut-off risk points calculated using the maximally selected rank statistics from the "maxstat" R package (an outcomeoriented method providing a value of a cut point that corresponds to the most significant relationship with the survival).

F I G U R E 6
Log-rank test of survival analysis between highand low-risk groups in validation data sets. univariable analysis. The above results highlight the crucial role of surgery in the treatment discipline. And both the univariable and multivariable analyses indicated that patients receiving treatment containing surgery had a better prognosis. It suggested surgery is still the mainstay in the treatment of C-MSG. On the other hand, a counterintuitive finding was that the resection margin did not show a significant difference in the analysis. This discrepancy could be attributed to adjuvant radiotherapy. Chris reported postoperative radiotherapy (PORT) improved 10-year local control significantly in patients with incomplete resection (82% vs. 44%), 27 and the UK national multidisciplinary guidelines also recommended adjuvant radiotherapy should be considered in cases where there is incomplete or close resection margin. 28 And a large-scale study based on the National Cancer Database also reported that the positive resection margin indicated poor prognosis and PORT could improve the outcome significantly. 29 However, the baseline characteristics in the above studies is unbalanced thus prospective evidence are still needed. Altogether, surgery plus adjuvant radiation and chemotherapy may be a better strategy for patients with advanced tumors, including capsular invasion, size over 4 cm, and incomplete or close resection.
The type of pathology also played an important role in our analysis. Patients with mucoepidermoid carcinoma presented with a better prognosis than the others not only in the univariable and multivariable Cox analyses but also in the RSF model. This is consistent with a previous study, which suggests low-grade tumors have an indolent behavior and could be cured by surgery alone, whereas a high degree of malignancy tumors may be locoregionally aggressive and often require neck dissection and adjuvant therapy. 30 The TNM staging system showed excellent capacity in evaluating the prognosis in the univariable analysis, and distant metastasis and lymph node status were also of great importance in the RSF model, broadly supporting similar findings by other researchers. 31 However, in the multivariable analysis, no significant result was found. It may also likely be related to the discipline of surgery plus adjuvant radiotherapy, which greatly improved the prognosis in patients with advanced tumors.
The RSF model performed well in the validation data sets according to the time-dependent ROC and was better than the Cox model, especially in the prediction of the second and third-year survival. Besides, the RSF is like a "black box" and does not have to satisfy as many restrictions as the Cox model, thus is more applicable in clinical practice. The model calculates a score representing the risk of death over time for each patient so that the clinicians would clarify patients at high risk during a specific time, and a cost-effective and individualized follow-up strategy can be made. We believe this work would effectively guide our clinical practice to detect the recurrence in an early stage and take interventions in time. In addition, a more precise follow-up could help save the limited medical resources, especially in a developing country with a large population like China.
Furthermore, to ease the use of the model, we calculated a cut point that could divide the patients with the most significant relation to survival. This helps us to determine whether a more aggressive therapy should be performed. For example, the positive resection margin is suggested to have a slight influence on survival in our analysis. Is adjuvant radiotherapy necessary for such patients in poor conditions? The RSF model may help us to make the decision.
On the other hand, chemotherapy is not routinely recommended in the treatment of salivary gland cancer, and just a few patients in our study accepted chemotherapy alone or adjuvant chemotherapy following surgery. Various regimens were chosen mainly based on particular pathological types (most were tumors metastatic to the salivary gland). We believe the small size of such patients is unrepresentative and plays little part in the analysis, thus we included them together into the same group to simplify the model.
The size and single center restrict further generalization of the study. Data from other centers would be necessary to improve the model and is the next direction of our work. The local recurrence may be more concerning to the clinicians, and we will try to track the outcomes of these patients. The RSF is not possibly appropriate for every clinical scenario, and indeed perfect model does not exist. But at least it works well and provides a promising method for us to screen patients with poor prognoses.

| CONCLUSIONS
M stage, age, treatment discipline, N stage, and the pathology type significantly influenced the survival of the salivary gland tumor patients, and the prediction model based on the RSF could effectively screen patients at high risk of death.