A novel deep learning prognostic system improves survival predictions for stage III non‐small cell lung cancer

Abstract Background Accurate prognostic prediction plays a crucial role in the clinical setting. However, the TNM staging system fails to provide satisfactory individual survival prediction for stage III non‐small cell lung cancer (NSCLC). The performance of the deep learning network for survival prediction in stage III NSCLC has not been explored. Objectives This study aimed to develop a deep learning‐based prognostic system that could achieve better predictive performance than the existing staging system for stage III NSCLC. Methods In this study, a deep survival learning model (DSLM) for stage III NSCLC was developed based on the Surveillance, Epidemiology, and End Results (SEER) database and was independently tested with another external cohort from our institute. DSLM was compared with the Cox proportional hazard (CPH) and random survival forest (RSF) models. A new prognostic system for stage III NSCLC was also proposed based on the established deep learning model. Results The study included 16,613 patients with stage III NSCLC from the SEER database. DSLM showed the best performance in survival prediction, with a C‐index of 0.725 in the validation set, followed by RSF (0.688) and CPH (0.683). DSLM also showed C‐indices of 0.719 and 0.665 in the internal and real‐world external testing datasets, respectively. In addition, the new prognostic system based on DSLM (AUROC = 0.744) showed better performance than the TNM staging system (AUROC = 0.561). Conclusion In this study, a new, integrated deep learning‐based prognostic model was developed and evaluated for stage III NSCLC. This novel approach may be valuable in improving patient stratification and potentially provide meaningful prognostic information that contributes to personalized therapy.


| INTRODUCTION
Lung cancer is the second most commonly diagnosed cancer and the leading cause of cancer-related death, with 80%-85% patients diagnosed with non-small cell lung cancer (NSCLC). 1,2 Approximately 25%-40% of NSCLC cases are diagnosed at stage III disease, which demonstrates a more invasive clinical course and poorer prognosis than early-stage disease, with two-thirds of patients dying within 5 years. 3,4 Precise survival prediction plays a crucial role in improving risk stratification and patient survival. However, it is challenging to provide accurate prognostic information based on the currently used TNM staging system (8th edition of the tumor, node, and metastasis classification). 5 Stage III NSCLC is characterized by heterogeneous features but is only canonically classified into IIIA, IIIB, and IIIC according to the TNM staging system or simply classified into resectable and unresectable diseases in clinical practice. In addition, the wide disparities in overall survival within the same stage suggest that other factors are important, including age, smoking, histology, differentiation, and treatment choices. [6][7][8] Thus, a new comprehensive stratification system based on multiple prognostic factors is necessary for stage III NSCLC.
Various methods have been proposed to build a prognostic model by integrating multiple factors. Traditionally, most prognostic models have been based on regression models. For example, the Cox proportional hazard model (CPH) has been widely employed for such prognostic tasks, among which the nomogram is used to quantify risk and derive the risk probability of a specific event by combining significant clinical characteristics. [9][10][11] Yet, these models can only detect the linear relationships between features and outcomes through fitting linear models. Consequently, a machine learning method with less restrictive model assumptions, namely, the random survival forest (RSF), has become attractive as a predictive tool. 12,13 However, accurate prediction via RSF heavily relies on the selection and parametrization of features based on prior biological knowledge. Moreover, RSF is considered to have a high risk of overfitting when analyzing large datasets and is unable to extrapolate the data. 14 Deep learning-based methods can handle large, complex, and heterogeneous data, thus allowing the solution of problems that cannot be properly addressed by traditional approaches. [15][16][17] These methods have been applied successfully in biomedicine, including medical image segmentation and reading, 18,19 pathological diagnoses, 20,21 and cancer genomics. 22,23 In addition, the utility of these frameworks for survival prediction also has been examined. For example, Lee et al. 24 demonstrated that DeepHIT, a deep neural network that directly learns the time-to-event distributions without making assumptions regarding the underlying stochastic process, could serve as a useful tool for survival prediction through a set of experiments conducted on a cohort of 5883 patients with adult cystic fibrosis. In another study, Kim et al. 25 developed a deep learning network model for oral squamous cell carcinoma based on DeepSurv and achieved high C-indices of 0.810 and 0.781 for the training and testing sets, respectively. However, no large-sample research has focused on prognostication in patients with stage III NSCLC.
In this study, a deep survival learning model (DSLM) for stage III NSCLC survival prediction was developed using data from the Surveillance, Epidemiology, and End Results (SEER) dataset, and showed better performance compared to CPH and RSF. Moreover, an external test was performed on an independent cohort of 172 patients from Shandong Cancer Hospital and Institute. Lastly, we proposed a new prognostic system based on the DSLM, which achieved significant improvement in stage III NSCLC prognostication.

| Data source and study population
We selected patients from the SEER database which collects cancer incidence and survival data from populationbased cancer registries covering approximately 48% of the United States. A query of the SEER database was performed for primary lung and bronchus cancer using the "International Classification of Disease for Oncology, Third Edition (ICD-O-3)." This study enrolled patients with pathologically confirmed primary stage III NSCLC diagnosed between 2012 and 2015. Cases of multiple tumors, unknown tumor size, and incomplete follow-up information were excluded. A real-world dataset of newly diagnosed stage III NSCLC, managed at the Shandong Cancer Hospital and Institute between January 2012 and December 2016, was also retrospectively collected following approval by the institutional review board of the Shandong Cancer Hospital and Institute.

| Clinical information and data processing
Among eligible cases, patient demographics, baseline information, tumor characteristics, first course of treatment, and follow-up information were collected. The TNM stages of all eligible patients were redefined according to the 8th edition of the TNM staging to reflect current guideline recommendations. We removed strongly correlated variables using correlation matrix analyses. Overall survival, defined as the time between diagnosis and loss to follow-up or death from any cause (loss to follow-up and death were assigned values of 0 and 1, respectively), was used as the response variable for the survival analyses.

| Deep survival neural network
The DSLM in this study was developed based on the neural multitask logistic regression model (N-MTLR). 26 This model was used to transform patient follow-up time into a series of time points with the time point corresponding to the event (death) was assigned a value of 1. First, DSLM was applied to the stage III NSCLC dataset to make the prediction. Thereafter, patient samples were classified into three subgroups with different prognostic statuses, namely high-risk, middle-risk, and low-risk, according to the values of the risk scores.

| Baseline survival models for comparison
The baseline models including CPH and RSF were compared to the deep learning model for survival prediction in the provided dataset. CPH is a semiparametric model for survival analysis which estimates the risk function of the event occurring for patients based on observed features using a linear function. RSF is a nonlinear machine learning model used to analyze right-censored survival data by generating ensemble estimates for the cumulative hazard function based on a nonparametric tree-ensemble method. 12 In this study, the RSF model was constructed with a tree count of 200, depth of two variables, and minimum node size of 5. Furthermore, all potential risk factors were ranked by their importance via RSF (Table S1).

| Evaluation of models and statistical analyses
Model performance was evaluated using the concordance index (C-index). The integrated Brier score (IBS), another measure for assessing predictive performance, was calculated to assess the overall error values across multiple time points. Generally, a useful model has an IBS below 0.25, with lower values indicating better predictive performance. Receiver operating characteristics and the area under the curve (AUROC) were used to compare predictive accuracy. Log-rank tests were used for the comparison of different curves. A p value <0.05 was considered statistically significant. The training, validation, and testing procedures for the deep learning models were conducted with Pytorch. The Python (R Foundation for Statistical Computing) scikit-learn and pandas packages were used for data analysis and the Python matplotlib was used for plotting graphs. The STATA statistical software package (StataCorp, 2015, Stata: Release 14. Statistical Software) was used for other conventional statistical analyses.

| Patient characteristics
According to the inclusion criteria, 16,613 patients with primary stage III NSCLC between 2012 and 2015 in the SEER database were enrolled in the study. Fifteen variables were selected for further analyses, including year of diagnosis, age, sex, race, T stage, N stage, primary site, laterality, histology, differentiation, number of positive lymph nodes, visceral pleural invasion, surgery to primary site, chemotherapy and radiotherapy. The correlation matrix was plotted for 15 selected variables ( Figure 1)

| Training and validation of the DSLM
The whole SEER dataset was split into training (n = 10,795, 65%), validation (n = 2907, 17.5%), and testing (n = 2907, 17.5%) sets, with five duplicates were dropped. The final deep learning network comprised four layers with 70 neurons in each layer. The ReLU activation was used for each layer. During training, we used the Adam optimizer to improve robustness. Grid searching was used to select optimal hyperparameters. Moreover, L2 regularization and dropout were used to prevent overfitting. Finally, DSLM achieved the best performance with a learning rate of 0.001, L2 regularization of 1e-2, L2 smoothing of 1e-2, and dropout rate of 0.2. The values of the loss function decreased from 44,000 to 32,988 after 2000 iterations ( Figure S1).
DSLM showed the best performance in survival prediction with a C-index of 0.725 and IBS of 0.14 in the validation set (Figure 2A), followed by RSF (C-index = 0.688, IBS = 0.16, Figure S2A) and CPH (C-index = 0.688, IBS = 0.17, Figure S2B). Additionally, the calibration curves of comparison between actual and predicted for DSLM yielded a median absolute error of 3.729 and a mean absolute error of 5.172 in the validation set ( Figure 2B). To validate this model and address potential overfitting, DSLM was tested with an independent testing set. A C-index of 0.719 and IBS of 0.15 ( Figure 2C) were still achieved. Additionally, the calibration curves of DSLM showed better accuracy (the mean absolute error of 5.228) in the testing set ( Figure 2D) than RSF (the mean absolute error of 7.092, Figure S3A) and CPH (the mean absolute error of 8.087, Figure S3B). Overall, these metrics confirmed the high performance and discriminatory ability of the deep learning-based prognostic model.

| Evaluation of the DSLM in an external testing set
To validate the general applicability of this deep learning-based prognostic model, DSLM was tested

| New prognostic system based on DSLM
Given a set of the selected features for each patient, the prognostic model was used to calculate a risk score for patients from the testing set, with a higher risk score indicating a worse prognosis. Thereafter, patients were divided into three subgroups with different prognostic statuses, namely high-risk, middle-risk, and low-risk ( Figure 3A). Figure 4 shows the survival curves of the test cohort for the new prognostic system (4A) and the TNM staging system (4B), where the patients showed significantly different survivals in the predicted high-, middle-, and low-risk groups (log rank test, p < 0.05). Additionally, AUROC curves were generated to assess the model performances. The TNM staging system yielded an AUROC of 0.561 ( Figure 4D), while the performance of the new prognostic system based on DSLM was significantly improved to 0.744 ( Figure 4C), compared with 0.721 for RSF ( Figure S4A) and 0.658 for CPH ( Figure S4B). In addition, DSLM could represent follow-up time as a series of time points, thus providing accurate survival prediction. For example, survival curves with significantly different slopes for three patients randomly selected from the testing set reflected the difference in death risks between these patients ( Figure 3B).

| DISCUSSION
To our knowledge, this study is the first to propose a deep learning-based prognostic model for survival prediction in patients with stage III NSCLC. Our study was driven by the desire to develop a useful method to make accurate F I G U R E 4 Prognostic performance of DSLM and a conventional staging system. (A) Kaplan-Meier curves for patients stratified with the new prognostic system showed a significant survival difference (log rank test, p < 0.05); mortality for patients in middle-and high-risk groups increased 2.4-and 6.0-fold, respectively, relative to patients in the low-risk group. (B) Survival curves for patients stratified by the TNM staging system also achieved significance (p < 0.05), with mortality for patients with stage IIIB and IIIC disease increasing 1.3-and 1.7fold, respectively, relative to stage IIIA disease. (C) ROC curve for the new prognostic system. (D) ROC curve for the conventional staging system. DSLM, deep survival learning model; ROC, receiver operating characteristics; TNM, tumor node metastasis survival prediction and personalized treatment recommendations for patients with stage III NSCLC, which is inadequate with the currently used TNM staging system as it considers only limited variables. Yet, deep learning approaches provide a unique opportunity to cope with this challenge. The remarkable features for stage III NSCLC are of great heterogeneity, including patients with T1-T4 and N0-N3, with diverse treatment strategies, pathological type, Karnofsky Performance Score, and so on. Thus, the predictive model integrated information from the distinct sources may reflect the multiple characteristics of patients and improve the predictive capability compared to TNM stage alone. 11,27 CPH is the most commonly used model for prognosis prediction of cancer patients. However, linear correlation is the essential precondition for CPH, and multidimensional parameters were commonly presented as nonlinear correlation in real world study. 28,29 Thus, deep learning, which can reveal the complex nonlinear association of parameters, provide a unique opportunity for the accurate prognosis prediction of patients with stage III NSCLC with diverse information.
As reported, Ryu SM, et al. 30 has developed and validated a deep survival neural network for survival prediction of patients with spinal and pelvic chondrosarcoma (n = 1088), using clinical features extracted from the SEER database. The results indicated the significant increase of the sensitivity for survival prediction using deep learning model when compared to the conventional logistic prediction model. In another study, Song Y et al. 31 used four different algorithms on cases from the SEER database (n = 3944) to train the prognosis prediction models for pancreatic neuroendocrine tumor (PNETs), among which the deep learning model showed the best predictive accuracy than logistic regression (LR), support vector machines (SVM), and random forest (RF). And the deep learning model was also found to perform better than the American Joint Committee on Cancer (AJCC) staging system (AUC = 0.87 vs. 0.76) in further comparison. It should be noted that external validation in real-world data, which is especially important for the applicable and robustness for models, was missing in these studies.
Here, we focused on a highly heterogeneous group of patients with stage III NSCLC and trying to develop the prognosis prediction model using deep learning method. A large cohort of consecutively diagnosed stage III NSCLC cases was selected for model training and validation (n = 16,613) to provide more reliable results. In order to improve the unstable performance of deep learning, which is the restriction for the clinical application of these models, we further validated the robustness and general applicability of the model in an independent realworld external dataset with the C-index of 0.665, which suggested that it may be suitable for patients from different regions. Finally, our study found the performance of the new prognostic system was significantly superior to the TNM staging system (AUROC = 0.744 vs. 0.561), indicated that deep learning approaches are particularly suitable for processing survival information of highly heterogeneous and complex patient cohorts, while the traditional TNM staging system performs relatively poorly in such prognostic task of this patient cohorts. More importantly, our study could further provide the individualized mortality risk and survival time predictions rather than simple risk stratification, which is convenient and efficient for clinical practice.
Compared with other prognostic models, the deep learning-based prognostic model has several advantages. First and notably, deep learning can model nonlinear risk functions and is advantageous when examining real-life factors with complex nonlinear relationships. Second, deep learning models are able to not only directly learn feature representations from raw clinical data without making assumptions, but also fit missing data using nonlinear risk functions. Missing data pose a challenge when trying to draw clinical inferences using traditional statistical models, and may lead to serious selection bias. Third, deep learning methods are considered less vulnerable to overfitting when a large feature set is used because they are able to learn feature representation; however, the inclusion of multiple variables in conventional linear regression models may result in overfitting, which implies that a deep learning method would be more appropriate in biomedical research.
The work presented in this study lays the foundation for further research. Apart from the 15 selected factors used for stage III NSCLC survival prediction, DSLM can be expanded to integrate increasingly complex data, such as radiomics and genomics data, and predict other clinical outcomes, such as drug response and progression-free survival (PFS). For example, immune checkpoint inhibitors (ICIs) have revolutionized cancer treatment, showing significant efficacy in NSCLC patients. The strength of the deep-learning model may be beneficial in improving further patient selection and providing meaningful prognostic information that may be utilized in making personalized cancer treatment plans. Furthermore, although it was developed using stage III NSCLC, this deep learningbased method can easily be generalized to other cancer types.
A limitation of this study is that the deep survival neural network has not been prospectively tested in clinical settings; however, the internal and external validation suggest the model's stability and reproducibility. Additionally, despite the black-box nature of the endto-end deep learning network, this study was successful and demonstrated that a deep learning-based model significantly outperformed existing prognostication methods in stage III NSCLC. Inevitably, interpreting a deep learning network is challenging in clinical practice; thus, in future studies, providing clinically meaningful interpretations for deep learning models would be valuable.

| CONCLUSIONS
This study demonstrated that a deep learning model may be useful for survival prediction in patients with stage III NSCLC as the model significantly outperformed existing prognostication models. This evidence suggests the utilization of this novel approach in patient stratification and personalized treatment decision-making.