Improving the overall survival prognosis prediction accuracy: A 9‐gene signature in CRC patients

Abstract Colorectal cancer (CRC) is a malignant tumor and morbidity rates are among the highest in the world. The variation in CRC patients' prognosis prompts an urgent need for new molecular biomarkers to improve the accuracy for predicting the CRC patients' prognosis or as a complement to the traditional TNM staging for clinical practice. CRC patients' gene expression data of HTSeq‐FPKM and matching clinical information were downloaded from The Cancer Genome Atlas (TCGA) datasets. Patients were randomly divided into a training dataset and a test dataset. By univariate and multivariate Cox regression survival analyses and Lasso regression analysis, a prediction model which divided each patient into high‐or low‐risk group was constructed. The differences in survival time between the two groups were compared by the Kaplan–Meier method and the log‐rank test. The weighted gene co‐expression network analysis (WGCNA) was used to explore the relationship between all the survival‐related genes. The survival outcomes of patients whose overall survival (OS) time were significantly lower in the high‐risk group than that in the low‐risk group both in the training and test datasets. Areas under the ROC curves which termed AUC values of our 9‐gene signature achieved 0.823 in the training dataset and 0.806 in the test dataset. A nomogram was constructed for clinical practice when we combined the 9‐gene signature with TNM stage and age to evaluate the survival time of patients with CRC, and the C‐index increased from 0.739 to 0.794. In conclusion, we identified nine novel biomarkers that not only are independent prognostic indexes for CRC patients but also can serve as a good supplement to traditional clinicopathological factors to more accurately evaluate the survival of CRC patients.


| INTRODUCTION
Colorectal cancer (CRC) tops the list of high-risk cancers that endangered public life over the past several decades, especially in the western developed countries. 1,2 It is estimated that as of January 1, 2019, more than 1.5 million men and women have been diagnosed with CRC in the United States and 145,600 new cases will be diagnosed in 2019. 3 Therefore, a novelty intervention or blocking-up strategy of this disease appears to be particularly important against the CRC for patients to live a longer or a high-quality life. The survival time of patients with CRC varies greatly in different stages. For example, the 5-year survival rate of patients with stage I and II patients is 91 and 82%, respectively, while those of stage IV CRC are approximately 12% [3]. Over the past several decades, the mortality rates of CRC patients have reduced moderately due to early diagnosis and treatment. [3][4][5] However, there is no effective quantitative prognostic index for CRC patients. In current clinical practice, clinicians judge the prognosis of patients with CRC mainly based on the tumornode-metastasis (TNM) staging and clinical characteristics of the disease. 6 Vague judgment criteria not only aggravate the psychological burden of patients but also increase the confusion of clinicians to a certain extent. Therefore, there is an urgent need for an accurate and highly feasible prediction tool in which clinicians can serve patients more confidently.
To our knowledge, several researchers have proposed a variety of strategies that can better predict the survival of patients with CRC. For example, Bingrong et al identified an effective 6-lncRNA signature with good performance that divides CRC patients into high-and low-risk groups that have significantly different survival times. 7 However, they have not further proven its clinical application and the highest area under the curve (AUC) value of their signature was 0.733, indicating that the stability and reliability of the model are at an average level. Zheng Zhou et al identified a 5-autophagy-related gene signature that can divide CRC patients into low-and high-risk groups and developed a nomograph to guide individualized treatment. 8 Unfortunately, the AUC value only reached 0.58 when they tried to demonstrate the performance of their formula in other datasets, indicating that the reliability of the model is not optimal. We observed that the same score in the nomograph chart corresponds to the same 1-year and 5-year survival rates, which are unreasonable. Therefore, the purpose of our research emerged and we aimed to construct a more stable and effective signature that can effectively predict the OS of CRC patients to give patients a more accurate answer in clinical practice.
In the present study, by a large CRC cohort from The TCGA database, we developed a credible expression-based 9-gene signature that divides patients into high-and low-risk groups with significant differences in OS. In the validation process, the signature in our study also reflected high reliability in the test dataset. Combined with clinicopathological factors such as the TNM stage, which has been used in the clinic, the signature can provide a more accurate prediction for individual patients during clinical work.

| Data sources and research design
We downloaded the HTSeq-FPKM gene expression data and corresponding clinical information of all postoperative CRC patients prior to October 2019 from the official website of the TCGA database (https://www.cancer.gov). A total of 530 samples was obtained in the aggregate dataset, including 42 normal tissue samples and 488 CRC tissue samples. After the exclusion of patients with incomplete survival information, a total of 467 patients with complete follow-up data was included. A GEO dataset GSE10 3479 downloaded from the official website (https://www.ncbi.nlm.nih.gov/geo/) including 156 postoperative drug-resistant stage II and stage III CRC samples was used during the further validating process (Table 1). In this study, initially, genes with an average expression level lower than 0.1 in all samples were excluded and batch survival analysis by which 1464 genes discovered to be statistically significant (p value < 0.05) was conducted to screen the genes associated with patients' OS using the "survival" package in R (version 3.6.0). Univariate Cox regression analysis was used to further identify genes that were screened by previous batch survival analysis associated with patients' OS and we set the filter criteria as HR>1.5 or HR<0.8 and p value < 0.01. Finally, 91 eligible genes were obtained. Then, we used the "caret" package in R to randomly divide 467 samples into two groups, in which 235 samples were allocated to the training group and 232 samples were allocated to the test group.
In the training group, 91 genes were rescreened using univariate Cox regression analysis and 62 genes with p value < 0.05 were obtained. Then, we continued to perform Lasso regression analysis of these 62 genes in the training group to omit the synergistic genes and obtained 17 genes. Finally, by using multicox regression analysis on these 17 genes in the training group, a 9 genes prognosis prediction model was established to predict the OS rate of CRC patients. The stability and effectiveness of the model were verified in the test group and validation dataset. Details in this research are shown in Figure 1

| Construction of prognostic signature
A multivariate Cox proportional hazards regression model was established and used to evaluate the pertinence between the expression value of the 9-gene signature and CRC patients' OS.
The risk score of each patient was constructed by weighting the expression value of prognostic genes according to their multivariate Cox regression coefficients. The formula is shown below: where n is the number of prognostic genes, Exp i is the expression value of gene i, and C i is the regression coefficient of gene i in the multivariate Cox regression analysis. The risk scores of each patient were distributed into two groups, high and low-risk groups, by setting the median value as the demarcation point.
The "survival" R package was used to distinguish the survival differences between the high-and low-risk groups. The timedependent receiver operating characteristic (ROC) analysis based on the "SurvivalROC" R package was used to assess the predictive stability and effectiveness of the 9-gene signature in each group. To verify whether this 9-gene signature can serve as an independent prognostic index compared with the existing traditional clinicopathological indicators for evaluating prognosis, we performed a multivariate regression analysis that takes into account our 9-gene signature, TNM stage, age, sex, and so on. Finally, a nomogram and related calibration curves were established based on CRC patients for further clinical application.

| WGCNA
To explore the relationship between the nine genes in our signature and all the survival-related genes mentioned above, the weighted gene co-expression network analysis (WGCNA) without clinical traits was analyzed. By selecting 5 as the soft threshold, a weighted gene co-expression network was constructed by the R "WGCNA" package with approximate scale-free properties. A highly synergistic matrix of genes was determined by the relevance among the expression values of all those genes. The network module was produced by the topological overlap measurement (TOM) 9 and using the dynamic hybrid cutting method (a bottom-up algorithm) to identify the co-expression gene modules. 10 Eventually, the modules with related genes were merged. The correlation between genes and modules was measured by calculating gene significance (GS) and module significance (MS). The target genes in the related modules were visualized by Cytoscape 3.6.1 software.

| Statistical analysis
The survival-related genes and survival differences between the high-and low-risk groups in the training dataset and test dataset were obtained by the "survival" R package through the Kaplan-Meier method and compared by the log-rank test. Lasso regression analysis was performed by the "glmnet" R package to eliminate synergistic genes. The random grouping was completed by the "caret" R package. Time-dependent ROC curve was used to describe the sensitivity and specificity of survival prediction based on the risk score generated by the "survivalROC" R package. Multivariate Cox regression analysis was used to show the independent predictability of the signature and generate a nomogram. The calibration curve was used to evaluate whether patients' actual survival was consistent with that predicted by the nomogram. The nomogram and calibration curve were generated by the "rms" R package. The "pec" R package was used to compare the concordance index between our nomogram and nowadays TNM stage added with age through C-index function. All statistical tests were two-sided and p < 0.05 was regarded as statistically significant. The statistical analysis was performed in R software (version 3.6.0).

| Gene selection and prognostic signature construction
In our study, 1464 genes correlated with OS in patients diagnosed with CRC were obtained through the Kaplan-Meier method. Univariate Cox proportional hazards regression analysis was performed on these 1464 genes and 224 of them were found to be statistically significant with a p value lower The workflow of the identification of CRC OS-related 9-gene signature than 0.05. Setting the filtered standard as HR>1.5 or HR<0.8 and p value < 0.01, 91 genes were selected for further research. A total of 467 CRC patients was divided randomly into two groups: a training group (n = 235) and a test group (n = 232). Univariate Cox regression analysis was used again to verify the effect of the 91 genes on the OS of CRC patients in the training group and 62 genes achieved a statistical significance. To improve the interpretability and prediction accuracy of the regression model and to solve the problem of collinearity of variables, we performed Lasso regression analysis of the 62 genes in the training dataset and the results showed that 17 genes were screened out as the basis for our further construction of a model to predict the OS of CRC patients ( Figure 2). Finally, multivariate Cox regression analysis in the training group was carried out on those 17 genes, and a signature that predicted the OS of patients with CRC consisting of 9 genes was established. The overall information of these nine genes is shown in Table 2.

| 9-gene prognostic signature verification and comparison
Based on multivariate Cox analysis and obtained regression coefficients, a prognostic signature was established, by which patients' risk score was calculated. Patients in both the training and test datasets were divided into high-risk and low-risk groups by setting the median risk score as a demarcation point. The risk score distribution, gene expression, and CRC patients' survival status of the two risk groups in the training and test datasets are shown in Figure 3. The results in both the training and test datasets showed that the high-risk groups lives obviously shorter than the low-risk groups reflected in the OS time. The time-dependent ROC curve showed that the 9-gene signature reached AUC values of 0.823 in the training dataset and 0.806 in the test dataset (Figure 4), indicating a substantially effective performance of the OS prediction.
Of the nine genes, seven were related to a high risk of CRC (KLC3, LINC00634, NHLRC4, C11orf45, ZKSCAN2, KIF7, and ODF3L2; HR > 1), and two seemed to be protective genes (KIF9 and SALL1; HR < 1). We examined and compared the differential expression of the nine prognostic genes between CRC tumor tissues and normal tissues and between the highrisk and low-risk groups. Except for KIF7, the remaining genes were differentially expressed between tumor and normal tissues and were statistically significant. Patients with high-risk scores are more likely to express risk-related genes, while patients with low-risk scores tended to express protective genes. The expression varieties of these genes are shown in Figure 5.
To explore the effectiveness of the predicting signature constructed by those nine genes, after performing the Cox proportional hazards regression analyzes, GEO dataset GSE10 3479 showed a significant difference in the OS time between high and low-risk groups ( Figure 6B). Further exploring work was performed to ensure whether our 9-gene signature was valuable for more research, we compared the signature with a 5-autophagy-related gene model constructed by Zhou mentioned above on distinguishing OS time. Unfortunately, Zhou's signature was unable to distinguish CRC patient's OS time in dataset GSE10 3479 ( Figure 6A). The stability of our 9-gene  Figure 6C).

9-gene signature are independent of other clinicopathological factors
To assess whether the 9-gene signature can be regarded as an independent prognostic factor for CRC patients' OS predicting, multivariate Cox regression analysis was executed in a stepwise manner on the whole cohort. Covariables included risk score (high vs. low) and clinicopathological factors such as age, sex, and TNM stage. The results showed that stage (HR = 2.442, 95% CI = 1.845-3.231, P = 4.17E-10) and risk score (high vs. low) (HR = 4.393, 95% CI = 2.314-8.339, P = 6.01E-06) could be used as independent prognostic factors (Table 3).
Furthermore, we also confirmed the expression level of these nine genes in CRC patients with different TNM stages and found that there were no significant relationships between them (Figure 7). These results proved that our 9-gene signature' predictive ability was not related to other conventional clinicopathological factors for predicting OS in CRC patients.

9-gene signature to construct a nomogram for clinical practice
To apply our findings in clinical practice, we combined the 9-gene signature with traditional clinicopathological factors such as TNM stage and age to more accurately predict the OS of patients with CRC based on multivariate Cox regression analysis on the entire cohort. To confirm that the accuracy of predicting the OS of CRC patients was improved after adding our 9-gene signature compared with only those two clinicopathological factors, we calculated the C-index before and after the addition of our 9gene signature. The comparison results elucidated by the C-index increased from 0.739 (95% CI = 0.676-0.802) to 0.794 (95% CI = 0.737-0.851), indicating an obvious increase in C-index after adding our signature ( Figure 8A). Calibration curves showed the degree of consistency between the predicted 3-and 5-year survival and the actual survival in all patients ( Figure 8B-C). Finally, a nomogram was developed to score each CRC patient based on the factors in the graph ( Figure 8D).

| Co-expression relationship between all survival-related genes and the nine genes in our signature based on WGCNA
To explore the possible co-expression relationship between the genes in our signature and other survival-related genes, we performed WGCNA without clinical traits. With a scale-free network and topological overlaps, by selecting 5 as the soft threshold and merging similar modules, we generated a hierarchical clustering tree, and 19 corresponding gene modules were identified ( Figure 9A,B), in which the branches of the tree on behalf of different gene modules. The non-co-expressed genes were disposed of in the "gray" module, which was not further analyzed. The relationships of the 19 modules were analyzed and are shown in Figure 9C. The target gene and its co-expressed genes were identified from each module and visualized as networks in Cytoscape ( Figure S1-6). Although the weight values between the genes are relatively low, to some extent, they may still provide clues for fundamental experimental exploration in the future.

| DISCUSSION
CRC is a disease with high intercellular heterogeneity, 11 which makes traditional clinical predictors such as age, sex, and tumor stage insufficient to accurately predict the survival time of CRC patients. 12 With the development of genomewide sequencing technology, the changes in the genome in CRC are gradually being elucidated by scientists, which also facilitate discovering a series of prognostic and predictive signatures, [13][14][15] making it possible to make targeted personalized treatment decisions. In this study, by transcriptome expression data from CRC patients, a robust 9-gene prognostic signature was developed and verified based on the batch screening of genes related to OS in CRC patients. Of the nine genes (KLC3, LINC00634, NHLRC4, C11orf45, ZKSCAN2, KIF9, KIF7, SALL1, and ODF3L2) in our signature, researchers have proven that SALL1 can act as a tumor suppressor by recruiting NuRD to induce tumor cell senescence in breast cancer during which it is controlled by the MAPK and mTOR signaling pathways. 16 KLC3 is the only known kinesin light chain (KLC) expressed in postmeiotic male germ cells and researchers have found that KLC3 could play a microtubule-independent role during the formation of sperm tails. 17 However, to our knowledge, the role of KLC3 in cancer has not been studied. L. Gomez et al proved that C11orf45 lies within the first intron of KCNJ5 which is associated with Tourette syndrome and attention-deficit/hyperactivity disorder. 18 Unfortunately, the function of C11orf45 in the development of tumors has not been studied. Gli2, the major transcriptional activator of Hh signaling, whose localization and activity can be promoted by KIF7, was reported to function in skin development and tumor suppression. 19 It has been mentioned that KIF9 may be related to the occurrence of breast cancer. 20 For the other genes in our signature, after searching, we found that they have yet to be studied by scientists. In our research, we found a distinction in the expression levels of these nine genes except for KIF7 between tumor and normal tissues ( Figure 5A), indicating that there is great value in exploration.
The 9-gene signature can classify CRC patients into highand low-risk groups with a significant difference in survival time. It is worth noting that when we tried to prove the reliability of this 9-gene signature, the results of ROC analysis proved that the AUC values in both the training dataset and test dataset were greater than 0.80 and in the validation dataset it was 0.651, indicating that the nine-gene combination can be regarded as a reliable and efficient prognostic indicator for CRC patients. Further comparison in the validation dataset showed that the 9-gene signature was better than Zhou's 5-gene signature and had an equivalent efficiency as the TNM stage in predicting CRC patients' 5 years' OS time verified by AUC values. It should be mentioned that the validation dataset is a collection of drug-resistant stage II and stage III CRC patients, which suggest that our signature may also benefits drug-resistant CRC patients with a specific treatment based on the risk score calculated on those nine genes. What is more, the multivariate Cox regression analysis showed that it was independent of other clinical factors, such as age, sex, and stage. TNM stage is the current gold standard for evaluating the prognosis of CRC patients. 3,21 It is commendable that when we combined the 9-gene signature with TNM stage and age to evaluate the survival time of patients with CRC, the C-index increased from 0.739 to 0.794, and the comparison results in Figure 8A show a significantly increase in C-index after adding the signature, indicating that the 9-gene signature can be utilized not only as an independent prognostic factor but also as a good supplement to traditional clinicopathological factors to more accurately evaluate the survival of CRC patients.
In conclusion, the 9-gene signature screened out based on batch survival analysis can be used not only as an independent F I G U R E 8 (A) The comparison of C-index after and before adding the risk grade based on the 9-gene signature to TNM stage, age, and gender. (B-C) A demonstration of the satisfaction between the predicted and actually 3-and 5-year survival in all patients through calibration curves. (D) A nomogram constructed by multi-cox regression analysis on risk, TNM stage, age, and gender to apply the 9-gene signature in clinical practice prognostic index for CRC patients but also as a favorable supplement to traditional clinicopathological factors for clinicians to more accurately and effectively evaluate the survival of CRC patients.