Identification of a novel 15‐gene expression signature predicting overall survival of human colorectal cancer

In this study, we developed a robust and clinically applicable 15-gene prognostic signature for colorectal cancer (CRC) patients using our multi-step bioinformatics analysis strategy. So far, no multigene expression signature is available for CRC in China where the tumor incidence is rapidly increasing. 5-year With sequencing sev-eral have CRC patients and predict clinical Of the RT-PCR-based 7-gene Oncotype DX colon panel utilized as a clinical tool for the prediction of recur-rence risk for stage II and III CRCs. there is still an urgent need to develop better and more robust gene signatures in this area. In this study, we developed a clinically applicable and robust prognostic signature for CRC patients using our multistep bioinformatics analysis strategy. 5–8

GSE17536, and the log-rank test was used to determine statistical significance. GSE28722 and GSE39582 were used as independent validation datasets.

Gene Expression-Based Signature and Prognostic Risk Score Development
To develop a multigene prognostic signature, we first generated the training sets by performing 100 random selections of 373 patients from TCGA-COAD. The remaining patients after each selection were used as test sets. A forward-conditional Cox regression with the 78 genes significantly associated with OS in GSE17536 was carried out using SPSS (IBM, version 24) to further select genes independently associated with OS in each training set. The frequency by which each gene was selected during the forward-conditional Cox regression analysis in the 100 training sets was calculated. Genes were ranked based on their frequency and we used a concordance statistic to determine the optimal number of genes in the prognostic signature [1].
We then repeated Cox regression on all 100 training sets using the 15-gene signature as covariates and the forced entry (enter) method to acquire the coefficient for each gene. For each gene, the 100 coefficient values were averaged to estimate the true coefficient of each gene. The prognostic score of each patient was calculated based on the expression levels of the 15 genes and average Cox regression coefficients (formula see below). All patients in the training sets were divided into three groups based on prognostic score.
The tertiles of prognostic score in each training set were identified as cut points and averaged across 100 training sets to obtain the optimal value of cut points. Based on these cut-points, each patient was assigned into one of three groups: "good"," intermediate" and "poor" prognostic 3 outcome. Kaplan-Meier analysis and log-rank test were conducted among the test sets to verify the coincidence of the predictive power of a gene signature, as described in our previous studies [2][3][4][5].
To determine the biological functions enriched in the 78-gene set associated with OS, we performed Gene Ontology enrichment analysis using the ClueGO plug-in in Cytoscape (version 3.7.1). ClueGO was run using default parameters and p<0.05 as a cut-off for significance [6].

Validation of the 15-Gene Score System Using Two Public Datasets
125 samples from GSE28722 dataset and 562 samples from GSE39582 were utilized as two independent validation sets. We analyzed the mRNA expression levels and OS status for the 15gene signature using the same Cox regression method described above, which gave a set of new coefficients for the 15-gene signature. Prognostic scores of each sample in the two validation sets were calculated and split into tertiles: "good", "intermediate" and "poor" prognostic outcome.
The OS status from the three groups was analyzed by Kaplan-Meier analysis and differences in survival among the groups was tested using the log-rank test.

The 15-Gene Score is Independent of Clinicopathological Features of CRC
To confirm the independence of the 15-gene signature in its prognostic function, we analyzed available clinicopathological factors using multivariate Cox regression analysis on their HR values, in relation to the 15-gene score.
We also analyzed the distribution of prognostic groups based on the 15-gene signature in the four CRC molecular subtypes. Kaplan-Meier analysis of OS within the four subtypes was performed and the survival curves were compared using the log-rank test. HRs with 95% CI were calculated and the percent distribution of the patients in three prognostic scores for each of the subtypes was recorded.

Comparison with the Oncotype DX Colon Cancer 7-Gene Signature
Using the same methodology described above using patient cohorts from GSE17536 (177 patients) and GSE28722 (125 patients), the prognostic ability of our 15-gene signature was compared with a previously developed Oncotype DX 7-gene signature [7][8][9]. Briefly, we performed a multivariate Cox regression analysis of the 7 genes, with 100 training sets for GSE17536 (118 patients) and GSE28722 (83 patients) separately. Coefficients for each of the 7 genes were averaged and prognostic scores for all the patients were then calculated as described above. Based on the scores, we divided the patients in 100 test sets into tertiles (good, intermediate, and poor), with the cut point scores being recorded and averaged. All the patients for each cohort were then separated into three groups based on these cut points and Kaplan-Meier analysis of OS was then used to compare the prognostic performance of the 7-gene and 15-gene signatures. HR values were calculated for each testing set for the "poor" group in comparison to the "good" group. The expression measurement of the 15 prognostic genes and 5 reference genes (ACTB, RPLP0, GUSB, TFRC, GAPDH) in the FFPE samples were performed using a mRNA hybridization assay previously described [5]. It should be noted that, in the clinical validation study, we used a different mRNA detection technology platform as compared to the public datasets used. The latter consisted of CRC transcriptome data obtained using microarray or RNA-seq.

Validation of the 15-Gene Expression Signature Using an Independent Hospital Cohort
We performed Cox regression on 100 resampling training sets consisted of 135 patients and the coefficient values for each gene were averaged as weights for calculating prognostic score. All samples in the training sets were sorted and equally trisected based on the prognostic scores. The tertiles of prognostic score in each training set were identified as cut points. Then, we averaged the upper and lower tertiles separately to obtain the optimal value of cut points.
Patients with prognostic score were divided into three groups: "good"," intermediate" and "poor". Kapan-Meier analysis with log-rank test and HR with 95% CI were conducted among the test sets with remaining 68 patients to verify the coincidence of the prognostic power of the gene signature. Multiple clinical and pathological factors including age at diagnosis, gender, TNM stage, WHO classification, primary tumor site in relation to the 15 genes-based prognostic groups were used as parameters for univariate as well as multivariate Cox regression analyses.

Statistical Analysis
Statistical methods used in this work are described in different sections above.