A network‐based predictive gene expression signature for recurrence risks in stage II colorectal cancer

Abstract The current criteria for defining the recurrence risks of stage II colorectal cancer (CRC) are not robust; therefore, we aimed to explore novel gene signatures to predict recurrence risks and to reveal the underlying mechanisms of stage II CRC. First, the gene expression profiles of 124 patients with stage II CRC from The Cancer Genome Atlas (TCGA) database were obtained to screen differentially expressed genes (DEGs). A total of 202 DEGs, including 128 upregulated and 74 downregulated, were identified in the recurrence group (n = 24) compared to the nonrecurrence group (n = 100). Furthermore, the top 5 DEGs (ZNF561, WFS1, SLC2A1, MFI2, and PTGR1) were identified by random forest variable hunting, and four (ZNF561, WFS1, SLC2A1, and PTGR1) were selected to create a four‐gene recurrent model (GRM), with an area under the curve (AUC) of 0.882 according to the receiver operating characteristic curve, and the robust diagnostic effectiveness of the GRM was further validated with another gene expression profiling dataset (GSE12032), with an AUC of 0.943. The diagnostic effectiveness of the GRM regarding recurrence was associated with poor disease‐free survival in all stages of CRC. In addition, gene ontology functional annotation and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analyses revealed 18 enriched functions and 6 enriched pathways. Four genes, ABCG2, CACNA1F, CYP19A1, and TF, were identified as hub genes by the protein‐protein interaction network, which further validated that these genes were correlated with a poor pathologic stage and overall survival in all stages of CRC. In conclusion, the GRM can effectively classify stage II CRC into groups of high and low risks of recurrence, thereby making up for the prognostic value of the traditional clinicopathological risk factors defined by the National Comprehensive Cancer Network guidelines. The hub genes may be useful therapeutic targets for recurrence. Thus, the GRM and hub genes could offer clinical value in directing individualized and precision therapeutic regimens for stage II CRC patients.


| INTRODUCTION
Colorectal cancer (CRC) is the third most frequent malignant tumor and the fourth leading cause of cancer-related death worldwide. 1 Survival and treatment in CRC are primarily dependent on the tumor stage at diagnosis. Radical resection of the tumor lesion is the foundation treatment for stage II CRC. However, postsurgery, 25%-30% of stage II CRC patients develop recurrence within 5 years, 2 contributing to mortality. To improve the survival of these patients, the American Society of Clinical Oncology, the National Comprehensive Cancer Network (NCCN) and the European Society for Medical Oncology recommend adjuvant chemotherapy for high-risk patients with stage II CRC, and the risks are primarily defined by clinicopathological features, such as tumor size, the number of lymph nodes investigated, poorly differentiated histology, tumor perforation (T4), bowel obstruction and perforation, positive resection margins, and lymphatic and venous invasion. 3 Unfortunately, using these abovementioned criteria, we found that a portion of low-risk patients also experienced recurrence after the operation. 4 On the other hand, is there a portion of patients who were overtreated according to the abovementioned criteria? This concern highlights the lack of available biomarkers that can help detect the genuine highrisk factors of recurrence for stage II CRC, which can improve the treatment accuracy of these patients.
In recent years, rapid technological breakthroughs of genome-wide sequencing have provided researchers with large expression datasets. With the popularization of big data analysis and the progress of bioinformatics technology, a series of biomarkers was identified for predicting the recurrence, metastasis, chemosensitivity, and prognosis of CRC. Moreover, the complex recurrence and metastatic processes of polygenic network collaboration were also analyzed by bioinformatics tools. [5][6][7] Practice suggested that the prediction value of a single biomarker was usually unfavorable, and multiple biomarkers jointly confirmed to be efficient and reliable. 8,9 In this study, we performed bioinformatical analyses based on high-throughput RNA sequencing of CRC from the The Cancer Genome Atlas (TCGA) (http://cance rgenom-e. nih.gov/) to gain a panoramic view of expression patterns between recurrence and nonrecurrence patients with stage II CRC. Furthermore, differentially expressed genes (DEGs) were used to establish a model to predict the recurrence risk (gene recurrent model [GRM]) by random forest sequencing, and excellent diagnostic effectiveness was shown by receiver operating characteristic (ROC) curve analysis. Then, another gene expression profiling dataset was extracted from the GEO (GSE12032) to further validate the robust diagnostic effectiveness of the GRM. In addition, gene ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), protein-protein interaction (PPI) network analyses, and hub gene selection were adopted to jointly analyze the underlying mechanism of recurrence. Finally, the results we obtained might be meaningful in guiding clinical practice and understanding the recurrence mechanisms of stage II CRC.

| Data collection
The RNA-Seq dataset of CRC, which includes the whole human transcriptome sequencing dataset and corresponding survival profiles, was downloaded from the TCGA database. All the data in the dataset used were pathological stage II (T 3-4 N 0 M 0 ) without postoperative adjuvant therapy and were followed up for at least 2 years. According to recurrence, the samples were divided into a recurrence group and a nonrecurrence group.

| Data preprocessing and the identification of DEGs
edgeR is an R package used for the analysis of DEGs 10 and was used in our study to screen the DEGs between the recurrence and nonrecurrence groups. The DEGs were identified with the following criteria: fold change |(FC)| ≥2 and P value <.01.

| Random forest sequencing
Random forest is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. The mean decrease in gini (MDG), which is involved in the random forest algorithm, is used to rank the important indexes with DEGs. The MDG provides ways to quantify which index contributes most to classification accuracy. A higher MDG indicates that the degree of impurity arising from the category could be reduced farthest by one variable and thus suggests an important associated index. We divided our data into training (66%, n = 82) and testing (34%, n = 42) datasets by using the randomForest package of R software (http://www.rproje ct.org). 11 We used the training dataset to develop the random forest model and then tested the model's performance with the testing dataset. The specific random forest model parameters were as follows: max features: auto, n estimators: 5000, min sample leaf: 1, and number of variables tried at each split: 2.

| ROC analysis of the select recurrencerelated genes
Receiver operating characteristic analyses are commonly used to evaluate the performance of disease diagnosis. In our study, the four genes selected from the top 5 genes ranked by the MDG were used as biomarkers for detecting recurrence and for constructing a recurrence risk model. The area under the curve (AUC) was used to demonstrate the accuracy of an individual gene and joint genes for predicting recurrence.

| External exploration of the diagnostic effectiveness of the GRM
To further validate the diagnostic effectiveness of the GRM, another gene expression profiling dataset was extracted from the GEO (GSE12032) for analysis. GEO2R, a web tool that was applied to screen the DEGs by comparing two groups of samples in a GEO series, was applied to identify the DEGs between the recurrence and nonrecurrence groups of stage II CRC patients following the criteria fold change |(FC)| ≥1.5 and P value <.05. The patients with missing values were excluded. Furthermore, the identified DEGs were ranked by the MDG in the random forest algorithm, and the top 10 genes were selected. In addition, the diagnostic effectiveness of the GRM was validated by ROC analyses, and the AUC was used to demonstrate the accuracy of the GRM for predicting recurrence.

| Further exploration of the clinical values of the GRM
As the four genes in the GRM achieved robust predictive values of recurrence in stage II CRC and the risk of recurrence always affected disease-free survival (DFS), we further validated the four genes by investigating their expression and relevance to DFS in all stages of CRC between the recurrence and nonrecurrence groups. Kaplan-Meier survival curves were constructed using the expressions of the four genes from the TCGA transcriptional profiles as a threshold and compared by log-rank analysis. All analyses were performed with GraphPad Prism 5.0 and SPSS version 19.0 (IBM), and P < .05 was considered statistically significant.

| Functional annotation and pathway enrichment analyses
The Database for Annotation, Visualization, and Integrated Discovery (DAVID) v.6.8 (https ://david.ncifc rf.gov) 12 was used to perform GO 13 functional annotation and KEGG 14 pathway enrichment analyses. The human genome was selected as the background list parameter, and a P value <.05 was set as statistically significant.

| PPI network analysis and hub gene selection
The Search Tool for the Retrieval of Interacting Genes (STRING, www.string-db.org/) database was used to construct a PPI network. An interaction score >0.4 was considered statistically significant. Furthermore, the result of the PPI network was imported into the Cytoscape plugin to create network visualizations and subjected to centrality values analysis with CentiScaPe 2.2.

| Further exploration of the clinical values of the four hub genes
As the four hub genes that promote recurrence always highly correlated with pathologic stage and affected overall survival (OS), we further validated the four hub genes by investigating their expression in different pathologic stages and relevance to OS in all stages of CRC. Kaplan-Meier survival curves were constructed using the expressions of the four hub genes from the TCGA transcriptional profiles as a threshold and compared by log-rank analysis. All analyses were performed with GraphPad Prism 5.0 and SPSS version 19.0, and P < .05 was considered statistically significant.

| Identification of the DEGs
Information on the 124 patients who met our research criteria was obtained from the TCGA database. After a 2-year follow-up, 24 patients experienced recurrence, and 100 patients did not experience recurrence. Moreover, 202 DEGs, including 128 upregulated and 74 downregulated, were identified in the recurrence group compared to the nonrecurrence group ( Figure 1). F I G U R E 1 Volcano plot of the DEGs between recurrence and nonrecurrence groups of stage II CRC. The x-axis indicates the log2 fold change in gene expression, which was defined as the ratio of normalized value of gene expression detected in stage II CRC between recurrence and nonrecurrence groups. The y-axis indicates the adjusted P values plotted in −log10. Red dots highlight genes upregulated in recurrence group (fold change >2, P value <.01). Green dots highlight genes downregulated in nonrecurrence group (fold change >2, P value <.01). CRC, colorectal cancer; DEGs, differentially expressed genes

| Acquisition of the classified DEGs by the random forest classifier
The randomForest package in R, which performs well in predicting whether variables are noise or not and evaluates the importance of the variable, was used. The 202 DEGs were ranked by the MDG with the random forest method, and the top 5 DEGs (ZNF561, WFS1, SLC2A1, MFI2, and PTGR1) were selected for construction of the ROC curve (Table 1).

| Construction of a four-GRM and further validation
The ROC curve defined an optimal threshold to predict the recurrence risk of stage II CRC, and the AUC values of the ROC for ZNF561, WFS1, SLC2A1, MFI2, and PTGR1 were 0.7404, 0.7283, 0.6238, 0.6738, and 0.7454, respectively ( Figure 2A). To elevate the prediction efficiency, we explored the combination of two, three, four and five DEGs.  Figure 2D). The AUC value of the ROC curve with the combination of five DEGs for ZNF561+WFS1+SLC2A1+PTGR1+MFI2 was 0.652 ( Figure  2D). Among them, the four-gene signature (ZNF561, WFS1, SLC2A1, and PTGR1), with an AUC of 0.882, exhibited the best performance for predicting recurrence and showed remarkable sensitivity and specificity when the cutoff value was 0.593 ( Figure 2D).
The transcriptome profiling data of 92 patients with stage II CRC in the GSE12032 dataset, which includes 30 patients with recurrence and 62 patients without recurrence, were collected. After excluding the patients with missing values, we obtained 60 patients, including 16 patients with recurrence and 44 patients without recurrence. Furthermore, 113 DEGs (63 upregulated DEGs and 50 downregulated DEGs) were identified in the recurrence group compared to the nonrecurrence group ( Figure 2E). In addition, all the genes in the GRM were included in the top 10 DEGs that were ranked by the MDG with the random forest classifier ( Table 2). These genes exhibited robust performance for predicting recurrence, with an AUC of 0.943, and showed remarkable sensitivity and specificity when the cutoff value was 0.619 ( Figure 2F).

| The clinical values of the four genes in the GRM for all stages of CRC
The expression levels of the four genes in the GRM between the recurrence and nonrecurrence groups were significantly different, and the P values for ZNF561, WFS1, SLC2A1, and PTGR1 were .004, <.0001, .0002, and .0002, respectively ( Figure 3A). The survival analysis indicated that low ZNF561, PTGR1 expression and high WFS1, SLC2A1 expression were associated with poor DFS, with P values of <.01, <.05,<.05, and <.05, respectively ( Figure 3B).

| Functional annotation and pathway enrichment analyses
Based on the DAVID software, a total of 18 GO functions were enriched. Concerning the Molecular Function terms, the 202 DEGs were mostly enriched in oxygen binding, iron ion binding, heme binding, monooxygenase activity, oxidoreductase activity, steroid hydroxylase activity, G protein-coupled receptor (GPCR) activity, oxygen transporter activity, structure of the cytoskeleton, and aromatase activity. Concerning the Biological Process terms, the DEGs were significantly enriched in the steroid metabolic process, oxygen transport, the GPCR signaling pathway, the sensory perception of smell, the drug metabolic process, and epidermis development. Concerning the Cellular Component terms, the DEGs were significantly enriched in hemoglobin complex and organelle membrane ( Figure 4A; Table 3). In KEGG pathway enrichment analysis, the 202 DEGs were significantly enriched in six signaling pathways: steroid hormone biosynthesis, retinol metabolism, xenobiotic metabolism-CYP450, chemical carcinogenesis, drug metabolism-other enzyme, and drug metabolism-CYP450 ( Figure 4B; Table 4).

| PPI network construction and hub gene identification
Based on the STRING database, a PPI network was constructed ( Figure 5A). The network contained 63 nodes and 100 edges that were subjected to hub gene analysis with CentiScaPe 2.2. Four hub genes evaluated by the degree (≥3.14) and betweenness (≥223) were identified: ABCG2, CACNA1F, CYP19A1, and TF ( Figure 5B; Table 5).

| The clinical value of the four hub genes in all stages of CRC
Using the TCGA transcriptional profiles in all stages of CRC, the expression levels of the four hub genes with different TNM stages were significantly different, and the high expression of ABCG2, CACNA1F, CYP19A1, and TF was associated with a poor TNM stage ( Figure 6A). For the T stage (T 1+2 vs T 3+4 ), the P values of ABCG2, CACNA1F, CYP19A1, and TF were <.05, <.05, .006, and <.05, respectively. For the N stage (N 0 vs N 1+2 ), the P values of ABCG2, CACNA1F, CYP19A1, and TF were <.01, <.05, .003, and <.01, respectively. For the M stage (M 0 vs M 1 ), the P values of ABCG2, CACNA1F, CYP19A1, and TF were <.01, <.01, <.01, and <.05, respectively. The high expression of ABCG2, CACNA1F, CYP19A1, and TF was also associated with poor OS ( Figure 6B), with P values of <.05, <.05, <.01, and <.05, respectively. depend largely on clinicopathologic factors, leading to limitations in their prognostic abilities for this highly heterogeneous tumor, and the improvement in stage II CRC patient survival following adjuvant chemotherapy is less than 5% at 5 years. 15 This means that the administration of adjuvant chemotherapy to all patients in stage II is approximately 75% unnecessary and harmful. This narrow therapeutic index underscores the importance of identifying more appropriate biomarkers to detect the genuine high-risk factors of recurrence for stage II CRC.
Single biomarkers, such as CEA, CA199, miR-21, miR-181c, MTA3, S100A2, and ezrin, 5,[16][17][18]20,21 for the prognosis of recurrence in stage II CRC patients have been reported before. However, given that CRC tissues show complicated molecular and cellular heterogeneity, a single biomarker failed to reflect the genomic heterogeneity of the tumor; therefore, their prediction efficiency was unpowerful. Although multigene expression signatures have been reported, the number of genes that need to be tested is 13, 31, or 120, which is uneconomical, and the signatures demonstrated poor feasibility and poor specificity. [22][23][24] One previously reported 8-miRNA recurrence classifier is superior to currently used clinicopathological features, as well as NCCN criteria. Another prognostic mutation panel comprising five prognostic genes (APAF1, DIAPH2, NTNG1, USP7, and VAV2) 25 also showed superior prognostic accuracy to that of the American Joint Commission on Cancer classification (concordance index: 0.70 vs 0.54, respectively). However, the authors showed only the HR; therefore, its specific diagnostic performance is unclear. Miyake et al constructed a discriminator gene set that included 30 genes based on the expression data of 92 stage II CRC patients; however, despite its reported diagnostic effectiveness, its prediction accuracy was only 77.4%. 26 Under these circumstances, we explored the prediction efficiency of expression datasets from the TCGA database of stage II CRC patients who did not undergo postoperative adjuvant therapy and were followed up for at least 2 years. Using random forest variable hunting, we identified the top 5 DEGs. As the results revealed, different combinations of the top 5 DEGs showed different AUC values. The combination of the four-gene signature (ZNF561, WFS1, SLC2A1, and PTGR1), with an AUC of 0.882, exhibited the best performance for predicting recurrence and showed remarkable sensitivity and specificity when the cutoff value was 0.593. Therefore, these four genes were selected to create the GRM and achieved excellent predictive values of recurrence, with an AUC of 0.882 in stage II CRC. To further test and verify the diagnostic effectiveness of the GRM, we extracted another gene expression profiling dataset from the GEO (GSE12032) for further validation. To our excitement, all the genes in the GRM were included in the top 10 DEGs ranked by the MDG with the random forest classifier (Table 2) from the GEO database and exhibited better performance for predicting recurrence, with an AUC of 0.943, than the TCGA database. Furthermore, we and Miyake et al used the same gene expression profiling dataset. Our GRM contains only four genes but exhibited robust performance for predicting recurrence, with an AUC of 0.943, while the prediction accuracy of their discriminator gene set, which contains 30 genes, was only 77.4%. However, the other genes selected before by different teams for the prediction of recurrence in stage II CRC were obtained using different methods; therefore, their diagnostic performance cannot be directly compared with ours. However, the main methods we used to select the DEGs to establish the GRM were advanced, such as the random forest method, which takes advantage of two powerful machine-learning techniques: bagging and random feature selection. As one of the most important advantages, accuracy mainly benefits from the complementation of the training set and the test set and is relatively robust to outliers and noise, which contribute to its superior performance over many other methods, and it is commonly used in genomic data analyses as an effective prediction tool. 27,28 The robust diagnostic effectiveness showed by the GRM in two independent databases highly emphasized its stability. Stability mainly depends on the high efficiency, robustness and reliability of the analytical methods we used, such as the random forest method and ROC curve analysis. Furthermore, the robust diagnostic effectiveness also encouraged us to further apply it in clinical practice, and the personalized prediction of recurrence by the GRM will help avoid undertreatment or overtreatment.
Further exploration of the clinical values of the four genes in the GRM validated their effectiveness in the diagnosis of recurrence in all stages of CRC. The low expression of ZNF561 and PTGR1 and the high expression of WFS1 and SLC2A1 were associated with poor DFS. All of the above findings strongly prove the high recurrence diagnostic efficiency of our GRM.
To explore the mechanism of recurrence in stage II CRC, we used an integrated bioinformatics analysis. In GO analysis, 18 functions were enriched and support evidence for the important roles of oxygen, epidermis development, hemoglobin, and GPCR.
Hypoxia (low oxygen concentration) and ischemia (low hemoglobin concentration), which are caused by insufficient vascularization, are hallmarks of solid tumors (including CRC). 29,30 Hypoxia acts as an off switch for the expression of several genes, such as vascular endothelial growth factor (VEGF) and epidermal growth factor receptor (EGFR), 31,32 and could facilitate tumor development by epithelial-to-mesenchymal transition 33,34 ; moreover, intratumoral hypoxia is associated with therapy resistance, metastasis, and a poor clinical outcome in CRC. 34 Under conditions of hypoxia, the transcription factor hypoxia-inducible factor (HIF) 35 is activated immediately and strongly adapts to the environment by regulating transcriptional programs in erythropoiesis, angiogenesis and metabolism. 36 Then, it secretes a large number of angiogenesis-related molecules, such as VEGF, to actualize angiogenesis, cell proliferation and protection against ischemic injury. 37 Heme has important functions in transportation, catalysis, and electron and signaling transfer and serves as a prosthetic group in heme binding. Heme binding is widely observed in tumor lesions and cancer cells, and heme-binding ability is necessary for DNA damage resistance 38 and represents an important biomarker of the proliferative status of cancers. 39 Hemoglobin, which is the most abundant heme-binding protein associated with oxygen transportation, 40 indicates hypoxia, and the heme-binding ability and hemoglobin yield exhibit correlative dependence and interplay.
G protein-coupled receptors comprise the largest superfamily of receptors involved in transmembrane-initiated transduction pathways and can crosstalk (or transactivate) with EGFR, insulin/insulin-like growth factor 1 receptors and other cell surface receptors 41 to mediate the proliferation, 42 angiogenesis, 43 and metastasis 44 of CRC.
Among the mechanisms involved in the above biological processes, oxygen binding, 45 iron ion binding, 45 heme binding, oxygen transporter activity, oxygen transport, the hemoglobin complex, epidermis development, and the GPCR signaling pathway, 46 which were highlighted in the results of the GO functional annotation, are crucial in tumor development; therefore, targeting HIF, VEGF, iron ion binding, and GPCR may be good options for preventing the recurrence of stage II CRC.
In KEGG analysis, the main enrichment pathways were xenobiotic metabolism (including drug and retinol)-CYP450/ other enzymes, chemical carcinogenesis and steroid hormone biosynthesis.
Xenobiotic metabolism involves the metabolism of potentially harmful compounds that can enter the body together with food, environmental components, drugs or food additives. Xenobiotic metabolism enzymes include cytochrome P450 (CYP), the glutathione S-transferase (GST) family, the uridine 50-diphospho-glucuronosyltransferase (UDP-glucuronosyltransferase-UGT) superfamily, alcohol-metabolizing enzymes, sulfotransferases, etc. Under normal circumstances, xenobiotic metabolism enzymes play the role of detoxification. However, they can also convert certain chemicals into highly toxic metabolites to trigger chemical carcinogenesis, which is known as ''bioactivation''. 47 Different alleles of enzymes involved in xenobiotic metabolism contribute to CRC susceptibility 48 ; for example, CYP450 can increase the metabolic activity of procarcinogens, which include polycyclic aromatic hydrocarbons and heterocyclic amines, resulting in the production of potential carcinogens and eventually the development of CRC. 49 Another metabolic enzyme is GST, which plays a major role in detoxification and steroid hormone biosynthesis. 50 Steroid hormone and its receptor have protective functions against the progression of CRC 51 ; therefore, the expression of steroid hormone in colon cancer tissues is lower than that in normal tissues, 52 and the downregulation of steroid hormone expression in colorectal tissues is a cancer signal.
Retinol inhibits CRC cell proliferation, 53 even inhibiting invasion through a retinoic acid receptor-independent mechanism. 54 In vivo, an impairment in hepatic and intestinal cytosolic retinol oxidation and retinoic acid formation by alcohol abuse can increase the risk of developing CRC, and retinol metabolism plays a very important role in this process. 55 Based on the above mechanisms of the enriched pathways, we hypothesized that the recurrence of stage II CRC is likely due to "the breach of duty" of the xenobiotic metabolism enzymes, leading to the decompensation of retinal metabolism and the inhibition of steroid biosynthesis, and ultimately the development of recurrence. The metabolism of other potentially harmful compounds may also be involved in regulating xenobiotic metabolism enzymes while avoiding the invasion of xenobiotics and supplementing retinal and steroid hormones, which may be good options for preventing the recurrence of stage II CRC.
To further highlight the hub genes that play the most important roles in recurrence and to explore their interactions, a PPI network was constructed, and four hub genes were selected. ABCG2, which is a membrane-associated protein, can relieve oxidative stress and the inflammatory response by inhibiting the NF-κB signaling pathway, and the high expression of ABCG2 in CRC tissues may represent feedback of the overoxidative reaction, which is associated with a poor prognosis. 56,57 Most research has reported that inhibiting ABCG2, as a marker of chemoresistance in CRC, could enhance the efficacy of Hypericin-mediated photodynamic therapy. CACNA1F is a voltage-gated calcium channel that is mainly expressed in the human retina, but it has also been reported to be widely distributed outside the retina, including in the immune system. 58 It is well documented that CACNA1F plays roles in cell proliferation, migration, and apoptosis, 59 but it is rarely reported in CRC; therefore, further research on CACNA1F is needed. CYP19A1 is a member of the CYP450 superfamily that, as a monooxygenase, catalyzes drug metabolism and the synthesis of cholesterol, steroids and other lipids. Many researchers suggest that polymorphisms in CYP19A1 are related to CRC risk and may be influenced by estrogen through an inflammation-related mechanism. 60,61 The main function of the TF protein is to transport iron from the intestine, reticuloendothelial system, and liver parenchymal cells to all proliferating cells in the body. TF is an irontransporting protein that can transfer the iron absorbed by the intestinal mucosa to the bone marrow for hemoglobin formation in normoblasts. 62 A large number of studies indicate that TF is also a growth factor of all proliferated and cultured cells. 63 Moreover, TF is synthesized for its own specific proliferation and differentiation in tumor tissue. Based on the biological properties of TF, TF in the feces is used as a blood marker for CRC screening, with a sensitivity and specificity of 92% and 72.0%, respectively. 64,65 However, our current study has the following limitations: (a) Because we were unable to obtain complete clinical data, we were unable to implement a comparison of predictive effectiveness between our GRM and high-risk factors defined by the NCCN guidelines. (b) Our study is based on a mRNA evaluation from the TCGA and GEO databases; therefore, it is not as persuasive as the level of protein expression, and we did not use clinical samples to further verify its authenticity. (c) It is a retrospective study; therefore, the evidence level is imperfect. With regard to potential limitations, our GRM and hub genes relying on the advantages of excellent predictive values and reasonable statements should be validated in future, prospective, multicenter clinical trials. In addition, biomarkers that show promising predictive value for the survival benefit of adjuvant chemotherapy in stage II CRC patients should be discovered at the same time.
In summary, our findings showed that the GRM can effectively classify stage II CRC patients into groups with high and low risks of recurrence, thereby making up for the prognostic value of the traditional clinicopathological risk factors defined by the NCCN guidelines. Moreover, various pathways and hub genes involved in the recurrence progression of stage II CRC were revealed, which may be useful therapeutic targets. Thus, the GRM and hub genes could offer clinical value in directing individualized and precision therapeutic regimens for stage II CRC patients.

| CONCLUSION
In conclusion, the GRM we established using stage II CRC data from the TCGA by random forest variable hunting showed robust diagnostic effectiveness and was further validated with GEO data, supporting its robust ability in the personalized prediction of recurrence. In addition, GO, KEGG, PPI network analyses, and hub gene selection revealed the underlying mechanism of recurrence to a certain extent. Thus, the GRM and hub genes could offer clinical value in directing individualized and precise therapeutic regimens for stage II CRC patients.