Identification of hub genes and construction of transcriptional regulatory network for the progression of colon adenocarcinoma hub genes and TF regulatory network of colon adenocarcinoma

Abstract The aim of this study was to identify key genes related to the progression of colon adenocarcinoma (COAD), and to investigate the regulatory network of hub genes and transcription factors (TFs). Dataset GSE20916 including 44 normal colon, 55 adenoma, and 36 adenocarcinoma tissue samples was used to construct co‐expression networks via weighted gene co‐expression network. Gene Ontology annotation and the Kyoto Encyclopedia of Genes and Genomes pathway enrichment analysis for the objective module were performed using the online Database for Annotation, Visualization and Integrated Discovery. Hub genes were identified by taking the intersection of differentially expressed genes between dataset GSE20916 and GSE39582 and validated using The Cancer Genome Atlas (TCGA) database. The correlations between microRNA (miRNA) and hub genes were analyzed using the online website StarBase. Cytoscape was used to establish a regulatory network of TF‐miRNA‐target gene. We found that the orange module was a key module related to the tumor progression in COAD. In datasets GSE20916 and GSE39582, a total of eight genes (BGN, SULF1, COL1A1, FAP, THBS2, CTHRC1, COL5A2, and COL1A2) were selected, which were closely related with patients’ survivals in TCGA database and dataset GSE20916. COAD patients with higher expressions of each hub gene had a worse prognosis than those with lower expressions. A regulatory network of TF‐miRNA‐target gene with 144 TFs, 26 miRNAs, and 7 hub genes was established, including model KLF11‐miR149‐BGN, TCEAL6‐miR29B2‐COL1A1, and TCEAL6‐miR29B2‐COL1A2. In conclusion, during the progression of COAD, eight core genes (BGN, SULF1, COL1A1, FAP, THBS2, CTHRC1, COL5A2, and COL1A2) play vital roles. Regulatory networks of TF‐miRNA‐target gene can help to understand the disease progression and optimize treatment strategy.


| INTRODUCTION
Colon cancer is one of the most common gastrointestinal malignancies worldwide (Arnold et al., 2017). Every year, the incidence of colorectal cancer increases by 4.2% globally (Chen et al., 2016;Maley et al., 2017).
Currently, most patients with colorectal cancer cannot be diagnosed at early stages. Thus, early diagnosis and intervention are important to improve the prognosis of colorectal cancer. Colon adenocarcinoma (COAD) is the most common type of colorectal cancer (Mutch, 2007).
However, the progression of COAD involving complex gene-mechanisms has not yet been clearly explored. Now many bioinformatics methods have been developed to identify potential prognostic biomarkers and therapeutic targets (Wan, Tang, Han, & Wang, 2018). For example, weighted gene co-expression network analysis (WGCNA) can be used for finding clusters (or modules) of highly correlated genes (Langfelder & Horvath, 2008), for summarizing such clusters using the module eigengene or an intramodular hub gene, for relating modules to one another and to external sample traits, and for calculating module membership (MM) measures (Wan et al., 2018).
WGCNA has been used to identify gene modules related to clinical outcomes in many cancers including colon cancer for evaluating recurrence (Ding, Li, Wang, Chi, & Liu, 2019). However, there is no study to systematically identify the regulatory network of transcription factor (TF)-microRNA (miRNA)-target gene related to the progression of COAD using WGCNA. miRNAs are small noncoding regulatory RNAs, which can bind to the target messenger RNAs (mRNAs) and repress the target mRNA expression. After regulated by TFs, miRNAs greatly affect the progression of COAD.
In this study, a co-expression network of differentially expressed genes (DEGs) in COAD was constructed and the most significant modules in the network were used to reveal the tumorigenic genes. Then hub genes were identified by taking the intersection of DEGs between datasets GSE20916 and GSE39582 and validated using The Cancer Genome Atlas (TCGA) database. Finally, a regulatory network of TF-miRNA-target gene was established for the progression of COAD.

| Construction of WGCNA and identifying preserved modules
The GSE20916 matrix that contained gene expression (including normal, adenoma, and COAD samples) was ranked by the standard deviation (SD), and top 5,000 genes (Table S1) were selected for the next WGCNA analysis. Then the ideal soft-thresholding power was chosen and genes cluster into modules was based on the Topological Overlap Matrix. The correlation between module eigengene and clinical characteristics was investigated by Pearson's correlation tests. p < .05 and |r| > .7 were considered as the selection criteria of candidate modules. Module preservation was calculated by the module preservation function to detect the stability and repeatability of modules in GSE39582 (Table S2).
The module with maximum preservation statistics Zsummary and minimum preservation statistics medianRank was selected as the key module.

| Function enrichment analysis
To investigate the potential biological function and pathways in the key module, the Online Database for Annotation, Visualization and Integrated Discovery (DAVID; https://david-d.ncifcrf.gov/) was used to analyze and visualize Gene Ontology (GO) terms and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways (Table S3).

| Screening hub genes
MM shows the correlation between genes and modules, and the top 15 genes with the highest correlation with the key module were selected as Group 1 of candidate genes. DEGs between normal colon tissue and adenocarcinoma in the GSE39582 dataset were determined by limma package based on the standard of adjusted p < .01 and | log FC | > 2. Then the disease free survival (DFS) of DEGs was calculated by survival package and genes with p < .05 were defined as Group 2 of candidate genes (Table S4). Hub genes were identified by the intersection of the two candidate gene groups.
F I G U R E 1 (a) The cluster was based on the expression data of GSE20916, which contained 44 normal tissue samples, 55 adenoma tissue samples, and 36 adenocarcinoma tissue samples. The top 5,000 genes with the highest SD values were used for the analysis by WGCNA. The color intensity was proportional to disease status (normal, adenoma, and adenocarcinoma), sex (male and female) and age (years old). (b) Analysis of the scale-free fit index for various soft-thresholding power (β). Analysis of the mean connectivity for various soft-thresholding power. In all, 10 was the most fit power value. SD, standard deviation 2.5 | Validating for hub genes Gene Expression Profiling Interactive Analysis (GEPIA; http:// gepia.cancer-pku.cn/) was used to analyze the RNA sequencing data from TCGA. TCGA data of COAD were used to validate the expression of identified hub genes. Human Protein Atlas (http:// www.proteinatlas.org) was used to validate the candidate hub genes by immunohistochemistry.

| Construction of TF regulatory network
StarBase (http://starbase.sysu.edu.cn/index.php) was used to predict miRNAs that bind to hub genes based on the standard that CLIP Data > = 3 and expression was present in at least one tumor sample.
Then miRNAs with most intersections in 7 databases were selected ( Fig. S3A-G). The plugin iRegulon of Cytoscape is used to predict TF regulation networks.

| Construction of WGCNA and identifying preserved modules
The sample dendrogram was plotted using the WGCNA package Expression of 5,000 genes with the largest difference of SD in dataset GSE39582 was used to calculate the preservation of modules. Figure 4a,b shows the medianRank statistics and Zsummary statistics of each module. Among the brown, green, and orange modules with preservation, the orange module with the maximum statistics Zsummary and minimum statistics medianRank was defined as the key module.

| Enrichment analysis of the orange module
The online database of DAVID was used to explore the enrichment analysis of GO and KEGG in key orange module. Figure 3b shows the enrichment results of GO and KEGG in the orange module. The KEGG of orange module was mainly in the extracellular matrix receptor (ECM)-receptor interaction, focal adhesion, cytokine-receptor interaction, hematopoietic cell lineage, and transforming growth factor (TGF)-β signaling pathway, while the biological process based on GO of orange module mainly included response to wounding, inflammatory response, cell adhesion, biological adhesion, and vasculature F I G U R E 4 Module preservation was evaluated by medianRank and Zsummary statistics which correlated to connectivity and density of networks. The module with lower medianRank tends to exhibit stronger observed preservation than the module with a higher medianRank if both of them are preserved. Compared with the GSE20916 dataset, orange module with highest Zsummary (a)and lowest medianRank statistics (b) than brown and green modules F I G U R E 5 (a) Venn plot of candidate hub genes commonly owned in GS20916 and GSE39582. (b) The transcriptional differences of hub gene levels between colon carcinoma tissues and the para-cancer tissues in TCGA. TCGA, The Cancer Genome Atlas (*p < .001) development. Figure 3b show the contents of cellular component and molecular function based on the GO of orange module.

| Identification and validation of hub genes
A total of eight genes (BGN, SULF1, COL1A1, FAP, THBS2, CTHRC1, COL5A2, and COL1A2) shared by both Groups 1 and 2 of candidate genes were identified as hub genes (Figure 5a). We used TCGA data of COAD to validate the hub gene expression with the online tool of GEPIA. All of the hub genes are expressed differently in normal and cancer tissues of COAD by the criterion of |logFC| > 1 and p < .01 (Figure 5b). All of the hub genes significantly related to disease-free survival of COAD ( Figure   6) and GSE39582 ( Figure S2). Compared to normal tissues, the protein expressions of these hub genes in tumor tissues were significantly higher based on the Human Protein Atlas database (Figure 7).

| DISCUSSION
High-throughput biological technologies, such as the systems biology algorithm of WGCNA, allow researchers to evaluate the relationships between genes and clinical features through a scale-free network (Langfelder & Horvath, 2008;Wan et al., 2018). WGCNA has been widely used in disease, physiology, drug, genome annotation, and cancer progression (Chen et al., 2018). WGCNA has been used to identify gene modules and hub genes associated with TNM stages in gastric cancer, hepatocellular carcinoma (Zhu, Sun, Zhou, He, & Qian, 2018), and metastatic colorectal cancer (Wang et al., 2014). In this study, datasets GSE20916 and GSE39582 were utilized to construct co-expression networks by WGCNA to identify the key orange module related to the progression of COAD.
Herein, we found eight hub genes for the progression of COAD, including BGN, SULF1, COL1A1, FAP, THBS2, CTHRC1, COL5A2, and  (Xing, Gu, & Ma, 2015). Furthermore, BGN has been reported to promote the chemotherapy resistance via activating nuclear factor kappa-light-chain-enhancer of F I G U R E 6 The disease-free survival analysis of hub gene levels in colon carcinoma patients in TCGA. TCGA, The Cancer Genome Atlas activated B cells signal transduction in colon cancer (Liu, Xu, Xu, Cui, & Xing, 2018). The SULF1 gene encoded the heparin-degrading endosulfatase to inhibit the activation of cell growth factor. The expression of SULF1 is stable in normal tissues but downregulated in cancer cells.
Moreover, cancer cell proliferation and migration could be inhibited by re-expression of SULF1 (Lai, Sandhu, Shire, & Roberts, 2008). The activation of the TGFβ/SMAD transcriptional pathway underlays a novel tumor-promoting role of SULF1 in hepatocellular carcinoma (Dhanasekaran et al., 2015). COL1A1 encodes the pro-α 1 chains of type I collagen.
THBS2 is a thrombospondin family protein and negatively regulates angiogenesis in cancer. Increasing the THBS2 expression in cancer inhibited tumor growth (Sun et al., 2014). THBS2 expression in colorectal cancer was associated with reduced angiogenesis and distant metastasis (Wang et al., 2016). Most recently, miR-135b was reported to promote invasion and metastasis through THBS2 suppression (Nezu et al., 2016).
Higher THBS2 expression in colorectal cancer tissue was found compared with normal tissue (Wang et al., 2016). Furthermore, RBP4 and THBS2 have been reported to serve as biomarkers for the diagnosis of colorectal cancer (Fei et al., 2017).
In addition, CTHRC1 is an extracellular matrix glycoprotein and regulates the Wnt-PCP pathway in developmental morphogenesis (Ma et al., 2014). Recently, CTHRC1 was found to be highly expressed in many human cancers, promoting invasion and metastasis (Tan et al., 2013).
CTHRC1 has been reported to promote proliferation and invasiveness of human colorectal cancer cells by activating Wnt/PCP signaling (Yang et al., 2015). COL5A2, also known as collagen type V alpha 2 chain, involves in the progression of colorectal cancer, breast cancer, and osteosarcoma. COL5A2 was reported to be correlated with poor clinical outcomes of bladder cancer patients, suggesting that it could serve as a F I G U R E 7 The translational differences of hub gene levels between colon carcinoma tissues and the paracancer tissues in the Human Protein Atlas database WEI ET AL.
In our study, Cytoscape and StarBase were used to construct a regulation network of TF-miRNA-mRNA. Although miRNAs like miR-155-5p, miR-199a-5p and miR-206 were clearly associated with the development and progression of colon cancer, miRNAs associated with colon cancer are still not thoroughly explored (Parasramka et al., 2012;Qu et al., 2015;Sakurai et al., 2011). We established the first regulatory network of TF-miRNA-target gene for the progression of COAD, involving 144 TFs, 26 miRNAs, and 7 hub genes, such as model KLF11-miR149-BGN, TCEAL6-miR29B2-COL1A1, and TCEAL6-miR29B2-COL1A2.
In summary, we found a total of eight hub genes (BGN, SULF1,