Identification of novel biomarkers and small molecule drugs in human colorectal cancer by microarray and bioinformatics analysis

Abstract Background Colorectal cancer (CRC) is one of the most common malignant tumors. In the present study, the expression profile of human multistage colorectal mucosa tissues, including healthy, adenoma, and adenocarcinoma samples was downloaded to identify critical genes and potential drugs in CRC. Methods Expression profiles, GSE33113 and GSE44076, were integrated using bioinformatics methods. Differentially expressed genes (DEGs) were analyzed by R language. Functional enrichment analyses of the DEGs were performed using the Database for Annotation, visualization, and integrated discovery (DAVID) database. Then, the search tool for the retrieval of interacting genes (STRING) database and Cytoscape were used to construct a protein–protein interaction (PPI) network and identify hub genes. Subsequently, survival analysis was performed among the key genes using Gene Expression Profiling Interactive Analysis (GEPIA). Connectivity Map (CMap) was used to query potential drugs for CRC. Results A total of 428 upregulated genes and 751 downregulated genes in CRC were identified. The functional changes of these DEGs were mainly associated with cell cycle, oocyte meiosis, DNA replication, p53 signaling pathway, and progesterone‐mediated oocyte maturation. A PPI network was identified by STRING with 482 nodes and 2,368 edges. Survival analysis revealed that high mRNA expression of AURKA, CCNB1, CCNF, and EXO1 was significantly associated with longer overall survival. Moreover, CMap predicted a panel of small molecules as possible adjuvant drugs to treat CRC. Conclusion Our study found key dysregulated genes involved in CRC and potential drugs to combat it, which may provide novel insights and potential biomarkers for prognosis, as well as providing new CRC treatments.


| INTRODUCTION
Colorectal cancer (CRC) is one of the most common cancers with high morbidity worldwide. Despite advances in screening detection and new treatment strategies, CRC remains a leading cause of cancer-associated mortality, due to the lack of effective diagnostic methods at the early stage and reduce sensitivity to chemotherapy (Weiser, 2018). Therefore, it is crucial to understand the precise molecular mechanisms involved in the carcinogenesis and thus develop promising prognostic biomarkers and potential therapeutic targets.
Colorectal cancer is a complex bioprocess following the adenomacarcinoma multistage sequence (Testa, Pelosi, & Castelli, 2018). Thus, molecular dysregulations during the process of carcinogenesis, particularly during the precancerous stage, should be considered as the most important risk factors contributing to the progression of CRC. Several studies have used gene expression profiling to identify key genes between CRC samples and normal samples. Huang, Yang, and Huang (2018) identified hundreds of CRC associated differentially expressed genes (DEGs) based on the Gene Expression Omnibus (GEO) and The Cancer Genome Atlas (TCGA) database. An, Zhao, Yu, and Yang (2019) found 35 genes significantly associated with patient survival based on the transcription profile of multi-stage carcinogenesis and bioinformatics analysis. Hou et al. (2018) found a collection of DEGs and DNA methylation aberrations in CRC. Hundreds of DEGs were detected. However, DEGs are different in different studies with only some of them consistently detected. Therefore, the discovery of novel effective therapeutic targets against CRC is urgently required.
In this study, we selected the following microarray datasets GSE33113 (de Sousa et al., 2011;Kemper et al.., 2012) and GSE44076 (Closa et al., 2014;Cordero et al., 2014;Sanz-Pamplona et al., 2014;Solé et al., 2014) from the GEO database to identify DEGs. Kyoto encyclopedia of genes and genomes (KEGG) and gene ontology (GO) pathway analysis were used to investigate DEGs. Then we identified hub genes from the common DEGs by constructing protein-protein interaction (PPI) network. Furthermore, the survival analysis was used on the gene expression profiling interactive analysis (GEPIA) website. Candidate small molecules were identified for their potential use in the treatment of CRC.

| Data resources
To investigate the differential gene expression between CRC and normal samples, GSE33113 and GSE44076 microarray datasets were downloaded from the GEO website (http://www. ncbi.nlm.nih.gov/geo/). These RNA profiles were provided on platform GPL570 (Affymetrix Human Genome U113 Plus 2.0 Array) (GSE33113) and GPL13667([HG-U219] Affymetrix Human Genome U219 Array) (GSE44076). A total of 188 CRC samples and 154 normal samples were obtained in our study, including 40 tumor samples and 6 normal samples in GSE33113 profile, 98 tumor samples, and 148 normal samples in GSE44076 profile.

| Identification of DEGs
The original CIMFast Event Language files were downloaded and classified as CRC and normal groups. The raw data were standardized and transformed into expression values using the affy package of Bioconductor (Bioconductor, http:// www.bioco nduct or.org/). The significance analysis of the empirical bayes methods within limma package was applied to identify DEGs between CRC samples and normal tissue sample (Ritchie et al., 2015). Adj.p value <0.01 and |logFC| > 1 were set as the cutoff criteria to select the significant DEGs.

| KEGG and GO enrichment analyses of DEGS
To explore potential biological process (BP), molecular functions (MF), and cellular components (CC) related to the overlap DEGs, the Database for Annotation, Visualization, and Integrated Discovery (DAVID; http://david.ncifc rf.gov) (version 6.7) was introduced to perform functional annotation and pathway enrichment analysis, including GO and KEGG pathway analysis (Ashburner et al., 2000;Dennis et al., 2003;"The Gene Ontology (GO) project," 2006;Huang, Sherman, & Lempicki, 2009;Kanehisa & Goto, 2000). GO is a major bioinformatics tool to annotate genes and analyze BP of these genes. KEGG is a database resource for understanding high-level functions and biological systems from large-scale molecular datasets generated by high-throughput experimental technologies. p value <0.05 was considered statistically significant.

| Protein-protein interaction (PPI) network construction and module analysis
The Search Tool for the Retrieval of Interacting Genes database (STRING, https ://string-db.org/) is an online tool designed to analyze the PPI information (Damian et al., 2015). Analyzing the functional interactions between proteins may provide insights into the mechanisms of generation or development of diseases. In the present study, all the previously identified DEGs were submitted to STRING database for exploring their potential interactions. The interactions with a combined score >0.4 were considered significant and extracted for constructing the PPI networks through the Cytoscape software that is an open source bioinformatics | 3 of 15 CHEN Et al.
software platform for visualizing molecular interaction networks (Bandettini et al., 2012). Subsequently, Molecular Complex Detection (MCODE) was used to screen significant modules from the PPI network with degree cutoff = 2, node score cutoff = 0.2, k-core = 2, and max depth = 100 (Smoot, Ono, Ruscheinski, Wang, Ono, Ruscheinski, Wang, & Ideker, 2011). The functional and pathway enrichment analysis of for the significant modules was also performed. The networks gene oncology tool (BiNGO) plugin of Cytoscape was used to perform and visualize the BP analysis of the hub genes (Maere, Heymans, & Kuiper, 2005).

| Analysis and validation of hub genes
A network of module genes and their co-expression genes was established by cBioPortal online platform (http://www. cbiop ortal.org). To confirm the reliability of hub genes from our detection, we analyzed their prognostic and expression in CRC using Gene Expression Profiling Interactive Analysis (GEPIA), an interactive web application tool for gene expression analysis, containing 8,587 normal samples and 9,736 tumors samples from the Genotype-Tissue Expression databases and TCGA databases (Cerami et al., 2012;Gao et al., 2013;Tang et al., 2017). Then the survival curve and box plot were performed to visualize the relationships. In addition, the protein expression of the hub genes between CRC and normal tissues was determined by the human protein atlas (HPA, www.prote inatl as.org) database, an online tool for analyzing protein level from clinical samples.

| Identification of small molecules
The Connectivity Map (CMap, http://www.broad insti tute. org/cmap/) was used to predict potential small active molecular that may induce or reverse the current biological status encoded by a particular gene expression signature (Lamb et al., 2006). We contrasted the DEGs with those participating in small active molecular interference in the CMap database to find potential small molecular related to these DEGs. First, the overlaps DEGs were divided into upregulated and downregulated groups. Then, these probe sets were used to query the CMap database. Finally, the enrichment score representing similarity was calculated, ranging from −1 to 1. A positive connectivity value (closer to +1) indicated the small molecules could induce the state of CRC cells, whereas a negative connectivity value (closer to −1) indicated greater similarity among the genes and the small molecules could reverse the above cancer cell status.

| Identification of DEGs in CRC
Analyzed with the Limma package, a total of 1,186 overlap DEGs expressed in CRC samples were extracted from the GSE33113 and GSE44076 datasets. The volcano plot of DEGs of CRC in each dataset was presented in Figure 1a. The Venn diagrams showed the 1,186 overlap DEGs among the three datasets ( Figure 1bA) including 428 significantly upregulated genes ( Figure 1bB) and 751 downregulated genes ( Figure 1bC).

| PPI network construction and module analysis
Based on the STRING database, a PPI network of DEGs with 482 nodes and 2,368 interactions was established using the Cytoscape software ( Figure 3). The most significant modules were extracted from this PPI network by MCODE (Figure 4a). The results of signaling pathway enrichment analysis suggested that the module genes were mainly enriched in cell cycle, DNA replication, oocyte meiosis, p53 signaling pathway, and progesterone-mediated oocyte maturation ( Table 2). The BP analysis showed that the module genes were significantly related to DNA replication, DNA strand elongation, and DNA-dependent DNA replication (Figure 4b)

| Analysis and validation of hub genes
In order to validate the correlation between hub genes expression and the clinical characteristics of CRC, HPA database, and the cBioPortal for Cancer Genomics database were used for further analysis. The mining of GEPIA database also demonstrated that DEGs exhibited significant differences in expression between CRC and normal tissues. This result further confirmed that the expression level of these hub genes was closely correlated with the onset of CRC. A total of 270 CRC patients were available from GEPIA database for the overall survival analysis and divided into high expression and low expression groups. It was found that all the four hub genes were upregulated (Figure 5a). Together, the expression level of AURKA, CCNB1, CCNF, and EXO1 could represent the important prognostic biomarkers for predicting the survival of CRC patients (Figure 5b). AURKA, CCNB1, CCNF, and EXO1 were selected as hub genes for further analysis. The full names and function roles for these hub genes were shown in Table 3. The immunohistochemical staining results from HPA database indicated significantly higher positivity for AURKA, CCNB1, CCNF, and EXO1(Not found in HPA database) in cancer tissues than in adjacent normal tissues ( Figure 6a). A network of the module genes and their co-expression genes was constructed using cBioPortal online platform (Figure 6b).

| Identification of related active small molecules
To screen and identify candidate small molecules for potential therapeutic drugs in CRC, we uploaded upregulated and downregulated DEGs groups into the CMap database for Gene Set Enrichment Analysis and then  matched them with small molecule treatment. This procedure aimed to find some small molecules that could reverse the changes of gene expression in CRC. Table 4 and quinostatin showed higher negative correlation and the potential to treat CRC.

| DISCUSSION
Although many studies on CRC development are available, more efforts are needed to identify driver genes and candidate drugs. This study integrated two gene profile datasets and utilized bioinformatics methods to analyze these datasets, and identified 1,179 commonly changed DEGs (428 upregulated and 751 downregulated). Pathway enrichment analysis indicated that that the module genes were mainly enriched in cell cycle, DNA replication, oocyte meiosis, p53 signaling pathway, and progesterone-mediated oocyte maturation. The PPI network was constructed including 482 nodes and 2,368 edges. AURKA, CCNB1, CCNF, and EXO1 were clearly related to the prognosis of patients. In addition, small molecules that can provide new insights in CRC therapeutic studies were identified. Many researchers have found that four key genes were involved in cell cycle, participating in tumorigenesis and tumor proliferation. AURKA has been studied in a wide range of human malignancies and is associated with poor prognosis in several malignancies, including cervical squamous cell carcinoma, hepatocellular carcinoma, and nonsmall cell lung carcinoma (Ma et al., 2017;Wang et al., 2018;Zheng et al., 2018). Besides, Yang et al., (2017) reported that AURKA as a transactivating co-factor in the induction of the c-Myc oncoprotein in breast cancer stem cells (BCSCs). In CRC, AURKA is critical for chromosome 20q amplification-associated malignant transformation in colorectal adenomas (Chuang et al., 2016). CCNB1 acts as a central protein of cell cycle. Owing to its function, it is found that CCNB1 is generally abnormal in tumors. CCNB1 was associated with pathologic grade and metastasis of tumors in cases of human breast and ovarian cancer (Fei et al., 2018). Several literatures found that the expression of CCNB1 was significantly associated with pathologic grade, metastasis, and prognosis of tumors (Ding et al., 2018;Zuryn, Krajewski, Klimaszewska-Wisniewska, Grzanka, & Grzanka, 2019). CCNF, capable of forming Skp1-Cul1-F-box protein ubiquitin ligase complex, is implicated in controlling centrosome duplication and preventing genome instability. Gagat, Krajewski, Grzanka, and T A B L E 3 The full name, functional roles, p-value, and LogFC of hub genes gene symbol Summary GSE33113 GSE44076

AURKA
The protein encoded by this gene is a cell cycle-regulated kinase that appears to be involved in microtubule formation and/or stabilization at the spindle pole during chromosome segregation. The encoded protein is found at the centrosome in interphase cells and at the spindle poles in mitosis. This gene may play a role in tumor development and progression. A processed pseudogene of this gene has been found on chromosome 1, and an unprocessed pseudogene has been found on chromosome 10. Multiple transcript variants encoding the same protein have been found for this gene.   Grzanka (2018) revealed that high expression of CCNF in melanoma patients was associated with worse overall survival. Additionally, with gene network reconstruction, CCNF was regarded as one of the main drivers in cell cycle network in gastric cancer . But Fu et al. (2013) found that CCNF was downregulated in HCC, which was an independent poor prognostic marker for overall survival. However, the function of CCNF in CRC is not clear. The contribution of EXO1 in the safeguarding stability of the genome during DNA replicative and postreplicative processes is well-established. EXO1 activity contributes to several DNA repair processes. EXO1 has been associated with different types of cancers owning to its mutations, including colon, breast, ovarian, lung, pancreatic, and gastric tract cancer (Bau et al., 2009;Hansen et al., 2015;Jin et al., 2008;Sun, Zheng, & Shen, 2002). Nevertheless, the overexpression of EXO1 has also been reported in several other cancers associated with poor prognosis, which in part is related to increased DNA repair activity (Axelsen, Lotem, Sachs, & Domany, 2007;Dai et al., 2018;de Sousa et al., 2017;Muthuswami et al., 2013). In this study, AURKA, CCNB1, CCNF, and EXO1 were significantly upregulated in CRC compared with normal samples, and in CRC patients, the survival rate was positively correlated with the high expression of these genes.
Several small molecules with potential therapeutic efficacy against CRC were identified. The most significant small molecules in our result have been reported to display anticancer activity. DL-thiorphan is served as the specific neutral endopeptidase (NEP) inhibitor, which is widely used to differentiate NEP enzyme activity. NEP enzyme is a membrane-bound metallopeptidase that plays key roles in wound repair (Muangman, Tamura, & Gibran, 2005). DL-thiorphan may be the target and candidate agent for Type 2 Diabetes Treatment (Wang, Zhao, Shang, & Xia, 2014). Besides, thiorphan binding to CD10 might interfere with the processing of neuropeptide hemoregulatory factors and thus influence the progenitor cell proliferation in acute leukemia (Feng et al., 2011). However, there are insufficient evidences indicating DL-thiorphan can be directly used in anticancer. Moreover, repaglininde is a new class of oral antidiabetic agents, which stimulates insulin release within a few minutes by inhibiting ATP-sensitive potassium channels of the beta-cell membrane via binding to a receptor distinct from that of sulphonylureas. A previous study revealed that repaglininde may have direct antitumor effects and have been shown to suppress various types of cancer cells in cell culture and in animal models (Stanovic et al., 2000). On the other hand, El Sharkawi et al reported that repaglininde may have cytotoxic effects against hepatic, breast, and cervical carcinoma cells (El Sharkawi, Shemy, & Khaled, 2014). Thus, we might suppose that these identified drugs could play certain roles to combat CRC. However, further studies were still required to clarify the role of these candidate small molecules in the pathogenesis of CRC.

| CONCLUSION
Using bioinformatics analysis, 1,179 DEGs were identified, which were significantly enriched in several pathways, mainly associated with cell cycle, oocyte meiosis, DNA replication, p53 signaling pathway, and progesterone-mediated oocyte maturation. We also identified key genes including AURKA, CCNB1, CCNF, and EXO1 that might play important roles in CRC and that might represent novel biomarkers in CRC prognosis and therapy. Additionally, a group of small molecules was identified that might be exploited as adjuvant drugs for improved therapeutics for CRC. However, further investigations are required to validate the predicted drugs.

ACKNOWLEDGMENT
This work was supported by grants from the National Natural Science Foundation of China (21804072) and NanTong Technology Projects (MS12017007-3).

CONFLICT OF INTEREST
The authors report no conflict of interest in this work.