Analysis of potential genes and pathways associated with the colorectal normal mucosa–adenoma–carcinoma sequence

Abstract This study aimed to identify differentially expressed genes (DEGs) related to the colorectal normal mucosa–adenoma–carcinoma sequence using bioinformatics analysis. Raw data files were downloaded from Gene Expression Omnibus (GEO) and underwent quality assessment and preprocessing. DEGs were analyzed by the limma package in R software (R version 3.3.2). Gene Ontology (GO) enrichment analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis were performed with the DAVID online tool. In a comparison of colorectal adenoma (n = 20) and colorectal cancer (CRC) stage I (n = 31), II (n = 38), III (n = 45), and IV (n = 62) with normal colorectal mucosa (n = 19), we identified 336 common DEGs. Among them, seven DEGs were associated with patient prognosis. Five (HEPACAM2, ITLN1, LGALS2, MUC12, and NXPE1) of the seven genes presented a sequentially descending trend in expression with tumor progression. In contrast, TIMP1 showed a sequentially ascending trend. GCG was constantly downregulated compared with the gene expression level in normal mucosa. The significantly enriched GO terms included extracellular region, extracellular space, protein binding, and carbohydrate binding. The KEGG categories included HIF‐1 signaling pathway, insulin secretion, and glucagon signaling pathway. We discovered seven DEGs in the normal colorectal mucosa–adenoma–carcinoma sequence that was associated with CRC patient prognosis. Monitoring changes in these gene expression levels may be a strategy to assess disease progression, evaluate treatment efficacy, and predict prognosis.


Introduction
Colorectal cancer (CRC) is the third leading cause of cancer and cancer-related death in patients with cancer worldwide, accounting for more than 134,000 estimated new cases and 49,000 estimated deaths in 2016 [1]. The five-year survival rate of CRC is approximately 65% in high-income countries but is <50% in low-income countries [2][3][4]. From 2000 to 2014, the mortality of CRC decreased by 18% in individuals aged ≥50 years due to the extensive use of traditional screening methods, including flexible ORIGINAL RESEARCH Analysis of potential genes and pathways associated with the colorectal normal mucosa-adenoma-carcinoma sequence colonoscopy, barium enema X-ray, and fecal blood testing [5]. However, these tests have non-negligible shortcomings, including bleeding, perforation, and acute diverticulitis, as well as a variable sensitivity ranging from 37% to 80% [6,7]. Therefore, sensitive and specific biomarkers are urgently needed to improve the rate of early diagnosis, to help manage individual therapy and to predict the prognosis of patients in different stages of the disease.
Most CRC cases develop slowly through the normal mucosa-adenoma-carcinoma sequence over several years [8]. During this multistep process of colorectal tumorigenesis, many factors play important roles, including old age, smoking, alcohol, a high-fat diet, and lack of physical exercise [9]. In recent decades, multiple genes and signaling pathways have been shown to participate in the initiation and development of CRC. Kinzler and Fearon et al. reported that APC inactivation was an early event in more than 70% of colorectal adenomas and carcinomas [10,11]. KRAS and TP53 mutations participated in the adenoma-carcinoma sequence [11]. Liu et al. [12] found that low miR-126 and high CXCR4 protein expression were associated with poor prognosis in colorectal cancer. Tsukamoto et al. [13] reported that overexpression of osteoprotegerin in human colorectal cancer might be a predictive biomarker of CRC recurrence and a potential target for individual treatment of this disease. Dynamic changes of genes in different stages have important roles in the occurrence and development of CRC, as well as the treatment and prognosis of this disease [14][15][16][17]. These differentially expressed genes (DEGs) may show changes that correspond to their functions in the different stages of CRC, which lead to different survival outcomes [18]. Stage at diagnosis is an important prognostic factor for patients with CRC. Siegel et al. [2] found that the fiveyear survival rate of patients diagnosed with CRC ranges from 90.1% in stage I to 11.7% in stage IV. Thus, it is important to identify DEGs during the normal mucosaadenoma-carcinoma sequence, which will help elucidate the molecular mechanisms involved in the occurrence and development of CRC, provide potential biomarkers for diagnosis at the early stage, and suggest potential targets for individual therapy.
Bioinformatics is a newly emerging scientific field that combines biology, mathematics, and information technology, making it possible to analyze large and increasingly complex molecular datasets. Microarray assays can acquire expression information on thousands of genes simultaneously and explore the genomic alterations associated with the progress of colorectal initiation and development [19]. Extensive genetic information is available online due to the development of public cancer databases, such as The Cancer Genome Atlas (TCGA), Oncomine, Gene Expression Omnibus (GEO), and others, which are repositories for microarray data retrieval and deposit. Online datasets can help enlarge the sample size and increase the statistical power. For example, Fu et al. [20] identified 72 miRNA-mRNA pairs along with 22 dysregulated miRNAs and their 58 target mRNAs that were involved in CRC tumorigenesis by a combination of expression profiling and bioinformatics analysis. Robles et al. [21] found that the CRC that developed in patients with IBD had different genetic characteristics from sporadic CRC with whole-exome sequencing analysis, providing possible genetic biomarkers for diagnosis and treatment of patients with IBD and CRC.
In our study, we aimed to identify DEGs related to the colorectal normal mucosa-adenoma-carcinoma sequence. Original data were downloaded from GEO and analyzed with R software (R version 3.3.2). Gene expression levels in colorectal adenoma and CRC stage I, II, III, and IV were compared with those in normal colorectal mucosa. We eventually identified seven potential DEGs related to CRC patient prognosis and explored their function by performing Gene Ontology (GO) analysis and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis.

Screening microarray data
The GEO database was systematically searched without language, race, region, and time restrictions (up to 7 January 2017). The advanced search strategy insured the comprehensiveness of the search results (see Table S1 for search strategy details). These inclusion criteria were as follows: (1) total RNA was extracted from frozen colorectal tissue sections; (2) datasets had the original gene expression data files. RNA extracted from frozen tissues shows little degradation. This was the basis for the qualified microarray. In addition, the original gene expression data files could realistically indicate the microarray quality. Evaluating chip quality and rejecting unqualified chip data insured the accuracy of the subsequent analysis. The exclusion criteria were as follows: (1) the genome-wide gene expression profile was not generated by Affymetrix Human Genome U133 Plus 2.0 Array; (2) No disease staging information was present; (3) Frozen tissue sections came from patients who might have received antitumor treatments previously. At present, there is no readymade dataset containing normal colorectal mucosa, adenoma, and carcinoma with all four stages in one dataset. However, there were several datasets containing either colorectal adenoma or carcinoma at different stages. The integration of these different datasets combined the normal colorectal mucosa, adenoma, and carcinoma at all four stages together, Genes and Pathways in Colorectal Cancer Progress Z. Wu et al.
making it possible to analyze genetic changes in the progression of colorectal cancer. Different datasets had heterogeneity, but the heterogeneity was much smaller if the datasets were generated from one platform. The Affymetrix Human Genome U133 Plus 2.0 Array was selected because this platform generated the most available datasets for further analysis. At the same time, it was necessary to exclude the influence of the therapeutic factors on the gene expression level. Thus, patients who previously received antitumor treatments were excluded.

Evaluation of microarray quality
The selected gene expression data were downloaded from the GEO database. These raw data files were from the Affymetrix platform, which could be analyzed by the affy package [22] in R software (R version 3.3.2). Before data preprocessing, all the microarrays were evaluated for quality by quality control (QC), relative logarithmic expression (RLE), normalized unscaled standard errors (NUSE), and RNA degradation curve [23,24]. QC is an overall assessment of the microarray quality, which primarily consists of the present percentage, background noise, scale factor, GAPDH 3'/5' ratio, and actin 3'/5' ratio. RLE and NUSE can both evaluate the consistency of the data, while NUSE is more sensitive. The RNA degradation curve plays an important role in evaluating the degradation of the microarray. Through quality assessment, poor quality data were removed.

Processing of microarray data
Data preprocessing was performed with a standard robust multiarray average (RMA) method, including background correction, normalization, and logarithmic conversion [25]. The raw data were converted to probe-level data after the RMA algorithm and were then transformed to gene symbols in R software [26]. The gene expression levels were the mean of the probes when multiple probes corresponded to one gene symbol. The batch effect could be due to different experimental times, methods, experimenters, datasets, platforms, and many other unpredictable factors, which might affect the accuracy of the data analysis. However, the datasets generated from one platform have a much smaller batch effect. In addition, the batch effect was evaluated by the expression level of housekeeping genes in each dataset to judge whether batch effects have a significant impact on our conclusions. The DEGs were identified by the limma package in R software [27]. Only genes with |log 2 FC| >1 (FC: fold change) and an adjusted P-value <0.05 were considered DEGs. Then, all DEGs underwent prognostic analysis with the survival information from TCGA. TCGA database had no separate colorectal adenoma data, and thus, this information was only used to validate the DEGs in colorectal cancer with four stages with RNA-seq data.
Functional and pathway enrichment analysis GO annotates and classifies genes based on three categories, including biology process, molecular function, and cellular component [28,29]. KEGG pathway interprets pathway maps of molecular interactions, reactions, and relation networks [30]. In our analysis, the DAVID online tool was used to perform GO enrichment and KEGG pathway analysis of the identified DEGs and many other related background DEGs with threshold P-values <0.05.

Basic characteristics of the microarray data
A total of 592 search results were obtained by our search strategy (See Table S1 for search strategy details). Thirtyeight datasets met the two inclusion criteria. Their RNAs were all extracted from frozen colorectal tissue sections. In addition, these 38 datasets had the original gene expression data files. Based on the exclusion criteria, 19 datasets were excluded because the Affymetrix Human Genome U133 Plus 2.0 Array was not used. One dataset was excluded because it had no staging information. Thirteen datasets with patients who might have received antitumor treatments shortly before were also excluded. Thus, only five datasets (GSE4183, GSE14333, GSE39582, GSE8671, and GSE10714) were finally eligible for analysis ( Fig. 1, Table S2).
These five datasets had a total of 971 raw data files, which were from the Affymetrix platform. According to the staging information, data were divided into six groups: normal mucosa, adenoma, and CRC stage I, II, III, and IV. Quality assessment was performed for all these raw data files by QC, RLE, NUSE, and the RNA degradation curve (Fig. S1). At the same time, the consistency of the data volume of each group and the microarray quality were taken into account, and 215 data profiles were finally included in the analysis (Table 1). In addition, the batch effect was evaluated with the expression level of GAPDH across the different datasets, and heterogeneity was not significant (Fig. S2).

Identification of DEGs and prognosis analysis
The gene expression levels in the colorectal adenoma and CRC stage I, II, III, and IV were compared to those in normal colorectal mucosa. The five comparison groups were normal mucosa-adenoma, normal mucosa-CRC stage I, normal mucosa-CRC stage II, normal mucosa-CRC stage III, and normal mucosa-CRC stage IV. With a threshold of |log 2 FC| >1 and an adjusted P-value <0.05, 645 DEGs were identified between the normal mucosa and the adenoma. In addition, there were 1059, 1183, 1195, and 1100 DEGs corresponding to normal mucosa-CRC stage I, normal mucosa-CRC stage II, normal mucosa-CRC stage III, and normal mucosa-CRC stage IV, respectively. Then, 336 common DEGs were extracted from these five comparison groups ( Fig. 2A). Among these 336 common DEGs, 87 genes were identified with an ascending or descending trend through the normal mucosa-adenoma-carcinoma sequence (Table S3). Using the survival information of TCGA, we eventually selected six DEGs related to patient prognosis. They were HEPACAM2, ITLN1, LGALS2, MUC12, NXPE1, and TIMP1 ( Fig. 3A-F, Table 2). Five of the six genes presented a descending trend in expression with tumor progression, while one gene presented an ascending trend.
We also obtained another six common DEGs from these five comparison groups with a threshold of |log 2 FC| >4 and an adjusted P-value <0.05. They were AQP8, CLCA4, CLDN8, GCG, GUCA2A, and MS4A12 (Fig. 2B, Table 3). These genes were all downregulated compared with the gene expression levels in normal mucosa. Among them, only GCG was considered related to patient prognosis (Fig. 3G, Table 3).
A total of seven DEGs from the GEO database associated with prognosis were obtained by bioinformatics analysis. They were HEPACAM2, ITLN1, LGALS2, MUC12, NXPE1, TIMP1, and GCG. The expression patterns of these seven DEGs were also confirmed by 672 RNA-seq data from TCGA. Their expression levels all presented the same trend as that in the GEO database except MUC12 (Table 4). However, MUC12 was also downregulated in colorectal cancer tissues compared with normal colorectal mucosa. The expression of these seven DEGs in colorectal adenomas could not be verified because TCGA database had no separate adenoma data.

GO term enrichment analysis
Using a threshold of |log 2 FC| >1 and an adjusted Pvalue <0.05, we identified 645, 1059, 1183, 1195, and 1100 DEGs in the comparisons of normal mucosa-colorectal adenoma, normal mucosa-CRC stage I, normal mucosa-CRC stage II, normal mucosa-CRC stage III,  and normal mucosa-CRC stage IV, respectively. Combining these genes together, we obtained a total of 1805 background DEGs in a union including the seven identified DEGs.
These 1805 DEGs were uploaded to the DAVID online tool to perform GO analysis and KEGG pathway analysis to explore the possible biological functions and signaling pathways of the DEGS. The results of the seven identified DEGs were extracted separately. In the biology process GO category, the functional enrichment of the seven DEGs was scattered so that no common biology process was found among these seven DEGs. In the cellular component GO category, TIMP1, GCG, and NXPE1 were related to extracellular region, and TIMP1 and GCG were related to extracellular space. In the molecular function GO category, HEPACAM2, LGALS2, TIMP1, and GCG were associated with protein binding, and ITLN1 and LGALS2 were associated with carbohydrate binding. Additionally, these DEGs had some other specific classifications (Table 5).

KEGG pathway analysis
The KEGG pathway enrichment analysis indicated that TIMP1 might participate in the HIF-1 signaling pathway, and GCG might play a role in the insulin secretion and glucagon signaling pathway (Table 6).

Discussion
Sequential changes of gene expression in different stages play essential roles in the colorectal normal mucosaadenoma-carcinoma sequence [31]. Many studies have confirmed that DEGs participated in the progress of CRC. Heijink et al. [32] reported that caspase-8 and cellular flice-like inhibitory protein (cFLIP) expression induced colorectal carcinogenesis independently in sporadic and hereditary nonpolyposis colorectal cancer (HNPCC)associated adenomas and carcinomas. Galamb et al. [24] showed that downregulated amnionless homolog (AMN) and prostaglandin-D2 receptor (PTGDR) and upregulated osteopontin and osteonectin were potential biomarkers of colorectal carcinogenesis and disease progression. However, most studies only focused on the colorectal carcinogenesis process, from normal colorectal mucosa to CRC [33,34]. Many other studies have examined the macro classification, such as from normal colorectal mucosa to adenoma and then to CRC with all stages mixed [35,36]. Comparatively speaking, our study had a more detailed grouping because there were five comparison groups in total. The gene expression levels in the colorectal adenoma and CRC stage I, II, III, and IV were compared with those of the normal colorectal mucosa. This has not been performed previously in the current literature. By analyzing public gene data from GEO and TCGA in R software   (R version 3.3.2), we eventually identified seven DEGs potentially related to CRC patient prognosis and explored their function by performing GO analysis and KEGG pathway analysis. These findings may help elucidate the molecular mechanisms involved in the initiation and development of CRC, provide potential biomarkers of early diagnosis, help manage potential targets in individual therapy, and predict the prognosis of patients at different stages of the disease.
In our study, we downloaded GSE4183, GSE14333, GSE39582, GSE8671, and GSE10714 from the public database GEO, which is an international public repository for microarray data retrieval and deposit. These data were submitted by the investigators, and there was a lack of quality review and evaluation [37]. Thus, evaluating the quality of microarray assays is important. QC is an overall assessment of the microarray quality, which primarily consists of present percentage, background noise, scale factor, GAPDH 3'/5' ratio, and actin 3'/5' ratio [24]. RLE and NUSE can both evaluate the consistency of the data, while NUSE is more sensitive. The RNA degradation curve plays an important role in evaluating the degradation of the microarray [24]. For RLE, the box chart center of each data profile in the high-quality dataset should be close to the position of the ordinate 0. For NUSE, it should be close to 1. If the slope of the RNA degradation curve is close to 0, it indicates that the degradation of the microarray is serious, and these data should be removed. The raw data must go through these quality evaluations, and only qualified data can be entered into the next data processing step to insure the reliability of the subsequent analysis [38].
We obtained seven DEGs of interest, which were HEPACAM2, ITLN1, LGALS2, MUC12, NXPE1, TIMP1, and GCG, and all were associated with patient prognosis. HEPACAM2, ITLN1, LGALS2, MUC12, and NXPE1 presented a sequentially descending trend in expression with tumor progression. In contrast, TIMP1 presented a sequentially ascending trend. Furthermore, GCG showed constant downregulation compared with the gene expression level in normal mucosa. Among these seven DEGs, TIMP1 and GCG have been studied extensively. TIMP1 is a member of the tissue inhibitors of metalloproteinase (TIMP) family that regulates matrix metalloproteinases (MMPs) and disintegrin metalloproteinases [39]. Recent studies reported that the dysregulated activity of TIMP1 was associated with cancer progression [40]. Increased expression of TIMP1 was shown to predict worse prognosis of laryngeal carcinoma [41] and melanoma [42]. Many studies have reported the TIMP1 was upregulated in both early and advanced CRC [43,44], and it possibly acted as a prognostic biomarker involved in liver metastasis of CRC [45,46]. In our study, we found that TIMP1 presented a sequentially ascending trend through the normal colorectal mucosa-adenoma-carcinoma sequence, and the upregulation of TIMP1 indicated a poor survival prognosis, consistent with previous studies. GCG is involved in the regulation of incretin synthesis, secretion, inactivation, and RET signaling. Diseases related to GCG are diabetes [47] and other metabolic diseases [48]. Drucker [49] reported that the protein encoded by GCG was a preproprotein, which could be cleaved into four mature peptides and regulated cell proliferation, differentiation, and apoptosis. However, few studies have focused on the role of GCG in CRC progression. We discovered that GCG expression was downregulated in both adenomas and carcinomas, which was also confirmed by Spisak et al. [50].
There are few studies about HEPACAM2, ITLN1, LGALS2, MUC12, and NXPE1, even rare in CRC. These  five genes all presented a sequentially descending trend in expression through the normal colorectal mucosaadenoma-carcinoma sequence. HEPACAM2 is a member of the immunoglobulin family of adhesion genes. The clinical importance of HEPACAM2 in CRC remains unclear. ITLN1 encodes intelectin-1, which functions as a receptor for both bacterial arabinogalactans and lactoferrin. Li et al. [51] noted that intelectin-1 suppressed tumor progression and was associated with improved survival in gastric cancer. However, there is no research exploring the function of ITLN1 in CRC.
LGALS2 encodes galectin-2, which participates in non-small-cell lung cancer [52], coronary heart disease [53], and ischemic stroke [54] instead of CRC. MUC12 is a member of mucin family. Matsuyama et al. [55] reported that MUC12 mRNA expression was an independent marker of prognosis in stage II and stage III colorectal cancer. For NXPE1, we found two studies, which were bioinformatics studies, but they did not determine the function of this gene. Although there are few studies on the functions of these genes, we used GO analysis and KEGG pathway analysis to predict the possible function of the genes and the possible signaling pathways, providing a direction for subsequent functional research.
The expression pattern of these seven DEGs was confirmed by 672 RNA-seq data on TCGA. Their expression levels presented the same trend as that in the GEO database, except for MUC12. However, MUC12 was also downregulated in colorectal cancer tissues compared with normal colorectal mucosa, and its expression levels in our own patient validation cases were consistent with the predictions of the GEO database, indicating that MUC12 is a promising marker. However, one-third of cases in TCGA were extracted again for further prognosis validation, and all seven DEGs showed a close relationship with patient prognosis (Table S4).
We also confirmed the expression levels and the relationship with prognosis of these seven DEGs using patients in our hospital for medical treatment. Gene sequencing was performed on the surgical samples of 28 patients with CRC who had not received antitumor treatment. Consistent with the bioinformatics predictions, MUC12 and NXPE1 presented a sequentially descending trend in expression with tumor progression, and TIMP1 presented a sequentially ascending trend (Table 5). Among these genes, only NXPE1 was considered related to patient prognosis of the three-year survival rate (Fig. S3, Table 7). There was a much larger mismatch between the validation results of our own patient cases and TCGA RNA-seq data, probably because of the scarcity of patients compared with the large number of validation cases in TCGA. Nevertheless, our findings suggest that DEGs play an important role in the development of CRC.
There are several limitations in our study. First, the amount of data we obtained from the GEO database and our validation were still not sufficient. However, these were all the data we could obtain from the GEO while still insuring data quality. In the future, more qualified patients in our hospital for medical treatment should be followed up. Second, the data in the GEO database are based on different experimental studies, and there was a lack of uniform standards, which would add heterogeneity to our findings. Thus, we attempted to minimize heterogeneity and to insure the rigor of the data by only including GPL570 platform data, performing quality assessment of the microarray and preprocessing the data. Although the batch effect was evaluated across the different datasets, and the heterogeneity was not significant, it still exists.
In conclusion, we discovered seven DEGs through the normal colorectal mucosa-adenoma-carcinoma sequence associated with CRC patient prognosis. Six genes that present sequential expression level changes with different stages might reflect the degree of tumor progression. One downregulated gene might play a key role in the early stages of neoplasia. Monitoring changes in these gene expression levels will allow us to assess disease progression, evaluate treatment efficacy, and predict prognosis. In addition, our study provides a set of useful targets for further functional research exploring the molecular mechanisms and uncovering new therapeutic targets.

Ethical Approval
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards. For this type of study, formal consent is not required.

Supporting Information
Additional supporting information may be found in the online version of this article: Figure S1. Quality assessment. Figure S2. Batch effect was evaluated with the expression level of GAPDH across the different datasets, and heterogeneity was not significant (P > 0.05). Figure S3. Survival curve of A. HEPACAM2, B. ITLN1, C.
LGALS2, D. MUC12, E. NXPE1, F. TIMP1 and G. GCG from patients in our hospital for medical treatment. Table S1. Search strategies. Table S2. Summary of thirty-eight datasets. Table S3. Eighty-seven genes present sequentially expression level changes through normal colorectal mucosaadenoma-carcinoma sequence. Table S4. Further prognosis validation with one-third cases on TCGA.