Identification of genes associated with cancer progression and prognosis in lung adenocarcinoma: Analyses based on microarray from Oncomine and The Cancer Genome Atlas databases

Abstract Background Lung adenocarcinoma (LUAD) accounts for approximately 40% of all lung cancer patients. There is an urgent need to understand the mechanisms of cancer progression in LUAD and to identify useful biomarkers to predict prognosis. Methods In this study, Oncomine database was used to identify potential genes contributed to cancer progression. Bioinformatics analysis including pathway enrichment and text mining was used to explain the potential roles of identified genes in LUAD. The Cancer Genome Atlas database was used to analyze the association of gene expression with survival result. Results Our results indicated that 80 genes were significantly dysregulated in LUAD according to four microarrays covering 356 cases of LUAD and 164 cases of normal lung tissues. Twenty genes were consistently and stably dysregulated by more than twofold. Ten of 20 genes had a relationship with overall survival or disease‐free survival in a cohort of 516 LUAD patients, and 19 genes were associated with tumor stage, gender, age, lymph node, or smoking. Low expression of AGER and high expression of CCNB1 were specifically associated with poor survival. Conclusion Our findings implicate AGER and CCNB1 might be potential biomarkers for diagnosis and prognosis targets for LUAD.


| INTRODUCTION
Lung cancer (LC), especially non-small-cell lung cancer (NSCLC), is the leading cause of cancer death worldwide and is associated with significant morbidity and poor prognosis (Mehta, Dobersch, Romero-Olmedo, & Barreto, 2015;Siegel, Naishadham, & Jemal, 2013). Lung adenocarcinoma (LUAD), a subtype of NSCLC, accounts for approximately 40% of all lung cancer patients and is one of the most genetically characterized human epithelial malignancies (Bender, 2014). In spite of recent improvement in clinical therapy, 5-year survival rate of NSCLC patients remains lower than 20% (Allemani et al., 2015), due to the low diagnosis rate at early stage and the frequent cancer recurrence and metastasis. There is an urgent need to identify novel diagnostic and prognostic markers to improve the survival of LC patients.
Bioinformatics analyses, including usage of microarray expression datasets (Stuart, Segal, Koller, & Kim, 2003), protein/gene-protein/gene interaction networks (Ivanov et al., 2018), and the annotation of genes (Phuong & Nhung, 2013), are being utilized as a powerful tool to study the cancer progression and to identify serum biomarkers (Hormigo et al., 2006;Huddleston, Wong, Welch, Berkowitz, & Mok, 2005) as well as potential therapeutic targets (Armstrong et al., 2003;Ye et al., 2003). Large amounts of data generated by this tool are collected in public archives such as the major public projects The Cancer Genome Atlas (TCGA) (DeSantis, Ma, Bryan, & Jemal, 2014), Oncomine (Rhodes et al., 2004), Gene Expression Omnibus (GEO) (Barrett et al., 2013), and so on. An increasing number of studies used these public databases as powerful evidence to screen and identify novel biomarkers for diagnosis and prognosis. For instance, by retrieving data from Oncomine and TCGA, Yin et al. (2016) successfully identified a group of genes related to cancer progression and prognosis in hepatocellular carcinoma. Liu et al. (2015) identified six genes that may be potential therapeutic targets and biomarkers for diagnosis and prognosis in ovarian cancer, based on data retrieved from Oncomine, GEO, and TCGA. Thus, bioinformatics analysis is a feasible and valuable method to mine data and predict gene function.
In this study, using mRNA expression profiles retrieved from Oncomine online database, we identified 80 dysregulated genes in LUAD and annotated several biological processes closely associated with the progression and development of LUAD. For the 20 stably and consistently dysregulated genes in LUAD, we retrieved the data of mRNA expression, clinical information from TCGA LUAD project to identify genes associated with cancer prognosis in LUAD.

| Data source
Microarrays data were selected from Oncomine database (http://www.oncomine.org/resource/login.html). Initially, 12 datasets were found when we used the following filters: (a) analysis type: differential analysis-cancer versus normal analysis; (b) cancer type: lung cancer-non-small-cell lung cancer; and (c) dataset filters: data type-mRNA. In order to retrieve the stably and consistently dysregulated genes in LUAD, we subsequently selected four studies from the 12 datasets according to the criteria: (a) lung adenocarcinoma versus normal; (b) sample number more than 50; and (c) microarray platform is Human Genome U133 or U133 Plus 2.0. Finally, genes that were significantly dysregulated in LUAD tissues were identified based on four microarrays studies: Hou Lung (45 LUADs vs. 65 lung tissues) (Hou et al., 2010), Landi Lung (58 LUADs vs. 49 lung tissues) (Landi et al., 2008), Okayama Lung (226 LUADs vs. 20 lung tissues) (Okayama et al., 2012), and Su Lung (27 LUADs vs. 30 lung tissues) (Su et al., 2007). The four studies totally include 356 cases of LUAD and 164 cases of normal lung tissues (Supporting Information Table S1). The rank for a gene is the median rank for that gene across each of the analysis. mRNA expression and clinical information, including age, gender, smoking status, overall survival time (OS), disease-free survival time (DFS), TNM stage, metastasis, and lymph node metastasis, of 522 LUAD patients in a TCGA cohort were Clinical practice points 1. An increasing number of studies used public databases, such as Oncomine and The Cancer Genome Atlas (TCGA), as powerful evidence to screen and identify novel biomarkers for diagnosis and prognosis. 2. In the present study, using mRNA expression profiles retrieved from Oncomine online database, we identified 80 dysregulated genes in LUAD and annotated several biological processes closely associated with the progression and development of LUAD. For the 20 stably and consistently dysregulated genes in LUAD, we retrieved the data of mRNA expression, clinical information from TCGA LUAD project to identify genes associated with cancer prognosis in LUAD. 3. The findings indicate that AGER and CCNB1 might be useful biomarkers for diagnosis and prognosis and could be potential therapeutic targets for LUAD treatment in the clinical work.
retrieved from TCGA database (https://cancergenome.nih. gov/), but only 516 samples with matched gene expression data and clinical data were utilized to analyze the clinical importance of the genes identified in this study.

| Bioinformatics analyses
Gene Ontology (GO) term supplies the annotation of genes and describes functions of genes or their proteins from three categories: cellular component (CC), biological process (BP), and molecular function (MF) (Gene Oncology Consotorium, 2015; Harris et al., 2004). The Kyoto Encyclopedia of Genes and Genomes (KEGG, http://www. genome.jp/) is a widely used database that supplies the molecular functions of genes and proteins (Kanehisa, Sato, Kawashima, Furumichi, & Tanabe, 2016). The Database for Annotation, Visualization, and Integrated Discovery (DAVID, Jiao et al., 2012) contains a comprehensive biological knowledge and a series of analytic tools available for extracting biological themes for genes or proteins. GO enrichment analysis and KEGG pathway enrichment analysis of target genes were performed using the DAVID online tool. The p-value <0.05 was chosen as the cutoff criterion for both GO functional enrichment analysis and KEGG pathway. Function prediction based on text mining was performed using the Coremine Medical online database (http://www.coremine.com/medical/).

| Data analysis
Expression values of gene were categorized as high and low expression using the median value as a cutoff for clinical characteristics in a TCGA cohort. The association of gene expression frequency with age, gender, smoking status, TNM stages, and lymph node numbers was analyzed by Pearson's chi-square test. The Kaplan-Meier method and log-rank test were used for survival analyses. Univariate and multivariate Cox proportional hazards regression models were used to calculate a hazard ratio (HR) for overall survival (OS) and disease-free survival (DFS) according to the gene expression status (high or low). Statistical analyses were performed at the two-tailed α level of 0.05, using SPSS software version 21.0.

| Functional enrichment analyses
Gene Ontology functional enrichment analysis of these 80 genes using the DAVID online tool indicated that total 33 terms were significantly enriched (Supporting Information  Table S3). Extracellular matrix organization and response to drug were the top two biological processes, covering 12 genes, extracellular region was the top cellular component, covering 14 genes, and protein binding was the top molecular function, covering 45 genes (Supporting Information Table S3). The results of KEGG pathway enrichment analysis showed that the 80 genes were only enriched in pathway of focal adhesion (Supporting Information Table S3).

| Potential roles of the genes in LUAD progression
The potential roles of the 20 genes in LUAD were predicted on the basis of Coremine Medical mining. As shown in Figure 2, the associations of the genes with diagnosis, prognosis, drug resistance, recurrence, metastasis, and invasiveness of LUAD were comprehensively analyzed. The results indicated that the 20 genes were all associated with at least one factor contributing to cancer progression, and many of the genes, for example, MLF1IP, SRPK1, CCNB1, COL11A1, ADH1B, SPTBN1, and EDNRB, were closely associated with all of the factors included in this analysis. Nineteen genes were associated with diagnosis with the exception of MLF1IP.
Eighteen genes were associated with metastasis except for ADAMTSL3 and FAM189A2. With the exception of TMEM106B and FAM189A2, the other 18 genes were associated with prognosis. Most of the genes were extensively associated with several factors. For instant, AGER was associated with invasiveness, metastasis, diagnosis, and prognosis, and CCNB1 was associated with invasiveness, metastasis, diagnosis, prognosis, drug resistance, and recurrence ( Figure 2).

| Analysis of clinical magnitude
The clinical magnitude of the 20 stably and consistently dysregulated genes in LUAD was assessed on the basis of TCGA clinical data. A total of 522 patient samples with LUAD were retrieved from a cohort of TCGA database, while only 516 samples with mRNA expression value were available to analyze the association of gene expression with clinical characteristics. The gene expression level was categorized as high or low based on the median value referring to a previous study (Yin et al., 2016). The association of gene expression with tumor stage, lymph node, metastasis, age, gender, and smoking packyear was analyzed. Eight genes were significantly associated with stage (p < 0.05; Table 1), in which high expression of ADH1B, AGER, CLIC5, FAM107A, and GPM6A was associated with early stage, while CCNB1, CENPU, and GOLM1 was associated with late stage of LUAD. Especially, CCNB1 and CENPU were markedly and highly expressed in stage III (63.1%, p = 0.001) and IV (60.0%, p = 0.040), respectively. Seven genes had a relationship with lymph node, where the frequency of high expression of CCNB1, CENPU, COL11A1, and TMEM106B was significantly higher in patients with more than one lymph node than that without one (Table 1). We observed that four genes (ADH1B, FAM107A, SLIT3, and TNXB) were highly expressed in female patients, while other four genes (CCNB1, CENPU, HMGB3, and SRPK1) were highly expressed in male patients. In addition, three genes (ADH1B, CLIC5, and GPM6A) were expressed at high levels in LUAD patients aged ≥65 years, while two genes (CCNB1 and CENPU) were expressed at high level in patients aged <65 years. Finally, six genes were closely related to the smoking status. Two genes (CENPU and TMEM106B) showed high expression in patients with smoking ≥40 pack-year, the other four genes (ADAMSTL, EDNRB, FAM107A, and TGFBR3) showed high expression in patients with smoking <40 pack-year (Table 1).

| Survival analysis of 20 genes
Ten out of 20 genes had a relationship with OS and/or DFS (Supporting Information Table S4). Seven genes were To elucidate whether these genes were the risk factors for predicting the patients' survival, we initially performed univariate analysis for the above ten genes. As shown in Table 2, high expression of CCNB1, CENPU, GOLM1, and TMEM106B genes was hazard factors for both OS and DFS of LUAD (all of HR >1.36, p < 0.05), in the contrary, high expression of AGER, CLIC5, and FAM189A2 could promote the OS and DFS of LUAD patients (all of HR <0.73, p < 0.05; Table 2).
Multivariate proportional hazard models for assessing the association of OS and DFS with these ten genes were subsequently carried out by adjusting age, gender, and smoking   Table 2). In the following, we analyzed the association of these seven genes expression with OS and/or DFS in LUAD patients at early stages (stage I + II) and advanced stages (stage III + IV) by adjusting age, gender, and smoking pack-year. As shown in Supporting Information Table S5, AGER (HR: 0.598; 95% CI: 0.380-0.841) and CENPU (HR: 1.807; 95% CI: 1.108-2.949) expression were associated with OS in patients at early stage, while SLIT3 (HR: 0.0.439; 95% CI: 0.222-0.869) expression was associated with OS of patients at advanced stage. For DFS, five gene (AGER, CCNB1, CENPU, COLIL5, and FAM189A2) expressions were found to be associated with DFS in patients at early stage, but none of these genes was related to advanced stage.

| DISCUSSION
With the rapid development of information technology, the ability to collect genomic and clinical information can be used to study disease progression and improve medical treatment (Jiang, Barmada, & Visweswaran, 2010;Schena, Shalon, Davis, & Brown, 1995). One of the growing types of information technology is that obtained from microarray dataset, which was widely used to measure the expression levels of a large number of genes simultaneously.
Oncomine, a cancer microarray database and online datamining platform, aimed at promoting discovery from genome-wide expression analyses (Rhodes et al., 2004). To date, Oncomine contains 715 gene expression datasets and 86,733 samples, in which 74 lung cancer microarray databases are included (https://www.oncomine.org/resource/login.html). There are totally ten datasets containing mRNA expression data of LUAD tissue as well as normal lung tissue. Hereinto, four databases based on microarray platform Human Genome U133 or U133 Plus 2.0 were selected to retrieve mRNA expression information to identify the dysregulated gene in LUAD. As a result, 80 genes significantly dysregulated in LUAD were identified based on microarray database covering 356 cases of LUAD as well as 164 normal lung tissues. Twenty genes were further identified to be consistently dysregulated in all four microarrays by at least twofold. TCGA research network had large numbers of cancer studies and released the databases to the public, including thousands of microarray datasets from lung cancer samples. TCGA has been successfully used to study the association of genes with drug therapy and survival (Shah et al., 2018), endogenous RNA analysis (Ning et al., 2018), and gene-gene interactions (Wu, Huang, & Ma, 2018) in lung cancer. Therefore, in this study, the information of clinical data and mRNA expression in LUAD patients was retrieved from the TCGA database to explore the association of gene expression with survival. Cancer is considered to be a disease involving dysregulated cell growth, a process in which cells divide uncontrollably. The causes of cancer progression are complex and diverse. Signaling pathways, covering a series of actions among multiple molecules occurring within cells, are important biological mechanisms in cell growth as well as proliferation. Discovering how the pathways and the molecules therein are associated with cancer is one of most essential problems for cancer researchers in the past decades. KEGG is a database resource for understanding high-level functions and utilities of the biological system, such as the cell, the organism, and the ecosystem, from molecular-level information, especially large-scale molecular datasets generated by genome sequencing and other high-throughput experimental technologies. According to KEGG pathway enrichment analysis by using DAVID, the present study demonstrated that the most significant pathways included the focal adhesion which were closely associated with tumor progression and metastasis. Together, these results indicate that the genes identified in this study might play crucial roles in LUAC progression, probably functioning as a group.
Biomarkers not only have prognostic implications, but are also helpful for measurement of treatment responses and surveillance for tumor recurrence and for guiding clinical decision (Wong, Xu, Chen, Lee, & Luk, 2013). Thus, prognostic biomarkers for LUAD patients are crucial, and there is an ongoing research for predictive biomarkers. Coremine medical mining suggested AGER was associated with invasiveness, metastasis, diagnosis, and prognosis and CCNB1 was associated with invasiveness, metastasis, diagnosis, prognosis, drug resistance, and recurrence.
In this study, a group of genes associated with DFS and OS was identified in 516 LUAD patients. Among these genes, low expression of AGER, CLIC5, and FAM189A2, and high expression of CCNB1, CENPU, GOLM1, and TEME106B were associated with poor OS and DFS. High expression of COL11A1, low expression of FAM107A and SLIT3 were associated with poor OS, but not with DFS. Furthermore, AGER was identified as independent risk prognostic factors for OS and DFS, while CCNB1 was independently associated with DFS in LUAD patients.
Advanced glycosylation end-product-specific receptor (AGER), also named receptor for advanced glycation end products (RAGE) (Ibrahim, Armour, Phipps, & Sukkar, 2013), has been well known as a promoter of inflammation (Nasser et al., 2015). Notably, it has been shown that pulmonary AGER is required for allergen-induced innate lymphoid cells accumulation in the lung (Oczypok et al., 2015). AGER is one of a limited number of pathogen recognition receptors whose expression is downregulated in lung cancer (Rho, Roehrl, & Wang, 2009;Wang, Li, Yu, et al., 2015). However, AGER has been widely reported being highly expressed in various types of cancer, including ovarian cancer (Rahimi et al., 2017), breast cancer (Nankali et al., 2016), gastric cancer (Wang, Li, Ye, et al., 2015), and endometrial cancer (Zheng et al., 2016). In the current study, we found AGER was significantly and consistently downregulated at least 8.266-fold in LUAD according to four independent microarrays databases. Based on the clinical importance analysis of 516 LUAD patients in a TCGA cohort, low expression of AGER was observed to be associated with poor DFS and OS in LUAD patients and was an independent risk prognostic factor for OS. Further study on AGER would be needed to better understand its association with LUAD. CCNB1, an important member of cyclin family, is a key initiator and rigorous quality control step of mitosis. It has a pivotal role in regulating cyclin-dependent kinase 1 (CDK1) and forming complex with it, which phosphorylates their substrates to promote the transition of cell cycle from G2 phase to mitosis (Krek & Nigg, 1991;Morgan, 1995). Increasing evidence demonstrates that CCNB1 is involved in checkpoint control, whose dysfunction is an early event in tumorigenesis, and that its deregulated expression is observed in a number of different human cancers including breast cancer, cervical cancer, lung cancer, esophageal squamous cell carcinoma, and melanoma (Kedinger et al., 2013;Kreis et al., 2010;Niméus-Malmström et al., 2010;Nozoe et al., 2002;Yoshida, Tanaka, Mogi, Shitara, & Kuwano, 2004). In parallel, evidence has showed that inhibition of CCNB1 expression renders breast cancer cells more sensitive to chemotherapy drug taxol (Androic et al., 2008), and CCNB1 is a biomarker for the prognosis of ER + breast cancer and monitoring of hormone therapy efficacy (Ding, Li, Zou, Zou, & Wang, 2014). In addition, CCNB1 is an independent predictor of HBV-related hepatocellular carcinoma recurrence (Weng et al., 2012). In the present study, our results showed the high expression of CCNB1 had a poor survival and was an independent factor for the poor DFS in LUAD patients, especially for the patients at the early stage.

| CONCLUSION
In summary, by means of data retrieved from four independent microarrays, clinical importance analyses in a cohort of 516 patients, and bioinformatics analyses including biological process annotation, text mining, we have identified a group of genes that are significantly dysregulated in LUAD and might be associated with cancer progression, development, and in particular, prognosis. AGER and CCNB1 might be useful biomarkers for diagnosis and prognosis and could be potential therapeutic targets for LUAD treatment.