Multivariate gene expression‐based survival predictor model in esophageal adenocarcinoma

Background Despite the recent development of molecular‐targeted treatment and immunotherapy, survival of patients with esophageal adenocarcinoma (EAC) with poor prognosis is still poor due to lack of an effective biomarker. In this study, we aimed to explore the ceRNA and construct a multivariate gene expression predictor model using data from The Cancer Genome Atlas (TCGA) to predict the prognosis of EAC patients. Methods We conducted differential expression analysis using mRNA, miRNA and lncRNA transciptome data from EAC and normal patients as well as corresponding clinical information from TCGA database, and gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis of those unique differentially expressed mRNAs using the Integrate Discovery Database (DAVID) database. We then constructed the lncRNA‐miRNA‐mRNA competing endogenous RNA (ceRNA) network of EAC and used Cox proportional hazard analysis to generate a multivariate gene expression predictor model. We finally performed survival analysis to determine the effect of differentially expressed mRNA on patients' overall survival and discover the hub gene. Results We identified a total of 488 lncRNAs, 33 miRNAs, and 1207 mRNAs with differentially expressed profiles. Cox proportional hazard analysis and survival analysis using the ceRNA network revealed four genes (IL‐11, PDGFD, NPTX1, ITPR1) as potential biomarkers of EAC prognosis in our predictor model, and IL‐11 was identified as an independent prognostic factor. Conclusions In conclusion, we identified differences in the ceRNA regulatory networks and constructed a four–gene expression‐based survival predictor model, which could be referential for future clinical research.


Introduction
Cancer statistics in 2016 suggested that case mortality of esophageal cancer ranked sixth globally among all cancers, leading to an estimated 415 thousand deaths 1 and a 20% five-year survival rate. 2 Despite recent developments in molecular-targeted treatment and immunotherapy on the basis of surgery, chemotherapy and radiotherapy, these treatments have not yet shown a significant effect. 3 Esophageal cancer includes adenocarcinoma and squamous cell carcinoma. The incidence of esophageal adenocarcinoma (EAC) has rapidly increased over the past decades and is resistant to current treatment with a poor prognosis. 4 Identifying the biomarkers for its occurrence, development and prognosis is essential for understanding EAC and improving decisions in clinical practice.
Previous studies have suggested that the occurrence and development of cancers are regulated by both coding and noncoding RNAs with interaction between both. 5,6 LncRNA and miRNA are the two most-studied noncoding RNAs. 7 miRNAs are single-stranded RNAs containing about 22 nucleotides that regulate protein transition through base-pairing with mRNAs and inhibit translation of mRNAs. 8,9 LncRNAs are non-coding RNAs containing more than 200 nucleotides that contribute to the regulation of epigenetics, cell cycle and cell differentiation. 10 Both have been verified with recent application in the diagnosis and prognosis of various cancers. [11][12][13] There have been previous efforts to identify biomarkers for the prognosis of EAC. Some have proposed a predictor model of EAC using miRNA, 14,15 while others have conducted prognostic risk factor analysis based on the interaction between miRNA and mRNA. 16 However, to the best of our knowledge, there has not been any analysis using the lncRNA-miRNA-mRNA competing endogenous RNA (ceRNA) network for EAC, which provides a more reliable analysis.
In this study, we obtained transcriptome data of mRNA, miRNA and lncRNA from EAC patients and patients with normal esophageal mucosa as well as corresponding clinical information from TCGA. We analyzed the lncRNAs, miRNAs and mRNAs with different expressions and constructed a competing endogenous RNA (ceRNA) network. We then constructed a multivariate gene expression predictor model based on patient survival data and identified the independent prognostic factors. The results of the study contribute to the understanding of the underlying mechanism of EAC using ceRNA network and multigene predictor model, identify potential therapeutic and prognostic target genes and provide new directions for future research.

Patient datasets and data preprocessing
Sequencing data of the three types of RNAs in esophageal adenocarcinoma and normal esophageal tissues and their corresponding clinical data was obtained from The Cancer Genome Atlas (TCGA) database. Integration of this RNA data and extraction of the lncRNA expression profiles was done using the R bioconductor package TCGA Biolinks. 17 Genes were annotated using the Ensembl online database (http://www.ensembl.org). LncRNA, miRNA, and mRNA expression profiles of the carcinoma and the normal were obtained. Our study was in compliance with the publication guidelines provided by TCGA, and the data obtained from TCGA did not require approval from the ethics committee.

Differential expression analysis
The R Bioconductor package edgeR was employed to identify the differentially expressed lncRNAs (DE-lncRNA), miRNAs (DE-miRNA), and DEGs. 18 Filtering criteria for the differential expression of the three RNAs in the normal and carcinoma groups were j fold change (logFCj) > 2 and adjust P-value < 0.01 in DEGs and DE-lncRNAs and DE-miRNAs. The corresponding heat map and clustering were created using the gplots package in R.

Prediction of potential transcription factors, and target genes of DE-miRNAs
The transcription factors of screened DE-miRNAs were predicted using FunRich software. 18 The screened upregulated and downregulated DE-miRNAs were typed into this software. The top 10 predicted transcription factors are reported below.

ceRNA network construction
LncRNA could regulate mRNA expression by acting as an miRNA sponge and contributing to ceRNA network. First, we decoded the miRNA sequences by using the starBase v2.0 database (http://starbase.sysu.edu.cn) 19 and successfully paired DE-miRNAs 3p or 5p transcript information. The miRcode database (http://www.mircode.org) and DIANA-LncBase v2 were employed to construct lncRNA-miRNA interaction pairs.miRDB (http://www.mirdb.org/), miRTarBase (http://mirtarbase.mbc.nctu.edu.tw/), and TargetScan (http://www.targetscan.org/) were used to predict target genes of the DE-miRNAs and establish miRNA-mRNA interaction pairs. The genes in all three databases are considered to be the target genes of the DE-miRNAs. Using the Venny online website to compare the target genes, only the overlapping genes and their interaction pairs were further analyzed. Then, on account of the lncRNAmiRNA pairs and miRNA-mRNA pairs, lncRNA-miRNA-mRNA ceRNA network was constructed using Cytoscape v3.6.1 software. 20 Functional enrichment analysis GO, including biological processes (BP), molecular function (MF), cellular component (CC), is a commonly used approach for defining genes and its RNA or protein product to identify unique biological properties of highthroughput transcriptome or genome data. 21 We used DAVID 6.8 (https://david.abcc.ncifcrf.gov/), an online bioinformatic tool designed to identify a large number of genes or proteins function 22 to visualize the DEGs enrichment of GO (P < 0.05) and KEGG pathway analysis.
Thoracic Cancer 11 (2020) 2896-2908 Construction and validation of the mRNA risk score A total of 76 patients were divided into two categories in a random manner: training dataset = 38, test dataset = 38. A training dataset was then analyzed to build an mRNA model that was later confirmed in the test and entire datasets. We used least absolute shrinkage and selection operator (LASSO) which is a generalized linear regression algorithm with simultaneous variable selection and regularization. 23 An mRNA prognostic risk scoring model for survival prediction was constructed as follows: Risk score = X n i = 1 βi*gene i β indicates the coefficient of the mRNA, and gene represents mRNA expression value. The subjects were divided into low-and high-risk groups according to the median score of the training dataset. Kaplan Meier (KM) and log-rank methods were used to compare the survival rate between the groups. The time-dependent receiver operating characteristic (ROC) curve was plotted by using the R "timeROC" package to evaluate specificity and sensitivity of the mRNA expressionbased prognostic signature. After that, the signature was confirmed in the test and entire datasets. ROC and KM curves were employed to verify the accuracy and feasibility of the mRNA model. All ROC and KM curves were plotted with R (version 3.5.3), and P < 0.05 represented statistical significance.

Statistical analysis
Single variable Cox proportional risk (CPH) model was used to study the relationship between gene mRNA expression and survival. Based on KM analysis, gene mRNA expression and survival were entered in a multivariate

2898
Thoracic Cancer 11 (2020) 2896-2908 CPH model. The results of CPH models are shown by HRs, and tests of significance coupled with 95% CI tests were carried out simultaneously. The classification of continuous variables which include risk score and RNAs expression value is specified in advance. P < 0.05 was considered statistically significant.

Construction of ceRNA network in EAC
Based on the lncRNA-miRNA and miRNA-mRNA pairs, we constructed and visualized the ceRNA network graph using Cytoscape v3.6.1 (Fig 3). We identified 43 common RNAs (29 lncRNAs, four miRNAs, and 10 mRNAs) in the ceRNA network. The node connections in the network represented the interactions between RNAs, and the RNAs with more important biological functions were those with stronger connectivity in the graph. We then included 10 mRNAs (NPTX1, CDH2, ITR1, PDGFD, SLC22A6, MEST, IL11, CHRDL1, PTPRT, and HOXC8) for further analysis.

Cox proportional hazard analysis
We also calculated a risk score for each subject based on the status of four genes using a Cox regression model (Table 4): coxph(formula = Surv(futime, fustat) NPTX1 + ITPR1 + PDGFD + IL11, data = rt). We further stratified EAC patients into high-and low-risk groups and found that four-year survival rate was 25% and 35% for highand low-risk patients, respectively (Tables 5 and 6). We then ranked the risk scores and visualized the survival status of each patients using a dotplot. Mortality rate of patients was lower for patients in the low-risk group in comparison to the high-risk group (Fig 5a,b). Survival time of high-risk patients was shorter than low-risk patients according to the Kaplan-Meier curve (Fig 5c). Using a time-dependent ROC curve, we described the predictive value of the proposed four gene expression predictor model, and the area under curve (AUC) values were 0.735, 0.759, 0.656, 0.662 at one, two, three and four years of the four-gene signature (Fig 5d). The four-gene profile heatmap is shown in Fig 5e. Kaplan-Meier analysis of the four-gene expression predictor model Last but not least, we assessed the prognostic value of the four-gene expression model using R and found that increased expression of IL11 was associated with worse Figure 3 ceRNA networks of EAC. Red represents upregulation, and blue represents downregulation. lncRNAs, miRNAs, and mRNAs in the networks are represented as diamonds, round rectangles, and circles, respectively.

Discussion
Morbidity and mortality of EAC have not significantly improved despite progress in its pathogenesis and clinical treatment, due to the lack of reliable biomarkers and specific genes to guide individualized treatment. 24 Research on molecular markers of EAC is urgently needed to customize effective individualization of treatment and improve survival. Sequencing technology and bioinformatics allow the screening of the entire DNA mutation profile and contribute to the understanding of EAC origin and characteristics. However, screening biomarkers using changes in mRNAs only are insufficient as miRNAs, lncRNAs and the interaction between these three all play important roles in the growth and differentiation of cells and occurrence of cancers. 25,26 ceRNA hypothesis provides an in-depth understanding of the mechanism of carcinogenesis and new evidence for cancer diagnosis and treatment, chemotherapy efficacy prediction and cancer risk prediction. 6,27,28 Data mining using TCGA database is an effective approach to identify changes in RNA expression that are relevant to clinical outcomes and to screen new therapeutic targets. While ceRNAs have been studied more in the past decades, few studies have evaluated the prognostic value of ceRNA for EAC patients using TCGA database. Here, in the ceRNA network, we identified 29 lncRNAs, four miRNAs and 10 mRNAs that were significantly different between EAC and normal patients using TCGA, and only two of the 10 mRNAs have been previously reported (CDH2 29 and PDGFD 30 ). Based on the survival data up to 2017, we constructed a four-gene expression predictor model (IL-11, NPTX1, ITPR1, PDGFD) using multivariate Cox regression analysis and also conducted univariate analyses individually for the four genes. Our results showed that IL-11 could potentially be used as an independent prognostic factor for EAC. Among the interleukin family, secretion of IL-33 in esophageal epithelial cells has been reported to prompt the occurrence of gastroesophageal reflux diseases and lead to Barrett esophagus; 31 IL-4, IL-13, 32 IL-1β 33 and IL-6 34 contributed to increased secretion of esophageal mucosa in patients with Barrett's esophagus, and IL-11 contributed to esophageal squamous cell carcinoma progression and its aggressiveness. 35 Nonetheless, to date, no study has reported the role of IL-11 in EAC, and our study fills this gap and confirms the potentially important role of it in EAC prognosis. IL-11 is a member of the IL-6 family of cytokines, 36 and has been recognized for its role in the disease pathogenesis of mucosal homeostasis including gastrointestinal cancers. 37 IL-11 could increase the tumorigenic abilities of cells including the survival of cells or origin, proliferation of cancer cells, and survival of metastatic cells of distant Figure 4 The most significant KEGG pathway of the 10 mRNAs. GF: growth factor; IP3R: inositol 1,4,5-trisphosphate receptor; RTK: receptor tyrosine kinases. organs. 38,39 Recent studies have also suggested its role in osteosarcoma deterioration, 40 postoperative recurrence of liver cancer 41,42 and migration and survival of gastric cancer cells. 43 Based on the current knowledge of the biological properties of IL-11 and its role in cancer, IL-11 signaling inhibition might be a new therapeutic approach for cancer treatment. 44 In pancreatic ductal adenocarcinoma, 45 lung adenocarcinoma, 46,47 gastric adenocarcinoma, 48 and colorectal adenocarcinoma, IL-11-STAT3 signaling induced invasion and enhanced development of adenocarcinoma. In addition, IL-6/11 is one of the highly specific biomarkers with great accuracy for the diagnosis of lung adenocarcinoma in bronchoalveolar lavage fluid specimens. 49 The molecular mechanism of IL-11 in esophageal adenocarcinoma is the focus of our future research.  NPTX1 is a member of the pentraxin family and a major risk factor for nervous system disorders, 50 and its downregulation expression has been reported in many cancers. [51][52][53][54] Its function in EAC is still unclear and some studies have suggested it plays a role in regulating cell migration 55 and tumor metastasis. 56 ITPR1 is considered as the most prominent gene in regulating cancer cell resistance to NK-mediated lysis 57 and has been shown to be involved in the regulation of intracellular calcium signaling and the regulation of autophagy. 58 in vivo studies have previously indicated that ITPR1 targeting in cancer cells in combination with NK depletion contributed to tumor growth, indicating its role in the regulation of in vivo susceptibility of renal carcinoma cells to NK activities. 57 These studies suggest a putative role of ITPR1 in cancer progression and immune resistance. PDFD-D plays an important

2904
Thoracic Cancer 11 (2020) 2896-2908 role in cell proliferation, apoptosis, transformation, invasion, metastasis, angiogenesis and other biological processes, 59 and its downregulation expression has been reported to inhibit the NF-κB pathway for cell proliferation and invasion, and induce apoptosis in esophageal squamous cell carcinoma. 30 It contributes to ibrutinib resistance of diffuse large B-cell lymphoma through EGFR activation 60 and also the prognosis of gastric adenocarcinoma patients. 61 There have been many efforts to identify biomarkers for EAC prognosis. Lan et al. reported a six miRNA signature of esophageal adenocarcinoma in 2019. 14 Skinner et al. reported a validated miRNA profile to predict response to therapy in esophageal adenocarcinoma. 15 However, they focused on miRNA while in our study we focused on mRNA, and the results of part of the six miRNA are consistent with those involved in our study. Our results remain in general agreement with the previous conclusions. Also, Dong et al. reported five genes that have a connection with overall or relapse-free survival. 16 However, the credibility of the result which shows the identification of DEGs in mRNA expression profiling data sets GSE1420, GSE26886, and GSE92396 was not high as the three data sets come from different platforms (GSE1420 from GPL96, GSE26886 from GPL570, GSE92396 from GPL6244).
In addition, EAC is extremely rare in China, and we made every attempt to contact many hospitals and research institutes during the study period. Although a small number of survival data are available, we have still been unable to find enough pathological specimens of this kind for further measurement (eg, PCR, next-generation sequencing, and sanger sequencing). We therefore divided the patients into two categories in a random manner: training dataset and test dataset, thereby hoping to minimize bias.
In conclusion, we constructed a ceRNA network and a four-gene expression predictor model using data from TCGA, a large-scale sequence database, that could be used to perform comprehensive multidimensional analysis. However, future studies are needed to validate our findings considering the complex interaction of the network.