Four transcription profile–based models identify novel prognostic signatures in oesophageal cancer

Abstract Oesophageal cancer (ESCA) is a clinically challenging disease with poor prognosis and health‐related quality of life. Here, we investigated the transcriptome of ESCA to identify high risk‐related signatures. A total of 159 ESCA patients of The Cancer Genome Atlas (TCGA) were sorted by three phases. In the discovery phase, differentially expressed transcripts were filtered; in the training phase, two adjusted Cox regressions and two machine leaning models were used to construct and estimate signatures; and in the validation phase, prognostic signatures were validated in the testing dataset and the independent external cohort. We constructed two signatures from three types of RNA markers by Akaike information criterion (AIC) and least absolute shrinkage and selection operator (LASSO) Cox regressions, respectively, and all candidate markers were further estimated by Random Forest (RFS) and Support Vector Machine (SVM) algorithms. Both signatures had good predictive performances in the independent external oesophageal squamous cell carcinoma (ESCC) cohort and performed better than common clinicopathological indicators in the TCGA dataset. Machine learning algorithms predicted prognosis with high specificities and measured the importance of markers to verify the risk weightings. Furthermore, the cell function and immunohistochemical (IHC) staining assays identified that the common risky marker FABP3 is a novel oncogene in ESCA.

testing dataset and the independent external cohort. We constructed two signatures from three types of RNA markers by Akaike information criterion (AIC) and least absolute shrinkage and selection operator (LASSO) Cox regressions, respectively, and all candidate markers were further estimated by Random Forest (RFS) and Support Vector Machine (SVM) algorithms. Both signatures had good predictive performances in the independent external oesophageal squamous cell carcinoma (ESCC) cohort and performed better than common clinicopathological indicators in the TCGA dataset.
Machine learning algorithms predicted prognosis with high specificities and measured the importance of markers to verify the risk weightings. Furthermore, the cell function and immunohistochemical (IHC) staining assays identified that the common risky marker FABP3 is a novel oncogene in ESCA.

K E Y W O R D S
machine learning, oesophageal cancer, prognostic signature, transcription profile

| INTRODUC TI ON
Oesophageal cancer (ESCA) is the eighth leading cancer and the sixth highest cause of cancer death worldwide. 1 In total, 17 290 newly diagnosed cases and 15 850 oesophageal cancer deaths were estimated in 2018. 2 ESCA has two main histological types, squamous cell carcinoma (SCC) and adenocarcinoma, in which SCC accounts for the most of ESCA cases of China. 3 The poor prognosis of ESCA is partially due to lack of effective early diagnosis and post-surgeon surveillance. 4 At present, the tumour-node-metastasis (TNM) staging system is the only well-recognized stratification system for treatment decisions. 5 However, TNM staging fails to assess the clinical outcome in a great number of patients. 6 Patients with the same stage category receive similar treatments, but their clinical outcome varies greatly. 7 Therefore, there is a pressing need to identify reliable prognostic factors for prediction in ESCA patients.
Second-generation sequencing characterizing ESCA transcriptomes has revealed mounts of molecular markers. 8,9 Several studies have shown that messenger RNAs (mRNAs), microRNAs (miRNAs) and long non-coding RNAs (lncRNAs) could become predictive signatures of survival and treatment results with good performance. 8,10, 11 Li et al have found a three-lncRNA signature is associated with ESCC patients' survival status. 8  To date, Cox and penalized Cox regressions were widely used for the stable feature selection. 15,16 Smyth et al identified a seven-gene signature to improve prognostic risk stratification in chemotherapy treated gastroesophageal cancer patients by using standard Cox regression model. 17 In addition, machine learning, a branch of artificial intelligence, was successfully applied in screening biomarkers correlated with cancer diagnosis, prognosis and treatment. 18 In another study, selecting with least absolute shrinkage and selection operator (LASSO) Cox and Support Vector Machine (SVM) algorithms, Qiu et al 19 constructed a three-CpG signature in predicting recurrence for patients with early-stage hepatocellular carcinoma. Thus, combining semi-parametric and machine learning algorithms has the potential to enhance the accuracy of the present prognostic indicators.
Here, we constructed a multi-marker signature based on exploring mRNA, lncRNA and miRNA profiles of ESCA with 2 regularization semi-parametric algorithms. Machine learning algorithms were then used to estimate the importance of included markers, which validated the risk weightings of Cox regressions. Additionally, lossof-function and immunohistochemical (IHC) staining assays identified the oncogenic function of the novel ESCA marker FABP3.

| Data collection and processing
We obtained 159 ESCA patients' clinical and sequencing data, including the RNA-sequencing and miRNA-sequencing datasets, from the TCGA (The Cancer Genome Atlas) data portal (April 2016; https ://portal.gdc.cancer.gov/). Based on the annotation information of the Ensemble GRCh37 genome, we identified 10 617 long non-coding RNAs and 18 687 protein coding genes. Differentially expressed lncRNAs, mRNAs and miRNA between ESCA and adjacent normal tissues were screened by edgeR package in R software (version 3.5.1), and |log2 FoldChange| ≥ 2 and FDR < 0.01 were considered significant. After normalization within edgeR, the expression profiles could be used for the next processing.

| Study design
In this study, we included three phases to identify and validate risk- Elimination (SVM-RFE) were applied to rank marker importance. 20,21 Additionally, the risk difference between high-risk and low-risk patients was estimated with Kaplan-Meier survival analyses. The area under the curve (AUC) of the receiver operating characteristic (ROC) curve was calculated to estimate the performance of each model.

| Cell culture, RNA extraction and qRT-PCR analysis
Two oesophageal squamous cell carcinoma (ESCC) cell lines KYSE150 and Eca109 were obtained from the American Type Culture Collection (ATCC). Cells were cultured in RMPI1640 medium (KeyGene), supplemented with 10% FBS with 100 U/mL penicillin and 100 mg/mL streptomycin. All cell lines were grown in humidified air at 37°C with 5% CO 2 . Cell cultures were occasionally tested for mycoplasma (last tested in June 2018). The cells used in experiments were within 10 passages from thawing. RNA extraction and qRT-PCR were performed as described previously. 22 We used β-actin and U6 as internal controls and

| Patients and tissue samples
We gained access to primary oesophageal cancer tissues through JiangSu Cancer Hospital Biobank. All tumours were confirmed by experienced pathologists. Written informed consent was obtained from all patients. Collection of human tissue samples was conducted in accordance with the International Ethical Guidelines for Biomedical Research Involving Human Subjects. This study was approved by the Ethics Committee of the JiangSu Cancer Hospital and was performed in accordance with the provisions of the Ethics Committee of Nanjing Medical University. This study was approved by the Nanjing Medical University.

| Cell proliferation, migration and apoptosis assays
Cell proliferation was examined using EdU assay (RiboBio), and Real Time xCELLigence Analysis (RTCA) system following the research protocol afforded by the manufacturer (Roche Applied Science and ACEA Biosciences). Cell migration ability was conducted using RTCA and 24-well transwells (8 μm pore size, Millipore). Cell invasion assays were examined using 24-well transwells coated with 1ml/mL Matrigel (8 μm pore size, BD Science). Apoptosis assays were conducted with FACSCanto II and Annexin V R-PE 20 tests (BD Science).

| Selection of candidate prognostic markers
The study flow chart is shown in Figure 1A, and we included three

| Construction of prognostic signatures
We conducted the Kaplan-Meier survival analyses to identify the association between the expression of DEGs and overall survival (OS). As a result, the significant DEGs were screened by AIC and LASSO Cox models to construct prognostic signatures (Figure 2A).
In the AIC Cox model, the following formula was derived to calculate risk score for each patient: Risk score = (0.404 × expression of AC010776.2) + (0.040 × expression of miR − 615) − (0.364 × expression of BHLHA15) − (0.365 × expression of CLCNKB) + (0.633 × expression of FABP3). The X-tile plots were used to generate an optimal selected cut-off score (cut-off = 1.98) to divide patients into highand low-risk score subgroups in the training dataset ( Figure S1A).
Then, we found patients with high risk score generally had worse OS than those with low risk (P < .0001) ( Figure 2B), and the performance of AIC Cox prognostic signature was validated in the testing dataset by the same signature and cut-off value ( Figure 2C).
We constructed the other risk score formula with LASSO Cox model to verify the robustness of Cox regression in this datasets.
10-fold cross-validation via penalized maximum likelihood was used to compute regularization parameter lambda ( Figure S1B). Eight (AC010776.2, AC119424.1, GK-IT1, miR-4664, miR-615, BHLHA15, CLCNKB and FABP3) out of the 23 candidate markers were selected to construct a prognostic signature with optimal weighting coefficients (lambada: 0.084; Figure 2D). Compared with AIC Cox model, three additional markers (AC119424.1, GK-IT1 and miR-4664) were added to the LASSO risk score formula with optimal selected cutoff value (cut-off = 0.44) ( Figure S1C). In the training dataset, patients with high risk score had worse OS than those with low risk (P < 0.0001) ( Figure 2E), and then, the performance of LASSO Cox prognostic signature was also validated in the testing dataset ( Figure 2F).

| Predictive performance of two prognostic signatures
To further validate the performance of these two prognostic signatures, univariate and multivariate Cox proportional hazard models were performed on risk scores and clinicopathological features ( Figure 2G). Both the AIC and LASSO prognostic signatures were identified as independent prognostic factors ( Figure 2H-I).
Accumulative effects of these two prognostic signatures were also  (RFE) showed that GK-IT1 ranks the top place ( Figure 3E).

| Predictive performance in the internal ESCC dataset and the Jiangsu Cancer Hospital ESCC cohort
Oesophageal squamous cell carcinoma accounts for more than 90% ESCA patients in China, 8,23 so we compared the AUCs among analysis showed that ESCC patients with high risk score had worse OS than those with low risk (P < 0.0001) in the LASSO model ( Figure 4B). Furthermore, we detected the performance of two machine learning algorithms in the ESCC datasets. The RF model got a better specificity (98.1%, Figure 4C), and the SVM model showed a better sensitivity (79.0%, Figure 4D). The results suggested that machine learning models are stable predictors along with the LASSO model (specificity 98.0%, sensitivity 86.5%) in the ESCC patients.
To further validate the performance of the two prognostic signatures, we detected the expression of prognostic markers in 24 ESCC patients from Jiangsu Cancer Hospital cohort ( Figure S3A). The prognostic signatures demonstrated that ESCC patients with poor prognosis got significant higher risk scores than survival patients (P < .0001) ( Figure 4E).

| FABP3 promotes malignant progression of oesophageal squamous cell carcinoma cells
In summary, after combining markers in the AIC Cox, LASSO Cox, RFS-FE and SVM-RFE models, 2 common markers were identified (risky marker: FABP3; protective marker: CLCNKB; Figure 4F). In addition, risk or importance of markers was compared in each model, and we found FABP3 achieved top few rankings in the all models ( Figure 4G). To identify the properties of the novel marker FABP3, we designed two siRNAs to investigate the biological function of FABP3, and the expression of FABP3 was found significantly downregulated by siRNA-2 in both ESCC cell lines ( Figure 5A).
Next, we found the knockdown of FABP3 greatly suppresses the proliferation ability of KYSE150 and Eca109 cells by using RTCA proliferation and migration assays ( Figure 5B). Additionally, the transwell and matrigel assays showed that silencing FABP3 significantly impairs the migration and invasion capabilities of KYSE150 and Eca109 cells ( Figure 5C). Edu assays validated the results of the RTCA proliferation assay ( Figure 5D), and the knockdown of ( Figure 5E). Collectively, our results suggested that FABP3 could promote the proliferation and migration abilities of ESCC cell lines.
FABP3 expression was then detected by IHC using the TMA of 39 ESCC cases (Table S1). Overexpression of FABP3 in ESCC was validated by IHC scores in TMA ( Figure 5F and Figure S3D). In addition, Kaplan-Meier survival analysis showed that patients with higher  Figure 5G).

| D ISCUSS I ON
Oesophageal cancer is a clinically challenging disease with a considerable decline in health-related quality of life and a poor prognosis. 24  We noted that FABP3 is the only risky marker identified by all four algorithms, AIC Cox regression, LASSO Cox regression, RFS-FS and SVM-RFE. The fatty acid-binding protein (FABP) family is involved in fatty acid signalling pathway, which is one of the most importantly involved pathways in cancer development. 31 Tang et al showed that the high expression of FABP3 is correlated with poor prognosis in non-small-cell lung cancer. 32 Our results showed that FABP3 promotes the proliferation and migration of ESCC cell lines.
The other common marker, protective marker CLCNKB is found predominantly expressed in the kidney and was demonstrated to be down-regulated in clear cell renal cell carcinoma. 33,34 Compared with FABP3, the risk weighting of CLCNKB was less than FABP3 in the Cox regression models, and CLCNKB ranked lower than FABP3 in the both RFS-FS and SVM-RFE models. In addition, the protective function of CLCNKB is still need to be validated with function assays. Other included markers, such as miR-615 35,36 and BHLHA15, 37 have been reported to be associated with the risk of gastric and some other cancers. Thus, the all new ESCA markers, discovered in the Cox regressions, RFS-FS and SVM-RFE, are worthy of further studies.
Our study had several limitations as well. First, the biologic mechanism of other prognostic markers, such as AC010776.2, GK-IT1 and CLCNKB, was still unknown. Second, it could be better if the external independent validation dataset had a greater sample size. Importantly, prospective studies are required to further validate our findings.
In summary, we combined four algorithms, included two types of adjusted Cox regressions and two machine learning algorithms, to investigate RNA-Seq, miRNA-Seq and adjuvant clinical data of ESCA.
Our results demonstrated that constructed signatures are potential prognostic tools to predict mortality risk in ESCA and ESCC, and FABP3 is a novel biomarker and newly identified oncogenic gene in ESCC.

CO N FLI C T O F I NTE R E S T
The authors have declared that no competing interest exists. All authors read and approved the final manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data that support the findings of this study are openly available in https ://portal.gdc.cancer.gov/.