A transcriptomic study for identifying cardia‐ and non–cardia‐specific gastric cancer prognostic factors using genetic algorithm‐based methods

Abstract Gastric cancer (GC) is a heterogeneous tumour with numerous differences of epidemiologic and clinicopathologic features between cardia cancer and non‐cardia cancer. However, few studies were performed to construct site‐specific GC prognostic models. In this study, we identified site‐specific GC transcriptomic prognostic biomarkers using genetic algorithm (GA)‐based support vector machine (GA‐SVM) and GA‐based Cox regression method (GA‐Cox) in the Cancer Genome Atlas (TCGA) database. The area under time‐dependent receive operating characteristic (ROC) curve (AUC) regarding 5‐year survival and concordance index (C‐index) was used to evaluate the predictive ability of Cox regression models. Finally, we identified 10 and 13 prognostic biomarkers for cardia cancer and non‐cardia cancer, respectively. Compared to traditional models, the addition of these site‐specific biomarkers could notably improve the model preference (cardia: AUCtraditional vs AUCcombined = 0.720 vs 0.899, P = 8.75E‐08; non‐cardia: AUCtraditional vs AUCcombined = 0.798 vs 0.994, P = 7.11E‐16). The combined nomograms exhibited superior performance in cardia and non‐cardia GC survival prediction (C‐indexcardia = 0.816; C‐indexnoncardia = 0.812). We also constructed a user‐friendly GC site‐specific molecular system (GC‐SMS, https://njmu‐zhanglab.shinyapps.io/gc_sms/), which is freely available for users. In conclusion, we developed site‐specific GC prognostic models for predicting cardia cancer and non‐cardia cancer survival, providing more support for the individualized therapy of GC patients.


| INTRODUC TI ON
Gastric cancer (GC) is the fifth most common cancer and the third leading cause of cancer death worldwide, with estimated 1.03 million new cases and 0.78 million deaths in 2018. 1 Based on pathogenic site, GC can be classified into cardia cancer and non-cardia cancer. To date, amounting studies have demonstrated that GC is a heterogeneous tumour with numerous differences of epidemiologic and clinicopathologic features between cardia cancer and non-cardia cancer. 2,3 It is well known that GC patients have a poor prognosis with a 5-year overall survival rate <40%. 4 Meanwhile, several studies have found that the survival rate of cardia cancer patients was significantly lower than that of non-cardia cancer patients, indicating the diverse prognosis between cardia cancer and non-cardia cancer. 5,6 Besides, growing evidence has revealed that, in addition to clinical factors (eg age and clinical stage), genetic factors (eg genetic variants and genes expression level) may play important roles in GC survival prediction. 7,8 Therefore, it is required to find potential site-specific biomarkers that can be used to individually predict cardia and non-cardia GC prognosis.
Recently, with the development of high-throughput biotechnology, how to perform feature selection in high-dimensional data with relatively small sample size has been a great challenge.
Genetic algorithm (GA), a searching algorithm based on natural selection, crossover and mutation, has been reported to be a very efficient method for feature selection. 9 Several studies have demonstrated that GA-based features selection methods can significantly improve the predictive accuracy of diseases risk prediction models. 10,11 In this study, to identify potential cardia-and non-cardia-specific GC prognostic biomarkers, we performed a comprehensive analysis using GA-based support vector machine (GA-SVM) and GA-based Cox regression method (GA-Cox) in the Cancer Genome Atlas (TCGA) stomach adenocarcinoma (STAD) transcriptomic data.

| Site-specific biomarkers identification
The TCGA STAD data were firstly normalized by log 2 (x + 1) transformed. We separately used unpaired Student's t test to perform differential expression analysis in cardia and non-cardia cancers, and extracted site-specific biomarkers based on the following criteria: (a) call rate (percent of biomarkers with expression value >0) >70%; (b)

| GA-SVM
To obtain the transcriptomic biomarkers with highest discriminatory power in distinguishing cardia and non-cardia tumour tissues, we performed GA-SVM analysis in genes, miRNAs and lncRNAs data sets, respectively (  Figure S1). To evaluate the clinical utility of the site-specific risk scores calculated by transcriptomic biomarkers in predicting GC survival probability, we used the median of risk score to divide the patients into a high-and low-risk groups among cardia and non-cardia GC patients. The Kaplan-Meier survival curve and log-rank test were then applied to compare the survival probability between two groups.

| Site-specific clinical prognostic models construction
We used Cox regression model to perform univariate analysis and multivariate analysis for identifying clinical prognostic factors. After univariate analysis, Cox stepwise regression analysis was used to further screen independent clinical characteristics, with a significance F I G U R E 1 Summary of this study design level of P < 0.05 for entering and P > 0.10 for removing variables.
The remaining clinical prognostic factors were used to construct traditional cardia and non-cardia GC prognostic models.

| Site-specific combined prognostic models construction
We further used site-specific clinical factors and risk scores to construct combined cardia and non-cardia GC prognostic model. The predictive power of prognostic model was measured using time-dependent ROC curve regarding 5-year survival with R package surviv-alROC. Besides, the threefold cross-validation test was used to avoid potential over-fitting. The difference of ROC curves was evaluated using Wilcoxon rank-sum test with R package survcomp.

| Site-specific GC nomograms construction
To predict the overall survival probability of cardia and non-cardia GC patients, the regression coefficients in multivariable Cox regression model were used to generate the nomogram using R package rms, each patient could obtain the total points from the nomogram. The concordance index (C-index) was then estimated to evaluate the similarity between the actual and predicted survival probability, the larger C-index (ranges from 0 to 1) indicates a better model performance.
The C-index was also adjusted by bootstrap method with 1000 resamples. Besides, the calibration plot and decision curve were used to evaluate the calibration and clinical utility of the nomograms.

| Construction of a user-friendly webserver
We used R package Shiny (R Core Team, Vienna, Austria) to construct a GC site-specific molecular system (GC-SMS), which is freely available and user-friendly. This system included site-specific GC molecular databases and survival prediction models. The molecular databases provided the results of differential expression analysis and survival analysis for each biomarker at different GC sites. The survival prediction models could provide the predicted 5-year survival probability for cardia and non-cardia GC patients.
Total points and corresponding survival probability were calculated by R package nomogramEx (R Core Team, Vienna, Austria).
All analyses were performed using R 3.4.1 software (R Core Team, Vienna, Austria). P < 0.05 (two-side) was statistically significant.

| Basic characteristics of study subjects
The detailed clinical characteristics of 87 cardia and 264 non-cardia cancer patients are summarized in Table S1. There was a significant TA B L E

| Identification of prognostic biomarkers
We first used a series of filtering criteria to obtain cardia (including were screened for further survival analysis ( Figure S2).
Moreover, we applied GA-Cox analysis to identify key transcriptomic prognostic factors for predicting cardia and non-cardia GC survival using 39 cardia-and 113 non-cardia-specific biomarkers ( Figure 1). For cardia cancer patients, a total of 10 prognostic biomarkers including 7 genes (Table S2) and 3 lncRNAs, with an AUC of 0.913 ( Figure S3), were finally identified (Table 1). For non-cardia cancer patients, we identified 13 prognostic biomarkers including 10 genes (Table S2), 2 miRNAs and 1 lncRNA (AUC = 0.918, Figure S3; Table 2). Furthermore, we divided the patients into high-and lowrisk groups using the median of risk score constructed by these biomarkers (Figure 2A,B). Broadly, compared to low-risk group, highrisk group had poorer prognosis among cardia and non-cardia GC patients (log-rank P < 0.001, Figure 2C,D).

| Construction of site-specific traditional and combined prognostic models
We initially performed univariate analysis to evaluate the association of each clinical factor with cardia and non-cardia GC survival (Table S1) (Table S3). Finally, neoplasm status (HR = 2.78, P = 0.009) was remained in cardia model (AUC = 0.720, Table S4).
We further introduced the risk scores of 10 cardia and 13 non-cardia cancer prognostic biomarkers to construct combined site-specific GC prognostic models, respectively. We found that, with the addition of biomarkers, the combined cardia (AUC = 0.899) and non-cardia (AUC = 0.994) cancer prognostic models showed stronger predictive power (P cardia = 8.75E-08; P noncardia = 7.11E-16, Table S4, Figure 2E,F) compared to traditional prognostic models.

| Construction of site-specific nomograms
Furthermore, we constructed two nomograms to show the potential clinical application of the two combined models in cardia ( Figure 3A) and non-cardia ( Figure 3B) GC patients' prognosis prediction.  Figure S4).

| Development of GC-SMS
An online version of user-friendly GC site-specific web server can be accessed at https://njmu-zhang lab.shiny apps.io/gc_sms/ ( Figure S5).
Users could perform differential expression analysis and survival analysis for each biomarker simply at different GC sites by clicking the corresponding module ( Figure S5A). For example, the user F I G U R E 2 Risk score analysis for 10 and 13 site-specific biomarkers in cardia and non-cardia gastric cancer (GC) patients, respectively. A, B, Distribution of site-specific risk score and survival status in cardia and non-cardia GC patients. The black lines represented risk scores. The grey dashed lines represented the median value of risk scores. C, D, Kaplan-Meier survival curves for overall survival outcomes in cardia and non-cardia GC patients. E, F, The time-dependent ROC curves regarding 5-year for cardia and non-cardia GC prognostic models can select a database (gene, miRNA or lncRNA) and a site (overall, cardia or non-cardia), and input a molecular biomarker (eg ASB5 for gene, AL627309.1 for lncRNA or miR-100 for miRNA) to search the results of differential expression analysis and survival analysis. In addition, online implementation of cardia and non-cardia GC nomogram prognostic models were also available ( Figure S5B), predicted 5-year survival probability can be easily calculated by inputting clinical characteristics and expression value of site-specific biomarkers.

| D ISCUSS I ON
GC is a heterogeneous tumour with great differences of epidemiologic and clinicopathologic features between cardia cancer and non-cardia cancer. 12 For instance, Helicobacter pylori infection was demonstrated to be a risk factor for non-cardia cancer, but not for cardia cancer. 13 The survival rate of cardia cancer patients was significantly lower than that of non-cardia cancer patients. 5 However, few studies were performed to construct site-specific GC prognostic models to predict the survival probability of cardia and non-cardia GC patients. In this study, we applied GA-SVM and GA-Cox methods to identify 10 cardia-and 13 non-cardia-specific GC prognostic factors, which may be useful for cardia cancer and non-cardia cancer survival prediction.
With the development of high-throughput sequence technology, finding accurate biomarkers in high-dimensional omics data are challenging. GA process, including natural selection, crossover and mutation, is a heuristic algorithm used to explore an optimal solution to a complex problem (such as non-linear condition). 14-16 Several researchers have applied GA-based machine learning methods to solve a variety of complex problems in high-dimensional omics data. 11,17 Thus, this study proposed two approaches that combine SVM and Cox models with a GA to explore an optimal subset of site-specific GC prognostic biomarkers. As a result, we finally identified 10 and 13 cardia-and non-cardia-specific GC prognostic factors with a good discriminatory ability, reflecting the GA-based algorithms' superior performance.
In the present study, the cardia cancer prognostic model was constructed using 7 genes and 3 lncRNAs; and non-cardia cancer survival model was constructed using 10 genes, 1 lncRNA and 2 miRNAs. Among these genes, most of them have been demonstrated to be involved in several complex biological processes. For example, APAF1 is a key apoptosis factor, which is closely related to several cancer-inducing genes and tumour suppressor genes (eg p53). 18  and was deemed as a potential prognostic biomarker. 24 AMDHD1 has been reported to be overexpressed in adrenal adenoma compared with adrenal carcinoma and is involved in the histidine metabolism pathway. 25 NETO2 was reported to be overexpressed in several cancers, including renal cancer, lung cancer and colon cancer. 26 Hu et al also found that high expression of NETO2 could be considered as a potential biomarker of both advanced tumour progression and poor prognosis in colorectal cancer patients. 27 In addition to genes, miRNAs are a class of non-coding RNA molecules that play a vital role in cell differentiation, proliferation and survival by altering the expression of multiple genes. 28 lncRNAs are a batch of long non-coding RNA transcripts with a vital role in cancer carcinogenesis and progression. 29,30 Therefore, we also introduced multiple miRNAs and lncRNAs to the prognostic models for found that the addition of transcriptomic risk score could improve traditional prognostic models' predictive accuracy. The discovery of transcriptomic signatures as a prognostic biomarker for cardia cancer and non-cardia cancer has the potential to be applied in GC risk stratification and personalized therapy.
There were several strengths in this study as follows: (a) this is a comprehensive study to identify the cardia-and non-cardia-specific GC prognostic biomarkers using TCGA STAD transcriptomic data; In conclusion, based on GA-SVM and GA-Cox methods, we identified 23 (cardia: 10 and non-cardia: 13) site-specific GC prognostic biomarkers and developed two nomogram prognostic models for predicting cardia cancer and non-cardia cancer survival, providing more support for the individualized therapy of cardia and non-cardia GC patients.

ACK N OWLED G EM ENTS
We thank the Cancer Genome Atlas (TCGA) for sharing the transcriptomic database.

CO N FLI C T S O F I NTE R E S T
The authors have declared no conflicts of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
All data will be made available upon request.