Identification of seven‐gene marker to predict the survival of patients with lung adenocarcinoma using integrated multi‐omics data analysis

Abstract Background The mechanism of cancer occurrence and development could be understood with multi‐omics data analysis. Discovering genetic markers is highly necessary for predicting clinical outcome of lung adenocarcinoma (LUAD). Methods Clinical follow‐up information, copy number variation (CNV) data, single nucleotide polymorphism (SNP), and RNA‐Seq were acquired from The Cancer Genome Atlas (TCGA). To obtain robust biomarkers, prognostic‐related genes, genes with SNP variation, and copy number differential genes in the training set were selected and further subjected to feature selection using random forests. Finally, a gene‐based prediction model for LUAD was validated in validation datasets. Results The study filtered 2071 prognostic‐related genes and 230 genomic variants, 1878 copy deletions, and 438 significant mutations. 218 candidate genes were screened through integrating genomic variation genes and prognosis‐related genes. 7 characteristic genes (RHOV, CSMD3, FBN2, MAGEL2, SMIM4, BCKDHB, and GANC) were identified by random forest feature selection, and many genes were found to be tumor progression‐related. A 7‐gene signature constructed by Cox regression analysis was an independent prognostic factor for LUAD patients, and at the same time a risk factor in the test set, external validation set, and training set. Noticeably, the 5‐year AUC of survival in the validation set and training set was all ˃ 0.67. Similar results were obtained from multi‐omics validation datasets. Conclusions The study builds a novel 7‐gene signature as a prognostic marker for the survival prediction of patients with LUAD. The current findings provided a set of new prognostic and diagnostic biomarkers and therapeutic targets.

metastasis (TNM) phase is currently the most widely used system to assess the clinical outcome of cases with LUAD. 6 However, the prognosis of LUAD patients with the same pathologic stage varies greatly. [7][8][9][10] Currently, an effective system for predicting the outcome of LUAD is needed to help clinicians evaluate treatment outcomes and promote personalized therapy.
A number of studies were conducted for finding biomarkers predictive of long-term LUAD survival. Biological markers are mainly categorized into several classes, firstly, single molecule as a sepa-  14 Noticeably, these above gene signatures have been independently verified in external datasets, but have not been clinically applied. Thus, it is crucial to analyze the biological function more comprehensively to identify genes associated with LUAD prognosis.
To effectively build a reliable LUAD prognostic processassociated gene signature, this study developed a scientific pipeline for identifying LUAD-associated gene markers. Single nucleotide mutations, gene expression profiles, and copy number variation data for LUAD patients were acquired from GEO and TCGA databases. A 7-gene signature was established by integrating transcriptomics and genomics data to screen prognostic markers. The ability of the signature to predict LUAD survival was validated in external validation and internal test sets. The results validated that the 7-gene signal participated in vital biology courses and pathways in LUAD. Similarly, GSEA analysis also showed consistent outcomes. The current data indicated that the 7-gene signature could efficiently estimate cancer prognostic of LUAD patients, providing a better understanding of the molecular mechanism of LUAD prognosis.

| Data collection and processing
A total of 516 samples containing SNP 6.0 chip copy number vari-

| Univariate Cox proportional hazards regression
To identify genes closely related to OS from the train dataset, we performed univariate Cox proportional hazard regression study as previously described in Jin-Cheng et al. 16 p < 0.05 was the threshold.

| Data analysis on copy number variation
GISTIC detects both focal and broad (probably overlapping) reappearing events. We used GISTIC 2.0 17 software to determine genes with significant amplification or deletion. The parameter thresholds for fragments with amplification or deletion lengths were ˃ 0.1 and p < 0.05.

| Analysis of gene mutation
Significantly mutated genes in the MAF file of TCGA mutation data were screened by Mutsig 2.0 software. The threshold was set to p < 0.05.

| Development of a prognostic immune gene signature
Genes significantly related to patient OS, amplification, deletions, and mutations were selected, and those showing prognostic significance were obtained applying randomized survival forest algorithm. 18 According to Jin et al, 19 random survival forest in R package was used for filtering genes, iterations number of Monte Carlo was 100, and the number of previous progressions was 5.
Characteristic genes were defined as genes exhibiting relative significance higher than 0.4. Further Cox regression study based on multiple variables was performed, and the following risk scoring mode was built: In which N denotes quantity of prognosis-related genes, Exp k refers to the expression level of prognosis-related genes, and e HR k represents the assessed gene regression coefficient in the multivariate Cox regression study.

| Identification of gene sets in correlation with survival of LUAD cases
For samples of the TCGA train set, the study performed univariate regression to examine the relation between gene expression and patients' overall survival (OS). Among the 2071 prognostic genes identified, 1161 genes were of HR<1, while 910 genes were of HR>1.
The top 20 genes were listed in Table 2.

| Genomic variation identification with gene set
Genes showing significant deletion or amplification were identified by GISTIC 2.0 using copy number variation data from TCGA.
There were 20 significantly amplified fragments in the genome of 230 genes ( Figure 2A)  Moreover, there were 21 significant deletion fragments ( Figure 2B) involving 1878 genes, and among these genes, CD1 showed significant absence in the 1p13.2 segment (q value = 2.63E-05), APC had significant deletion in the5q13 segment (q value = 0.017957), and CDKN2B also showed a great deletion in the 9p21.3 segment Mutsig2 was used to screen significantly mutated genes using TCGA mutational annotation data; here, a total of 438 genes with significant mutation frequencies were detected.

| Functional analysis on genomic variant genes
For investigating the functions of genes with genomic variations, 2261 deleted or amplified genes and significant mutant genes screened based on copy number variation were integrated. GO biological process and KEGG functional up-regulation study was conducted on the 2261 genes. KEGG enrichment analysis revealed that natural killer cellmediated cytotoxicity, t-cell receptor signaling pathway, MAPK signaling pathway, chemokine signaling pathway, Foxo signaling pathway, other KEGG biological pathways, and non-small-cell lung cancer were related to the cancer development ( Figure 3A). In the category of biological process, the pathways were mainly enriched in metabolic process, cell communication, cell differentiation, developmental process, and other GO terms ( Figure 3B). Interestingly, these above terms were closely correlated with cancer progress. Our data indicated that the genes of these genomic variants were linked to tumor development.

| A 7-gene signature for LUAD survival was built
Genomic variant genes and prognosis-related genes were integrated, and a sum of 218 genes in the intersection of the three groups was selected as a candidate gene. Based on the relation of the quantity of classification trees and error rate, random forest was used in feature selection ( Figure 4A). Here, genes with a significance greater than 0.4 were recruited to build a gene signature, and here, 7 genes were acquired ( Table 3). The 7 genes were ranked for their out-of-bag value ( Figure 4B). The 7-gene signature was built with Cox regression analysis based on multiple variables as follow: Patients' prognosis in the two risk groups was significantly different ( Figure 4C). Our result showed that the 3-year AUC of the model was 0.71 in the training set ( Figure 4D). Analysis on expression correlation of the 7 genes and risk score showed that high expression and high-risk correlation of RHOV, CSMD3, FBN2, and MAGEL2 were risk factors, whereas low-risk correlation of SMIM4, BCKDHB, GANC, and high expression was protective factors ( Figure 4E).

| The robustness of the 7-gene signature was verified
To verify the robustness of the 7-gene signature, the sample risk score in the test set was calculated. According to the threshold of the training set, we divided the samples into two groups, and noticeable prognostic diversifications between the 2 groups were observed ( Figure 5A). Analysis of ROC showed a 5-year AUC of 0.68 ( Figure 5B). The risk score and the expression of the 7 genes were TA B L E 3 7-genes was significantly associated with the overall survival in the training-set patients  Figure 5C). Thus, the model was validated as an effective prognostic classifier in the TCGA dataset.
GEO platform was an external dataset for verifying the classification of the gene prediction system in different platforms. The signature was applied to determine sample risk score. The cutoff of the training set was the threshold for grouping samples into low-and highrisk groups. We found that patients' prognosis was evidently better in the low-risk group than the high-risk group ( Figure 6A). Moreover, ROC analysis showed a 3-year AUC of 0.66, which was similar to the training set ( Figure 6B), and the relationship between the risk score and the expressions of the 7 genes was also consistent in the training set ( Figure 6C). To conclude, our 7-gene signature model showed a prognostic significance in both external and internal datasets.

| The 7-gene signature model was clinically independent
To identify whether the 7-gene signature system was independent in clinical applications, the 95% confident interval (

| Validation of 7-gene signature model in CNV and mutation datasets
We added additional validation of CNV and mutation datasets.
Specifically, we derived the lung adenocarcinoma samples from ICGC mutation dataset LUSC-KR (https://dcc.icgc.org/proje cts/ LUSC-KR), to verify that these gene mutations are in the relationship with the prognosis. Multivariate regression analysis was used to calculate the mutation characteristics score of 7 genes in each patient. ROC analysis showed that the 5-year AUC was 0.79, and the 1-year AUC was 0.6 ( Figure 7A), suggesting that the mutation characteristics of these 7 genes could effectively evaluate the prognosis of patients. The R software package maxstat was used to classify patients, and the prognosis of patients with high scores was significantly worse than that of patients with low scores ( Figure 7B).
We also obtained from the GEO database of lung adenocarcinoma of CNV data queue GSE36363 (https://www.ncbi.nlm.nih. gov/geo/query/ acc.cgi?acc=GSE36363), extracted copy number of seven genes. The copy number characteristic scores of seven genes were calculated for each patient using the same method. ROC analysis showed that the 5-year AUC was 0.68 and 1-year AUC was 0.77 ( Figure 7C), which suggested that the prognosis of patients could be effectively evaluated based on the copy number characteristics of these 7 genes. Patients were classified by maxstat, R software package. Patients with high scores had significantly worse outcomes than those with low scores ( Figure 7D).

| Enriched pathway differences in the two risk groups analyzed by GSEA
GSEA in the TCGA training set was used to analyze significantly enriched pathways in the two groups (low-risk and high-risk groups), and we obtained a total of 30 significantly enriched pathways, including pathways (eg, P53 signaling pathway, DNA replication, focal adhesion, cell cycle), which were tightly linked to the progress and metastasis of LUAD. The above pathways were greatly enriched to the samples of the high-risk group (Figure 8).  32 Mutation of CSMD3 in non-small-cell lung cancer leads to increased proliferation of airway epithelial cells, 33 and abnormal methylation of FBN2 is a biomarker for lung cancer. 34,35 Our research provides a better understanding for subsequent study on the clinical significance and biological role of the 7 genes. MAGEL2, SMIM4, BCKDHB, and GANC have not been previously found to be related to tumors, and they were the first confirmed as novel prognostic markers for LUDA in this study.

| DISCUSS ION
Furthermore, the current GSEA analysis also showed that pathways enriched by the 7 genes were closely associated with biological processes and pathways in LUDA progress. According to the current findings, the 7-gene signature could be clinically applied, providing possible targets for diagnosis of LUDA patients.

ACK N OWLED G M ENT
None.

CO N FLI C T O F I NTE R E S T
The authors declare no conflicts of interest. The analyzed data generated during the study are available from the corresponding author on reasonable request.