Systematic analyses of a novel lncRNA‐associated signature as the prognostic biomarker for Hepatocellular Carcinoma

Abstract Accumulating evidence implies that long noncoding RNAs (lncRNAs) play a crucial role in predicting survival for Hepatocellular carcinoma (HCC) patients. This study aims to capture the current research hotspots of HCC, based on the analysis of publications related to HCC research from 2013 to 2017, and to identify a novel lncRNA signature for HCC prognosis through the data mining in The Cancer Genome Atlas (TCGA). “Prognosis” and “biomarker” were located in the core of the HCC research hotspot. Moreover, long noncoding RNA was the top one research frontier in HCC research. The associations between survival outcome and the expression of lncRNAs were evaluated by the univariate and multivariate Cox proportional hazards regression analyses. Four lncRNAs (LINC00261, TRELM3P, GBP1P1, and CDKN2B‐AS1) were identified as significantly correlated with overall survival (OS). These four lncRNAs were gathered as a single prognostic signature. There was a significant positive correlation between HCC patients with low‐risk scores and overall survival (HR = 1.802, 95%CI [1.224‐2.652], P = .003). Further analysis suggested that the prognostic value of this four‐lncRNA signature was independent in clinical features. The enrichment analysis of prognostic lncRNA‐related gene was performed to find out the related pathways. Our study indicates that this novel lncRNA expression signature may be a useful biomarker of the prognosis for HCC patients, based on bioinformatics analysis.


| INTRODUCTION
Hepatocellular carcinoma (HCC) ranks sixth in the list of most commonly occurring solid cancers worldwide and ranks second in the list of most prevalent cause of death from fatal cancer. 1 Hepatitis B or Hepatitis C Virus infection, alcohol drinking, and excessive smoking are the primary causes of HCC. 2,3 Despite emerging evidence in the understanding of molecular mechanisms of HCC and improved therapies for HCC, the average survival time is still short. Regarding the | 3241 SUI et al. recent research, over 60% of initial detection of HCC patients in Japan is an early stage with an approximately 40% fiveyear survival rate and an average survival time of 50 months. 4 In the past decade, progress in the genome-wide analysis of mammalian transcriptome has indicated a novel class of transcript, long noncoding RNAs (lncRNAs), which are broadly transcribed in the genome. 5 LncRNAs are restricting defined as transcripts of >200 nucleotides in length, which lack significant open reading frames (ORF). 6 In the nucleus, lncRNAs primarily modulate gene transcription and mRNA splicing, while they are involved in RNA activation and stability of miRNA in the cytoplasm. 7 Further evidence suggests that the aberrant expressions of lncRNAs have a clinical influence on the diagnosis and prognosis of HCC. [8][9][10] Till now, lncRNA-associated biomarkers for diagnosis of HCC have been reported in many studies. Nevertheless, limited attempts have made to report the lncRNA signature as the prognostic biomarkers for HCC patients.
This study aims to capture the current research hotspots of HCC, based on the analysis of publications related to HCC research from 2013 to 2017, and to identify a novel lncRNA signature for HCC prognosis through the data mining in The Cancer Genome Atlas (TCGA) (http://cancergenome.nih. gov). Through constructing a comprehensive lncRNA expression analyses, we identified a new candidate indicator for the overall survival (OS) prediction in HCC patients.

| Source of the literature data and search strategy
Literature was searched from the Science Citation Index-Expanded (SCI-E) of Web of Science (WOS) of Clarivate Analytics on June 30, 2017. The data were collected from the public database, did not involve any interactions with human or animal subjects. Ethical approval was not applicable here.
All searches were conducted on the same day, June 30, 2017, to avoid the bias of daily updating of the database. The following terms were used in search: Title = ("liver cancer*") OR Title = ("liver neoplasm*") OR Title = ("Hepatocellular Cancer*") OR Title = ("Hepatocellular carcinoma*") OR Title = ("hepatic cancer*") OR Title = ("hepatic neoplasm*") OR Title = ("cancer of the liver") OR Title = ("cancer of liver") AND Language = English. In this case, only research articles and review articles were included.

| Literature data collection and analysis method
The data were independently collected from all eligible publications by two authors (Jing Sui and Yan Miao). The txt data were downloaded from WOS, and were imported into VOSviewer 1.6.5 (Leiden University, Leiden, Netherlands) and CiteSpace V (Drexel University, Philadelphia, PA, USA). The data were analyzed objectively. VOSviewer was performed to carry out the cluster analysis of the literature and the hotspot analysis of keywords.

| TCGA database and patient information
Three hundred and seventy-seven HCC patients' data were downloaded from TCGA database (up to January 28, 2016). After exclusion criteria: (1) histologic diagnosis ruled out HCC; (2) another malignancy besides HCC. Overall, 317 HCC patients with corresponding clinical features such as race, age, gender, tumor stage, radiation therapy, and residual tumor were included in this study. Moreover, the endpoint in this study was OS. Of these above 317 HCC patients, there were 154 HCC patients with tumor stage I, 78 HCC patients with tumor stage II, 80 HCC patients with tumor stage III, and 5 HCC patients with tumor stage IV. As the data were retrieved from the public database (TCGA database), further ethical approval was not applicable in this study. Data processing procedures also met the policy of TCGA data and human subject protection (http://cancergenome.nih.gov/ publications/publicationguidelines).

| RNA sequence data procession and lncRNA profile mining
The HCC RNA level 3 expression data were downloaded from TCGA database. All the lncRNA sequencing raw reads were postprocessed and normalized using TCGA RNASeqv2 system. 11 In this study, lncRNAs with a description from NCBI (https://www.ncbi.nlm.nih.gov/gene/) and Ensemble (http://www.ensembl.org/index.html) would be selected for further study. To identify the differential expression of lncR-NAs, patients were divided into HCC four tumor stages, including I, II, III, and IV to compare with adjacent nontumor lung tissues, respectively. The intersection of lncRNAs was selected in the further analysis ( Figure 1).

| Construction of the lncRNA-based prognostic signature and Statistical analysis
The expression profile of each lncRNA was normalized by log2-transformed for further statistical analysis. However, the differently expressed lncRNAs that were 0 in more than 10% of all data were eliminated. The univariate Cox proportional hazards regression was used to evaluate the association between the differently expressed lncRNAs with OS of HCC patients (P-value <.05). Then, the multivariate Cox proportional hazards regression was used to identify the prognostic value of these independent lncRNA biomarkers.
Meanwhile, the prognostic lncRNA signature (the risk score model) was constructed based on a combination of the expression profiles of each prognostic lncRNAs, weighted by their estimated regression coefficients in the multivariate Cox proportional hazards regression analysis as follows: risk score = exp lncRNA1 *β lncRNA1 + exp lncRNA2 *β lncRNA2 + … exp lncRNAn *β lncRNAn .
The Kaplan-Meier survival curves were performed to present the difference in OS between high-risk score group and low-risk score group. The statistical significance was examined by the log-rank test. The univariate and multivariate Cox proportional hazards regressions for OS were conducted for individual clinical features with the lncRNA signature. The hazard ratio (HR) and 95% confidence intervals (CI) were calculated in this study. The prognostic performance at five years was accessed using time-dependent receiver operating characteristic (ROC) curves. 12

| Functional enrichment analysis
To investigate the biological feature of these above four lncR-NAs in lncRNA signature, we identified the genes that highly correlated with these above four lncRNAs expression (Pearson |R| > 0.5) in TCGA database. Pathways and biological processes were predicted using functional enrichment analysis of Gene Ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) in the Database for Annotation, Visualization, and Integrated Discovery (DAVID) (https:// david.ncifcrf.gov/) Bioinformatics Resources 6.8. 13,14 The P-value <.05 and FDR <0.05 were considered to be significant. Subsequently, the protein-protein interaction (PPI) network was constructed with the coexpressed genes via STRING (https://string-db.org/). 15,16 3 | RESULTS

| Cluster analysis and hotspot analysis on HCC research
A total of 1792 papers met the search criteria. These papers were analyzed by VOSviewer and divided into three clusters: "Patients Related Study," "Expression Related Study," and "Cell Related Study." The cluster analysis demonstrated that the dominant fields of HCC include three research directions ( Figure 2A).
Keywords used in the 1792 papers were extracted and analyzed by VOSviewer. As shown in Figure 2B, VOSviewer applied colors to keywords. The color of an item was determined by the frequency of occurrence, where by default colors range from blue (low frequency) to green (median frequency) to red (high frequency). Keywords with high frequency were captured and considered as the hotspots in this field. From the literature analysis, we found hot keywords, including Hepatocellular carcinoma, prognosis, and biomarker. Thus, we confirmed that the current research hotspot of HCC is to identify a prognostic-biomarker for HCC.
Furthermore, CiteSpace V was performed to capture the keywords with the most energetic citation bursts that identified as research frontiers over time. The top one research frontier of HCC research was "long noncoding RNA" (Figure 3). We realized a keyword "long noncoding RNA" appeared and grew rapidly. Considering this, our team determined the final research objective that was to discover a lncRNA-related prognostic biomarker for HCC. Based on this destination, we proceeded to the next step of lncRNA-related data mining. Here, we chose The Cancer Genome Atlas (TCGA) as a data source for both clinical information and bio-information.

| Patient characteristics
There were 317 HCC patients included in this study downloaded from TCGA dataset. Based on American Joint Committee on F I G U R E 2 Cluster analysis and hotspot analysis on Hepatocellular carcinoma research. A, The divided into three clusters: "Patients Related Study," "Expression Related Study," and "Cell Related Study." The cluster analysis demonstrated that the dominant fields of Hepatocellular carcinoma include three research directions. B, Keywords with high frequency were captured and considered as the hotspots in this field Cancer (AJCC) TNM stage, the HCC patients were divided into stage I, stage II, stage III and stage IV, four groups. The age of all HCC patients was 58.019 ± 13.509 years. The OS time was 813.108 ± 747.979 days, 106 of 317 (33.438%) HCC patients died.

| Identification of differentially expressed lncRNAs
We performed differential expression analysis by comparing the expression of 1081 lncRNAs in HCC and adjacent nontumor liver tissues. Fold change>2 or <0.5, P-value <.05 and FDR <0.05 were set up to identify significantly differentially expressed lncRNAs. Three hundred and seventeen differentially expressed lncRNAs were selected for further analysis, including 181 lncRNAs in stage I, 222 lncRNAs in stage II, 234 lncRNAs in stage III, and 165 lncRNAs in stage IV. We combined these four groups of 317 differentially expressed lncRNAs together, and 90 lncRNAs were identified stability  Table S1.

| Prognostic signature construction
Based on these 165 differentially expressed lncRNAs and clinical features in 317 HCC patients from TCGA database, 18 lncRNAs significantly associated with OS (P < .05) were identified by the univariate Cox regression model in Table 1. Afterward, the multivariate Cox proportional hazards regression analysis was used to calculate the interrelated relationship among 18 lncRNAs with OS, and only four lncRNAs exhibited a significant prognostic value for HCC, including LINC00261, TRELM3P, GBP1P1 and CDKN2B-AS1 (Table 2 and Figure 6).
The risk score for predicting prognostic value was constructed with the formula: Based on the risk score model, HCC patients were classified as low-risk score or high-risk score patients via the median risk score as the cutoff value, which divided into the low-risk score group (n = 159) and high-risk score group (n = 158) ( Figure 7). K-M curves confirmed that the survival time of patients in the low-risk score group was 929.698 ± 773.779 days, predominantly longer than that of the high-risk score group (695.032 ± 703.854 days, P = .002, Figure 8A). Furthermore, the risk score could largely predict the 5-year survival of showing the relationship between these four lncRNAs and overall survival. The patients were divided into over-and underexpression groups by the mean lncRNAs level; B, ROC curves of the four lncRNAs to distinguish HCC tissue from adjacent normal tissues HCC patients, as the area under ROC curve (AUC) was 0.709 ( Figure 8B). The expression pattern of these four differentially expressed lncRNAs in the HCC and adjacent normal tissues, low-risk score and high-risk score groups were shown in Figure 9.

| Correlation between lncRNA signature and clinical characteristics
We examined the association of four-lncRNA signature (risk score) with clinical features in HCC patients used the univariate and multivariate Cox proportional hazard regression  Table 3 (P < .05). Meanwhile, the multivariate Cox proportional hazards regression showed Neoplasm cancer (P = .002) and risk score (P < .001) could predict as an independent prognostic indicator of HCC (Table 3).
In this study, the K-M curves of these clinical features were shown in Figure 10A. Moreover, it synthetically presented that the risk score conferred a prognostic value for predicting patients' status of tumor stage (AUC = 0.603, P = .002) and Neoplasm cancer (AUC = 0.586, P = .001) ( Figure 10B).

| Functional assessment of the four-lncRNA signature
There were 626 genes identified in TCGA database coexpressed with these four lncRNAs (LINC00261, TRELM3P, GBP1P1, and CDKN2B-AS1) (|R|>0.5), including 424 genes with LINC00261, 36 genes with TRELM3P, 132 genes with GBP1P1, and 31 genes with CDKN2B-AS1, respectively (Table S2). It revealed enrichment of 628 GO Terms and 131 Pathways (P-value <.05 and an enrichment score of >1.5; Table S3). It was found that the top GO biological process of coexpressed genes was small molecule metabolic process (GO: 0044281) and cellular nitrogen compound metabolic process (GO: 0034641) ( Table 4 and Figure 11A). After the pathway analysis, the coexpressed genes were mainly  enriched in Metabolic pathways and "Valine, leucine and isoleucine degradation" (Table 4 and Figure 11B). For the construction of the protein-protein interaction (PPI) network, there were 470 genes in the PPI network, which were regarded as hub genes ( Figure 12).

| DISCUSSION
Hepatocellular carcinoma (HCC) is one of the deadliest malignancies with the high global mortality. Most HCC patients were diagnosed in the advanced stages of tumor progression (stage III and stage IV). 17 However, HCC patients in the same stage might exhibit different prognosis outcome, owning to differences in various biomarkers, which are still being discovered. 18 The novel biomarkers for early diagnosis, therapeutic process monitoring, and prognostic evaluation might increase the survival rate for HCC. Accumulating evidence suggested that lncRNAs might play major role in tumorigenesis, metastasis, development and the prognosis of HCC. [19][20][21][22] The large-scale genome analyses have revealed the molecular characteristics associated with HCC OS. [23][24][25] However, most studies focused on miRNA, mRNA, gene, and protein expression. 26 T A B L E 3 (Continued) growing, the functional role of lncRNAs in tumorigenesis and development also represents a significant untapped resource for HCC prognosis.
In the present study, to identify lncRNAs significantly related to the OS of HCC, HCC data were analyzed on HCC patients TNM stage with clinical features from the TCGA database in groups. After the univariate and multivariate Cox proportional hazards regression, a total of four HCC OSrelated lncRNAs were identified as significant prognostic value for HCC survival. Then, the signature (risk score) was set by combining these above four lncRNAs and found that this four-lncRNA signature could independently predict OS in HCC patients. The advantage of this study is a combination of clinical features and TCGA data to assess the survival of HCC patients by setting the lncRNA-related risk score.
Wang et al. 31 also identified a four-lncRNA signature (RP11-322E11.5, RP11-150O12.3, AC093609.1, CTC-297N7.9) which might be an independent prognostic biomarker for the prediction of HCC patient survival. However, compared with previous study, we used more stringent screening criteria. Firstly, we used different classification regarding the clinical information extracted from TCGA datasets. Secondly, we screened the lncRNAs which were not described in NCBI and Emsemble, the left lncRNAs were considered to have potential clinical significance for further validation. Then, the differently present study that expression of four novel lncRNAs could also become a novel independent prognostic signature for HCC patients. Accumulating evidence has presented that a series of lncRNAs could act as tumor suppressors or oncogenes in HCC. However, the roles of most lncRNAs in HCC remain largely unknown. Hu et al. 32 found overexpressed SVUGP2 could suppress cell proliferation and suppresses the invasion ability of HCC cell lines in vitro, and tumor growth in vivo. SchLAH was found downregulated in HCC with significantly correlated with shorter overall survival of HCC patients. 33 Moreover, HOTAIR and HOTTIP were also upregulated in HCC indicating a poorer prognosis and reduced overall survival. [34][35][36] Among these above four lncRNAs in the risk score, decreased LINC00261 was identified associated with poor prognosis and metastasis in Gastric Cancer (GC). 37 Moreover, LINC00261 was found related to cell growth, migration, cell proliferation, and cell apoptosis in endometriosis and choriocarcinoma. 38,39 Furthermore, multivariate analyses revealed that expression of CDKN2B-AS1 could be an independent predictor for OS (P = .036) in GC. 40 The other two lncRNAs (TRELM3P and GBP1P1) were not reported till now. Moreover, we identified the genes that strongly related with these above four lncRNAs expression in HCC dataset from TCGA database. The relevant genes were mainly enriched in metabolic pathways, "Valine, leucine and isoleucine degradation," cellular nitrogen compound metabolic process and small molecule metabolic process. However, there is no study as of yet investigated the biological and clinical function of those above four lncRNAs in HCC, there is still many research that needs to be accomplished.
These findings of the present study may have substantial clinical significance. However, the limitations should be taken into consideration in the present study. Firstly, only 1801 human lncRNAs were identified, which would be selected with a description from NCBI and Ensemble for further study. The prognostic-related lncRNAs identified here might not represent all the lncRNAs, which were potentially related to HCC OS. Secondly, the mean time of follow-up in the model was 813.108 days. Thus, the further study with the longer follow-up time is warranted. Thirdly, the role of these four lncRNAs in HCC is still unknown; in vivo and in vitro experiments should be investigated in the further study.
In conclusion, by synthetically analyzing the HCC ln-cRNA expression profiles in TCGA database, we identified a four-lncRNA signature, which could act as an indicator for HCC patient outcome and could be a potential independent biomarker for prognosis prediction of HCC. Future functional investigations are required to explore the mechanisms underlying the roles of these lncRNAs in HCC.