Genomewide identification of a novel six‐LncRNA signature to improve prognosis prediction in resectable hepatocellular carcinoma

Abstract The current prognostic long noncoding RNA (lncRNA) signatures for hepatocellular carcinoma (HCC) are still controversial and need to be optimized by systematic bioinformatics analyses with suitable methods and appropriate patients. Therefore, we performed the study to establish a credible lncRNA signature for HCC outcome prediction and explore the related mechanisms. Based on the lncRNA profile and the clinical data of carefully selected HCC patients (n = 164) in TCGA, six of 12727 lncRNAs, MIR22HG, CTC‐297N7.9, CTD‐2139B15.2, RP11‐589N15.2, RP11‐343N15.5, and RP11‐479G22.8 were identified as the independent predictors of patients’ overall survival in HCC by sequential univariate Cox and 1000 times Cox LASSO regression with 10‐fold CV, and multivariate Cox analysis with 1000 times bootstrapping. In the Kaplan‐Meier analysis with patients trichotomized by the six‐lncRNA signature, high‐risk patients showed significantly shorter survival than mid‐ and low‐risk patients (log‐rank test P < 0.0001). According to the ROCs, the six‐lncRNA signature showed superior predictive capacity than the two existing four‐lncRNA combinations and the traditional prognostic clinicopathological parameter TNM stage. Furthermore, low MIR22HG and CTC‐297N7.9, but high CTD‐2139B15.2, RP11‐589N15.2, RP11‐343N15.5, and RP11‐479G22.8, were, respectively, demonstrated to be related with the malignant phenotypes of HCC. Functionally, the six lncRNAs were disclosed to involve in the regulation of multiple cell cycle and stress response‐related pathways via mediating transcription regulation and chromatin modification. In conclusion, our study identified a novel six‐lncRNA signature for resectable HCC prognosis prediction and indicated the underlying mechanisms of HCC progression and the potential functions of the six lncRNAs awaiting further elucidation.


Abstract
The current prognostic long noncoding RNA (lncRNA) signatures for hepatocellular carcinoma (HCC) are still controversial and need to be optimized by systematic bioinformatics analyses with suitable methods and appropriate patients. Therefore, we performed the study to establish a credible lncRNA signature for HCC outcome prediction and explore the related mechanisms. Based on the lncRNA profile and the clinical data of carefully selected HCC patients (n = 164) in TCGA, six of 12727 lncRNAs, MIR22HG, CTC-297N7.9, CTD-2139B15.2, RP11-589N15.2, RP11-343N15.5, and RP11-479G22.8 were identified as the independent predictors of patients' overall survival in HCC by sequential univariate Cox and 1000 times Cox LASSO regression with 10-fold CV, and multivariate Cox analysis with 1000 times bootstrapping. In the Kaplan-Meier analysis with patients trichotomized by the six-lncRNA signature, high-risk patients showed significantly shorter survival than mid-and low-risk patients (log-rank test P < 0.0001). According to the ROCs, the six-lncRNA signature showed superior predictive capacity than the two existing four-lncRNA combinations and the traditional prognostic clinicopathological parameter TNM stage. Furthermore, low MIR22HG and CTC-297N7.9, but high CTD-2139B15.2, RP11-589N15.2, RP11-343N15.5, and RP11-479G22.8, were, respectively, demonstrated to be related with the malignant phenotypes of HCC.
Functionally, the six lncRNAs were disclosed to involve in the regulation of multiple cell cycle and stress response-related pathways via mediating transcription regulation and chromatin modification. In conclusion, our study identified a novel six-lncRNA signature for resectable HCC prognosis prediction and indicated the underlying mechanisms of HCC progression and the potential functions of the six lncRNAs awaiting further elucidation.

K E Y W O R D S
biomarker, hepatocellular carcinoma, LncRNA, prognosis 1 | INTRODUCTION Hepatocellular carcinoma (HCC), accounting for 70% to 90% of primary liver cancer, is the fifth most common malignancy and the second leading cause of cancer death worldwide. [1][2][3] Although great progress has been made on the diagnosis, treatment, and prognosis dictation of HCC, the clinical outcome is still unsatisfied. 3,4 The declaration of prognostic biomarkers might be helpful for the optimization of treatment and thus the improvement of patients' prognosis. 2 Nonetheless, HCC, unlike for other tumor types, is a special disease with two competing death causes including cirrhosis and cancer, the combination of which decides the variations of HCC biological progression and makes the prognosis prediction highly complex. 2,[5][6][7][8] The identification of prognostic biomarkers should be integrated with the analysis of cancer stage, and the general status of both host patients and underlying liver. 2,5,7 Besides tumor extents such as tumor size, grade, and stage, it is believed that liver fibrosis and function level, patients' performance status, and postsurgical residual status are essential, and AFP and hepatitis activities are additional indicators for HCC prognosis. 2,5,[7][8][9] The development of high-throughput sequencing techniques and bioinformatics methods has disclosed the potential prognostic value of genomic biomarkers including long noncoding RNAs (lncRNAs). 10 LncRNAs are a class of noncoding RNAs with lengths longer than 200 nucleotides, show no capacity of protein coding, but might regulate gene expression at genetic and epigenetic levels. 11 So far, variety of lncRNAs or lncRNA groups have been uncovered might dictate HCC prognosis by multiple studies, 10,[12][13][14][15][16][17] most of which are about the functional roles and working mechanisms of single lncRNAs, [14][15][16] and totally two (by Wang et al in 2017 and Ma et al in 2018) are found to describe the prognostic significance of multiple-lncRNA signature in HCC. 13,17 However, we should be wary of the conclusions of the previous studies for the following reasons.
First, most previous studies screened out survival related lncRNAs based on the differential profile between cancer and noncancerous samples. [12][13][14][15][16] The method could miss considerable survival information in The Cancer Genome Atlas (TCGA) datasets including some important prognostic genes, 18 and lncRNAs as well. Second, most researchers did not perform preanalysis case selection, 13,16,17 whereas using an uncurated TCGA dataset without attention to sample characteristics can lead to false associations and undermine the application of the conclusions. 19 The patients with the diagnosis of non-HCC liver cancer (cholangiocarcinoma), pathological metastasis, and postsurgical residual carcinoma, the history of neoadjuvant treatment, and too short survival time are considered to have distinct biological procedures and progression mechanisms, and should be excluded from the prognostic analysis. 8,19,20 Third, most studies performed the analysis by mixing resectable and unresectable cancers. 13 2,3,[5][6][7][8]21 thus should be analyzed distinctively for the development of prognostic models. Fourth, most published prognostic lncRNAs and the two established four-lncRNA signatures were identified without evaluating the relationship with other known promising HCC prognostic factors, such as the fibrosis and function levels of liver, and the performance status and serum parameters of patients, [13][14][15][16][17] all of which are considered vital features of HCC. 2 Fifth, the candidate prognosis-related lncRNAs in most previous studies were screened out by uni/multivariate Cox analysis, [12][13][14][15][16][17] the accuracy of which is considered inferior to penalized Cox regression for small-sample and high-dimensional data. 22 In the current study based on appropriate selection of patients and comprehensive analysis of genetic profile and clinicopathological parameters (CPPs), we develop a novel six-lncRNA signature that could well predict patients' prognosis in resectable HCC by combining integrated bioinformatics tools with multiple gene-profiling datasets. According to the receiver operating characteristics (ROCs), the six-lncRNA signature shows better prediction accuracy than the previously discovered lncRNA groups and the traditional prognostic tumor-node-metastasis (TNM) stage. Subsequent gene set enrichment analysis (GSEA), gene-set-lncRNA network construction, and Gene Ontology-molecular function (GO-MF) assays for lncRNA-related mRNAs disclose that these lncRNAs are involved in the vital processes associated with HCC progression. The results may enrich our knowledge on the progression mechanisms of HCC and also present a six-lncRNA signature as the potential prognosis biomarker for resectable HCC.

| Data source and processing
The expression values of lncRNA based on the Reads Per Kilobases per Million mapped reads (RPKM) were downloaded from the Atlas of Noncoding RNAs in Cancer (TANRIC) in TCGA Liver Hepatocellular Carcinoma F I G U R E 2 The seed lncRNAs were extracted by 1000 times Cox LASSO regression. A, Highly consistency was demonstrated in the lncRNAs among the 11 extracted lncRNA sets. The left ordinate indicates the seed lncRNA set and the number of seed lncRNAs found by every single iteration of LASSO. The right ordinate is the frequency of the seed lncRNA set disclosed through the 1000 times Cox LASSO regression. The horizontal ordinate is the lncRNA name. The yellow block represents the occurrence of the particular lncRNA in the specific lncRNA set; B, Totally 14 seed lncRNAs with >600 occurrences in the most common lncRNA set were filtered out for further analysis. The blue column indicates the frequency of each lncRNA occurs in the most common lncRNA set 16  (LIHC) database. The mRNA expression profile, as well as the phenotypes and prognostic data, was downloaded from the University of California Santa Cruz (UCSC) xena website (https://xena.ucsc.edu/). All of the included cases have primary HCC, definite report of survival status, and lncRNAs' expression values, whereas those with the diagnosis of cholangiocarcinoma, pathological metastasis, and postsurgical residual carcinoma (R1 and R2), the survival duration not longer than 30 days, and the history of neoadjuvant treatment were excluded for further analysis.
LncRNAs were selected based on the following criteria according to the expression values and the calculated median and standard deviation (SD). First, lncRNAs with nonzero values in more than 66.7% of the cases were included. Second, the median and SD of the lncRNA should be larger than 1. Third, the lncRNAs were ranked by SDs and those with the SD larger than the median of SDs were included.

| Cox survival analysis and least absolute shrinkage and selection operator (LASSO) regression with 10-fold cross-validation
The prognostic value of each lncRNA was firstly calculated in the univariate Cox analysis using R/survival package, and the lncRNAs with P < 0.01 were selected as seed lncRNAs for Cox LASSO regression with 10-fold cross-validation (CV).
LASSO is the penalized regression that uses an L1 penalty to shrink regression coefficients toward zero, thereby eliminate a number of variables based on the principle that the larger the penalty the fewer predictors selected. Thus, the seed lncRNAs with nonzero coefficients were considered as potential prognostic predictors. By 1000 iterations of Cox LASSO regression with 10fold CV using the R package glmnet (with the default parameter "standardize = T"), 23 the seed lncRNAs were shrunk into multiple-lncRNA sets. The lncRNA sets including lncRNAs with nonzero coefficients were potential prognostic models, whereas those with no lncRNA showing nonzero coefficients were designated as "not available (NA)" for prognostic prediction. The potential prognostic lncRNAs in the most common prognostic ln-cRNA set and with at last 600 occurrences were applied into further analysis.
To identify the prognostic value of the lncRNAs, multivariate Cox regression with 1000 times bootstrapping was further performed using R/survival package based on each "significant" lncRNA disclosed in the above steps, and the AJCC TNM stage was used as the adjustment factor. A ln-cRNA with P < 0.05 was defined as significant. The corresponding hazard ratio (HR), 95% confidence interval (CI), and P value were collected.
Furthermore, we calculated the resulting area under the curve (AUC) every year based on the time-dependent ROC curves using previously reported method 24,25 and plotted the AUC (t) curves to compare the prediction accuracy of our candidate lncRNA prediction model with the other two published multiple-lncRNA combinations and the traditional CPPs. The traditional CPPs with missing values above 25% were excluded, while the CPPs with enough available data (age, sex, grade, and TNM stage) were firstly evaluated for the prognostic significance using K-M curve and log-rank test, and the statistically significant one (TNM stage) was applied for further study.

F I G U R E 3
The prognostic significance and superiority of the novel six-lncRNA signature were, respectively, illustrated by the Kaplan-Meier

| Correlation analysis between each lncRNA and CPPs
The associations of each lncRNA with the CPPs were further analyzed in patients with available data. For the categorical variables, such as sex, Eastern Cooperative Oncology Group Performance Status Score (ECOG PS), risk factor, Child-Pugh grade, ISHAK score, tumor grade, T classification, and TNM stage, Spearman correlation analysis, Wilcoxon sum rank test or Kruskal-Wallis test was used as appropriate. For the continuous variables including age, body mass index (BMI), tumor weight, and the serum level of alpha-fetoprotein (AFP), albumin (ALB), creatinine, platelet (PLT), and prothrombin time (PT), Pearson correlation analysis was used. Furthermore, Benjamini-Hochberg procedure was used to control the false discovery rate (FDR). P < 0.05 and FDR <0.3 was defined as statistically significant. Because of the dimension inconsistency in source data of ALB and creatinine, we used adjusted values for the further analysis, and the adjusted values were calculated as follows: ALB average (or creatinine average ) = (upper limit of normal value + lower limit of normal value)/2, ALB adjust = (ALB measure −ALB average )/ ALB average , creatinine adjust = (creatinine measure −creatinine average )/ creatinine average .

| GSEA and gene-set-lncRNA network construction with each of the lncRNAs
Based on the mRNA profile of the 20 530 genes downloaded from UCSC database, GSEA was performed by the JAVA program (https://www.broadinstitute.org/gsea) to identify the lncRNA-related gene sets using MSigDB H: hallmark gene sets as functional gene sets and the expression level of each lncRNA as phenotype. After performing 1000 permutations, the first 20 gene sets with FDR q < 0.25 and P < 0.05 were considered to be significantly enriched. The gene-set-lncRNA network and the corresponding heat map were then constructed with R/igraph and R/heat map package.

GO-MF analysis
To predict the molecular function of each candidate lncRNA, lncRNA-related mRNAs were firstly filtered out by Pearson correlation analysis with TCGA dataset (P -and FDR q-value < 0.001) and then applied into further GO-MF analysis using the Bioconductor "clusterProfiler" package with the statistical significance standard of P-and FDR q-value < 0.05.

| Basic characteristics of patients
The analysis procedure of the current study is shown in Figure 1. The basic characteristics of the patients are listed in Table S1. In TCGA dataset, there were totally 164 HCC patients with the available data (Table S2)

| Identification of six key lncRNAs for HCC patients' survival
Totally, 12 727 lncRNAs were analyzed for the prognostic significance in univariate Cox survival analysis, and 30 lncRNAs with P < 0.01 were filtered out and applied to 1000 times Cox LASSO regression with 10-fold CV. As shown in Figure 2A, totally 11 lncRNA groups were disclosed, and the highly consistency among the lncRNA

| Confirmation for the prognostic role of the six-lncRNA model for HCC
According to the K-M analysis ( Figure 3A), there is a significant difference in patients' survival among high-, mid-, and low-risk groups divided by the six-lncRNA signature (logrank test P < 0.0001), and patients in the high-risk group had significantly shorter survival (median 21.3 months) than those in the mid-(median 38.3 months) and low-risk groups (median 81.9 months).
In the time-dependent ROC curve analysis ( Figure 3B), the AUCs in the first, third, and fifth year are 0.830, 0.739, and 0.852, respectively, and the prediction capability of our six-lncRNA signature is superior to the two published four-ln-cRNA groups. Moreover, the inclusion of six-lncRNA linear predictor to the prognostic model using TNM stage obviously improved the prediction ability for survival, as demonstrated by the increase of the resulting AUC values.
The significant trends in the associations of each lncRNA expression with clinicopathological characteristics of the cohort are shown in Table 3 and Figures 4 and 5. On the one hand, we found two prognosis benefit lncRNAs. MIR22HG was demonstrated to be negatively correlated with T classification (P = 0.0381) and Child-Pugh score (P = 0.0135) of HCC ( Figure 4A,B), and CTC-297N7.9 was negatively associated with ECOG PS of patients (P = 0.0444), the grade (P = 0.0059), T classification (P = 0.0216), and TNM stage (P = 0.0087) of tumors, and the serum levels of ALB (P = 0.0121) ( Figure 4E-I). On the other hand, four lncRNAs majorly indicated malignant phenotypes in HCC. CTD-2139B15.2 was positively related with ECOG PS (P = 0.0023) and grade (P = 0.0004) (Figure 4C and

| DISCUSSION
By applying multiple biostatistics methods, such as univariate Cox and 1000 times Cox LASSO regression with 10-fold CV and multivariate Cox analysis with 1000 times bootstrapping, on the overall lncRNA data of appropriately selected cases in TCGA, six lncRNAs, MIR22HG, CTC-297N7.9, CTD-2139B15.2, RP11-589N15.2, RP11-343N15.5, and RP11-479G22.8, were currently filtered out and identified as the independent prognosis predictors in HCCs. The prognostic significance, the prediction superiority, and the clinicopathological roles of the six-lncRNA signature were, respectively, confirmed by K-M analysis, time-dependent ROC curves, LncRNAs are considered to have greater potentiality than other hallmarks of cancers as the biomarkers of diagnosis and prognosis because of following unique advantages. (a) The expressions of lncRNAs show great divergence in different tissues, diseases and the disease progression stage, thus are more representative of disease characteristics. 26,27 (b) LncRNAs are noncoding RNAs and directly involve in various biological processes, thus the levels and functions are more closely associated with the development characteristics of diseases including cancers. [28][29][30][31] Therefore, more and more studies are performed to clarify the clinical significance of lncRNAs in cancers, including HCCs.
Although more and more lncRNAs have been identified involved in various diseases including cancers, the functions of most lncRNAs are still not well understood, and a large number of lncRNAs are awaiting further characterizations. Accordingly, it is popular to predict the functions of lncRNAs by GSEA, GO-MF, and lncRNA-mRNA coexpression analysis. 32,33 With these popular methods, the six lncRNAs were currently unraveled to potentially involve in multiple ontogenetic mechanisms, such as cell cycle, DNA repair, cell homeostasis, and stress response, via variety of functions including "ligase activity," "magnesium ion binding," cofactor/ chromatin/NAD binding, oxidoreductase/DNA-dependent ATPase activity, tubulin/nucleosome/microtubule binding, "microtubule motor activity," and "C-acyltransferase activity," which were regarded fundamental for transcription regulation and chromatin modification, and important for HCC development and progression. 2,30,31 On the one hand, MIR22HG and CTC-297N7.9 are found to predict better prognosis of HCCs among the six ln-cRNAs, and these results could be well supported by their negative relationships with the advanced CPPs revealed in the current study and are consistent with the previous studies of MIR22HG in HCCs and lung adenocarcinomas and CTC-297N7.9 in HCCs. 13,34,35 Functionally, we demonstrate the involvement of MIR22HG in cell cycle and DNA repair pathways, and thus present further bioinformatics evidence for the recent observations that MIR22HG prohibits the proliferation of liver cancer cells and inhibits cell cycle-related genes via the regulation of YBX1, Met, and P21 in lung cancer. 35  Previous reports showed that CTC-297N7.9 prohibited cancer development via the transmembrane protein, and our current study discloses that CTC-297N7.9 negatively regulates the pathways of E2F_TARGETS, G2M_CHECKPOINT, and MITOTIC_SPINDLE via involving in cofactor/chromatin/ NAD binding and oxidoreductase/DNA-dependent ATPase activity.
On the other hand, the other four lncRNAs, including CTD-2139B15.2, RP11-589N15.2, RP11-343N15.5, and RP11-479G22.8, are currently demonstrated to be poorer prognosis indicators for HCCs, and the reports on the prognostic roles of CTD-2139B15.2 in papillary thyroid cancer 37 and RP11-479G22.8 in lung adenocarcinoma 38 tally with the findings, whereas there is no previous report on the clinicopathological relevance of RP11-589N15.2 and RP11-343N15.5 in cancers. Moreover, the harmful roles of the four lncRNAs in HCCs are presently further illustrated by their positive relationships with the progressive CPPs. So far, the functional engagement of the four lncRNAs in cancers is unknown. For the first time, we uncovered the involvement of CTD-2139B15.2 in cell homeostasis and stress response through "ligase activity" and "magnesium ion binding" and RP11-343N15.5 in cell cycle progression via tubulin/nucleosome/microtubule binding and "microtubule motor activity," and disclosed the roles of RP11-589N15.2 in cell homeostasis and stress response and the "oxidoreductase activity," "cofactor binding," and "C-acyltransferase activity" of RP11-479G22.8 in HCCs. As for the molecular functions of RP11-589N15.2 and the biological processes that involved by RP11-479G22.8, there are no statistical findings in the present study based on the current dataset, and more biostatistics analysis and molecular studies are needed.
Comparing with other studies for the prognostic roles of lncRNAs in HCCs, our study exhibits several superiorities. In the initial step, it is more sensible to identify prognostic molecules from global lncRNAs, rather than starting from differential expression ones. On the study cohort, homogeneity of patients is greatly improved by excluding the specific cases might have distinct disease procedures, such as those diagnosed as cholangiocarcinoma, received neoadjuvant therapy, survived too short durations, and had distant metastasis or residual tumors. On the study method, the accuracy of the bioinformatics analysis is increased by integrating 1000 times LASSO regression and bootstrapping, and the clinical reasonability of the biomarker study is validated by the functional analysis combining with the CPPs of cancers and the physical status of patients. As for the prediction efficacy, our six-lncRNA signature was helpful to improve the prediction accuracy of traditional prognostic TNM stage and showed superiority than the two revealed four-lncRNA groups 13,15 based on the increase of the resulting AUC values in the analysis of the ROC curves. Additionally, the lncRNA SNHG20 and SERHL disclosed by Ma et al in the recently published four-lncRNA-signature study 17 were demonstrated prognostic insignificant (P = 0.1072 and 0.1304, respectively, in univariate Cox analysis) in the current cohort of patients carefully selected based on the clinicopathological characteristics.
Several limitations should be considered. First, we did not perform in vitro and in vivo experimental studies to confirm the prognostic role of the six-lncRNA signature in HCCs, which was deduced from online datasets using bioinformatics methods. Second, there is no other available dataset so far that could be used to present more external validations for the results of the current study. Further validations are awaited.
In summary, we uncover a six-lncRNA signature as the independent prognosis biomarker in HCCs by the comprehensive bioinformatics analysis combining the genetic profile and CPPs data in carefully selected cohort, confirm its' superior capability for resectable HCC prognosis prediction based on the ROC curves comparing with the other multiple-lncRNA combinations and the traditional CPPs, and reveal the regulation of cell cycle and stress response-related pathways via transcription and chromatin modifications as the potential functional roles of the six lncRNAs in resectable HCC progression. The results not only disclose the novel candidate biomarker for HCC outcome prediction, but also indicate the interesting topics for future studies on the underlying mechanisms of HCC progression and the potential functions of the six lncRNAs.