Development and validation of robust metabolism‐related gene signature in the prognostic prediction of hepatocellular carcinoma

Abstract Hepatocellular carcinoma (HCC) is one of the most common malignant tumours worldwide. Given metabolic reprogramming in tumours was a crucial hallmark, several studies have demonstrated its value in the diagnostics and surveillance of malignant tumours. The present study aimed to identify a cluster of metabolism‐related genes to construct a prediction model for the prognosis of HCC. Multiple cohorts of HCC cases (466 cases) from public datasets were included in the present analysis. (GEO cohort) After identifying a list of metabolism‐related genes associated with prognosis, a risk score based on metabolism‐related genes was formulated via the LASSO‐Cox and LASSO‐pcvl algorithms. According to the risk score, patients were stratified into low‐ and high‐risk groups, and further analysis and validation were accordingly conducted. The results revealed that high‐risk patients had a significantly worse 5‐year overall survival (OS) than low‐risk patients in the GEO cohort. (30.0% vs. 57.8%; hazard ratio [HR], 0.411; 95% confidence interval [95% CI], 0.302–0.651; p < 0.001) This observation was confirmed in the external TCGA‐LIHC cohort. (34.5% vs. 54.4%; HR 0.452; 95% CI, 0.299–0.681; p < 0.001) To promote the predictive ability of the model, risk score, age, gender and tumour stage were integrated into a nomogram. According to the results of receiver operating characteristic curves and decision curves analysis, the nomogram score possessed a superior predictive ability than conventional factors, which indicate that the risk score combined with clinicopathological features was able to achieve a robust prediction for OS and improve the individualized clinical decision making of HCC patients. In conclusion, the metabolic genes related to OS were identified and developed a metabolism‐based predictive model for HCC. Through a series of bioinformatics and statistical analyses, the predictive ability of the model was approved.


| INTRODUC TI ON
The incidence and mortality rates of primary liver cancer remain the sixth most common and third leading cause of cancer deaths worldwide in 2020, respectively, among which hepatocellular carcinoma (HCC) comprised 75%-85% of the reported pathologic type of primary liver cancer. 1 Since HCC generally progresses asymptomatically, most patients are diagnosed with intermediate or even advanced stages of which the radical therapies are impossible to be performed, and their long-term prognosis are poor. 2 According to the American Joint Committee on Cancer (AJCC) tumour-nodemetastasis (TNM) staging system, liver transplantation, hepatectomy, interventional therapy and systemic therapy have achieved a variety of therapeutic effects for patients diagnosed with corresponding stages and prolonged survival periods. 3 However, several patients with a similar tumour stage finally obtain dramatically different clinical outcomes, indicating that the TNM staging system is still need to be improved. Those high-risk HCC patients with potentially poor outcomes must be monitored and recommended to timely and effective treatments to prolong prognosis and improve qualities of life. 4 Therefore, there is an urgent need for effective prediction biomarkers to accurately assess the prognosis of HCC patients.
Additionally, alpha-fetoprotein (AFP), des-gamma-carboxy prothrombin (DCP) and carbohydrate antigen (CA) 19-9 demonstrated promising values in clinical prediction for HCC patients prognosis who underwent resection or transcatheter arterial chemoembolization. 5,6 Presently, with the developments in gene chips and high-throughput sequencing, gene biomarkers including microRNA, circular RNA and mRNA were proven their increasing importance in terms of the application for prognosis of HCC. 7 Moreover, recently, there were many pieces of studies developing prognostic stratifications that were able to classify HCC patients into different risk groups based on the multigene expressions. [8][9][10] Unfortunately, these studies were limited by the sample size, mainly based on the cancer genome atlas (TCGA) dataset and failed to perform an external validation to estimate the possibility for optimism and overfitting in model performance. 11 On the other hand, cellular metabolic reprogramming is a critical hallmark during carcinogenic progression and is dramatically associated with the tumorigenesis. 12,13 As early as 1920s, Warburg et al. observed that an increment of glucose was consumed by tumour tissues compared to normal tissues. 14 In addition, excessive activated anaerobic glycolysis, as well as insufficient aerobic respiration, were regarded as the specific features of tumour cells, 15 which were also observed in HCC tissues. 16 Apart from the metabolism of glucose, lipid, amino acid, nucleotide, and other metabolite metabolism also demonstrated varying degrees of alternations in the procedure of HCC. 17,18 With the increasing applications of bioinformatics analysis in the diagnostic and prognostic predictions of carcinoma, several researchers have associated the metabolome with the genome, which allowed broad and precise metabolite profiles to be depicted. 19,20 Accordingly, it is necessary to take metabolomics, genomics and transcriptomics into comprehensive consideration, which would help to better understand the biological behaviour of malignant tumours.
Previous studies have initially proven the utility for metabolismrelated genes to generate a predictive model for colorectal cancer, bladder cancer and breast cancer and achieved promisingly predictive abilities. [21][22][23] Whereas, there is little research using the transcriptomics profiles of metabolism-related genes to evaluate the outcomes in predicting HCC patients. Therefore, the present study aims to identify a group of the metabolism-related genetic cluster to construct a predictive model for HCC patients.

| Collection of data
The gene expression profiling of HCC was downloaded from the Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/ geo/) database. The detailed inclusion criteria for candidate datasets were following: Human gene expression profile; hepatocellular carcionoma specimen; total count of sample volume ≥60; availability of follow-up information (overall survival, [OS]) and related clinical data.
Finally, a total of four datasets (GSE14520, GSE54236, GSE10141 and GSE116174) were included in the present study (GEO cohort

| Data processing
The mRNA microarray datasets from the GEO database were normalized before being downloaded. Probe identifications of gene value. All gene expression values were processed by log2, moreover the average value was calculated as the final expression value if multiple probes corresponded to the same gene symbol. Given there was a need for a combination of multiple datasets, the Empirical Bayes method by 'sva' package was applied to diminish the batch effects among datasets after merging. 24 Additionally, the same method was applied to correct the batch effect due to different omics platforms between GEO cohort and TCGA-LIHC cohort before external validation. ( Figure S1).

| Extraction of metabolism-related genes
In the current study, the whole metabolism-related genes were derived from Kyoto Encyclopaedia of Genes and Genomes (KEGG) metabolism-related genesets of 'c2.cp.kegg.v7.0.symbols.gmt' (http://softw are.broad insti tute.org/gsea/downl oads.jsp/). After intersecting the whole geneset of samples with the metabolic gene sets, 482 metabolism-related genes were identified in the transcriptome data. The expression levels of these genes was extracted from each sample to undergo further analysis.

| Construction and validation of the metabolism-related signature
In the GEO cohort, univariate Cox regression analysis was performed via 'survival' package to screen out the metabolic genes that were correlated with OS (p < 0.005). To address the impact of overfitting, the Least Absolute Shrinkage and Selection Operator (LASSO) algorithm was performed to select potential genes to construct metabolic gene signature by using 'glmnet' and 'biospear' packages. 25,26 The values of penalty parameter λ were determined by 100-fold cross-validations. In order to maximally reduce the false discover rate (FDR), the methods of cross-validated log-likelihood (LASSO-Cox) and penalty cross-validated likelihood (LASSO-pcvl) were applied, and the final biomarkers were determined by the common parts of both methods. Finally, the common biomarkers filtered out by LASSO-Cox and LASSO-pcvl algorithms were used to develop a formula comprising the gene expression levels (Expr) weighted by the corresponding coefficients: According to the optimal cut-off value of the GEO cohort calculated via 'survminer' package, GEO cohort was stratified into high-and lowrisk groups, respectively. Subsequently, the multivariate Cox regression analyses were applied to determine whether the risk score was an independent prognostic factor for HCC. Additionally, calibration curves, receiver operating characteristic (ROC) analysis and decision curve analysis (DCA) were employed to evaluate the accuracy and clinical utility of the model for OS via the 'ROCR' and 'rms' packages. 27

| Bioinformatic analysis
Based on Hallmarks geneset of 'h.all.v7.0.symbols.gmt', Gene set enrichment analysis (GSEA) was applied to identify the significantly enriched pathways between the high-and low-risk groups via the 'DESeq2' package. Metascape tool (http://metas cape. org/) was carried out to explore the functional annotations of the metabolism-related genes selected by the LASSO-Cox and LASSO-pcvl algorithms. 28 Nevertheless, the associations between the risk score and infiltrated immune cells were investigated using CIBERSORTx and gene-centric single sample GSEA (ssGSEA) algorithm via the 'IOBR' and 'GSVA' packages, respectively. The CIBERSORTx algorithm was able to identify 22 types of human immune cell phenotypes according to the gene expression data, while, the ssGSEA algorithm was applied to further explore the phenotypes of the immune cell subsets. 29,30

| Statistical analysis
All statistical analyses were performed via using R software version 4.2.2 (http://www.r-proje ct.org). Categorical variables were compared by chi-square test, and continuous variables were evaluated with the Wilcoxon and Kruskal-Wallis tests. OS was defined as the length of time from the date of diagnosis to death from any cause.
The survival curves between the two groups were estimated using the Kaplan-Meier method (K-M curves), and significant differences were validated by using the log-rank test. All p values were based on two-sided statistical tests, and p < 0.050 was considered statistically significant unless specified.

| Patient characteristics and the construction of the risk score
According to the Figure 1, 324 cases of normal tissue adjacent to cancer, three cases that lacked survival information were excluded initially from the enrolled four datasets. Finally, a total of 466 cases HCC tissue were included in the followed analysis, and the characteristics of included patients were demonstrated in Table 1. In the GEO cohort, the univariate Cox regression analysis was conducted to screen out metabolic genes related to OS. As a result, 38 metabolic genes (p < 0.005) were filtered out to achieve further analysis. Moreover, strong correlations among these genes were observed in the GEO cohort. (Figure 2A Subsequently, the risk scores for each case were calculated based on the formula mentioned above, and the optimal cut-off value of 0.880 was determined, which was able to generate the largest survival difference between the high-and low-risk groups. ( Figure S2) In the GEO cohort, the distribution of risk scores and the survival status of patients were displayed. ( Figure 2D).

| Validation and evaluation of the metabolic gene signature
According to the result from Figure 3A Table 3).
After that, the predictive abilities of the developed risk score for

| Correlation between the risk score and clinical features
To further explore the prognostic values of the developed signature in different populations, the entire cohort was sorted into several sub-group populations based on collected clinical features to explore K-M curves in terms of OS between the high-and low-risk groups.
F I G U R E 1 Flow diagram of construction and validation for the metabolism-related gene signature model. LASSO, least absolute shrinkage and selection operator; GSEA, gene set enrichment analysis.
As results, sub-group analyses revealed, male or female, above or below 60 years, patients from the Asia and tumour stage I/II or III/ IV, the high-risk population were significantly associated with poor prognosis. (all p < 0.050; Figure 4A-F,H) Interestingly, HCC patients who were race of white were unable to stratify between the highand low-risk population according to current signature. (p > 0.050; Figure 4G) Moreover, cases based on the entire cohort were used to explore the correlations between the risk score and clinicopathological characteristics of HCC patients. (Figure 5A-F) The result demonstrated that higher risk scores significantly corresponded to advanced tumour stage, younger age and higher AFP level. (all p < 0.050; Figure 5C,D,F) Whereas, no differences with respect to the risk score among hepatitis B cirus (HBV) status (p = 0.580; Figure 5A), different genders (p = 0.110; Figure 5B) and different ethnicities. (p = 0.120; Figure 5E).

| Correlation between the risk score and tumour microenvironment
Subsequently, the associations between the risk score and   Additionally, when further investigating the sub-population of in- were statistically more enriched in the samples derived from the high-risk group. (all p < 0.050; Figure 5H).

| Validations among pan-cancer datasets
To evaluate the specificity of the metabolism-related risk score        Figure 7C) Finally, DCA was conducted to compare the clinical net benefits among the nomogram, age, conventional staging system and metabolism-related risk score. As result, the nomogram possessed improved net benefits across a wider scale of threshold probabilities for predicting 2-, 3-and 5-year OS than those of age, conventional staging system and metabolism-related risk score. (Figure 7D).

| Exploration of biological function
To explore the biological pathways underlay the signatures, first, the GSEA based on the KEGG was applied to determine

| DISCUSS ION
Given the high heterogeneity of HCC, patients with similar tumour stages according to the TNM staging system often result in dramatically different survival outcomes, which indicated that the TNM staging system had reached its limitation in the prediction of the prognosis of HCC patients. Even though much effort was paid to explorations of optimal prognostic models utilizing molecular signatures, few biomarkers, including AFP, DCP and CA19-9, were developed with trustworthy predictive abilities. 31  PLCG2 was an unexpected biomarker found in the present study and its function was unclear until now. Given that PLCG1 recently has been proven to be a pro-tumorigenesis gene in HCC by Seo 47 Therefore, the above tumour-infiltrating immune cell patterns might support the explanation of the outcomes that the low-risk group had a better prognosis compared to the highrisk group in HCC patients.
Additionally, the biological processes with respect to the lowand high-risk groups were investigated via KEGG and GSEA analysis.
The results indicated that a total of 5 pathways were significantly enriched in the high-risk group, of which the 'tyrosine metabolism' pathway was regarded as critical cyclins and cyclin-dependent kinases in the procession of tumorigenesis. 48 Accordingly, molecularly targeted drugs, including sorafenib and lenvatinib, have been developed to block this pathway and were utilized as first-line therapy for advanced HCC. 49,50 Moreover, the 'IL-17 signaling pathway' was enriched in the low-risk group mostly and it revealed that the inflammation played a vital role in the tumorigenesis in liver tissue from cirrhosis to HCC. 51  Whereas, the current study suffered from several limitations.
First, the enrolled research datasets were derived from the public database, which caused it difficult to collect complete clinical information for each patient. Regardless, since this study involved multiple series of sequence datasets, the batch effect was inevitable despite applying the LASSO-Cox and LASSO-pcvl algorithms to eliminate it. Therefore, the results should be interpreted with caution. Last but not least, given the retrospectively designed, the potential bias correlated with unbalanced clinicopathological profiles was unable to be ignored. Prospective, large-scale cohort were urgently required to validate the prognostic value of the present developed metabolism-related gene signature.

| CON CLUS ION
In the present study, for the first time, a list of metabolic genes related to OS was identified and developed a metabolism-related risk score for HCC patients. Undergoing a series of bioinformatics and statistical analyses, the predictive ability of the signature was approved. It was believed that the current signature would induce the discovery of a novel landscape for the therapeutic strategy of the HCC in the future.

FU N D I N G I N FO R M ATI O N
This work was funded by National Natural Science Foundation of China (82103566).

CO N FLI C T O F I NTER E S T S TATEM ENT
The authors have no conflict of interest to declare.

DATA AVA I L A B I L I T Y S TAT E M E N T
The datasets generated for this study can be found in the GEO database (GSE14520, GSE54236, GSE10141 and GSE116174; https:// www.ncbi.nlm.nih.gov/geo/), and UCSC Xena website (https://gdc. xenah ubs.net).