Transcriptome analysis identifies metallothionein as biomarkers to predict recurrence in hepatocellular cacinoma

Abstract Background Liver cancer is the fifth most common cancer, and hepatocellular carcinoma (HCC) is the major liver tumor type seen in adults. HCC is usually caused by chronic liver disease such as hepatitis B virus or hepatitis C virus infection. One of the promising treatments for HCC is liver transplantation, in which a diseased liver is replaced with a healthy liver from another person. However, recurrence of HCC after surgery is a significant problem. Therefore, it is important to discover reliable cellular biomarkers that can predict recurrence in HCC. Methods We analyzed previously published HCC RNA‐Seq data that includes 21 paired tumor and normal samples, in which nine tumors were recurrent after orthotopic liver transplantation and 12 were nonrecurrent tumors with their paired normal samples. We used both the reference genome and de novo transcriptome assembly based analyses to identify differentially expressed genes (DEG) and used RandomForest to discover biomarkers. Results We obtained 398 DEG using the Reference approach and 412 DEG using de novo assembly approach. Among these DEG, 258 genes were identified by both approaches. We further identified 30 biomarkers that could predict the recurrence. We used another independent HCC study that includes 50 patients normal and tumor samples. By using these 30 biomarkers, the prediction accuracy was 100% for normal condition and 98% for tumor condition. A group of Metallothionein was specifically discovered as biomarkers in both reference and de novo assembly approaches. Conclusion We identified a group of Metallothionein genes as biomarkers to predict recurrence. The metallothionein genes were all down‐regulated in tumor samples, suggesting that low metallothionein expression may be a promoter of tumor growth. In addition, using de novo assembly identified some unique biomarkers, further confirmed the necessity of conducting a de novo assembly in human cancer study.


| INTRODUCTION
Liver cancer is the fifth most common cancer, and the third leading cause of cancer-related death worldwide. Hepatocellular carcinoma (HCC) is the major liver tumor type seen in adults (Bosch, Ribes, Díaz, & Cléries, 2004;Shibata & Aburatani, 2014;Thomas et al., 2010). HCC is usually caused by chronic liver disease such as hepatitis B virus (HBV) or hepatitis C virus (HCV) infection, which accounts for 75%-80% of the cases (Arzumanyan, Reis, & Feitelson, 2013;Bosch et al., 2004). Abuse of alcohol and exposure to aflatoxin are also risk factors for HCC.
The treatment for HCC includes liver resection, liver transplantation, chemotherapy, and radiation. Liver transplantation is one of the promising treatments, in which a diseased liver is replaced with a healthy liver from another person. However, recurrence of HCC after surgery is a significant problem. It has been reported that the recurrence after liver transplantation ranges from 6% to 40% (Cheng et al., 2011;Marsh et al., 1997;Shimoda et al., 2004). Although, many studies have attempted to identify biomarkers, in order to predict recurrence in patients with HCC, the early detection of recurrence still remains challenge. The current biomarkers for HCC are mostly serum markers, which show low sensitivity (Tsuchiya et al., 2015). Three recent studies identified some novel markers. One was using DNA markers from urine, but the study was based on only 10 individual patients and did not show wide applicability (Hann et al., 2017); a meta-analysis showed a possibility of using circRNA as biomarkers, but they did not talk about the recurrence problem in HCC (M. Wang et al., 2018); marker for the prediction of sorafenib response has been proposed, but its relevance to the recurrence problem is unclear (Kim et al., 2018). Therefore, it is important to discover reliable cellular biomarkers that can predict recurrence in HCC.
Next-generation sequencing, which identifies genomic alterations and somatic mutations at the nucleotide base level, is providing insights into the etiology of cancer and corresponding diagnostics (Chin et al., 2012;Davey et al., 2011;Meyerson, Gabriel, & Getz, 2010;Schuster, 2008). Scientists have started to sequence patients' DNA or mRNA to obtain their genome or transcriptome, but gene expression is usually measured based on the annotation of the human reference genome. Recently, it has been suggested that de novo assembly is valuable even when a reference genome is available (S. Wang & Gribskov, 2017). Although the human reference genome is available, in the case of tumors, where mutation and chromosomal rearrangement may have altered gene/transcript structure, incorporation of de novo assembly is even more important.
In this project, we analyzed previously published HCC RNA-Seq data (Xue et al., 2015) that includes 21 paired tumor and normal samples, in which nine tumors were recurrent after orthotopic liver transplantation and 12 were nonrecurrent tumors with their paired normal samples. In this study, we compared the results of reference and de novo transcriptome assembly based analyses, in order to identify biomarkers that predict recurrence of tumors in HCC.

| Data description
The RNA-seq data were directly downloaded from the NCBI sequence read archive (SRP040998). There were nine recurrent tumor with paired adjacent normal samples, and 12 nonrecurrent tumor with paired adjacent normal samples. In total, there were 9 × 2 + 12 × 2 = 42 samples. Details of library construction and patient information are described in Xue et al. (2015).

| Quality control of raw data
Adapter sequences and low-quality portions of reads were removed using Trimmomatic (version 0.32) (Bolger, Lohse, & Usadel, 2014). Adapters and low-quality read regions with average quality below 13 (phred score) over a four base window were removed. Low-quality sequences at the 5′ and 3′ end, with quality score <10 were also removed. Only reads with a trimmed length over 30 bases were used in further analysis. The number of paired-end reads in each sample is shown in Table S1.

| Transcriptome assembly of cleaned data
We pooled all left cleaned reads, right cleaned reads, and unpaired cleaned reads from all 42 samples together for de novo transcriptome assembly. Cleaned reads were assembled using Trinity (Grabherr et al., 2011) (version 2.0.6) with the parameters recommended by the authors.

| Differential expression gene analysis
The DeSeq2 package (Love, Huber, & Anders, 2014) was used to determine differential expression. The integrated statistical model is: where i is group (recurrent or nonrecurrent), j is condition (normal or tumor), k is individual (patient). In this model, we integrated the sample type (tumor or normal) and recurrence type (yes or no), which identified genes that were both differentially expressed in these conditions. Only genes with observed counts >100 (summed over all conditions) were analyzed.

| Blast search
We compared the de novo assembly to the human reference genome (GRCH38) using BlastN with default settings (Blast version 2.2.29+, National Center for Biotechnology Information, National Library of Medicine, National Institues of Health, Bethesda, MD, USA). We filtered hits by two criteria: identity score ≥95%; and aligned length ≥100 bases.

| Biomarker identification and confirmation
The RandomForest package (Liaw & Wiener, 2002) was used to identify biomarkers from recurrent and nonrecurrent patients' gene expression levels. Another independent data set was downloaded from the NCBI sequence read archive (SRP068976) for use as confirmation data, to predict the patient outcome using the biomarkers identified in the RandomForest analysis. The confirmation data included 50 patients paired normal and tumor RNA-Seq data. Details of library construction and patient information are described in Liu et al. (2016).

| De novo transcriptome assembly
We pooled all patient reads together to assemble the transcriptome using the Trinity program (k-mer = 25). In total, there were 1,036,270 predicted transcripts in the assembly with minimum length 224 bp and maximum length 31,736 bases. Trinity labels predicted transcripts systematically with designated form such as TR i |c j _g k _i l (e.g., TR 101 |c 0 _g 2 _i 2 ), where i, j, k, and l are integers indicate the transcripts, component, group, and isoform, respectively. We determined that sequences with the same component (e.g., c 0 , c 1 ) but different groups such as TR 1 |c 0 _g 1 and TR 1 |c 0 _g 2 , usually match the same gene. Therefore, we used transcripts number for example, TR 1 as the gene unit. The number of genes when pooled at this level in de novo transcriptome assembly was 797,713, substantially more than the number of genes annotated in the human reference genome. This is possibly due to two effects. (a) many of the predicted transcripts are similar or duplicated; (b) many of them are expressed at low levels which leads to incomplete transcripts.

| Differentially expressed genes on recurrent/nonrecurrent tumor analysis
We analyzed nine recurrent tumors (after orthotopic liver transplantation) and 12 nonrecurrent tumors, each with a paired normal sample. We did analysis using two approaches, (a) all samples were analyzed with respect to the human Reference genome; (b) all samples were analyzed with respect to de novo assembly (Trinity). Thus, we obtained two gene expression profiles (one using the reference genome, the other using the assembly), and two differentially expressed genes (DEG) lists (see Materials and Methods2 for details).
First, we used the gene expression level to do a principal component analysis. Gene counts were log 10 transformed. All normal samples clustered closely, while the tumor samples were distributed widely in both the Reference and assembly cases (Figure 1a,b). However, data from one patient (both normal and tumor) showed very large reciprocal deviations from the expected clusters (Figure 1a,b), suggesting that the tumor and normal samples may be mislabeled. If this is the case, it would cause large errors in estimates of variance, suggesting this sample should be omitted. Without this patient, the paired normal and tumor samples, were better separated (Figure 1c,d). In order to gain more confidence in identifying DEG, we excluded this patient from all analysis.
DEG were identified using the DeSeq2 package with the integrated model where takes group (recurrent or nonrecurrent) and condition (normal or tumor) into account. The significance level was defined as a false discovery rate < 0.0001, and log 2 foldchange (log 2 FC) larger than ±3 (i.e., eightfold change). In total, we obtained 398 DEG using the Reference approach (Table S2) and 412 DEG using de novo assembly approach (Table S3).
We further compared the DEG between these two methods and found that 258 DEG were identified by both approaches.
After identifying the DEG, we first checked DEG expression. In order to provide a more straightforward and detailed perspective on gene expression, up-and down-regulated genes were displayed as a heatmap. We chose the top 100 DEG (for a better viewing) exhibiting the largest fold change and used hierarchical clustering with Euclidean distance and complete linkage method. Gene counts were log 10 transformed and normalized as Z-score. From the heatmap, there were clearly two clusters ( Figure 2); all 20 normal samples were in one cluster and the 20 tumor samples in another cluster. The tumor and normal samples separated very well in both the Reference and de novo assembly (Figure 2), which gives the confidence that these DEG express differentially between tumor and normal samples.

| Comparison with known cancer genes
In order to validate the involvement of the identified genes in cancer etiology, we compared the DEG to two cancer gene databases: tumor suppressor genes (Zhao, Kim, Mitra, Zhao, & Zhao, 2016;Zhao, Sun, & Zhao, 2013) and oncogenes (Y. Liu, Sun, & Zhao, 2017). We combined the two databases to produce a list of 1616 cancer oncogenes and tumor suppressor genes. We refer this combined database as known cancer genes and it serves as an internal positive control because we expect some of these known cancer genes to be identified in the analysis. We matched our DEG to this list and found 41 known cancer genes in Reference approach and 39 known cancer genes in de novo assembly approach. Twentytwo known cancer genes were found using both methods ( Table 1). Some of these cancer genes have previously identified in liver cancer. For example, the overexpression of insulin like growth factor 2 was found to be associated with HCC (Sedlaczek et al., 2003). This gives high confidence that the DEG represent genes involving in regulating pathways in HCC.

| Biomarker identification and confirmation
We used RandomForest (Liaw & Wiener, 2002), a decision tree-based classification method, to identify biomarkers. A decision tree uses a tree-like graph, which each branch represents a "test" on an attribute (e.g., whether a gene turned on or not, or if the expression level >20), and each leaf node represents the outcome of the test, usually it is a class label (e.g., "Yes" or "No," "tumor" or "nontumor"). RandomForest builds a forest of decision trees to make classifications and rank the importance of attributes (e.g., genes).
In our analysis, we splitted the 40 patients data into two data sets. One set was the training data to train the decision F I G U R E 2 Heatmaps of top 100 DEG with largest fold change. The hierarchical clusters were based on Z-score. (a) Heatmap of top 100 DEG identified using Reference approach (b) Heatmap of 100 DEG identified using de novo assembly (Trinity). DEG, differentially expressed gene tree; the other used as validation data. We used 80% of original data as training data, after training, we used the trees to predict the patient's condition (normal, recurrent tumor, or nonrecurrent tumor) in the validation data set (20% of original data) and compared the prediction with the patient's real condition. If the accuracy was over 80%, we kept the trees and listed the top 30 important genes in the trees according to the importance plot (data not shown). The top 30 genes for reference and de novo assembly approaches are listed in Table 2.
Then we used these top 30 genes (Table 2) to predict the patient condition in the confirmation data set, which had 50 HCC patients' tumor and normal samples. With these 30 biomarkers, the accuracy for predicting the normal and tumor condition was 100% and 98%, respectively, suggesting these genes might be used for potential biomarkers to predict HCC.
Interestingly, we identified a group of Metallothionein genes as biomarkers (down-regulated in tumor samples), including metallothionein 1E,1F,1G,1H,1J,1M,1X (Table 3). Metallothioneins (MT), are a groups of cysteine-rich, low molecular weight proteins that bind to heavy metals. Their major function is protection against DNA damage, oxidative stress, and apoptosis, and they play an important role in transcription factor regulation. Therefore, defects in MT expression may lead to malignant transformation of cells and ultimately cancer. It has previously been reported that metallothionein is associated with tumors (Arriaga, Bravo, Mordoh, & Bianchini, 2017;Cherian, Jayasurya, & Bay, 2003;Han et al., 2013;Zheng et al., 2017). Here, MT were all down-regulated in tumor samples, suggesting that low MT expression may be a promoter of tumor growth.

| DISCUSSION
In this research, we used both the Reference and de novo assembly approaches to identify genes that could be used as biomarkers to predict recurrence in HCC. We analyzed one RNA-Seq dataset that with the recurrent tumors after orthotopic liver transplantation (and their paired normal samples) and the nonrecurrent tumor after orthotopic liver transplantation (and their paired normal samples). We did both de novo transcriptome assembly and reference-based analysis because through our previous research, we discovered that de novo assembly is valuable even when a reference genome available (S. Wang & Gribskov, 2017). And we indeed identified some unique and interesting biomarkers that were not showed in reference method. For example, CLEC4M, a protein encodes a transmembrane receptor and expressed in the endothelial cells of the lymph nodes and liver, together with CD209, mediate transinfection of liver cells by HCV (Cormier et al., 2004). Another example is PLIN2, belonging to the perilipin family, members of intracellular lipid storage droplets. This protein is found in hepatocytes in alcoholic liver cirrhosis, suggesting that it may serve as a marker of lipid accumulation in liver diseases (Graffmann, Ring, Kawala, Wruck, & Ncube, 2016). And CD5L, a key regulator of lipid synthesis, was also identified as a possible marker in liver disease (Gangadharan, Antrobus, Dwek, & Zitzmann, 2007). Therefore, these results further confirmed the necessity of conducting a de novo assembly. In addition to some unique biomarkers identified in de novo method, it was also very interesting that we identified some long noncoding RNA in reference method. Since long noncoding RNA plays important roles in regulating gene expression, these long-noncoding RNA may be promising biomarkers in HCC diagnostics. However, because we used BLAST program to match the assembled Trinity transcripts to known cDNA gene file, long noncoding RNA was not in the cDNA gene file, therefore, no long noncoding RNA identified in de novo assembly. Furthermore, lower expression of metallthionein protein in HCC tumor has been found before (Cherian et al., 2003), but through our analysis, we systematically pointed out that these genes may use as biomarkers in HCC. However, we recommend more analyses and molecular experiments are needed to confirm the utility of these biomarkers.
In terms of de novo transcriptome assembly programs, it was suggested that SOAPdenovo-trans and Trinity were the best in case of Arabidopsis study (S. Wang & Gribskov, 2017). In this study, we used Trinity, but we found that Trinity produced many redundant or duplicated transcripts when compared with human reference gene annotation. Therefore it may be advantageous using more transcriptome assembly programs in de novo assembly. And for the bioinformatics analysis, DEG were usually discovered by comparing two conditions at one time. But in our analysis, we compared four conditions simultaneously, taking into account the group (recurrent or nonrecurrent) and condition (normal or tumor) information into the integrated statistical model, therefore improves the accuracy of identifying the significant DEG.

ACKNOWLEDGMENTS
This work was funded by the National Natural Science Foundation of China (31741044, 31800781) and China Postdoctoral Science Foundation (2018M631198).

CONFLICT OF INTEREST
The authors declare no conflict of interest.