Transcriptome and DNA methylation analysis reveals molecular mechanisms underlying intrahepatic cholangiocarcinoma progression

Abstract Intrahepatic cholangiocarcinoma (iCCA) is an aggressive malignancy with increasing incidence. It has been suggested that DNA methylation drives cancer development. However, the molecular mechanisms underlying iCCA progression and the roles of DNA methylation still remain elusive. In this study, weighted correlation networks were constructed to identify gene modules and hub genes associated with the tumour stage. We identified 12 gene modules, two of which were significantly positively or negatively related to the tumour stage， respectively. Key hub genes SLC2A1, CDH3 and EFHD2 showed increased expression across the tumour stage and were correlated with poor survival, whereas decrease of FAM171A1, ONECUT1 and PHYHIPL was correlated with better survival. Pathway analysis revealed hedgehog pathway was activated in CDH3 up‐regulated tumours, and chromosome separation was elevated in tumours expressing high EFHD2. JAK‐STAT pathway was overrepresented in ONECUT1 down‐regulated tumours, whereas Rho GTPases‐formins signalling was activated in PHYHIPL down‐regulated tumours. Finally, significant negative associations between expression of EFHD2, PHYHIPL and promoter DNA methylation were detected, and alterations of DNA methylation were correlated with tumour survival. In summary, we identified key genes and pathways that may participate in progression of iCCA and proposed putative roles of DNA methylation in iCCA.


| INTRODUC TI ON
Cholangiocarcinoma (CCA) constitutes a heterogeneous group of malignancies arising from the biliary tract, characterized by late diagnosis and rapid progression. 1,2 According to the anatomic location, CCA is categorized into distal CCA (dCCA), perihilar CCA (pCCA) and intrahepatic CCA (iCCA), which accounts for 25% of CCA. 3,4 iCCA is the second most common primary liver cancer after hepatocellular carcinoma, comprising approximately 15% of all primary hepatic malignancies. 1 In general, the age-standardized incidence for iCCA worldwide has been steadily increasing over the past few decades. 1 iCCA is a highly aggressive malignancy whose 1-and 5-year overall survival rates are approximately 30% and 18%, respectively. 4 Complete surgical resection serves as the best treatment for longterm survival for iCCA patients. 4 For the patients to whom surgical resection is unamenable, the benefit of conventional chemotherapy and targeted therapy is limited. 5 As a consequence, the mortality rates of iCCA are universally increasing. 6 Despite recent applications of high-throughput methods in the study of iCCA, the molecular pathogenesis and progression of the tumour still remain elusive. 7 Therefore, it is urgent to advance our understanding of the molecular mechanisms driving progression of iCCA to improve patient welfare and outcome.
Weighted correlated network analysis, also known as weighted gene co-expression analysis (WGCNA), functions as a biological meaningful data reduction approach. 8 By relating modules to clinical traits, one can discover important gene modules with respect to clinical values. Intramodular genes with highest connectivity are considered critical hub genes, as they play central roles within the module.
This method provides great opportunities to reveal important genes which drive the development of cancers.
DNA methylation is an epigenetic mechanism related to gene expression. DNA methylation of CpG islands (CGIs), which are DNA sequences enriched in CpG dinucleotides located at the promoter region, is associated with stable gene repression. 9 In cancer, many tumour suppressor genes (TSGs) are silenced by DNA methylation. 10  Most transcriptome and DNA methylome analyses were performed on mixed CCA samples, which may not clarify the specific molecular profiles and underlying mechanisms of iCCA.
That is because the molecular landscapes, particularly somatic mutations, largely differ between iCCA and other subtypes of CCA (dCCA and pCCA). 15 In the current study, we constructed weighted correlated networks to identify key hub genes relevant to progression of iCCA. As a result, we identified CDH3, EFHD2 as tumour-promoting genes, and ONECUT1, PHYHIPL as tumour suppressing genes. We indicated their prognostic values in clinical practice. In addition, underlying signalling pathways, including hedgehog pathway, O-glycan biosynthesis, JAK-STAT pathway and Rho GTPases activating formins, were uncovered. Lastly, associations of key hub genes with promoter DNA hypermethylation were revealed.

| Data collection
The cholangiocarcinoma microarray data set was obtained from

| Data pre-processing
Different workflows were implemented to process two gene expression data sets. The limma R package 16 released from the Bioconductor software project (https://bioco nduct or.org/) was used to process the raw data from GSE89749 as suggested. Specifically, background correction using negative controls, quantile normalization, log2 transformation and filtering out probes with detection P-values of larger than 0.05 was performed consecutively. Probes were aggregated into genes by median for genes with multiple probes to obtain the gene-level expression values. As the data set consisted of two batches, the combat function of the sva R package was used to correct for batch effect after processing data from two batches separately. Samples other than iCCA were removed, resulting in a sample of 83 tumours. For weighted correlation network construction, 43 samples from tumour stages I and IV were selected. Underexpressed genes were filtered out by variance and only the top 1/5 variable genes (5502 in total) were kept. Sample hierarchical clustering tree, which implemented the average linkage method, was constructed to identify outlier and finally 1 sample was found and removed. For the GSE10 7943 data set, gene filtering by variance was used to select the top 1/5 variable genes as previously mentioned. Subsequently, normalization was implemented with variance-stabilizing transformation, using the DESeq2 R package. 17 Finally, outlier detection by hierarchical clustering and outlier filtering was carried out, resulting in an expression matrix containing 11 555 genes and 30 samples.

| Weighted correlation network construction
Weighted correlation network was constructed with the WGCNA R package to identify significant modules and hub genes related to iCCA progression as previously described. 18 Firstly, the adjacency matrix a i,j which measured the connection strength between two genes was calculated as follows: where x i and x j were vectors of gene expression values for genes i and j, respectively; β, also called power, was the soft threshold parameter, and Pearson correlation coefficient was used. The optimal β that resulted in approximate scale-free topology as measured by the scale-free topology fitting index R 2 was selected. In the present study, β = 16 was selected since this was the local optimum where the corresponding R 2 reached 0.9. To better define gene co-expression modules, topological overlap measure (TOM) was calculated on the basis of adjacency matrix as follows: where a denoted adjacency matrix as previously calculated and k denoted connectivity, that is, row sum of adjacencies. 19 Subsequently, TOM-based dissimilarity measure DistTOM ij was obtained as 1-TOM ij and used for average linkage hierarchical clustering with the cutreeDynamic algorithm, where minimal module size was set at 30. Close modules as measured by correlation of module eigengenes were merged with a dissimilarity threshold of 0.3.

| Identification of clinically significant modules and hub genes
Three key statistics were calculated for the identification of important modules and genes. First, module eigengene (ME) was defined as the first principal component of each module and served as an optimal summary of the gene expression profiles within a given module. The Pearson correlation coefficients between MEs and tumour stage were calculated to identify tumour progressionassociated modules. Second, a quantitative measure of module membership (MM), also known as eigengene-based connectivity (kME), was defined as the correlation between the ME of a given module and the expression profile of a gene. Finally, gene significance (GS) was calculated as the correlation between the gene expression and the clinical traits. Presumably, genes with higher GS and kME play more important roles in the given module and tumour progression. Therefore, genes with the absolute value of GS > 0.4 and kME > 0.4 in the modules were referred to as intramodular hub genes.

| Gene ontology (GO) enrichment analysis
Gene ontology enrichment analysis was carried out in select modules using the EnrichGO function in the clusterProfiler R package as suggested. 20 Three GO categories, including biological process (BP), cellular component (CC) and molecular function (MF), were involved in the analyses. The Benjamini-Hochberg (BH) method was implemented to adjust for multiple comparisons and the resulting adjusted P-values of < 0.05 represented statistically significant. For each category, the top 8 GO terms were retrieved.

| Network visualization
The gene co-expression networks of key modules were visualized with Cyotoscape 3.7.2. Network topological statistics were obtained using the Network Analyzer tool, and genes with the top 80 degree were selected for visualization. Hub genes were labelled in red, whereas other genes were labelled in module colours.

| Gene set enrichment analysis (GSEA)
Gene set enrichment analysis was performed to investigate the

| Identification of key DNA methylation alterations
The HumanMethylation450 BeadChip probes associated with hub genes were obtained according to the beadchip annotation. Gene expression profiles and beta values of DNA methylation were scaled, after which a permutation approach to linear regression was performed to study the relationship between gene expression and DNA methylation of the loci associated with the specified genes using the lmPerm R package with default parameters. The BH method was employed to correct for multiple comparisons and an adjusted P-value of < 0.05 represented statistically significant. Probes with an absolute value of coefficient of > 0.3 and an adjusted P-value of < 0.05 were retrieved.

| Survival and other statistical analysis
Survival analysis of GSE89803 and GSE10 7943 data sets was per- genes across the stages were carried out in a similar approach, that is, the gene expression profile was scaled and the k-sample permutation test was performed afterwards. A P-value of < 0.05 was considered as statistically significant. All the plots in this study except the coexpression network plots were generated by R 3.6.2.

| Construction of weighted correlation network in iCCA by WGCNA
Mircroarray gene expression profiles of normal and malignant bile duct samples coded as GSE89803 were obtained from GEO database. 21 83 iCCA samples were included for the study, among which 43 tumours staged I and IV were used for constructing gene coexpression network. 5502 genes with the largest variance (top 1/5) across samples were kept. One outlying sample was detected with sample hierarchical clustering and then removed ( Figure 1A). To determine the optimal power β that ensured scale-free topology of the resulting network, the relationship between either scale independence or mean connectivity and power was analysed, respectively, and β = 16 was chosen ( Figure 1B,C). Scale-free topology of the network was supported by the goodness of fitting a linear model to log-transformed connectivity k and frequency of k (p(k)), as was quantified by the fitting R 2 index ( Figure 1D). We then constructed weighted correlation network by applying average linkage hierarchical clustering algorithm to dissimilarity measure of TOM and identified 12 gene modules after combining closely related modules ( Figure 1E). The allocation of genes to each module is described in Table S1.

| Relating module eigengenes to tumour stage identified iCCA progression-associated modules
To identify clinically significant gene modules, especially those  Figure 2B). The turquoise and green modules were chosen for further investigation.
To understand the biological functions involved in turquoise and green modules, we examined enriched GO categories of them ( Figure 2C-F). The genes of turquoise module were mainly enriched in glycosylation-related processes, epidermis development ( Figure 2C), plasma membrane and cell-cell junction ( Figure 2D), whereas the genes of green module were mainly enriched in chromosome assembly-related processes, protein-DNA interaction, innate immune response ( Figure 2E), nucleosome and protein-DNA complex ( Figure 2F).

| SLC2A1, CDH3 and EFHD2 of turquoise module were positively related to progression of iCCA
Focusing the analysis on gene modules and their highly connected intramodular hub genes serves as a biologically meaningful data reduction scheme. 8 Additionally, these hub genes represent a small proportion of nodes with maximal information 18 ; hence, we focused on hub genes for the network analysis. In the study, intramodular hub genes were referred to as those with absolute value of GS with respect to tumour stage of > 0.4 and kME of > 0.4. Within the turquoise module, the majority of genes were positively correlated with the tumour stage, and genes with higher GS had larger module membership ( Figure 3A). Such significant correlations were also detected in the blue, brown, green-yellow, magenta, purple, red, salmon, tan and yellow modules, but the magnitude of correlation was much smaller (Figure. S1). 45 hub genes were identified in the turquoise module (Table S2). Expression of the hub genes, which was represented as PC1, increased across the tumour stage ( Figure 3B). We

| Identification of key biological processes and pathways of hub genes underlying iCCA progression
To investigate potential mechanism of the hub genes, we performed GSEA to examine enriched biological processes and signalling pathways for CDH3 and EFHD2 of the turquoise module. We found that hedgehog signalling pathway and collagen biosynthesis and modifying enzymes were overrepresented in CDH3 high expression samples ( Figure 5A,B). Indeed, hedgehog pathway was suggested to be participated in cancer development and invasiveness. 22 For EFHD2, chromosome separation and O-glycan biosynthesis were found to be enriched in EFHD2 high expression group ( Figure 5C,D). It is implicated that O-glycan biosynthesis plays a role in cancer development and invasion. 23 Taken together, these findings indicate that CDH3 and EFHD2 can activate these pathways to promote tumour progression.
We applied similar strategies to ONECUT1 and PHYHIPL of the green module. Blood vessel endothelial cell migration and JAK-STAT signalling pathway were overrepresented in ONECUT1 down-regulated samples ( Figure 6A,B), whereas cell cycle and Rho GTPases activating formins were enriched in PHYHIPL downregulated samples ( Figure 6C,D). Given that activation of these pathways were involved in tumour development, [24][25][26] these observations indicate that down-regulation of ONECUT1 and PHYHIPL can induce activation of these key pathways to promote iCCA progression.
Detailed statistics of GSEA can be found in Table S3.

| DNA methylation was involved in the regulation of key hub genes
We speculated dysregulation of the hub genes may be associated with aberrant DNA methylation in promoter regions. Thus, we examined the relationships between DNA methylation and mRNA expression and found that hypermethylation of promoter loci cg15026696 and cg06972969 were associated with decreased expression of EFHD2 ( Figure 5E) and PHYHIPL ( Figure 6E), respectively. To further investigate the clinical significance of these loci, we conducted survival analysis of the DNA methylation loci and found that the hypomethylation of cg15026696 ( Figure 5F) showed significant lower survival rates, whereas hypermethylation of cg06972969 ( Figure 6F) was correlated with significant lower survival rates.

| D ISCUSS I ON
Intrahepatic cholangiocarcinoma has become a global burden to public health because of its increasing incidence and poor prognosis. 1,4 Currently, there is little treatment option with proven benefit for the tumour at advanced stage. 5 Thus, it is urgently needed to explore the molecular mechanisms of iCCA progression and identify targets for therapeutics. In the present study, we systematically  30 EFHD2, whose function in malignancies remained unexplored, was reported to stimulate tumour invasion and metastasis. 31 These data indicate that these hub genes may participate in invasion and development of iCCA and highlight putative novel role of EFHD2 in iCCA, but functional experiment was needed to further elucidate their biological significance.
In contrast, the green module was inversely correlated with iCCA stage. It was enriched in chromatin and nucleosome organization.
Dysfunction of the hub genes of the module may stimulate iCCA invasion, as was indicated by the finding that overall expression of the hub genes decreased across the tumour stage. Among them, FAM171A1, ONCUT1 and PHYHIPL were identified as key hub genes.
FAM171A1, also known as astroprincin (APTN), was reported to overexpress in brain astrocytes and involve in regulation of cytoskeletal dynamics, hence the cell shape and growth of cancer cells. 32 Intriguingly, expression of FAM171A1 increased in triple-negative breast tumour (TNBC) but decreased in non-TNBC compared to normal tissue, 33 whereas its role and significance in iCCA or CCA remains unclear. ONCUT1 is a transcription factor that is enriched in the liver and stimulates liver-expressed genes. Previous study suggested that ONCUT1 was down-regulated in HCC and acted as tumour suppressing gene in cell line experiment. 34 Likewise, ONCUT1 expression was lost in pancreatic cancer cells, and its up-regulation resulted in a reduction of invasiveness. 35  and stomach tumours, and its presence is related to metastasis and poor progonosis. 40 It was reported that high levels of O-glycan Tn and sTn antigen promoted growth and metastasis of breast tumour in vivo. 41 In addition, T antigen regulated adhesion of tumour cells to the endothelium by galectin-3, driving metastasis. 40 We postulate that EFHD2 acts as tumour-promoting gene probably through inducing chromosome separation errors or O-glycan modification of key adhesion proteins and proteases that are associated with metastasis.
Collectively, we indicate hedgehog signalling pathway and collagen fibre organization are critical pathways for CDH3, whereas aberrant chromosome separation and O-glycan biosynthesis serve as mechanisms for EFHD2, regarding driving iCCA progression.
F I G U R E 5 Molecular mechanism associated with CDH3 and EFHD2 of turquoise module underlying progression of intrahepatic cholangiocarcinoma. (A and B) Gene set enrichment analysis (GSEA) revealed biological processes and pathways associated with CDH3 underlying progression of intrahepatic cholangiocarcinoma (iCCA). iCCA samples with high (top 2/5) or low (bottom 2/5) expression of the hub gene were assigned to two groups (high or low expression groups) and used for GSEA. Enrichment score (ES) on the y-axis measures the degree to which a given gene set is overrepresented in either of the two groups. Gene sets with a nominal P-value of < 0.05 were chose. (C and D) Gene set enrichment analysis (GSEA) revealed biological processes and pathways associated with EFHD2 underlying progression of iCCA. E, Negative correlation between DNA methylation probe cg15026696 and expression of EFHD2. F, Kaplan-Meier plots of GSE89749 showed overall survival rates for the high and low DNA methylation groups of cg15026696. For the correlation of gene expression and DNA methylation, the BH method was performed to adjust for multiple comparisons | 6385 PENG Et al.
Moreover, our findings revealed key pathways mediated by suppression of ONECUT1 and PHYHIPL for the tumour progression.
Blood vessel endothelial cell migration is involved in angiogenesis, which is required for tumour growth. 42 Activation of JAK-STAT signalling exerts effect on tumour survival, proliferation and invasion and has been recognized as drug targets in many cancers, including blood, breast, prostate and brain cancers. For instance, STAT3 activation acts as a crucial regulator of gliomagenesis by inducing angiogenesis, immunosuppression and tumour invasion. 25 We speculate that suppression of ONECUT1 promotes growth and invasion of iCCA by regulating angiogenesis or JAK-STAT pathway, but the de- Our data indicate inhibition of PHYHIPL suppresses iCCA invasion through influencing cell cycle activity or Rho GTPase-formins pathway. This is the first report that establishes the relationships of these pathways with ONECUT1 and PHYHIPL in iCCA. Apart from the limitations discussed, this study did not apply multivariate survival model to account for known effects, mainly due to insufficient samples in the data set. With the application of whole genome DNA methylation profiling techniques, it will become feasible in the future. In summary, our report characterized molecular signature related to iCCA progression identified novel genes that may participate in promoting or inhibiting invasion and development of the tumour. Putative biological processes, signalling pathways of these genes and DNA methylation relations were revealed. These findings provide essential insights in understanding the biology of iCCA development.

E TH I C A L A PPROVA L A N D CO N S E NT TO PA RTI CI PATE
Not applicable.

ACK N OWLED G EM ENTS
We would like to thank other members in the group for discussion and suggestions. We also thank National Library of Medicine for providing access to the GEO series data.

CO N FLI C T O F I NTE R E S T
The authors declare that they have no competing interests. Project administration (lead); Supervision (lead); Visualization (equal); Writing-original draft (equal); Writing-review & editing F I G U R E 6 Molecular mechanism associated with ONECUT1 and PHYHIPL of green module underlying progression of intrahepatic cholangiocarcinoma. A and B, Gene set enrichment analysis (GSEA) revealed biological processes and pathways associated with ONECUT1 underlying progression of intrahepatic cholangiocarcinoma (iCCA). iCCA samples with high (top 2/5) or low (bottom 2/5) expression of the hub gene were assigned to two groups (high or low expression groups) and used for GSEA. Enrichment score (ES) on the y-axis measures the degree to which a given gene set is overrepresented in either of the two groups. Gene sets with a nominal P-value of < 0.05 were chose. C and D, Gene set enrichment analysis (GSEA) revealed biological processes and pathways associated with PHYHIPL underlying progression of iCCA. E, Negative correlation between DNA methylation probe cg06972969 and expression of PHYHIPL. F, Kaplan-Meier plots of GSE89749 showed overall survival rates for the high and low DNA methylation groups of cg06972969. For the correlation of gene expression and DNA methylation, the BH method was performed to adjust for multiple comparisons (lead). Xinyi Sheng: Data curation (supporting); Formal analysis (supporting); Methodology (supporting); Visualization (supporting); Writing-review & editing (supporting). Hongqiang Gao: Data curation (supporting); Visualization (supporting); Writing-review & editing (supporting).

DATA AVA I L A B I L I T Y S TAT E M E N T
All data required for evaluating the conclusions are present in the paper and supplementary materials. The data sets used the study can be accessed from GEO database with accession code GSE89749, GSE89803 and GSE10 7943.