Identification of key lncRNAs in the tumorigenesis of intraductal pancreatic mucinous neoplasm by coexpression network analysis

Abstract Intraductal papillary mucinous neoplasm (IPMN) is an intraepithelial precancerous lesion of pancreatic ductal adenocarcinoma (PDAC) that progresses from adenoma to carcinoma, and long noncoding RNAs (lncRNA) might be involved in the tumorigenesis. In this study, we obtained the expression profiles of more than 4000 lncRNAs by probe reannotation of a microarray dataset. As a correlation network‐based systems biology method, weighted gene coexpression network analysis (WGCNA) was used to find clusters of highly correlated lncRNAs in the tumorigenesis of IPMN, which covered four stepwise stages from normal main pancreatic duct to invasive IPMN. In the most relevant module (R2 = −0.75 and P = 5E‐05), three hub lncRNAs were identified (HAND2‐AS1, CTD‐2033D15.2, and lncRNA‐TFG). HAND2‐AS1 and CTD‐2033D15.2 were negatively correlated with the tumorigenesis (P in one‐way ANOVA test = 1.45E‐07 and 1.39E‐0.5), while lncRNA‐TFG were positively correlated with the tumorigenesis (P = 3.99E‐08). The validation set reached consistent results (P = 2.66E‐03 in HAND2‐AS1, 1.47E‐04 in CTD‐2033D15.2 and 6.23E‐08 in lncRNA‐TFG). In functional enrichment analysis, the target genes of microRNAs targeting also these lncRNAs were overlapped in multiple biological processes, pathways and malignant diseases including pancreatic cancer. In survival analysis, patients with higher expression of HAND2‐AS1‐targeted and CTD‐2033D15.2‐targeted microRNAs showed a significantly poorer prognosis in PDAC, while high expression of lncRNA‐TFG‐targeted microRNAs demonstrated an obviously better prognosis (log‐rank P < .05). In conclusion, by coexpression network analysis of the lncRNA profiles, three key lncRNAs were identified in association with the tumorigenesis of IPMN, and those lncRNAs might act as early diagnostic biomarkers or therapeutic targets in pancreatic cancer.


| INTRODUCTION
Pancreatic cancer is one of the common cancers across the world, characterized by a poor prognosis. 1 The dominant subtype of pancreatic ductal adenocarcinoma (PDAC) accounts for more than 90%. 2 As an intraepithelial precancerous lesion of pancreatic cancer, intraductal papillary mucinous neoplasm (IPMN) could progress from adenoma to PDAC. Several factors have been identified in relation with the tumorigenesis of IPMN, like blood type, main duct dilatation, and high-grade dysplasia, but the mechanism is still unclear. [3][4][5] As a diverse class of transcribed RNA molecules consisting of more than 200 nucleotides, long noncoding RNAs (lncRNAs) played an important role in the regulation of protein-coding gene expression, although they did not encode proteins. 6,7 Recent studies also reported a regulatory role of lncRNAs in the proliferation, invasion, and metastasis of pancreatic cancer. For example, lncRNA TUSC7, LINC00052, and lncRNA CASC2 modulated miR-371a-5p, miR-330-3p, and miR-21, respectively, to suppress pancreatic cancer cell lines. [8][9][10] High expression of lncRNA MALAT1 was an independent predictor for overall survival in PDAC. 11 However, few studies investigated the role of lncRNAs in IPMN. In the study of Permuth et al, the signature of eight circulating lncRNAs were more accurate than clinical features in the differential diagnosis of malignant and benign IPMN. 12 High-throughput microarray technology develops rapidly in recent years, and several studies adopted the gene expression profiles to find genes associated with the tumorigenesis and prognosis of pancreatic cancer. Nevertheless, most studies focused on the differentially expressed protein-coding genes in pancreatic cancer, regardless of the vast majority of lncRNAs and the intermediate stages, like IPMN. As a correlation network-based systems biology method, weighted gene coexpression network analysis (WGCNA) was used to find clusters of highly related genes and clusters to clinical features. 13 In this study, as the ln-cRNA expression profiles were unavailable, we obtained the lncRNA expression data through an accurate reannotation of the microarray probe. 13 Then, the WGCNA method was used to find network hub lncRNAs related with the tumorigenesis of IPMN.

| Data collection
We obtained the raw data of gene expression profiles (CEL files) and clinical data of dataset GSE19650 and GSE26647 from the database of Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo/). 14,15 Dataset GSE19650 was conducted on the platform of GPL570 (Affymetrix Human Genome U133 Plus 2.0 Array), and used as a training set to construct the coexpression network and identify hub genes in this study. It had 22 tissue samples which were sufficient for the subsequent WGCNA analysis, and covered four sequential stages, including normal main pancreatic duct (n = 7), intraductal papillary mucinous adenoma (IPMA) (or lowgrade dysplasia in IPMN) (n = 6), intraductal papillary mucinous carcinoma (IPMC) (or high-grade dysplasia in IPMN) (n = 6), and invasive IPMN (n = 3). All patients undergone initial surgical resection, and received no prior therapy. The tumors were classified according to the combined criteria. 14,16 Another dataset GSE26647 was conducted on the platform of GPL5175 (Affymetrix Human Exon 1.0 ST), and used as a validation set to verify the results. It had 28 tissue samples, and covered four sequential stages of IPMN, including low-grade (n = 10), moderate-grade (n = 5), high-grade (n = 6), and invasive IPMN (n = 7).

| Data preprocessing
The R software package of "affy" was used to preprocess the raw expression data according to the RMA method: (a) background correction; (b) log2 transformation; (c) quantile normalization; (d) median polish summarization. In quality assessment, sample outliers were removed using the network method for describing sample relationships in genomic datasets, with the threshold of standardized sample connectivity (Z.K)>−2 and standardized sample clustering coefficient (Z.C) <2. 17 Finally, we found no outliers in both the datasets ( Figure 1A-D).

| Probe reannotation
As for the GPL570 microarray, the fasta file of probe sequences was obtained from the annotation file (http:// www.affym etrix.com). In the GENCODE database, we downloaded the fasta file of human genome (GRCh38) and the gtf file of annotation file (GRCh38.p12) (https ://www.genco degen es.org). Probe-matched lncRNA sequences were identified by the spliced alignment program for reads alignment of HISAT. 18 In this study, we defined the lncRNA transcripts as those with the gene types of "lincRNA", "bidirectional_promoter_lncRNA", "macro_ lncRNA", "antisense", "processed_transcript", "TEC", "3prime_overlapping_ncRNA", "sense_intronic", "non_ coding", and "sense_overlapping". We included the transcripts according to the following criteria: (a) identified by ≥4 probes; (b) perfect match; (c) specific match. 13 If multiple probes mapped to the same lncRNA, we assigned the mean value to the lncRNA. As for the GPL5175 microarray, the probes were reannotated using the same method and criteria.

| Coexpression network construction
The Pearson's distances of each paired lncRNAs were calculated by the construction of a weighted adjacency matrix, in which we chose a power of 9 to ensure a scale-free network (scale free R 2 = 0.81) (Figure 2A-D). 19 In the network, the hubs were defined as the highest-degree nodes, which usually played certain roles. Then, we transformed the adjacency matrix into the topological overlap matrix (TOM). In TOM, the network connectivity of a lncRNA could be measured by calculating the sum of its adjacency with all other lncRNAs for network generation. 20 Then, average linkage hierarchical clustering were used to divide the lncRNAs with similar expression patterns in the same modules (a minimum size of at least 30). 21 The analysis was conducted by the R software package of "WGCNA", and the R script used in this study was available on the tutorial website (https ://horva th.genet ics.ucla.edu/ html/Coexp ressi onNet work/Rpack ages/WGCNA/ Tutor ials).

| Significant module identification
The module in relation to the IPMN tumorigenesis was identified by two methods. 13 The significant module was F I G U R E 3 Identification of modules associated with the tumorigenesis of IPMN. A, Dendrogram of 4021 lncRNAs clustered based on a dissimilarity measure (1-TOM). B, Distribution of average gene significance in the modules associated with the tumorigenesis. C, Heatmap of the correlation between module eigengenes and the tumorigenesis defined as the module with the module with the maximal absolute module significance (MS) which was the average lncRNA significance in the module. Second, we also calculated the correlation between the first principal component in the principal component analysis for each module and the clinical features to find the most correlative module.

| Identification of hub lncRNAs
As the highly interconnected nodes in a module, hub genes in the coexpression network have been functionally reported. 22 In this study, we calculated the intramodular connectivity of each lncRNA and high module membership (MM) by the Pearson's correlation. Then, we chose the lncRNAs with both high intramodular connectivity and high MM as the hub genes (namely cor.Standard >0.8 and cor.Weighted >0.8). 13

| Functional enrichment analyses
The cDNA sequences of hub lncRNAs were obtained from the database of Ensembl (http://asia.ensem bl.org). The mi-croRNA targets of these hub lncRNAs were predicted by blasting the sequence with the microRNAs (http://mirdb. org/miRDB/ custom.html). The gene targets of the micro-RNAs were obtained from the experimental validation database of miRTarBase (http://mirta rbase.mbc.nctu.edu.tw/ php/index.php), and two computational prediction databases of miRDB (http://mirdb.org/) and TargetScan (http://www. targe tscan.org/). To study the potential function of these mi-croRNAs, we performed a gene ontology (GO) analysis of related biological processes, pathways and diseases on their predictive gene targets using the online tool of ToppFun (https ://toppg ene.cchmc.org/enric hment.jsp). A false discovery rate (FDR) of less than 0.05 was selected as the cut-off.

| Survival analyses
As the survival data of novel lncRNAs were unavailable, the prognostic role of these lncRNAs in PDAC was indirectly evaluated by investigating the association between their target microRNAs and the prognosis. Kaplan-Meier plotter (http://kmplot.com) was an online tool based on the RNA-sequencing data and clinical data of The Cancer Genome Atlas (TCGA) dataset, and was used to study the relationship between the microRNA expression and PDAC prognosis.

| Network construction and significant module identification
In the training set, probe annotation identified a total of 4951 probes which were mapped to a total of 4066 lncRNAs. According to the similarity of expression pattern, these lncRNAs were grouped into nine modules (black (lncRNA number: 362), blue (2957), cyan (66), grey (195), grey60 (41), magenta (124), red (155), salmon (71), tan (95)) ( Figure 3A). Additionally, the grey module was the module of lncRNAs not assigned to any module, which was regarded as a special module and not considered. In the magenta module, there shared the highest MS, and the ME also had a higher correlation than other modules in the tumorigenesis (R 2 = −0.75 and P = 5E-05) ( Figure 3B,C). Two methods reached the consistent result, and thus the magenta module was chose as the relevant module in the tumorigenesis of IPMN.

| Functional enrichment analyses
Fifty-four microRNAs were identified as the targets of HAND2-AS1, while 15 microRNAs were in CTD-2033D15.2 and 56 microRNAs in lncRNA-TFG. Multiple biological processes, pathways and diseases were, respectively, enriched in the targets genes of microRNAs targeting also these F I G U R E 4 Boxplot of the hub lncRNAs across four stepwise stages in the tumorigenesis of IPMN. IPMA, intraductal papillary mucinous adenoma; IPMC, intraductal papillary mucinous carcinoma; IPMN, intraductal papillary mucinous neoplasm; low, low-grade IPMN; moderate, moderate-grade IPMN; high, high-grade IPMN; invasive, invasive IPMN lncRNAs ( Figure 5). In the overlap analysis, the target genes have the common biological processes of cell differentiation and development especially for neuron, the pathways of axon guidance, membrane trafficking, vesicle-mediated transport and signaling by NGF, the diseases of mental disorders, and malignant diseases including pancreatic cancer. The results indicated a potential involvement of these lncRNAs in the tumorigenesis of PDAC, including the intermediate stage of IPMN.

| DISCUSSION
Initially, lncRNAs were thought to be nonfunctional in transcription, but recent studies experimentally validated that multiple lncRNAs could sponge microRNAs to up-regulate downstream message RNAs. 23 Recently, several studies have reported the regulatory role of lncRNAs in the etiology of pancreatic cancer. Dysregulated expression of multiple lncRNAs has been documented in pancreatic cancer, which might be a major cause of tumorigenesis and progression. 24 Furthermore, epigenetic events were proposed to have an important role in the tumorigenesis of pancreatic cancer, while lncRNAs has already been found as important epigenetic regulators in multiple biological processes, such as cell proliferation, differentiation and apoptosis and subsequent tumorigenesis. 25 Several lncRNAs have been experimentally validated in the tumorigenesis and progression of pancreatic cancer. For example, certain lncRNAs could interact with the WDR5/MLL complex to up-regulate the expression of multiple 5' HOXA genes, which played an regulatory role in the progression and chemoresistance of pancreatic cancer. 26 The study of Giulietti et al 27 also used the method of coexpression network analysis to identify novel lncRNA biomarkers for PDAC. Finally, 11 lncRNAs were found, namely A2M-AS1, DLEU2, MIR155HG, SLC25A25-AS1, ITGB2-AS1, TSPOAP1-AS1, LINC01133, LINC00675, PSMB8-AS1, LINC01857, and LOC642852. However, Giulietti et al extracted the lncRNAs by the microarray annotation file, which was different from the method of probe reannotation. Fu et al 28 study adopted the same method as us and identified three lncRNAs (AFAP1-AS1, UCA1, and ENSG00000218510) involved in the PDAC progression. However, few studies performed a systematic analysis to find key lncRNAs in relation to the tumorigenesis of IPMN.
In this study, we combined the method of probe reannotation with coexpression network analysis, and identified F I G U R E 8 Expression levels of the lncRNA-TFG-targeted microRNAs and the prognosis of pancreatic ductal adenocarcinoma (PDAC) three lncRNAs (HAND2-AS1, CTD-2033D15.2, and lncRNA-TFG) in association with the tumorigenesis of IPMN. In functional enrichment analysis, the target genes of microRNAs targeting also these lncRNAs were enriched in multiple biological processes, pathways, and malignant diseases including pancreatic carcinoma. In survival analysis, the target microRNAs of the lncRNAs were also correlated with the prognosis of PDAC. These results suggested a potential role of these lncRNAs in the pancreatic tumorigenesis. In mechanism, these lncRNAs might act though sponging the target microRNAs which were involved in the pathogenesis of pancreatic malignancies. Among these lncRNAs, HAND2-AS1 has been experimentally validated as a suppressor in multiple cancers, like esophagus squamous cell cancer, nonsmall cell lung cancer, colorectal cancer, osteosarcoma, and endometrioid endometrial cancer. [29][30][31][32][33] As far as we know, this was the first study that combined the method of probe reannotation with coexpression network analysis to find key lncRNAs in relation to the tumorigenesis of IPMN. However, the limitations should be also acknowledged. First, the method of probe reannotation could help obtain reliable lncRNA data, but not cover all lncRNAs. Second, cellular and molecular experiments were needed to validate our results.
In conclusion, by coexpression network analysis of the lncRNA profiles, three key lncRNAs were identified in association with the tumorigenesis of IPMN, and those lncRNAs might act as early diagnostic biomarkers or therapeutic targets in pancreatic cancer.