Survival analysis and functional annotation of long non‐coding RNAs in lung adenocarcinoma

Abstract Long non‐coding RNAs (lncRNAs) are a subclass of non‐protein coding transcripts that are involved in several regulatory processes and are considered as potential biomarkers for almost all cancer types. This study aims to investigate the prognostic value of lncRNAs for lung adenocarcinoma (LUAD), the most prevalent subtype of lung cancer. To this end, the processed data of The Cancer Genome Atlas LUAD were retrieved from GEPIA and circlncRNAnet databases, matched with each other and integrated with the analysis results of a non‐small cell lung cancer plasma RNA‐Seq study. Then, the data were filtered in order to separate the differentially expressed lncRNAs that have a prognostic value for LUAD. Finally, the selected lncRNAs were functionally annotated using a bioinformatic and systems biology approach. Accordingly, we identified 19 lncRNAs as the novel LUAD prognostic lncRNAs. Also, based on our results, all 19 lncRNAs might be involved in lung cancer‐related biological processes. Overall, we suggested several novel biomarkers and drug targets which could help early diagnosis, prognosis and treatment of LUAD patients.


| INTRODUC TI ON
Lung cancer is the number one cause of cancer-related death among both men and women worldwide. 1 Non-small cell lung cancer (NSCLC) accounts for approximately 85% of lung cancer cases. 2 NSCLC is histologically divided into three subtypes of which lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC) account for ~50% and ~40% of the cases respectively. 3 Unfortunately, most of the NSCLC patients are diagnosed at advanced stages and have a very poor prognosis, which consequently results in a low overall survival (OS) rate (15%). 4,5 In this regard, LUAD is one of the most aggressive and deadliest types of cancer with less than 5 years of OS. 6 A variety of factors such as cigarette smoking, exposure to second-hand smoke, air pollution, cooking fumes, asbestos and radon put individuals at the risk of LUAD. Besides, immunologic dysfunction, genetic susceptibility as well as some diseases including asthma and tuberculosis infections would enhance the risk of LUAD. 7 Additionally, it is reported that the carcinogenesis of LUAD varies between men and women as well as between smokers and never smokers. 8 Long non-coding RNAs (lncRNAs) refer to a class of non-protein coding RNAs that are more than 200 nucleotides and are differentially accumulated in the nucleus and cytoplasm. 9 LncRNAs play various regulatory roles in the cell including regulation of development, stem cell pluripotency, cell growth and apoptosis and are frequently dysregulated in different cancers. 9,10 HOTAIR, as an example, is a | 5601 SALAVATY eT AL.
well-known oncogenic lncRNA that is up-regulated in several cancers. 11 The lncRNA linc00665 has recently been represented as an oncogenic factor in LUAD. 12 However, there are hundreds of ln-cRNAs that their exact roles in different cancers are yet to be discovered and/or experimentally validated. RAB6C-AS1, for instance, is a poorly known lncRNA that is presented as a potential candidate biomarker for prostate and brain cancers but its implications in the carcinogenesis of these cancers are not still neither computationally nor experimentally examined and validated. 13 LncRNAs are also involved in the tumourigenesis and progression of lung cancer through aberrant regulation of gene expression at the transcriptomic, epigenomic and genomic levels. 14 Additionally, epigenetic and RNA deregulations are considered as a potential hallmark of LUAD. 15 Altogether, discovering the functional roles of lncRNAs in LUAD would greatly enhance our knowledge of the aetiology of LUAD and lead to the advent of novel promising biomarkers and drug targets for this deadly disease.
As lncRNAs play essential roles in the progression of different cancers, they have the potential to be used as diagnostic and prognostic biomarkers. 16 Also, the presence of several circulating transcripts has been reported in the plasma and serum of cancer patients which could be used for diagnostic purposes. 17 Moreover, different circulatory non-coding RNAs (ncRNAs) including lncRNAs are being constantly represented as biomarkers for cancer diagnosis, prognosis and monitoring of treatment response. 5,18 Thus, lncRNAs are potential factors for the prediction of OS and disease-free survival (DFS) periods of cancer patients. Today, lncRNAs are being regarded as potential diagnostic factors and therapeutic targets for NSCLC. 19 The lncRNA LINC00578, as an example, is represented as a promising biomarker and therapeutic target for LUAD. 20 In another study, TG et al introduced three lncRNAs including HCP5, SNHG12 and LINC00472 as potential biomarkers for LUAD management. 21 Also, several lncRNAs such as LHFPL3-AS2, LINC01105, LINC00092, LINC00908 and FAM83A-AS1 have been reported as prognostic factors for LUAD. 22 Furthermore, the diagnostic value of circulating lncRNAs as plasma signatures for the early detection of lung cancer has been confirmed. 23 A growing number of computational models are being constantly developed for the identification of lncRNA-disease associations and characterization of functional roles of lncRNAs in diseases including lung cancer. KATZLDA, as an example, is a robust computational model for the prediction of lncRNA-disease associations. 24 In another study, Chen et al proposed a kind of top-down model. They assumed that similar diseases tend to be associated with functionally similar lncRNAs and accordingly, developed a computational model named LRLSLDA. 25 Generally speaking, identification of lncRNAdisease associations is achieved based on two different approaches; using known lncRNA-disease associations, as in machine learningbased and network-based models, or using models based on the known disease-related genes/miRNAs. Functional similarity calculation method, which is based on the assumption that functionally similar lncRNAs are associated with similar diseases, is commonly applied in both of the aforementioned approaches but usually in combination with other methods. 26 Various information resources are used for the calculation of lncRNA functional similarity which could be summarized into four categories: lncRNA expression similarity, GO term-based lncRNA functional similarity, miRNA/mRNA-lncRNA interaction-based functional similarity and lncRNA-disease association-based functional similarity. 27 In the current study, a GO term-based lncRNA functional similarity method was used to functionally interrogate the lncRNA-LUAD associations. In the context of the prediction of functional roles of lncRNAs in diseases, several computational models have thus far been proposed that could be classified into four major categories, including gene coexpressionbased models, lncRNA-miRNA/mRNA/protein interaction-based models, sequence alignment-based models and integrative featuresbased models which incorporates sequence-derived and experimental features of lncRNAs. 27 In this study, we applied a coexpression-based model for the prediction of functional roles of lncRNAs in LUAD. It is frequently reported that lncRNAs are differentially expressed (DE) in cancer tissues. 28 Also, according to the guilt by association principle, if a gene shows an expression correlation with the expression profiles of a set of genes involved in a specific function, that gene is possibly involved in the same function. 29 Therefore, identification of the coexpressed genes (CEGs) of DE-lncRNAs can help functional annotation of lncRNAs in cancer. Moreover, CEGs can have common regulatory sequences and might be interacting partners of the same complex and/or involve in the same pathway. 30 Actually, dysregulated ln-cRNAs interact with other macromolecules and consequently drive various cancer manifestations. 31 Hence, identification of the CEGs of DE-lncRNAs in cancer assists in the characterization of oncogenic or tumour suppressive functions of lncRNAs and recognition of pathways they are involved in. This is a common methodology that can be used for the functional annotation of poorly known genes in the context of different diseases including cancer. 32 There are several notable variances in the gene expression profile and molecular features between LUSC and LUAD and consequently different therapeutic strategies and regiments are administrated to these two major subtypes of NSCLC. 33 Thus, NSCLC studies should precisely target either LUAD or LUSC so as to get more specific results. In this study, we systematically analysed the prognostic value of lncRNAs for LUAD and annotated their functional roles using a bioinformatic and systems biology approach. A schematic outline of the implemented methodology is given in Figure 1.

| Data preparation
All of The Cancer Genome Atlas (TCGA) LUAD DE-lncRNAs (1 < |Log2FC|; adjusted P-value (adjp) < 0.05) were retrieved from the circlncRNAnet (Table S1). 34 The circlncRNAnet is an integrated webbased resource for mapping functional networks of long or circular forms of ncRNAs. TCGA LUAD is one of the projects conducted by TCGA Research Network and comprises 483 LUAD tumour samples and 59 normal lung samples. Also, all TCGA LUAD DEGs were obtained from GEPIA (Table S2). 35 GEPIA is a web server specialized for analysing the RNA-seq data of 9736 tumours and 8587 normal samples from the TCGA and the GTEx projects. In the context of the differential expression analysis of genes in LUAD, GEPIA has added 288 normal lung samples from GTEx projects to the normal samples of TCGA LUAD so as to make a higher balance between the number of normal and cancer samples. Then, common DE-lncRNAs between circlncRNAnet and GEPIA with the same expression dysregulation (either over-or underexpression) in both databases were selected.
Moreover, a total of six plasma RNA-seq data samples, including three normal and three NSCLC plasma samples, were retrieved from the PRJNA286036 study at the European Nucleotide Archive and analysed using an Australian Galaxy server (GVL QLD, GVL 4.0.1; https ://galaxy-qld.genome.edu.au/galaxy).

| Data filtration
All of the prepared data were filtered so as to make our downstream analyses more specific. First, the list of lncRNAs selected from the intersection of data retrieved from circlncRNAnet and GEPIA data-

| Survival analysis
Prognostic value of those RefSeq lncRNAs that remained in the last step of the data filtration process was investigated using the GEPIA web server. To this end, the prognostic value of all of the    45 was used to perform a multivariate Cox regression analysis on a LUAD microarray study (GSE31210). In this step, a microarray dataset rather than an RNA-Seq one was used to obtain more reliable results.

| Coexpression analysis
The coexpression analysis was done for each lncRNA independently. Different resources of TCGA LUAD processed data were integrated in order to identify high-confident CEGs. To this pur-   HLA should be noted that because SNHG6 was not significantly differ-

| Clinicopathological and demographic analysis
The differential expression of LAProLncRs among different LUAD stages was analysed using the GEPIA web server. Also, the impact of F I G U R E 5 The lncRNA-GO-BP association network. The yellow nodes represent LAProLncRs. The intensity of pink node colour indicates the betweenness centrality score; stronger pink colour indicates higher betweenness centrality score. The node height is proportional to node degree; Bigger node indicates higher node degree. The edge colour is representative of the source database of predicted GO term and the reference tissue type. Please refer to Table 2 for finding the description of GO IDs. The network was reconstructed using the Cytoscape software

| Statistical and topological analysis
All of the statistical analyses, except Cox regression analysis, were done using R statistical software (R Development Core Team (2014), freely available at http://www.r-proje ct.org). The multivariate Cox regression analysis was done by the Kaplan-Meier plotter web server. The Pearson correlation coefficient (R)>0.3 was considered as the significant threshold throughout the study. Also, the P-value < 0.05 was considered statistically significant in all of the analyses. In the context of graph topology, two network metrics including betweenness centrality (a measure of node centrality based on shortest paths) and degree (the number of edges incident to each node) were coincidently employed to determine the hub nodes, whenever possible.
Interestingly, all 109 lncRNAs had lower abundance in NSCLC plasma samples compared with normal samples. Filtration of these 109 lncR-NAs through circlncRNAnet and cBioPortal databases indicated that four of the 109 circulating lncRNAs were significantly amplified/ overexpressed in TCGA LUAD samples (Data not shown). Altogether, 168 lncRNAs were selected for downstream analyses (Table S6).

| Presentation of 19 lncRNAs as candidate LUAD biomarkers
Among all lncRNAs in our gene library, only 62 lncRNAs came out as RefSeq lncRNAs after filtration through HGNC RefSeq lncRNAs (Table S7). Subsequently, survival analyses using the GEPIA web server demonstrated that 19 of the 62 RefSeq lncRNAs had significant prognostic values (Logrank test P-value < 0.05) for LUAD (Table 1). Remarkably, one of these lncRNAs, namely SNHG6, was of the lncRNAs with differential abundance between plasma samples of NSCLC patients and healthy controls. Also, Kaplan-Meier plots illustrated that the association of these 19 lncRNAs with OS/DFS of patients is in accordance with the dysregulation of these lncRNAs in TCGA LUAD cancer samples (Figure 2 and Figure S1). Actually, while down-regulated lncRNAs had higher expression levels in patients with higher percentages of OS/DFS, up-regulated lncRNAs had lower expression levels in those patients. Furthermore, the expression analysis of these lncRNAs using the GEPIA web server demonstrated obvious differences in the expression of these 19 lncRNAs between normal and cancer samples (Figure 3).

| LAProLncRs are coexpressed with several other genes
The coexpression analysis of lncRNAs indicated that they were sig-

| LAProLncRs are involved in several regulatory biological processes
The gene set enrichment analysis of the DECEGs of LAProLncRs indicated that these lncRNAs might be involved in several regulatory biological processes including cancer-related ones ( Figure 5; Table 2; Table   S8). As depicted in the lncRNA-GO-BP network, some lncRNAs and their associated GO terms were grouped in modules and might work in  (Table 3 and Figure S2).

| The expression of some of the LAProLncRs is associated with clinicopathological and demographic features
The analyses demonstrated that the expression of nine of the 19 LAProLncRs was significantly associated (P-value < 0.05) with different stages of LUAD ( Figure 6). Also, the violin plots in Figure 6 indi-  (Table 4).

| D ISCUSS I ON AND CON CLUS I ON
The aim of this investigation was to systematically examine the di- Circulating lncRNAs have recently emerged as novel cancer biomarkers for diagnostic and prognostic purposes. For instance, the circulating lncRNAs NEAT1, ANRIL and SPRY4-IT1 have been suggested as new diagnostic biomarkers for NSCLC. 5 It has been reported that the dysregulation of lncRNAs in plasma is in accordance with their dysregulation in the source tumour tissue. 16  18 LAProLncRs could be used as diagnostic and/or prognostic biomarkers in clinical practice as well. Notably, the association of six of the 19 LAProLncRs with lung cancer has been previously confirmed by microarray studies (Table 5)  TA B L E 4 (Continued) insights into the functional roles of LAProLncRs in LUAD tumourigenesis. In addition to hub GO-BP terms, the shared terms between LUAD and normal lung tissues (blue edges in Figure 5) should also be considered with a higher priority in future studies. Also, the pathway enrichment analysis demonstrated that the LAProLncRs in Module B might synergistically function in the neuroactive ligand-receptor interaction pathway. The neuroactive ligand-receptor interaction is a methylation-enriched pathway 78 which its association with LUSC, 79 osteosarcoma, 80 breast cancer, 81 colon cancer, 82 pancreatic cancer 83 and hepatocellular carcinoma 78 has been previously reported.
However, further studies are required to decode the precise func-  (Table 4).
Collectively, we conducted the most comprehensive sys- NSCLC plasma samples that we had access to was too low and consequently our results regarding the differential abundance of lncRNAs in the blood might not be robust enough. Additionally, further in silico, in vitro and in vivo assays are required to evaluate the potential of LAProLncRs as biomarkers and/or drug targets for LUAD patients.

ACK N OWLED G EM ENTS
The results shown in this study are in part based upon data generated by the TCGA Research Network: http://cance rgeno me.nih.
gov/. The authors thank Ms. Mozhdeh Shahmoradi for her scientific consultations and Ms. Marzie Samimifar for the proofreading of the English language of the manuscript.

CO N FLI C T O F I NTE R E S T
The authors have no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
All datasets generated/analysed for this study are included in the manuscript and the supplementary files.