ce‐Subpathway: Identification of ceRNA‐mediated subpathways via joint power of ceRNAs and pathway topologies

Abstract Competing endogenous RNAs (ceRNAs) represent a novel mechanism of gene regulation that may mediate key subpathway regions and contribute to the altered activities of pathways. However, the classical methods used to identify pathways fail to specifically consider ceRNAs within the pathways and key regions impacted by them. We proposed a powerful strategy named ce‐Subpathway for the identification of ceRNA‐mediated functional subpathways. It provided an effective level of pathway analysis via integrating ceRNAs, differentially expressed (DE) genes and their key regions within the given pathways. We respectively analysed one pulmonary arterial hypertension (PAH) and one myocardial infarction (MI) data sets and demonstrated that ce‐Subpathway could identify many subpathways whose corresponding entire pathways were ignored by those non‐ceRNA‐mediated pathway identification methods. And these pathways have been well reported to be associated with PAH/MI‐related cardiovascular diseases. Further evidence showed reliability of ceRNA interactions and robustness/reproducibility of the ce‐Subpathway strategy by several data sets of different cancers, including breast cancer, oesophageal cancer and colon cancer. Survival analysis was finally applied to illustrate the clinical application value of the ceRNA‐mediated functional subpathways using another data sets of pancreatic cancer. Comprehensive analyses have shown the power of a joint ceRNAs/DE genes and subpathway strategy based on their topologies.

termed as ceRNAs. 2 Many important transcripts can be reconsidered and functionalized, partly through the identification of competing endogenous mechanism, presenting a framework for the prediction and validation of ceRNAs. 3 Moreover, large-scale analyses have shown that ceRNAs play the crucial roles in complex biological processes of many diseases. [4][5][6] For example, the pseudogene PTENP1 competes with the important tumour suppressor gene PTEN for interaction with miR-499-5p, thus regulating PTEN protein levels. 7 PTEN is known to be frequently disrupted in multiple cancers and governs multiple biological processes, including survival, proliferation and energy metabolism. 8 As the study of ceRNAs progress, ceRNA interactions are found to work in the pathways. For example, Sumazin et al have constructed glioblastoma-related ceRNA network and suggested that ceRNAs may function in different regulatory pathways. 9 However, systematic analysis of competing endogenous mechanism within the pathways is still poorly understood. It is reasonable to expect that some ceRNAs may locate in key regions of the disease-related pathways or multiple ceRNAs may mediate the same dysregulated regions of the important pathways. Therefore, it is crucial to study ceRNA interactions between the disease-related genes within the functional pathways and further identify key ceRNA-mediated subpathway regions, which would provide some clues to the major pathogenesis of human diseases.
Pathway analysis is an effective tool for identifying the pathways or subpathways that are significantly impacted when a biological system is perturbed by stimulation. 10 However, the current computational approaches applied to pathway identification fail to specifically consider ceRNA interactions within the pathways and key regions impacted by them. For example, hypergeometric test and gene set enrichment analysis (GSEA) approaches have been widely used in pathway analysis. 11,12 However, they focus only on differentially expressed (DE) genes; ignore those mRNAs that have competing endogenous relationships between each other and the important topologies within the pathways. Another two classical methods, signalling pathway impact analysis (SPIA) and Subpathway-GM have used pathway topologies for the identification of pathways or subpathways. 13,14 These methods are excellent in identifying key pathways or subpathway regions, but they still do not consider the effect of competing endogenous mechanism on the pathway or subpathway identification. Recently, although some studies have slightly considered the concept of ceRNAs in pathway analysis, they are short of systematic design, normalized algorithm or available software. For instance, Sun et al have simply imported the interrelated genes within the constructed ceRNA network to some given pathways for annotation, and demonstrated that ceRNAs could potentially modulate multiple signalling pathways. 15 Obviously, ceRNAs could be the essential components of pathways, the alterations of which may contribute to the altered activities of functional pathways.
In this study, we proposed a novel approach referred to as ce-Subpathway for the identification of ceRNA-mediated subpathways.
It integrated information from ceRNAs/DE genes, and their key regions within the given pathways to identify significant subpathways associated with the study condition (eg various human diseases). Specifically, ceRNA interactions and disease-related DE genes were obtained based on gene expression profiles and mapped into the reconstructed pathway graphs. An effective subpathway identification strategy was then applied to locate ceRNA-mediated functional subpathways from the given pathways. Statistical significance of these subpathways was further evaluated by hypergeometric test, which integrated the impact of the number of both ceRNAs and disease-related DE genes. The R-based tool of the ce-Subpathway strategy has been freely available in GitHub (https://github.c om/chunquanlipathway/ce-Subpathway).

| Gene expression data sets
The study devoted to identifying ceRNA-mediated functional subpathways associated with various human diseases. The corresponding gene expression profiles were obtained from the NCBI Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) database and TCGA research group (http://tcga-data.nci.nih.gov/), respectively. The following data sets were from GEO: one PAH (GSE33463), one MI (GSE66360), one breast cancer (GSE7562),two oesophageal cancer (GSE20347, GSE74742), two colon cancer (GSE8671, GSE4183) and two pancreatic cancer (GSE32676, GSE57495) data sets. For all the gene expression profiles, the probes were mapped to gene symbol and those probes that mapped to the same gene symbol were merged by averaging their expression values. As for the profiles with raw expression values, gene expression values were log 2 transformed.
The following data sets were from TCGA: one breast cancer data set with 849 samples and one pancreatic cancer data set with 177 samples. The processed level 3 RNA-seq data were downloaded for further analysis.
2.2 | Obtain mRNA-related ceRNA interactions and disease-related DE genes 423 975 miRNA-mRNA interactions containing 386 miRNAs and 13 802 mRNAs were downloaded from starBase V2.0. 16 Based on these miRNA-mRNA interactions, we obtained all the mRNA-mRNA pairs with sharing the number of common miRNAs ≥3 and estimated their statistical significance by a hypergeometric test. Meanwhile, we required the two mRNAs in each mRNA-mRNA pair appearing in the same pathway. Those mRNA-mRNA pairs with hypergeometric test false discovery rate (FDR) adjusted P < 0.05 and the two mRNAs of which within the same pathway were retained. According to the theory that the expression of ceRNA interactions was positively correlated, we further computed Pearson correlation coefficient (Pcc) of the above mRNA-mRNA pairs based on gene expression profiles.
Then, all the mRNA-mRNA pairs with Pcc R >0 and P < 0.05 were identified as ceRNA interactions. We obtained these ceRNA interactions, their corresponding P-values of Pcc and all the nonredundant ceRNAs. Here, P-values of Pcc were used to measure the strength of ceRNA interactions.
Differentially expressed genes were the key factor in identifying functional subpathways associated with human diseases. Thus, various diseases-related DE genes were identified between disease and normal samples based on gene expression profiles, using the significance analysis of microarrays method with a strict cut-off of FDR adjusted P < 0.01. 17  which provides abundant pathway structure information and is widely used in pathway analysis. 13,18 These KGML files were converted to list variables in R by applying an R package iSubpath-wayMiner. 19 Specifically, the map nodes were removed from the corresponding KEGG pathway map, and the resulting graphs mainly contained gene products. Two genes were connected by an edge if a common compound was existed in their corresponding reaction in a metabolic pathway, or if they had a relationship such as interaction, binding or modification in a non-metabolic pathway. Thus, we obtained the reconstructed pathway graphs with the topology structure of each pathway retained.
The disease-related ceRNAs and DE genes, which we have obtained from the previous step, were then mapped to all the reconstructed pathway graphs. Those ceRNAs or DE genes that could be mapped onto the corresponding nodes of gene products within the reconstructed pathway graphs were defined as key nodes. These key nodes would contribute to identifying important subpathways in the next step.

| Locate ceRNA-mediated subpathways according to key nodes
We located ceRNA-mediated functional subpathways by integrating the power of a joint ceRNAs/DE genes and subpathway strategy based on their pathway topologies. Three steps were as follows.
First, we developed a computed score named ce-score, which showed the importance of the DE levels of disease-related DE genes, the strength of ceRNA interactions and pathway topologies.
The formulas were defined as: where P DE i and P DE j are the FDR adjusted P-values of key node i and j, respectively; P DE is the minimum value between P DE i and P DE j ; P cor is the P-value of Pcc between key node i and j; z is the value of the inverse normal cumulative distribution function (θ −1 ) that is converted from P; d is the length of the shortest path between key node i and j in any one pathway, calculated by breadth-first search algorithm; ce-score is the computed score.
We computed the ce-scores between any two key nodes in every given pathway graph. The key nodes could be ceRNAs, DE genes or both of them. The smaller FDR adjusted P-value represented more significant DE level of disease-related DE genes. The smaller P-value of Pcc represented stronger ceRNA interactions.
Here, if the relationship between two key nodes was not ceRNA interaction, we set its P-value of Pcc to 1; if one key node was not DE gene, we set its FDR adjusted P-value to 1. Thus, when the nodes had more significant DE level, stronger ceRNA interaction or shorter path length, the ce-score would be greater. And this would help in identifying more important ceRNA-mediated functional subpathways. In contrast, when the nodes were not ceRNAs/DE genes or more distance from each other, the ce-score would be smaller.
Second, we estimated an appropriate threshold ω to screen the greater ce-scores. Here, we used the empirical probability distribution function. Actually, we ranked all the ce-scores between any two key nodes within all the reconstructed pathway graphs, and regarded a value which was greater than 75% of the ce-scores as threshold ω. If the ce-score between any two key nodes was greater than ω, these two key nodes and other nodes at their shortest path (many other nodes that may or may not be ceRNAs or DE genes) were added to the same node set. This process was recurrent for all key nodes. All the node sets formed the basis for locating subpathways. Flexibility could be introduced through varying threshold ω by users. A larger ω indicated that only key nodes with stronger ceRNA interaction, more significant DE level or shorter path length could be added to the same node set, the identified subpathways would thus form a smaller scale. As the threshold ω decreased, the number of some other nodes except for key nodes within subpathways would increase, and then the scale of subpathways would become larger.
Third, the idea of lenient distance similarity was used to locate ceRNA-mediated subpathways. 14,20 According to each node set above, we extracted the corresponding subgraph from pathway graph and defined those subgraphs with the number of nodes ≥5 as ceRNA-mediated functional subpathways because subgraphs with small scales were only scatter node sets and could not usually form biologically meaningful subpathways. subpathway; and (d) the number of background genes annotated to each subpathway. All human genes in KEGG were considered as background genes. The hypergeometric P-value of statistical significance was defined as: where m is the number of genes in the whole genome; n ce (n DE ) is the total number of ceRNAs (DE genes) submitted for analysis, of which r ce (r DE ) is involved in the same subpathway containing t genes.
When many subpathways were considered, a high false-positive discovery rate may be likely to occur. Therefore, we further calculated FDR adjusted P-values using the Benjamini-Hochberg FDR method.

| Survival analysis
To evaluate prognosis performance of the significant ceRNAmediated functional subpathways, a risk score model was constructed. The risk score for each patient was calculated according to linear combination of the gene expression values weighted by the regression coefficient from the univariate Cox regression analysis, which was defined as: where β i is the Cox regression coefficient of gene i from an independent training set; Exp(i) is the expression value of gene i in a corresponding patient; k is the number of testing genes.
The median risk score was used as the cut-off to classify the training set into high-risk group and low-risk group. A Kaplan-Meier survival analysis was then performed for the two classified groups of patients, and statistic significance was assessed using the log-rank test with a cut-off value of P < 0.05. The Kaplan-Meier survival curve was utilized to validate the predicted ability of the k-gene signature model.

| RESULTS
The ce-Subpathway strategy has been proposed to identify ceRNAmediated functional subpathways via a global view of the systemlevel integration of ceRNAs, disease-related DE genes and pathway topologies ( Figure 1). In this study, we firstly compared the ce-Subpathway strategy with four other non-ceRNA-mediated pathway/ subpathway identification methods at the system level, including hypergeometric test, GSEA, SPIA and Subpathway-GM. These methods were commonly used for pathway analysis. [11][12][13][14]21 Then, reliability of ceRNA interactions and robustness/reproducibility of the ce-Subpathway strategy were validated, respectively. Furthermore, survival analysis was applied to illustrate prognostic value of the ceRNA-mediated functional subpathways.

| PAH-related ceRNA-mediated subpathway identification
The PAH data set (GSE33463, Table 1) was chosen to illustrate the effectiveness of the ce-Subpathway method. With FDR adjusted P < 0.05, we identified 31 significant ceRNA-mediated functional subpathways from all the reconstructed pathway graphs, which corresponded to 26 entire pathways (Tables 2, S1 and S2). Lots of these pathways were ignored by hypergeometric test, GSEA, SPIA or Subpathway-GM ( Figure 2A, Tables 2 and S2). Hypergeometric test found 25 significant pathways with FDR adjusted P < 0.05, but 20 (76.92%) significant pathways identified by ce-Subpathway could not be detected by hypergeometric test (Figure 2A, Table 2). GSEA identified only four significant pathways with FDR adjusted P < 0.05, which were not the same as any significant pathways identified by ce-Subpathway ( Figure 2A, Table 2). The power of SPIA also seemed to be limited, 22 (84.62%) significant pathways identified by ce-Subpathway were ignored by SPIA ( Figure 2A, Table 2). The subpathway identification strategy, Subpathway-GM, identified lots of subpathways. However, up to 30 significant subpathways corresponding to 25 (96.15%) entire pathways identified by ce-Subpathway were undetected by Subpathway-GM ( Figure 2A, Table 2). Surprisingly, up to 24 significant subpathways identified by ce-Subpathway, which corresponded to 20 entire pathways, were simultaneously ignored by the four non-ceRNA-mediated methods ( Figure 2A, Table 2). In addition, we found that up to 25 pathways identified by ce-Subpathway have been well reported to be associated with PAH-related cardiovascular disease (Tables 2 and S2). Only one pathway could not be validated effectively with appropriate references. However, for example, in 25 significant pathways identified by hypergeometric test, up to three pathways were not reported to be associated with PAH by curated literatures (Table S2).
We focused on a significant ceRNA-mediated subpathway (path:04110_5), which belonged to "cell cycle" (Figure 3). It was reported to play key roles in the smooth muscle cell proliferation and vascular remodelling. 22 26 In addition, the expression of RBL2 has a high correlation with the growth of vascular endothelial cells. 27 Strong ceRNA interaction was also found between RBL2 and E2F3.
E2F3 has been reported as an important transcription factor that controls proliferation of vascular smooth muscle cells. 28 The above results suggest that stronger ceRNA interactions, shorter path length and greater ce-scores are the necessary factors in identifying more important ceRNA-mediated functional subpathways.
Another ceRNA-mediated functional subpathway (path:04310_ 20) belonged to "Wnt signalling pathway" (Figure 4), which has been verified as a crucial regulatory mechanism in PAH. The ce-Subpathway method yielded a FDR adjusted P-value of 4.32E-06. Key subpathway identified was located in the canonical pathway region of Wnt signalling pathway. β-catenin, one of the key nodes in this subpathway, was located in the central position of the canonical pathway region. Activation of β-catenin would disturb the growth of normal pulmonary arterial smooth muscle cells, but promotes the malignant proliferation. 29 In the level of miRNAs that mediated EP300 and JUN, CREBBP and TCF7L2, FBXW11 and EP300 ceRNA interactions, we found miR-30 family was extracted, simultaneously.
It has been demonstrated that the miR-30 family has been a crucial regulator that exerts functions in human pulmonary endothelial cells. 30 Previous studies have also shown that all genes associated with these ceRNA interactions are implicated in the pathological processes of PAH. For example, EP300 functions as histone acetyltransferase to regulate transcription of genes via chromatin remodelling, overexpression of which could alter the expression levels of ECM proteins and VEGF in endothelial cells. 31 In a word, both the common miRNAs and ceRNAs of the key ceRNA-mediated subpathways have been suggested to be closely associated with PAH.
The most significant of these additional ceRNA-mediated subpathways (path:04722_6) belonged to the "neurotrophin signalling pathway," which has been demonstrated to be involved in the regulatory processes of injury of MI. 32  04630_2 and path:04630_6), but MI-related subpathway was located in Jak-STAT signalling pathway (path:04630_5). More information could be seen in Tables 2 and 3

| Cancer-related ceRNA-mediated subpathway identification
Our challenge not only lies in obtaining biologically meaningful subpathways, but also in interpreting the reliability of ceRNA interactions used in the study. Based on the breast cancer data set from The Cancer Genome Atlas (TCGA; Table 1), we identified 16 significant ceRNA-mediated subpathways by ce-Subpathway.
They corresponded to 12 entire pathways, which were well associated with breast cancer (Table 4). Such as phosphatidylinositol signalling system, which is critical to normal and malignant cellular processes, including proliferation, apoptosis and metabolism. 40 Studies have shown that mutations in genes that constitute the phosphatidylinositol 3-kinase (PI3K)-related pathway occur in >70% of breast cancers. 41 Clinical and experimental evidence suggests that PI3K signalling activation promotes resistance to some of the current breast cancer therapies. 42 By extracting topology structure, the famous ceRNA PTEN was found in the significant subpathway region (path:04070_8) of phosphatidylinositol signalling system ( Figure 6A). We then took PTEN-related ceRNAs for reliability vali-  Figure 6B).  F I G U R E 4 The Wnt signalling pathway uniquely identified by ce-Subpathway. The upper figure is the Wnt signalling pathway in KEGG. Red node labels and borders near asterisk symbol belong to the subpathway (path:04310_20) identified by ce-Subpathway. Key subpathway region is shown in red triangle. The middle table lists all the ceRNA interactions of this subpathway with the corresponding ce-scores. The bottom region shows three pairs of ceRNA interactions in this subpathway, which are formed by competing the common miRNAs data removal tests using PAH-related or MI-related DE genes, respectively. To each set of DE genes, we removed the number of DE genes from 5% to 30% at 5% intervals with the remaining DE genes as new input, and repeated the ce-Subpathway method 20 times for each removal. In the PAH data set, the number of overlapped significant subpathways fell slowly compared with the original data, and the ratio of overlapped subpathways to original significant subpathways remained 76.11% even after removal of up to 30% DE genes (Figure 6D).The similar results were also obtained in the MI data set (Figure 6D). These results indicate that the ce-Subpathway method is robust to data removal.

| Robustness and reproducibility analysis of ce-
To test the reproducibility of the results across different data sets, the ce-Subpathway method was applied to another two independent colon cancer data sets from GEO (GSE8671, GSE4183).
About 11 and 10 significant ceRNA-mediated subpathways were identified, corresponding to 10 and 9 entire pathways, respectively.
Eight pathways were found highly reproducible across these results ( Figure 6E), all of which have been reported to be associated with the occurrence and development of colon cancer. For example, focal adhesion kinase signalling that is overexpressed in metastatic colon cancer plays a key role in angiogenesis, cell proliferation and survival, motility and invasion. 43 The insulin-like growth factor-1 receptor tyrosine kinase signalling through the MAPK and phosphatidylinositol 3-kinase pathways plays a part in transformation and colon tumourigenesis. 44 The actin-cytoskeleton pathway is the backbone of cells that allows migration and mobility of cells within the body, which has already been implicated in the development and pathogenesis of invasive metastatic colon cancer. 45 These results show the power of reproducibility of the ce-Subpathway method.    To further illustrate the overall discriminatory power of the ceRNA-mediated subpathways, every gene of the six common subpathways was identified as a single-gene signature for survival analysis. According to the independent training set from GEO (GSE57495), almost all single-gene signatures could not distinguish the high-risk and low-risk groups (Table S3). Only one gene (ATP2A3) in calcium signalling pathway got a significant P-value of 0.006, but it was still not more significant than the result of its corresponding subpathway (path:04020_1), which got a significant Pvalue of 0.0004. The similar results that almost all single-gene signatures could not distinguish the high-risk and low-risk groups were also observed based on the independent training set from TCGA (Table S3). Taken together, it has become obvious that the ceRNAmediated subpathways identified by the ce-Subpathway method would be an effective predictor of survival outcome in cancer patients and greatly improve prognostic capabilities. species and various studying conditions (disease/non-disease). In this study, the ce-Subpathway method was firstly applied to human PAH and MI data sets, respectively. The results showed that most of the pathways identified by ce-Subpathway were well reported to be highly associated with the corresponding diseases.

|
And fewer pathways were not reported to be associated with PAH by curated literatures using ce-Subpathway than the other methods. Compared with the entire pathway identification methods, ce-Subpathway could not only locate key subpathways associated with the given condition but also identify key regions representative of entire pathways. These key subpathways contained fewer genes than entire pathways, allowing researchers to use alternative low-throughput technologies to confirm the local subpathway regions related to specific condition. Compared with the non-ceRNA-mediated pathway identification methods, ce-Subpathway was able to identify some additional ceRNA-mediated functional subpathways. These subpathway regions contained stronger ceRNA interactions, which played important roles in evaluating the influence of competing endogenous mechanism on key subpathway identification.
This study focused on identifying subpathways by setting the threshold ω between key nodes. The threshold ω could indirectly influence the size of the located subpathway. As ω decreased, the size of the subpathway would increase because lenient distance similarity tended to merge more nodes into the same subpathway.
Key nodes within entire pathways would also tend more to be in the same two independent training sets. Therefore, the overall discriminatory powers of the ceRNA-mediated subpathways were further illustrated.
In the recent years, some studies have connected ceRNAs with pathways, but they usually annotated the genes associated with ceR-NAs into some given pathways and got some simple results of pathway annotation. These studies did not consider the joint power of ceRNAs/DE genes and pathway topology. They also lacked systematic algorithm or available software for subpathway identification. The ce-Subpathway method seemed to be the first implementation of ceRNA interactions within the pathways for the identification of ceRNA-mediated functional subpathways effectively.
Still, the ce-Subpathway method has some limitations. The benchmarks used throughout to compare the performances of different pathway identification methods may be biased. However, we know that the gold standards for expected pathways associated with diseases are not clearly defined at present. We suggest that multiple integrative methods for pathway identification might need to be used in further research.

CONFLI CT OF INTERESTS
The authors declare that they have no competing financial interests.