SNMDA: A novel method for predicting microRNA‐disease associations based on sparse neighbourhood

Abstract miRNAs are a class of small noncoding RNAs that are associated with a variety of complex biological processes. Increasing studies have shown that miRNAs have close relationships with many human diseases. The prediction of the associations between miRNAs and diseases has thus become a hot topic. Although traditional experimental methods are reliable, they could only identify a limited number of associations as they are time‐consuming and expensive. Consequently, great efforts have been made to effectively predict reliable disease‐related miRNAs based on computational methods. In this study, we present a novel approach to predict the potential microRNA‐disease associations based on sparse neighbourhood. Specifically, our method takes advantage of the sparsity of the miRNA‐disease association network and integrates the sparse information into the current similarity matrices for both miRNAs and diseases. To demonstrate the utility of our method, we applied global LOOCV, local LOOCV and five‐fold cross‐validation to evaluate our method, respectively. The corresponding AUCs are 0.936, 0.882 and 0.934. Three types of case studies on five common diseases further confirm the performance of our method in predicting unknown miRNA‐disease associations. Overall, results show that SNMDA can predict the potential associations between miRNAs and diseases effectively.


| INTRODUCTION
miRNAs are a class of small noncoding RNAs which are associated with a variety of complex biological processes. 1 Increasing studies have shown that miRNAs play a vital role in various biological processes which are essential for human life, including cell proliferation, differentiation, ageing and apoptosis. [2][3][4][5] In the meanwhile, evidence have demonstrated that miRNAs are related with a number of common neoplasms, such as breast neoplasms, 6 lung neoplasms 7 and prostate neoplasms. 8 Therefore, the research on prediction of potential associations between miRNAs and diseases provides new opportunities to study the molecular mechanisms of diseases. During the past few years, a large number of associations have been confirmed by traditional experiments. 9,10 Although reliable, experimental methods are generally time-consuming and expensive. As a result, effective computational methods are urgently needed to uncover potential associations between miRNAs and diseases.
associations by integrating the disease phenotype similarity network, miRNA functional similarity network and known phenome-micro-RNAome network to build a heterogeneous network. Based on weighted K most similar neighbours, Xuan et al 13 presented HDMP to predict the associations between miRNAs and diseases. The miRNA functional similarity was calculated by disease terms and the disease phenotype similarity. Considering the fact that the accuracy of local network similarity measures is lower than that of global network similarity measures, [14][15][16][17] Chen et al 18  into the construction of similarity matrices. Specifically, WBSMDA calculated a within-score and a between score, and combined them together to obtain a final score for miRNA-disease associations prediction. Using the same data, they further proposed another method named HGIMDA. 23 They first constructed a heterogeneous graph, and then implemented an iterative process on the graph to discover the relationships between miRNAs and diseases. HGIMDA was proved to be fast and effective compared to the aforementioned methods.
Recently, many path-based methods were proposed to predict miRNA-disease associations. Based on the lengths of different walks, the KATZ model was originally used in the social networks. Zou et al 24  They constructed a heterogeneous network and applied depth-first search algorithm to uncover the potential associations between miR-NAs and disease. By taking the network topological structure into account, PBMDA achieved remarkable performance. Nevertheless, the searching process for paths of a certain length could be extremely time-consuming in large networks. Recently, Chen et al 27 proposed a method GIMDA to predict miRNA-disease associations, in which the related score of a miRNA to a disease was calculated by measuring the graphlet interactions between two miRNAs or two diseases. They also 28 proposed another model called NDAMDA based on network distance to predict miRNA-disease associations. In this study, they considered not only the direct distance between two miRNAs (diseases), but also the average distance to other miR-NAs (diseases).
Several machine learning-based models were also developed to predict potential miRNA-disease associations. Jiang et al 29  which integrated different omics data and applied regularized least squares to discover the relationship between miRNAs and diseases.
Recently, Chen et al 32 proposed DRMDA using stacked autoencoder, greedy layer-wise unsupervised pretraining algorithm and SVM to identify miRNA related with diseases. They also proposed another two machine learning-based methods MKRMDA and EGBMMDA. 33,34 Specifically, MKRMDA could automatically optimize the combination of multiple kernels for disease and miRNA based on Kronecker regularized least squares. In EGBMMDA, a model of extreme gradient boosting machine was applied to identify miRNAdisease associations. EGBMMDA achieved a high prediction accuracy in the framework of cross-validation.
Although existing computational methods have made outstanding contributions in this filed, there is still room for further improvement.
In this study, we present a reliable method based on Sparse Neighborhood to predict the MiRNA-Disease Associations (SNMDA).
SNMDA mainly consists of three steps. First, we use the sparse reconstruction to obtain the reconstructed similarity matrices both for miRNA and disease by considering the neighbourhood information. Second, we integrated similarity information to construct a similarity network for miRNAs and diseases, respectively. Third, we predicted the potential miRNA-disease associations through label

| Known miRNA-disease associations
The human miRNA-disease associations (HMDD) 38 is a database containing experimentally verified miRNA-disease associations. The known associations used in this paper were downloaded from the latest version HMDD v2.0 (http://www.cuilab.cn/hmdd). After filtering, 5430 associations between 383 diseases and 495 miRNAs were obtained. For convenience, we define an adjacency matrix A to describe the known miRNA-disease associations. For a given disease i and a miRNA j, A(i, j) = 1 if i is related to j, and A(i, j) = 0 otherwise.
Our goal is to confirm the uncertain associations between miRNAs and diseases.

| miRNA functional similarity
According to previous study, 39 the miRNA functional similarity was calculated based on the assumption that functionally similar miRNAs tend to be associated with similar diseases. Benefitting from their results, here we directly downloaded the miRNA functional similarity from http://www.cuilab.cn/files/images/cuilab/misim.zip. An adjacency matrix MFS was built to represent the similarity of miRNAs, where MFS(i, j) represents the similarity score between miRNA i and miRNA j. The larger the MFS(i, j) is, the closer their associations will be.

| Disease semantic similarity
Mesh database provides a strict classification system for disease.
Each disease can be described as a directed acyclic graph (DAG). 39 DAG is made up of points and links. For a given disease d, DAG = (d, T(d), E(d)), where T(d) represents its ancestor nodes and itself, and E(d) is the set of links of d. Disease t is one of T (d), and the contribution to disease d can be calculated as follows 25 : We define the contribution to itself is 1 while others are 0.5. Therefore, we can use the following formula to calculate the semantic value of d.
Then, we calculate the semantic similarity between disease a and disease b through the following formula.
represents the contribution of disease t to disease b. From Equation (3), we found that the semantic similarity for a and b depend on the number of their common diseases. The larger the number is, the greater the similarity will be. By calculating the semantic similarity of each disease pair according to Equation (3), we could obtain an F I G U R E 1 . An overall workflow of SNMDA to predict miRNA-disease associations adjacency matrix DSS. DSS(i,j) represents the semantic similarity between disease i and disease j.

| SNMDA
As described in the first section, SNMDA could be divided into three steps and the first step was the key to our approach.
Specifically, we reconstructed miRNA similarity (RMS) and disease similarity (RDS) by taking the sparse neighbourhood into account.
An overall workflow was illustrated in Figure 1. The details of SNMDA are as follows.

| Feature representation
In general, the sparse neighbourhood representation is constructed based on feature vectors. Therefore, the miRNAs and diseases are required to be in the form of feature vectors. Here, "interaction profile" 36,37 is adopted to describe the feature vectors according to the known miRNA-disease associations. An example is given in Figure 2.
As shown in Figure  respectively. If miRNA M1 is known to be related with disease D1, the value in the corresponding adjacency matrix of the interaction network is 1, and 0 otherwise. Each column represents the feature vector of one miRNA, and each row represents the feature vector of one disease. For example, the feature vector of M1 can be represented as (1, 1, 0, 1, 0) and that of D1 is represented as (1, 0, 1, 0, 0).

| Reconstruction of similarity for miRNAs and diseases
Generally, the functional similarity of miRNAs as well as the semantic similarity of diseases is used to predict the relationships between miRNAs and diseases directly. However, they are still far from complete since many of the associations are uncovered yet. To solve this limitation, we first carried out a degree distribution analysis for the constructed miRNA-disease association network ( Figure 3). Obviously, power-law distributions for both known disease-associated miRNAs and known miRNA-associated diseases were observed. In other words, most diseases as well as miRNAs only have very few known associations, which results in a very sparse heterogeneous network constructed by the known miRNA-disease associations.
F I G U R E 2 . An example of feature representation Degree distribution analysis for the constructed miRNA-disease association network Therefore, how to take advantage of this sparsity and further improve the prediction accuracy of the associations between miR-NAs and diseases is a meaningful task.
In this section, we present a novel method to integrate the sparse information into the existing similarity information by reconstructing the miRNA similarity and disease similarity with sparse representation. Sparse representation has received extensive attention in pattern recognition and machine learning. [40][41][42] Before calculating the reconstructed similarity, we first briefly introduce the definition of sparse neighbourhood and sparse reconstruction.

Sparse neighbourhood
The sparse neighbourhood of the sample x i (i = 1, 2,…, n) is defined as follows. First, set a parameter ε (ε > 0) as the threshold. Then, compare the reconstructed coefficients with parameter ε. If recon-

Sparse reconstruction
Suppose we have an uncertain linear equation where sample x is an n-dimensional vector to be reconstructed, and D represents an over complete dictionary. α is a coefficient vector whose entries represent the correlation scores of sample x with other samples. Our motivation is to calculate the vector α. We can solve this problem by optimizing the following formula: min L0-norm is the number of nonzero elements in the vector. However, it is known as a NP-hard problem to find the sparsest solution for L0-norm. As L1-norm is the closest convex form to L0-norm, 43 a common approach to solve this problem is to replace the L0-norm with the L1-norm. 43 Consequently, the problem is equal to minimize the following optimizing problem: L1-norm represents the sum of the absolute values of each element in a vector. D represents a dictionary. We can obtain a relevance score matrix by solving Equation (6). Considering that many scores in the matrix are very small, we only selected the sparse neighbourhood for each sample to reconstruct each sample by Equation (6).
Eventually, we obtained a new matrix which is reconstructed by the sparse neighbourhood of each sample.
We have introduced sparse neighbourhood and sparse reconstruction. Next, we will introduce how to compute the reconstructed miRNA similarity and disease similarity. In the previous section, miRNAs and diseases have been represented as feature vectors and were regarded as points which were projected into a feature space, respectively. In our method, we assume that points are linearly arranged in the feature space, and every point can be reconstructed by other points.

| Integration of similarity information
After RDS and RMS were obtained, we integrated them into existing similarity matrices. The final miRNA similarity matrix (FMS) and final disease similarity matrix (FDS) were then used to predict the According to Equation (7), we constructed a miRNA similarity network where nodes are miRNAs and edges represent their similarity.
A disease similarity network was constructed in the same way according to Equation (8).

| Label propagation
Label propagation was applied on both miRNA and disease networks to obtain the prediction results. The labels were initiated with the known miRNA-disease associations and were updated through label propagation. First, labels were propagated in the miRNA similarity network. In the process of label propagation, each point retains the information from its neighbours and receives its initial label information. Parameter α (0 < α < 1) is used to control the rate of retaining the information from its neighbours while 1-α represents the probability of receiving its initial label information. Therefore, the iteration equation can be written as follows: According to previous studies, 44,45 Equation (9) is guaranteed to converge if FMS is properly normalized by Equation (10): where D is a diagonal matrix with its (i, i)-th element equal to the sum of i-th row in FMS. We used the same way to deal with FDS.
Therefore, Equation (9) was rewritten in the following form: Equation (11) was then used to update the label information for each miRNA until convergence. F M (t + 1) represents the label matrix in the (t + 1)-th iteration. Y is the initial label matrix and F(0) = Y. in the disease similarity network, the iteration equation on the disease similarity network could be written as follows: Taken together, the final correlation score matrix is calculated by: here, β was set to 0.5.
The complete process of SNMDA was outlined in Algorithm 1.

| Evaluation
We applied leave-one-out cross-validation (LOOCV), and 5-fold cross-validation to test the prediction ability of our method. LOOCV can be conducted in two different ways: global and local LOOCV. In the framework of global LOOCV, each known miRNA-disease association was left out in turn as the test sample and the other known associations were regarded as training samples. 46 After each prediction, the ranking of the test sample was compared with all the unconfirmed miRNA-disease associations. If the ranking of the test sample was higher than a given threshold, it was marked as a successful prediction. In comparison, in the framework of local LOOCV, each known miRNA associated with a given disease was left out in turn as the test sample and the ranking of that test sample was only compared with the unconfirmed associations of this specific disease. 26 Both frameworks were repeated 5430 times. In addition, 5fold cross-validation was also implemented to evaluate our method.
In 5-fold cross-validation, all the known miRNA-disease associations were divided into five disjoint subsets. Each subset was taken as test

| Case study
In this section, we carried out three types of case studies to further validate the effectiveness of SNMDA. For the first type of case studies, we applied SNMDA to predict novel miRNA-disease associations for three selected diseases based on the known associations from HMDD v2.0, that is breast neoplasms, lung neoplasms and prostate neoplasms. The prediction results for each disease were verified by two databases PhenomiR 47 and dbDEMC, 48 both of which provide differentially expressed miRNAs for certain diseases.
Breast neoplasms is a common disease in women and also a serious threat to the health of women. 49,50 In the early years, the mor- found that many miRNAs are associated with breast neoplasms by clinical experiments, such as mir-155 and mir-21. 6 We used our method to predict the candidate miRNAs for breast neoplasms, and we listed the top 50 predicted candidate miRNAs ( Table 1). As a result, 49 of the top 50 candidate miRNAs were successfully verified. For example, hsa-mir-150 (1st in Table 1) and hsa-mir-130a are closely related to breast neoplasms. Hsa-mir-150 can promote human breast neoplasms growth 51 while hsa-mir-130a (3rd in Table 1) could suppress breast neoplasms cell migration and invasion. 52 The only unconfirmed miRNA was hsa-mir-507. As a matter of fact, Jia et al 53 have reported that hsa-mir-507 inhibits the migration and invasion of human breast neoplasms cells. Our prediction results provided new evidence for its role in the pathogenesis of breast neoplasms.
Lung neoplasms is a malignant tumour that has the greatest threat to the health and life for people. It is also one of the fastest growing neoplasms in the incidence and mortality rate. 7 Increasing evidence has suggested that miRNAs can not only be utilized to classify lung cancer, but also has the potential to be biomarkers for early diagnosis and clinical treatments. As shown in the table ( demonstrated that hsa-mir-106a and hsa-mir-106b (2nd in Table 2) can affect the growth and metastasis of lung neoplasms. The result further proved that our method could effectively predict miRNA-disease associations in lung neoplasms.
Prostate neoplasms is also one of the common diseases in men.
It is estimated that one in six men in the United States will be diagnosed with prostate cancer in their lifetime, with the likelihood increasing with age. 54 Studies have shown that miR-182-5p can be used as a marker for the diagnosis of prostate neoplasms and miR-20 plays a vital role in the occurrence of prostate cancer. 55 Therefore, identifying miRNAs related with prostate neoplasms is of great importance. As shown in Table 3 The second type of case study mainly aims to prove the ability of our method in predicting new associations for diseases without any known related miRNAs. To this end, we selected another disease pancreatic neoplasms for the following analysis. Pancreatic neoplasms is a malignant tumour of the digestive tract that is highly malignant and difficult to diagnose and treat in the world. 58 Many studies have found that the differential expression of miRNAs in pancreatic neoplasms is closely related to the occurrence of tumours.
Here, we first removed all known entries associated with pancreatic neoplasms in HMDD v2.0, and thus, all 495 miRNAs were considered as candidate miRNAs. Then, we used our model to prioritize these candidate miRNAs and obtained their corresponding rankings associated with pancreatic neoplasms. We found that all the top 50 predicted miRNAs were confirmed either by dbDEMC, PhenomiR or HMDD v2.0 ( Table 4). The result proved that our method could be applied to predict potential associations for disease without any known related miRNAs.
Finally, we conducted the last type of case study by taking the associations from older version of HMDD as input and test whether SNMDA could uncover those newly added associations in the latest version of HMDD. The older version of HMDD contained 1395 associations between 271 miRNAs and 137 diseases. 30 Colorectal neoplasms were selected as the investigated disease. As a result, 49 of the top 50 predicted miRNAs have been confirmed to be related with colorectal neoplasms by HMDD 2.0, PhenomiR and dbDEMC (Table 5). The only unconfirmed miRNA was hsa-mir-205. However, studies have indicated that hsa-mir-205 is also associated with colorectal neoplasms. 59 Taken together, all case studies have shown that SNMDA can effectively and reliably uncover the potential associations between miRNAs and diseases.

| DISCUSSION
As it is unrealistic to make a large scale of prediction based on traditional experimental methods, identifying the associations buried in miRNAs and diseases based on computational models is still a hot topic, and it remains a challenging task to discover such associations accurately and efficiently. Therefore, we proposed a novel method based on sparse neighbourhood to predict miRNA-disease associations. First, sparse reconstruction was used to obtain a reconstructed miRNA similarity and a disease similarity. Second, similarity information was integrated to construct a similarity network for miRNA and disease, respectively. Finally, label propagation was applied on the miRNA similarity network as well as disease similarity network to obtain the relevance scores for each miRNA-disease association. The prediction ability of our method was further verified by three types of case studies on five common diseases.
The success of our model could be mainly attributed to two reasons. First, the reconstructed miRNA similarities and disease similarities based on the sparse neighbourhood have greatly compensated for the incompleteness of current data sets due to the inherent sparsity of the constructed miRNA-disease network. Second, the label propagation process on both miRNA and disease networks ensured that the known labels were reliably propagated according to the reconstructed similarities. Nonetheless, the performance of SNMDA was sensitive to the quality of the miRNA similarity network and disease similarity network. More data sources should be integrated into our model to further improve the prediction accuracy of our model.
Moreover, we simply combined the reconstructed similarity matrices with the precalculated similarity matrices by adding them together with equal weights, which might be a suboptimal result for the overall similarity. In conclusion, we believe that SNMDA could serve as a powerful tool for the prediction of miRNA-disease associations.