Computational drug repositioning using similarity constrained weight regularization matrix factorization: A case of COVID‐19

Abstract Amid the COVID‐19 crisis, we put sizeable efforts to collect a high number of experimentally validated drug–virus association entries from literature by text mining and built a human drug–virus association database. To the best of our knowledge, it is the largest publicly available drug–virus database so far. Next, we develop a novel weight regularization matrix factorization approach, termed WRMF, for in silico drug repurposing by integrating three networks: the known drug–virus association network, the drug–drug chemical structure similarity network, and the virus–virus genomic sequencing similarity network. Specifically, WRMF adds a weight to each training sample for reducing the influence of negative samples (i.e. the drug–virus association is unassociated). A comparison on the curated drug–virus database shows that WRMF performs better than a few state‐of‐the‐art methods. In addition, we selected the other two different public datasets (i.e. Cdataset and HMDD V2.0) to assess WRMF's performance. The case study also demonstrated the accuracy and reliability of WRMF to infer potential drugs for the novel virus. In summary, we offer a useful tool including a novel drug–virus association database and a powerful method WRMF to repurpose potential drugs for new viruses.

is known as a new enveloped RNA betacoronavirus2 named SARS-CoV-2. 6,7 On February 11, 2020, the World Health Organization named the new coronavirus-infected pneumonia "COVID-19". As of July 30, 2020, there were more than 16.85 million COVID-19 infections worldwide and more than 660,000 deaths. However, scientists still cannot find a special drug that could deal with all variants of SARS-CoV-2. In addition, scientific research teams in several countries are developing vaccines for the prevention and treatment of COVID-19, but the incidences of infection are still rising. Therefore, there is an urgent need to find novel treatment plans for COVID-19. 8,9 The development of a new drug for a disease (e.g.  is a long and expensive process. Therefore, drug repositioning is an effective drug discovery strategy, which can greatly reduce the time and cost compared with de novo drug discovery. [10][11][12] Drug repositioning has been successfully applied in diseases like cancers. 13,14 However, how to prioritize potential drugs for specific diseases is still a bottleneck for drug repositioning. Research teams in various countries are constantly striving to find existing drugs to treat COVID-19, For example, Draghici S et al. analysed the changes in the gene expression, pathways and putative mechanisms induced by SARS-CoV2 and found that methylprednisolone (MP) could improve outcomes in severe cases of COVID-19. 15 But there are few drugs effective for COVID-19 so far. 16 Therefore, there is an urgent need for novel computational methods to repurpose drugs for COVID-19.
The computational drug repositioning method provides new testable hypotheses for repositioning old drugs, which can predict potential drug-target interactions to direct the experimental verification and improve the drug discovery efficiency. In recent years, many computational association prediction methods have been developed. For example, Iorio et al. proposed a transcriptionalnetwork based approach, which applied the network theory and utilized similarity in gene expression profiles following drug treatment for drug repositioning. 17 Sirota et al. developed a systematic computational approach based on compendia of public gene expression data to predict novel therapeutic indications. 18 Peyvandipour et al. proposed a systems biology approach by considering the different roles of genes and their dependencies at the system level. 19 Saberian et al. designed a novel machine learning-based drug repositioning algorithm based on the theory that the distances between disease and its associated FDA-approved drugs are smaller than that of other disease-drug pairs. 20 Martínez et al. presented a new network-based methodology (called DrugNet) by constructing a heterogeneous network including drugs, proteins, and diseases. 21 Yang et al. proposed a bounded nuclear norm regularization (BNNR) method to complete the drug-disease matrix for the prediction of drug-disease associations. 22 Luo et al. proposed a novel network-based method, called MBiRW, which uses some comprehensive similarity measures and Bi-Random walk (BiRW) algorithm. 23 Zeng et al. integrated ten networks (i.e. one drug-disease, one drug-side-effect, one drug-target, and seven drug-drug networks) and proposed a deep-learning based method (named deepDR), consisting of a multi-modal deep auto-encoder and a collective variational auto-encoder. 24 Li et al. developed a new neural induction matrix completion method of the graph convolutional network (termed NIMCGCN). 25 NIMCGCN was first utilized to predict miRNA-disease associations and was proven to have great potential in drug repositioning.
The above-mentioned computational prediction approaches are mainly classified as network-based approaches and machine learning-based approaches. As the most typical machine learningbased method, matrix factorization represents drugs and diseases in a shared latent space and reconstructs the drug-disease association using their latent vectors. Recently, a few variants of matrix factorization have also been widely and successfully used in bioinformatics researches, 26 such as prediction of drug-drug interaction. 27,28 predicting drug side effects, 29 predicting drug-target interactions, 30 identifying drug-disease associations, 31,32 anticancer drug response prediction in cell lines, 33 potential miRNA-disease association prediction, [34][35][36] and imputing the dropout entries of a given single-cell RNA-sequencing expression matrix. 37 However, drug repositioning against human coronavirus like COVID-19 prediction with limited information is challenging and meaningful.
In this study, we developed a new weight regularization matrix factorization method (WRMF) for drug repositioning against COVID-19 based on similarity constraints, which mainly includes the following four steps: (i) collect experimentally verified drug-virus associations from the literature, (ii) calculate the chemical structure similarity of drugs and the genome sequence similarity of viruses, (iii) build heterogeneous networks based on known drug-virus associations, drug-drug similarity and virus-virus similarity, and (iv) use the similarity constrained weight regularization matrix factorization method to predict drugs most likely to be effective on the virus. Via comprehensive evaluation on 5-fold cross-validation (CV), local leave-one-out-cross-validation (LOOCV), and two additional independent datasets, we found that WRMF achieved higher performance in comparison with several state-of-the-art methods. To fully prove the reliability of WRMF, we further conducted a case study about MERS. The experimental results showed that six of the top ten WRMF-predicted anti-MERS drugs had been confirmed. We expect that WRMF-predicted anti-COVID-19 drug candidates might have a therapeutic effect. In summary, WRMF provides a powerful model to predict new drug-virus associations for accelerating drug repurposing.

| MATERIAL S AND ME THODS
We give the main idea of WRMF in Figure 1, which mainly includes the following four steps: (i) collect data by searching literature to construct a data set; (ii) calculate (the similarity between viruses and the similarity between drugs; (iii) build a heterogeneous network based on existing data; and (iv) use the similarity constrained weight regularization matrix factorization method on heterogeneous networks to obtain potential viral therapeutics.

| Human drug-virus association network
Since the databases of drug-virus associations are urgently de- Meanwhile, we put sizeable efforts to collect a high number of experimentally validated drug-virus association entries from literature by text mining and built an experimentally supported human drug-virus association dataset consisting of 34 human infectious viruses, 218 therapeutic drugs, and 451 known human drug-virus associations (i.e. the drug is observed to have a known therapeutic role in the virus). Compared with DVA, the viruses we collected are mainly human-infected coronaviruses and RNA viruses. In addition, we included 218 antiviral and broad-spectrum drugs, which contains nearly 100 more drugs than DVA. As far as we know, our dataset is the largest in the sense that it contains the largest number of drugs and drug-virus associations.
We define the adjacency matrix of the drug-virus association network as the variable Y, that is, if the drug d(i) is observed to have a therapeutic effect on the virus v(j), the entity Y(i, j)is equal to 1; otherwise, it is 0. The two variables n d and n v represent the number of drugs and viruses, respectively. In this study, we integrate the drugvirus association network, drug-drug similarity network, virus-virus similarity network into a heterogeneous network. For the drug-drug similarity network, we measure the similarity of drug pairs by calculating the chemical structure similarity. For the virus-virus similarity network, we evaluate the similarity of virus pairs by calculating the gene sequences similarity. Therefore, the adjacency matrix of the drug-virus heterogeneous network can be defined as: The sub-matrix Y represents the collected drug-virus association network, Y T is the transposition of Y, S d , and S v, respectively, represent the adjacency matrix of drug-drug similarity network and virus-virus similarity network.

| Chemical structure similarity of drugs
There are many algorithms for calculating drug similarity, among which classic algorithms usually include molecular similarity. 39 In this article, we use the Tanimoto coefficient to express the similarity between drugs. The chemical structure information (SMILES format) was downloaded from the DrugBank database, and the MACCS fingerprint of each drug was calculated using Open Babel v2.3.1. If the MACCS fragment bit strings of two drug molecules are set with bits a and b, then c is set in the fingerprints of the two drugs, and the Tanimoto coefficient (T) of a drug pair is defined as: T is widely used in various drug development and relocation processes, and its value ranges from zero (no common bits) to one (all bits are the same).

| Genome sequencing similarity of viruses
With the development of gene sequencing technology, our understanding of any virus often starts with its sequence. MAFFT is a multiple sequence alignment program for Unix-like operating systems. 40 It offers a range of multiple alignment methods, L-INS-i (accurate; recommended for <200 sequences), FFT-NS-2 (fast; recommended for >2000 sequences), etc. In the research of this paper, we use MAFFT to calculate the sequence similarity between viruses to express the similarity of viruses.

| WRMF
The drug repositioning against Human Coronavirus Like COVID-19 problem can be modelled as a recommendation system that recommends novel indications by filling out the unknown entries in the drug-virus association matrix, which is known as matrix completion.
Matrix completion algorithms have been widely and successfully used in bioinformatics research, such as uncovering lncRNA-disease associations, 41 predicting miRNA-disease associations, [42][43][44] identifying drug-disease associations, [45][46][47] and selecting anti-viral drugs for COVID-19. 48 In our study, there are 451 confirmed human drug-virus associations in the database we collected, which indicates that the known drug-virus association matrix is sparse. Based on the premise that similar drugs tend to treat similar viruses, the hidden factors that control the drug-virus associations are highly correlated, which results in an also highly correlated data matrix, and thus the number of underlying independent factors is much smaller than the existing number of drugs or viruses. In other words, the underlying latent factors determining drug-virus associations are highly correlated, and the drug-virus matrix to be completed is low-rank. In fact, many studies used matrix completion methods for similar bioinformatics by constructing low-rank matrix approximations consistent with known association matrix. 22,41,47 Generally, when the matrix is of low rank, the matrix factorization minimization problem can be expressed as: where ‖⋅‖ F represents the Frobenius norm, w and h are regularization parameters.
In the drug-virus association database, there are many unobserved entries, and we do not know negative samples (i.e. the drugvirus pair is unassociated). We define a problem with only positive feedback as a type of one-class problem because there are only positive samples. 49,50 For one-class problems, we proposed a novel weight regularization matrix factorization approach, which adds weight R to each training sample for reducing the influence of unknown samples. R represents the confidence of the drug's preference for the disease. In addition, the traditional matrix factorization does not take into account the similarity between viruses and the similarity between drugs. To solve the aforementioned problems, we propose a weight regularized matrix factorization model (WRMF), formalized as follows: Among them, the hyperparameter controls the contribution of positive samples to model training. w , h , 1, and 2 are the regularization parameters.
Since the WRMF model is a fitting of matrix Y, directly using SGD optimization will face the problems of overfitting and training efficiency. 22 Therefore, we use the gradient descent algorithm to learn model parameters. The specific optimization steps are followed as: According to formula (5) and formula (6), iteratively update W and H until the local minimum of the objective function. Finally, the predicted drug-virus association matrix is Y * = W T H. The i th column of Y * indicates the association score between virus v i and drugs. The larger the score, the more relevant it is.

| Performance evaluation of WRMF
To evaluate the performance of the algorithm, we used the 5-fold CV and local LOOCV. In the 5-fold CV experiment, all known drugvirus associations are randomly divided into five equal and disjoint parts. Then, leave a part as a test set in turn, and the remaining four parts are used as a training set to train the model. The process is repeated for five times until all samples are predicted once. In the local LOOCV experiment, for each virus v i , we remove all the known associations of the virus v i and build prediction model using the remaining data. We then calculate the relevance score of each node in

| Comparison with the state-of-the-art methods
To evaluate the performance of our proposed WRMF, we compared WRMF with five state-of-the-art association prediction methods Based on the drug-virus association dataset we constructed, we performed cross-validation on the training dataset to tune the parameters, which are increasing from 0.1 to 1 with a step of 0.1, and the ones with the best AUC were selected. WRMF achieves the best performance when w = h = 0.3 and 1 = 2 = 0.1 (see Figure S1).
To ensure a fair comparison, the parameters in the compared approaches are set to the best values according obtained by using grid search (see Figure S2). Specifically, like WRMF, we chose the optimal   Figure 2B). Generally, the PR curve shows similar changes to the ROC curve at different thresholds, and if the AUPR is close to 1, the prediction performance will be better. As shown in Figure 2B

| Performance of WRMF on our constructed drug-virus dataset in local LOOCV
Cross-validation probably leads to over-optimistic results because SARS-CoV-2 is a completely new virus. There was no connection between the drugs and COVID-19. We further performed the local LOOCV to further evaluate the performance of WRMF. As can be seen in Figure 3A, the AUC of WRMF is the highest of all methods.
In terms of AUPR (see Figure 3B), we find that WRMF achieves the second-best performance (AUPR is 0.1776) in our constructed drug-virus dataset. The possible reason is that WRMF only uses drug chemical structure and virus genome sequence to calculate drug and virus similarity, while MBiRW considers the influence of known association information on the similarity measures and utilizes comprehensive similarity measures. In summary, WRMF has a good performance in predicting the potential therapeutic drugs of a new virus.

| Performance of WRMF on two different types of datasets
In addition to the drug-virus association dataset collected by our study, we selected more challenging scenarios to assess the generalizable ability of WRMF. We compared WRMF with other three matrix factorization & completion methods (i.e. BNNR, CMF and IMC) on two different public datasets, which are the drug-disease association dataset (Cdatase) 23  As shown in Figure 5B, WRMF achieves an AUPR value of 0.4007, outperforming that of BNNR (0.3720), CMF (0.3807), and IMC (0.2507). Additionally, WRMF identified 656 associations in the top 1000 rankings, while BNNR, IMC, and CMF only predicted 634, 648, and 547 associations, respectively (see Figure 5C). Figure 5 indicates that WRMF performs better than the other comparison methods.
The results on two different types of datasets prove that WRMF is generally a good model in association prediction.

| Case study: WRMF identified the potential drugs for COVID-19
COVID-19 is a brand-new (i.e. there is no interaction between COVID-19 and any drug) and zoonotic disease. To further validate the prediction performance of WRMF, we conducted a case study to predict novel anti-COVID-19 drugs from a computational perspective. Specifically, we put the other known drug-virus associations to as the input of WRMF, then ranked the predicted scores of the potential anti-COVID-19 drugs. Following previous studies, 45,55 we adopted Clini calTr ials.gov and the Comparative Toxicogenomics Database (CTD) 56 as references to validate whether the predicted drugs for COVID-19 are efficacy or not. Table 1 shows that eight out of ten drugs (80% success rate) are validated by the reliable source, clinical trials, and previous literatures. For example, ribavirin (ranked the second) was predicted by WRMF to have an interaction with COVID-19. Such a prediction can be supported by Clini caltr ials.gov and CTD. Nitazoxanide (ranked the fourth), a broad-spectrum anti-infective drug, can inhibit In addition, chloroquine (ranked the third), camostat (ranked the fifth), favipiravir (ranked the sixth), and remdesivir (ranked the eighth) predicted by WRMF have been confirmed by both CTD and clinical trials for COVID-19 promising treatment. In summary, eight out of ten WRMF-predicted anti-COVID-19 drugs were verified by the evidences from Clini calTr ials.gov and CTD. It indicates that WRMF offers a useful tool to prioritize potential repurposed drugs for COVID-19.
Second, molecular docking research is a method that provides valuable information and can be used to design well-known ligands for specific active sites of large molecules. This is an economic and modern trend in drug discovery, where the technology-based ligand-protein interaction reveals the possibility of pre-synthesis.
Hexachlorophene (ranked the first) and N4-Hydroxycytidine (ranked the seventh) were conducted blind docking both in online and offline modes. The Autodock 4.2 package (http://autod ock.scrip ps.edu) was used for offline docking. The X-ray crystal structures of protein were retrieved from the RCSB protein database (www.rscb. org). A macromolecule with PDB ID: 6LZG, which is a novel coronavirus spike receptor binding domain complexed with its receptor ACE2. All proteins and ligands were prepared using MGL Tools 1.5.6 and Autodock Tool (ADT). ADT is used to calculate the binding free energy and inhibition constant of the optimal docking complex of the aforementioned proteins. The negative value of the combined free energy further indicates the stability of the complex (Table 2).
Additionally, Figure 6 reveals that the two unproven drugs predicted by WRMF interact with multiple residues on its receptor ACE2 and once again shows that the drugs discovered by WRMF may have an inhibitory effect on COVID-19.

F I G U R E 6
The predicted ligandprotein binding mode between the two unconfirmed potential anti-COVID-19 drugs and the receptor ACE2 (angiotensin conversion Enzyme 2) using molecular docking drugs based on our constructed drug-virus dataset (see Figure 7 and Table S1). As shown in Figure 7, some verified drug-virus associations are shown as light blue lines, while potential relationships are shown as magenta lines.
Finally, as the number of people infected with SARS-CoV-2 continues to increase, the drug-virus database is also increasing. In order to assess the expansibility and practicality ability of WRMF, we also applied WRMF to the DVA dataset 38 25 and several matrix factorization-based algorithms: GRMF, 58 GRMC, 59 and WGRMF. 58 The top-10 predicted anti-COVID-19 drugs by these algorithms have been listed in Table 3.
We validated the top-10 candidate drugs of these algorithms by Clini caltr ials.gov. The bold font in Table 3 indicates that the predicted candidate drug has been validated by Clini calTr ials.gov. As can be seen, our proposed method WRMF obtains seven Clini calTr ials.govvalidated drugs, more than that of GRMF, GRMC, and WGRMF. The promising clinical results signify that the practicality of BGNN in predicting potentially drugs for COVID-19.

| DISCUSS ION
In this study, we proposed a novel in silico drug repositioning approach for uncovering the potential associations between viruses and drugs, termed WRMF. Apart from the known drug-virus association network via literature mining, we integrated one drugdrug chemical structure similarity network, and one virus-virus genome sequencing similarity network to construct a heterogeneous network, which contains a comprehensive view for screening anti-COVID-19 drug candidates. We have validated the prediction ability of WRMF in terms of five-fold CV, the local LOOCV, two additional datasets validation, and a case study. The results show that our method achieves state-of-the-art performance for repurposing anti-COVID-19 drugs. In future studies, since WRMF is a scalable approach, collecting and incorporating more relevant association data from more databases and literatures might improve its power.
We acknowledged several potential limitations in the current study. Although we take sizeable efforts to collect experimentally reported drug-virus associations from published literature, data quality is unassured and the drug-virus association data may be incomplete. We provided the top 20 WRMF-predicted anti-COVID-19 drugs. State-of-the-art pharmaco-epidemiologic analysis on patient data (e.g. health insurance claims data) and in vitro or in in vivo mechanistic studies for the WRMF-predicted anti-COVID-19 candidates are required in the future.
In summary, our findings suggest that in silico drug repurposing could benefit from constraints on drug and viral similarity, matrix