An in silico Approach for Integrating Phenotypic and Target‐based Approaches in Drug Discovery

Abstract Phenotypic and target‐based approaches are useful methods in drug discovery. The phenotypic approach is an experimental approach for evaluating the phenotypic response. The target‐based approach is a rational approach for screening drug candidates targeting a biomolecule that causes diseases. These approaches are widely used for drug discovery. However, two serious problems of target deconvolution and polypharmacology are encountered in these conventional experimental approaches. To overcome these two problems, we developed a new in silico method using a probabilistic framework. This method integrates both the phenotypic and target‐based approaches to estimate a relevant network from compound to phenotype. Our method can computationally execute target deconvolution considering polypharmacology and can provide keys for understanding the pathway and mechanism from compound to phenotype, thereby promoting drug discovery.


Introduction
Phenotypic and target-based approaches, developed in chemical biology, are useful methods for drug discovery [1] ( Figure 1). They are also utilized as an Adverse Outcome Pathway Framework in the field of toxicology. [2] The phenotypic approach is an experimental approach involving techniques such as cell-based assay and in vivo assay for evaluating the phenotypic response of cells or tissues by chemical compounds. These assays are used in drug discovery to search for active compounds that induce phenotypic responses that improve the disease state of cells or tissues. [3] On the other hand, the target-based approach is a rational approach for screening drug candidates targeting a biomolecule that causes diseases. [1b,4] In recent years, with the advancement of high throughput experimental technologies, the target-based approach has become used widely for drug discovery. [1a,b, 5] However, in drug discovery, we have two serious problems, target deconvolution and polypharmacology, that cannot be solved by the phenotypic and target-based approaches [1b,3b, 6] (Figure 1). Target deconvolution is used to identify a target biomolecule responsible for a phenotypic response. In the phenotypic approach, even if we successfully identify an active compound that affects a target phenotype, it is not possible to reveal the target biomolecule on which the hit compound directly acts, and the relation between the target biomolecule and the phenotype. [3b,6a] In addition, because the target-based approach is a method for searching for compounds that act directly on a known target biomolecule, it cannot be applied if the target biomolecule is unknown. Thus, target deconvolution is a bottleneck in phenotypic and targetbased approaches to drug discovery. Several experimental methods have been developed in the chemical biology and chemical genetics research fields to address this problem. [7] In silico approaches have been also developed such as protein-ligand docking, [8] virtual screening of small ligands [9] and inverted virtual screening methods. [10] These methods have a limitation in that they require available target protein structures.
The second critical issue in drug discovery is polypharmacology, which means that "many drugs do not act on only a single target biomolecule, but acts on multiple target biomolecules and affects phenotypes such as efficacy and toxicity". [6] To consider polypharmacology, it is necessary to achieve a target deconvolution for a plurality of target biomolecules that may affect drug efficacy and toxicity. However, it is extremely difficult to perform target deconvolution experiments for multiple biomolecules from the viewpoint of labor and cost. Furthermore, even if target deconvolution for multiple biomolecules can be achieved, it is very hard to design and synthesize a chemical compound that interacts with multiple target molecules by experiment.
To overcome these two problems i. e., target deconvolution and polypharmacology, we developed a new in silico method using the probabilistic framework. This method is based on a machine learning technique that integrates data from compound-target protein interactions obtained from the target-based approach and data from compoundphenotype associations obtained from the phenotypic approach. It estimates a relevant network from compound to phenotype via target proteins. Specifically, the method consists of the following two steps. In the first step, we infer a plurality of protein candidates a compound can target by using a prediction model that is trained on the data of the compound-target protein interactions. In the second step, we select the target proteins related to a phenotype from protein candidates predicted in the first step model using a lasso model constructed by learning the data of compound-phenotype associations. Therefore, we can deduce a plurality of target proteins that can be interacted with by a specific compound and can be related to the phenotypic response. This is the first method for computationally executing target deconvolution considering polypharmacology, which has been unsolved by previous experimental approaches. This method could provide keys for understanding the pathway and molecular mechanism from compound to phenotype, thus promoting drug discovery 2 Materials and Methods

Dataset
We extracted 789,708 compounds that interact with 1,103 target proteins from six families, including G-proteincoupled receptor (GPCR), kinase, ion channel, transporter, a nuclear receptor, and protease from the ChEMBL database. [11] We used compound-protein pairs with a binding affinity of less than 30 μM (at Ki, EC50, and IC50) as active interaction pairs. [12] We defined compound-protein pairs below the set potency threshold (30uM) as noninteraction pairs. We then prepared a data set classified for each protein family and used these data sets as gold standard data of compound-protein interactions (CPIs) in cross-validation (CV) experiments to evaluate the performance of the CPI prediction (Table 1).
We extracted phenotypes that had both more than 100 active compounds as well as inactive compounds from the PubChem database. [13] Consequently, we obtained 34,959,972 compound-phenotype associations (CPAs), including 900,688 compounds and 548 phenotypes. These phenotypes were classified via various assay types (e. g., In vivo, In vitro, Biochemical, Cell-based, and Toxicity). We used the CPAs as a gold standard data in CV experiments to evaluate the performance of the CPAs prediction (Table 2  and Supplementary Table S1).

An in silico Method Using Probabilistic Framework
To formulate the integrating phenotypic and target-based approaches, we generated a probabilistic interpretation for the model. Figure 2 shows a Bayesian network representation for integrating the phenotypic and target-based approaches. We consider three kinds of probabilities to  First, P t ð jdÞ is a conditional probability of a binary vector, t that represents the activities of targets related to a drug with a feature vector, d. This probability is defined by a probabilistic model. Though probabilistic models can be separated into generative models and discriminative models, this type of probability can be modelled using a discriminative model such as logistic regression.
represents a weight parameter matrix, and b k is a bias-parameter vector. Since t k is a binary, its expectation t k can be computed as follows: Next, P pjt ð Þ is a conditional probability of a binary vector, p that represents the activities of a phenotype given a binary vector for targets, t. This probability can also be defined by a discriminative model.
Finally, P pjd ð Þ is the probability of a binary vector, p that represents phenotypes related to a drug given a feature vector, d of the drug.
Let us consider the training parameters related to P t ð jdÞ and P pjt ð Þ from the given datasets: d; t ð Þ 2 D dÀ t and d; p ð Þ 2 D dÀ p . P t ð jdÞ can be trained using D dÀ t by the conventional manner for discriminative models like logistic regression. In contrast, to train P pjt ð Þ from D dÀ p , the following equation is considered: This probability requires the consideration of all combinations related to the targets, which requires exponential time to compute. An effective approach to such models is a mean-field approximation that is realized by replacing the effects from variables with an expectation of them. By using this approximation, the equation above can be rewritten as where P pj � t ð Þ is defined by replacing t in a model of P pjt ð Þ with its expectation � t, computed by a given drug feature vector, d. Since P pjd ð Þ is given as an empirical distribution of data, P pj � t ð Þ can be trained using the KL-divergence between the LHS and RHS probabilities in this formula. The minimization of KL-divergence can be regarded as an optimization of the cross-entropy error as follows: where H 0 is the entropy related to P pjd ð Þ, which does not depend on parameters, and CE represents cross-entropy error. The model, P pj � t ð Þ can be trained using the same supervised training technique by using input, � t and supervised label, p.

Prediction of Compound-protein Interactions and Compound-phenotype Associations
To predict all possible compounds interacting with each protein in the six families, we proposed a method based on linear logistic regression, which is one of the machine learning methods that refers to the CGBVS method. [12,14] Compounds were represented by 894-dimensional descriptors, called DRAGON descriptor (Version. 6.0-2014-Talete srl, Milano, Italy), and proteins were represented by 1,080dimensional descriptors, called PROFEAT descriptor. [15] To predict compounds that are associated with a phenotype, we proposed a prediction method based on linear logistic regression. Compounds are represented by a 1,103-dimensional binary vector, whose elements respectively use 1 or 0 to encode the interaction or un-interaction of each protein.
To predict the possibility of CPIs or CPAs, we used a linear logistic regression with L1-or L2-regression as classifier, and adopted the LIBLINEAR suite of programs [16] (http://www.csie.ntu.edu.tw/~cjlin/liblinear). To select the penalty parameter C, we examined various values (0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1,000, 10,000). The value that yields the highest area under the receiver operating characteristic curve score in the 5-fold CV experiment is selected.

Cross-validation of the Prediction Models
To evaluate the performance of classifiers of the CPIs or CPAs, we performed 5-fold CV experiments using the gold standard datasets. First, we divided the gold standard dataset into five subsets. Second, we used a subset as an evaluating set and used the remaining four subsets as a learning set and constructed a prediction model using the learning set. Finally, we applied the prediction model to all pairs of the evaluating set and calculated the prediction scores.

www.molinf.com
We evaluated the performance of the prediction accuracy using a receiver operating characteristic (ROC) curve. The ROC curve is a plot of the true positive rate against the false positive rate. To evaluate the performance of the proposed methods, we calculated the area under the ROC curve (AUC), which yields a score of 1 for perfect prediction and 0.5 for random prediction.

Selection of Target Proteins by the Predictive Model
To select statistically significant target proteins for a phenotype, we performed feature selection from the prediction model using the following procedure (Supplementary Figure S1): (i) construction of the prediction model using all the 1,103 target proteins; (ii) calculation of the AUC score through a 5-fold CV experiment using the gold standard dataset for the CPAs; (iii) selection of proteins that had a higher weight than the threshold (initially 0); (iv) construction of the prediction model using selected target proteins that had a higher weight than the threshold in the previous procedure; (v) calculation of the AUC scores using the prediction model constructed in the previous procedure; (vi) evaluating whether the AUC was less than 95 % of the original AUC score. We raised the threshold by 0.02 for each cycle from 0 and evaluated prediction accuracy using the selected proteins by the 5-fold CV and calculated the AUC score. The selected proteins were defined as statistically significant proteins if the AUC score was less than 95 % of the AUC score of all proteins. The selected target proteins were important to their phenotype at that time. Significant proteins for each phenotype were selected using these analysis procedures.

Enrichment Analyses of the Selected Proteins
The gene ontology biological process term, enrichment analysis, was performed using statistically significant proteins. [17] We performed a statistical overrepresentation test of the selected proteins using the PANTHER website [18] (http://www.pantherdb.org/).

Overview of an in silico Approach Integrating Phenotypic and Target-based Approaches
We proposed a new in silico method using the probabilistic framework that aims at target deconvolution considering polypharmacology in drug discovery. Figure 3 shows the Figure 3. Workflow of the proposed method. Our proposed method comprises of two steps. In the first step, the method predicts compound-target protein interactions. In the second step, the method selects target proteins related to a phenotype.

www.molinf.com
workflow of the proposed method; it comprises two steps, corresponding to the target-based and phenotypic approaches (Figure 1). In the first step (corresponding to the target-based approach), the procedure for the prediction of CPIs is divided into the following four steps: (i) acquisition of known CPIs from the ChEMBL database [11] (Table 1); (ii) calculation of the compound and protein feature vectors (descriptors); (iii) construction of CPIs prediction model using known interaction information to infer unknown data due to lack of CPIs; (iv) prediction of unknown CPIs using the constructed prediction model. In the second step (corresponding to the phenotypic approach), the procedure for the selection of the target proteins related to a phenotype is divided into the following four steps: (v) acquisition of known CPAs from the PubChem database [13] ( Table 2); (vi) using the CPIs as feature vectors (descriptors); (vii) construction of the prediction model to predict CPAs using the descriptors; (viii) selection of target proteins related to a phenotype.

Performance Evaluation of the Compound-protein Prediction Models
Though CPI information is registered in databases, many CPIs still remain unknown. In the first step of the proposed method, we constructed a CPI prediction model to predict unknown interaction information (Figure 3). To evaluate the performance of the CPI prediction model using L1-and L2regularized logistic regression algorithms, we employed the gold standard data from the ChEMBL database for the CPIs. We performed 5-fold CV experiments. Figure 4 shows the ROC curve ( Figure 4A) and the area under the ROC curve (AUC) score ( Figure 4B) in the 5-fold CV experiments for the gold standard data for CPIs in each protein family using L1-and L2-regularized logistic regression algorithms. In all protein families except the GPCR family, the AUC score for the L1-regularized logistic regression algorithm was higher than that of the L2regularized logistic regression algorithm. The GPCR family and the transporter family showed high values of AUC scores (0.9299 and 0.9406, respectively). The nuclear receptor family had the lowest AUC score (0.8251) among the six families. Our proposed methods exhibited high prediction accuracy because the AUC scores for all protein families were higher than 0.8. These results suggest that the constructed prediction models had high accuracy for CPIs prediction.

Performance Evaluation of the Compound-phenotype Prediction Models
In the second step of the proposed method, we constructed a CPA predictive model to predict unknown association information. We used the interactions of the target protein as a feature vector of a compound and predicted potential CPIs using the constructed prediction model. To evaluate the performance of the proposed method comparing a random method, we performed 5-fold CV using the gold standard data for CPAs from the PubChem database. Figure 5 shows the results of the evaluation for each phenotype using the 5-fold CV experiment. Figure 5A shows ROC curves of the top 10 AUC phenotypes while Figure 5B shows the AUC score distribution of the CPA prediction models. Details of all predictive results are shown in Supplementary Table S2. In the random method, the prediction results were shown to be random for nearly all the phenotypes (AUC score: 0.5). The AUC score distribution of the proposed method was shifted to a high value compared to the random method. This result suggests that integrating the phenotypic and target-based approaches improves the performance.

Full Paper
www.molinf.com

Extraction of the Statistically Significant Proteins by a Predictive Model
We applied L1-and L2-regularized logistic regression model to infer target proteins from the compound-phenotype network. In the previous section, we constructed the CPA prediction models by L1-and L2-regularized logistic regression models. In each method, we inferred target protein features with positive weights in the predictive model. Figure 6A shows a histogram of the number of extracted proteins (Supplementary Table S3) while Figure 6B shows a histogram of AUC scores for L1-and L2-regularization (Supplementary Table S4). It was found that in most phenotypes, L1-regularized logistic regression inferred a smaller number of features compared to L2-regularized logistic regression ( Figure 6A). This result suggests that L1regularization was more effective in reducing the number of proteins with positive weight. Next, histograms of the AUC scores for both the L1-and L2-regularized logistic regression model exhibited a similar tendency. The average prediction accuracy of L1-regularized regression is 0.6349 while that of L2-regularized regression is 0.6361. These results indicate that L1-regression predicted a smaller number of extracted proteins while maintaining analytical accuracy, compared to L2-regression.
Furthermore, we performed additional analysis to evaluate the biological interpretation of the estimated target proteins of the phenotype. Specifically, the threshold