Bruton’s tyrosine kinase has emerged as a potential target for the treatment for B-cell malignancies and autoimmune diseases. Discovery of Bruton’s tyrosine kinase inhibitors has thus attracted much attention recently. In this investigation, we introduced a hybrid protocol of virtual screening methods including support vector machine model-based virtual screening, pharmacophore model-based virtual screening and docking-based virtual screening for retrieving new Bruton’s tyrosine kinase inhibitors from commercially available chemical databases. Performances of the hybrid virtual screening approach were evaluated against a test set, which results showed that the hybrid virtual screening approach significantly shortened the overall screening time, and considerably increased the hit rate and enrichment factor compared with the individual method (SB-VS, PB-VS and DB-VS) or their combinations by twos. This hybrid virtual screening approach was then applied to screen several chemical databases including Specs (202 408 compounds) and Enamine (980 000 compounds) databases. Thirty-nine compounds were selected from the final hits and have been shifted to experimental studies.
Bruton’s tyrosine kinase (BTK) is a member of the SRC-related TEC family of protein tyrosine kinases (PTK) (1,2). Bruton’s tyrosine kinase is expressed in all hematopoietic cells except T cells, natural killer cells and plasma cells and is an important participant in many cellular signalling pathways (3–6). For example, Btk is involved in the activation of stress-activated protein kinases, JNK/SAPK1/2, and thereby regulates c-Jun and other transcription factors that are important in cytokine gene activation (7). Bruton’s tyrosine kinase has also been demonstrated to play crucial roles in the activation of the B-cell antigen receptor (BCR) signalling pathway, whose activation contributes to the initiation and maintenance of B-cell malignancies and autoimmune diseases (8,9). Recent studies have clearly shown that functional inhibition of BTK induced objective clinical responses in dogs with spontaneous B-cell non-Hodgkin lymphoma (10). All of these findings implicate that BTK is a potential target for the treatment for B-cell malignancies, as well as various autoimmune diseases (11,12). And the discovery of BTK inhibitors has recently attracted much attention.
Currently, many academic institutions and pharmaceutical companies are involved in the discovery of Btk inhibitors, and a number of Btk inhibitors have been reported (13–15). However, no Btk inhibitor has been approved to use in clinical and only one Btk inhibitor, PCI-32765, has just entered clinical trials in patients with lymphoma (10,15). Therefore, it is still important to identify potent and specific Btk inhibitors as lead candidates for drug development.
In silico virtual screening (VS), as an economical and rapid approach for lead discovery, has been widely applied and been becoming a major source of lead compounds in drug discovery (16,17). Currently, several VS methods have been well established, typically including molecular docking-based (18) and pharmacophore-based VS (19). Recently, supporting vector machine (SVM), a machine learning method, has also been introduced into the virtual screening (20). However, these methods are individually far from perfect in many aspects. For example, the docking-based VS (DB-VS) method suffers a problem of inaccurate scoring functions, which often lead to a low hit rate and a low enrichment factor (21). The pharmacophore-based VS (PB-VS) approach, as indicated previously (22), usually bears a higher false-positive rate, which stems from the insufficient consideration of receptor information, such as the steric restriction by the receptor. Different from DB-VS and PB-VS, the SVM-based VS (SB-VS) method identifies active compounds using a SVM classifier derived from a set of known active (positive) and inactive (negative) compounds (23). The biggest advantage of SB-VS lies in the fast screening speed, which makes it possible to screen a vast chemical library rapidly. However, the SB-VS still suffers from a low hit rate and a high false-positive rate, which might originate from the lack of consideration of the information regarding the macromolecule target and pharmacophore features of small molecules. Obviously, each of these VS methods has its own advantages and disadvantages. Each one might not perform optimally when used alone in terms of the speed and effectiveness of VS. An alternative approach is a combination of these methods. Because inherent limitations of each of these screening techniques are not easily resolved, their combination in a hybrid protocol can help to mutually compensate for these limitations and capitalize on their mutual strengths.
In this investigation, we shall introduce a hybrid protocol of SB-VS, PB-VS and DB-VS to retrieve novel Btk inhibitors from commercially available chemical databases. The models used for the SB-VS and PB-VS will be firstly established and validated. And docking parameters and scoring functions for the DB-VS will also be optimized in advance. The overall performances of this hybrid VS approach will then be assessed. Finally, this hybrid VS approach will be applied to screen several large chemical libraries including Specs (202 408 compounds) and Enamine (980 000 compounds). Top hit compounds will be selected and shifted to experimental evaluations.
Materials and Methods
A total of 432 compounds including 370 positives (Btk inhibitors with the half-maximal inhibitory concentration (IC50) <10 μm) and 62 negatives (Btk non-inhibitors or IC50 >10 μm) were collected from different literature resources (13–15,24–31). All the positive compounds were first grouped into different categories based on their scaffolds. Then, we randomly chose compounds from each category that contains compounds with a similar scaffold to form three data sets (see Tables S1–S3 in Supporting Information): 282 positives for a training set, 60 for a test set and 28 for an independent validation set.
As a sufficient number of negative compounds in the training set are very important for the quality of SVM model, we adopted a method suggested by Chen et al. (32) to generate putative negatives to compose the training set. For the generation of putative negatives, 18.83 m compounds from the PubChem database were clustered into 8913 compound families using K-means clustering based on their molecular descriptors, which were calculated by discovery studio (DS) 2.55 software (Accelrys, San Diego, CA, USA). After that, 16 713 putative negatives were randomly selected from families that do not contain any of the known Btk inhibitors. The finally formed training set contains 282 positives and 16 713 negatives. A test set comprising 60 positives and 62 negatives, called TS1, was adopted for the validation of SVM model. Another test set, called TS2, is composed of all of the 370 Btk inhibitors and 2392 decoys (33) (the selection of decoys, see Supporting Information) from ZINC library, which is for the validation of pharmacophore model. The 28 positive compounds together with 8662 decoys (33) (the selection of decoys, see Supporting Information) from MDDR were used to form the third test set, called TS3, which is for the overall validation of the hybrid VS approach.
A modified SVM modelling method, namely support vector machine (SVM) method combined with genetic algorithm (GA) for feature selection and conjugate gradient method (CG) for parameter optimization (GA-CG-SVM), which was proposed by our group recently (34), will be used. Detailed description of the proposed GA-CG-SVM method can be found in Supporting Information.
The descriptors used in this study were calculated by DS 2.55 program package. The initial 288 features were preprocessed to reduce the redundancy of the descriptors. Here, the following descriptors were removed: (i) descriptors with too many zero values, (ii) descriptors with very small standard deviation values (<0.5%) and (ii) descriptors that are highly correlated with others (correlation coefficients >95%). A total of 138 molecular descriptors were selected after the preprocessing. Then, these descriptor values were scaled to a range of −1 to 1. The termination criterion is either that the generation number reaches 200 or that the fitness value does not improve during the last 10 generations. The crossover rate was set to 0.5; mutation rate was 0.1. The starting values of the parameters (C, γ) were set to (1890, 1.6).
We assessed the performance of the SVM classification model by the quantity of true positives (TP, true inhibitors), true negatives (TN, true non-inhibitors), false positives (FP, false inhibitors) and false negatives (FN, false non-inhibitors). Sensitivity SE = TP/(TP + FN) and specificity SP = TN/(TN + FP) are the prediction accuracy for the inhibitor and non-inhibitor, respectively. The overall accuracy (Q) is calculated by the equation: Q = (TP + TN)/(TP + TN + FP + FN).
The HipHop and HypoGen algorithms implemented in DS were employed for the pharmacophore modelling. The common pharmacophore features necessary for potent Btk inhibitors were identified by HipHop, and quantitative pharmacophore models were created by HypoGen module (35).
On the basis of the principles of structural diversity and wide coverage of activity range, 20 compounds (Figure 1) were carefully chosen to form a training set. The IC50 values of the inhibitors in training set span a range of six orders of magnitude (IC50 values range from 0.0019 to 110 μm).
Molecular docking study
All the molecular docking studies were carried out by GOLD (Genetic Optimization of Ligand Docking) 4.0 (36), and the CHARMm force field was used. The crystal structure (PDB entry 3GEN) of the kinase domain of Btk bound to compound 2 was taken as the receptor structure. The binding site was defined as a sphere containing the residues that stay within 10 Å from the co-ligand, which cover the ATP-binding region and hinge region at the active site.
Results and Discussion
Establishment and validation of the SVM classification model of Btk inhibitors and non-inhibitors
A training set containing 282 known Btk inhibitors (positives, see Supporting Information) and 16 713 putative non-inhibitors (negatives) was constructed in advance. Initially, 288 molecular descriptors for each compound, which cover various molecular properties, including geometrical, topological and electronic properties, were calculated. These descriptors were first preprocessed to eliminate those obvious ‘bad’ descriptors. After the preprocessing, a total of 138 molecular descriptors remained. These descriptors were further optimized using the genetic algorithm-conjugate gradient (GA-CG) method. 83 descriptors, listed in Table 1, were finally chosen for building the SVM model.
Table 1. Molecular descriptors selected by the GA-CG algorithm for the generation of SVM classification model of Bruton’s tyrosine kinase inhibitors and non-inhibitors
The generated SVM model was first validated by fivefold cross-validation, which results are presented in Table 2. From Table 2, we can see that 234 of the known 282 Btk inhibitors were correctly predicted, indicating a prediction accuracy of 82.98% for the positives. 16 697 of the 16 713 non-inhibitors were also correctly predicted, showing a prediction accuracy of 99.90% for the negatives. The overall prediction accuracy is 99.62%. These results demonstrate that the generated SVM model is quite good for discriminating the Btk inhibitors and non-inhibitors for the training set compounds.
Table 2. Results of the SVM model validation by fivefold cross-validation
The purpose of SVM modelling is not only just to create a SVM model that can classify the training set agents correctly into Btk inhibitors and non-inhibitors but also to verify whether the SVM model is capable of classifying external agents that are outside of the training set accurately as Btk inhibitors and non-inhibitors. Thus, an independent validation set, called TS1, comprising 60 positives and 62 negatives, was further used to assess the predictability of the SVM model just built. The prediction results are also presented in Table 2. Of the 60 positive compounds, 56 (TP, Table 2) were correctly predicted, indicating a prediction accuracy of 93.33% for the positives (SE, Table 2). For the 62 negatives, 49 (TN, Table 2) were properly predicted. The accuracy for the prediction of negatives (SP, Table 2) is 79.03%. Of all the 122 agents (positive and negative), 105 were correctly predicted and 17 were wrongly predicted (see Table 2). The overall prediction accuracy (Q) is 87.33%, which is comparable with that for the training set. These results clearly demonstrate that the established SVM model not only can correctly classify the training set compounds into Btk inhibitors and non-inhibitors, but also has a considerable predictability to the external agents outside of the training set, implying that the SVM model can be used as a screening tool for retrieving Btk inhibitors.
Pharmacophore model generation and model validation
Before the development of quantitative pharmacophore model, a qualitative pharmacophore model was built using HipHop to identify the most important pharmacophore features for the Btk inhibitors. The HipHop model generated based on the six most active compounds from the training set (1–6) contains four pharmacophore features: hydrogen bond acceptor, hydrogen bond donor, general hydrophobic and ring aromatic feature. Thus, the four types of chemical features were specified as the initial pharmacophore features in the quantitative pharmacophore modelling.
The quantitative pharmacophore models were generated based on the 20 compounds in the training set (Figure 1). The top 10 hypotheses and their statistical parameters obtained in the HypoGen run are shown in Table 3. Then, a test set, called TS2, containing 370 Btk inhibitors and 2392 decoys (the selection method for the decoys see Supporting Information) was used to examine the prediction ability of these pharmacophore models to Btk inhibitors. The results are also showed in Table 3. From Table 3, we can see that, of the 10 models, Hypo4 that contains one hydrogen bond acceptor, one hydrogen bond donor and two general hydrophobic (shown in Figure 2A) is the best one in terms of the prediction ability to the Btk inhibitors. The mappings of compounds 1, 5 and 15, which represent high, medium and low bioactivity correspondingly, are shown in Figure 2B–D, respectively. Obviously, compound 1 mapped all the four features of Hypo4, 5 mapped three features and 15 just mapped two. These results further demonstrate the correctness of pharmacophore model Hypo4. Thus, Hypo4 will be used in the subsequent PB-VS.
Table 3. Top 10 pharmacophore hypotheses together with their statistical parameters generated by HypoGen
aCost diff. = (null cost -total cost), where null cost = 189.548, fixed cost = 88.4002, configuration cost = 16.6395. All cost values are in bits.
bA, D, H and R present hydrogen bond acceptor, hydrogen bond donor, hydrophobic feature and ring aromatic, respectively.
Docking parameter optimization and scoring function selection
Molecular docking was carried out using GOLD 4.0. The crystal structure of Btk kinase domain complexed with compound 2 (Figure 1; PDB: 3GEN) was used as the reference receptor owing to two reasons. One is that it has the highest resolution (1.60 Å). The other is that the bound ligand is one of the most active compounds (IC50 = 0.0082 μm). In order to ensure a high probability to obtain correct docking poses of ligands and better estimates of the ligand-binding affinity in DB-VS, the docking parameters were optimized in advance. We first took out of the bound ligand from the crystal structure, following docking the ligand back to the receptor. In the docking process, we adjusted the docking parameters until the docked structures (both the poses and positions of heavy atoms) are as close as possible to their original crystallized structure in the binding site of Btk. The optimal docking parameters finally selected mainly include: the ‘Number of dockings’ was set to 10 without using early termination option; the ‘Detect Cavity’ was turn on; the optimized positions of polar protein hydrogen atoms were saved; the genetic algorithm (GA) parameter was set to ‘7–8 times speedup’; the top 10 scoring poses were saved for each compound. Using these parameters, we obtain a very small RMSD (root mean square deviation) value (RMSD = 0.6786 Å) between the docked and bound ligands in the crystal structure.
There are different scoring functions, including GOLDScore, ChemScore and a modified ChemScore that is an optimized scoring function for the kinase-related docking (KCS) (37). For the selection of scoring functions, we chose a set of known Btk inhibitors whose IC50 values span a range of three orders. These inhibitors were docked into the active site of Btk, with the optimized docking parameters. Then, we calculated the correlation coefficient between experimentally measured IC50 values and scoring function values. It was found that the GOLDScore gave the best correlation coefficient. Therefore, the GOLDScore will be adopted in the subsequent DB-VS study.
Evaluation of the performances of the hybrid VS method in virtual screening
Before the actual application of the hybrid VS approach, an assessment to its performance in virtual screening will be made. We firstly constructed a large independent test set, called TS3, for this assessment. TS3 consists of 28 known Btk inhibitors (these compounds have never been used in previous modelling and model validations) and 8662 decoys obtained from MDDR (MDL Drug Data Report) library (for the selection method of the decoys, see Supporting Information). The yield (percentage of predicted compounds in known inhibitors), hit rate (percentage of known inhibitors in predicted compounds) and enrichment factor (ratio of hit rate to the percentage of known inhibitors in TS3), which shows the magnitude of hit rate improvement over random selection, were used to evaluate the performance of VS.
Firstly, SB-VS, PB-VS and DB-VS were individually used to screen TS3, which results are given in Table 4. For SB-VS, the number of the predicted positives is 202, and that of the total hits is 26 with a yield of 92.86% (26 known inhibitors out of 28). The hit rate and enrichment factor are 12.87% and 39.95, respectively. The time used for the screening of TS3 by SB-VS is about 10 seconds (0.003 h) on a PC desktop equipped with Intel E5420 (2500 MHz) processor if the molecular descriptors have been prepared (the time cost for the calculation of molecular descriptors of the TS3 compounds is about 0.7 h on the same computer). For PB-VS, the number of the predicted positive is 1595, and that of the total hits is 24 with a yield of 85.71% (24 known inhibitors out of 28, see Table 4). The hit rate and enrichment factor are 1.50% and 4.67, respectively. The time cost for the screening of TS by PB-VS is about 28 h. For DB-VS, the number of hits is 4842, in which 21 are known inhibitors. The yield, hit rate and enrichment factor are 75%, 0.43% and 1.35, respectively. The time cost is about 231 h.
Table 4. Evaluation results of the performances of various VS methods by screening a chemical database (TS3) that contains 28 known Bruton’s tyrosine kinase inhibitors and 8662 decoys from MDDR library
Hit rate (%)
Time cost (hours)
Support vector machine model-based VS
Secondly, the three VS methods were used to construct three pairs: SB-VS/PB-VS, SB-VS/DB-VS and PB-VS/DB-VS, with the faster one used firstly. For SB-VS/PB-VS, the number of total hits is 22 (26/22) with a yield of 85.71% (see Table 4). The hit rate and enrichment factor are 30.56% and 94.83, respectively. The overall time used is 5 h. For the SB-VS/DB-VS, the total hits finally obtained include 21 (26/21) compounds. The yield is 75.00% (see Table 4). The hit rate and enrichment factor are 0.43% and 1.35, respectively. The overall time used is 23 h. For the PB-VS/DB-VS, the number of total hits is 19 (24/19) compounds. The yield, hit rate and enrichment factor are 67.86%, 2.20% and 6.82, respectively. The overall time used is 55 h.
Thirdly, the three VS methods were used together (SB-VS/PB-VS/DB-VS) to screen TS3 with the faster SB-VS used firstly, following the slower PB-VS and DB-VS. The final number of positive compounds passed through the tree filters of hybrid VS is 43 (202/72/43), and that of the total hits is 20 (26/22/20) with a yield of 71.43% (20 known inhibitors out of 28, see Table 4). The hit rate and enrichment factor are 46.51% and 144.35, respectively, which are significantly higher than the corresponding values of the individual VS methods as well as the combinations of SB-VS, PB-VS and DB-VS by twos. The time totally used is just 6 h.
Virtual screening by using the hybrid VS approach for retrieving novel potent Btk inhibitors
The hybrid VS approach was used to screen several commercially available chemical databases, including Specs (202 408 compounds) and Enamine (980 000 compounds), to retrieve new potent Btk inhibitors. The workflow of this screening process using the hybrid VS protocol is schematically shown in Figure 3. 13 482 compounds passed through the first filter SB-VS. These compounds then underwent the second filtering process by PB-VS, and 5246 compounds remained. Finally, the 5246 compounds were subjected to docking study, and 1921 passed through the DB-VS filter. The top-ranked 200 compounds were taken as our final hits.
The final hit compounds were further visually inspected to check whether some important interactions with the active site of Btk kinase could be kept, for example, interactions with catalytically important residues in the hinge region and other residues in the active site of Btk kinase, including Thr474, Met477, Leu408 and Val416 (27,28). Finally, we selected a total of 39 compounds (see Table S4 in Supporting Information) from the 200 hits to purchase from the market. These 39 compounds have been shifted to experimental evaluation studies, which results will be reported in the near future.
In this investigation, a hybrid protocol of virtual screening methods including SB-VS, PB-VS and DB-VS has been introduced in the discovery of BTK inhibitors. We firstly established an SVM classification model of BTK inhibitors and non-inhibitors for the SB-VS method, and a pharmacophore model of BTK inhibitors for the PB-VS method. The docking parameters were also optimized in advance. Then, performances of the formed hybrid VS approach, SB-VS/PB-VS/DB-VS, in virtual screening were evaluated against a test set, which results showed that the hybrid VS approach was considerably superior compared with the individual SB-VS, PB-VS and DB-VS, as well as their combinations by twos in terms of the hit rate, enrichment factor and screening speed. Finally, the hybrid VS approach was applied to screen several chemical databases including Specs (202 408 compounds) and Enamine (980 000 compounds) databases. Thirty-nine compounds were selected from the final hits and have been shifted to the subsequent experimental studies.
This work was supported by the National S&T Major Project (2012ZX09102-101-002), and partly by the National Natural Science Foundation of China (81172987) and the 863 Hi-Tech Program (2012AA020301, 2012AA020308).