CCR5 is the key receptor of HIV-1 virus entry into host cells and it becomes an attractive target for antiretroviral drug design. To date, six types of CCR5 antagonist were synthesized and evaluated. To search more potent bio-active compounds, non-linear support vector machine was used to construct the relationship models for 103 oximino-piperidino-piperidine CCR5 antagonists. Then, comparative molecular field analysis and comparative molecular similarity indices analysis models were constructed after alignment with their common substructure. Twenty-one structural diverse compounds, which were not included in the support vector machine, comparative molecular field analysis, and comparative molecular similarity indices analysis models, validated these models. The results show that these models possess good predictive ability. When comparing between support vector machine and 3D-quantitative structure activity relationship models, the results obtained from these two methods are compatible. However, 3D-quantitative structure activity relationship model is significantly better than support vector machine model and previous reported pharmacophore model. These models can help us to make quantitative prediction of their bio-activities before in vitro and in vivo stages.
HIV-1 entry into host cells is the key process of virus infection, therefore the inhibition of this process is an attractive target for antiretroviral intervention and drug design (1). The detail infection process of HIV-1 is as follow: at first, HIV gp120 envelope protein binds the CD4 receptor (2), then the HIV envelope protein adjusts its conformation, finally docks the chemokine receptors CCR5 and CXCR4 (3). Therefore, CCR5 receptor is one of the important targets for anti-AIDS drug design. Until now, the antagonists of CCR5 are classified into six main categories (4): anilide derivatives (5), oximino-piperidino-piperidine derivatives (6), chiral piperazine-based derivatives (7), tropane-based derivatives (8), spirodiketopiperazine-based derivatives (9), and acyclic and cyclic scaffold-based derivatives (10). Because oximino-piperidino-piperidine antagonists have safe and generally well-tolerated characters (4), they have been widely investigated. However, the crystal structure of CCR5 receptor complex does not released, ligand-based molecular modeling method is just used to investigate CCR5 receptor antagonist.
Support Vector Machine (SVM) is an algorithm developed for regression and classification (11). Thanks to its remarkable generalization performance, the SVM has attracted wide attention and gained extensive applications to quantitative structure activity relationship (QSAR) and quantitative structure property relationship (QSPR) for drug design (12–16). In this work, SVM was used to construct regression model. Then, this model was compared with those of comparative molecular field analysis (CoMFA) and comparative molecular similarity indices analysis (CoMSIA). In summary, these models could help us to better understand how the substituents influence the bio-activity of CCR5 receptor antagonist and to make quantitative prediction of their bio-activities before in vitro and in vivo stages.
Computational Methods and Materials
The structures and activities (concentration in nm was determined on RANTES binding to CCR5) of CCR5 receptor antagonists are extracted from the literature (17). The structures and data are listed in Table 1. The structures marked with ‘*’ constitute the test set. The others are made up of the training set.
Table 1. Structures and bio-activities
Support vector machine is a promising classification and regression method developed by Vapnik et al. (11) A detailed description of SVM theory can be found in several relative books (18–21). Support vector machine was originally developed for classification problems, then it was extended to solve non-linear regression problems by the introduction of ε-insensitive loss function.
Support vector machine approach has been proposed to minimize the structural risk rather than the empirical risk; that is to preserve good generalization ability rather than optimizing the agreement with a given (limited) training set. Therefore, it constitutes a trade-off between the complexity of the model and its capability to reproduce experimental observations.
The regression performance of SVM depends on the setting of parameters: C, ε and the kernel type and corresponding kernel parameters. Parameter C is a regularization constant which determines the trade-off between the model complexity and the degree to which deviations larger than ε are tolerated in optimization formulation. The kernel function and corresponding parameters are another important influencing factor because they define the distribution of the training set of samples in the high-dimensional feature space.
All SVM models in this present study were implemented using the shareware program Libsvm developed by C. W. Hsu and C. J. Lin (22). The radial basis function (RBF) was used as kernel function in this work. For RBF kernel, the most important parameter is the width (γ) of the RBF. All calculation programs implementing SVM were written in M-file based on matlab script.
Molecular modeling and alignment
Three-dimensional structure modeling was performed using the sybyl program packagea. Energy minimization was used by Tripos force field (23) with a distance-dependent dielectric and the Powell conjugate gradient algorithm with a convergence criterion of 0.01 kcal/(mol Å). Partial atomic charges were calculated with the Gasteiger–Hückel method. Fifty conformers of each compound in the training and test set were generated using multisearch module in sybyl. Energy-minima conformers were selected to build 3D-QSAR model.
Comparative molecular field analysis results might be extremely sensitive to a set of factors such as alignment rule, overall orientation of the aligned compounds, lattice shifting step size and probe atom type (24). The accuracy of prediction for CoMFA models and the reliability of the contour plots strongly depend on the structural alignment of the molecules. Because the structures in this investigation have a common substructure, these compounds were aligned according to the common substructure (Figure 1). The molecule M77 with the largest pKi value was selected as the template of alignment. The molecular alignment was applied with the routine SYBYL function of ‘database align’.
Comparative molecular field analysis was used to build statistical and graphical models of activity from molecular structure and to make accurate predictions of the activity for designed compounds (25). This method led to several drugs currently on the market (26). The simple step of CoMFA analysis is as follow. After consistently aligning the molecules within a lattice, a probe sp3 carbon atom with +1 net charge was employed. The steric and electrostatic interactions between the probe and the rest of the molecule were calculated. Steric and electrostatic fields were generated by the standard CoMFA method in sybyl with default energy of 30 kcal/mol. Electrostatic interactions were modeled by a Coulomb potential and Van der Waals interactions by Lennard-Jones potential. The regression analysis was carried out using the partial least-squares (PLS) (27) method. The final model was developed with the optimum number of components equal to that yielding the highest . The total set of antagonist was initially divided into two groups in the approximate ratio 4:1 (for example, 82 in training set to 21 in the test set). The selection of the training and test set was done such that low-, moderate-, and high-activity compounds are present in roughly equal proportions in both sets.
Standard CoMFA approach just describes the potential energetic contributions to the binding constants and neglects entropic influences or insufficiently covered. To include entropic contributions, CoMSIA method was proposed by Klebe et al. (28,29) Five physicochemical properties related to steric, electrostatic, hydrophobic, hydrogen bond donor, and hydrogen bond acceptor fields were evaluated on the probe atom. Gaussian-type distance dependence potential was employed to describe the relative attenuation of the field position for each atom in lattice. Gaussian-type distance dependence in CoMSIA leads to much smoother sampling of the fields around the molecules than in CoMFA. The default value of 0.3 was used as the attenuation factor.
Twenty-seven Charged Partial Surface Area descriptors (30) and 56 VolSurf descriptors (31) with default methods implemented in sybyl were calculated. But only eight descriptors were selected after a number of tests with training set compounds. These descriptors have strong correlation with the bio-activity of CCR5 antagonist. But their mutual dependence is not significant. These descriptors are molecular polarizability and dispersion force (W2 and W4), three capacity factors (Cw3, Cw4, and Cw5), the ratio between charge of most negative atom and sum total negative charge (RNCG), highest occupied molecular orbital (HOMO), and lowest unoccupied molecular orbital (LUMO).
To derive 3D QSAR models, the standard CoMFA and CoMSIA descriptors were used as independent variables, pKi as dependent variable, to perform PLS regression analyses implemented in sybyl package. Comparative molecular field analysis descriptors were calculated using a sp3 carbon probe atom with a charge of +1.0 to generate steric (Lennard-Jones potential) field energies and electrostatic (Coulombic potential) fields with a distance-dependent dielectric at each lattice point. Comparative molecular similarity indices analysis descriptors were generated using a sp3 carbon probe atom with +1.0 charge (Gaussian potential). The predictive values of the models were evaluated by leave-one-out (LOO) cross-validation. The cross-validated coefficient, q2, was calculated using eqns 1 and 2 (32).
Where yi is the activity of training set. ym is the mean observed value, corresponding to the values for each cross-validation group. ypred,i is the predicted activity for yi.
Results and Discussion
To select the suitable descriptors, LOO cross-validation method was used to build SVM model. The performance of SVM for regression depends on the combination of several factors. They are kernel function type, capacity parameter C, ε of insensitive loss function, and its corresponding parameters. To get the best generalization ability, some strategies are employed to optimize these factors. There are four possible choices of kernel functions available in LibSVM package i.e., linear; polynomial; RBF and sigmoid function. For regression tasks, the RBF kernel is often used because of its effectiveness and speed in training process (33) and also applied in our SVM models. For RBF kernel function, three parameters, ε, γ and C, were chosen. Detailed descriptions of the process for selecting parameters and effect of each parameter on generalization performance are shown in Figures 2–4. At first, we only change the value of γ from 0.10 to 1.60, the mean standard error (MSE) based on LOO cross-validation for training set varies with γ. The curve between MSE and γ is shown in Figure 2. The optimal value of γ was found as 1.0. According to this method, the optimal value of ε was found as 0.05. The cost factor was found as 1000. Then experimental activities (EA), selected descriptors, and predicted activities (PA) are listed in supplement files (Tables S1 and S2).
Examination of SVM model
The correlation coefficient r2 between EA and PA is 0.904, with standard errors (SEE) equal to 0.210 for training set. The r2 is 0.742 with SEE equal to 0.312 for test set (M3 and M38 are outliers which can not be predicted by this SVM model). The correlations between EA and PA for training and test sets are shown in Figure 5. This model could generally reflect the relations between EA and PA.
Two methods, CoMFA and CoMSIA, were used to construct 3D-QSAR models for CCR5 receptor antagonists. The alignment diagram of the 144 compounds of the training and test set is shown in Figure 6. The statistical parameters of the models are given in Table 2. The PA and the residuals between EA and PA are gathered in Tables 3 and 4.
Table 2. Partial least-squares (PLS) statistical parameters of comparative molecular field analysis (CoMFA) and comparative molecular similarity indices analysis (CoMSIA) Modelsa
aS, steric field; E, electrostatic field; H, hydrophobic field; D, hydrogen bond donor; A, hydrogen bond acceptor; SEE, standard error.
Table 3. The experimental and predicted activities (PA) of training set and the previous work
CoMFA, comparative molecular field analysis; CoMSIA, comparative molecular similarity indices analysis.
Selection of CoMSIA fields
There are five fields in CoMSIA model. The most important parameter that influences its performance is how to combine these five fields. To obtain the optimal result, we systemically changed the combination of fields and chose the value that gave the best cross-validation and non-cross validation, the smallest SEE and the largest F value. Figure 7 illustrates the parameters for the combination of five fields. The model combined with steric, electrostatic, and hydrophobic fields, having the highest cross-validated q2, r2, 1/SEE, and F, was chosen as the best CoMSIA model, and the contour plots will be analyzed using this model.
Evaluation of CoMFA and CoMSIA models
We will now examine the correlation models between EA and PA of CCR5 antagonist.
For CoMFA model, the cross-validated q2 value of training set is 0.731 with six principal components. The non-cross-validated r2 value is 0.971, with SEE 0.129. Twenty-one structurally diverse compounds, which were not included in the CoMFA and CoMSIA models, were selected as a validation set. The corresponding correlation coefficient r2 between EA and PA for test set is 0.927, with SEE equal to 0.222. The correlations between EA and PA for training and test sets are shown in Figure 8. Except M38, other compounds of test set can be predicted well by CoMFA model. However, we cannot give an exact explanation for the large deviation. It may be brought by experimental errors or insufficient correction factors. For CoMSIA(SEH) model, the cross-validated q2 value of training set is 0.520 with six principal components. The non-cross-validated r2 value comes to 0.956, with SEE 0.158. The corresponding correlation coefficient r2 between EA and PA for test set is 0.967, with SEE 0.155. The correlations between EA and PA for training and test sets are also shown in Figure 8. The residues for all tested compounds are less than 0.40. This suggests that the CoMSIA(SEH) model has good prediction ability.
Analysis of CoMFA model
The PLS statistical parameters of CoMFA are summarized in Table 2. The steric and electrostatic field contribution is 0.573 and 0.427, respectively.
Figure 9 illustrates the contour plots of CoMFA model with the structure M77. The meanings of the different color areas are given in the figure caption. Red-colored regions near 4-pyridine of phenyl for substituent R3 suggest that negative charge groups are favorable to bio-activity. This could explain that the activity of compound M77 with pyridine is higher than that of compound M74 with phenyl substituent. For compounds M93, M94, and M95, the activity order is consistent with that of their electronegativity, such as M93 (–CF3) > M94 (–NH2) > M95 (–NHCOCH3). Blue-colored regions near substituent at positions 2 and 6 of phenyl for substituent R3 suggest that positive charge groups are favorable to activity. This could explain that the bio-activities have the sequence: M64 (2,6-diCH3) > M65 (2-Cl-6-NH2), M18 (2,6-diCH3) > M27 (2-Cl-6-NH2), M58 (2,6-diCH3) > M59 (2-Cl-6-NH2). Green-colored, blue-colored, and yellow-colored regions near the substituent of C=N–O for substituent R1 suggest that suitable bulk and positive groups are favorable to activity. This could explain that the activity of M88 (–CH2–CF3) with negative charge group is lower than those of M89 (–CH2–C3H7), M90 (–CH2–C3H5), and M91 (–CH2–C2H6) with positive charge group. The activity of M23 with positive charge is higher than that of M24 with negative charge group. This is also consistent with the indication of CoMFA model. At the same time, the bulk of substituent –CH2–C2H6 for M91 is the smallest among –CH2–C3H7 for M89 and –CH2–C3H5 for M89, indicating that the activity of M91 is the largest among M88 and M89. Blue-colored region near the substituent at position R1 of trans conformers of M20 and M18 indicates that the activity of trans conformers (M18 and M20) is larger than that of cis transformers (M17 and M19), respectively. Green-colored, red-colored, and yellow-colored regions near the substituent at position 4 of phenyl for R1 suggest that suitable bulk and negative charge groups are favorable to activity. It could explain that the activities have the sequence: M4 (–CF3) > M3 (–I) > M1 (–Br) > M2 (–Cl) > M6 (–H) > M7 (–OCH3), M43 (–CF3) > M44 (–OCF3) > M45 (–SO2CH3). It is convinced that the activity of M11 (–C6H5) with positive and bulk group is very low. Yellow-colored region near the substituent R2 and R4 suggests that small bulk groups are favorable to activity. This is also consistent with the experimental observation.
Analysis of CoMSIA model
The PLS statistics of CoMSIA are also summarized in Table 2. The steric, electrostatic, and hydrophobic field contribution is 0.188, 0.352, and 0.460, respectively.
Figure 10 shows the contour plots of CoMSIA model with the structure M77. The meanings of the different color areas are listed in the figure caption. Red-colored regions near 4-pyridine of phenyl for substituent R3 suggest that negative charge groups are favorable to activity. This is in agreement with the result of CoMFA. Blue-colored and purple-colored regions near substituent at positions 2 and 6 of phenyl for substituent R3 suggest that positive charge and hydrophobic groups are favorable to activity. This could explain that M65 (2-Cl-6-NH2), M27 (2-Cl-6-NH2), and M59 (2-Cl-6-NH2) with hydrophilic groups are less active than those compounds M64 (2,6-diCH3), M18 (2,6-diCH3), and M58 (2,6-diCH3) with hydrophobic groups. White-colored, purple-colored, and green-colored regions near the substituent of C=N–O for substituent R1 suggest that suitable hydrophobic and bulk groups are favorable to activity. The logP of substituent for M89 (–CH2–C3H7), M90 (–CH2–C3H5), and M91 (–C3H7) is 2.11, 1.54, and 1.63, respectively. M91 with suitable logP is the most active one among these compounds. This is consistent with the indication of CoMSIA model. Purple-colored regions near the substituent at position 4 of phenyl for R1 suggest that hydrophobic groups are favorable to activity. When comparing with the contour plots of CoMFA and CoMSIA models, CoMSIA model can reflect the influence of hydrophobic filed. This compensates the shortage of CoMFA model.
SVM versus 3D-QSAR models
The descriptors constructed SesVM models are W2, W4, Cw3, Cw4, Cw5, RNCG, HOMO, and LUMO. W2 and W4 describe the molecular polarizability and dispersion force. They could be assigned to the steric field of CoMFA and CoMSIA. RNCG is the parameter related with the negative charge. It represents the electrostatic interaction between the residues of receptor and the compound. Cw3, Cw4, and Cw5 represent the hydrophilic capacity. It may reflect the hydrophobic property of a molecule. This is consistent with CoMSIA model that the contribution of hydrophobic field is 0.460. The HOMO and LUMO are localized onto the aromatic moiety of the compounds. Their energies describe the ability of the aromatic ring to make π–π interactions with aromatic residues of receptor (34). Therefore, it is conceivable to assign the HOMO and LUMO descriptors to activity, in agreement with the result that phenyl substituent at position R3 is favorable to activity. Our SVM and 3D-QSAR models not only can explain but also complement each other. However, the correlation coefficient r2 between EA and PA of SVM model for test set is 0.742, while the corresponding r2 of CoMFA and CoMSIA models for same data set is 0.927 and 0.967, respectively. This suggests that 3D-QSAR model is significantly better than SVM model.
Comparison with previously reported work
Debnath A. K. (17) studied these CCR5 antagonists (see Tables 3 and 4) with pharmacophore model. Twenty-five training set compounds were used to construct 10 pharmacophore models. In these models, there is at last one hydrophobic feature, indicating that the activity is significantly correlated to the hydrophobicity of molecules. The contribution of hydrophobic field in the CoMSIA model is 0.460, which shows that hydrophobic field is the main factor influencing activity among three used fields. Our CoMSIA model is consistent with the previous work. The estimated activities with pharmacophore model are also gathered in Tables 3 and 4. The correlation between EA and PA is shown in Figure 11A for all activity data [except M61 and M103, because of their too large residuals between CA and EA for Debnath’s result (17)]. The correlation coefficient r2 between EA and PA is 0.330. While the corresponding EA–PA correlation coefficient r2 of our CoMSIA model including training and test set is 0.958 (see Figure 11B). Although this comparison between this work and previous work has some biases because of the different training and test set, we also can conclude that our model has better prediction ability.
Non-linear SVM method can be used to construct the activity model for CCR5 antagonists. The SVM model including eight descriptors could well describe this correlation. The descriptors include W2, W4, Cw3, Cw4, Cw5, RNCG, HOMO, and LUMO. The correlation coefficient r2 between EA and PA of SVM model for training set is 0.904. Then, CoMFA and CoMSIA models were constructed after alignment according to common substructure. The 3D-QSAR investigations suggest that negative charge group at position 4 of substituent R3; positive charge and hydrophobic groups at positions 2 and 6 of substituent R3; suitable bulk, hydrophobic and positive groups of substituent C=N–O for R1; suitable bulk, hydrophobic and negative charge groups at position 4 of phenyl for R1 could be favorable to the activity. These results can help us to make quantitative prediction of their activities. When comparing between SVM and 3D-QSAR model, we found that SVM and 3D-QSAR models not only can explain but also complement each other. However, 3D-QSAR model is significant better than SVM model. For the same set of compounds, our 3D-QSAR and SVM models give better results than pharmacophore model.
SYBYL, [Computer Program], Version 6.9, St Louis, MO: Tripos Associates Inc.
This work is supported by the National Natural Science Foundation of China (Grants No. 30770502 and No. 20773085), in part by grants from Ministry of Science and Technology China and by National 863 High-Tech Program (2007DFA31040).