Identification of Novel HIV-1 Integrase Inhibitors Using Shape-Based Screening, QSAR, and Docking Approach


  • Pawan Gupta,

    1. Centre for Pharmacoinformatics, National Institute of Pharmaceutical Education and Research (NIPER), Sector-67, S.A.S. Nagar, 160062 Punjab, India
    Search for more papers by this author
  • Prabha Garg,

    Corresponding author
    1. Centre for Pharmacoinformatics, National Institute of Pharmaceutical Education and Research (NIPER), Sector-67, S.A.S. Nagar, 160062 Punjab, India
    Search for more papers by this author
  • Nilanjan Roy

    1. Centre for Pharmacoinformatics, National Institute of Pharmaceutical Education and Research (NIPER), Sector-67, S.A.S. Nagar, 160062 Punjab, India
    2. Department of Biotechnology, National Institute of Pharmaceutical Education and Research (NIPER), Sector-67, S.A.S. Nagar, 160062 Punjab, India
    Search for more papers by this author

  • This article was published online on 13/2/12. An error was subsequently identified. This notice is included in the online and print versions to indicate that both have been corrected 14/2/12.

Corresponding author: Prabha Garg,;


The objective of this study is to identify novel HIV-1 integrase (IN) inhibitors. Here, shape-based screening and QSAR have been successfully implemented to identify the novel inhibitors for HIV-1 IN, and in silico validation is performed by docking studies. The 2D QSAR model of benzodithiazine derivatives was built using genetic function approximation (GFA) method with good internal (cross-validated r2 = 0.852) and external prediction (inline image). Best docking pose of highly active molecule of the benzodithiazine derivatives was used as a template for shape-based screening of ZINC database. Toxicity prediction was also performed using Deductive Estimation of Risk from Existing Knowledge (DEREK) program to filter non-toxic molecules. Inhibitory activities of screened non-toxic molecules were predicted using derived QSAR models. Active, non-toxic screened molecules were also docked into the active site of HIV-1 IN using AutoDock and dock program. Some molecules docked similarly as highly active molecule of the benzodithiazine derivatives. These molecules also followed the same docking interactions in both the programs. Finally, four benzodithiazine derivatives were identified as novel HIV-1 integrase inhibitors based on QSAR predictions and docking interactions. ADME properties of these molecules were also computed using Discovery Studio.

A mandatory requirement in the retroviral life cycle is the integration of the viral double-stranded DNA into the host chromosome. HIV-1 integrase (IN) enzyme removes a dinucleotide next to a conserved cytosine–adenine sequence from each 3′-end of the viral DNA. Then, IN catalyzes joining of the processed viral 3′-ends to the 5′-ends of strand breaks in the host DNA. HIV-1 IN enzyme has no counterpart in host cell and is also an essential enzyme for effective viral replication. Inhibitors of this enzyme are of paramount importance for the treatment of HIV infection (1–3). An intense research is being carried out on HIV-1 IN protein, but only one FDA-approved drug ‘Raltagravir’ is available in market (, which is administered in combination with other antiretroviral agents (4–6). So, current situation warrants more HIV-1 IN inhibitors with good potency.

Several molecular modeling aspects have been employed in the development of potent HIV-1 IN inhibitors, e.g., QSAR, pharmacophore mapping, and docking studies. Several 3D QSAR studies were performed to obtain insights into the structural requirement of HIV-1 IN inhibitors, which can be useful in the improvement of HIV-1 inhibitory activity (7–12). Similarly, 2D QSAR was performed on different series of molecules and found that electro-topological state indices (13), spatial, shape, and thermodynamical properties are important for explaining the inhibitory activity of molecules (14,15). Recently published 2D QSAR analysis of 1,3,4-oxadiazole substituted naphthyridine derivatives result showed that valence connectivity index order 1, lowest unoccupied molecular orbital, and dielectric energy significantly affect the inhibition of HIV-1 IN activity (16). In 2D QSAR, modeling of carboxylic acid derivatives revealed that polarizability and mass were the most influencing atomic properties for HIV-1 IN inhibitory activity (17). 3D pharmacophore models for HIV-1 IN inhibitors were generated for designing of novel potent inhibitors (18–21). The docking studies of various molecules were performed to explore binding modes using different protein PDB ID (1QS4, 1BL3, 1BIS, 1WKN) of HIV-1 IN (7,22–27).

Nowadays, shape-based screening (SBS) is very popular in the field of computer-aided drug design. The basic assumption of SBS is that molecules that are similar in terms of shape and electrostatic potential with known active molecules may bind in similar manner to the active site as active molecules and having good probability to produce similar activity (28–30). This hypothesis provides support for the application of shape matching in the identification of new inhibitors with good potency. Perez-Neuno et al. (31) proposed that ligand-based shape-matching searches yielded higher enrichments than receptor-based docking, especially for CXCR4.

The present group of authors had already published 3D QSAR and docking studies for benzodithiazine molecules as HIV-1 IN inhibitors (7). In continuation of such efforts, the objective of this work is designed to build a 2D QSAR model for benzodithiazine series and subsequently, perform SBS using best docked conformation of benzodithiazine 32 to identify the novel molecules that exhibit good in silico activity in QSAR predictions and favorable binding mode against HIV-1 IN.

Methods and Materials

Shape-based screening

rocs programa was used for SBS of ZINC database ( rocs is a rapid similarity analysis of molecules using shape-based method. rocs performs shape-based overlays of conformers of a candidate molecule to a query molecule in one or more conformations. Gaussian function is used to assess volume overlay of two molecules. The principal components of inertia of two molecules (query and candidate) are aligned using their centers of mass. The optimization of these molecules is performed using solid-body optimization algorithm to maximize volume overlap. In reported data, single low energy conformation can be used as a template for SBS if information on the bioactive conformation of a bound ligand is not available in protein. Alternatively, calculated docking pose can also be used as a template. There are no systematic data available providing a clear statement or quantitative estimates on the impact of the template conformation(s) on virtual screening performance. rocs uses both shape and color force field for the optimization of overlay (29).

The basic scoring function implemented in rocs is the ShapeTanimoto score, which is a quantitative measure for the shape overlap of two molecules. By focusing on the optimum shape overlap, chemical functionality is not matched. To obtain overall alignment (shape as well as chemical functionality), rocs has implemented a color score. Here, we had used Combo score (shape and color) for ranking of screened molecules. In this mode, rocs optimizes the molecular overlay to maximize both the color overlap and the shape overlap obtained by aligning groups with the same properties that are contained in the color force field file. This overlay is then subsequently scored using the sum of the color and ShapeTanimoto score for the overlay (the so-called Combo score). The rocs color force field is composed of SMARTS patterns for the characterization of chemical functions. Six different types chemical functionality of molecules are available: hydrogen bond donors, hydrogen bond acceptors, hydrophobes, anions, cations, rings, and basic pKa model assuming pH 7. In this way, rocs is able to maximize both molecular shape overlay and chemical functionality overlap (29).

In this work, best docking pose of highly active molecule of the benzodithiazine derivatives (benzodithiazine 32) was used as template. The Combo score was used to rank the screened molecules. The best 300 hits were selected for maximum output of SBS.

QSAR modeling

3D QSAR modeling

3D QSAR analysis (CoMFA and CoMSIA) were performed using sybyl 7.1 (Tripos Inc., St Louis, MO, USA). All details of computational methodology are given in previous publication (7).

2D QSAR modeling

Data set.  Data set of benzodithiazine derivatives was divided into same training and test set molecules as in 3D QSAR. Data set was divided into the training and test sets in such a way that training set covers all activity range and structural features of data set (7).

Descriptors extraction.  Different types of physicochemical descriptors of each molecule were generated using Dragon 5.3b. Before generating the 2D-QSAR models, dimensionality reduction exercise was carried out to determine which descriptors are biologically relevant. A correlation matrix was constructed where the correlation of each descriptor with other descriptors as well as the biological activity was determined. Some of the descriptors were having low correlation with biological activity and high correlation with other descriptors of matrix, eventually discarded from matrix. This procedure was repeated for each and every descriptor of matrix.

On the other hand, variance inflation factor (VIF) analysis was performed to check intercorrelation of descriptors. VIF value is calculated from 1/(1 − r2), where r2 is the multiple correlation coefficient of one descriptor’s effect regressed on the remaining molecular descriptors. If VIF value is larger than 10, the information of descriptors can be hidden by correlation of descriptors (32).

Genetic function approximation (GFA).  After dimensionality reduction of calculated descriptors using correlation coefficient, remaining 211 descriptors were used for 2D QSAR model generation using GFA method in Cerius2 program. GFA is genetics-based method for variable selection, which combines Holland’s genetic algorithm (GA) with Friedman’s multivariate adaptive regression splines (MARS) (33,34). This method works by producing 100 equations (default setting) randomly from number of descriptors selected. From these 100 equations, pairs of ‘parent’ equations are selected randomly. After the crossover of parent equations, 100 new model equations are generated. Predictive ability of each new model equation is evaluated by Friedman’s lack of fit (LOF) score (eqn 1). The generated new models replace the previous models based on LOF score. After the repeated crossover (default 5000), finally 100 best models containing number of descriptors selected were developed. The best model equation of the 100 was selected based on the statistical parameters such as regression coefficient (r2), adjusted regression coefficient (inline image), cross-validated r2 (inline image), and F-test values.


where LSE is the least-squares error, c is the number of basis functions in the model, d is smoothing parameter, p is the number of descriptors, and m is the number of observations in the training set (33).

Number of descriptors necessary and sufficient for the QSAR equation was first determined. Taking a brute force approach, we increased the number of descriptor in the GFA method one by one and evaluated the effect of addition of new descriptor on the statistical quality of model. The inline image was selected as the limiting factor for number of descriptors to be used in the final model (35). As shown in Figure 1, the inline image value of the best model increases till the number of descriptors in the equation reaches up to 6 and then declines as the number of descriptors increases further. Thus, the number of descriptors was restricted to 6 for the final model. The statistical parameters of various models with increasing number of descriptors are shown in Table 1. Additionally, rule of thumb (36) is used for searching optimum number of descriptors that can be used in proposing the model for the data set of 31 molecules (training set) used in the present study. According to this rule, number of molecules to be used should be at least three times the number of descriptors under consideration. So, sufficient number of molecules must be included in analysis to enable statistical significance to be reached despite inevitable errors in measurement.

Figure 1.

 Cross-validated r2 (inline image) as a function of number of descriptors.

Table 1.   Statistical evaluation of 2D QSAR models with varying number of descriptors
DescriptorEquationFriedman’s lack of fitr2inline imageF-testLSErinline imageq2
  1. Items in bold indicate this model was selected for analysis.

1pIC50 = 4.37 + 0.814 × Mor17m0.0820.4050.38419.7310.0720.6360.405 ± 0.0130.305
2pIC50 = −20.314 + 0.477*Mor17m + 0.989 × LP10.0710.5550.52317.4790.0540.7450.556 ± 0.0140.451
3pIC50 =  -23.343 + 0.334 × Mor17m + 11.669 × LP1 − 0.365 × IC10.0620.6650.62817.8540.0410.8150.666 ± 0.0100.540
4pIC50 = −11.552 + 0.450 × Mor17m + 7.219*LP1 − 0.563 × IC1 + 0.211 × C-0340.0450.7950.76425.2200.0250.8920.876 ± 0.0030.699
5pIC50 = −5.485 + 0.383 × Mor17m + 5.156 × LP1 − 0.659 × IC1 + 0.245 × C-034 − 0.589 × GATS7m0.0260.9030.88346.4140.0120.9500.904 ± 0.0020.827
6pIC50 = −2.7382 + 0.411768 × Mor17m + 4.03737 × LP1 − 0.621692 × IC1 + 0.260938 × C-034 −0.725133 × GATS7m + 1.20922 × MATS4m (PRESS 0.555)0.0250.9220.90246.9610.0090.9600.922 ± 0.0010.852
7pIC50 = −4.317 + 0.407 × Mor17m + 4.643 × LP1 − 0.573 × IC1 + 0.239 × C-034 − 0.782 × GATS7m + 1.092 × MATS4m − 0.080 × GVWAI-800.0290.9280.90642.1670.0090.9630.929 ± 0.0010.847

Estimation of the predictive ability of a QSAR model.  The parameter inline image is used as a criterion of both robustness and predictive ability of the model. Many authors consider high inline image (for instance, inline image > 0.5) as an indicator or even as the ultimate proof that the model is highly predictive (37). In addition to a high inline image, a reliable model should also be characterized by a high r2 (≥0.9) (eqn 2) to establish the goodness of fit for studied data set.


where Y and YPred are the observed and predicted activity values, respectively, of the training set and inline image is the mean activity value of the training set.

In addition of using these statistical parameters, one more parameter inline image was considered to evaluate the model predictability for test set molecules. It was suggested that the test set molecules activities and structures must be covered by the range of activities and structures of molecules from the training set. This requirement is necessary for obtaining reliable predictions for comparison between the observed and predicted activities for these molecules (37). If external molecules (test set and new screened molecules) have similar structural features as found in training data, then developed model should have good predictability for these molecules (37,38). Therefore, similarity should be considered during data set splitting in QSAR development. The inline image was calculated using eqn 3.


where Ytest and Ypred(test) are the observed and predicted activity values, respectively, of the test set molecules and inline image is the mean activity value of the training set.

To further check the predictive ability of developed model, randomization test was conducted that evaluate statistical significance of the relationship between the anti-HIV-1 IN activity and chemical structure descriptors. The randomization test was performed by repeatedly permuting the activity values of the data set and using the permuted values to generate QSAR models and then comparing the resulting scores with the score of the original QSAR model generated from non-randomized activity values. If the original QSAR model is statistically significant, its score should be significantly better than that from permuted data (39). The randomized test was performed at 90%, 95%, 98%, and 99% confidence intervals with 9 trials, 19 trials, 49 trials, and 99 trials, respectively. The higher the confidence level, the more randomization tests are run.

QSAR prediction of screened molecules

Predictivity itself is highly dependent on the test set and requires two points of methodology to be clarified: the method(s) for extraction of the test set from the training set and the definition of the methods for determining the applicability domain (AD). First prediction was defined by calculating the inline image for test set, while second is related to quantitative measure of distance to domain (40–42). Domain of applicability of developed 2D and 3D QSAR models was checked for new molecules, in which only the predictions for chemicals falling within its AD can be considered reliable, otherwise unreliable. In 3D QSAR, predictions of target properties are the most reliable for a molecule for which the descriptor values are similar to those for the training set from which the model was built. sybyl assesses the degree of similarity by comparing the value for each descriptor for molecules against the range of descriptor values found for molecules in the training set. The number of such out-of-range descriptors found for molecule is reported, along with the total contribution of such extrapolated points to the prediction. The ‘sum of extrapolated term (SUM)’ quoted is the total contribution made to the prediction for molecule by the out-of-range descriptors. In addition, standard error of prediction (SEP) was calculated during the 3D QSAR model generation. If SUM is larger than the SEP for developed cross-validated model, then the extrapolation is probably too far outside the model to get a reliable prediction. For reliable prediction of new molecules, relation between SUM and SEP must be SUM < SEP (all this predictions described in sybyl manual). For predictions of test and screened molecules, SUM was calculated for each molecule.

Similarly, domain of applicability must also be defined for 2D QSAR model for prediction of screened molecules. In 2D QSAR, domain of applicability was defined using k-NN (k-Nearest Neighbor) method (40). In this method, Euclidean distance (ED) was calculated for each training set molecule with nearest neighbor (= 1) using model descriptors. The distance matrix of all training set was built and calculated the minimum ED for each molecule. This range of minimum ED was considered as descriptors domain of training data. If minimum ED of each predicted molecules from training set is fall within this range, that means this prediction can be considered reliable, otherwise extrapolated outside the domain. Moreover, another approach extent of extrapolation (43) was also used to define the domain applicability (AD). It is defined by calculation of leverage h(x) for each molecule (42). The leverage h(x) of a molecule measures its influence on the model. The leverage of molecule is defined as:


where x represents the test molecule in centered descriptor space and X is the training data matrix whose N rows represent the training molecules in the centered descriptor space and T is transpose.

‘Centered’ means that the grand mean of the training data is taken as the origin of the descriptor space. Leverage values can be calculated for training, test, and new screened molecules. Leverage values for the training set molecule indicate those molecules that may have influenced the model parameters to a marked extent. For the test and screened molecules, these indicate the AD of the model. The warning leverage (h*) is 3*k/N, while high leverage (h) is defined as >2*k/N, where N is the number of molecules in the training set and k is the number of descriptors in the model. When the leverage value of a molecule is lower than the h*, the predicted data considered as interpolated within domain of training set chemicals, hence reliable. Conversely, molecules with h* in the test set that means structurally distant from the training chemicals, resulting extrapolated outside the AD of the model; hence, prediction is unreliable. This prediction must be used with great care by users as having increased uncertainty(42,44).

All screened non-toxic (toxicity predictions given below) molecules were predicted from 2D and 3D QSAR models. Simultaneously, they were checked for domain of applicability as well. The 22 molecules predicted active were docked into the active site of HIV-1 IN. ADME properties of the molecules having good binding interactions were also checked.

Toxicity and ADME studies

Screened molecules were first filtered for non-toxicity using deductive estimation of risk from existing knowledge (DEREK). This is a rule-based expert system that incorporates a SAR approach. It works by matching structural entities in a query structure with predetermined knowledge-based structural alerts that are associated with different toxicity end-points (termed as toxicophores). DEREK predicts a number of toxicity end-points including carcinogenicity, genotoxicity (mutagenicity and chromosome damage), hepatotoxicity, teratogenecity, ocular toxicity, thyroid toxicity, reproductive toxicity, respiratory sensitization, and skin sensitization for a number of species (E. coli, dog, guinea-pig, mouse, mammal, rabbit, and rat) and also provides an indication of the likelihood of each predicted adverse effect. Screened 69 non-toxic molecules were predicted from developed 2D and 3D QSAR models. ADME properties [AlogP98, absorption (95% and 99% level), polar surface area, blood brain barrier, and solubility] were calculated from ADME descriptors tool in Discovery Studio 2.5.

Docking studies of screened molecules

Here, docking studies of benzodithiazine 32 were performed using AutoDock and DOCK program (AutoDock results were published) (7). Two programs were used to check whether both will give the same binding pose or not. If both programs give the same binding poses, we can say that this is best possible binding pose of benzodithiazine 32. These programs are using different searching algorithm and different scoring functions for positioning the appropriate conformation of ligand into the active site. The AutoDock used Lamarckian genetic algorithm (LGA) (45), while dock used incremental construction algorithm (matching algorithm) (46,47). In each program, first grid files were generated before proceeding to docking run. The AutoGrid program was used to precalculate the energy of a specific ‘probe’ atom at regular points over a 3D space around the receptor. These energies are saved in ‘grid maps’. There is one grid map for each atom type in the ligand plus an electrostatics map and a desolvation map. The grid parameter file specifies the 3D search space by setting the number of points in each dimension, the center of the grids, and the spacing between points. It also specifies the types of probe atoms to use, the filename of the receptor, and the names of each the output grid maps. These grid files were further used for scoring/ranking the appropriate conformation of ligand into the active site. In AutoDock docking, all the parameters were kept as it is as used in previous publication (7). But in dock, grid files have somewhat different characteristics than AutoDock. Prior to grid generation, spheres must be generated from molecular surface of protein (using SPHGEN (48) command) and select the spheres, which belong to active site of ligand. The grid creates the grid files necessary for rapid score evaluation in dock. Most evaluations are carried out on (scoring) grids to minimize the overall computational time. At each grid point, the enzyme contributions to the score are stored. Grid also computes a bump grid, which identifies whether a ligand atom is in severe steric overlap with a receptor atom. The ligand–enzyme binding energy is taken to be approximately the sum of the van der Waal attractive, van der Waal dispersive, and Coulombic electrostatic energies. Approximations are made to the usual molecular mechanics attractive and dispersive terms for use on a grid. To generate the energy score, the ligand atom terms are combined with the receptor terms from the nearest grid point or combined with receptor terms from a ‘virtual’ grid point with interpolated receptor values. The score is the sum of overall ligand atoms for these combined terms. In this case, the energy score is determined by both ligand atom types and ligand atom positions on the energy grids. Here, dock program was repeated two times (named dock-1 and dock-2) for screened molecules. In dock-1, active site spheres were selected on the basis of co-crystallized ligand (5-CITEP) of PDB ID 1QS4, while in dock-2, spheres were selected around best docked conformation of benzodithiazine 32. For each ligand, 30 poses were generated with ranking and scoring in both programs. This repetitive docking was performed to check whether all docking run will give the same binding pose or not. Converging binding pose into active site as benzodithiazine 32 in all docking run means predicted poses were considered best possible poses of molecules. These binding poses were selected on the basis of interaction-based accuracy classification (49) as also used in previous publication (50).

Results and Discussion

Docking studies

The docking studies of benzodithiazine 32 were performed using AutoDock and dock program. These two programs gave same binding poses and found that benzodithiazine 32 formed hydrogen bonding with Cys65, His67, and Asn155 (Table 2 and Figure 2). The docking gesture of the sulfonamide moiety, thiol group, and nitrogen of pyrimidine ring formed a co-ordination ring around the Mg2+, which strongly chelate the Mg2+ than the carboxyl moieties of Asp64 and Asp116, finally inhibiting the HIV-1 IN activity (7). In our previous study (7), we had also used FlexX program, which gave same binding pose as from dock program. This docking pose was used as a reference pose for further analysis of docking results of screened molecules.

Table 2.   Docking interactions of benzodithiazine 32 using AutoDock and DOCK program
InteractionsScore (Kcal/mol)InteractionsScore (Kcal/mol)
Cys65, His67, Asn155
Lys156, Lys159, Mg
Cys65, His67, Asn155
Lys156, Mg
Figure 2.

 Docking pose of benzodithiazine 32 into active site of HIV-1 IN (1QS4); AutoDock conformation magenta and dock conformation green sticks; magenta ball represents magnesium ion.

Shape-based screening

The ZINC database had 5.3 millions drug-like molecules (now it has >10 millions). The best docking pose of benzodithiazine 32 was used as a template for SBS (Figure 3) using rocs program. The Combo score was used to rank the screened molecules. The best 300 hits were selected for maximum output of SBS. The screened molecules were checked carefully to remove the duplicate molecules (owing to different conformation of same shape and scored more than one time), which have low rank. Interestingly, we found that most of screened molecules have similar scaffold as found in benzodithiazine 32, but substitution is different (Figure 4). These molecules were subjected to toxicity studies before going for QSAR studies.

Figure 3.

 (A) Query molecule (benzodithiazine 32 docked conformation), (B) overlapped ZINC molecules after shape-based screening.

Figure 4.

 Structures of benzodithiazine 32 and best docked molecules after screening from ZINC database.

Toxicity studies

The selected screened molecules were used to perform toxicity and ADME studies. For toxicity studies in DEREK, different animal species models were selected that gave toxicological end-point for particular toxicity (approximately 10 different toxicity end-points were checked). If molecule is toxic, these end-points had given information regarding toxicity, otherwise non-toxic. These studies filtered 69 non-toxic common molecules in each model. But of 69 non-toxic molecules, only 22 molecules exhibited good QSAR predictions (as described below). These 22 molecules were checked for ADME properties using Discovery Studio. All of the molecules were interpolated within the prediction space of ADME descriptors, except ZINC10093792, ZINC11684978, ZINC10547407, ZINC02651882, and ZINC10093812 (see Figure S1 in supporting information). ZINC07558742 was found on boundary line of prediction space of ADME descriptors.

QSAR modeling

3D QSAR modeling

In previous studies, CoMFA and CoMSIA studies were performed using the same series. For alignment of all molecules, best docked conformation of highly active molecules of series (benzodithiazine 32) was used as a template, because this series molecule was not found in PDB of X-ray crystal structure of protein. The data set consisting of 41 molecules was divided randomly into test set of 10 and training set of 31 molecules. The molecule 14 was appeared as outlier during 3D QSAR model generation as it was differently bound into the active site from other molecules of the series [described in detail in reference (7)]. After that, best CoMFA model was obtained with cross-validation inline image = 0.728, non-cross-validation inline image = 0.934, and predictive inline image = 0.708. The best CoMSIA model was obtained with cross-validation inline image = 0.794, non-cross-validation inline image = 0.928, and predictive inline image = 0.59. The obtained models were statistically robust, which have good internal and external validation. The steric (CoMFA) and hydrophobic fields (CoMSIA) were found important for binding of these molecules into the active site than the other fields (7). These models can also be used for screening of new molecules for predicting the HIV-1 IN inhibitory activity.

2D QSAR modeling

The molecule 14 appeared as found outlier in 3D QSAR as well (described below). After removing this outlier, final 2D QSAR model (eqn 4) was built using six descriptors.


n = 31; LOF = 0.025; r2 = 0.922 inline image = 0.902; F-test = 46.961; LSE =  0.009; r = 0.960; inline image = 0.922 ± 0.001; inline image = 0.852; inline image = 0.650

n is the number of molecules in training set; LOF is Friedman’s lack of fit score, which is used to assess the goodness of each progeny equation using LOF (eqn 1), r2 is the regression coefficient; inline image is adjusted regression coefficient; F-test is a variance-related statistic that compares two models differing by one or more variable to see whether the more complex model is more reliable than the less complex one, the model is supposed to be good if the F-test is above a threshold value; LSE is the least square error; r is the correlation coefficient; bootstrap r2 (inline image) is the average squared correlation coefficient calculated during the validation procedure. A high bootstrap r2 with a low standard deviation indicates the robustness of the model; (inline image) is a squared correlation coefficient generated during the validation procedure (cross-validation); inline image is the predictive power of the model that means model can predict well the activity of molecules not considered in the training set.

The developed QSAR model showed that properties GATS7m, MATS4m, Mor14m, IC1, LP1, and C-034 are important for defining the inhibitory activity of HIV-1 IN of benzodithiazine derivatives. See Table 3 for descriptors definition. These properties were capable of elucidating 92.2% and 90.2% of total variance of HIV-1 IN inhibitory activity data. During model development, it was found that molecule 14 appeared as outlier in both training and test set. This molecule was kept into test set, and model was developed with inline image = 0.852 and inline image = 0.340. Finally, this molecule was removed from test set to get good prediction and found that inline image was drastically increased up to 0.650. The reason for molecule 14 being outlier was described in reference 7 as bound differently into the active site of HIV-1 IN as most of the molecules of series [the results of docking studies of molecule 14 already published (7)].

Table 3.   Definition of descriptors used in final 2D QSAR model developed by genetic function approximation method
Descriptor classDescriptor nameDescriptionTypea
  1. aDimension, all 2D descriptors are topological in natures except Mor17m (3D).

2D AutocorrelationMATS4mMoran autocorrelation - lag 4/weighted by atomic masses2D
2D AutocorrelationGATS7mGeary autocorrelation - lag 7/weighted by atomic masses2D
3D-MoRSEMor17m3D-MoRSE signal 17/weighted by atomic masses3D
Atom-centered fragmentsC-034(R---CR…X), -- represents an aromatic bond;.... represents aromatic single bonds2D
Eigen value-based indicesLP1Lovasz–Pelikan index2D
Information indicesIC1Neighborhoods information content2D

Further validation of model for robustness and predictability, leave-one-out r2 (inline image) and fivefold cross-validated r2 (inline image) were evaluated in PLS algorithm using same data set. The inline image and inline image (= 5) values for the generated model were 0.8521 and 0.8516, respectively, and r2 and inline image values were 0.9215 and 0.6387, respectively. The internal and external predictions indicate that model was robust and found satisfactory for predicting the activity of test set. The values of actual and predicted inhibitory activities (pIC50) are shown in Table 4 for test and training set molecules (from GFA model). The scatter plots between actual versus predicted pIC50 values of training and test sets are given in Figure 5A,B, respectively.

Table 4.   Actual and predicted pIC50 values by 2D QSAR model for training and test sets (leverage limit h* = 0.581; h# = 0.387)
Molecule no.Actual pIC50Cerius2 Pred.pIC50h(x)Molecule no.Actual pIC50Cerius2 Pred.pIC50h(x)
  1. aTest set molecules.

  2. bOutlier molecule deleted.[Correction made here after initial online publication: Columns 4 and 5 were interchanged - 14 February 2012]

Figure 5.

 Scatter plot for (A) training and (B) test set between actual versus predicated pIC50 value.

One more approach we had used here for establishing the contribution of each descriptor in developed model (eqn 4). Each descriptor was removed from final equation and checked if r2 value dropped for the whole data set. This procedure was repeated for all six descriptors of final equation and noted the r2 value for each descriptor removal. Table S1 in supporting information is showing the results of this analysis. It was found that IC1 and C-034 descriptors have largest drop in r2 0.691 and 0.736, respectively, means that these two descriptors have large contribution in defining the HIV-1 IN inhibitory activity in QSAR model equation.

Similarly, the predictive ability of a 2D QSAR model was estimated using randomization test. The randomization test had shown the statistical significance of the relationship between the anti-HIV-1 IN activity and chemical structure descriptors. The randomization tests were performed at 90% (9 trials), 95% (19 trials), 98% (49 trials), and 99% (99 trials) confidence levels and carried out by repeatedly permuting the dependent variable set. It was that the r-value of the original model was much higher than any of the trials using permuted data, showing thereby that the model developed is statistically significant and robust. The results of randomization test at various levels of confidence levels are shown in Table 5.

Table 5.   Results of randomization test
Confidence level90%95%98%99%
Total trial9194999
r from non-random0.9600.9600.9600.960
Random r’s > non-random0000
Random r’s < non-random9194999
Mean value of r from random trial0.3700.4550.4270.438
Standard deviation of random trial0.0830.1250.1210.120
Standard deviation from non-random r to mean7.134.034.414.36

The intercorrelation of descriptors was also calculated. It was necessary that the descriptors evolved in this equation should not be intercorrelated with each other. If descriptors have high interrelatedness among different descriptors can result in highly unstable model. Thus, developed model is not statistically significant. The correlation matrix for the used descriptors is shown in Table 6. The VIF was calculated for checking the intercorrelation of descriptors. In this model, the VIF values of these descriptors are 1.133 (Mor17m), 1.000 (LP1), 1.000 (IC1), 1.167 (C-034), 1.033 (GATS7m), and 1.156 (MATS4m). From VIF analysis, it is clear that the descriptors used in the final model have very low intercorrelation.

Table 6.   Correlation matrix of the descriptors used in final 2D QSAR equation

In 2D QSAR modeling, same training and test sets were used as in 3D QSAR model. But the similarity of training and test sets was again considered during 2D QSAR analysis. For defining domain of applicability, k-NN (k-Nearest Neighbor) method was used. In this method, ED was calculated for each training set molecule with nearest neighbor (= 1) using model descriptors. The range of minimum ED was calculated for training data (0.124–1.01). This range was used as a domain of applicability for 2D model developed. To check the predictability of QSAR model for test set molecules, similarly EDs were calculated from training data and found that (0.102–0.378) fall within range of AD as the maximum ED of nearest neighbor of test set molecule is 0.378, which is less than the maximum ED of training set molecule from nearest neighbor. So, we can say that test molecules occupied the same domain, which is generated by training data. This generated model is statistically reliable for prediction of test molecules. In addition, h(x) value for each test set molecules was found within the limit as warning h* = 0.581 and high leverage limit h# = 0.387 that again supported the reliability of prediction of test set from 2D QSAR model (Table 4).

The generated 2D and 3D (CoMFA and CoMSIA) QSAR models were used for the prediction of screened molecules, which have similar shape and chemical features. It was found that 22 molecules predicted active from both 2D and 3D QSAR models (see Table 7). The SUM was checked for each screened molecules to find out the total contribution of out-of-range descriptors in final predictions. These molecules also have very less number of out-of-range descriptors, which contributed very low in final predictions than total number of modeled descriptors. All the molecules were also found with SUM < SEP. (Table 7). Therefore, conclusion can be drawn from these analyses that all screened molecules have occupied the same AD as training data; hence, predictions are reliable. Interestingly, all these molecules had a similar scaffold as found in training data (see Figure 4). These structural similarities may cause good predictability of these molecules being into same domain of training data. The generated contour maps assisted here for understanding the prediction of these molecules, which have some modification in side chains.

Table 7.   QSAR predictions of screened molecules against HIV-1 IN using CoMFA, CoMSIA, and 2D models (pIC50); h(x) values; Leverage limit h* = 0.581, h# = 0.387

The domain of applicability was also calculated for these 22 molecules for 2D QSAR model for checking the prediction ability of model for new molecules. ED was calculated using training data and compared. The screened molecules ED from nearest neighbor ranges from 0.069 to 1.06 and fall within the 2D model ED. Similarly, extent of extrapolation was applied using h(x). Some of molecules have found high leverage value (>h*) (Table 7). High h(x) value of these molecules indicated that they were extrapolated outside AD of the training set; hence, increased uncertainty and prediction is unreliable. Remaining molecules found within the domain of training set of 2D QSAR model hence indicated the reliability of predictions. The similarity with training and test set molecules, these molecules exhibited good activity within AD. These molecules were used for docking studies.

The QSAR predictions and structural similarity with both data sets indicated here that these molecules may also have HIV-1 IN inhibitory activities, which need to be evaluated experimentally. The predicted activity of 22 screened molecules from 2D QSAR, CoMFA, and CoMSIA (SH) models is given in Table 7.

Docking studies of screened molecules

The docking studies were performed to explore the binding mode of 22 well-predicted molecules using AutoDock and dock. As stated previously, docking studies were performed three times into the actives site: AutoDock, dock1, and dock2. Tables 8 and S2 in supporting information show the results of the docking studies of AutoDock and dock programs. The molecules ZINC08990646, ZINC10093792, ZINC11503051, ZINC07795477, and ZINC13134325 were not docked in either docking runs into the active site of HIV-1 IN (see Table S2) as benzodithiazine 32. One molecule ZINC11071662 was docked well as benzodithaizine 32 only in AutoDock run (see Table S2). Rest of the screened molecules has occupied the space near to active site as benzodithiazine 32. The molecules ZINC02651882, ZINC07079919, ZINC07558748, ZINC07286662, ZINC10547407, and ZINC11684978 were docked in similar way as benzodithiazine 32 with good binding interactions in two docking run: AutoDock and dock1 (Table S2 in supporting information). Some of the molecules ZINC07795474, ZINC07978428, ZINC10274906, ZINC11520483, and ZINC12481683 were differently posed into the active site only in AutoDock program as their ring D position is inverted (180°) (see Figure 4). So these molecules did not show consistency in docking results. The molecules ZINC07558742, ZINC07795482, ZINC11153210, ZINC12485110, and ZINC10093812 docked in similar way in all three docking run: AutoDock, dock1, and dock2 (Table 8 and Figure 6–10). These molecules have similar structure as benzodithiazine 32 (Figure 4). Like benzodithiazine 32, all these molecules followed the same interactions. All these molecules exhibited consistently H bonding interaction with Asn155, which is already shown important for binding (7). These molecules also exhibited chelation of metal ion by making co-ordination to metal ion with highly electronegative atoms of these molecules, which is important for HIV-1 IN inhibition. The cation-π interactions with active site residues (Mg2+, Lys156, and Lys159) are also important for the inhibition of HIV-1 IN (Tables 8 and S2 in supporting information) (7). These similar binding interactions into the active site as benzodithiazine 32 may possibly exhibit HIV-1 IN inhibitory activity that need to be tested in in vitro. But the molecule ZINC10093812 has not exhibited good ADME properties as extrapolated from prediction space of ADME descriptors. Moreover, it also exhibited high leverage value than the limit; hence, prediction was not reliable. Therefore, it removed from studies. Finally, these four molecules ZINC07558742, ZINC07795482, ZINC11153210, and ZINC12485110 are selected as novel HIV-1 IN inhibitors.

Table 8.   Docking interactions of screened best docked molecules using AutoDock and dock programs
InteractionsaScoreb (kcal/mol)InteractionsaScoreb (kcal/mol)InteractionsaScoreb (kcal/mol)
  1. aMg interactions with nearby highly electronegative atom in Å; cation-π interaction of ligand with active site residues.

  2. bScore in kcal/mol.

  3. cHigh leverage value.

ZINC07558742H-bonding: Asn155, Mg-interaction: Mg-C=O 1.69 Mg-N 3.42 Mg-NH 3.22
Cation-π: Lys156
−2.36H-bonding: Glu92, Asn155
Mg-C=O 1.93
Mg-N 3.59
Mg-NH 3.81
Mg-S 4.87
−40.53H-bonding: Asp116, Asn155
Mg-C=O 1.96
Mg-N 3.18
Mg-NH 2.49
Mg, Lys156, Lys159
ZINC07795482H-bonding: Asn155
Mg-O 1.61
Mg-N 3.36
Mg-S 4.21
−2.06H-bonding: Asn155
Mg-O 1.94
Mg-N 2.35
Mg-S 3.37
Cation-π: Mg
−45.17H-bonding: Asn155
Mg-O 1.93
Mg-N 2.54
Mg-S 2.82
ZINC11153210H-bonding: Asn155
Mg-C=O 1.64
Mg-N 2.86
Mg-NH 3.76
Mg-S 2.88
−2.07H-bonding: Cys65, Asn155
Mg-C=O 2.03
Mg-N 2.45
Mg-NH 2.53
Mg-S 3.58 & 3.80
−39.73H-bonding: Cys65, Asn155
Mg-C=O 2.00
Mg-N 2.43
Mg-NH 2.57
Mg-S 3.73 & 3.60
Cation-π: Mg
ZINC12485110H-bonding: Cys65, His67, Asn155, Lys159
Mg-C=O 2.47
Mg-N 2.68
Mg-NH 2.00
Mg-S 3.96
Cys65, Thr66, Asn155, Lys159
Mg-C=O 2.00
Mg-N 2.31
Mg-NH 2.78
Mg-S 3.54
−44.30H-bonding: Cys65, Asn155, Lys159
Mg-C=O 1.99
Mg-N 2.36
Mg-NH 2.61
Mg-S 3.67 & 3.58
ZINC10093812cH-bonding: Cys65, Glu92, Asn155
Mg-C=O 1.77
Mg-S 4.14
Mg-N 2.83
−2.91H-bonding: Asn155
Mg-C=O 1.98
Mg-C=O 1.97
Mg-N 3.31
Lys156, Lys159
−48.75H-bonding: Asn155
Mg-C=O 2.02
Mg-C=O 2.09
Mg-N 3.31
Mg, Lys156, Lys159
Figure 6.

 Docking pose of ZINC07558742 into active site of HIV-1 IN (1QS4); AutoDock conformation magenta and dock conformation green sticks and magenta ball represents magnesium ion.

Figure 7.

 Docking pose of ZINC07795482 into active site of HIV-1 IN (1QS4); AutoDock conformation magenta and dock conformation green sticks and magenta ball represents magnesium ion.

Figure 8.

 Docking pose of ZINC11153210 into active site of HIV-1 IN (1QS4); AutoDock conformation magenta and dock conformation green sticks and magenta ball represents magnesium ion.

Figure 9.

 Docking pose of ZINC12485110 into active site of HIV-1 IN (1QS4); AutoDock conformation magenta and dock conformation green sticks and magenta ball represents magnesium ion.

Figure 10.

 Docking pose of ZINC10093812 into active site of HIV-1 IN (1QS4); AutoDock conformation magenta and dock conformation green sticks and magenta ball represents magnesium ion.


The aims of present work were to develop a 2D QSAR model for benzodithiazine series and SBS to identify the novel molecules that have good in silico predictions (QSAR, ADMET, and docking). First, GFA method was used for variable selection and model building of benzodithiazine derivative. The successful model was built with descriptors GATS7m, MATS4m, Mor17m, IC1, LP1, and C-034 having r2 = 0.922; F-test = 46.961; inline image = 0.852; and inline image = 0.650. The cross-validation method, Y-randomization technique, and external validation indicated that the model is statistically significant and has good internal and external predictability. The domain applicability (k-mean and extent of extrapolation) was used to access the prediction reliability of developed model for test set molecules. The test set molecules occupied the same domain as modeled molecules (training set) that mean these molecules have structural similarity with modeled molecules. The 2D and 3D descriptors were related to topology and 3D arrangement of atoms in molecules that can be used to design new inhibitors with good potency. This QSAR model can be used for predicting HIV-1 IN inhibitory activity of benzodithiazine derivatives and its similar molecules.

The SBS was also performed to identify novel molecules using best docked conformation of benzodithiazine 32. After toxicity study, non-toxic molecules were predicted from QSAR models. These molecules were found structurally similar with modeled (training) molecule, hence predicted well from QSAR models. The reliability of these predictions for screened molecules, which were not part of QSAR training set, was also assessed by domain of applicability. Most of the molecules were followed the same domain as training set; hence, predictions were reliable. The docking results of screened molecules ZINC07558742, ZINC07795482, ZINC11153210, and ZINC12485110 were exhibited consistency in terms of position into the active site (near to metal ion) and binding modes in all docking runs. As concluded from these studies, SBS using known active enabled to identify new hits with good in silico activities and binding poses, but in vitro assay to verify their experimental activity need to be done.


  • a

    ROCS. ( 2.4.1, OpenEye Scientific Software Inc., Santa Fe, NM.

  • b

    Todeschini R., Consonni V., Mauri A., Pavan M. (2005) Milano Chemomatrics and QSAR Research Group, Milano, Italy. Dragon 5.3; available at


Pawan acknowledges National Institute of Pharmaceutical Education and Research (NIPER), S.A.S. Nagar, Punjab, India for providing fellowship.