Quantitative Structure‐activity Relationship (QSAR) Models for Docking Score Correction

Abstract In order to improve docking score correction, we developed several structure‐based quantitative structure activity relationship (QSAR) models by protein‐drug docking simulations and applied these models to public affinity data. The prediction models used descriptor‐based regression, and the compound descriptor was a set of docking scores against multiple (∼600) proteins including nontargets. The binding free energy that corresponded to the docking score was approximated by a weighted average of docking scores for multiple proteins, and we tried linear, weighted linear and polynomial regression models considering the compound similarities. In addition, we tried a combination of these regression models for individual data sets such as IC50, Ki, and %inhibition values. The cross‐validation results showed that the weighted linear model was more accurate than the simple linear regression model. Thus, the QSAR approaches based on the affinity data of public databases should improve docking scores.

(Talete srl, Milano,I taly). Moreover,t here have been many regression models.I no ur previous studies,w eu sed ap rotein-compound affinity matrix as the set of descriptors and successfully predictedC YP inhibitors and substrates. [7,12] As an extension of our previous work, [6,7] in the present study we developed and examined some principal component regression-type prediction methods based on the machine-learning score modification (MSM)m ethod and the docking score index( DSI:p rotein-compound affinity matrix) using public repository data. In the MSM method, the dockingscore against atarget protein could be corrected by al inear combination of docking scores against multiple nontarget proteins. The DSI of ac ompound is as et of docking scores of the compound against multiple target and nontarget proteins. Here, the nontarget proteins are used as "probes" to check the three-dimensional (3D) shape and distribution of the atomic chargeo ft he compound. We applied our methods to kinases of the ChEMBL database, because kinases form am ajor protein family that is key to understanding and controllingc ellular signal transduction, and because theirs tructures have beens tudied thoroughly. [25][26][27] 2Methods 2

.1 Prediction Models
The present method predicts the binding energy (affinity) of given protein-compound pairs based on the proteincompound dockings cores obtained by ad ocking program; this method is ad escriptor-based machine-learningo rr egression method. [6,7,28,29] The present method requires al earning set of 3D structureso fc ompounds, the binding energy data betweent hose compounds, and target proteins. Let s i b , R b a ,a nd b be the docking score of the i-th compound, that of the b-th protein, and parameters, respectively.T he set of {b} can include the target protein (the a-th protein). We proposed as core modification method as follows. [6] DG a i ¼ Here, DG i a is the bindingf ree energy between the i-th compound and the a-th protein. Proteinst hat are similar to the target protein could bind the ligandso ft he target protein. Docking scores correspond to the binding free energy, and the ensemble average should improve the accuracy. Thus eq. 1s hould work, and indeed, this approximation was successfully applied to severalt argets. [6,7,28,29] The score modification methodi ncreased the area under the curve (AUC) values of the database enrichmentc urves by 50 %. Namely,t he AUC values of 60-70 %w ere improved to 80-90 %i nt hese previous reports. In addition,t here is one advantage to the other conventional QSAR models with ordinary molecular descriptors. One of the most serious prob-lems of QSAR models is the limited range of applicable domains, since QSAR models cannot work for unexpected input data. [30] If the dockings core is precisely proportional to the binding free energy without computational error, DG i a = s i a .Thus, eq. 1can workwithoutany experimental affinity data and the problemo fi dentifying an applicable domain is avoided.
In eq. 1, the number of parameters is equal to the numbero fp roteins. The number of parameters can be reduced by principal componentr egression( PCR). Docking scores should be af orm of some kinds of similarity scores (see APPENDIX A). Thus, the docking scoresc ould be used as the descriptors for compounds. If we allow docking scores as descriptors, the docking score for at arget protein is not neededine q. 1. In the present model, the protein-compoundb inding energy DG i a is approximated by the PCR method based on the protein-compound docking scores s i b .A ss hown in Figure 1, we tried six PCR models (Models 1-6)a sf ollows. In each model, the optimal principal componenta nalysis (PCA) axis was selected to maximize the qv alue by the leave-one-out (LOO)c ross-validation test. The selection of the PCA axis corresponds to the factor rotation. [29] (Model 1) Linear PCR model Here, c j a , b, p,and d b j are the parameter,o ffset parameter, principal component vector, and loading vector,r espectively.T he hi represents an average. The PCA of the proteincompound dockings corem atrix s gives the loading vector d and the principal component vector (axis) p.T he parameters c and b are determined by am ultilinear regression (MLR). N p is the total number of docked proteins and N (N < N p )i sd etermined to maximize the q-value obtained by the LOO cross-validations. The parameters are determined based on the learning set and then are usedf or prediction.
The protein-compoundd ocking scores s i b were obtained by the program Sievgene, [31] which is ap rotein-ligand flexible docking program for in silico drug screening. Sievgene is ap art of the myPresto system, whichi savailable online (http://presto.protein.osaka-u.ac.jp/myPresto4/) and is free for academic use.
(Model2 )P olynomial PCR model Here, c2 j a represents the parameterf or the second-order term. We tried only the second-order and did not try the higher-order polynomials. The other terms and parameters are defined exactly as in Model1.
S is the distance betweent he i-th and j-th compounds. Let S max (i) = max{S;f or all j}. The data of compoundst hat satisfy S < x ·S max are replicated M times. In the present study, x = 0.1, 0.2, 0.3, and0 .5 were examined and M was set to 1, 2, 4, and 8.

(Model 4) Classified PCR model
In the classifiedP CR model, the experimental data are classifiedi nto IC 50 , K i ,a nd %inhibition data, and the simple PCR method is applied to each classifiedd ata set.
(Model5 )C ombined PCR model This model is al inear combination of the regression models made by the classified PCR model.
Here, Dg, C m and c 0 are the binding free energy obtained from the m-th data set and the fitting parameters, respectively.T he Dg is given by eq. 1, and the coefficient C m is determined to minimize the root-mean-square differenceb etween the coefficients of {c} of Dg.I nt he presents tudy, SPECIAL ISSUE the experimental data were classifiedi nto IC 50 , K i ,a nd %inhibition data (m = K i , IC 50 and %inhibition). We basedt his classification on the individual source of the experimental data.

(Model 6) Replica and partial-replica PCR models
Cortes-Ciriano et al. suggested that the use of multiple replica data sets permutated by randomn oise could improve the QSAR accuracy. [32] In the replicaP CR method, the experimental data and dockings coresa re replicated by the permutation of 5% noise. Also, Steinmetz et al. suggested that the QSAR result could be improved by considering the importance of data due to reliability. [33] In the partial-replica PCR method, experimental DG values < À10 kcal/mol are replicatedb yp ermutation of 2% noise, sincet he strong affinities of lead-level compounds are observed by multiple experiments in many casesa nd should be more reliable than the weak affinities of hit-level compounds.

Generation of the Docking Score Index by Proteincompound Docking
The protein-compoundd ocking scores s i b in all models were calculated by the protein-compound docking program Sievgene. [31] Sievgeneg enerates multiple possible conformers for each compound and keepst he structureso f target proteins more or less rigid, witht he exception that soft interaction forces can changet he structures slightly. This dockingp rogram reconstructed about 50 %o ft he receptor-compound complexesi nP DB (132 in total) with an accuracyo fl ess than 2 root mean square deviation (RMSD), [31] which is mostly equivalent to the predictionsb y other docking programs. In the presents tudy,t he Sievgene program generated up to 100 conformers for each compound, and 200 200 200 grid potentials were adapted for all proteins. The pocket regionsw ere suggested by the coordinates of the original ligands in the receptor-compound complexs tructures. The details of the dockings core are summarized in Appendix B( Supporting Information). It takes 3s econds to dock one compounda gainst one protein on as ingle core of the Xeon 5570 CPU (2.98 GHz).

Data-conversion Method
The protein-compoundb inding energy DG is calculated from the K d value as follows: where k B and T are the Boltzmann constant and temperature.
The experimental K d and DG values are difficult to obtain and quite rare in public databases. On the other hand, the %inhibition, K i and IC 50 values are relativelye asy to obtain and abundant in public databasess uch as PubChema nd ChEMBL.I nt he present study, we assumed that K d = K i , since the binding affinities of the naturall igandsh ave been reported to be much weaker than those of the reported artificial ligands in manyp roteins. For the %inhibition and IC 50 data, the conventional approaches are adopted as follows. The %inhibitionv alue is convertedt ot he K i value. Let E, S, P and I be the enzyme, substrate,p roduct and inhibitor,r espectively.T he inhibition reaction is described as follows. Here,"K"r epresents the reactionr ate.
When the enzyme reaction is the rate-determining step, we have The value of K s is then derived from the density of the E, S and ES complexesa sf ollows.
Here, the bracket []represents the density of molecules. The reaction speed v is described as follows. v Here, V max is the maximum enzyme reaction speed. Let r be the residual activity; K i is then givenb y The parameters in eq. 13 must satisfy because K i > 0. Here, the %inhibitionv alue is (1Àr)*100.
The IC 50 value is converted to the K i value by the Cheng-Prusoffe quation as follows. [20,34] Here, S and K s are the substrate and the affinity between the enzyme and the substrate.  [27,28,[35][36][37] 3Data Preparation

Probe Protein Sets with and without Kinase Structures
To generate {s}i nM odels 1-6 (the DSI or affinity fingerprint), we performed ap roteinÀcompound docking simulation based on the soluble protein structures registeredi n the ProteinD ata Bank (PDB). The probep rotein set consisted of 600 arbitrarily selected protein structures. These structuresw ere all protein-ligand complex structures. Some of them were kinases. For protein sets, the complexes containing ac ovalent bond between the protein and ligand were removed,a nd all missing hydrogen atomsw ere added to form the all-atom models of the proteins. All water molecules and cofactors were removed from the protein structures. The atomic chargeso ft he proteins were the same as those in AMBER parm99. [38] The docking pocket of each protein was indicated by the coordinates of the originall igand.

Validation Test Set:T arget Proteinsa nd Compounds
The testedc ompounds and their assay information (compound structures, affinities againstkinases) were downloaded from KinaseSARfari on the ChEMBLw ebsite (https:// www.ebi.ac.uk/chembl/). [24] The biochemical assay data, namely, K i , IC 50 ,% residual activity and/or %inhibitionv alues of humank inase protein-inhibitor systems,w ere also extracted from the bioactivity table in KinaseSARfari. Assay data with inadequate energy unitso ru nclear energy values were excluded. The assay data for large compounds( mass weight > 500 Da) were also excluded, since Sievgene is designed for the docking of small compoundsw ith mass weights < 500 Da.
As target proteins, 97 kinases with more than 50 assay data points were selected. The 3D structures of the compounds were energy-optimized by cosgene [39] with the general AMBER force field (GAFF), [40] and the atomic charges were calculated by the MOPAC AM1 modelu sing the Hgene program of the myPresto suite.F inally,3 8,946 assay data points of 97 kinases and 18,491 compoundsw ere derived. Most of the assay data were IC 50 data, with the second-most common type being %inhibition data, followed by K i and K d values. These data were converted to the DG value by using Eqs. And r values greater than 0.95 and less than 0.05 were ignored, since the erroro fr should bea round 5% and an r value > 1g ave unreasonable DG values. The parameter set was determined basedo ns everal reports of assays included in the ChEMBLd atabase.

DG Values Obtained from K i , IC 50 and %Inhibition Values
Appendix C( Supporting Information) provides al ist of the target kinases used in the present study and the number of ligands for each. The kinase names were the domain names from the KinaseSARfari database in ChEMBL.
The averagea nd standardd eviation values of the experimental DG values of the 97 targetp roteins are summarized in Ta ble 1, and the data were classified followingt he data type. For some kinases, multiple experimental assay data (K i , IC 50 and %inhibition) were available for the same ligands of some proteins. The differences between each classified data type (the DG valuec alculated from K i , IC 50 and %inhibition data) and the othert ypes of data are summarized in Ta ble 1. The difference corresponded to the erroro ft he experimentald ata converted in the presents tudyb yE qs. 13 and 15.
We simulated the expected correlation coefficientb etween the experimental and calculated DG values based on the statistics summarized in Ta ble 1. We generated as et of numbers thatm imicst he experimental DG values whose average and standardd eviation were À9.6 kcal/mol and 2.5 kcal/mol, respectively.T hen ar andom numberw as added as experimental error to each simulated experimental DG value. Also, we generated as et of numbers that mimicst he calculated DG value by adding ar andom number as the computational error.T he calculatedc orrelation coefficients are summarized in Ta ble 2. In addition,w e performedaset of virtuals creenings based on these simulated data. The compounds with experimental DG < À11 kcal/mol were selected as the active (hit) compounds, and the others were treated as decoys. Then the compounds were sorted in the order of the calculated DG and the receiver operating characteristic (ROC) curves were calculated. The area-under-the-curve (AUC) values of the ROC curves of thes imulated virtuals creenings are also summarized in Ta ble 2. Ah igher AUC value corresponds to better SPECIAL ISSUE [a] The standard deviation of the whole observed data (kcal/mol).
[b] The minimum DGv alue of the data set (kcal/mol).
[c] The maximum DGv alue of the data set (kcal/mol).
[d] The root mean square deviation (RMSD) of the multiply-observed data for the same protein-ligand pairs (kcal/mol) prediction, and the AUC value is always more than zero and less than 100 %. For the random screening, AUC = 50 %. These correlation coefficients (0.6-0.7) suggested poor QSARs;o nt he other hand, the AUC values( AUC = 80-90 %) showed good virtual screening results.

Cross-validation Tests of QSAR Models
Each of the compounds that gave assay data for one or more of the 97 target proteins was docked to all proteins of ap rotein set to generate the protein-compound docking score matrix s.T hen we adopted Models 1-6 and the LOO cross-validationt est to calculate the correlation coefficients Ra nd Q. Figure 2s howst he schematic description of the LOO cross-validationp rocedure for models 1-6. In all models, we performed the simple linear regressions and obtained an Rv alue for each principal component axis. The axes were sorted according to their Rv alues. Then, we performed the multiple linear/polynomialr egressions adopting the top m-axes and calculated the Rv alues. The number of axes that gave the highest Rw as adopted to constructt he regression model for calculatingt he DG of the query compound, and the Qv alue was calculated. Figure 3s howst he results of the LOO cross-validation test of threes elected kinases. The correlation coefficients (R) betweent he experimentala nd predicted DG values are summarized in Ta bles 1a nd 3, respectively.A lso, the root mean square error (RMSE) valuesb etween the experimental and predicted DG values are summarized in Ta ble 3.
The machine-learning DSI and MSM methods were score modification methodsi nw hich the new docking scores were given by the linear combinationso ft he protein-com-pound docking scores.T he previous works reported the AUC values of database enrichmentc urves obtainedb yt he machine-learning DSI/MTS methods and the AUC values were about 98 %( in the original works, the AUC values were referred to as qv alues). When the number of active compounds is much smaller than the number of inactive compounds, the AUC of the database enrichment curve is close to the AUC of the ROC curve, and the datas ets of the previous works satisfy the condition (number of actives : number of in-actives = 1:1000). Based on Ta ble 2, an AUC of 98 %c orresponded to an Ro f0 .8-0.9. The Rv alues were close to the Rv alues obtainedb yt he weighted PCR models in Ta ble 3. We must note that the data sets used in the previous studiesc onsisted of higha ffinity compounds (e.g.,c ommercial drugs) as activec ompoundsa nd decoy compounds. In the presents tudy, all compounds were nearly active, and discriminating betweens trong and weak active compoundsi sm ore difficult than distinguishing highly active and inactive compoundsa si nt he previous reports. Thus, the present validation tests were much more strict thant hose used in the previous studies.I na ddition, the previous methods only realized the active/in-active binary decision in virtuals creening. On the otherh and,t he present methods could evaluate the bindinge nergyv alue, which is essential in drug design. The simple PCR model (Model 1) worked well, since the correlation coefficients betweent he experimentala nd calculated DG valuesw ere closet ot hose obtained by the mathematical simulation data. Also,t he results obtained by SPECIAL ISSUE  the simple PCR model were closet ot he averagedc orrelation coefficient and RMSE values obtained by the docking study (R = 0.7 and 2-3 kcal/mol). Considering that the present model did not require the target protein structure, even the simple PCR model should be useful for rough affinity estimation in 21 %o fc ases in which Q > 0.7 out of the 97 target proteins.
For the polynomialP CR model, the polynomial was restricted to the second order, since the high-order polynomials require manyp arameters in the regressione quation. This trial did not improve the correlationt ot he experimental data. This suggestst hat the linear model was sufficient in the present study. Figure 3s hows the results obtained by the weighted PCR model. In the weighted PCR method, the molecule in the teaching data thatw as similar to the input molecule was copied multiple times. The similarity was calculated by the docking score in eq. 6, and x = 0.1, 0.2, 0.3 and 0.5 were examined. The weighted PCR method with x = 0.1 and single or double copies of similar compoundss howedt he best correlation to the experimental data. This methodr equired the same data as the simple PCR model. In addition to our method, however,i ti se xpected that many othera pproaches could be taken to generate the replica data. The weighted PCR model should be usefulf or rough affinity estimationa mong these models, and Q > 0.7 in 37 %o fc ases in the presents tudy.T he simple average of Qo ver the 97 proteinsi nT able 3w as about 0.64, and the correlation coefficient of the DG value of the whole data was 0.76 (see Figure 3).
The experimental data were classified into IC 50 , K i and %inhibition data sets,a nd then the classifiedP CR method was applied to each. The correlations were improvedc ompared to the other modelsw ith unclassified data. Wem ust note that the Qv alues obtainedb yt he classified PCR modelc annot be simply compared to those obtained by the other models,s ince there were fewer classified experimentald ata than unclassified data.
The combinedP CR method did not improvet he correlation between the experimental and calculated DG values compared to the simple PCR model. In particular,i ns ome cases,t he combined PCR model generated unrealistic DG valuesw ith extreme computational error( > 10 6 kcal/mol). In 9c ases (9 %), the RMSE values obtainedb yt he combined PCR model were < 1kcal/mol, while onlyt hree RMSE values obtainedb yt he weighted PCR model were < 1kcal/ SPECIAL ISSUE Table 3. Average correlation coefficient (R/Q) between the experimental data and the calculated data obtained by the six regression models over all 79 proteins.  mol (3 %). This result suggested that ad eep learning method that considers the study from which the data was sourced should work bettert han the methode mployed in the present study,a nd that the data manipulation should be performed carefully while considering the applicable domain of the model. The replicaP CR and partial-replica PCR models did not work in the present study.T he greater the number of replicas, the lower the correlation coefficients became. Since this study was as ingle trialw ith simple random noise, the results do not contradict those of the previous works. [32,33] The duplication of data should be treatedm ore carefully than it is in the simple duplication.

5C onclusions
In ordert oa chieve dockings core correction, we developed several QSARm odels basedo nc ombinationso fm ultiple docking scores by protein-drug docking simulations and applied heterogeneous public data to these models. The predictionm odels employedadescriptor-basedP CR, and the compound descriptor was as et of docking scores against many nontarget proteins( DSI, protein-compound affinity matrix or affinity fingerprint).W et ried six variations of the PCR models:s imple, polynomial, weighted,c lassified, replica PCR and combined PCR models.
Event he simple PCR model worked in some cases,b ut when the assay data were classified into IC 50 , K i and %inhibition data, the classified PCR modelw orked better than the simple PCR model. The linearc ombination of the QSAR models (combined PCR model) did not improve the results compared to the simple PCR. Although the weighted PCR model was simple, it achieved the same results as the more complex combinedP CR model. In general, the weighted PCR model should be easier and more accurate than the other models. In casesi nw hich the original sourceso ft he assay data are easily accessible,t he classified/combined PCR models could be used for further analysis with careful data treatment. Althoughi tw as difficult to compare the present results to those obtained by previous studies, including studies using the DSI method with al inearc ombination of dockings cores, the comparison of the Ra nd AUC values suggested that the predictiono fa ctive compounds by the present models should be comparable to that achieved in the previous study.H owever,c onsidering the difference of the datasets, the method introduced in the present study should be superior to the previous method. In addition,t he present method affords DGp rediction.
In the present study,t he docking scores against target proteins were omitted.T he QSAR models could be improved by the addition of the docking scores that were descriptorso ft he models.T hus, the presentm odels should be effective for the correction of docking scores.

Supporting Information
The compound structures in SDF format and experimental assay data were supplied as described in the supporting information.