Assessment of fragment docking and scoring with the endothiapepsin model system

Fragment‐based screening has become indispensable in drug discovery. Yet, the weak binding affinities of these small molecules still represent a challenge for the reliable detection of fragment hits. The extent of this issue was illustrated in the literature for the aspartic protease endothiapepsin: When seven biochemical and biophysical in vitro screening methods were applied to screen a library of 361 fragments, very poor overlap was observed between the hit fragments identified by the individual approaches, resulting in high levels of false positive and/or false negative results depending on the mutually compared methods. Here, the reported in vitro findings are juxtaposed with the results from in silico docking and scoring approaches. The docking programs GOLD and Glide were considered with the scoring functions ASP, ChemScore, ChemPLP, GoldScore, DSXCSD, and GlideScore. First, the ranking power and scoring power were assessed for the named scoring functions. Second, the capability of reproducing the crystallized fragment binding modes was tested in a structure‐based redocking approach. The redocking success notably depended on the ligand efficiency of the considered fragments. Third, a blinded virtual screening approach was employed to evaluate whether in silico screening can compete with in vitro methods in the enrichment of fragment databases.


| INTRODUCTION
Molecular docking and scoring have become an indispensable method in the context of computer-aided drug design. [1]Aiming to find the best-performing docking programs and scoring functions, many publications have focused on evaluations with the most diverse test sets of protein-ligand complexes. [2]However, it was emphasized that the performance of an individual docking program or scoring function is also highly dependent on the type of protein target, the type of docked ligands, the quality of the considered crystal structures, and the exact docking protocol. [3]erefore, we believe that more detailed and case-specific analyses are needed in the evaluation of docking programs and scoring functions.
The family of pepsin-like aspartic proteases plays a key role in several human diseases, such as hypertension, malaria, Alzheimer's disease, and fungal infections. [4]Endothiapepsin embodies a wellstudied model system for this protease family.Its first crystal structure was already published in 1984, [5] and the structural knowledge supported the development of drugs such as inhibitors of renin [6] and β-secretase. [7]Studies on endothiapepsin further helped to elucidate the catalytic mechanism of aspartic proteases. [8]agment-based screening approaches have gained high popularity in drug design projects. [9]They aim to identify the smallest molecules that are capable of binding to a protein target, and thereby to search the chemical space with high efficiency. [10]While the energetic contribution of the individual fragment-target interactions can be remarkable, the absolute binding affinity is usually weak for these small molecules. [11]Thus, the detection of the corresponding binding events can be a challenge for in vitro [12] and in silico methods alike.In the context of molecular docking, the low number of directed interactions is often associated with poor absolute scores.This makes the identification of a correct binding mode and the comparison between different fragments difficult. [13]remarkably extensive study on fragment binding was published for endothiapepsin using a library of 361 fragment-sized small molecules. [14]Seven well-established biochemical and biophysical screening methods were applied to the entire library: a biochemical cleavage assay, a reporter-displacement assay (RDA), saturationtransfer difference NMR (STD-NMR), electrospray-ionization mass spectrometry (ESI-MS), a thermal shift assay (TSA), microscale thermophoresis (MST) and X-ray crystallography.Additionally, isothermal titration calorimetry (ITC) was used to determine K d values and ligand efficiencies (LE).Comparing the results of the performed assays, the authors observed a very poor overlap of the hit compounds identified with the individual methods.This demonstrated impressively to what extent experimental fragment screening can be faced with false positive and false negative results.Beyond that, the study provides a comprehensive data set of in vitro fragment screening with an aspartic protease target, for which also a total of 71 well-resolved fragment-bound crystal structures could be obtained (1.05-1.7514c] Here, we complement and juxtapose these data of in vitro screening methods with results of in silico docking and scoring studies.We focused on the docking programs GOLD [15] and Glide, [16] which are among the top three most frequently used programs for virtual screening in the years 2015-2020. [1]We considered the scoring functions ASP, [17] ChemScore [18] (CS), GoldScore [15,19] (GS), and ChemPLP [20] as implemented in GOLD, GlideScore as implemented in Glide, and the stand-alone scoring function DSX CSD [21] .
Our study addresses three central questions: (1) The evaluation of ranking power and scoring power: Are the considered scoring functions capable of ranking or scoring the fragments according to their reported binding affinity data if the crystallographically observed binding modes are assessed?(2) The structure-focused evaluation of redocking approaches: Is the docking program capable of reproducing the correct binding mode of the crystallized fragments, and are the considered scoring functions able to identify this binding mode among all generated poses?Which options might optimize the redocking workflow?(3) The evaluation of a blinded virtual screening approach with the entire fragment database: Which extent of enrichment can be achieved in a virtual screening approach?How successful is the virtual screening workflow in comparison to the reported in vitro screening results?Altogether, the focus on this single model protease and an extensively studied fragment database allows us to assess the performance of the presented docking programs and scoring functions within the applicability domain of fragment screening with aspartic proteases.

| Data selection
Schiebel et al. [14c] reported crystal structures of 71 fragments in complex with endothiapepsin after the crystallographic screening of the aforementioned 361-fragment library.As some fragments showed binding at multiple sites in the respective crystal structure, 87 binding modes were observed in total.
We checked the individual entries of the fragment library with respect to their suitability for our study and decided to remove three fragments from the library: (a) a fragment that caused false positive assay results after a cascade of reactions in solution, as revealed by the authors in a follow-up study [22] ; (b) a fragment containing boron, because the handling and parametrization of boron compounds is not yet sufficiently implemented in most docking software packages [23] ; and (c) a fragment that showed covalent binding because our study was supposed to focus on noncovalent docking only.Moreover, Schiebel et al. reported an additional fragment-sized molecule in one of the crystal structures that was not included in the original library.
We added this to the data set.These considerations resulted in a total of 359 fragments with 68 crystal structures and 84 binding modes included in our study.A detailed overview of the considered data is provided in Supporting Information S1: Figure S1 and Tables S1 and S3.

| Ranking power and scoring power
2a,2e] The parameter ranking power describes the capability of a scoring function to rank known ligands of a specific target according to their binding affinity if the true binding modes are assessed.Beyond this challenge of correct ranking, the parameter scoring power examines the-preferably linear-correlation of calculated scores and binding affinity.
Traditionally, dissociation constants (K d -values) are measured to quantify binding affinity.K d -values, however, tend to underestimate the value of small or fragment-sized ligands that are of high interest in the early phase of drug design.Instead, the ligand efficiency (LE) metric allows us to estimate the free energy of ligand binding (ΔG°) in relation to the number of nonhydrogen atoms (N) at standard temperature (Equation 1). [12,24] N In the context of scoring, the use of absolute scores comparably tends to underestimate fragment-sized ligands.Instead, the use of per-atom scores (PAS), which is the absolute score in relation to the number of nonhydrogen atoms (N), represents a valid alternative (compare Equation 2).
As ligand efficiency values of fragment-sized molecules naturally lie in a very small range, the presented analysis of ranking power and scoring power needs to be understood exclusively in the context of fragment screening.
Schiebel et al. reported LE data for 50 of the 68 crystallographic hit fragments that were included in our study. [14]For these 50 fragments, 65 binding modes were entirely or partially resolved and could thus be considered.Each of these crystallographically obtained binding poses was scored with the scoring functions ASP, [17] ChemScore, [18] GoldScore, [15,19] ChemPLP, [20] DSX using CSD potentials (DSX CSD ). [21]and GlideScore (with SP version) [25] .Except for the scoring with DSX CSD , local minimization was allowed with the respective scoring function.Special attention was drawn to the role of water molecules involved in the fragment binding mode.Water molecules often play a crucial role in protein-ligand interactions.
Hence, each crystal structure was inspected for the presence of resolved water molecules that form hydrogen bonds to both ligand and protein.The scoring was then performed twice: with and without consideration of potentially relevant water molecules.
For the subsequent statistical analysis, Spearman's rank correlation and the Pearson correlation were applied to examine the relation between the absolute values of the calculated PASs and the experimentally determined LE values.We decided to forego the analysis of the dependency between K d and total scores as not all binding modes were fully resolved.2a,26] All results are presented in Table 1.A list with all obtained scores is provided in Supporting Information S1: Tables S4 and S5, an overview of the derived scatterplots in Supporting Information S1: Figure S2.
A significant correlation (at a significance level of α = 0.05) was achieved with all scoring functions at mostly moderate [26] correlation strength.In most cases, the determined scoring power was higher than the ranking power.In virtual screening approaches, a reasonable scoring power can be considered of higher importance in the distinction between binders and nonbinders than the exact ligand ranking.The strongest correlation was found with GoldScore and GlideScore.The dominance of GoldScore was quite surprising as this force field-based scoring function was optimized to predict binding modes rather than binding affinities. [15]Interestingly, the consideration of interacting water molecules could not achieve a general improvement in scoring or ranking.Only DSX CSD and GlideScore might profit slightly from the additional information.The meaningful inclusion of water molecules might require further improvements in traditional scoring functions.

| Structure-focused evaluation of redocking approaches
The reliable prediction of ligand binding modes is a core competence of docking and scoring techniques.It depends on a sufficient sampling of possible binding poses and their subsequent ranking by a scoring or rescoring function.A success threshold of 2 Å root mean square deviation (RMSD) from the experimental binding mode is commonly applied in the redocking analyses of druglike compounds, while a reduced threshold of 1.5 Å was proposed for fragment-sized ligands [3] and is employed in this study.Unsuccessful redocking can either be caused by (a) sampling failure, that is, no docking pose with an RMSD ≤ 1.5 Å is generated, or (b) scoring failure, that is, the scoring function and evaluation protocol is not able to identify the correct binding mode among the entirety of generated docking poses.
In this study, sampling performance and scoring performance were evaluated for different setups: In the easiest scenario, the fragments were docked back into the protein conformation of the protein-ligand complex with a binding site definition derived from the crystallized binding mode.This setup will be referred to as native redocking.Evidently, this native redocking setup requires structural knowledge that is often not available in an inhibitor screening.Alternatively, a nonnative protein conformation can be taken from a structure with a different ligand in complex or an apostructure of the protein.The currently best-resolved apostructure of endothiapepsin was selected from the protein data bank (PDB: 4Y5L, 0.99 Å resolution) for a corresponding setup that will be referred to as nonnative redocking.Further, most real-life docking attempts must work with general binding site definitions when a fragment's true binding site is unknown.In a third approach which is referred to as nonnative unbiased redocking, the docking was therefore performed with three subpockets of the large endothiapepsin cleft (compare Figure 1a).An overview of the consequent setups in the structure-focused evaluation of redocking approaches is presented in Figure 1b.As listed in Supporting Information S1: Table S1, fragment binding modes were excluded from the redocking approaches if (a) they bound outside of the cleft to the endothiapepsin surface, (b) less than 75% of the nonhydrogen atoms was resolved, or (c) codependent binding was assumed if two binding modes showed a distance of less than 4 Å.A total of 62 crystallographically observed fragment binding modes was thus included in the structure-focused redocking analyses.

| Sampling performance
First, the sampling performance of the tested redocking setups was analyzed.In this study, sampling performance is defined as the amount or percentage of fragment binding modes that could be reproduced at a cutoff of 1.5 Å with respect to the crystallographically observed binding mode, irrespective of the rank among the docking poses of that fragment.Native, nonnative, and nonnative unbiased redocking were each performed with and without consideration of hand-selected water molecules.Further, the generate diverse solutions option in GOLD and the expanded sampling option in Glide were tested for their ability to improve fragment sampling.
Table 2 provides an overview of how many fragment binding modes were successfully sampled with the individual setups.
As expected, the sampling performance declined with the use of an apoprotein structure in the nonnative redocking and even further with the consideration of unbiased binding sites.But in every case, information about water molecules could improve the reproduction of crystallographically observed binding modes.Further notable enhancement could be achieved with the generate diverse solutions option in GOLD, whereas activation of the expanded sampling option in Glide mostly led to deteriorated results.The overall highest sampling performance (75.8%) was achieved with Glide and the standard precision (SP) mode.Details about fragment-specific sampling success are provided in Supporting Information S1: Table S6.
The generally observed improvement through the use of water molecules can be illustrated by the example of ligand 211 in the native redocking setup.The crystallographically observed binding mode (PDB: 4YCK) is depicted in Figure 2a and shows a waterinduced network of hydrogen bonds between the ligand's protonated amine group and the catalytic aspartates.If the water is not considered during the docking process, ligand poses tend to show direct interactions between the amine group and the catalytic residues Asp35 and Asp219.This results in a minimum RMSD of 2.57 Å for the GOLD setup with GoldScore and 2.44 Å for the Glide setup with the HTVS mode.Thus, the sampling of ligand 211 was not considered successful in these docking runs.
The respective poses are shown exemplarily in Figure 2b.
However, both docking attempts were able to reproduce the crystallographically observed hydrogen bond network when the selected water molecules were included in the docking setups.A top RMSD value of 0.56 Å was obtained for GOLD with GS and a top RMSD value of 0.37 Å for Glide with HTVS mode (compare Figure 2c).A docking mode similar to the generated poses without water was sterically impossible.This way, the sampling of ligand 211 was considered successful if the docking was carried out in the presence of interacting water molecules.
In addition to the individual analysis of the considered redocking setups, we wanted to assess if the use of different scoring functions and docking modes can enhance the overall sampling.Figure 3 gives

Known binding site
(1) Native redocking 9 9 (2) Non-native redocking 8 9 As protein input, either the known protein conformation from the complex crystal structure or apostructure 4Y5L was applied.As a binding site, either the known cavity of the reference binding mode or a broad three-pocket approach comprising the entire endothiapepsin cleft was applied.
detailed insight into the overlap of successfully sampled fragments that was observed with (a) the different scoring functions in GOLD, (b) different docking precision modes in Glide, and (c) between all considered GOLD and Glide setups.The analysis refers to the native redocking approach with the use of selected water molecules with the enabled generate diverse solutions option for the GOLD dockings and the disabled expanded sampling option for the Glide docking as these settings showed the best results, as shown in Table 2.
The redocking with GOLD did indeed profit from the use of different scoring functions.Only ChemPLP did not add to the overall sampling performance in the presented case.Instead, the redocking with Glide was-at least in terms of sampling-by far the most T A B L E 2 Amount and percentage of successfully sampled fragment binding modes as achieved in the redocking approaches native, nonnative, and nonnative unbiased with respect to the applied settings.Note: GOLD dockings were performed with the implemented scoring functions ASP, CS, ChemPLP, and GS.Additionally, the inclusion of selected water molecules and the generate diverse solutions option was tested.Glide dockings were performed with the settings HTVS, SP, and XP.Additionally, the inclusion of selected water molecules and the expanded sampling option was tested.Sixty-two fragment binding modes were assessed in total.For every scoring function or docking mode, the top results of the native redocking setup are highlighted in bold font.

| Performance of scoring and evaluation strategies
Second, the performance of scoring and evaluation strategies was analyzed with the aim of identifying correct fragment binding modes among the entirety of considered binding hypotheses.In this study, scoring performance is defined as the amount or percentage of fragment binding modes for which a correct pose (with ≤1.5 Å RMSD) was identified with the considered evaluation strategy.The three evaluation strategies (i) best-scored pose, (ii) five best-scored poses, and (iii) best cluster (the cluster definition is given in the Experimental section) were assessed.
The first test sets describe all docking poses that were generated with GOLD in the native redocking with and without selected water molecules.The generate diverse solutions option was enabled, and the poses were (re)scored and evaluated with the scoring functions ASP, ChemScore, ChemPLP, GoldScore, and DSX CSD .The second test sets were generated with Glide SP mode in the native redocking with and without selected water molecules.The expanded sampling option was disabled.Docking runs with the Glide HTVS or XP mode will not be discussed further, as sampling was far inferior to the runs with SP mode.By default, scoring was performed with the scoring function GlideScore, which was optimized for the comparison of different ligands.According to Friesner et al., [16] a modified version of GlideScore, called Emodel, is better suited for the pure prediction F I G U R E 4 Impact of ligand efficiency on successful sampling in the native redocking approach with inclusion of water molecules.Again, the setup with enabled generate diverse solutions was considered for GOLD with scoring functions ASP, ChemScore, ChemPLP, and GoldScore, whereas the setup with disabled expanded sampling was considered for Glide with docking modes high throughput virtual screening (HTVS), standard precision (SP), and extra precision (XP).The x-axis represents the number of docking protocols that were able to sample the depicted number of fragment binding modes successfully.All 62 considered fragment binding modes are represented.Applied ligand efficiencies (LE) groups and color codes are in accordance with Schiebel et al. [14c] LE is indicated in kJ mol −1 atom −1 .
of binding poses and is therefore additionally calculated by Glide.
Thus, both scoring functions were included in the presented evaluation of successful pose prediction.Again, DSX CSD was considered as an external scoring function.
Table 3 presents the results for these test sets.The consideration of water molecules, which could already enhance the sampling performance as discussed in the previous section, also raised the numbers of ultimately correctly identified binding modes.Yet, in the best cases of the native redocking with GoldScore and Emodel, less than one-third of the considered fragment binding modes were correctly identified as the best-scored pose.The exclusive consideration of the best-scored pose might be too narrow in the evaluation of fragment binding modes.Enhanced scoring performance was achievable when the best cluster or the five best-scored poses were taken into account instead.The use of an apoprotein structure in the nonnative redocking and unbiased binding site definitions further diminished the scoring performance.
By analogy with Figure 4, the relation between a fragment's ligand efficiency and the successful identification of the crystallized binding mode was evaluated for the evaluation method (iii), the best cluster.
The analysis focused on native redocking with consideration of water molecules.Detailed fragment-specific information of the top RMSD T A B L E 3 Amount and percentage of correctly sampled and identified fragment binding modes as achieved in the redocking approaches native, nonnative, and nonnative unbiased.
( values that were observed in the best cluster is provided in Supporting Information S1: Table S7.The results are depicted in Figure 5a. Nineteen of the 62 considered fragment binding modes (30.6%) could not be predicted by any of the described docking setups.Still, this was only the case for one of the fragment binding modes in the highest LE category.Moreover, 14 fragment binding modes (22.6%) were correctly predicted by the best cluster of 5, 6, or 7 scoring functions.

| A blinded virtual screening approach
A real-life fragment screening attempt often consists of a prescreening with a fast biochemical or biophysical assay and subsequent crystallographic screenings.These crystallographic experiments can be expensive and time-consuming.However, reliable structural information about the fragment binding modes is essential for further rational fragment growing, fragment linking, and fragment merging attempts on the road to a potent lead structure.
Schiebel et al.The present study aims to assess whether virtual fragment docking could be a competitive or even more efficient alternative to biochemical and biophysical prescreening options.Therefore, a blinded virtual screening approach was conducted with the entire database containing 359 fragments.The endothiapepsin cleft was divided into three subpockets as presented in the nonnative unbiased redocking.Possible stereoisomers, protomers, and tautomers were prepared for every fragment.The obtained fragment structures were docked in two setups: A first attempt was carried out with GOLD and all implemented scoring functions.Every docking pose was (re)scored with the scoring functions ASP, ChemScore, ChemPLP, GoldScore, and DSX CSD .A second attempt was carried out with Glide.In this case, every docking pose was (re)scored with GlideScore and DSX CSD .
Once more, it shall be pointed out that the distinction between binders and nonbinders is not trivial in the context of fragment-sized molecules.This issue was nicely illustrated by the poor overlap of the Impact of ligand efficiency on successful pose prediction in the native redocking approach with inclusion of water molecules and evaluation method (iii) considering the best cluster.GOLD poses were evaluated with the scoring functions ASP, ChemScore, ChemPLP, GoldScore, and DSX CSD for this analysis.Glide standard precision (SP) poses were evaluated with GlideScore and Emodel.The x-axis represents the number of scoring functions that were able to sample the depicted number of fragment binding modes successfully.All 62 considered fragment binding modes are represented.Applied ligand efficiencies (LE) groups and color codes are in accordance with Schiebel et al. [14c] LE is indicated in kJ mol −1 atom −1 .(b) Overall impact of the LE on successful sampling and scoring with consideration of the native redocking approach with water molecules and evaluation method (iii) considering the best cluster."Success" indicates that fragment binding modes were correctly sampled and found in the top-scored cluster for any scoring presented in Table 3. "Scoring failure" indicates that fragment binding modes were correctly sampled but not found in the top-scored cluster for any scoring.And "Sampling failure" indicates that fragment binding modes were not even sampled with any of the presented setups.All 62 considered fragment binding modes are represented.Applied LE groups are in accordance with Schiebel et al. [14c] LE is indicated in kJ mol −1 atom −1 .
results from diverse in vitro screening methods as presented by Schiebel et al. [14b,14c] Among the 291 fragments that were not found in the crystallographic screening, 105 were inactive in all prescreening assays, 110 were found as hits with one method only, and 76 were positive in two to five prescreening assays.None of the crystallographic nonbinders was active with all six prescreening methods.
Thus, also crystallographic screening can miss fragments with a certain biochemical or biophysical activity.However, most real-world fragment screening approaches represent the starting point in a rational drug design project.Thus, structural information about the fragment binding mode is indispensable.That is why the presented study deliberately focuses on the achievable enrichment of fragments with crystallographically confirmed target binding.
The obtained enrichments of crystallographic binders were assessed in a score-focused and a PAS-focused evaluation.Glide docking.(e) Evaluation of the different experimental assays according to the assay data reported by Schiebel et al. [14b] As not every fragment could be assessed in every assay, the ROC curves were completed with a dashed line representing random probabilities.
comparable to random selection only, in particular when used with a higher fragment concentration that, apparently, favors false positives.
Overall, one may conclude that molecular docking can indeed be a valid alternative for prescreening purposes in the given scenario.The success in enrichment, however, depends strongly on the type of evaluation.While GlideScore performed best with the score-focused evaluation, the remaining scoring functions ASP, ChemScore, ChemPLP, GoldScore, and DSX CSD should be used with a PAS-focused evaluation.

F I G U R E 7
The barplots present the percentage of docking poses with root mean square deviation (RMSD) ≤ 1.5 Å to a crystallized binding mode for every fragment considered in the structure-focused evaluation of redocking approaches averaged over all considered stereoisomers and protomers and grouped by the ligand efficiency (LE) as reported by Schiebel et al. [14] LE is indicated in kJ mol If the best 10% of the fragment database is considered in a score-focused evaluation, GlideScore outperforms all experimental screening methods.In a PAS-focused evaluation, rescoring with DSX CSD also provides high enrichment that is only surpassed by experimental screening with RDA.Additionally, a strong advantage of virtual prescreening is the testability of all fragments.Most biochemical or biophysical assays face different limitations like fragment solubility or purity that restrict the amount of effectively prescreened compounds.
Although good enrichment could be obtained in the blinded virtual screening approach, the correct prediction of crystallographically observed binding modes was limited.Figure 7 allows a detailed analysis of the sampling and scoring success of all fragments considered already in the structure-focused evaluation of redocking approaches, but now averaged over all considered stereoisomers and protomers (as used in the blinded virtual screening approach) and grouped by the fragments' ligand efficiency values.Again, fragments with high ligand efficiency were better sampled than fragments with low LE values.If only the best-scored fragment poses were considered, the identification of the correct binding mode at a threshold of 1.5 Å RMSD failed for many fragments.Yet, GlideScore remarkably allowed correct pose prediction for four of nine fragments in the group with the highest ligand efficiency.
These observations raised the question of how virtual prescreening could achieve the presented strong enrichment despite rather poor pose prediction.Therefore, further analyses of endothiapepsin as a protein target and the fragment database were conducted.First, the electrostatic surface potential of endothiapepsin was calculated, as visualized in Figure 8.The entire oblong binding site region is characterized by a negative electrostatic potential.Thus, it appears obvious that positively charged ligands might preferably bind to this target.
A fragment property analysis was performed to assess whether any significant differences could be found between the group of crystallographic binders and the group of nonbinders.For this analysis, the term binders refers to the crystallized fragments as considered in the structure-focused evaluation of redocking approaches (this applies to 57 fragments, cf.Supporting Information S1: Figure S1).Nonbinders were defined as fragments that showed negative results with all prescreening methods and the crystallographic screening (this applies to 105 fragments [14b,14c] ).
The three properties of charge, number of aromatic atoms, and clogP differed at a significance level of α = 0.01.The violin plots in Figure 9 illustrate the respective property distribution.As expected after the calculation of the electrostatic surface potential of endothiapepsin, most binders were positively charged, whereas most nonbinders were uncharged.Moreover, a higher number of aromatic atoms was found in the group of binders.More explicitly, only 6 of the 57 binders (11%) did not contain an aromatic ring, while this was the case for 29 of the 105 nonbinders (28%).Lastly, the calculated logP was significantly lower for the binders than for the nonbinders, indicating better aqueous solubility, which can be an important factor in the successful crystallization of ligands.Taken together, binding fragments tend to be positively charged, contain at least one aromatic ring, and show good solubility.Apparently, these properties also favor better scores in the interaction with endothiapepsin, in particular on the per-atom-scale (as observed with all scoring functions), but also in terms of the absolute score values in the case of GlideScore, leading to an enrichment despite the limited quality of pose prediction.

| CONCLUSION
The presented study juxtaposes the findings of an extensive in vitro screening of a 361 fragment library with the aspartic protease endothiapepsin from Schiebel et al. [14b,14c] with results of in silico approaches including the docking programs GOLD and Glide as well as the scoring functions ASP, ChemScore, ChemPLP, GoldScore, DSX using CSD potentials (DSX CSD ), and GlideScore.
F I G U R E 8 Electrostatic surface potential map of endothiapepsin apostructure 4Y5L as calculated with the Amber22 PBSA tool.
First, the ranking power and scoring power of the individual scoring functions were evaluated based on the PAS and the experimentally determined ligand efficiencies.At a significance level of α = 0.05, all scoring functions showed significant correlations, while in most cases, the scoring power (represented by the Pearson correlation coefficient) was superior to the observed ranking power (represented by Spearman's correlation coefficient).The strongest correlations could be achieved with GoldScore and GlideScore.
Interestingly, the consideration of interacting water molecules could not generally improve the correlation strength, while DSX CSD and GlideScore did profit slightly from the additional information.
Second, a structure-focused evaluation of redocking approaches was presented in detail to analyze sampling success as well as the success of scoring and evaluation strategies.As expected, the overall sampling and scoring could profit notably from the use of native protein conformations and narrow binding site definitions.More interestingly, the docking programs GOLD and Glide both offer docking options to enhance sampling (referred to as generate diverse solutions or expanded sampling, respectively), which can be of special interest in the docking of fragments to fully cover the energetic landscape of possible binding modes.While the generate diverse solutions option in GOLD led indeed to a notably improved sampling, the expanded sampling option in Glide rather reduced the number of correctly sampled fragment binding modes.Another interesting finding was the enhancement achieved through the consideration of interacting water molecules.While this information hardly improved the ranking or scoring power (as discussed above), it allowed notable improvements in the sampling performance as well as the scoring and evaluation strategies in the structure-focused evaluation of redocking approaches.Fragment docking can thus be expected to profit from a thorough analysis of conserved water molecules in known structures of the target in its apo form or in complex with other ligands.
Most strikingly, high ligand efficiency was found to be associated with a fragment binding mode's chance to be successfully sampled and identified by a larger number of docking protocols.The overlap of highly scored fragment poses according to different scoring functions might thus be used as a potential indicator for the reliability of the obtained fragment binding mode.

| Preparation of structural data
Protein: The crystal structures were obtained from the protein data bank. [27]A comprehensive list of all considered structures and fragments is provided in Supporting Information S1: Table S1.When superposed onto the C α -atoms of apostructure 4Y5L, the structures  S8.
show a high degree of similarity (cf.Supporting Information S1: Figure S3).The individual complexes were prepared with the MOE [28] structure preparation and protonate 3D tool (pH 4.6, salt concentration 0.1 M, 290 K).Water molecules were selected if they showed hydrogen bonds with both ligand and protein (cf.Supporting Information S1: Table S1).
Ligand: For the determination of ranking power and scoring power, the considered fragment conformations were extracted from the protonated protein-ligand complexes.For the structure-focused evaluation of redocking studies, unresolved fragment atoms were added manually if necessary, and energy minimization was carried out with the MMFF94x force field to an RMS gradient of 0.001 kcal/mol/Å.
For the blinded virtual screening approach, a database of all 359 fragments was prepared in MOE (SMILES are provided in Supporting Information S1: Table S3).Starting from this database, 474 stereoisomers were created with RDKit [29] , and 633 respective protomers were predicted with MOE at pH = 4.6 and checked through visual inspection.Subsequent energy minimization was again carried out with MMFF94x to an RMS gradient of 0.001 kcal/mol/Å.

| Scoring of experimental binding modes
The scoring of fragments in their crystallographically determined binding modes was conducted in the absence and presence of the Rescoring with GlideScore (as implemented in Glide) was also performed with local optimization.The respective score was divided by the number of resolved non-hydrogen atoms to obtain a PAS.

| Evaluation of ranking power and scoring power
The calculated absolute PAS values were tested for correlation with the LE data as obtained from Schiebel et al. [14c] with and without the consideration of the selected water molecules.Pearson and Spearman correlation coefficients were calculated with R. [30] 4.4 | Docking and scoring in the structure-focused evaluation of redocking approaches In the structure-focused evaluation of redocking approaches, 62 crystallographically determined fragment binding modes were studied in detail (compare Supporting Information S1: Table S1).In the native redocking setup, the protein conformation was used as in the crystal structure of the respective protein-ligand complex.In the nonnative and nonnative unbiased redocking setups, the protein conformation of the endothiapepsin apostructure 4Y5L was applied.RMSD values were calculated with the program fconv [31] and the option "rmsd2" which allows the comparison with incompletely resolved reference structures and respects symmetry to avoid artificially high RMSD results.
Docking calculations with GOLD were performed with GOLD version 2021.3.0 [15] and all four implemented scoring functions: ASP, ChemScore (CS), ChemPLP, and GoldScore (GS).The number of genetic operations (maxops) was set to 120000.The formation of intramolecular hydrogen bonds, the matching of ring templates, and the flipping of free ring corners as well as amide bonds, pyramidal nitrogen atoms, and carboxylic acids was allowed.For the native and nonnative redocking setup, the binding site was defined as a cavity within 5 Å of the reference ligand.In this case, 50 poses were generated per scoring function and considered fragment binding mode.For the nonnative unbiased redocking setup, the entire endothiapepsin cleft was divided into three subpockets that were defined by lists of residues (compare Supporting Information S1: Table S2).Fifty docking poses were generated per subpocket, scoring function, and fragment.In docking runs with the activated generate diverse solutions option, the associated cluster size was set to 5 and the RMSD cutoff to 1.5 Å. Dockings were performed in the absence and presence of preselected water molecules in a static position.
Every docking pose was rescored by the remaining scoring functions implemented in GOLD plus DSX with CSD potentials (DSX CSD ).A detailed overview of the docking and rescoring procedure with GOLD is provided in Supporting Information S1: Figure S4.
Docking calculations with Glide (version 2023-2) [16] were prepared with Maestro's Receptor Grid Generation tool and performed with the docking modes HTVS, SP, and extra precision (XP).Force field OPLS_2005 [32] was applied by default.Grid generation was performed in the absence and presence of the preselected water molecules in a static position.The number of poses used in post-docking minimizations was set to 500.For the native and nonnative redocking setup, the binding site grid was derived from the respectively crystallized binding mode with an inner box size of 7 Å.A maximum number of 50 poses was generated.For the nonnative unbiased redocking setup, the three subpockets were defined by the box parameters presented in Supporting Information S1: Table S2.A maximum of 50 docking poses were generated per subpocket and fragment.Dockings were performed with enabled and disabled expanded sampling options.The assigned GlideScore and Emodel scores were considered in the study.
Clustering of obtained docking poses was performed with a scoreheaded hierarchical clustering algorithm to group all considered docking poses in the study of a certain fragment binding mode in a specific test set.For this purpose, the poses are first ranked by the score of the assessed scoring function.The best-ranked pose is assigned to the first cluster.All poses that show an RMSD smaller than or equal to the clustering threshold of 2 Å to this first reference pose are added to the same cluster.Among the remaining unassigned poses, the best-ranked pose is then ascribed to the second cluster together with all further unassigned poses with an RMSD ≤ 2 Å to this second reference pose.This procedure continues until every pose is assigned to a cluster.Fconv [31] was used to calculate RMSD values.

F
I G U R E 1 (a) Endothiapepsin apostructure (PDB: 4Y5L) with crystallographically observed binding poses of all considered fragments after protein superposition.The fragments cluster in three partially overlapping subpockets of the large cleft.The catalytic residues Asp35 and Asp219 are located at the junction of subpockets 1 and 2. (b) Outline of the setups in the structure-focused evaluation of redocking approaches.

F
I G U R E 2 (a) Crystallographically observed binding mode of Ligand 211 (PDB: 4YCK).A network of hydrogen bonds with the catalytic residues is induced by water.Distances are given in Ångstrom.(b) Docking poses with the smallest root mean square deviation (RMSD) value, when no water is considered, predicted by GoldScore (GS) (light orange) and Glide high throughput virtual screening (HTVS) (light green).Hydrogen bonds are directly formed between the ligand's protonated amine group and the catalytic aspartates.(c) Docking poses with the smallest RMSD value for the GS (dark orange) and HTVS (dark green) docking considering selected waters.The binding mode is highly similar to the crystallographically observed pose.successful with the SP mode.Further, the docking modes high throughput virtual screening (HTVS) and extra precision (XP) could each add only one entry to the list of successfully sampled fragment binding modes.In total, the use of different scoring functions in GOLD allowed the sampling of six more fragments (9.7%) compared to the sampling with all docking modes in Glide.Furthermore, the relation between a fragment's ligand efficiency and the sampling success of its binding mode was evaluated.The results are depicted in Figure 4. Indeed, fragments with high LE values were more often correctly sampled than fragments with low or undetermined LE values.The five fragments whose sampling failed in all docking protocols all showed LE values ≤ 1.2 kJ mol −1 atom −1 .

F I G U R E 3
The Venn diagrams show the overlap of successfully sampled fragments between the analyzed native redocking protocols.(a) Focus on the dockings with GOLD and the implemented scoring functions ASP, ChemScore, ChemPLP, and GoldScore with the outstanding setup considering selected water molecules and the enabled generate diverse solutions function.(b) Focus on the dockings with Glide and the implemented modes high throughput virtual screening (HTVS), standard precision (SP), and extra precision (XP) with the outstanding setup considering selected water molecules and disabled expanded sampling function.(c) Meta-analysis of sampling success with any scoring function in GOLD or any mode in Glide with the aforementioned settings.57 of the 62 considered fragment binding modes could be reproduced at a cutoff of 1.5 Å.
allows an overall comparison of how many fragment binding modes were successfully sampled and/or identified in the best-scored cluster in any of the native redocking setups with the inclusion of selected water molecules.The plot includes the data of 200 GOLD docking poses scored by ASP, ChemScore, ChemPLP, GoldScore, and DSX CSD, as well as the data of up to 50 Glide SP docking poses scored by GlideScore and Emodel for each fragment binding mode.It can once more be observed that the success of sampling and scoring does indeed depend on the fragments' ligand efficiency.
[14b]  provided extensive experimental results for the entire 359 fragment database and six different prescreening methods, as mentioned in the Introduction: a biochemical cleavage assay, RDA, STD-NMR, ESI-MS, TSA, and MST.In addition to basic STD-NMR experiments, competitive displacement experiments (referred to as CTL-NMR) with the inhibitor ritonavir were conducted to identify fragment binding to the active site.The MST experiments were performed with two different concentrations of the tested fragment: 0.5 and 2.5 mM.

Figure
Figure 6a-d shows the associated ROC curves.Table4 supplements

Figure 6e illustrates considerable
Figure 6e illustrates considerable variability in the success rates of the tested assay methods.The best enrichment was reached with RDA, which allowed a 2.94-fold enrichment if the best 10% of the fragment database is considered.Instead, MST screening performed −1 atom −1 .Additional point markers indicate which scoring functions successfully applied the best score to such a pose with RMSD ≤ 1.5 Å to a crystallized binding mode for at least one of the considered stereoisomers/protomers.(a) Screening setup with GOLD.(b) Screening setup with Glide.

F I G U R E 9
Comparing violin plots for the selected properties charge, number of aromatic atoms, and clogP.Binders (B) are juxtaposed with nonbinders (NB).Individual values are shown as jittered gray dots, median is shown as a black dot.Details of the statistical evaluation are provided in Supporting Information S1: Table selected water molecules observed in the crystal structures.Rescoring with the scoring functions ASP, ChemScore, ChemPLP, and GoldScore (all implemented in GOLD) was performed with local optimization (GOLD version 2021.3.0).The stand-alone scoring function DSX was applied with CSD potentials and default settings.

Table 4 supplements
this visual overview with the amount of obtained crystallographic hit fragments if the best 10%/20% of the fragment database is considered.In the score-focused evaluation, the by far best enrichment was obtained with GlideScore, which reached an enrichment factor of 3.09 if the best 10% of the database is considered.In contrast, ChemScore and DSX CSD performed worse than a random fragment selection when the absolute scores were T A B L E 4 Number of crystallographic hit fragments found if the best 10%/20% of the fragment database (db) is considered after selection with the presented prescreening method.
Note: Additionally, in parentheses, the enrichment is shown in relation to the number of crystallographic hit fragments that would have been found at chance probability (6.8 fragments if 10%/13.6 fragments if 20% of the database is screened).