Automated benchmarking of combined protein structure and ligand conformation prediction

The prediction of protein‐ligand complexes (PLC), using both experimental and predicted structures, is an active and important area of research, underscored by the inclusion of the Protein‐Ligand Interaction category in the latest round of the Critical Assessment of Protein Structure Prediction experiment CASP15. The prediction task in CASP15 consisted of predicting both the three‐dimensional structure of the receptor protein as well as the position and conformation of the ligand. This paper addresses the challenges and proposed solutions for devising automated benchmarking techniques for PLC prediction. The reliability of experimentally solved PLC as ground truth reference structures is assessed using various validation criteria. Similarity of PLC to previously released complexes are employed to judge PLC diversity and the difficulty of a PLC as a prediction target. We show that the commonly used PDBBind time‐split test‐set is inappropriate for comprehensive PLC evaluation, with state‐of‐the‐art tools showing conflicting results on a more representative and high quality dataset constructed for benchmarking purposes. We also show that redocking on crystal structures is a much simpler task than docking into predicted protein models, demonstrated by the two PLC‐prediction‐specific scoring metrics created. Finally, we introduce a fully automated pipeline that predicts PLC and evaluates the accuracy of the protein structure, ligand pose, and protein–ligand interactions.

experiment are presented elsewhere in this issue, 9 as well as the technical details and challenges encountered during the establishment of the new category as part of CASP. 10 These challenges include: (1) PLC with incomplete ligands or suboptimal quality to be used as ground truth ligand poses, (2) the need for extensive manual verification of data input and prediction output, and (3) the lack of suitable scoring metrics that consider both protein structure and ligand pose prediction accuracy, which necessitated the development of novel scores.
By integrating the insights and developments from the CASP15-PLI experiment, automated systems for the continuous benchmarking of combined PLC prediction can be established.The Continuous Automated Model EvaluatiOn (CAMEO, https://beta.cameo3d.org/) 113][14][15] Since 2012, the 3D structure prediction category has been assessing the accuracy of single-chain predictions.Additional assessment categories have been implemented over time to serve the structural bioinformatics community, in particular around the assessment of quality estimates.Recently, efforts were made towards the assessment of protein-protein complexes (quaternary structures) and protein-ligand pose prediction. 11While CAMEO allows for continuous validation of newly developed methods, it is dependent on the distribution of PLC released in the PDB in a given period.Thus, CAMEO evaluation in a given time period may not be representative of the entire PLC space and method developers may not have immediate access to problem cases or specific sets of PLC where their algorithm under or overperforms.This suggests a second, complementary angle to automated benchmarking, namely the creation of a diverse dataset of PLC with representative complexes from across protein-ligand space, which would allow both global comparative scoring as well as pinpointing cases that method developers would need to address to improve their global performance.In this paper we discuss challenges and insights associated with the development of these two complementary approaches for PLC benchmarking.
Previous research has shown that the quality of experimentally resolved structures can vary significantly. 16Such errors in experimental structure can not only incorrectly bias the training of prediction methods, but comparing prediction results to lower quality structures can skew the perception of their performance, an especially important consideration when assessing deep learning (DL)-based tools which have been trained to reproduce results seen in experimentally resolved structures.Additionally, many crystal structures with ligands contain missing atoms or missing residues in the binding site, complicating their use as ground truth.Efforts have been made to establish criteria for assessing the quality of such structures, like the Iridium criteria. 17In this article, we explore various criteria and their feasibility for filtering and annotating target structures.
Even in the era of DL, the difficulty of a PLC prediction still depends, to some degree, on the availability of previously experimentally resolved structures of similar PLC.This was exemplified in this year's CASP-PLI results, 9 where template-based docking methods outperformed others due to the availability of previously solved highly similar PLC for many of the targets.In addition, redundancy between the training and test sets can mask overfitting, a common problem with DL methods.Thus, incorporating the novelty of a PLC into automated benchmarking setups is crucial for a fair and comprehensive evaluation.
The fields of protein structure prediction and ligand docking both have a number of scoring metrics for the assessment of prediction results.In both cases, however, the scores usually rely on pairwise residue-or atom-level comparisons of equivalent atoms of the protein for the former and the ligand for the latter.The joint task of PLC prediction has a number of other considerations that make these scores incomplete.1][22] Moreover, some ligands have highly flexible regions that mainly interact with the solvent, where evaluating the conformation of the flexible part may not be as meaningful as the parts of the ligand forming crucial interaction with protein residues.Thus, it is necessary to develop and employ evaluation metrics that extend beyond rigid protein and rigid ligand pose assessments.Thus, automated PLC prediction assessment relies on tools and methodologies for three main categories: deciding if an experimentally solved PLC is of a high enough quality to use as ground truth, assessing how different a PLC is from previously determined PLC, and evaluating predicted PLC prediction from the protein, ligand, and interaction perspectives.In this article, we show how these three questions can be answered in the context of both continuous and representative automated benchmarking.

| Is the ground truth good enough?
With any benchmarking experiment, ensuring the correctness of what we use as a reference is of utmost importance.To assess the distribution of high-quality crystal structures of PLC in the PDB, 23 we extracted protein, ligand, and binding pocket information from PDB validation reports from 114 973 PLC entries in the PDB solved by x-ray crystallography (see Section 3 for the criteria and definitions used).We analyzed 236 538 small molecule pockets across 75 065 PLC PDB entries and 32 273 unique small-molecule ligands, and 798 651 ion pockets across 84 215 PLC and 138 unique ions.In total, this corresponds to over a million pockets.
The authors of the Iridium dataset defined a highly stringent set of criteria regarding the quality of crystal structures, with emphasis on the suitability for pose prediction, virtual screening, and binding affinity estimation. 17These include criteria on the protein (resolution ≤3.5 Å, R < 0.4, R free < 0.45, absolute difference between R and R free ≤ 0.05) as well as ligand and pocket criteria (full density with RSR ≤ 0.1 and RSCC ≥ 0.9, full atom occupancies and no alternative configurations for ligand atoms and protein atoms within 6 Å of ligand).We applied the Iridium criteria to the binding pockets within our set of PLC.Only 0.3% (721) of small molecule pockets across 504 PLC and 0.98% (315) of unique small molecule ligands, and 0.66% (5248) of ion pockets across 3379 PLC and 35.51% (49) of unique ion ligands passed the criteria.In total, 0.58% of all pockets are acceptable according to the Iridium criteria, across 3.21% (3686) of PLC and 1.12% (364) of unique ligands.Thus this criteria is too stringent for both of the applications we explore.For continuous evaluation methods such as CAMEO which runs on a weekly basis, the majority, if not all PLC would be discarded.
Similarly, restricting to such a small fraction of the PDB is incompatible with creating a diverse and representative dataset of PLC for comprehensive training and evaluation.We suggest alternative "relaxed" criteria with RSCC > 0.8 and >90% protein residues within 6 Å of ligand with RSCC > 0.8, with the remaining criteria the same as Iridium.The threshold of 0.8 for RSCC is in accordance with the widely accepted rule of thumb that 0.8 < RSCC < 0.95 are generally okay, RSCC > 0.95 indicate a very good fit, and RSCC < 0.8 indicate that the experimental data may not accord with the ligand placement. 24ving such a set of relaxed criteria could be used as a post-filter step in the CAMEO setting and, in the latter case, the stringent Iridium criteria could be used to create the starting set with more PLC being added based on their diversity and the relaxed criteria.
Figure 1 shows the distribution of validation data values across all binding pockets as well as the selected relaxed thresholds for four criteria: resolution (Figure 1A), absolute difference between R and R free (Figure 1B), RSCC (Figure 1C), and percentage of protein residues within 6 Å of the ligand with RSCC > 0.8 (Figure 1D).The most stringent criterion is by far the absolute difference between R and R free , which removes almost 15% of the pockets.
When applying our suggested criteria, we retained 44 We also applied this criteria to the PDBBind time-split test-set, 25 commonly used by recent DL-based docking methods [25][26][27] which contains 255 small molecule or ion-binding PLC.Nearly half (105) do not pass even this relaxed criteria, which may compromise the validity of recent benchmarking efforts using this set.
Thus, the community can make use of the criteria explored in this section along with the publicly and programmatically available PDB validation reports to automate the selection of high quality ground truth in their prediction and evaluation efforts.Indeed, similar efforts to annotate PLC quality are ongoing in the ELIXIR 3D-BioInfo community. 28The results of that initiative could be incorporated in this assessment once they are available.

| Is a PLC target interesting to assess?
In the context of large-scale structural databases, such as the PDB, it is possible to encounter several very similar PLC or complexes with the same protein and ligand pose that have been crystallized in different experimental conditions or resolved by means of different experimental methods.The CASP15 CASP-PLI assessment 9 highlighted the superiority of template-based methods to model PLC accurately, as the set of target PLC in CASP15 in general had numerous templates already available in the PDB.This does not necessarily translate to the entire PLC space, and indeed generalizability of methods to PLC with novel folds, ligands, and binding modes is of great interest.
F I G U R E 1 Distributions across PLC pockets of (A) experimental resolution, (B) difference between R and R free , (C) ligand RSCC, and (D) the percentage of protein atoms within 6 Å of the ligand, which have an RSCC > 0.8.Pockets are divided into categories depending on the number of rotatable bonds of the ligands they contain.In each panel, the black line shows the suggested threshold, and the percentage of pockets passing this criterion is displayed.PLC, protein-ligand complexes.
To that end, we investigated the growth of PLC in the PDB over the years in terms of protein sequence and small molecule ligand name, using the 236 538 small molecule pockets across 75 065 PLC and 32 273 unique small-molecule ligands described in Section 2.1.
The choice of sequence and ligand name arises from the constraints of CAMEO where the exact protein conformation and the pose of the ligand within the protein complex is unknown.
We assigned identifiers to each PLC PDB entry, consisting of the sequence cluster identifiers (at different identity thresholds) of each entity and the chemical component code of the ligands present in the PLC.The distribution of sequence clusters and ligand combinations seen per year is shown in Figure 2, along with the fraction of PLC that pass the relaxed quality criteria from Section 1.For example, the four different bars for the 70%-90% cluster in the year 2022 represent, in order, (1) all PLC released in 2022 where every entity in the PLC has 70%-90% identity to every entity in a matching PLC from a previous year but the ligands are not all the same, (2) same as (1) but only the PLC passing the relaxed quality criteria from Section 2.1, (3) all PLC released in 2022 where every entity has 70%-90% identity to every entity in a matching PLC from a previous year and the ligands are all the same, and (4) same as (3) but only the PLC passing the relaxed quality criteria from Section 2.1.
We see that, from the protein perspective, 78.85% of PLC (and 71.83% of high quality PLC) released in 2022 have at least 30% sequence identity to a matching PLC from previous years (across all entities).However, most of these (79.14%) still have different combinations of ligands, indicating that they may still be interesting to assess for PLC prediction.Every year some redundant PLC from both perspectives are also released in the range of 10%-20% redundant structures per year, out of which more than half are highly redundant structures (90%-100% sequence identity and same ligands).
We provide the entire dataset of over a million small molecule and ion pockets with information about the structure quality, whether they pass our validation criteria, and the sequence cluster identifiers of the components at different identity thresholds on Zenodo (https://doi.org/10.5281/zenodo.8348280).This dataset can be used by method developers to make more rational train-test splits, ensure high quality data for training models, and evaluate their methods with more representative data.
The PDBBind time-split test-set also suffers from a high degree of redundancy, with 62% of the test-set proteins having >90% sequence identity with other test-set proteins and 59% having >90% identity to proteins in the corresponding PDBBind time-split training-set.This indicates that this set would not be able to accurately represent protein-ligand space, even if all the ligands were chemically dissimilar, which is not the case.In addition, 108 protein-ligand pairs have peptide and oligosaccharide ligands which are not ideal as most docking tools are not calibrated for these types of ligands. 29I G U R E 2 Protein-ligand complexes (PLC) released per year (in brown and orange) and those passing the relaxed quality criteria (in green and blue), divided according to sequence identity to PLC seen in previous years.The left two bars of each year (in brown and green) are PLC with ligand combinations which differ from previous PLC, and the right two bars (in orange and blue) are PLC containing the same set of ligands as a matching PLC at that sequence identity.
To better explore the effect of crystal structure quality and pro- In this section, we assessed the novelty and diversity of PLC in the PDB.We showed that the PDBBind time-split test-set is inappropriate for evaluation due to the high degree of redundancy with the training set of recent DL-based methods, and provide an alternative, high-quality representative dataset to fill this gap.In addition, we provide a dataset of over a million PLC in the PDB additionally annotated by structure quality and sequence redundancy.Given that the relevance of a PLC for consideration depends on a variety of factors and the task at hand, we hope that these datasets will be useful for the community in representative PLC benchmarking and implementing relevant filters for PLC prediction and evaluation.

| Can we automatically score predicted protein-ligand complexes?
The joint prediction of protein-ligand complexes is generally divided into three steps: (1) predicting the protein structure, (2) predicting the location of the binding pocket within the structure, and (3) docking the ligand within the structure.Some docking tools also do not explicitly require step 2, and instead use the entire protein as the search space to dock the ligand, termed "Blind docking."Many previous efforts on docking assessment have focused entirely on steps 2 and 3, with the structure being an experimentally solved structure in the correct ligand-binding conformation already, termed "Redocking." However, the CASP15-PLI experiment involved the prediction of all three steps, and thus called for the development of novel scoring metrics to assess this joint prediction.These scores were BiSyRMSD (referred to as RMSD) and lDDT-PLI, with the former emphasizing the correct positioning of the ligand within the predicted binding pocket and the latter emphasizing the correctness of predicted interactions between protein and ligand atoms.These scores are further described in the CASP15 CASP-PLI assessment paper. 9 establish automated assessment of such PLC prediction, we developed an automated benchmarking workflow, consisting of two components: (1) Preprocessing, input preparation, set-up, and running of five PLC prediction tools (Autodock Vina, 29,30 SMINA, 31 GNINA, 32 DiffDock, 27 and TankBind 26 ) with different input parameters, and (2) Assessment of PLC prediction results using different scoring metrics.The workflow allows for using predicted models as input and has a pocket detection step to differentiate between blind docking and pocket-based docking.The workflow is implemented using Nextflow 33 to enable efficient parallelization and distributed execution, making it well-suited for handling large datasets and computationally intensive tasks.Each process is encapsulated in a module, with dependency management controlled using Conda 34 or Singularity. 35The resources for each step in the workflow are defined individually, ensuring that only the required resources are reserved and failed processes are automatically restarted with increased resources.Upon completion, all the predicted binding poses are collected and a summary of scores is created, along with reporting on resource usage across the evaluated tools.
We ran this workflow using the PDBBind time-split test-set of 363 protein-ligand pockets, and our similarly sized HQR set created as described in Section 2.2.As the two most recent DL tools in our set, TankBind and DiffDock, are trained on the PDBBind time-split trainingset, these are fair sets to use for their evaluation at the current time, also allowing for the comparison of benchmark sets created using different principles.However, it is important to emphasize that the aim of this experiment is to demonstrate the feasibility of an automated benchmarking workflow, and not a comprehensive evaluation of docking tools, due to the small size of both datasets and the set of tools being benchmarked, the issues in the PDBBind time-split test-set already discussed in the previous sections, and the use of lower quality highly redundant structures in the training of the DL tools which may mask their true capability of learning relevant protein-ligand interaction patterns.
For both datasets, we also evaluated PLC prediction results on AlphaFold 36  In order to demonstrate the workflow in different input settings, we use P2Rank 37 to detect pockets in each protein and model in both test sets and report results in two scenarios: Blind docking, which is considered the worst-case scenario for docking tools where no indication is provided about the possible location of the ligand, and Best pocket docking, representing the best-case scenario where the correct binding pocket is known and used to define the docking search space.For the evaluation of Best pocket docking, the P2Rank pocket that had the smallest distance from the true binding site center was considered the best pocket.P2Rank was able to predict the center of the correct binding pocket for 89% (324/363) and 85% (290/340) of the receptors for the two test sets within 8 Å distance of the true binding site center, defined as the mean coordinate of the ligand in the pocket.For the AlphaFold modeled receptors, the percentages were 81% (206/256) and 85% (190/229), where the ground truth pocket is defined by structural superposition of the model with the reference structure.In addition, in over 70% of the cases the top ranked pocket by P2Rank is also the best pocket, with this percentage increasing to over 90% when considering the top three predicted pockets, irrespective of whether the structure is experimentally solved or predicted by AlphaFold.These results indicate that the pocket detection step of PLC prediction can be accomplished even on predicted models with acceptable accuracy.and the best scored pose out of the top-5 ranked poses (where the ranking is an output of each tool) are assessed for blind docking where the entire protein is employed to define the search box.Furthermore, for all tools except DiffDock where this option is not present, the same assessment is carried out using the best pocket for defining the search box. Figure 3 depicts the distributions of these scores for the top-1 and best out of top-5 poses for experimental and modeled receptors for both docking scenarios.
Overall from Table 1, as expected, the results for the Best pocket docking are better than Blind docking as the search space is restricted.We hypothesize that this is due to the presence of lower quality structures in the PDBBind test-set which unfairly penalizes predictions based on a flawed ground truth.This is further supported by the fact that the median RMSD for Blind docking on this set is worse for GNINA than TankBind indicating more "severe" failures which bring up the RMSD, as it is an unbounded metric.In contrast, lDDT-PLI is bounded, and all PLC poses beyond the thresholds used are assigned a score of 0, and are less penalized by very bad predictions.In addition, lDDT-PLI does not penalize parts of the ligand which are floating in areas not in contact with the protein.
All the tools have a significant performance decrease when using AlphaFold models as input (Table 2, Figure 3).This is especially striking when considering Best pocket docking, where despite AlphaFold models having quite good lDDT of pocket residues (lDDT-LP) (mean 0.94 ± 0.08), docking performance is poor.Exact side-chain and conformation positioning of all residues seems to be crucial for obtaining the right ligand pose for physics-based docking tools, as seen in Figure 4, where the backbone RMSD of the AlphaFold model is 3.56 Å and it is clear that a rearrangement has pushed a helix into the binding pocket, preventing the correct ligand pose from being found.This trend is not as striking for the DL tool DiffDock, as its training has less reliance on side-chain atoms, although the performance is still lower than on crystal structures.
This benchmark effort demonstrates that both structure quality and redundancy have a major impact on the assessment of docking tools.However, it is still clear that AlphaFold models are not necessarily suitable for ligand docking. 209][40] Ultimately though, structure prediction methods which are aware of the presence of ligands in a complex may be better at learning to predict the conformation that is appropriate for binding a specific ligand.We anticipate such approaches will start to appear in future CASP-PLI and CAMEO competitions and thus encourage the use of these PLC-specific scoring metrics by the community.

| Caveats and future steps
The concepts in previous sections can be incorporated into the CAMEO benchmarking workflow, future representative PLC prediction assessment initiatives, and in method development.However, despite our best efforts, this study also has several limitations that could not be tackled here.
First, it is not always possible or desirable to filter high-quality benchmarking targets.For instance, in the case of CAMEO, validation information is not available by the time the targets are pre-released, and therefore it is impossible to filter out low-quality targets before submission and a near-50% filter rate after submission as predicted by the criteria defined in this study (Section 2.1) would mean discarding many valuable predictions.In addition, there may be protein or ligand classes for which only low quality structures are available.We are considering several alternatives, such as defining even more relaxed criteria that would only exclude exceptionally poor structures, or taking quality into account as a weighting factor for scoring.Ideally an Distribution of the scores shown in Table 1 and Table 2 for  Second, we assessed the novelty and diversity of targets using only sequence clustering.Sequence identity is frequently used as a proxy for difficulty and novelty, but it is not perfect, and PLC with different binding modes are possible even in high sequence identity cases.Similarly, there is no guarantee that complexes with sequence identities lower than 30%, as we defined in this paper, significantly differ structurally.A solution to this problem would be to consider structural similarity itself to measure the novelty.This would work well for a representative benchmark set, but would require significant efforts to distinguish spurious clusters from evolutionarily-related ones and to distinguish binding pocket similarity from global structure similarity, which we will investigate in a future publication.However in the case of CAMEO, structural information is not available by the time benchmarking targets are selected and submitted, and sequence clustering is the only option available to us.Similar considerations extend to the 3D similarity of the ligand pose, thus the "difficulty" of a PLC is still ill-defined.
Next, we encountered several hurdles when implementing and running the pipeline.Failures are automatically identified, reported, and isolated, allowing the workflow to proceed with the remaining predictions.However, we must ensure that every tool is given a fair chance and we tried very hard to minimize the number of predictions that would fail.In terms of reporting scores, three options are available: (1) assign a low score (high/infinite RMSD and 0 lDDT-PLI) to missing predictions; (2) report results on the common subset, that is the targets for which predictions are available for all methods; or (3) report average scores ignoring the missing predictions.Approach 1 is difficult with unbounded scores like RMSD, where an arbitrary RMSD for missing predictions will have a large impact on the averages.Approach 2 might hide underperformance of some tools on certain classes of targets, for instance if the tools systematically do not produce any result rather than poor predictions.In addition, when comparing a large number of tools, this might dramatically reduce the size of the dataset.In this analysis, and because there were only very few, non-systematic failures, we decided to follow approach 3 and simply report the number of successful predictions.
Perhaps the most challenging step was the ligand preparation.All the tools that we benchmarked take as input not a SMILES string, but instead an SDF file containing an hydrogenated ligand, annotated with the correct charges.This conversion turned out to be more challenging than we expected, and currently relies on custom charge annotation and hydrogenation functions which only work on certain moieties.In several instances, RDKit was unable to generate a sanitized molecule.We are still unsure whether the problem lies on the PDB reporting SMILES strings of invalid molecules, or RDKit being overly pedantic, but in the end 14 ligands could not be converted.Iron in particular is present in a huge number of PLC, but cannot automatically be handled by RDKit.In addition, a number of docking tools are unable to handle many atom types.Similarly, proteins with modified residues are not automatically handled in protein structure prediction methods, necessitating the use of canonical sequences for AlphaFold prediction.Most benchmarking efforts, including the PDBBind timesplit test set used here, exclude such problematic cases from consideration.In the HQR dataset of 371 PLC, we encountered 31 cases with unsupported ligand atom types, and 18 cases of protein modified An example of GNINA docking results on the Hsp90 receptor in complex with ligand 9J0 (PDB ID: 5ZR3).The crystal structure of the receptor is shown in purple with the ground truth ligand in green.The AlphaFold model is shown in gray.
The GNINA docked conformation using the AlphaFold model as input is in white and the docked conformation using the crystal structure as input is in orange.
residues, indicating that such cases form a significant fraction of PLC and are encountered quite often in CAMEO.In 8 cases AlphaFold models could not be produced due to HHBlits errors in the MSA creation step.We believe that a benchmark set should contain such challenging data in order to encourage the community to improve tool support.A report of all the failures encountered in running the benchmarking pipeline is available on Zenodo (https://doi.org/10.5281/zenodo.8348280), to help identify common failure modes that still need to be addressed.
One thing to note is that we only benchmarked the protein structure prediction aspect of PLC prediction using AlphaFold with default parameters as a baseline, due to its overall better performance at protein structure prediction as seen in CASP14. 36This, as we also see in the results, did not actually translate well to PLC prediction.Despite AlphaFold being shown to prefer prediction of the holo form, 21 homology modeling was recently shown to be better at the PLC prediction task. 42However, this is limited to proteins with close templates.We speculate that flexible docking or novel methods that perform joint prediction of protein-ligand complexes from sequence and SMILES would be better suited to this task.
Finally, while typical docking benchmarking efforts, including those in this study, only contain PLC with a single ligand, we note that many PLC from the PDB often contain more than one ligand, typically a mixture of metal ions and small organic or sometimes inorganic mol- information is available, [44][45][46] and which contain at least one protein chain (polymer entity) and at least one non-polymer entity (small molecule ligand or ion).Ligands present in the BioLIP artifact list were excluded. 47This list contains 463 frequent crystallization artifacts such as solvents and buffers.It may also filter out a few biologically relevant ligands, however this is rare and we considered the trade-off acceptable for this study.Additional information including the entry ID to polymer entity ID mapping, release date, and polymer composition for each entry as well as the canonical one-letter code sequence for each entity in the dataset was retrieved with the GraphQL-based API of the RCSB PDB Web Services 48 on March 28, 2023.Thirty-seven entries marked as obsolete in the API results were discarded.
PDB entries for which the "polymer composition" was one of "DNA," "RNA," "DNA/RNA," "NA-hybrid," "other type pair," "NA/ oligosaccharide," or "other type composition," as well as any remaining entry containing DNA or RNA polymers were ignored.Binding pockets were defined as the set of amino acid residues in the reference structure with at least one heavy atom within a 6 Å radius of any heavy ligand atom.
The filtering thresholds for the Iridium criteria were extracted from the original manuscript. 17The suggestion to filter PLC where atoms from crystal packing are within 6 Å of any ligand atom was not used as this information could not easily be extracted from the PDB validation report.

| PLC clustering and novelty assessment
For PLC clustering, the set of PLC described in Section 3.1 was used.
PLC were grouped together based on the cluster identifier of all the unique polymer entities and the chemical component three-letter code of the ligands (i.e., identical ligands) they contained.Polymer entity cluster identifiers were obtained by performing sequence-based clustering of all polymer entities in the dataset with the cluster module from the MMseqs2 software (version 13.45111). 49Six different sequence-based clustering patterns were obtained as a result of clustering with minimum sequence identity thresholds of 100%, 95%, 90%, 70%, 50%, and 30%, respectively.For the sequence alignment, a coverage threshold of 90% (Àc 0.9) of both the query and target sequences was used (-cov-mode 0).The sensitivity of the prefiltering was set to (Às 8.0).Clustering was performed with the connected component algorithm (-cluster-mode 1) with the option (-clusterreassign) to reassign cluster members to other clusters if they no longer fulfill the clustering criteria after each iteration.Each PLC entry in the dataset was subsequently given an identifying string consisting of the cluster ids of the entities and optionally the three-letter code of the unique ligands present in the PLC.
The assessment of the redundancy of a given PLC with respect to a different set of PLC, at a given minimum sequence identity threshold, was performed by comparing its PLC identifier to the set of all PLC identifiers of the other set.

| Molecule preparation
Each ligand was prepared starting from the SMILES string.Ligands were first standardized by neutralizing the charges and readjusted for pH 7 using protonation rules.Explicit hydrogen atoms were then added.The 3D conformation was generated using the ETKDG method from RDKit, 50 and stored in SDF format.The receptor protein was hydrogenated using reduce (version 4.13) with the -FLIP option. 51For docking tools related to the AutoDock family, the Python package Meeko (v0.4.0) was used to generate the PDBQT input files. 52

| PLC prediction tools
The predictions were run with the default parameters given by the tools unless stated differently below.

| Scoring
BiSyRMSD (shortened to RMSD throughout this article) and lDDT-PLI scores were calculated with OpenStructure version 2.6.0 53 with default parameters.The methods are identical to those described in the CASP15 CASP-PLI assessment paper. 9The lDDT-LP (lDDT-Ligand Pocket) is the lDDT score of the ligand binding site residues, with an lDDT atom distance inclusion radius of 10 Å.Every ligand was scored separately and a summary CSV file containing scores for each ligand pose, pocket, and blind docking is generated.

| CONCLUSION
With combined prediction of protein-ligand complexes forming the next frontier in computational structural biology, we need approaches for independent, comprehensive, and blind assessment of prediction methods to better assess the advantages and shortcomings of classical and novel approaches.Two complementary approaches can be employed for this purpose: weekly continuous evaluation of structures released in the PDB, and the creation of a representative, diverse dataset for benchmarking.
In this study, we examined three challenges essential for establishing such systems in an automated and unsupervised manner: determining whether an experimentally solved PLC can be used as ground truth, assessing the interest or difficulty of a PLC for prediction, and automating the scoring of predicted PLC.In the process, we defined quality criteria for PLC pockets, assessed novelty in the PDB over the years, and developed an automated workflow for PLC prediction and assessment using newly developed scoring metrics.Comparing PLC prediction results on datasets with varying redundancy and quality revealed major differences in the predictive performance of docking tools, and AlphaFold as a structure prediction method does not seem suitable on its own for use as a receptor for docking methods.
tein and ligand diversity on PLC prediction results, we extracted a high-quality representative benchmark dataset (hereafter referred to as the HQR dataset) of 371 small-molecule PLC released after 2019 which (1) have <30% sequence identity to the PDBBind time-split training-set, (2) pass the relaxed validation criteria from Section 2.1, and (3) have unique sequence cluster identifiers at 30% sequence identity.From the small molecule side, 182 of the 265 unique ligands in this set are distinct from those in the PDBBind time-split trainingset.Despite having a similar size as the PDBBind time-split test-set, and a similar composition of proteins (259 monomeric proteins and 112 dimeric proteins with a single ligand), this dataset covers more protein-ligand space, has less redundancy with respect to the training set of common DL tools, and has higher quality experimental structures to use as reference.The HQR dataset is available on Zenodo (https://doi.org/10.5281/zenodo.8348280).
structures of the monomeric proteins, 256 for the PDBBind time-split test set and 229 for the HQR test set; 77% (197/256) and 68% (156/229) of the AlphaFold models are within 2 Å RMSD of the crystal structure, respectively.
However, while Vina-based tools have similar performance in Blind docking irrespective of the test set used, the performance of DLbased tools drops massively in both modes when moving from the PDBBind time-split test-set to the HQR set.This indicates overfitting on the training set and lack of generalizability across PLC space, further emphasizing the need for redundancy filtering in both training and evaluation.DiffDock only offers Blind docking mode, as also reported by the authors along with the suggestion to use DiffDock as a ligand-specific pocket detector. 27While it still performs well compared to the other tools in this mode, this is easily outperformed by a simple pocket detection with P2Rank which improves the results of pocket-based tools by a factor of 2-3 in success rate.In Best pocket docking the Vina-based tools actually see a higher prediction performance on the HQR set compared to the PDBBind time-split test-set.T A B L E 1 Redocking small molecules in experimentally solved protein structures from the PDBBind time-split test-set (first panel) and the HQR dataset (second panel), with Blind docking and Best pocket docking.
(A) Top-1 pose RMSD, (B) Top-1 pose lDDT-PLI, (C) Top-5 pose RMSD, and (D) Top-5 pose lDDT-PLI.The lines and the black dots in the bars represent the median and the mean, respectively.atom-levelweighting would be used, especially for larger ligands that can display variable levels of quality within the residue itself.Unfortunately the PDB does not make atom-level quality information available in the validation reports at the time of writing, and the only information that would be available is occupancy numbers which are part of the structural data.Also, it should be noted that the validation criteria we describe in this paper are only available for x-ray structures.Validation reports for electron microscopy structures now include per-residue Q-scores,41 but how to incorporate them in a criteria remains an open question and will need to be tackled in future projects.

| METHODS 3 . 1 |
ecules, which are part of the CAMEO and CASP15-PLI datasets.Most of the tools we benchmarked here are only able to dock a single ligand at a time and cannot take advantage of the information about other ligands present in the complex.This can result in clashes and generally suboptimal results.In this study, the scoring focused only on the accuracy of the individual ligand docking predictions.Future works will also need to take molecular validity and physico-chemical consistency into account.3PLC validation criteriaPLC were obtained from the PDB, release 2023-03-15.The PDB Chemical Component Dictionary 43 was downloaded on March 17, 2023.X-ray validation information was extracted from the XML files provided by the PDB for entries where Electron-Density Server

( 5 )
environment containing the required python bindings.Meeko v0.4.0 was used to transform the PDBQT output file into an SDF file, to be used by the evaluation tools.(2) SMINA v2020.12.10 (based on AutoDock Vina 1.1.2)31was run using a Singularity image downloaded from https://hub.docker.com/r/zengxinzhy/smina (tag: 1.0) with exhaustiveness set to 64.(3) GNINA v1.0.332 was run using a Singularity image downloaded from https://hub.docker.com/r/gnina/gnina/tags(tag: 1.0.3) with exhaustiveness set to 64. (4) TANKBind,26 input preparation and inference was run according to the code provided at https://github.com/luwei0917/TankBind using a Singularity image for the dependencies downloaded from https://hub.docker.com/r/qizhipei/tankbind_py38.DiffDock27 inference was run using -samples_per_complex 40 -bat ch_size 10 -actual_steps 18 -no_final_step_noise within a Conda environment built according to the setup guide (master:2c7d438, built Mar 13 2023).Each tool except DiffDock allows for the definition of a pocket center and grid size, within which the search space for ligand conformations is restricted.To assess predictions for different pockets, P2Rank 37 (v2.4) was used to predict and rank multiple binding pockets, with default parameters for experimental structures and -c alphafold option for AlphaFold predicted models.The box in which Autodock Vina, GNINA, and SMINA search for binding poses was constructed around each predicted P2Rank pocket center.The diameter of the search box was the diameter of the ligand conformer generated by RDKit with an additional 10 Å on all six sides of the search box.Thus for each tool (p + 1)*n predicted ligand poses were obtained as outputs, where p is the number of pockets predicted by P2Rank and n is the number of poses returned by the tool.

Table 1 and
Table 2 display the outcomes for PLC prediction using crystal structures (redocking) and AlphaFold modeled receptors, respectively.The full results are available on Zenodo (https://doi.org/10.5281/zenodo.8348280)for the experimentally solved and Alpha-Fold modeled receptors, respectively.The highest ranked pose (top-1) Prediction of small molecule binding to AlphaFold predicted structures for monomeric structures in the PDBBind time-split test-set (first panel) and HQR dataset (second panel).
T A B L E 2Note: The number of PLC for which the pipeline successfully completed (n), the success rate (SR) defined as the percentage of predictions with RMSD <2 Å, the median RMSD, the mean lDDT-PLI, and the standard deviation of lDDT-PLI are shown.DiffDock does not use a pocket definition.TANKBind gives only one prediction per search box.Abbreviations: HQR, high-quality representative benchmark; PLC, protein-ligand complexes.
25 training data by TANKBind and DiffDock,25and our own representative dataset created by filtering PLC such that (1) no entity in the PLC has >30% sequence identity to any entity in the PDBBind time-split training-set, (2) the PLC passes the relaxed validation criteria, (3) each entity in the PLC has over 30 residues, (4) the PLC does not contain a polynucleotide,(5)the PLC has only one relevant small We used two datasets to demonstrate the automated benchmarking workflow.The 363 PLC in the PDBBind time-split test-set that were not used