Decrypting protein surfaces by combining evolution, geometry, and molecular docking

Abstract The growing body of experimental and computational data describing how proteins interact with each other has emphasized the multiplicity of protein interactions and the complexity underlying protein surface usage and deformability. In this work, we propose new concepts and methods toward deciphering such complexity. We introduce the notion of interacting region to account for the multiple usage of a protein's surface residues by several partners and for the variability of protein interfaces coming from molecular flexibility. We predict interacting patches by crossing evolutionary, physicochemical and geometrical properties of the protein surface with information coming from complete cross‐docking (CC‐D) simulations. We show that our predictions match well interacting regions and that the different sources of information are complementary. We further propose an indicator of whether a protein has a few or many partners. Our prediction strategies are implemented in the dynJET2 algorithm and assessed on a new dataset of 262 protein on which we performed CC‐D. The code and the data are available at: http://www.lcqb.upmc.fr/dynJET2/.


| INTRODUCTION
Proteins are main actors in biological processes and a detailed description of their interactions is expected to provide direct information on these processes and on the way to interfere with them. 1 Our knowledge of protein-protein interaction (PPI) networks 2 is largely incomplete, since the experimental assessment of all possible interactions of a protein is very challenging. 3,4 To overcome this limitation, recent efforts have been invested in the integration of direct and indirect experimental evidence and of computational predictions to better describe PPIs at the genome scale [5][6][7][8][9][10][11]40]. These efforts have revealed the complexity and multiplicity of PPIs. A protein may interact with several partners at the same time-each partner binding to a different site at its surface, or its surface may present a shared binding region that will be used by different partners at different moments of its lifetime. It is estimated that as much as 75% of the surface could potentially be used for PPIs. 12 In this context, there is a need for the development of tools able to decrypt protein surfaces at the residue level. A comprehensive description of protein surfaces and a precise identification of the residues involved in interactions are mandatory to identify cellular partners at large scale 11 and design drugs modulating PPIs. 13 Moreover, characterizing protein surfaces' properties may inform us on the number of partners a protein may have, and thus on the role of that protein in the cell.
Evolutionary, physicochemical, and geometrical properties have been shown to be relevant to PPIs, [14][15][16][17][18][19][20][21][22][23][24][25] and, based on them, in the past 20 years, a number of tools have been developed to predict interacting surfaces 24,[26][27][28][29][30][31][32][33] (see 25,34 for surveys). Although some of these tools achieve very high accuracy against subsets of known experimental binding sites, their predictions are generally much smaller than the expected interacting surface size. 12 Moreover, many tools do not propose sites but rather evaluate the probability of a residue to be involved in interactions. An orthogonal approach consists in exploiting molecular docking calculations. Docking methods were originally designed to predict the structure of a complex starting from the known structures of its components. Candidate conformations are evaluated based on properties reflecting the strength of the association, for example, shape complementarity, electrostatics, desolvation, and conformational entropy. By deriving statistics from the generated conformational ensemble, one can estimate the propensity of each protein surface residue to be found at a docked interface and use these propensities to identify binding sites. 35 This has been realized in single docking studies, [36][37][38][39][40] where two proteins known to interact are docked to each other, in arbitrary docking studies, 41 where proteins from a benchmark set are docked to arbitrarily chosen proteins, and in complete cross-docking (CC-D) studies, 11,[42][43][44][45] where all vs all docking is realized on a given dataset.
In the present study, we combine these different types of information to decipher the complexity of protein surfaces and give clues about the many interactions a protein may have (Figure 1). Given a protein, we predict patches at its surface based on some intrinsic properties of that surface and on properties inferred from the behavior of the protein with respect to others in docking calculations. We F I G U R E 1 Schematic representation of our workflow. We consider four residue-based properties (left panel), namely evolutionary conservation, amino acid propensities to be found at an interface, local geometry, and propensities to be found in docked interfaces. We predict interacting patches at the surface of proteins by using four different strategies: SC cons , SC notLig , and SC geom combines the first three properties, while SC dock relies exclusively on the fourth property. We compare the predicted patches with a set of experimentally determined functional interacting regions. We analyze and cluster the predicted patches' seeds, from which they were grown, to precisely localize interacting regions and infer the number of partners used by each region [Color figure can be viewed at wileyonlinelibrary.com] assess our predictions against a new set of experimentally known functional interfaces, detected at the surface of 262 proteins and of their close homologs. We demonstrate that considering only one single complex for a given protein leads to underestimate the proportion of its surface involved in functional interactions and to the incorrect assessment of protein interface prediction algorithms. To cope with this issue, we introduce the new concept of interacting region (IR) as a protein surface region used by one or several partners. IRs are defined by merging overlapping interacting sites (IS) extracted from different protein complex structures. We show that our predictions better match IRs compared to ISs and capture the interface variability induced by molecular flexibility. Our approach includes sequencebased analysis, which allows the detection of signals even when the interface is "hidden." Interestingly, we highlight a few cases where docking enables unveiling interfaces that could not be detected otherwise. We further exploit the way in which our predicted patches are grown, starting from a seed that is progressively extended. Specifically, we demonstrate that predicted patches' seeds can be used to localize IRs with high precision and to determine whether a protein has a few or many partners.
We provide sets of experimentally known interaction sites and regions and CC-D results for our dataset of 262 proteins, along with a computational tool, called dynJET 2 , for predicting interacting patches based either on protein sequence and structure analysis or on any pre-computed residue based property. All data and implemented code are available at: http://www.lcqb.upmc.fr/dynJET2/.  (Table S1). This indicates a large variation of protein size inside the dataset (21 residues for the smallest protein vs 789 residues for the largest one). Based on the information recovered from the PDB, the proteins were manually classified, following and extending the classification proposed in Reference 46. We defined 11 functional classes: 16 bound antibodies (AB), 25 complex subunits (C), 60 enzymes (E), 10 enzyme regulators (ER), 9 G proteins (G), 6 antigens from the immune system (I), 23 receptors (R), 24 structural proteins (S), 16 substrates/inhibitors (SI), 7 transcription factors (TF) and 66 proteins with other function (O).

| Protein interfaces: PPI-262 and PPI-262 ext
We defined two datasets of experimental protein-protein interfaces, namely PPI-262 and PPI-262 ext ( Figure S1). Both datasets comprise only interfaces buried within "biological units" or "biological assemblies," as annotated by the authors of the PDB structure or by PISA software. 47 This ensures that the interfaces we consider carry a biological meaning. PPI-262 comprises 329 ISs (see definition below, in subsection Surfaces and interfaces) extracted from the PDB files associated to P-262 and PPI-262 ext comprises 370 IRs (see definition below, in subsection Surfaces and interfaces) defined from PDB files of close homologs of the proteins from P-262.
To construct PPI-262 ext (see Figure S1), we first searched for homologs of the 262 proteins from P-262 in the PDB. We downloaded the pre-computed set of PDB structures clustered at 90% sequence identity from ftp://resources.rcsb.org/sequence/clusters/.
This set was determined using BLASTClust with the arguments -p T -b T -S 90. We then filtered out structures with a resolution poorer than 5 Å resolution. 23 642 functional ISs were detected on these structures and were then mapped onto the query proteins from P-262 by performing global pairwise sequence alignment (using the blosum62 matrix, with the Biopython package 48 ). ISs were then merged into IRs.

| Complete cross-docking
Given an ensemble of proteins, CC-D consists in docking each protein against all others in the dataset, including itself. CC-D was performed on P-262 using the MAXDo (Molecular Association via Cross Docking) rigid-body coarse-grained docking program. 42 Statistics were computed from the generated conformations (docking poses) to determine the propensity of each residue from each protein to be found in a docked interface. We define the interface propensity (IP) of residue i, belonging to protein P, as 11,42 : where N pos,P is the total number of docking poses considered for protein P and N int,P (i) is the number of docking poses where residue i lies in the interface. In order to limit the number of docked interfaces to reconstruct, which is the most time-consuming part of the analysis, we considered only the lowest energy docking poses (less than 2.7 kcal from the best-scored pose, as described in Reference 11). This led to 50 000 docking poses for each protein pair. Thus, for a given protein P, we considered about 50 000 × 262 = 13 100 000 docking poses.
Below, we shall use IP P (i) values to formally define a normalized interaction propensity score, called NIP, that dynJET 2 uses in order to predict interface sites.

| Residue-based properties
Four measures, T JET , PC, CV, and NIP, are used to evaluate single residues in a protein and to define scores for the prediction of protein interfaces.
T JET reflects the evolutionary conservation level of a residue, and is computed from phylogenetic trees constructed by using sequences, homologous to a query sequence and sampled by a Gibbs-like approach. 23 The Gibbs-like approach extracts N representative subsets of N sequences 23  Then, tree trace levels are averaged over the N trees to get statistically significant values, which we denote relative trace significances, or T JET , and which are calculated as follows 23 : where l t j is the tree trace level of residue r j in tree t, L t is the maximum level of t and M j is the number of trees where a nonzero tree trace level was computed for r j . T JET values vary in the interval [0,1] and represent averages, over all trees of residues, of evolutionary conservation levels.
PC indicates the physicochemical propensity specific to amino acids located at a protein interface. The original values, taken from, 52 range from 0 to 2.21 and are scaled here between 0 and 1 for the calculation of residue scores.
CV is the circular variance, a measure of the density of protein around a residue. Formally, the circular variance of a fixed point in 3D space is computed from the vectorial distribution of a set of neighboring points around it. 53 Specifically, the CV value of an atom i is expressed as: where n i is the number of atoms distant by less than r c Å from atom i and r ij ! is the coordinate vector from atom i to its neighbor j. If atom i is buried within the protein, then the resultant of the vectors toward its neighbors will be small and its CV value will be close to one. On the contrary, if the atom is located in a protruding region, the vectors toward its neighbors in the protein will share the same direction, their resultant will be high, and hence the CV value will be close to zero.
We compute the CV value of a residue a k as the average of the atomic CVs, over all atoms of a k : CV a k ð Þ= 1 where N ak is the number of atoms in a k . Compared to solvent accessibility, CV changes more smoothly from the surface to the interior of the protein, 54 and is thus less sensitive to small conformational changes. CV values are scaled between 0 (most protruding residues) and 1 (least protruding residues) for the calculation of residue scores.
NIP is the normalized form of the Interface Propensity score IP, defined in Equation (1), that reflects the propensity of a residue to be found at the interface. In order to compare IP scores among proteins, we normalize it, as done in Reference 11: a positive NIP value indicates that the residue i is favored to occur at potential binding sites, and a negative NIP value indicates that it is disfavoured. NIP is defined as: where <IP P ( j)> j 2 P and max(IP P [j]) j2P are the average IP and the maximum IP, respectively, computed over all the residues j in P. The NIP value represents how often a residue is docked on the retained conformations (ie, those conformations that have less than 2.7 kcal/mol energy difference from the best one, as explained above).
These four residue-based properties were previously shown to be useful for the prediction of protein interfaces. 11

| Surfaces and interfaces
A residue is considered to be at the surface of the protein if it displays at least 5% of relative accessible surface area (rasa), as computed by Naccess. 55 Experimental and predicted interfaces are exclusively comprised of surface residues.
Experimental interfaces are detected on known PDB complex structures, considering only biological assemblies. We define two types of interfaces, namely interacting sites (ISs) and interacting regions  Figure 2A). For instance, let us consider a ternary complex comprised of proteins P 1 , P 2, and P 3 . If both P 2 and P 3 bind to P 1 but are not in contact with each other, then we will define two single-partner ISs at the surface of P 1 (Figure 2A, middle panel). By contrast, if P 2 and P 3 are in contact with each other (less than 5 Å away), then we will define one multi-partner IS at the surface of A (Figure 2A, right panel).
IRs are defined by merging several ISs. Two ISs, namely IS 1 and IS 2 , of reasonable sizes (more than five residues), will be merged into an IR if their maximum overlap with respect to their respective sizes (max [overlap(IS 1 , IS 2 ), overlap(IS 2 , IS 1 )]) is greater than an arbitrarily chosen threshold of 60%. In practice, this threshold gave us the most realistic IRs given the experimental information. This merging criterion is relaxed when dealing with very small ISs. Namely, we request that an IS comprising at most five residues shares at least one residue in common with another IS to be merged with it. The merging procedure is iterated over all ISs for a given protein. To construct PPI-262 and PPI-262 ext , we retained only ISs and IRs of reasonable sizes, that is, comprising more than five residues.
Predicted interfaces are identified by the dynJET 2 software (www. lcqb.upmc.fr/dynJET2/). Given the sequence and the structure of a query protein, dynJET 2 predicts the location of potential protein binding sites on the protein surface. 24 dynJET 2 implements a clustering algorithm and scoring strategies specifically aimed at detecting the different layers of a protein interface, namely the support, the core and the rim. 57 These three layers are defined for known experimental interfaces by comparing their solvent accessibilities in the presence and absence of the partner. 24 Support residues are buried with and without the partner, core residues become buried upon binding to the partner and rim residues are exposed in the presence and absence of the partner. A threshold of 25% relative solvent accessibility is used to determine whether a residue is buried or not. To approximate these layers, our algorithm 24 first identifies a small cluster of highly scored residues, called the seed. Seeds closer than 5 Å are merged. Then, the detected seeds are progressively extended, and the resulting residue clusters are merged if they are in contact (< 5 Å away). Importantly, the way residues are picked up based on their scores (defined below) differs between seed and extension, such that the detected signal is very strong in the seed and progressively fades away as the extension is grown. Finally, an outer layer is added to form what we call a predicted patch. We used the iterative mode of dynJET 2 (i-dynJET 2 ) and considered a residue to be predicted as interacting if it was detected at least twice over 10 iterations (as done in Reference 24).
Four scoring schemes or strategies are implemented in dynJET 2 (compared to three in JET 224 ): F I G U R E 2 Examples and schema illustrating the notions of interacting site and interacting region. A, Schematic representation of single-and multiple-partner interacting sites. Three proteins are considered, namely P 1 , P 2 , and P 3  This scoring scheme is new in dynJET 2 .
To evaluate the performance of our predictions, we mainly relied on the F1-score, which is the harmonic mean of precision and recall. Predicted patches were compared to ISs from PPI-262 and IRs from PPI-262 ext .
The union of all predicted interface residues was also compared to the union of all experimental interface residues, for each protein.

| Seeds clustering
Seeds generated by dynJET 2 's different scoring schemes were collected. The SC cons seeds were discarded because almost half of their residues were shared with the SC notLig seeds ( Figure S3a), they were bigger than the other seeds ( Figure S3b) and we observed that they often extended over several other seeds. We considered that these characteristics would make SC cons seeds perform badly in locating different IRs. The atoms belonging to the seeds collected from SC notLig , SC geom , and SC dock were then classified by applying hierarchical clustering using the average linkage method. A threshold distance of 23 Å was used to define the clusters. This value yielded the best match between the number of clustered seeds and the number of IRs.

| Conformational variability of IRs
For each IR, the RMSD of its backbone atoms (or, if not possible, its C-α atoms) was computed between the query structure from P-262 and each of the homologous structures on which the IR was detected.
For each homologous structure, only the subset of residues detected on this structure were considered to compute the RMSD. RMSD values were then averaged over the homologous structures (including the query structure if the IR was also detected on it). This gives us a single RMSD value for each IR.   Figure 2B illustrate the complexity of the experimental interaction surfaces in our datasets. Binding sites may be disjoint, overlapping or included in others ( Figure 2B, on the left), and they may be defined by the interaction with another copy of the same protein, other proteins or peptides ( Figure 2B, on the right). The two examples show five ISs (three on the left and two on the right), which were merged into three distinguished IRs (two on the left and one on the right, contoured by thick forest green lines). In all those cases, the IRs result from the merging of ISs that represent binary interactions with different partners. In addition, IRs may also be defined from several ISs representing a single interaction, but whose binding mode slightly differs from one PDB structure to another (see below).

| Prediction of interacting patches
We predicted interacting patches at the surface of the proteins from P- and a Coulomb potential for electrostatics. 62 The four SCs are implemented in dynJET 2 , an upgraded version of the JET 2 method. 24

| Estimation of the protein surface involved in functional interactions
We used both experimental interfaces and predicted patches to esti-  Figure 3E). These three types of interfaces are highly variable, with standard deviations in the [24][25][26][27][28]% range.
Finally, patches predicted based on local geometry (SC geom ) are the smallest ( Figure 3I), representing 16% of the protein surface.

| Assessment of the predictions and contribution of each SC
The identification of a protein's set of interacting residues is important to understand the determinants of molecular association. For each protein, we compared the union of all predicted patches with the union of all ISs (respectively IRs) from PPI-262 (resp. PPI-262 ext ).
To do so, we relied on the F1-score, which reflects the balance between precision (or positive predictive value) and recall To better characterize the contribution of docking-based information, we compared the predictive performance of SC cons , SC notLig , SC geom , either considered individually or altogether, with that of SC dock ( Figure 5B). We observed that the vast majority of IRs is better detected by the former than the latter (points below the diagonal, 68% on top and 72% at the bottom). Hence, evolutionary conservation, physicochemical properties and local geometry are generally able to better capture protein interface signals than the coarse-grained empirical energy function used in the docking experiment. Nevertheless, there are a number of cases where docking-based data provide valuable information to improve predictions by unveiling interfaces that could not be detected otherwise. An example is given by the anticoagulation Factor X ( Figure 5C), where one of its three IRs (in white) is very well detected by SC dock (in red, F1-score = 0.74) but completely missed by the other SCs.

| Predictions capture interface variability coming from molecular flexibility
Accurately accounting for molecular flexibility remains a challenge for protein interface and interaction prediction. We looked at how our predictions is equal to or greater than that computed on the corresponding ISs (compare black/white and colored surfaces on Figure 6). These results reveal that there exists a non-negligible variability inherent to protein interfaces and that dynJET 2 predictions is generally able to capture it.
We also assessed the robustness of our predictions with respect to conformational changes. For each IR from PPI-262 ext , we calculated the conformational deviation of its backbone atoms between the query structure from P-262, on which our predictions were computed, and the structures of its homologs (see Materials and Methods). Almost all (95%) IRs display average conformational deviations lower than a 4 Å ( Figure S4). The extent of the deviation is not correlated to the quality of the predictions (Pearson correlation coefficient of 0.05 between RMSD and F1-score, Figure S4). This indicates that dynJET 2 predictions are robust to small to medium conformational changes.

| Predicted patches' seeds describe the multiplicity of interactions
Almost all (94%) IRs from PPI-262 ext were detected, at least partially, by considering predictions issued by all four SCs ( Figure 4G).
Some predicted patches display a good or very good match with a single IR. For example, the interface between profilin and human VASP To test whether this type of reasoning could be generalized, we systematically investigated how predicted patches were distributed over experimental IRs. For this, we explicitly considered the patches' seeds, which are the first groups of residues being detected by dynJET 2 clustering algorithm. We collected all seeds generated by SC notLig , SC geom , and SC dock and clustered them based on 3D proximity (note that SC cons seeds were not considered for this analysis, see Materials and Methods). The total number of resulting clustered seeds is 562, which corresponds to 2.14 seeds per protein on average. By comparison, the average number of IRs is 1.4. About one quarter of the seeds are completely inside an IR (100% precision) and almost than half of the seeds detect an IR with very high (≥80%) precision ( Figure 7A). In the examples of Profilin and factor X, the number of seeds is equal to the number of IRs and each seed points to a different IR ( Figure 5A,C).
We also investigated whether seeds could be used to infer the number of partners a protein has ( Figure 7B). For this, we looked at the properties of the seeds lying completely or almost completely (PPV F I G U R E 5 Examples and comparison of predictions. A, Profilin (light gray cartoon) displayed with the patches predicted by SC cons (in beige) and SC geom (in cyan), the patches' clustered seeds, two experimental IRs from PPI-262 ext (in gray tones) and the corresponding partners (colored cartoons); (B) Scatterplot of F1-scores computed for the best-matching patch or combination of patches, among SC cons , SC notLig , SC geom (x-axis), and from SC dock (y-axis) against experimental IRs from PPI-262 ext . In cases where a combination of several patches is retained, the patches either come from a single SC (on top) or from several SC (at the bottom, x-axis). (C) Heavy chain of the anticoagulation factor X (light gray cartoon) displayed with the patches predicted by SC cons (beige) and SC dock (red), the patches' clustered seeds, the three experimental IRs from This means that IRs displaying a multiplicity of signals relevant to protein interactions tend to attract more partners. Hence, the accumulation of seeds with different properties in a protein region can be used as an indicator that this region will likely interact with many partners.

| Comparison with other state-of-the art interface predictors
We compared dynJET 2 predictions to those of Multi-VORFFIP 59 Figure 4A, compare the second, third and fourth boxes). However, the performance of Multi-VORFFIP and SPPIDER in detecting individual IRs is lower than that of dynJET 2 ( Figure 4B,C,D, compare darkblue and green boxplots). The average F I G U R E 6 Examples of predictions whose precision is higher on the IR compared to the IS. The query protein structure from P-262 is displayed as a gray cartoon. The experimental and predicted interfaces are displayed as opaque surfaces: on top, the IS is colored in white and the additional residues belonging to the IR are in black; at the bottom, the SC cons , SC notLig , and SC dock patches predicted for 1avo_A, 1jjo_A, 2vp7_A are in wheat, purple and red, respectively, and the best combination of patches predicted for 1ibc_A is in yellow.  2 ). In particular, dynJET 2 is more sensitive than the two other tools ( Figure 4D).

| DISCUSSION
Protein surfaces are used in multiple ways in cellular partners association. A comprehensive and accurate description of protein surfaces should account for multiple partners, molecular flexibility (from slight rearrangements to conformational changes), disorder and posttranslational modifications. In this work, we have analyzed a pool of proteins with different functions to address the two first aspects.
In line with a previous study, 12 we found that the protein surface involved in functional interactions is probably much bigger than antici- This procedure permitted shedding light on the variability of binding modes and on the multiplicity of protein surface usage. We observed that, within an IR, some residues are specific to the interaction with one partner while others are shared between different partners, and possibly between another copy of the same protein and other proteins or peptides. The proportion of shared residues can be high, indicating that IRs can serve as "binding platforms" for very different partners. In the evaluation of dynJET 2 predictions, we could appreciate that a large amount of predicted patches better matched IRs, compared to ISs. This result is expected from a good protein interface prediction algorithm, as the notion of IR seems more biologically pertinent than that of IS in many cases, especially when the IR synthesizes the variability inherent to structure ensembles of the same complex.
Our predictions were generated based on three sequence-and structure-based properties of monomeric protein surfaces and also on residue propensities inferred from docking calculations. The latter reflect how energetically favorable the interatomic interactions established between a protein and many other proteins are. The energy was evaluated by a combination of Lennard-Jones and Coulomb potentials. We have systematically assessed the contribution of this physical description of protein interactions to the prediction of interfaces. We have shown that in most cases, its predictive power is lower than that of the sequence and structure-based descriptors.
Nevertheless, it brings complementary information, which helps to improve the accuracy of the predictions and, in some cases, it even unveils binding sites that could not be detected otherwise.
We have highlighted several cases where almost the entire protein surface is involved in functional interactions. This finding challenges the role of specificity in the evaluation of protein interface prediction methods and rather put the emphasis on precision. We have demonstrated the usefulness of the prediction patches' three-layer structure by showing that the patches' seeds enabled precisely locating and discriminating IRs at the protein surface. We have further shown that the seeds could help determine whether a protein has a few or many partners. Future work will aim at getting a more accurate estimation of the number of partners. Moreover, with the help of future PPI data, it seems achievable to associate functions to the partners binding on different surface areas, described by different seeds on a region.
Although dynJET 2 predictions match reasonably well experimentally identified interacting regions, the match is not perfect. Given the degree of complexity we have highlighted in the usage of a protein surface, it seems legitimate to ask whether perfect match with experimental interfaces is an attainable goal for protein interface predictions algorithms that, like dynJET 2 , do not use any knowledge about the query protein's partners. An accurate estimation of the maximum level of agreement one could expect would be most valuable. Besides, even without perfect match, dynJET 2 predictions can be fully exploited to guide experiments. For example, the above-mentioned ability of patches' seeds to precisely locate IRs has implications for the control and modulation of existing protein-protein interactions. Mutating seed residues should impact the binding of the associated partners.
Another way to go would be to design interactors that bind to the predicted patches. Indeed, dynJET 2 algorithm provides a mean to delineate regions at the protein surface with sizes similar to those of experimental interacting sites or regions and complying with a few properties known to be relevant to protein-protein association. Thus, in addition to detecting regions being actually used to interact, it also reveals the potentiality of other regions to interact.

CONFLICT OF INTEREST
The authors have no conflict of interest to declare.