Simplified geometric representations of protein structures identify complementary interaction interfaces

Abstract Protein‐protein interactions are critical to protein function, but three‐dimensional (3D) arrangements of interacting proteins have proven hard to predict, even given the identities and 3D structures of the interacting partners. Specifically, identifying the relevant pairwise interaction surfaces remains difficult, often relying on shape complementarity with molecular docking while accounting for molecular motions to optimize rigid 3D translations and rotations. However, such approaches can be computationally expensive, and faster, less accurate approximations may prove useful for large‐scale prediction and assembly of 3D structures of multi‐protein complexes. We asked if a reduced representation of protein geometry retains enough information about molecular properties to predict pairwise protein interaction interfaces that are tolerant of limited structural rearrangements. Here, we describe a reduced representation of 3D protein accessible surfaces on which molecular properties such as charge, hydrophobicity, and evolutionary rate can be easily mapped, implemented in the MorphProt package. Pairs of surfaces are compared to rapidly assess partner‐specific potential surface complementarity. On two available benchmarks of 185 overall known protein complexes, we observe predictions comparable to other structure‐based tools at correctly identifying protein interaction surfaces. Furthermore, we examined the effect of molecular motion through normal mode simulation on a benchmark receptor‐ligand pair and observed no marked loss of predictive accuracy for distortions of up to 6 Å Cα‐RMSD. Thus, a shape reduction of protein surfaces retains considerable information about surface complementarity, offers enhanced speed of comparison relative to more complex geometric representations, and exhibits tolerance to conformational changes.

has been from direct experimental determination using techniques such as X-ray crystallography and electron microscopy, 1,2 but these methods remain costly and laborious. Other, more indirect experimental techniques, including mutagenesis, 3,4 mass spectrometry, 5 and cross-linking analysis, 6 can also illuminate the specific residues that participate in these interaction interfaces. These techniques give partial information about the three-dimensional (3D) assembly of complexes, and new integrative computational modeling strategies are increasingly able to consider such data as distance restraints to infer 3D structures. [7][8][9][10] To complement experimentally led approaches, there has also been a strong push to develop better computational approaches for predicting protein interaction interfaces directly from protein amino acid sequences and 3D structures.
Importantly, the prediction of protein-protein interaction interfaces is of substantially lower computational complexity than the problem of predicting or folding a 3D protein structure based on its linear amino acid sequence, as interface predictions (eg, by molecular docking) are limited to 6 of rotational and translation freedom and a sampling of accompanying intramolecular motions that might occur upon binding. 11 Ideally, successful interface predictors would go beyond predicting pairwise interactions and be useful to assemble large molecular machines from individual subunits.  13 there tend to be a larger number of polar residues along the interface. 14 Additionally, many of the forces driving these interactions derive from weak electrostatic charge. 15 Furthermore, all of these noncovalent interactions would benefit from a calculation of the binding affinity. 16 Thus, computational approaches face a significant challenge in predicting contact interfaces that may vary significantly based on the relevant class of protein-protein interaction for any particular interface.
Computational approaches for determining how proteins interact include predictions of interaction interfaces or docking of protein structures, where the former informs the latter. It has been shown that knowledge of an interaction interface can greatly improve the prediction of the conformation of the proteins that are interacting. 17 Interface predictors may be divided into two groups: intrinsic-and template-based approaches. 18 Intrinsic-based approaches focus on features within the protein sequence or the protein structure.
Template-based approaches search through databases of protein complexes with known structures and use these interfaces to make predictions. 19 However, the latter approach requires prior structural information for the protein(s) of interest. Intrinsic-based approaches take either sequence information or structural information as the input of the predictor. Enhancing intrinsic-based approaches may be challenging, as a review of previous literature found that the addition of more features does not improve predictions. 18 Sequence-based predictors utilize protein sequence information to either feed different amino acid properties into a machine learning classifier or sequence alignment tools. Sequence alignment methods assume that proteins of similar sequences have similar binding partners and therefore binding sites. 19 Many machine learning techniques focus on features of neighboring residues, where the size of the window of residues ranges from 9 to 21 amino acids. 19 However, proximity in sequence does not necessarily reflect proximity in structure, highlighting the benefits of incorporating structural information into the interface predictions. Some techniques have taken an intermediate approach where the proteins are represented by a network where individual nodes represent residues and residue properties, while edges represent structural information providing some spatial resolution. 20,21 Structure-based predictors utilize structural information from either experimental data or homology modeling as a constraint in formulating their prediction. Previous studies showed that the quality of the prediction is dependent on the quality of the structure and that homology models produce less accurate predictions. 19 One structural approach involves dividing a protein surface into patches and using these patches to predict interaction sites. Patches are defined as either the n closest residues where n depends on the size of the protein or a set size for all proteins. 22,23 For these methods, patch size is predetermined and uniform, causing problems for predicting interfaces of proteins with multiple binding partners or if the defined surface patch does not accurately reflect the size of the true interface. 22 Many predictors ignore the binding partner; however, utilizing the binding partner has been shown to improve predictions. 18 Partner-specific interface predictors, which account for all participating proteins in the interaction are less common but have the benefit of considering complementarity between specific proteins.
Partner-specific predictors use structures or sequences of two proteins that are assumed to interact in predicting the interaction interface for each protein. 18 A partner-specific approach allows the user to consider complementarity, which plays a central role in molecular recognition. Proteins that promiscuously bind to multiple partners present a unique challenge for predicting interfaces. These multiple binding partners may all bind at the same site, or they may bind at multiple sites on the protein surface. 24 While recent studies highlight the ability of current predictors to separate non-binding from binding residues on individual proteins, these predictors fail to distinguish partner-specific interaction sites resulting in cross-prediction between sites. 19 Currently, many partner-specific approaches exist for predicting interactions. A majority of these methods use the primary sequence and homology searches to make predictions. PAIRpred utilizes a support vector machine classifier for predicting partner-specific interaction interfaces. 25 While this approach employs multiple features, the features included in the classifier are all based on solvent accessible surface area, which cannot account for proteins that undergo a dramatic conformational change during binding. Another partner-specific tool is PPIPP.

PPIPP uses a neural network trained on interacting pairs and has been
shown to outperform partner-unaware models. 26 Similarly, HomPPI uses sequence-homology based approaches to identify conserved regions between the partners. 27 Both approaches only use sequence information and do not incorporate spatial data. Many recent approaches have attempted to use multiple sequence alignments to predict residues that coevolve between proteins through direct coupling analysis, mutual information, or a combination of the two and show improved prediction capabilities. 8,28,29 One important challenge that remains for partner-specific, structure-based predictors is accounting for conformational changes that occur upon binding. The performance of these methods decreases with increasing conformational rearrangements and dynamics of the protein pairs upon binding. 26 For this reason, we were interested in developing a reduced representation of protein structural data that does not explicitly consider shape complementarity and can make quick predictions that may be used in assembling larger protein complexes. Here, we developed and evaluated a protein shape reduction method (MorphProt) that predicts partner-specific interaction interfaces by mapping properties of protein surfaces to a reduced representation and rapidly tests for complementary surface patches within these reduced geometric representations. MorphProt shows comparable predictive power to a number of more computationally intensive approaches and tolerance to structural rearrangements in the interaction partners.

| Benchmark set of protein-protein interactions
To evaluate the quality of the interaction interface predictions from MorphProt, we used a benchmark set of known protein complexes.
The benchmark data set for this method was version 5.0 of the widely used protein-protein interaction docking benchmarks. 30 This benchmark set provides a large library of 230 Protein Data Bank 31 (PDB) files for non-redundant complexes with varying rigidity, as well as enzyme-containing and antibody-antigen complexes. From this set, we extracted 172 complexes (Supporting Information). Those complexes that were not included either had incomplete structures, creating an error in the PQR calculation or had more than two subunit chains (excluding antibody complexes).
In addition to the protein docking benchmark 5.0, we used the protein docking gold standard, the Critical Assessment of PRedicted Interactions (CAPRI) score set. 32 CAPRI provides an expanded benchmark data set for evaluating scoring functions, which includes 13 published CAPRI targets. All predictions were made on unbound structures and were validated against the bound structures.

| Calculated properties of surfaces
The properties that were used in these analyses were charge, hydrophobicity, and evolutionary rate. The atomic charge was calculated using PDB2PQR. 33 PDB2PQR begins by rebuilding missing nonhydrogen atoms using standard amino acid topologies in conjunction with the existing atomic coordinates to determine new positions for the missing atoms. Next, hydrogen atoms are added and positioned to optimize the global hydrogen-bonding network. Finally, PDB2PQR assigns atomic charges and radii based on the AMBER force field.
Here, The PDB2PQR program was run using the Opal server. Finally, the evolutionary rates were obtained from the ConSurf Database. 35 This database contains information regarding precalculated evolutionary conservation scores. The evolutionary rates stored in the database are calculated using the Rate4Site algorithm. 36 This method evaluates evolutionary rates using a maximum likelihood estimate assuming a stochastic process. Based on this, amino acid replacement probabilities were computed for each branch of the phylogenetic tree. The tree is then used to cluster closely related sequences and find a consensus sequence for each cluster. The consensus sequences are then compared, and each position may be described as variable or conserved. The frequencies are renormalized to represent a number between 1 and 9. Finally, each of the properties described was stored in the surface of the protein structure as part of the appropriate atomic coordinate.

| Protein shape reduction
To reduce the dimensionality of the intricacies of protein shape, we performed a shape reduction of the 3D atomic structure into a simplified representation. To perform these calculations, we have created a Python library, MorphProt. The input for these calculations is a PDB file (either an atomic structure or homology model), a PQR file, and a conservation file produced by Consurf 35 (when considering evolutionary rate). MorphProt began by extracting the molecular surface using Michel Sanner's MSMS program, 37 which uses a 1.4 Å diameter sphere to detect the solvent accessible surface area. Next, it calculated a residue depth for all of the amino acids in the protein sequence using the molecular surface. The residue depth was calculated using Biopython 38 and was evaluated as the average depth of all atoms in a residue from the calculated surface. In MorphProt, amino acids were said to be contributing to the surface of the protein if their residue depth (calculated as the average depth across all atoms in the residue) was less than 5 Å from the calculated accessible surface. We include this additional 5 Å from the accessible surface to account for any subsurface binding properties that may be missed in the accessibility calculation. MorphProt then extracted the 3D coordinates for all of the atoms that satisfy these surface constraints.
After the atomic coordinates of the surface are extracted, MorphProt took the maximum and minimum for each x, y, and z coordinate as 6 biased start centroids, k = 6. MorphProt uses SKLearn 39 to perform a K-means clustering. It projected each of the clusters onto a 2D surface proportional to the size of the cluster.
Next, it binned each 2D projection into 5 Å by 5 Å boxes, forming a grid. Note that MorphProt allows for a customizable bin size.

| Protein interaction interface prediction
We computed a 2D cross-correlation, a common pattern recognition and image processing tool, to predict areas of the protein surface with maximum interaction between properties. The cross-correlation was calculated using MorphProt. Because each protein is reduced to a total of six matrices, we calculated a total of 36 2D cross-correlations for each pairwise interaction between the six faces of each protein for a given initialization. In addition, we sampled all 10 rotations to account, in an approximate fashion, for different orientations or positions of the initial protein structures. For antibody complexes, we took the top charge score due to the variable binding region that would not be captured with evolutionary rate or the hydrophobic interaction between the heavy and light chains.
After identifying the top score in the cross-correlation matrix, we determine the position of the two matrices that produced the score.
We use the position of the high score within the cross-correlation matrix to identify the alignment of the two matrices and extract any overlapping regions between the two. Once the areas of the two matrices that are interacting are identified, we map this back to the protein structures themselves.

| Evaluation of predicted protein interaction interfaces
To evaluate our predictions, we calculated a confusion matrix to classify predicted interface residues as true positives, false positives, false negatives, and true negatives based on the predicted and actual classes. We defined a residue to be on the interaction interface if any atom from the residue is within 5 Å of an atom from the protein it is in complex with. We then evaluated our confusion matrix where the precision, recall, accuracy, and F 1 score are defined accordingly: Additionally, we have integrated an extreme value calculation to validate the "uniqueness" of the atomic properties. This demonstrates that placement along the interface is not a random distribution of points but a clustering of some property contributing to an interface.
To calculate this, we randomly shuffled the properties associated with each atom and recalculated scores. We repeated this shuffle and scoring 1000 times to generate a distribution. If the score was an extreme value in the distribution, then the score is statistically significant and represented a clustering of a property at that location.

| Simulation of structural distortion by normal mode analysis
To simulate structural distortion in the crystal structures from the test set we used elNémo 44 , a normal mode analysis. elNémo predicts the possible movements of a macromolecule through lowfrequency normal modes. The l and r unbound subunits of PDBID: 1FQJ, 1NZ8, 1US7, 2GTP, and 3CPH from the protein-protein interaction docking benchmark were used. All default parameters were kept except for DQMIN and DQMAX, which were adjusted to 100 and 300, respectively, to allow more extreme distortion.

Selected normal modes and PDBs can be found in the Supporting
Information. Modes were selected based on large Cα-RMSD from the wild-type structure.

| RESULTS
We wished to test if a highly simplified geometric representation of a 3D protein surface embedded with properties was sufficient to F I G U R E 1 MorphProt pipeline for partner-specific interaction interface prediction. A, The MorphProt interaction interface pipeline begins with atomic coordinates (PDB). Relevant surface properties are stored in these coordinates. The protein surface is converted into six matrices. A cross-correlation is calculated between matrices of each protein to find the area of maximum interaction (max score). This is used to generate the position of the matrices to give the maximum, which are then mapped back to the surface of the protein. B, The atomic structure to property matrix is described in more detail. The surface of the structure is extracted, and a k-means of the atomic coordinates is used to segment the surface into six patches. The patches are then projected into two dimensions. Each patch is then binned, and an average surface property is calculated predict protein-protein interactions, while being tolerant of possible molecular motions relevant to the interaction. We wanted to consider protein surface properties and how opposing surfaces complement each other when forming an interface, largely independent of protein shape. For this reason, we began with a reduction of the irregular shape of a protein by considering atoms near the surface of the protein, thus excluding the atoms that play a role in stabilizing the protein core and presumably make less of a contribution to protein-protein interactions.

| Simplified representation and interaction interface prediction
Our simplified representation is as follows: The solvent accessible surface of the protein is computed and reduced into a simplified geometric representation proportional to the size of the protein (Figure 1).
The reduction retains an approximate representation of interface proportions. Recently, the idea of reducing proteins to simplified shapes has gained attention in structural searches. 45 Our shape reduction uses a K-means clustering algorithm to separate protein surface accessible amino acids into six distinct clusters, followed by a projection of the coordinates into two-dimensions (2D) ( Figure 1B These reduced protein surfaces are images, making them suitable for several image processing techniques. To build a partnerspecific predictor that considers surface property-complementarity, we performed cross-correlation of images from two partner proteins to find an area of maximum similarity between the two images by computing a dot product at each position after rotation and translation ( Figure 1A). Cross-correlations have already proven to be invaluable in various image processing techniques, including identifying single particles from electron microscopy data. 46 Here, this approach was used to identify an area of maximum interaction by searching and calculating a complementarity score between properties in the matrix. Because our protein surfaces were reduced into six matrices,

| Detecting interaction interfaces with a known nature of interaction
To address the concern of any distortion by the shape reduction, we demonstrated that interaction interfaces are still detectable with a proof-of-concept protein pair, the alpha-chymotrypsin-eglin c complex (PDB:1ACB) (Figure 2A-C). We extracted the surface of each pro-  Figure S1).  (Table S1). These complexes are also categorized by the difficulty to predict the interaction interface based on I-RMSD (RMSD of the interface). Using our validation scheme, we have analyzed the ability of MorphProt to predict the interaction interface of the protein pairs for each difficulty group ( Figure 3A,B). The average MorphProt was still able to predict some of the interaction interface ( Figure 3B).

| Evaluating MorphProt on a benchmark protein interaction set
We also classified these predictions according to the CAPRI criterion of high, medium, acceptable, and incorrect ranking. 48 Using this criterion, we found that 26 of the complexes ranked high, 35 ranked medium, 74 ranked acceptable, and 37 ranked incorrect. Our method is based on selecting the top, consistent interface as the prediction, but we were interested in exploring whether selecting the second interface for those predictions where recall <0.1 would improve our statistics ( Figure S2). This approach led to better summary statistics with the new average F 1 score of 0. 22

| Interaction interface prediction despite structural distortion
Finally, we wanted to test the performance of our interface predictor against the uncertainty that may arise from structural models produced by homology modeling or lower resolution structure building methods. In both experimental and computational structural biology, there can occasionally be uncertainty regarding the exact position of the side chains and backbone of the model. By distorting one of our test proteins that produced a strong evolutionary rate interface prediction, we showed that our predictions remain robust even considering a structure that is distorted by up to 6 Å Cα-RMSD. The crystal structures of the unbound Gnai and RGS9 (PDB: 1FQJ) were distorted using normal mode analysis ( Figure 4A,B). We used elNémo 44    The ligand and receptor are depicted in gold and blue, respectively. The interface is predicted using evolutionary rate. B, The predicted interface is colored red on the bound structure. C, The receptor and ligand were distorted using elNémo normal mode analysis. The receptor was distorted up to 6 Å (Cα-RMSD) while ligand was held constant. The zoom-in depicts the native structure (gray) superimposed onto the 6 Å distorted structure to show the change in position of residues on the interface (top). Precision, recall, F 1 score, and accuracy were plotted against Cα-RMSD showing that for the distorted receptor there was no linear correlation between Cα-RMSD (R 2 acc and R 2 F1 < .02) (bottom). D, The same was done for the ligand while the receptor was held constant. Here, precision, recall, F 1 score, and accuracy showed a slight linear correlation with the Cα-RMSD (R 2 acc = . 44

| DISCUSSION
Here, we have demonstrated that by using shape reduction to normalize the highly variable 3D protein structure to a simplified geometric representation, we are able to store layers of information on a 2D representation of a protein surface while preserving atomic neighborhoods. The resulting matrix of values contains the location of surface properties and their proximity to other values and is a direct representation of the spatial coordinates of the 3D atomic structure. We showed that converting the surface properties to an image allows us to identify areas of maximum interaction of surface properties between two proteins via a partner-specific approach. In addition, MorphProt has the ability to construct large macromolecular assemblies through detecting multiple partner-specific interfaces.
While primary sequences provide information regarding amino acid identity and adjacent residues, it can be difficult to precisely determine from sequence alone which residues reside on the surface of a protein and their relation to each other in its 3D structure.
Structure-based approaches allow us to extract and investigate surface properties, providing a useful first step for interface prediction, as the spatial position of residues is essential for macromolecular recognition. 23 Many machine learning interaction interface predictors exist and use structure, but the only information stored in feature vectors is statistical information for the surface patches and not the spatial arrangement of the residues. 23 In addition to the lack of information regarding residue neighborhoods, many of the structure-based approaches are not equipped to handle dramatic conformational changes upon binding. 49 We have addressed these limitations of previous methods through our greatly from these methods in its approach. The reduced representation removes emphasis from the shape and shifts it to surface properties and their location relative to each other. Because of this, no information is required aside from the protein structure to generate a prediction. The other programs require some sort of a priori information, either a template that resembles the protein being searched or a training set where a similar protein is represented. This limits these methods from making predictions for proteins that are not well represented experimentally.
In most structure-based interaction interface predictors, an interface is identified based on features of a given area on one of the protein surfaces, ignoring properties of a partner when determining how they best fit together. A partner-specific predictor uses information regarding both proteins of interest. It has been shown that prediction methods that do not employ a partner-specific approach have lower reliability in predicting transient binding sites, 50,51 whereas a partnerspecific approach can identify locations that are highly conserved for transient protein-protein interactions. 27 A significant advantage of using a partner-specific predictor is its ability to find specific surface areas that form interactions with different partners. One significant challenge of many partner-specific predictors is their use of unbound protein structures to search for interaction interfaces. 18 In many biological processes, proteins undergo a dramatic conformational change upon binding, which complicates predicting an interface based on unbound structures. We have demonstrated that using a reduced surface representation of a protein in combination with stored information of highly predictive properties, we can make partner-specific interface predictions for unbound proteins, including those that undergo at least moderate structural rearrangements, an important feature for building multi-protein assemblies.
Furthermore, we showed that despite introduced structural distortion, we are still able to predict interfaces based on complementarity. This is increasingly important for predicting interaction interfaces with the widespread use of homology models and lower-resolution structures. Here, greater weight is put on the neighborhoods of properties on the surface rather than their exact location. The ability to predict the interface for homology models is significant for assembling macromolecular complexes where little is known regarding the structure of the individual subunits. Theoretically, one could produce models for the subunits and then arrange them according to their interaction interfaces to predict the structure for large assemblies.
Such analyses would also benefit from protein docking following the interface prediction to improve positioning.
While discrepancies between interface prediction and protein docking occur often, the techniques are effective when used in conjunction with one another. Protein-protein docking is a partnerspecific technique that is highly dependent on shape complementarity and energetics. 23 One of the limitations of protein-protein docking is the large sample size that must be tested and then scored by an energy function to produce a prediction of the arrangement of two proteins. The number of arrangements would be drastically reduced by using an interface prediction as a preliminary step before docking.
Previous studies showed that using a partner-specific, homologybased interface prediction prior to docking significantly improves the scoring of the docked proteins. 52 Notably, the HADDOCK server allows for the incorporation of a predicted interaction interface, however, this interface is computed from a single protein rather than a partner-specific interface. 53 Incorporating our interface prediction into a protein-protein docking pipeline would increase computational efficiency because it is independent of shape complementarity and energetics.
Another significant application of partner-specific interaction interface predictors is the screening of small molecule inhibitors or drugs. These often interact via transient interactions, 23 making predicting transient interactions imperative. Small molecules that interact with protein-protein interfaces and alter these interactions have demonstrated to be effective drugs and the prediction of these interfaces could be useful in finding potential targets. 54 This poses a challenge because many protein interfaces have been described as large, featureless surfaces that lack obvious binding pockets. 55 Because our method reduces the protein surface to essentially the same, we would likely be able to make more accurate predictions using physicochemical properties stored on the surface of the protein.
Furthermore, predictions and scores for small molecule inhibitors or drugs could be optimized by understanding the area of interaction produced by our method.

| CONCLUSIONS
To address the inherent variability of protein shape, conformational changes, and structural approximations while reducing computation time, we were interested in determining if a simplified geometry retains enough spatial information to predict interaction interfaces based on complementary properties. Specifically, our aim was to develop a pipeline that was robust to molecular motions while gaining computational power to assemble larger multimeric protein complexes. Using MorphProt, we performed a shape reduction of the accessible surface of a protein into a reduced surface representation. This reduced representation allows for easy storage of intrinsic properties of the protein such as hydrophobicity, charge, and evolutionary rate to be embedded within each surface image. The result is a quantitative description of these properties across a protein surface enabling image processing techniques to identify complementarity between the properties of two interacting protein surfaces. We show this method can be useful when one of the above properties is the driving force of the interaction.
MorphProt was able to predict interaction interfaces for the unbound CAPRI targets and the protein-protein interaction benchmark complexes with comparable results to a number of other predictors. Additionally, MorphProt was able to predict interfaces for a large 16-subunit oligomer, proteins with multiple binding sites, and crystal structures that have been distorted by up to 6 Å Cα-RMSD to mimic models built from lower resolution density maps or imperfect homology models. Our algorithm could be integrated into platforms that aim to assemble complicated protein architectures.