Here, we present a diverse, structurally nonredundant data set of two-chain protein–protein interfaces derived from the PDB. Using a sequence order-independent structural comparison algorithm and hierarchical clustering, 3799 interface clusters are obtained. These yield 103 clusters with at least five nonhomologous members. We divide the clusters into three types. In Type I clusters, the global structures of the chains from which the interfaces are derived are also similar. This cluster type is expected because, in general, related proteins associate in similar ways. In Type II, the interfaces are similar; however, remarkably, the overall structures and functions of the chains are different. The functional spectrum is broad, from enzymes/inhibitors to immunoglobulins and toxins. The fact that structurally different monomers associate in similar ways, suggests “good” binding architectures. This observation extends a paradigm in protein science: It has been well known that proteins with similar structures may have different functions. Here, we show that it extends to interfaces. In Type III clusters, only one side of the interface is similar across the cluster. This structurally nonredundant data set provides rich data for studies of protein–protein interactions and recognition, cellular networks and drug design. In particular, it may be useful in addressing the difficult question of what are the favorable ways for proteins to interact. (The data set is available at http://protein3d.ncifcrf.gov/∼keskino/ and http://home.ku.edu.tr/∼okeskin/INTERFACE/INTERFACES.html.)
Most, if not all, biological processes are regulated through association and dissociation of protein molecules. These processes include but not restricted to hormone–receptor binding, protease inhibition, antigen–antibody recognition, signal transduction, enzyme–substrate binding, vesicle transport, RNA splicing, and gene activation. In a pioneering study already almost 30 years ago, Chothia and Janin (1975) addressed the profound problem of protein–protein recognition. Jones and Thornton 1996 have reviewed this important subject of the properties of different types of protein–protein complexes. Figuring out the principles of protein–protein interactions is critically important for the understanding of the relationship between biological function and intermolecular complex formation (Katchalski-Katzir et al. 1992; Jones and Thornton 1996; Kleanthous 2000; Kuhlmann et al. 2000; Ma et al. 2001; Nooren and Thornton 2003). Understanding these principles is essential for predicting the conformations of multimolecular assemblies, for predicting cellular pathways, and for drug design. In addition, they should be useful in predicting docked complexes. Furthermore, because binding and folding are similar processes with similar underlying mechanisms, studies of intermolecular binding are expected to aid in folding.
From the computational standpoint, there are a number of ways to study protein–protein interactions. Among these, one may focus on the details of the recognition process in one or few interacting proteins (Tramontano and Macchiato 1994; Wallis et al. 1998; Kuhlmann et al. 2000; Todd et al. 2002; Arkin et al. 2003), or carry out a broader analysis of different two-chain complexes (Tsai et al. 1996, 1998a,b; Tsai and Nussinov 1997; Bogan and Thorn 1998; Keskin et al. 1998; Xu et al. 1998; LoConte et al. 1999; Ma et al. 2001; Valdar and Thornton 2001a,b; Chakrabarti and Janin 2002; Fariselli et al. 2002). Both approaches have advantages and disadvantages. In principle, focusing on given complexes enables following the binding process, and dissecting the contributions of particular interactions. On the other hand, analysis of a data set of protein–protein interfaces allows assessment of the interactions in a statistically meaningful way. It allows using the properties of these for binding site prediction (Fariselli et al. 2002). It further allows studies of functionally distinct interfaces to identify residues critical for function and stability (Bogan and Thorn 1998; Hu et al. 2000; DeLano 2002) and facilitates analysis of the interactions in two- versus three-state complexes (Tsai and Nussinov 1997; Tsai et al. 1998b). Yet, despite the clear advantages of a data set of nonredundant protein–protein interfaces, from the technical standpoint, its creation presents difficulties. Interfaces consist of interacting residues that belong to two different chains, along with residues in their spatial vicinity. Thus, interfaces consist of pieces of each of the chains, and some isolated residues. To generate a nonredundant data set, it is essential to carry out structural comparisons of the interfaces independent of their amino acid sequence order, because the residue order may vary (Tsai et al. 1996).
Using the computer vision-based Geometric Hashing structural comparison technique (Nussinov and Wolfson 1991; Tsai et al. 1996), we compare protein–protein interfaces derived from the PDB to obtain hierarchically organized interface clusters. Next, we use MultiProt (Shatsky et al. 2002, 2003), to simultaneously multiply align large numbers of structures. MultiProt also disregards the order of the residues on the chains, allowing us to obtain the common patterns within the clusters. These two methods are able to exhaustively handle all interfaces in the PDB to create such a data set. The current work is a considerable extension of our previous study (Tsai et al. 1996). In our earlier 1996 work, we started with 1629 two-chain interfaces. Three hundred fifty-one distinct families were generated. These structurally similar interface families provided a rich data set, allowing examinations of protein interfaces from different perspectives. However, recently there has been an extremely large increase in the number of known three-dimensional protein structures. In this study, we have made use of all protein assemblies including oligomeric proteins, viral capsids, muscle fibers, enzyme/inhibitor, and antibody/antigen complexes available in the PDB (Berman et al. 2000). The large increase in the PDB has enabled us to filter further the clustered interfaces and remove similar entries to a stricter extent than previously, making conservation studies easier to analyze and interpret. The newly generated, an order of magnitude larger clustered interface-data set (from 351 in 1996 to 3799 clusters now), makes it possible to address a broad range of questions such as whether the increase in the number of known protein structures gives rise to new families of interfaces, or are new members added to the already known ones. This may yield clues to the completeness of both protein folds and protein interface architectures. Further, protein–protein recognition relates to the physical and chemical properties of the interfaces (Chothia and Janin 1975; Tsai and Nussinov 1997; LoConte et al. 1999; Hu et al. 2000; Ma et al. 2001). Thus, interfaces can be characterized in terms of their geometrical properties such as size, shape, and complementarity and chemical properties, such as hydrophobicity, salt bridges, hydrogen bonds, disulfide bonds, and packing, the presence/absence of water molecules at certain sites, the total or the nonpolar buried surface areas, residue composition, and family conservation. Together, these properties play a role in determining the chemical and physical nature, and thus biological function, of protein complexes. The diverse data set makes it possible to investigate binding across and within families.
Our interface clusters contain similar interface architectures formed by two chains. In most cases, these similar interfaces are derived from globally similar protein chains. These are called Type I interfaces. However, among our clusters, there are some with similar interfaces yet dissimilar global protein folds. These proteins have different functions. These interfaces are called Type II clusters. These clusters are good candidates for detailed structural/functional studies. Because the overall structures of the proteins are different, it is likely that although the interfaces in their complexed states have similar structures, the distributions and redistribution of their substates are different, the outcome of the change in their binding states (Kumar and Nussinov 2001; Ma et al. 2002). On the other hand, they may bind similar drugs and interfere with complex formation.
Furthermore, the fact that different proteins bind in similar ways to yield similar interface architectures suggests that these Type II interfaces represent favorable structural scaffolds. They lend stability to the protein–protein interactions (Cunningham and Wells 1991; Wells and deVos 1996; DeLano et al. 2000) and afford functional flexibility. This similar structure, different function situation is reminiscent of protein structures. The recurrence of folds in single chains has led to the proposition of the paradigm of the limited number of folding motifs, regardless of the diversity of protein functions (Chothia 1992). Evolution has repeatedly utilized favorable, stable folds adapting them to a broad range of regulatory, enzymatic, and packaging/structural roles. Here we show that different folds combinatorially assemble to yield similar motifs in the interfaces. The preference of different folds to associate in similar ways illustrates that this paradigm is universal, whether for single chains in folding or for protein–protein association in binding. Below, we enumerate examples of interfaces found in the same structural cluster, yet have different global protein structures and different functions. In the third, Type III cluster category, only one side of the interfaces is similar across the cluster. This interface category illustrates that a given protein binding site may bind different geometries of the complementary protein.
The general similarity in architectures between interfaces and protein cores illustrates that binding and folding are similar processes (Tsai and Nussinov 1997; Tsai et al. 1998b). Combined, this diverse hierarchical data set, reflecting almost 22,000 two-chain interfaces in the (July 2002) PDB will be invaluable: Cluster members may provide hints to presumed protein specificity; comparisons across different clusters may yield clues to principles governing protein recognition and stability (Lichtarge et al. 1996; Kuntz et al. 1999; Hu et al. 2000; Brooijmans et al. 2002; Fernández and Scheraga 2003; Ma et al. 2003). The clustered data set may be a rich source for various types of analyses of protein interfaces. The old (1996) data set was used to identify some chemical and physical properties of the interfaces: It was used to extract computational hot spots in protein–protein interfaces, which were observed to be largely polar and to correlate well with alanine scanning mutagenesis (Hu et al. 2000; Ma et al. 2003). In another study, the data set was useful for deriving residue–residue empirical interaction parameters in the core regions of proteins and their comparison with the protein interfaces (Keskin et al. 1998). It was used to study the strength of the hydrophobic effect at the interfaces compared to protein cores, and to study the types of architectures in the interfaces versus in single chains (Tsai and Nussinov 1997; Tsai et al. 1997, 1998a,b). It was used to compare the number of hydrogen bonds in the single chains versus the interfaces and to study the evolution of protein dimerization (Xu et al. 1998). The enlarged data set is currently being used to predict interacting pairs of proteins. As such, it may assist in providing some clues for networks of protein interactions. It will be used to extract the structurally and sequentially conserved residues across the interfaces, that is, coupled mutations among families and to derive profiles of interface families. These are expected to be particularly useful in prediction of protein function, because they should be more robust than single interfaces. We are further using it for studies of interface hot spot organization. The data set should be useful in inferring cellular networks and in the design of small molecules to block protein–protein binding. Furthermore, our clusters allow investigation of proteins where the global folds are similar while their interfaces are not found in the same cluster. These may have different functions. A broad study of this question is now in progress.
The data set at the different clustering cycles
Table 1 lists the threshold parameters applied in successive clustering cycles to calculate the similarities between interfaces. The first column gives the iteration cycle. There are six successive clustering cycles (A through F). The second column gives the number of interface clusters at the beginning and end of the iteration. For example, during iteration A, there were initially 21,686 interfaces (in the first cycle, this number is equal to the number of two-chain interfaces in the PDB). Using the similarities of the structures and sequences (with the parameters listed in columns 3–5) the number decreased to 16,446. The connectivity score takes into account the residue connectivity in the polypeptide chains. The score favors a match with consecutive residues. At the end of the sixth (final) cycle, we obtained 3799 distinct clusters. After this cycle, members of each cluster had at least 0.5 connectivity score. There was no amino acid similarity constraint and the maximal size difference between interfaces was 50 residues.
A comparison of the new and old data sets of interfaces shows a substantial increase, from 351 to 3799. The data set and the clustering results are available at http://protein3d.ncifcrf.gov/∼keskino/ and http://home.ku.edu.tr/∼okeskin/INTERFACE/INTERFACES.html. It is of interest to examine whether this increase is the outcome of the increased number of PDB entries or of new architectures. Figure 3 shows the ratio of increase in the PDB entries, the SCOP families (the 1996 and 2002 versions, respectively; Murzin et al. 1995) and in the interface clusters (the old data set [Tsai et al. 1996] and this work). We observed that the number of entries in the PDB increased sixfold and the number of SCOP families increased threefold, whereas the increase of interface clusters is 10-fold. Thus, it appears that the increase in the PDB over the last seven years has allowed a more diversified data set for interfaces. This may also be the outcome of the rapid growth in the determination of high molecular weight proteins that are likely to include more than one chain.
Generation of a nonhomologous data set of interfaces: Sequence alignment, excluding chains with high sequence similarity
To have a nonredundant set of interfaces, sequences within each family were compared using CLUSTALW (Higgins et al. 1994) and the BLOSSUM90 substitution matrix (Henikoff and Henikoff 1992). To eliminate redundancy, a threshold similarity of 50% was imposed. Thus, one of the two sequences in a cluster that shares a sequence similarity of more than 50% is deleted from the cluster. This yields a data set of interfaces with structurally similar but sequentially dissimilar members. Further, to constitute a valid cluster of interfaces, the cluster should have at least five members (10 chains). These filtering procedures reduce the number of clusters from 3799 to 103.
The 3799 original clusters listed by their representatives are given in Appendix A (and are available at our Web site at http://protein3d.ncifcrf.gov/∼keskino/). The numbers in parentheses are the number of all members included in the corresponding clusters. In all cases, both chains of the interface of each cluster member superimpose on those of the cluster representative within the similarity criteria provided in Table 1. Appendix B lists the nonredundant interface clusters. These clusters have at least five members, and at most 50% sequence identity among their members. This separate listing is given for the convenience of users who wish to carry out statistical analysis of the data set. We have further carried out multiple structure comparisons of all cluster members listed in Appendix B, using MultiProt (Shatsky et al. 2002, 2003). Appendices A and B are provided as Supplemental Material. Clusters for which MultiProt detected a consensus core encompassing all members from both chains and with similar function were labeled as Type I, those with different functions were labeled as Type II. On the other hand, the clusters where MultiProt found a consensus for only one of the chains, were termed Type III. Fifty-four of the clusters are Type I and II interfaces; the rest are Type III aligned interfaces.
Multiple structural alignment of the interfaces with MultiProt
MultiProt detects recurring motifs in an ensemble of proteins by simultaneously aligning multiple protein structures (Shatsky et al. 2002, 2003). The algorithm considers all protein structures at the same time, rather than initiating from a pairwise-imposed molecular seed. This eliminates the bias in the superposition and finds the largest common substructure of Cα atoms that appears in the structural set. Furthermore, MultiProt efficiently finds high-scoring partial multiple alignment for all possible number of molecules in the input. That is, it does not require that all input molecules participate in the alignment. Because it is sequence order-independent, it can be applied to protein surfaces and protein–protein interfaces effectively. We have used MultiProt to align the interfaces of our clusters to find consensus motifs of the members' interfaces. To qualify as a consensus motif, at least 10 residues have to match with an RMSD of at most 3.5 Å. Because each member can have noncontiguous residues in the interface, MultiProt is an extremely useful tool for our purpose.
Interface family types
Type I: Similar interfaces, similar global protein folds
In most cases, if the interfaces are similar, the overall protein folds are also similar. Such similar interface, similar fold clusters contain a single family. The list of the interface clusters and the members of these clusters are given in Appendix B (nonitalic entries). The interface clusters include homodimeric enzyme complexes (transferases, oxidoreductases, etc.), enzyme/inhibitor complexes, antibody/antigen complexes, as well as toxins. They have different polar/nonpolar compositions and different accessible surface areas. Some examples are given in Figure 4. In the figure three members of the 1cydAD cluster are presented. This cluster is formed by reductases, oxidoreductases (PDB codes: 1cyd, 1e3s, 1hdc, 1i01), and a pterin reductase (PDB code: 1e92).
Type II: Similar interfaces, dissimilar global protein folds
Some clusters, belong to a particularly interesting category: In these cases the interfaces are structurally similar; however, the global protein folds are different. These are listed in Table 2 and Appendix B (italic entries). These similar-interfaces, dissimilar-protein folds fall into different families (see the SCOP classification, also provided in Table 2, first column). Even though, however, they have structurally similar interfaces, they are nevertheless members of the same clusters. These families have different functions. Thus, interface structural similarity does not ensure global protein structural similarity. Furthermore, previously it has been shown that globally similar structures may have different functions in proteins (Martin et al. 1998; Orengo et al. 1999; Moult and Melamud 2000; Thornton et al. 2000; Nagano et al. 2002). Cases such as those listed here illustrate that this paradigm can be taken further: Similar interfaces do not imply similar functions.
Figure 5 illustrates some examples from Table 2. Part A shows all members of the cluster. Each case in the figure presents the ribbon diagrams of the proteins that belong to different SCOP families in the same interface cluster, clearly showing that the global structures are different. Part B displays ribbon diagrams of two of the proteins with their common interfaces highlighted with yellow. Note that there are three clusters in Table 2 where the representative of the cluster does not appear in the list of family members. These cases are cellulose-binding domain family III, MHC antigen-recognition domain, and nucleotide and nucleoside kinases. In these cases, while the representative aligned with each cluster member, it did not align well with all members simultaneously, suggesting some slight deviations in the multiple structural superposition.
Type III: One side similar interfaces, dissimilar global protein folds
Our data set also contains clusters where one chain of the interface is conserved while the second varies. Figure 6 presents an example of such a cluster. Although this figure specifically shows an antibody interacting with four partners (three of them peptides), Type III interfaces are not constrained to only antibody/antigen or protein/peptide complexes. This type manifests protein complexes with a diverse range of biological functions. For example, in one of the Type III clusters we have a homodimer antioncogene protein (interface ID: 1a1uAC), a homodimer of leucine zipper complex (interface ID: 1a93AB), a homodimeric complex of mannose binding protein, lectin (interface ID: 1afa12), a homodimer of transcription regulation protein (interface ID: 1ajyAB), a tetramer of cytokine, cliary neurotrophic factor protein (1cnt14), and a homodimeric replication termination protein (1f4kAB). One-chain conserved clusters are very interesting: They can be used to address fundamental questions such as whether nonspecific binding is largely hydrophobic with flatter surface, which functions are involved, or whether in one chain-dominant interfaces the second chain is smaller. The data set may bear on longstanding problems relating to binding specificity and selectivity and to specificity with respect to conserved interactions and function. It may also be useful for prediction of residues contributing dominantly to stability.
Propensities of residues in the interfaces
The relative frequencies of different types of amino acids in the interfaces of protein–protein complexes can be used to derive the propensities of the residues. The overall propensities of the 20 amino acids are calculated for the contacting residues (not including the “nearby”) in the interfaces from the data set containing all interface clusters. We compare the frequency patterns at the binding sites versus those in the overall structures. The propensity (Pi) of a residue (i = Ala,Val, Gly, …) to occur at the interface is calculated as the fraction of the count of residue i in the interface compared with its fraction in the whole chain as
where ni is the number of residues of type i at the interface, Ni is the number of residues of type i in the chains, n is the total number of residues in the interface, and N is the total number of residues in the whole chains.
Figure 7 displays the correlation of our residue propensities with those of Jones and Thornton (1997). The axes represent the natural logarithms of the propensities. The positive value in the logarithmic propensity indicates that a residue is more likely to occur in an interface. A high correlation coefficient (0.91) is obtained over the 20 amino acids. The residue propensities of Jones and Thornton (1997) were calculated from a data set of 63 protein complexes by taking the fraction of accessible surface area that the amino acid has contributed to the interface compared with the fraction of accessible surface area that the amino acid has contributed to the whole surface (i.e., all exposed residues). Thus, their propensities are calculated by the propensity of the accessible surface areas of the residues. Our propensities are calculated by the frequency of occurrence of the residues compared with the rest of the chain. To have a more appropriate comparison, we have multiplied each residue by its average accessible surface area (Miller et al. 1987) and normalized the results by the surface propensities of the amino acids (Table 2; Ma et al. 2003) according to the formula: ln(P)i = ln[(ni/Ni)/(n/N)/(ns/Ns)/(n/N) * ASA], where ASA stands for the average accessible surface areas of the residues in an extended Gly-X-Gly triplet (Miller et al. 1987). The high correlation we observe suggests that their data set, despite its smaller size, still presents a good coverage with similar properties.
Table 3 lists the propensities of the different amino acids in interfaces Type I, Type II, and Type III. We have further computed the propensities in each interface type when dividing the residues into classes of hydrophobic (A,P,L,I,M,V), charged (D,E,R,K), polar (N,Q,S,T), and aromatic residues (W,Y,F,H). The last four rows of Table 3 give the overall contribution of these residue classes. The percentages are given as the second figure in the last four rows. Clearly, interfaces are dominated by hydrophobic residues in all three cases. Next, it is mostly aromatic residue contribution. However, it is interesting that the hydrophobic effect is smaller in the Type III interfaces. Instead, the propensities of the charged residues increase. This may reflect the fact that in Type III the nonconserved side of the interface is smaller. Smaller interfaces have already been shown to display a reduced hydrophobic effect (Tsai et al. 1997). In these smaller, more exposed interfaces electrostatics appears to play a more important role. In general, overall, charged residues are less frequent in the interfaces. This also suggests that overall electrostatic interactions are probably not the major source of the stability of the interfaces.
Here we provide a structurally unique data set of two-chain interfaces derived from the PDB. The interfaces are clustered based on their spatial structural similarities, regardless of the connectivity of their residues on the protein chains. The data set includes 3799 clusters, compared to 351 in 1996. This substantially more diverse data set reflects both the growth in the number of structures as well as the larger number of higher molecular weight proteins currently in the PDB. The comparison of the old and new data sets indicates that the number of newly found interface clusters has increased much more rapidly compared to the number of the available new PDB structures. This may suggest that the number of unique interfaces has still not reached its upper limit.
We divide the clusters into three types: Type I clusters consist of similar interfaces whose parent chains are also similar. In Type II clusters, the interfaces are similar; however, the overall structures of the parent proteins from which the interfaces derive are different. In all Type II cases that we have studied, the clustered proteins belong to different SCOP families, with different functions. Type III category introduces clusters of interfaces where only one side of the interface is similar but the other side differs. Type III clusters illustrate that a binding site can interact with more than one chain, with different geometries, sizes, and composition. One of the paradigms in protein science states that similar global structures may have similar functions. Our observations suggest an extension of this paradigm: Similar interface architectures may have differerent functions. As in protein structures, evolution has reused “good” favorable interface structural scaffolds and adapted them to diverse functions. The functions extend from enzymes/inhibitors to toxins and immunoglobulins. We did not observe homodimers in Type II clusters. This is probably due to the smaller sizes of the monomers and the extensive interfaces in the two-state homodimers that cover large portions of the chains. As expected, we find that multifunctional interface clusters consisting of helices largely derive from proteins whose functions relate to muscle and to membranes.
The observation that globally different protein structures associate in similar ways (i.e. Type II) to yield similar motifs, is interesting. Clearly, there is a very large number of ways that monomers can combinatorially assemble. Remarkably, among these there are preferred interface architectures, and these are similar to those observed in monomers (Tsai et al. 1998b). This observation both underscores the view that the number of favorable motifs is limited in nature, and highlights the analogy between binding and folding. It is further reminiscent of the combinatorial assembly of protein building blocks in folding (Tsai and Nussinov 1997).
We hope that this diverse, structurally nonredundant data set will be useful in a broad range of studies, such as deriving profiles of binding sites, elucidation of the determinants of protein–protein interactions, and identification of residues contributing to the stabilization of the protein associations and those playing a role in a specific protein function. The data set should allow extensive comparisons between binding and folding and derivation of motifs across interfaces. This data set should further be useful in construction of protein networks, and allow studies of structurally conserved residue hot spots. We expect it to be useful in studies of evolutionary conservation, recognition, binding, and function.
Electronic supplemental material
Supplemental material includes two appendices: (1) a list of 3,799 cluster representatives; (2) a list of all nonredundant two-chain interface clusters.
We thank Drs. Buyong Ma, K. Gunasekaran, S. Kumar, D. Zanuy, H.-H.(G.) Tsai, and members of the Nussinov-Wolfson group—in particular Maxim Shatsky—for help with MultiProt, and Inbal Halperin and Shira Mintz for many useful comments and suggestions. We thank Dr. Jacob V. Maizel for discussions and encouragement. We thank Dr. A. Gursoy and S. Aytuna for their helpful discussions. The research of R.N. and H.W. in Israel has been supported in part by the Center of Excellence in Geometric Computing and its Applications, funded by the Israel Science Foundation (administered by the Israel Academy of Sciences. This project has been funded in whole or in part with Federal funds from the National Cancer Institute, NIH, under contract number NO1-CO-12400.
The publisher or recipient acknowledges right of the U.S. Government to retain a nonexclusive, royalty-free license in and to any copyright covering the article.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.