A new, structurally nonredundant, diverse data set of protein–protein interfaces and its implications


  • Ozlem Keskin,

    Corresponding author
    1. Koc University, Center for Computational Biology and Bioinformatics and College of Engineering, Rumelifeneri Yolu, Sariyer, Istanbul 34450, Turkey
    2. Basic Research Program, Science Applications International Corporation (SAIC)-Frederick, Inc., Laboratory of Experimental and Computational Biology, National Cancer Institute (NCI)-Frederick, Frederick, Maryland 21702, USA
    • NCI-Frederick, Building 469, Room 151, Frederick, MD 21702, USA; fax: (301) 846-5598.
    Search for more papers by this author
  • Chung-Jung Tsai,

    1. Basic Research Program, Science Applications International Corporation (SAIC)-Frederick, Inc., Laboratory of Experimental and Computational Biology, National Cancer Institute (NCI)-Frederick, Frederick, Maryland 21702, USA
    Search for more papers by this author
  • Haim Wolfson,

    1. School of Computer Science, Tel Aviv University, Tel Aviv 69978, Israel
    Search for more papers by this author
  • Ruth Nussinov

    Corresponding author
    1. Basic Research Program, Science Applications International Corporation (SAIC)-Frederick, Inc., Laboratory of Experimental and Computational Biology, National Cancer Institute (NCI)-Frederick, Frederick, Maryland 21702, USA
    2. Sackler Institute of Molecular Medicine, Department of Human Genetics and Molecular Medicine, Sackler School of Medicine, Tel Aviv University, Tel Aviv 69978, Israel
    • NCI-Frederick, Building 469, Room 151, Frederick, MD 21702, USA; fax: (301) 846-5598.
    Search for more papers by this author


Here, we present a diverse, structurally nonredundant data set of two-chain protein–protein interfaces derived from the PDB. Using a sequence order-independent structural comparison algorithm and hierarchical clustering, 3799 interface clusters are obtained. These yield 103 clusters with at least five nonhomologous members. We divide the clusters into three types. In Type I clusters, the global structures of the chains from which the interfaces are derived are also similar. This cluster type is expected because, in general, related proteins associate in similar ways. In Type II, the interfaces are similar; however, remarkably, the overall structures and functions of the chains are different. The functional spectrum is broad, from enzymes/inhibitors to immunoglobulins and toxins. The fact that structurally different monomers associate in similar ways, suggests “good” binding architectures. This observation extends a paradigm in protein science: It has been well known that proteins with similar structures may have different functions. Here, we show that it extends to interfaces. In Type III clusters, only one side of the interface is similar across the cluster. This structurally nonredundant data set provides rich data for studies of protein–protein interactions and recognition, cellular networks and drug design. In particular, it may be useful in addressing the difficult question of what are the favorable ways for proteins to interact. (The data set is available at http://protein3d.ncifcrf.gov/∼keskino/ and http://home.ku.edu.tr/∼okeskin/INTERFACE/INTERFACES.html.)

Most, if not all, biological processes are regulated through association and dissociation of protein molecules. These processes include but not restricted to hormone–receptor binding, protease inhibition, antigen–antibody recognition, signal transduction, enzyme–substrate binding, vesicle transport, RNA splicing, and gene activation. In a pioneering study already almost 30 years ago, Chothia and Janin (1975) addressed the profound problem of protein–protein recognition. Jones and Thornton 1996 have reviewed this important subject of the properties of different types of protein–protein complexes. Figuring out the principles of protein–protein interactions is critically important for the understanding of the relationship between biological function and intermolecular complex formation (Katchalski-Katzir et al. 1992; Jones and Thornton 1996; Kleanthous 2000; Kuhlmann et al. 2000; Ma et al. 2001; Nooren and Thornton 2003). Understanding these principles is essential for predicting the conformations of multimolecular assemblies, for predicting cellular pathways, and for drug design. In addition, they should be useful in predicting docked complexes. Furthermore, because binding and folding are similar processes with similar underlying mechanisms, studies of intermolecular binding are expected to aid in folding.

From the computational standpoint, there are a number of ways to study protein–protein interactions. Among these, one may focus on the details of the recognition process in one or few interacting proteins (Tramontano and Macchiato 1994; Wallis et al. 1998; Kuhlmann et al. 2000; Todd et al. 2002; Arkin et al. 2003), or carry out a broader analysis of different two-chain complexes (Tsai et al. 1996, 1998a,b; Tsai and Nussinov 1997; Bogan and Thorn 1998; Keskin et al. 1998; Xu et al. 1998; LoConte et al. 1999; Ma et al. 2001; Valdar and Thornton 2001a,b; Chakrabarti and Janin 2002; Fariselli et al. 2002). Both approaches have advantages and disadvantages. In principle, focusing on given complexes enables following the binding process, and dissecting the contributions of particular interactions. On the other hand, analysis of a data set of protein–protein interfaces allows assessment of the interactions in a statistically meaningful way. It allows using the properties of these for binding site prediction (Fariselli et al. 2002). It further allows studies of functionally distinct interfaces to identify residues critical for function and stability (Bogan and Thorn 1998; Hu et al. 2000; DeLano 2002) and facilitates analysis of the interactions in two- versus three-state complexes (Tsai and Nussinov 1997; Tsai et al. 1998b). Yet, despite the clear advantages of a data set of nonredundant protein–protein interfaces, from the technical standpoint, its creation presents difficulties. Interfaces consist of interacting residues that belong to two different chains, along with residues in their spatial vicinity. Thus, interfaces consist of pieces of each of the chains, and some isolated residues. To generate a nonredundant data set, it is essential to carry out structural comparisons of the interfaces independent of their amino acid sequence order, because the residue order may vary (Tsai et al. 1996).

Using the computer vision-based Geometric Hashing structural comparison technique (Nussinov and Wolfson 1991; Tsai et al. 1996), we compare protein–protein interfaces derived from the PDB to obtain hierarchically organized interface clusters. Next, we use MultiProt (Shatsky et al. 2002, 2003), to simultaneously multiply align large numbers of structures. MultiProt also disregards the order of the residues on the chains, allowing us to obtain the common patterns within the clusters. These two methods are able to exhaustively handle all interfaces in the PDB to create such a data set. The current work is a considerable extension of our previous study (Tsai et al. 1996). In our earlier 1996 work, we started with 1629 two-chain interfaces. Three hundred fifty-one distinct families were generated. These structurally similar interface families provided a rich data set, allowing examinations of protein interfaces from different perspectives. However, recently there has been an extremely large increase in the number of known three-dimensional protein structures. In this study, we have made use of all protein assemblies including oligomeric proteins, viral capsids, muscle fibers, enzyme/inhibitor, and antibody/antigen complexes available in the PDB (Berman et al. 2000). The large increase in the PDB has enabled us to filter further the clustered interfaces and remove similar entries to a stricter extent than previously, making conservation studies easier to analyze and interpret. The newly generated, an order of magnitude larger clustered interface-data set (from 351 in 1996 to 3799 clusters now), makes it possible to address a broad range of questions such as whether the increase in the number of known protein structures gives rise to new families of interfaces, or are new members added to the already known ones. This may yield clues to the completeness of both protein folds and protein interface architectures. Further, protein–protein recognition relates to the physical and chemical properties of the interfaces (Chothia and Janin 1975; Tsai and Nussinov 1997; LoConte et al. 1999; Hu et al. 2000; Ma et al. 2001). Thus, interfaces can be characterized in terms of their geometrical properties such as size, shape, and complementarity and chemical properties, such as hydrophobicity, salt bridges, hydrogen bonds, disulfide bonds, and packing, the presence/absence of water molecules at certain sites, the total or the nonpolar buried surface areas, residue composition, and family conservation. Together, these properties play a role in determining the chemical and physical nature, and thus biological function, of protein complexes. The diverse data set makes it possible to investigate binding across and within families.

Our interface clusters contain similar interface architectures formed by two chains. In most cases, these similar interfaces are derived from globally similar protein chains. These are called Type I interfaces. However, among our clusters, there are some with similar interfaces yet dissimilar global protein folds. These proteins have different functions. These interfaces are called Type II clusters. These clusters are good candidates for detailed structural/functional studies. Because the overall structures of the proteins are different, it is likely that although the interfaces in their complexed states have similar structures, the distributions and redistribution of their substates are different, the outcome of the change in their binding states (Kumar and Nussinov 2001; Ma et al. 2002). On the other hand, they may bind similar drugs and interfere with complex formation.

Furthermore, the fact that different proteins bind in similar ways to yield similar interface architectures suggests that these Type II interfaces represent favorable structural scaffolds. They lend stability to the protein–protein interactions (Cunningham and Wells 1991; Wells and deVos 1996; DeLano et al. 2000) and afford functional flexibility. This similar structure, different function situation is reminiscent of protein structures. The recurrence of folds in single chains has led to the proposition of the paradigm of the limited number of folding motifs, regardless of the diversity of protein functions (Chothia 1992). Evolution has repeatedly utilized favorable, stable folds adapting them to a broad range of regulatory, enzymatic, and packaging/structural roles. Here we show that different folds combinatorially assemble to yield similar motifs in the interfaces. The preference of different folds to associate in similar ways illustrates that this paradigm is universal, whether for single chains in folding or for protein–protein association in binding. Below, we enumerate examples of interfaces found in the same structural cluster, yet have different global protein structures and different functions. In the third, Type III cluster category, only one side of the interfaces is similar across the cluster. This interface category illustrates that a given protein binding site may bind different geometries of the complementary protein.

The general similarity in architectures between interfaces and protein cores illustrates that binding and folding are similar processes (Tsai and Nussinov 1997; Tsai et al. 1998b). Combined, this diverse hierarchical data set, reflecting almost 22,000 two-chain interfaces in the (July 2002) PDB will be invaluable: Cluster members may provide hints to presumed protein specificity; comparisons across different clusters may yield clues to principles governing protein recognition and stability (Lichtarge et al. 1996; Kuntz et al. 1999; Hu et al. 2000; Brooijmans et al. 2002; Fernández and Scheraga 2003; Ma et al. 2003). The clustered data set may be a rich source for various types of analyses of protein interfaces. The old (1996) data set was used to identify some chemical and physical properties of the interfaces: It was used to extract computational hot spots in protein–protein interfaces, which were observed to be largely polar and to correlate well with alanine scanning mutagenesis (Hu et al. 2000; Ma et al. 2003). In another study, the data set was useful for deriving residue–residue empirical interaction parameters in the core regions of proteins and their comparison with the protein interfaces (Keskin et al. 1998). It was used to study the strength of the hydrophobic effect at the interfaces compared to protein cores, and to study the types of architectures in the interfaces versus in single chains (Tsai and Nussinov 1997; Tsai et al. 1997, 1998a,b). It was used to compare the number of hydrogen bonds in the single chains versus the interfaces and to study the evolution of protein dimerization (Xu et al. 1998). The enlarged data set is currently being used to predict interacting pairs of proteins. As such, it may assist in providing some clues for networks of protein interactions. It will be used to extract the structurally and sequentially conserved residues across the interfaces, that is, coupled mutations among families and to derive profiles of interface families. These are expected to be particularly useful in prediction of protein function, because they should be more robust than single interfaces. We are further using it for studies of interface hot spot organization. The data set should be useful in inferring cellular networks and in the design of small molecules to block protein–protein binding. Furthermore, our clusters allow investigation of proteins where the global folds are similar while their interfaces are not found in the same cluster. These may have different functions. A broad study of this question is now in progress.


Construction of the new nonredundant data set of protein–protein interfaces

Definition of the interface and its application to the Protein Data Bank

Here, we define an interface to be the region between two polypeptide chains that are not covalently linked. The residues that interact with each other across the binding region compose the interface between the two chains. The selection of a residue in each chain is based on how close this residue is to the residues in the accompanying chain. Two residues (one from each chain), which are in direct contact, are called interacting residues. Residues in the vicinity of interacting residues are nearby (neighboring) residues. The latter provide the structural scaffold of the interfaces.

There are several schemes to define residues in two-chain interfaces as interacting and nearby. For example, two residues may be defined as interacting across the interface if the distance between their Cα atoms, one from each chain, is less than 9 Å or, alternatively, if the distance between any two atoms of two residues from different chains is less than the sum of their corresponding van der Waals radii plus 0.5 Å (Tsai et al. 1996). Here we have adopted the second scheme. A residue is defined to be a “nearby” residue if the distance between its Cα and a Cα atom of an interacting residue is under 6 Å. Nearby residues are important for the clustering of the interfaces. They provide information about the architecture of the interfaces and make it possible to align one interface structure against another. Figure 1 illustrates an example of interfaces among three chains of a protein complex (a transferase; PDB code 1gwc). Here, three interfaces could be formed between chain pairs A–B, B–C, and A–C. As seen from the figure, only the first two interfaces have been formed. There is no interface between chains A–C, because these two chains are not close enough to each other to form an interface. Figure 1 shows these two interfaces in detail. The red and green residues are the interacting, and the neighboring (nearby) residues between chains A and B, magenta and cyan, mark the interacting and the neighboring (nearby) residues between chain pair B and C. The side chains of the interacting residues are fully displayed. To guide the eye, the three chains are colored separately: A in yellow, B in gold, and C in dark green.

We have applied these criteria to all multichain PDB entries in the database. On July 18, 2002, there were 18,687 entries in the PDB that included 35,112 single chains. PDB entries that contain more than two chains were used to get two-chain combinations. Therefore, interfaces between any two chains were extracted if each of the two chains at least had 10 residues. These included all two-chain interfaces from dimers, trimers, and higher complexes of protein–protein and protein–peptide complexes. As a result, 21,686 two-chain interfaces were obtained. Following the nomenclature of Tsai et al. (1996), we have renamed the interfaces as follows: If the PDB code of a protein is 1gwc and it has two chains A and B, the interface is named 1gwcAB (see Fig. 1), indicating that there is an interface between chains A and B of protein 1gwc.

Structural comparisons

Constructing a data set of nonredundant interfaces is not straightforward. The main difficulty is that interfaces consist of two separate chains with discontinuous pieces of the polypeptides. Although we seek similar spatial arrangements of the polypeptide pieces between the interfaces, their sequence order may differ. Furthermore, some of the pieces may consist of isolated amino acids. Consequently, any algorithm that is sequence- and directionality-dependent is not applicable to the interface comparison problem. On the other hand, computer vision-based structural alignment techniques view protein structures as collections of points in 3D space. Therefore, they are ideally suited to comparisons between protein surfaces and interfaces (Nussinov and Wolfson 1991; Tsai et al. 1996). The algorithms used in this study compare all available protein interfaces, allowing the clustering of the interfaces into families with distinct structural features.

The first step is the comparison of interfaces by the Geometric Hashing algorithm. Details of the algorithm were given in Tsai et al. (1996) and in Nussinov and Wolfson (1991). The algorithm uses the Cα coordinates and no connectivity among these Cα points is taken into account in the matching. Figure 2a shows a protein in 3D space representation with its Cα atoms denoted as points. This is the same protein as in Figure 1 (the same view). The goal of the algorithm is to find the most similar sets of points common to both protein interfaces. The algorithm has three consecutive steps: hash table construction, voting, and extension processes.

  • 1.Hash table construction is used to find the local similarity between two sets of points. The coordinates of every three consecutive Cα (Ci−1, Ci, Ci+1) along the protein chain define an orthogonal reference frame, centering the (Ci) point as follows: Rx = Ci−1 − Ci; Vt = Ci+1 − Ci; Rz = RxxVt; Ry = RxxRz; where Rx, Ry, Rz are the x, y, z axes of the reference frame and x represents the cross-product. Each point within a cutoff distance of 15 Å around the i′th point is projected onto this orthogonal reference frame. Thus, for the i′th element in the table, both the identity of the Cα atom and the neighboring projected coordinates are kept. This is the preprocessing step.
  • 2.Voting is carried out to compare the two structures. If a local similarity (a large “enough” number of votes for a given reference frame) is detected between the two proteins, the transformation is computed and the matching Cα atom pairs from the two proteins are superimposed. The similarity between the proteins is computed in terms of the root mean squared deviation (RMSD) between them.
  • 3.The extension step is used to find the best global alignment starting with the best local alignment obtained in the previous step. This is an iterative process. The interfaces are superimposed, and a new list of matching pairs is reassigned, with the distance between every matched pair below a threshold (here 2.5 Å). If the distance criterion cannot find a unique solution, the best global alignment is found using the similarity score. This score favors solutions with better connectivity. For complete description of the method see Tsai et al. (1996), Nussinov and Wolfson (1991), Bachar et al. (1993), and Fischer et al. (1994).

In this study, the measure of the similarity between two protein–protein interfaces is based on the extent of the geometrical superposition between their corresponding Cα atoms, the percent residue identity in the match, and the difference in sizes between the interfaces. The superposition between two interfaces computed by the Geometric Hashing algorithm yields a list of matched Cα atom pairs. The percent residue identity is the count of identical residues in the match divided by the total number of matched pairs. The RMSD is not considered in measuring the similarity between two interfaces. Instead, we compute a “connectivity score” to express the quality of a geometrical superposition. If the residue connectivity information is excluded, the similarity score is equal to the number of matched pairs. The data set contains both biological (functional) and crystal packing interfaces, because unfortunately, to date, there is no clear way to distinguish between them. Nevertheless, because crystal interfaces are often small, we exclude an interface if it has less than 10 residues that are in contact in a given chain.

The clustering algorithm

Clustering is a multivariate problem with two criteria. First, members in each cluster should be similar to each other, and second, members of one cluster should be different from members of all others (Gordon 1981). The frequently adopted clustering approach for classifying a set of structures consists of two steps. First, the similarity between any two structures is calculated, and second, a set of clusters is generated by clustering the two most similar structures at a time and selecting one of them to represent the cluster. This procedure is iterated, until the extent of similarity between the unclustered structures and the cluster representative is below the specified threshold. Here, we have adopted a heuristic iterative clustering procedure. At each iteration cycle, the similarity definition is gradually relaxed. This yields a hierarchy of grouping of clusters with different similarity thresholds. In the first phase of an iteration, the first entry in the initial list of interfaces forms a new cluster. The next interface in the list is compared to the first. If the similarity between them is above a predefined threshold, the second is added to the cluster of the first, or else it forms a new cluster. Next, the third interface is compared with the clusters already formed. This procedure is repeated, until all structures are assigned to clusters. At the end of this procedure, the similarity between each member of the individual cluster and its corresponding putative representative should be above the threshold prescribed for the current clustering cycle. In this phase, pairwise structural comparisons of structures are carried out sparsely, greatly reducing the computational costs. In the second phase, exhaustive pairwise comparisons are performed within each cluster. These extensive comparisons fulfill two functions. First, the structure that is most similar to all other structures in its cluster is selected as the representative for the next iteration. Second, if a structure is found dissimilar to other structures, it is removed from the cluster. Such a structure forms a new, one-member cluster for the next iteration. A schematic representation of the algorithm is given in Figure 2B. This clustering procedure is as that used previously (Tsai et al. 1996).


The data set at the different clustering cycles

Table 1 lists the threshold parameters applied in successive clustering cycles to calculate the similarities between interfaces. The first column gives the iteration cycle. There are six successive clustering cycles (A through F). The second column gives the number of interface clusters at the beginning and end of the iteration. For example, during iteration A, there were initially 21,686 interfaces (in the first cycle, this number is equal to the number of two-chain interfaces in the PDB). Using the similarities of the structures and sequences (with the parameters listed in columns 3–5) the number decreased to 16,446. The connectivity score takes into account the residue connectivity in the polypeptide chains. The score favors a match with consecutive residues. At the end of the sixth (final) cycle, we obtained 3799 distinct clusters. After this cycle, members of each cluster had at least 0.5 connectivity score. There was no amino acid similarity constraint and the maximal size difference between interfaces was 50 residues.

A comparison of the new and old data sets of interfaces shows a substantial increase, from 351 to 3799. The data set and the clustering results are available at http://protein3d.ncifcrf.gov/∼keskino/ and http://home.ku.edu.tr/∼okeskin/INTERFACE/INTERFACES.html. It is of interest to examine whether this increase is the outcome of the increased number of PDB entries or of new architectures. Figure 3 shows the ratio of increase in the PDB entries, the SCOP families (the 1996 and 2002 versions, respectively; Murzin et al. 1995) and in the interface clusters (the old data set [Tsai et al. 1996] and this work). We observed that the number of entries in the PDB increased sixfold and the number of SCOP families increased threefold, whereas the increase of interface clusters is 10-fold. Thus, it appears that the increase in the PDB over the last seven years has allowed a more diversified data set for interfaces. This may also be the outcome of the rapid growth in the determination of high molecular weight proteins that are likely to include more than one chain.

Generation of a nonhomologous data set of interfaces: Sequence alignment, excluding chains with high sequence similarity

To have a nonredundant set of interfaces, sequences within each family were compared using CLUSTALW (Higgins et al. 1994) and the BLOSSUM90 substitution matrix (Henikoff and Henikoff 1992). To eliminate redundancy, a threshold similarity of 50% was imposed. Thus, one of the two sequences in a cluster that shares a sequence similarity of more than 50% is deleted from the cluster. This yields a data set of interfaces with structurally similar but sequentially dissimilar members. Further, to constitute a valid cluster of interfaces, the cluster should have at least five members (10 chains). These filtering procedures reduce the number of clusters from 3799 to 103.

The 3799 original clusters listed by their representatives are given in Appendix A (and are available at our Web site at http://protein3d.ncifcrf.gov/∼keskino/). The numbers in parentheses are the number of all members included in the corresponding clusters. In all cases, both chains of the interface of each cluster member superimpose on those of the cluster representative within the similarity criteria provided in Table 1. Appendix B lists the nonredundant interface clusters. These clusters have at least five members, and at most 50% sequence identity among their members. This separate listing is given for the convenience of users who wish to carry out statistical analysis of the data set. We have further carried out multiple structure comparisons of all cluster members listed in Appendix B, using MultiProt (Shatsky et al. 2002, 2003). Appendices A and B are provided as Supplemental Material. Clusters for which MultiProt detected a consensus core encompassing all members from both chains and with similar function were labeled as Type I, those with different functions were labeled as Type II. On the other hand, the clusters where MultiProt found a consensus for only one of the chains, were termed Type III. Fifty-four of the clusters are Type I and II interfaces; the rest are Type III aligned interfaces.

Multiple structural alignment of the interfaces with MultiProt

MultiProt detects recurring motifs in an ensemble of proteins by simultaneously aligning multiple protein structures (Shatsky et al. 2002, 2003). The algorithm considers all protein structures at the same time, rather than initiating from a pairwise-imposed molecular seed. This eliminates the bias in the superposition and finds the largest common substructure of Cα atoms that appears in the structural set. Furthermore, MultiProt efficiently finds high-scoring partial multiple alignment for all possible number of molecules in the input. That is, it does not require that all input molecules participate in the alignment. Because it is sequence order-independent, it can be applied to protein surfaces and protein–protein interfaces effectively. We have used MultiProt to align the interfaces of our clusters to find consensus motifs of the members' interfaces. To qualify as a consensus motif, at least 10 residues have to match with an RMSD of at most 3.5 Å. Because each member can have noncontiguous residues in the interface, MultiProt is an extremely useful tool for our purpose.

Interface family types

Type I: Similar interfaces, similar global protein folds

In most cases, if the interfaces are similar, the overall protein folds are also similar. Such similar interface, similar fold clusters contain a single family. The list of the interface clusters and the members of these clusters are given in Appendix B (nonitalic entries). The interface clusters include homodimeric enzyme complexes (transferases, oxidoreductases, etc.), enzyme/inhibitor complexes, antibody/antigen complexes, as well as toxins. They have different polar/nonpolar compositions and different accessible surface areas. Some examples are given in Figure 4. In the figure three members of the 1cydAD cluster are presented. This cluster is formed by reductases, oxidoreductases (PDB codes: 1cyd, 1e3s, 1hdc, 1i01), and a pterin reductase (PDB code: 1e92).

Type II: Similar interfaces, dissimilar global protein folds

Some clusters, belong to a particularly interesting category: In these cases the interfaces are structurally similar; however, the global protein folds are different. These are listed in Table 2 and Appendix B (italic entries). These similar-interfaces, dissimilar-protein folds fall into different families (see the SCOP classification, also provided in Table 2, first column). Even though, however, they have structurally similar interfaces, they are nevertheless members of the same clusters. These families have different functions. Thus, interface structural similarity does not ensure global protein structural similarity. Furthermore, previously it has been shown that globally similar structures may have different functions in proteins (Martin et al. 1998; Orengo et al. 1999; Moult and Melamud 2000; Thornton et al. 2000; Nagano et al. 2002). Cases such as those listed here illustrate that this paradigm can be taken further: Similar interfaces do not imply similar functions.

Figure 5 illustrates some examples from Table 2. Part A shows all members of the cluster. Each case in the figure presents the ribbon diagrams of the proteins that belong to different SCOP families in the same interface cluster, clearly showing that the global structures are different. Part B displays ribbon diagrams of two of the proteins with their common interfaces highlighted with yellow. Note that there are three clusters in Table 2 where the representative of the cluster does not appear in the list of family members. These cases are cellulose-binding domain family III, MHC antigen-recognition domain, and nucleotide and nucleoside kinases. In these cases, while the representative aligned with each cluster member, it did not align well with all members simultaneously, suggesting some slight deviations in the multiple structural superposition.

Type III: One side similar interfaces, dissimilar global protein folds

Our data set also contains clusters where one chain of the interface is conserved while the second varies. Figure 6 presents an example of such a cluster. Although this figure specifically shows an antibody interacting with four partners (three of them peptides), Type III interfaces are not constrained to only antibody/antigen or protein/peptide complexes. This type manifests protein complexes with a diverse range of biological functions. For example, in one of the Type III clusters we have a homodimer antioncogene protein (interface ID: 1a1uAC), a homodimer of leucine zipper complex (interface ID: 1a93AB), a homodimeric complex of mannose binding protein, lectin (interface ID: 1afa12), a homodimer of transcription regulation protein (interface ID: 1ajyAB), a tetramer of cytokine, cliary neurotrophic factor protein (1cnt14), and a homodimeric replication termination protein (1f4kAB). One-chain conserved clusters are very interesting: They can be used to address fundamental questions such as whether nonspecific binding is largely hydrophobic with flatter surface, which functions are involved, or whether in one chain-dominant interfaces the second chain is smaller. The data set may bear on longstanding problems relating to binding specificity and selectivity and to specificity with respect to conserved interactions and function. It may also be useful for prediction of residues contributing dominantly to stability.

Propensities of residues in the interfaces

The relative frequencies of different types of amino acids in the interfaces of protein–protein complexes can be used to derive the propensities of the residues. The overall propensities of the 20 amino acids are calculated for the contacting residues (not including the “nearby”) in the interfaces from the data set containing all interface clusters. We compare the frequency patterns at the binding sites versus those in the overall structures. The propensity (Pi) of a residue (i = Ala,Val, Gly, …) to occur at the interface is calculated as the fraction of the count of residue i in the interface compared with its fraction in the whole chain as

equation image((1))

where ni is the number of residues of type i at the interface, Ni is the number of residues of type i in the chains, n is the total number of residues in the interface, and N is the total number of residues in the whole chains.

Figure 7 displays the correlation of our residue propensities with those of Jones and Thornton (1997). The axes represent the natural logarithms of the propensities. The positive value in the logarithmic propensity indicates that a residue is more likely to occur in an interface. A high correlation coefficient (0.91) is obtained over the 20 amino acids. The residue propensities of Jones and Thornton (1997) were calculated from a data set of 63 protein complexes by taking the fraction of accessible surface area that the amino acid has contributed to the interface compared with the fraction of accessible surface area that the amino acid has contributed to the whole surface (i.e., all exposed residues). Thus, their propensities are calculated by the propensity of the accessible surface areas of the residues. Our propensities are calculated by the frequency of occurrence of the residues compared with the rest of the chain. To have a more appropriate comparison, we have multiplied each residue by its average accessible surface area (Miller et al. 1987) and normalized the results by the surface propensities of the amino acids (Table 2; Ma et al. 2003) according to the formula: ln(P)i = ln[(ni/Ni)/(n/N)/(ns/Ns)/(n/N) * ASA], where ASA stands for the average accessible surface areas of the residues in an extended Gly-X-Gly triplet (Miller et al. 1987). The high correlation we observe suggests that their data set, despite its smaller size, still presents a good coverage with similar properties.

Table 3 lists the propensities of the different amino acids in interfaces Type I, Type II, and Type III. We have further computed the propensities in each interface type when dividing the residues into classes of hydrophobic (A,P,L,I,M,V), charged (D,E,R,K), polar (N,Q,S,T), and aromatic residues (W,Y,F,H). The last four rows of Table 3 give the overall contribution of these residue classes. The percentages are given as the second figure in the last four rows. Clearly, interfaces are dominated by hydrophobic residues in all three cases. Next, it is mostly aromatic residue contribution. However, it is interesting that the hydrophobic effect is smaller in the Type III interfaces. Instead, the propensities of the charged residues increase. This may reflect the fact that in Type III the nonconserved side of the interface is smaller. Smaller interfaces have already been shown to display a reduced hydrophobic effect (Tsai et al. 1997). In these smaller, more exposed interfaces electrostatics appears to play a more important role. In general, overall, charged residues are less frequent in the interfaces. This also suggests that overall electrostatic interactions are probably not the major source of the stability of the interfaces.


Here we provide a structurally unique data set of two-chain interfaces derived from the PDB. The interfaces are clustered based on their spatial structural similarities, regardless of the connectivity of their residues on the protein chains. The data set includes 3799 clusters, compared to 351 in 1996. This substantially more diverse data set reflects both the growth in the number of structures as well as the larger number of higher molecular weight proteins currently in the PDB. The comparison of the old and new data sets indicates that the number of newly found interface clusters has increased much more rapidly compared to the number of the available new PDB structures. This may suggest that the number of unique interfaces has still not reached its upper limit.

We divide the clusters into three types: Type I clusters consist of similar interfaces whose parent chains are also similar. In Type II clusters, the interfaces are similar; however, the overall structures of the parent proteins from which the interfaces derive are different. In all Type II cases that we have studied, the clustered proteins belong to different SCOP families, with different functions. Type III category introduces clusters of interfaces where only one side of the interface is similar but the other side differs. Type III clusters illustrate that a binding site can interact with more than one chain, with different geometries, sizes, and composition. One of the paradigms in protein science states that similar global structures may have similar functions. Our observations suggest an extension of this paradigm: Similar interface architectures may have differerent functions. As in protein structures, evolution has reused “good” favorable interface structural scaffolds and adapted them to diverse functions. The functions extend from enzymes/inhibitors to toxins and immunoglobulins. We did not observe homodimers in Type II clusters. This is probably due to the smaller sizes of the monomers and the extensive interfaces in the two-state homodimers that cover large portions of the chains. As expected, we find that multifunctional interface clusters consisting of helices largely derive from proteins whose functions relate to muscle and to membranes.

The observation that globally different protein structures associate in similar ways (i.e. Type II) to yield similar motifs, is interesting. Clearly, there is a very large number of ways that monomers can combinatorially assemble. Remarkably, among these there are preferred interface architectures, and these are similar to those observed in monomers (Tsai et al. 1998b). This observation both underscores the view that the number of favorable motifs is limited in nature, and highlights the analogy between binding and folding. It is further reminiscent of the combinatorial assembly of protein building blocks in folding (Tsai and Nussinov 1997).

We hope that this diverse, structurally nonredundant data set will be useful in a broad range of studies, such as deriving profiles of binding sites, elucidation of the determinants of protein–protein interactions, and identification of residues contributing to the stabilization of the protein associations and those playing a role in a specific protein function. The data set should allow extensive comparisons between binding and folding and derivation of motifs across interfaces. This data set should further be useful in construction of protein networks, and allow studies of structurally conserved residue hot spots. We expect it to be useful in studies of evolutionary conservation, recognition, binding, and function.

Electronic supplemental material

Supplemental material includes two appendices: (1) a list of 3,799 cluster representatives; (2) a list of all nonredundant two-chain interface clusters.

Table Table 1.. The parameters used during the clustering of the interfaces
CycleNumber of interfacesRelative connectivity scoreMinimal % amino acid identityMaximal amino acid size difference between interfaces
A21,686 → 16,4460.9900
B16,446 → 96370.9803
C9637 → 66470.85010
D6647 → 53320.72520
E5332 → 44290.61040
F4429 → 37990.5050
Table Table 2.. Similar interfaces with dissimilar folds
SCOP familyRepresentativeProteinsCommon residues in interfacesInterface fold type
  1. a

    The first column is the SCOP classification. The numbers in square brackets identify the different SCOP families within each cluster. The second column lists the representatives of the interface clusters. The third column provides the individual members in the corresponding cluster. The interface names are represented by their PDB codes and chain identifiers. The numbers at the beginning of the proteins represent which SCOP family—in column 1—it belongs to. The fourth column is the result of MultiProt (Shatsky et al. 2002, 2003) alignments: the number of common residues aligned structurally for the members in the clusters. The fifth column gives the interface fold type.

[1] DNA polymerase processivity factor1ah8AB[1] GP45 sliding clamp (1b77AB) [1] Prolifirating cell nuclear antigen (PCNA) (1axcAC)18α + β
[2] Microbial ribonucleases [2] Barnase/Binase (1a2pBC)  
[1] Chromo domain-like chromatin1afrBD[1] Heterochromatin protein 1, HP1 (ldz1AB, 1e0bAB)22α
[2] Aldolase [2] Transaldolase (1f05AB)  
[3] Tryptophan synthase β subunit-like PLP-dependent enzymes [3] 1-aminocyclopropane-1-carboxylate deaminase (1f2dBD)  
[1] Cellulose-binding domain family III1aohAb[1] Cohesin domain (1aohAB, 1g1kAB)21β
[2] Fluorescent proteins [2] Green fluorescent protein (1b9cAB) [2] Red fluorescent protein (1g7kAB)  
[1] Snake venom toxins &1e7kAB[1] Cardiotoxin V4II (1cdtAB)19β
[2] Cysteine proteinases [2] (Pro)cathepsin X(1ef7AB)  
[3] P-loop containing nucleotide triphosphate hydrolases [3] Initiation factor 4a (1fuuAB)  
[1] MHC antigen-recognition domain1hyrAC[1] Bungarotoxin (1kbaAB) [1] MHC I homolog (1hyrAC, 1kcgac)20
[2] Tyrosine-dependent oxidoreductases [2] Negative transcriptional regulator NmrA (1k6jAB)  
  [3] Class I MHC-related molecule (1kcgAC)  
[1] Virus ectodomain1qbzBC[1] Core structure of Ebo gp2 (1eboAB)54α (bundle)
[2] Tropomyosin [2] Tropomyosin (1ic2CD)  
  [1] Envelope polyprotein GP160 (1if3AB) [1] Retrovius gp41 protease-resistant core (1qbzAC)  
[1] Fibrinogen C-terminal domain-like1gk4AB[1] Fibrinogen C-terminal domains (1fzaAB)73α (bundle)
[2] Vimentin coil [2] Vimentin coil (1gk4AB)  
[3] Neuronal synaptic fusion complex [3] Neuronal synaptic fusion complex (1gl2BC, 1kilAB)  
[4] Tropomyosin [4] Tropomyosin (1ic2AB)  
[5] Synaptic snare complex [5] Synaptic vesicle protein vamp2 and presynaptic plasma membrane proteins snap-25 and syntaxin 1a (2bu0BC)  
[1] Immunoglobulin1irxAB[1] T-cell antigen receptor (1fo0HB, 1g6rBH)13α
[2] Ferritin [2] (Apo)ferritin (1iesBF)  
[3] Nucleotidylyl transferase [3] C-terminal domain of class I lysyl-tRNA synthetase (1irxAB)  
[1] Tetraspanin1g8qAB[1] CD81 extracellular domain (1g8qAB)20α
[2] Signaling proteins [2] Dopamine D2 receptor modeled on bacteriorhodopsin (1i15cd)  
[3] Light-harveting complex subunits [3] Light-harvesting complex subunits (1ijdac, 1lghgj)  
[1] α helical bundle1gc7AB[1] α helical bundle (1cosAC)31α
[2] Neuronal synaptic fusion complex [2] Neuronal synaptic fusion complex (1gl2AC, 1kilAC, 1kilBD)  
[3] Virus ectodomain 2siv [3] Retrovius gp41 protease-resistance core (2sivAB)  
[1] ROP protein1kd8AB[1] ROP protein (1f4mAB)43α
[2] Neuronal synaptic fusion complex [2] Neuronal synaptic fusion complex (1hvvBC)  
[3] Leucine zipper-domain [3] GCN4 (1kd8AB)  
[4] Tropomyosin [4] Tropomyosin (1kqlAB)  
[5] Cell envelope component [5] Murein lipoprotein (1mlpAB)  
[1] Virus ectodomain2sivAC[1] Retrovius gp41 protease-resistant core (1aikNC)29α
[2] Cytochrome c [2] Mitochondrial cytochrome c (1kyoBR)  
[3] Neuronal synaptic fusion complex [3] Neuronal synaptic fusion complex (1sfcBD, 1sfcBJ)  
[1] Bcr-Abl oncoprotein oligomerization domain homotetramer1l7cAC[1] Bcr-Abl oncoprotein oligomerization domain (1k1fDF)17α
[2] Membrane protein [2] Pentameric transmembrane domain of phospholamban (1k9nAB)  
[3] α-catenin/vinculin [3] α-catenin (1l7cAC)  
[4] Nucleotide and nucleoside kinases [4] Thymidylate kinase (3tmkDG)  
Table Table 3.. Residue propensities of amino acids
Residue typeType IType IIType IIIAll
  1. a

    The first column is the type of the amino acid. The second, third, and fourth columns are the propensities for Type I, II, and III clusters, respectively. The last column gives the overall propensities summed over all types of interfaces. The last four rows are the sum of the propensities for hydrophobic, polar, charged, and aromatic residues, respectively. The first number gives the cumulative effect of all the residues in the four classes, the second number gives the percentage of the each class.

Figure Figure 1..

Definition of protein–protein interfaces: The ribbon diagram of Glutathione S-Transferase is displayed. The three chains (A, B, and C) of transferase are colored yellow, gold, and dark green, respectively. Two interfaces form between chain pairs. Chains A and C do not form an interface. In the first interface between chains A and B, the interacting residues are colored red, and the nearby ones in green. In the second one (interface between chains B and C), the interacting residues are displayed in magenta and the nearby in cyan. Only the side chains of the interacting residues are shown.

Figure Figure 2..

(A) The input of the alignment program for interfaces: the representation of the Glutathione S-Transferase with its Cα atoms as points in three-dimensional space. The coloring scheme is as in Figure 1. The structural pairwise alignment of interfaces are performed considering only the points belonging to the contacting and nearby atom. (B) The schematic representation of the alignment algorithm. We start with all structures available in the PDB and extract the interfaces formed between pairs of chains. These interfaces are next compared to each other with an iterative procedure to assign them into different structural clusters. The algorithm reads the interfaces as sets of points—as shown in A—and constructs the hash tables to define all local motifs in interfaces. Interfaces are compared iteratively and clustered.

Figure Figure 3..

Histogram indicating the increase in the number of protein structures available in the PDB (between 1997 and 2003), the increase in the number of protein–protein interface clusters (comparison between the previous work [Tsai et al. 1996] and the results of this work), the increase in the number of protein families (comparison between the 1997 and 2003 SCOP databases). Note that our previous interface data set was extracted in 1996, so the closest version of the SCOP (1997) was compared in the analysis.

Figure Figure 4..

Some examples of similar interfaces, similar monomer structures, and functions (called Type I in this work). In the figure three members of the 1cydAD cluster are presented. The two complexes displayed at the top panel are oxidoreductases (PDB codes: 1cyd, 1e3s), and the bottom complex is a pterin reductase (PDB code: 1e92). Three of the structures are available as tetramers in the PDB. For clarity, we have displayed the chains that form the common interface among them (1cydAD, 1e3sAC, and 1e92AC). In all complexes one chain is colored pink, and the other is cyan. One side of the common interfaces is colored yellow, and the complementary side of the interfaces is colored in purple. There are 111 interface residues in common. The RMSD between the 1cydAD and 1e3sAC interfaces is 3.11 Å, and the rmsd between 1e92AC and 1e3sAC interfaces is 1.26 Å.

Figure Figure 5..

Some examples of similar interfaces, dissimilar monomer structures, and functions (called Type II in this work). (A) In the figure, ribbon diagrams of four members in the 1g1kAB cluster are illustrated. These are the structures of the single cohesin domain from the scaffolding protein cipa of the Clostridium thermocellum cellulosome (1aoh), green fluorescent protein mutant F99S, M153T, and V163A (1b9c), cohesin module from the cellulosome of Clostridium cellulolyticum (1g1k), and Dsred, a red fluorescent protein from discosoma sp. red (1g7k). The letters correspond to the monomers in these complexes. (B) Ribbon diagrams of two interfaces (1aohAB and 1g7kAB) derived from two functionally different proteins. The yellow region points to the common interface with 48 common interface residues. The rmsd between these two interfaces is 2.31 Å, considering only α-carbon atoms.

Figure Figure 6..

The ribbon diagrams of Type III interfaces. The members in this cluster are represented by the 1fj1DE interface. In all figures, the cyan structures represent the protein that binds to different proteins or peptides (pink structures). The yellow colored region in each case is the similar interface architecture within the 1fj1DE cluster. (A) This displays an example of a complex between a human monoclonal BO2C11 FAB heavy chain and human factor VII (1iqdBC interface). (B) An illustration of an antibody/peptide complex (1bogBC interface). (C) This is another immunoglobulin/viral peptide complex formed between the FAB fragment and human rhinovirus capsid protein VP2 (1a3rHP interface). (D) This is an example for the interface formed between the heavy chain (IGG2A Kappa antibody CB41) and an antigen bound peptide (1cfsBC interface). The RMSD values between the interfaces are 0.67 Å, 1.29 Å, and 3.01 Å over 26 residues, respectively.

Figure Figure 7..

Propensities of residues in the interfaces. This figure illustrates the correlation of our residue propensities with those of Jones and Thornton 1997. The axes represent the natural logarithms of the propensities. A high correlation coefficient (0.91) is obtained over the 20 amino acids.


We thank Drs. Buyong Ma, K. Gunasekaran, S. Kumar, D. Zanuy, H.-H.(G.) Tsai, and members of the Nussinov-Wolfson group—in particular Maxim Shatsky—for help with MultiProt, and Inbal Halperin and Shira Mintz for many useful comments and suggestions. We thank Dr. Jacob V. Maizel for discussions and encouragement. We thank Dr. A. Gursoy and S. Aytuna for their helpful discussions. The research of R.N. and H.W. in Israel has been supported in part by the Center of Excellence in Geometric Computing and its Applications, funded by the Israel Science Foundation (administered by the Israel Academy of Sciences. This project has been funded in whole or in part with Federal funds from the National Cancer Institute, NIH, under contract number NO1-CO-12400.

The publisher or recipient acknowledges right of the U.S. Government to retain a nonexclusive, royalty-free license in and to any copyright covering the article.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.