Protein interfaces are thought to be distinguishable from the rest of the protein surface by their greater degree of residue conservation. We test the validity of this approach on an expanded set of 64 protein–protein interfaces using conservation scores derived from two multiple sequence alignment types, one of close homologs/orthologs and one of diverse homologs/paralogs. Overall, we find that the interface is slightly more conserved than the rest of the protein surface when using either alignment type, with alignments of diverse homologs showing marginally better discrimination. However, using a novel surface-patch definition, we find that the interface is rarely significantly more conserved than other surface patches when using either alignment type. When an interface is among the most conserved surface patches, it tends to be part of an enzyme active site. The most conserved surface patch overlaps with 39% (± 28%) and 36% (± 28%) of the actual interface for diverse and close homologs, respectively. Contrary to results obtained from smaller data sets, this work indicates that residue conservation is rarely sufficient for complete and accurate prediction of protein interfaces. Finally, we find that obligate interfaces differ from transient interfaces in that the former have significantly fewer alignment gaps at the interface than the rest of the protein surface, as well as having buried interface residues that are more conserved than partially buried interface residues.
As structural genomics projects proceed, they are likely to yield structures of proteins that are functionally uncharacterized. Identification of active sites in enzymes and protein–protein binding sites in nonenzymatic proteins will be particularly important for elucidating function and designing inhibitors. Inhibiting protein–protein interactions with small molecules has proved particularly difficult due to their large size and lack of cavities (Toogood 2002; Gadek and Nicholas 2003). However, targeting the most critical residues may lead to improved inhibition of these interactions. Many of the residues that are critical for binding are likely to be evolutionarily conserved. Therefore, their potential impact in predicting protein–protein binding sites is an important question. Whereas there is general agreement that active/ligand binding sites are conserved across many different protein families (Grishin and Phillips 1994; Ouzounis et al. 1998; Bartlett et al. 2002), the importance of conservation is less clear for protein–protein interfaces (Grishin and Phillips 1994; Valdar and Thornton 2001b). Grishin and Phillips (1994) concluded that interface residues were only slightly more conserved than the rest of the protein sequence after examining five enzyme families. Valdar and Thornton (2001b) concluded that interface residues, particularly those completely buried in the interface, were more conserved than other surface-exposed residues after analyzing six homodimers. The distinguishing features of the latter study included the use of a similarity score rather than an identity score, the application of more robust statistical tests, and the comparison of interface residues relative to other contiguous surface patches. Although these studies are very valuable, the data sets used are small, and the results may not apply to all complexes, particularly those with heterodimeric or transient interfaces.
Nonetheless, several groups have successfully used conservation scores to predict protein–protein binding sites. Two independent groups (Elcock and McCammon 2001; Valdar and Thornton 2001a) conclude that conservation in combination with other factors can accurately discriminate genuine homodimers from crystal contacts. The majority of methods that predict protein–protein binding sites also use conservation scores (other approaches are discussed later). Those that map the conservation score to the three-dimensional structure are likely to be the most informative and include Evolutionary Trace (ET; Lichtarge et al. 1996), Consurf (Armon et al. 2001), Rate4Site (Pupko et al. 2002), and the method of Landgraf and Eisenberg (Landgraf et al. 2001). In cases in which a three-dimensional structure is unavailable, residues that are conserved for the entire family or a subfamily within the alignment are predicted to be functional (Casari et al. 1995; Livingstone and Barton 1996; Caffrey et al. 2000; Hannenhalli and Russell 2000). However, assessing the accuracy of these methods has been difficult and usually limited to a few experimentally characterized protein families. Furthermore, we are only aware of a few published experiments that confirm previously computed predictions (Stenmark et al. 1994; Bauer et al. 1999; Sowa et al. 2001).
The physical and chemical properties of protein–protein interactions have been studied on a large number of complexes by numerous groups (Chothia and Janin 1975; Argos 1988; Janin et al. 1988; Janin and Chothia 1990; Korn and Burnett 1991; Clackson and Wells 1995; Jones and Thornton 1996, 1997; Lijnzaad et al. 1996; Tsai et al. 1996, 1997a,b; Tsai and Nussinov 1997; Xu et al. 1997; Bogan and Thorn 1998; Larsen et al. 1998; Xu and Regnier 1998; Lo Conte et al. 1999; Jones et al. 2000; Sheinerman et al. 2000; Glaser et al. 2001; Chakrabarti and Janin 2002; Sheinerman and Honig 2002). In general, interfaces tend to be planar with an area that is often proportional to the total protein size (Jones and Thornton 1996). The residue composition usually differs for those complexes that are transient versus those that are obligate. This is probably due to the former relying more on salt bridges and hydrogen bonds, whereas the latter rely more on hydrophobic attractions (Jones and Thornton 1997; Lo Conte et al. 1999). There are also many examples of both geometric and electrostatic complementarity between the binding interfaces (Lawrence and Colman 1993; McCoy et al. 1997; Xu et al. 1997; Lo Conte et al. 1999; Sheinerman et al. 2000). Although the interface can be quite large, it was shown in some systems that only a small fraction of the residues contribute to the majority of the binding energy (Clackson and Wells 1995). Furthermore, these so-called hotspots of binding energy tend to have preferred residue types that often have a high degree of burial at the interface (Bogan and Thorn 1998). Interestingly, there is evidence (for 11 families) to suggest that there is a relationship between the enrichment of a residue type in a hotspot and the propensity of the corresponding residue to be conserved (Hu et al. 2000).
In this study, we examine the difference in conservation between the protein interface and the rest of the protein surface for a set of 64 protein–protein interfaces. As residue conservation depends on the choice of sequences aligned, we construct two multiple-sequence alignments (MSAs) for each protein using two different strategies. The first approach attempts to include closely related sequences, whereas the second includes a more diverse set of sequences. These MSAs are generally expected to contain orthologs and paralogs, respectively, and there are arguments for choosing either MSA type. Orthologs are expected to be almost identical in function, whereas a set of paralogs are expected to have undergone some evolutionary changes so that they can perform slightly different functions. However, nonfunctional residues are often conserved over short periods of evolutionary time, which is a source of noise that will be more prominent in orthologs. When the two approaches are examined and compared with each other, we find that the difference in conservation between the interface and the rest of surface is marginally (but not significantly) better in MSAs of diverse homologs than in MSAs of close homologs. Furthermore, we find that obligate and transient interfaces have different physico-chemical properties that influence their evolutionary rates.
Nonredundant data set
The data set consists of 42 chains that form homodimers, 12 chains that form heterodimers, and 10 chains that form transient complexes as described in Table 1. As mentioned above, the MSAs of close homologs and diverse homologs are expected to contain orthologs or paralogs, respectively. A number of criteria were also used to remove distantly related or poorly aligned sequences (see Materials and Methods). Consequently, it is usually the case that only one of the chains is considered in the analysis, as the alignment for its binding partner was not satisfactory. The interface sizes ranged from 415 to 3568 Å2 for heterodimers, 550 to 4718 Å2 for homodimers, and 423 to 2361 Å2 for transient complexes. This suggests that transient interfaces are generally smaller than obligate interfaces, although it could be due to difficulties in crystallizing larger transient interfaces. Although an interface residue was defined if it had a ΔASA of 1% or more, the majority of residues (86%) lose more than 5% ASA upon complex formation. For each data set, none of the chains share significant sequence identity with the other chains (see Materials and Methods).
Comparison of interface residues with exposed noninterface residues
Figure 1 shows the difference in residue conservation between the interface and the rest of the exposed surface for both alignment types. Table 2 shows the statistics that are associated with Figure 1. The majority of proteins (40/64) are more conserved at the interface than the rest of the surface in both of the alignment types (top, right quadrant). There are six proteins for which only the MSAs of close homologs have an interface that is more conserved than the rest of the surface (top, left quadrant: 1k9oIE_I, 1lehAB_A, 1masAB_A, 1rvv12_1, 2pcdMP_M, 1gotAB_B). In four of the proteins, only the MSAs of diverse homologs have an interface that is more conserved than the rest of the surface (bottom, right quadrant: 1g3nAC_A, 8atcAB_A, 1poy12_1, 1daaAB_A). In the remaining 14 proteins, the interface is less conserved than the rest of the exposed surface for both MSA types (bottom, left quadrant). These 14 proteins can be further divided into 11 homodimers, 1bncAB_A (acetyl-CoA carboxylase), 1ecpBD_B (purine nucleoside phosphorylase), 1gp1AB_A (glutathione peroxidase), 1hyhAB_A (L-lactate dehydrogenase), 1idsAC_A (superoxide dismutase), 1nhkLR_L (nucleoside diphosphate kinase), 1qorAB_A (quinone oxidoreductase), 1rahBD_B (aspartate carbamoyltransferase), 1scuBE_B (succinyl-CoA synthase β subunit), 1xikAB_A (ribonucleoside-diphosphate reductase β subunit), 2eipAB_A (inorganic pyrophosphatase); 1 heterodimer, 1tcoBC_B (calmodulin-dependent phosphatase β subunit); and two transient complexes, 1g3nAB_A (CDK4) and 1rrpAB_A (ran GTPase). It is not entirely clear why all of these interfaces are not more conserved than the rest of the exposed surface, but it might be due to the presence of a second interface not being considered. For example, 1g3nAB_A (CDK6) forms another interface with cyclin D (1g3nAC_A; bottom, right quadrant). Combining the two interfaces of CDK6 and comparing them with the rest of the exposed surface improves the ratio of interface conservation to surface conservation. Similarly, the β subunit of the heterotrimeric G protein (top, left quadrant) forms an interface with the γ subunit as well as the α subunit (1gotAB_B). Both Ran GTPase (1rrpAB_A; bottom, left quadrant) and calcineurin A (1tcoAB_B; bottom, left quadrant) are also known to interact with several different proteins (Griffith et al. 1995; Moroianu 1999).
An interesting example is the tetrameric succinyl-CoA synthetase (Fig. 2). The homodimeric interaction between the 41-kD subunits is not very conserved (1scuBE_B; bottom, left quadrant of Fig. 1; Fig. 2B), but the same 41-kD subunit (chains B and E) forms a heterodimeric interface with a 29-kD chain that is highly conserved (1scuDE_E; top, right quadrant of Fig. 1; Fig. 2B). The heterodimeric interface overlaps with the catalytic site and illustrates that two different interfaces on the same chain can evolve at very different rates.
Table 2 shows that both MSAs of diverse homologs (44/64, P = 0.00032) and MSAs of close homologs (46/64, P = 0.0000193) are more conserved at the interface than the rest of the exposed surface. However, when compared directly, diverse homologs more often had a better ratio (interface conservation to exposed surface conservation) than close homologs (35 to 29), although this was not statistically significant (P = 0.19).
In some MSAs of diverse homologs, the interface is a lot more conserved (e.g., a ratio ⩽ 1.3) than the rest of the solvent exposed surface (Fig. 1; 1apmIE_E, 1ughIE_E, 1ubsAB_A, 1scuDE_E, 1sftAB_A, and 1pkyAC_A). These are cAMP-dependent kinase, uracil DNA glycosylase, tryptophan synthase α subunit, adenylate kinase, succinyl-CoA synthetase, alanine racemase, and pyruvate kinase, respectively. With the exception of 1ubsAB_A, their interfaces overlap with their active sites, explaining the relatively high conservation. In 1ubsAB_A, the highly conserved interface serves as a conduit in which the substrate can be passed from one active site to another.
Collectively, these results indicate that the alignment type, the presence of multiple faces, and the presence of a catalytic site at the interface can influence the conservation of the interface relative to the rest of the surface.
Comparison of interface residues with other surface patches
Despite a difference in conservation existing between the interface and the rest of the exposed surface for a statistically significant fraction of interfaces, a thorough prediction program will have to consider and rank a large number of candidate surface patches. To explore this, we generated a number of surface patches (one for almost every exposed residue), and use the Z test to examine whether the average conservation of the interface is significantly different from the conservation of all other patches on that protein (Fig. 3). With the exception of one protein (1k9oie_e), all patches had the same number of residues as the interface. In 1k9oie_e, 40% of the surface patches had fewer residues (minimum of 25 residues) than the actual interface (31 residues). The results of this test are summarized in Table 3, in which it can be seen that the majority of interfaces are not significantly more conserved than other surface patches (Z > 1.64, corresponding to the 95th percentile of the normal distribution). The MSAs of diverse homologs have slightly more significantly conserved interfaces (9/64) than MSAs of close homologs (6/64). However, the overall differences between the two alignment types are not significant.
There are only four significantly conserved interfaces for both alignment types, 1sftAB_A, 3mdeAB_A, 1k9oIE_E, and 1ubsAB_A. Five protein interfaces are significantly more conserved than their respective surface patches for MSAs of diverse homologs only, 1apmIE_E, 1fuqAB_A, 1tcoAB_B, 1scuDE_E, and 2pcdBN_N. Two protein interfaces are significantly more conserved than their respective surface patches for MSAs of close homologs, 2cstAB_A, and 1tcrAB_A. Assuming the correct choice of MSA type, this suggests that 11 of the 64 interfaces would have been predicted correctly. With the exception of four interfaces (1ubsAB_A, 1tcoAB_B, 2cstAB_A, and 1tcrAB_A), the remaining seven interfaces overlap with their active sites. The least conserved interfaces have already been described in Figure 1. As described in the previous section, the interface of 1ubsAB_A functions as a conduit between two active sites.
Although most interfaces are not significantly more conserved than other patches, it is possible that the most conserved patch shares some overlap with the interface. In Figure 4, we consider the most conserved surface patch in each protein and measure its overlap with the actual interface. The degree of overlap between the most conserved surface patch and the actual interface is 39% (± 28%) and 36% (± 28%) for MSAs of diverse and close homologs, respectively. The most conserved surface patch overlaps with 50% of the interface in only 17 of the 64 interfaces for both alignment types (top, right quadrant). However, in the majority of proteins (39/64), the most conserved surface patch has <50% overlap with the actual interface (bottom, left quadrant). These results suggest that protein interfaces can rarely be predicted accurately when using conservation analysis alone, regardless of the alignment type used. Again, the interface tends to be more conserved when it forms an active site.
Comparison of central interface residues with exposed noninterface residues
It had been shown previously for six homodimers that residues that become completely buried upon complex formation also tend to be very conserved (Valdar and Thornton 2001b). Such residues are termed “central residues,” but this does not mean they are necessarily in the center of the interface. Instead, a central residue is defined as one that has an accessible surface area of <7% when bound (B-ASA ⩽ 7%) and B-ASA should be distinguished from ASA or ΔASA (see Materials and Methods). A peripheral residue is defined as one that is only partially buried upon complex formation (B-ASA > 7%). The majority of residues (85% of peripheral residues, 94% of central residues) lose at least 5% ASA after binding. Figure 5 compares the average conservation of the central interface residues against the average conservation for the rest of the exposed surface. With the exception of 1ytfAD_A and 1tcoBC_B, the remaining 62 interfaces have a central interface. The majority (46/62) of central interfaces are more conserved than the rest of the surface for both alignment types (top, right quadrant). The difference in conservation between the central interface and the rest of the exposed surface is significant (in both alignment types) for obligate interfaces, but not transient interfaces (Table 4). Similarly, the difference in conservation between the central interface and the peripheral interface is significant for obligate interfaces but not transient interfaces (data not shown). As discussed below, this suggests that obligate binding is primarily driven by hydrophobic interactions.
The frequency of conserved residues at different degrees of burial at the interface
Given that central residues tend to be more conserved than peripheral residues in obligate interfaces, we decided to compare the residue preferences of conserved residues at the center and periphery. An interface residue was considered conserved when the Information score was >0.85 in MSAs of diverse homologs. Sequence logos were generated with ALPRO (Schneider and Stephens 1990).
For heterodimers, there are both similarities and subtle preferential differences between central residues (Fig. 6A) and peripheral residues (Fig. 6B). Leucine is the most prominent conserved residue at the central interface, but is also fairly prominent at the peripheral interface, where its B-ASA ranges from 8.2% to 33.7%. There is some evidence that residues at the protein–protein interface are less flexible than the rest of the protein surface (Cole and Warwicker 2002), and this need might be met by leucine with its limited conformational diversity (Pickett and Sternberg 1993). The aromatic residues phenylalanine and tyrosine are more prominent in the central interface than the peripheral interface. In contrast, the peripheral interface prefers conserved arginine and glycine residues. This would suggest that pi-interactions of the conserved central aromatic residues are a primary driving force for heterodimerization. The preference for conserved arginines at the peripheral interface is probably due to its ability to form hydrophobic interactions, while still requiring interactions with water or polar molecules. We speculate that the role of glycine is probably more structural, given that it is important in helix caps (Fetrow et al. 1997) and loops (Crasto and Feng 2001). The other surprise at the central interface is the preference for aspartic acid. Its is not clear to us why this is more preferred than glutamic acid, but might also be due to its high propensity to be in loops (Crasto and Feng 2001).
In homodimers, the central residues (Fig. 6C) are predictably more hydrophobic than the interface residues (Fig. 6D). However, their preference for aromatic residues is not as strong as it is in heterodimers. The highest ranked central residues are leucine and arginine, whereas the highest ranked peripheral residues are glycine and proline. With the exception of proline, the possible roles of these residues have already been mentioned. Similar to glycine, we believe that the role of proline is probably structural, given that it is a secondary structure breaker and important in loops (Crasto and Feng 2001) and helix caps (Fetrow et al. 1997).
With the exception of aspartic acid and arginine, the majority of central residues are hydrophobic. These results suggest that hydrophobic forces primarily drive packing of obligate interfaces.
The frequency of gapped alignment positions at the protein–protein interface
It is generally thought that gaps in an alignment most often correspond to loops in the protein structure. It is also well known that loops are primarily exposed and often part of an active site or protein–protein interface. Many of the residues described above are commonly found in loops (Crasto and Feng 2001). Therefore, it could be argued that a prediction method should find a way to reward a candidate surface patch that contains a loop that is believed to be part of the interface. However, many scoring schemes either ignore alignment positions with gaps or introduce a gap penalty, the argument being that a residue position is unlikely to be important if it can be deleted. In this work, our conservation score uses a gap penalty, and we were interested to know how many interface residues had one or more gaps in their alignment position compared with the number found in the rest of the exposed surface. In Figure 7, obligate interfaces (homodimers and heterodimers) tend to have fewer gaps at their interface than on the rest of their protein surface. This observation is not as striking when using alignments of close homologs (Table 5). In contrast, the number of interface gaps does not significantly differ from the number of surface gaps for transient interfaces.
This result is probably not surprising if one views an obligate interface as a protein core, which are known to contain fewer gaps than the surface when multiply aligned.
We have shown that the protein interface is usually more conserved than the rest of the exposed surface. However, a more realistic surface-patch analysis showed that the interface conservation was not sufficiently different from other surface patches to allow prediction of the interface by conservation alone. The most conserved surface patch on a protein was rarely found to share >50% residue overlap with the real interface. The results are a lot less optimistic than two previous studies that focused exclusively on protein–protein interfaces (Grishin and Phillips 1994; Valdar and Thornton 2001b), and are probably a result of our data set being significantly larger. To our knowledge, this is the first time that conservation of transient and heterodimeric interfaces has been studied. Although the number of heterodimeric and transient complexes is larger than in previous studies of homodimers, the results should still be considered preliminary. Overall, the results suggest that one will have a small chance (17/64) of correctly predicting 50% of the interface residues when the three-dimensional structure is known and either multiple alignment type is used. The success rate is likely to improve when the two interfaces form a catalytic site and will be poorer when the protein has multiple faces. The conservation of catalytic/small-ligand binding sites is well documented, and the ET method is expected to predict them accurately (Yao et al. 2003). Although there was not a significant difference between the two MSA types, we prefer the MSA of diverse homologs. They appear to be marginally better for discriminating the interface from the rest of the surface, and the number of gaps at obligate interfaces is less than the number of gaps at the rest of the surface.
Occasionally, the protein belonged to a large family in which each subgroup might be expected to differ from other subgroups at the interface. Although our information score assigns a relatively high score to these subgroup specific/tree-determinant sites, the MSAs of diverse homologs will not contain many sequences for a subgroup, whereas the MSAs of close homologs will contain many sequences for just one subgroup (see Materials and Methods). Some of the less-conserved interfaces are likely to be detected by methods that account for the phylogenetic relationships (Lichtarge et al. 1996; Armon et al. 2001; Pupko et al. 2002). Unfortunately, defining the correct subset of sequences is not trivial, particularly if the procedure is to be automated (de Sol Mesa et al. 2003). One strategy might be to define subgroups on the basis of gene duplication events, although this also has caveats. Combining parameters such as tree-determinant information with surface-patch conservation should lead to improved prediction of interfaces. Other parameters that might be combined include residue propensities (Ofran and Rost 2003), physical properties (Jones and Thornton 1997), and evolutionary models of variable residues believed to be functionally important (Hughes and Nei 1988; Pazos et al. 1997; Shirai et al. 2002). Efforts along these lines are underway.
Materials and methods
The nontransient homodimer and heterodimer data sets were derived from a previous data set used by Glaser et al. (2001). A complex was defined as a homodimer if the two chains shared >95% sequence identity. The list of transient complexes was derived from a larger internal data set of transient complexes. Only those structures solved at a resolution of 2.5 Å or better were considered. The data set was reduced significantly after removing redundant sequences and partial structures that were not an appropriate size for patch analysis (see below). The multiple sequence alignments described below are available from http://oscar.gen.tcd.ie/∼dcaffrey.
Diverse homolog selection
The objective was to have an MSA containing a diverse set of sequences that would include several paralogs whenever possible. As this is a semiautomated approach, the exact phylogeny of the sequences is unknown for each protein family. Each chain from a complex was searched against the nonredundant protein database using BLASTP with an E-Value cutoff of 0.001 (Altschul et al. 1997). Sequences from each search were clustered together when they shared >60% identity, using BLASTCLUST, which is part of the BLAST package (Altschul et al. 1997). The longest sequence from each cluster was taken and aligned to the structural template using CLUSTALW (Thompson et al. 1994). This prevented oversampling from a particular subgroup of sequences found in each protein family. To ensure that the alignments were of an adequate quality, a number of criteria were used. Sequences that had five or more gaps at positions that were otherwise populated with residues in other sequences (75% of the alignment) were removed. This process was iterated three times. To ensure that a significant portion of the protein was crystallized, we only considered alignments in which the structural template made up 85% or more of the significant sites in the alignment. A significant site was defined as a position in the alignment where >70% of sequences had a residue present. Alignments with continuous stretches of significant sites (20 or more) that were not present in the structural template were removed, as were alignments that had 10 or fewer sequences aligned to the structural template. The structural template had to contain at least 120 residues that were aligned to residues in the other sequences. The remaining structures were compared against each other for sequence redundancy using the BLASTCLUST with a cutoff of 30% identity. Finally, the alignment quality was confirmed by manual inspection with PFAAT (Johnson et al. 2003).
Close homolog selection
The objective was to have an MSA containing a set of sequences that were closely related and would typically be orthologs. Again, the semiautomated approach does not guarantee that all sequences are bona fide orthologs. Depending on the taxonomy assignment of the proteins in Table 1, the proteins were grouped as belonging to eubacteria, metazoa, or euglenezoa (Wheeler et al. 2000). For eubacteria, each of the sequences was searched against the following genomes: Bacillus anthracis (Ames), Borrelia burgdorferi, Chlamydophila pneumoniae (CWL029), Escherichia coli (K12), Haemophilus influenzae, Helicobacter pylori (J99), Listeria monocytogenes, Mycoplasma penetrans, Neisseria meningitidis (MC58), Pseudomonas aeruginosa, Salmonella typhimurium (LT2), Shigella flexneri (2a), Staphylococcus aureus (MW2), Vibrio cholerae, and Xanthomonas citri. The top hit from each genome was selected if it had an E value of e−10 or better. Sequences belonging to the metazoa group were similarly searched against species databases that were derived from the NCBI nr protein database (Homo sapiens, Mus musculus, Caenorhabditis elegans, Drosophila melanogaster, Rattus norvegicus, Anopheles gambiae, Bos taurus, Gallus gallus, Xenopus laevis, Danio rerio, Ovis aries, Sus scrofa, and Takifugu rubripes). Sequences belonging to the euglenezoa group were searched against the entire NCBI nr database, and the top hits were hand selected from appropriate species.
The Shannon Entropy for a multiple alignment position can be calculated as follows:
in which P(x) is the relative frequency of each amino acid x in the alignment position. The base of 20 ensures that all values are bounded between zero and one (assuming that we ignore entities such as “X”, “Z”, “B”, and “−”). However, it does not account for the physicochemical similarities that are found between the different amino acids. Therefore, we calculate the Von Neumann entropy for each alignment column. The Von Neumann entropy takes a similar form to equation 1 (Lifshitz and Pitaevskii 1980; Petz 2001):
in which ϖ is a density matrix with trace = 1. Apart from normalization by the trace, the density matrix is given by the product of the relative frequencies of the amino acids in each alignment position [P(x)] and an appropriate similarity matrix (e.g., BLOSUM), that is,
The calculation of equation 2 is facilitated by first calculating the eigenvalues λi of ϖ. In this case, equation 2 is given by the simpler and more computationally efficient equation
In the special case in which the similarity matrix is the identity matrix, equations 2 and 4 become identical to the Shannon Entropy in equation 1. After trial and error, we found that the BLOSUM 50 target frequencies (blosum50.qij) (Henikoff and Henikoff 1992) gave results that we considered most desirable, but other matrices give appropriate results. To incorporate sequence weights, the frequency for each amino acid is computed as follows:
in which aai is one of the 20 amino acids in the alignment position, wj is the sequence weight for sequence j to which amino acid (aaij) belongs, n is the number of sequences in the alignment, and the sequence weights sum to n. The sequence weights are computed using the method of Henikoff and Henikoff (1994), but could be derived by other means. A gap penalty is enforced using an approach similar to that used by CLUSTALX (Thompson et al. 1997). To do this, the VNE score is first transformed to its information score (IS) by subtracting it from the maximum entropy (i.e., IS = 1 −VNE). The gap penalty is the number of residues in the column, divided by the number of sequences. The information score is then multiplied by the gap penalty. An information score derived from VNE will range from 0 to 1, where a score of 1 is assigned to a 100% identical alignment column. In practice, a score will only be below 0.3 when gaps are present, as the 20 residues are not considered to be completely orthogonal. For residue propensities, we assigned an alignment position as being highly conserved when the information score was ⩾ 0.85.
Defining interface residues
Interface residues were defined as those that lost >1% relative solvent accessibility upon complex formation (ΔASA > 1%). Solvent accessibilities were calculated using the algorithm of Lee and Richards with a probe size of 1.4 Å (Lee and Richards 1971). All complexes with a total interface <1500 Å2 were manually inspected. This involved careful reading of the literature and the PDB files to ensure that all files contained genuine biological interfaces. Water molecules were not considered. Interface residues were further classified as peripheral or central on the basis of their solvent accessibility when bound (B-ASA). A peripheral residue has a B-ASA ⩾ 7%, where as a central residue, has a B-ASA <7%. To clarify, the relationship between all of these terms is as follows: ΔASA = B-ASA—Separated monomer ASA. Sequence logos for central and peripheral residues were generated for each category using ALPRO (Schneider and Stephens 1990).
We wanted to compare the interface patch with other random surface patches to see whether the former was more conserved. A surface patch was defined by taking each solvent-exposed residue and its surrounding neighbors on the unbound protein. Thus, a protein with 100 solvent-exposed residues would have 100 surface patches. To ensure that we did not measure through the protein, the following procedure was followed. A side-chain centroid was calculated for every solvent-exposed residue on the unbound protein (a whole residue centroid for glycine) and was used to calculate distances between all exposed residues. The patch was grown from the single starting (seed) residue to include all neighboring residues that were within 7 Å of it. This process was iterated using the newly acquired residues, until the total number of residues in the patch was equal to the total number of residues in the interface. When the number of neighboring residues exceeds the number of remaining places in the patch, the residues closest to the seed residue are selected first. The patch will not always expand to an adequate size, and those with <70% of the actual interface are excluded from the analysis. The average residue conservation was calculated for each surface patch and the interface patch.
The Wilcoxon-signed ranked test was used for all statistical comparisons. This test was chosen because it makes minimal assumptions about the underlying distribution, but is still able to take the magnitudes of the observed differences into account. Similar results were obtained when using the binomial and T-tests. The Z-test was used to compare the conservation of the interface relative to conservation of all other patches on the same protein.
We thank A. Cheng, Q. Cao, Y-H. Ding, P. Henstock, and S. Xi for helpful discussions. J.M. is supported by DoE/Krell Institute CSGF. We thank Research Collaboratory for Structural Bioinformatics (RCSB) and the numerous crystallographers for making their structures available in electronic format. Finally, we thank the anonymous reviewers for their suggestions.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.