• protein domains;
  • prediction of functional residues;
  • evolutionary conservation


  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Materials and methods
  6. Acknowledgements
  7. References

We present a method for prediction of functional sites in a set of aligned protein sequences. The method selects sites which are both well conserved and clustered together in space, as inferred from the 3D structures of proteins included in the alignment. We tested the method using 86 alignments from the NCBI CDD database, where the sites of experimentally determined ligand and/or macromolecular interactions are annotated. In agreement with earlier investigations, we found that functional site predictions are most successful when overall background sequence conservation is low, such that sites under evolutionary constraint become apparent. In addition, we found that averaging of conservation values across spatially clustered sites improves predictions under certain conditions: that is, when overall conservation is relatively high and when the site in question involves a large macromolecular binding interface. Under these conditions it is better to look for clusters of conserved sites than to look for particular conserved sites.

Despite recent growth of the protein sequence and structure databases, there remains only a small fraction of proteins whose functions have been experimentally characterized. It is sometimes possible to infer the function of uncharacterized proteins by comparison to the sequences or structures of functionally annotated homologs. Common descent does not necessarily imply functional similarity, however (Hegyi and Gerstein 1999; Devos and Valencia 2000; Todd et al. 2001) and functional annotation transferred from one homologous protein to another can result in incorrect functional assignment. To verify functional assignments one must examine the common features conserved among homologs and attempt to identify functionally important sites.

Several investigators have considered the problem of functional site prediction using multiple sequence alignments (Casari et al. 1995; Andrade et al. 1997; Hannenhalli and Russell 2000; Li et al. 2003). Casari et al. (1995), for example, applied principal component analysis to a vector representation of protein sequences in a multidimensional “sequence space,” to derive subfamily-specific residues involved in protein function. Andrade et al. (1997) proposed a rigorous clustering algorithm based on a self-organizing map as a means to identify protein subfamilies and retrieve characteristic sequence patterns. As functional similarity can be inferred from clades in phylogenetic trees, some methods of functional site prediction use phylogenetic analysis to identify residues associated with functional divergence (Lichtarge et al. 1996; Sjolander 1998; Aloy et al. 2001; Madabushi et al. 2002; del Sol Mesa et al. 2003). The evolutionary trace (ET) method, for example, delineates invariant residues responsible for subgroup specificity by partitioning the dendrogram into an increasing number of subgroups of similar sequences with subsequent analysis of their three-dimensional (3D) structures (Lichtarge et al. 1996; Aloy et al. 2001; Madabushi et al. 2002).

Despite the efforts in this field, the accuracy of functional site predictions remains low, suggesting that it may be worthwhile to consider other aspects beyond sequence conservation. Use of structure information is one possibility, because knowledge of the protein structure is necessary for predicting many aspects of protein function (Teichmann et al. 2001). Given that functionally important surface regions often contain residues with specific characteristics, some methods attempt to identify functional sites on the basis of physicochemical properties of individual residues, their electrostatic contribution, and their location in the 3D structure (Jones and Thornton 1997; Tsai et al. 1997; Elcock 2001; Bartlett et al. 2002). Landgraf and colleagues (2001), for example, offered an automated method for functional site prediction by identifying 3D clusters of conserved residues using residue-specific (regional) and global similarity scores.

Here we present a method which is based on the assumption that the structural location of functional sites is conserved between homologous proteins and that functionally important residues tend to cluster together in space, forming three-dimensional residue clusters or surface patches. In the method considered here, each residue is assigned a score which depends on its own conservation in homologs and the conservation of residues in its spatial neighborhood, as judged from the analysis of known structures within a given protein family. We hypothesize that high-scoring sites are more likely to be involved in specific binding or catalysis, and that one may identify functionally important residues even in the absence of structural data on protein–ligand or macromolecular complexes.

We tested the method on a benchmark of 86 protein domain families, including families with a wide variety of functions and sequence diversity. To assess the accuracy of functional site predictions, we applied a rigorous receiver operating characteristic (ROC) test (see Materials and Methods). This gave us a means to compare different scoring schemes directly, by calculating the actual number of correctly predicted functional sites at a given level of false assignments. We show that including information about conserved structural features in some cases helps to make more accurate predictions, especially for DNA/RNA binding macromolecular interfaces. When sequence diversity is low, spatial averaging also helps to detect functional sites against the high background of sequence conservation.


  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Materials and methods
  6. Acknowledgements
  7. References

Functional site predictions based on sequence conservation and sequence conservation with spatial averaging

Functionally relevant residues in proteins are often conserved among all or a majority of members of a protein family. Accordingly, these residues can be identified from the analysis of positional conservation in multiple sequence alignments using different sequence conservation measures. Here, we employed information content and maximum likelihood estimates of the expected number of substitutions per position (substitution rate), as calculated by the PAML package (Yang 1997). We found that substitution rates performed better in terms of detecting functional sites than information content; the recognition rate at 5% false positives (R0.05) for the whole test set was 0.32 and 0.25 using PAML substitution rate and information content, respectively. This difference is especially pronounced for highly divergent domain families and could be due to the fact that the substitution rate calculated by PAML takes into account the phylogenetic history of the protein family.

To determine whether clustering of conserved residues in space and consideration of their solvent accessibility help to identify functional sites, we compared scoring functions based on sequence conservation alone and sequence conservation with spatial averaging (see Materials and Methods). Figure 1 shows the ROC30 statistic for the contact-based scoring function with an optimized distance cutoff (the distance cutoff yielding the best performance for each domain family) and with a fixed distance cutoff (less than 6 Å), plotted against ROC30 values obtained with a sequence-based scoring function. As can be seen from the figure, the contact-based scoring function with optimized distance cutoff detects more functional sites for 73% of domain families compared to sequence-based scoring function. Because the value of optimal distance cutoff is difficult to determine a priori for each domain family, in our work we used the 6 Å distance cutoff, which has been shown to yield the best performance.

Functional site predictions for different functional categories

Analyzing different functional categories we found that conserved contacts and solvent accessibility are particularly useful for predicting DNA/RNA-binding and protein–protein binding interfaces. The difference in recognition accuracies can be represented by ROC plots (Fig. 2A,B), which show the fraction of false positives for any given recognition rate. For example, at 5% of false positives the structure-based scoring function detects about 20% of DNA/RNA-binding and 14% of protein–protein binding sites, whereas sequence-based scoring function yields a recognition rate of 9%–10%. An improvement in the ROC30 statistic upon including structural information is also observed for DNA/RNA binding and protein–protein binding sites, as can be seen from Table 1. It was shown earlier that the level of conservation of DNA-binding and protein–protein binding sites and, as a consequence detection accuracy, depends on the conservation of the entire protein sequence (Luscombe and Thornton 2002; Nooren and Thornton 2003). Given that the average sequence identity in our test families is about 30%, DNA/RNA-binding and protein–protein binding sites are also predicted with limited accuracy.

We found that the success rate in detection of catalytic sites is higher than for other types of functional sites, about 47% true positives recognized at the 5% false positive rate (Fig. 2C). The increased prediction accuracy for catalytic sites can be explained by the fact that catalytic sites apparently are under stronger selection pressure (not counting those cases where different functional groups could mediate the same catalytic mechanisms in homologous enzymes [Todd et al. 2002]), such that even families with a high degree of sequence diversity exhibit strong conservation of catalytic sites. As can be seen from Figure 2C, structure information does not seem to assist the prediction of catalytic sites. Examination of Table 1 shows that residue solvent exposure is also not a very important factor in predicting catalytic sites, which agrees with the previous observation that despite their polarity, catalytic residues have lower solvent exposure compared to other residues (Bartlett et al. 2002).

It should be noted that there is great variety among different catalytic domains. They can vary in terms of the type of enzymatic activity, the sizes of protein clefts, and interacting ligands. These factors apparently make it difficult to predict active sites using structure-based scoring function with the fixed distance cutoff. As a consequence, the sequence-based scoring function alone gives more reliable predictions for sufficiently diverse domain families where conserved active sites become more apparent. On the other hand, DNA/RNA binding and protein–protein binding sites very often are nonspecific and form contiguous patches on the surface of the protein. These factors apparently allow the contact-solvent-accessibility scoring function to improve detection of functional sites.

Statistical significance of functional site predictions

To compare the results obtained by our method to the outcome of random assignments, we performed a binomial test for each domain family. The number of trials in the binomial test was equal to the overall number of functional residues in a given domain alignment, and the probability of success was calculated as a number of functional residues in the alignment divided by the overall number of residues in the alignment. Using the contact-solvent-accessibility scoring function, we found that predictions of functional sites for 57% of domain families are significant with P-values <0.05 (P-value here denotes the probability of finding an equal or higher number of correctly predicted functional sites purely from the binomial distribution). Values for domains with annotated catalytic, DNA/RNA-binding, and protein–protein binding sites were 76%, 35%, and 20%, respectively. Sequence conservation scoring yielded significant predictions of catalytic sites for 65% of domains, DNA/ RNA-binding sites for 24% of domains, and protein–protein interfaces for 20% of domains (50% overall). In all cases the site was predicted to be functional if it belonged to the top 5% of the most conserved sites in domain alignment.

These results are comparable to those of the 3D cluster analysis employed by Landgraf et al. (2001). Those investigators identified 36% of all interface residues at a threshold of less than 1% expected from reshuffled alignments and 67% at the less stringent threshold of 10%. An automated method based on the ET approach found the correct locations of catalytic residue clusters for 62 out of 80 enzymes (78% of clusters compared to 76% of catalytic domains with significant predictions found by our method) for multiple alignments with less than 30% identity (Aloy et al. 2001). Aloy et al. defined the predicted site/cluster to be correct if the overlap between the volume of predicted cluster and the volume of annotated functional site was more than 50%. Their method was considered to find a right prediction for a given protein if at least one of the predicted functional clusters was correct.

Conserved structural features help to predict functional residues for domain alignments with low sequence diversity

Our test set can be considered rather heterogeneous in terms of the sequence diversity of domain families (Table 2). For domain families with low sequence diversity, sequence and structure similarity is extensive and the degree of residue conservation is high for all positions in alignments. Sequence profiles based on low-diversity alignments perform relatively poorly in a database search (Panchenko and Bryant 2002), and we similarly found that functional residue identification is problematic in these cases. As shown in Figure 3, for low-diversity domain alignments (where the number of different amino acid types per column, Nobs is less than 5 and average sequence identity is about 45%), the average recognition rate (R0.05) is less than 0.2, whereas for more diverse alignments (Nobs is greater than 15 and average sequence identity is about 20%), the average recognition rate is twice as high. In agreement with these results, Aloy et al. (2001) reported that for multiple alignments with sequence identity of more than 30%, their method of functional site prediction has very limited applications.

We found that spatial averaging nonetheless improves functional site recognition for low-diversity alignments. As can be seen from Figure 4A, the site recognition rate increases for low-diversity families upon including the structure-based term in the scoring function. The improvement in accuracy exceeds 20% for this range of diversity, mostly affecting domain families with catalytic and DNA/RNA-binding sites. Moreover, including the solvent accessibility term in the scoring function improves the prediction accuracy for families with medium sequence diversity (Nobs between 5 and 15), as shown in Figure 4B. Diverse domain families with highly conserved functional sites, on average, show a decline in recognition rate when structure-based scoring function is used. For example, the recognition rate for a very diverse family of metal-dependent phosphohydrolases (HDc; average percent identity 18%) drops from 100% recognition with the original sequence-based scoring to 50% with contact-based scoring. This family has a particularly conserved HD-motif, which suggests that the conservation signal is high enough to be detected by sequence-based scoring alone. Structure-based scoring in this case can flatten the overall signal by averaging the conservation measure over neighboring residues.


  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Materials and methods
  6. Acknowledgements
  7. References

In an attempt to identify functionally important sites, we present a method which quantifies the conservation of protein sites in terms of preserving amino acid types and local structural environments. First, the scoring function, which accounted for the local environment and/or surface exposure of protein sites, was found to perform better than sequence-based scoring alone in many cases, serving mainly as a filter to eliminate nonfunctional residue conserved positions. The largest improvement was observed for predicting DNA/RNA binding sites. This observation is in agreement with the previous studies which similarly demonstrated that accounting for 3D clusters of conserved residues reduced the number of false positives identified (Landgraf et al. 2001).

Second, it was shown that the sequence divergence of domain alignments is a prerequisite for the successful functional prediction, and structurally conserved features help to discriminate functional and nonfunctional sites for families with low sequence diversity. Accordingly, to increase blind prediction accuracy we can formulate several rules based on these observations. The first: To predict functional residues for low-diversity families, whenever possible diversify them with more distantly related family representatives and, if not possible, use a structure-based scoring function. The second rule can be applied if the general function of the domain family is known: Whenever possible use contact-based and solvent accessibility-based scoring for predicting DNA/RNA binding and protein–protein binding sites; for catalytic sites use a contact-based scoring function for low-diversity families and the original sequence-based scoring function for all others. If a blind prediction of functional residues is being attempted, the simple strategy would be to apply these rules for initial family screening and then define functional residues as those having conservation scores among the top 7%, 6%, and 5% of conservation scores for catalytic, DNA/RNA binding, and protein–protein binding sites, respectively. These conservation score cutoffs correspond approximately to the error rate of 5% false positives.

As we showed, spatial averaging does not always help the function prediction, and prediction accuracy still remains quite low. Madabushi et al. (2002) demonstrated that the number of clusters (or size of the largest cluster) of functional residues determined by the ET method was larger than the number of clusters predicted by random simulations for 98% of their test cases (at the significance level of 5%). It should be noted that this result does not imply that the ET method is able to correctly identify active sites for 98% of test proteins at the 5% significance level. Similarly to Landgraf et al. (2001), we showed that the accuracy of functional site prediction, in fact, was far from reaching 100%. Applying ROC analysis we found that 47% of active sites, 20% of DNA/RNA binding sites, and 14% of protein–protein interfaces can be predicted at a 5% false positive rate. We note that the limited accuracy of functional prediction can be caused by the differences in functional specificity among homologous family members as well as by the functional plasticity of protein molecules. Even proteins sharing the same evolutionary origin and functional activity may show variability in the physicochemical properties of functional residues and their location in a 3D structure (Todd et al. 2001, 2002; Lichtarge and Sowa 2002).

Materials and methods

  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Materials and methods
  6. Acknowledgements
  7. References

A benchmark for evaluating the methods of functional sites prediction

We selected 86 domain alignments from the curated Conserved Domain Database (CDD), a current version of which is available at (Marchler-Bauer et al. 2002). Multiple alignments in the CDD have been manually curated to reconcile sequence alignments with protein 3D structures and structure-structure alignments. Based on the crystal structures and experimental data from the literature, conserved functional sites have been annotated for each CDD domain by inspection of protein–ligand, protein–DNA/RNA, and protein–protein complexes for all structure representatives. Functionally important sites were defined as those residues making contacts with a ligand or a macromolecule. CDD alignments represent alignments of conserved core structures formed by presumably homologous sites, and positions outside the conserved cores are removed from the alignment, resulting in alignment lengths between 38 and 576 residues.

The selected test set covered a broad range of different functional categories including 37 domains with annotated catalytic sites, 17 domains with annotated DNA/RNA binding sites, 20 domains with annotated protein–protein binding sites, and domains from other functional groups (domains containing disulfide bonds and domains with less than two annotated functional sites were excluded). Names of CDD families used in the test set together with their sequence diversity, length, the number and the type of functional sites are listed in Table 2. By definition, CDD alignments have at least one structural family representative, whereas in our test set the number of structures per family ranged from 1 to 15, with three structures per family on average.

Calculation of sequence conservation

We used two different measures to estimate the level of conservation at each position in CDD alignments. The first measure, information content, was based on counting the number of different amino acid types per aligned column and inferring the relationships between amino acid types with the pseudocount method (Altschul et al. 1997), where pseudocount frequencies were calculated using the PAM70 amino acid substitution matrix. The second measure of evolutionary conservation of different sites, the substitution rate per site, was calculated using the PAML3.12 package (Yang 1997) with its implementation of the Jones, Taylor, and Thornton amino acid substitution model (Jones et al. 1992), where the variable substitution rates across sites were described with the γ-model. Phylogenetic trees required for this analysis were constructed by the neighbor-joining method (Saitou and Nei 1987) with the PHYLIP package (Felsenstein 1989).

Scoring the clusters of conserved residues

For each position in the alignment, two regional conservation scores were calculated. The first one represented the average over conservation scores for residues located within a given distance from each position “i” of the alignment, namely,

  • equation image((1))

where Δij is equal to 1 if residues i and j are in contact, and 0 otherwise. Cj is the residue conservation score of residue j, N is the total number of positions in the alignment, and n is the number of residues in contact with residue “i.” Contacts were defined between the virtual Cβ atoms (points 2.4 Å away from Cα atom) of residues separated along the chain by at least five peptide bonds and having the distance less than a given distance cutoff (4, 5, 6, 7, 8, and 9 Å). It should be noted that contacts were calculated for all structural representatives of domain alignments, and only conserved contacts were used in the evaluation of Ccont. The contact between positions i and j was defined as conserved if aligned residues in these positions formed the contact in all structural representatives. For those residues which did not make any contacts, the original residue conservation value was assigned. Inter-residue contacts conserved between all structural representatives were shown to increase prediction accuracy for 60% of domain families (for families with more than one structure) compared to the scoring function based on one representative structure (data not shown).

The second regional conservation score gave emphasis to solvent accessible residues, because these residues are very often involved in the formation of functionally important interfaces:

  • equation image((2))

where Δsolv is equal to 1, if solvent accessibility of position “i” is greater than 0.05, and 0 otherwise. Reversing equation 2 and considering only buried residues in contact did not improve the prediction accuracy (data not shown). The cutoff threshold of 0.05 was derived from an analysis of homologous protein structures forming a conserved hydrophobic interior (Miller et al. 1987). Solvent-accessible area was calculated by the DSSP algorithm (Kabsch and Sander 1983), where solvent accessibility of residue “X” was defined as the ratio of its solvent-accessible area in protein structure to that for extended tripeptide Gly-X-Gly. The solvent accessibility of position “i” in a multiple alignment was calculated by averaging solvent accessibility values in a given position for all structural representatives.

Evaluation of prediction accuracy

To evaluate the accuracy of functional site predictions, we calculated the number of correctly predicted functional sites (true positives) and the number of incorrectly predicted functional sites (false positives) found at different thresholds of conservation score. True positives were identified as those functionally important sites which had scores higher than a given score threshold. False positives, in turn, were identified as sites with scores higher than a given threshold, but unrelated to the functional activity of a given domain family. To measure the performance of retrieval methods, the truncated receiver operating characteristic (ROC) has been widely used (Gribskov and Robinson 1996; Schaffer et al. 2001). A ROCn statistic was calculated as the sum of the number of true positives found at 1,2,3, … n false positive levels (ti) divided by the overall number of true positives (T): ROCn = (∑I=1, …, nti)/nT. Here, the total number of true positives (T) was calculated as the total number of annotated functionally important sites in a given domain family, whereas the total number of false positives was equal to the difference between the total number of sites in the alignment and the number of functional sites annotated for a family. Knowing the number of true positives detected and overall number of true positives, it is possible to calculate the fraction of true positives detected and, correspondingly, the fraction of false positives detected, and plot them in the order of decreasing score threshold (see Fig. 2). The false positive cutoff “n” was set to 30, which corresponds approximately to the first quarter of false positives detected. In those cases where the prediction performance was compared for different families with the different numbers of false positives, the R0.05 was used.

Table Table 1.. Average ROC30 values calculated with different scoring functions for different functional categories of test domains: catalytic, DNA/RNA-binding and protein-protein binding domains
 Catalytic sitesDNA/RNA binding sitesProtein-protein interfacesAll
Subst. rates0.490.320.180.37
Subst. rates+contacts0.480.380.200.38
Subst. rates +contacts+solv.acc.0.420.440.220.37
Table Table 2.. Names of 86 CDD families used together with the pdb codes of their first structures, average sequence identities of family alignments (average number of different amino acid types per column, Nobs), alignment lengths, and the overall numbers of functional sites
NamePdb code%identity/NobsLengthDomain descriptionNumber of functional sites
  1. a

    Number of active sites, DNA/RNA binding and protein-protein binding sites are denoted by letters C, D, and P, respectively, and shown in parentheses.

35EXOc2kzm21/121013′–5′ exonuclease5 (5C)
53EXOc1xo135/82135′–3′ exonuclease8 (8C)
ACTIN1dga34/9305Actin24 (18P)
ADF1cof24/10115Actin depolymerisation factor/cofilin-like domains10 (10P)
alkPPc1elz40/6325Alkaline phosphatase homologs13 (13C)
Aminopeptidase1b6530/659L-Aminopeptidase domain2 (2C)
AP21gcc47/559DNA-binding domain found in transcription regulators in plants11 (11D)
AP2Ec1qtw28/8211AP endonuclease family 213 (9C)
Arfaptin1i4d25/6193Arfaptin domain11 (11P)
BPI1bpl16/11131BPI/LBP/CETP domain13
C11faq28/1143Protein kinase C conserved region 18 (8C)
C21dqv26/1463Protein kinase C conserved region 24 (4C)
CASc1cp337/8203Caspase, interleukin-1 β converting enzyme homologs16 (16C)
CBM91i8239/6145Family 9 carbohydrate-binding module18
CH1aoa22/1375Calponin homology domain36 (36P)
ChtBD31aiw30/938Chitin/cellulose binding domain2
cNMP (CAP_ED)1rgs19/1491Cyclic nucleotide-monophosphate binding domain4
CPT1qhx38/3170Chloramphenicol phosphotransferase21 (15C)
DED1a1z22/761Death effector domain9 (9P)
DEXDc1d9x25/1596DEAD-like helicases superfamily9
DSPc1vhr28/8118Dual specificity phosphatases6 (6C)
DSRM1di226/1356Double-stranded RNA binding motif12 (12D)
ENDO3c1muy22/9125Endonuclease III19 (8C)
eu-GS2hgs38/5442Eukaryotic glutathione synthetase29 (7C)
fer21b9r26/15602Fe-2S iron-sulfur cluster binding domain10
FGF1qqk32/8113Acidic and basic fibroblast growth factor family22 (22P)
FH1e1757/752Forkhead, winged helix5 (5D)
FlpREC1flo34/4338Flp recombinase domain7 (7C)
FYVE1vfy35/955FYVE, zinc-binding domain13
G-α1azt39/10304G protein α-subunit61 (52P)
GlcAT-I1fgg44/6213β, 3-glucuronyltransferase I domain12 (12C)
Glm_e1ccw51/4368Coenzyme B12-dependent enzyme glutamate mutase14 (14C)
GuKc1gky27/10130Guanylate kinase homologs15 (10C,4P)
GYF1gyf26/756GYF-domain16 (16P)
H151hst33/1177linker histone 1 and histone 5 domains15 (15D)
H2A1aoi65/4114Histone 2A7 (7D)
HDc1f0j18/1691Metal-dependent phosphohydrolases with conserved ‘HD’ motif4 (4C)
HECTc1c4z29/8312HECT domain29 (14C,15P)
HELICc1d2m17/16130Helicase superfamily C-terminal domain16 (13D)
HPT1qsp21/1086Histidine Phosphotransfer domain5
HTH_ARSR1smt23/1371Arsenical Resistance Operon Repressor26 (24D)
HTH_XRE1lmb22/1551Helix-turn-helix XRE-family like proteins7 (7D)
KISc3kar43/10245Kinesin motor, catalytic domain, ATPase8
LIGANc1dgs44/7284NAD+ dependent DNA ligase adenylation domain10 (1C)
LMWPc1dlp34/15112Low-molecular-weight phosphatase family6 (6C)
MADS1mnm43/485MCM1, Agamous, Deficiens, and serum response factor domain6 (6D)
MBD1qk931/661Methyl-CpG binding domain8 (8D)
Mog11eq637/4165Homolog to Ran-Binding Protein Mog1p22 (22P)
MYSc2mys41/11576Myosin, large ATPases16
PAX1pdn68/3128Paired Box domain34 (34D)
PDZ3pdz24/1562PDZ domain12 (12P)
PI3Kc1e8x26/10272Phosphoinositide 3-kinase, catalytic domain35 (27C)
PIPKc1bo136/6264Phosphatidylinositol phosphate kinases45 (37C)
PLCc1gym28/8189Phospholipase C, catalytic domain11 (11C)
PNPsynthase1ho444/6230Pyridoxine 5′-Phosphate synthase domain18 (18C)
POLXc2bpf40/6294DNA polymerase X family13 (3C,10D)
PP2Ac1aui37/7235Protein phosphatase 2A homologs, catalytic domain16 (13C)
PP2Cc1a6q26/13178Serine/threonine phosphatases, family 2C, catalytic domain9 (9C)
PRCH1prc50/6224Photosynthetic reaction center complex, subunit H6
PROF1dlj32/8108Profilin17 (11P)
PTB2nmb17/11113Phosphotyrosine-binding domain, phosphotyrosine-interaction domain10
PTPc2shp37/13195Protein tyrosine phosphatase6 (6C)
PTS_IIA_fru1a6j31/7118PTS system, fructose/mannitol specific IIA subunit2 (2C)
PTS_IIA_lac1e2a38/599PTS system, lactose/cellobiose specific IIA subunit7 (7C)
PTS_IIA_man1pdo27/9100PTS system, mannose/sorbose specific IIA subunit7 (7C)
PTS_IIB_glc1iba32/781PTS system, glucose/sucrose specific IIB subunit7 (7C)
PTS_IIB_lac1h9c36/498PTS system, lactose/cellobiose specific IIB subunit7 (7C)
RA1rax20/1166RasGTP binding domain from guanine nucleotide exchange factors13 (13P)
RhoGAP1am426/13138GTPase-activator protein for Rho-like GTPases5 (5P)
RPA1ewi19/948Human Replication Protein A7 (7D)
S41dm923/1351S4/Hsp/tRNA synthetase RNA-binding domain5 (5D)
SAM1b0x21/1357Sterile alpha motif5 (4P)
SEC141aua18/13129Sec14p-like lipid-binding domain16
Sec71pbv26/10178Sec7 domain22 (22C)
SERPIN1ova34/12280Serine proteinase inhibitor14 (14P)
SH22shp29/1670Src homology 2 domains8
SNc2sns30/991Staphylococcal nuclease homolog7 (7C)
TBOX1xbr43/7169T-box DNA binding domain25 (25D)
TNF1a8m23/9103Tumor necrosis factor7 (7P)
Topo6_Spo1d3y32/8250DNA topoisomerase VI subunit A4 (4C)
ToxGAP1he141/4116GTPase-activating protein domain15
UBCc2ucz29/13129Ubiquitin-conjugating enzyme B2 and UBC homologs6 P (5P,1C)
VWA1dzi19/16119von Willebrand factor type A domain5 (5P)
XPG1a7632/8254Xeroderma pigmentosum G N- and I-regions38 (8C,32D)
ZnF_GATA2gat45/651Zinc finger DNA binding domain19 (17D)
ZnMc1smp31/1291Zinc-dependent metalloprotease7 (7C)
thumbnail image

Figure Figure 1.. The ROC30 statistic for each domain family obtained with the contact-based scoring function (equation 1) and optimized distance interval cutoff is plotted vs. ROC30 values calculated with the original sequence-based scoring function (triangles). The ROC30 statistic for each domain family obtained with the contact-based scoring function (equation 1) and the distance cutoff less than 6 Å is also shown.

Download figure to PowerPoint

thumbnail image

Figure Figure 2.. The fraction of correctly identified DNA/RNA binding sites (A), protein–protein binding sites (B), and catalytic sites (C) is plotted against the fraction of incorrectly identified functional sites for different scoring functions: the original sequence-based scoring function (solid line) and contact-solvent-accessibility-based scoring function (equation 2; dashed line). The contact-based scoring function (equation 1) is used in case of catalytic site prediction. The contacts are defined between residues separated by a distance of 6 Å.

Download figure to PowerPoint

thumbnail image

Figure Figure 3.. The site recognition rate (R0.05) obtained with the sequence based scoring function is plotted for different sequence diversity ranges. Domain family diversity is calculated as the average number of different amino acid types per column in the CDD alignment. Results are shown as a boxplot (Chambers 1998), where the central line in each box shows the median recognition rate within a given bin of diversity, the upper and lower boundaries of the box show the upper and lower quartiles, and the vertical lines extend to a value 1.5 times the interquartile range. Outlier values beyond these ranges are shown as individual points.

Download figure to PowerPoint

thumbnail image

Figure Figure 4.. Improvement in the site recognition rate upon including the structural term in the scoring function is plotted vs. the sequence diversity of domain families. The difference in recognition rate is calculated as the average recognition rate (R0.05) obtained with the contact-based scoring function (A) or contact-solvent-accessibility scoring function (B) minus the average recognition rate for the sequence-based scoring function.

Download figure to PowerPoint


  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Materials and methods
  6. Acknowledgements
  7. References

We thank John Spouge, Ben Shoemaker, and Michael Galperin for helpful suggestions, and the NIH Intramural Research Program for support.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.


  1. Top of page
  2. Abstract
  3. Results
  4. Discussion
  5. Materials and methods
  6. Acknowledgements
  7. References