One of the goals of structural genomics is to obtain a structural representative of almost every fold in nature. A recent estimate suggests that 70%–80% of soluble protein domains identified in the first 1000 genome sequences should be covered by about 25,000 structures—a reasonably achievable goal. As no current estimates exist for the number of membrane protein families, however, it is not possible to know whether family coverage is a realistic goal for membrane proteins. Here we find that virtually all polytopic helical membrane protein families are present in the already known sequences so we can make an estimate of the total number of families. We find that only ∼700 polytopic membrane protein families account for 80% of structured residues and ∼1700 cover 90% of structured residues. While apparently a finite and reachable goal, we estimate that it will likely take more than three decades to obtain the structures needed for 90% residue coverage, if current trends continue.
An important goal of structural biology and structural bioinformatics is to assign every part of every protein sequence to a function and to a unique fold for the independently structured regions (Holm 1998; Brenner and Levitt 2000; Heger and Holm 2000; Mittl and Grutter 2001; Chothia et al. 2003). While it is not possible to solve the structures of all folded proteins, the crude structure of most protein domains could be obtained through fold recognition and homology modeling methods if a representative structure is available for all sequences (Zhang and Kim 2003). Thus, one of the goals of structural genomics is to expand our library of folds with known structure (Vitkup et al. 2001). It is therefore of interest to know how many structures will be required to have a representative fold for most proteins.
To assign a sequence to a particular fold, there must be some way to recognize the compatibility of the sequence for the known structure. If the sequences of two proteins are similar, the structures are also very likely to be similar. Thus, an upper estimate of the number of structures required to span most protein sequences is the number of protein families that can be identified based on sequence similarity alone. Estimates for the number of soluble protein families have ranged widely (Orengo et al. 1994; Wolf et al. 2000; Kunin et al. 2003), but recent results suggest that there will be many more than 250,000 families (Yan and Moult 2005). Indeed, there are currently >250,000 families with at least two members in the ProDom family database (Servant et al. 2002). Different operational definitions and strategies for defining families account for some differences in estimates (Kunin et al. 2005), but the variation is also attributable to the observation that the number of soluble protein families is constantly increasing with the number of new sequences (Kunin et al. 2003; Yan and Moult 2005). Nevertheless, because of the uneven distribution of proteins in families, Yan and Moult predict that, for the first 1000 genomes ∼80% of all structured soluble protein domains should be represented by only 25,000 families (Yan and Moult 2005). This is a daunting, but reasonably achievable, goal for structural genomics.
Not all sequence families adopt different folds, however, so protein fold space is much smaller than the number of families (Murzin et al. 1995; Holm and Sander 1997; Orengo et al. 1997; Hou et al. 2003). There are only a few predominant protein folds in soluble proteins and many rare folds (Govindarajan et al. 1999). Similarly membrane proteins also appear to have an uneven distribution of folds (Ubarretxena-Belandia and Engelman 2001). To the extent that a structural relationship can be recognized in the absence of easily identifiable sequence similarity, the number of structures needed to span the sequences can be reduced. Thus, the minimum number of structures needed is the number of folds that exist. Several estimates of the number of soluble protein folds have been presented that utilize different models of how folds distribute over sequence families. The estimated number of folds ranges widely from 700 to ∼8000 based on differences in assumptions and data sets (Orengo et al. 1994; Zhang and DeLisi 1998; Govindarajan et al. 1999; Wolf et al. 2000; Coulson and Moult 2002; X. Liu et al. 2004). The fact that there are already 898 soluble protein folds, according to the current release of the SCOP database (v1.69) (Murzin et al. 1995), indicates that the number will greatly exceed 1000.
Membrane proteins present special technical challenges, so the pace of membrane protein structure determination is much slower. While membrane proteins account for over a quarter of all proteins, they represent roughly 0.3% of the >34,000 structures currently deposited in the Protein Data Bank (PDB). Nevertheless, there may be considerably fewer membrane protein families and folds so that fewer structures may be needed to assign most of the sequence space to structures. In a study of 26 proteomes, Liu and colleagues (Y. Liu et al. 2002, 2004) found 214 polytopic membrane protein families and 1922 soluble protein families containing at least four members, and concluded that there are about 10 times more soluble protein families than membrane protein families. Without information on how this ratio scales with an increase in family number, however, these results do not yield an estimate of the total number of families.
To parse membrane protein sequence space into families, we developed a membrane protein specific family building algorithm, which differs in several ways from other algorithms that are optimized primarily for soluble proteins (Sonnhammer and Kahn 1994; Linial et al. 1997; Park and Teichmann 1998; Gouzy et al. 1999; Enright and Ouzounis 2000; Heger and Holm 2001; Krause et al. 2005; Yan and Moult 2005). First, for sequence similarity detection, sequences are normally masked to eliminate low complexity regions. Masking often eliminates membrane spanning segments, however, so we adjusted masking to maintain them. Second, we utilized predictions of TM helices to help define domain boundaries. Finally, we made use of the recent finding that there are fewer multi-domain membrane proteins, which allowed us to be less restrictive in combining overlapping families than would be possible with soluble proteins (Y. Liu et al. 2004). Analysis of the families built by our algorithm suggests that there are indeed many fewer membrane protein families and folds than are found in soluble proteins.
Results and Discussion
Family size distribution in membrane proteins
We applied our family building algorithm, described in the methods section, to a starting data set of 86,033 membrane protein sequences derived from 95 genomes and the Swiss-Prot database. We found 4075 total polytopic families with two or more members. There were 8273 singletons (no hits better than an e-value of 10−5), and 13,644 sequences were lost when monotopic membrane protein families were removed. The polytopic families include 64,116 sequences which account for 82.5% of the 77,760 membrane protein sequences, after excluding singletons.
The distribution of sequences is strongly weighted toward a few large families. Figure 1 shows the average size distribution of the families in blocks of 20, going from the largest to the smallest families. The average size of the 20 largest families for membrane proteins was 1275 members, and the average size of the next 20 largest dropped rapidly to 336. Thus, there are only a few very large families and many very small families in membrane proteins, similar to the observation for soluble proteins (Govindarajan et al. 1999; Chothia et al. 2003).
The largest families are listed in Table 1. Among the largest families are various signaling proteins, largely of eukaryotic origin, including GPCRs and potassium channels (Chou and Elrod 2002; Harte and Ouzounis 2002; Sansom et al. 2002). A number of different transporter families are also prominent, including the major facilitator family of secondary transporters and the ABC transporter family (Saier et al. 1999; Driessen et al. 2000; Schmitt and Tampe 2002). Proteins involved in energy production (cytochrome b and NADH ubiquinone oxidoreductases) are also found in the largest families (Berry et al. 2000; Friedrich and Bottcher 2004). The family of protein kinases represents an error in our automated method as they are single-pass proteins with large soluble domains. Most single-pass proteins are eliminated from our families, but the protein kinases are expressed with signal sequences, which were identified as a second TM domain in our algorithm.
Table Table 1.. The largest membrane protein families
Validation of family building algorithm: Families versus domains
With an ideal family building algorithm, family and folding domain would be equivalent units. In other words, all families would contain a single folding domain and different families would represent different folds. In reality, this is probably impossible to achieve because there is insufficient information in the sequence alone. For example, some domains are constructed from noncontiguous regions of the primary sequence and will tend to be separated in family building. Alternatively some structurally distinguishable domains are never found alone and are contiguous in sequence, so there is no indication in the sequence database that the domains are separable. Nevertheless, we can examine how close we have come in distinguishing domains by seeing how well our family building algorithm divides known folds into different families.
Since our algorithm is optimized for membrane proteins, we would ideally test it with known membrane protein folds. In particular, our family merging protocol was specifically designed with membrane proteins in mind. Membrane proteins more rarely contain multiple domains, so we were able to use a more liberal joining of overlapping families than would be allowable with soluble proteins. As there are so few membrane protein structures, however, we decided to perform a less ideal test, using all protein classes.
We first constructed a sequence database containing known structural domains, starting with SCOP (v1.65), which organizes 20,619 structures from the PDB into folding domains. We obtained 9643 sequences for the SCOP structural domains from the Astral database (Chandonia et al. 2004), after removing sequences with >95% sequence identity with other members. From the original 9643 sequences, we then removed sequences with structural domains constructed from noncontiguous regions of sequence, as well as all sequences <75 residues, and were left with 7784 sequences. Many of these domains are elements of larger proteins, so we next retrieved the full length sequences available in the Swiss-Prot database and were able to retrieve 4931 sequences containing one or more of the original SCOP domains. To these sequences we added soluble protein sequences randomly picked from our genomic sequences to create a soluble protein data set similar in size to our membrane protein data set.
We applied our family building algorithm to a final set of 90,823 nonredundant sequences and obtained 11,339 families with at least two members. After discarding families that had no SCOP domains, 1704 families remained. In the 1704 families, 1315 contained only one SCOP domain and 389 contained more than one SCOP domain. Thus, under a quarter of the families included multiple SCOP domains. Most of these do not represent errors in the algorithm, however, but rather limitations in the starting data. In particular, if no isolated domains are present in the sequence database, there is no way for the family building algorithm to separate them. Indeed, for all but 112 of the families with more than one fold, the domains are contiguous in the sequence. As pointed out by Liu and colleagues (Y. Liu et al. 2004), joining domains will be much less problematic for membrane proteins, which are less often composed of multiple domains. Thus, these results likely represent the worst case scenario. Even so, the vast majority (>75%) of the families correspond to a single SCOP domain.
Validation of family building algorithm: Family boundaries
We sought sequence families representing domains that span the membrane-embedded segments of membrane proteins and do not include flanking soluble domains. We define ideal TM domain boundaries as the N-terminal residue that enters and the C-terminal residue that exits the membrane hydrocarbon core. To assess how well our families achieve this ideal, we examined how well the sequence family boundaries correspond to the membrane boundaries for a number of known membrane protein structures. The membrane boundary was defined from the structure by finding the most hydrophobic 30 Å slab in the membrane plane (Chamberlain and Bowie 2004).
We compared the family regions with the SCOP domains of 36 PDB structures consisting of 46 polytopic transmembrane chains, from the 40% nonredundant helical membrane protein domain entries in SCOP v1.69 (Murzin et al. 1995; Chandonia et al. 2004), provided by the Astral Web site. Table 2 compares the structure-defined membrane boundaries with our family boundaries. The average deviation from ideal membrane boundaries at either end is 22 ± 28, which is about 8% of the average membrane region size of 258 residues. Some examples of how the family boundaries map on the structures are shown in Figure 2. It appears that our families largely capture the membrane-embedded domains and are reasonably effective at removing large soluble regions. We did observe one clear failure of our automated algorithm, however. As noted above, the monotopic protein kinases should have been deleted but were included because the algorithm treats the signal sequence as a TM domain, and current methods do not reliably distinguish signal peptides from TM regions (Remm and Sonnhammer 2000).
Table Table 2.. Comparison of membrane boundaries defined by structures against family boundaries
Sequence space coverage
Given the rapid fall off in family size, it makes little sense to discuss the total number of families since not all families are equally important. For example, the largest ten families will account for a much larger fraction of sequence space than the smallest ten families. Moreover, we are interested in folded protein domains, and not all regions of sequence are structured. We assume that unstructured regions are less likely to be conserved and fall into protein families. Consequently we operationally define structured sequence space as those sequence regions that are part of families with two or more members (Vitkup et al. 2001) and ignore the ∼10% of the sequence space made up of singletons. We believe this provides a more realistic view of structural coverage than the fraction of all residues (Fischer and Eisenberg 1999; Vitkup et al. 2001; Chothia et al. 2003). After removing singletons, 4075 families remain with at least two members.
As shown in Figure 3, the largest 100 families cover more than 50% of residues in families, while the largest 500 families cover almost 80% of the total structured sequence space. In other words, for any given residue in a membrane protein family derived from our current sequence database, there is an 80% chance that it will fall into one of about 500 families.
Membrane protein family space is largely saturated by currently known sequences
As we find more membrane protein sequences, will the number of families increase dramatically or have we already seen most of the extant families? To address this question, we examined how the number of families increases with increasing database size. We generated databases of 20,000, 40,000, 60,000, and 86,033 sequences by selecting at random from the full 86,033 sequence database of membrane proteins. We then applied our family building algorithm to each of these data sets and counted the number of families needed to account for 80% or 90% of all the family residues.
Using 86,033 membrane protein sequences we find 4075 polytopic families that include 64,116 sequences as described above. These families contain 83.9% of the total residues in the 64,116 sequences, indicating that a vast majority of the sequence space of membrane proteins is represented by the families. Similarly, we found 3107 families that include 43,470 sequences from the 60,000 membrane protein sequences. There were 6976 singletons (no hits for these sequences with e-value better than 10−5). The remaining 9554 sequences, accounting for the difference in the number of sequences in polytopic families and the sequence database, were from families that contained less than at least two TM helices and were thus removed. The families cover 83.5% of the total residues in the sequences. The 40,000 membrane protein sequence data set led to 2401 families that include 28,456 sequences from the 40,000 membrane protein sequences. There were 5703 singletons (no hits for these sequences with e-value better than 10−5), and the remaining 5841 sequences were from families that contained less than at least two TM helices and were removed. The families cover 83.5% of the total residues in the sequences. The smallest data set used to build families was 20,000 sequences, and we obtained 1401 polytopic families from 13,467 sequences. There were 4131 singletons, and the remaining 2402 sequences came from monotopic families that were removed. These families cover 82.9% of the total residues in the sequences.
As shown in Figure 4, as the number of sequences required for 80% and 90% residue coverage increases from 13,467 to 28,456, the number of membrane protein families increases, but essentially stops growing after that. These results suggest that, ignoring the rare families, we have already observed virtually all the membrane protein families that we can expect to see. The asymptote of a hyperbolic fit suggests that only 670 families will cover 80% of structured sequence space and 1720 families will cover 90% of structured sequence space for all extant polytopic membrane proteins.
How many membrane protein folds?
To address the number of membrane protein folds, given the number of families, it is necessary to know how folds are distributed across families. Here we define fold as something equivalent to the SCOP database definition, i.e., a distinct, compact unit of protein structure that is recognizable by a skilled human. According to the SCOP hierarchy, a family is a subset of a fold, and hence the number of families will define an upper bound on fold diversity. In soluble proteins, it is well known that some folds are very common and others are rare. The distribution of the folds over families has been modeled in various ways (Orengo et al. 1994; Zhang and DeLisi 1998; Govindarajan et al. 1999; Wolf et al. 2000; Coulson and Moult 2002; X. Liu et al. 2004). At this point, however, there are too few structures to address this issue for membrane proteins. Consequently, to obtain a first order approximation of the number of membrane protein folds, we make the assumption that the distribution of folds over families is similar to the distribution for soluble proteins. If this is the case, we can use the stretched exponential model based on the continuous distribution proposed by Govindarajan and colleagues (Govindarajan et al. 1999), which provides a reasonable description of the number of folds as a function of family number. By this measure, we estimate that only 550 folds cover 90% and 300 folds cover 80% of membrane protein structured sequence space. Considering the physical constraints imposed by the membrane bilayer environment (Bowie 1997a,b; Y. Liu et al. 2004), we expect this to be an overestimate.
Is there evidence for soluble protein family saturation?
As discussed above, as the database size increases, we find that the number of membrane protein families needed to cover the vast majority of the database residues becomes saturated. Do soluble proteins exhibit a similar saturation with the current families? As our family building algorithm is optimized for membrane proteins, a direct comparison with soluble proteins is not valid. We, therefore, examined how the number of families required to cover structured sequence space changed in the PfamB families as the database size increased (Bateman et al. 2004). Like our family building algorithm, PfamB utilizes an automated family building algorithm, but it is optimized for soluble proteins. By examining PfamB from its inception (v0.2) in 1996 to the present (v18.0) (http://www.sanger.ac.uk/Software/Pfam/ftp.shtml), we found that the total number of PfamB families is growing at a nearly linear rate with increasing number of sequences (data not shown). As discussed previously, however, some families are much larger than others. We therefore determined how the number of families needed to cover 80% and 90% of all the residues in the families changes with increasing database size. We eliminated membrane protein families based on the criterion that at least half the members contained at least a single TM region. Like our membrane protein families, we also required that families contain at least two members. PfamB v17.0 had 129,746 families built from 336,353 sequences. The number of soluble families with at least two members is 96,905, derived from 268,144 sequences, and there are 7531 orphan soluble protein families. PfamB v7.0 had 78,233 total families built from 179,966 sequences. In PfamB v7.0, there are 2038 orphan families, and 60,638 families were soluble with at least two members, built from 143,017 sequences. PfamB v6.0 had 40,681 families built from 96,571 protein sequences. Of these, there are 1293 soluble orphan families and 31,168 soluble families with at least two members, built from 76,231 sequences.
Figure 5 shows the number of sequences in each of the three versions of PfamB versus the number of soluble protein families required to cover 80% or 90% of the family sequence space. Unlike membrane proteins, residue coverage does not appear to saturate when the sequences increase in number. This would suggest that there are many more soluble protein families left to be found. Even for 80% coverage, 50,000 families are currently needed, a result that is hard to reconcile with other much lower estimates quoted of total soluble families. By our analysis of the PfamB database, the number of important soluble protein families is at least 50,000 and, unless saturation suddenly occurs, it will likely grow even more. In their similar analysis of soluble protein families, Yan and Moult (2005) also found little evidence for saturation in the currently known soluble protein families.
Time frame for obtaining structural representatives
How long will it take to obtain a representative structure for most membrane proteins? The answer will depend on how fast structures will be determined and whether structure determination is targeted to maximize sequence coverage.
To estimate the future pace of helical membrane protein structure determination, we followed the analysis of Stephen White (2004), who noted that the number of membrane protein structures determined has been increasing exponentially. He projected the future pace of structure determination by fitting the historical data to an exponential of the form Y = eAX, where Y is the number of structures in a given year, X is the number of years since the first structure determination, and A is a fitted parameter. Because our families do not consider β-barrel membrane proteins, we have performed a similar analysis, counting only the helical membrane proteins, using the Membrane Proteins of Known 3D Structure Database (http://blanco.biomol.uci.edu/Membrane_Proteins_xtal.html). The number of new, unique structures determined each year since the first structure in 1985 is shown in Figure 6A. Fitting to the exponential equation yields the curve shown in Figure 6A and the parameter A = 0.123.
The most efficient route to covering membrane protein sequence space with the fewest structures would be to determine structures in order of family size. Figure 6B shows how structural coverage of our current sequence database would advance if we were to determine structures strictly in order of family size, assuming the pace of structure determination continues as modeled above. As a starting point, we use the structures that were known by the beginning of the year 2005. Indeed, we already have representative structures of many of the largest families, including GPCRs, ABC transporters, potassium channels, cytochrome b, cytochrome c oxidase, etc., and therefore are already near 30% sequence coverage. By the ideal model, we would expect to achieve 80% coverage by 2020 and 90% sequence coverage by 2027.
The ideal model of careful structure targeting is unlikely to be practiced, however. We therefore modeled the rate of progress if proteins were selected at random for structure determination. Such a process would still be biased toward the larger families as the more members there are, the more likely a member will be selected for structure determination. As shown in Figure 6B, by a random model, we would not expect 80% residue coverage of our current sequence database until 2034, and 90% residue coverage would not occur until 2042. A more realistic view of structural coverage lies perhaps somewhere in the middle of the ideal and random estimates.
Our results indicate that most of the dominant membrane protein families are already present in our current sequence database. Thus, the boundaries of the membrane protein universe are visible. No doubt the overall number of families will continue to increase, but we expect the additional families to be minor contributors to the total structured sequence space.
The limited universe of membrane protein families allows us to provide an estimate for the number of structures needed to cover most of the structured membrane protein sequence space. For 90% coverage of structured sequence space we estimate a limit of ∼1700 families will be required, which we estimate corresponds to an upper bound of ∼550 folds. Similarly, for 80% coverage, only ∼700 families will be required for which we estimate an upper bound of ∼300 folds. To obtain representative structures for most sequences in the universe, the best case scenario is the total number of folds (∼300–550). For this number of structures to provide useful models, however, it will be necessary to develop methods that can reliably detect whether a sequence adopts a known fold. A more realistic scenario with current technologies would require a structure for each distinct family (∼700–1700).
The universe of membrane protein families appears to be much more limited than found for soluble protein families. We find little evidence for family saturation using the Pfam database, and a recent study (Yan and Moult 2005) examining 67 genomes and over 400,000 total sequences showed only slight indications of saturation at 80% domain coverage. Their extrapolated results suggest that ∼25,000 families will be needed for 80% coverage of 1000 genomes and that the number will continue to increase after that. This is ∼50-fold greater than the total family diversity that we find for membrane proteins. Thus, the structural genomics goal of a structure for every sequence seems to be a more definable goal for membrane proteins.
If current trends continue, we estimated that it will take roughly three decades to obtain structural representatives for 90% of the current structured membrane protein sequence space. Even if a representative structure is available for all families, however, the structures will not be sufficiently similar to all members of a family to permit accurate sequence alignment and homology modeling (Vitkup et al. 2001; Yan and Moult 2005). As a result, many more structures will likely be needed for good sequence coverage and three decades is a severe underestimate. Thus, efforts to greatly improve the pace of membrane protein structure determination are much needed. Nevertheless, the results presented here strongly suggest that there is a finite universe of membrane protein folds so that the goal of at least one structure of a distant relative for the vast majority of membrane protein sequences is both recognizable and achievable.
Materials and methods
The sequence database
A total of 95 genome sequences were downloaded from EMBL and EBI web sites (www.ebi.ac.uk) comprised of 13 eukarya, 16 archaea, and 66 eubacteria. These were further supplemented by Swiss-Prot 41.15 entries not already in the genomes (O'Donovan et al. 2002). Protein sequences smaller than 100 residues or greater than 5000 residues were rejected.
Proteins were classified as membrane or soluble using maximum hydrophobicity as described by Boyd and colleagues (Boyd et al. 1998). In this method, the most hydrophobic segment of the protein is identified and, if the maximum hydrophobicity is greater than a particular cutoff, the protein is deemed a membrane protein. We used the recommended JTT2 hydropathy scale, a segment length of 19 residues, and a hydrophobicity of >1.528, which results in 95% confidence for membrane protein prediction. If the maximum hydrophobicity was <1.47, the protein was classified as soluble. Protein sequences with a maximum hydrophobic segment between 1.47 and 1.528 were considered ambiguous and not considered further. The final data set contained 86,033 membrane protein sequences and 336,198 soluble protein sequences.
Transmembrane helix prediction
Transmembrane regions within membrane proteins were determined using the Eisenberg hydropathy scale and a fixed window length of 21 amino acid residues (Eisenberg et al. 1984). The MPTOPO data set of membrane proteins with determined structures was used for validation (Jayasinghe et al. 2001). Helical transmembrane protein prediction was validated against 139 α-helical TM proteins and 28 non-α-helical TM proteins. The Eisenberg scale correctly predicted 121 (87%) α-helical TM proteins and mistakenly identified 1 (3.6%) non-α-helical TM protein.
Our transmembrane helix predictions were validated using 101 α-helical membrane protein sequences with solved structure and containing 443 transmembrane segments. We considered a prediction correct if there was at least 50% overlap between the predicted and actual transmembrane region. By this criterion, we found a false-positive error rate of 13.3% and a false-negative error of 3.2% in transmembrane helix prediction.
Sequence comparisons were done using BLAST 2.0 (Altschul et al. 1997). We refer to a detectable sequence relationship as a hit. To prevent spurious apparent hits caused by low complexity sequences, the low complexity regions were masked using SEG (Wootton and Federhen 1996). The standard parameters typically used for SEG often mask transmembrane segments, however, which are inherently low complexity. Indeed, using the default SEG parameters of 12 2.2 2.5, 22% of residues in the transmembrane regions were masked. We therefore adjusted the SEG parameters to 10 1.0 1.5, which minimized transmembrane segment masking, while still masking obviously low-complexity regions.
Family building algorithm
We used an iterative clustering algorithm to define families (Linial et al. 1997; Park and Teichmann 1998; Gouzy et al. 1999). The overall scheme for building families is shown in Figure 7. Family members are evolutionarily related but this relation is not always evident by sequence homology due to limits in the sensitivity of sequence comparison programs. Consequently, two related members are linked directly if there is detectable sequence similarity, but in the absence of sequence similarity they can be linked indirectly by an intermediate sequence with detectable similarity to both. This method of indirect linkage can lead to the domain problem where two unrelated proteins may be related to a third protein, which has multiple domains. We deal with this by ensuring that all family member regions overlap one another.
We wanted each family to represent a single, independent folding domain. Reasoning that smaller sequences were more likely to contain a single domain, we organized sequences according to increasing size, with smallest sequences placed first. Subsequently, we organized BLAST hits hierarchically according to statistical significance: (1) e-value < 10−10, (2) e-value < 10−9, (3) e-value < 10−7, (4) e-value < 10 −5. This ensured that we initiated families with the smallest, albeit most significant alignments. The statistical significance of BLAST hits is found to be exaggerated at higher e-values, and an e-value of 10 −5 is found to correspond to an error rate of 1 in 500 (Brenner et al. 1998). We found clustering errors at e-values worse than 10−5 and these hits were not considered further in our analysis. In addition, we omitted hits which did not contain at least one transmembrane helix. We considered a hit region to contain a transmembrane helix if at least half of a predicted transmembrane region was part of the hit region.
The first family was initiated with a hit from the smallest sequence within the most significant hit group. We then searched down the hit list, in the order described, for additional significant hits between present family members and new sequences. New sequences were added to the family subject to the criteria discussed below. After all the hits were examined, this process was repeated to see if hits that had not been incorporated into families in the first round could now be added. Then the second family was initiated with the next smallest, most significant hit remaining and so forth.
Family members were added by the following procedure. There were two levels of clustering involved. The first incorporated only those hits with an e-value better than 10−9. In this hit group the sequence was incorporated directly if it passed overlap and length tests. To prevent divergence in aligned regions and domain sizes that can occur between the first family member and the subsequently added members, we required that all members had similarity to each other with regard to size and region of the protein segment. This was implemented using two additional checks before adding a new sequence. (1) We required at least a 70% overlap in sequence of the aligned region with all other previously clustered members of the family. Percent overlap is defined as the length of the overlap of the region of the hit under consideration with the region already in the family, divided by the average size of the two regions. (2) We required that the ratio of the hit length under consideration and the family domain size was between one-half and two, where the domain size of the family was defined as the size of the first member of the family. The families in this high significance hit group were clustered until convergence or for a maximum of eight iterations.
The second level of clustering considered hits in the lower significance group, with e-value between 10−9 and 10−5. Hits were added to each family built from the high significance hit group by clustering for a maximum of two iterations. In this group there is an increased danger of adding a false positive. Thus, the sequence was considered for inclusion subject to the following test: The sequence must show a hit with at least two different members that have already been incorporated into the family. Sequences were defined as different if they showed <80% sequence identity.
Post-processing of families
We organized the families according to their size, and removed those families that were monotopic. A family was judged to be monotopic if at least 25% of its members consisted of only a single transmembrane helix. Next, we processed the families to merge common domains that had been clustered separately. One family can be separated into two due to our stringent criterion for inclusion of a member into a family during the first phase of family building. Another situation where two families should be merged is when fragmentation or domain cutting by BLAST results in one domain broken over more than one family. One cause for this is incomplete sequence fragments in the sequence database, while another may be inherent weaknesses in BLAST. Such fragmentation of families was observed frequently in the case of GPCR families. To correct for this family splitting, we merged a smaller family with a larger family if at least half of its members are common with the larger family, and if these common members overlap in sequence region by at least 50%.
Subsequent to merging families, sequence end-points for a family member were determined from all the merged hits for that member, based on the boundaries of the membrane domain. We compared hit regions for a given member in the merged family with the corresponding transmembrane helix regions predicted for that member. The initial and final hit regions that most closely incorporated the entire membrane domain were chosen as the family region. A region that was too large was always given preference to a region that was smaller than the actual membrane boundary.
We thank Salem Faham, Sehat Nauli, and Marisa Baron for helpful discussions and comments on the manuscript, and Blai Bonet for valuable suggestions on developing the algorithm. This work was supported by National Institutes of Health Grant GM3919 and a National Science Foundation Integrative Graduate Education and Research Traineeship predoctoral award (to A.O.). J.U.B. is a Leukemia and Lymphoma Society Scholar.