Test set of proteins with experimentally verified topology
A nonredundant test set of multispanning membrane proteins with experimentally determined topology was extracted from the database compiled by Möller et al. (2000), from the MPtopo database (Jayasinghe et al. 2001), and from the recent literature. From the Möller database, only multispanning proteins of ‘trust levels’ A–C were included, that is, proteins for which reliable experimental topology information is available. From the MPtopo database, only multispanning proteins from the 3D_helix and 1D_helix subsets were used, that is, proteins where either the three-dimensional structures have been determined, or where the approximate positions of the transmembrane helices (TMHs) have been identified by other experimental techniques (gene fusions, proteolytic cleavages, etc.). If a sequence occurred both in the MPtopo and the Möller databases, only the entry from the Möller database was included. All proteins annotated to contain a cleavable signal peptide were removed.
The resulting test set was split into a prokaryotic subset and a eukaryotic subset. Both test sets were then homology-reduced using an implementation of the Hobohm algorithm (Hobohm et al. 1992) with a pairwise global sequence similarity threshold of 30%. ClustalW (Thompson et al. 1994) was used for the pairwise, global sequence alignments. The numbers of sequences in the final prokaryotic and eukaryotic test sets were 73 and 23, respectively (see Supplementary Information).
Five topology prediction methods—TMHMM2.0 (Sonnhammer et al. 1998; Krogh et al. 2001), HMMTOP2.0 (Tusnady and Simon 1998Tusnady and Simon 2001), MEMSAT1.8 (Jones et al. 1994), PHD2.1 (Rost et al. 1996), and TOPPRED1.0 (von Heijne 1992; Claros and von Heijne 1994)—were used in their single-sequence mode (i.e., no information from homologous proteins was included). All methods produce a prediction of both the number and location of the TMHs, and the in/out location of the N-terminus relative to the membrane. All user-adjustable parameters were kept at their default values, with the exception of TOPPRED predictions for eukaryotic proteins, where the organism parameter was set to ‘eukaryote.’ The output from the different topology prediction programs was converted into a standard format for further analysis.
Partial consensus topology prediction algorithm
The partial consensus topology prediction method is based on our previous observation that the reliability of a predicted topology can be estimated from the number of prediction methods that agree on the global topology (i.e., that give the same number of predicted TMHs and the same predicted orientation for the N–terminus). Specifically, that study (Nilsson et al. 2000) indicated that very high reliability can be assigned to topologies where five different prediction methods give the same prediction.
Here, we tested the assumption that this relationship holds also for cases where all five prediction methods agree on the topology of only a part of the protein. These cases are referred to as partial consensus topologies (PCTs).
The PCT algorithm is described in Figure 2A. In the first step, if all methods agree on the topology at a certain position in the sequence, a consensus topology prediction is assigned to this position (inside loop, outside loop, in–to–out helix, and out–to–in helix states are designated i, o, m, and w, respectively). If all methods do not agree at a certain position, no consensus is assigned (designated ‘.’). To aid in the construction of the final partial consensus prediction, we define two additional symbols that represent positions for which the predicted topology states are incompatible with each other. Thus, when loop states with opposite locations (i and o) are predicted at the same position, we define this as a loop clash (X). In the same manner, a TMH clash (#) is defined for positions where two TMHs with opposite directions (m and w) are predicted.
After this initial step, a filtering procedure is used to remove “spurious” TMH clashes caused by slight misalignments of predicted TMHs (Fig. 2B). In this procedure, a TMH clash which is flanked by consensus TMHs with opposite directions is replaced by a loop state. Such clashes occur frequently for proteins containing closely spaced TMHs.
The final step is the construction of the partial consensus topology (Fig. 2A). Starting from the N-terminus of the protein, the N-terminal end of the first PCT is defined by the first TMH (m or w states) of at least n residues in the consensus topology (where n is an adjustable parameter; the default value used here is n = 5). The PCT is then extended towards the C–terminus until either a consensus TMH of less than n residues is encountered, or a loop clash or TMH clash occurs. In either case, the end of the PCT is defined by the most C–terminally located m or w state in the consensus. The process is then repeated until the C–terminal end of the protein is reached. A protein may thus contain more than one PCT.
The significance of the n–value is illustrated in Figure 2C, where the resulting PCT prediction differs depending on whether the consensus TMH is longer or shorter than the value of n.
To be included in a PCT, a consensus TMH has to be at least a minimum number of residues n in length. The larger the n–value, the smaller the risk that an incorrectly predicted consensus TMH is included in the PCT. However, a high n–value also decreases the average length of a PCT. To determine the optimal n–value, the evaluation step above was performed for different length thresholds. Figure 2D shows the fraction of correctly predicted PCTs and the average fraction of sequence length covered by a PCT for different values of n. For the prokaryotic proteins, both the fraction of sequence length covered and the fraction of correctly predicted PCTs is relatively constant for n = 1–12. For n > 12 residues, the fraction of sequence length covered drops significantly, whereas there is only a minor increase in the fraction of correct PCTs. The trend is basically the same for the eukaryotic proteins, though we consider these results less reliable because of the small test set. In summary, the results do not vary appreciably for n–values < 10, and the default value n = 5 has been used for all results reported here.