SEARCH

SEARCH BY CITATION

Keywords:

  • Protein structure prediction;
  • prediction accuracy;
  • hydropathy plots

Abstract

  1. Top of page
  2. Abstract
  3. Protein selection criteria and database characteristics
  4. Accuracies of prediction algorithms
  5. Database accessibility and availability
  6. Acknowledgements
  7. References

The reliability of the transmembrane (TM) sequence assignments for membrane proteins (MPs) in standard sequence databases is uncertain because the vast majority are based on hydropathy plots. A database of MPs with dependable assignments is necessary for developing new computational tools for the prediction of MP structure. We have therefore created MPtopo, a database of MPs whose topologies have been verified experimentally by means of crystallography, gene fusion, and other methods. Tests using MPtopo strongly validated four existing MP topology-prediction algorithms. MPtopo is freely available over the internet and can be queried by means of an SQL-based search engine.

The number of protein sequences in the Protein Information Resource (PIR; Barker et al. 2000) and SWISS-PROT (Bairoch and Boeckmann 1991) databases has exploded as a result of genome sequencing efforts. The PIR database presently contains over 142,000 nonredundant entries, while SWISS-PROT contains over 80,000. A simple search of these databases returns a large number of entries classified as membrane proteins (MPs): 12,000 in the PIR and 9000 in SWISS-PROT. These MP entries provide assignments for transmembrane (TM) segments, but their reliability is uncertain. In a recent survey of SWISS-PROT, Senes et al. (2000) determined that almost 94% of the TM segments were annotated as potential, possible, or probable, indicating that these segments were identified through the use of prediction algorithms, primarily hydropathy plots. Therefore, the majority of TM segment assignments within these public databases must be used with caution. Several collections of MPs have been compiled directly from SWISS-PROT, using either SWISS-PROT annotations or criteria that are sometimes ambiguous (Hofmann and Stoffel 1993; Jones et al. 1994; Cserzö et al. 1997; Gromiha 1999). To avoid propagation of errors that may have been present in the original predictions underlying the database annotations, a curated database of membrane proteins is needed that contains only proteins for which direct experimental evidence of TM segment assignments exists. We have, therefore, created MPtopo, a modest but growing database of MP TM sequences whose topologies have been verified experimentally by means of crystallography, gene fusion, and other methods. The purpose of this note is to introduce the database and to report the results of using it to evaluate and compare four existing MP topology-prediction algorithms (Claros and von Heijne 1994; Milpetz et al. 1995; Rost et al. 1995; Tusnády and Simon 1998).

The assembly of a dependable database of MP topology from literature reports was less straightforward than expected. Even in the case of membrane proteins whose three-dimensional (3D) structures have been determined, TM segment assignments are often not identified in the original publications and often are not readily determinable from the Protein Data Bank coordinate files. Beyond the MPs of known structure, we sought papers in the MP literature that contained keywords, such as gene fusion, suggestive of direct experimental studies of topology. Reported MP topologies were included in the database only after careful evaluation of published experimental results. For example, in the case of gene fusion data (Boyd 1994), the density of fusions had to be sufficient to inspire confidence that the topology had been explored thoroughly, as in the case of lac permease (Calamia and Manoil 1990). MPtopo has now grown to 90 proteins or subunits contributing 534 TM segments, a size we believe sufficient for evaluating existing prediction algorithms and creating new ones.

Protein selection criteria and database characteristics

  1. Top of page
  2. Abstract
  3. Protein selection criteria and database characteristics
  4. Accuracies of prediction algorithms
  5. Database accessibility and availability
  6. Acknowledgements
  7. References

The TM segment assignments of membrane proteins or subunits of known 3D structure, labeled alphabetically in the data records, were obtained from the published reports or by examination of the PDB coordinate files. In a few cases, secondary structure determinations obtained using the Kabsch and Sander (1983) DSSP program were used to establish segment assignments.

Some surface-bound, monotopic membrane proteins without TM segments, such as prostaglandin synthase (Picot et al. 1994), were also included in MPtopo to aid the development of algorithms for distinguishing monotopic from TM proteins. These proteins are identified by an asterisk following the protein name in the data record, appropriate comments in the remarks field, and an asterisk on the TM segment alphabetic assignment, indicating surface-lying helices. The recently reported water and glycerol channel proteins (Fu et al. 2000; Murata et al. 2000) are also marked with asterisks because they have TM segments comprised of two end-to-end helices that are distant in the sequence. In addition to comments in the remarks field, each partial helix is recorded as a TM segment with an asterisk on the alphabetic identifier. To identify the segment pairs constituting the full TM segment, the first partial segment in the sequence is identified, for example, as C*, and the second one as *C.

In the absence of 3D structures, TM sequence assignments were obtained from published reports of topology that included experimental confirmation using techniques such as gene fusion (Manoil and Beckwith 1986), Asn-linked glycosylation (Pan et al. 1999), or amino acid deletions (Wolin and Kaback 1999). In some cases, such as rhodopsin (whose 3-D structure was recently reported [Palczewski et al. 2000]), an overwhelming amount of data of all sorts from a large number of publications provided strong, coherent evidence for TM segment assignments. Even when noncrystallographic experimental data are used to validate topology, however, most specific TM sequence assignments in published reports originate from hydropathy plots. In most cases, authors generally provided topology diagrams that assigned the TM segments believed to be located within the membrane bilayer. Such assignments were used for specifying TM segments in MPtopo. In a few cases, topologies of proteins with long interhelix connecting loops were specified without specific assignment of the membrane-buried segments. Under those circumstances, we identified likely membrane-buried segments by seeking long runs of hydrophobic residues bounded by charged residues. Our assignments included only intervening uncharged residues.

Hydropathy plots vary in important details even among closely related proteins (White and Jacobs 1990); seemingly subtle differences in sequences can have big effects on decision thresholds (Edelman and White 1989; Edelman 1993). Because of our interest in prediction tools based strictly on physiochemical criteria (White and Wimley 1999), we did not reject any protein because of high homology or sequence identity with a protein already in MPtopo.

The data fields and structure of each MPtopo entry are summarized in Figure 1B. We have divided the entries into three subsets: 3D_helix, 1D_helix, and 3D_other.

The first two contain helix-bundle proteins segregated according to the existence or absence, respectively, of 3-D structures. 3D_other includes β-barrel and monotopic MPs whose structures have been determined crystallographically. The general characteristics of the database are summarized in Table 1. The lengths of TM segments show a wide distribution. Within 3D_helix, the average TM helix length is 28 residues, ranging from 17 to 43 residues. These values are quite similar to those observed by Bowie (1997) for 45 TM helices from three helix-bundle MP structures. The length distribution for 1D_helix is slightly broader, nine to 46 residues with an average length of 22 residues. This shorter average undoubtedly reflects the influence of hydropathy plots performed with window lengths of 19 or 21 residues.

Accuracies of prediction algorithms

  1. Top of page
  2. Abstract
  3. Protein selection criteria and database characteristics
  4. Accuracies of prediction algorithms
  5. Database accessibility and availability
  6. Acknowledgements
  7. References

We used the 3D_helix and 1D_helix subsets of MPtopo to determine TM-segment prediction accuracy of four algorithms designed for predicting TM helices: HMM (Tusnády and Simon 1998), TopPredII (von Heijne 1992), TMAP (Persson and Argos 1994; Milpetz et al. 1995), and PHDhtm (Rost et al. 1995, 1996). Prediction accuracy Q was computed using the per segment method of Tusnády and Simon (1998). The results are summarized in Table 2. All four algorithms yield impressive per segment prediction accuracies, the highest reaching 97% for the 3D_helix set. Interestingly, the prediction accuracies for the 1D_helix set are systematically lower than for 3D_helix. As shown in Table 2, the reduced accuracies are mainly caused by false positive TM segment predictions. The causes of this result are uncertain. Two simple possibilities include algorithmic bias toward MPs of known structure and imperfections in the experimental methods for validating topology. A third possibility is the existence of exceptionally hydrophobic extra membrane domains in the 1D_helix set. We tested this possibility using automated hydropathy analysis of a collection of ∼1000 soluble proteins. One or two potential TM segments were found for ∼10% of the proteins.

Database accessibility and availability

  1. Top of page
  2. Abstract
  3. Protein selection criteria and database characteristics
  4. Accuracies of prediction algorithms
  5. Database accessibility and availability
  6. Acknowledgements
  7. References

MPtopo is available at http://blanco.biomol.uci.edu/mptopo. It can be downloaded as a composite text file or searched using a java applet (MPtopo Querier) connected to an SQL-based server (Fig. 1A). Search results are returned as complete database entries, displayed in a separate results window (Fig. 1B). We would be pleased to receive suggestions for other membrane proteins to include in MPtopo.

Table Table 1.. General characteristics of the MPtopo database
 MPtopo subset
 3D_helix1D_helix3D_other
  • a

    a Includes protein subunits.

  • b

    b Given as the number of residues.

No. of proteinsa413811
No. of total residues8960150184171
Average sequence lengthb218395379
No. of residues in TM segments418654261671
No. of total TM segments150242142
Average TM segment lengthb28 ± 522 ± 412 ± 3
TM segment length rangeb17 − 439 − 464 − 20
Table Table 2.. Prediction accuracy of various algorithms using MPtopo
  No. of transmembrane helicesa
MPtopo subsetAlgorithmNpredictedNcorrectQ (%)b
  • a

    a Nknown, Npredicted, Ncorrect are, respectively, number of experimentally known helices, total number of predicted, and number predicted correctly. Ncorrect is defined as predicted helices that exhibited at least a 50% overlap with known transmembrane helices.

  • b

    b Prediction accuracy Q was determined as described in Tusnády and Simon (1998).

    • equation image
  • c

    c From the PredictProtein automatic prediction server (Rost et al. 1996) using the default settings.

  • d

    d Hidden Markov Model (Tusnády and Simon 1998) (HMM) used with single sequence information from MPtopo.

  • e

    e TopPred II (von Heijne 1992) used with default settings: window size top = 11, window size bottom = 21, upper cut-off = 1.0, lower cut-off = 0.6.

  • f

    f TMAP (Persson and Argos, 1994; Milpetz et al. 1995) was used with single sequence information from MPtopo.

3D_helix (Nknown = 150)    
 PHDhtmc15214697
 HMMd15414595
 TopPred IIe16214895
 TMAPf13913696
1D_helix (Nknown = 242)    
 PHDhtm25022893
 HMM26424095
 TopPred II25922489
 TMAP24122192
thumbnail image

Figure Fig. 1.. Web tools for using MPtopo. (A) MPtopo Querier, a java applet designed to search the MPtopo database using an SQL-based server. With Querier, MPtopo may be searched by protein name, authors, number of transmembrane (TM) segments, Protein Information Resource (PIR) identifier, Protein Data Bank (PDB) identifier, or any combination of these fields. The search can be performed on the whole MPtopo database or limited to one of the subsets. MPtopo Querier is available for use over the World Wide Web from our Web site at http://blanco.biomol.uci.edu/mptopo. (B) Search results from Querier are displayed within the results window. Each returned result is displayed as the complete database entry. Each entry contains 15 fields including the complete protein sequence, the number of transmembrane (TM) segments, TM segment start and end positions, PIR and PDB identifiers (when available), a complete reference citation, and the topology (Nterm = in or out). Selected entries may be sent to MPEx, a hydropathy plot TM segment prediction tool developed in our laboratory (S. Jayasinghe, K. Hristova, and S.H. White, in prep.). The complete database is also available for anonymous ftp download as a plain text file at blanco.biomol.uci.edu/mptopo.

Download figure to PowerPoint

Acknowledgements

  1. Top of page
  2. Abstract
  3. Protein selection criteria and database characteristics
  4. Accuracies of prediction algorithms
  5. Database accessibility and availability
  6. Acknowledgements
  7. References

We are pleased to acknowledge Michael Myers' assistance in maintaining the MPtopo database and his assistance in editing this manuscript. This work is supported by National Institutes of General Medical Sciences (GM-46823).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

References

  1. Top of page
  2. Abstract
  3. Protein selection criteria and database characteristics
  4. Accuracies of prediction algorithms
  5. Database accessibility and availability
  6. Acknowledgements
  7. References