Identifying cysteines and histidines in transition-metal-binding sites using support vector machines and neural networks

Authors

  • Andrea Passerini,

    Corresponding author
    1. Università degli Studi di Firenze, Dipartimento di Sistemi e Informatica Via di Santa Marta 3, 50139 Firenze, Italy
    • Università degli Studi di Firenze, Dipartimento di Sistemi e Informatica Via di Santa Marta 3, 50139 Firenze, Italy
    Search for more papers by this author
  • Marco Punta,

    1. CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York City, New York 10032
    2. Columbia University Center for Computational Biology and Bioinformatics (C2B2), New York City, New York 10032
    Search for more papers by this author
  • Alessio Ceroni,

    1. Università degli Studi di Firenze, Dipartimento di Sistemi e Informatica Via di Santa Marta 3, 50139 Firenze, Italy
    Search for more papers by this author
  • Burkhard Rost,

    1. CUBIC, Department of Biochemistry and Molecular Biophysics, Columbia University, New York City, New York 10032
    2. Columbia University Center for Computational Biology and Bioinformatics (C2B2), New York City, New York 10032
    Search for more papers by this author
  • Paolo Frasconi

    1. Università degli Studi di Firenze, Dipartimento di Sistemi e Informatica Via di Santa Marta 3, 50139 Firenze, Italy
    Search for more papers by this author

  • The software is available from the corresponding author upon demand.

Abstract

Accurate predictions of metal-binding sites in proteins by using sequence as the only source of information can significantly help in the prediction of protein structure and function, genome annotation, and in the experimental determination of protein structure. Here, we introduce a method for identifying histidines and cysteines that participate in binding of several transition metals and iron complexes. The method predicts histidines as being in either of two states (free or metal bound) and cysteines in either of three states (free, metal bound, or in disulfide bridges). The method uses only sequence information by utilizing position-specific evolutionary profiles as well as more global descriptors such as protein length and amino acid composition. Our solution is based on a two-stage machine-learning approach. The first stage consists of a support vector machine trained to locally classify the binding state of single histidines and cysteines. The second stage consists of a bidirectional recurrent neural network trained to refine local predictions by taking into account dependencies among residues within the same protein. A simple finite state automaton is employed as a postprocessing in the second stage in order to enforce an even number of disulfide-bonded cysteines. We predict histidines and cysteines in transition-metal-binding sites at 73% precision and 61% recall. We observe significant differences in performance depending on the ligand (histidine or cysteine) and on the metal bound. We also predict cysteines participating in disulfide bridges at 86% precision and 87% recall. Results are compared to those that would be obtained by using expert information as represented by PROSITE motifs and, for disulfide bonds, to state-of-the-art methods. Proteins 2006. © 2006 Wiley-Liss, Inc.

Ancillary