Local descriptors of protein structure: A systematic analysis of the sequence-structure relationship in proteins using short- and long-range interactions

Authors

  • Torgeir R. Hvidsten,

    1. Linnaeus Centre for Bioinformatics, Uppsala University, Uppsala, Sweden
    2. Umeå Plant Science Centre, Department of Plant Physiology, Umeå University, Umeå, Sweden
    Search for more papers by this author
    • Torgeir R. Hvidsten and Andriy Kryshtafovych contributed equally to this work.

  • Andriy Kryshtafovych,

    1. Genome Center, UC Davis, Davis, California
    Search for more papers by this author
    • Torgeir R. Hvidsten and Andriy Kryshtafovych contributed equally to this work.

  • Krzysztof Fidelis

    Corresponding author
    1. Genome Center, UC Davis, Davis, California
    • UC Davis Genome Center, 451 East Health Sciences Drive, Davis, CA 95616-8816
    Search for more papers by this author

Abstract

Local protein structure representations that incorporate long-range contacts between residues are often considered in protein structure comparison but have found relatively little use in structure prediction where assembly from single backbone fragments dominates. Here, we introduce the concept of local descriptors of protein structure to characterize local neighborhoods of amino acids including short- and long-range interactions. We build a library of recurring local descriptors and show that this library is general enough to allow assembly of unseen protein structures. The library could on average re-assemble 83% of 119 unseen structures, and showed little or no performance decrease between homologous targets and targets with folds not represented among domains used to build it. We then systematically evaluate the descriptor library to establish the level of the sequence signal in sets of protein fragments of similar geometrical conformation. In particular, we test whether that signal is strong enough to facilitate correct assignment and alignment of these local geometries to new sequences. We use the signal to assign descriptors to a test set of 479 sequences with less than 40% sequence identity to any domain used to build the library, and show that on average more than 50% of the backbone fragments constituting descriptors can be correctly aligned. We also use the assigned descriptors to infer SCOP folds, and show that correct predictions can be made in many of the 151 cases where PSI-BLAST was unable to detect significant sequence similarity to proteins in the library. Although the combinatorial problem of simultaneously aligning several fragments to sequence is a major bottleneck compared with single fragment methods, the advantage of the current approach is that correct alignments imply correct long range distance constraints. The lack of these constraints is most likely the major reason why structure prediction methods fail to consistently produce adequate models when good templates are unavailable or undetectable. Thus, we believe that the current study offers new and valuable insight into the prediction of sequence-structure relationships in proteins. Proteins 2009. © 2008 Wiley-Liss, Inc.

Ancillary