Structure-guided approach for detecting large domain inserts in protein sequences as illustrated using the haloacid dehalogenase superfamily

Authors

  • Chetanya Pandya,

    1. Bioinformatics Graduate Program, Boston University, Boston, Massachusetts
    Search for more papers by this author
  • Debra Dunaway-Mariano,

    1. Department of Chemistry and Chemical Biology, University of New Mexico, Albuquerque, New Mexico
    Search for more papers by this author
  • Yu Xia,

    Corresponding author
    1. Department of Bioengineering, Faculty of Engineering, McGill University, Montreal, Quebec, Canada
    • Correspondence to: Yu Xia, Department of Bioengineering, Faculty of Engineering, McGill University, Montreal, Quebec, Canada H3A 0C3. E-mail: brandon.xia@mcgill.ca or Karen N. Allen, Department of Chemistry, Boston University, 590 Commonwealth Avenue, Boston, MA 02215. E-mail: drkallen@bu.edu

    Search for more papers by this author
  • Karen N. Allen

    Corresponding author
    1. Bioinformatics Graduate Program, Boston University, Boston, Massachusetts
    2. Department of Chemistry, Boston University, Boston, Massachusetts
    • Correspondence to: Yu Xia, Department of Bioengineering, Faculty of Engineering, McGill University, Montreal, Quebec, Canada H3A 0C3. E-mail: brandon.xia@mcgill.ca or Karen N. Allen, Department of Chemistry, Boston University, 590 Commonwealth Avenue, Boston, MA 02215. E-mail: drkallen@bu.edu

    Search for more papers by this author

ABSTRACT

In multi-domain proteins, the domains typically run end-to-end, that is, one domain follows the C-terminus of another domain. However, approximately 10% of multi-domain proteins are formed by insertion of one domain sequence into that of another domain. Detecting such insertions within protein sequences is a fundamental challenge in structural biology. The haloacid dehalogenase superfamily (HADSF) serves as a challenging model system wherein a variable cap domain (∼5–200 residues in length) accessorizes the ubiquitous Rossmann-fold core domain, with variations in insertion site and topology corresponding to different classes of cap types. Herein, we describe a comprehensive computational strategy, CapPredictor, for determining large, variable domain insertions in protein sequences. Using a novel sequence-alignment algorithm in conjunction with a structure-guided sequence profile from 154 core-domain-only structures, more than 40,000 HADSF member sequences were assigned cap types. The resulting data set afforded insight into HADSF evolution. Notably, a similar distribution of cap-type classes across different phyla was observed, indicating that all cap types existed in the last universal common ancestor. In addition, comparative analyses of the predicted cap-type and functional assignments showed that different cap types carry out similar chemistries. Thus, while cap domains play a role in substrate recognition and chemical reactivity, cap-type does not strictly define functional class. Through this example, we have shown that CapPredictor is an effective new tool for the study of form and function in protein families where domain insertion occurs. Proteins 2014; 82:1896–1906. © 2014 Wiley Periodicals, Inc.

Ancillary