BAliBASE 3.0: Latest developments of the multiple sequence alignment benchmark

Authors

  • Julie D. Thompson,

    Corresponding author
    1. Département de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Molculaire et Cellulaire, (CNRS/INSERM/ULP), Illkirch Cedex, France
    • Département de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Molculaire et Cellulaire (CNRS/INSERM/ULP), BP 10142, 67404 Illkirch Cedex, France
    Search for more papers by this author
  • Patrice Koehl,

    1. Genome Center and Department of Computer Science, University of California, Davis, Davis, California
    Search for more papers by this author
  • Raymond Ripp,

    1. Département de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Molculaire et Cellulaire, (CNRS/INSERM/ULP), Illkirch Cedex, France
    Search for more papers by this author
  • Olivier Poch

    1. Département de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Molculaire et Cellulaire, (CNRS/INSERM/ULP), Illkirch Cedex, France
    Search for more papers by this author

Abstract

Multiple sequence alignment is one of the cornerstones of modern molecular biology. It is used to identify conserved motifs, to determine protein domains, in 2D/3D structure prediction by homology and in evolutionary studies. Recently, high-throughput technologies such as genome sequencing and structural proteomics have lead to an explosion in the amount of sequence and structure information available. In response, several new multiple alignment methods have been developed that improve both the efficiency and the quality of protein alignments. Consequently, the benchmarks used to evaluate and compare these methods must also evolve. We present here the latest release of the most widely used multiple alignment benchmark, BAliBASE, which provides high quality, manually refined, reference alignments based on 3D structural superpositions. Version 3.0 of BAliBASE includes new, more challenging test cases, representing the real problems encountered when aligning large sets of complex sequences. Using a novel, semiautomatic update protocol, the number of protein families in the benchmark has been increased and representative test cases are now available that cover most of the protein fold space. The total number of proteins in BAliBASE has also been significantly increased from 1444 to 6255 sequences. In addition, full-length sequences are now provided for all test cases, which represent difficult cases for both global and local alignment programs. Finally, the BAliBASE Web site (http://www-bio3d-igbmc.u-strasbg.fr/balibase) has been completely redesigned to provide a more user-friendly, interactive interface for the visualization of the BAliBASE reference alignments and the associated annotations. Proteins 2005. © 2005 Wiley-Liss, Inc.

Ancillary