Reference-free transcriptome assembly in non-model animals from next-generation sequencing data

Authors

  • V. CAHAIS,

    1. CNRS UMR 5554, Institut des Sciences de l’Evolution de Montpellier, Université Montpellier 2, Place E. Bataillon, 34095 Montpellier, France
    Search for more papers by this author
  • P. GAYRAL,

    1. CNRS UMR 5554, Institut des Sciences de l’Evolution de Montpellier, Université Montpellier 2, Place E. Bataillon, 34095 Montpellier, France
    Search for more papers by this author
  • G. TSAGKOGEORGA,

    1. CNRS UMR 5554, Institut des Sciences de l’Evolution de Montpellier, Université Montpellier 2, Place E. Bataillon, 34095 Montpellier, France
    2. School of Biological and Chemical Sciences, Queen Mary University of London, Mile End Road, London, E1 4NS, UK
    Search for more papers by this author
  • J. MELO-FERREIRA,

    1. CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, 4485-661 Vairão, Portugal
    Search for more papers by this author
  • M. BALLENGHIEN,

    1. CNRS UMR 5554, Institut des Sciences de l’Evolution de Montpellier, Université Montpellier 2, Place E. Bataillon, 34095 Montpellier, France
    Search for more papers by this author
  • L. WEINERT,

    1. CNRS UMR 5554, Institut des Sciences de l’Evolution de Montpellier, Université Montpellier 2, Place E. Bataillon, 34095 Montpellier, France
    2. Medical Research Council (MRC), Centre for Outbreak Analysis and Modelling, Imperial College Faculty of Medicine, London, UK
    Search for more papers by this author
  • Y. CHIARI,

    1. CNRS UMR 5554, Institut des Sciences de l’Evolution de Montpellier, Université Montpellier 2, Place E. Bataillon, 34095 Montpellier, France
    2. CIBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Campus Agrário de Vairão, 4485-661 Vairão, Portugal
    Search for more papers by this author
  • K. BELKHIR,

    1. CNRS UMR 5554, Institut des Sciences de l’Evolution de Montpellier, Université Montpellier 2, Place E. Bataillon, 34095 Montpellier, France
    Search for more papers by this author
  • V. RANWEZ,

    1. CNRS UMR 5554, Institut des Sciences de l’Evolution de Montpellier, Université Montpellier 2, Place E. Bataillon, 34095 Montpellier, France
    Search for more papers by this author
  • N. GALTIER

    1. CNRS UMR 5554, Institut des Sciences de l’Evolution de Montpellier, Université Montpellier 2, Place E. Bataillon, 34095 Montpellier, France
    Search for more papers by this author

Nicolas Galtier, Fax: +33 467 14 36 10;
E-mail: nicolas.galtier@univ-montp2.fr

Abstract

Next-generation sequencing (NGS) technologies offer the opportunity for population genomic study of non-model organisms sampled in the wild. The transcriptome is a convenient and popular target for such purposes. However, designing genetic markers from NGS transcriptome data requires assembling gene-coding sequences out of short reads. This is a complex task owing to gene duplications, genetic polymorphism, alternative splicing and transcription noise. Typical assembling programmes return thousands of predicted contigs, whose connection to the species true gene content is unclear, and from which SNP definition is uneasy. Here, the transcriptomes of five diverse non-model animal species (hare, turtle, ant, oyster and tunicate) were assembled from newly generated 454 and Illumina sequence reads. In two species for which a reference genome is available, a new procedure was introduced to annotate each predicted contig as either a full-length cDNA, fragment, chimera, allele, paralogue, genomic sequence or other, based on the number of, and overlap between, blast hits to the appropriate reference. Analyses showed that (i) the highest quality assemblies are obtained when 454 and Illumina data are combined, (ii) typical de novo assemblies include a majority of irrelevant cDNA predictions and (iii) assemblies can be appropriately cleaned by filtering contigs based on length and coverage. We conclude that robust, reference-free assembly of thousands of genes from transcriptomic NGS data is possible, opening promising perspectives for transcriptome-based population genomics in animals. A Galaxy pipeline implementing our best-performing assembling strategy is provided.

Ancillary