Using Illumina next generation sequencing technologies to sequence multigene families in de novo species

Authors

  • Graham M. Hughes,

    1. UCD School of Biological and Environmental Science, University College Dublin, Dublin 4, Ireland
    2. UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin 4, Ireland
    Search for more papers by this author
  • Li Gang,

    1. Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, TX, USA
    Search for more papers by this author
  • William J. Murphy,

    1. Department of Veterinary Integrative Biosciences, Texas A&M University, College Station, TX, USA
    Search for more papers by this author
  • Desmond G. Higgins,

    1. UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin 4, Ireland
    Search for more papers by this author
  • Emma C. Teeling

    Corresponding author
    1. UCD Conway Institute of Biomolecular and Biomedical Research, University College Dublin, Dublin 4, Ireland
    • UCD School of Biological and Environmental Science, University College Dublin, Dublin 4, Ireland
    Search for more papers by this author

Correspondence: Emma C. Teeling, Fax: +1-353-1-716-1153; E-mail: emma.teeling@ucd.ie

Abstract

The advent of Next Generation Sequencing Technology (NGST) has revolutionized molecular biology research, allowing for rapid gene/genome sequencing from a multitude of diverse species. As high throughput sequencing becomes more accessible, more efficient workflows must be developed to deal with the amounts of data produced and better assemble the genomes of de novo lineages. We combine traditional laboratory methods with Illumina NGST to amplify and sequence the largest mammalian multigene family, the Olfactory Receptor gene family, for species with and without a reference genome. We develop novel assembly methods to annotate and filter these data, which can be utilized for any gene family or any species. We find no significant difference between the ratio of genes within their respective gene families of our data compared with available genomic data. Using simulated data we explore the limitations of short-read sequence data and our assembly in recovering this gene family. We highlight the benefits and shortcomings of these methods. Compared with data generated from traditional polymerase chain reaction, cloning and Sanger sequencing methodologies, sequence data generated using our pipeline increases yield and sequencing efficiency without reducing the number of unique genes amplified. A cloning step is not required, therefore shortening data generation time. The novel downstream methodologies and workflows described provide a tool to be utilized by many fields of biology, to access and analyze the vast quantities of data generated. By combining laboratory and in silico methods, we provide a means of extracting genomic information for multigene families without complete genome sequencing.

Ancillary