Abstract
- Top of page
- Abstract
- Introduction
- Materials and methods
- Results
- Discussion
- Acknowledgements
- References
- Data accessibility
- Supporting Information
Restriction site-associated DNA Sequencing (RAD-Seq) is an economical and efficient method for SNP discovery and genotyping. As with other sequencing-by-synthesis methods, RAD-Seq produces stochastic count data and requires sensitive analysis to develop or genotype markers accurately. We show that there are several sources of bias specific to RAD-Seq that are not explicitly addressed by current genotyping tools, namely restriction fragment bias, restriction site heterozygosity and PCR GC content bias. We explore the performance of existing analysis tools given these biases and discuss approaches to limiting or handling biases in RAD-Seq data. While these biases need to be taken seriously, we believe RAD loci affected by them can be excluded or processed with relative ease in most cases and that most RAD loci will be accurately genotyped by existing tools.
Introduction
- Top of page
- Abstract
- Introduction
- Materials and methods
- Results
- Discussion
- Acknowledgements
- References
- Data accessibility
- Supporting Information
The use of high throughput sequencing-by-synthesis technologies for ecology and conservation depends on accurate inference of biological signal from technical noise. Individual genotypes and population allele frequencies must be inferred from raw sequence data, preferably at low cost and with low sequencing and analytical effort. While it is now possible to generate sequence data from entire genomes at relatively low cost, the sequencing-by-synthesis process introduces noise from a number of novel sources and reveals existing sources of noise that were previously undetected by less sensitive technology, making the path from raw sequence reads to biological information far from straightforward. In recent years, many sources of noise in high throughput DNA and RNA sequencing data have been identified and either mitigated during library preparation or corrected during analysis (Aird et al. 2011; Quince et al. 2011, Meacham et al. 2011; Quail et al. 2008). However, methods appropriate for one sequencing method are not necessarily appropriate for others, and new library preparation methods may produce novel sources of noise.
Restriction site-associated DNA sequencing (RAD-Seq; Miller et al. 2007; Baird et al. 2008; Davey & Blaxter 2011) is a method for SNP discovery and genotyping using sequencing-by-synthesis. It is one of a number of reduced representation methods that sample a shared set of sites across the genome in many individuals or pools, making population-scale sequencing possible at a fraction of the cost of whole genome sequencing (Davey et al. 2011). RAD-Seq is suitable for fine-scale linkage mapping (Amores et al. 2011; Chutimanitsakun et al. 2011; Baxter et al. 2011), phylogenetics and phylogeography (Rubin et al. 2012; Nadeau et al. 2012, Emerson et al. 2010), genome scaffolding (Catchen et al. 2011; Heliconius Genome Consortium 2012) and population genetics (Andersen et al. 2012; Hohenlohe et al. 2012). RAD-Seq has also been used to generate large SNP data sets for many species, most recently in salmon (Houston et al. 2012), cutthroat and rainbow trout (Amish et al. 2012), artichoke (Scaglione et al. 2012), guppy (Willing et al. 2011) and eggplant (Barchi et al. 2011).
The RAD-Seq method has been well documented elsewhere (Baird et al. 2008; Etter et al. 2011a). Briefly, genomic DNA from multiple samples of interest is digested with a chosen restriction enzyme, and adapters that contain sample-specific barcodes and end with an overhang matching the restriction enyzme's cut site are ligated to the digested restriction fragments. Adapter-ligated restriction fragments are sheared to a size suitable for Illumina sequencing (typically 300–700 bp), and sheared fragments containing restriction site overhangs are amplified using polymerase chain reaction (PCR) and sequenced, typically using Illumina sequencing-by-synthesis.
RAD-Seq reads can be aligned to reference genomes and genotyped using standard tools designed for whole genome sequencing data (Nielsen et al. 2011), including aligners such as BWA (Li & Durbin 2009) and Stampy (Lunter & Goodson 2011), and genotypers such as those built into the Genome Analysis Tool Kit (GATK; DePristo et al. 2011) and SAMtools (Li 2011). RAD-Seq can also be used de novo, generating large marker sets where no reference genome is available. Several tools have been developed to produce RAD marker sets de novo, including Stacks (Catchen et al. 2011), RaPiD (Willing et al. 2011) and RADtools (Baxter et al. 2011).
RAD-Seq projects typically produce thousands to tens of thousands of markers, several orders of magnitude greater than is possible with traditional technologies such as microsatellites or AFLPs, at a fraction of the labour cost. However, separating high-quality markers from sequencing noise is challenging. Manual validation of such large marker sets is impractical, and the accuracy of automatic analysis tools is not yet clear. Unfortunately, because RAD-Seq has considerable benefits to researchers working with non-model species, the vast majority of publicly available RAD-Seq data are derived from populations with no reference genome or sequence variation information, making it difficult to validate RAD-Seq marker sets in any depth.
Typically, RAD-Seq analysis proceeds by applying quality thresholds or likelihood ratio tests at multiple levels (for example, raw sequence, mapping and genotyping), and by testing for expected biological patterns. For example, for linkage mapping, markers can be removed if they are missing in multiple individuals or have segregation distortion (Amores et al. 2011; Miller et al. 2011); for population studies, repetitive regions and duplicates can be screened by filtering by read coverage or by testing patterns of heterozygosity (Hohenlohe et al. 2011). Marker sets can be validated using laboratory methods such as PCR (e.g. Scaglione et al. 2012) or SNP chips, but as comprehensive validation of tens of thousands of markers remains expensive and labour-intensive, it would be valuable to improve bioinformatic filtering to reduce the cost of laboratory validation.
Illumina sequencing of short (<1 kb) fragments involves sequencing one (read 1, single end) or both (reads 1 and 2, paired end) ends of each fragment, typically producing reads 100 bp long. RAD markers can be developed using only single end Illumina sequencing, identifying single nucleotide polymorphisms (SNPs) and insertions or deletions (indels) in read 1 sequences. However, paired-end sequencing can also be used for RAD-Seq, a technique that has several novel implications compared with paired-end sequencing of genomic DNA and that makes RAD-Seq particularly attractive for de novo studies. First, read 2 sequences up- or downstream of a particular restriction site can be assembled into 300- to 600-bp RAD contigs (Etter et al. 2011b; Willing et al. 2011), which can be used to investigate synteny and gene content in otherwise unsequenced genomes (Baxter et al. 2011). Second, read 2 sequences have been used to attempt to remove PCR duplicates from RAD-Seq data (Baxter et al. 2011), with the aim of reducing GC bias known to be introduced by PCR (Benjamini & Speed 2012; Aird et al. 2011, Quail et al. 2008).
While RAD-Seq can be used for reference-based approaches, several methods with simpler library preparation protocols have been developed (e.g. Andolfatto et al. 2011; Elshire et al. 2011; Wang et al. 2012), which may be preferable for study of laboratory crosses or where a high-quality reference sequence is available (Davey et al. 2011), as this allows imputation of genotypes in the face of missing data. In theory, RAD-Seq develops markers more robustly than these related methods and so is more suitable for de novo analyses of wild populations, where little information about the source populations is known and so imputation of missing data is very difficult. However, no empirical study of technical variation in RAD-Seq data has been published to date. While RAD-Seq is in principle unbiased with respect to many population genetics statistics and so may avoid known issues of ascertainment bias in marker sets (Helyar et al. 2011), in practice there has been no detailed analysis of noise in RAD-Seq data, and it may be that commonly used quality thresholds and post hoc tests are unsuitable. This may mean researchers are discarding potentially useful markers, retaining inaccurate markers or incorrectly genotyping real markers.
We therefore set out to investigate the characteristics of RAD-Seq data, to validate existing analysis techniques and propose improvements where appropriate. In the process, we identified several sources of sequencing variation unique to RAD-Seq, above and beyond those found in other types of sequencing-by-synthesis data. These sources of variation have implications for genotyping of RAD markers. We also investigated methods for RAD contig assembly. Multiple assemblers have been used to generate RAD contigs, including Velvet (Catchen et al. 2011; Etter et al. 2011b), VelvetOptimiser (Baxter et al. 2011; Houston et al. 2012) and LOCASopt (Willing et al. 2011). However, to date, there has been no comparison of the performance of existing assemblers on heterozygous RAD paired-end data where reference sequences are available. We hope this work will bring clarity to the process of generating RAD-Seq data and enable thorough analysis of both simple and complex studies.