The analysis of food webs and their dynamics facilitates understanding of the mechanistic processes behind community ecology and ecosystem functions. Having accurate techniques for determining dietary ranges and components is critical for this endeavour. While visual analyses and early molecular approaches are highly labour intensive and often lack resolution, recent DNA-based approaches potentially provide more accurate methods for dietary studies. A suite of approaches have been used based on the identification of consumed species by characterization of DNA present in gut or faecal samples. In one approach, a standardized DNA region (DNA barcode) is PCR amplified, amplicons are sequenced and then compared to a reference database for identification. Initially, this involved sequencing clones from PCR products, and studies were limited in scale because of the costs and effort required. The recent development of next generation sequencing (NGS) has made this approach much more powerful, by allowing the direct characterization of dozens of samples with several thousand sequences per PCR product, and has the potential to reveal many consumed species simultaneously (DNA metabarcoding). Continual improvement of NGS technologies, on-going decreases in costs and current massive expansion of reference databases make this approach promising. Here we review the power and pitfalls of NGS diet methods. We present the critical factors to take into account when choosing or designing a suitable barcode. Then, we consider both technical and analytical aspects of NGS diet studies. Finally, we discuss the validation of data accuracy including the viability of producing quantitative data.
The diversity of life on Earth and the abundance of each species are in large part a product of the evolution of strategies for survival. A major part in this struggle for existence is played by predator–prey, host–parasite and herbivore–plant interactions, between which multidimensional webs of interactions have arisen. The complexity of these food webs can be considerable even at the simplest level of connectance, while analysis of fully quantitative webs involving generalist species has so far been beyond our reach. Perturbation in one part of a food web has ramifications throughout the web and change outcomes in ways that become more difficult to predict as complexity increases (Pimm 2002). However, such predictions are essential if we are to analyse, and determine the mechanistic processes behind, major areas of community ecology and hence better understand ecosystem function (e.g. Estes et al. 2011). Food web dynamics are important in areas of ecology as diverse as conservation biology, agroecology, penetration of ecosystems by alien species, the effects of biodiversity on ecosystem function, and energy flows within and between ecosystems.
Understanding food webs ultimately requires reconstruction and modelling of the overall population interactions of the species involved. An important part of this process is analysis of precisely what has been eaten in the field, information that is difficult to obtain for generalist predators and herbivores. Thus, we need accurate techniques for determining the identities and proportions of different food items in a species diet. The application of DNA-based techniques has improved our ability to determine diets, and the range of DNA-based approaches used has been extensively considered elsewhere (Symondson 2002; Sheppard & Harwood 2005; Sunderland et al. 2005; Gariépy et al. 2007; King et al. 2008a). Our aim here is not to cover the diversity of methods in detail but to review recent developments based on next generation sequencing (NGS) technologies that have provided us with more precise information on the dietary ranges of predators and herbivores. We also present the critical steps necessary for setting up a NGS diet assessment, examine the utility of NGS as a means of improving the speed, accuracy and quantitative potential that this technology makes theoretically possible and also discuss its limitations.
From behavioural observations to NGS diet assessment
Initially, diets were determined mainly by direct observation of feeding, or by microscopic examination of gut contents or faeces. While these approaches have provided much useful information in some cases, direct observation precludes working on small invertebrates, most nocturnal species, anything beneath the soil, under water, hidden or elusive. Microscopic examination also has major limitations. It is labour intensive and depends heavily upon the skills of the person identifying species from masticated, semi-digested fragments of plants and animals (Holechek et al. 1982; Moreby 1988; Ingerson-Mahar 2002; Sunderland et al. 2005). Moreover, it precludes studying the diet of fluid feeders (most invertebrates) and identifying foods that leave no hard remains or simply lack diagnostic taxonomic features.
To overcome these difficulties, ecologists have innovatively adapted a variety of technologies. A suite of techniques based upon prey-specific antibodies have been used to analyse gut contents with some success (reviewed by Boreham & Ohiagu 1978; Symondson 2002). In most cases, these tests were developed as markers for specific target prey species and were best utilized for screening a range of predators consuming a small number of target species.
However, to understand food web dynamics, the whole dietary breadth needs to be measured. Attempts to measure consumption of the whole array of prey in the guts of predators were initially made using a protein electrophoretic approach (e.g. Walrant & Loreau 1995). This worked reasonably well for stenophagous predators (Murray et al. 1989; Solomon et al. 1996) but not for highly generalist predators for which an individual may have several prey species in the gut simultaneously resulting in an uninterpretable superimposed protein banding pattern. Similar problems are encountered when interpreting plant alkane fingerprints to determine herbivore diets (Dove & Mayes 1996). Plant material in the diet has also been analysed using near infrared reflectance spectroscopy (NIRS) (Kaneko & Lawler 2006; Rothman et al. 2009). Calibration of the technique for different plant material and consumers is critical and obtaining constituent identities for complex mixtures of plants difficult, but this approach is best suited for estimating nutritional components such as total nitrogen or starch (Foley et al. 1998).
Studies using stable isotopes can tell us a lot about energy flow through (e.g. Ponsard & Arditi 2000) and between ecosystems (e.g. Gratton et al. 2008), or simply dietary shifts (e.g. Ostrom et al. 1997). Such work provides information on long-term diets, rather than simply components within a recent meal (reviewed by Sunderland et al. 2005). However, isotopic enrichment is subject to many variables (Vanderklift & Ponsard 2003; Lecomte et al. 2011), and the isotopic signatures of different food items are often not great enough, or consistent enough, for clear resolution of trophic links (Traugott et al. 2007). This approach that has the strong advantage of being integrative can be replaced or complemented by DNA-based methods when a precise identification of the taxa consumed in the last meals is needed.
An early DNA approach to look at dietary diversity was DNA profiling through amplification of the gut contents using general or group-specific primers, followed by temperature or denaturing gradient gel electrophoresis (TGGE or DGGE) (Deagle et al. 2005; Harper et al. 2006). Problems with cryptic bands and haplotypic variation (Lessa & Applebaum 1993) make interpretation difficult. However, where diversity in the diet per se is the object, rather than component identity, this approach has merit, as it has in the study of microbial community diversity (e.g. Muyzer et al. 1993; Felske et al. 1998). DGGE/TGGE had also been used as a separation tool, followed by sequencing (Martin et al. 2006; Tollit et al. 2009), but the technique could show significant amplification bias, and more robust approaches have been developed.
Recent DNA-based approaches
The last 12 years have witnessed the development of two main strands in the field of DNA-based dietary analysis, both of which use PCR to amplify DNA present in dietary samples to obtain enough material for subsequent analyses. One of these is a targeted approach using specific PCR primers designed to examine predation on one or a few specific prey or prey groups (see Box 1; reviewed by King et al. 2008a). It was first used by Asahida et al. (1997) to examine predation by sand shrimps (Crangon affinis) on stone flounder (Kareius bicoloratus). This approach has proved to be useful for studying predator–prey systems as divergent as whale sharks feeding on larval crabs (Meekan et al. 2009) down to Collembola feeding on nematodes (Read et al. 2006). Efforts have also been made to combine many taxon-specific fluorescence-labelled primers to look at the range of prey consumed through multiplex PCR followed by fragment analysis (Harper et al. 2005; King et al. 2010a,b). However, even with multiplexing only the limited range of prey for which the primers were designed can be detected.
Table Box 1. Place of next generation sequencing analyses among DNA-based methods for dietary studies
The potential for DNA to be used as a dietary biomarker was first highlighted 20 years ago (Höss et al. 1992) and since then a range of DNA-based approaches have been applied to identify food remains in stomach contents and faeces. These studies can be categorized based on: (i) type of sample analysed (ii) whether diagnostic PCR or DNA-barcoding (i.e. sequencing) is used for identification and (iii) whether quantitative data is sought from individual samples [Fig. 1; following Darling & Blum (2007)].High-throughput sequencing has substantially enhanced diet studies using the DNA barcoding approach (Fig. 1; categories 5 and 6) by removing the need to clone PCR products, and by drastically reducing the unit cost of sequencing. This has transformed dietary DNA barcoding from being an approach limited in scale, due to costs and technical difficulty, into an approach that provides a wealth of data from each sample and is cost-effective for large diet studies. One of the consequences of having large dataset with counts of sequences from food items is the potentially misleading ability to infer a quantitative picture of diet (see main text for further discussion).
The second strand has been based upon the amplification of DNA using general or group-specific primers, followed by the cloning and sequencing of amplicons to identify individual taxa (see Fig. 1 and Box 1). This dietary barcoding approach has been used to analyse the diets of extinct mammals from coprolites (Poinar et al. 1998, 2001; Hofreiter et al. 2000, 2003), the stomach contents of an ancient human (Rollo et al. 2002), and in several contemporary ecological studies that analysed predation and herbivory. They include diet analysis of passerine birds (Sutherland 2000), whales (Jarman et al. 2004), marine amphipods and bivalves (Blankenship & Yayanos 2005), penguins (Deagle et al. 2007), dolphins (Dunshea 2009), bats (Zeale et al. 2011), primates (Bradley et al. 2007) and krill (Passmore et al. 2006). The cloning step can be skipped in studies where DNA can be extracted from particulate remains in faeces and directly sequenced (Clare et al. 2009, 2011). The cloning approach can detect any dietary components by using suitably general primers, without any need to predict target prey species. However, this is a labour-intensive approach, requiring sequencing of many clones. Thus, it is not well suited to the mass screening that is generally required to obtain a comprehensive analysis of animal diets.
Because of the constraints linked to the simultaneous characterization of broad ranges of taxa from environmental samples, the DNA regions defined as standard by the Consortium for the Barcoding of Life (CBoL, http://www.barcodeoflife.org/) are not always the best suited, leading to the choice of other DNA barcodes (sensu lato, Valentini et al. 2009b) and primers to amplify them for reliable NGS diet analyses. This choice should be determined by the question being addressed and knowledge of the ecology and biology of the study species (see Fig. 2). The range of likely consumed species that can have their DNA amplified by the barcode primers (i.e. the taxonomic coverage) is the primary consideration, so studies of herbivorous animals, for example, require barcoding primers that amplify DNA from a wide range of plants (e.g. Valentini et al. 2009a). The taxonomic resolution (i.e. resolution capacity) of the barcode region also needs to be considered as the saturation of the marker should be avoided, and some barcode regions will only identify taxa above the species level. Looking for highly conserved primers to increase the taxonomic coverage may favour less variable DNA regions with, concomitantly, a low-resolution capacity, which may not be sufficient for answering all dietary questions. For example, primer sets with binding sites that are invariant among eukaryotes (e.g. LSU rDNA) will often amplify less variable regions than those found in mitochondria or chloroplasts, although some have good enough resolution for answering many questions (e.g. Sonnenberg et al. 2007). Conversely, designing short broad coverage barcodes in variable regions such as COI used for animal barcoding may be difficult, because potential priming sites would not be conserved enough over a broad range of taxa. This is illustrated by the Uni-minibar barcode targeting COI (Meusnier et al. 2008) evaluated for vertebrates. About 60% of the taxa potentially amplified differ at each priming site by 4–8 mismatches, probably leading to a failure or a low rate of amplification (Ficetola et al. 2010).
It may be useful to employ a DNA barcode with a broad taxonomic coverage but low resolution, combined with other DNA barcodes that resolve some of the higher taxonomic units to species level, a strategy sometimes called hierarchical barcoding (Moszczynska et al. 2009). For example, in a study of seal diet, Deagle et al. (2009) added to the information provided by a low-resolution primer set (which identified bilaterian animals to family), by using group-specific primer sets that resolved key groups (i.e. bony fish and squid) to the species level. This provided a general overview of seal diet while allowing species level identification of the most important prey.
The amplification efficiency of a DNA barcoding primer set is also important in ensuring that most dietary samples analysed will produce sequences. Even when primers work on target DNA templates in isolation, amplification can be highly skewed within a multitemplate PCR. For example, in a study on penguin diet, Deagle et al. (2007) found that krill was the prevalent prey item (based on stomach contents and species-specific PCR tests carried out on faecal DNA); however, amplicons generated from faeces using primers that were relatively conserved among prey species recovered only fish and squid sequences. In this case, a minor primer mismatch in the krill sequence led to extensive under-representation of krill as squid and fish were preferentially amplified. Sequence variation in primer-binding sites such as this can strongly bias the pool of sequences generated in a PCR. In some situations, this variation in the primer-binding site may be used to help reduce predator DNA amplification (Deagle et al. 2007), but this may lead to a bias in representation of food species proportions.
DNA degradation in dietary samples limits the length of fragments that can be successfully amplified by PCR (Symondson 2002; Deagle et al. 2006; Troedsson et al. 2009). For this reason, the length of barcoding regions used for dietary analysis is generally in the 100–250 bp range, which inevitably reduces taxonomic resolution. It is possible in some situations to use barcodes as long as 585 bp (Juen & Traugott 2005), 650 bp (the COI marker defined by Folmer et al. 1994; Blankenship & Yayanos 2005) or even more when DNA is extracted from intact prey remains (Clare et al. 2009, 2011). However, in many cases, such a long marker will reduce the success of the amplification so that sequences will not be recovered from many samples. Minimal variation in the barcode size among food species is also required to prevent preferential long fragment dropout during PCR (Pompanon et al. 2005) and biases because of differences in DNA copy number for long vs. short fragments in degraded samples (Deagle et al. 2006). Rayéet al. (2011) suggest that the short trnL fragment size of Helianthemum nummularium relative to other plant species could lead to its preferential amplification and thus explain its high proportion in their data set. DNA degradation also means that markers that occur in multiple copies per cell, such as mitochondrial and chloroplastic DNA regions or regions of nuclear ribosomal RNA genes, are usually used. While this improves the sensitivity of the PCR assays for detecting food species DNA, it does mean that there may be more variation in DNA:biomass ratios among species as copy numbers of these regions may vary extensively between food species.
The quality and coverage of databases of DNA barcodes available for NGS diet studies is often a major factor in choosing a barcoding primer set (Fig. 2). Where there is not a good collection of potential food DNA available, making a customized database is possible (e.g. Deagle et al. 2009; Rayéet al. 2011), but this may not be common for predation. The most extensive database is undoubtedly that for the IBOL (http://www.barcodeoflife.org/) but for animals, this has largely been generated for a mitochondrial region that is too large to amplify reliably from diet samples in most cases and has a variable primer-binding site among species. Primers that target a shorter portion of this region have been developed for identifying samples with degraded DNA (Meusnier et al. 2008) but may have restricted range of application. However, the region has been used in some dietary work targeting particular groups of prey (e.g. Zeale et al. 2011). The choice of plant barcoding regions used by CBoL, matK and rbcL, is unfortunately also different from those regions that have been most successfully used in DNA diet work, such as the trnL (UAA) intron (Valentini et al. 2009a). Despite all of the necessary compromises, a range of useful DNA metabarcoding primer sets is available for NGS diet assessment (Table 1).
Table 1. Main DNA barcoding primer sets for next generation sequencing (NGS) diet studies. The table references the main primer sets already used in DNA-based diet studies that are suitable for NGS diet assessment. The table is not exhaustive, and priority was given to primer sets with the broadest taxonomic coverage according to their ecological interest. However, more group-specific primers can be useful, such as those developed by Corse et al. (2010) for studying the diet of freshwater organisms
Amplicon length†‡ (bp)
Size of public database (no. of accessions and species in EMBL)†§
*Taxonomic range is given according to cited publications and gives an indication of the range taxa for which primers have been designed. However, within this range, some group of species may not be amplified and/or the range of taxa amplified may be broader.
†Parameters have been estimated by analysing the results of an in silico PCR as described in Ficetola et al. (2010). In silico PCR was performed on the release 107 of the EMBL public database (May 2011) and retained 10–1000 bp amplicons from accessions containing the priming sites with up to three mismatches per primer (except on the last three bases of the primer 3′ end). In silico PCRs were restricted to Viridiplantae (taxid 33090) or Eumetazoa (taxid 6072) without Homo sapiens (taxid 9606) for plant and animal barcode primers, respectively. Primers with a broader taxonomic coverage (i.e. amplifying LSU rDNA) were tested on both groups.
‡Amplicon length is given without primers (range given for 99.9% of the sequences). Amplicons longer than 150–200 bp may not be adequate for degraded templates.
§The size of reference data sets is estimated by the number of accessions and species measured using EcoPCR and ecoTaxSpecificity, respectively (Ficetola et al. 2010). It gives an idea of the availability of reference sequence in public databases but does not prejudge either their quality or the relevance of building a customized database.
¶Taxonomic resolution is estimated as the percentage of taxa unambiguously identified for a given taxonomic level, using ecoTaxSpecificity (Ficetola et al. 2010). The estimation obtained from the whole EMBL data set is indicative and reflects the level of identification of the sequences from the public database. Actually, it strongly varies according to the reference data set and thus to the potential food species. The estimations of taxonomic resolution are not strictly comparable because they are not calculated on the same data set. Moreover, these values are under-estimated, because of errors in the database, such as wrong taxonomic assignation. Thus, we strongly recommend testing the performance of the primer pairs in the context of the study using dedicated bioinformatic tools (e.g. Ficetola et al. 2010; see text).
A problem encountered when designing primers with a broad taxonomic coverage is the possible amplification of undesirable species. For example, broad coverage primers that amplify all eukaryotes have been used in DNA diet assessment (Martin et al. 2006; Riemann et al. 2010), but this strategy will produce amplicons from the predator and from gut parasites and symbionts. The amplification of DNA from the predator can be particularly problematic because it is often more abundant than the prey templates (e.g. Deagle et al. 2006) and can prevent their detection. Several solutions have been devised to deal with this difficulty. One set of approaches is based on the targeted removal of undesired dominant DNA templates. Digestion of predator DNA by endonuclease activity prior to (Dunshea 2009) and following (Blankenship & Yayanos 2005) PCR with broad coverage primers has had limited success. A promising alternative, termed ‘suicide polymerase endonuclease restriction’ (SuPER) (Green & Minz 2005), involves the design of a target-specific PCR primer and identification of a restriction site within the amplicon. During a joint PCR and endonuclease digest, only double-stranded target DNA is cut, leaving single stranded rare DNA templates intact. A requirement of this method, however, is finding enzymes capable of working at a high enough temperature to maintain PCR primer stringency. Another strategy is to block the amplification of predator DNA (reviewed by Vestheim et al. 2011). This can be accomplished using an artificially synthesized DNA analogue, such as peptide nucleic acid (PNA, Ørum et al. 1993) or locked nucleic acid (LNA, Dominguez & Kolodney 2005) designed to attach somewhere along the target predator DNA fragment and prevent DNA polymerase from extending the broad coverage PCR primers along the entire length of the fragment. The use of PNA in such ‘PCR clamping’ (Ørum et al. 1993) has been applied successfully to dietary analysis (Chow et al. 2011).
A cost-effective alternative, based on the same principle, is the use of an oligonucleotide modified at the 3′ end to prevent polymerase extension, with the 3′ hydroxyl group replaced by either a phosphate group (Carlson et al. 2003), a reversed (3′ to 5′) nucleotide (Corless et al. 2006), or a spacer-C3-CPG (Vestheim & Jarman 2008). The spacer-C3-CPG modification has been successfully applied in dietary studies (Vestheim & Jarman 2008; Deagle et al. 2009). Vestheim & Jarman (2008) found that a 10-fold excess of spacer-C3-CPG-blocking oligonucleotide compared with PCR primers was sufficient to reduce dominant predator amplicons from 100% to as low as 2.2% and even further at higher concentrations. The approach has been found to be efficient when designed to overlap with the primer-binding site, thus competitively preventing PCR primers from annealing [von Wintzingerode et al. 1997; Peano et al. 2005; Vestheim & Jarman 2008; Shehzad et al. 2012]. As it may be challenging to find target-specific differences directly flanking the PCR primer sites, the use of a long dual priming oligonucleotide (DPO) (Chun et al. 2007) may be used to effectively extend the region flanking the primer site in which a suitable blocking oligonucleotide site can be found (Vestheim & Jarman 2008). A DPO containing two separate binding regions with distinct annealing properties connected by a polydeoxyinosine linker and when modified with a C3 spacer can be as effective as a short blocking oligonucleotide (Chun et al. 2007; Vestheim & Jarman 2008). Thus, when blocking of non-target amplification is required, the barcode region chosen is ideally sufficiently variable in a region near the conserved PCR primer-biding sites for designing a blocking oligonucleotide that targets predator but not prey DNA.
Therefore, while an ideal DNA barcode primer set should consistently amplify short and unique DNA fragment already recorded in a well-curated database with equal efficiency from all food species, all primer sets developed so far compromise on some aspects of this situation. Finding the best compromise, which has long been empiric, may now go through comparative evaluation of several barcodes in silico (Fig. 2). New bioinformatic tools have been developed for estimating the resolution capacity of DNA barcodes (proportion of taxa unambiguously identified) and the taxonomic coverage of their associated primers (proportion of species amplified in a given range of taxa) from a set of reference sequences such as a complete mitochondrial genomes for vertebrates (Ficetola et al. 2010) or as internal transcribed spacers (ITS) for fungi (Bellemain et al. 2010). Based on this approach, designing the best-suited barcodes and associated primers for a given range of food species is now possible (Riaz et al. 2011).
King et al. (2008a) provide a detailed review of technical issues related to molecular diagnostics, from sampling in the field through to DNA extraction, primer design, PCR and laboratory procedures. Although this review concentrated upon predation studies, very similar issues are applicable to measurement of herbivory. Our intention here, however, is to focus on experimental design, sampling, preparation and pooling of samples (see Fig. 2), tagging of primers, and quantitative issues particularly relevant to NGS.
Sampling design in NGS diet studies must take several factors into consideration if the expectation/hope is that quantitative data on consumption of each dietary component (sequence counts for each taxon) could be built into food webs, providing population-level information directly, rather than through the sums of individual trophic links that are represented in conventional webs (see ‘Quantitative aspects’ below). Owing to sampling effects, the more individual consumers included in each pooled sample, the better the proportions of different sequences generated will reflect the diet of the population as a whole. Similarly, at the single sample level, Deagle et al. (2005) showed that DNA subsampled from different parts of the same sea lion faecal mass generated different dietary profiles, reflecting successive meals. Blending of the whole faecal mass ensures that the full dietary range pertaining to that mass is properly represented within subsamples taken for DNA extraction (Deagle et al. 2005; Matejusova et al. 2008). However, additional handling during blending might increase the contamination risk. Alternatively, subsamples can be taken from different parts of the faecal mass, pooled, homogenized and resampled (Deagle et al. 2009; Kowalczyk et al. 2011). Conversely, where predators produce many small faecal pellets, several faeces may be blended together and resampled (e.g. Deagle et al. 2010). A potential disadvantage of this approach is that individual diet samples may vary in food DNA content (because of age, size, etc.), resulting in some contributing disproportionately to the pooled DNA template. Taking multiple subsamples from each blend, whether derived from a single or several faeces, increases the chances of detecting all prey. In the same way, we recommend performing multiple amplifications per subsample (Taberlet et al. 2012) to cope with possible PCR drift.
Once a suitable barcoding primer set has been chosen and samples collected, the production of sequence data involves PCR amplification and carrying out NGS. Even when the diversity of DNA in a dietary sample corresponds to dozens of species, individual samples can be characterized by from a few hundred to a few thousands sequences. This means that, even with reduced-throughput NGS platforms, amplicons from several hundred samples can be pooled and sequenced together in a single run. Ideally, uniquely tagged primers are used in the PCR to amplify DNA from each sample. These tags then act as identifiers to recover data from each sample post-sequencing using a bioinformatics approach (see the study by Binladen et al. 2007; Meyer et al. 2007; Coissac 2012). Tags can be used to sort sequences by individual consumer (Pegard et al. 2009; Soininen et al. 2009; Valentini et al. 2009a; Kowalczyk et al. 2011) or to separate ‘treatments’, thus providing an average view of diet for each uniquely tagged group. For example, tags may be used to create sequence libraries for a consumer species feeding in different habitats or locations (Deagle et al. 2009; Brown 2011), or at different times of year, to compare adults with juveniles or to investigate differences between the diets of each sex separately. Once PCR is completed, the construction of amplicon pools is in itself an important step and usually involves normalization of DNA concentrations of the individual amplicons before mixing (see the study by Harris et al. 2010 for possible methods).
Exploratory NGS followed by targeted analysis
Many ecological studies require data on the diets of individuals rather than populations, and statistical comparisons are only possible where there is sufficient replication. When large numbers of individuals are analysed separately, data may be broken down to examine the effects of multiple variables upon predation or herbivory (e.g. consumer age, sex, the time of year, temperature, spatial factors, etc.). Currently it is feasible, but not considered practical (or economic) to use separate tags for potentially hundreds of individual predators. However, a two-stage process is possible (Brown 2011). Many of the prey consumed by a generalist predator, for example, may be too rarely consumed to be of any significance in terms of energetics, food web dynamics or predator behaviour. Thus, identifying the main food sources might be adequate for modelling the primary dynamics within a food web, and NGS is good at identifying the major species in the diet, those that are primarily responsible for sustaining populations of consumers. However, identifying rare prey species may also be meaningful, for example when studying the impact of predation on threatened species. Species-specific primers can subsequently be designed that target these critical species, followed by mass screening of individuals, using multiplexing and fragment analysis to make the task more efficient (Harper et al. 2005; King et al. 2010b) or real-time PCR where quantification is advantageous and/or sensitivity improved (Zhang et al. 2007; Lundgren et al. 2009; Schmidt et al. 2009; Weber & Lundgren 2009). However, as the price of NGS is decreasing, and as the number of sequence reads per experiment is continuously increasing, such a two-step approach might not be justified in the near future.
The first DNA-based diet studies applying NGS have used the pyrosequencing technology of Roche/454 Life Sciences (Margulies et al. 2005). This was the first commercially available platforms and has an advantage over current Illumina/Solexa and AB SOLiD/Agencourt technologies because of its longer read length (Hudson 2008). However, technologies with shorter read length may be adequate for studying the shortest barcodes. For example, the Illumina HiSeq paired-end reads length (i.e. 100 bp) is compatible with the trnL approach (Valentini et al. 2009a). Another advantage of the 454 technology is ironically its lower sequencing capacity. Indeed many of the newest sequencing platforms are aimed at genome resequencing, or de novo genome assembly, and their capacity is higher than required for most directed amplicon sequencing projects, leading to an unnecessary increase in both the time needed for analysis and data storage capacity. However, their sequencing power (e.g. the 50 Mb produced in a HiSeq lane) remains compatible with analysing multiplexes of barcodes for hundreds or even thousands of samples. Thus, the choice of the NGS technology to use may depend on several parameters such as the barcode length, the number of barcodes used and the sample size (see Fig. 2). More complete description of the features and biases of the different NGS platforms can be found elsewhere (Glenn 2011).
A number of new instruments that cater to ‘small-scale’ amplicon sequencing projects are starting to be produced. Roche’s GS Junior System uses pyrosequencing technology to produce about 70 000 amplicon sequences of about 450 bp in length (http://my454.com) for a consumable cost of 1300 €. Life Technologies’ Ion Torrent Personal Genome Machine will initially generate 100 000 reads per run with a length of 100 bp at a consumable cost of 500 € per run (Perkel 2011). The Illumina MiSeq sequencing system (http://www.illumina.com/systems/miseq.ilmn) can produce more than 3 400 000 paired-end reads of 150 bp for a consumable cost of 600 €. Currently, these types of systems are the best adapted for NGS diet assessment because they have sufficient capacity to analyse large sample sizes combined with lower per-run costs and reduced computing requirements. However, given the frequent upgrades and evolution of the technologies, these specifications will become quickly obsolete. Therefore, even laboratories planning to routinely conduct NGS diet analyses may consider outsourcing high-throughput sequencing to companies that can follow the latest technological developments and propose the best adapted service currently available.
As for other environmental sequencing experiments, the benefit of using NGS resides in the ability to analyse multiplexed PCR products obtained using tagged primers with a broad taxonomic coverage. The output consists of DNA sequences (i.e. reads) and corresponding quality values (for each nucleotide of each sequence read). Primary analysis consists of (i) discarding sequences with errors, (ii) sorting the remaining sequences according to their tag, (iii) clustering them and/or assigning them to a taxon. As for all studies generating DNA sequences, there is a need for public access to the data produced in NGS environmental studies. A solution is to deposit unique sequences in DRYAD (http://datadryad.org/).
Dealing with errors
Erroneous reads may affect the analysis in two ways. First, an error occurring on the tag may lead to assignment of a sequence to the wrong sample group. We can overcome this by using an appropriate minimum number of differences among tags to maintain the ability to correctly assign a sequence to a sample even when errors occur on the tag sequence (Coissac 2012). Second, errors affecting the barcode sequence between primers may lead to taxon misidentification. These errors may occur during the sequencing process. For example, 454 technology is known for its low reliability during homopolymer extension (Huse et al. 2007). Such read errors may be identified through low-quality values. However, analysing PCR products from a unique sequence template shows that most of errors result from degradation of template DNA or nucleotide misincorporation during PCR occurring before the sequencing step and are not related to a low-quality value. This is why previous studies have been conservative, discarding rare sequences, as well as more frequent sequences that were close to a very common sequence (Valentini et al. 2009a). Indeed, the strategies set up for removing errors are based on the assumption that the correct sequence is the most common one, that is the error is rare and occurs late during the replication process. However, nucleotide misincorporation may sometimes occur during an early step in the PCR, leading to erroneous sequences predominating. A conservative approach would therefore be to discard a frequent sequence that occurs in a single sample when aiming to describe the diet at a population or group level or to replicate PCRs for each sample (i.e. multitube approach) when aiming at describing precisely differences among individuals. Another common problem leading to erroneous sequences is the occurrence of PCR-generated chimeras (Qiu et al. 2001) that is promoted when amplifying degraded DNA (Pääbo et al. 1990). Such chimeric sequences can be managed during the taxon assignation step by assigning the sequence to a higher taxonomic rank that includes both taxa composing the chimera, or to unassign it if the taxa are too distant from anything in the database (Soininen et al. 2009).
To date, dealing with errors has mainly involved discarding sequences using more or less arbitrarily chosen thresholds. Several studies have discarded sequences occurring <4 times (Valentini et al. 2009a; Rayéet al. 2011), but there is no general rule, and one may use a threshold relative to the number of sequences per sample. A drawback of removing less frequent sequences is that rare food items could be missed. Protocols used to date may reliably estimate the diet at a population/group level, but we do not yet know how we can interpret between-individual differences for taxa represented by low frequency sequences. However, new methods are being developed that remove the noise from sequence data sets by examining chimeras, by building clusters of sequences relating erroneous sequences to the correct one (e.g. Quince et al. 2011), and we can predict that more explicit error models and the development of new related methods will decrease the impact of read errors on diet assessment.
Another critical step in the analysis is the sequence assignment to a taxon. This is possible through comparison of each sequence to a reference database (e.g. ecoTag, Pegard et al. 2009; Taxonerator, Jones et al. 2011). The reference database may be a subset of public databases and/or a set of sequences specifically produced for the study. This allows faster processing, by avoiding inappropriate comparisons and cleaning the reference data set from errors. An efficient way of selecting from a public database (e.g. GenBank, EMBL), the sequences corresponding to the barcode of interest, is to perform an in silico PCR using the barcode primers (Pegard et al. 2009; Soininen et al. 2009). In silico PCR programs such as ecoPCR may allow imperfect matches between each barcode primer and its binding site to mimic in vitro PCR (Ficetola et al. 2010). It is well known that public databases contain sequencing errors and a few wrong taxonomic assignations (Harris 2003). As a consequence, it is important to remove errors when using public data, because the quality (as well as the comprehensiveness) of the reference database determines the reliability and accuracy of taxon identification. When possible, the best solution is to build a customized reference database gathering all potential food items and not only those deduced from an a priori knowledge of the diet. Examples of this include the study by Soininen et al. (2009), who studied subarctic voles using a database of 842 widespread and/or ecologically important arctic plant species, and the study by Rayéet al. (2011), who based their study of the diet of chamois from the Bauges Massif on a reference database of 475 plant species collected in this massif. Using a customized database has two major advantages. First, it allows the diet to be analysed with reference to what is actually available and ecologically meaningful. Second, it allows a more accurate taxonomic assignation. Indeed, a species sharing the same barcode sequence with related species living in different ecosystems or biogeographical regions may be identified at a higher taxonomic level (i.e. genus, family) when using a database built at the worldwide level. However, public databases may be useful for identifying sequences not assigned when using the customized database (Soininen et al. 2009).
Although unrecognized sequences are often simply a result of incomplete barcoding databases, they can also be a source of new discoveries. For example, species may be discovered in the diets of predators or herbivores that are not known to exist in that area, or which may be new to science. This also applies to cloning approaches, but the far greater exploratory power intrinsic to massive parallel sequencing will greatly accelerate the discovery process. Molecular phylogenetics frequently reveals extensive cryptic speciation in sympatry among the relatively few organisms that have been analysed in any detail (e.g. Williams et al. 2006; Finston et al. 2007; Heethoff et al. 2007; King et al. 2008b). Predation on cryptic species or lineages can be studied using specific primers (King et al. 2010b). NGS will provide archives of trophic links that can be mined for proof of trophic interactions in future, when new species designations have been defined and (where possible) sequences matched to morphologies.
The use of public databases is also useful when building a comprehensive local database is out of reach such as in tropical forests (Gonzalez et al. 2009). This may lead to a low-resolution taxa assignation (e.g. at the family level) that may be sufficient for estimating the diversity of the diet or the overlap between trophic niches of different species. An anonymous survey (i.e. without taxonomic information) may also be informative. The diversity of a diet can be assessed for example using Ecological Clades that are monophyletic groups containing species sharing precise ecological features (e.g. Corse et al. 2010). Molecular operational taxonomic Units (MOTUs) that represents groups defined on a common barcode sequence are also currently used for assessing ecological diversity when taxon assignation is not possible (see for example, the studies by Floyd et al. 2002; Blaxter et al. 2005; Bohmann et al. 2011), and dedicated program exit that MOTUs (e.g. jMOTU, Jones et al. 2011). The clustering approach that is used to build MOTUs is of critical importance and will have to take into account PCR and sequencing errors.
Validation of data accuracy
While it is clear that NGS has the potential to be enormously useful in dietary studies (Valentini et al. 2009b), unforeseen biases can influences the recovery of sequences as in all DNA-based studies (Pompanon et al. 2005). Therefore, it is important to follow the general advice on best practice for DNA-based diet studies (King et al. 2008a) and whenever possible validate steps of the process. We will outline a few points that are particularly relevant for NGS studies.
Validity of sequence data
From sample collection, through to DNA extraction and PCR, it is necessary to be extremely careful to avoid cross-contamination. The low amount of target DNA in many dietary samples, combined with extreme sensitivity of PCR and the ability to recover thousands of sequences per sample, means that even minor contamination will be represented in the final data set. This is clearly illustrated by the recovery of contaminating human DNA sequences (0.5–2.4% of sequences) in amplicon sequencing studies using vertebrate barcoding markers (Binladen et al. 2007; Deagle et al. 2009). In these studies, the human sequences could simply be discounted, but if contamination had been target DNA it would have been difficult (or impossible) to detect and could produce misleading results. Amplicons from previous experiments are the most potent source of target DNA contamination, necessitating physical separation of laboratory space and equipment used in pre- and post-PCR amplification steps (King et al. 2008b). It is also important to carry out DNA extractions from tissue, and primer testing experiments using target DNA, in isolation from dietary sample processing (see the studies by Taberlet et al. 1999; Cooper & Poinar 2000 for general discussion of laboratory contamination issues). Post-PCR contamination (e.g. between lanes in NGS systems) can be detected more easily than pre-PCR because reads can be assigned to the right sample by looking at the primer and Tag sequences. As additional amplicon sequencing studies using NGS are published, it may become apparent that a low level of contamination is inevitable and all sequences below a certain threshold might best be excluded from final analysis. To minimize representation of contaminants, Valentini et al. (2009a) recommended no more than 35 cycles during PCRs. Regardless, minimizing the potential for contamination of dietary samples by exogenous DNA is crucial to obtain accurate diet data. Contamination originates in many forms and some are ecological. Species unintentionally consumed can be detected in the case of secondary predation where a predator eats another predator that contains prey in its guts (Sheppard et al. 2005). Similarly, vertebrate grazers may unintentionally consume large numbers of invertebrates, and marine species may accidentally consume planktonic components. According to the goal of the study, this information can be considered as contamination (e.g. when studying food preference) or not (when looking at the actual intake).
Beyond contamination, there are several other technical issues that could influence the diversity of sequences in the final data set. Most critically is the ability of the barcoding primers to reliably amplify the intended targets, and this should be taken into account when designing barcode primers, as discussed earlier. However, different sequencing platforms may also preferentially sequence certain amplicons (Dohm et al. 2008) or fail to completely sequence particular amplicons (Deagle et al. 2009), and this may also lead to some sequences from some targets being unrepresented in the final data set.
The intent of many dietary studies is not simply to analyse diversity, but to obtain quantitative data on the relative amounts of different foods consumed by a species. For this to be possible using NGS, the proportional biomass of foods eaten needs to be reflected in the proportions of the recovered DNA sequences. It makes intuitive sense that if a large biomass of a particular food item was consumed, this would be mirrored in the amount of DNA in the dietary sample and ultimately in the recovered DNA sequences. However, results from many different fields of research have shown that obtaining quantitative data in amplicon sequencing studies is fraught with problems (e.g. von Wintzingerode et al. 1997; Polz & Cavanaugh 1998; Acinas et al. 2005; Porazinska et al. 2009; Amend et al. 2010). The most basic requirement is that the quantitative signature must be retained during all technical steps (see Fig. 3). A clear potential for bias exists during PCR when target DNA is exponentially amplified; even a 2% difference in amplification efficiency between two initially equal targets can lead to a 30% divergence in DNA copy number over 35 cycles. Biases may also appear during DNA extraction (e.g. Martin-Laurent et al. 2001), DNA pooling (e.g. Harris et al. 2010), sequencing (e.g. Porazinska et al. 2010) and during bioinformatic sorting (Amend et al. 2010). Beyond these technical issues, a number of biological features may obscure the quantitative signal (see Fig. 3). These include variation in tissue cell density (and therefore DNA per gram of tissue eaten), inter- and intraspecific variation in gene copy number (particularly relevant to markers for mtDNA, chloroplast DNA and nuclear ribosomal gene clusters) (Prokopowich et al. 2003), differential survival of DNA during digestion (e.g. Deagle & Tollit 2007), and differences in the state of digestion (particularly in stomach contents e.g. Troedsson et al. 2009). Finally, it is unclear what the effect of averaging data from many samples will have on the quantitative estimates. On the one hand, stochastic variation will be minimized, but systemic biases will likely persist. Pooling a standardized amount of PCR product before NGS will result in each sample having equal weight in the final dietary data set, and this will affect the quantitative data (e.g. a dissected stomach containing a single small prey item would count for as much as a stomach full of many items).
Given this substantial list of obstacles, how can we know whether NGS data accurately reflect diets, either quantitatively or even qualitatively? One possibility is to use alternative diet analysis methods in parallel to help validate the DNA-based data. Soininen et al. (2009) used microhistological identification of plant fragments in vole stomachs to complement their DNA-based approach. This did provide some assurance that the NGS DNA results were reasonable, but was hindered by the high degree of uncertainty in the micro-histological identification (Soininen et al. 2009). This is likely to be a common problem because DNA-based methods are most usefully applied in situations where other methods of diet analysis are problematic. Deagle et al. (2009) also compared their prey sequence data from fur seal faeces with traditional prey hard-part identification and found a good agreement between data sets, increasing confidence in both methods of diet analysis for fur seals. The sequence depth afforded by NGS also allowed Deagle et al. (2009) to sequence multiple mtDNA barcoding markers targeting the same prey species to cross-validate the precision of their data. This is a useful approach to help uncover major marker-specific, or primer-specific, biases (see also Murray et al. 2011). Validation studies based on captive feeding trials with animals fed a known diet will probably provide the best test of the methodology in the long run. Many studies have examined DNA-based approaches for studying diet with captive animals (e.g. Hoogendoorn & Heimpel 2001; Deagle et al. 2005; Foltan et al. 2005; Weber & Lundgren 2009) and a few have specifically focused on Quantitative aspects (Deagle & Tollit 2007; Bowles et al. 2011). One study has investigated the accuracy of quantitative diet data from high-throughput amplicon sequencing (Deagle et al. 2010). In this study on captive penguins, the authors found differences in digestibility of the fish prey and suggested that estimates of diet composition were possible but should be given wide confidence intervals. With these limitations in mind, where possible, it would be prudent to design NGS dietary studies that are comparative and not as dependent on absolute quantification (e.g. measuring spatial or temporal variation in diet).
Our analysis of the potential and pitfalls of the NGS approach to dietary analyses, detailed in this review, was designed to help researchers to choose what technique(s) to use before embarking upon a new study. All approaches do not bring the same information and other methods might be used instead of, or together with, NGS methods. DNA-based methods are especially efficient for determining the taxa eaten, but when an estimation of nutritional components is required (e.g. total nitrogen for herbivores), the use of NIRS may be required (e.g. Kaneko & Lawler 2006; Rothman et al. 2009) even if the calibration step is critical. Also, the use of stable isotopes gives integrative information that DNA-based approaches cannot provide. It reflects the diet over longer periods of time and gives insight into how what was ingested is used. A limitation of this approach is the necessity of obtaining prior knowledge of the isotopic signals for different prey types (e.g. Moore & Semmens 2008). A combination of DNA-based and stable isotopes analyses has already been proven to be effective (Hardy et al. 2010). NIRS (Foley et al. 1998) and micro-histological studies may also be of interest when information on the part of the organism consumed is required. Other DNA-based techniques may also be preferred to NGS sequencing when quantitative data are expected, such as real-time PCR, which may also be used for validating the semi-quantitative aspect of the NGS approach in a given situation (see ‘Quantitative aspects’ section). Classical sequencing approaches, or prey-specific primers, can also complement an NGS approach that provides insufficiently resolution for a given group, by allowing discriminating among potential eaten species within this group [e.g. distinction between wild mouflon and domestic sheep eaten by snow leopards, W. Shehzad, T. M. MacCarthy, F. Pompanon, L. Purevjav, E. Coissac, T. Riaz, P. Taberlet unpublished]. If predation or herbivory on a limited number of specific target species is required, then a multiplex-PCR approach is probably a quicker, simpler and less expensive approach (Harper et al. 2005; King et al. 2010a,b; Traugott et al. 2012). However, NGS diet assessment takes advantage of the previous progress in DNA-based methods for diet analysis (King et al. 2008a) and will benefit from the continued development of NGS technologies for characterizing biodiversity in environmental samples (Valentini et al. 2009b). The major advantage of NGS-based techniques is that they are ideal for providing precise taxonomic identification of food items within highly diverse diets, whatever the type of sample and especially non-invasive and degraded ones, with low effort compared to the large sample sizes that can be analysed. NGS diet assessment has already been applied to study herbivores and carnivores from highly diverse taxa such as insects, spiders, birds, molluscs, mammals and reptiles (Deagle et al. 2009, 2010; Valentini et al. 2009a; Bohmann et al. 2011; Brown 2011; Murray et al. 2011) and can be adapted to analyse the diet of most organisms. Valentini et al. (2009a) showed that NGS diet assessment is especially well suited to studying extremely diverse herbivore diets. The lack of resolution of some barcodes in particular cases can be overcome by the simultaneous use of several barcodes (Deagle et al. 2009; Valentini et al. 2009a; Rayéet al. 2011).
Given that a species can be identified in a complex substrate using broad coverage primers when its DNA represents a low proportion of the target DNA (Pegard et al. 2009), even uncommon food species can be documented. In cases where samples are dominated by non-target DNA that would be co-amplified with food species (e.g. DNA from predator or parasites), the use of blocking oligonucleotides provides a practical route to preferentially amplify the DNA of interest (Vestheim & Jarman 2008). NGS-based methods may also uncover a higher dietary diversity than traditional methods by detecting DNA from prey that leave no hard-parts in the faeces. For example, occasional predation of rays and sharks by the Australian fur seal revealed in a DNA-based study was previously missed in traditional faecal analysis because of the digestion of their cartilaginous skeletons (Deagle et al. 2009). Similarly, Brown et al. (in press) were able to use NGS to analyse predation by lizards on different earthworm species, none of which can be distinguished by microscopic examination of faeces. However, the power of NGS diet assessment is at the same time limited by problems such as amplification of contaminants and PCR errors. Thus, if the sequencing of the PCR products is outsourced to a company or to a sequencing centre, prior template preparation should be carried out in laboratory space dedicated to minimizing contamination and the data analysis should take into account potential for sequence errors. Even with such constraints, all published studies recognize that methods based on NGS reduce the effort necessary for carrying out diet assessment. This is true in comparison with hard-parts analysis of faeces (e.g. Deagle et al. 2009; Soininen et al. 2009) and also with other DNA-based techniques. Thus, the time and money saved can potentially be reallocated advantageously to developing a more sophisticated and extensive sampling strategy and increasing the accuracy of the study. While semi-quantitative interpretation of dietary barcoding data is probably the best that can be hoped for, given the difficulties with alternate approaches of diet determination, DNA-based methods may be one of the most accurate approaches available for increasing our understanding of trophic relationships in a diverse range of food webs.
We would like to thank Alice Valentini and Ludovic Gielly for fruitful discussions.
F.P. studies adaptation through population genomics approaches and develops molecular markers for describing biodiversity. B.D. works on the application of DNA-based identification methods in dietary studies and also on stickleback evolutionary genomics. W.O.C.S. leads a research group developing molecular techniques for studying predator–prey interactions. D.B. studies the diet of British reptiles using DNA-based methods. S.J.’s research applies DNA-based techniques and bioinformatics to better understand key aspects of marine animal populations. P.T. is interested in conservation genetics and develops molecular methods for studying non-invasive samples, including the description of biodiversity with environmental DNA.