To manage the forces that affect the levels and distribution of biodiversity, we require the ability to measure biodiversity comprehensively, reliably, repeatedly, and over large scales. Efforts in this direction to date by ecologists and environmental biologists have been impeded by standard survey methodologies that consume large amounts of time, money and taxonomic expertise, and we are therefore impeded from addressing biodiversity loss as a normal management problem that can be dealt with wherever and whenever it arises. Instead, most biodiversity research remains in the realm of basic science, and even then, scientists typically are forced to rely on proxies (Favreau et al. 2006; Lewandowski, Noss, & Parsons 2010). One long-standing proxy has been to designate a subset of taxa as indicators, some popular ones being butterflies, dung beetles, birds and parasitoid wasps (e.g. Gardner et al. 2008; Anderson et al. 2010).
Proxies might be efficient, but it is a truism in management that we only get what we measure. Schoolteachers evaluated on exam scores have an incentive to ‘teach to the test’, and biological proxies are subject to the same narrowing of perspective. As one example, 19 species of farmland birds have been designated as a biodiversity indicator on UK farmlands (JNCC 2011), the aim being to use birds to indicate overall farmland biodiversity. However, an understandable response has been to ‘teach to the test’ via supplemental winter feeding of farmland birds (Siriwardena et al. 2007; see also Newton 2011). Thus, in addition to proxies, we should tackle the lack of biodiversity information directly.
Here, we describe a way to measure arthropod biodiversity rapidly, reliably, cheaply, comprehensively, over large spatial scales, and in ways that can be audited by third-parties, which is a requirement for dispute resolution. The first element is DNA barcoding, in which short gene sequences are used to identify species. The most commonly used barcode for animals is a 658 bp section of the mitochondrial cytochrome c oxidase subunit I gene (mtDNA COI) (Hebert et al. 2003). Other barcode genes are proposed for plants, protists, and meiofauna (Hollingsworth et al. 2009; Creer et al. 2010; Medinger et al. 2010; Yao et al. 2010). Because sequencing is fast and cheap, the barcode approach potentially provides large amounts of species-level inventory data, making it possible to track and measure biodiversity over space and time (e.g. Janzen et al. 2005; Waugh 2007; Borisenko et al. 2008). However, generating barcodes with Sanger sequencing is inefficient if we want to assign taxonomies to hundreds of thousands of samples, a requirement if we want to measure biodiversity repeatedly and over large spatial scales.
Our protocol therefore includes large-scale trapping for sample acquisition, high-throughput sequencing, and bioinformatic analysis. In short, mass-collected specimens are homogenised (‘souped’), and the genomic DNA is extracted, mass-PCR-amplified for the barcode gene of interest and sequenced on machines that can separate out individual DNA molecules. Bioinformatic tools then process the resulting huge number of sequences down to a data set of manageable size and high-enough quality that is practical for subsequent analysis.
Altogether, we call this technique metabarcoding to distinguish it from the broader term metagenetics, which encompasses microbial communities, and from metagenomics, which, in addition, refers to the reconstruction of whole genomes. Finally, environmental barcoding or eDNA is probably best used to refer to the amplification and sequencing of free DNA from soil or water. We note, however, that the terminology is in flux.
Metabarcoding is transforming ecology (Creer 2010; Creer et al. 2010; Bik et al. 2012), especially of cryptic biodiversity. Recently, Fonseca et al. (2010) compared marine meiofauna (metazoans between 45 and 500 μm long) across beaches in the UK, Porazinska et al. (2010) compared nematode diversity in different rainforest microhabitats in Costa Rica and Nolte et al. (2010) compared protist diversity across seasons in a lake in Austria. Nolte et al. further showed that for one genus, Spumella, species from the clade that is typically found in cold habitats are more abundant in cold months, whereas species belonging to warm-climate clades are more abundant in the summer. In these systems, previous studies had been impeded by the difficulty of measuring very high levels of diversity of very small taxa, and metabarcoding technology has unlocked this diversity, in the same way that microbiome biology has been unlocked by next-generation sequencing (Committee on Metagenomics 2007).
However, precisely because meiofauna and protists were so difficult to study before metabarcoding, independent validation of results has so far been forced to depend on small data sets based on morphospecies (Medinger et al. 2010), on laboratory tests (Porazinska et al. 2009a,b), or on BLASTing reads against GenBank (Fonseca et al. 2010). These checks have been crucial, but by their nature, they do not fully validate metabarcoding as a method for making general measures of biodiversity.
The field requires further validation because metabarcoding promises important management advantages in addition to increased efficiency. Traditional biodiversity data rely on expertise that is difficult to standardise across multiple individuals, and errors (or even fraud) in direct observational data, such as bird lists, cannot subsequently be corrected or audited. In contrast, metabarcoding requires only that staff be able to carry out protocols using standard collection (e.g. pitfall, malaise, Winkler and light traps) and laboratory techniques, and the raw sequence data remain available for future analyses. It is also possible to partition aliquots of the original collections, or the extracted DNA, for auditing. Another advantage is that metabarcoding can hitchhike on advances in software and laboratory practises that are being developed for bacterial metagenetics (Kosakovsky Pond et al. 2009; Caporaso et al. 2010b).
To further the process of turning metabarcoding into a standard management method, we apply the technique to the Arthropoda, especially the Insecta within it, for which it is easier to validate results against independent sampling, as well as other biodiversity proxies, such as vegetation (e.g. Gaspar, Gaston, & Borges 2010). Arthropods are also a deserving focal group for direct study, as they form a major component of terrestrial biodiversity, provide important ecosystem services such as pollination, decomposition and pest control, can themselves be pests and disease vectors and are potentially indicative of plant diversity, because arthropods are mostly herbivores. Finally, with arthropods, it is easier to use the COI barcode gene, which holds some advantages over 18S and other rRNA genes (Emerson et al. 2011). COI is single copy, present in all taxa of interest, with the exception of a few protozoa, capable of being amplified across a wide range of taxa with a small set of primers (Folmer et al. 1994), especially with degenerate primer pairs (Rose, Henikoff, & Henikoff 2003; Boyce, Chilana, & Rose 2009), and has a faster substitution rate, compared to nuclear rRNA genes, which increases taxonomic resolution. Mitochondrial 12S and 16S genes satisfy these criteria, but COI has additional advantages. There exists a fast-growing taxonomic reference database (http://www.boldsystems.org, accessed 10 September 2011) with over 1·3 million specimen-vouchered records so far (Ratnasingham & Hebert 2007), and finally, the mutational properties of COI offer the opportunity to eliminate most pyrosequencing error (Emerson et al. 2011; Ranwez et al. 2011), a phenomenon that, if uncorrected, results in overestimates of diversity (Quince et al. 2009; Reeder & Knight 2010).
In this light, a step forward was provided by Hajibabaei et al. (2011), who pyrosequenced the mini-barcode gene (the first 130 bp of COI) in test pools of Trichoptera and Ephemeroptera and BLASTed against reference sequences to recover 17 of 23 input species. They also showed that larval collections, which cannot be identified using morphology, could be identified using metabarcoding and that the collections matched known adult species assemblages from the same locations.
Following Fonseca et al.’s (2010) pioneering work with meiofaunal samples, the next step is to go beyond the recovery of species lists and to devise an efficient and adaptable pipeline that can turn huge lists of COI sequences into usable and high-quality taxonomic and ecological information. In particular, we show that even when some taxonomic information is lost, which is currently unavoidable, it is still possible to recover precise estimates of alpha diversity and beta diversity.
We provide the research community with model laboratory protocols and bioinformatic scripts that can be adapted to incorporate new technologies and software as they arise. We also provide the original sequence data for software developers to use as test data sets. Our main contributions are: (1) new degenerate PCR primers to minimise allelic dropout of terrestrial arthropods (mostly but not only insects), (2) validation of several new software packages for denoising, de novo operational taxonomic unit (OTU) picking, and taxonomic assignment (Table S1) within the QIIME pipeline (Caporaso et al. 2010b), which has active developer and user communities, (3) detailed scripts, methods, and data sets for users to learn with, (4) experimental demonstration that beta diversity can be recovered and (5) experimental demonstration that rarefaction of phylogenetic diversity (PD) can recover alpha diversity (Nipperess 2011a,b).