DNA barcoding demystified
Only 10% of the earth's biota has been described despite 250 years of taxonomic research (Wilson 2000). This is in large part a reflection of the extent and complexity of biological diversity, but it is also true that traditional taxonomic techniques are labourious and highly specialised, and taxonomic expertise is very thinly spread across the myriad groups of life (Scotland et al. 2003). As a result, obtaining species identifications, even for a modest sample of commonly encountered insects in a biologically diverse country, such as Australia, is at best time-consuming and costly and all too often impossible. This is a result of the difficulty that morphological taxonomic techniques encounter in dealing with phenotypic differences between immature and mature stages of a species, and among individuals of variable species, the lack of phenotypic differences among cryptic species, shifts in geographical distribution as a result of invasions or range expansions and the overwhelming diversity of life. Our reliance on morphological approaches, thus, places serious limitations on the diagnosis of biological diversity at a time when we are facing a biodiversity crisis of unprecedented magnitude. It has long been clear that new approaches to biodiversity assessment, description and diagnosis are needed, and indeed, the 1990s saw the emergence of many novel taxonomic initiatives, leading E.O. Wilson to conclude that ‘the 19th century culture of taxonomy has begun to be replaced’ (Wilson 2003). The emerging discipline of DNA barcoding (Floyd et al. 2002; Hebert et al. 2003) represents another advance and part of the solution to these problems. This paper provides a brief overview of DNA barcoding, and considers what barcoding offers researchers and industry alike.
WHAT ARE DNA BARCODES AND ARE THEY NOVEL?
DNA barcodes are short DNA sequences (≥500 base pairs) that can be used to identify species. The ‘barcode’ analogy refers, of course to the universal product codes (UPC) affixed to merchandise in retail stores. The analogy suggests that DNA barcodes similarly provide a universal system of readily scannable, unique tags for products (species), speeding and automating transactions (identifications) and facilitating novel applications, such as automated stock management (biodiversity assessment) and so on. Unfortunately, the barcode analogy also has been used by the microbial diagnostics community to refer to the banding pattern of rep-polymerase chain reaction (PCR) products (a technique similar in principle to Random Amplification of Polymorphic DNA (RAPD)-PCR) visualised on a gel, which bears some resemblance to UPCs (Sutton & Cundell 2004). Love it or hate it, the term ‘DNA barcoding’ has caught the public imagination, and is likely here to stay.
DNA sequence data from the mitochondrial genome have long been used to resolve difficult taxonomic questions and diagnose species in cases where morphology alone has proved inadequate (e.g. Brown et al. 1999; Mitchell & Samways 2005), so is DNA barcoding just molecular systematics and molecular diagnostics repackaged or is it a truly novel enterprise? The novelty in DNA barcoding is its combination of four main factors: standardisation on a particular gene region, the large scale of operation, compulsory vouchering of specimens and active curation of the resulting databases.
First, standardisation on a particular gene region for all species in a particular branch of life leads to universality of the system. For animals the standard is the upstream half of the mitochondrial gene cytochrome c oxidase subunit I (COI or cox1). Without standardisation, molecular systematic data sets lack interoperability, building ‘Towers of Babel’ (Caterino et al. 2000). The same is true of molecular diagnostics – the literature contains a plethora of highly specialised methods for identifying particular organisms, and this leads to some circularity in that one has to have a well-developed idea of an organism's identity (e.g. which genus) to even know which diagnostic test to apply. Furthermore, non-sequence-based diagnostic tests are incapable of uniquely identifying novel isolates, and must be redeveloped if novel isolates are discovered. Standardisation allows for the generalisation of the protocols and skills needed to implement diagnostics so that identifications can be performed by technicians with generic molecular biology skills.
Second, the use of high-throughput techniques adapted from genomics allows small molecular biology laboratories to process hundreds of samples per day (thousands if they use robotics systems, and have sufficient DNA sequencing capacity). When compared with the six genitalia slide preparations per day, i.e. the recommended daily limit for lepidopterists (Robinson 1976), the appeal of molecular techniques is clear.
Third, compulsory deposition of voucher specimens in museums will facilitate future use of barcoded material in morphological taxonomic research, aiding development of the integrated morphological/molecular approach advocated by Page et al. (2005) and others.
Fourth, the enforcement of data standards for ‘reference barcodes’ (e.g. specimen collection records, photographs, numbers of specimens that must be processed, minimum quality scores for sequence trace files) combined with the review and curation of DNA barcode records by the Consortium for the Barcode of Life (CBOL) lends a degree of rigour and transparency on a par with modern taxonomy, and certainly superior to that found in existing public domain primary databases, such as GenBank (Nilsson et al. 2006).
DNA barcoding consists of an evaluation phase and an application phase. In the evaluation phase, DNA sequence data are collected from identified specimens, and analysed to assess their congruence with existing taxonomy. If there is complete congruence, the DNA barcodes accurately reflect accepted taxonomy, i.e. they are perfect markers of species boundaries, and become ‘reference barcodes’ for that species. If there is incongruence, then either DNA barcodes do not recover species boundaries accurately, or the current concepts of species boundaries themselves are incorrect. Either way further investigation in needed, preferably integrating both morphological and molecular data (Dayrat 2005). In the application phase, DNA barcode sequences are collected from unidentified specimens, and compared with reference sequences in a barcode database to provide a species identification. In the evaluation phase, discrepancies usually are flagged only (e.g. Hebert et al. 2004) for future taxonomic resolution. For example, Handfield and Handfield (2006) described a new species of moth first identified as new to science by DNA barcodes, but taxonomic issues may be resolved simultaneously if the authors' expertise permits (e.g. Hulcr et al. 2007). In cases where the number of potential new species identified by barcodes is too large to allow immediate formal taxonomic description, or the specimens examined (e.g. insect larvae, nematodes, etc.) cannot be distinguished by morphological means, barcodes can be used to designate molecular operational taxonomic units or MOTUs (Blaxter et al. 2005; Ahrens et al. 2007). While MOTUs are best thought of as a temporary solution, they could be used indefinitely to provide unique, searchable species identifiers, pending a formal taxonomic revision. This could facilitate consistent identification of undescribed species encountered during ecological research, circumventing the taxonomic impediment.
DOES DNA BARCODING WORK?
The grand vision presented by Hebert et al. (2003) in their ground-breaking barcoding paper generated tremendous publicity but also a predictable backlash. The following 2 years produced few studies with comprehensive sampling of particular genera, making it difficult to assess the ability of DNA barcodes to diagnose closely related species (Moritz & Cicero 2004). However, in light of the more than 200 empirical studies published since then, few authors, if any, would disagree today with the statement that ‘large-scale and standardised sequencing, when integrated with existing taxonomic practice, can contribute significantly to the challenges of identifying individuals and increasing the rate of discovering biological diversity’ (Moritz & Cicero 2004).
Recent reviews (Vogler & Monaghan 2007; Waugh 2007) suggest that the fundamental goal of DNA barcodes, to provide rapid, accurate and consistent species identifications, even for undescribed species (Blaxter et al. 2005), has been successful for the vast majority (>95%) of species examined. Although simulation studies suggest that DNA barcodes may fail to discover new species (Hickerson et al. 2006), especially recently diverged species, empirical studies have nevertheless uncovered many deeply divergent mtDNA lineages within species. In some cases, these clearly represent cryptic species, e.g. in Lepidoptera (Hebert et al. 2004; van Velzen et al. 2007), Diptera (Smith et al. 2006) and Crustacea (Witt et al. 2006) among other taxa; whereas, in other cases, there are ecological and geographical correlates of mtDNA diversity, but the authors have decided to err on the side of caution and not make taxonomic changes (Dittrich et al. 2006).
DNA barcoding analysis as outlined by Ratnasingham and Hebert (2007) has been criticised for, among other reasons, its dependence on distance-based methods (Rubinoff 2006). However, barcoding analytical methods are undergoing rapid development (e.g. Nielsen & Matz 2006; Abdo & Golding 2007; Rosenberg 2007) to bring greater statistical rigour to the field.
DNA barcoding will fail if species have diverged too recently to accumulate fixed differences in their COI sequences, in cases of inter-specific hybridisation, incomplete lineage sorting and where paralogous copies of mitochondrial genes have been inserted into the nuclear genome (‘numts’) (Funk & Omland 2003). Concerns also have been raised over the potentially confounding effects of Wolbachia infection in many invertebrates (Whitworth et al. 2007). Thus, while most published barcoding studies have shown high success rates with less than 5% of species that cannot be separated by barcodes (Waugh 2007), a few studies have shown substantially higher failure rates.
For example, Meyer and Paulay (2005) found that DNA barcodes failed to identify 4–17% of cowrie species. The most extreme percentage of failures was reported by Meier et al. (2006) who found that up to 30% of Dipteran species on GenBank could not be distinguished by their COI sequences. While this figure is surprisingly high, it is likely affected by two factors. First, molecular systematists tend only to study taxonomic problems which are not easily resolved, using morphological characters. The existing COI data sets on GenBank, therefore, likely represent a biased sample of worst-case scenario examples for DNA barcoding. Second, GenBank is notorious for its large number of misidentifications. For example, Nilsson et al. (2006) found that up to 20% of fungal sequences on GenBank may be incorrectly identified to species level. An example from Ramsuran (2004) illustrates the problem for insects: a supposed boll weevil (Anthonomus grandis) COI sequence in GenBank (accession number AY266628, Scataglini et al. 2006) shows up to 30% sequence divergence from other boll weevil COI sequences, yet it is a 100% match to two Dichroplus elongatus grasshopper sequences (AY014345 and AF260551). Clearly, the boll weevil sequence is erroneous, yet GenBank will not correct such errors unless the original author updates them. This illustrates the need for rigorous curation of DNA barcode databases, a task to be undertaken by the CBOL.
Given the immense diversity of life and the variability of evolutionary rates among genes and among taxa, it is extremely improbable that a simple single-gene system ever could be developed that would prove truly universal: the search for a simple, single-tiered molecular diagnostic test which is truly universal can indeed be likened to the search for a Holy Grail (Rubinoff et al. 2006). It is, therefore, all the more remarkable that COI-based barcodes have worked as well as they have, particularly given the simple analytical methods that have been employed to date. Although COI-based barcodes clearly will not provide species-level resolution in all taxa, this in no way detracts from their potential to form the backbone of a diagnostics system for all animal life because even where barcodes ‘fail’, they will still identify to some more inclusive group, such as species-group or genus. In such cases, a second gene or other characters could be assessed subsequently to refine the identification. Such a multi-tiered barcoding system is precisely what is envisaged for identification of plants (Newmaster et al. 2006). Even in cases where COI-based DNA barcodes distinguish species easily, there is still the problem of hybridisation. For example, Nelson et al. (2007) successfully identified nine species of forensically important Chrysomya blowflies using DNA barcodes but a single hybrid specimen in their data set was misidentified (because mtDNA is inherited directly from the mother without undergoing recombination). However, nuclear DNA sequence data correctly resolved the identity of the hybrid specimen. It would, therefore, be prudent to include an appropriate nuclear gene in DNA barcode data sets for diagnostic applications where lives are at stake, such as criminal forensics or medical parasitology.
WHAT CAN ONE DO WITH DNA BARCODES?
Following thorough evaluation of the accuracy of DNA barcodes in identifying taxa of interest, the data may be applied in biosecurity, ecology, conservation, phylogeography and numerous other fields. Although some potential applications, including species discovery and phylogenetic analysis, remain controversial, like most strongly polarised debates the truth most likely lies somewhere in the middle.
It has long been recognised that robust molecular phylogenetic analysis requires data from multiple unlinked genes (Miyamoto & Cracraft 1991) and that mitochondrial genes evolve too rapidly to provide resolution of deeper (e.g. Mesozoic) divergences (Friedlander et al. 1992). Thus, the sole use of a short fragment of mtDNA for phylogenetic analysis is certainly not recommended. However, the addition of mtDNA data to nuclear gene data sets should help resolve the tips of trees where more slowly evolving nuclear genes are less informative (Caterino et al. 2000), thus DNA barcode data will likely prove of tremendous value for phylogenetics.
The identification of insect larvae (the stage that most often damages crops) is often problematic, but DNA barcodes can be used to match larvae with identified adults (Miller et al. 2005; Caterino & Tishechkin 2006; Rojo et al. 2006; Ahrens et al. 2007). This can have immediate practical application in taxonomy, resource management and biosecurity.
DNA barcodes have been developed for a number of pests of biosecurity importance (Ball & Armstrong 2006; Roe et al. 2006; Scheffer et al. 2006). Once a comprehensive barcode database has been developed for taxa of interest, it is also possible to design secondary assays to facilitate more rapid identification (Summerbell et al. 2005), although DNA sequencing technology is developing so rapidly that such secondary analysis methods could become obsolete in the near future (Easley et al. 2006). Rapid detection of exotic species at countries' borders and around commercial ports could facilitate suppression or even eradication of invasive species, saving many millions of dollars annually.
Conservation biologists typically need to make inferences about the value and distribution of biological diversity that remains unknown to science, and usually rely on surrogate indicators; however, performance evaluations of most existing surrogates have been discouraging. Faith (1992) proposed the use of a phylogenetic diversity (PD) measure, which side-steps simplistic species-counting, and uses phylogeny to boost predictions of general biodiversity patterns. Faith and Baker (2006) demonstrated how DNA barcoding programs can support the use of PD for biodiversity conservation planning, side-stepping contentious DNA barcode-based ‘species’ designations while relying only on phylogenetic signal towards the tips of the phylogeny. DNA barcoding also has been applied to biodiversity assessment using other strategies (e.g. Smith et al. 2005; Caesar et al. 2006).
DNA barcoding also has been suggested as a quality control measure to ensure that model organisms used in laboratory experiments are indeed the species which they are supposed to be (Bely & Weisblat 2006), to help map the complex interactions among organisms such as tracing food webs (McCann 2007) and even to identify the species of snake responsible for envenomation of people (Pook & McEwing 2005). Many other potential applications exist.
DNA barcoding is not a replacement for morphology-based taxonomy (Ebach & Holdrege 2005), but neither is it just another of the many tools available to taxonomists. Rather, it could represent a fundamental shift in our ability to automate and standardise species identification. DNA barcoding has the potential to streamline taxonomy by releasing taxonomists from routine identification duties to concentrate on research and by rapidly answering many alpha-taxonomic research questions while simultaneously identifying the more complex problems (species with overlapping DNA barcodes or species containing deep divergences in their mtDNA) where taxonomists and molecular systematists can spend their precious time more productively. DNA barcoding currently stands where genomics stood more than a decade ago. While many argue over whether it can be done and some even argue over whether it should be done (Scotland et al. 2003), it is being done – albeit on a far smaller scale than is needed. Provided it is underpinned by sound taxonomy and continued refinement of analytical methods and content delivery, DNA barcoding has the potential deliver a genomics-era solution to the taxonomic impediment, contributing to E.O. Wilson's vision for a universal encyclopaedia of life (Wilson 2003).
The author thanks Dan Faith for helpful comments. Funding was provided by a New South Wales BioFirst Award to AM from the New South Wales Office for Science and Medical Research.