Chloroplast DNA sequence data are a versatile tool for plant identification or barcoding and establishing genetic relationships among plant species. Different chloroplast loci have been utilized for use at close and distant evolutionary distances in plants, and no single locus has been identified that can distinguish between all plant species. Advances in DNA sequencing technology are providing new cost-effective options for genome comparisons on a much larger scale. Universal PCR amplification of chloroplast sequences or isolation of pure chloroplast fractions, however, are non-trivial. We now propose the analysis of chloroplast genome sequences from massively parallel sequencing (MPS) of total DNA as a simple and cost-effective option for plant barcoding, and analysis of plant relationships to guide gene discovery for biotechnology. We present chloroplast genome sequences of five grass species derived from MPS of total DNA. These data accurately established the phylogenetic relationships between the species, correcting an apparent error in the published rice sequence. The chloroplast genome may be the elusive single-locus DNA barcode for plants.
Analysis of plant DNA for identification of plant species and genotypes has generally replaced earlier techniques based upon other biochemical markers. DNA analysis techniques have evolved becoming increasingly more discriminating and amenable to routine low-cost applications (Henry, 2008). However, a universal protocol that could be applied to any unknown sample has not been achieved. It has been difficult to define discriminating loci in the nuclear genome that could be analysed in all plants with a standard protocol.
Chloroplasts contain both highly conserved genes fundamental to plant life and more variable regions which are informative over broad time scales. When PCR is used to retrieve discrete chloroplast regions for phylogenetic analysis and barcoding, the target sequences need to be carefully selected. Sequences with relatively low mutation rates are required for higher level phylogenetic comparisons, while higher mutation rates are needed to discriminate among closely related species (Heinze, 2007).
Although the mitochondrial gene CO1 has been widely adopted as an efficient DNA barcode for animal species (Hebert et al., 2003), no such single locus has been identified for plants (Rubinoff et al., 2006), and there has been considerable debate surrounding the selection of the most suitable loci (Chase and Fay, 2009; Kress and Erickson, 2008). Recently, a two-locus land plant barcode consisting of portions of the chloroplast genes rbcL and matK has been proposed (CBOL, 2009). The plant barcode was limited to two loci to reduce the cost and time involved in bidirectional Sanger sequencing. Primer universality and species discrimination, however, were suboptimal especially for non-angiosperm land plants. The potential difficulties associated with targeting particular chloroplast regions for barcoding are circumvented by sequencing the chloroplast genome.
Conventional approaches to chloroplast genome sequencing commonly involve purification or PCR amplification of the chloroplast genome prior to sequencing. More recently, massively parallel sequencing (MPS) has been used to capture sequence data from many individual multiplexed chloroplast PCR amplicons (Cronn et al., 2008; Parks et al., 2009). These approaches are, however, relatively time consuming. Non-purified (total) DNA extractions include chloroplast DNA which is sequenced during MPS runs and is treated as contaminating sequence for many applications. However, these sequences have utility for DNA plant barcoding and the exploration of plant relationships. Plant identification has many uses in enforcing intellectual property rights and in quality control in plant and food production and processing. Phylogenetically guided identification of close relatives has the potential to aid gene discovery for crop improvement, and access to chloroplast genome sequences will be of benefit for targeted chloroplast transformation (Daniell et al., 2005).
As the primary staple food for over half of the global population, rice is the world’s most important crop and is a model plant species for genetic studies. The wild relatives of cultivated rice provide a broad gene pool for improved food security as the human population expands in an uncertain climatic future. This gene pool will be accessed more efficiently by comparison of wild rice genome sequence data with the complete rice genome sequence (Goff et al., 2002; IRGSP, 2005).
The Illumina GA (http://www.illumina.com) generates short sequence reads of up to 100 base pairs (bp) which can be converted to contiguous DNA sequence data by reference-guided assembly using an existing genome sequence as a scaffold. This approach was used to retrieve chloroplast sequence from a single 36- bp paired end run of cultivated rice Oryza sativa japonica (cultivar Nipponbare) and wild relatives Oryza meridionalis, Oryza australiensis and Potamophila parviflora from the tribe Oryzeae, and Microlaena stipoides from the Ehrharteae.
The aim of this study was to determine the extent to which chloroplast genome sequences can be recovered by MPS from total DNA. Re-sequenced cultivated rice was included as a control to test the accuracy of chloroplast sequence data recovered by reference-guided assembly using a previously published rice chloroplast genome of the same cultivar (Nipponbare) as a scaffold. Wild relatives of increasing evolutionary distance from rice were selected to test the utility of this approach in constructing chloroplast genomes with a more distant reference scaffold.
The total aligned data set of five grass species and the rice reference was 134 551 bp in length. One of the two inverted repeat regions was excluded from subsequent phylogenetic analysis. The modified alignment was 113 749 bp in length. There were 109 665 constant positions. Of the 4084 variable positions, 935 were parsimony informative in-group substitutions. Phylogenetic analysis of the 113 749 -bp alignment resulted in the construction of trees which conform to the accepted phylogeny (Guo and Song, 2005; Kellogg, 2009). Single optimal phylogenetic trees obtained by maximum likelihood (-lnL = 186669.13), maximum parsimony (MP) (5178 steps; consistency index CI = 1.00) and Bayesian analysis shared the same topology, with maximum bootstrap (100%) and posterior probability (1.00) for all nodes (Figure 1).
Reference-guided assembly of rice MPS re-sequence data against a complete chloroplast genome sequence of the same cultivar (Tang et al., 2004) generated a consensus sequence of identical length (134 551 bp) to the reference (Table 1). The re-sequenced sample differed from the reference by one base. A guanine ‘G’ was present at position 8128 of the reference, while an adenine ‘A’, confirmed by Sanger sequencing, was detected at this position in the re-sequenced sample (32 times coverage; 100% were A). Interestingly, an ‘A’ was also found in this position in the other four grass species included in this study and in a previously published Oryza nivara chloroplast genome sequence (Genbank accession AP006728.1). This suggests an error in the reference sequence and demonstrates the potential accuracy of this technique.
Table 1. Summary statistics for reference-guided assembly of short-read massively parallel sequencing data
The number of assembly gaps increased with evolutionary distance from the reference (Table 1), ranging from zero in the control (O. sativa japonica), to one in O. meridionalis (an 8 -bp gap in the psaJ-rpl33 intergenic spacer), 19 in O. australiensis (179 bp in total), 61 in P. parviflora (479 bp in total) and 263 gaps in M. stipoides (8230 bp in total). The majority of assembly gaps (>91%) were in non-coding regions as illustrated in an mVISTA alignment plot (Figure 2). In O. australiensis, the only intragenic gap was 6 bp in length in rbcL. In P. parviflora, six intragenic gaps in total were found in four genes: matK, RpoC2, rpl33 and ndhH. In M. stipoides, a total of 32 intragenic gaps were located in 13 genes, with the majority (61%) in accD and rpoC2. Assembly gaps within coding regions did not disrupt reading frames, with the exception of the accD region in M. stipoides. A start codon was also lacking suggesting that the chloroplast accD gene is non-functional in this species, a finding that has been observed in other grasses (Diekmann et al., 2009).
Of 91 SNPs identified between the closely related species O. meridionalis and O. sativa japonica, 73 were located in the large single copy region and 16 were in the small single copy region. Only two were located in the inverted repeats (IR) (one in each IR at positions 90 581 and 124 575). The majority of SNPs (73%) were in non-coding regions, and within coding regions only synonymous SNPs were found. Multiple SNPs were identified in the genes matK (2), rpoC2 (3), rbcL (3) and ndhD (2).
The chloroplast genome sequences of five grass species attained by reference-guided assembly of short-read sequences to a reference sequence from cultivated rice allowed construction of trees which conform to the accepted phylogeny. Evolutionary distance from the reference ranged from intravarietal to intertribal, with divergence times from recent to around 40 million years ago (Kellogg, 2009). The number of assembly gaps increased with increasing genetic distance from the reference. Gaps were located primarily in non-coding regions (Figure 2) suggesting that indels or highly divergent sequence prevented assembly of MPS reads in these regions, rather than incomplete sequence data, as median coverage exceeded 100 times.
Indels, located primarily in non-coding regions, are the most abundant chloroplast mutations and are often associated with repetitive elements such as simple sequence repeats, cpSSRs (Ebert and Peakall, 2009). Recovery of indels and repetitive elements is problematic in the assembly of contiguous sequence from short reads, and is of particular concern for population level studies (Harismendy et al., 2009; Kidd et al., 2010). The impact of missing data has been explored in the context of phylogenomics and it has been found that trees based on the sequence of many genes, supermatrices, are surprisingly robust to high levels missing data (Delsuc et al., 2005). In one case, the integrity of the tree was maintained with 90% missing data (Philippe et al., 2004).
Our data demonstrate that MPS platforms have the capacity to sequence the chloroplast genome at over 100 times coverage in a single lane without purification (Table 1). Despite representing a small fraction of total DNA sequence, 0.04% in rice, the concentration of chloroplast sequence reads is high relative to nuclear sequence in total DNA preparations. Chloroplast genome sequences of over 100 species of land plants have been deposited in the public domain (Cui et al., 2006), and these are available as scaffolds for reference assembly of MPS data. Initial assembly of highly conserved chloroplast regions (such as the IR) for unidentified samples and species that have not been previously sequenced will allow selection of the most appropriate (closely related) published chloroplast genome for reference assembly.
Chloroplast sequence data has the capacity to accurately identify different species, a characteristic which is exploited in DNA barcoding (Kress et al., 2009; Lahaye et al., 2008). Recently, markedly increased phylogenetic resolution among 32 gymnosperm species in comparison with that using a 2-locus barcode was demonstrated using nearly complete chloroplast genome sequences derived from MPS of chloroplast amplicons (Parks et al., 2009). The increased resolution was attributed to increased matrix length and numbers of informative sites, leading the authors to conclude that plastome sequences have potential as plant DNA barcodes. An advantage with the approach presented in this study is that universal primers and PCR amplification were not required. Techniques deployed to enrich for chloroplast sequences, PCR and purification of DNA preparations with cesium chloride gradients, for example may be unnecessary in future to capture plastome sequence data.
The cost per lane of Illumina GA sequencing is now less than US$1500 and in this study gave more than 100-fold coverage of the chloroplast genome (Table 1). Recent improvements in MPS have led to an increase in the length and number of sequence reads. For example, the Illumina HiSeq 2000 promises a full order of magnitude increase in depth of genome sequence over the Illumina GA. When used in conjunction with multiplexed sequencing approaches, MPS of chloroplast genomes is rapidly becoming a simple and cost-effective alternative to bidirectional Sanger sequencing of PCR-amplified partial chloroplast genes.
Current limitations to the use of chloroplast genome sequences derived through MPS of total DNA for plant barcoding include recovery of indels that may be necessary to discriminate between recently diverged species and unavailability of a reasonably close reference for some highly divergent taxa or less-studied groups. These problems are likely to diminish as ‘short-read’ sequences increase in length and more chloroplast genome sequences become available. Another issue for consideration is the applicability of this approach for species with large genomes. In this study, coverage of the chloroplast genome was not correlated with genome size (Table 1). Assembly of the chloroplast genome for highly polyploid sugarcane (Saccharum) species, with genome sizes exceeding 10 Gbp has been tested using MPS of total DNA (unpublished data). Although direct comparison was not possible as longer read lengths (75 bp) were generated, median coverage of the chloroplast genome from a single lane of Illumina MPS data exceeded 100 times.
The simplicity and efficiency of obtaining chloroplast genomes from total DNA is advantageous for the barcoding of plants. It obviates reliance on universal primers and decreases the risk of misidentification owing to the amplification of pseudogenes (Huang et al., 2005; Song et al., 2008). Furthermore, it provides access to many variable sites to improve resolution even among closely related species. Chloroplasts are haploid and non-recombining, so they act as a single locus. The chloroplast genome therefore has the potential to become the elusive universal single-locus plant barcode for plant species identification. This approach provides a general strategy that may have wide applications for plant barcoding, species delimitation and to guide gene discovery by defining related species for plant biotechnology. This protocol is probably a good first option for plant identification in any forensic application and should have widespread application in the routine enforcement of the intellectual property rights of plant breeders.
DNA was extracted from leaf tissue using a Qiagen DNeasy kit. Approximately 3 μg of total DNA was sheared, polished and prepared following the manufacturer’s instructions (Illumina sample preparation protocol for paired-end sequencing) with the following modifications. Briefly, DNA was sheared using the adaptive focused acoustics method on a Covaris S2 device with the following settings: duty cycle 10%; intensity 5; cycles per burst 200 for 180 s at 6 °C.
Ligation products were purified by agarose gel electrophoresis (2% agarose, 120 V for 120 min). A narrow size range of predominantly 200- bp fragments was excised from the gel, and the products isolated with a QIAquick Gel Extraction kit without heating. PCR products were further purified with a QIAquick PCR Purification Kit and quantified using a DNA 1000 chip on an Agilent BioAnalyzer 2100. Approximately 4 pmol per individual and 3 pmol of the PhiX control lane were sequenced for 36 × 2 cycles on an Illumina Genome Analyser (GAII) following the manufacturer’s instructions. Base calling was performed with Illumina software Pipeline 1.4 (Illumina, San Diego, CA, US).
Paired-end sequence reads were trimmed of low-quality data with a quality score limit of 0.01 and adaptor sequence in CLC Genomics Workbench 3.6.5 (http://www.clcbio.com) and reads of less than 15 base pairs (bp) in length were discarded. Trimmed short-read sequences were assembled using reference-guided assembly (read mapping) against a published rice chloroplast genome sequence (Genbank accession AY522330.1). Assembly was undertaken with CLC Genomics Workbench with the following short-read parameters: ungapped alignment; limit = 8; mismatch cost = 2. Match mode was random to allow for assembly of both inverted repeat regions and repetitive elements, and the conflict resolution mode was vote majority.
Consensus sequences were exported to Geneious 4.7 (http://www.genious.com) and aligned using Mauve (Darling et al., 2004). The best fitting nucleotide substitution models were selected using Modeltest and MrModeltest (Posada and Crandall, 1998). Aligned data were analysed under MP and maximum likelihood (ML) criteria using the TVM+G model (G = 0.1158) in PAUP* (http://www.paup.csit.fsu.edu). Gaps were treated as missing data. Heuristic searches were conducted with 20 random addition replicates. and TBR branch swapping. M. stipoides was the outgroup in rooted trees with 1000 bootstrap replicates to evaluate nodal support. Bayesian phylogenetic analysis was conducted using MrBayes 3.1 (Ronquist and Huelsenbeck, 2004) using the GTR+I model (I = 0.73). Two independent runs of 1 × 106 Monte Carlo Markov Chains (MCMC) were performed following burn in of 1 × 105 MCMC, each starting with a different random tree. Nodal support for Bayesian consensus trees was evaluated by posterior probability distribution.
Consensus sequences were annotated using DOGMA, Dual Organellar Genome Annotator (Wyman et al., 2004) and manually adjusted as needed before submission to Genbank. Aligned sequences and annotations for O. sativa japonica were used to construct sequence conservation plots in the program mVISTA (Frazer et al., 2004).
The authors acknowledge the assistance of Sally Norton from the Australian Tropical Crops and Forages Collection for supply of seed samples. In addition, the authors acknowledge the technical assistance provided by Laura Homer, Shabana Kasem, Asuka Kawamata and Linda Hammond from the Centre for Plant Conservation Genetics, Southern Cross University.