Finding a (pine) needle in a haystack: chloroplast genome sequence divergence in rare and widespread pines


  • Rich Cronn (US Forest Service, Pacific Northwest Research Station), Aaron Liston (Oregon State University) and PhD student Matt Parks are developing methods to facilitate the application of genomic data to population genetic and phylogenetic questions in non-model organisms. John Syring (Linfield College) studies the systematics, population genetics, and evolution of pines. Justen Whittall (Santa Clara University) studies plant evolution, applying genome-scale approaches to understanding the process of adaptation and speciation. Jason Buenrostro and Cindy Dick are members of the Whittall lab, and are responsible for validating Illumina SNPs.

Richard Cronn, Fax: (541) 750-7329; E-mail:


Critical to conservation efforts and other investigations at low taxonomic levels, DNA sequence data offer important insights into the distinctiveness, biogeographic partitioning and evolutionary histories of species. The resolving power of DNA sequences is often limited by insufficient variability at the intraspecific level. This is particularly true of studies involving plant organelles, as the conservative mutation rate of chloroplasts and mitochondria makes it difficult to detect polymorphisms necessary to track genealogical relationships among individuals, populations and closely related taxa, through space and time. Massively parallel sequencing (MPS) makes it possible to acquire entire organelle genome sequences to identify cryptic variation that would be difficult to detect otherwise. We are using MPS to evaluate intraspecific chloroplast-level divergence across biogeographic boundaries in narrowly endemic and widespread species of Pinus. We focus on one of the world’s rarest pines – Torrey pine (Pinus torreyana) – due to its conservation interest and because it provides a marked contrast to more widespread pine species. Detailed analysis of nearly 90% (∼105 000 bp each) of these chloroplast genomes shows that mainland and island populations of Torrey pine differ at five sites in their plastome, with the differences fixed between populations. This is an exceptionally low level of divergence (1 polymorphism/∼21 kb), yet it is comparable to intraspecific divergence present in widespread pine species and species complexes. Population-level organelle genome sequencing offers new vistas into the timing and magnitude of divergence within species, and is certain to provide greater insight into pollen dispersal, migration patterns and evolutionary dynamics in plants.


Next-generation (Next-Gen) sequencing is revolutionizing all facets of molecular ecology (Hudson 2007; Rokas & Abbot 2009; this issue), as rapid access to orders of magnitude more data at substantially reduced costs promises a wealth of new insights. The ability to sequence nearly complete organellar genomes is an important milestone in this revolution. In addition to the important population and evolutionary insights provided by these independent genomic partitions, the compact size, conserved genic content and structural organization, and low (to absent) intraindividual variability of organelle genomes make them an experimentally tractable system for testing and refining modern sequencing strategies (Moore et al. 2006; Meyer et al. 2007; Cronn et al. 2008, Parks et al. 2009), and for developing and testing new bioinformatics tools (Bryant et al. 2009).

In plants, the chloroplast genome has been an invaluable resource for investigating inter- and intraspecific evolutionary histories (Birky 1978, 2001; Chase et al. 1993; McCauley 1995; Newton et al. 1999; Provan et al. 2001; Petit et al. 2003). The predominantly uniparental inheritance of chloroplasts (for exceptions, see Birky 2001; Mogensen 1996) is analytically attractive since a single, independent genealogical history can be readily obtained for hypothesis testing and comparison with the nuclear genome. In plants showing maternal chloroplast inheritance, the magnitude and pattern of differentiation reveals the relative importance of seed vs. pollen dispersal and matrilineal evolutionary history (Ennos 1994; Hu & Ennos 1997; Petit et al. 2005). In a subset of land plants (conifers and a few flowering plant lineages), the chloroplast is paternally inherited and thus tracks the evolutionary history of pollen dispersal independent of the nuclear genome, and is frequently independent of the mitochondrial genome (Neale & Sederoff 1989). This allows genetic variation to be partitioned into parental contributions (pollen vs. seed), and for each genome to serve as an independent partition in tests for genetic differentiation of geographically isolated or disjunct populations (Hu & Ennos 1999; Mitton et al. 2000).

In most plants, the usefulness of chloroplast-derived information is often offset by its conservative mutation rate. For example, the estimated per-base mutation rate for chloroplast genome in pines is on the order of 0.2–0.4 × 10−9 synonymous substitutions per site per year (Willyard et al. 2007; Gernandt et al. 2008). This is ∼ 1/100 the value for animal mitochondria (Moritz et al. 1987), so it requires proportionately more chloroplast DNA sequence to yield resolutions comparable to those estimated from animal mitochondrial genomes for similarly aged divergence events. An impact of this limitation is that chloroplast-based inferences often focus on the fastest evolving fraction of the chloroplast genome, primarily microsatellites or repeated motifs (Provan et al. 1999; Ebert & Peakall 2009). These markers show high mutation rates and can provide excellent haplotypic discrimination (Afzal-Rafii & Dodd 2007; Höhn et al. 2009; Moreno-Letelier & Piñero 2009). Conversely, chloroplast microsatellites are constrained in length, which increases the probability of molecular homoplasy (Estoup et al. 2002; Jakobsson et al. 2006) and makes them poorly suited for investigating genealogical, mutational, and coalescent histories (Brumfield et al. 2003). Collectively, these types of studies highlight the need for evaluating all genetic variation contained within the chloroplast genome.

The current generation of genome sequencers possesses an overwhelming excess of capacity for accessing sequences from entire organellar genomes. Land plant organellar genomes range in size from ∼70–220 kb for the chloroplast, to over 700 kb in mitochondria (survey of NCBI GenBank; Release 172.0). When combined with multiplex or barcoding methods (e.g. Meyer et al. 2007; Craig et al. 2008; Cronn et al. 2008; Erlich et al. 2009), modern sequencers could potentially sequence hundreds of organelle genomes in a single analysis. Although the sequencing of genomes is increasingly easy, Next-Gen sequencers are not without limitations. For example, some platforms have been characterized as showing higher positional error rates than Sanger sequencing, particularly in regions of low complexity (e.g. single nucleotide repeats, short perfect repeats; Bentley et al. 2008). These repeats can be abundant in organellar genomes, so they might be ‘hotspots’ for methodological errors. Similarly, biases in genome-wide base composition have been reported to result in biases in sequencing error (Dohm et al. 2008; Dolan & Denver 2008). Plant organelle genomes are generally A/T-rich, with chloroplasts showing the greatest skew in base composition compared to mitochondria (∼62% A/T vs. 58% respectively; NCBI GenBank; Release 172.0). These kinds of errors are not problematic for many genomics applications, but they are certain to inflate estimates of nucleotide diversity when surveying populations for rare polymorphisms.

In this report, we show how whole chloroplast genomes can be rapidly sequenced and screened to identify intraspecific variation, with examples from the conifer genus Pinus. Results from Next-Gen sequencing are directly compared to Sanger sequencing in order to evaluate the relationship between sequencing depth, the discovery of putative SNPs, and the false-positive and false-negative discovery rate. The primary focus of this study, Torrey pine (Pinus torreyana), is one of the rarest temperate trees in the world (Critchfield & Little 1966) and a species of conservation concern. Torrey pine is restricted to two populations in California, USA, separated by 280 km of Pacific Ocean (Fig. 1a). The mainland grove located north of San Diego, CA (P. torreyana ssp. torreyana), comprises ∼3400 trees, while another ∼2000 trees occur on Santa Rosa Island, CA (P. torreyana ssp. insularis). The populations have been suggested to be evolutionarily distinct based on subtle morphological differences (cone features, growth rates in common garden) and have been described as subspecies (Haller 1986). Torrey pine is exceptional among pine species due to its unusually low levels of allozyme variation (Ledig & Conkle 1983), and attempts at distinguishing island from mainland populations have been stymied by a lack of genetic variation, especially in cpDNA (Waters & Schaal 1991). Despite three separate cpDNA studies and a combined total of 17 cpSSR loci (Provan et al. 1999), 150 cp restriction sites (Waters & Schaal 1991) and 3.5 kbp of cpDNA sequence (Gernandt et al. 2009), intraspecific variation has not been detected in this species. This, in turn, severely constrains our ability to understand the evolutionary history of this species.

Figure 1.

 Geographic distributions of the species examined and locations sampled (triangles). (a) Pinus torreyana is restricted to Santa Rosa Island (ssp. insularis) and the mainland (ssp. torreyana) near San Diego, California. (b) Pinus albicaulis (light shading) and Pinus lambertiana (dark shading); dotted line shows the division between north and south germplasm (see text for description). (c) Pinus flexilis and Pinus ayacahuite; Pinus strobiformis (stippled) is displayed to highlight the continuous distribution of this species complex (see text for description). (d) Pinus monticola; dotted line shows the division between north and south germplasm (see text for description).

Using Next-Gen sequencing, we can sequence and analyse whole chloroplast genomes from species of conservation concern such as Torrey pine, and begin to provide answers to important questions that bear upon their management: (i) can genetic variation be detected in the chloroplast genome of Torrey pine?; (ii) do extant populations represent an undifferentiated segregating metapopulation, or are they evolutionarily distinct in their chloroplast genomes?; (iii) is it possible to date the approximate divergence of chloroplast types detected?; and (iv) is Torrey pine unique among pines in its magnitude and partitioning of chloroplast divergence? To address these questions, we compare the results from Torrey pine to five estimates of intraspecific divergence that use partial or complete pine chloroplast genomes. Two of these comparisons compare divergent haplotypes within Sugar pine (P. lambertiana) and within Western White pine (P. monticola) that were sampled from previously identified, genetically divergent populations (Liston et al. 2007; Steinhoff et al. 1983; J. Syring, unpublished). The remaining comparisons are effectively intraspecific, as they focus on chloroplast genomes from taxa that have either been considered conspecific (P.–P. sibirica; Meusel et al. 1965; Shaw 1914), part of a cembra species complex (P. flexilis–P. ayacahuite; Moreno-Letelier & Piñero 2009; Syring et al. 2007), or are related through introgressive/chloroplast capture events (P. lambertiana – P. albicaulis; Liston et al. 2007). In total, these data offer an unprecedented view into the magnitude of intraspecific, cryptic chloroplast genome variation. They also highlight possible discrepancies between estimates of diversity/divergence from different classes of markers (microsatellites, single genes and whole genomes) that need to be reconciled in future comparisons.

Materials and methods

Haplotype sampling

Intraspecific samples were taken across previously identified biogeographic barriers and/or chosen to represent known haplotype variants for Pinus torreyana, P. lambertiana and P. monticola (Table 1, Fig. 1). Pinus torreyana plastomes were sequenced in one island (ssp. insularis) and one mainland (ssp. torreyana) individual grown at the Santa Barbara Botanical Garden. For haplotype screening, an additional 81 individuals were collected from both segments of the population within Torrey Pines State Natural Reserve, San Diego, CA, and 86 individuals were collected from the Santa Rosa Island population. Pinus lambertiana samples from two individuals represent previously identified and highly divergent haplotypes (Liston et al. 2007; Fig. 1b). Ten samples from P. monticola were chosen to evenly represent northern and southern populations of this species that have been previously determined to be phylogeographically distinct through isozyme studies (Steinhoff et al. 1983) and preliminary analyses of four low-copy nuclear loci (J. Syring, unpublished) (Fig. 1d).

Table 1.   Species, sample locations and properties of microread assemblies used to construct full or partial chloroplast genomes
SpeciesCollection IDLocality informationLatitude, longitudeVoucherMicroreadsRGA contigs (average length bp)Average depth* (median)Assembly length† (bp)GenBank accession no.
  1. *Average depth and median values reported only for those sites with ≥5× depth.

  2. †Values in parentheses for P. torreyana include additional Sanger sequence.

P. albicaulisALBI05USA: Montana, Stillwater Co.45.44°N, 110.01°WOSC 213500869 50956 (2079.0)100.9 (56)107 159FJ899566
P. ayacahuiteAYAC01Mexico: Tialmanatco, Mexico19.17°N, 98.80°WOSC 2137621 173 42054 (2038.9)133.7 (97)104 983FJ899570
P. cembraCEMB03Austria: Styria, Resorts Predlitz-Turrach and Reichenau, Turracher Hohe47.98°N, 13.89°EOSC 2135111 166 707110 (905.8)175.4 (44)86 921FJ899574
P. flexilisFLEX13USA: Arizona, Graham Co.37.38°N, 106.58°WOSC 213721 545 50939 (2979.5)186.6 (137)110 415FJ899576
P. lambertiana NLAMB08USA: California, Siskiyou Co.41.85°N, 122.31°WOSC 2138782 870 15319 (6122.7)300.2 (102)114 386EU998743
P. lambertiana SLAMB01USA: California, Placer Co.39.05°N, 120.38°WOSC2 138781 180 28955 (2011.1)172.2 (114)105 202FJ899577
P. monticolaMONT06Canada: mainland British Columbia51.50°N, 119.55°WUnvouchered324 56464 (587.8)145.1 (52)36 133GQ478176
P. monticolaMONT07Canada: Vancouver Island, British Columbia48.72°N, 123.97°WUnvouchered306 67655 (660.9)140.7 (90)34 231GQ478177
P. monticolaMONT08USA: California, Siskiyou Co.41.75°N, 123.13°WUnvouchered512 15435 (1111.3231.0 (116)37 696GQ478178
P. monticolaMONT12USA: California, Kern Co.36.05°N, 118.35°WUnvouchered444 88836 (1085.1)212.3 (159)38 578GQ478179
P. monticolaMONT14USA: California, Tulare Co.36.20°N, 118.48°WUnvouchered448 83341 (946.2)219.4 (150)38 202GQ478180
P. monticolaMONT26USA: Oregon, Douglas Co.43.42°N, 122.73°WUnvouchered233 68441 (940.3)109.0 (82)36 952GQ478181
P. monticolaMONT30USA: Oregon, Douglas Co.43.13°N, 121.97°WUnvouchered358 42237 (1044.5)187.5 (121)37 981GQ478182
P. monticolaMONT36USA: California, Siskiyou Co.41.25°N, 122.87°WUnvouchered787 41065 (586.2)130.0 (50)36 368GQ478183
P. monticolaMONT38USA: California, El Dorado Co.38.92°N, 119.94°WUnvouchered193 44952 (739.0)92.9 (53)36 733GQ478184
P. monticolaMONT49USA: Montana, Flathead Co.48.34°N, 113.93°WUnvouchered188 21450 (764.2)91.8 (59)35 893GQ478185
P. sibiricaSIBI03Russia: Kemorovo District55.40°N, 86.10°EOSC 213880947 216108 (995.5)84.0 (62)97 547FJ899558
P. torreyana ssp. insularisSBBG 65–187Santa Barbara Botanical Garden, CA, USA (grown from seed collected by Bob Haller from Santa Rosa Island)34.27°N, 119.42°WWhittall. 2008.2451 157 85160 (1920.5)158.1 (89)107 977 (109 041)FJ899564
P. torreyana ssp. torreyanaSBBG s.n. (‘pre-1937’)Santa Barbara Botanical Garden, CA, USA (grown from seed collected by Bob Haller from La Jolla, CA)34.27°N, 119.42°WWhittall. 2008.2441 114 11167 (1762.4)104.4 (77)104 432 (105 892)FJ899563

Interspecific comparisons were also made, although these taxa are arguably conspecific (P. cembra and P. sibirica) or represent members of a species grade (P. flexilis and P. ayacahuite). Prior studies of chloroplast DNA show that divergence among these pairs of species is equivalent to conspecific comparisons in pines (Gernandt et al. 2005; Eckert & Hall 2006; Liston et al. 2007) and other gymnosperms (Little & Stevenson 2007). For example, P. cembra and P. sibirica show little morphological differentiation and have been considered conspecific (Shaw 1914; Meusel et al. 1965). Analysis of chloroplast microsatellites (Gugerli et al. 2001), chloroplast sequences (Liston et al. 2007), and nuclear gene sequences (Syring et al. 2007) reveal identical haplotypes in these species. Our samples (one per species) were collected from sites ∼4800 km distant. Pinus flexilis and P. ayacahuite represent geographic extremes of a species complex that differ primarily in cone dimensions and seed wing development. This species complex spans 35° of latitude, from Mexico (P. ayacahuite), across the southwestern USA (P. strobiformis) and northward into Canada (P. flexilis) (Fig. 1c). Our samples of P. flexilis and P. ayacahuite (one each) were collected at sites ∼2200 km distant. Finally, P. albicaulis (Fig. 1b) and northern populations of P. lambertiana are genetically and morphologically distinct, but they share nearly identical chloroplast haplotypes, possibly as a consequence of introgressive hybridization (Liston et al. 2007). Distribution maps of species (generated in ArcMap v9.3; ESRI) used digitized range maps of individual species (Critchfield & Little 1966;

Microread sequencing and genome assembly

DNA was extracted from fresh needles or seed megagametophyte tissue using the FastDNA Kit (Q-BIO Gene) or the DNeasy Plant Mini Kit (QIAGEN). For all samples but P. monticola, chloroplast genomes were amplified in 35 separate PCR reactions as previously reported (Cronn et al. 2008). In P. monticola one-third of the chloroplast genome was amplified in 12 PCR reactions with primers 1F through 12R (Cronn et al. 2008). For each species, the PCR reactions were quantified, pooled into equal-molar mixtures and converted into barcoded Illumina sequencing libraries (Cronn et al. 2008). Individual libraries were pooled into multiplex sequencing libraries ranging from 4× (for full chloroplast genomes) to 16× (partial P. monticola chloroplast genomes).

Cluster generation of adapter-barcoded libraries used 5 pmol, and produced 870 000–2 870 000 microreads per sample for complete genomes, and 188 000–787 000 microreads for partial genomes (P. monticola). After the removal of barcodes, microreads (33–37 bp) from all accessions except P. monticola were assembled with de novo assemblers velvet v. 0.6 (Zerbino & Birney 2008) and edena v. 2.1.1 (Hernandez et al. 2008), using minimum depth filters of 5×, minimum contig lengths of 100 bp and hash lengths of 25 bp. Generally, assembled contigs ranged from several hundred to several thousand bp in length; between 100 and 300 contigs were produced per complete genome, and 35 and 65 contigs were produced per partial genome (Table 1).

Genome assembly from de novo contigs followed a two-step process. De novo contigs were aligned to a reference chloroplast using codoncode v. 2.0.6 (Codoncode Corp., The following reference sequences were used: P. ponderosa (GenBank FJ899555) for P. torreyana accessions; and P. koraiensis (GenBank AY228468) for P. albicaulis, P. ayacahuite, P. cembra, P. flexilis, P. lambertiana and P. sibirica accessions. Orphan contigs that failed to align to references were checked for chloroplast homology using BLASTN (; where sequence coverage was lacking or where contig alignment failed due to indels, orphan contigs were manually inserted into the alignment. De novo assemblies from these two programs (velvet, edena) were nearly identical, but a slight increase in aligned de novo assembly length was gained through the use of both assemblers. A consensus sequence of aligned velvet and edenade novo contigs was made using BioEdit version (Hall 1999). The terminal 30 bp of contig ends were also edited to match the reference sequence completely, as these regions often contained assembly error due to reduced sequencing depth at contig ends. The consensus sequence of aligned contigs was merged with the reference to form a ‘chimeric pseudoreference’, composed primarily of de novo sequence (typically >90%), and including a small proportion (<10%) of reference sequence where de novo sequence was missing. Original microreads from each accession were then re-mapped onto a pseudoreference using the reference-guided assembler RGA (Shen and Mockler,, a minimum depth of 2×, maximum allowable error/mismatch of 0.033 and 70% majority minimum for SNP acceptance. Pinus monticola sequences were assembled against an unpublished P. monticola chloroplast genome sequence (R. Cronn, unpublished) using RGA with these same parameters.

Genomes were aligned using mafft v. 6.240 (Katoh et al. 2005) with a gap opening penalty of 2–2.5 and a gap extension penalty of 0. Aligned sequences were annotated using DOGMA (Wyman et al. 2004) and the Chloroplast Genome Database ( Initial quality checks of exon translations (to identify errors and frameshift/nonsense mutations) and spatial patterning of SNPs showed some regions with unexpectedly high divergence, and these were inferred as misassemblies arising from one or more of the following sources: (i) rare misassembly error from RGA; (ii) errors arising due to low sequencing depth near primers; and (iii) amplification of paralogous pseudogenes. In these rare instances, preference was given to de novo sequence assemblies. If the problematic region was not represented in de novo assemblies, or if unexpectedly high divergence was found across an entire region (exon or amplicon) the region was coded as missing. We observed that highly divergent regions were commonly associated with nucleotides flanking primer locations (±100 bp of the primer), and this appears to be related to low sequencing depth near primers; these regions were changed to N’s. Finally, due to the overlapping nature of our primers (Cronn et al. 2008), there was no way to unequivocally determine the sequence of primer regions, so primer sequences were changed to N’s. The net impact of these corrections is that true ‘hotspots’ of divergence are only supported in our study if they are supported by de novo and reference guided assembly.

As noted below, chloroplast variation from P. torreyana was also evaluated by direct Sanger sequencing. Alignment of these sequences to Illumina-based assemblies identified 1064 bp (ssp. insularis) and 1460 bp (ssp. torreyana) of gaps that could be eliminated by merging these data. For the purpose of identifying false-positives and false-negatives, Sanger sequences were compared to assemblies derived only from Illumina microreads. Our final sequences to GenBank, however, include the Sanger additions.

Pairwise comparisons of pine plastomes

In order to assess the distinctiveness of the Torrey pine plastome results, we compared P. torreyana to nearly complete chloroplast genome divergence in four other cases [P. lambertiana northern (N) vs. southern (S) haplotypes, P. lambertiana N vs. P. albicaulis, P. ayacahuite vs. P. flexilis, P. cembra vs. P. sibirica], and from 10 partial plastomes of P. monticola (∼39 kb). For these comparisons, all variable sites in initial assemblies were filtered for a minimum 25× coverage depth and 85% majority base call based on results of P. torreyana SNP validation (rationale for this minimum depth is provided below).

Uncorrected pairwise distances between haplotypes were calculated for the entirety of the aligned sequences, and partitioned into synonymous plus silent sites (dS) vs. non-synonymous sites (dN). All distance estimates were calculated using mega4 (Tamura et al. 2007), with P-distances for comparisons of overall nucleotides, Jukes–Cantor estimates of dS and dN, and pairwise deletion of unshared sites. Estimates of error were determined using 500 bootstrap replicates. amova was conducted using genalex v. 6 (Peakall & Smouse 2006) to examine hierarchical structure of genetic variation in P. monticola between two regions. Input data was from pairwise distance matrices, and significance was assessed using 1000 permutations.

Sanger sequencing of variable sites in P. torreyana

For the two P. torreyana samples, variable sites were scrutinized based on the minimum number of microreads supporting the base call and the minimum base-call consistency to directly identify true SNPs and to estimate the rate of false-positive SNPs and false-negative SNPs. From this analysis, regions flanking 32 putative SNPs and 2 indels were examined by Sanger sequencing. Primers were developed to maximize the number of variable sites covered while limiting the amplification products to ∼1 kb each (primer sequences available from authors by request). PCR reactions were done in 20 μL reaction volumes containing: MgCl2 (2.5 mm), Taq PCR Buffer B (1×; Fisher Scientific), dNTPs (0.25 mm each), forward and reverse primers (1 μm each), Taq polymerase (2 units) and 50 ng of genomic DNA. Thermal cycling conditions were: 30 s denature at 92 °C, followed by 35 cycles of 8 s denature at 92 °C, 30 s annealing at 55–57 °C and 90 s extension at 72 °C. A final 10-min extension at 72 °C was followed by a 4 °C hold. PCR products were visualized on agarose gels and directly sequenced on an ABI 3730 (Applied Biosystems).

SNP genotyping in P. torreyana

For the SNPs confirmed with Sanger sequencing, we genotyped 167 trees (81 from mainland; 86 from island). All five variable sites overlapped with restriction enzyme recognition sites, yet in order to confidently determine genotypes, we developed a complementary dCAPs assay using a primer that introduced a restriction site into the allele that was not cut by the native restriction site (Neff et al. 1998). Using these genotyping primers, we amplified fragments from 121 to 203 bp following the aforementioned PCR protocol. Five to 10 μL of PCR product were digested with 5 units of restriction enzyme for 5 h and assayed on agarose gels. Cut vs. uncut fragments for each SNP differed by 21–32 bp.

Divergence dating

Although there is substantial error associated with coalescent approaches to estimating divergence times (Graur & Martin 2004; Morrison 2008), these analyses can be informative when comparing recently diverged taxa with similar mutation rates and generation times. We estimated approximate divergence dates for intraspecific haplotype pairs and average divergence of 10 haplotypes for P. monticola. For calibration, we used chloroplast-specific mutation rates estimated for Pinus (Willyard et al. 2007; Gernandt et al. 2008). These prior studies reported a range of mutation rates based on slightly different fossil calibrations. For simplicity, we used the most recent estimates for divergence of hard and soft pines (72–87 Ma; Gernandt et al. 2008; Willyard et al. 2007), and calibrated mutation rates at the midpoint of this estimate (79.5 Ma) with a 4-Myr standard deviation. Assuming a lognormal distribution (Morrison 2008), 95% confidence intervals include the estimated divergence dates of both recent studies (72.1 Ma, 87.9 Ma).

Under these assumptions, we calculate the mean silent divergence rate to be 0.24 × 10−10 silent substitutions per site per year (95% CI = 0.890–5.371 × 10−10). To include error in this estimate, we assumed that error in divergence dates is lognormally distributed (Morrison 2008). Under assumptions of the neutral model, the absolute per-year mutation rate (μ) for a haploid organelle is represented as:




where Tdiv is the time since species divergence (measured as absolute years), d is pairwise divergence between haplotypes, μ is the mutation rate and Ne is the ancestral effective population size (Kimura 1983). Divergence dates were estimated by Monte-Carlo simulation, using lognormally distributed mutation rates (0.24 × 10−10; 95% CI = 0.890–5.371 × 10−10), normally distributed silent (dS) genetic distances and errors, and values of Ne that span a reasonable range from 100 to 5000. Only results from Ne = 1000 are presented, as varying Ne over this range had minimal impact on estimated dates. Divergence dates are reported as means, and 95% confidence intervals are approximated from 2.5% to 97.5% percentiles of 10 000 simulations.


Microread sequencing and genome assembly

When barcoded samples from these experiments were parsed, we retrieved an average of 1 336 085 microreads for each full genome and 379 829 microreads for partial Pinus monticola genomes (Table 1). By aligning de novo contigs onto reference genomes, we determined that de novo assemblies consistently were interrupted at priming sites (35 for whole genomes; 12 for partial genomes) and low complexity single nucleotide repeats; this phenomenon is evident in depth plots for genomes (Fig. 2), and is discussed in greater detail in Cronn et al. (2008). In addition, alignment of de novo contigs revealed no detectable structural rearrangements. RGA analysis resulted in an average of 63.1 contigs per full genome, with an average length of 2313 bp per contig. Assemblies for the partial P. monticola genome were proportionately less abundant (mean = 47.6 contigs) and shorter (mean = 846.5 bp). Full genome sequences produced by the pseudoreference-guided assembly process include an average of 104 336 bp (88.9%) for full genomes, and 36 887 bp (93.4%) for partial genomes.

Figure 2.

 Assembly, sequence depth and variable sites for aligned P. torreyana chloroplast genomes. Black circles are confirmed SNPs, grey circles are confirmed false-positives and indels, open circles are unconfirmed indels. Numerals above circles indicate multiple polymorphisms. In the sequence alignment, exons are blue, introns are light blue and intergenic regions are pink. The matK and 16S rRNA regions were not obtained in either sample; the rbcL region was amplified in ssp. insularis, a putative non-plastid pseudogene was amplified in ssp. torreyana. In the sequence density plots, blue lines (ssp. torreyana, mainland) and red lines (ssp. insularis, island) indicate sequencing depth at each position.

Confirming SNPs and false-positives in P. torreyana

Initial pairwise comparisons of chloroplast genomes revealed a surprisingly large number of polymorphic sites, a finding seemingly inconsistent with expectations of a conservative chloroplast divergence rate. For example, analysis of 104 432 bp from paired samples of Torrey pine revealed 32 putative SNPs (Table 2; Fig. 2) that spanned a range of sequencing depth (Fig. 2). A plot of the majority base frequency vs. the sequencing depth for variable positions (Fig. 3) showed that sequencing depth was generally low among these sites (geometric mean = 18.9). At most putative SNP sites, only two of the possible four nucleotides were found (Table 2) and the minority nucleotide represents the ancestral state (P. ponderosa). We attribute this bias to either sequencing errors in the 3-bp barcode resulting in the incorrect assignment of microreads (Cronn et al. 2008) or to potential sample cross-over between adjacent lanes of the Illumina flow cell during the cluster generation process.

Table 2.   Read densities for all variable sites detected in Pinus torreyana and Sanger sequencing validation
Position*Ancestralssp. torreyana (mainland)ssp. insularis (island)
  1. The positions of all variable sites are shown, with the five validated SNP positions indicated in bold italic type; the remaining positions are false-positives. Positional base calls are shaded proportionally to read depth; majority base calls for a position are also indicated in bold. The ancestral nucleotide state is represented by the sequence of Pinus ponderosa.

  2. *Position in alignment of P. torreyana and P. ponderosa assemblies.

  3. †Site 50203 was polymorphic in the original sequence assemblies, but this is not supported in the read density analysis (nor Sanger sequencing).

Figure 3.

 A comparison of minimum read density and minimum base-call consistency was used to predict SNPs and false-positive SNPs from the variable sites identified in comparing two Pinus torreyana plastomes. Black circles are confirmed SNPs, and the identity of each region is noted; grey circles indicate confirmed false-positives.

To determine whether these sites were false-positives arising from low sequencing depth, we resequenced these 32 sites from both Torrey pine samples using standard Sanger sequencing. The resulting 23.1 kb of sequence (11 628 bp for mainland; 11 528 bp for island) validated 5 SNPs. These were located in the trnV–trnH spacer (127× depth minimum, 98% consistency), trnS–psbB spacer (127× min, 98%), ycf1 coding region (a replacement substitution; 86× min, 99%), rps4–ycf12 spacer (61× min, 97%) and 23S rRNA (10× min, 83%) (Fig. 3). With the exception of 23S rRNA, these positions showed the highest combined depths (all >60×) and consistency (>95%). All remaining variable positions that were confirmed as false-positives showed generally low average sequencing depth and read consistency (with means of 21× and 76% respectively). The 23.1 kb of Sanger sequence used to validate SNPs is also useful in estimating the false-negative rate for regions that are readily accessible for sampling by short read and Sanger sequencing. This additional Sanger sequence differed from Illumina base calls at seven positions that were fixed in both subspecies. At present, we do not know the source for this systematic bias, but it is important to recognize that these differences are rare (seven sites out of 11 528 bp), consistent and do not result in novel SNPs. After confirmation of SNPs through Sanger sequencing, 99.995% of the P. torreyana genome was found to be identical between the two subspecies. Based on the results of this detailed screening, we used similar filtering criteria (depth ≥25×; consistency ≥85%) for all subsequent analyses where Sanger validation was unavailable.

Relative and absolute chloroplast genome divergence in pines

Using the filtering criteria identified above, ‘intraspecific’ pairwise differences between chloroplast genomes of widespread pine taxa ranged from zero differences across 75 195 bp in P. cembra vs. P. sibirica, to a high of 382 differences across 88 768 bp within P. lambertiana (Table 3; Fig. 4). In general, variable sites were unevenly dispersed across genome, with no mutational ‘hot-spots’ apparent across all comparisons (Fig. 4). Replacement substitutions were found in all comparisons except between P. cembra and P. sibirica and partial genomes of P. monticola. As expected for this conservative genome, silent substitutions outnumbered replacement substitutions 3.8:1 across all positions. Comparison of P. torreyana to other pairwise calculations shows that the average pairwise distance for P. torreyana (0.000047) is considerably higher than the comparison between the identical sequences from P. cembra vs. P. sibirica, approximately equal to the average pairwise divergence within P. monticola (0.000050), and substantially lower than the divergence for the P. ayacahuiteP. flexilis comparison (0.000165; Table 3).

Table 3.   Divergence statistics for complete and partial chloroplast genomes in Pinus
 Chloroplast genome comparison
P. torreyanaP. monticola N–P. monticola SP. lambertiana N–P. lambertiana SP. lambertiana N–P. albicaulisP. ayacahuiteP. flexilisP. cembraP. sibirica
  1. Comparisons reflect intraspecific divergence in P. torreyana, P. monticola and P. lambertiana, and divergences that reflect near-conspecific comparisons (P. ayacahuiteP. flexilis; P. cembraP. sibirica). The large values in the P. lambertiana N–S comparison result from introgression with a P. albicaulis-like chloroplast genome donor; for this reason, comparisons within P. lambertiana and between P. lambertiana N and P. albicaulis are shown. Standard errors (SEs) were determined using 500 bootstrap replicates.

  2. *Tdiv is reported in thousands of years, with lower confidence (LCL) and upper confidence levels (UCL) noted. To calculate Tdiv for P. cembra–P. sibirica, we assumed an upper bound of one synonymous substitution for these genomes.

Alignment length (bp)120 36239 150114 000117 504117 546117 228
Filtered SNPs5738212170
Pairwise distance
(SE)Average bp compared
0.000047 (0.000015)
105 308
0.000050 (0.000017)
35 535
0.004303 (0.000148)
88 768
0.000113 (0.000031)
106 058
0.000165 (0.000041)
102 920
0.0 (0.0)
75 195
(SE)Average bp compared
0.000029 (0.000026)
34 547
0.000000 (0.000000)
12 727
0.002528 (0.000247)
32 432
0.00008 (0.000052)
37 636
0.000079 (0.000037)
37 779
0.0 (0.0)
29 277
(SE)Average bp compared
0.000057 (0.000022)
70 603
0.000077 (0.000025)
22 808
0.005344 (0.000264)
56 327
0.000132 (0.00003)
68 263
0.000215 (0.000061)
65 209
0.0 (0.0)
45 949
Estimated Tdiv (LCL, UCL)*160 (41.5, 433)214 (61.3, 547)14 881 (5353, 33 172)369 (120, 884)598 (182, 1448)<60.6 (<3.4, 182)
Figure 4.

 Location of chloroplast genome SNPs in pairwise and population comparisons. Outer track shows the location of protein coding (blue), tRNA (red) and rRNA (orange) genes in the Pinus chloroplast genome; scale is in kbp. Inner tracks show the location of filtered SNPs for each comparison: (a) Pinus lambertiana N vs. S; (b) Pinus ayacahuite vs. Pinus flexilis; (c) Pinus albicaulis vs. Pinus lambertiana N; (d) Pinus monticola populations (partial genomes, positions 1–39 000); (e) Pinus torreyana island vs. mainland.

Based on previously calibrated silent substitution rates for pine chloroplast genes, we estimated the divergence times between paired haplotypes in four comparisons and the average haplotype divergence date for 10 haplotypes within P. monticola (Table 3). Mainland and island Torrey pine plastomes diverged c. 160 000 years ago. In the absence of detectable divergence between P. cembra vs. P. sibirica, we estimated a maximum divergence date for these individuals by assuming that they differed maximally by one substitution across the range of sampled silent sites (45 949 bp); this places the mean estimated divergence date at <60 000 years ago. The remaining estimates ranged from c. 145 000 to 598 000 years ago, placing the divergence of these haplotype pairs to the mid- to upper-Pleistocene. At the far extreme, the divergent haplotypes residing within P. lambertiana date to a far more ancient divergence of c.14.8 Ma.

Spatial differentiation in pine plastomes

In this study, we are able to provide estimates of genome-wide geographic differentiation for two of the examined species, P. torreyana and P. monticola. Restriction enzyme genotyping of 167 mainland and island Torrey pine trees demonstrated that the 5 validated SNPs present in our two exemplars represented fixed differences between these populations. Based on these results, we predict that the mainland and island populations are distinct and fully differentiated in their plastomes. In contrast, our sample of chloroplast genomes from 10 P. monticola individuals resulted in 9 distinct haplotypes. Based on prior studies of nuclear genetic variation in this species (Steinhoff et al. 1983), we explicitly divided our sample into ‘northern’ and ‘southern’ geographic groups (Fig. 1d), and examined chloroplast variation using amova. This analysis shows that haplotype variation does not follow the pattern of nuclear differentiation, as φPT (the partitioning of variance among groups, relative to total variance) was insignificant for these chloroplast genomes (0%, P = 0.861).


Recent dramatic improvements in DNA sequencing make it possible for simple genomes to be completely sequenced and compared in population and evolutionary genomics studies. In this analysis, we sequenced multiple barcoded chloroplast genomes simultaneously (four to six complete genomes, 16 partial genomes per lane), and have compared pairwise divergences of genomes reflective of intraspecific comparisons (Pinus lambertiana, P. monticola, P. torreyana) or effectively intraspecific comparisons (P. ayacahuite–P. flexilis, P. cembra–P. sibirica, P. albicaulis–P. lambertiana). These intraspecific comparisons are based on 1.3 million aligned bases, and they add substantially to our understanding of the magnitude of intraspecific chloroplast genome variation in conifer trees.

One of the striking results to emerge from our analysis of full chloroplast genomes is that genome-wide sequence variation is very low within pine species. In all instances except one (P. lambertiana; discussed next), two selected chloroplast genomes from pine species showed fewer than 18 differences across the span of their full genome. This value is substantially lower than a comparison of two samples representing unique varieties of Oryza sativa, which showed 72 SNPs (Tang et al. 2004). As intraspecific chloroplast genome sequencing is in its infancy, we do not know if low divergence is an outcome specific to our sampling, a general condition for conifers (perhaps attributable to low absolute mutation rate, combined with a recent population expansion) or common throughout land plant chloroplast genomes. For the species we examined, it is clear that accurate estimates of nucleotide divergence and genealogical relationships will require full – not partial – genomes for robust resolution, and even this may be insufficient.

A second important finding from this analysis is that mutational variability across the genome is sufficiently heterogeneous that divergence estimates from a small number of loci could be misleading. For example, based on complete genomes, we find that P. torreyana shows the lowest pairwise divergence among the comparisons examined. In contrast, if we had chosen a 15 000 bp contiguous region spanning nucleotide positions 40 000–55 000 for our analysis (e.g. Fig. 4), we would have reached a different conclusion, namely, that the two samples of P. torreyana have greater pairwise divergence than samples from P. ayacahuite–P. flexis and P. albicaulis–P. lambertiana N. The uneven distribution of variation across closely related chloroplast genomes argues strongly for a plastome-scale approach to intraspecific evolutionary studies, an approach now feasible with Next-Gen sequencing.

Organismal insights from pairwise chloroplast genome divergences

A key motivation for this analysis was to determine whether mainland and island populations of P. torreyana showed detectable chloroplast genome divergence, and to frame that divergence in the context of more widespread species and species complexes. As noted, traditional molecular approaches to distinguish the remaining populations of P. torreyana– a species distributed across two locations with a total range of <30 km2 have been largely inconclusive due to the absence of molecular variation in this species (Ledig & Conkle 1983; Provan et al. 1999). One study of 59 allozymes identified two variable loci in a survey of 157 trees representing the island and mainland populations (Ledig & Conkle 1983). These polymorphisms represented fixed differences between the island and mainland populations, a finding consistent with the complete partitioning of plastome variation reported herein. The unusual partitioning of plastome variation in P. torreyana is consistent with subspecific recognition of these two disjunct populations (Haller 1986).

In the absence of other comparisons, it would have been reasonable to conclude that the low divergence observed within P. torreyana was related to its restricted range or its low census and (presumably) low effective population size. From these initial intraspecific comparisons, however, we have learned that chloroplast genome divergence within many pine species and species complexes is low, even for geographically widespread species (Table 3). For example, P. monticola is known to consist of geographically differentiated populations (Fig. 1d) based on isozyme data from 12 isozyme loci (Steinhoff et al. 1983) and nuclear sequence data (J. Syring, unpublished). This species has a range of ∼370 000 km2 (Fig. 1d), spanning 17° of latitude and 13° of longitude, and occurring in ecologically disparate regions (e.g. northern Rocky Mountains of British Columbia, serpentine barrens of the Klamath-Siskiyous, the southern Sierra Nevada of California) from sea level to 3350 m in elevation (Mirov 1967). Despite this larger range and census counts for P. monticola (perhaps 2–3 orders of magnitude larger) than P. torreyana, pairwise chloroplast genome divergence values for these two species are nearly equal (0.000047 for P. torreyana, 0.000050 for P. monticola; Table 3). Perhaps more sobering, P. cembra and P. sibirica have a combined range that is greater than 5 million km2, with our samples separated by 4800 km. Sequencing of 75 kbp turned up no detectable differences between these two haplotypes, providing us with a clear lower bound for expected pairwise divergence. The low intraspecific divergence uncovered in P. torreyana appears not to be solely attributable to its rarity, as this feature appears to be the norm for Pinus (Table 3).

Based on our sample, pairwise divergence of P. ayacahuite–P. flexilis (0.000165; Table 3) set a realistic expectation for the upper bound of intraspecific comparisons in Pinus. This species complex is distributed from southern Alberta, Canada south to Honduras, with our samples collected from sites 2200 km apart (Fig. 1c). Analysis revealed a total of 17 SNPs across a comparison of 103 kbp, or ∼1 SNP per 6 kbp. Even at this upper bound of intraspecific divergence, this comparison highlights the daunting challenge of locating SNPs for use in population genetic analysis, and reinforces the importance of massively parallel sequencing efforts. Figure 4 indicates that there is not a single gene, intron or spacer region found in our analyses that would serve as a ‘marker locus’ for future studies in Pinus, as SNPs are spaced irregularly across the chloroplast genome.

Although pairwise genome divergences for our chosen species pairs are comparable, the partitioning of genetic variation is uniquely structured by species. Genotyping in P. torreyana indicates that the 5 validated SNPs are fixed across populations, yielding estimates of complete differentiation (φPT = 1.0) for these populations. In contrast, our sampling of haplotype diversity in 10 accessions of P. monticola appears to show no geographic partitioning, with a calculated φPT of zero. Geographic subdivision of P. lambertiana into northern and southern chloroplast haplogroups was recently documented by Liston et al. (2007). This research found two major haplotypes that shared 10 fixed differences across a narrow geographic zone 150 km in width (demarcated in Fig. 1b), relative to the 1600 km latitudinal range of the species. Based on Liston et al.’s (2007) data (2300 bp of sequence from matK and the trnG intron), the preponderance of the variation was found between geographic groups (φPT = 0.98; P = 0.003). Therefore, we have documented cases of narrowly endemic pines with high plastid differentiation (P. torreyana), widespread pines with high plastid differentiation (P. lambertiana; Liston et al. 2007) and widespread pines with essentially no plastid differentiation (P. monticola). These three examples demonstrate the impact that each unique history has had on these species and genomes.

Genome-scale data continues to show the uniqueness of P. lambertiana. The pairwise divergence between the northern and southern populations is 26-fold greater than the next highest comparison (P. ayacahuite–P. flexilis). Prior phylogenetic analyses confidently placed the northern haplotype in a clade that includes P. albicaulis (whitebark pine) and other East Asian white pines, and the southern haplotype in a clade with North American white pines (Liston et al. 2007; Parks et al., 2009). Liston et al. (2007) interpreted this phylogeographic pattern as a case of chloroplast introgression from P. albicaulis into the northern population of P. lambertiana. In this case, the high pairwise divergence value is more indicative of an interspecific rather than intraspecific comparison and suggests a cautionary approach be taken if large haplotypic divergences are uncovered in Pinus. Our estimate for the time of this introgression event was c. 370 000 years bp (Table 3). Pairwise divergence between the northern P. lambertiana haplotype and P. albicaulis is 0.000113, a value within the range of our other intraspecific comparisons.

To summarize, low plastome variation in Pinus species appears to be commonplace. Even in P. monticola, where we uncovered 9 unique haplotypes in 10 individuals, inter-population level diversity averaged 1 SNP per 20 kbp for each pairwise comparison. Where deviations from the expectation of low plastome diversity occur, as in the case of P. lambertiana, further investigation as to the cause is warranted. Although there appear to be narrow limits on plastome diversity, the hierarchical structure of that genetic diversity should be anticipated to vary according to the unique history of each species. Contextually, this indicates that there is nothing unusual about the haplotypic diversity of P. torreyana. On the one hand, the identified fixed differences found between the mainland and Santa Rosa Island populations support the uniqueness of these populations, and are suggestive that both populations should be a part of any long-term conservation plan. On the other hand, the low intraspecific plastome diversity is a trait that is shared with much more common and geographically widespread species.

What is next in ‘Next-Generation’ organelle sequencing?

A significant question remaining to be addressed in intraspecific organellar genome sequencing is the congruence between estimates of diversity and differentiation from nucleotides and microsatellites. As noted, chloroplast microsatellites have been successfully used to address many population and landscape level questions (Provan et al. 2001; Petit et al. 2005; Ebert & Peakall 2009). This is particularly true for conifers, where microsatellite-based estimates of haplotype variation can be striking, and as many as 235 haplotypes have been recorded from 311 individuals (Afzal-Rafii & Dodd 2007). This extreme variability seems unusual in light of the apparent quiescence of the remainder of the genome, but these differences could be expected given the magnitude of difference in positional mutation rates of nucleotides (0.890–5.371 × 10−10) and microsatellites (3.2–7.9 × 10−5; Provan et al. 1999). The extreme variability in microsatellites, combined with length constraints, has led many to suspect that genealogical estimates may be obscured through mutational ‘homoplasy’ (Estoup et al. 2002). The methods we used in our analysis are poorly suited to directly comparing sequence and microsatellite variation, because long single nucleotide repeats are difficult to assemble with short microreads (Cronn et al. 2008). With the development of paired-end sequencing and longer sequence reads, direct comparison of sequence- and microsatellite-based population genetic and genealogical estimates should be a high priority to evaluate the consistency of these methods.


We thank Bob Haller and Dieter Wilken for providing Torrey pine needles from the Santa Barbara Botanic Garden. Brian Knaus, Darren Smith, Stephanie Kim, Tim Butler, Mariah Parker-DeFeniks, Ugi Daalkhaijav, Sarah Sundholm, Angela Rodriguez, David Gernandt, Rongkun Shen, Todd Mockler, Mark Dasenko, Scott Givan, Chris Sullivan, Ismael Grachico, Jon Laurent and John Reeves provided critical assistance. Support provided by the Santa Barbara Botanic Garden and NSF IPY No. 0733078 to JBW, the OSU College of Science Venture Fund to AL, NSF ATOL No. 0629508 to AL and RC, and USFS PNW Research Station.

Conflicts of interest

The authors have no conflict of interest to declare and note that the sponsors of the issue had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.