Author for correspondence: Gancho T. Slavov Tel: +44 1970 823094 Email: firstname.lastname@example.org
•Plant population genomics informs evolutionary biology, breeding, conservation and bioenergy feedstock development. For example, the detection of reliable phenotype–genotype associations and molecular signatures of selection requires a detailed knowledge about genome-wide patterns of allele frequency variation, linkage disequilibrium and recombination.
•We resequenced 16 genomes of the model tree Populus trichocarpa and genotyped 120 trees from 10 subpopulations using 29 213 single-nucleotide polymorphisms.
•Significant geographic differentiation was present at multiple spatial scales, and range-wide latitudinal allele frequency gradients were strikingly common across the genome. The decay of linkage disequilibrium with physical distance was slower than expected from previous studies in Populus, with r2 dropping below 0.2 within 3–6 kb. Consistent with this, estimates of recent effective population size from linkage disequilibrium (Ne ≈ 4000–6000) were remarkably low relative to the large census sizes of P. trichocarpa stands. Fine-scale rates of recombination varied widely across the genome, but were largely predictable on the basis of DNA sequence and methylation features.
•Our results suggest that genetic drift has played a significant role in the recent evolutionary history of P. trichocarpa. Most importantly, the extensive linkage disequilibrium detected suggests that genome-wide association studies and genomic selection in undomesticated populations may be more feasible in Populus than previously assumed.
Forest tree species are assumed to have large effective population sizes because of their high levels of genetic diversity, weak interpopulation differentiation for neutral loci and rapid decay of linkage disequilibrium (LD) with physical distance (Hamrick et al., 1992; Neale & Ingvarsson, 2008; Neale & Kremer, 2011). However, because most previous population genetic studies were based on small numbers of statistically independent loci (González-Martínez et al., 2006b; Savolainen & Pyhäjärvi, 2007), relatively little is known about the genome-wide patterns of allele frequency variation, LD and recombination in these ecologically and economically important organisms. Detailed information on these patterns is indispensable for understanding the evolutionary history of tree populations and for designing, analyzing and interpreting data from association studies and selection scans. Thus, in addition to providing unprecedented opportunities for marker-assisted breeding through genomic selection (i.e. the prediction of genetic value from high-density single-nucleotide polymorphism (SNP) genotype data; Meuwissen & Goddard, 2010), genome-wide surveys of DNA polymorphism can yield novel biological insights.
Because of their role as foundation species in a number of ecosystems (Whitham et al., 2006), wide geographic distribution, rapid growth and potential as a bioenergy crop (Rubin, 2008), species of Populus have become models for tree genomics (Jansson et al., 2010) and have well-developed molecular resources, including a whole-genome sequence (Tuskan et al., 2006) integrated with genetic and physical maps (Kelleher et al., 2007). Several association studies in P. trichocarpa, P. deltoides, P. nigra and P. tremula are targeting traits related to biomass productivity, cell wall characteristics and climatic adaptation (Stanton et al., 2010).
Black cottonwood (Populus trichocarpa), which is one of the fastest growing species of the genus, inhabits riparian areas in western North America from Baja California to Alaska (DeBell, 1990). To characterize genome-wide patterns of allele frequency variation, LD and recombination in P. trichocarpa, we resequenced the genomes of 16 trees (Fig. 1a, Supporting Information Table S1) selected as a ‘range-wide’ sample (i.e. spanning a large proportion of the species’ native range). Results based on genome resequencing data were corroborated by extensive sequence and SNP data generated using traditional technologies (e.g. genotypic data for 29 213 SNPs in 120 trees from a similar geographic area, Fig. 2). In addition, we used information on DNA sequence features and data from whole-genome methylated DNA immunoprecipitation (MeDIP) resequencing to identify the strongest correlates of fine-scale recombination rates estimated using our genome resequencing data.
Materials and Methods
Plant materials and DNA extraction
We assembled a clonally replicated population of 1100 black cottonwood (Populus trichocarpa Torr. & Gray) genotypes that were established in multiple field trials. The vast majority of these genotypes (n = 1052) were sampled in the core of the range of P. trichocarpa, west of the Cascade Mountains in northern Oregon, Washington and southern British Columbia. This set of genotypes is expected to be appropriate for association mapping because P. trichocarpa grows best in this area (DeBell, 1990) and is characterized by high levels of molecular and phenotypic variation, with relatively weak interpopulation differentiation for neutral markers (Weber & Stettler, 1981; Weber et al., 1985). An additional set of 48 genotypes (22 from northern California, 15 from the Mid-Willamette Valley in central Oregon and 11 from central British Columbia) was included to allow a preliminary assessment of levels of molecular and phenotypic variation on a broader geographic scale. To generate genome resequencing data, we selected 16 P. trichocarpa trees spanning the entire range of our experimental population and a large proportion of the range of P. trichocarpa (Fig. 1a). Five to ten micrograms of high-molecular-weight DNA was extracted from 3–5 g of roots or leaves from plants grown in the glasshouse or in hydroponic systems using a protocol available at http://my.jgi.doe.gov/general/protocols/Populus_nuclear_DNA_extraction.doc. To generate corroborating Infinium SNP data (described below), we selected 12 trees from each of 10 subpopulations, which were chosen to roughly match the geographic distribution of the 16 resequenced trees (Figs 1a, 2a), and included 10 of these trees to allow SNP genotype comparisons between the two datasets.
Library construction and genome resequencing
DNA was randomly sheared into small fragments (200–300 bp) using a Covaris E210 ultrasonicator (Covaris, Inc., Woburn, MA, USA), according to the manufacturer’s recommendations. The overhangs created by fragmentation were converted into blunt ends using T4 DNA polymerase and DNA polymerase I Klenow fragment. Illumina adaptors (Illumina, San Diego, CA, USA) were then ligated to the DNA fragment using DNA ligase. Finally, a polymerase chain reaction (PCR) was performed using DNA Phusion Polymerase to selectively enrich those DNA fragments that have adaptor molecules on both ends and to amplify the amount of DNA in the library. Short-read sequence data were generated for each individual tree (i.e. without pooling) using an Illumina Genome Analyzer.
Short-read sequence alignment
Sequence reads were aligned to assembly v2 of the reference genome of Nisqually-1 (http://www.phytozome.net/poplar.php; Tuskan et al., 2006) using the MAQ software package (Li et al., 2008) with default options, except for increasing the number of allowable mismatches in the first 24 bp to three, and the resulting SNPs were filtered using the default options of SNPFilter from the ‘maq.pl’ script. The rationale of this step was to potentially align as many reads as possible, but also to capture mapping quality and other alignment statistics that can be used for downstream filtering (see SNP filtering below). The ‘map’ alignment file from MAQ was converted to Binary Alignment/Map (BAM) format using the SAMtools package (Li et al., 2009), and subsequent filtering was based on the BAM representation of the alignments (Table S1).
To guide the selection of SNP filtering criteria, we generated Sanger sequence data by PCR amplification of fragments spanning 10 candidate genes (Table S2) in 47 P. trichocarpa trees, including 15 of the 16 trees in the genome resequencing data. Fragments were then direct-sequenced by Beckman Coulter (Beckman Coulter, Danvers, MA, USA) on an ABI 3730XL sequencer (Life Technologies Corporation, Carlsbad, CA, USA). Sequences were assembled using the Phred/Phrap (Gordon et al., 1998) programs and polymorphisms were scored using PolyPhred (Nickerson et al., 1997).
Biallelic SNPs that passed the initial sequence alignment filtering (see Short-read sequence alignment above) were further screened using nine additional filtering criteria (Table 1) based on sequence alignment statistics (Li et al., 2008), the observed number of minor alleles, the availability of data for all trees and conformity of genotype frequencies to Hardy–Weinberg expectations. We searched a large parameter space (n > 106 scenarios) and evaluated each set of filtering criteria by comparing the set of SNPs obtained after filtering the genome resequencing data with a set obtained from Sanger resequencing data for the same trees (Table S2). Based on the tradeoff between the estimated rates of false positives and false negatives with respect to SNP detection, we selected two sets of filtering criteria (Table 1), both of which included only SNPs that were genotyped in all 16 resequenced trees and had minor allele frequencies (MAFs) of at least 0.094 (i.e. 3/32). The Hardy–Weinberg equilibrium (HWE) set included relatively relaxed alignment criteria, but required genotype frequencies to be roughly concordant with Hardy–Weinberg proportions (i.e. FIS ≤ |0.4|) and observing at least one homozygous genotype for the minor allele. FIS was calculated as 1 − Ho/He, where Ho and He were the observed and expected heterozygosity based on genotypes from the core of the range of P. trichocarpa (i.e. excluding trees from the strongly differentiated Tahoe and Willamette subpopulations; see Results). We did not use a more stringent threshold value of FIS in order to avoid the elimination of high-quality SNPs with genotype frequencies deviating from HWE because of sampling variance or because they reflected population substructure within the core of the range of P. trichocarpa (see the Results section). In contrast, the Quality Score (QS) set relied on stricter alignment criteria, but did not use any information on genotype frequencies.
Table 1. Filtering criteria for genome resequencing data from 16 Populus trichocarpa trees
dMin hit rate, minimum approximate copy number of the sequence flanking a single-nucleotide polymorphism (SNP) (Li et al., 2008).
eMax hit rate, maximum approximate copy number of the sequence flanking an SNP (Li et al., 2008).
fMissing, maximum number of trees for which missing data are allowed.
gMinor alleles, minimum number of copies of the minor allele.
hMin homozygotes, minimum number of homozygous genotypes for the minor allele.
iMax |FIS|, maximum deviation of observed genotype frequencies from Hardy–Weinberg expectations.
jFalse positives, percentage of filtered SNPs that were not found in Sanger sequencing data for the same trees (Table S2).
kFalse negatives, percentage of SNPs that were found in Sanger sequencing data for the same trees, but were not detected in the Illumina data or did not pass one or more of the filtering criteria.
lMAF, average minor allele frequency. For comparison, the average MAF for common SNPs (MAF ≥ 0.10) genotyped through Sanger resequencing (Table S2) was 0.24 and that for the Infinium SNP data was 0.27.
mNo. of loci, number of SNPs that passed all filtering criteria.
Min hit rated
Max hit ratee
False positives (%)j
False negatives (%)k
No. of locim
1 453 752
Infinium SNP data
To verify that the patterns observed from analyses of the genome resequencing data were not a sampling artifact caused by the relatively small number of resequenced trees, we genotyped 120 trees for 29 213 SNPs using the methods described in Supporting Information Methods S1. Briefly, we used information from previous studies to identify 3704 candidate genes that are being targeted in ongoing association studies of traits related to cell wall characteristics and climatic adaptation. After combining our genome resequencing data with transcriptome resequencing data from developing xylem of 20 P. trichocarpa trees (Geraldes et al., 2011), we identified 169 626 potential target SNPs within or near these candidate genes. We then used information on pairwise tagging, design scores and assay types for Infinium iSelect HD Custom Genotyping (Illumina), SNP annotation and spacing to select 38 000 target SNPs. Infinium assays were successfully developed for 34 131 SNPs, 29 213 of which were successfully genotyped.
Population structure and allele frequency gradients
We performed principal component analyses (PCAs) at multiple scales using v. 3.0 of the EIGENSOFT package (Patterson et al., 2006) after removing one SNP from each pair of loci located within 10 kb of one another and linked at r2 ≥ 0.8. This was done in order to avoid artifacts caused by large blocks of tightly linked markers (Patterson et al., 2006; Nelson et al., 2008). The statistical significance of the relationship between geographic and genetic (FST; Wright, 1965; Weir & Cockerham, 1984) distances among the 10 subpopulations represented in the Infinium SNP data was assessed through Mantel tests (10 000 permutations) using GenAlEx (Peakall & Smouse, 2006). We quantified allele frequency gradients in the genome resequencing data by calculating Pearson’s correlation coefficients between the source latitude and the number of copies of an arbitrarily chosen allele in each tree. To assess the robustness of these measures, we used the Infinium SNP data to calculate Pearson’s correlation coefficients between latitudes and allele frequencies in the 10 subpopulations.
LD and recombination
For genome resequencing data, we estimated haplotype frequencies using version 2.1 of the PHASE program (Stephens et al., 2001; Li & Stephens, 2003; Stephens & Scheet, 2005) and calculated r2 (Slatkin, 2008) for all pairs of SNPs in windows consisting of a fixed number of SNPs. To verify that estimates of r2 were not biased by inaccurate haplotype frequency estimation, values of r2 obtained as described above were compared with those calculated using genotypic correlations (Marchini et al., 2006). For data filtered using the HWE filtering criteria (Table 1), windows consisted of 50 consecutive SNPs (median size, 35.5 kb), with adjacent windows overlapping by 20 SNPs. For data filtered using the QS filtering criteria (Table 1), we employed 200-SNP windows (median size, 43.6 kb) overlapping by 20 SNPs. To quantify LD for pairs of unlinked loci, r2 was estimated through genotypic correlations for 107 randomly selected pairs of SNPs located on different chromosomes. For comparisons of LD between different datasets (e.g. core of range vs range-wide), mean values of r2 in each distance class were corrected for small sample size by subtracting 1/n, where n is the number of chromosomes used to estimate r2 (Tenesa et al., 2007). The same approach was used for the Infinium SNP data, with windows corresponding to candidate genes and the 2-kb flanking regions up- and downstream from them.
We obtained estimates of the scaled recombination rate (ρ = 4Nec) for each 50- or 200-SNP window using the default options of PHASE (i.e. using the –MR0 model). Based on the haplotype frequencies estimated as part of this analysis, we also calculated the nucleotide diversity (π; Nei & Miller, 1990) for each window. We then calculated 100-kb values of ρ and π as weighted averages of the values for the 50- or 200-SNP windows overlapping with each 100-kb window (i.e. using the lengths of overlap as weights). Based on information on correlates of recombination in other organisms (Myers et al., 2005; Drouaud et al., 2006; Kim et al., 2007; Liu et al., 2009), we assessed 17 potential explanatory variables reflecting: the position along chromosomes relative to putative centromeres and telomeres; nucleotide diversity; DNA sequence features (e.g. GC content, occurrence of CpG-rich regions, repeat elements and genes); and epigenetic patterns revealed using whole-genome MeDIP resequencing (Methods S2). Estimates of effective population size (Ne) were obtained by relating values of r2 or estimates of ρ to direct estimates of local recombination rates (c) obtained from genetic linkage maps (Methods S3).
Recombination hotspots were identified using the –MR1 option of PHASE, with all other parameters set to their default values. In order to identify robust recombination hotspots, we ran PHASE on the HWE-filtered SNP data three times, setting the seed of the pseudorandom number generator to a different number for each run, and combined results with a run based on SNPs filtered using the QS criteria (Table 1). Subsequent analyses were performed only for putative hotspots that consistently (i.e. in results from all runs) had: a median intensity (λ) of at least 20, a posterior probability that λ > 10 of at least 0.9 (Crawford et al., 2004) and border discrepancies of no more than 5 kb across runs. We also eliminated hotspots detected on scaffolds that were not mapped to chromosomes and in windows that were longer than 100 kb or had < 50 SNPs (i.e. at the ends of chromosomes). All of these criteria were satisfied for 606 hotspots, whose median Bayes factor (Crawford et al., 2004) was 293 (range, 27–1468). For each of these hotspots, we attempted to identify a matching ‘coldspot’ (Myers et al., 2005) in a window that was located within 500 kb of the window in which the respective hotspot was detected, had λ < 10 and a posterior probability that λ > 10 of at most 0.5 across all runs of PHASE and was no longer than 100 kb and contained no fewer than 50 SNPs. We identified 589 coldspot windows that met these criteria and randomly selected the start and end coordinates of coldspots within these windows, matching the lengths of coldspots to their respective hotspots. To test whether hotspot locations relative to genes followed a nonrandom pattern, we repeated the random selection of coldspot borders 10 000 times and calculated empirical one-sided P values by comparing the observed distribution of distances between hotspots and genes to the one of the simulated coldspots. We compared DNA sequence and methylation characteristics between hotspots and coldspots through paired t-tests (Wackerly et al., 2002) using logarithmic transformations when diagnostic box plots suggested widely unequal variances or severe departures from normality of the data. We tested for over- or under-representation of different types of repeat elements by comparing the proportional occurrence of each type of repeat element relative to all repeat elements identified in hotspots to that in coldspots using contingency table χ2 tests (Wackerly et al., 2002).
SNP discovery and genotyping
Resequencing was based on 6.4 billion Illumina Genome Analyzer reads with an average length of 59 bp (379 Gb of sequence). Approximately 84% of these reads mapped to the reference genome of Nisqually-1 (Tuskan et al., 2006), covering 95% of its length to an average depth of 39× (Table S1). The HWE set of SNP filtering criteria produced no false positives (Table 1) and resulted in genotype calls that matched those based on the Sanger data 96% of the time and those from the Infinium SNP data 97% of the time. Furthermore, 98% of the heterozygous SNP calls for Nisqually-1 based on genome resequencing data filtered using the HWE criteria were supported by Sanger reads used to generate the reference genome sequence (Tuskan et al., 2006). Because of the high reliability of SNP detection and genotype calls obtained using the HWE criteria, we used them to filter the genome resequencing data for all subsequent analyses. However, the HWE criteria also resulted in a high rate of false negatives (79%). In contrast, the QS set of filtering criteria had a high rate of false positives (17%), but also a substantially lower rate of false negatives (54%), resulting in genome coverage which was approximately three times more dense (Table 1). This set was used to assess the robustness of our results to varying SNP density. In addition, we employed the Infinium SNP data (120 trees genotyped for 29 213 SNPs) to verify all patterns detected using genome resequencing data. Average MAFs for common SNPs (MAF ≥ c. 0.10) were 0.27, 0.22, 0.24 and 0.27 for genome resequencing data filtered using the HWE and QS criteria, the Sanger resequencing data and the Infinium SNP data, respectively.
Population structure and allele frequency gradients
PCAs of SNP genotypes based on genome resequencing and the Infinium array revealed clear patterns of geographic structure at multiple spatial scales (Figs 1, 2). The primary axes of variation (PC1) were highly significant (P ≤ 0.0005), explained between 2.9% (Fig. 2c) and 18.2% (Fig. 1b) of the total variance and were associated with the latitudinal origins of the resequenced trees at all of these spatial scales. For example, correlations between PC1 scores and source latitudes were r = 0.87, 0.88 and 0.71, respectively, for analyses presented in Fig. 2(b–d) (P < 10−5). Similar patterns were detected using model-based clustering of trees based on a subset of the Infinium SNP data (Supporting Information Fig. S1). Consistent with the spatial pattern of genetic differentiation revealed by PCA, genetic and geographic distances among subpopulations were linearly associated (Fig. S2, Table S3).
Quantitative measures of latitudinal allele frequency gradients based on genome resequencing and Infinium SNP data were strongly correlated (r = 0.76, P < 10−15), verifying that allele frequency patterns detected using the genome resequencing data were robust. Analyses of c. 450 000 SNPs from the genome resequencing data revealed that latitudinal allele frequency gradients were: strikingly common at the range-wide scale (c. 23%, Fig. 3a), and less common, but still substantially more frequent than expected by chance (c. 9%, Fig. 3b), within the core of the range (i.e. excluding trees from the strongly differentiated Tahoe and Willamette subpopulations). The relative frequencies of latitudinal gradients across different categories of SNPs also differed between these two spatial scales. At the range-wide scale, clinal allele frequency variation was more common among intergenic than among genic SNPs (Fig. 3a). In contrast, latitudinal allele frequency gradients within the core of the range of P. trichocarpa were more common among genic than among intergenic SNPs (Fig. 3b). Interestingly, the frequency of latitudinal gradients at both spatial scales was similar in SNPs located in untranslated regions, introns and coding sequences.
The observed heterozygosities of the two resequenced trees from the Tahoe subpopulation (Ho = 0.226 and 0.263 based on the HWE filtering criteria) were substantially lower than those of the trees sampled from the core of the range (average Ho = 0.333, SD = 0.024). The same pattern was detected using the Infinium SNP data (Ho = 0.219 for the Tahoe subpopulation vs average Ho = 0.314, SD = 0.011, across subpopulations in the core of the range). After excluding the Tahoe subpopulation, however, the observed heterozygosity and nucleotide diversity were not significantly correlated with latitude (P > 0.40).
LD and recombination
Values of r2 obtained based on the estimated haplotype frequencies were concordant with those calculated using genotypic correlations (Fig. S3), and were therefore used for all subsequent analyses, except for the quantification of LD among physically unlinked loci (see next paragraph). The genome-wide average r2 dropped below 0.2 within 3–6 kb, depending on the dataset and filtering criteria used (Figs 4, 5). Filtering of the genome resequencing data using the stringent HWE set of criteria (Table 1) resulted in estimates that were very similar to those obtained based on the Sanger resequencing data, and slightly higher, but consistent, with those obtained based on the Infinium SNP data (Fig. 5). In contrast, estimates based on the more relaxed QS set of criteria appeared to be downwardly biased (Fig. 5).
Population structure had a two-fold effect on LD. First, r2 among physically linked loci decayed at a slightly faster rate (i.e. at c. 500-bp shorter distances) when all trees were analyzed as one homogeneous population than when strongly differentiated trees were excluded (Fig. S4). Consistent with this, the estimate of effective population size from LD was 24% higher using the range-wide sample (Ne = 5592) than that using only the data from trees sampled in the core of the range (Ne = 4506, P < 10−12 from a paired t-test across chromosomes) (Fig. S4). Furthermore, rates of decay of r2 with physical distance and LD-based estimates of Ne varied dramatically among subpopulations, with Ne estimates ranging from 193 (Tahoe) to 2278 (Willamette) (Fig. 6). Second, in contrast with its effect on local LD, population subdivision caused elevated levels of genome-wide LD among physically unlinked loci (Fig. S5).
Fine-scale rates of recombination estimated indirectly from the resequencing data varied by orders of magnitude across the genome (Figs 7, S6) and were correlated with recombination rates estimated directly from genetic maps (Spearman’s ρ = 0.50, P < 10−10) (Fig. S7). As expected, several chromosomal, DNA sequence and epigenetic features were moderately correlated with recombination rates, as well as with one another (Fig. 7, Table S4). The overall patterns were consistent across analyses based on 100-kb and 1-Mb windows, but correlations tended to be stronger for 1-Mb windows (Table S5). Multicollinearity made the identification of a single most appropriate set of predictors challenging, but a forward selection procedure with a Bonferroni adjustment to control for the large number of potential explanatory variables resulted in models with R2 = 0.65 and 0.77 for 100-kb and 1-Mb windows, respectively (Tables S6, S7). Models including only the three strongest predictors (the number of MeDIP reads and the occurrence of long terminal repeat (LTR) Gypsy retrotransposons and A-, T- or AT-rich low-complexity sequences; Tables S4, S5) explained between one-half and three-quarters of the genome-wide variation in log-transformed recombination rates (R2 = 0.53 and 0.76 for 100-kb and 1-Mb windows, respectively).
We identified over 6000 putative recombination hotspots with an average intensity (i.e. fold change relative to background rates of recombination) of 261 (range, 20–5918). Approximately 10% of these hotspots (n = 606) passed stringent screening, including detection using both HWE and QS filtering criteria (Table 1). The distribution and sequence characteristics of these hotspots were compared with those of control regions (i.e. ‘coldspots’; Myers et al., 2005), which were chosen as close as possible to each hotspot (i.e. the median distance between hotspots and coldspots was c. 74 kb). Recombination hotspots tended to be located closer to genes than expected by chance, preferentially outside of genes and preferentially upstream of genes (P < 10−4 for all three tests; Table 2). As expected on the basis of the correlates of recombination identified (Tables S4–S7), DNA in hotspots tended to be less methylated (P < 10−15), contained fewer LTR Gypsy retrotransposons (P < 10−15), contained more AT-rich low-complexity sequences (P < 10−15) and had a lower GC content (P < 10−15), but a larger number of bases in CpG-rich regions (P < 10−6), compared with DNA in coldspots (Table 2).
Table 2. Contrasts between 589 recombination hotspots and their matched coldspots (P is the two-sided P value, unless indicated otherwise)
aAverage across 10 000 simulated coldspot locations.
bOne-sided P values.
cProportional occurrence of each type of repeat element relative to all repeat elements identified in hotspots or coldspots.
Location relative to genes
Number overlapping with genes
Number upstream of genes
Number downstream of genes
Average distance upstream of genes (bp)
Average distance downstream of genes (bp)
DNA characteristics and methylation
Average GC content (%)
Average number of CpG bases
5.1 × 10−7
Average number of MeDIP reads per kb of sequence
DNA repeat compositionc
Proportion of LTR Gypsy retrotransposons (%)
Proportion of AT-rich low-complexity sequences (%)
Proportion of LINEs (%)
The high reliability of SNP detection and genotyping based on our genome resequencing data was illustrated through comparisons with Sanger data (e.g. Table 1, comparisons with Nisqually-1 reference genome sequence), consistency with transcriptome resequencing data used for SNP discovery (Geraldes et al., 2011; Methods S1) and transferability to the Infinium SNP genotyping assay (Methods S1). This suggests that the prospects of high-density genotyping in Populus are excellent, especially considering the rapidly decreasing cost and increasing throughput of sequencing technologies (Metzker, 2010). However, our results also highlight the importance of filtering genome resequencing data appropriately (Table 1). Naïve filtering based entirely on sequence depth is likely to result in high rates of false positives and may influence downstream data analyses and interpretation (Table 1). However, the stringent filtering criteria required to achieve low rates of false positives limited our inference to common SNPs (MAF ≥ c. 0.10), prevented us from calculating reliable absolute estimates of nucleotide diversity (i.e. because of the consistently high rates of false negatives across all filtering scenarios considered) and possibly resulted in biased genome sampling. Further technological and bioinformatic advances will probably help to mitigate these limitations. For example, the availability of resequencing data from a larger number of individuals will allow the inclusion of lower frequency SNPs (i.e. even after requiring the detection of multiple minor alleles) in population genetic analyses, association studies and genomic selection.
Despite the great potential for long-distance seed and pollen dispersal in P. trichocarpa (Slavov et al., 2009; DiFazio et al., 2012), population genetic structure appears to be present from the range-wide (Figs 1, 2) to the local stand (Slavov et al., 2010) scale. The genetic differentiation of trees sampled from the Tahoe subpopulation (pairwise FST > 0.151; Table S3) was stronger than expected based on previous population genetic studies in Populus (Slavov & Zhelev, 2010), but comparable with that observed in a range-wide sample of P. balsamifera (Keller et al., 2010), a close relative of P. trichocarpa. Even the subtler genetic differentiation observed in the core of the range of P. trichocarpa (pairwise FST = 0.013–0.048; Table S3) needs to be statistically accounted for in association studies (Price et al., 2006, 2010).
The observed continuous patterns of differentiation and allele frequency variation (Figs 1, 2) match those expected under population genetic models of isolation by distance (IBD) (Rousset, 1997). For example, our results are consistent with both hypotheses proposed by Soltis et al. (1997) to explain the phylogeographic patterns detected in plants growing in the Pacific Northwest of North America. However, the lack of a clear trend of decreasing diversity with latitude makes a scenario of recolonization from multiple glacial refugia (i.e. ‘north–south recolonization hypothesis’) more plausible than a scenario of recolonization from a single group of refugia located in the mountains of northern California and southern Oregon (i.e. ‘leading edge hypothesis’). Regardless of the specifics of the underlying demographic scenario, the remarkable abundance and ubiquity of latitudinal SNP gradients at the range-wide scale (i.e. about one of four SNPs with MAF ≥ c. 0.10), and their higher frequency among intergenic than among genic SNPs, suggest that allele frequency gradients caused by local adaptation may be completely confounded with neutral differentiation under IBD (Vasemägi, 2006; Novembre & Di Rienzo, 2009). The reversed pattern of higher frequency of latitudinal gradients among genic than intergenic SNPs in the core of the range indicates that spatial patterns of SNP variation could potentially be instrumental for the detection of molecular signatures of selection. However, performing this robustly will probably require the use of appropriately defined study populations and neutral differentiation models consistent with IBD, as well as conservatively combining multiple sources of evidence (e.g. allele frequency differentiation, local levels of LD and nucleotide diversity vs divergence, and allele frequency spectra for ancestral vs derived alleles and haplotypes) (Grossman et al., 2010; Hernandez et al., 2011).
Second, estimates of LD decay were affected by the criteria used to filter the genome resequencing data (Fig. 5). Estimates based on the stringent HWE criteria were very similar to those based on Sanger resequencing data (Fig. 5a) and roughly comparable to those based on Infinium SNP data (Fig. 5b). The slight difference in the latter case probably resulted from the much larger sample size used to generate the Infinium SNP data (i.e. 2n = 192 chromosomes sampled in the core of the range, making the small sample bias in this dataset negligible) and/or the exclusion of nearly 50 000 strongly linked (i.e. r2 ≥ 0.8) SNPs through the pairwise tagging analyses performed to select target loci for Infinium genotyping (Methods S1). In contrast, analyses of genome resequencing data filtered using the more relaxed QS criteria appeared to systematically result in underestimates of the mean r2 values by distance class (Fig. 5). This is probably caused primarily by the high frequency of false-positive SNPs (i.e. which would generally be expected to behave as unlinked) in data filtered using these criteria (Table 1). However, even the decay of r2 based on data filtered using the QS criteria was an order of magnitude slower than previously estimated for Populus. Taken together, the comparisons among datasets generated using different SNP genotyping assays indicate that the extensive genome-wide LD detected was probably not caused by technological artifacts.
Third, at 100-kb and 1-Mb window scales, rates of recombination were weakly to moderately correlated with gene density (Tables S4, S5), and recombination hotspots were located closer to genes than expected by chance (Table 2). These trends suggest that LD within or near genes may be weaker than the genome average. Consistent with this, estimates of r2 based exclusively on SNPs located within or near genes (Infinium SNP data) were lower than those based on the genome resequencing data (Fig. 4), but the difference was relatively small.
Finally, LD estimates based on our genome resequencing data (i.e. with one to three SNPs per kilobase; Table 1) may not be directly comparable to those from previous studies in P. trichocarpa and other forest trees, which were typically based on up to several kilobases of sequence for a relatively small number of candidate genes (Krutovsky & Neale, 2005; González-Martínez et al., 2006a; Heuertz et al., 2006; Ingvarsson, 2008; Wegrzyn et al., 2010). We hypothesize that data from these studies may not have been sufficiently extensive to develop accurate expectations on genome-wide patterns of LD. Consistent with this, analyses of resequencing data for 372 randomly sampled gene fragments in P. trichocarpa’s close relative P. balsamifera resulted in dramatically variable LD estimates that were, on average, higher than previously reported for other Populus species (Olson et al., 2010). More importantly, estimates of LD based exclusively on very closely spaced SNPs may not be appropriate for the prediction of longer range LD (Kim et al., 2007). This is presumably because gene conversion may have a substantial effect on short-range but not on long-range LD (Andolfatto & Nordborg, 1998), and because LD estimates at different distances reflect Ne over different time scales (i.e. longer range LD is expected to reflect more recent Ne; Tenesa et al., 2007). The simple regression model that is commonly used to summarize the decay of r2 with physical distance (Hill & Weir, 1988) does not account for these two factors, and predictions of long-range LD based on fitting this model to short-range LD data should be treated with caution.
As a reflection of the surprisingly slow decay of r2 with physical distance between loci, estimates of Ne from LD (i.e. c. 4000–6000, depending on the genetic map and estimation method used) are remarkably low, considering the vast census sizes of the P. trichocarpa populations sampled. These Ne estimates are also over 20 times lower than those obtained for European aspen (P. tremula) based on putatively neutral nucleotide diversity (Ingvarsson, 2008), and three to four times lower than those that would be obtained for P. trichocarpa based on similar assumptions and estimates of neutral nucleotide diversity from previous studies (Gilchrist et al., 2006; Tuskan et al., 2006). The dramatic contrast between Ne estimates for P. tremula and P. trichocarpa may be a reflection of the substantial differences in life history between aspens and cottonwoods (e.g. greater propensity for asexual reproduction in aspens, possibly resulting in a ‘Meselson effect’; Balloux et al., 2003; Slavov & Zhelev, 2010). Furthermore, because LD-based estimates of Ne presumably reflect more recent population history than those from nucleotide diversity (Tenesa et al., 2007), and because P. trichocarpa populations are likely to have experienced severe bottlenecks and/or founder effects over the last few hundred generations (Soltis et al., 1997), our low estimates of Ne from LD are not implausible. Similar to the geographic patterns of allele frequency variation and heterozygosity, Ne estimates by subpopulation (Fig. 6) are consistent with a scenario of recolonization from multiple glacial refugia. These estimates also indicate that the Willamette and Columbia subpopulations may be located in a ‘melting pot’ (Petit et al., 2003) of haplotype diversity from multiple refugia.
In addition to its dramatic variation among subpopulations, LD also varied substantially across the genome. The epigenetic and DNA sequence variables identified as best predictors of historical rates of recombination (4Nec) were similar to those in other plants (Drouaud et al., 2006; Kim et al., 2007; Gore et al., 2009; Liu et al., 2009; Branca et al., 2011). However, the correlations estimated tended to be stronger and the combined explanatory power of the variables considered higher than in these studies. Furthermore, the clear pattern of nonrandom occurrence of recombination hotspots relative to genes (Table 2) was consistent with that detected based on detailed studies in humans (McVean et al., 2004; Myers et al., 2005). The ability to detect these patterns attests to the precision of our estimates of fine-scale recombination rates, and suggests that this information will probably be useful in both fundamental studies of DNA- and chromosome-level evolution and in applied molecular breeding.
In summary, our analyses of extensive genome resequencing and SNP genotyping data for a broad collection of P. trichocarpa trees led to two main novel findings. First, significant population genetic structure appears to be present at multiple spatial scales, and latitudinal allele frequency gradients are surprisingly common across the genome (Figs 1–3). Second, genome-wide LD extends over much larger physical distances than expected on the basis of previous smaller scale studies (Figs 4–6). These results have several implications. First, geographic patterns of population structure, allele frequency gradients and estimates of Ne from LD consistently suggest that genetic drift has played a significant role in the recent evolutionary history of P. trichocarpa, with post-glacial recolonization from multiple refugia appearing more plausible than recolonization from a single southern refugium. Second, the striking genome-wide ubiquity of high-LD regions and SNPs with latitudinal allele frequency gradients indicates that reliable selection scans in P. trichocarpa will require the development of ad hoc statistical tests integrating multiple molecular signatures. Finally, the commonly made assumptions that population structure is weak and LD decays within 1–2 kb in forest trees (Neale & Ingvarsson, 2008; Neale & Kremer, 2011) may not hold universally across species of Populus. This will need to be reflected in the analyses of data from ongoing and in the design of future association studies. For example, statistically significant marker–phenotype associations will need to be interpreted in the specific context of local LD (i.e. the actual causative polymorphism(s) may, in some cases, be kilobases away from the markers used to detect the association). Most significantly, the extensive LD we detected makes genome-wide association studies and genomic selection in P. trichocarpa much more feasible than previously assumed (Neale & Ingvarsson, 2008; Neale & Kremer, 2011), especially given the increasing cost-effectiveness of SNP genotyping.
Funding was provided by the BioEnergy Science Center, a US Department of Energy (DOE) Bioenergy Research Center (Office of Biological and Environmental Research in the DOE Office of Science) and by the Province of British Columbia through Genome British Columbia Applied Genomics Innovation Program project 103BIO. The work conducted by the DOE Joint Genome Institute was supported by the Office of Science of the DOE under Contract No. DE-AC02-05CH11231. Reinhard Stettler, Jon Johnson, Brian Stanton, Richard Shuren, Nancy Engle, Xiaohan Yang and Stan Wullschleger provided assistance with the collection of plant materials. We thank Reinhard Stettler, Glenn Howe, the New Phytologist Editor and three anonymous reviewers for their comments on earlier versions of the manuscript.