- Top of page
- Materials and Methods
- Supporting Information
Forest tree species are assumed to have large effective population sizes because of their high levels of genetic diversity, weak interpopulation differentiation for neutral loci and rapid decay of linkage disequilibrium (LD) with physical distance (Hamrick et al., 1992; Neale & Ingvarsson, 2008; Neale & Kremer, 2011). However, because most previous population genetic studies were based on small numbers of statistically independent loci (González-Martínez et al., 2006b; Savolainen & Pyhäjärvi, 2007), relatively little is known about the genome-wide patterns of allele frequency variation, LD and recombination in these ecologically and economically important organisms. Detailed information on these patterns is indispensable for understanding the evolutionary history of tree populations and for designing, analyzing and interpreting data from association studies and selection scans. Thus, in addition to providing unprecedented opportunities for marker-assisted breeding through genomic selection (i.e. the prediction of genetic value from high-density single-nucleotide polymorphism (SNP) genotype data; Meuwissen & Goddard, 2010), genome-wide surveys of DNA polymorphism can yield novel biological insights.
Because of their role as foundation species in a number of ecosystems (Whitham et al., 2006), wide geographic distribution, rapid growth and potential as a bioenergy crop (Rubin, 2008), species of Populus have become models for tree genomics (Jansson et al., 2010) and have well-developed molecular resources, including a whole-genome sequence (Tuskan et al., 2006) integrated with genetic and physical maps (Kelleher et al., 2007). Several association studies in P. trichocarpa, P. deltoides, P. nigra and P. tremula are targeting traits related to biomass productivity, cell wall characteristics and climatic adaptation (Stanton et al., 2010).
Black cottonwood (Populus trichocarpa), which is one of the fastest growing species of the genus, inhabits riparian areas in western North America from Baja California to Alaska (DeBell, 1990). To characterize genome-wide patterns of allele frequency variation, LD and recombination in P. trichocarpa, we resequenced the genomes of 16 trees (Fig. 1a, Supporting Information Table S1) selected as a ‘range-wide’ sample (i.e. spanning a large proportion of the species’ native range). Results based on genome resequencing data were corroborated by extensive sequence and SNP data generated using traditional technologies (e.g. genotypic data for 29 213 SNPs in 120 trees from a similar geographic area, Fig. 2). In addition, we used information on DNA sequence features and data from whole-genome methylated DNA immunoprecipitation (MeDIP) resequencing to identify the strongest correlates of fine-scale recombination rates estimated using our genome resequencing data.
Figure 1. Population structure in Populus trichocarpa based on genome resequencing data. (a) Sampling locations of the 16 P. trichocarpa trees that were resequenced, relative to the species range (Little, 1971). (b, c) Major axes of variation based on principal component analysis (PCA) of 235 259 single-nucleotide polymorphisms (SNPs) on a range-wide scale (b) and after excluding the clearly differentiated Tahoe and Willamette trees (c). The cluster which includes trees from multiple locations (dotted frame in b) was re-analyzed at a finer spatial scale (c).
Download figure to PowerPoint
Figure 2. Population structure in Populus trichocarpa based on Infinium single-nucleotide polymorphism (SNP) data. (a) Sampling locations of 120 P. trichocarpa trees from 10 subpopulations chosen to span the geographic distribution of the 16 resequenced trees. (b–d) Major axes of variation based on principal component analysis (PCA) of 22 280 SNPs on a range-wide scale (b), after excluding trees from the clearly differentiated Tahoe, Willamette and Columbia subpopulations (c), and within the most homogeneous set of subpopulations in western Washington (d). Clusters which include trees from multiple subpopulations (dotted frames in b and c) were re-analyzed at finer spatial scales after excluding clearly differentiated trees.
Download figure to PowerPoint
- Top of page
- Materials and Methods
- Supporting Information
The high reliability of SNP detection and genotyping based on our genome resequencing data was illustrated through comparisons with Sanger data (e.g. Table 1, comparisons with Nisqually-1 reference genome sequence), consistency with transcriptome resequencing data used for SNP discovery (Geraldes et al., 2011; Methods S1) and transferability to the Infinium SNP genotyping assay (Methods S1). This suggests that the prospects of high-density genotyping in Populus are excellent, especially considering the rapidly decreasing cost and increasing throughput of sequencing technologies (Metzker, 2010). However, our results also highlight the importance of filtering genome resequencing data appropriately (Table 1). Naïve filtering based entirely on sequence depth is likely to result in high rates of false positives and may influence downstream data analyses and interpretation (Table 1). However, the stringent filtering criteria required to achieve low rates of false positives limited our inference to common SNPs (MAF ≥ c. 0.10), prevented us from calculating reliable absolute estimates of nucleotide diversity (i.e. because of the consistently high rates of false negatives across all filtering scenarios considered) and possibly resulted in biased genome sampling. Further technological and bioinformatic advances will probably help to mitigate these limitations. For example, the availability of resequencing data from a larger number of individuals will allow the inclusion of lower frequency SNPs (i.e. even after requiring the detection of multiple minor alleles) in population genetic analyses, association studies and genomic selection.
Despite the great potential for long-distance seed and pollen dispersal in P. trichocarpa (Slavov et al., 2009; DiFazio et al., 2012), population genetic structure appears to be present from the range-wide (Figs 1, 2) to the local stand (Slavov et al., 2010) scale. The genetic differentiation of trees sampled from the Tahoe subpopulation (pairwise FST > 0.151; Table S3) was stronger than expected based on previous population genetic studies in Populus (Slavov & Zhelev, 2010), but comparable with that observed in a range-wide sample of P. balsamifera (Keller et al., 2010), a close relative of P. trichocarpa. Even the subtler genetic differentiation observed in the core of the range of P. trichocarpa (pairwise FST = 0.013–0.048; Table S3) needs to be statistically accounted for in association studies (Price et al., 2006, 2010).
The observed continuous patterns of differentiation and allele frequency variation (Figs 1, 2) match those expected under population genetic models of isolation by distance (IBD) (Rousset, 1997). For example, our results are consistent with both hypotheses proposed by Soltis et al. (1997) to explain the phylogeographic patterns detected in plants growing in the Pacific Northwest of North America. However, the lack of a clear trend of decreasing diversity with latitude makes a scenario of recolonization from multiple glacial refugia (i.e. ‘north–south recolonization hypothesis’) more plausible than a scenario of recolonization from a single group of refugia located in the mountains of northern California and southern Oregon (i.e. ‘leading edge hypothesis’). Regardless of the specifics of the underlying demographic scenario, the remarkable abundance and ubiquity of latitudinal SNP gradients at the range-wide scale (i.e. about one of four SNPs with MAF ≥ c. 0.10), and their higher frequency among intergenic than among genic SNPs, suggest that allele frequency gradients caused by local adaptation may be completely confounded with neutral differentiation under IBD (Vasemägi, 2006; Novembre & Di Rienzo, 2009). The reversed pattern of higher frequency of latitudinal gradients among genic than intergenic SNPs in the core of the range indicates that spatial patterns of SNP variation could potentially be instrumental for the detection of molecular signatures of selection. However, performing this robustly will probably require the use of appropriately defined study populations and neutral differentiation models consistent with IBD, as well as conservatively combining multiple sources of evidence (e.g. allele frequency differentiation, local levels of LD and nucleotide diversity vs divergence, and allele frequency spectra for ancestral vs derived alleles and haplotypes) (Grossman et al., 2010; Hernandez et al., 2011).
First, estimates of r2 are upwardly biased in small samples (Terwilliger & Hiekkalinna, 2006) and very sensitive to the distribution of allele frequencies across loci (VanLiere & Rosenberg, 2008). However, both the number of trees used to generate the genome resequencing data (n = 16 trees or 2n = 32 chromosomes) and the MAF thresholds used (≥ c. 0.10) were very similar or identical to those employed in previous studies in Populus and other forest trees (Krutovsky & Neale, 2005; González-Martínez et al., 2006a; Heuertz et al., 2006; Ingvarsson, 2008; Wegrzyn et al., 2010), as well as those used in the highly selfing plants employed for reference in Fig. 4 (Kim et al., 2007; Branca et al., 2011). Thus, neither of these factors seems to be a likely explanation for the unexpectedly slow decay of LD in P. trichocarpa.
Second, estimates of LD decay were affected by the criteria used to filter the genome resequencing data (Fig. 5). Estimates based on the stringent HWE criteria were very similar to those based on Sanger resequencing data (Fig. 5a) and roughly comparable to those based on Infinium SNP data (Fig. 5b). The slight difference in the latter case probably resulted from the much larger sample size used to generate the Infinium SNP data (i.e. 2n = 192 chromosomes sampled in the core of the range, making the small sample bias in this dataset negligible) and/or the exclusion of nearly 50 000 strongly linked (i.e. r2 ≥ 0.8) SNPs through the pairwise tagging analyses performed to select target loci for Infinium genotyping (Methods S1). In contrast, analyses of genome resequencing data filtered using the more relaxed QS criteria appeared to systematically result in underestimates of the mean r2 values by distance class (Fig. 5). This is probably caused primarily by the high frequency of false-positive SNPs (i.e. which would generally be expected to behave as unlinked) in data filtered using these criteria (Table 1). However, even the decay of r2 based on data filtered using the QS criteria was an order of magnitude slower than previously estimated for Populus. Taken together, the comparisons among datasets generated using different SNP genotyping assays indicate that the extensive genome-wide LD detected was probably not caused by technological artifacts.
Third, at 100-kb and 1-Mb window scales, rates of recombination were weakly to moderately correlated with gene density (Tables S4, S5), and recombination hotspots were located closer to genes than expected by chance (Table 2). These trends suggest that LD within or near genes may be weaker than the genome average. Consistent with this, estimates of r2 based exclusively on SNPs located within or near genes (Infinium SNP data) were lower than those based on the genome resequencing data (Fig. 4), but the difference was relatively small.
Finally, LD estimates based on our genome resequencing data (i.e. with one to three SNPs per kilobase; Table 1) may not be directly comparable to those from previous studies in P. trichocarpa and other forest trees, which were typically based on up to several kilobases of sequence for a relatively small number of candidate genes (Krutovsky & Neale, 2005; González-Martínez et al., 2006a; Heuertz et al., 2006; Ingvarsson, 2008; Wegrzyn et al., 2010). We hypothesize that data from these studies may not have been sufficiently extensive to develop accurate expectations on genome-wide patterns of LD. Consistent with this, analyses of resequencing data for 372 randomly sampled gene fragments in P. trichocarpa’s close relative P. balsamifera resulted in dramatically variable LD estimates that were, on average, higher than previously reported for other Populus species (Olson et al., 2010). More importantly, estimates of LD based exclusively on very closely spaced SNPs may not be appropriate for the prediction of longer range LD (Kim et al., 2007). This is presumably because gene conversion may have a substantial effect on short-range but not on long-range LD (Andolfatto & Nordborg, 1998), and because LD estimates at different distances reflect Ne over different time scales (i.e. longer range LD is expected to reflect more recent Ne; Tenesa et al., 2007). The simple regression model that is commonly used to summarize the decay of r2 with physical distance (Hill & Weir, 1988) does not account for these two factors, and predictions of long-range LD based on fitting this model to short-range LD data should be treated with caution.
As a reflection of the surprisingly slow decay of r2 with physical distance between loci, estimates of Ne from LD (i.e. c. 4000–6000, depending on the genetic map and estimation method used) are remarkably low, considering the vast census sizes of the P. trichocarpa populations sampled. These Ne estimates are also over 20 times lower than those obtained for European aspen (P. tremula) based on putatively neutral nucleotide diversity (Ingvarsson, 2008), and three to four times lower than those that would be obtained for P. trichocarpa based on similar assumptions and estimates of neutral nucleotide diversity from previous studies (Gilchrist et al., 2006; Tuskan et al., 2006). The dramatic contrast between Ne estimates for P. tremula and P. trichocarpa may be a reflection of the substantial differences in life history between aspens and cottonwoods (e.g. greater propensity for asexual reproduction in aspens, possibly resulting in a ‘Meselson effect’; Balloux et al., 2003; Slavov & Zhelev, 2010). Furthermore, because LD-based estimates of Ne presumably reflect more recent population history than those from nucleotide diversity (Tenesa et al., 2007), and because P. trichocarpa populations are likely to have experienced severe bottlenecks and/or founder effects over the last few hundred generations (Soltis et al., 1997), our low estimates of Ne from LD are not implausible. Similar to the geographic patterns of allele frequency variation and heterozygosity, Ne estimates by subpopulation (Fig. 6) are consistent with a scenario of recolonization from multiple glacial refugia. These estimates also indicate that the Willamette and Columbia subpopulations may be located in a ‘melting pot’ (Petit et al., 2003) of haplotype diversity from multiple refugia.
In addition to its dramatic variation among subpopulations, LD also varied substantially across the genome. The epigenetic and DNA sequence variables identified as best predictors of historical rates of recombination (4Nec) were similar to those in other plants (Drouaud et al., 2006; Kim et al., 2007; Gore et al., 2009; Liu et al., 2009; Branca et al., 2011). However, the correlations estimated tended to be stronger and the combined explanatory power of the variables considered higher than in these studies. Furthermore, the clear pattern of nonrandom occurrence of recombination hotspots relative to genes (Table 2) was consistent with that detected based on detailed studies in humans (McVean et al., 2004; Myers et al., 2005). The ability to detect these patterns attests to the precision of our estimates of fine-scale recombination rates, and suggests that this information will probably be useful in both fundamental studies of DNA- and chromosome-level evolution and in applied molecular breeding.
In summary, our analyses of extensive genome resequencing and SNP genotyping data for a broad collection of P. trichocarpa trees led to two main novel findings. First, significant population genetic structure appears to be present at multiple spatial scales, and latitudinal allele frequency gradients are surprisingly common across the genome (Figs 1–3). Second, genome-wide LD extends over much larger physical distances than expected on the basis of previous smaller scale studies (Figs 4–6). These results have several implications. First, geographic patterns of population structure, allele frequency gradients and estimates of Ne from LD consistently suggest that genetic drift has played a significant role in the recent evolutionary history of P. trichocarpa, with post-glacial recolonization from multiple refugia appearing more plausible than recolonization from a single southern refugium. Second, the striking genome-wide ubiquity of high-LD regions and SNPs with latitudinal allele frequency gradients indicates that reliable selection scans in P. trichocarpa will require the development of ad hoc statistical tests integrating multiple molecular signatures. Finally, the commonly made assumptions that population structure is weak and LD decays within 1–2 kb in forest trees (Neale & Ingvarsson, 2008; Neale & Kremer, 2011) may not hold universally across species of Populus. This will need to be reflected in the analyses of data from ongoing and in the design of future association studies. For example, statistically significant marker–phenotype associations will need to be interpreted in the specific context of local LD (i.e. the actual causative polymorphism(s) may, in some cases, be kilobases away from the markers used to detect the association). Most significantly, the extensive LD we detected makes genome-wide association studies and genomic selection in P. trichocarpa much more feasible than previously assumed (Neale & Ingvarsson, 2008; Neale & Kremer, 2011), especially given the increasing cost-effectiveness of SNP genotyping.
- Top of page
- Materials and Methods
- Supporting Information
Fig. S1 Population structure in Populustrichocarpa based on 1000 randomly chosen Infinium SNP loci (one per candidate gene) and model-based clustering using v. 2.2 of the STRUCTURE program.
Fig. S2 Linear associations betweengenetic (FST, based on Infinium SNP data,22 280 loci) and geographic distances among subpopulations ofPopulus trichocarpa.
Fig. S3 Correspondence ofr2 values calculated based on estimated haplotype frequencies (i.e. using PHASE) to those based on genotypic correlations.
Fig. S4 Effect of population structure onestimates of linkage disequilibrium (LD r2) basedon genome resequencing and Infinium SNP data (a) and LD-basedestimates of effective population size (Ne) fromgenome resequencing data (b).
Fig. S5 Linkage disequilibrium for physically unlinked loci.
Fig. S6 Correlates of recombination inP. trichocarpa.
Fig. S7 Recombination rates in 1-Mb windows estimated from a dense SNP linkage map and from genome resequencing data.
Table S1 Summary of genome resequencingdata for 16 Populus trichocarpa trees
Table S2 Summary statistics for Sangerresequencing data for 10 candidate genes in 47P. trichocarpa trees
Table S3 Genetic (below diagonal,FST, based on Infinium SNP data, 22 280loci) and geographic (above diagonal, km) distances among 10subpopulations of Populus trichocarpa
Table S4 100-kb window correlation matrixof recombination correlates in Populus trichocarpa
Table S5 1-Mb window correlation matrix ofrecombination correlates in Populus trichocarpa
Table S6 Multiple linear regression modelfor log10 (4Nec) analyzed in 100-kbwindows
Table S7 Multiple linear regression modelfor log10(4Nec) analyzed in 1-Mbwindows
Methods S1 Infinium SNP data.
Methods S2 Correlates of recombination.
Methods S3 Estimates of effective population size from linkage disequilibrium.
Please note: Wiley-Blackwell are not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing material) should be directed to the New Phytologist Central Office.