Association genetics of the loblolly pine (Pinus taeda, Pinaceae) metabolome


Author for correspondence:
David B. Neale
Tel: +1 530 754 8431


  • The metabolome of a plant comprises all small molecule metabolites, which are produced during cellular processes. The genetic basis for metabolites in nonmodel plants is unknown, despite frequently observed correlations between metabolite concentrations and stress responses. A quantitative genetic analysis of metabolites in a nonmodel plant species is thus warranted.
  • Here, we use standard association genetic methods to correlate 3563 single nucleotide polymorphisms (SNPs) to concentrations of 292 metabolites measured in a single loblolly pine (Pinus taeda) association population.
  • A total of 28 single locus associations were detected, representing 24 and 20 unique SNPs and metabolites, respectively. Multilocus Bayesian mixed linear models identified 2998 additional associations for a total of 1617 unique SNPs associated to 255 metabolites. These SNPs explained sizeable fractions of metabolite heritabilities when considered jointly (56.6% on average) and had lower minor allele frequencies and magnitudes of population structure as compared with random SNPs.
  • Modest sets of SNPs (n = 1–23) explained sizeable portions of genetic effects for many metabolites, thus highlighting the importance of multi-SNP models to association mapping, and exhibited patterns of polymorphism consistent with being linked to targets of natural selection. The implications for association mapping in forest trees are discussed.


A central goal of genetics is to dissect phenotypes into their genetic components, especially those underlying ecologically relevant phenotypes (Stinchcombe & Hoekstra, 2008). Knowledge about the genes underlying a diversity of phenotypes has accumulated in tandem with developments in forward genetic methodologies and technologies such as high-throughput genotyping and phenotyping platforms. Much of this work in nonmodel species has focused on forest trees, where genetic associations have been discovered for a range of phenotypic traits (Neale, 2007; Neale & Ingvarsson, 2008), many of which are clearly adaptive (Neale & Kremer, 2011). Missing from most of these studies, however, are investigations of how genotypes relate to cellular phenotypes, such as gene expression and metabolite concentrations. Yet, it is polymorphisms that affect cellular phenotypes that often give the strongest signals in genome-wide association studies in humans (Nicolae et al., 2010), and novel links between genotypes and phenotypes have been elucidated through analysis of this type of trait for plant species (Kirst et al., 2004, 2005; Wentzell et al., 2007; West et al., 2007; Potokina et al., 2008; Chan et al., 2010a,b; Dorst et al., 2010).

A number of studies for plants have dissected transcript abundance (i.e. gene expression) into its genetic components (Joosen et al., 2009). The focus of these genetical genomic studies (Jansen & Nap, 2001), as exemplified by studies of forest tree transcriptomes, is a description of gene expression variation among individuals (Holliday et al., 2008; Palle et al., 2011), analysis of a set of candidate genes and the cis-acting regulatory polymorphisms affecting gene expression (Thumma et al., 2009; Beaulieu et al., 2011), and the quantitative genetic analysis of overall gene expression patterns (Kirst et al., 2004, 2005). These analyses have discovered a substantial number of expression quantitative trait loci (eQTLs), highlighted the importance of cis- and trans-regulation for transcript abundance, and shown the complex genetic architecture underlying gene expression. This complexity has been observed across a wide range of organisms and has been attributed to dominance effects, changes in the strength of purifying selection across gene networks, and epistatic interactions among genes comprising metabolic pathways (Kacser & Burns, 1973, 1981; Hartl et al., 1985; Keightley, 1989; Whitlock et al., 1995; Bost et al., 1999; Kondrashov & Koonin, 2004; Rowe et al., 2008; Ramsay et al., 2009). Links between eQTLs and metabolite QTLs have also established interactions among cellular phenotypes, suggesting that natural variation for metabolites may create feedback loops to the transcriptome (Wentzell et al., 2007). Cellular phenotypes beyond gene expression, therefore, should be investigated during scans attempting to dissect adaptive traits into their genetic components.

The metabolome of a tree represents the entire set of small molecule metabolites, which are produced through cellular processes. In model plants, quantitative genetic analysis of a range of metabolites has established that they also have a complex genetic architecture, as well as identifying polymorphisms underlying variation in metabolite concentrations (Schauer et al., 2008; Kleibenstein, 2009; Chan et al., 2010a,b). By contrast, previous work on metabolites for forest trees has primarily focused on profiling (Fiehn, 2002; but see Robinson et al., 2007; Külheim et al., 2011), where metabolites are studied in an experimental design suited to testing specific hypotheses regarding their function. These studies have established correlations between metabolite concentrations and adaptive, whole-plant phenotypes such as drought-stress responses (e.g. Schwanz & Polle, 2001), seed dormancy (reviewed by Finklestein et al., 2008) and disease resistance (reviewed by Witzell & Martin, 2008). The genetic basis of variation in metabolite concentrations, however, remains largely unknown for forest trees, despite the clear genetic basis for other plants (Kliebenstein et al., 2001; Keurentjes et al., 2006; Rowe et al., 2008; Kleibenstein, 2009; Chan et al., 2010a,b). Thus, an association analysis using a large and representative sample for the gene space and metabolome for a forest tree is warranted.

Life history characteristics of forest trees make them amenable to complex trait dissection using forward genetic approaches (Neale & Savolainen, 2004). Genetic associations with cold-hardiness (Ingvarsson et al., 2008; Eckert et al., 2009a; Holliday et al., 2010), wood properties (Thumma et al., 2005, 2009; González-Martínez et al., 2007; Dillon et al., 2010; Beaulieu et al., 2011), lignin content (Wegrzyn et al., 2010), drought-stress responses (González-Martínez et al., 2008; Cumbie et al., 2010), disease resistance (Quesada et al., 2010) and secondary metabolites (Külheim et al., 2011) have been discovered for a variety of forest tree species. In general, the effect sizes are small, effects are additive, and associated markers span coding and noncoding portions of genes. These results are largely consistent with quantitative genetic theory describing the genetic architecture of complex traits (Hill et al., 2008).

Much of the association mapping for forest trees has focused on loblolly pine (Pinus taeda). This species is the most important commercial forest tree species growing in the southern United States, and has extensive genetic resources developed that range from deep expressed sequence tag (EST) libraries, from which thousands of single nucleotide polymorphisms (SNPs) have been discovered through resequencing, to multiple association populations. Here, we take an association mapping approach to dissect the loblolly pine metabolome into its genetic components. We establish that concentrations for many metabolites are heritable, can be associated significantly with SNPs, and that multi-SNP models can explain large portions (i.e. > 50%) of these heritabilities. We show, moreover, that SNPs associated with at least one metabolite display nonrandom attributes with respect to the entire dataset and discuss the relevance of this pattern to association mapping for nonmodel species.

Materials and Methods

Focal species

Loblolly pine (Pinus taeda L.) is distributed throughout the southeastern United States, with a range extending from Texas to Delaware (Fig. 1). Its range is divided primarily by the Mississippi River Valley (cf. Soltis et al., 2006) and is characterized by broad environmental and climatic gradients. Estimates of population structure (n = 3059 SNPs and n = 23 nuclear microsatellites) are consistent with these patterns (Eckert et al., 2010b), identifying three to eight genetic clusters depending on the method employed, low overall differentiation among populations (FST = 0.02–0.04) and correlations of genetic structure to temperature and precipitation gradients (Eckert et al., 2010a).

Figure 1.

The distribution and genetic structure of loblolly pine (Pinus taeda). Circles indicate counties in which maternal trees were sampled (n = 1–22 trees per county) with colors denoting membership in five genetic clusters as inferred using hierarchical clustering on the top four genetic principal components (PCs) derived from a principal components analysis (PCA) on the full set of 3563 single nucleotide polymorphisms (SNPs).

The association population

Rooted cuttings from 445 unrelated trees sampled across the natural range of loblolly pine (Fig. 1) were established in a randomized complete block design in a common garden located at North Carolina State University (NCSU) during the spring of 2006. These trees are largely first-generation selections and are georeferenced by county of origin (ncounties = 155). Climate data (monthly minimum and maximum temperatures and precipitation) for maternal tree environments were obtained from the PRISM Group, as described by Eckert et al. (2010a), with values calculated per county by averaging across all 800 × 800 m tiles nested within counties. Further information about this population is available in Cumbie et al. (2010).

Metabolite phenotyping

Approximately 2 g of woody tissue was collected from the last flush of the second-year growth from each tree and were immediately flash-frozen in liquid nitrogen. Two ramets were sampled for each clone. Samples were taken between 10:00 h and 13:00 h over 3 d in March 2008. Wood samples were clipped using hand pruners and stored in a 50 ml screw-cap tube containing liquid nitrogen, until tissue pulverization was carried out using a GenoGrinder 2000 beadmill (OPS Diagnostics, LLC, Lebanon, NJ, USA). After pulverization at −20°C, each ramet was divided into two technical replicates, thus providing four replicates (two biological, two technical) per clone.

Gas chromatography coupled with time-of-flight mass spectrometry (GC-TOF-MS) was used to detect metabolites in extracts derived from pulverized xylem tissue. Extractable metabolites were obtained using a standard plant extraction protocol consisting of a chloroform : methanol : water (2 : 5 : 2 volume ratios) extraction at −20°C followed by an acetonitrile : water (1 : 1 volume ratio) clean-up step. Analysis was performed using an Agilent 6890 N gas chromatograph (Palo Alto, CA, USA) interfaced to a time-of-flight Pegasus III mass spectrometer (Leco, St Joseph, MI, USA). Automated injections were performed with a programmable robotic Gerstel MPS2 multipurpose sampler (Mülheim an der Ruhr, Germany). Mass spectra were acquired at 20 scans s−1 with a mass range of 50–500 m/z. We have also used internal standards (fatty acid methyl esters) that covered the full retention time range to correct for potential discrimination of high boiling vs low boiling compounds as a result of potential injector discrimination. Data acquisition and analysis were conducted at the UC Davis Metabolomics Core Facility.

Resulting GC-TOF-MS data were processed following the methods outlined by Fiehn et al. (2008). In brief, initial GC-TOF-MS peak detection and mass spectrum deconvolution were performed with ChromaTOF software version 2.25 (Leco). A reference chromatogram was defined that had the maximum of detected peaks over a signal : noise threshold of 20. This was used subsequently for automated peak identification based on mass spectral comparison to standard and in-house customized mass spectral libraries. Mass spectra were searched against custom spectrum libraries (e.g. the Fiehn library of 713 unique metabolites) and identified based on retention index and spectrum similarity match. All known artifactual peaks caused by column bleeding or phthalates and polysiloxanes derived from N-methyl-N-trifluoroacetamide hydrolysis were manually identified and removed. Resulting data for each sample were normalized using the total summed soluble metabolite concentration for known metabolites and then logarithmically transformed (base = 10; cf. Fiehn et al., 2008). The units are thus metabolite intensity counts per spectrum normalized to the total metabolic contents (e.g. cps/mTIC). For each metabolite, transformed values > 6 SD from the mean across clones were set to missing data.

Analysis of variance was used to assess clonal effects (H2) for each metabolite, which were estimated as the fraction of the expected mean squares accounted for by clonal identifiers. This estimate of H2, however, will likely be biased because of inflated estimates for genetic variance as a result of genotype × environment (G × E) effects. Metabolites with F-statistics ≤ 1.0 for clone as a fixed effect were discarded from further analysis (Supporting Information, Fig. S1). A mixed linear model was used subsequently to adjust clonal least-square means:

image(Eqn 1)

where yijkl is metabolite concentration for the ith clone on the jth date for the kth ramet and the lth technical replicate, ci is the fixed effect of the ith clone (i = 1–297), dj is the random effect of the jth evaluation date (j = 1–49, approx. NID(0, inline image)), cidj is the random interaction between clone and evaluation date (approx. NID(0, inline image)), rk(ci) is the random effect of ramet nested within clone (k = 1–2, approx. NID(0, inline image)), and eijkl is the residual variance (l = 1–2, approx. NID(0, inline image)). No effort was made to incorporate effects resulting from collection date for the sampled tissue, as collection dates were randomized across metabolite evaluation dates. This model was chosen because of unequal allocation of clones, ramets and technical replicates across metabolite evaluation dates.

SNP genotyping

Total genomic DNA was isolated from needle tissue collected from each of the 445 trees at the USDA National Forest Genetics Laboratory (NFGEL, Placerville, CA, USA) using DNeasy® Plant Kits (Qiagen) following the manufacturer’s protocol. SNPs derived from the Allele Discovery of Economic Pine Traits project (ADEPT2; were used. Further information regarding SNP discovery and annotation is available elsewhere (Methods S1–S3; Eckert et al., 2010b). Genotyping of SNPs utilizing the Infinium platform was carried out at the University of California Davis Genome Center. Arrays were imaged on a Bead Array reader (Illumina, San Diego, CA, USA) and genotype calling was performed using BeadStudio v. (Illumina). Quality of genotype calls was assessed with two Illumina specific quality measures – the GenCall50 (GC50) and call rate (CR) scores. We used thresholds of 55% for the CR and 0.15 for the GC50 scores for inclusion of SNPs into the initial dataset (cf. Eckert et al., 2009b). Analysis was conducted at the UC Davis DNA Technologies Core Facility.

Common descriptive statistics (expected (Hexp) and observed (Hobs) heterozygosities, Wright’s inbreeding coefficient (FIS) and hierarchical fixation indices (FST)) were estimated for each SNP using the Genetics and Hierfstat (Goudet, 2005) packages in R version 2.11.1 (R Development Core Team, 2010) before association analysis. Estimation of hierarchical F-statistics was performed using counties as populations and the genetic clusters identified previously for loblolly pine as regions (cf. Eckert et al., 2010b). Confidence intervals were determined for hierarchical F-statistics by bootstrapping variance components across SNPs (n = 10 000 replicates). SNPs with |FIS| ≥ 0.25 or minor allele frequency (MAF) ≤ 0.02 were removed from further analysis. These cutoffs were chosen arbitrarily, but with the goals of further reducing deviations from Hardy–Weinberg equilibrium and to ensure that a minimum of eight to 10 copies of the minor allele were present for the single locus association analyses.

Association analysis

Single SNP models  A standard single locus association mapping approach was used to identify SNPs associated to metabolites (Price et al., 2006). The approach chosen incorporates corrections for the confounding effects of shared ancestry on genotype–phenotype correlations by first using principal components analysis (PCA) to describe the underlying genetic structure across SNPs (Patterson et al., 2006; Eckert et al., 2010b). Significance of each genetic principal component (PC) was assessed using the Tracy–Widom (TW) distribution and a significance threshold of P = 0.01. Two vectors of ancestry-corrected residuals were subsequently obtained by multiple linear regression on phenotypic and genotypic traits, using the k significant genetic PCs as independent variables. Association between SNP genotypes and metabolites was then described by the squared correlation (r2) between the two vectors of residuals. For each SNP, scored in n individuals, the test statistic was calculated as (n – k − 1) r2 which is approximately χ2-distributed with 1 df (Price et al., 2006). Multiple testing was accounted for by use of a false-discovery rate (FDR) correction (Storey & Tibshirani, 2003).

Multi-SNP models  We also constructed multilocus models for each metabolite using Bayesian linear mixed models incorporating effects of population structure (Yu et al., 2006; Quesada et al., 2010). Metabolomic phenotypes were centered and standardized before analysis. For each metabolite, multilocus models were subsequently constructed from the list of SNPs with significant effects (P ≤ 0.05) for the single locus tests described previously (Quesada et al., 2010).

Model parameters, including 95% credible intervals for SNP effects, were estimated using Markov chain Monte Carlo (MCMC) with 50 000 steps after an initial burn-in of 10 000 steps. Three replicated runs of the MCMC Gibbs sampler were used to verify convergence to similar posterior distributions for each parameter of each model. All linear mixed-model analyses were conducted with the BAMD package in R. For each metabolite, the amount of phenotypic variance explained by the SNPs with 95% credible intervals excluding zero was estimated using the adjusted R2 from a linear model of ancestry-corrected SNP genotypes as explanatory variables and the ancestry-corrected metabolite concentration as the response variable. Additive and dominance effects were estimated for each genetic association as described previously (Eckert et al., 2009a; Methods S4).


Metabolite phenotyping

A total of 305 metabolites were detected from GC-TOF-MS of pulverized xylem tissue (Table S5). Of these, 86 (28.19%) were known metabolites, including free amino acids, free fatty acids, sugars and a number of organic acids. Removal of metabolites with ≥ 30% missing data and those with F-statistics ≤ 1.0 for clone as a fixed effect in an ANOVA model resulted in a reduced dataset of 292 metabolites assayed for 384 clones (Fig. S1). Intersection of these data with the SNP genotyping data (see the section ‘SNP genotyping’) resulted in the further removal of 87 clones. The final phenotypic dataset for association mapping consisted of 292 metabolites, of which 82 were known, assayed across 297 clones, with amounts of missing data varying from 0 to 26.26% across metabolites (mean, 6.63%; SD, 6.08%).

Clones accounted for a moderate portion of the variance across biological and technical replicates (Tables S1, S2, Fig. S2), c. 32% on average (min, 5.1%; max, 68.0%). There were no significant differences for H2 between known and unknown metabolites (Mann–Whitney U-test: P = 0.619). Pairwise correlations among clonal means across metabolites were small (Fig. S3), with a mean (± 1 SD) Spearman’s ρ of 0.11 (± 0.15). The highest correlations (ρ > 0.60) were between unknown metabolites and between unknown and known metabolites. As expected from the low pairwise correlations, multivariate analysis using PCA, with pairwise deletion of missing data for computation of the correlation matrix, was largely uninformative as the top 10 PCs explained only 38.6% of the overall variance (results not shown).

Clonal means for metabolites were correlated moderately to the source environment of the maternal tree, with Spearman’s ρ values spanning a range of − 0.50–0.45. Only a small fraction of these correlations (c. 15%) were significantly different from zero, however, using permutation tests (n = 100 000 permutations) and a significance threshold of P = 0.0001.

SNP genotyping

A total of 3938 SNPs genotyped across 409 clones passed the initial quality thresholds based on Illumina quality metrics (Table S6). Further filtering based on the minor allele frequency and Wright’s inbreeding coefficient reduced this to 3613 SNPs, while intersection with the metabolite data (see the section ‘Metabolite phenotyping’) further reduced this to a final dataset of 3563 SNPs genotyped for 297 clones (Fig. S1). Amounts of missing data across SNPs varied from 0 to 22.89%, with an average of 1.41% (SD = 2.70%).

Most of the 3563 SNPs were consistent with Hardy–Weinberg equilibrium expectations (87.20% with P > 0.05) and had moderately high MAFs (60.88% with MAF > 0.10). These SNPs represent 3073 unique EST contigs, with 2596 EST contigs covered by one SNP, 465 EST contigs covered by two SNPs, 11 EST contigs covered by three SNPs and one EST contig covered by four SNPs (Table S7). Of the 3073 EST contigs, 1399, representing 1643 SNPs, are located on a high-density linkage map for loblolly pine (cf., accession: TG091). Linkage disequilibrium (LD), as measured with the r2 and |D′| statistics, among markers was low and was consistent with the coarse coverage of loblolly pine genome (Table S3). Only 144 pairs of SNPs, out of the 6 345 703 total SNP pairs, have an r2 > 0.90. Eighty of those 144 pairs had both SNPs mapped and the SNPs for these 80 pairs were all within 1.6 cM (mean = 0.18 cM) of each other on the linkage map.

Of the 3073 EST contigs containing SNPs used for association mapping, 1920 (62.47%) had similarity to gene models in Arabidopsis using a liberal threshold of 1E–05 for e-value associated with the BLASTx results (see Methods S2). Of those 1920 EST contigs, 1487 had Gene Ontology (GO) terms assigned for molecular function (GO:0003824; Fig. S4). Site annotations were deduced for 1611 of the 3563 SNPs that were used for association mapping. Of these 1611 SNPs with site annotations, 380 were nonsynonymous (23.6%), 549 were synonymous (34.1%), and 682 were noncoding (42.3%). Type of SNP had no effect on the MAF (Kruskal–Wallis rank sum test: P = 0.8016).

Association analysis

Population structure  Population structure within our association population was minimal. The multilocus FST estimate using counties as populations was 2.22% (95% bootstrap confidence interval: 1.75–2.71%), with estimates for single SNPs ranging from 0 to 8.68%. Decomposition of this estimate into effects of region (FRT = 1.85%) and counties nested within regions (FCR = 0.37%) revealed that regions accounted for 83.53% of the overall allele frequency differences across loci. These results were supported by a PCA. The top four genetic PCs were significant (P ≤ 0.01) as determined using the TW distribution. These four PCs explained 4.22% of the variance (PC1, 2.10%; PC2, 0.85%; PC3, 0.64%; PC4, 0.63%). Hierarchical classification of trees using these four PCs, which defined five clusters, largely followed that presented elsewhere using larger sample sizes (Fig. 1; Eckert et al., 2010b), with three prominent geographic clusters defined by the first two PCs accounting for most of the population structure (Fig. S5).

Metabolites were largely uncorrelated with the top four genetic PCs (multiple regression R2 values < 0.20), and the concentration for most metabolites (n = 270 of the 292) did not differ significantly among groups made from those PCs. There were 22 metabolites, however, with significant differences among groups (Kruskal–Wallis rank sum tests: P ≤ 0.05) of which five survived multiple test corrections (FDR Q ≤ 0.10). These five metabolites were phosphoric acid, cellobiose, phytol and two unknowns (216832, 199239).

Single SNP models  A total of 1 040 396 correlations were estimated between the 3563 SNPs and 292 metabolites passing our quality thresholds (Fig. S1). Of these correlations, 62 651 (6.1%) were significant, as determined using the (n – k − 1) r2-statistic of Price et al. (2006), at a nominal threshold of P = 0.05. The number of significant tests varied by metabolite (min = 156, max = 357, mean = 215), with 260 of the 292 metabolites (89.0%) having an excess of significant associations over the number expected under the null hypothesis (Fig. S6). This pattern was also apparent in quantile–quantile (Q–Q) plots, where the observed quantiles of the (n − k − 1) r2-statistic were compared against the theoretical quantiles of the χ2-distribution for each metabolite (Fig. 2a).

Figure 2.

A total of 61 significant genetic associations (FDR Q ≤ 0.10) were identified for the 1 040 396 single locus tests performed for 292 metabolites and 3563 single nucleotide polymorphisms (SNPs) in loblolly pine (Pinus taeda). (a) Quantile–quantile plot of the observed quantiles for the (n − k − 1) r2-statistic and the expected quantiles from a χ2-distribution with 1 df for all 292 metabolites (gray lines). The solid black line gives the expected relationship. (b) The joint distribution of the minor allele frequency (MAF) and FST values for all 3563 SNPs used for association mapping (gray points). The eight SNPs associated with at least one known metabolite are plotted with black points: (•), associated and (inline image), unassociated. (c) The distribution of P-values (−log10-transformed) across the genome for single locus association tests performed for phytol. Alternating black and gray points designate linkage groups. The SNP significantly associated to phytol is circled and labeled.

Corrections for multiple testing reduced the 62 651 significant tests to 28 significant associations with an FDR Q ≤ 0.10 (Table 1). Three of these 28 associations also survive a Bonferroni correction with a nominal significance threshold of P = 0.05. The majority of these associations were with unknown metabolites (71.4%), although this was similar to the relative number of unknown metabolites in the targeted set (71.9%). These 28 associations represent 24 unique SNPs and 20 unique metabolites. Of these 24 unique SNPs, three were nonsynonymous, one was synonymous, 14 were noncoding and six were not annotated. Effect sizes were small (Fig. 2b), with the r2 between ancestry-corrected phenotype and genotype vectors ranging from 0.08 to 0.18 (mean, 0.09; SD, 0.02). In terms of H2, however, effect sizes ranged from moderate to large (mean, 26.4%; SD, 11.4%). Effect sizes for SNPs associated significantly to at least one metabolite did not differ between known and unknown metabolites (Mann–Whitney U-test, P = 0.676).

Table 1.   Significant associations of single nucleotide polymorphism (SNP) genotypes to metabolites in loblolly pine (Pinus taeda) as determined using single SNP models
MetaboliteSNP accessionaPutative protein productbnMAFr2PQr2/H2 c
  1. MAF, minor allele frequency.

  2. aSuperscripts after the SNP identifiers refer to the following: NC, noncoding; NS, nonsynonymous; SY, synonymous; UN, not annotated. Asterisks denote those associations surviving a conservative Bonferroni correction at a nominal threshold of P = 0.05.

  3. bSNPs labeled with NA were located in expressed sequence tag (EST) contigs that did not have significant similarity using Blastx to either the RefSeq proteins in Arabidopsis or published gene models for Picea. Information about the EST contigs containing the associated SNPs is available in Table S7.

  4. cThe proportion of H2, as estimated using the mean squares accounted for by clonal identifiers in an ANOVA framework (see the Materials and Methods section), accounted for by the correlation between ancestry-corrected genotype and phenotype vectors (r2).

  5. dMore information about unknown metabolites is available in Table S2.

 ArabitolSNP_143466-PitaNCProtein kinase-related protein2370.140.102.9E–087.5E–030.23
 Arachidic acidSNP_138728-PitaNSPlastid-lipid associated protein2160.030.081.4E–065.2E–020.57
 PhytolSNP_143186-PitaNCMyo-inositol : hydrogen symporter2630.490.081.2E–065.1E–020.12
 RaffinoseSNP_143607-PitaNC*MYB domain protein 1032140.040.117.2E–093.3E–030.24
 199203SNP_140792-PitaNCUnknown protein2480.030.084.6E–073.0E–020.26
SNP_138959-PitaNSNodulin MtN21 family protein2540.090.092.5E–072.4E–020.20
 217842SNP_142219-PitaNCUbiquitin thiolesterase2770.040.085.5E–073.0E–020.20
SNP_142220-PitaNCUbiquitin thiolesterase2770.040.085.2E–073.0E–020.20
 220009SNP_138748-PitaNCTrypsin inhibitor protein2840.050.092.3E–072.4E–020.39
 227345SNP_136730-PitaSYRNA binding protein (APUM5)2230.120.085.1E–073.0E–020.17
SNP_138424-PitaNCTransducin family protein/WD-40 repeat family protein2230.200.081.0E–065.0E–020.17
SNP_137041-PitaNSOxygen-binding protein (CYP71B24)2230.030.092.5E–072.4E–020.19
 299833SNP_142143-PitaNCIND transcription factor2540.090.091.0E–071.7E–020.20
 301484SNP_138039-PitaNCPentatricopeptide repeat-containing protein2240.200.091.6E–072.4E–020.21
 318115SNP_142064-PitaNC*Unknown protein2750.030.129.6E–093.3E–030.56
 335365SNP_138245-PitaNCAvirulence-responsive family protein2920.030.081.4E–065.2E–020.36
SNP_138246-PitaNCAvirulence-responsive family protein2920.020.092.0E–072.4E–020.41

Approximately 33.3% (eight of 24) of the SNPs with significant associations to at least one metabolite were mapped genetically, and for mapped SNPs in general, there was little correspondence between degrees of genetic differentiation and effect size or statistical significance. For example, the association between a SNP in a myo-inositol : hydrogen symporter (SNP_143186-Pita) and phytol was localized to a region on linkage group 11 with average to below-average levels of population structure, with the closest marker having an FST above average > 10 cM away. The same patterns were also apparent for SNPs that were not mapped, and when considered together there was no correlation between (n − k − 1) r2-statistic for any metabolite and FST (Spearman’s ρ: − 0.10–0.15).

Several attributes for SNPs associated to at least one metabolite were not random samples from empirical distributions across the 3563 SNPs used for association mapping (Fig. 3; Methods S5). The mean MAF and multilocus FST values were both lower than the mean across all SNPs. These were also both in the extreme lower tails of the bootstrap distribution (n = 100 000 replicates) for the mean MAF and multilocus FST values estimated from bootstrap subsamples of size 24 SNPs (P < 0.0001). These SNPs were also inconsistent with random draws from the entire set with respect to being synonymous (P = 0.0432), noncoding (P = 0.001 21) and unknown (P = 0.039), with too many of the associated SNPs being noncoding and too few being unknown or synonymous. There was no effect for nonsynonymous (P = 0.6130) SNPs. The skew for the MAF was unlikely an artifact of population structure corrections, as there were no significant correlations between MAF and the (n − k − 1) r2-statistic (Spearman’s ρ: − 0.5–0.10), and there was no difference as extreme between the mean MAF associated to at least one metabolite and the mean MAF for 500 randomizations of genotypes with respect to phenotypes (Methods S5; Fig. S7).

Figure 3.

Single nucleotide polymorphisms (SNPs) associated to at least one metabolite exhibited different minor allele frequencies (MAFs) and magnitudes of genetic differentiation (FST) among loblolly pine (Pinus taeda) populations relative to the entire set of SNPs. (a) Bootstrap distribution (n = 100 000 replicates) for the mean MAF from subsamples of size 24 SNPs sampled with replacement from the entire 3563 SNP set. Dashed line, the observed mean across all SNPs; solid line, the mean MAF for SNPs associated to at least one metabolite. (b) Bootstrap distribution (n = 100 000 replicates) for the multilocus FST calculated from subsamples of size 24 SNPs sampled with replacement from the entire 3563 SNP set. The multilocus FST across all SNPs is labeled with a dashed line, while that for SNPs associated to at least one metabolite is labeled with a solid line. Multilocus FST was calculated by summing the appropriate variance components across loci.

Multi-SNP models  Application of Bayesian linear mixed models, where each metabolite was evaluated against multi-SNP models, revealed a multitude of new genetic associations (Table S8). In total, 3059 associations were detected across metabolites representing 255 unique metabolites and 1617 unique SNPs. Increasing the a priori cutoff from P < 0.05 to P < 0.10 from the single locus tests for inclusion in the multi-SNP models (see the Materials and Methods section) identified only a single additional SNP that was associated to an unknown metabolite, with an effect size that was only marginally significant (95% Bayesian interval: 0.001–0.025). Of the 1617 unique SNPs, 180 were nonsynonymous, 249 were synonymous, 300 were noncoding and 888 were not annotated. Effect sizes for SNPs identified with these models, when analyzed using a single locus test, were approx. threefold lower than SNPs significant in the single locus tests (average r2 = 0.025). As opposed to the single locus results, these SNPs had a mean MAF close to that across all 3563 SNPs, although it was in the lower tail of the bootstrap distribution for the mean MAF for subsamples of 1617 SNPs from the full data-set (Methods S5; Fig. S8). The multilocus FST was also lower than that across all SNPs and was in the lower tail of the bootstrap distribution for subsamples of 1617 SNPs (Fig. S8). Lastly, SNPs associated to at least one metabolite were again consistent with random draws from the entire 3563 SNP dataset with respect to being nonsynonymous (P = 0.6763), synonymous (P = 0.5098) and noncoding (P = 0.1886).

All of the SNPs identified as significant in the single locus tests (FDR Q ≤ 0.10) were inferred to have significant effects in multilocus models. The number of SNPs retained in these models varied from zero to 23, with a mean of 11 SNPs per model. A total of 718 of these SNPs were correlated to more than one metabolite (range 1–14). These models explained < 2% to c. 40% of the total phenotypic variance after corrections for population structure per metabolite (Table S1, Figs 4, S9), as well as sizeable portions of the clonal effects (Fig. 4b; mean, 56.6%; SD 27.7%). Patterns were the same for unknown metabolites (Table S2; Fig. S9). In general, most effects were nonadditive, with 437, 842, and 1115 associations consistent with additive, dominance, or over- and underdominance effects, respectively. No differences were observed between unknown and known metabolites for the magnitudes of additive and dominance effect sizes (Fig. S10). Additive and dominance effect sizes were positively correlated with one another and also negatively correlated to the logarithm of MAF (Fig. S11). Effect sizes did not differ strongly among categories of site for additive (Kruskal–Wallis rank sum test: P = 0.0899) or dominance effects (Kruskal–Wallis rank sum test: P = 0.3543).

Figure 4.

Multilocus single nucleotide polymorphism (SNP) models explain a large percentage of the phenotypic variance for many metabolites in loblolly pine (Pinus taeda). (a) Illustrated is the adjusted R2 for marker effects from a linear model with population structure covariates and ancestry-corrected phenotypes as the dependent variable. The adjusted R2 (inline image) was calculated as: inline image, where k is the number of independent predictors, n is the sample size, and R2 is the coefficient of determination for the set of SNPs in the linear model. The gray line and points denote how many SNPs are in the linear model, which for each metabolite was the set of SNPs identified using the Bayesian mixed linear model in BAMD. The same patterns were also seen for the unknown metabolites. (b) The fraction of H2 (see the Materials and Methods section) accounted for by the adjusted R2 for each of the known metabolites.


Summary of genetic associations

This is the first large-scale genetic dissection of metabolite phenotypes in a nonmodel plant species. The number of detected associations reported here is large relative to previous association studies for forest trees (Neale & Kremer, 2011), but is similar to that for the Arabidopsis metabolome (e.g. Chan et al., 2010a,b). Interestingly, 93 SNPs were associated to five or more metabolites, which suggest the existence of hotspots for metabolite regulation (Chan et al., 2010a,b), but evidence for epistasis among these markers was lacking (e.g. LD among unlinked SNPs; cf. Rowe et al., 2008). Effect sizes were small to moderate and were consistent with the L-shaped distribution of effect sizes expected for metabolic QTLs (Bost et al., 1999), as SNPs from single marker models tended to have much larger effect sizes than those detected only in multi-SNP models. Most associations exhibited nonadditive effects (cf. Kacser & Burns, 1981; Külheim et al., 2011) and involved both noncoding and coding polymorphisms.

Estimates of H2 were consistent with those reported for cellular phenotypes for model plants (Keurentjes et al., 2006), but were larger than those based on 10 elite Douglas-fir (Pseudotsuga menziesii) families replicated across two environments (Robinson et al., 2007). Our estimates of H2 were likely confounded with sampling across multiple time points (Chan et al., 2010a) and microscale G × E effects in the common garden (Robinson et al., 2007; Cumbie et al., 2010). Even the SNPs with the largest effect sizes (Table 1) explained only small to moderate portions of H2 across metabolites (e.g. typically < 50%). When taken together, however, multi-SNP models explained sizeable portions of H2 (e.g. on average > 50%, cf. Yang et al., 2010), thus highlighting the importance of multi-SNP models to complex trait dissection (cf. Holliday et al., 2010). Although our estimates of effect sizes were overestimated (Xu, 2003), large portions of H2 can still be accounted for when we consider a several-fold overestimation of marker effects resulting from small sample sizes (e.g. as in Ingvarsson et al., 2008).

Loblolly pine is one of the most extensively studied plants with respect to the dissection of quantitative traits using association genetic methodologies. To date, 34 genetic associations have been published for wood properties (n = 6, González-Martínez et al., 2007), carbon isotope discrimination (n = 4, González-Martínez et al., 2008; n = 7, Cumbie et al., 2010), height (n = 1, Cumbie et al., 2010), foliar nitrogen content (n = 6, Cumbie et al., 2010), quantitative resistance to pitch canker (n = 10, Quesada et al., 2010), and climate (n = 54, Eckert et al., 2010a,b). The SNPs associated to at least one metabolite and those reported previously were often the same: 10 of the 14 SNPs from Cumbie et al. (2010), five of the 10 SNPs from Quesada et al. (2010), and 15 of the 54 SNPs from Eckert et al. (2010a,b) were associated to at least one metabolite. These shared associations were often consistent given functional information from model plant species and are not likely artifacts resulting from homology problems as the sampled trees and targeted SNPs used for association mapping were largely the same for these studies. As an example, one of the 17 SNPs associated to trehalose (SNP_138156-Pita) was also associated to the first climate PC (primarily temperature variables) from Eckert et al. (2010a). This SNP is located in a gene encoding a nodulin MtN21 family protein whose putative ortholog in Arabidopsis is up-regulated by application of exogenous trehalose (Bae et al., 2005). Accumulation of trehalose increases tolerance of low temperatures for many plants (Fernandez et al., 2010), as evidenced by the negative correlation of trehalose and minimum monthly temperature.

Functional interpretations of genetic associations

Many of the correlations between metabolite concentrations and maternal tree environments were consistent with profiling experiments and suggest an adaptive genetic basis for variation in metabolite concentrations. Rigorous interpretation of these correlations, however, is suspect as only a single developmental time point in a single common garden was investigated and many of them were not significantly statistically, although many of these correlations were functionally consistent. For example, sugars and sugar alcohols in plants accumulate under osmotic stress (e.g. Tarczynski et al., 1993; Taji et al., 2002; Zuther et al., 2004). Here, mannitol, a sugar alcohol, was most correlated with summer precipitation and was associated with six SNPs from the BAMD analysis, five of which are located in candidate genes for drought-stress response (Table S4). These results may thus provide a bridge between metabolite profiling and the dissection of complex adaptive traits.

Many of the genetic associations were also consistent with biochemical links between gene products and metabolites for model plant species. Some of the best examples were for metabolites involved in lignin biosynthesis. Coniferin, as an example, is a glucoside of coniferyl alcohol acting as an intermediate metabolite during cell wall lignification (Tsuji et al., 2005; Espiñeira et al., 2011). Coniferin was associated significantly with 13 SNPs, many of which reside in genes related to lignin biosynthesis or cell wall deposition. The most striking examples of these were a synonymous SNP located at a locus encoding an R2R3-MYB transcription factor, whose putative ortholog in Arabidopsis affects lignin deposition (Newman et al., 2004), and a synonymous SNP located at a locus encoding an α-xylosidase, whose putative ortholog in Arabidopsis affects the structure and accessibility of xyloglucan (Günl & Pauly, 2010).

The biochemical links between gene products and metabolites were also convoluted for many of the genetic associations, yet in some cases could have reflected the complex interactions of multiple metabolic pathways. As an example, the functional link between a noncoding SNP located in a myo-inositol symporter and phytol is not directly apparent. Phytol is an acyclic diterpene alcohol and one of the first products derived from hydrolysis of chlorophyll. The putative ortholog of this locus in Arabidopsis is a tonoplast embedded myo-inositol symporter that regulates efflux of inositol-derived compounds out of the vacuolar lumen (Schneider et al., 2008). Downstream catabolites of phytol are localized to the vacuole. A direct link between inositol and these catabolites is lacking (Hörtensteiner, 2006). Phytol, however, is salvaged as input into other biosynthetic pathways, notably biosynthesis of tocopherols and phylloquinones (Ischebeck et al., 2006). Inositol-derived compounds also affect these pathways (Furuya et al., 1987).

Population and quantitative genetics of associations

Genetic associations often involve SNPs located in distinctive portions of genomes (Orozco et al., 2009; Chan et al., 2010a,b; Castro & Feldman, 2011). It is not yet possible to perform similar analyses for loblolly pine until genome-wide resources are available. The SNPs involved in our associations, however, did exhibit several distinctive attributes.

The set of associated SNPs had lower multilocus FST than randomly chosen sets of SNPs. This pattern is likely a by-product of population structure corrections (Yu et al., 2006), which highlights the intrinsic problem of such corrections to identify causal SNPs with high FST. Overcorrection for structure will lead to high false-negative rates (Mezmouk et al., 2011), while under-correction will lead to high false-positive rates (Marchini et al., 2004). We may be suffering from both types of errors, as population structure was only accounted for by using the top four genetic PCs. With respect to false negatives, single locus tests without any correction for population structure identified 64 additional SNPs that were largely correlated with overall population structure. This is the signal expected as a result of confounding by population structure (Mezmouk et al., 2011). With respect to false positives, associated SNPs were less differentiated then randomly chosen SNPs (Fig. 4). We cannot be sure, however, that residual population structure is not affecting the results. Sampling design of source materials established within common gardens, and its effect on results from association analyses, thus merits further attention (cf. Jansson & Ingvarsson, 2010).

Associated SNPs had lower MAFs on average than randomly selected SNPs. This could reflect a high frequency of false-positive associations, as statistical attributes of frequentist tests for association are correlated to sample size and MAF (Wakefield, 2009). Here, the effect of enrichment for low MAF in the set of associated SNPs was seen even for the associations identified using the Bayesian approach, which tends to be less sensitive to assumptions regarding sample size and MAF (Stephens & Balding, 2009), and there was no significant correlation between MAF and the (n − k − 1) r2-statistic across metabolites (Spearman’s ρ: −0.05–0.08). Our associations, therefore, are not likely dominated by false positives.

The enrichment for SNPs with low MAF is also expected for synthetic associations, where linkage between rare causal and more common variants assayed through genotyping (i.e. our ‘rare’ SNPs) drives the signal of association (Dickson et al., 2010; but see Wray et al., 2011). It is reasonable to expect that causal variants for variation in metabolite concentrations are rare and that larger effect sizes are correlated with low MAF as a result of patterns of purifying selection across metabolite networks (Fig. S11; Kacser & Burns, 1981; Keightley, 1989; Eyre-Walker, 2010). The question of whether the observed associations reflect neutral SNPs linked to an undetected rare causal variant (i.e. synthetic associations) or are the actual targets of selection themselves remains open. For the former, the FST among populations for associated SNPs is expected to be larger relative to neutral SNPs (Charlesworth et al., 1997; but see Pamilo et al., 1999). This was not observed (Fig. 3b). We are unaware of any study addressing this expectation, however, for SNPs identified using association genetic methodologies with population structure corrections employed. For the latter, lower FST for associated SNPs is expected, but the degree of selection would have to be similar across nonsynonymous, synonymous and noncoding sites (cf. Casillas et al., 2007), as associations were detected for all three types of SNPs. Whether enrichment for SNPs with low MAF in our associations is the result of intrinsic properties of the loci underlying metabolic phenotypes or an artifact of coarse genomic resolution awaits the further development of genomic resources for loblolly pine (i.e. an annotated genome sequence; cf. Goldstein, 2011). This is because scans for associations only identify targets of interest so that the SNPs reported here require further statistical and functional validation. The rarity of associated and/or causal SNPs, moreover, has several biological explanations (cf. Huff et al., 2011), which implies that while biological signal may drive many of the observed associations, functional interpretations are difficult without experimentation.

Conclusions and future directions

Forward genetic dissection of complex traits for forest trees is feasible, at least in the sense of identifying genomic regions of interest that may underlie adaptive phenotypes, and has provided insights into some of the genes underlying a multitude of quantitative traits (Neale & Kremer, 2011). Here, we provided the first large-scale association study for cellular phenotypes in a conifer. We establish that metabolites, even those not in standard spectrum libraries, are heritable and that for many metabolites a large fraction (i.e. > 50%) of this heritability is explained by relatively few SNPs (n = 1–23). This assessment is based solely on a single common garden and previous work has shown the importance of site relative to genetics in determining metabolite profiles for forest trees (e.g. Robinson et al., 2007). Future work for forest tree metabolomics, therefore, should focus on replicated sampling across different environments, as well as the targeting of specific pathways suitable for testing a priori hypotheses about the genetic architecture of ecologically relevant metabolites (e.g. Külheim et al., 2011). Identification of causal variants underlying this architecture, however, may require deeper resequencing of genomic regions tagged by associations to more common variants genotyped with existing resources (e.g. Coventry et al., 2010). This implies that the search for molecular variants underlying ecologically relevant traits in forest trees and other nonmodel species will benefit from the further development and careful application of next-generation sequencing technologies.


The authors would like to thank Charles Nicolet and Vanessa Rashbrook for the SNP genotyping, Katie Tsang for laboratory support, Gabriel Rosa for computational support, John Liechty and Ben Figueroa for bioinformatics support, and three anonymous reviewers whose comments greatly improved this manuscript. This work was supported by a grant from the National Science Foundation (IOS-PGRP-0501763).