A microarray-based genotyping and genetic mapping approach for highly heterozygous outcrossing species enables localization of a large fraction of the unassembled Populus trichocarpa genome sequence

Authors

  • Derek R. Drost,

    1. Graduate Program in Plant Molecular and Cellular Biology, University of Florida, Gainesville, FL 32611, USA
    2. School of Forest Resources and Conservation, University of Florida, Gainesville, FL 32611, USA
    Search for more papers by this author
  • Evandro Novaes,

    1. School of Forest Resources and Conservation, University of Florida, Gainesville, FL 32611, USA
    Search for more papers by this author
  • Carolina Boaventura-Novaes,

    1. School of Forest Resources and Conservation, University of Florida, Gainesville, FL 32611, USA
    Search for more papers by this author
  • Catherine I. Benedict,

    1. School of Forest Resources and Conservation, University of Florida, Gainesville, FL 32611, USA
    Search for more papers by this author
  • Ryan S. Brown,

    1. School of Forest Resources and Conservation, University of Florida, Gainesville, FL 32611, USA
    Search for more papers by this author
  • Tongming Yin,

    1. Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
    2. The Key Laboratory of Forest Genetics and Gene Engineering, Nanjing Forestry University, Nanjing 210037, China
    Search for more papers by this author
  • Gerald A. Tuskan,

    1. Environmental Sciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
    2. Department of Plant Sciences, University of Tennessee, Knoxville, TN 37996, USA
    Search for more papers by this author
  • Matias Kirst

    Corresponding author
    1. Graduate Program in Plant Molecular and Cellular Biology, University of Florida, Gainesville, FL 32611, USA
    2. School of Forest Resources and Conservation, University of Florida, Gainesville, FL 32611, USA
    3. Genetics Institute, University of Florida, Gainesville, FL 32611, USA
      *(fax +1 352 846 1277; e-mail mkirst@ufl.edu).
    Search for more papers by this author

*(fax +1 352 846 1277; e-mail mkirst@ufl.edu).

Summary

Microarrays have demonstrated significant power for genome-wide analyses of gene expression, and recently have also revolutionized the genetic analysis of segregating populations by genotyping thousands of loci in a single assay. Although microarray-based genotyping approaches have been successfully applied in yeast and several inbred plant species, their power has not been proven in an outcrossing species with extensive genetic diversity. Here we have developed methods for high-throughput microarray-based genotyping in such species using a pseudo-backcross progeny of 154 individuals of Populus trichocarpa and P. deltoides analyzed with long-oligonucleotide in situ-synthesized microarray probes. Our analysis resulted in high-confidence genotypes for 719 single-feature polymorphism (SFP) and 1014 gene expression marker (GEM) candidates. Using these genotypes and an established microsatellite (SSR) framework map, we produced a high-density genetic map comprising over 600 SFPs, GEMs and SSRs. The abundance of gene-based markers allowed us to localize over 35 million base pairs of previously unplaced whole-genome shotgun (WGS) scaffold sequence to putative locations in the genome of P. trichocarpa. A high proportion of sampled scaffolds could be verified for their placement with independently mapped SSRs, demonstrating the previously un-utilized power that high-density genotyping can provide in the context of map-based WGS sequence reassembly. Our results provide a substantial contribution to the continued improvement of the Populus genome assembly, while demonstrating the feasibility of microarray-based genotyping in a highly heterozygous population. The strategies presented are applicable to genetic mapping efforts in all plant species with similarly high levels of genetic diversity.

Introduction

Microarrays revolutionized the study of gene expression, and have recently been applied for high-throughput genotyping of sequence- and expression-level polymorphisms. Single-feature polymorphisms (SFPs) detected by differential hybridization of genomic DNA to whole-genome microarrays were first reported in yeast (Winzeler et al., 1998; Brem et al., 2002) and Arabidopsis (Borevitz et al., 2003; Singer et al., 2006), and later in rice (Kumar et al., 2007). Subsequent reports showed that hybridization of RNA could also identify SFPs in haploid yeast (Ronald et al., 2005) and several inbred plants (Cui et al., 2005; Rostoks et al., 2005; West et al., 2006, 2007; Luo et al., 2007; Coram et al., 2008), while concurrently generating estimates of gene expression from segregants (Ronald et al., 2005; West et al., 2007). Utilizing RNA to characterize SFPs also creates the opportunity to identify gene expression markers (GEMs) – genes that are differentially expressed between parents of mapping populations and show Mendelian segregation of expression values within progeny (West et al., 2006). Generating genotypic and gene expression data in a common assay establishes a framework for powerful forward-genetic approaches, including genetical genomics studies (Jansen and Nap, 2001). However, while microarray-based mapping has been successfully applied to haploid or homozygote lines, the approach has yet to be demonstrated in outcrossing plant species with high genetic diversity, in which up to four alleles can segregate for each locus in a full-sibling pedigree.

RNA-based SFP genotyping requires robust separation of the microarray signal variance associated with differential hybridization kinetics between alleles from the variance due to differences in mRNA abundance (Ronald et al., 2005). Previous studies in species with limited genetic diversity have relied on short (≤25-mer) oligonucleotide probes to detect genetic variants, because a unique single-nucleotide polymorphism (SNP) can result in differential hybridization and detection of SFP (Kirst et al., 2006). Short oligonucleotide-based microarrays typically utilize multiple probes per gene (a probe set) to estimate gene expression. Thus, SFP-containing probes can be detected by comparing individual probe signals with the signal measured across the probe set. Probes for which the signal deviates significantly from the probe set mean in a subset of the segregating population suggest the presence of a segregating SFP, while the remainder of the probe set provides an estimate of gene expression (Ronald et al., 2005; West et al., 2006; Luo et al., 2007). However, in outcrossing species with extensive genetic diversity, abundant SNP variation and heterozygosity can result in significant bias for estimates of gene expression, as SFPs may be present within many probes in a given probe set (Kirst et al., 2006). Such biases render platforms that utilize short probes less reliable for concurrent analysis of gene expression and genetic polymorphisms in these experimental settings. Utilizing long oligonucleotide probes may improve estimates of gene expression in these cases. However, approaches to select optimal long oligonucleotide probes for gene expression analysis in highly diverse species or across multiple related species and their hybrids are lacking. Similarly, the ability of longer probes to detect a useful quantity of segregating polymorphisms for genetic mapping has yet to be demonstrated.

Our first objective in this study was to develop an approach to select optimal long oligonucleotide probes for gene expression analysis and microarray-based genotyping in a highly heterozygous population. We utilized an inter-specific pseudo-backcross of P. trichocarpa × P. deltoides and a long-oligonucleotide (>50-mer) microarray platform to develop a two-step method to discover candidate SFP in parent lines, then genotype sequence- and expression-based polymorphic features in the progeny. We show that genotypic data generated by this method can contribute to the development of an accurate high-density, gene-based genetic map. Additionally, the value of these markers is demonstrated by the positioning of almost half of the previously unassembled whole-genome shotgun (WGS) sequence scaffolds within the complex and highly heterozygous genome of P. trichocarpa. The results we describe provide an indication of both the challenges and opportunities presented when undertaking a microarray-based genetic mapping study in a genetically diverse plant species. We believe that the techniques we present provide a strong framework for future microarray-based genotyping in crops, forest tree species, and other complex plant genomes. Similarly, our approach for optimal probe selection for gene expression analysis within or between highly diverse species may prove useful for other agricultural and forest tree species with similar levels of genetic diversity.

Results

SSR framework map of genotype 52-225

We constructed a single-tree framework microsatellite map (Figure S1) for the maternal P. trichocarpa × P. deltoides hybrid parent (genotype 52-225) of family 52-124 based on 167 SSR markers, using a pseudo-testcross strategy (Grattapaglia and Sederoff, 1994; Ma et al., 2008). The framework map represented the 19 consensus linkage groups (LG) of poplar (Cervera et al., 2001), although an unresolved gap remained in linkage group (LG) VI due to a lack of informative markers in this region. Markers shared with the genetic map of genotype 52-225 produced for a different population (family 13, for which the genotype also serves as the maternal parent; Yin et al., 2004) were largely collinear. Framework SSR loci represented a subset of the sequence-tagged sites used to assemble the P. trichocarpa WGS contigs and scaffolds into chromosomes (Tuskan et al., 2006). Based on this information, we anchored and oriented the framework map relative to the genome assembly. The framework map spanned 2970 cM, with mean marker intervals of 17.8 cM, and served as the basis for subsequent grouping of SFP and GEM markers into linkage groups.

Identification of probes for genotyping family 52-124

A microarray analysis was initially performed in each parent line to (i) identify candidate SFP probes segregating in the pedigree and (ii) identify a single optimal probe for gene expression analysis in the progeny. To develop a microarray platform that could be used for concurrent genotyping and transcript profiling of the progeny of family 52-124, we began by testing six or seven probes per gene in the two parents. The custom platform comprised 384 287 60-mer probes representing 55 793 annotated gene models (probe sets) from the sequenced genome of P. trichocarpa. This gene set included 45 555 predicted gene models reported previously, plus 10 238 ESTs and less supported gene models with transcriptional evidence (Tuskan et al., 2006).

For the probe selection study, RNA extracted from the root, leaf and secondary xylem of each parent of family 52-124 was converted to double-stranded cDNA, labeled, and hybridized to the microarrays. After normalization, the data were assessed by analysis of variance (anova), with genotype, tissue, tissue-by-genotype interaction, probe and genotype-by-probe interaction effects. Genotype effect accounts for overall differences in signal in a probe set between the two parental genotypes, and primarily reflects a difference in gene expression level between them [Figure 1(c,d)]. The tissue effect accounts for differences in expression detected by a probe set between different tissues, regardless of the genotype being profiled. The probe effect detects the specific properties of a probe that distinguish it from others in a probe set, independent of parent genotype [Figure 1(a,d)]. Finally, the genotype-by-probe interaction accounts for specific properties of a probe that distinguish it from the rest of the probe set, depending on the genotype being analyzed. Dependence on genotype suggests that these probes contain SFP between the parental genotypes that may segregate in the progeny [Figure 1(b,d)].

Figure 1.

 Examples of significant fixed effects detected by analysis of variance of microarray data from the parents of family 52-124.
Normalized, zero-centered signal measured in seven probes for each parent (black lines, P. deltoides D124; gray lines, P. deltoides × P. trichocarpa 52-225) in two biological replicates.
(a) Significant probe effect (gene ID grail3.0028018001) reflected by wide variation in measured signal intensity around the probe set mean (probes 2, 6 and 7). Significant probe effects may arise because of gene mis-annotation, significant variation in sequence between the probe and all transcribed alleles in the cross, or unfavorable probe chemical properties.
(b) Significant genotype by probe effect (gene ID gw1.XVIII.2378.1) revealed by the difference in signal intensity across a probe set within one genotype (probes 1 and 2 for genotype 52-225; probes 6 and 7 for genotype D124).
(c) A significant genotype effect (gene ID gw1.XII.1836.1) represents a property of the probe set as a whole, and is reflected by relatively constant signal variance between genotypes for each probe across the probe set. Strong and/or highly heritable genotype effects correspond to potential GEMs.
(d) Significant genotype (probes 2–6), genotype by probe (probe 7) and probe effects (probe 1) within a single gene (gene ID eugene3.02350016).

To identify candidate probes for SFP genotyping, two separate analyses were performed. In the first, a t-test was used to contrast least-square mean estimates of the interaction between the two parental genotypes at each probe across all tissues. A probe within a probe set may be biased towards one or the other parent due to differential hybridization (i.e. an SFP), and therefore is a candidate to be tested for segregation in the progeny. Furthermore, only probes for which the difference in least-square means between the parental lines exceeded an arbitrary fourfold threshold were selected. We identified 2875 probes meeting these criteria (false-discovery rate < 0.1; < 0.0085). When more than one probe from a probeset was identified, we selected the most significantly interacting probe. In total, candidate SFP probes were identified for 912 genes. Among these, 770 exhibited hybridization bias favoring the 52-225 hybrid parent, while 142 demonstrated stronger hybridization in the D124 P. deltoides parent. These results are expected because the microarray probes were designed based on the genome sequence of P. trichocarpa (Tuskan et al., 2006), one of the species contributing to the hybrid parent. Therefore, we hypothesized that the majority of candidate SFPs may be explained by species-level polymorphism between P. trichocarpa and P. deltoides alleles. Based on this hypothesis, and the inter-specific pseudo-backcross pedigree structure, comprising one P. trichocarpa and three P. deltoides alleles, we expected that most SFP and GEM alleles showing simple Mendelian inheritance should segregate at a ratio of 1:1.

To identify additional candidate SFP probes for genotyping and mapping in the progeny, we re-analyzed the parental expression data derived from secondary xylem in a separate anova. Similar to the previous analysis, we contrasted each parent’s interaction with individual probes within a probe set, and selected those that were significant (FDR < 0.1, < 0.0051) with at least a threefold difference in least-square means estimates. The separate analysis focusing on xylem tissue was conceived after previous work showed this tissue to be among the most transcriptionally diverse in Populus (Tuskan et al., 2006). From this dataset, we initially identified 13 191 additional candidate SFP probes, including 8986 with hybridization bias favoring the hybrid parent and 4205 with hybridization bias favoring the P. deltoides parent. By again selecting only the most significantly interacting probe in each probe set, we identified an additional 11 172 genes harboring candidate SFPs. In total, our two analyses identified single specific probes from 12 084 genes containing candidate SFPs, which were subsequently carried forward for analysis of the progeny.

Identification of probes for transcript profiling
of family 52-124

A second objective of the microarray analysis of parental genotypes was to identify a single optimal probe for expression analysis of the 55 793 gene models in the 52-124 progeny. To identify probes that were unbiased for gene expression analysis in both parental species backgrounds, we assumed that the probe set mean best represents the true expression value in each parent. Therefore, in contrast to the previous analysis, the goal was to select the probe that performs most consistently within the probe set in both parents [Figure 1(a)].

To select the optimal probe for gene expression analysis, an iterative selection process was implemented. First, for each gene, probes were ranked based on the deviation of the least-square mean estimate of each probe effect, relative to the probe set mean. Lack of significant deviation from the probe set mean suggest that inherent properties of the probe do not contribute bias to the signal detected at that probe. Next, the highest ranking probe was analyzed for its sequence alignment uniqueness scores assigned during probe design. Only probes with no more than one unique match to the Populus genome sequence were further considered. Finally, probes were evaluated for significant genotype-by-probe interaction (FDR < 0.1). In cases where the probe was not unique or showed a significant genotype-by-probe interaction, the next highest ranked probe was evaluated (i.e. next step of the iteration). After seven iterative rounds of selection, all probes had been considered by these criteria, and probes to measure gene expression were selected for 46 001 genes.

Selection for the remaining 9792 genes was based on a rank variable provided by NimbleGen (http://www.nimblegen.com). The rank variable concurrently accounts for probe chemical properties and probe uniqueness characteristics. The highest ranked probe for each gene exhibiting a non-significant probe effect and genotype-by-probe interaction effect was selected. For 149 genes, all probes in the probe set exhibited a significant probe effect or genotype-by-probe interaction. Single probes were chosen for these genes solely on the basis of the NimbleGen rank variable.

Genotyping SFP and GEM probes in the progeny
of family 52-124

To evaluate the candidate SFP probes identified in the parent genotypes, we assayed RNA abundance in xylem tissue from 154 progeny of family 52-124. A modified microarray was designed, comprising the single selected expression probe per gene for each of the 55 793 gene models and the 12 084 candidate SFP probes. Loci were genotyped using a k-means clustering allele-calling procedure (see Experimental procedures). Normalized data for each of the 67 877 experimental probes was grouped into two separate clusters, and frequency of cluster membership was tested for 1:1 segregation (inline image, > 0.05). A total of 12 680 features followed the expected Mendelian segregation pattern, including 9782 probes selected for gene expression analysis (17.5%) and 2898 of the candidate SFP probes (24.0%). Gene expression probes that segregate in the mapping population may be utilized as GEMs, and were therefore considered in further analyses.

Next, signal separation between allelic classes was evaluated using a modified normal deviate (see Experimental procedures), and probes resulting in >10% ambiguous allele assignments were removed. Reliable genotypes in >90% of the progeny were obtained for 1733 probes, including 1014 GEMs and 719 SFPs (1.8 and 6.0% of the total, respectively). The 1733 segregating features correspond to 1610 independent gene models – segregating probes corresponding to both GEM and SFP were identified for 123 gene models.

Genetic mapping of genotype 52-225

The 1733 candidate SFP and GEM probes were utilized to generate a genetic map of genotype 52-225. Marker grouping, ordering and mapping were performed as described previously (West et al., 2006) with slight modifications (see Experimental procedures). To correct for genotypic errors and ambiguities in the resulting linkage groups, markers were re-genotyped after localization of recombination breakpoints using structural change analysis (Singer et al., 2006). In addition to the 167 framework SSRs, we unambiguously localized 324 SFP and 117 GEM loci in the map of 52-225 (Table 1, Figure 2 and Table S1). For most linkage groups, and the genome as a whole, the mean marker intervals were <5 cM. The total genome length was 2798.5 cM, in good agreement with recently published genetic maps for inter-specific crosses of Populus (Yin et al., 2004). The overall rate of marker placement error was low: for genes known to be physically located on specific chromosomes in the P. trichocarpa WGS sequence assembly, ten were not placed in their predicted linkage group – an error rate of 3.52% (10/284). Of the misplaced markers, seven corresponded to SFPs and three to GEMs. These ten markers were subsequently excluded from the map.

Table 1.   Summary of F tests for fixed effects in the mixed anova performed on parent tree microarray data
 GenotypeTissueTissue by genotypeProbeGenotype by probe
  1. Significance was judged at FDR < 0.025. Details of the significance of F statistics for these fixed effects on a per gene basis (at FDR < 0.025) are given in Table S5.

Significant790934 32618 47051 8213355
Non-significant47 88421 46737 323397252 438
Figure 2.

 Microarray and SSR-based genetic map of P. trichocarpa × P. deltoides 52-225.
Colors and font styles represent marker types and genomic sequence locations: black, framework SSR markers; green, SFP markers; blue, GEM markers; italicized, GEM/SFP markers contained within unplaced WGS scaffold sequences in version 1.1 of the P. trichocarpa genome; plain font, GEM/SFP markers with known linkage group-anchored genomic coordinates. Maps were generated using publicly available mapchart software version 2.1 (Voorrips, 2002). Map data are also given in Table S1.

Physical orientation of the 52-225 genetic map

We oriented and aligned the 52-225 genetic map to the chromosome-level WGS assembly of P. trichocarpa Nisqually-1 based on physical positions of genes interrogated by SFP and GEM probes (Tuskan et al., 2006) and our previously anchored SSR loci. The predicted genetic orientation and physical orientation were usually collinear; several small inversions were detected that may be the result of error in map ordering or may represent true differences in gene order between various P. trichocarpa clones or between P. trichocarpa and P. deltoides (data not shown). Slight variations in map order between Nisqually-1 and 52-225 have been reported elsewhere (Yin et al., 2008). On average, the predicted physical intervals between ordered markers contain 84.4 genes; however, the range is wide (1–624 genes). The mean physical distance spanned by marker intervals is 725 kb, and ranges from 146 bp to 5.31 million bp (Mbp).

Genetic mapping of the unassembled Populus genome

Approximately 7700 sequence scaffolds from the WGS assembly are not assigned to specific linkage groups in version 1.1 of the P. trichocarpa genome sequence. These scaffolds vary in size from <100 bp to >3.5 Mbp (mean approximately 16.8 kb), and represent 75 Mbp of unplaced sequence (Tuskan et al., 2006). Much of this sequence was postulated to be heterochromatic or derived from substantially divergent haplotypes in the sequenced clone (Tuskan et al., 2006; Kelleher et al., 2007). Our microarray-based mapping results provided an unprecedented opportunity to anchor a large amount of this unplaced sequence to potential genomic locations in P. trichocarpa based on the genes physically localized within these sequence scaffolds. Of our 1733 candidate GEM and SFP markers, 783 were contained in genes residing in 492 sequence scaffolds. We successfully mapped 167 of these 783 loci, thereby locating 116 sequence scaffolds to unique genetic positions in linkage groups (Table 2 and Table S2). Five remaining scaffolds showed linkage to other markers in the map, but could not be unambiguously placed within a single linkage group (data not shown). This error rate associated with scaffold mapping (4.13%; 5/121) is congruent with the mapping error rate observed for markers with known position in the linkage-group WGS assembly (see above). The 116 sequence scaffolds localized on the genetic map correspond to 35.7 Mbp of WGS sequence assembly, or nearly 50% of the unlinked sequence (Table 3). Among these mapped scaffolds, 34 (representing 23.3 Mbp) could be linked by two or more markers, enabling orientation of the sequence strands comprising the scaffolds (Table S2).

Table 2.   Summary statistics for the P. trichocarpa × P. deltoides clone 52-225 microarray- and SSR-based linkage map
Linkage groupFramework SSR lociNumber of GEM loci mappedNumber of SFP loci mappedVersion 1.1 scaffolds mappedTotal mapped lociMap length (cm)Mean marker spacing (cm)Mean number of recombinations (±SD)
LG_I1962516684156.104.09 ± 1.51
LG_II111025452193.23.722.18 ± 1.05
LG_III1138630150.55.021.83 ± 0.87
LG_IV7210729144.54.981.72 ± 0.79
LG_V11212941154.33.761.74 ± 0.84
LG_VI8411431189.16.101.96 ± 0.83
LG_VII736831118.63.831.57 ± 0.84
LG_VIII14513234147.74.341.66 ± 0.79
LG_IX8471030111.73.721.5 ± 0.64
LG_X10823445157.73.501.84 ± 0.93
LG_XI83111033131.63.991.73 ± 1.03
LG_XII44532084.34.221.41 ± 0.58
LG_XIII7410325115.94.641.58 ± 0.69
LG_XIV1006727141.85.251.65 ± 0.81
LG_XV613620117.75.891.43 ± 0.60
LG_XVI93701995.75.041.38 ± 0.59
LG_XVII5151028132.14.721.69 ± 0.81
LG_XVIII41921676.14.761.62 ± 0.93
LG_XIX83115291214.171.58 ± 0.86
Unlinked scaffolds50117
Genome total1671173241166082798.54.621.80 ± 0.59
Table 3.   Summary of WGS scaffold sequences localized based on SFP and GEM markers, and resultant estimated coverage
Linkage group Assembled size (kb)a Estimated coverage (%)aScaffold sequence added (kb)Estimated coverage revised (%)a
  1. aOriginal assembled size and estimated coverage as reported previously (Tuskan et al., 2006). The revised estimated coverage is based on these previously reported statistics, and may exceed 100% because of erroneous estimation of linkage group size due to the assumption of uniform genetic:physical distance ratio, or because of map-based linear reassembly of highly divergent haplotypes that should be collinear and distinct.

LG_I35 50080514691.6
LG_II24 5009198.591.4
LG_III19 10079152685.3
LG_IV16 600951387102.9
LG_V18 00078283490.3
LG_VI18 500924295113.4
LG_VII12 80085582.488.5
LG_VIII16 100738.473
LG_IX12 50085136.286
LG_X21 100100137.8100.6
LG_XI15 10082138689.5
LG_XII14 100102703.4107.1
LG_XIII13 1001072908.6130.7
LG_XIV14 700853162.8103.3
LG_XV10 600791792.292.3
LG_XVI13 70081081
LG_XVII6000565601108.3
LG_XVIII13 50077983.482.6
LG_XIX12 000652424.478.1
Mean83.81848.194.5

Verification of map position for unassembled sequence scaffolds

To confirm that our assembly of genomic scaffolds using SFPs and GEMs was reliable, we verified the position of a subset of mapped scaffolds using SSRs. From the P. trichocarpa version 1.1 sequence scaffolds (Tuskan et al., 2006), we identified SSR loci within nine distinct scaffolds mapped using GEM and SFP markers, and designed PCR primers in their flanking sequences. After amplification and genotyping, we mapped these SSR loci on the basis of the original framework SSR map only, to eliminate any bias that may be introduced due to genotyping error in linkage group-anchored SFP and GEM alleles. For eight of the nine scaffolds, we successfully verified the putative map location of the scaffold sequence with respect to the framework SSRs (Table 4 and Figure S1). The relative genetic distances between scaffold-anchored markers in both the SFP/GEM-based map and the SSR framework map were also in agreement (Table 4).

Table 4.   Verification of scaffold map location for nine sequence scaffolds using SSR markers and the framework SSR map
Joint Genome Institute version 1.1 sequence Scaffold Mapped SFP/ GEM genes SFP/GEM locationAnchored SSR flanking scaffold in GEM/SFP mapVerification SSR IDVerification SSR location in framework mapAnchored SSR flanking scaffold in framework map
Scaffold_29eugene3.00290072LG_I, 85.6 cMG833, G2837UFLA_29LG_I, 119.7 cMG833, G3784
estExt_fgenesh4_pg.C_290162LG_I, 86.3 cM    
Scaffold_130gw1.130.59.1LG_IV, 90.3 cMG1809, O545UFLA_130LG_IV, 109.2 cMG1809, O545
eugene3.01300051LG_IV, 96.8 cM    
Scaffold_166eugene3.01660055LG_IV, 0.0 cMO349UFLA_166LG_IV, 0.0 cMO349
Scaffold_181eugene3.01810009LG_VII, 65.0 cMG354, P2794UFLA_181LG_VII, 52.4 cMG354, P2794
Scaffold_118fgenesh4_pg.C_scaffold_118000002LG_III, 79.7 cMG1629, P2611UFLA_118LG_III, 97.4 cMG1629, P2611
eugene3.01180022LG_III, 81.8 cM    
Scaffold_170eugene3.01700010LG_XVII, 0.0 cMG125UFLA_170LG_XVII, 0.0 cMG125
eugene3.01700027LG_XVII, 0.0 cM    
fgenesh4_pg.C_scaffold_170000022LG_XVII, 7.2 cM    
eugene3.01700033LG_XVII, 13.7 cM    
Scaffold_147estExt_Genewise1_v1.C_1470180LG_VI, 189.9 cMO50UFLA_147LG_VI_b, 94.5 cMO50
Scaffold_121gw1.121.10.1LG_XVIII, 76.1 cMO534UFLA_121LG_VI_b, 7.1 cMP2221, W12
   UFLA_121_bLG_VI_b, 7.1 cMP2221, W12
Scaffold_97gw1.97.119.1LG_I, 134.2 cMG3784, G937gw97_2LG_I, 142.5 cMG3784, G2837

The only scaffold (scaffold_121) for which we could not verify a map position using this technique was localized on the basis of a single GEM to LG_XVIII, whereas data from two SSR consistently positioned it within LG_VI. We speculated that this result was attributable to strong trans-acting regulator on LG_XVIII acting on the gene characterized as a GEM. As GEMs may be the result of either cis- or trans-acting variation, we were interested to determine whether scaffolds mapped based on single GEM loci were less reliable with regard to genetic positioning. We studied the remaining 13 scaffolds that were localized on the basis of single GEMs in our map, and identified informative SSR in 6 of these 13 scaffolds. Using the framework SSR map, we successfully verified the predicted genetic placement of five of these six GEM-anchored scaffolds (Figure S1 and Table S3). The single unverified scaffold (scaffold_250) localized to the same linkage group, but a different SSR bin, than predicted by the GEM locus (Table S3).

Characterization of sequence-level allelic variation represented by mapped SFPs

SFPs detected by short (≤25-mer) oligonucleotide probes often correspond to one or a few SNPs or small indels (Kirst et al., 2006; Luo et al., 2007; Das et al., 2008). However, the implication of sequence mismatches on signal detected from long oligonucleotide probes has only recently been described (Rennie et al., 2008). Thus, we characterized the allelic variations present in a sample of mapped SFP probes from the microarray platform. Using double-stranded cDNA produced from xylem for each of the parent trees, we amplified, cloned and sequenced regions corresponding to five mapped SFP loci, and assayed polymorphisms between the alleles. We identified sequence-level variation ranging from a single SNP in the 60-mer region to large indel polymorphisms affecting >10 bp [Figure 3(b–e)]. Of the five SFP that we characterized, one showed no variation between alleles within the sequence interrogated by the genotyping probe, although sequence variation between the alleles and probe was observed [Figure 3(a)]. Therefore, this probe may correspond to an actual GEM that was mis-characterized as an SFP, as previously described (Luo et al., 2007). As we hypothesized, SFP we detected are primarily due to species-level sequence polymorphisms between P. trichocarpa and P. deltoides, although multiple haplotypes were identified at two of the five probes [Figure 3(b,e)].

Figure 3.

 Allelic variations characterized by sequencing genomic DNA regions corresponding to mapped SFP probes.
Among sequenced clones, haplotypes are shown as detected forP. trichocarpa × P. deltoides clone 52-225 and P. deltoides clone D124. Variations between alleles or between detected sequence and probe sequence are depicted in red.
(a) No variation was detected between parent trees for estExt_Genewise1_v1.C_LG_III2262.
(b) Extensive SNP and indel polymorphism between haplotypes in grail3.0016013002.
(c) A 12 bp deletion polymorphism in P. deltoides estExt_Genewise1_v1.C_LG_XVII1215.
(d) A single SNP distinguishes alleles of grail3.0005006601.
(e) Multiple SNPs detected for extExt_Genewise1_v1.C_LG_XVIII1445.

Discussion

Parallel genotyping and gene expression quantification using mRNA microarray hybridization data require accurate classification of differences in signal intensity arising from DNA sequence variants versus transcript level abundance (Ronald et al., 2005). To separate genetic polymorphism from differences in transcript abundance, candidate genotyping probes can subsequently be detected by identifying individual probes that deviate significantly from the probe set mean signal (which provides a balanced measure of expression), and that segregate in the progeny. Although first demonstrated in populations with simple genetic segregation patterns (i.e. haploid, recombinant inbred line and doubled-haploid) and species with limited genetic diversity, we have extended mRNA-based microarray genotyping to a highly heterozygous, outcrossing plant species for which low resolution at the genotype level has often hampered forward-genetic gene discovery methods.

Contrary to previous studies, which relied on microarray platforms comprising multiple (11–30) short probes (≤25-mer) per gene (Ronald et al., 2005; West et al., 2006; Luo et al., 2007), we adopted a long-oligonucleotide microarray platform for use in our study. Furthermore, our analysis relied on single optimal genotyping and gene expression probes selected by analyzing the parental individuals before characterizing the segregating population. A set of six or seven probes per gene was first screened in the parental genotypes, and an analysis of variance was applied to identify probes interrogating potential polymorphisms and optimal probes for measuring transcript levels (Cui et al., 2005; Rostoks et al., 2005). Next, the microarray platform was re-designed to comprise a single optimal gene expression probe for each transcriptional unit and 12 084 candidate SFP probes for analysis of 154 segregating progeny. From this analysis, we identified 1733 segregating features with reasonably low levels of ambiguous data (<10%). After applying a statistically based genotyping correction described previously (Singer et al., 2006), we successfully mapped 441 of these segregating features (25.4%). Our mapped features include probes that were pre-selected for gene expression analysis and those pre-selected for SFP genotyping, corresponding to 117 GEM and 324 SFP markers. The sample of sequenced SFP regions indicates that our data analysis approach robustly detected sequence variants from RNA-based microarray data.

Together with 167 framework SSR markers, our map represents one of the highest-resolution genetic maps derived from a single pedigree in the Populus genus. Markers from the framework SSR map represent an important tool to delineate true versus spurious linkage of GEM and SFP to linkage groups in the genome, analogous to the situation described when mapping largely homozygous barley RILs (Luo et al., 2007). Nonetheless, we have demonstrated that GEM and SFP mapping in highly heterozygous species is both beneficial and feasible, and may serve as a supplement to traditional DNA-based markers. Our study focused on an inter-specific cross, in which sequence and gene expression variation may be extraordinary. However, estimates of genetic variation and nucleotide diversity within individual species of the Populus genus (Ingvarsson, 2008) and other economically significant outcrossing plants (Ching et al., 2002; Tenaillon et al., 2002; Kolkman et al., 2007; Novaes et al., 2008) suggest that our analysis approach could also be adapted to identify genetically informative variants from diverse intra-specific accessions. However, it is expected that variables including probe length and statistical thresholds associated with allele calling may require optimization, and that the abundance of SFP and GEM detected may be lower.

Establishing a high-density, gene-based genetic map also provided an opportunity to position previously unlinked sequence scaffolds from the WGS sequence assembly of P. trichocarpa to putative genomic locations. The existing genome assembly comprises 410 Mbp of a total estimated genome size of 485 Mbp (Tuskan et al., 2006), but there is substantial variation in estimated chromosome sequence coverage, from 56% (chromosome XVII) and 65% (chromosome XIX) to estimated completion (chromosomes X, XII and XIII). Of the 492 unplaced scaffolds in which we identified a segregating GEM or SFP marker, we unambiguously positioned 116 on our genetic map (23.6%). Scaffold sequences mapped using our GEM and SFP markers represent over 35 Mbp of previously unanchored sequence from the WGS assembly of P. trichocarpa, including more than 23 Mbp localized by at least two independent markers in the same scaffold.

Of a sample of 15 putatively placed genomic scaffolds, the placement of 13 could be verified using independent SSR markers, lending a good degree of confidence to our map-based re-assembly of nearly 50% of the P. trichocarpa scaffold sequence. In addition, 18 scaffolds that we have mapped using SFPs or GEMs have been previously mapped using SSRs and amplified fragment length polymorphisms by other research groups (A. Rohde, Institute for Agriculture and Fisheries Research, personal communication). Our microarray-based markers verified the genetic position for 17 of these 18 scaffolds. Misplacement of sampled scaffolds based on microarray marker data was generally due to mapping based on single GEM loci. Because GEMs can result from segregating cis- or trans-acting regulatory variation, scaffolds mapped based only on GEMs should be verified for their position using SSRs where possible. Despite this fact, localization of a large proportion of the previously unplaced genome sequence is a high-impact result for the Populus genomics community, even given the small degree of potential error in placements.

Interestingly, the newly mapped scaffolds are predominantly located in chromosomes with low sequence coverage, where larger gaps exist in the current assembly. It is unclear why there is bias towards mapping scaffolds in chromosomes with poor assembly. There may be a higher probability of mapping unassembled scaffolds to them simply because of their higher expected abundance there. Alternatively, smaller unmapped scaffolds could be more prevalent in chromosomes that are populated by large numbers of hypervariable regions, as high levels of polymorphism are not favorable to long-range WGS assembly of a consensus haplotype (Kelleher et al., 2007). Such an observation was recently made in the sex-determining telomeric region of Populus chromosome XIX (Yin et al., 2008). Furthermore, chromosome XVII, which has the lowest estimated percentage of sequence fully assembled (56%, Tuskan et al., 2006), has the fourth highest rate of sequence polymorphism (unpublished data), and has the highest number of scaffolds mapped and total sequence added in our study (Tables 1 and 2). Although we can only speculate as to the basis for this phenomenon, our study provides a significant improvement to the WGS assembly of the P. trichocarpa sequence. Additional mapping studies using SFP and GEM markers that we have identified, and focusing on variation in the sequenced clone Nisqually-1, could shed light on the structural genomic nature of these scaffold sequences and their proper designation in the genome assembly as alternative haplotypes or bona fide unplaced WGS sequence segments. De novo sequencing and assembly of other P. trichocarpa and P. deltoides genotypes will also provide a better indication of whether specific regions exist that are hypervariable within and between species haplotypes, and their genome location.

Perhaps most importantly, our effort demonstrates the power that microarray-based mapping may bring to future map-based WGS reassemblies. We have shown that mapping based on physically positioned genes can rapidly localize and orient large amounts of WGS-derived sequence within the context of a physical assembly, even when the sequence is scattered amongst a number of smaller scaffolds whose assembly is not supported by traditional WGS computerized assembly techniques or anonymous sequence marker anchoring methods. Thus, further application of microarray-based mapping in genetically diverse species will not only increase resolution at the level of genotype for forward-genetic analyses, but may drastically improve the initial quality of draft WGS assemblies to the community as a whole. In addition, providing a putative location for an unplaced sequence can identify candidate genes affecting quantitative phenotypes that would otherwise go unconsidered if relying only upon the chromosome-level sequence assembly for characterization of a genomic interval.

Experimental procedures

Plant growth conditions and RNA isolation

A pseudo-backcross population (family 52-124) derived from the cross of a female P. trichocarpa × P. deltoides hybrid (genotype 52-225) and a male P. deltoides (genotype D124) was obtained from the Department of Forestry at the University of Minnesota at Duluth as hardwood cuttings. After rooting, bud break and shoot elongation, fresh softwood terminal cuttings were harvested and placed in rooting media pellets (Jiffy Forestry Products, http://www.jiffypot.com) for 2 weeks. Rooted cuttings were planted in 9 L pots, and grown for 6 weeks on ebb-and-flow benches in a greenhouse under long-day conditions (16 h light/8 h dark) with a standard nutrient regime (Hocking’s modified complete fertilizer, Cooke et al., 2003) supplemented with 25 mm nitrogen (NH4NO3). Plants were distributed in the greenhouse according to a partially balanced incomplete block design, with three biological replications per genotype. At harvest, the main plant organs (stems, roots, leaves and sylleptic branches) were collected separately. Stems were further dissected into secondary xylem tissue and phloem/bark/immature xylem. Samples of leaf, secondary xylem and root tissue from two biological replicates of each genotype were used for gene expression analysis. All tissue was flash-frozen in liquid nitrogen immediately after harvest, and stored at −80°C prior to lyophilization and subsequent RNA isolation (Chang et al., 1993). RNA samples were treated with RQ1 DNase (Promega, http://www.promega.com/) and purified using RNeasy Plant Mini Kit columns (Qiagen, http://www.qiagen.com/), and their integrity was evaluated using 1% w/v agarose gels.

Microsatellite (SSR) genotyping and framework map construction

Parent trees and 418 progeny of family 52-124 were genotyped for 167 framework SSR loci (http://www.ornl.gov/sci/ipgc/ssr_resource.htm, Smulders et al., 2001; Tuskan et al., 2004; van der Schoot et al., 2000). DNA was isolated from leaf samples using a Qiagen DNeasy Plant Mini Kit according to the manufacturer’s protocol. PCR reagents and concentrations were as described previously (Tuskan et al., 2004), except that SSR loci were amplified from 7.5 ng genomic DNA, and amplified fragments were labeled by incorporation of 8 μm fluorescein-12-dUTP (Roche Diagnostics, http://www.roche.com). Amplification conditions were 94°C initial denaturation for 5 min, nine cycles of touchdown comprising denaturation at 94°C for 15 s, annealing for 15 s at 59–50°C for one cycle each with 1°C increments, and extension at 72°C for 30 s, followed by 21 cycles of denaturation at 94°C for 15 s, annealing at 50°C for 15 s, and extension at 72°C for 30 s, with a final extension at 72°C for 3 min. Fragments were detected as described previously (Tuskan et al., 2004) except that an Applied Biosystems Prism 3730xl DNA analyzer (http://www.appliedbiosystems.com/) was used. Alleles were identified and genotyped using GeneMapper 4.0 (Applied Biosystems) and/or GeneMarker 1.5 (SoftGenetics LLC, http://www.softgenetics.com).

Single-tree framework maps were constructed using MapMaker version 3.0 (Lander et al., 1987) as described previously (Grattapaglia and Sederoff, 1994; Ma et al., 2008), and were anchored to the P. trichocarpa genome assembly version 1.1 through Blastn analysis (Altschul et al., 1990) of PCR primer sequences for each marker. Proper placement of markers was confirmed by comparison of sequence-predicted and experimentally determined P. trichocarpa SSR amplicon lengths.

SSRs used to confirm map position sequence scaffolds were identified using MsatFinder version 2.0 (http://www.genomics.ceh.ac.uk/cgi-bin/msatfinder/msatfinder.cgi) based on scaffold sequences from version 1.1 of the P. trichocarpa genome sequence. Primers were designed within the MsatFinder interface (Table S4), and SSR loci were amplified from 96 family 52-124 progeny as described above. Thirteen of the 16 loci segregated highly heterozygous alleles between the P. trichocarpa and P. deltoides backgrounds, and were genotyped using 1% w/v agarose gel electrophoresis. The remaining three loci were scored using polyacrylamide gel electrophoresis as described previously (Bassam et al., 1991).

Microarray analysis of parental genotypes

RNA extracted from root, leaf and secondary xylem of the parents of family 52-124 was converted to double-stranded cDNA (SuperScript double strand cDNA synthesis kit, Invitrogen, http://www.invitrogen.com/) using oligo(dT) primers (Promega) as described by the manufacturer, except that synthesis of first and second strands was extended to 16 h. The resultant double-stranded cDNA was labeled using cy3-tagged random 9-mers and Klenow fragment for 2 h at 37°C, denatured at 95°C for 5 min, and hybridized to custom in situ synthesized oligonucleotide microarrays (produced by NimbleGen) at 42°C overnight (16–20 h).

Microarray probe design.  A total of 55 793 gene models derived from annotation of the P. trichocarpa genome sequence were represented in the microarray used in the analysis of the two parents of family 52-124. Oligonucleotide probes (60-mer) were designed based on NimbleGen standard procedures that optimize the uniqueness of the targeted genomic region and GC content, while minimizing self-complementarity and homopolymer runs. The highest-ranking six or seven probes (probe set) were selected to represent each gene model, with optimal probe spacing leading to uniformly distributed, non-overlapping coverage. Twenty negative control probes utilized in previous studies (Tuskan et al., 2006) were also included for background quantification.

Statistical analyses.  Raw signal data from all hybridizations were background-subtracted, log2-transformed, and quantile-normalized (Bolstad et al., 2003). The normalized signal detected for each probe was centered to zero and analyzed using a gene-by-gene mixed anova model in sas 9.1 (SAS Institute, http://www.sas.com), with genotype i (1 d.f.), tissue j (2 d.f.), tissue i by genotype j interaction (2 d.f.), probe k (5 or 6 d.f.) and genotype i by probe k interaction (5 or 6 d.f.) as fixed effects:

image

F tests were performed for all fixed effects, and least-square mean estimates were obtained, and correction for multiple tests was performed using a modified false-discovery rate (FDR) threshold (FDR < 0.025, Table 1 and Table S5) (Storey and Tibshirani, 2003). Normalized log2-transformed signal values from microarrays derived from differentiating xylem tissue samples were analyzed separately using a similar model that excluded tissue effects. Pairwise t-tests were implemented to contrast least-square means estimates of the interaction detected between the two parents for each probe in a probe set. Resulting P values were corrected for multiple testing as above (FDR < 0.1).

Microarray analysis of family 52-124

Based on the probes selected from the parent tree data, a modified microarray was designed for analysis of the progeny of family 52-124. The modified microarray comprised 67 897 probes, including the pre-selected 55 793 gene expression probes and 12 084 SFP genotyping probes, plus 20 controls (Tuskan et al., 2006). Microarrays were synthesized using NimbleGen’s four-plex platform and utilized for analysis in the progeny. RNA isolated from one biological replicate of secondary xylem in 154 progeny genotypes was converted to double-stranded cDNA, labeled, and hybridized as described above.

All 67 877 experimental probes were evaluated for Mendelian segregation in the progeny, based on k-means clustering procedures modified from those described previously (Luo et al., 2007). Briefly, quantile-normalized, log2-transformed signal values detected for each probe in the progeny of family 52-124 were separated into two clusters using ‘Proc Fastclus’ in sas 9.1. Cluster membership was tested for the expected 1:1 segregation using a chi-squared test. Probes for which cluster frequencies deviated significantly (inline image, < 0.05) from the expected segregation were discarded.

Subsequently, the probability that an individual assigned to one cluster is not a member of the other cluster was evaluated by calculating the P value (Pi) associated with the modified normal deviate:

image

where xi is the signal at a given probe for an individual assigned to cluster i, and mj and sj are the mean and standard deviation of signal at that probe for all individuals assigned to cluster j (Luo et al., 2007). We used zi > 1.96 (Pi < 0.05) as evidence that the two allelic classes were clearly distinguishable, and scored individuals below this threshold as missing data. Probes resulting in >10% missing data ( 15) were not considered for mapping.

Grouping, ordering, and mapping of SSRs, GEMs and SFPs to linkage groups

Selected GEM and SFP markers, in conjunction with SSR markers utilized for the framework mapping, were grouped and ordered using MadMapper V248 linkage mapping software (http://cgpdb.ucdavis.edu/XLinkage/MadMapper/) essentially as described previously (West et al., 2006). However, because MadMapper scripts were developed for marker grouping and ordering in advanced-generation Arabidopsis recombinant inbred lines, the estimates of pairwise recombination frequency provided differ from those experimentally observed in a first-generation backcross pedigree structure (Haldane and Waddington, 1931). In addition, only microarray-based markers grouping together with at least one SSR from the established framework map were subsequently included. Probes not linked to the framework are likely to have an excess genotyping error and were subsequently discarded.

Markers were re-genotyped after localization of recombination breakpoints using a structural change analysis method within the Strucchange statistical module in R (Zeileis et al., 2002), using a strategy initially described by Singer et al. (2006). Structural change analysis detects large pattern shifts in a dataset based on a Bayesian information criterion statistical threshold, and can be used to detect overall change between phases of alleles that are characteristic of recombination breakpoints.

To contribute to the Strucchange analysis of breakpoint positioning, the P value (Ps) associated with the standard normal distribution for the cluster of assignment was determined:

image

The P values for each distribution were compared by calculating the ratio R, which has a range from zero to one, analogous to the procedure described previously (Singer et al., 2006):

image

If the alleles are highly distinct (i.e. clearly form separate distributions), individuals from the population return values of R very close to zero or one, depending on their allele. However, markers with little allelic distinction accumulate individuals at intermediate levels of R. Utilizing a continuously distributed allele score such as R also provides a direct assessment of confidence associated with an assigned genotype on an individual-by-individual basis, and thereby contributes to more concretely defined breakpoints in the Strucchange analysis.

To verify proper placement of recombination breakpoints, agreement between Strucchange genotypic results and raw SSR genotypes was determined. Additional breakpoints supported by the Strucchange minimum Bayesian information criterion statistic, but not present in the SSR data, were accepted if they included at least three microarray-based markers. Subsequently, genetic distances for the corrected genotypes were estimated using MapMaker version 3.0 (Lander et al., 1987).

Sequence-level characterization of SFP alleles

A subset of mapped SFPs was arbitrarily selected for sequence-level characterization in each parent of family 52-124. PCR primers were designed from the genome sequence surrounding five mapped SFPs (Table S6). Alleles were amplified from each parent tree using approximately 50 ng of xylem double-stranded cDNA, 200 μm dNTPs, and 2 μl 10× Advantage 2 PCR buffer and 0.4 μl Advantage 2 polymerase mix (both Clontech Laboratories Inc., http://www.clontech.com/) in a total volume of 20 μl. PCR was performed in a two-step procedure with identical amplification conditions for each step: 95°C initial denaturation for 5 min, 30 cycles of denaturation at 95°C for 30 s, annealing at 58.5°C for 30 s and extension at 72°C for 1 min 45 s, with a final extension of 72°C for 7 min. Secondary PCR was performed using identical reagent concentrations, except that a 1:25 dilution of the primary PCR was substituted as template. Amplicons from the secondary reaction were gel-purified in 1% w/v agarose, and cloned into pGEM-T vector (Promega) according to the manufacturer’s protocol. Eight to ten independent clones per construct were isolated using a QIAprep miniprep kit (Qiagen), and sequenced bi-directionally from the SP6 and T7 promoters using an ABI Prism 3730xl. Resulting sequences were aligned and analyzed in sequencher version 4.6 (Gene Codes Corporation, http://www.genecodes.com) and clustal w version 2.0 (Larkin et al., 2007).

Acknowledgements

The authors wish to thank Alexander Kozik (Department of Plant Science, University of California at Davis Genome Center) for excellent technical assistance in the implementation of MadMapper software. The authors are also grateful to Donald J. Lee (Department of Agronomy, University of Nebraska at Lincoln), A. Mark Settles (Department of Horticultural Sciences, University of Florida), Ronald R. Sederoff (Department of Forestry and Environmental Resources, North Carolina State University), and anonymous reviewers for constructive comments to improve the manuscript. This work was supported by the Department of Energy, Office of Science, Office of Biological and Environmental Research grant award number DE-FG02-05ER64114 (to M.K.), and the National Science Foundation, Genes and Genomes System Cluster in the Division of Molecular and Cellular Biosciences (to M.K.).

Ancillary