SEARCH

SEARCH BY CITATION

Keywords:

  • NimbleGen sequence capture;
  • genotyping;
  • SNP;
  • molecular marker;
  • reduced representation sequencing;
  • allele mining

Summary

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

Sequence capture technologies, pioneered in mammalian genomes, enable the resequencing of targeted genomic regions. Most capture protocols require blocking DNA, the production of which in large quantities can prove challenging. A blocker-free, two-stage capture protocol was developed using NimbleGen arrays. The first capture depletes the library of repetitive sequences, while the second enriches for target loci. This strategy was used to resequence non-repetitive portions of an approximately 2.2 Mb chromosomal interval and a set of 43 genes dispersed in the 2.3 Gb maize genome. This approach achieved approximately 1800–3000-fold enrichment and 80–98% coverage of targeted bases. More than 2500 SNPs were identified in target genes. Low rates of false-positive SNP predictions were obtained, even in the presence of captured paralogous sequences. Importantly, it was possible to recover novel sequences from non-reference alleles. The ability to design novel repeat-subtraction and target capture arrays makes this technology accessible in any species.


Introduction

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

Identifying genetic variation is a critical step in relating genotypes to phenotypes. Reference genome sequences exist for several plant species; however, it remains expensive and difficult to perform whole-genome sequencing of dozens of haplotypes per species, especially in crops such as maize (Martienssen et al., 2004; Bennetzen, 2005), wheat (Feuillet et al., 2008) and conifers (Morse et al., 2009), which have large genomes with complicated high-copy repeat interspersion. The ability to perform targeted resequencing of specific intervals of the low-copy fraction of these genomes has significant potential for a range of applications, such as discovering markers, identifying the basis of mutants, and cloning qualitative and quantitative trait loci (QTL), all of which can contribute to the genetic improvement of crops.

Microarray-based sequence capture (Albert et al., 2007; Hodges et al., 2007; Okou et al., 2007; D’Ascenzo et al., 2009) has been successfully applied to mammalian genomes for resequencing exons, large genomic loci and candidate gene sets. Sequence capture has also been performed using solution hybrid selection with long oligos (Porreca et al., 2007) or PCR products (Herman et al., 2009) as probes. The substantial enrichment of target sequences achieved via sequence capture makes it much less expensive than resequencing whole genomes or even the entire gene space. Even for marker discovery, where it is possible to obtain large numbers of ‘random’ SNPs via high-throughput sequencing of genomic fractions, as has been done for maize (Barbazuk et al., 2007), large numbers of SNPs are not typically discovered within a set of specific genes or a defined genomic region (e.g. a QTL interval).

Since their earliest implementation, hybridization-based complexity reduction technologies for targeted sequencing have required the use of blocking DNA in massive excess (Bashiardes et al., 2005). Most usually, the blocking reagent of choice has been the most repetitive genomic portion, the Cot-1 fraction (Strachan and Read, 1999). Blocking is believed to suppress non-specific DNA binding that could lead to capture of ‘off-target fragments’. A second function of the blocker is to suppress the secondary capture of library molecules based upon their intrinsic repeat content. Secondary capture could occur when an array probe anneals to a complementary fragment from a sample and there is repeat content elsewhere on that captured genomic fragment, which could potentially anneal to and capture other repeat-containing library molecules. To prevent this secondary capture, the blocking DNA must match the specific repeats in the genome of interest for effective sequence capture. Over 85% of the maize B73 genome (2.3 Gb) consists of repetitive DNA (Bennetzen et al., 2001; Martienssen et al., 2004; Schnable et al., 2009). In most genomes of agricultural interest, genes represent a small percentage of the genome and are dispersed among large blocks of highly repeated retrotransposon-derived sequence. Consequently, it is necessary to develop species-specific sequence capture blocking reagents. The logistical burden of Cot-1 production drove us to develop a ‘blocker-free’ capture protocol.

Here we have used a two-stage sequential sequence capture strategy (repeat subtraction-mediated sequence capture, RSSC) to sequence an approximately 2.2 Mb chromosomal interval and a set of 43 genes dispersed across the maize genome. The first capture is designed to deplete highly repetitive elements from input plant genomic DNA using a repeat subtraction array. The second capture employs a target-specific array that enriches for the target in a reduced-complexity sample. We have further streamlined capture implementation by directly capturing from an approximately 700 bp insert 454 Life Sciences GS FLX-Titanium long-read sequencing library.

Results

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

Repeat subtraction-mediated sequence capture (RSSC)

The process of array-based RSSC is shown in Figure 1. RSSC consists of two phases: reducing the abundance of repetitive sequences within the capture library and capturing target sequences from the resulting reduced-complexity library. The publically available 454 Life Sciences GS FLX-Titanium (454 hereafter) library construction protocol was utilized to produce a single-stranded A/B-adapted sequencing library for either B73 or Mo17 inbreds with a mean insert size of approximately 700 bp. This library was then amplified via limited cycles of PCR using primers designed to the 454 A/B adapters, purified, and quality checked. Next, RSSC was performed on a maize repeat array constructed by tiling probes across the maize accessions in a cereal repeat database (see Experimental Procedures for design criteria).

image

Figure 1.  Workflow for the NimbleGen repeat subtraction-mediated sequence capture experiments for the maize genome. Target regions in the maize genome were selected and an array was designed to represent the unique portions of each target (red, green and blue segments in DNA). (1) A GS FLX-Titanium sequencing library was constructed for each sample; this process places the sequences necessary for 454 sequencing into the library before capture. The magnified section shows Titanium molecule ends (approximately 4 h). (2) The sequencing library was amplified via the 454 adapter sequences (approximately 2 h). (3) The sample is hybridized to a repeat subtraction array; design content focused on tiling the repeat content from the sample genome (24–72 h). The repeat-containing library molecules hybridize to the repeat array (black molecules and arrows), but the target fragments do not (red, green and blue). (4) The repeat content array is discarded. (5) The hybridization cocktail is recovered and placed onto the capture array. (6) The array is hybridized (72 h) and washed (1 h including elution time). (7) Captured target fragments are eluted from the array (red, green and blue). (8) Target fragments are amplified via 454 adapters (approximately 2 h). Because the 454 adapters are on the fragments, the samples are simply diluted and directly sequenced (9) using the 454 GS FLX-Titanium (24 h). A total of 8–16 samples can move from step 1 through to sequencing in as little as 2 weeks of total time.

Download figure to PowerPoint

In addition to the maize repeat array, two specific sequence capture arrays were designed and generated by Roche NimbleGen. The first capture array (Interval 377) targets an approximately 2.2 Mb genomic interval from chromosome 3 of the B73 inbred (Experimental Procedures). This array was designed based on the sequences of a series of 70 overlapping BACs. The Interval 377 capture array models situations in other crop genomes where a specific region of a sequenced genome is under investigation or where several sequenced BACs covering a region of interest are available from an otherwise unsequenced genome. Such situations may be expected when chromosome walking in a large genome such as wheat or pine. The second capture array (43-Gene array) targets 43 genes dispersed throughout the genome. The 43-Gene capture array models the situation whereby several genes in an otherwise unsequenced genome are under investigation.

For the Interval 377 capture array only, repeat sequences in the interval were masked prior to probe design (see Experimental Procedures and Figure S1). Table 1 provides summary statistics for the design of both capture arrays. The target region for each array consists of a non-redundant set of sequences that comprise the probe space. Figure S2 shows the distribution of designed probes across Interval 377. The mapping of probes onto the whole genome provided an estimate of the repetitiveness of each probe from the Interval 377 capture array (Table S1). As expected, all probes except one map to Interval 377. However, 9.9% of probes (4067/41 555) were mapped to 1∼3 additional loci elsewhere in the genome due to the ancient allotetraploid nature of the maize genome (Paterson et al., 2004), the high frequency of transposon-mediated redistribution of genic fragments, and the existence of nearly identical paralogs (Schnable et al., 2009).

Table 1.   Summary statistics for maize capture array design
Array design statisticsInterval 377 arraya43-Gene array
  1. aUsing the B73_RefV1 sequence as the reference sequence (Experimental Procedures).

  2. bSee Figure S1 for detailed method.

  3. cTarget region consists of a non-redundant set of sequences used for probe synthesis.

  4. dLength of target region/length of primary target space.

  5. eBased on members of the ‘filtered gene set’ (Schnable et al., 2009) that overlapped with the target region.

Total length (bp)2 224 325303 557
Primary target space* after repeat masking (bp)b666 488No masking
Length of target region (bp)c277 305280 749
Percentage of primary target space covered by probesd42%92%
Length of target paralogous region (bp)c45 434Not determined
Number of non-transposable element protein-encoding genes40e43

Regional and dispersed sequence capture (from B73)

To characterize RSSC, the Interval 377 capture array was used to perform two independent captures of B73 genomic DNA. DNA fragments eluted from the capture arrays were sequenced using the 454 pyrosequencing technology, and the resulting filtered reads were mapped to the B73 reference genome (B73 RefGen_v1; Experimental Procedures). If the two captures from the B73 genotype are considered as a pool, more than 97% of the filter-passing captured 454 sequence reads can be mapped uniquely to the reference genome (Table S2). The two independent B73 captures show similar proportions of on-target reads and base coverage, median, as well as similar mean base-pair coverage statistics (Figure S2 and Table S2). Consequently, all subsequent analyses were performed after pooling reads from the two independent B73 captures to more fully encompass all sources of technical variation. Nearly 90% of the pooled B73 reads that can be mapped to B73 RefGen_v1 map to a single location, and are therefore non-repetitive sequences (Figure S3). Given that 85–90% of the maize genome is highly repetitive (Bennetzen et al., 2001; Martienssen et al., 2004; Schnable et al., 2009), this finding demonstrates the efficiency of the array-based repeat-subtraction procedure. Reads are considered ‘on-target’ if they overlap the target region. By comparing the percentage of on-target reads (31%) with the proportion of the genome contained in Interval 377, we calculate that RSSC for B73 achieved an approximately 2600-fold enrichment of target sequences (Table 2). This enrichment is illustrated by the dramatic difference in coverage achieved between targeted regions within Interval 377 and flanking untargeted regions (Figure 2a). Approximately 98% (271 498 bp/277 305 bp) of all bases in the target region were covered by at least one sequence read, and approximately 97% of all bases in the target region have greater than or equal to threefold coverage (Table 2). The threefold coverage value is significant because that is the minimum coverage previously established for SNP identification between inbred maize lines (Barbazuk et al., 2007).

Table 2.   Summary statistics for maize capture data using two arrays and two genotypes
GenotypeInterval 377 array43-Gene arraya
B73bMo17B73Mo17
  1. aCalculations were based on combined data from all genes.

  2. bTwo B73 regional captures were combined for calculation.

  3. cReads remaining after removal of low-quality reads (Experimental Procedures).

  4. dReads mapping to a region overlapping with the target region (Figure S3).

  5. ePercentage of on-target reads/(length of target region/size of B73 reference [2.3 Gb]).

  6. fThe read mapped to a region overlapping the target paralog region (Figure 3).

  7. gNot determined.

  8. hPercentage of on-paralog reads/(length of target paralogous region/size of B73 reference genome [2.3 Gb]).

Number of filtered readsc268 350132 16216 13530 367
Number of on-target readsd (percentage of on-target reads)83 429 (31%)29 226 (22%)5612 (35%)11 074 (36%)
Fold enrichmenteApproximately 2600Approximately 1800Approximately 2900Approximately 3000
On-paralog readsf (percentage of on-paralog reads)8939 (3.3%)5157 (3.9%)NDgND
Fold enrichment for paralogshApproximately 1700Approximately 2000NDND
Coverage
 Percentage target bases covered by ≥1/≥3/≥10 capture reads98/97/9482/78/7091/73/2081/70/46
 Mean coverage of target bases10638612
 Mean coverage per 1000 on-target reads1.31.31.11.1
image

Figure 2.  RSSC of Interval 377. (a) Many sequence reads map to Interval 377 (indicated by the black bar), but few sequence reads map to adjacent, non-target regions. (b) Detailed view of Interval 377. The black bars in the top track indicate regions targeted by sequence capture probes. The next track (orange bars) provides CGH data for probes within this interval taken from Springer et al., 2009. For each probe, the log2 of the ratio of Mo17/B73 hybridization signals (y axis) is provided. Negative values indicate higher hybridization values for B73 than Mo17. The blue, red and purple tracks provide normalized coverage (axis) for B73 sequence captures (pool of two captures) and a Mo17 sequence capture, and their difference (M – B), respectively. The arrows highlight two examples with negative log2(Mo17/B73) CGH values and normalized coverage difference (Mo17 – B73). The green track indicates locations of SNPs identified from sequence capture data. (c) Close-up view of sequence coverage in a small region of the capture interval indicated by the black bar below the green track in (b). Tracks are as described for (b).

Download figure to PowerPoint

Mapped reads are highly clustered near probe locations, suggesting highly efficient capture of probe sequences but reduced capture of sequences >500 bp from probes, consistent with a capture library consisting of fragments of approximately 700 bp. B73 reads that could not be mapped to Interval 377 (i.e. off-target reads) exhibit a seemingly random distribution across the genome, with two notable exceptions. In the first exception, 91% of an approximately 53 kb interval of chromosome 1 exhibits ≥98% identity to Interval 377. Consequently, 2032 probes perfectly match both Interval 377 and this interval of chromosome 1. The second exception involves an approximately 33 kb interval of chromosome 8 that exhibits less sequence identity to Interval 377, but even so 28 probes perfectly match both intervals. In both of these cases, the existence of perfectly matched probes resulted in paralog capture (Figure 3).

image

Figure 3.  Capture of paralogs from chromosome 8. A total of 28 probes from Interval 377 (shown in green) perfectly match an interval of chromosome 8. The y axis indicates the depth of coverage at each nucleotide by captured sequences that uniquely align to this interval of chromosome 8.

Download figure to PowerPoint

When the 43-Gene capture array was similarly used to capture B73 sequences via RSSC, an approximately 2900-fold enrichment was achieved (Table 1). Even though many fewer reads were generated for this capture [approximately 16 k versus approximately 268 k], 91% of the targeted bases were covered by at least one sequence read. The reduced read number results in a lower percentage of greather than or equal to threefold coverage (73% versus 97%) and a lower mean coverage (6× versus 106×).

Capture of allelic sequences from Mo17

To determine the efficiency of capturing sequences from a non-reference inbred using an array based on the B73 reference genome, RSSC was performed using both B73 arrays with a capture library constructed from Mo17 genomic DNA. Applying the same stringent alignment criteria used for B73, we achieved approximately 1800-fold and approximately 3000-fold enrichment of Mo17 sequences that match the targets from the two arrays, respectively (Table 2). For the 43-Gene capture array similar enrichments were achieved for the B73 and Mo17 genotypes (approximately 2900-fold and approximately 3000-fold, respectively). In contrast, when using the Interval 377 capture array, less enrichment was obtained for Mo17 than was achieved for B73 (approximately 1800-fold versus approximately 2600-fold, respectively). We hypothesized that this could be a consequence of polymorphisms between B73 and Mo17 within Interval 377. Polymorphisms could reduce the capture of Mo17 sequences by probes designed based on B73 sequences, and/or cause difficulties in properly mapping Mo17 reads to the B73 reference sequence (Figure S4).

Maize sequence capture and comparative genomic hybridization

The hypothesis that polymorphisms are responsible for the reduced fold enrichment achieved from Mo17 is supported by data from independent maize comparative genomic hybridization (CGH) experiments performed with a 2.1 million oligonucleotide microarray (Springer et al., 2009) and the 43-Gene capture array (this study). Consistent with previous studies, our CGH data indicate that there is extensive sequence and structural variation between these two maize haplotypes. Our whole-genome array contains 2072 probes from within Interval 377. Regions of Interval 377 that had substantial B73 capture but low or no coverage by Mo17 capture reads typically exhibited negative log2 ratios of Mo17/B73 hybridization signals in our whole-genome CGH experiment (Figure 2b). This relationship was also observed in the 43-Gene capture experiments (Figure 4a). As expected, regions with equivalent coverage typically exhibited CGH log2 ratios close to zero. We hypothesized that the reason that the fold enrichments observed for the B73 and Mo17 captures from the 43-Gene capture array were similar is that the well-characterized genes on this array generally exhibit a higher degree of conservation between the two genotypes than do the predicted genes located in Interval 377. This hypothesis is supported by the finding that 15% of the CGH probes in Interval 377 (319/2072) exhibit greater than or equal to twofold variation in hybridization signals (reflecting significant structural variation between B73 and Mo17), whereas only approximately 6% of CGH probes designed for the 43-Gene array do so (980/16 406; Table S3).

image

Figure 4.  Successful capture of genic regions using the 43-Gene array. (a) The zmet2 gene, including flanking regions, is one of 43 targeted for capture using this array. The position of an approximately 4.9 kb retrotransposon insertion (red triangle) in the Mo17 allele is indicated by a triangle. High-density CGH data (orange data track) provide information on sequence variation between the B73 and Mo17 alleles. Extreme log2(Mo17/B73) values (y axis) indicate high levels of SNPs, InDel polymorphismss or presence/absence variants. Normalized coverage (axis) by B73 (blue) and Mo17 (red) sequence reads is shown. (b) Successful recovery of novel Mo17 allelic sequences. The VISTA identity plot (pink) was presented using the zmet2 Mo17 allele sequence as the reference sequence (bottom). Mo17 capture reads were mapped to both the B73 reference sequence and the known sequence of the Mo17 allele. Reads that could be aligned to both alleles are shown in green. Reads that could be aligned only to the Mo17 allele are shown in red. Reads that align to only Mo17 span the junction of the Mo17-specific insertion and over-lie a highly polymorphic region. By de novo assembling Mo17 sequence reads into contigs (shown in orange) prior to alignment with the B73 reference allele, it was possible to recover Mo17 sequences that are highly divergent from the B73 reference allele.

Download figure to PowerPoint

For both Interval 377 and the 43-Gene capture arrays, we noted that the sequence capture provided coverage for a larger proportion of the target bases in B73 than in Mo17. We hypothesized that Mo17 regions without coverage may be caused by our inability to align captured Mo17 sequence reads with allelic B73 sequences due to high levels of DNA sequence polymorphism. To test this hypothesis, we aligned all Mo17 reads captured from the 43-Gene array to existing sequences of B73 and Mo17 alleles of four genes from this array. It was possible to align 2010 of the reads captured from Mo17 to the sequences of B73 alleles. Interestingly, 223 reads that could not be mapped to the B73 alleles of these genes could be mapped to the sequences of the Mo17 alleles of these genes. This finding demonstrates that some Mo17 sequences had been captured but had not been detected as being on target because they did not align to B73. Pre-assembly of the Mo17 reads into contigs, followed by mapping of the contigs onto the B73 reference, allowed identification of nearly 90% of these reads (197/223). These newly rescued Mo17 reads cover regions of the Mo17 haplotype that are poorly conserved relative to, or even absent from, B73. Figure 4(b) depicts this analysis for one of the four genes.

SNP prediction and validation

An important application of sequence capture is to develop SNP-based markers within targeted genomic regions by using captured sequences from non-reference genotypes. The ability to use RSSC-derived data to identify SNPs within targeted regions was tested by aligning the captured Mo17 reads from the two arrays that uniquely map to the target regions (‘on-target reads’; Table 2) to the B73 reference genome. Potential SNP sites were required to be covered by a minimum of three Mo17 reads. Because Mo17 is homozygous at each locus, Mo17 base calls at the polymorphic site were expected to be identical. Hence, only those polymorphic sites that were mono-allelic within all Mo17 reads were designated ‘high-confidence SNPs’. SNP sites that had more than one base call within the aligned reads of a single genotype were assumed to result from the inadvertent alignment of paralogous sequences; such SNPs were designated ‘lower-confidence SNPs’. The alignments of Mo17 reads to the B73 reference genome were used to predict 1357 and 1221 high-confidence SNPs from the Interval 377 and 43-Gene arrays, respectively (Experimental Procedures and Table 3). Rates of false-positive SNP predictions were estimated via comparison with known SNPs that had been detected by alignment of existing partial sequences of the Mo17 alleles for four of the genes present on the 43-Gene array. We predicted a total of 212 SNPs, including 151 high-confidence and 61 lower-confidence SNPs, within the corresponding regions of these four genes. All of the 151 high-confidence SNPs and 56 of the lower-confidence SNPs were confirmed via comparisons to our previously known control sequences. Based on this analysis, the rate of false-positive SNP prediction is extremely low (<3%).

Table 3.   SNP prediction using reads captured from B73 and Mo17
 Input dataaNumber of SNPsNumber of high-quality SNPsbNumber of genesc with high-quality SNPs
  1. aTwo sets of B73 and Mo17-derived sequence reads were used for SNP prediction: all filtered reads (‘all’) and only on-target reads (‘target’).

  2. bHigh-quality SNPs are those that are mono-allelic for all aligning reads. In addition, SNPs identified within repetitive DNA regions of Interval 377 were removed (Experimental Procedures).

  3. cThere are 40 and 43 genes represented on the Interval 377 and 43-Gene arrays, respectively (Table 1).

Interval 377B73 (all)8531982
B73 (target)2351
Mo17 (all)8044169335
Mo17 (target)1649135734
43-Gene setB73 (all)1703111
B73 (target)1443011
Mo17 (all)2249124040
Mo17 (target)1790122139

It was possible to identify ‘on-target reads’ for use in the analysis described above because we had access to the B73 reference genome sequence. This would not be possible in a species that lacks a reference genome sequence. To test whether SNPs could be successfully predicted without access to a reference genome sequence, we performed a second experiment in which we aligned all Mo17 reads captured from the two arrays to their respective B73 capture intervals. This experiment yielded 1693 and 1240 high-confidence SNPs from the two arrays (Table 3), representing 25% and 2% increases in the numbers of SNPs predicted, compared to using genome-directed ‘on-target reads’.

To test the hypothesis that the inclusion of non-target paralogous sequences in the SNP discovery pipeline is responsible for the increased numbers of high-confidence SNPs predicted in this second experiment, we performed a SNP discovery experiment using B73 sequences captured by the Interval 377 array to predict ‘SNPs’ relative to the B73 reference genome. We have previously shown that there is little residual heterozygosity in B73 (Emrich et al., 2007). Hence, the rate at which we identify ‘SNPs’ when using captured B73 reads is a measure of the number of putative SNPs that are false-positive due to sequencing errors or the inadvertent identification of ‘paramorphisms’ as SNPs. Paramorphisms are sequence variants between highly similar paralogs (Fu et al., 2004).

Alignment of only on-target B73 reads to Interval 377 of the B73 reference genome yielded five such high-confidence ‘SNPs’ (Table 3). Because few paralogous sequences are expected among the on-target reads, most of these false-positive SNPs are probably due to sequencing errors in either the captured reads or the reference genome. Examination of the alignments of B73 and Mo17 captured sequences in the regions of the five potential false-positive polymorphic sites indicated that two are the result of sequence errors in the reference genome and one is probably caused by capture of paralogous sequences, while the causes of the remaining two could not be determined because Mo17 reads were not available for these sites. The low rate of false-positives caused by sequencing errors reflects the high stringency of our SNP prediction pipeline. Alignment of all B73 reads to Interval 377 of the B73 reference genome yielded 98 high-confidence ‘SNPs’, which probably includes false positives due to both sequencing errors (ours and reference) and paramorphisms. Overall, the low rate of false-positive SNP calls caused by sequencing errors led us to conclude that <6% (98/1693) of the high-confidence SNPs generated in the absence of paralog removal represent false positives due to paramorphisms.

Discussion

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

Repeat subtraction-mediated sequence capture (RSSC)

Over the past two decades, several approaches to achieve a reduction in genomic complexity have been attempted, including EST sequencing, methyl filtration, and high-Cot DNA selection (Barbazuk et al., 2005). Each of these approaches has been successful in reducing genome complexity, but none delivers sequences of interest in a targeted fashion as is possible with hybridization-based sequence capture.

In initial experiments in which we utilized Cot1 DNA as a blocker, we found that maize Cot1 DNA improved the performance of sequence capture compared with human Cot1 DNA (data not shown). Extending this idea suggests that adapting sequence capture technology for the many crop genomes would require the production of species-specific blocking agents for each of the many important crops. Published maize Cot1 production protocols have only approximately 10% yield, meaning that scaling of production is prohibitive from the perspective of genomic DNA consumption (Zwick et al., 1997). Furthermore, in our hands, 16 of 20 independent attempts at using the previously published Cot1-based protocol yielded fold enrichments that were at least an order of magnitude below those achieved in the current study (P.S.S., Y.F., N.M.S., W.B.B. and J.A.J., unpublished results). We therefore investigated the use of a two-stage microarray sequence capture method that might yield samples with consistently reduced complexity.

A repeat-subtraction microarray was designed to remove DNA fragments that contain highly repetitive sequences. A similar approach has been used to improve hybridization performance (Newkirk et al., 2005). To date, seven of nine maize RSSC sample attempts have been successful in providing >1000-fold enrichment in each sample. The two ‘failures’ have been traced to a hybridization reagent issue within one experiment (D.J.G. and J.A.J., unpublished results).

Approximately one-third of captured reads are ‘on target’. Although this is sufficient to make this technology very attractive for practical applications, it must be asked why two-thirds of the reads are ‘off target’. Interestingly, in the capture experiments for the human genome, probe sets in the range of 200–500 kb resulted in off-target rates that are in the same range as observed here (Albert et al., 2007). We therefore hypothesize that this off-target rate is probably a consequence of the small design space on our array. When only a small amount of library is hybridized to the array (so as to not overwhelm the repeat subtraction), there are only limited numbers of copies of the target in the sample. Increasing the design space might result in higher ‘on-target’ rates. Recent results from maize using a larger capture target design space support the correlation between design space and specificity that was first observed in the human genome (J.A.J., T.J.A. and D.J.G., unpublished results). Larger designs (approximately fivefold) exhibited an approximately twofold better on-target read rate in two independent tests (data not shown). Other potential approaches to increase the rate of on-target reads include making the numbers of various types of probes on the repeat subtraction array proportional to their copy number in the genome and reducing fragment sizes in the capture library (thereby reducing the potential for secondary capture).

Use of sequence capture to identify allelic variation

The two applications of sequence capture described here highlight the potential uses of this technology. Sequence capture of chromosomal regions, such as Interval 377, which contain a target gene, mutation or QTL can provide two important outcomes. First, the targeted resequencing identifies polymorphisms such as SNPs that can be converted into high-density genetic markers. Not much sequencing was required to obtain a large number of SNPs: a 1/16 region 454 PicoTiterPlate run was well paired, from a coverage perspective, with our approximately 300 kb capture interval within this homozygous genome. Approximately 1600 SNPs were identified that can be used to map the gene or causative QTL to high resolution. Second, the targeted resequencing concomitantly provides a set of potentially causative polymorphisms.

Another application of sequence capture is the isolation and characterization of novel alleles from a non-reference genome. Maize is a highly polymorphic species with many SNPs, InDel polymorphisms and presence/absence variants (Springer et al., 2009). We demonstrated that characterization of novel alleles is greatly facilitated by the combination of CGH and sequence capture. CGH data provide information on the relative conservation of the target genome and the reference genome. Regions of the genome that had lower hybridization to the target genome in CGH experiments often had much lower Mo17 coverage. By complementing sequence capture with CGH, it is possible to rapidly identify conserved and non-conserved regions and to focus novel allele characterization efforts on highly variable regions.

Although mapping Mo17 reads to the B73 reference sequence provided coverage of many regions, thereby allowing us to identify SNPs, a number of regions lacked Mo17 read coverage. It is likely that a subset of the examples of regions with missing Mo17 sequence may reflect presence/absence variants (i.e. B73-derived sequences that are simply absent from the Mo17 genome). The remaining regions with missing Mo17 sequence represent either reduced capture of Mo17 sequence due to polymorphisms or inability to align the sequences to the reference B73 sequence due to polymorphisms. Using comparisons to the known Mo17 sequences of several genomic regions, we found that the Mo17 sequences had been effectively captured, but that SNPs and InDel polymorphisms limited our ability to map these captured Mo17 sequence reads to the appropriate location on the B73 reference genome. It was possible to recover some of these sequence reads by first performing an assembly of all captured reads and then mapping the longer Mo17 assemblies to the B73 genome (Figure 4b). Importantly, assembly prior to alignment allowed us to recover novel allelic sequences from a non-reference haplotype that were not targeted by the capture array. Even though the number of sequence reads rescued in this manner is not large, such reads are valuable because they can be used to construct and extend the sequence of a captured haplotype and are therefore useful for identifying insertion/deletion polymorphisms, and may be useful for iterative capture-mediated chromosome walking.

RSSC and paralogs

Maize arose from an allotetraploidization event in the past 5–10 million years (Paterson et al., 2004), and has retained an extensive degree of gene duplication. Processes such as transposon capture of gene fragments (Schnable et al., 2009) have provided additional paralog complexity. Consequently, approximately 10% of our capture probes had more than one identical match in the maize genome, potentially making them eligible to capture paralogs. As expected, these probes were equally capable of capturing the target sequence and the paralogous sequence. Paralogous reads were recovered at a frequency consistent with their probe representation frequency [e.g. 10.7% (8939/83 429) for the B73 captures from the Interval 377 array; Table 2]. The degree to which paralog capture complicates SNP discovery depends on the structure of the genome being analyzed, but we were encouraged to discover that, even in maize, very few of the mono-allelic putative SNPs appear to be false positives.

Broader applicability of RSSC

We have reported a protocol implementation that allowed us to achieve 1800–3000-fold enrichment of both a defined chromosomal interval and a set of dispersed genes. This enrichment is comparable to that achieved for the human genome (Albert et al., 2007). For both captures, 80–98% of targeted bases were covered by captured sequences. The mean coverage of the target regions per 1000 on-target reads was similar for captures from the two different arrays (1.3 versus 1.1), highlighting the overall robustness of the approach. Therefore, the RSSC protocol provides a method to resequence targeted genomic regions of the maize genome, and is expected to exhibit similar levels of performance in other genomes. The ability to design reagents required for repeat subtraction in silico significantly reduces the technical hurdles involved in applying sequence capture across diverse species. Because highly repetitive elements can be discovered using only limited amounts of whole-genome shotgun sequencing data, it should be possible to design species-specific repeat-subtraction arrays with limited investment of resources in combination with next-generation sequencing technologies. Hence, it will be possible to apply RSSC not only to species with sequenced reference genomes, but also to those whose genomes have not yet been sequenced. Importantly, we have established that polymorphism analyses performed in the absence of a fully sequenced reference genome are not substantially cumbersome. We therefore foresee application of this technology for studies of population genetics, cloning of loci controlling quantitative variation, and allele mining in crops, model organisms, and, importantly, non-model species.

Experimental Procedures

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

Repeat array design

A customized NimbleGen 3 x 720 K sequence capture microarray (081110_Zea_mays_repeats_cap) was synthesized three times per slide to contain maize repetitive elements in the MAGI Cereal Repeat Database (version 3.1; http://magi.plantgenomics.iastate.edu/repeatdb.html) and the Maize Repeat Database (version 4; http://maize.jcvi.org/repeat_db.shtml). The design may be ordered by request. There are 2.1 M total probes on the array, although only the center sub-array containing 720 K probes was utilized in this study. The median probe length is 74 bp.

Maize NimbleGen sequence capture array design

A large genomic region on a BAC fingerprint contig (FPC Ctg138, chromosome 3) was originally selected for targeting. Based on the physical map released prior to 29 May 2008, a total of 70 sequenced BACs are within this FPC contig, and their sequences were downloaded from GenBank on 29 May 2008. The physical map has been updated to the latest release (maize Golden Path AGP version 1, release 4a.53). Details regarding sequence annotation and gene prediction are shown in Figure S1. A total of approximately 1.5 Mb, comprising 44 unordered sequence fragments with 83 non-redundant predicted non-repetitive genes, were soft-masked for probe design. The uniqueness/repetitiveness of all the probes and physical locations of the probes were determined based on the collection of maize BAC sequences available in March 2008. The array design was constructed by tiling at approximately 5 bp spacing across the target regions. Probes with a mean 15-mer frequency in the genome greater than 100 were excluded, as were probes that had more than five close matches in the genome. A close match is a match to the genome that is at least 38 bp long, allowing up to five insertions/deletions/mismatches. When the probes are shorter than 50 bp, we use the length of the probe (12 bp, the seed length) as the minimum match size. A total of 41 555 probes were selected, and replicated at least 17 times on the array. To reconcile with the reference genome sequence, probes were remapped to B73 RefGen_v1 (Schnable et al., 2009). The final sequence interval was defined from the 1 kb upstream of the most-left mapped probe (REGION0042FS000010140) to the 1 kb downstream of the most-right mapped probe (REGION0028FS000002032), i.e. 183 062 553–185 609 824 bp on chromosome 3. Two fragments (183 315 664–183 553 126 bp and 183 880 178–183 965 661 bp) were excluded from analyses because they were not present in the sequences used for probe design. This design is used to generate a customized NimbleGen 3 x 720 K sequence capture array. Only the center sub-array was utilized for this study; this array may be ordered from Roche NimbleGen by requesting 081028_Zea_mays_schnable_cap.

The second NimbleGen 3 x 720 K sequence capture array design was constructed by tiling at approximately 15 bp spacing across 43 dispersed gene targets. Probes with a mean 13-mer frequency in the genome greater than 500 were excluded, as were probes that had more than seven close matches in the genome. A total of 16 406 probes were selected and replicated 44 times on the array. This array comprises approximately 350 kbp of genomic space, but has only 123 kbp represented within the probes. Again, only the center sub-array was utilized for this study. This design may be ordered by requesting 080328_maize_cap_springer_1. These probes are of various lengths and the median probe length for both designs is 76 bp.

Maize sequence capture and 454 sequencing

DNA was isolated from 14-day-old seedlings of two maize inbreds, B73 and Mo17, using a previously described protocol (Li et al., 2007). A 700 bp mean insert size 454 GS FLX-Titanium sequencing library (454 Life Sciences, http://www.454.com) was generated for each inbred and subjected to eight cycles of amplification using primers based upon the sequencing adapters. Amplicons were purified using a QIAquick/MinElute spin column (Qiagen, http://www.qiagen.com/). The DNA concentration was determined using NanoDrop ND1000 (Thermo Scientific, http://www.thermo.com) and the molecular weight range was determined using an Agilent Bioanalyzer 2100 with a DNA7500 kit (Agilent Technologies, http://www.agilent.com). We progressively decreased the total amount of library used per hybridization across the study. The Interval 377 captures used 500 ng, and the 43-Gene captures utilized either 250 or 150 ng for the repeat-subtraction hybridization. The indicated mass of double-stranded sequencing library was hybridized to the maize repeat subtraction at low stringency (37°C) using the Mai Tai system (SciGene, http://www.scigene.com) with NimbleGen hybridization solution supplemented with Tween-20 at 0.1% v/v, together with a 100-fold molar excess of non-extendable primers complementary to the sequencing adapters. The rotation speed in the SciGene hybridization oven was set to 15. The hybridization cocktail was recovered by separating the two slides with the gasket array on the bottom (facing up) and the subtraction array on the top (facing down). The remaining hybridization cocktail, containing the library fragments of interest (still on the gasket slide), was subjected to a second capture array aimed at the gene space of interest. The capture array was placed with the probe side facing down onto the hybridization cocktail on the gasket slide. The gasket slide remained in the Mai-Tai rig during placement. The capture array was then subjected to an additional 4 days of hybridization at 42.5°C with the rotator set to 15. The capture array was washed as previously described (Albert et al., 2007) and eluted using a sodium hydroxide method that is available from Roche NimbleGen Technical Support on request. The eluted molecules were amplified via the sequencing adapters (14 cycles), and the products were purified and quantified. The double-stranded eluted libraries were diluted for emulsion PCR (emPCR) as recommended by 454 Life Sciences, and sequenced using the 454 Life Sciences GS FLX-Titanium protocol according to the manufacturer’s instructions using a 4- or 16-region Titanium PicoTiterPlate. Prior to emPCR, the diluted double-stranded eluate libraries were heat-treated at 95°C for 2 min in a thermal cycler. This heating step was found to be essential to avoid amplification-associated artifacts in the emPCR. The raw 454 capture reads (deposited to the GenBank Short Read Archive with accession number SRA009261.9) with low quality (parameters: maximum mean error = 0.01, maximum error at ends = 0.01), and short 454 reads (<200 bp) were removed using Lucy (Chou and Holmes, 2001). This cut-off was selected because few of the 454 reads of <200 bp could be mapped to the B73 reference genome (K.Y., Y.F. and P.S.S., unpublished results).

Data analyses

To estimate on-target rates, all filtered B73 and Mo17 captured 454 reads were aligned to the B73 reference genome sequence, i.e. B73_RefGen_v1 (Schnable et al., 2009) (BLAST alignment criteria: 95% similarity and total unaligned regions of both 5′ and 3′ ends of 454 reads ≤15 bp). Sequence reads whose best match overlapped a target region were classified as on-target. The target paralog region is defined as a non-redundant set of sequences of those probes that can be mapped both inside and outside Interval 377. Sequence reads with a best match that overlaps the target paralog region are considered as ‘on-paralog’ reads. Whole-genome CGH data were retrieved from the NCBI GEO database (GSE16938) (Springer et al., 2009). Only CGH probes within targeted regions were used to calculate normalized coverage. GFF files were generated for data visualization using NimbleScan (version 2.4, NimbleGen). Shell and AWK scripts for the analysis pipeline are available upon request. Additional CGH data for the 43-Gene array are given in Table S3. Sequence alignments between B73 and Mo17 allelic sequences were performed using VISTA (LAGAN alignment program used with default settings) (Frazer et al., 2004). CAP3 (Huang and Madan, 1999) was used for assembling Mo17 reads from the 43-Gene array (parameters used: overlap percentage identity ≥95, overlap length ≥50 bp).

Comparative genomic hybridization (CGH)

CGH was performed using the 43-Gene capture array in place of a CGH design within a standard Roche NimbleGen CGH workflow for NimbleGen human CGH with 385 K arrays. Two arrays were utilized in a B73 versus Mo17 dye swap. Labeling, hybridization, washing, scanning and analytical conditions were as previously reported (Springer et al., 2009).

SNP discovery

SNP discovery was performed either using all filtered 454 reads or the subset of on-target 454 reads defined above. The 454 reads were aligned to the reference sequences (either the chromosome 3 Interval 377 or the 43-Gene set) using MosaikAligner (Hillier et al., 2008) with the following parameters: -a (alignment algorithm), all; −p (CPUs used), 8; −mmp (maximum percentage of read length to be mismatched), 0.05; –minp (minimum percentage of the read length aligned), 0.95; –mmal (aligned read length rather than the original read length when counting errors); −m (alignment mode), unique; −hs (hash size), 15; −mhp (maximum number of positions to use), 100. These alignment parameters ensured that each 454 sequence read was uniquely aligned; sequences that failed to meet these criteria were discarded from the analysis. SNPs were identified within the alignments using the GigaBayes package (http://bioinformatics.bc.edu/marthlab). Arguments to GigaBayes were: –D (pairwise nucleotide diversity), 0.003; –ploidy (sample ploidy), haploid; –algorithm, recursive; –sample (sequence source), single; –anchor; –CAL (minimum overall allele coverage), 3; –QRL (minimum base quality value), 20. Potential SNP sites were required to be covered by a minimum of three Mo17 reads, and all Mo17 base calls at the polymorphic site were expected to be identical. SNP sites that had more than one allele within the aligned reads were assumed to result from alignment of paralog sequences. In addition, potential high-confidence SNP sites within the Interval 377 region were required to be from non-repetitive regions. The false SNP discovery rate was determined by identifying potential SNPs from B73 captured reads.

Acknowledgements

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

We thank James Birchler (Division of Biological Sciences, University of Missouri) for providing repeat clones for early development work, John Luckey, Jason Norton and Paul Marrione for support with Mai-tai hybridization optimization and platform development, Rudi Seibl, Rebecca Selzer and Courtney Erickson for research and development support, and the Maize Genome Sequencing Project (NSF DBI-0527192) for sharing genome sequences and annotation prior to publication. This project was supported in part by funding to P.S.S. from the Iowa State University Plant Sciences Institute, funding to N.M.S. from the University of Minnesota, and a grant from the National Science Foundation Plant Genome Program (DBI-0501758) and funding from University of Florida to W.B.B. The Roche NimbleGen research and development group is privately funded.

References

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

Supporting Information

  1. Top of page
  2. Summary
  3. Introduction
  4. Results
  5. Discussion
  6. Experimental Procedures
  7. Acknowledgements
  8. References
  9. Supporting Information

Figure S1. Workflow for preparing BAC sequences within FPC Ctg138 for design of probes for the Interval 377 array.

Figure S2. Evaluation of the coverage and reproducibility of sequence capture from Interval 377. (a) Two B73 captures (tracks from top to bottom: probe target regions in black, normalized coverage, i.e., the observed coverage of each base/the mean coverage of all targeted bases, in blue; (b) Correlation between the normalized coverage of two B73 captures; each spot represents a base within the target region (the non-redundant set of sequences used for probe synthesis); (c) Density plot for coverage uniformity across target region.

Figure S3. Mapping of capture reads to Interval 377. (a) B73 reads; (b) Mo7 reads. BLAST alignment criteria: 95% similarity and the total unaligned regions of both 5’ and 3’ ends of 454 reads (‘tails’) ≤15 bp. Mapped reads are classified as ‘on-target’ (green), ‘on-paralog’ (yellow), and other (red).

Figure S4. Comparison of CGH data and sequence capture efficiency. The structure of the B73 allele of the zmet2 gene is illustrated. The position of an ~4.9 kb retrotransposon insertion (red triangle) in the Mo17 allele is indicated by a triangle. High-density CGH data (orange data track) provides information on sequence variation between the B73 and Mo17 alleles. Extreme log2(Mo17/B73) values (y-axis) indicate high levels of SNPs, IDPs or PAVs between the two alleles. Normalized coverage (y-axis) by B73 (blue) and Mo17 (red) sequence reads are shown, as is the difference between Mo17 and B73 (purple).

Table S1. Re-mapping of probes designed for Interval 377 to B73 RefGen_v1.

Table S2. Summary statistics for two B73 captures of Interval 377.

Table S3. Gene array CGH data.

As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such materials are peer-reviewed and may be re-organized for online delivery, but are not copy-edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors.

FilenameFormatSizeDescription
TPJ_4196_sm_FigureS1.tif1729KSupporting info item
TPJ_4196_sm_FigureS2.tif209KSupporting info item
TPJ_4196_sm_FigureS3.tif1730KSupporting info item
TPJ_4196_sm_FigureS4.pdf420KSupporting info item
TPJ_4196_sm_Legends.doc31KSupporting info item
TPJ_4196_sm_TableS3.xls1842KSupporting info item
TPJ_4196_sm_TablesS1-S2.doc45KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.