• Open Access

Targeted re-sequencing of the allohexaploid wheat exome

Authors


(Tel 44 117 331 6770; fax 44 117 925 7374; email mark.winfield@bristol.ac.uk)

Summary

Bread wheat, Triticum aestivum, is an allohexaploid composed of the three distinct ancestral genomes, A, B and D. The polyploid nature of the wheat genome together with its large size has limited our ability to generate the significant amount of sequence data required for whole genome studies. Even with the advent of next-generation sequencing technology, it is still relatively expensive to generate whole genome sequences for more than a few wheat genomes at any one time. To overcome this problem, we have developed a targeted-capture re-sequencing protocol based upon NimbleGen array technology to capture and characterize 56.5 Mb of genomic DNA with sequence similarity to over 100 000 transcripts from eight different UK allohexaploid wheat varieties. Using this procedure in conjunction with a carefully designed bioinformatic procedure, we have identified more than 500 000 putative single-nucleotide polymorphisms (SNPs). While 80% of these were variants between the homoeologous genomes, A, B and D, a significant number (20%) were putative varietal SNPs between the eight varieties studied. A small number of these latter polymorphisms were experimentally validated using KASPar technology and 94% proved to be genuine. The procedures described here to sequence a large proportion of the wheat genome, and the various SNPs identified should be of considerable use to the wider wheat community.

Introduction

Globally, wheat is one of the three most important crops for human and livestock feed (Shewry, 2009). However, unlike rice (a diploid) and maize (an ancient tetraploid), wheat is an allohexaploid (AABBDD) derived from the hybridization of the diploid Aegilops tauschii (DD genome) with the tetraploid Triticum turgidum (AABB genome) approximately 8000 years ago (Dubcovsky and Dvorak, 2007). Our ability to characterize the wheat genome has been revolutionized by the development of next-generation sequencing (NGS). However, the size and complexity of the genome means that whole genome sequencing (WGS) costs are still high compared with many model species with smaller genomes (see Biesecker et al., 2011). In maize, a crop with a large ancient tetraploid genome, the issues surrounding WGS have been overcome by utilizing targeted-capture re-sequencing using NimbleGen arrays (Fu et al., 2010). Targeted-capture, which is used to enrich for sequences of interest before carrying out NGS, can be carried out by hybridization of the target sequences to bait probes in solution (Sulonen et al., 2011) or on solid support (Asan et al., 2011). Saintenac et al. (2011) have recently reported the use of SureSelect, an in-solution targeted-capture technology, to examine the genome of tetraploid wheat. Saintenac’s work did not extend the study to hexaploid wheat, and, although they pointed out that, without proper safeguards, many of the sequence variations identified might not be verifiable and may prove to be false, they performed no validation of their putative single-nucleotide polymorphisms (SNPs).

In hexaploid wheat, there is the inherent difficulty of the three homoeologous genomes, and thus, there are likely to be at least three similar copies of each gene. In addition, some loci may have undergone gene duplications producing closely related paralogues. Therefore, intensive computational analyses are required to translate raw wheat NGS data into accurately mapped reads from which a comprehensive list of variants can be derived (Reumers et al., 2012). This is a challenging problem and, without safeguards, many of the detected variants might be subsequently re-classified as genotyping errors. As a result, it is usually necessary to validate the putative variants using Sanger sequencing or other targeted approaches, and this may prove as expensive as the initial NGS approaches. It is, therefore, important to include filters in any analysis to remove false-positive calls (a variant called when, in fact, one does not exist), while minimizing false negatives (no variant called, when, in fact, a variant is present). Quality filters can be applied to the raw read data to remove sequences based on particular parameters, such as low coverage of the target sequence or quality scores indicating the reliability of the base calling. In well-annotated genomes, such as those available for model organisms, it is also possible to apply filters that remove target genomic regions that contain repetitive DNA sequences (Reumers et al., 2012). This presents more of a challenge for complex unfinished genomes such as wheat. Targeting of exonic regions of the genome should minimize the presence of repeats. To extend targeted-capture re-sequencing to hexaploid wheat and to include within these studies, an examination of homoeologous and varietal SNPs, we have developed a pipeline for sequence capture, NGS and sequence characterization of wheat genomic DNA resulting in SNP identification and validation. Our pipeline incorporates a NimbleGen array designed to capture a significant proportion of the wheat exome, and a bioinformatics pipeline designed to identify both putative homoeologous and varietal SNPs. Furthermore, the work presented here also confirms the validity against 23 wheat varieties of a small subset (96) of the putative SNPs identified from the NimbleGen capture experiments. Here, we discuss the results of targeted sequencing of eight UK bread wheat varieties chosen for this study as representative of valuable British varieties.

Results

Coverage of the exome

The design of the NimbleGen array used in this study is described in the experimental procedures. The array contained 132 606 repeat-masked expressed sequence tags covering 56.5 Mb. The average and median length of the sequences, used to generate the capture probes, were 426 and 366 bp, respectively. Given that the wheat genome is approximately 17 Gb in size, of which 1%–2% is thought to be protein coding (Paux et al., 2006), the wheat exome represents about 170–340 Mb. If the wheat NimbleGen array contains only unique sequences, as it was designed to do, and given that the total length of the features on the array is 56.5 Mb, it is capable of capturing at least 50% of the genes in a diploid exome.

Screening the NimbleGen array with genomic DNA derived from eight UK wheat varieties

Eight UK wheat varieties (Alchemy, Avalon, Cadenza, Hereward, Rialto, Robigus, Savannah and Xi19) were used for the study. Hybridization of Illumina NGS libraries derived from genomic DNA from these eight varieties to NimbleGen arrays followed by Illumina NGS gave a total of 127.2 million reads of which 48.4 million reads (average, 38.1%; range, 22%–44.5%) could be aligned to a reference sequence generated from the sequences used to make the array: the variety Xi19 had the lowest number of mapped reads, and the varieties Alchemy and Savannah had the highest (Table 1). Of the 132 606 features on the array, 119 386 (90%) had a match to at least one read. The remaining 13 220 features (10%) had no matches. The mean coverage level of sequence reads matching NimbleGen array capture probes (derived from EST and cDNA sequences; see section ‘Generation of the NimbleGen Array’ in Experimental procedures) from across all eight varieties was 381 sequences per NimbleGen feature (Figure S1). Thus, the average coverage per variety was 47.6 × (381/8).

Table 1.   The number of raw sequences and mapped reads for the eight wheat varieties following hybridization to the wheat NimbleGen array
VarietyRaw reads (×million)Mapped reads (×million)Percentage mapped
Alchemy44.416.938.0
Avalon27.210.939.9
Cadenza20.24.924.1
Hereward41.015.237.0
Rialto30.411.838.8
Robigus31.46.320.0
Savannah48.716.834.6
Xil99.82.929.9

SNP distribution within and between the eight varieties

Using a previously described custom PERL script (Allen et al., 2011), we found 59 762 contigs from the unigene set (i.e. those contigs used to create the array-bound capture probes) where one or more putative SNPs could be characterized once the Illumina data had been aligned to this unigene reference set. These 59 762 array-bound unigenes had a cumulative length of 34.1 Mb. The average and median length of these SNP containing contigs were 571 and 469 bp, respectively. These figures are somewhat higher than the pre-filtered values (426 and 366 bp) reported above. The total number of putative SNPs called across the eight varieties was 511 439 (Table S1). The average number of SNPs per contig was 8.6 (511 439/59 762), which translates into a SNP occurring, on average, every 67 bp (34.1 Mb/511 439), or ∼15 SNPs per kb. This figure is very similar to that of 16.5 SNPs per kb reported by Barker and Edwards (2009). The number of SNPs per contig ranged from 1 (7928 contigs or 13.3% of the total) to 127 (present in only one contig; Figure 1a): 50% of all contigs had <5 SNPs, and 75% had <10 SNPs (Figure 1b). There was weak correlation (R2 = 0.3422) between contig length and the number of SNPs they contained (Figure 1c). The longest contig, contig49291, was 2938 bp long and had 96 putative SNPs. The shortest contig, contig07060, was 113 bp and had one putative SNP. The contig with the most SNPs (contig63711 with 127 SNPs) was 1164 bp long (having a SNP every 9.2 bp) is only a third of the length of the longest contig. By contrast, contig41302 was 1185 bp long and had only one SNP. A BLASTX of these five contigs shows them all to be hypothetical or predicted proteins (data not shown).

Figure 1.

 (a) Histogram of the number of single-nucleotide polymorphisms (SNPs) within each NimbleGen contig. (b) The number of SNPs per contig plotted against cumulative percentage of contig number (this shows that 95% of the contigs had <25 SNPs while the other 5% had more). (c) Scatterplot of SNP frequency versus contig length (bp).

Identification and characterization of homoeologous and varietal-specific SNPs

Of the total number of putative SNPs identified in our study (511 439), 411 494 (80.5%) were classified as homoeologous (that is, allelic variants exist between the three genomes within a particular variety – see Figure 2a), and 99 945 (19.5%) were classified as varietal SNPs (that is, allelic variants were identified that distinguished two or more varieties – see Figure 2b). Because of their inherent value as molecular markers, we experimentally validated a small number (96) of the putative varietal SNPs on 23 wheat varieties (Table S2) using KASPar-based genotyping technology (Allen et al., 2011). The validation rate for this subset was 93.8% (90 of the 96 putative SNPs tested; Table S2). This figure of 93.8% is considerably higher than the value of 60% reported by Allen et al. (2011). However, in the study by Allen et al. (2011), RNA was used as the source of the cDNA sequenced, a procedure that may lead to problems associated with either the non-expression of one or more alleles/homoeologues or the presence of introns. Additionally, we compared the SNPs identified in this study against those identified by Allen et al. (2011) to see how many were in common to the two studies; 3.2% of the putative SNP identified in this study were also in the Allen et al. (2011) data set (conversely, 16.3% of the SNPs reported by Allen et al. (2011) were found in our data set). A more in depth analysis of a large number of the putative varietal SNPs is in progress.

Figure 2.

 Diagrammatic representation of the single-nucleotide polymorphism (SNP) nomenclature used in this paper. (a) A homoeologous SNP exist is a difference that exists between the three homoeologous genomes; here, both varieties 1 and 2 possess the same homoeologous SNP. Homoeologous SNPs make no reference to a second variety or individual, and they are a characteristic of an individual. (b) A varietal SNP is shown where there is a homozygous TT in the D genome of variety 1 and a homozygous AA in the D genome of variety 2.

The homoeologous and varietal SNPs identified were further classified according to whether the base change was a transition or transversion event (Table 2). In general, transitions tend to be more conservative, because substituted nucleotides belong to the same chemical group (purine or pyrimidine) and result in less physical disruption of the DNA helix and so are more likely to be over-looked by error proof-reading enzymes. Conversely, SNPs resulting in transversions involve nucleotide substitution across purine and pyrimidine chemical groups. The different chemical structures normally limit the occurrence of SNPs of transversion type (Yang and Yoder, 1999). SNPs that could potentially be three or four different bases (e.g. A ↔ C ↔ G ↔T) were categorized as a multiple substitution event. The number of SNPs that caused a transition was 326 885 (63.9%), compared with 177 418 (34.7%) transversions and 7136 (1.4%) multiple substitution events at tri-homoeoallelic SNP locations (Table 2). The greater number of transitions compared with transversions is consistent with the findings reported for barley (Duran et al., 2009). This bias is commonly observed for genuine SNPs and is thought to result from the relatively high rate of mutation of methylated cytosines to thymine (Coulondre et al., 1978). The distribution of base calls (A, T, C, G) are similar for all eight varieties (Figure S2), all having elevated G + C content, and is consistent with that expected for wheat DNA (Wang et al., 2005).

Table 2.   Transition/transversion rates for all putative single-nucleotide polymorphisms
TypeNumberPercentage
Transitions (A↔G, C↔T)326 88563.9
Transversions (C↔G, A↔C, G↔T, A↔T)177 41834.7
Multiple substitution events71361.4

SNP allele calls

The total number of allele calls (where a ‘call’ represents a single read across a locus where a putative SNP exists; these calls represent the depth of coverage once the data have been filtered as shown in Figure 6) for each variety is shown in Figure 3a. These values varied from 3 890 800 to 20 505 631 for Xi19 and Savannah, respectively (these figures reflect the number of mapped reads reported in Table 1). These calls are distributed across the 511 339 putative SNP loci, and the average number of calls per locus for each variety ranged from 7.6 to 40; adjusting for the fact that there were missing data for some loci, the average number of calls per locus ranged from 8.0 to 40.3.

Figure 3.

 (a) Number of allele calls across putative single-nucleotide polymorphism loci for each variety (post-filter data); (b) the percentage of loci plotted against the number of calls; (c) the cumulative percentage of loci plotted against number of calls. From this graph, it is possible to see that almost 50% of the reads for Xi19 are at or lower than a 5× coverage, whereas for Alchemy, Hereward and Savannah <5% have this level of coverage. NB: The right-hand tail of the two graphs is not shown to focus in on the coverage of the vast majority of loci.

Missing data and monomorphism

The percentage of loci with missing data was calculated for each of the varieties, and is displayed in Figure 4a. The variety Xi19 had the greatest number of loci for which data were missing (27 353) followed by the varieties Cadenza (13 206) and Robigus (8767). The remaining five varieties studied, Alchemy (3114), Avalon (2760), Hereward (1456), Rialto (2773) and Savannah (2639), had considerably fewer such loci. There was a negative relationship (R2 = 0.81) between the varietal average number of calls per locus and the number of loci with missing data (Figure 4b).

Figure 4.

 (a) and (b) The percentage of single-nucleotide polymorphism (SNP) loci for each variety for which there was no read coverage. (c) and (d) the number of SNPs called as monomorphic (no variation between the three genome) and the depth of cover of reads for the variety; (e) the relationship between the percentage of loci with missing data and the percentage of monomorphic loci for each of the eight varieties. The graphs (b) and (d) give some indication of the level of coverage needed to reduce null calls (c. 15×) and to avoid missing homoeologous SNPs (c. 15–20×). That is, you need slightly higher coverage to see two rather than one allele.

At a number of loci, while there were homoeologous SNPs present in some varieties studied, in other varieties, there was no variability between the three homoeologous genomes studied : such loci were defined as monomorphic (i.e. invariant in that variety) (Saintenac et al., 2011). The number of monomorphic loci for each variety was calculated, and the result is presented in Figure 4c. The highest percentage of monomorphic loci was for Xi19, the variety with the smallest depth of coverage. There was a strong negative correlation (R2 = 0.97) between the varietal average number of calls per locus and the total number of monomorphic loci (Figure 4d).

A strong positive relationship (R2 = 0.94) existed between the number of loci with missing data and the number of monomorphic loci for each of the varieties (Figure 4e).

Tri-homoeoallelics

Given that bread wheat is an allohexaploid, most wheat genes are expected to be present as three homoeologues (Hernandez et al., 2012; Sears, 1954). However, as our data show (Figure 5), it is unusual to find SNP loci for which there are three distinct alleles (Table S3). In a small number of cases, however, we found loci for which there were three alleles; these were defined as tri-homoeoallelics. In our data set, the number of homoeologous loci for which three different alleles were called in at least one variety was 6588. However, there was considerable overlap between the varieties for these loci; that is, if one variety had three distinct alleles at any particular homoeologous locus, then it was likely that other varieties did also. It was quite unusual to find a tri-homoeoallelics locus restricted to a single variety. The number of loci for which all eight varieties were tri-homoeoallelics was 1158. Making the assumption that these 1158 loci represent sequences from single genes for which there are no paralogues, then one would expect each of the three alleles to be called approximately one-third of the total number of calls for that locus. To test this hypothesis, we summed the values for the each of the three alleles called and then performed a chi-squared test to see how closely the observed values reflected expected values. For example, contig100689 had a tri-homoeoallelic (A, C, T) SNP called at position 232. Across the eight varieties, the total number of reads for the different alleles was 129, 134 and 124, respectively, giving a total of 387. The expected value for the three alleles would be 129 (387/3). Thus, the observed values were sufficiently close to the expected values that one could assume that this locus is a genuine tri-homoeoallelic SNP. Conversely, contig07314 had a tri-homoeoallelic SNP (A, C, G) at position 410. The read scores for the three alleles were 100, 941 and 44, respectively (total = 1085). The expected scores are 361 for each allele. Thus, it is unlikely that the observed scores result from the reads of a single, tri-homoeoallelic locus (Table S4). At a cut-off value of P = 0.05, only 173 of the 1158 tri-homoeoallelics for all varieties could be assumed to correspond to the expected values for an even three-way split with regard to number of calls (Table S4). If our simple assumptions are correct, then the remaining 985 (1158–173) loci for which there are ratios statistically unlikely to reflect an underlying pattern of three distinct homoeologues, we must assume that either procedural biases are giving rise to the results or that paralogues are also being mapped to the reference sequence. We tested for this possibility by blasting the sequences for contigs that contained tri-homoeoallelics against the Chinese Spring 5× reference (Figure S3). We took all 1153 contigs that contained tri-homoeoallelics in all eight varieties (two contigs, contig88744 (2) and contig96951 (5) had more than one tri-homoeoallelic locus), ranked them according to their chi-squared value (Table S4) and partitioned them into three bins – the first bin contained the 400 contigs with SNPs that had alleles frequencies closest to a three-way split, the second bin contained the 400 contigs with intermediate values for chi-squared and the third bin contained the remaining 353, which had chi-squared values most distinct from a three-way split. The average number of BLASTN hits returned (Figure S3) was used to see whether there was more likely to be interference from paralogues in those contigs for which the chi-squared values were low. For the first two bins, there was very little difference between the average number of BLAST hits while the contigs in the third bin produced a higher number of hits.

Figure 5.

 The percentage of the various calls, bi-allelics, tri-homoeoallelics, monomorphic loci and loci with missing data for each of the eight varieties.

Discussion

Previous work has shown that a relatively large numbers of wheat SNPs can be generated and mapped onto the wheat genome using NGS-based technology (Allen et al., 2011). However, this study focused on sequences obtained from the transcriptome and was, therefore, dependant on both the tissue used to make the cDNA and on the relative levels of expression of the three homoeologous gene copies. Transcripts from paralogues might also add to the complexity of the sequences obtained. Such an approach can lead to problems with the mis-identification of SNPs because of gene silencing. To overcome such problems, for this study, we used genomic DNA as the source of our NGS data. However, because of the size and complexity of the wheat genome, it would have been impractical, time-consuming and expensive to attempt whole genome re-sequencing of multiple varieties to depths required for successful SNP identification. Thus, we followed the example of Fu et al. (2010) and developed a NimbleGen array consisting of 132 606 features covering 56.5 Mb, which represent a significant proportion of the wheat exome.

Estimates of the size of the exome in wheat suggest that it is in the order of 1%–2% (Paux et al., 2006) of the whole genome, which is ∼ 17 Gb. Given this estimate, the size of the wheat exome would be in the order of 170–340 Mb, representing as many as 300 000 protein-coding genes (Devos et al. 2008). A single diploid exome would, therefore, be in the range 57–113 Mb. The NimbleGen array contains sequences covering 56.5 Mb and so represents at least 50% of exome space of a diploid genome. We are assuming, however, that, in many cases, the features on the NimbleGen array are able to capture all three homoeologues of a gene and in so doing that we are sampling the gene space of all three genomes. Saintenac et al. (2011) reported that they had sequenced 3.5 Mb of exonic sequence of tetraploid wheat (∼10 Gb), and their coverage of the exome was in the order of 3.5%–7.0% of the exome.

On average, 38.1% of the sequences generated aligned to the reference sequence, compared with the 60% reported by Saintenac et al. (2011), using SureSelect with tetraploid wheat, and the 66.6% reported by Asan et al. (2011), using NimbleGen arrays and maize. Non-alignment of a significant amount of NGS to the reference is not uncommon and may be due to a variety of reasons such as genome size and complexity, poor quality sequence and inadequate washing of captured DNA. However, in this study, it is possible that the non-alignment of a significant of amount of the NGS data is attributed to the presence in the genomic sequences of intron–exon boundaries; while such fragments would probably bind to the array features, they will not align to the reference sequence during sequence alignment because of the presence of non-target sequences. Hence, the use of genomic sequence, if available, as the reference sequence should reduce the amount of non-aligned sequences and hence would significantly increase the fold coverage of sequences to each of the targets.

One of the most important factors in generating reliable data from NGS of exon-capture arrays appears to be the depth of sequence coverage. Variations in coverage depth can be attributed to the efficiency of the hybridization step, subsequent sequencing and bioinformatics methods, and other factors such as G + C content, target sequence uniqueness, overlap with repeat regions and the presence of long homopolymers. Approximately 10% of the sequences on the array could not be aligned with the sequence data. The reason for this non-alignment is unclear and was unexpected. To examine whether these non-hybridising sequences had been correctly annotated as wheat sequences, we subjected the reference sequences on the NimbleGen array to a BLASTN search (default parameters) against the 5× Chinese Spring draft genome assembly (CerealsDB web site: http://www.cerealsdb.uk.net/) on the assumption that it was unlikely that contaminating non-wheat DNA sequences would be found in both independent sources. This search confirmed that as only 0.22% of the NimbleGen array sequences could not be aligned to the 5× Chinese Spring genomic sequence, and hence, the lack of hybridization to 10% of the array features is probably not due to non-wheat sequences being present on the array. Again, it is possible that our inability to detect NGS-derived sequences for 10% of the array features may be due to a complex intron–exon structure inhibiting both the hybridization of the targets to the features and the ability of the NGS software analysis pathway to align them to the reference. However, while it is unclear why a significant amount of the array features do not hybridize to any target sequences, it has been suggested that this may be related to G + C content and/or repetitive elements and that novel elution protocols could aid in the coverage of such regions, and post-elution amplification steps may still present an opportunity for G + C bias (Hoppman-Chaney et al., 2010).

Depth of coverage and reliability of SNP calling

Saintenac et al. (2011) reported a median depth of coverage of 10 reads per base and indicated that their data set contained many false-positive and false-negative results; they suggested that increased genome coverage would improve SNP discovery. Our results are entirely in agreement with this conclusion but indicate that a depth of coverage of 25–30× is required to give accurate SNP calling for nulls, and 40× to remove false-negative SNPs that fall into the homozygote category. It is clear that those varieties with inadequate depth of coverage (Table 1), such as Cadenza, Robigus and Xi19, contained the most missing data and monomorphic loci (Figures 4a and 5a). There was a strong negative correlation between the number of loci with missing data and the average number of reads covering the SNPs within a particular variety (Figure 4e). Similarly, the number of monomorphic loci within any particular variety showed a strong negative correlation with depth of read (Figure 4c). However, a greater depth of coverage was required to reliably resolve a locus as monomorphic: the number of such loci levelled out at c. 35- to 40-fold depth of coverage.

Figure 4a,b indicates that the loci, for which there are missing data in the five cultivars with the highest read coverage, may represent true presence–absence variations because above a depth of coverage of 25×, the number of such alleles does not decrease. It is possible that these are genuine intra- or inter-cultivar deletions or replacements caused by introgressions that are seen within elite inbred lines of crops species (Haun et al., 2011; Swanson-Wagner et al., 2010; Tokatlidis et al., 2004).

For the reliable identification of homoeologous SNPs, a depth of coverage of 40-fold is recommended to decrease the number of false-negative calls. In this case, the five varieties Alchemy, Avalon, Hereward, Rialto and Savannah appear to have reached a plateau, indicating that an adequate depth of coverage has been reached for these varieties (Figure 4b,d). Conversely, our results indicate that for the varieties Cadenza, Robigus and Xi19 have a very high number of missing values suggesting that depth of coverage was not sufficient. For these varieties, allele dropout might also be a problem resulting in a greater number of loci for which SNP would have been missed.

Tri-homoeoallelics

We were surprised by the number of tri-homoeoallelics SNPs within our data set. Possible explanations for these highly variable loci could be due to the maintenance of different alleles across the three genomes, and/or to the presence of paralogues or gene copies that interfere with the scoring, and/or bias in sequence capture. It could also be due to the presence of highly repetitive sequences within the exome that are part of a conserved domain found in large gene families.

Exon capture is thought to be exceptionally quantitative in the manner in which it samples; increased allele representation almost always represents additional genomic copies in human studies (Mercer et al., 2012). However, the wheat genome is much more complex than that of humans and exome capture may be less efficient. Although from our analyses we were unable to account for the fact that the majority of the tri-homoeoallelics studied did not correspond to a three-way split for the three alleles, there was some indication that the tri-homoeoallelics with extreme variation in allele ratios are predominantly genes encoding transposable elements or contain repetitive sequences or motifs, for example contig 76060 has been annotated as a retrotransposon protein and contig 46 167 contains an LRR repeat domain, which are associated with large gene families. However, taken as a whole, our results do appear to suggest that at least for allohexaploid wheat the quantitative hybridization of sequences to the array cannot be taken for granted, and hence, copy number determination, at least in polyploids, should be undertaken with caution and only following careful analysis of the sequence data.

It is apparent that Xi19 has fewer tri-homoeoallelics than other varieties, and clearly, this is related to the depth of coverage. If one requires a 10× depth of coverage to be 95% certain of assaying all three alleles at a tri-homoeoallelic locus, more than 70% of the putative SNPs in Xi19 fall below this level of coverage, whereas more than 90% fall above this figure for the varieties Alchemy, Hereward and Savannah (Figure 3c). It is clear that the false-negative rate for Xi19 will be high.

Data analysis

Because there are a number of steps involving both experimental and bioinformatic methodologies, it is important to minimize potential sources of error that could lead to false-positive or false-negative results. Sequencing errors can be especially problematic in SNP discovery, and certain types of error are more prevalent, depending upon the platform used. Typically, sequencing errors can be categorized as mismatches, where one base is substituted for another, and insertion/deletions (indels) that are discovered when a read contains a different number of bases from its reference.

The bioinformatic analysis of sequence data is an important step, and the custom pipeline used here included a number of post-processing filter steps (Figure 6). The first filter ensures that sequences are mapped uniquely to the reference (it need not be a paired match), and any read mapping to more than one reference location is discarded. Secondly, where multiple mapped reads have the same start position in the reference sequence, the first read sampled by the mapping software is used, and subsequent mapped reads discarded as potential PCR-based clones. Thirdly, the whole of each mapped read must not differ by more than 10% from the reference. Each base in a read is assigned a quality score indicating the confidence that the base has been correctly called. It is important that a filtering step is included that takes into account the quality scores. Therefore, the fourth filtering step ensures that any putative SNP resides at the centre of a three-base window of minimum PHRED 20 quality score.

Figure 6.

 Bioinformatic workflow includes a mapping step followed by a number of filtering steps written in PERL.

The final filtering step ignores the bases that occur only once or have a frequency of <0.02 in the data. This step is important for base calling, because if one variety has the heterozygous base call A(8)G(16) while other varieties have the homozygous call G(30), then we might assume the first variety possesses one homoeologue containing the A allele and two homoeologues with the G allele. The other varieties could have either a G at all three homoeologous positions, or we may have failed to sample the A allele by chance. If no real SNP difference exists between varieties, we would expect a 1 : 2 ratio of A : G and would therefore expect 10 of the 30 reads to have had the A allele. By randomizing the entire set of alleles to varieties and repeating the SNP calling, we found that, using a minimum expected count of 10 for the null base, our false-positive rate was below 0.01, and thus, this cut-off was chosen for the analysis.

The difference in average contig length between pre-filtered (426 bp) and post-filtered data (571) may be explained by the requirement for putative SNPs to be positioned at least 25 bp from either end of the contig; this requirement allows for there to be sufficient sequence either side of the putative SNP for primers to be designed and so permitting validation. Thus, small contigs would have been preferentially removed from the data set.

Given the inherent uncertainties involved with NGS sequencing of exon-capture arrays, it is crucial to address the problems that we have highlighted. Ensuring that array capture probes are designed so that they capture homoeologous copies of expressed genes, while being sufficiently unique in their identity so as not to capture paralogues would facilitate efficient clustering of sequences before they are arrayed. Of the 132 606 features on the array, a proportion (10%) had no matches. This could not be attributed to contamination with foreign DNA as only 292 features (0.22%) on the NimbleGen array did not align to the 5× Chinese Spring draft genome. It is more likely that a proportion of the sequences on the array were not clustered properly, and multiple copies exist of the same sequence on the array. This would introduce bias during the mapping step because these wheat sequences would be discarded from the analysis during filter step 1 of the pipeline (Figure 6) and may account for a proportion of the 10% of features that had no match.

The issue of sequence quality and the filtering of poor quality reads using bioinformatic methods is another necessary consideration. The array contained sequences covering 56.5 Mb, which, if truly unique, would represent >50% of the exome on a diploid genome. However, only 34.1 Mb of sequence was actually analysed. So, it is apparent that the data from our pipeline have reduced in size by almost 40% because of the above filtering steps. We consider this to be a price worth paying to ensure that the sequence data are of the highest quality. Nevertheless, even in the post-filtered data set, there are false-positive and false-negative results. At this point, more than anything else, these are related to the depth of coverage, which is probably the most fundamental factor that affects the success of this method.

Ultimately, if the data display sufficient depth of coverage and stringent quality filtering is applied, then the resulting data set should provide a very efficient resource for SNP discovery.

We have used a novel wheat NimbleGen array to generate exome-based sequences from eight UK wheat varieties; the sequences generated from the captured genomic DNA have been examined using a robust pipeline to predict the occurrence of a large number of homoeologous and varietal SNPs. While this robust procedure removes a significant number of the sequences generated validation of a small number of the varietal SNPs via KASPar-based genotyping resulted in a validation rate of over 90%, a figure that is significantly better than our previous study (Allen et al., 2011). The improvement in validation rate is attributable to the use of genomic DNA in this study, which is aligned to the NimbleGen capture array. The Allen et al. (2011) study made use of mRNA, so any primers designed from the cDNA sequence for KASPAR validation would have lacked information on intronic sequence, which may have been present in the genomic DNA. In addition, the use of mRNA could have missed potential SNPs located on down-regulated genes, as these genes would have been present at a considerably lower coverage and would tend to be filtered out by our bioinformatics pipeline.

Our work has shown that to achieve such high rates of validation it is necessary to generate high-quality sequence data of at least 30-fold coverage and preferably much higher. Without such high rates of coverage, it is likely that many of the SNPs currently being predicted will be false and could lead to a considerable waste of effort in failed validatory experiments. Both the homoeologous and the varietal SNPs identified in this work will be extremely valuable to the wheat community, and the varietal SNPs will be of use in mapping and the manipulation of various traits such as disease resistance and yield, while the homoeologous SNPs will be useful to track the presence and expression of the homoeologous genes during various stages of development.

Experimental procedures

Generation of the NimbleGen array

To produce the wheat NimbleGen array, we generated 454 titanium sequence data from normalized and non-normalized cDNA libraries prepared from RNA from Chinese Spring line 42. Homology searches were performed using BLAST (Altschul et al., 1990), and the sequences generated were queried against a ribosomal RNA sequence database (BLASTN 1e-20) and similar sequences removed, leaving 1 800 178 reads (308 Mb). These sequences were combined with the public EST collections available at the time (October 2009) and included the TAGI gene index composed of 216 452 sequences (151 Mb), and the NCBI UniGene set containing 40 870 sequences (36 Mb). Together, these were assembled with Newbler (version 2.6, Roche Diagnostics Ltd, Burgess Hill, UK). From the resulting assembly, contigs and singletons <100 bp were removed, leaving 171 126 contigs/singletons. The set of assembled sequences were queried against both the wheat chloroplast and mitochondria genomes (BLASTN 1e-20), which removed 3404 sequences and the Triticeae Repeat Sequence (TREP; http://wheat.pw.usda.gov/ITMI/Repeats/) databases (BLASTN 1e-5), which removed a further 10 061 sequences. Finally, the remaining sequences were queried against a 2.5× Chinese Spring wheat genome database (BLASTN 1e-5; http://www.cerealsdb.uk.net) and sequences with >100 hits removed, leaving 132 605 sequences covering 56.5 Mb. Low-complexity regions were also masked with Dustmasker (Morgulis et al., 2006). The resulting sequence space was used by NimbleGen to design a set of probes tiling across the sequence space as described in the experimental procedures. The resulting NimbleGen array contained 132 605 features with an average length of 426 bp and a median length of 366 bp (Table S1and NimbleGen array reference 100819_Wheat_Hall_cap_HX1).

Wheat capture array design

Starting with 132 605 sequences screened to remove repetitive sequences, variable length probes (50- to 100-mers) were generated at a 5-bp step across the entire sequence space. Individual probes were repeat-masked by removing probes that had an average 15-mer frequency >50, using a 15-mer frequency table generated from whole genome shotgun sequence set from Chinese Spring wheat genome (http://www.cerealsdb.uk.net). The repeat-masked probe set was compared back to the target sequence using SSAHA (Ning et al., 2001), using a minimum match size of 30 and allowing up to 5 indels/gaps. Probe sequences with more than 10 matches in the target set were eliminated from consideration. The final probe set was generated by selecting probes at an average spacing of 30 bp (measured from 5′ oligo starting position to the next 5′ oligo starting position).

Illumina paired-end library preparation

Genomic DNA from eight wheat varieties was prepared using a phenol chloroform extraction protocol (Allen et al., 2011). These varieties were chosen by breeders as representative of valuable British varieties and are a subset of the 23 varieties used in the SNP discovery study carried out by Allen et al. (2011). Genomic DNA was fragmented using a UCD-200 biorupter (Diagenode, Holliston, MA) to get an average fragment size of 250 bp. Fragmented samples were purified down a Qiagen PCR purification column, and an aliquot run on the Agilent bioanalyser (Agilent Technologies Inc., Santa Clara, CA) (DNA 1000 assay) to check the size distribution.

Sequencing libraries were prepared using a NEBNext DNA sample preparation master mix 1 kit (New England Biolabs, Ipswich, MA) according to manufacturer’s protocol with a modification to Illumina’s paired-end adapter sequence to add a unique 5-base tag on the 5′ends.

Microarray hybridization, washing and elution of captured DNA

The pre-capture libraries were pooled equally for capture. Hybridization to a NimbleGen capture array (100819_Wheat_Hall_cap_HX1) was carried out on a NimbleGen Hybridization System 4 (Roche) for 72 h at 42 °C followed by microarray washing and elution of captured samples following the hybridization and washing procedures described in Haun et al. (2011).

Sequencing of captured genomic DNA

Post-capture-enriched sequencing libraries were subjected to 110 bp of paired-end sequencing on a Illumina Genome Analyser (GAIIx, San Diego, CA) using Illumina TruSeq v5 Cluster Generation (Illumina Inc., San Diego, CA) and sequencing reagents following the manufacturers preparation guides for paired-end runs (Part 15019435 RevB, October 2010 and Part 15013595 Rev C, February 2011, respectively).

SNP discovery

After pre-processing of reads, where adapter sequences were removed, the data were submitted to a custom pipeline (Figure 6). NGS sequences generated from the eight varieties were mapped to the Nimblegen array reference using BWA version 0.5.9-r16 (Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Cambridge, UK) (Li and Durbin, 2009) with a seed length of 32 bases, and the resulting SAM files were used for downstream analysis. Uniquely mapped reads were analysed using a series of custom PERL scripts. First, all data were filtered to require at least two alternative bases predicted at a reference position, each represented by two or more independent reads or 5% of all reads examined (whichever was the greater). Only bases that were located at the centre of a three-base window of PHRED quality ≥20 were included in the analysis. Sequences were discarded if they displayed more than 10% sequence variation from the reference over their length or if they mapped equally well to more than one locus as the mapping in these situations could be regarded as uncertain. In cases where multiple reads started at the same position in the reference; all but one were ignored to guard against clonal reads being sampled more than once. Varietal SNPs were identified as differences between two or more varieties as (opposed to those between each variety and the reference sequence) where both sequences had sufficient coverage depth that there was a <5% chance of each varietal variant being present (but not detected) in the other variety (as would be the case for a homoeologous variant). SNPs meeting the above-mentioned quality criteria but where each variety contained two or more allelic variants were deemed to be putative homoeologous variants.

KASPar SNP validation

The KASPar SNP assays were carried out as described in Allen et al. (2011). For each putative varietal SNP, two allele-specific forward primers and one common reverse primer were designed (KBioscience, Hoddesdon, UK). Genotyping reactions were performed in a Hydrocycler (KBioscience) in a final volume of 1 μL containing 1× KASP 1536 Reaction Mix (KBioscience), 0.07 μL assay mix (containing 12 μm each allele-specific forward primer and 30 μm reverse primer) and 10–20 ng genomic DNA. The following cycling conditions were used: 15 min at 94 °C; 10 touchdown cycles of 20 s at 94 °C, 60 s at 65–57 °C (dropping 0.8 °C per cycle); and 26–35 cycles of 20 s at 94 °C, 60 s at 57 °C. Fluorescence detection of the reactions was performed using a Omega Pherastar scanner (BMG LABTECH GmbH, Offenburg, Germany), and the data were analysed using the KlusterCaller 1.1 software (KBioscience).

All NGS data generated for this study will be available at http://www.cerealsdb.uk.net/CerealsDB/NGSdata/Homoeologous_and_varietal_SNPs.csv; in addition, the Illumina fastq files and associated metadata have been uploaded to NCBI sequence read archive (SRA) under the study accession SRP011067. Mapping statistics were calculated by converting the SAM output file into BAM format and running the SAMtools flagstat command.

Acknowledgements

We are grateful to the Biotechnology and Biological Sciences Research Council, UK, for providing the funding for this work (awards BB/G012865/1, BB/F007523/1). We are also grateful to the Lady Emily Smyth Agricultural Research Station for providing the funds for next-generation sequencing. The 5× Chinese Spring wheat genome sequence can be downloaded from http://www.cerealsdb.uk.net/CerealsDB/Documents/DOC_CerealsDB.php as can all the data associated with the homoeologous and varietal SNPs described in this report.

Ancillary