• Open Access

Single nucleotide polymorphism (SNP) discovery in the polyploid Brassica napus using Solexa transcriptome sequencing

Authors


* Correspondence (fax +44 1603450021; e-mail: ian.bancroft@bbsrc.ac.uk)

Summary

Oilseed rape (Brassica napus) was selected as an example of a polyploid crop, and the Solexa sequencing system was used to generate approximately 20 million expressed sequence tags (ESTs) from each of two cultivars: Tapidor and Ningyou 7. A methodology and computational tools were developed to exploit, as a reference sequence, a publicly available set of approximately 94 000 Brassica species unigenes. Sequences transcribed in the leaves of juvenile plants were aligned to approximately 26 Mb of the reference sequences. The aligned sequences enabled the detection of 23 330–41 593 putative single nucleotide polymorphisms (SNPs) between the cultivars, depending on the read depth stringency applied. The majority of the detected polymorphisms (87.5–91.2%) were of a type indicative of transcription from homoeologous genes from the two parental genomes within oilseed rape, and are termed here ‘hemi-SNPs’. The overall estimated polymorphism rate (~0.047%–0.084%) is consistent with that previously observed between the cultivars analysed. To demonstrate the heritability of SNPs and to assess their suitability for applications such as linkage map construction and association genetics, approximately nine million ESTs were generated, using the Solexa system, from each of four lines of a doubled haploid mapping population derived from a cross between Tapidor and Ningyou 7. Computational tools were developed to score the alleles present in these lines for each of the potential SNPs identified between their parents. For a specimen region of the genome analysed in detail, segregation of alleles largely, although not entirely, followed the pattern expected for genomic markers.

Introduction

Polyploidy is widespread in angiosperms and is thought to have been a predominant factor in their evolution and success (Leitch and Bennett, 1997; Wendel, 2000). Several important crops are relatively recently formed polyploids, including bread wheat, cotton, soybean and oilseed rape. The cultivated Brassica species, which include oilseed rape, constitute the group of crops most closely related to Arabidopsis thaliana, all being members of the Brassicaceae (Crucifereae) family (Warwick and Black, 1991). Three genetically diploid Brassica species, B. rapa (2n = 20), B. nigra (2n = 16) and B. oleracea (2n = 18), contain the typified A, B and C genomes, respectively, although these have palaeopolyploid structures. Each pairwise combination has since hybridized spontaneously to form the three allotetraploid species, B. napus (2n = 38, comprising the A and C genomes), B. juncea (2n = 36, comprising the A and B genomes) and B. carinata (2n = 34, comprising the B and C genomes), first deduced by cytogenetic analysis (UN, 1935). The Brassica and Arabidopsis lineages diverged only c. 20 Mya (Yang et al., 1999). This close phylogenetic relationship with A. thaliana, which was the first plant for which the complete genome sequence was determined (Arabidopsis Genome Initiative, 2000), has made Brassica species an attractive group in which to study genome evolution following polyploidy.

Comparative studies, conducted at the level of genetic linkage maps, have revealed extensive duplication within Brassica genomes (Lagercrantz and Lydiate, 1996), whereas a cytogenetic approach has revealed that a distinctive feature of the Brassiceae tribe is that they contain extensively triplicate genomes (Lysak et al., 2005). A study across the Brassicaceae has identified 24 conserved chromosomal blocks, relating them to a proposed ancestral karyotype (n = 8) (Schranz et al., 2006). Where analyses have been conducted on targeted regions of the genomes of B. oleracea, B. rapa and B. napus, using physical mapping or comparative sequencing approaches, the results have been consistent with a fundamentally triplicate structure for the diploid Brassica genomes (O’Neill and Bancroft, 2000; Rana et al., 2004; Park et al., 2005; Town et al., 2006; Yang et al., 2006). It is estimated that the lineages of the species B. rapa and B. oleracea diverged c. 3.7 Mya (Inaba and Nishio, 2002). Oilseed rape (B. napus) arose from the hybridization of an A genome progenitor and a C genome progenitor, probably during human cultivation, i.e. less than 10 000 years ago. Genetic mapping has confirmed that the progenitor A and C genomes are essentially intact in B. napus and have not been rearranged (Parkin et al., 1995). Thus, Brassica species provide an opportunity to study the evolution of genome structure, following polyploidy, over a wide range of timescales.

The genomes of Brassica species are relatively large for analysis by Sanger/capillary electrophoresis (CE) sequencing. For example, that of B. rapa is the smallest, at c. 500 Mb (Arumuganthan and Earle, 1991). Although a genome sequencing project is under way for this species, with both sequences and sequence annotations in the public domain (http://brassica.bbsrc.ac.uk/), it is unlikely that the ~1.2-Gb genome of B. napus will be sequenced in the foreseeable future.

Three commercial systems for massively parallel sequencing (typically referred to as ‘next generation sequencing’, NGS) are available: GSFLX (from Roche), Solexa (from Illumina) and SOLiD (from Applied Biosystems). High-quality read lengths from NGS are typically much shorter than the > 800 bases produced by CE sequencing, but all NGS platforms produce data at very much lower costs per nucleotide than CE sequencing. Initial studies utilizing the precursor to the Roche GSFLX instrument (the 454 GS20) have shown that combining data from this with CE platforms overcomes the inherent shortcomings of the respective technologies, permitting a more complete sequence representation of genomes (Goldberg et al., 2006). The Solexa and SOLiD systems overcome the principal shortcoming of the GSFLX, the sequencing of homopolymers, but have their own shortcomings (e.g. they are unable to resolve most repetitive sequences as a consequence of their very short read lengths) and biases (e.g. related to the GC content for the SOLiD platform). All three systems are well suited to re-sequencing applications, where a reference genome sequence is available, for example, as applied in C. elegans using Solexa (LaDeana et al., 2008). However, there are no complete reference sequences available for the large genomes of most crop species, and the computational difficulties associated with the de novo assembly of very large genomes mean that the presently available NGS systems are unlikely to provide them.

Single nucleotide polymorphisms (SNPs) are single base differences between DNA sequences of individuals or lines. They can be assayed and exploited as high-throughput molecular markers. If sufficiently dense linkage maps can be constructed, SNP markers have the potential for exploitation in association genetics approaches, whereby population-level surveys can take advantage of historical recombination to identify trait–marker relationships on the basis of linkage disequilibrium. This has become a favoured genetic approach in many organisms (Cardon and Bell, 2001; Flint-Garcia et al., 2003). There are collections of B. napus lines available, including ‘diversity fixed foundation sets’, which aim to capture, in a small number of homozygous lines, a large proportion of the genetic diversity available in the species (http://www.oregin.info/). These represent resources that can be exploited using association genetics. However, there are presently too few polymorphic markers available to undertake a genome-wide association approach in B. napus, largely because the polyploid nature of its genome interferes with both SNP discovery and high-throughput SNP marker assay technologies.

Studies have been initiated that aim to use NGS to identify SNPs in plants, with success being reported for maize, using 454 sequencing of the transcriptome of shoot meristems (Barbazuk et al., 2007). This study used genomic sequence assemblies composed of gene-enriched genomic survey sequences from line B73 (Fu et al., 2005; Whitelaw et al., 2003), and enabled the identification of more than 4900 SNPs within more than 2400 maize genes. This approach is not suitable for use in B. napus for two main reasons. First, there is no equivalent genomic sequence resource to use as a reference sequence for NGS alignment. Secondly, it is anticipated that the amphidiploid nature of the B. napus genome will usually result in homoeologous pairs of genes, originating from the progenitor A and C genomes, being co-expressed. These transcripts will differ in sequence, on average, by only approximately 3.5% (I. Bancroft, unpubl. data). This will generally result in polymorphisms detected in short NGS reads appearing as a mixture of bases at a given position, one corresponding to the A genome homoeologue and the other to the C genome homoeologue, thus making automated SNP detection challenging and confounding allelic with homoeologue variation.

To address the problem of SNP detection in polyploid crops by an NGS-based approach, we selected B. napus for our experiments. To maximize the redundancy of sequence coverage that could be achieved, we selected the Solexa system for the generation of sequence tags. We focused on the non-normalized transcriptome in leaves which, in a single dataset, provides a wide dynamic range of transcript abundance levels with which to test the methodology, from very highly expressed genes, such as those encoding proteins involved in photosynthesis, to those with barely detectable expression. For initial sequence alignment and candidate SNP identification from short read data, we utilized the publicly available maq software (Li et al., 2008). Based on these results, we developed a new downstream approach for SNP detection between genotypes in the absence of reference genomic sequence data, identified putative SNPs between two oilseed rape cultivars, and validated a proportion of these in lines selected from a conventionally characterized mapping population.

Results and discussion

Generation of expressed sequence tag (EST) data

Oilseed rape cultivars grown in the West exhibit a very low level of allelic diversity. Therefore, to maximize the number of SNPs available for detection, we selected for our study the cultivars Tapidor and Ningyou 7, which have been found to be the most divergent lines analysed using restriction fragment length polymorphism (RFLP) markers (Meng et al., 1996). Tapidor is a European winter-type cultivar (i.e. it has a strong vernalization requirement), whereas Ningyou 7 is a Chinese semi-winter-type cultivar (i.e. it has little vernalization requirement). They are available as doubled haploid (DH) lines, meaning that they are entirely homozygous. A DH mapping population has been developed from a cross between them (Qiu et al., 2006), which has been widely adopted as a reference mapping population by the Brassica research community. We therefore selected the DH lines of Tapidor and Ningyou 7 for the development of ESTs. Plants were grown to the six-leaf stage and RNA was extracted from the leaves. Polyadenylated RNA was purified and subjected to Solexa sequencing (four lanes on the instrument for each cultivar). In order to retain a wide range of abundance levels of mRNA species sampled from within the transcriptome, the RNA was not normalized.

After filtration for primer, short and low-quality sequences, 21 952 885 ESTs of 35 bp were obtained for Ningyou 7 and 19 002 103 ESTs of 35 bp were obtained for Tapidor. Therefore, the total number of bases available for analysis was ~0.77 Gb for Ningyou 7 and ~0.67 Gb for Tapidor. These results confirm the anticipated efficiency of the Solexa system for the generation of EST data from Brassica, with those obtained from each cultivar exceeding the total number of bases of CE-derived EST sequences presently in the public databases for all Brassica species (~0.49 Gb).

Alignment of ESTs to reference sequences

There is presently no reference genome sequence available for B. napus and no computational approaches available for SNP detection directly between two Solexa datasets. We therefore had to develop a novel approach. We chose to conduct our alignments and candidate SNP identification procedure using the maq software package (Li et al., 2008). This was originally developed for re-sequencing applications, assuming the provision of a high-quality reference genome sequence. In place of this conventional type of reference sequence, we used a set of 94 558 Brassica unigenes that had been assembled from approximately 810 000 public ESTs from several different Brassica species, originally for use in the design of a Brassica microarray (M. Trick et al., in preparation; http://brassica.bbsrc.ac.uk/array_info.html). These unigenes represent, in total, 64 044 420 bp of sequence, including 43 731 unresolved bases (‘N’). For the Ningyou 7 ESTs, 16 267 058 reads (74.1%) were aligned to 27 256 779 resolved bases of the reference sequences (over 71 239 unigenes), providing an average depth of coverage of 20.9-fold (16 267 058 reads × 35 bp read length/27 256 779 bp of aligned reference sequence). For the Tapidor ESTs, 13 953 924 reads (73.4%) were aligned to 25 871 023 resolved bases of the reference sequences (over 70 007 unigenes), providing an average depth of coverage of 18.9-fold (13 953 924 reads × 35 bp read length/25 871 023 bp of aligned reference sequence). These results show that the majority of Solexa-derived ESTs can be aligned with sequences already represented in the public databases, but a substantial minority (25.9% in Ningyou 7 and 26.6% in Tapidor) cannot. The lack of alignment for these sequences may be because the corresponding transcript is not represented in the public EST dataset (and hence the Brassica unigenes), either as a result of the greater depth of leaf transcriptome profiling in our experiment, or an inability to clone the sequences in Escherichia coli prior to CE sequencing. It could also be caused by the sequence variation between the corresponding Solexa ESTs and unigene sequences being too large to enable alignment, for example as a result of InDel variation involving several bases. The de novo assembly of these non-aligned Solexa reads should result in an enlargement of the unigene sequences available for Brassica species, but is beyond the scope of the present study.

A summary of the nucleotide compositions of the reference sequences and of both total and aligned Solexa reads is presented in Table 1. These results show that the GC contents of total and aligned Solexa reads (47.6% and 48.2%, respectively for Ningyou 7; 48.4% and 48.8%, respectively for Tapidor) are higher than that for the unigene dataset overall (44.1%), suggesting that there may be some differential representation between Solexa-derived EST sequences and EST sequences derived by CE following cloning.

Table 1.  Nucleotide composition of unigene reference sequences and Solexa expressed sequence tag (EST) sequences (total and aligned) from oilseed rape cultivars Ningyou 7 and Tapidor
NucleotideReference (%)Ningyou 7Tapidor
Total reads (%)Aligned reads (%)Total reads (%)Aligned reads (%)
A28.3127.7826.8426.5626.62
C21.3922.3722.2923.8023.03
G22.7225.2125.9224.6225.78
T27.5924.5824.9524.9424.57

Polymorphism detection

Following the alignment of Solexa sequence reads to the Brassica unigene reference sequences, the identification of allelic sequence polymorphisms between Tapidor and Ningyou 7 needed to be conducted. Although simplified by the use of inbred or, as used here, DH lines (which are homozygous throughout the genome), this process is intrinsically more complex in an allotetraploid, such as B. napus, than in a diploid species because of the presence of homoeologous loci. This is illustrated schematically in Figure 1. Where the sequences of homoeologues differ, polymorphisms can be observed, termed here ‘inter-homoeologue polymorphisms’. These do not represent allelic variation per se and, on sequence alignment, generate the same ambiguity code in each genotype (e.g. Y in Figure 1, generated by the presence of both C and T bases). In contrast, we term the allelic polymorphisms observed in the presence of homoeologous sequences ‘hemi-SNPs’. On sequence alignment, these generate an ambiguity code for one cultivar only (e.g. in Figure 1, S indicating the presence of both C and G bases for Tapidor, but just C for Ningyou 7) or, occasionally, differing ambiguity codes for cultivars with one of their constituent bases in common. In genetic mapping experiments, the base that is in common between an allele in one cultivar and its homoeologues in both cultivars is uninformative; therefore, hemi-SNPs will be scored as dominant markers and, as is typical for such markers, scoring errors will primarily be false negatives for the informative allele. In our case, this will be the result of the base required to reveal the informative allele having not been sampled, by chance, by the available Solexa reads. Even in polyploids, locus-specific analysis is sometimes possible, for example when using locus-specific amplicons for the analysis of genomic DNA or when only one homoeologue contributes to the transcriptome. In these cases, allelic variation can be identified by what we term ‘simple SNPs’ (Figure 1).

Figure 1.

Schematic representation of sequence polymorphism types in doubled haploid (DH) lines. The positions of sequence polymorphisms are highlighted in bold. The two types of allelic single nucleotide polymorphism (SNP) are indicated by the full-line boxes (inter-homoeologue polymorphisms are not allelic SNPs). International Union of Biochemistry (IUB) ambiguity codes: Y = C or T, S = C or G.

In order to identify allelic polymorphisms (inter-cultivar SNPs), we used a two-step process. In the first step, the SNP filter Perl script supplied with the maq distribution was used to identify, in the Brassica unigene sequences, the positions of robust candidate sequence polymorphisms relative to the aligned Solexa reads from each cultivar. Amongst the various parameters employed by the script is an adjustable requirement for minimum read depth to score a potential SNP. We used the default depth of three to assemble the individual SNP lists. In the second step, a Perl script that we had developed was used to compare these two lists of refined SNPs in order to identify: (i) calls of the same ‘SNP’ relative to the Brassica unigene sequences in both Tapidor and Ningyou 7; (ii) calls of SNPs between the Brassica unigene sequences and only one of Tapidor or Ningyou 7; and (iii) differing SNP calls for Tapidor and Ningyou 7 relative to the Brassica unigene sequences. The last two classes were combined and further processed in order to refine the calling of putative polymorphisms between the two cultivars. This involved an analysis of the base calls in the Solexa reads for each of Tapidor and Ningyou 7 at the position of a given putative SNP, with the aim of filtering out those instances in which there was insufficient read depth in the opposing cultivar for maq to have reliably called either the reference base or a polymorphism. We imposed an initial requirement for a minimum of four reads over a potentially polymorphic base in both the Tapidor and Ningyou 7 datasets for a putative SNP between the cultivars to be retained. In addition, the algorithm eliminated SNPs identified by maq as base ambiguities in one genotype if a mixture of high-quality base calls at that position in the opposing parent reconstituted that same ambiguity, irrespective of whether maq had called an SNP there.

At a minimum read depth of four, our analysis identified 55 938 instances of the same ‘SNP’ calls relative to the reference sequences being identified at the same position for both cultivars. Very few of these (139) were assigning base calls at positions that were unresolved (i.e. ‘N’) in the unigene sequences. We interpret the vast majority that remains as being largely indicative of sites at which the sequences of transcripts from the various species used for the public CE-derived EST dataset differ from the sequence in oilseed rape. In addition, some are also likely to represent sequencing errors in the CE-derived data, errors in the assembly of the Brassica unigenes or allelic variation between the oilseed rape cultivars analysed by us and those used for the generation of the CE-derived ESTs.

We identified 41 593 putative allelic polymorphisms over 15 626 unigenes between oilseed rape cultivars Tapidor and Ningyou 7. The majority of these represent SNPs relative to the reference where the base read from one of the cultivars matches the base in the corresponding unigene, but there were 5378 instances where both Tapidor and Ningyou 7 differed from the unigene (and from each other). Many SNPs were denoted by maq as ambiguity codes representing multiple bases. For the original application for which maq was developed, these would be interpreted as corresponding to heterozygous loci in a diploid genome. In our application, which involved two completely homozygous lines, we interpret these as indicative of contributions to the transcriptome from closely related genes, most probably homoeologues from the A and C genome of B. napus. Indeed, the unigenes used as reference sequences were assembled employing parameters that aimed to co-assemble ESTs from such homoeologues. Of the putative polymorphisms identified, 36 424 (87.5%) involve ambiguity codes. Two such hemi-SNPs, which occur close by in the same unigene (JCVI_23323:727 and JCVI_23323:751), are shown in Figure 2. We interpret these as instances in which a putative SNP is present in one of two transcribed and closely related genes, most probably homoeologues in our application.

Figure 2.

Comparative displays from Maqview showing maq-aligned reads for parents Tapidor and Ningyou 7 over a sample unigene, JCVI_23323. Candidate SNPs (C/Y and Y/C) are identified at positions 727 and 751. In each panel, the top track shows the reference sequence; the next shows the maq-called consensus derived from the aligned individual reads depicted below. Phred-like base qualities (Li et al., 2008) are encoded graphically (capital, base quality > 25; lower case white, quality 21–25; lower case light grey, quality 11–20; lower case dark grey, quality, ≤ 10). Differences from the consensus base are coloured red.

Where allele scoring is dependent on the representation of transcripts from two genes, an important source of error in allele assignment is the requirement for both being adequately represented in the Solexa dataset. The rate of such errors is expected to diminish with increasing read depth, and so we evaluated the number of putative SNPs identified as we progressively increased the required threshold from a minimum read depth of four to a minimum read depth of eight. The results are summarized in Table 2. As expected, the number of candidate SNPs identified decreases with increasing read depth required. However, even at the most stringent requirement evaluated (minimum read depth of eight), we still identified 23 330 putative SNPs over 9265 unigenes. Of these, 21 259 (91.2%) are hemi-SNPs. We interpret this as probably representing a better estimate of the proportion of SNPs that involve transcripts from two closely related genes, as the increased read depth should reduce the likelihood of sampling errors, as described above.

Table 2.  Putative single nucleotide polymorphisms (SNPs) detected between the transcriptomes of oilseed rape cultivars Tapidor and Ningyou 7 at varying minimum depths of sequence coverage
Minimum read depthTotal inter-cultivar SNPsUnigenesSimple SNPsHemi-SNPsHemi-SNPs (%)
441 59315 626516936 42487.5
535 03013 395385631 17489.0
630 10411 632301627 08890.0
726 35810 357246623 89290.6
823 330 9 265207121 25991.2

The number of unigenes identified to contain at least one putative SNP ranges from 9265 to 15 626 (using eightfold to fourfold minimum read depths, respectively). The calculated SNP densities range from 0.32 to 29.72 per kb of unigene sequence, with their frequency distributions shown in Figure 3. The distribution departs from an exponential decay in frequency with increasing polymorphism rate by the occurrence of a peak at a density of approximately 19 SNPs/kb of aligned reference sequence. This may correspond to transcripts from regions of the genome introgressed into Ningyou 7 from B. rapa. This is because we anticipate a much higher rate of sequence variation between alleles originating from B. napus and B. rapa than we do for alleles both originating from B. napus cultivars, and the breeding of Ningyou 7 included crosses with B. rapa (Qiu et al., 2006). An estimation of the overall sequence polymorphism rate is complicated by uncertainty over the actual complexity of the transcriptome being analysed. If we assume that the proportion of the ~26 Mb of reference sequences to which Solexa reads have been aligned from two distinct genes is indicated by the proportion of hemi-SNPs when requiring a minimum read depth of eight (91.2%), we can estimate the total effective size of the aligned reference sequence to be 1.912 × ~26 Mb = ~49.7 Mb. This enables us to estimate the detected polymorphism rate between transcribed sequences in Tapidor and Ningyou 7 as between ~0.047% based on a minimum read depth of eight (23 330 SNPs/49.7 Mb) and ~0.084% based on a minimum read depth of four (41 593 SNPs/49.7 Mb). These are consistent with previous estimates of allelic sequence diversity between these two cultivars of ~0.1% (I. Bancroft, unpubl. data).

Figure 3.

Logarithmic frequency distribution histograms of detected single nucleotide polymorphism (SNP) densities at minimum read depth criteria of four and eight reads. SNPs identified by maq in the parental datasets and surviving the post-processing steps were recorded to calculate densities from reference sequence lengths.

Validation of candidate SNPs in genomic sequences

The validation of candidate SNPs by polymerase chain reaction (PCR) amplification of genomic DNA and sequencing is difficult and time-consuming in oilseed rape. This is because the low level of divergence between the A and C genomes, ~3.5%, usually results in the simultaneous amplification of both homoeologues, which interferes with subsequent sequencing. We have previously conducted an analysis of sequence variation between Tapidor and Ningyou 7 using 33 locus-specific amplicons, totalling 13 857 bp, as shown in Table S1 (see ‘Supporting Information’), as part of an SNP marker development project (Qiu et al., 2006; http://brassica.bbsrc.ac.uk/IMSORB/). Based on the results of single-pass sequencing, six of the amplicons contain SNPs, with a total of 19 SNPs being detected overall. We identified, using blast alignment, unigenes containing putative SNPs that had been identified in the Solexa data (with a minimum read depth of four) that overlap the genomic sequences. We then assessed whether these genomic sequences validated the putative SNPs called by our methods. Of the nine SNPs in the aligned regions, eight had been called in the Solexa data analysis as hemi-SNPs, and so are anticipated to correspond to pairs of homoeologous genomic sequences, only one of which will correspond to the locus identified by the polymorphism. Of these eight, four of the polymorphisms, JCVI_17800:283 (Tapidor = G/T, Ningyou 7 = G), JCVI_17800:316 (Tapidor = A/G, Ningyou 7 = A), JCVI_35335:514 (Tapidor = C/T, Ningyou 7 = C) and JCVI_35335:709 (Tapidor = A/T, Ningyou 7 = T), were confirmed as present. For the remaining four, JCVI_35335:559 (Tapidor = C/T, Ningyou 7 = G/T), JCVI_ 35335:607 (Tapidor = G, Ningyou 7 = A/G), JCVI_35335:718 (Tapidor = C, Ningyou 7 = C/T) and JCVI_35335:775 (Tapidor = T, Ningyou 7 = C/T), the genomic sequences contained the base expected to be in the homoeologue of the polymorphic locus and so are uninformative. The ninth putative SNP JCVI_38464:472 (Tapidor = C, Ningyou 7 = A) was contradicted by the genomic sequences, with the corresponding base present in both Tapidor and Ningyou 7 genomic sequences being C. Examination of the aligned Solexa reads revealed that the Ningyou 7 data comprised only poor quality base calls at that position. Overall, the validation rate based on the small sample that we tested was good, but there were some errors, and there is scope for future improvement of the methodology. The SNPs detected using genomic sequences, but not called using the Solexa data could, indeed, be observed by manual inspection of aligned Solexa reads (where there were some), but these had been filtered out by our methods because of insufficient read depth. We anticipate that a greater depth of Solexa sequence data would have resulted in the identification of a larger proportion of the polymorphisms that exist between the cultivars, and would have enabled further filtering of erroneous calls.

Validation of SNPs by Solexa transcriptome sequencing of lines of a mapping population

Our objective is the development of a marker system, involving the genotyping of B. napus lines by Solexa re-sequencing, for use in association genetics and for the construction of ultra-high-density linkage maps required to deal with population structure in such studies. We therefore designed a validation experiment to test whether we could observe the parental alleles and their expected segregation patterns in a panel of lines from a DH mapping population. We selected four lines from the ‘TNDH’ population (Qiu et al., 2006). An updated version of the TNDH linkage map and the associated locus file are publicly available (http://www.jic.ac.uk/staff/ian-bancroft/research_page3.htm#linkage). The lines that were used for the validation (DH151, DH156, DH177 and DH186) were selected on the basis of containing relatively large numbers of recombination events. Solexa transcriptome data were generated as for the Tapidor and Ningyou 7 data, but with only one Solexa lane run per sample instead of four. The statistics of the sequence data generated are shown in Table 3.

Table 3.  Statistics for Solexa expressed sequence tag (EST) sequences generated from four lines of the TNDH mapping population
 DH151DH156DH177DH186
Number of ESTs7 588 7959 984 3969 230 1109 755 434
Total number of bases in ESTs (bp)265 607 825349 453 860323 053 850341 440 190
GC content of ESTs (%)46.1746.2846.3746.20
ESTs aligned to reference5 344 0717 270 4976 602 7957 076 516
Length of reference aligned (bp)24 781 88426 039 95025 463 85726 452 382
Fold coverage of aligned reference7.559.779.089.36

The EST data from the four TNDH lines were used for allele calling at the SNP loci identified between the Tapidor and Ningyou 7 parents over a range of minimum read depths as before, but with these minimum depths then being applied to all four DH lines in order for an allele to be scored. The results are summarized in Table 4, and an example of an alignment across the TNDH lines is illustrated in Figure 4. Because the Solexa datasets derived from the TNDH lines were significantly smaller than the parental lines, the number of loci that could be scored in this analysis was reduced, ranging from 7980 (at a minimum read depth of eight) to 18 130 (at a minimum read depth of four). In most instances, only parental alleles were identified in the TNDH lines. Some alleles detected, however, were non-parental. These were scored as instances in which at least one allele call over the four TNDH lines did not match either that in the Tapidor or Ningyou 7 parents. The proportion of such instances was greatest with a requirement for only fourfold minimum read depth across all six lines (17.2%), but reduced to 12.6% with increased stringency of eightfold minimum read depth. As this population has been studied extensively without the observation of non-parental alleles, we interpret the instances observed as being the result of erroneous alignments of Solexa sequences to some loci, or stochastic failure to observe reads of all bases needed to call the ambiguity code corresponding to the parental allele.

Table 4.  Single nucleotide polymorphisms (SNPs) scored in the transcriptomes of four lines of the TNDH mapping population at varying depths of minimum sequence coverage across all four lines and their parents
Minimum read depthTotal inter- cultivar SNPsTotal SNPs scored*UnigenesSimple SNPsHemi-SNPsSNPs scored with only parental allelesSNPs scored with only parental alleles (%)
  • *

    SNPs identified between parents needed to be covered by the same minimum read depth over all four DH lines to be scored.

  • Excluding instances of non-parental alleles called in any of the four DH lines.

441 59318 1307065145616 67415 00882.8
535 03016 8266612127515 55114 03883.4
630 10411 4044583 74710 657 9 79985.9
726 358 9 4223829 593 8 829 8 18586.8
823 330 7 9803269 483 7 497 6 97387.4
Figure 4.

Maqview displays of alignments to the reference unigene JCVI_23323 across four doubled haploid lines from the TNDH population. These correspond to the SNP loci shown in Figure 1 for Tapidor and Ningyou 7.

As the TNDH lines had already been subjected to extensive genotyping using conventional markers, we were able to assess whether the genotyping results obtained using Solexa EST sequences were consistent. We analysed the composition of each of the four TNDH lines for Tapidor and Ningyou 7 alleles at the loci scored using a minimum read depth of four applied across all six lines. For this analysis, we focused on a region of the genome for which we had identified a set of overlapping sequenced BAC clones containing B. rapa genomic DNA (that had been sequenced as part of the Brassica rapa Genome Sequencing Project; http://brassica.bbsrc.ac.uk/), and which had been anchored to the TNDH linkage map via molecular markers derived from these BACs. They map to linkage group A1, and the genotyping for the TNDH lines used in this study indicates that Tapidor alleles should be present for this region of the genome in line DH151, and Ningyou 7 alleles in the remainder. The scoring of markers in the homoeologous linkage group (C1; which contains the homoeologues of genes on this part of linkage group A1) indicates that all four lines should contain Tapidor alleles in this region of the genome. We identified unigenes that are cognate to the set of B. rapa BAC clones (KBrB010F06, KBrB042G14, KBrB124L09, KBrH001M22, KBrH003E03, KBrH055H09, KBrH130E08, KBrS004O21 and KBrS003F17) by blast alignment. We then analysed the alleles called for each of the four TNDH lines at the loci contained within these cognate unigenes, scored with lowest stringency of coverage, i.e. those with a minimum fourfold read depth at each locus across all four lines plus their parent cultivars. Four putative SNPs were filtered from the analysis as a consequence of non-parental alleles being detected in one or more of the TNDH lines. The results are summarized in Table 5. Of the 47 loci analysed, 46 were hemi-SNPs. The alleles scored for 36 of the 47 loci were consistent across all four of the TNDH lines with that expected for the targeted region of linkage group A1, and four were consistent across all four of the TNDH lines with that expected for its homoeologous region on linkage group C1. Seven loci did not show one of the expected segregation patterns. However, the data for these are consistent with stochastic failure to score an ambiguity call as a result of insufficient depth of sequence coverage, i.e. false negatives for the informative allele (allele calls highlighted in bold in Table 5). An alternative explanation could, in principle, arise from trans-acting effects, such as the transcription of only one of a pair of homoeologues being dependent on allelic variation at an unlinked locus.

Table 5.  Allele calls for putative single nucleotide polymorphism (SNP) markers in unigenes cognate to sequenced BAC clones mapped to linkage group A1, with a minimum read depth of four at each locus, across all six lines
Cognate unigene:SNP positionTapidorNingyou 7DH151DH156DH177DH186Consistent mapping?
  • International Union of Biochemistry ambiguity codes: M = A/C; K = G/T; Y = C/T; R = A/G; W = A/T; S = C/G.

  • *

    The putative erroneous allele calls are shown in bold. These are interpreted as potentially resulting from stochastic failure to observe, in the aligned Solexa reads, all of the bases required to call the correct allele.

EV164953:267TYTYTYNo*
EX027153:788AMMAMMNo*
JCVI_14922:199MCMCCCYes
JCVI_14922:491CYCCCCYes (C1)
JCVI_14962:220CYCYYYYes
JCVI_14962:658GCGCCCYes
JCVI_1663:630WAWAAAYes
JCVI_17140:269RGRGGGYes
JCVI_17140:536SGSSSSYes (C1)
JCVI_18812:1560YTYTTTYes
JCVI_18812:1596AMAMMMYes
JCVI_19188:2822CMMMMMNo*
JCVI_1925:271GRGRRRYes
JCVI_22126:284MCMCCCYes
JCVI_22126:528GSGSSSYes
JCVI_23323:727CYCYYYYes
JCVI_23323:751YCYYYYYes (C1)
JCVI_23323:822GRGRRRYes
JCVI_23323:925GKGGGGYes (C1)
JCVI_25984:225RGGRGRNo*
JCVI_40373:29KTKTTTYes
JCVI_40373:97RGRGGGYes
JCVI_40373:208TKKKKKNo*
JCVI_40373:215CYYYYYNo*
JCVI_40373:799MAMAAAYes
JCVI_41027:561AWAWWWYes
JCVI_41027:645AMAMMMYes
JCVI_4176:226GSGSSSYes
JCVI_4176:667GKGKKKYes
JCVI_4176:1016AWAWWWYes
JCVI_4176:1319RARAAAYes
JCVI_4176:1461GRGRRRYes
JCVI_4317:668CMCMMMYes
JCVI_5044:408GRRRRRNo*
JCVI_5044:606YCYCCCYes
JCVI_5392:857RGRGGGYes
JCVI_7288:618CSCSSSYes
JCVI_7882:130CYCYYYYes
JCVI_7882:166SRSRRRYes
JCVI_7882:217YCYCCCYes
JCVI_7882:251CYCYYYYes
JCVI_7882:259KSKSSSYes
JCVI_7882:295TKTKKKYes
JCVI_7882:391GRGRRRYes
JCVI_7882:418KTKTTTYes
JCVI_7882:457TKTKKKYes
JCVI_7882:522KGKGGGYes

We conclude that Solexa sequencing of the transcriptome is an efficient method for the identification of sequence variation between oilseed rape lines. We have shown that the sequence variation is reproducibly identified in lines of a mapping population derived from the cultivars analysed, and that the automated scoring of alleles in such lines is consistent with the results of conventional markers. Solexa sequencing of the transcriptome is therefore an appropriate approach for SNP discovery and assay in oilseed rape and other polyploid species. It overcomes the key impediment to association genetic studies in polyploids by enabling the cost-effective analysis of sequence variation at tens of thousands of loci simultaneously.

Experimental procedures

Isolation of leaf RNA

Two parents of the TNDH population (Qiu et al., 2006) and four TNDH lines, TN151, TN156, TN177 and TN186, with high recombinant frequency were selected for RNA extraction and Solexa sequencing. The two parents were planted in June 2007, and the four DH lines were planted in June 2008. All of the materials were planted in the same environment. The leaves from seedlings at the six-leaf stage were extracted to isolate RNA. In total, 0.1 g of leaf tissue for each sample was frozen in liquid nitrogen and was used to isolate RNA. RNA extraction was performed using a Trizol kit (Invitrogen, Carlsbad, CA) according to the manufacturer's protocol. After the RNA samples had been isolated and dried, they were dissolved in diethylpyrocarbonate-treated H2O, and a ultraviolet (UV) spectrophotometer was used to determine the concentration of the solution. The quality of RNA was confirmed on 1.0% denaturing agarose gels.

Solexa sequencing

At least 20 µg of total RNA, at a concentration of ≥ 400 ng/µL, was sent to the Beijing Genomics Institute (BGI) for Solexa sequencing as a commercial service. First, poly A-containing mRNA molecules were purified from total RNA using poly-T oligo-attached magnetic beads. Following purification, the mRNA was fragmented into small pieces using divalent cations at elevated temperature. The cleaved RNA fragments were then copied into first-strand cDNA using reverse transcriptase and a high concentration of random hexamer primers. This was followed by second-strand cDNA synthesis using DNA Polymerase I and RNaseH. Finally, the short cDNA fragments were prepared for Solexa sequencing on an Illumina (San Diego, CA) Genome Analyser using the manufacturer's protocol and reagents in the Genomic DNA Sequencing Sample Prep Kit supplied with the system.

SNP calling using maq

maq version 0.6.8 (Li et al., 2008) was used for the initial alignment and SNP identification in the parental datasets, essentially following the protocols described in the online documentation (http://maq.sourceforge.net) and adopting the default parameter values. Briefly, the 95k unigene reference sequences were converted to binary FASTA format, and each Solexa read data subset (corresponding to one lane on the instrument, approximately five million raw reads) was transformed from Solexa FASTQ to Sanger FASTQ format. As recommended, each subset was separately mapped to the reference, and these maps were then merged to form parental maps and to assemble the consensus sequences. It was found that mapping was the most computer-intensive step, taking some 12–15 h of real time on a four-core machine with 16 GB RAM, with each core handling one lane of data. Presumably, this results from the complexity of the reference sequence. The maq cns2snp command was used to extract SNP sites employing Phred-like consensus qualities as an initial criterion. Next, the maq Perl script SNPfilter was used to post-process these SNPs. The algorithm used here accesses additional data to eliminate SNPs that are covered by too few or too many reads, are near to potential InDels, fall in possibly repetitive regions or have low-quality neighbouring bases. The default parameters used mean that the resulting SNPs called were covered by at least three reads.

Polymorphism detection

As this is a somewhat novel use of maq, we developed a Perl script to compare the filtered SNP lists generated by the pipeline described above for the two parental datasets. For each candidate SNP, the algorithm accessed the read and base quality data over that position, in text pileup format, for the opposing genotype and made further assessments of robustness, given the complexity of our data. We applied an adjustable read depth criterion, such that SNPs were eliminated if there was insufficient read depth in the opposing genotype (compromising maq's ability to align or to call SNPs). In addition, SNPs identified by maq as base ambiguities with respect to the reference in one genotype were eliminated if the mixture of high-quality base calls at that position in the opposing parent overlapped, irrespective of whether maq had called an SNP there. We also experimented with base recalling at SNP positions to possibly override maq's original consensus call. The overall result was an aggressive filtering of maq's SNP calls. The script was also used to identify concerted ‘SNPs’ called by maq that indicated departures from (and informing corrections to) the reference sequence. The Maqview alignment indexer and viewer (J. Ruan, H. Li and M. Zhao, http://maq.sourceforge.net) was used for graphical visualization and inspection of candidate SNPs.

Allele scoring

An ancillary Perl script was developed to sequentially access the raw maq pileup data, guided by the refined SNP list, for each DH line used in our validation experiments. An adjustable minimum read depth criterion was applied universally across the parental and DH lines in order for an SNP position to be scored. This resulted in heavy attrition of the primary candidate SNP list, as the individual DH datasets contained fewer reads and hence poorer coverage than the parents. Alleles were then recalled at each SNP position in a consistent manner for each parental line and for each segregant, and the allele patterns were then recorded and counted. These Perl scripts are freely available from the authors, but should be considered as in development.

Acknowledgements

Financial support for this work was provided by the UK Biotechnology and Biological Sciences Research Council (BBSRC BB/E017363/1 and Competitive Strategic Grant to John Innes Centre) and the National Basic Research and Development Programme of China (2006CB101600).

Ancillary