• Open Access

A tailed PCR procedure for cost-effective, two-order multiplex sequencing of candidate genes in polyploid plants

Authors


(fax +49 641 9937429; email rod.snowdon@agrar.uni-giessen.de)

Summary

Complex polyploid crop genomes can be recalcitrant towards conventional DNA sequencing approaches for allele mining in candidate genes for valuable traits. In the past, this has greatly complicated the transfer of knowledge on promising candidate genes from model plants to even closely related polyploid crops. Next-generation sequencing offers diverse solutions to overcome such difficulties. Here, we present a method for multiplexed 454 sequencing in gene-specific PCR amplicons that can simultaneously address multiple homologues of given target genes. We devised a simple two-step PCR procedure employing a set of barcoded M13/T7 universal fusion primers that enable a cost-effective and efficient amplification of large numbers of target gene amplicons. Sequencing-ready amplicons are generated that can be simultaneously sequenced in pools comprising multiple amplicons from multiple genotypes. High-depth sequencing allows resolution of the resulting sequence reads into contigs representing multiple homologous loci, with only insignificant off-target capture of paralogues or PCR artefacts. In a case study, the procedure was tested in the complex polyploid genome of Brassica napus for a set of nine genes identified in Arabidopsis as candidates for regulation of seed development and oil content. Up to six copies of these genes were expected in B. napus. SNP discovery was performed by pooled multiplex sequencing of 30 amplicons in 20 diverse B. napus accessions with interesting trait variation for oil content, providing a basis for comparative mapping to relevant quantitative trait loci and for subsequent marker-assisted breeding.

Introduction

Candidate genes discovered by detailed forward and reverse genetics approaches in model plants represent a valuable resource for potential improvement of agronomically important traits in many major crops species. DNA sequencing of candidate genes is often applied for allele mining in genetically diverse germplasm, where novel, trait-relevant allelic variation can be uncovered within the species of interest and subsequently implemented directly into breeding programmes. The review of Takeda and Matsuoka (2008) provides numerous examples for the use of candidate genes to elucidate the improved crop plant performance.

Many major crop species have complex polyploid genomes, however, in which simple PCR-based DNA sequencing approaches to uncover allelic diversity are often impractical or unfeasible. A high degree of gene duplication and genome plasticity is encountered in such genomes (Adams and Wendel, 2005; Leitch and Leitch, 2008), necessitating expensive and labour-intensive cloning approaches to isolate the multiple homologous copies of any given candidate gene and develop locus-specific assays for surveying of allelic variation. Such an approach is not always successful, however, for example because of high sequence conservation among homologues or unpredictable structural genome rearrangements leading to widespread presence–absence variation (PAV) and copy number variation (CNV). Furthermore, the highly specific reaction conditions necessary to ensure that assays are truly locus-specific can lead to failure of PCR-based approaches in genetically diverse germplasm.

Next-generation sequencing today offers a diverse variety of technological solutions to overcome these problems and help implement genomic knowledge in crop improvement (Edwards and Batley, 2010; Varshney et al., 2009). Whole-genome resequencing, the most powerful of these, is not yet cost-effective enough to allow it to be widely used in crop breeding; furthermore, it relies on reliable reference genome assemblies that for many complex polyploid genomes are not yet publicly available. On the other hand, high-depth next-generation sequencing of PCR amplicons is today common in human diagnostics for high-throughput detection even of rare allelic variants in specific target genes (e.g. Dahl et al., 2007; De Leeneer et al., 2011). For applications implementing short-read sequencing technologies, this involves the generation of barcoded libraries for fragmented amplicons of individual genotypes (Craig et al., 2008). Typically, such approaches target known or novel mutations within specific genes of interest using an established PCR assay that amplifies a unique copy of a specific target gene. After sequencing of pooled, barcoded libraries, the reads are aligned onto a reference sequence for SNP discovery. Polyploid crop genomes, on the other hand, often possess multiple homologous copies of any given gene or amplicon, along with the potential for additional related paralogues. It can thus be highly challenging to distinguish SNPs between homologous loci from SNPs among genotypes at the same locus, particularly if little surrounding sequence information is available with which to identify homologue-specific haploytpes (see Parkin et al., 2010). In such cases, it can be a great advantage to employ sequencing technologies which deliver longer reads, for example, by using high-depth sequences from full-length amplicons to distinguish haplotypes corresponding to homologous loci. Alignment of consensus sequences for contigs assembled by deep sequencing of the same targets from different genotypes can potentially allow a more efficient distinction between homologues and the discovery of locus-specific SNPs (Barbazuk et al., 2007; Parkin et al., 2010; Allen et al., 2011).

In contrast to studies in medicine, candidate gene analyses in crop plants with less well-characterized genomes tend to be considerably less targeted; large numbers of putative candidate genes, generally identified in model systems that can differ significantly from their crop relatives in the expression of yield and quality-related parameters, are often of equal interest for potential improvement of crop plant performance. Detailed investigation of multiple candidates by conventional means tends to exceed the resources and budget of a typical plant breeder, however, a situation which is further compounded by the complexities of polyploidy in many major crops (Udall and Wendel, 2006). Thus, there is a need in crop breeding for low-cost platforms that enable a broad discovery of allelic variation in diverse target sequences for more efficient and productive exploitation of genetic variation.

Oligonucleotide-mediated sequence capture technologies have been successfully applied for targeted resequencing of predetermined chromosomal regions in crop plants (e.g. Fu et al., 2010); however, bead-pool and array-based capture methods are generally not down-scalable for experiments which only aim to target a few 1000 nucleotides, as is generally the case in candidate gene sequencing approaches. Sequencing of PCR amplicons is a viable alternative that allows the capture of smaller targets. Pooled amplicons from a single individual can be pooled together in a single sequencing reaction and their sequences later extracted based on the 5′ primer sequences. For example, Bundock et al. (2009) introduced 454 amplicon sequencing as an efficient method for targeted SNP discovery from smaller capture targets in polyploid plants. In their study, a large number of PCR amplicons were generated for each of two sugarcane genotypes, and amplicon pools for each genotype were sequenced separately, each in a single region of a 4-region gasket on a Roche 454-FLX Genome Sequencer (454 Life Sciences, Branford, CT). The use of only two genotypes limits information on allelic frequencies of discovered SNPs, however, and in a complex polyploid it can be difficult to distinguish truly locus-specific SNPs from inter-homoeologue SNPs in only two genotypes (Bancroft et al., 2011). Furthermore, this basic technique requires each sample to be separated in a single sequencing region. In recent years, the use of multiplex identifier (MID) oligonucleotides, often referred to as ‘barcodes’, has become a popular method to pool next-generation sequencing libraries from multiple genotypes in a single sequencing reaction (e.g. Binladen et al., 2007; Roche 454 Technical Bulletin, 2009b; Lennon et al., 2010). Sample barcoding increases the number of samples that can be simultaneously sequenced at a sufficient sequencing depth, meaning that experiments on ultra-high-throughout sequencing machines can also be scaled for relatively small target regions. For example, Griffin et al. (2011) demonstrated the applicability of sample barcoding for 454 sequencing of five genes in Poa grasses, which have a complex polyploid genome with multiple gene copies.

In cases where the target sequence is a defined fragment that can be amplified by a simple PCR procedure, the addition of MID barcodes and sequencing adapters to the 5′ end of the target-specific PCR primer enables the generation of PCR-ready ‘fusion’ PCR products (Roche 454 Technical Bulletin, 2009a). On the one hand, this potentially leads to considerable reductions in time and cost of library production, because it eliminates the need for adapter-ligation steps prior to sequencing. On the other hand, a unique set of MID-tagged fusion primers is required for every specific amplicon one wishes to include in the sequencing pool. Fusion primer oligonucleotides are very expensive, however, because of their length (55 bp or longer) and the need for HPLC purification to ensure their authenticity. This generally makes them worthwhile only in cases where the same set of MID-tagged primers (e.g. for a specific target gene) is repeatedly used in a large number of experiments to sequence the same amplicon in large numbers (100 or 1000) of samples. This strategy is, for example, used in human medicine to screen for genetic mutations in single genes involved in specific disorders (e.g. Dahl et al., 2007; Taudien et al., 2010; De Leeneer et al., 2011).

In contrast, crop geneticists tend to be interested in allelic variation for multiple homologues of many different candidate genes and may wish to investigate different sets of multiple candidate genes for many different traits of interest. Often, the number of gene copies is unknown or variable and no locus-specific assays can be developed because of the absence of reference DNA sequences. In such cases, the simplicity of the fusion PCR method is greatly outweighed by the very high cost of HPLC-purified, MID-barcoded fusion primer sets for each gene (and copy) of interest. For example, for multiplex screening of 30 gene-specific amplicons in ten genotypes (as we describe in this article), using a conventional fusion 454 PCR strategy would require 300 unique sets of long, HPLC-purified primers to cover all amplicon and barcode combinations. Synthesis and HPLC purification of a pair of 55 bp PCR primers currently costs well over US$120, a cost factor that is clearly prohibitive for everyday analyses of candidate genes in crop plants.

Here, we present an alternative method that employs a single set of universal M13/T7 fusion primers for multiplexed, long-read sequencing of PCR amplicons from multiple copies of multiple candidate genes in multiple genotypes simultaneously. Using a straightforward two-step PCR protocol (Figure 1), a set of universal primer combinations with ten unique MID barcodes allows generation of ready-to-sequence amplicons for multiplexed sequencing of up to ten pooled samples per sequencing region, from any group of target amplicons and from any species. Because all loci of a particular gene can be addressed and assayed simultaneously, the method is highly suitable for screening of candidate gene sequence variation in complex polyploid genomes, even in the absence of a completed reference genome sequence. We tested this system by using it for high-throughput SNP discovery from homoeologous candidate gene copies in the complex polyploid genome of oilseed rape (B. napus L., genome AACC, 2n = 38). A panel of 20 genetically diverse B. napus genotypes (Table 1) was used. The results revealed a high power of SNP detection, the ability to distinguish homologous loci and indications for a surprisingly high plasticity of the B. napus genome in terms of PAV and CNV.

Figure 1.

 Two-step generation of 454-FLX amplicons with ‘universal fusion primers.’ A first round of PCR specifically amplifies 400- to 500-bp-long amplicons from either specific loci or multiple homologues of target genes using locus-specific or multi-locus tailed primers, respectively. In the first PCR step, all forward primers are concatenated with a universal M13 tail at their 5′ end, while reverse primers are 5′-appended with a universal T7 tail. In the second step, all amplicons generated by step 1, regardless of the target, can be re-amplified at high efficiency using a single tailed forward primer with the M13 tail at its 3′ end, along with a reverse primer with the T7 tail at its 3′ end. Multiplex identifier (MID) barcodes and 454-FLX Titanium A and B sequencing adapters are appended to the 5′ ends of the forward and reverse universal primers, respectively. A set of ten unique universal primer combinations with different MID barcodes allows generation of ready-to-sequence amplicons for multiplexed sequencing of ten pooled samples per sequencing region, from any group of target amplicons and from any species.

Table 1.   Origins of the 20 genetically diverse Brassica napus genotypes used for screening of sequence diversity on candidate genes for oil content
GenotypeType pedigreeOrigin
  1. RS, resynthesised B. napus; S, swede; Chin., Chinese oilseed rape; SOSR, spring-type oilseed rape (canola); FUB, Free University of Berlin, Germany; GAU, Georg August University, Göttingen, Germany; HAAS, Hunan Academy of Agricultural Sciences, Changsha, Hunan Province, China; JAAS, Jiangshu Academy of Agricultural Sciences, Nanjing, Jiangshu Province; NSWDA, New South Wales Dept of Agriculture, Agricultural Research Institute, Wagga Wagga, Australia; UOA, University of Alberta, Edmonton, Canada; WIAS, Wanxian Institute of Agricultural Sciences, Sichuan Province, China.

G39RS B. rapa convar. capitata var. capitata B. rapa ssp. oleiferaGAU
G50RS B. oleracea convar. acephala var. gongyloides × B. rapa ssp. oleiferaGAU
H40RS B. oleracea convar. botrytis var. italica × B. rapa ssp. pekinensis var. laxaGAU
H44RS B. oleracea convar. capitata var. sabauda × B. rapa ssp. pekinensisGAU
H149RS B. oleracea convar. acephala var. medullosa × B. rapa ssp. chinensisGAU
H365RS B. oleracea convar. capitata var. sabauda × B. rapa ssp. rapaGAU
K29RS B. oleracea convar. acephala var. sabellica × B. rapa ssp. oleiferaGAU
K332S B. rapa ssp. oleifera × B. napus ssp. napobrassica × B. napus ssp. napobrassicaGAU
L122RS B. oleracea convar. capitata var. sabauda × B. rapa ssp. oleiferaGAU
R76RS B. oleracea convar. botrytis var. alboglabra × B. rapa ssp. oleiferaGAU
R99RS B. rapa convar. capitata var. capitata × B. rapa ssp. pekinensisGAU
RS1/2RS B. rapa sp.  × B. oleracea sp.FUB
S17RS (B. napus ssp. napus × B. oleracea convar. gemmifera)  × B. rapa ssp. oleiferaFUB
MOY4RS B. rapa Yellow Sarson × Brassica montana PourretGAU
87-50182Chin. UnknownWIAS
Linyou 5Chin. UnknownJAAS
Xiangyou 11Chin. UnknownHAAS
AltexSOSR UnknownUOA
BarossaSOSR Haya//Zephyr/Bronowski/3/Chisaya//Zephyr/BronowskiNSWDA
ShiraleeSOSR Haya//Zephyr/Bronowski/5/Sv62.371/Zephyr//Norin20/3/Erglu/4/BJ168/Cresus-o-PrecoseNSWDA

Results and discussion

454 sequencing output

Parallel sequencing of 30 multilocus PCR amplicons from 20 B. napus genotypes in two pools on two quarter-gasket 454 sequencing regions (see Materials and Methods) yielded a total of 223 193 reads for the ten genotypes in pool 1, while 221 824 reads were obtained from pool 2. Details of the sequencing output per genotype are provided in Table 2. From the raw sequence reads, 95% in each pool had the TCAG calibration key and an MID barcode oligonucleotide at their 5′ end, while 88% of the total reads in each pool also had the M13 or T7 tail sequence. This demonstrates the high efficiency of the two-step universal fusion PCR procedure in generating sequencing-ready 454 amplicon sequencing libraries, without the need for expensive and time-consuming enzymatic ligation of sequencing adaptors and MID barcodes for large numbers of libraries. The almost identical numbers of raw and processed reads from each of the two genotype pools reflect an extremely high consistency of the procedure, which could make it suitable for large-scale amplicon sequencing of selected target genes in large populations of genetically diverse individuals. The 20 amplicon libraries yielded between 11 253 and 25 596 reads per genotype that showed the expected MID barcodes and M13/T7 universal primer adapters. This corresponds to an average of between 360 and 825 reads per amplicon per genotype after sequence cleanup, which is within the recommended read depth for detection of rare alleles and CNV in 454 amplicon sequencing studies (e.g. Taudien et al., 2010).

Table 2.   Summary of data output per genotype for the 30 pooled, multilocus PCR amplicons on two quarter-gasket regions of a 454-FLX Titanium sequencing run
PoolGenotypeMIDTotal no. of readsNo. with TCAG key + MID barcodeNo. with M13 or T7 tail sequence
Pool 1R99MID119 71818 98617 393
H44MID213 21812 29811 363
G50MID321 48220 42018 640
R76MID428 63227 24425 177
S17MID524 36123 08421 273
G39MID617 73316 83816 060
H149MID721 18820 16218 388
K29MID817 83516 92615 759
BarossaMID929 89228 25826 128
ShiraleeMID1029 13427 57025 596
Mean read no. 22 31921 17919 578
Total no. of reads 223 193211 786195 777
Proportion of raw reads   94.89% 87.72%
Pool 2RS1/2MID120 97420 21318 402
K332MID213 11012 20111 253
H40MID319 11218 11616 672
L122MID428 78927 31125 099
H365MID524 37523 07321 168
Xiangyou 11MID618 83017 84916 993
87-50182MID723 36422 21220 454
Linyou 5MID817 46216 61415 441
AltexMID926 48625 05523 255
MOY4MID1029 32227 81525 932
Mean read no. 22 18221 04619 467
Total no. of reads 221 824210 459194 669
Proportion of raw reads   94.88% 87.76%

Interestingly, however, there appeared to be a systematic and highly repeatable bias in the number of reads with the ten different barcodes between the two pools. Correlation coefficients of r = 0.96 were found between the read numbers for each barcode in pools 1 and 2 for both the raw read number and the reads containing the correct barcodes and adapters. For example, the barcodes MID2 and MID6 consistently gave the lowest read numbers in both pools, while barcodes MID4, MID9 and MID10 delivered more than twice as many reads in both pools as MID2. Because all samples were normalized and from different genotypes, this suggests a bias for specific barcode sequences in their suitability for 454-FLX emulsion PCR (454 Life Sciences). A similar observation was made by Lundin et al. (2010), who also noted high read numbers with MID4, MID9 and MID10 in pooled 454 sequencing libraries. The extremely poor performance of MID3 in that study could not be confirmed in our sequencing pools, however, whereas MID2 was much better represented than in our data. Although our procedure gave satisfactory read numbers for all samples, implementation of more sophisticated normalization procedures might further increase the multiplexing consistency. For example, Sandberg et al. (2011) presented a normalization method for 454 emulsion PCR libraries based on fluorescent tagging and high-speed, flow cytometric sorting. This procedure not only achieved highly consistent read numbers, even from less efficient MID barcodes, but also resulted in a substantial overall increase in sequence quality and mean read length because of removal of empty and mixed emulsion PCR beads.

Discrimination of homologous loci and detection of putative PAV and CNV

Pairwise neighbour-joining of consensus sequences from assembled contigs was found to be an effective method to distinguish homologous loci of each target amplicon from multiple genotypes. Figure S1 gives an example showing the grouping of two loci from amplicon BnaLEC2.1.1, in which two clusters clearly discriminate the homologous loci and allow the distinction of between-locus SNPs and InDels from locus-specific polymorphisms. In cases where homologous A and C genome reference sequences from B. rapa, B. oleracea and/or B. napus were available for the gene of interest, the results of the neighbour-joining procedure generally corresponded to stringent alignments to the different reference homologues. This means that the procedure should also be useful for polyploid plants or target genes for which no reference sequences are available. Obviously, the locus calling procedure is simplified if only homozygous plants are used (e.g. doubled haploids or homozygous inbred lines), because it may be impossible to distinguish between duplicated loci and heterozygosity. In the present case, most of the lines were doubled-haploid resynthesized B. napus genotypes, while the remainder were advanced inbred lines where little residual heterozygosity was expected.

Often, it was observed that one genotype, particularly among the resynthesized B. napus accessions which are prone to homoeologous non-reciprocal translocations (Udall et al., 2005; Szadkowski et al., 2010), was represented with more than one sequence for a given amplicon that was otherwise present in only a single copy or did not exhibit the particular locus. For example, the amplicon BnaLEC2 2-2 was generated with putative locus-specific primers, because of divergence among the sequence homologues which made it impossible to make a multilocus assay. In 11 of the 20 test genotypes, BnaLEC2 2-2 was indeed recovered as a single-locus amplicon, whereas no locus corresponding to this fragment was found in the resynthesized B. napus L122 (Table 3).

Table 3.   Example for putative copy number variation (CNV) and presence/absence (PAV) variation for the locus-specific amplicon BnaLEC2 2-2, which showed very high read quality. PAV and CNV are commonly caused in resynthesized Brassica napus genotypes by homoeologous non-reciprocal translocations that can replace homologous gene loci with copies from homoeologous chromosomes. Low-frequency contigs were found by BLAST analysis to represent off-target amplification products which were eliminated from the further analysis
GenotypeNo. of readsRead no. after quality filter% high quality readsNo. of contigsContigs with freq. >0.04
G3925125099.6042
G50270270100.0031
H149375375100.0032
H365393393100.0042
H40343343100.0022
H44194194100.0011
K29260260100.0021
K332219219100.0042
L1221100
MOY4453453100.0042
R7634933395.4231
R99268268100.0031
RS1/2291291100.0022
S17370370100.0021
87-50182340340100.0051
Linyou 5249249100.0011
Xiangyou11268268100.0021
Altex351351100.0051
Barossa47547399.5821
Shiralee465465100.0052
Mean reads30130099.70  

In contrast, duplicated loci were discovered in the resynthesized lines G39, H149, H365, K332, RS1/2 and MOY4 along with the Australian canola cultivar Shiralee (Table 3). Amplicon absence can theoretically be caused by polymorphisms in the primer binding site for a particular genotype; however, the relatively non-stringent PCR conditions necessary with tailed primers should normally tolerate minor differences in primer binding sites. Knowing that PAV is expected to be widespread in B. napus, we therefore interpret absence of a given locus (e.g. BnaLEC2 2-2 in L122) as a putative PAV. Examples for putative locus absence were also found in contigs derived from multilocus amplicons. For example, two loci are expected for IKU2 in B. napus, and correspondingly, two contigs corresponding to BnaIKU2 sequences were observed in 18 of the 20 genotypes (Table 4). The observed contigs in the resynthesized lines G39 and H50 represented only one locus for each of the sequenced amplicons, however, suggesting that one locus in these two genotypes has either been deleted or replaced by an exact homologous copy. Similar observations were prevalent in almost all amplicons, suggesting that PAV and CNV are likely to be widespread among B. napus genes. This phenomenon can be expected to play an important role in the expression of both qualitative and quantitative traits, with implications also in terms of potential additive gene effects influencing complex traits like heterosis, for example.

Table 4.   Example for putative copy number variation (CNV) for the dual-locus amplicon BnaIKU 3. Sequence data from Brassica napus BAC clones predicted two IKU2 loci in B. napus; however, reads for this amplicon from the resynthesized B. napus lines G50 and H149 corresponded to only one of these loci
GenotypeNo. of readsRead no. after quality filter% high quality readsNo. of contigsContigs with freq. >0.04
G3952252199.8142
G50555555100.0031
H149547547100.0031
H36576976899.8742
H4055455299.6432
H4436436399.7342
K29574574100.0042
K33276546360.5232
L12279679299.5052
MOY4748748100.0032
R76595595100.0032
R9957457399.8322
RS1/271571399.7232
S1752752599.6242
87-5018276075899.7432
Linyou 5530530100.0032
Xiangyou1162262099.6832
Altex74174099.8732
Barossa71170999.7232
Shiralee76175999.7422
Mean reads63762097.85  

SNP detection

A total of 302 SNPs were detected in 52 loci of the 30 amplicons in the nine investigated candidate genes (Table 5). For all detected SNPs, the details of their position in the consensus, the major allele in the reference, the allelic variant, allele frequencies and the sequencing depth at which the minor allele was detected are given in Table S1. Assuming uniqueness of loci, the total length of sequence in which we called these SNPs was 24 665 bp, giving an overall frequency of 1.2 SNPs per 100 bp or 1 SNP every 80 bases. The highest number of SNPs was detected in BnaPKP-β1, while only a very low frequency of SNPs was found in BnaFIE. All SNPs were simple biallelic SNPs with the exception of three SNPs for which three alternative alleles were detected. Of the 52 putative amplicon loci, only nine contained no SNPs, two of them in the highly conserved amplicons from BnaFIE. This observation might reflect differences in the degree of conservation among homologues of these genes after polyploidization, for example, because of greater selection pressure on the function of specific genes.

Table 5.   Details of SNPs detected between 20 diverse Brassica napus genotypes in homologues of the investigated candidate genes. The amplicon and locus names correspond to the respective amplicon primers, while multiple homologous loci for a given amplicon are subdivided first numerically and then alphabetically based on their relatedness in the neighbour-joining analysis
GeneAmpliconLocusConsensus length (bp)No. of SNPsSNPs per 100 bp
BnaAP2 1145600.00
22a46261.30
2b46771.50
BnaFIE 1147020.43
3353400.00
44a40700.00
4b45551.10
BnaFUSCA 2-12-143810.23
2-22-2501132.59
33-1a49030.61
3-1b34351.46
3-2a47600.00
3-2b49491.82
BnaIKU2 22431276.26
33-150200.00
3-2a49700.00
3-2b49530.61
3-2c49600.00
4445081.78
BnaLEC2 11a42692.11
1b41492.17
2-12-1a46020.43
2-1b411122.92
2-1c41740.96
2-22-249251.02
3-13-147740.84
3-23-247840.84
44a50910.20
4b44140.91
BnaPKL 11a49471.42
1b51250.98
22a48300.00
2b47310.21
BnaPKP-β1 22a49081.63
2b50550.99
3-13-1a523254.78
3-1b546193.48
3-23-2506183.56
BnaPKP-α11-142371.65
1-2a50161.20
1-2b47630.63
1-3a49510.20
1-3b49761.21
2-12-1a49910.20
2-1b49910.20
2-22-250140.80
2-32-3469112.35
3-13-1a44800.00
3-23-2a45420.44
3-2b46161.30
5551281.56
BnaSUC5 11509152.95
Total305224 6653021.22

On the other hand, almost all of the amplicons were designed to include mainly intron sequences (Figure 2), to maximize polymorphism among homologous loci. Hence, we did not consider it worthwhile to infer functional significance of the called SNPs, because the great majority of the data cover non-coding sequences. A transition to transversion ratio of 0.87 was observed. In coding sequences, a higher selection against transversions is expected (Wakeley, 1996) because they are more likely to induce non-synonymous amino acid exchanges (Zhang, 2000). In our experimental approach, however, the preferential amplification of intronic regions may have countered the normal bias against transversions in coding sequences. The relatively high proportion of transversions we observed might furthermore reflect the presence of multiple copies of the target genes, allowing more tolerance of sequence divergence in homologues during polyploidization. In support of this theory, very different SNP frequencies were sometimes observed between different homologues of the same gene. For example, the BnaFUSCA amplicon 3-2a was strongly conserved (0 SNPs), whereas its homologues represented by amplicons BnaFUSCA 3-1a, 3-1b and 3-2b showed considerable variation (2, 5 and 9 SNPs, respectively). Such discrepancies may reflect the preferential conservation of specific homologues during polyploidization.

Figure 2.

 Schematic representation of the primer design strategy for preferential assignment of primers to regions conserved in multiple homologous gene copies. (a) In cases where different A-genome and C-genome homologues exhibited well-conserved coding sequences and similar intron–exon structure, the primers were selected where possible in conserved sequences near intron–exon boundaries to amplify across polymorphic intron regions (here represented by amplicons 1 and 2 from the A- and C-genome homologues, respectively). Where introns were longer than the maximum possible amplicon length for 454 sequencing (500bp), either the forward or reverse primer was set in a conserved intronic sequence. (b) In cases with paralogous gene copies showing poorly conserved coding sequences, inflated introns and/or variable intron–exon structure, it was sometimes necessary to design additional locus-specific primers.

SNP validation

For validation of SNPs in two selected genes, we used data that were generated by bidirectional Sanger sequencing of locus-specific amplicons from specific copies of two of the candidate genes. These sequences overlapped partially with the 454 amplicons, making it possible to compare SNP calls obtained using the two methods in the overlapping regions. The validation was performed in data from a set of 38 B. napus accessions, which included the 20 diverse genotypes in which the 454 SNP discovery was performed along with an assortment of less divergent, modern oilseed rape material.

Sanger sequence reads from the validation genotypes, corresponding to BnaFUSCA amplicon 3-2b and BnaIKU2 amplicon 2-2, were aligned to the 454 contig alignments. This allowed us to compare positions of corresponding SNPs in overlapping regions between the two sets of data. SNPs were regarded as validated if the least common allele was found once or more in the Sanger reads from the respective validation sets. The SNP validation results are summarized in Table S1. For BnaIKU2 2-2, the base calls for 20 out of 25 SNPs detected in the exotic lines could also be verified in the overlapping regions from the Sanger sequences for this locus. All five SNPs in the overlapping region of BnaFUSCA 3-2b could be confirmed by the Sanger reads. The overlapping regions between the 454 and Sanger sequence data, and the SNPs that could be validated in these regions, are highlighted in Table S5.

The successful validation, in one case of 100% and in the other case of 80% of the detected SNPs in the corresponding overlapping regions, underlines the ability of our 454 amplicon sequencing strategy to accurately distinguish individual loci and detect SNPs in diverse B. napus germplasm. Pyrosequencing often results in sequencing errors in tracts surrounding homopolymeric tracts (Huse et al., 2007), which can lead to false SNP calls in such regions. In our case high-depth 454 reads were used to construct consensus contigs prior to alignment and SNP calling, which may help to reduce sequencing errors like those associated with homopolymers.

Relevance of the detected SNPs for oilseed rape breeding

Oil content is a complex trait with considerable genetic variation in B. napus. Exotic materials including Asian genepools and resynthesized B. napus are an interesting resource to mine for interesting allelic variants influencing oil content; however, transfer of these into elite, adapted winter rapeseed breeding lines is most effective with the help of suitable genetic markers. The most suitable markers for high-throughout selection at multiple gene loci are single-nucleotide polymorphisms from DNA sequences of relevant genes. To address quantitative traits in a complex polyploid crop like oilseed rape, it is necessary to gain knowledge about sequence variation at all homologous loci of genes that are potentially involved in the trait of interest. The exotic B. napus genotypes investigated in this study were selected because they showed interesting phenotypes with potential positive effects on oil content, or because progenies from crosses with them had showed transgressive segregation for oil content. The SNP validation analysis using Sanger sequence data from different sets of germplasm confirmed that these exotic genotypes contain considerable genetic variation that is rare or possibly absent in modern oilseed rape cultivars. By generating segregating F2 populations from crosses of these 20 exotic genotypes to elite winter oilseed rape cultivars, it was possible to localize quantitative trait loci (QTL) for oil content in genetic maps from each cross (W. Ecke, University of Göttingen, unpublished results). The SNPs identified in the present study will now enable the development of assays to map positions of polymorphic loci from the respective candidate genes in these populations and compare them to positions of interesting QTL. Development of gene-specific mapping assays for candidate genes in B. napus and other complex polyploids can otherwise be a tedious and often fruitless task because of the high and variable copy number.

SNP discovery by universal fusion primer-mediated 454 amplicon sequencing

Next-generation sequencing of PCR amplicons has been described previously as a fast and efficient method to detect SNPs in multiple copies of target sequences in polyploid plants (e.g. Bundock et al., 2009; Griffin et al., 2011). In the present study, we confirmed the power of long-read 454 sequencing to deconvolute multiple homologous loci in a complex polyploid crop plant, B. napus. The architecture of our 454 structure of our sequence data was similar to that generated by Parkin et al. (2010) through 454 sequencing of anchored 3′-ESTs from the amphidiploid B. napus genome. As in that study, we were able in most cases through stringent assembly parameters to unambiguously identify homologous loci. This was facilitated by generation and sequencing of defined amplicons with a length of 300–500 bp in all cases, meaning that haplotypes could generally be derived from a number of SNPs and InDels per amplicon to distinguish homologues. Multiplexing for pooled sequencing on quarter or half-gasket 454 runs normally necessitates a time-consuming and expensive ligation of barcoded adapters for every individual in a sequencing pool. The common alternative, namely the use of fusion primers to include the 454 sequencing adapter oligos at the 5′ ends of the target-specific PCR primers, is only economical for studies investigating very small numbers of candidate genes in large sample numbers because of prohibitive costs of long, HPLC-purified primers.

Conclusions

Discovery of phenotypically relevant sequence variation in candidate genes for important traits is difficult in B. napus and other polyploid crops because locus-specific assays are normally required for conventional DNA sequencing of homoeologous loci. On the other hand, high-throughput next-generation sequencing provides a powerful alternative to simultaneously screen for sequence variants in all homoeologous copies of multiple B. napus candidate genes.

Multiplexed amplicon sequencing represents a powerful, cost-effective and rapid technique for discovery of novel gene variants in genetically diverse breeding materials. The universal MID-barcoded fusion PCR method and primer panel we developed in this study can be applied in any species for any target gene. We are continuing to apply this method for sequence analysis and SNP discovery for other traits and target genes in oilseed rape. The system is also expected to be useful for SNP discovery in homologues of specific target genes in other complex crop genomes like wheat or sugarcane, for example. In comparison with powerful sequence capture techniques that enable resequencing of large target regions covering whole chromosome segments or 100 of genes in a single experiment (e.g. Saintenac et al., 2011), our method is highly suited to experiments where different sets of, for example, 10–20 candidate genes are of interest for initial polymorphism discovery in diversity panels of 10–20 genotypes. For experiments on this scale, universal fusion PCR can be a simple, cost-effective and efficient alternative to standard adapter-ligation procedures for generation of 454-FLX sequencing libraries.

Methods and materials

Plant materials

A collection of 20 exotic B. napus genotypes were selected for the study based on interesting phenotypic variation for seed oil content. The material, described in detail in Girke (2002) and Girke et al. (2011), included old, genetically diverse cultivars from Australia, Canada and China along with swede types (B. napus ssp. napobrassica) and a number of resynthesized (RS) B. napus genotypes generated from interspecific crosses between genetically divergent combinations between Brassica rapa L. (A genome, 2n = 20) and Brassica oleracea L. or Brassica montana L. (both C genome, 2n = 18). The 20 genotypes are listed in Table 1 with information about their origin and genome composition. Genomic DNA was extracted from young leaves of each genotype using the method of Doyle and Doyle (1990).

Candidate gene selection and gene-specific primer design

A selection of nine genes that were identified in Arabidopsis thaliana as candidates for regulation of seed size, metabolism and seed storage were selected for this project (Table S2). The full-length genomic sequences for B. napus homologues of six of these genes (BnaPKP-β1, BnaPKP-α, BnaFUSCA3, BnaLEC2, BnaIKU and BnaFIE), identified by Sanger sequencing in BAC clones containing the genes of interest, were kindly provided by Renate Schmidt, IPK Gatersleben (Germany). For the other three genes (BnaSUC5, BnaPKL1 and BnaAP2), where no B. napus reference sequences were available, primers were designed according to publicly available B. napus sequences matching the genes in the NCBI database (http://www.ncbi.nlm.nih.gov/). Primers were designed by aligning all available homologues for each gene to identify regions that are preferentially conserved in all copies (Figure 2), where possible nonspecific primers were used to simultaneously target the fragment of interest in all loci of the gene. For the genes BnaPKPβ1, BnaPKPα, BnaFUSCA3 and BnaLEC2, it was necessary to design more than one set of primers per amplicon because of divergence among the sequences of the known homologous copies in the regions of interest.

Intron–exon structure was elucidated by aligning full-length coding DNA sequences of the target genes, from Arabidopsis or B. rapa, to the available B. napus genomic sequences. The program Spidey (http://www.ncbi.nlm.nih.gov/spidey/) was used to avoid primers on intron–exon boundaries and, wherever possible, to include introns in amplified regions for increased polymorphism. The resulting amplicon design thus aimed to preferentially cover intron-spanning regions but include short-conserved exonic sequences at the amplicon ends. The intention was to amplify all possible homologues of the target genes (conserved coding sequences for primer design) while simultaneously maximizing the potential for distinction of polymorphisms among homologous copies (less-conserved intron sequences in amplicon body). The program Primer 3 (http://primer3.sourceforge.net/) was used for primer design. In total, 30 gene-specific primers were designed to generate between 1 and 7 amplicons from the nine target genes (Table S3).

To generate tailed PCR products that could be amplified in a second step with universal 454 fusion primers, we appended the 18-base universal M13 oligonucleotide 5′-TTTCCCAGTCACGACGTT-3′ to the 5′ end of every forward primer and the 20-base universal T7 oligonucleotide 5′-TAATACGACTCACTATAGGG-3′ to the 5′ end of every reverse primer. For 454-FLX amplicon sequencing, the recommended maximum length of the amplicons is 500 bp, above which the performance of the emulsion PCR and sequence reads degenerates. To maximize the amount of sequence obtained per amplicon from the 454 sequencing run, we designed primers for the first PCR step to amplify targets with a total length (including the M13 and T7 overhangs) of between 400 and 450 bp. HPLC-purified oligonucleotides were synthesized by Eurofins MWG Operon (Ebersberg, Germany).

Universal MID fusion primer design

A set of ten MID-barcoded universal fusion PCR primers were designed to generate 454 sequencing-ready fusion amplicons from any target sequence flanked by the M13 and T7 universal overhangs. The forward fusion primer consists of the 454-FLX Titanium A-adaptor sequence 5′-GCCTCCCTCGCGCCA-3′ followed by the 4-base calibration sequence 5′-TCAG-3′, the respective 10-base MID oligonucleotide and the 18-base M13 oligonucleotide. The reverse fusion primer consists of the 454-FLX Titanium B-adaptor sequence 5′-GCCTTGCCAGCCCGC-3′ followed by the 4-base calibration sequence, the 10-base MID oligonucleotide and the 20-base T7 oligonucleotide. HPLC-purified oligonucleotides were synthesized by Eurofins MWG Operon (Ebersberg, Germany). The full set of MID barcodes and barcoded fusion primer sequences are provided in Tables S4 and S5, respectively. A diagram explaining the primer design for the two-step PCR procedure is presented in Figure 1.

Generation of 454 sequencing-ready universal amplicons

A two-step PCR procedure was established that can be used to generate 454 sequencing-ready fusion PCR amplicons from any 400 to 500 bp target sequence. For each target-specific tailed primer pair, a gradient PCR was performed using a standard B. napus genomic DNA sample to establish the optimal annealing temperature for amplification of the desired target. For each tailed primer combination, 2 ng/μL of genomic DNA, 200 μm dNTP mix, 0.1 pmol/μL of forward and reverse primer, respectively, 1.5 mm MgCl2 and 0.5 U of HotStar HiFidelity Taq Polymerase in (Qiagen, Düsseldorf, Germany) in 1× polymerase reaction buffer. Sterile double-distilled water was added up to 20 μL final reaction volume. The mix was amplified with the following PCR conditions: Initial denaturing at 96 °C for 15 min, 35 cycles of denaturing at 96 °C for 1 min and annealing at between 55 °C and 65 °C (1 °C gradient) for 1 min. Amplicons were visualized after electrophoresis on 1% agarose gels to check the amplicon size, and the optimal annealing temperature for each primer combination was chosen as the temperature with the optimal yield of correctly sized product with the minimum quantity of residual primers and minimum off-target amplification. An example is shown in Figure 3a.

Figure 3.

 Example of two-step PCR generation of sequencing-ready PCR amplicons for 454 sequencing from a target sequence of interest using M13/T7 tailed fusion primers. (a) Determination of optimal annealing temperature for target-specific primers of a 360 bp target sequence (including tailed primers). The dashed box shows the temperature giving the optimum amplification of the expected fragment size for this primer pair along with minimum residual primer oligonucleotides (arrow). (b) Target-specific amplification of a 480 bp amplicon in 20 genetically diverse Brassica napus accessions using the M13/T7 tailed primers at the optimized annealing temperature. (c) Re-amplification of two tailed amplicons in the 20 B. napus accessions with universal 454 fusion primers for pre-amplified M13/T7 templates. The dashed box here shows a typical example of a PCR failure, in genotype Barossa, in which a shorter band corresponding to unincorporated tailed PCR primers is seen instead of the expected product size. All samples which failed to re-amplify in the second PCR were repeated until successful.

The 20 B. napus genotypes were assigned into two groups and each genotype in each group was assigned one of the ten MID barcodes. Amplification of each target-specific amplicon was performed in all genotypes using the target-specific primers with the M13 and T7 overhangs (Table S3). A 1-μl aliquot from each sample was run on an agarose gel to confirm clean amplification of the desired fragment size (e.g. Figure 3b).

Generation and sequencing of pooled 454 libraries

QIAquick 96 PCR purification kits (Qiagen) were used for a 96-well plate purification of the re-amplified target sequence to remove primers, nucleotides, enzymes, salts and other impurities. Purification was performed according to the instructions of the kit manufacturer, except that a centrifuge rather than a vacuum was used for target recovery. Each step requiring vacuum recovery in the manufacturer’s protocol was replaced by 2 min centrifugation in 96-well plates at 2683 g at room temperature. After purification, all samples were stored at −20 °C until transportation to the next-generation DNA sequencing service provider SEQ-IT (Kaiserslautern) for quantification, normalization, pooling and sequencing of the amplicon libraries. After purification, all samples were stored at −20 °C and transported on dry ice.

Amplicons were quantified with Quant-iTTM PicoGreen (Invitrogen, Carlsbad, CA) and all samples were normalized according to the quantity and size of the amplicon. Two amplicon pools were generated with normalized equimolar concentrations of all amplicons from ten different MID-barcoded genotypes in each pool. After emulsion PCR using standard conditions, each amplicon pool was sequenced on a quarter-gasket region of a Roche 454-FLX Titanium machine (454 Life Sciences).

Data processing

Raw sequence reads from the 454-FLX sequencing run were deconvoluted into MID-specific. sff files using the command-line tool ‘sfffile’ in the Roche-454 off-instrument software package before trimming, sorting and quality filtering using the software Cygwin (http://www.cygwin.com/). First, the MID barcode oligos, TCAG sequencing key, M13 and T7 tails were trimmed from the 3′ ends of the reads, leaving the amplified regions of the target sequences flanked by the gene-specific primers. Subsequently, the sequences were sorted by the gene-specific primers. The result was 30 groups of sequences from each genotype (for each MID in each pool), corresponding to the 30 amplicons generated per genotype. Low-quality sequence was removed from the 3′ ends of the amplicons, with a cut-off threshold of Q30 (see Brockman et al., 2008), and all the reads which after trimming had a length <100 bp were deleted.

High quality sequences from each genotype/amplicon combination were assembled into contigs using the sequence assembly software cap3 (http://seq.cs.iastate.edu/cap3.html) using the default settings. During the assembly, contigs comprising <4% of the total reads for the amplicon were removed, because these are likely to have derived from PCR errors or sequencing anomalies. BLAST analysis of some of these contigs confirmed that they were off-target products of no further interest for the analysis.

Further processing of the assembled contigs was performed with the software CLC Genomics Workbench 4.8 (CLCbio, Aarhus, Denmark). To deal with unknown numbers of homologous loci in the complex allopolyploid B. napus genome we employed an identity-based grouping strategy to distinguish homologues. First a neighbour-joining tree was generated using a 1000-bootstrap permutation to assign the cap3 consensus sequences from each contig into putative locus-specific groups, using a stringent threshold group distance of 0.005. Based on the result of this grouping, all consensus sequences derived from a minimum of 80 reads were aligned to reference sequences for all of the candidate genes, derived from either B. napus BACs, the B. rapa v1.0 genome (The Brassica rapa Genome Sequencing Project Consortium, 2011), B. oleracea chromosome assemblies (http://brassicadb.org/brad/) or, if none of the above were available, the mega-consensus sequences derived from each individual consensus group. Alignment was performed with CLC Genomics Workbench using a similarity threshold of 0.8, length fraction of 0.5, mismatch cost of 2, insertion cost of and deletion cost of 3. The alignment files were then used to detect SNPs among the query genotypes. A SNP was called only when it occurred between genotypes in the sequencing panel within one of the putative locus-specific consensus groups.

A preliminary validation of selected detected SNPs was performed by alignment and comparison with previously generated data from Sanger sequencing of selected locus-specific amplicons for some of the candidate genes. Sanger sequences for locus-specific amplicons of BnaIKU2 and BnaFUSCA3 from a panel of 38 B. napus genotypes, including the 20 lines used for the SNP discovery, were used to performed the validation. Sequence data for the validation were kindly provided by Daniela Zeltner and Wolfgang Ecke, University of Göttingen, Germany, and Renate Schmidt, IPK Gatersleben, Germany. The Sanger reads were aligned to the cap3 consensus sequences from each 454 amplicon sequence contig to identify overlapping regions within the corresponding amplicon loci. The SNP calling in the Sanger reads followed the procedure and parameters used for calling SNPs in the 454 contigs. Validation was performed by comparing the SNP position, nucleotide change and allelic variants.

Acknowledgements

This work was performed within the public–private research consortium GABI-OIL, with funding from the German Federal Ministry of Education and Research (BMBF) and support from the breeding companies Deutsche Saatveredelung AG, KWS Saat AG, Lantmännen SW Seed, NPZ-Lembke, Raps GbR and Syngenta Seeds. We thank Renate Schmidt (IPK Gatersleben), Wolfgang Ecke and Daniela Zeltner (University of Göttingen) for providing additional Sanger sequence data and PCR primers for a number of the analysed genes. RS acknowledges additional support from DFG grant SN14/12-1.

Ancillary