Large-scale detection of rare variants via pooled multiplexed next-generation sequencing: towards next-generation Ecotilling

Authors

Errata

This article is corrected by:

  1. Errata: Correction Volume 69, Issue 3, 564, Article first published online: 25 January 2012

(fax +39 (0)432 603887; e-mail marroni@appliedgenomics.org).

Summary

Common variants, such as those identified by genome-wide association scans, explain only a small proportion of trait variation. Growing evidence suggests that rare functional variants, which are usually missed by genome-wide association scans, play an important role in determining the phenotype. We used pooled multiplexed next-generation sequencing and a customized analysis workflow to detect mutations in five candidate genes for lignin biosynthesis in 768 pooled Populus nigra accessions. We identified a total of 36 non-synonymous single nucleotide polymorphisms, one of which causes a premature stop codon. The most common variant was estimated to be present in 672 of the 1536 tested chromosomes, while the rarest was estimated to occur only once in 1536 chromosomes. Comparison with individual Sanger sequencing in a selected sub-sample confirmed that variants are identified with high sensitivity and specificity, and that the variant frequency was estimated accurately. This proposed method for identification of rare polymorphisms allows accurate detection of variation in many individuals, and is cost-effective compared to individual sequencing.

Introduction

Next-generation sequencing (NGS), which is also referred to as deep sequencing and massively parallel sequencing, allows researchers to obtain a large amount of genetic data in the form of short sequences at an unprecedented rate. This enables re-sequencing of individual genomes, or targeted re-sequencing of a number of selected regions in a large number of pooled individuals (Ingman and Gyllensten, 2009). NGS can be used to screen large populations for rare functional variants in target genes, and as a follow-up to genome-wide association studies to extensively sequence regions surrounding associated single nucleotide polymorphisms (SNPs) (Druley et al., 2009). Previous studies have indicated that NGS can reliably detect variants in pooled samples (Druley et al., 2009; Out et al., 2009). We designed an experiment that took advantage of pooled multiplexed NGS to screen 768 Populus nigra accessions for mutations in five genes predicted to be involved in lignin biosynthesis. Genes belonging to at least ten gene families are currently known to be involved in lignin biosynthesis in poplar, and most of them are functionally redundant (Hamberger et al., 2007). We selected five candidates, prioritizing genes for which (a) a known effect on lignin biosynthesis has been documented either in poplar or in other species through the respective orthologs (Vanholme et al., 2008), (b) functional redundancy is less pronounced (Shi et al., 2010), and (c) expression is higher in differentiating xylem compared to other tissues (Shi et al., 2010). The selected candidate genes were CAD4, HCT1, C3H3, CCR7 and 4CL3 (see Table S1 for a list of aliases and full names).

Lignin protects cell-wall polysaccharides from microbial degradation, and is one of the most important limiting factors in conversion of plant biomass to pulp or biofuels (Vanholme et al., 2010). Altering lignin structure or reducing lignin content can improve biofuel production (Jung and Ni, 1998; Li et al., 2008). A promising approach is to genetically engineer the lignin biosynthetic pathway in trees used for the production of biofuels (Vanholme et al., 2008). An alternative approach is identification of naturally occurring variants affecting lignin composition, e.g. via association studies. However, association mapping may not always be the best choice, as genome-wide association studies have so far succeeded in explaining only a small proportion of trait variation (Cirulli and Goldstein, 2010). A possible explanation is that a substantial proportion of variation is due to rare mutations that are not easily identified by association studies (Eichler et al., 2010). If this is the case, use of a positional cloning approach that is similar to QTL mapping but focused on extreme phenotypes may be a valid choice (Cirulli and Goldstein, 2010). QTL mapping and association studies require phenotyping of a relatively large population, and cannot be easily applied to the study of phenotypes that are difficult to measure. Reliable measurements of lignin composition are difficult and destructive, may be expensive and are time consuming (Hatfield and Fukushima, 2005).

An alternative approach for identification of variants affecting the lignin phenotype is screening a large number of samples in search of functional polymorphisms (Comai et al., 2004; Comai and Henikoff, 2006). We propose a method based on NGS to screen large populations to identify both common and rare functional variants in candidate genes. In the present study, we identified a total of 36 non-synonymous variants in candidate genes. One of the variants caused a premature stop codon and is being actively investigated. Using this approach, we were able to identify both common and rare variants, including a variant present in only one of the 1536 tested chromosomes.

Results

Re-sequencing

We performed a pooled multiplexed NGS experiment in 768 P. nigra accessions divided into 12 pools of 64 accessions each. The experiment consisted of two phases, as described below.

Phase 1 (CAD4).  The aim of phase 1 was to establish a pooled multiplexed sequencing procedure for SNP detection in CAD4 in 768 accessions. In addition, two pools (= 128) underwent individual Sanger sequencing to enable a sensitivity analysis to be performed and to identify the optimal SNP calling method. This subset is referred to as the ‘training set’.

In phase 1 of the experiment, approximately 0.9 Gb of sequence data were generated. Dividing this amount by the number of accessions (= 768) and the length of the CAD4 consensus sequence (2240 bp), we obtained an experiment-wide mean individual coverage (MIC) of 486-fold. The complete consensus sequence of P. nigra CAD4 was obtained in our laboratory previously (Marroni et al., 2011). The consensus was used as a reference against which short reads were aligned. After removing (a) reads that could not be aligned to the reference (8%), (b) reads carrying errors in the index sequence (4%), and (c) over-represented reads at amplicon ends (16%), an experiment-wide MIC of 350-fold was obtained (Figure 1). The MIC varied across pools, ranging from 120- to 640-fold (Figure S1). However, coverage at each single position along the gene, measured as the mean individual per base coverage (MIBC, see Experimental Procedures for details), is strongly conserved between pools, suggesting sequence specificity.

Figure 1.

 MIBC along the CAD4 sequence.
The positions of SNPs (vertical bars) and indels (black circles) are indicated relative to the coding sequence (CDS, gray boxes) and introns (horizontal black lines).

Phase 2 (HCT1, C3H3, CCR7, 4CL3).  In phase 2, PCR products of the coding regions of HCT1, C3H3, CCR7 and 4CL3 were obtained. PCR products of the four genes were pooled (see Experimental Procedures for details) and sequenced in 768 accessions. A subset of the accessions and amplicons were also individually sequenced using Sanger technique, to estimate the general performance of the method developed in phase 1. These are referred to as the ‘test set’.

In phase 2, approximately 1.65 Gb of sequence data were generated, corresponding to an experiment-wide MIC of 181-fold (= 768, total length of consensus sequences 11 500 bp). After removing (a) reads that could not be aligned to the reference (23%), (b) reads carrying errors in the index sequence (1%), and (c) over-represented reads at amplicon ends (5%), an experiment-wide MIC of 128-fold was obtained. We used Sanger sequencing on eight accessions to obtain a P. nigra consensus for the coding sequence of the genes; the length of each individual coding sequence is shown in Table 1. The consensus was used as a reference against which short reads were aligned. After removing reads that could not be aligned to the reference, reads that could not be assigned to an index and over-represented reads, the MIC ranged from 108-fold in CCR7 to 281-fold in HCT1. MIBC values along each gene are shown in Figure S2.

Table 1.   Summary statistics for SNPs identified in coding sequences of CAD4, HCT1, C3H3, CCR7 and 4CL3
GeneCoding sequence (bp)aSNPs (n)Mis-sense (n)Stop (n)
  1. aLength of coding sequence (in base pairs).

CAD410741980
HCT19481351
C3H315271260
CCR710171590
4CL316232580

Variant detection

Phase 1 (CAD4).  To identify the optimal minor allele frequency (MAF) for SNP calling, we selected two pools (pool 5 and pool 9) as a training set. The training set was analyzed using both individual Sanger sequencing and pooled multiplexed NGS. Using individual Sanger sequencing, we identified a total of 42 SNPs: 19 in pool 5 and 23 in pool 9 (Table 2). Using pooled multiplexed NGS, we identified a total of 43 SNPs: 19 in pool 5 and 24 in pool 9 (Table 2). If individual Sanger genotyping is considered as the gold standard, 42 of the 43 SNPs identified by pooled multiplexed NGS were true positives and one was a false positive. No false negatives were identified. Receiver operating characteristic (ROC) analysis gave the best results (sensitivity 100%, specificity 99.9%) when we defined an SNP as any position at which more than 0.41% of the reads showed a base different from the consensus. Overall, pooled NGS was in agreement with individual Sanger sequencing, and the area under the ROC curve was 99.9%. The small difference obtained between the two techniques may be due to the joint effects of pooling and use of different sequencing technologies. The correlation of allele frequencies determined by Sanger individual sequencing and pooled NGS was high (Table 2 and Figure 2, Pearson’s correlation coefficient = 0.96). Thus, the number of reads in a pooled sample is strongly correlated with allelic frequency in the origin population. As an additional measure to assess the ability of Illumina pooled multiplexed analysis to identify SNPs and indels, we included in each pool three controls for which the complete CAD4 sequence was known from a previous study (Marroni et al., 2011). In total, these 36 controls contributed 109 SNPs and 17 indels to the 12 pools. We were able to identify 105 of the SNPs and 15 of the indels, corresponding to sensitivity of 96% for SNP calling and 88% for identification of both SNPs and indels. Although rare, polymorphisms present in the control accessions may be present in additional accessions of each pool, and this might in turn lead to a slight overestimation of sensitivity. Using the identified MAF threshold, we performed SNP detection on the whole sample and identified 48 SNPs (one every 47 bp), 19 of which were in the coding sequence (one every 56 bp); eight SNPs were non-synonymous (Table 3). We identified five deletions and one insertion, all in non-coding regions (Figure 1 and Table S2). All the indels had already been identified in a previous study (Marroni et al., 2011). No indel was detected in the coding sequence of CAD4.

Table 2.   Minor allele frequency (MAF) of CAD4 SNPs identified in pool 5 and pool 9 by pooled multiplexed NGS and individual Sanger sequencing
PositionaSNPPool 5Pool 9
Pooled NGSbIndividual SangercPooled NGSbIndividual Sangerc
  1. aPosition of the SNP on the reference sequence.

  2. bMinor allele frequency (MAF) using pooled multiplexed NGS.

  3. cMinor allele frequency (MAF) using individual Sanger sequencing.

  4. ND, no SNP detected.

948T/A0.04870.07350.08550.0758
969C/G0.15840.14710.17640.1667
1012C/T0.00900.01470.01670.0379
1016T/GNDND0.00790.0076
1075G/A0.01220.02210.04780.0530
1098C/T0.20520.22060.18560.1970
1100T/C0.00820.01470.00980.0379
1159A/G0.16870.14390.18090.1719
1165T/G0.01000.01520.00540.0156
1168T/A0.00490.0076NDND
1219G/T0.00480.00770.03700.0156
1260A/GNDND0.0132ND
1305A/G0.01080.00760.01930.0259
1329A/GNDND0.00940.0172
1402T/C0.01110.01490.00660.0161
1437A/G0.01830.01470.00650.0077
1515A/G0.21410.22790.14570.1567
1518G/A0.00900.00750.02270.0373
1753G/A0.00460.00740.03590.0373
1758A/G0.00550.00740.00490.0075
1827T/C0.03380.03680.01830.0154
1886T/A0.27420.24260.18940.1875
2013C/TNDND0.01540.0159
2101C/TNDND0.00790.0079
2140A/GNDND0.01050.0079
Figure 2.

 Minor allele frequency of SNPs identified in the training set by pooled multiplexed NGS and individual Sanger sequencing.
Error bars were calculated assuming a binomial distribution, with successes corresponding to the number of reads carrying the minor allele and attempts corresponding to the total number of reads.

Table 3.   Non-synonymous SNPs of CAD4, HCT1, C3H3, CCR7 and 4CL3
GenePositionaSNPbNumber of PoolscMIBCdFrequencyeCarriersfAmino acid change
  1. aRelative to the coding sequence.

  2. bCommon allele/rare allele.

  3. cNumber of pools in which the SNP was identified.

  4. dMean individual per base coverage of the SNP.

  5. eMinor allele frequency in the whole sample.

  6. fPredicted number of chromosomes carrying the SNP.

CAD411T/A22830.00112L4H
CAD4109G/C123260.011418V37L
CAD4310C/A124830.2015310N104H
CAD4374A/C45920.00294Y125S
CAD4597A/G126570.1896291I199M
CAD4835G/A82330.012920A279T
CAD4925A/G11590.00031M309V
CAD4956C/T42860.00467A319V
HCT1278C/T112800.023937T93I
HCT1508C/T18220.00122R170C
HCT1517C/T24930.00112L173F
HCT1632T/C15010.00041V211A
HCT1710G/C62630.00579G237A
HCT1729C/A122270.029846C243*
C3H3598C/G12540.4376672E200Q
C3H3922G/A2520.00091I308V
C3H31033T/C8750.00528P345S
C3H31053A/C21150.00142Q351H
C3H31198C/T81280.022635P400S
CCR770C/A81320.1967302L24I
CCR7112A/G121350.3034466T38A
CCR7350C/G5570.006911A117G
CCR7506C/T1300.030347A169V
CCR7758C/T1270.035955S253F
CCR7887A/G22100.00071K296R
CCR7947A/G112890.019129K316R
CCR71003C/G81700.2979458V335L
CCR71014A/T41210.00599Q338H
4CL37G/A111270.024538A3T
4CL361T/G42310.00396Y21D
4CL395T/C11530.00041V32A
4CL3709T/C21490.00081F237L
4CL3897C/G122550.0887136D299E
4CL3913A/G11810.00081M305V
4CL31477A/T11090.00041T493S
4CL31512G/T11010.00122Q504H

Phase 2 (HCT1, C3H3, CCR7, 4CL3).  Using the SNP detection workflow used in phase 1, we analyzed the four additional genes included in phase 2 to identify variants affecting the amino acid composition of the corresponding gene products. We identified a total of 65 SNPs, 28 of which were non-synonymous (Tables 1 and 3). Five mis-sense SNPs were identified in HCT1, nine in CCR7, five in C3H3 and eight in 4CL3. In addition, an SNP causing a premature stop codon was identified in HCT1. No indel was detected in the coding sequence of any of the four genes.

To estimate the general performance of the method developed in phase 1, a subset of accessions and amplicons included in phase 2 (the ‘test set’) was selected for individual Sanger sequencing. The test set included all 12 pools for amplicon HCT1, pool 2 for amplicons CCR7a and CCR7c, and pool 12 for amplicon 4CL3b. In total, 94 SNPs were identified, 83 of which were identified by both approaches, six by NGS only and five by Sanger sequencing only, with a sensitivity of 93% (Table S3). The correlation of the MAF between pooled multiplexed NGS and individual Sanger sequencing was 0.98, and the correlation between the logarithms of the allele frequencies (sensitive to differences in small MAFs) was 0.96. Figure 3 shows the logarithms of MAFs for SNPs identified by individual Sanger sequencing as a function of the logarithm of MAFs for SNPs identified by pooled multiplexed NGS. Of the 83 SNPs identified by both approaches, 34 had a frequency lower than 5% and ten had a frequency lower than 1%. Each rare SNP is carried by one or a few accessions, and globally they appear in a small proportion of accessions (Figure S3).

Figure 3.

 Minor allele frequency of SNPs identified in the test set by pooled multiplexed NGS and individual Sanger sequencing.

Among the large proportion of variants with a frequency lower than 5%, we identified a variant causing a premature stop codon (C243*) in HCT1 (Table 3); the resulting gene product is substantially shorter than the expected 315 amino acids. We performed individual Sanger sequencing in all the 768 accessions and identified 42 carriers of the C243* variant. One carrier was homozygous for the mutated allele.

Removing sequencing and alignment errors

False-positive SNPs can arise at the 5′ and 3′ ends of Illumina reads due to the higher error rate of the sequencing process or the presence of small insertions/deletions near the read ends. To adjust for this, we (a) analyzed the forward and reverse strands separately and required that each SNP was identified in both strands, and (b) partitioned each read into three segments of equal length, and required that each SNP was identified in each of the segments of the read (see Experimental Procedures for details). Analysis performed on the training set (phase 1) without these additional quality controls identified a total of 50 SNPs, eight of which were false positives using individual Sanger sequencing as a gold standard. The positive predictive value was therefore 84%. Using the above-mentioned quality controls, we identified 43 SNPs, one of which was false (positive predictive value 97.7%). No false-negative SNPs were identified in either analysis.

Effect of using a high-fidelity DNA polymerase on SNP detection

Given the sample size of 64 diploid individuals for each pool, a variant occurring only once in the 128 chromosome would have a frequency of 0.78%. Our aim was to design a workflow able to detect such variants. False-positive SNPs arise as a consequence of either wrong alignments or sequencing errors, or errors introduced into the sequencing templates during the PCR amplification reaction due to mis-incorporation of dNTPs by DNA polymerase. Taq polymerase is known to be a particularly error-prone polymerase, with an estimated error rate between 1.1 errors per 104 bp and 2 errors per 105 bp according to the manufacturer (Applied Biosystems, http://www.appliedbiosystems.com), and this is known to create problems when sequencing cloned PCR products. To investigate the role of the DNA polymerase error rate, we amplified one candidate gene (HCT1) using a high-fidelity polymerase with an error rate of 1 per 107. We varied the discrimination threshold between noise and true SNPs from 0 to 1%, recorded the number of identified SNPs as a function of MAF threshold, and compared results obtained using the two enzymes. When the MAF threshold was close to zero, the proportion of bases carrying a putative SNP was between 0.9 and 1.0 (Figure 4). When the MAF threshold for SNP calling increased, the proportion of bases identified as polymorphic decreased. In Figure 4, the dots show the MAF threshold at which the proportion of bases identified as polymorphic is decreased to 50% of the maximum value. The corresponding frequency was 0.1% for HCT1 (the only gene amplified using a high-fidelity DNA polymerase), and more than 0.2% for all the genes for which amplification was performed using Amplitaq Gold (Applied Biosystems, http://www.appliedbiosystems.com). When the MAF threshold was increased to 0.4%, the performances of the two polymerases were comparable. Overall, our data suggest that use of an accurate polymerase decreases the process error rate and facilitates detection of extremely rare variants. However, variants with frequencies lower than 1% can still be accurately detected even without use of a high-fidelity polymerase.

Figure 4.

 Number of called SNP per nucleotide as a function of the MAF set as the threshold for the five genes studied in the experiment.
The dots represent the MAF threshold at which the proportion of bases identified as polymorphic is 50% of the maximum value. Continuous line: HCT1, sequenced using a high-fidelity polymerase. Thick dashed line: CAD4, sequenced using Taq polymerase. Thin dashed line: C3H3, sequenced using Taq polymerase. Dotted line: 4CL3, sequenced using Taq polymerase. Dashed/dotted line: CCR7, sequenced using Taq polymerase.

Effect of decreasing mean individual coverage on SNP detection

To investigate the effect of decreasing MIC on SNP detection, and to identify the optimal coverage for variant detection, we performed simulations on subsets of the data generated in phase 1. Figure 5 shows the number of SNPs as a function of the MIC (continuous line). The graph also shows the mean number of SNPs identified in each simulated subset that were also identified in the whole dataset (dashed line) which can be considered a conservative estimate of true-positive SNPs. When the MIC is above 150-fold, the results for the subsets are not distinguishable from those obtained with the whole dataset. Performance is worse for an MIC <100-fold, for which almost 50% of the identified SNPs were not identified in the whole dataset (i.e. likely to be false positives).

Figure 5.

 Number of SNPs identified as a function of MIC.
Results are plotted as the mean from ten independent simulations (continuous line). The mean number of SNPs identified in the simulation and in the whole data set (dashed lines) is used as an estimate of true positives.

Population genetics parameters

The accuracy of nucleotide diversity estimation by pooled multiplexed NGS was evaluated by comparing nucleotide diversity in the training and test sets. Estimates of nucleotide diversity based on pooled multiplexed NGS and individual Sanger sequencing were highly correlated (= 0.99, Figure 6). Nucleotide diversity and statistical tests for neutrality (Tajima’s D test) were calculated for the whole sample. The results are shown in Table 4. No Tajima’s D test showed significant deviation from neutrality. However, most Tajima’s D tests gave negative values. The overall nucleotide diversity ranged from 0.65 × 10−3 to 1.86 × 10−3. The ratio of non-synonymous to synonymous nucleotide diversity ranged from 0.03 in HCT1 and 4CL3 to 0.47 in CCR7.

Figure 6.

 Nucleotide diversity calculated using allele frequency estimated by pooled multiplexed NGS (light bars) and by individual Sanger sequencing (dark bars).

Table 4.   Estimates of nucleotide diversity (×103) and Tajima’s D in five genes involved in lignin biosynthesis
GeneTotalSynonymousNon-synonymousπnssa
π × 103Dπs × 103Dsπns × 103Dns
  1. aNon-synonymous to synonymous nucleotide diversity ratio.

CAD41.39−1.291.74−1.360.36−0.560.21
HCT11.40−0.495.490.470.18−1.380.03
C3H30.65−0.701.31−0.690.46−0.450.35
CCR71.86−0.012.32−0.471.720.350.74
4CL31.60−0.426.660.230.18−1.400.03

Discussion

Next-generation sequencing of selected genomic regions represents a powerful approach to identify the complete spectrum of DNA sequence variants. Accurate detection and genotyping of SNPs is crucial for use of population sequencing to detect rare, as well as common, functional variants affecting any given trait. For this reason, a cheap and fast method for identification of polymorphisms is likely to be of great importance for determining the spectrum of variation and for identifying carriers of functional variants. We report on the use of pooled multiplexed next-generation sequencing of PCR products to identify SNPs in five candidate genes involved in lignin biosynthesis.

Our aim was to detect rare, as well as common, polymorphisms in a sample of 768 poplar accessions arranged in 12 pools of 64 accessions each. To do this, we developed a customized analysis workflow with strong discrimination ability between real SNPs and sequencing or alignment errors.

Given the small number of genes that we sequenced, PCR was the method of choice. However, errors due to DNA polymerase may affect the detection of rare variants. We performed a subset of amplifications using a high-fidelity polymerase, and showed that use of an accurate DNA polymerase increase the ability to discriminate between sequencing errors and true polymorphisms (Figure 4).

Each of the 12 pools comprised a relatively high number of poplar accessions (= 64), and the number of individuals sequenced in a single reaction was very high (= 768). Therefore, a high MIC was required to avoid the possibility that SNPs in under-represented accessions were missed. In phase 1, we obtained an MIC of 350-fold. Simulations showed that a lower MIC (approximately 150-fold) still produced reliable results (Figure 5).

Finally, false-positive SNPs can arise at the 5′ and 3′ ends of Illumina reads due to the higher error rate of the sequencing process or the presence of small insertions/deletions near the read ends. The workflow that we developed was effective in removing SNPs in the proximity of small insertions/deletions, and substantially reduced false-positive findings.

Applying our workflow to the whole dataset, we identified 37 non-synonymous SNPs in five genes involved in lignin biosynthesis (Table 3), one of which (C729A) caused a premature stop codon (C243*) in HCT1. We performed individual Sanger sequencing to identify carriers of the C243* mutation (Table S3). Carriers of the stop codon have been selected for extensive phenotypic evaluation, and will be used in a conventional breeding program to obtain offspring with improved lignin composition.

Among SNPs confirmed by individual Sanger sequencing, we were able to identify an SNP that occurred only once in 1536 chromosomes (expected frequency 0.065%). This gives an idea of the potential that pooled multiplexed NGS has to identify rare variants. The ability to confidently identify rare variants is crucial. Mutations affecting the phenotype are likely to be negatively selected. Their frequency (f) will depend on the mutation rate (μ) and the strength of negative selection (s). At equilibrium, the predicted frequency of a dominant mutation is μ/s, and that of a recessive mutation is √(μ/s). Assuming a gene mutation rate in the range 10−8–10−5, and a decrease in fitness of 1%, the expected frequency of a dominant allele in the population is <0.1% and that of a recessive allele is <5%. Not surprisingly, more than 40% of the SNPs identified in the present study had a frequency lower than 5% (Figure S3), and eight SNPs are predicted to appear in only one chromosome (Table 3). Knockout mutations, such as those obtained by gene silencing through transformation, are likely to be fully recessive and have a selection coefficient equal to 1. This translates to an equilibrium frequency range of 0.0001–0.0032 for the same range of mutation rate, and would require screening 475–15 000 individuals to have a 95% chance of identifying the desired mutation. These numbers, although large, are clearly achievable using the method described here, which can therefore be adopted to look for rare mutations that could be utilized in breeding programs.

We detected a mutation that causes a premature stop codon (C243*) in HCT1, that was found at a rather high frequency (3%), and that was present in homozygous state in one accession. The reason for the high frequency and for the viability of the homozygote could either be that the mutation does not causing gene knockout due to its location towards the C-terminus of the protein, which would therefore retain at least a partial activity, or that the gene is partially or completely functionally redundant due to the presence of additional family members (Shi et al., 2010).

Our method accurately identified rare variants, with low false-positive and low false-negative rates, and allowed correct estimation of minor allele frequency (Figure 3). As a consequence, all population genetics parameters that can be obtained on the basis of the allele frequency, such as nucleotide diversity, are also accurately estimated (Figure 4). In addition, Tajima's D test can be used to test for departure from neutrality (Table 4).

The workflow that we developed can be used to identify common and rare SNPs in any organism. Given its ability to discriminate true positive from false positive SNPs, it is particularly suited for the screening of large populations in the search for rare variants that could be immediately used in breeding programs. In addition, studies aimed at genetic characterization of various populations of a given organism may take advantage of our workflow. Individuals belonging to each population are pooled, and population genetics parameters can be accurately estimated for each population. Finally, researchers investigating qualitative phenotypes may create pools based on the phenotypic category, accurately estimate the allele frequency of SNPs in the two phenotypic classes, and perform pooled association studies using statistical approaches that have already been developed (Sham et al., 2002).

In conclusion, we show that our workflow based on pooled multiplexed NGS is an efficient and accurate method to screen a large number of individuals for mutations, providing the basis for a next-generation Ecotilling method (Comai et al., 2004). The method has high sensitivity and specificity, and it accurately estimates allele frequencies and population genetic parameters in large samples.

Experimental Procedures

Plant material

The experimental sample consisted of 768 Populus nigra accessions originally collected in various European areas (Smulders et al., 2008; Rohde et al., 2010). A large proportion of trees originated from France (= 631). The remaining trees originated from Italy (= 118), Germany (= 9), Spain (= 9) and The Netherlands (= 1). The latitude ranged from 40.24°N to 51.48°N (median latitude 46.24°N) and the longitude ranged from 16.39°E to 0.56°W (median longitude 3.19° E). The height above sea level ranged from 35 to 1699 m (median altitude 160 m). Approximately 1 g of leaf material was collected for DNA extraction from each tree.

Amplification and pooling

The DNA of 768 P. nigra accessions was extracted from dehydrated leaves using a DNeasy plant kit (Qiagen, http://www.qiagen.com/). Primer design was performed using the web interface Primer3plus (Untergasser et al., 2007). DNA amplifications were performed in a 15 μl volume, using the primer pairs listed in Table S4. The reactions contained a mean of 25 ng genomic DNA and the following reagents: 0.3 μm of each primer, 250 μm of each dNTP, 1.5 mm MgCl2, 2% DMSO, 1 unit Amplitaq Gold (Applied Biosystems, http://www.appliedbiosystems.com/) and 1x PCR buffer II (Applied Biosystems). The reactions were performed using the Geneamp 9700 PCR system (Applied Biosystems) under the following conditions: 95°C for 10 min, 40 cycles of 20 sec at 94°C, 30 sec at 60°C and 90 sec at 72°C, followed by a final extension of 10 min at 72°C.

Amplification of HCT1 was performed substituting 1 unit of Amplitaq Gold with 0.3 units of Phusion high-fidelity DNA polymerase (Finnzymes, http://www.finnzymes.com). PCR products were then pooled in two phases as detailed below:.

Phase 1.  The whole CAD4 sequence was amplified, using three PCR primer pairs already tested using a different sample (Marroni et al., 2011). For each amplicon, the quantities of PCR products for 16 random accessions were estimated on an agarose gel, the mean concentration was taken as an estimate of each amplicon’s concentration, and the three amplicons for each individual were combined in equimolar amounts. Experimental samples were pooled in 12 separate groups, each comprising 64 accessions (total 768). A schematic representation of the pooling scheme is shown in Figure S4. PCR products obtained from three control accessions for which Sanger sequencing was already available (Marroni et al., 2011) were added to each pool.

Phase 2.  In phase 2, the coding sequences of the selected genes were amplified using one primer pair for HCT1, four for 4CL3 and three each for C3H3 and CCR7. For each amplicon, the quantities of the PCR products of 16 random accessions were estimated on an agarose gel; the mean concentration was taken as an estimate of each amplicon’s concentration, and the 11 amplicons for each individual were combined in equimolar amounts. Experimental samples were pooled in 12 separate groups, each comprising 64 accessions.

Sequencing

To prepare NGS libraries, 5 μg of pooled PCR products were randomly fragmented by nebulizing at 60 psi for 4 min. Libraries were prepared using Illumina reagents, according to the manufacturer’s specifications (Illumina, http://www.illumina.com). End repair of fragmented DNA was performed using T4 DNA polymerase and Klenow polymerase with T4 polynucleotide kinase. An adenine base was added at the 3′ end using a 3′→5′ exonuclease-deficient Klenow fragment, and Illumina-indexed adapter oligonucleotides were ligated to the sticky ends created. We electrophoresed the ligation mixture on an agarose gel, and selected fragments of 200 bp.

DNA was enriched for fragments with Illumina adapters on either end using a 16-cycle PCR reaction performed using an Illumina multiplexing sample preparation oliogonucleotide kit as described by the manufacturer. A Genome Analyzer flowcell was prepared on the supplied cluster station according to the manufacturer’s protocol. In both phase 1 and phase 2 of the experiment, two lanes of the flowcell were used to sequence the 768 accessions. Clusters of PCR colonies were then sequenced on an Illumina Genome Analyzer II platform using a single read run of 44 bp, coupled with 7 bp Illumina index sequencing using the Illumina multiplex sequencing primer. Images from the instrument were processed using the manufacturer’s software to generate FASTQ sequence files.

Sanger sequencing was performed using a BigDye® Terminator version 3.1 cycle sequencing kit (Applied Biosystems) on an ABI3730 sequencer (Applied Biosystems) according to the manufacturer’s instructions.

Data analysis

In a previous study (Marroni et al., 2011), we obtained a CAD4 consensus sequence from 360 accessions (GenBank accessions HM440565HM440924). This reference was 2240 bp long and comprised the complete CAD4 coding sequence (1074 bp). The P. nigra reference sequences for the coding portions of HCT1 (accession JF693234), C3H3 (accession JF693232), CCR7 (accession JF693233) and 4CL3 (accession JF693234) were obtained by Sanger sequencing of eight accessions, using the following software packages: Phred (Ewing and Green, 1998), Phrap and Consed (Gordon et al., 1998).

Illumina sequences were aligned against the P. nigra reference using the short-reads aligner Novoalign (Novocraft Technologies, http://www.novocraft.com). Alignment was performed by setting the alignment scoring options t (the highest alignment score acceptable for the best alignment) and g (the gap opening penalty) in different ways for SNP and small indel detection. For SNP detection, the chosen pair of scoring options was = 60 and = 15, while for indel detection they were = 110 and = 20. Default settings for the remaining alignment scoring, quality control and read filtering options were retained. Coverage statistics were calculated based on the alignment performed for SNP detection. The mean individual per base coverage (MIBC) was calculated for each base as R/N, where R is the number of reads covering the position, and N is the number of accessions (768 for analysis of the whole dataset, and 64 for analysis at pool level). Thus, for each base, MIBC represents the mean number of reads covering the base in each accession. Mean individual coverage (MIC) is calculated as the mean of MIBC over the reference sequence. To evaluate experiment-wide MIC, we used the reference sequences of all the studied genes.

SNP and indel detection were performed using Varscan (Koboldt et al., 2009). SNP detection was performed separately on forward and reverse reads of each pool; a nucleotide was considered polymorphic if the alternative allele was present both in forward and reverse sequences in at least ten reads per strand. This allowed false-positive SNPs due to sequence-specific sequencing errors to be discarded. SNPs called by Varscan with a mean variant quality lower than 25 and SNPs identified in regions with an MIBC <50-fold (corresponding to a total coverage of 3350-fold) were not considered. The error distribution in Illumina reads is not uniform; the proportion of errors usually increases from the beginning to the end of the read, with the additional possibility of high error rates in the very first bases (Kircher et al., 2009). In addition, preliminary results of the present study indicated that false-positive SNPs can be identified at the 3′ and 5′ ends of a read when a small deletion/insertion is present near (<10 bp from) the read ends. To control for this, specific a posteriori quality control was applied. Each read was partitioned into three segments of equal length, and the condition was imposed that each SNP had to be independently identified (using the thresholds shown above) when considering each of the segments of the read separately.

To assess the performance of pooled multiplexed NGS to identify SNPs, two amplicons (CAD4b and CAD4c, Table S4) of two of the 12 pools (pool 5 and pool 9) were sequenced using the Sanger method in phase 1, and SNPs were detected on the individual sequences obtained. Identification of SNPs in Sanger sequences of the training set was performed using the software package PolyPhred (Nickerson et al., 1997), as described previously (Marroni et al., 2011). The sensitivity and specificity of pooled multiplexed NGS were measured using receiver operating characteristic (ROC) curve analysis, with individual Sanger sequencing as the gold standard. Comparison was performed only for positions at which Sanger coverage was at least 50-fold.

Using ROC curves, the true-positive rate (sensitivity) is calculated as a function of the false-positive rate (1 – specificity) for various cut-off points; each point on the ROC represents a sensitivity/specificity pair corresponding to a particular decision threshold. The area under the ROC curve is an overall measure of test performance, with a value of 0.5 indicating random performance and a value of 1.0 denoting perfect performance. The proportion of true Illumina SNPs as a function of the proportion of false SNPs was calculated, using the variant frequency threshold as the test outcome and Sanger genotyping as the gold standard. The best variant frequency threshold was chosen to maximize the following value: (sensitivity+specificity)/2, in order to maximize true positives while minimizing false positives.

The correlation between MAFs of SNPs identified by pooled NGS and those SNPs identified by individual Sanger sequencing was calculated using the Pearson product-moment correlation coefficient. In each pool, the MAF for Sanger sequencing was calculated as the number of occurrencies of a given alternative allele divided by the total number of observed alleles. The MAF for pooled multiplexed NGS was calculated as the number of Illumina reads carrying the alternative allele divided by the total number of Illumina reads.

Small indel detection was performed separately on forward and reverse reads of each pool, and a position was cosidered polymorphic for indels if an indel was identified in both forward and reverse strand in at least two reads per strand. Indels with an MIC <25-fold per strand were not considered.

To investigate the effect of sequence coverage on SNP calling, and to identify the optimal coverage for variant detection, a total of 1000 simulations, sampling subsets from 1 to 99% of reads generated in phase 1, were run, and the identified SNPs (positives) were compared with those identified using whole-sequence data. The intersection of the two sets represents a conservative estimate of true positives. The positive predictive value was calculated as the ratio between true positives and the sum of true positives and false positives.

Comparison between SNPs identified by pooled multiplexed NGS and individual Sanger sequencing was repeated in phase 2. The selected test set was composed as follows: all 12 pools for HCT, pool 2 for CCR7b and CCR7c, and pool 12 for 4CL3b. The algorithm used for SNP detection in Sanger sequences was the same as in phase 1, but when identifying SNPs in short reads in HCT1, C3H3, CCR7 and 4CL3 (which showed approximately half of the coverage of CAD4), the number of reads required on each strand and the lower acceptable MIC of a region to be used for SNP calling were linearly scaled, and set to five and 25-fold, respectively.

To investigate the effect of frequency threshold on SNP calling, additional analyses were performed by varying the MAF threshold frequency from 0 to 1%, removing coverage thresholds, and requiring that only one occurrence of the SNP be found on the forward and reverse strand. When the MAF threshold was zero, the number of polymorphic positions corresponded to the number of bases in which an error was introduced on both strands during the whole sequencing process. The proportion of polymorphic positions was plotted as a function of the variant frequency used to define a polymorphism in each gene.

Nucleotide diversity is the mean number of per site differences between two randomly chosen DNA sequences (Nei and Li, 1979); here, it was estimated as the sum of the unbiased heterozygosity of segregating sites (Tajima, 1989), averaged over all nucleotides. Heterozygosity and 95% confidence limits of heterozygosity were calculated by re-sampling with replacement all pairs of polymorphic loci with probability corresponding to the frequency of the two alleles, and multiplying by n/(n−1), where n is the sample size of the pool, to obtain an unbiased estimate of nucleotide diversity (Futschik and Schlötterer, 2010). For each polymorphic position, 200 re-sampling experiments were performed, each of size 10 000. The neutrality hypothesis was tested by calculating Tajima’s D (Tajima, 1989); we chose confidence limits for Tajima’s D based on the β distribution for = 1000 DNA sequences, as calculated by Tajima (1989). Statistical analyses were performed using R (http://www.r-project.org).

Acknowledgements

This research received funding from the European Community’s Seventh Framework Program (FP7/2007-2013) under grant agreement number 211917 (ENERGYPOPLAR). Authors thank Marc Villar, Véronique Jorge, Vanina Guerin and Catherine Bastien of the National Institute for Agricultural Research (INRA), Orleans, France, for providing samples.

Ancillary