The experimental sample consisted of 768 Populus nigra accessions originally collected in various European areas (Smulders et al., 2008; Rohde et al., 2010). A large proportion of trees originated from France (n = 631). The remaining trees originated from Italy (n = 118), Germany (n = 9), Spain (n = 9) and The Netherlands (n = 1). The latitude ranged from 40.24°N to 51.48°N (median latitude 46.24°N) and the longitude ranged from 16.39°E to 0.56°W (median longitude 3.19° E). The height above sea level ranged from 35 to 1699 m (median altitude 160 m). Approximately 1 g of leaf material was collected for DNA extraction from each tree.
Amplification and pooling
The DNA of 768 P. nigra accessions was extracted from dehydrated leaves using a DNeasy plant kit (Qiagen, http://www.qiagen.com/). Primer design was performed using the web interface Primer3plus (Untergasser et al., 2007). DNA amplifications were performed in a 15 μl volume, using the primer pairs listed in Table S4. The reactions contained a mean of 25 ng genomic DNA and the following reagents: 0.3 μm of each primer, 250 μm of each dNTP, 1.5 mm MgCl2, 2% DMSO, 1 unit Amplitaq Gold (Applied Biosystems, http://www.appliedbiosystems.com/) and 1x PCR buffer II (Applied Biosystems). The reactions were performed using the Geneamp 9700 PCR system (Applied Biosystems) under the following conditions: 95°C for 10 min, 40 cycles of 20 sec at 94°C, 30 sec at 60°C and 90 sec at 72°C, followed by a final extension of 10 min at 72°C.
Amplification of HCT1 was performed substituting 1 unit of Amplitaq Gold with 0.3 units of Phusion high-fidelity DNA polymerase (Finnzymes, http://www.finnzymes.com). PCR products were then pooled in two phases as detailed below:.
Phase 1. The whole CAD4 sequence was amplified, using three PCR primer pairs already tested using a different sample (Marroni et al., 2011). For each amplicon, the quantities of PCR products for 16 random accessions were estimated on an agarose gel, the mean concentration was taken as an estimate of each amplicon’s concentration, and the three amplicons for each individual were combined in equimolar amounts. Experimental samples were pooled in 12 separate groups, each comprising 64 accessions (total 768). A schematic representation of the pooling scheme is shown in Figure S4. PCR products obtained from three control accessions for which Sanger sequencing was already available (Marroni et al., 2011) were added to each pool.
Phase 2. In phase 2, the coding sequences of the selected genes were amplified using one primer pair for HCT1, four for 4CL3 and three each for C3H3 and CCR7. For each amplicon, the quantities of the PCR products of 16 random accessions were estimated on an agarose gel; the mean concentration was taken as an estimate of each amplicon’s concentration, and the 11 amplicons for each individual were combined in equimolar amounts. Experimental samples were pooled in 12 separate groups, each comprising 64 accessions.
To prepare NGS libraries, 5 μg of pooled PCR products were randomly fragmented by nebulizing at 60 psi for 4 min. Libraries were prepared using Illumina reagents, according to the manufacturer’s specifications (Illumina, http://www.illumina.com). End repair of fragmented DNA was performed using T4 DNA polymerase and Klenow polymerase with T4 polynucleotide kinase. An adenine base was added at the 3′ end using a 3′→5′ exonuclease-deficient Klenow fragment, and Illumina-indexed adapter oligonucleotides were ligated to the sticky ends created. We electrophoresed the ligation mixture on an agarose gel, and selected fragments of 200 bp.
DNA was enriched for fragments with Illumina adapters on either end using a 16-cycle PCR reaction performed using an Illumina multiplexing sample preparation oliogonucleotide kit as described by the manufacturer. A Genome Analyzer flowcell was prepared on the supplied cluster station according to the manufacturer’s protocol. In both phase 1 and phase 2 of the experiment, two lanes of the flowcell were used to sequence the 768 accessions. Clusters of PCR colonies were then sequenced on an Illumina Genome Analyzer II platform using a single read run of 44 bp, coupled with 7 bp Illumina index sequencing using the Illumina multiplex sequencing primer. Images from the instrument were processed using the manufacturer’s software to generate FASTQ sequence files.
Sanger sequencing was performed using a BigDye® Terminator version 3.1 cycle sequencing kit (Applied Biosystems) on an ABI3730 sequencer (Applied Biosystems) according to the manufacturer’s instructions.
In a previous study (Marroni et al., 2011), we obtained a CAD4 consensus sequence from 360 accessions (GenBank accessions HM440565–HM440924). This reference was 2240 bp long and comprised the complete CAD4 coding sequence (1074 bp). The P. nigra reference sequences for the coding portions of HCT1 (accession JF693234), C3H3 (accession JF693232), CCR7 (accession JF693233) and 4CL3 (accession JF693234) were obtained by Sanger sequencing of eight accessions, using the following software packages: Phred (Ewing and Green, 1998), Phrap and Consed (Gordon et al., 1998).
Illumina sequences were aligned against the P. nigra reference using the short-reads aligner Novoalign (Novocraft Technologies, http://www.novocraft.com). Alignment was performed by setting the alignment scoring options t (the highest alignment score acceptable for the best alignment) and g (the gap opening penalty) in different ways for SNP and small indel detection. For SNP detection, the chosen pair of scoring options was t = 60 and g = 15, while for indel detection they were t = 110 and g = 20. Default settings for the remaining alignment scoring, quality control and read filtering options were retained. Coverage statistics were calculated based on the alignment performed for SNP detection. The mean individual per base coverage (MIBC) was calculated for each base as R/N, where R is the number of reads covering the position, and N is the number of accessions (768 for analysis of the whole dataset, and 64 for analysis at pool level). Thus, for each base, MIBC represents the mean number of reads covering the base in each accession. Mean individual coverage (MIC) is calculated as the mean of MIBC over the reference sequence. To evaluate experiment-wide MIC, we used the reference sequences of all the studied genes.
SNP and indel detection were performed using Varscan (Koboldt et al., 2009). SNP detection was performed separately on forward and reverse reads of each pool; a nucleotide was considered polymorphic if the alternative allele was present both in forward and reverse sequences in at least ten reads per strand. This allowed false-positive SNPs due to sequence-specific sequencing errors to be discarded. SNPs called by Varscan with a mean variant quality lower than 25 and SNPs identified in regions with an MIBC <50-fold (corresponding to a total coverage of 3350-fold) were not considered. The error distribution in Illumina reads is not uniform; the proportion of errors usually increases from the beginning to the end of the read, with the additional possibility of high error rates in the very first bases (Kircher et al., 2009). In addition, preliminary results of the present study indicated that false-positive SNPs can be identified at the 3′ and 5′ ends of a read when a small deletion/insertion is present near (<10 bp from) the read ends. To control for this, specific a posteriori quality control was applied. Each read was partitioned into three segments of equal length, and the condition was imposed that each SNP had to be independently identified (using the thresholds shown above) when considering each of the segments of the read separately.
To assess the performance of pooled multiplexed NGS to identify SNPs, two amplicons (CAD4b and CAD4c, Table S4) of two of the 12 pools (pool 5 and pool 9) were sequenced using the Sanger method in phase 1, and SNPs were detected on the individual sequences obtained. Identification of SNPs in Sanger sequences of the training set was performed using the software package PolyPhred (Nickerson et al., 1997), as described previously (Marroni et al., 2011). The sensitivity and specificity of pooled multiplexed NGS were measured using receiver operating characteristic (ROC) curve analysis, with individual Sanger sequencing as the gold standard. Comparison was performed only for positions at which Sanger coverage was at least 50-fold.
Using ROC curves, the true-positive rate (sensitivity) is calculated as a function of the false-positive rate (1 – specificity) for various cut-off points; each point on the ROC represents a sensitivity/specificity pair corresponding to a particular decision threshold. The area under the ROC curve is an overall measure of test performance, with a value of 0.5 indicating random performance and a value of 1.0 denoting perfect performance. The proportion of true Illumina SNPs as a function of the proportion of false SNPs was calculated, using the variant frequency threshold as the test outcome and Sanger genotyping as the gold standard. The best variant frequency threshold was chosen to maximize the following value: (sensitivity+specificity)/2, in order to maximize true positives while minimizing false positives.
The correlation between MAFs of SNPs identified by pooled NGS and those SNPs identified by individual Sanger sequencing was calculated using the Pearson product-moment correlation coefficient. In each pool, the MAF for Sanger sequencing was calculated as the number of occurrencies of a given alternative allele divided by the total number of observed alleles. The MAF for pooled multiplexed NGS was calculated as the number of Illumina reads carrying the alternative allele divided by the total number of Illumina reads.
Small indel detection was performed separately on forward and reverse reads of each pool, and a position was cosidered polymorphic for indels if an indel was identified in both forward and reverse strand in at least two reads per strand. Indels with an MIC <25-fold per strand were not considered.
To investigate the effect of sequence coverage on SNP calling, and to identify the optimal coverage for variant detection, a total of 1000 simulations, sampling subsets from 1 to 99% of reads generated in phase 1, were run, and the identified SNPs (positives) were compared with those identified using whole-sequence data. The intersection of the two sets represents a conservative estimate of true positives. The positive predictive value was calculated as the ratio between true positives and the sum of true positives and false positives.
Comparison between SNPs identified by pooled multiplexed NGS and individual Sanger sequencing was repeated in phase 2. The selected test set was composed as follows: all 12 pools for HCT, pool 2 for CCR7b and CCR7c, and pool 12 for 4CL3b. The algorithm used for SNP detection in Sanger sequences was the same as in phase 1, but when identifying SNPs in short reads in HCT1, C3H3, CCR7 and 4CL3 (which showed approximately half of the coverage of CAD4), the number of reads required on each strand and the lower acceptable MIC of a region to be used for SNP calling were linearly scaled, and set to five and 25-fold, respectively.
To investigate the effect of frequency threshold on SNP calling, additional analyses were performed by varying the MAF threshold frequency from 0 to 1%, removing coverage thresholds, and requiring that only one occurrence of the SNP be found on the forward and reverse strand. When the MAF threshold was zero, the number of polymorphic positions corresponded to the number of bases in which an error was introduced on both strands during the whole sequencing process. The proportion of polymorphic positions was plotted as a function of the variant frequency used to define a polymorphism in each gene.
Nucleotide diversity is the mean number of per site differences between two randomly chosen DNA sequences (Nei and Li, 1979); here, it was estimated as the sum of the unbiased heterozygosity of segregating sites (Tajima, 1989), averaged over all nucleotides. Heterozygosity and 95% confidence limits of heterozygosity were calculated by re-sampling with replacement all pairs of polymorphic loci with probability corresponding to the frequency of the two alleles, and multiplying by n/(n−1), where n is the sample size of the pool, to obtain an unbiased estimate of nucleotide diversity (Futschik and Schlötterer, 2010). For each polymorphic position, 200 re-sampling experiments were performed, each of size 10 000. The neutrality hypothesis was tested by calculating Tajima’s D (Tajima, 1989); we chose confidence limits for Tajima’s D based on the β distribution for n = 1000 DNA sequences, as calculated by Tajima (1989). Statistical analyses were performed using R (http://www.r-project.org).