Comparing whole‐genome shotgun sequencing and DNA metabarcoding approaches for species identification and quantification of pollen species mixtures

Abstract Molecular identification of mixed‐species pollen samples has a range of applications in various fields of research. To date, such molecular identification has primarily been carried out via amplicon sequencing, but whole‐genome shotgun (WGS) sequencing of pollen DNA has potential advantages, including (1) more genetic information per sample and (2) the potential for better quantitative matching. In this study, we tested the performance of WGS sequencing methodology and publicly available reference sequences in identifying species and quantifying their relative abundance in pollen mock communities. Using mock communities previously analyzed with DNA metabarcoding, we sequenced approximately 200Mbp for each sample using Illumina HiSeq and MiSeq. Taxonomic identifications were based on the Kraken k‐mer identification method with reference libraries constructed from full‐genome and short read archive data from the NCBI database. We found WGS to be a reliable method for taxonomic identification of pollen with near 100% identification of species in mixtures but generating higher rates of false positives (reads not identified to the correct taxon at the required taxonomic level) relative to rbcL and ITS2 amplicon sequencing. For quantification of relative species abundance, WGS data provided a stronger correlation between pollen grain proportion and sequence read proportion, but diverged more from a 1:1 relationship, likely due to the higher rate of false positives. Currently, a limitation of WGS‐based pollen identification is the lack of representation of plant diversity in publicly available genome databases. As databases improve and costs drop, we expect that eventually genomics methods will become the methods of choice for species identification and quantification of mixed‐species pollen samples.


| INTRODUC TI ON
The identification of species in pollen mixtures has a variety of applications, including forensics, paleobotany, allergen monitoring, and pollination biology. Identification of pollen by microscopy is limited as there are few experts on palynology (Rahl, 2008), many taxa cannot be identified to the species level based on pollen morphology (Khansari et al., 2012;Salmaki et al., 2008), and the methods are timeconsuming. Recent studies have used DNA metabarcoding to overcome these disadvantages. DNA metabarcoding uses high-throughput sequencing methods to simultaneously sequence PCR-amplified DNA of one or two molecular markers from all species in a mixture (Bell et al., 2016;Cristescu, 2014). DNA metabarcoding approaches have several advantages over microscopic identification, including higher taxonomic resolution in many instances, greater availability of relevant expertise, and the ability to process many more grains per sample and with higher throughput (Bell et al., 2019). Pollen DNA metabarcoding has been used for monitoring honey quality (Hawkins et al., 2015), determining the foraging patterns of bees Richardson, Lin, Sponsler, et al., 2015) and other pollinating insects (Lucas et al., 2018), monitoring allergenic pollen (Brennan et al., 2019;Kraaijeveld et al., 2014), and examining historical flower visitation (Gous et al., 2019).
Despite the successful use of pollen DNA metabarcoding, and its advantages relative to microscopic identification, metabarcoding has limitations. Species-level identification may be impeded by a lack of divergence between related species in the DNA barcoding markers, while detection and quantification may be hampered by biases that favor certain species (Bell et al., 2019). Different species have different DNA isolation yields and vary in organellar or ribosomal genome copy number, which could lead to biases in DNA quantity going into PCRs (Kembel et al., 2012;Lamb et al., 2019;Pawluczyk et al., 2015). Primers for PCR may differ in their binding efficiencies to different species (Pompanon et al., 2012), or polymerases may be biased toward different nucleotide composition (Nichols et al., 2018), leading to PCR biases.
Whole-genome shotgun (WGS) sequencing is an approach that could improve both taxonomic resolution and quantification in the molecular identification of pollen. In terms of taxonomic resolution, WGS sequences many more loci than DNA metabarcoding, even with very low coverage, generating the potential for much finer taxonomic resolution. In terms of quantification, WGS approaches do not require PCR and do not target particular gene regions, eliminating amplification bias and potentially reducing copy number bias (Bista et al., 2018). WGS methods are increasingly used to analyze the species composition and functional profiling of microbiomes (Sharpton, 2014;Venter et al., 2004) and, more recently, eukaryotes, including soil invertebrate communities (Andújar et al., 2015), herbivore diets (Chua et al., 2021;Srivathsan et al., 2014), organisms in honey (Bovo et al., 2018), and ancient plant communities (Parducci et al., 2019).
The quantitative accuracy (e.g., Morgan et al., 2010) and species detection ability (Ranjan et al., 2016) of WGS has been investigated for prokaryote communities and more recently eukaryotic communities (Bista et al., 2018;Garrido-Sanz et al., 2020;Gómez-Rodríguez et al., 2015;Ji et al., 2019;Tang et al., 2014Tang et al., , 2015. However, few have examined the performance of WGS in identification and quantification of species in pollen mixtures, and most of these used only the small proportion of sequences from the plastid genome (Lang et al., 2019).
Furthermore, there are two key limitations of WGS-based pollen species identifications that have not been tested: Very few plants have had their whole genomes sequenced and publicly available genome sequence data may contain errors (Breitwieser et al., 2019).
In this study, we test the ability of WGS sequencing and current publicly available reference sequences to identify taxa and quantify their proportions in pollen mixtures. We specifically examined (1) the proportions of false-negative identifications, that is, taxa present in the sample that were not identified; (2) the proportions of false-positive identifications, that is, reads that were identified to taxa not present in the sample; and (3) quantitative matching, that is, the correspondence between the proportion of pollen grains in a sample and the proportion of sequence reads matching the respective plant species.
We also examined the effect of sample complexity (in terms of species richness, relatedness of taxa, and rarity of pollen grains in a sample) on false-negative and false-positive identifications. Finally, we compared WGS to DNA metabarcoding in terms of performance in identification and quantification. Considering the current limitations of reference genome availability, completeness, and quality, we expected to find poorer taxonomic identification relative to metabarcoding. However, given the multiple sources of quantitative bias in amplicon sequencing, we expected to find improved quantification with WGS.

| Overview
Our overall approach-described in more detail in the subsequent paragraphs-was based on shotgun sequencing carefully constructed mixtures of pollen (a "mock community"). We had previously studied the efficacy of DNA metabarcoding with the same pollen mixtures (Bell et al., 2019), and to enhance interpretability, we conducted whole-genome shotgun sequencing on the exact same DNA isolations that we had previously used for amplicon sequencing. We used the Kraken2 bioinformatics pipeline (Wood et al., 2019), which implements a k-mer approach to taxonomically classify sequencing reads relative to a reference database, and subsequently analyzed the classified reads to assess the performance of WGS in terms of (1) falsenegative reads; (2) false-positive reads; and (3) quantitative matching; both on its own and also relative to amplicon-based methods.

| Pollen mock communities
The mock communities of pollen we sequenced are described in full detail in Bell et al. (2019). We designed the samples to vary in (1) species richness (1-9 species); (2) relatedness of taxa within samples (from two species in the same genus, to species in widely disparate orders, and including all of the major angiosperm lineages); and (3) rarity of taxa within samples, ranging from approximately 5%-95% of pollen grains in a sample. The pollen mixtures were carefully quantified via microscopy, with several observers, to ensure that we were able to assess quantitative matching with high confidence. The nine species included in the mixtures cover a broad spectrum of the flowering plant phylogeny, including monocots and all subclasses of eudicots. We included 1-9 species in the mixtures as this represented the typical range of species richness in a pollen sample taken from an individual pollinator. Pollen mixtures were made from high-purity pollen purchased from pharmaceutical companies, to minimize contamination with DNA from other organisms. Details of the suppliers and pollen mixtures are shown in Tables 1 and 2. We extracted DNA from ~1,000,000 pollen grains for each sample; a quantity similar to what might be expected on the corbicula of a honeybee or a pooled sample from the bodies of multiple pollinating insects. The DNA isolation methods were described in full in Bell et al. (2019) and used the FastDNA™ Spin Kit for Soil (MP Biomedicals) with minor modifications as described in Bell et al. (2017). These DNA extractions were previously analyzed with DNA metabarcoding, based on rbcL and ITS2 (Bell et al., 2019). For rbcL, the primers rbcL2 (Palmieri et al., 2009) and rbcLa-R (Kress & Erickson, 2007) were used. For ITS2, the primers ITS2 S2F and ITS2 S3R (Chen et al., 2010) were used.
Illumina MiSeq libraries for DNA metabarcoding were prepared using Nextera XT dual-index barcodes (Illumina) and run in a single flow cell on a 600-cycle run of the MiSeq instrument at the Emory Integrated Genomics Core (EIGC). Taxonomic assignments were determined with the bioinformatics pipeline of Sickel et al. (2015), using the RDP classifier (Wang et al., 2007), and previously compiled and trained databases for ITS2 (Sickel et al., 2015) and rbcL (Bell et al., 2017), with the addition of relevant sequences that had become available more recently. For full details of the DNA metabarcoding analyses, see Bell et al. (2019).

| Sample preparation and sequencing
We conducted whole-genome shotgun sequencing of our samples in two groups, a first "pilot" group sequenced at the Emory Integrated Genomics Core (EIGC) on Illumina HiSeq and a second "full" group sequenced at the Georgia Genomics Facility (GGF) on Illumina MiSeq. While we recognize that ideally all sequencing would have been done in a single location with the same sequencing protocol and chemistry, logistical and financial constraints precluded this possibility. The first set of samples consisted of pollen mixtures 1 and 2 and single-species Carya illinoinensis (Wangenh.) K. Koch (Juglandaceae), sequenced on an Illumina HiSeq at EIGC, with 2 M paired reads of 100 bp run for each sample from two flow cells, with library preparation using the Nextera XT method. Following successful results with the sample subset, we prepared mixtures 3-11 for sequencing by GGF using the Nextera XT library preparation method (Illumina).
We also prepared single-species samples (from the set of species included in the mixtures), as well as negative controls. Again, to ensure comparability of the WGS data with our earlier amplicon-based study (Bell et al., 2019), we used the exact same isolated DNA. We were run across two replicate flow cells. The samples run by GGF were run on a single flow cell. We analyzed data from the two replicate flow cells separately. These were treated as replicates in statistical analyses, so we had two replicate datasets for each sample run by EIGC and a single dataset for each sample run by GGF.

| Bioinformatics
Taxonomic identification of WGS sequences was conducted using Kraken2 (Wood et al., 2019). This program compares short sequence substrings (k-mers) from the sample sequences to a genomic reference database, returning the lowest common ancestor of matches in the database. A Kraken2 (Wood et al., 2019)  RefSeq database (O'Leary et al., 2016). Additional plant species were included from GenBank (Clark et al., 2016) and the Sequence Read Archive (Leinonen et al., 2011) (SRA), to ensure that all species in pollen mixtures were represented (Table 3). Genome assemblies were downloaded from GenBank with ncbi-genome-download (Blin, n.d The tools used in this analysis are each available from Bioconda (Grüning et al., 2018). The commands and results are available at https://github.com/Brosi -Lab/Kraken.

| Overview and analysis commonalities
Our analysis focused on three outcomes: 1) false-negative identifications, that is, failure to identify a taxon that was present in a sample; 2) false-positive identifications, that is, incorrect identification of taxa not present in a sample; and 3) quantitative matching, that is, the relationship between the proportion of pollen grains in a sample and the proportion of sequence counts corresponding to that taxon in our output. For each of these outcomes, we assessed WGS on its own and separately compared the performance of WGS with our previous amplicon results (again from the exact same DNA extractions). For the qualitative outcomes (false negatives and positives), we assessed the effect of sample complexity on matching, including the constructed pollen mixtures described above, specifically constructed to vary in species richness, relatedness of plant taxa, and rarity.
We conducted all statistical analyses in the R language for statisti- The response variables for our first two outcomes were binomial (yes / no) in structure. For our first outcome of false negatives, we needed to record-for each species present in a pollen mixturewhether or not that species was identified in that sample. To do this, we set up a data file with each species truly present within each sample as its own row, which we subsequently scored as 0 / 1, with a zero for species that were present in the sample but not identified in sequencing reads above the contamination threshold, and a one for species that were identified in the sequencing reads above the threshold.

| Quantitative accuracy
To assess the quantitative accuracy of WGS sequencing for our constructed mixtures, we tested the correlation between the (known) proportion of pollen grains in a sample (Bell et al., 2019) and the proportion of DNA sequencing reads (i.e., the proportion of reads assigned to a taxon at the taxonomic level of interest relative to the total classified sequencing reads). We used a linear mixed-effects model, implemented with the "lmerTest" package in R. The response variable was the proportion of sequencing reads, and the explanatory variable was proportion of input pollen grains. In parallel with our qualitative analyses, we used mixture identity and species as crossed random-intercepts effects. This analysis was conducted separately for identifications at the level of species, genus, and family. were retained per sample after quality filtering and classification.

| Overview
Therefore, a contaminant threshold of 5,048 was used for retaining taxonomic identifications in samples.

| False negatives
A list of the species identified from at least 1% of sequencing reads in each sample is available in Appendix S1. In all mixtures except for the sample containing B. papyrifera alone, Kraken2 analysis of WGS identified all taxa in the mixture to the species level (Table 5; Figure 1). The sample containing B. papyrifera alone contained only 7 reads, which was below the contamination threshold. We were unable to conduct mixed-effects modeling of the effect of sample complexity on true positive detection, due to the almost 100% success rate.
Whether WGS using our full or partial genome reference database was more or less effective than amplicon sequencing using the more complete locus-specific barcode reference databases for the detection and identification of pollen taxa in a mixture depended on the taxonomic level of identification and the barcode used for DNA metabarcoding (Table 6). In comparison to either rbcL or ITS2 alone, the WGS method identified significantly more taxa correctly at both the species and the genus level. After combining rbcL and ITS2 identifications, DNA metabarcoding still performed significantly worse than WGS for the identification of taxa at the species levels, but there was no significant difference at the genus level.

| False positives
False positives occurred in all samples (Appendix S1). Species-level false-positive identifications usually occurred as less than 1% of the total reads in the sample. Excluding the Broussonetia papyrifera single-species sample, which had very few sequencing reads, only three false-positive species were identified from greater than 1% of the total reads. Helianthus annuus L. was identified in the Ambrosia artemisiifolia L. single-species sample at 17.08% of total reads, the  Figure 2).
Comparing WGS to amplicon sequencing, we found that the two approaches did not produce significantly different rates of falsepositive reads at the species level, irrespective of the marker used (rbcL, ITS2, or both markers combined) (Table 8). At the genus level, ITS2 amplicon sequencing generated significantly fewer false positives than WGS. When the two amplicon markers were combined, there was no significant difference in false-positive rate between amplicons and WGS.

| Quantitative matching
The proportion of WGS sequencing reads identified to a particular taxon was usually less than the proportion of pollen grains of that taxon in the mixture (Figure 3), as is expected given the presence of false-positive reads. We found that the proportion of DNA sequencing reads for each taxon increased with an increasing proportion of pollen grains within a mixture at the genus (R 2 = .62, p < .000001) and species (R 2 = .60, p = .0007) levels, but the slope of the regression was well below 1 (0.46 for both genus and species).
Quantitatively, WGS provided a better fit than amplicons in terms of the strength of correlation between pollen grain proportion and read proportion, regardless of the barcode or taxonomic TA B L E 4 Number of sequencing reads and Kraken2 k-mer fragments retained for analysis WGS sequencing data of pollen samples through processing steps of quality filtering and classification

| D ISCUSS I ON
We tested the ability of WGS to identify taxa and quantify their relative abundances in pollen mock communities, using ~600,000-700,000 sequencing reads per sample and Kraken2-based taxonomic identification with publicly available reference sequences and compared this to our previous DNA metabarcoding analyses.
We found our WGS method to be almost 100% successful in identifying known species in mixtures. Rates of false negatives (failure to detect and identify taxa that were present) and false positives (reads that were not identified as taxa that were present) in WGS data were not sensitive to mixture complexity. In comparison to DNA metabarcoding, WGS performed better in terms of false negatives and worse in terms of false positives. We found WGS to give improved quantification of the proportions of taxa when compared to metabarcoding, although the relationship between input pollen

| Species-level identification
Compared to DNA metabarcoding, our WGS sequencing method generated fewer false negatives. WGS generated more falsepositive reads, though this difference was not statistically significant for most comparisons. There have been few studies assessing the detection and identification ability of WGS relative to amplicon sequencing, and most of these are not directly comparable to our study due to differences in experimental design. Bista et al. (2018) found better qualitative detection with WGS compared to amplicon sequencing of COI for mixtures of invertebrate animals. However, their study used mitochondrial data only (<1% of sequencing reads), and their reference library was custom-made for their study using only the species of interest and assembling the full mitochondrion.
In contrast, our study uses nuclear, chloroplast, and mitochondrial genomes, with a reference database of all available RefSeq genomes.
Other studies using WGS methods for pollen identification have also found that the ability to identify all species in a mixture is close to 100% (Lang et al., 2019;Peel et al., 2019). It is not possible to compare our false-positive rate to either of these studies based on differences in the way identifications to taxonomic levels other than species or genus are treated (i.e., removed vs treated as false positives). It is likely that our false-positive rate is similar to these studies as there were few false-positive species-level identifications occurring at greater than 1% in any sample.
Differences in the false-positive rate between WGS and DNA metabarcoding could be explained by the relative quality and completeness of the reference databases. A recent study aiming to analyze plant diet from fecal samples encountered a high false-positive rate using whole-genome reference sequences due to missing species, which limited the use of WGS data to a few well-represented loci (Chua et al., 2021). Although all the species included in our mixtures were present in our reference database for WGS, many were downloaded from NCBI as unassembled SRA data and may have had lower coverage and more errors than RefSeq genome assemblies.
Genome assembly data may contain contaminants (e.g., fungi, bacteria, human; Breitwieser et al., 2019) that would not amplify with the kingdom-specific primers typically used for DNA metabarcoding, but would affect taxonomic matching for WGS methods. As more genome sequence data become available and reference genome quality improves, false-negative and false-positive reads in WGS are likely to diminish, as has been recorded for bacterial metagenomics (Nasko et al., 2018).

| Quantification of relative species abundances
Several studies have found the relationship between DNA metabarcoding sequence reads and pollen proportion to deviate substantially from a 1:1 relationship, especially with ITS2 (Bell et al., 2019;Richardson, Lin, Sponsler, et al., 2015;Smart et al., 2017).
Using WGS, we detected a stronger correlation between the proportion of pollen grains of a taxon in a mixture and the proportion of sequences assigned to that taxon, relative to DNA metabarcoding. At the same time, we also found that the WGS sequence reads departed more strongly from the "true" 1:1 input pollen grain to output sequence proportion relationship relative to amplicon-based methods. Comparison of WGS of chloroplast genomes with DNA metabarcoding for the identification of pollen mixtures has found improved quantification with WGS (Lang et al., 2019), and likewise for comparison of WGS of mitochondrial genomes with amplicon sequencing of animal mixtures (Bista et al., 2018), and this quantification can be further improved by correcting for mitochondrial TA B L E 6 Binomial mixed model to assess if method (WGS or amplicon sequencing) has a significant effect on the ability to identify true positives in a pollen mixture  There are at least three reasons why WGS may only be semiquantitative. First, pollen grains vary in their DNA quantity due to variation in genome size, which can differ by orders of magnitude among flowering plant species (Soltis et al., 2003). Kraken2 was not designed for quantification and does not correct for genome size bias, so some of the unexplained variance in our analyses may be due to this. This could be corrected with knowledge of genome sizes, though variation in ploidy within taxa (Kolář et al., 2017) could complicate such corrections. Second, the proportion of sequencing reads identified to a taxon was always less than the true proportion of pollen grains for that taxon in the sample due to false positives.
Improved coverage and quality of whole-genome reference databases is likely to reduce the false-positive rate and improve quantification in the future. Third, as with amplicon sequencing, our DNA extractions may have been affected by variation among species in the effectiveness of the extraction. This source of bias could be corrected with a database of relative DNA extraction efficiencies.

| Present feasibility of WGS and future research directions
Our results highlight the potential of WGS as a method for identi- These three disadvantages are likely to be minimized with future developments. Currently, the most important requirement to make WGS identification of pollen mixtures feasible is improved reference sequence databases. While the current number of publicly available high-quality genome assemblies represent only a small proportion of plant species diversity, it is likely that sequencing rates will increase with new initiatives such as the Earth BioGenome Project, which aims to have sequenced the genomes of most eukaryote species within a decade (Lewin et al., 2018). New methods using long sequencing reads and newly developed bioinformatics pipelines will increase the rate at which plant genomes can be assembled (Driguez et al., 2021). Database "cleaning" will be possible when there are more near-complete genomes from a wider range of species, and new bioinformatics methods for removing bacterial contaminants from eukaryotic genome assemblies show promise (Fierst & Murdock, TA B L E 8 Binomial mixed model to assess if method (WGS or amplicon sequencing) has a significant effect on the proportion of false-positive sequencing reads identified in a pollen mixture. The direction of the trend was always in favor of amplicon sequencing 2017). With future increases in availability and quality of reference genome sequences, WGS will become feasible for the identification and quantification of pollen in most applications.

| CON CLUS IONS
The limitations of DNA metabarcoding mean that alternatives need to be developed. We have demonstrated that WGS is a suitable method for identification and quantification of pollen grains in mixtures, although it may not currently be practical. The weaknesses of WGS are surmountable in the long-term, particularly as the number of publicly available full-genome sequences increases. Increased reference sequence availability will enable WGS to identify species (or taxa below the species level) that are not uniquely identifiable via DNA barcoding and allow for improved quantification of the proportions of species in a mixture. This higher level of precision would allow for finer geographic precision in forensic applications, improved understanding of pollination biology at the plant population level, and more accurate assessments of food origins and quality.

F I G U R E 3
Relationship between the proportion of pollen grains in a mixture belonging to a particular taxon and the proportion of wholegenome shotgun sequencing reads matching that taxon: (a) genus-level identifications; and (b) species-level identifications