Biodiversity Soup II: A bulk-sample metabarcoding pipeline emphasizing error reduction

Despite widespread recognition of its great promise to aid decision-making in environmental management, the applied use of metabarcoding requires improvements to reduce the multiple errors that arise during PCR amplification, sequencing, and library generation. We present a co-designed wet-lab and bioinformatic workflow for metabarcoding bulk samples that removes both false-positive (tag jumps, chimeras, erroneous sequences) and false-negative (‘drop-out’) errors. However, we find that it is not possible to recover relative-abundance information from amplicon data, due to persistent species-specific biases. To present and validate our workflow, we created eight mock arthropod soups, all containing the same 248 arthropod morphospecies but differing in absolute and relative DNA concentrations, and we ran them under five different PCR conditions. Our pipeline includes qPCR-optimized PCR annealing temperature and cycle number, twin-tagging, multiple independent PCR replicates per sample, and negative and positive controls. In the bioinformatic portion, we introduce Begum, which is a new version of DAMe (Zepeda Mendoza et al. 2016. BMC Res. Notes 9:255) that ignores heterogeneity spacers, allows primer mismatches when demultiplexing samples, and is more efficient. Like DAMe, Begum removes tag-jumped reads and removes sequence errors by keeping only sequences that appear in more than one PCR at above a minimum copy number. We report that OTU dropout frequency and taxonomic amplification bias are both reduced by using a PCR annealing temperature and cycle number on the low ends of the ranges currently used for the Leray-Fol-Degen-Rev primers. We also report that tag jumps and erroneous sequences can be nearly eliminated with Begum filtering, at the cost of only a small rise in drop-outs. We replicate published findings that uneven size distribution of input biomasses leads to greater drop-out frequency and that OTU size is a poor predictor of species input biomass. Finally, we find no evidence that different primer tags bias PCR amplification (‘tag bias’). To aid learning, reproducibility, and the design and testing of alternative metabarcoding pipelines, we provide our Illumina and input-species sequence datasets, scripts, a spreadsheet for designing primer tags, and a tutorial.

. Four classes of metabarcoding errors and their causes. Not included are software bugs, general laboratory and field errors like mislabeling, DNA degradation, and sampling biases or inadequate effort.
Incorrect classification of an OTU as a prey item when it was in fact consumed by another prey species in the same gut Hardy et al. 2017 Numts ( sequences that appear in more than one (or a low number of) PCR replicate(s) at above a 97 minimum copy number per PCR, albeit at a cost of also filtering out some true sequences. 98 Begum improves on DAMe by ignoring heterogeneity spacers in the amplicon, allowing 99 primer mismatches during demultiplexing, and by being more efficient 100 (http://github.com/shyamsg/Begum). (1) the frequency of false-negative OTUs ('drop-outs', i.e. unrecovered input species), 107 (2) the frequency of false-positive OTUs ('drop-ins'), 108 (3) the recovery of frequency information (does OTU size per species predict input 109 DNA amount per species?), and 110 (4) taxonomic bias (are some taxa more or less likely to be recovered?).

111
The highest efficiency would be achieved by recovering all and only the input species, in 112 their original frequencies. We show that metabarcoding efficiency is highest with a PCR 113 cycle number and annealing temperature at the low ends of the ranges currently used in 114 metabarcoding studies, that Begum filtering nearly eliminates false-positive OTUs (drop-ins), 115 at the cost of only a small absolute rise in drop-out frequency, that greater species evenness 116 and higher absolute concentrations reduce drop-outs (replicating Elbrecht, Peinert & Leese 117 2017), and that OTU sizes are not reliable estimators of species-biomass frequencies. We also 118 find at most only small effect sizes for 'tag bias,' which is the hypothesis that the sample-119 identifying nucleotide sequences attached to PCR primers might promote annealing to some 120 template-DNA sequences over others (e.g. Berry et al. 2011;O'Donnell et al. 2016), 121 exacerbating taxonomic bias in PCR. All these results have important implications for using 122 metabarcoding as a biomonitoring tool. 123 To test our pipeline, we created eight mock arthropod soups, all containing the same 248 arthropod taxa but differing in absolute and relative DNA concentrations, ran them under five different PCR conditions, and used Begum to filter out erroneous sequences (Fig. 1). We then quantified the efficiency of species recovery from bulk arthropod samples, as measured by four metrics: Fig. 1. Schematic of study. A. Twin-tagged primers with heterogeneity spacers (above) and final amplicon structure (below). B. Each mock soup (a 'sample') is PCR-amplified three times (#1, #2, #3), each with a different tag, following the Begum strategy. There are eight mock soups (mmmm/hhhl/Hhml/hlll X body/leg), and PCR replicates #1 from each of the eight mock soups are pooled into the first amplicon pool, PCR replicates #2 are pooled into the second amplicon pool, and so on. The setup in B. was itself replicated 8 times for the 8 PCR conditions (A-H), which generated (3 X 8 =) 24 pools in total. C. Key steps of the Begum bioinformatic pipeline. For clarity, primers and heterogeneity spacers not shown.  design 100 unique tags of 7 nucleotides in length in which no nucleotide is repeated more 154 than twice, all tag pairs differ by at least 3 nucleotides, no more than 3 G and C nucleotides 155 are present, and none ends in either G or TT (to avoid homopolymers of GGG or TTT when 156 concatenated to the Leray-Geller primers). We increased diversity by adding one or two 157 'heterogeneity spacer' nucleotides to the 5' end of the forward and reverse tagged primers 158 (De Barba et al. 2014;Fadrosh et al. 2014). The total amplicon length including spacers, 159 tags, primers, and markers was expected to be ~382 bp. The primer sequences are listed in 160 Supp. Info. S4_Table 1. 162 We ran test PCRs using the Leray-Geller primers with an annealing temperature (Ta) gradient 163 of 40 to 64°C. Based on gel-band strengths, we chose an 'optimal' Ta of 45.5°C (clear and 164 unique band on an electrophoresis gel) and a 'high' Ta value of 51.5 °C (faint band) to 165 compare their effects on species recovery. 166 We followed Murray, Coghlan and Bunce (2015) (see also Bohmann et al. 2018) and first ran 167 the eight mock soups through qPCR to detect PCR inhibition and establish the appropriate 168 template amount, to assess extraction-negative controls, and to estimate the minimum cycle 169 number needed to amplify the target fragment across samples. qPCR amplification curves 170 indicated that 10X dilution of sample extracts was optimal to avoid PCR inhibition, and we 171 observed that the end of the exponential phase was achieved at 25 cycles, which we define 172 here as the 'optimal' cycle number. To test the effect of PCR cycle number on species 173 recovery, we also tested a 'low' cycle number of 21 (i.e. stopping amplification during the 174 exponential phase), and a 'high' cycle number of 30 (i.e. amplifying into the plateau phase replicates.

187
Following the Begum strategy, for each of the PCR conditions, each mock soup was PCR-188 amplified three times, each time using a different tag sequence. The same tag sequence was 189 added to the forward and reverse primers, which we call 'twin-tagging' (e.g. F1-R1, F2-R2,  Begum is available at https://github.com/shyamsg/Begum (accessed 28 March 2020). First, 206 we used Begum's sort.py (-pm 2 -tm 1) to demultiplex sequences by primers and tags, add 207 the sample information to header lines, and strip the spacer, tag, and primer sequences.

208
Sort.py reports the number of sequences that have novel tag combinations, representing tag-209 jumping events (mean 3.87%). We then used Begum's filter.py to remove sequences < 300 bp 210 and to filter out erroneous sequences. We filtered at three levels of stringency: 1) no 211 filtering, 2) retaining sequences that appeared in minimum 2 of 3 PCR replicates with 212 minimum 4 copies per PCR, and 3) retaining sequences that appeared in minimum 3 of 3 213 PCR replicates with minimum 3 copies per PCR.  Tag-bias test 247 We took advantage of the technical replicates in PCR sets A&B, C&D, and G&H (Table 3). PCRs that used the same tags (n = 3) and different tags (n = 12).

255
The 18 libraries containing PCR sets A-F yielded 9,223,624 paired-end reads, and the 6 256 libraries of PCR sets G&H yielded 7,004,709 paired-end reads.

Effects of Begum filtering and PCR condition 258
The PCR conditions that resulted in the lowest drop-out frequencies were optimal Ta + 259 optimal or low cycle numbers (PCRs A, B, E,  (Table 3). Applying Begum filtering at either of 263 the two stringency levels reduced drop-in frequencies by ~20-40 times. Importantly, the 264 application of Begum filtering also caused drop-out frequencies to increase, but only by a 265 small absolute amount (from 3-4% to 6-9%) as long as the PCR conditions were also optimal 266 (A, B, E). In short, it is possible to combine wet-lab and bioinformatic protocols to 267 simultaneously reduce false-negative and false-positive errors.

Effects of input-DNA absolute and relative concentrations on OTU recovery 269
Altering either the absolute (body/leg) or relative (Hhml, hhhl, hlll, and mmmm) input DNA 270 concentrations created quantitative compositional differences in the OTU tables (Fig. 2).

271
Soup hlll, with the most uneven distribution of input DNA concentrations, recovered the 272 fewest OTUs (Fig. 2). The same effect can be seen by regressing the number of recovered  Table 3. Species-recovery success by Begum filtering stringency level and PCR condition, using the mmmm_body soup. Recovered species are OTUs that match one of the 248 reference species at ≥97% similarity. Drop-outs are false-negatives and are defined as reference species that fail to be matched by an OTU at ≥97% similarity. Drop-ins are false-positives and are defined as OTUs that fail to match any reference species at ≥97% similarity. There are three levels of Begum filtering, starting with no filtering in the top section (accepting reads that are present in ≥1 PCR and ≥1 copy). Begum filtering reduces Drop-in frequencies by ~20-40 times (dark to light orange cells), at the cost of a small absolute rise in Drop-out frequency (grey to blue cells), provided that PCR conditions are optimal (PCRs A, B, E). With non-optimal PCR conditions (PCRs C, D, F, G, H: High Ta or cycle number, Touchdown), there is a trade-off: filtering to reduce false positives also increases false negatives (blue cells get darker on the lower right). See Effects of Begum filtering and PCR condition for details. Table S_DROPS shows the number of OTUs remaining after each major step of our pipeline. The dark box indicates best-performing combinations of Begum filtering and PCR conditions. Optimal Ta is 45.5°C, and high is 51.5°C. Optimal cycle number is 25, low is 21, and high is 30.

Begum filtering stringency
Optimal Ta

PCR conditions
High Ta + Optimal cycle number Fig. 2. Non-metric multidimensional scaling (NMDS) ordination of the eight mock soups, which differ in absolute (Body/Leg) and relative (Hhml, hhhl, hlll, and mmmm) DNA concentrations of the input species (Table 2). Shown here is the output from the PCR A condition (Table 3): optimum annealing temperature Ta (45.5 C) and optimum cycle number (25). Point size is scaled to the number of recovered OTUs. Species recovery is reduced in samples characterized by more uneven species frequencies (e.g. hlll) and, to a much lesser extent, lower absolute DNA input (leg). Optimal PCR conditions (Ta + cycle number) produce larger OTUs than do non-optimal PCR 281 conditions (high Ta, high cycle number, and Touchdown), especially for Hymenoptera,

282
Araneae, and Hemiptera (Fig. 3). These taxa are therefore at higher risk of failing to be 283 detected by the Leray-Geller primers under sub-optimal PCR conditions.

284
Tag-bias test 285 We found no evidence for tag bias in PCR amplification. reduced when species DNA frequencies are uneven (Fig. 2, Fig. S1), and we found that OTU 304 sizes are a poor predictor of input genomic DNA, which confirms that OTU size is a poor

307
In this study, we tested our pipeline with eight mock soups that differed in their absolute and relative DNA concentrations of 248 arthropod taxa (  (Table 3), and we filtered the OTUs under different stringencies ( Fig. 1, Table 3). We define high efficiency in metabarcoding as recovering most of sample's compositional and quantitative information, which in turn means that both drop-out and drop-in frequencies are low, that OTU size predicts input DNA amount per species, and that any drop-outs are spread evenly over the taxonomic range of the target taxon (here, Arthropoda).
Green branches indicate that PCR A (right side) produced relatively larger OTUs in those taxa. Brown branches indicate that PCR A produced smaller OTUs. There are, on balance, more dark green branches than dark brown branches in the three heat trees that compare PCRs C, F, and G (suboptimal conditions) with A (optimal conditions), and the green branches are concentrated in the Araneae, Hymenoptera, and Lepidoptera, suggesting that these are the taxa at higher risk of failing to be detected by Leray-Geller primers under sub-optimal PCR conditions. Shown here are the mmmmbody soups, at Begum filtering stringency ≥2 PCRs, ≥4 copies per PCR.

A2 vs B3
Dimension 1 Dimension 2  (Table 3). To decide on a filtering stringency level, we therefore 324 recommend using complex positive controls and to take into account one's study's aims. If 325 the aim is to detect a particular taxon, like an invasive pest, it is better to set stringency low to

338
The Begum workflow co-designs the wet-lab and bioinformatic components (Fig. 1) to remove multiple sources of error (Table 1). Twin-tagging allows removal of tag jumps, which result in sequence-to-sample misassignments. Multiple, independent PCRs per sample allow removal of false-positive sequences caused by PCR and sequencing error and by low-level contamination, at the cost of only a small absolute rise in false-negative error (Table 3).
qPCR optimization reduces false negatives caused by PCR runaway, PCR inhibition, and annealing failure (Table 3, Fig. 4). Moderate dilution appears to be a better solution for inhibition than is increasing cycle number, since the latter increases drop-outs (Table 3).
qPCR also allows extraction blanks to be screened for contamination. Size sorting (Elbrecht, Peinert & Leese 2017) should also reduce false negatives, but our finding that leg-only soups have more missing OTUs (Fig. 2) suggests that large insects should be replaced by heads not legs.     Fig. 2. Non-metric multidimensional scaling (NMDS) ordination of the eight mock soups, which differ in absolute (Body/Leg) and relative (Hhml, hhhl, hlll, and mmmm) DNA concentrations of the input species (Table 2). Shown here is the output from the PCR A condition (Table 3): optimum annealing temperature Ta (45.5 C) and optimum cycle number (25). Point size is scaled to the number of recovered OTUs. Species recovery is reduced in samples characterized by more uneven species frequencies (e.g. hlll) and, to a much lesser extent, lower absolute DNA input (leg).     (Table 3). The mmmm soup was omitted because it has no variance. , and within each pair, the same eight tag sequences were used in PCR, and thus each pair should recover the same communities. Bottom. All pairwise Procrustes comparisons of PCRs C and D. The top row (box) displays the three same-tag pairwise correlations. The other rows display all the different-tag pairwise correlations. If there is tag bias during PCR, the top row should show a greater degree of similarity. Mean correlations are not significantly different between sametag and different-tag ordinations (Mean of same-tag Procrustes correlations: 0.989 ± 0.010 SD, n = 3. Mean of different-tag correlations: 0.980 ± 0.012 SD, n = 12. p<0.067). Circle size is scaled to number of OTUs, and colors indicate tissue source (leg or body). Mock soups with the same concentration evenness (e.g. Hhml) are linked by lines. , and within each pair, the same eight tag sequences were used in PCR, and thus each pair should recover the same communities. Pair G1 & H1 cannot be compared because H1 is missing one sample. Bottom. All pairwise Procrustes comparisons of PCRs G and H. The top row (box) displays the two pairwise comparisons between the technical replicates.

A2 vs B3
The other rows display all the different-tag pairwise correlations. If there is tag bias during PCR, the top row should show a greater degree of similarity. Mean correlations are not significantly different between same-tag and different-tag ordinations, but we note that only two same-tag comparisons could be made (Mean of same-tag Procrustes correlations: 0.910 ± 0.038 SD, n = 2. mean correlation of different-tag correlations: 0.943 ± 0.050 SD, n = 6. p=0.419). Circle size is scaled to number of OTUs, and colors indicate tissue source (leg or body). Mock soups with the same concentration evenness (e.g. Hhml) are linked by lines.