Modelling technical and biological biases in macroinvertebrate community assessment from bulk preservative using multiple metabarcoding markers

Abstract DNA metabarcoding from the ethanol used to store macroinvertebrate bulk samples is a convenient methodological option in molecular biodiversity assessment and biomonitoring of aquatic ecosystems, as it preserves specimens and reduces problems associated with sample sorting. However, this method may be affected by errors and biases, which need to be thoroughly quantified before it can be mainstreamed into biomonitoring programmes. Here, we used 80 unsorted macroinvertebrate samples collected in Portugal under a Water Framework Directive monitoring programme, to compare community diversity and taxonomic composition metrics estimated through morphotaxonomy versus metabarcoding from storage ethanol using three markers (COI‐M19BR2, 16S‐Inse01 and 18S‐Euka02) and a multimarker approach. A preliminary in silico analysis showed that the three markers were adequate for the target taxa, with detection failures related primarily to the lack of adequate barcodes in public databases. Metabarcoding of ethanol samples retrieved far less taxa per site (alpha diversity) than morphotaxonomy, albeit with smaller differences for COI‐M19BR2 and the multimarker approach, while estimates of taxa turnover (beta diversity) among sites were similar across methods. Using generalized linear mixed models, we found that after controlling for differences in read coverage across samples, the probability of detection of a taxon was positively related to its proportional abundance, and negatively so to the presence of heavily sclerotized exoskeleton (e.g., Coleoptera). Overall, using our experimental protocol with different template dilutions, the COI marker showed the best performance, but we recommend the use of a multimarker approach to detect a wider range of taxa in freshwater macroinvertebrate samples. Further methodological development and optimization efforts are needed to reduce biases associated with body armouring and rarity in some macroinvertebrate taxa.


Table of Contents: Supplementary Tables
Page 2

Supplementary Figures
Page 13 Supplemental Tables   Table S1 Composition of the DNA mock sample used as positive control during library preparation. For each family, we provide the number of taxa included, and the amount and percentage of DNA in the sample.    Table S4 Summary results of in silico and in vitro evaluations of marker adequacy for detecting benthic macroinvertebrate taxa potentially occurring in samples collected in rivers from Central Portugal . For each taxon, we provide: "barc" -number of barcode sequences that comprise the insert originated by the desired marker based only on the Genbank database; "bind" -number of those sequences that comprise the desired insert and the flanking regions for primer binding; "range" -minimum-maximum insert sizes recovered from sequences according to primer binding sites; "amp" -number of unique barcode sequences that were amplified by each primer set for a maximum number of mismatches of 5; "res" -percentage of barcodes that were identified up to family level.
In marker adequacy, values in "in silico" and "in vitro" columns refer to detection (1) or non-detection (0) at the respective analysis, and "refDB" to whether barcodes with the desired insert are available (1) or not (0) in both public (Genbank and BOLD) and in-house databases (IBI-CIBIO and aquaDNA-LECA) for taxonomic assignment.      Figure S1 PCR plate layout.

Figure S3
Distribution of (A) sequencing read counts and (B) unique sequences after dereplication (bars) overlaid by amplicon length ranges (shading) estimated in silico ( Figure  S2A) and used for filtering metabarcoding data (attribute filtering; Figure S2B), for each of the three DNA markers used in the study. Separate results are shown for each of three markers (COI-M19BR2, 16S-Inse01, 18S-Euka02) used in this study.

Figure S5
Rarefaction curves showing the accumulation of targeted macroinvertebrate families (Family richness) recovered by each of three individual markers (COI-M19BR2, 16S-Inse01, 18S-Euka02) in relation to read counts per sample (Sample size). Panels in the left represent rarefaction curves across the entire range of coverages obtained in our samples, while in the right panels we have zoomed in on read counts up to 10,000.

Figure S6
Pairwise differential heat trees showing significant differences in the taxonomic composition of macroinvertebrate communities retrieved using either morphotaxonomy or each of the three DNA markers used in this study. Heat trees were produced considering the "abundance" of taxa estimated with either morphotaxonomy (counts of individuals) or DNA markers (counts of reads). Only families detected by morphotaxonomy (sample-wise) were considered. Tree tips represent families and internal nodes the corresponding taxonomic path. Black nodes denote the Animalia root.

Figure S7
Detection probability curves (predicted values ± 95% confidence interval) showing variation in the probability of detecting through metabarcoding a macroinvertebrate taxon recorded in a sample by morphotaxonomy (i.e., true positives), as a function of per-sample sequencing depth. Curves were estimated using Generalized Linear Models with binomial distribution and logit link, with separate models fitted for each combination of macroinvertebrate taxa and each of three markers (COI-M19BR2, 16S-Inse01 or 18S-Euka02), or the multi-marker. Only taxa detected by morphotaxonomy (sample-wise) were considered.

Figure S8
Detection probabilities (± 95% confidence intervals) of macroinvertebrate taxa using metabarcoding for a standard sequencing count of 50,000 reads, per each combination of macroinvertebrate taxon and each of three markers (COI-M19BR2, 16S-Inse01, 18S-Euka02), or the multi-marker. Estimates were made using the GLM shown in Figure S7. Only taxa detected by morphotaxonomy (sample-wise) were considered. White dots indicate taxa that were not detected by the respective marker or in which models failed to converge; crossed dots indicate rare families (<5 sites).