Short‐ and long‐read metabarcoding of the eukaryotic rRNA operon: Evaluation of primers and comparison to shotgun metagenomics sequencing

High‐throughput sequencing‐based analysis of microbial diversity has evolved vastly over the last decade. Currently, the go‐to method for studying microbial eukaryotes is short‐read metabarcoding of variable regions of the 18S rRNA gene with <500 bp amplicons. However, there is a growing interest in applying long‐read sequencing of amplicons covering the rRNA operon for improving taxonomic resolution. For both methods, the choice of primers is crucial. It determines if community members are covered, if they can be identified at a satisfactory taxonomic level, and if the obtained community profile is representative. Here, we designed new primers targeting 18S and 28S rRNA based on 177,934 and 21,072 database sequences, respectively. The primers were evaluated in silico along with published primers on reference sequence databases and marine metagenomics data sets. We further evaluated a subset of the primers for short‐ and long‐read sequencing on environmental samples in vitro and compared the obtained community profile with primer‐unbiased metagenomic sequencing. Of the short‐read pairs, a new V6‐V8 pair and the V4_Balzano pair used with a simplified PCR protocol provided good results in silico and in vitro. Fewer differences were observed between the long‐read primer pairs. The long‐read amplicons and ITS1 alone provided higher taxonomic resolution than V4. Together, our results represent a reference and guide for selection of robust primers for research on and environmental monitoring of microbial eukaryotes.

Planktonic single-celled eukaryotes play central roles in aquatic ecosystems and in global biogeochemical cycles, through primary production, transfer of organic material to higher trophic levels and sequestration of carbon to the deep ocean (Barton et al., 2013;Carradec et al., 2018;Worden et al., 2015). Some taxa (e.g., Pseudonitzschia spp. and Alexandrium spp.) can, however, also produce toxins and cause harmful algal blooms, threatening marine life and human health (Karlson et al., 2021;Lewitus et al., 2012). The study of eukaryotic plankton is thus central for understanding aquatic ecosystems; for this undertaking metabarcoding offers a time and cost-efficient method for gaining insight into their diversity and biogeography.
The method, however, is not without its limitations and challenges. The 18S ribosomal RNA (rRNA) gene is a commonly used marker, due to its conserved regions separated by highly variable regions. In theory, this allows to both universally target eukaryotes and discriminate between closely related organisms. However, no primers identified to date equally target all eukaryotic organisms, and despite the abundance of studies that have attempted to develop "universal" eukaryotic primers for the 18S rRNA gene (Amaral-Zettler et al., 2009;Hadziavdic et al., 2014;Hugerth, Muller, et al., 2014), there is no consensus in the research community for one primer pair. Within the 18S rRNA gene, different variable regions have been targeted, for example, V1-V2 and V3 (Medinger et al., 2010;Mohrbeck et al., 2015), however, V4 (Balzano et al., 2015;Pagenkopp Lohan et al., 2016;Piredda et al., 2017;Stoeck et al., 2010) and V9 (Bradley et al., 2016;de Vargas et al., 2015;Stoeck et al., 2010) have been most commonly used for eukaryotic plankton.
A few recent studies have compared the suitability of the variable regions for metabarcoding of eukaryotes (Hadziavdic et al., 2014;Hugerth, Muller, et al., 2014) and eukaryotic plankton in specific (Tanabe et al., 2016). However, all limiting factors have not always been taken into consideration, for example a primer pair with broad taxonomic coverage designed by Hugerth, Muller, et al. (2014) generates amplicons too long for merging forward and reverse illumina reads. Shortread metabarcoding using Illumina sequencing poses restrictions on the length of the barcode region. As a result, the taxonomy of amplicon sequences can often not be resolved beyond the genus level (Hugerth, Muller, et al., 2014), since closely related species regularly have identical barcode regions. The advancement of third generation sequencing, or long-read sequencing, through platforms from Pacific Bioscience (PacBio) and Oxford Nanopore technologies, enables sequence recovery of >10 kb (Amarasinghe et al., 2020). Disadvantages include higher costs and typically higher error rates compared to Illumina (~15% compared with ~0.1%). However, these errors tend to be random and by sequencing the same amplicon several times using circular consensus sequencing (CCS) error rates are considerably lowered (Jamy et al., 2020;Larsen et al., 2014;Wenger et al., 2019;Westbrook et al., 2015).
Long-read sequencing allows sequencing of amplicons covering a large fraction of the rRNA operon. The inclusion of the fast-evolving internal transcribed spacers (ITS1 and ITS2) and the hypervariable D1D2 region of the 28S rRNA gene allows detailed (species-level) taxonomic identification when closely related taxa are present in the reference database, and higher-level taxonomic placement using the rRNA genes in other cases (Orr et al., 2018;Tedersoo & Anslan, 2019). The long sequences of 18S and 28S rRNA genes also allow reconstruction of well resolved phylogenetic trees which facilitates evolutionary analysis of uncultivated organisms (Jamy et al., 2020). However, to our knowledge, no extensive evaluation of PCR primers for rRNA operon sequencing has been published so far.
The aim of our study was to evaluate existing and newly designed primers for short-and long-read metabarcoding in silico and in vitro to guide the selection of robust primers for research and environmental monitoring of microbial eukaryotes.
2.1.2 | Generation of rRNA databases from marine and brackish metagenomics contigs SSU and LSU rRNA sequences in primary metagenomic contigs from the "protistan" filter size fraction (0.8-5.0 μm) from the Tara Oceans project (Tully et al., 2018), and contigs from samples (size

| Short-read metabarcoding library preparation and sequencing
The primers for short-read amplification on 18S rRNA (Table 1)

| Long-read metabarcoding library preparation
We evaluated three sets of primers targeting a ~4.5 kbp region of the eukaryotic rRNA operon (Table 2). Three reverse primers were paired with the same forward primer (V4_Balzano_F/V4F): 21R (Schwelm et al., 2016)

| Processing and analyses of sequencing data
Sequencing data generated in this study are available at the Community Ecology Package), and ggplot2 (Valero-Mora, 2010).
The 18S rRNA gene database pr2 version 4.12.0 (Guillou et al., 2013) was used as training set for taxonomic classification with as-signTaxonomy of DADA2.

| Short-read Illumina data
The median sequencing depth was 0.11 M reads per sample with 88.78% of reads of a quality score ≥30. The dada2 pipeline was used to infer biological sequences from amplicon reads (Callahan et al., 2016), resulting in 6173 amplicon sequence variants (ASVs) from six samples and nine primer pairs (V4_Hugert primers were excluded since forward and reverse reads did not overlap). ASV abundance was rarefied with the function rarefy from the vegan package ver- Around 90% of ASVs generated from the V4 and V6-8 region were classified to genus level, and ~70% to species level. The primers targeting V9 resulted in fewer ASVs and a steeper drop in taxonomic assignment from domain to genus level; ~70% of the V9 ASVs were classified at genus level compared to ~80%-90% of the V4 ASVs.
The relative resolution of the V9_EMP primers did not follow the pattern of the other two V9 primers; due to the very low number of ASVs generated for this primer pair the results are probably distorted. Similar patterns of taxonomic resolution were observed in vitro when applying the primer pairs on six water samples ( Figure S1).

| Coverage of primers on public and constructed environmental databases
The amplicons generated from the in silico PCR were used to eval- In some cases, primers designed for targeting 18S rRNA can also amplify prokaryotic 16S rRNA genes, but the evaluated primers were in general specific to eukaryotic 18S rRNA. The V9_EMP forward primer matched to bacterial and to a lesser extent archaeal sequences, but combined with the V9_EMP_R primer the pair was specific to eukaryotic sequences.

| In silico evaluation of primers targeting 28S rRNA
In addition to two previously published reverse 28S rRNA primers, five primers designed in this study were evaluated (Figure 3b). The coverage of all primers on the SILVA database and the simulated brackish sample was generally >90%. Lower coverage was obtained on the simulated marine sample, where the primers 2593R and 3143R provided the highest coverage of ~60%-70%. As for many of the SSU F I G U R E 2 Amplicon length distribution and taxonomic resolution of in silico PCR on public database sequences of ten 18S rRNA primer pairs. The sequences were simulated from a clustered PR2 database (version 4.12.0). (a) Amplicon length distribution of amplicons generated on the database. (b) Inferable taxonomic information from unique in silico amplicon sequences (ASVs). Taxonomy was assigned using the function assignTaxonomy from the dada2 R package. Note that V9_AmaralZettler and V9_Piredda share similar primer sequences and therefore plot on top of each other primers, the coverage of the Excavata supergroup was limited by all primers, while the other supergroups appeared evenly covered.

| In vitro evaluation of primers targeting 28S rRNA and long-read amplicons
Since a thorough in silico evaluation of primer pairs for long-range amplification is not feasible due to the low number of eukaryotic genomes for which complete rRNA operon data is available, analyses of amplicon length distribution and taxonomic resolution were performed on sequencing data generated in this study (Figure 4; and further described in the next section). The length distributions of the amplicons of the three primer pairs tested were much wider than the distributions from the 18S rRNA short-read amplicons, probably reflecting length variation in the ITS sequences, variable regions, and introns. Since the same 18S rRNA targeting forward primer (V4_Balzano_F) was used in the pairs, the shifts in length distribution between the pairs were due to the different reverse primers used ( Figure 1c). 3143R produced the longest amplicons with the highest peak at ~4700 bp length, 2742R with a main peak at ~4300 bp and the previously published 21R (which binds closely to 2742R) at ~4200 bp.
The taxonomic resolution of the long-range amplicons was evaluated for the V4 region and the 18S rRNA region of the amplicon ( Figure 4b; the different regions of the amplicons are illustrated in Figure 4c). The three primer pairs performed similarly for both regions; for the full 18S rRNA region, 85%-90% of the sequences were classified down to species level, while for the V4 region trimmed to the V4_Balzano primers, <80%.
To investigate the potential of different regions of the eukaryotic rRNA operon for metabarcoding, we evaluated the variability within the regions 18S, 28S, ITS1, ITS2 and V4 (Figure 4c,d).
Since we found that denoising with DADA2 artificially suppresses variation in highly variable ITS regions (Schoch et al., 2012) of the long-reads ( Figure S2), this analysis was conducted without denoising on unique CCS reads from the primer pair V4_Balzano_F -2742R. The sequences of the individual regions were clustered at 99% identity to compensate for sequencing errors that would artificially inflate the observed variation. The long regions of the 18S and 28S rRNA genes formed fewer sequence clusters than the F I G U R E 3 Coverage of primer pairs and individual primers across database sequences and taxonomic groups. Database sequences derive from published databases (PR2 for 18S rRNA, silva for 28S rRNA, silva for prokaryotic 16S and 23S rRNA sequences) and constructed databases of rRNA sequences extracted from shotgun metagenomics data of brackish (Baltic Sea: Alneberg et al., 2020) and marine water (TARA Oceans; Tully et al., 2018). Coverage of primers on eukaryotic supergroups was based on public database sequences (pr2/ SILVA) and corresponds to the size of the circle (legend). Matches to prokaryotic sequences are shown as grey circles. The primers are sorted by variable regions amplified within (a) 18S and (b) 28S rRNA genes, and by forward "_F", reverse "_R" and pair "_pair" (marked with black stroke) full amplicon, with 9.9% compared to 21.5% of sequences from the full amplicons serving as centroid sequences during clustering. The short and variable ITS1 and ITS2 are frequently targeted for species discrimination, for example, dinoflagellates (John et al., 2014), and fungal metabarcoding (Schoch et al., 2012) and 12.2 and 12.6% of sequences, respectively, formed centroids for clusters.
The corresponding number for the V4 region was 6%, indicating a lower variability than the ITS regions.

| Effect of sequencing method and primer pair on inferred community composition in natural samples
In order to evaluate the influence of sequencing method and primer choice on the obtained community composition, we compared  Figure S3. Typically, biological replicates processed with the same method grouped closely. Depending on the taxonomic level chosen for creating the NMDS plots, the clustering shifted but the general patterns remained ( Figure S3).
Within the short-read samples, those targetting the same variable region grouped closely. For the long-read samples there was a minor separation depending on if the V4 or the full 18S rRNA region was used for taxonomic classification. Generally, Illumina samples clustered closer to the shotgun MG samples than long-read samples, and of those in particular V9 primers, V4_new and V6-V8_new.
When comparing the community composition on supergroup level (Figure 5c), the influence of the sequencing methods and primer pairs on the observed composition became more evident.
The differences in community profile depending on the variable region targeted was also visible here, but more notable at lower taxonomic levels ( Figure S4). The composition obtained by shotgun MG can be assumed to reflect the true community composition most closely, since it is not affected by primer biases.
To directly compare the metabarcoding data to shotgun MG, we plotted the average Bray-Curtis dissimilarity across samples at each taxonomic level, where a value of 1 represents the highest   Figure S3. (c) Bar plots of one biological replicate per location on phylum level (supergroup in pr2 taxonomy), abundance of each group was normalised across samples, bar plots at all taxonomic levels and of both biological replicates in Figure S4 (replicate 1) and Figure S5 (replicate 2). (d) Average Bray-Curtis dissimilarity of primer pairs to shotgun-MG data across six samples on seven taxonomic levels dissimilarity (Figure 5d). It became apparent that the dissimilarity to the shotgun MG data increased at lower taxonomic levels. Of the short-read primer pairs, V4_new showed the lowest dissimilarity up to family level, while the V7_new primers showed the narrowest range of dissimilarity across taxonomic levels, except species level.
At genus and species level little difference could be observed between the pairs, with the exception of the V4_Bråte pair which most strongly diverged from the MG data (Figure 5c,d).
The V4_Bråte pair had a notable bias against the supergroup Opisthokonta ( Figure 5c); on closer observation the highly abundant copepod species Eurytemora affinis (Bothnian Bay) and Oithona sp.
(North Sea) were strongly underrepresented by V4_Bråte ( Figure   S4). Consistent with this, the V4_Bråte primers did not match to sequences of these genera in silico, while V4_Balzano and V4_new did.
The three V9 primer pairs created very similar community profiles ( Figure 5c) and had low dissimilarity to the shotgun MG profile at supergroup level; however, at lower taxonomic levels differences became evident (Figure 5d) connected to that many taxonomic groups were not identified ( Figure S4). The community profiles created by the three long-read primer pairs and PacBio sequencing differed based on the reverse primer only (Figure 5c), and generally overrepresented Alveolates and in specific Ciliophora ( Figure S4). Based on Bray-Curtis dissimilarity the 21R reverse primer was most successful in mimicking the shotgun MG data (Figure 5d), both when the 18S rRNA and the V4 region was used for taxonomic assignment. The results obtained from the V4 region and full 18S did not differ at the supergroup level but were surprisingly different on lower taxonomic levels ( Figure S4). Even though the V4 region from PacBio amplicons was obtained by trimming reads based on the V4_Balzano primers' positions, the community profile did not bear similarity with those created by the V4_Balzano primers, indicating amplification biases was part of the problem. The supergroup Hacrobia is not well covered by any of the PacBio reverse primers tested; the bias cannot be explained by the forward primer V4_Balzano_F since the V4_Balzano pair even overrepresented this taxonomic group, compared to the shotgun MG samples.

| DISCUSS ION
In this study, we performed in silico evaluations of primers for short-and long-read metabarcoding of eukaryotes and compared the community composition obtained in vitro from six marine samples with shotgun metagenomic (MG) sequencing. To our knowledge, this is the first comparison of the three sequencing methods for microbial community characterisation. The rationale for repeating the primer design effort by Hugerth, Muller, et al. (2014) was that the expanded sequence databases and clustering of sequences would enable design of broader targeting primers with less bias towards overrepresented taxa. We also included a larger set of previously published primers in the evaluations. We showed that our approach is suitable by confirming the primers performance both in silico and in vitro. The taxonomic resolution tests only measured what proportion of sequences was classified at the different levels and not the accuracy of the classifications. Although a subset of the classifications was probably wrong, we argue that this is still a valid approach for comparing the taxonomic information of the different primer pairs/regions, since the applied k-mer-based algorithm implemented in DADA2 (Callahan et al., 2016) will assign more informative sequences to more detailed taxonomic levels (  . However, since the V4_Hugerth pair generated too long amplicons for sequencing on Illumina Miseq (Hugerth, Muller, et al., 2014), and the V4_new pair amplified offtarget bacterial sequences, these pairs are less suitable for metabarcoding of eukaryotes. The V9 pairs did not perform as well as the other pairs, probably due to the short amplicons and more limited representation of the region in the pr2 database ( Figure 1a).
Our analyses indicated the V6-V8_new pair performs the best of the tested pairs, and that it most closely resembled the taxonomic profile from unbiased shotgun MG sequencing In vitro (Figure 5b,c).
Primers targeting the V6-V8 regions are not widely used, but a pair targeting these regions was previously shown to provide high coverage Wilkins et al., 2013). However, since it is desirable to be able to compare across studies, a more commonly used primer pair with slightly lower performance might be preferable. In that case, the V4_Balzano pair (Balzano et al., 2015) seems a reasonable choice for short-read 18S rRNA metabarcoding studies, since it had the best performance after V6-V7_new, is used in many studies (Berdjeb et al., 2018;Obiol et al., 2020;Questel et al., 2021) and additionally differs from the most commonly used pair V4_Stoeck  by only the last nucleotide of the reverse primer (Table 1). The simplified PCR protocol we established in this study (Table S2) with only one reaction instead of six, one annealing temperature, and fewer cycles, makes the primers more suitable for high-throughput library preparation. We compared the simplified PCR protocol to the original protocol (Table S2)  While the advantages of long-read metabarcoding of almost the whole rRNA operon are ample, this relatively new method has its limitations. Next to the high costs per base pair, technical limitations should be taken into consideration. The wide distribution of amplicon lengths could lead to shorter sequences being overrepresented (Tedersoo et al., 2018), and PCRs on such long DNA regions increase the formation of chimeric sequences. A current bottleneck for data analysis is the lack of an established pipeline for denoising of long-read sequences, although attempts have been made for the full rRNA operon (Heeger et al., 2018;Jamy et al., 2020) and for SSU only using DADA2 (Callahan et al., 2019). When we used DADA2 for sequence denoising on the full rRNA operon, the biological variation in the ITS regions appeared to be falsely suppressed, since the numbers of unique ITS sequences were lower than the number of unique 18S V4 sequences, as shown in Figure   S2, contrary to what would be expected and what was obtained without applying denoising. This is probably due to the combination of long amplicons and low sequencing depth, as previously noted (Furneaux et al., 2021), rendering too few reads for many of the true biological sequences to be recaptured by the algorithm.
Furthermore, there is currently no database or collection of full eukaryotic rRNA operon sequences available, like it is attempted for bacteria (Benítez-Páez & Sanz, 2017;Kinoshita et al., 2021); such an initiative would greatly benefit the ease of application.
As part of this study, we evaluated seven reverse primers for long-read metabarcoding, including two used for long-read metabarcoding in a previous study (Jamy et al., 2020), Opisthokonta and Alveolata ( Figure 5c). Since the performance of the primers did not significantly differ, and since 3143R (designed here) generated the longest amplicon with two additional variable regions (D9 and D10), 3143R might be the preferable choice for long-read rRNA operon metabarcoding studies.

CO N FLI C T O F I NTE R E S T S
The authors declare that there is no conflict of interest.