Long- and short- read metabarcoding technologies reveal similar spatiotemporal structures in fungal communities

Fungi form diverse communities and play essential roles in many terrestrial ecosystems, yet there are methodological challenges in taxonomic and phylogenetic placement of fungi from environmental sequences. To address such challenges, we investigated spatiotemporal structure of a fungal community using soil metabarcoding with four different sequencing strategies: short- amplicon sequencing of the ITS2 region (300– 400 bp) with Illumina MiSeq, Ion Torrent Ion S5 and PacBio RS II, all from the same PCR library, as well as long- amplicon sequencing of the full ITS and partial LSU regions (1200– 1600 bp) with PacBio RS II. Resulting community structure and diversity depended more on statistical method than sequencing technology. The use of long- amplicon sequencing enables construction of a phylogenetic tree from metabarcoding reads, which facilitates taxonomic identification of sequences. However, long reads present issues for denoising algorithms in diverse communities. We present a solution that splits the reads into shorter homologous regions prior to denoising, and then reconstructs the full denoised reads. In the choice between short and long amplicons, we suggest a hybrid approach using short amplicons for sampling breadth and depth, and long amplicons to characterize the local species pool for improved identification and phylogenetic analyses.

is facilitated by sampling of vegetative structures (Horton & Bruns, 2001). Unlike many saprotrophic fungi which grow easily in axenic culture, ECM fungi are usually difficult to culture, so DNA barcoding is increasingly used to investigate vegetative structures in the field.
The advent of high-throughput sequencing (HTS) has facilitated such studies by providing enough sequencing depth for metabarcoding of bulk environmental samples such as soils (Lindahl et al., 2013).
As additional techniques and methods are developed for HTS, there is an increasing array of choices for researchers investigating fungal communities. Fungal metabarcoding studies using short-read HTS technologies such as 454 Pyrosequencing, Illumina and Ion Torrent have usually targeted the rDNA internal transcribed spacer regions ITS1 or ITS2, which are the standard molecular barcode for fungi, providing sufficient resolution to distinguish fungal species in many groups, and which are usually short enough for HTS (Lindahl et al., 2013;Schoch et al., 2012). In some groups such as arbuscular mycorrhizal fungi, variable regions of the rDNA small subunit (SSU) are the barcode of choice (Öpik et al., 2010), and variable regions of the rDNA large subunit (LSU) have also been used for barcoding (e.g. House et al., 2016;Kurtzman & Robnett, 1998;Tedersoo et al., 2015). The resulting sequencing reads are clustered by sequence similarity to form operational taxonomic units (OTUs), which are then used as the units for further community analysis (Lindahl et al., 2013). If taxonomic identification is desired in order to put OTUs in a wider context and associate functional information, it has usually been performed by database searches using BLAST (Altschul et al., 1990;Lindahl et al., 2013) with public databases such as GenBank (Benson et al., 2013) and Unite (Nilsson et al., 2019). However, there is potential to improve this approach at several stages, including sequencing technology, amplicon choice, clustering and taxonomic assignment.
Different sequencing technologies have different capabilities in terms of sequencing depth and read length, as well as differing quality profiles and potential biases (Yang et al., 2013). The rapid development of new HTS technologies, as well as subsequent iterative improvements in sequencing chemistry and read capacity, means that the technologies used in metabarcoding studies, along with any associated biases, change frequently. As an example, the first study using HTS metabarcoding of soil fungi was published in 2009 (Buée et al., 2009) using 454 Pyrosequencing; production of 454 sequencers was subsequently discontinued in 2015, and sales of reagents stopped in 2016 (Hollmer, 2013). This brings into question the comparability of studies conducted only a few years apart. Existing studies that sequenced the same environmental samples using different HTS technologies (e.g. Claesson et al., 2010;Divoll et al., 2018;Kennedy et al., 2018;D. P. Smith & Peay, 2014;Speranskaya et al., 2018; have found that most differences in results seem to be attributable to differences in sequencing depth or different primer biases, rather than differences in the technologies themselves. Only a few of these studies have controlled for primer biases by using the same primer pairs in each technology (Claesson et al., 2010;Divoll et al., 2018;Speranskaya et al., 2018), and to our knowledge, none have sequenced the same PCR products using multiple HTS technologies. ITS1 and ITS2 often have suitable variation to distinguish species, although closely related species may share identical ITS sequences in certain groups such as various Pezizomycotina , but this variability means that they cannot be reliably aligned over the fungal kingdom (Lindahl et al., 2013;. Additionally, the wide range of length variation of these regions may introduce bias in recovery of different taxa (Ihrmark et al., 2012;Palmer et al., 2018;Tedersoo et al., 2015). Further bias is introduced by variation in the 5.8S region, which separates the two ITS regions, as well as in the 5′ end of LSU, which makes it difficult to design primers that are suitable for all fungi .
Distance-based clustering conflates intraspecies variation and sequencing error (Lindner & Banik, 2011;Nilsson et al., 2008), and results are data set-specific. In contrast, more recent denoising methods such as DADA2 (Callahan et al., 2017), Deblur (Amir et al, 2017) and UNOISE2 (Edgar, 2016b) utilize read quality information to control for sequencing error while preserving intraspecies variation. The resulting units are known as amplicon sequence variants (ASVs) or exact sequence variants (ESVs), as they should represent true amplicon sequences from the sample. Unlike cluster-based OTUs, ASVs can capture variation of as little as one base pair, although alpha and beta diversity estimates based on ASVs and OTUs at different clustering thresholds are highly correlated (Glassman & Martiny, 2018;Botnen et al., 2018). Amplicon sequence variants have been suggested to be less data set specific than cluster-based OTUs (Callahan et al., 2017). Support for PacBio has recently been added to DADA2 (Callahan et al., 2019), but its application requires greater sequencing depth for longer reads, especially in high diversity samples.
Because both OTU clustering and denoised ASVs may 'clump' different species into a single unit and 'split' a single species into multiple units (Ryberg, 2015), diversity measures based on counting species within a community or shared species between two communities may give different results depending on the clustering threshold. In contrast, phylogenetic community distance measures (Wong et al., 2016) are relatively insensitive to species/OTU delimitation, but require a phylogenetic tree. Phylogenetic placement algorithms have been developed to place short-amplicon reads onto a reference tree (Berger et al., 2011;Matsen et al., 2010;, but are not easy to apply to ITS sequences because they require that the query sequences be aligned to a reference alignment. Additionally, methods exist to place OTUs on a simplified tree based on taxonomic assignments  or to create hybrid trees using ITS and a more conserved marker such as SSU or LSU based on matching taxonomic annotations in reference databases (Fouquier et al., 2016), but these approaches are only applicable to sequences of known taxonomic affiliation.
Assignment of taxonomic identities to environmental sequences is dependent on both the reference database and the algorithm used.
Although the public INSDC databases (Karsch-Mizrachi et al., 2018) are often used for sequence identification, the open nature of submission to these databases results in a substantial fraction of incorrect taxonomic annotations (Bidartondo, 2008;Nilsson et al., 2006;Steinegger & Salzberg, 2020) as well as sequences of poor technical quality (Ashelford et al., 2006;Nilsson et al., 2012). Consequently, taxonomic assignments based on these databases may be incorrect or inconsistent (Nilsson et al., 2005). Several curated databases also exist which attempt to address these issues and which cover the whole fungal kingdom. The Unite database is an attempt to include all publicly available high-quality ITS sequences (800,000 as of release 8.0), originally limited to fungi but now expanded to include all eukaryotes (Nilsson et al., 2019), where efforts have been made to correct incorrect annotations and exclude low-quality sequences . The Ribosomal Data Project (RDP, Cole et al., 2014) hosts two additional manually curated fungal barcode sequence databases, which are specifically intended for use in taxonomic assignment of sequences: the Warcup ITS training set, containing 18,000 manually curated fungal ITS sequences (Deshpande et al., 2016), and the RDP fungal LSU training set, containing 8000 manually curated LSU sequences from fungi and 3000 from other eukaryotic groups (RDP-LSU, Liu et al., 2012). Although the quality of sequences and taxonomic annotations is undoubtedly higher in these more curated databases, they are inherently limited in taxonomic coverage and do not include the most recently published sequences.
Assigning taxonomy to unknown sequences using blast requires a priori choice of similarity thresholds for different taxonomic ranks.
Several algorithms specifically designed for taxonomic assignment have been published which instead use information about variability within different taxa in the reference database to assign unknown sequences, along with confidence estimates for these assignments, including the RDP Classifier (RDPC, Wang et al., 2007), SINTAX (Edgar, 2016a) and IDTAXA (Murali et al., 2018) among others. In addition, methods have been published which integrate predictions from multiple algorithms to increase the reliability of assignments (Gdanetz et al., 2017;Palmer et al., 2018;Somervuo et al., 2016).
However, all sequence similarity-based approaches are dependent on high taxonomic coverage in the reference database, making the placement of novel or undersampled groups problematic .
Recent long-read HTS technologies such as Pacific Biosciences (PacBio) Single Molecule Real Time (SMRT) sequencing enable sequencing longer amplicons, which include both the ITS regions and the flanking, more highly conserved SSU and/or LSU regions. This can improve taxonomic placement of sequences that lack close database matches and allow the alignment of metabarcoding reads for subsequent phylogenetic analysis .
Information from phylogenetic trees produced from long-amplicon metabarcoding has the potential to both improve taxonomic assignment and provide alternative measures of community alpha and beta diversity. PacBio sequencing has also been shown to recover longer variants of variable-length regions such as ITS1 and ITS2, which are excluded by other technologies (Castaño et al., 2020). However, long-read technologies are currently more expensive per read compared to short-read sequencing, and so their use entails a trade-off with sequencing depth and/or sample number (Kennedy et al., 2018).
Because of the variety of sequencing platforms and analytical pipelines, which have been used in metabarcoding studies, comparisons between studies may be difficult. Here, we investigated the effects of different sequencing strategies and postanalysis on biological conclusions using measurement of the spatiotemporal turnover rate of the fungal community in an ECM-dominated woodland in Benin by metabarcoding of bulk soil, sampled at narrow intervals, over two years. Turnover scale is the distance at which two communities can be considered to be independent samples of the local species pool (Legendre & Legendre, 2012). Knowledge of turnover scale is important when planning studies of local diversity and its environmental correlates. Turnover scale varies between different ecosystems and taxonomic groups, and can be measured by the range at which a Mantel correlogram indicates significant autocorrelation, or by fitting a function to an empirical distance-decay curve of community dissimilarity vs. distance (Legendre & Legendre, 2012).
We compare three different sequencing platforms (PacBio RS II, Illumina MiSeq, Ion Torrent Ion S5), long and short amplicons, three different taxonomic assignment algorithms (RDPC, SINTAX, IDTAXA) with three different reference databases (Unite, Warcup, RDP-LSU), and both nonphylogenetic and phylogenetic community distance measures. We also present new algorithms for dividing the LSU into domains, combining denoising results from multiple domains as a strategy to capture more ASVs from long amplicons in diverse communities, and incorporating phylogenetic information into taxonomic assignments. We hypothesize that (a) PacBio sequencing of the short amplicon gives less bias against longer ITS2 amplicons than Illumina and Ion Torrent, both qualitatively (recovering amplicons missed by the others) and quantitatively (greater fraction of reads in longer amplicons); (b) our long amplicons (ITS1-LR5) recover a more complete view of the fungal community than our short amplicons (gITS7-ITS4), due to reduced length and primer biases; (c) these differences lead to differing results for ecological metrics, specifically OTU/ASV richness and turnover distance; and (d) incorporating LSU in the long amplicon allows for better taxonomic assignment.

| Sampling
Sampling was conducted at two sites, near the villages of Angaradebou (Ang: 9.75456°N 2.14064°E) and Gando (Gan: samples. For each sample, coarse organic debris was removed from the soil surface and a sample of approximately 5 cm × 5 cm × 5 cm was extracted with an ethanol sterilized knife blade. Each sample was sealed in a plastic zipper bag and homogenized by shaking and manually breaking apart soil aggregations. A subsample of approximately 250 mg total of soil was collected from two locations in the homogenized soil sample and stored in a DNA preservation buffer before return to the laboratory for extraction (see Methods S1.1, as well as Figures S1-S3, for preservation and extraction methods).

| DNA amplification and sequencing
DNA extracts were sequenced using four distinct strategies, with two different amplicon lengths (long and short; Figure 1a (Ihrmark et al., 2012) as the forward primer and an equimolar mix of ITS4 (White et al., 1990) and ITS4a (Urbina et al., 2016) as the reverse primer. The long amplicon (approximately 1500 bp) targeted the full ITS region including the 5.8S rDNA and approximately 950 bp at the 5′ end of the LSU, including the first three variable regions (Figure 1a), using ITS1 (White et al., 1990) as the forward primer and LR5 (Vilgalys & Hester, 1990) as the reverse primer.
Each PCR run also included a blank sample and a positive control consisting of freshly extracted DNA from a commercially purchased fruitbody of Agaricus bisporus.
For the short amplicon, forward primers included samplespecific indexes and adapters for multiplexing (File S1). PCR amplification was performed in 20-µl reactions containing 200 µM dNTP mix, 250 µM indexed gITS7 primer, 150 µM ITS4 m, 2 mM MgCl 2 , 0.1 U Taq polymerase (Dream Taq, Thermo Fisher Scientific) and 3-7 ng purified DNA in Dream Taq buffer. The reaction conditions were 10 min at 95°, followed by 35 cycles of 60 s at 95°, 45 s at 56° and 50 s at 72°, and finally 3 min at 72°. Each reaction was conducted in three technical replicates to reduce the effect of PCR stochasticity.
For the long amplicon, both forward and reverse primers included indexes for combinatorial multiplexing (File S2). PCR was performed as for the short amplicons, but with 500 µM of each of the two primers. Reaction conditions were 10 min at 95°, 30 cycles of 45 s at 95°, 45 s at 59° and 90 s at 72°, and finally 10 min at 72°. Each reaction was performed in three technical replicates as for short amplicons.
After pooling of technical replicates, amplicons were purified using SPRI beads (Vesterinen et al., 2016) and quantified fluorometrically using Quant-iT™ PicoGreen™ dsDNA (Thermo Fisher Scientific) fluorescent indicator dye on a Infinite F200 plate spectrofluorometer (Tecan Trading AG) according to the manufacturer's F I G U R E 1 rDNA regions. (a) Partial map of rDNA showing the 5.8S rDNA, partial SSU and LSU rDNA, and internally transcribed spacer (ITS) regions. D1-3 represent the first three variable regions in LSU, while LSU1-4 represent the conserved regions. Primer sites used in this study are indicated in red (forward primers) and blue (reverse primers), and the resulting amplicons are shown with green braces. (b) Total number of DADA2 ASVs vs. fraction of demultiplexed reads successfully mapped to ASVs for different rDNA regions extracted from a set of long PacBio amplicon sequences using lsUx. Data from all samples were analysed as a single pool. long: entire long amplicon, including ITS1, 5.8S, ITS2 and partial LSU; 32S: partial 32S precursor to LSU, including 5.8S, ITS2 and partial LSU; LSU: section of LSU rDNA included in the long amplicon, from the 5′ end to the LR5 primer site; ITS: full ITS region, including ITS1, 5.8S and ITS2. Colour indicates median region length. Shorter and more conserved regions yielded a greater fraction of successfully mapped reads. At a given fraction of mapped reads, more variable regions yield a greater number of unique ASVs protocol. An aliquot of 100 ng of DNA from each sample (or the total PCR product if less than 100 ng) was pooled into two libraries each for long and short amplicons.

| Bioinformatics
Circular consensus sequence (CCS) basecalls for PacBio sequences were made using CCS version 3.4 (Pacific Biosciences, 2016, 13 July 2019) using the default settings. The resulting sequences, as well as the paired-end Illumina sequences, were demultiplexed and sequencing primers were removed using Cutadapt version 2.8 (Martin, 2011). Sequencing primers were similarly removed from the Ion Torrent sequences, but interference between the tagged gITS7 primers and the Ion XPress tags used in library prep made full demultiplexing of the Ion Torrent sequences impossible, resulting in two samples sharing each tag. These reads were thus either analysed as a pool, or comparisons were made to equivalently combined samples in the other data sets. For Ion Torrent and PacBio, reads were discarded if they did not have the appropriate primers on both ends. Reads were searched in both directions, and reads where the primers were found in the reverse direction were reverse complemented before further analysis. For Illumina sequences, read pairs were only retained when PCR primers were detected at the 5′ ends of both the forward and reverse read.
Primers were also searched for and removed on the 3′ ends of the reads, in case of readthrough with short amplicons. Read pairs where the primers were found in reverse orientation were kept in separate files, but were retained in their original orientation until after denoising.

| Denoising and clustering
All amplicons were denoised using DADA2 version 1.12.1 according to the ITS pipeline workflow (Callahan, 2020a;Callahan et al., 2016), with technology-specific modifications for Ion Torrent (Callahan, 2020b) and PacBio (Callahan et al., 2019). Although this was successful for the short amplicons on all technologies, only 38 ASVs were obtained for the long amplicons, representing 12% of the trimmed reads.
We conclude that this poor performance was due to a combination of long amplicon length and low sequencing depth relative to community diversity, which lead to most biological variants being represented only by a cluster of reads differing by a small number of unique sequencing errors (for calculations, see Methods S1.2.1). We therefore developed a new workflow to assemble ASVs from the long amplicons by splitting the reads into homologous domains, including the two ITS regions, 5.8S, the variable D1-3 regions of LSU (Michot et al., 1984) and the conserved LSU regions between the D regions, here referred to as LSU1-4 ( Figure 1a). We then independently denoised reads from each domain and concatenated the denoised domains for each read. Finally, denoised reads were clustered based on 100% ITS2 identity, and a full-length consensus ASV was calculated for each cluster. This method, implemented in the new R packages LSUx (splitting reads into homologous regions; https://github.com/ brend anf/LSUx) and TZARA (reassembling regions and generating full-length consensus ASVs; https://github.com/brend anf/tzara) and detailed in Methods S1.2 and Table S1, was used for all of the PacBio and Ion Torrent data sets. Because the LSUx plus TZARA method as currently implemented is not applicable to Illumina paired-end reads, the ASVs generated from the Illumina data set according to the standard DADA2 workflow were used. The ITS2 region was extracted from the ASVs using lsUx for comparison to the results from the other technologies. To account for intraspecies variation and the possibility of different denoising performance between the different sequencing strategies, the pooled ITS2-ASVs from all sequencing strategies were also clustered into operational taxonomic units (OTUs) at 97% similarity using VSEARCH version 2.9.1 (Rognes et al., 2016).

| Phylogenetic inference and taxonomic assignment
Full-length long-amplicon ASVs were aligned using DECIPHER (Wright, 2015) with up to 10 iterations of alternating progressive alignment and conserved RNA secondary structure calculation, followed by 10 refinement iterations. This alignment was truncated at a position after the D3 region corresponding to base 907 of the Saccharomyces cerevisiae S288C reference sequence for LSU, because several sequences had introns after this position, as also observed in several fungal species by Holst-Jensen et al. (1999). An ML tree was produced using RAxML version 8.2.12 (Stamatakis, 2014) using the GTR+GAMMA model and rapid bootstrapping with the MRE_IGN stopping criterion. Sequences confidently (i.e. by at least five of the primary taxonomic identification methods) assigned outside kingdom Fungi were used to root the tree, and sequences outside the clade defined by confidently identified Fungi were removed (see below and Methods S1.3.2).
Taxonomic annotations of the RDP-LSU training set version 11.5 (Cole et al., 2014;Liu et al., 2012) and Warcup ITS training set (Deshpande et al., 2016) were mapped to a uniform taxonomic classification system (see Methods S1.3.1). Primary taxonomic assignment was performed to genus level separately on the ITS region using Unite and Warcup and on the LSU region using RDP-LSU, respectively, as taxonomic references. For each region/reference combination, taxonomy was assigned using three popular algorithms: the RDPC (Wang et al., 2007) as implemented in DADA2; SINTAX (Edgar, 2016a) as implemented in VSEARCH version 2.9.1 (Rognes et al., 2016); and IDTAXA (Murali et al., 2018). A relatively lax confidence threshold of 50% was used for all three algorithms, in order to increase the amount of input for consensus algorithms. Each full-length ASV was thus given up to nine primary taxonomic assignments (three references × three algorithms). Amplicon sequence variants from the short-amplicon data sets for which no matching long-amplicon ASV could be reconstructed were taxonomically assigned using Unite and Warcup on the full length of the short amplicon.
For full-length long-amplicon ASVs, the primary taxonomic assignments were refined based on the ML phylogenetic tree generated above using the new algorithm PHYLOTAX. The PHYLOTAX algorithm resolves conflicts among one or more primary assignment methods using a supplied phylogenetic tree (see Figure S4 and Methods S1.3.3). It is available in the new R package PHYLOTAX at https://github.com/brend anf/phylotax. Amplicon sequence variants which were not present in the tree, either because they were not represented in the long-amplicon data set, or because full-length ASV reconstruction failed, were given refined taxonomic assignments using a strict consensus of the different primary assignments at each rank, resulting in a consensus assignment equivalent to the 'last common ancestor' of the primary assignments (Huson et al., 2007). This algorithm has been used to assign a consensus taxonomy based on a list of top BLAST hits (e.g. MEGAN and LCAClassifier, Huson et al., 2016;Lanzén et al., 2012) or k-mer similarity scores (mothur's k-nearest neighbour method, Schloss et al., 2009), but here is used to resolve conflicts between assignments from different algorithms and databases. Strict consensus assignments were also generated for all ASVs, as a comparison to the PHYLOTAX assignments, and are referred to as 'Consensus'.

| Effect of sequencing strategy on recovered community
We compared alpha diversity estimates by the different sequencing strategies by calculating ASV and OTU accumulation curves, as well as comparing richness estimates after rarefaction for each sample (Methods S1.4). We also compared the effect of sequencing technology and amplicon length on the recovered fungal community composition (i.e. after removal of nonfungi), as assessed by the Bray-Curtis dissimilarity, using PERMANOVA and heat tree visualizations (Methods S1.4).

| Spatiotemporal analysis
To estimate turnover scale, ecological community dissimilarity matrices were calculated using the ASV/OTU-based Bray-Curtis metric (Bray & Curtis, 1957, for both long and short amplicons) and the phylogenetically based weighted UniFrac metric (Lozupone & Knight, 2005;Lozupone et al., 2007, for only long amplicons) in Phyloseq version 1.26.0. Dissimilarities were based on relative read abundance within each sample. Samples were not rarefied to a standard sequencing depth within data sets, as both the Bray-Curtis dissimilarity and the UniFrac metric are robust to unequal sampling depths (McMurdie & Holmes, 2014). In addition, we did not standardize sequencing depth between data sets, because this would remove one of the potential benefits of Illumina and Ion Torrent relative to PacBio.
Each of the distance matrices was used to calculate a Mantel correlogram with a 1 m bin size for distances in the range of 0-12 m, that is half the maximum separation present in the data set. Separate correlograms were drawn for samples taken during the same year and samples separated in time by one year, in order to assess the degree to which the soil community changes over the course of one year.
Additionally, empirical spatiotemporal distance-decay curves were generated by plotting mean community dissimilarity as a function of spatial distance and time lag, and fit to an exponential model of the form given by Legendre and Legendre (2012) using the nls() function in R (Methods S1.5). Spatiotemporal analyses were performed on the full recovered fungal community after removal of nonfungal sequences and on the ECM community. Sequences were assigned as ECM based on taxonomic assignments using the FUNGuild database (as of 20 Feburary 2020; Nguyen et al., 2016) via the R package FUNGuildR (https://github.com/brend anf/FUNGu ildR). All taxa which included 'Ectomycorrhiza' in the guild assignment at any level of confidence were included.

| RE SULTS
Samples from Ang in 2015 yielded low quantities of DNA, poor PCR performance and ultimately very few sequencing reads, especially in the long-amplicon library, where only one sample produced more than 100 reads ( Figure S5). Consequently, Ang samples were excluded from spatial analysis, although they were retained for denoising, phylogenetic reconstruction, taxonomic assignment and all nonspatial analyses. Spatial analyses were based on the remaining 34 samples for Illumina and 30 samples each for the PacBio long and short amplicons.
The number of sequencing reads and ASVs at each stage in the bioinformatics pipeline differed between sequencing strategies (Table   S2). Sequencing with PacBio yielded more than twice as many raw reads for long amplicons as for short amplicons, with approximately 125 thousand and 50 thousand reads, respectively. Ion Torrent and Illumina yielded substantially more reads, with 20.7 million and 10.8 million, respectively. PacBio sequencing of the short-amplicon library yielded the highest fraction of high-quality reads (≤1 expected error), followed by Illumina, with Ion Torrent yielding the lowest quality ( Figure S6b). Although the per-base read quality of the long-amplicon PacBio sequences was similar to that of Illumina ( Figure S6a), this translated to a greater number of expected errors per read due to the amplicon length ( Figure S6b). Demultiplexing, primer trimming and quality filtering reduced the read totals by 64% for PacBio long amplicons, but only by 21% for PacBio short amplicons, resulting in a similar number of filtered reads for the two strategies. Losses in demultiplexing, trimming and quality filtering were intermediate for Ion Torrent and Illumina, with 41% and 28% loss, respectively. Extraction of only the ITS2 region before quality filtering ( Figure S6c) reduced the loss of long-amplicon PacBio reads to only 29%, comparable to Illumina. Application of TZARA resulted in 708 reconstructed long-amplicon ASVs, representing 97% of denoised ITS2 reads from the long-amplicon PacBio data set. Mapping identical ITS2 ASVs from the short-and long-amplicon data sets allowed 58%, 71% and 81% of denoised reads from the Ion Torrent, Illumina and PacBio short-amplicon data sets, respectively, to be assigned to a long-amplicon ASV (Table S2).
Almost all of the short-amplicon sequences from all three technologies were between 240 and 375 bp long ( Figure S7a Except for these sequences, all conserved regions of LSU, as well as 5.8S, displayed very little size variation, as expected, with standard deviations <2 bp. Around 12% of ITS2 sequences extracted from the long-amplicon data set were shorter than 140 bp, a much greater fraction than the 0.26%-0.44% from the short-amplicon data sets ( Figure S9). The taxonomic identity of these sequences is discussed below.
Agaricus bisporus, the positive control, was represented by a single ASV in the positive control samples for both long-and shortamplicon PacBio data sets, and in the Ion Torrent data set. Agaricus bisporus was represented by two ASVs in the Illumina data set, which differed at one base pair (99.5% similarity in ITS2). The abundance of the second ASV was 1.1% and 1.0% that of the primary A. bisporus ASV in the two Illumina positive controls. The consistency of this ratio across replicate positive controls suggests that it represents true intercopy variation within the specimen, rather than sequencing or PCR error. Despite higher total sequencing depth, this ASV was not identified from the Ion Torrent data set.
Agaricus bisporus sequences represented 0.01%, 0.09%, 0.09% and 0.09% of noncontrol reads, in the PacBio long, PacBio short, Illumina and Ion Torrent data sets, respectively, giving similar estimates for the rate of tag-switching for all technologies. These reads were excluded from community analyses.

| Reproducibility of sequence detection using different technologies
We compared the unique ASVs and OTUs shared between data sets from different sequencing strategies, and the number of reads represented by these ASVs and OTUs in each strategy. The majority of abundant ASVs and OTUs were detected by all sequencing strategies used (Tables S3 and S4). The short-amplicon ASVs shared between all sequencing technologies represented 95%, 76% and 66% of the reads for PacBio, Illumina and Ion Torrent, respectively (Figure 2a).
When differences at the intraspecies scale were removed by clustering the ASVs into 97% OTUs, the number of OTUs shared between all three technologies increased to 524, representing 100%, 93% and 89% of reads, respectively (Figure 2b). In particular, the majority of the 9418 unique Ion Torrent ASVs were found to be shared with other sequencing technologies upon OTU clustering. Amplicon sequence variants unique to the Ion Torrent data set made up 14% of reads in that data set, but only 1% belonged to a unique OTU after clustering. In contrast, 21% of reads in the long PacBio data set belonged to ASVs whose ITS2 region was unique to that data set ( Figure 2c), and the fraction only reduced to 20% after clustering the ITS2 regions into OTUs (Figure 2d). The taxonomic identity of these ASVs is discussed below.
Read counts for shared ASVs and OTUs were highly correlated between strategies, with a minimum R 2 value of .47 ( Figure S10).
Correlations between read counts for the three technologies using the short-amplicon library were increased by OTU clustering ( Torrent short-amplicon reads, respectively; Figure S10).

Amplicon sequence variant richness estimates after rarefaction
were strongly correlated between the three sequencing technologies applied to the short-amplicon library (R 2 = .91-.94; Figure 3). The slope of the relationship between PacBio and Illumina richness estimates was only slightly different from 1, indicating that these two technologies give highly comparable rarefied richness estimates, despite the approximately 200× difference in original sequencing depth. Ion Torrent resulted in rarefied richness estimates, which were 24%-31% greater than the other technologies, an effect which is also visible in ASV accumulation curves (Figure 4). Amplicon sequence variant richness estimates were somewhat less strongly correlated between the PacBio long-amplicon data set and the three short-amplicon data sets (R 2 = .65-.72; Figure 3). Total least squares regression indicated that the long-amplicon data set resulted in richness estimates which were intermediate between the short-amplicon results from Ion Torrent and the other two technologies. Despite the fact that experimentwide OTU richness was lower than ASV richness (Figure 2), OTU accumulation curves for each sample ( Figure S11) and rarefied OTU richness relationships between sequencing strategies ( Figure S12) were highly similar to those for ASVs.

| Taxonomic assignment
For all sequencing data sets and taxonomic assignment protocols, a higher proportion of reads were assigned than of ASVs, indicating that common ASVs were more likely to be taxonomically identified than rare ASVs ( Figure 5). A greater fraction of ITS reads and ASVs were assigned using the Unite database than the Warcup database across sequencing technologies, amplicons, algorithms and taxonomic ranks. At most taxonomic ranks, the RDPC algorithm assigned the greatest fraction of reads and ASVs, followed by SINTAX, and then IDTAXA.
Taxonomic composition of the sequenced soil fungal community at the class level is summarized in Figure 6 and as a heat tree (Foster et al., 2017) in Figure S13. The ML tree for fungal ASVs, along with taxonomic assignments, is shown in File S3. According to the PHYLOTAX assignments, fungi represented 88% of the ASVs and 81% of the reads in the long-amplicon library, compared to 92.4%-96.4% of the ASVs and 97.9%-98.3% of the reads in the short-amplicon library. Many of the ASVs that were unique to the long-amplicon library thus fall outside kingdom Fungi ( Figure S14).
In particular, a large fraction of ITS2 sequences with length less than 140 ( Figure S9) were identified as Alveolates ( Figure S15).
Measured fungal community composition at the class level varied significantly between long and short amplicons (PERMANOVA with 9999 permutations, p < .0001, R 2 = .048), but only marginally between sequencing technologies (p = .0346, R 2 = .002). The F I G U R E 2 Shared richness and abundance of ITS2-based ASVs (a, c) and 97% OTUs (b, d) between different sequencing technologies from the same short-amplicon library (a, b), and between long-and short-amplicon libraries (c, d).
In each region, the ASV/OTU richness is given above, while the relative abundance of reads represented by these ASVs/OTUs in each sequencing strategy is shown below in the order PacBio/Illumina/ Ion Torrent (a, b), or long/short (c, d). For short amplicons in c and d, ASV/ OTU counts reflect detection by any of the three technologies, and read counts represent the mean fraction of reads across the three technologies. Analyses performed on pooled ASVs/OTUs from all samples majority of variation was spatiotemporal (i.e. between samples; p < .0001, R 2 = .90), but once this variation was removed, the remaining effect consisted of a clear bias against Sordariomycetes in the long-amplicon data set (Figures 6, S14 and S16). Additionally, several lower-rank taxonomic groups showed increased detection in either the long or short data sets, such as Tulasnellaceae (Agaricomycetes) and Pyronemataceae (Pezizomycetes) in the long-amplicon data set, and Meyerozyma (Saccharomycetes) in the short-amplicon data sets ( Figures S14 and S17).
Fungi categorized as ECM made up 9.0% of ASVs and 39.2% of reads in the long-amplicon library, and 5.4%-13.2% of the ASVs and 36.4%-46.4% of the reads in the short-amplicon library ( Figure S18).
Although amplicon length had a significant effect on ECM community composition at the family level ( Figure S17), the explained variation was very low (PERMANOVA with 9999 permutations, p = .0040, R 2 = .002), and the majority of variation was again spatiotemporal (p < .0001, R 2 = .98). Variation between sequencing technologies was not significant (p = .76, R 2 = .0002).

| Spatial analysis
Results of spatial analysis based on the Bray-Curtis dissimilarity were qualitatively similar between the two amplicon libraries and between PacBio and Illumina sequencing, with significant autocorrelation at p < .05 for ranges of up to 2-3 m for the total recovered fungal community, and 1-2 m for the ECM fungal community ( Figure   S19). In both cases, the greatest correlation magnitudes were found with Illumina, followed by long-amplicon PacBio. The least spatial structure was detected with PacBio short-amplicon sequencing. In contrast to the Bray-Curtis distance, the weighted UniFrac distance showed very little spatial structure, with only the total recovered fungal community in the 1 m distance class showing a significant correlation at p < .05. No temporal correlation was found for the weighted UniFrac distance.
The best-fit spatial turnover ranges based on Bray-Curtis distance-decay curves calculated from different sequencing strategies range widely from 13 to 31 m for the total recovered fungal community and 12-42 m for the ECM fungal community (Figure 7,  Table S5). However, there was overlap of the 95% confidence intervals for all of the Bray-Curtis spatial ranges in both the total recovered fungal and ECM fungal communities, across amplicon libraries and sequencing technologies (Table S5), so no strong conclusion of variability between methods can be drawn. Although a distancedecay model was fit for the weighted UniFrac distance applied to the total recovered fungal community, the result was very poorly constrained, and a range of 0 m, indicating no spatial structure, was included in the 95% confidence interval (Table S5).

| Reconstruction of long amplicons from denoised subregions
Sequencing depth in the long-amplicon PacBio data set was not sufficient to successfully denoise using standard protocols, given the amplicon length and diversity of the samples. Amplicon sequence variant recovery for long amplicons using DADA2 was dramatically improved from 12% to 76% of reads by denoising homologous subregions independently using our new LSUx and TZARA packages.
Although newer sequencing platforms from PacBio (Sequel and Sequel II) feature increased sequencing depth and lower error rate compared to the RS II, long sequences inherently require much more sampling depth to identify ASVs. Thus, TZARA should increase recovery of rare ASVs from these platforms as well. It may also be adaptable to Oxford Nanopore sequencing, which has hitherto posed difficulties for application to complex community metabarcoding (Loit et al., 2019).

| Comparison of sequencing strategies
The three sequencing technologies gave similar results for the shortamplicon library, the major difference being in sequencing depth.
Although a greater fraction of PacBio raw reads were ultimately mapped to ASVs (75%) compared to Illumina (63%) or Ion Torrent (65%), the latter two technologies provided much greater sequencing depth for a similar cost, allowing a greater diversity of rare ASVs to be recovered, and were much closer to saturation of their respective species accumulation curves (Figure 4). Operational taxonomic unit read counts were strongly correlated between technologies (R 2 = .72-.82), and even between primer pairs (R 2 = .49-.62, Figure S10b). This lends some support for the technical repeatability of abundance-based beta diversity measures in metabarcoding, although bias at the amplification stage still presents issues (Bellemain et al., 2010;Castaño et al., 2020;Kanagawa, 2003;Polz & Cavanaugh, 1998).

F I G U R E 7
Distance-decay plot for community dissimilarities and spatiotemporal distance. Circles represent community data from short-(top two rows) and long-(bottom two rows) amplicon libraries, sequenced by Illumina MiSeq (top row) or PacBio RS II (bottom three rows). Community dissimilarities are calculated using the Bray-Curtis dissimilarity for all data sets (top three rows) and using the weighted UniFrac dissimilarity for the long-amplicon library, for which a phylogenetic tree could be constructed (bottom row). The left column represents the full fungal community, and the right column only sequences identified as ECM. The colour of each circle represents the time lag between samples being compared (0 or 1 year), and the size represents the number of comparisons for that spatial distance and time lag. Lines are the best-fit lines for an exponential decay to max model. The model was only fit for data sets where the Mantel test indicated a significant relationship between community dissimilarity and spatial (for the 0 year time lag) or spatiotemporal (for the 1 year time lag) distance DADA2 denoising may perform differently on different technologies (or perhaps sequencing runs) and indicated by the fact that clustering ASVs at 97% led to substantially higher correspondence between both the set of OTUs recovered from the same library by different technologies and the read counts for each OTU ( Figure   S10). ASV diversity appears to be artificially inflated in the Ion Torrent data set relative to the Illumina and PacBio data sets, which gave remarkably similar ASV richness after rarefaction, despite a difference of around 200× in unrarefied sequencing depth (Figures 2   and 3). This may be a result of the lower fraction of very high-quality reads in the Ion Torrent data set ( Figure S6). We used options for DADA2 intended to improve performance on technologies, like Ion Torrent, with higher rates of homopolymer indel errors (Callahan, 2020b), but our results suggest that this still does not result in performance comparable to that which DADA2 achieves on Illumina sequences, for which it was developed (Callahan et al., 2016).
Although the longer read length capabilities of PacBio allow recovery of longer ITS2 sequences than the other two technologies, as has recently been demonstrated in mock communities (Castaño et al., 2020), in our data set from a natural community PacBio did not recover any reads from the short-amplicon library which were longer than those recovered by Illumina and Ion Torrent. Notably, neither long-nor short-amplicon sequencing recovered any sequences identifiable to Cantharellus, an ECM genus which is commonly observed at the study sites as fruitbodies (personal observations by BF and NSY), but which is also known to have accelerated evolution in the rDNA (Moncalvo et al., 2006) and longer ITS regions than other fungi (Feibelman et al., 1994), making it an especially difficult target for metabarcoding. Contrary to expectations, Illumina showed a slightly higher fraction of longer ITS2 sequences than Ion Torrent, which in turn showed slightly longer sequences than PacBio ( Figures S7 and   S9).
The long-amplicon data set included 20% unique taxa, even after clustering at 97% ITS2 similarity, indicating that the differences in the communities recovered are not due to small sequencing errors, but rather that the different primers amplify different parts of the community. The ITS4 primer used in the short-amplicon data set has known mismatches to Tulasnellaceae and Alveolata, while gITS7 also has mismatches for Tulasnellaceae . ITS1 and LR5 match a much broader range of fungal and other eukaryote groups . The alternate LR5-F primer (Tedersoo et al., 2008) would select against the nontarget Alveolata, at the expense of also having mismatches for the Tulasnellaceae. We assert that, for studies targeting ECM fungi in particular, more complete detection of groups with high rDNA variability such as Tulasnella (and ideally other Cantharellales) is worth the read depth spent on nontarget groups.

| Taxonomic identification
Assignment of ecological function to environmental fungal sequences is dependent on accurate taxonomic identification, especially at the genus level or below (Nguyen et al., 2016). However, different combinations of algorithms and reference data sets vary in their performance at confidently assigning taxonomy to sequences. Although RDP-LSU and Unite performed comparably at taxonomic placement of long-amplicon sequences, the Warcup database placed notably fewer sequences at all taxonomic levels for all data sets ( Figure 5). This is probably due to two factors. First, the Warcup database does not include any nonfungi, so it cannot place any nonfungal sequences. Second, due to its low-density coverage of the fungal kingdom (18,000 sequences vs. 800,000 for Unite), it is likely that many ITS sequences, especially from uncultured tropical soil fungi, have no close match in the Warcup database, and so cannot be placed. RDP-LSU, which has even fewer sequences (8000 fungi plus 3000 other eukaryotes), is probably more successful due to higher sequence conservation in LSU. Heeger et al. (2019) also found that a more conserved region, in their case 5.8S, outperformed ITS at placing sequences without close database matches. Of the three algorithms tested, IDTAXA placed fewer sequences than RDPC or SINTAX with all databases, as expected given its more wellcalibrated and conservative confidence scores (Murali et al., 2018), but this was particularly dramatic when paired with the Warcup database, where IDTAXA placed <25% of ASVs even to phylum. Gdanetz et al. (2017) showed that a majority-rule consensus of three assignment algorithms can improve the fraction of sequences assigned as well as decrease the false assignment rate. Strict consensus rejects assignments whenever there is conflict between methods and should therefore provide more conservative taxonomic assignments than majority-rule consensus. AMPtk (Palmer et al., 2018) uses a strict consensus taxonomy between UTAX and SINTAX as an alternative when an initial BLAST search failed to give a hit with at least 97% sequence identity, but did not present results assessing the results of this approach. Here, we found that strict consensus also usually increases the number of assigned sequences relative to any single method, except at family-and genus-level identifications ( Figure 5). Inconsistent family-and genus-level assignments are particularly problematic because accurate assignment at these ranks is generally required for ecological guild assignment using FUNGuild.
For ASVs where a long-amplicon sequence is available, our novel PHYLOTAX algorithm uses relationships from a provided phylogenetic tree to resolve these disagreements. The effect was most pronounced for the PacBio long-amplicon data set, where 46% and 62% of reads were assigned to genus and family, respectively, by the strict consensus of methods, but PHYLOTAX increased this fraction to 73% and 81%. This led to a corresponding increase in the fraction of fungal reads assigned to a functional guild from 71% to 90% ( Figure S18). For short-amplicon sequencing strategies, the improvement was more modest, because PHYLOTAX could only be applied for ITS2 ASVs with a match to one of the long-amplicon ASVs (last row of Table S2). Deeper long-amplicon sequencing would improve the coverage of long amplicons, allowing a greater fraction of shortamplicon ASVs to also be placed phylogenetically.
Because our data set was generated from environmental samples whose true taxonomic affinity is unknown, we were not able to assess the accuracy of taxonomic assignments by any of the methods used here. Accuracy has been assessed using leave-one-out validation for the primary assignment algorithms (e.g. Edgar, 2018;Murali et al., 2018) and other consensus methods (e.g. Gdanetz et al., 2017;Somervuo et al., 2016), and similar work could be carried out in the future for PHYLOTAX.
Distance-decay plots (Figure 7, Table S5) gave substantially longer autocorrelation distances. There was little variation in the results between the Illumina and long-amplicon PacBio data sets for both the total recovered fungal community and the ECM community, with best-fit estimates ranging from 12-18 m. The 95% confidence interval was substantially wider than this variation, generally covering a range of 5-41 m. All of these values are smaller than the 65 m reported by Bahram et al., (2013), also based on distance-decay curves from a similar ECM woodland habitat in Benin, but based on Sanger sequencing of ECM root tips rather than HTS metabarcoding of bulk soil. We speculate that this discrepancy is due to an increased ability to detect spatially variable rare species using HTS.
For the short-amplicon data set, PacBio showed a spatial turnover range more than twice as long as showed by Illumina (Table S5) for both the total fungi and ECM communities, with wide confidence intervals. It is possible that the weaker fit for this data set, which also showed weaker autocorrelation in the Mantel correlogram, is due to low sequencing depth in the PacBio short-amplicon data set. The long-amplicon PacBio data set, with more than twice the read depth of the short-amplicon PacBio data set, gave spatial turnover distance results much closer to those from Illumina. This is consistent with our speculation that the longer spatial turnover range found by Bahram et al. (2013) is related to sequence sampling depth.
Year-to-year correlation was found for both the total recovered fungal and ECM communities in the long-amplicon data set ( Figure S19). The spatiotemporal distance-decay fit estimated the temporal turnover range as 3.3 years for the total recovered fungal community and 4.2 years for the ECM community, but with overlapping confidence intervals. This corresponds to a space-for-time substitution rate (i.e. ratio of a spatial distance to a time delay which results in equivalent community dissimilarity) of 5.4 and 3.3 month/ year for the total recovered fungal community and ECM community, respectively. In a recent study, Kivlin and Hawkes (2020) reported a space-for-time substitution rate of 81 month/year (reported as 6.8 day/1.5 m) in the soil fungal community of a nonseasonal tropical forest in Costa Rica. However, comparison is obscured by different spatial and temporal sampling scales between the two studies.
Year-to-year variation in ECM fungal communities, which we sampled, has been shown to be less than intra-annual variation , as sampled by Kivlin and Hawkes (2020). Neither data set from the short-amplicon library showed significant temporal autocorrelation.
Weighted UniFrac did not reliably detect spatial structure within this relatively ecologically homogeneous community. Although the Mantel test did show a small but significant positive autocorrelation in the fungal community at the smallest size category (1 m; Figure   S19), the distance-decay plot in Figure 7 does not show any clear relationship. The functional fit showed poor convergence, with a 95% confidence interval for spatial range of 0-5470 m, indicating little evidence of spatial structure. This is probably because the majority of community turnover in this system, especially among ECM fungi, is between closely related species or individuals of the same species, while the presence of major clades (e.g. ECM lineages sensu Tedersoo et al., 2010) are more spatially constant. This is also reflected in the generally smaller sample-to-sample dissimilarities measured by UniFrac (0.4-0.6) as compared to Bray-Curtis (0.8-1.0) in Figure 7. UniFrac analysis would be more suited at larger spatial scales and/or larger ecological gradients.

| CON CLUS ION
Contrary to our hypothesis, we found that Illumina and Ion Torrent sequencing of real environmental samples resulted in neither qualitative nor quantitative bias against longer ITS2 amplicons, relative to PacBio. Furthermore, although we did find an increased ability to detect certain fungal groups using the more universal ITS1-LR5 primer pair, the choice of amplicon and sequencing technology did not affect the results of the spatial analysis, provided sufficient sequencing depth. Alpha diversity estimates were strongly correlated between methods, but somewhat inflated for Ion Torrent relative to the other technologies. However, the addition of long-amplicon reads did allow the construction of a phylogenetic tree directly from the metabarcoding reads, which allowed refinement of taxonomic assignments using our new tool PHYLOTAX. DADA2 ASV yield was initially poor for long amplicons, but this was improved by developing a workflow for extraction of subregions, separate denoising and then reconstruction of full-length unique sequences. Together, these approaches provide a hybrid approach using long-read sequencing to acquire long-amplicon sequences for the local species pool in order to improve taxonomic assignments, and cost-effective shortread sequencing to provide high sampling depth and sample number.

AUTH O R CO NTR I B UTI O N S
Sampling was planned and carried out by BF, NSY and MR.
Bioinformatics and data analysis were performed by BF with input from MB, AR and MR. Scripts and R packages were written by BF.
The manuscript was drafted by BF and MR. All authors contributed to and approved the final version of the manuscript.

DATA AVA I L A B I L I T Y S TAT E M E N T
• Trimmed, demultiplexed sequencing reads are deposited at the European Nucleotide Archive (ENA) under Project Accession no.
PRJEB37385. Accession numbers are given in Files S1 and S2.
• Consensus ASV sequences from the PacBio and Illumina data sets are deposited at ENA under Project Accession no. PRJEN37385, Accession nos. HG995461-HG996435 (ASVs detected in the long amplicon data set) and FR984199-FR988025 (ASVs only detected in the short amplicon data set). Because of concern that the Ion Torrent data set may contain many ASVs with uncorrected sequencing errors, we elected not to submit ASVs, which were only detected in the Ion Torrent data set. However, these sequences, along with all other consensus ASVs, are archived in Dryad (Furneaux et al., 2021, https