The use of different 16S rRNA gene variable regions in biogeographical studies

Abstract 16S rRNA gene amplicon sequencing is routinely used in environmental surveys to identify microbial diversity and composition of the samples of interest. The dominant sequencing technology of the past decade (Illumina) is based on the sequencing of 16S rRNA hypervariable regions. Online sequence data repositories, which represent an invaluable resource for investigating microbial distributional patterns across spatial, environmental or temporal scales, contain amplicon datasets from diverse 16S rRNA gene variable regions. However, the utility of these sequence datasets is potentially reduced by the use of different 16S rRNA gene amplified regions. By comparing 10 Antarctic soil samples sequenced for five different 16S rRNA amplicons, we explore whether sequence data derived from diverse 16S rRNA variable regions can be validly used as a resource for biogeographical studies. Patterns of shared and unique taxa differed among samples as a result of variable taxonomic resolutions of the assessed 16S rRNA variable regions. However, our analyses also suggest that the use of multi‐primer datasets for biogeographical studies of the domain Bacteria is a valid approach to explore bacterial biogeographical patterns due to the preservation of bacterial taxonomic and diversity patterns across different variable region datasets. We deem composite datasets useful for biogeographical studies.


INTRODUCTION
The ubiquity of the 16S rRNA gene among prokaryotes, coupled with the presence of both conserved and variable nucleotide regions in its sequence, has led to its widespread use in environmental studies examining the structure and diversity of prokaryotic communities (Straub et al., 2020;Tringe & Hugenholtz, 2008).However, the read length of the most commonly used next-generation sequencing technology (i.e., Illumina) ranges from 100 to 300 bp, with typical paired-end sequencing covering only a fraction of the full 16S rRNA gene ($1500 bp; Abellan-Schneyder et al., 2021).Consequently, a shortcoming of this technology has been that only between one and three of the nine 16S rRNA variable gene regions (i.e., V1-V9) can be sequenced in a single Illumina sequencing run (Abellan-Schneyder et al., 2021;Goodwin et al., 2016).
Over the last 20 years of 16S rRNA gene based phylogenetics, multiple variable gene regions have been targeted by different primer sets for amplification of the intervening 16S rRNA gene regions and subsequent phylogenetic assignments (Abellan-Schneyder et al., 2021).While attempts to establish universal protocols for prokaryotic phylogenetic analysis of environmental samples, such as the Earth Microbiome project (Gilbert et al., 2018), have arguably led to a greater consensus on primer selection, the continued use of different variable regions as phylogenetic markers adds complexity to comparisons of different 16S rRNA gene amplicon datasets (Sperling et al., 2017;Tremblay et al., 2015;Yang et al., 2016).Thanks to long-read sequencing technologies, such as PacBio and more recently Oxford Nanopore, the 16S rRNA gene can be sequenced fully (Klemetsen et al., 2019;Matsuo et al., 2021;Numberger et al., 2019;Winand et al., 2019).However, the vast majority of published studies, and consequently the data available in online sequence repositories, report Illumina sequence data derived from primers designed for the amplification of partial 16S rRNA gene sequences (Gilbert et al., 2018;Pollock et al., 2018).
It is current practice for all sequences generated as part of microbial ecological studies to be uploaded to a public sequence repository (e.g., Leinonen et al., 2011;NCBI Resource Coordinators, 2015).Consequently, online repositories contain many publicly available 16S amplicon sequence datasets derived from a huge spectrum of prokaryotic communities (Gilbert et al., 2018;Jurburg et al., 2020).Some of those datasets are derived from unique samples acquired from the most remote and inaccessible regions on Earth (e.g., Dragone et al., 2021;Staebe et al., 2019).Such samples are arguably of great importance; for example, for biogeographical surveys aimed at resolving complex prokaryotic distributional patterns across large spatial, environmental or temporal scales (Dickey et al., 2021).However, the value and utility of these datasets may be reduced by lacking consistency in prokaryotic phylogenetic analysis protocols, including primer selection for variable region amplification (Abellan-Schneyder et al., 2021;Pollock et al., 2018;Tremblay et al., 2015;Yang et al., 2016).The use of different primers can lead to the differential resolution of different organisms (Fredriksson et al., 2013;Tremblay et al., 2015).Furthermore, this may also lead to loss of taxonomic resolution as, when working with datasets composed of 16S rRNA gene samples targeting different variable regions, it is not possible to work at amplicon sequence variant (ASV) level because different variable region sequences are represented by different sets of ASVs (Callahan et al., 2017) and are therefore not comparable.
The use of composite datasets may be particularly important for studies aiming to establish prokaryotic community patterns across vast and remote areas, where sample collection is challenging and expensive.
Here, we explore whether environmental phylogenetic sequence data stemming from diverse 16S rRNA gene variable regions can be used as a resource for comparative biogeographical studies.To test how these data can be viably and validly combined, we sequenced eDNA from 10 Antarctic soil samples using 5 primer sets (i.e., 27F-519R, 341F-805R, 515F-806R, 515F-926R and 926F-1392wR), obtaining amplicon sequence data representing five different 16S rRNA gene amplicons spanning seven 16S variable regions (i.e., V1-V3, V3-V4, V4, V4-V5 and V8-V9).S1 and Table S1).At each location, 500 g of surface soil (0-10 cm) was collected by combining five sub-samples from each plot into sterile Whirl-Pak bags (Nasco, Fort Atkinson, Wisconsin), as described in Czechowski et al. (2022), Czechowski, Clarke, et al. (2016) and Czechowski, White, et al. (2016), for the Prince Charles Mountains and Reinbolt Hills samples, and in Velasco-Castrill on et al. (2014) for the two coastal samples.Soil samples were kept at À20 C right after sampling and stored permanently at À80 C until further processing.

Sequence data processing and analyses
Illumina sequencing adapters were trimmed with Trimmomatic v 0.39 (Bolger et al., 2014), and default parameters.The five datasets consisting of the seven variable 16S regions (i.e., V1-V3, V3-V4, V4, V4-V5 and V8-V9) were then analysed separately in the R environment v 4.0.3(R Core Team, 2021) using dada2 v 1.16.0 package (Callahan et al., 2016).The resulting ASVs (Callahan et al., 2017) were taxonomically annotated with reference information of the SILVA database v 138 (Quast et al., 2012).Subsequently, the data of the five variable gene regions were combined, and ASVs assigned to Eukaryotes, mitochondria and chloroplasts removed.To overcome diverse read sample size, the dataset was then normalized using scaling with ranked subsampling (SRS) method with the R package SRS; read counts were scaled using the total read count of the smallest sample (n = 14,035; Beule & Karlovsky, 2020).Because the different primer pairs showed differential amplification of taxa from the domain Archaea (Table S3), ASVs assigned to domain Archaea were also removed, thereby retaining only ASVs associated with domain Bacteria.
All statistical analyses were performed on taxonomy datasets (i.e., at genus and phylum level) as it was inappropriate to work at ASV level due to the use of different variable region sequences represented by different sets of ASVs (Callahan et al., 2017).Comparisons of taxonomic datasets at lower (i.e., genus) and higher (i.e., phylum) taxonomic levels were conducted in order to explore which taxonomic level was more consistent between samples sequenced for different 16S rRNA gene variable regions.
Analysis of similarity (ANOSIM; Clarke, 1993) tests were performed using the function anosim() from the R library vegan v 2.5.7 (Oksanen et al., 2022).ANOSIM tests were calculated on the Bray-Curtis dissimilarity matrices obtained from the Hellinger-transformed genus and phylum taxonomic datasets, using 10,000 permutations (Legendre & Anderson, 1999;Legendre & Gallagher, 2001).Principal coordinates analysis (PCoA) was performed using the function pcoa() from the R library ape (Paradis & Schliep, 2019).Distance-based redundancy analysis (dbRDA) was performed using the function capscale().Before running capscale() the geochemical and bioclimatic variables were standardized with decostand() and checked for collinearity.The function ordiR2step() was used to select the environmental variables to use in the RDA model; these environmental variables were then checked for significance using the function anova.cca().All these functions are part of the R library vegan v 2.5.7 (Oksanen et al., 2022).Bray-Curtis dissimilarity and Jaccard dissimilarity matrices were calculated applying the function vegdist() (vegan v 2.5.7) on the community relative abundance and absence/presence datasets, respectively.Shannon index was calculated using the function diversity() (vegan v 2.5.7).

Bacterial community composition at phylum and genus levels
Of the 26 phyla represented in the normalized entire dataset (i.e., dataset including all the samples sequenced for the 5 variable regions), 15 phyla were present with a relative percentage higher than 1% in at least one sample (i.e., dominant phyla; Figure 1A,B).These 15 phyla were represented in all the variable region datasets.However, three of these phyla were differentially abundant at relative abundances higher than 5% across the different datasets; the phylum Actinobacteriota ranged between 18.7% in V4-V5 and 25.3% in V8-V9, Bacteroidota between 12.4% in dataset V4 and 18.4% in V4-V5 and Chloroflexi between 14.3% in V8-V9 and 22.4% in V3-V4 (Figure 1A).Differences in phylum relative abundances were also observed at single sample resolution (Figure 1B).Five phyla showed at least one significant pairwise difference between variable region datasets, meaning that 33% of the dominant phyla were differentially distributed across different variable region datasets (Figure 1C).The pairwise comparisons showing the highest number of significant differences (5 phyla; p < 0.05) were 'V3-V4 versus V4-V5' and 'V4 versus V4-V5'.The only pairwise comparison that did not show any significant difference between any of the dominant phyla was 'V3-V4 versus V4' (Figure 1C).Taxonomic similarity between V3-V4 and V4 amplified regions is not surprising, considering that the DNA sequence of the 16S rRNA gene V4 region is included in the V3-V4 region.Comparison of 'V4 versus V4-V5', even if equally overlapping, showed as many differences as the other comparisons.V4-V5 amplicons have been previously shown to provide dissimilar taxonomic profiles when compared with other amplicons (Abellan-Schneyder et al., 2021).
The total number of genera in the entire dataset was 627, ranging from 363 in the V1-V3 dataset to 434 genera in V4-V5 dataset (Table 1[A]).The number of dominant genera (i.e., genera represented by a relative abundance higher than 1% in at least one sample) in the entire dataset was 74 and ranged from 42 in V4 and V4-V5 to 60 in V1-V3, indicating that the dominant community at genus level differed widely across different variable region datasets (Table 1[B]).Of these 74 genera, 11 (accounting for the 15% of the dominant genera) showed significant differences (p < 0.05) in at least one of the pairwise comparisons between datasets.The pairwise comparison with the highest number of significantly diverse comparisons was observed between V3-V4 and V8-V9; and the lowest number was observed between V3-V4 and V4, and V4 and V4-V5 (Figure 2).Four of the genera showing significantly diverse distributions across primer datasets belonged to Actinobacteriota (Iamia, Marmoricola, Nakamurella and Nocardioides), one to Abditibacteriota (Abditibacterium), one to Verrocomicrobia (Candidatus Udaeobacter) and one to Planctomycetota (Tundrisphaera).All these phyla showed differential distribution in at least one pairwise comparison (Figure 1C).The only phylum that was differentially abundant, but was not associated to any differentially abundant genera in the dominant community, was Gemmatimonadota.Differences in community composition between the variable region datasets and samples taken at different locations were compared with assess whether using different variable regions has a significant impact on beta-diversity analyses.ANOSIM statistics performed on the phylum-level taxonomic dataset showed R values of 0.79 (p = 0.00009) and 0.19 (p = 0.00030) for the factors 'Sample' (i.e., ME1, ME2, ME3, MM1, MM2, LT1, LT2, RH1, C1 and C2) and 'Variable region' (i.e., V1-V3, V3-V4, V4, V4-V5, V8-V9), respectively.ANOSIM statistics performed on the genus-level taxonomic dataset showed a higherR for the factor 'Sample' (r = 0.89; p = 0.00009) and a lower R for factor 'Variable region' (r = 0.11; p = 0.01260) where p was also higher compared with the phylum dataset (Table 2).These results suggest that despite the significant differences in community composition between different variable regions of the same sample, samples extracted from distinct locations were still clearly separated regardless of the variable region used.We therefore conclude that the use of different variable regions had a relatively low impact on overall community compositions when comparing samples from different locations.It is worth noting that compared with the phylum F I G U R E 1 Dominant phyla (i.e., phyla present with a relative abundance higher than 1% in at least one sample) relative abundance distribution in the five variable region datasets (i.e., V1-V3, V3-V4, V4, V4-V5 and V8-V9) (A).Relative abundance of dominant phyla in all samples where only relative abundances >1% are represented by a dot (B).Tukey's test showing the pairwise comparisons between different datasets performed for each phylum (C).
dataset, the genus dataset showed substantially more taxonomic consistency between samples sequenced using different variable region primer sets.This suggests that when working with datasets composed of samples sequenced using different 16S rRNA gene variable regions, it is more reliable to work at the lower (e.g., genus) rather than higher (e.g., phylum) taxonomic levels.
The higher reliability of the genus dataset, compared with the phylum dataset, may be due to a variety of reasons.First, taxonomic datasets are obtained by summing all reads belonging to ASVs assigned to specific taxa.However, different primers induce amplification taxonomic biases; that is, differentially amplify different taxa (Fredriksson et al., 2013;Tremblay et al., 2015).Higher taxonomic levels (e.g., phylum) could therefore accumulate more biases than lower taxonomic levels (e.g., genus) because they group a higher number of ASVs.The differential abundance of a specific genus will be reflected at the phylum level, and this is shown by the fact that four phyla (out of the five phyla that showed a differential distribution across T A B L E 1 Number of genera (A), dominant genera (i.e., genera represented by a relative abundance higher than 1% in at least one sample; B), rare genera (i.e., genera represented by a relative abundance lower than 0.1% in all samples; C), Shannon index (D) and unique genera (E) in each sample and amplicon dataset.F I G U R E 2 Relative abundance of dominant genera (i.e., genera present with a relative abundance higher than 1% in at least one sample) in all samples where only relative abundances >1% are represented by a dot (A).Tukey's test showing the pairwise comparisons between different datasets performed for each genus (B).
could not be classified at the genus other taxonomic) level.This could bring to further uncertainties at the phylum level as different phyla are composed of different percentage of unknown organisms (Table S6  and S7).Finally, the number of genera in a dominant community is higher than the number of phyla, and therefore the number of differentially abundant genera (11 compared with 5 phyla) has less statistical weight (10% of the genera showed statistically significant (p < 0.05) pairwise differences, compared with 33% for phyla; Figures 1C and 2B).
We note that, conversely, Abellan-Schneyder et al. ( 2021) found phylum-level resolution to be more preserved between different 16S rRNA gene variable regions compared with genus-level resolution.Irrespective of these conflicting results, we propose that phylumlevel analyses are not suitable for performing highresolution analyses of bacterial communities, as they group widely different organisms with different metabolic capacities and potentially diverse environmental roles (Kersters et al., 2006;Tischler et al., 2019).We therefore recommend that, when working with composite datasets, all such analyses are performed at the lowest possible taxonomic level, such as species level or genus level.Our analyses were restricted to genus-level assignments (see Winand et al., 2019), since only 1% of ASVs could be validly assigned at species level.

Alpha diversity, dominant, rare and unique genera
The number of genera (i.e., richness), dominant genera, rare genera, unique genera and Shannon index metrics varied among different 16S rRNA gene variable region data, even on a single sample (Table 1).However, pairwise correlations between variable region datasets showed that richness and Shannon index were consistent across all the variable region datasets (p < 0.05; Figure 3A,D and Table S8).For the number of unique genera, only the comparisons 'V3-V4 versus V4-V5' and 'V4-V5 versus V8-V9' did not show significant correlations (Figure 3E and Table S8).Only one correlation for the number of dominant genera, and four correlations for the number of rare genera were statistically significant (Figure 3B,C and Table S8).
These results suggest that it is statistically valid to derive alpha diversity metrics (e.g., richness and Shannon diversity) and identify unique genera when comparing composite datasets.However, even if the pairwise correlations are statistically significant, the number of shared genera between the same sample, analysed by sequencing different variable regions, is low (Table 1).Our conclusion is that, whereas analyses of bacterial diversity trends across samples are reliable and valid, detailed descriptions of which taxa are present or absent from a specific sample are neither reliable nor recommended.

Biogeographic analyses
In biogeographical studies, the relationship between microbial communities and geographical distances or environmental variables are often based on similarity and dissimilarity matrices, such as Bray-Curtis dissimilarity matrix calculated on transformed relative abundance community datasets, or Jaccard dissimilarity matrix calculated on absence/presence datasets are commonly used (Schroeder & Jenkins, 2018).To test whether these matrices varied due to different variable region datasets, we performed pairwise correlation analyses between the different variable regions, and mixed datasets created by randomly choosing samples from all the variable region datasets (Mix 1, Mix 2 and Mix 3; Table S4).All these datasets had a positive significant correlation between each other higher than 0.90 (p < 0.05; Figure S2).This demonstrated that mixed datasets can be reliably used to explore similarities and dissimilarities in bacterial community composition and distribution, and to apply statistical analyses based on these parameters (e.g., cluster analyses, distance-decay).
PCoA and dbRDA (adjusted R 2 = 0.427 and p = 0.001) cluster analyses, both widely used in biogeographical studies, showed a clear grouping of the dataset by sample (Figure 4).Composite datasets can therefore be reliably visualized in 2D space where the same samples, even when sequenced for different variable regions, showed similar relationships to climatic (BIO4, temperature seasonality; and BIO10, mean temperature of warmest quarter) and geochemical (gravel, pH, sulphur concentration) variables (Figure 4B).Finally, correlations between Bray-Curtis dissimilarity matrix, obtained from the entire bacterial community dataset, and sample geographical distance and environmental variables were performed.Even if these analyses were performed on the entire dataset (i.e., composite of all samples independently of the sequenced 16S rRNA gene variable region), bacterial community distributions correlated significantly with sample geographical   Tremblay et al., 2015;Yang et al., 2016), we do not recommend any descriptive analyses of shared and unique taxa among different samples.Similarly, we do not recommend the use of composite datasets for analyses of specific taxa.These limitations do not constitute a problem when working on biogeographical studies where the focus is not on which taxa are shared between samples, but rather how many taxa are shared and how closely related the communities of two distinguished samples are (i.e., use of similarity and dissimilarity matrices).While we have identified composite 16S rRNA gene datasets as useful resources for biogeographical studies where the focus is on prokaryotic distribution trends across geographical distances and environmental gradients, we would emphasize that analyses of such datasets must be done with caution.For example, the amplified 16S rRNA gene variable region is not the only source of bias among different datasets; sample collection and DNA extraction methods, among other factors, can also play a role (Pollock et al., 2018;Teng et al., 2018).Ensuring that all the samples have been collected using consistent methods and that all samples have been extracted using similar protocols (e.g., beat-beating for soil samples) is therefore an important factor to consider.
Principal coordinates analysis (PCoA; A) and distance-based redundancy analysis (dbRDA; B) performed on the Hellinger transformed taxonomic dataset (genus level).dbRDA shows the effect of significant (p < 0.05) explanatory climatic and geochemical variables on bacterial community distribution.BIO4, temperature seasonality; BIO10, mean temperature of warmest quarter.