Improving Species Level‐taxonomic Assignment from 16S rRNA Sequencing Technologies

Analysis of the bacterial community from a 16S rRNA gene sequencing technologies requires comparing the reads to a reference database. The challenging task involved in annotation relies on the currently available tools and 16S rRNA databases: SILVA, Greengenes and RDP. A successful annotation depends on the quality of the database. For instance, Greengenes and RDP have not been updated since 2013 and 2016, respectively. In addition, the nature of 16S sequencing technologies (short reads) focuses mainly on the V3‐V4 hypervariable region sequencing and hinders the species assignment, in contrast to whole shotgun metagenome sequencing.


INTRODUCTION
The 16S ribosomal RNA (rRNA) gene sequencing is widely used for microbiota analysis with next-generation sequencing (NGS) technologies.The 16S rRNA gene has nine hypervariable regions (V1 to V9, Chakravorty et al., 2007), but only the V3 and V4 regions are commonly targeted for short-read sequencing and microbial community analysis.These regions have been found to balance sequencing diversity and technical challenges associated with their analysis (López-Aladid et al., 2023).However, the taxonomical annotation when using these regions is often restricted to the genus level because species resolution cannot be achieved (Gwak and Rho, 2020;Hiergeist et al., 2023).
Analysis of the bacterial community from a 16S rRNA amplicon data includes comparing the reads to a reference database.A successful annotation depends on the raw data and database reference quality.To process amplicon sequencing data from raw reads to taxa abundance tables, several bioinformatic pipelines have been developed, such as Quantitative Insights into Microbial Ecology 2 (QIIME2, Bolyen et al., 2019), DADA2 (Callahan et al., 2016), and mothur (Schloss et al., 2009).All these abovementioned pipelines involve mapping reads to taxonomical reference databases.Three common standard 16S rRNA databases are SILVA (Quast et al., 2013), Greengenes (McDonald et al., 2012), and RDP (Wang et al., 2007).In contrast to SILVA, Greengenes and RDP have not been updated since 2013 and 2016, respectively.These databases also differ significantly in their size, content, and how they are curated.
Moreover, different from whole shotgun metagenome sequencing, which analyzes the collective genetic material of all microorganisms in a particular sample, 16S rRNA sequencing presents low species taxa discrimination capacity because of this gene homology between some species (Gwak & Rho, 2020).Strategies are warranted to enhance the taxonomical classification of 16S RNA sequencing data and obtain more accurate insights into the microbial composition of samples.Therefore, here we used a modified custom in-house developed R code to improve the taxonomical classification of the amplicon sequence variants (ASV) obtained from complex communities in fecal samples.Following the default taxonomy assignment obtained from SILVA v.138.1 databases through DADA2 functions assignTaxonomy and addSpecies, the chimera-free ASVs were submitted towards a custom BLASTN database constructed from the SILVA v.138.1 databases, using a robust E-value of 1e -50 and identity threshold of 99.5%.
Each ASV classification method was checked for similarities and discrepancies through an R code automatic checkout to establish the best ASV taxonomic level of resolution.This gave additional confidence about the taxonomic lineages obtained.

Hardware
The present protocol can be run on Linux, MacOS, or Windows (with Subsystem for Linux) operating system, with sufficient available random-access memory (RAM) and disk space.The most resource-intensive software in terms of computing requirements are DADA2 and Kraken2 (NCBI RefSeq Targeted Loci Project database).As a general recommendation, it is advisable to ensure a minimum of 32 GB of RAM and at least approximately 140 GB of disk space.Therefore, depending on the number of samples to be analyzed, working on a multicore CPU system or a high-performance computing system is highly encouraged.

Flowchart
The flowchart (Fig. 1) shows the steps in the proposed protocol to offer a general idea of the processes described below.This workflow can be considered as a graphic summary of the process.In contrast to Basic Protocol 1, Basic Protocols 2 and 3 show a drop in the output raw ASV numbers possibly due to the stringent homology required to define a match for the blast algorithm but not for the E-value (four more raw ASV classified when increasing to an E-value of 1e -10 ) and the different taxonomical reference database used in Basic Protocol 3 (NCBI RefSeq Targeted Loci Project).Nevertheless, after the final merge of all the information (Basic Protocol 4), the curated ASV (without non-human Bars-Cortina et al.

of 19
Current Protocols gut, Mitochondria, and Chloroplast lineages) grows up and gives additional confidence about the taxonomic lineages obtained.

NOTE:
All protocols involving animals must be reviewed and approved by the appropriate Animal Care and Use Committee and must follow regulations for the care and use of laboratory animals.Appropriate informed consent is necessary for obtaining and use of human study material.

SAMPLE INFERENCE AND TAXONOMIC PROFILING THROUGH DADA2 ALGORITHM
This protocol presents the results obtained from DADA2 pipeline to use the ASV sequences, which is the input data for the other protocols described here.The DADA2 pipeline is extensively detailed in the DADA2 developer's webpage (https:// benjjneb.github.io/dada2/ tutorial.html);therefore, we only state the steps followed.Briefly, lowquality reads were filtered and trimmed out based on the observed quality profiles by using the filterAndTrim function, truncating forward and reverse reads below 290 and 230, respectively, and considering a value of 2 as the maximum expected error.In detail, the function arguments for filterAndTrim are: maxN=0, maxEE=c(2,2), truncQ=2, trimLeft=10, truncLen=c(290.230),rm.phix=T, compress=T, multithread = T (the last argument is indicating the use of multiple cores) Furthermore, the first 10 nucleotides of each read were removed.We combined identical sequencing reads into unique sequences, made a sample inference from the matrix of estimated learning errors, and merged paired reads.For the sample inference step (see https:// benjjneb.github.io/dada2/ tutorial.html)the argument of the pool was defined as True.Chimeras and contaminants are often rare but spread across samples, making them much more effectively-identified when the samples are pooled (pool =T).Chimeric sequences were removed by using the removeBimeraDenovo function and taxonomy was assigned utilizing the SILVA 16S rRNA database (v.138.1).

Necessary Resources Hardware
Linux, MacOS, or Windows (with Subsystem for Linux) operating system, with sufficient available random-access memory (RAM) and disk space (see Strategic Planning)

Software
The following software must be installed and available in the PATH environment variable to be executable as a binary system: R (v4.

of 19
Current Protocols  1) is obtained, the DNA sequence assigned for each ASV is mapped to a specific Bacteria or Archaea lineage.b.Open, customize the file 08.assign_taxonomy.R to your directory pathway and run: Rscript 08.assign_taxonomy.R.As shown in the R script file, you need the object seqtab.nochim_pooling.rds.Submit this object first to assignTaxonomy and second to addSpecies functions, pay attention to the R script of the reference databases that use each of both functions.
Output generated to use downstream: taxa_silva138.1_pooling.rdsFor this task, DADA2 implements a naive Bayesian classifier method.Different reference taxonomic databases for 16S rRNA exist, but the most up-to-date and maintained DADA2 reference database is SILVA, which is an ELIXIR Core Data Resource from the DSMZ-German Collection of Microorganisms and Cell Cultures (GmbH, available at https:// www.arb-silva.de).
3. Phylogenetic tree (optional step).To create a phylogenetic tree, if you consider opportune to use in your 16S rRNA downstream statistical analysis, customize to your directory pathway and run: Rscript 09.phylogenetic_tree.R To perform this step, apart from the dada2 R package, you also need the R packages of phangorn and DECIPHER.
This script will generate output, which we will use downstream: tree_dada2_16S.rds

CUSTOM BLASTN DATABASE CREATION AND ASV TAXONOMICAL ASSIGNMENT
This protocol includes two steps.The first one describes the basic bash and R script commands to follow for creating a custom BLASTN database that will be used to classify the ASV sequences obtained in Basic Protocol 1.The second step describes the procedure from the BLASTN output to obtain a specific lineage based on E-value and percentage of identical matches (pident) parameters from the BLAST tool and other user-defined parameters detailed on an R script.All this process has been automatized to be robust and reproducible over time.Nevertheless, the output obtained in this protocol needs manual checks (but only for cases where only one specific lineage could be defined for a particular ASV sequence).

Necessary Resources Hardware
Linux, MacOS or Windows (with Subsystem for Linux) operating system, with sufficient available random-access memory (RAM) and disk space (see Strategic Planning)

Files
All the input files are detailed in the present protocol and available on the Figshare link.

Sample files
See Basic Protocol 2 on the Figshare link.
Bars-Cortina et al.

of 19
Current Protocols 1. Download databases to create custom BLASTN database.
In a specific directory of your choice, download the following reference SILVA databases (some used in Basic Protocol 1).Nevertheless, we use another species multifasta file not used in the DADA2 standard pipeline in the addSpecies R function (see Basic Protocol 1).Furthermore, we also used the original SILVA 138 reference database instead of the DADA2-trained reference customized by DADA2 developers.
5. Create a new identifier for all the sequences in the multifasta file of silva_dada2_arb.fa(solve 4.1): cat silva_dada2_arb.fa| seqkit replace -p.+ -r "{nr}" --nr-width 7 > 3basesdades.fa The previous function with seqkit replaces all identifiers of fasta sequence to a correlative number, starting from 1 up to 1322260.We have 1322260 sequences.
6. Create a new directory where you will create the custom BLASTN database.
Let's name the new directory as the database: The file ASV_code_sequence.fasta is a multifasta that contains all the ASV sequences (4617 ASV sequences in our studied case).
You can do this through the following: bash splitfasta.shASV_code_sequence_linear.fasta, or directly run the commands in the terminal.As you can see, we use the multifasta file as an argument of the bash script.
Bars-Cortina et al.

of 19
Current Protocols Output generated: In the same directory, you will generate as many FASTA files as there are ASVs in the original multi-FASTA file.
14. Move all the independent fasta files to a new directory: 15.Run BLASTN for all ASV sequences using the custom database obtained in step 7.
In the terminal, run: bash blastn.sh This is a long process, so we also provide an alternative script to run parallel tasks on an SGE cluster: blastn-sge.sh Find the output in the path defined in the previous blastn.shfile.
Check that the number of blasted.txtfiles coincide with the number of input fasta files.In our case the number is 4617.For example, in the folder that contains the output file, run the next command: 19.Manual check and deduplication.
The output of blast for some ASV may not be unique, and more than one species share the same DNA sequence.Deduplication of these cases can only be performed manually because there are many possible situations, and it requires some knowledge on bacterial taxonomy (e.g., Latin knowledge and habitat-specific microbiota lineages).
Bars-Cortina et al.

of 19
Current Protocols From the blastresults_CP.txtcreate a copy where a manual check is done (let's name it, e.g., blastresults_CP_selected.txt).Consider that we do not have to check all the 3096 ASV manually included in blastre-sults_CP.txt;we only must check the ASV with more than one lineage classification (due to BLAST identical pident value).
For ease of use, open blastresults_CP_selected.txt in office software (such as LibreOffice Calc or Microsoft Excel).By using the conditional formattingbased tool on duplicated ASV numbers you can highlight the ASV to accurate the lineage manually (Figure 2).
After the use of the conditional formatting-based tool, remove the column deep.
Let us comment on the steps we follow for different scenarios that you can face to reproduce our results: Case 1: Two or more lineages with the same pident value and the same level of resolution (example: ASV1008).Then for that case we defined ASV1008 as Granulicatella adiacens/para-adiacens.When we have different species possibilities for the same genus, we separate with "/" and alphabetically arranged.
Case 2: When for the same ASV, we have more than seven different species for the same genus (e.g., ASV1032), we transformed to Genus + sp.For that case, we use Methylobacterium sp.
Case 3: Sometimes, it is impossible to achieve the species level because we have different genera and species, all of which are human gut inhabitant (e.g., ASV 1287ASV , 1359)).Then, we retain the lineage up to the most common taxonomical rank.In these two cases, up to family rank (Enterobacteriaceae).The most complicated cases to establish a lineage in our data were ASV1156 and ASV1168.Nevertheless, considering that DADA2 defined that lineage up to the Lelliotia genus, it helped us define those lineages.
Once all the duplicated ASV ID rows are checked, save that document (blastre-sults_CP_selected.txt).

ASV TAXONOMICAL ASSIGNMENT USING NCBI REFSEQ TARGETED LOCI PROJECT DATABASE
This protocol uses the public 16S database from NCBI to classify the ASV sequences obtained from the DADA2 pipeline.For this purpose, we use the k-mer taxonomical classification algorithm of Kraken2 and Bracken2, creating a custom database.

of 19
Current Protocols

Necessary Resources Hardware
Linux, MacOS, or Windows (with Subsystem for Linux) operating system, with sufficient available random-access memory (RAM) and disk space (see Strategic Planning)
Within the folder you desire to create the Kraken2 custom database, create a subdirectory.mkdir customdatabase 3. Create the Kraken2 custom database only with the two fna files downloaded above from the NCBI RefSeq Targeted Loci.
To do this, in the folder that contains both files and in the computer that has installed Kraken2 and Bracken2, execute the following bash script so that Kraken creates the taxonomy folder inside the customdatabase folder: This command requires an internet connection to allow kraken2 to download the taxonomy file.Once this first step is finished, execute the next commands to build the custom database with the downloaded fna files: for f in *.fna; do kraken2-build --add-to-library $f --db /PATH/customdatabase; done NOTE: Please verify that the directory where you execute the command 'for f in *.fna' contains only the two *.fna files indicated on step 1 and no additional files with the .fnaextension.
Once finished, a new library subdirectory will have been created inside custom database directory.
To complete the creation of the Kraken2 custom database, use the following command.You can also execute this command directly in the terminal:kraken2-build --build --threads 28 --db customdatabase Three files with extension *.k2d will be created (which are essential to work the Kraken2 algorithm downstream).
Then, let's do the bracken build for different k-mer lengths to build a database (we use three k-mer lengths: 150, 200 and 240 bp).We can do it directly in the terminal.The output will be saved in the customdatabase directory.

of 19
Current Protocols To execute Kraken2, use these 4617 FASTA files as input, each corresponding to one of the 4617 ASVs retrieved from the DADA2 pipeline (Basic Protocol 1):bash kraken_refseq.sh 6. Perform Bracken analysis for the 150 bp kmer.
bash bracken_refseq.sh We want to transform the Kraken + Bracken output format to a more friendly (metaphlan format) output.We use the following bash scripts, two Python scripts downloaded from the Krakentools webpage (https:// github.com/jenniferlu717/ KrakenTools).Indeed, both Python scripts are available in Figshare.
7. Transform each Bracken report to the Metaphlan format.
The phyton file kreport2mpa.py is used within the following bash script.bash krakentools_1_refseq.sh Merge all metaphlan format files into one for downstream analysis.The Python script combine_mpa.py is used within the following bash script.bash krakentools_2_refseq.sh A txt file is generated (combined_allranks_mpa_refseq.txt).
8. Perform last changes/details on the output generated in order to obtain the R file ASV_lineage_refseq.rds with some convenient reformatting of the lineage established with Kraken 2 and Bracken 2 for each ASV using NCBI RefSeq Targeted Loci as a taxonomical database.To achieve this:

DEFINITIVE SELECTION OF LINEAGE AMONG THE THREE METHODS
From the previous protocols, we obtained the lineage for each ASV on the DADA2 pipeline (Basic Protocol 1) from the custom BLASTN database (Basic Protocol 2) and 16S NCBI public databases (Basic Protocol 3).This protocol combines the outputs to consider all our lineage information in three ways: analyze them, compare them, and find common patterns.The aim of this protocol is to obtain a more reliable lineage for each ASV after the output comparison of the three methods.Moreover, this protocol presents an extra final step to ensure the update and homogenization of taxonomic lineage ranks according to NCBI taxonomy (ftp.ncbi.nlm.nih.gov/pub/taxonomy/) using myTAI and taxonomizr R packages.In addition, it shows how to construct the phyloseq object, which is the bridge between bioinformatics and statistical analysis in microbiome data.

Necessary Resources Hardware
Linux, MacOS, or Windows (with Subsystem for Linux) operating system, with sufficient available random-access memory (RAM) and disk space (see Strategic Planning)

Software
The following software must be installed and available in your PATH environment variable to be executable as a system binary: R (v4.1.2):(https:// cran.r-project.org/bin/ windows/ base/ old/ 4.

Sample files
See Figshare 1. Select the lineage among the three methods.
The purpose of this step is to automate the lineage selection towards the three methods described previously and update the lineage classification (which differs in some cases for some genera or species if it comes from the SILVA database of NCBI RefSeq Targeted Loci).To do this, the R package myTAI is used.
Run Rscript script_pre_phyloseq_16S_CP.R The output file results_16SCP_def.RData file is obtained.This file contains the lineage selected for each of the initial 4617 ASV.In detail, we finally got 4469 ASV with some lineage.This difference is due to not removing human gut, Mitochondria, and Chloroplast lineages.
2. Perform last lineage check and create the phyloseq object.
In this last step, we propose to do some extra lineage checks and use another R package available (apart from myTAI) to update the lineage according to the NCBI Taxonomy database, which is updated periodically.The R package used is taxonomizr.Please note some additional steps are needed to perform after installing it: creating a SQL database.
To achieve it, run Rscript taxonomizr.R To finish the protocol, we also consider it opportune to facilitate the R code used to create a phyloseq object before conducting the microbiome statistical analysis through this same R package or others the reader could use.
The final phyloseq object that creates our script is species-level defined (through tax_glom phyloseq R package function) and filtered (Navas-Molina et al., 2013).
To get these analyses run Rscript last_step.R.

GUIDELINES FOR UNDERSTANDING RESULTS
For all the script runs, the standard output and standard error are merged, and it is crucial to check for potential error messages.If error messages exist, the results are unreliable, and the error(s) must be resolved before running the next downstream step.
The results retrieved after running BLASTN must be treated through R code blast_assignment.R (as detailed in Basic Protocol 2) to retain all information.
There is no manipulation in the output txt files from BLASTN running, only inspection.
To remind the meaning of each column of the BLASTN results in the txt files, re-open Bars-Cortina et al.
The blastresults_CP.txtfile obtained from the Basic Protocol 2 must be checked manually.Use the recommendations stated in that protocol, and, in case of doubt, check if the species determined is an inhabitant of the environment of your study.
Protocol 3, the Kraken2 results are saved into two folders (outputs and reports).The txt files in outputs folders are not used for Bracken2 and downstream analysis.Python commands from KrakenTools are used to obtain the Kraken2 result in a Metaphlan format which is easier to manipulate in R, at least for us.

COMMENTARY Background Information
Due to the nature of 16S rRNA sequencing as NGS technology (based on short reads) and the identical/highly homologous between some species (Gwak & Rho, 2020;Hiergeist et al., 2023), most microbiome studies use the genus rank level as the most resolution taxonomic level.On some occasions, to overcome this limitation and try to classify up to species rank, the tool BLASTN was used (Bazinet et al., 2018;Boisseau et al., 2023) with different parameter configurations due to the absence of a standardized protocol to classify species in 16S rRNA experiments.The most common strategy is to use some taxonomical reference databases integrated into the microbiome bioinformatic pipelines, such as Qi-ime2 and DADA2.However, a small percentage of ASV is classified at the species level.
Therefore, looking for a strategy to increase the rate of species level classified without compromising the misidentifications of their prediction is interesting to obtain more information from the same analysis that allows finding out specific species of a particular genus implied in some phenotype studied, or at least, to narrow to some species from a genus if a non-unique specie could be established.
First, the standard bioinformatic DADA2 pipeline is described in Basic Protocol 1 and is considered the pattern (the mother method).Basic Protocol 2 illustrates the DADA2 ASV taxonomical classification based on a custom BLASTN database built from the SILVA reference database (also used in Basic Protocol 1) using a confident and robust E-value of 1e -50 .Finally, Basic Protocol 3 describes the taxonomical classification of DADA2-ASV using the 16S reference database from NCBI RefSeq Targeted Loci and Kraken2 and Bracken2 as classifier algorithms.Through an automatic process (only a few manual checks are needed), the definitive taxonomic lineage for each ASV has been decided from the reference method (DADA2) and the other two methods (Blast and NCBI RefSeq).
The primary limitation of the current workflow lies in the fact that the 16S databases used for taxonomical profiling are not specifically tailored to the human gut ecosystem.However, this limitation is not inherently problematic, as the protocol is adaptable for various disciplines within microbiota research, including soil microbiology, animal studies, and human research, among others.
Nonetheless, researchers should exercise caution in cases where BLASTN results in different taxonomic lineages with the same percentage identity value, as well as discrepancies among the three methods presented in the manuscript.In such situations, researchers should draw upon their prior experience in their respective research fields and seek relevant literature to confirm whether the uncertain species has been documented within the specific habitat under investigation.Apart from this 3-method approach, two methods have been studied and implemented to improve the 16Sr RNA species classification.One way attempted to classify the DADA2 taxonomically-ASV towards the Unified Human Gastrointestinal Genome (UHGG) v2.0 database (Almeida et al., 2021)   method was abandoned due to the misclassification of ASV DADA2 sequences to non-16S rRNA sequences.UHGG database contains complete genomes and is designed for shotgun metagenomics analysis.Another method assayed but with unsatisfactory results was to create a custom Kraken2 database with the last version of SILVA.By default, Kraken2 incorporates the SILVA database only up to the genus level.For this reason, through a Python code, we force the incorporation of SILVA v138.1 species multifasta in the Kraken2 algorithm.However, no substantial improvement in species classification was achieved compared to the three methods detailed in the present protocol.
The protocol was performed in a 16S rRNA gene sequencing dataset from 156 human gut samples.In the pattern (mother) method (the DADA2 protocol), only 252 out of the initial 4617 ASV could be classified as species, representing 5.5% (Fig. 4).In contrast, the proposed method based on the output of three methods (DADA2, custom BLASTN and NCBI RefSeq Targeted Loci) classified 1754 ASV as species from 4344 curated ASV in total, increasing the percentage of species classified to 40.4% or 38% if referred to the original 4617 ASV.Analyzing the source of the new species assignments by method, BLAST provided 1371 ASV classified up to species, while 1001 species were retrieved from NCBI RefSeq method.
Furthermore, as can be seen in Figure 4, for the most resolution taxonomical ranks (e.g., Genus and Species level) the proposed strategy presented in this manuscript (use of three protocols instead of only one gold standard such as DADA2) allowed to obtain more lineages classified to those levels.In relation to the higher ranks, DADA2 (the pattern method) classified more ASV in comparison to the other two methods (BLAST and NCBI Ref-Seq) but the corresponding final output values are not so far from DADA2.Finally, DADA2 presented the lowest number of ASV that could not be classified either to Bacteria and/or Archaea (103).Nevertheless, the final output (considering the three methods) presented 267 ASV not classified to Archaea/Bacteria or misclassified to the non-human gut, Mitochondria, and Chloroplast lineages, which is markedly lower than the percentage presented by BLAST and NCBI methodology (1515 and 808 ASV, respectively).
Therefore, and in conclusion, the present protocol increased the practical eight times ASV classifications at the species level by incorporating the information retrieved from two additional methods (BLAST and NCBI RefSeq) apart from the pattern method of DADA2.Then, this methodology could be of interest to research groups that use the 16S rRNA gene sequencing technology in their metagenomic studies.

Software's version
Different versions of DADA2, BLAST, Kraken2, and Bracken2 could change the output of the results despite using the same parameters detailed in the present protocol.Therefore, it is crucial to annotate the version of each software used.
Bars-Cortina et al.

of 19
Current Protocols Check that all the headers of the sequences that you want to include in the custom BLAST database are unique.You can use the check_no_duplicates.py to detect it.
Can't find taxonomy/subdirectory in the database directory, exiting Check the bash script running the Kraken2 that the pathway to database is not included in whole detail.
Check that you have included the entire Kraken2 database pathway (from the root).
Error: Bad Request (HTTP 400) Internet failure communication with NCBI (using myTAI package) Restart the process or subset the process into smaller tasks.
Retrieval of taxonomical lineages from species that have more than one possibility (e.g., Escherichia coli/Shigella sonnei, myTAI only considers Shigella sonnei) Some bug of myTAI R package, unknown origin.
When using the myTAI R package (as stated on the R script) check if the row names (ancient lineage names contain some symbol such as "/" or "-") and compare to the rank level that is updating if it works well.Check the R script to view an example.

Reference databases and lineage nomenclature
The use of other reference databases will completely change the ASV assignment.Furthermore, due to the frequent updates of Bacteria and Archaea in NCBI RefSeq, it is essential to save the download date to reproduce the results.In our current protocol, we have utilized R packages, specifically myTAI and taxonomizr, to update the ancient lineage names from the SILVA v138.1 database to align with the latest NCBI taxonomy database.(https:// ncbiinsights.ncbi.nlm.nih.gov/2022/ 11/ 14/ prokaryotic-phylum-name-changes/ ).This update ensures that our taxonomy information incorporates the most recent revisions in nomenclature.

Troubleshooting
Table 2 summarizes the common errors documented during the processes detailed in the protocols.

Time Considerations
Running all of the protocols on the 16S rRNA amplicon data from 156 human fecal samples data showcased herein will take approximately 30 hr 36 min to complete: 24 hr 36 min for Basic Protocol 1, 1 hr 30 min-2 hr for Basic Protocol 2, 3 hr for Basic Protocol 3, and 40-60 min for Basic Protocol 4. The bulk of the time for Basic Protocol 1 (12 hr) is spent in the phylogenetic tree building followed by the denoising step (10 hr), while most of the time for Basic Protocol 2 is spent on the manual check of blast assignment (1 hr-1hr 30 min).
For Basic Protocol 3, the running time could be lower in function to the internet connection to The National Center for Biotechnology (NCBI) server that the R package myTAI uses.An analogous scenario occurs for Basic Protocol 4 and the steps required for taxonomizr R package installation account for the main runtime in the last protocol.
The exact time required for each protocol will vary based on the specifications of the computer used to execute the commands.For steps where multiple threads are supported, increasing the number of CPU threads will generally reduce the run-time.

Figure 1
Figure 1 Flowchart of the workflow.
fasta./fasta_filesBefore, remember to remove the ASV_code_sequence_linear_fasta in the fasta_files directory.

Figure 2
Figure 2 Screenshot of the blast_results_CP.txtopened in an Excel spreadsheet.

Figure 3
Figure 3 Snapshot of the homepage of the NCBI RefSeq Targeted Loci Project.

Figure 4
Figure4Cumulative barplot showing the percentage classified per taxonomical rank and percentage of unclassified for each of the three methods (DADA2, BLAST, and NCBI RefSeq) and the final output obtained.Due to the nature of the taxonomic lineage (the more specific rank level is embedded for the previous rank levels), the graph shows the resulting ratio to 1 calculated from the percentage of each category (rank level) for each method.

Table 1
Data Structure of the rds Object seqtab.nochim_pooling.rds

Bars-Cortina et al. 5 of 19
Through the abovementioned R script, an automatic lineage selection per ASV is proposed based on having a pident value >99.5% and taking into account if the lineage defined for BLAST has the maximal resolution (species level).

Bars-Cortina et al. 15 of 19
using the Kraken2 and Bracken2 as a classifier algorithm (as Basic Protocol 3).Nevertheless, this

Table 2
Solutions to Potential Errors