A set of nematode rRNA cistron databases and a primer assessment tool to enable more flexible and comprehensive metabarcoding

The ITS‐2‐rRNA has been particularly useful for nematode metabarcoding but does not resolve all phylogenetic relationships, and reference sequences are not available for many nematode species. This is a particular issue when metabarcoding complex communities such as wildlife parasites or terrestrial and aquatic free‐living nematode communities. We have used markerDB to produce four databases of distinct regions of the rRNA cistron: the 18S rRNA gene, the 28S rRNA gene, the ITS‐1 intergenic spacer and the region spanning ITS‐1_5.8S_ITS‐2. These databases comprise 2645, 254, 13,461 and 10,107 unique full‐length sequences representing 1391, 204, 1837 and 1322 nematode species, respectively. The comparative analysis illustrates the complementary value but also reveals a better representation of Clade III, IV and V than Clade I and Clade II nematodes in each case. Although the ITS‐1 database includes the largest number of unique full‐length sequences, the 18S rRNA database provides the widest taxonomic coverage. We also developed PrimerTC, a tool to assess primer sequence conservation across any reference sequence database, and have applied it to evaluate a large number of previously published rRNA cistron primers. We identified sets of primers that currently provide the broadest taxonomic coverage for each rRNA marker across the nematode phylum. These new resources will facilitate more comprehensive metabarcoding of nematode communities using either short‐read or long‐read sequencing platforms. Further, PrimerTC is available as a simple WebApp to guide or assess PCR primer design for any genetic marker and/or taxonomic group beyond the nematode phylum.

The availability of high-quality and comprehensive reference sequence databases represents one of the major challenges of DNA metabarcoding techniques (Alberdi et al., 2019;Alsos et al., 2018;Cristescu & Hebert, 2018;Liu-wei et al., 2023;van der Loos & Nijland, 2021).Incomplete or inaccurate reference sequences may result in incorrect taxonomic assignments, particularly when investigating rare taxa or under-represented species (De et al., 2018;Dell'Anno et al., 2015).Publicly accessible databases like the Barcode of Life Data Systems (BOLD) (Ratnasingham & Hebert, 2007), the SILVA database (https:// www.arb-silva.de), or the NCBI GenBank (https:// www.ncbi.nlm.nih.gov/ genba nk/ ) serve as valuable resources for DNA metabarcoding due to their extensive data spanning diverse taxa.Despite their value, both databases have their imperfections.The BOLD database, which contains Cytochrome C Oxidase subunit I (COI) sequences, includes a substantial amount of "private" data and restricts its free usage, while the NCBI GenBank is deemed less reliable and susceptible to erroneous sequences and annotations (Bidartondo, 2008).Regardless of these biases, having well-curated reference databases remains essential, and could significantly increase the number of species detectable through metabarcoding.
Additionally, transitioning to longer DNA marker regions or combining multiple DNA markers may lead to improved metabarcoding resolution.This is occurring in the microbial field, with the shift towards targeting the full-length 16S and improving metabarcoding resolution, elevating it from the genus level to the species level and even the strain level (Johnson et al., 2019).This shift has also started in plant-parasitic nematodes' with a longread metabarcoding workflow targeting the SSU rRNA region (van Himbeeck et al., 2024).Moreover, multiple studies using a multimarker approach, such as COI and 18S (Topstad et al., 2021;Zhang et al., 2018), COI and 16S (Alberdi et al., 2018), or 18S and 28S (Alberdi et al., 2018;Liu & Zhang, 2021;Topstad et al., 2021;Zhang et al., 2018), have resulted in more reliable estimations of species richness.This reinforces the necessity for enhanced databases, enabling more informed choices of DNA markers to elevate the resolution of DNA metabarcoding.
The successful implementation of multi-marker approaches relies on meticulous primer design.By ensuring the accuracy of databases, proper sequencing primer design is facilitated, consequently reducing the occurrence of off-target amplification (Freeland, 2017).Furthermore, knowing whether the primer sequence contains potential mismatches in the 3′ end region is crucial since it can significantly impact the primer's efficiency (Boyle et al., 2009).Additionally, issues with databases hinder the effective utilization of primer design software, ultimately limiting the value of metabarcoding results (Coissac et al., 2012).
Nematodes are important human, animal and plant pathogens, and are also ecologically important free-living organisms.However, their exploration at the community level is still limited compared to many other taxonomic groups (Charlier et al., 2014;Zajac & Garza, 2020).ITS-2-rRNA "nemabiome" metabarcoding using short-read Illumina sequencing is increasingly used to study gastrointestinal nematode communities in domestic livestock and has also been extended to other host species (Avramenko et al., 2015;Beaumelle et al., 2021Beaumelle et al., , 2022;;Poissant et al., 2021).However, while this is a valuable approach for livestock parasitology research, the current methodologies have several limitations, particularly when applied to poorly defined and complex communities such as wildlife parasites or free-living nematodes in the environment.Firstly, the ITS-2 rRNA marker sequence diversity may be insufficient to differentiate between some closely related species or too extensive to establish deeper phylogenetic relationships.Also, the primers most commonly used for ITS-2-rRNA nemabiome metabarcoding of gastrointestinal nematodes, NC1 and NC2, largely target clade V and have limited conservation across the rest of the nematode phylum.
In this study, we have sought to address some of these current limitations in two ways.Firstly, by developing a modular approach using different taxonomic rDNA markers, including the 18S and 28S rDNA coding regions, the ITS-1 rDNA region and the combined ITS-1/5.8S/ITS-2regions, we can create four new full-length nematode sequence databases.Secondly, by developing a bioinformatic tool to help design and evaluate custom primers for different metabarcoding applications.These new resources should enable more structured and flexible approaches to nematode metabarcoding of both parasitic and free-living nematode communities.

| NCBI GenBank searches
NCBI GenBank searches were conducted to retrieve nematode rRNA cistron marker (Figure 1) sequences using the advanced search parameters to restrict searches to the Nematoda and a set of text search parameters as well as minimum and maximum sequence lengths (https:// www.ncbi.nlm.nih.gov/ nucco re/ advanced).Several text searches were conducted for a number of different rRNA cistron markers, 18S, 28S, ITS-1 and ITS-1_5.8S_ITS-2, to assess the combination of search term criteria that would yield the highest number of full-length nematode sequences (data not shown).The final search terms used and the total number of sequences recovered by manual searching with each of these search terms are shown in Table 1.
First, the pipeline retrieves potential nematode sequences from NCBI Genbank, which have annotations matching our text search criteria and only retain sequences with complete taxonomic information in the metadata (Table 1).Then an inference of RNA alignment (INFERNAL) sequence search against a covariance model database (CMSCAN) is used to identify the appropriate rRNA coding region in the retrieved sequences and discard any lacking these regions.In the case of the 18S and 28S databases, eukaryotic SSU (small subunit, 18S) or LSU (large subunit, 28S) genes are searched for, respectively.In the case of the ITS-1 database, the eukaryotic SSU (small subunit, 18S) and 5.8S genes are searched, and for the ITS-1_5.8S_ITS-2database, the eukaryotic SSU (small subunit, 18S) and LSU (large subunit, 28S) genes are searched.We modified the infernal CMSCAN parameters to generate the co-variance model to identify the upstream and/or downstream sequences to direct the trimming of sequences to retain only the rRNA cistron region of interest.Finally, we modified the minimum and maximum length parameters for each database based on expectations from the literature (Table 1).Sequences that were too long or short, based on these parameters, were discarded from our final database as well as any redundant sequences.The number of sequences retained at each filtering step of the pipeline is shown in Figure 2a.The databases are available at the following link: www.nemab iome.ca.

| Evaluating the phylogenetic conservation of primers targeting different regions of the nematode rRNA cistron
We created a bioinformatic pipeline, PrimerTC, using R programming (Supplementary Document D1) to determine the percentage identity of any specified primer against its target site in the marker sequences in any chosen reference database (Figure 3).Cutadapt (Martin, 2011) (https:// cutad apt.readt hedocs.io/ en/ stable/ ) and the pairwiseAlignment, with the global alignment parameter, function (Malde, 2008) from the Biostrings package (version 2.40.2) (https:// www.rdocu menta tion.org/ packa ges/ Biost rings/ versi ons/2.40.2/ topics/ pairw iseAl ignment) were used to identify the target site with the highest similarity between the specified primer sequence and each reference sequence in the chosen database, and the percentage identity was reported.In addition, a custom R script was used to report whether the 3′ terminal nucleotide of the primer matched the target reference sequence.The mean percentage identity between the primer sequence/reference sequence targets was then calculated for each nematode genus and displayed as a heatmap using custom R-scripts.Eight different sequence identity levels were classified in the heat map: below 70% identity and then from 70% to 100% identity in 5% increments.The PrimerTC tool is available as a web app at the following link www.nemab iome.ca.The web app automates the PrimerTC R program by enabling users to input their primer sequence, reference database and preferred phylogenetic tree, thereby generating a comprehensive output automatically.
F I G U R E 1 Schematic representation of the nematode rRNA cistron and the relative sequence variation rates between taxa.The rRNA cistron is present as a multicopy tandem array in a nematode genome, with each copy comprising three highly conserved coding regions-18S (Small Subunit [SSU]), 5.8S and 28S (Large Subunit [LSU]).These coding regions are separated by the less conserved internal transcribed spacer regions ITS1 and ITS2.These different regions of the rRNA cistron can be used to resolve different levels of taxonomy based on their level of conservation across different taxa.Adapted from Doris et al., 1999.

| Creating the 18S maximum likelihood phylogenetic tree
We generated a Multiple Alignment using the Fast Fourier Transform (MAFFT) (Katoh et al., 2002) of the 18S reference database sequences using the bioconda MAFFT package (https:// mafft.cbrc.jp/ align ment/ softw are/ ) with 1000 cycles of iterate refinement.From this multiple sequence alignment, we used the msa R package (Bodenhofer et al., 2015) (version 1.24.0)(https:// github.com/ UBod/ msa) and the consensusString method to generate consensus sequences for each of the 514 genera represented.These consensus sequences were then combined into a single fasta file, which we used to create a MAFFT sequence alignment, using the same criteria as the previous alignment.Finally, the RAxMLHPC (Stamatakis, 2014) tool (https:// cme.h-its.org/ exeli xis/ resou rce/ downl oad/ NewMa nual.pdf) with a GTRCAT approximation, a random number of one seed for the parsimony interference and criteria of 100 bootstrap values was used to generate a Randomized Axelerated Maximum Likelihood (RAxML) 18S phylogenetic tree.

| Integration of primer identity heatmaps with phylogenetic trees
We used the ggtree package (Yu et al., 2017) (version 1.4.11) to import our phylogenetic tree and the gheatmap function (https:// www.rdocu menta tion.org/ packa ges/ ggtree/ versi ons/1.4. 11/ topics/ gheatmap) to combine each phylogenetic coverage heatmap with the corresponding 18S phylogenetic tree to produce the final outputs (e.g., Figure 4).Distinct 18S trees were generated for each diagram by selectively including only the genera found in the respective databases used for the PrimerTC analysis (18S, ITS-1_5.8S_ITS-2

and 28S).
Ven diagrams were created using the Venn Diagram R package (https:// cran.r-proje ct.org/ web/ packa ges/ VennD iagram/ VennD iagram.pdf) (version 1.7.3) to compare the number of species shared between and unique to each of the five rRNA cistron databases (Figure 5).

| The construction and characterization of four new nematode rRNA cistron marker databases
Using the text search and sequence length criteria determined through our NCBI GenBank manual search, we applied the MarkerDB pipeline to create four rRNA cistron marker nematode databases.A summary of the sequence content of each of the four new nematode rRNA cistron databases (18S, 28S, ITS-1 and ITS-1_5.8S_ITS-2.),as well as the previously described ITS-2 database, is shown in Table 1 and Figure 2 (panels B and C).ITS-1 and ITS-1_5.8S_ITS-2 were the largest of the four new databases, containing 13,461 and 10,107 TA B L E 1 MarkerDB search criteria and results summary for the new nematode rRNA cistron databases.The number of species that were either shared or unique to each of the four new databases, as well as the previously described ITS-2 database, is shown in Figure 5a.Only a small number of nematodes were present in all 5 marker databases (Blaxter et al., 1998) and, even when excluding the smallest database (28S), only 406 were shared between the remaining four.Although many species were represented in more than one database, a large number were only represented once.The 18S rRNA database contained the greatest number of species not shared with any other database, with 683 unique species (Figure 5a).The number of species that were shared or unique to each database for each of the major nematode phylogenetic clades (I-V and Chromadorida) is shown in Figure 5 (B-G).

Bursaphelenchus
) and V (e.g., Haemonchus) being more broadly represented across the different databases.
We also examined how the different genera were phylogenetically distributed across each phylogenetic clade for each database (Figure 6 and Additional File 1).There was a wide phylogenetic distribution of genera across each clade for the 18S database, as well as a wide distribution that was similar for the ITS-1, ITS-1/5.8S/ITS-2 and ITS-2 databases (Figure 6b-f).The 28S database was poorly represented across clades I-IV but had a broad representation of genera across clade V.

| Development of the PrimerTC pipeline to assess the phylogenetic conservation of primers targeting different regions of the nematode rRNA cistron
The PrimerTC pipeline was constructed to assess the sequence conservation, and its phylogenetic distribution, for any chosen primer with its target site in the rRNA cistron across the nematode phylum using the new rRNA databases (Figure 3).This pipeline was then used to evaluate previously published primer sequences that have been used for nematode rRNA barcoding/metabarcoding or other PCR applications.NC1, NC2, NC5 and NC13 are among the most commonly used primers for nematode barcoding/metabarcoding, particularly for animal parasitic nematode communities, and their analysis is presented to illustrate pipeline application (Figure 4).NC1 and NC2 (Gasser et al., 1993), target the 3′ end of the 5.8S and 5′ end of the 28S rRNA cistron region, respectively, and are often used as primer pairs for nematode PCR and barcoding/metabarcoding experiments (Andersen et al., 2013;Mangkit et al., 2014;Queiroz et al., 2020).NC1 has a high degree of identity with its target site widely across clade V, albeit not completely comprehensive, and even better phylogenetic coverage of clade IV, but has relatively low levels of sequence identity for the other nematode clades (Figure 4).
NC2 is highly conserved across most of Clade V and a small part of Clade IV but again has low identity to nematode target sequences in the other clades.Consequently, PCR amplification using NC1 and NC2 will likely be largely restricted to Clade V and some Clade IV nematodes.The NC13 primer (Newton et al., 1998), like NC1, also targets the 3′ end of the 5.8S rRNA coding region but has a high degree of identity with its target site across all clades except for clade I.The NC5 primer (Newton et al., 1998), targeting the 3′ end of 18S rRNA, has poor target sequence identity across all 5 clades, with high identity being restricted to just the Trichostrongylide nematode group within clade V (Figure 4).Additionally, the 3′-nucleotide of the NC5 primer mismatched with its target sequence in many Trichostrongylide nematode species (Figure 4).
Finally, we undertook a literature search to identify primers previously used for nematode rRNA barcoding (Supplementary Figure 1-3).These primers were then analysed with "PrimerTC" to identify those that provided the broadest phylogenetic conservation across the nematode phylum (see Additional Files 1 and 2).We then selected the forward and reverse primers that had the highest level of identity across the broadest phylogenetic range within the Nematoda phylum (Figure 7, Supplementary Table S1 and S2) and identified one, or more, forward and reverse primers with high levels of sequence identity across almost the whole nematode phylum that could be used to PCR amplify the full-length ITS-1, ITS-2, ITS-1_5.8S_ITS-2 or 28S markers (Figure 7b-d).However, even the most conserved forward primer for the 18S marker has variable identity across multiple genera, particularly in Clade V.

| DISCUSS ION
DNA metabarcoding is becoming an increasingly important technique in the study of both free-living and parasitic metazoan organisms and has huge potential, not only for research but also for many practical applications from environmental monitoring to disease surveillance (Mechai et al., 2021;Takasaki et al., 2021;Zou et al., 2020).
While there has been significant progress in recent years, there are F I G U R E 6 Representation and phylogenetic distribution of different nematode genera in each nematode rRNA cistron database.The complete set of full-length 18S sequences within the 18S database was used to create a multiple alignment.Subsequently, consensus sequences were produced for every genus, and these were then employed to construct a multiple alignment.This genus-level alignment was used as the basis for constructing an 18S rRNA maximum likelihood phylogenetic tree.From this parent maximum likelihood phylogenetic tree, separate trees were then constructed by only retaining the genus present in each of the following clades: still many technical challenges to overcome (Compson et al., 2020).

DNA marker choice can have critical impacts on metabarcoding
studies.Two of the most important considerations are the quality and completeness of reference sequence databases and the choice of PCR amplification primers.Organisms, or even whole groups of organisms, may be excluded or misidentified as a result of incomplete databases, mis-annotations, or poor amplification due to low conservation of primer sites.The power of DNA metabarcoding largely depends on the accuracy of the reference sequences present in the databases used for species identification (Keck et al., 2023).
One good illustration is a study using COI metabarcoding of zooplankton communities, where approximately 15% of the data remained unidentifiable, even when attempting to classify at the phylum level, primarily due to the absence of reference sequences in their database (Ershova et al., 2023).Improving the completeness of DNA reference databases is essential not only for the reliable identification of anticipated species but also to enable the assignment of unanticipated species found in complex communities (Hestetun et al., 2020).
We have sought to address some of these issues for the Nematoda, which are an important, and relatively neglected, group of organisms.Parasitic nematodes are extremely important pathogens of plants, domestic and wild animals, and humans, whereas free-living nematodes are among the most abundant metazoans on the planet, playing critical roles in ecosystem health and nutrient recycling.In this paper, we have presented some new resources that we have developed to support more comprehensive and reliable metabarcoding.Specifically, we have produced four new nematode databases of unique full-length rRNA marker sequences (18S, 28S, ITS-1 and ITS-1_5.8S_ITS-2) and a tool to assess the phylogenetic conservation of PCR amplification primers (PrimerTC).

| New rRNA cistron databases for nematodes
There is no single DNA marker that is ideal for all applications.Marker choice depends on the nature of the research question, the level of taxonomic resolution required and the availability of the relevant high-quality reference sequence databases.Another important factor is the sequencing platform being used.Short-read sequencing, namely Illumina, is by far the most commonly used for metabarcoding studies, but this limits marker choice to short fragments (200-500 bp), which can in turn limit phylogenetic resolution.Also, the high sequencing capacity of the different Illumina short-read platforms means there is a lack of flexibility when smaller sample sets need to be assessed in a cost-effective manner.New long-read sequencing platforms, such as Oxford Nanopore, provide some solutions to these problems, both supporting longer DNA markers and providing more flexible sequencing capacity.Consequently, there is a need to use different taxonomic markers, both to provide the appropriate level of taxonomic resolution and to support both shortand long-read sequencing.The rRNA cistron, comprising coding and intergenic spacer sequences, provides a set of taxonomic markers that are very commonly used, including in nematodes, for a variety of reasons (Dorris et al., 1999).Different parts of the rRNA evolve at different rates, which can be used to identify and determine phylogenetic relationships between both closely related and distantly related species (Figure 1).The idea is for this set of databases to be used in a modular fashion in metabarcoding studies to enable more informed decision-making regarding marker selection and maximize the number of nematodes that can be identified.The idea of complementary use of different markers, such as 18S and 28S markers, to maximize the nematode species recovery rate when using highthroughput sequencing data has been suggested before (Porazinska et al., 2009).Based on the analysis of the set of rRNA cistron databases we have now developed, we believe that the combined use of the 18S and ITS-1 markers would provide the maximum coverage of the range of 642 and 154 unique nematode species at present, respectively.
Although the public databases contain an increasing number of nematode rRNA sequences, there are a number of challenges when using them to support metabarcoding studies (Keck et al., 2023).
A major issue faced by the International Nucleotide Sequence Database Collaboration (INSDC) databases is the presence of duplicate and redundant sequences, which can introduce inconsistency and accuracy issues with sequence assignments (Chen et al., 2017;Rosikiewicz et al., 2013).The presence of partial sequences is another well-known issue that can also lead to similar problems (Holovachov et al., 2015).Indeed, using full-length reference sequences in addition to full-length metabarcoding data has been shown to improve species assignment for amplicon sequencing data analysis.For example, the use of full-sequence 16S metabarcoding has been shown to enhance bacterial classification accuracy, particularly when looking at the gut microbiota, when compared to V3-V4 short-read sequencing (Hsieh et al., 2022;Jeong et al., 2021).Consequently, there is a clear need for curated, unique, full-length nematode F I G U R E 7 Visualization of primers with high phylogenetic conservation with the different rRNA regions across the nematode phylum.
A review of the existing literature was used to identify primers used in nematode rRNA barcoding.Using PrimerTC and specific databases (18S, ITS-1_5.8S_ITS-2 or 28S), we visualized the most effective primer combination for each rRNA marker, maximizing taxonomic coverage.
In each diagram, the central phylogenetic tree represents a maximum likelihood tree of 18S rRNA (a), utilizing full-length 18S sequences from the 18S database.Unique 18S trees were created for each diagram by exclusively incorporating genera present in the databases used for PrimerTC analysis: (b, c) ITS-1_5.8S_ITS-2 and (d) 28S.A black dot on the tree indicates a discrepancy between the last base pair of the primer sequence and the reference sequences of the depicted genera in the database.The inner circle displays a heatmap, revealing the percentage identity between the tested primer and each genus in the chosen database, assessed with the PrimerTC tool.The outer circle signifies the clade affiliation of nematode genera.
non-redundant databases to support metabarcoding data analysis.
We have used marker DB to produce such databases for the 18S, 28S, ITS-1 and ITS-1_5.8S_ITS-2regions of the nematode rRNA cistron to add to our previously published ITS-2 database (Workentine et al., 2019).Each of these databases has differing content in terms of unique sequence numbers and species representation.Additionally, the reliability of these new database outcomes relies on the quality, accuracy and careful assessment of sequences submitted to public databases.The addition of inaccurate sequences to these public databases could lead to discrimination bias for species.
The 18S rRNA coding region has been a commonly used marker for both molecular and phylogenetic work in nematodes for several decades due to its ability to resolve phylogenetic relationships at the family and order level (Blaxter et al., 1998;De Ley, 2006; De Ley & Blaxter, 2004;Donn et al., 2011;Sapkota & Nicolaisen, 2015).On the other hand, the 28S rRNA coding region exhibits slightly greater variability in evolutionary rates and features more distinct regions of divergence or expansion segments; hence, the focus has been more on the 18S rRNA marker (Hillis David et al., 1991).The preferential use of the 18S over the 28S marker in nematode studies was reflected in our databases, with the 28S database comprising only 254 sequences compared to the 2645 sequences identified in the 18S rRNA database.This was also highlighted in the unique species coverage of each database, with only 32 nematode species found in the 28S database that were not in the 18S database compared to 690 unique species found in the 18S database and not the 28S database.There was a large reduction in the number of sequences remaining in the 18S and 28S databases (Figure 2b) after the lengthbased filtering process, which can be primarily attributed to the high number of ITS-1 or ITS-2 sequences containing 18S and 28S partial sequences retained from incomplete trimming when deposited in the NCBI GenBank database.
Another nematode 18S rRNA database, 18S-NemaBase, was recently published (Gattoni et al., 2023).Although 18S-Nemabase is an excellent resource, their method of quality assurance is labourintensive and difficult to keep updated.We wanted to produce a database that can be updated regularly while ensuring the sequences are correctly annotated as 18S.Consequently, we utilized the mark-erDB pipeline to create our 18S database.The process is completely automated, making it much easier to keep up to date.The fewer number of sequences compared to 18S-NemaBase (2645 vs. 5232) is likely due to the different sources of sequence databases used.
18S-NemaBase used SILVA's 18S databases, which were sourced from EMBL-EBI/ENA, whereas the markerDB pipeline searches sequences from the NCBI Nucleotide database.ITS-2 rRNA is commonly used as a barcoding marker, particularly for Clade V parasitic nematodes, and so we had previously developed an ITS-2 database to support nematode metabarcoding studies (Workentine et al., 2019).
We have now produced an ITS-1 rRNA database that provides similar, and perhaps more, taxonomic resolution as ITS-2 only to enable its use in nematode metabarcoding.There are 1524 nematode species in common between the two databases, but the ITS-1 database also contains 120 species that are not present in the ITS-2 database, and conversely, 197 nematode species are present in the ITS-2 but not in the ITS-1 database (Figure 2a).Using these two markers and databases in concert could be used to broaden the phylogenetic coverage of metabarcoding as required by the research question.We have also produced an ITS-1_5.8S_ITS-2rRNA database.While this contains a smaller number of total sequences than either the ITS-1 or ITS-2 database, we envisage this database being valuable to support longer read metabarcoding, for example, using Oxford Nanopore sequencing, and providing the greatest level of resolution for closely related species of all the rRNA cistron databases.
One important observation from our database analysis is the underrepresentation of Clade I and Clade II nematode species within all the rRNA databases overall, with the 18S database providing the greatest representation.Furthermore, Clade I and II, primarily dominated by the Enoplida and Dorylaimia orders, respectively, are predominantly free-living nematodes (Bik et al., 2010) that have been demonstrated to be inadequately represented in phylogenetic investigations (Blaxter et al., 2014;Smythe et al., 2019).This emphasizes the potential biases that may occur against Clade I and II nematodes in current metabarcoding studies and the need to improve the database representation of these groups.
Sequence database content is biased towards groups of certain nematodes based on the interests of researchers.For instance, the genus Trichostrongylus has a large representation due to their importance as livestock parasites (with 268 ITS-2 sequences) compared to the genus Trichodorus (with 35 ITS-2 sequences) due to their lesser perceived importance.Therefore, there is a need to increase the number and diversity of reference sequences in databases to encompass all nematode groups.To this effect, approaches such as genome skimming could be considered to provide full coverage of the rDNA cistron unit.

| PrimerTC: A tool to aid the rational design and assessment of PCR primer sequences for metabarcoding studies
One of the most important factors in a DNA metabarcoding study is PCR primer choice.It is critical that the level of conservation of primer sites across the desired taxonomic group be assessed to allow meaningful interpretation of results and the limitations of a study to be understood.Although a large number of primers targeting different rRNA markers have been used in various published barcoding/ metabarcoding and other PCR applications, there has been little systematic assessment of the taxonomic coverage they provide.Often, researchers simply use primers used by others without a systematic assessment of their suitability for the intended purpose.This is partly due to a lack of available tools to allow a quick and easy overview of primer site conservation across taxonomic groups of interest.Consequently, we created PrimerTC, a tool that employs a global pairwise alignment method to assess the percentage similarity between a primer sequence and a reference database, as well as determine whether the 3′ nucleotide of the primer complements the reference sequence.A key feature of the tool is that it generates a visualization that integrates primer and target sequence identity with an 18S rRNA-based phylogenetic tree to provide an overview of 'taxonomic coverage'.
We used PrimerTC to evaluate a number of primers commonly used for ITS-2 nematode barcoding or metabarcoding studies: NC1, NC2, NC5 and NC13 (Gasser et al., 1993;Newton et al., 1998).NC1 and NC2 primers are commonly used to study and monitor Clade V nematode populations through ITS-2 metabarcoding (Avramenko et al., 2015;Beaumelle et al., 2022;Davey et al., 2021;Poissant et al., 2021;Sargison et al., 2022).When these primers are employed together, it can be seen that they offer good taxonomic coverage for a significant portion of Clade V and some restricted segments of Clades III and IV (Figure 4).Although the NC13 forward primer provides nearly comprehensive taxonomic coverage for the nematode phylum when used with the NC2 reverse primer for PCR, amplicons are likely to be restricted largely to Clade V and a relatively limited number of Clade III and IV nematodes.Our analysis also revealed that the 3′-nucleotide of NC5 (Newton et al., 1998), designed as a forward primer to amplify ITS-1 from Trichostrongylide nematodes, mismatched with its target sequence in many Trichostrongylide nematode species, raising questions about its efficiency for this purpose.
We conducted a literature review to identify primers previously  6c).For nematode ITS-1 metabarcoding, almost universal taxonomic coverage is predicted for a combination of either Euk1391F (Bastida et al., 2020) or rRNA2 (Powers TO et al., 1997) forward primers with the 58Ar2 reverse primer (Martin & Rygiewicz, 2005) (Figure 7b).For ITS-1_5.8S_ITS-2 long-read metabarcoding, the rRNA2 or Euk1391F forward primers used with the D2AR reverse primer (reference) are predicted to provide broad phylogenetic coverage across the phylum.In contrast, for metabarcoding using the 18S rRNA coding region marker, our analysis revealed there are still major limitations to the available primers.While the reverse primers 18S 1573R (Mullin et al., 2005) and 18S-CL-R5 (Carta & Li, 2018) give broad phylumwide coverage, the best available forward primer, Nem_18S_F (Floyd et al., 2005), is much less comprehensive and has a low identity for many clade V nematodes in particular.
Finally, while we have used PrimerTC to assess the taxonomic conservation of primers targeting different regions of the nematode rRNA cistron, it is worth noting that it can be used to assess the phylogenetic conservation of any primer sequence against any genetic marker and/or taxonomic group for which there are appropriate sequence databases.Conversely, it can also be potentially used to assess the likelihood of off-target PCR amplification for chosen groups of organisms, such as bacteria or fungi.

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no competing interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
The databases are available as an interactive web app, and can be downloaded at: www.nemab iome.ca.The open-source software used to create the database, markerDB, is freely available at https://github.com/ ucvm/ markerDB.The PrimerTC tool is available as an interactive web app at www. nemab iome.ca.There is no restriction on the usage of the databases or the PrimerTC by non-academics.

F
I G U R E 2 Schematic representation of the markerDB pipeline and a summary of each rRNA cistron database output.(a) Flow chart of the key steps in the markerDB pipeline (Workentine et al., 2019, https:// github.com/ ucvm/ markerDB).(b) Summary of the number of nematode sequences retrieved/retained at each step of the MarkerDB pipeline for each of the four new databases.(c) The final number of unique nematode species represented by the sequences within each of the four new rRNA cistron databases and the previously described ITS-2 database.unique sequences representing 1837 species (396 genera) and 1322 species (308 genera), respectively.Even though the 18S database only contained 2645 unique sequences, more nematode species and genera were represented, with 1391 species (514 genera).The 28S database was the smallest, with only 254 unique nematode sequences, representing 204 nematode species (117 genera).It is noteworthy that although the previously described ITS-2 database had a larger total number of species represented (1721) than any of the four new databases, it had a smaller number of genera (382) than the new 18S database (514), with a large representation of Clades III, IV and V but a lack of Clade I and II representations.

F I G U R E 4
Visualization of rRNA primers (NC1, NC2, NC5 and NC13) phylogenetic conservation across the nematode phylum.The histograms illustrate the count of genera in each database (18S, ITS-1_5.8S_ITS-2 and 28S) where the primer's percentage identity exceeds 70%.The total number of genera in each database is depicted by the red bar.The central phylogenetic tree in each diagram is a maximum likelihood tree of 18S rRNA, constructed using the complete 18S sequences from the 18S database.Distinct 18S trees were generated for each diagram by selectively including only the genera found in the respective databases used for the PrimerTC analysis (18S, ITS-1_5.8S_ITS-2 and 28S).The presence of a black dot on the phylogenetic tree signifies a lack of match between the last base pair of the primer sequence and the reference sequences associated with the depicted genera in the database.The inner circle depicts a heatmap showcasing the percentage identity between the tested primer and each genus found in the chosen database using the PrimerTC tool.The outer circle denotes the clade to which the nematode genera are affiliated with.

F
The number of nematode species unique to or shared amongst the five-nematode rRNA cistron databases.The numbers in the Venn diagrams indicate the number of nematode species unique to or shared between each database for (a) all clades, (b) clade I, (c) clade II, (d) clade III, (e) clade IV, (f) clade V, (g) Chromadorida.
(a) clade I, (b) clade II, (c) clade III, (d) clade IV, (e) clade V and (f) Chromadorida.The heatmaps to the right of each tree indicate the presence or absence of each genus in each database: The phylogenetic trees with full annotation are available in Newick format in additional file 3.
used for nematode rRNA barcoding and assessed them with PrimerTC in conjunction with the relevant 18S, 28S or ITS-1_5.8S_ITS-2databases.The goal was to identify the best available primer combination to provide the widest taxonomic coverage for each rRNA marker.For ITS-2, metabarcoding NC13 (forward primer) with D2AR (reverse primer)(Carta & Li, 2018) is predicted to provide almost universal coverage across the phylum Nematoda in contrast to the more limited coverage of the more commonly used NC1 and NC2 combination (Figures3 and

5
| CON CLUS ION In this study, we produced reference sequence databases of the different components of the nematode rRNA cistron that are commonly used as phylogenetic and molecular barcoding markers: 18S, ITS-1, ITS-1_5.8S_ITS-2 and 28S.Each database comprises a set of unique full-length reference sequences to maximize the accuracy of species assignments in both short-and long-read nemabiome metabarcoding experiments.The use of databases, individually or in combination, will enable more comprehensive coverage of species across the nematode phylum.We have also developed a new bioinformatic tool called PrimerTC, which uses these databases to assess and visualize the phylogenetic conservation of primers across the nematode phylum.This tool can be used to assess existing primers or facilitate the design of new primers and can be potentially used not just for nematode metabarcoding study design but for any genetic marker or taxonomic group.AUTH O R CO NTR I B UTI O N S EC co-conceptualized the study and undertook bioinformatic pipeline development, data analysis, visualization and drafted the manuscript.EC and RC co-wrote the code for the PrimerTC tool and RC helped modify the MarkerDB code.EC, RC and NT all contributed to creating the PrimerTC web app.JSG co-conceptualized and planned the study, provided ongoing supervision and helped write the manuscript.All authors read and approved the final manuscript.FU N D I N G I N FO R M ATI O N This work was funded by Results Driven Agriculture Research (RDAR) Grants 2019F022R and 2022F059R (JG) and Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant 2015-03976 (JG).National Institutes of Health (NIH) grant R01AI153088 (JG).