A novel assembly pipeline and functional annotations for targeted sequencing: A case study on the globally threatened Margaritiferidae (Bivalvia: Unionida)

The proliferation of genomic sequencing approaches has significantly impacted the field of phylogenetics. Target capture approaches provide a cost‐effective, fast and easily applied strategy for phylogenetic inference of non‐model organisms. However, several existing target capture processing pipelines are incapable of incorporating whole genome sequencing (WGS). Here, we develop a new pipeline for capture and de novo assembly of the targeted regions using whole genome re‐sequencing reads. This new pipeline captured targeted loci accurately, and given its unbiased nature, can be used with any target capture probe set. Moreover, due to its low computational demand, this new pipeline may be ideal for users with limited resources and when high‐coverage sequencing outputs are required. We demonstrate the utility of our approach by incorporating WGS data into the first comprehensive phylogenomic reconstruction of the freshwater mussel family Margaritiferidae. We also provide a catalogue of well‐curated functional annotations of these previously uncharacterized freshwater mussel‐specific target regions, representing a complementary tool for scrutinizing phylogenetic inferences while expanding future applications of the probe set.


| INTRODUC TI ON
Two decades separate the announcements of the first human genome sequence (Lander et al., 2001;Venter et al., 2001) and the recently assembled gapless human genome (Nurk et al., 2022).
During this time, genome biology has been revolutionized by a fundamental shift in the process and scale on which DNA and RNA are studied (Goodwin et al., 2016;Sedlazeck et al., 2018;Stephan et al., 2022). The initially prohibitive cost of whole genome-scale sequencing dropped numerous orders of magnitude and at an astonishing pace, which has revolutionized biodiversity research (Goodwin et al., 2016;Sedlazeck et al., 2018;Stephan et al., 2022).
New genomic approaches have transformed the field of phylogenetics and helped to usher in the era of 'phylogenomics'. Emerging phylogenomic approaches offer several cost-effective strategies to simultaneously sequence hundreds or thousands of loci, which was one of the major constraints of Sanger sequencing (Kapli et al., 2020;Lemmon & Lemmon, 2013;Smith & Hahn, 2021).
Although whole-genome sequencing (WGS) presents an ideal option for phylogenomic studies of non-model organisms, it is still a rarely used resource (Stephan et al., 2022). However, the costs of WGS continue to decrease, as evidenced by the last two decades, which will expand its application when coupled with increasing reference genome assemblies of non-model organisms (Gomesdos-Santos et al., 2020;Hotaling et al., 2021). Nevertheless, when compared with genome subsampling strategies (e.g. genotype-bysequencing, target capture), WGS data processing is more computationally demanding, especially for large and complex genomes (Goodwin et al., 2016;Sedlazeck et al., 2018;Stephan et al., 2022).
Freshwater mussels (Bivalvia: Unionida) represent an ecologically and taxonomically diverse group of bivalves composed of six families and nearly 1000 species . These organisms play fundamental roles in freshwater ecosystems, such as water filtration, nutrient cycling and sediment bioturbation and oxygenation Vaughn et al., 2015). Unfortunately, the group is also among the most threatened worldwide, showing the second-highest percentage of threatened species (43%) and the highest percentage of wild extinctions (5.9%; Díaz et al., 2019;Lopes-Lima et al., 2021).
The Unioverse probes were designed from a set of exonic regions from eight unionid transcriptomes and the reference genome of one marine bivalve, Bathymodiolus platifrons Hashimoto & Okutani 1994 (Bivalvia, Mytillidae;Pfeiffer et al., 2019). Since these AHE probe sets are regarded as tools for phylogenetic inferences, their functional relevancy is often ignored (Lemmon et al., 2012;Lemmon & Lemmon, 2013;Pfeiffer et al., 2019). Accurate and unbiased phylogenetic reconstruction is a multi-dependent task which requires a set of decisions regarding evolutionary modelling, orthology assessment and matrix reconstruction to properly balance information from loci with dissimilar underlying evolutionary histories (Bernt, Braband, et al., 2013;Buddenhagen et al., 2016;Chen et al., 2007;Edwards et al., 2016;Hosner et al., 2016;Lemmon & Lemmon, 2013;Zhang et al., 2018). Those decisions, as well as the assessment of the results, will therefore be facilitated by proper structural and functional characterization of loci, such annotations will provide a framework to guide data processing, test robustness and identify biases. Moreover, these annotations will widen the applications for the data, such as targeted functional analyses, using the probes as a reference.
Here, we develop an assembly pipeline for target capture phylogenomics that can incorporate WGS data. The pipeline is designed to be applied to all types of target capture sequencing data and can easily be used with any probe set. To demonstrate the utility of this approach, we reconstructed the evolutionary history of the family Margaritiferidae. We incorporated the recently published WGS dataset of the freshwater pearl mussel Margaritifera margaritifera (Linnaeus, 1758;Gomes-dos-Santos et al., 2021) and increased the available AHE taxon sampling for Margaritiferidae, from 2 to 11 species, including representatives of all four margartiferid genera.
Within the tested dataset, the new pipeline provided the best balance between loci capture, number of duplicated sequences per locus and sequence length. To compare and complement this primarily nuclear dataset, nine additional whole mitogenomes were produced. We also curated a catalogue of functional annotations for the Unioverse targets based on a highly comprehensive multi-evidence database search (i.e. INTPRO, pfam, GO, KEGG and BUSCO), representing a complementary tool for scrutinizing phylogenetic inferences while expanding future application of the probe set.

| Sampling and DNA extraction
Tissue samples from 22 margaritiferid specimens and eight outgroup taxa (from families Unionidae, Hyriidae, Iridinidae and Mycetopodidae) were collected (Table 1). A small piece of foot tissue was subsampled from the animals following Naimo et al. (1998) and placed in 96% ethanol and the specimen returned to its habitat.
Genomic DNA was extracted for all samples using a conventional high-salt protocol (Sambrook et al., 1989) or the Qiagen DNeasy Blood and Tissue Kit (QIAGEN GmbH).

| Sample selection and sequencing
A total of 14 DNA extractions, from all known species of the genera Margaritifera, Pseudunio, Cumberlandia, and two species of the genus Gibbosula were selected for the AHE sequencing, as well as six outgroup taxa (Table 1). Genomic DNA was used for capturing phylogenetically informative nuclear protein-coding loci using the Unioverse probe set developed by Pfeiffer et al. (2019). Capture, library preparation and Illumina sequencing were performed at RAPiD Genomics (Table 1). For all species aside from M. hembeli and M. marrianae, genomic DNA was sheared to ~400 bp fragments, endrepaired and adenine residues were ligated to the 3′-end to generate A overhangs. Subsequently, barcoded adapters were ligated to the library and amplified by PCR. SureSelect Target Enrichment System for Illumina paired-end Multiplexed Sequencing Library protocol was used for solution-based target enrichment of pooled libraries. Probes were synthesized as Custom SureSelect probes from AgilentTechnologies. Finally, 150-bp long paired-end reads were generated in an Illumina HiSeq 3000.
A total of 11 DNA extractions, encompassing all genera, including five species from the genus Margaritifera, one species from Pseudunio, one Cumberlandia and two from Gibbosula were selected for mitogenome sequencing (Table 1). Briefly, genomic DNA was sheared to ∼500 bp using an M220 Covaris Ultrasonicator (Covaris).   because of an issue with sequencing chemistry, the second read resulted in only 57 bp reads, but this did not affect genome assembly (see below).

| Targeted sequence assembly
The bioinformatic workflow, including assembly pipelines, postassembly processing and phylogenetic reconstruction, is schematized in Figure 1.

| Anchored hybrid enrichment and RNA-seq sequencing outputs
The novel and previously generated AHE sequenced samples (Pfeiffer et al., 2019) and the RNA-seq samples (Bertucci et al., 2017; Table 1) were processed according to the pipeline developed by Breinholt et al. (2018) (Figure 1). Briefly, reads with less than 30 nt were filtered and reads with Phred score <20 were trimmed using Trim Galore! V 0.4.4 (Krueger et al., 2023).
Locus assemblies were produced with the iterative bait assembly script IBA.py (Breinholt et al., 2018) using Unioverse reference probes sequences as baits (Pfeiffer et al., 2019). Briefly, IBA filters reads with high similarity to the reference taxa probe using USEARCH v 10.0.240 (Edgar, 2010), producing a de novo assembly of isoforms from the selected reads using the transcriptome assembler Bridger v2014-12-01 (Chang et al., 2015). Afterward, MAFFT v 7.453 (Katoh & Standley, 2013) was used to add the de novo assemblies to the Unioverse reference alignment using the parameters '-addlong' and 'adjustdirectionaccurately'. Exonic and flanking regions were split using the script extract_probe_region.
py (Breinholt et al., 2018). Gene orthology was checked by single hit to the same regions of the B. platifrons genome using the ortholog_filter.py (Breinholt et al., 2018). Individual alignments for each locus were retrieved with script split.py (Breinholt et al., 2018) and subsequently aligned with MAFFT v 7.453 (Katoh & Standley, 2013). At each locus, a single consensus sequence (of each isoform) was produced using FASconCAT-G v 1.04 (Kück & Longo, 2014). Finally, the script remove_duplicates.py (Breinholt et al., 2018) was used to discard loci with more than one sequence per taxon.

| WGS outputs
Since the IBA assembly has only been developed for assembling loci derived either from AHE sequencing outputs (IBA.py) or RNAseq sequencing outputs (IBA_trans.py) (Breinholt et al., 2018), here we develop a new pipeline for assembling loci using WGS outputs (Illumina short read; Figure 1). This pipeline is based on an alternative iterative bait assembly, that is SRAssembler (McCarthy et al., 2019), that relies on two distinct whole-genome assemblers for the assembly stage, that is ABySS (Jackman et al., 2017) and SOAPdenovo2 (Luo et al., 2012). The freshwater pearl mussel M. margaritifera WGS reads were retrieved from NCBI (SRR13091477, SRR13091478, SRR13091479). Although SRAssembler allows to directly use raw sequencing outputs, it rapidly escalates the storage requirements, especially if a large amount of initial sequencing reads is provided. Therefore, a prefiltering was performed before running SRAssembler (Figure 1).
Using the AHE assembly datasets (Margaritiferidae and outgroup) generated as described above, a set of reference sequences composed of both probe and flanking regions was produced. This set of sequences was used as a reference for read filtering using BBmap tool from BBtools (Bushnell, 2016) specifying the parameters 'outm' and 'pairedonly'. The output was converted in paired fastq files using the tool reformat.sh, from BBtools and after quality processed with Trim Galore! (Krueger et al., 2023). Since the output reads were still considerably large, they were subsampled using the script extractfq.py from MitoZ (Meng et al., 2019), specifying the parameter '-size_required 5' (see the REAME file tutorial in Appendix S1 for a detailed description).
The trimmed and subsampled reads were subsequently used for the individual loci assembly in SRAssembler, using ABySS as the assembler and specifying each Unioverse probe reference loci individually at each run (Appendix S1).
All the individual assemblies were concatenated, and the names of the sequences transformed to automatically assign their corresponding reference loci code (Appendix S1) ( The average mean coverage distribution for all the target assemblies generated herein (i.e. WGS, RNA-seq and AHE based) was estimated using BBmap from BBtools (Bushnell, 2016), by mapping the sequencing reads to each assembly, specifying the parameters 'covstats'. All the tasks were executed using two distinct computing machines independently (i.e. either a 32 CPUs/64Gb-Intel(R) Xeon(R) CPU E5-2650L 0 @ 1.80GHz or a 32 CPUs/64Gb-AMD Opteron(tm) Processor 6262 HE).

| Mitochondrial genomes data processing
The mitogenomes were assembled using NovoPlasty v.4 (Dierckxsens et al., 2016) using a COI gene sequence retrieved from NCBI as reference bait for each sample. Annotations were obtained using MITOS2 web server (Bernt, Donath, et al., 2013)

| AHE datasets
To access the correct positioning of the assemblies produced by the new assembly strategy, an initial dataset that included the M. margaritifera AHE probe regions assemblies produced from RNA-seq was constructed ( Figure S1). However, RNA-seq-based assemblies only encompass exonic regions, whereas many non-exonic regions are represented in the AHE dataset by sequenced regions that flank the genomic regions targeted by the AHE probes. Consequently, the RNA-seq sample was excluded from the remaining alignments, which were all composed of both the exonic and hypervariable flanking regions (Figure 1).
Individual loci were aligned to reference sequences using MAFFT v 7.453 (Katoh & Standley, 2013) and only the loci with over 70% of gene occupancy were kept. The python scripts alignment_DE_trim.
py and flank_dropper.py (Breinholt et al., 2018) were used to trim and filter sequences and then split using extract_probe_region.py ( Figure 1). This resulted in three alignment files, one corresponding to the probe region (hereafter referred to as probe) and two corresponding to flanking regions (hereafter referred to as head for the 5′ flanking regions and tail for the 3′ flanking region), that were visually inspected using AliView v 1.26 (Larsson, 2014). The final datasets were realigned using MAFFT v 7.453 (Katoh & Standley, 2013) and were concatenated using both the probe and flanking regions ('head+probe+tail'), except for the dataset including RNA-seq data, which was solely composed of the probe region ('probe'; Figure 1).
Finally, trimAl v 1.2rev59 (Capella-Gutiérrez et al., 2009) was used to remove positions with gaps in 50% or more of the sequences.

| Mitogenomes alignments
The newly sequenced female Type (F-type) mitogenomes and the 17 F-type herein available on NCBI were used (Table 1). The nucleotide sequences of the protein-coding genes (PCGs; except atp8) and the two ribosomal RNA (rRNA) genes were aligned using GUIDANCE2 (Penn et al., 2010) with the MAFFT v 7 multiple sequence alignment algorithms, specifying the following parameters: score algorithm: GUIDANCE2; bootstrap replicates 100; sequence cut-off score: 0.0 (no sequence removal); column cut-off score: below 0.8; site masking score: below 0.6 (for codon and amino acids alignments) and 0.8 (for the rRNA alignments). The resulting alignments were concatenated to include both 12 PCG and the two rRNA ('PCG + rRNA').

| Phylogenomic analyses
All phylogenomic inferences for each AHE distinct dataset were performed using a maximum likelihood implemented with IQ-TREE v 1.6.12 (Nguyen et al., 2015; Figure 1). Best-fit substitution models were inferred with ModelFinder (Kalyaanamoorthy et al., 2017), implemented in IQ-TREE using a 10% relaxed clustering (option -rcluster 10; Lanfear et al., 2014). IQ-TREE analyses were conducted with 10 independent runs of an initial tree search and 10,000 ultrafast bootstrap replicates. For AHE and AHE + mtDNA combined datasets, substitution models were estimated in ModelFinder with no partition. For the mtDNA dataset, substitution models were obtained specifying each PCG codon position as a single partition and the two rRNA genes as two independent partitions.

| AHE probe region functional annotation
All the individual reference probe sequences from B. platifrons (i.e.

| Phylogenetic analyses
The general characteristics of the alignments used for the various phylogenetic inferences are reported in Table 3. All the phylogenetic tree files can be found in the supplementary information (Appendices S2 and S3).

| AHE datasets
The two AHE-based phylogenetic inferences reveal the same topology, recovering the glochidia-bearing mussels (i.e. (Hyriidae (Margaritiferidae+Unionidae)) as sister to the lasidia-bearing mussels (Mulleriidae+Iridinidae; Figure 2a, Figure S1)). Margaritiferidae is two subfamily-level groups, one containing taxa belonging to the genus  Figure S1). In the phylogeny inferred using the RNAassembly and SRAssembly samples, both M. margaritifera samples were grouped with maximum support ( Figure S1). All the above-described splits show maximum support for both phylogenies (Figure 2a, Figure S1).

| Mitogenome datasets
The mtDNA F-type-derived phylogeny (with the outgroup taxa A. elongata and M. dubia) is identical to the AHE phylogenies, except for genus-level relationships (Figure 2b). All the margaritiferid genera are still recovered as monophyletic with Gibbosula as sister to the remaining genera, but the sister group of Cumberlandia is Pseudunio + Margaritifera (vs. only Pseudunio in the AHE topology; Figure 2b).
The BUSCO search revealed that around 77% of the probe regions match at least one of the three searched OrthoDB databases (Table S6, Figure 5), with more than half of the loci being assigned to Mollusca, 15% assigned to metazoan and 6% assigned to eukaryote ( Figure 5).

| Assembly pipelines and probe capturing results
Here, we developed a new assembly strategy (referred to here as SRAssembly) for capturing and de novo assembly of the targeted regions using whole genome re-sequencing reads. Given no F I G U R E 2 Maximum likelihood phylogenetic trees of family Margaritiferidae based on: (a) concatenated alignments of 514 Unioverse loci (head+probe+tail; n = 1541); (b) concatenated alignments of the mitochondrial DNA from 12 PCG and 2 rRNA; *Above the nodes refer to bootstrap with maximum support.  taxon-specific parameters are required, our method can be replicated to any other reduced representation sequencing dataset based on the reference sequences for region capturing. We demonstrated the utility of SRAssembly using the Unioverse probe set. Of the assemblies here performed, SRAssembly provided the best balance between the number of assembled sequences and the number of captured loci while generating large sequences (Table 2). Although many AHE samples show a higher number of initially assembled sequences, most were the result of duplicated assemblies (Table 2), and SRAssembly assembled the highest proportion of individual captured loci (81%; Table 2). Moreover, both SRAssembly and RNA assembly have considerably higher sequencing coverage than the AHE outputs, which was not accompanied by an increased proportion of individual captured loci (Table 2).
While keeping the number of duplicated sequences low from the beginning and maintaining high loci capturing, SRAssembly reduces errors introduced by post-assembly duplication removal.
Furthermore, SRAssembly maintains a higher overall mean sequence length (1333.97 bp), which has obvious advantages over RNA-seq methods (177.8 bp) that only capture the probe region (Breinholt et al., 2018;Lemmon et al., 2012;Pfeiffer et al., 2019Pfeiffer et al., , 2021; Table 2). Given these statistics, WGS outputs are more promising than RNA-seq to augment target capture datasets.
As of March 2023, high-quality reference genomes are still a scant resource for many non-model species. However, the increasing accessibility to fast and affordable sequencing approaches will likely generate new whole genome assemblies followed by re-sequencing studies (Gomes-dos-Santos et al., 2020;Hotaling et al., 2021). Therefore, the availability of WGS data will increase sharply, making this pipeline a timely tool for future studies.
Moreover, the availability of reference genomes will provide the opportunity to make more robust reference gene sets for capture, which can easily be incorporated into our pipeline given that targeted capture is not a requirement for sequencing. Additionally, target sequencing may no longer represent an affordable option compared to WGS due to the constant and accentuated decline of sequencing costs. Nevertheless, the availability of already carefully selected target regions represents a valuable set of markers for phylogenomics.
New pipelines that integrate WGS for target capturing, such as the one presented herein or others recently made available (e.g. Allen et al., 2017;Knyshov et al., 2021;Ribeiro et al., 2021;Zhang et al., 2019), represent fundamental tools for future studies. The emergence of these pipelines follows a common theme, motivated by the recognition that, in the near future, WGS will likely represent one of the most accessible sources of molecular data for ecological and evolutionary studies, even for non-model organisms (Gomes-dos-Santos et al., 2020;Hotaling et al., 2021).
Many of these pipelines share similarities, especially the selected whole genome assembly software at the assembly stage (Allen et al., 2017;Knyshov et al., 2021;Ribeiro et al., 2021;Zhang et al., 2019). However, each pipeline assumes several methodological, bioinformatical and conceptual aspects that differentiate them and serve the user differently. The most fundamental differences distinguishing recently described approaches involve the way that the assembly is addressed. These approaches either produce assemblies from the WGS total read pool after extracting the target regions by homology or sequence alignment (such as F I G U R E 3 Frequency distribution of Unioverse probe regions per Bathymodiolus platifrons transcripts, to which the probes were aligned to. Top right corner is a schematic representation of how probe regions are distributed within a gene. ware allows automated multi k-mer assemblies, even when using Abyss, which does not natively support a multi k-mer (Jackman et al., 2017). SRAssembler also provides an optimization step to account for low read coverage ('-a' flag) and has been efficiency tested for coverage depths ranging from 10× to 40× (McCarthy et al., 2019). Moreover, it has 'cleaning rounds' that try to match assembled contigs against the original reference query, purging contigs without at least a partial match (McCarthy et al., 2019).
Finally, SRAssembler has a relatively low demand for computational resources compared to other approaches, which we demonstrate applies even for a high coverage WGS dataset.
Overall, the pipeline here presented represents an alternative, but a complementary approach, to other available pipelines. By condensing most of the necessary stages in a single command (i.e. multi k-mer assembly, low-to-high coverage balancing expectation and removal of redundancy), presents itself as a direct, simple and low-demanding approach, ideal for users with limited resources and when high coverage sequencing outputs are required.

| Phylogenetic analyses
Early margaritiferid systematics was based on highly variable or homoplastic morphological characters that often produced conflicting classifications (Lopes-Lima et al., 2018). Molecular phylogenetic studies on the group, which have primarily been based on F I G U R E 4 Percentages distributions of the Gene Ontology (GO) Terms main categories count for: (a) All Unioverse probe regions; (b) The Unioverse probe regions used for the final AHE Margaritiferidae samples alignments. within Margaritiferidae are similar, which was maintained after whole data processing pipeline to generate matrix alignments (Table 2).
This reinforces the efficiency of the Unioverse AHE probe dataset in isolating the target regions across Margaritiferidae (Pfeiffer et al., 2019). Both AHE and mtDNA phylogenies retrieve the genuslevel groups recently described by Lopes- Lima et al. (2018), corresponding to the four genera and placing the Gibbosula clade sister to all the remaining genera (Figures 1a,b). However, the position of Cumberlandia concerning Margaritifera and Pseudunio differs among the phylogenies (Figures 1a,b). The results of the AHE phylogeny agree with the published works based on combined mitochondrial and nuclear markers Huff et al., 2004;Lopes-Lima et al., 2018). On the other hand, the mitogenome phylogeny is congruent with other mitochondrial makers-based studies (Araujo et al., 2009;Gomes-dos-Santos et al., 2019;Inoue et al., 2014).
These expected differences do not contradict the inferred relationships within genera, which have high support among both phylogenies (Figures 1a,b). Mitochondrial genomes (or single mtDNA genes) have many intrinsic features that make them attractive for phylogenetic inferences in Metazoa (Bernt, Braband, et al., 2013;Ghiselli et al., 2021) but do not always reflect the evolutionary history of the species (Ghiselli et al., 2021;Hurst & Jiggins, 2005;Kern et al., 2020).
The disagreeing results between nuclear markers and the F-type mtDNA here reported suggest a divergent evolutionary history and given the notably low Margaritiferidae mitochondrial evolutionary rates (Bolotov et al., 2016), it is unlikely a result of nucleotide sub-

| Unioverse probe region annotation
Phylogenetic inferences are highly sensitive to data selection and studies relying on reduced sequencing approaches commonly treating loci as independent regions (e.g. Hipp et al., 2014;Pfeiffer et al., 2019Pfeiffer et al., , 2021Rosenfeld et al., 2016;Smith et al., 2020;Zhang et al., 2020). However, this may be unrealistic. Here we show that less than half (n = 304; ~37%) of the regions targeted by the target capture set Unioverse originate from a single gene, with a considerable number of targets belonging to different regions of the same gene (n = 507; ~63%) (Tables S1 and S2, Figure 3). Specifically, a total of 507 probe regions belong to different regions of the same genes (from 2 to 19 loci in the same gene), which may introduce bias when using individual locus methods. Although there is a high level of redundancy in gene targets, the entire coding sequence (CDS) for these genes may be assembled with a priori locus information provided in this study. This approach would allow for more conservative and less biased phylogenetic reconstruction by building alignments based on entire CDS rather than restricting the analyses to small, linked exons. We recommend future studies utilize this approach and capturing the CDS of targeted genes could be increased using larger inserts in the sequencing library preparation (Lemmon et al., 2012).
We were able to provide functional annotations of the 811 exons targeted by Unioverse (Tables S3-S6), which will allow future Ontology categorization and the BUSCO annotation. Due to the superimposed underlying assumptions, it is expected that the target capture methods will capture a significant number of genes within BUSCO databases (Lemmon et al., 2012;Lemmon & Lemmon, 2013;Manni et al., 2021). In total, 77% of the Unioverse AHE probe regions could be matched to three curated lists of BUSCO genes for Eukaryota, Metazoa and Mollusca, with more than half of the hits being assigned to Mollusca ( Figure 5). This further demonstrates the utility of the Unioverse probe set for phylogenetic inferences (Pfeiffer et al., 2019) and highlights the possibility of the probe set being integrated with extensive BUSCO databases to test macroevolutionary hypotheses. This demonstrates the importance of annotating target capture probe sets, which can expand their utility outside targeted focal groups (Fleming & Arakawa, 2021;Johnston et al., 2019;Li et al., 2021;Pfeiffer et al., 2019;Shen et al., 2016;Taylor et al., 2020;Zhao et al., 2020).

| CON CLUS ION
In this study, we provide a new assembly strategy for target capturing that allows us to easily combine AHE datasets with the increasingly emerging WGS outputs. To explore the results of the new pipeline, we provide a phylogenetic study with a comprehensive sampling of the most threatened Unionida family, Margaritiferidae.
Furthermore, we provide a complementary phylogenetic analysis using whole mitogenome assemblies. Moreover, we provide a thorough structural and functional annotation of the Unioverse probes regions, that will allow future studies to adjust loci sampling to their desire.

ACK N O WLE D G E M ENTS
The authors wish to thank the Editor and the two anonymous re-

CO N FLI C T O F I NTER E S T S TATEM ENT
All authors declare no conflict of interest.

DATA AVA I L A B I LIT Y S TATE M E NT A N D B E N E FIT-S H A R I N G S TAT E M E N T
The raw reads for each AHE sample were deposited at the NCBI Sequence Read Archive (see Table 1 for accessions), under BioProject PRJNA895396. Whole mitogenome assemblies were deposited in GenBank (see Table 1 for accessions). The structural and functional annotation tables of the AHE probe regions are provided in the Supplementary Data (Tables S1-S6). All software with respective versions and parameters used to assemble the AHE loci here presented are listed in the methods section. Software programs with no parameters associated were used with the default settings.