euka: Robust tetrapodic and arthropodic taxa detection from modern and ancient environmental DNA using pangenomic reference graphs

Ancient environmental DNA (aeDNA) is a crucial source of information for past environmental reconstruction. However, the computational analysis of aeDNA involves the inherited challenges of ancient DNA (aDNA) and the typical difficulties of eDNA samples, such as taxonomic identification and abundance estimation of identified taxonomic groups. Current methods for aeDNA fall into those that only perform mapping followed by taxonomic identification and those that purport to do abundance estimation. The former leaves abundance estimates to users, while methods for the latter are not designed for large metagenomic datasets and are often imprecise and challenging to use. Here, we introduce euka, a tool designed for rapid and accurate characterisation of aeDNA samples. We use a taxonomy‐based pangenome graph of reference genomes for robustly assigning DNA sequences and use a maximum‐likelihood framework for abundance estimation. At the present time, our database is restricted to mitochondrial genomes of tetrapods and arthropods but can be expanded in future versions. We find euka to outperform current taxonomic profiling tools and their abundance estimates. Crucially, we show that regardless of the filtering threshold set by existing methods, euka demonstrates higher accuracy. Furthermore, our approach is robust to sparse data, which is idiosyncratic of aeDNA, detecting a taxon with an average of 50 reads aligning. We also show that euka is consistent with competing tools on empirical samples. euka's features are fine‐tuned to deal with the challenges of aeDNA, making it a simple‐to‐use, all‐in‐one tool. It is available on GitHub: https://github.com/grenaud/vgan. euka enables researchers to quickly assess and characterise their sample, thus allowing it to be used as a routine screening tool for aeDNA.


| INTRODUC TI ON
For the past 40 years, investigation of ancient DNA (aDNA) has enabled researchers to study extinct species and past environments.However, aDNA's peculiarities make analysis challenging.
First, DNA degrades in the environment after the host cell leaves, as DNA repair mechanisms stop working.This leads to DNA fragmentation, with ancient DNA often particularly short (<100bp) (Hofreiter et al., 2001;Pääbo, 1989).Second, DNA fragments accrue post-mortem chemical damages (like cytosine deamination leading to observed cytosine damage patterns to thymine or guanine to adenine) (Briggs et al., 2007;Prüfer et al., 2010).Such damage and short fragment lengths pose issues in aligning sequences to reference genomes and profiling taxonomically aDNA samples (Jónsson et al., 2013;Poullet & Orlando, 2020).Finally, contamination from in situ microbes or present-day sources often constitutes most of the extracted DNA.Contaminant DNA can spuriously align to the reference genome due to its phylogenetic proximity (Peyrégne & Peter, 2020).They tend to accumulate in specific genomic locations rather than distributing evenly across the reference genome.While aDNA research has mostly concentrated on fossil-source DNA isolation, more recently, ancient and modern environmental DNA (eDNA) has emerged as a method to track, monitor and reconstruct organismal assemblages.Yet, the greatest computational challenge for both ancient and modern eDNA remains correct taxonomic assignment of relevant DNA reads and estimating their corresponding abundance (Beng & Corlett, 2020;Ruppert et al., 2019).
More accurate taxonomic identification and abundance estimation of aeDNA data are urgently needed.Some current methods, including MALT (Herbig et al., 2017) and Bowtie2 combined with ngsLCA (Langmead & Salzberg, 2012;Wang et al., 2022), part of the HOLI pipeline (Pedersen et al., 2016), focus on mapping and taxonomic classification, but leave post-processing and quantification estimations to the user.Both align against a reference database and run the naive lowest common ancestor (LCA) algorithm to classify every alignment (Bender et al., 2005).
HAYSTAC (Dimopoulos et al., 2022) does mapping and aeDNA abundance estimation, presenting an LCA inference alternative, and includes post-processing steps for sample authenticity.It excels at species identification within smaller, incomplete databases.However, its thorough analysis is time-consuming with larger databases and datasets, lessening its suitability for rapid screening.Other abundance estimation methods include Bracken (Lu et al., 2017), used with any Kraken output like Krak-enUniq (Breitwieser et al., 2018), though not typically for aeDNA.The HOPS pipeline (Hübler et al., 2019), which also works from Kraken output, focuses on bacterial detection and abundance estimation.All existing methods necessitate building a reference set of expected taxa, which may overlook some taxa with too narrow a database.
We introduce euka, an easy-to-use command-line tool in C++ for robust aeDNA sample identification of tetrapodic and arthropodic mitochondrial DNA.euka aligns aeDNA sequences to a taxon-based pangenomic reference graph, filters through an ancient-damageaware maximum-likelihood framework and estimates relative abundance using Markov chain Monte Carlo (MCMC).Irrespective of the filtering method, euka excels at taxa identification (at its taxonomic resolution) using the same curated database for both modern and ancient simulated metagenomes.We demonstrate euka's accurate abundance estimation with a simulated aeDNA sample and its lowabundance taxa detection robustness with about 50 aligning reads.
We found euka performs consistently on empirical datasets and can process a billion reads in approximately 6 h (20 cores), making it suitable for aeDNA sample screening.euka is a vgan suite subcommand

| General workflow
In this section, we introduce euka's general workflow (Figure 1), which comprises three steps: (1) mapping to our reference structure, (2) filtering spurious alignments and (3) abundance estimation of detected taxa.
First, we describe our curated database and how it is used to construct the taxa-based pangenome reference structure.Then, we explain the quality-control filters, including our ancient damage-aware maximumlikelihood framework for quantifying the probability of correct alignments and the test for minimal coverage across our pangenome graph.
At last, we describe our approach for estimating the relative abundance of detected taxonomic groups using an MCMC sampling algorithm.

| Database content
euka's first workflow step involves mapping metagenomic sequence data to our pangenomic graph reference structure (see the explanation in Supplementary Section 1), which is provided via FTP as described in euka's installation guide. 1 Our reference structure is built from a curated database comprising 5847 mitochondrial genomes from 336 defined taxonomic groups within the phylum Arthropoda and subphylum Tetrapoda of the kingdom Animalia (see Table S2 and Supplementary Section 2 for a detailed description of the construction, curation and an explanation of the taxonomic resolution).The mitogenomes are clustered into taxa according to their sequence identity.Each taxon is converted into a singular circular graph, as illustrated in Figure 1 using bears as an example.The pangenome graph condenses identical genomic regions and illustrates genetic variations through branching nodes (Garrison et al., 2018).These components combined form euka's pangenomic reference graph, enabling robust mapping of aDNA by capturing each taxon's genetic variation (Martiniano et al., 2020).Additionally, we maintain informative mapping quality scores for conserved genomic regions in a taxon due to the graph structure, directly influencing euka's filtering.For further details on how the pangenomic graph is built from the mitochondrial database and the time estimation of the build, please refer to Supplementary Section 3.

| Mapping and post-processing
euka loads its pangenomic reference graph at runtime, together with the user-provided FASTQ file for mapping.We assume the FASTQ input has been pre-processed (low-complexity filtering and PCR duplicate removal) before running euka.euka calls the subcommand giraffe of the vg toolkit (Hickey et al., 2020;Sirén et al., 2021) to map the input FASTQ against the mitogenomes embedded in our graph (see Supplementary Section 4 for mapping parameters).In the first post-processing phase, we determine if an alignment is genuine, that is, that it stems from the correct taxon and the correct genomic region.To achieve this, we compute the ratio of two models.The first 1 https://github.com/grenaud/vgan.F I G U R E 1 euka's main workflow: Sediments contain extracellular DNA through the decomposition of biological remains.Sequencing techniques pick up on the ancient DNA fragments and every other source of DNA in sedimentary samples.The resulting data contains a mix of unknown DNA sequences (FASTQ), used as input for euka.euka maps the unknown samples against the taxa-based mitochondrial pangenome graph to determine which taxa are present in a metagenomic sample.To refine its findings euka uses a maximum likelihood framework, which accounts for the effects of damage typically seen in aDNA, to filter spurious alignments.Additionally, euka filters taxa based on a minimal breadth of coverage across the pangenome.The abundance of the detected taxa is estimated, and the output is visualised to provide a detailed summary of the sample.
posits that the alignment is correct, and the second postulates that the alignment is spurious.Using simulations, we determined that this ratio effectively minimises the number of incorrect alignments (Figure S1).
Additionally, we discard alignments with a mapping quality threshold under 30 on a PHRED scale calculated by vg giraffe (Figure S2).A detailed explanation of our framework is provided in the Supplementary Section 5.In the second post-processing phase, we restrict the analysis to taxa where sequenced DNA fragments sufficiently covered the reference mitogenomes and were evenly distributed.We thereby aim to avoid the detection of taxa where multiple sequences align to only one genomic region, especially regions of low complexity, otherwise referred to as read pile-ups.To mitigate the impact of such pile-ups, a coverage binning system was developed to estimate the minimal breadth of coverage of each taxon's connected component.A taxon is only considered detected if it has a minimum of one read mapped to each bin of the taxon subgraph, has a minimum of six bins with an entropy score of 1.17 and, in total, a minimum of 10 mapped reads.For more details, see Supplementary Section 6.

| Abundance estimation
We now have completed the alignment post-processing; every remaining taxon and associated alignment is believed to be of high confidence.The next step is the abundance estimation of detected taxa.To do this, we first define an initial abundance vector for the sample as the proportion of post-processed alignments in each detected taxon compared to the total number of high-confidence alignments.We use an MCMC sampling algorithm to obtain a posterior estimate of our abundance vector and confidence intervals (95% and 85%).We describe our abundance estimation in more detail in the Supplementary Section 7.

| Ancient DNA authentication
For every taxon that passes our filtering steps, we check alignments for ancient damage signatures.If an ancient damage event is identified (observed as C → T or G → A mutations on the reverse strand due to C → U mutations), we record these occurrences and output the probability of observing ancient damage at a specific position as a damage profile for that taxon.In addition, we output a tab-separated file that details the coverage and fragment length distribution.We provide R scripts to plot damage rates at each end of the DNA sequence, fragment size distribution and mitogenome coverage on a per-bin basis.

| Test data preparation
To evaluate taxa detection and individual taxonomic read assignments across all tools, we created three in silico aeDNA sample environments.These simulations closely replicate the abundance of different kingdoms in empirical environments from previous studies, thus realistically simulating the challenges of environmental samples, such as high false-positive mapping from non-eukaryotic species to eukaryotic genomes.For more details, see Supplementary Section 8.
We utilised three empirical samples to demonstrate euka's consistency with existing methods.We chose two pre-processed samples (filtered for low-complexity reads and PCR duplicates removed)-a 15kya Mexican cave sediment sample (Ardelean et al., 2020;Pedersen et al., 2021) and a 2-million-year-old aeDNA Greenlandic sample (Kjaer et al., 2022).Additionally, we used a 30kya permafrost sample from Yukon, Canada (Murchie, Monteath, et al., 2021), where we processed raw empirical data, performed adapter trimming and merging with lee-Hom (Renaud et al., 2014) using ancient parameters, and filtered lowcomplexity reads using sga preprocessing.2

| Software limitations
The database of euka is limited to tetrapodic and arthropodic taxa to conserve disk space and optimise runtime.Yet, as pangenome graph compression technologies improve (Deorowicz et al., 2023), we aim to expand our database to encompass all available eukaryotes.euka is also limited by its exclusive processing of mitochondrial DNA due to the large size of nuclear genomes.Although tools like Minigraph (Li et al., 2020) provide efficient solutions to handle larger genomes, vg giraffe remains the only haplotype-aware mapper suitable for euka's framework.

| RE SULTS
First, we demonstrate euka's performance compared to existing tools using three simulated aeDNA environments, where present taxa and exact abundances are known.We benchmark against Bowtie2+ngsLCA and MALT by mapping and analysing the different taxa without explicit abundance quantification.We subsequently compare euka's performance against HAYSTAC and KrakenUniq+Bracken, which specifically quantifies abundance.We provide a feature comparison in Table S1.Finally, we present euka's results compared to all five tools on two empirical samples.All experiments used euka's curated database as a reference to evaluate the computational method per se.Owing to our taxonomic limitations, we only considered alignments to taxa defined in Table S2.For exact commands, see Supplementary Section 11.

| Mapping accuracy
In our first benchmark, we evaluated the accuracy of read assignments for four tools, namely euka, MALT, Bowtie2+ngsLCA, HAYSTAC and KrakenUniq+Bracken, using simulated environments.
euka only considers alignments with high confidence as defined in Section 2.4.For the remaining tools, we set the filters as recommended by the authors in the original publication (Supplementary Section 11).In our simulations, we know the ground truth assignments and can therefore compute the accuracy of assignments to taxa.Accuracy is computed as the ratio of correctly assigned reads to the total number of alignments to these taxa.Figure 2a indicates that our approach produced the highest mean accuracy across all environments and damage rates (for the accuracy measure from 1 to 0, see Figure S3). Figure 2b shows simulated damage rates.In comparison, Bowtie2+ngsLCA's strict filters, followed by HAYSTAC's bayesian approach and KrakenUniq+Bracken resulted in lower accuracy.
MALT, with its heuristic alignment approach, was the least accurate.
Using a pangenome graph incorporating genetic variation between species of a taxon, we better accommodate mutations due to ancient damage.Furthermore, euka's filter minimises spurious alignment.This unique combination makes it possible to maintain accuracy with accumulating ancient damage, as shown in Figure 2a.

| Correctness of abundance
We next evaluated the estimated taxa abundances for each simulated environment.Both Bowtie2+ngsLCA and MALT are predicated on the LCA algorithm and output every taxon with at least one alignment and do not provide abundance estimates.Therefore, to compare euka with Bowtie2+ngsLCA and MALT, we used as a criterion for detecting a taxon a threshold ranging from a single alignment to the maximum number of alignments to a given taxon.We assessed how many taxa are correctly detected (true-positive) versus wrongly detected (false-positive) (Figure 3).Especially for the cave environment, euka significantly outperforms both competing tools.For the permafrost environment, euka is more accurate than existing methods; however, this environment holds the challenge of detecting the ant taxon (Formicidae).We cannot detect the Formicidae with standard parameters as their mitogenomes contain too many lowentropy regions (Barbosa et al., 2020;Gotzek et al., 2010).
We then compared HAYSTAC, KrakenUniq+Bracken and euka abundance estimation precision using Bray-Curtis dissimilarity (Bray & Curtis, 1957;Ricotta & Podani, 2017).Figure 4 displays mean dissimilarity between euka, HAYSTAC and KrakenUniq+Bracken compared to the ground truth, showing that euka consistently outperforms both other tools.Importantly, HAYSTAC overestimates abundance by assigning a default value to all present taxa, which is part of its fine-tuning to identify closely related species in smaller datasets but leads to bias in larger reference sets.KrakenUniq's k-mer-based alignment leads to increased spurious matches to various taxa.This outcome follows an unspecific calculation of the abundance vector with Bracken.In contrast, euka's taxon filtering method ensures more accurate abundance estimation.We excluded the ant family (Formicidae) from the permafrost sample abundance calculation due to their low-entropy mitogenomes and high likelihood of spurious alignments.

| Robustness of abundance
To evaluate euka's robustness to low coverage depth, we generated a FASTQ file containing three taxa (Bovidae, Myotis, Ursidae) in low, medium and high abundance, respectively.We downsampled reads to 75%, 50%, 25%, 10% and 1% of the original for all three simulated aDNA damage ranges (Figure 2b).Our results (Figure 5) demonstrate that euka is highly accurate and can estimate abundances down to 10% of the original input for high-and medium-abundant taxa.Even We compared the computational performance of the three quantification tools, namely euka, HAYSTAC and KrakenUniq+Bracken in our Supplementary Results Section 10.1.

| Empirical data
We also benchmarked on empirical samples.Since the ground truth is unknown, we evaluated consistency while manually verifying the results with their original publications.We used two empirical aeDNA samples to compare against existing methods.Additionally, we use one aeDNA sample to demonstrate euka's detailed output summary in our Supplementary Section 10.2.

| Analysis of a 15kya Mexican cave sample
First, we used the UE1210 Mexican cave layer sample from Pedersen et al. (2021) (ENA: PRJEB42692).The sample was sequenced using Illumina HiSeq and NovaSeq shotgun sequencing and reported the presence of the American black bear (Pedersen et al., 2021).In the previous study, which describes the cave environment Ardelean et al. (2020), the tribe Marmotini was also detected.To set a threshold on the number of aligned reads to mark a taxon as being detected for Bowtie2+ngsLCA and MALT, we used the threshold obtained from our ROC curves, which has the highest true positive value (all F I G U R E 4 Comparison between euka, HAYSTAC and KrakenUniq+Bracken based on the average Bray-Curtis dissimilarity of their resulting abundance vectors.The Bray-Curtis dissimilarity, ranging between 0 and 1, measures the compositional dissimilarity between two environments, with 0 indicating perfect similarity.The dissimilarity between the abundance vectors produced by euka, HAYSTAC and KrakenUniq+Bracken compared to the original input vector from the simulated metagenomic sample was calculated to conduct this analysis.The figure highlights the effectiveness of euka's filters and its resilience to ancient damage.

F I G U R E 3
The plot examines the performance of euka in relation to Bowtie2+ngsLCA and MALT in terms of estimating taxa abundance.However, as Bowtie2+ngsLCA and MALT do not offer abundance estimations for the taxa they identify.Therefore, we applied thresholds ranging from 1 to the highest number of reads assigned to a single taxon.We calculated the true positive and false positive detected taxa for each threshold, allowing us to measure the accuracy.The uniqueness of euka is highlighted in its ability to provide a list of identified taxa with corresponding abundance estimates, represented as a single dot on the graph.known taxa are detected), with the lowest false-positive value, as described in Section 3.1.2.As for HAYSTAC, we used every taxon with coverage evenness value less than 10 as recommended in Dimopoulos et al. (2022).Since HAYSTAC classifies at the species level, we aggregated reads from the same taxon as defined in euka's database (Table S2).We considered a taxon detected for HAYSTAC if 30 reads were mapped.For KrakenUniq+Bracken, we disregard any taxa with less than 50 reads mapped to.For all tools, we did not consider taxonomic classes higher than superfamily.F I G U R E 5 euka, HAYSTAC and KrakenUniq+Bracken abundance estimation for a downsampled ancient sample of three taxa.To show euka's robustness, we downsampled the original sample, considering where euka would be unable to detect the different taxa.The initial abundance of the three taxa was represented with a dotted line.On average, euka detects taxa with approximately 50 alignments.KrakenUniq+Bracken does not report confidence intervals and is therefore represented as a dot.
F I G U R E 6 Taxa identification (the proportion of the taxa is displayed on the y-axis) for two empirical datasets ('UE1210' and 'Bear Creek') with all five methods.Only taxa in HAYSTAC with coverage evenness <10 and a general read threshold for KrakenUniq+Bracken of 50 were considered.We used the best threshold estimated for the mimicked simulated sample from Figure 3 for Bowtie2+ngsLCA and MALT, excluding every taxonomic rank above superfamily for the 'UE1210' sample.For the 'BearCreek' sample, we used a 10% fraction of the best ROC threshold due to our high coverage simulation.
, Bayesian, bioinformatics, paleoecology, pangenomics, software for pangenomics, an open-source software under GPLv3.It accepts FASTQ as input and provides a summary of all taxonomic groups in the sample, including their respective relative abundance estimates, damage profiles, fragment length distribution and estimated coverage across the pangenome graph.
in low-abundance scenarios, euka remains accurate down to 75% of the original input level, equivalent to approximately 50 alignments F I G U R E 2 (a) Comparison of sensitivity on simulated reads.True positive mappings to all known taxa were assessed within three simulated ancient metagenomic samples on three levels of simulated damage.Each tool was provided with the same database for mapping and compared on euka's defined taxonomic levels.(b) Damage profiles for the three levels of simulated damage within the metagenomic samples.The top panel shows no simulated damage; the middle panel shows the profile for our mid-simulated damage samples, and the bottom panel for our high-damage simulations.(0.2X coverage).KrakenUniq+Bracken can confidently estimate the abundances up until down sampling to 10%.Contrastingly, HAYSTAC's abundance estimation fails to recover the ground truth, owing to the issue shown in the environment simulation experiment.We repeated the experiment for medium damage and no damage, finding no discernible differences to Figure 5 (see Figures S4 and S5, respectively).

Figure 6
Figure 6 summarises our empirical results.The two panels show the different detected taxa proportions for the two empirical samples.

Figure 6 '
Figure 6's first panel for sample 'UE1210' illustrates euka's ability to detect both the highly-abundant bears and the low-abundance taxon Xerinae (the tribe Marmotini is contained within the subfamily Xerinae