A combination of machine‐learning and eDNA reveals the genetic signature of environmental change at the landscape levels

The current advances of environmental DNA (eDNA) bring profound changes to ecological monitoring and provide unique insights on the biological diversity of ecosystems. The very nature of eDNA data is challenging yet also revolutionizing how biological monitoring information is analysed. In particular, new metrics and approaches should take full advantage of the extent and detail of molecular data produced by genetic methods. In this perspective, machine learning algorithms are particularly promising as they can capture complex relationships between the multiple environmental pressures and the diversity of biological communities. We investigated the potential of a new generation of biomonitoring tools that implement machine‐learning techniques to fully exploit eDNA datasets. We trained a machine learning model to discriminate between reference and impacted communities of freshwater macroinvertebrates and assessed its performances using a large eDNA dataset collected at 64 standard federal monitoring sites across Switzerland. We show that a model trained on eDNA is significantly better than a naive model and performs similarly to a model trained on traditional data. Our proof‐of‐concept shows that such a combination of eDNA and machine learning approaches has the potential to complement or even replace traditional environmental monitoring, and could be scaled along temporal or spatial dimensions.

which is costly, time consuming, and technically demanding (Growns et al., 1997;Stribling et al., 2008). These limits inevitably restrict the number of sites and taxa monitored. In addition, the identification of individuals based exclusively on complex or concealed morphological characters can strongly limit the taxonomic resolution of biomonitoring and our ability to infer the ecological status of the environment.
The current development of DNA-based identification methods is bringing profound changes to biological monitoring (Baird & Hajibabaei, 2012;Deiner et al., 2017;Keck et al., 2017;Pawlowski et al., 2018). The recent development of DNA metabarcoding, a molecular method to identify multiple taxa simultaneously from environmental samples with standard genetic markers, has enabled the production of reliable and cost-effective community inventories from environmental DNA (eDNA) (Creer et al., 2016;Deiner et al., 2017;Pawlowski et al., 2020;Taberlet et al., 2012). Metabarcoding holds promise for next-generation biomonitoring as it provides unique insights on biological diversity (Baird & Hajibabaei, 2012). One exciting application of DNA metabarcoding is the possibility to track a wide diversity of living organisms while automating a large part of the process and improving the spatial coverage and frequency of sampling . In a series of recent studies, this increase in spatial and temporal coverage has been demonstrated Carraro et al., 2020;Truelove et al., 2022). Overall, this novel possibility to increase spatial, temporal and taxonomic resolution is a real opportunity to improve the performance of ecological monitoring (Deiner et al., 2017;Leese et al., 2018).
Despite manifest advantages, the implementation of metabarcoding raises numerous issues as it profoundly changes the nature, volume, and dimensionality of data produced by biomonitoring (Keck et al., 2017). Early attempts to use metabarcoding data for ecological assessment naturally focused on converting molecular data into taxonomic information (species list), so that the experience gained and the vast set of tools and indices developed over the years in traditional biomonitoring could be directly mobilized (Aylagas et al., 2016;Elbrecht et al., 2017;Vasselon et al., 2017;Visco et al., 2015).
Overall, this approach was motivated by the attempt to reproduce existing indices through novel and non-invasive means. However, some inherent limits remain, because metabarcoding data are not fully compatible with the taxonomic classifications in use, and the attempt to reproduce existing indices by novel means retained constraints of the old methods. As a major limitation and side-effect of this "reproduce" yet not "extend" strategy, a large part of the genetic data produced remains unassignable, due to the use of biased and non-specific primers (Casey et al., 2021;Clarke et al., 2017;Ficetola et al., 2021) and limited reference databases Li et al., 2022;Weigand et al., 2019). As a result, estimates of community composition obtained by traditional methods and DNA metabarcoding are not always congruent . The very nature of high-throughput molecular data (i.e. its large volume, high dimensionality, and sparsity) is challenging scientists to think differently about the way biological monitoring information is analysed, which means developing new metrics and approaches that take full advantage of the large amount of information produced by genetic methods. This has prompted the development of a new generation of data-driven tools based on the so-called de novo strategy .
The de novo approach aims to directly associate biological community profiles obtained by DNA metabarcoding with known ecological state or disturbance gradients (Apothéloz-Perret-Gentil et al., 2017;Tapolczai et al., 2019; however, see extensive review by Cordier et al., 2021). Thereby, machine learning algorithms are particularly promising as they can capture complex non-linear and non-monotonic relationships between the multiple environmental pressures and the biological diversity uncovered by metabarcoding (Cordier, Lanzén, et al., 2019). As such, they also allow going beyond reproducing past approaches and make full(er) use of novel types of data. Thus, machine learning can be used to train models that can estimate ecological scores from eDNA data as if obtained using traditional sampling. Such approaches have demonstrated their potential to emulate various traditional biological indices both in marine (Cordier et al., 2017Frühe et al., 2021) and freshwater ecosystems Brantschen et al., 2021). However, machine learning models also "learn" to reproduce errors and biases from traditional indices, which is a clear limitation hindering the transition from an old to a new world of biomonitoring. It should be remembered that one of the promises of eDNA metabarcoding was to characterize the diversity of biological communities at unprecedented resolution and thus allow for more accurate and targeted biomonitoring than morpho-taxonomic data (Baird & Hajibabaei, 2012). Thus, using the high dimensionality of high-throughput genetic data to reproduce old regulatory systems constitute a technical bottleneck and de facto limits the potential added value of environmental DNA for ecological assessment.
One way to solve this problem is to develop new data driven tools that fully exploit the power of the machine learning framework to directly connect the patterns observed in eDNA data to the state and quality of the environment. Many studies have shown statistical relationships between eDNA and environmental quality gradients (e.g. Bik et al., 2012;Chariton et al., 2015;Cordier, Frontalini, et al., 2019;Laroche et al., 2018;Salis et al., 2017;Simonin et al., 2019;Stoeck et al., 2018;Thompson et al., 2016) but few have tested the predictive ability of eDNA based models in the context of ecological monitoring. Here, we use a dataset of 64 river-sites sampled in Switzerland to test whether the signal present in eDNA data can be used to discriminate between reference and impacted sites.
All sites sampled are part of an extensive federal monitoring program on aquatic state and change (BAFU, 2013). We aimed to detect a structural signature of anthropogenic pressure on the landscape scale (agriculture and urbanization) in the structure of freshwater macroinvertebrate communities. We approach it using a super- After collecting eDNA samples, macroinvertebrate communities were sampled using a standardized kick-net approach. A total of eight microhabitats were sampled at each site. Non-target organisms (e.g. fish, crayfish and amphibians), sediment and coarse organic material were immediately removed and the remaining sample was preserved with 95% ETOH on site. Individuals were sorted and identified based on morphological characters in the laboratory. Individuals were classified mostly at the family level, except for Porifera, Bryozoa and Cnidaria, which were only classified at the phylum level.

| Site classification
In order to test the ability of the different models to discriminate between communities impacted and not impacted by human activities, we classified the 64 sites into two categories (reference and impacted). For each site, we extracted the upstream watershed area limited to a 10 km radius from the sampling point. We then used the Swiss GEOSTAT 2013/18 land use classification (Office fédéral de la statistique, 2019) to estimate the proportion of different classes of land use in each extracted area. For each site, we computed the total proportion of natural land use (i.e. all the surfaces not classified as agriculture or urban areas, mostly forests and natural grasslands).
Finally, sites were classified as reference if their proportion of natural land use was higher than 33% and as impacted if their proportion of natural land use was lower than this threshold. The threshold was chosen as the median of the distribution, hence guaranteeing balanced classes (32 sites in each class), which is important for supervised machine learning. A visual inspection also verifies that this value separates the two main modes of the distribution ( Figure S1) and correctly delimits and separates the anthropized Swiss plateau from the more preserved regions of the Jura in the north and the Alps in the south.

| Molecular analyses
DNA was extracted from environmental samples in a specialized laboratory following clean lab standard procedures (Deiner et al., 2015) and by using the Qiagen PowerWater Sterivex Extraction Kit. Extracted DNA was eluted in a 100 μL elution buffer and stored at −20°C.
The first PCR started with initial denaturation at 95°C for 10 min, followed by 10 cycles of denaturation at 95°C for 15 s, annealing at 60°C for 30 s (reduced by 1 degree every cycle) and extension at 72°C for 30 s. Then, 30 additional cycles were performed with an annealing at 50°C. The final extension was performed at 72°C for 5 min and the plates were cooled to 10°C. The success of the amplification was tested using the AM320 method on the QiAxcel Screening The second thermal-cycling regime involved initial activation at 95°C for 10 min, followed by thermal cycling with denaturation at 95°C for 30 s, annealing at 55°C for 30 s and extension at 72°C for 30 s. After 8 cycles of final extension at 72°C for 5 min, the products were cooled to 10°C and stored in the refrigerator at 4°C for downstream use.
The PCR products were then cleaned using the Thermo MG Magjet

| Bioinformatics
The demultiplexed forward and reverse reads from MiSeq were first checked using FastQC (v.0.11.9) and then processed using the R package DADA2 (v.1.16.0) to identify amplicon sequence variants (ASVs). The primer sequences (fwhF2/EPTDr2n) were removed from the reads using cutadapt (v. 2.10). The forward and reverse reads were then trimmed to 120 nucleotides to remove low quality bases. Reads with ambiguities or a maximum expected error (maxEE) greater than 2 were removed. ASVs were inferred based on the error rate model estimated by DADA2 and the paired reads were merged into one sequence with a minimum overlap of 12 bases. Chimeric sequences were removed using the de novo bimera detection algorithm available in DADA2. Rare ASVs with less than 10 occurrences across the dataset were removed, as well as ASVs shorter than 137 bp and longer than 147 bp. Finally, ASVs which were found in relative proportion > 0.1% in one of the negative controls were also discarded from all the samples. For comparison and interpretation, ASV sequences were also assigned taxonomically with the IDTAXA algorithm (Murali et al., 2018) implemented in the DECIPHER package (Wright, 2016) using a confidence threshold of 60% and a custom reference database for macroinvertebrates . To help with the statistical modelling, the complexity of the dataset was reduced by clustering ASVs sequences into OTUs using a farthest-neighbour algorithm with 97% of identity and the OTU table was rarefied to 10,000 sequences per sample prior to further analyses.

| Supervised machine learning for impact prediction
We applied a supervised machine learning procedure to test the ability of the morpho-taxonomic matrix and the eDNA matrix to predict the land use class of the sites. We trained these two binary classifiers using the random forests algorithm implemented in the ranger package (Wright & Ziegler, 2017) with Gini impurity as a measure of importance and 1000 trees per ensemble. The other parameters (number of features to try in individual tree, maximum depth, and split rule) were set as default. Additionally, we created a null model, a non-informative model that always predicts the most represented class in the training set for classification of the testing set and that can be used as a baseline for performance comparisons. To assess the performances of our three models we performed a 10-fold cross validation procedure with five repetitions. This procedure consists in splitting the data set randomly into 10 folds. The model is then fitted using nine folds and the performance is measured using the remaining testing fold. The procedure is repeated 10 times until each fold is used for testing.
The entire process is repeated five times to limit the sampling effects in the initial splitting. Performances were measured with the accuracy metric, which measures the proportion of sites whose class is predicted correctly. The fitting and cross validation of the models was performed using the tidymodels package (Kuhn & Wickham, 2020).

| RE SULTS
The sequencing platform generated a total of 10,349,275 reads with The overall composition of the communities (gamma diversity) as estimated by the eDNA and traditional methods is shown in Figure 1.
Some major arthropod groups like Diptera, Coleoptera, Plecoptera and Ephemeroptera were found by both methods, although in F I G U R E 1 Treemaps showing the composition of the morpho-taxonomical and eDNA matrices across all sites. Cell size is proportional to the relative abundance of the different features (morphotaxa and OTUs) used to train the models. In total 86 morphotaxonomic groups and 1692 OTUs were distinguished, respectively. different proportions. However, Amphipoda and Trichoptera were clearly detected in lower proportions by metabarcoding (grouped in the 'Others' category of Figure 1) compared to traditional sampling. Besides macroinvertebrates, metabarcoding also generated a significant number and proportion of OTUs affiliated to the order Calanoida (copepods) and to the family Vannellidae (amoeboid protists).
We found that the model based on eDNA (OTUs features) had the best performance in predicting site categories with an average accuracy of 69.1% (Figure 2). This model showed similar performance with impacted sites (69.3% of correct predictions) and with reference sites (68.1% of correct predictions). The model based on morpho-taxonomic groups obtained through traditional kicknet sampling performed similarly with 59.5% of accuracy. In contrast to eDNA, this model was better with reference sites (61.3% of correct predictions) than with impacted sites (57.5% of correct predictions).
Both eDNA and morpho-taxonomy based models outperformed the null model (33% of accuracy).
When looking at the model performances at individual sites ( Figure 3), relatively many sites are well-predicted by both matrices. Twenty-two sites have prediction success of 100% for both approaches, while a few sites have a low to very low predictability (six sites have a prediction success of 0% for both approaches). Finally, some sites are better predicted by the eDNA matrix and some others are better predicted by the morpho-taxonomic matrix. There is no indication that these differences can be explained by geography or any other obvious reason (Figure 2). Some model features were clearly more influential than others for predictions. Figure 4

| DISCUSS ION
We demonstrated the use of machine learning algorithms to analyse DNA metabarcoding data for the assessment of the ecological status of the environment. We showed that it is possible to train models able to identify genetic patterns indicative of anthropogenic activities, such as urbanization and agriculture, directly from the bulk of DNA reads. While many studies have shown a response of communities to environmental changes through the study of eDNA (e.g. Bik et al., 2012;Chariton et al., 2015;Cordier, Frontalini, et al., 2019;Laroche et al., 2018;Salis et al., 2017;Simonin et al., 2019;Stoeck et al., 2018;Thompson et al., 2016), few have focused on the assessment of the predictive power of environmental DNA to estimate environmental changes. Our results indicate that the DNA-based model performs at least as well as the model based on the traditional morpho-taxonomic approach, despite being based on a very F I G U R E 2 Barplot showing the average accuracy (±SD) of each model in predicting site status, as estimated by the cross-validation procedure.

F I G U R E 3
Map of Switzerland showing the location of the 32 sites used in this study. The shape of the sites indicates in which category they were classified based on land use. The colour of the sites varies along two axes depending on the success rate of classification by the morphotaxonomic model (grey to red) and by the DNA model (grey to blue). The grey line delineates the river Rhine catchment and the main water bodies within this catchment are shown in light blue. different data matrix. This is a significant finding because it indicates that the information contained in the DNA matrix is sufficient to distinguish impacted sites from reference sites, and that the morphotaxonomic approach may not necessarily be superior for assessing the ecological status of the environment.
These results add to the growing body of evidence that eDNA and traditional methods give complementary pictures of biodiversity ), yet may be congruent with respect to their use for ecological indicators or derived descriptors of ecological state. While morphologically based taxonomic identification has been used routinely for ecological monitoring of environments for decades, our results confirm the potential of eDNA as a valuable tool for biomonitoring. One advantage of eDNA for model training is the finer resolution of its features produced, hence increasing the number of potentially useful predictors which can discriminate between ecological conditions. This is shown in Figure 1, where the number of OTUs in each order is much higher than the number of families traditionally used. For example, the order Diptera, which represent the largest proportion of both individuals and eDNA reads, is poorly resolved by the morpho-taxonomy with only five major groups (Chironomidae, Simuliidae, Limoniidae/Pediciidae, Psychodidae, and Empididae), while DNA metabarcoding splits this group in 759 OTUs. It is important to highlight that out of the 10 most important predictors for the DNA model, 8 were affiliated within the order Diptera. This is consistent with the very high proportion and diversity of Diptera present in the eDNA dataset but also indicates that Diptera could have a significant potential for biomonitoring. This potential may have hitherto been largely masked and rendered unusable by the low taxonomic resolution commonly used in traditional monitoring programs but is corroborated by several studies that highlight the bioindicator potential of dipteran taxa, especially chironomids, when studied at finer taxonomic levels (Nicacio & Juen, 2015;Rosenberg, 1992;Saether, 1979). In fact, our results suggest that a metabarcoding approach targeting only Diptera through a highly specific and effective set of primers (see e. g. Gibson et al., 2011) could produce data to train a potentially useful and powerful tool for ecological assessment.
Another important point is that environmental DNA can be transported by water flow for some distance before being degraded (Deiner & Altermatt, 2014;Pont et al., 2018). This gives eDNA the potential to represent the environment at a larger scale than locally sampled individuals (Carraro et al., 2020) and to incorporate the effects of anthropogenic pressures on the biota upstream of the sampling point (Burian et al., 2021;Carraro et al., 2020;Deiner et al., 2016). Insofar as we are testing the ability of different types of data (metabarcoding and traditional) to discriminate between sites that were pre-classified on the basis of land use up to 10 km upstream of the sampling point, eDNA may have an advantage here because of this passive transport effect and spatial integration, as has also been reflected in a land-water fingerprint of eDNA and structural diversity of terrestrial ecosystems (Zhang et al., 2023).
Our results confirm the potential of machine learning to process and value high-throughput generated DNA data. As a taxonomy free approach, the machine learning framework used in this study does not rely on taxonomic assignment of DNA sequences Cordier et al., 2017Cordier et al., , 2021. Thus, 100% of the DNA data can be exploited, irrespective of any possible assignment F I G U R E 4 Barplots showing the variable importance of the 10 most important features used for prediction by each model (eDNA and morpho-taxonomy, respectively). For the eDNA model, OTUs are referred after their numeric ID and their estimated taxonomic identification at the lowest possible rank.
to specific taxa, including the 584 OTUs (113,669 reads), which could not be classified at the order level, and thus would not have been used in any approach constrained by the use of the traditional set of taxa (Figure 1). While some of these unclassified data may correspond to errors and noise, a significant fraction corresponds to real biological features conveying a statistical signal useful for improving model performance. For example, OTU 142_141 is unclassified but plays an important role for the DNA model performance (Figure 4).
Several reasons can explain the failure of the classification of these sequences, such as incompleteness or conflicts in the reference database  but these data remain nonetheless valuable and machine learning algorithms can be used to incorporate them.
We here highlight the value of using complete metabarcoding data for the classification and assessment of ecological state, and advocate approaches that are not perpetuating previous limitations inherent to a past method yet obsolete to novel approaches. Indeed, we see the biggest limitation and possible challenge in the transition to DNA-based assessments exactly in the perpetuation of past constraints, limiting the full exploitation of the potential of eDNA metabarcoding. The goals should not be to replicate, but rather to complement and replace past approaches. While previous works have used machine learning to mimic traditional taxonomic indices, our study shows that it is possible to train algorithms directly from environmental data and without using traditional indices that are known to suffer from certain biases, technical issues and limitations (Brucet et al., 2013;Poikane et al., 2014;Reyjol et al., 2014). This finding has significant implications for future studies and developments in biological monitoring, as it opens up the possibility of developing more accurate and reliable tools for biomonitoring using machine learning algorithms that can learn directly from environmental data.
Here, we have demonstrated the concept using a simple binary classification between reference and impacted sites. Importantly, the approach can easily be extended to predict multiple quality classes (e.g. 'high', 'good', 'moderate', 'poor', and 'bad') as typically found in traditional methods being used for ecological assessment (Hering et al., 2006). It can also be used along continuous gradients of pressures and with quantitative data by doing regression instead of classification, an approach that is comparable to the development of transfer functions for environmental stressors (Bennion et al., 1996;Taukulis & John, 2009). Self-supervised models that can distinguish the status of the sample based on the eDNA signal alone could also be an interesting direction to explore. Finally, there is the opportunity to develop multiple models in parallel to provide finetuned assessment and predictions for multiple types of different pressures. This is an exciting prospect because of the multiplicity and ever-increasing number of environmental pressures that affect ecosystems (Folt et al., 1999;Piggott et al., 2015). The generalization of machine-learning approaches on large data from environmental DNA monitoring networks could help meet the challenge of the increasing complexity of human pressures and cocktail effects (Chariton et al., 2016;Dafforn et al., 2016). However, training these models will require large amounts of data in order to ensure their accuracy and transferability. In particular, it is important to ensure that the accuracy of the estimates is transferable from year to year and from region to region with external validation based on a-priori ecological knowledge and the expertise of practitioners. This calls for efforts by stakeholders and managers to pursue the development of such eDNA based monitoring networks.

AUTH O R CO NTR I B UTI O N S
FK and FA designed the study. JB did the laboratory processing of eDNA samples. FK did the bioinformatics and performed the analyses. FK wrote the manuscript with significant inputs from FA and JB.

ACK N O WLE D G E M ENTS
We thank Fedor Čiampor Jr., Letizia Lamperti and an anonymous reviewer for their helpful comments. We thank the Swiss Federal

CO N FLI C T O F I NTE R E S T S TATE M E NT
The authors declare no conflict of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
The sequencing data are available from https://doi.org/10.5061/ dryad.jsxks n0cq. The code to reproduce the analyses is available at: https://github.com/fkeck/ NAWA.