A powerful long metabarcoding method for the determination of complex diets from faecal analysis of the European pond turtle (Emys orbicularis, L. 1758)

Abstract High‐throughput sequencing has become an accurate method for the identification of species present in soil, water, faeces, gut or stomach contents. However, information at the species level is limited due to the choice of short barcodes and based on the idea that DNA is too degraded to allow longer sequences to be amplified. We have therefore developed a long DNA metabarcoding method based on the sequencing of short reads followed by de novo assembly, which can precisely identify the taxonomic groups of organisms associated with complex diets, such as omnivorous individuals. The procedure includes 11 different primer pairs targeting the COI gene, the large subunit of the ribulose‐1,5‐bisphosphate carboxylase gene, the maturase K gene, the 28S rRNA and the trnL‐trnF chloroplastic region. We validated this approach using 32 faeces samples from an omnivorous reptile, the European pond turtle (Emys orbicularis, L. 1758). This metabarcoding approach was assessed using controlled experiments including mock communities and faecal samples from captive feeding trials. The method allowed us to accurately identify prey DNA present in the diet of the European pond turtles to the species level in most of the cases (82.4%), based on the amplicon lengths of multiple markers (168–1,379 bp, average 546 bp), and produced by de novo assembly. The proposed approach can be adapted to analyse various diets, in numerous conservation and ecological applications. It is consequently appropriate for detecting fine dietary variations among individuals, populations and species as well as for the identification of rare food items.


| INTRODUC TI ON
Molecular technologies, such as high-throughput amplicon sequencing (HTS), have become a method of choice to accurately and rapidly characterize complex, multispecies, ecological communities. This approach has the potential to greatly improve the accuracy of diet analysis from faecal samples or stomach contents (Alberdi et al., 2018).
HTS has been used to assess the diet composition of a wide taxonomic range of animals. Animals whose diets have been successfully investigated have included mammals (Buglione et al., 2018;De Barba et al., 2014;Esnaola et al., 2018;Robeson et al., 2017), birds (Crisol-Martinìez et al., 2016;Han & Oh, 2018), reptiles (Caut et al., 2019;Kartzinel & Pringle, 2015;Koizumi et al., 2017), fish (Barbato et al., 2019;Harms-Tuohy et al., 2016;Riccioni et al., 2018) and arthropods (Kamenova et al., 2018;Kennedy et al., 2020;Krehenwinkel et al., 2016). For most species, faecal samples, in contrast to stomach contents, can easily be obtained, with a minimal level of interaction and harm inflicted on the studied animal. Using faecal samples is therefore a noninvasive and attractive approach to study dietary patterns (Pompanon et al., 2012;Valentini et al., 2009). This is especially true for endangered species or species whose feeding patterns are difficult to observe in the wild, such as aquatic or nocturnal species (Baamrane et al., 2012;Hibert et al., 2013). Following sampling, laboratory procedures involve total DNA extraction from faecal samples, PCR amplification with either a universal or a specific set of primers corresponding to one or more barcode loci, preparation of DNA libraries and DNA sequencing ultimately terminated by data processing via bioinformatics pipelines (Laudadio et al., 2019).
Previous studies have revealed the difficulty associated with determining operational taxonomic units (OTUs) at the species level, such as has been reported in diet (Ursus arctos: De Barba et al., 2014; Lepus corsicanus: Buglione et al., 2018) or environmental DNA (eDNA) studies (Lacoursière-Roussel et al., 2018;Lim et al., 2016;Ruppert et al., 2019). Furthermore, obtaining sufficient amplicon length to determine sample identities to the species level can be challenging (De Barba et al., 2014;Deagle et al., 2010). This difficulty is primarily due to the fact that prey DNA in faecal samples is degraded (Deagle et al., 2006). Consequently, primers were selected to target only short and variable prey DNA fragments present in the diet. Long metabarcoding has facilitated the production of long sequencing reads (Godwin et al., 2016). The increased marker length allows a higher taxonomic resolution, with long markers increasing the ability to distinguish closely related species (Singer et al., 2016).
Regarding vertebrates and invertebrates, the mitochondrial cytochrome oxidase subunit 1 (COI) is the most frequently used barcode locus, with the most diversified and complete reference database (Jusino et al., 2018). The careful selection and design of amplification primers are essential, as well as the evaluation of primers from close and distant species, using as many DNA sources as possible, in order to characterize primer specificity and/or universality.
Together with primer selection, the parameters of bioinformatic pipelines have an important impact on the identification of OTUs. Indeed, after sequencing, to achieve in-depth analysis of DNA present in the studied sample, raw sequence data alone are not sufficient (van der Walt et al., 2017). Furthermore, unassembled raw metagenomic sequence data are fragmented, contain errors and/or are affected by unequal sequencing depths (Nagarajan & Pop, 2013), hindering the accuracy when sorting DNA sequences (Nurk et al., 2017). Thus, to accurately analyse metagenomes, larger contiguous segments named contigs can be assembled from raw sequence data (Anantharaman et al., 2016). For this reason, multiple metagenome bioinformatic pipelines have been developed to assemble raw sequences by simply merging paired-end reads or by de novo assembly (van der Walt et al., 2017).
Using a metagenome assembler producing high-quantity long contigs (>1,000 bp) will allow for more accurate determination of organisms to the species level using DNA sequences present in the sample (van der Walt et al., 2017). Additionally, control and validation methods are needed for parametrizing bioinformatics pipelines. This can be achieved by creating and sequencing mock communities, which are references for DNA databases and are used as positive controls for HTS (Jusino et al., 2018). Moreover, a captive feeding trial can be conducted to evaluate the applicability of the method and to test whether prey DNA can be reliably detected in faecal samples (Deagle et al., 2005;Nakahara et al., 2015).
Here, we describe a study of the complex diet of the European pond turtle, using, for the first time in a dietary study, a new long DNA metabarcoding approach. The procedure includes 11 different primers pairs that target a region of the COI gene, the large subunit of the ribulose-1,5-bisphosphate carboxylase gene (rbcL), the maturase K gene (matK), the 28S rRNA and the trnL-trnF chloroplastic region (the proposed combination of primers covers plants, invertebrates and vertebrates). This will result in the amplification of fragments between 350 and 1,400 bp in length. After sequencing, raw metagenome sequence data are first analysed with the open bioinformatics pipeline, metaspades version 3.9.0 (Nurk et al., 2017; http://cab.spbu.ru/softw are/spades), which, to our knowledge, is being used for the first time in diet studies using metagenomic approaches, and considered nowadays as the most recommended metagenomic data assembler for high-complexity metagenomes (Forouzan et al., 2018). This new DNA metabarcoding approach is based on the use of multiple primers in order to maximize coverage of species groups and can accurately be used to identify taxa to the species level for plants, vertebrates and invertebrates present in faeces collected in the field. Finally, the accuracy of the method was also validated by using faeces obtained from captive European pond turtles fed using known diets.

| General approach for omnivorous diet analysis
We have developed a general method of long DNA metabarcoding for analysis of omnivorous diets (Figure 1), through a diet study of the omnivorous European pond turtle (Emys orbicularis, L. 1758).

| Study species
The European pond turtle is found in wetlands of Europe and North Africa and has been classified as "near threatened" (NT) according to the UICN Red List. In Switzerland, the species is listed as "critically endangered" (CR) on the Swiss Red List (Monney & Meyer, 2005). Previous studies investigating the feeding behaviour of the species using direct observations and microscopic examination have suggested that the animals have an omnivorous diet (Çicek & Ayaz, 2011;Ottonello et al., 2005Ottonello et al., , 2016Ottonello et al., , 2018. However, these techniques have several limitations, including the loss of information due to difficulties identifying prey and plant matter present in faeces. To our knowledge, no metabarcoding study exploring the European pond turtle diet have been conducted, and metabarcoding studies of reptile diets are scarce Kartzinel & Pringle, 2015;Koizumi et al., 2017).

| Collection of faecal samples and DNA extraction
European pond turtles were captured in April 2017 using conical fishing basket traps placed perpendicular to the banks (Cadi, 2003)

| Feeding trial
Additionally, six captive European pond turtles were placed in individual containers in the laboratory, and food was withheld for 10 days to empty the digestive systems of the turtles (Devaux et al., 1996). Then, turtles were given diets consisting of a predetermined set of fishes and invertebrates (Table 1). The night after feeding, turtles were placed in dry containers (similar to those used in the field experiment), and multiple faecal samples were collected from each individual and subsequently homogenized. This procedure yielded six faecal samples with known diets, which were separately analysed as defined above.

| Mock community
A reference DNA database (mock community; MC) was set up using DNAs extracted from known components of the putative diet of the European pond turtle (Appendix S1: S1), which is composed of plants, macro-invertebrates and fishes, according to the literature (Çicek & Ayaz, 2011;Ottonello et al., 2005Ottonello et al., , 2016Ottonello et al., , 2018

| Primer selection and PCR amplification
Previously published primers used for the amplification of the large subunit of the ribulose-1,5-bisphosphate carboxylase gene (rbcL), the maturase K gene (matK), the 28S rRNA gene, the trnL-trnF gene region in plants and a portion of the mitochondrial-encoded cytochrome oxidase subunit I (COI or COX1) gene in animals were evaluated against DNA samples isolated from the mock community (Table 2). Several other primers pairs were tested (data not shown) but not retained, either because they were unable to amplify or lacked amplification specificity. Moreover, as live microorganisms such as bacteria and fungi are present in high numbers in the digestive systems of hosts, primer sets were also evaluated against DNA collection. Primer pairs which amplified bacterial or fungal genes were then discarded. Primer pairs tested but not used were psbA3f (Sang et al., 1997) and trnHf-05 (Tate & Simpson, 2003), matK 390-F and matK 1326-R (Cuénod et al., 2002), JK11 and JK14 (Aceto et al., 1999), COIF2 and COIR2 (Martinsen et al., 2008), C1-J-2182 (Simon et al., 1994) and TL2-N-3020 (Dobler & Müller, 2000), BF2 and BR2, BF2 and BR1, BF1 and BR2 (Elbrecht & Leese, 2017), ModRepCOI-F and COI-R, VertCOI_7194-F and ModRepCOI-R (Reeves et al., 2018), and Chmf4/Chmr4 (Che et al., 2012). This careful screening finally yielded seven primer pairs targeting three different types of plant genes and four primer pairs specific to COI sequences in vertebrates and invertebrates (Table 2). All PCRs were carried out using a 25-μl reaction volume consisting of 5 µl Furthermore, we designed host-specific blocking primers, but the use of the blocking primer was not compatible with our metabarcoding approach (see explanations in Appendix S1: S2).

| First upstream method validation
After metabarcode amplification, sequencing and bioinformatics were first validated upstream using in duplicate a synthetic mock community (sMC). In total, four distinct amplified genes from 14 organisms, which included plants, insects, fungi and bacteria (Table 3), were amplified using PCR, controlled using gel electrophoresis and their respective PCR products were pooled and purified with a Wizard Genomic Purification Kit (Promega). After measuring the DNA concentration with the Q ubit 3.0 Fluorometer, the two pooled amplicons were diluted to 2 ng/μl and then fragmented using a protocol developed to create fragments with a mean size of 290 bp. This protocol utilizes the Covaris S2 focusedultrasonicator and can be applied to shear amplicons of a variety of lengths, ranging from 350 to 1,400 bp ( Figure S3). The quality of sheared products was subsequently evaluated using a Tapestation 2,200 (Agilent).     (Andrews, 2010). An additional quality trimming of raw Illumina reads with trimmomatic 0.32 (Bogler et al., 2014) was evaluated on 10 samples, with stringent settings for base quality filtering, and was found not to be conclusive based on post de novo assembly results. Moreover, metaspades, the used de novo assembly software, comes with an "error correction read" process prior to contig assembly-i.e., "BayesHammer error correction tool," which uses Bayesian subclustering to correct sequencing reads (Nikolenko et al., 2013). Following trimming and demultiplexing, cleaned sequencing reads were downloaded from the Illumina Basespace account. De novo assembly of sequencing data was separately carried out for each sample using the genome assembly software spades 3.11 (Nurk et al., 2017), with the metagenome assembly option ("metaSPAdes") which includes the "error correction read" process prior to contig assembly. The parameters used in the software are described in Figure S4. Contigs smaller than 150 bp were removed, which represented between 0.17% and 10.34% of all contigs across all samples with an average of 2.86%. The resulting contigs files were analysed with the Basic Local Alignment

| Bioinformatic analysis
Search Tool (blast) using blast+ (Camacho et al., 2009), searching the complete NCBI nucleotide (nt) database (command lines using blast + are described in Figure S3). Only the sequences of eukaryotes were conserved. Following the blast search characterized by a strong e-value cut off (E-value 1e−20) (Truelove et al., 2019), the five most significant matches (max_target_seq5) to the reference database for each of the query sequences were recorded. If only a single taxon was present in the top five and above 97.6% identity (see below for the level applied), the query was assigned directly to this taxon. If more than one reference taxon was present in the top five and above 97.6% identity, the query was assigned to the lowest taxonomic level that was shared by all taxa. In these specific cases (i.e., multiple taxa shared for a query sequence), the species identity was if possible confirmed without any ambiguity, thanks to the knowledge of biologists or botanists specialized in these studied sites. Finally, query sequences for which the best blast hit had less than 97.6% identity to any sequence were simply not considered. This threshold was determined following analysis of the sequences from both the mock communities and the captive feeding trials. The complete taxonomy of each species identity assignment per contig was completed using its respective TaxID and  Figure S4).

| Upstream method validation
After sequencing, 14 amplicons (550-1400 bp in length) of the two MCs were entirely de novo assembled and all micro-organism identities were successfully retrieved using NCBI blast. This confirmed that the proposed method could be performed on several different samples with DNA sequences (barcodes/amplicons) of different sizes, up to 1,400 bp and more.

| Mock community
Two different mock communities with different amplification procedures (MC1 and MC2) were used to determine whether the differences in sequencing data were observed when DNAs were pooled before amplification (MC1) versus. amplified individually (MC2) (

| Captive feeding trial
Results of the analysis of faecal samples obtained by captive feeding trials demonstrated that every prey given to the European pond turtle was amplified and correctly assigned down to the species level. This produced a minimum identity threshold of 97.8% for the exact determination of prey species from DNA extracted from faeces and allowed for the allocation of identity to the species level. The average length of the reference alignment was 422 bp (Table 1).

| Blocking primers
Without the use of any host-specific blocking primers (that would have prevented host COI gene amplification), Emys orbicularis was formerly identified in only 12.5% of the samples, namely four out of 32. Across these four samples, we found that between 0.74% and 3.35% of raw sequencing paired-end reads per sample aligned to the Emys COI gene contig, with an average of 1.78%. Considering the 32 samples totally, the average drops to 0.22% (Appendix S1: S5).

| Qualitative analysis: Diet
Metabarcoding analyses showed that all samples contained plant Appendix S1: S6). No contamination was detected in the positive and negative controls.

| Quantitative analysis: Read abundance
Finally, we determined the read abundance of all prey and plant species identified per sample, including the two MCs and feeding trial samples, in order to conclude whether we could use this information as a quantification indicator of metabarcoding diet (Appendix S1: S6 Note: Contig sequences (recovered amplicons) were obtained by de novo assembly with metaspades (Nurk et al., 2017) and queried using the Basic Local Alignment Search Tool (blast) and blast+ (Camacho et al., 2009), against the complete NCBI nucleotide (nt) database.

| D ISCUSS I ON
It has typically been assumed that prey DNA fragments in faecal samples were short and degraded as a result of the digestion by the host. This causes a major difficulty for the taxonomic identification of prey taxa using methods based on the analysis of DNA sequences within faecal samples (Deagle et al., 2006). For these reasons, previous protocols have used barcodes that target very short amplicons (<100 bp) (De Barba et al., 2014). However, the rate of DNA degradation in faeces may vary according to the identity of the ingested species. Indeed, in the present study, we found that plant parts (including seeds), bones and insect parts (such as legs and elytra) are only partially digested in the faecal samples. These observations suggest that prey DNA may not always be highly degraded and therefore enable the amplification of longer amplicons, thereby facilitating their identification down to the species level.
Furthermore, the feeding trial experiment also demonstrated that the proposed long metabarcoding method can detect the DNA sequences of vertebrates and macro-invertebrates fed to captive European pond turtles. In this case, prey could not be identified using direct observation or microscopy, because they were entirely digested. The method has been shown to be appropriate for the analysis of the diet of wild European pond turtle, and probably other species as well. DNA in faecal samples was not overly degraded, produced reliable results and allowed for the recovery of long amplicon sequences after de novo assembly. However, to accurately elucidate the real diet of the European pond turtle, the feeding trial should have contained plants (this information was unknown before this study). Use of mock communities, combined with samples obtained through captive feeding trials, proved to be essential to produce positive controls and validation data for parametrizing (threshold set up) bioinformatics pipelines. Finally, the various tests performed on samples collected throughout feeding trials and MCs enabled us to set a relatively high detection threshold at almost 98% identity.
Without using any host-specific blocking primers, we demonstrated that on average only 0.22% of raw sequencing paired-end reads per sample aligned to Emys contigs. This is relatively weak and questions the usefulness of blocking primers related to the host DNA in metabarcoding diet studies from faeces. Indeed, it seemed, at least in the specific case of Emys orbicularis, that cell loss following cell renewal throughout the intestinal lumen of the host generates only a little or no DNA of amplifiable quality, as the COI gene of the host DNA was only identified in four out of 32 samples.
Regarding the qualitative analysis of faecal samples, the recovered amplicon lengths following de novo assembly (through contigs) varied between 168 and 1,379 bp (average of 546 bp); taxonomic resolution to the species level was reached for most sequences (82.4%).
In previous diet analyses, taxonomic identification to the species level did not reach such high levels. For example, species were as- Obtaining longer amplicons provides a well-known advantage for our long metabarcoding approach because it enhances the precision of taxonomic identification (Heeger et al., 2018;Jamy et al., 2020; F I G U R E 2 Mean length (bp; with SD) of the de novo assembly amplicons (contigs) assembled following long metabarcoding analysis of the 32 faecal samples of European pond turtles (Emys orbicularis). Total mean was calculated from a total of 233 contigs (each contig per sample is attributed to a unique OTU), whereas the other means were calculated per phylum Mean length per contigs after assembly per respective phylum or total number of contigs Liu et al., 2020;Piper et al., 2019;Porter & Golding, 2011).
Furthermore, the approach developed here not only allowed us to amplify both barcodes of short to medium size, in cases of degraded DNA, but also and especially of long size (>500-650 bp and more), which is believed to increase the taxonomic affiliation at the species level. Indeed, shorter DNA fragments (e.g., 100 bp or less) are more likely to be sequenced, while longer DNA fragments provide a better taxonomic identification and resolution (Liu et al., 2020). Analyses made on degraded DNA samples demonstrated that few very informative barcodes such as COI, of shorter size between 135 bp (Hajibabaei et al., 2006) and 250 bp (Meusnier et al., 2008), can reliably identify animal species, if they target an appropriate placement within the larger barcode region (Elbrecht et al., 2019). Thus, longer amplicons generally increase the level of taxonomic assignment, and especially elevate the veracity and power of the results (absence of false positives; Piper et al., 2019). This method of long metabarcoding by short-read sequencing (Illumina Platform) and de novo assembly has other advantages compared to the long-read Pacific Bioscience sequencing platform used by Heeger et al. (2018) and Jamy et al. (2020). The method is more affordable and provides a higher level of sequencing depth. Moreover, our approach combining the use of different specific and universal primer pairs targeting both the same and unique gene (or gene portion) and several different genes as well, coupled with different targeted amplicon sizes (both long and short amplicons in case of highly degraded DNA), allowed us to amplify a large spectrum of species richness (da Silva et al., 2019) and to confirm some of the identified plant OTUs with redundancy of identification using many genes (matK, rbcL and 28S genes). Usually, species-specific primers are used to amplify DNAs within a particular diet from faecal samples or gut contents (Leal et al., 2014;Pumarino et al., 2011;Wallinger et al., 2012). However, this approach is only useful if prior information regarding the diet of the studied animal is available and if the range of the diet is not too large (Moorhouse-Gann et al., 2018). Thus, the use of multiple specific and universal primers in this study is an optimal method to determine complex diets, such as the diets of omnivorous animals such as the European pond turtle.
Nevertheless, the set of multiple primers that we developed cannot be considered as the optimal one for any diet study. In future research, depending on the diet or eDNA sample studied, additional primer pairs could be needed, to reach a higher level of discrimina- have not yet been added to this database (Kennedy et al., 2020). To identify the aforementioned plants to a higher taxonomic level, we recommend the elaboration of a local DNA sequence database, containing the plant species representing the known botanical diversity of the studied areas.
Unfortunately, one of the limitations of metabarcoding approaches is related to the status of the prey; indeed, it is not possible to determine whether adults, juveniles and/or eggs were consumed or if individuals were dead or alive. Moreover, plant fragments present in our samples could also be derived from prey. Indeed, some prey species of the European pond turtle are known to consume plants as a typical part of their own diets. The identification of certain plants may also be due to pollen contamination of the turtle's food. However, DNA present in faecal samples is usually degraded (Deagle et al., 2006), meaning that food items eaten by both prey and, subsequently, the predator, were degraded twice, making it unlikely that DNA fragments from this source could be detected.
Furthermore, plant matter and seeds were present in large quantities in faecal samples, which confirmed the importance of combining direct observation with metabarcoding to validate the sequencing results.
Regarding the quantitative analysis, the amount of sequencing reads is a debatable indicator in the quantification of metabarcoding diets (Deagle et al., 2019). Indeed, the correlation between composition of the sample and sequence reads varies from none to strong. It remains to be shown whether biomass may be linked to read abundance as previously shown in copepods (Clarke et al., 2017;Hirai et al., 2015), in nematode communities (Schenk et al., 2019) and in below-ground plants (Matesanz et al., 2019), while others failed to assess this link such as in zooplankton assemblages (Harvey et al., 2017 Similarly, the efficiency of the digestion could have an impact on the detected DNA. In the present study, that is a diet analysis based on the extraction of DNAs from faeces, and verified by visual analysis (see also Ottonello et al., 2018), it was established that the DNA of some prey are excreted with higher integrity than others (e.g., intact seeds, bones, elytra of some beetles). Consequently, a prey can represent 95% of the food intake but its DNA, after extraction, would only represent 1% of the total faecal DNA, compared to a less digestible vegetable food (e.g., intact seeds) representing 5% of the food intake, which would represent 99% of the faecal DNA sample after extraction.
Finally, an additional bias preventing any relative quantitative evaluation when different barcodes are used is the number of copies of a barcoded gene within a targeted organism genome, and whether it is of nuclear, mitochondrial or chloroplast origin. When studying an omnivorous diet by metabarcoding analysis on stool samples, and with our current knowledge, it would be inappropriate to estimate a putative quantification of the prey ingested. Only qualitative analysis is reliable in this particular case.
The large species richness identified using the long metabarcoding approach proposed here is congruent with other molecular studies, which yielded high resolutions and an even greater richness regarding prey consumption, compared to histological analyses of the same samples (Soininen et al., 2015;Ando et al., 2013). To our knowledge, this is the first time that the metaspades software (Nurk et al., 2017) has been used for a long metabarcoding analysis, especially within a dietary study. We showed that this metagenome assembler is able to retrieve amplicons with a high confidence level and consequently provides an accurate taxonomic assignment.
Finally, this new long metabarcoding method with a short-read sequencing approach, combining the use multiple primers pairs (da Silva et al., 2019) and de novo assembly, could be used as a universal, standardized method for studying complex diets, as well as in other complex eDNA analyses. Its high level of precision allows for improvements in studies of biodiversity assessments and trophic interactions, which would enhance our understanding of community ecology and ecosystem functioning.

ACK N OWLED G EM ENTS
We thank the Office cantonal de l'agriculture et de la nature (Gottlieb Dandliker) of the canton of Geneva for permission to catch European pond turtles in the natural reserve of Moulin de Vert, Jussy and Laconnex (Geneva, Switzerland). We are also grateful to the anonymous referees and the editor for their constructive comments.

AUTH O R CO NTR I B UTI O N S
The first (C.D.) and second (J.C.) author contributed equally to this