Advancing the integration of multi‐marker metabarcoding data in dietary analysis of trophic generalists

Abstract The application of DNA metabarcoding to dietary analysis of trophic generalists requires using multiple markers in order to overcome problems of primer specificity and bias. However, limited attention has been given to the integration of information from multiple markers, particularly when they partly overlap in the taxa amplified, and vary in taxonomic resolution and biases. Here, we test the use of a mix of universal and specific markers, provide criteria to integrate multi‐marker metabarcoding data and a python script to implement such criteria and produce a single list of taxa ingested per sample. We then compare the results of dietary analysis based on morphological methods, single markers, and the proposed combination of multiple markers. The study was based on the analysis of 115 faeces from a small passerine, the Black Wheatears (Oenanthe leucura). Morphological analysis detected far fewer plant taxa (12) than either a universal 18S marker (57) or the plant trnL marker (124). This may partly reflect the detection of secondary ingestion by molecular methods. Morphological identification also detected far fewer taxa (23) than when using 18S (91) or the arthropod markers IN16STK (244) and ZBJ (231), though each method missed or underestimated some prey items. Integration of multi‐marker data provided far more detailed dietary information than any single marker and estimated higher frequencies of occurrence of all taxa. Overall, our results show the value of integrating data from multiple, taxonomically overlapping markers in an example dietary data set.

One problem that has attracted much attention is the selection of molecular markers, because primer specificity and biases can greatly affect the results of dietary studies (Alberdi et al., 2019;Taberlet et al., 2018). In general, studies build on previous knowledge of the diet of one or more species of interest, or of ecologically similar species, to select a primer that amplifies DNA from the main food items expected to be consumed (Alberdi et al., 2019;Coghlan et al., 2013).
For instance, the studies of Soininen et al. (2009) and Valentini et al. (2009) used primers amplifying a fragment of the chloroplast trnL intron to analyse the diet of a number of herbivore species. Likewise, many studies on insectivore diets often used the ZBJ primer amplifying a fragment of the COI mitochondrial gene (Razgour et al., 2011;Zeale, Butlin, Barker, Lees, & Jones, 2011). This single marker approach has been widely used in many studies (Gordon et al., 2019;McClenaghan, Nol, & Kerr, 2019;Moran, Prosser, & Moran, 2019), but it may produce significant biases due to differential primer affinity for different taxa. For instance, although ZBJ is often used as a  Trevelline, Latta, Marshall, Nuttle, & Porter, 2016;Trevelline et al., 2018), it may have strong positive or negative bias depending on the taxa (Clarke, Soubrier, Weyrich, & Cooper, 2014;Piñol, Mir, Gomez-Polo, & Agustí, 2015). The challenge is even worse in the case of omnivorous diets, because the variety of taxonomic clades consumed cannot be analysed using a single marker (De Barba et al., 2014;Taberlet et al., 2018). Therefore, it is increasingly recognised that molecular dietary studies should be based on a mix of markers that adequately amplify the full complement of prey ingested, which requires integration of data from several markers for each sample (Alberdi et al., 2019;Alberdi, Aizpurua, Gilbert, & Bohmann, 2017;Deagle, Kirkwood, & Jarman, 2009;Taberlet et al., 2018).
In multi-marker dietary studies, the most common approach is to divide the expected diet in various components (e.g., vascular plants, cephalopods, arthropods and vertebrates), and then use a primer designed to target each component (Coghlan et al., 2013;Groom, White, Mitchell, Roberts, & Mawson, 2017;Robeson et al., 2018;Sullins et al., 2018). The integration of this type of multi-marker data is relatively straightforward, as information from each dietary component is retrieved from a single marker, and so a list of taxa detected in each sample can be inferred simply by adding taxa lists across markers. However, in some cases it may be necessary to use a mix of primers overlapping in the range of taxa amplified, making data integration more difficult. For instance, in dietary analysis of trophic generalists it may be useful to combine a universal marker with more specific markers, to account for the consumption of unexpected taxa that are not adequately detected by any of the specific primers used (De Barba et al., 2014;Deagle et al., 2009;Taberlet et al., 2018). Also, in dietary analysis involving highly diverse prey groups such as arthropods it may be necessary to avoid biases by combining primers that vary in affinity for different orders or even families, but that may overlap considerably in the range of taxa amplified (Aizpurua et al., 2018;De Barba et al., 2014;Kaunisto et al., 2017). Integration of such data cannot be made simply by adding the taxa lists retrieved across markers, because the same individual prey may be detected at different taxonomic levels by different markers, due to differences in taxonomic resolution or in the availability of reference databases (Elbrecht et al., 2016). To combine such data, it is necessary to identify duplications across markers, and to retain in each case the most taxonomically resolved taxa. Although these approaches based on taxonomically overlapping markers may advance dietary studies of trophic generalist species by maximising the diversity of species detected, they remain underutilised, there are no well-established criteria for integrating data across markers, integrating data from multiple, taxonomically overlapping markers in an example dietary data set.

K E Y W O R D S
bird, diet, metabarcoding, morphological identification, overlapping markers, secondary predation and there is no simple computation procedure to implement such criteria.
Here, we test the use of multiple overlapping markers, the criteria for integrating data from them in dietary analysis of a trophic generalist bird, and provide a python script to implement our data integration scheme. Prey remains retrieved from Black Wheatear (Oenanthe leucura) faeces were identified morphologically and using DNA metabarcoding with four molecular markers. Molecular data was integrated by the means of a python script to provide a single list of taxa detected per sample, controlling for duplications by collapsing less resolved taxa detected by one marker (e.g., order and family level) with higher resolved taxa detected using a different marker (e.g., genus and species). We then evaluated differences between morphological, single marker and multi-marker approaches in the estimates of dietary descriptors, in terms of (a) taxonomic resolution, (b) diet diversity, (c) the identity of taxa recorded, and (d) the composition of diet considering the taxa recorded and their representation in the samples.

| Study species and sample collection
The Black Wheatear is a small (~35 g) black and white passerine that occurs in cliffs and rocky slopes of arid areas in western North Africa and Iberia. Although the species is not globally threatened, European populations are steadily declining, and the species is now regionally vulnerable (BirdLife International, 2017). Black Wheatears have a very diverse diet, feeding on freshly fruits, insects, arachnids, centipedes and sometimes even lizards (Hodar, 1995;Prodon, 1985;Richardson, 1965). The wheatear is a good study system to test our methodology due to its large feeding spectrum, including both plants and animals, and thus allowing us to test many food items simultaneously, and serving as model for other generalist terrestrial vertebrates. We collected 115 faecal samples from 143 Black Wheatears captured with spring-traps baited with Mealworms (Tenebrio molitor) throughout their known distribution in the Douro Valley in Portugal ( Figure S1), during spring and summer of 2014-2016. All birds were ringed to allow for individual recognition, which indicated that only four samples resulted from re-trapped individuals, two from birds collected 3 months apart, and two from birds collected in different years. Faecal samples were collected from clean cotton bags (soaked in 10% bleach for 1 hr and then washed between each use) or directly from stones used to disguise the bottom of the spring-traps (McInnes et al., 2017;Oehm, Juen, Nagiller, Neuhauser, & Traugott, 2011). Samples were stored in 98% ethanol and refrigerated at 4ºC until processed in the laboratory.

| Molecular analysis
DNA was extracted from each faecal sample using the Stool DNA Isolation Kit (Norgen Biotek Corporation) following the manufacturer's protocol. Samples were extracted in batches of 23 plus a negative control in which no faecal material was added. After DNA extraction the remaining faecal fragments used for DNA extraction were preserved for morphological identification. This was possible because morphological identification was based on hard faecal fragments such as chitinous body parts of invertebrates, vertebrate bones, and plant seeds and epidermis, which were not destroyed by the extraction method, as assessed through visual comparison of extracted and nonextracted samples.
Four different marker sets were used to analyse the diet. A universal eukaryote 18S marker (Jarman et al., 2013); two arthropod markers: a modified version of IN16STK (Kartzinel & Pringle, 2015) for the 16S region in which some degenerate bases were added to increase the affinity of the primers (IN16STK-1F_ mod: 5′-TRAACTCARATCAYGTAA-3′, IN16STK-1R_mod: 5′-TTAGGGATAACAGCRTWA-3′) and ZBJ  for COI region; and finally the gh plant specific marker for the trnL intron (Taberlet et al., 2007). All primers were modified to contain Illumina adaptors at the 5′ end of the sequence (forward primers: Each marker was amplified in an independent PCR reaction, without any multiplexing. A very low PCR amplification temperature was used for all markers in order to reduce as much as possible the level of primer bias, this way allowing primers to anneal with less matching templates. This has been tested for some COI markers with positive results (Clarke et al., 2014). We also did not do any PCR replicates because recent studies have shown that, for faecal samples, variation in prey species composition among PCR replicates is much smaller than variation among samples . Amplification success was checked by visually inspecting 2 μl of each PCR product on a 2% gel stained agarose (GelRed Biotium). PCR products were subjected to a second round of PCR with P5 and P7 indexes, after an initial

| Bioinformatic analysis
Bioinformatic processing of sequencing reads was done using OBITools (Boyer et al., 2016), with a separate analysis for each molecular marker. First, paired-end reads were aligned using the command 'illuminapairedend' and discarded if overlapping quality was <40 (Taberlet et al., 2018). Second, reads were assigned to samples and primer sequences were removed using 'ngsfilter', allowing a total of four mismatches to the expected primer sequence. Finally, reads were collapsed into exact sequence variants (ESVs) and sin-

| Morphological identification
Plant and animal remains from faecal samples were analysed under a dissecting microscope, except plant epitheliums that were seen under a compound microscope, after DNA extraction. Plant remains like seeds and epidermis were identified by comparison with plants collected at capture sites. Animal parts were identified to the order or family level whenever possible, using specialized bibliography (Barrientos, 2004). In each sample, we also identified the total number of animal morphospecies of each order, thereby producing an approximation to the total number of taxa per sample. We did not, however, compare morphospecies across all samples in order to estimate the total diversity, because they were rarely comparable due to differences in the fragments recorded in each sample. This should not affect the analysis as all comparisons with molecular data were done on a sample by sample basis.

| Data analysis
Statistical analysis was conducted to detect significant variation in To evaluate whether there were differences between methods in diet diversity estimates, for both plant and animal components, we compared among methods (a) the numbers of taxa per sample, and (b) the number of orders per sample, using generalized linear mixed models (GLMM) with a Poisson distribution and a log link, specifying the sample as random effect. GLMMs were performed using the packages lme4 (Bates, Mächler, Bolker, & Walker, 2015) and lmertest (Kuznetsova, Brockhoff, & Christensen, 2017). We then used multi-comparisons with Bonferroni corrections to identify in which pairs the observed differences occurred, using the package multcomp (Hothorn, Bretz, & Westfall, 2008

| Plant component
Morphological examination detected plants in 73 out of 115 faecal samples, yielding 12 taxa from five orders, of which five taxa were identified to genus or species levels ( Figure 1). The most frequent taxon was Solanum nigrum, order Solanales ( Figure 2). Metabarcoding detected plants in more faecal samples and yielded more taxa than morphology using either 18S (100 samples; 57 taxa from 16 orders; 2,479 ± 220 reads/sample) or trnL (110 samples; 124 taxa from 27 orders; 7,462 ± 387 reads/sample) (Figure 2). Besides detecting almost twice as many taxa, the taxonomic resolution was much higher for trnL (54% of taxa identified to genus or species) than 18S (19% to genus or species; Figure 1). The taxa recorded most frequently using either 18S or trnL was an unidentified plant of the family Vitaceae, most probably Vitis vinifera (Figure 2). There was significant variation among methods in the number of orders (χ 2 = 200.77, df = 2, p < .001) and taxa (χ 2 = 289.58, df = 2, p < .001) detected, with much lower values for morphology than metabarcoding with either 18S or trnL (Table 1; Table S1). There were also significant differences in plant composition between methods (Wald value = 11.21, p < .001),

| Animal component
Morphological examination detected animal prey in 112 samples, yielding 23 taxa from eight orders, all of which were identified at best to family level (Figures 1 and 2). The most frequent order was Hymenoptera (81%), mainly due to the family Formicidae (70%; Figure 2). Metabarcoding using 18S detected animals in 94 samples (3,008 ± 299 reads/sample), yielding 91 taxa from 21 orders, of which 10% were assigned to a genus or a species (Figures 1 and   2). The most frequent order was also Hymenoptera (45%). The two arthropod specific markers, IN16STK and ZBJ detected animals in 113 (6,765 ± 342 reads/sample) and 108 (2,829 ± 202 reads/sample) samples, yielding 244 and 231 taxa from 21 and 18 orders, respectively. From the taxa identified, 31% and 42%, IN16STK and ZBJ respectively, were identified to genus or species (Figure 1). The most frequent order detected by IN16STK was Hymenoptera (77%), mainly due to Formicidae (71%) as in the morphological analysis, while ZBJ detected most frequently Lepidoptera (59%) and only detected Hymenoptera in 15% of the samples, failing to detect the family Formicidae (Figure 2). The mean number of taxa per sample varied significantly among the morphological identification and the markers (χ 2 = 148.78, df = 3, p < .001), with all differing significantly from each other, except morphology from ZBJ (Table 1; Table S1). Likewise, there was significant variation in the mean number of orders per sample across methods (χ 2 = 54.78, df = 3, p < .001), with all differing significantly from each other, except morphology from 18S, and IN16STK from ZBJ (Table 1; Table S1). Finally, there were significant differences in animal composition among morphological and molecular methods (Wald value = 21.29, p < .001), with univariate tests indicating that 10 orders, particularly Hymenoptera, Lepidoptera and Orthoptera, significantly contributed to such differences (Figure 1; Table S3).

| Multi-marker approach
When integrating information from the four molecular markers used, the initial 2,064 occurrences (828 plant and 1,236 animal) were reduced to 1,492 (591 plant and 901 animal), indicating that only approximately one quarter of the information provided by the individual markers was redundant. The multi-marker approach detected a total of 27 plant and 28 animal orders, in 112 and 115 samples, respectively. The most detected plant order was Vitales, and the most detected animal order was Hymenoptera ( Figure 2). As expected, individual markers differed significantly from each other and from the multi-marker approach in terms of taxa detected for both plants (χ 2 = 142.43, df = 2, p < .001) and animals (χ 2 = 444.93, df = 3, p < .001). The multi-marker approach provided more occurrences with high taxonomic resolution, i.e., genus or species, (Figure 1) and also detected a higher number of taxa and orders per sample than any individual marker, except for trnL that contributed to most of the plant taxa present in the multi-marker approach (Table 1; Table   S1). The overlap of the multi-marker data for the plant component was very high in relation to trnL (0.959) and very low in relation TA B L E 1 Average ± standard error of the number of orders and taxa detected per sample and Saxifragales (Table S2). For animals, these differences were caused by 10 orders, mainly Diptera, Lepidoptera and Hymenoptera (Table S3).

| D ISCUSS I ON
Our study highlights the challenges involved in the description of the diet of trophic generalist animals, showing that results greatly vary depending on the method used. As expected, there were major differences in estimates of diet diversity, prey taxonomic identity, and composition between morphological and molecular methods, but there were also large variations in the results produced using different molecular markers. In particular, we found that widely used markers consistently underrepresented or missed some heavily consumed taxa, including taxa that were easily detected using the morphological analysis. The multi-marker approach appeared to largely overcome the problems of underestimate biodiversity that single marker dietary or nonmolecular analysis produce, though it shares problems such as the detection of secondary ingestion. Overall, we suggest that using a mix of universal eukaryote and more taxonspecific markers can advance the description of trophic generalist diets and underline the importance of adequately integrating data to overcome problems associated with different taxonomic resolution across markers.

| Biases and pitfalls in morphological and molecular dietary data
Most plant material recovered visually from Black Wheatear faeces were seeds of berry-producing plants, mainly S. nigrum and, to a much lesser extent, Periplaneta americana. These results suggest that wheatears regularly consumed berries in our study area, more so than suggested by previous studies (Hodar, 1995;Prodon, 1985;Richardson, 1965). Surprisingly, metabarcoding showed an even greater consumption of plants, with 18S and particularly trnL detecting a very large diversity of taxa, most of which produce dry seeds rather than berries. Reasons for this are unknown, but it may be a consequence of several nonexclusive factors. One possibility is that metabarcoding detects direct consumption of items for periods longer than the defaecation time (Deagle, Chiaradia, McInnes, & Jarman, 2010;Oehm et al., 2011), especially if short amplicons are used (Kamenova et al., 2018). This can explain for instance, why trnL detected the berry-producing Pistacia terebinthus in 11 samples, while the seeds of this plant were detected in a single faecal sample.
Another hypothesis is that the method is detecting plants that left no hard parts, and thus could not be detected visually. Lack of seeds can occur when wheatears only eat the flesh of berries, which might explain the high prevalence of V. vinifera detected through metabarcoding but not visually. However, this is questionable because grapes at the time of sampling were unripe and thus unlikely to be eaten by the birds. The typical insectivore morphology and behaviour of Black Wheatears (Richardson, 1965) also question the hypothesis of direct consumption to explain the detection of DNA from species with small and dry seeds such as Asterales, Lamiales and Poales, or with large acorns such as oaks Quercus spp. It is also highly implausible that wheatears are feeding on other parts of these plants such as buds, flowers, or pollen. A more likely explanation may thus be indirect consumption through the stomach contents of animal prey, which may be recovered by molecular markers amplifying small DNA fragments (<200 bp), such as the 18S and trnL markers used in our study (Kamenova et al., 2018;Sheppard et al., 2005). Detection of secondary consumption is well documented even through traditional methods (Johnson, Ross, & Smith, 1997), but it is usually considered as having little importance (Barrett et al., 2007). In metabarcoding diet studies the effect of secondary consumption is not often explored in detail, and depending on the studied species it is considered to have low impact (Gerwing et al., 2016) or considerable influence on the range of the species detected in the diet (Bowser, Diamond, & Addison, 2013 The animal prey detected visually in Black Wheatear faeces was in line with previous studies (Hodar, 1995;Prodon, 1985;Richardson, 1965), showing a prevalence for Hymenoptera, mainly Formicidae, and Coleoptera, Hemiptera, Araneae and Diptera. As expected, these groups were largely recovered through molecular analysis, though metabarcoding yielded a much larger diversity of prey and higher taxonomic resolution, particularly in the case of the COI marker ZBJ (Hope et al., 2014;Krüger, Clare, Greif, et al., 2014;Krüger, Clare, Symondson, Clare, Symondson, Keišs, & Pētersons, 2014;Razgour et al., 2011). Furthermore, some taxa were far more often detected through metabarcoding than by visual examination, including orders that seemed to be important in the diet such as Lepidoptera and Orthoptera. This may be a consequence of the ingestion of soft-bodied animals leaving few or no hard parts (Nielsen et al., 2017;Sutherland, Newton, & Green, 2004), as it was probably the case of caterpillars (Lepidoptera Trombidiformes, and Sarcoptiformes, detected through metabarcoding but not through visual examination, were probably not directly preyed by wheatears. These may have been ingested indirectly through the stomach contents of arthropod predators (Sheppard et al., 2005), or as parasites occurring in the body of arthropod prey or the birds themselves (Di Prisco et al., 2016;Gerwing et al., 2016;Martinho, Tenreiro, Ferreira, Faísca, & da Silva, 2017 Some animal preys were easily detected visually but not by some molecular markers. Formicidae, in particular, were often detected in faeces, while they were missed altogether by ZBJ. This was probably a consequence of the well-known positive bias of ZBJ towards Diptera and Lepidoptera, at the expenses of other arthropod orders (Clarke et al., 2014). Although the failure to detect Formicidae was solved when using 18S and IN16STK, these tended to provide a lower taxonomic resolution of prey items, particularly in the case of Lepidoptera for which there was a very comprehensive reference database of COI barcodes. Therefore, only the combination of the three markers provided a detailed description of the animal component of Black Wheatear's diet.

| Implications to describing diets with multimarker approaches
Overall, the combination of visual and molecular approaches used in this study highlighted two important sources of potential errors in the analysis of trophic generalist diets and provided some clues on how to address these problems. First, our study suggests that morphological examination and/or previous ecological information may be important in order to detect unexpected biases and pitfalls of molecular methods, providing a basis to interpret and eventually correct results. This is highlighted by the detection of a range of animal and plant taxa that probably resulted from secondary ingestion or contamination, which may be a widespread problem in molecular analysis of trophic generalists, particularly when using small amplicons such as gh for plant trnL (Groom et al., 2017;Liu et al., 2018;Sullins et al., 2018) or generalist molecular markers (Bowser et al., 2013). This problem might be important, for instance, in conservation studies aiming to assess key trophic resources for a given species (Groom et al., 2017;Liu et al., 2018), in behavioural ecology research (Aizpurua et al., 2018;Quéméré et al., 2013), and even when reconstructing trophic networks from molecular data (Evans, Kitson, Lunt, Straw, & Pocock, 2016). To address this problem, visual analysis of a subset of samples would be desirable (Haarsma, Siepel, & Gravendeel, 2016), providing information on the range of taxa that are eaten, which could then be compared against the results of metabarcoding. As this may often be impractical, researchers should at least check their metabarcoding results against the literature on conventional dietary studies of the target or closely related species (Gerwing et al., 2016), as well as ancillary information on morphology, behaviour and ecology, which may provide a basis to assess the plausibility of direct ingestion of unexpected taxa detected in samples. Another potential way to identify secondary consumption could be to look at the proportion of reads of each taxon and try to understand if it always occurs at a low proportion or not (Deagle et al., 2019). By filtering all taxa with <1% of the total number of reads of the corresponding PCR, one could expect that secondary consumption would disappear. However, in our study we observed that this is not always the case, with high number of reads obtained for some taxa that probably resulted from secondary ingestion.
Nevertheless, detection resulting from secondary ingestion may not always be a problem, e.g., if the study aim is to know the entire intake of a given species, irrespective of whether it was ingested directly or indirectly .
Second, our study confirmed the value of using multiple markers, but suggests that previous studies based on a mix of nonoverlapping specific markers each targeting a particular dietary component (e.g., Coghlan et al., 2013;Groom et al., 2017;Robeson et al., 2018;Sullins et al., 2018) may not be sufficient to overcome marker biases and thus provide a reliable diet composition. This is because markers considered universal for a given taxonomic clade may still have considerable variations in affinity across taxa within that clade, and thus may not amplify some important items in the diet (Aizpurua et al., 2018;Alberdi et al., 2019Alberdi et al., , 2017Bowser et al., 2013;Clarke et al., 2014;Kaunisto et al., 2017;Piñol et al., 2015). The problem was clearly illustrated by the high level of bias detected for ZBJ, which is sometimes regarded as universal for arthropods and is still the only marker used in many studies (Gordon et al., 2019;McClenaghan et al., 2019;Moran et al., 2019). In our study ZBJ completely missed Formicidae and other Hymenoptera, which was a key component of the diet identified through other methods, and probably overestimated the dietary importance of Lepidoptera and Diptera. The 16S marker used appeared less biased and thus may provide an alternative to ZBJ (Clarke et al., 2014;Deagle, Jarman, Coissac, Pompanon, & Taberlet, 2014), but it still underestimated some important dietary components, which may be partly due to the less comprehensive reference databases available when compared to COI (Elbrecht et al., 2016).
To overcome the problems of marker bias and taxonomic resolution, a mix of taxonomic data from different markers needs to be integrated, by eliminating the duplicates resulting from the same individual prey being detected at different taxonomic resolutions.
This should be a relatively easy task using the criteria and the python script provided in our study. Notwithstanding, newer and better molecular markers have been developed and are now available, and these may reduce the need for a multi-marker approach, e.g., UniPlant for plants (Moorhouse-Gann et al., 2018) and fwh for insects (Vamos, Elbrecht, & Leese, 2017). Unfortunately, there are no perfect markers, thus multiple primer sets should most of the time detect more taxa, mainly in highly diverse groups such as invertebrates (Corse et al., 2019). Even though untested in this study, our script should also prove useful in any broadscale biodiversity assessment, using either eDNA or bulk samples, allowing the integration of taxa detected using any combination of molecular markers, as well as of taxa detected through other methods like morphological identification.

ACK N OWLED G EM ENTS
We would like to thank to Joana Pinto and Joana Veríssimo for assistance in the laboratory, and to Professor William Symondson and three anonymous reviewers for the valuable comments that greatly improved the manuscript. The study was funded by Fundação para