Relative species abundance estimation in artificial mixtures of insects using mito‐metagenomics and a correction factor for the mitochondrial DNA copy number

Mito‐metagenomics (MMG) is becoming an alternative to amplicon metabarcoding for the assessment of biodiversity in complex biological samples using high‐throughput sequencing. Whereas MMG overcomes the biases introduced by the PCR step in the generation of amplicons, it is not yet a technique free of shortcomings. First, as the reads are obtained from shotgun sequencing, a very low proportion of reads map into the mitogenomes, so a high sequencing effort is needed. Second, as the number of mitogenomes per cell can vary among species, the relative species abundance (RSA) in a mixture could be wrongly estimated. Here, we challenge the MMG method to estimate the RSA using artificial libraries of 17 insect species whose complete genomes are available on public repositories. With fresh specimens of these species, we created single‐species libraries to calibrate the bioinformatic pipeline and mixed‐species libraries to estimate the RSA. Our results showed that the MMG approach confidently recovers the species list of the mixtures, even when they contain congeneric species. The method was also able to estimate the abundance of a species across different samples (within‐species estimation) but failed to estimate the RSA within a single sample (across‐species estimation) unless a correction factor accounting for the variable number of mitogenomes per cell was used. To estimate this correction factor, we used the proportion of reads mapping into mitogenomes in the single‐species libraries and the lengths of the whole genomes and mitogenomes.

side, the tiny size of the mitogenome compared to the nuclear genome produces a high number of non-informative reads (Tang et al., 2014), so a great sequencing depth is needed.
It is generally assumed that MMG quantifies satisfactorily the relative species abundance (hereafter, RSA) of complex mixtures (Gómez-Rodríguez et al., 2015;Zhou et al., 2013). However, as far as we know, there are only five studies that tested the MMG method (plus one using chloroplast metagenomics in plants) using shotgun samples of known composition (Bista et al., 2018;Gómez-Rodríguez et al., 2015;Gueuning et al., 2019;Ji et al., 2020;Lang et al., 2019;Tang et al., 2015). In general, the relationship between the expected and estimated RSA is statistically significant, but with high variability in the goodness of fit.
How the RSA of complex mixtures is presented in the literature requires some clarification. First, the RSA can be expressed as a proportion of the species biomass (e.g., Gueuning et al., 2019) or individual counts (e.g., Lang et al., 2019); alternatively, the RSA can refer to the proportion of the DNA amount of each species in the mixture.
Whilst the former approach is more meaningful for most ecological studies, we adopt the latter approach here because it allows the independent evaluation of different sources of bias on the RSA estimation. Second, some studies provide the relative abundance of one species in different samples (e.g., Bista et al., 2018), whereas others report the abundance of several species in a single sample (e.g., Saitoh et al., 2016). Ji et al. (2020) named within-species estimation the former (is species i more abundant in sample s than in sample r?) and across-species the latter (is species i more abundant than species j in sample s?). This distinction is important because there are species-specific characteristics that influence the across-species estimation but not the within-species estimation. The most important of these characteristics is the variable number of mitogenomes per nuclear genome (mitochondrial DNA copy number). Thus, a species i with twice the number of mitogenomes per nuclear genome than another species j will produce twice many mitochondrial reads as well; without a proper correcting factor, species i would, apparently, be twice more abundant in the mixture than species j. This fact is known (Bista et al., 2018;Piñol et al., 2015;Tang et al., 2014), but there are not reliable solutions to the problem because little is known about the causes of the variation of the mitochondrial copy number acrossspecies (but see Liu et al., 2018, that reported a higher mitochondrial copy number in organs with a high metabolic rate and in species living at low altitude than in their counterparts at high altitude in the Tibetan Plateau).
The size of the nuclear genome of the species also affects the across-species RSA estimation. Being all other things equal, a species r with a nuclear genome half as big as that of another species s will produce twice many mitochondrial reads because the mitochondrial DNA is diluted in a smaller amount of nuclear DNA. Therefore, without a proper correcting factor, species r would, apparently, be twice more abundant in the mixture than species s. The effect of genome size on RSA estimation is also known (Crampton-Platt et al., 2016;Krehenwinkel et al., 2017;Tang et al., 2014), but it is difficult to consider it because measuring the genome size is not an easy task (there is a database of genomes sizes with 1344 insect species on it; Gregory (2020), accessed on 25 March 2020). Both the variation across-species of mitochondrial copy number and genome size affect MMG, but also any amplicon metabarcoding method that targets genomic regions with a variable copy number, such as COI in animals (Hebert et al., 2003), ITS in fungi (Schoch et al., 2012), or rbcL + matK in plants (CBoL Plant Working Group, 2009).
Here we explore the quantitative capabilities of MMG for the estimation of RSA of heterogeneous mixtures of insects. For this purpose, we prepared single-species and artificial mixed-species libraries with several species of insects whose entire genome has already been sequenced. We included four species of Drosophila to assess the ability of the method to set apart closely related species. The single-species libraries allowed the calculation of a reliable mitochondrial DNA copy number (N M ) for each species that was further used as a correction factor for the across-species estimation of RSA of the mixed-species libraries. In particular, we addressed the following questions: (1) Is the MMG method able to identify species in complex mixtures, even when they are of the same genus? As an approach to real samples, we investigated the robustness of the method in the absence of the mitogenome of the focal species. (2) Can MMG estimate the RSA of complex samples? Is it necessary the use of the N M correction factor for the across-species estimation of RSA? (3) Finally, can the number of sequenced reads be reduced and still recover all species in a complex sample of insects? 2 | MATERIAL S AND ME THODS 2.1 | Selection of species, preparation of the DNA libraries, sequencing, and quality control We selected 17 species of insects whose complete genome is already sequenced and available on the RefSeq repository (Table 1) (e.g., DNA-to-biomass ratio) are ignored (Matesanz et al., 2019;Tang et al., 2015). were also used in a previous study (Garrido-Sanz et al., 2020). In the parent study, the obtained reads were mapped against a reference database of whole-genomes instead of a reference database of mitogenomes as we do here.
We pre-processed the raw reads through a quality control step using fastqc v0.11.7 (Andrews, 2015) and trimmomatic v0.36 (Bolger et al., 2014) to trim the reads to a 150 bp length and to remove those shorter than 140 bp. Only paired reads were kept.

| Reference genomes
We downloaded all mitogenomes of insects available at the NCBI RefSeq database on 1 August 2019, plus the complete genomes of the 17 species selected for the study from the same database.
Ten species had the complete genomes but not the mitogenomes on RefSeq, and we downloaded their mitogenomes from GenBank (accessed on 2 August 2019). We used high-quality mitogenomes of as many species as possible to test the ability of the method to find the selected species among the many others in the reference database. Species with several mitogenomes were deduplicated, and we obtained the mitogenomes of 1794 species of insects, comprising 1174 genera, 331 families, and 27 orders (hereafter, Mito1794; Table S2).

| Mapping of reads into references
In MMG studies, a small proportion of shotgun reads map into the mitogenome (e.g., Tang et al., 2014), hence most reads are not useful and slow down the mapping process. Thus, it seems reasonable to eliminate the reads that are not mitochondrial before the mapping step Zhou et al., 2013). For this purpose, we created a reference database with one mitogenome per family (hereafter, Mito331) where the representative species per family was chosen randomly. We then mapped the raw reads against the Mito331 reference data set using a permissive criterion and kept the putative mitochondrial reads (hereafter, candidate mito-reads).
The mapping was done using BWA 0.7.15-r1140 (Li, 2013) with mem algorithm and an alignment score of zero (-T0). samtools 1.10 (Li et al., 2009) was subsequently used to filter the paired-end reads with no mapping reads (view -F2316 -b) and recovered the mito-reads in fastq format (bam2fq). As low-complexity regions are prone to misclassify the reads (Lu & Salzberg, 2018;Pearman et al., 2019), we prepared a new set of filtered mitogenome references by removing low-complexity regions from the Mito1794 reference (hereafter, FilteredMito1794).
Low-complexity regions were identified using dustmasker (-level 45) (Morgulis et al., 2006) and replaced with Ns using an in-house python script.
To avoid confusion, we recapitulate below the name and meaning of the three different databases of mitochondrial genomes that we used for the mapping of reads: • Mito1794: the original mitogenomes of 1794 species.
• FilteredMito1794: as Mito1794, but with the low complexity regions removed.
• Mito331: a subset of Mito1794 with only one mitogenome per family; this reference was only used to obtain the candidate mitoreads from the total of reads of each sample.
As we did not know to which extent the filtering of raw reads and mitogenomes was useful, we conducted four different kinds of mapping of reads to reference mitogenomes in the single-species libraries: • Raw reads against Mito1794 reference database In all cases, the mapping was conducted using BWA with mem algorithm and default options. Because the mapping of a sample was done against each mitogenome individually, we obtained the same number of SAM files as references used. We subsequently used SAMtools to remove reads that did not map to any reference (view -F2308).

| Assignment of mapped reads to species
In general, a read mapped to several reference mitogenomes (e.g., homologous sequences in several species), so an algorithm was needed to assign reads to species. For this purpose, we used the γ-δ algorithm described in Garrido-Sanz et al. (2020). Briefly, what the γ-δ algorithm does is to quantify the similarity between the query read and the reference (i.e., the mapping ratio) and then decide whether a read is informative or non-informative. It is informative when it is very similar (mapping ratio above γ) to species i and not very similar (mapping ratio below δ) to the rest of species; in this case, the read is assigned to species i. It can be non-informative for two reasons, either because the read is not similar enough to any species (mapping ratio below γ for all species) or because it is too similar (mapping ratio above δ) to two or more species; in this case, the read is discarded. In all cases γ > δ.
The γ-δ algorithm has never been applied before to MMG data, hence the appropriate values of γ and δ are unknown, so the single-species libraries were used to find the best combination of the parameters γ and δ. To find the best values of γ and δ we relied on the criterion that the number of recovered species had to be one in the single-species libraries.
The reads in single-species libraries were divided into a training set with 75% of the reads randomly selected, and a test set with the remaining 25% of the reads. The training set was used for the calibration of the procedure; the test set was used to assess the goodness of fit of the model and to calculate the summary statistics.
A situation that can arise in real samples, as opposed to the artificial samples used here, is that the mitogenomes of some species in the sample are not in the reference database. We explored this situation by running again the complete pipeline with all the singlespecies libraries using the best set of input data and parameters.
Ideally, no read should be assigned to any species because the mitogenome of the only species in the library is not in the database.
However, the reads might eventually be wrongly assigned to other species in the database and, thus, generate false positives. The outcome of this experiment should reveal the robustness of the γ-δ algorithm in the assignment of reads to species.

| Quantification of the RSA in mixedspecies libraries and the need for a species-specific correction factor
In the literature, the relative abundance of one species is sometimes compared among different samples and on other occasions, the relative abundance of several species is compared within one sample (within-species and across-species RSA, respectively, following Ji et al., 2020). Here, we present the comparison of actual versus estimated RSA in the mixed-species libraries using both approaches. As we observed in the Introduction, from a conceptual point of view, the quantitative estimation of within-species RSA in MMG is easier than the across-species RSA, because in the latter the mitochondrial DNA copy number can vary widely between species.
With the single-species libraries we estimated the mitochondrial DNA copy number (N M ) of each species in the following way: 1. Let x i be the ratio of the genomic mitochondrial information divided by the total (haploid) genomic information for species i. The mitochondrial information is the mitogenome length (M i ) times N Mi ; the total genomic information is the sum of the nuclear genome length (G i ) and the mitochondrial information.
2. The re-arrangement of Equation (1)  We obtained R Mi by mapping the reads of species i to its mitogenome when this mitogenome was the only one used as the reference in the mapping. Regarding R Gi , we assumed that all reads of the singlespecies library of species i belong to species i.
In the comparison of actual versus estimated RSA using the mixed-species libraries, we multiplied the actual relative abundance of species i (Table S1)

| Rarefaction of the input samples
We only multiplexed six mixed-species libraries in a single Illumina MiSeq run (Table S1)

| Statistical analyses and hardware
All statistical analyses were performed with r 3.4.2 (R Core created using the r packages ggplot2 (Wickham, 2016) and ggpubr (Kassambara, 2018).
We run the complete pipeline on a server with two Intel Xeon E5-2620 v3 processors with six cores each and hyperthreading technology, so a maximum of 24 threads were available.

| Species identification
The 21 single-species libraries (Table  1) generated 2,371,091 ± 1,210,091 (mean ± SD) paired-end reads. A proportion of 0.012 ± 0.027 reads were eliminated in the trimming step, remaining a proportion of 0.987 ± 0.027 reads available for further analysis.
The results that follow correspond to the application of the γ-δ algorithm for the assignation of reads to species on the training set (i.e., 75% of the sequenced data). We also eliminated from the following results the reads assigned to species that could legitimately be attributed to physical contamination in the laboratory or the sequencing. These contaminants were species sequenced in different libraries of the same Illumina run and the fly Ceratitis capitata that contaminated the library of Bactrocera oleae (see the discussion for the reason behind this contamination).
Because the MMG method must recover only one species in single-species libraries, we fixed the values of γ = 0.99 and δ = 0.96 in the γ-δ algorithm as this combination was the only one to provide the expected result (Table S3). All the other tested combina-  (Tables S4-S7).
Filtering out the repetitive regions of the mitogenomes (i.e., FilteredMito1794) had a dramatic effect on the number of identified species. With the FilteredMito1794 database, we only detected the focal species in all the libraries, whereas with the Mito1794 database there appeared several false positives in many libraries (2.5 ± 1.2 species per library using all raw reads or 1.7 ± 0.9 species using only candidate mito-reads; Table 2a). The masked regions mostly belonged to non-coding regions of the mitogenome, including the control region (Table S8). The use of only candidate mito-reads instead of all reads produced a loss of c. 6% of informative reads (Table 2b) but reduced ~18 times the execution time (Table 2c). In summary, the elimination of the repetitive regions from the genomes removed all the false positives and the mining of candidate mito-reads reduced 18-fold the execution time of the pipeline with a moderate loss of informative reads. Therefore, in the subsequent steps, we used both the FilteredMito1794 database and only the candidate mito-reads ( Figure 1).
We evaluated the goodness of fit of the model with the test set (i.e., the remaining 25% of reads not used in the previous calibration) using the best set of input data and parameters (i.e., the FilteredMito1794 database, the candidate mito-reads and the parameters γ = 0.99 and δ = 0.96). The number of identified species per library was 1 in all cases (Table S9) and the proportion of informative reads was 0.0046 ± 0.0056.
The absence of the mitogenome of the focal species in the reference database did not produce many false positives in the singlespecies libraries (Table 3). When there were no congeneric species of the focal species in the reference database (six out of 17 species), no read was assigned to any species; when there were congeneric species in the database, in six cases no reads were assigned to any species and in four cases some reads were assigned to another species of the same genus (Bactrocera, Drosophila, Plutella, and Solenopsis); only in one species (Drosophila melanogaster) appeared some reads belonging to species of a different genus (Exorista sorbillans, Diptera:Tachinidae).
The six mixed-species libraries (Table S1)   In the mixed-species libraries, we recovered all species included in the libraries except Papilio machaon in library no. 2 (Table S10).
As in the single-species libraries, in the mixed-species libraries, we found reads of Ceratitis capitata and discarded them as laboratory contamination. In addition, in libraries no. 2 and no. 3, one single read was attributed to Bactrocera biguttula (Table S10), a species not handled in the laboratory; therefore, an analytical detection limit of ε = 0.0001 would be useful for the elimination of all false positives.

| Estimation of the relative species abundance in mixed-species libraries
The within-species RSA was well estimated for all species (r ≥ .97 and p < .05 for all species; Figure 2), but the across-species RSA estimation was very poor (r ≤ .67 and p > .05 for all samples; Figure 3a). Thus, it seems clear the need for a species-specific correction factor that considers a variable ratio of mitochondrial to nuclear DNA (Table 4). When we modified the actual RSA with the

| Computer use
The total consumed time by running the entire pipeline with the mixed-species libraries ranged between 66 and 82 min (Table S11).
Most of the processing time was devoted to the mapping of the reads to the references (95%) and only 3% of the time was used by the γ-δ algorithm (Table S11).

| DISCUSS ION
Mito-metagenomics (MMG) proved to be able to set apart and quantify the relative DNA abundance of insect species in artificial mixtures, even when the species were congeneric. The estimation was as good as the one obtained with whole genomes instead of mitogenomes by Garrido-Sanz et al. (2020), using the same DNA libraries and bioinformatic methods as here. However, to be able to quantify the RSA in a sample with several species (across-species RSA) it was necessary to correct the raw reads by the variable amount of mitochondrial to nuclear DNA (mitochondrial DNA copy number) among species.

| Species identification
The MMG approach used here recovered only the focal species from all the single-species libraries, with no false positives when low-complexity regions were filtered out from the mitogenomes (except for genuine contaminants, see below). Without this filtering step, some reads were attributed to non-focal species; these reads were sequences with biased composition, probably from repetitive regions (e.g., microsatellites) that mostly matched non-coding F I G U R E 1 MMG pipeline applied in this study. In brackets, the tools used in each step regions on the reference mitogenomes (Table S8; Faber & Stepien, 1998;Wolff et al., 2012). Some popular tools do implicitly or explicitly filter out low complexity regions from the reference genomes: Kraken masks low complexity regions when adding references to the database (Wood & Salzberg, 2014); and BLAST filters both query sequences and references (Altschul et al., 1990(Altschul et al., , 1997Camacho et al., 2009).
We also found contaminant species in all the sequenced libraries (Tables S9-S10). The origin of these reads of contaminants can be tag jumping during the sequencing reaction (Schnell et al., 2015) or actual contamination in the laboratory. The first reason is probably the cause of finding reads in single-species libraries belonging to other species sequenced in the same run but different libraries.

The second reason is behind the presence of reads attributed to
Ceratitis capitata in libraries where Bactrocera oleae was also present, because the two dipterans, which are agricultural pests, were

| Quantification of the RSA and the need for a species-specific correction factor
With the mixed-species libraries, we estimated the within-species RSA with high statistical confidence (Figure 2). Similar good results have been reported in previous studies that used mock samples (Bista et al., 2018;Ji et al., 2020). On the contrary, the across-species RSA estimation within a sample was not statistically significant in any sample (Figure 3a). These results contrast with the study of TA B L E 3 List of species detected on the single-species libraries when the mitogenome of the focal species is in the reference database (column A) and when it is not (column B). For each detected species we indicate its name and the number of assigned reads (in brackets). The number of congeneric species of the focal species included in the database is provided in column C. Libraries are divided into four groups: Group 1, species without congeneric species in the database and without false positive species; Group 2, species with congeneric species in the database and without false positive species; Group 3, species with congeneric species in the database but with false positive of the same genus; and group 4, species with congeneric species in the database but with false positive of a different genus   Figure 3c). Other studies reporting RSA estimation across-species also used some correction factor before comparing the expected and observed number of reads, but none included the genome size, as we did here. For instance, Gómez-Rodríguez et al. (2015) and Tang et al. (2015) considered the mitogenome size of the species and Tang et al. (2015) and Lang et al. (2019) the number of reads from the genome. Tang et al. (2015) is the only study that provides both the goodness of fit with and without the use of the correction factor, and the effect is very different from the one reported here, as the result was almost the same in both cases.
One possible explanation is that Tang et al. (2015) dealt only with wild bees (a group of a few Hymenoptera families), so the interspecific differences might be low compared to our study which included species from four insect orders. Nevertheless, the effect of the mitochondrial DNA copy number on the estimation of RSA deserves more research effort if DNA-based techniques are to provide good quantitative results, both using mito-metagenomics and amplicon metabarcoding targeting variable copy number regions.
The method used here to estimate the correction factor for the mitochondrial DNA copy number (i.e., preparation and sequencing of a single-species library to a depth of c. one million reads) has a cost that is not negligible. Ideally, there should be a method to independently estimate the mitochondrial DNA copy number of each species that did not involve sequencing. Such methods do exist Actual relative concentration Estimated relative concentration because there is an interest in medicine to measure the mitochondrial DNA copy number for its relationship with several diseases.
In medicine, the mitochondrial DNA copy number is usually estimated using quantitative PCR (qPCR; Thyagarajan et al., 2012); this method requires two primer pairs, one for a mitochondrial marker and one for a single-copy nuclear marker. These primers are known for humans, but it would be costly to generate them for every species in an environmental sample, especially for those species whose genome has not yet been sequenced (but see Liu et al., 2018). The qPCR itself is cheap, but the preparative work for each species would be long.
It is important to emphasize that the RSA used here is based on In metabarcoding applications, some authors use empirical correction factors based on mixtures of known relative biomass of several species rather than in mixtures of DNA (e.g., Matesanz et al., 2019;Thomas et al., 2016). In these cases, the correction factor solves for the mitochondrial copy number among species and also for the DNA-to-biomass ratio and the differential amplification efficiency caused by PCR. This method is undoubtedly practical but does not differentiate the relative importance of each source of bias.

| Mito-metagenomics in real samples
The present study is based on artificial mixtures of a low number of species whose mitogenomes are already assembled. Thus, it is fair to question the value of our proposal in real samples with many more species, with a limited amount of DNA, where the prior species composition is unknown, or when the reference mitogenomes are obtained in the same experiment and are only partially assembled.

| More complex mixtures
Real samples can contain hundreds of species, and that might affect the ability of the method to detect the less abundant ones. With the sequencing depth achieved here (~3.4 million raw reads per sample; Table S1) we were able to detect three (out of four) species with an expected RSA below 1‰ (Table S10). The subsequent rarefaction experiment showed that with fewer reads more species become undetected ( Figure 4). In consequence, it seems that at least 3.4·10 6 reads are needed to detect most species with an RSA above 1‰.
Having hundreds of species in the mixture would not hamper the quantitative ability of the method, as most species would be above a 1‰ abundance. However, ultra-rich samples with thousands of species would require a higher sequencing depth to ensure the detection of most species.

| Limited amount of DNA available
The Illumina TruSeq kit used here to prepare the libraries requires 1 μg of DNA and this might be a problem with small specimens or in DNA-poor samples. However, today there are alternative methods that provide good results with just 1 ng of DNA, like the Illumina Nextera DNA Flex kit (Sato et al., 2019), albeit potential biases should be tested in future experiments for these kits.

| Absence of mitogenomes in the reference database
Our results showed that the proposed methodology was robust in the absence of the mitogenomes of species in the reference database. Of course, the species without their mitogenome in the reference database will never be found, but their reads will not generate many false positives, even for species with close relatives in the reference database (Table 3). The presence of species without their reference genome in the mixture is likely to occur frequently in real samples. The unassigned reads (or also when the prior composition of the sample is unknown) can be further explored by mapping them against other databases, like COI barcodes from BOLD; thus, the identity of more species will be revealed, albeit not their relative abundance.

| Incomplete genomes
In several MMG studies, the reference mitogenomes are assembled from the same mixtures in which the RSA is intended to be quantified F I G U R E 3 Scatter plot of the estimated versus the actual RSA in each mixed-species library (i.e., across-species RSA). At the top, it is indicated the way we conducted the actual RSA. (a) Original expected data. (b) Corrected expected data after applying the N Mi correction factor. (c) Corrected expected data after applying the N Mi correction factor. Rows from top to bottom correspond to mixed-species libraries from 1 to 6. Each plot shows the Pearson correlation coefficient (r) and the corresponding p-value (Crampton-Platt et al., 2016;Zhou et al., 2013). In these cases, the mitogenomes are assembled de novo and, normally, they are incomplete. We do not see any impediment in using mitogenomes assembled in this way if all of them have a similar length and quality.
However, we would advise against the simultaneous use of mitogenomes with disparate length or quality for quantification purposes, because that would bias the RSA towards the species with better mitogenomes (Tang et al., 2015). On the contrary, the use of all available partial mitogenomes would be fine for identification purposes. • G i . The length of the whole genome is more variable across species than the length of the mitogenomes: 90% of the 115 whole genomes of insects available at RefSeq (Table S13A) have a length between 0.14 and 0.98 Gbp. However, if the insects are split by orders the variability of G i is smaller for most insect orders (Table   S13B). Consequently, we would advise using the mean value Ḡ for each group of taxa (e.g., insect orders).

| Concluding remarks
The approach presented here to identify insect species and to estimate their relative abundance in complex mixtures using MMG worked well with artificial samples of known composition for a select group of species whose mitogenomes are sequenced to an advanced degree. The key for the accurate estimation of the across-species RSA was a correction factor for the mitochondrial copy number of each species. We are aware that the proposed methodology is not immediately applicable to most real samples, so its real value should be tested on more of such samples.

ACK N OWLED G EM ENTS
We are grateful to several entomologists who provided the speci-