Present‐Day DNA Contamination in Ancient DNA Datasets

Present‐day contamination can lead to false conclusions in ancient DNA studies. A number of methods are available to estimate contamination, which use a variety of signals and are appropriate for different types of data. Here an overview of currently available methods highlighting their strengths and weaknesses is provided, and a classification based on the signals used to estimate contamination is proposed. This overview aims at enabling researchers to choose the most appropriate methods for their dataset. Based on this classification, potential avenues for the further development of methods are discussed.


DOI: 10.1002/bies.202000081
tissues, [22] and DNA from the environment that can seep into the specimen. Exogenous DNA can also be introduced by handling, lab equipment, and reagents. [23][24][25] In most cases, researchers identify endogenous DNA sequences by aligning them to a closely related reference genome, thereby largely excluding sequences from distantly related organisms. [18,[26][27][28] However, sequences from contaminating DNA that are similar to the reference genome can pass this filtering step. This is particularly problematic for the study of ancient human material, since human DNA is abundant in research environments and contamination by human DNA can lead to false signals of admixture or to underestimation of the divergence to present-day humans. [2,29,30] Several precautions can guard against contamination. [17,31,32] Protective clothing during excavation [33,34] and lab work minimizes the introduction of contaminating DNA, and the irradiation of reagents, lab equipment, and clean room facilities are often used to degrade DNA from other potential sources. [35,36] Despite these efforts, a low level of contamination is unavoidable, and contamination is therefore closely monitored by using negative controls during DNA extraction and library preparation [37] and by the inclusion of unique combinations of DNA barcodes in each ancient DNA library. [38] However, these methods can only reveal contamination that is introduced during DNA extraction and library preparation or through cross-contamination between experiments. A measurement of contamination from all sources in the final sequencing dataset, which includes contamination introduced before DNA lab-work, remains crucial for downstream analyses.
In this review, we discuss methods that estimate the proportion of contamination in ancient DNA data. We focus on the case of ancient human samples with present-day human DNA contamination, since this is a particularly challenging problem (Box 1). However, many methods can also be applied to other organisms. We first describe the features of sequence data that can be used to quantify contamination. We then discuss specific approaches in more details, starting with methods for haploid loci, the mitochondria, and Y-chromosome, and then proceeding with methods for estimating contamination in diploid and recombining nuclear DNA.

BOX 1. Ancient DNA at the extremes
Poorly preserved ancient DNA samples can pose a particular challenge for analysis. Often this poor preservation is due to old age, although the oldest sequenced DNA to date, a permafrost sample from an ancient horse, is comparatively well preserved. [95] The poor DNA preservation in these samples means that the total concentration of ancient DNA is low and that contamination can constitute a large proportion of aligning sequences. Fortunately, old samples exhibit a high rate of C-to-T substitutions due to ancient DNA damage. By using the presence of damage-associated substitutions to enrich for sequences that stem from ancient molecules, Meyer et al. [96] reconstructed the mitochondrial genome from the highly contaminated sequences of a Neanderthal ancestor found at Sima de los Huesos in Spain and dated to over 400 000 years ago. Although this approach only considered C-to-T substitutions at the ends of sequences, Skoglund et al. [44] developed a scoring system that considers all substitutions throughout the sequences. By filtering sequences based on these scores, they were able to reconstruct the mitochondrial genome from a Neanderthal found in Okladnikov Cave, Russia, with 10% present-day human DNA contamination. Filtering sequences based on C-to-T substitutions can help to reduce contamination. However, if contamination accumulated some C-to-T substitutions, then these procedures may fail. In the previous two examples, mitochondrial genomes were reconstructed from multiple-fold coverage of sequences. Due to the high coverage, the correct endogenous mitochondrial genome will be reconstructed as long as contaminating sequences constitute at every site the minority among sequences with C-to-T substitutions. Unfortunately, such an approach is not possible for nuclear genomes, where the generated sequence coverage is typically far below onefold for highly degraded samples and informative sites are not always available. To deal with this issue and analyze nuclear sequences from Sima de los Huesos hominins, Meyer et al. [54] counted sequences in support of an assignment to the Neanderthal, Denisovan or modern human lineages. The inference of a closer relationship to Neanderthals was robust to high levels of modern human contamination, since contamination should not exhibit a high proportion of Neanderthal-specific variants. As an alternative to this approach, contamination levels can also be quantified and included in the calculation of statistics of interest. [42,43] By considering contamination, the relationship of late Neanderthals to early Neanderthals, with up to 65% contamination, was resolved. [42] Also, admixture rates can be estimated in the presence of contamination. [43] These results show that even highly contaminated samples can yield insights when modeling or estimating the effect of contamination as part of the analyses.
between the contaminant and endogenous DNA, deviations from the expected ploidy, and time-dependent characteristics of ancient DNA such as damage-induced substitutions (Figure 1). We briefly describe these signals in this section.

Differences in the DNA Sequence
Sites that differ between the genome of interest and likely contaminants can be identified when their genome sequences are known in advance. For instance, the mitochondrial genomes of Neanderthals differ at some sites from those of all presentday humans [39] and contamination can be estimated by measuring the proportion of sequences that show the present-day human allele at these sites. [40] Other measures of sequence differences, such as sequence divergence, can also be used to estimate contamination if it is possible to predict what their value would be in the absence of contamination. [41,42] All approaches in this class require some a priori knowledge of the relationship between contaminating and ancient individuals. Also, these approaches gain power with increasing divergence between the contaminating and ancient genome sequences. Yet, once sequence differences are known, a few sequences overlapping these positions can be sufficient to estimate contamination.

Deviation from the Expected Ploidy
Contamination can cause a sample to show unusual patterns of ploidy. For instance, heterozygous sites on the X or Y-chromosomes in males, or Y-chromosome sequences in females, are signs of contamination. This signal is not limited to the sex chromosomes; a higher proportion of sequences supporting one allele at a heterozygous site on the autosomes can, for instance, indicate contamination from an individual carrying this allele. [2,43] In contrast to the previous class, ploidy-based methods often require multiple-fold coverage. However, they have the advantage that prior knowledge of the relationship between the contaminant and ancient individual is not required.

Ancient DNA Degradation Patterns
The degradation of DNA leaves characteristic patterns that can be used to distinguish ancient DNA sequences from those from present-day DNA contamination. [44,45] The most common damage in ancient DNA originates from cytosine-deamination [46] and occurs more often at the ends of DNA molecules, [47] likely because of single-stranded overhangs that degrade faster than the mostly double-stranded interior. [48,49] Cytosine-deamination turns cytosines (C) into uracils that are then misread as thymines Figure 1. Classification of the signals used to estimate contamination. Each box illustrates one of the three signals. Box A shows the genealogical relationship of present-day and ancient individuals, including two derived variants that are informative for either group. The presence of a red variant indicates a contaminant sequence, whereas a blue variant indicates endogenous sequences. Box B shows the expected ploidy for the autosomes (A), the X-and Y-chromosomes (X, Y), and the mitochondrial genome (MT) for females (red) and males (blue). Deviations from these expectations indicate contamination from the opposite sex. The illustration below shows sequences aligned to a reference genome. These sequences carry two different alleles represented by dots. However, one allele (white) is rare compared to the other allele (blue). This observation is not compatible with the 50:50 ratio expected for a heterozygous site; therefore the discordant allele may originate from contamination. Box C illustrates ancient DNA damage. Left: ancient DNA fragments often contain uracils caused by ancient DNA damage whereas uracils are typically absent from present-day DNA fragments. Right: When no repair enzymes are used, uracils will be misread as thymines and their presence will result in high rates of C-to-T substitutions that occur primarily at the ends of sequences. Note that this signal depends on the library preparation protocol, and that high rates of G-to-A toward the 3ʹ-end, instead of C-to-T exchanges, are also a possible signal. Neither pattern is expected for present-day DNA sequences.
(T) by the DNA polymerases used during DNA library preparation. This leads to erroneous C-to-T substitutions in the sequence data (and additional G-to-A substitutions, depending on the specifics of the library preparation protocol). The prevalence of deamination-induced C-to-T substitutions increases with the age of the sample, [50] although other factors, such as climate, have also a substantial effect on the rate of cytosine deamination. [51] The frequency of C-to-T substitutions can be used to classify sequences as likely ancient, [44,52] and to quantify contamination from undamaged present-day DNA. [53,54] Although not diagnostic, other features such as the length of ancient DNA sequences can also be used. Methods that are based solely on these ancient DNA degradation patterns require comparatively few sequences and no prior knowledge of genetic relationships.

Methods to Estimate Contamination
Many methods have been developed over the years to estimate contamination in ancient samples. Often, these methods rely on more than one of the signals described above (Figure 2). Thus, it is more helpful to consider the type of data to be analyzed when choosing a method. Here, we discuss methods grouped by the  following three categories: mitochondrial DNA, sex chromosomes, and autosomes. Methods in each category differ in their requirements regarding sequence coverage, ancient DNA damage patterns, and a priori knowledge of sequence divergence.

Mitochondrial DNA
Ancient DNA studies often proceed by first sequencing mitochondrial genomes (or other nonrecombining haploid sequences such as chloroplasts [55] ). The high copy number of mitochondria per cell, their small genome, and the wider availability of enrichment methods for mitochondrial DNA [56] make it easier to sequence the mitochondrial genome to high coverage than the nuclear genome, so that mitochondrial sequences often provide first insights into the level of contamination in a sample.

Methods Based on Differences in the DNA Sequence
Mitochondrial variation is often well characterized. For instance, full mitochondrial genomes have been reconstructed for the extinct human groups of Neanderthals and Denisovans [40][41][42][57][58][59][60][61][62] in addition to thousands of present-day and ancient modern humans. [63] These data can be used to identify positions where contaminating and endogenous mitochondrial genomes differ. At these diagnostic sites, the proportion of sequences carrying the contaminating allele represents an estimate of contamination. [2,40,64] However, diagnostic positions are not always known in advance. In this case, and when multi-fold coverage is available, one can reconstruct the endogenous sequence by consensus calling and then identify positions where the consensus sequence shows a variant that is either absent or present at a low frequency in a reference panel of potential contaminating genomes. [40,65] Sometimes it is not possible to identify diagnostic positions if, for instance, the endogenous genome falls within the variation of contaminating genome(s). An alternative strategy exploits the fact that the mitochondrial genome is nonrecombining, and models the data as a mixture of sequences from different mitochondrial genomes. This approach is implemented in contamMix, an approach that considers both the reconstructed consensus mitochondrial sequence of the ancient individual and a reference panel of potentially contaminating mitochondrial genomes. [66] The method assumes that each sequence originates from one of these genomes and identifies them based on the number of matching bases, but allows for additional differences from sequencing errors. The proportion of sequences assigned www.advancedsciencenews.com www.bioessays-journal.com to other genomes than the consensus genome corresponds to the contamination estimate.

Methods Based on Deviations from the Expected Ploidy
The reconstruction of the endogenous consensus genome can be challenging if the data are highly contaminated or exhibit high error rates because of DNA damage. In this context, another approach, Schmutzi, takes advantage of the haploidy of the mitochondrial genome and requires multiple-fold sequence coverage to jointly reconstruct the endogenous and contaminant consensus sequences. [53] It uses a detailed model of errors, including substitutions from cytosine-deamination, and distinguishes endogenous from contaminant sequences by considering damage patterns and fragment lengths. It then gives a contamination estimate based on diagnostic sites, which are identified by comparing both consensus sequences to a reference panel of likely contaminants.
Schmutzi and contamMix can only be applied if a reference panel is available, but another approach that is independent of such panel exists (Mafessoni's model in [10] ). This approach requires high sequence coverage and estimates the proportion of distinct mitochondrial genomes from differences among mapped sequences. Although the method is not aimed at quantifying contamination, one could apply it for this purpose, if contaminating and endogenous genomes are sufficiently different and sequencing error rates are low.
Contamination estimates based on mitochondrial sequences are often used as a proxy for nuclear contamination. However, it has been noted that contamination estimates obtained from the mitochondria and the nuclear genome can differ. [30,67] This is because mitochondrial DNA may degrade at different rates than nuclear DNA [21,68] and, more importantly, because the ratio of nuclear to mitochondrial sequences differs between cell types. This can, for example, lead to an underestimate of contamination if the source of contamination is a cell type with a low ratio of mitochondrial to nuclear genomes. [30,67,69] While none of the methods takes nuclear mitochondrial insertions (NuMts) or heteroplasmies into account, these factors are unlikely to introduce large errors in contamination estimates.

Sex Chromosomes
Although sex chromosomes are less accessible than mitochondrial genomes, they are recognized as a useful tool to study sexbiased migration and admixture among populations. [70] Contamination estimates that rely on the haploid state of the sex chromosomes in males or the absence of a Y-chromosome in females have been specifically developed for sex chromosome data.

Methods Based on Differences in the DNA Sequence
Many methods to estimate mitochondrial DNA contamination are also applicable to the nonrecombining part of the Ychromosome. In particular, diagnostic sites can be identified from the large datasets of Y-chromosome diversity in humans (e.g., [71,72] ). While useful for studying Y-chromosomes, the estimates are insensitive to contamination from females and cannot be used as a measure of autosomal contamination.

Methods Based on Deviations from the Expected Ploidy
The X-chromosome is also present in a haploid state in males. In contrast to the mitochondrial genome and the Y-chromosome, however, the X-chromosome recombines, and methods based on haplogroups cannot be applied. As an alternative, Rasmussen et al. [73,74] and Moreno-Mayar et al. [75] used known variants on the X-chromosome to detect sequences that disagree with the majority call. These alternative alleles are unexpected in males that carry only one copy of the X-chromosome and can be used, together with an estimate of sequencing error from neighboring sites, to estimate contamination. Note that this X-chromosome contamination estimate gives an upper limit on the rate of contamination in the autosomes, since contamination originating from females has twice the impact on the X-chromosome of males compared to their autosomes.
Another approach, also based on the expected ploidy, is to compare the sequence coverage between the X-chromosome and the autosomes. [41] Female contamination in a male individual will increase the X-to-autosome ratio, while male contamination in a female individual will decrease it. The deviation of this ratio from the expected value of 0.5 and 1 for males and females, respectively, can thus be used to estimate contamination by the opposite sex. The advantage of this method is that it can be applied to both sexes and is unaffected by sequencing errors, as it does not rely on genetic variants.
Similarly, it is possible to estimate male contamination in a female sample by dividing the number of Y-chromosome sequences by the expected number of sequences that would map to the Y-chromosome if it were a male. [30] Assuming that the alignment efficiency is uniform among chromosomes, this expected number of sequences is simply half the observed number of sequences that map to the autosomes multiplied by the fraction of the genome that is the Y-chromosome.

Autosomal DNA
Estimating contamination from autosomal DNA is challenging because autosomes are diploid and recombine, and sequence coverage is often low. However, autosomal data are indispensable for the study of population history and selection, and accurate estimates of contamination are crucial to ensure correct results.

Methods Based on Differences in the DNA Sequence
As for mitochondrial and sex-chromosome sequences, contamination rates in autosomal data can rely on diagnostic sites. A prerequisite for the existence of such sites is a sufficiently large divergence between the likely source of contamination and the studied genome. This approach has, for instance, been used to estimate modern human DNA contamination in Neanderthal data. [54,76] In this setting, at least thousands of positions exist where most modern human genomes carry a derived variant that is absent from sequenced Neanderthal genomes. Note that the approach is conservative in that it may lead to an overestimate of contamination rates, as newly sequenced Neanderthal genomes can carry modern human alleles because of hitherto unknown variation instead of contamination. The comparatively small fraction of Neanderthal ancestry in present-day humans would only lead to a minor underestimate. [2,76] A natural extension of using diagnostic positions is to rely on expectations for a statistic that describes the relationship between contaminating and endogenous genomes. If these expectations differ between the contaminating and endogenous genomes, the level of contamination can be gauged by modeling the observed value of the summary statistic as a linear combination of these two expectations. Statistics that have been used for this purpose are estimates of sequence divergence and the sharing of derived alleles. [41,42] However, further statistics, such as admixture proportions, may also be suitable. Depending on the summary statistic used, this approach can use more of the data than diagnostic positions. However, similar to diagnostic positions, the approach relies on assumptions about the relationship of the contaminant to the ancient individual that can influence later analyses. Yet, as both methods only require that a few hundred sequences overlap informative sites, they are particularly useful for low-coverage data.
Nakatsuka et al. [77] recently presented another approach that relies on linkage between pairs of sites. As the contaminant and endogenous genomes often carry different haplotypes, this approach tests for a reduction of linkage between sites compared to the expectations derived from a panel of reference genomes. As a reduction in linkage may also be due to divergence to this reference panel, the method uses deaminated sequences to correct the linkage that is expected without contamination. This approach is applicable to ancient genomes with little divergence to the contaminant(s), which makes it valuable for the study of nuclear sequences from modern humans.

Methods Based on Deviations from the Expected Ploidy
In the previous section, the methods required a priori knowledge about the endogenous genome to infer contamination rates. Often, it is easier to make assumptions about the contaminant rather than the endogenous sequences. Assuming that some divergence exists between contaminating genomes and the endogenous genome, Philip L. F. Johnson [2,78] uses sites where likely sources of contamination are all or nearly all derived. Although the endogenous genome can carry any allele at these positions, we expect contamination to contribute sequences with derived alleles. Thus, the method infers contamination rates as an excess of sequences with derived alleles compared to the expectation of 0% at homozygous ancestral positions or 50% at heterozygous positions in the endogenous genome.
An extension of this approach, implemented in the software DICE, [43] increases the set of informative sites to also include those that are derived at a lower frequency in the contaminating source population. For this, Racimo et al. [43] model the relationship of the ancient sample to a set of known background populations to infer the probability of each genotype in the endogenous genome. Contamination then corresponds to the excess of sequences with either ancestral or derived alleles compared to expectations derived from the most likely genotype (i.e., the absence of derived and ancestral alleles at homozygous ancestral and homozygous derived sites, respectively, or an equal proportion of ancestral and derived alleles at heterozygous sites).
In contrast to the methods described in the previous section, both methods require multiple-fold coverage at informative sites since the inference of contamination relies on deviations from the expected ploidy. Requiring higher coverage also ensures that contamination can be estimated when contaminating and endogenous genomes show little divergence. However, we note that a recently introduced method, admixfrog, which is similar to DICE in how it models contamination, can yield contamination estimates with low sequence coverage (0.1× for an archaic human genome; [79] ). This is achieved by taking advantage of sequence differences among multiple panel populations and assuming that the ancestries of the ancient genome derive from some of these source populations.
Another approach based on deviations from the expected ploidy is to take advantage of regions in the genome that are homozygous because of inbreeding. Contamination introduces alternative alleles randomly along the genome, including in these homozygous regions where only one allele is expected at any given position. An implementation based on this idea used homozygous regions in the genome of a Neanderthal woman whose parents were related at the level of half-siblings to jointly estimate error rates and the proportion of contamination. [80]

Methods Based on Patterns of Ancient DNA Damage
Substitutions associated with DNA damage have long been used as a signal to determine whether ancient sequences are preserved in a sample and to distinguish these sequences from contaminating sequences. [44,54,[81][82][83][84] Contamination can be estimated from deamination patterns under the assumption that such patterns are absent in contaminating sequences. This is achieved by contrasting the true rate of damage-associated substitutions for the endogenous sequences to the observed rate of such substitutions for all sequences. To estimate the frequency of damage at the terminal positions of endogenous sequences, Meyer et al. [54] conditioned on the presence of a damage-associated substitution on one end of a sequence to enrich for genuine ancient sequences. The opposite ends of these sequences are then used to infer the frequency of substitutions in this ancient fraction, assuming that damage at both ends is independent. Because several biases can influence the frequency of substitutions at the ends of sequences (e.g., alignment bias against sequences with many substitutions), this method has not been used to quantify contamination. However, Meyer et al. could identify samples with substantial contamination using this approach.
As a prior for the mitochondrial contamination estimates of Schmutzi, Renaud et al. [53] implemented a method, contDeam, that solely uses patterns of ancient DNA damage. Contamination is estimated as the mixture proportion between two models of substitutions, one with and another without ancient DNA www.advancedsciencenews.com www.bioessays-journal.com

BOX 2. Contamination per sequence or contamination per base?
Some contamination estimates give the proportion of bases that originate from contamination, while others give the proportion of contaminating sequences. These estimates can differ when the sequence length of contaminating and endogenous sequences differ. For instance, if contaminating sequences are on average twice as long as endogenous sequences, then a given informative site is twice as likely to be covered by a contaminating sequence compared to an endogenous sequence. A method based on informative sites would thus give an estimate per base, but would in this example overestimate the proportion of contaminating sequences. Although methods based on informative sites naturally yield estimates per base, methods relying on ancient DNA damage produce estimates per sequence. However, methods based on ploidy or coverage can be formulated as either per base or per sequence. Downstream analyses often benefit from contamination estimates per base. To convert estimates per sequence to estimates per base, it is possible to either weight sequences proportionally to their length or subsample sequences so that longer sequences are represented more often. For instance, restricting the estimation of contamination to the subset of sequences overlapping specific sites will automatically correct for sequence length.
damage. Compared to the previous approach, this method is not limited to the terminal bases of sequences, but instead infers site-specific deamination frequencies from sequences that exhibit a C-to-T substitution at one end.
By taking into account the dependence between C-to-T substitutions along ancient DNA sequences, a more recent method models the observed frequency of damage-associated substitutions in single-and double-stranded parts of the original DNA fragments. [85] Each sequence originates from either contamination, which does not contain damage, or an ancient molecule that contains damage according to the explicit model of the structure of ancient DNA fragments. Like the previous method, the approach estimates contamination as the mixture proportion of sequences fitting to one of these two models. Note that both methods provide estimates that correspond to the proportion of contaminant sequences, while the estimates for most other methods correspond to the proportion of contaminant bases (Box 2).
Contamination estimates based on DNA damage have the advantage that they are independent of sequence differences between endogenous and contaminating genomes, and that estimates can be obtained even for very low-coverage samples (10 000 sequences can be sufficient to estimate contamination). However, current methods assume that the contaminant is devoid of deamination, which several studies have shown is not true in all cases. [25,41,42] Other factors, such as heterogeneity in preservation within the sample or biases in extraction or library preparation, may limit the validity of the deamination models. [86] With further knowledge about the structure of ancient DNA fragments and the reduction of bias from protocols, [87] the use of these methods may increase.

Perspectives
Although many methods to estimate contamination are now available, these methods are not applicable in all circumstances. In addition, more accurate estimates of contamination may help to infer more details about the evolutionary and population history. Here we ask: What future development may we expect?

Methods Based on Differences in the DNA Sequence
The knowledge about the relationships among human groups increased substantially in recent years. [12] This knowledge includes the timing of population movements that resulted in large-scale admixtures. Some of these admixture signals may represent a useful source of information to estimate contamination. For instance, contamination from a present-day admixed population could be quantified in populations that pre-date these admixture events by measuring the admixture proportion.
Admixture between populations also results in large uninterrupted segments of different ancestries within an individual's genome. Contamination from other individuals will often carry a different ancestry at the same locations, so that contaminating sequences yield a reduction in linkage within ancestry segments. This information can in principle be used to quantify contamination.

Methods Based on Deviations from the Expected Ploidy
Sex-chromosomes are often used to estimate contamination levels in ancient samples, since even for shallow sequence data a substantial difference in coverage is expected for these regions of the genome. Large-scale insertion/deletion differences or segmental duplications can similarly yield such expected differences in coverage. We are hopeful that a better knowledge of the frequency and location of these polymorphisms will lead to the development of new methods for quantifying contamination.

Methods Based on Characteristics of Ancient DNA
Future method development could include additional features of ancient DNA such as the propensity of ancient DNA sequences to align to positions that are adjacent to purines. [47] This feature may be explained by a process called depurination, the loss of purine bases that can lead to a break of the DNA backbone. If www.advancedsciencenews.com www.bioessays-journal.com

BOX 3. Contamination and metagenomics
This review focused on the analyses of a single species' genome from an ancient sample. However, the study of the species composition in ancient metagenomic samples gained attention in recent years, for instance, to reconstruct the diet of ancient populations [97] or to study the evolution of pathogens. [98] Often, these samples yield very few sequences from the genomes of interest. This means that low levels of contamination pose a challenge. We note that contamination in the context of metagenomics datasets can also stem from misassignments of sequences from closely related endogenous species, an issue that goes beyond the scope of this review. The presence of ancient DNA damage-associated C-to-T substitutions can authenticate ancient sequences. Weiss et al. [82] devised a method to detect substitution patterns typical of ancient DNA when only a few hundred sequences are available. Even though this method does not exclude the presence of contamination, it can provide a positive indication that at least some sequences are ancient. When authenticating sequences assigned to a single species, one can also use the distribution of edit distances (number of substitutions to the reference genome per sequence) to fur-ther increase confidence in the species assignment. [99,100] This is because truly related sequences will tend to carry a lower number of mismatches to the reference compared to spurious alignments. Competitive mapping is another way to identify and exclude contamination. [101,102] After mapping sequences to the genomes of the organism of interest and its closest known relatives, only sequences that map to the genome of interest are retained, thereby increasing the confidence that the sequences originated from this genome. More recently, this idea was also applied to faunal datasets to exclude human DNA contamination. [103] Finally, the availability of high-throughput profiling tools [104][105][106] and large databases of microbial sequences [107][108][109] makes it possible to characterize the microbial content of metagenomics datasets. This information can be used to authenticate the origin of the sequences. [110] For instance, oral and soil microbiomes differ sufficiently to rule out substantial environmental contamination when studying dental calculus. [111] For more in-depth reviews about authenticating ancient microbial sequences, see for instance refs. [112] and [99]. present-day contamination does not exhibit this pattern, which is generated over a long time, then future contamination tests could include this signal.
DNA molecules invariably break into shorter fragments over time [20] and the length of ancient DNA sequences has been used as a signal to indicate the presence of ancient DNA. [88] However, some studies have found a poor correlation between age and fragment length over different archaeological sites. [50,51] This poor correlation may be explained by different preservation conditions, but length distributions are also influenced by extraction and library preparation methods. [89] These issues make it problematic to distinguish comparatively recent contaminating molecules, which are in principle subject to similar processes, from those that are truly old.
Several other features of ancient DNA have been discussed and could perhaps be used to quantify contamination. Regulatory signals in DNA are partially preserved in the substitution patterns and breakage of ancient DNA molecules. For instance, a nucleosome map has been reconstructed from the periodicity of fragment length. [90] Such periodicity could serve as a diagnostic feature of the tissue of origin, for example bone for an ancient bone sample versus skin for contamination from handling. Similarly, one could leverage signals of methylation that also differ between tissues. It was indeed possible to reconstruct partial methylation maps from C-to-T substitutions in CpG context from ancient DNA libraries that have been treated with repair enzymes or amplified with a proofreading polymerase. [90][91][92]

Conclusion
Ancient DNA is a rapidly growing field that has yielded unique insights into the past. However, from the start, the development of the field was hindered by the issue of contamination, which remains a major concern. The recently expanding field of palaeoproteomics is facing similar challenges. [93] Here, we have surveyed methods to quantify contamination and introduced a classification scheme of the signals they currently use.
We focused on methods applied to human samples, for which contamination is particularly problematic. However, we note that contamination is also a pervasive issue for the analysis of other organisms [24] (Box 3). For instance, it may be difficult to study reliably ancient samples from agricultural products, as these are often common in our environment. [82] Similarly, residual microbial sequences on lab-ware can influence metagenomic studies. [94] Our classification puts the different approaches into perspective and may help to identify further avenues of method development. Methods that provide accurate estimates of contamination in low-coverage data are particularly needed. Such methods will help researchers to make better decisions during data production by avoiding wasting resources and preserving rare samples.
Contamination tests ensure the validity of insights drawn from ancient DNA studies. We hope that the continued development of methods to estimate contamination increases the confidence in ancient DNA results and helps to further push the limits of this field.