Quantifying and reducing cross‐contamination in single‐ and multiplex hybridization capture of ancient DNA

The use of hybridization capture has enabled a massive upscaling in sample sizes for ancient DNA studies, allowing the analysis of hundreds of skeletal remains or sediments in single studies. Nevertheless, demands in throughput continue to grow, and hybridization capture has become a limiting step in sample preparation due to the large consumption of reagents, consumables and time. Here, we explored the possibility of improving the economics of sample preparation via multiplex capture, that is, the hybridization capture of pools of double‐indexed ancient DNA libraries. We demonstrate that this strategy is feasible, at least for small genomic targets such as mitochondrial DNA, if the annealing temperature is increased and PCR cycles are limited in post‐capture amplification to avoid index swapping by jumping PCR, which manifests as cross‐contamination in resulting sequence data. We also show that the reamplification of double‐indexed libraries to PCR plateau before or after hybridization capture can sporadically lead to small, but detectable cross‐contamination even if libraries are amplified in separate reactions. We provide protocols for both manual capture and automated capture in 384‐well format that are compatible with single‐ and multiplex capture and effectively suppress cross‐contamination and artefact formation. Last, we provide a simple computational method for quantifying cross‐contamination due to index swapping in double‐indexed libraries, which we recommend using for routine quality checks in studies that are sensitive to cross‐contamination.


| INTRODUC TI ON
Advancements in DNA sequencing techniques have led to tremendous improvements in the analysis of degraded DNA. This is most apparent in the progress that has been made in ancient DNA studies, where complete or partial genomes have been recovered from thousands of ancient human individuals (Racimo et al., 2020). There are two methods that are used for the generation of these data. The first is shotgun sequencing (or the direct sequencing of DNA libraries) where a random subset of DNA molecules that were retrieved from a sample is sequenced. As molecules from the taxa of interest (e.g., humans) often constitute only a small proportion of the total molecules in a sample, this method is not economically feasible for many samples. The second strategy is hybridization-based capture of library molecules using synthetic DNA or RNA probes of the sequences of interest (Burbano et al., 2010;Carpenter et al., 2013;Cruz-Davalos et al., 2017;Fu et al., 2013;Suchan et al., 2021), which enriches libraries for molecules with similarity to predefined genomic targets, thereby greatly lowering sequencing costs.
As hybridization capture has become the dominant strategy for data acquisition in population-scale ancient DNA studies, sample sizes continue to increase (Mathieson et al., 2015;Narasimhan et al., 2019;Vernot et al., 2021;Wang et al., 2021;Zavala et al., 2021).
However, the consumption of time, consumables and reagents associated with all current capture protocols are beginning to turn hybridization capture into a serious bottleneck in sample preparation. One possible solution to this problem lies in the development of automated protocols for hybridization capture on liquid handling systems (Slon et al., 2017;Suchan et al., 2021). Automation has been shown to be particularly important for applications that necessitate high-throughput sample screening, such as the recovery of ancient human and faunal DNA from sediments. For example, fully automated sample preparation (DNA extraction, library preparation, and hybridization capture) was recently used to screen more than 700 sediment samples from Denisova Cave for the preservation of hominin and other mammalian mitochondrial (mt) DNA to reconstruct the occupational history of the site ).
Yet the development of higher throughput protocols, especially for hybridization capture, remains critical for further increasing the resolution of ancient DNA studies using sediments and other material. One possible strategy for achieving this goal is the capture of multiple libraries in a single reaction -that is, multiplex capture.
Multiplex capture has been successfully used for exome capture of modern samples in order to reduce costs for consumables, reagents and hands-on time while increasing throughput (Filer et al., 2021;Neiman et al., 2012;Ramos et al., 2012;Rohland & Reich, 2012;Shearer et al., 2012;Wesolowska et al., 2011), but has not been explored in the context of ancient DNA research.
One main concern regarding the use of multiplex capture with ancient DNA is the elevated risk of cross-contamination between samples. Specifically, it has been proposed that the parallel amplification of libraries tagged with sample-specific indices may lead to index swapping via jumping PCR during post-capture amplification (MacConaill et al., 2018;van der Valk et al., 2020) (see Figure 1 for a proposed mechanism). Index swapping is also known to occur on the sequencer itself when libraries from different samples are sequenced together: on Illumina platforms, it has mainly been linked to signal spreading between neighbouring/mixed clusters on the flow cell and/or to the incorporation of residual oligonucleotides remaining from library preparation during cluster generation (especially on Illumina's newer HiSeqX, HiSeq4000 and NovaSeq platforms) (MacConaill et al., 2018;van der Valk et al., 2020). Double-indexing, that is, using index sequences in both library adapters (or inline barcodes immediately adjacent to the insert sequence (Rohland & Reich, 2012)), has been found to be an efficient means to decrease misassignments of index reads (Kircher et al., 2012;MacConaill et al., 2018) and has become a widely used strategy for sequencing ancient DNA. Yet, it remains unclear whether double-indexing is sufficient in preventing cross-contamination between libraries if ancient DNA libraries are combined prior to hybridization capture and sequencing. This question is critical, as cross-contamination has been shown to be a substantial threat to data integrity in ancient DNA studies (Prufer & Meyer, 2015;van der Valk et al., 2020). In particular, for samples where the DNA of interest represents a miniscule fraction of the total DNA -for example, human DNA in sediments (Slon et al., 2017) or pathogens from ancient skeletal remains (Bos et al., 2015) -even a small amount of cross-contamination could mimic a true signal.
In this study, we present laboratory protocols and analysis tools . Approximately 50 mg of material from each sample was transferred in 2 ml tubes. DNA extraction was performed following the protocol described in (Rohland et al., 2018) with purification on a Bravo NGS Workstation B (Agilent Technologies) using binding buffer "D". The resulting DNA extract was then converted into a single-stranded double-indexed library on the Bravo NGS Workstation B following the protocol described in Gansauge et al. (2020). DNA extraction, library preparation and indexing PCR setup were performed in a dedicated clean room at the Max Planck Institute for Evolutionary Anthropology (Leipzig, Germany).

| Hybridization capture
Two successive rounds of hybridization capture with human mtDNA probes (Slon et al., 2017) were performed on the Bravo NGS Workstation B, either in-solution, as described in Fu et al. (2013), or on beads, as described in Slon et al. (2017). One microgram library was used as input for the first round of capture in the singleplex experiments. For multiplex capture, library pools were used containing 500 ng of each library. After each round of capture, the yield of library molecules was determined by qPCR using 1 µl of the capture eluate (from a total volume of 25 µl) as template (with a 1:10 dilution for the second capture round) and following the protocol outlined in Gansauge et al. (2020). Ten µl of the capture eluate was then amplified using Herculase II Fusion DNA polymerase (Agilent Technologies) in 20 µl reactions containing 1x Herculase II reaction buffer, 250 nM each dNTP, 1 µM each primer (primer pair IS5/IS6 [Meyer & Kircher, 2010] or primer pair IS105/ IS109 [see Figure S9 for primer sequences]), and 0.2 µl enzyme.  (Fu et al., 2013) and the purified DNA eluted in 22 µl EBT buffer (10 mM Tris-HCl, 0.05% Tween-20 pH 8.0) for the first capture round and 15 µl EBT for the second round. The second capture round was performed using 10 µl purified library as input. Post-capture quantification of library yield, library amplification, and purification were performed exactly as after the first capture round. Pools F I G U R E 1 Schematic of how index swapping and chimeric molecules can occur during amplification of DNA libraries and how these processes manifest in sequencing data. PCR amplification of an initial library molecule (a) can create both complete products (b) and incomplete extension products (c). Incomplete extension products can anneal to complete extension productions, and during subsequent rounds of PCR amplification, the polymerase (shown in yellow) will then extend the former product in a process known as jumping PCR, creating a new molecule that is a combination of the annealed complete and incomplete molecules (d,e). When only an incorrect index is copied over, the result is "index swapping" (d). When part of the DNA insert is copied a "chimeric" molecule is formed (e) containing 5 µl of each library (or library pool) were then created for sequencing on the Illumina MiSeq platform in 2 × 76 bp pairedend configuration with two 8-bp index reads. The exact parameters used for capture and library amplification in each experiment are provided in Supplementary Data File 3. One capture experiment was performed manually, following the detailed protocol provided in Supporting Information S3.

| Data processing
Bustard (Illumina) was used for base calling and the resulting sequencing data was processed using the pipeline described in Slon et al. (2017). In short, leeHom (Renaud et al., 2014) was used to merge overlapping paired-end reads into full-length molecule sequences, which were then mapped to the revised Cambridge reference sequence. Next, bam-rmdup (https://github.com/mpiev a/bioha zardtools) was used to remove sequences shorter than 35 basepairs, along with PCR duplicates. The remaining sequences were assigned to different mammalian families using BLAST (Altschul et al., 1990) and MEGAN's lowest common ancestor algorithm (Huson et al., 2007) as described in Slon et al. (2017).

| Identification of index-swapping and crosscontamination
Although index swapping has been extensively studied (Griffiths et al., 2018;MacConaill et al., 2018;Owens et al., 2018;Vodak et al., 2018), no broadly applicable method has been proposed, to our knowledge, that aims to directly quantify the level of crosscontamination it causes on a per-library level. Here we suggest a simple and straightforward computational method for detecting and quantifying events of cross-contamination among libraries carrying sample-specific unique double-indices that is based solely on an analysis of the observed counts of index pairs in a pool of libraries that are sequenced together. The method makes use of the notion that if each index is used only once, two successive index swapping events are required to fully replace one index combination by another. Cross-contamination thus proceeds through a transient state where only one of the indices has been replaced by that of another library (see Figure 1 for an illustration). When the counts for all possible combinations of known index sequences are plotted so that expected index combinations fall on the diagonal, these transient states occupy the corners of a rectangle defined by the two libraries between which index swapping has occurred (see Figure 1 for an illustration). We therefore refer to these transient states as "corner libraries". It is important to note that corner libraries can also arise through processes that are unrelated to jumping PCR, such as errors in sequencing, misincorporation of nucleotides during PCR or cross-contamination of indexing primers. However, these processes are likely to lead to an exchange of only one of the indices. This is in contrast to index swapping, which due to the stochasticity of jumping PCR is expected to result in the symmetrical formation of two corner combinations for each pair of libraries that are amplified in the same reaction.
Our method recognizes possible cases of cross-contamination by identifying all pairs of corner combinations in the data that are compatible with index swapping between one of the expected index pairs and a contaminating library with a different pair of indices (see Supporting Information S1 for further details on the implementation). We then leverage the expectation of symmetry in any true cross-contamination and assume that the number of molecules occupying the transient state is closest to the smaller of the two corner library counts. The ratio between the number of sequences carrying this corner combination and the number of sequences carrying the expected index combination is then used as a proxy for the frequency at which one of the indices in the sample library is replaced by an index originating from the putative contaminant library. As this process has to occur twice in order for both indices to be exchanged and a misassignment to occur, we estimate the probability of this event by squaring the frequency at which one index is exchanged. The probability of cross-contamination can then be multiplied by the number of sequences carrying the expected index pair to estimate the number of sequences that may have originated from a specific contaminant library, but were assigned to the sample library due to index swapping. In addition to estimating individual cross-contamination events, the sum of all events affecting a specific sample library is computed by adding all events that have been determined to contribute at least 0.5 sequences. It should also be noted that the search space of possible cross-contamination events is not confined to libraries carrying indices that were used in the current library pool. Instead, every observed combination of indices that were not used in the sample library pool but are "known" (i.e., are physically available in the laboratory and/or were used in previous experiments) is evaluated as a possible source of crosscontamination. This strategy in principle enables the detection of cross-contamination across experiments and library pools, although limited to libraries with uniquely sample-specific indices (i.e., not between libraries that share one or both of their indices).

| Exploring the feasibility of multiplex capture for ancient DNA libraries
To determine the feasibility of multiplex capture for the enrichment of genomic targets in ancient DNA libraries, we created two pools of double-indexed libraries prepared from ancient sedimentary DNA. All libraries had previously been found to contain small traces of human mtDNA, six of them with damage patterns indicative of ancient DNA . Each pool contained eight libraries that had been combined in equal amounts, and was subjected to hybridization capture using human mtDNA probes. For comparison, we enriched the same libraries individually for human mtDNA (henceforth referred to as "singleplex" capture). Singleand multiplex captures were performed using a method that utilizes single-stranded biotinylated DNA probes targeting the human mtDNA genome (Fu et al., 2013;Slon et al., 2017), in an implementation for 384-well automated liquid handling .
Two consecutive rounds of capture were performed and the captured molecules were amplified to PCR plateau (30 cycles Table S1 for an example). Applying this method to the multiplex experiment,

| Determining the impact of PCR cycle number on artefact formation
To reduce index swapping and the formation of molecules with chimeric inserts in hybridization capture, we explored possible options for decreasing artefacts formation in the library amplification steps.
As jumping PCR is known to occur most frequently when amplification reactions reach plateau (Paabo et al., 1990),

| Library amplification by two-step PCR
We next investigated whether PCR conditions can be further optimized to limit index swapping in pre-and post-capture amplification of indexed libraries. As jumping PCR is initiated by crosshybridization of incomplete extension products, we hypothesized that it may be less likely to occur under more stringent annealing conditions. We therefore extended the length of the primers that were previously used for library amplification (IS5 and IS6: Meyer & Kircher, 2010), which resulted in an increase in the annealing temperature from 60 to 68°C (primers IS105 and IS109; see Figure S9), enabling a two-step thermal profile in PCR. We tested these PCR conditions by amplifying the same pool of 92 libraries that had been previously used to evaluate the possibility of PCR-induced index F I G U R E 4 Proportion of shared alignment start and end coordinates in human mtDNA sequences obtained after the first and second round of single-and multiplex capture. The black dots represent the expected proportion of shared start and end coordinates in the absence of chimeric molecules, determined from downsampled shotgun data (in black, regression line calculated using the loess method). Data from the capture experiments are coloured by library. Values for the sharing of start and end coordinates are plotted independently swapping (Section 3.1), again omitting capture and amplifying the libraries to PCR plateau. In contrast to the earlier experiment, where unexpected index pairs contributed to 33.6% of the data and crosscontamination was estimated to 10.4% on average per library, MiSeq data generated from the amplification product of the new primer pair and two-step cycling produced only 8.85% unexpected index pairs and cross-contamination reduced to less than 0.25% per library (Supplementary Data File 3). Increasing the annealing temperature in PCR is thus another potential strategy for reducing index swapping during library amplification pre-and post-capture, even if amplification proceeds to PCR plateau.

| An optimized protocol for automated hybridization capture
We have shown that an optimal protocol for single-or multiplex hybridization capture of double-indexed libraries should utilize both two-step PCR cycling and a reduction in the number of PCR cycles in post-capture amplification to avoid cross-contamination and the formation of chimeric molecules. However, the adjustment of PCR cycles poses a challenge to the automation of hybridization capture in plate format on liquid handling systems, which is essential for laboratories aiming at high-throughput sample preparation. One solution to this problem could be to limit the cycle number for all reactions on a plate to the maximum number of cycles that can be performed before the richest library reaches PCR plateau (as performed in section 3.2). However, this strategy is problematic in that it produces unequal concentrations of amplification product (as shown in  Multiplex capture may also be used as a screening tool when working with large sample sizes to determine whether ancient DNA from the taxa of interest is present, possibly followed by a subsequent individual enrichment of libraries for larger genomic targets.

| DISCUSS ION
Contamination is a serious threat when working with small quantities of DNA and measures have been implemented to minimize its impact in ancient DNA research. Common strategies for contamination prevention include the use of clean rooms, the separation of laboratory space for pre-and post-amplification work, and the inclusion of negative controls to detect sporadic contamination with exogenous DNA or between samples (Cooper & Poinar, 2000). However, it has been assumed that contamination is unlikely to occur after library preparation, especially when libraries are tagged with sample-specific indices in both adapter sequences.
Contrary to this expectation, we detected small levels of sporadic cross-contamination even in sequence data from libraries that were enriched for mtDNA in singleplex captures, indicating that an exchange of fine droplets or aerosols occurred between libraries from different samples before they were amplified to PCR plateau pre-or post-capture. These results highlight that caution is required when reamplifying DNA libraries, even if they are double-indexed. Primers that allow for the highest possible annealing temperature should be used for this purpose to minimize index swapping, such as the IS105/ IS109 pair provided here, and PCR plateau avoided whenever possible. These measures also prevent the formation of chimeric insert sequences, which can occur if sequences from small genomic targets are enriched by successive rounds of hybridization as shown here.
We also provide a simple computational method that can be used to examine sequence data to determine if cross-contamination by index swapping has occurred and to quantify its impact. As the method requires only the number of sequences assigned to different index pairs it can be applied across various platforms and be used to test for cross-contamination in various capture protocols and kits. In order to maintain the quality and transparency of results, we encourage labs to include this test in their routine quality control pipeline and publish the results. In summary, our work shows that it remains critical to monitor and suppress the formation of artefacts in ancient DNA sample preparation and sequencing as technological and methodological advances continue to push the limits for DNA recovery.

ACK N OWLED G EM ENTS
This project was funded by the Max Planck Society. We thank B.
Schellbach and A. Weihmann for their help with sequencing, J. Kelso and J. Visagie for helpful discussions and assistance with raw data processing.

CO N FLI C T O F I NTE R E S T
The authors declare no conflicts of interest.

O PE N R E S E A RCH BA D G E S
This article has earned an Open Data Badge for making publicly available the digitally-shareable data necessary to reproduce the reported results. The data is available at https://doi.org/10.5061/ dryad.2280g b5t7.

DATA AVA I L A B I L I T Y S TAT E M E N T
The files containing sequence counts used for calculating cross-