Using in silico predicted ancestral genomes to improve the efficiency of paleogenome reconstruction

Abstract Paleogenomics is the nascent discipline concerned with sequencing and analysis of genome‐scale information from historic, ancient, and even extinct samples. While once inconceivable due to the challenges of DNA damage, contamination, and the technical limitations of PCR‐based Sanger sequencing, following the dawn of the second‐generation sequencing revolution, it has rapidly become a reality. However, a significant challenge facing ancient DNA studies on extinct species is the lack of closely related reference genomes against which to map the sequencing reads from ancient samples. Although bioinformatic efforts to improve the assemblies have focused mainly in mapping algorithms, in this article we explore the potential of an alternative approach, namely using reconstructed ancestral genome as reference for mapping DNA sequences of ancient samples. Specifically, we present a preliminary proof of concept for a general framework and demonstrate how under certain evolutionary divergence thresholds, considerable mapping improvements can be easily obtained.


| INTRODUC TI ON
The number and quality of sequenced genomes is rapidly increasing as DNA sequencing technologies become increasingly faster and cheaper per base sequence (Mardis, 2011). This has been revolutionary across the fields of biology, enabling researchers to not only introduce genomic approaches to model organisms, but across the tree of life (Ellegren, 2014). In parallel, the sequencing of genomes from ancient DNA (aDNA) within historic, ancient, or otherwise degraded samples has also started to become of growing interest, and these so-called paleogenomic studies have helped us better understand the history of ancient species and populations (Brunson & Reich, 2019;MacHugh et al., 2017;Nielsen et al., 2017;Przelomska et al., 2020). Although DNA from such samples has played an important role in understanding species relationships, it is not without its limitations, since the molecules recovered are usually far from optimal for genome sequencing due to the challenges of postmortem damage (Hofreiter, Jaenicke, Serre, von Haeseler, & Pääbo, 2001;Lindahl, 1993). For instance, ancient samples may often have a low endogenous DNA content due to infiltration of DNA "contaminants" derived principally from microbial sources or have undergone postmortem degradation (Poinar et al., 2006).
Although the exact contributions of individual processes to the damage will vary with the environment surrounding specimens (Hofreiter, Jaenicke, et al., 2001;), key processes include cross-linking of DNA to other molecules in the surrounding matrix (Pääbo, 1989), fragmentation through the introduction of abasic sites or nicks in the phosphate backbone, and generation of miscoding lesions due principally to hydrolytic deamination (Lindahl, 1993). These DNA damage processes have considerable relevance for the potential of generating genomic sequences from ancient materials. At the most basic level, due to the degradation of DNA as a function of temperature and time (Lindahl, 1993;Smith et al., 2001), there is an ultimate time window beyond which no usable DNA can be recovered. While initially argued by many to lie in the range of tens to few-100 thousand years ago (KYA) for temperate preserved material (Wayne et al., 1999), recent groundbreaking studies have pushed this window back to over 700 KYA in permafrozen bones . Of further relevance, is the effect that this damage (and contaminating exogenous DNA) plays on the quality of genomes that can be sequenced from ancient samples. Endogenous DNA sequences recovered are predominantly <100 base pairs (bp) in length (Poinar et al., 2006); thus, the de novo assembly of extinct eukaryotic genomes-techniques that largely require long DNA fragments in order to better resolve sequence variation and regions of low complexity-is currently not possible (Kircher, 2012), even though some assemblies have been attempted on ancient microbial genomes and the Tasmanian tiger (the latter without much success (Feigin et al., 2018)). Thus, ancient genome reconstructions (whether of extinct or extant species) have been almost exclusively based around mapping sequence data to references. While powerful when a reference is closely related (e.g., ancient versus modern horse; Orlando et al., 2013), sometimes no closely related species exist, and even when they do, as evolutionary distance increases between an ancient sample and modern reference, so does the challenge of mapping the data and thus the percentage of endogenous DNA that can be mapped (Shapiro & Hofreiter, 2014). This effect can easily be exemplified using sequence data from modern genome projects, specifically through mapping raw sequence reads of one taxa against the genome of a related species. This was clearly shown in a seminal paper (Prüfer et al., 2010) Zhang et al., 2014) can be mapped to the genome of the ca 80 Mya. divergent ostrich using standard mapping software such as BWA (Li & Durbin, 2009), with mapping reads predictably located within the most conserved exons (Jose Samaniego Castruita, unpublished observation). The implications of such studies are threefold. Firstly, with an estimated divergence of 50-80 Mya. between these species and one of their extinct, high profile relative clades, the moas of New Zealand , reconstructing a near complete genome in this way is going to be extremely challenging, if not impossible. Secondly, mapping-based estimates of the endogenous DNA content of sequencing datasets generated through shotgun sequencing of extinct species are likely underestimates (a 1% mapping success rate of a moa sample, at estimated 5% overall mapping success, suggests an overall endogenous DNA content of 20%). Thirdly, as mappability directly correlates with the degree of conservation between the two genomes, any evolutionary relevant changes outside of the most conserved regions will not be recovered (Prüfer et al., 2010;Richmond et al., 2016).
Clearly, given the interest in the reconstruction of complete genomes of now extinct taxa, this is not an optimal situation for us to be in. If we are to accept the premise that we will be unable to improve the quality of the DNA recovered from many such samples, the question is raised as to whether there is any other possibility for improving the quality of genome recovered. In this regard, one potential might be through exploring computational approaches to deal with the particular challenge of the evolutionary divergence.
Using bioinformatic tools to improve the quality of data generation in paleogenomics is not a new approach, although to date most research has addressed this challenge by accounting for the miscoding lesions derived by postmortem damage that plague the ancient DNA molecules. Derived principally from the deamination of cytosines to uracil and its chemical analogues (Gilbert et al., 2003(Gilbert et al., , 2007Hansen et al., 2001;Hofreiter, Jaenicke, et al., 2001;Lindahl, 1993;Pääbo, 1989), represented as C->T and G->A mutations in the resulting sequence, and in particular concentrated at the 5′ and 3′ extremes of the templates sequenced (Briggs et al., 2007;Brotherton et al., 2007), these mismatches significantly reduce mapping efficacy (Schubert et al., 2012). Thus, several informatic-based approaches exist, including rescaling base quality at sequence extremes to account for the damage probability (Jónsson et al., 2013), informed modification of alignment parameters (e.g., disable seed, or allowing for more mismatches; Orlando et al., 2015;Schubert et al., 2012), or adoption of aligners customized for aDNA, that do not assume uniform mismatches along the sequence read and enable some degree of sequence divergence between the reference and the target species. A number of probabilistic aligners based on position-specific scoring matrices (PSSM) have been developed in this regard, such as MIA (Briggs et al., 2009), ANFO (Briggs et al., 2007), BWA-PSSM (Kerpedjiev et al., 2014), and generally show good performance for short and/or low quality data. These can therefore improve the fraction of alignable hits, although some show running times only compatible with alignments against relatively small reference genomes (e.g., mitochondrial genomes). Furthermore, some of these approaches (e.g., allowing more mismatches, or PSSM mis-specifications) can have unintended detrimental effects, leading to increased mapping errors. For this reason, accurate mapping against distantly related genomes is very challenging and, as a result, new mapping methods are required to improve the amount of genome recovered.

| Use of ancestrally reconstructed genomes as reference
The growing availability of reference genomes from extant organisms indicates we are entering an era in which new mapping approaches should be considered for paleogenomics. While the above discussed approaches that aim to minimize the effects of miscoding lesions are useful, ultimately the evolutionary distance between extinct-extant species remains the principal challenge (Richmond et al., 2016). In this regard, one theoretically attractive approach is reducing in silico this evolutionary distance, through ancestral genome reconstruction, a technique whose use has been explored at both the sequence and genomic architecture level. Embedded in particular in maximum parsimony (Fitch, 1971), maximum likelihood and Bayesian frameworks (Huelsenbeck & Bollback, 2001;Pagel, 1999;Pupko et al., 2000;Yang et al., 1995), ancestral sequence reconstruction (ASR) attempts to infer the state (amino acid or DNA) of the ancestral sequence through the use of modern sequences within a phylogenetic context (Randall et al., 2016). These have, for example, been used to reconstruct genomic traits such as the visual pigments of F I G U R E 1 Passeriformes phylogeny used for sequence reconstruction. Test species are depicted in bold, and reconstructed ancestral nodes 1, 2, and 3 as colored circles (red, green, and blue, respectively) dinosaurs (Chang et al., 2002), the ancestral steroid hormone receptor (Thornton et al., 2003), alcohol dehydrogenases from yeast (Thomson et al., 2005), primate mitochondrial DNA (Krishnan et al., 2004), and over 500MB of the ancestral archosaur genome (Green et al., 2014). At the genomic architecture level, recent developments provide algorithms that handle repeats as well as diverse types of genome rearrangements and chromosome structures (Chauve & Tannier, 2008;Jones et al., 2012;Ma et al., 2006), and such methods have been used to reconstruct gene content (Cohen et al., 2010) and gene order, with the latter being used for reconstructing ancestral genomes organization of taxa ranging from bacteria (Fremez et al., 2007) to animals (Alekseyev & Pevzner, 2009;Green et al., 2014). In this study, we explore the potential use of ASR to reconstruct ancestral genomes and test their potential use as reference genomes, so as to improve the mapping of ancient samples.

| Exploratory analysis
Taking advantage of data generated recently by the Bird 10k Genome Consortium, kindly provided by Guojie Zhang and Shaohong Feng prior to its formal release (Feng, et al., in press), we selected 31 Passeriformes species whose genomes have been de novo sequenced and assembled ( Figure 1 and Table 1), and aligned them using the zebra finch (Taeniopygia guttata) high quality genome as anchor (Warren et al., 2010). Since ASR is highly dependent on divergence times, we selected two test species with different divergence times to their respective sister species: Sturnus vulgaris (diverged from its closest sister species, Rhabdornis inornatus, ~20 Myra) and Pteruthius melanotis (diverged from its closest sister species, Sylvietta virens, ~40 Myra) (Kumar et al., 2017). For each one of these cases, we removed the test species from the alignment and reconstructed the ancestral sequences.
To evaluate the effects of using an ASR as reference, we assessed the mappability of simulated reads to several genomes: (i) test species reference genome, (ii) closest sister species reference genome, (iii) alignable/conserved fraction of the closest sister species reference genome, (iv) alignable/conserved fraction of reconstructed ancestral sequences (from 3 ancestral nodes), and (v) hybrid reference genomes between (ii) and (iv).

| Whole-genome alignment
Whole-genome alignments were created with the LASTZ + MULTIZ (Blanchette, 2004) pipeline across the 31 bird species chosen (Table 1), using the zebra finch genome as the anchor genome.

| Ancestral sequence reconstruction
Due to computational constraints, we excluded some of the outgroup species in our alignment since they would be unlikely to affect the reconstruction. That meant that, for our ancestral sequence reconstruction, only the first 15 species in Table 1 were used. To reconstruct the ancestral sequences from our species, we used RAxML v8.2.11 (Stamatakis, 2014). Briefly, and since we used the TimeTree (www.timet ree.org) tree topology for the MULTIZ multiple-alignment phase, we decided to use the same topology as a fixed parameter, and only estimate its branch lengths ("-f e -m GTRGAMMA -t Passeriformes_species.B10k.nwk"); for computational reasons, and assuming it to be representative of the entire genome, only the largest chromosome (chr1) was used on this step. Then, using the tree topology from TimeTree with our estimated branch lengths, we again used RAxML to estimate the ancestor posterior probability distribution at each internal node, and choose the allele with the highest probability ("-m GTRGAMMAX --HKY85"). All missing positions, as well as those with more than 90% missing data in the multiple sequence alignment, were removed from the ancestral reconstructed sequence.

| Hybrid reference sequence
Since only a fraction of the genome is alignable between all 31 species, we tried to further improve mappability by creating a hybrid genome between each of the ancestrally reconstructed genomes and the, respectively, closest sister species of each test species. For that, we replaced all alignable parts of the sister species reference genome, with each of the ancestral reconstructed sequences.
These reads were then mapped using BWA-MEM v0.7.15-r1140 (Li & Durbin, 2009) with default parameters plus option "-M," against the sequences from each method. The reconstruction efficiency was then assessed from the percentage of total mapped reads, percent-

| RE SULTS AND D ISCUSS I ON
Our results as summarized in Table 2 show that, as expected, the sister species is not always a good reference genome. For short evolutionary distances (20 Mya.), it serves as a good proxy with only a slightly lower mapping efficiency than the original genome, as inferred from the percentage of mapped reads (95% vs. 99%), percentage of covered genome (78% vs. 83%), and average depth (1.74X vs.
If the genetic distance between our paleospecies and the closest sister species is big, then we hypothesized that the use of an ancestrally reconstructed genome as reference should improve mapping efficiency. At first sight, this does not seem to be true, since these reconstructed genomes resulted in a lower percentage of mapped reads than obtained when using the sister species' or the reference genome (57%, 68%, and 69% vs. 71% or 99.6%). However, we believe this is due to the incompleteness of the reconstructed genomes, since only a fraction of the original 31 genomes can actually be aligned and, consequently, is possible to infer the ancestral state from. If we just take these conserved regions into account, we can clearly see that ancestrally reconstructed genomes show higher mapping efficiencies than the correspondent sister genome, especially for longer evolutionary distances, with higher percentage of mapped reads (57%, 68%, and 69% vs. 56%), covered genome (62%, 71%, and 73% vs. 63%), and average depth (1.22X, 1.51X, and 1.55X vs. 1.24X); at short evolutionary rates the increase is marginal, reflecting the high similarity of the sequences. Interestingly, the mapping efficiency increases considerably with evolutionary distance probably due to the fact that, at deep ancestral nodes, more information is used and so a more complete reconstruction is possible (reflected in longer sequences).
In order to improve the apparent lower mapping efficiency of the ancestrally reconstructed genomes, we devised one last approach where we created a hybrid sequence between them and the sister species genome. With this approach, we hoped to mitigate the effect of incomplete genome reconstruction, by using the sister species reference genomes where ancestral information was not available. This new approach greatly improved mapping efficiency, with the best case scenario largely surpassing the sister genome on all metrics: percentage of mapped reads (82% vs. 71%), percentage of covered genome (67% vs. 58%), and average depth (1.41X vs. 1.14X); at shorter distances, as seen above, the improvement was marginal or nonexistent.

| CON CLUS ION
We are entering an era where new approaches should be considered for paleogenomics in light of the growing interest in sequencing the genomes of now extinct species. While considerable effort has been dedicated to account for the effects of miscoding lesions (e.g., Jónsson et al., 2013) on the accuracy of the reconstructed ancient genomes, we believe that ultimately the evolutionary distance that separates extinct from extant relative species will remain a greater challenge. Here, we have explored the possibility of reducing this distance in silico by ancestral genome reconstruction.
Overall, our results show that ancestral genome reconstruction can be helpful at reducing the distance between extinct and extant species in silico, in particular under the "hybrid" genome approach.
However, at the same time, our results also highlight some of the drawbacks associated with such an approach. This work was intended as a preliminary proof of concept and, as so, we can foresee several possible improvement venues. Firstly, we acknowledge that our analysis was limited to birds, whose genomes are not only relatively small compared to those of many other vertebrates, but also well known for their degree of synteny conservation (Zhang et al., 2014). Given that our aim was to improve mappability (as opposed to de novo assembly), we suspect that is unlikely to affect the results; however, future analyses could be done using other taxa to explore this. Secondly, we note that, unlike our in silico data, true ancient DNA extracts contain molecules that are fragmented and thus span a range of template sizes, most of which are very short, and can contain DNA damage driven miscoding lesions (Poinar et al., 2006). While both would affect the efficiency of mapping the data to the reference genomes, including modifying the coverages obtained and possibly even ultimately incurring biases in population genomic analyses done using the genome (Gopalakrishnan et al., 2017;Orlando et al., 2013)-and certainly it might be interesting for future studies to explore this effect in detail, for example, simulating damaged DNA using software such as gargammel (Renaud et al., 2017)-we do not believe that this fundamentally changes our observation that ancestral genome reconstruction has a role to play when mapping deeply divergent taxa to a reference. Thirdly, and probably the most limiting step, improving the multiple sequence alignment between extant species can dramatically increase the fraction of aligned genomic regions and, as such, the genome fraction that can be reconstructed (Leimeister et al., 2019). Lastly, the ancestral inference step might also have room for improvements, by using, for example, more complex Maximum Likelihood (Yang et al., 1995) or Bayesian (Bouckaert et al., 2014;Lartillot et al., 2009) methods and models, especially those using marginal ancestral state reconstruction using the "classical" empirical Bayesian method that uses all sequences in the tree.

CO N FLI C T O F I NTE R E S T S
The authors declare no competing interests.

DATA AVA I L A B I L I T Y S TAT E M E N T
All data and software used in this study are already in the public domain.