Inference of natural selection from ancient DNA

Abstract Evolutionary processes, including selection, can be indirectly inferred based on patterns of genomic variation among contemporary populations or species. However, this often requires unrealistic assumptions of ancestral demography and selective regimes. Sequencing ancient DNA from temporally spaced samples can inform about past selection processes, as time series data allow direct quantification of population parameters collected before, during, and after genetic changes driven by selection. In this Comment and Opinion, we advocate for the inclusion of temporal sampling and the generation of paleogenomic datasets in evolutionary biology, and highlight some of the recent advances that have yet to be broadly applied by evolutionary biologists. In doing so, we consider the expected signatures of balancing, purifying, and positive selection in time series data, and detail how this can advance our understanding of the chronology and tempo of genomic change driven by selection. However, we also recognize the limitations of such data, which can suffer from postmortem damage, fragmentation, low coverage, and typically low sample size. We therefore highlight the many assumptions and considerations associated with analyzing paleogenomic data and the assumptions associated with analytical methods.


Impact Summary
The search for signatures of natural selection on the genome is still most commonly based on screening modern genomes for regions of reduced diversity or increased differentiation between populations. This framework is essentially a snapshot in time of a process that may have played out over many millennia, during which changes in population size, ecology and gene flow between populations may have played a role in determining genetic variation. Here, we outline how utilising ancient DNA (aDNA) techniques to sequence time series of genomes spanning changes in natural selection can provide a more nuanced understanding of how natural selection has impacted genomic variation in present-day populations. In particular, we argue that the advent of paleo-population genomics, in which datasets of multiple individuals spanning millennia have been sequenced, offers unprecedented opportunity to estimate changes in allele frequencies through time. We outline considerations and the types of data that would be needed for the inference of positive selection on traits associated with single and many genes (polygenic), genome-wide negative (background) selection, and balancing selection. However, we recognise that there are currently few datasets existing that are suitable for these types of investigation. There is thus a bias towards study species that have undergone strong selection over relatively recent timescales that are within the scope of aDNA, such as has occurred in domesticated species. We also detail a number of caveats associated with working with aDNA data, which is by its nature comprised of short, degraded DNA fragments, typically with a high degree of missing data and DNA damage patterns. Lastly, we highlight how the predicted move towards increasingly big datasets in aDNA studies will require the adoption of new analytical techniques and efficient data storage. Emerging developments, including the recording of genealogical variation across hundreds or thousands of individuals as tree sequences, and the increased automation of analyses through machine learning, which offer exciting new opportunities for the inference of selection from aDNA.

Introduction
Most population genetic studies use comparisons at a single point in time or over timescales of only a few generations, and infer ancestral states using coalescent-based methods. This snapshot of evolution may only be partially informative, as diverging populations may have experienced changes in allele frequencies due to gene flow and population size changes, which can be difficult to disentangle from signatures of natural selection (Fig. 1). Given the temporal nature of evolution, ancient DNA (aDNA) techniques are obvious and promising tools with which to track the chronology and tempo of genomic change, and thereby provide unique opportunities for detecting distinct footprints of selection. The advent and increasing efficiency of high-throughput sequencing, combined with recent advances in aDNA extraction, library build, and data processing (Pinhasi et al. 2015;Gansauge et al. 2017;Link et al. 2017;Carøe et al. 2018;Renaud et al. 2019;Dabney and Meyer 2019;Martiniano et al. 2019;Wales and Kistler 2019), now allow the generation of paleopopulation genomic datasets, thus offering unprecedented opportunities to better understand the chronology and tempo of evolution at the genomic level.
In this review, we advocate for increased utilization of paleogenomics within the field of evolutionary biology, allowing natural selection to be investigated along the evolutionary continuum, at multiple time points throughout the process. We aim to give a nuanced discussion on the present role and future potential for aDNA data to contribute toward our understanding of selection in a broad range of organisms. We first describe how sampling across a time series can increase our understanding of the selective processes underlying patterns of genomic variation in contemporary data. We highlight the advances that have allowed the field of paleogenomics to progress over the past decade and significant challenges that remain associated with working with aDNA data. We then outline the potential and the limitations of studying different types of selection by incorporating aDNA time series data. Throughout, we try to raise awareness of the shortcomings of such data by exposing its caveats. For example, we discuss the merits of using few ancient samples for elucidating genome-wide processes such as background selection, while acknowledging that the inference of selection will remain limited to a few key study species until sufficient sample sizes accrue. We finish with a look forward to future innovations and a summary of the state of the field.

Temporal Sampling
The use of temporally spaced genetic data is a promising way to circumvent some of the problems inherent to methods of selection inference. The utility of analyzing time series is illustrated by "evolve and resequence" experiments combining experimental evolution under controlled laboratory or field mesocosm conditions with next-generation sequencing. Evolve and resequence experiments have elucidated the genetic changes underlying evolution in real time over multiple generations (Long et al. 2015;Schlötterer et al. 2015;Rajpurohit et al. 2018) but are limited to species with short generation times (e.g., Turner. & Miller 2012;Bosshard et al. 2017;Good et al. 2017) and for asexually reproducing populations (Bennett et al. 1990;Baym et al. 2016;Good et al. 2017). However, due to a lack of recombination, selection dynamics in these populations cannot easily be generalized to sexually reproducing populations, which will be the main focus of this review. Furthermore, the controlled conditions of a laboratory experiment or even field mesocosm cannot capture the full complexity of evolutionary processes in the wild. Experimental populations in evolve and resequence studies can suffer from an excess of rare alleles (if sampled from large wild populations), extended linkage disequilibrium due to limited experimental population size and masking selective sweeps, and pseudoreplication (Baldwin-Brown et al. 2014; Kelly and Hughes 2019). Studies of some natural populations have tracked the action of selection over several generations (Hendry et al. 2000;Grant and Grant 2002;Marques et al. 2018), but these remain inherently rare and limited to instances of unusually rapid evolution.
An alternative and commonly used approach to understand the temporal context of evolution in natural populations is to sample along the so-called "speciation continuum" by comparing sister taxa at different stages of divergence from each other (Feder et al. 2012;Seehausen et al. 2014;Shaw and Mullen 2014). For instance, this approach has been applied to investigations of the accrual and erosion of genomic differentiation due to linked selection (e.g., Burri et al. 2015) and admixture (e.g., Martin et al. 2013), respectively. However, samples from natural populations along the speciation continuum are not equivalent to sampling the same population through time. Ancestral demography, differences in the presence and strength of selection pressures, and the starting substrate of standing genetic variation may be important factors to explain the variation in genomic summary statistics among populations (e.g., Fang et al. 2019;Miller et al. 2019) that are overlooked when comparing across the speciation continuum.
Sampling genomes from multiple time points in the past using aDNA techniques offers the possibility to study the chronology and tempo of natural selection across evolutionary timescales. Using genomes from the past concurrent with ecological data relevant to selection pressures, selection and its timing and strength can be inferred by directly estimating allele frequencies at each time point. It is, to some extent, analogous to "experimental evolution in the lab" and this can allow the accurate joint inference of demography and the disentangling of selection from drift in nonequilibrium populations based on differences in the rate of change in allele frequencies between selected and neutral loci (Bank et al. 2014).

The Scope and Limitations of aDNA
aDNA has previously been defined as DNA recovered postmortem from nonideal biological material (Navascués et al. 2010). We adopt this definition, which can be applied to datasets of museum specimens spanning past decades, through to archaeological remains dated back across millennia. This material is nonideal relative to modern samples in several respects. aDNA is subject to postmortem damage, fragmentation, and decay through processes such as hydrolysis, purination, and deamination (Lindahl 1993;Allentoft et al. 2012). Although postmortem damage complicates downstream inference by introducing alleles not reflective of a sample's diversity, fragmentation imposes a theoretical limit on the age from which mappable DNA fragments can be recovered (e.g., Dabney et al. 2013;Orlando et al. 2013;Meyer et al. 2016). Despite recent advances in sequencing ultrashort DNA fragments from specimens hundreds of thousands of years old, the majority of ancient genomes sequenced to date are in the range of thousands to tens of thousands of years old (Brunson & Reich 2019;Skoglund and Mathieson 2018;Fages et al. 2019). aDNA is typically subject to contamination from external sources, reducing the ratio of endogenous to exogenous content. Of particular concern is the contamination from modern conspecific samples, which map to the reference sample alongside endogenous DNA and thus alter patterns of allele frequencies and genetic diversity. The amount of endogenous DNA surviving in museum and archaeological specimens varies among samples due to factors that include climate, substrate, and exposure to UV radiation, specimen treatment in the museums, in addition to material type. For example, dense material such as the petrous bone has been found to contain a high percentage of endogenous DNA (Pinhasi et al. 2015). Skins and pelts also have high endogenous content, but the DNA is frequently highly fragmented likely as the result of harsh chemical treatment for specimen preservation in museums (e.g., tanning; van der Valk et al. 2017). However, it is often the case that aDNA extracted from museum or archaeological specimens provides low and fragmented coverage of the genome, thereby typically limiting inference based on heterozygosity or specific loci of interest.
Populations can adapt to new selection pressures either from de novo mutations or from standing genetic variation (Barrett and Schluter 2008). Although both de novo mutations and standing variants can rise in frequency in response to selective pressures in the time window afforded by aDNA data, in the former case, selection can only act on a beneficial variant once it exists within the population. Standing genetic variation, on the other hand, is generally expected to allow a more rapid response to changes in selective pressures (Barrett and Schluter 2008). For example, recent time series studies show that adaptation from standing genetic variation can happen within only a few generations after the origin of a new selective pressure (Epstein et al. 2016;Franks et al. 2016;Marques et al. 2018). Adaptation from standing genetic variation is thus limited by the presence of genetic variation to respond to new changes, which can be dependent upon past exposure of an ancestral population to similar selective pressures (Schluter and Conte 2009;Marques et al. 2019) and the overall effective population size. Similarly, much of the genetic substrate contributing toward deleterious recessive mutation load and thereby subject to negative selection is also thought to be maintained as standing variation in heterozygous genotypes (Peischl and Excoffier 2015).
Although our ability to push the limits of aDNA retrieval and sequencing now extends to samples dating hundreds of thousands of years in age, due to the difficulties of working with aDNA detailed above, compiling population datasets of time series data from which allele frequencies can be estimated is limited to more recent timescales (up to tens of thousands of years). Thus, both the temporal scales over which aDNA datasets are likely to span, and the frequency with which both positive and negative selection acts upon standing genetic variation relative to de novo mutations, make standing variation the more tractable genetic substrate to study the effects of selection using aDNA data.
There are few existing paleogenomic datasets consisting of multiple individuals that span temporal changes in selection pressures. The most compelling findings of selection are from rich datasets associated with recent and artificially strong selective regimes, such as domestication processes, incorporating pre-and early domestication samples (see Irving-Pease et al. 2018 for a review). Such studies have been conducted on domestic species including horses (Librado et al. 2017), maize (Ramos-Madrigal et al. 2016), and dogs (Ollivier et al. 2016). The application of the methods outlined in this review to natural populations remains a rarity.

Detecting Positive Selection on a Monogenic or Oligogenic Trait
Positive selection acting upon a single (monogenic) or few genes (oligogenic) and sites linked to the targets of selection causes the selected allele(s) at linked sites to rise to high frequency within the population. This reduces genetic diversity within the region of the genome linked to the gene(s) targeted by selection and increases differentiation and lineage sorting of these genomic region in comparison with other populations. Studies sampling contemporary populations can therefore detect positive selection on monogenic or oligogenic traits by investigating patterns of coalescence (e.g., Hermisson and Pennings 2017), measures of population differentiation such as F ST (Lewontin and Krakauer 1973;Beaumont 2005), patterns in the site frequency spectrum (Tajima 1989;Fay and Wu 2000), or the extent of linkage disequilibrium (Kim and Nielsen 2004). However, contemporary data represent only a single point in time. A major challenge is to disentangle the various effects upon the genome of ancient population structure, positive and background selection, and nonequilibrium demography. Selection and demographic bottlenecks can leave similar patterns of genomic variation, including reduced genetic diversity in affected genomic regions, which increases lineage sorting and differentiation between populations with different demographic and evolutionary histories (Charlesworth et al. 1993;Zeng et al. 2006;Crisci et al. 2012;Li et al. 2012;Comeron 2014).
Changes in ecological conditions, geographic distribution, rates of gene flow, and population size can all influence the strength and consistency of selection, and may thus heavily confound selection estimators. High F ST values, for instance, can be indicative of an ancestral selective sweep, but may also be caused by demographic processes (Nielsen 2005;Excoffier et al. 2009) or background selection (Cruickshank and Hahn 2014;Burri 2017). A recent study estimated that more than 95% of the human genome is affected by background selection or biased gene conversion, and thus is evolving in a nonneutral manner (Pouyet et al. 2018). Furthermore, parallel adaptation for a derived haplotype at a specific locus in two populations can be misinterpreted as selection on the ancestral haplotype in a third population in three-population comparisons, such as implemented in the population branch statistic (see Mathieson 2019; Fig. 1). Current attempts to account for these confounding factors using only contemporary samples are either limited to simple models or rely on strong assumptions about the strength of selection and distribution of beneficial variants (Li et al. 2012). Finally, there are limitations to how far back in the past applying coalescent approaches to only contemporary samples can reach, due to lineages coalescing in ancestral bottlenecks and selection events. Inferences about historic periods of selection may therefore be restricted to relatively recent time scales and will not span all historical changes in selective pressure, for example, shifts in the selective regime associated with strong demographic founder effects during the colonization of new habitats.
Allele frequencies inferred from aDNA from a time series of samples with known ecological context can be used to infer selection, while controlling for many of these confounding factors (Bank et al. 2014;Malaspinas 2016). The foundations for inferring the underlying mode of evolution (i.e., under neutrality or selection) from time series allele frequency data are based upon the Wright-Fisher model. The model was named after Sewall Wright and Ronald Fisher, who famously debated the extent to which drift or selection was the driving evolutionary forces underlying fluctuations in color polymorphism frequency in a time series dataset collected from a scarlet tiger moth (Panaxia dominula) population (Fisher and Ford 1947;Wright 1948). The Wright-Fisher model is a simple approximation of genetic drift in a population of constant size (N diploid individuals) with nonoverlapping generations, in which alleles are randomly sampled from the previous generation.
There are several available methods for inferring selection as a cause of directional allele frequency shift with a trajectory that is inconsistent with neutral evolution under a Wright-Fisher model (Bollback et al. 2008;Malaspinas et al. 2012;Feder et al. 2014;Foll et al. 2015;Gompert 2016;Ferrer-Admetlla et al. 2016;Schraiber et al. 2016). Malaspinas (2016) provided a dedicated review of how these methods work and what differentiates them from each other. These methods can then characterize selective sweeps in terms of timing, duration, and the strength of selection measured as selection coefficients (see Fig. 2; Bank et al. 2014;Malaspinas 2016). The different statistical methods using time series data to infer selection mainly differ in the statistical approach used to estimate allele frequency probabilities. As a result, different methods are suitable for different study systems, depending on the population size, the magnitude of the selection coefficient, and the parameter set to be inferred (see Malaspinas 2016). Available methods vary in their underlying assumptions and the variables that they are able to estimate. For example, some estimators can jointly infer allele age (Malaspinas et al. 2012;Schraiber et al. 2016) or population size and selection coefficients (Foll et al. 2015;Ferrer-Admetlla et al. 2016;Schraiber et al. 2016) and account for variation in the strength of selection through time (Shim et al. 2016). However, it is important to note that most of these methods are unable to distinguish between direct and linked selection (Bank et al. 2014): they measure the by-product of a sweep, which is the directional changes in allele frequencies at both the target and linked sites, but do not necessarily identify the target of selection if that is unknown a priori.
Despite the availability of several methods for inferring selection from time series datasets, their application to aDNA datasets from natural populations remains limited (see Table S1 for an overview). Examples of applications to aDNA datasets have typically been on human-induced selection during domestication (Ludwig et al. 2009;da Fonseca et al. 2015) or selection in humans due to dietary changes associated with domestication (Sverrisdóttir et al. 2014;Mathieson et al. 2015;Buckley et al. 2017;Ye et al. 2017;Mathieson and Mathieson 2018;Mathieson 2019). The paucity of application of such methods to aDNA datasets may reflect the scarcity of available time series of allele frequency data from aDNA, but also restrictive assumptions of the underlying Wright-Fisher model, in particular the effect of migration on allele frequencies. This last point can to some extent be accounted for by considering a spatially structured framework in which selection coefficients and migration rates between demes can be allowed to vary (Mathieson and McVean 2013). The starting allele frequency and dominance of a beneficial allele can influence the speed of the sweep and therefore the difference in the trajectory through time from neutrally evolving loci and the required density of sampling points through time needed to detect the sweep (Feder et al. 2014;Malaspinas 2016; Fig. 3). The difficulties in inferring the mode of evolution is nicely illustrated by the re-evaluation of the trajectory of alleles in genes associated with coat color in horses, which were inferred to have changed consistent with directional selection (Ludwig et al. 2009), drift (Malaspinas 2012), and balancing selection (Steinrücken et al. 2014).

Detecting Polygenic Selection on a Polygenic Trait
In contrast to phenotypes with a relatively simple genetic basis, polygenic traits are genetically more complex, being determined by the effect of allele frequency changes at hundreds or thousands of loci. Polygenic selection from standing variation might be of equal or greater importance than selective sweeps in rapid  adaptive events (Mather 1943;Pritchard and Di Rienzo 2010;Jain and Stephan 2017). Indeed, as many phenotypic traits are polygenic, the quantitative variation associated with these traits is likely to play an important role in adaptation and contribute toward individual fitness in a given set of environmental conditions (Gratten et al. 2008;Besnier et al. 2015;Bosse et al. 2017). Even though the collective effect of polygenic traits under selection can be significant, individual allele frequency shifts are more subtle than those under a selective sweep model, and therefore harder to detect with traditional methods for selection inference. Studies looking for polygenic adaptation in contemporary genomic datasets typically rely on sets of loci associated with a specific trait and identified by genome-wide association studies (GWAS) (Turchin et al. 2012;Berg and Coop 2014;Robinson et al. 2015;Racimo et al. 2018). Derivations in the mean effect size of a set of loci compared to a null model or another population are indicative of selection. A key limitation of investigating polygenic adaptation using only contemporary samples is in determining the timing and onset of polygenic selection. Estimation of the timing of polygenic adaptation in an ancestral population can be achieved using just modern samples, but this requires a dataset of known quantitative trait loci and the establishment of the past splits and migration among populations (Racimo et al. 2018). The relationship among populations using population genomic data is increasingly estimated as an admixture graph (Patterson et al. 2012;Pickrell and Pritchard 2012;Lipson et al. 2013). Admixture graphs represent a consensus topology inferred from the majority of neutral loci, in which drift is represented by branch length. Admixture events are then inferred from loci that are a poor fit for this consensus tree model and are incorporated into the graph to increase the fit of the graph to the data. Racimo et al. (2018) expanded this approach to generate an admixture graph from putatively neutral loci. They then separately considered the fit of allele frequency shifts at GWAS loci to the admixture graph to identify when GWAS loci evolved differently to neutral loci (i.e., inconsistent with genetic drift), thereby inferring when polygenic selection occurred in the evolutionary history of the sampled populations. Racimo et al. (2018) proposed that the method should be applicable to admixture graphs that include ancient populations, as commonly incorporated into human population genomic studies (e.g., Lazaridis et al. 2014;Raghavan et al. 2015). However, care is needed, as the method requires sufficient samples from each time period to ensure accurate estimates of allele frequencies and also to avoid artefacts from postmortem damage and low coverage. In addition, linkage disequilibrium structure may vary among populations and through time affecting the accuracy of comparing markers discovered in another population or across temporally stratified data (see Martin et al. 2017). Similar to datasets containing only modern samples, this approach is restricted to species with known GWAS identified loci. A modified version of the Q B statistic (Racimo et al. 2018), the S B statistic developed by Refoyo-Martínez et al. (2019), similarly uses the signal of allele frequency differences between populations, discordant with the consensus topology of an admixture graph. This method does not require gene-trait association data, making it a promising approach for identifying genome-wide targets and the timing of selective sweeps in model and nonmodel organisms (Refoyo-Martínez et al. 2019), but it is unclear if the method would be sufficiently sensitive to detect polygenic selection.
The detection of polygenic selection from time series genetic data requires methods that consider genome-wide patterns of subtle changes in allele frequencies that are distinguishable from genetic drift. Although the method of Racimo et al. (2018) is dependent upon the loci under selection being known a priori, a theoretical framework developed by Buffalo and Coop (2019) can partition the variance of genome-wide allele frequency changes through time into those evolving neutrally through drift and those linked to (unknown) loci evolving under additive polygenic selection. However, this approach is subject to many of the caveats discussed below in that it assumes a constant population size and the model would be violated by migration or other temporal variation in population composition. Therefore, although this approach is supported by simulations, and has been demonstrated to be effective at estimating temporal covariance in allele frequencies associated with linked selection in lab-based experimental evolution (Buffalo and Coop 2019a, b), it may be limited in its application to real-life aDNA data. The effect of population stratification on polygenic signals from modern samples has recently been highlighted, when two studies found that the signal for height selection in Europe was less pronounced in the U.K. Biobank dataset, which is less confounded by population structure than the GIANT consortium dataset (Berg et al. 2018;Sohail et al. 2019). Sampling through time increases the chances of stratification in a population genomics dataset (Pickrell and Reich 2014), and so would need to be carefully accounted for.
A recent modelling study by Hayward and Sella (2019) found that shifts in mean phenotype toward a new optimum through polygenic adaptation following a sudden environmental change were driven in the short term by the small frequency changes in moderate and large effect alleles. In the long term, the contribution of subtle changes in large-effect alleles is replaced by large allele frequency changes, including fixation, of moderate and small effect alleles (Hayward and Sella 2019). The ability of temporal sampling approaches, such as those of Racimo et al. (2018) and Buffalo and Coop (2019), may vary between these proposed shortand long-term phases, with the more extreme frequency shifts of the latter intuitively being more detectable. We look forward to future investigations into this temporal change in the signature of polygenic selection.
The results of scans for alleles or genes evolving under polygenic selection can be used to search subnetworks of interacting genes in biological pathways and identify those with unusual features to better understand the interaction with phenotype or the environment. For example, Gouy et al. (2017) applied such a method to identify the polygenic basis and the biological processes involved in convergent adaptation to high altitude in modern humans. The method has been tested on the time series data from Mathieson et al. (2015) and can therefore be applied to aDNA datasets, provided there are sufficient sample sizes, and considering the caveats of population stratification, migration, and linkage disequilibrium changes through time (A. Gouy, pers. comm.). An advantage of incorporating aDNA time series data into such an analysis would be to better determine if selection acts independently at different times on different genes or simultaneously on multiple genes within a network in response to a novel selection pressure: independent and epistatic selection sensu Gouy and Excoffier (2019).

Detecting Purifying Selection
Negative or purifying selection-the removal of deleterious alleles from a population-can lead to a reduction in genetic diversity in regions of the genome because neutral polymorphisms at sites linked to deleterious mutations are also removed from the population: a process called background selection (Charlesworth et al. 1993). The effectiveness of purifying selection in removing deleterious mutations depends both upon the selection coefficient of the mutation (s) and effective population size (N e ), and in an idealized population is thus determined by N e s (Charlesworth 2009). In this context, we refer to the variance N e rather than the inbreeding N e , the former being the measure of variance in allele frequency drift per generation in an idealized Wright-Fisher population (Wright 1931;Crow and Denniston 1988). Therefore, although selection will act rapidly to remove strongly deleterious mutations, given a sufficiently large effective population size, weakly deleterious mutations may segregate for multiple generations before they are effectively removed from a population (Kimura et al. 1963). In particular, weakly deleterious recessive mutations can be retained as effectively neutral alleles in heterozygous state (Peischl and Excoffier 2015). The term mutation load is broadly used for the measure of deleterious mutations within an individual (Henn et al. 2015). Prolonged bottlenecks and subsequent expansions, as, for example, under serial founder effects associated with range expansion or domestication events, result in reduced efficacy of selection and increased drift (Lohmueller 2014). As a consequence of these demographic scenarios, weakly deleterious mutations can rise to high frequency within an affected population, and weakly deleterious recessive mutations can become exposed in homozygous form as they rise to fixation through drift; thus under a recessive model, the mutation load resulting from nonequilibrium demography is predicted to have a greater population-level impact (Peischl and Excoffier 2015). As a result, the signature of background selection detected in comparisons among modern populations can be similar to that of positive polygenic selection in that it reduces genetic diversity and increases genetic differentiation among populations (Charlesworth et al. 1993). Studies solely based on modern population data also lack resolution of the timing of purifying selection relative to demographic changes, for example, pre-and post-bottleneck, when recessive alleles are exposed in homozygous genotypes, or during other demographic events such as extinctions.
In contrast to the methods for detecting positive selection on single or few loci of large effect, which have potentially prohibitively dense temporal sampling requirements for most aDNA datasets currently available, an assessment of the strength of negative selection can be made from a large number of independent (unlinked) loci using relatively few samples. The difficulty is how to disentangle the effects of negative selection from those caused purely by demography, given that both reduce genetic diversity. One approach is to look for differences in genetic diversity across regions of the genome that differ in recombination rate, because the impact of selection on genetic diversity will be greatest where recombination rates are lowest. This approach was used by Murray et al. (2017) in their analysis of DNA from museum samples of the once abundant but now extinct passenger pigeon (Ectopistes migratorius). Hung et al. (2014) had previously reported surprisingly low genetic diversity in passenger pigeons and had concluded that this reflected a history of dramatic population size fluctuations. To distinguish between the effects of selection and demography, Murray et al. (2017) mapped their passenger pigeon scaffolds to the chicken genome assembly, and because karyotype and synteny are strongly conserved across bird genomes, they were able to establish that genetic diversity was much lower in regions of the genome with lower rates of recombination. They concluded from this that the much lower than expected genetic diversity of the passenger pigeon was largely a consequence of the impact of selection on linked loci, rather than demographic instability, and they suggested that this might have been a consequence of passenger pigeons having had a very large effective population size.
Although the genomic investigation of the extinct passenger pigeon sampled across a narrow temporal window, other studies have sampled the genomic signature of the extinction process over longer timescales. For example, a loss of genetic diversity and increase in the fraction of the genome composed of runs of homozygosity and accumulation of deleterious mutations were detected in one of the last surviving mammoths (Mammuthus primigenius), dated to 4300 years ago, when compared with an older 44,800 years ago sample (Palkopoulou et al. 2015;Rogers and Slatkin 2017). When species recover from a bottleneck rather than going extinct, aDNA time series data can shed light on the strength of purifying selection acting upon the accumulated deleterious mutations. In an ongoing study, a comparison of the per-genome accumulation of nonsynonymous mutations (see Do et al. 2015) across a global dataset of killer whales (see Foote et al. 2019), the strongest signature of purifying selection is in genomes sampled from Iceland and Norway. Comparing with the genome of a Danish sample dated to 7500 years ago, which was inferred to be ancestral to the modern Icelandic and Norwegian populations, revealed that most of the purging of nonsynonymous mutations had occurred during the Holocene, subsequent to the inferred bottleneck during the last glacial period (see Foote et al. 2016). Thus, as with other forms of selection, sampling across different time periods can inform us of the timing of purifying selection and relate this to changes in demography or environmental variables.

Detecting Balancing Selection
Balancing selection is the umbrella term used for evolutionary processes that maintain polymorphisms in a population. Different mechanisms can lead to balancing selection. Heterozygote advantage refers to the process whereby individuals with a heterozygous genotype have a higher fitness than those with either homozygous genotype (Lindtke et al. 2017). Under negative frequency dependence, the fitness of a genotype is determined by the frequency of other genotypes, meaning that a genotype remains advantageous if rare. This type of selection is most often found in host-pathogen or predator-prey systems (Stahl et al. 1999;Leffler et al. 2013;Le Rouzic et al. 2015;Sato 2018). In genetically structured populations with gene flow, variable selection pressures can result in balancing selection (Levene 1953;Hedrick 2006). Although positive and negative selection have both been extensively studied, balancing selection has gained relatively little attention, likely because it is more difficult to detect as its effects span shorter genomic regions and may be transient in time (Fijarczyk and Babik 2015). As a result, there is little consensus on how prevalent this form of selection is and what role it has in maintaining genetic diversity.
Depending on the time scale, balancing selection will leave different patterns in the genome. Recent balancing selection is characterized by the increase in frequency of an allele at a specific locus. Balancing selection over long evolutionary time scales, on the other hand, will result in increased sequence diversity around the selected locus and long gene genealogies (Charlesworth 2006;Fijarczyk and Babik 2015). However, detecting the footprints of balancing selection in contemporary genomes is not a straightforward task: the patterns left in the genome can either be misinterpreted as other selection processes or may be caused by demogra-phy, introgression, or population structure (Fijarczyk and Babik 2015). For example, the signatures of recent or transient balancing selection can be misidentified as ongoing positive selection. Alternatively, signatures from long-term balancing selection, that is, increased gene diversity, can also be caused by population structure. Due to these difficulties, methods using only contemporary data to detect balancing selection typically have low statistical power.
Because the frequency of alleles evolving under balancing selection is expected to change less over time than expected under neutral drift (Fig. 4), temporally sampled data can be helpful to detect balancing selection. If alleles are truly under balancing selection, one can expect them to neither reach fixation nor get lost from the population. Although the maintenance of polymorphism is challenging to detect from single-time point data, these patterns should be detectable over longer periods of time, provided evenly distributed temporal sampling (Fig. 4).

Caveats and Considerations
The inference of selection of time series paleogenomic data that we have advocated above typically depends upon simple evolutionary models, such as the Wright-Fisher model, that have a number of assumptions based around an idealized population. In reality, time series aDNA data from most species contravene such models through a history of admixture, overlapping generations and changes in effective population size. Thus, genetic differentiation between temporal samples may be due to drift, selection, or migration ). Sample-rich paleogenomic datasets such as those for horses and humans (Reich 2018;Fages et al. 2019) highlight the fluidity of population structure through time, such that a time series of samples from a given location rarely represents a single continuous population, in which older samples are directly ancestral to younger ones. Furthermore, there can be behavioral differences that can cause sample bias of a subset of a population to accrue in a location (e.g., Allentoft et al. 2010;Pečnerová et al. 2017), thereby invalidating the model assumptions. Ascertainment bias can also occur during the collection of specimens, causing museum datasets to be biased toward a particular sex or phenotype (Cooper et al. 2019). As such, great care is required to rule out migration or population replacement when inferring drift or selection as the driver of allele frequencies from time series data. New approaches are increasingly being developed to estimate how direct of an ancestor an ancient sample is to a modern sample, by estimating the drift along the branch from the most recent common ancestor to the ancient sample (Rasmussen et al. 2014;Racimo et al. 2016;Schlebusch et al. 2017;Schraiber 2018). The shorter the branch, the more directly ancestral the ancient sample was to the modern sample, and thus, the more the dataset represents a continuous population through time. Alternatively, it is possible to test for continuity explicitly using coalescent simulation methods (Bramanti et al. 2009), an approach that was recently extended to include structured populations (Silva et al. 2017). However, a comprehensive investigation into how these confounding variables violate the assumptions and impact the inferences of the methods outlined above is currently lacking. As a minimal next step, further simulation work is needed to understand how migration from sampled or unsampled populations into the study population part way through a time series influences the results. Beyond this, theoretical work is needed to feed into method and tool development that can infer natural selection in datasets with complex demographic histories that violate the assumptions of Wright-Fisher or other simple models. Additionally, we need a better understanding of the spatial and temporal sampling requirements to be able to detect different types of selection. We have alluded to the fact that changes in processes that can leave a genome-wide signature, such as polygenic or background selection, can be inferred from even a small number of genomes sampled at a few temporal intervals. In contrast, the temporal signature of processes such as balancing and positive selection on a monogenic or oligogenic trait would require more dense temporal sampling of multiple genomes from each time point to be able to estimate allele frequency variation. However, the effect of the density of temporal and spatial sampling and the number of genomes per sample are additional variables that need to be incorporated into simulations to be able to provide more quantitative and formal guidance for future empirical studies.

Future Directions
As we enter the futuristic sounding year 2020, a number of methodological and technical advances loom on the near horizon, which we see greatly contributing to the kinds of analyses we have outlined here. Of key importance is the development of methods to handle "big data" such as the genomic datasets composed of hundreds and thousands of individuals (e.g., Reich 2018;Fages et al. 2019). Two recent papers published back-toback (Kelleher et al. 2019;Speidel et al. 2019) together with an accompanying perspective (Harris 2019) introduce new methods, relate and tsinfer, which estimate genealogies in the presence of recombination at an unprecedented scale. Recombination events result in small differences in the genealogy of contiguous sequences; tsinfer records these differences thereby efficiently encoding variation across the genomes of thousands of individuals. This method greatly reduces the data storage requirements and processing time of large datasets of thousands of genomes (Kelleher et al. 2019). The extension of methods such as tsinfer and relate to aDNA datasets presents new challenges, for example, accounting for postmortem damage patterns and high sequencing error rates, when estimating recombination events. Trees inferred using tsinfer have already been used to inform analysis of aDNA (Scheib et al. 2019), and improved methods to deal with the complexities of aDNA are under active development (J Kelleher, pers. comm.).
To accompany these new approaches to encoding the genomic variation within large datasets, machine learning approaches are emerging as valid inferential tool in population genomics (Schrider and Kern 2018;Mondal et al. 2019). Such data-driven approaches base their inferences by learning the relationship between inputs (e.g., summary statistics of genetic diversity or full genotype information) and outputs (e.g., strength and time of selection) from a large collection of data points, which can be provided by simulations (Sheehan and Song 2016). Machine learning, specifically deep learning and convolutional neural networks, have been successfully applied to population genetic data to infer population size changes (Flagel et al. 2018;Chan et al. 2018) and predict targets of natural selection (Torada et al. 2019). Existing methods can be extended to analyze aDNA data by incorporating (i) the temporal dimension and (ii) missingness of sequencing data obtained from degraded ancient samples. These functionalities can be addressed by employing recurrent layers in the network and encoding the statistical uncertainty of aDNA data in input nodes. As such, deep learning is likely to be a suitable framework for the inference of past selective events from aDNA.

Summary
Our goal in writing this review was to highlight the potential for paleogenomic time series datasets to enhance our understanding of selective processes, while at the same time cautioning on the many potential pitfalls inherent in working with such challenging samples and datasets. The growth of the field of paleogenomics during the past decade has been close to exponential, and datasets of hundreds of ancient genomes are now available for some study systems. However, it is important to recognize that datasets of this magnitude are still exceptional. Our take on the current state of the field is that the method development for working with paleogenomic datasets has progressed ahead of the widespread availability of such data. But in the knowledge that the generation of many large-scale datasets for a range of taxa is underway, we anticipate that the relationship between method development and study systems with which to apply them will soon change. We therefore hope that this review will serve to enthuse evolutionary biologists to consider incorporating paleogenomic data in their future study design.

AUTHOR CONTRIBUTIONS
MD, DDdM, KG, A-SM, MDM, GGRM, ASTP, NOT, DW, LD, and ADF conceived and developed the manuscript outline during a writing retreat. MD and ADF drafted the initial manuscript. All authors contributed to later versions of the manuscript.