Paleo‐metagenomics of North American fossil packrat middens: Past biodiversity revealed by ancient DNA

Abstract Fossil rodent middens are powerful tools in paleoecology. In arid parts of western North America, packrat (Neotoma spp.) middens preserve plant and animal remains for tens of thousands of years. Midden contents are so well preserved that fragments of endogenous ancient DNA (aDNA) can be extracted and analyzed across millennia. Here, we explore the use of shotgun metagenomics to study the aDNA obtained from packrat middens up to 32,000 C14 years old. Eleven Illumina HiSeq 2500 libraries were successfully sequenced, and between 0.11% and 6.7% of reads were classified using Centrifuge against the NCBI “nt” database. Eukaryotic taxa identified belonged primarily to vascular plants with smaller proportions mapping to ascomycete fungi, arthropods, chordates, and nematodes. Plant taxonomic diversity in the middens is shown to change through time and tracks changes in assemblages determined by morphological examination of the plant remains. Amplicon sequencing of ITS2 and rbcL provided minimal data for some middens, but failed at amplifying the highly fragmented DNA present in others. With repeated sampling and deep sequencing, analysis of packrat midden aDNA from well‐preserved midden material can provide highly detailed characterizations of past communities of plants, animals, bacteria, and fungi present as trace DNA fossils. The prospects for gaining more paleoecological insights from aDNA for rodent middens will continue to improve with optimization of laboratory methods, decreasing sequencing costs, and increasing computational power.

In past decades, the community of midden researchers has dwindled with few trained in molecular techniques and genomics, delaying full exploration of midden aDNA analysis and its potential from early studies confirming the presence of aDNA in fossil rodent middens. Ancient DNA in middens was used to identify the dung, diet, and Pleistocene biogeography of a rare rodent in northern Chile (Kuch et al., 2002) and of an extinct ground sloth preserved in southern Argentina (Hofreiter, Betancourt, Sbriller, Markgraf, & Gregory McDonald, 2003;Hofreiter et al., 2000). Ancient DNA from a midden in Tiburón Island in the Gulf of California showed that a unique haplotype of desert bighorn sheep inhabited this land-bridge island 1,500 years ago and that its successful introduction in 1975 represents an "unintentional rewilding" (Wilder et al., 2014). Ancient DNA from middens in the Grand Canyon, USA, provided evidence of papillomavirus (PV) infection and long-term codivergence with packrats over the last 27,000 years, the oldest known PV sequence (Larsen, Cole, & Worobey, 2018). High-throughput sequencing and metabarcoding have been reported from South African, Australian (Murray et al., 2012), and Chilean (Díaz et al., 2019) middens. Metabarcoding of middens in the hyperarid Atacama Desert of northern Chile records the response of plant-pathogen communities in response to changing climates during the past 50,000 years (Wood et al., 2018). Because of their dense spatiotemporal distribution, fossil rodent middens in the Americas offer the chance to genetically profile entire communities through time and space and to reconstruct time-lapse molecular phylogeographies for individual species. To achieve this potential, a necessary step is to continue to improve the genetic characterization of these promising deposits, the principal aim of the present study.
DNA from ancient and highly degraded materials is more accessible than ever before with modern high-throughput DNA sequencing platforms. Taxonomic composition of ancient samples can be inferred using sophisticated bioinformatic algorithms and massive genetic reference databases (Harbert, 2018). Common sequencing approaches to the taxonomic identification of environmental DNA (eDNA) include amplicon sequencing and whole-genome shotgun sequencing (Taberlet, Coissac, Pompanon, Brochmann, & Willerslev, 2012). Amplicon sequencing produces short fragments of a specific gene that can be compared to robust databases. In contrast, whole-genome shotgun sequencing produces random reads from all of the available DNA molecules in a sample, but with a theoretical bias toward high-copy molecules (e.g., organellar genomes). Amplicon sequencing can sometimes produce imprecise results due to the relatively small samples of the overall genome being compared and because it may produce bias against rare molecules, particularly in degraded samples (Ficetola et al., 2015;Ficetola, Taberlet, & Coissac, 2016;Zinger et al., 2019). Nevertheless, amplicon data may, in some cases, produce more complete taxonomic coverage relative to shotgun methods (Tessler et al., 2017). Whole-genome shotgun sequencing inherently generates vastly more data from a theoretically larger sample of the genome; therefore, it may overcome underrepresentation of rare molecules and provide more information to make precise taxonomic classifications from highly degraded samples.
For this study, we attempt both shotgun and amplicon methods for metagenomics, but we focus on the whole-genome shotgun sequencing methods for the exploration of aDNA in packrat midden sequences from western North America. We analyzed aDNA from 25 packrat midden samples of different ages, both unprocessed vouchers (still indurated with crystallized urine) and processed material (after crystallized urine was removed through soaking in water) to determine whether the extraction and classification of endogenous DNA from are possible with both unprocessed and processed material. The packrat middens for which sequencing libraries were attempted in this research are between 300 and 48,000 C 14 years old and come from two different localities. City of Rocks National Reserve (COR), characterized by mean annual temperature of 9°C and mean annual precipitation of 280 mm, is in south-central Idaho, USA, near the northern end of the known paleomidden distribution.
Guadalupe Canyon on the eastern piedmont of the Sierra Juarez in northern Baja California, Mexico, is characterized by mean annual temperature of 22.5°C and mean annual precipitation of ~70 mm. These two sites more or less span the range of current climatic conditions across which North American packrat middens are preserved and have been studied.
DNA extraction and shotgun sequencing produced data on eleven samples; amplicon samples were reviewed for a subset of six samples. Shotgun metagenomic data were then analyzed to identify taxa present in each sample. The shotgun metagenomic results generally outperform amplicon data and more completely characterize paleo-communities that are consistent with macrofossil communities identified in the middens and associated taxa.

| Midden samples
Packrat midden material was sampled from the North American Packrat Midden Collection archived at the University of Arizona's Tree Ring Laboratory, Tucson, AZ. There are two kinds of samples in the Tucson collection, processed and unprocessed (indurated) midden vouchers. Midden processing entails placing a discrete (approximately shoebox-sized or larger) sample of the midden in a bucket of water. The crystallized packrat urine (amberat; see Figure 1) is soluble in tap water and usually dissolves in one to two weeks, releasing plant material, fecal pellets, and other remains. After dissolving, the soaked midden is wet-screened through geologic sieves, is dried in a drying oven overnight, and placed into large plastic bags. This material eventually is sorted and identified under a dissecting scope using modern material and then farmed out to other laboratories for geochemical and other analyses. Processed midden is the most abundant material in the University of Arizona Collection. A major question about midden aDNA analysis is the degree to which aDNA survives the current routine processing of middens, which involves both prolonged soaking and oven-drying the material. Also available for some middens are indurated vouchers, meaning chunks of middens that are still embedded with crystallized urine and have not been processed to release/recover the contained plant macrofossils and other remains. The availability of these indurated vouchers permits assessing of routine processing on midden aDNA.
Both processed and voucher midden samples were taken for DNA extraction from the two sites, twelve from the City of Rocks and thirteen from Guadalupe Canyon. The samples from COR were dated from ~47,500 C 14 years old to 115 C 14 years old (initial results partially published in Weppner, Pierce, & Betancourt, 2013). The Guadalupe Canyon samples were dated between 54,400 C 14 years old and 3,545 C 14 years old (Holmgren et al., 2014).

| DNA extractions
DNA extractions from the middens were completed in an aDNA laboratory at the American Museum of Natural History's Sackler Institute for Comparative Genomics. This is a PCR-free room with laboratory workspaces designated for low DNA content samples and clean protocols in place to limit residual DNA contamination.  Using cutadapt (Martin, 2011), the resulting sequences were demultiplexed by locus and had their primer sequences trimmed; furthermore, sequences were only kept if >75 bp. BBMerge (Bushnell, Rood, & Singer, 2017) was used to merge overlapping read pairs, and the resulting full-length sequences were aligned against the NCBI "nt" database using BLAST (Johnson et al., 2008). Resulting hits (>95% similarity) are reported.

| Whole-genome shotgun sequencing
A total of 21 samples were selected for whole-genome shotgun sequencing, nine from the Guadalupe Canyon site and 12 from the City of Rocks National Reserve. Samples were chosen for sequencing based on total DNA content (>0.525 ng/µl) and by manual inspection of the Bioanalyzer fragment size distribution to identify samples with majority low molecular weight (<1 kb) DNA. These criteria were imposed under the assumption that endogenous ancient DNA will be highly fragmented and that higher molecular weight DNA is likely from modern contamination. The library preparation was done using the KAPA Hyper Library Preparation Kit (KAPA Biosystems). Library preparation and sequencing were carried out at the New York Genome Center on the Illumina HiSeq 2500 platform for paired-end reads of 125 base pairs.

| Shotgun metagenomic classification
The whole-genome shotgun data were processed with a custom pipeline to merge overlapping reads, trim low complexity data, remove duplicate reads and reads shorter than 30 bp, and map reads to the NCBI "nt" reference database ( Figure 2 Breitwieser, & Salzberg, 2016). Taxonomic assignments were converted to standard classifications based on the NCBI taxonomy database (Federhen, 2012) and the R "taxonomizr" code library (Sherrill-Mix, 2019). Visualizations of the results were coded in R 3.5.2 (R Core Team, 2019) and the "ggplot2" graphics library (Wickham, 2009).

| Shotgun metagenomics pipeline and data availability
All code and parameter settings for individual programs for the shotgun metagenomics portion of this project are published in a public code repository (https ://github.com/rsh42 9/Neoto maSeq ), and raw shotgun sequence data are available through the NCBI SRA at BioProject PRJNA488629 (https ://www.ncbi.nlm.nih.gov/biopr oject/ PRJNA 488629).

| DNA quantification
The concentration of DNA present in each extraction as measured by the 2100 Agilent Bioanalyzer confirms that DNA can be extracted from midden material up to 31,760 years old ( Figure

| Whole-genome shotgun sequencing
Eleven of the 21 DNA samples submitted for sequencing were successfully sequenced with midden ages ranging from 345 to 31,760 C 14 years. The other ten samples failed library preparation procedures and, accordingly, could not be sequenced. Of these ten samples, eight of them were from Guadalupe Canyon, meaning that all but one of the extractions from this hotter, drier site failed to go through library preparation. Overall, between approximately 0.11% and 6.7% of the raw read pairs were uniquely classified given the filtering and classification pipeline (Table 1); however, only processed (washed and sieved) midden samples (sequence samples PPC524-PrA and TS211F-2Pr) yielded greater than 1% classified reads.

| Metagenomic classification
Direct classifications placed some of the DNA fragments at the species level, but there is more confidence in higher classifications (e.g., phylum and family level) where stronger consensus is established.

| Macrofossil analysis comparison
Standard plant macrofossil analysis was performed on the middens collected from the City of Rocks locality (Table S1)

| aDNA Confirmation
Assessment of DNA damage through mapping of cytosine deamination and read length shows that reads that map to chloroplast loci are putatively ancient (Figure 7). In general, all samples analyzed show elevated (5%-10%) cytosine to thymine substitutions near the 5′ end of the DNA molecule dropping to 2.5%-3% by 30bp from the 5′ end ( Figure 7a). Also, most samples show read lengths (from merged paired-end reads) of 50-200 bp (Figure 7b). Sequence reads from the processed midden samples (PPC524-1-6-10 and TS211F-2Pr) show elevated levels of C to T mismatches across the entire length of the reads relative to the data from raw midden samples ( Figure 7a).

| D ISCUSS I ON
Application of metagenomics to packrat midden paleoecology has a lot to contribute to the study of these materials. In a single analysis, aDNA metagenomic data can identify major lineages of eukaryotesplants, animals, and fungi-and prokaryotes. This method provides the basis for a new way to more rapidly conduct paleoecological studies, as numerous samples can be batch-processed and data on a wide range of taxonomic groups can be recovered. Also, it does not require specialized knowledge for morphological identification of fragmented macrofossils, helping make studies in this system more available to the scientific community at large.
The taxonomic characterization of DNA from packrat midden series has the potential to generate a millennial-scale timeline of community composition. These data and future developments to ancient metagenomics methods will aid in understanding the effects of changing environments and provide a clearer picture of the processes involved in ecosystem turnover. This may require improvements to and expansion on the methods presented here, but should be feasible by sampling more ancient packrat middens across a wider transect of geography and time, more replication of sequencing, deeper sequencing, experimentation with target capture and metabarcoding methods, and the establishment of more complete reference databases for global and regional flora and fauna relevant to ecosystems represented in packrat middens.

| DNA recovery and midden handling
Processing of packrat middens for macrofossil analysis is not compatible with ancient DNA protocols. In this study, we successfully recover DNA from both processed and voucher specimen middens ( Figure 3). Typical analysis of packrat midden plant macrofossils involves dissociating the midden matrix in water (Betancourt et al., 1990). This protocol appears to further degrade any endogenous DNA and introduce significant contamination. The processed samples analyzed here are dominated by bacterial sequences (Figure 4) low overlap with the plant macrofossil taxa known from those samples ( Figure 6). Obvious contaminants include reads mapping to Echinodermata (in one sample) and humans, which indicates the presence of contamination above that of the unprocessed ancient samples (Figure 4). While this is unfortunate, it is not surprising; it has long been known that aDNA samples must be handled with extreme care to preserve the remaining material and to limit contamination from modern sources (Cooper, 2000). However, midden processing is a destructive process in the context of aDNA work, and we encourage developing new field and laboratory protocols that better ensure aDNA preservation, and more routine archiving of unprocessed midden material as voucher specimens.

| Macrofossils versus aDNA
In these analyses, we show that DNA extracted from packrat middens can be attributed to a range of organisms consistent with known or

(C)
<1 ka 1−5 ka >25 ka F R T 5 0 4 − 1 − 6 − 1 9 F R T 5 1 1 A − 2 − 6 − 1 9 P P C 5 2 4 − 1 − 6 − 1 9 P P C 5 2 9 − P rA P P C 5 2 9 − U n A P P C 5 2 9 − U n B T S 2 1 1 F − 2 P r T S 2 1 1 F − 4 − 6 − 7 T S 2 1 1 F − U n . It may also be that DNA fragments are being misclassified due to missing data in the reference database or bias toward more complete genomic records (Harbert, 2018). However, the putative ancient sequences described here closely match the appropriate midden surroundings ( Figure 6) and importantly are missing common wind-pollinated genera (e.g., Quercus and Platanus) that inhabit Central Park, which is the closest pollen source to where the laboratory work was conducted.
Conversely, there are macrofossils identified in the middens that do not show up in the aDNA analysis. There are at least three reasons why this may be: (a) The specific sample taken from the midden might not have the taxon in question; (b) the aDNA that macrofossils might have once contained is no longer viable for analysis and accordingly failed at the extraction or sequencing step; or (c) there may be a lack of genomic data to place sequence reads in the fossil group. Like any fossil, there is presumably a point at which midden contents will be too degraded to recover DNA from, no matter how sensitive sequencing technology becomes, and this process intuitively should occur more rapidly for some species and tissues, and in some midden localities. For example, the best characterized and highest performing midden samples came from the Pinnacle Pass Cave locality, PPC524 and PPC529 (Figure 6).

| Tracking plant communities
It is well established that packrat midden plant macrofossils provide insights into changes in plant communities through time. The patterns and relative abundances of plant families identified in the shotgun metagenomic analysis here provide evidence of similar changes ( Figure 4) to what has previously been observed in these middens (Weppner et al., 2013). In the City of Rocks midden series, we observe that the oldest middens contain DNA dominated primarily by Poaceae (grasses) and Asteraceae (composites) (Figure 4d). This composition may suggest tree-less shrubland, much like modern Artemisia sagebrush ecosystems. In the middle section of this series (~1-5 ka), the  Figure 4d). Accordingly, metagenomics seems like a viable alternative to macrofossils, but, given the incomplete overlap between these methods, adding metagenomics to prior macrofossil work should indeed refine the understanding of sites even further. Like many new molecular methods that had previously required more manual species identifications, the physical and molecular methods are highly synergistic-combining them creates more refined results [e.g., (Parducci et al., 2019;Weiskopf et al., 2018)]. For instance, sometimes, macrofossils in this study were more effective for certain taxa (e.g., Fabaceae; Figure 5), but metagenomics provides insights into other cases (e.g., all bacteria and most metazoan taxa; Figure 4).

| Beyond plants
While our focus is on flowering plants and conifers, there are abundant data on a variety of lineages. Prokaryotes are dominant in the dataset, and further work could potentially tease apart any modern contaminants from focal ancient sequences. Viruses are present in low quantities, but recent work in other packrat middens has isolated ancient viral sequences (Larsen et al., 2018). The nonplant eukaryotes, such as vertebrates and arthropods, are even more workable, given that there are clearly defined macrofossils present in packrat middens for other vertebrates and insects (Elias, 1990(Elias, , 1992Hall, Van Devender, & Olson, 1988;MacKay & Elias, 1992;Mead, Devender, & Cole, 1983;Van Devender & Mead, 1978). Not only might this be able to establish what animals are generally in the area (e.g., arthropod sequences are the most common), it could also be useful for paleo-population work on the packrats themselves as well as their diet and parasites.

| Amplicon for packrat aDNA
Metagenomic sequencing in this study far outperformed amplicon results; however, our goal with amplicon sequences was to use somewhat longer fragments (~185 bp for rbcL and ~300 bp for ITS2) to see if species-or genus-level identification would be possible from the midden aDNA. Toward our goal, when amplicon sequencing was successful, it was highly capable of getting confident genus-level identifications. Unfortunately, the success rate for amplifying these amplicon loci was low in our samples (Table 2), and only two of the six tested samples yielded identifiable sequences. Future studies should add onto this work by attempting more replicates and comparing shorter amplicon sequences (that are more likely to amplify and may identify sequences at the genus or family level). In either case, it is worth noting that a clear advantage of metagenomics is its ability to get results on multitudes of taxa from microbes to plants and animals all in a single run, unlike the more singular approach taken in amplicon sequencing. The amplicon method has indeed been successful in South American ancient rodent middens (Díaz et al., 2019), and we expect that further refinement of amplicon sequencing methods applied to ancient packrat middens will yield positive results.

| Caveats: aDNA confirmation
The packrat midden aDNA samples analyzed in this study contain fragments of DNA from plants and animals that are consistent with what we would expect to and do find through morphological studies in these same packrat middens. As expected, the DNA recovered from packrat middens is highly degraded (Figure 7). The level of DNA damage provides a challenge for building profiles of taxonomic diversity. The damage assessment results suggest that much of the plant DNA extracted from these middens is ancient and derived from macrofossil fragments in each midden. However, further analysis to filter data based on the DNA damage patterns (Skoglund et al., 2014) may be required to see if DNA from modern sources can be removed from this metagenomic sample. Given the DNA damage results (Figure 7), it is still possible that there may be relatively little ancient or a combination of ancient and modern DNA mixed in our samples. Alternatively, all the DNA could be ancient but degraded and fragmented such that the damage patterns are obscured.

| Suggestions for additional analyses
The whole-genome shotgun metagenomic classification results can and should be tested in future studies with other classifiers, such as FALCON and Kraken (Breitwieser, Baker, & Salzberg, 2018;Pratas et al., 2018;Wood & Salzberg, 2014). These software tools use different methods to classify reads and may result in different classifications than those reported here as classifiers are known to produce variable results (Harbert, 2018;McIntyre et al., 2017). Comparing the results of several classifications could create a more comprehensive image of what is present in the sample if certain taxa are identified by all methods.

| Future directions
To maximize the potential of packrat midden ancient DNA, future studies should further account for the complexities of the samples.
There are many prospects for follow-up research on aDNA metagenomics. As with all NGS projects, sequencing depth is important for the scale of insights that may be generated, and as sequencers get cheaper and produce more reads, truly deep sequencing likely will become even more feasible. This may be important, as some bacterial studies find that very deep sequencing may be important to detect rare elements of a community (Nicholls, Quick, Tang, & Loman, 2018). Another technical aspect is the quality of reference databases. There are exceedingly few quality nuclear genomes that are currently deposited for plants. However, this is slowly but surely starting to change with rapidly falling sequencing costs and new technologies like long-read sequencing (Chen et al., 2018;Jung, Winefield, Bombarely, Prentis, & Waterhouse, 2019). As these databases become fuller, projects like this can even simply be reanalyzed to get more refined results. Target capture methods, as have recently been developed for plants (Johnson et al., 2019), may also be viable and help to enrich sequence data for regions where reference data with high taxonomic coverage exist. Confidence in aDNA analysis of packrat middens will increase with sequencing more frequent negative controls and increasing replication, as contamination often results from even the most careful studies.

| CON CLUS ION
Metagenomics of packrat midden aDNA has the potential to be a boon for paleoecology, helping to transform the field from one that requires substantial expertise for micro-and macrofossil identification to one that benefits additionally from experts working with high-throughput methods and big data. Here, we show that packrat middens up to ~30,000 years old contain recoverable DNA that is taxonomically consistent with macrofossils found in these deposits. Further investigation into the taxonomic composition of middens with aDNA analysis throughout the region could refine our understanding and the timeline of past climate change (Harbert & Nixon, 2018) and species migration and extinction, and this will better inform the study of the effects of current and future climate change. Deeper sequencing and targeted generation of reference sequences should improve classification precision in future work by providing clearer and more easily verifiable taxonomic classifications for degraded and mixed aDNA samples from packrat middens. The results presented here are an important step toward unlocking the potential of packrat middens as a diverse source of Late Quaternary aDNA and expanding the rich TA B L E 2 Packrat midden amplicon sequencing ecological information that can be obtained through the study of packrat midden plant macrofossils.

ACK N OWLED G M ENTS
We thank the American Museum of Natural History and the Gerstner Family Foundation for support of R.H.'s postdoctoral research. We also thank AMNH's NSF REU program (Award Number: 1358465) for support of the work done by G.M. Our gratitude is extended to Apurva Narechania, Martine Zilversmidt, and Chase Nelson for their consultation and feedback during the development of this project.

CO N FLI C T O F I NTE R E S T
None Declared. S.W.C. and J.B. helped revise later drafts.

DATA AVA I L A B I L I T Y S TAT E M E N T
All code and parameter settings for individual programs for the shotgun metagenomics portion of this project are published in a public code repository (https ://github.com/rsh42 9/Neoto maSeq ), and raw shotgun sequence data are archived and available through the NCBI SRA at BioProject PRJNA488629 (https ://www.ncbi.nlm.nih.gov/ biopr oject/ PRJNA 488629).