Metagenomic analysis of historical herbarium specimens reveals a postmortem microbial community

Advances in DNA extraction and next‐generation sequencing have made a vast number of historical herbarium specimens available for genomic investigation. These specimens contain not only genomic information from the individual plants themselves, but also from associated microorganisms such as bacteria and fungi. These microorganisms may have colonized the living plant (e.g., pathogens or host‐associated commensal taxa) or may result from postmortem colonization that may include decomposition processes or contamination during sample handling. Here we characterize the metagenomic profile from shotgun sequencing data from herbarium specimens of two widespread plant species (Ambrosia artemisiifolia and Arabidopsis thaliana) collected up to 180 years ago. We used blast searching in combination with megan and were able to infer the metagenomic community even from the oldest herbarium sample. Through comparison with contemporary plant collections, we identify three microbial species that are nearly exclusive to herbarium specimens, including the fungus Alternaria alternata, which can comprise up to 7% of the total sequencing reads. This species probably colonizes the herbarium specimens during preparation for mounting or during storage. By removing the probable contaminating taxa, we observe a temporal shift in the metagenomic composition of the invasive weed Am. artemisiifolia. Our findings demonstrate that it is generally possible to use herbarium specimens for metagenomic analyses, but that the results should be treated with caution, as some of the identified species may be herbarium contaminants rather than representing the natural metagenomic community of the host plant.


| INTRODUC TI ON
The world's herbaria house a vast number of plant specimens, ~350 million, some of which are up to 400 years old. Due to advances in DNA extraction and sequencing, especially next-generation sequencing (NGS), it is now feasible to include these specimens in genetic studies. This offers the possibility to use them to track changes over time if samples from different time periods are available (e.g., historical herbarium samples and modern populations from the same location; Bieker & Martin, 2018). The incorporation of historical herbarium samples into genetic analyses enables us to better estimate evolutionary metrics, such as the nucleotide substitution rate, than with modern data alone . In combination with the metadata found on herbarium labels (e.g., sampling location, altitude), changes in the distribution, extinctions in specific areas or introductions of new species can be inferred, which can inform conservation and invasive species control efforts (Nualart et al., 2017) in the face of the profound alteration of many of Earth's ecosystems due to anthropogenic activity (Peñuelas et al., 2013).
Herbarium specimens can provide a wealth of genomic information not only about the specimen itself, but also about microorganisms colonizing its tissues. These microbes originate either through colonization of the living plant (antemortem), such as host-associated commensal taxa, through colonization around the time of death (perimortem), such as pathogens that killed the plant, or through later (postmortem) colonization of the herbarium sheets. Thus, the metagenomic community we can observe today may contain DNA from environmental bacteria involved in decomposition or storage conditions (e.g., bacteria and fungi overgrowth), or contamination from sample handling (e.g., skin microbes) and laboratory sources (e.g., reagents contaminated with enzyme expression vectors ;Warinner et al., 2017).
Since the 1940s, chemical pesticides such as herbicides and fungicides have become widely used (Hahn, 2014;Vats, 2015).
These chemicals can affect not only crops and their associated microbial communities, but also plants that grow close to fields.
Herbarium specimens hold potential to shed more light on the impact of these treatments on the metagenomic composition when specimens pre-dating the common use of pesticides are compared to younger samples (Délye, Deulvot, & Chauvel, 2013).
When mounted herbarium specimens include the roots, metagenomic analysis of those organs can also provide insights into pollution levels, because for example the overabundance of nitrogen can alter root-associated bacterial composition (Lang, Willems, Scheepens, & Burbano, 2019).
Degradation of DNA is a well-known effect in historical tissue samples such as those from plant and animal natural history collections. Typical ancient DNA (aDNA) damage includes fragmentation to about 40-500 bp in length and varied degrees of hydrolytic deamination (Dabney, Meyer, & Pääbo,2013). The deamination of cytosine to uracil plays a major role, leading to apparent C to T transitions in the downstream analysis (Dabney et al., 2013;Staats et al., 2011). Weiß et al. (2016) found that the DNA in preserved plant specimens decays about six times faster than in moa bones (Allentoft et al., 2012), so DNA fragmentation of herbarium samples, despite their relatively young age (up to 400 years), can be comparable to those of animal remains several hundreds or thousands of years old. In addition, the endogenous DNA content of herbarium samples can be relatively low. For example,  found that as little as 16% of Illumina shotgun sequencing reads derived from herbarium specimens collected between 1839 and 1898 mapped to the Arabidopsis thaliana reference genome. For these reasons, Shepherd & Perrie (2014) suggest that the same precautions as for truly ancient samples should be used when working with herbarium samples. These include strict separation of pre-and post-PCR laboratories and performing pre-PCR steps in dedicated clean laboratories (Pääbo et al., 2004).
A limited number of studies have successfully characterized particular members of the metagenomic communities contained in historical herbarium specimens. For example, Malmstrom et al. (2007) were able to detect yellow dwarf viruses in up to 100-yearold herbarium specimens from various California grasses (Poaceae). Miller et al. (2016) used real-time PCR to detect the fungal pathogen Discula destructiva in up to 66-year-old herbarium specimens of Asian dogwood (Cornus). In another collection of studies, domestic potato (Solanum tuberosum) and tomato (Solanum lycopersicum) herbarium specimens known to be infected by the oomycete plant pathogen Phytophthora infestans were used to analyse single nucleotide polymorphisms (SNPs), simple sequence repeats (SSRs) and complete genome sequences of the pathogen (Martin et al., 2013;Saville, Martin, & Ristaino, 2016;Yoshida et al., 2013). In the only study thus far to compare complete metagenomic communities in herbarium specimens, Schubert et al. (2014) showed that the community can differ substantially between herbarium vouchers of the same species.
Following from previous observations that the endogenous DNA in herbarium specimens can be rapidly lost (or contaminated), we sought to better understand this process by characterizing the nature of the metagenomic community of historical herbarium specimens. We also wished to investigate whether particular microbes are responsible for postmortem colonization of mounted plants in herbarium collections. For this we used both previously published and novel shotgun sequencing data from modern and historical collections of Ambrosia artemisiifolia (Sánchez Barreiro et al., 2017). Am. artemisiifolia is an annual weed that is mostly found in disturbed habitats such as grain and cultivated fields and along roadsides (Bassett & Crompton, 1975;Payne, 1970). It is native to North America and became invasive following its introduction to Europe (Chauvel et al. 2006). To further elucidate which taxa are specific to herbarium specimens, we also included in the analysis previously published sequencing data from herbarium and modern specimens of Ar. thaliana, an annual plant native to Eurasia and Africa (Durvasula et al., 2017;Exposito-Alonso et al., 2018;1001 Genomes Consortium, 2016) that colonized North America after its probable introduction from Europe in the early 17th century (Exposito-Alonso et al., 2018).
We compare DNA sequenced from herbarium tissues collected between 1835 and 1993 with that from freshly collected, present-day tissues. We show that it is possible to extract metagenomic profiles from shotgun sequencing data even from the oldest samples. We found significant differences between modern and herbarium sample microbial communities and were able to identify three species as possible ubiquitous contamination of plant tissues in herbarium collections.

| Previously published data
For this study, we obtained previously published sequence data from leaf tissue destructively sampled from herbarium-mounted (Exposito-Alonso et al., 2018)

| Newly generated data
We generated new shotgun sequence data from wild-collected modern (North American) specimens of Am. artemisiifolia as well as from leaf tissue destructively sampled from a selection of historical herbarium specimens (North American and European). Table S1 presents a complete summary of the plant samples and data sources included in this study.

| DNA extraction, library preparation and sequencing
In this study, we generated additional sequence data for 43 presentday and historical herbarium-derived N. American Am. artemisiifolia samples where leaf tissue had been previously collected and DNA extracted according to the methods described in Martin et al. (2014).
Twenty-six of these samples had been previously converted into Illumina NGS libraries in Sánchez Barreiro et al. (2017), so for our study these indexed libraries were simply pooled and Illuminasequenced (see Table S1 for details about the sequencing method).
Seventeen of the present-day samples, along with an extraction blank, were converted into NGS libraries and shotgun-sequenced as described below.
For the previously extracted herbarium samples, blunt-end Illumina library preparation was conducted in dedicated, positively pressurized aDNA laboratories at the NTNU University Museum or the University of Copenhagen using either NEBNext library kit E6070L as in Sánchez Barreiro et al. (2017) or the BEST single-tube protocol (Carøe et al., 2017). Both methods involved the ligation of customized blunt-end adapters (Meyer & Kircher, 2010). Samplespecific, single-or dual-indexing barcodes were incorporated into each library using custom, indexed primers during the indexing PCR (Kircher, Sawyer, & Meyer, 2012). Indexing PCR was carried out in 100-µl reactions with 10-20 µl library template, 0.2 mM each dNTP, 0.2 µM forward primer, 0.2 µM reverse primer, 0.05 U/µl AmpliTaq Gold polymerase, 1 × AmpliTaq Gold Buffer, 2.5 mM MgCl 2 , 0.4 mg/ml bovine serum albumin, with the remaining reaction volume being completed with molecular biology-grade water. For each library, the optimal number of indexing PCR cycles was estimated using qPCR.
Indexing PCR was carried out under the following conditions: 95°C for 10 min, 10-15 cycles of 95°C for 30 s, 60°C for 1 min, 72°C for 45 s and a final extension of 72°C for 5 min. The amplified libraries were purified using either Qiagen QIAquick PCR purification columns with a final incubation for 15 min at 37°C prior to eluting in 32 µl Qiagen EB buffer, or SPRI beads prepared as in Rohland and Reich (2012) with a final incubation for 15 min at 37°C prior to eluting into 30 µl EBT buffer (Qiagen EB buffer with 0.05% Tween-20).
These libraries were pooled and Illumina-sequenced (see Table S1 for details of the sequencing method).
For the 17 modern wild-collected Am. artemisiifolia DNA samples that had not previously been shotgun-sequenced, the DNA was sheared to a mean length of 500 bp using a Covaris LE220, and blunt-end Illumina library preparation was conducted using the BEST single-tube protocol (Carøe et al., 2017) and customized blunt-end adapters. Sample-specific, dual-indexing barcodes were incorporated into each library using custom, indexed primers during the indexing PCR. Indexing PCR was carried out in 100-µl reactions with 7.5 µl library template, 0.25 mM each dNTP, 0.25 µM forward primer, 0.25 µM reverse primer, 1 µl Herculase II Fusion DNA polymerase, 20 µl 5 × Herculase II Reaction Buffer, and 65.3 µl molecular biology-grade water. For each library, the optimal number of indexing PCR cycles was estimated using qPCR. Indexing PCR was carried out under the following conditions: 95°C for 3 min, 10-18 cycles of 95°C for 20 s, 60°C for 20 s, 72°C for 40 s and a final extension of 72°C for 5 min. The amplified libraries were purified using AMPure XP beads (Beckman Coulter) prior to eluting into 27 µl EB buffer (Qiagen). These libraries were pooled equimolarly and Illuminasequenced (see Table S1 for details of the sequencing method).
In addition, we collected 10 historical herbarium samples of European Am. artemisiifolia from the Naturalis herbarium (NHN) collection, converted them into Illumina libraries along with an extraction blank, and shotgun-sequenced them (see Table S1 for details of the sequencing method). For these samples, leaf tissue was collected from herbarium sheets using gloves and sterile forceps. Tissue homogenization, DNA extraction and library preparation were performed in a dedicated, positively pressurized aDNA laboratory at the NTNU University Museum. Tissue homogenization was achieved using a Qiagen TissueLyser LT and a combination of stainless steel and tungsten carbide beads that had been sterilized in a 10% bleach solution and then rinsed in molecular biology-grade water. The DNA extraction was conducted as in Martin et al. (2014); in brief, the method involved the use of a Qiagen DNeasy Plant Mini Kit, following the manufacturer's protocols except for the addition of proteinase K (2.2 mg/ml final concentration) during the lysis/incubation step, which was conducted overnight for up to 16 hr.
For Am. artemisiifolia specimens, reads were mapped against an unpublished, 1.37-Gbp Am. artemisiifolia draft genome assembly. The endogenous DNA content was estimated as the number of raw alignments to the reference genome divided by the total number of raw sequences retained after trimming and quality filtering, and was calculated by paleomix. To test if the endogenous DNA content differs between herbarium and modern specimens of the same species, R version 3.4.2 was used to perform a Welch two-sample t test.

| Metagenomic profiling with blast/megan
For removing host sequences prior to the metagenomic profiling, bwa version 0.7.15 aln (Heng Li & Durbin,2009) was used for mapping with a minimum mapping quality (MAPQ) score of zero.
Sequences were mapped against their respective reference genome.
Unmapped reads were extracted from the resulting BAM files with samtools version 0.1.19 (Li, 2011;) and converted to FASTA files using the picardtools version 2.9.1 (http://broad insti tute.github.io/picard) samtofastq tool. If the resulting FASTA file of unmapped reads contained more than 1 million reads, 1 million reads were randomly selected using a custom Python script. Otherwise, all unmapped reads were used for the blast (Basic Local Alignment Search Tool) search.
The reads were then blasted against the nonredundant nucleotide database from NCBI using blastn (Camacho et al., 2009) with maximum e-value = 0.01 and maximum target sequences = 500.
For paired-end data, only reads from read 1 were used for the blast search. The blast results were imported to megan version 6.11.1 (Huson, Auch, Qi, & Schuster, 2007) using the weighted LCA (lowest common ancestor) algorithm with the following parameters: min score = 100, max expected = 0.01, min percent identity = 0, top percent = 10, min support percent = 0.1, min support = 1, percent to cover = 80. In the megan analysis, hits against Viridiplantae, Metazoa and unclassified taxa, such as "environmental samples<Bacteria>" and "unclassified Bacteria", were ignored. For taxonomic-level assignment, megan used the March 2018 version of the US National Center for Biotechnology Information (NCBI) nucleotide to taxonomy mapping database. For the taxonomic binning, megan used the NCBI taxonomy. The weighted LCA algorithm first assigns each reference sequence a weight based on the number of reads that have hits against that sequence. Each read is then placed on the NCBI taxonomy on the node that is above 75% of the total weight of all the references against which the read has significant hits. As reference sequences in the NCBI database can be associated with several species (up to 1,000), a read may be placed on a higher node even if it only has one significant hit. To compare the different samples, megan's compare function was used, ignoring unassigned reads. Counts were normalized to the smallest count. Bray-Curtis distances were used for a principal coordinates analysis (PCoA). This distance method takes the abundance of species into account (Mitra, Gilbert, Field, & Huson, 2010).
To test if the variance in the PCoA plot could be explained by various sample groupings (modern Ar. thaliana, herbarium Ar. thaliana, modern Am. artemisiifolia and herbarium Am. artemisiifolia), the distance matrix was exported from megan and a pairwise PERMANOVA test with 9,990 permutations was performed in R version 3.4.2. For the 20 taxa that explain most of the variation in the PCoA, we performed a Welch two-sample t test in R version 3.6.2 to test if some taxa are predominantly found in herbarium specimens. Possible herbarium contaminants were identified as those individual microbial taxa for which the presence/absence data indicated a significant difference between both conspecific sample groups (herbarium Am. artemisiifolia vs. modern Am. artemisiifolia and herbarium Ar. thaliana vs. modern Ar. thaliana) and a nonsignificant difference between herbarium samples of Am. artemisiifolia and Ar. thaliana. These taxa (Alternaria alternata, Alternaria solani and Eimeria mitis) were present in more than 30% of herbarium samples from Ar. thaliana and Am. artemisiifolia specimens and were absent or nearly absent in modern specimens. The megan analysis was repeated with those taxa disabled.

| Alignments to reference genomes for important microbial taxa
Because the megan analysis identified Al. alternata, Al. solani and E. mitis as a major component of the metagenomic community, especially in herbarium specimens, we sought to better characterize the sequences originating from these genomes. Thus we used paleomix version 1.2.13.4 and adapterremoval version 2 (Schubert et al., 2016) to remove adapter contamination and then to directly map all raw reads against the 33.5-Mbp Al. alternata (Nguyen, Lewis, Lévesque, & Gräfenhan, 2016), the 33.1-Mbp Al. solani (Wolters et al., 2018) and the 72.2-Mbp E. mitis (GenBank assembly accession: GCA_000499745.2) reference genomes. The mapping was performed with bwa version 0.7.15 aln (Li & Durbin, 2009) with no MAPQ filtering. Information about the raw genomic sequencing depth was obtained from the paleomix summary files. mapdamage version 2.0.8 (Jónsson, Ginolhac, Schubert, Johnson, & Orlando, 2013) was used to obtain frequencies of base misincorporation (damage patterns) for mapped reads. mapdamage results were plotted only for samples with ≥ 0.2 × mean sequencing depth of the Al. alternata or Al. solani genome and the highest-depth for the E. mitis genome since no depth higher than 0.06 × was obtained for this genome.

| RE SULTS
The samples derived from herbarium collections contained a significantly lower proportion of endogenous DNA content than modern samples (Arabidopsis thaliana: mean herbarium samples: 0.77, mean modern samples: 0.95, p = 5.889 × 10 −6 ; Ambrosia artemisiifolia: mean herbarium samples: 0.84, mean modern samples: 0.91, p = 7.999 × 10 −6 , Figure 1). From the metagenomic profiles obtained for all study samples, a total of 205 different species-level taxa could be identified in the entire data set. Only two of the 20 most common species were found in the extraction blanks: Eimeria mitis was present in both blanks and Methylorubrum extorquens was only present in the extraction blank performed alongside the herbarium Am. artemisiifolia specimens. All species identified in the blanks were found in low abundance except E. mitis, which was observed in the extraction blank corresponding to modern Am. artemisiifolia specimens.
The most abundant metagenomic species observed in the overall data set is Albugo laibachii, a known pathogen of Ar. thaliana (Kemen et al., 2011). Indeed, we find this species in all Ar. thaliana samples, but also in two herbarium and one modern Am. artemisiifolia sample (Figure 2). Most of the differentiation between modern and herbarium samples in the PCoA (Figure 3a) is explained by the presence of Alternaria alternata, E. mitis and Al. solani in herbarium specimens. E. mitis is present in a majority of Ar. thaliana (19 of 34) and Am. artemisiifolia (25 of 46) herbarium samples and is almost absent in modern Ar. thaliana (2 of 28) and Am. artemisiifolia (4 of 17) samples. E. mitis is the only taxon among the most common taxa of the entire data set that was found in both the extraction blanks that were performed alongside the Am. artemisiifolia samples. Up to 0.8% of raw reads map against the E. mitis reference genome. These sequences show no consensus pattern of aDNA damage ( Figure S1).  Table S3). The difference is smaller between herbarium and modern samples from the same species than between species, with herbarium versus modern Am. artemisiifolia being more similar than herbarium versus modern Ar. thaliana. Herbarium samples of Am. artemisiifolia versus Ar. thaliana are more similar than any other comparison between species. This difference between the different host species (herbarium/modern Ar. thaliana versus herbarium/modern Am. artemisiifolia) becomes more pronounced when "herbarium contamination" taxa (Al. alternata, Al. solani, E. mitis) are removed from the analysis (Figure 3b). This also leads to samples from the same species becoming more similar, and in the case of Ar. thaliana, the difference between the two groups becomes less significant ( Figure 6; Table S3).

| D ISCUSS I ON
Principally because it reduces the endogenous DNA content of historical samples, metagenomic "contamination" reduces the achievable sequencing depth-of-coverage of a target host genome. Thus, the metagenomic content of historical samples is a major factor determining which will receive valuable sequencing resources in shotgun sequencing studies of samples from natural history and archaeological collections (Carpenter et al., 2013). In this study, we present a new perspective on the use of herbarium specimens with a comparison of metagenomic composition of shotgun sequencing data from (historical) herbarium and modern plant samples. Our results confirm previous observations that herbarium samples have a lower endogenous DNA content than modern samples. This could indicate that diverse microbes colonize herbarium specimens during preparation or storage, and therefore reduce the proportion of host DNA of the sequenced reads (endogenous content). Indeed, we identify several microbial species that are predominant only in herbarium samples.
We find that a particular microbial community develops postmortem in plant tissues preserved in herbaria. By removing this "herbarium contamination", the natural metagenomic community of historical specimens can be recovered. Through comparison with contemporary specimens, microbial changes over time can be determined.
The different metagenomic communities we observe between herbarium and modern samples could also arise from differences in sample handling in the laboratory. It is well known that common reagents and extraction kits can be contaminated with microbial DNA (Salter et al., 2014). So far, more than 180 genera have been identified as laboratory contaminants (Glassing, Dowd, Galandiuk, Davis, F I G U R E 2 Per-sample abundance of most important microbial taxa explaining variation in the PCoA. Circle area scales with the relative abundance of each species in each sample. Species are ordered by their importance for the PCoA, with Albugo laibachii accounting for most of the variance. Green: herbarium Ambrosia artemisiifolia, orange: modern Am. artemisiifolia, blue: herbarium Arabidopsis thaliana, pink: modern Ar. thaliana, yellow: blank extraction negative controls carried out alongside and for comparison with Am. artemisiifolia samples & Chiodini, 2016). The number and composition of contaminants can differ between extraction kits and even batches of the same kit. To distinguish the real metagenomic composition from background contamination, it is recommended to sequence an extraction blank (Glassing et al., 2016). From the most abundant species, only Eimeria mitis was found in both extraction blanks and Methylorubrum extorquens was present in low abundance in one extraction blank.
Because only extraction blanks for the Ambrosia artemisiifolia specimens were available, we additionally compared the identified species with the list of known laboratory contaminants (Glassing et al., 2016). None of the species that are most abundant in our data set were previously detected as laboratory contamination.
Although the presence of "herbarium contamination" taxa in herbarium but not modern samples might be explained by a change in the genetic diversity of Am. artemisiifolia and North American Arabidopsis thaliana populations, or by the fact that the modern Ar. thaliana samples are not collected from wild populations but grown in a glasshouse, we judge these possibilities unlikely as some of the species that are found in herbarium Ar. thaliana samples, but not in modern ones, are also found in herbarium samples of a different species (Am. artemisiifolia) and are nearly absent in modern, wild collected Am. artemisiifolia samples. Thus, we conclude that most of the abundant species found predominantly in herbarium samples colonized the specimens after collection, during either sample preparation or storage.  (Kimura & Tsuge, 1993). The fungus can also colonize indoor environments and induce allergic reactions (Gabriel, Postigo, Tomaz, & Martínez, 2016). As the specimens were sampled from ten different herbaria, from both North America and Europe ( A possible way to determine if the taxa colonized the herbarium specimens shortly after collection or later during storage is the DNA damage pattern. In older specimens, DNA decay leads to smaller fragment size, and deamination of cytosine to uracil leads to characteristic C-to-T transitions in the resulting sequences. When reads are mapped to a reference genome, the frequency of these transitions can be measured. If the DNA damage in reads mapping to Al. alternata is similar to that of reads mapping to the host specimen (Ar. thaliana or Am. artemisiifolia), the taxa probably colonized the specimen before or shortly after collection. Significantly lower levels of DNA damage or the absence of DNA damage in microbial reads could indicate that the herbarium specimen was infected more recently (Warinner et al., 2017), during storage in herbaria or handling in the laboratory. It is also possible that due to other factors such as the concentration of nucleases in the cells, the DNA of the fungus and the host specimen age differently. Therefore, probably occurred during or shortly after collection. Al. alternata is also found in three modern Am. artemisiifolia specimens that were collected into silica gel from the same wild population on the same day. It is possible that too little silica gel was used, resulting in slower drying of the plant material. This would indicate that the fungus is already on the plants when they are collected and is able to spread during the relatively slow drying process that is typical in the preparation of herbarium voucher specimens.

F I G U R E 3 Species-level
Interestingly, the second most abundant species, E. mitis, is a pathogen of chickens (Blake, 2015). Species of the genus Eimeria can infect diverse animal hosts, including cattle and goats (Chartier & Paraud, 2012). This species is primarily found in herbarium specimens (54% herbarium Am. artemisiifolia, 55%  (Warinner et al., 2017). However, the fact that this species was also found in the extraction blank indicates it is more likely a laboratory reagent contamination in most of the samples. Herbarium samples are usually more prone to laboratory contamination as they have low DNA concentrations. Therefore, small contaminations are having a greater effect than in modern samples with high DNA concentration. More sequencing effort per sample would be needed to determine if E. mitis is rather a laboratory contamination or a herbarum contamination or a combination of both.
Hymenobacter sp. PAMC26554 is common in herbarium samples (in seven Am. artemisiifolia and 13 Ar. thaliana samples), but is also found in four modern Am. artemisiifolia samples. This strain has been isolated from Antarctic lichens and is UV-radiation-resistant and adapted to cold climate (Oh, Han, Ahn, Park, & Kim, 2016).
Therefore, this species can probably survive common measures against contamination in herbarium collections, including freezing specimens before archival in the collections. It is also possible that UV-radiation-resistant bacteria survive in dedicated clean laboratory facilities used to handle ancient and historical materials, as these facilities are commonly sterilized using UV light. As the herbarium samples used in this study were processed in three different facilities, and given that these species are not present in all samples and are absent from the extraction blanks we processed, we propose rather that the herbarium specimens themselves are infected. Because some of these species were identified in herbarium specimens from different host plants (Ar. thaliana and Am. artemisiifolia), they seem to be specific to herbaria rather than specific to species.
Future research comparing the metagenomic communities of herbarium versus modern specimens in a broad panel of plants will be crucial to verifying these findings. Moreover, in future studies the DNA damage pattern of reads mapping to the host plant genome should be compared with those mapping to the herbarium-specific microbial genomes. This may help to determine when the infections developed (e.g., during drying of plant material, during long-term storage or during sample processing in the laboratory). Ultimately, this could inform efforts to prevent these infections or contaminations in the future, which would therefore enable more faithful preservation of these valuable specimens, along with their genomic and metagenomic contents.
The most abundant species in the data set, Albugo laibachii, is found in all Ar. thaliana herbarium and modern specimens, in two Am. artemisiifolia herbarium specimens, and one modern Am. artemisiifolia specimen. Albugo laibachii is an oomycete and an obligate biotrophic pathogen to Ar. thaliana (Thines et al., 2009). Am. artemisiifolia is not known to be a host of A. laibachii, but can be infected by the closely related species Albugo tragopogonis (Gerber et al., 2011), which is not represented in the database used for the blast search.
By removing the originating species from the database in their blast search, Warinner et al. (2017) found that reads are then assigned to the closest relative of that species. It is thus possible that reads originated from Albugo tragopogonis were falsely assigned to Albugo laibachii, the closest related species in the database.
Pseudomonas viridiflava is the most common pathogen of Ar.
thaliana populations in North America (Jakob et al., 2002) and was indeed found in 47% of the Ar. thaliana herbarium specimens in this study, but not in the modern samples. As the pathogen was previously found in modern Ar. thaliana specimens, and because the severity of P. viridiflava infection is highly related to environmental factors (Everett & Henshall, 1994), the absence of this species in modern Ar. thaliana samples in this study probably results from these specimens being cultivated in glasshouses rather than growing in the wild.
Removal of possible "herbarium contamination" taxa (Al. alternata, Al. solani and E. mitis) from the analysis brought the metagenomic communities of conspecific herbarium and modern samples closer to each other ( Figure 3) and increased the differences between species, except for modern Ar. thaliana and herbarium Am. artemisiifolia, where the differences decrease ( Figure 6). This indicates that the original host microbiome can be revealed from herbarium specimens when contaminating taxa are known, and therefore presenting the possibility to study microbial changes over time. Indeed, we find some species that are more common in either modern or herbarium specimens. For Ar. thaliana, the bacterium P. viridiflava is only found in herbarium specimens (16/34) and absent in all modern specimens, whereas Bacillus thuringiensis is only found in modern specimens (3/28). As the modern Ar. thaliana specimens were actually grown in glasshouses, these differences probably do not indicate real shifts of the metagenomic composition in wild populations.
Moreover, the presence of B. thuringiensis only in modern specimens could be a glasshouse contamination and might not be present on wild specimens. To further investigate this, future studies should compare the metagenomic composition of herbarium specimens with wild collected specimens.
For Am. artemisiifolia, the bacterium N. massiliensis is only found in North American herbarium specimens (14/46) and is more common in specimens from the east coast of the USA ( Figure 5). It is only found in four individuals collected from the Midwest, and in one individual from the Southeast. Notably, it was not detected in any of the European herbarium specimens. The bacterium was identified in specimens from five different herbaria and except for the MO herbarium, not all samples from each herbarium are infected. We therefore believe that N. massiliensis is not a herbarium contamination and the geographical pattern we observe is due to a real difference in the metagenomic composition. The fungal plant pathogen Cercospora beticola is more common in herbarium specimens (13/46) than in modern specimens (3/17) and is also absent in European specimens.
Interestingly, the three modern samples containing this pathogen are the same as where Al. alternata was found. The geographical distribution of N. massiliensis presence/absence in North America, as well as its complete absence in introduced historical populations in Europe, could indicate that the source population is more likely the western or southeastern population than the eastern population. It is also possible that the complete metagenomic community was not transmitted during Am. artemisiifolia's introduction to Europe, as the primary introduction vector is seeds (e.g., contaminated birdseed) and not whole plants (Chauvel et al., 2006).
The bacteria Aureimonas sp. AU20 and M.extorquens are more common in modern specimens (7/17 for both taxa) than in herbarium specimens (1/46 for Aureimonas sp. AU20 and 3/46 for M. extorquens). Both are found in only one herbarium specimen from Europe. The shift in the metagenomic composition of Am. artemisiifolia specimens through time could indicate genetic changes that increase the resistance to species such as N. massiliensis and C. beticola. Another factor could be the near-ubiquitous use of pesticides and fungicides in modern times. Am. artemisiifolia is usually found in disturbed habitats such as next to roads and abandoned and actively used agricultural fields (Bassett & Crompton, 1975;Payne, 1970) and could therefore be exposed to pesticides used in crop production. for preservation, and are therefore at such low abundances that they could not be detected using our methods.
To ensure that herbarium specimens can be used for metagenomic studies, we suggest the creation of a database of common herbarium contaminants that are found across species and herbaria.
This would make it possible to exclude these contaminants from the analysis and observe real shifts in the metagenomic makeup. To get a better understanding of the infection of Al. alternata and Al.
solani, we suggest performing experiments with freshly collected specimens from several species where part of the plant is dried on silica gel and another part is preserved with traditional herbarium methods. It would thus be possible to test if the presence of these metagenomic taxa is due to the preservation method.
Our study is based on leaf material, but could be expanded to other plant parts (e.g., stem or roots). In that case, a similar approach should be used to determine possible "herbarium contamination", as these contaminants could vary between different parts of the plant. We also expect to find different metagenomic communities on other plant parts (see Busby et al., 2017). For example, root samples contain rhizome soil bacteria that are not found on leaves. Herbarium specimens often also contain roots, enabling the comparison of metagenomic root communities over time. This could give insights into, for example, the effects of historical changes in pollution levels (Lang et al., 2019), pesticide use and fertilization practices.
We have shown that it is possible to obtain the metagenomic composition from genome-skimming data sets based on herbarium specimens that were more than 180 years old. As some of the identified species were exclusive to herbarium specimens, found in both tested species and in herbaria from two different continents, we conclude that they have colonized the specimens during sample preservation or storage. Future studies should include herbarium specimens from additional plant species and compare them to modern wild-collected specimens to investigate the frequency of these species as herbarium specimen contaminants. Investigations of the DNA damage pattern may elucidate when the specimens were colonized, which could inform efforts to preserve specimens in modern herbaria. We suggest that future metagenomic studies on herbarium plant material should consider that some of the identified species may not belong to the natural metagenomic community of the host plant, and thus should be excluded from further analysis and biological interpretation.

ACK N OWLED G M ENTS
We thank our colleagues A. Frisch, M. Bendiksby and M. Nygård for their helpful comments as the study progressed. We thank F. Vieira for his technical help in the bioinformatic analysis. We thank A. L.
Kolstad for his help with the statistical analysis. For their generous contributions of tissues destructively sampled for this work, we also thank the herbarium staff and curators at the National Herbarium of the Netherlands (NHN), the Herbarium of the National Museum of Natural History (US), the Herbarium of the Academy of Natural Sciences (PH), the New York Botanical Garden Herbarium (NY), the Harvard University Herbaria (A) and the Herbarium of the Missouri Botanical Garden (MO). Illumina HiSeq4000 sequencing was performed by the NTNU Genomics Core Facility (GCF), which is funded by the Faculty of Medicine and Health Sciences, Norwegian University of Science and Technology (NTNU), and the Central Norway Regional Health Authority. We thank three anonymous reviewers for their useful comments which greatly improved the manuscript.

CO N FLI C T O F I NTE R E S T
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.