A nuclear DNA barcode for eastern North American oaks and application to a study of hybridization in an Arboretum setting

Abstract DNA barcoding has proved difficult in a number of woody plant genera, including the ecologically important oak genus Quercus. In this study, we utilized restrictionsite‐associated DNA sequencing (RAD‐seq) to develop an economical single nucleotide polymorphism (SNP) DNA barcoding system that suffices to distinguish eight common, sympatric eastern North American white oak species. Two de novo clustering pipelines, PyRAD and Stacks, were used in combination with postclustering bioinformatic tools to generate a list of 291 potential SNPs, 80 of which were included in a barcoding toolkit that is easily implemented using MassARRAY mass spectrometry technology. As a proof‐of‐concept, we used the genotyping toolkit to infer potential hybridization between North American white oaks transplanted outside of their native range (Q. michauxii, Q. montana, Q muehlenbergii/Q. prinoides, and Q. stellata) into a horticultural collection surrounded by natural forests of locally native trees (Q. alba and Q. macrocarpa) in the living collection at The Morton Arboretum (Lisle, IL, USA). Phylogenetic and clustering analyses suggested low rates of hybridization between cultivated and native species, with the exception of one Q. michauxii mother tree, the acorns of which exhibited high admixture from either Q. alba or Q. stellata and Q. macrocarpa, and a hybrid between Q. stellata that appears to have backcrossed almost exclusively to Q. alba. Together, RAD‐seq and MassARRAY technologies allow for efficient development and implementation of a multispecies barcode for one of the more challenging forest tree genera.

In the past, markers such as restriction fragment length polymorphisms (RFLP), amplified fragment length polymorphisms (ALFP), and simple-sequence repeats (SSR) have been used for population genetics, phylogenetics, and genotype/phenotype studies in a wide range of forest trees, including oaks (Chagné, Lalanne, Madur, Kumar, & Frigério, 2002;Curtu et al., 2007;Durand et al., 2010;Hipp & Weber, 2008;Hipp et al., 2014;Van Droogenbroeck et al., 2004). However, ascertainment bias (Putman & Carbone, 2014), reproducibility issues, and species specificity (Twyford & Ennos, 2012) often limit their utility across a range of species. Moreover, diagnosis of hybrids often requires a high number of loci. EST-linked SSRs (Durand et al., 2010;Sullivan et al., 2016) and SSRs not known to be associated with coding regions (Craft & Ashley, 2006) have proven valuable for understanding hybridization in systems with a relatively small number of species, but the large number of loci needed to diagnose hybrids in multispecies hybrid zones makes SSRs a difficult marker to apply. With the advent of inexpensive high-throughput sequencing, SNPs have become a marker of choice for rapid genotyping (Alberto et al., 2013;Chancerel et al., 2011;Lijavetzky, Cabezas, Ibáñez, Rodríguez, & Martínez-Zapater, 2007;Novaes et al., 2008).
There are several ways to identify 100s or 1,000s of SNPs in model organisms (Appleby, Edwards, & Batley, 2009;Yan et al., 2010). In the past few years, restrictionsite-associated DNA sequencing (RAD-seq) and related methods (e.g., genotyping by sequencing, double-digest RAD) have emerged as particularly efficient approaches to genotype as well as to identify SNPs for further study (Baird et al., 2008;Barchi et al., 2011;Davey & Blaxter, 2010;Pegadaraju, Nipper, Hulke, Qi, & Schultz, 2013;Scaglione et al., 2012). These methods have been applied with great success to understanding phylogeny, species circumscription, and hybridization patterns in oaks Fitz-Gibbon, Hipp, Pham, Manos, & Sork, 2017;Hipp et al., 2014;McVay, Hauser, Hipp, & Manos, 2017). However, the costs are often relatively high per individual and require a library of identified individuals for comparison, making RAD-seq and related methods impractical for identification of individual trees outside of the context of tree diversity research programs.
DNA barcoding is often problematic in trees due to lineage sorting, hybridization, or low polymorphism (Arca et al., 2012;Hollingsworth, Graham, & Little, 2011;Percy et al., 2014;Simeone, Piredda, Papini, Vessella, & Schirone, 2013). Oaks are no exception in this regard: to date, an oak DNA barcode has seemed impracticable, as no combination of a small number of loci provides reliable species diagnosis (Denk & Grimm, 2010;Hubert et al., 2014;Hipp, 2015;Oh & Manos, 2008;Piredda, Simeone, Attimonelli, Bellarosa, & Schirone, 2011;Simeone et al., 2013;though cf. Chen, Zeng, Liao, Yan, &Yang et al., 2017 for recent successes in East Asian oaks). In this study, we present an 80-SNP DNA barcode for species identification and hybrid studies of a suite of common eastern North American white oaks. Our addresses solves the problem of identifying barcode-informative loci through careful selection of a core set of informative and easily amplified SNPs from a broadly sampled RAD-seq dataset . We then develop PCR primers for multiplex amplification and SNP genotyping using the MassARRAY platform and apply this DNA barcode to a pilot study of hybridization among white oaks in the living collections of The Morton Arboretum. The resulting genotyping toolkit builds on a previous 2-species oak barcode using MassARRAY technology (Guichoux, Lagache, Wagner, Léger, & Petit, 2011) and provides a model for rapid, inexpensive, high-throughput assessments of hybridization and species identity in multispecies oak stands and living collections.

| Sampling
For marker development, we used RAD-seq data from 63 samples selected to represent 15 species or species groups (Supplementary File 1) reported from a previous study .
Quercus muehlenbergii and Q. prinoides were treated together as a single species in our study, because our analyses as well as current phylogenetic work Pham, Hipp, Manos, & Cronn, 2017) suggest that the two are not distinguishable using multilocus nuclear data. The European white oaks and East Asian white oaks were treated as two species groups to identify markers that distinguish the two groups from one another and to study possible patterns of gene flow between American and Eurasian white oak. We did not, however, attempt to distinguish the Eurasian white oak species from one another, as our focus is on hybridization among eastern North American species. For SNP evaluation (Figure 1

| RAD-seq genotyping and clustering
RAD-seq DNA extraction, library preparation, and sequencing were conducted as presented previously in oaks Hipp et al., 2014Hipp et al., , 2018McVay, Hauser et al., 2017. Briefly, DNA for all RAD-seq samples was extracted from fresh or frozen material using the DNeasy plant extraction protocol (DNeasy, Qiagen, Valencia, CA, USA). DNA extractions were gel-quantified in agarose by visual comparison with a 1 kb+ Ladder (Invitrogen, Thermo Fisher Scientific, Waltham, MA, USA). Extraction concentrations ranged from 5 to 10 ng DNA/μl extraction. RAD sequencing library preparation was conducted at Floragenex, Inc. (Portland, OR, USA) following the methods of Baird et al. (2008) with PstI as the restriction enzyme. RAD-seq fastq data were analyzed using PyRAD version 1.64 (Eaton, 2014) (www.dereneaton.com/software). In this pipeline, sequences are clustered first by individual using USEARCH (Edgar, 2010), which allows sequences within clusters to vary in indels, nucleotide polymorphisms, and sequencing strand (direction).
After clustering, heterozygosity and sequencing errors are jointly estimated from the base counts observed across all sequences, sites, and clusters using the likelihood equation of Lynch (2008).
Heterozygotes are inferred by a binomial probability based on these parameters. Bases that could not be assigned with ≥95% probability were treated as unknown (N). Any locus possessing more than two haplotypes within individuals after correcting for sequencing errors was discarded, under the assumption that it included one or more paralogous sequences. For each individual, each locus was summarized as a consensus sequence, and consensus sequences were clustered among individuals to generate a data matrix for each locus. Due to variation in sequencing coverage and mutation F I G U R E 1 Flow diagram outlining analysis used in this study and future directions. RAD-seq (orange arrows) data from previous project were analyzed using PyRAD/RADami and Stacks. A stringent approach (dotted purple arrows) and relaxed approach (solid blue arrows) were used to develop two lists of potential SNP (pSNPs). These were subjected to the primer design software Assay Design Suite and data analysis using TYPER 4.0 software, screened, and checked for SNP validation. A list of verified SNPs was re-entered to the primer design software to select for 80 suitable SNP multiplexes, which comprise the barcode (turquoise arrows). Applications of the barcode include genotyping of North American white oaks and hybridization studies (yellow arrows) at the restriction site defining RAD loci, the resulting data matrix was not complete for all loci in all individuals. The following parameters were specified for PyRAD: minimum depth of reads per withinsample cluster: 10; maximum number of sites in a read that can have a quality score of less than twenty: 4; clustering threshold (percent similarity): 0.85; minimum number of samples in each across-sample cluster: 20; maximum number of individuals with a shared heterozygous site in an across-sample cluster: 3. All other settings used default values.
For comparison, RAD-seq data were also clustered using Stacks (Catchen, Amores, Hohenlohe, Cresko, & Postlethwait, 2011;Catchen, Hohenlohe, Bassham, Amores, & Cresko, 2013). Barcodes were first removed using gawk prior to filtering out low-quality reads with the "process_radtags" module of Stacks (Catchen et al., 2011(Catchen et al., , 2013. The wrapper program "denovo_map.pl" provided by Stacks was used to perform de novo assembly of RAD-seq. Ten reads were set as the minimum to be considered as a stack (m); distance between two stacks to be assembled to loci was set to two (M); and the distance between loci to be merged was set to four (n). Filtering criteria were set to 75% of individuals in a population required to be present in a loci (r) and 10 as the minimum number of populations in which a locus must be present to be considered (p).

| SNP identification in RAD-seq dataset
SNPs were identified using two methods. In the first, pairwise F ST was calculated between all pairs of oak species using the RADami (version ≥1.1-2) (Hipp et al., 2014) and hierfstat (version 0.04-10) (Goudet, 2013) packages of R. The resulting F ST matrix identified loci with fixed or nearly fixed differences (pairwise F ST ≥ 0.95) between any pair of white oak species, identified the location of SNPs within the locus, and identified how many pairwise comparisons were at or above the threshold F ST level (of 0.95). This approach generated a list of more than 1,090 loci containing 4,020 potential SNPs of F ST ≥ 0.95, all of which were exported as FASTA files. Loci were then filtered under both stringent criteria and relaxed criteria. The stringent approach excluded SNPs with ambiguity codes of any kind and resulted in 1,068 potential SNPs (Supplementary File 2). These were further subjected to manual inspection, and SNPs were retained if they were present in at least 50% of individuals from one species to be considered for SNP screening; this threshold was chosen to maximize the number of SNPs we could consider in design without wasting undue time on SNPs that might provide little power to distinguish among species. FASTA files were formatted with custom bash scripts that retrieved one species-specific SNP per locus sequence and masked other SNPs using IUPAC ambiguity codes (Supplementary File 2). The relaxed approach treated heterozygotic and ambiguous positions as potentially informative, as long as at least one individual per species was homozygous at the position distinguishing species; this expanded the list of potential SNPs to 1698. For both the relaxed and stringent approaches, loci were considered suitable for SNP identification only if they had at least 20 bp available for primer design at both the 5′ and 3′ ends.
For Stacks data, the program "populations" was used to compute pairwise F ST between species. Loci were derived from the vcf output file. Output files (vcf and fasta) were obtained and further processed with custom bash scripts and manually selected in Excel (Microsoft, Washington State, USA) to create a list of potential SNPs.
The corresponding position of a SNP within a RAD-seq stack was identified by a custom script from the output files (sumstats and vcf; Supplementary File 3).
Stand-alone NCBI-BLASTN version 2.2.30 (Altschul, Gish, Miller, Myers, & Lipman, 1990) was utilized to determine overlap of loci between PyRAD and Stacks. The command "makeblastdb" was used to convert each loci dataset into a database. BLASTN (version 2.2.30+) was used to compare PyRAD loci to a Stacks loci database followed by reciprocal BLASTN of Stacks loci against a PyRAD loci database.
The e-value of 1E−05 was selected as the threshold for recognizing loci as orthologous; this is a commonly used value and is relatively arbitrary, but given the restricted nature of the dataset provides a framework for comparing loci. The output was saved as tabular format. Loci that matched between the PyRAD and Stacks outputs with alignment length exceeding more than 95% were treated as duplicates and excluded from downstream analyses.

| Primer design
Assay Design 4.0 Suite (Agena Biosciences, San Diego, CA, USA) was used to design primers for MassARRAY analysis (Bradić, Costa, & Chelo, 2012). A maximum of 40 SNPs were multiplexed, and a total of four multiplexes were screened per chip. Primers with a high percentage of Ns, indels, and primer-dimer formation potential were not considered for multiplexing.

| SNP evaluation
After SNP screening, SNPs present in all individuals of a species were marked as fixed SNPs, whereas those present in the majority of individuals were marked as nearly fixed; 11.6 ± 7.7 (SD) SNPs selected as diagnostic for each of the 13 species and two species groups targeted (   to be monophyletic. Phylogenetic analysis was conducted using PhyML (Guindon et al., 2010) with mode = "A la carte" and number of bootstraps = 100, as implemented in phylogeny.fr (Dereeper et al., 2008). FigTree version 1.4.2 (http://tree.bio. ed.ac.uk/software/figtree/) was used to visualize results.

| Sampling and DNA extraction from white oak acorns
Acorns were collected in fall of 2013 and 2014 from nine mother trees in The Morton Arboretum oak collection site (Figure 2).
Mother trees were selected to be species of Illinois that are not native to The Morton Arboretum. Three species-Quercus stellata, the subject of a detailed distribution study (Prasad, Gardiner, Iverson, Matthews, & Peters, 2013), Q. michauxii, and Q. montana-are southern species that have the potential to migrate north with climate change. The fourth, Q. muehlenbergii/Q. prinoides, ranges to northern Illinois but is not present in native stands on the grounds of The Morton Arboretum. Acorns were subjected to a five-minute floating test in cold water. Any floating acorns were discarded. The remaining acorns were counted and underwent cold stratification for 2 months. A maximum of 50 acorns per species were planted and grown in the glasshouse for 3 months, with the exception of Q. montana with 100 acorns.
After germination, leaves were harvested and stored in zipper locking bags with paper towels to remove excess moisture and stored at −4°C.

| Postgenotyping analysis
Following MassARRAY genotyping, SNPs were concatenated and converted to Phylip format for phylogenetic analysis. UPGMA trees were calculated in Geneious R9 (Kearse et al., 2012). The RAxML BlackBox Web server was used to run ML analysis with default settings (Stamatakis, Hoover, & Rougemont, 2008), and trees were uploaded to iTOL for annotation (Letunic & Bork, 2016). The Eurasian white oaks (Q. petraea, Q. robur, Q. fabri, and Q. aliena) were used to root the ML tree for purposes of visualization. Dataset templates retrieved from the iTOL help page were used for color-coding species (with the exception of the Eurasian white oaks, which were colored the same) and to highlight mother trees and learning samples   (Pritchard et al., 2000). DISTRUCT v1.1 was used for the visualization (Rosenberg, 2004). Vertical black lines represent grouping of individuals according to their population assignment. Each vertical line color-codes the genotype probability of an individual according to population grouping. Black bars over individuals represent adult trees. Black dots over individuals represent mother trees producing acorns genotyped in this study. (b) Estimation of best cluster number (K) calculated by STRUCTURE with respect to mean log likelihood of the data over five iterations (left axis) and ΔK plotted as a function of K (K = 1-6) (right axis) identity was not used as a prior. Markov chain Monte Carlo (MCMC) was run using a burn-in period of 100,000 MCMC steps followed by 1,000,000 data-gathering steps. Markov chains were run from K = one to six populations, five replicates for each value of K. To estimate the best number of clusters (K) given the data, mean ΔK over replicates was calculated (Evanno, Regnaut, & Goudet, 2005;Owusu, Sullivan, Weber, Hipp, & Gailing, 2015). K = 6 showed the highest support based on ΔK (Figure 3b).

| RE SULTS
Using the methods described above, we worked from a set of 32,959 RAD-seq loci to a candidate set of 844 SNPs. We developed primers for 344 loci (Table 1)

| SNP identification in RAD-seq data
Two different approaches were used to generate a list of potential SNPs for barcoding. First, the stringent screening approach resulted in the identification of 1090 loci of potential utility, which were exported as FASTA files (Figure 1, dashed line). We generated a list of 395 potential SNPs of which the majority (264 SNPs) were discarded by the primer design tool and 131 showed primer compatibility across four multiplexes. Of the 131 SNPs, 83 (63%) were added to the list of verified SNPs. The majority of these SNPs were diagnostic for Q. bicolor, Q. michauxii, or Q. lyrata. For Q. michauxii, we found more than fixed SNPs that were considered for the genotyping toolkit.
The relaxed screening approach (Figure 1, green

| Primer design and SNP genotyping
The Assay Design Suite was used to design suitable primers pairs for more than 844 potential SNPs, 395 from the stringent selection approach and 449 from the relaxed selection approach. The primer design software rejected 59% of these due to the presence of ambiguous nucleotides close to the SNP site, potential primerdimer formation, or placement of SNP sites too close to the 5′ or 3′ end of the amplified region. Primers were designed for the remaining 344 (40%) potential SNPs, of which we combined 291 SNPs into eight multiplexes for screening ( Table 1); 187 of these SNPs were confirmed as fixed (44) or nearly fixed (143) in a species, while 93 SNPs were identified as false positives based on failure to discriminate species in a broader range of 63 samples. Eleven SNPs did not amplify. Among 44 fixed SNPs, the majority belonged to Q. michauxii (11) followed by Q. muehlenbergii/prinoides (6) ( Table 2).
Quercus montana exhibited the most nearly fixed SNPs (13). The majority of Q. stellata SNPs (15) were found to be heterozygous or overlapping with another species. The most false positives were determined for Q. muehlenbergii/Q. prinoides and representatives of the Eurasian white oaks. This is somewhat remarkable given that Q. muehlebergii/prinoides was represented by six individuals in the initial RAD-seq screening, comparable to the number we had for Q. alba (4) and Q. michauxii (5).
The remaining 187 (65%) of the screened SNPs showed the dinucleotide pattern indicated during SNP identification using the RADseq data and were maintained in this study. We reduced the list of

Total
Passed Multiplexed Detailed breakdown of the SNP screens. Total corresponds to the established lists of potential SNPs and was considered for the primer development. "Passed" represents the number of SNPs meeting the criteria for primer development. "Multiplexed" represents the number of SNPs successfully multiplexed with the maximum of 40 SNP per multiplex and ordered for screening.

| Analysis of SNP data
The developed barcoding toolkit was used to genotype a total of 324 samples. The sample set consisted of 228 acorns, 37 adult trees not previously genotyped (including nine mother trees), and 59 learning samples drawn from the 63 samples originally sequenced using RAD-seq and used to identify SNPs. Maximum-likelihood analysis of the SNP dataset from parent trees and learning samples recovers a topology comparable to that described in previous studies (Hipp et al., 2014 (Figure 4). UPGMA analyses provide no additional insights over maximum-likelihood analyses and are excluded from further discussion.
A total of 70 SNPs in two multiplexes were successfully amplified. Both multiplexes together were required to identify species.
Amplification of SNPs varied among samples. The MassARRAY amplification success rate per mother tree was assessed by calculating the percentage of missing data per seedling for the two multiplexes and summarized by mother tree (Table 3). Overall, the rate of MassARRAY amplification is high, ranging from 57% to 100%, with the exception for two mother trees, Q. montana (602-2000*3) and Q. muehlenbergii (704-46*2 sd T) (Table 3).
Species were recovered as monophyletic based on the barcode SNP set with bootstraps values between 62% and 100% (Figure 4).

| Species integrity in a botanical garden
We used the barcoding toolkit to genotype nine mother trees from five species not native to the region and 228 acorns from them, as well as 28 additional adult trees (Table 3). STRUCTURE analysis was performed with 247 samples using 66 loci to estimate rates of hybridization in the genotyped samples. A K of 6 explained substantial variance in the data without overfitting (Figure 3b). Of 228 acorns, 191 acorns could be assigned to a single species, whereas 24 acorns showed >5% admixture (Figure 3). The majority of hybrid acorns were identified from a single mother tree, Q. michauxii 645-48*2. Of 63 acorns from this mother tree, 21 were admixed with either Q. alba or Q. stellata or Q. macrocarpa (Figure 3a). We observed no hybridization among acorns from the Q. muehlenbergii mother tree; however, two acorns from Q. prinoides 745-52*1 showed admixture with Q. macrocarpa. An interesting result was observed for Q. stellata.
Here, the adult trees separate from the progeny, but three adultsone of which includes the mother tree of the acorns germinated for this study-genotyped as hybrids with Q. alba. The progeny then genotyped as Q. alba or a hybrid between Q. alba and Q. stellata (one plant) rather than Q. stellata (Figure 3a). Furthermore, of 35 acorns from Q. stellata 210-91*1, two exhibited admixture from Q. montana or Q. macrocarpa (Table 3).
One acorn in our study (Quercus_stellata_STE_210_91_1_47) appears to be misclassified as Q. michauxii (Figure 3a). As the halfsiblings of this individual genotype as Q. alba or as hybrids between Q. alba and Q. stellata, we treat this individual as a labeling error during leaf harvest or DNA extraction.

| D ISCUSS I ON
Advances in next-generation sequencing technologies allow researchers to generate high amounts of reduced-representation data by RAD-seq, GBS, anchored enrichment, and other methods at relatively low per-base-pair cost. Such methods reduce the risk of ascertainment bias, as the primary criteria for developing genotyping strategies using these method are generally associated with marker number rather than variability per se. Moreover, although the choice of the restriction enzymes in RAD-seq, GBS, and related F I G U R E 4 Maximum-likelihood tree of mother trees, potential father trees, and learning samples. Species are color-coded (see legend). The mother trees genotyped in this study are marked with blue boxes. Learning samples are indicated with purple boxes. Q. aliena, Q. fabri, Q. robur, and Q. petraea are combined in this study and represented as Eurasian (Old World) white oaks. The tree is rooted by the Eurasian white oaks. Annotations were made in iTOL (Letunic & Bork, 2016) methods may introduce genome sampling bias (as, for example,

| Comparing pipelines: Stacks vs PyRAD
The bioinformatic queries of the RAD-seq dataset we undertook used different strategies to identify SNPs of potential genotyping value. PyRAD uses an alignment-based algorithm (USEARCH/ VSEARCH for clustering, MUSCLE for multiple alignment) to detect loci whereas Stacks uses an "off-by-N" threshold for clustering. It has been demonstrated that Stacks may recover as few as half the loci PyRAD does when indels are frequent (Eaton, 2014). In our study, the use of PyRAD for locus detection resulted in a higher number of SNPs that we initially screened (233 SNPs) and subsequently verified via MassARRAY and considered for the barcode (100 SNPs). Using Stacks, we identified ~25% as many SNPs (58 SNPs screened, 19 suitable for genotyping). PyRAD is expected to outperform Stacks in phylogenetic samples (Eaton, 2014), making it a better fit to our multispecies sample. Stacks is more suitable for the analysis of population data (Catchen et al., 2013) and may be as effective as PyRAD for within-species studies or studies among two close relatives, but our results support previous findings that at least in a phylogenetically structured multispecies context, PyRAD is likely to be more efficient at detecting shared loci.

| Constraints on primer design
The RAD-seq loci we used to design primers were approximately 85 bp in length (ignoring indels), which introduces a strong constraint on primer design. Listed are the individual IDs and species names of the corresponding nine mother trees. The amount of collected and planted acorns in Fall/Winter of 2014 per mother tree and their germination rate. The table is sorted by the number of germinated acorns. The amplification rate is divided into two categories: 71%-100% of markers amplified and 0%-70% of markers amplified. Hybridization frequency per mother tree is based only on individuals for whom 71%-100% of markers amplified.
using RAD-seq. This limitation could easily be circumvented using longer sequencing reads. In our study, primer design and multiplex suitability cut our list of potential SNPs by half: Of more than 800 potential SNPs, only 187 reached the final set of SNPs that we considered in detail for the final reactions. The high rate of attrition expected during any such project makes increasing sequence read length at the outset of the project-thus increasing the percentage of the genotype sequenced that is suitable for SNP amplification-an efficient way of increasing the number of potential SNPs.

| SNP quality
In most traditional genotyping projects, marker discovery aims to identify loci that harbor the greatest possible variation, with an eye toward capturing both within-and among-population variation. In our study, we used pairwise F ST to identify markers that minimize variation within species while maximizing differences among species, with the expectation that at least some outlier loci would exhibit strong pairwise differentiation for all species pairs. We had a relatively small sampling of numerous species to begin with, so we expected that our initial screening would discover false positives.
We did find that by screening more stringently, we could correct for some false positives: Our stringent screening approach yielded 83 SNPs, whereas the relaxed approach exhibited a higher rate of false positives.
We were not certain at the outset of the study that we would find sufficient numbers of fixed or nearly fixed SNPs to develop a barcode in oaks, which are renowned for hybridization and lineage sorting (Hipp, 2015). In our study, we found 40 apparently fixed SNPs unique to nine species (based on our admittedly small sampling within species) and 147 nearly fixed SNPs, as well as two fixed SNPs at the clade level that were useful for distinguishing southern species from northern species in our sample (the clade comprising Quercus stellata and its relatives; Table 2). We found very few fixed or nearly fixed SNPs in Q. stellata and Q. macrocarpa, probably for two reasons: in Q. macrocarpa, we started with a larger candidate pool in the RAD-seq dataset (N = 9 individuals), which may have rendered our F ST threshold for SNP discovery too stringent; and both species are wide in geographic range and known to hybridize with numerous other white oak species (Burger, 1975;Hardin, 1975;Rushton, 1993;Stoynoff & Hess, 2002;Whittemore & Schaal, 1991).
Furthermore, during SNP screening we observed SNPs that we initially judged to be fixed or nearly fixed SNPs in Q. stellata that also turned out to be present in individuals of Q. alba, Q. montana, and Q. michauxii. Despite this fact, our barcode set suffices to delimit even these species, although it may reduce our power to distinguish hybrids from relatively pure individuals in, for example, Q. stellata.
Our ability to distinguish species is an important result, given that 10-20 loci were needed to reliably distinguish just two white oak species in a previous barcoding study using the same technology (Guichoux et al., 2011). With additional sampling of individuals, we may find more false-positive loci. However, given that the toolkit presented here is effective at identifying samples beyond the initial RAD-seq sequencing run, and apparently effective in distinguishing hybrids from genetically pure individuals, we expect it to be robust to additional sampling.

| Hybridization in a mixed-species horticultural collection
The cultivated white oaks from which we sampled acorns exhibited low rates of hybridization for most individuals. This finding is at some odds with the traditional view, still quite common, that sympatric white oaks are utterly promiscuous (Burger, 1975;Hardin, 1975;Van Valen, 1976) or at least genetically cryptic (Craft & Ashley, 2006) and the expectation of high rates of hybridization in genotyped acorns (Moran et al., 2012 individual is also relatively close to the margin of a natural white oak forest, and it may be that in such a case, intraspecific pollination was swamped by interspecific pollination from the adjacent forest. The Q. stellata offspring we genotyped yielded an interesting result. The mother tree, unexpectedly, appears to have been an F 1 or early-generation backcross with Q. alba, but the offspring all appear to be nearly pure Q. alba based on the STRUCTURE analysis. Our results suggest that the mother tree of these offspring is a hybrid that mates preferentially with white oak, although this result bears additional study. This result is not outlandish, as the individuals sampled are all adjacent to a large forest with large numbers of Q. alba, and Q. stellata is known to hybridize with Q. alba (Whittemore & Schaal, 1991). However, our finding that all acorns cluster predominantly with Q. alba may suggest that the number of Q. stellata SNPs we developed is simply too low to distinguish backcrosses from pure individuals. Of the 66 loci analyzed using STRUCTURE, only three were designed to identify Q. alba, and one of these was a nearly fixed SNP; and 12 were designed to identify Q. stellata, but all 12 were nearly fixed SNPs. Additional genotyping, ideally of controlled crosses, will be helpful what power we have to distinguish between pure individuals, F 1 hybrids, and backcrosses between at least these two species.
By contrast, the Q. prinoides mother included in this study is a low-growing shrub planted in a cluster with two others of its species and substantially further from the forest margin than the Q. michauxii mother tree discussed above. None of its acorns appear to be hybrids (Figure 3). It may well be that landscape position plays an important role in hybridization patterns in this garden. However, generalizing our findings will take a more thorough study of numerous mother trees over a number of years, evaluating landscape position relative to other species. The barcoding presented here should facilitate such studies.

| CON CLUDING REMARK S
Living collections may be the last harbor for many endangered trees and important repositories of genetic diversity, but only if planting strategies and management are optimized to control gene flow and genetic diversity . DNA barcoding can be an effective tool to quickly assess whether newly introduced endangered species are susceptible to hybridization with native populations, but DNA barcoding is often problematic in trees due to incomplete lineage sorting and hybridization (Arca et al., 2012;Percy et al., 2014;Pham et al., 2017;Simeone et al., 2013). In this study, we solve the problem of DNA barcoding for a regional sample of interbreeding white oaks, producing a novel genotyping tool that works to identify species and their hybrids. The developed barcoding toolkit should be useful in predicting which species are more likely to hybridize, monitoring hybridization in response to climate change, and ultimately predicting how species introductions and planting strategies will impact hybridization in artificial plantings. Such attention to genetic conservation will help to guide and manage future plantings of oaks in common gardens, living collections, and elsewhere. Leroy for helping with the preparation of the in-house scripts. Last but not least, we would like to thank Dr. Remy Petit for insights at the outset of this project.

CO N FLI C T O F I NTE R E S T
None declared.

AUTH O R S' CO NTR I B UTI O N S
ALH and EF conceived and designed the study and formulated analyses. ALH and MH generated and analyzed the RAD-seq dataset. ALH developed analytical tools to identify SNPs in the clustered RAD-seq