Tracking invasions of a destructive defoliator, the gypsy moth (Erebidae: Lymantria dispar): Population structure, origin of intercepted specimens, and Asian introgression into North America

Abstract Genetic data can help elucidate the dynamics of biological invasions, which are fueled by the constant expansion of international trade. The introduction of European gypsy moth (Lymantria dispar dispar) into North America is a classic example of human‐aided invasion that has caused tremendous damage to North American temperate forests. Recently, the even more destructive Asian gypsy moth (mainly L. d. asiatica and L. d. japonica) has been intercepted in North America, mostly transported by cargo ships. To track invasion pathways, we developed a diagnostic panel of 60 DNA loci (55 nuclear and 5 mitochondrial) to characterize worldwide genetic differentiation within L. dispar and its sister species L. umbrosa. Hierarchical analyses supported strong differentiation and recovered five geographic groups that correspond to (1) North America, (2) Europe plus North Africa and Middle East, (3) the Urals, Central Asia, and Russian Siberia, (4) continental East Asia, and (5) the Japanese islands. Interestingly, L. umbrosa was grouped with L. d. japonica, and the introduced North American population exhibits remarkable distinctiveness from contemporary European counterparts. Each geographic group, except for North America, shows additional lower‐level structures when analyzed individually, which provided the basis for inference of the origin of invasive specimens. Two assignment approaches consistently identified a coastal area of continental East Asia as the major source for Asian invasion during 2014–2015, with Japan being another source. By analyzing simulation and laboratory crosses, we further provided evidence for the occurrence of natural Asian–North American hybrids in the Pacific Northwest, raising concerns for introgression of Asian alleles that may accelerate range expansion of gypsy moth in North America. Our study demonstrates how genetic data contribute to bio‐surveillance of invasive species with results that can inform regulatory management and reduce the frequency of trade‐associated invasions.


| INTRODUC TI ON
Biological invasions pose significant threats to native biodiversity and the global economy (Clavero & Garcia-Berthou, 2005;Pimentel, Lach, Zuniga, & Morrison, 2000;Simberloff et al., 2013). Preventing or at least minimizing introductions of invasive species remains a major challenge for pest management programs (Courchamp et al., 2017). Design of effective preventative strategies can be facilitated by identifying routes of introduction, particularly the geographic origin of the invasion (Estoup & Guillemaud, 2010). Historical records sometimes provide direct evidence of how and from where invasive pests arrived, but such records have become less available with the complexity of modern trades and transportations. Alternatively, indirect methods based on genetic similarity between invasive populations and populations from the organism's native range allow such inference (Dlugosch & Parker, 2008). For example, if the invading population shows a close genetic relationship with a particular native population, the latter is often considered the origin of the invasion (Darling, Bagley, Roman, Tepolt, & Geller, 2008;Estoup & Guillemaud, 2010;Rosenthal, Ramakrishnan, & Cruzan, 2008). A successful inference draws on the premises that genetic differences exist among native populations and such differentiation can be detected by a set of genetic markers (Cornuet, Piry, Luikart, Estoup, & Solignac, 1999;Picq et al., 2017;Roe et al., 2018).
The gypsy moth Lymantria dispar (L.) is one of the most destructive invasive pests, ranking third among the most costly invasive insects in the world (Bradshaw et al., 2016). Its native range encompasses the temperate forests of Europe and Asia (Pogue & Schaefer, 2007).
Although adult gypsy moths do not feed, the caterpillars are notoriously polyphagous, feeding on a wide variety of host plants belonging to 400-600 species (APHIS, 2016;Liebhold, Gottschalk, et al., 1995). During gypsy moth outbreaks, host plants can be completely defoliated and killed, causing shifts in natural forest composition as a result of host tree and seedling mortality (APHIS, 2016;Campbell & Sloan, 1977;Gottschalk, 1990). Three subspecies of gypsy moth are currently recognized, broadly corresponding to their geographic regions of origin: L. d. dispar (L.) mainly from Europe and introduced to North America, L. d. asiatica (Vnukovskij) from Asia east of the Ural Mountains, and L. d. japonica (Motschulsky) restricted to the Japanese archipelago (Pogue & Schaefer, 2007).
The subspecies L. d. dispar, also known as European gypsy moth (EGM), was introduced to the United States (U.S.) by a French artist and naturalist, Étienne Léopold Trouvelot, presumably from France between 1868 and 1869 (Forbush & Fernald, 1896). After a decade of remaining undetected, the population experienced successive outbreaks and has since invaded the northeastern U.S. and the Canadian Maritimes (Liebhold, MacDonald, Bergdahl, & Mastro, 1995). As one would expect from the introduction history, some studies identified an intermingled genetic pattern among populations from North America and Europe Zahiri, Schmidt, Schintlmeister, Yakovlev, & Rindos, 2019). Alternatively, other studies revealed significant distinctiveness between the North American population and contemporary European populations (Wu et al., 2015;Zhang et al., 2019). If the latter scenario is supported, it may suggest that the North American population is on the evolutionary trajectory of becoming a lineage independent from its source population(s).
After the initial invasion of EGM, the U.S. and Canada have undertaken tremendous efforts to prevent subsequent introductions of the Asian gypsy moth (AGM), a collective term referring to subspecies L. d. asiatica and L. d. japonica as well as three closely related species once considered subspecies of gypsy moth (APHIS, 2016;Bogdanowicz, Schaefer, & Harrison, 2000;Pogue & Schaefer, 2007).
The most notable difference between AGM and EGM is that AGM females are capable of sustained ascending flight over several kilometers before oviposition (Yang et al., 2012). This characteristic contrasts with the incapability of flight in most EGM females including the North American population, which usually oviposit near the site of pupation (Pogue & Schaefer, 2007). Asian gypsy moth also has different host preferences and exhibits elevated levels of polyphagy compared with EGM (Baranchikov, 1989). Thus, if established, AGM populations can potentially spread faster and further, while consuming plant species currently unaffected by EGM. The introduction of AGM in North America may also result in introgressive hybridization of flight and host preference alleles that can be partially inherited in the hybrid progenies, which could accelerate the spread of future outbreaks (Keena, Grinberg, & Wallner, 2007). Since the 1990s, AGM egg masses (rarely pupae) attached to surfaces of cargo and on ship superstructures have been intercepted annually at ports of entry in North America (APHIS, 2016;Savotikov, Smetnik, & Orlinskii, 1995; Figure 1).
Tracking invasions of gypsy moths depend on how populations are structured across native and invasive ranges. Multiple genetic groups have been recognized within L. dispar, some of which are connected by gene flow (Bogdanowicz, Mastro, Prasher, & Harrison, 1997;deWaard et al., 2010;Keena, Côté, Grinberg, & Wallner, 2008;Wu et al., 2015;Zahiri et al., 2019;Zhao, Wu, Kurenshchikov, Ilyinykh, & Shi, 2019). However, the complex evolutionary history of gypsy moth, further confounded by human-aided movement, cannot be delineated by a small number of nuclear or mitochondrial markers, which in turns limits the capacity to determine the origin of invasion. Larger genetic datasets, on the other hand, are more likely to resolve population structuring at both regional and local scales. A recent proof-of-concept study used single nucleotide polymorphism (SNP) data to differentiate eight laboratory-reared L. dispar colonies . Based on the observed differentiation, the study K E Y W O R D S admixture zone, amplicon sequencing, Asian gypsy moths, assignment test, Invasive species, natural hybrids demonstrated the potential for genomic data to assign individuals to their source populations. However, because the genetic diversity of colony moths is biased due to excessive, long-term inbreeding, it remains unclear the extent to which the study applies to wild populations .
In this study, we have generated a panel of 60 genetic markers (55 nuclear and 5 mitochondrial) derived from natural populations.
We used this panel to genotype over one thousand gypsy moth specimens sampled across Asia, Europe, North America, and North Africa, representing the largest dataset of field-collected gypsy moth to date. We supplemented the above sampling with L. umbrosa collected from Hokkaido, Japan, due to its possible genomic admixture with L. d. japonica (Pogue & Schaefer, 2007;Zhang et al., 2019).
We characterized genetic differentiation for a better understanding of evolutionary relationships between species, subspecies, and populations. Based on the recovered geographic structuring, we further investigate the ancestry of AGM intercepted at U.S. ports of entry and evidence of Asian introgression into North America. Two different approaches, the discriminant analysis of principal components (DAPC; Jombart, Devillard, & Balloux, 2010) and the support vector machine predictive (svm) model (Chen et al., 2018), were used for the assignment analyses. Our application provides critical information for the development of effective management measures, helps monitor and prevent additional AGM introductions, and ultimately facilitates safer bilateral trade.

| Sample collection and DNA extraction
We covered as much of the geographic distribution of Lymantria dispar as possible, with an intense sampling effort in East Asia and Europe. Gypsy moth samples (n = 1,181) were collected in 17 countries at 89 different geographic locations (average of 37 specimens/ locality) (Figure 2, Table S1). Adult moths were caught in traps baited with disparlure as the sex-attractant pheromone. Specimens of L. umbrosa (n = 37) were collected from Hokkaido and morphologically identified (Hannah Nadel). We included L. umbrosa in all subsequent analyses given its sister status with L. dispar and potential introgression into the latter species (Zhang et al., 2019). We further genotyped egg masses intercepted at U.S. ports of entry (Table S3) and male moths trapped in the Pacific Northwest between 2014 and 2015 ( Figure 3, Table S4). Those moths primarily comprised individuals diagnosed as EGM by U.S. Department of Agriculture (USDA) plus four specimens diagnosed as AGM. Specimens were stored dry either at −20°C or −70°C. Genomic DNA was extracted using the Qiagen DNeasy Blood & Tissue Kit (Qiagen, Venlo, the Netherlands) following the manufacture's protocol or a modified Proteinase K protocol (Maniatis, Fritsch, & Sambrook, 1982). For the latter protocol, a moth leg or antenna was placed in 500 µl extraction buffer (5% Proteinase K, 0.1% Tergitol, 1X TE buffer) and incubated at 37°C overnight. The extraction was deactivated by heating at 75°C for 30 min and then stored at −20°C.

| Genome-wide genetic marker discovery
To identify polymorphic amplicons while reducing ascertainment bias, we first selected 12 individuals (two per locality) from a broad geographic range including Germany, Japan, China, South Korea, the Russian Far East, and U.S. For each of these samples, we generated a double-digest restriction site-associated DNA (ddRAD) library (Peterson, Weber, Kay, Fisher, & Hoekstra, 2012), cutting with SbfI and MspI while simultaneously ligating P1 (SbfI) and P2 (MspI) adaptors to the fragmented DNA. Following amplification, the 12 samples were pooled and double size-selected (targeted range: 350-550 bp) using AMPure XP beads. This size-selected library was then sequenced using an Illumina MiSeq platform with paired-end reads (2 × 300 bp) at the Cornell University's BioResource Center.
After an initial quality check using FastQC (Andrews, 2010), raw Illumina reads were assembled into contigs (unique consensus sequences from multiple Illumina reads) using NGen (v.11) and default parameters. The resulting contigs (n = 1,713) were then filtered and (c) monomorphic contigs to exclude uninformative regions. After applying these filters, 117 contigs associated with multiple SNPs were initially selected. Primer pairs were developed to amplify a fragment of each selected contig using BatchPrimer3 (You et al., 2008; for primers sequences, see Table S2). All amplicons, or called loci hereinafter, were constrained to between 350-425 bp in size.
Primers were tested through individual PCRs with a sub-sample of gypsy moth populations. Sixty loci showing robust amplification (n nuclear = 55 and n mitochondrial = 5) were kept in the final genotyping panel. The median number of SNP per locus is 42 (25% quantile = 30, 75% quantile = 57). Sequence variants detected by a custom Perl script (available upon request) defined alleles at those loci for the individuals genotyped.

| Individual genotyping
Individual moths were genotyped at 60 loci in four randomly divided multiplex PCRs using the QIAGEN Multiplex PCR Kit (Qiagen, Venlo, the Netherlands). Primers targeting mitochondrial loci were added at 0.5X concentration relative to the nuclear primers to account for the increased copy number of the mitochondrial genome.
For allele calls at each locus, we ran the custom Perl script to extract reads and assign them to the appropriate locus and individual.

F I G U R E 2
Collection sites of wild populations of Lymantria dispar throughout the Holarctic region. Two populations of L. umbrosa were collected from Hokkaido, Japan F I G U R E 3 Collection sites for 37 male moths intercepted in the Pacific Northwest between 2014 and 2015. Four of those moths were initially identified as Asian gypsy moth by USDA The script (a) trimmed adapters and low-quality reads, (b) merged overlapping reads, (c) identified reads corresponding to each locus, (d) collapsed identical reads for each individual, and (e) identified the top two haplotypes for individuals at all loci (i.e., their diploid genotypes). We set a minimum read length of 225 bp and applied a matching command (-x) that requires 90% of the first 40 bp of a read to align to and match a reference sequence at each locus, thereby minimizing PCR artifacts and paralogs. Specimens that had < 10X coverage across all loci or failed at more than 15 loci (>25% missing data) were excluded from subsequent analyses. The final worldwide dataset include 950 field-collected L. dispar and 36 L. umbrosa specimens.

| Population diversity analysis
Genetic diversity was estimated for nuclear loci for all populations of L. dispar and L. umbrosa. Mitochondrial loci were excluded because they were linked and heterozygosity indices did not apply. Summary statistics including number of polymorphic loci, average number of alleles per locus, mean observed and expected heterozygosity, and average gene diversity were calculated for each country or geographic region in Arlequin 3.5 (Excoffier & Lischer, 2010). Similarly, average number of private alleles per locus was computed using the R package poppr 2.8.2 (Kamvar, Tab ima, & Grünwald, 2014). We further assessed linkage disequilibrium between all pairs of nuclear loci using Genepop 4.7 (Rousset, 2008) with 10,000 dememorizations in one million Markov chain approximations. Pairwise fixation indices (F ST ) were estimated in Genepop among genetic clusters identified below.

| Population structure analyses
To characterize genetic differentiation among our samples, we used Bayesian and multivariate clustering methods. The Bayesian clustering analysis was performed in STRUCTURE 2.3.4 (Pritchard, Stephens, & Donnelly, 2000) with the admixture model, which allowed individuals to have mixed ancestry from multiple clusters.
Only nuclear data were used due to the requirement of unlinked markers. We tested from K = 1 to K = 10 with 10 replicates for each value of K. Markov chain Monte Carlo was run for one million generations after a burn-in of 100,000 generations. The best K value was selected using the ΔK method in STRUCTURE Harvester (Earl & vonHoldt, 2012;Evanno, Regnaut, & Goudet, 2005), and the associated replicates were aligned under the Greedy algorithm of CLUMP 1.1.2 (Jakobsson & Rosenberg, 2007). Mean individual and population matrix were plotted in DISTRUCT 1.1 (Rosenberg, 2004). We discarded individual membership coefficients with posterior probabilities < 0.05 and assigned corresponding values proportionally to other membership coefficients so that the total coefficients for each individual still summed to one. Following a similar approach for analyzing genetic structure that may exhibit hierarchies (Coulon et al., 2008), we repeated the analysis on each of the K groups inferred in the previous step. Individual run length was adjusted to 500,000 generations with 200,000 burn-in period. This approach allows the identification of both higher and lower levels of population structure (Balkenhol et al., 2014).
A multivariate, hierarchical Discriminatory Analysis of Principal Components (DAPC) analysis was performed as follows based on combined nuclear and mitochondrial data. First, we used a hierarchical Ward algorithm through the function find.clusters to determine the minimum number of geographic groups that are genetically distinct from each other while best summarizing variations in the data.
The number of retained principal components (PC) was optimized by cross-validation starting with 300 PCs to avoid model over-fitting. Second, to further explore lower-level population structure, we repeated DAPC for individual geographic groups inferred in the previous step using a priori information of specimen's group membership. The optimal number of clusters per group was then chosen by varying K from 1 to 20. Lastly, after determining the total number of genetic clusters present among all groups, we carried out a final DAPC analysis with the full dataset and a priori information of specimen's cluster membership to display relationships among clusters.
The analysis started with 300 PCs and was cross-validated. All multivariate analyses were run on the devel version of R package adegenet 2.1.1 (Jombart et al., 2010). Two types of assignment tests were performed using the predictive model in DAPC and the svm predictive model implemented in the R package assignPOP 1.1.5 (Chen et al., 2018). To measure conclusiveness in the assignment, we adopted the "relative probability" following Schmidt et al. (2019), which was defined as the ratio between the highest and the second highest posterior probability of membership for a given specimen. A specimen was only assigned to a given reference group if the relative probability was ≥ 2, otherwise it was considered inconclusive.

| Assignment of intercepted AGM
Congruence between the two types of tests would indicate a robust assignment.
Before the assignment test, diagnostic power of the reference data was assessed through resampling validation. For validation in DAPC, 10 individuals were randomly selected from each reference group to form a test set, and the remaining specimens were used as the training set. A total of 20 replicates were performed with a different random test set each time. When test individuals were correctly assigned back to their original population, we proceeded to predict geographic origins of intercepted AGM specimens by the function predict.dapc. For validation in assignPOP, the Monte Carlo procedure randomly selected 10% individuals from each reference group as the testing data and the remaining 90% as the training data, which was crossed by top half loci with highest F ST values and all 60 loci. We ran 50 replicates for each cross-validation under the svm model retaining the same number of PCs as in DAPC. Then, assignment test was performed using the function assign.X.

| Detection of ongoing hybridization
When AGM evades inspection at ports of entry, it may hybridize with the North American population. Evidence for ongoing hybridization can be demonstrated by detecting early-generation progenies (i.e., F1, F2, and backcrosses). Here, we focused on the Pacific Northwest because this region is not only a likely hotspot for AGM introduction, but also witnessed recurrent introduction of EGM transported by humans from its established range on the Atlantic Coast. We examined 37 male gypsy moths captured in Washington and Oregon during 2014 and 2015 for their identity as AGM, EGM, or hybrids based on posterior probability of membership. Among them, four moths had been previously identified as AGM and the remaining 33 as EGM by the USDA standard gypsy moth assay (available upon request), which does not test for hybrids.
References for hybrid assignment were established through two approaches. First, we randomly selected 100 specimens collected from East Asia (including L. umbrosa) to represent the AGM parent and another random 100 U.S. specimens as the EGM parent. First generation (F1) and second generation (F2 and backcrosses between F1 and EGM) progenies were simulated for 100 individuals for each progeny group using the function hybridize in adegenet. For demonstration purposes, we did not simulate further generations. Field-collected parents and simulated hybrids were all used as reference. Second, because hybridization between geographic races of L. dispar can generate abnormal sex ratios or intersex progenies (Goldschmidt, 1934), which may violate the assumption from standard hybridization simulation, we per-

| Genetic diversity among worldwide populations
Linkage disequilibrium was only found in 5% of all pairwise comparisons from the 55 nuclear loci after correcting for multiple comparisons, suggesting that they are mostly unlinked. Population diversity statistics are summarized in Table 1. Despite variation in number of specimens sampled from Asia, Europe, and North America, the numbers of polymorphic loci were similar across countries or regions (mean = 50), but loci that were invariable within each country or region were different. The mean number of alleles per locus and private allele per locus were much greater in East Asia than other populations (two tailed t test, t (18) = 10.13, p < .0001; t (18) = 6.56, p < .0001). Average genetic diversity and heterozygosity were also highest in East Asia.

| Population structure of gypsy moth
The ΔK method selected K = 5 for STRUCTURE runs based on nuclear data only ( Figure S1). The worldwide populations of L. dispar and L. umbrosa were separated into five geographic groups For population structure using DAPC, the hierarchical Ward algorithm did not determine a single optimum number of genetic clusters and found equal support for models from a range of approximately K = 5 to K = 15 (50 PCs retained, Figure S3). The K = 5 model recovered the same groups recovered by the STRUCTURE analysis. We re-ran the analysis without the mitochondrial loci, and the result was nearly identical ( Figure S4). We then repeated the DAPC analysis for each of the five geographic groups to explore lower-level structure. The BIC scores and structure-like posterior probability of membership for individual specimens are shown in Figure S5. While the North America population did not support TA B L E 1 Sampling countries and regions, number of sampled individuals, and summary population diversity indices based on nuclear data. For detailed sampling localities in each country or region, see Table S1 further partitioning, the other four groups can be divided into 2

| Geographic origin of intercepted AGM
Based on results from the DAPC analysis, we selected four East Asian clusters as reference for the assignment test ( Figure 5a  The two types of test, DAPC and assignPOP (PC = 26), produced nearly identical assignments for 84 AGM eggs and five adults. All specimens but one were decisively assigned (relative probability ≥ 2) by the former test and all decisively by the latter test. For the only inconclusive specimen (relative probability < 2) in DAPC, its most likely origin was the same cluster that received the decisive assignment by assignPOP. Since multiple eggs were sampled from the same egg mass, those sibling eggs should be assigned to the same reference. In our case, all but one egg mass had identical or at least congruent assignment (i.e., one egg assigned and one egg inconclusive) among sibling eggs. The only exception, which was identified by both tests, produced a conflicting result that assigned one egg to the AGM admixture zone and other three eggs to L. d. japonica. This could be due to mislabeling or contamination.
Of the 26 AGM egg masses, 21 (80.8%) were assigned to the AGM admixture zone and four (19.0%) to L. d. japonica (Figure 5b; Table   S3). The percentage was similar for AGM adults, with four of five moths originated from the AGM admixture zone and the fifth moth from Japan. No egg mass or adult was assigned to either the main China cluster or L. umbrosa.

| Detection of ongoing Hybridization in the Pacific Northwest
Cross-validation for the simulated dataset demonstrated that AGM and EGM parents can be easily distinguished from first and second generation hybrids in either DAPC or assignPOP (PC = 10, Figure 6a).
Assignment accuracy for the parents ranged between 98.0% and 100%. Both tests also readily identified F1-EGM backcrosses (accuracy > 96.0%). However, about half of the time neither method was able to differentiate between F1 and F2 progenies. This was not unexpected since genomic profiles can be similar between those two groups given the number of markers used. For the laboratory dataset, F1 progenies seemed to follow Mendelian inheritance and resembled those generated by simulation (Figure 6b). Cross-validation using DAPC and assignPOP (PC = 13) showed perfect assignment of resampled parents and F1 with either top half high F ST loci or the full set of loci.
Among the 37 L. dispar adult males trapped in Washington and Oregon, the simulated dataset confirmed four AGM and 15 EGM specimens as pure parents. Both DAPC and assignPOP yielded identical assignment for those 19 moths (Table 3, Table S4).
However, as many as 13 moths were likely backcrosses between less than 0.05, but they could not be decisively assigned to either F1, F2 or backcross. Those individuals were considered as "undetermined hybrids." When using the laboratory dataset, DAPC and assignPOP identified at least 5 individual as hybrids. Those five specimens were also undoubtedly identified as hybrids under the simulated dataset (Table S4). suggest five major geographic groups, which are best regarded as the upper hierarchical level of population structure (Coulon et al., 2008). Four of the five groups can be further partitioned into 2-5 genetic clusters. In total, we defined 13 clusters whose relationships mirror the geography across the Northern Hemisphere when viewed from above the North Pole (Figure 4b). Our findings support the overall population structure recovered from genomic SNP and microsatellite studies Wu et al., 2015;Zhang et al., 2019) and offer new insight into fine-scale regional structuring.
The nominated subspecies L. d. dispar is thought to comprise populations from the western part of Eurasia with low genetic diversity (deWaard et al., 2010;Harrison, Wintermeyer, & Odell, 1983;Wu et al., 2015). In our study, it approximately corresponds to group B, which can be divided into as many as five clusters, including three from Europe. Historical work from Goldschmidt (1940) proposed two geographic races in Europe, dispar and mediterranea, for populations in northern and southern Europe, respectively. Even though the mediterranea race has never been formally recognized, both STRUCTURE and DAPC analyses show that moths from the Mediterranean region can be separated from other European populations (e.g., Germany), a finding that warrants re-examination of differentiation within Europe. Additionally, it has been shown that populations from the Mediterranean region (Spain and Italy) possessed mitochondrial lineages that branched out early in the gene tree (Wu et al., 2015), which could suggest association with former glacial refugia during the Pleistocene and signal a more complicated evolutionary history than previously thought (Hewitt, 2000). Our results also showed that most L. d. dispar specimens possess an ad- incompatibility (Goldschmidt, 1940;Higashiura et al., 2011). Although we cannot rule out the introgression hypothesis, close examination of Note: Lymantria umbrosa is included in AGM for assignment purpose. UH: undetermined hybrids, individual with relative probability < 2 but could be either F1, F2, or Backcross (BC) at probability > .95 (probability of being a purebred < .05). Inconclusive: individual with relative probability < 2 and could be either a purebred or hybrid.
our L. umbrosa specimens confirmed key morphological characteristics that match with the species diagnosis in Pogue and Schaefer (2007), such as creamy ground color of the forewing, white on the hindwing in contrast with dark margin, white legs, and tan-colored head and thorax. Those characteristics are often used to morphologically identify the two species, since male genitalia is indistinguishable between L.
umbrosa and L. dispar (Pogue & Schaefer, 2007). A mutually nonexclusive hypothesis for the observed pattern would involve ancestral polymorphism, in which the nuclear genome is expected to coalescence more slowly than the mitogenome does. Therefore, both processes may account for the close relationship between the two taxa. To fully understand their taxonomic status, use of genomic SNP data, additional sampling throughout the Hokkaido Island, and re-examination of the designated type specimens is required.

| Genetic distinctiveness of the introduced North American population
Despite plentiful historical records on the North American population after it was introduced from Europe (Forbush & Fernald, 1896), it is yet to reach a consensus regarding its origin and distinctiveness relative to contemporary European populations. Our full and nuclear-only datasets both identified clear differentiation between moths from the two continents, a result also supported by independent microsatellite data (Keena et al., 2008;Wu et al., 2015).
The separation was not recovered by genomic SNPs generated from laboratory-reared colonies that no longer resemble wild populations . However, whole nuclear genome data from a few field-collected specimens corroborated the split between U.S. and European samples (Zhang et al., 2019). Additionally, F ST values based on all nuclear loci between the North American and European populations were comparable to or even exceeded values among subspecies (Table 2, e.g., European L. d. dispar versus. L. d. asiatica or L. d. japonica versus. L. d. asiatica).
It has been demonstrated that founder effects in conjunction with adaptation to new environments can drive significant allele frequency changes between the founding and source population in merely tens of generations, leading to genomic differentiation (Shultz, Baker, Hill, Nolan, & Edwards, 2016;Marques, Jones, Palma, Kingsley, & Reimchen, 2018). population are yet to be found in other populations sampled from the native range. A similar finding was reported from microsatellite data as well (Wu et al., 2015). All those lines of evidence support the unique genetic status of the North American population (i.e., an independent evolutionary lineage) distinct from contemporary European populations. Future studies are needed to understand whether such distinctiveness is a result of genetic drift, local adaptation (e.g., Friedline et al., 2019), historical admixture from multiple source populations, or a combination of those factors.

| Geographic origin of AGM intercepted in the U.S
Using genetic data to infer the origin of biological invasions depends on the extent of population structuring in the species' native range and whether such differentiation can be recovered by the selected markers. Previously, Picq et al. (2017)

| Hybridization between AGM and EGM in the Pacific Northwest
Despite predeparture inspection at Asian seaports and arrival inspection at U.S. ports of entry, a few AGM egg masses (occasionally pupae) may still be overlooked and not intercepted (Bogdanowicz, Wallner, Bell, Odell, & Harrison, 1993 However, among the undetermined hybrids, there was one individual that had the sum of posterior probability of membership between F1 and F2 greater than 0.95, suggesting its identity as a nonbackcross hybrid (Table S4). Compared with the simulated dataset, the laboratory dataset only represents one scenario of the hybridization process (i.e., U.S. × Japan) and thus is more conservative. It still unequivocally identified at least five specimens as hybrids. We expect that other crossing scenarios (e.g., U.S. × China or U.S. × Korea) should produce similar results, because divergence within AGM is much smaller than that between AGM and the North American population (Figure 4b,  (Keena et al., 2007). A small percentage of females from backcrosses to EGM voluntarily attempted flight (Keena et al., 2007). Introgression of Asian alleles could thus accelerate the spread of gypsy moth in North America (Keena, Wallner, Grinberg, & Cardé, 2001). In addition, although not evident among our U.S. samples (see STRUCTURE analysis), it is unknown whether hybridization and introgression have contributed to the shift of allele frequencies in the North American population, because such events may not be limited to the Pacific Northwest. Therefore, long-term monitoring is required and immediate eradication of any potential hybrids is critical to prevent the introgression of Asian alleles into North America.

| CON CLUS IONS
We characterized hierarchical genetic differentiation among worldwide populations of Lymantria dispar and its sister species L. umbrosa.
Five major geographic groups were recovered, each of which, except for North America, can be further divided into multiple clusters.
Population structure of gypsy moth is more complicated than previously thought and admixture is pervasive, even at species level. Our results support evolutionary distinctiveness of the introduced North

ACK N OWLED G EM ENTS
We are grateful to Heung-Sik Lee, Alamaz Orozumbekov, Yasutomo Higashiura, Yuri Baranchikov, Tamara Freyman, Vasily Ponomarev, Paul Schaefer, Melody Keena, Milan Zubrik, Frank Kruger, Matthias Kolb, Horst Marohn and Mike South for their assistance in procuring specimens from Asian and Europe. We thank John Molongoski, Nevada Trepanowski, and Marjorie Palmeri for their assistance in laboratory work. We thank Hannah Nadel for morphological identification of specimens. We thank the associated editor and anonymous reviewers for their comments that greatly improved the manuscript.
This study was funded by USDA, APHIS, Agriculture Quarantine Inspection program. JAA was partially supported by the NSF grant DEB-1748389.