Assessing the potential of genotyping‐by‐sequencing‐derived single nucleotide polymorphisms to identify the geographic origins of intercepted gypsy moth (Lymantria dispar) specimens: A proof‐of‐concept study

Forest invasive alien species are a major threat to ecosystem stability and can have enormous economic and social impacts. For this reason, preventing the introduction of Asian gypsy moths (AGM; Lymantria dispar asiatica and L. d. japonica) into North America has been identified as a top priority by North American authorities. The AGM is an important defoliator of a wide variety of hardwood and coniferous trees, displaying a much broader host range and an enhanced dispersal ability relative to the already established European gypsy moth (L. d. dispar). Although molecular assays have been developed to help distinguish gypsy moth subspecies, these tools are not adequate for tracing the geographic origins of AGM samples intercepted on foreign vessels. Yet, this type of information would be very useful in characterizing introduction pathways and would help North American regulatory authorities in preventing introductions. The present proof‐of‐concept study assessed the potential of single nucleotide polymorphism (SNP) markers, obtained through genotyping by sequencing (GBS), to identify the geographic origins of gypsy moth samples. The approach was applied to eight laboratory‐reared gypsy moth populations, whose original stocks came from locations distributed over the entire range of L. dispar, comprising representatives of the three recognized subspecies. The various analyses we performed showed strong differentiation among populations (FST ≥ 0.237), enabling clear distinction of subspecies and geographic variants, while revealing introgression near the geographic boundaries between subspecies. This strong population structure resulted in 100% assignment success of moths to their original population when 2,327 SNPs were used. Although the SNP panels we developed are not immediately applicable to contemporary, natural populations because of distorted allele frequencies in the laboratory‐reared populations we used, our results attest to the potential of genomewide SNP markers as a tool to identify the geographic origins of intercepted gypsy moth samples.

The present proof-of-concept study assessed the potential of single nucleotide polymorphism (SNP) markers, obtained through genotyping by sequencing (GBS), to identify the geographic origins of gypsy moth samples. The approach was applied to eight laboratory-reared gypsy moth populations, whose original stocks came from locations distributed over the entire range of L. dispar, comprising representatives of the three recognized subspecies. The various analyses we performed showed strong differentiation among populations (F ST ≥ 0.237), enabling clear distinction of subspecies and geographic variants, while revealing introgression near the geographic boundaries between subspecies. This strong population structure resulted in 100% assignment success of moths to their original population when 2,327 SNPs were used. Although the SNP panels we developed are not immediately applicable to contemporary, natural populations because of distorted allele frequencies in the laboratory-reared populations we used, our results attest to the potential of genomewide SNP markers as a tool to identify the geographic origins of intercepted gypsy moth samples.

| INTRODUCTION
Invasive alien insects represent a major threat for biodiversity and ecosystem stability, and can have considerable economic and social impacts (Bradshaw et al., 2016;Kenis et al., 2009). The European gypsy moth, Lymantria dispar dispar Linnaeus, is a perfect example of an invasive alien species that is responsible for severe tree growth losses in its new habitat (Bradshaw et al., 2016). Since its accidental introduction from Europe into eastern North America in the late 1860s (Pogue & Schaefer, 2007), the European gypsy moth (EGM) has caused billions of dollars in losses for the forest industry and urban communities, and has required important investments in pest management (Bradshaw et al., 2016). In addition, this moth has altered biodiversity in its new habitat by contributing to a decline in oak populations in eastern North America (Morin & Liebhold, 2016). This severe impact of EGM can be explained by the wide variety of hardwood and coniferous trees defoliated by its larvae (Liebhold et al., 1995) and by periodic population irruptions, leading to outbreaks that cover large areas.
Fortunately, range expansion of the gypsy moth in North America has been limited by the inability of EGM females to fly (Pogue & Schaefer, 2007). Dispersal is accomplished through crawling of caterpillars from tree to tree or larval ballooning, that is, aerial dispersal using silk, and is therefore limited to short distances (<100 m) (Lance & Barbosa, 1982;Nickason, 2001;Weseloh, 1997). Nevertheless, long-distance displacements have occurred repeatedly as a result of accidental transportation of egg masses on firewood, household goods and vehicles (Bigsby, Tobin, & Sills, 2011). Consequently, EGM has spread at a rate of 3 to 29 km/year since its introduction (Tobin, Liebhold, & Anderson Roberts, 2007) and is now considered established on a territory ranging from Quebec to North Carolina on the east coast, and inland to Wisconsin (Tobin, Bai, Eggen, & Leonard, 2012). EGM is currently considered one of the most destructive non-native insects in eastern North America (Aukema et al., 2011).
Two Asian subspecies of L. dispar have been described and both are referred to as "Asian gypsy moth": L. dispar asiatica Vnukovskij, present in continental Asia, and L. dispar japonica Motschulsky, distributed across the Japanese archipelago (Pogue & Schaefer, 2007;Wu et al., 2015). The Asian gypsy moth (AGM) is considered a greater threat than its European counterpart to North America's forests due to the strong flight capability of its females and its broader host range (Pogue & Schaefer, 2007). AGM is not established in North America, but egg masses and adult moths are recurrently intercepted during foreign vessel inspections in North American ports, and occasional escapees have been removed through major eradication campaigns (APHIS-USDA, 2006. Given that both AGM subspecies can interbreed with L. d. dispar (M. Keena, unpublished data), the escape and establishment of AGM in North America could lead to the introgression of traits such as strong female flight capacity and extended host range into North American EGM populations. For example, 16%-33% of the female progeny obtained from crosses between AGM and their North American counterpart are capable of strong flight (Keena, Grinberg, & Wallner, 2007). Thus, the establishment of AGM could result in accelerated spread and more severe outbreaks. In this context, the accurate identification of moths and egg masses intercepted on foreign vessels is critical if we are to avoid AGM establishment and introgression of undesirable traits into North American gypsy moth populations.
Distinguishing moths of each L. dispar subspecies using morphological characters has proven to be a nontrivial task, and gypsy moth egg masses cannot be visually identified at the subspecies level. For these reasons, several molecular diagnostic tools aimed at separating AGM from EGM (both North American and European populations of L. dispar dispar) have been developed (e.g., Bogdanowicz, Schaefer, & Harrison, 2000;deWaard et al., 2010;Garner & Slavicek, 1996;Pfeifer, Humble, Ring, & Grigliatti, 1995), including a qPCR-based suite of assays designed by our group (Stewart et al., 2016). However, molecular tools aimed at identifying the geographic origins of intercepted samples are still lacking; tracing the origins of such samples is critical to negotiations undertaken by Canadian and American regulatory authorities with trading partners to help prevent future introductions.
The advent of next-generation sequencing (NGS) has greatly facilitated the development of diagnostic single nucleotide polymorphism (SNP) markers, including for nonmodel organisms. Such NGS-derived SNPs have also been shown to provide enhanced resolution relative to classical markers (e.g., microsatellites), as in the identification of the geographic origins of individuals (Larson et al., 2014;Puckett & Eggert, 2016). Here we present the results of a proof-of-concept study aimed at assessing the potential of genotyping-by-sequencing (GBS)-derived SNPs to trace the geographic source of gypsy moth samples. In view of the proof-of-concept nature of our work and difficulties in rapidly obtaining fresh gypsy moth samples from the field during periods of low population densities, our study focused on eight laboratory-reared gypsy moth populations whose original stocks came from locations distributed over the entire range of L. dispar, comprising representatives of the three recognized subspecies.
In the present work, the SNPs obtained by GBS (Mascher, Wu, Amand, Stein, & Poland, 2013) were first submitted to both classical and more recent analytical approaches used in the field of population genomics (linkage disequilibrium network analysis, population structure analysis and F ST calculations) to assess their usefulness in discriminating our gypsy moth populations. Then, the success of these SNPs in assigning moths to their original population was evaluated using two different methods: a multivariate approach, discriminant analysis of principal components (DAPC; Jombart, Devillard, & Balloux, 2010) and a Bayesian approach implemented in the Genetic Stock Identification

K E Y W O R D S
genotyping-by-sequencing, invasive species, laboratory populations, Lymantria dispar, population assignment, SNP filtering software gsi_sim (Anderson, Waples, & Kalinowski, 2008). Finally, we evaluated the impact of modifying a SNP filtering parameter (e.g., linkage disequilibrium filtering) on the outcome of our analyses. Although the SNP panels we developed using the aforementioned assignment methods are not immediately applicable to contemporary, wild gypsy moth populations, our results attest to the potential of genomewide SNP markers as a tool to identify the geographic origins of intercepted gypsy moth samples.

| Insect material and DNA extraction
In view of the challenge associated with collecting fresh gypsy moth samples at various locations over this species' vast geographic range, we opted for the use of specimens derived from material collected in the context of earlier studies (Keena et al., 2007;Keena, Côté, Grinberg, & Wallner, 2008;H. Nadel, unpublished data) and subsequently maintained in the laboratory as distinct, population-specific colonies (see Figure 4 for a map showing collection locations). Such samples had the advantage of being readily available and well characterized with respect to female flight capability, two features that weighed heavily in our decision to use this material and in our assessment of its suitability for a proof-of-concept study. This material also provided an opportunity to assess the impact of laboratory rearing, over several generations, on genetic diversity.
The laboratory populations we used were established using 4 to 58 egg masses per population collected in the field 6 to 25 years ago (see Table 1). At each generation in the laboratory, 100 egg masses/ population were chosen at random from all those produced by the previous generation (note: in the laboratory, the gypsy moth has a generation about every 8 months and each mated female lays a single egg mass containing 500-2,000 eggs). From these egg masses, 400 to 500 eggs were randomly selected and reared to the adult stage. Then, ~150 adult pairs/population were formed for mating to produce the next generation. Given that the date of adult emergence from pupae can vary, adults selected to produce eggs for the next generation were sampled to provide an even coverage of emergence dates. Although the populations used in our study had been bred in the laboratory for several generations, they appeared to have maintained their original biological and behavioural attributes. Of course, these insects have unavoidably undergone some adaptation to laboratory rearing, but every effort was made to limit, as much as possible, the loss of genetic diversity relative to the original field samples.
For the work presented here, we took a random sample of 12 specimens (six females and six males; Table 1) from each laboratory population, for a total of 96 moths. The number of individuals we selected from each population may be viewed as an approximation of the average number of gypsy moths caught in pheromone traps during endemic periods (Streifel, 2016); this value was deemed sufficient for the purpose of our study, given that accurate estimates of population differentiation and genetic diversity can be obtained with small samples (4-8 individuals) when many markers are considered (i.e., >1,000 SNPs; Willing, Dreyer, & van Oosterhout, 2012;Nazareno, Bemmels, Dick, & Lohmann, 2017). Specimens were assigned to one of the three recognized L. dispar subspecies (L. d. dispar, L. d. asiatica and L. d. japonica), using the criteria of Pogue and Schaefer (2007

| Genotyping-by-sequencing library construction and sequencing
Prior to library construction, samples were diluted to 20 ng/μl. To prepare a reduced-representation library for sequencing, we used a modified genotyping-by-sequencing (GBS) protocol where two restriction enzymes (PstI/MspI) and a Y-adapter are employed (Mascher et al., 2013). To ensure sufficient read depth, the library was sequenced on three P1v3 chips using HiQ reagents on an Ion Proton sequencer

| Bioinformatic analysis and genotyping
Prior to analysis, read quality was assessed using the FastQC software (Andrews, 2016). SNP calling was performed without a reference genome, using the Universal Network Enabled Analysis Kit (UNEAK) pipeline (Lu et al., 2013) with default parameter settings: minimum tag count c = 5, error tolerance rate in the network filter e = 0.03 and minimum minor allele frequency mnMAF = 0.05. Briefly, UNEAK retains all reads containing a barcode and a restriction enzyme cut site, in addition to being devoid of missing data in the first 64 bp after the barcode. Reads are then clustered into tags (i.e., reads displaying 100% identity), and only tags with c ≥ 5 are retained. Then, networks of tags differing by one bp are built. In these different networks, tags with read counts corresponding to 3% (e = 0.03) or less of read counts from the adjacent tags are considered errors. The edges connecting the "error" tags to "real" tags are then sheared, thus dividing networks into subnetworks or decreasing the number of tags present in networks. At the end of the process, only tag-pair networks are retained as potential SNPs; networks with multiple tags are discarded.

| SNP filtering
Among SNPs identified by UNEAK, we retained those genotyped in ≥80% of individuals and present in at least seven populations of eight Impacts of the different filtering steps on SNP counts are reported in Table 2. Missing data filtering and LD calculations were carried out using VCFtools (Danecek et al., 2011), whereas the other filtering procedures were conducted using an in-house R script. The resulting VCF file was converted to file formats suitable for each subsequent analysis using PGDSpider v2.0.9.2 (Lischer & Excoffier, 2012).

| Linkage disequilibrium network analysis
The analysis of linkage disequilibrium (LD), that is, the nonrandom association of alleles from different loci, can reveal various evolutionary processes in population genomic data sets, including local adaptation and geographic structure (Kemppainen et al., 2015). To explore LD patterns in our data set, we used linkage disequilibrium network analysis (LDna) as implemented in the LDna R package (Kemppainen et al., 2015). This analytical procedure, which identifies groups of loci in high LD, does not require knowledge of locus positions within the genome and is therefore applicable to species without a reference genome. In addition, LDna explores LD across the entire genome, thus generating information about LD among widely scattered loci, which classical LD analysis does not do. Given that LDna is sensitive to missing data, rare alleles and loci displaying heterozygosity > 0.5 (Picq, McMillan, & Puebla, 2016), the analyses were conducted on the filtered data set. Prior to carrying out LDna analysis, LD was measured as the squared pairwise correlation coefficient (r 2 ) between loci using VCFtools (Danecek et al., 2011) and the matrix of pairwise LD values was generated using an in-house R script.

| Detecting outlier SNPs
Single nucleotide polymorphisms with extreme F ST values can greatly affect population differentiation estimates and phylogenetic inferences (Luikart, England, Tallmon, Jordan, & Taberlet, 2003 (q-value is the false discovery rate analogue of the p-value). Second, an F dist approach implemented in Arlequin V3.5 (Excoffier, Hofer, & Foll, 2009) was run using a hierarchical island model with 50,000 simulations, three simulated groups (i.e., three subspecies) and 100 demes per group. Finally, SNPs were identified as outliers when their F ST value was inferior or superior to the 1st and 99th percentile of the F ST simulated distribution, respectively. This outlier detection approach takes into account the population structure that generates an important excess of false-positive outliers when it is ignored (Excoffier et al., 2009). As neutral and outlier markers can reveal different genetic differentiation patterns (Luikart et al., 2003), analyses of population structure, population differentiation and population assignment were run on data sets with and without outlier SNPs.

| Population structure
Population structure was inferred using a model-based method employing a maximum-likelihood approach implemented in admixture v1.3.0 (Alexander, Novembre, & Lange, 2009) and a k-means algorithmic method implemented in the find.clusters function of the adegenet R package (Jombart et al., 2010). admixture uses the same statistical model as structure (Falush, Stephens, & Pritchard, 2003), but runs faster as a result of a new optimized algorithm calculating ancestry. We ran admixture with cross-validation for a number of groups (K) varying from 2 to 10. For each K value, calculations were repeated 10 times, using different random seeds to assess the stability of the estimate. The optimal K was identified as being the one exhibiting the lowest cross-validation error compared to other K values. The number of clusters was also assessed using a k-means method, a clustering algorithm which finds a given number K of groups maximizing the variation between groups. The find.clusters function (adegenet R package) was used to run sequentially the k-means algorithm with an increasing number of clusters K and to determine the optimal number of groups by the Bayesian information criterion method. The function was run several times to assess the stability of the optimal number of groups found. Before running the find.clusters function, missing values present in the data set were replaced by the mean allele frequency calculated for the entire set of individuals.
The extent of pairwise population differentiation was quantified through the computation of the unbiased F ST estimator θ (Weir & Cockerham, 1984). Significance was determined by running 1,000 permutations using GenoDive 2.0b25 (Meirmans & Van Tienderen, 2004) and assessed against an FDR-adjusted p-value to account for multiple testing (Benjamini & Hochberg, 1994). A UPGMA dendrogram based on F ST values was generated using the hclust function in the stats R package. A heatmap organized by subspecies and geographic origins was produced to illustrate population pairwise F ST , and a hierarchical analysis of molecular variance (AMOVA) was computed among populations nested within subspecies groups (Excoffier, Smouse, & Quattro, 1992). Finally, a PCA was conducted on genotypes to summarize the overall variability among individuals. This PCA was computed using the dudi.pca function implemented in the ade4 R package (Dray & Dufour, 2007).

| TaqMan PCR assay vs. genotyping-bysequencing results
Our team recently developed a qPCR-based suite of assays aimed at as on the detection of North American "N" and Asian "A" alleles of the "FS1" nuclear marker (Garner & Slavicek, 1996)

| Moth assignment to population
To perform assignment tests, we first employed discriminant analy- to rank the SNPs and to test the assignment power of these SNPs (Anderson, 2010). The assignment of individuals from the holdout set to a given population was computed using the R function predict.dapc (R package adegenet; Jombart et al., 2010), which is based on the outcome of the DAPC analysis. To assess consistency of the assignment results, calculations were made for 10 different training sets/holdout sets, for which individuals were randomly sampled.
For comparative purposes, we also used a Bayesian assignment approach implemented in the gsi_sim Genetic Stock Identification software (Anderson et al., 2008), available in the assigner R package (function assignment_ngs; Gosselin, 2016

| Genotyping and SNP filtering
We obtained an average of 84 million reads for the three sequencing runs. The UNEAK pipeline identified an average of 411,000 tags (unique sequences) per individual, computed from 2,370,000 reads.
Of the 58,309 SNPs identified by UNEAK, 2,327 SNPs remained after applying the filtering procedure ( presence, in about 50% of them, of an indel within a mononucleotide repetitive region located in an otherwise identical sequence background. When the LD filtering criterion was further constrained to r 2 > 0.80 in at least four populations, the 50% proportion increased to 95%. The UNEAK pipeline allows only one mismatch between sequences to call a SNP. Thus, when a sequencing error produces an indel in both alleles of a locus, the number of mismatch among alleles increases and, as a consequence, UNEAK considers the alleles with an indel as originating from a different locus (Data S1). For each SNP pair highly linked, the SNP presenting less missing data and supported by a greater number of reads was kept. At the end of the filtering process, five individuals (5%) presented a proportion of missing data >30% and were removed from the data set (Table 1).

| Linkage disequilibrium network analyses
We

| Identifying outlier SNPs
Of the 2,327 SNPs remaining after the filtering procedure (

| Population structure
Analysis of the 2,194 neutral SNP data set using admixture enabled the The genetic split between the eight sampled populations was also discerned using DAPC. Once again, a small number of the replicated runs pointed to different optimal K values (7 and 9). For K = 7, F I G U R E 1 Linkage disequilibrium network analysis (LDna) applied to gypsy moth population genomic data. (a) Clustering tree of pairwise LD values from the 2,387 filtered SNPs from eight L. dispar populations (φ = 23, |E|min = 3). Branches corresponding to single-outlier clusters (SOCs) are highlighted in blue and labelled with their individual designation (e.g., in 315_0.65, 315 = cluster number and 0.65 = the LD threshold at which the SOC merged with another cluster). All identified SOCs result from population structuring and the population associated with each SOC is indicated on the right of the tree (see Table 1 (Figure 2b). For K = 9, the moth UC7M, identified by admixture as having mixed ancestry, formed here a group by itself (data not shown).
We also ran admixture and DAPC analyses on outlier SNPs only and on both neutral and outlier SNPs. In all cases, the number of populations detected was also k = 8, corresponding to the populations studied (Data S5.1).

| Impact of the absence of LD filtering
The presence of high-LD SNPs did not significantly affect the results of our outlier SNP detection procedure. When Arlequin and BayeScan were run several times on data sets with and without high-LD SNPs,

| TaqMan PCR assay vs. genotyping-bysequencing (GBS) results
Using the COI-based TaqMan assay of Stewart et al. (2016), moths from Connecticut (UC), Greece (KG) and Lithuania (LJ) were assigned to the L. dispar dispar subspecies (Table 3), in line with results of analyses based on genomewide SNPs (e.g., Figures 2 and 4). However, the same assay identified the central Russian population (RB) as L. dispar dispar whereas analysis of GBS-derived SNPs strongly suggested it belonged to the L. dispar asiatica subspecies (see Figures 2 and 4).
For an independent assessment of the occurrence of Asian introgression into EGM, we used the FS1 nuclear marker assay of Stewart et al. (2016), which detects the presence of North American "N" and Asian "A" FS1 alleles in unknown samples. The A allele was detected in all four populations examined (Table 3) and its frequency increased from west to east (20%-100%). In the Greek (KG) and Lithuanian (LJ) populations, the A allele was the major allele (55% and 87.5%, respectively), flagging these two European populations as displaying significant Asian introgression. By contrast, analysis of genomewide SNPs did not detect Asian introgression in the Connecticut (UC) and Greek (KG) populations, while the Lithuanian (LJ) displayed moderate Asian introgression, but less than what the FS1 genotype might suggest. The UC7M individual found to display Asian admixture using GBS-derived SNPs was heterozygous (A/N) for the FS1 marker, as were three other specimens from the Connecticut population. Earlier FS1 genotyping of this population revealed a 20% proportion of A/N genotype (Keena et al., 2008).

| Moth assignment to population
With the exception of results obtained for the 12-SNP panel, the assignment success of individuals to their respective population was high (≥86.25%), irrespective of the number of SNPs and method used ( Figure 5a); however, the mean assignment success to population was low (51.25%) for the 12-SNP panel, following computation using the gsi_sim method. An assignment success of 100% was obtained for the 48-SNP panel using the gsi_sim method while the same level of success required the full SNP data set when using DAPC. However, the DAPC method outperformed gsi_sim for panels containing fewer than 48 SNPs (Figure 5a).
We conducted a DAPC using priors corresponding to populations characterized earlier as having either flightless or flight-capable females. This analysis yielded two SNPs whose allele frequency correlated with the presence of flying females in these populations ( Figure 5b). Only one individual in each of the UC and LJ populations had a discordant haplotype for TP59129. For TP7787, one or two discordant haplotypes were observed for the LJ and RB populations, respectively. A blastx search against the NCBI nonredundant database, using the reads bearing these SNPs as query, failed to produce significant matches to known proteins.

| DISCUSSION
The aim of the present proof-of-concept study was to assess the feasibility of using GBS-derived SNPs to identify source populations of gypsy moth samples. The success of such a strategy is heavily  (Cornuet, Piry, Luikart, Estoup, & Solignac, 1999). For example, for populations that are relatively well differentiated (e.g., F ST = 0.11), such as European cattle breeds, the correct assignment of an individual to its source breed can reach 85% with only 90 SNPs (Negrini et al., 2009). By contrast, for populations of the American lobster, which display minimal genetic differentiation (F ST = 0.002), assignment success has been shown to plateau at 31%, using as many as 10,156 SNPs (Benestan et al., 2016). The different analyses conducted in the present study revealed strong differentiation among the eight gypsy moth populations examined (minimum F ST value = 0.237). As a consequence, we obtained 100% assignment success using all 2,327 SNPs (neutral and outlier), irrespective of the assignment method employed (i.e., DAPC or gsi_sim). In addition, assignment success remained generally high (86%-100%) using fewer SNPs (12,24,48 and 96), although the Bayesian method generated a lower score (51%) for the 12-SNP panel (Figure 5a). The two assign-  (Allendorf, Luikart, & Aitken, 2012). A mean pairwise F ST value of 0.284 was reported by Keena et al. (2008) for six of the populations used in the present study (see Table 1), a value computed one generation after initial field collection (see Table 5 in Keena et al. (2008) Keena et al. (2008) and Wu et al. (2015), respectively. It is worth pointing out that with twice-lower F ST values, the correct assignment of a bovine to its source breed reached 85% using 90 SNPs (Negrini et al., 2009). Thus, the high degree of genetic differentiation observed in natural gypsy moth populations suggests that an assignment procedure based on GBS-derived SNPs applied to intercepted moths has a high chance of success.
With respect to gypsy moth population structure, our study supports conclusions made earlier by other workers (Keena et al., 2008;Wu et al., 2015), including a clear delineation of the three subspe- using six markers, four of which were also used by Wu et al. (2015).
Interestingly, the status of the North American population varied as a function of the markers used in the analyses. For example, when a mitochondrial marker was included, the North American population was deemed distinct from European L. dispar populations, whereas it formed a single group with the European moths when only nuclear markers were considered (Keena et al., 2008). It is difficult to say whether this outcome is due to cyto-nuclear discordance as only one mitochondrial marker was used in this particular instance. However, this comparison illustrates the fact that when only a few markers are considered, some patterns of population structure can be overemphasized due to a very low proportion of the genome being sampled.
Future work based on genomewide SNPs and a larger sample of contemporary North American and European specimens should help clarify the extent to which the North American population is genetically distinct from its European progenitor.
Sample size (10-12 moths/population) is another feature of our study design that had the potential of biasing our evaluation of population genetic differentiation and genetic diversity. Although they could be considered relatively small, our sample sizes were initially deemed sufficient to achieve our objectives given that the large number of SNPs generated by high-throughput sequencing tends to lessen the requirement for large sample sizes (Jeffries et al., 2016). For example, a simulation study revealed that accurate estimates of population differentiation could be obtained from small sample sizes (n = 4-6) if a large number of markers were considered (>1,000 SNPs; Willing et al., 2012). An empirical study on an Amazonian plant species provided support for this assessment, generating accurate estimates of F ST using ≥1,500 SNPs and ≥8 individuals per population (Nazareno et al., 2017). The results of these studies suggest that the genetic diversity estimates reported here (F ST , F IS , etc.) are reliable as they are based on 10-12 individuals/population and 2,327 SNPs.
For comparative purposes, we assessed the nuclear FS1 genotype of all L. d. dispar moths included in our study, using the TaqMan assay of Stewart et al. (2016), to assess the occurrence of Asian introgression into these populations, considered here to be European gypsy moths on the basis of their mitochondrial COI haplotype (Table 3) Although female flight is believed to be a trait observed uniquely in Asian gypsy moth subspecies, studies on the world distribution of this trait have revealed that L. d. dispar populations from the northeastern parts of Europe possess gliding and flight-capable females (Keena et al., 2008;Reineke & Zebitz, 1998). This observation raises regulatory concern as 20% of merchandize imported on European vessels originates from north-eastern Europe, and currently vessels from this part of Europe are not subject to phytosanitary regulation for gypsy moth (CFIA, 2014). In our data set, we identified two high F ST outlier markers that enable separation of moths of the North American and Western Europe populations with flightless female (UC, KG) from those of north-eastern Europe (LJ) and Asian populations with flightcapable female. The sequences containing these SNPs did not reveal significant matches to known proteins, so future work will aim to determine whether these two SNPs are present in genes or genomic regions linked to female flight capacity or reveal population structure due to other selected traits. Although the usefulness of these two markers as predictors of female flight capability appears limited, their identification here suggests that research aimed at localizing the genomic determinants of flight capacity using other genomewide SNPbased approaches (e.g., QTL, GWAS) will likely be rewarding.
High linkage disequilibrium (LD) between markers can introduce bias in analyses requiring independence of loci, such as outlier SNP detection or model-based methods to characterize population structure.
For example, LD between markers in close proximity in a genome ("background LD" or BLD) could cause overestimation of population divergence and incorrect estimation of population admixture using the STRUCTURE software (Falush et al., 2003). Given the uncertainty about the type of bias caused by SNPs in high LD in genomic analyses (Pérez-Figueroa et al., 2010), the presence of some high-LD SNPs in our data set provided an opportunity to assess the influence of these SNPs on the outcome of our outlier identification and population structure analyses. Interestingly, the presence of high-LD SNPs failed to cause significant variation in either the outlier SNP identification results or in population structure characterization. This absence of effect may be due to the low proportion of high-LD SNPs in our data set (1.5%) and their assumed random distribution along the genome.
Indeed, in such situations, model-based methods used to evaluate population structure seem to perform reasonably well, irrespective of the presence of some high-LD SNPs in the data set (Falush et al., 2003).
In conclusion, the present proof-of-concept study demonstrates the power of a genomewide SNP-based approach to correctly assign gypsy moths to their source populations. Our work also shows that when populations are well differentiated (F ST ≥ 0.237), high assignment success (>81.88%) can be achieved using as few as 24 SNPs.
However, because our analyses are based on laboratory-reared populations, the SNP panels we developed are not immediately applicable to wild gypsy moth populations. We are now setting out to repeat the present work using a large panel (>50) of contemporary, natural populations, with numerous individuals per population (30-40), which will enable an accurate assessment of population allele frequencies.
Because natural gypsy moth populations are expected to be less well differentiated than the populations examined here, we will likely need more SNPs for successful discrimination of populations. In this respect, the foreseeable availability of a gypsy moth reference genome to map GBS reads should greatly increase the number of SNPs available for such discrimination. In addition, recently commercialized tools for target enrichment (e.g., AmpliSeq ™ ) are expected to provide a high degree of latitude with respect to the number of SNPs that can be selected for rapid sequencing in a single run. Altogether, the above considerations offer great hope for the development of a molecular tool capable of accurately assigning intercepted gypsy moths to their source populations.