RAPTURE (RAD capture) panel facilitates analyses characterizing sea lamprey reproductive ecology and movement dynamics

Abstract Genomic tools are lacking for invasive and native populations of sea lamprey (Petromyzon marinus). Our objective was to discover single nucleotide polymorphism (SNP) loci to conduct pedigree analyses to quantify reproductive contributions of adult sea lampreys and dispersion of sibling larval sea lampreys of different ages in Great Lakes tributaries. Additional applications of data were explored using additional geographically expansive samples. We used restriction site‐associated DNA sequencing (RAD‐Seq) to discover genetic variation in Duffins Creek (DC), Ontario, Canada, and the St. Clair River (SCR), Michigan, USA. We subsequently developed RAD capture baits to genotype 3,446 RAD loci that contained 11,970 SNPs. Based on RAD capture assays, estimates of variance in SNP allele frequency among five Great Lakes tributary populations (mean F ST 0.008; range 0.00–0.018) were concordant with previous microsatellite‐based studies; however, outlier loci were identified that contributed substantially to spatial population genetic structure. At finer scales within streams, simulations indicated that accuracy in genetic pedigree reconstruction was high when 200 or 500 independent loci were used, even in situations of high spawner abundance (e.g., 1,000 adults). Based on empirical collections of larval sea lamprey genotypes, we found that age‐1 and age‐2 families of full and half‐siblings were widely but nonrandomly distributed within stream reaches sampled. Using the genomic scale set of SNP loci developed in this study, biologists can rapidly genotype sea lamprey in non‐native and native ranges to investigate questions pertaining to population structuring and reproductive ecology at previously unattainable scales.


| INTRODUC TI ON
Over the past decade, several technological and methodological advances have dramatically increased the number of loci available to study a range of ecological and evolutionary questions. The additional information gained has substantially increased the precision and accuracy of population genetic inference (e.g., estimation of relatedness, inbreeding, spatial structure, Barth et al., 2017;Kardos, Luikart, & Allendorf, 2015;Kardos, Taylor, Ellegren, Luikart, & Allendorf, 2016) and enabled locus-specific analyses (e.g., outlier tests, association analyses; Allendorf, Hohenlohe, & Luikart, 2010;Luikart, England, Tallmon, Jordon, & Taberlet, 2003). Research and monitoring of nonmodel organisms that employ expansive suites of loci are increasingly used in combination with demographic data over spatial and temporal scales previously not possible (e.g., Waples et al., 2018).
Sea lamprey (Petromyzon marinus) are an anadromous fish native to the Atlantic Ocean and can be found in North American and European waters (Beamish, 1980). Currently, several sea lamprey populations in Europe are considered threatened (Mateus, Rodríguez-Muñoz, Quintella, Alves, & Almeida, 2012). Following the opening of the Welland Canal in 1829 (Morman, Cuddy, & Rugen, 1980), sea lamprey and many other non-native species invaded the North American Great Lakes, and by 1946, sea lamprey had invaded all five Great Lakes (Christie, 1974;Lawrie, 1970). The introduction of sea lamprey into the Great Lakes exposed large piscivorous native fish species to a parasitic predator and led to catastrophic declines in abundance of ecologically and economically important species, including the top predators (Smith & Tibbles, 1980). An extensive control program, conducted since the late 1950s, has generated considerable information about the species' life cycle and ecology (Applegate, 1950;Hansen et al., 2016). Like other anadromous fishes, adult sea lamprey spawn in streams. Filter-feeding larvae typically live for two to five years before transforming into parasitic juveniles and migrating to the sea or lake in search of a host. In the Great Lakes, parasitic sea lampreys typically spend 12-18 months in the lake before migrating back into streams to spawn. Unlike many other anadromous fishes, sea lampreys do not home to natal streams (Bergstedt & Seelye, 1995;Waldman, Grunwald, Roy, & Wirgin, 2004) but use chemical cues (pheromones) from larval lampreys to identify streams with suitable spawning and rearing habitat (Bjerselius et al., 2000;Sorensen & Vrieze, 2003), resulting in highly panmictic populations within lakes.
In the Great Lakes, sea lamprey control has been largely dependent on annual lampricide treatments that kill larvae in streams and dams that prevent adults from accessing spawning habitat (Applegate, Howell, Moffett, & Smith, 1961;Christie et al., 2003).
The status of sea lamprey populations and success of the control program as whole are determined largely from mark-recapture methods to estimate stream-specific adult sea lamprey abundance from trap catches (Mullett et al., 2003) and prey wounding estimates from native lake trout (Salvelinus namaycush) populations (Rutter & Bence, 2003). Though sea lamprey populations are currently below historic (postinvasion) levels, abundance and prey wounding estimates have exceeded target levels in some lakes in recent years (Sullivan, Adair, & Woldt, 2016). Annual assessments in several larger Great Lakes tributaries and connecting waterways have revealed that sizable numbers of outmigrating juveniles have been produced (Sullivan et al., 2016), yet the sources of those lampreys are unknown. Sea lamprey demographic data, specifically adult abundance in untrappable streams and stream-specific recruitment, are needed to track the status of these populations. Capture-mark-recapture methods employing traps are often used to estimate the number of adults (Mullett et al., 2003), but trapping in large nonwadable rivers is not feasible with conventional methods due to the rivers' large width, length, depth, and flow. Genetic data and genetic pedigree reconstruction methods may be a valuable alternative approach for estimating adult sea lamprey abundance in Great Lakes tributaries, as has been demonstrated with other fishes in large systems (e.g., Abadia-Cardoso, Anderson, Pearse, & Garza, 2013;Bravington, Eveson, Grewe, & Davies, 2017;Steele et al., 2013). Studies of native lamprey in support of conservation initiatives involving parentage analysis include Pacific lamprey (Lampetra tridentata; Hess et al., 2015) and European River lamprey (Lampetra fluviatilis; Rougemont et al., 2015;Hume, Rechnagel, Bean, Adams, & Mable, 2018).
Genomic scale tools are lacking for invasive and native populations of sea lamprey throughout their range. Current genetic assays that are routinely employed for sea lamprey in the Great Lakes (e.g., microsatellite DNA genotyping; Bryan, Libants, Warrillow, Li & Scribner, 2003, Filcek, Gilmore, Jones, & Scribner, 2005 or mtDNA; Waldman et al., 2004) are not adequate to meet management needs because they lack sufficient power to answer high priority questions (e.g., SLCB, 2016). For example, existing data and the number of genetic markers available can be used for sea lamprey on small spatial scales or experimentally (e.g., Dawson, Jones, Scribner, & Gilmore, 2009;Derosier, Jones, & Scribner, 2007) but are insufficient to accurately determine parentage or pedigree relationships in large Great Lakes tributary systems where large numbers (1,000s or 10,000s) of adults could be spawning. However, others have shown that similar inferences are possible using genomic data (e.g., Baetscher, Clemento, Ng, Anderson, & Garza, 2017;Strucken et al., 2016).
A number of molecular techniques now allow targeted sequencing of hundreds or thousands of loci and samples at relatively low cost. A spatially referenced and standardized Great Lakes sea lamprey genomics database that ties larval pedigrees to locations could assist managers in achieving monitoring and sea lamprey suppression goals for invasive Great Lakes populations in North America. Applications for invasive populations include improved larval assessment, enumeration of spawning adult abundance, identification of genes associated with physiological pathways under selection, and use of genetic control methods (McCauley, Docker, Whyard, & Li, 2015). Applications for conservation of anadromous populations in Europe and the Atlantic coastal populations of North America include investigations of and characterization of effective population size (Waples, 2010;Waples, Larson, & Waples, 2016), the study of local adaptation (Savolainen, Lascoux, & Merila, 2013) important for reintroductions, or simply be used in the context of genetic monitoring of reintroductions (Sard et al., 2016). Analyses could be further enabled by the existence of an annotated sea lamprey genome (Smith et al., 2018).
Protocols that combine targeted sequence capture with restriction site-associated DNA sequencing (Ali et al., 2016;Hoffberg et al., 2016) and highly multiplexed amplicon sequencing protocols (GT-Seq;Campbell et al., 2015) are beginning to be widely used for conservation genomic applications (Meek & Larson, 2019). Greater understanding of sea lamprey mating ecology and movement dynamics of sibling groups prior to metamorphosis could greatly aid sea lamprey control efforts. Thus, the overarching project objective of this study was to develop a panel of SNPs for sea lamprey and to apply the multilocus SNP panel to conduct genetic pedigree analyses of larval sea lampreys to characterize aspects of the species mating ecology and dispersal behavior. Specific objectives were to (a) estimate larval membership to half-and full-sibling family groups and the number of adult sea lampreys that produced larvae captured in two Great Lakes tributaries representing extreme ranges in size and sampling accessibility, (b) determine sea lamprey spawning locations, and range of downstream larval dispersal in the St. Clair River based on locations of siblings, and (c) characterize levels of spatial genetic structure among Great Lakes tributaries. staff, respectively, in May and June of 2017 during routine assessment sampling. Samples were collected by applying granular bayluscide to nonwadable waters (SCR; 2.6 hectare area) which allowed larvae to be collected downstream in drift nets, or using ABP-2 backpack electrofishers in wadable waters (DC).Sampling locations in DC were chosen so that the spatial extent of the larval sea lamprey distribution was well represented (see map in Figure 1). All individuals were retained irrespective of size/age, and therefore, because of the area sampled, samples were expected to include a broad range of ages and family groups. Samples (whole body) were stored in 95% ethanol and delivered to the laboratory at Michigan State University (MSU). In the laboratory, we measured body length for each individual from DC and used a mixture of four Gaussian distributions to group larval body lengths into age 0, age 1, and age 2, age 3+ years (Sethi, Gerken, & Ashline, 2017). We used the DC length-frequency distributions to age larvae from the SCR because of small sample size (Table 1). To facilitate exploration of other analyses focusing on spatial genetic structure, samples were also collected from widely dispersed spawning adult sea lampreys from Great Lakes tributaries by USFWS staff (Brule River in the Lake Superior basin [n = 25], Carp River in the Lake Michigan basin [n = 15], St. Mary's River in the Lake Huron basin [n = 25], Table 1 and Figure 1).

| Morphological and genetic verification of species identity
Field crews distinguished larval sea lampreys from larvae belonging to native lamprey species (Ichthyomyzon spp., Lampetra appendix) using morphometric features including trunk myomere counts and pigment location and density. We also used a genetic method to verify species identity. DNA was extracted from 288 larval sea lampreys collected in the SCR (n = 37, Table 1) and DC (n = 251, Table 1) using Qiagen DNeasy Kits (Qiagen, Valencia, CA) to validate field species identification using sea lamprey specific cytochrome c oxidase subunit I (COI) single-species DNA barcoding PCR primers (Gingera et al., 2016). At the time DNA was extracted from samples, body weight and total length were recorded. PCRs were carried out as described in Gingera et al. (2016) and contained 20 ng of DNA at a concentration of 10 ng/µl and 23 µl of master mix (2.5 µl of Amplitaq Gold PCR Buffer II-no Mg+2, 1.5 µl of 25 mM MgCl 2 , 2.5 µl of 2 mM dNTPs, 0.5 µl each of the sea lamprey forward (5′-GGCAACTGACTTGTACCMCTAATACTTGGT-3′) and reverse (5′-GGCTAAGTGTAAGGAAAAGATTGTTAGGTCGAC-3′) primers at 10 pmol/µl, 15.37 µl of diH2O and 0.13 µl of Amplitaq Gold (Applied Biosystems) at 5 U/µl) for a total volume of 25 µl. Cycling conditions were as follows: an initial denaturation step of 5 min at 95°C, followed by 35 cycles of 30 s at 95°C, 30 s at 55°C and 30 s at 72°C, and a final extension step of 5 min at 72°C. PCR products (expected size of 225 bp) were run on 1.5% agarose gels along with a 100 bp DNA ladder (Invitrogen) and stained with ethidium bromide for visualization. All PCRs were conducted included negative (no DNA) and positive (known DNA from adult sea lamprey) controls.

| Genomic library preparation
2.3.1 | Sample preparation and locus discovery using RAD sequencing DNA samples extracted from larval sea lamprey from DC and SCR were used to construct RAD libraries used for SNP discovery.
Double-stranded DNA concentrations were determined using Quant-iT PicoGreen assays (Thermo Scientific), and samples were normalized to a concentration of 10 ng/µl before proceeding with RAD library preparation and sequencing.
Restriction site-associated DNA (RAD) libraries were prepared following the "BestRAD" protocol described in Ali et al. (2016). After normalizing DNA concentrations, DNA was digested using high fidelity SbfI (New England Biolabs). After 60 min of digestion at 37°C and F I G U R E 1 Map showing locations of Great Lakes tributaries where sea lamprey were collected (a), including larvae in the St. Clair River (SCR), Michigan, USA (b), and in Duffins Creek (DC), Ontario, Canada (c). Larval sampling locations in SCR and DC are indicated by dots. Adults were sampled in the Brule River, St. Mary's River, and Carp Lake Outlet. Locations in the St. Clair were not numbered in the map, like Duffins Creek, because no siblings were identified, and thus, similar analyses (e.g., Figure 7)  to two. We preformed de novo assembly first, as recommended by Paris, Stevens, and Catchen (2017). When forming the catalog, we used 15 individuals from each population that had the highest number of total reads. We allowed one mismatch between observed and expected barcodes and restriction site sequences during pro-cess_radtags. In addition, reads with uncalled bases or low-quality reads (default STACKS value: average phred score of 10 in a 22 bp sliding window across the read) were removed. PCR duplicates were removed using clone_filter to avoid inflated coverages that could lead to incorrect genotype calling (Andrews, 2014

| RAD capture methods
Using the SNP discovery dataset described above, we designed RNA baits needed to apply the targeted RAD capture (RAPTURE) sequencing approach described by Ali et al. (2016). Consensus sequences for variable RAD loci identified above (output by the STACKs "populations" module) were aligned to the sea lamprey genome (Smith et al., 2018) using BWA-mem (Li, 2013). RAD loci mapping to multiple locations were excluded from the dataset using SAMtools (Li et al., 2009). Mapping coordinates for the remaining loci were lengthened by 100 base pairs using bedtools slopBed, and extended sequences were extracted from the reference using bedtools getFasta (Quinlan & Hal, 2010). Initial bait QC found that sequences had high GC content on average, so we opted to design two 80 bp baits offset by 20bp for each variable locus. These sequences were input into Arbor Biosciences' complementary bait design pipeline. In order to maintain a bait, we required 0 off-target hits with a T m > 60°C, fewer than 10 off-target hits with Tm between 62.5 and 65°C, and at most 1 off-target hits with T m > 65°C. We also required 0 alignments to the sea lamprey mitome, dG >= −15, 0 heterodimers with other baits, and less than 25% repeat masking. Two 80 bp sequences were selected for each variable RAD locus (n = 3,764 loci).
RAD capture libraries were prepared using the library prepara-

| RAPTURE bioinformatics
Quality of RAPTURE sequencing data was initially inspected using FastQC (Andrews, 2014). Reads in the two files were exchanged whenever the barcode was found at the start of read 2 using a custom perl script (bRAD_flip_trim.pl), reads were demultiplexed using process_radtags (--inline_null -e sbfI --barcode_dist_1 1 --re-tain_header), and PCR duplicates were removed using clone_filter (Catchen et al., 2011). Adapter sequences were removed from the end of reads using Trimmomatic, and reads were trimmed whenever the mean base quality across a sliding window of four bases dropped below Q15 (Bolger, Lohse, & Usadel, 2014). Read pairs were removed from the dataset if one or both reads were less than 50 bp after trimming. To create a reference to map remaining reads to, we merged the sea lamprey reference genome (Smith et al., 2018; GenBank: PIZI00000000.1) and mitome (Lee & Kocher, 1995; GenBank: U11880.1), resulting in a fasta file that was normalized using Picard NormalizeFasta (http://broad insti tute.github.io/picard).
Read mapping was performed using BWA-mem with default settings (Li, 2013). Mapped bam files were filtered using SAMtools view in order to exclude reads with mapping qualities <30 and reads not mapping in proper pairs. SAMtools was also used to sort and index filtered bam files (Li et al., 2009).
Variants were detected, and genotypes were called using gstacks (Catchen et al., 2011;Maruki & Lynch, 2017) and FreeBayes (Garrison & Marth, 2012). Gstacks was run using default parameters, and FreeBayes was run without applying population or binomial observation priors, an expected cross-contamination rate of 1%, and a minimum base quality score of 20. Variants were only reported whether the alternate allele was supported by greater than 2 reads and whether the total depth across all individuals was greater than 1000XBiallelic SNPs that were detected by both programs and overlapped targeted RAD loci were extracted from the FreeBayes VCF file using VCFtools (Danecek et al., 2011). VCFtools was also used to remove individuals with greater than 75% missing data and set genotypes to missing whether the genotype quality (GQ) was less than 10. We also required genotypes at a site to have a minimum mean depth of 10× and no more than 30% missing genotypes. Sequencing characteristics of the RAPTURE panel, including allele read balance and mean total depth for each variable position, are reported as part of an R Markdown document on the project's GitHub repository.

| Estimation of population diversity and differentiation using RAD capture panel
Summary statistics describing levels of inter-and intrapopulation diversity were calculated using genotypes called by FreeBayes from the RAD capture dataset. Before calculating summary statistics, DC samples were reduced to include only a single individual from each full-sibling family group based on pedigree reconstructions described below. No SCR larvae pairs were full siblings so all individuals were used. Individuals with greater than 50% missing data were also removed at this point, leaving 140 individuals for analyses.
When attempting to establish pedigree relationships, population measures of genetic diversity are important to the accuracy of inferred interindividual relationships (Blouin, 2003). Specifically, biallelic SNPs with high minor allele frequencies are expected to be most informative for determining kinship (Krawczak, 1999).
Genetic diversity measures including observed heterozygosity (H O ), expected heterozygosity (H E ), Wright's inbreeding coefficient (F IS ), and minor allele frequencies (MAFs) were calculated for each locuspopulation combination. MAFs were calculated using the "minorAllele" function of the R package adegenet (Jombart, 2008;Jombart & Ahmed, 2011), and H O , H E , and F IS values were calculated using the "basic.stats" function of the R package hierfstat (Goudet, 2004).
Distributions of MAF and F IS were plotted using the R package gg-plot2 (Wickham, 2016). RAD loci were plotted on scaffolds using the R function "preparegenome plot" from the quantsmooth package (Oosting, Eilers, & Menezes, 2018).
Given that the SNP discovery library was constructed from individuals sampled from two populations, we were interested to document whether the SNP loci were polymorphic in other Great Lake tributaries and basins, and to document levels of interpopulation variance in allele frequency. Pairwise estimates of F ST (Nei, 1973) were estimated among sampling sites using the "pairwise_Gst_Nei" and global F ST was estimated using "diff_stats," both from the mmod R package (Winter, 2012). Differentiation statistics were estimated using the 140 individual downsampled dataset (Table 1).
Discriminant analysis of principal components (DAPC; Jombart, Devillard, & Balloux, 2010) was used to examine patterns of interpopulation spatial genetic structure. DAPC provides a multivariate clustering approach that summarizes genetic data into a user-defined number of principal components before performing discriminant analysis on those retained principal components. This strategy maximizes variation among groups while minimizing variance within groups, providing a means of detecting genetic structure (Jombart et al., 2010). Contributions of individual SNP loci to the DAPC can be used to determine which loci most strongly influence the observed patterns (Jombart et al., 2010).Locus loadings have different ranges along each axis and were therefore transformed and presented as percentile ranks to allow for comparisons. DAPC was performed using the adegenet R package using the downsampled dataset.

| Detecting outlier loci
Genotypes for thousands of SNPs allowed analyses to ascertain whether some loci appear to be acting in a non-neutral manner and/ or explain a disproportional amount of variation among population samples. We used the R package OutFLANK (Whitlock & Lotterhos, 2015) to detect loci potentially experiencing selection. OutFLANK uses a trimmed distribution of locus-specific F ST values to infer a distribution of neutral markers. Loci with significant deviations based on this expected distribution are considered outliers and possibly subject to selection. We used the default values of trimming 5% of loci from each tail of the overall F ST distribution and a 0.10 false discovery rate (Benjamini & Hochberg, 1995) to reduce the likelihood of false-positive detections. Outlier analyses were also conducted using the dataset that was downsampled to include a single individual from each full-sibling family.

| Assessment of power for larval sea lamprey pedigree reconstruction
The power to correctly infer full-and half-sibling relationships among larval offspring when no adult lamprey were genotyped was assessed using simulated pedigrees analyzed using the genetic pedigree reconstruction assignment program COLONY (Jones & Wang, 2010;Wang, 2004). Each simulation began by creating a breeding matrix where the total number of parents (N parents ) was assumed to be 10, 100, or 1,000. These numbers were chosen to assess assignment power over a broad range of possible breeding scenarios and spawning adult abundance. For each simulation, the size of the breeding matrix (N females × N males ) was allowed to vary such that the N males :N females sex ratio was sampled from a uniform distribution (range: 1-2, Hansen et al., 2016). For i in 1 to N females columns in the matrix, the number of males (N mates ) that mated with the ith female was randomly drawn from a Poisson distribution (λ = 3), given that evidence for polygyny and polyandry was established by Gilmore (2004). The number of eggs fertilized (N offspring ) for the ith female was sampled from a uniform distribution (range: 25,000-1,00,000, Manion & Hanson, 1980). For each female, N offspring were nonrandomly distributed among mated males following Equation (1). The random process of identifying mates per female sometimes resulted in a limited number of males not mating with any females. To ensure the sex ratio remained relatively unaltered, we repeated the above process with any unmated males, except that N mates were females rather than males.
Thus, each simulation creates a polygamous mating structure, not just a polyandrous mating structure because multiple matings, from the perspective of males or females, occurred by chance. As noted above, the number of mates was drawn from a Poisson distribution, but the actual individuals mated were chosen randomly. The mean number of mates per sex was driven by the sex ratio (ranges from 1 to 2 males per female, drawn randomly) and the size of the breeding matrix (containing 10, 100, or 1,000 total adults). In the simulations, the average number of mates for a female was ~3 (range 2.7-3.2 across breeding matrices simulated) and for a male, it was ~2 (range: 1.5-2.2 across breeding matrices simulated).
After the breeding matrix was created, 100 offspring were randomly sampled from the total number produced by the breeding matrix. Each set of N parents simulations was evaluated at 100, 200, or 500 independent loci (N loci ), where the allele frequencies for N loci were randomly drawn from the observed allele frequencies based on the original RAD discovery dataset. The above information was provided to the simulation module within COLONY (Wang, 2013) using in-house wrapper scripts in R (version 3.5, R Core Team, 2018) to generate 100 independent simulated datasets to be run in COLONY (available in the GitHub repository). Thus, a total of 900 independent simulated datasets were evaluated across the nine different combinations of N parents and N loci .
When analyzing the simulated datasets in COLONY, we assumed a polygamous mating system and no sibship prior was used. Each genetic pedigree was inferred using the full-likelihood method, with high precision and a "long" run. We used the Linux version of COLONY (colony2s.ifort.out, version 2.06) to infer full-and half-sibling relationships for each independent simulation. Analyses were conducted in parallel on MSU iCER HPCC. For each simulation, the "Best Configuration" genetic pedigree was analyzed to assess power. Importantly, known parental genotypes were not provided to COLONY when the program was reconstructing the pedigrees, but that the parents that actually produced any given offspring were known, as their names were embedded in the names of their respective offspring. In addition, these data require comparisons between inferred and known dyadic (pairwise) relationships to assess error and accuracy (below). That is, the inferred (via COLONY) and known (based on simulation information embedded in offspring names) familial relationship was determined by counting how many parents the two individuals shared: Full siblings share both parents, half-siblings share one parent, and unrelated dyads share no parents. Using this information, accuracy, the false-positive rate and the false-negative rate can be calculated for each simulation. Here, accuracy is defined as the proportion of inferred dyads (full-sibling, half-sibling, or unrelated dyads) that were correctly assigned. The false-positive rate is 1-accuracy (i.e., the proportion of inferred dyads (full-sibling, half-sibling, or unrelated dyads) that were incorrectly assigned).
Given the direct relationship between accuracy and the false-positive rate, we report only accuracy. The false-negative rate is defined as the proportion of known dyads (full-sibling, half-sibling, or unrelated dyads) that were incorrectly inferred. The expected distributions of accuracy and false-negative rates for full-sibling, half-sibling, and unrelated dyads across 10, 100, or 1,000 parents genotyped at 100, 200, or 500 loci are presented below. Note that COLONY does provide probabilities that any given dyad in the genetic pedigree reconstructed is true, but no user-defined cutoff was chosen because we implemented the full-likelihood method of COLONY, which effectively balances the false-positive and false-negative rates throughout the dataset (Jones & Wang, 2010;Wang, 2004).
Additional details, including step-by-step instructions, are available in the project's GitHub repository.

| Pedigree reconstruction
We inferred full-and half-sibling relationships for each age cohort of larval sea lampreys sampled within DC and the SCR separately in COLONY. Larval age was estimated using length-frequency distributions of larvae sampled in each stream (as described above). The above simulations assumed that loci were independent of each other.
Thus, we selected loci for genetic pedigree reconstruction that were two megabases apart on the same chromosome (i.e., independent, as determined from Smith et al., 2018) because the RAPTURE panel contained multiple loci per chromosome (see below). Loci chosen were genotyped in >80% of individuals (97% ± 3%, mean ± 1SD) and had high minor allele frequencies (0.20 ± 0.20, median: 0.12). The number of loci per chromosome varied based on chromosome size (range: 1-14, median = 1). The above filtering identified 454 independent loci to be used in genetic pedigree reconstruction, which is comparable to the simulations based on 500 loci (see above). All other COLONY input parameters were the same as described for the simulations above.
We also estimated the effective number of breeders (N b ) that produced each larval sample using the sibship method based on pedigree (1) N offspring × N mates ,N mates − 1, … ,1 sum 1,2, … ,N mates membership to full-and half-sibling groups, as implemented in program COLONY (Wang, 2004(Wang, , 2009. N b is typically lower that the number of adults associated with pedigreed offspring because the variance in offspring produced by adults and realized breeding sex ratio (Araki, Waples, Ardren, Cooper, & Blouin, 2007;Duong, Scribner, Forsythe, Crossman, & Baker, 2013). We also estimated the effective number of breeders (N b ) using the LDNe method (Waples & Do, 2008) as implemented in NeEstimator Version 2.1 (Do et al., 2014). We used the random mating model, a P crit allele frequency cutoff of 0.05, and only considered interscaffold locus pairs. Reported confidence intervals were generated by jackknifing across samples. We also repeated the analysis using all locus pairs in order to determine the extent to which including physically linked loci would downwardly bias N b estimates.

| Statistical analyses for pedigrees
Assuming directional (downstream) dispersal of offspring and assuming the farthest upstream locale where a full or half-sibling was collected were a proxy for the spawning location, we were able to coarsely characterize larval dispersal. Randomization procedures were used to test whether the observed number of within sampling location dyads that were related (N RW ) was due to chance alone.
The randomization procedure first enumerated all possible pairwise relationships among the number of genotyped (N gt ) offspring (i.e., Dyad total = N gt !∕(2!(N gt − 2)!)), with individual sampling locations recorded, as well. N RW dyads were randomly assigned as related, and the proportion of N RW observed within the same sampling locations, (P NW = N RW ∕Dyad total ), was recorded. The above procedure was repeated 1,000 times to generate a null distribution, representing the expected proportion of dyads observed within sampling locations due to chance. We calculated statistical significance as the proportion of the null distribution greater than the observed P NW per cohort. In addition, Fisher's exact tests were used to test whether the P NW per cohort differed significantly and thus whether spatial dispersion of larvae was concordant among age cohorts. Finally, a measure of coancestry was calculated per pedigree using an R function created by Bartron, Sard, and Scribner (2018), which was based on Cockerham's (1967) derivation of coancestry for a group of individuals. All statistical tests were conducted at α = 0.05. All analyses and graphics were completed in R, aided by "tidyverse" (Wickham, 2017).

| Genetic verification of species identity
Species identity of putative sea lamprey larvae for the SNP discovery panel was validated based on amplification using the sea lamprey specific mitochondrial COI PCR primers (Gingera et al., 2016). All larvae returned to the laboratory and screened from DC and SCR were sea lampreys based on positive PCR amplification. Coordinates for recovered targeted RAD loci were extracted from the fasta file produced by gstacks and cross-referenced with mapping coordinates for targeted loci using bedtools intersect (Quinlan & Hal, 2010

| Genetic diversity and differentiation
Locus   (Table 3). Pairwise comparisons indicated the strongest differentiation existed between DC (Lake Ontario basin) and the upper Great Lakes sampling sites (Table 3).

| Estimates of spatial genetic structure
Using DAPC, we transformed SNP variation over all loci into 149 principal components (PCs), of which we retained 100 based on

| Assessment of power of larval sea lamprey pedigree assignment
Across all simulations, the power to correctly infer known related dyads increased when more loci were used ( Figure 5). Conversely, resolving sibling relationships became more difficult when more parents were used in the breeding matrix. Initially, we evaluated false-negative rate (proportion of known dyads incorrectly inferred) across all N parents × N loci comparisons. Generally, most true full-and half-sibling dyads were identified using 200 or 500 loci when 10, 100, or 1,000 parents contributed to simulated offspring ( Figure 5).
Across all parameters evaluated, most unrelated dyads could be resolved as well ( Figure 5).
Given that interindividual dyadic relationships will not be known in datasets generated by field sampling and subsequent genotyping, it was important to assess accuracy among inferred dyads (the proportion of inferred dyads that were correctly inferred). Most inferred full-sibling and unrelated dyads were correctly inferred across most N parents × N loci comparisons ( Figure 6). Accuracy was related to both N parent and N loci . More half-sibling dyads were incorrectly inferred when N parents increased from 10 to 1,000. However, increasing N loci (100 to 500) resulted in a great improvement in assignment accuracy.

| Larval sea lamprey pedigree assignments in the SCR and DC
A total of 250 larval lampreys were successfully genotyped from DC (n = 217) and the SCR (n = 33). Parameters estimated for cohorts from DC are likely tied to larval sample sizes that varied across age classes (Table 1). The number of independent adult genotypes inferred for each genetic pedigree in DC varied among cohorts (N s range: 16-42).
The mean and variance in adult reproductive success varied across cohorts (Table 4). High variance in reproductive success for age-1 and age-2 larvae is reflected in comparatively lower estimated effective numbers of breeding adults (N b ) in age-1 and age-2 age cohorts relative to the age-3 cohort (Table 4). Mean coancestry ranged from 0.03 to 0.10 across the three cohorts sampled in DC (Table 4), largely reflecting differences in numbers of breeding adults, adult mean and variance in reproductive success, and number of full-and half-sibling dyads ( Table 3). Estimates of N b based on the number and variability in family size within cohorts (Wang, 2004 estimator) were comparable to those based on linkage disequilibrium (Table 4) (Figure 7). The spatial distribution of dyads (P NW per section pair) differed significantly (p < .05) among age-1, age-2, and age-3 and older cohorts sampled in DC.  (Derosier et al., 2007). Population genetic analyses demonstrate that the SNPs developed from larval sea lampreys from two populations were variable across the Great Lakes and exhibit levels of interpopulation variance in allele frequency that are comparable to mtDNA (Waldman et al., 2004) and microsatellite analyses (Bryan et al., 2005).Identification of outlier loci (Figure 4; features detailed in GitHub repository) portends developments in areas of gene-assisted control (McCauley et al., 2015), for example, the exploration of potential rapid evolution in response to intense selection associated with pesticide treatments (Dunlop et al., 2018).

| Evidence for spatial genetic structure
Our genetic structure results are consistent with previous analyses of spatial genetic structure of sea lampreys across the Great Lakes (Bryan et al., 2005;Waldman et al., 2004) based on microsatellites. This result was thus expected because sea lampreys do not home to natal streams. Little evidence of genetic differentiation was documented among populations within each Great Lake basin, consistent with tagging studies (Bergstedt & Seelye, 1995) that have demonstrated lack of natal homing. Data on SNP allele frequency variation within Great Lakes basins relative to more substantial variation among basins are consistent with the hypothesis that adult sea lampreys select streams based on olfactory queues associated with larval abundance in streams rather than cues associated with natal origins (Buchinger, Siefkes, Zielinski, Brant, & Li, 2015).Previous analyses of sea lamprey spatial genetic structure in terms of genetic affinities among populations based on mtDNA (Waldman et al., 2004) and microsatellite loci (Bryan et al., 2005), and coalescence-based model evaluation and estimation of levels of population bottlenecks (Bryan et al., 2005) indicated a sequential pattern of movements from Lake Ontario to Lake Erie and then into the upper Great Lakes.

TA B L E 3
Pairwise F ST estimates for sea lamprey sampled from five spawning streams, including age-specific variation among cohorts (1, 2, 3 refer to age classes) sampled from Duffins Creek (DC) and the St. Clair River (SCR)

F I G U R E 3
Ordination of discriminant analysis of principal components (DAPC) for individual larval sea lamprey sampled from five spawning streams, including multiple cohorts from Duffins Creek and the St. Clair River. 95% confidence ellipses are also provided for each sampling site. The inset figure illustrates cumulative percent of variation explained for 100 retained principal components

| Effects of outlier loci on spatial population structure
Information concerning interpopulation variance in SNP allele frequency could inform future applications of these markers in areas of sea lamprey control or conservation. Emerging DNA methods are being widely used to detect environmentally mediated selection that can enhance the ability of an invasive species to thrive in introduced habitats (Ellner & Sasaki, 1996).  Scaffold −log 10 (p) be produced by genetic drift during range expansion from the lower Great Lakes to the upper Great Lakes (Excoffier, Foll, & Petit, 2009). Nonetheless, OutFLANK has been shown to have a low false-positive rate relative to other outlier detection methods under range expansion scenarios (Whitlock & Lotterhos, 2015). Further attention to these putatively adaptive loci, others identified in future studies, and those previously observed in closely related taxa (Hess, Campbell, Close, Docker, & Narum, 2013) may help to shed light on the genetic basis for local adaptation in invasive and native sea lamprey populations.
Ordination of individuals in multivariate space (Figure 3) also revealed that larval sea lampreys from disparate locations can be genetically differentiated. As expected, individuals were largely separated by lake of origin, with individuals from Lake Ontario (DC) being separated from all other locations along the first DAPC axis and Lake Michigan and St. Clair River being separated from Lake Superior origin individuals along DAPC axis 2. Data from the SCR revealed that larvae from different cohorts cluster with different groups of adults from different streams (age-2 larvae with adults from the St. Mary's River and age-3 larvae with adults from the Carp River; Figure 3).
These preliminary results are based on only two tributaries but suggest an intriguing explanation that the genetic similarities of larvae  Barba et al., 2010). Importantly, genetic tagging studies can confidently identify full-and half-sibling relationships without genotyping parents (Jones & Wang, 2010), especially when several hundred loci are genotyped (Santure et al., 2010) as demonstrated by simulated and empirical data shown here.

| Analyses of simulated and empirical pedigrees
Simulations demonstrated that with 200 or 500 SNP loci separated by 2 MB across the genome, pedigrees could be established with high confidence, even with 1,000 adults contributing to simulated offspring ( Figures 5 and 6 (Waples et al., 2016). We sought to determine how much estimates of N b differed when using all loci pairs in the RAPTURE panel or only loci pairs that were not physically linked.
Importantly, results did not demonstrably change between the two approaches.The lack of change is likely due to the fact that lamprey have a large number of chromosomes and that for this set of loci, the number of intrachromosomal locus pairs is negligible relative to the number of interchromosomal pairs (Waples et al., 2016). Expansion of the number of loci for pedigree analyses would greatly increase statistical power to resolve pedigrees.
Given results in this study, a full suite of RAD loci may be utilized for pedigree reconstruction, further expanding the spatial and demographic scale that these loci may be used. We note that simulations were conducted assuming independent loci and with mean and variance in reproductive success lower than empirically estimated for the DC larval samples in this study (Table 4). Incidence of multiple SNPs per locus means that SNPs could be phased into microhaplotypes, which would potentially increase the number of alleles per locus and our ability to correctly resolve pedigrees (Baetscher et al., 2017). Results collectively suggest further simulation studies would be useful to better understand how violating assumptions regarding independent loci affect estimates of N b and genetic pedigree reconstruction.
The main ecological findings in this study highlight several po-  Such information will inform potential large-scale close-kin mark-recapture studies.

| Effects of potential aging error on pedigree inference and N b
Larval lamprey were aged based on established length-frequency distributions (Sethi et al., 2017). Aging error, particularly in tributaries of low productivity, can lead to individuals from multiple year cohorts being assigned to a single-year class (Dawson et al., 2009).
Because of the semelparous breeding with different adults producing each cohort, the most prudent approach would be tiered to ( using either LD (Waples et al., 2016) or family size (Wang, 2009). The problem of overlapping size distributions is not likely to affect age-1 individuals because of greater size discrimination relative to size differences among members of older age classes.

| Effects of ascertainment bias
SNP genotyping panels are influenced by ascertainment, and the resulting biases might affect subsequent analyses (Lachance & Tishkoff, 2013). Biased ascertainment has been observed in SNP genotyping panels developed for model (Clark, Hubisz, Bustamante, Williamson, & Nielsen, 2005) and nonmodel species (Seeb, Templin, et al., 2011), specifically when ascertainment is conducted using few populations or individuals that are not representative of the genetic diversity encountered across the range (Seeb, Carvalho, et al., 2011).
Our ascertainment process may have resulted in the detection of an  (Haasl & Payseur, 2011). Interestingly, our DAPC depicts patterns of population structure similar to those described in previous studies conducted using microsatellites (Bryan et al., 2005). Assignment tests (Bradbury et al., 2011), parentage analysis, and individual identification are expected to be minimally affected by ascertainment bias because using markers with higher heterozygosity markers increases statistical power (Morin et al., 2004). Temporal F ST and linkage disequilibrium-derived estimates of effective population size are also expected to be minimally affected by ascertainment (Morin et al., 2004). Demographic inferences can likely be made after correcting for biases induced by ascertainment scheme (Wakeley, Nielsen, Liu-Cordero, & Ardlie, 2001). Ascertainment bias has the potential to increase false-positive rates for genome-wide scans for selection (Lachance & Tishkoff, 2013).

| Future applications for the RAD capture sea lamprey SNP array
With the large number of loci now available for sea lamprey, future applications in many areas of sea lamprey research and assessment are not limited by statistical power of the markers available, but by the ability to collect adequate sample sizes in the context of appropriate spatial scales. Data presented here represent one of the few studies to employ the RAD capture genotyping method for fish (others include Margres et al., 2018;Ali et al., 2016;Komoroske et al., 2019;review in Meek & Larson, 2019 1. Analyses of pedigrees from multiple year classes in DC demonstrate that estimates of the number of spawning adults and effective numbers of breeding adults that produced each cohort in a larval sample can be obtained. A useful next step would be to estimate the total river spawner abundance for each year (cohort-specific), which would require additional analyses. One benefit to sea lamprey control of using pedigree analysis of larvae rather than mark-recapture of adults to estimate stream-specific annual adult abundance is that any captured adults could be immediately removed from the system rather than marked and released (i.e., allowed to reproduce). Pedigree analysis of larvae could also provide estimates of adult abundance in rivers where traps are not operated (e.g., rivers lacking dams). One drawback to larval-based adult abundance estimates, however, is that it requires a time lag (e.g., provides estimates one or more years after spawning occurred), whereas mark-recapture of adults provides information to managers during the year of spawning.
2. Identification of sibling sea lamprey across life stages (i.e., linking premetamorph juveniles in a stream to siblings sampled as spawning adults years later) could address one of the highest priorities of the sea lamprey control program: the ability to identify the stream of origin of the parasitic and adult lampreys in each lake. One benefit of genetic mark-recapture for stream origin questions is that marked individuals do not need to be handled; only their siblings. Thus, a sample of larvae collected during treatments or other outmigrant/ larvae collections could be used to recapture (as adults or parasitic juveniles) their siblings. This seems to have at least the same power as coded wire tagging studies but does not rely on assumptions of handling effects and postrelease mortality (Johnson et al., 2014).
3. The sea lamprey assessment program could benefit from having the ability to characterize dispersion of larvae from the same fulland half-sibling families. Previously, we have shown (Derosier et al., 2007;Gilmore, 2004) Duong et al., 2013;Sard et al., 2016). Additionally, adaptive loci may be identified and used to inform which fish are reintroduced in specific locations to improve reintroduction outcomes.

CO N FLI C T O F I NTE R S T
None declared.

AUTH O R CO NTR I B UTI O N S
GB, JA. CH, PH, KT, and KS designed the research; GB, CH, PH, and KT were involved in field collection of samples; NS, SS, JH, and JK were responsible for laboratory analyses; and NS, SS, JH, and JA conducted bioinformatic and statistical analysis of molecular data and field data. All authors contributed to writing of the manuscript.