Estimating effective population size using RADseq: Effects of SNP selection and sample size

Abstract Effective population size (Ne) is a key parameter of population genetics. However, N e remains challenging to estimate for natural populations as several factors are likely to bias estimates. These factors include sampling design, sequencing method, and data filtering. One issue inherent to the restriction site‐associated DNA sequencing (RADseq) protocol is missing data and SNP selection criteria (e.g., minimum minor allele frequency, number of SNPs). To evaluate the potential impact of SNP selection criteria on Ne estimates (Linkage Disequilibrium method) we used RADseq data for a nonmodel species, the thornback ray. In this data set, the inbreeding coefficient F IS was positively correlated with the amount of missing data, implying data were missing nonrandomly. The precision of Neestimates decreased with the number of SNPs. Mean Ne estimates (averaged across 50 random data sets with2000 SNPs) ranged between 237 and 1784. Increasing the percentage of missing data from 25% to 50% increased Ne estimates between 82% and 120%, while increasing the minor allele frequency (MAF) threshold from 0.01 to 0.1 decreased estimates between 71% and 75%. Considering these effects is important when interpreting RADseq data‐derived estimates of effective population size in empirical studies.

based on genetic markers. Demographic approaches estimate N e as a function of parameters such as mean and variance in offspring number, survival-at-age, and birth rate. However, demographic methods often rely on strong assumptions such as discrete generations (Caballero, 1994;Nomura, 2002) or if overlapping generations are admitted, stable age structure (Robin S. Waples, Do, & Chopelet, 2011). Only one demographic method allows demographic stochasticity and heterogeneity at the expense of challenging data demands such as individual-level information (Engen, Lande, Saether, & Gienapp, 2010). This might explain why the method has not been much used so far (but see Trask, Bignal, McCracken, Piertney, & Reid, 2017).
Genetic methods have gained in popularity and power due to recent advances in genotyping and sequencing technologies as well as in computer processing speed. They rely on the extraction of genetic signals (allele frequencies) which are theoretically known to be affected by population demography, mainly effective population size. Among the genetic methods available, single-sample approaches are appealing since they require sampling only at one point in time. The most popular N e estimator is based on a measure of linkage disequilibrium (LD), that is, the nonrandom association of alleles at different loci. The LD method has been widely used during the last decade for a variety of organisms, including mammals (Cervantes, Pastor, Gutiérrez, Goyache, & Molina, 2011;Juarez et al., 2016), insects (Francuski & Milankov, 2015), reptiles (Bishop, Leslie, Bourquin, & O'Ryan, 2009), and fishes (Pilger, Gido, Propst, Whitney, & Turner, 2015;Wilson, McDermid, Wozney, Kjartanson, & Haxton, 2014).
Empirical estimates of N e are often biased because all methods rely on strong assumptions which are likely violated in natural populations (R. S. Waples, Antao, & Luikart, 2014). Numerous recent genetic studies have documented how more realistic simulations or real data, which do not fulfill methods' assumptions, lead to biased N e estimates (Gilbert & Whitlock, 2015;Hare et al., 2011;Luikart, Ryman, Tallmon, Schwartz, & Allendorf, 2010;Marandel et al., 2019;Robinson & Moyer, 2013;Russell & Fewster, 2009;Robin S. Waples & Do, 2010). Among the various sources of bias, the assumption of nonoverlapping generations is often violated. In this case, the amount and the direction of bias as well as the precision of estimates are highly dependent on life history traits, thus species-specific, but also on the sampling fraction (Marandel et al., 2019;. Other factors such as unequal sex ratio, high level of inbreeding and high variance in family sizes have also been found to bias N e estimates (Montarry et al., 2019).
For nonmodel species, the absence of a reference genome challenges the development of genetic markers and the assessment of genomic ascertainment bias, and more generally the amount of expected species-specific bias for N e estimates. A widely used method to develop de novo genetic markers and genotype individuals in one single step is the restriction associated DNA sequencing (RADseq), which provides thousands of sequenced SNP (single-nucleotide polymorphism) markers across many individuals at reasonable costs (Davey & Blaxter, 2010). A drawback of the method is the numerous sources of genotyping errors (Mastretta-Yanes et al., 2015) and missing data (information missing for certain individuals for certain markers). One case of genotyping errors is dropped alleles, that is, one allele is not typed making a heterozygous individual appearing homozygous (Bilton et al., 2018). Missing data and random allelic dropouts can bias LD estimates (Akey, Zhang, Xiong, Doris, & Jin, 2001;Bilton et al., 2018) and subsequently bias LD based N e estimates and increase their variance (Nunziata & Weisrock, 2018;Russell & Fewster, 2009). The degree of bias in LD estimates depends on allele frequency (Akey et al., 2001). Further, rare alleles are known to cause positive bias in N e estimates (e.g., Nunziata & Weisrock, 2018;Russell & Fewster, 2009). For microsatellites, this has led to the recommendation to select those with minor allele frequency (MAF) >0.01 if sample size >100 (Robin S. Waples & Do, 2010). For SNPs, rare alleles can be avoided by keeping only the SNPs with highest polymorphic content (Phillips et al., 2004).
However, the effect of the MAF threshold remains poorly known, but see Nunziata and Weisrock (2018).
The aim of this study was to determine the effects of the MAF, the proportion of missing data and the number of SNPs on N e estimates when applying the LD approach to RADseq data. These effects were explored using empirical data collected for the thornback ray (Raja clavata, Figure 1) in the Bay of Biscay.

| Sampling
Overall 159 thornback rays were sampled in the Bay of Biscay be-

| RAD-sequencing protocol and bioinformatics
All individuals were genotyped by sequencing using a RADseq protocol to effectively subsample the genome of multiple individuals at homologous genomic regions. The library construction followed the original protocol by Baird et al. (2008) with slight modifications.
Briefly, 1 μg of genomic DNA from each individual was digested with the restriction enzyme SbfI-HF (New England Biolabs), and then ligated to a P1 adapter labeled with a unique barcode. We used 16 barcodes of 5-bp and 16 barcodes of 6-bp length in our P1 adapters to build 32-plex libraries. The 159 individuals were part of a wider sample including individuals from other regions. From these, one pool of three individuals and seven pools of 32 individuals were made by mixing individual DNA in equimolar proportions and sheared to an average size of 500 and 350 bp, respectively, using a Covaris S220 sonicator (KBiosciences). A size-selected step was carried out on agarose gel to keep DNA fragments within the size range 500-1000 bp for the 3-plex and 300-700 pb for the 32-plexes. Each library was then submitted to end-repair, A-tailing, and ligation to P2 adapter before PCR amplification for 18 cycles. Amplification products from six PCR replicates were pooled for each library, gel-purified after size selection and quantified on a 2,100 Bioanalyzer using the High Sensitivity DNA kit (Agilent). The 3-plex library was sequenced in paired-ends 300 reads using Illumina Miseq technology. Each 32-plex library was sequenced on a separate lane of an Illumina Hiseq 2500 instrument by INTEGRAGEN, using 100-bp single reads.
Individuals were genotyped on 389 483 putative SNPs spread on 35 134 RAD loci (a sequence starting or ending with a restriction enzyme site). Given the variability due to laboratory work and sequencing protocols, we chose to retain only the loci with a calling rate percentage above 50% (i.e., maximum percentage of missing data (NA) of 50%) and a MAF above 0.01 for the 159 individuals to remove spurious SNPs. This raw data set had 43 088SNPs (Le Cam et al., 2019). Given the large number of amplification cycles (18), a preliminary analysis of the dependence of the heterozygote miscall rate on mean SNP read depth was carried out using the R package whoa (Anderson, 2019) as suggested by a reviewer. The potential heterozygote miscall rate was estimated from comparing the observed number of heterozygous individuals with the number expected given the allele frequency and assuming the SNP was in Hardy-Weinberg equilibrium (Hendricks et al., 2018). Based on this a data set containing only genotypes with read depths between 30 and 300 copies, MAF ≥0.01 and NAs ≤0.5 was created (referred to as full data, 17 843 SNPs). Removing genotypes with low and very high read depths (below 30 and above 300 copies) reduced the number of SNPs but also increased the proportion of missing data. A second data set with a maximum NA of 25% was therefore created (referred to as reduced data, 4,816 SNPs). The lower NA threshold value is more in line with common practices in empirical studies (e.g., 15% missing data in Pazmino, Maes, Simpfendorfer, Salinas-de-Leon, and van Herwerden (2017), 25% in Rodriguez-Ezpeleta et al. (2016)).
The randomness of missing data in the full data set was tested by estimating the Spearman rank correlation between the proportion of missing data and the inbreeding index where H obs is the observed proportion of heterozygous individuals and H exp the expected proportion under Hardy-Weinberg equilibrium for a given SNP. We also tested the correlation between the proportion of missing data and the proportion of heterozygous individuals (H obs ).
Seven individuals including five from outside the Bay of Biscay were genotyped twice. This replicate data set was used to explore genotyping error and allelic dropout. Allelic dropout corresponded to one of the replicate genotypes being heterozygous but not the other one or one being homozygous for the major allele and the other for the minor allele. The correlation between the proportion of replicated individuals exhibiting dropout and the inbreeding coefficient for all 159 individuals was tested.

| Effective population size
The single point estimation method linkage disequilibrium (LD) is based on linkage disequilibrium due to the nonrandom association of alleles at different gene loci. LD is measured at one point in time by the covariance between loci. We used NeEstimatorV2.1 (Do et al., 2014) for estimating effective population size N e . All samples from different years and cohorts (size classes) were pooled for estimation.
First, the effect of the number of SNPs was evaluated by drawing randomly (without replacement) 500 to 4,000 SNPs from the reduced data set with MAF ≥0.01 and percent missing data NA ≤25%. Fifty replicate data sets were created, and the mean and coefficient of variation (standard deviation/mean) of replicate N e estimates were calculated. The effects of four thresholds for the minimum MAF were then evaluated for the full data set: 0.01, 0.02, 0.05, and 0.1 where a value of 0.01 means that the selected SNPs had a MAF in the range 0.01 to 0.5. This rather wide range of threshold values was chosen to explore the shape of the relationship between the MAF filter and N e estimates. High threshold values (>0.05) might be unsuitable for practical applications (see discussion). The MAF filter was combined with a filter for NA for each SNP which had six levels between 25% and 50% (5% steps). Combining MAF and NA thresholds led to 24 empirical genetic datasets for which N e was estimated. The number of available SNPs varied between data sets from 1549 to 17 842 (Table 1).
For standardization, 2000 SNPs were randomly selected (without replacement) from each data set, except for the smallest data set for which it was 1,000 (NA ≤25%, MAF ≥0.1). This number of SNPs was sufficient to stabilize estimates (see results). An ANOVA was fitted to replicate log-transformed N e estimates for comparing the effects of MAF and missing data filters, as well as their interaction. Residuals were checked for normality.
The effect of the sample size on N e estimates was evaluated with the reduced data set (MAF ≥0.01; NA ≤25%) by creating random data sets with the number of individuals ranging from 25 to 150.
A rarefaction curve analysis was calculated as a function of sample size. A parametric model (Michaelis-Menten) was fitted using nonlinear least-squares to estimate N e free of sample size effects, which corresponds to the model asymptote.
All data handling and analysis of results were carried out in R (R Development Core Team, 2008).

| Exploratory analysis
In the raw data, the estimated mean miscall rate decreased strongly as the minimum read depth increased (Figure 3). It was around 0.75 considering all SNPs in the raw data set. Therefore, further analyses were restricted to genotypes with read depth in the range 30 to 300 copies (full data set).
The distribution of MAF values in the full data set was nonuniform with most of the SNPs displaying MAF <0.1 (Figure 4a) and NA >25% (Figure 4b). The distribution of missing individuals was nonuniform across individuals with 33 individuals missing more than 50% of SNPs (Figure 4c).

| Ne estimation
The mean estimate of N e across the 50 random data sets stabilized at around 1,500 SNPs and uncertainty decreased with the number of SNPs ( Figure 6). The coefficient of variation (CV) decreased from 0.29 for 500 SNPs to 0.07 for 2,000 SNPs and 0.02 for 4,000 SNPs. This indicates that the 2000 SNPs used for exploring the effects of missing data and MAF thresholds were sufficient for obtaining reliable estimates.
The effects on N e estimates of the thresholds for MAF and NA were important (Figure 7). Mean N e estimates ranged between 237 and 1,784 corresponding to a factor of 7.5. Mean values decreased by 71% to 75% with increasing MAF threshold and increased by 82% to 120% with NA. For example, for the smallest NA (25%), mean N e (averaged across 50 replicates) decreased by 76% from 982 to 237 as the MAF threshold increased from 0.01 to 0.1. In contrast, for the smallest MAF threshold (0.01), the mean N e increased by 82% from 982 to 1784 when NA increased from 25% to 50%.
The ANOVA revealed that the effect of the MAF threshold value was eight times larger than that of NA (Table 2). There was a weak but significant interaction between the two factors.

| D ISCUSS I ON
Empirical genetic data from thornback rays sampled in the Bay of Biscay were used to explore the effects of data selection on N e estimates. Genetic markers were obtained from a RADseq protocol in which individuals with missing data for a given SNP are common (Nunziata & Weisrock, 2018) and genotyping errors frequent (Mastretta-Yanes et al., 2015). In the thornback ray data, for 11% of SNPs the genotype differed between the two replicates on av- Subsampling the data with different thresholds for missing data and minimum minor allele frequency permitted us to evaluate the effects of these two factors on N e estimates. Depending on the combination of threshold values, N e estimates varied by up to a factor of 7.5. In comparison, the well-known effect of underestimation of N e due to ignoring overlapping generations is only around 30% in the thornback ray, independent of the census population size (Marandel et al., 2019 (Marandel et al., 2019). Further, the 159 individuals were probably not enough to obtain stabilized estimates. This might not be surprising given that simplified genetic simulations for a thornback ray like species indicated that around 1% of the population needed to be sampled to obtain reliable estimates (Marandel et al., 2019) and the Bay of Biscay population is potentially large (Marandel, Lorance, & Trenkel, 2016). The rarefaction analysis with the reduced data set indicated a stabilized effective population size of 903 (asymptote of fitted model). For the thornback ray population in the Irish Sea and Bristol Channel Chevolot, Ellis, Rijnsdorp, Stam, and Olsen (2008) estimated N e as being 283 using five microsatellites and a temporal estimation method with samples from two time periods. Given the difference in approach, it is unknown whether their sample of 363 individuals and number of microsatellites was sufficient and hence whether the two effective population size estimates can be compared. If they are comparable, the Bay of Biscay populations would be the larger one.
Other genetic simulations for thornback ray populations in European waters using contrasted assumptions for migration rates suggested a stable large scale population structure with little exchange (Marandel et al., 2018). This could mean that migration might not be expected to impact much allele frequencies in European thornback ray population, hence effective population size estimates. Selection can also cause nonrandom association of alleles within and across loci which will again be interpreted as genetic drift (underestimation of N e ) by the linkage disequilibrium estimator (Waples & Do, 2010). Contrary to genetic drift that affects all loci in the genome, selection only affects certain loci (depending on the genetic architecture) but its effect should be diluted when using a large number of SNPs (»100) as done here).
Depending on the genetic determinism of the selected traits (monogenic to polygenic) and the intensity of the selective process, the effect on N e estimates is hard to predict.
Further, physically unlinked SNPs were assumed, which is clearly unrealistic (Waples, Larson, & Waples, 2016). The number of truly independent SNPs is equal to the number of chromosomes, which is 98 for thornback ray (Nygren, Nilsson, & Jahnke, 1971) times the average number of crossing-over per chromosome in thornback ray.
The finite number of chromosomes will create linkage disequilibrium (more precisely gametic linkage disequilibrium) purely due to physical linkage between SNPs, rather than true N e changes (Waples et al., 2016).
Based on our results as a guide for practitioners we recommend to use the lowest feasible percentage of missing data, though the precise threshold value will depend on the overall sample size and the expected effective population size. The main principle is to maintain a sufficiently large sample size (in terms of genotyped individuals) for all SNPs included in the analysis. It is more difficult to make recommendations regarding the threshold value for the minor allele frequency. It is important to keep in mind, that in a perfect Fisher-Wright population, thresholds on MAF values are nonsense since any filtration will remove important genetic information to infer N e .
However, empirical datasets will always contain loci with alleles of spurious low or very low frequencies. There are a growing number of methods to discard spurious SNPs with a low MAF within the bioinformatics pipeline by taking conservative filters on minimum read depth of the loci. Here, a read depth of at least 30 was required to reduce replication error. At the least, the lowest possible MAF filter should be chosen (compromise between loosing relevant genetic information and noise) and the results for different threshold values should be compared.
In conclusion, for nonmodel species special attention should be paid to the interpretation of N e estimates as large bias in estimates might occur when using the LD method. For thornback ray, we found that nonrandomly missing data, allele frequency filters and sample size had much larger effects than the expected bias due to ignoring overlapping generations (Marandel et al., 2019). We expect these TA B L E 2 Analysis of variance for testing the effects of threshold values for percent of missing data (NA) and minimum minor allele frequency (MAF) on log-transformed effective population size (N e ) estimates