Population assignment from genotype likelihoods for low-coverage whole-genome sequencing data

Low-coverage whole genome sequencing (WGS) is increasingly used for the study of evolution and ecology in both model and non-model organisms; however, eﬀective application of low-coverage WGS data requires the implementation of probabilistic frameworks to account for the uncertainties in genotype likelihood data. Here, we present a probabilistic framework for using genotype likelihood data for standard population assignment applications. Additionally, we derive the Fisher information for allele frequency from genotype likelihood data and use that to describe a novel metric, the eﬀective sample size, which ﬁgures heavily in assignment accuracy. We make these developments available for application through WGSassign, an open-source software package that is computationally eﬃcient for working with whole genome data. Using simulated and empirical data sets, we demonstrate the behavior of our assignment method across a range of population structures, sample sizes


| INTRODUC TI ON
In just a few years, next-generation sequencing (NGS) technologies have revolutionized the study of evolution and ecology in both model and non-model organisms, and have become established as standard tools in molecular ecology.In particular, whole-genome sequencing (WGS) can provide sequence data from a large proportion of the genome and is increasing in use.While large-scale WGS projects can be prohibitively expensive at the necessary read depths for accurately calling individual genotypes, low-coverage WGS offers a cost-effective approach aimed at reducing the read depth per individual while retaining sufficient information for genomic analyses.However, since low-coverage WGS precludes the ability to call individual genotypes, probabilistic frameworks are used to account for the uncertainty in an individual's genotype (Buerkle & Gompert, 2013;Nielsen et al., 2011).Extending common analyses in the field of molecular ecology to accommodate genotype uncertainty through the direct use of genotype likelihoods is a necessary advance for broadening the utility of low-coverage WGS.
The creation of probabilistic frameworks for allele frequency estimation, genotype calling and single nucleotide polymorphism (SNP) calling have made low-coverage WGS practical for many applications (Kim et al., 2011;Nielsen et al., 2011Nielsen et al., , 2012)).By first estimating the joint site frequency spectrum for individuals without calling individual genotypes, priors on allele frequency can improve the calling of individuals' genotypes and SNPs.Population genetic analyses have been further advanced through the development of methods that quantify genetic differentiation and investigate population structure with principal components analysis, while accounting for uncertain genotypes (Fumagalli et al., 2013).Similarly, accurate estimates of individual admixture proportions (Skotte et al., 2013) and pairwise relatedness (Korneliussen & Moltke, 2015) can be obtained using genotype likelihoods.The widespread use of these methods is facilitated by software that is both user-friendly and computationally efficient (e.g.ANGSD (Korneliussen et al., 2014), ngsTools (Fumagalli et al., 2014), PCangsd (Meisner & Albrechtsen, 2018)).However, a fundamental analysis for molecular ecology yet to be developed for low-coverage WGS data is population assignment.
Population assignment methods are used to determine an individual's population of origin and have provided insight into ecological and evolutionary processes, such as dispersal, hybridization and migration, as well as informed conservation and management decisions (Manel et al., 2005).The traditional assignment test uses an individual's multilocus genotype and the source populations' allele frequencies to calculate the likelihood of the genotype originating from each of the populations (Paetkau et al., 1995;Rannala & Mountain, 1997).
Using this framework, the recent increase in available markers (e.g. from RADseq approaches) has made possible highly accurate assignment of individuals among weakly differentiated populations by using subsets of informative loci for population structure (e.g.Benestan et al., 2015;DeSaix et al., 2019;Ruegg et al., 2014).The traditional assignment test is readily extended to analyses such as genetic stock identification (GSI), to determine the proportion of source populations in a mixture of individuals Smouse et al., 1990.To date, methods for performing assignment tests require known genotypes and have not been implemented to use genotype likelihoods.
Assignment tests are well suited for application with lowcoverage WGS data, because they rely heavily on allele frequency estimates, for which a number of approaches are already developed.
However, a challenge with using low-coverage WGS data for assignment tests is that the allele frequency estimates may be uncertain, which could lead to inaccurate assignment results.While this challenge is not unique to low-coverage WGS data, as low sample size also increases uncertainty regardless of sequencing coverage, the challenge of accurate allele frequency estimation is compounded for low-coverage WGS by low read depth.For accurate allele frequency estimation from low-coverage WGS data, specific recommendations include aiming for individual sequencing depths of 1x (Buerkle & Gompert, 2013) or having at least 10 individuals sequenced with a total per-population sequencing depth of at least 10x (Lou et al., 2021).The goal of these strategies is to maximize information for estimating allele frequencies given finite resources for sequencing depth and number of samples.Lower sequencing depth decreases the amount of information about population allele frequency, while using larger sample sizes increases the amount of information.However, information is not directly quantified in these studies; rather comparison of known versus simulated allele frequencies was used to arrive at these general rules of thumb (Buerkle & Gompert, 2013;Lou et al., 2021).The development of an information metric that accounts for read depth variation across genotypes would provide a valuable method to quantify the thresholds of information needed for parameter estimation with low-coverage WGS data.For population assignment tests, an information metric of this sort would allow researchers to more directly identify the necessary sample size and sequencing depth needed to perform accurate assignment given the genetic differentiation of their samples.Furthermore, given that unequal sample size among reference populations is a source of bias in assignment tests with called genotypes (Wang, 2017), an information metric would allow the identification and mitigation of biased assignment due to the combined influence of unequal sample sizes and sequencing depths among populations.
Here, we present WGSassign, an open-source software package of population assignment tools for genotype likelihood data from low coverage WGS.The objectives of WGSassign are (1) to provide common assignment methods that use genotype likelihoods, instead of called genotypes; (2) to evaluate the information available in lowread depth sequencing data for allele frequency estimation; and (3) to achieve computational efficiency for processing large numbers of samples with genome-wide data.WGSassign provides methods for individual assignment and leave-one-out cross-validation of samples of known origin.Additionally, it calculates a z-score metric that can indicate when samples originate from an unsampled source population.For the second objective, we calculate Fisher information (Casella & Berger, 2021) and determine the effective sample size-the number of samples with completely observed genotypes that would yield the same amount of statistical information for estimating allele frequency as the observed genotype likelihoods in a data set.This calculation of effective sample size has broad utility for population genomics studies using low-coverage WGS.
We validate WGSassign and investigate its behaviour with an extensive set of simulations and demonstrate its use on two empirical data sets.In the first, we apply WGSassign to weakly differentiated groups of yellow warblers (Setophaga petechia).In the second, we apply WGSassign to two well-differentiated Chinook salmon (Oncorhynchus tshawytscha) populations to demonstrate that when sufficient effective sample sizes of the source population are available, unknown individuals can be assigned accurately, even at extremely low read depths.

| Population assignment
We assume that there are K sampled source populations to which an individual can be assigned using data from L biallelic loci in the With low-coverage sequencing data, G is not observed with certainty.Rather, evidence about the unknown genotype is obtained from sequencing reads covering the locus.Let R denotes the sequencing read data from an individual at locus .The evidence for the state of G from the read data is summarized as the likelihood of the genotype given the read data, which is simply the probability of the read data given the genotype, considered as a function of the genotype: Without loss of generality, we consider these likelihoods to be scaled so that they sum to one: g ,0 + g ,1 + g ,2 = 1.Such likelihoods are typically a function of the number of reads of each allele observed and the corresponding base quality scores, and they are computed during genotype calling by a variety of programmes such as bcftools (Li, 2011;Li et al., 2009), GATK (McKenna et al., 2010) and ANGSD (Korneliussen et al., 2014).An accessible review of the different models providing genotype likelihoods is found in Lou et al. (2021).
Performing population assignment using read data from an individual (rather than from directly observed genotypes) requires, for each locus, , the likelihood that the individual came from a source population k, say, given the individual's read data.This is simply the probability of the read data from the individual given that the individual came from source population k, with allele frequencies k, .Thus, we require P R | k, , which can be calculated from Equations ( 1) and (2) using the law of total probability: If the L loci in the genome are not in linkage disequilibrium (LD) and are hence independent of one another, within source populations, then the likelihood of source population k given R, the read sequencing data across the entire genome, is simply the product over loci.
where k denotes the set of all L allele frequencies in population k .Of course, with WGS, some variants may be near one another and will then likely be in LD.In such a case, Equation ( 4) is not correct, but, rather, is a composite-likelihood approximation to the true likelihood (which is largely intractable).Composite likelihood estimators often produce unbiased results, but, because they do not take account of the dependence of different variables in the likelihood, they typically underestimate the uncertainty in the estimates (Larribe & Fearnhead, 2011).Given the unbiased nature of composite likelihood estimators, LD pruning of the WGS data is not necessary.For each individual of unknown origin, this likelihood can be computed for each source population, k, and the relative values of those likelihoods give the evidence that the individual came from each of the source populations.If the prior probability k that an individual came from source population k is available for k ∈ {1, … , K}, then the likelihoods can be used to compute the posterior probability that the individual came from each of the source populations: where Z is a random variable indicating the origin of the individual. (1) (2) ( (4) In practice, the allele frequencies in each source population are not known with certainty.Accordingly, these frequencies must be estimated from sequencing read data from individuals known to be from the source populations (these are often referred to as 'reference samples').We estimate these by maximum likelihood.The probability of the read data, R (i) , from the ith reference sample, given that it came from source population k, is, following Equation (3), where the genotype likelihoods are now adorned with a superscript (i) to denote they are for the ith reference sample.Assuming the samples from source population k are not related, the log-likelihood for k, given the read data from all n k reference samples from population k is: In our implementation, we first use the expectation-maximization algorithm (Dempster et al., 1977) from ANGSD (Kim et al., 2011) and the code as implemented in PCangsd (Meisner & Albrechtsen, 2018), to obtain the maximum likelihood estimates (MLEs) of the population allele frequencies, ̂ k, , from the reference samples.Then, when calculating P R| k , we substitute ̃ k, for k, , calculated as follows: where, again, n k is the number of reference samples from source population k.This provides a correction for cases in which the allele exists in a source population, but was not detected in the reference samples from that population-effectively, it adds one more individual to the sample that carries one copy of the allele not previously seen in that reference population.Without this correction, the P R (i) | k, = 0 in the absence of an allele and the L k, cannot be calculated.This approach is identical to the 'Frequency Criterion' used in GENECLASS 2.0 with the 'adjustable default value' set to 1 ∕ (2n + 1).Another approach, due to Rannala and Mountain (1997), that places beta priors, independently for each population and locus, on the allele frequencies, has also been widely used in population assignment methods.Implementing that approach with genotype likelihoods is more computationally challenging than with observed genotypes, and since extensive simulations (not shown) revealed no substantial differences between the two methods, we adopted the 'Frequency Criterion' approach.

| Fisher information and effective sample size
As should be clear from the preceding development, the accuracy of population assignment depends, at least in part, on the accuracy of the estimates of the allele frequencies from each source population.
In this section, we develop the theory (which is then implemented in WGSassign) that provides the user with a measure of allele frequency estimate accuracy, calculated from the genotype likelihoods in the reference samples, that takes account of both sample size and read depth.We define this metric as the effective sample size: The number of diploid individuals with called genotypes that provide the same amount of information for allele frequency as the observed information from the low-coverage WGS samples.Fewer individuals sampled and lower sequencing depth will result in less information in the data regarding allele frequency.
As noted above, estimates of the allele frequencies are made by maximum likelihood using the sequencing data on the reference samples from each source population.Fisher information is a statistical metric that quantifies the amount of information in a sample for estimating an unknown, continuous parameter (Fisher, 1922).
It measures the curvature of the log-likelihood function and is inversely related to the variance.In visual terms, a sharply peaked log-likelihood curve (i.e. one with greater curvature) for a parameter indicates greater certainty in the estimated parameter (and, also higher Fisher information) than a flatter log-likelihood function.
Formally, the curvature is measured by the negative second derivative of the log-likelihood function.The observed Fisher information for allele frequency is that negative second derivative evaluated at the MLE: Appendix A shows how I (i) o k, , the observed Fisher information for k, in the reads from a single individual, i, is found to be: The observed Fisher information from all n k reference samples is then To derive ñl , our effective sample size metric for locus , we compare this observed Fisher information to the expected Fisher information that would be obtained from 2ñ gene copies with allelic type directly observed (Appendix A) from a population in which the true allele frequency is ̂ k, : Equating I o k, to I e k, and solving for ñ yields: This is the number of diploid individuals with perfectly observed genotypes that provides the same information (and hence accuracy) for estimating k, as is available from the sequencing read data from the n k reference samples from source population k.We term ñ , calculated as above, the effective sample size of the read data from the reference samples of source population k at locus .In practice, to avoid issues of non-differentiability on the boundaries of the space (i.e. at = 0 or = 1), we calculate ñ using ̃ k, .The effective sample size for a population is then derived by taking the mean of ñl across all loci, ñ = 1 L ∑ L l=1 ñl .In practice, the estimates of information are highly variable for rare alleles; therefore, we recommend this calculation be done for loci with a minor allele frequency > 0.05.
Fisher information and effective sample size calculated in this way are useful summaries for understanding the trade-offs between sequencing more individuals at lower depth versus fewer individuals at higher depth, at least as it pertains to accurately estimating allele frequencies.In the context of population assignment, the effective sample size, in particular, provides an accessible metric for how good (or bad) the source-population allele frequencies can be expected to be.As we will see later, Fisher information also provides a valuable way to standardize the effective sample size of the reference samples from each population-an important consideration when using WGSassign.A useful statistic for accomplishing this is the individualspecific average effective size for individual i : where is the contribution to the observed Fisher information of the reads from individual i: ñ(i) ranges between 0 and 1.
We also implement a z-score calculation for determining whether an individual's genotype is unlikely to have come from one of the K source populations, but rather, from an unsampled population.The

| Simulations to illustrate the effective sample size
We used the R programming language to run simulations that illustrate how Fisher information and effective sample size vary across a range of simulated read depths and true allele frequencies.Our simulations assumed a sample size of 100 diploid individuals and a single biallelic locus, with allelic types within individuals being independent of each other.
For each individual, we simulated read depth from a Poisson distribution with mean D ave and allelic types upon each read by sampling from the two gene copies within the individual with equal probability and switching the allelic type with probability 0.01 for each read to simulate sequencing errors.Genotype likelihoods from the reads were calculated according to the simulation model.We calculated the maximum likelihood estimate (MLE) for from the genotype data as the observed proportion of alleles, and for the sequencing read data, we used the EM algorithm to compute the MLE.Using these estimates, we then computed the observed information from the genotypes and from the genotype likelihoods.
To determine the effective sample size, we calculated the expected information for observed genotypes, assuming the true value of was the MLE from the genotype likelihoods and then used Equation ( 12).

| Genetic simulations
To demonstrate the efficacy of WGSassign in performing population assignment for a range of samples, read depths and genetic differentiation among populations we simulated a series of genetic data sets using the coalescent simulation program, msprime (Kelleher et al., 2016).The first simulation included two populations, each with an effective sizes of 1000, exchanging migrants.We simulated ancestry for a genomic sequence of 10 8 bases with a recombination rate of 10 −8 and a mutation rate of 10 −7 , per site and per generation.
To vary the genetic differentiation between populations, we varied the lineage migration rate parameter between 0.0005 and 0.05 in 10 equal increments.From both populations, we sampled 10, 50, 100 or 500 individuals.Pairwise F ST was calculated between the two populations using the sampled individuals and the genetic variants were output in variant call format (VCF).
With the VCF file output from msprime, we used bcftools (Li, 2011;Li et al., 2009) to remove any SNPs with a minor allele frequency (MAF) less than 0.05 and randomly selected 100,000 of the remaining SNPs.Genotype likelihoods were produced with vcfgl (https:// github.com/ isina ltink aya/ vcfgl ) based on mean read depths of 0.1X, 0.5X, 1X, 5X, 10X or 50X.For each of the 240 parameter combinations (10 migration rates, 4 sample sizes and 6 read depths), we simulated 10 replicates, for a total of 2400 simulated data sets.Genotype likelihood output was converted to Beagle file format with custom scripts, and we used these data as input into WGSassign.
To determine the influence of genetic differentiation on assignment accuracy, we calculated the effective sample size and leave-one-out (LOO) assignment accuracy for each population.In WGSassign, LOO is performed by iteratively removing an individual of known origin from its source population, calculating allele frequencies within the source populations using the remaining individuals and then calculating the likelihood that the removed individual originated from each of the different source populations.The LOO method is widely used to avoid the bias that arises from using training data that also include data being tested.The assigned population was determined by maximum likelihood.
In the second simulation, we conducted a deeper assessment of the behaviour of effective sample size and its influence on assignment accuracy.We implemented two-population island models as in the previous simulation, but included all sample combinations of 10, 30, 60 and 100 individuals for a population and read depths of 0.5X, 0.75X, 1X, 2X, 4X and 6X with 10 replicates for a total of 5760 simulations.We set migration rate at 0.005 for moderate genetic differentiation based on the previous simulation.In each run, we simulated an extra 20 individuals from each of the two populations and these individuals were held out from allele frequency calculations for the respective population and used for standard assignment accuracy.After performing initial assignment, if a population had a higher effective sample size than the other population, then individuals were removed to standardize the effective sample sizes, and assignment was performed again.In this simulation, all SNPs were used that had MAF > 0.05.
In the third simulation, we assessed the performance of the WGSassign z-score metric for determining whether an individual of unknown origin that is assigned to a population is actually from an unsampled population.We implemented a three-population stepping-stone model with 20, 60 or 110 individuals per population using msprime.We varied the migration rate parameter between 0.0001 and 0.01 in 20 equal increments.Individuals had simulated mean read depths of 1X or 5X.We used populations 1 and 2 in the stepping-stone model as reference populations and calculated the reference z-scores using WGSassign from all but 10 individuals in these two populations.We assigned 10 individuals from population 3 and 10 from population 2 to the reference populations (1 and 2) using WGSassign.We calculated the z-scores of these individuals' assignments to demonstrate the behaviour of the z-score metric for correctly assigned individuals (i.e. the individuals from population 2 that were assigned to population 2) versus individuals from an unsampled population (i.e. the individuals from population 3 that were assigned to population 2).and the migration rate was set at 0.005.Effective sample size was calculated for all these replicate simulations.These values were chosen such that 'equal sequencing effort' could be compared, in this case for a total sequencing depth of 60X (e.g. 120 individuals at 0.5X to 10 individuals at 60X).

| Application to empirical data
We used WGSassign on data from yellow warblers to test its accuracy when applied to individuals from a species exhibiting isolation by distance (Bay et al., 2021;Gibbs et al., 2000).Previous work on yellow warblers has found weak differentiation between populations, with pairwise F ST values on the order of 0.01 or less (Gibbs et al., 2000).
Blood samples from 105 individuals was collected via brachial venipuncture in the years 2020 and 2021.These served as reference samples from three populations-North, Central and South-previously described in Bay et al. ( 2021) and Gibbs et al. (2000).We extracted DNA from blood using the manufacturer's protocol for Qiagen DNEasy Blood and Tissue Kits.Whole-genome sequencing libraries were prepared following modifications of Illumina's Nextera Library Preparation protocol (Schweizer & DeSaix, 2023) and sequenced on a HiSeq 4000 at Novogene Corporation Inc., with a target sequencing depth of 2X per individual.
Sequences were trimmed with TrimGalore version 0.6.5 (https:// github.com/ Felix Krueg er/ TrimG alore ) and mapped to the NCBI yellow warbler reference genome (Sayers et al., 2022) (accession number JANCRA010000000) using the Burrows-Wheeler Aligner software version 0.7.17 (Li & Durbin, 2009).After mapping, the resulting SAM files were sorted, converted to BAM files and indexed using Samtools version 1.9 (Li et al., 2009).We used MarkDuplicates We implemented principal components analysis (PCA) to ensure reference samples from each of our source populations actually showed geographic signatures of clustering in the PCA.In order to assess our ability to accurately assign individuals of unknown origin to breeding populations, we determined the accuracy of assignment of the known breeding origin individuals using WGSassign's leaveone-out approach.
For the second empirical data set, we applied WGSassign to previously published data from Chinook salmon (Thompson et al., 2020) to assess its utility in situations with low to extremely low read depth and poor-quality DNA.For this scenario, we entertained the task of assigning Chinook salmon to either the Klamath River basin, or the Sacramento Basin.These populations are quite distinct, with pairwise F ST values between the basins on the order of 0.1.So, it should be quite easy to distinguish fish from the two basins.However, in WGS data from Thompson et al. (2020), there were several fish from rivers in the Klamath basin collected from carcasses with low read depth.These fish were excluded from most analyses in Thompson et al. (2020) because they did not reliably cluster with other fish from their populations on a PCA; however, we evaluate here if their basin of origin can be recovered using WGSassign.Additionally, through downsampling of reads from the BAM files, we investigate if average read depths as low as 0.001X in the sample being assigned can deliver accurate assignments.
We included fish from the closely related Feather River Spring, Feather River Fall, San Joaquin Fall and Coleman Late Fall collections as members of the Sacramento River source population, while fish from the closely related Salmon River Fall and Spring and Trinity River Fall and Spring collections constitute samples from the Klamath River source population.With 64 fish in each source population, we removed the 12 fish from each that had the fewest sequencing reads to serve as our 24 'unknown' fish to be assigned to the populations.The remaining 52 in each population served as the reference samples.
The genotype likelihoods for the reference sample were in a VCF file produced by GATK.This was filtered using bcftools (Danecek et al., 2021) to retain only biallelic SNPs with a MAF > 0.05 which were missing data in fewer than 30% of the samples.Additionally, data from chromosome 28, which holds a region strongly differentiated between spring-run and fall-run Chinook salmon (Thompson et al., 2020), were excluded.These genotype likelihoods were stored in a Beagle-formatted file using a custom script.
The data for the test samples were extracted from BAM files.
We used samtools stats (Li et al., 2009) to determine the average read depth in each BAM and used that number with samtools view to downsample each BAM five times with five separate seeds to average read depth levels of 0.001X, 0.005X, 0.01X, 0.05X, 0.1X, 0.5X and 1.0X, when those read depths were lower than the full read depth of the file.Genotype likelihoods for the 24 individuals were then called with ANGSD v0.940 (Korneliussen et al., 2014)  of additional sequencing depth versus additional samples, for estimating allele frequencies that has been noted previously (Buerkle & Gompert, 2013;Fumagalli, 2013;Lou et al., 2021).

| Genetic simulations
In the first simulation, genetic differentiation between the sampled individuals from the two populations ranged from −0.003 to 0.13 F ST .Across all read depths within each category of number of samples (10, 50, 100, 500), assignment accuracy increased with genetic differentiation and generally high assignment accuracy was achieved even with low genetic differentiation (Figure 2).Accuracy above 90% was reached for all simulations within the 500 samples category with F ST > 0.004, 100 samples category with F ST > 0.006, 50 samples category with F ST > 0.015 and the 10 samples category with F ST > 0.043.Within each sample size category, increasing average read depth, and therefore effective sample size, resulted in higher assignment accuracy, especially when populations had weak genetic differentiation (Figure 2).
Runtime for the simultaneous calculation of Fisher information, effective sample size and allele frequency for populations in WGSassign was fast.With two populations and 100,000 loci being analysed in parallel with 20 threads, runtime was less than 10 s for populations with 100 samples or less, and between 15 and 30 s for populations with 500 samples.Leave-one-out assignment requires population allele frequency to be recalculated for each individual in the population, and time required for that recalculation increases linearly with sample size.Accordingly, runtime for LOO cross-validation is expected to increase quadratically with increasing number of samples per population, and we observe this: FOR 100 samples for the two populations at 1X mean individual read depth, LOO assignment had a mean runtime of 51 s and, for 500 samples, run time was 1743 s.
The second set of simulations showed that at weak to moderate genetic differentiation (mean F ST = 0.0055), assignment accuracy was close to 100% when effective sample sizes of the two populations were equal and had at least eight effective individuals (Figure 3a).However, at higher measures of effective sample size, the two populations could have different effective samples sizes and still have high assignment accuracy (e.g.effective samples sizes of 20 vs. 100).Assignment bias occurred when there were sufficient differences between the effective sample sizes that individuals were only being incorrectly assigned from the lower effective sample size population (Figure 3b).
Importantly, when effective sample size is roughly equivalent between the two populations but the number of samples and read depth differ, assignment accuracy is still high and unbiased (Figure 3c,d).This pattern was apparent up through the maximum tested magnitude difference of 12 (Figure 3c,d).
At higher genetic differentiation (F ST > 0.1), samples can readily be identified as coming from an unsampled population using the zscore metric in WGSassign (Figure 4).At such high differentiation, individuals from an unsampled population tend to have z-scores less than −3 compared to individuals correctly assigned to a population having z-scores in ( − 3, 3), as expected of a standard unit normal.
With weaker genetic differentiation (F ST < 0.1), sample size and read depth have a more noticeable effect on the behaviour of the z-score metric (Figure 4).Generally, higher reference sample sizes and read depths allow individuals from unsampled populations to be distinctively identified from individuals that are truly from a sampled source population.
The simulations demonstrating the relationship of read depth and absolute sample size for producing effective sample size in a single population highlighted that prioritizing sample size over sequencing depth results in higher effective sample size.In the provided example of an equal sequencing effort of 60X (i.e. total sequencing depth of a single population), effective sample size increased as more samples at a lower read depth were used-with the lowest effective sample size of 7.8 for 10 individuals at 6X which increased threefold to 24.5 for 120 individuals at 0.5X (Figure 5).In other words, if a researcher Observed information calculated for simulated data summarized either as fully observed genotypes (purple) or as genotype likelihoods (orange) computed from sequencing read data of different depths simulated from the genotypes.Fully observed genotype data are not affected by read depth, but an independent set of fully observed genotypes was simulated for each different value of read depth, and these are all shown in the figure.(b) Effective sample sizes calculated for simulated genotype likelihood data.In each figure, the facet headers give the true population allele frequency, the x-axis gives the average read depth in the simulations and the distribution of quantities in the y direction is summarized as boxplots showing the median (dark line) the first and third quartiles (the edges of the boxes) the largest (or smallest) value no further than 1.5 × the interquartile range from the first (third) quartiles (the whiskers) and outliers beyond the whiskers (individual points).All simulations had 100 individuals.Chinook salmon were accurately assigned to either the Sacramento or Klamath river basins even at read depths as low as 0.001X (Figure 6).All 12 test samples from the Sacramento river were correctly assigned at all read depth levels, and, of the 12 Klamath test fish, 11 were correctly assigned at all read depth levels, while one was correctly assigned at all read depth levels except for one of the five replicates at read depth 0.001X.The four samples with lowest full read depth (the four at the bottom of Figure 6) have log-likelihood ratios that are noticeably smaller than those of the remaining 20 fish even when downsampled to similar read depth levels, suggesting that these samples suffer from factors other than low read depth, such as poor quality DNA or contamination.The number of informative sites per individual varied from 11,866 to 906,505 at full read depth, and from 370 to 3257 at 0.001X, while the total number of sites varied from 955,185 at full depth to 48,220 at 0.001X (Table 1).Evidently, at low read depths, each individual assignment relies on a set of informative SNPs that overlaps little with the informative SNPs in other individuals.

| DISCUSS ION
Here, we present WGSassign and demonstrate its utility for population assignment with low-coverage WGS data.Our results, from both simulated and empirical data, show that low-coverage WGS data can be used to achieve high assignment accuracy even among weakly differentiated populations (F ST < 0.01).We show that balancing effective sample size among populations is essential for avoiding assignment bias due to variation in the precision of allele frequency estimation for different populations.Effective sample size can also be used to guide decisions in study design for choosing the number of samples and sequencing depth in a given population.The ability to perform population assignment on large numbers of individuals, cost-effectively sequenced at low-coverage across the whole genome, further expands the utility of low-coverage WGS for population and conservation genomics.F I G U R E 3 Mean assignment accuracy (a) and mean difference in assignment accuracy (b; assignment accuracy of population 2 − population 1) were compared for populations with an array of effective sample sizes, listed on the axes as ranges.Equal effective sample sizes are along the plots' diagonals.Assignment accuracy was high when effective sample sizes were sufficiently high, even when unbalanced.When effective sample sizes were approximately equal, assignment accuracy was high regardless of the combination of read depths and number of samples (c, d).Approximately equal effective sample sizes are found along the diagonal where the axes display a range of the magnitude of difference for depth (y-axis) and sample size (x-axis) for population 2 in relation to population 1 (e.g. 2 ∕ 3 − 1.5x indicates the number of individuals in the sample from population 2 is between two-thirds and three-halves of the sample size from population 1).The centre tile of the plot, 2 ∕ 3 − 1.5x, indicates when effective sample size is equal due to approximately similar sample numbers and read depth.that are conducted at a large scale.For example, in the mid-2000s, an arduous, international, multi-laboratory study was undertaken to standardize a DNA database of 13 microsatellite loci for genetic stock identification of Chinook salmon at a coast-wide scale (Seeb et al., 2007).With today's sequencing power, a low-coverage WGS approach could provide a cost-effective method for creating a reference baseline of known populations without the need for extensive standardization of genetic makers.Fish of unknown origin could be sequenced at very low read depth, and still be accurately assigned to populations from the reference baseline.Furthermore, using WGS data streamlines the process of adding new reference populations to compare to previous analyses because the loci used for assignment are not pre-selected to maximize genetic differentiation and thereby potentially subject to ascertainment bias.

| Performance of WGSassign and implications for population assignment studies
We note that WGSassign can be used in conjunction with other clustering approaches for low-coverage WGS data (e.g.PCangsd; Meisner & Albrechtsen, 2018).Notably, the formal population assignment implemented in WGSassign requires a priori delineation of populations.In species that live in discrete population groups, this can be done without genetic data.However, when species are distributed more continuously, then unsupervised clustering approaches in tandem with geography and other covariates (e.g.behaviour, morphology) can be used to delineate reference populations.Assignment accuracy from WGSassign on a set of hold-out individuals can be used to determine if the identified populations are informative for assignment.The use of complementary clustering methods is also informative for identifying if test samples are from populations not represented in the reference samples as well as identifying admixed individuals.Importantly, clustering methods for population structure can be biased by variation in sequencing depth among individuals (Lou et al., 2021), while WGSassign is less influenced by that variation in sequencing depth.Accordingly, WGSassign is expected to give more reliable assignment in the face of sequencing depth variation than unsupervised clustering approaches.

| Accounting for population sample size and read depth with effective sample size
Our development of the effective sample size metric provides a powerful tool for population genomics studies using low-coverage WGS data and informing study design.Previous studies have provided recommendations for the number of individuals and sequencing depth required to accurately estimate allele frequencies with lowcoverage WGS data (Buerkle & Gompert, 2013;Fumagalli, 2013;Lou et al., 2021).Effective sample size provides a metric to quantify these recommendations and determine the precision of allele frequency estimation needed for different applications.For example, the recommendation of (Lou et al., 2021) at least 10 individuals with 1X average sequencing depth for allele frequency estimation can be quantified as an effective sample size of 2.4 individuals in the simulations from this study (Figure 5) and does correspond to sufficient precision to achieve accurate assignment at moderate genetic differentiation (Figure 3).However, at weaker genetic differentiation among populations, effective sample size needs to be increased for accurate assignment.Quantifying the amount of information gain for different study designs can inform researchers on how to more efficiently allocate resources for sequencing efforts.

F I G U R E 5
The relation between read depth and number of samples in determining the effective sample size for a single population highlights the potential for different sampling design strategies.Notably, effective sample increases more rapidly with changes in number of samples than read depth.The x-axis provides the number of samples from a single population, the y-axis is the mean read depth for the corresponding population, and the value listed is the mean effective sample size across 10 replicate simulations using all SNPs with minor allele frequency > 0.05 (126,871).Tiles outlined in red have equal sequencing effort based around 60 individuals at 1X.Given the same amount of sequencing effort, effective sample size increases when the number of samples is prioritized over sampling depth, with a low of 7.8 for 10 individuals at 6X and a high of 24.5 for 120 individuals at 0.5X.Off-diagonal values allow the comparison of sampling design strategies of different sequencing effort, for example, sequencing 20 individuals at 1X (effective sample size = 6.1) versus sequencing 10 individuals at 2X (effective sample size = 4.6).

| Further improvements for population assignment
Currently in our implementation of WGSassign, the issue of only a single allele being observed in a population, and thereby producing a likelihood of 0, is avoided by correcting a population with a minor allele frequency of 0 at a given locus to 1 2n + 2 , where n is the number of individuals in the population.Essentially, this treats the locus as having a rare allele that would be observed in a single copy if another individual was to be sampled.Another approach specifies a formal prior for the allele frequencies in each population (Rannala & Mountain, 1997).We note that the latter approach yields performance that is very similar to ours; however, implementing a prior for allele frequencies that accounts for the a priori expectation F I G U R E 6 Log-likelihood ratios for assignment at different read depth levels for the Chinook salmon data.On the y-axis are different Chinook salmon samples, labelled by their population, a colon, their ID number and then in parentheses the average read depth of their aligned data at full depth.On the x-axis is the log-likelihood ratio in favour of assignment to their own (correct) population on a 'pseudo-log' scale that accommodates negative values.Positive numbers indicate correct assignment.Colours denote the read depths after downsampling.There are five points for each individual at each value of downsampling, reflecting the five different seeds used for downsampling.and note that we can rewrite L i ( ) = vw, and take the derivative of that easily using the product rule: (vw) � = v � w + w � v.To do so, we first find the derivatives Then, we put them together with the product rule Restoring the k, subscript to , and the (i) superscript and subscript to g, negating, taking the sum over the n k individuals and evaluating at the MLE yields in Equation (10).

Expected fisher information from observed genotypes
Under Hardy-Weinberg equilibrium, the allelic type of the two gene copies within a locus is independent of one another, and thus, a sample of n diploids with fully observed genotypes is equivalent to a sample of 2n gene copies, each one an independent Bernoulli trial with success probability .Finding the expected Fisher information in such a case is a standard exercise, but we repeat it here for completeness.In previous applications, with far fewer markers, determining such a distribution of the log probability of the observed data has been done through simulation, for example, in the 'exclusion method ' of Cornuet et al. (1999); however, with genomic-scale data, it would be impractical to simulate thousands of new multilocus genotypes, each with potentially millions of loci, to assess whether each individual (with their own, specific read depth values) might be from a population not included among the source populations.Instead of simulation, we develop the expected distribution of log probabilities using a central limit theorem (CLT) approximation.Note that, since P R| k is a product over many loci, logP R| k is a sum over loci.We will write the contribution of each locus to that sum as: where we include the notation f g , k, to emphasize the fact that W is a deterministic function of k, and the vector of genotype likelihoods g = g ,0 , g ,1 , g ,2 .It is important to recognize in this context that k, is considered fixed while g is a random variable.By extension, then, so too is W a random variable.By the CLT, the sum of very many independent W random variables can be approximated by a normal distribution with mean and variance 2 given by: u = g 0 (1 − ) 2 + g 1 2 (1 − ) + g 2 2 = g 0 1 − 2 + 2 + g 1 2 − 2 2 + g 2 2 , u = g 0 (2 − 2) + g 1 (2 − 4 ) + g 2 2 = 2 g 0 + g 2 − 2g 1 + 2 g 1 − g 0 . .
Thus, we correct the z-score so that it exhibits a mean of 0 and a variance of 1 for the reference samples, themselves, from population k.With i = 1, … , n k denoting the reference samples from population k, we calculate Then, we assess whether an unknown individual A assigned to population k may have come from an unsampled population using: As in the likelihoods calculated by WGSassign, values of ̃ k, are used in place of values of k, in all of the above calculations.
Furthermore, when calculating the z scores for each individual from the reference samples, the value of k, used must be one estimated while leaving the individual out of the sample (analogous to the LOO procedure described in the paper). .
genome.Let a diploid individual's genotype at locus (1 ≤ ≤ L) be represented by G ∈ {0,1,2}, which counts the number of alleles matching the reference genome carried by the individual at locus .Denote by k, the true-but typically unknown-frequency of the alternate allele at locus within source population k.Under the assumption of Hardy-Weinberg equilibrium, the probability of G , when the individual is from population k is: full derivation of the method is shown in Appendix B. In short, we determine the expected distribution of log probabilities of an individual's genotype likelihoods arising from a population (given the individual's allele counts across loci and the population's allele frequencies), using a central limit theorem approximation.The z-score is then calculated by subtracting the mean expected likelihood from the observed likelihood and dividing the difference by the standard deviation of the expected likelihoods.Given that the actual distribution of the z-score is likely to deviate from a standard normal distribution, we further standardize the observed z-score by the z-scores of the reference individuals from the source populations.Individuals truly from an assigned population are expected to have z-scores within several (e.g.three) standard deviations of the normal distribution, while individuals from an unsampled but differentiated population are expected to have z-scores that fall below the expected range of a standard unit normal random variate.The determination of the specific standard deviation cut-off for the z-score must be determined from the specific empirical data.

Finally
, to illustrate the relation of effective sample size to read depth and absolute sample size for the purpose of study design, we simulated from a two-population island-model coalescent to produce 10 replicates of all combinations of sample sizes in{10,12,15,20,30,60,80,120}  and read depths in {0.5X,0.75X,1X, 2X, 3X, 4X, 5X, 6X} for a total of 640 simulations.The two populations had the same number of samples and read depths, fromGATK version 4.1.4.0 (McKenna et al., 2010)  to mark read duplicates and clipped overlapping reads with the clipOverlap function from bamUtil (https:// genome.sph.umich.edu/ wiki/ BamUt il:_ clipO verlap).To reduce sequencing depth variation, we used the DownsampleSam function from GATK to downsample reads from BAM files with greater than 2X coverage, to 2X coverage.To identify genetic markers from low-coverage WGS data, we used stringent filtering options in ANGSD version 0.9.40(Korneliussen et al., 2014) of mapping quality >30 and base quality >33.We retained SNPs with read data in at least 50% of individuals and an MAF > 0.05.The genetic data are stored at https:// doi.org/ 10. 5061/ dryad.h9w0v t4pj(DeSaix et al., 2023).

|
using the -sites options to call only the sites found in the Beagle-formatted file of the reference samples.After genotype likelihood estimation in the test samples, the Beagle file of reference samples was filtered to include only the sites output by ANGSD.The total number sites in each data set was recorded, as was the number of informative sites (those with unequal likelihoods for the three different genotypes) within each individual.The resulting Beagle files were then passed to WGSassign to compute the likelihood of population origin for each of the test fish, and the results were plotted using R Effective sample size simulations Fisher information and effective sample size are shown for three representative values of (0.05, 0.3 and 0.5) in Figure 1.As expected, observed Fisher information for allele frequency from sequencing read data increases as the average sequencing depth increases, reaching a limit at the observed information from fully observed genotypes.The absolute value of the observed Fisher information varies widely over the different allele frequencies; however, the relative values of information from genotypes and from sequencing reads vary less, and the effective sample size is largely consistent across the range of minor allele frequencies from 0.05 to 0.5, showing the effective sample size to be a useful metric.The flattening of the curves for observed information from sequencing data as the average read depth increases indicates the diminishing returns had the option to sequence 10 individuals at 6X from a population or 120 individuals at 0.5X, the latter strategy would provide over three times as much information regarding allele frequency estimation despite the same sequencing effort.

3. 3 |
Application to empirical dataVariant calling with the Yellow warbler samples identified 5,301,627SNPs.Using all SNPs, Yellow warbler reference samples were accurately assigned to either the North, Central or East populations using leave-one-out self-assignment.All 35 reference samples from both the North and East populations were assigned with 100% accuracy, and of the 35 birds from the Central population, 34 were correctly assigned.
Our implementation of WGSassign allows users to perform population assignment analyses from genotype likelihood data.Features of WGSassign include standard and leave-one-out (LOO) population F I G U R E 2 Each point represents a single simulation run of the two-population island model when effective sample sizes were greater than 0.1 individuals.Panels are ordered by the number of individuals (10, 50, 100, 500) sampled from each of the two populations.The proportion of correctly assigned individuals, via LOO cross-validation for one population is given on the y-axis and genetic differentiation (F ST ) between the two populations is on the x-axis.The points are coloured by effective sample size (log 10 scale) of the population.Assignment accuracy in simulation runs with similar genetic differentiation increases with greater effective sample sizes (lighter colours).assignment, as well as calculations of effective sample sizes (of both individuals and populations) and a z-score metric for determining whether an individual is from an unsampled population.Importantly, as implemented, these analyses can be parallelized across loci, which allows for fast computation of data produced from low-coverage WGS, even for computationally intensive applications such as LOO assignment.Studies of wild populations are typically limited in the number of samples available for sequencing, where 50 may be a large number of samples for a given population.With such a sample size, leave-one-out assignment at a standard low-coverage read depth of 1X could be expected to have a runtime on the order of minutes for multiple populations and a million loci.
Implicit in standard population assignment tests is that there will always be a population with a maximum likelihood of assignment, even if the individual does not originate from any of the reference populations.To address this issue, we developed a z-score metric for testing whether an individual could be from an unsampled population.The z-score is based on the individual's observed likelihood of assignment in relation to the expected likelihood from a hypothetical individual from the same population with the same allele count data as the individual being tested.The z-score metric functions as expected at higher genetic differentiation (F ST > 0.05) and with larger reference samples by distinguishing the majority of individuals incorrectly assigned as having much lower z-scores (outside the 90% expected mass of the distribution of z-scores) than correctly assigned individuals.We recommend that any studies that may have incomplete sampling coverage of all genetically distinct populations test for correct assignment with the z-score metric.However, since this metric is limited by sample size and genetic differentiation, a robust approach towards using it would involve, first, observing the metric's behaviour by testing it upon individuals of known origin, calculating z-scores both for the population they are from and the other populations.For high assignment accuracy, source populations need to have sufficient effective sample sizes in relation to genetic differentiation among the populations.For example, in our simulations for low to moderate levels of genetic differentiation (mean F ST = 0.0055), an effective sample size of roughly eight individuals was sufficient when effective sample sizes were balanced (Figure3).If the reference populations' effective sample sizes are sufficiently high for the given genetic differentiation, individual samples being assigned can have extremely low read depth for accurate assignment.Our results from downsampled Chinook salmon data showed that individuals were still correctly assigned to populations (F ST = 0.1) when individual samples had average read depths as low as 0.001X.While the minimum sequencing coverage needed for highly accurate population assignment depends on genetic differentiation, this has powerful implications for population assignment studies, especially those

F
Results from the three-population stepping stone model demonstrate the behaviour of the z-score metric in identifying individuals from an unsampled population (Pop3) assigned to a population in the reference compared to individuals correctly assigned to their source population of origin (Pop2).The column facets list the number of samples used for the reference populations while the rows are the population of origin and sequenng depth.Symmetric lines subtending 90%, 99% and 99.9% of the mass of a standard unit normal random variate are given by vertical lines (dotted, dashed and solid, respectively).In this simulation, Pop3 individuals are expected to be incorrectly assigned to Pop2 (since there are no Pop3 individuals in the reference set) and accordingly the z-score metric should depict this by falling outside the mass of the standard unit normal random variate.
Numbers of informative SNPs (i.e.those covered by at least one read, such that the genotype likelihood is not equal for all three genotypes) at different downsampled coverage levels of the Chinook salmon data.
log(u) ∕ = ( u ∕ )u −1 , we have that Proceeding, define v and w as follows: For a single such variable Y i , we have P Y i = y| = y (1 − ) 1−y , so the log-likelihood for that single observation isL i ( ) = ylog + (1 − y)log(1 − ).It follows thatThe expected Fisher information in a single gene copy is the expectation of the negative second derivative given the true value of : Since information from independent variables is additive, the information for 2n such Bernoulli variables is 2n (1 assumption that the true value of is ̂ k, gives I e k, in Equation (11).A PPE N D I X B : Z-SCORE CALCULATIONIn order to assess whether an individual A's genotype could not plausibly have come from one of the K source populations, even though it was assigned to population k, we wish to compare A's log read probability given that it originated from population k, logP R (A) | k , to the distribution of log read probability values expected from individuals that actually are from population k.Complicating matters, these log read probabilities are heavily influenced by the read depth, and to a lesser extent, by the relationship between allele depths (how many reads of each allele were seen) and the genotype likelihoods.So, in fact, we must compare logP R (A) | k to the distribution of logP R| k expected from an individual that originates from source k, but also has read depths at each locus exactly the same as individual A, and also has genotype likelihoods that exhibit the same relationship to allele depths as those in individual A (this relationship will be influenced by such factors as the base quality scores and the genotype likelihood model used).