Sequence coverage required for accurate genotyping by sequencing in polyploid species

Polyploidy plays an important role in the evolution of eukaryotes, especially for flowering plants. Many of ecologically or agronomically important plant or crop species are polyploids, including sycamore maple (tetraploid), the world second and third largest food crops wheat (hexaploid) and potato (tetraploid) as well as economically important aquaculture animals such as Atlantic salmon and trout. The next generation sequencing data enables to allocate genotype at a sequence variant site, known as genotyping by sequencing (GBS). GBS has stimulated enormous interests in population based genomics studies in almost all diploid and many polyploid organisms. DNA sequence polymorphisms are codominant and thus fully informative about the underlying genotype at the polymorphic site, making GBS a straightforward task in diploids. However, sequence data may usually be uninformative in polyploid species, making GBS a far more challenging task in polyploids. This paper presents novel and rigorous statistical methods for predicting the number of sequence reads needed to ensure accurate GBS at a polymorphic site bared by the reads in polyploids and shows that a dozen of reads can ensure a probability of 95% to recover all constituent alleles of any tetraploid genotype but several hundreds of reads are needed to accurately uncover the genotype with probability confidence of 90%, subverting the proposition of GBS using low coverage sequence data in the literature. The theoretical prediction was tested by use of RAD‐seq data from tetraploid potato cultivars. The paper provides polyploid experimentalists with theoretical guides and methods for designing and conducting their sequence‐based studies.


| INTRODUC TI ON
Advancement in next-generation sequencing technology (NGS) allows identification of DNA sequence variants at the genome-wide scale and at a very competitive cost in comparison to traditional DNA molecular marker development techniques such as restriction fragment length polymorphisms (RFLPs, Botstein et al., 1980), amplified fragment length polymorphisms (AFLPs, Vos et al., 1995) and Microsatellites (Jarne & Lagoda, 1996). The new sequencing technique has been fast evolved for the discovery, sequencing and genotyping of sequence variant markers at a genome-wide scale, at a population level and at a much competitive cost (Davey et al., 2011). The digital nature of sequence based marker data enables genotyping by sequencing (GBS) at the marker loci, which has motivated enormous interests in population and quantitative genetic analyses in diploid species (Nielsen et al., 2011;Narum et al., 2013;van de Gejin et al., 2015).
There may be three possible genotypes, that is, two homozygotes and one heterozygote, at any polymorphic locus in diploids.
Diagnosing an individual genotype at any diploid sequence variant site may be done directly from surveying whether one or two alleles are present at the variant site. Thus, GBS in diploids is a straightforward task. However, it is much more challenging to carry out GBS in polyploid species because a much larger number of genotypes would be possible for a polyploid individual at a polymorphic locus with a given number of polymorphic alleles. Taking tetraploids as an example, there would be three possible heterozygote genotypes at a bi-or triallelic locus. Thus, sequence data from polyploids may be usually uninformative to recover the genotype at sequence variant sites with the certainty like that in diploids as explained above.
Two fundamental questions are raised on designing GBS experiments and the following analysis in polyploids in order for accurate genotype prediction from the sequence data. First, for a polyploid individual, what number of sequence reads is needed to ensure the presence of the marker alleles at a sequence variant site with a given probability confidence? Second, what does a sequence coverage need to be designed for accurately predicting a polyploid genotype with a given probability confidence? Adequately answering these questions is obviously critical for any genetic or genomic studies involving GBS in polyploid species. This study reports statistically rigorous methods to answers these two questions in tetraploid species as an example and the analyses presented here may be extended to any other polyploids without substantial technical obstacles. We explored the statistical prediction using the sequence data from restriction-site associated DNA sequencing (RAD-seq, Baird et al., 2008) experiment with tetraploid cultivars of potato (Solanum tuberosum L.) and discussed limitations and challenges of GBS in tetraploids in light of the theoretical prediction and complexities of sequence data in quantitative genetic analyses.

| Design parameters of GBS experiments in tetraploids
Although analysis presented below is generic for polyploids at any ploidy level, we focus here on tetraploids as an example. To design a GBS experiment in tetraploid species, one needs to address two key questions. First, what sequence coverage at a polymorphic site is needed to ensure all k constituent alleles of a tetraploid genotype at the site will be present in the sequence data for a given probability confidence P? Second, what sequence coverage is needed to ensure a tetraploid genotype to be accurately predicted with a given probability confidence interval P ± .
To answer these two questions, we assume that generation of sequence data from a tetraploid genotype can be assimilated as a sampling process of multinomial distribution. We consider a tetraploid individual genotype at a polymorphic site as Data of the sequence reads covering the polymorphic site from a tetraploid individual is denoted by n = (n A n C n G n T ) with n x being the number of sequence reads carrying nucleotide x = A, C, G or T at the polymorphic site. We assume that n = (n A n C n G n T ) follows a multinomial distribution as To formulate the questions aforementioned, we consider m, the number of segregating alleles, possibly being equal to 2, 3 and 4, respectively. It needs to be noted that n is the number of sequence reads of a tetraploid individual.
For question 1, the smallest number of n sequence reads, which is required to all k constituent alleles of a tetraploid genotype at the site to be present in the sequence data for a given confidence P, will satisfy Equation (2) is a generic form and needs to be specified for a given m, the number of alleles at the polymorphic site.
When m = 4, For question 2, the smallest number of n sequence reads satisfies where p = n X ∕n as formulated in Equation (1), P and δ are prior given.
From Equation (7), one can have √ n∕P(1 − P) = Φ −1 (1 − q∕2) with q = Pr p ∈ P ± and thus, When m = 3 and without loss of generality, let and p 1 ,p 2 ,p 3 = (n 1 , n 2 , n 3 )∕n. For i = 1, 2, let Since n 3 = n − n 1 − n 2 , min n P − ≤p ≤ P + , Equation (9) may be numerically solved for n as we programmed in R scripts (Appendix A). Alternatively, one can implement a Monte-Carlo approach to calculate n. Specifically, to generate a large number, say 10 6 , of random samples of x 1 , Note that the right hand side of (11) depends on the sample size n, which can be estimated by solving the equation When m = 4, the formulation is similar to the case of m = 3.
Specifically, it is to solve n from where P i = ¼, p i , ′ i and r i are similarly defined as the above.
Equation (12) may be numerically solved for n as programmed in R scripts or through a Monte Carlo approach described above.

| Collection and analysis of RAD-seq data from tetraploid potato cultivars
We conducted a breeding programme to generate an outbred segregation population from crossing two autotetraploid potato clones, 12-1-20 and 12-5-12, which are offspring from two autotetraploid potato cultivars, Atlantic and Longshu-3 (a Chinese variety cultivated in the northwest provinces of China). The outbred offspring population is composed of 304 individuals. The young leaf samples were collected from these offspring and their parental plants.
Genomic DNA samples were extracted from the leaf samples using the Qiagen kit. The DNA samples were checked for quality and quantity by use of NanoDrop 2000c, then justified to a concentration of 100 ng/μl and stored at -80°C for further use. It is noted that each of the two parents were prepared in two biological replicate samples but no replication was made for the offspring individuals.

| Construction of RAD-seq libraries
The DNA samples made above were used to carry out restriction site associated DNA sequencing (RAD-seq) experiments. Protocol for constructing the RAD-seq library was modified from our previous work (Jiang et al., 2016), which confers the optimal target of minimizing the presence of undesirable DNA segments such as mitochondria DNA and DNA representing the RNA genes in the library and the optimal target of an even coverage of the DNA segments over the target genome. Each genomic DNA sample of 1 μg was firstly digested by two restriction enzymes (EcoRI-HF and MspI, plantbiology.msu.edu/pgsc_download.shtml) and only those reads, which were uniquely mapped with the mapping quality score ≥20, were to be included in further analyses. After the above steps of quality filtering, the sequence reads were subjected to screening for polymorphic sites embedded. A polymorphic site was declaimed if there were two or more different alleles (i.e., nucleotides) observed at the sequence site, that is, a SNP (single nucleotide polymorphism). When there were two or more polymorphic sites detected within a sequence read, the one with the highest Q value was chosen as the polymorphic site for the read and the others were ignored. For a demonstration purpose, only the sequence data from the two parental lines was used in the following analyses. The computational scripts developed here can be straightforwardly implemented to analyse any other data set. Table 1 lists the number of sequence reads needed to ensure the presence of the number of polymorphic alleles of a tetraploid genotype (m) in the sequence data for a given probability confidence (P) as well as the number of sequence reads required to ensure a tetraploid genotype (g i ) to be accurately predicted for a given probability confidence (P ± ) with = 5 %. It can be seen from the table that about one dozen of reads would be sufficient to ensure the presence of all constituent alleles of a tetraploid genotype in the sequence data with a probability of >95%. Obviously, a larger number of sequence reads are required to detect a larger number of constituent alleles of a tetraploid genotype. For a given m (the number of constituent alleles of a tetraploid genotype), the evenness of the genotype alleles (e.g., AAGG) requires fewer sequence reads to detect their presence than the unevenness of the genotype alleles (e.g., AAAG or AGGG).

| Expected sequencing coverages required for detecting and genotyping at a SNP marker in tetraploids
Although a moderate number of sequence reads are needed to detect the presence of alleles of a tetraploid genotype at a high statistical confidence (p > 95%), the number of reads needed to accurately predict the tetraploid genotype will be few hundreds, depending on m, the number of genotype constituent alleles. The larger is m, the larger number of sequence reads is required to predict the genotype accurately. Again, the evenness of the genotype alleles requires a small number of sequence reads to predict the genotype than the unevenness of the alleles. It needs to be stressed that one may plug in any value of P and δ in the program for numerical analyses.

| Observed sequencing coverages required for detecting and genotyping at a SNP marker in tetraploid potato data
In the present study, we focus on the polymorphic marker data identified from the sequence data of the parental lines, each of which had two replicates of expected 4 M or a total 8 M sequence reads.
Following the criteria described above for screening SNPs, we detected a total of 1,686,373 SNPs from the sequence data. Genome distribution of the SNPs is illustrated in Figure 1 After filtering out those SNPs from the reads with coverage of less than 10, we obtained a total of 11,151 and 11,908 SNPs in the two parental plants, 12-1-20 and 12-5-12. Of the SNP markers, 11,116 (11,751), 33 (35) and two (122) have two, three and four segregating alleles respectively from plant 12-1-20 (or 12-5-12). This indicates that it is feasible to detect the presence of component alleles of a tetraploid genotype from the sequence data according to the theoretical prediction (Table 1). Because up to 95% of sequence variants show two alleles, we focused on the marker loci with two segregating alleles. There are three possible tetraploid genotypes, g 1 , g 2 , and g 3 , at a two-allele SNP locus. For given n 1 and n 2 , observed numbers of alleles at the locus, we calculated and with Pr {n 1 , n 2 �g k } = respectively, where g k = (k 1 , k 2 ) = (1, 3), (2, 2) or (3, 1) corresponding to three possible tetraploid genotypes at a two-allele locus, and n = n 1 + n 2 . Based on Equations (12) and (13), we calculated . The calculation can be similarly extended to the cases of three and four alleles.
It is clear that the nearer δ is to zero, the more certainly is the tetraploid genotype recovered.  Table 2. It is clear that the proportion increases as the coverage of the sequence reads increases as predicted in Table 2. However, it is also clear that there is a substantially large proportion of genotypes predicted with a poor level of certainty even from sequence data with an exceptionally high level of coverage, and thus highlighting that it is clearly not a trivial task to recover a tetraploid genotype at a sequence variant site.
It should be noted that the two questions addressed in the present study will merge into one in diploids because appearance of two alleles at any sequence variant site of a diploid genome will suffice to infer an individual heterozygote genotype at the site. Assuming binomial distribution of sampling alleles from a diploid heterozygote genotype, one can simply the above equations to predict n, the number of sequence reads that are needed to ensure the presence of two alleles of a heterozygote genotype in diploid with a prior given probability P, from log(1 − P)∕log(1∕2) + 1, from which n ≈ 5, 6 or 8 for p = .90,.95 or .99, respectively. This clearly indicates that sequencing coverage is a much less demanding for GBS in diploids than in polyploids. The formulation above has been programmed in R scripts which are presented as Text S1.  Note: M, number of segregating alleles; Dosage, k A : k C : k G : k T , indicates allele configuration of a tetraploid genotype; P (or P ± ) is probability confidence of the prediction.

| DISCUSS ION
The next generation sequencing (NGS) technique enables generation of a large number of short sequence reads which may represent a reference genome with a predesigned coverage. Fast advancement in NGS makes it unprecedentedly economically doable to discover genome-wide sequence variants at a population scale. This has greatly facilitated sequence data driven population based genetic and genomic studies in ecology and evolutionary biology (Ellegren, 2014;Hill, 2012;Kim et al., 2016), and also motivated enormous methodological researches for modelling and analysing the NGS data (Li, 2011;Catchen et al., 2013;Begun et al., 2007). Digital nature of sequence reads data generated from NGS experiments allows genotyping at a polymorphic site embedded in any sequence reads, that is, genotyping by sequence (GBS). Codominance of polymorphic nucleotide alleles makes it feasible to predict individual genotype at any polymorphic site in diploids even with low coverage sequence reads (Wickland et al., 2017) because GBS in diploids merely needs to survey whether there is one (homozygote) or two alleles (heterozygote) present at the site. However, it is much more challenging to recover genotype of a polyploid even from codominant marker data including those identified from the sequence reads as detailed in Luo et al. (2000).
The challenge of GBS in polyploids may be attributed to several characteristic features of polyploid sequence data. First, the sequence data is highly relevant to but may be usually uninformative about allele dosage of a polyploid genotype at any sequence variant site (Margarido & Heckerman, 2015). Second, polyploid sequence data shows complex patterns of bio-and nonbiological variations (Dang et al., 2021;Gerard et al., 2015). Although enormous research efforts have been made to explore GBS in polyploid species (Griffin et al., 2011;Margarido & Heckerman, 2015;McKinney et al., 2018;Poland & Rife, 2012), little is yet known to what extent a polyploid genotype at a polymorphic site may be reliably recovered from sequence reads data. Third, there has been evidence showing there is allele loss at a substantial proportion of sequence variant sites in diploid and polyploid sequence data in the literature (Gautier et al., 2013;Shestak et al., 2021) and from our own unpublished tetraploid potato data. This adds an extra angle of complexity to GBS in polyploids.
The present study exploits two logically related but conceptually different questions, the numbers of sequence reads required to detect the presence of polymorphic alleles and to recover genotype at the polymorphic site in polyploids. These are highly relevant to how to effectively design a GBS experiment and how to reliably call for genotype from the sequence reads data in polyploids. Taking tetraploid as example, the paper provides generic statistical methods to predict the numbers of sequence reads to be required to ensure the presence or/and dosage of constituent alleles of a tetraploid genotype with a pregiven probability confidence. The formulation is made by assuming a multinominal distribution to model the sampling process of generating sequence reads from a given tetraploid genotype as in many polyploid sequence data modelling and analysis (Griffin et al., 2011;Margarido & Heckerman, 2015;McKinney et al., 2018;Poland & Rife, 2012).
Under the multinomial distribution assumption, we showed that a dozen of sequence reads are sufficient to ensure the presence of any tetraploid genotype alleles in the reads data with a probability confidence of 95%. However, it is much more demanding in term of the sequence coverage to accurately predict a tetraploid genotype at a polymorphic site embedded in sequence reads, specifically a few hundreds of the sequence reads are needed to reliably predict the genotype from the sequence data with a probability confidence of 90%. This is in a sharp contrast to the widely accepted proposition of GBS in diploids and polyploids from low coverage sequence data (Gorjanc et al., 2017;Blischak et al., 2018;Wickland et al., 2017), and indicates that use of low coverage sequence data (i.e., <200×) would not be possible to lead to reliable GBS in tetraploids. The theoretical predictions made in the present paper were tested using RAD-seq data from two tetraploid potato cultivars which were crossed to generate an outbred segregation population. The RAD-seq data analysis shows a clear increase in accuracy of GBS as sequence coverage increases and that there is still a large proportion of genotypes predicted with a poor accuracy even at the polymorphic sites with a sequence coverage of no less than 1,000x. This may be partly explained by the fact that part of the RAD-seq data violates the multinomial distribution assumption. We have previously showed that the RAD-seq data may bear significant extra variation over that of the multinomial distribution (Dang et al., 2021). The messy nature of the sequence data in polyploids is also demonstrated elsewhere (Gerard et al., 2015).
It is clear from the present study that accurate GBS in polyploids needs to be built upon high-coverage sequence data. However, it could be financially infeasible to sequence a polyploid population at a high coverage, and thus GBS in polyploids may not be feasible at a genome wide scale for many laboratory-based projects. On the other hand, use of genotype information at molecular markers has been demonstrated to be effective in improving the efficiency of quantitative genetic analysis in tetraploids (Hackett et al., 2014).
Instead of sequencing the whole genome, restricted site associated DNA sequencing (RAD-seq) targets only a part of the genome and thus enables the targeted regions to be sequenced with a high coverage. We proposed a computer-based method to search for an optimal combination of restriction enzymes to be implemented in designing a RAD-seq experiment so as for sequence reads to be generated with an approximately even distribution over the target genome (Jiang et al., 2016). The RAD-seq data analysed here demonstrated the evenness, which is important for genomewide analysis of quantitative traits (Chen et al., 2020). The present RAD-seq data analysis shows only a portion of sequence reads in a sufficient coverage for accurate genotype prediction, and thus may further decrease the density of the sequence variant markers over the target genome. It must be pointed out that the number of markers does not necessarily need to be very large for many population based quantitative genetic analysis because the efficiency of the analysis is largely determined by the number of effective recombinants in the genome under question (Luo et al., 2002).
Taking cultivated potato (Solanum tuberosum L.) as an example, it has a genetic linkage map of length 1,416 cM, which was estimated from a large intensive marker data set (Yu et al., 2020). If the potato genome is covered by 500 evenly distributed markers, the intermarker distance would be ~3 cM. Given all these, the designed size of RAD-seq data would be appropriate to obtain the required number of sequence variant markers with sufficiently high coverages so that GBS could be accurately predicted.
It needs to be acknowledged that two parameters, P and δ, in- The present study has been focused on GBS prediction directly from sequence data of any tetraploid individual. It has been well documented that accuracy in predicting individual genotype at a polymorphic marker locus may be substantially improved from using the individual's pedigree information in addition to its marker data as demonstrated in diploids (Chen et al., 2014) and in tetraploids (Luo et al., 2000). We will show how offspring sequence data may be implemented to improve the accuracy of prediction of their parental genotypes (Luo et al., unpublished data).
It has been well documented that newly formed polyploid genomes have been subject to the dynamic changes in both structure and function when compared to their diploid relatives and ancient countpartners (Soltis & Soltis, 1995;Comai et al., 2000). This may imply sequence coverage for accurate GBS may be more demanding in the newly formed polyploids than in ancient polyploids with more stable genomes. Tetraploid potato (Solanum tuberosum L.) has a domestication history of 8000 to 10,000 years (Spooner, 2005), suggesting a fairly genetically stable genome and thus that the sequence coverage estimated from the present study would be informative for accurate GBS in the tetraploid crop.
TA B L E 2 Proportion of predicted two-allele tetraploid genotypes with -0.05 ≤ δ ≤ 0.05 from sequence data with varying coverages of sequence reads collected for the two parental potato cultivars (12-1-20 and 12-5-12) Numerical analysis involves evaluation of bi-and/or multinomial probability distribution and may be recognized to be computationally challenging, particularly in the case of calculating multinomial probabilities. Method was proposed to approximate multinomial probability function in the classical literature in statistics, for example, Johnson (1960). By combining it with a more recently proposed method for approximating gamma function in Yang and Tian (2018), we tested and demonstrated the efficiency and accuracy of the method in statistically modelling sequence data from tetraploid potato (Dang et al., 2021). Additionally, numerical computation took only about 10 minutes on a Dell precision workstation computer to analyse the sequence data set from an individual sample.

ACK N OWLED G EM ENTS
The research is supported by research grants (31871240)

DATA AVA I L A B I L I T Y S TAT E M E N T
The sequence data analysed in the present study has been made available at http://figsh are.com/artic les/datas et/1-20-1_quali ty_inf_ txt/16989784.