SEARCH

SEARCH BY CITATION

Keywords:

  • genomics;
  • population genetics;
  • theory

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. The basic theoretical framework
  5. Optimization of the strategy to reliably characterize sequence genotypes in populations
  6. Conclusion
  7. Acknowledgment
  8. References
  9. Supporting Information

The use of diploid sequence markers is still challenging despite the good quality of the information they provide. There is a common problem to all sequencing approaches [traditional cloning and sequencing of PCR amplicons as well as next-generation sequencing (NGS)]: when no variation is found within the sequences from a given individual, homozygozity can never be asserted with certainty. As a consequence, sequence data from diploid markers are mostly analysed at the population (not the individual level) particularly in animal studies. This study aims at contributing to solve this. Using the Bayes theorem and the binomial law, useful results are derived, among which: (i) the number of sequence reads per individual (or sequencing depth) which is required to ensure, at a given probability threshold, that some heterozygotes are not considered erroneously as homozygotes, as a function of the observed heterozygozity (Ho) of the locus in the population; (ii) a way of estimating Ho from low coverage NGS data; (iii) a way of testing the null hypothesis that a genetic marker corresponds to a single and diploid locus, in the absence of data from controlled crosses; (iv) strategies for characterizing sequence genotypes in populations minimizing the average number of sequence reads per individual; (v) a rationale to decide which are the variations that one needs to consider along the sequence, as a function of the sequencing depth affordable, the level of polymorphism desired and the risk of sequencing error. For traditional sequencing technology, optimal strategies appear surprisingly different from the usual empirical ones. The average number of sequence reads required to obtain 99% of fully determined genotypes never exceeds six, this value corresponding to the worst situation when Ho equals 0.6. This threshold value of Ho is strikingly stable when the tolerated proportion of nonfully resolved genotypes varies in a reasonable range. These results do not rely on the Hardy–Weinberg equilibrium assumption or on diallelism of nucleotidic sites.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. The basic theoretical framework
  5. Optimization of the strategy to reliably characterize sequence genotypes in populations
  6. Conclusion
  7. Acknowledgment
  8. References
  9. Supporting Information

Diploid sequence genotypes, compared to other types of genetic markers, provide the most precise information for molecular ecology studies (Chenuil, 2006), because both maternal and paternal alleles in each individual and evolutionary relationships among alleles are known. Two factors limited the spread of such markers (i) the absence of universal PCR primers for nuclear loci in nonmodel organisms and (ii) technical complexity and absence of guidelines to obtain reliable diploid sequence genotypes. Recently, potentially universal nuclear genetic markers became available (Chenuil et al., 2010), significantly reducing the first limitation. This article contributes to overcome the second difficulty. Until recently, obtaining sequence genotypes currently required several experimental steps after PCR (i.e. cloning and sequencing of several clones per individual). A new set of technologies, known as next-generation sequencing (NGS), potentially allows obtaining, at a moderate cost, thousands of individual DNA sequence reads from a mixture of PCR fragments or from a genomic library, avoiding the cloning step and highly increasing throughput. Although few publications report population genetic studies using this technique yet (Bentley et al., 2009; Galan et al., 2010, Ekblom & Galindo, 2011), individual labelling with PCR primers slightly modified in their 5′end allows tracking the sequence genotype of several individuals at numerous predetermined loci in a single pyrosequencing NGS run. Whatever the technology, classical or NGS, the absence of a rigorous approach to establish the homozygozity of an individual when all the sequenced clones or the sequence reads are identical precludes researchers from using this valuable information at the individual level. Thus, such studies are restricted to markers displaying low polymorphism [e.g. SSCP, Sunnucks et al. (2000)] or rely upon dubious interpretations. In this article, I first show that Bayesian probabilities can help deciding whether an individual can be recorded as homozygous at a given locus given an error probability threshold. I also propose a simple test of the hypothesis that a sequence marker corresponds to a single diploid locus (SDL) and a solution to identify sequencing errors. Then, I use these results to optimize the strategy for genotyping numerous individuals (i.e. minimizing the average number of sequences required per individual, respecting certain conditions of practical simplicity and genotyping reliability) for both NGS and traditional approaches (which for simplicity I will call OGS for old-generation sequencing). These results are briefly compared with recent studies dealing with NGS data analyses at the genotype or the population level.

The basic theoretical framework

  1. Top of page
  2. Abstract
  3. Introduction
  4. The basic theoretical framework
  5. Optimization of the strategy to reliably characterize sequence genotypes in populations
  6. Conclusion
  7. Acknowledgment
  8. References
  9. Supporting Information

Sequencing depth and assessment of individual homozygozity

Table 1 displays the meaning of the different symbols and abbreviations employed in this article. The probability that an individual is heterozygous when all x clones sequenced (or reads, in case of NGS) display the same DNA sequence, noted P (het|x), can be inferred using the Bayes theorem (Bayes & Price, 1763).

  • image
Table 1.   Abbreviations, symbols and parameters
Symbol/or/abbreviationParameter/or/expression
P(a|b)Probability of occurrence of the event ‘a’ given that event ‘b’ is true
xThe x clones sequenced [or number of reads, in case of next-generation sequencing (NGS)] from the individual considered display the same DNA sequence
xiAllele i was drawn x times in the x sequences (reads) from the individual considered
hetThe individual is heterozygous
HoObserved heterozygosity (proportion of heterozygous individuals)
HapApparent heterozygosity (proportion of individuals for which all sequence reads identical)
piFrequency of allele i
αProbability of considering that the individual is homozygous when it actually is heterozygous [threshold value for P(het|x)]
dObserved sequencing depth from a NGS run result (average among individuals) or number of clones sequenced per individual
μThe proportion of genotypes that are not fully resolved under a given strategy
T1Average total number of sequences using Strategy 1 (direct sequencing and when necessary, cloning and sequencing a single set of Sμ clones, which number is determined as a function μ)
T2Average total number of sequences using Strategy 2 (cloning and sequencing a first set of S1 clones, then in cases of similarity, S2 clones, S1 and S2 being determined as a function of μ, and minimizing T2)
S or SμNumber of clones to sequence under strategy 1 for individuals appearing heterozygous after direct sequencing, respecting a maximum proportion ‘μ’ of nonfully determined genotypes
S1 or inline imageNumber of clones sequenced for the first sequencing round, under strategy 2 respecting a maximum proportion of μ of nonfully determined genotypes
S2 or inline imageNumber of clones sequenced for the second sequencing round (for individuals displaying S1 identical sequences after the first round), under strategy 2, respecting a maximum proportion of μ of nonfully determined genotypes
CSCloning-sequencing
CWIRNCumulated within individual read numbers
DASDirect amplicon sequencing
NGSNext-generation sequencing
OGSOld generation sequencing
SDLSingle diploid locus

Each of these three factors is simply obtained.

The probability of having all x clones identical given that the individual is heterozygous is the probability of drawing x times the same allele. Each allele has a probability of being drawn of 0.5 in a diploid individual, and two alleles are possible (explaining the factor 2).

  • image

By definition, the observed heterozygozity is the proportion of heterozygotes in the considered group of individuals (population); it also represents the probability of being heterozygous in this group, independently of any assumption about the mode of union of the gametes because we do not use the population allele frequency to compute it.

  • image

The probability of drawing x times the same allele in any individual of the population corresponds to two independent and complementary possibilities, depending on whether the individual is heterozygous or homozygous, in which case the probability of drawing x identical alleles is one.

  • image

From these expressions, we obtain

  • image(1)

This equation allows determining how many clones need to be sequenced so that, in case all of them appeared identical, the individual can be considered as being homozygous, with a probability of being wrong not higher than α. The number of clones required is the smallest integer superior or equal to x*, x* being the value of x solving P(het|x) = α

  • image(2)

When Ho is high (0.9), nine clones are requested to obtain a value of α of 5%, whereas for an observed heterozygozity of 0.2 for instance, four clones are sufficient for this threshold. Figure 1 displays the values of x* as a function of Ho.

image

Figure 1.  Total number of sequence reads required to assess, when all reads are identical, that the individual is homozygous with a risk of alpha of being wrong, as a function of the observed heterozygozity (Ho).

Download figure to PowerPoint

When the expected heterozygozity (He) is known (for instance when previous studies provided the allele frequency distribution), it may be interesting to replace the expression Ho by He (1 − FIS) using Wright’s F statistics though in general and practical cases it is simpler and more reliable to use Ho.

The probability that a given individual is heterozygous given that all x sequence reads yielded allele ‘i’ is similarly derived, except that in this case, the Hardy–Weinberg equilibrium must be assumed.

  • image

The first term P(xi|het) is the probability, for a heterozygote, of having the allele i, and of drawing x times a given allele among two distinct alleles. In a panmictic population, the frequency of heterozygotes displaying allele i is 2pi(1 − pi) within the whole population, and the same value divided by Ho among heterozygotes.

  • image

The last probability factor is solved with a similar reasoning as for eqn 1 (summing the alternative cases of heterozygous and homozygous individuals) and using the above result.

  • image

Thus

  • image(3)

As expected, the probability that the individual is heterozygous is higher if it displays x reads identical for a rare allele than for a frequent allele.

Detecting sequencing errors and testing the hypothesis of a single diploid locus

The above results rely on the assumptions that all identified reads correspond to true alleles (no sequencing errors) and to a single diploid locus [the SDL hypothesis] without allelic bias, which are not always true (see the two following paragraphs). These assumptions may be examined using the same tool, the cumulated distribution of within individual read numbers [abbreviated to CWIRN distribution] (see the 3rd paragraph below).

Sequencing errors and allelic bias

Although sequencing errors are generally negligible after amplicon sequencing when target DNA is not limiting, fake alleles may be produced during experimental procedures with both NGS and OGS. After cloning a PCR product, numerous single mutation errors were evidenced by different, often unpublished studies (Faure et al., 2007; Egea, 2011), the number of haplotypes exceeding two per individual, one (or several) clones from a single individual differing by a single mutation from major haplotypes, even when the locus is a well-known single diploid marker. Such errors are not observed for all the loci characterized by identical laboratory practices in a given team (unpublished). A way to visualize most of such mutants is to highlight the haplotypes obtained from a single individual in the graphical representation of the global network of haplotypes. Such mutants generally appear as singletons very closely related (generally one-step mutation) to another haplotype present in the same individual (Egea, 2011 and unpublished). The safest and unbiased solution to deal with this important concern is to remove from the alignment the nucleotide positions where such mutations are suspected to occur (cf. tunings section below). NGS technologies generate numerous base-calling errors of new types about which knowledge and tools rapidly progress (Nielsen et al., 2011; Nothnagel et al., 2011; Meacham et al., 2011). Some algorithms provide quality scores estimating the risk of base-calling error at the initial step of image analysis and others compute genotype likelihoods accounting for such uncertainty (reviewed in Nielsen et al., 2011). With both OGS and NGS, repetitive sequences are more prone to error. Allelic biases may also affect the proportion of sequence reads obtained from an individual. They were reported to affect both OGS and NGS, shorter alleles often being more efficiently amplified (PCR) (Wattier et al., 1998), cloned or integrated in NGS runs (thus more represented in sequence results).

Departures to the SDL hypothesis

When a molecular marker is characterized after a PCR step, and even by transcriptome or genome sequencing, it is not straightforward to determine whether it corresponds to a SDL, or whether distinct paralogs are confounded, whether the DNA region amplified is located within an autosome or within a cytoplasmic or haploid genome, or whether it belongs to a polyploid genome. A traditional and reliable way to obtain this information involves characterizing the progeny of controlled crosses from parents with distinct genotypes. This is often impossible in nonmodel species. Non diploid or multiple loci are readily identified when some individuals display more than two distinct alleles (assuming they are not due to sequencing errors). But for low polymorphism polysomic loci, or in case of paralogs with overlapping allele distributions, the expected number of such observations may be low (even null, when there are only two alleles) in which case an appropriate test is required (cf. next section). Conversely, a marker should not be automatically rejected when a single individual (out of hundreds) displays more than two alleles. This may also result from PCR contamination (i.e. neither from sequencing errors, nor from a mixture of loci). In general, examination of the sequences allows distinguishing PCR contamination from sequencing errors. Removing such individuals, if they are extremely rare compared to those displaying two distinct alleles, may not bias, in general, the test of the ‘SDL’ hypothesis proposed below.

Distributions of read numbers within individuals and tests

If a molecular marker is diploid and if the different alleles are amplified, cloned and sequenced with the same efficiency (i.e. assuming no allelic bias and no sequencing error), each heterozygote must provide two distinct alleles in the same proportion of 0.5 each in the amplicon (PCR product). On the contrary, with polysomy or paralogy, some of the genotypes displaying distinct sequences may actually not contain both alleles in identical proportions. More generally, if d clones are sequenced, the probability of obtaining i times a given sequence (and (d − i) times the other sequence) from a heterozygote follows a Binomial law of parameters d and 1/2:

  • image(4)

Except when d is an even number for i = d/2, there are pairs of symmetrical cases that are not distinguishable (in cloning sequence results) for i and d − i; their observations need to be pooled (corresponding proportions are doubled). Table 2 displays numerical values of expected frequency for relevant values of d in case of OGS (the relevant values are established further in the manuscript, Fig. 4) and an example of expected counts for a particular case. The contingency table containing the counts of observed and expected cases, after binning symmetrical cases, can be submitted to a classical test (e.g. chi-square or exact tests). If the test is significant, the marker cannot be considered a SDL. For a tetrasomic locus, the genotypes of the form ‘ABBB’ follow a binomial law which probability parameter departs from a half (0.75 or 0.25 in this case). In case of polyploidy as well as for mixtures of paralogous loci, the cells corresponding to the most equilibrated distribution (half clones of each type, or, when d is an odd number, (− 1)/2 and (+ 1)/2 clones of each type) will always be depleted compared to the SDL hypothesis (right cells in Table 2).

Table 2.   Expected values of sequence type distribution (for Single Diploid Locus tests).
(a) Expected frequency for all values of d (number of clones sequenced) which are relevant for OGS (see also Fig. 2 for larger d values, more typical of NGS approaches)
= 2
 Sequence distribution0 + 21 + 1  
 fi0.50.5  
= 3
 Sequence distribution0 + 31 + 2  
 fi0.250.75  
= 4
 Sequence distribution0 + 41 + 32 + 2 
 fi0.1250.50.375 
 inline image 0.57140.4286 
= 5
 Sequence distribution0 + 51 + 42 + 3 
 fi0.06250.31250.625 
 inline image 0.33330.6667 
= 6
 Sequence distribution0 + 61 + 52 + 43 + 3
 fi0.031250.18750.46870.3125
 inline image 0.19350.48390.3226
= 7
 Sequence distribution0 + 71 + 62 + 53 + 4
 fi0.0156250.10940.32810.5469
 inline image 0.11110.33330.5556
(b) Expected counts in the case d = 4, for a total of 32 individuals cloned
Sequence distribution0 + 4 1 + 3 2 + 2Total
Expected counts4161232

When all the individuals which would be heterozygous under the SDL hypothesis are unambiguously identified, as for instance after direct amplicon sequencing (DAS), these expected fi values can be compared to the observed values, allowing testing the SDL hypothesis. In other cases, when only NGS or cloning sequencing (CS) data are available, testing the SDL hypothesis requires considering exclusively the individuals for which two distinct sequences are already detected, and they do not represent all the heterozygotes. However, they represent a known proportion of the heterozygotes: all heterozygotes except those displaying all d sequences identical. This corresponds to removing the class (= 0 or d) from the contingency table (left cell in Table 2). Then, the test can be carried out exactly the same way, but expected frequency in the remaining cells must be corrected, replacing fi by inline image as follows:

  • image(5)

For this modified test, the minimum number of clones to sequence is = 4 (not two as in the former situation) since for = 3, only two columns can be compared (Table 2a). These tests do not rely on the Hardy–Weinberg assumption, but they assume that all alleles are amplified and sequenced with the same efficiency (no allelic bias) and the absence of sequencing errors. When the SDL test is significant thus, it is recommended to critically inspect allele sequences to check the possibility that assumptions be violated and to avoid discarding the locus when unbiased corrections are possible. The possibility of allelic bias or sequencing errors should thus be assessed independently of the SDL test upon examination of the primary sequence of alleles. Sequencing errors can affect the distribution in a typical way, and it is important to identify such cases, because they can sometimes be easily corrected (cf section ‘Tunings to reduce costs, work time, and experimental errors’). Assuming that mutations induced by experimental steps are rare and random, erroneous reads are expected to be singletons and to differ from true alleles by a single or few mutations. If the locus is prone to such sequencing errors, the CWIRN distribution will display an excess of unique reads (and by symmetry, also ‘d − 1’ reads because the test applies to individuals displaying at most two distinct sequences) (more explanations in Fig. 2 and Appendix S1 in Supporting Information, for d = 20). Simple tests are possible (comparing these classes against the pooled other classes); the power of such tests may be limited with small read numbers as the relevant values determined for OGS strategies (< 8, Table 2) but sufficient with sequencing depths above 15 common in NGS approaches.

image

Figure 2.  Cumulated distribution of within individual read numbers simulated for 100 individuals and d = 20 reads per individuals. For each individual, the number of alleles occurring i times among the 20 reads is reported for all i (from 0 to 20), and this is summed over all individuals to build the CWIRN distribution. The upper part represents any single diploid locus (SDL) distribution with Ho = 0.9; without sequencing error (black bars) or with one erroneous allele out of 20 reads, in 5% of the individuals (white bars, the arrows emphasize the classes with erroneous alleles). For a SDL, the distribution only depends on Ho and sequencing errors and does not change whether or not panmixia is true. The lower part represents: a double diploid locus (DDL) when both loci display the same two equifrequent alleles, or a tetrasomic locus (black bars); a triple diploid locus (TDL) displaying the same two equifrequent alleles or a hexasomic locus (white bars). In those cases, contrary to SDL, predicting the distribution shape requires knowing more than Ho and sequencing error modalities, these simulations correspond to panmixia, but the modes may be visible (as here for DDL or tetrasomy, highlighted by three arrows), helping to determine loci number or ploidy.

Download figure to PowerPoint

When the SDL hypothesis is rejected, and when sequencing errors are not suggested by the shape of the CWIRN distribution, one may also wonder whether rejection is due to the mixture of two, three or more loci. The CWIRN distribution for a double locus will be a linear combination of binomial distributions of event probability ½ (genotypes of the form AABB), 1/4 and 3/4 (for genotypes such as ABBB). In the general case of l loci, it will be a combination of more numerous terms of software binomials laws, which event probability are i/2l (with i = 1 to − 1) (if homozygous genotypes are included in the distribution (as in Fig. 2), there is also a binomial component of event probability 1). Appendix S1 explains the construction of expected CWIRN distributions in some particular cases: a SDL (with and without random sequencing errors), double and triple loci sharing alleles among individuals with less than three distinct alleles. Remarkably, for a SDL, the distribution only depends on Ho and on sequencing errors and does not change whether or not panmixia is true. Figure 2 illustrates three cases: a SDL with and without sequencing error, a tetrasomic and a hexasomic locus. It suggests that a relatively large sequencing depth (at least 20 × ) is required to visually infer the number of confounded loci from the shape of the distribution, multimodality being hardly visible in the tetrasomic (particular) case, and not visible in the hexasomic one, although the distribution shapes are very distinct.

Optimization of the strategy to reliably characterize sequence genotypes in populations

  1. Top of page
  2. Abstract
  3. Introduction
  4. The basic theoretical framework
  5. Optimization of the strategy to reliably characterize sequence genotypes in populations
  6. Conclusion
  7. Acknowledgment
  8. References
  9. Supporting Information

These results allow optimizing the strategy to obtain sequence genotypes for large samples of individuals. Aspects of the reasoning below rely upon the assumption that the marker is a SDL. This SDL hypothesis may be tested a posteriori, avoiding additional experiments, in some cases (cf. below, Fig. 4).

Different approaches (DAS, CS or NGS) have different strengths. When DAS provides a simple unambiguous sequence, the homozygozity of an individual at a given locus is assessed with certainty (assuming DNA amount in the PCR reaction was not limiting and all alleles are equally amplified). By contrast, when the set of sequence reads from an individual (from either NGS or CS) reveals two distinct alleles, heterozygozity is established. In these two opposite but favourable cases, sequence genotypes are fully characterized. The complementary cases, when mixed sequences are obtained from DAS or when there is no variation among haplotypes obtained from NGS or CS, are less straightforward. Knowing these opposed properties of the two types of approaches and the theoretical results presented above (except eqn 3 no more used) allows establishing optimal strategies for reliable genotype inference with minimal effort and cost. I will consider separately the cases when NGS is used and those using traditional approaches, which I will call old-generation sequencing (OGS), and which include both DAS and CS. Preliminarily, a brief comparison of NGS and traditional approaches is necessary.

Choosing between ‘Next’- and ‘Old’-generation sequencing

NGS methods may now appear as the solution of choice because they provide, by far, the highest throughput and lowest cost (per nucleotide). However, several limitations affect NGS. (i) Read sizes are limited to ca. 400 bp (Titanium GS-Flex 454 pyrosequencing). (ii) All pooled fragments must have similar sizes to avoid a bias favouring smaller fragments, and, even respecting this condition, it is difficult to control the number of copy of a given locus that will be obtained, due to heterogeneity in affinity among DNA fragments. (iii) Background is lacking for the use of Multilocus IDentification (MID)-tags allowing individual identification within the pool. (iv) NGS error rates are higher than with OGS, although in OGS, error rates are increased when cloning is used after PCR. (v) Finally, setting up an NGS run with individual tags for population genotyping is profitable only when a minimum number of loci is pooled (while respecting the similar sizes constraint) though it is possible to combine different species in a run (Chenuil and Aurelle, ongoing study). Thus, it is likely that OGS will coexist with NGS for some time in the future. Furthermore, we cannot exclude the possibility that significant progress also occurs to single amplicon or single clone sequencing.

Next-generation sequencing approaches

Prior to genotype inference based on sequence reads, the number of individuals and loci that can be simultaneously analysed in a single NGS run must be determined. The required average number of reads per individual per locus (i.e. the sequencing depth) can be estimated by the total number of sequence reads provided by the NGS run divided by the number of loci and by the number of individuals whose amplicons were pooled in the run. If the observed heterozygozity (Ho) of the locus is known, eqn 2 allows estimating the required sequencing depth a priori and the number of individuals and or loci pooled in the run can be adjusted accordingly. In other cases, one has to decide the sequencing depth arbitrarily. Then, Ho can be estimated from the raw result of the run. Calling ‘apparent heterozygozity’ (Hap) the proportion of individuals displaying two distinct sequences among their reads, and d the observed sequencing depth (average number of reads per individual), Ho is easily obtained: the sequences for which all reads are identical are constituted from two types of individuals, the true homozygotes, and some heterozygotes for which identical alleles were drawn d times. With a similar reasoning as above (e.g. eqn 1), this can be expressed

  • image

so

  • image(6)

From the obtained set of reads, then, inferring individual genotypes is relatively straightforward (illustrated in Fig. 3). The presence of two distinct haplotypes unambiguously defines the genotype of heterozygous individuals, assuming that a SDL was contained in the original amplicon, a hypothesis that can be checked a posteriori (cf. above). Eqn 1 gives the probability that an individual displaying a single type of reads is heterozygous. If the Hardy–Weinberg equilibrium can be assumed, and if allele frequency could be estimated from previous data, eqn 3 provides the probability that a precise individual is heterozygous as a function of the number of reads obtained, x, and pi, the frequency of the displayed haplotype ‘i’, which is obtained from the run result.

image

Figure 3.  Flowchart diagram representing the steps to follow to assess individual genotypes from a NGS run with individual tags. For each individual and locus combination, the steps are represented from top to bottom, diamonds represent questions, answers are in circles, and rectangles represent conclusions or data processing (softened angles) or actions involving experimental work (sharp angles). Rectangles bordered by thick lines represent conclusions which are certitudes (i.e. not depending on any predefined risk value). Parameters are explained in Table 1 and in the body of the manuscript. Using allele frequency information (pi) is not necessary, it is interesting in case of panmixia (eqn 3).

Download figure to PowerPoint

For the individuals for which the probability of being heterozygous exceeds a given threshold, direct sequencing allows determining which of these individuals actually are homozygous (i.e. those for which DAS results in a non-ambiguous simple sequence), and, in case not, there is a single unknown allele that may eventually be deduced from the chromatogram (or from the sequence expressed with the IUPAC ambiguity code), visually or using dedicated software (see next section ‘OGS’) by ‘subtracting’ the known allele sequence.

Old-Generation Sequencing approaches

When NGS is not appropriate, there is the possibility to start by performing DAS for all the individuals to genotype. This will provide good results for homozygous individuals, avoiding cloning and sequencing identical clones, but may produce very few usable data when observed heterozygozity (Ho) is high because very few readable (homozygous) sequences will be recovered. Therefore, the best strategy depends on Ho which, thus, should be estimated initially.

Preliminary phase to estimate Ho

The first phase, allowing Ho assessment, can consist in performing DAS or CS for each one of about 32 individuals.

Direct sequencing.  DAS of a set of individuals directly provides the value of Ho (assuming the marker is a SDL) which is simply the proportion of unreadable sequences due to heterozygozity of the individual, and which can easily be identified by visual inspection of the chromatogram.

Cloning sequencing.  Although this approach is more demanding, one may prefer to use CS rather than DAS in some cases, for instance when Ho is expected to be high, to obtain a correct number of interpretable sequences. In case of a mixture of paralogous loci, this approach may eventually allow designing more specific internal primers in the hope of getting rid of some paralogs. If four clones are sequenced (recommended as a minimum, but sufficient number, see below), Ho is directly deduced from the observed data Hap using eqn 6 with d = 4: Ho = 8 Hap/7. Sequencing two clones (Ho = 2 Hap) or three clones is also possible to determine Ho, but sequencing a minimum of four clones allows using these results to perform the SDL test using fi (cf dedicated section and Table 2), and the effort and cost required for sequencing are minor relative to cloning.

Routine genotyping phase

Favourable case: genotypes can be deduced from double chromatograms.  In favourable situations, it is possible to deduce, after DAS, the two sequences from ambiguous (double) chromatograms. This generally occurs for limited polymorphism values and sequence lengths. The most frequent alleles are often identified from homozygous individual sequences, so that diploid sequence genotypes can be deduced from ambiguous double sequences relatively easily. If the length of two mixed sequences is moderate, there is generally no shift between their corresponding chromatogram peaks, it is thus possible to use dedicated software to analyse the sequence files written with the IUPAC ambiguity code (Dixon, 2009; Garrick et al., 2010; Stephens & Donnelly, 2003). These programs provide estimates of the probability of correct inference for each genotype or allele (but see Garrick et al., 2010), and the presence of indels is not a problem, contrary to what is often supposed. When both sequences become shifted at some point in the chromatogram, the programs cited above cannot be used but, by contrast, it often becomes possible for the scientist, to visually distinguish the two mixed sequences owing to their shift, and sequence genotypes may then be obtained in some favourable cases, in general when polymorphism is low. Cloning may thus not be required to determine both sequences. With such markers however, cloning a few individuals is nevertheless useful to test that the marker behaves as a single diploid locus (SDL test).

In the general case, cloning is required at least for some individuals. Figure 4 schematizes the explanations below and displays the different phases for this general situation, according to polymorphism values (Ho).

image

Figure 4.  Flowchart representing which strategy is optimal, steps and parameters (as a function of the value of Ho represented in the horizontal axis) for OGS in the general case (i.e. when heterozygotes cannot be deduced from chromatograms). The average total number of sequences per individual is represented by the grey thick dotted curve behind text and flowchart symbols (which are the same as in Fig. 2). The parameter values (number of clones to sequence, and threshold Ho values) are shown for μ = 0.01 (99% of fully determined genotypes). Note that threshold Ho value (around 0.6) is nearly invariant when μ varies and can be considered a general result. Strategy 1 (direct amplicon sequencing, followed by cloning sequencing when necessary) is the best for Ho lower than 0.6, and for higher Ho, strategy 2 is the best (cloning and sequencing 3 clones in a first step, then when necessary cloning additional clones). Single diploid locus (SDL) boxes indicate which frequency functions (eqns 4 or 5) must be used, and their position indicates at which steps of routine genotyping data can be used to test the SDL hypothesis (combined or appended to SDL tests eventually performed after the preliminary phase).

Download figure to PowerPoint

After a cloning step, the strategy minimizing the number of sequencing reactions consists in sequencing two clones only, and then, when both clones are identical, adding clone sequences one by one until the second allele is obtained. However, this solution would be tedious and poorly efficient relative to the organization of laboratory work. I thus decided to focus on simple strategies respecting the constraint that only two sessions of sequencing are carried out per individual. Two such strategies are possible. In strategy 1, the first step consists in DAS for a set of individuals and is eventually followed, for unreadable (heterozygous) individuals, by a second step of CS, in which S clones are sequenced per individual (S being determined as to minimize the average total number of sequences). In strategy 2, cloning and sequencing a first set of S1 clones is carried out for the first step; then, for the individuals for whom all clones display the same sequence, a second set of S2 clones is sequenced (S1 and S2 being determined as to minimize the average total number of sequences). To determine the threshold value of Ho, under which strategy 1 is better than strategy 2, we used the following parameters and computations. T1 (respectively T2) represents the average number of sequences per individual which is necessary to obtain a proportion (1-μ) of fully determined genotypes (both alleles known) under strategy 1 (respectively strategy 2). Noting Sμ the number of sequences necessary to ensure that the proportion of individuals whose genotypes are not fully determined does not exceed μ, and summing the first and second steps we obtain:

  • image

From a heterozygous individual, the probability of getting all S clones identical is 2(1 − S) so the proportion of undetermined genotypes is μ = 2(1 − Sμ)Ho. Therefore, Sμ is the smallest integer equal or superior to: 1 − [Ln(μ/Ho)/Ln(2)]. As expected, the smaller the value of μ, the higher the number of sequences Sμ for realistic parameter values (μ < Ho). Under strategy 1, the undetermined genotypes are individually identified owing to the direct sequencing step and all are heterozygotes (Fig. 4).

The average number of sequences required for characterizing genotypes under strategy 2 as a function of S1 and S2 is: T2 = S1 + S2[2(1 − S1)Ho + (1 − Ho)].

Under this strategy, the proportion of individuals which are heterozygous and for which the sequence of one allele is missing is the proportion of individuals which are heterozygous (Ho) and which display a single allele after the first sequence round (2(1 − S1)), and for which the second round of S2 sequences provided S2 sequences identical to the one obtained at the first round (2(−S2)), which is therefore Ho · 2(1 − S1 − S2). Under strategy 2, the individuals which genotypes are not fully determined (only one haplotype recovered) cannot be individually identified but instead are reported as homozygotes erroneously (Fig. 4). Noting inline image the threshold values of S2 corresponding to the proportion of undetermined genotypes μ (for a given S1), inline image is obtained by equalling μ to the proportion of individuals which are heterozygous and for which all (inline image) clones sequenced were identical. inline image is the integer immediately superior or equal to:

  • image

Replacing inline image by its expression as a function of Ho, μ and S1 in T2 gives:

  • image

The next step is to determine the value(s) of S1 minimizing T2, for a given value of μ, noted inline image. This requires to study the variation of the function T2(S1) which reveals that there is a single value of S1 minimizing T2 (Appendix S2). The minimum relevant number of clones to sequence is 2, and for values exceeding 10 sequences, the probability that a heterozygote displays only identical clones drops below 2−9 (ca. 2.10−3), which represents a very low proportion of genotypes that will not be fully determined. Therefore, T2 can be computed for all relevant integer values of its argument (starting from two) until the minimum is found to deduce inline image. This was carried out to find the results presented in Table 3 and Fig. 4.

Table 3.   Threshold values of Ho for different proportions of nonfully determined genotypes.
 μ = 0.001μ = 0.01μ = 0.03
Threshold Ho0.590.600.61
T (at threshold Ho)75.154.3

The minimum average number of sequences per genotype [min(T1, T2)] as a function of Ho for a proportion of nonfully determined genotypes of μ = 0.01 is represented in Fig. 4. The threshold Ho value under which strategy 1 is more interesting than strategy 2 is close to 0.60. Strikingly, when stringency varies (μ = 0.001 or 0.03), the Ho threshold is nearly invariant, between 0.59 and 0.61 (Table 3). The mean number of sequences to perform per individual is never higher than 4.3 for μ = 0.03, 5.15 for μ = 0.01, and 7 for μ = 0.001. These maxima correspond to the threshold values of polymorphism (Ho).

When strategy 2 is better (i.e. Ho > threshold), the optimal number of clones to sequence at the first round inline image is larger than two: three for μ = 0.03 or 0.01 (for all values of Ho), and four for μ = 0.001. With μ = 0.01, the number of sequences to perform to obtain an individual genotype is never higher than 8 (max = 3 + 5 sequences per individual, which occurs when Ho > 0.64), but the average number of sequences per determined genotype is much lower (4–5) for this range of Ho values (Fig. 4).

Optimizing the timing and the parameters of the SDL test

The above deductions depend on the hypothesis that the marker is a SDL, and thus, the test of this hypothesis should be carried out as soon as possible to avoid wasting experimental work in case it is rejected (Table 2). Nevertheless, it is interesting to avoid doing CS only for this test when the routine genotyping is expected to rapidly provide sufficient data for the test. The right part of Fig. 5 (OGS) displays a flowchart of the distinct phases (preliminary estimation of Ho, SDL test and routine genotyping) which order, interactions and parameters were optimized as explained below.

image

Figure 5.  Global view of the phases requested for reliable and optimized sequence genotyping, for the different possible cases. Next-generation sequencing (NGS) or OGS approaches are represented, with two cases for NGS, whether or not the locus was confirmed as a single diploid locus (SDL) and its observed heterozygozity was previously known. The grey background highlights the steps undertaken to test the SDL hypothesis. For OGS approaches, when Ho was assessed by cloning and sequencing four clones per individual (CS4), an initial SDL test can be performed without obtaining additional data, whereas when Ho was estimated after direct amplicon sequencing, cloning is requested for the initial SDL testing, and it is optimal to choose the number of clones to sequence according to Ho (see text); in both cases, data obtained subsequently should be used to improve the SDL test with higher sample sizes (dotted arrowed line).

Download figure to PowerPoint

When Ho is lower than 0.6, routine genotyping will employ strategy 1 (DAS then CS for putative heterozygotes). In such cases, the preliminary phase (dedicated to assess Ho) allowed identifying <20 putative heterozygotes (32 · Ho) if it employed DAS, or <17 heterozygotes (32 · Hap = 32 · Ho × 7/8) if it used CS (with four sequences per cloning), from which the SDL test will be possible. To optimize the use of clone sequences obtained, the SDL test should be performed sequencing five, six or seven clones per individual, according to Ho as recommended in Fig. 4 so that the data can also serve for routine genotyping (Fig. 5). The number of individuals is rather low in such cases, and so the power of the test. If the SDL hypothesis is not rejected, routine genotyping may be started, and additional data obtained for individuals appearing heterozygous after DAS may be added, or probability values may be combined, to obtain a more powerful SDL test. When the same number of clones were sequenced at the preliminary and subsequent steps, and when heterozygous individuals were identified by the same way (either DAS or CS) the data can be appended to perform the SDL test. In other cases, one can use combined P-values from different SDL tests (Brown, 1975). When heterozygosity is lower than 0.5, the power of the initial SDL tests will be really low but few scientists want to invest in OGS sequence genotyping for loci displaying such moderate polymorphism. When Ho is higher than 0.6, the SDL test will be performed initially from a larger number of individuals. However, it is nevertheless highly recommended, when SDL was not rejected initially, to include the data subsequently obtained (during routine genotyping) for repeating the test with larger sample sizes and more power. For OGS approaches, when Ho was assessed by cloning and sequencing four clones per individual (CS4), an initial SDL test can be performed without obtaining additional data, whereas when Ho was estimated after DAS, cloning is requested for the initial SDL testing, and it is optimal to choose the number of clones to sequence according to Ho (Fig. 5).

Tunings to reduce cost, work time and experimental errors

It clearly appeared that Ho strongly influences the accuracy and the effort required to achieve sequence genotyping. Everything being equal, lower Ho situations require smaller numbers of reads. Although this possibility is generally overlooked, there are multiple ways to reduce Ho in a perfectly reliable fashion with sequence data. Haplotypes can be pooled in fewer allelic categories, for instance considering indels and not point mutations, transversions not transitions, considering (or removing) only some particular nucleotide positions, etc.

In NGS, moreover when Ho is not known a priori and the sequencing depth obtained may not be satisfactory, one may avoid the step of DAS eventually required for some individuals, by removing a few nucleotide positions, for instance. In OGS approaches, when the initial Ho is close to the threshold between strategies 1 and 2, and thus the maximum average number of sequences per genotype is required, reducing Ho may save time and money. In particular, it may allow avoiding the first cloning step. In addition, such a procedure can be used to correct some particular technical artefacts, such as those generated by cloning and PCR (cf. above section on error). Positions subject to such genetic instability during cloning (or other steps) could be eliminated from the alignment. This may also eliminate some real haplotypes, but this will not bias the analyses derived from the data set, provided the polymorphism is neither interpreted for itself (in an absolute value) nor compared with data sets not treated the same way.

Conclusion

  1. Top of page
  2. Abstract
  3. Introduction
  4. The basic theoretical framework
  5. Optimization of the strategy to reliably characterize sequence genotypes in populations
  6. Conclusion
  7. Acknowledgment
  8. References
  9. Supporting Information

Figure 5 summarizes the different procedures to follow to obtain reliable sequence genotypes, for three different situations. In the first situation, NGS methods are employed on a locus, or several comparable loci, which are already known (i.e. which were shown to correspond to single diploid loci and which polymorphism values, Ho, are already estimated). This study provides a simple way to determine an optimal sequencing depth which, for instance, allows determining how many loci or individuals can be combined in a single NGS run. In the second case, NGS methods are employed on one or several previously unknown loci. The section on sequencing error and the SDL tests led me to recommend a sequence depth of about 40-× (at least 20×) in such cases so that CWIRN distributions can be properly interpreted and allow testing the SDL hypothesis and detecting an eventual impact of sequencing errors. When analysing NGS outputs (two previous situations), eqn 1 allows computing, for each apparent homozygote, the probability that it is actually heterozygous, according to the observed sequencing depth for this particular individual (see also Fig. 3). The third situation corresponds to the use of traditional sequencing methods (OGS). I showed that a good strategy consists in a preliminary estimation of Ho from about 32 individuals, followed by a test of the SDL hypothesis, then, when SDL is not rejected, by routine genotyping using the optimal strategy illustrated in Fig. 4.

These results are based on simple theoretical developments and require neither strong nor unverifiable assumptions (only eqn 3, which is independent of the rest of the manuscript, assumes panmixia). They allow reliably inferring diploid sequence genotypes and provide practical rules to design efficient sequence genotyping strategies for population genetic studies. Previous studies have developed theoretical frameworks for analysing NGS data at the genotype or population level, but their scopes and assumptions were different. Gompert et al. (2010) developed a Bayesian analysis of molecular variance to quantify population genetic structure; sequence data were not individually tagged and the analyses considered populations, not individual genotypes. Huang et al. (2009) developed a genotype-calling approach based on whole genome re-sequencing data, mainly devoted to mapping; they used sliding windows for accurate identification of recombination breaks and SNPs resulting of errors. A maximum likelihood method was developed by Lynch (2009). This method, aimed at estimating allele frequency, provided a robust conceptual framework for genotyping individuals, while estimating the error rate at single nucleotide positions. The assumptions of this method were more restrictive than those made in this study. No more than two nucleotides were assumed to segregate per site. From my experience with marine invertebrate taxa, positions in which three or four nucleotides segregate are not rare at all among polymorphic sites (verified for protein coding, ribosomal or intronic regions). Other assumptions such as the Hardy–Weinberg equilibrium are made in their study, though this one is said to be easily tested and manageable a posteriori. In the present study, sequencing error rates were not parameterized, but the problem of sequencing errors was supposed to be solved before genotyping: a test of the presence of sequencing errors, based on the distribution of individual read numbers, was proposed, as well as an empirical but efficient strategy to detect and remove artefactual alleles (which is not applicable to SNPs because it uses information of allele similarity).

A typical field where this study would be of use is phylogeography, or population genetics, for nonmodel species. About fifty potentially universal introns (EPICs) were identified in metazoans and can be tested for PCR amplification in a couple of days (Chenuil et al., 2010). Each one of the promising primer pairs (about 25 introns in average) should be amplified in 20–50 individuals (using MID-tags added on PCR primers for lowering costs). Even a portion of a 454 run (for instance 25.000 correct size reads) is largely sufficient to test those numbers of individuals and loci with a sequencing depth of about 25–75 × , allowing performing the SDL and sequencing errors tests. For the loci that successfully passed the SDL-SE tests, a second run can be planned with more individuals and a lower sequencing depth, determined according to Ho obtained from the previous run. When few loci are promising after PCR, few loci will be pooled in the run, thus sequencing depth may easily reach 100-x. Such a high coverage potentially allows designing distinct single diploid loci from a single mixture of paralogous loci.

Acknowledgment

  1. Top of page
  2. Abstract
  3. Introduction
  4. The basic theoretical framework
  5. Optimization of the strategy to reliably characterize sequence genotypes in populations
  6. Conclusion
  7. Acknowledgment
  8. References
  9. Supporting Information

Gilbert Maurel is thanked for his reminders with derivatives, Didier Aurelle for his comments on a preliminary presentation of these ideas.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. The basic theoretical framework
  5. Optimization of the strategy to reliably characterize sequence genotypes in populations
  6. Conclusion
  7. Acknowledgment
  8. References
  9. Supporting Information
  • Bayes, T. & Price, M. 1763. An essay towards solving a problem in the doctrine of chances. By the late Rev. Mr Bayes, F.R.S. communicated by Mr Price, in a letter to John Canton A.M.F.R.S. Phil. Trans. 53: 370418.
  • Bentley, G., Higuchi, R., Hoglund, B., Goodridge, D., Sayer, D., Trachtenberg, E.A. et al. 2009. High-resolution, high-throughput HLA genotyping by next-generation sequencing. Tissue Antigens 74: 393403.
  • Brown, M. 1975. A method for combining non-independent, one-sided tests of significance. Biometrics 31: 987992.
  • Chenuil, A. 2006. Choosing the right Molecular Genetic Markers for studying biodiversity: from molecular evolution to practical aspects. Genetica 127: 101120.
  • Chenuil, A., Hoareau, T.B., Egea, E., Penant, G., Rocher, C., Aurelle, D. et al. 2010. An efficient method to find potentially universal population genetic markers, applied to metazoans. BMC Evol. Biol. 10: 276.
  • Dixon, C.J. 2009. OLFinder:a program which disentangles DNA sequences containing heterozygous indels. Mol. Ecol. Res. 10: 335340.
  • Egea, E. 2011. Histoire évolutive, structures génétique, morphologique et écologique comparées dans un complexe d’espèces jumelles: Echinocardium cordatum (Echinoidea, Irregularia). PhD thesis, 321 pp. Aix-Marseille université, Marseille.
  • Ekblom, R. & Galindo, J. 2011. Applications of next generation sequencing in molecular ecology of non-model organisms. Heredity 107: 115.
  • Faure, B., Bierne, N., Tanguy, A., Bonhomme, F. & Jollivet, D. 2007. Evidence for a slightly deleterious effect of intron polymorphisms at the EF1 alpha gene in the deep-sea hydrothermal vent bivalve Bathymodiolus. Gene 406: 99107.
  • Galan, M., Guivier, E., Caraux, G., Charbonnel, N. & Cosson, J.F. 2010. A 454 multiplex sequencing method for rapid and reliable genotyping of highly polymorphic genes in large-scale studies. BMC Genomics 11: 296.
  • Garrick, R.C., Sunnucks, P. & Dyer, R.J. 2010. Nuclear gene phylogeography using PHASE: dealing with unresolved genotypes, lost alleles, and systematic bias in parameter estimation. BMC Evol. Biol. 10: 118.
  • Gompert, Z., Forister, M.L., Fordyce, J.A., Nice, C.C., Williamson, R.J. & Buerkle, C.A. 2010. Bayesian analysis of molecular variance in pyrosequences quantifies population genetic structure across the genome of Lycaeides butterflies. Mol. Ecol. 19: 24552473.
  • Huang, X.H., Feng, Q., Qian, Q., Zhao, Q., Wang, L., Wang, A.H. et al. 2009. High-throughput genotyping by whole-genome resequencing. Genome Res. 19: 10681076.
  • Lynch, M. 2009. Estimation of allele frequencies from high-coverage genome-sequencing projects. Genetics 182: 295301.
  • Meacham, F., Boffelli, D., Dhahbi, J., Martin, D.I.K., Singer, M. & Pachter, L. 2011. Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics 12: 451.
  • Nielsen, R., Paul, J.S., Albrechtsen, A. & Song, Y.S. 2011. Genotype and SNP calling from next-generation sequencing data. Nat. Rev. Genet. 12: 443451.
  • Nothnagel, M., Herrmann, A., Wolf, A., Schreiber, S., Platzer, M., Siebert, R. et al. 2011. Technology-specific error signatures in the 1000 Genomes Project data. Hum. Genet. 130: 505516.
  • Stephens, M. & Donnelly, P. 2003. A comparison of Bayesian methods for haplotype reconstruction from population genotype data. Am. J. Hum. Genet. 73: 11621169.
  • Sunnucks, P., Wilson, A.C., Beheregaray, L.B., Zenger, K., French, J. & Taylor, A.C. 2000. SSCP is not so difficult: the application and utility of single-stranded conformation polymorphism in evolutionary biology and molecular ecology. Mol. Ecol. 9: 16991710.
  • Wattier, R., Engel, C.R., Saumitou-Laprade, P. & Valero, M. 1998. Short allele dominance as a source of heterozygote deficiency at microsatellite loci: experimental evidence at the dinucleotide locus Gv1CT in Gracilaria gracilis (Rhodophyta). Mol. Ecol. 7: 15691573.

Supporting Information

  1. Top of page
  2. Abstract
  3. Introduction
  4. The basic theoretical framework
  5. Optimization of the strategy to reliably characterize sequence genotypes in populations
  6. Conclusion
  7. Acknowledgment
  8. References
  9. Supporting Information

Appendix S1 Detection of departure from the SDL hypothesis, detection of the phenomenon of sequencing error and determination of the copy number in case of multiple loci (polyploidy or paralogy) using the cumulated distribution of within individual read numbers.

Appendix S2 Finding the value of S1 minimizing T2, given μ and H0.

As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such materials are peer-reviewed and may be re-organized for online delivery, but are not copy-edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors.

FilenameFormatSizeDescription
JEB_2488_sm_AppendixS1.xlsx30KSupporting info item
JEB_2488_sm_AppendixS2.doc26KSupporting info item

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.