Favourable case: genotypes can be deduced from double chromatograms. In favourable situations, it is possible to deduce, after DAS, the two sequences from ambiguous (double) chromatograms. This generally occurs for limited polymorphism values and sequence lengths. The most frequent alleles are often identified from homozygous individual sequences, so that diploid sequence genotypes can be deduced from ambiguous double sequences relatively easily. If the length of two mixed sequences is moderate, there is generally no shift between their corresponding chromatogram peaks, it is thus possible to use dedicated software to analyse the sequence files written with the IUPAC ambiguity code (Dixon, 2009; Garrick et al., 2010; Stephens & Donnelly, 2003). These programs provide estimates of the probability of correct inference for each genotype or allele (but see Garrick et al., 2010), and the presence of indels is not a problem, contrary to what is often supposed. When both sequences become shifted at some point in the chromatogram, the programs cited above cannot be used but, by contrast, it often becomes possible for the scientist, to visually distinguish the two mixed sequences owing to their shift, and sequence genotypes may then be obtained in some favourable cases, in general when polymorphism is low. Cloning may thus not be required to determine both sequences. With such markers however, cloning a few individuals is nevertheless useful to test that the marker behaves as a single diploid locus (SDL test).
In the general case, cloning is required at least for some individuals. Figure 4 schematizes the explanations below and displays the different phases for this general situation, according to polymorphism values (Ho).
Figure 4. Flowchart representing which strategy is optimal, steps and parameters (as a function of the value of Ho represented in the horizontal axis) for OGS in the general case (i.e. when heterozygotes cannot be deduced from chromatograms). The average total number of sequences per individual is represented by the grey thick dotted curve behind text and flowchart symbols (which are the same as in Fig. 2). The parameter values (number of clones to sequence, and threshold Ho values) are shown for μ = 0.01 (99% of fully determined genotypes). Note that threshold Ho value (around 0.6) is nearly invariant when μ varies and can be considered a general result. Strategy 1 (direct amplicon sequencing, followed by cloning sequencing when necessary) is the best for Ho lower than 0.6, and for higher Ho, strategy 2 is the best (cloning and sequencing 3 clones in a first step, then when necessary cloning additional clones). Single diploid locus (SDL) boxes indicate which frequency functions (eqns 4 or 5) must be used, and their position indicates at which steps of routine genotyping data can be used to test the SDL hypothesis (combined or appended to SDL tests eventually performed after the preliminary phase).
Download figure to PowerPoint
After a cloning step, the strategy minimizing the number of sequencing reactions consists in sequencing two clones only, and then, when both clones are identical, adding clone sequences one by one until the second allele is obtained. However, this solution would be tedious and poorly efficient relative to the organization of laboratory work. I thus decided to focus on simple strategies respecting the constraint that only two sessions of sequencing are carried out per individual. Two such strategies are possible. In strategy 1, the first step consists in DAS for a set of individuals and is eventually followed, for unreadable (heterozygous) individuals, by a second step of CS, in which S clones are sequenced per individual (S being determined as to minimize the average total number of sequences). In strategy 2, cloning and sequencing a first set of S1 clones is carried out for the first step; then, for the individuals for whom all clones display the same sequence, a second set of S2 clones is sequenced (S1 and S2 being determined as to minimize the average total number of sequences). To determine the threshold value of Ho, under which strategy 1 is better than strategy 2, we used the following parameters and computations. T1 (respectively T2) represents the average number of sequences per individual which is necessary to obtain a proportion (1-μ) of fully determined genotypes (both alleles known) under strategy 1 (respectively strategy 2). Noting Sμ the number of sequences necessary to ensure that the proportion of individuals whose genotypes are not fully determined does not exceed μ, and summing the first and second steps we obtain:
From a heterozygous individual, the probability of getting all S clones identical is 2(1 − S) so the proportion of undetermined genotypes is μ = 2(1 − Sμ)Ho. Therefore, Sμ is the smallest integer equal or superior to: 1 − [Ln(μ/Ho)/Ln(2)]. As expected, the smaller the value of μ, the higher the number of sequences Sμ for realistic parameter values (μ < Ho). Under strategy 1, the undetermined genotypes are individually identified owing to the direct sequencing step and all are heterozygotes (Fig. 4).
The average number of sequences required for characterizing genotypes under strategy 2 as a function of S1 and S2 is: T2 = S1 + S2[2(1 − S1)Ho + (1 − Ho)].
Replacing by its expression as a function of Ho, μ and S1 in T2 gives:
The next step is to determine the value(s) of S1 minimizing T2, for a given value of μ, noted . This requires to study the variation of the function T2(S1) which reveals that there is a single value of S1 minimizing T2 (Appendix S2). The minimum relevant number of clones to sequence is 2, and for values exceeding 10 sequences, the probability that a heterozygote displays only identical clones drops below 2−9 (ca. 2.10−3), which represents a very low proportion of genotypes that will not be fully determined. Therefore, T2 can be computed for all relevant integer values of its argument (starting from two) until the minimum is found to deduce . This was carried out to find the results presented in Table 3 and Fig. 4.
Table 3. Threshold values of Ho for different proportions of nonfully determined genotypes.
| ||μ = 0.001||μ = 0.01||μ = 0.03|
|T (at threshold Ho)||7||5.15||4.3|
The minimum average number of sequences per genotype [min(T1, T2)] as a function of Ho for a proportion of nonfully determined genotypes of μ = 0.01 is represented in Fig. 4. The threshold Ho value under which strategy 1 is more interesting than strategy 2 is close to 0.60. Strikingly, when stringency varies (μ = 0.001 or 0.03), the Ho threshold is nearly invariant, between 0.59 and 0.61 (Table 3). The mean number of sequences to perform per individual is never higher than 4.3 for μ = 0.03, 5.15 for μ = 0.01, and 7 for μ = 0.001. These maxima correspond to the threshold values of polymorphism (Ho).
When strategy 2 is better (i.e. Ho > threshold), the optimal number of clones to sequence at the first round is larger than two: three for μ = 0.03 or 0.01 (for all values of Ho), and four for μ = 0.001. With μ = 0.01, the number of sequences to perform to obtain an individual genotype is never higher than 8 (max = 3 + 5 sequences per individual, which occurs when Ho > 0.64), but the average number of sequences per determined genotype is much lower (4–5) for this range of Ho values (Fig. 4).