Comparing the van Oosterhout and Chybicki-Burczyk methods of estimating null allele frequencies for inbred populations


  • P. Campagne,

    Corresponding author
    1. IRD, UR072 – BEI C/O CNRS, Laboratoire Évolution Génomes et Spéciation, Gif sur Yvette Cedex, France
    2. NSBB Project, IRD/icipe – African Insect Science for Food and Health, Duduville Campus, Nairobi, Kenya
    3. Université Paris-Sud 11, Orsay cedex, France
    4. Department of Ecology, Evolution & Natural Resources, School of Environmental & Biological Sciences, Rutgers University, New Brunswick, NJ, USA
    Search for more papers by this author
  • P. E. Smouse,

    1. Department of Ecology, Evolution & Natural Resources, School of Environmental & Biological Sciences, Rutgers University, New Brunswick, NJ, USA
    Search for more papers by this author
  • G. Varouchas,

    1. Université Paris-Sud 11, Orsay cedex, France
    Search for more papers by this author
  • J.-F. Silvain,

    1. IRD, UR072 – BEI C/O CNRS, Laboratoire Évolution Génomes et Spéciation, Gif sur Yvette Cedex, France
    2. Université Paris-Sud 11, Orsay cedex, France
    Search for more papers by this author
  • B. Leru

    1. IRD, UR072 – BEI C/O CNRS, Laboratoire Évolution Génomes et Spéciation, Gif sur Yvette Cedex, France
    2. NSBB Project, IRD/icipe – African Insect Science for Food and Health, Duduville Campus, Nairobi, Kenya
    3. Université Paris-Sud 11, Orsay cedex, France
    Search for more papers by this author


In spite of the usefulness of codominant markers in population genetics, the existence of null alleles raises challenging estimation issues in natural populations that are characterized by positive inbreeding coefficients (F > 0). Disregarding the possibility of > 0 in a population will generally lead to overestimates of null allele frequencies. Conversely, estimates of inbreeding coefficients (F) may be strongly biased upwards (excess homozygotes), in the presence of nontrivial frequencies of null alleles. An algorithm has been presented for the estimation of null allele frequencies in inbred populations (van Oosterhout method), using external estimates of the F-statistics. The goal of this study is to introduce a modification of this method and to provide a formal comparison with an alternative likelihood-based method (Chybicki-Burczyk). Using simulated data, we illustrate the strengths and limitations of these competing methods. Under most circumstances, the likelihood method is preferable, but for highly inbred organisms, a modified van Oosterhout method offers some advantages.


Population geneticists have been traditionally interested in homozygosity excess, reflecting either inbreeding or hidden population structure (e.g. see Chakraborty & Li 1992). In this vein, null alleles have long been an issue for widely deployed genetic markers, and that remains true for microsatellites, which may be characterized by an increased frequency of null alleles (Dakin & Avise 2004). Moreover, many data originating from next-generation sequencing and chip-based approaches also suffer from severe null allele problems (Franke et al. 2008; Hohenlohe et al. 2011; Pfender et al. 2011).

The utility of codominant markers in revealing heterozygote deficiency and population structure (notably FIS and FST) is complicated by the presence of hidden null alleles (Chakraborty et al. 1992; Brookfield 1996; Chapuis & Estoup 2007), because they contribute to apparent homozygote excess, over and above the effects of inbreeding and reduced gene flow. Undetected null alleles may also introduce bias into the estimation of allelic frequencies, affecting any subsequent population genetic analyses (Dakin & Avise 2004). Conversely, failure to account for inbreeding (FIS > 0) will lead to overestimates of null allele frequencies (Van Oosterhout et al. 2006).

Joint estimation of both the inbreeding coefficient (F) and the null allele frequency (r) is a formidable challenge (see Chybicki & Burczyk 2009). Different estimation methods have been developed to estimate or establish both parameters. One might estimate the null allele frequency (r), using an externally supplied estimate of F. Alternatively, one might jointly estimate F and a set of r-coefficients, employing a different r-parameter for each locus.

There are currently two methods of dealing with the estimation of null allele frequencies in nonequilibrium populations. First, Van Oosterhout et al. (2006) have proposed an algorithm using external estimates of F. The method has been implemented within an Excel macro (Null Allele Estimator,, henceforth referred to as the VO method. Chybicki & Burczyk (2009) have designed an alternative likelihood-based method for a simultaneous estimation of F and r, henceforth referred to as the CB method, implemented in INEst software (

The VO method has been used in several empirical studies (e.g. Basic & Besnard 2006; Perrin et al. 2007; Potter et al. 2008; Billard et al. 2010; Elias et al. 2010; Shirk et al. 2010) and has also been proposed to adjust allele frequencies by including provision for a null allele (e.g. Dewoody et al. 2006; Carlon & Lippé 2007). Notwithstanding its popularity, theoretical validation for this method remains unavailable, and its reliability remains at issue. While Van Oosterhout et al. (2006) state that the system ‘returns a single real solution’, the possible existence of two solutions is acknowledged in the implementation of the method (null allele estimator). The matter needs further attention.

The CB method was published with a detailed evaluation of its performance (Chybicki & Burczyk 2009), but only over a restricted range of conditions: sample sizes (N > 50); moderately inbred populations (< 0.2); and = 5 or 10 loci. It would be useful to evaluate the performance in this method over the wider range of parameters encountered in practice (e.g. < 50, > 0.2,  10).

The aims of this comparison were (i) to provide some clarifications on the VO and CB methods, using theoretical analysis and simulation; (ii) to assess their actual performance when dealing with a wide range of inbreeding, by comparing a modification (VOm) and the CB method; and (iii) to improve our understanding of the arrays of situations that favour one method over the other.

Materials and methods

System of equations

Assuming that amplification failures are due to a mutation in the primer recognition sequence, the frequency of individuals in the data set that exhibit a single allelic product at a locus is the sum of two terms: the proportion of true homozygotes and the proportion of heterozygote individuals carrying a null allele. The equation used in Van Oosterhout et al. (2006) study is:

display math(1)

where yk is the observed frequency of apparent homozygotes for k-th allele in the data set (including both true homozygotes and those due to a heterozygous null allele); pk is the real frequency of the k-th allele; r is the frequency of the null allele; and F is the mean inbreeding coefficient or fixation index.

The quadratic equation (Eqn. (1)) can be rearranged to return a single solution for each (non-null) allele frequency, consistent with all parameters being within appropriate bounds:

display math(2)

Because the sum of allele frequencies equals 1, the following constraint is added:

display math(3)

Substitution of Eqn. (2) into Eqn. (3) yields:

display math(4)

We can observe yk, so if we supply an external estimate of F, Eqn. (4) yields an estimate of r. The number of solutions to Eqn. (4) can vary, and the conditions under which there are 0, 1 or 2 admissible solutions require formal examination.

Note that Eqn. (1) of Van Oosterhout et al. (2006) does not explicitly incorporate the proportion of homozygous null allele (‘no types’) when calculating the yk. We here propose a modified formulation of the problem (henceforth VOm) that adds this component, and will show that the modification improves the performance of the estimators considerably.


To assess the efficacy of all methods investigated here, we evaluated simulated data sets, generated according to the following procedure:

  1. For a single locus, we generated a random distribution of + 1 alleles {a0, a1, a2, …, aK}, with = 0 for the null allele (with frequency r) and = 1,…, K for the assayable alleles, with frequencies pk, respectively.
  2. Specifying an inbreeding coefficient (0 < < 1), we generated a (+ 1) × (+ 1) matrix of genotypic products:
    1. Homozygote individuals:
      display math
    2. as well as for homozygous ‘no types’:
      display math
    3. Heterozygote individuals:
      display math
      where P(akak) and P(agak) are the probabilities of the homozygote and heterozygote genotypes, respectively, and where g, k = 0,…, K.
  3. We used this collection of probabilities, whose sum is ‘1’, to generate N random genotypes, according to a multinomial distribution.
  4. Heterozygote genotypes carrying a null allele were tallied as (apparent) homozygotes for the assayable allele. Homozygote individuals for null allele were ignored when running the VO algorithm, because the method does not account for missing values. For the modified VO method (henceforth VOm) and the CB method, we allowed for null homozygotes (‘no types’) in the analysis.
  5. For empiric data sets with excesses of heterozygotes, one is not usually concerned with either inbreeding or null alleles, so we removed simulated trials with an excess of heterozygotes.

Simulations were performed on multilocus data sets, considering the entire range of possible F-values and using realistic values of r, in order to compare the performance of the methods. L-locus samples of size N were generated and the number of alleles at a locus (K) was allowed to vary randomly between 5 and 15, according to an equi-probable distribution within a simulated data set. Null allele frequencies were randomly drawn for each locus, using a negative exponential distribution approximating an empirical distribution of null alleles published by Dakin & Avise (2004) (Appendix S2, Supporting information). The distribution was bounded above by = 0.33, allowing for a approximately 10% maximum missing value fraction for any one locus in an equilibrium (= 0) population. F-values used to generate the samples were drawn randomly from a uniform distribution, to explore the behaviour of the methods under a wide range of nonequilibrium situations. Simulations were run for two different sample sizes of = 25 and = 50 and for two different numbers of loci (= 10 and = 20). Simulations were replicated 5000 times for each (N, L) combination.

The analysis of the simulated samples was performed using two types of methods for the estimation of null alleles. For the VO (and VOm) estimation methods, we implemented an R script (R Development Core Team 2011). While the VOm method allowed simultaneous estimation of F and r (see Appendix S3, Supporting information), the actual inbreeding coefficient used to simulate the samples was supplied as the required ‘external estimate’ of F, when using the VO method. How one is to provide such a simultaneous estimate of r and F from real genetic data, for the VOm method, is a matter to which we shall return below.

As an alternative to the VO and VOm methods, we also deployed a likelihood joint-estimation method for r and F (henceforth, labelled CB), proposed by Chybicki & Burczyk (2009). The CB analyses were carried out with the original INEst software (Chybicki & Burczyk 2009), based on the population inbreeding model (PIM), but we automated the repetitive analysis tasks, using Visual Basic scripts. The PIM-based estimator corresponds to the maximum likelihood of the genetic model, using the following multinomial function (L):

display math(5)

where nlg0, nlgk and nl00 represent sample counts of individuals having at the l-th locus phenotypes aga0, agak and a0a0, respectively; n-values are the phenotypic counts; P-values are the allele frequencies; rl is the frequency of the null allele for the l-th locus; Kl is the number of visible alleles (excluding the null allele) for the l-th locus; and c is a combinatorial constant that depends on the phenotypic counts in the sample.

The VO, VOm and CB estimation methods were applied to the same sets of simulated genotypes. For all three methods, we computed the bias and root-mean-squared error (RMSE) of the estimated frequency (r) of the null allele:

display math(6)

where rj is the estimate and ρj the parametric value from the j-th-simulated trial. By virtue of measuring the differences between the estimated and parametric values, the RMSE is a reflection of the accuracy of the estimation and depends on both the squared bias and the variance of the estimator.

Two solutions (r1 < r2) can sometimes be returned with the VO method (Appendix S1, Supporting information). Knowing the external F-value and each of the solutions (r1 < r2), it is possible to compute the corresponding binomial probabilities of observing the expected frequency math formula of the ‘no-type’ class in the sample, for r1 and for r2. Whenever two solutions existed, we chose the more likely.


Systems of equations

Van Oosterhout et al. (2006) stated that Eqn (1) ‘can be solved, and in conjunction with equation (5) [Eqn. (2) here], return[s] a single real solution'. We demonstrate (Appendix S1, Supporting information) that this statement is not correct, and we provide elements to discriminate the cases where the system either returns 0 or 1 or 2 solutions. Graphically, the solutions represent the intersections between an asymmetrical curve and a horizontal line; we defined r1 < r2 as the left and right intersections (Appendix S1, Supporting Information), respectively. Van Oosterhout et al. (2006) provided an estimate of r (via an Excel macro), but that algorithm sometimes returns solutions for which the sum of allele frequencies (including r) is slightly > 1; such a solution set is not admissible.

This ‘no-solution’ problem appears to result from an incomplete formulation of the problem. Recall that Eqn. (1) does not incorporate the frequency of ‘no types’ (null homozygotes). If we include provision for them, in proportion [ + rF(1 – r)], the expectation of the kth ‘single-type’ frequency becomes:

display math(1m)

After simplification, the discriminant (the term under the radical) of (Eqn. (3)) becomes:

display math(7)

positive as long as < 1. That means, we can rewrite (Eqn. (3)) and (Eqn. (4)) as:

display math(3m)


display math(4m)

While the r2 solution in the original VO equations is not a constant, the modified VOm equation (Eqn.(4)m) yields r2m = 1, independently of the value of F, and that solution is excluded de facto. The other solution is r1m < 1 and is admissible, by virtue of the fact that there are assayable alleles with positive frequency (Appendix S1, Supporting Information). The practical solution, if it exists at all, is unique (r1m) and is conditional of the F-value supplied.

At this stage, the VOm method still requires specification of an F-estimate, and as a practical matter, F (typically, the analytical target of deeper interest) is seldom known a priori. Moreover, each of the L loci ‘fits’ any given F-value to a different degree. We insert an average F-parameter that is most consistent with the variance in the observed heterozygosities of the L loci under examination. We refer the reader to Appendix S3 (Supporting information) for a detailed description of that numerical optimization routine, but it permits simultaneous estimation of a general F-value for the L loci and a separate r-value for each.

Comparison of VO, VOm and CB algorithms

The VO method yielded high rates of algorithmic failure (no solution): for [0 < < 1/3], [1/3 < < 2/3] and [2/3 < < 1], no solution was returned, in 17.1, 52.3 and 61.2% of the simulations with = 25; for the same three intervals, no solution was returned in 16.4, 57.8 and 63.8% of the simulations with = 50 (Table 1). Algorithmic failure resulted in very few simulated data sets for which null allele frequencies could be estimated for all loci (e.g. for = 50 and = 10 loci, only 7.4% of the data sets provided a full set of estimates).

Table 1. Performances of the VO, VOm and CB estimation methods for different ranges of the parametric inbreeding coefficient (F). Analyses were carried out on simulated samples of variable size (N = 25 or 50), with different numbers of loci (L = 10 or 20). While a simultaneous estimation of inbreeding coefficient and null allele frequency was possible with the VOm and the CB methods, the results for the VO method were obtained by supplying the parametric value of F as the external estimate. (a) Failure rates, biases and root-mean-squared errors (RMSE) for VO, VOm and CB estimates of null allele frequency (r) (b) Failure rates, biases and root-mean-squared errors (RMSE) for VOm and CB estimates of the inbreeding coefficient (F)
 0 < F < 1/31/3 < F < 2/32/3 < F < 1
VON = 250.3280.5050.1710.3080.3890.5270.0060.1340.612
VOmN = 25L = 100.0170.1020.0000.0790.1770.0010.1440.2790.180
CBN = 25L = 10−0.0190.0750.005−0.0240.0770.006−0.0100.0690.539
VOmN = 25L = 200.0090.1030.0000.0970.1870.0010.1660.2920.172
CBN = 25L = 20−0.0150.0730.012−0.0260.0790.018−0.0100.0710.610
VON = 500.1230.1150.1640.0110.3390.578−0.0090.4090.638
VOmN = 50L = 100.0190.0790.0000.0520.1270.0000.1170.2450.082
CBN = 50L = 10−0.0230.0580.000−0.0210.0580.000−0.0060.0490.272
VOmN = 50L = 200.0230.0800.0000.0500.1320.0000.1220.2530.080
CBN = 50L = 20−0.0200.0550.001−0.0210.0580.001−0.0070.0480.329
VOmN = 25L = 10−0.0840.1200.000−0.0900.1260.000−0.0700.0730.016
CBN = 25L = 10−0.0160.0820.005−0.0010.0550.006−0.0120.0390.539
VOmN = 25L = 20−0.0570.1020.000−0.1200.1470.000−0.0630.0740.000
CBN = 25L = 20−0.0250.0700.012−0.0010.0390.018−0.0140.0300.610
VOmN = 50L = 10−0.0590.1030.000−0.0530.0870.000−0.0250.0410.000
CBN = 50L = 100.0120.0580.0000.0060.0390.000−0.0050.0250.272
VOmN = 50L = 20−0.0710.1050.000−0.0500.0830.000−0.0250.0360.000
CBN = 50L = 200.0060.0430.0010.0050.0280.001−0.0020.0190.329

The modification incorporated into VOm solved the problem of multiple solutions, and it was thus possible to attempt simultaneous estimation of F and r by minimizing the variation among the expected heterozygosities computed for the L loci (Appendix S3, Supporting Information). The VOm method yielded a consistent solution (> 0) whenever the parametric value of F was comparable to the F-value realized in the sample, although it returned a slightly negative r-value when the parametric F-value was much larger than that encountered in the sample, due to sampling fluctuations (approximately 20% of the simulations). In those cases, any slightly negative values of the r1m estimate were reported as bounded below at r1m = 0. Conversely, when the proportion of observed homozygotes in the sample was close to 1, the VOm method could provide estimates of > 0.99 (clearly impossible, in the absence of ‘no types’) and thus attributed to algorithmic failure in a low proportion of cases (Table 1).

The CB method (PIM algorithm) was characterized by 0–61% failure rates. Problems of convergence (Table 1) occurred with high inbreeding (> 2/3), large numbers of parameters to be estimated (e.g. = 20) and small sample sizes (e.g. = 25), indicating overparameterization of the model, relative to the available data. With larger sample sizes (e.g. = 50), the frequency of such problems was substantially reduced.

We also compared the methods for bias and RMSE. Due to the variation in the number of solutions returned by the VO method, the RMSE of the r estimate was increased (Table 1). As a consequence, only 19.0–24.5% of the trials yielded apparently reliable estimates of r1. The VO method performs poorly in estimating the null allele frequency (Table 1, Fig. 1).

Figure 1.

Plot of estimated null allele frequencies (r) with the VO method, as a function of their parametric values in simulated data sets where = 25 (a) and = 50 (b). The upper and lower dark grey bands represent simulations for which the real inbreeding coefficient was 0 < < 1/3; middle grey, 1/3 < < 2/3; most of the points corresponding to 2/3 < < 1 (light grey) are hidden. Estimates were obtained by supplying the parametric F-value used in the simulations as the required external estimate of F.

In general, the VOm method yielded substantially better performance than its VO progenitor (with an externally imposed F-value), even though its performance, measured as bias and RMSE, was dependent on the apparent inbreeding in the sample. The RMSE of r increased as population inbreeding increased, particularly with small sample sizes (= 25), substantially decreased for large sample sizes (= 50) and is minimal (0.079–0.103) for parametric values of < 1/3 (Table 1).

The inbreeding coefficient (F) provided by the VOm method was consistently an underestimate (−0.120 < bias < −0.025), and that bias was larger than the corresponding bias from the CB method (−0.025 < bias < 0.012) (Table 1). While the CB method also exhibited a small downward bias in the r estimate (−0.023 <  bias < −0.006), it yielded a consistent decrease in RMSE with increasing numbers of loci (L), as long as the available sample size (N) was large enough for the convergence of the algorithm (Table 1).

Finally, the performances of both the VOm and CB methods (Fig. 2) were affected by high population inbreeding, translating into high failure rates in the CB method and low accuracy in the VOm method. The biases of the r estimate in the VOm and CB methods are comparable for moderate inbreeding (e.g. < 1/3), but the CB method consistently exhibited the lowest RMSE for both r and F estimates, provided that it did not fail (Table 1).

Figure 2.

Comparison of the VOm (a, b) and CB (c, d) methods with a plot of estimated null allele frequencies (r) and inbreeding coefficients (F), as a function of their parametric values in simulated data sets where = 10 and = 50. In (a and c), the dark grey dots represent simulations for which the real inbreeding coefficient was 0 < < 1/3; middle grey, 1/3 < < 2/3; light grey, 2/3 < < 1. The black line represents F- or r-estimated = F or r-parametric'.


Simulations confirmed that the uncorrected VO method, without including provision for ‘no types’, suffered severe restrictions in its range of application by giving rise to ‘0-solution’ or ‘2-solution’ issues. Its high failure rate did not allow further development towards the simultaneous estimation of both the null allele frequency (r) and the inbreeding coefficient (F). Clearly, VOm should replace VO. As with the likelihood-based CB method, it yielded improved performance, in terms of failure rate, as well as providing plausible estimates of r, in terms of bias and RMSE.

Limits to applicability of the VOm and CB methods

With simultaneous estimation of F and r, the VOm treatment provides reasonable estimates of r across a wide range of inbreeding coefficients (< 2/3), but for > 2/3, the RMSEs became large. The proportion of cases where the CB method did not converge was low (<1%) within a broad-range inbreeding (< 2/3); we had very few problems with the CB method when the parametric < 2/3, = 50, even with = 20 loci. But when population inbreeding was high (> 2/3), overparameterization of the system was responsible for an increased probability of algorithmic failure (e.g. >0.5 for = 25).

Method of choice

The VOm method is substantially more effective than the antecedent VO method, but its accuracy did not improve for = 20, relative to = 10, and RMSE remained high for > 2/3, although it did decrease with large sample sizes (= 50). The RMSE values for the CB method are generally lower than those for the VOm method, but when > 2/3, the algorithm frequently fails, particularly for small sample sizes. Under those circumstances, the VOm method may represent a high RMSE (but attainable) alternative. Moreover, given that the estimates of r are characterized by downward biases for the CB and upward biases for the VOm method, the two methods may provide complementary outcomes (Appendix S4, Supporting Information). For both CB and VOm methods, the only long-term solution for large parameter arrays (multiple loci, multiple allele frequencies, r-values and an F-coefficient) is larger sample sizes; = 50 is probably minimal.

Additional considerations

By introducing biases in population structure estimates, null alleles compromise subsequent inference that relies on widely used codominant markers. It seems clear that error-prone assay issues now emerging in the context of next-generation sequencing procedures will have to be dealt with in similar fashion. Although similar treatment may help with null allele frequency estimation, simultaneously providing improved estimates of the frequencies (pk: = 1,…, K) of assayable alleles, it remains unclear how such changes will impact the computation of differentiation and diversity estimates for multiple populations. Indeed, introducing F-dependent estimates of r might not be free of consequences for subsequent population structure analyses. Due to the importance of such issues, special care should be exercised in selecting genetic marker loci that have a minimum of assay ambiguity.


The authors would like to thank George Ong'amo and Stéphane Dupas, as well as the anonymous reviewers, for much helpful commentary on the manuscript. PC was supported by IRD (Institut de Recherche pour le Développement) and the US National Science Foundation (NSF-DEB-0514956); PES was supported by the US Department of Agriculture and New Jersey Agricultural Experiment Station (USDA/NJAES-17111) and by (NSF-DEB-0514956); JFS and BL were supported by IRD.

P.C., P.S., and G.V. contributed to new analytical tools; P.C. ran the simulations; P.C., P.S., J.S., and B.L. wrote the paper.

Data Accessibility

The R code and the simulated data sets have been uploaded to Dryad doi:10.5061/dryad.842n6.