#### System of equations

Assuming that amplification failures are due to a mutation in the primer recognition sequence, the frequency of individuals in the data set that exhibit a single allelic product at a locus is the sum of two terms: the proportion of true homozygotes and the proportion of heterozygote individuals carrying a null allele. The equation used in Van Oosterhout *et al*. (2006) study is:

- (1)

where *y*_{k} is the observed frequency of apparent homozygotes for *k*-th allele in the data set (including both true homozygotes and those due to a heterozygous null allele); *p*_{k} is the real frequency of the *k*-th allele; *r* is the frequency of the null allele; and *F* is the mean inbreeding coefficient or fixation index.

The quadratic equation (Eqn. (1)) can be rearranged to return a single solution for each (non-null) allele frequency, consistent with all parameters being within appropriate bounds:

- (2)

Because the sum of allele frequencies equals 1, the following constraint is added:

- (3)

Substitution of Eqn. (2) into Eqn. (3) yields:

- (4)

We can observe *y*_{k}, so if we supply an external estimate of *F*, Eqn. (4) yields an estimate of *r*. The number of solutions to Eqn. (4) can vary, and the conditions under which there are 0, 1 or 2 admissible solutions require formal examination.

Note that Eqn. (1) of Van Oosterhout *et al*. (2006) does not explicitly incorporate the proportion of homozygous null allele (‘no types’) when calculating the *y*_{k}. We here propose a modified formulation of the problem (henceforth VO_{m}) that adds this component, and will show that the modification improves the performance of the estimators considerably.

#### Simulations

To assess the efficacy of all methods investigated here, we evaluated simulated data sets, generated according to the following procedure:

- For a single locus, we generated a random distribution of
*K *+* *1 alleles {*a*_{0}, *a*_{1}, *a*_{2}, …, *a*_{K}}, with *k *=* *0 for the null allele (with frequency *r*) and *k *=* *1,…, *K* for the assayable alleles, with frequencies *p*_{k}, respectively. - Specifying an inbreeding coefficient (0 <
*F *<* *1), we generated a (*K *+* *1) × (*K *+* *1) matrix of genotypic products: - Homozygote individuals:
- as well as for homozygous ‘no types’:
- Heterozygote individuals: where P(
*a*_{k}*a*_{k}) and P(*a*_{g}*a*_{k}) are the probabilities of the homozygote and heterozygote genotypes, respectively, and where *g, k = 0,…, K*.

- We used this collection of probabilities, whose sum is ‘1’, to generate
*N* random genotypes, according to a multinomial distribution. - Heterozygote genotypes carrying a null allele were tallied as (apparent) homozygotes for the assayable allele. Homozygote individuals for null allele were ignored when running the VO algorithm, because the method does not account for missing values. For the modified VO method (henceforth VO
_{m}) and the CB method, we allowed for null homozygotes (‘no types’) in the analysis. - For empiric data sets with excesses of heterozygotes, one is not usually concerned with either inbreeding or null alleles, so we removed simulated trials with an excess of heterozygotes.

Simulations were performed on multilocus data sets, considering the entire range of possible *F*-values and using realistic values of *r*, in order to compare the performance of the methods. *L*-locus samples of size *N* were generated and the number of alleles at a locus (*K*) was allowed to vary randomly between 5 and 15, according to an equi-probable distribution within a simulated data set. Null allele frequencies were randomly drawn for each locus, using a negative exponential distribution approximating an empirical distribution of null alleles published by Dakin & Avise (2004) (Appendix S2, Supporting information). The distribution was bounded above by *r *=* *0.33, allowing for a approximately 10% maximum missing value fraction for any one locus in an equilibrium (*F *=* *0) population. *F*-values used to generate the samples were drawn randomly from a uniform distribution, to explore the behaviour of the methods under a wide range of nonequilibrium situations. Simulations were run for two different sample sizes of *N *=* *25 and *N *=* *50 and for two different numbers of loci (*L *=* *10 and *L *=* *20). Simulations were replicated 5000 times for each (*N*,* L*) combination.

The analysis of the simulated samples was performed using two types of methods for the estimation of null alleles. For the VO (and VO_{m}) estimation methods, we implemented an R script (R Development Core Team 2011). While the VO_{m} method allowed simultaneous estimation of *F* and *r* (see Appendix S3, Supporting information), the actual inbreeding coefficient used to simulate the samples was supplied as the required ‘external estimate’ of *F*, when using the VO method. How one is to provide such a simultaneous estimate of *r* and *F* from real genetic data, for the VO_{m} method, is a matter to which we shall return below.

As an alternative to the VO and VO_{m} methods, we also deployed a likelihood joint-estimation method for *r* and *F* (henceforth, labelled CB), proposed by Chybicki & Burczyk (2009). The CB analyses were carried out with the original INEst software (Chybicki & Burczyk 2009), based on the population inbreeding model (PIM), but we automated the repetitive analysis tasks, using Visual Basic scripts. The PIM-based estimator corresponds to the maximum likelihood of the genetic model, using the following multinomial function (*L*):

- (5)

where *n*_{lg0}, *n*_{lgk} and *n*_{l00} represent sample counts of individuals having at the *l*-th locus phenotypes *a*_{g}*a*_{0}, *a*_{g}*a*_{k} and *a*_{0}*a*_{0}, respectively; *n-*values are the phenotypic counts; *P*-values are the allele frequencies; *r*_{l} is the frequency of the null allele for the *l*-th locus; *K*_{l} is the number of visible alleles (excluding the null allele) for the *l*-th locus; and *c* is a combinatorial constant that depends on the phenotypic counts in the sample.

The VO, VO_{m} and CB estimation methods were applied to the same sets of simulated genotypes. For all three methods, we computed the bias and root-mean-squared error (*RMSE*) of the estimated frequency (*r*) of the null allele:

- (6)

where *r*_{j} is the estimate and *ρ*_{j} the parametric value from the *j*-th-simulated trial. By virtue of measuring the differences between the estimated and parametric values, the *RMSE* is a reflection of the accuracy of the estimation and depends on both the squared bias and the variance of the estimator.

Two solutions (*r*_{1} < *r*_{2}) can sometimes be returned with the VO method (Appendix S1, Supporting information). Knowing the external *F*-value and each of the solutions (*r*_{1} < *r*_{2}), it is possible to compute the corresponding binomial probabilities of observing the expected frequency of the ‘no-type’ class in the sample, for *r*_{1} and for *r*_{2}. Whenever two solutions existed, we chose the more likely.