François Rousset Laboratoire Génétique et Environement, Istitut des Sciences de l‘Évolution, CC065, USTL, Place E. Bataillon, 34095 Montpellier Cedex 05, France, Tel: +33 67 14 46 30; fax: +33 67 14 36 22; e-mail: Rousset@isem.univ.montp2.fr

Abstract

I describe a method of analysis of genetic differentiation which is suitable for the comparison of genetic and demographic estimates of the ‘neighborhood size’– more precisely the product of density and second moment of dispersal distance σ^{2}– in continuous populations sampled at the smallest scale. This method is based on results of models of isolation by distance common to a wide variety of dispersal distances. The performance of this method is tested by simulation for some highly leptokurtic dispersal distributions, and it is applied to a previous study of a kangaroo-rat (Dipodomys spectabilis) population. In this case, genetic and demographic estimates are within a factor of two from each other. Thus, in line with some previous examples, this study shows that a better agreement may be attained than is usually recognized between genetic and demographic estimates.

In many populations, it is difficult to define subpopulations or demes on the basis of a spatial clustering of individuals. The expected genetic differentiation in such ‘continuous’ populations was first discussed by Wright (1943). Many later works on this issue are simulation studies of lattice models (e.g. Sokal & Wartenberg, 1983; Epperson, 1995). Here I will describe a method of analysis which allows the comparison of the observed and expected differentiation in such ‘continuous’ populations. This method is not based on simulation but on the mathematical analysis of Malécot’s lattice model ( Malécot, 1951). The performance of the method will be investigated by simulations and it will be used to reanalyse data for a population of kangaroo-rats, Dipodomys spectabilis ( Waser & Elliott, 1991).

Basic theory

As noted by Malécot (1975), the best model we have at present for ‘continuous’ populations is the lattice model with one individual per lattice node, because there are difficulties in the mathematical formulation of other models. For diploid individuals this is a gametic migration model. In a continuous population, the local density may fluctuate in space and time, but in lattice models the position of individuals is rigidly fixed and density does not fluctuate. It may be considered as a limit case of a model of a continuous population in which competition is strong enough to keep local densities at a constant level.

Let us suppose that a population follows this lattice model with one individual per node. Let Q_{r} be the probability of identity of genes at (geographical) distance r and Q_{w} be the probability of identity of two genes within an individual. Then the parameters

contain the same information as the F_{ST}/(1 − F_{ST}) parameters discussed in Rousset (1997) (with gametic migration, Q_{w} is equivalent to Q_{0} in that paper). The theory developed for such parameters then gives expected values of a_{r} parameters in ‘continuous’ populations in one- and two-dimensional habitats. Based on this theory, it is possible to estimate Dσ^{2}, where D is the population density and σ^{2} is the average squared axial parent–offspring distance. More details will be given below for the case of two-dimensional populations, and the proposed method of analysis will be applied to the population studied by Waser & Elliott (1991). I first define a simple estimator of a_{r}.

Estimation

The practical problem is to estimate a_{r} between pairs of individuals. The simplest way to obtain an estimator of a_{r} with low bias is to estimate the numerator without bias and the denominator with both low bias and low variance. While unbiased estimation of the numerator is easy, the variance of estimates of the denominator obtained from pairs of individuals could result in a serious bias on the overall estimate. However, it follows from the assumptions of the above model that the denominator 1 − Q_{w} of the a_{r} parameter is the same for all pairs of individuals. Hence we can define an estimator of a_{r} at some distance, where the denominator is estimated from all pairs of individuals at all distances and where the numerator is estimated from the genotypes of a pair of individuals at this distance.

We may use a simple extension of earlier methods ( Cockerham, 1973): briefly, we consider the indicator variable X_{ij:u} defined so that it has value 1 if gene j ( j=1, 2) of individual i is of allelic type u, and it has value 0 otherwise. Then, considering all possible alleles, the expectation E[∑_{u}(X_{i}_{j}_{:}_{u}−X_{i}_{j}_{′:}_{u})^{2}] is 2(1−Q_{w}) for pairs of genes within individuals, and E[∑_{u}(X_{i}_{j}_{:}_{u}−X_{i}_{′}_{j}_{′:}_{u})^{2}] is 2(1−Q_{r}) for pairs of genes between individuals. Consider the average value X_{i}≡(X_{i}_{1}+X_{i}_{2})/2 of X_{i}_{j}_{:}_{u} for individual i, and the average X_{..}≡(X_{11}+X_{12}+X_{21}+X_{22})/4 for a pair P of individuals. Then for this pair

for which E[SS_{w}_{(P)}]=2(1−Q_{w}) and E[SS_{b}_{(P)}]= (1−Q_{w}+2(Q_{w}−Q_{r})). For each pair of individuals, the numerator Q_{w} − Q_{r} can be estimated without bias by (2SS_{b}_{(P)}−SS_{w}_{(P)})/4, and a single estimate for the denominator 1 − Q_{w} is ∑_{k}^{P}(SS_{w}_{(}_{k}_{)})/(2P), summed over the P different pairs of individuals. Finally,

is an estimator of a for pair P.

Based on the same logic, other estimators can be defined. For example, SS_{w}_{(P)} has the same expectation as ∑_{k}^{P}SS_{w}_{(}_{k}_{)}/P, and hence another estimator of a is

The simulations described below show that this estimator performs slightly better than the previous one.

Multilocus estimates are defined as the sum of locus-specific numerators divided by the sum of locus-specific denominators ( Weir & Cockerham, 1984). When the genotype is unknown for some locus in one individual from a pair, this locus is not included in the sum over loci when computing the denominator of â for that particular pair of individuals. Such procedures ensure that the estimator is asymptotically unbiased when the number of independent loci increases.

An application

Under the lattice model of isolation by distance in two dimensions,

where σ^{2} is the second moment of dispersal distance and D is the population density ( Rousset, 1997). One can then estimate 4Dπσ^{2} from the inverse of the slope of the regression line of a_{r} versus logarithm of distance.

I have applied this method of analysis to the population of Dipodomys spectabilis studied by Waser & Elliott (1991). The aim of their study was, as here, to compare observed and expected differentiation in a population that seemed a good example of a system which should function according to models of isolation by distance.

Waser & Elliott (1991) estimated adult density, D~=2.30×10^{−4} adults m^{−2}, and σ~^{2}=6235 m^{2} by mark recapture experiments (this estimate is σ_{T}^{2} in Waser & Elliott, 1991). The demographic estimate of 4Dπσ^{2} is therefore 18.02. No information is available on the effects of age-structure on ‘effective’ population size, but ultimately these effects should be taken into account and would somewhat affect the expected value. There may also be a slight tendency for males to mate at greater distance from their birthplace than where they live (P. Waser, personal communication).

The genotypic data of 107 individuals at six enzyme loci (on average 93 genotypes determined per locus) were provided by P. M. Waser. For â^{*} and â, respectively, the slopes of the regressions are 1/22.77 and 1/32.3674, which are 1.26 and 1.8 times the demographic estimate, if we exclude all comparisons of individuals at distance below the demographic estimate &σtilde; ( Fig. 1). It is appropriate to exclude such points in this comparison since the approximate linear relationship is expected to hold less well at shorter distances ( Rousset, 1997). However, if we do not exclude all pairwise estimates at distances below &σtilde;, but only 11 pairwise comparisons for pairs of individuals sampled at ‘null’ distance (in the same mound), which cannot be included in the analysis on the logarithmic distance scale, the slopes do not differ much, being 1/25.1 and 1/31.2.

Efficiency of the estimators

It is desirable to evaluate the efficiency of the estimators for realistic dispersal distributions, in particular distributions with large kurtosis. The commonly used usual discrete probability distributions are not appropriate here because high kurtosis can be achieved only by assuming a low dispersal probability, i.e. that most offspring reproduce exactly where their parent reproduced. So I have considered dispersal distributions where the probability of axial dispersal by k steps (K_{min} < k < K_{max}), m_{k} is M/k^{n} for some parameters M and n (a truncated variant of the discrete Pareto, or Zeta, distribution; see e.g. Patil & Joshi, 1968). Roughly, M is a parameter controlling the total dispersal rate, and n controls the kurtosis. By suitable choice of the parameter values, large kurtosis can be obtained with large migration rates (i.e. the two-dimensional total migration rate, 1−m_{0}^{2}, is >0.5).

Mutation was assumed to follow a two-allele model with symmetric mutation at rate u=10^{−4}. Starting from an undifferentiated state with each allele in frequency 1/2, the population was first let to evolve for 10^{4} generations. Then, one-locus data were generated by completely sampling every 1000 or 2000 generations from a population of 100 × 100 individuals evolving according to a discrete generation model with such distributions of gamete dispersal (250 × 250 individuals for case (d) in Table 1). Two thousand such samples were generated for each migration model. For large K_{max}, the simulations are rather time-consuming and this sets a limit to the total size of the population considered.

Table 1. Performance of the estimators. See text for further details

Out of each recorded population, a block of up to 20 × 20 individuals was sampled; a variable number of multilocus samples of l loci were generated from l polymorphic one-locus samples. The smallest sample size considered ((10 × 10) individuals × 6 loci) is similar to the Dipodomys sample size. For each sample the estimates of Dσ^{2} were computed as described above from the regression of a_{r} estimates to logarithm of distance. The accuracy of this method of analysis will be given in terms of the relative error on the estimates of Dσ^{2}: Table 1 gives estimates of the relative bias (i.e. the bias of estimated Dσ^{2} divided by Dσ^{2}), of the relative mean square error E[((estimate − Dσ^{2})/(Dσ^{2}))^{2}], and of the probability that the estimate is within a factor of two from Dσ^{2}(‘×2 coverage probability’). Different set of parameter values of the axial dispersal distributions were investigated:

(a) m_{0}=0.4 and m_{k} ∝ 1/k^{3.798} for all 0 < k < 49. With such parameters, σ^{2}=1(4πDσ^{2}=12.56) and the one-dimensional kurtosis is γ_{2}(x)=46.31, resulting in a two-dimensional kurtosis γ_{2}(r)=γ_{2}(x)/2 − 1=22.15, slightly higher than the kurtosis for the Dipodomys example (γ~_{2}(r)=19.4, P. Waser, personal communication). The simulations show that for realistic sample sizes the relative bias is generally low (a few per cent at most), and that 81–95% of estimates fall within a factor of two from the parameter value, depending on sample size. Thus, it is reasonable to expect such an accuracy from these methods of analysis. In terms of sampling design, for a comparable total number of individuals multiplied by loci, it is better to increase the number of loci (from six to 13) than to increase the number of individuals (from 100 to 225).

(b) This is a variant of the previous model with m_{0}=0.6 and m_{k} ∝ 1/k^{3.86295} for all 0 < k < 25, such that γ_{2}(x)=26.03. Overall, the performances are similar. In both cases the â statistic performed consistently slightly better than the â^{*} statistic in terms of mean square error and coverage probability. A negative estimate was obtained only once among all replicates, for the smallest sample size.

(c) It should be increasingly difficult to estimate σ from analyses at short distances when σ increases. To investigate the performance of the method in such circumstances, I considered the following distribution of dispersal distances characterized by m_{0}=0.7, m_{1}=m_{−1}=0.06, m_{2}=m_{−2}=0.0504 and m_{k} ∝ 1/k^{2.51829} for 2 < k < 16. For this distribution, σ=2. The results show that the bias increases (27% for â) and the coverage decreases (70%).

(d) Further simulations where run for the same dispersal distribution to investigate whether sampling at larger distances would improve the results. â performs better when samples of (20 × 20) individuals were considered, but if (10 × 10) individuals are sampled every two steps in a (20 × 20) block on the lattice, the estimation is not improved. Thus, as σ increases, it is necessary to increase both the range of distances and the number of individuals sampled, and almost complete sampling of the population in a 10σ × 10σ block (as in the present case and in cases (a) and (b)) seems necessary to obtain precise estimates.

(e) Only one example is shown for a simple stepping stone model with nearest-neighbour dispersal at rate m=1/2, where the method performs well, as one would expect.

Discussion

Most studies of differentiation under isolation by distance have assumed that for different dispersal distributions, the expected differentiation is a function of ‘neighborhood’ size Dσ^{2}, but this cannot be so in general ( Rousset, 1997). This fact had already been noticed in one of the first simulation studies ( Rohlf & Schnell, 1971, pp. 316–319), and casts doubts on the robustness, for alternative dispersal distributions with the same neighborhood parameters, of many proposed methods of analysis.

Here, a method of analysis has been defined which is in agreement with a model of isolation by distance appropriate for various dispersal distributions. Its applicability differs from that of the related method described in Rousset (1997). Three different cases must be distinguished: (1) there is a spatial clustering of individuals (e.g. the human population reconsidered in Rousset, 1997), so the model considered in the present paper is not appropriate; (2) there is no spatial clustering of individuals in the population, but samples of individuals are taken far apart, i.e. with the physical distance between individuals within samples being small relative to the dispersal distances of individuals and relative to the distance between samples (e.g. the snail population reconsidered in the same paper). The model used here would in principle be appropriate for such populations, but the estimators may perform very poorly and it was not possible to investigate their properties in very large populations. So the method described in Rousset (1997) is preferable in this case; (3) a continuous sampling from a continuous population where most dispersal events occur within a few times the interindividual distance: this is the case considered in this paper, illustrated by the Dipodomys population.

The estimators may or may not seem efficient depending on the objectives of the study, but they yield estimates that are much closer to the parameter value than is usually recognized (e.g. Waser & Elliott, 1991; Johnson & Black, 1995; Koenig et al., 1996 ; Peterson, 1996), when σ is small and when most individuals are sampled within an area of about 10σ × 10σ. The latter conditions imply either some independent knowledge of dispersal distances (as for the comparison of genetic and demographic estimates) or some stepwise procedure in which a preliminary estimate is used to determine a final sample size. It should be possible to define more efficient estimators, provided the underlying model is approximately correct, as suggested by the present analyses.

Few studies may achieve the sampling scheme described above, but the one by Waser & Elliott (1991) comes closer to this ideal than others. Using the â estimator, it is found that the genetic estimate of Dσ^{2} is 1.8 times the demographic estimate in this study. We cannot expect a perfect agreement, both because Malécot’s model is not an exact description of the population and because the demographic estimate may be biased. However, the agreement obtained here is similar to the accuracy to be expected from the genetic estimator according to the simulations, if the genetic model is correct.

Using a spatial autocorrelation method, Waser & Elliott (1991) found no pattern of isolation by distance, while according to a simulation study ( Sokal & Wartenberg, 1983), such a pattern was expected for values of 4Dπσ^{2} similar to the demographic estimate. The exact cause of this discrepancy is unclear ( Waser & Elliott, 1991; Epperson, 1995). However, Waser and Elliot noted that Sokal & Wartenberg did not consider highly leptokurtic dispersal distributions, while the observed dispersal distribution was highly leptokurtic. Previously inferred discrepancies between genetic and demographic estimates at a local spatial scale may in large part result from similar problems. The present example, as well as those given in Rousset (1997), show that a better agreement may be obtained by focussing on theoretical results robust for different distributions of dispersal distances.

Acknowledgments

I thank P. Waser for providing the data set and helpful discussion, M. Raymond for some venerable lines of code, J. Lagnel for advice on programming, and A. Estoup and Y. Mickalakis for comments. This work was supported by the Service Commun de Biosystématique de Montpellier, and Grants LR963223 from the Région Languedoc-Roussillon, ACCSV3 9503077 and GDR 1105 from CNRS. A program performing the analyses described in this paper will be included in future versions of the Genepop package ( Raymond & Rousset, 1995). This is paper 99.064 of the Institut des Sciences de l’Évolution.