Homoplasy has recently attracted the attention of population geneticists, as a consequence of the popularity of highly variable stepwise mutating markers such as microsatellites. Microsatellite alleles generally refer to DNA fragments of different size (electromorphs). Electromorphs are identical in state (i.e. have identical size), but are not necessarily identical by descent due to convergent mutation(s). Homoplasy occurring at microsatellites is thus referred to as size homoplasy. Using new analytical developments and computer simulations, we first evaluate the effect of the mutation rate, the mutation model, the effective population size and the time of divergence between populations on size homoplasy at the within and between population levels. We then review the few experimental studies that used various molecular techniques to detect size homoplasious events at some microsatellite loci. The relationship between this molecularly accessible size homoplasy size and the actual amount of size homoplasy is not trivial, the former being considerably influenced by the molecular structure of microsatellite core sequences. In a third section, we show that homoplasy at microsatellite electromorphs does not represent a significant problem for many types of population genetics analyses realized by molecular ecologists, the large amount of variability at microsatellite loci often compensating for their homoplasious evolution. The situations where size homoplasy may be more problematic involve high mutation rates and large population sizes together with strong allele size constraints.
Originally, the concept of homoplasy was used by evolutionists to describe the fact that a character present in two species is not derived from the same character in a common ancestral species but that the similarity is due to factors such as convergence, parallelism or reversion. Homoplasy is therefore often considered as the ‘noise’ in opposition to its corollary, homology, which is the evolutionary ‘signal’ (Scotland 1992). For a genetic marker, homoplasy occurs when different copies of a locus are identical in state, although not identical by descent. This identity in state is generated through mutation event(s) and is thus related to the way mutation produces new alleles. Hence, homoplasy per se is implicit to the mutation model of the marker so that it is difficult to consider the two evolutionary features (homoplasy and mutation model) independently. Homoplasy is expected under the stepwise mutation model (SMM, Kimura & Ohta 1978), the two phase model (TPM, Di Rienzo et al. 1994), the generalized mutation model (GSM, i.e. a simplified version of the TPM) and the K-allele model (KAM, Crow & Kimura 1970), while homoplasy is not expected under the infinite allele model (IAM, Kimura & Crow 1964). These mutation models are defined in information box 1, with special reference to microsatellite markers. However, homoplasy also depends on evolutionary factors independent on the mutation model such as the mutation rate, the effective population size, and the time of divergence between populations, although this has been poorly formalized theoretically. While homoplasy is a recurrent theme in molecular evolution (e.g. Coyne et al. 1979 for allozymes and Hassanin et al. 1998 for DNA sequences), it has recently attracted the attention of population geneticists, as a consequence of the popularity of microsatellite loci as genetic markers. Microsatellites indeed mutate in a stepwise fashion at a relatively high rate (reviewed in Estoup & Angers 1998; Ellegren 2000a; Schlötterer 2000).
Microsatellite alleles generally correspond to DNA fragments of different size revealed by electrophoretic methods (electromorphs). Electromorphs are identical in state (i.e. have identical size), but are not necessarily identical by descent due to convergent mutation(s) that occurred in the lineages connecting the gene copies and their most recent common ancestor. Homoplasy at microsatellite electromorphs is thus referred to as size homoplasy (SH). For some categories of microsatellites, a fraction of SH may be detected because a given electromorph may hide different sequences that can be observed using molecular techniques such as single-strand conformation polymorphism (SSCP) or DNA sequencing (references listed in Table 1). This corresponds to the molecularly accessible size homoplasy (MASH). Note that these definitions can be used for any piece of DNA exhibiting length variation and following a stepwise mutation process (e.g. minisatellites). The differences between SH and MASH have been insufficiently underlined in previous works dealing with SH at microsatellite markers, and both quantities have often been mixed up. It is essential to understand that MASH represents only a subset of the SH that actually occurs at microsatellite electromorphs. Moreover, the relationship between SH and MASH is not trivial, as it depends on many factors (e.g. mutation rate and model, effective population size, molecular structure of microsatellite loci) so that drawing conclusions about SH from MASH is often difficult, especially in a population genetics context.
Table 1. Publications in which molecular methods were used to detect size homoplasy at microsatellite markers (cf. molecular accessible size homoplasy). Taxonomic levels of investigation are indicated
Homoplasy is obviously important when assessing the phylogenetic relationships among species. An important issue is whether SH at microsatellite loci affects the population genetics analyses usually realized within species by molecular ecologists. It seems to be widely held that studying population structure or phylogenetic relationships among populations with microsatellites introduces significant problems because of size homoplasy. Indeed, that microsatellites follow a stepwise mutation process gave the impetus for developing new statistics taking into account allele size differences and hence supposedly more adapted to microsatellite data analysis (Goldstein et al. 1995; Slatkin 1995). Moreover, uncovering MASH modifies genetic variability measures such as the number of allelic forms. How MASH data affect population differentiation and structure has been rarely evaluated (but see Angers et al. 2000).
In this review, we first evaluate the effect of various mutational and populational factors on SH at microsatellite electromorphs using analytical developments and computer simulations. Then, we summarize and discuss information obtained recently about MASH. Finally, we address the question of the effect of SH & MASH on various types of population genetics analyses.
Size homoplasy within and between populations: theoretical aspects
Size homoplasy (SH) at microsatellite markers has, so far, been poorly formalized theoretically. Moreover, the effect of the main mutational and populational factors on SH is difficult to evaluate experimentally, for instance by processing various molecular techniques on microsatellite electromorphs (cf. section Molecularly accessible size homoplasy). Therefore, we first made some analytical developments and ran simulations to address the question of the effect on SH of the mutation rate, the mutation model, the population drift and the time of divergence between populations.
Size homoplasy within and between populations
Let us define an index of (size) homoplasy as the probability that, at a given locus, two gene copies with the same state are not identical by descent. For microsatellites, the allelic state is the size of the electromorph which itself corresponds to a given number of repeat units. The index of (size) homoplasy (P) can be expressed as functions of the probability of identity in state, namely the homozygosity, using various formulae from the literature. Consider a population at mutation-drift equilibrium (i.e. with a population size stable over time) and a locus mutating according to the SMM, a mutation model often considered for microsatellite loci (but see information box 1). Under the SMM, the probability of drawing two alleles of same size is equal to the homozygosity, which is equal to (1 + 2 M)−0.5, with M = 4Neµ, Ne being the effective population size in number of diploid individuals and µ the mutation rate (Kimura & Ohta 1978). Under the IAM, the probability of drawing two alleles identical by descent is equal to the homozygosity (1 + M)−1 (Kimura & Crow 1964), since two gene copies are identical in state only if they come without mutation from the same ancestor. Under the SMM, the probability of drawing two gene copies identical by descent provided that they have the same state (i.e. copies of a given electromorph for microsatellites) is simply the ratio of the IAM homozygosity over the SMM homozygosity, and thus,
In order to also evaluate SH between populations, we have extended the same definition of the index of SH to the case of two panmictic isolated populations at mutation-drift equilibrium that diverged from a common ancestral population. We will consider the following definition: the index of SH is the probability that two gene copies identical in state (i.e. copies of a given electromorph for microsatellites) taken in each population are not identical by descent. Note that different populations may not share electromorphs: in this case the index of SH is zero. It is also worth noting that although the index of SH between populations is computed from the fraction of electromorphs shared by the two populations, it includes size homoplasious events which occurred both before and after the divergence between the two populations. Two isolated populations will tend to loose common electromorphs by drift. On the other hand common electromorphs may be produced by mutation. Hence, it is expected that the index of SH between populations will increase with time from an initial level (at time of divergence) towards 1. Instead of the homozygosity, which is only defined for a single population, the probability of identity in state corresponding to the parameter b(t) =Σxi(t)yi(t) is now considered, where xi(t) and yi(t) are the frequencies of the i-th electromorph in each population x and y at generation t after divergence. Using a similar rationale than before, an analytical resolution of P as a function of both the time since divergence between the two populations and M can be obtained for the SMM and is given in information box 2 (equation 3). Equations 1 and 3 show how variation of P depends on M and thus on both the mutation rate and the effective population size.
A GSM with allele size constraints is a more realistic mutation model than a strict SMM for microsatellites (though it still oversimplified the actual evolutionary dynamics, see information box 1). However, an analytical resolution of P is difficult under the GSM, as well as under both the SMM and GSM with allele size constraints. In theses cases, SH was analysed both within and between populations using computer simulations based on the coalescence process. Details of the simulation method and parameters of the mutation models (variance of the geometric distribution under a GSM, and range of allelic states for size constraints) are given in information box 3.
Effect of the mutation rate and the effective population size
In contrast to the mutation rate and the effective population size, that are difficult to estimate, the heterozygosity (H) is easily estimated from molecular data (e.g. Nei 1987). At mutation-drift equilibrium, the heterozygosity is a simple function of M under the SMM (i.e. H = 1 − (1 + 2M)−0.5), so that the function P = f(H) is easily deduced. The relationship between P and H was computed by simulation under other mutation models. Figure 1 illustrates how SH within population increases with the level of variability (H) at a marker evolving under various mutation models. Under the SMM, a large proportion (> 50%) of electromorphs are not identical by descent (i.e. are homoplasious) for H values > 0.7, such values being commonly observed at microsatellite markers. Because P depends on M and thus both on µ and Ne, SH may be substantial at microsatellite loci with low mutation rate provided that the population size is large. The relationship between P and Ne is given in Fig. 2 for three mutation rates covering the range usually considered for microsatellite markers (i.e. 5 × 10−3−5 × 10−5, reviewed in Estoup & Angers 1998). This figure shows that for a mutation rate of 5 × 10−4, a value often taken as a mean value for microsatellites, a significant amount of homoplasious electromorphs (e.g. P > 0.05) is present within population only for large population sizes (e.g. Ne > 200). Low mutating microsatellites (µ = 5 × 10−5) have reduced level of SH even for Ne values of several thousand individuals. In contrast, highly mutating markers (µ = 5 × 10−3) quickly reach high level of SH (e.g. P > 0.5 for Ne > 500).
Figure 3 represents the relationship between the index of SH between two populations and the time since separation of those two populations expressed in number of generations (t) for different effective population size (Ne), and assuming markers mutating at a rate of 5 × 10−4 under various mutation models. For µ = 5 × 10−4, the index of SH is near 1 in less than c. 6000 generations for very different Ne values (50–5000) and under all mutation models. This means that after a relatively short separation time all microsatellite electromorphs shared by two populations are homoplasious. The effect of µ on P between populations is illustrated in Fig. 3 for three values of µ (5 × 10−3, 5 × 10−4 and 5 × 10−5) for Ne = 500 and under a SMM. The mutation rate has a marked effect on SH since P is near 1 for a separation time of only 500 generations for µ = 5 × 10−3 and more than 20 000 generations for µ = 5 × 10−5. It is worth mentioning that, although SH between populations can reach high levels, it involves only the fraction of electromorphs which are common to the two populations. This fraction of electromorphs will decrease with time provided that constraints on size are not too strong, so that two populations with high P values between them (e.g. possibly near 1) may still show some large level of differentiation.
Postdivergence vs. predivergence size homoplasy
Remember that the index of SH between two populations includes size homoplasious events that occurred both before or after the divergence between the two populations. While the fraction of SH events which occurred before the divergence [P(t = 0)] is larger for large Ne (Fig. 2), the fraction of SH events which occurred after the divergence [P(t > 0)] increases more quickly for small than for large Ne values. This was illustrated by computing the proportion of homoplasious alleles which occurred after divergence, . Figure 4 gives the relationship between PAD and t in number of generations for different Ne values, µ = 5 × 10−4, and under various mutation models. PAD clearly increases more quickly and reaches higher plateau values for small Ne. This means that, for small Ne values and after a relatively short separation time, homoplasious electromorphs have been almost exclusively generated after divergence, while both categories of homoplasious events (before and after divergence) are represented for large Ne.
Effect of the mutation model
Figures 1–4 illustrate the effect on SH of the mutation model and especially the occurrence of multistep mutations (GSM with σ2 = 0.36 or 2) and of constraints on allele size (K = 10, 20 and 50). For both SH within and between populations P is minimum under a non bounded GSM with frequent multistep mutations (σ2 = 2) and maximum for a bounded SMM with strong size constraints (K = 10). The differences among P values within population for the mutation models increase with the level of variability of markers (i.e. when µ and/or Ne increase, Figs 1 and 2). This trend is less marked for P values between populations (Fig. 3). Irrespectively, the mutation models studied here have less effect on SH than variation in µ and Ne.
Molecularly accessible fraction of size homoplasy
At least a fraction of SH may be experimentally detected at microsatellite loci because a given electromorph may hide different sequences. This corresponds to the molecularly accessible size homoplasy (MASH). Interrupted [e.g. (CT)n1AT(CT)n2TT(CT)n3] microsatellite loci and loci showing variation in their flanking regions provide favourable situations to detect SH in the core sequence of microsatellites. The rationale for this is that the rate of mutations within the repeat region (10−3 or 10−4, reviewed in Estoup & Angers 1998; Ellegren 2000a; Schlötterer 2000) is considerably higher than the rate of point mutations in the core sequence of interrupted loci (C → A and C → T in the above example) and in adjacent flanking regions (c. 10−9, Lehmann et al. 1996) or the recombination rate between the repeat and the point mutation (10−6 for sequences 100 bp long, Hilliker et al. 1991). Based on a different rationale, compound microsatellite loci [e.g. (CT)n1(GT)n2] were also used to detect SH. In this case, electromorphs of a given size corresponding to n repeats may exhibit two or more repeated areas with a total number of repeats summing to n for different combinations of n1 and n2 values. Several authors have taken advantage of these situations to detect SH in various taxonomic groups, either within or among populations and species (see References listed in Table 1). However, most of these studies included a relatively low number of loci, electromorphs and individuals. This makes it difficult to draw clear relationships between MASH and SH in a population genetics context. This is because such studies were usually based on the time consuming and costly cloning-sequencing procedures of microsatellite electromorphs. A proper study of SH through MASH in a population genetics context should use a method allowing distinguishing electromorph sequences and analysing a sufficiently large number of individuals and loci per population at a low cost in both time and money. The single-strand conformation polymorphism method (SSCP, Orita et al. 1989) presents these qualities. SSCP variants are usually referred to as conformers. Several studies have shown that the SSCP technique can detect both mutations along the flanking regions and differences in the molecular structure of the microsatellite core sequence, such as different number of repeats in the two parts of a compound microsatellite (Orti et al. 1997; Angers et al. 2000; Sunnucks et al. 2000). Angers et al. (2000) showed that three electromorphs at a compound (CT)n1(GT)n2 locus gave eight conformers when 240 individuals from 13 populations of the freshwater snail Bulinus truncatus were genotyped. Sequence determination through classical sequencing procedures revealed that each conformer corresponds to a different combination of repeats in the (GT)n1 and (CT)n2 arrays. The presence/absence and different locations of interruption(s) in the core sequence of interrupted loci may be also assessed by SSCP (unpublished data).
Even if efficient techniques such as SSCP are used, it remains difficult to draw clear relationships between MASH and SH in a population genetics context. Whatever the molecular techniques used to reveal SH events, an unknown fraction of SH remains undetectable because even identical sequences or conformers may not be identical by descent, so that MASH generally underestimates SH. A common misinterpretation in MASH studies is that perfect microsatellites are less homoplasious than interrupted or compound microsatellites. This probably arises from a circular line of argument: the ability to detect SH events is dependent on the molecular structure of the microsatellite core sequence and there is no possible way to detect such SH in the core sequence of a perfect microsatellite (except from rare point mutations in the flanking regions). It may be even possible that the opposite conclusion prevails since several studies indicate that, although mutation processes may vary greatly among loci, perfect microsatellites tend to show less deviation from the SMM than interrupted or compound microsatellites (reviewed in Estoup & Cornuet 1999 and cf. section Size homoplasy within and between populations: theoretical aspects). For MASH studies based on the sequencing of flanking regions it is worth stressing that the ability to detect homoplasious electromorphs will greatly depend on the number of nucleotides sequenced in the flanking regions. These limits underline the usefulness of developing analytical or simulated SH indexes that are independent of the molecular structure of microsatellite core sequences and of the length of flanking regions.
Some interesting results emerged, however, from MASH studies. In agreement with theoretical expectations, the MASH studies referenced in Table 1 show globally that size homoplasy is usually higher among species than between populations of the same species, and rarer within populations. Variation in the amount of MASH was also observed among microsatellite loci (e.g. Garza & Freimer 1996; Viard et al. 1998; Culver et al. 2001). This could reflect: (i) the stochasticity among loci of the coalescence process that leads to a large variation of the time to a common ancestor and thus to a large variation in the number of mutations occurring at each locus; (ii) variation of mutation rates and models among microsatellite loci, as illustrated by our analytical and simulation results; and (iii) differences in the ability to detect SH due to differences of the molecular structure of the core sequences among loci.
Variation in the amount of MASH was also observed among electromorphs of the same locus (Viard et al. 1998; Culver et al. 2001). Viard et al. (1998) have shown that the number of sequence alleles per electromorph was globally correlated with the size in base pairs of the core sequence, suggesting that electromorphs with large number of repeats hide more alleles than shorter ones. This supports the idea that electromorphs with large numbers of repeats are less stable (i.e. have higher mutation rates) than shorter ones (Primmer et al. 1996; Ellegren 2000b). However, the large difference in the amount of MASH between some electromorphs was unlikely to result only from differences in the number of repeats and from the ability to detect SH event due to the molecular structure of the core sequence. It may again reflect different coalescence times between the electromorph gene copies. Unspecified electromorph-specific factors could be also involved, such as selective factors favouring electromorphs with a given number of repeats or acting on linked genes. More quantitative population studies (e.g. using the SSCP technique) on more loci and electromorphs are needed to clarify the factors responsible for variation in the amount of MASH observed among microsatellite loci and electromorphs at a given locus.
Consequence of size homoplasy on population genetics analysis
Various types of population genetics analyses can be conducted within and between populations. The following section focus on the potential effect of SH and the consequence of uncovering MASH for some of these analyses, namely observed level of polymorphism, F-statistics, analysis of continuous populations under isolation by distance models, genetic relatedness between pairs or groups of individuals, assignment methods and phylogenetic relationships among populations.
For a given number of loci, additional variability detected in MASH studies resulted in significant differentiation between more pairs of populations when sequences or conformers were considered rather than electromorphs (Viard et al. 1998; Angers et al. 2000). This is at least partly because higher levels of polymorphism results in higher asymptotic power for exact tests of population differentiation (Rousset & Raymond 1995). Note that this expectation can be extended to any comparison of categories of markers with different levels of polymorphism, such as microsatellites vs. allozymes (Estoup et al. 1998). Although this issue has rarely been discussed, it has been supported for an island model of population structure (Goudet et al. 1996) as well as for other types of analyses (Robertson & Hill 1984; Rousset & Raymond 1995).
Expected values of Wright's F-statistics (Wright 1951), taking into account allele size differences ( ρST and ρIS) or not (FST and FIS), were described by Rousset (1996) for loci evolving under stepwise mutation processes (SMM and GSM) and considering an island model of migration. Rousset (1996) showed that there is virtually no effect of homoplasy on the parameters FIS and ρIS and no simple effect on the parameters FST and ρST. For the mutation process and hence homoplasy to matter, it is necessary that the ratios of coalescent times of genes within and among individuals of a subpopulation for FIS and ρIS and within and among subpopulations for FST and ρST are different for a sufficiently long time for two or more mutation events to occur. FIS and ρIS are virtually insensitive to mutation processes because most coalescent events for pairs of genes within subpopulations occur before the occurrence of mutations. On the other hand, the ratios of coalescent times of genes within and among subpopulations are different for a longer time. Therefore, the mutation processes affect more (although weakly) the values of FST and ρST than the values of FIS and ρIS, especially for markers with high mutation rates. Note that the ratios of coalescent times of genes within and among subpopulations will be on average different for shorter times when the migration rate between subpopulations increases, so that the effect of the mutation processes decreases when the migration rate increases. Considering more specifically the estimation of the number of migrants per generation (Nm), Rousset (1996) showed that, for a given value of the mutation rate, FST is a poorer measure of Nm under the IAM than under stepwise processes (SMM or GSM), and FST is a poorer measure of Nm under the GSM than under the SMM. An other important conclusion of Rousset's (1996) theoretical studies is that the mutation model has a much lower impact on the parameters FST and ρST than the mutation rate (see also the review of Hedrick 1999 on the effect of the mutation rate on FST estimators). The small sample properties of FIS, ρIS, FST and ρST estimators still need to be investigated for different mutation models, for example using computer simulations. However, it is likely that the main conclusions of Rousset (1996) on the corresponding parameters would roughly hold for these estimators.
Specific mutation features such as the occurrence of strong allele size constraints considerably increases SH and appears to substantially affect gene flow estimation between subpopulations. Gaggiotti et al. (1999) and Paetkau et al. (1997) have shown that such range constraints can lead to serious overestimation of the number of migrants between subpopulations (Nm) from the FST estimator θ (Weir & Cockerham 1984) or from RST values (Slatkin 1995), the latter statistic being an analogue of GST (Crow & Aoki 1984) that takes into account allele size differences (Rousset 1996). This overestimation is more likely to occur when population sizes are large, size constraints pronounced and mutation rates high. One other interesting result from the simulation study of Gaggiotti et al. (1999) is that, even under a SMM, θ-based estimates of Nm are better than RST-based estimates for moderate sample size (say 50 diploid individuals per population) and number of loci scored (say 10). This is due to the larger variance of RST and confirms the conclusions of empirical studies comparing both category of statistics (reviewed in Estoup & Angers 1998). Note that deviation from a SMM due to mutation steps of more than one repeat unit will not bias the expected values of ρST and RST but will increase the already large variance of those statistics (Zhivotovsky & Feldman 1995).
Unexpectedly, the empirical study of Angers et al. (2000) showed that values of the FIS estimator f̂ (Weir & Cockerham 1984), and therefore of the mean population selfing rate Ŝ = 2f̂/(1 + f̂) (Pollak 1987), were considerably larger for several populations of the self-fertilizing freshwater snail B. truncatus when conformers where considered instead of electromorphs. However, larger f̂ values are expected for populations of highly self-fertilizing species since conformers analysis will reveal heterozygous genotypes that are likely to have been generated by mutation. An alternative explanation for this result is the mixture of sampled individuals originating from subgroups with similar electromorphs but different conformers (i.e. a ‘Walhund effect’ more visible on conformers), so that mutation-drift-migration equilibrium did not hold for some populations. Angers et al. (2000) also showed that pairwise estimates of the FST estimator θ were either lower, or larger, with conformers than with electromorphs, depending on whether electromorphs were shared or not among population pairs. On the other hand, and in agreement with the theoretical expectations of Rousset (1996), estimates of θ values computed over all populations were very similar for electromorphs and conformers.
Continuous population evolving under isolation by distance
In numerous species, individuals are continuously distributed and dispersal is restricted in space. This has prompted the development of models of a continuous population evolving under isolation by distance, as well as methods for estimating the demographic parameter Dσ2, where D is the population density and σ2 is the average squared parent-offspring distance. One of these methods is based on the increase, at a local scale, of genetic differentiation between pairs of individuals [using estimators calculated between individuals of a parameter analogous to the parameter FST/(1 − FST)] with geographical distance (Leblois et al. 2000; Rousset 2000; Sumner et al. 2001). Simulation studies have shown that the mutation model of the markers has little influence on the efficiency of this method. In particular, SH typically generated under stepwise models (SMM and GSM) is not prejudicial for such estimations (Leblois 2000). On the other hand, the precision of the estimation considerably increases with the genetic diversity of markers, whatever the level of SH (Leblois 2000).
Genetic relatedness between pairs or groups of individuals
Our understanding of the evolution of mating systems and social behaviours depends on the possibility to differentiate individuals genetically and to estimate with sufficiently high precision the genetic relatedness between pairs or among groups of individuals (Queller & Goodnight 1989). Microsatellite markers have quickly become the markers of choice for these studies (reviewed in Estoup & Angers 1998). It is important to differentiate: (i) direct parentage analysis as estimated by the simple sharing/nonsharing criterion of alleles between potential parents and offsprings (e.g. Estoup et al. 1994 in social insects, Primmer et al. 1995 in birds, Streiff et al. 1998 in trees, Rossiter et al. 2000 in mammals); and (ii) the estimation of relatedness statistics between two individuals or group of individuals of unknown pedigree in a population (Queller & Goodnight 1989; Chakraborty & Jin 1993; Blouin et al. 1996). In the first case, the number of generations considered is small (usually one generation), so that the way the genetic markers mutate does not really matter, provided that the mutation rate is not too high. Hence, direct parentage analysis is not affected by SH, and only depends on the number and level of variability of markers. This is less obvious for the estimation of relatedness statistics as the number of generations considered is larger. However, such relatedness estimations can be interpreted as a F-statistics like analysis (Rousset 2002). Hence, it is expected that the mutation model and associated features such as SH are not an important issue, at least less important than the number and level of variability of the markers. It is also likely that the relative reduction of polymorphism due to the stepwise mutation process of microsatellites for a given mutation rate is largely compensated by their high mutation rate and the large number of independent loci that can be analysed. This is yet to be investigated for instance using computer simulations.
The variability of microsatellite is often so high that, even with a small number of loci and a large number of individuals, all individuals have unique multilocus genotypes. It is therefore possible to address issues such as discrimination, relationships, structure, and classification, not only at the population (using allelic frequencies) but also at the individual level (using genotypes) (reviewed in Estoup & Angers 1998). Moreover, individual-based analyses potentially allow the analysis of contemporary gene flow, as opposed to gene flow estimates derived from indirect approaches based on FST estimators (e.g. Paetkau et al. 1995; Rannala & Mountain 1997). Simulation studies showed that the mutation process of markers, especially the occurrence of homoplasious mutations, affects the efficiency of assignment or exclusion methods (Cornuet et al. 1999). Other things being equal (effective population size and mutation rate), genetic markers were always more efficient for various assignment methods when evolving under the nonhomoplasious IAM than under the homoplasious SMM, even for equal values of the FST estimator θ (Cornuet et al. 1999). Using measures taking into account the size difference of microsatellite alleles such as δµ2 (Goldstein et al. 1995) did not result in any improvement. However, it is worth noting that the mutation model appears to be a less important issue for assignment methods than the level of variability of markers. Preliminary simulation studies with different mutations rates indicate that assignment scores are influenced greatly by marker variability, the best scores being obtained with the most variable markers (for an equal value of θ, unpublished results). This result agrees with the empirical study of Estoup et al. (1998) who observed a much higher assignment score with the frequency method when using highly variable microsatellite markers than when using moderately variable allozymes, although there was no significant difference between θ values computed for each class of markers.
Phylogenetic relationships among populations
Theoretical and empirical studies show that microsatellites may not be the most efficient category of molecular marker for reconstructing phylogenetic relationships among populations, at least when a ‘low’ number of loci (say < 10) is used (but see the reviews of Goldstein & Pollock 1997 and Estoup & Angers 1998).
Phylogenetic reconstructions among closely related populations may not be affected by the mutation model and thus by SH since genetic divergence is essentially due to random drift. In this case, although they do not consider the possibility of SH, the classical distances of Cavalli-Sforza & Edwards (1967) and of Nei et al. (1983) have been shown to reconstruct microsatellite phylogenies better than distances taking into account allele size differences (Takezaki & Nei 1996). This is essentially due to the lower coefficient of variation and acceptable linearity with time of these classical distances when short periods of divergence are considered.
For distantly related populations, genetic divergence is due to both random drift and mutation. When computed from microsatellite data, classical genetic distances are no longer linear with time for homoplasious SMM markers and tends to underestimate divergence time over large time scales (Takezaki & Nei 1996). This is particularly true when there is no overlap of allele frequency distributions between two populations. Hence, in spite of their high coefficient of variation, distances taking into account allele size differences (e.g. Goldstein et al. 1995) become more appropriate in this situation (Goldstein & Pollock 1997). Although these distances were derived assuming a strict SMM, their linearity with time has been shown to be independent of the assumptions of both single-repeat mutation steps and symmetry in the mutation rate. However, loci with frequent large mutational changes will have a larger coefficient of variation (Zhivotovsky & Feldman 1995).
The accuracy and linearity with time of all genetic distances are strongly affected by allele size constraints, especially for distantly related populations. Nauta & Weissing (1996) addressed this issue for two panmictic populations deriving from a common ancestral population. Constraints on allele size do not appear to be relevant for small populations. On the other hand, measures of classical genetic distances not taking into account allele size differences become unreliable in large populations. This is particularly true for microsatellite loci characterized by reduced allelic ranges and high mutation rates (Goldstein et al. 1995; Feldman et al. 1997; Pollock et al. 1998). Several distances taking into account allele size differences were recently proposed to statistically account for size constraints (Feldman et al. 1997; Pollock et al. 1998).
Although several studies based on various molecular techniques enabled a fraction of SH to be detected at some microsatellite loci, it remains difficult to draw clear relationships between MASH and SH, the later being better described by analytical developments and computer simulations. Additional population studies on more loci, electromorphs and individuals (e.g. using the SSCP technique) are required to evaluate empirically the potential effect of SH on population genetics studies and to clarify the factors responsible for the variation observed in the amount of MASH among microsatellite loci and electromorphs. However, a major conclusion of this paper is that SH does not represent a significant problem for many types of population genetics analyses realized by molecular ecologists, and the large amount of variability at microsatellite loci often largely compensates for their homoplasious evolution. Therefore, the gain associated with the production of MASH data in routine population genetic studies appears to be minimum in most cases. Considering appropriate population genetic models as well as a reasonably large number of loci is certainly more important than focusing on SH and mutation models of microsatellites. The situations in which size homoplasy may be problematic involve high mutation rates and large population sizes together with strong allele size constraints. In recent years, important advances have been made towards the goal of estimating the probability of obtaining a given gene sample configuration in order to make likelihood-based statistical inferences from molecular data, rather than drawing inferences based on summary statistics (Wilson & Balding 1998; Beaumont 1999; Bahlo & Griffith 2000; Stephens & Donnelly 2000). Although very promising, such approaches would require thorough testing of the effects of the mutation model assumed for loci, especially when homoplasious markers with high variability such as microsatellite are considered. If those effects turned out to be important it would be appropriate to consider more realistic mutation models than the commonly used SMM (e.g. a GSM with parameters values inferred from pedigree studies) and/or to include mutation model parameters as additional variables in these methods.
We thank R. Streiff, C. Primmer, D. Peatkau, R. Leblois, F. Rousset and two anonymous referees for constructive comments on an earlier version of the manuscript. This work was financially supported by the INRA department SPE for A.E. and J.-M.C.
Arnaud Estoup's current research mainly focuses on the theory and application of various methods for demographic and historical inferences using molecular markers (especially microsatellites) in the context of local scale and nonequilibrium systems. Philippe Jarne is the head of a group involved in various evolutionary subjects, and Jean-Marie Cornuet is a quantitative and evolutionary geneticist, with a particular interest in statistical and inferential methods.
Information Box 1: Mutation models and size homoplasy at microsatellite loci
• Infinite allele model (IAM, Kimura & Crow 1964). Under this model, a mutation involves any number of tandem repeats and always results in an allelic state not previously encountered in the population.
•Stepwise mutation model (SMM, Kimura & Ohta 1978). This model describes the loss or gain with equal probability (if the model is symmetrical) of a single tandem repeat.
•Two phase model (Di Rienzo et al. 1994) and generalized stepwise model (GSM). Under the TPM, the state of the mutating allele changes by an absolute number of x repeat units subsequently added or withdrawn with equal probability for a symmetrical mutation model. X is equal to 1 with probability PSMM and, with probability 1 − PSMM, X follows a geometric distribution (P(X = x) = α(1 − α)x−1) specified by its variance σ2 = (1 − α)/α2. The latter term allows the occurrence of mutation steps of more than a single tandem repeat. The GSM corresponds to a simplified version of the TPM in which PSMM = 0.
•K-allele model (KAM, Crow & Kimura 1970). Under this model, there are exactly K possible allelic states and any allele has a constant probability [µ/(K − 1)] of mutating towards any of the other K − 1 allelic states.
For all these mutation models, except the IAM, alleles can mutate towards allelic states already present in the population and hence generate size homoplasy. Moreover, the range of possible allelic states can be restricted for all these models but the IAM, for instance by imposing reflecting boundaries defining a given number (K) of possible continuous allelic states (e.g. Feldman et al. 1997).
Microsatellite mutations can be studied using a number of approaches. The most straightforward and conclusive way is the direct detection of mutation events in pedigree genotyping. Most studies indicate that the TPM or the GSM are the most realistic mutation model among those defined above for microsatellite loci (reviewed in Estoup & Cornuet 1999; Ellegren 2000a; Schlötterer 2000). Moreover, several lines of evidence have suggested that mechanisms which may or may not be associated with selective factors counteract the elongation of microsatellite arrays (e.g. Bowcock et al. 1994; Garza et al. 1995; Samadi et al. 1998). Such allele size constraints act on the range of allele sizes reducing the number of possible allelic states, hence favouring SH (Lehmann et al. 1996; Nauta & Weissing 1996; Feldman et al. 1997). While the null assumption may be that all microsatellites loci have the same evolutionary dynamics (mutation rate and model), it is now well understood that this dynamic differs between loci and species. Several factors contributing to the observed differences in the evolutionary dynamics of microsatellites have been suggested: repeat number, sequence of the repeat motif, length of the repeat unit, flanking sequence, interruptions in the microsatellite, recombination rate, transcription rate, age and sex, efficiency of the mismatch repair system, etc. (reviewed in Ellegren 2000a; Schlötterer 2000). Finally, it is most likely that a large fraction of microsatellites evolve neutrally, but in some selected cases, microsatellite variation can be associated with an altered phenotype (reviewed in Schlötterer 2000).
Information Box 2: Analytical formulation of the index of (size) homoplasy
Index of (size) homoplasy within population under a KAM (see the main text for an analytical resolution under the SMM). Under the KAM, the homozygosity for a locus at mutation drift-equilibrium is equal to [1 + M/(K − 1)]/[1 + MK/(K − 1)], and hence the index of size homoplasy is equal to
Index of (size) homoplasy between two populations under the SMM and KAM. Let us define an index of (size) homoplasy (P) as the probability that, at a given locus, two gene copies identical in state (electromorphs of the same size for microsatellites) and drawn in two panmictic isolated populations x and y at mutation-drift equilibrium are not identical by descent. In this case the probability of identity in state corresponds to the parameter b(t) = Σxi(t)yi(t), where xi(t) and yi(t) are the frequencies of the i-th electromorph in each population at generation t after divergence.
The index of (size) homoplasy is then simply equal to for the SMM model and for theKAM model. Under the IAM, Nei (1987) has shown that bIAM(t) ≅ bIAM(0)e−2µt = bIAM(0)e−MT with t and T being the time elapsed since divergence expressed in generations and in 2Ne generations, respectively.
Under the SMM, the corresponding formula has been given by Li (1976):
with M = 4Neµ, and Ii(2x) = where i > 0 and I−i(x) = Ii(x)
Under the KAM, with K the number of possible allelic states.
If the population was at mutation drift equilibrium when it diverged, bIAM(0), bKAM(0) and bSMM(0) are equalto (1 + M)−1, and (1 + 2M)−0.5, respectively,and the two indexes of (size) homoplasy reduce to:
Information Box 3: Estimation of the index of (size) homoplasy using coalescence-based simulation
To estimate the index of (size) homoplasy within a population or between two populations, we have to evaluate the probability of identity in state Pss of a pair of genes (i.e. the probability that two gene copies have the same state) and the probability of identity by descent Pnm, which corresponds to the probability that no mutation occurred in the two lineages connecting these two gene copies and their most recent common ancestor (MRCA). For microsatellites, the allelic state is the size of the electromorph which itself corresponds to a given number of repeat units. The index of (size) homoplasy is 1 − (Pnm/Pss). In a panmictic isolated population of constant size (and hence at mutation-drift equilibrium), the number of generations to reach the MRCA (TMRCA) of two gene copies at a selective neutral locus follows a geometric distribution (Pr(X = x) =a(1 −a)x–1, with a = 1/2Ne), usually approached by an exponential distribution of expectation 2Ne. Along these two equally long branches (each of TMRCA generations), mutation occurs with probability µ per generation. The number of mutations follows a binomial distribution with parameters 2TMRCA and µ, usually approximated by a Poisson distribution of parameter 2µTMRCA. For each pair of gene copies, we first draw a TMRCA value according to a geometric distribution, and then simulated a number of mutations according to a binomial distribution. Starting from an initial state, the gene copy state was mutated according to the mutation model considered as many times as there were mutations. The whole process was repeated 100 000 times for each set of parameters and the number of times there was no mutation (nnm) was recorded as well as the number of times the final state was identical to the initial state (nss). The index of (size) homoplasy was estimated as 1 – (nnm/nss). The above process was used to compute the index of (size) homoplasy within population. For the index of (size) homoplasy between two isolated populations which diverged t generations ago, we simply added the time of divergence to the simulated value of TMRCA. The exactness and accuracy of our simulations were thoroughly checked by comparing simulated values of homoplasy to their theoretical values for two mutation models, the SMM and KAM, for which exact analytical formulae are given in the main text and in information box 2.
With respect to the mutation model, the change in the number of repeat units under the generalized stepwise mutation model (GSM) followed a geometric distribution, with a variance σ2 = (1 − α)/α2 modelled as in Estoup et al. (2001). Note that microsatellite mutation data in humans suggest a mean value of σ2 = 0.36 (Dib et al. 1996). Because the frequency and range of multiple steps mutations appear to differ considerably among loci, smaller (σ2 = 0, i.e. SMM) and larger values (σ2 = 2) of this parameter were also included in some simulations. Allele size constraints were modelled by imposing reflecting boundaries to an allele size range of K = 10, 20 or 50 possible continuous allelic states (e.g. Feldman et al. 1997).