**Molecular Ecology**

# Statistical analysis of amplified fragment length polymorphism data: a toolbox for molecular ecologists and evolutionists

**Box 1 Analysing dominant genetic data such as AFLPs: two alternative philosophies**Two different approaches exist to analyse individual AFLP profiles (Kosman & Leonard 2005). One can choose to focus on the pattern of band presence or absence and to compare it between samples. This is termed here the band-based approach, which is usually conducted at the individual level. Alternatively, one may decide to adopt the allele frequency-based approach, which involves estimating allele frequencies at each AFLP locus. This strategy is thus population orientated; moreover, as AFLPs are dominant markers, allele frequencies are accessible only with preliminary assumptions or additional data about the inbreeding coefficient in the examined population(s).

For each approach, this box presents the main metrics or estimators used as starting points and discusses their respective statistical advantages and drawbacks.

**Metrics for the band-based approach**The band-based approach usually resorts to several metrics called ‘coefficients of similarity’, which can be viewed as distance measures. Estimates of band frequencies, applied in some assignment tests (Duchesne & Bernatchez 2002), also belong to the band-based approach.

The properties of the main coefficients of similarity can be illustrated with an example where two individuals

*i*and*j*are genotyped at*n*AFLP loci. The following table summarizes the pattern of matches and mismatches among their respective band presences and absences:Individual *i*Band presence 1 Band absence 0 Individual *j*Band presence 1 *a**b*Band absence 0 *c**d*where

*n*=*a*+*b*+*c*+*d**Jaccard coefficient (**Jaccard 1908**)*The Jaccard coefficient only takes into account the bands present in at least one of the two individuals, and is therefore unaffected by homoplasic absent bands (when the absence of the same band is due to different mutations).

*Dice coefficient (**Dice 1945**)*The Dice coefficient is equivalent to the Nei and Li coefficient (Nei & Li 1979) and the Sørensen coefficient (Sørensen 1948).

Comparable to the Jaccard coefficient, the Dice coefficient gives more weight to the bands present in both individuals. It thus lays the emphasis on the similarity between individuals, rather than on their dissimilarity.

*Simple-matching coefficient (**Sokal & Michener 1958**)*The simple-matching coefficient maximizes the amount of information drawn from an AFLP profile by considering all scored loci. Double-band absence and double band presence are given the same biological importance, which may not be adequate in case of frequent band absence homoplasy. This coefficient has interesting Euclidean metric properties that allow its use in an analysis of molecular variance (amova; Excoffier

*et al*. 1992).**Metrics for the allele frequency-based approach**Allelic frequencies can be extracted from dominant data following at least five procedures:

*The square-root procedure*This procedure simply uses the inbreeding coefficient and the square root of the frequency of null homozygotes (i.e. of band absences) to calculate the frequency of the null allele (Stewart & Excoffier 1996). Because the inbreeding coefficient is rarely known, this method often implies assuming Hardy–Weinberg equilibrium. Moreover, estimates tend to be downwardly biased for null alleles with low frequencies.

*The Lynch & Milligan procedure*Lynch & Milligan (1994) refined the square-root method by proposing a more accurate estimate of null allele frequencies. However, their procedure requires restricting the analysis to AFLP loci where the frequency of band absence is higher than 3/

*N*,*N*being the number of samples. It thus induces a bias in the choice of loci for analysis together with a loss of information, and this can ultimately bias estimates of genetic diversity and differentiation (Isabel*et al*. 1999).*The Bayesian procedure*Zhivotovsky (1999) introduced a Bayesian approach that gives satisfactory estimates of null allele frequencies, even in case of moderate departure from Hardy–Weinberg equilibrium. Because of these advantages, this procedure is now routinely employed.

*The moment-based procedure*Hill & Weir (2004) developed a robust moment-based method using the observed mean and variance of the frequency of null homozygotes at any given locus. The underlying model assumes random mating, Hardy–Weinberg equilibrium, linkage equilibrium, no mutation from common ancestor and equally distant populations.

*The Holsinger procedure*Holsinger

*et al*. (2002) used a Bayesian framework to estimate allelic frequencies,*F*_{IS}and*F*_{ST}from dominant data. This procedure only assumes that*F*_{IS}and*F*_{ST}are similar across loci and that the pattern of allelic frequency variation follows a beta distribution among populations. However, it seems to provide poor estimates of allelic frequencies and*F*_{IS}.**Box 2**Metrics commonly used to measure genetic diversity with AFLPsGenetic diversity can be estimated at different hierarchical levels (within populations, among populations within regions, among regions, etc.) and total diversity can be partitioned in components at these different levels. Such a hierarchical approach has been used for several of the metrics described below. The partitioning of genetic diversity among populations is specifically addressed in a subsequent section (Population structure) and in Box 3.

**Band-based metrics***The various similarity coefficients (see**Box 1**)*These coefficients can be calculated for each pair of individuals in the population and then averaged to give a measure of the genetic diversity in the population.

*The Shannon index of phenotypic diversity S, derived from the Shannon-Weaver index (**Shannon 1948**)*where

*p*is the frequency of the band presence at the_{i}*i*th marker within the population. This index gives more weight to the presence than to the absence of bands. This has no real biological support, although it might account for the occurrence of homoplasic absences of bands.*The nucleotide diversity π*In a population, π can be defined as the average number of nucleotide differences per site between two randomly chosen DNA sequences. Borowsky (2001) proposed a simple equation to estimate π from band data, assuming the Hardy–Weinberg equilibrium, an absence of band homoplasy, and a low overall π:

where Φ

_{e}is the proportion (over all polymorphic loci) of mismatched bands between two individuals drawn at random in the population, and*m*is the number of bases screened per band. For AFLP data,*m*is the total number of bases in both restriction sites and in the selective extensions.Clark & Lanigan (1993) also introduced a method to assess π but it requires the prior calculation of allelic frequencies for diploid species. As for the estimate of Innan

*et al*. (1999), it has to be implemented by computer because of its complexity, and its validity is based on the assumption that the GC content of the screened genome is around 50%. Computer simulations, however, have shown that in case of departure from this value, this method still gives reliable estimates as long as that π remains small.**Allele frequency-based metrics**Allele frequencies have to be estimated first to calculate these metrics (see Box 1).

*Percentage of polymorphic loci*P*at the*x%*confidence level*This corresponds to the percentage of loci where

*p*, the frequency of the allele presence, obeys the following criterion:The percentage of polymorphic loci

*P*is sometimes calculated based on band frequencies.*Nei's average gene diversity per locus*H*(**Nei 1973**)*This parameter is equivalent to the average expected heterozygozity

*H*_{e}in the population:where

*p*and_{i}*q*stand for the frequencies of the presence and absence alleles at locus_{i}*i*, respectively, and*n*is the number of loci examined.For a small sample size, an unbiased estimate of

*H*_{e}is given by the formula (Nei 1978; Nei 1987):where

*N*is the number of diploid samples. Kosman (2003) showed that Nei's average gene diversity calculated from band frequencies, instead of allele frequencies, was equivalent to the average number of pairwise differences within populations (see simple-matching coefficient, Box 1).**Box 3****Comparison of***F*_{ST}estimators using real and simulated AFLP data**Real data**We used a data set obtained for 13 vascular plant species sampled at 9–16 localities in Fennoscandia (nine individuals per locality on average) and scored for 65–181 polymorphic AFLP markers to compare the estimates of genetic differentiation obtained by different approaches (Skrede

*et al*. 2006; P.B. Eidesen, D. Ehrich, I.G. Alsos, unpublished data). Region-wide estimates of differentiation were calculated by using the following methods: (1)*F*_{ST}according to Lynch & Milligan (1994) based on allele frequencies estimated (a) as band frequencies, (b) using the square-root method (assuming Hardy–Weinberg equilibrium) (c) using the Bayesian method with uniform priors, and (d) using the Bayesian method with non-uniform priors (see Box 1). Differentiation was calculated using aflp-surv. In addition, (e)*G*_{ST}was calculated from allele frequencies estimated according to the square-root method assuming Hardy–Weinberg equilibrium in popgene. Furthermore, differentiation was estimated (2) by the Bayesian approach of Holsinger*et al*. (2002) using the default settings in Hickory, and either (a) the full model or (b) the f-free model; and (3) as Φ_{ST}in an amova based on pairwise differences between AFLP profiles using arlequin.Figure 1a method shows that there are large discrepancies between some of the differentiation estimates for real data.*G*_{ST}(method 1e) is often the highest, followed by the two band-based measures (methods 1a and 3). Estimates based on Bayesian estimation of allele frequencies are the lowest, especially those using uniform priors. However, despite these large discrepancies, the relative differences between the data sets are largely comparable. The low Bayesian*F*_{ST}estimates obtained for*Betula nana*are a noteworthy exception. Figures 1b and 1c show the sensitivity of two estimators based on allele frequencies to assumptions about*F*_{IS}:*F*_{ST}based on allele frequencies estimated with the Bayesian method with non-uniform priors (1d) and with the square-root method (1b), both calculated in aflp-surv. Kremer*et al*. (2005) showed that multilocus estimates of gene diversity were surprisingly robust to changes in assumed*F*_{IS}. The same was true in general for*F*_{ST}(square-root method). For some data sets, however,*F*_{ST}(square-root) increased slightly with*F*_{IS}, and*F*_{ST}(Bayesian non-uniform) almost always increased considerably. However, as for the comparison of methods, relative differences between the species were in general conserved.- (1)

[ Comparison of differentiation estimators (real data), and sensitivity of two estimators to assumptions about

*F*_{IS}. ]**Simulated data**Genetic drift was simulated at 300 biallelic loci for 10 populations of 100 diploid individuals each. Initial allele frequencies at each locus were chosen from a beta distribution with shape parameters 0.3 and 0.8. Preliminary simulations showed that these values produced U-shaped marker frequency distributions resembling those often observed in empirical data sets. Initial genotypes in each population were chosen at random from the same ancestral frequencies. There was no mutation, no migration, no selfing and mating was random. Each generation,

*F*_{ST}was calculated from the diploid genotypes of all individuals in all populations according to Weir (1996) and considered as the real value. Simulations were stopped when*F*_{ST}reached 0.05 or 0.25 (five replicates each). Samples of 10 and 50 individuals were taken from each population, genotypes were converted to dominant data and these were analysed in the same way as the 13 data sets above. Simulations were carried out in r (The R Development Core Team 2004) using a script available from D.E. on request.Figure 2 shows that the

*F*_{ST}estimates based on allele frequencies calculated with the square-root method (method 1b; according to Lynch & Milligan 1994) and the Bayesian method with non-uniform priors (method 1d) are closest to the real value based on the diploid genotypes. All estimates based on allele frequencies improved for larger sample sizes.*G*_{ST}based on allele frequencies calculated with the square-root method (method 1e) was particularly sensitive to small sample sizes. The approaches based on band frequencies (methods 1a and 3) overestimated differentiation considerably, as did the Bayesian calculations performed by Hickory (methods 2a and 2b). Larger sample sizes did not improve these last estimates. Estimates were also calculated for 150 loci (not shown). Differences were in general very small (< 0.01). For the two estimates based on the square-root method (1b and 1e;*F*_{ST}= 0.05,*n*= 10), however, the bias increased by 0.01–0.02. These simulations used only one type of initial distribution of allele frequencies, chosen to resemble the 13 empirical data sets, and their results may thus not be representative of all possible situations.- (2)

[ Comparison of differentiation estimators (simulated data). ]

Dr Stéphanie Manel, Fax: +33 4 76 51 42 79; E-mail: stephanie.manel@ujf-grenoble.fr

## Abstract

Recently, the amplified fragment length polymorphism (AFLP) technique has gained a lot of popularity, and is now frequently applied to a wide variety of organisms. Technical specificities of the AFLP procedure have been well documented over the years, but there is on the contrary little or scattered information about the statistical analysis of AFLPs. In this review, we describe the various methods available to handle AFLP data, focusing on four research topics at the population or individual level of analysis: (i) assessment of genetic diversity; (ii) identification of population structure; (iii) identification of hybrid individuals; and (iv) detection of markers associated with phenotypes. Two kinds of analysis methods can be distinguished, depending on whether they are based on the direct study of band presences or absences in AFLP profiles (‘band-based’ methods), or on allelic frequencies estimated at each locus from these profiles (‘allele frequency-based’ methods). We investigate the characteristics and limitations of these statistical tools; finally, we appeal for a wider adoption of methodologies borrowed from other research fields, like for example those especially designed to deal with binary data.