**Molecular Ecology**

# Standardizing methods to address clonality in population studies

## Errata

### This article is corrected by:

- Errata: Erratum Volume 17, Issue 13, 3222, Article first published online: 28 June 2008

Present address: Ifremer, Centre de Brest BP70, Department DEEP, 29280 Plouzané, France

^{(eqn 1)}**Box 1 Genotypic vs. clonal membership***a) Assessing whether all replicates of the same MLG are part of the same clone*The probability of a given genotype

*i*under the assumption of Hardy–Weinberg equilibrium can be estimated as:((eqn 1))where

*l*is the number of loci,*f*the frequency of each allele at the_{i}*i*^{th}locus (estimated using the round-robin method, see text), and*h*the number of heterozygous loci in the sample.When taking into account departures from Hardy–Weinberg equilibrium (using

*F*_{IS}), this equation becomes:((eqn 2))where

*l*is the number of loci,*h*is the number of heterozygote loci, and*f*and*g*are the allelic frequencies of the alleles*f*and*g*at the*i*^{th}locus (with*f*and*g*identical for homozygotes),*F*_{IS(i)}is the*F*_{IS}estimated for the*i*^{th}locus (using allelic frequencies estimated with the round-robin method), and z_{i}= 1 if the*i*^{th}locus is homozygous (for*f*=_{i}*g*) and z_{i}_{i}= –1 if the*i*^{th}locus is heterozygous.When the same genotype is detected

*n*times in a sample of*N*sampling units, the probability that the repeated genotypes originate from distinct sexual reproductive events (i.e. from different zygotes, thus being different genets), derived from the binomial expression, is:((eqn 3))In this calculation, the probability of the genotype

*p*_{gen }can be replaced by*p*_{gen}(*F*_{IS}) to consider possible departures to Hardy–Weinberg equilibrium, in order to obtain a more conservative estimate of*p*_{sex}.A Monte Carlo procedure can be applied to ensure that the set of loci used provides enough power to discriminate all MLGs present in the sample:

Fig. B1.1: Box plot describing the genotypic resolution of microsatellites in a data set of the seagrass

*Cymodocea nodosa*containing 220 sampling units genotyped using nine microsatellites, analysed for of all possible combinations of*K*loci (*K*= 1, ... ,*l; l*is the number of loci available). the edges of the boxes show the minimum and maximum number of genotypes and the central line shows the average number of genoptypes identified in the sample using X microsatellites (Alberto*et al*. 2005). The example illustrated here shows that a set of seven loci allows an accurate determination of the number of genotypes in the sample.*b*)*Ascertaining that each distinct MLG belongs to a distinct clone, or genet*(Halkett*et al*. 2005a);*defining clonal lineages (MLL)*This procedure can be used if the distribution of genetic distances among sampling units does not follow a strict unimodal distribution but shows high peaks toward low distances, susceptible to reveal the existence of somatic mutations or scoring errors in the data set resulting in low distances among slightly distinct MLG actually deriving from a single reproductive event. The use of the frequency distribution of distances to detect such events has been proposed four times so far, to our knowledge (Douhovnikoff & Dodd 2003; Meirmans & Van Tienderen 2004; Arnaud-Haond

*et al*. 2005; Rozenfeld*et al*. 2007). In a recent work on*Posidonia*(Arnaud-Haond*et al*. 2007) we introduced the concept of MLL to design genets represented by slightly distinct MLG, due to mutation or scoring errors. We propose a two step approach, consisting in (i) screening each MLG pair presenting extremely low distance, and originating a primary small peak in the frequency distribution of distances, making it bimodal rather than unimodal (see the dashed line in Fig. B1.2). Then we propose (ii) using*p*_{sex}on the set of identical loci in order to estimate the likelihood that those slightly distinct MLG would actually be derived from distinct reproductive events. When such likelihood was lower than a chosen threshold (in that case 0.01), then the slightly distinct MLG may be considered as being derived from the same genet and being slightly distinct representatives of the same MLL. Numerous distance metrics can be chosen, such as the number of distinct alleles, Jaccard similarity in particular for multibanding patterns (Douhovnikoff & Dodd 2003) or the number of microsatellite motifs (Arnaud-Haond*et al*. 2007) under the hypothesis of a stepwise mutation model for somatic mutations.Fig. B1.2: (A) Frequency distribution of the pairwise number of alleles differences between MLGs for the same sample of

*C. nodosa*(Alberto*et al*. 2005), compared with (B) the frequency distribution of the pairwise distances in a set of seeds from the same location (Cadiz, Spain) in which neither identical MLG nor somatic mutation are expected. The*x*-axis represents the number of allele differences and the*y*-axis is the frequency distribution for each x rank. The dashed line in the adult distribution represents the threshold below which identical MLG have a*p*_{sex}, estimated after excluding the slightly different loci, that supports the slightly distinct MLG as having originated from the same MLL (i.e. from the same zygote).^{(eqn 4)}**Box 2 Clonal richness estimates**The index of clonal diversity proposed by Ellstrand & Roose (1987) for a sample of size

*N*in which*G*genotypes are discriminated is estimated as:((eqn 4))This modification was proposed by Dorken & Eckert (2001):

((eqn 5))such that the smallest possible value in a monoclonal stand is always 0, independently of sample size, and the maximum value is still 1, when all the different samples analysed correspond to distinct clonal lineages.

These indices provide an estimate of the clonal (vs. sexual) input, once the set of loci allowed assessing the clonal membership, as previously detailed. Else, this index may overestimate clonal input, as it will ignore the reproduction of the same multilocus genotype through sexual reproduction (Stoddart 1983; Uthike

*et al*. 1998). To estimate the extent of this possible bias in estimating sexual input, one method was developed (Stoddart 1983; Stoddart & Taylor 1988) involving two of those components. The first is the estimate of genotypic diversity in the sample:((eqn 6))where

*p*_{i}is the observed frequency of the*i*^{th}of*G*genotypes, as described in Stoddart (1983). This first component happens to be also the inverse of the Simpson index of genotypic heterogeneity commonly used to describe clonal diversity (equation 20). It is used in a ratio with the second component, the expected genotypic diversity under Hardy–Weinberg and random assortment between all pairs of loci:((eqn 7))where

*D*is the sum of all for all*p*where (_{i}*p*×_{i}*N*) > 1 , and*P*the sum of*p*for all (_{i}*p*×_{i}*N*) < 1. The clonal input is then estimated as:((eqn 8))When the data set used is made of markers exhibiting high polymorphism and allowing an optimal discriminating power, a very high number of genotypes may be expected and

*P*will be negligible. The estimator (equation 19) will approximate estimator (15) as the number of multilocus lineages is more accurately estimated, and when reaching full resolution of MLLs*P*_{d}(or*R)*provides then a reliable estimate of the clonal input.^{(eqn 9)}**Box 3 Clonal heterogeneity and evenness estimates***Clonal heterogeneity*Simpson index:

((eqn 9))where

*p*is the frequency of the MLL_{i}*i*in the population, and*G*_{pop}the number of distinct MLLs in the population. An unbiased estimator of λ for a sample of size*N*is:((eqn 10))where

*G*is the number of MLLs detected in the sample, and*n*_{i}is the number of sampled units with the MLL*i*.The Simpson index can be modified to vary positively with heterogeneity (Pielou 1969), as an index first proposed in economical sciences (Gini 1912; Peet 1974), and the resulting

*complement of Simpson index*then describes the probability of encountering distinct MLLs when randomly taking two units in the sample:Simpson's complement:

((eqn 11))for which the unbiased estimator from a sample of size

*N*is*D** = 1 –*L*that ranges from 0 to almost 1 − (1/*G*).As proposed for species heterogeneity indices, the

*reciprocal of Simpson index*is:Simpson's reciprocal:

((eqn 12))for which the unbiased estimator for a sample of size

*N*is 1/*L*.Simpson's reciprocal ranges from 1 to

*G*, and it can be interpreted as the number of equally represented MLLs required to obtain the same heterogeneity as observed in the sample (Hurlbert 1971; Hill 1973), or as the ‘apparent number of*clonal lineages*in the sample’.The Shannon-Wiener's index describes clonal diversity as:

((eqn 13))using the estimator:

((eqn 14))This index quantifies the level of uncertainty regarding the MLL of a sample unit taken at random (Pielou 1966). This index of clonal diversity increases with the number of MLLs and the evenness in the assignment of individuals (ramets) to the MLLs, since this leads to a greater uncertainty in predicting the MLL of a randomly drawn sample unit.

*Clonal evenness*A way of describing clonal equitability, which is independent of clonal richness but not explicitly described by any diversity index (see above), is to use an evenness index. So far the most widely used evenness index in clonal plant studies is the Simpson's complement index (Hurlbert 1971; Fager 1972):

((eqn 15))with

*D*_{min}and*D*_{max}being the approximate minimum and maximum values of Simpson's complement index given the sample size*N*and the sample clonal richness*G*, estimated as:This evenness formulation can also be used with the Shannon-Wiener index (e.g. Hurlbert 1971), or alternatively evenness can also be estimated as

*V*′, the ratio of observed to maximal diversity (using either heterogeneity index). In this case, when using the Shannon-Wiener index, the corresponding evenness index, sometimes called Pielou's evenness (*J*′, Pielou 1975) and hereafter referred to as such, can be estimated as:((eqn 16))where

^{(eqn 17)}**Box 4 Power law (Pareto) distribution of clonal membership**The distribution of elements into size classes has been shown to follow a power law for a very broad diversity of systems and phenomena, all of which (from distributions in social sciences to astrophysics and the commonality of gene expression) conform to a particular probability density distribution referred to as the Pareto distribution (e.g. Pareto 1897 in Vidondo

*et al*. 1997; Ueda*et al*. 2004). A power law distribution applies to systems where the distribution of elements into classes is highly skewed, with much fewer large classes than small ones. The use of a power distribution allows the efficient and parsimonious description of the distribution of the studied elements into classes. We therefore propose here the use of the Pareto distribution as a continuous approximation to describe the discrete distribution of sample units, or ramets (elements) into groups of clonal sizes (classes), where clonal sizes are defined by the number of sampling units belonging to that clone (MLL). This relationship is described by the equation:((eqn 17))where

*N*_{≥X}is the number of sampled ramets belonging to lineages (MLLs) containing*X*, or more, ramets in the sample of the population studied, and the parameters a and β are fitted by regression analysis. In practice, the power slope (–β) is derived as the slope of the fitted log-log regression equation describing the rate of decline in the relative frequency of ramets that belong to MLLs of size equal to or larger than a given number of ramets*X*(when both are in log scale; Fig. B4.1). The parameter β (–slope) therefore indicates the scaling of the partitioning of the ramets among MLL size classes (Fig. B4.1).- (B4.1)

[ ]

Fig. B4.1: (a) Distribution of replicates among MLLs in

*Cymodocea nodosa*from Alfacs Bay (Alberto*et al*. 2005), showing the steep decline in number of MLLs with increasing clonal membership typical of power law distributions; (b) transformed into a log-log reverse cumulative distribution.**Box 5 Spatial components of clonality***Edge effect*In order to test whether for the sampling design used, apparent unique or rare MLLs are more distributed towards the edges of the sampling area, thereby inducing a possible overestimation of clonal diversity, the following index can be estimated:

with

*D*_{u}the average geographic distance between unique MLLs and the centre of the sampling area, and*D*_{a}the average geographic distance between all sampling units and the centre of the sampling area. The significance of such index is tested against the null hypothesis of random distribution of unique and multiply represented MLLs. In practice, the likelihood of the observed difference*D*_{u}–*D*_{a}being only due to chance and not to edge effect can be tested for by permuting*x*times the positions of the samples (i.e. randomly reassigning the sample unit to the sampling coordinates), and calculating the index for each permutation to obtain an empirical distribution of*E*_{e}. If the observed*E*_{e}value lies beyond the critical value (function of the chosen alpha) in the distribution of*E*_{e}in the permuted data, then a significant edge effect is present that may cause indices of clonal diversity to overestimate the population diversity.*Aggregation index*In order to test for the existence of spatial aggregation of clonemates, or MLGs belonging to identical MLLs, the aggregation index

*A*_{c}can be estimated as follows:with

*P*_{sg}being the average probability of clonal identity of all sample unit pairs and*P*_{sp}the average probability of clonal identity among pairwise nearest neighbours; these are estimated from the respective observed proportions in the sample. This index will typically range from 0, when the probability between nearest neighbours does not differ on average from the global one, to l when all nearest neighbours preferentially share the same MLL, in a situation of spatially distant distinct clonal lineages. The statistical significance of the calculated aggregation index can be tested against the null hypothesis of spatially random distribution of samples using a resampling approach, whereby the individuals sampled are randomly assigned to the existing sampling coordinates.

S. Arnaud-Haond, E-mail: sarnaud@ifremer.fr; eserrao@ualg.pt

## Abstract

Although clonal species are dominant in many habitats, from unicellular organisms to plants and animals, ecological and particularly evolutionary studies on clonal species have been strongly limited by the difficulty in assessing the number, size and longevity of genetic individuals within a population. The development of molecular markers has allowed progress in this area, and although allozymes remain of limited use due to their typically low level of polymorphism, more polymorphic markers have been discovered during the last decades, supplying powerful tools to overcome the problem of clonality assessment. However, population genetics studies on clonal organisms lack a standardized framework to assess clonality, and to adapt conventional data analyses to account for the potential bias due to the possible replication of the same individuals in the sampling. Moreover, existing studies used a variety of indices to describe clonal diversity and structure such that comparison among studies is difficult at best. We emphasize the need for standardizing studies on clonal organisms, and particularly on clonal plants, in order to clarify the way clonality is taken into account in sampling designs and data analysis, and to allow further comparison of results reported in distinct studies. In order to provide a first step towards a standardized framework to address clonality in population studies, we review, on the basis of a thorough revision of the literature on population structure of clonal plants and of a complementary revision on other clonal organisms, the indices and statistics used so far to estimate genotypic or clonal diversity and to describe clonal structure in plants. We examine their advantages and weaknesses as well as various conceptual issues associated with statistical analyses of population genetics data on clonal organisms. We do so by testing them on results from simulations, as well as on two empirical data sets of microsatellites of the seagrasses *Posidonia oceanica* and *Cymodocea nodosa*. Finally, we also propose a selection of new indices and methods to estimate clonal diversity and describe clonal structure in a way that should facilitate comparison between future studies on clonal plants, most of which may be of interest for clonal organisms in general.