Correspondence site: http://www.respond2articles.com/MEE/

# Clarifying and developing analyses of biodiversity: towards a generalisation of current approaches

Article first published online: 4 JAN 2012

DOI: 10.1111/j.2041-210X.2011.00181.x

© 2012 The Author. Methods in Ecology and Evolution © 2012 British Ecological Society

Additional Information

#### How to Cite

Pavoine, S. (2012), Clarifying and developing analyses of biodiversity: towards a generalisation of current approaches. Methods in Ecology and Evolution, 3: 509–518. doi: 10.1111/j.2041-210X.2011.00181.x

#### Publication History

- Issue published online: 7 JUN 2012
- Article first published online: 4 JAN 2012
- Received 19 July 2011; accepted 22 November 2011 Handling Editor: Robert Freckleton

### Keywords:

- AMOVA;
- analysis of variance;
- Euclidean geometry;
- functional diversity;
- Monte Carlo tests;
- nucleotide diversity;
- optimization;
- parametric tests;
- phylogenetic diversity;
- quadratic entropy

### Summary

- Top of page
- Summary
- Introduction
- Development of quadratic entropy
*H*_{ D }optimisation- Estimating and comparing levels of diversity
- Among-collection diversity
- Partitioning diversity
- Testing the effects of factors on diversity: directions for future research
- Conclusions
- Acknowledgements
- References
- Supporting Information

**1.** Quadratic entropy (QE) was developed as a fundamental function for measuring the diversity within a collection, such as a community, or population, from indices of abundance and distance among categories, such as species or alleles. Based on a literature review in the fields of genetics, ecology and statistics and new developments, I analyse the potential of this function for biodiversity studies.

**2.** Quadratic entropy was established as a generalisation of well-known diversity indices, and has been widely used in molecular ecology and genetics research. It is now integrated within more general frameworks for analysing functional and phylogenetic diversity in ecology.

**3.** Quadratic entropy can be maximised by removing categories, and several collections can share the maximum diversity, even with highly distinct compositions. Clarifying these statements, I identify all potential indices of the abundance of the categories that maximise QE.

**4.** By quantifying changes in diversity when mixing collections together, QE can measure differences among collections. Here, I provide a geometric interpretation of these differences that demonstrates their relevance as classical geometric distances.

**5.** A critical aspect of these distances is obtained if QE is strictly concave; that is, diversity always strictly increases by mixing distinct collections together. More generally, QE can be used to evaluate the effects of various factors on diversity in a framework designated ANOQE (analysis of QE). Generalising ANOVA (analysis of variance), ANOQE uses QE to measure distances between centroids.

**6.** Importantly, QE is estimated from sampled data and thus requires estimators. Based on these estimators, tests have been developed to compare levels of diversity. Tests of factor effects are evaluated by parametric, jackknife, bootstrap and permutational approaches. However, the procedures associated with these tests that have been suggested thus far only treat a few factors.

**7.** There is an urgent need for the development of such approaches in biology to deal with experimental factors, observed population and community structure, and different spatial and temporal scales. Together, QE and the ANOQE procedure are likely to have a critical impact on all scientific disciplines interested in any form of diversity.

### Introduction

- Top of page
- Summary
- Introduction
- Development of quadratic entropy
*H*_{ D }optimisation- Estimating and comparing levels of diversity
- Among-collection diversity
- Partitioning diversity
- Testing the effects of factors on diversity: directions for future research
- Conclusions
- Acknowledgements
- References
- Supporting Information

Traditionally, biodiversity has been measured by counting categories (e.g., species, alleles) in collections (e.g., communities, populations) of entities (organisms). However, studies now are evolving towards a more synthetic approach in which different scales and explanatory factors are considered. At an ecological level, species have different phylogenies and life histories that make them similar in some aspects and unique in others. At a genetic level, some DNA sequences share more nucleotides than others. According to these considerations, an index of diversity should include: (i) differences among categories (e.g., nucleotide distances among alleles, phylogenetic/functional distances among species); (ii) the proportions of these categories within a collection (e.g., relative abundance of an allele within a population or of a species within a community); and (iii) factors that might impact the level of diversity (e.g., spatio-temporal scales or factor levels in an experimental design).

Quadratic entropy (QE), defined as the expected distance between two entities in a collection, satisfies these requirements (Rao 1982a). In genetics, nucleotide distances among alleles are used to compare genetic diversity among species and to reveal factors that influence populations, including mutation rate and effective population size (Nei & Li 1979). In ecology and conservation, the taxonomic, phylogenetic and functional distances between species are used to prioritise the conservation of species and areas (Faith 1992; Pavoine, Ollier & Dufour 2005a), to evaluate ecosystem services (Lavorel *et al*. 2010) and to understand the ecological processes that structure community assemblages (Warwick & Clarke 1995; Pavoine & Dolédec 2005; Hardy & Senterre 2007). With this definition, QE is closely related to classical species encounter theory (e.g., Patil & Taillie 1982).

Interestingly, QE can be partitioned across different levels of factors in exactly the same way that variance is partitioned in analysis of variance (ANOVA) (Rao 1982b, 1986). This type of partitioning was thus designated analysis of quadratic entropy, ANOQE (Liu 1991). Through this analysis, it is possible to evaluate the effects of experimental or structural (e.g., spatio-temporal) factors on biodiversity. Despite its development in the 1980s, this approach is still under-exploited, and is still in its infancy in the domain of testing the effects of multiple factors. In addition, the connections between ANOQE and other independent developments are not widely acknowledged in the biological literature (Excoffier, Smouse & Quattro 1992; Gower & Krzanowski 1999; Legendre & Anderson 1999; McArdle & Anderson 2001). Analysis of molecular variance, AMOVA (Excoffier, Smouse & Quattro 1992), which tests for genetic structures in hierarchical subdivisions of a population, is a striking example of such a connected approach (Pavoine 2005), with the original publication cited about 4000 times (ISI Web of Science 2011). The original report (Rao 1982a) on ANOQE publication by contrast has only 232 citations. However, ANOQE has more potential than AMOVA. It can be employed to analyse nested, crossed, fixed, random or mixed factors, and has several ramifications allowing multivariate factorial analysis (Pavoine, Dufour & Chessel 2004; Pavoine & Bailly 2007) as well as decomposition of diversity across a taxonomic (Ricotta 2005) or phylogenetic (Pavoine, Love & Bonsall 2009; Pavoine, Baguette & Bonsall 2010) tree, all of which render the approach practical for use by biologists. Recently, Rao (2010) re-emphasised the importance that ANOQE is likely to have for future research on diversity. In this context, the objectives of this report are to review the developments made thus far with respect to QE and related approaches, to introduce new developments and to propose new directions for future research.

### Development of quadratic entropy

- Top of page
- Summary
- Introduction
- Development of quadratic entropy
*H*_{ D }optimisation- Estimating and comparing levels of diversity
- Among-collection diversity
- Partitioning diversity
- Testing the effects of factors on diversity: directions for future research
- Conclusions
- Acknowledgements
- References
- Supporting Information

*H*_{ D } Function (QE Index)

Measures similar to the QE index have been developed independently in the fields of functional ecology, genetics, taxonomy and economics since the report by Hendrickson & Ehrlich (1971) (see Appendix S1 for details). Rao (1982a) defined the DIVC index (DIVersity Coefficient) as

where *P* is the probability distribution function of a variable *X* and *d*(*X*_{1},*X*_{2}) is a non-negative symmetric function that measures the difference between two individuals with *X* = *X*_{1} and *X* = *X*_{2}. The term ‘generalised quadratic entropy function’ appeared in Rao (1982c), and the simpler version, ‘quadratic entropy’, was retained in subsequent reports (e.g. Rao 1986, 2010; Rao & Nayak 1985).

In current biodiversity studies, the formula simplifies to

- (eqn 1)

where **p** is a vector of proportions (*p*_{1},*p*_{2},…,*p*_{S}) in the set . Contrary to other developments (Izsák & Pavoine 2011), matrix **D** = (*d*_{kl}), which contains the differences between categories (e.g., taxonomic, phylogenetic, functional differences among species, nucleotide differences among alleles) is considered here to be independent from **p**. Based on pair-wise comparisons, *H*_{D}(**p**) is thus the mean difference between two random categories and, therefore, it cannot integrate higher levels of inter-relationships among more than two categories.

Lau (1985) and Rao (1986) demonstrated a useful property of *H*_{D}(**p**) (see *Partitioning Diversity*) if **D** = (*d*_{kl}) satisfies the following conditions:

- (eqn 2)

*H*_{ D } Generalises Well-Known Indices

Let *d*_{kl} = 1 for all *k*≠*l* (which is equivalent to **D** = **1****1**^{t}−**I**, with **1** being the *S* × 1 vector of ones and **I** the *S* × *S* identity matrix); is then equal to the Gini–Simpson index (variation in a qualitative variable, Rao 1982a):

- (eqn 3)

In comparison with the Gini–Simpson index, *H*_{D} can integrate the fact that species are not equivalent but differ in terms of phylogenetic and taxonomic positions, as well as functional traits, or the fact that some alleles might share more nucleotides than others.

In addition, *H*_{D} also generalises the variance of a quantitative variable (Rao 2010). The variance of the variable *Y* is usually written as the mean squared deviation between *y*_{k} values and their mean, *y*_{•}. However, it can be rewritten as the average squared difference among values:

Using *d*_{kl}=(*y*_{k}−*y*_{l})^{2}/2 leads to Eqn 1. Although the variance only considers a single quantitative trait to characterise the categories, *H*_{D} can consider multivariate distances among categories, which is necessary, for instance, when measuring taxonomic and phylogenetic distances among species, or functional distances based on multiple traits.

*H*_{ D } Is Generalised

In contrast, two generalisations of *H*_{D} have been developed in ecology. The objectives of these generalisations are to include a parameter in the index of diversity that modifies the importance given to rare species in comparison with more abundant ones. The structure of a community is then described by a vector where abundant species are progressively given more weight until only the most abundant species dominates. This vector might be used to test whether different processes underlie the presence of rare vs. abundant species.

Ricotta & Szeidl (2006) defined

With *a* = 2, *Q*_{a} = *H*_{D}; with *a* = 0, *Q*_{a} is a generalisation of the richness; and with *a* tending to 1, *Q*_{a} is a generalisation of the Shannon (1948) index.

Pavoine *et al.* (2009) also generalised *H*_{D} in the context of a phylogenetic tree describing the estimated dates of speciation among species. Each period in a phylogeny defines groups of species that descend from it, just as dividing a genealogical tree at a given time defines sets of related families. The index of biodiversity was defined as

The values (*t*_{K}−*t*_{K−1}) are the lengths of periods, and *p*_{i,K} is the proportion of the *i*th group defined at period *K*. With *a* = 0, *I*_{a} generalises the richness and is equal to the famous Faith (1992) index of phylogenetic diversity, apart from the use of an additive constant (the height of the tree). With *a* = 2, *I*_{a} = *H*_{D}, where **D** is defined from the tree. When *a* tends to 1, *I*_{a} is a generalisation of the Shannon (1948) index.

#### Proposed Functions of *H*_{D}

Recent reports in ecology have proposed using functions of *H*_{D} instead of *H*_{D} itself. The objective of this is to obtain intuitive measures that are more easily interpretable by ecologists and conservation biologists. The method consists of obtaining the ‘doubling’ or ‘replication’ property: if *N* equally diverse, equally large and maximally dissimilar assemblages are pooled, the diversity of the pooled assemblages must be *N* times the diversity of the individual assemblages. With the condition that the distances (values in **D**) lie in [0,1], Ricotta, Burrascano & Blasi (2010) suggested using 1/(1−*H*_{D}). Chao, Chiu & Jost (2010) developed a related formula for the special case of distances among species obtained from rooted phylogenetic trees or functional dendrograms: . Here, the distances among species in **D** are equal to half the sum of the branch lengths on the shortest path that connect two species in a tree, and is the average distance between a species and the root of the tree.

Given that the function *f*(*x*) = 1/(1−*x*) increases monotonically, these transformations do not change how communities are ranked according to their diversity. Rather, they change the absolute value of the difference between the diversity of two communities, and will thus impact our interpretation of how much more (or less) diversified a community is in comparison to others. These transformations are therefore useful to compare the levels of diversity among distinct communities (Jost 2006). However, considering the developments to date regarding the quadratic entropy index, the function *H*_{D}, without any transformation, has the advantage of being integrated in a more general, inferential framework: tests have been developed based on *H*_{D} to evaluate the effects of various factors on diversity.

It is very common in statistics that the metric associated with a test is a function of the metric used for an intuitive interpretation. As a result, both *H*_{D} and its transformation could be used in the future to test and interpret, respectively, the effects of different factors on levels of diversity. Nevertheless, transformations have so far only been proposed for measuring the diversity within and among collections (Ricotta *et al.* 2010). To my knowledge, transformations associated with the effects of multiple factors affecting biodiversity have not been suggested.

*H*_{ D } optimisation

- Top of page
- Summary
- Introduction
- Development of quadratic entropy
*H*_{ D }optimisation- Estimating and comparing levels of diversity
- Among-collection diversity
- Partitioning diversity
- Testing the effects of factors on diversity: directions for future research
- Conclusions
- Acknowledgements
- References
- Supporting Information

The previous section introduced QE (function *H*_{D}) and positioned it among the indices that it generalises and, in contrast, among the indices that generalise it. This section goes one step further in characterising QE through analysis of its maximum. According to *H*_{D}, what would be the characteristics of a population with maximum possible genetic diversity? What would be the characteristics of a community with maximum ecological diversity? Precise definition of the maximum of a diversity index is important because many conservation strategies are based on preserving maximum possible amount of.

#### Maximising *H*_{D} in a Euclidean Space

Maximisation of *H*_{D} has been studied in the case where matrix **D** is fixed. All species that could be found in a community, or all alleles that could occur within a population, are therefore known, and the genetic, taxonomic, phylogenetic and functional distances among them are fixed. Thus, *H*_{D} is optimised over all possible vectors, **p**. This maximisation approach, in which **D** is fixed, is coherent with the maximisation of more classical indices of diversity (e.g., Shannon 1948; Simpson 1949), where the number of categories is fixed and the distances among categories are fixed as equal. Regardless, for these more classical indices the maximum is unambiguously obtained for the evenness of the proportions of the categories. When the distances among the categories are not equal, the maximum is quite different. Pavoine, Ollier & Pontier (2005b) have obtained the maximum value and a maximising vector, **p**, for *H*_{D}. Pavoine & Bonsall (2009) demonstrated that several vectors **p** can lead to the maximum. I formulate these findings in a new proposition and provide a demonstration of this in Appendix S2. Figure 1 gives examples of the *H*_{D} values when three theoretical categories are considered.

##### Proposition 1

Let *H*_{D} be the function defined in Eqn 1, with **D** being an *S* × *S* matrix satisfying condition 2. Given that **D** satisfies condition 2, there exists a Euclidean space with points *M*_{k}, *k* = 1,…,*S*, such that ||*M*_{k}*M*_{l}||^{2}/2 = *d*_{kl} for all *k* and *l*. Let *T* be a set of *s* points on the boundary of the smallest hypersphere (SEH) that encloses the *S* points *M*_{k}, *k* = 1,…,*S*. Let **D**_{s} = (*d*_{kl}) be the *s*×*s* symmetric square subset of matrix **D** that corresponds to set *T*. Any vector that gives the proportions defined as for the *s* points in *T* and as zero for the remaining points with the constraint that is the radius of the SEH and that contains non-negative values is a maximiser.

An important consequence of this proposition is that, according to *H*_{D}, diversity can be maximised by reducing richness (i.e., the number of categories, alleles or species), and this depends on matrix **D**. The choice of **D** is therefore critical.

#### Choice of Matrix **D**

The fact that *H*_{D} can be maximised by removing some categories has been considered a poor property for an index of diversity (e.g., Shimatani 2001). This is because the simple counting of categories such as species or alleles is still deeply rooted in traditional biodiversity studies, and strongly influences research associated with new diversity indices. Alternatively, I suggest considering the fact that *H*_{D}, in its general definition, departs from traditional indices of biodiversity and complements them as a strength. *H*_{D} is a new, non-redundant synthetic measure of biodiversity.

Its connection with traditional indices depends on the choice of matrix **D**. In particular, **D** = (*d*_{kl}) is ultrametric if and only if *d*_{kl}≥0 ∀*k*,*l*; *d*_{kl}≤max(*d*_{kj},*d*_{jl}) ∀*j*,*k*,*l*; and *d*_{kk}≤min_{l≠k}*d*_{kl}∀*k* (which includes equidistances). Ultrametric distances are important in ecology and conservation as taxonomic distances, and the phylogenetic distances defined as the time from speciation, are ultrametric. With **D** being ultrametric, *H*_{D} has several interesting properties: (i) the maximising vector of proportions is unique; (ii) it has no zero values; and (iii) it reflects species’ contributions to diversity (Pavoine *et al.* 2005a).

### Estimating and comparing levels of diversity

- Top of page
- Summary
- Introduction
- Development of quadratic entropy
*H*_{ D }optimisation- Estimating and comparing levels of diversity
- Among-collection diversity
- Partitioning diversity
- Testing the effects of factors on diversity: directions for future research
- Conclusions
- Acknowledgements
- References
- Supporting Information

Now that *H*_{D} has been defined, identified as a generalised measure, and described based on an optimisation study as a new integrative measure of diversity, it is important to consider that it is estimated from samples, and that therefore estimators are required. Analysing estimators for *H*_{D} is a prerequisite for testing the effects of factors on its values.

#### Nayak Parametric Estimator for *H*_{D}

Nayak (1983) obtained an estimator for *H*_{D} with the assumption that the observed proportions are drawn from a multinomial distribution. Let ** π**=(

*π*

_{1},

*π*

_{2},…,

*π*

_{S})

^{t}be the unknown vector of proportions. Let

*N*

_{k}be a variable denoting the observed number of entities from category

*k*and

*n*the total number of entities observed;

*k*= 1,…,

*S*. Let and . An estimator for

*H*

_{D}(

**) is**

*π*Nayak (1983) demonstrated that

where

the estimator is asymptotically unbiased, and tends towards 0 when *n* tends towards infinity. is therefore an appropriate estimator for *H*_{D}(** π**) (Nayak 1983).

Generally (see Propositions 4.4.6 and 4.4.7 in Nayak 1983),

Under this condition, a confidence interval for *H*_{D} can be specified for large samples (high *n* values) at level 100(1−*α*)%, with **p** as an estimate of * π* and as an estimate of 4

**π**^{t}

**D**Σ

**D**

*obtained by replacing*

**π***by*

**π****p**:

where *ɛ*_{α/2} is the threshold from the normal distribution *N*(0,1) associated with *α*.

From this, Nayak (1983) developed tests for differences between *H*_{D} estimates assuming that the proportions are drawn from a multinomial distribution, and that the sample size is large. Consider *H*_{0}:*H*_{D}(**p**_{1})=*H*_{D}(**p**_{2})=…=*H*_{D}(**p**_{r}), meaning that the diversities within *r* samples are equal. Let *n*_{i} be the number of entities observed in sample *i*, be the averaged diversity over all samples, and be obtained by replacing * π* with

**p**

_{i}in equation 4

**π**^{t}

**D**Σ

**D**

*:*

**π** H_{0} is rejected with a risk of error *α* if

where is the threshold from a distribution associated with *α*. This test might be used to compare the level of nucleotide diversity among populations or the level of taxonomic, phylogenetic and functional diversity among communities.

#### Non-parametric Alternatives

Field data will not necessarily satisfy the multinomial assumption. Additionally, in ecological studies, the number of individuals per sample is not always known, with quantities instead being broadly measured as biomass, percentage cover or density. Alternatives to Nayak's *H*_{D} estimates and tests for differences between *H*_{D} estimates should therefore be developed.

Other approaches do exist for more traditional indices that do not include a matrix **D** but that could be adapted for application to *H*_{D}. For instance, Magurran (2004) suggests that it could be more beneficial to measure the diversity index within a number of samples, instead of within a single large sample. Such samples could be jackknifed to improve diversity estimates, and obtain confidence intervals. Diversity curves analysing *H*_{D} as a function of the number of samples could be computed. These curves may or may not attain an asymptote. In addition, Ricotta *et al.* (2010) suggested an estimator for *H*_{D} based on rarefaction methods, where the relative abundances of species are replaced with species’ contributions to the expected species richness. This estimator uses the presence/absence of species in samples, instead of their abundance. Finally, if a single large sample is available, the applicability of bootstrap approaches will depend on how the data have been collected (e.g., Liu 1991; Liu & Rao 1995).

A further alternative to Nayak (1983) would be to develop a test of the homogeneity of multivariate dispersions (Anderson 2006) accounting for the fact that sampled categories are weighted (e.g., a species is weighted by its biomass). Whether it is possible to compare the level of *H*_{D} among populations or communities in real datasets is thus still an open question.

### Among-collection diversity

- Top of page
- Summary
- Introduction
- Development of quadratic entropy
*H*_{ D }optimisation- Estimating and comparing levels of diversity
- Among-collection diversity
- Partitioning diversity
- Testing the effects of factors on diversity: directions for future research
- Conclusions
- Acknowledgements
- References
- Supporting Information

Previous sections have focused on measuring diversity within a collection. However, a critical component of diversity is the average distance between collections. In genetics, measuring dissimilarities among populations allows evaluation of the degree of isolation among populations. In ecology, measuring dissimilarities among communities can reveal critical processes, including dispersal and colonisation, environmental filters and competition (Pavoine & Bonsall 2011).

#### A Unified Approach for Diversity and Dissimilarities

Diversity and dissimilarities are two related concepts, as the diversity of a collection is zero if all of its entities are the same. Increasing numbers of studies therefore incorporate nucleotide, taxonomic, phylogenetic and functional dissimilarities among organisms. These dissimilarities should also be considered when measuring the dissimilarities among whole collections.

According to Rao & Nayak (1985), any non-negative and concave index of diversity, *H*, might be partitioned as

- (eqn 4)

where *C*(**p**_{i},**p**_{•}) is a measure of differences between a collection *i* with the vector of proportions **p**_{i}, and a theoretical average collection with the vector of proportions **p**_{•}=∑_{i}*λ*_{i}**p**_{i}, *λ*_{i} is a weight given to collection *i* with ∑_{i}*λ*_{i}=1. If *C* is symmetric (*C*(**p**,**q**)=*C*(**q**,**p**)), then Eqn 4 can be rewritten as

- (eqn 5)

where

- (eqn 6)

*D*(**p**_{i},**p**_{j}) is a measure of dissimilarity, induced by the function *H*, between a collection with the vector of proportions **p**_{i} and a collection with the vector of proportions **p**_{j}. Rao & Nayak (1985) demonstrate that *C* is a symmetric function if, and only if, *H* can be written as a form of *H*_{D}. *H*_{D} therefore unifies the diversity and dissimilarity concepts because the diversity of a collection is measured based on dissimilarities among categories, and dissimilarities among collections are calculated from diversity. Equation 5 corresponds to a first, simple version of diversity partitioning, where the diversity in the collections pooled together (SST=*H*_{D}(**p**_{•})) is equal to the sum of the averaged diversity within collections (), and the diversity among collections (SSB = SST−SSW). If **D** satisfies condition 2, *H*_{D} is concave (Rao 1986), which ensures that *D*(**p**_{i},**p**_{j})≥0 and SSB ≥ 0.

#### A Geometric Interpretation

A measure of the distance between populations or communities should indicate the degree of differences among these populations or communities in terms of the organisms they contain, and in terms of differences among these organisms. From a functional point of view, two communities that share no species might be evaluated as similar if the species they contain have similar life histories. The biological meaning of *D*(**p**_{i},**p**_{j}), as a measure of the distance among populations or communities, has been questioned. Pavoine & Bonsall (2009) demonstrate that the dissimilarity among two collections with maximum diversity is zero, even if the two collections have different compositions and thus fall into different categories. There is therefore a need to clarify how *D*(**p**_{i},**p**_{j}) measures the distance among two collections.

According to Eqn 6, these distances among collections are defined as the degree to which diversity increases by mixing two collections together. As Ricotta (2005) highlights, using this excess of diversity as a distance among collections is not necessarily straightforward. However, the geometric interpretation of *D*(**p**_{i},**p**_{j}) provided below demonstrates that *D* is definitely a meaningful distance metric.

If **D** satisfies condition 2, then it is possible to define a Euclidean space in which each category *k* will be positioned at a point *M*_{k}, such that ||*M*_{k}*M*_{l}||^{2}/2=*d*_{kl} for all *k* and *l*. Let **M** be the *S*×*n* matrix of coordinates, with points (categories) as rows and axes as columns, *n* is the dimension of the Euclidean space. Then, each collection might be positioned at a centroid of the category points: a collection *i* with the vector of proportions **p _{i}** is positioned at point

*G*

_{i}with the vector of coordinates , such that ||

*G*

_{i}

*G*

_{j}||

^{2}/2=

*D*(

**p**

_{i},

**p**

_{j}) (Champely & Chessel 2002; Pavoine

*et al.*2004). The distance between two collections is therefore equal to half the squared distance between their centroids. In this Euclidean space, the decomposition SST = SSW+SSB given above is a decomposition of geometric variability (Pavoine

*et al.*2004; Cuadras 2008).

A consequence of this geometric interpretation is that *D*(**p**_{i},**p**_{j}) = 0 if, and only if, the collections *i* and *j* have identical centroids. Examples are given in Fig. 2. In the extreme case where the distances among categories can be plotted in only one dimension, that is where ANOQE = ANOVA, then *D*(**p**_{i},**p**_{j}) compares the means. Thus, the geometric interpretation here given provides justification for the use of *D*(**p**_{i},**p**_{j}): the ANOVA framework defines the distances between two collections as differences between means, whereas the ANOQE framework more generally defines the distances between two collections as differences between centroids. The fact that distances are measured from means (or centroids) must be kept in mind when interpreting values from the *D*(**p**_{i},**p**_{j}) index. For instance, in Fig. 2a, if points are species and if the Euclidean space is defined by the body size of the species, then a community with species A, D and E and a community with species B, C and F might be considered as equivalent because the species within them have equal mean body sizes.

#### Consequence of the Strict Concavity of *H*_{D}

It is clear that if two sets of points in a multidimensional space are identical and associated with equal weights, their centroids must be identical. However, the converse is not true, except under particular conditions. More formally, *D*(**p**_{i},**p**_{j}) = 0 if **p**_{i} = **p**_{j}, but the converse is true only if *H*_{D} is strictly concave; diversity always strictly increases by mixing distinct collections together. Here, I provide further conditions on **D** such that *H*_{D} is strictly concave (see Appendix S3). In this case *D*(**p**_{i},**p**_{j}) = 0 if and only if **p**_{i} = **p**_{j}, meaning that the distance between two collections is zero if and only if their compositions are identical. If *H*_{D} is not strictly concave, the distance might be zero even if the compositions are not identical, and thus the strict concavity for *H*_{D} could be a prerequisite for a more intuitive measure of among-collection diversity.

##### Proposition 2

Let **p** ∈ *A*^{S} and **D** be an *S*×*S* matrix satisfying condition 2, and let , where **1** is the *S*×1 vector of ones and **I** the *S*×*S* identity matrix. *H*_{D}(**p**) = **p**^{t}**D****p** is strictly concave if and only if *S* points with coordinates **X**, defined as −**QDQ** = **XX**^{t}, are embedded in exactly *S*−1 dimensions; that is to say, if and only if rank(−**QDQ**) = *S*−1.

Note that ultrametric distances, including equidistances and, thus, the particular case of the Gini–Simpson index, satisfy this property. If *H*_{D} is strictly concave, then there is a single collection with a maximum value of *H*_{D}.

### Partitioning diversity

- Top of page
- Summary
- Introduction
- Development of quadratic entropy
*H*_{ D }optimisation- Estimating and comparing levels of diversity
- Among-collection diversity
- Partitioning diversity
- Testing the effects of factors on diversity: directions for future research
- Conclusions
- Acknowledgements
- References
- Supporting Information

Eqn 5 is a first diversity partitioning that analyses the effect of one factor on *H*_{D}, but QE can be more generally applied to estimate the effects of multiple factors.

#### Evaluating the Contributions of Factors on Values of *H*_{D}–ANOVA is generalised

If and only if matrix **D** satisfies condition 2, the quadratic entropy is completely concave (Lau 1985; Rao 1986). This property means that an ANOVA-like analysis (ANOQE) can be performed using *H*_{D} instead of variance (Rao 1986, 2010), which implies that it is possible to test the effects of experimental factors or observational factors, such as spatial and temporal scales, or subdivisions of populations or communities into demes or patches on diversity. *H*_{D} is decomposed into a number of non-negative components, assigned to specified nested and/or crossed factors and interactions. For instance, consider two crossed factors, X_{1} with *r* levels and X_{2} with *s* levels. Let *p*_{ijk} be the proportion of the *k*th category for the *i*th level of X_{1} and the *j*th level of X_{2}. Let **p**_{ij}=(*p*_{ij1},…,*p*_{ijk},…,*p*_{ijS}) be the corresponding vector of proportions. Let *λ*_{ij} be the weight attributed to this sample; for instance, *λ*_{ij}=∑_{k}*p*_{ijk}/∑_{i,j,k}*p*_{ijk}. Let *λ*_{i•}=∑_{j}*λ*_{ij}; *λ*_{•j}=∑_{i}*λ*_{ij}; **p**_{i•}=∑_{j}*λ*_{ij}**p**_{ij}/*λ*_{i•}; **p**_{•j}=∑_{i}*λ*_{ij}**p**_{ij}/*λ*_{•j}; and . Then

- (eqn 7)

SSB(*X*_{1}) and SSB(*X*_{2}) are the main effects of *X*_{1} and *X*_{2}, respectively.

SSB(*X*_{1}×*X*_{2}) is the component of diversity stemming from the interaction between factors *X*_{1} and *X*_{2}. It can be simply deduced from Eqn 7 and has a simple form only if *λ*_{ij}=*λ*_{i•}*λ*_{•j} (independence of prior probabilities):

where **p**_{c}=**p**_{ij}−**p**_{i•}−**p**_{•j}+**p**_{••} (Nayak 1983). It is also possible to evaluate the conditional effect of X_{2} given X_{1} as follows (Nayak 1986a):

- (eqn 8)

Similar approaches have been developed by Gower & Krzanowski (analysis of distance, 1999), Legendre & Anderson [distance-based redundancy analysis (db-RDA) 1999], Anderson (nonparametric multivariate analysis of variance 2001) and McArdle & Anderson (extended development of db-RDA for applications to dissimilarity matrices that do not satisfy condition 2, 2001). In addition to these closely related approaches, ANOQE generalises the well-known ANOVA method (Fisher 1925) and its matching categorical analysis of variance (Light & Margolin 1971). Despite the extensive use of ANOVA and this generalised character of ANOQE, use of ANOQE in ecology and genetics is still scarce.

#### Other Specific *H*_{D} Partitions in Ecology

Other kinds of *H*_{D} partitioning complementing the ANOQE framework have already been developed in the field of ecology. First, taxonomic diversity can be measured by *H*_{DTax}, where **D**^{Tax} represents taxonomic distances among species that depend on the hierarchy in the taxonomy. The distance between two species from the same genus is 1, while that between two species from different genera but the same family is 2, and so on. Consider a taxonomy including species, genera, families and orders. Let **p**, **g**, **f** and **o** be the vectors of the relative abundances of the species, the genera, the families and the orders, respectively. According to Shimatani (2001),

where *G* is the Gini–Simpson index (see *H*_{D}*generalises well-known indices*). This stems from an additive property of *H*_{D} (if **D**=**D**_{1}+**D**_{2}, *H*_{D}=*H*_{D1}+*H*_{D2}) and from the fact that (Eqn 3). This taxonomic decomposition (or any other additive partitioning of matrix **D**) can be applied to any component of ANOQE, such that these two complementary approaches allow studies of the effects of factors on the diversity of different taxonomic levels (Ricotta 2005). As Pavoine *et al.* (2009) highlighted, a similar approach can be applied to phylogenies divided into evolutionary periods, and a more complex decomposition of *H*_{D} on a phylogenetic tree can be found in Pavoine *et al.* (2010).

For understanding the respective roles of the proportions and the distances between categories in the measurement of diversity, another decomposition for *H*_{D} is as follows (Shimatani 2001),

where *G* is the Gini–Simpson index, *A* is the unweighted average distance, and *B* evaluates covariations between proportions and distances. This equation is therefore related to the well-known decomposition of the expected product of two variables: *E*(*XY*)=*E*(*X*)*E*(*Y*)+*Cov*(*X*,*Y*). In an ecological setting, analysis of component *B* in the context of understanding the determinants of community diversity and structure may be very interesting. Indeed, processes that limit similarity among species (including some cases of competition, or mutualism) are expected to lead to positive covariance between abundance and distance, whereas processes that increase similarity among species, including environmental pressures, are expected to lead to negative covariance between abundance and distance.

These particular decompositions of *H*_{D} complement the ANOQE framework, providing additional details on the relative impacts of **D** and **p** on each individual component of ANOQE.

### Testing the effects of factors on diversity: directions for future research

- Top of page
- Summary
- Introduction
- Development of quadratic entropy
*H*_{ D }optimisation- Estimating and comparing levels of diversity
- Among-collection diversity
- Partitioning diversity
- Testing the effects of factors on diversity: directions for future research
- Conclusions
- Acknowledgements
- References
- Supporting Information

A crucial step to allow for concrete applications of ANOQE to real data sets will be the development of the tests associated with ANOQE and although tests have been suggested, there is a critical need for further development.

#### One Factor

The decomposition for a single factor is given in *A unified approach for diversity and dissimilarities*. The effect of one factor on diversity has been evaluated with parametric tests by assuming multinomial distributions and large samples (Nayak 1986b), with bootstrap (Liu 1991; Liu & Rao 1995) and with permutational schemes (Anderson 2001; Pavoine & Dolédec 2005; Hardy & Senterre 2007; Hardy 2008). Anderson (2001) emphasised that one of the assumptions of these tests might be that observations have similar distributions, which means that a test for the equality of *H*_{D} measured per level of a factor (see *Estimating and comparing levels of diversity*) could be necessary before evaluating the effect of the factor. Nevertheless, a critical issue was observed by Nayak (1986b), who demonstrated that the components SST (total diversity) and SSB (diversity between levels of factors; i.e., the effect of a factor) are asymptotically independently distributed. Contrary to ANOVA, SSB/SST rather than SSB/SSW should therefore be used in statistical inference associated with the ANOQE. In addition, this ratio has a direct interpretation as the proportion of explained diversity. Another important definition of these tests is that of the null hypothesis, H0. For instance, the Nayak (1986b) test is based on H0 = the true proportion vectors for each level of the factor being equal; that is H0 = *π*_{1} = … = *π*_{r}. H0 implies that the dissimilarities among collections are zero (SSB = 0), but the converse is true only if *H*_{D} is a strictly concave function (see *Consequence of the strict concavity of**H*_{D}).

#### Two Factors

Few solutions have been proposed with more than one factor. For nested factors, solutions have been developed with permutational schemes (Excoffier, Smouse & Quattro 1992; Pavoine & Dolédec 2005).

Nayak (1986a) tackled the question of two crossed factors. Let *π*_{ij}, *π*_{i•}, *π*_{•j} and *π*_{••} be the unknown vectors of proportions associated with level *i* of factor *X*_{1} and level *j* of factor *X*_{2}, level *i* of factor *X*_{1}, level *j* of factor *X*_{2}, and the whole studied collection (e.g., a population, a community, a region), respectively (see *Evaluating the contributions of factors on values of **H*_{D}–*ANOVA is generalised*). Considering multinomial distributions and the hypotheses H_{0}:*π*_{i•}=*π*_{••}∀*i*, H_{0}:*π*_{•j}=*π*_{••}∀*j*, H_{0}:*π*_{ij}=*π*_{i•}+*π*_{•j}−*π*_{••}∀*i*,*j*, Nayak (1983) found that the asymptotic distributions of *SS*(*X*_{1}), *SS*(*X*_{2}) and *SS*(*X*_{1}×*X*_{2}) under these respective null hypotheses depend on unknown parameters. However, he subsequently provided tests for the conditional effect of one factor given the other (*ρ*^{2}(2|1), see Eqn 8) (Nayak 1986a). For example, the test for a conditional effect of X_{2} given X_{1} corresponds to the null hypothesis H_{0}=*π*_{i1}=…=*π*_{iS}, for all *i* = 1,…,*r*. Under H_{0}, the asymptotic distribution of *r*(*S*−1)(*n*_{•••}−*r*)*ρ*^{2}(2|1) can be approximated by (Nayak 1986a). This test might help to distinguish the roles of, for example, space vs. time or of two experimental treatments on the level of diversity.

Alternative solutions based on permutational testing were discussed by Legendre & Anderson (1999) in the context of the related db-RDA approach and the issues raised should be considered in developing permutational tests for crossed factors with ANOQE. A critical step is to decide whether the factors considered are fixed, random or mixed. For instance, AMOVA was developed only for nested, random factors.

#### Restrictions

A key challenge in the analysis of real data is that data need to satisfy all of the hypotheses on which tests are based. Most tests developed in the last two sections are associated with restrictive hypotheses that the vectors of proportions should satisfy: for instance that there should be independence among individual samplings, multinomial distributions and large samples Nayak (1986a,b). In the context of nested ANOQE, Pavoine & Dolédec (2005) tackled the question of non-independence of sampled individuals. As organisms may be distributed patchily, sampling within a patch may lead to a high number of individuals from a single species being drawn simultaneously. Considering a single factor, Hardy (2008) introduced various tests that provide inferential solutions in the case of correlations between the vectors of proportions **p** and the matrix **D**, as well as in the case of spatial autocorrelation. It is therefore crucial that constraints related to field or experimental data in biology be acknowledged to allow for further development of the ANOQE framework.

### Conclusions

- Top of page
- Summary
- Introduction
- Development of quadratic entropy
*H*_{ D }optimisation- Estimating and comparing levels of diversity
- Among-collection diversity
- Partitioning diversity
- Testing the effects of factors on diversity: directions for future research
- Conclusions
- Acknowledgements
- References
- Supporting Information

The quadratic entropy index (QE) was used here based on vectors of proportions (index *H*_{D}), and I have highlighted how this function is maximised and how it can be estimated from sampled data. QE provides a novel view of diversity and a powerful, rigorous statistical framework for analysing biodiversity. Within the ANOQE framework, QE has the potential to identify and test nested and crossed factors underlying biodiversity, while placing ANOQE in a geometric view reinforces its connections with ANOVA. The title ‘analysis of variance’, abbreviated ANOVA by Tukey, stems from the fact that variances are used to measure differences among means (Sokal & Rohlf 1995). Here, using the ANOQE approach, we are able to show that quadratic entropies may be used to measure distances between centroids.

QE, ANOQE and related approaches have been developed several times in genetical and ecological research. They may be implemented in several software applications, which are presented in Appendix S4. Despite these developments, further approaches still require advances test procedures, which thus far have been limited to nested factors or two crossed factors. Identification of questions raised by the differences between fixed, random and mixed factors is also required. Finally, for concrete application to experimental or observational data, these developments should be made in association with research from different biological fields, identifying needs for dealing with particular scenarios, such as broad definitions of proportions (e.g., biomass, densities), and spatial and temporal autocorrelations.

I hope to have highlighted that the QE and ANOQE frameworks could have a strong impact and applicability to any scientific field with a focus on the analysis of multivariate data.

### Acknowledgements

- Top of page
- Summary
- Introduction
- Development of quadratic entropy
*H*_{ D }optimisation- Estimating and comparing levels of diversity
- Among-collection diversity
- Partitioning diversity
- Testing the effects of factors on diversity: directions for future research
- Conclusions
- Acknowledgements
- References
- Supporting Information

I would like to particularly thank Daniel Chessel for all of the discussions we had, for his availability during my Master's Degree and PhD research and for his continuous encouragement.

### References

- Top of page
- Summary
- Introduction
- Development of quadratic entropy
*H*_{ D }optimisation- Estimating and comparing levels of diversity
- Among-collection diversity
- Partitioning diversity
- Testing the effects of factors on diversity: directions for future research
- Conclusions
- Acknowledgements
- References
- Supporting Information

- 2001) A new method for non-parametric multivariate analysis of variance. Austral Ecology, 26, 32–46. (
- 2006) Distance-based tests for homogeneity of multivariate dispersions. Biometrics, 62, 245–253. (
- 2002) Measuring biological diversity using Euclidean metrics. Environmental and Ecological Statistics, 9, 167–177. & (
- 2010) Phylogenetic diversity measures based on Hill numbers. Philosophical Transactions of the Royal Society London Series B, 365, 3599–3609. , & (
- 2008) Distance-based multi-sample tests for general multivariate data. Advances in Mathematical and Statistical Modeling (eds B.C. Arnold, N. Balakrishnan, J.-M. Sarabia & R. Minguez), pp. 61–70. Birkhäuser, Boston. (
- 1992) Analysis of molecular variance inferred from metric distances among DNA haplotypes: application to human mitochondrial DNA restriction data. Genetics, 131, 479–491. , & (
- 1992) Conservation evaluation and phylogenetic diversity. Biological Conservation, 61, 1–10. (
- 1925) Statistical Methods for Research Workers. Oliver and Boyd, Edinburgh, UK. (
- 1999) Analysis of distance for structured multivariate data and extensions to multivariate analysis of variance. Applied Statistics, 48, 505–519. & (
- 2008) Testing the spatial phylogenetic structure of local communities: statistical performances of different null models and test statistics on a locally neutral community. Journal of Ecology, 96, 914–926. (
- 2007) Characterizing the phylogenetic structure of communities by an additive partitioning of phylogenetic diversity. Journal of Ecology, 95, 493–506. & (
- 1971) An expanded concept of ‘‘species diversity’’. Notulae Naturae, 439, 1–6. & (
- ISI Web of Science (2011) Thomson Reuters. URL http://wokinfo.com/ [accessed 11 November 2011].
- 2011) New concentration measures as kinds of the quadratic entropy. Ecological Indicators, 11, 540–544. & (
- 2006) Entropy and diversity. Oikos, 113, 363–375. (
- 1985) Characterization of Rao's quadratic entropies. Sankhyā, 47A, 295–309. (
- 2010) Using plant functional traits to understand the landscape distribution of multiple ecosystem services. Journal of Ecology, 99, 135–147. , , , , , , & (
- 1999) Distance-based redundancy analysis: testing multispecies responses in multifactorial ecological experiments. Ecological Monographs, 69, 1–24. & (
- 1971) An analysis of variance for categorical data. Journal of the American Statistical Association, 66, 534–544. & (
- 1991) Bootstrapping one way analysis of Rao's quadratic entropy. Communications in Statistics-Theory and Methods, 20, 1683–1703. (
- 1995) Asymptotic distribution of statistics based on quadratic entropy and bootstrapping. Journal of Statistical Planning and Inference, 43, 1–18. & (
- 2004) Measuring Biological Diversity. Blackwell Publishing, Oxford, UK. (
- 2001) Fitting multivariate models to community data: comment on distance-based redundancy analysis. Ecology, 82, 290–297. & (
- 1983) Applications of entropy functions in measurement and analysis of diversity. PhD thesis, University of Pittsburgh, Pittsburgh, PA. (
- 1986a) An analysis of diversity using Rao's quadratic entropy. Sankhyā, 48B, 315–330. (
- 1986b) Sampling distributions in analysis of diversity. Sankhyā, 48B, 1–9. (
- 1979) Mathematical model for studying genetic variation in terms of restriction endonucleases. Proceedings of the National Academy of Sciences of the United States of America, 76, 5269–5273. & (
- 1982) Diversity as a concept and its measurement. Journal of the American Statistical Association, 77, 548–561. & (
- 2005) Méthodes statistiques pour la mesure de la biodiversité– Statistical methods for measuring biodiversity. PhD thesis, Université Lyon 1, Villeurbanne, France. (
- 2007) New analysis for consistency among markers in the study of genetic diversity: development and application to the description of bacterial diversity. BMC Evolutionary Biology, 7, e156. & (
- 2009) Biological diversity: distinct distributions can lead to the maximization of Rao's quadratic entropy. Theoretical Population Biology, 75, 153–163. & (
- 2011) Measuring biodiversity to explain community assembly: a unified approach. Biological Reviews, 86, 792–812. & (
- 2005) The apportionment of quadratic entropy: a useful alternative for partitioning diversity in ecological data. Environmental and Ecological Statistics, 12, 125–138. & (
- 2004) From dissimilarities among species to dissimilarities among communities: a double principal coordinate analysis. Journal of Theoretical Biology, 228, 523–537. , & (
- 2005a) Is the originality of a species measurable? Ecology Letters, 8, 579–586. , & (
- 2005b) Measuring diversity from dissimilarities with Rao's quadratic entropy: are any dissimilarity indices suitable? Theoretical Population Biology, 67, 231–239. , & (
- 2009) Hierarchical partitioning of evolutionary and ecological patterns in the organization of phylogenetically-structured species assemblages: application to rockfish (genus:
*Sebastes*) in the Southern California Bight. Ecology Letters, 12, 898–908. , & ( - 2010) Decomposition of trait diversity among the nodes of a phylogenetic tree. Ecological Monographs, 80, 485–507. , & (
- 1982a) Diversity and dissimilarity coefficients: a unified approach. Theoretical Population Biology, 21, 24–43. (
- 1982b) Diversity: its measurement, decomposition, apportionment and analysis. Sankhyā, A44, 1–22. (
- 1982c) Gini–Simpson index of diversity: a characterization, generalization and applications. Utilitas Mathematics, 21, 273–282. (
- 1986) Rao's axiomatization of diversity measures. Encyclopedia of Statistical Sciences, Vol. 7 (eds S. Kotz & N.L. Johnson), pp. 614–617. Wiley and Sons, New York. (
- 2010) Quadratic entropy and analysis of diversity. Sankhyā, 72A, 70–80. (
- 1985) Cross entropy, dissimilarity measures, and characterizations of Quadratic Entropy. IEEE Transactions on Information Theory, 31, 589–593. & (
- 2005) Additive partitioning of Rao's quadratic diversity: a hierarchical approach. Ecological Modelling, 183, 365–371. (
- 2006) Towards a unifying approach to diversity measures: Bridging the gap between the Shannon entropy and Rao's quadratic index. Theoretical Population Biology, 70, 237–243. & (
- 2010) Incorporating functional dissimilarities into sample-based rarefaction curves: from taxon resampling to functional resampling. Journal of Vegetation Science, 21, 280–286. , & (
- 1948) A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, (
- 2001) On the measurement of species diversity incorporating species differences. Oikos, 93, 135–147. (
- 1949) Measurement of diversity. Nature, 163, 688. (
- 1995) Biometry, 3rd edn. W.H. Freeman, New York. & (
- 1995) New ‘biodiversity’ measures reveal a decrease in taxonomic distinctness with increasing stress. Marine Ecology Progress Series, 129, 301–305. & (

### Supporting Information

- Top of page
- Summary
- Introduction
- Development of quadratic entropy
*H*_{ D }optimisation- Estimating and comparing levels of diversity
- Among-collection diversity
- Partitioning diversity
- Testing the effects of factors on diversity: directions for future research
- Conclusions
- Acknowledgements
- References
- Supporting Information

**Appendix S1.** Some details on the history of the development of the function *H*** _{D}**.

**Appendix S2.** Proof for Proposition 1.

**Appendix S3.** Proof for Proposition 2.

**Appendix S4.** Implementation.

As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such materials may be re-organized for online delivery, but are not copy-edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors.

Filename | Format | Size | Description |
---|---|---|---|

MEE3_181_sm_AppendixS1-S4.pdf | 87K | Supporting info item |

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.