A new method for indicator species analysis in the framework of multivariate analysis of variance

For a set of plots that are grouped according to some external criteria, such as selected environmental variables or different experimental treatments, two relevant questions usually asked by community ecologists are: “Are these groups compositionally different from each other?” and “Which species contribute most to the compositional differences among the groups?” These questions are generally answered with different tools: the compositional dissimilarity among groups of plots is usually tested with dissimilaritybased multivariate analysis of variance (Pillar & Orlóci, 1996; Anderson, 2001), whereas the compositional characterization of the different groups is performed by means of indicator species analysis (Dufrêne & Legendre, 1997; Chytrý et al., 2002). Dissimilaritybased multivariate analysis of variance (dbMANOVA) is applicable to any type of compositional data, Received: 2 December 2020 | Revised: 24 February 2021 | Accepted: 4 March 2021 DOI: 10.1111/jvs.13013


RICOTTA eT Al.
irrespective of the number of species sampled and the way they are sampled (i.e., presence/absence scores, number of individuals, species cover, etc.), provided that a meaningful measure is used to adequately represent the dissimilarity between pairs of plots (or relevés, communities, assemblages, quadrats, sites, etc.).
Given a community composition matrix containing the presence/ absence or the abundance values x sj of S species (s = 1, 2, . . . , S) in N plots (j = 1, 2, . . . , N), let the number of plots in group k (k = 1, 2, . . . , K) be N k such that ∑ K k = 1 N k = N and d ij be the compositional dissimilarity between plot i and plot j. In the simplest situation of a single-factor analysis on a set of plots partitioned into K groups, the very essence of db-MANOVA is to compare the plot-to-plot dissimilarity within groups with the plot-to-plot dissimilarity among groups with an adequate test statistic (Figure 1). The larger the value of between-group dissimilarity compared to within-group dissimilarity, the more likely it is that the groups are compositionally different from each other (Anderson, 2001). For db-MANOVA, hypothesis testing is usually performed with randomization tests where the N plots are randomly permuted among the K groups, thus creating a reference distribution under the null hypothesis of no compositional difference between the different groups of plots.
Once the compositional differences among the groups of plots have been verified, the identification of diagnostic species for the different groups is a relevant step for their ecological characterization (Dufrêne & Legendre, 1997). In vegetation science, the concept of indicator or diagnostic species which we use here as synonyms has usually been associated with the species concentration in particular groups of plots. According to Chytrý et al. (2002), diagnostic species "include species which preferably occur in a single vegetation unit (character species) or in a few vegetation units (differential species)". Several methods have been developed to identify indicator species, (Dufrêne & Legendre, 1997;Chytrý et al., 2002;Podani & Csányi, 2010;Ricotta et al., 2015). Indicator species are fundamentally defined as those species that are more common in a given group of plots than expected by chance alone. Therefore, to determine if a given species is significantly associated with a target group of plots, most of these methods compare the actual abundance of that species in the target group of plots with a reference distribution obtained from a random null model in which the species occurrences or abundances are permuted among the N plots. The null hypothesis is that all plots have equal probability to host each species, irrespective of the species' environmental preferences (De Cáceres & Legendre, 2009).
Although dissimilarity-based MANOVA and indicator species analysis are apparently unrelated, in this paper we will show that for a particular group of dissimilarity measures the classical partitioning of variation used for hypothesis testing in one-factor db-MANOVA can be additively decomposed into species-level values, thus connecting indicator species analysis with multivariate analysis of variance.

| ME THODS
Several test statistics have been proposed for dissimilarity-based MANOVA. For review, see Pillar (2013) and references therein.
For dissimilarity matrices with Euclidean properties (see Gower & Legendre, 1986), Pillar and Orlóci (1996) proposed as test criterion the sum of squared dissimilarities between groups. Given a symmetric N × N dissimilarity matrix whose elements d ij represent the pairwise dissimilarities between plot i and j such that d ij = d ji and d ii = 0, the between-group sum of squared dissimilarities Q B is: where is the total sum of squared dissimilarities. Hence, to calculate Q T we have to add the N (N − 1) ∕2 plot-to-plot squared dissimilarities d 2 ij in the sub-diagonal half of the dissimilarity matrix and then divide by the total number of plots N. The term Q W is the sum of squares within the K groups, which is obtained as: where the sum of squares within group k is: ij ijk F I G U R E 1 Schematic example of a semimatrix of pairwise dissimilarities among nine plots clustered into two groups. The very essence of dissimilarity-based MANOVA is to compare the withingroup dissimilarity with the between-group dissimilarity with an adequate test statistic. The larger the value of between-group dissimilarity compared to within-group dissimilarity, the more likely it is that the groups are compositionally different from each other | 3 of 7

Journal of Vegetation Science
RICOTTA eT Al.
The indicator variable ijk takes the value 1 if plots i and j are both in group k. Otherwise it is 0. That is, Q Wk is obtained by adding all squared dissimilarities d 2 ij between all plots that occur in group k and then dividing the sum by the number of plots in that group N k .
The sum of squares could be equally computed from deviations of single plots from group centroids, as described, among others, by Edgington (1987) or Manly (1991), but using plot-to-plot dissimilarities does not require the computation of group centroids. McArdle and Anderson (2001) further showed that this partition of the total sum of squares is also valid for dissimilarity coefficients that do not have Euclidean properties, thus making dissimilarity-based MANOVA a very flexible component of the ecologist's toolbox.
Since the distribution of Q B under the null hypothesis of no compositional difference among the different groups of plots is generally unknown, a reference distribution of values of the test statistic is then obtained by Mantel randomization of the dissimilarity matrix.
For dissimilarity-based one-way MANOVA, this is equivalent to randomly permuting the N plots among the K groups. Note that in onefactor MANOVA, since the total sum of squares Q T is invariant over permutations, some test statistics, such as Q B , Q W and the Anderson ing. That is, for the same data and permutations they will lead to identical p-values (Pillar, 2013).
A desirable property for a dissimilarity index is its ability to be decomposed into species-level values (Ricotta & Podani, 2017). In this way, it is possible to see which species contribute most to plotto-plot dissimilarity. In this framework, the advantage of adopting Q W as a test statistic over Q B or Q B ∕Q W is that it can be decomposed into intuitively simple species-level values, provided that the squared dissimilarities d 2 ij are also decomposable in the same way. Take for example the Euclidean distance is additively decomposable into the species-level contributions . Therefore, in a sum-ofsquares-based test, if the plot-to-plot compositional dissimilarity is calculated with the Euclidean distance, the within-group sum of squares becomes: where the contribution of single species to overall Q W is: Geometrically, a smaller value of Q W denotes higher within-group concentration and larger between-group differences (Cai, 2006 where Obs (

| Data
We used data on plant communities along a primary succession on the foreland of the Rutor Glacier (northern Italy). The data were sampled by Caccianiga et al. (2006) and have been used in many previous papers for exploring community assembly rules along ecological gradients (e.g., Caccianiga et al., 2006;Ricotta et al., 2016;Ricotta et al., 2018). The data set can be found in Appendices S2 and S5 Based on the species abundances in each plot, we calculated the Euclidean distance between all pairs of plots. Next, using Q W and Q Ws as test statistics, we tested for overall and species-wise compositional differences among the three successional stages with dissimilarity-based MANOVA. All calculations were done with a new R function available in Appendices S3 and S6. Significance is computed by randomly permuting the 59 plots (data vectors) among the three successional stages with 9,999 replicates. For all species that showed significant compositional difference among the three groups of plots, we also tested for pairwise differences among all pairs of groups (9,999 permutations) using the same approach as described in Pillar and Orlóci (1996) for multivariate contrasts.

| RE SULTS
The overall compositional differences among the three groups of plots and between all pairs of groups proved highly significant with the minimum possible p-value for 9,999 randomizations (p = 0.0001) showing that the three successional stages are compositionally well distinguishable. The results of the single-species MANOVA are shown in Figure 2, together with the species abundances in each group and the contributions of single species to the compositional differences among the groups (SES s ).
From a total of 45 species, we identified 32 with a clear preference for certain successional stages (p < 0.05). It therefore appears that more than two thirds of the species sampled have rather narrow ecological requirements allowing them to colonize only the most ecologically suitable habitats along the primary succession. Caccianiga et al. (2006) showed that the establishment of the first pioneer species is associated with random dispersal mechanisms that drive the colonization of the glacial deposits by early-successional ruderal forbs, such as Cerastium uniflorum, Oxyria digyna, or Tussilago farfara, whereas the late-successional stages are preferentially colonized by stress-tolerant graminoids (sensu Grime, 1974), such as as Achillea moschata or Trifolium pallescens, which are among the species that contribute most to the compositional differences among the successional stages ( Figure 2).

| D ISCUSS I ON
In this paper, we showed that when we test for compositional differences among two or more groups of plots with one-factor db-MANOVA, the within-group sum of squares Q W can be additively decomposed into species-level values, provided that the squared dissimilarities d 2 ij can also be additively decomposed into their specieslevel contributions d 2 ijs such that d 2 ij = ∑ S s = 1 d 2

ijs
. In the remainder we will call a dissimilarity coefficient d ij that conforms to this property 'squared decomposable' (S-decomposable).
The species that contribute most to the compositional differences among the groups can be appropriately called indicator species as they are preferentially concentrated in particular groups of plots. Note however that, while classical indicator species analysis usually identifies indicators of single groups of plots, with the proposed method some species may be related to more than one group. By taking into account combinations of groups of plots and adopting a multiple-contrast approach (Pillar & Orlóci, 1996) A desirable aspect of this method for identifying indicator species is that it is based on the same dissimilarity measure used for multivariate analysis of variance (see Pavoine et al., 2013). This puts indicator species analysis and multivariate analysis of variance under the same mathematical umbrella. Regarding the most appropriate dissimilarity coefficient for db-MANOVA, being S-decomposable, the Euclidean distance appears to be a natural choice. Likewise, all Euclidean distances computed on transformed species abundances, such as the Chord or the Hellinger distance, may be equally adequate (see Legendre & Gallagher, 2001). Note that, according to Pillar (2013), a comprehensive evaluation of the power and accuracy of one-factor db-MANOVA showed that the Euclidean distance gave better results compared to other dissimilarity coefficients in spite of its well-known limitations for summarizing compositional dissimilarity among plots (see Orlóci, 1978;Legendre & Gallagher, 2001).
However, the same species-level decomposition can also be extended to virtually any dissimilarity measure of the form √ ij provided that ij is additively decomposable (A-decomposable) such that . Note however that since the Bray-Curtis dissimilarity summarizes the difference in species abundances between plots i and j compared to the total species abundance in both plots (see Ricotta & Podani, 2017), the contribution of each species depends on the values of other species. This is conceptually distinct from classical indicator species analysis which is usually species-specific.

F I G U R E 2
Results of the distance-based MANOVA for the 45 species of the Rutor data set. The species abundances in each group of plots on a five-point ordinal scale and the contribution of each species to the compositional differences among the groups (SES s ) are also shown. The species showing significant compositional differences among the three groups of plots (p < 0.05; 9,999 permutations) are marked with an asterisk. For those species, we also tested for significant pairwise differences among all pairs of groups. For each species, letters indicate individually significant tests (p < 0.05; 9,999 permutations) after correction for multiple tests according to Benjamini and Yekutieli (2001) RICOTTA eT Al.
If the square root of an A-decomposable dissimilarity coefficient √ ij is used in one-factor db-MANOVA, we obtain: Therefore, dealing with A-decomposable dissimilarities, Equation 9 tells us that the within-group sum of squares of any dissimilarity measure of the form √ ij can be additively decomposed into species-level values. At the same time, the term Q W in Equation   9 can also be interpreted as the within-group sum of non-squared A-decomposable dissimilarities ij . As such, Q W is a special case of the "average dissimilarity within groups" used by Mielke and Berry (2001) in multi-response permutation procedures (MRPP), which is a class of multivariate permutation methods for testing for compositional differences among a-priori-defined groups of plots.
This makes the differences between randomization tests based on squared and non-squared dissimilarities much fuzzier than one might think. A short overview of MRPP and its relationship with db-MANOVA is shown in Appendix S4.
Although randomization tests based on A-decomposable nonsquared dissimilarities have been rarely used in ecology (partly because they are not included in common statistical packages), they can be equally adopted in one-factor designs for testing for compositional differences among a-priori-defined groups of plots and for identifying the indicator species that contribute most to the compositional differences among the groups. This allows to expand the number of suitable dissimilarities to the whole arsenal of A-decomposable coefficients already available in the ecological literature (Legendre & De Cáceres, 2013). Also, the partitioning of other test statistics could be evaluated, such as the between-groups sum of squares (Q B ) defined in Equation 1. In this case, the proposed approach could be further extended to multi-factor group comparisons (see Pillar & Orlóci, 1996), when it may be relevant to identify indicator species of each group considering the factors separately and their interactions.
Which of these dissimilarities will show the best performance in statistical and biological terms (i.e., accuracy and power vs biological interpretability)? Can the proposed approach be further extended to functional and phylogenetic dissimilarities, such as the large class of Adecomposable coefficients introduced by Pavoine and Ricotta (2019)? These are relevant questions, and their answers can help shed light on the effects of ecological processes on community composition.

CO N FLI C T S O F I NTE R E S T
The authors declare no conflicts of interest.

AUTH O R CO NTR I B UTI O N S
CR conceived the idea. CR, SP and VP developed the methodology.
CR, SP and BC analyzed the data. SP wrote the R script. CR took the lead in writing the main text; all authors revised the manuscript critically and approved the final version.

DATA AVA I L A B I L I T Y S TAT E M E N T
Data on the alpine vegetation used in the worked example are available in Appendices S2 and S5; the R scripts for the dissimilaritybased MANOVA decomposition are available in Appendices S3 and S6. (9)