Aim: Introduction of a novel approach to the classification of vegetation data (species by plot matrices). This approach copes with a large amount of noise, groups irregularly shaped in attribute space and species turnover within groups.
Method: The proposed algorithm (Isopam) is based on the classification of ordination scores from isometric feature mapping. Ordination and classification are repeated in a search for either high overall fidelity of species to groups of sites, or high quantity and quality of indicator species for groups of sites. The classification is performed either as a hierarchical, divisive method or as non-hierarchical partitioning. In divisive clustering, resulting groups are subdivided until a stopping criterion is met. Isopam was tested on 20 real-world data sets. The resulting classifications were compared with solutions from eight widely used clustering algorithms.
Results: When looking at the significance of species fidelities to groups of sites, and at quantity and quality of indicator species, Isopam often achieved high ranks as compared with other algorithms.
Finding patterns in a more or less noisy, fuzzy and moving target like plant species composition is a persistent challenge in vegetation science. Among other approaches, classification techniques are important tools for meeting this challenge. Many objective numerical clustering techniques have become available over the last decades. However, their outputs frequently appear to be unsatisfactory to field ecologists. The derived classes often do not match expert guesses regarding ecologically meaningful partitions. This mismatch may be an explanation for the fact that handcrafted vegetation tables frequently continue to be published, even in times of teraflop computing.
The reasons for the differences between expert knowledge and results from clustering techniques might include inappropriate algorithms (Legendre & Legendre 1998). Three reasons can be identified in the data itself and the way algorithms deal with the data. (1) Species occurrences are not equally informative, and ecologically meaningful information is hidden within a large amount of irrelevant variation, further referred to as noise (Gauch & Whittaker 1981). Here, information is considered irrelevant if it does not contribute to understanding structure, i.e. the organization of objects along gradients or in groups (Legendre & Legendre 1998). Signal detection is a strength of the human brain, which cannot be said for all clustering algorithms. (2) Expert groupings may be irregularly shaped in attribute space, while most numerical algorithms search for spherical or other regular geometries. (3) Groups are made up of a shifting set of typical species. One and the same plant community inhabiting a specific site type can exhibit a considerable species turnover in the geographic domain, as described by Gleason (1926). This problem grows with increasing geographic distances.
Summarizing points (1) to (3), vegetation shows nonlinear behaviour, with structure emerging from background noise at unexpected spots. How do non-numerical approaches deal with this behaviour? Traditional phytosociology, for example, relies on the fidelity of species to groups of sites and on the occurrence of selected, typical (“diagnostic”) species as a classification aid (Ellenberg & Müller-Dombois 1974). This procedure results in groups with irregular shapes in attribute space. Species turnover within groups is taken into account by introducing “local diagnostic” species. Another factor, which is not a specifically data-related problem, may also contribute heavily to the above-mentioned mismatch: strictly data-driven classifications are often not, or only partially, transferable to new data sets. In contrast, expert classifications often aim to approach general solutions taking into account previous work.
The Twinspan algorithm (two-way indicator species analysis, Hill 1979) mimics the indicator or “diagnostic” species approach. It is based on the partitioning of ordination axes resulting from reciprocal averaging. This partitioning is followed by an adjustment of group assignments based on indicator species. The latter step introduces a desirable irregular shape of clusters. Twinspan has been (probably overly) criticized, mainly due to the fact that the existence of a single dominant, conspicuous gradient is assumed in partitioning. Furthermore, the adjustment of group assignments adds considerable complexity (McCune & Grace 2002).
We examined the question whether or not there are other ways to address the disorder in vegetation in an objective, unsupervised, data-driven way, bearing in mind the sometimes nonlinear data structure and taking advantage of the methodical experience gained before the advent of the computer age. The current paper introduces a novel approach to classification called Isopam (isometric feature mapping and partitioning around medoids). This partitional and optionally hierarchical method relies on existing recent methods of nonlinear dimensionality reduction and on parameter optimization based on species occurrences.
Figure 1 provides a simplified overview of the Isopam processing chain. The proposed algorithm is based on ordination space partitioning. For ordination, Isomap (isometric feature mapping, Tenenbaum et al. 2000) as implemented in the R package Vegan is used (version 1.15-2, J. Oksanen). PAM (partitioning around medoids, Kaufman & Rousseeuw 1990) as implemented in the R package Cluster was drawn on for partitioning (version 1.11.13, M. Maechler). Both ordination and partitioning are repeated in an exhaustive search for partitions with either high overall fidelities of species to groups of sites or high quantity and quality of indicator species. These indicator species are found in a data-driven way. The search for good partitions includes an optional assessment of the most advantageous number of groups. If used as a hierarchical clustering algorithm, the resulting groups are subdivided using the same approach until a predefined stopping criterion is met. Isopam is a function in R (R Development Core Team 2009) and can be downloaded and installed, along with the accompanying package, from within R (http://cran.r-project.org/). Users unfamiliar with R can select the function with some of the more important options from the Windows-based program Juice (Tichý 2002, available at http://www.sci.muni.cz/botany/juice/).
Ordination can help to reveal essential underlying structures in vegetation data. Ordination space partitioning may thus be beneficial for finding ecologically meaningful groups. In the present paper, Isomap is proposed for dimensionality reduction. This extension of a technique published by Glenn De'ath (1999) is not based on straight dissimilarity measures between all plots, as in other ordination methods. Instead, it measures dissimilarity distances only between close neighbours (similar sampling units), resulting in a net-like construction. The dissimilarity between more distant (dissimilar) sampling units is measured along the strings of this net, using sampling units in between as stepping-stones. This is done following the shortest path (Williamson 1978). The implicit assumption is that meaningful dissimilarity measures exist between observations that do not have any species in common as long as they are connected by mediating observations. Mediated dissimilarity is especially useful in cases of high species turnover (beta diversity). The use of geodesic distances yields a consistently strong correlation between ecological distance and dissimilarity in ordination space (De'ath 1999) and addresses the above-mentioned problem of species turnover within groups. Metric multidimensional scaling (principal coordinate analysis) of the resulting dissimilarity matrix as the final step in Isomap allows for a low-dimensional representation of these dissimilarities.
As compared to other, more frequently used ordination methods, Isomap is not limited to analyses of short gradients (as principal component analysis, Pearson 1901) and less affected by artifacts than detrended correspondence analysis (Hill & Gauch 1980). Like non-metric multidimensional scaling (NMDS, Kruskal 1964), it is able to cope with non-linear features and linearizes the fit between species data and environmental variation (Beals 1984; De'ath 1999). Unlike NMDS, it handles large amounts of data and is not prone to local minimum traps (McCune & Grace 2002). The mediated dissimilarities are Isomap's strength but also its major pitfall, since the resulting patterns are highly dependent on how the network of observations is constructed. Each observation is linked to the most similar observations; these are, in turn, linked to similar observations, and so on. The number of connected neighbours can be limited with a similarity threshold. Alternatively, a certain number of nearest neighbours (k) is considered. Isopam internally searches for an optimum k. While changing k, different views on data structure come to light. There is no “best” number for k, since all solutions reflect reality in some way. These properties of Isomap can be exploited in various ways, e.g. parameterization can be optimized for maximum fit to original dissimilarities, which while in some way a step back may, however, be seen as a valid replacement of NMDS. In the current study, parameters are optimized for maximum discrimination of vegetation units, either in terms of overall fidelity of species to groups of sites, or in terms of indicative species. The number k used to construct the final neighbourhood matrix is found in the brute-force approach described below.
Isomap is still rarely used in ecology, but has already been shown to be useful for pattern detection. For example, Mahecha & Schmidtlein (2008) employed Isomap to disentangle the enormous attribute space of a complete national floristic survey and revealed structures of sampling bias: borders between administrative units led to hiatuses in the Isomap ordination space. These hiatuses were mainly caused by relatively small numbers of species with similar distribution. These similarities were an artifact and based on different taxonomic concepts. However, it is remarkable that these “species” caused structure due to their joint occurrence, even if they contributed little to overall variation. In phytosociology, the process of delineating structure is similar: few species with conspicuously similar distribution are used as indicators for underlying conditions.
The next step is to partition the ordination space. Partitioning around medoids (PAM, Kaufman & Rousseeuw 1990) was chosen as a suitable clustering algorithm. PAM creates spheroid clusters in attribute space. In our case, clustering takes place along Isomap ordination axes (an expedient distortion of conventional dissimilarity patterns). The clusters are therefore irregular in terms of conventional dissimilarity measures. PAM operates on a distance matrix (in this case, distances in the coordinate system spanned by Isomap axes) and searches for a predefined number of medoids. Each medoid is centered in a group of observations in a way that minimizes dissimilarity to the other group members. Unlike means, medoids are always real observations. PAM is similar to the more widely used k-means algorithm but tends to be more robust (Kaufman & Rousseeuw 1990).
Based on a brute-force approach, three parameters are optimized: the number of clusters used by PAM, Isomap k (the number of neighbours used in Isomap) and the number of Isomap dimensions to be taken into account for clustering. Here, brute force (or exhaustive search) means that all possible clustering solutions out of a predefined range of parameter settings are examined in order to determine an optimum partition with respect to defined criteria. The upper limits of the ranges of parameter values are user-defined (the defaults are suggestions based on the observed range of solutions). The lower limits are all set to two (if necessary, the minimum k is increased for connecting all sampling units in a single network). A loop written in R (R Development Core Team 2009) allows for testing possible combinations of the three parameters.
The default optimality criterion in Isopam is currently derived from the G statistic (Sokal & Rohlf 1995). G values are computed for all species, adjusted using Williams' correction (Williams 1976), and standardized (Botta-Dukát et al. 2005). Standardization is necessary to correct for the influence of the number of clusters. The resulting value GS is a measure of a species overall capacity to distinguish clusters in a given partition (or a species fidelity). In optimization, mean GS values are used or, in order to make use of the discriminating power of indicator species, only species exceeding a customizable threshold of GS. In the latter case, the product of the resulting number of indicator species and their averaged GS value is used as optimality criterion (equations in the Supporting Information, Appendix S2). GS could be easily replaced by other statistics, e.g. by a criterion derived from Fisher's exact test (Fisher 1922; Tichý et al. 2010).
In hierarchical mode, the algorithm continues dividing groups until the number of indicator species exceeding a defined limit of GS falls below a defined number per group, or until groups become too small. Three members (sites) provide the minimum group size necessary for subdivision. Relaxing the customizable stopping criteria leads to more levels of clustering. Details are given in the R documentation accompanying the program package.
Isopam was tested on 20 real-world examples of species by plot matrices (Table 1) with different sizes and structures (Supporting Information, Appendix S1). The resulting partitions were compared with solutions from eight widely used clustering algorithms. The comparisons were based on fixed cluster numbers (2, 3, 4, 5 and 6 clusters, disregarding the numbers suggested by Isopam in order to avoid bias).
Table 1. Data sets used for the performance test; more details on data sets and clustering parameters are given in the Supporting Information, Appendices S1 and S2; * Data accessible via VegBank (http://www.vegbank.org); ** Data accessible via VegetWeb (Ewald 2005); *** Add-ons in R packages: N as “bryceveg” in labdsv (version 1.3-1, D. W. Roberts), F as “vegtf” in ade4 (version 1.4-11, Dray & Thioulouse 2007).
For evaluation, we opted for measures that take species distributions across clusters as a starting point (Aho et al. 2008). Species-based approaches are an intuitive choice when plant species composition is the target (Dale 1991). In the present paper, six evaluators were used for comparison (Table 2 and Supporting Information, Appendix S3). All evaluators used are non-geometric (Aho et al. 2008) – they test for high fidelity of species to clusters instead of looking at the overall compactness of clusters. Three evaluators take into account quantity and quality of indicator species (isa.ind, fsh.ind, phi.ind), which is also the default in Isopam. Three additional evaluators express cluster separation considering all species (isa.all, phi.all, fsh.all). The equations for all evaluators are given in the Supporting Information, Appendix S3. The scores resulting from tests on 20 data sets and five clustering levels differed in their order of magnitude. For the comparison of algorithm performances, all results were transformed into ranks, using averaged ties. The distribution of achieved ranks was visualized with box plots (Fig. 2).
Table 2. Evaluators for the comparison of clustering algorithms. a–c are measures based on quality and quantity of indicator species, d–f are measures based on the distribution of all species across clusters. The equations are given in Supporting Information, Appendix S2.
Number of species with P<=0.05 in indicator species analysis (ISA, Dufrêne & Legendre 1997), multiplied by the negative logarithm of their median P.
Number of species with P<=0.05 in Fisher's exact test, multiplied by the negative logarithm of their median P values
Number of species with standardized phi>0.35 (Tichý & Chytrý 2006), multiplied by their mean maximum phi values
Median maximum indicator value of species in ISA
Negative logarithm of the median P of all species in Fisher's exact test
Mean maximum standardized phi of species
Eight clustering algorithms were compared to Isopam, including the two non-hierarchical methods k-means and PAM. All algorithms, with the exception of Twinspan, were based on Bray-Curtis distances. PAM is the algorithm used in Isopam and was briefly described above. k-means was computed with the algorithm proposed by Hartigan & Wong (1979) using the principal coordinates of a square-root-transformed Bray-Curtis distance space as a starting point (Legendre & Legendre 1998). k-means was run with 100 starting configurations. The final solution was selected based on the Calinski & Harabasz (1974) criterion. As for hierarchical methods, five widely used agglomerative algorithms were applied: single linkage (or nearest neighbour) clustering, complete linkage (or farthest neighbour) clustering, average linkage clustering, flexible beta clustering (with beta −0.25) and Ward's incremental sum of squares clustering. All of these, except flexible-beta, were computed using the standard settings in hclust (R Development Core Team 2009). Flexible beta (−0.25) was computed using agnes (R package Cluster, version 1.11.13, M. Maechler). As a divisive method, Twinspan (Hill 1979) was applied, using the modified version proposed by Roleček et al. (2009). Differing from the original code, this version is not constrained to cluster numbers of 2, 4, 16, etc. Pseudo-species cut levels in Twinspan were always set to 0, 0.1, 0.2 and 0.4 times the maximum possible cover value. The minimum group size was set to 2, and Whittaker's beta (Whittaker 1960) was used to determine cluster heterogeneity. Beta is a measure required by the modified Twinspan algorithm in the search for the next group to be dissected. Isopam was run with unlimited k, fixed numbers of clusters and otherwise default settings. Optimization was based on a search for high quantity and quality of indicator species.
The 20 data sets used for testing include habitats from North America, Central America, Europe and South and Central Asia. All data sets are described in the literature, and many of them are freely available, either from online databases or as add-ons in R packages. Even if some geographic and ecological scattering was intentional, the selection was mainly determined by data access and availability of published descriptions.
Parameters of the resulting Isopam partitions (Isomap k and dimensions, number of indicators used) are given in the Supporting Information, Appendix S2. Isopam produced clusters that were more balanced in size than those produced by other algorithms. Thus, small outlier clusters as observed with single linkage or average linkage were not an issue (Fig. 3). The box plots in Fig. 2 summarize results for the six evaluations (indicator-based evaluations in the left and evaluations based on all species in the right column). Boxes and whiskers represent the distributions of performance ranks achieved by the algorithms for 20 data sets. Some marked differences in median ranks could be observed, with similar results in evaluations based on indicator species analysis (isa.ind, isa.all) and Fisher's exact test (fsh.ind, fsh.all). Both gave low scores to solutions with unbalanced cluster sizes and small outlier clusters as produced by single and average linkage clustering (si and av in Fig. 2; see also Fig. 3). Instead, evaluators based on the phi coefficient of association (phi.ind and phi.all) tended to reward the occurrence of small outlier clusters (see also Chytrý et al. 2002). Accordingly, results for single and average linkage clustering were inconsistent, with high scores in phi-based evaluations and low scores in the others. With isa.ind, isa.all, fsh.ind and fsh.all, high ranks were more frequently reached with Isopam than with other algorithms. Tests based on indicator species and corresponding tests based on all species led to similar rankings. Clustering success of algorithms depended in part on the data used. We refrain from interpreting outcomes for single tables because a sample of 20 data sets is not large enough to draw general conclusions.
Isopam is a powerful tool when groups with many good indicator species and high overall fidelities of species to clusters are needed. What are the potential reasons for the observed performance? Three data-related problems generally hamper the use of clustering algorithms by vegetation scientists: a bad “signal to noise ratio,” irregularly shaped groups in attribute space and a shifting set of typical species. The first issue is addressed by adopting Isomap as a starting point for the partitioning procedure. Isomap collects most of the variation and therefore most of the signal or structure in a low-dimensional attribute space. Accordingly, the number of dimensions used for the best partition is usually low. In practice, more than six dimensions are rarely used (see Supporting Information, Appendix S2).
Rankings based on indicator species and rankings based on all species (Fig. 2) were similar. The optimization of indicators thus leads to good solutions for most species. For Isopam, the contrary is also true: good solutions regarding all species tend to be good solutions regarding indicators. Nevertheless, there are differences between clustering solutions obtained with indicators and without them. It is up to the interpreter to determine which solution is more useful to understand the data structure.
The proposed number of clusters – if this optional feature is used – depends on the optimality criterion and certainly does not represent the ultimate truth. GS values tend to decrease monotonically with increasing numbers of clusters, while indicator numbers tend to show a hump-backed behaviour. In this study, the optimality criterion, which is the product of both, usually peaked at low numbers of clusters (Appendix S2). The test criterion fsh.ind, which is based on Fisher's exact test, leads to even more conservative estimates when used as optimality criterion (data not shown). A replacement of the adjusted and standardized G statistic (GS) by other optimality criteria is worth exploring.
We consider the sensitivity of Isomap to k as its strength. An assumption underlying Isopam is that Isomap solutions represent legitimate views on data structure revealing different aspects. This versatile feature of Isomap is used for finding meaningful partitions according to the defined criteria (knowing that there are many other possible solutions when other criteria are used). Even if all Isomap solutions represent legitimate views, this does not mean that all are useful. Sometimes, with k too large, the net of neighbours may connect groups of observations that should better remain separate (short-circuits). If this is the case, it is unlikely that many and good indicators are found. Such solutions are most likely skipped in the iterative search for a good clustering solution.
The irregular shape of groups conflicts with most clustering algorithms that opt for spherical and other regular shapes. This is taken into account by the underlying dissimilarity measure in Isomap. Here, dissimilarity is measured along the strings of a net of neighbourhoods. PAM creates spherical clusters in the Isomap attribute space. However, re-projected to a conventional dissimilarity space, any shape may emerge. Measuring distances along the strings of a net of neighbours is also a good means to keep track of shifts in species composition: even a complete species turnover does not prevent an ecologically meaningful dissimilarity measure.
As is the case for all data-driven clustering algorithms, Isopam suffers from a fundamental constraint: the resulting clusters are fitted to the data in use and they are highly dependent on sampling (Bruelheide 2000). As an alternative, Isopam could be easily transformed into an expert system by fixing the indicator species sets or by fixing medoids in PAM. The first of these options is implemented.
The comparison with other clustering algorithms was based on a relatively small range of evaluators. External tests in particular could provide additional insights. For instance, the separability of groups based on available environmental data could be tested. The drawbacks are that the ecological relevance of available measured variables is never guaranteed and that vegetation reacts with time lags to environmental change. Other criteria could address the robustness of clustering solutions across random subsamples of data sets. Further internal tests are proposed by Aho et al. (2008). The data necessary for such tests are becoming increasingly available; databases like VegBank or VegetWeb are invaluable for fostering methodical progress (Ewald 2005; Bekker et al. 2007).
The Isopam search for the best partition according to defined criteria represents a constrained brute-force approach. It is constrained by limiting the possible choices of partition to solutions from the Isomap–PAM processing chain. In theory, full brute force with an evaluation of all possible partitions would be preferable. However, the number of possible partitions (Bell number) is large, even for small data sets (e.g. 115 975 possibilities for 10 sample units). Limitation is currently necessary. Isopam is modular and hence open to modifications. Other ordinations in place of Isomap and other partitioning procedures besides PAM can be tested, and the same applies to the in-built optimality criterion. We hope that the presented results will stimulate more research along these lines.
Isopam code and documentation can be downloaded from within R or via the web interface of the CRAN network (http://cran.r-project.org/). Isopam with reduced options can also be used within the framework of Juice (http://www.sci.muni.cz/botany/juice/). Several parameters are adjustable, although the default settings should work well in most cases. Parameters define whether clustering should be hierarchical or not, whether optimization will be based on all species or indicator species, and whether optimizations should include the number of clusters. A cluster number can be defined in partitional mode. Other settings determine the thresholds for indicator definition or predefined indicator species if Isopam is used as an expert system. The R package provides ordered synoptic tables summarizing the frequency of species in groups along with the significance of their association to clusters, a plot function for dendrograms, and group medoids from PAM (the most typical plots per group in terms of overall species composition).
In the present paper, a clustering algorithm is proposed that explicitly addresses several problems inherent to vegetation data. Isopam often out-performed other algorithms in terms of quantity and quality of indicator species per group. The test also showed that all considered algorithms could qualify as a good choice, depending on the data in use. Accordingly, strategies combining different algorithms and the selection of optimality criteria will probably gain in importance.
Acknowledgements. We thank all those who provided data for the tests, including all of the authors who uploaded data to VegetWeb or VegBank. We also thank the developers and maintainers of these databases. Thanks go to Jason Collison for contributions to the code. We thank K. Aho, J. Podani and D. Roberts for their discussion and review of this manuscript. L. Tichý was supported by the Czech Science Foundation (206/09/0329) and by the long-term research plan of Masaryk University (Czech Ministry of Education, MSM 0021622416).