Spherical k-means clustering is good for interpreting multivariate species occurrence data

Authors


Summary

  1. Clustering multivariate species data can be an effective way of showing groups of species or samples with similar characteristics. Most current techniques classify the samples first and then the species. A disadvantage of classifying the samples first is that relatively subtle differences between occurrence profiles of species can be obscured.
  2. The k-means method of clustering minimizes the sum of squared distances between cluster centres and cluster members. If the entities to be clustered are projected on the unit sphere, then a natural measure of dispersion is the sum of squared chord distances separating the entities from their cluster centres; k-means clustering with this measure of dispersion is called spherical k-means (SKM). We also consider a variant in which the sum of squared perpendicular distances to a central ray is minimized.
  3. Unweighted SKM is liable to produce clusters of very rare species. This feature can be avoided if each point on the unit sphere is weighted by the length of the ray that produced it. The standard SKM algorithm converges to very numerous local optima. To avoid this problem, we have developed a computationally intensive algorithm that uses multiple randomizations to select high-quality seed species.
  4. The species clustering can be used to define simplified attributes for the samples. If the samples are then classified using the same technique, the resulting matrix of clustered species and clustered samples provides a biclustering of the data. The strength of the relationship between clusters can be measured by their mutual information, which is effectively the entropy of the biclustering.
  5. The technique was tested on five ecological and biogeographical datasets ranging in size from 30 species in 20 samples to 1405 species in 3857 samples. Several variants of SKM were compared, together with results from the established program Twinspan. When judged by entropy, SKM always performed adequately and produced the best clustering in all datasets but the smallest.

Introduction

Methods of classifying species and samples from multivariate species occurrence data were much investigated in the 1960s and 1970s. A distinction was made between Q-mode methods, in which the samples or stands were clustered, and R-mode methods, in which the species were clustered. Occasionally, as in Lambert & Williams's (1962) nodal analysis and Hill's (1979) program Twinspan both samples and species were clustered, one after the other. By the end of the 1970s, it was accepted that the correct procedure is to classify the samples first. R-mode methods were in eclipse.

More recently, in the period 1995–2010, there has been renewed interest in numerical classification, mainly in the fields of text mining (Manning, Raghavan & Schütze 2008) and genomics.

Along with the general increase of interest in numerical classification, two-way classification has received increased attention. Two-way classification is variously known as biclustering (Madeira & Oliveira 2004; Gupta & Aggarwal 2010), co-clustering (Banerjee et al. 2007; Jain 2010) or two-mode clustering (Van Mechelen, Bock & De Boeck 2004; Schepers & Van Mechelen 2011; Hageman, Malosetti & van Eeuwijk 2012). The term biclustering, used here, was apparently introduced by Mirkin (1996), who does indeed cite Twinspan as an example. There has, however, been little flow from methodologies used in text mining and bioinformatics into ecology.

A promising approach to clustering and biclustering is to treat these methods as fitting models to a data matrix. An interesting example is set out by Martella & Vichi (2012). They and several other authors (ter Braak et al. 2009; Schepers & Van Mechelen 2011) use the least-squares criterion to approximate either a raw matrix or a similarity matrix. Approximations to a raw matrix based on unweighted least squares are generally not suitable for occurrence data in ecology and biogeography. We set out a crude multiplicative model for such data, but do not use it except as a means of estimating the Akaike Information Criterion to select the numbers of clusters.

Our interest in R-mode clustering was rekindled during a study of European plant distributions (Finnie et al. 2007). For this purpose, we compared species distributions with cluster centroids, using the cosine measure of similarity. This measure is widely used in text mining (Manning, Raghavan and Schütze 2008). Finnie's (2007) clustering algorithm was agglomerative, building up clusters from pairs of similar individual species. It was rather complicated and had some arbitrary parameters. Therefore, in a subsequent study of British and Irish liverworts (Preston, Harrower & Hill 2011), we used a simpler method. We called it Clustaspec. It starts by being agglomerative, and continues with a second phase in which the smallest clusters are systematically removed and their species distributed to larger ones. When Clustaspec was applied to other datasets, it usually gave good results, but it had a tendency to generate small clusters of rare species confined to special habitats. We were not entirely satisfied with it.

Both Finnie's (2007) method and Clustaspec tidied up the final clustering by means of an iterative relocation algorithm, by which each species was allocated to the nearest cluster centre, repeating the process until stability was reached. For clustering in Euclidean space, this method is known as the k-means algorithm (Krishna & Murty 1999). Finnie's algorithm and Clustaspec defined proximity in terms of the cosine similarity measure. Their relocation algorithm was therefore a case of the spherical k-means (SKM) algorithm, whose properties have been investigated by Vinh (2008). There is, however, an important difference. In the SKM algorithm described by Vinh, the objects to be clustered are first projected on the surface of the unit hypersphere, and are thereafter clustered by the SKM algorithm. In the algorithm used by us, the unit hypersphere was not considered, the cluster centres being calculated simply as the centroids of untransformed vectors. As explained below, this amounts to weighted SKM, with weights proportional to the length of the untransformed vectors. The weights make a big difference.

In Clustaspec, we used the SKM algorithm merely for tidying up the clusters. Vinh (2008) shows that the SKM algorithm will converge to a local optimum of the SKM objective function, defined as the sum of squared chord distances between cluster centres and individual cluster vectors. He also points out that there are very many such local optima. Indeed, there are so many local optima that the quest for the global optimum can be very arduous. For this quest, we have devised an algorithm based on ‘key species’. These are defined as those species that are most closely aligned to the cluster centres. Key species were used by Finnie et al. (2007) and Preston, Harrower & Hill (2011) to name the clusters. In the algorithm described below, they are used also to initiate the clusters.

Data and methods

Datasets

Five datasets were studied in detail (Table 1):

Table 1. Five datasets studied in detail, and the number of clusters into which they were grouped; abundance classes for the Dune and Arable datasets used the van der Maarel and DAFOR scales respectively
 DuneDanubeArableLiverwortVascular
Area sampledNetherlands, TerschellingGermany, E of UlmBritain and IrelandBritish Isles and Channel IslandsBritish Isles and Channel Islands
Data typeAbundance classBiomass %Abundance classPresence-absencePresence-absence
Sample units2 × 2 m quadratsMeadowsArable fields10 × 10 km squares10 × 10 km squares
Number of species30941643001405
Number of samples202581234593857
Number of non-zero items197788110031169731510290
Number of species clusters7691020
Number of sample clusters58121224
  1. Dune meadow data, discussed by Jongman, ter Braak & van Tongeren (1995);
  2. Danube meadow data from a 25 km2 study area east of Ulm, as discussed by Mueller-Dombois & Ellenberg (1974) and used in the manual for Twinspan (Hill 1979);
  3. The Arable bryophyte dataset analysed by Preston et al. (2010);
  4. The Liverwort dataset used by Preston, Harrower & Hill (2011); and
  5. An equivalent dataset for British and Irish native vascular plants; the dataset comprises all native records mapped by Preston, Pearman & Dines (2002), including old records as well as recent ones, but excluding records of native species from localities where they are known to be introduced.

Computer programs

The program Clustaspec was written in R by Harrower for classifying liverworts. Our program for spherical k-means was subsequently written by Hill in Fortran, using the GNU Fortran G77 v0.5.25 compiler for Windows XP (Free Software Foundation 1999). Both Clustaspec and the new program, Spherikm (SPHERIcal K-Means), can be downloaded from the BRC website http://www.brc.ac.uk.

As in Euclidean k-means clustering, the number of clusters, k, has to be specified in advance. The best clustering is defined to be that which minimizes the sum of squared distances between cluster members and their centroids. Specifically, let

display math

be a matrix specifying the occurrence of n species in m samples; the value of aij is either the quantity of species j found in sample i or may be 1 or 0 if A is a matrix of presences and absences. Let aj be the vector of elements corresponding to species j, i.e.

display math

Define

display math

This is the projection of aj on the unit hypersphere. Then the spherical k-means problem is to find a set of cluster centres

display math

on the unit hypersphere that minimize the sum of squared chord distances between the vectors bj and and the cluster centres. In symbols, the criterion to be minimized is

display math

As bjTxh is simply the cosine of the angle between aj and xh, an equivalent problem is to maximize the sum of cosines between the vectors and their cluster centres. In our calculations, we have used a weighted version of the summed cosine criterion, i.e.

display math

Different weighting systems, from wj = 1 (standard SKM) to wj = ║aj are compared below. In the case where weights wj = ║aj, the centroid of a cluster of weighted points bj on the unit hypersphere is then exactly aligned to the cluster centroid of vectors aj in the original space.

We have made much use of the spherical k-means algorithm. The SKM algorithm starts with an initial set of trial cluster centres, and derives a new set by the following two steps (Vinh 2008).

  1. The membership assignment step – each vector is assigned to the cluster of the trial cluster centre to which it is closest; and
  2. The centre adjustment step – new cluster centres are located at the centroid of the members defined by step 1.

If these two steps are repeated, the algorithm converges to a local optimum.

Our algorithm, mentioned in the introduction, is based on seed vectors. The initial clusters consist of a set S of seed vectors s1, s2,…, sk, selected from a1, a2,…, an. Let the local optimum derived from S by application of the spherical k-means algorithm to the seed vectors be denoted by SKM(S). In each of the clusters defined by SKM(S), there will be a best-aligned vector (a ‘key species’ in the terminology used above). A self-regenerating set of seeds S is one such that the key vectors in the clusters of SKM(S) are identical to the seeds. When solutions are restricted to those local optima derived from self-regenerating seeds, the search for the (restricted) global optimum is more tractable. The algorithm proceeds in three stages using random or restricted-random vectors S as seeds for SKM(S): (i) make a shortlist of suitable seeds; (ii) select a list of k seeds sequentially from the shortlist by adding in the most frequently-selected key vector that is not already in the selected list; (iii) adjust the list by trying out alternative seed lists in which each element of the list s1, s2,…, sk selected at stage 2 is replaced with an unselected vector from the shortlist. If any replacement seed list decreases the sum of squared deviations, select the best and repeat stage 3 with the new seed list until stability is reached. This process, consisting of stages 1–3, is repeated 10 times and the best solution out of the 10 replicates is retained.

Perpendicular spherical k-means

There are potentially two variants of the spherical k-means problem. They differ, as explained in the discussion, in how much leverage is given to aberrant cluster members. In spherical k-means as outlined above, we minimize the sum of squared chord distances. This method is abbreviated below as CSKM for chord spherical k-means. However, in principle an equally suitable criterion is the sum of squared perpendicular distances (Fig. 1). There is a small complication with this method, in that the minimum is not generally achieved by dropping perpendiculars to the ray through the centroid of the cluster. Specifically, let the minimizing ray be x. Then, ignoring the weights, we seek to minimize

display math

subject to the constraint that x is on the unit hypersphere, i.e.

display math
Figure 1.

The two types of spherical k-means. Vector x is the centre of the cluster and bi is a member of the cluster. In ordinary (chord) SKM we minimize the sum of squared chord distances ∑BC║2, while in perpendicular SKM we minimize the sum of squared perpendicular distances ∑BA2.

To find the direction of x, we solve the problem with Lagrange multipliers and minimize

display math

Differentiating with respect to x, Λ is minimized when

display math

Therefore

display math

This relationship allows us to solve for x iteratively, starting with a trial vector x(0) which is the centroid of bj and then repeating the process so that

display math

and so on. The value of (-1/λ(1)) is chosen to be the positive value that places x(1) on the unit hypersphere. Note that because the vectors bj and x are in the positive quadrant, all the coefficients bjTx are also positive. Once the direction of x is known, calculation of D, the sum of squared deviations, is immediate.

Biclustering and measures of concentration

Biclustering of the data was achieved by first clustering the species, then condensing the data to account for species clusters (i.e. adding together the species vectors in each cluster), transposing, and clustering the samples by the same method. Suppose, for example, that a given sample contains species A, B, C and D, all with quantity 1, and that A, C and D belong to Species-cluster 1 and B belongs to Species-cluster 2. The composition of the sample for the purposes of the secondary clustering is Species-cluster 1 quantity 3, Species-cluster 2 quantity 1.

With presence data, a well known goodness-of-fit measure for a two-way table is the chi-squared statistic based on the sum of squared deviations between observed and expected values in cluster cells ∑(o-e)2/e. This statistic does not generalize readily to data types where the original values are quantities or are ordinal classes. A measure that generalizes better is the dimensionless (geometric) mean ratio of observed to expected values. Let I denote a cluster of samples and J denote a cluster of species. The observed value oIJ is the sum of matrix elements in clusters I and J, i.e.

display math

Then the expected value is defined as

display math

Concentration can be measured by the statistic

display math

In reporting results, K is called the ‘concentration ratio’ because it measures the geometric mean ratio of the observed values in the cluster cells to those that would be expected if species occurred at random. In the case where the data aij are presences and absences, K is effectively the G statistic of Sokal & Rohlf (1981) which measures the entropy (more properly the mutual information) of the biclustering. It can be argued that mathematically the best solution is that which maximizes the entropy (Banerjee et al. 2007).

Cluster presentation and choice of cluster numbers

For clarity of presentation, the clusters, once defined, were arranged by a two-stage process. First they were ordered by correspondence analysis (Hill 1982; Jongman, ter Braak & van Tongeren 1995). Then they were clustered hierarchically by Ward's method (Legendre & Legendre 1998), an agglomerative technique which at each stage unites the pair that minimally increases the total within-cluster variance. Clusters were ordered so that the hierarchy resulting from Ward's method could be presented cleanly. In other words, when groups were united, they were placed side-by side. Correspondence analysis order was retained if there was a choice, with the cluster having minimum axis score appearing as the first in the final order. The hierarchy was printed out in Newick format for viewing in Dendroscope (Huson et al. 2007). We give two examples in the Supplementary Information.

For selecting cluster numbers, the biclustering was approximated by fitting a multiplicative model with the same row totals, column totals and cluster totals

display math

and then calculating an analogue of the concentration ratio

display math

K′ measures the size of the residuals after fitting âij. If the values aij were counts, then the G statistic, which is distributed as χ2 would be

display math

where N is the total count, i.e. ΣΣaij. Let F be the number of fitted constants, k1 the number of species clusters and k2 be the number of sample clusters. Then in this case, ignoring a constant offset in AIC,

display math

If this criterion is to be applied where aij are not counts then an analogue for N needs to be found. If the values aij are presences and absences (0 or 1) N can be taken to be the total ΣΣaij. If aij are quantities such as species abundance values, a suitable choice of N is, in the notation of Hill (1973) the number N2, i.e. (ΣΣaij)2/ΣΣaij2. The value of AIC calculated here using N2 is called ‘quasi-AIC’, to emphasize the fact that it is not based on likelihood in a statistical model.

Testing the methods

The standard SKM analyses, for the purposes of this paper, are those in which projections of data on the unit hypersphere are weighted in proportion to ║aj, the length of their vectors. These are signified as W1 as the weights are ║aj1·0. Both the chord variant CSKM and the perpendicular variant PSKM have been tested. W00 is SKM as usually understood, with species and samples projected on the unit sphere and given equal unit weight ║aj0·0. Two other species weightings were considered, namely W0, in which species were weighted as in W00 but samples in the subsequent sample clustering were weighted as in W1. W0·5 is defined similarly, with species weights ║aj0·5 and sample weights ║aj1·0.

Datasets other than the vascular plant dataset were transposed to check whether it is better to cluster the species first and then the samples, or vice-versa. Transposed analyses, in which the samples were clustered first and the species clustered second, are denoted by W00 Transposed, W1 Transposed, etc.

Twinspan does not produce a specific number of clusters, but does generate a hierarchy for both species and samples. To compare it with the other methods, clusters were defined on the basis of the higher levels of the hierarchy, trying to avoid very small clusters that would give the other methods an unfair advantage. This process was not automated and clusters were selected by eye.

Results

Concentration ratios

Except for the dune dataset, the highest concentration ratios were found either with standard weighted CSKM or PSKM (Table 2). In the biogeographical datasets, the PSKM arrangement was the most concentrated, whereas in the Danube and Arable datasets, the CSKM arrangement was more concentrated. Twinspan produced less highly concentrated solutions. Clustaspec produced results that were rather similar to those from SKM but were somewhat less concentrated.

Table 2. Concentration ratios for biclustering by various clustering methods. CSKM and PSKM are chord and perpendicular spherical k-means respectively; W00, W0, W0·5 and W1 are differing weighting schemes. Transposed biclusterings were performed by clustering the samples first and the species second. Maximum values are shown in bold type
Analysis typeDuneDanubeArableLiverwortVascular
CSKM W001·459941·498561·116251·17084 
CSKM W01·449531·545871·146341·18142 
CSKM W0·51·505251·560261·197991·209601·16624
CSKM W11·51112 1·56470 1·22544 1·221351·17142
CSKM W00 Transposed 1·429431·490011·185761·18150 
CSKM W0Transposed 1·515331·518841·208691·20149 
CSKM W0·5Transposed1·515331·519991·210391·21359 
CSKM W1Transposed 1·515751·498261·213621·22167 
PSKM W001·491131·414711·156711·19455 
PSKM W01·508121·547371·180871·20631 
PSKM W0·51·505661·561421·212641·225131·16852
PSKM W11·511121·541401·22415 1·22743 1·17321
PSKM W00Transposed 1·429431·460291·161471·20285 
PSKM W0Transposed 1·515331·518421·211271·21437 
PSKM W0·5Transposed 1·515331·482631·209341·21829 
PSKM W1Transposed 1·515751·496681·209831·22495 
Twinspan1·468221·510061·184071·15397 
Twinspan Transposed1·368761·371651·106001·16533 
Clustaspec1·504231·541391·171471·19633 
Clustaspec Transposed 1·51952 1·420991·198201·21366 

Dune meadow data

The Dune dataset, the most species-poor, is small enough to be displayed in full in Fig. 2. Some samples had much bare ground. In sample 17, only Anthoxanthum odoratum had cover value greater than 2; its cover value 4 signifies less than 5% vegetation cover. Two biclusterings are shown. The first (Fig. 2a), with a concentration ratio of 1·51, is the standard SKM solution for seven species clusters and five sample clusters. The second (Fig. 2b), with concentration ratio 1·44, shows the simplified solution with five species clusters and four sample clusters suggested by the quasi-AIC statistic. For this dataset and not the others, better results were obtained for the (7, 5) case by clustering the samples first and then the species. For the preferred (5, 4) case, it was better to cluster the species first.

Figure 2.

Dune dataset, showing (a) the standard W1 solution resulting from both CSKM and PSKM (concentration ratio 1·51) and (b) the simplified solution with minimum quasi-AIC (concentration ratio 1·44). Species names are abbreviated as in ter Braak & Šmilauer (1998).

Danube meadow data

The Danube dataset is displayed in Fig. 3, which shows six clusters for 34 species and omits the 60 species with lowest average biomass. An expanded version of the figure, colour-coded for concentration ratios and including the PSKM biclustering, is given the Supporting Information, along with the solutions suggested by quasi-AIC, which have five species clusters and five sample clusters. Concentration ratios were 1·56 for the (6, 8) case and 1·47 for the (5, 5) case.

Figure 3.

Danube Meadow dataset with biclustering by CSKM, concentration ratio 1·56. Values are biomass%. Species with average biomass less than 0·4% of the total have been omitted. The symbol + indicates presence but with less than 0·5% of the biomass in that sample.

Other datasets

Table 3 shows bicluster totals for the arable field dataset. The concentration ratio was 1·23 for the (9, 12) case, which was investigated in detail. The minimum quasi-AIC was found with 24 species clusters and 28 sample clusters. This solution had concentration ratio 1·37, and is set out briefly in the Supporting Information.

Table 3. Arable bryophyte data, showing (a) bicluster totals and (b) individual cell concentrations (observed/expected) for the standard CSKM biclustering. Rows are species clusters; columns are sample clusters. The mean concentration for the whole biclustering is 1·23, which is the weighted geometric mean of the individual cell concentrations in (b), weighted by the totals in (a)
(a)
Cluster124637589111210
14699717426351297484917118
56945411162174123343672641102575219416
21724723364008154828116799173
4323549503787786484
32632312373512991432424616011287108
6538851111371885729416
74592822062517305020027935534
8180137255102688818036310991303489527
9611350115019549323817
(b)
Cluster124637589111210
14·211·031·291·091·061·040·360·740·300·120·170·08
51·081·001·491·250·840·270·600·711·150·690·570·69
20·983·201·591·061·070·111·570·280·450·350·091·06
41·101·431·400·802·961·1115·220·470·140·110·460·15
31·281·330·950·793·362·780·672·050·520·420·710·56
61·190·210·150·530·5712·280·133·390·860·501·550·14
70·310·750·470·660·400·471·180·590·931·470·413·92
80·350·320·410·930·400·630·891·221·441·951·601·10
90·130·030·230·500·051·300·000·710·791·558·670·39

Bicluster totals and concentrations for the liverwort and vascular-plant datasets are not shown here but are given in the Supporting Information.

Discussion

Themes and algorithm

When differing weighting schemes W0, W0·5 and W1 were applied, it became apparent that a relatively small suite of cluster themes appeared repeatedly. An analysis of themes for liverwort analyses (see Supporting Information) revealed 16 themes from 8 analyses, each of which had 10 species clusters. Four themes, namely Southwest coast, Irish Atlantic, Calcicole montane and Eastern snowpatch, were nested within larger W1 themes. Two themes, Middle western and Rather upland, were intermediate between W1 themes.

The algorithm, based on random seeds, cannot be guaranteed to converge to the global optimum. Our use of cluster seeds is similar to the MedoidKNN procedure proposed by Kalogeratos & Likas (2011). Our algorithm is somewhat complicated, but we found that simpler algorithms were frustratingly unable to locate really good solutions. Solutions that were close to the optimum displayed almost all the same themes. For example, the second-best solution for CSKM W1 applied to the vascular plants was found in two of the ten main replicates. It had mean cosine 0·79932 compared with 0·79939 for the best. Its 20 themes were the same. Of its key species, 13 were identical to those in the best solution, and five appeared as number 2 in order of alignment to the best solution. Of the remaining two, Carex echinata was 5th in order of alignment to the moorland cluster, and Alisma plantago-aquatica had moved from the eutrophic lowland cluster to the aquatic lowland cluster. In the best solution the cosine similarity of A. plantago-aquatica to the eutrophic cluster was 0·910, while its similarity to the aquatic cluster was 0·878. Clearly these are small differences, but in our judgement, the mathematically suboptimal solutions were for the most part somewhat inferior ecologically.

The algorithm is not especially quick. Typically, a solution for one of the larger problems required about 50 000 iterations of the SKM algorithm. Applied to the arable dataset, with 11 003 elements, the calculation took 27 and 40 min respectively for CSKM and PSKM to extract nine species clusters and 12 sample clusters, using a desktop computer with a 2.8 GHz processor. Calculations for the vascular plant dataset, which is 140 times bigger, took about a week, partly because of the large size of the dataset and partly because more groups were sought.

We have no doubt that efficiency could be improved, but this would require either parallel processing or a more subtle algorithm.

What makes a good clustering?

From the early days of plant ecology, clustering has been used for data exploration. During the period 1950–1980 investigators sought objectivity through the use of numerical methods. The methods of Braun-Blanquet and his followers were frequently attacked as lacking objectivity. However, Goodall (1953) noted early on that Braun–Blanquet's method of ‘character species’ could in principle be made objective. It has much resemblance to the algorithm based on key species, used here.

In biogeographical analyses (Finnie et al. 2007; Preston, Harrower & Hill 2011), we have successfully employed R-mode methods that rely on the cosine measure of similarity. Forty years earlier, Orloci (1967) had proposed the method of ‘optimal agglomeration’. This is essentially Ward's minimum variance method (Legendre & Legendre 1998) on the surface of a hypersphere. It also uses the cosine measure of similarity but never achieved much popularity. This may well be because optimal agglomeration used unweighted vectors, i.e. the weighting scheme W0, which in our study proved less satisfactory than W1 (Table 1).

How then should we judge clustering methods? Their ability to extract clear patterns is essential. They should not pick out minor groups at the expense of the broad picture. For these reasons, the concentration ratio has all the hallmarks of a good criterion by which biclusterings can be judged. Perhaps, it should be used directly, just as maximum entropy methods are used in other applications. We do not know of a direct algorithmic approach to the maximization problem and have therefore used variants of SKM and compared them by the concentration ratio (Table 1). In principle, the ‘double k-means’ approach explored by Martella & Vichi (2012) could be extended from k-means to SKM using the concentration ratio as objective function. The problem of avoiding local optima would be just as severe with double SKM as with ordinary SKM, but double SKM might be useful to clean up approximate solutions derived by sequential clustering (species, followed by samples).

A good clustering should not have too many or too few clusters. For the two smaller datasets, the application of quasi-AIC to restrict cluster numbers was successful. For the Dune Meadow data, the groups (Fig. 2b) make obvious ecological sense and are: (i) dicots (and one grass) of low-nutrient permanent grassland; (ii) dicots (and one moss and annual grass) of short turf; (iii) competitive pasture grasses (and one dicot); (iv) dune-slack margins; and (v) dune-slack centres. For the Danube Meadow data (Supporting Information, Figure S1b) the five groups are: (i) Dry calcareous grassland (Mesobromion); (ii) Poa pratensis (dominant in one aberrant sample); (iii) coarse pasture grasses (and four dicots); (iv) wetland grasses (confined to a sample that was regularly inundated); and (v) dicots (plus two grasses and one sedge). This classification brings out themes corresponding to two main gradients, dry to wet, and high-grass to high-dicot. In addition, it distinguishes an aberrant sample.

With the arable bryophyte data, quasi-AIC suggested a substantial increase in cluster numbers from (9, 12) to (24, 28). The 24 × 28 concentration matrix is shown in the Supporting Information (Fig. S2). There is undoubtedly much structure even at this level of subdivision, but in most applications it is preferable to have a succinct overview. Indeed, Preston et al. (2010) recognized just six species assemblages based on detrended correspondence analysis followed by k-means clustering. Many of the clusters recognized by both CSKM W1 and PSKM W1 with nine species clusters and 12 sample clusters are broadly similar to assemblages described by Preston et al. (2010).

The hierarchy derived by Ward's method (illustrated in Supplementary Information Fig. S3 and Fig. S4) also provides an overview. There is indeed no straightforward answer to what makes a good clustering. It depends on whether the investigator is looking for detail or for broad features.

Comparison of methods

All the classifications outlined above produced recognizable patterns that can be interpreted in ecological or biogeographical terms. There was a clear progression from the more balanced W1 analyses to the W0 analyses, which generated some small but rather distinct clusters of rare species as well as some large clusters. The pattern is shown for liverwort clusters (Table 4). The two least concentrated biclusterings resulted from Twinspan and CSKM W0; here the largest sample clusters were 1402 and 790, i.e. 41% and 23% of all 3459 samples. The Twinspan classification was especially uneven, and failed to distinguish a category of montane species. In CSKM W0, the maximum cell concentration of 52·1 was for 10 Irish-Atlantic species in a cluster of 33 hectads among which 26 were in Ireland and 7 in Britain. Clearly the W0 biclusterings were too uneven to be generally suitable. The W0·5 biclusterings, on the other hand, were nearly as concentrated as the W1 biclusterings.

Table 4. Liverwort cluster size in relation to concentration of biclustering; CV is coefficient of variation in cluster size, spec refers to species cluster size, samp to sample cluster size
AnalysisConcentrationMax cell concentrationCV specCV sampMin specMax specMin sampMax samp
CSKM W11·22111·50·370·48175182559
PSKM W11·22711·60·380·37135174439
CSKM W0·51·21013·60·390·70115149695
PSKM W0·51·22515·40·250·61194046635
CSKM W01·18152·10·670·90107316790
PSKM W01·20653·30·500·94105721793
Twinspan1·15421·40·651·70137751402
Clustaspec1·19617·10·560·38106681526

The PSKM W1 classification of the arable dataset produced two essentially single-species clusters, Bryum klinggraeffii in one cluster and B. violaceum in the other. This dataset has less inherent structure than the other datasets from Britain and Ireland, because it was obtained from a single, rather uniform habitat that is confined to the lowlands. The liverwort and vascular plant datasets cover the whole environment, including woods, grasslands, rivers, coasts and mountains. The CSKM and PSKM methods produced very similar results for a given weighting when applied to these data.

Apart from the fact that PSKM minimizes the sum of squared distances to rays not passing through exact cluster centroids, the main difference between CSKM and PSKM is that PSKM maximizes ∑j (bjTx)2 whereas CSKM maximizes ∑j (bjTx). This distinction underlies the main practical difference between them, namely that CSKM emphasizes overall conformity to the centroid, whereas PSKM pays less attention to species that are more deviant, emphasizing those that are well aligned. PSKM produced marginally higher-entropy biclusterings than CSKM for the two biogeographical datasets. Our analyses do not indicate that either of the two is always better. We note in passing that the most truly spherical k-means clustering would be angular spherical k-means (ASKM), which minimizes the sum of squared angles to a central ray. ASKM would take longer to compute than CSKM, because as with PSKM the position of the central ray has to be calculated by a recursion formula. ASKM would be more sensitive to poorly-aligned elements than CSKM, but we have not programmed it and do not report on its properties here.

Although the differing weightings of CSKM and PSKM produced results that differ in their concentration ratios, the selection of a preferred weighting may on occasion be better judged by the requirements of the user rather than by differences in concentration ratio. The particular choice may depend on the dataset in question. To our way of thinking, the W1 methods produced a satisfactory classification of the liverworts, which have very few ubiquitous species. When applied to vascular plants, among which widespread species are more frequent, the W1 weighting produced three groups of almost ubiquitous species, differing in the rather small areas of Britain and Ireland from which they are absent. For vascular plants, therefore, W0·5 weightings generated a more interesting set of patterns, which will be reported elsewhere.

Conclusions

Spherical k-means is shown to be a powerful clustering method, especially for R-mode analyses. It has hitherto been neglected because it tends to produce very unequal cluster sizes unless the commoner species are given greater weight. It also requires careful programming to avoid unsatisfactory local optima. There is no general answer to whether CSKM or PSKM is better; we recommend doing both and selecting the solution with higher concentration ratio. The quasi-Akaike criterion is good for selecting the number of clusters in small datasets, but in large datasets convenience is likely to be the main consideration.

Acknowledgements

This work was partly funded by the Joint Nature Conservation Committee (JNCC) and the Natural Environment Research Council under the Biological Records Centre (BRC) work programme for 2010 and 2011. We are grateful to John Birks, Cajo ter Braak and an anonymous referee for helpful comments on an earlier draft, and thank the British Bryological Society and the Botanical Society of the British Isles for permission to use their data.

Ancillary