SEARCH

SEARCH BY CITATION

Keywords:

  • biclustering;
  • biogeography;
  • co-clustering;
  • entropy;
  • mutual information;
  • R-mode, Twinspan

Summary

  1. Clustering multivariate species data can be an effective way of showing groups of species or samples with similar characteristics. Most current techniques classify the samples first and then the species. A disadvantage of classifying the samples first is that relatively subtle differences between occurrence profiles of species can be obscured.
  2. The k-means method of clustering minimizes the sum of squared distances between cluster centres and cluster members. If the entities to be clustered are projected on the unit sphere, then a natural measure of dispersion is the sum of squared chord distances separating the entities from their cluster centres; k-means clustering with this measure of dispersion is called spherical k-means (SKM). We also consider a variant in which the sum of squared perpendicular distances to a central ray is minimized.
  3. Unweighted SKM is liable to produce clusters of very rare species. This feature can be avoided if each point on the unit sphere is weighted by the length of the ray that produced it. The standard SKM algorithm converges to very numerous local optima. To avoid this problem, we have developed a computationally intensive algorithm that uses multiple randomizations to select high-quality seed species.
  4. The species clustering can be used to define simplified attributes for the samples. If the samples are then classified using the same technique, the resulting matrix of clustered species and clustered samples provides a biclustering of the data. The strength of the relationship between clusters can be measured by their mutual information, which is effectively the entropy of the biclustering.
  5. The technique was tested on five ecological and biogeographical datasets ranging in size from 30 species in 20 samples to 1405 species in 3857 samples. Several variants of SKM were compared, together with results from the established program Twinspan. When judged by entropy, SKM always performed adequately and produced the best clustering in all datasets but the smallest.