14. Cluster Analysis

  1. Alvin C. Rencher

Published Online: 27 MAR 2003

DOI: 10.1002/0471271357.ch14

Methods of Multivariate Analysis, Second Edition

Methods of Multivariate Analysis, Second Edition

How to Cite

Rencher, A. C. (2002) Cluster Analysis, in Methods of Multivariate Analysis, Second Edition, John Wiley & Sons, Inc., New York, NY, USA. doi: 10.1002/0471271357.ch14

Author Information

  1. Brigham Young University, USA

Publication History

  1. Published Online: 27 MAR 2003
  2. Published Print: 22 FEB 2002

ISBN Information

Print ISBN: 9780471418894

Online ISBN: 9780471271352

SEARCH

Keywords:

  • clusters;
  • classification;
  • similarity;
  • dissimilarity;
  • pattern recognition;
  • numerical taxonomy;
  • hierarchical clustering;
  • partitioning;
  • Minkowski metric;
  • profile;
  • agglomerative method;
  • divisive method;
  • tree diagram;
  • dendrogram;
  • distance matrix;
  • inversion;
  • crossover;
  • monotonic;
  • ultrametric;
  • space-conserving;
  • space-contracting;
  • space-dilating;
  • chaining;
  • outliers;
  • monothetic;
  • polythetic;
  • splinter group;
  • k-means;
  • seeds;
  • mixtures of distributions;
  • density estimation;
  • modes;
  • dense point;
  • cross-validation;
  • clustering variables

Summary

In cluster analysis we search for patterns in a data set by grouping the observation vectors into clusters. The goal is to find an optimal grouping for which the observations within each cluster are similar, but the clusters are dissimilar to each other. We seek to find the natural groupings in the data.

To group the observations into clusters, many techniques begin with similarities between all pairs of observations. In many cases the similarities are based on some measure of distance. Since a distance increases as two sampling units become further apart, distance is actually a measure of dissimilarity. It is also possible to cluster the variables, in which case the similarity could be a correlation.

Two common approaches to clustering the observation vectors are hierarchical clustering and partitioning. In the agglomerative approach to hierarchical clustering, we start with n clusters, one for each observation, and end with a single cluster containing all n observations. At each step, an observation or a cluster of observations is absorbed into another cluster. In the divisive approach to hierarchical clustering we start with a single cluster containing all n observations and end with n clusters of a single item each. In partitioning, we simply divide the observations into g clusters. This can be done by starting with an initial partitioning or with cluster centers and then reallocating the observations according to some optimality criterion. Other clustering methods we discuss are based on fitting mixtures of multivariate normal distributions or searching for regions of high density sometimes called modes.

At each step of an agglomerative hierarchical method, the two closest clusters are merged into a single new cluster. Different approaches to measuring distance between clusters give rise to different hierarchical methods. The following methods are discussed: single linkage (nearest neighbor), complete linkage (farthest neighbor), average linkage, centroid, median, Ward's method (incremental sum of squares method), flexible beta method. Various properties of hierarchical methods are discussed, and the hierarchical methods are compared with regard to these properties.

Divisive hierarchical methods are generally of two classes: monothetic and polythetic. In a monothetic approach, the division of a group into two subgroups is based on a single variable, whereas, the polythetic approach uses all p variables to make the split.

Two approaches to partitioning are the k-means technique and methods based on MANOVA sums of squares and products matrices for clusters. In the k-means method, we first select g items to serve as seeds. There are various ways we can choose the seeds, for each of which, the number of clusters, g, must be specified. Alternatively, a minimum distance between seeds may be specified, and then all items that satisfy this criterion are chosen as seeds.

After the seeds are chosen, each remaining point in the data set is assigned to the cluster with the nearest seed. As soon as a cluster has more than one member, the cluster seed is replaced by the centroid. After all items are assigned to clusters, each item is examined to see if it is closer to the centroid of another cluster than to the centroid of its own cluster. If so, the item is moved to the new cluster and the two cluster centroids are updated. This process is continued until no further improvement is possible.

Other clustering methods include mixtures of distributions and density estimation. In mixtures of distributions, we assume the existence of g distributions (usually multivariate normal), and we wish to assign each of the n items in the sample to the distribution it most likely belongs to.

In the method of density estimation, we seek regions of high density sometimes called modes. No assumption is made about the form of the density. One approach is to simply separate regions with a high concentration of points from regions with a low density.

To choose the number of clusters in hierarchical clustering, we can select g clusters from the dendrogram by cutting across the branches at a given level of the distance measure used by one of the axes. To determine the value of g that provides the best fit to the data, one approach is to look for large changes in distances at which clusters are formed.

To check the validity of a cluster solution, we could test the hypothesis that there are no clusters or groups in the population from which the sample at hand was taken. A cross-validation approach can also be used to check the validity or stability of a clustering result. The data are randomly divided into two subsets, say A and B, and the cluster analysis is carried out separately on each of A and B. The results should be similar if the clusters are valid.

In some cases, it may be of interest to cluster the p variables rather than the n observations. For a similarity measure between each pair of variables, we would usually use the correlation. Clustering of variables can sometimes be done successfully with factor analysis, which groups the variables corresponding to each factor.

Many examples with real data sets illustrate the techniques in this chapter, and the problems at the end of the chapter further develop and illustrate the methods.