The snake for visualizing and for counting clusters in multivariate data

Authors

  • Adam Petrie,

    Corresponding author
    1. Department of Statistics, Operations, and Management Science, University of Tennessee, Knoxville, TN 37996, USA
    • Department of Statistics, Operations, and Management Science, University of Tennessee, Knoxville, TN 37996, USA
    Search for more papers by this author
  • Thomas R. Willemain

    1. Department of Decision Sciences and Engineering Systems, Rensselaer Polytechnic Institute, Troy, NY 12180, USA
    Search for more papers by this author

Abstract

We introduce the ‘snake’, a new tool for the visualization and exploration of a multivariate dataset. The snake connects each data point along a single short path. Using techniques from the Traveling Salesman Problem (TSP), it is possible to find such a path in polynomial (nearly quadratic) computational time. A plot of the individual segment lengths versus their position along the path transforms the original multidimensional dataset into a one-dimensional ‘time-series’ of interpoint distances. The snake traces the local structure of a datacloud, so this visualization is most useful for detecting density fluctuations: regions of high density appear as many consecutive short segments, while regions of low density appear as many consecutive long segments. Dips in the time series reveal the presence of clustering and can be used to count the number of modes in the datacloud. We illustrate the technique on a variety of artificial and real-world datasets. Copyright © 2010 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 236-252, 2010

Ancillary