The Relevant-Set Correlation Model for Data Clustering



This paper introduces a model for clustering, the Relevant-Set Correlation (RSC) model, that requires no direct knowledge of the nature or representation of the data. Instead, the RSC model relies solely on the existence of an oracle that accepts a query in the form of a reference to a data item, and returns a ranked set of references to items that are most relevant to the query. The quality of cluster candidates, the degree of association between pairs of cluster candidates, and the degree of association between clusters and data items are all assessed according to the statistical significance of a form of correlation among pairs of relevant sets and/or candidate cluster sets. The RSC significance measures can be used to evaluate the relative importance of cluster candidates of various sizes, avoiding the problems of bias found with other shared-neighbor methods that use fixed neighborhood sizes. A scalable clustering heuristic based on the RSC model is also presented and demonstrated for large, high-dimensional datasets using a fast approximate similarity search structure as the oracle. © 2008 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 1: 000-000, 2008