This article introduces an agglomerative Bayesian model-based clustering algorithm which outputs a nested sequence of two-way cluster configurations for an input matrix of data. Each two-way cluster configuration in the output hierarchy is specified by a row configuration and a column configuration whose Cartesian product partitions the data matrix. Variable selection is incorporated into the algorithm by identifying row clusters which form distinct groups defined by the column clusters, through the use of a mixture model. A primitive similarity measure between the two clusters is the multiplicative change in model posterior probability implied by their merger, and the hierarchy is formed by iteratively merging the cluster pair which maximize some fixed monotonic function of this quantity. A naive implementation of the algorithm would be to choose this function to be the identity function. However, when applying this naive algorithm to gene expression data where the number of genes being studied typically far exceeds the number of experimental samples available, this imbalanced dimensionality of the data results in an algorithmic bias toward merging samples. To counteract this bias, alternative functions of the similarity measure are considered which prevent degenerative behavior of the algorithm. The resulting improvements in the output cluster configurations are demonstrated on simulated data and the method is then applied to real gene expression data. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.