Get access

A parametric mixture model for clustering multivariate binary data



The traditional latent class analysis (LCA) uses a mixture model with binary responses on each subject that are independent conditional on cluster membership. However, in many practical applications, the responses are correlated because they are observed on the same subject; this is known as local dependence. In this paper, we extend the LCA model to allow for local dependence in each cluster to improve clustering accuracy. The clustering problem is hard because of its unsupervised learning nature (the true cluster memberships and even the true number of clusters are unknown), the difficulty of estimating a correlation matrix for each cluster and the paucity of information in binary data. Therefore, we follow a parametric approach in which we fit a mixture model whose components follow multivariate Bernoulli distributions (one for each cluster). An extension of a family of parametric models by Oman and Zucker1 is adopted for this purpose and the maximum likelihood estimation method is used for fitting. The Bayesian information criterion (BIC) due to Schwarz2 is employed to select the number of clusters. Subjects are classified to clusters using the maximum posterior rule. The proposed method is tested and compared with the LCA method via simulation and by applying both methods to two real data sets. Significant improvement is demonstrated relative to the LCA method. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 3: 3-19, 2010

Get access to the full text of this article