Modelling Heterogeneity With and Without the Dirichlet Process
Abstract
We investigate the relationships between Dirichlet process (DP) based models and allocation models for a variable number of components, based on exchangeable distributions. It is shown that the DP partition distribution is a limiting case of a Dirichlet–multinomial allocation model. Comparisons of posterior performance of DP and allocation models are made in the Bayesian paradigm and illustrated in the context of univariate mixture models. It is shown in particular that the unbalancedness of the allocation distribution, present in the prior DP model, persists a posteriori. Exploiting the model connections, a new MCMC sampler for general DP based models is introduced, which uses split/merge moves in a reversible jump framework. Performance of this new sampler relative to that of some traditional samplers for DP processes is then explored.
Citing Literature
Number of times cited according to CrossRef: 101
- Samuel Manda, Flexible Modeling of Frailty Effects in Clustered Survival Data, Computational and Methodological Statistics and Biostatistics, 10.1007/978-3-030-42196-0_21, (489-509), (2020).
- Matthew Ludkin, Inference for a generalised stochastic block model with unknown number of blocks and non-conjugate edge models, Computational Statistics & Data Analysis, 10.1016/j.csda.2020.107051, (107051), (2020).
- Sergios Theodoridis, Bayesian Learning: Approximate Inference and Nonparametric Models, Machine Learning, 10.1016/B978-0-12-818803-3.00025-8, (647-730), (2020).
- Nathan Cunningham, Jim E. Griffin, David L. Wild, ParticleMDI: particle Monte Carlo methods for the cluster analysis of multiple datasets with applications to cancer subtype identification, Advances in Data Analysis and Classification, 10.1007/s11634-020-00401-y, (2020).
- Brenda Betancourt, Abel Rodríguez, Naomi Boyd, Modelling and prediction of financial trading networks: an application to the New York Mercantile Exchange natural gas futures market, Journal of the Royal Statistical Society: Series C (Applied Statistics), 10.1111/rssc.12387, 69, 1, (195-218), (2019).
- Iliana Peneva, Richard S. Savage, A Bayesian Nonparametric Model for Integrative Clustering of Omics Data, Bayesian Statistics and New Generations, 10.1007/978-3-030-30611-3_11, (105-114), (2019).
- Xinhong Chen, Qing Li, Event modeling and mining: a long journey toward explainable events, The VLDB Journal, 10.1007/s00778-019-00545-0, (2019).
- Tung Dang, Hirohisa Kishino, Stochastic Variational Inference for Bayesian Phylogenetics: A Case of CAT Model, Molecular Biology and Evolution, 10.1093/molbev/msz020, (2019).
- D B Dunson, J E Johndrow, The Hastings algorithm at fifty, Biometrika, 10.1093/biomet/asz066, (2019).
- Adam N. Smith, Greg M. Allenby, Demand Models with Random Partitions, Journal of the American Statistical Association, 10.1080/01621459.2019.1604360, (1-33), (2019).
- Shuhei Mano, Shuhei Mano, Methods for Inferences, Partitions, Hypergeometric Systems, and Dirichlet Processes in Statistics, 10.1007/978-4-431-55888-0_5, (105-122), (2018).
- Adam N. Smith, Greg M. Allenby, Demand Models With Random Partitions, SSRN Electronic Journal, 10.2139/ssrn.3192926, (2018).
- Sylvia Frühwirth-Schnatter, Gertraud Malsiner-Walli, From here to infinity: sparse finite versus Dirichlet process mixtures in model-based clustering, Advances in Data Analysis and Classification, 10.1007/s11634-018-0329-y, (2018).
- Liang Hong, Ryan Martin, Dirichlet process mixture models for insurance loss data, Scandinavian Actuarial Journal, 10.1080/03461238.2017.1402086, 2018, 6, (545-554), (2017).
- Jeffrey W. Miller, Matthew T. Harrison, Mixture Models With a Prior on the Number of Components, Journal of the American Statistical Association, 10.1080/01621459.2016.1255636, 113, 521, (340-356), (2017).
- Mingyang Li, Hongdao Meng, Qingpeng Zhang, A nonparametric Bayesian modeling approach for heterogeneous lifetime data with covariates, Reliability Engineering & System Safety, 10.1016/j.ress.2017.05.029, 167, (95-104), (2017).
- Liang Hong, Ryan Martin, Dirichlet Process Mixture Models for Insurance Loss Data, SSRN Electronic Journal, 10.2139/ssrn.2949036, (2017).
- Daiane Aparecida Zuanetti, Luis Aparecido Milan, A generalized mixture model applied to diabetes incidence data, Biometrical Journal, 10.1002/bimj.201600086, 59, 4, (826-842), (2017).
- Alessandro Panella, Piotr Gmytrasiewicz, Interactive POMDPs with finite-state models of other agents, Autonomous Agents and Multi-Agent Systems, 10.1007/s10458-016-9359-z, 31, 4, (861-904), (2017).
- Yingying Yan, Ruizhang Huang, Can Ma, Liyang Xu, Zhiyuan Ding, Rui Wang, Ting Huang, Bowei Liu, Improving Document Clustering for Short Texts by Long Documents via a Dirichlet Multinomial Allocation Model, Web and Big Data, 10.1007/978-3-319-63579-8_47, (626-641), (2017).
- Liang Hong, Ryan Martin, A Flexible Bayesian Nonparametric Model for Predicting Future Insurance Claims, North American Actuarial Journal, 10.1080/10920277.2016.1247720, 21, 2, (228-241), (2017).
- O. Karakuş, E.E. Kuruoğlu, M.A. Altınkaya, Bayesian Volterra system identification using reversible jump MCMC algorithm, Signal Processing, 10.1016/j.sigpro.2017.05.031, 141, (125-136), (2017).
- Nazmus Sakib, Xuxue Sun, Nan Kong, Hongdao Meng, Mingyang Li, Bi-level heterogeneity modeling of functional performance degradation for the aging population, IISE Transactions on Healthcare Systems Engineering, 10.1080/24725579.2017.1339147, 7, 3, (156-167), (2017).
- Gertraud Malsiner-Walli, Sylvia Frühwirth-Schnatter, Bettina Grün, Identifying Mixtures of Mixtures Using Bayesian Estimation, Journal of Computational and Graphical Statistics, 10.1080/10618600.2016.1200472, 26, 2, (285-295), (2017).
- Alan P. Ker, Yong Liu, Bayesian model averaging of possibly similar nonparametric densities, Computational Statistics, 10.1007/s00180-016-0700-4, 32, 1, (349-365), (2016).
- Mingyang Li, Jiali Han, Jian Liu, Bayesian nonparametric modeling of heterogeneous time-to-event data with an unknown number of sub-populations, IISE Transactions, 10.1080/0740817X.2016.1234732, 49, 5, (481-492), (2016).
- Leonardo Bottolo, Petros Dellaportas, Bayesian Hierarchical Mixture Models, Statistical Analysis for High-Dimensional Data, 10.1007/978-3-319-27099-9_5, (91-103), (2016).
- Juhee Lee, Peter Müller, Yitan Zhu, Yuan Ji, A Nonparametric Bayesian Model for Nested Clustering, Statistical Analysis in Proteomics, 10.1007/978-1-4939-3106-4_8, (129-141), (2016).
- Olfat Al-Harazi, Sadiq Al Insaif, Monirah A. Al-Ajlan, Namik Kaya, Nduna Dzimiri, Dilek Colak, Integrated Genomic and Network-Based Analyses of Complex Diseases and Human Disease Network, Journal of Genetics and Genomics, 10.1016/j.jgg.2015.11.002, 43, 6, (349-367), (2016).
- Mohsen Zand, Shyamala Doraisamy, Alfian Abdul Halin, Mas Rina Mustaffa, Ontology-Based Semantic Image Segmentation Using Mixture Models and Multiple CRFs, IEEE Transactions on Image Processing, 10.1109/TIP.2016.2552401, 25, 7, (3233-3248), (2016).
- Matteo Bersanelli, Ettore Mosca, Daniel Remondini, Enrico Giampieri, Claudia Sala, Gastone Castellani, Luciano Milanesi, Methods for the integration of multi-omics data: mathematical aspects, BMC Bioinformatics, 10.1186/s12859-015-0857-9, 17, S2, (2016).
- Judith Rousseau, On the Frequentist Properties of Bayesian Nonparametric Methods, Annual Review of Statistics and Its Application, 10.1146/annurev-statistics-041715-033523, 3, 1, (211-231), (2016).
- Michail Papathomas, Sylvia Richardson, Exploring dependence between categorical variables: Benefits and limitations of using variable selection within Bayesian clustering in relation to log-linear modelling with interaction terms, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2016.01.002, 173, (47-63), (2016).
- Leo L. Duan, John P. Clancy, Rhonda D. Szczesniak, Bayesian Ensemble Trees (BET) for Clustering and Prediction in Heterogeneous Data, Journal of Computational and Graphical Statistics, 10.1080/10618600.2015.1089774, 25, 3, (748-761), (2016).
- Chiranjit Mukherjee, Abel Rodriguez, GPU-Powered Shotgun Stochastic Search for Dirichlet Process Mixtures of Gaussian Graphical Models, Journal of Computational and Graphical Statistics, 10.1080/10618600.2015.1037883, 25, 3, (762-788), (2016).
- Garfield O. Brown, Winston S. Buckley, Experience rating with Poisson mixtures, Annals of Actuarial Science, 10.1017/S1748499515000019, 9, 2, (304-321), (2015).
- Liang Hong, Ryan Martin, Flexible Bayesian Nonparametric Credibility Models, SSRN Electronic Journal, 10.2139/ssrn.2690843, (2015).
- Peter Müller, Fernando Andrés Quintana, Alejandro Jara, Tim Hanson, Peter Müller, Fernando Andres Quintana, Alejandro Jara, Tim Hanson, Clustering and Feature Allocation, Bayesian Nonparametric Data Analysis, 10.1007/978-3-319-18968-0_8, (145-174), (2015).
- Peter Müller, Fernando Andrés Quintana, Alejandro Jara, Tim Hanson, Peter Müller, Fernando Andres Quintana, Alejandro Jara, Tim Hanson, Density Estimation: DP Models, Bayesian Nonparametric Data Analysis, 10.1007/978-3-319-18968-0_2, (7-31), (2015).
- Sergios Theodoridis, Bayesian Learning, Machine Learning, 10.1016/B978-0-12-801522-3.00013-6, (639-706), (2015).
- Fangfang Liu, Chong Wang, Peng Liu, A Semi-parametric Bayesian Approach for Differential Expression Analysis of RNA-seq Data, Journal of Agricultural, Biological, and Environmental Statistics, 10.1007/s13253-015-0227-0, 20, 4, (555-576), (2015).
- Patrick Rubin-Delanchy, Garth L Burn, Juliette Griffié, David J Williamson, Nicholas A Heard, Andrew P Cope, Dylan M Owen, Bayesian cluster identification in single-molecule localization microscopy data, Nature Methods, 10.1038/nmeth.3612, 12, 11, (1072-1076), (2015).
- M. M. Hossain, A. B. Lawson, B. Cai, J. Choi, J. Liu, R. S. Kirby, Space‐time areal mixture model: relabeling algorithm and model selection issues, Environmetrics, 10.1002/env.2265, 25, 2, (84-96), (2014).
- Purushottam W. Laud, Nicholas M. Pajewski, Dirichlet Process, Simulation of, Wiley StatsRef: Statistics Reference Online, 10.1002/9781118445112, (2014).
- Z. Yang, B. Rannala, Unguided Species Delimitation Using DNA Sequence Data from Multiple Loci, Molecular Biology and Evolution, 10.1093/molbev/msu279, 31, 12, (3125-3135), (2014).
- S. Mukhopadhyay, S. Bhattacharya, Cross‐validation based assessment of a new Bayesian palaeoclimate model, Environmetrics, 10.1002/env.2248, 24, 8, (550-568), (2013).
- Nicolas Lartillot, Nicolas Rodrigue, Daniel Stubbs, Jacques Richer, PhyloBayes MPI: Phylogenetic Reconstruction with Infinite Mixtures of Profiles in a Parallel Environment, Systematic Biology, 10.1093/sysbio/syt022, 62, 4, (611-615), (2013).
- Ruizhang Huang, Guan Yu, Zhaojun Wang, Jun Zhang, Liangxing Shi, Dirichlet Process Mixture Model for Document Clustering with Feature Partition, IEEE Transactions on Knowledge and Data Engineering, 10.1109/TKDE.2012.27, 25, 8, (1748-1759), (2013).
- David I Hastie, Silvia Liverani, Lamiae Azizi, Sylvia Richardson, Isabelle Stücker, A semi-parametric approach to estimate risk functions associated with multi-dimensional exposure profiles: application to smoking and lung cancer, BMC Medical Research Methodology, 10.1186/1471-2288-13-129, 13, 1, (2013).
- Mikkel N. Schmidt, Morten Morup, Nonparametric Bayesian modeling of complex networks: an introduction, IEEE Signal Processing Magazine, 10.1109/MSP.2012.2235191, 30, 3, (110-128), (2013).
- Md. Monir Hossain, Andrew B. Lawson, Bo Cai, Jungsoon Choi, Jihong Liu, Russell S. Kirby, Space-time stick-breaking processes for small area disease cluster estimation, Environmental and Ecological Statistics, 10.1007/s10651-012-0209-0, 20, 1, (91-107), (2012).
- Saheli Datta, Abel Rodriguez, Raquel Prado, Bayesian semiparametric regression models to characterize molecular evolution, BMC Bioinformatics, 10.1186/1471-2105-13-278, 13, 1, (2012).
- David I. Hastie, Peter J. Green, Model choice using reversible jump Markov chain Monte Carlo, Statistica Neerlandica, 10.1111/j.1467-9574.2012.00516.x, 66, 3, (309-338), (2012).
- Michail Papathomas, John Molitor, Clive Hoggart, David Hastie, Sylvia Richardson, Exploring Data From Genetic Association Studies Using Bayesian Variable Selection and the Dirichlet Process: Application to Searching for Gene × Gene Patterns, Genetic Epidemiology, 10.1002/gepi.21661, 36, 6, (663-674), (2012).
- Paul Kirk, Jim E. Griffin, Richard S. Savage, Zoubin Ghahramani, David L. Wild, Bayesian correlated clustering to integrate multiple datasets, Bioinformatics, 10.1093/bioinformatics/bts595, 28, 24, (3290-3297), (2012).
- Max Welling, Ian Porteous, Kenichi Kurihara, undefined, 2012 Information Theory and Applications Workshop, 10.1109/ITA.2012.6181768, (407-414), (2012).
- Mattias Villani, Robert Kohn, David J. Nott, Generalized smooth finite mixtures, Journal of Econometrics, 10.1016/j.jeconom.2012.06.012, 171, 2, (121-133), (2012).
- Nicole White, Helen Johnson, Peter Silburn, Kerrie Mengersen, Dirichlet process mixture models for unsupervised clustering of symptoms in Parkinson's disease, Journal of Applied Statistics, 10.1080/02664763.2012.710897, 39, 11, (2363-2377), (2012).
- Ilker Yildirim, Robert A. Jacobs, A Rational Analysis of the Acquisition of Multisensory Representations, Cognitive Science, 10.1111/j.1551-6709.2011.01216.x, 36, 2, (305-332), (2011).
- Maria Kalli, Jim E. Griffin, Flexible Modelling of Dependence in Volatility Processes, SSRN Electronic Journal, 10.2139/ssrn.1769655, (2011).
- Sylvia Frühwirth‐Schnatter, Dealing with Label Switching under Model Uncertainty, Mixtures, undefined, (213-239), (2011).
- Toke Jansen Hansen, Morten Morup, Lars Kai Hansen, undefined, 2011 IEEE International Workshop on Machine Learning for Signal Processing, 10.1109/MLSP.2011.6064611, (1-6), (2011).
- Joseph L. Austerweil, Thomas L. Griffiths, A rational model of the effects of distributional information on feature learning, Cognitive Psychology, 10.1016/j.cogpsych.2011.08.002, 63, 4, (173-209), (2011).
- Samuel O. M. Manda, A Nonparametric Frailty Model for Clustered Survival Data, Communications in Statistics - Theory and Methods, 10.1080/03610920903480882, 40, 5, (863-875), (2011).
- John Molitor, Jason G. Su, Nuoo-Ting Molitor, Virgilio Gómez Rubio, Sylvia Richardson, David Hastie, Rachel Morello-Frosch, Michael Jerrett, Identifying Vulnerable Populations through an Examination of the Association Between Multipollutant Profiles and Poverty, Environmental Science & Technology, 10.1021/es104017x, 45, 18, (7754-7760), (2011).
- J. E. Griffin, Bayesian clustering of distributions in stochastic frontier analysis, Journal of Productivity Analysis, 10.1007/s11123-011-0213-7, 36, 3, (275-283), (2011).
- Dilan Görür, Carl Edward Rasmussen, Dirichlet Process Gaussian Mixture Models: Choice of the Base Distribution, Journal of Computer Science and Technology, 10.1007/s11390-010-9355-8, 25, 4, (653-664), (2010).
- Rosella Castellano, Luisa Scaccia, Bayesian Hidden Markov Models for Financial Data, Data Analysis and Classification, 10.1007/978-3-642-03739-9_51, (453-461), (2010).
- N. Bouguila, D. Ziou, A Dirichlet Process Mixture of Generalized Dirichlet Distributions for Proportional Data Modeling, IEEE Transactions on Neural Networks, 10.1109/TNN.2009.2034851, 21, 1, (107-122), (2010).
- J. Molitor, M. Papathomas, M. Jerrett, S. Richardson, Bayesian profile regression with an application to the National survey of children's health, Biostatistics, 10.1093/biostatistics/kxq013, 11, 3, (484-498), (2010).
- Jing Wang, Gibbs sampling in DP-based nonlinear mixed effects models, Journal of Applied Statistics, 10.1080/02664760903117721, 37, 2, (325-340), (2010).
- Peter Congdon, References, Applied Bayesian Hierarchical Methods, 10.1201/9781584887218, (495-500), (2010).
- John Molitor, Léa Fortunato, Nuoo-Ting Molitor, Sylvia Richardson, Examining the Association between Deprivation Profiles and Air Pollution in Greater London using Bayesian Dirichlet Process Mixture Models, Proceedings of COMPSTAT'2010, 10.1007/978-3-7908-2604-3, (277-283), (2010).
- R. Fuentes-García, R. H. Mena, S. G. Walker, A New Bayesian Nonparametric Mixture Model, Communications in Statistics - Simulation and Computation, 10.1080/03610910903580963, 39, 4, (669-682), (2010).
- John W. Lau, Mike K. P. So, A Monte Carlo Markov chain algorithm for a class of mixture time series models, Statistics and Computing, 10.1007/s11222-009-9147-6, 21, 1, (69-81), (2009).
- Maria Kalli, Jim E. Griffin, Stephen G. Walker, Slice sampling mixture models, Statistics and Computing, 10.1007/s11222-009-9150-y, 21, 1, (93-105), (2009).
- Y. Fong, J. Wakefield, K. Rice, Bayesian mixture modeling using a hybrid sampler with application to protein subfamily identification, Biostatistics, 10.1093/biostatistics/kxp033, 11, 1, (18-33), (2009).
- Kristin P. Lennox, David B. Dahl, Marina Vannucci, Jerry W. Tsai, Density Estimation for Protein Conformation Angles Using a Bivariate von Mises Distribution and Bayesian Nonparametrics, Journal of the American Statistical Association, 10.1198/jasa.2009.0024, 104, 486, (586-596), (2009).
- Mattias Villani, Robert Kohn, Paolo Giordani, Regression density estimation using smooth adaptive Gaussian mixtures, Journal of Econometrics, 10.1016/j.jeconom.2009.05.004, 153, 2, (155-173), (2009).
- A. Rodriguez, D. B. Dunson, A. E. Gelfand, Bayesian nonparametric functional data analysis through density estimation, Biometrika, 10.1093/biomet/asn054, 96, 1, (149-162), (2009).
- Cliburn Chan, Feng Feng, Janet Ottinger, David Foster, Mike West, Thomas B. Kepler, Statistical mixture modeling for cell subtype identification in flow cytometry, Cytometry Part A, 10.1002/cyto.a.20583, 73A, 8, (693-701), (2008).
- Purushottam W. Laud, Nicholas M. Pajewski, Dirichlet Process, Simulation of, Encyclopedia of Statistics in Quality and Reliability, 10.1002/9780470061572, (2008).
- John W. Lau, Mike K.P. So, Bayesian mixture of autoregressive models, Computational Statistics & Data Analysis, 10.1016/j.csda.2008.06.001, 53, 1, (38-60), (2008).
- Martin Burda, Matthew Harding, Jerry Hausman, A Bayesian mixed logit–probit model for multinomial choice, Journal of Econometrics, 10.1016/j.jeconom.2008.09.029, 147, 2, (232-246), (2008).
- F. Caron, M. Davy, A. Doucet, E. Duflos, P. Vanheeghe, Bayesian Inference for Linear Dynamic Models With Dirichlet Process Mixtures, IEEE Transactions on Signal Processing, 10.1109/TSP.2007.900167, 56, 1, (71-84), (2008).
- Hemant Ishwaran, Mahmoud Zarepour, Exact and approximate sum representations for the Dirichlet process, Canadian Journal of Statistics, 10.2307/3315951, 30, 2, (269-283), (2008).
- A. Rodriguez, D. B. Dunson, J. Taylor, Bayesian hierarchically weighted finite mixture models for samples of distributions, Biostatistics, 10.1093/biostatistics/kxn024, 10, 1, (155-171), (2008).
- Mattias Villani, Robert Kohn, Paolo Giordani, Nonparametric Regression Density Estimation Using Smoothly Varying Normal Mixtures, SSRN Electronic Journal, 10.2139/ssrn.1024701, (2007).
- David S. Leslie, Robert Kohn, David J. Nott, A general approach to heteroscedastic linear regression, Statistics and Computing, 10.1007/s11222-006-9013-8, 17, 2, (131-146), (2007).
- IOULIA PAPAGEORGIOU, IOANNIS LIRITZIS, MULTIVARIATE MIXTURE OF NORMALS WITH UNKNOWN NUMBER OF COMPONENTS: AN APPLICATION TO CLUSTER NEOLITHIC CERAMICS FROM AEGEAN AND ASIA MINOR USING PORTABLE XRF*, Archaeometry, 10.1111/j.1475-4754.2007.00336.x, 49, 4, (795-813), (2007).
- P. Congdon, Bayesian modelling strategies for spatially varying regression coefficients: A multivariate perspective for multiple outcomes, Computational Statistics & Data Analysis, 10.1016/j.csda.2006.01.004, 51, 5, (2586-2601), (2007).
- John W Lau, Peter J Green, Bayesian Model-Based Clustering Procedures, Journal of Computational and Graphical Statistics, 10.1198/106186007X238855, 16, 3, (526-558), (2007).
- D. I. Ohlssen, L. D. Sharples, D. J. Spiegelhalter, Flexible random‐effects models using Bayesian semi‐parametric models: applications to institutional comparisons, Statistics in Medicine, 10.1002/sim.2666, 26, 9, (2088-2112), (2006).
- Petros Dellaportas, Ioulia Papageorgiou, Multivariate mixtures of normals with unknown number of components, Statistics and Computing, 10.1007/s11222-006-5338-6, 16, 1, (57-68), (2006).
- Nils Lid Hjort, Dirichlet Processes, Encyclopedia of Actuarial Science, 10.1002/9780470012505, (2006).
- Daniel J. Navarro, Thomas L. Griffiths, Mark Steyvers, Michael D. Lee, Modeling individual differences using Dirichlet processes, Journal of Mathematical Psychology, 10.1016/j.jmp.2005.11.006, 50, 2, (101-122), (2006).
- Simon Cauchemez, Laura Temime, Didier Guillemot, Emmanuelle Varon, Alain-Jacques Valleron, Guy Thomas, Pierre-Yves Boëlle, Investigating Heterogeneity in Pneumococcal Transmission, Journal of the American Statistical Association, 10.1198/016214506000000230, 101, 475, (946-958), (2006).
- Jean-Michel Marin, Kerrie Mengersen, Christian P. Robert, Bayesian Modelling and Inference on Mixtures of Distributions, Bayesian Thinking - Modeling and Computation, 10.1016/S0169-7161(05)25016-2, (459-507), (2005).
- Andrea Ongaro, Size-biased sampling and discrete nonparametric Bayesian inference, Journal of Statistical Planning and Inference, 10.1016/j.jspi.2003.10.005, 128, 1, (123-148), (2005).
- Sonia Jain, Radford M Neal, A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model, Journal of Computational and Graphical Statistics, 10.1198/1061860043001, 13, 1, (158-182), (2004).
- See more




