Bayesian non-parametric hidden Markov models with applications in genomics
Article first published online: 12 OCT 2010
DOI: 10.1111/j.1467-9868.2010.00756.x
© 2010 Royal Statistical Society
Issue

Journal of the Royal Statistical Society: Series B (Statistical Methodology)
Volume 73, Issue 1, pages 37–57, January 2011
Additional Information
How to Cite
Yau, C., Papaspiliopoulos, O., Roberts, G. O. and Holmes, C. (2011), Bayesian non-parametric hidden Markov models with applications in genomics. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73: 37–57. doi: 10.1111/j.1467-9868.2010.00756.x
Publication History
- Issue published online: 5 JAN 2011
- Article first published online: 12 OCT 2010
- [Received October 2008. Final revision May 2010]
- Abstract
- Article
- References
- Cited By
Keywords:
- Block Gibbs sampler;
- Copy number variation;
- Local and global clustering;
- Partial exchangeability;
- Partition models;
- Retrospective sampling
Summary. We propose a flexible non-parametric specification of the emission distribution in hidden Markov models and we introduce a novel methodology for carrying out the computations. Whereas current approaches use a finite mixture model, we argue in favour of an infinite mixture model given by a mixture of Dirichlet processes. The computational framework is based on auxiliary variable representations of the Dirichlet process and consists of a forward–backward Gibbs sampling algorithm of similar complexity to that used in the analysis of parametric hidden Markov models. The algorithm involves analytic marginalizations of latent variables to improve the mixing, facilitated by exchangeability properties of the Dirichlet process that we uncover in the paper. A by-product of this work is an efficient Gibbs sampler for learning Dirichlet process hierarchical models. We test the Monte Carlo algorithm proposed against a wide variety of alternatives and find significant advantages. We also investigate by simulations the sensitivity of the proposed model to prior specification and data-generating mechanisms. We apply our methodology to the analysis of genomic copy number variation. Analysing various real data sets we find significantly more accurate inference compared with state of the art hidden Markov models which use finite mixture emission distributions.

1467-9868/asset/olbannerleft.gif?v=1&s=d55c85a7e7aac5c5ddf05f31e9e584d39f8961ee)
1467-9868/asset/olbannercenter.jpg?v=1&s=7f2bafc8a6e229a072fb0798aeb1fecfae3c0354)
1467-9868/asset/olbannerright.jpg?v=1&s=d7f42122f0fa25df519dfbdf38fde37816a67322)