A Segmentation/Clustering Model for the Analysis of Array CGH Data
Abstract
Summary Microarray‐CGH (comparative genomic hybridization) experiments are used to detect and map chromosomal imbalances. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose representative sequences share the same relative copy number on average. Segmentation methods constitute a natural framework for the analysis, but they do not provide a biological status for the detected segments. We propose a new model for this segmentation/clustering problem, combining a segmentation model with a mixture model. We present a new hybrid algorithm called dynamic programming–expectation maximization (DP–EM) to estimate the parameters of the model by maximum likelihood. This algorithm combines DP and the EM algorithm. We also propose a model selection heuristic to select the number of clusters and the number of segments. An example of our procedure is presented, based on publicly available data sets. We compare our method to segmentation methods and to hidden Markov models, and we show that the new segmentation/clustering model is a promising alternative that can be applied in the more general context of signal processing.
Citing Literature
Number of times cited according to CrossRef: 43
- Anne Gégout-Petit, Aurélie Gueudin-Muller, Clémence Karmann, The revisited knockoffs method for variable selection in L 1 -penalized regressions , Communications in Statistics - Simulation and Computation, 10.1080/03610918.2020.1775850, (1-14), (2020).
- Rocío Joo, Matthew E. Boone, Thomas A. Clay, Samantha C. Patrick, Susana Clusella‐Trullas, Mathieu Basille, Navigating through the r packages for movement, Journal of Animal Ecology, 10.1111/1365-2656.13116, 89, 1, (248-267), (2019).
- Rémi Patin, Marie‐Pierre Etienne, Emilie Lebarbier, Simon Chamaillé‐Jammes, Simon Benhamou, Identifying stationary phases in multivariate time series for highlighting behavioural modes and home range settlements, Journal of Animal Ecology, 10.1111/1365-2656.13105, 89, 1, (44-56), (2019).
- Clément Mabire, Jorge Duarte, Aude Darracq, Ali Pirani, Hélène Rimbert, Delphine Madur, Valérie Combes, Clémentine Vitte, Sébastien Praud, Nathalie Rivière, Johann Joets, Jean-Philippe Pichon, Stéphane D. Nicolas, High throughput genotyping of structural variations in a complex plant genome using an original Affymetrix® axiom® array, BMC Genomics, 10.1186/s12864-019-6136-9, 20, 1, (2019).
- Maria Thaker, Pratik R. Gupte, Herbert H. T. Prins, Rob Slotow, Abi T. Vanak, Fine-Scale Tracking of Ambient Temperature and Movement Reveals Shuttling Behavior of Elephants to Water, Frontiers in Ecology and Evolution, 10.3389/fevo.2019.00004, 7, (2019).
- Senthil B. Girimurugan, Yuhang Liu, Pei-Yau Lung, Daniel L. Vera, Jonathan H. Dennis, Hank W. Bass, Jinfeng Zhang, iSeg: an efficient algorithm for segmentation of genomic and epigenomic data, BMC Bioinformatics, 10.1186/s12859-018-2140-3, 19, 1, (2018).
- John Nagorski, Genevera I. Allen, Genomic region detection via Spatial Convex Clustering, PLOS ONE, 10.1371/journal.pone.0203007, 13, 9, (e0203007), (2018).
- Matthew Ludkin, Idris Eckley, Peter Neal, Dynamic stochastic block models: parameter estimation and detection of changes in community structure, Statistics and Computing, 10.1007/s11222-017-9788-9, (2017).
- Pierre Fernique, Anaeile Dambreville, Jean-Baptiste Durand, Christophe Pradal, Pierre-Eric Lauri, Frederic Normand, Yann Guedon, undefined, 2016 IEEE International Conference on Functional-Structural Plant Growth Modeling, Simulation, Visualization and Applications (FSPMA), 10.1109/FSPMA.2016.7818290, (68-74), (2016).
- Faicel Chamroukhi, Piecewise Regression Mixture for Simultaneous Functional Data Clustering and Optimal Segmentation, Journal of Classification, 10.1007/s00357-016-9212-8, 33, 3, (374-411), (2016).
- Alberto Cassese, Michele Guindani, Marina Vannucci, iBATCGH: Integrative Bayesian Analysis of Transcriptomic and CGH Data, Statistical Analysis for High-Dimensional Data, 10.1007/978-3-319-27099-9_6, (105-123), (2016).
- Wei-Po Lee, Chung-Hsun Lin, Combining Expression Data and Knowledge Ontology for Gene Clustering and Network Reconstruction, Cognitive Computation, 10.1007/s12559-015-9349-5, 8, 2, (217-227), (2015).
- Mokhtar Z. Alaya, Stephane Gaiffas, Agathe Guilloux, Learning the Intensity of Time Events With Change-Points, IEEE Transactions on Information Theory, 10.1109/TIT.2015.2448087, 61, 9, (5148-5171), (2015).
- Mael Le Corre, Christian Dussault, Steeve D Côté, Detecting changes in the annual movements of terrestrial migratory species: using the first-passage time to document the spring migration of caribou, Movement Ecology, 10.1186/s40462-014-0019-0, 2, 1, (2014).
- Guillem Rigaill, Vincent Miele, Franck Picard, Fast and Parallel Algorithm for Population-Based Segmentation of Copy-Number Profiles, Computational Intelligence Methods for Bioinformatics and Biostatistics, 10.1007/978-3-319-09042-9_18, (248-258), (2014).
- Alice Cleynen, The Minh Luong, Guillem Rigaill, Gregory Nuel, Fast estimation of the Integrated Completed Likelihood criterion for change-point detection problems with applications to Next-Generation Sequencing data, Signal Processing, 10.1016/j.sigpro.2013.11.029, 98, (233-242), (2014).
- Glen A. Satten, Andrew S. Allen, Morna Ikeda, Jennifer G. Mulle, Stephen T. Warren, Robust Regression Analysis of Copy Number Variation Data based on a Univariate Score, PLoS ONE, 10.1371/journal.pone.0086272, 9, 2, (e86272), (2014).
- Yinglei Lai, Paul S. Albert, Identifying multiple change points in a linear mixed effects model, Statistics in Medicine, 10.1002/sim.5996, 33, 6, (1015-1028), (2013).
- Dorra Trabelsi, Samer Mohammed, Faicel Chamroukhi, Latifa Oukhellou, Yacine Amirat, An Unsupervised Approach for Automatic Activity Recognition Based on Hidden Markov Model Regression, IEEE Transactions on Automation Science and Engineering, 10.1109/TASE.2013.2256349, 10, 3, (829-835), (2013).
- Seung-Gu Kim, Jeong-Soo Park, Yung-Seop Lee, Identification of target clusters by using the restricted normal mixture model, Journal of Applied Statistics, 10.1080/02664763.2012.759192, 40, 5, (941-960), (2013).
- Petri Pehkonen, Lynn Welter-Stahl, Janine Diwo, Jussi Ryynänen, Anke Wienecke-Baldacchino, Sami Heikkinen, Eckardt Treuter, Knut R Steffensen, Carsten Carlberg, Genome-wide landscape of liver X receptor chromatin binding and gene regulation in human macrophages, BMC Genomics, 10.1186/1471-2164-13-50, 13, 1, (50), (2012).
- Guillem J. Rigaill, Sidney Cadot, Roelof J.C. Kluin, Zheng Xue, Rene Bernards, Ian J. Majewski, Lodewyk F.A. Wessels, A regression model for estimating DNA copy number applied to capture sequencing data, Bioinformatics, 10.1093/bioinformatics/bts448, 28, 18, (2357-2365), (2012).
- Robert B Scharpf, Terri H Beaty, Holger Schwender, Samuel G Younkin, Alan F Scott, Ingo Ruczinski, Fast detection of de novo copy number variants from SNP arrays for case-parent trios, BMC Bioinformatics, 10.1186/1471-2105-13-330, 13, 1, (2012).
- David Brawand, Magali Soumillon, Anamaria Necsulea, Philippe Julien, Gábor Csárdi, Patrick Harrigan, Manuela Weier, Angélica Liechti, Ayinuer Aximu-Petri, Martin Kircher, Frank W. Albert, Ulrich Zeller, Philipp Khaitovich, Frank Grützner, Sven Bergmann, Rasmus Nielsen, Svante Pääbo, Henrik Kaessmann, The evolution of gene expression levels in mammalian organs, Nature, 10.1038/nature10532, 478, 7369, (343-348), (2011).
- F. Picard, E. Lebarbier, M. Hoebeke, G. Rigaill, B. Thiam, S. Robin, Joint segmentation, calling, and normalization of multiple CGH profiles, Biostatistics, 10.1093/biostatistics/kxq076, 12, 3, (413-428), (2011).
- Alexandre Lung-Yut-Fong, Celine Levy-Leduc, Olivier Cappe, undefined, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 10.1109/ICASSP.2011.5946259, (3608-3611), (2011).
- Md Pavel Mahmud, Alexander Schliep, Fast MCMC sampling for hidden markov models to determine copy number variations, BMC Bioinformatics, 10.1186/1471-2105-12-428, 12, 1, (2011).
- Timothy Danford, Robin Dowell, Sudeep Agarwala, Paula Grisafi, Gerald Fink, David Gifford, Discovering Regulatory Overlapping RNA Transcripts, Journal of Computational Biology, 10.1089/cmb.2010.0267, 18, 3, (295-303), (2011).
- Charalampos E. Tsourakakis, Richard Peng, Maria A. Tsiarli, Gary L. Miller, Russell Schwartz, Approximation algorithms for speeding up dynamic programming and denoising aCGH data, ACM Journal of Experimental Algorithmics, 10.1145/1963190.2063517, 16, (2011).
- Yinglei Lai, On the Adaptive Partition Approach to the Detection of Multiple Change-Points, PLoS ONE, 10.1371/journal.pone.0019754, 6, 5, (e19754), (2011).
- M. A. van de Wiel, F. Picard, W. N. van Wieringen, B. Ylstra, Preprocessing and downstream analysis of microarray DNA copy number profiles, Briefings in Bioinformatics, 10.1093/bib/bbq004, 12, 1, (10-21), (2010).
- Sylvain Arlot, Alain Celisse, Segmentation of the mean of heteroscedastic data via cross-validation, Statistics and Computing, 10.1007/s11222-010-9196-x, 21, 4, (613-632), (2010).
- Faicel Chamroukhi, Allou Samé, Gérard Govaert, Patrice Aknin, A hidden process regression model for functional data description. Application to curve discrimination, Neurocomputing, 10.1016/j.neucom.2009.12.023, 73, 7-9, (1210-1221), (2010).
- Georges Hébrail, Bernard Hugueney, Yves Lechevallier, Fabrice Rossi, Exploratory analysis of functional data via clustering and optimal segmentation, Neurocomputing, 10.1016/j.neucom.2009.11.022, 73, 7-9, (1125-1141), (2010).
- Timothy Danford, Robin Dowell, Sudeep Agarwala, Paula Grisafi, Gerald Fink, David Gifford, Discovering Regulatory Overlapping RNA Transcripts, Research in Computational Molecular Biology, 10.1007/978-3-642-12683-3_8, (110-122), (2010).
- Hyungwon Choi, Zhaohui S. Qin, Debashis Ghosh, A Double-Layered Mixture Model for the Joint Analysis of DNA Copy Number and Gene Expression Data, Journal of Computational Biology, 10.1089/cmb.2009.0019, 17, 2, (121-137), (2010).
- Xuesong Yu, Timothy W. Randolph, Hua Tang, Li Hsu, Detecting Genomic Aberrations Using Products in a Multiscale Analysis, Biometrics, 10.1111/j.1541-0420.2009.01337.x, 66, 3, (684-693), (2009).
- Sang-Tae Han, Hyun-Cheol Kang, Ho-Sik Choi, Myung-Suk Jang, A Study on Development of Scoring Campaign System, Korean Journal of Applied Statistics, 10.5351/KJAS.2009.22.1.001, 22, 1, (1-16), (2009).
- Juan R González, Isaac Subirana, Geòrgia Escaramís, Solymar Peraza, Alejandro Cáceres, Xavier Estivill, Lluís Armengol, Accounting for uncertainty when assessing association between copy number and disease: a latent class model, BMC Bioinformatics, 10.1186/1471-2105-10-172, 10, 1, (2009).
- Eva Budinska, Eva Gelnarova, Michael G. Schimek, MSMAD: a computationally efficient method for the analysis of noisy array CGH data, Bioinformatics, 10.1093/bioinformatics/btp022, 25, 6, (703-713), (2009).
- Byung-Soo Kim, Sang-Cheol Kim, A Penalized Spline Based Method for Detecting the DNA Copy Number Alteration in an Array-CGH Experiment, Korean Journal of Applied Statistics, 10.5351/KJAS.2009.22.1.115, 22, 1, (115-127), (2009).
- Leighton Pritchard, Hui Liu, Clare Booth, Emma Douglas, Patrice François, Jacques Schrenzel, Peter E. Hedley, Paul R. J. Birch, Ian K. Toth, Microarray Comparative Genomic Hybridisation Analysis Incorporating Genomic Organisation, and Application to Enterobacterial Plant Pathogens, PLoS Computational Biology, 10.1371/journal.pcbi.1000473, 5, 8, (e1000473), (2009).
- Hung-I Harry Chen, Fang-Han Hsu, Yuan Jiang, Mong-Hsun Tsai, Pan-Chyr Yang, Paul S. Meltzer, Eric Y. Chuang, Yidong Chen, A probe-density-based analysis method for array CGH data: simulation, normalization and centralization, Bioinformatics, 10.1093/bioinformatics/btn321, 24, 16, (1749-1756), (2008).




