Part of this work was done when the author was with HIIT.
From black and white to full color: extending redescription mining outside the Boolean world†
Article first published online: 10 APR 2012
Copyright © 2012 Wiley Periodicals, Inc.
Statistical Analysis and Data Mining: The ASA Data Science Journal
Volume 5, Issue 4, pages 284–303, August 2012
How to Cite
Galbrun, E. and Miettinen, P. (2012), From black and white to full color: extending redescription mining outside the Boolean world. Statistical Analy Data Mining, 5: 284–303. doi: 10.1002/sam.11145
A preliminary version of this paper appeared in SDM 2011.
- Issue published online: 19 JUL 2012
- Article first published online: 10 APR 2012
- Manuscript Accepted: 8 MAR 2012
- Manuscript Revised: 17 FEB 2012
- Manuscript Received: 23 JUN 2011
- redescription mining;
- bioclimatic niche finding;
- numerical data;
- missing data;
- data mining
Redescription mining is a powerful data analysis tool that is used to find multiple descriptions of the same entities. Consider geographical regions as an example. They can be characterized by the fauna that inhabits them on one hand and by their meteorological conditions on the other hand. Finding such redescriptors, a task known as niche-finding, is of much importance in biology. Current redescription mining methods cannot handle other than Boolean data. This restricts the range of possible applications or makes discretization a pre-requisite, entailing a possibly harmful loss of information. In niche-finding, while the fauna can be naturally represented using a Boolean presence/absence data, the weather cannot. In this paper, we extend redescription mining to categorical and real-valued data with possibly missing values using a surprisingly simple and efficient approach. We provide extensive experimental evaluation to study the behavior of the proposed algorithm. Furthermore, we show the statistical significance of our results using recent innovations on randomization methods. © 2012 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2012