AZ is statistical consultant for the Protagen AG, Dortmund. AZ is an Associate Editor of the Biometrical Journal, Statistics in Medicine, Methods of Information in Medicine. IRK has declared no conflicts of interest for this article.
Mining data with random forests: current options for real-world applications
Article first published online: 23 DEC 2013
© 2013 John Wiley & Sons, Ltd.
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery
Volume 4, Issue 1, pages 55–63, January/February 2014
How to Cite
Ziegler, A. and König, I. R. (2014), Mining data with random forests: current options for real-world applications. WIREs Data Mining Knowl Discov, 4: 55–63. doi: 10.1002/widm.1114
- Issue published online: 23 DEC 2013
- Article first published online: 23 DEC 2013
- Manuscript Accepted: 25 OCT 2013
- Manuscript Revised: 30 SEP 2013
- Manuscript Received: 30 JAN 2013
Random Forests are fast, flexible, and represent a robust approach to mining high-dimensional data. They are an extension of classification and regression trees (CART). They perform well even in the presence of a large number of features and a small number of observations. In analogy to CART, random forests can deal with continuous outcome, categorical outcome, and time-to-event outcome with censoring. The tree-building process of random forests implicitly allows for interaction between features and high correlation between features. Approaches are available to measuring variable importance and reducing the number of features. Although random forests perform well in many applications, their theoretical properties are not fully understood. Recently, several articles have provided a better understanding of random forests, and we summarize these findings. We survey different versions of random forests, including random forests for classification, random forests for probability estimation, and random forests for estimating survival data. We discuss the consequences of (1) no selection, (2) random selection, and (3) a combination of deterministic and random selection of features for random forests. Finally, we review a backward elimination and a forward procedure, the determination of trees representing a forest, and the identification of important variables in a random forest. Finally, we provide a brief overview of different areas of application of random forests. WIREs Data Mining Knowl Discov 2014, 4:55–63. doi: 10.1002/widm.1114
For further resources related to this article, please visit the WIREs website.