Original Article
Data mining, neural nets, trees — Problems 2 and 3 of Genetic Analysis Workshop 15
Article first published online: 28 NOV 2007
DOI: 10.1002/gepi.20280
© 2007 Wiley-Liss, Inc.
Issue

Genetic Epidemiology
Supplement: Genetic Analysis Workshop 15: Summaries of the Design and Analysis of Genomic Data
Volume 31, Issue S1, pages S51–S60, 2007
Additional Information
How to Cite
Ziegler, A., DeStefano, A. L. and König, I. R. (2007), Data mining, neural nets, trees — Problems 2 and 3 of Genetic Analysis Workshop 15. Genetic Epidemiology, 31: S51–S60. doi: 10.1002/gepi.20280
Publication History
- Issue published online: 28 NOV 2007
- Article first published online: 28 NOV 2007
- Abstract
- References
- Cited By
Keywords:
- association analysis;
- classification tree;
- genome-wide analysis;
- GWA;
- logic regression;
- logistic regression;
- logistic trees;
- neural net;
- random forest;
- single nucleotide polymorphism;
- SNP
Abstract
Genome-wide association studies using thousands to hundreds of thousands of single nucleotide polymorphism (SNP) markers and region-wide association studies using a dense panel of SNPs are already in use to identify disease susceptibility genes and to predict disease risk in individuals. Because these tasks become increasingly important, three different data sets were provided for the Genetic Analysis Workshop 15, thus allowing examination of various novel and existing data mining methods for both classification and identification of disease susceptibility genes, gene by gene or gene by environment interaction. The approach most often applied in this presentation group was random forests because of its simplicity, elegance, and robustness. It was used for prediction and for screening for interesting SNPs in a first step. The logistic tree with unbiased selection approach appeared to be an interesting alternative to efficiently select interesting SNPs. Machine learning, specifically ensemble methods, might be useful as pre-screening tools for large-scale association studies because they can be less prone to overfitting, can be less computer processor time intensive, can easily include pair-wise and higher-order interactions compared with standard statistical approaches and can also have a high capability for classification. However, improved implementations that are able to deal with hundreds of thousands of SNPs at a time are required. Genet. Epidemiol. 31(Suppl. 1):S51–S60, 2007. © 2007 Wiley-Liss, Inc.

1098-2272/asset/olbannerleft.jpg?v=1&s=7594b96a41be6d121ac42d260a9e61edb86678af)
1098-2272/asset/olbannerright.png?v=1&s=b6f0f2541c409e5b7f8d9f5207c4667ef587b61a)