The Effect of Retrospective Sampling on Estimates of Prediction Error for Multifactor Dimensionality Reduction


Corresponding author: Alison A. Motsinger-Reif, Ph.D., Bioinformatics Research Center, Department of Statistics, 1 Lampe Dr., North Carolina State University, Raleigh NC, 27695-7566, USA. Tel: 919-515-3574; Fax: 919-515-7315; E-mail:


The standard in genetic association studies of complex diseases is replication and validation of positive results, with an emphasis on assessing the predictive value of associations. In response to this need, a number of analytical approaches have been developed to identify predictive models that account for complex genetic etiologies. Multifactor Dimensionality Reduction (MDR) is a commonly used, highly successful method designed to evaluate potential gene-gene interactions. MDR relies on classification error in a cross-validation framework to rank and evaluate potentially predictive models. Previous work has demonstrated the high power of MDR, but has not considered the accuracy and variance of the MDR prediction error estimate. Currently, we evaluate the bias and variance of the MDR error estimate as both a retrospective and prospective estimator and show that MDR can both underestimate and overestimate error. We argue that a prospective error estimate is necessary if MDR models are used for prediction, and propose a bootstrap resampling estimate, integrating population prevalence, to accurately estimate prospective error. We demonstrate that this bootstrap estimate is preferable for prediction to the error estimate currently produced by MDR. While demonstrated with MDR, the proposed estimation is applicable to all data-mining methods that use similar estimates.