Benchmarking Variable Selection in QSAR



This article is corrected by:

  1. Errata: Correction: Benchmarking Variable Selection in QSAR Volume 31, Issue 10, 752, Article first published online: 12 October 2012


Variable selection is important in QSAR modeling since it can improve model performance and transparency, as well as reduce the computational cost of model fitting and predictions. Which variable selection methods that perform well in QSAR settings is largely unknown. To address this question we, in a total of 1728 benchmarking experiments, rigorously investigated how eight variable selection methods affect the predictive performance and transparency of random forest models fitted to seven QSAR datasets covering different endpoints, descriptors sets, types of response variables, and number of chemical compounds. The results show that univariate variable selection methods are suboptimal and that the number of variables in the benchmarked datasets can be reduced with about 60 % without significant loss in model performance when using multivariate adaptive regression splines MARS and forward selection.