• random forest;
  • feature importance sampling;
  • ADMET;
  • machine learning;
  • virtual screening


Good performance of ensemble approaches could generally be obtained when base classifiers are diverse and accurate. In the present study, feature importance sampling-based adaptive random forest (fisaRF) was proposed to obtain superior classification performance to the primal one-step random forest (RF). fisaRF takes a convenient, yet very effective, way called feature importance sampling (FIS), to select the more eligible feature subset at each splitting node instead of simple random sampling and thereby strengthen the accuracy of individual trees, without sacrificing diversity between them. Additionally, the iterative use of feature importance obtained by the previous step can adaptively capture the most significant features in data and effectively deal with multiple classification problems, not easily solved by other feature importance indexes. The proposed fisaRF was applied to classify three structure–activity relationship (SAR) data sets proposed by Xue et al. 1 together with disinfection by-products (DBPs) data, compared to the primal one-step RF induced by simple random sampling. The comparison revealed that fisaRF can effectively improve the classification accuracy and prediction confidence for each sample and thereby was considered as a very useful tool to screen the underlying lead compounds. Copyright © 2011 John Wiley & Sons, Ltd.