Get access

Feature importance sampling-based adaptive random forest as a useful tool to screen underlying lead compounds

Authors

  • Dong-Sheng Cao,

    1. Research Center of Modernization of Traditional Chinese Medicines, Central South University, Changsha 410083, P. R. China
    Search for more papers by this author
  • Yi-Zeng Liang,

    Corresponding author
    1. Research Center of Modernization of Traditional Chinese Medicines, Central South University, Changsha 410083, P. R. China
    • Research Center of Modernization of Traditional Chinese Medicines, Central South University, Changsha 410083, P. R. China.
    Search for more papers by this author
  • Qing-Song Xu,

    Corresponding author
    1. School of Mathematical Sciences and Computing Technology, Central South University, Changsha 410083, P. R. China
    • School of Mathematical Sciences and Computing Technology, Central South University, Changsha 410083, P. R. China
    Search for more papers by this author
  • Liang-Xiao Zhang,

    1. Research Center of Modernization of Traditional Chinese Medicines, Central South University, Changsha 410083, P. R. China
    Search for more papers by this author
  • Qian-Nan Hu,

    1. Systems Drug Design Laboratory, College of Pharmacy, Wuhan University, Wuhan 430071, China
    Search for more papers by this author
  • Hong-Dong Li

    1. Research Center of Modernization of Traditional Chinese Medicines, Central South University, Changsha 410083, P. R. China
    Search for more papers by this author

Abstract

Good performance of ensemble approaches could generally be obtained when base classifiers are diverse and accurate. In the present study, feature importance sampling-based adaptive random forest (fisaRF) was proposed to obtain superior classification performance to the primal one-step random forest (RF). fisaRF takes a convenient, yet very effective, way called feature importance sampling (FIS), to select the more eligible feature subset at each splitting node instead of simple random sampling and thereby strengthen the accuracy of individual trees, without sacrificing diversity between them. Additionally, the iterative use of feature importance obtained by the previous step can adaptively capture the most significant features in data and effectively deal with multiple classification problems, not easily solved by other feature importance indexes. The proposed fisaRF was applied to classify three structure–activity relationship (SAR) data sets proposed by Xue et al. 1 together with disinfection by-products (DBPs) data, compared to the primal one-step RF induced by simple random sampling. The comparison revealed that fisaRF can effectively improve the classification accuracy and prediction confidence for each sample and thereby was considered as a very useful tool to screen the underlying lead compounds. Copyright © 2011 John Wiley & Sons, Ltd.

Ancillary