Get access
Advertisement

Roughly balanced bagging for imbalanced data

Authors

  • Shohei Hido,

    Corresponding author
    1. IBM Research-Tokyo Research Laboratory, 1623-14 Shimo-Tsuruma, Yamato-shi, Kanagawa 242-8502, Japan
    2. Department of Systems Science, Kyoto University, Yoshida-Honmachi, Kyoto 606-8501, Japan
    • IBM Research-Tokyo Research Laboratory, 1623-14 Shimo-Tsuruma, Yamato-shi, Kanagawa 242-8502, Japan
    Search for more papers by this author
  • Hisashi Kashima,

    1. IBM Research-Tokyo Research Laboratory, 1623-14 Shimo-Tsuruma, Yamato-shi, Kanagawa 242-8502, Japan
    Current affiliation:
    1. Department of Mathematical Informatics, University of Tokyo, Tokyo 113-8656, Japan
    Search for more papers by this author
  • Yutaka Takahashi

    1. Department of Systems Science, Kyoto University, Yoshida-Honmachi, Kyoto 606-8501, Japan
    Search for more papers by this author

Abstract

The class imbalance problem appears in many real-world applications of classification learning. We propose an ensemble algorithm “Roughly Balanced (RB) Bagging” using a novel sampling technique to improve the original bagging algorithm for data sets with skewed class distributions. For this sampling method, the number of samples in the largest and smallest classes are different, but they are effectively balanced when averaged over all of the subsets, which supports the approach of bagging in a more appropriate way. Individual models in RB Bagging tend to show larger diversity, which is one of the keys of ensemble models, compared with existing bagging-based methods for imbalanced data that use exactly the same number of majority and minority examples for every training subset. In addition, the proposed method makes full use of all of the minority examples by under-sampling, which is efficiently done by using negative binomial distributions. Numerical experiments using benchmark and real-world data sets demonstrate that RB Bagging shows better performance than the existing “balanced” methods and other common methods for area under the ROC curve (AUC), which is a widely used metric in the class imbalance problem. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2: 412-426, 2009

Get access to the full text of this article

Ancillary