The class imbalance problem appears in many real-world applications of classification learning. We propose an ensemble algorithm “Roughly Balanced (RB) Bagging” using a novel sampling technique to improve the original bagging algorithm for data sets with skewed class distributions. For this sampling method, the number of samples in the largest and smallest classes are different, but they are effectively balanced when averaged over all of the subsets, which supports the approach of bagging in a more appropriate way. Individual models in RB Bagging tend to show larger diversity, which is one of the keys of ensemble models, compared with existing bagging-based methods for imbalanced data that use exactly the same number of majority and minority examples for every training subset. In addition, the proposed method makes full use of all of the minority examples by under-sampling, which is efficiently done by using negative binomial distributions. Numerical experiments using benchmark and real-world data sets demonstrate that RB Bagging shows better performance than the existing “balanced” methods and other common methods for area under the ROC curve (AUC), which is a widely used metric in the class imbalance problem. Copyright © 2009 Wiley Periodicals, Inc. Statistical Analysis and Data Mining 2: 412-426, 2009
If you can't find a tool you're looking for, please click the link at the top of the page to "Go to old article view". Alternatively, view our Knowledge Base articles for additional help. Your feedback is important to us, so please let us know if you have comments or ideas for improvement.