The integrated discrimination improvement (IDI) is commonly used to compare two risk prediction models; it summarizes the extent a new model increases risk in events and decreases risk in non-events. The IDI averages risks across events and non-events and is therefore susceptible to Simpson's paradox. In some settings, adding a predictive covariate to a well calibrated model results in an overall negative (positive) IDI. However, if stratified by that same covariate, the strata-specific IDIs are positive (negative). Meanwhile, the calibration (observed to expected ratio and Hosmer–Lemeshow Goodness of Fit Test), area under the receiver operating characteristic curve, and Brier score improve overall and by stratum. We ran extensive simulations to investigate the impact of an imbalanced covariate upon metrics (IDI, area under the receiver operating characteristic curve, Brier score, and R2), provide an analytic explanation for the paradox in the IDI, and use an investigative metric, a Weighted IDI, to better understand the paradox. In simulations, all instances of the paradox occurred under stratum-specific mis-calibration, yet there were mis-calibrated settings in which the paradox did not occur. The paradox is illustrated on Cancer Genomics Network data by calculating predictions based on two versions of BRCAPRO, a Mendelian risk prediction model for breast and ovarian cancer. In both simulations and the Cancer Genomics Network data, overall model calibration did not guarantee stratum-level calibration. We conclude that the IDI should only assess model performance among a clinically relevant subset when stratum-level calibration is strictly met and recommend calculating additional metrics to confirm the direction and conclusions of the IDI. Copyright © 2016 John Wiley & Sons, Ltd.