The Hosmer–Lemeshow test is a commonly used procedure for assessing goodness of fit in logistic regression. It has, for example, been widely used for evaluation of risk-scoring models. As with any statistical test, the power increases with sample size; this can be undesirable for goodness of fit tests because in very large data sets, small departures from the proposed model will be considered significant. By considering the dependence of power on the number of groups used in the Hosmer–Lemeshow test, we show how the power may be standardized across different sample sizes in a wide range of models. We provide and confirm mathematical derivations through simulation and analysis of data on 31,713 children from the Collaborative Perinatal Project. We make recommendations on how to choose the number of groups in the Hosmer–Lemeshow test based on sample size and provide example applications of the recommendations. Copyright © 2012 John Wiley & Sons, Ltd.