## 1 Introduction

Logistic regression is a widely used statistical technique for the analysis of binary outcomes. It yields an equation for the probability of one of the binary outcomes (success/failure) as a function of predictor variables. An assessment of goodness of fit is an important and essential part of any modeling exercise. The Hosmer–Lemeshow test for the goodness of fit of logistic regression is very popular owing to ease of implementation, simplicity of interpretation, and widespread adoption by popular statistical packages. It is widely used for the evaluation of risk-scoring models in medicine that are developed using a wide range of sample sizes, as the following examples show. Cole *et al.* [1] used the Hosmer–Lemeshow test to compare three PREM scoring models for predicting survival of preterm infants. They developed the models by using samples ranging in size from 1434 to 4748. Newman *et al.* [2] used the test to evaluate a model for high neonatal total serum bilirubin level by using a sample of 51,387 infants. In another example, Krag *et al.* [3] used the Hosmer–Lemeshow test to evaluate a prediction model for metastases constructed using a sample of 443 breast cancer patients. Nashef *et al.* [4] constructed EuroSCORE—a scoring system for the prediction of early mortality in cardiac surgical patients in Europe—by using a data set of 13,302 patients and evaluated goodness of fit by using the Hosmer–Lemeshow test. The final example is a meta-analysis by Hukkelhoven *et al.* [5] of prognostic models for traumatic brain injury that they evaluated using the Hosmer–Lemeshow test. They developed the models by using samples ranging in size from 124 to 2269 and validated the models by using data sets that ranged in size from 409 to 2269.

The Hosmer–Lemeshow test is a chi-square test conducted by sorting the *n* records in the data set by estimated probability of success, dividing the sorted set into *g* equal-sized groups, and evaluating the Hosmer–Lemeshow *C* statistic [6]:

where *O*_{s,i} and *O*_{f,i} are the observed number of successes and failures and *E*_{s,i} and *E*_{f,i} are the expected number of successes and failures in the *i*th group. In this study, we assume that the expected numbers, *E*_{s,i} and *E*_{f,i}, are generated by fitting a logistic regression model to the data set and not from an external model. Under the null hypothesis that the model fits the data, we show that follows a *χ*^{2} distribution with *ν* = (*g* − 2) degrees of freedom. Thus, the *p*-value for the Hosmer–Lemeshow test is

where is the probability density function of the *χ*^{2} distribution with *g* − 2 degrees of freedom evaluated at *x*. The value of *g* is user-defined, but a commonly used value is *g* = 10; this value has been adopted as the default by most statistical packages.

It is well known that the power—the probability of correctly rejecting a poorly fitting model—of a chi-square test increases with sample size, and the Hosmer–Lemeshow test is no exception. However, this may not be a desirable property for a goodness of fit test: ideally, the likelihood of rejection of a good, albeit not perfect, regression model should be independent of sample size. Kramer and Zimmerman [7], in a recent paper, demonstrate this undesirable feature of the Hosmer–Lemeshow test.

Our goal in this paper is twofold. First, we seek to establish the dependence of the power of the Hosmer–Lemeshow test on sample size and on the number of groups used in performing the test. Second, we seek guidelines in using the Hosmer–Lemeshow test to enable consistent assessment of goodness of fit for models developed with samples of varying sizes. In doing this, we opt for a parsimonious approach: as currently implemented, the test has one modifiable parameter (the number of groups, *g*) that we adjust so that the power of the test remains consistent across a wide range of sample sizes. We do not suggest a new test but simply provide recommendations on the value of *g* to use in the test as currently implemented. As we show, adjusting this single parameter does not solve the problem for indefinitely large samples, but it can make the test uniformly powered across approximately three orders of magnitude in sample size.

We investigated properties of analytically and numerically. We performed the numerical analysis with simulated data sets as well as actual data drawn from the US Collaborative Perinatal Project (CPP). Our numerical results confirm that the statistical distribution of is, under a wide range of circumstances, non-central chi-square whose non-centrality parameter, *λ*, is a function of (i) the degree of deviation of the regression model from perfect fit, (ii) the sample size, and (iii) the number of groups *g*. Furthermore, we derive the interrelationship among the power of the test, *λ* and *g*. Using these results, we provide a set of recommendations on the choice of *g* to allow assessment of goodness of fit of logistic regression models across data sets of different sizes.

Several authors have examined the behavior of the chi-square statistic as a test for fit under various null and alternative hypotheses. Moore and Spruill, in a seminal study [8], examined the large-sample distribution of several chi-square statistics. One of the cases they consider—that of random cells—is relevant to our study. We have also drawn on results proved by Dahiya and Gurland [9] for non-null distributions for the chi-square statistic.

The plan of the paper is as follows. In the next section, we present relevant distributional properties of the Hosmer–Lemeshow statistic. This is followed by discussions of the simulation study and analysis of the CPP data set. Finally, we discuss the implications of the results leading to our recommendations on the choice of *g* and present some of the limitations of our approach.