Standardizing the power of the Hosmer–Lemeshow goodness of fit test in large data sets

Authors


Michael L. Pennell, Division of Biostatistics, College of Public Health, The Ohio State University, 246 Cunz Hall, 1841 Neil Avenue, Columbus, OH 43210, U.S.A.

E-mail: mpennell@cph.osu.edu

Abstract

The Hosmer–Lemeshow test is a commonly used procedure for assessing goodness of fit in logistic regression. It has, for example, been widely used for evaluation of risk-scoring models. As with any statistical test, the power increases with sample size; this can be undesirable for goodness of fit tests because in very large data sets, small departures from the proposed model will be considered significant. By considering the dependence of power on the number of groups used in the Hosmer–Lemeshow test, we show how the power may be standardized across different sample sizes in a wide range of models. We provide and confirm mathematical derivations through simulation and analysis of data on 31,713 children from the Collaborative Perinatal Project. We make recommendations on how to choose the number of groups in the Hosmer–Lemeshow test based on sample size and provide example applications of the recommendations. Copyright © 2012 John Wiley & Sons, Ltd.

1 Introduction

Logistic regression is a widely used statistical technique for the analysis of binary outcomes. It yields an equation for the probability of one of the binary outcomes (success/failure) as a function of predictor variables. An assessment of goodness of fit is an important and essential part of any modeling exercise. The Hosmer–Lemeshow test for the goodness of fit of logistic regression is very popular owing to ease of implementation, simplicity of interpretation, and widespread adoption by popular statistical packages. It is widely used for the evaluation of risk-scoring models in medicine that are developed using a wide range of sample sizes, as the following examples show. Cole et al. [1] used the Hosmer–Lemeshow test to compare three PREM scoring models for predicting survival of preterm infants. They developed the models by using samples ranging in size from 1434 to 4748. Newman et al. [2] used the test to evaluate a model for high neonatal total serum bilirubin level by using a sample of 51,387 infants. In another example, Krag et al. [3] used the Hosmer–Lemeshow test to evaluate a prediction model for metastases constructed using a sample of 443 breast cancer patients. Nashef et al. [4] constructed EuroSCORE—a scoring system for the prediction of early mortality in cardiac surgical patients in Europe—by using a data set of 13,302 patients and evaluated goodness of fit by using the Hosmer–Lemeshow test. The final example is a meta-analysis by Hukkelhoven et al. [5] of prognostic models for traumatic brain injury that they evaluated using the Hosmer–Lemeshow test. They developed the models by using samples ranging in size from 124 to 2269 and validated the models by using data sets that ranged in size from 409 to 2269.

The Hosmer–Lemeshow test is a chi-square test conducted by sorting the n records in the data set by estimated probability of success, dividing the sorted set into g equal-sized groups, and evaluating the Hosmer–Lemeshow C statistic [6]:

display math(1)

where Os,i and Of,i are the observed number of successes and failures and Es,i and Ef,i are the expected number of successes and failures in the ith group. In this study, we assume that the expected numbers, Es,i and Ef,i, are generated by fitting a logistic regression model to the data set and not from an external model. Under the null hypothesis that the model fits the data, we show that inline image follows a χ2 distribution with ν = (g − 2) degrees of freedom. Thus, the p-value for the Hosmer–Lemeshow test is

display math(2)

where inline image is the probability density function of the χ2 distribution with g − 2 degrees of freedom evaluated at x. The value of g is user-defined, but a commonly used value is g = 10; this value has been adopted as the default by most statistical packages.

It is well known that the power—the probability of correctly rejecting a poorly fitting model—of a chi-square test increases with sample size, and the Hosmer–Lemeshow test is no exception. However, this may not be a desirable property for a goodness of fit test: ideally, the likelihood of rejection of a good, albeit not perfect, regression model should be independent of sample size. Kramer and Zimmerman [7], in a recent paper, demonstrate this undesirable feature of the Hosmer–Lemeshow test.

Our goal in this paper is twofold. First, we seek to establish the dependence of the power of the Hosmer–Lemeshow test on sample size and on the number of groups used in performing the test. Second, we seek guidelines in using the Hosmer–Lemeshow test to enable consistent assessment of goodness of fit for models developed with samples of varying sizes. In doing this, we opt for a parsimonious approach: as currently implemented, the test has one modifiable parameter (the number of groups, g) that we adjust so that the power of the test remains consistent across a wide range of sample sizes. We do not suggest a new test but simply provide recommendations on the value of g to use in the test as currently implemented. As we show, adjusting this single parameter does not solve the problem for indefinitely large samples, but it can make the test uniformly powered across approximately three orders of magnitude in sample size.

We investigated properties of inline image analytically and numerically. We performed the numerical analysis with simulated data sets as well as actual data drawn from the US Collaborative Perinatal Project (CPP). Our numerical results confirm that the statistical distribution of inline image is, under a wide range of circumstances, non-central chi-square whose non-centrality parameter, λ, is a function of (i) the degree of deviation of the regression model from perfect fit, (ii) the sample size, and (iii) the number of groups g. Furthermore, we derive the interrelationship among the power of the test, λ and g. Using these results, we provide a set of recommendations on the choice of g to allow assessment of goodness of fit of logistic regression models across data sets of different sizes.

Several authors have examined the behavior of the chi-square statistic as a test for fit under various null and alternative hypotheses. Moore and Spruill, in a seminal study [8], examined the large-sample distribution of several chi-square statistics. One of the cases they consider—that of random cells—is relevant to our study. We have also drawn on results proved by Dahiya and Gurland [9] for non-null distributions for the chi-square statistic.

The plan of the paper is as follows. In the next section, we present relevant distributional properties of the Hosmer–Lemeshow statistic. This is followed by discussions of the simulation study and analysis of the CPP data set. Finally, we discuss the implications of the results leading to our recommendations on the choice of g and present some of the limitations of our approach.

2 Distributional properties of the Hosmer–Lemeshow statistic

The following are four key properties:

  1. Under the alternative hypothesis that the data were not drawn from the fitted model, the Hosmer–Lemeshow statistic follows a non-central chi-square distribution; that is,

    display math(3)

    where λ is the non-centrality parameter. Because the model parameters in logistic regression are estimated using maximum likelihood, inline image is a Chernoff–Lehmann statistic that Moore and Spruill [8] label as T2n. The result in Equation (3) is, then, the second case under Theorem 5.1 of Moore and Spruill (closely related results are Theorem 1 in [9] and Theorem 2 in [6]). In simulation studies, we may estimate λ as

    display math(4)

    where ν = g − 2 and inline image is inline image averaged over K identically generated samples.

  2. It can be shown that the non-centrality parameter is proportional to the sample size; that is,

    display math(5)

    This result follows from Theorem 1 in [9] and is illustrated by Figure 1 in [7], which shows the dependence of inline image on sample size.

  3. Let inline image and inline image. If the level of significance (α) and the power of the test (1 − β) are held constant, then the following approximation holds, with greater accuracy for larger λ:

    display math(6)

    We provide the proof of this statement in Appendix A.

  4. The power of the Hosmer–Lemeshow test varies most rapidly with λ or with ν when the critical value y(α) approximately coincides with the maximum of the non-null distribution, that is, when

    display math(7)

    where zα is the (1 − α) quantile of the standard normal distribution and y(α) is defined implicitly in

    display math

    We provide the proof of this statement in Appendix A. Figure 1 illustrates this approximation for the point of inflection in power as a function of λ. As expected, the accuracy increases with increasing λ and ν. Thus, for instance, at the α = 0.05 level of significance, the power of the test is relatively insensitive to any change from g = 10 (i.e., ν = 8) if λ is either much larger or much smaller than inline image. Because λ ∝ n (Property II), the power of the Hosmer–Lemeshow test is approximately 0.5 and most sensitive to change in sample size when λ, zα, and ν satisfy the critical relationship in Equation (7).

Figure 1.

Power versus the non-centrality parameter λ, for ν = 4, 9, 16, 25, 36, 49, 64, 81, and 100, at α = 0.05. The dots show the location of the point of inflection as predicted in Equation (7).

3 Simulation study

3.1 Methods

We investigated the sensitivity of the power of the Hosmer–Lemeshow test to sample size (n) and number of groups (g) by using simulated data. We simulated each binary outcome by using the following model for the logit (Y):

display math(8)

where X1 and X2 are independent standard normal variables and Z ∼ Binomial(n = 1,p = 0.5) independent of X1 and X2. In each run, we generated (X1i, X2i, Zi) for each individual and calculated the success probability

display math(9)

which was used to assign either success or failure as the outcome for the individual, using independent draws from Binomial (1,probi). We then fit the following logistic regression model to the data:

display math(10)

where inline image and inline image are the maximum likelihood estimates of the intercept and the slope, respectively. We examined the goodness of fit by using the Hosmer–Lemeshow test implemented with g = 6, 10, 18, 34, 66, and 130. We repeated the process over K = 5000 runs for each of five sample sizes (n = 500, 1000, 2000, 4000, and 25,000) and six sets of values of the βi's described in models 1 through 6 in Table 1. For n =  25,000, we also used g = 5002 to test the robustness of the recommendations that we finally make (Section 5). Note that, in each case, the fitted model is not the same as the true model and allows us to examine the power of the Hosmer–Lemeshow test when important variables are missing (X2,Z) or mismodeled (missing quadratic trend in X1 or interaction with Z). The true models include a range of deviations from the fitted model. Model 1 exhibits the greatest deviation: a quadratic term in X1 and a large interaction with Z. Models 2 and 3 are the closest to the fitted model, and models 4–6 are intermediate cases. The average probability of success (psuccess) ranges from 0.055 to 0.874 among the six models.

Table 1. Summary of simulation models.
ModelDefinitionpsuccess
  1. Y is log-odds (logit) of a successful outcome.
1inline image0.256
2Y = 2 + X1 + Z + 0.5(Z × X1)0.874
3Y = X1 + Z + 0.5(Z × X1)0.585
4inline image0.529
5inline image0.528
6inline image0.055
Table 2. Values of the estimated non-centrality parameters for the simulation models summarized in Table 1 for various sample sizes and number of groups used.
ModelSample size (n)Number of groups (g)
6101834661305002
  1. a

    ‘*’ denotes cases where the Hosmer–Lemeshow statistic does not follow a non-central chi-square distribution, and ‘+’ denotes a case where the chi-square assumption was satisfied for the data presented in the table but was not for a previous set of 5000 data sets. The superscripts A through F label numbers used in Figure 3.

15009.2211.2312.5112.7812.3410.88 
 100018.9623.4225.8826.6426.3524.84 
 200038.4747.1352.3954.5854.8753.58 
 400077.1395.15105.91110.40111.74110.63 
 25,000484.33598.10666.24698.31710.37713.16586.97
25000.230.200.310.761.222.00 
 10000.150.180.360.571.142.31 
 20000.220.370.460.731.232.36 
 40000.330.300.450.621.212.49 
 25,0001.461.842.142.663.264.3491.16
35000.150.180.30.320.530.84 
 10000.250.290.350.470.691.40 
 20000.500.510.510.600.701.42 
 40000.941.131.211.501.681.82 
 25,0005.866.937.427.878.068.3632.96
45002.633.22 A3.623.843.873.90 
 10005.276.56 B7.428.048.478.47 
 200010.3813.20 C15.2616.72 E17.6117.52 
 400020.8926.46 D30.7133.6335.3135.99 
 25,000131.20166.57194.29213.18225.76234.19210.76
55002.252.783.203.263.443.20 
 10004.545.786.587.097.106.79 
 20009.0511.4713.2914.3115.0314.64 
 400018.3623.3326.8929.1630.1930.53 
 25,000114.04145.47168.64183.73193.21198.51164.65
65000.590.380.04  + *** 
 10001.221.260.93*** 
 20002.292.632.661.72** 
 40004.825.906.426.094.290.38 
 25,00030.2138.0844.0247.4248.1045.23*

3.2 Results

We present the estimated non-centrality parameter inline image and the estimated power for models 1 through 6 for different values of g and n = 500, 1000, 2000, 4000, and 25,000 in Tables 2 and 3, respectively. From Table 2, we observed that when inline image was substantially larger than its standard deviation, inline image, it was proportional to n, supporting Equation (5). However, when inline image was large, it was only weakly dependent on the number of groups, g. When the success probability was small (model 6) and a large number of groups were used, the estimated non-centrality parameters were sometimes negative, which suggests that the chi-square assumption was not valid in these scenarios (denoted by ‘*’ in Tables 2 and 3). From Table 3, we observed that when inline image was large enough, the power of the Hosmer–Lemeshow test increased with sample size and decreased with g.

Table 3. Estimated power of the Hosmer–Lemeshow test for the different simulation models as functions of sample size and number of groups used at the α = 0.05 level of significance.
ModelSample size (n)Number of groups (g)
6101834661305002
  1. ‘*’ denotes cases where the Hosmer–Lemeshow statistic does not follow a non-central chi-square distribution, and ‘+’ denotes a case where the chi-square assumption was satisfied for the data presented in the table but was not for a previous set of 5000 data sets.

15000.6780.6520.5670.4180.2680.135 
 10000.9520.9490.9090.8120.6490.430 
 20001.0000.9990.9990.9950.9740.889 
 40001.0001.0001.0001.0001.0000.999 
 25,0001.0001.0001.0001.0001.0001.0001.000
25000.0540.0540.0610.0740.0930.099 
 10000.0520.0530.0580.0640.0690.095 
 20000.0600.0640.0610.0640.0710.084 
 40000.0630.0610.0630.0670.0720.079 
 25,0000.1350.1250.1100.1110.0970.0880.268
35000.0530.0520.0530.0520.0420.039 
 10000.0630.0620.0530.0540.0550.052 
 20000.0740.0710.0620.0580.0550.060 
 40000.1000.0890.0800.0750.0740.067 
 25,0000.4580.4190.3250.2420.1770.1330.077
45000.2170.1920.1530.1190.0950.058 
 10000.4150.3910.3190.2450.1800.124 
 20000.7350.7420.6600.5520.4140.279 
 40000.9710.9750.9620.9200.8130.646 
 25,0001.0001.0001.0001.0001.0001.0000.678
55000.1840.1680.1470.1100.0880.053 
 10000.3600.3450.2880.2180.1540.099 
 20000.6770.6600.5870.4700.3500.228 
 40000.9520.9590.9400.8780.7390.553 
 25,0001.0001.0001.0001.0001.0001.0000.498
65000.0750.0550.046  + *** 
 10000.1000.0870.064*** 
 20000.1760.1550.1120.068** 
 40000.3860.3460.2680.1800.0970.049 
 25,0000.9990.9990.9990.9960.9670.824*

Figure 2 supports Equation (3) regarding the distributional property of the Hosmer–Lemeshow statistic: the broken lines in the figure show the expected results if inline image follows inline image, and the dots show actual results from simulation. The correspondence is seen to be close. The solid gray line with slope of unity corresponds to the null hypothesis, inline image. We present graphically tabulated results from model 4 in Figure 3 to demonstrate the interrelationship among power, λ, and g at α = 0.05. The contour lines for power, obtained from numerical solutions of Equations (A.1) and (A.2) with α = 0.05, confirm the trends already noted in Table 3 that power increased as λ increased (at constant g) and as g decreased (at constant λ).

Figure 2.

Power (large dots) at various values of α and sample size for model 4 of the simulation study as shown in Table 3, with number of groups g = 34. The solid line, showing the type-I error rate, α, is included for reference. The broken lines show the power for non-central chi-square distributions with inline imageas shown in Table 2. The close fit between the power estimated from simulation runs (large dots) and from exact non-central chi-square distributions (broken lines) supports result I in Section 2.

Figure 3.

Contour plot of power (values in boxes) with level of significance held constant at α = 0.05. The encircled letters A through F illustrate results from model 4, with inline imagefrom Table 2: A through D correspond to g = 10 and n = 500, 1000, 2000, and 4000, respectively; the arrows show the effect of increasing g with sample size according to Equation (16) presented in Section 5.

As an interesting aside, we note that in our simulation study, the Hosmer–Lemeshow test appeared to be highly sensitive to mismodeling where the fitted model neglected a quadratic term in the exact model, as what happened in models 1, 4, 5, and 6, although the power was much smaller for model 6 owing to the low success probability. It was less sensitive to interactions neglected in the fitted model (e.g., Z × X1) and least sensitive to variables missing in the fitted model (e.g., Z and X2).

4 Analysis of the Collaborative Perinatal Project data

4.1 Methods

In addition to simulated data, we used a subset of data from the US CPP to investigate the variation of the power of the Hosmer–Lemeshow test owing to both variation in sample size and variation in the number of groups used. The CPP was a prospective study of pregnant women and their children conducted between 1959 and 1974 at 12 centers in the USA, resulting in 6700 data items on about 58,000 pregnancies. We used a reduced version of the data (31,713 children) presented in [10].

Our outcome was a binary variable (uw) indicating low birthweight: uw = 1 if birthweight  < 2500 g, and uw = 0 if birthweight 2500 g. For each of four sample sizes (n = 1000, 2000, 4000, and 8000), K = 5000 random samples were drawn from the data set, and the following model was fit to each sample:

display math(11)

where BMImom is the mother's pre-pregnancy BMI, sex is the infant's sex coded 0 for male and 1 for female, age is the mother's age in years, and the smka are a set of four indicator variables resulting from smk, a five-level categorical variable related to the mother's smoking status: smk = 1 for never smoker; smk = 2 for ex-smoker; smk = 3, 4, and 5 for current smoker smoking < 10, 10–19, and 20 cigarettes per day, respectively. For each of the K samples, we tested the goodness of fit by using the Hosmer–Lemeshow test with g = 6, 10, 18, 34, 66, and 130 groups.

4.2 Results

We summarize the CPP results in Table 4. As with the simulation study, inline image increased with n. The test average test statistic value (inline image) also increased with n, which was expected because inline image. The power was relatively insensitive to the number of groups, which can be attributed to the small observed deviations from the fitted model (inline image). The variation in inline image with g (with n held constant) was larger for this model than in the simulation models. This was, in part, an artifact of the smallness of the deviation of the model from the null model, resulting in values of inline image that were not large compared with statistical fluctuations therein.

Table 4. Values of the average Hosmer–Lemeshow C statistic (inline image), estimated non-centrality parameter (inline image), and power (at the α = 0.05 level of significance) for the Collaborative Perinatal Project data.
 Sample size (n)Number of groups (g)
610183466130
inline image10004.178.1716.4832.8765.86131.21
 20004.639.0017.3334.2466.94131.78
 40005.5810.3019.4837.1370.09134.88
 80007.8113.3623.9443.2177.01141.16
inline image10000.170.170.480.871.863.21
 20000.631.001.332.242.943.78
 40001.582.303.485.136.096.88
 80003.815.367.9411.2113.0113.16
Power10000.0500.0530.0580.0660.0680.075
 20000.0760.0820.0840.0920.0850.080
 40000.1290.1380.1550.1640.1420.119
 80000.2970.3110.3470.3700.2940.209

5 Recommendations

The results so far are consistent with the distributional properties of the Hosmer–Lemeshow statistic presented in Section 2. The key observation is that for a given actual (non-null) distribution, the power of the Hosmer–Lemeshow test increases with sample size but (usually) decreases with the number of groups used for the test. If the goal is to keep power constant as sample size increases, then a possible strategy is to increase the number of groups. The missing link now is the dependence of the non-centrality parameter λ on g. Unfortunately, in the absence of close knowledge of the actual (non-null) distribution, the dependence cannot be specified, especially for finite values of g. Numerical results (Table 2) show that the dependence is weak for moderate–large inline image. For example, in simulation model 1, with n = 500, inline image ranged from 9.22 when g = 6 to 10.88 when g = 130; with n =  25,000 the range was larger (484–713) but modest compared with the changes in inline image with sample size at a given g. To further complicate things, the changes in inline image with g are sensitive to the nature of the deviation from the null hypothesis (e.g., the values of the β's in the simulations). Therefore, in light of our findings and in the absence of knowledge of the exact relationship, we assume that λ is independent of g; we feel this is adequate for the purpose of providing rules of thumb for future use. From Equations (5) and (6), we see that if the power is to remain constant at a given level of significance,

display math(12)

Thus, we may perform comparable Hosmer–Lemeshow tests on samples of sizes n0 and n1 with numbers of groups g0 and g1, respectively, where

display math(13)

that is,

display math(14)

Once standard values g0 and n0 are chosen, we can apply the test uniformly across sample sizes by using Equation (14). In the absence of any universal criterion for a good fit, choosing an appropriate g0 and n0 is difficult. However, we do provide some recommendations based on our simulations.

There are three further points of note. First, we must have g < n for the Hosmer–Lemeshow test; g < n ∕ 5 (i.e., at least five observations per group) is preferable. Second, for the Hosmer–Lemeshow statistic to be distributed approximately as inline image, it is necessary that g ≥6. Finally, the chi-square approximation also fails when the event rate is small and g is large. As in [11], we define an event as the binary outcome (‘success’ or ‘failure’) that occurs with the smallest probability. When the event is rare (e.g., psuccess or 1 − psuccess⩽0.1), the chi-square approximation breaks down as the number of groups approaches the expected number of events, inline image. In most of our simulations, we did not experience problems with the chi-square assumption except when the number of groups exceeded the expected number of events. However, we did experience two scenarios for simulation model 6 where the assumption was violated at just over 1.5 events/group (g = 34, n = 1000 and g = 66, n = 2000).

We offer recommendations to offset the increase in power with increasing sample size, tailored to two different scenarios:

  1. Assessment of the goodness of fit of a model by using a single sample, where the goal is to either reject or not reject the null hypothesis that the model fits the data at a given level of significance.

    1. For sample sizes up to n = n0 = 1000, use g = 10 (the currently used standard). The choice of n0 = 1000 appears to be reasonable based on our simulations; 1 this is best supported by simulation models 4 and 5 where there is moderate disagreement between the true and fitted model and power of the Hosmer–Lemeshow test is not already limited by a small event rate. In these scenarios, the power is reasonably small (30–40%) when g = 10 and n = 1000. However, at n = 2000 and larger, the power is around 70% and higher, which is larger than we would like for models that exhibit moderate lack of fit.
    2. For 1000 < n ≤ 25,000, use

      display math(15)

      where m is the number of successes. The reasoning for this formula requires some explanation. First, to achieve comparable power with our benchmark of n0 = 1000 and g0 = 10, one should choose g according to Equation (14); that is,

      display math(16)

      Table 2 demonstrates the use of Equation (16). For n = 500 and 1000, use g = 10; for n = 2000, use g = 34; and for n = 4000, use g = 130. Figure 3 illustrates implementation of Equation (16) for model 4. Among the models, model 4 presents the worst case, as the estimated non-centrality parameter for g = 10 (at n = 1000) is very close to the critical value of 6.58 obtained from Equation (7). In the case of rare events, one must be cautious; as seen in our simulations, the chi-square approximation of the test statistic becomes questionable as g approaches the number of events, min(m,n − m). On the basis of our simulations, a conservative requirement would be g ≤ no. of events/2 to be comfortable with the chi-square assumption. Combining this upper bound for g with Equation (16) along with the stipulation that g never falls below the currently used standard of g = 10, we obtain Equation (15).

    3. For samples larger than n =  25,000, we do not recommend the use of the Hosmer–Lemeshow test. This is due to the rapid increase in the value of g with n in Equation (16), surpassing n for n =  125,000. It is preferable that there are at least five observations per group—that is, g < n ∕ 5. For very large samples where n >  25,000, our recommendation (Equation (15)) breaks down; for n >  25,000, our approach will almost always default to using g =  no. of events/2, which would be much smaller than g based on Equation (15), resulting in an over-powered test. We are currently investigating strategies to extend the Hosmer–Lemeshow test beyond this limit.
  2. Comparison of the goodness of fit among models fit to large samples of disparate sizes, where the goal is to achieve a fair comparison among model fits, unbiased by the disparity in power due to different sample sizes. It is possible to compare models fit with large samples, including those with sample sizes above n =  25,000 and small event rates. To do so, set n0 as the size of the smallest sample and g0 = 10; for the other samples, determine g, the number of groups to be used for the Hosmer–Lemeshow test, using
    display math(17)
    We may now use the p-values obtained from performing the test on each sample to compare the goodness of fit of the different models.

We examined the performance of our recommendations under the first scenario, assessment of the goodness of fit of a single model by using the data generated from simulation model 6, our rare event scenario (psuccess = 0.055). We provide the results of this simulation study in Table 5. For each n we considered, the chi-square assumption appeared to be satisfied, and our method successfully standardized the power; for example, when n =  25,000 our method had 8% power, whereas under the default g = 10, the power was 99.9%. Note that because the event rate was low, our method usually recommended g = m ∕ 2 except for n = 2000 where g was calculated using Equation (16). For more common events, our method would use Equation (16) more often.

Table 5. Comparison of our recommendations for g with the commonly used g = 10 when applied to data generated from simulation model 6 (psuccess = 0.055).
nOur recommendationsg = 10
ginline imagePowerinline imagePower
  1. Cell entries for g and the non-centrality parameter are averages across 5000 simulated data sets.

50010.00.380.0550.380.055
100010.01.260.0871.260.087
200034.01.720.0682.630.155
4000109.81.730.0605.900.346
25,000688.09.790.07538.080.999
Table 6. Summary of recalculation of p-values for the ‘small deviation’ model by Kramer and Zimmerman [7] using our recommendations.
 ninline imageinline imagegrecoCestpp10
  1. The sample size, n, and the mean of the Hosmer–Lemeshow statistic, inline image, are from [7]. The number of groups we recommend, greco, and the estimated p-values are also shown. Estimated p-values evaluated with g = 10 groups, p10, are shown for comparison.

Model500010.02.0202202.00.450.27
assessment10,00014.06.0700704.00.430.08
 50,00045.437.435003535.40.33 < 0.001
Model500010.02.01010.00.270.27
comparison10,00014.06.03438.00.210.08
 50,00045.437.4802837.40.17 < 0.001

It is also instructive to test our recommendations on the numbers provided from a study on the sensitivity of the Hosmer–Lemeshow test performed by Kramer and Zimmerman [7]. They simulated the logit of a binary variable by using a model similar to the one we used (Equation (8)) but with 23 variables. They distorted the resulting probabilities in a prescribed manner to cause a ‘small deviation’, generated values of the binary variable, and fit a logistic regression model to the binary outcome data using the 23 predictors. They repeated this process K = 1000 times for each of three sample sizes n = 5000, 10,000, and 50,000. The average success probability in each scenario was 0.14. In Table 6, we focus our attention on the average value of the Hosmer–Lemeshow statistic, inline image, for each sample size (from Table 2 in [7]). Assume that the goal of the analysis was to assess the goodness of fit by using a single set of data. Applying Equation (4) to their ‘small deviation’ model, we find that inline image for n = 5000. For the 14% event rate expected in these simulations, we recommend that the number of groups be greco = 202 for n = 5000. Also, assuming that inline image remains constant with g, we can estimate the Hosmer–Lemeshow statistic as inline image and obtain the corresponding p-value based on the distribution inline image. We repeated this process for sample sizes n = 10,000 and n = 50,000 (although, strictly, our recommendations do not extend to n = 50,000). The upper half of Table 6 summarizes the results and shows that the p-value remained almost constant over a wide range of sample sizes. In contrast, estimated p-values based on g = 10 groups, p10, showed marked decrease with increase in sample size. In addition, that the p-values were not significant was consistent with the ‘small deviation’ from perfection of the simulation model. We were unable to reject the null hypothesis for any of the values of n at α = 0.05, whereas Kramer and Zimmerman did reject the null hypothesis at larger sample sizes using a constant g = 10.

Now let the goal, instead, be to compare the fit of the model when applied to samples of three different sizes— n = 5000, 10,000 and 50,000—generated using Kramer and Zimmerman's small deviation model. Starting with the smallest sample, we set n0 = 5000 and g0 = 10. Then, using Equation (17) with n =  10,000, we obtain g = 34, and similarly, with n = 50,000, we obtain g = 802. As before, we used the numbers provided by Kramer and Zimmerman to calculate Cest and the corresponding p-value. We summarize the results in the lower half of Table 6. Fit was comparable across the different sized samples, which is reasonable given that all the data were generated using the exact same model.

6 Discussion

The most serious limitation of our recommendation in assessing the goodness of fit of a model is the relatively modest upper limit on sample size, n ≤ 25,000. We offer two possible strategies to circumvent the restriction on sample size, although—unlike the standard implementation of the Hosmer–Lemeshow test—neither strategy leads to a unique and reproducible verdict on fit. The strategies are sketched rather broadly; the implementation details are subjects of future investigation.

The first strategy is to select random subsamples of a standard size (say, ns = 1000) from the large sample (n >  25,000) and assess model fit by using the subsamples and g = 10. The p-value will differ from subsample to subsample; the final verdict on the rejection of the null hypothesis will depend on a judicious assessment of the set of p-values obtained. So, while this strategy avoids the problem of escalating power with sample size, the goodness of fit measure it produces derives from a subjective assessment of a set of fit statistics.

The second strategy that we suggest is to run model assessment on the large sample by using a range of group sizes of at least five (i.e., g ≤ n ∕ 5). If goodness of fit is accepted at small g (e.g., g ≤10), then one can be confident that the model fits the data. If the p-value remains fairly constant at small values (e.g., p < 0.05) across g, then it is likely that the model does not fit the data well. The answer, however, is less clear if the p-value changes with g as one probably has a model with borderline fit. In this case, one should carefully examine both the trend in p-values with g and their magnitudes before making a final decision regarding fit. Note that of the five studies cited in Section 1 as examples of the use of the Hosmer–Lemeshow test, only one had a sample size larger than 25,000. Hence, our recommendations should be applicable to a substantial proportion of biomedical studies.

This study focused on adapting the Hosmer–Lemeshow test, as currently implemented, to obtain uniformity in power across sample sizes. The resulting recommendations are based on asymptotic approximations and a key assumption—that the non-centrality parameter (λ) that characterizes the exact distribution is insensitive to the number of groups (g) used in the test. Possible violations of these approximations and assumptions limit the scope of the study. Also, our results are limited by the models that we considered in both the simulation study and analysis of the CPP data. However, we carefully chose models that are commonly fit in medical and epidemiologic studies, which were the focus of this paper. Work in progress focuses on a new goodness of fit test that may be used uniformly across sample sizes without changing tuning parameters such as the number of groups. We also plan on exploring the dependence of power of the Hosmer–Lemeshow test on factors that we did not focus on in this study, such as the number of predictors.

Appendix A: Proof of Property III

Property III

Let inline image and inline image. If the level of significance (α) and the power of the test (1 − β) are held constant, then the following approximation holds, with greater accuracy for larger λ:

display math

Proof. We may prove this as follows. Let

display math(A.1)

and

display math(A.2)

To prove Property III, we would first solve Equation (A.1) for y(α) and then solve Equation (A.2) for λ. Neither can be solved in closed form, but numerical solutions may easily be obtained. However, for large ν, we can prove expression (6) by using a normal approximation to the non-central chi-square distribution, as we will show now. The non-central chi-square distribution inline image with ν = g − 2 degrees of freedom has mean ν + λ and variance 2ν + 4λ and may be approximated with the normal distribution N(ν + λ,2ν + 4λ) for large ν. Define

display math(A.3)

where zα is the (1 − α) quantile of the standard normal distribution. Here, yα 蝶 y(α), the critical value for the Hosmer–Lemeshow statistic at the α level of significance. If zβ is the (1 − β) quantile of the standard normal distribution, then we must have

display math(A.4)

for the power of the test to be (1 − β). We now equate the two expressions for yα and expand out in powers of 2λ ∕ ν in a Taylor series, which converges if λ < ν:

display math(A.5)

so that

display math(A.6)

Expanding out the quantity in the second set of parentheses on the left-hand side as a series in inline image (which converges if inline image), we arrive at

display math(A.7)

which is rearranged to obtain

display math(A.8)
display math(A.9)

where the last approximation is valid for large λ or small | zβ | . This completes the proof of Property III.

Appendix A: Proof of Property IV

Property IV

The power of the Hosmer–Lemeshow test varies most rapidly with λ or with ν when the critical value y(α) (as defined in Equation (A.1)) approximately coincides with the maximum of the non-null distribution, that is, when

display math

Proof. To show this, we start with Equation (A.1) and use the normal approximation to the non-central chi-square distribution:

display math(A.10)

where ϕ(z) is the standard normal distribution evaluated at z and

display math

We can now approximate the rate of change of power with λ as

display math(A.11)

where

display math(A.12)
display math(A.13)

Because z0 is of order one and (2λ + ν) is large for the normal approximation to be valid, the second term in Equation (A.12) dominates the first, leading to Equation (A.12). Consequently, we may approximate the rate of change of power as

display math(A.14)

The rate of change of power is maximum when the derivative of Equation (A.14) with respect to λ is zero; that is,

display math(A.15)

where dz0 ∕ dλ has been approximated using Equation (A.12). We rearrange Equation (A.15) as

display math(A.16)
display math(A.17)

To verify that this is indeed a maximum of the rate of change of power with λ, we evaluate the dominant term in the third derivative of power with respect to λ at z0 = 0:

display math(A.18)

because ϕ(z0) has a maximum at z0 = 0. Thus, the rate of change of power is maximized at z0 蝶 0, which corresponds to power of approximately 0.5. Letting y(α) = yα as defined in Equation (A.3), Equation (A.16) is recast as

display math(A.19)

so that

display math

which is identical to Equation (7). One may use similar arguments to show that power varies most rapidly with ν (with λ held constant) at z0 蝶 0, leading again to the relationship between λ and ν in Equation (7).

This completes the proof of Property IV.

  • 1

    The implicit assumption in this recommendation is that, for instance, with α = 0.05 and n = 1000, the critical value inline image obtained using Equation (7) represents a good fit so that we are willing to accept lower power for smaller values of n.

Ancillary