• Open Access

MEASURING OVERFITTING IN NONLINEAR MODELS: A NEW METHOD AND AN APPLICATION TO HEALTH EXPENDITURES

Authors


SUMMARY

When fitting an econometric model, it is well known that we pick up part of the idiosyncratic characteristics of the data along with the systematic relationship between dependent and explanatory variables. This phenomenon is known as overfitting and generally occurs when a model is excessively complex relative to the amount of data available. Overfitting is a major threat to regression analysis in terms of both inference and prediction.

We start by showing that the Copas measure becomes confounded by shrinkage or expansion arising from in-sample bias when applied to the untransformed scale of nonlinear models, which is typically the scale of interest when assessing behaviors or analyzing policies. We then propose a new measure of overfitting that is both expressed on the scale of interest and immune to this problem. We also show how to measure the respective contributions of in-sample bias and overfitting to the overall predictive bias when applying an estimated model to new data.

We finally illustrate the properties of our new measure through both a simulation study and a real-data illustration based on inpatient healthcare expenditure data, which shows that the distinctions can be important. Copyright © 2013 John Wiley & Sons, Ltd.

1 INTRODUCTION

When fitting a model, it is well known that we pick up part of the idiosyncratic characteristics of the data along with the systematic relationship between dependent and explanatory variables. This phenomenon is known as overfitting and generally occurs when a model is excessively complex relative to the amount of data available. Overfitting is a major threat to regression analysis in terms of both inference and prediction. When models greatly over-explain the data at hand, this casts doubt on both the statistical significance and magnitude of the estimates. In addition, because the relations found by chance in the estimation sample will not replicate, the predictive performance of the model deteriorates when it is applied to new data. An important distinction has thus to be made between retrospective or in-sample prediction, where a model predicts outcomes within the dataset used to estimate it, and prospective or out-of-sample prediction, where a previously estimated model forecasts new outcomes in a different dataset.

When parameter estimates from the estimation sample are used to make predictions in a new sample from the same population, the plot of the actual outcomes against their forecasts should lie on the 45° line in the absence of overfitting. The deviation from the 45° line is referred to as shrinkage. An early measure of shrinkage is the adjusted multiple correlation coefficient, which was proposed by Wherry (1931). Shrinkage is also often measured by means of cross-validation (CV) techniques following the pioneering work by Larson (1931). Our work is based on the seminal work by Copas (1983) who proposed measuring shrinkage by 1 minus the least squares slope of the regression of the observed outcomes on their out-of-sample predictions. Later, Copas (1987) also suggested to estimate this slope by CV, and his measure gained further popularity. Many applied researchers use this measure, especially in health sciences (e.g., Harrell et al., 1996; Harrell, 2001) and health econometrics (e.g., Blough et al., 1999) where it is sometimes referred to as the Copas test of overfitting. The Copas measure of shrinkage is often assessed for various competing models and the results displayed in league tables. The resulting ranking is eventually part of model selection along with other diagnostic tools (e.g., Basu et al., 2006). It should however be stressed that predictive shrinkage is relevant not only to health economics but also to any field of economics, especially when having well-calibrated predictions is important. Government budgeting and risk adjustment are two such examples.

In this paper, we revisit the Copas measure of shrinkage in the case of nonlinear models. With nonlinear models, estimation often takes place on a different scale from that of the dependent variable. The former is sometimes referred to as scale of estimation and the latter as scale of interest because the original scale is typically the scale of scientific or policy interest. Although the scale of estimation often deals with a statistical issue, policies in spending often depend on actual dollars or in the incremental cost-effectiveness of different drugs and devices. Moreover, there is a common confusion in the literature between (ln(y)|x) models and ln(E(y|x)). A statement about a ln(y) model does not provide an adequate statement to judge the fit of the model to actual expenditures. This is sometimes referred to as the retransformation problem (Manning, 1998; Mullahy, 1998).

Shrinkage is usually measured on the scale of estimation (see for instance Copas, 1983, 1997, in the specific case of the logistic regression), but a few authors (Veazie et al., 2003; Basu et al., 2006) found it more meaningful to assess the Copas slope on the scale of interest as shrinkage is then measured in the same unit as the dependent variable. We show that this alternative Copas measure does not constitute a measure of shrinkage arising from overfitting alone as it also picks up the shrinkage or expansion caused by any misspecification that generates a bias in the estimation sample (in-sample bias hereafter). For that reason, the scale-of-interest Copas measure should instead be viewed as a measure of the calibration of the out-of-sample predictions. To correct this problem, we propose a new measure of overfitting that is both expressed on the scale of interest and immune to in-sample bias. We also show that calibration of the out-of-sample predictions can be expressed as their in-sample calibration multiplied by 1 minus our new measure of overfitting. This relation makes it possible to measure the respective contributions of in-sample bias and overfitting to the overall predictive bias when applying an estimated model to new data. It illustrates the trade-off that the analyst faces when comparing—and selecting—competing models. Should flexibility be increased in order to reduce in-sample bias? Or should nonlinearity be reduced and some secondary covariates be left out in order to contain overfitting? Our expression indicates on what side of this trade-off a given model lies, thus providing the analyst with guidance on optimal specification choice.

The major contribution of this paper is to show how overfitting and in-sample bias have different effects on the scale of estimation versus the scale of interest. The scale-of-interest Copas test combines both. We show how to separate and test for the two separately and how to immunize the test of overfitting.

The rest of the paper is organized as follows. Section 2 presents our new measure of overfitting along with its relation with in-sample and out-of-sample calibrations. Section 3 describes the setting of a simulation study that aims at showing the behavior of our new measure of overfitting when the true model specification is known. Our data-generating processes (DGPs) mimic healthcare expenditures as they are typically heavily right skewed and their models highly nonlinear. Section 4 presents the results of these simulations, and Section 5 illustrates our new measure of overfitting with real data from a well-known hospitalist study (Meltzer et al., 2002). Section 6 concludes.

2 METHODS

2.1 The Copas measure of shrinkage and overfitting

We start by restricting the nonlinear models analyzed to the members of the generalized linear model (GLM) family (Nelder and Wedderburn, 1972), which notably include the linear model for untransformed continuous variables, Poisson regression for counts, logistic and probit regression for binary variables, and parametric proportional hazard models for durations. In the GLM family, the distribution of the observed outcomes, yi,i=1,…,n, is assumed to be a member of the exponential family where the expectation is related to the linear predictor, ηi = β0 + βKxi ≡ βzi, via the link function math formula. In addition, the variance of yi is supposed to be a function of its expectation, that is, V(yi|xi) = vT(μi). Note that xi refers to a vector of K covariates, βK to the vector of their corresponding parameters, β0 to a constant term, and subscript T to the true model followed by yi. Note also that math formula and math formula.

In his seminal paper, Copas (1983) proposed a very convenient measure of the shrinkage caused by overfitting in the case of the linear regression with multivariate normal covariates. This measure exploits the fact that the conditional expectation of a new outcome, math formula, can be expressed as a linear function of its out-of-sample prediction, math formula. Because these two quantities should be equal in the absence of overfitting, 1 minus the least squares slope of this regression provides a measure thereof. Copas (1983, 1997) also claimed that this method can be generalized to the entire GLM family and illustrated this in the specific case of the logistic regression. In this paper, we formally show that Copas' intuition is correct by deriving a large-sample approximation of the Copas shrinkage factor in the GLM framework. To do so, let us express the conditional expectation of a new outcome on the scale of estimation, math formula, as a linear function of its out-of-sample linear predictor, math formula,

display math(1)

where the covariates xi have been appropriately centered.

Note that this centering does not lead to any loss of generality as it merely redefines the constant term of the model, β0. Equation (1) can be viewed as being the best linear approximation of math formula in the maximum likelihood sense. In the absence of overfitting, math formula is a well-calibrated predictor of math formula, and Δ equals 1. On the other hand, the further the Δ is below 1, the greater is the shrinkage resulting from overfitting. The quantity 1 − Δ can thus be interpreted as a measure of the shrinkage caused by overfitting. We show in Appendix A (in the supporting information available online) that, asymptotically, Δ can be expressed as

display math(2)

where I is the Fisher information matrix. This confirms what Copas (1997) found for the logistic regression, namely that shrinkage increases with the number of covariates K and decreases with the GLM deviance math formula. So, the better the in-sample fit, the smaller the shrinkage. Because Δ is defined on the transformed scale of y, we refer to 1 − Δ as the estimation-scale Copas shrinkage throughout.

Although the vast majority of studies measure shrinkage on the estimation scale (for instance, Blough et al., 1999), a few authors (Veazie et al., 2003; Basu et al., 2006; Hill and Miller, 2010) have found it more meaningful to measure it on the untransformed scale or the scale of interest. This was originally seen as a two-degree-of-freedom measure, but for better comparability with the estimation-scale version and without any loss of generality, we here assume that the observations yi are centered. We formalize this measure as

display math(3)

where math formula is the out-of-sample prediction of yi and Equation (3) the best linear approximation of math formula in the least squares sense. The absence of overfitting results in δ = 1, whereas δ is smaller than 1 when shrinkage occurs. For instance, if y represents individual healthcare expenditure, δ = 0.95 would mean that the deviations above (below) average healthcare expenditure are overestimated (underestimated) by out-of-sample predictions by 5%. We refer to 1 − δ as the scale-of-interest Copas shrinkage throughout.

2.2 Overfitting and in-sample bias

To illustrate the effect of in-sample bias on measurement of overfitting, let us consider the situation where a ‘wrong’1 link function gW is used for the model of yi. Note that subscript W refers to a misspecified—or wrong—model throughout.

Remarkably, the estimation-scale Copas statistic is not affected by such misspecification. This can be shown by replacing gT with gW and math formula and math formula with the corresponding parameter estimates under the wrong link function, math formula and math formula, in Equation (1) and rearranging the equation as

display math(4)

This means that, in the absence of overfitting, or in other words when the transformed conditional expectation of math formula, math formula, equals its out-of-sample linear predictor math formula, the estimation-scale Copas statistic still equals 1 despite the aforementioned misspecification. The reason is that the misspecification has been made twice. It has first been made when computing the linear predictor, math formula, and then when assessing the estimation-scale Copas statistic defined in Equation (1).

On the other hand, the scale-of-interest Copas statistic is generally affected by the in-sample bias that results from the said misspecification. The reason is that, in the absence of overfitting, math formula is unlikely to equal math formula when the link function is misspecified. Indeed, math formula is the expectation of the outcome, μi, which will generally not equal math formula when using the wrong link function gW. So, in Equation (3), the slope δ generally does not equal 1 even in the absence of overfitting. We nonetheless argue that statistic δ remains extremely meaningful, as it can be interpreted as a broader measure of calibration of the out-of-sample predictions, sensitive to both overfitting and in-sample bias.

2.3 A new scale-of-interest measure of overfitting

Our objective here is to propose a new measure of overfitting that is both expressed on the scale of interest and immune to in-sample bias. Let us start by defining α, which is the slope of the linear regression of the conditional expectation of the outcome, math formula, on its in-sample prediction, math formula:

display math(5)

Similar to Equation (3), this equation is the best approximation of math formula in the least squares sense, and the observations yi are assumed to be centered. When using in-sample predictions, no overfitting can occur because the same observations are used when both estimating and predicting. When the model is well specified, α equals 1, whereas we expect that α will generally not equal 1 in the presence of in-sample bias. Note that α can equal 1 even when the link function of the model is misspecified, and 1 − α should thus not be interpreted as a measure of in-sample bias per se. We merely view α as being a measure of calibration of the in-sample predictions and quantity 1 − α as a measure of the shrinkage arising from in-sample bias. It may be stressed that α can be greater than 1, in which case it indicates predictive expansion, that is, under-prediction of large outcomes and over-prediction of small ones.

Interestingly, Equations (3) and (5) implicitly define our new measure of overfitting. To see this, it should be stressed that the expectation of yi is always assumed to be the same, both within and outside the sample: math formula. Substituting μi for these two expectations in Equations (3) and (5) yields

display math(6)

where

display math(7)

It is important to note that the same model, possibly misspecified, is used to predict both math formula and math formula. Any deviation between these two quantities can thus only be caused by overfitting. We thus interpret quantity 1 − γ as a measure of the shrinkage caused by overfitting alone. In the absence of overfitting, γ equals 1, and no shrinkage arises. When overfitting occurs, the out-of-sample predictions lose their relation with the outcome, γ diminishes, and the measured shrinkage increases.

Further insight can be gained by expressing relation (7) in terms of shrinkage:

display math(8)

The overall shrinkage when predicting outcomes in a new sample, 1 − δ, is thus the sum of shrinkage due to in-sample bias, 1 − α, and shrinkage caused by overfitting, 1 − γ, times the in-sample calibration factor, α. In the absence of in-sample bias (i.e., α = 1), our measure of overfitting equals the out-of-sample shrinkage, 1 − δ. On the other hand lies the case where the model is so biased that α = 0 and where overfitting does not play any role as this cannot further deteriorate the fit. It may also be noted that term 1 − δ can be negative in the presence of large expansion caused by in-sample bias (i.e., 1 − α < 0).

The preceding decomposition of out-of-sample calibration illustrates the trade-off that analysts face when comparing—and selecting—competing models or when judging the adequacy of the model they have selected. As discussed, the extent of model flexibility has to balance in-sample quality of fit with containment of overfitting. This is the spirit of Mallows' Cp and the Akaike information criterion, which both measure the goodness of fit of a given model while penalizing for the number of covariates. However, model flexibility is a function of not only the number of covariates but also the nonlinearity of its functional form. Our method decomposes the overall out-of-sample bias according to both in-sample bias and overfitting, thus indicating on what side of this trade-off a given model lies and providing guidance on optimal specification choice, although, unlike Mallows' Cp and the Akaike information criterion, quality of fit is not derived from the mean square error but is assessed through the more restrictive predictive calibration. In particular, our expression does not account for quality of the predictive discrimination of the model (e.g., van Houwelingen and Cessie, 1990). So, our decomposition should not be used as an alternative to—but complementary to—the standard goodness of fit in statistics. In economics, predictive calibration is often of interest per se, especially in government budgeting and risk adjustment where getting the average prediction of a given population right is important.

Finally, Table 1 summarizes the main features of the measures of shrinkage defined earlier, and Appendix B (supporting information available online) provides more insight into δ, α, and γ by means of large-sample approximations.

Table 1. Main characteristics of the shrinkage statistics
  Effect captured
SymbolScaleMisspecificationOverfitting
ΔEstimation 
δInterest
αInterest 
γInterest 

3 SIMULATION DESIGN

The simulations aim at showing the behavior of our shrinkage statistics when the specification of the model is known. We emulate healthcare expenditure models as they are typically highly nonlinear given their strictly positive and right-skewed dependent variable. Our explanatory variables include an evenly split dummy, which can be thought of as representing gender. We also include 50%, 35%, and 15% split categorical variables, which approximately correspond to the adults, children, and elderly age classes, respectively, found in many countries. In addition, we include both uniformly and normally distributed variables to account for a variety of quantitative factors. We choose a sample size of 5000, which falls into the range of most observational surveys available in practice.

Table 2 shows the different models used to generate the dependent variable, Y. The baseline case is a GLM model with a gamma distribution and logarithmic link, which is one of the most widely used models for healthcare expenditures (Blough et al., 1999). A series of scenarios is then generated with the extended estimating equations (EEE, Basu and Rathouz, 2005), which generalize the GLM framework notably through its Box–Cox link function. The EEE model provides us with the opportunity to progressively modify the link function while keeping the distribution of Y unchanged. Because it is restricted to the special case where the Box–Cox parameter λ equals 0, the log-gamma GLM will be biased for any other value of λ when estimated with such data. We also generate additional scenarios using the generalized gamma (GENGAM) model (Manning et al., 2005) to show how the shrinkage statistics are affected by the efficiency of the estimation. We take advantage of the fact that, with GLM models, efficiency is solely conditioned by the choice of the distribution and use the GENGAM to progressively modify the distribution of Y while keeping the logarithmic link between the linear predictor and the expectation of Y. The GLM model, which we have restricted to the special case where the GENGAM shape parameter κ equals 1.5, will thus be inefficient for any other value thereof.

Table 2. Simulation data-generating processes
 ParametersHigher moments
Data-generating processβ0ν/σSkewnessKurtosis
  1. All parameters have been determined numerically so that E(Y) = 1 and V(Y) = 2.2. β0 is the constant term of the linear predictor, ν the ancillary parameter of the EEE, and σ that of the GENGAM. The distribution higher moments have been computed numerically.

  2. EEE, extended estimating equations; GENGAM, generalized gamma model; GLM, generalized linear model.

Log-gamma GLM−0.3120.53.2620.4
EEE (gamma distribution)
λ = −0.75−0.3390.5153.5029.5
λ = −0.5−0.3290.5093.3522.3
λ = −0.25−0.3200.5053.2921.0
λ = 0.25−0.3060.4943.2520.0
λ = 0.5−0.2970.4943.2219.6
λ = 0.75−0.2910.4943.2119.4
λ = 1−0.2830.4943.2019.2
GENGAM
κ = 0.5−0.7231.264.8653.6
κ = 1−0.5161.383.7828.6
κ = 2−0.0201.382.7914.6
κ = 30.3981.212.329.9

Each DGP presented in Table 2 is generated 400 times. For each repetition, the explanatory variables are randomly generated first. In order to reduce the Monte Carlo variation in the simulation results, the same explanatory variables are used over all scenarios. At each iteration, we draw the binary and categorical variables so as to ensure exact 50–50 and 50–35–15 splits, which also contribute to containing the Monte Carlo variation. As for the quantitative variables, they are drawn from the standard uniform and normal distributions. For each DGP, the covariate matrix is then duplicated v = 2.10 times, and the dependent variable Y is randomly drawn for each of these validation samples. Finally, the shrinkage factors are estimated using v-fold CV (Geisser, 1975) where all validation samples have the same size (n = 5000) and covariate matrix and only differ with respect to Y.

4 SIMULATION RESULTS

Table 3 shows the simulation results relative to the specification of the link function. It can first be seen that the scale-of-interest Copas shrinkage (math formula) observed in the GLM and EEE models can be substantially different. For instance, for λ = −0.75, this amounts to 2.99% expansion for the GLM and 10.04% shrinkage for the EEE model. Because overfitting alone cannot lead to expansion, this indicates that another factor is in play. This is confirmed by the measure of in-sample shrinkage that shows near-perfect calibration in the case of the EEE model (i.e., math formula) and substantial deviations for the GLM that span from 7.70% expansion for λ = −0.75 to 6.76% shrinkage for λ = 1. These results demonstrate that the scale-of-interest Copas cannot be considered as a measure of shrinkage arising from overfitting alone. As discussed in Section 2, we propose measuring this with our new measure of shrinkage, math formula. As expected, the simpler GLM exhibits lower overfitting than the EEE model, which requires the estimation of nonlinear parameter λ. For the highly nonlinear case where λ = −0.75, shrinkage due to overfitting alone in the EEE (9.75%) is more than twice as large as what is observed for the GLM (4.37%). Note also that when the nonlinearity decreases, there is less overfitting in the EEE model as sample-specific variation in λ has less effect on the predicted values.

Table 3. Measure of shrinkage (percentage) and specification of the link function
Data-generating processEstimated model
Log-gamma GLMEEE
math formulamath formulamath formulamath formulamath formulamath formula
  1. Average Monte Carlo estimates of shrinkage over 400 repetitions with standard errors are in parentheses. Both the data-generating process and estimated EEE are constrained to the gamma distribution.

  2. EEE, extended estimating equations; GLM, generalized linear model.

EEE, λ = −0.75−2.99 (0.26)−7.70 (0.26)4.37 (0.06)10.04 (0.67)0.35 (0.13)9.75 (0.66)
EEE, λ = −0.50.35 (0.26)−4.57 (0.26)4.71 (0.07)9.22 (0.49)0.28 (0.12)8.99 (0.48)
EEE, λ = −0.253.31 (0.23)−1.78 (0.22)5.01 (0.07)7.62 (0.33)0.22 (0.10)7.41 (0.32)
Log-gamma GLM (λ = 0)5.70 (0.21)0.36 (0.21)5.37 (0.07)7.07 (0.15)0.26 (0.08)6.83 (0.13)
EEE, λ = 0.257.31 (0.24)1.79 (0.23)5.63 (0.08)7.57 (0.37)0.05 (0.09)7.53 (0.36)
EEE, λ = 0.58.84 (0.23)3.35 (0.22)5.69 (0.08)6.82 (0.21)−0.07 (0.10)6.89 (0.19)
EEE, λ = 0.7510.52 (0.23)4.86 (0.22)5.96 (0.08)6.74 (0.21)−0.09 (0.10)6.83 (0.19)
EEE, λ = 112.86 (0.30)6.77 (0.27)6.54 (0.13)6.98 (0.27)0.20 (0.18)6.80 (0.20)

The simulations also illustrate our decomposition of the scale-of-interest Copas shrinkage (Equation (8)). For the misspecified GLM model, in-sample shrinkage can add to the shrinkage due to overfitting and further deteriorate out-of-sample calibration. This is shown by our simulations when λ > 0. On the other hand, it also happens that in-sample bias and overfitting work in the opposite direction, such as in our simulations when λ < 0. For instance, for λ = −0.5, the out-of-sample predictions obtained with the GLM are well calibrated (math formula) as in-sample expansion and overfitting cancel each other. For the EEE model, the scale-of-interest Copas shrinkage approximately equals shrinkage due to overfitting alone. This model, even though its link function is well specified, shows considerable out-of-sample shrinkage (math formula), which is driven by overfitting alone. That is why the scale-of-interest Copas shrinkage, math formula, even though it does not measure overfitting per se, remains a valuable measure of out-of-sample predictive performance.

The simulation results presented in Table 4 show the relationship between shrinkage and misspecification of the GLM distribution. As expected, such misspecification does not lead to in-sample bias as math formula is never significantly different from zero. In such cases, the scale-of-interest Copas shrinkage, math formula, is an adequate measure of overfitting (math formula). Both measures show that the efficiency loss of the misspecified GLM model, even though it does not adversely affect in-sample calibration, leads to greater out-of-sample overfitting. For instance, for κ = 3, misspecification of the distribution results in an increase in out-of-sample shrinkage by 1.23% compared with that in the GENGAM model. This is due to the double burden of inefficiency. Inefficient models not only are less precise in sample but also have reduced out-of-sample predictive performance as this precision loss leads to greater overfitting. Again, measuring overfitting is very useful as this reveals here the out-of-sample shortcomings of the simpler GLM specification, which appears to be unbiased when judged on in-sample grounds only. Conversely, this also shows that the efficiency gain of the GENGAM more than cancels out the greater overfitting induced by its greater complexity when the underlying DGP has a proportional response and both estimators have a log link.

Table 4. Measure of shrinkage (percentage) and specification of the distribution
 Log-gamma GLMGENGAM
Data-generating processmath formulamath formulamath formulamath formulamath formulamath formula
  1. Average Monte Carlo estimates of shrinkage over 400 repetitions with standard errors are in parentheses. The estimated GLM has a gamma distribution and logarithmic link function; the GENGAM is unconstrained.

  2. GENGAM, generalized gamma model; GLM, generalized linear model.

GENGAM, κ = 0.55.50 (0.23)0.28 (0.23)5.23 (0.08)4.60 (0.30)0.20 (0.31)4.41 (0.06)
GENGAM, κ = 15.56 (0.23)0.28 (0.23)5.29 (0.08)5.36 (0.24)0.24 (0.25)5.13 (0.08)
Log-gamma GLM, κ = 1.55.70 (0.21)0.36 (0.21)5.37 (0.07)5.70 (0.21)0.36 (0.21)5.37 (0.07)
GENGAM, κ = 25.66 (0.23)0.29 (0.23)5.39 (0.08)5.50 (0.26)0.38 (0.25)5.15 (0.08)
GENGAM, κ = 35.67 (0.23)0.30 (0.23)5.39 (0.08)4.64 (0.34)0.51 (0.34)4.16 (0.06)

5 ILLUSTRATION

We use a sample of 6500 observations from a hospitalist study that took place at the University of Chicago hospital (Meltzer et al., 2002). The outcome variable is patient-level healthcare expenditure excluding physician fees, and the key covariates relate to physician characteristics: whether the physician is a hospitalist and disease-specific experience. Many control variables are also present such as patient comorbidities, relative utilization weight of diagnosis, admission month dummy variables, and an indicator for transfer from another institution.

In our illustration, we fit a log-gamma GLM to these data as it has been widely applied in this context. Table 5 shows our measures of shrinkage for this model as a function of sample size and number of CV splits. Note that to measure shrinkage with a representative smaller sample, we have randomly drawn 101 quarters (n = 1625) of the full sample, computed the scale-of-interest Copas shrinkage for each one of them, and picked the subsample with a median value.

Table 5. Shrinkage (percentage) in the log-gamma generalized linear model according to sample size and number of cross-validation splits, hospitalist data
 10-fold CVTwofold CV
 Raw scaleEstimation scaleRaw scaleEstimation scale
Sample sizemath formulamath formulamath formulamath formulamath formulamath formulamath formulamath formula
  1. Twofold and 10-fold estimates have been averaged over 400 repetitions (standard errors are displayed in brackets). An unbiased measure of the standard error of CV estimates has yet to be found (Arlot and Celisse, 2010), and these standard errors are an upper bound for the average CV estimates that we report.

  2. CV, cross-validation.

650016.11 (0.65)14.32 (0.05)2.10 (0.79)1.33 (0.14)17.32 (2.77)14.18 (1.30)3.62 (4.00)2.37 (0.69)
162519.79 (1.92)14.49 (0.14)6.19 (2.34)4.89 (0.51)23.72 (7.39)13.96 (3.34)11.15 (9.76)8.34 (2.21)

Let us start by interpreting the full-sample results obtained by repeated 10-fold CV. There is a striking difference between the raw-scale Copas shrinkage, math formula, and our new measure of shrinkage arising from overfitting alone, math formula: whereas the former shows significant shrinkage (16.11%), the latter reveals that overfitting plays a secondary role only (2.10%). The most important problem, by far, is the lack of fit within the sample, as shown by the in-sample measure of shrinkage (math formula). In the quarter sample, shrinkage caused by overfitting considerably increases to 6.19%, but in-sample misspecification still remains the main issue. Note also that the scale-of-interest Copas measure of shrinkage is at least 25% higher than the estimation-scale measure, math formula, for both sample sizes. This demonstrates that the scale of analysis matters when assessing overfitting.

Table 5 also shows the relative gain from using 10-fold CV over twofold CV, which is widely used when measuring shrinkage and also known as the two-way Copas test. Because CV methods consist of holding out part of the data for validation when estimating the model, they yield an overly pessimistic estimate of its predictive accuracy. By holding out 50% of the sample, the twofold CV is more prone to this bias than the 10-fold CV, which makes use of 90% of the data when estimating the model. What Table 5 shows is that this bias also depends on sample size. In the full sample, our measure of shrinkage caused by overfitting alone is inflated from 2.10% to 3.62%, whereas this quantity jumps from 6.19% to 11.15% in the quarter-sample case. Not only does the bias of the v-fold CV increase with v, but its efficiency also does. We can see that the standard errors reported for the out-of-sample shrinkage estimates are approximately four times greater when using the twofold CV. The accuracy of the in-sample measures of shrinkage is hit even harder with standard errors more than 23 times larger. The twofold CV should thus be avoided, unless computational cost is an issue and sample size is large enough so that holding out half the data does not excessively impact the estimation of the model.

6 DISCUSSION

In this paper, we start by showing that the Copas measure of overfitting (Copas, 1983; Veazie et al., 2003) becomes confounded by shrinkage—or expansion—arising from in-sample bias when it is applied to the untransformed scale of a nonlinear model. This is an important shortcoming as this is typically the scale of interest in terms of assessing behaviors or policy analysis. That is why we have proposed a new measure of overfitting that is both expressed on the scale of interest and immune to in-sample bias.

We then show that out-of-sample predictive calibration can be expressed as in-sample calibration times 1 minus this new measure of overfitting. In addition to providing a large-sample approximation of our new measure of overfitting, we also show its behavior through a simulation study where specification of the model is known and give an illustration with real data. Both our simulation and illustration are based on healthcare expenditure models as such models are typically highly nonlinear given the strictly positive and right-skewed dependent variable.

Our simulations demonstrate that our new measure of overfitting is immune to in-sample bias, whereas the scale-of-interest Copas is not. In fact, in-sample bias can outweigh overfitting. Thus, when evaluating the out-of-sample predictive accuracy of their model, the analysts should take into account both overfitting and in-sample model specification. Large in-sample bias calls for actions such as adding flexibility to the model, whereas large overfitting requires reducing model complexity and/or increasing the efficiency of the estimation method. That last point is well illustrated by our simulations, which show that an inefficient GLM can lead to considerable out-of-sample bias, despite of its in-sample robustness to model misspecification within the exponential family. This is in line with Manning and Mullahy (2001), who showed that selecting the inappropriate model can lead to loss of precision and went one step further by showing that this can also lead to forecast issues even when the analysis is consistent in the estimation sample. More generally, our results highlight the fact that indiscriminately preferring an unbiased estimator over an efficient one is by no means a safe strategy, as the inefficiency of the former not only reduces the power for inference but also ultimately results in biased out-of-sample predictions.

Our real-data illustration shows that the scale on which overfitting is measured matters a lot, as the estimated shrinkage can substantially differ between the original and estimation scales. The scale-of-interest measure of overfitting we propose might thus be relevant to those primarily interested in the original scale, which has specific scientific or policy interpretations to them.

The illustration also confirms that in-sample bias matters a lot as the resulting shrinkage dominates the one due to overfitting in all our examples. Finally, the illustration shows that the role played by sample size can be considerable by comparing the results obtained with the full sample with those obtained with only one fourth as large a sample size.

It may be remarked that we have used the GLM family for our simulations because it is a very convenient way to introduce in-sample biases by means of inadequate link functions and inefficiency through wrong distributional assumptions. However, when measuring overfitting, the GLM framework is not required. A related point is that when using a log-GLM as a baseline scenario, the shape of the distribution needs to be monotonically decreasing in order to obtain the type of over-dispersed data that we used, which is a common feature of healthcare expenditures. That is why we have also carried out a small-scale simulation (results not shown) with a DGP that is bell shaped instead of monotonically decreasing. To do so, we used lognormal distributions with varying log-scale error variances to change the coefficient of skewness in the untransformed dependent variable. Such exercise yielded the same pattern of results as shown earlier without any qualitative changes. The actual application we present in our paper confirms this in a case where the data are still heavy tailed after a log transformation to achieve symmetry on the scale of estimation.

Finally, we suggest a modified approach to calibrating out-of-sample predictions using CV. As noted by Blough et al. (1999), a practical advantage of the Copas (1983) preshrunk estimator is that the shrinkage parameter can be independently estimated from the data, for instance by CV. We thus suggest correcting the Copas (1983) preshrunk predictor by adjusting the out-of-sample predictions on the scale of interest instead of the scale of estimation, that is, math formula, where math formula represents the average healthcare expenditure. Indeed, in addition to correcting for in-sample miscalibration, our suggested preshrunk predictor also would have the valuable advantage of dissociating the estimation method from recalibration. Given the numerous challenges raised by most data, the analysts might address these challenges with what they consider to be the most appropriate estimation method and later recalibrate their predictions by using our scale-of-interest preshrunk predictor.

To conclude, we think that our approach can contribute to better model selection by challenging the model that is chosen on the basis of in-sample grounds only. We recommend starting by selecting a few models with different degrees of flexibility on the basis of rigorous residual analysis. In a second step, we propose running our decomposition analysis for all the selected models, thus allowing the analyst to quantify both in-sample and out-of-sample biases.

Using this information, the analyst can assess to what extent the greater flexibility of more complex models compared with less flexible alternatives (e.g., EEE vs log-link GLM) reduces in-sample bias and whether this greater flexibility translates or not into superior out-of-sample predictive performance.

ACKNOWLEDGEMENTS

We would like to thank the Swiss National Science Foundation for supporting Marcel Bilger's postdoctoral studies at the University of Chicago. No conflict of interest exists with this financing source. The revised paper has benefited from the comments of Randall Ellis, Boston University, when the paper was presented at the International Health Economics Association in Toronto in July 2011 and at the Annual Health Econometrics Workshop held at the University of Minnesota, in September 2011.

  1. 1

    ‘…all models are wrong, but some are useful’ (Box and Draper, 1987). By wrong specifications, we mean specifications that are less adequate approximations in the functional form or distributional assumptions.

Ancillary