## 1 INTRODUCTION

When fitting a model, it is well known that we pick up part of the idiosyncratic characteristics of the data along with the systematic relationship between dependent and explanatory variables. This phenomenon is known as *overfitting* and generally occurs when a model is excessively complex relative to the amount of data available. Overfitting is a major threat to regression analysis in terms of both inference and prediction. When models greatly over-explain the data at hand, this casts doubt on both the statistical significance and magnitude of the estimates. In addition, because the relations found by chance in the estimation sample will not replicate, the predictive performance of the model deteriorates when it is applied to new data. An important distinction has thus to be made between retrospective or *in-sample* prediction, where a model predicts outcomes within the dataset used to estimate it, and prospective or *out-of-sample* prediction, where a previously estimated model forecasts new outcomes in a different dataset.

When parameter estimates from the estimation sample are used to make predictions in a new sample from the same population, the plot of the actual outcomes against their forecasts should lie on the 45° line in the absence of overfitting. The deviation from the 45° line is referred to as *shrinkage*. An early measure of shrinkage is the adjusted multiple correlation coefficient, which was proposed by Wherry (1931). Shrinkage is also often measured by means of cross-validation (CV) techniques following the pioneering work by Larson (1931). Our work is based on the seminal work by Copas (1983) who proposed measuring shrinkage by 1 minus the least squares slope of the regression of the observed outcomes on their out-of-sample predictions. Later, Copas (1987) also suggested to estimate this slope by CV, and his measure gained further popularity. Many applied researchers use this measure, especially in health sciences (e.g., Harrell *et al*., 1996; Harrell, 2001) and health econometrics (e.g., Blough *et al*., 1999) where it is sometimes referred to as the Copas test of overfitting. The Copas measure of shrinkage is often assessed for various competing models and the results displayed in league tables. The resulting ranking is eventually part of model selection along with other diagnostic tools (e.g., Basu *et al*., 2006). It should however be stressed that predictive shrinkage is relevant not only to health economics but also to any field of economics, especially when having well-calibrated predictions is important. Government budgeting and risk adjustment are two such examples.

In this paper, we revisit the Copas measure of shrinkage in the case of nonlinear models. With nonlinear models, estimation often takes place on a different scale from that of the dependent variable. The former is sometimes referred to as *scale of estimation* and the latter as *scale of interest* because the original scale is typically the scale of scientific or policy interest. Although the scale of estimation often deals with a statistical issue, policies in spending often depend on actual dollars or in the incremental cost-effectiveness of different drugs and devices. Moreover, there is a common confusion in the literature between (ln(*y*)|*x*) models and ln(*E*(*y*|*x*)). A statement about a ln(*y*) model does not provide an adequate statement to judge the fit of the model to actual expenditures. This is sometimes referred to as the retransformation problem (Manning, 1998; Mullahy, 1998).

Shrinkage is usually measured on the scale of estimation (see for instance Copas, 1983, 1997, in the specific case of the logistic regression), but a few authors (Veazie *et al*., 2003; Basu *et al*., 2006) found it more meaningful to assess the Copas slope on the scale of interest as shrinkage is then measured in the same unit as the dependent variable. We show that this alternative Copas measure does not constitute a measure of shrinkage arising from overfitting alone as it also picks up the shrinkage or expansion caused by any misspecification that generates a bias in the estimation sample (in-sample bias hereafter). For that reason, the scale-of-interest Copas measure should instead be viewed as a measure of the calibration of the out-of-sample predictions. To correct this problem, we propose a new measure of overfitting that is both expressed on the scale of interest and immune to in-sample bias. We also show that calibration of the out-of-sample predictions can be expressed as their in-sample calibration multiplied by 1 minus our new measure of overfitting. This relation makes it possible to measure the respective contributions of in-sample bias and overfitting to the overall predictive bias when applying an estimated model to new data. It illustrates the trade-off that the analyst faces when comparing—and selecting—competing models. Should flexibility be increased in order to reduce in-sample bias? Or should nonlinearity be reduced and some secondary covariates be left out in order to contain overfitting? Our expression indicates on what side of this trade-off a given model lies, thus providing the analyst with guidance on optimal specification choice.

The major contribution of this paper is to show how overfitting and in-sample bias have different effects on the scale of estimation versus the scale of interest. The scale-of-interest Copas test combines both. We show how to separate and test for the two separately and how to immunize the test of overfitting.

The rest of the paper is organized as follows. Section 2 presents our new measure of overfitting along with its relation with in-sample and out-of-sample calibrations. Section 3 describes the setting of a simulation study that aims at showing the behavior of our new measure of overfitting when the true model specification is known. Our data-generating processes (DGPs) mimic healthcare expenditures as they are typically heavily right skewed and their models highly nonlinear. Section 4 presents the results of these simulations, and Section 5 illustrates our new measure of overfitting with real data from a well-known hospitalist study (Meltzer *et al*., 2002). Section 6 concludes.