Minimum sample size for developing a multivariable prediction model: PART II ‐ binary and time‐to‐event outcomes
Abstract
When designing a study to develop a new prediction model with binary or time‐to‐event outcomes, researchers should ensure their sample size is adequate in terms of the number of participants (n) and outcome events (E) relative to the number of predictor parameters (p) considered for inclusion. We propose that the minimum values of n and E (and subsequently the minimum number of events per predictor parameter, EPP) should be calculated to meet the following three criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥0.9, (ii) small absolute difference of ≤ 0.05 in the model's apparent and adjusted Nagelkerke's R2, and (iii) precise estimation of the overall risk in the population. Criteria (i) and (ii) aim to reduce overfitting conditional on a chosen p, and require prespecification of the model's anticipated Cox‐Snell R2, which we show can be obtained from previous studies. The values of n and E that meet all three criteria provides the minimum sample size required for model development. Upon application of our approach, a new diagnostic model for Chagas disease requires an EPP of at least 4.8 and a new prognostic model for recurrent venous thromboembolism requires an EPP of at least 23. This reinforces why rules of thumb (eg, 10 EPP) should be avoided. Researchers might additionally ensure the sample size gives precise estimates of key predictor effects; this is especially important when key categorical predictors have few events in some categories, as this may substantially increase the numbers required.
1 INTRODUCTION
Statistical models for risk prediction are needed to inform clinical diagnosis and prognosis in healthcare.1-3 For example, they may be used to predict an individual's risk of having an undiagnosed disease or condition (“diagnostic prediction model”), or to predict an individual's risk of experiencing a specific event in the future (“prognostic prediction model”). They are typically developed using a multivariable regression framework, such as logistic or Cox (proportional hazards) regression, which provides an equation to estimate an individual's risk based on their values of multiple predictors (such as age and smoking, or biomarkers and genetic information). Well‐known examples are the Wells score for predicting the presence of a pulmonary embolism4, 5; the Framingham risk score and QRISK2,6, 7 which estimate the 10‐year risk of developing cardiovascular disease (CVD); and the Nottingham Prognostic Index, which predicts the 5‐year survival probability of a woman with newly diagnosed breast cancer.8, 9
Researchers planning or designing a study to develop a new multivariable prediction model must consider sample size requirements for their development data set. Our related paper considered this issue for prediction models of a continuous outcome using linear regression.10 Here, we focus on binary and time‐to‐event outcomes, such as the risk of already having a pulmonary embolism, or the risk of developing CVD in the next 10 years. In this situation, the effective sample size is often considered to be the number of outcome events (eg, the number with existing pulmonary embolism, or the number diagnosed with CVD during follow‐up). In particular, a well‐used “rule of thumb” for sample size is to ensure at least 10 events per candidate predictor (variable),11-13 where “candidate” indicates a predictor in the development data set that is considered, before any variable selection, for inclusion in the final model. Note that, if a predictor is categorical with three of more categories, or continuous and modelled as a nonlinear trend, then including the predictor will require two or more parameters being included in the model. Therefore, we refer to events per predictor parameter (EPP) here, rather than events per variable.
The 10 EPP rule has generated much debate. Some authors claim that the EPP can sometimes be lowered below 10.14 In contrast, Harrell generally recommends at least 15 EPP,15 and others identify situations where at least 20 EPP or up to 50 EPP are required.16-19 However, a concern is that any blanket rule of thumb is too simplistic, and that the number of participants required will depend on many intricate aspects, including the magnitude of predictor effects, the overall outcome risk, the distribution of predictors, and the number of events for each category of categorical predictors.16 For example, Courvoisier et al20 concluded that “There is no single rule based on EPP that would guarantee an accurate estimation of logistic regression parameters.” A new sample size approach is needed to address this.
In this article, we propose the sample size (n) and number of events (E) in the model development data set must, at the very least, meet the following three criteria: (i) small optimism in predictor effect estimates as defined by a global shrinkage factor of ≥0.9, (ii) small absolute difference of ≤ 0.05 in the model's apparent and adjusted Nagelkerke's R2, and (iii) precise estimation of the overall risk or rate in the population (or similarly, precise estimation of the model intercept when predictors are mean centred). The values of n and E (and subsequently EPP) that meet all three criteria provide the minimum values required for model development. Criteria (i) and (ii) aim to reduce the potential for a developed model to be overfitted to the development data set at hand. Overfitting leads to model predictions that are more extreme than they ought to be when applied to new individuals, and most notably occurs when the number of candidate predictors is large relative to the number of outcome events. A consequence is that a developed model's apparent predictive performance (as observed in the development data set itself) will be optimistic, and its performance in new data will usually be lower. Therefore, it is good practise to reduce the potential for overfitting when developing a prediction model,15 which criteria (i) and (ii) aim to achieve. In addition, criterion (iii) aims to ensure that the overall risk (eg, by a key time point for prediction) is estimated precisely, as fundamentally, before tailoring predictions to individuals, a model must be able to reliably predict the overall or mean risk in the target population.
The article is structured as follows. Section 2 introduces our proposed criterion (i), for which key concepts of a global shrinkage factor and the Cox‐Snell R2 are introduced.21 The latter needs to prespecified to utilise our sample size formula, and so in Section 3, we suggest how realistic values of the Cox‐Snell R2 can be obtained in advance of any data collection, eg, by using published information from an existing model in the same field, including values of the C statistic or alternative R2 measures. Extension to criteria (ii) and (iii) is then made in Section 4. Section 5 then provides two examples, which demonstrate our sample size approach for diagnostic and prognostic models. Section 6 raises a potential additional criteria to consider: ensuring precise estimates of key predictor effects, to help ensure precise predictions across the entire spectrum of predicted risk. Section 7 concludes with discussion.
2 SAMPLE SIZE REQUIRED TO MINIMISE OVERFITTING OF PREDICTOR EFFECTS
To adjust for overfitting during model development (and thereby improve the model's predictive performance in new individuals), statistical methods for penalisation of predictor effect estimates are available, where regression coefficients are shrunk toward zero from their usual estimated value (eg, from standard maximum likelihood estimation).22-26 Van Houwelingen notes that “… shrinkage works on the average but may fail in the particular unique problem on which the statistician is working.”22 Therefore, it is important to minimise the potential for overfitting during model development, and this criterion forms the basis of our first sample size calculation. Our approach is motivated by the concept of a global shrinkage factor (a measure of overfitting), and so we begin by introducing this, before then deriving a sample size formula.
2.1 Concept of a global shrinkage for logistic and Cox regression
(1)
terms denote the original predictor effect estimates (ln odds ratios) from maximum likelihood, and α* is the intercept that has been re‐estimated (after shrinkage of predictor effects) to ensure perfect calibration‐in‐the‐large, such that, the overall predicted risk still agrees with the overall observed risk in the development data set (for details on how to do this, we refer to the works of Harrell15 and Steyerberg1). Similarly, after fitting a proportional hazards (Cox) regression model using standard maximum likelihood, the model can be revised using
(2)Example of a global shrinkage factor
Van Diepen et al developed a prognostic model for 1‐year mortality risk in patients with diabetes starting dialysis.29 They use a logistic regression framework, with backwards selection to choose predictors in a dataset of 394 patients with 84 deaths by 1 year, and the estimated model is shown in Table 1. To examine overfitting, the authors use bootstrapping to estimate a global shrinkage factor of 0.903, indicating that the original model was slightly overfitted to the data. Therefore, a revised prediction model was produced by multiplying the original
coefficients (ln odds ratios) from the original logistic regression model by a global shrinkage factor of S = 0.903.
| Developed (unpenalised) model | Final (penalised) model adjusted for overfitting | |
|---|---|---|
| Intercept |
![]() |
α* |
| 1.962 | 1.427 | |
| Predictor |
![]() |
= 0.903
![]() |
| Age (years) | 0.047 | 0.042 |
| Smoking | 0.631 | 0.570 |
| Macrovascular complications | 1.195 | 1.078 |
| Duration of diabetes mellitus (years) | 0.026 | 0.023 |
| Karnofsky scale | −0.043 | −0.039 |
| Haemoglobin level (g/dl) | −0.186 | −0.168 |
| Albumin level (g/l) | −0.060 | −0.054 |
2.2 Expressing sample size in terms of a global shrinkage factor
(3)
(4)
to denote the apparent (“app”) estimate of a prediction model's Cox‐Snell (“CS”) R2 performance as obtained from the model development data set. It can be shown (eg, see the works of Magee31 or Hendry and Nielsen32) that the LR statistic can be expressed in terms of the sample size (n) and
as follows:
(5)
(6)
(7)2.3 Criterion (i): calculating sample size to ensure a shrinkage factor ≥ 0.9
Equation 7 provides a closed‐form solution for the expected shrinkage conditional on n, p, and
. Therefore, if we could specify a realistic value for
in advance of our study starting, we could identify values of n and p that correspond to a desired shrinkage factor (eg, 0.9), thus informing the required sample size. However, a major problem is that
is a postestimation measure of model fit, whereas for a sample size calculation, this needs to be specified in advance of collecting the data when designing a new study. Furthermore, due to overfitting in the model development data set, the observed
is generally an upwardly biased (optimistic) estimate of the Cox‐Snell R2 as it is estimated in the same data used to develop the model. Thus, in new data, the actual Cox‐Snell R2 peformance is likely to be lower.
, an adjusted (approximately unbiased) estimate of the model's expected
performance in new individuals from the same population. In other words,
is a modification of
to adjust for optimism (caused by overfitting) in the model development data set. For generalised linear models such as logistic regression, Mittlboeck and Heinzl suggest that
can be obtained by33
(8)
corresponds to the underlying population value.33 By rearranging Equation 8, we can express
in terms of
(9)
, rather than
(10)
(11)
of at least 0.1, then to target an expected shrinkage of 0.9, we need a sample size of

2.4 Translating the calculated sample size to the number of events and EPP
It may be surprising that the overall outcome proportion (or overall outcome rate) is not directly included in the right‐hand side of the sample size Equation 11, especially because the total number of events, E, (which depends on the outcome proportion or rate) is often considered the effective sample size for binary and time‐to‐event outcomes.15 However, the outcome proportion (rate) is indirectly accounted for in the sample size calculation via the chosen
, as the maximum value of
for the intended population of the model depends on the overall outcome proportion (rate) for that population. As the outcome proportion decreases, the maximum value of
decreases. This is explained further in Section 3.4. Therefore, after n is derived from the sample size equation 11, E can be obtained by combining the calculated n with the outcome proportion (rate) for the intended population. Similarly, EPP can be obtained.
For example for binary outcomes, E = nϕ and EPP = nϕ/p, where ϕ is the overall outcome proportion in the target population (ie, the overall prevalence for diagnostic models, or the overall cumulative incidence by a key time point for prognostic models). In our aforementioned hypothetical example, where 1698 subjects were needed based on an
of 0.1 and SVH of 0.9, then if the intended setting has ϕ of 0.1 (ie, overall outcome risk is 10%), the required E = 1698 × 0.1 = 169.8. With 20 predictor parameters, the required EPP = (1698 × 0.1)/20= 8.5. However, if the intended setting has ϕ of 0.3, then E = 509.4 and EPP = 25.5. The big change in EPP is because, although the chosen value of
is fixed at 0.1, the maximum value of
is much higher for the setting with the higher outcome proportion.
We can explain this further using Nagelkerke's “proportion of total variance explained”,34 which is calculated as
. If two models have the same
(say at 0.1, as in the aforementioned examples), then Nagelkerke's measure of predictive performance will be lower for the model whose setting has a higher outcome proportion, as the
is larger in that setting. Models with lower performance have larger overfitting concerns,22 and therefore require larger EPP to minimise overfitting than models with high performance. Hence, explaining why EPP was larger when ϕ was 0.3 compared with 0.1 in the aforementioned example. This highlights that a blanket rule of thumb (such as at least 10 EPP) is unlikely to be sensible to meet criterion (i), as the actual EPP depends on the setting/population of interest (which dictates the overall outcome proportion or rate) and expected model performance.
3 HOW TO PRESPECIFY
BASED ON PREVIOUS INFORMATION
Our sample size proposal in Equation 11 requires researchers to provide a value for the model's
, that is, to prespecify the anticipated Cox‐Snell R2 value if the model was applied to new individuals. How should this be done? We recommend using
values from previous prediction model studies for the same (or similar) population, considering the same (or similar) outcomes and time points of interest. For example, the researcher could consult systematic reviews of existing models and their performance, which are also increasingly available,35 or registries that record the prediction models available in a particular field.36
Often, a new prediction model is developed specifically to update or improve upon the performance of an existing model, by using additional predictors. Then, the existing model's
could be used as a lower bound for the new model's anticipated
. In this situation, if the apparent Cox‐Snell estimate,
, is available in an article describing the development of the existing model, then its
can be derived using Equation 8 as long as the study's n and p can also be obtained. In addition, as in van Diepen et al's example (Table 1), a global shrinkage factor may be reported directly for an existing model development study, and if so,
can be derived from a simple rearrangement of Equation 10, again as long as the study's n and p are also available.
Note that, if
is available from an external validation study of an existing model, there is no need for adjustment (ie,
, as the validation dataset provides a direct estimate of the model's performance in new individuals (free from overfitting concerns as there is no model development therein).
Other options to obtain
from the existing literature are now described. For guidance on choosing an
value in the absence of any prior information, please see our discussion.
3.1 Using the LR statistic to derive the Cox‐Snell

If the
or
is not available in the publication of an existing model, the LR value may be reported, which would allow
to be derived using Equation 6, then SVH for the model derived using Equation 7 (assuming the model's n and p are also provided), and finally
using Equation 8.
and
to be derived using Equations 6 and 8, respectively. For example, in a logistic regression model, the logLnull value can be calculated using
(12)
(13)3.2 Using other pseudo‐R2 statistics to derive

has a maximum value less than 1, Nagelkerke's R2 is sometimes reported,34 which divides
by the maximum value defined by
, as follows:
(14)
can be calculated by rearranging Equation 14 to give
(15)
calculated via Equation 8.
(16)
is reported, we can rearrange Equation 16 to obtain ln Lmodel, and subsequently derive the LR statistic using Equation 4, the Cox‐Snell
from Equation 6, SVH from Equation 7 (assuming the model's n and p are also provided), and finally
via Equation 8.
by replacing n with the number of events (E)38
(17)
and E were reported, the LR value could be found using
(18)
can be obtained using Equation 6, SVH using Equation 7, and finally
using Equation 8.
(19)
is reported it can be used to obtain
by rearranging Equation 19 as
(20)
, SVH and then
to be derived as explained previously. A similar measure to
is Royston and Sauerbrei's
,40 which can be derived from their proposed D statistic (the ln(hazard ratio) comparing two groups defined by the median value of the model's risk score in the population of application)
(21)
and
are reasonably similar, and thus, we tentatively suggest
as a proxy for
when only
(or D) is reported; though, we recognise that further research is needed on the link between
and
.
3.3 Using values of the C statistic to derive

) when only the C statistic is reported for a survival model41
(22)
from Equation 21) predicted from Equation 22 for selected values of the C statistic, as taken from the work of Jinks et al.41 Thus, if only the C statistic is reported, we can use Equation 22 to predict Royston's D statistic and calculate
(using Equation 21) as a proxy to
, and then
, LR,
and finally
computed sequentially using the equations given previously.
from Equation 23 for selected values of the C statistic (values taken from table 1 in the work of Jinks et al41)
| C | D |
![]() |
C | D |
![]() |
|
|---|---|---|---|---|---|---|
| 0.50 | 0 | 0 | 0.72 | 1.319 | 0.294 | |
| 0.52 | 0.11 | 0.003 | 0.74 | 1.462 | 0.338 | |
| 0.54 | 0.221 | 0.011 | 0.76 | 1.61 | 0.382 | |
| 0.56 | 0.332 | 0.026 | 0.78 | 1.765 | 0.427 | |
| 0.58 | 0.445 | 0.045 | 0.80 | 1.927 | 0.470 | |
| 0.60 | 0.560 | 0.070 | 0.82 | 2.096 | 0.512 | |
| 0.62 | 0.678 | 0.099 | 0.84 | 2.273 | 0.552 | |
| 0.64 | 0.798 | 0.132 | 0.86 | 2.459 | 0.591 | |
| 0.66 | 0.922 | 0.169 | 0.88 | 2.652 | 0.627 | |
| 0.68 | 1.05 | 0.208 | 0.90 | 2.857 | 0.661 | |
| 0.70 | 1.182 | 0.25 | 0.92 | 3.070 | 0.692 |
Further evaluation of the performance of Jinks' formula is required, eg, using simulation and across settings with different cumulative outcome incidences. Indeed, based on figure 5 in the work of Jinks et al,41 the potential error in the predictions of D appears to increase as C increases, and is about +/− 0.25 when C is 0.8. Nevertheless, Equation 22 serves as a good starting point and works well in our applied example (see Section 5.2.1). Further research is also needed to ascertain how to predict
from other measures, such as Somer's D statistic.
3.4 The anticipated value of
may be small
, values for logistic and survival models are usually much lower than for linear regression models, with values often less than 0.3. A key reason is that (unlike for linear regression) the
has a maximum value less than 1, defined by
(23)

is 0.33, and for an outcome proportion of 1%, the
is 0.11. Therefore, especially in situations where the outcome proportion is low, researchers should anticipate a model with a (seemingly) low
value, and subsequently a low
value.
Low values of
or
do not necessarily indicate poor model performance. Consider the following three examples. First, Poppe et al used a Cox regression to develop a model (“PREDICT‐CVD”) to predict the risk of future CVD events within two years in patients with atherosclerotic CVD,42 and directly report an
of 0.04. However, the corresponding C statistic is 0.72, which shows discriminatory magnitude typical of many prognostic models used in practice. Second, Hippisley‐Cox and Coupland use the QResearch database to produce three models (QDiabetes) that estimates the risk of future diabetes in a general population.43 In their validation of their “model A,” there were 27 311 incident cases of diabetes recorded in 1 322 435 women (3.77 cases per 1000 person‐years) during follow‐up, and the reported
was 0.505. Using the approach described previously to convert
to LR, this leads to a
of 0.02; however, the corresponding D statistic of 2.07 and C statistic of 0.89 are large. Third, in a risk prediction model for venous thromboembolism (VTE) in women during the first 6 weeks after delivery,44
was 0.001 due to the extremely low event risk (7.2 per 10 000 deliveries), but the model still had important discriminatory ability as the corresponding C statistic was 0.70.
4 ADDITIONAL SAMPLE SIZE CRITERIA
Criterion (i) focuses on shrinkage of predictor effects, which is a multiplicative measure of overfitting (ie, on the relative scale). Harrell suggests to also evaluate overfitting on the absolute scale and to check key model parameters are estimated precsiely.15 We now address this with two further criteria.
4.1 Criterion (ii): ensuring a small absolute difference in the apparent and adjusted

(24)
, as shown in Equation 23.
(25)
(26)
(as they did for criterion (i)) and also the value of
as outlined for Equation 23. Then, sample size equation 11 can be used to derive the sample size needed to satisfy criterion (ii). This is only necessary when the calculated value of SVH from Equation 26 is larger than that chosen for criterion (i), as then the sample size required to meet criterion (ii) will be larger than that for criterion (i).
of at least 0.1, and in a setting with the outcome proportion of 5%, such that the
is 0.33. Then, to ensure δ is ≤ 0.05, we require

been 0.2, then

4.2 Criterion (iii): ensure precise estimate of overall risk (model intercept)
for a null model (ie, no predictors included). For example, for a binary outcome, an approximate 95% confidence interval for the overall outcome proportion is

, which leads to
(27)
These sample sizes aim to ensure precise estimation of the overall risk in the population of interest. Strictly speaking, we are more interested in precise estimation of the mean risk in an actual model including multiple predictors. If we centre predictors at their mean value, then the model's intercept is the logit risk for an individual with mean predictor values. The corresponding risk for this individual will often be very similar (though not identical) to the mean risk in the overall population. Furthermore, the variance of the estimated risk for this individual will be approximately
.
**
As obtained by inversing the information matrix X'V−1X and replacing individual variances defined by pi(1‐pi) with a constant variance defined by
. Thus, it follows that Equation 27 is also a good approximation to the sample size required to precisely estimate the mean risk in a model containing predictors centred at their mean.
), where
is the estimated rate (number of events per person‐year). An approximate 95% confidence interval for the estimated F(t) is
, where T is the total person‐years of follow‐up. Therefore, to ensure a small absolute margin of error, such that the lower and upper bounds of the confidence interval are ≤ δ (eg, 0.05) of the true value, we must ensure both the following are satisfied:
(28)
5 WORKED EXAMPLES
To summarise our sample size approach for researchers, we provide a step‐by‐step guide in Figure 1. The sample size (and corresponding number of events and EPP) that meets criteria (i) to (iii) provides the minimum sample size required for model development. We now present two worked examples to illustrate our approach.

5.1 A diagnostic prediction model for chronic Chagas disease
Our first example considers the minimum sample size required for developing a diagnostic model for predicting a binary outcome (disease: yes or no). Brasil et al developed a logistic regression model containing 14 predictor parameters for predicting the risk of having chronic Chagas disease in patients with suspected Chagas disease.46 Upon external validation in a cohort of 138 participants containing 24 with Chagas disease, the model had an estimated C statistic of 0.91 and an
of 0.48. Consider that a researcher wants to update this model and improve the predictive performance. Our sample size approach can be applied as follows.
5.1.1 Steps 1 and 2: identifying values for p,
, and

. To achieve this, we can convert the
value for Brasil's existing model into a
value. Assume the disease prevalence is 17.4%, as in the Brasil validation study, and use Equation 12 to calculate the log‐likelihood for the null model in Brasil's validation study

. Now, we can use Equation 15 to obtain

, as it was obtained in a different data set to that used for model development. Therefore no adjustment is needed, because
=
here.
5.1.2 Step 3: criterion (i) ‐ ensuring a global shrinkage factor of 0.9
of our new model. We now use Equation 11 to estimate the sample size required to ensure an expected shrinkage factor (SVH = 0.90) conditional on a number of predictor parameters (p = 24)

5.1.3 Step 4: criterion (ii) ‐ ensuring a small absolute difference in the apparent and adjusted

. Using Equation 26, we obtain


5.1.4 Step 5: criterion (iii) ‐ ensure precise estimate of overall risk (model intercept)

5.1.5 Step 6: minimum sample size that ensures all criteria are met
The largest sample size required was 668 subjects to meet criterion (ii), and so this provides the minimum sample size required for developing our new model. It corresponds to 668 × 0.174 = 116.2 events, and an EPP of 116.2/24 = 4.84, which is considerably lower than the “EPP of at least 10” rule of thumb.
5.2 A prognostic model to predict a recurrence of VTE
Our second example considers the sample size required to develop a prognostic model with a time‐to‐event outcome. Ensor et al developed a prognostic time‐to‐event model for the risk of a recurrent VTE following cessation of therapy for a first VTE.47 The sample size was 1200 participants, with a median follow‐up of 22 months, a total of 2483 person‐years of follow‐up, and 161 (13.42% of) individuals had a VTE recurrence by end of follow‐up.47 The model included predictors of age, gender, site of first clot, D‐dimer level, and the lag time from cessation of therapy until measurement of D‐dimer (often around 30 days). These predictors corresponded to six parameters in the model, which was developed using the flexible parametric survival modelling framework of Royston and Parmar48 and Royston and Lambert.49 Although Ensor's model performed well on average, the model's predicted risks did not calibrate well with the observed risks in some populations.47 Therefore, new research is needed to update and extend this model, eg, by including additional predictors. We now identify suitable sample sizes to inform such research.
5.2.1 Steps 1 and 2: identifying values for p,
and

Assume that there are 25 potential predictor parameters for inclusion in the new model, and thus, p = 25. We next need to identify suitable values for
and
.
Calculating

was not reported but we should expect it to be quite small because the maximum value of
is low. For example, assuming (for simplicity) an exponential survival model was fitted to the Ensor data, then using Equation 13, we have


Thus,
is considerably less than 1.
Obtaining a sensible value for
from the study authors
was not reported for the Ensor model, we need to obtain it. We contacted the original authors who told us their model's
was 0.056 in the development data set. Thus, let us use this value to derive
from Equation 8. Based on Ensor's sample size of 1200, and six predictor parameters, we obtain

of the new model. This corresponds to Nagelkerke's proportion variation explained of
0.051/0.37 = 0.14 (or 14%).
Calculating a sensible value for
from other reported information
could have been estimated indirectly from other available information. The model's reported C statistic was 0.69, and so we can use Equation 22 to predict the corresponding D statistic

can be derived from Equation 21

as a proxy for
, we can then use Equation 20 to obtain

and the number of reported events (E = 161) to derive the LR statistic from Equation 18


is 0.052 for the Ensor model. This is reassuringly close to the estimate of 0.056 provided directly by the study authors.
5.2.2 Step 3: criterion (i) ‐ ensuring a global shrinkage factor of 0.9
of 0.051, for a model with 25 predictor parameters and a targeted expected shrinkage of 0.9, the sample size required is

5.2.3 Step 4: criterion (ii) ‐ ensuring a small absolute difference in the apparent and adjusted

. Recall, assuming an exponential model for simplicity, we calculated that the
. Then, using Equation 26, we obtain

5.2.4 Step 5: criterion (iii) ‐ ensure precise estimate of overall risk

5.2.5 Step 6: minimum sample size that ensures all criteria are met
The largest sample size required was 4286 participants to meet criterion (i), which therefore provides the minimum sample size required for developing our new model. This assumes the new cohort will have a similar follow‐up, censoring rate, and event rate to that reported by Ensor et al, where the mean follow‐up per person was 2.07 years, 13.42% of individuals had a VTE recurrence by end of follow‐up, and the event rate was 0.065.47
Then, the required 4286 participants corresponds to about 4286 × 2.07 = 8872 person‐years of follow‐up, and 8872 × 0.065 ≈ 577 outcome events, and thus an EPP of 577/25 ≈ 23. This is over twice the “EPP of at least 10” rule of thumb. Figure 2 shows that an EPP of 10 only ensures a shrinkage factor of 0.79, which would reflect relatively large overfitting.

of 0.051 [Colour figure can be viewed at wileyonlinelibrary.com]
5.2.6 What if the sample size is not achievable?
value (as this is anticonservative for criterion (i)). Rather, to ensure an SVH of 0.9 (ie, an expected shrinkage of 10%), the researcher should lower p by reducing the number of candidate predictors. For example, predictors could be prioritised based on previous evidence (eg, systematic reviews). After data collection, unsupervised learning techniques such as principal component analysis may be useful, which are blinded to the outcome data. Figure 3 shows how changing p changes the required sample size to meet criterion (i). For example, if a researcher was restricted to a sample size of about 2000 participants, then they would need to reduce p to 12 to ensure an expected shrinkage of 0.90. This is because, for an SVH of 0.9 and
of 0.051, the sample size required is


of 0.051 [Colour figure can be viewed at wileyonlinelibrary.com]
6 POTENTIAL ADDITIONAL CRITERION: PRECISE ESTIMATES OF PREDICTOR EFFECTS
Ideally, predictions should also be precise across the entire spectrum of predicted values, not just at the mean. This is challenging to achieve, but is helped by ensuring the sample size will give precise estimates of the effects of key predictors;50 hence, this may form a further criterion for researchers to check (ie, in addition to criteria (i) to (iii)). Briefly, for a particular predictor of a binary or time‐to‐event outcome, the sample size required to precisely estimate its association with the outcome (ie, an odds ratio or hazard ratio) depends on the assumed magnitude of this effect, the variability of the predictor's values across subjects, the predictor's correlation with other predictors in the model, and the overall outcome proportion in the study.51-53 Ideally, we want to ensure a sample size that gives a precise confidence interval around the predictor's effect estimate.54 However, this is taxing, as closed‐form solutions for the variance of adjusted log odds ratio or hazard ratios, from logistic and Cox regression, respectively, are nontrivial. One solution is to use simulation‐based evaluations.54, 55 However, perhaps a more practical option is to utilise readily available power‐based sample size calculations that calculate the sample size required to detect (based on statistical significance) a predictor's effect for a chosen type I error level (eg, 0.05) and power.51-53, 56 As such sample size calculations are likely to be less stringent than those based on confidence interval width (especially for predictors with large effect sizes), we might use a high power, say of 95%, in the calculation.
Checking sample size for predictor effects will be laborious with many predictors, and so it may be practical to focus on the subset of key predictors with smallest variance of their values, as these predictors will have the least precision. In particular, when there are important categorical predictors but with few subjects and/or outcome events in some categories, substantially larger sample sizes may be needed to avoid separation issues (ie, no event or nonevents in some categories).57 In addition, any predictors whose effect is small (and thus harder to detect), but still important, may warrant special attention.
For example, returning to the VTE prediction model from Section 5.2, a key predictor in the original model by Ensor et al was age,47 with an adjusted log hazard ratio of −0.0105. Although this is close to zero, as age is on a continuous scale, the impact of age on outcome risk is potentially large; for example, it corresponds to an adjusted hazard ratio of 0.66 comparing two individuals aged 40 years apart. Based on the results presented by Ensor et al,47 the standard deviation of age was 15.21 and the overall outcome occurrence by end of follow‐up was 13.5%. Based on these values, and assuming other included predictors explain 20% of the variation in age, then the sample size approach of Hsieh and Lavori52 suggests 4718 subjects are required to have 95% power to detect a prognostic effect for age. This is larger than the 4286 subjects required to meet criterion (i), and so, to be extra stringent beyond criteria (i) to (iii), the researcher might raise the recommended sample size to 4718 subjects, if possible.
7 DISCUSSION
Sample size calculations for prediction models of binary and time‐to‐event outcomes are typically based on blanket rules of thumb, such as at least 10 EPP, which generates much debate and criticism.14, 16, 57 In this article, building on our related work for linear regression,10 we have proposed an alternative approach that identifies the sample size, events and EPP required to meet three key criteria, which minimise overfitting whilst ensuring precise estimates of overall outcome risk. Criterion (i) aims to ensure the optimism of predictor effect estimates is small, as defined by a global shrinkage factor of ≥ 0.9. This idea extends the work of Harrell who suggests that, after a model is developed, if the shrinkage estimate “falls below 0.9, for example, we may be concerned with the lack of calibration the model may experience on new data.”15 Our premise is the same, except we focused on calculating the expected shrinkage before data collection, to inform sample size calculations for a new study. Criterion (ii) extends this idea to ensure the optimism is small on the
scale, such that there is a difference of ≤ 5% in the apparent and adjusted percentage of variation explained by the model. Lastly, criterion (iii) ensures the sample size will precisely estimate the overall outcome risk, which is fundamental.
By utilising the model's anticipated Cox‐Snell R2, the sample size calculations are essentially tailored to the model and setting at hand, because the Cox‐Snell R2 reflects many factors including the outcome proportion (ie, outcome prevalence or cumulative incidence) and the overall fit (performance) of the model. It therefore better reflects the trait of a particular model and setting at hand rather than a blanket EPP rule.16 In our examples, the sample sizes required often differed considerably from an EPP of 10, reinforcing the idea that this rule is too simplistic.57 Indeed, the required EPP was much higher 23 in our second example than our first (4.8), illustrating the problem with a blanket EPP rule trying to cover all situations.14, 16-18
Section 3 also showed how to obtain a realistic value for Cox‐Snell R2 based on previous models to make our proposal more achievable in practice. If no previous prediction model exists for the outcome and setting of interest, then information might be used from studies in a related setting or using a different but similar outcome definition or time points to those intended for the new model. Information can also be borrowed from predictor finding studies (eg, studies aiming to estimate the prognostic effect of a particular predictor adjusted for other predictors58). Typically, these studies apply multivariable modelling, and although mainly focused on predictor effect estimates, they often report the C statistic and pseudo‐R2 values.
Further research is needed to help researchers when there are no existing studies or information to identify a sensible value of the expected Cox‐Snell R2. Medical diagnosis and prediction of health‐related outcomes are, generally speaking, low signal‐to‐noise ratio situations. It is not uncommon in these situations to see
values in the 0.1 to 0.2 range. Therefore, in the absence of any other information, we suggest that sample sizes be derived assuming the value of
corresponds to an
of 0.15 (ie,
). An exception is when predictors include “direct” (mechanistic) measurements, such as including the baseline version of the binary or ordinal outcome (eg, including smoking status at baseline when predicting smoking status at 1 year), or direct measures of the processes involved (eg, including physiologic function of patients in intensive care when predicting risk of death within 48 hours). Then, in this special situation, an
may be a more appropriate default choice.
The rule of having an EPP of at least 10 stems from limited simulation studies examining the bias and precision of predictor effects in the prediction model.11-13 Jinks et al41 alternatively developed sample size formulae for a time‐to‐event prediction model based on the D statistic.40 They suggest to predefine the D statistic that would be expected, and then, based on a desired significance or confidence interval width, their formulae provide the number of events required to achieve this. However, their method does not account for the number of candidate predictors and does not consider the potential for overfitting when developing a model. Our sample size calculations address this, and are meant to be used before any data collection. In situations where a development data set is already available, containing a specific number of participants and predictors, our criteria could be used to identify whether a reduction in the number of predictors is needed before starting model development. Indeed, Harrell already illustrated this concept by using the shrinkage estimate from the full model (including all predictors) to gauge whether the number of predictors should be reduced via data reduction techniques.15 Ideally, this should be done blind to the estimated predictor effects (ie, just calculate the shrinkage factor for the full model, but do not observe the predictor effect estimates and associated p‐values), as otherwise decisions about predictor inclusion are influenced by a “quick look” at the effect estimates from the full model results. Similarly, when planning to use a predictor selection method (such as backwards selection) during model development, researchers should define p as the total number of parameters due to all predictors considered (screened), and not just the subset that are included in the final model.59 As Harrell notes,15 the value of p should be honest.
Section 6 also highlighted the potential additional requirement to ensure precise estimates of key predictor effects. In particular, special attention may be given to those predictors with strong predictive value (and thus most influential to the predicted outcome risk), especially if the variance in their values is small, or when events or nonevents in some categories of the predictor are rare, as this leads to larger sample sizes. For example, van Smeden et al highlighted that “separation” between events and nonevents is an important consideration toward the required sample size, which occurs when a single predictor (or a linear combination of multiple predictors) perfectly separates all events from all nonevents, and thus causes estimation difficulties.57 This may lead to substantially larger EPP to resolve the issue (eg, so that all categories of a predictor have both events and nonevents). For such reasons, we labelled our criteria (i) to (iii) proposal as the “minimum” sample size required.
Further research should identify how our sample size criteria relates to that of the work of van Smeden et al, who focused on sample size in regards to the mean squared error in predictions from the model.60 Specifically, they use simulation to evaluate the characteristics that influence the mean squared prediction error of a logistic model, and identify that the outcome proportion and number of predictors are important,60 in addition to total sample size. This leads to a sample size equation to minimise root mean‐squared prediction error in a new model development study. Harrell also suggested using simulation to inform sample size, and illustrates this for a logistic regression model with a single predictor.15 For example, one could simulate a very large dataset from an assumed prediction model, and quantify the mean square (prediction) error and mean absolute (prediction) error of a model developed from this data set. Then, repeat this process each time removing an individual at random, until a sample size is identified below which the mean squared (prediction) error is unacceptable.
In summary, we have proposed criteria for identifying the minimum sample size required when developing a prediction model for binary or time‐to‐event outcomes. We hope this, and our related paper,10 encourages researchers to move away from rules of thumb, and to rather focus on attaining sample sizes that minimise overfitting and ensure precise estimates of overall risk within the model and setting of interest. We are currently writing software modules to implement the approach.
ACKNOWLEDGEMENTS
We wish to thank two reviewers and the Associate Editor for their constructive comments which helped improve the article upon revision. Danielle Burke and Kym Snell are funded by the National Institute for Health Research School for Primary Care Research (NIHR SPCR). The views expressed are those of the authors and not necessarily those of the NHS, the NIHR, or the Department of Health. Karel G.M. Moons receives funding from the Netherlands Organisation for Scientific Research (project 9120.8004 and 918.10.615). Frank Harrell's work on this paper was supported by CTSA (award UL1 TR002243) from the National Centre for Advancing Translational Sciences. Its contents are solely the responsibility of the authors and do not necessarily represent official views of the National Centre for Advancing Translational Sciences or the US National Institutes of Health. Gary Collins was supported by the NIHR Biomedical Research Centre, Oxford.
REFERENCES
Citing Literature
Number of times cited according to CrossRef: 49
- Philipp Herzog, Matthias Feldmann, Ulrich Voderholzer, Thomas Gärtner, Michael Armbrust, Elisabeth Rauh, Robert Doerr, Winfried Rief, Eva-Lotta Brakemeier, Drawing the borderline: Predicting treatment outcomes in patients with borderline personality disorder, Behaviour Research and Therapy, 10.1016/j.brat.2020.103692, (103692), (2020).
- Yan Luo, Konstantina Chalkou, Ryo Yamada, Satoshi Funada, Georgia Salanti, Toshi A. Furukawa, Predicting the treatment response of certolizumab for individual adult patients with rheumatoid arthritis: protocol for an individual participant data meta-analysis, Systematic Reviews, 10.1186/s13643-020-01401-x, 9, 1, (2020).
- Mike J. Bradburn, Ellen C. Lee, David A. White, Daniel Hind, Norman R. Waugh, Deborah D. Cooke, David Hopkins, Peter Mansell, Simon R. Heller, Treatment effects may remain the same even when trial participants differed from the target population, Journal of Clinical Epidemiology, 10.1016/j.jclinepi.2020.05.001, 124, (126-138), (2020).
- Hyung Bin Park, Ji-Yong Gwark, Byung Hoon Kwack, Jaehoon Jung, Hypo-High-Density Lipoproteinemia is Associated with Preoperative Tear Size and with Postoperative Retear in Large to Massive Rotator Cuff Tears, Arthroscopy: The Journal of Arthroscopic & Related Surgery, 10.1016/j.arthro.2020.04.043, (2020).
- Ype de Jong, Chava L. Ramspek, Vera H.W. van der Endt, Maarten B. Rookmaaker, Peter J. Blankestijn, Robin W.M. Vernooij, Marianne C. Verhaar, Willem Jan W. Bos, Friedo W. Dekker, Gurbey Ocak, Merel van Diepen, A systematic review and external validation of stroke prediction models demonstrates poor performance in dialysis patients, Journal of Clinical Epidemiology, 10.1016/j.jclinepi.2020.03.015, 123, (69-79), (2020).
- Daniel E. Leisman, Michael O. Harhay, David J. Lederer, Michael Abramson, Alex A. Adjei, Jan Bakker, Zuhair K. Ballas, Esther Barreiro, Scott C. Bell, Rinaldo Bellomo, Jonathan A. Bernstein, Richard D. Branson, Vito Brusasco, James D. Chalmers, Sudhansu Chokroverty, Giuseppe Citerio, Nancy A. Collop, Colin R. Cooke, James D. Crapo, Gavin Donaldson, Dominic A. Fitzgerald, Emma Grainger, Lauren Hale, Felix J. Herth, Patrick M. Kochanek, Guy Marks, J. Randall Moorman, David E. Ost, Michael Schatz, Aziz Sheikh, Alan R. Smyth, Iain Stewart, Paul W. Stewart, Erik R. Swenson, Ronald Szymusiak, Jean-Louis Teboul, Jean-Louis Vincent, Jadwiga A. Wedzicha, David M. Maslove, Development and Reporting of Prediction Models, Critical Care Medicine, 10.1097/CCM.0000000000004246, 48, 5, (623-633), (2020).
- Marco Chiabudini, Martin Schumacher, Erika Graf, Comparison of complex modeling strategies for prediction of a binary outcome based on a few, highly correlated predictors, Biometrical Journal, 10.1002/bimj.201800243, 62, 3, (568-582), (2020).
- Molly McAdow, Xiao Xu, Heather Lipkind, Uma M. Reddy, Jessica L. Illuzzi, Association of Oxytocin Rest During Labor Induction of Nulliparous Women With Mode of Delivery, Obstetrics & Gynecology, 10.1097/AOG.0000000000003709, 135, 3, (569-575), (2020).
- Mical Paul, Angela Huttner, Julia A. Bielicki, Jesús Rodríguez-Baño, Andre C. Kalil, Mariska M.G. Leeflang, Luigia Scudeller, Leonard Leibovici, Reporting methods of observational cohort studies in CMI, Clinical Microbiology and Infection, 10.1016/j.cmi.2020.01.024, (2020).
- Janita E. van Timmeren, Davide Cester, Stephanie Tanadini-Lang, Hatem Alkadhi, Bettina Baessler, Radiomics in medical imaging—“how-to” guide and critical reflection, Insights into Imaging, 10.1186/s13244-020-00887-2, 11, 1, (2020).
- Jila Nahaee, Fatemeh Abbas-Alizadeh, Mojgan Mirghafourvand, Sakineh Mohammad-Alizadeh-Charandabi, Pre- and during- labour predictors of dystocia in active phase of labour: a case-control study, BMC Pregnancy and Childbirth, 10.1186/s12884-020-03113-5, 20, 1, (2020).
- Nai-Ming Cheng, Cheng-En Hsieh, Yu-Hua Dean Fang, Chun-Ta Liao, Shu-Hang Ng, Hung-Ming Wang, Wen-Chi Chou, Chien-Yu Lin, Tzu-Chen Yen, Development and validation of a prognostic model incorporating [18F]FDG PET/CT radiomics for patients with minor salivary gland carcinoma, EJNMMI Research, 10.1186/s13550-020-00631-3, 10, 1, (2020).
- Thomas E. Cowling, David A. Cromwell, Linda D. Sharples, Jan van der Meulen, A novel approach selected small sets of diagnosis codes with high prediction performance in large healthcare datasets, Journal of Clinical Epidemiology, 10.1016/j.jclinepi.2020.08.001, 128, (20), (2020).
- Alexander Pate, Richard Emsley, Matthew Sperrin, Glen P. Martin, Tjeerd van Staa, Impact of sample size on the stability of risk scores from clinical prediction models: a case study in cardiovascular disease, Diagnostic and Prognostic Research, 10.1186/s41512-020-00082-3, 4, 1, (2020).
- Rishi Mandavia, Gerjon Hannink, Muhammad Nayeem Ahmed, Yaami Premakumar, Timothy Shun Man Chu, Helen Blackshaw, Tanjinah Ferdous, Nishchay Mehta, Joseph Manjaly, Maha Khan, Anne GM Schilder, Prognostic factors for outcomes of idiopathic sudden sensorineural hearing loss: protocol for the SeaSHeL national prospective cohort study, BMJ Open, 10.1136/bmjopen-2020-038552, 10, 9, (e038552), (2020).
- Franco M. Impellizzeri, Alan McCall, Patrick Ward, Luke Bornn, Aaron J. Coutts, Training Load and Its Role in Injury Prevention, Part 2: Conceptual and Methodologic Pitfalls, Journal of Athletic Training, 10.4085/1062-6050-501-19, 55, 9, (893-901), (2020).
- Natalie Kupperman, Jay Hertel, Global Positioning System–Derived Workload Metrics and Injury Risk in Team-Based Field Sports: A Systematic Review, Journal of Athletic Training, 10.4085/1062-6050-473-19, 55, 9, (931-943), (2020).
- Tristan W. Clark, Nathan J Brendish, Stephen Poole, Vasanth V. Naidu, Christopher Mansbridge, Nicholas Norton, Helen Wheeler, Laura Presland, Sean Ewings, Diagnostic accuracy of the FebriDx host response point-of-care test in patients hospitalised with suspected COVID-19, Journal of Infection, 10.1016/j.jinf.2020.06.051, (2020).
- Ross Bicknell, Wen Kwang Lim, Andrea B. Maier, Dina LoGiuidice, A study protocol for the development of a multivariable model predicting 6- and 12-month mortality for people with dementia living in residential aged care facilities (RACFs) in Australia, Diagnostic and Prognostic Research, 10.1186/s41512-020-00085-0, 4, 1, (2020).
- Lada Kohoutová, Juyeon Heo, Sungmin Cha, Sungwoo Lee, Taesup Moon, Tor D. Wager, Choong-Wan Woo, Toward a unified framework for interpreting machine-learning models in neuroimaging, Nature Protocols, 10.1038/s41596-019-0289-5, (2020).
- Ilias Nikolakopoulos, Soheila Nourabadi, Joanna B. Eldredge, Lalitha Anand, Meng Zhang, Michael Qiu, David Rosenberg, Alex C. Spyropoulos, Using big data to retrospectively validate the COMPASS-CAT risk assessment model: considerations on methodology, Journal of Thrombosis and Thrombolysis, 10.1007/s11239-020-02191-8, (2020).
- Xiaofei Chen, Bala Ramanan, Shirling Tsai, Haekyung Jeon‐Slaughter, Differential Impact of Aging on Cardiovascular Risk in Women Military Service Members, Journal of the American Heart Association, 10.1161/JAHA.120.015087, (2020).
- Stephen Gerry, Timothy Bonnici, Jacqueline Birks, Shona Kirtley, Pradeep S Virdee, Peter J Watkinson, Gary S Collins, Early warning scores for detecting deterioration in adult hospital patients: systematic review and critical appraisal of methodology, BMJ, 10.1136/bmj.m1501, (m1501), (2020).
- Shamil D. Cooray, Lihini A. Wijeyaratne, Georgia Soldatos, John Allotey, Jacqueline A. Boyle, Helena J. Teede, The Unrealised Potential for Predicting Pregnancy Complications in Women with Gestational Diabetes: A Systematic Review and Critical Appraisal, International Journal of Environmental Research and Public Health, 10.3390/ijerph17093048, 17, 9, (3048), (2020).
- Luca Zigiotto, Luciano Annicchiarico, Francesco Corsini, Luca Vitali, Roberta Falchi, Chiara Dalpiaz, Umberto Rozzanigo, Mattia Barbareschi, Paolo Avesani, Costanza Papagno, Hugues Duffau, Franco Chioffi, Silvio Sarubbo, Effects of supra-total resection in neurocognitive and oncological outcome of high-grade gliomas comparing asleep and awake surgery, Journal of Neuro-Oncology, 10.1007/s11060-020-03494-9, (2020).
- Laila Staerk, Sarah R. Preis, Honghuang Lin, Juan P. Casas, Kathryn Lunetta, Lu-Chen Weng, Christopher D. Anderson, Patrick T. Ellinor, Steven A. Lubitz, Emelia J. Benjamin, Ludovic Trinquart, Novel Risk Modeling Approach of Atrial Fibrillation With Restricted Mean Survival Times, Circulation: Cardiovascular Quality and Outcomes, 10.1161/CIRCOUTCOMES.119.005918, (2020).
- Richard D Riley, Joie Ensor, Kym I E Snell, Frank E Harrell, Glen P Martin, Johannes B Reitsma, Karel G M Moons, Gary Collins, Maarten van Smeden, Calculating the sample size required for developing a clinical prediction model, BMJ, 10.1136/bmj.m441, (m441), (2020).
- José M. Bolarín, F. Cavas, J.S. Velázquez, J.L. Alió, A Machine-Learning Model Based on Morphogeometric Parameters for RETICS Disease Classification and GUI Development, Applied Sciences, 10.3390/app10051874, 10, 5, (1874), (2020).
- Chinchin Wang, Jorge Trejo Vargas, Tyrel Stokes, Russell Steele, Ian Shrier, Analyzing Activity and Injury: Lessons Learned from the Acute:Chronic Workload Ratio, Sports Medicine, 10.1007/s40279-020-01280-1, (2020).
- Sarah Booth, Richard D Riley, Joie Ensor, Paul C Lambert, Mark J Rutherford, Temporal recalibration for improving prognostic model development and risk predictions in settings where survival is improving over time, International Journal of Epidemiology, 10.1093/ije/dyaa030, (2020).
- Kishna Ram, Salim Sheikh, Rahul Kumar Bhati, Chakra Dhar Tripathi, Jagdish Chander Suri, Girish Gulab Meshram, Steady‐state pharmacokinetic and pharmacodynamic profiling of colistin in critically ill patients with multi‐drug–resistant gram‐negative bacterial infections, along with differences in clinical, microbiological and safety outcome, Basic & Clinical Pharmacology & Toxicology, 10.1111/bcpt.13482, 0, 0, (2020).
- Katie Aafjes-van Doorn, Céline Kamsteeg, Jordan Bate, Marc Aafjes, A scoping review of machine learning in psychotherapy research, Psychotherapy Research, 10.1080/10503307.2020.1808729, (1-25), (2020).
- Roberto Carando, Emiliano Soldini, Simone Cotrufo, Michele Zazzara, Giuseppe M. Ludovico, Electro-mediated drug administration of mitomycin C in preventing non-muscle-invasive bladder cancer recurrence and progression after transurethral resection of the bladder tumour in intermediate- and high-risk patients, Arab Journal of Urology, 10.1080/2090598X.2020.1816150, (1-7), (2020).
- Gary S. Collins, Maarten van Smeden, Richard D. Riley, COVID-19 prediction models should adhere to methodological and reporting standards, European Respiratory Journal, 10.1183/13993003.02643-2020, (2002643), (2020).
- Guangyao Wu, Henry C. Woodruff, Avishek Chatterjee, Philippe Lambin, Reply to “COVID-19 prediction models should adhere to methodological and reporting standards”, European Respiratory Journal, 10.1183/13993003.02918-2020, (2002918), (2020).
- Ben Van Calster, Maarten van Smeden, Bavo De Cock, Ewout W Steyerberg, Regression shrinkage methods for clinical prediction models do not guarantee improved performance: Simulation study, Statistical Methods in Medical Research, 10.1177/0962280220921415, (096228022092141), (2020).
- Ben Van Calster, David J. McLernon, Maarten van Smeden, Laure Wynants, Ewout W. Steyerberg, Calibration: the Achilles heel of predictive analytics, BMC Medicine, 10.1186/s12916-019-1466-7, 17, 1, (2019).
- Damien Drubay, Ben Van Calster, Stefan Michiels, Development and Validation of Risk Prediction Models, Principles and Practice of Clinical Trials, 10.1007/978-3-319-52677-5, (1-22), (2019).
- Raffaele Brustia, Serena Langella, Takayuki Kawai, Gilton Marques Fonseca, Astrid Schielke, Fabio Colli, Vivian Resende, Francesco Fleres, Didier Roulin, Paul Leyman, Alessandro Giacomoni, Benjamin Granger, Laetitia Fartoux, Luciano DE. Carlis, Nicolas Demartines, Daniele Sommacale, Marcelo Dias Sanches, Damiano Patrono, Olivier Detry, Paulo Herman, Shinya Okumura, Alessandro Ferrero, Olivier Scatton, Shinji Uemoto, Fabiano Perdigao, Francisco Nolasco, Sophie Laroche, Renato Romagnoli, Simone Famularo, Preoperative risk score for prediction of long-term outcomes after hepatectomy for intrahepatic cholangiocarcinoma: Report of a collaborative, international-based, external validation study, European Journal of Surgical Oncology, 10.1016/j.ejso.2019.10.041, (2019).
- Émile Lemoine, Frédérick Dallaire, Rajeev Yadav, Rajeev Agarwal, Samuel Kadoury, Dominique Trudel, Marie-Christine Guiot, Kevin Petrecca, Frédéric Leblond, Feature engineering applied to intraoperative in vivo Raman spectroscopy sheds light on molecular processes in brain cancer: a retrospective study of 65 patients , The Analyst, 10.1039/C9AN01144G, (2019).
- Tuan V Nguyen, Fernando Rivadeneira, Roberto Civitelli, New Guidelines for Data Reporting and Statistical Analysis: Helping Authors With Transparency and Rigor in Research, Journal of Bone and Mineral Research, 10.1002/jbmr.3885, 34, 11, (1981-1984), (2019).
- Warren Gregson, Valter Di Salvo, Matthew C. Varley, Mattia Modonutti, Andrea Belli, Karim Chamari, Matthew Weston, Lorenzo Lolli, Cristiano Eirale, Harmful association of sprinting with muscle injury occurrence in professional soccer match-play: a two-season, league wide exploratory investigation from the Qatar Stars League, Journal of Science and Medicine in Sport, 10.1016/j.jsams.2019.08.289, (2019).
- Hendrik-Jan Mijderwijk, Ewout W Steyerberg, Hans-Jakob Steiger, Igor Fischer, Marcel A Kamp, Fundamentals of Clinical Prediction Modeling for the Neurosurgeon, Neurosurgery, 10.1093/neuros/nyz282, 85, 3, (302-311), (2019).
- Catarina Dinis Fernandes, Rita Simões, Ghazaleh Ghobadi, Stijn W.T.P.J. Heijmink, Ivo G. Schoots, Jeroen de Jong, Iris Walraven, Henk G. van der Poel, Petra J. van Houdt, Milena Smolic, Floris J. Pos, Uulke A. van der Heide, Multiparametric MRI Tumor Probability Model for the Detection of Locally Recurrent Prostate Cancer After Radiation Therapy: Pathologic Validation and Comparison With Manual Tumor Delineations, International Journal of Radiation Oncology*Biology*Physics, 10.1016/j.ijrobp.2019.05.003, (2019).
- Alice E. Hughes, Ulla Sovio, Francesca Gaccioli, Emma Cook, D Stephen Charnock-Jones, Gordon C.S. Smith, The association between first trimester AFP to PAPP-A ratio and placentally-related adverse pregnancy outcome, Placenta, 10.1016/j.placenta.2019.04.005, 81, (25-31), (2019).
- Rodrigo M. Carrillo-Larco, Carlos Altez-Fernandez, Niels Pacheco-Barrios, Claudia Bambs, Vilma Irazola, J. Jaime Miranda, Goodarz Danaei, Pablo Perel, Cardiovascular Disease Prognostic Models in Latin America and the Caribbean, Global Heart, 10.1016/j.gheart.2019.03.001, 14, 1, (81-93), (2019).
- Tom Hughes, Richard Riley, Jamie C. Sergeant, Michael J. Callaghan, A study protocol for the development and internal validation of a multivariable prognostic model to determine lower extremity muscle injury risk in elite football (soccer) players, with further exploration of prognostic factors, Diagnostic and Prognostic Research, 10.1186/s41512-019-0063-8, 3, 1, (2019).
- Romin Pajouheshnia, Rolf H H Groenwold, Linda M Peelen, Johannes B Reitsma, Karel G M Moons, When and how to use data from randomised trials to develop or validate prognostic models, BMJ, 10.1136/bmj.l2154, (l2154), (2019).
- Richard D. Riley, Kym I.E. Snell, Joie Ensor, Danielle L. Burke, Frank E. Harrell, Karel G.M. Moons, Gary S. Collins, Minimum sample size for developing a multivariable prediction model: Part I – Continuous outcomes, Statistics in Medicine, 10.1002/sim.7993, 38, 7, (1262-1275), (2018).





= 0.903




