Design of data collection
Ideally, the data needed to develop a new prediction model come from a prospective study, performed in study participants that share most of the clinical characteristics with the target patients for the model (i.e. generalizability of the model) [3]. In diagnostic model development, this means that a sample of patients suspected of having the disease is included, whereas the prognostic model requires subjects that might develop a specific health outcome over a certain time period. For example, the prognostic VTE recurrence prediction models were developed from prospective cohorts of VTE patients being at risk of a recurrent event [40] [79].
Randomized clinical trials (RCTs) are in fact more stringently selected prospective cohorts. Data from RCTs can thus also be used for prognostic model development, yet—given the stringent inclusion and exclusion criteria—there is a chance of hampered generalizability [14, 18]. In contrast, a retrospective cohort design is prone to incomplete data collection as information on the predictors and outcomes is commonly less systematically obtained and therefore more prone to yield biased prediction models. The traditional casecontrol design is hardly suitable for risk prediction model development (and validation). However, a nested casecontrol or casecohort design can be chosen for prediction modeling studies in specific circumstances, like a rare outcome or expensive predictor measurements [4143].
Predictors
Many patientrelated variables (i.e. sex, age, comorbidities, severity of disease, test results) that are known or assumed to be related to the targeted outcome may be studied as a predictor. Out of all such potential predictors, a selection of the most relevant candidate predictors has to be chosen to be included in the analyses especially when the number of subjects with the outcome is relatively small, as we will describe below (see Tables 2 and 3: of all characteristics of patients suspected of DVT, we chose to include only seven predictors in our analyses). In contrast to etiological study designs, in which only causally related variables are considered, noncausal variables can also be highly predictive of outcomes [14]. For example, one of the predictors of the Wells diagnostic PE rule is tachycardia (see Tables 2 and 3). Although there is no causal relation between tachycardia and PE, the predictive ability is substantial.
Table 2. Univariable analyses of each candidate diagnostic predictor compared with the presence or absence of the outcome (PE)  Pulmonary Embolism (reference standard) 

yes n = 73  no n = 525 

N  Sens (%) (95% CI)  PPV (%) (95% CI)  N  Spec (%) (95% CI)  NPV (%) (95% CI) 


Clinical signs and symptoms of DVT  26  36 (26–47)  46 (33–58)  31  94 (92–96)  91 (89–93) 
PE most likely diagnosis  61  84 (73–90)  18 (15–23)  272  48 (44–52)  95 (92–97) 
Heart rate > 100 beats min^{−1}  25  34 (24–46)  23 (16–31)  86  84 (80–87)  90 (87–92) 
Recent immobilization or surgery  23  32 (22–43)  24 (17–34)  71  86 (83–89)  90 (87–92) 
Previous DVT or PE  18  25 (16–36)  21 (14–31)  66  87 (84–90)  89 (86–92) 
Hemoptysis  5  7 (3–15)  24 (11–45)  16  97 (95–98)  88 (85–91) 
Presence of malignancy  5  7 (3–15)  19 (9–38)  21  96 (94–97)  88 (85–91) 
Positive result Ddimer test  70  96 (89–99)  24 (20–29)  219  58 (54–62)  99 (97–100) 
Table 3. Multivariable diagnostic model to confirm or reject the diagnosis Pulmonary Embolism  Model 1 (Basic model)  Model 2 (Basic model + Ddimer) 

Regression coefficient (SE)  OR (95% CI)  Pvalue  Regression coefficient (SE)  OR (95% CI)  Pvalue 


Intercept  −3.75 (0.34)  –  –  −5.88 (0.66)  –  – 
Clinical signs and symptoms DVT  1.93 (0.33)  6.9 (3.6–13.2)  < 0.01  1.50 (0.36)  4.5 (2.2–9.0)  < 0.01 
PE most likely diagnosis  1.32 (0.34)  3.8 (1.9–7.3)  < 0.01  1.23 (0.36)  3.4 (1.7–6.9)  < 0.01 
Heart rate > 100 beats min^{−1}  0.90 (0.31)  2.4 (1.3–4.5)  < 0.01  0.56 (0.33)  1.7 (0.9–3.3)  0.09 
Recent immobilization or surgery  0.71 (0.32)  2.0 (1.1–3.8)  0.03  0.61 (0.35)  1.8 (0.9–3.6)  0.08 
Previous DVT or PE  0.91 (0.34)  2.5 (1.3–4.8)  < 0.01  0.88 (0.37)  2.4 (1.2–5.0)  0.02 
Positive result Ddimer test  NA  NA  NA  3.11 (0.61)  22.3 (6.8–73.1)  < 0.01 
Predictors that are difficult to measure, or have high interobserver variability, might not be suitable for inclusion in a prediction model because this will influence the predictive ability of the model when applied in other individuals. A subjective predictor like ‘other diagnosis less likely’ of the Wells PE rule might be scored differently by residents and more experienced physicians. Furthermore, it is of utmost importance to define predictors accurately and to describe the measurements in a standardized way. This enhances applicability and predictive stability across multiple populations or settings of the prediction model to be developed [33].
Continuous predictors (such as the Ddimer level in the Vienna prediction model [8], blood pressure or weight) can be used in prediction models, but preferably should not be presented as a categorical variable. Converting the variable into categories often creates a huge information loss [44, 45]. Moreover, chosen thresholds for categorization are usually driven by the development data at hand, making the developed prediction model unstable and less generalizable when used or applied in other individuals. Continuous predictors should thus be kept continuous although it is important to assess the linearity or shape of the predictor–outcome association and to transform the predictor if necessary [13, 16, 4446].
The decision on what candidate predictors to select for the study aimed at developing a prediction model is mainly based on prior knowledge, clinical or from the literature. Preferably, predictor selection should not be based on statistical significance of the predictor–outcome association in the univariable analysis [12, 13, 47, 48] (see also section on actual modeling). Also, it is often tempting to include as many predictors as possible into the model development. But if the number of outcome events in the data set is limited, there is a high chance of including predictors into the model erroneously, only based on chance [12, 13, 47, 48]. To prevent this, although not based on firm scientific evidence, one might apply as a rule of thumb the socalled ‘EPV (events per variable) 1–10’: One candidate predictor per 10 outcome events should be included in the data set to secure reliable prediction modeling [4951]. Other methods to limit the amount of candidate predictors are to combine several related variables into one single predictor or to remove candidate predictors that are highly correlated with others [13].
Outcome
The outcome of a prediction model has to be chosen as such that it reflects a clinically significant and patient relevant health state, for example, death yes or no, or absence or presence of (recurrent) pulmonary embolism. A clear and comprehensive predefined outcome definition limits the potential of bias. This includes a proper protocol on standardized (blinded or independent) outcome assessment [4]. In case of prognostic prediction research, a cleardefined followup period is needed in which the outcome development is assessed. For example, the PESI score, developed to identify PE patients with a low risk of shortterm mortality in whom outpatient treatment may be safe, used 30 days of followup to assess the outcome PE recurrence or mortality during that period [10].
Missing data
As in all types of research, missing data on predictors or outcomes are unavoidable in prediction research as well [52, 53]. This can influence the model development, as missing data frequently follow a selective pattern in which the missingness of predictor results is related to other variables or even the outcome [5459]. Removal of all participants with missing values is not sensible, as the nonrandom pattern of missing data inevitably causes a nondesired nonrandom selection of the participants with complete data as well. Moreover, it reduces the effective sample size. As a consequence, the model will be prone to inaccurate—biased—and attenuated effect size estimations. Guidelines recommend imputing these missing data using imputation techniques [55, 5861]. These techniques use all available information of a patient—and that of similar patients—to estimate the most likely value of the missing test results or outcomes in patients with missing data.
A predictor with many missing values, however, suggests difficulties in acquiring data on that predictor, even in a research setting. In clinical practise that specific variable will likely be frequently missing as well and one might argue if it is prudent to add such a predictor in a prediction model.
Actual modeling
Prediction models are usually derived using multivariable regression techniques, and many books and papers have been written how to develop a prediction model [12, 13, 16, 62]. In brief, a binary outcome commonly asks for the use of a logistic regression model for diagnostic or shortterm (e.g. 1 or 3 months) prognostic outcomes or survival modeling for longterm, timetoevent prognostic outcomes. There are two generally accepted strategies to arrive at the final model, yet there is no consensus on the optimal method to use [1214, 16].
The full model approach includes all candidate predictors not only in the multivariable analysis but also in the final prediction model, that is, no predictor selection whatsoever is applied. The main advantage of this approach is a bypass of improper predictor selection due to chance (predictor selection bias) [13]. The difficulty remains, however, to adequately preselect the predictors for inclusion in the modeling and requires much prior knowledge [16, 17].
To overcome this issue, the second method uses predictor selection in the multivariable analyses, either by backward elimination of ‘redundant’ predictors or forward selection of ‘promising’ ones. The backward procedure (see Table 3) starts with the full multivariable model (all predictors included, accounting for the above addressed ‘EPV 1:10 rule’) and then subsequently removes predictors based on a predefined criterion, for example, the Akaike Information Criterion (AIC) or a nominal significance level (based on the socalled Log likelihood ratio test (LR test)) for removal [12]. If predictors are added to the multivariable model one by one, this is called forward selection. With this approach, however, some variables will not be considered at all, and thus, no overall effect (that is, the model with all candidate predictors) is assessed. If selection is applied, backward selection is therefore often preferred over forward selection. Strict selection (e.g. based on an often used significance level of < 0.05) will lead to a low number of predictors in the final model but also enhances unintentional exclusion of relevant predictors, or inclusion of spurious predictors that by chance were significant in the development data set. This risk increases when the data set was relatively small and/or the number of candidate predictors relatively large [12, 13, 18]. Conversely, the use of less stringent exclusion criteria (e.g. P < 0.25) leaves more predictors, but potentially also less important ones, in the model.
The predictors of the final model, regardless of the selection procedure used, are considered all associated with the targeted outcome, yet the individual contribution to the probability estimation varies. The multivariable modeling assigns the weight of each predictor, mutually adjusted for each other's influence, to the probability estimate. For each individual, the probability of having or developing the outcome can then be calculated based on these regression coefficients (see legend Table 3).
Developed regression models—logistic, survival, or other—might be too complicated for (bedside) use in daily clinical care. To improve userfriendliness, the coefficients are often rounded toward numbers that can be easily scored by clinicians (see Table 1, Wells PE score). Such simplification, however, might hamper the accuracy of the model and thus needs to be applied with care [18]. Instead, one may use the original regression equation to create an easy to use webbased tool or nomogram to calculate individual probabilities. As an example of such comprehensive model presentation in the VTE domain, we refer to the Vienna prediction model nomogram and webbased tool [8].
Model performance measures
A prediction model should be able to distinguish diseased from nondiseased individuals correctly (discrimination) and should produce predicted probabilities that are in line with the actual outcome frequencies or probabilities (calibration). As the clinical potential of a prediction model largely depends on these two items, both should be assessed and reported as part of the model development.
Discrimination can be expressed as the area under the receiveroperating curve for a logistic model or the equivalent cindex in a survival model. The AUC (or cindex) represents the chance that in two individuals, one with and one without the outcome, the predicted outcome probability will be higher for the individual with the outcome compared with the one without (see Fig. 1). A cindex of 0.5 represents no discriminative ability, whereas 1.0 indicates perfect discrimination [33, 63, 64].
A calibration plot provides insight into this calibrating potential of a model. First, individuals are ranked based on increasing model–derived deciles of predicted probability of having the outcome. On the xaxis, the mean predicted probability and on the yaxis, the observed outcome frequencies are plotted (see Fig. 2). If the slope of a line equals 1 (diagonal), it reflects optimal calibration. A formal statistical test examines the socalled ‘goodnessoffit’. The Hosmer and Lemeshow test is regularly used, but might lack statistical power to detect overfitting [12, 13, 65].
Internal validation
Independent of the approaches used to arrive at the final multivariable model, a major problem in the development phase is the fact that the model has been fitted optimally for the available data. All model development techniques are prone to produce ‘overfitted’ or overoptimistic and thus unstable models when applied in other individuals, especially if small data sets (limited number of outcomes) or large numbers of predictors are used for model development [12, 13]. To ascertain the bestfitted and most stable model, it is essential to continue with a socalled internal validation procedure [12, 13, 48].
Several techniques are available to evaluate optimism or the amount of overfitting in the developed model. The simplest method is to randomly split the data set into a development and a validation set and to compare the performance for both models. As the actual development sample consists of only a part (e.g. 2/3) of the original data set, which also is not different from the (1/3) validation set other than by chance, this method lacks not only efficiency as it decreases statistical power for the model development, it also does not allow for an independent validation of the model. Hence, this random splitsample method should preferably not be used [16, 18, 22].
A more advanced method to avoid waste of development data is the use of bootstrapping [12, 13, 47]. Aim of this procedure is to mimic random sampling from the source population. The sampling procedure consists of multiple samples (e.g. 500) of the same size as the study sample, drawn with replacement (bootstrap). In each sample, all development steps of the model are performed, and indeed, different models might be yielded as a result. These bootstrap models are then applied to the original sample. This in turn yields an average estimate of the amount of overfitting or optimism in the originally estimated regression coefficients and predictive accuracy measures, which are adjusted accordingly [12, 13].
Added value of a new test or biomarker
Often, when developing a prediction model, there is a particular interest in estimating the added—diagnostic or prognostic—predictive value of a new biomarker or (e.g. imaging) test results to existing or established predictors. A good predictive value of such biomarker or test result by itself, that is, in isolation, is no guarantee for relevant added predictive value when combined with the standard predictors [64, 6670]. Preferably, the new biomarker should be modeled as an extension or supplement to the existing predictors. When developing a diagnostic prediction model following the workup in practice, this stepbystep model extension approach is rather sensible: The information of each subsequent test or biomarker result is explicitly added to the previously obtained information, and the target disease probability is adjusted (see Table 3 (model 2), Figs 1 and 2).
Measures of discrimination such as the AUC (or cstatistic) are insensitive to detecting small improvements in model performance, especially if the AUC of the basic model is already large [26, 35, 64, 69, 70]. Therefore, other measures have been suggested to evaluate the added value of a new biomarker or (imaging) test. Reclassification tables (see Table 4) provide insight in the improvement in correct classification of patients. By quantification of the extent to which an extended model improves the correct classification of patients into diseased or nondiseased categories, compared with the basic model, the net reclassification improvement (NRI) estimates the added value of the biomarker. Improved classification (NRI > 0.0) suggests that more diseased patients are categorized as high probability and nondiseased as low probability using the extended model [69, 71].
Table 4. Quantifying the added value of a Ddimer test using a reclassification table  Model 2 with Ddimer 

≤ 25%  > 25%  Total 


PE yes (n = 73) 
Model 1 without Ddimer 
≤ 25%  22  17  39 
> 25%  1  33  34 
Total  23  50  73 
PE no (n = 525) 
Model 1 without Ddimer 
≤ 25%  445  47  492 
> 25%  13  20  33 
Total  458  67  525 
The NRI is very dependent on categorization of the probability threshold(s). Different thresholds may result in very different NRIs for the same added test. To overcome this problem of arbitrary cutoff choices, another option is to calculate the socalled integrated discrimination improvement (IDI), which considers the magnitude of the reclassification probability improvement or worsening by a new test over all possible categorizations or probability thresholds [12, 69, 72].