• clinical prediction rule;
  • D-dimer;
  • probability;
  • risk assessment;
  • venous thromboembolism


  1. Top of page
  2. Summary
  3. Introduction
  4. Risk prediction models: definition and rationale
  5. Development
  6. Validation
  7. Impact and implementation
  8. Concluding remarks
  9. Disclosure of Conflict of Interest
  10. References

Risk prediction models can be used to estimate the probability of either having (diagnostic model) or developing a particular disease or outcome (prognostic model). In clinical practice, these models are used to inform patients and guide therapeutic management. Examples from the field of venous thrombo-embolism (VTE) include the Wells rule for patients suspected of deep venous thrombosis and pulmonary embolism, and more recently prediction rules to estimate the risk of recurrence after a first episode of unprovoked VTE. In this paper, the three phases that are recommended before a prediction model may be used in daily practice are described: development, validation, and impact assessment. In the development phase, the focus is on model development commonly using a multivariable logistic (diagnostic) or survival (prognostic) regression analysis. The performance of the developed model is expressed by discrimination, calibration and (re-) classification. In the validation phase, the developed model is tested in a new set of patients using these same performance measures. This is important, as model performance is commonly poorer in a new set of patients, e.g. due to case-mix or domain differences. Finally, in the impact phase the ability of a prediction model to actually guide patient management is evaluated. Whereas in the development and validation phase single cohort designs are preferred, this last phase asks for comparative designs, ideally randomized designs; therapeutic management and outcomes after using the prediction model is compared to a control group not using the model (e.g. usual care).


  1. Top of page
  2. Summary
  3. Introduction
  4. Risk prediction models: definition and rationale
  5. Development
  6. Validation
  7. Impact and implementation
  8. Concluding remarks
  9. Disclosure of Conflict of Interest
  10. References

In recent years, risk prediction models have become increasingly popular to aid clinical decision-making. These models are developed to provide for estimating a probability of having (a diagnostic prediction model) or developing (a prognostic prediction model) a certain outcome (e.g. disease, event, complication) in an individual, given the individual's demographics, test results, or disease characteristics. The probability estimates can guide care providers as well as the individuals themselves in deciding upon further management [1-4]. In the field of venous thromboembolism (VTE), well-known prediction models are those developed by Wells and colleagues. These rules aid the diagnostic process in patients suspected of deep venous thrombosis (DVT) or pulmonary embolism (PE) (see Table 1) [5, 6]. Yet, many more prediction models in the domain of VTE have been developed, such as the prognostic models to assess VTE recurrence risk in patients who suffered from a VTE [7-9] or the Pulmonary Embolism Severity Index (PESI) for short-term mortality risk in PE patients [10], and various other diagnostic models for both DVT and PE, for example, developed by Oudega et al. [11] or Aujesky et al. [10] (See box 1 for several examples from the VTE domain).

Table 1. Risk prediction models (clinical decision rules) in diagnosis of DVT and pulmonary embolism
ItemsRegression coefficientsPoints assigned
  1. DVT unlikely (score ≤ 1 and low D-dimer): 33% (1268/ 3875) [90].

  2. DVT present if DVT unlikely: 0.7% (95% CI 0.3–1.3%) [90].

  3. PE unlikely (score ≤ 4 and qualitative D-dimer negative): 42% of suspected patients (95% CI 33–52%)[39].

  4. PE present if PE unlikely: 1.7% (95% CI 1.0–2.8%)[39].

Wells DVT [5]
Active cancer1
Recent surgery1
Entire leg swollen1
Unilateral calf swelling > 3 cm1
Unilateral edema1
Collateral superficial veins1
Alternative diagnosis as likely or more likely−2
Wells PE [6]
Clinical signs and symptoms of DVT1.83.0
Other diagnosis less likely1.53.0
Heart rate > 100 beats min−11.11.5
Recent immobilization or surgery0.921.5
Previous DVT or PE0.871.5
Active cancer0.811.0

With the increase in risk prediction models developed and reported each year, the methodology for developing, validating, and implementing these models receives increasingly attention, as reflected in recent books and series of publications [4, 12-22]. Unfortunately, the quality of a prediction model is not guaranteed by its publication as reflected by various recent reviews [23-27]. The very recent PROGRESS series reviews common shortcomings in model development and reporting [22]. The fact that multiple prediction models are being developed for a single clinical question, outcome, or target population, suggests that there is still a tendency toward developing more and more models, rather than to first validate those existing or adjust an existing model to new circumstances. As a consequence, there is still a huge mismatch between the number of papers on model development vs. on validation and even more vs. the implementation of prediction models [22, 28-30].

In this article, we review the literature on methods for developing, validating, and assessing the impact of prediction models, building on three recent series of such papers [4, 14-18, 31]. We illustrate this throughout with examples from the diagnostic and prognostic VTE domain, complemented with empirical data on a diagnostic model for PE. We stress that the empirical data, based on a recent publication of a model validation study of the Wells PE rule [6] for suspected PE in primary care [32], are used for illustration purposes only and by no means to define the best diagnostic model or work-up for PE suspicion or to compare our results with existing reports on the topic. Our sole aim here is to illustrate the methods used in prediction modeling to improve understanding and interpretation of such studies.

Risk prediction models: definition and rationale

  1. Top of page
  2. Summary
  3. Introduction
  4. Risk prediction models: definition and rationale
  5. Development
  6. Validation
  7. Impact and implementation
  8. Concluding remarks
  9. Disclosure of Conflict of Interest
  10. References

Risk prediction models estimate the risk (absolute probability) of the presence or absence of an outcome or disease in individuals based on their clinical and non-clinical characteristics [1-3, 12, 33, 34]. Depending on the amount of time until outcome assessment, prediction research can be diagnostic (outcome or disease present at this moment) or prognostic (outcome occurs within a specified time frame). Although we illustrate some of our methods with empirical data of a diagnostic modeling study, the methods described in this article for prediction model development, validation, and impact assessment can be mutatis mutandis applied to both situations [18].

In clinical diagnostic practice, doctors incorporate information from history-taking, clinical examination, laboratory or imaging test results to judge and determine whether or not a suspected patient has the targeted disease. In essence, prediction model development mimics this diagnostic work-up by combining all this patient information, further summarized as predictors of the outcome, in a statistical multivariable model [2, 12, 33, 35-38].

For each unique combination of predictors, a prediction model provides an estimated probability that allows for risk stratification for individuals or groups. Hence, it can guide physicians in deciding upon further diagnostic tests or treatments. For example, patients with a high probability of having a disease might be suitable candidates for further testing, while in low probability patients, it might be more effective to refrain from further testing. For instance, the combination of the Wells PE rule and a negative D-dimer test can safely rule out PE in about 40% of all patients suspected of having PE. These patients can be refrained from further testing, thus improving efficiency of the diagnostic process [39].

The urge to develop a prediction model usually starts with a clinical question on how to tailor further management considering the patients profile of having or developing a certain outcome or disease. For example, patients with unprovoked VTE might benefit from prolonged anticoagulant therapy, but only those at high risk of recurrence because of the associated risk of bleeding. Several tools have been developed to assess the prognostic probability of developing a recurrent VTE to determine whether or not secondary prevention is indicated in a subset of patients [40] [7-9].

As addressed previously, to become clinically valuable, a prediction model ideally follows three clearly distinct steps, namely development, validation, and impact/ implementation [12, 14, 18, 22, 28, 34].


  1. Top of page
  2. Summary
  3. Introduction
  4. Risk prediction models: definition and rationale
  5. Development
  6. Validation
  7. Impact and implementation
  8. Concluding remarks
  9. Disclosure of Conflict of Interest
  10. References

Design of data collection

Ideally, the data needed to develop a new prediction model come from a prospective study, performed in study participants that share most of the clinical characteristics with the target patients for the model (i.e. generalizability of the model) [3]. In diagnostic model development, this means that a sample of patients suspected of having the disease is included, whereas the prognostic model requires subjects that might develop a specific health outcome over a certain time period. For example, the prognostic VTE recurrence prediction models were developed from prospective cohorts of VTE patients being at risk of a recurrent event [40] [7-9].

Randomized clinical trials (RCTs) are in fact more stringently selected prospective cohorts. Data from RCTs can thus also be used for prognostic model development, yet—given the stringent inclusion and exclusion criteria—there is a chance of hampered generalizability [14, 18]. In contrast, a retrospective cohort design is prone to incomplete data collection as information on the predictors and outcomes is commonly less systematically obtained and therefore more prone to yield biased prediction models. The traditional case-control design is hardly suitable for risk prediction model development (and validation). However, a nested case-control or case-cohort design can be chosen for prediction modeling studies in specific circumstances, like a rare outcome or expensive predictor measurements [41-43].


Many patient-related variables (i.e. sex, age, comorbidities, severity of disease, test results) that are known or assumed to be related to the targeted outcome may be studied as a predictor. Out of all such potential predictors, a selection of the most relevant candidate predictors has to be chosen to be included in the analyses especially when the number of subjects with the outcome is relatively small, as we will describe below (see Tables 2 and 3: of all characteristics of patients suspected of DVT, we chose to include only seven predictors in our analyses). In contrast to etiological study designs, in which only causally related variables are considered, non-causal variables can also be highly predictive of outcomes [14]. For example, one of the predictors of the Wells diagnostic PE rule is tachycardia (see Tables 2 and 3). Although there is no causal relation between tachycardia and PE, the predictive ability is substantial.

Table 2. Univariable analyses of each candidate diagnostic predictor compared with the presence or absence of the outcome (PE)
 Pulmonary Embolism (reference standard)
yes n = 73no n = 525
N Sens (%) (95% CI)PPV (%) (95% CI) N Spec (%) (95% CI)NPV (%) (95% CI)
  1. DVT, deep venous thrombosis; PE, pulmonary embolism; CI, confidence interval. To illustrate the development steps of a risk prediction model, we use data from a study in which the Wells PE rule was validated in a primary care setting. 598 patients suspected of having pulmonary embolism were included in the analysis. History, clinical examination, and a dichotomous D-dimer test were performed in all participants. See reference [32] for a detailed discussion on all study logistics. Presence or absence of the outcome pulmonary embolism (PE) was assessed by a composite reference standard, including spiral CT scanning, V/Q scanning, angiography, and 3 months follow-up. Although heart rate and D-dimer concentration are continuous predictors, they were analyzed here as dichotomous predictors conform the original definition in the diagnostic model (heart rate) and the dichotomous test used in the study (D-dimer). It can be seen that all single predictors are not sufficient to diagnose or reject the diagnosis pulmonary embolism. Of note, these univariable analyses are not used to select predictors to include in the multivariable model. Calculation of the diagnostic accuracy measures is as follows: Sens, sensitivity. The proportion of all diseased patients correctly classified as such based on the predictor; Spec, specificity. The proportion of all non-diseased patients that is correctly classified as such based on the predictor; PPV, positive predictive value. The probability of patients having PE given the fact that the predictor classified the patient as such; NPV, negative predictive value. The probability of patients not having PE given the fact that the predictor classified the patient as not having PE.

Clinical signs and symptoms of DVT2636 (26–47)46 (33–58)3194 (92–96)91 (89–93)
PE most likely diagnosis6184 (73–90)18 (15–23)27248 (44–52)95 (92–97)
Heart rate > 100 beats min−12534 (24–46)23 (16–31)8684 (80–87)90 (87–92)
Recent immobilization or surgery2332 (22–43)24 (17–34)7186 (83–89)90 (87–92)
Previous DVT or PE1825 (16–36)21 (14–31)6687 (84–90)89 (86–92)
Hemoptysis57 (3–15)24 (11–45)1697 (95–98)88 (85–91)
Presence of malignancy57 (3–15)19 (9–38)2196 (94–97)88 (85–91)
Positive result D-dimer test7096 (89–99)24 (20–29)21958 (54–62)99 (97–100)
Table 3. Multivariable diagnostic model to confirm or reject the diagnosis Pulmonary Embolism
 Model 1 (Basic model)Model 2 (Basic model + D-dimer)
Regression coefficient (SE)OR (95% CI)P-valueRegression coefficient (SE)OR (95% CI)P-value
  1. DVT, deep venous thrombosis; PE, pulmonary embolism; OR, odds ratio; CI, confidence interval; SE, standard error; NA, not applicable. The model development started with seven candidate predictors of the Wells PE rule (see Table 2), plus D-dimer for quantifying its added value (see Table 4). According to the EPV rule 1 of 10, the number of predictors should not be higher to prevent an overfitted model, as the total number of PE cases was 73 (Table 2). Only 5 were included in the final Model 1 when using a backward selection procedure using the AIC: In this data set, active malignancy and hemoptysis did not have any independent value that was not already captured by the other predictors. The intercept reflects the baseline risk. The regression coefficient reflects the relative weight per predictor. The exponent of a regression coefficient yields the odds ratio (OR) of the predictor. An OR of 2.0 for the predictor ‘recent immobilization or surgery’ in Model 1 indicates that the chance of having PE in a patient suspected of PE is twice as high if the predictor is present, compared with a situation in which the predictor is absent, all other predictors kept constant. To illustrate the effects of adding a new diagnostic biomarker to an existing prediction model (Model 1), we present a second model. Addition of a dichotomous D-dimer test (Model 2) yields regression coefficients that differ from the first model. The changed regression coefficients reflect that the history-taking and physical examination predictors are correlated with the D-dimer results, but that the D-dimer test contributes most to the predicted probability. For each individual, a predicted probability of having PE can be calculated using the formula: Probability of PE = exp (lp)/(1 + exp (lp)). ‘lp’ stands for ‘linear predictor’ and is calculated by adding the baseline risk and the sum of all predictors multiplied by its regression coefficient. For example, for a patient where the physician considers the diagnosis of PE to be most likely, and a heart rate of 120 beats per minute, while all other variables are absent (no signs and symptoms of DVT, no recent surgery, no previous DVT or PE), the prediction Model 1 (without D-dimer) yields the following result: lp = −3.75 + (0 × 1.93) + (1 × 1.32) + (1 × 0.90) + (0 × 0.71) + (0 × 0.91) = −5.54; pPE = exp(−5.54)/(1 + exp (−5.54)) = 17.8%. The probability of having PE for this particular patient is 18%.

Intercept−3.75 (0.34)−5.88 (0.66)
Clinical signs and symptoms DVT1.93 (0.33)6.9 (3.6–13.2)< 0.011.50 (0.36)4.5 (2.2–9.0)< 0.01
PE most likely diagnosis1.32 (0.34)3.8 (1.9–7.3)< 0.011.23 (0.36)3.4 (1.7–6.9)< 0.01
Heart rate > 100 beats min−10.90 (0.31)2.4 (1.3–4.5)< 0.010.56 (0.33)1.7 (0.9–3.3)0.09
Recent immobilization or surgery0.71 (0.32)2.0 (1.1–3.8)0.030.61 (0.35)1.8 (0.9–3.6)0.08
Previous DVT or PE0.91 (0.34)2.5 (1.3–4.8)< 0.010.88 (0.37)2.4 (1.2–5.0)0.02
Positive result D-dimer testNANANA3.11 (0.61)22.3 (6.8–73.1)< 0.01

Predictors that are difficult to measure, or have high interobserver variability, might not be suitable for inclusion in a prediction model because this will influence the predictive ability of the model when applied in other individuals. A subjective predictor like ‘other diagnosis less likely’ of the Wells PE rule might be scored differently by residents and more experienced physicians. Furthermore, it is of utmost importance to define predictors accurately and to describe the measurements in a standardized way. This enhances applicability and predictive stability across multiple populations or settings of the prediction model to be developed [33].

Continuous predictors (such as the D-dimer level in the Vienna prediction model [8], blood pressure or weight) can be used in prediction models, but preferably should not be presented as a categorical variable. Converting the variable into categories often creates a huge information loss [44, 45]. Moreover, chosen thresholds for categorization are usually driven by the development data at hand, making the developed prediction model unstable and less generalizable when used or applied in other individuals. Continuous predictors should thus be kept continuous although it is important to assess the linearity or shape of the predictor–outcome association and to transform the predictor if necessary [13, 16, 44-46].

The decision on what candidate predictors to select for the study aimed at developing a prediction model is mainly based on prior knowledge, clinical or from the literature. Preferably, predictor selection should not be based on statistical significance of the predictor–outcome association in the univariable analysis [12, 13, 47, 48] (see also section on actual modeling). Also, it is often tempting to include as many predictors as possible into the model development. But if the number of outcome events in the data set is limited, there is a high chance of including predictors into the model erroneously, only based on chance [12, 13, 47, 48]. To prevent this, although not based on firm scientific evidence, one might apply as a rule of thumb the so-called ‘EPV (events per variable) 1–10’: One candidate predictor per 10 outcome events should be included in the data set to secure reliable prediction modeling [49-51]. Other methods to limit the amount of candidate predictors are to combine several related variables into one single predictor or to remove candidate predictors that are highly correlated with others [13].


The outcome of a prediction model has to be chosen as such that it reflects a clinically significant and patient relevant health state, for example, death yes or no, or absence or presence of (recurrent) pulmonary embolism. A clear and comprehensive predefined outcome definition limits the potential of bias. This includes a proper protocol on standardized (blinded or independent) outcome assessment [4]. In case of prognostic prediction research, a clear-defined follow-up period is needed in which the outcome development is assessed. For example, the PESI score, developed to identify PE patients with a low risk of short-term mortality in whom outpatient treatment may be safe, used 30 days of follow-up to assess the outcome PE recurrence or mortality during that period [10].

Missing data

As in all types of research, missing data on predictors or outcomes are unavoidable in prediction research as well [52, 53]. This can influence the model development, as missing data frequently follow a selective pattern in which the missingness of predictor results is related to other variables or even the outcome [54-59]. Removal of all participants with missing values is not sensible, as the non-random pattern of missing data inevitably causes a non-desired non-random selection of the participants with complete data as well. Moreover, it reduces the effective sample size. As a consequence, the model will be prone to inaccurate—biased—and attenuated effect size estimations. Guidelines recommend imputing these missing data using imputation techniques [55, 58-61]. These techniques use all available information of a patient—and that of similar patients—to estimate the most likely value of the missing test results or outcomes in patients with missing data.

A predictor with many missing values, however, suggests difficulties in acquiring data on that predictor, even in a research setting. In clinical practise that specific variable will likely be frequently missing as well and one might argue if it is prudent to add such a predictor in a prediction model.

Actual modeling

Prediction models are usually derived using multivariable regression techniques, and many books and papers have been written how to develop a prediction model [12, 13, 16, 62]. In brief, a binary outcome commonly asks for the use of a logistic regression model for diagnostic or short-term (e.g. 1 or 3 months) prognostic outcomes or survival modeling for long-term, time-to-event prognostic outcomes. There are two generally accepted strategies to arrive at the final model, yet there is no consensus on the optimal method to use [12-14, 16].

The full model approach includes all candidate predictors not only in the multivariable analysis but also in the final prediction model, that is, no predictor selection whatsoever is applied. The main advantage of this approach is a bypass of improper predictor selection due to chance (predictor selection bias) [13]. The difficulty remains, however, to adequately preselect the predictors for inclusion in the modeling and requires much prior knowledge [16, 17].

To overcome this issue, the second method uses predictor selection in the multivariable analyses, either by backward elimination of ‘redundant’ predictors or forward selection of ‘promising’ ones. The backward procedure (see Table 3) starts with the full multivariable model (all predictors included, accounting for the above addressed ‘EPV 1:10 rule’) and then subsequently removes predictors based on a predefined criterion, for example, the Akaike Information Criterion (AIC) or a nominal significance level (based on the so-called Log likelihood ratio test (LR test)) for removal [12]. If predictors are added to the multivariable model one by one, this is called forward selection. With this approach, however, some variables will not be considered at all, and thus, no overall effect (that is, the model with all candidate predictors) is assessed. If selection is applied, backward selection is therefore often preferred over forward selection. Strict selection (e.g. based on an often used significance level of < 0.05) will lead to a low number of predictors in the final model but also enhances unintentional exclusion of relevant predictors, or inclusion of spurious predictors that by chance were significant in the development data set. This risk increases when the data set was relatively small and/or the number of candidate predictors relatively large [12, 13, 18]. Conversely, the use of less stringent exclusion criteria (e.g. P < 0.25) leaves more predictors, but potentially also less important ones, in the model.

The predictors of the final model, regardless of the selection procedure used, are considered all associated with the targeted outcome, yet the individual contribution to the probability estimation varies. The multivariable modeling assigns the weight of each predictor, mutually adjusted for each other's influence, to the probability estimate. For each individual, the probability of having or developing the outcome can then be calculated based on these regression coefficients (see legend Table 3).

Developed regression models—logistic, survival, or other—might be too complicated for (bedside) use in daily clinical care. To improve user-friendliness, the coefficients are often rounded toward numbers that can be easily scored by clinicians (see Table 1, Wells PE score). Such simplification, however, might hamper the accuracy of the model and thus needs to be applied with care [18]. Instead, one may use the original regression equation to create an easy to use web-based tool or nomogram to calculate individual probabilities. As an example of such comprehensive model presentation in the VTE domain, we refer to the Vienna prediction model nomogram and web-based tool [8].

Model performance measures

A prediction model should be able to distinguish diseased from non-diseased individuals correctly (discrimination) and should produce predicted probabilities that are in line with the actual outcome frequencies or probabilities (calibration). As the clinical potential of a prediction model largely depends on these two items, both should be assessed and reported as part of the model development.

Discrimination can be expressed as the area under the receiver-operating curve for a logistic model or the equivalent c-index in a survival model. The AUC (or c-index) represents the chance that in two individuals, one with and one without the outcome, the predicted outcome probability will be higher for the individual with the outcome compared with the one without (see Fig. 1). A c-index of 0.5 represents no discriminative ability, whereas 1.0 indicates perfect discrimination [33, 63, 64].


Figure 1. Receiver-operating curves (ROCs) for the model without and with D-dimer testing. The overall discriminative abilities of both models can be assessed using receiver-operating curves. The sensitivities and 1-specificities of both models over all possible probability thresholds are presented in this graph. The higher the areas under these ROCs are, the better the overall discriminative performance of the model with a maximum of 1 and a minimum of 0.5 (diagonal reference line).

Download figure to PowerPoint

A calibration plot provides insight into this calibrating potential of a model. First, individuals are ranked based on increasing model–derived deciles of predicted probability of having the outcome. On the x-axis, the mean predicted probability and on the y-axis, the observed outcome frequencies are plotted (see Fig. 2). If the slope of a line equals 1 (diagonal), it reflects optimal calibration. A formal statistical test examines the so-called ‘goodness-of-fit’. The Hosmer and Lemeshow test is regularly used, but might lack statistical power to detect overfitting [12, 13, 65].


Figure 2. Calibration curve of model 2 (basic model + D-dimer). After addition of the D-dimer test to the basic model (see Table 3), the calibration of model 2 was assessed. In the ideal situation, the calibration curve follows the diagonal line in the plot: The predicted probability and observed outcome frequency are the same for all individuals. As the model has been fitted optimally in the data set, there is a chance of overfitting. Therefore, bootstrapping was used to shrink the estimates of the model, indicated by the bias-corrected line.

Download figure to PowerPoint

Internal validation

Independent of the approaches used to arrive at the final multivariable model, a major problem in the development phase is the fact that the model has been fitted optimally for the available data. All model development techniques are prone to produce ‘overfitted’ or overoptimistic and thus unstable models when applied in other individuals, especially if small data sets (limited number of outcomes) or large numbers of predictors are used for model development [12, 13]. To ascertain the best-fitted and most stable model, it is essential to continue with a so-called internal validation procedure [12, 13, 48].

Several techniques are available to evaluate optimism or the amount of overfitting in the developed model. The simplest method is to randomly split the data set into a development and a validation set and to compare the performance for both models. As the actual development sample consists of only a part (e.g. 2/3) of the original data set, which also is not different from the (1/3) validation set other than by chance, this method lacks not only efficiency as it decreases statistical power for the model development, it also does not allow for an independent validation of the model. Hence, this random split-sample method should preferably not be used [16, 18, 22].

A more advanced method to avoid waste of development data is the use of bootstrapping [12, 13, 47]. Aim of this procedure is to mimic random sampling from the source population. The sampling procedure consists of multiple samples (e.g. 500) of the same size as the study sample, drawn with replacement (bootstrap). In each sample, all development steps of the model are performed, and indeed, different models might be yielded as a result. These bootstrap models are then applied to the original sample. This in turn yields an average estimate of the amount of overfitting or optimism in the originally estimated regression coefficients and predictive accuracy measures, which are adjusted accordingly [12, 13].

Added value of a new test or biomarker

Often, when developing a prediction model, there is a particular interest in estimating the added—diagnostic or prognostic—predictive value of a new biomarker or (e.g. imaging) test results to existing or established predictors. A good predictive value of such biomarker or test result by itself, that is, in isolation, is no guarantee for relevant added predictive value when combined with the standard predictors [64, 66-70]. Preferably, the new biomarker should be modeled as an extension or supplement to the existing predictors. When developing a diagnostic prediction model following the work-up in practice, this step-by-step model extension approach is rather sensible: The information of each subsequent test or biomarker result is explicitly added to the previously obtained information, and the target disease probability is adjusted (see Table 3 (model 2), Figs 1 and 2).

Measures of discrimination such as the AUC (or c-statistic) are insensitive to detecting small improvements in model performance, especially if the AUC of the basic model is already large [26, 35, 64, 69, 70]. Therefore, other measures have been suggested to evaluate the added value of a new biomarker or (imaging) test. Reclassification tables (see Table 4) provide insight in the improvement in correct classification of patients. By quantification of the extent to which an extended model improves the correct classification of patients into diseased or non-diseased categories, compared with the basic model, the net reclassification improvement (NRI) estimates the added value of the biomarker. Improved classification (NRI > 0.0) suggests that more diseased patients are categorized as high probability and non-diseased as low probability using the extended model [69, 71].

Table 4. Quantifying the added value of a D-dimer test using a reclassification table
 Model 2 with D-dimer
≤ 25%> 25%Total
  1. Based on the a priori cut-off value chosen, patients are classified as having low or high probability of having PE. In case of high probability, patients are referred to undergo further diagnostic tests (e.g. spiral CT scan). For this example study, the cut-off value of a 25% PE risk has been chosen arbitrarily for illustration purposes. In clinical practise, a lower cut-off value is obviously be preferred. In this example, for 17 of 73 (i.e. 0.23) patients who experienced PE events classification improved with the model with D-dimer, and for 1 of 73 (0.01), it became worse, with the net gain in reclassification proportion of 0.22 (thus, 0.23–0.01). In patients who did not experience an event 47 of 525 (0.09) individuals were reclassified worse by the model with the D-dimer and 13 of 525 (0.02) were reclassified better, resulting in a net gain in reclassification proportion of −0.07. The total net gain in reclassification proportion (NRI) therefore was 0.22–0.07 = 0.15 (95% CI 0.04–0.27). The integrated discrimination improvement (IDI) was 0.09 (95% CI 0.08–0.11).

PE yes (n = 73)
Model 1 without D-dimer
≤ 25%221739
> 25%13334
PE no (n = 525)
Model 1 without D-dimer
≤ 25%44547492
> 25%132033

The NRI is very dependent on categorization of the probability threshold(s). Different thresholds may result in very different NRIs for the same added test. To overcome this problem of arbitrary cut-off choices, another option is to calculate the so-called integrated discrimination improvement (IDI), which considers the magnitude of the reclassification probability improvement or worsening by a new test over all possible categorizations or probability thresholds [12, 69, 72].


  1. Top of page
  2. Summary
  3. Introduction
  4. Risk prediction models: definition and rationale
  5. Development
  6. Validation
  7. Impact and implementation
  8. Concluding remarks
  9. Disclosure of Conflict of Interest
  10. References

If a developed prediction model shows acceptable or good performance based on the internal validation in the development data set, it is not guaranteed that the model will behave similarly in a different group of individuals [15, 34].

Therefore, it is essential to assess the performance of the prediction model with patient data not used in the development process and preferably selected by different researchers and in different institutes, countries or even clinical settings or protocols. This is commonly referred to as independent or external validation [15, 17, 21, 28, 73, 74].

Essentially, formal external validation comprises that in a new set of individuals, the predicted outcome probabilities are estimated using the originally developed model and compared with the actual outcomes. Importantly, external validation is not repeating the analytic steps or refitting the developed model in the new validation data and then comparing the model performance [15, 17, 22, 74]. Three methods of external validation are available and can be carried out in a prospective manner, but also retrospectively if data sets with the necessary information on predictors and outcomes are available [15, 17, 22, 28, 34, 73, 74].

Temporal validation

Temporal validation may be performed by splitting a large (development) data set non-randomly based on the moment of participant inclusion [15, 17, 18, 22]. One may argue this is not a form of independent or external validation but a form of non-random split-sample internal validation, as the entire data set is established by the same researchers using the same definitions and measurements. However, it results in more variation between the development and validation sample than random splitting [17]. Prospective evaluation of the model in a new study sample by the same researchers in the same institutions only later in time might allow for more variation [17]. For example, to develop a DVT prediction model for a primary care setting, Oudega et al. [11] studied a large prospective cohort of suspected patients. The newly developed rule was then validated in largely the same primary care practises but with participants recruited during a later time period, by Toll et al. [75].

Geographical validation

As with temporal validation, one may assess the performance of a prediction model in other institutes or countries, by non-randomly splitting a large development data set based on institute or country [17]. A more external or independent validation is when the model is validated in other institutes or country by different researchers, as has been carried out by Klok and colleagues for the revised Geneva score to diagnose PE [76]. Due to more variation in case-mix (inclusion and exclusion criteria chosen) and even in measurements of predictors and outcome, the latter variant provides a more thorough and independent validation. Obviously, external validations may include a combination of temporal and geographical validation.

Domain validation

Perhaps the most extreme and rigid form of external validation is the assessment of the prediction model in a completely different clinical domain or setting [15, 17, 22, 28, 34, 73, 74]. Domain validation may, for example, comprise a model developed in secondary care and validated in primary care, developed in adults and validated in children or developed for predicting fatal events, and validated for its ability to predict non-fatal events.

Model updating or adjustment

The external validation procedure provides quantitative information on the discrimination, calibration, and classification of the model in a population that differs from the development population [15, 22, 28, 73, 74]. Ideally, the performance is comparable in the development and validation sample, indicating that the model can be used in the source populations of both [15]. But often the model performance in the new individuals is worse than that found in the development study. This does not mean that the model should be refrained from further using. Whereas it might be tempting to just built a new model that will, obviously, fit the data at hand more properly, this increases the wild growth of models for the same clinical situation [4, 12, 17, 22, 77-79].

There are no strict criteria how to define poor or acceptable performance [28, 58, 73, 74]. If prediction model performance is considered to perform poorly, the original model can be adapted to the circumstances of the validation sample [22, 77-79]. These so-called updating methods include very simple adjustment of the baseline risk, simple adjustment of predictor weights, re-estimation of predictors weights, or addition or removal of predictors and have been described extensively elsewhere [12, 34, 77-80]. The updated prediction model should preferably be externally validated as well [4, 17].

From a clinical perspective, external validation is often approached differently. Instead of pursuing the most optimal fit of a model, the main question is whether patient outcomes, for example, failure rate and incorrect predictions, remain acceptable and adequate if the model is applied in another population. For example, the AMUSE-2 study validated the use of the Wells PE rule in a primary care setting by comparing its efficiency (i.e. the proportion of patients categorized as low risk by the Wells PE rule and D-dimer testing) and safety (i.e. the proportion of ‘missed’ PE cases in this low-risk group) with generally accepted failure rates from secondary care studies [32]. Further updating was not considered.

Impact and implementation

  1. Top of page
  2. Summary
  3. Introduction
  4. Risk prediction models: definition and rationale
  5. Development
  6. Validation
  7. Impact and implementation
  8. Concluding remarks
  9. Disclosure of Conflict of Interest
  10. References

The final step toward implementation of a developed and validated (and if needed updated) prediction model is the quantification of the impact when it is actually used to direct patient management in clinical care [4, 17, 22, 28, 74]. To what extent contributes the use of the prediction model to the (change in) behavior and (self-) management of patients and doctors? And ultimately, what are the effects on health outcomes and cost-effectiveness of care? Although this final step is important to improve health care, reviews showed that this form of prediction modeling studies is even less frequently performed than external validation studies [22, 28, 29].

The largest difference from a validation study is the fact that impact studies require a control group [4, 17, 28]. It is essential to compare the effects on decision-making and health outcomes using standard care or by prediction model guided care.

A major disadvantage of the ordinary RCT design—in which each consecutive patient can be randomized to either the index (prediction model guided management) or control (care-as-usual)—is the impossibility of blinding and subsequently the potential learning curve of the treating physicians. This makes eventually the two groups increasingly alike and dilutes the potential effect [4, 17]. These learning effects are prevented by randomization of clusters rather than patients. All patients within a cluster, for example, a doctor or hospital, receive the same type of intervention [81]. Unfortunately, cluster RCTs do require more individuals to obtain the same amount of power, compared with the standard RCT design and are therefore often costly to perform.

The stepped wedge design is an appealing variant of the standard cluster-randomized trial if the new, often complex intervention has to be implemented in routine care [17, 82]. All clusters (e.g. hospital) will switch from usual care to the intervention eventually, but the exact moment of transition is randomly assigned across the clusters. This design improves the statistical efficiency. Moreover, potential problems in implementation of the new intervention can be detected early in the course of the trial and thus reacted upon immediately.

There are also several non-randomized study designs that can be used to assess impact and might even be worthwhile to conduct before deciding to start a cluster (stepped wedge) RCT [4, 17]. A prospective before–after impact study compares patient outcomes before and after implementation of the prediction model. Although less complex and time-consuming, it is prone to potential time effects and subject differences. A before–after study within the same doctors is even simpler. Doctors are asked to document the treatment decision before and after exposure to the prediction model for the same patient. No follow-up is involved, and it is easy and cheap to perform, but as with the prospective before–after studies, there is the potential of time effects. An interesting and cost-efficient alternative for impact studies that would require a long follow-up period is the performance of a decision analytic model [17, 68, 83-85]. This is a mathematical approach to combine information on patient outcomes and health effects, usually from prior RCTs or meta-analyses, with the predicted accuracy measures and their uncertainty as presented by the prediction model. These models estimate the (cost-) effectiveness of implementation of the prediction model in clinical daily care, as compared to usual care. If the outcomes show that the new prediction model does not improve clinical care and thus patient outcomes, one might wonder if a (often costly and time-consuming) trial is worthwhile to be performed [17, 68]. As an example of a decision analytic model, we refer to the cost-effectiveness analysis of using of a primary care clinical decision rule, combined with a qualitative D-dimer test, in suspected DVT [86] (See Box 1).

Concluding remarks

  1. Top of page
  2. Summary
  3. Introduction
  4. Risk prediction models: definition and rationale
  5. Development
  6. Validation
  7. Impact and implementation
  8. Concluding remarks
  9. Disclosure of Conflict of Interest
  10. References

The advantages of using risk prediction models in clinical care—namely more individually risk tailored management and thus increase in efficiency and ultimately cost-effectiveness—drive the popularity of developing and using prediction models. Yet, despite this popularity, there is also concern that the use of prediction models will lead to so-called ‘cookbook medicine’, a situation in which the doctor's gut feeling (or gestalt) is completely bypassed by the use of prediction rules [14, 28, 87]. We believe that probabilities estimated by a prediction model are not considered to replace but rather help the doctor's decision-making [4, 14, 17]. In fact, it should serve as a useful tool to incorporate all the single pieces of information to aid their clinical reasoning. Sometimes, physicians are as good as prediction models to identify those individuals that actually are diseased, whereas prediction models are better or more efficient in identifying those individuals where a disease can be excluded. For instance, a recent meta-analysis by Lucassen and colleagues showed that gestalt (that is, the estimated probability of a patient being diseased or not based on clinical reasoning) is just as sensitive as the application of a risk prediction model to rule out PE, but much less specific [39]. As a consequence, although no more PE cases are actually missed by physicians using their own gut feeling yet many more patients are unnecessarily referred for spiral CT scanning. Of note, this not only is associated with higher costs but also poses more patients with the inherent risks of CT scanning: radiation and contrast nephropathy. For an ultimate answer on the cost-effectiveness of the use and thus impact of this model, a decision analytic model should still be performed, as discussed above.

To conclude, we aimed to provide a comprehensive overview of the steps in risk prediction modeling—from development to validation to impact assessment—the preferred methodology per step and the potential pitfalls to overcome. We hope this will guide future research on this topic and enhance applied studies of risk prediction modeling in the field of thrombosis and hemostasis.

Box 1. Examples of risk prediction models in the VTE domain

 First AuthorClinical Context N Type of ModelOutcome(s)References
  1. CDR, clinical decision rule; DVT, deep venous thrombosis; PE, pulmonary embolism; VTE, venous thromboembolism; PESI, Pulmonary Embolism Severity Index; CEA, cost-effectiveness analysis; NA, not applicable as this is cost-effectiveness modeling study; QALY, quality-adjusted life year; iCER, incremental cost-effectiveness ratio. *Using backward stepwise selection. Clinical variables in the model were selected based on significance levels, for laboratory variables forward selection. Selection process (backward or forward) not described in detail in paper. §Most diagnostic validation studies on the validation of CDRs in suspected VTE do not routinely evaluate model validity in a formal manner, yet focus on clinical outcomes only, that is, safety = proportion of ‘missed’ VTE cases in a low-risk population, and efficiency = proportion of patients identified as low risk (see main text for more detail on the difference between formal model validity analyses and clinical outcome validity. The study randomized 601 patients with a low score on the Wells CDR for DVT (score ≤ 1) into a D-dimer testing group and no D-dimer testing group; in addition, it also randomized 495 DVT likely patients, yet these data are less relevant when the aim is to exclude DVT.

Oudega CDR for DVTR. OudegaThe Wells DVT CDR was not safe in primary care, and therefore, a new CDR for primary care was developed.1295Logistic regression *Confirmed DVT [11]
Vienna prediction modelS. EichingerVTE recurrence risk is high in patients with a first (unprovoked) event, yet is actual risk in individual patients is unknown.929Cox regression Recurrence of VTE during a median FU of 43.3 months [8]
PESI modelD. AujeskyOutpatient treatment of patients with PE may be safe; the PESI model was developed to identify patients with a low risk of short-term mortality in whom this indeed may be safe.10 354Logistic regression Death from any cause within 30 days [10]
AMUSE-2 studyG.J. GeersingThis study validated the Wells CDR for PE as a safe tool in patients suspected of PE in a primary care domain.598Safety and efficiency §Confirmed PE [32]
Revised Geneva rule for PEF.A. KlokThe revised Geneva rule for PE was validated in new cohort of patients.300c-statisticConfirmed PE [76]
Validation study Oudega CDR for DVTD.B. TollThis study validated the Oudega CDR for DVT for different subgroups, that is, based on age, gender, and previous VTE.2086Safety and efficiency §Confirmed DVT [75]
Wells DVT rule with D-dimerP. WellsIn this landmark RCT, the safety of not performing CUS in patients with a low Wells CDR score and a negative D-dimer test was demonstrated.601Safety and efficiency §Confirmed DVT [88]
CEA study for AMUSE-1 strategyA. Ten Cate-HoekIn this study, all costs and effects of not referring a patient with suspected DVT and a low score on the Oudega CDR were quantified, demonstrating its cost-effectivenessN.A.Markov modelCosts per QALY and iCER [86]
OTPE trial for outpatient PE managementD. AujeskyThis RCT demonstrated that it is safe to treat patients out of the hospital if their PESI score is low (PESI classes I and II).344Proportion of outcome in index and control groupSymptomatic recurrent DVT or PE within 90 days [89]


  1. Top of page
  2. Summary
  3. Introduction
  4. Risk prediction models: definition and rationale
  5. Development
  6. Validation
  7. Impact and implementation
  8. Concluding remarks
  9. Disclosure of Conflict of Interest
  10. References
  • 1
    Wasson JH, Sox HC, Neff RK, Goldman L. Clinical prediction rules. N Engl J Med 1985; 313: 7939.
  • 2
    Laupacis A, Sekar N, Stiell IG. Clinical prediction rules. A review and suggested modifications of methodological standards. JAMA 1997; 277: 48894.
  • 3
    Grobbee DE, Hoes AW. Clinical Epidemiology- Principles, Methods and Applications for Clinical Research. London: Jones and Bartlett Publishers, 2009.
  • 4
    Moons KG, Altman DG, Vergouwe Y, Royston P. Prognosis and prognostic research: application and impact of prognostic models in clinical practice. BMJ 2009; 338: b606.
  • 5
    Wells PS, Anderson DR, Bormanis J, Guy F, Mitchell M, Gray L, Clement C, Robinson KS, Lewandowski B. Value of assessment of pretest probability of deep vein thrombosis in clinical management. Lancet 1997; 350: 17958.
  • 6
    Wells PS, Andersen DR, Rodger M, Ginsberg JS, Kearon C, Gent M, Turpie AG, Bormanis J, Weitz J, Chamberlain M, Bowie D, Barnes D, Hirsh J. Derivation of a simple clinical model to categorize patients probability of pulmonary embolism: increasing the models utility with the SimpliRED D-dimer. Thromb Haemost 2000; 83: 41620.
  • 7
    Rodger MA, Kahn SR, Wells PS, Anderson DA, Chagnon I, Le Gal G, Solymoss S, Crowther M, Perrier A, White R, Vickars L, Ramsay T, Betancourt MT, Kovacs MJ. Identifying unprovoked thromboembolism patients at low risk for recurrence who can discontinue anticoagulant therapy. CMAJ 2008; 179: 41726.
  • 8
    Eichinger S, Heinze G, Jandeck LM, Kyrle PA. Risk assessment of recurrence in patients with unprovoked deep vein thrombosis or pulmonary embolism: the vienna prediction model. Circulation 2010; 121: 16306.
  • 9
    Tosetto A, Iorio A, Marcucci M, Baglin T, Cushman M, Eichinger S, Palareti G, Poli D, Tait RC, Douketis J. Predicting disease recurrence in patients with previous unprovoked venous thromboembolism: a proposed prediction score (DASH). J Thromb Haemost 2012; 10: 101925.
  • 10
    Aujesky D, Obrosky DS, Stone RA, Auble TE, Perrier A, Cornuz J, Roy PM, Fine MJ. Derivation and validation of a prognostic model for pulmonary embolism. Am J Respir Crit Care Med 2005; 172: 10416.
  • 11
    Oudega R, Moons KG, Hoes AW. Ruling out deep venous thrombosis in primary care. A simple diagnostic algorithm including D-dimer testing. Thromb Haemost 2005; 94: 2005.
  • 12
    Steyerberg EW. Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. New York: Springer, 2008.
  • 13
    Harrell FE. Regression Modeling Strategies. New York: Springer-Verlag, 2001.
  • 14
    Moons KG, Royston P, Vergouwe Y, Grobbee DE, Altman DG. Prognosis and prognostic research: what, why, and how? BMJ 2009; 338: b375.
  • 15
    Altman DG, Vergouwe Y, Royston P, Moons KG. Prognosis and prognostic research: validating a prognostic model. BMJ 2009; 338: b605.
  • 16
    Royston P, Moons KG, Altman DG, Vergouwe Y. Prognosis and prognostic research: developing a prognostic model. BMJ 2009; 338: b604.
  • 17
    Moons KGM, Kengne AP, Grobbee DE, Royston P, Vergouwe Y, Altman DG, Woodward M. Risk prediction models: II. External validation, model updating, and impact assessment. Heart 2012; 98: 6918.
  • 18
    Moons KGM, Kengne AP, Woodward M, Royston P, Vergouwe Y, Altman DG, Grobbee DE. Risk prediction models: I. Development, internal validation, and assessing the incremental value of a new (bio)marker. Heart 2012; 98: 68390.
  • 19
    Hemingway H, Croft P, Perel P, Hayden JA, Abrams K, Timmis A, Briggs A, Udumyan R, Moons KGM, Steyerberg EW, Roberts I, Schroter S, Altman DG, Riley RD. Prognosis research strategy (PROGRESS) 1: a framework for researching clinical outcomes. BMJ 2013; 346: e5595.
  • 20
    Hingorani AD, Windt DA, Riley RD, Abrams K, Moons KGM, Steyerberg EW, Schroter S, Sauerbrei W, Altman DG, Hemingway H. Prognosis research strategy (PROGRESS) 4: stratified medicine research. BMJ 2013; 346: e5793.
  • 21
    Riley RD, Hayden JA, Steyerberg EW, Moons KGM, Abrams K, Kyzas PA, Malats N, Briggs A, Schroter S, Altman DG, Hemingway H. Prognosis research strategy (PROGRESS) 2: prognostic factor research. PLoS Med 2013; 10: e1001380.
  • 22
    Steyerberg EW, Moons KGM, van der Windt DA, Hayden JA, Perel P, Schroter S, Riley RD, Hemingway H, Altman DG. Prognosis research strategy (PROGRESS) 3: prognostic model research. PLoS Med 2013; 10: e1001381.
  • 23
    Bouwmeester W, Twisk JW, Kappen TH, Klei WA, Moons KG, Vergouwe Y. Prediction models for clustered data: comparison of a random intercept and standard regression model. BMC Med Res Methodol 2013; 13: 19.
  • 24
    Mallett S, Royston P, Dutton S, Waters R, Altman D. Reporting methods in studies developing prognostic models in cancer: a review. BMC Med 2010; 8: 20.
  • 25
    Mallett S, Royston P, Waters R, Dutton S, Altman D. Reporting performance of prognostic models in cancer: a review. BMC Med 2010; 8: 21.
  • 26
    Tzoulaki I, Liberopoulos G, Ioannidis JP. Assessment of claims of improved prediction beyond the Framingham risk score. JAMA 2009; 302: 234552.
  • 27
    Collins G, Mallett S, Omar O, Yu L-M. Developing risk prediction models for type 2 diabetes: a systematic review of methodology and reporting. BMC Med 2011; 9: 103.
  • 28
    Reilly BM, Evans AT. Translating clinical research into clinical practice: impact of using prediction rules to make decisions. Ann Intern Med 2006; 144: 2019.
  • 29
    van Dieren S, Beulens JWJ, Kengne AP, Peelen LM, Rutten GEHM, Woodward M, van der Schouw YT, Moons KGM. Prediction models for the risk of cardiovascular disease in patients with type 2 diabetes: a systematic review. Heart 2012; 98: 3609.
  • 30
    Bouwmeester W, Zuithoff NPA, Mallett S, Geerlings MI, Vergouwe Y, Steyerberg EW, Altman DG, Moons KGM. Reporting and methods in clinical prediction research: a systematic review. PLoS Med 2012; 9: e1001221.
  • 31
    Moons KG. Criteria for scientific evaluation of novel markers: a perspective. Clin Chem 2010; 56: 53741.
  • 32
    Geersing GJ, Erkens PMG, Lucassen W, Buller HR, ten Cate H, Hoes AW, Moons KGM, Prins MH, Oudega R, Van Weert HC, Stoffers H. Safe exclusion of pulmonary embolism using the Wells rule and qualitative D-dimer testing in primary care: prospective cohort study. BMJ 2012; 345: e6564.
  • 33
    Harrell FE Jr, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996; 15: 36187.
  • 34
    Toll DB, Janssen KJ, Vergouwe Y, Moons KG. Validation, updating and impact of clinical prediction rules: a review. J Clin Epidemiol 2008; 61: 108594.
  • 35
    Moons KG, de Groot JA, Linnet K, Reitsma JB, Bossuyt PM. Quantifying the added value of a diagnostic test or marker. Clin Chem 2012; 58: 140817.
  • 36
    Moons KG, Biesheuvel CJ, Grobbee DE. Test research versus diagnostic research. Clin Chem 2004; 50: 4736.
  • 37
    Sackett DL, Haynes RB, Guyatt GH, Tugwell P. Clinical Epidemiology: A Basic Science for Clinical Medicine, 2nd edn. Boston: Little, Brown and Company, 1991.
  • 38
    Moons KG, Grobbee DE. Diagnostic studies as multivariable, prediction research. J Epidemiol Community Health 2002; 56: 3378.
  • 39
    Lucassen W, Geersing GJ, Erkens PMG, Reitsma JB, Moons KGM, Buller HR, van Weert HC. Clinical decision rules for excluding pulmonary embolism: a meta-analysis. Ann Intern Med 2011; 155: 44860.
  • 40
    Kearon C, Iorio A, Palareti G. On behalf of The Subcommittee on Control of Anticoagulation of The SSC of The ISTH. Risk of recurrent venous thromboembolism after stopping treatment in cohort studies: recommendation for acceptable rates and standardized reporting. J Thromb Haemost 2010; 8: 23135.
  • 41
    Biesheuvel CJ, Vergouwe Y, Oudega R, Hoes AW, Grobbee DE, Moons KG. Advantages of the nested case-control design in diagnostic research. BMC Med Res Methodol 2008; 8: 48.
  • 42
    Rutjes AW, Reitsma JB, Vandenbroucke JP, Glas AS, Bossuyt PM. Case-control and two-gate designs in diagnostic accuracy studies. Clin Chem 2005; 51: 133541.
  • 43
    Ganna A, Reilly M, de Faire U, Pedersen N, Magnusson P, Ingelsson E. Risk prediction measures for case-cohort and nested case-control designs: an application to cardiovascular disease. Am J Epidemiol 2012; 175: 71524.
  • 44
    Altman DG, Royston P. The cost of dichotomising continuous variables. BMJ 2006; 332: 1080.
  • 45
    Royston P, Altman DG, Sauerbrei W. Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med 2006; 25: 12741.
  • 46
    Sauerbrei W, Royston P, Binder H. Selection of important variables and determination of functional form for continuous predictors in multivariable model building. Stat Med 2007; 26: 551228.
  • 47
    Steyerberg EW, Eijkemans MJ, Harrell FE Jr, Habbema JD. Prognostic modelling with logistic regression analysis: a comparison of selection and estimation methods in small data sets. Stat Med 2000; 19: 105979.
  • 48
    Sun GW, Shook TL, Kay GL. Inappropriate use of bivariable analysis to screen risk factors for use in multivariable analysis. J Clin Epidemiol 1996; 49: 90716.
  • 49
    Peduzzi P, Concato J, Feinstein AR, Holford TR. Importance of events per independent variable in proportional hazards regression analysis. II. Accuracy and precision of regression estimates. J Clin Epidemiol 1995; 48: 150310.
  • 50
    Concato J, Peduzzi P, Holford TR, Feinstein AR. Importance of events per independent variable in proportional hazards analysis I. Background, goals, and general strategy. J Clin Epidemiol 1995; 48: 1495501.
  • 51
    Peduzzi P, Concato J, Kemper E, Holford TR, Feinstein AR. A simulation study of the number of events per variable in logistic regression analysis. J Clin Epidemiol 1996; 49: 13739.
  • 52
    Janssen KJ, Donders AR, Harrell FE Jr, Vergouwe Y, Chen Q, Grobbee DE, Moons KGM. Missing covariate data in medical research: to impute is better than to ignore. J Clin Epidemiol 2010; 63: 7217.
  • 53
    Janssen KJ, Vergouwe Y, Donders AR, Harrell FE Jr, Chen Q, Grobbee DE, Moons KGM. Dealing with missing predictor values when applying clinical prediction models. Clin Chem 2009; 55: 9941001.
  • 54
    Vach W. Some issues in estimating the effect of prognostic factors from incomplete covariate data. Stat Med 1997; 16: 5772.
  • 55
    van der Heijden GJ, Donders AR, Stijnen T, Moons KG. Imputation of missing values is superior to complete case analysis and the missing-indicator method in multivariable diagnostic research: a clinical example. J Clin Epidemiol 2006; 59: 11029.
  • 56
    Gorelick MH. Bias arising from missing data in predictive models. J Clin Epidemiol 2006; 59: 111523.
  • 57
    Wood AM, White IR, Royston P. How should variable selection be performed with multiply imputed data? Stat Med 2008; 27: 322746.
  • 58
    Vergouwe Y, Royston P, Moons K, Altman D. Development and validation of a prediction model with missing predictor data: a practical approach. J Clin Epidemiol 2009; 63: 20514.
  • 59
    de Groot JA, Janssen KJ, Zwinderman AH, Moons KG, Reitsma JB. Multiple imputation to correct for partial verification bias revisited. Stat Med 2008; 27: 58809.
  • 60
    Little RA. Regression with missing X's; a review. J Am Stat Assoc 1992; 87: 122737.
  • 61
    Donders AR, van der Heijden GJ, Stijnen T, Moons KG. Review: a gentle introduction to imputation of missing values. J Clin Epidemiol 2006; 59: 108791.
  • 62
    Harrell F, Lee K, Mark D. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med 1996; 15: 36187.
  • 63
    Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 1982; 143: 2936.
  • 64
    Cook NR. Use and misuse of the receiver operating characteristic curve in risk prediction. Circulation 2007; 115: 92835.
  • 65
    Hosmer DW, Lemeshow S. Applied Logistic Regression 2nd ed. New York: Wiley, 2000.
  • 66
    Moons KG, van Es GA, Deckers JW, Habbema JD, Grobbee DE. Limitations of sensitivity, specificity, likelihood ratio, and bayes’ theorem in assessing diagnostic probabilities: a clinical example. Epidemiology 1997; 8: 127.
  • 67
    Moons KG, van Es GA, Michel BC, Buller HR, Habbema JD, Grobbee DE. Redundancy of single diagnostic test evaluation. Epidemiology 1999; 10: 27681.
  • 68
    Moons KGM. Criteria for scientific evaluation of novel markers: a perspective. Clin Chem 2010; 56: 53741.
  • 69
    Pencina MJ, D'Agostino RB Sr, D'Agostino RB Jr, Vasan RS. Evaluating the added predictive ability of a new marker: from area under the ROC curve to reclassification and beyond. Stat Med 2008; 27: 15772.
  • 70
    Pepe MS, Feng Z, Huang Y, Longton G, Prentice R, Thompson IM, Zheng Y. Integrating the predictiveness of a marker with its performance as a classifier. Am J Epidemiol 2008; 167: 3628.
  • 71
    Cook NR, Ridker PM. Advances in measuring the effect of individual predictors of cardiovascular risk: the role of reclassification measures. Ann Intern Med 2009; 150: 795802.
  • 72
    Pencina MJ, D'Agostino RB, Vasan RS. Statistical methods for assessment of added usefulness of new biomarkers. Clin Chem Lab Med 2010; 48: 170311.
  • 73
    Justice A, Covinsky K, Berlin J. Assessing the generalizability of prognostic information. Ann Intern Med 1999; 130: 51524.
  • 74
    Altman D, Royston P. What do we mean by validating a prognostic model? Stat Med 2000; 19: 45373.
  • 75
    Toll DB, Oudega R, Vergouwe Y, Moons KG, Hoes AW. A new diagnostic rule for deep vein thrombosis: safety and efficiency in clinically relevant subgroups. Fam Pract 2008; 25: 38.
  • 76
    Klok FA, Kruisman E, Spaan J, Nijkeuter M, Righini M, Aujesky D, Roy PM, Perrier A, Le Gal G, Huisman MV. Comparison of the revised Geneva score with the Wells rule for assessing clinical probability of pulmonary embolism. J Thromb Haemost 2008; 6: 404.
  • 77
    Steyerberg EW, Borsboom GJJM, van Houwelingen HC, Eijkemans MJC, Habbema JDF. Validation and updating of predictive logistic regression models: a study on sample size and shrinkage. Stat Med 2004; 23: 256786.
  • 78
    Janssen KJ, Moons KG, Kalkman CJ, Grobbee DE, Vergouwe Y. Updating methods improved the performance of a clinical prediction model in new patients. J Clin Epidemiol 2008; 61: 7686.
  • 79
    Janssen KJ, Vergouwe Y, Kalkman CJ, Grobbee DE, Moons KG. A simple method to adjust clinical prediction models to local circumstances. Can J Anaesth 2009; 56: 194201.
  • 80
    van Houwelingen H. Validation, calibration, revision and combination of prognostic survival models. Stat Med 2000; 19: 340115.
  • 81
    Campbell MK, Elbourne DR, Altman DG. CONSORT statement: extension to cluster randomised trials. BMJ 2004; 328: 7028.
  • 82
    Brown C, Lilford R. The stepped wedge trial design: a systematic review. BMC Med Res Methodol 2006; 6: 54.
  • 83
    Drummond M, Sculpher M, Torrance G, O'Brien B, Stoddart G. Basic Types of Economic Evaluation. Methods for the Economic Evaluation of Health Care Programmes 3rd ed. New York: Oxford University Press, 2005.
  • 84
    Schaafsma JD, van der Graaf Y, Rinkel GJE, Buskens E. Decision analysis to complete diagnostic research by closing the gap between test characteristics and cost-effectiveness. J Clin Epidemiol 2009; 62: 1248.
  • 85
    den Ruijter HM, Vaartjes I, Sutton-Tyrrell K, Bots ML, Koffijberg H. Long-term health benefits and costs of measurement of carotid intima–media thickness in prevention of coronary heart disease. J Hypertens 2013; 31: 78290.
  • 86
    ten Cate-Hoek AJ, Toll DB, Buller HR, Hoes AW, Moons KGM, Oudega R, Stoffers HE, van der Velde EF, van Weert HC, Prins MH, Joore MA. Cost-effectiveness of ruling out deep venous thrombosis in primary care versus care as usual. J Thromb Haemost 2009; 7: 20429.
  • 87
    Kawamoto K, Houlihan CA, Balas EA, Lobach DF. Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success. BMJ 2005; 330: 765.
  • 88
    Wells PS, Anderson DR, Rodger M, Forgie M, Kearon C, Dreyer J, Kovacs G, Mitchell M, Lewandowski B, Kovacs MJ. Evaluation of D-dimer in the diagnosis of suspected deep vein thrombosis. N Engl J Med 2003; 349: 122735.
  • 89
    Aujesky D, Roy PM, Verschuren F, Righini M, Osterwalder J, Egloff M, Renaud B, Verhamme P, Stone RA, Legall C, Sanchez O, Pugh NA, N'Gako A, Cornuz J, Hugli O, Beer HJ, Perrier A, Fine MJ, Yealy DM. Outpatient versus inpatient treatment for patients with acute pulmonary embolism: an international, open-label, randomised, non-inferiority trial. The Lancet 2011; 378: 418.
  • 90
    ten Cate-Hoek AJ, Prins MH. Management studies using a combination of D-dimer test result and clinical probability to rule out venous thromboembolism: a systematic review. J Thromb Haemost 2005; 3: 2465.