In medical statistics, many alternative strategies are available for building a prediction model based on training data. Prediction models are routinely compared by means of their prediction performance in independent validation data. If only one data set is available for training and validation, then rival strategies can still be compared based on repeated bootstraps of the same data. Often, however, the overall performance of rival strategies is similar and it is thus difficult to decide for one model. Here, we investigate the variability of the prediction models that results when the same modelling strategy is applied to different training sets. For each modelling strategy we estimate a confidence score based on the same repeated bootstraps. A new decomposition of the expected Brier score is obtained, as well as the estimates of population average confidence scores. The latter can be used to distinguish rival prediction models with similar prediction performances. Furthermore, on the subject level a confidence score may provide useful supplementary information for new patients who want to base a medical decision on predicted risk. The ideas are illustrated and discussed using data from cancer studies, also with high-dimensional predictor space.