Calibration of probability predictions from machine-learning and statistical models

Aim: Predictions from statistical models may be uncalibrated, meaning that the predicted values do not have the nominal coverage probability. This is easiest seen with probability predictions in machine-learning classification, including the common species occurrence probabilities. Here, a predicted probability of, say, .7 should indicate that out of 100 cases with these environmental conditions, and hence the same predicted probability, the species should be present in 70 and absent in 30. Innovation: A simple calibration plot shows that this is not necessarily the case, particularly not for overfitted models or algorithms that use non-likelihood target functions. As a consequence, ‘raw’ predictions from such a model could easily be off by .2, are unsuitable for averaging across model types, and resulting maps hence be substantially distorted. The solution, a flexible calibration regression, is simple and can be applied whenever deviations are observed. Main conclusions: ‘Raw’, uncalibrated probability predictions should be calibrated before interpreting or averaging them in a probabilistic way.


DORMANN
Here I want to draw attention to the phenomenon of uncalibrated predictions, in particular for machine-learning models of binary data.
The reason for a discrepancy between a machine-learning prediction and the actual frequency stems from overfitting, potentially leading to perfect separation of classes and from the use of non-likelihood loss functions, which lead to a non-probabilistic weighting of prediction misfits. As such, also other classifications and even discrete distributions may cause similar problems. This observation is not new (Pearce & Ferrier, 2000), and indeed corrections have been proposed and re-examined by, among others, Platt (2000), Niculescu-Mizil and Caruana (2005) and Lin, Lin, and Weng (2007). Correct probability predictions are particularly important, as Platt (2000) points out, when they form part of an actual probability-based decision, or when they are averaged with other methods, so that a common measure is required. In the specific field of species distribution models of presence-absence data, Pearce and Ferrier (2000) featured such calibration prominently, yet hardly any study or even standard has picked it up (Araújo et al., 2019;Peterson et al., 2011;Sofaer et al., 2019); for notable exceptions see Franklin (2010); Johnston et al. (2015Johnston et al. ( , 2019; Guisan, Thuiller, and Zimmermann (2017) and Fink et al. (2020).
Diagnosing such non-probabilistic behaviour is simple, and, for all practical purposes, calibration is straightforward. Hence it should be applied to any model type as part of the prediction process, before predicting, cross-validating and making effect plots and maps or using predictions in any other probabilistic interpretation.

| THE C ALIB R ATI ON PLOT AND A DEMONS TR ATION OF B IA S
A little analysis demonstrates where the calibration comes into a statistical analysis. The example here is that of an Australian bird species' distribution (as presence-absence data at a scale of 50 × 50 km 2 ), using climate and land-cover predictors. The example itself is immaterial and merely illustration [R code (R Core Team, 2019) and data can be found in Supporting Information Appendix S1], although in the context of species distribution analysis presence-absence data and predicted occurrence probability are particularly common (some recent examples are Derville, Torres, Iovan, & Garrigue, 2018;Marca et al., 2019;Martínez et al., 2018;Robinson, Ruiz-Gutierrez, & Fink, 2018;Sabatini et al., 2018;Sofaer et al., 2019).
The data were analysed using a traditional binomial generalized linear model (GLM) and some machine-learning approaches: random forest, (simple) neural networks, boosted regression trees and support vector machines. All approaches managed to fit the data well, measured as root mean square error and log-likelihood in fivefold cross-validation [ Table 1, for area under the curve (AUC) see Supporting Information Appendix S1].
The calibration plot (see Box 1, also known as a reliability diagram) has been a statistical goodness-of-fit measure for a long time, but seems to have fallen out of fashion (despite recommendations by Harrell, 2001Harrell, , 2015. It simply regresses observed data against model fits (step I.4 in Box 1) and thus provides information additional to the common (pseudo-)R 2 -value (Chalcraft, 2019). For binary data, this requires a binomial GLM and link-scale predictions. The slope of this regression line, the calibration slope β1, should be unity, and the calibration intercept, β0, zero. Figure 1 shows some calibration curves. In this specific illustration, the GLM, the simple neural network and the support vector machine actually truthfully lie on the expected calibration line (with intercept near 0 and slope near 1); the two tree-based approaches display misfit.
In this case, the predictions can be calibrated using a flexible calibration regression, such as a structurally constrained generalized additive model (GAM; see step II.1, Box 1). A simple correction based on the binomial GLM estimates (step I.4 in Box 1) is typically insufficient, as lines need not be sigmoidal. Thus, the observed values are regressed against the fitted values using a GAM constrained to be monotonically increasing, as suggested for this purpose in the appendix of Johnston et al. (2015) and detailed in Pya & Wood (2015), yielding the panels on the right of Figure 1. Platt (2000), after whom the probability-rescaling is sometimes referred to as 'Platt scaling', suggested that this kind of calibration regression will overfit and suggests regularization, cross-validation or replacement of the actual observed values by values moved away from the margin (i.e., 0 or 1). The latter step is motivated by an application of Bayes rule, and the replacement target values are then, for 1s: and for 0s: t 0 = 1 N 0 + 2 , with N 0 and N 1 representing the number of 0s and 1s in the training data. Platt sketches a customized calibration procedure, upon which Lin et al. (2007) improve. In their case of support vector machines, the deviation followed a clear sigmoidal pattern (as also seen in Figure 1, e.g., bottom left), and hence Platt suggested to fit a sigmoidal function. For more serpentine curves, one could jackknife the calibration curve to prevent it from overfitting, but the structurally constrained GAM will often curb this problem sufficiently.
Note that any calibration requires the observed data to be unbiased.
If, for example, 0s are unreliable and possibly largely attributable to low detection probabilities, calibration cannot restore actual probabilities (Fithian & Hastie, 2013). This applies particularly to any use-availability TA B L E 1 Spatial-block cross-validation RMSE and log-likelihood of the five model types without ('raw') and with ('cal') calibration of their predictions. The models were fitted on the eastern/western half of the data and predicted to the other half (and vice versa). For log-likelihood ( ) the sum of the two folds is given, RMSE values represent means. Rank-independent metrics, such as area under the curve (AUC), show no effect of calibration (see Supporting Information Appendix S1) analysis as employed, for example, in resource-selection or presence-only data, which requires quantification of detection probabilities
For machine learning, one cause for the discrepancy between model predictions and the 1:1 line may lie in the target function of the modelling approaches. A GLM (and GAM, for that matter) maximizes the log-likelihood, which is founded in probability theory. In contrast, machine-learning algorithms may minimize Gini impurity or maximal risk (minimax), or maximize variance reduction (Hastie, Tibshirani, & Friedman, 2009). While this may make for good class separation, it does not bode well for nominal probabilities. Comparing, for instance, the GLM to a target function that maximizes accuracy (i.e., the proportion of correctly predicted values, using prevalence as the threshold) displays distorted probabilities (Figure 2).
In this illustration, the cause behind the misfit for boosted regression trees must be something different, as it uses the likelihood as target function. The most likely candidate is overfitting, which can lead to perfect separation of the two classes (0s and 1s) and resulting misfits (e.g., Heinze & Schemper, 2002). In neural networks, we can tune the back-propagation by the decay rate, and setting this to very low values increases the chance of overfitting -and deviation from the calibration line. Similarly, the cost-parameter of a support vector machine can be tuned in such a way that deviations are either prominent or absent.

| DOE S IT MAT TER?
But does it matter: will a map of occurrence probabilities look all that different with calibrated predictions to the current practice? That depends to a large extent on the dominant environmental conditions in the depicted region. If for this region predictions are either close to 0 or close to 1, then the map will be nearly indistinguishable.
The more shades of grey the prediction has, the more the maps will indeed differ. In this arbitrary case study, prediction maps for ran-domForest with and without calibration look very similar, apart from south of Perth and Tasmania (Figure 3).
Also the functional relationship between predictors and response is affected by calibration (Figure 4). The already strong 'raw' effects become de-facto thresholds in the calibration. This is not an effect of the calibration, but rather depicts more truthfully the actual predictions of the model itself: the randomForest really fits a threshold, and only because its predictions are uncalibrated probabilities does it not appear so strongly in the raw predictions.
The ultimate aim of calibration is better predictions: if model predictions are biased, that should show under external validation.
Splitting the data into an eastern and a western half, using that for training and predicting to the other (spatial block cross-validation: Roberts et al., 2017), I found that prediction error and log-likelihood were both increased and decreased (Table 1). That is to be expected, as also class-separating algorithms should yield very good validation results for these relatively small samples, particularly for rankbased validation measures such as AUC (see Supporting Information Appendix S1).

F I G U R E 2
Generalized linear model (GLM) fitted using accuracy rather than log-likelihood as optimization target; compare to Figure  1  While it is beyond dispute that machine-learning predictions may require calibration before being interpretable as probabilities (Platt, 2000), the best way to achieve such calibration is a matter of continuous refinement (e.g., Lin et al., 2007). The pragmatic GAMbased approach presented here is not the final say on the topic. It is important to notice, however, that calibration effects can be stark ( Figure 1) and without calibration regression probabilities can easily be misinterpreted.

| RECOMMENDATIONS
Overfitting may lead to full separation of categories in the fitted model. As a consequence, uncalibrated probability predictions may well be biased. This is the case for any classification model, including not only data following Bernoulli, binomial and multinomial distributions, but also discrete data such as Poisson and negative binomial. Calibrating predictions is straightforward and approximately restores the actual interpretation as predicted probabilities. Such calibration is necessary when averaging probabilities and interpreting predictions as probabilities, but not when options are merely ranked. Rarely will ecologists or practitioners use such predicted probabilities alone to decide on a management strategy, and often the uncertainty of these predictions may be substantially larger than the calibration misfit. However, as part of a sound craftsmanship, statistical analysts can be expected to do what they know is right: calibrate their predictions.