Mind the gap: Performance metric evaluation in brain‐age prediction

Abstract Estimating age based on neuroimaging‐derived data has become a popular approach to developing markers for brain integrity and health. While a variety of machine‐learning algorithms can provide accurate predictions of age based on brain characteristics, there is significant variation in model accuracy reported across studies. We predicted age in two population‐based datasets, and assessed the effects of age range, sample size and age‐bias correction on the model performance metrics Pearson's correlation coefficient (r), the coefficient of determination (R 2), Root Mean Squared Error (RMSE) and Mean Absolute Error (MAE). The results showed that these metrics vary considerably depending on cohort age range; r and R 2 values are lower when measured in samples with a narrower age range. RMSE and MAE are also lower in samples with a narrower age range due to smaller errors/brain age delta values when predictions are closer to the mean age of the group. Across subsets with different age ranges, performance metrics improve with increasing sample size. Performance metrics further vary depending on prediction variance as well as mean age difference between training and test sets, and age‐bias corrected metrics indicate high accuracy—also for models showing poor initial performance. In conclusion, performance metrics used for evaluating age prediction models depend on cohort and study‐specific data characteristics, and cannot be directly compared across different studies. Since age‐bias corrected metrics generally indicate high accuracy, even for poorly performing models, inspection of uncorrected model results provides important information about underlying model attributes such as prediction variance.

Corr. age range models (N = 11,495) SI Figure 5: Performance metrics calculated in UK Biobank (UKB) subsets with different age ranges, where the lower age limit was kept constant while the upper age limit was varied. Predictions are based on models trained using 10-fold cross validation within each subset, i.e. age range is equal for training and test sets. The x-axes indicate the age range for each of the subsets. Sample size is kept constant across subsets, and represents the maximum number of participants available with the narrowest age range (45-60y). Corr. shuffled data models (N = 622) SI Figure 6: Age-bias correction in Cam-CAN models with 0, 10, 25, 50, and 75% of shuffled data. Corr = corrected. SI Figure 7: Age-bias correction in Cam-CAN models with 0, 25, 50, and 75% randomly shuffled data. SF = shuffle fraction in %. For all models, the relationship between predicted and true age improves after age-bias correction, and the delta values show a flat relationship with true age. Corr = corrected. , and the fit coefficients α and β used to correct the predictions in a separate test set (N = 18,578) using UK Biobank (UKB) data. Performance metrics are shown for models with 0, 10, 25, 50, and 75% of shuffled data. All models improve after correction, and the models with the poorest initial prediction accuracy (highest fraction of shuffled data) show the largest improvement. Corr = corrected.

Effects of age-bias correction using separate training and test sets -UKB data
SI Figure 9: Age-bias correction with the correction fit applied to the predictions in a training set (N = 18,578), and the fit coefficients α and β used to correct the predictions in a separate test set (N = 18,578) using UK Biobank data. Performance metrics are shown for models with 0, 25, 50, and 75% of shuffled data. For all models, the relationship between predicted and true age improves after age-bias correction, and the corrected delta values show a flat relationship with true age. The variance decreases with lower initial performance / higher shuffle fraction. Corr = corrected. For a detailed description of the plots, see Figure 9 in the main manuscript.

Effects of age-correction bias including a quadratic age term -UKB data
SI Figure 10: Age-bias correction including a non-linear term in UK Biobank models with 0, 25, 50, and 75% randomly shuffled data. SF = shuffle fraction in %. For all models, the relationship between predicted and true age improves after age-bias correction, and the corrected delta values show a flat relationship with true age. The variance decreases with lower initial performance / higher shuffle fraction. Corr = corrected. For a detailed description of the plots, see Figure 9 in the main manuscript. UKB Age range models (N = 9,000)

Test sets with varying age ranges, training set held constant
Corr. age range models (N = 9,000) SI Figure 11: Performance metrics calculated in UK Biobank (UKB) test sets when using Support Vector Regression instead of XGBoost regression. Predictions are based on a model trained on the full age range. The x-axes indicate the age range for each of the test sets. Sample size is kept constant across training and test sets, and represents the maximum number of participants available with the narrowest age range (65-82y). UKB Age range models (N = 9,000)

Training sets with varying age ranges, test set held constant
Corr. age range models (N = 9,000) SI Figure 12: Performance metrics calculated in a UK Biobank (UKB) test set (age range = 65-82y) when using Support Vector Regression instead of XGBoost regression. Predictions are based on models trained with different age ranges. The x-axes indicate the age range of the training sets applied to the same test set. Sample size is kept constant across training and test sets, and represents the maximum number of participants available with the narrowest age range (65-82y). Corr. age range models (N = 18,050) SI Figure 13: Performance metrics calculated in UK Biobank subsets with different age range and sample size when using Support Vector Regression instead of XGBoost regression. Predictions are based on models trained using 10-fold cross validation within each subset, i.e. age range is equal for training and test sets. The x-axes indicate the age range for each of the subsets. Sample size is kept constant across subsets, and represents the maximum number of participants available with the narrowest age range (65-82y).  Corr. shuffled data models (N = 37,156) SI Figure 15: Age-bias correction in UK Biobank (UKB) models with 0, 10, 25, 50, and 75% randomly shuffled data, using Support Vector Regression instead of XGBoost regression. All models improve after correction, and the models with the poorest initial prediction accuracy (highest fraction of shuffled data) show the largest improvement. Corr = corrected.

Training and test sets with equal age ranges
SI Figure 16: Age-bias correction in UKB models with randomly shuffled data when using Support Vector Regression instead of XGBoost regression. SF = shuffle fraction in %. For all models, the relationship between predicted and true age improves after age-bias correction, and the corrected delta values derived show a flat relationship with true age. The variance decreases with lower initial performance / higher shuffle fraction. Corr = corrected. For a detailed description of the plots, see Figure 9 in the main manuscript.

Age-bias correction applied to delta values instead of predictions -UKB data
SI Figure 17: Age-bias correction applied to the brain age delta values instead of the predictions [1], shown for models with 0, 25, 50, and 75% randomly shuffled data in UKB. Delta is calculated as predicted age − true age. The deltas are corrected by fitting Delta = α × Ω + β, where Ω represents chronological age, and α and β represent the slope and intercept. If the corrected delta is subtracted from predicted age (third column), we see the strong relationships between predicted and true age as observed in Figure 8 in the main manuscript. Since the delta value contains the prediction minus age, and age is used in the correction fit, these correction procedures are mathematically equivalent [2], and corrected deltas are thus not exempt from the potential issues related to poorly performing models.

Alternative model error metrics
Alternative model error metrics such as Median Absolute Error (MedAE), weighted MAE (wMAE), Relative Squared Error (RSE), and Relative Absolute Error (RAE) also vary depending on age range, as shown in SI Figures 17-19. MedAE generally shows the same behaviour as MAE across different age ranges, but can be a useful metric as it is less affected by outliers. Weighted MAE is adjusted for the sample age range (calculated by MAE / age range [3,4]), however, this metric only takes into account the range and not the underlying distribution. RSE is closely related to R 2 , and can be expressed as RSE = √ 1 − R 2 . For a model that describes a high proportion of the variance, RSE thus tends towards smaller values, whereas poorer models will have larger RSE values. RAE is related to MAE, but compares the prediction residualsŷ − y to the standard deviation of the predicted variable, y − y. SI Figure 17 shows MedAE, wMAE, RSE, and RAE for a test sets with different age ranges based on a model trained on the full age range. As evident from the plots, these error metrics are influenced by variable range and mean age differences between training and test sets, similarly to RMSE and MAE. Corr. age range models (N = 9,000) SI Figure 18: Alternative error metrics calculated in UK Biobank (UKB) test sets with different age ranges.
Predictions are based on a model trained on the full age range. The x-axes indicate the age range for each of the test sets. Sample size is kept constant across training and test sets, and represents the maximum number of participants available with the narrowest age range (65-82y). wMAE = weighted MAE, RSE = Relative Squared Error, RAE = Relative Absolute Error. Figure 18 shows MedAE, wMAE, RSE, and RAE calculated in a test set that is held constant while predictions are based on training sets with different age ranges. As evident from the plots, these error metrics are influenced by variable range, prediction variance, and mean age differences between training and test sets, similarly to RMSE and MAE. Corr. age range models (N = 9,000) SI Figure 19: Alternative error metrics calculated in a UK Biobank (UKB) test set (age range = 65-82y). Predictions are based on models trained with different age ranges. The x-axes indicate the age range of the training sets applied to the same test set. Sample size is kept constant across training and test sets, and represents the maximum number of participants available with the narrowest age range (65-82y). wMAE = weighted MAE, RSE = Relative Squared Error, RAE = Relative Absolute Error.

SI
SI Figure 19 shows MedAE, wMAE, RSE, and RAE calculated in subsets where 10-fold cross validations are run within different age-range subsets. In contrast to RMSE and MAE, the relative model error metrics wMAE, RSE, and RAE show increasing error values with a narrower age range, following the same trends as R 2 and r. This is due to how these metrics are calculated, where relative values are obtained by dividing byȳ − y (RSE and RAE) or the age range (wMAE). Here, there are no mean age difference between training and test sets, so the error metrics are influenced only by variable range and prediction variance.  Figure 20: Alternative error metrics for models with different age ranges. Predictions are based on 10-fold cross validated models, i.e. age range is equal for training and test sets. The x-axes indicate the age range for each of the subsets. In contrast to RMSE and MAE, the error measures RSE, RAE, and weighted (w) MAE increase with a narrower age range (indicating larger model error), in line with the r and R 2 patterns. This is due to how these metrics are calculated, where relative values are obtained by dividing byȳ − y (RSE and RAE) or the age range (wMAE). However, all metrics vary depending on the range of the predicted variable. wMAE = weighted MAE, RSE = Relative Squared Error, RAE = Relative Absolute Error.