Considerations on brain age predictions from repeatedly sampled data across time

Abstract Introduction Brain age, the estimation of a person's age from magnetic resonance imaging (MRI) parameters, has been used as a general indicator of health. The marker requires however further validation for application in clinical contexts. Here, we show how brain age predictions perform for the same individual at various time points and validate our findings with age‐matched healthy controls. Methods We used densely sampled T1‐weighted MRI data from four individuals (from two densely sampled datasets) to observe how brain age corresponds to age and is influenced by acquisition and quality parameters. For validation, we used two cross‐sectional datasets. Brain age was predicted by a pretrained deep learning model. Results We found small within‐subject correlations between age and brain age. We also found evidence for the influence of field strength on brain age which replicated in the cross‐sectional validation data and inconclusive effects of scan quality. Conclusion The absence of maturation effects for the age range in the presented sample, brain age model bias (including training age distribution and field strength), and model error are potential reasons for small relationships between age and brain age in densely sampled longitudinal data. Clinical applications of brain age models should consider of the possibility of apparent biases caused by variation in the data acquisition process.


BACKGROUND: WHAT IS BRAIN AGE AND WHAT IS IT GOOD FOR?
Brain age refers to the estimation of a person's age from magnetic resonance imaging (MRI) parameters (Franke & Gaser, 2019).This has been done using either neural networks on 3D data (Leonardsen et al., 2022) or tabular data containing region-averaged metrics (Korbmacher et al., 2023;Vidal-Pineiro et al., 2021).Brain age becomes particularly interesting when assuming that lifespan changes in the brain follow normative patterns and that deviations from such patterns might be indicative of disease or disease development (Marquand et al., 2019;Kaufmann et al., 2019).An elevated predicted compared with chronological age in adults may be indicative of psychiatric, neurodegenerative, and neurological disorders (Kaufmann et al., 2019) and poorer health, for example measured by various cardiometabolic risk factors (Beck et al., 2022;Korbmacher et al., 2022).Hence, brain age is a promising developing biomarker of general brain health (Franke & Gaser, 2019).
However, revealing connections between brain age and structural and functional brain architecture is needed to fully understand the biological underpinnings of brain age and its potential clinical implications (Vidal-Pineiro et al., 2021).Furthermore, large cross-sectional samples are often used, which could obscure effects of predictive power of brain age by confounders, in particular, differences in MRI acquisition (Jirsaraie et al., 2022).Hence, contributions of individual differences to brain age estimates require a closer examination.With the aim of assessing the effects of automated MRI scan quality control (QC) metrics on brain age predictions, we used a pretrained deep neural network model (Leonardsen et al., 2022) to predict brain ages from densely sampled T1-weighted MRI data from three individuals (BBSC1-3) scanned in total N BBSC = 103 times over a 1-year interval (Wang et al., 2022), and an independent data set including one individual (FTHP1) scanned N FTHP = 557 times over a 3-year interval.We first observed withinsubject prediction error and correlations between chronological and predicted age, revealing small, nonsignificant correlations and larger prediction errors than previously shown in between-subjects analyses.
We then tested associations of QC metrics on brain age using linear random intercept models showing potential associations between QC parameters and brain age as well as associations between acquisition parameters and brain age.Finally, we validate the findings in cross-sectional data and investigate differences in the variability in predictions between longitudinal and cross-sectional datasets.2).

Weak correlation between brain age and age
Interestingly, we also find systematically underestimated brain ages across subjects (Figure 1) with the underestimations being stronger for a field strength of 3T than 1.5T for FTHP1 (Table 1), and as compared with age-matched cross-sectional data (

Scan quality and acquisition: possible reasons for inaccurate brain age predictions?
We used linear random intercept models at the participant level to examine associations of individual QC metrics (see Figure 3;  2).This was also true when using the entire cross-sectional data (combining TOP and NCNG data), yet correlations between age and brain age were more similar at 1.5T (r= 0.98, 95% CI [0.97, 0.98], p < .001)and 3T (r= 0.92, 95% CI [0.91, 0.93], p = .004).
While our findings indicate an association between QC parameters EFC and FBER and brain age in all BBSC subjects when controlling for age and constant scanning parameters and scanner site, no QC parameters were significantly associated with brain age after adjustments for multiple comparisons in FTHP1.Based on that, one could speculate that scan quality impacts brain age predictions when participant ages are sampled from under-represented age groups within the prediction model.For example, Jirsaraie et al. (2022) showed that neural networks' reliability of brain age predictions was lowest at the ends of the age distributions across scanning sites, and predictions were less consistent when image quality was low.Furthermore, QC metrics might be sensitive to individual differences, and vary across scanner sites.FTHP1 results also suggest a strong effect of field strength on brain age.This indicates overall that brain age estimates are potentially dependent on intraindividual variables in addition to acquisition parameters and other scanner site-specific covariates.While we cannot generalize from the obtained single-subject results (FTHP1) on field strength, the additional analyses on external datasets support the effect of field strength congruent with Jirsaraie et al.'s (2022) findings of lower prediction errors at 1.5T compared with 3T.This was expressed in our analyses as generally higher brain age estimates at 1.5T compared with 3T, and higher prediction errors at 3T in both cross-sectional and longitudinal data.Finally, we show that prediction error in longitudinal data can be much higher than anticipated from cross-sectional estimates, without the presence of mental or physical disorder (see BBSC3 in Table 1; compare Tables 2 and Supplement 3).

F I G U R E 3
Standardized quality control metrics at 3T per subject.For an overview of scan quality control metrics at 1.5T (only applicable for FTHP1), see Supplement 2.
A potential approach for future brain age modelling could be to employ multiple, more specific models which are better tuned to individual differences, developmental trajectories, and scan quality.Such models could for example be trained on data with a smaller age range and a single field strength.Dependent on these parameters, brain age predictions can then be made by a model selected based on the available scan and group the individual belongs to.

Participants
We used two datasets for the analyses which had received ethics approval with all participants consenting formally previously (Opfer et al., 2022;Wang et al., 2022Wang et al., , 2023)).The first dataset was the Bergen Breakfast Scanning Club (BBSC) dataset (Wang et al., 2022(Wang et al., , 2023)), including three male subjects (BBSC2:start-age BBSC2 = 27, BBSC1:start-age BBSC1 = 30, and BBSC3:start-age BBSC3 = 40) who were scanned over the period of circa 1 year with a summer break in the middle of the scanning period (Wang et al., 2022).This resulted
T1-weighted volumes of FTHP1 were acquired at different scanners with various different scanning parameters (see Opfer et al., 2022 or https://www.kaggle.com/datasets/ukeppendorf/frequentlytraveling-human-phantom-fthp-dataset).All imaging sites involved in the scanning of FTHP1 were informed that the scan was acquired for the purpose of MRI-based volumetry.Furthermore, all FTHP sites were asked to use acquisition parameters in accordance with the ADNI recommendations for magnetization prepared rapid gradientecho (MP-RAGE) MRI for volumetric analyses.Thus, the range of FTHP acquisition parameters is representative of MRI-based volumetry in everyday clinical routine at nonacademic sites.However, the scan quality might be higher than during average clinical assessments, as only few scans were affected by motion artifacts (relatively young healthy subject).TOP data (Tønnesen et al., 2018)  Before prediction, the volumes were automatically processed using Freesurfer version 5.3 (Fischl, 2012) and FSL version 6.0 (Jenkinson et al., 2012;Smith et al., 2004), both being widely used open-source software packages (see for overview of advantages and disadvantages compared with other packages: Man et al., 2015) which were validated in clinical and nonclinical samples (Clerx et al., 2015;Fischl, 2012;Jenkinson et al., 2012;Smith et al., 2004).The processing procedure included skull-stripping as part of Freesurfer's recon-all pipeline, linearly orienting to MNI152 space (6 degrees of freedom) using FSL's linear registration, and excess border removal.While linear registration in FSL is sensitive to atrophy and high levels of noise (Dadar et al., 2018), this does not apply for the current quality controlled data including only healthy controls.As Freesurfer's skull stripping algorithm can include errors (Falkovskiy et al., 2016;Waters et al., 2019), the images were manually checked for accuracy.A step-by-step processing tutorial including necessary code can be found at https:// github.com/estenhl/pyment-public.

Brain age estimation
We applied a fully convolutional neural network (Gong et al., 2021;Peng et al., 2021) trained on 53,542 minimally processed MRI T1weighted whole-brain images from individuals aged 3-95 years collected at a variety of scanning sites (both 1.5 and 3T field strength) (SFCN-reg detailed in Leonardsen et al., 2022) to estimate participants' ages directly from the MRI using Python v3.9.13.The model was tested in both clinical and nonclinical samples (Leonardsen et al., 2022) and presented high accuracy and test-retest reliability compared with other brain age models (Dörfel et al., 2023).

QC metrics
QC metrics were extracted for each T1-weighted volume by using the automated MRIQC tool version 22.0.6 (Esteban et al., 2017).Of these metrics, we used those which are calculated for the whole brain or volume, being (1) noise measures: contrast-to-noise ratio, signalto-noise ratio, coefficient of joint variation of gray and white matter, (2) measures based on information theory EFC and foregroundbackground energy ratio (FBER), (3) white-matter to maximum intensity (WM2MAX), and (4) other measures: full-width half-maximum (FWHM).

Statistical analyses
All statistical analyses were conducted using R (v4.We also examined single individual acquisition parameters in the FTHP dataset (including only one subject FTHP1) as fixed effects in addition to the fixed age effect.The acquisition parameters of interest were field strength, manufacturer, and slice thickness.Acquisition parameters not used as fixed effects were used as random effect at the level of the intercept in addition to scanner site.All p-values were adjusted for multiple testing using Holm correction, marked with p Holm .
Standardized β-values (β std ) for predictors were used for comparability across β-weights by scaling QC metrics for each subject individually.
Finally, as a validation step, we estimated brain ages for healthy controls in NCNG and TOP datasets and correlated the estimates with age for the entire sample, subjects which were age-matched to the longitudinal, densely sampled individuals mean age ± 5 years.This provided a baseline understanding for differences in inter and intra subject brain age variability.In a second step, brain age gap (BAG) was examined by field strength and scanner site in the validation sample.
Materials and Methods) and brain age, while controlling for age in BBSC1-3.Entropy-focus criterion (EFC, β std = −0.489,p Holm < .001)and the foreground-background energy ratio (FBER, β std = 0.456, p Holm < .001)were significant predictors of brain age.In a separate analysis of FTHP1, scanned at different sites using different scanning parameters, we included scanner site, field strength, and slice thickness as random factors, rendering none of the QC metrics significant after correcting for multiple testing (p Holm = 1).Follow-up analyses in FTHP1 focused on examining acquisition parameters.We observed individual fixed effects of field strength, manufacturer, and slice thickness in one model each, while keeping scanner site and the other acquisition parameters as random effects at the level of the intercept, revealing only significant associations of field strength (β = −1.141,p Holm < .001)with brain age.For validation, we replicate this finding in healthy controls from the TOP and NCNG (see Materials and Methods section).We found differences in BAG at different field strengths (β = −3.547,p < .001),with Mean BAG-1.5T= 1.357 ± 3.285 and Mean BAG-3T = −2.19± 4.06 using the entire out-of-sample test data, with this difference being attenuated when regressing out age (β = −5.318,p < .001).When age-matching FTHP1 and including only the N = 162 participants aged 50 ± 5 years (N = 49 scanned at 1.5T), the effect of field strength appears stronger (β = −6.294,p < .001),with Mean BAG-1.5T= 2.38 ± 2.71 and Mean BAG-3T = −3.92± 4.35, yet smaller when regressing out the age-effect (β = −1.942,p < .001).In the case of age-matching, also correlations between age and brain age are stronger at 1.5T compared with 3T (Table in a total number of N BBSC = 103 scans, relatively equally distributed across subjects (N BBSC1 = 38, N BBSC2 = 40, N BBSC3 = 25).The second dataset was the frequently travelling human phantom (FTHP) MRI dataset(Opfer et al., 2022), including one male subject (FTHP1:startage FTHP = 48) with 157 imaging sessions at 116 locations, resulting in a total of N FTHP = 557 MRI volumes.Of these, we excluded N = 6 volumes based on errors in the processing pipeline, resulting in a final sample for the main analyses of N FTHP = 551.For QC (Supplement 1), we removed another N FTHP = 25 volumes which were repeat-sequences run at the same scanner and time without changing head position or acquisition parameters, resulting in a final sample for the supplemental analyses of N FTHP = 526.Finally, as additional validation data, we selected healthy controls from two of the cross-sectional out-of-sample test datasets described inLeonardsen et al. (2022): locally collected data (TOP;Tønnesen et al., 2018) and the Norwegian Cognitive NeuroGenetics sample (NCNG;Espeseth et al., 2012), as these provided most MRI scans on healthy controls.Together these datasets include a total of N = 209 scans of healthy controls at 1.5T (Mean age = 54.66 ± 15.51), and N = 856 scans of healthy controls at 3T (Mean age = 32.93 ± 10.55).

Table 2
Intraindividual correlations between brain age and chronological age at 3T for BBSC1-3 and FTHP1.Dot color was gray, with overlapping dots presented as darker.

TA B L E 1
Age, predicted age, brain age gap (BAG), and prediction error by subject and field strength.

Subject Field strength N observations Mean age SD age Mean prediction SD prediction Mean BAG SD BAG MAE RMSE
The presented data refer to the longitudinal, densely sampled data of few individuals.BAG, brain age gap; MAE, mean absolute error; RMSE, root mean squared error.BAG is calculated as the difference between predicted age and age.TA B L E 2Correlations between age-matching cross-sectional subsamples' ages and brain age estimates.

Matched subject Field strength N subjects Pearson's r [95% CI]* Mean age SD age Mean prediction SD prediction
Matched subject refers to the longitudinally sample subjects presented in Table1.Mean ages for the respective subjects with an interval of five years were used to sample from the cross-sectional validation set consisting of 3T and 1.5T data from TOP and NCNG samples.BAG, brain age gap; MAE, mean absolute error; RMSE, root mean squared error.BAG is calculated as the difference between predicted age and age.
(Rosen et al., 2018)022;Leonardsen et al., 2022)treating predictions for age groups which are underrepresented in the training sample and differences in field strength with care.In that sense, the observed within-subjects variability associated with acquisition-or scannerspecific effects might be used to estimate the minimum size of true within-subject changes (e.g., due to disease) to be detected with a given power.Previous findings outlined the influence of scanner site on brain age predictions and scan quality(Jirsaraie et al., 2022;Leonardsen et al., 2022)indicated by the Euler number(Rosen et al., 2018).Lower quality scans lead to lower prediction errors.We hence hypothesize that there might be additional reasons for inaccuracies in brain age predictions caused by factors beyond the characteris-F I G U R E 2 Intraindividual correlations between brain age and chronological age at 1.5T and 3T for FTHP1.Dot color was gray, with overlapping dots presented darker.tics of the brain age model, in particular scan quality and acquisition parameters.
multiple QC metrics.Furthermore, random effect models were chosen due to the possibility to account for variances being dependent on different grouping variables, such as ID, scanner site, field strength, and slice thickness.Hence, linear random intercept models at the participant level were used to examine associations of individual QC metrics and brain age, while controlling for age in the BBSC dataset, by running one model for each QC metric.Similarly, for dataset 2, we predicted each QC metric as a fixed effect in addition to the fixed effect of age in a single model.However, we used different random effects, namely, scanner site, field strength, and slice thickness, as dataset 2 contained only FTHP1.