Large‐Scale metabolomics: Predicting biological age using 10,133 routine untargeted LC–MS measurements

Abstract Untargeted metabolomics is the study of all detectable small molecules, and in geroscience, metabolomics has shown great potential to describe the biological age—a complex trait impacted by many factors. Unfortunately, the sample sizes are often insufficient to achieve sufficient power and minimize potential biases caused by, for example, demographic factors. In this study, we present the analysis of biological age in ~10,000 toxicologic routine blood measurements. The untargeted screening samples obtained from ultra‐high pressure liquid chromatography‐quadruple time of flight mass spectrometry (UHPLC‐ QTOF) cover + 300 batches and + 30 months, lack pooled quality controls, lack controlled sample collection, and has previously only been used in small‐scale studies. To overcome experimental effects, we developed and tested a custom neural network model and compared it with existing prediction methods. Overall, the neural network was able to predict the chronological age with an rmse of 5.88 years (r 2 = 0.63) improving upon the 6.15 years achieved by existing normalization methods. We used the feature importance algorithm, Shapley Additive exPlanations (SHAP), to identify compounds related to the biological age. Most importantly, the model returned known aging markers such as kynurenine, indole‐3‐aldehyde, and acylcarnitines along with a potential novel aging marker, cyclo (leu‐pro). Our results validate the association of tryptophan and acylcarnitine metabolism to aging in a highly uncontrolled large‐s cale sample. Also, we have shown that by using robust computational methods it is possible to deploy large LC‐MS datasets for metabolomics studies to reduce the risk of bias and empower aging studies.


Metabolomics is the study of small molecules in biological samples
and is a strong tool to explain complex traits such as biological age-a product of lifestyle, genomic alterations, inflammation, time since birth, etc. (Franceschi et al. 2018;Kennedy et al. 2014). Several studies have modeled age and identified and validated aging markers, using targeted (known molecules) and untargeted (mostly unknown molecules) metabolomics (Ahadi et al. 2020;Auro et al., 2014;Darst et al. 2019;Rist et al., 2017;Robinson et al., 2020;Verri Hernandes et al., 2022;Yu et al., 2012). The output from the aging models is interpreted as biological age, and any discrepancy to the chronological age is interpreted as accelerated/decelerated aging which has been found to correlate with life expectancy (Hertel et al., 2016).
Human-based large-scale metabolomics studies of age exist, but most are focused on targeted metabolomics, that is, a small subset of compounds, because untargeted data are susceptible to experimental variance and computationally heavy to preprocess (T. Kim et al., 2021). Consequently, a great fraction of metabolites is neglected, although their inclusion might provide a better understanding of aging. To our knowledge the largest untargeted study is by O. Robinson et al. (2020) who successfully analyzed 2239 samples and predicted chronological age with high accuracy (Robinson et al., 2020). Despite this, factors such as demography introduce a risk of sampling bias, and we believe methods facilitating large-scale untargeted metabolomics studies will help overcome these issues.
In this study, we first introduce the use of untargeted mass spectrometry screening data (UHPLC-QTOF in positive ESI mode) as a large-scale resource of metabolomics data. Our data are acquired for toxicological screenings in routine forensic casework across several years and hundreds of analytical batches and we aim to deploy deploy it as a valuable resource in biological research. Second, we aim to predict the biological age of the donors to identify the potential biomarkers, and to validate our method against the existing literature ( Figure 1).

| Data
The data are antemortem whole blood samples (UHPLC-QTOF) from drivers suspected of driving under the influence of drugs. The sampling period spans from January 2017 to December 2020 and contain 394 batches. The samples were kept at cooled conditions until reaching the analysis site typically within 2-7 days (see experimental procedures). 93% of the samples from the period are males with a mean age of 28.9 ± 9.2 SD. The age distribution is skewed, proposing a challenge for making balanced predictions in the machine learning models (Figure 2a). To comply with GDPR, all visualizations in this study are anonymized such that all age groups have +5 observations while calculations are based on all observations spanning from 15 to +90 years old. XCMS peak calling yielded 12,686 features but features having >20% batch-wise missing values were removed. Also, we removed female samples to avoid confounders and finally removed sample outliers using principal component analysis (PCA) (Rist et al., 2017).
The final dataset used for the analysis contained 9929 samples and 8038 features. The observations were partitioned into training data (n = 7943) for fitting the models, development data (n = 993) for tuning NN hyperparameters, and testing data (n = 993) for hold-out data model evaluation.
Strong experimental effects appeared on the PCA plot of the data (Figure 2b). Samples from 2020 clustered, as well as the semesters of 2020, which might be caused by high retention time drift ( Figure S1). In sum, the first two components of the PCA accounted for 44% of the total variance in the data. The PCA did not reveal any patterns of age-neither in the first two components or in the following 16-indicating that only weak/few signals might explain age (Figure 2c, Figure S2). As the data lacked pooled quality controls (QCs) and had only a few internal standards, a QC-based correction of the 394 batches was impossible.

| Neural networks outperform existing normalization methods
We performed a small-scale comparison of normalization methods suited for data lacking Q Cs (Table 1) E. Johnson et al., 2007). We predicted age using machine learning and used the root mean squared error (RMSE) to assess the normalization methods. No method improved compared to naive methods such as row normalization, which achieved a test RMSE of 6.15 years.
The first two components (PCA) were plotted for each normalized dataset to assess whether the batch effect was removed ( Figure S3).
WaveICA2.0 and Combat seemed to remove the batch structure but did not improve the machine learning performance.
The used preprocessing methods vary from naive normalizations (row normalization, quantile) to between-batch normalization (Combat) and finally to between-and within-batch normalization (WaveICA2.0). We assessed the effect size of the differences between model-normalization pairs by bootstrapping the RMSE values and concluded the model type was more influential than the normalization type ( Figure S4). Expanding with additional normalization methods would most likely not benefit the analysis and be outside the scope of this study.
As no normalization method improved the model RMSEs, we implemented a neural network (NN) as it offers a flexible framework for custom architectures. Consequently, we implemented a ratio layer in the NN motivated by the hypothesis that compound ratios have normalizing properties-that is, the ratio of two compounds is F I G U R E 2 Raw data. (a) Distribution of age across all 10 k individuals, minimum 5 individuals per age, 13 individuals older than 75y. (b) Principal components of 4th root transformed data, colored by semester. (c) PCA colored by whether samples are more or less than the median age. The first 18 components colored by age are shown in Figure S2. robust to general effects that dilute/enrich both compounds by the same factor in each sample. To avoid an explosion of model parameters leading to overfitting, the first layer of the NN consisted of only 12 nodes condensing the information of the 8038 features. The second layer contained all unique ratios of the first layer's output (1st layer outputs = 12 weights, ratios = 66), followed by two hidden layers and one output layer. See experimental procedures for a detailed motivation and the specific architecture.

Preprocessing
As neural networks are heuristic methods, their fits differ from run to run despite using the same data and the same architecture.
To ensure repeatability, we made a hyperparameter optimization to decide the optimal NN architecture and refitted the best architecture 1000 times on the training and development data. Finally, we evaluated the 1000 fitted NN models on the held-out test data to get 1000 predictions per observation, which was averaged to one prediction. The testing data were not used for any model optimization or decision-making in the process. We did not test any of the normalization methods on the NN model, as the architecture should automatically normalize, and the normalization methods had no or small effect on the RMSE of the simple models ( Table 1).  Figure 3). Interestingly, the model predictions showed a smaller bias than the models from the model screening (Table 1) which underestimated the age of old individuals ( Figure S5). Despite this, old samples were still underestimated in the model predictions, which potentially is caused by two reasons: First, the age distribution is skewed, making the model focus on the densest part of the distribution (21-29 years old). Second, the model might be mean biased, thus finding a balance of using the mean age (safe guess, never completely wrong) and the actual aging patterns. In practice, we must estimate accelerated aging by comparing individuals against the mean prediction of their age group and not the chronological age-otherwise, all 40-year-olds would appear 5 years younger.

| SHapley additive exPlanation (SHAP) values identifies age associated features
We extracted feature importance from each of the 1000 NN mod- NN models that we also used to make the averaged NN predictions ( Figure 4).

The calculated SHAP values indicated that the NN model primar-
ily interpreted linear responses from the feature intensities. When the color gradient of a feature goes from blue to red (low to high feature intensity) the NN model interprets the feature as positively correlated with age ( Figure 4a). On the contrary, red to blue gradients mean the model interprets features as negatively correlated with age. Hence, SHAP values provide a deeper insight into the models than the global importance in e.g., PLS and random forest.
Molecules positively correlated to the SHAP values included cyclo (leu-pro), acylcarnitines, cortisol, and benzoic acid. Negatively SHAP correlated molecules included 17:0 Lyso PC, and the tryptophan pathway metabolites; serotonin, indole-3-aldehyde (I3A), kynurenate, and indole-3-methyl acetate. Some of the most important predictors remained unknown due to low fragmentation intensities or no database matches (see Data S1 for RT and m/z and methods for ID lvl 2 annotations).
Because of the ratios, the NN model potentially assigns high importance to normalizing features not correlated to age. Thus, we To further investigate the acylcarnitine and tryptophan pathways, we calculated false discovery rates based on Spearman p-values using the correlations from all 8038 features vs. age ( Figure S7). This analysis yielded 1867 correlated features at fdr < = 0.01. To quickly assess the combined information in the 1867 correlated features we performed a PCA and plotted the first 18 axes, but no age pattern appeared ( Figure S8). Hence, we

| Cyclo (leu-pro)-a potential novel marker of age
Cyclo (leu-pro) was the second most important metabolite in the NN model and had the smallest fdr value with a positive correlation to age of r = 0.26 (Spearman rho). The correlation was consistent throughout life, although additional observations will increase certainty for the +50y group ( Figure 4b). Cyclo (leu-pro) is a cyclic dipeptide (2,5-diketopiperizene) found in diet, and is biologically active by its antimicrobial, antiviral, and antitumor properties (Rhee, 2004;Zhao et al., 2021).

| Acylcarnitine metabolism
The identified acylcarnitines are also known markers of age and aging related diseases (Di Cesare et al., 2022;Jarrell et al., 2020).
Decanoyl carnitine was the most important acylcarnitine in our data, and it has also been identified as a female aging predictor in a study by Di Cesare et al. (Di Cesare et al., 2022). Acylcarnitines are positively correlated with a wide range of inflammation driven diseases including type 2 diabetes, cardiovascular diseases, and nonalcoholic fatty liver disease (Dambrova et al., 2022). Finally, acylcarnitines play an essential role in transporting free fatty acids to the mitochondria for the energy metabolism-hence, dysregulation might have a wide array of physiological effects.
In coherence with the literature, our identified acylcarnitines increase with age.

| Biased compounds
The NN model also included a cocaine derivate, benzoylecgonine, which peaked in concentration at ages 27-29 and had low concentrations for very young and very old samples. Sampling bias caused benzoylecgonine's importance and including it as a predictor allows us to isolate the confounding variables. If omitted, the metabolites affected by cocaine metabolism might become new predictors, and we lose the ability to isolate endogenous aging markers from endogenous cocaine markers. Finally, the exogenous compound, benzoic acid, also appeared as top predictor. Benzoic acid is potentially derived from cocaine, but as it increases throughout life (in contrast to benzoylecgonine) it most likely origins from processed food (Floriani et al., 2014;Leth et al., 2010).

| DISCUSS ION
We have used a highly uncontrolled cohort to validate some of the important pathways in aging. Despite the limitations of uncontrolled sample collection, we have shown that routine untargeted measurements are a valuable resource in population-wide studies of metabolomic biomarkers. The feature selection returned known metabolites that strongly associate with inflammaging, the interplay between inflammation and aging (Franceschi et al., 2018). We also identified a new metabolite, cyclo (leu-pro) which is not a human metabolite The acylcarnitines are also associated with age and agingdiseases and have been found to increase the hazard ratio of heart attack over a 20 year follow-up period . Hence, it makes sense the NN model includes the acylcarnitine pathway and tryptophan derivatives to estimate the biological age. Better data quality may allow biological aging scores based on these two aging pathways exclusively. This would ensure a meaningful age score explaining mortality rather than chronological age as molecules associated with chronological age are not guaranteed to explain mortality (Hertel et al., 2016; L. C. Johnson et al., 2019). Despite this, J. Hertel et al. (2016) showed that the full metabolome correlates with mortality rates-which in turn relates to biological age-using a longitudinal study design (Hertel et al., 2016).
Unfortunately, we were unable to make any direct conclusions about our age-score (predictions) association to biological age because we lack longitudinal data. Despite this, our error between age-scores and chronological age (r 2 = 0.63) was on level with the existing literature (Hertel et al., 2016; L. C. Johnson et al., 2019;Menni et al., 2013;Robinson et al., 2020). We also saw that several of the identified compounds were related to inflammatory responses which is likely to reflect biological aging. This shows the potential of using observational metabolomics data in geroscience to obtain generalizing aging scores, independent of, for example, fasting, diet, smoking, and BMI (although we had to remove females because of low sample size).
By combining untargeted data with a large sample size, we ob-  (Nielsen et al., 2016;Wang et al., 2022).
Our use of neural networks entailed several strengths and weaknesses. The model performed better than elastic net and gradient boosting combined with different batch-normalized datasets.
Because we removed the batch normalization step which needs several samples to correct batch effect, our computational setup can predict on a single new sample from the same screening laboratory.
Also, our modeling proved to be robust to the strong experimental effects, contributing more than 40% of the variance. Unfortunately, neural networks require large datasets due to the training process (partitioning into test, train, and validation data) and due to the many parameters, making the models susceptible to overfitting.
Neural networks are considered black box models as they are difficult to explain. SHAP values overcome this methodological issue. The industry has already absorbed it to ensure fairness in predictive modeling (e.g., avoiding racial bias), and (high impact) biological studies are beginning to use it as well (Buergel et al., 2022;Weis et al., 2022). We only used the SHAP values to validate whether the neural networks fit followed the linear trends of the raw data (if not, we disregarded the compound), and further investigation might explain the peculiar trends or elucidate the tryptophan and acylcarnitine pathways more profoundly.
We believe that untargeted large-scale metabolomics provides a strong tool for describing biological age-and that screening data are a valuable data resource-although several aspects can be improved.
First, other screening data from the health sector might provide interesting observational data on aging in diseases. Second, modeling time until death might yield more meaningful results for biological aging than modeling chronological age. Third, uncontrolled studies (such as this) contribute with population-wide aging profiles, making it possible to isolate effects when investigating diseases associated with age.
In summary, this study is a proof of principle under extremely limited experimental conditions, and we have shown that robust methods exploit the large sample size and model age to rescue/extract/ identify known biomarkers and potentially new ones. This shows the power of using routine measurements and that large-scale untargeted studies might become even more informative under better sampling and controlled conditions. Finally, we find that untargeted metabolomics shows a great potential in geroscience, as it monitors a wide range of physiological responses to aging and inflammation.

| Biological material, sample extraction, and untargeted screening
All steps of the data acquisition pipeline have been described in detail in R. Telving et. al. (Telving et al., 2016) and T. Wang et. al (Wang et al., 2022).
In short, whole blood samples were collected by the police given a suspicion of individuals driving under the influence of drugs. The samples were then stored and transported at cooled conditions until receival at the laboratory typically within 2-7 days, and then stored at −18 °C before the analysis (within 1-5 additional days). The analysis was based on ultra-high-performance-liquid-chromatography quadrupole-time-of-flight mass spectrometry (UHPLC-QTOF) and was performed on evaporated and reconstituted 30 kDa filtered supernatant from precipitated whole blood as described in detail in Telving et al. (Telving et al., 2016). All samples were manually approved by a forensic chemist given the QCs, internal standards, the retention time were within specifications.
Two large changes in the laboratory protocol were implemented during the sampling period from 2017 to 2020. First, the sample tubes were changed from FC (sodium fluoride, sodium EDTA, citric acid, sodium citrate; pH 5.9) to FX (sodium fluoride, potassium oxalate; pH 7.4). Second, the sample extraction procedure was changed from a manual to a semi-automated procedure (30-10-2018) which is most likely not affecting the data quality negatively.
For identification purposes, fragmentation analysis was car- The grouping and alignment happened simultaneously on all files in standard XCMS functions. Table S1 describes the steps of the workflow-peak identification, alignment, grouping, and filling-and the XCMS parameter settings.

| Preprocessing methods
Before experimental effect correction, the feature intensities were fourth root transformed and any female samples were removed.
Samples that were extreme outliers were removed if having PCA scores more than 1.5*95th quantile away from the median PCA

| Modeling
The data used for modeling were randomly partitioned into train  (Chen & Guestrin, 2016;Zou & Hastie, 2005). The models were tuned and evaluated using a fivefold cross validation in the R package, caret (ver. 6.0.86), with the tuneLength set to 5 for hyperparameter tuning (Kuhn, 2008). For the model screening, only the training and development partitions were used, and the models were selected to cover linear and tree-based models with varying tendency to overfit. The best preprocessing-model combination was evaluated on the test set to achieve an unbiased performance estimate.

| Neural networks model using ratios
To improve the statistical analysis and following interpretation, the data must be normalized to correct for experimental noise. Existing normalization methods neglect the normalizing potential of compound ratios, so we implemented a ratio-based NN model. This NN model was compared against quantile normalization, Combat (W. E. Johnson et al., 2007), WaveICA2.0 (Deng et al., 2021), row normalization (see experimental procedures), and fourth root transformed raw data.
Studies have shown that ratios work well for normalizing the biomarkers because ratios follow the same distribution across samples, as opposed to single compound intensities which depend on, for example, sample quality and batch effect (Petersen et al., 2012).
With untargeted screening data, thousands of features yield millions of ratios making brute force approaches extremely inefficient computationally and sensitive to spurious correlations (i.e., ratios that randomly correlate with a feature of interest but are impossible to validate). Consequently, we present a ratio-based neural network that implements the anticipation of the self-normalizing effect of ratios, by selecting only a small fraction of informative features used for the ratios (Figure 1).
A neural network consists of layers that each contain multiple linear models. The linear models of the first layer use the feature intensities of a sample as input. During training, each linear model is fitted to find a pattern in the feature intensities that improves the final prediction. For instance, one linear model might use features that correlate with age and another linear model might use features that explain the baseline intensity of the sample. The following layer in the neural network also consists of linear models, and these models use the outputs from the first layer to combine the information, that is, age-correlated features and features explaining baseline intensity ( Figure S8 and Figure S9).
By introducing ratios between the first layer and the second layer, we can force the output of one linear model (e.g., age score) to be divided by the output of another linear model (e.g., a baseline score). We can imagine that a high-quality sample will have its age score divided by a high baseline score, and vice versa for low quality samples. This will normalize the age scores between samples with different baselines. In principle, the model outputs of the first layer could also represent two different pathways, single compounds, noise levels, etc., conditioned that the output benefits the final prediction accuracy. In practice, it is most likely the penalized linear models find the age-correlated features and some stable features that reflect the state of the sample.

| NN Architecture
Pytorch (ver. 1.10.1) was used for the implementation (Paszke et al., 2017). The architecture and hyperparameters of the NN model were decided by a grid search. The best model consisted of one input layer and one ratio layer followed by a dense neural network of two hidden layers and an output layer ( Table 2). The Adam optimizer was used with a learning rate of 3e-4 on batches of 500 samples and a total of 1000 epochs to ensure convergence (due to unstable learning). All layers, except the ratio layer, used a dropout of 0.1.
The model can be found at https://github.com/Johan Lasse n/untar geted_ratios-it is meant as a building block for NN implementations using ratios, not as a plug-n-play implementation.
The NN model was fitted on the training data (80% of the data) and evaluated during the training on the development data. A total of 1000 NN models (using the final architecture,

| Feature Importance
SHAP values were calculated from the 1000 fitted pyTorch models using the SHAP package (Lundberg & Lee, 2017). This generated 1000 Shapley value matrices (of n observations and p features) which were averaged to one matrix of the same size (n observations and p features). Following this, global importance was calculated by taking the variance of each feature. SHAP values are upcoming in the metabolomics field and applicable in a large range of different models (Buergel et al., 2022;Weis et al., 2022).

| Feature annotation
The features were annotated by metID (Shen et al., 2022)  TA B L E 2 Neural network model specifications.
hmdb, the fiehn hilic database, and our own in-house database of approximately 400 metabolites. All features discussed in the manuscript have been manually quality assured, while non-significant hits have not been fully assessed (included in supplementary, all reference spectra in Data S2). All discussed compounds are ID level 1 matched by m/z, fragmentation pattern, and retention time against our inhouse library (following the guidelines of Metabolomics Standard Initiative) (Sumner et al., 2007) (annotations in Data S1, reference spectra in Data S2). Among the top 5 features, we excluded one annotation, M255T346, which matched poorly to the 6-dehydrotestosterone glucuronide reference spectra (data not shown).

AUTH O R CO NTR I B UTI O N S
The study was conceived by JKL and PV. Samples were provided by MJ and JBH. Data analysis was performed by JKL, TW, and PV. The manuscript was drafted by JKL and PV and reviewed by JBH, MJ, TW, and KLN. The final version was approved by all authors.

ACK N OWLED G M ENTS
We are thankful to Innovationfund Denmark for funding (TraceAge).
PV & JL is supported by a PhD grant from AUFF NOVA (Aarhus University Research Foundation).

CO N FLI C T O F I NTER E S T S TATEM ENT
The authors declare they have no conflicts of interest.

DATA AVA I L A B I L I T Y S TAT E M E N T
Data sharing must comply with GDPR meaning published data must be anonymized, and the pseudo anonymized data (used for the analysis) are deleted once the study reaches its end.