Explainable machine learning framework to predict personalized physiological aging

Abstract Attaining personalized healthy aging requires accurate monitoring of physiological changes and identifying subclinical markers that predict accelerated or delayed aging. Classic biostatistical methods most rely on supervised variables to estimate physiological aging and do not capture the full complexity of inter‐parameter interactions. Machine learning (ML) is promising, but its black box nature eludes direct understanding, substantially limiting physician confidence and clinical usage. Using a broad population dataset from the National Health and Nutrition Examination Survey (NHANES) study including routine biological variables and after selection of XGBoost as the most appropriate algorithm, we created an innovative explainable ML framework to determine a Personalized physiological age (PPA). PPA predicted both chronic disease and mortality independently of chronological age. Twenty‐six variables were sufficient to predict PPA. Using SHapley Additive exPlanations (SHAP), we implemented a precise quantitative associated metric for each variable explaining physiological (i.e., accelerated or delayed) deviations from age‐specific normative data. Among the variables, glycated hemoglobin (HbA1c) displays a major relative weight in the estimation of PPA. Finally, clustering profiles of identical contextualized explanations reveal different aging trajectories opening opportunities to specific clinical follow‐up. These data show that PPA is a robust, quantitative and explainable ML‐based metric that monitors personalized health status. Our approach also provides a complete framework applicable to different datasets or variables, allowing precision physiological age estimation.


Variable selection and merging
To generate a consistent and large database, with a maximal number of common biological variables for subjects, we performed a manual data cleaning to eliminate redundant outcomes, both within the same year and in different years. This step requires human intervention. For instance, some variables hold different labels and/or codes across years (e.g., LBDHDLSI, "HDL-cholesterol (mmol/L)" from 1999 to 2002 and LBDHDDSI, "Direct HDL-Cholesterol (mmol/L)" from 2003 to 2018), or have different units (e.g., serum glucose in mmol/L LBDSGLSI and in mg/dL LBXSGL). To optimize cleaning by investigators (IA, LC, LP, PK and PM) and ensure reproducibility, a web interface was developed (Fig. S1).
For each laboratory variable available, investigators were independently asked to select the variables based on the inclusion criteria described above. A similarity algorithm (using cosine similarity and Levenshtein distance) based on the "SAS" and "text" labels, proposed a list of potentially synonymous terms to investigators. A manual search tool with autocompletion was also available.
When identical variables were measured several times, the mean value was considered (for example: both variables LB2NEPCT and LBXNEPCT corresponded to the same biological variable, i.e. segmented neutrophils percent). Biological variables expressed in international units (SI) were privileged over their non-SI counterparts. In case of disagreement by an investigator, a collegial decision was made at the consensus phase. After this step, and considering the distribution of the number of available variables for a given number of subjects, the largest dataset with the minimum amount of missing data was defined. The cut-off for this distribution selected variables with at least 50,000 individuals. Individuals with more than 10% missing values were also dropped from database. After processing, the selected dataset contained 60,322 individuals with 48 laboratory variables (Table S1)  (A) Distribution of the number of individuals by chronological age and gender. The amount of data from 12 to 20 years was twice those of other age and a 25% decrease of available subject number from 70 to 79 years old. No major gender imbalance was pointed out across age groups. (B) Uniform distribution of missing data among chronological age and gender. (C) Proportion of missing data by variable. The amount of missing data was low (25% of individuals with one missing value representing 0.6% of the total values). They were mainly related to the lack of C-reactive protein, folate, albumin, and creatinine data.  gender (B). UMAP revealed some clustering across the second dimension by gender with mostly males in the upper part of the UMAP and females in the lower part. In addition, the first dimension mainly contains chronological age information, with a gradient from youngest to oldest from right to left. Left scatter plot illustrates the distribution of residuals on the train dataset and right scatter plot on the test dataset. The performances are largely inferior to those obtained with XGBoost or MultiLayer Perceptron with a strong performance discrepancy across the age group (younger people are predicted to be older and conversely). The mean absolute value of shap values for the 20 most important variables are shown for the whole population (gray), female (green) and male (purple) populations. A similar importance can been shown according gender. Each point color encodes the SHAP value of each variable for each individual, red and blue colors for high and low values of the variable respectively. On the x-axis, a positive or negative SHAP value means that the variable, for one individual contributes to the estimation of physiological age positively or negatively respectively. 8

Fig. S9: Global explainability of the PPA model in importance order of mean of absolute SHAP values for male (A) and female individuals (B).
Each point color encodes the SHAP value of each variable for each individual, red and blue colors for high and low values of the variable respectively. On the x-axis, a positive or negative SHAP value means that the variable, for one individual contributes to the estimation of physiological age positively or negatively respectively. Similar explainability profile can be found between male and female. dot represents an individual. The color indicates the corresponding chronological age (scale on the right). X-axis corresponds to the real value of the variable, while the y-axis corresponds to the SHAP value given to this individual for this variable. The dotted line corresponds to the SHAP value of 0, which means that when the individual displays a variable value for which the SHAP value is 0, the variable has no impact on the physiological age.
(B) Heatmap of contextualized SHAP values as a function of chronological age. The color of each pixel indicates the average SHAP value of a variable (xaxis) as a function of chronological age (y-axis).   The grid search is presented for each model together with the best hyperparameters found for each model.

Pathologies
Liver diseases: MCQ160L, MCQ500 (Ever told you had any liver condition) Coronary Heart Diseases: presence of a condition among MCQ160C (Ever told you had coronary heart disease), MCQ160F (Ever told you had a stroke), MCQ160B (Ever told had congestive heart failure), MCQ160D (Ever told you had angina/angina pectoris) and MCQ160E (Ever told you had heart attack) Diabetes: presence of a condition among DIQ010 (Doctor told you have diabetes) and DIQ160 (Ever told you have prediabetes) Thyroid diseases: presence of a condition among MCQ160H (Ever told you had a goiter), MCQ160M (Ever told you had a thyroid problem) and MCQ160I (Ever told you had thyroid disease) Arthritis: presence of a condition among MCQ160N (Doctor ever told you that you had gout?), MCQ160A (Doctor ever said you had arthritis) and ARQ125E (Ever told had Ankylosing Spondylitis) Cancer: MCQ220 (Ever told you had cancer or malignancy) Kidney diseases: presence of a condition among KIQ020, KIQ022 (Ever told you had weak/failing kidneys) and OHQ144 (Have kidney disease w/ renal dialysis?) Bronchitis: presence of a condition among MCQ160K (Ever told you had chronic bronchitis), MCQ160o (Ever told you had COPD?) and MCQ010 (Ever been told you have asthma) Auto-immune digestive disease: presence of a condition among ARQ125C (Ever told you had Ulcerative Colitis) and ARQ125D (Ever told you had Crohns Disease) Digestive ulcer: MCQ200 (Ever told had stomach/duod/peptic ulcer) Eye disease: presence of a condition among VIQ090 (Ever told had glaucoma) and VIQ310 (Told had macular degeneration) Dermatologic disease: presence of a condition among DEQ053, MCQ070 (Ever told had Psoriasis?) and AGQ180 (Doctor told have eczema)