The interpretable machine learning model associated with metal mixtures to identify hypertension via EMR mining method

Abstract There are limited data available regarding the connection between hypertension and heavy metal exposure. The authors intend to establish an interpretable machine learning (ML) model with high efficiency and robustness that identifies hypertension based on heavy metal exposure. Our datasets were obtained from the US National Health and Nutrition Examination Survey (NHANES, 2013–2020.3). The authors developed 5 ML models for hypertension identification by heavy metal exposure, and tested them by 10 discrimination characteristics. Further, the authors chose the optimally performing model after parameter adjustment by Genetic Algorithm (GA) for identification. Finally, in order to visualize the model's ability to make decisions, the authors used SHapley Additive exPlanation (SHAP) and Local Interpretable Model‐Agnostic Explanations (LIME) algorithm to illustrate the features. The study included 19 368 participants in total. A best‐performing eXtreme Gradient Boosting (XGB) with GA for hypertension identification by 16 heavy metals was selected (AUC: 0.774; 95% CI: 0.772–0.776; accuracy: 87.7%). According to SHAP values, Barium (0.02), Cadmium (0.017), Lead (0.017), Antimony (0.008), Tin (0.007), Manganese (0.006), Thallium (0.004), Tungsten (0.004) in urine, and Lead (0.048), Mercury (0.035), Selenium (0.05), Manganese (0.007) in blood positively influenced the model, while Cadmium (−0.001) in urine negatively influenced the model. Study participants' hypertension associated with heavy metal exposure was identified by an efficient, robust, and interpretable GA‐XGB model with SHAP and LIME. Barium, Cadmium, Lead, Antimony, Tin, Manganese, Thallium, Tungsten in urine, and Lead, Mercury, Selenium, Manganese in blood are positively correlated with hypertension, while Cadmium in blood is negatively correlated with hypertension.


INTRODUCTION
The quantity of individuals with hypertension has multiplied since 1990, and 1.28 billion adults worldwide have hypertension currently. 1sk factors for hypertension include obesity, lack of physical exercise, alcohol consumption, etc. 2 According to epidemiological data, exposure to environmental metals is associated with hypertension. 3netics, diet, and lifestyle are currently well-established risk factors for hypertension, 4 while previous researches have suggested that metal exposure may also be the cause of hypertension's etiology. 5tals can get into people's bodies in a variety of ways, such as air inhalation, skin contact, and digestion. 6Essential elements could assume an imperative part in human physiological activities, including immunity, metabolism, and development. 7However, insufficient, or excessive essential elements might apply antagonistic impacts on people health. 7,8The harmful metals can disrupt the body's homeostasis and harm organs. 9Numerous epidemiological studies have focused on metal exposure's impacts on hypertension, however the results are still uncertain.For example, the positive correlation was found in a systematic review between blood cadmium levels and hypertension prevalence in adults, 10 but a cross sectional study in Canada 5 showed that urinary cadmium was inversely correlated with blood pressure among the general adults.The conflict between research results might be due to study designs, or exposure levels, therefore further studies are necessary to confirm the association.
6][17] In this way, a new analytical approach could be used to better identify the association between hypertension and heavy metal exposure.
[14] With developments of computer science and expanding information sources, researchers have faced a huge challenge when mining hidden meanings from big data. 18Due to the black-box nature, ML requires less standards for preprocessing data, which increases the possibility to analyze numerous information, which can support hazard identification, or other decision-making for health. 19 our study, we chose NHANES (2013−2020.3)datasets for mining the connection between hypertension and heavy metal exposure.
We selected 5 ML models to identify hypertension by heavy metals' exposure, compared the performance characteristics of the 5, and then used Genetic Algorithm (GA) to improve the efficiency of the best.
Further, the study incorporated the advanced EMR mining technique based on SHAP 20 and Local Interpretable Model-Agnostic Explanations (LIME) 21 into the evaluation of heavy metals' contribution during the identification of hypertension, boosting the likelihood of early intervention.

Demographics characteristics of the study participants
Participants' demographics and other relevant characteristics were collected in NHANES.Characteristics included gender, age (in years at screening), Race/Hispanic origin w/ NH Asian, education level (college or above, high school or equivalent, and less than high school), povertyto-income ratio (PIR) (≤1, 1−4, and ≥4), 22 and body mass index (BMI, kg/m 2 ).

Heavy metals
Our analyses included urinary and blood level of 16

Pre-processing of features
In our research, we selected 22 variables (also known as features in ML field).Among them, 19 were continuous, and 3 were categorical.We eliminated data with loss rate of 10% and above.
Continuous variables had their median filled in for missing values, whereas unordered categorical variables had their mode filled in, ordinal categorical variables had their nearest neighbor values filled in.
We used Standard Scaler for standardizing features and one-hot codes for transforming categorical variables in the ML model. 25We used Principal Component Analysis (PCA) and the Select K Best (SKB) algorithm for extracting features. 26We removed variables that have little contribution to the model during preprocessing in order to avoid overfitting.

Model establishment
Data of study was split into train set and test set by repeated K-Fold cross validation. 27We employed 5 ML algorithms, including Neural Networks (DNN), Support Vector Machine (SVM), Gaussian Naive Bayes (GNB), Decision Tree (DT), and eXtreme Gradient Boosting (XGB), to establish models for the identification of hypertension by heavy metal exposure.These five models have their own characteristics.The DNN method is usually more accurate with simple structure for data training; meanwhile, it also has strong black-box characteristics, that is, it is more difficult for people to understand its discrimination principle. 28SVM is data-insensitive, but can process nonlinear, multidimensional datasets. 29GNB performs well on small-scale data, can handle multiple classification tasks, and is suitable for incremental training, but there will be noise and redundancy. 30,31sual analytics are supported by DT, which is easy to comprehend and interpret, but it is susceptible to problems with over-fitting. 32B is a library optimized to increase distributed gradient and designed to be highly efficient, flexible, and portable 33 ; however, XGB's model parameters are too many to adjust for the optimal efficiency.34 We chose the model that was best suited for identifying diseases after comparing the discrimination features of the five models, and then used GA to adjust parameters in solving disadvantage of the chosen.The SHAP and LIME value were used to illustrate our model with related risk variables for hypertension identification in participants from 2013 to March 2020.35 SHAP was used for overall interpretation, and LIME was used for partial interpretation.

Statistical analysis
The continuous variable was described as the median (interquartile range), and the categorical variable was described as the number (percentage).The chi-square test was used to compare various group-specific characteristics.Geometric means (geometric standard deviations) were used to describe heavy metals.Throughout the 8+ years (three data release cycles), the trends were examined using the Mann-Kendall test.
The indicators used for model effectiveness testing included average area under the curve (AAUC) 36 and 95% confidence intervals (95% CI), best AUC (BAUC), average precision score (APS), average recall, average f1 score, average accuracy, average brier score loss, average cross-entropy loss, average Jaccard index, and average Cohen's kappa of each model by repeated K-Fold cross validation.
Python 3.9.7 was used for all analyses, with p < .05considered statistically significant.An overview of our methodology is shown in Figure 1.

Participants' demographics characteristics
The characteristics of study participants are summarized in Table 1.
We analyzed 19 368 participants in total.Among them, 2650 were diagnosed as hypertension, 9397 were men, and the average age was 57.Hypertension participants were more likely to be older, have the higher level of BMI, non-Hispanic white, and have the education level of college or above (all p < .05).

Heavy metals' concentrations
The heavy metal concentration in urine or in blood of each data release cycle are described in Table 2. Based on the data release cycles, Barium, Cadmium, Cobalt, Cesium, Manganese, Lead, Antimony, Tin, Thallium, and Tungsten in urine, and Lead, Cadmium, Mercury, Selenium, and Manganese in blood showed significant tendencies (all P for trend < 0.05).

Models' preprocessing
In feature selection, PCA determined that at least 18 variables were required to retain more than 90% content of the original information, and SKB scores of features ranged from 0.01 to 1083.44.We selected the top 18 features by scores to adapt our ML models, then 5 ML algorithms were applied to NHANES datasets using repeated K-Fold cross validation to train the models.

Models' performance
The XGB model has the optimal AAUC (AUC: 0.766; 95%CI: 0.763−0.769),BAUC (0.927), and APS (0.38) performance which were significantly higher than the AUC values of the other four models (p < .05).In order to pursue better AAUC and APS in identifying hypertension, we adopted GA for parameter adjustment and obtained

Models' comparison
Table 3 shows the performances' comparison of the ML models.The AAUC, BAUC, APS, average recall, average f1 score, average accuracy, average brier score loss, average cross-entropy loss, average Jaccard index, and average Cohen's kappa for all 5 ML models are shown in Table 3. XGB reached the best of the five models in 5 of the 10 performance indicators.Typically, the AAUC (AUC: 0.766; 95%CI: 0.763−0.769),BAUC (0.927), and APS (0.38) of XGB performed the best of all 5 ML models.The comparison results demonstrate that XGB has the best performance of the five for hypertension identification.Then, we used GA to adjust the parameters of XGB to further improve its model effect, as shown on the far right of Table 3.

Feature importance visualization
SHAP and LIME were used to visualized features' influence on hypertension identification of the GA-XGB model.The SHAP&LIME summary plot demonstrates the impact of each selected feature of the model to identify hypertension (Figure 3).

Prediction interpretation
In the SHAP decision plot on the right side of Figure 4, each participant is represented by each line.The lines converge to the single point of 0.877.These features are listed in descending order based on the plotted observations.The tree plot on the left side of Figure 4 shows the The best receiver operating characteristic curve and precision-recall curve for models.
optimal logic of discrimination, and served as one of basic trees of the decision logic.

DISCUSSION
In our study, for identifying hypertension in 2013−2020.3NHANES data, we developed a ML strategy that can be understood in relation to heavy metal exposure.The GA-XGB model was chosen to identify hypertension because it performed the best of the 5 ML algorithms.
The GA-XGB model performed well with an average AUC of 0.774, and an accuracy of 0.877.Meanwhile, using the SHAP game theory method with LIME could make up for the shortcomings of these two algorithms, and demonstrate the significance of each model feature both globally and locally, with the summary and decision plot.Our results indicated that the SHAP&LIME-GA-XGB model had encouraging possibility for hypertension identification by heavy metal exposure.
Our research is based on earlier studies that used ML algorithms to predict diseases [27, 33, 35].They discovered that sophisticated categorization algorithms can increase prediction accuracy.
Mathematical algorithms are used in ML, a subset of artificial intelligence, to find and classify structures in heterogeneous data so that decisions can be made. 18,37It is difficult to understand whether certain conclusions will be drawn when considering the ML algorithm. 38 the meantime, medical decision-making has been hindered by ML algorithms' inability to be understood.
United States placed a significant emphasis on heavy metal exposure through a variety of environment programs beginning in 2013. 39Also from 2013, the standard ICD-10 of hypertension has been applied to disease records of NHANES.Environmental heavy metal exposure levels decreased directly as a result of policy and treatment programs, while hypertension incidences also varied. 40We utilized enormous amounts of data to develop ML models, focusing on the heavy metal concentration in each participant's urine and blood.The AAUC of the GA-XGB model was 0.774, demonstrating good efficiency.In addition, we used 5 ML algorithms, which have been described in other The SHAP-GA-XGB decision plot.
ML studies to address hypertension [12][13][14] or other diseases, to identify hypertension by heavy metal exposure.Some of them were efficient and applicable to raw data.Particularly, the accuracy of algorithm prediction improved with the improvement of data authenticity. 41Further, we evaluated the multi-level prediction potential of ML models.
The GA-XGB model had the best performance in terms of classifica- suggested that high levels of Cadmium might increase the incidence of hypertension. 10In addition, Systolic blood pressure and diastolic blood pressure were statistically significantly associated with peak blood Lead level.A blood lead level ≥ 6.87 µg/dL was associated with hypertension. 42Moreover, heavy metals showed pertinence for special populations.The study suggested that increased Manganese during pregnancy might be a potential risk factor for inducing pregnancy hypertension, 43 and urinary Antimony was consistently and dose-responsively associated with increased blood pressure and hypertension, of which Antimony was the major contributor among children. 44en it comes to the analysis and explanation of particular features, experts will benefit in the future from constant tracking because it will help them come to logical conclusions rather than simply accepting the algorithm's predictions.Further studies could also focus on validating the model's performance by expanding the database and increasing the participation of clinicians' judgment. 45

Limitations
The study has several limitations.First, other characteristics, which might have demonstrated dynamic correlations because of computational constraints when analyzing limited access data, were not disaggregated.Second, although the ICD-10 is normative, the diagnosis of hypertension was self-reported by participants in questionnaire data of NHANES, 46 which may have introduced information bias.The ML models' ability to accurately identify hypertension may have been affected in some way by any incorrect hypertension classification that resulted.Third, numerous data were missing due to the strict including criteria of the study participants, which may has probably led to the bias.Finally, model's complexity of interpretation may limit their reproducibility.

CONCLUSIONS
In our study among US NHANES 2013−2020.3participants, the SHAP&LIME-GA-XGB model was found to be an interpretable ML

F I G U R E 1
Overview plot.
The SHAP value plot on the left side of Figure3globally indicates that Barium (0.02), Cadmium (0.017), Lead (0.017), Antimony (0.008), Tin (0.007), Manganese (0.006), Thallium (0.004), Tungsten (0.004) in urine, and Lead (0.048), Mercury (0.035), Selenium (0.05), Manganese (0.007) in blood positively influence the model, while Cadmium (−0.001) in urine negatively influence the model.Additionally, the SHAP&LIME summary plot shows that being old, non-Hispanic, having a lower education level, having a higher PIR, and having a higher BMI are related to higher hypertension risk.The SHAP interaction value plot on the upper right side of Figure3demonstrates the interaction between main features.The LIME value plot on the lower right side of Figure 3 locally indicates the feature importance of single sample discrimination (the 5000 th sample).SHAP values illustrate features' contributions to hypertension identification of the model.
tion robustness, according to the comprehensive comparison results by 10 discrimination characteristics.At the same time, we effectively avoided the occurrence of over-fitting or under-fitting by repeated K-Fold segmentation.We applied SHAP values for global interpretation and LIME values for local interpretation to the GA-XGB model with the intention of achieving the best interpretability, because it is hard to correctly comprehend the ML methodology and to visually demonstrate the identification results.A positive SHAP value indicated that there was a positive conditional association between the related feature and hypertension, whereas negative SHAP values implied the opposite.The SHAP with tree-explainer can visualize the decision-making process of the model.The findings of SHAP were comparable to those of previous studies, which primarily focused on determining how heavy metal exposure affects hypertension.The hazard ratios of the highest quartiles Cadmium compared with the reference group was 1.42 (95% confidence interval [CI] 1.09−2.02)for cases of hypertension, which Through various survey strategies, US NHANES study investigated the US population for demographics, dietary, examination, laboratory, and questionnaire data.All data can be found on the website of the Ameri- The study participants' characteristics in NHANES (2013−2020.3).
Comparison of ML models' performance.AAUC, average area under the curve; APS, average precision score; BAUC, best area under the curve; DNN, deep neural networks; DT, decision tree classifier; GNB, Gaussian naive bayes; NA, null; SVM, support vector machine; XGB, extreme gradient boosting;