Ischemic stroke prediction using machine learning in elderly Chinese population: The Rugao Longitudinal Ageing Study

Abstract Objective Compared logistic regression (LR) with machine learning (ML) models, to predict the risk of ischemic stroke in an elderly population in China. Methods We applied 2208 records from the Rugao Longitudinal Ageing Study (RLAS) for ischemic stroke risk prediction assessment. Input variables included 103 phenotypes. For 3‐year ischemic stroke risk prediction, we compared the discrimination and calibration of LR model and ML methods, where ML methods include Random Forest (RF), Gaussian kernel Support Vector Machines (SVM), Multilayer perceptron (MLP), K‐Nearest Neighbors Algorithm (KNN), and Gradient Boosting Decision Tree (GBDT) to develop an ischemic stroke risk prediction model. Results Age, pulse, waist circumference, education level, β2‐microglobulin, homocysteine, cystatin C, folate, free triiodothyronine, platelet distribution width, QT interval, and QTc interval were significant induced predictors of ischemic stroke. For ischemic stroke prediction, the ML approach was able to tap more biochemical and ECG‐related multidimensional phenotypic indicators compared to the LR model, which placed more importance on general demographic indicators. Compared to the LR model, SVM provided the best discrimination and calibration (C‐index: 0.79 vs. 0.71, 11.27% improvement in model utility), with the best performance in both validation and test data. Conclusion In a comparison of LR with five ML models, the accuracy of ischemic stroke prediction was higher by combining ML with multiple phenotypes. Combined with other studies based on elderly populations in China, ML techniques, especially SVM, have shown good long‐term predictive performance, inspiring the potential value of ML use in clinical practice.


INTRODUCTION
Stroke is of great concern as one of the major diseases worldwide.
In China, stroke is the leading cause of death and disability in adults, and the morbidity and mortality rates are stable or increasing.Some studies indicate that the prevalence of stroke in China (2.6% in 2020) is already higher than the estimated global prevalence of stroke (1.2% in 2019) (Tu et al., 2023).It has been reported that the incidence of stroke in China occurs mostly in the elderly (Bonita et al., 2004;Feigin et al., 2014;Wang et al., 2008;Yong-Jun, 2010).As the major type of stroke, acute treatment of ischemic stroke is highly dependent on early prediction, and having timely treatment decisions is the cornerstone of acute stroke management (Wang et al., 2008).
In recent years, there has been a large body of research work related to stroke and its complications based on LR; for example, Johnston et al. (2007) validated and improved scores to predict very early stroke risk after transient ischemic attack by deriving new uniform scores based on LR.Lian et al. (2020) developed a risk prediction model by deriving LR for cohorts to develop early prediction of post-ischemic stroke brain-heart syndrome: the PANSCAN scale.Such risk scores are usually derived based on LR risk models and have been validated mainly in Europe and the United States.However, underestimation of stroke risk by such LR models in the contemporary Chinese elderly population due to poor self-calibration may lead to inaccurate identification of individuals at high risk for ischemic stroke who would benefit from timely treatment, and new risk models should be developed for use in such populations (Chien et al., 2010;Leung et al., 2018;Xing et al., 2019).
In recent years, with the widespread use of big data, artificial intelligence has deduced excellent predictive value in predicting stroke risk with the excellent characteristics of ML in automating the decisionmaking process (Hung et al., 2017;Khosla et al., 2010;Leung et al., 2018;Liu et al., 2019;Weng et al., 2017).For instance, Wu and Fang (2020) showed that the ML method with data balancing technique is an effective tool for stroke prediction using unbalanced data.Therefore, the construction of risk prediction models with more robust utility is the goal of our further work.To the best of our knowledge, previous AI-based research efforts on stroke in Chinese elderly are interesting attempts but limited.The utility of stroke risk prediction models is poor and mostly based on small variables and small sample data, and little is known about the utility of such models for predictive assessment (Yong-Jun, 2010).
The objectives of this study were to (i) identify important differential indicators between the ischemic stroke elderly population and healthy elderly population in China; (ii) develop and compare LR and ML models to identify significant predictors of induced ischemic stroke and to predict potential ischemic stroke episodes; and (iii) find optimal ML methods to decide in advance whether preventive interventions are needed to provide some diagnostic substantiate for clinical medicine.

Study population
Data used in this study were obtained from the Rugao Longitudinal Ageing Study (RLAS).RLAS was designed to examine aging health trajectories and outcomes, and its design have been described elsewhere (Liu et al., 2016).This is a population-based longitudinal study con-

Model development and evaluation
In Ten-fold cross-validation was used to select features in the model and adjust hyperparameters.Briefly, 10-fold cross-validation means that all data are divided into ten equal parts.Then, nine of these parts are simulated for training, and the rest are the test set (Kohavi, 1995;Wu & Fang, 2020).Finally, the average of the results is measured to obtain a more stable application.The detailed methods and results of parameter tuning are defined in Tables S3-S9 and Figure S1 in the Supplementary Material.The derivation and validation of the ML methods were done by Python 3.9 and the Scikit-learn toolkit.

F I G U R E 1
A flowchart describing the general framework of the study.Models were built using the training dataset, and the test dataset was used for computing the C-index shown in Table 2.

Statistical analysis
Continuous variables were expressed as the mean ± standard deviation and One-Hot coding was applied to categorical variables.Wilcoxon rank-sum test and  2 test were used for statistical comparisons to identify important indicators of variability between the elderly population with ischemic stroke and the healthy elderly population.Specific details are further elaborated in Supplementary Method S2.Two-tailed p < .05 was appraised statistically significant.
Figure 1 derives the statistical analysis procedure followed in this study.Considering the resulting heavily imbalanced data samples, we made the LR and ML models learn as much as possible by giving higher weights to a few classes of samples (ischemic stroke class), setting the weight parameter "balanced" in the classifier, and lattice-searching the hyperparameters with 10-fold cross-validation.Seventy-five percent of the data set was stratified from the entire participant group for training/validation; the remaining 25% was used for testing.

RESULTS
A total of 2208 records were included.The mean age was 78.0 ± 4.4 years, of which 52.8% (n = 1166) were female and 43.6% (n = 963) were illiterate.Overall, 7.6% (n = 167) had symptoms of cerebral infarction in the past (medical history diagnosed at the township health center level or higher), 23.1% (n = 511) had a history of smoking either currently or for more than 6 consecutive months in the past, and 34.9% (n = 771) had a history of alcohol consumption either currently or for more than 6 consecutive months in the past.with ischemic stroke were found to be older (79.6 vs. 77.9), to have wider waist circumference (95.2 vs. 90.3),lower education levels, more stable pulse, and to be more likely to have a history of cerebral infarction (47.6% vs. 6.7%) compared with the healthy elderly population.At the same time, cystatin C (1.2 vs. 1.1), folate (8.9 vs. 10.5),homocysteine (18.2 vs. 15.9), and β2-microglobulin levels (2.5 vs. 2.3) in these ischemic stroke patients were statistically significantly different from the healthy elderly population.

Comparisons of LR and ML models to predict risk of ischemic stroke
We use the training dataset to build models using different methods and optimize them to reduce prediction errors.These models are then tried on a test dataset to check model performance and determine the best predictor variables.Table 2 derives the results of LR and five ML model hyperparameter settings, as well as the C-index in the test dataset.
Overall, using the C-index as an evaluation metric, some of the models derive a large improvement in performance on the test data compared to the validation scores, which indicates that the models have good generalization application and are of high practical value (Ambale-Venkatesh et al., 2018;Harrell, 1982).Specifically, by plotting the Receiver operating characteristic (ROC) curve, as illustrated in Figure 2, we conclude that the performance varies widely among the models.The C-index of SVM reaches a maximum value of 0.79 and is one of the best performing ML models.The C-index of RF, MLP, and LR ranges from 0.71 to 0.73 and simulates well, while the application of KNN and GBDT is not satisfactory.Due to its high validation score and C-index, SVM was confirmed as the best simulating model and was

Selection of important predictors for the prediction model
Table 3 derives the top ten important predictor variables learned by LR, RF, and GBDT on the test dataset.These factors are the variables that play a key role in the models described above.Here, we show that the reason for choosing RF and GBDT in ML models is that for KNN, MLP and SVM, the inputs and outputs of the models can be intuitive, but determining the importance of features through the nonlinear model parameters behind them is very challenging and needs to be addressed.
We observed that the older age of ischemic stroke patients in China may reflect the duration of risk exposure compared to the healthy elderly population, and that the wider waist circumference may be associated with irregular dietary habits and prolonged sedentary activity.On the other hand, lower education levels and a history of cerebral infarction appear to be significantly associated with an increased prevalence of ischemic stroke.Equally important, higher β2microglobulin was shown by biochemical markers to be more likely to lead to renal impairment, suggesting a high correlation between renal impairment and ischemic stroke.Similarly, higher levels of homocysteine and cystatin C can lead to atherosclerosis and thrombosis by damaging vascular endothelial cells and affecting lipid metabolism.
Lower folate levels also lead to higher homocysteine, emphasizing the role of thrombosis as a common pathway leading to ischemic stroke.In addition to biomarkers with such a significant profile, lower free triiodothyronine and platelet distribution width also indicate the TA B L E 3 The top 10 important factors in the LR, RF, and GBDT methods.

DISCUSSION
In this large prospective study of RLAS participants, we developed a new risk prediction model for predicting ischemic stroke in older adults in China.By comparing the analysis of LR with five ML models, we proposed a better SVM risk prediction model (11.27% improvement in C-value compared with LR model).This SVM, which has a higher degree of discrimination relative to the LR, can translate into meaningful public health benefits.For example, a recent analysis of 100,000 British adults reported that a CVD polygenic risk score with a 0.012 increase in the C-index could prevent 7% more CVD events than a traditional risk score alone (Sun et al., 2021).This model can tap more risk factors and hence more accurately make early risk prediction compared with previous prediction studies for Chinese elderly stroke, which is crucial for the treatment and prognosis of patients with ischemic stroke (Liu et al., 2007).

Logistic regression and machine learning
In contrast to previous risk prediction studies on ischemic stroke, we effectively evaluated the potential of ML techniques to improve risk include the availability of data on certain risk factors and the need for regular updates and recalibration of clinical diagnostic guidelines (Liu et al., 2020).In this case, underestimation of stroke risk due to poor model calibration may result in failure to identify high-risk individuals who could benefit from prophylactic drug therapy.

Methodological considerations
A major advantage of our models over previous stroke studies is that it is based on a prospective cohort of high-quality longevity and aging data from Rugao, a typical representative city in China.We found that the sample characteristics of the evant and original, and it will be effective for subsequent advances in precision medicine (Benjamin et al., 2015;Cheng et al., 2014;Dawber et al., 1951).
Our study also has some limitations.

CONCLUSIONS
Our prospective study manifests that for ischemic stroke prediction in the elderly population in China, the use of ML techniques improves risk prediction compared to traditional LR model approaches, with SVM providing the best discrimination and calibration performance.Older age, wider waist circumference, lower education level, kidney injury, atherosclerosis, thrombosis, hypothyroidism, and heart rate derangement were important predictors of predicting ischemic stroke.Our original research work provides some guidance for the application of big data based on multiple phenotypes.Through ML models construction, we will obtain more meaningful risk prediction, biomarker identification, and form data-driven hypotheses.

F
The receiver operating characteristic curves for LR and ML models in test data set.Receiver operating characteristic curves for the logistic regression (LR), Random Forest (RF), Gaussian kernel Support Vector Machines (SVM), Multilayer perceptron (MLP), K-Nearest Neighbors Algorithm (KNN), and Gradient Boosting Decision Tree (GBDT) models in above.Each area under the curve indicates the corresponding C-index of the model.TA B L E 2 Validation score and test performance (both using Concordance index as evaluation criterion) for each of the LR and ML models.predictive ML model for ischemic stroke in the elderly population in China.
Our study was conducted in the Chinese elderly cohort RLAS, and the SVM risk prediction model outlined in the study needs further external validation and refinement in clinical practice, in other aging populations in China, and potentially in other low-and middle-income countries, as the RLAS cohort may not be representative of the entire Chinese elderly population or other populations.

TA B L E 1
Characteristics of the elderly study population in Rugao (Conventional Indicators).
Note: p Value indicates the significance level of the hypothesis test (Wilcoxon rank-sum test for numeric variables and  2 test for categorical variables, twotailed p value < .05 was considered statistically significant).hsCRP, high-sensitivity C-reactive protein; GLU, blood glucose; HDLC, high-density lipoprotein cholesterol; LDLC, low-density lipoprotein cholesterol; TG, triglyceride; UA, uric acid; SBP, systolic blood pressure; DBP, diastolic blood pressure.
The relative variable importance of each variable can be assessed by visualizing the coefficients (which can be seen in FigureS2in the Supplementary Material).
On the other hand, in our training of supervised ML, we used a rigorous data cleaning method to ensure the quality of ischemic stroke data obtained from the RLAS cohort for training and testing, while taking into account the selection of variable features that are more readily available and have simpler and more convenient clinical utility than image data.Based on the aging difference in China, this risk prediction model construction for ischemic stroke in Chinese urban elderly is rel- portion of SBP > 140 mm Hg in China was lower than that in clinic diagnoses in the United States (California clinic) and the United Kingdom (Oxford clinic) (36% vs. 60%, 36% vs. 48%), while the proportion of DBP > 90 mm Hg was higher (32% vs. 28%, 32% vs. 29%), and the proportion of HD was lower (8% vs. 18%, 8% vs. 10%).Therefore, blindly applying predictive models from previous studies will result in