Comparison of machine‐learning models for the prediction of 1‐year adverse outcomes of patients undergoing primary percutaneous coronary intervention for acute ST‐elevation myocardial infarction

Abstract Background Acute ST‐elevation myocardial infarction (STEMI) is a leading cause of mortality and morbidity worldwide, and primary percutaneous coronary intervention (PCI) is the preferred treatment option. Hypothesis Machine learning (ML) models have the potential to predict adverse clinical outcomes in STEMI patients treated with primary PCI. However, the comparative performance of different ML models for this purpose is unclear. Methods This study used a retrospective registry‐based design to recruit consecutive hospitalized patients diagnosed with acute STEMI and treated with primary PCI from 2011 to 2019, at Tehran Heart Center, Tehran, Iran. Four ML models, namely Gradient Boosting Machine (GBM), Distributed Random Forest (DRF), Logistic Regression (LR), and Deep Learning (DL), were used to predict major adverse cardiovascular events (MACE) during 1‐year follow‐up. Results A total of 4514 patients (3498 men and 1016 women) were enrolled, with MACE occurring in 610 (13.5%) subjects during follow‐up. The mean age of the population was 62.1 years, and the MACE group was significantly older than the non‐MACE group (66.2 vs. 61.5 years, p < .001). The learning process utilized 70% (n = 3160) of the total population, and the remaining 30% (n = 1354) served as the testing data set. DRF and GBM models demonstrated the best performance in predicting MACE, with an area under the curve of 0.92 and 0.91, respectively. Conclusion ML‐based models, such as DRF and GBM, can effectively identify high‐risk STEMI patients for adverse events during follow‐up. These models can be useful for personalized treatment strategies, ultimately improving clinical outcomes and reducing the burden of disease.

Acute ST-elevation myocardial infarction (STEMI) is a major cause of morbidity and mortality worldwide, with primary percutaneous coronary intervention (PCI) being the preferred treatment option. 1 Despite the significant advances in treatment, some patients still experience adverse clinical outcomes, such as cardiogenic shock, heart failure, and death.
Early identification of patients at high risk of adverse clinical outcomes is critical for improving clinical outcomes and reducing the burden of disease.Machine learning (ML) models have shown great potential in identifying high-risk patients and predicting adverse clinical outcomes in various medical conditions, including STEMI. 2 However, the comparative performance of various ML models for predicting adverse clinical outcomes of STEMI patients treated with primary PCI remains unclear.
Therefore, in this study, we aim to compare the performance of various ML models, including Logistic Regression (LR), Distributed Random Forest (DRF), Gradient Boosting Machines (GBMs), and Deep Learning (DL) for predicting adverse clinical outcomes of STEMI patients treated with primary PCI.By leveraging large data sets of clinical and imaging data, we aim to identify the most accurate and reliable machine-learning model for predicting adverse clinical outcomes of STEMI patients.Our findings may help improve the early identification of high-risk patients and facilitate personalized treatment strategies to improve clinical outcomes and reduce the burden of disease.

| Study design and patient selection
In this retrospective study conducted in Tehran Heart Center (THC), Tehran, Iran, from 2011 to 2019, all consecutive hospitalized patients diagnosed with acute STEMI, who were treated by primary PCI, were recruited.A total number of 6340 patients entered at first with a diagnosis of STMEI, 1286 subjects were excluded from the study as they were not undergone PCI and were managed either with medical treatment or scheduled for coronary artery bypass grafting (CABG) surgery regarding their coronary disease complexity.Also, 540 patients were excluded due to remarkable missing data in our registration system.Eventually, 4514 subjects were enrolled in the study for further analysis.All patients had at least 1-year postdischarge follow-up.In general, the follow-ups were conducted three times: 1-, 6-, and 12-month following hospital discharge.Nonetheless, our analysis exclusively utilized data from 1-year follow-up assessments as the final outcome.All employed models were trained with the specific objective of predicting major cardiovascular events occurring within this 1-year timeframe.

| Study endpoints
The main endpoint of the present study was a composite of the major adverse cardiovascular events (MACEs) during 1-year follow-up, including myocardial infarction, emergent revascularization, hemodynamic instability, and all-cause mortality.Hemodynamic instability was defined as low systolic blood pressure that needs inotrope therapy and/or mechanical ventilation in the course of their admission.

| Statistical analysis
As it is illustrated in Table 1, continuous and categorical variables were represented as mean and frequencies, respectively.Statistical analysis was performed with independent samples t test for continuous numerical variables.Also, the χ 2 test and Fisher exact test were done to evaluate the relationship between categorical variables and final adverse outcomes as appropriate.The significance level for all of the statistical analyses was determined as a p value of lower than .05.

| Data extraction and processing
Demographics and clinical and paraclinical variables were extracted from Electronic Health Records.A total number of 156 initial variables were identified for each patient.Based on our primary analysis (explained in Section 2.6), eventually, we retained 50 of the most important variables for further model developments.

| Missing data
We had missing data in several variables for several situations.Our approach to missing data was a combination of two methods: (1)   imputation or (2) data removal.If the variable was significant for the prediction process and the missing values were minimal, we used the K-nearest neighbor (KNN) approach to impute data.This is a common strategy for dealing with missing data that effectively imputes the predicted values instead of the missing ones while having less of an influence on the final analysis than other traditional methods.On the other hand, if the number of missing values was substantial and the variable was not significant enough, the variable may be deleted.

| Feature selection
The next step was to choose the best variables for model development (after data preprocessing and missing value management).In this step, which is called "feature selection," we used two methods: first, regarding traditional statistical analysis, we determined the variables that had significant differences between the two groups; second, we utilized the more precise method, L1 regularization (or Lasso regression) method, which is one of the best methods for feature selection in data science.L1 regularization lets us find the most important variables for the prediction of final results.Using this method and the traditional analysis, we eventually determined 50 variables that were more important for model development.
T A B L E 1 Demographic, diagnostic, and procedural characteristics of the study patients.

| Model development
To create prediction models for a data set of 4514 patients, the data were randomly divided into three categories: a training set comprising 56% of the total population (n = 2528), a validation set comprising 14% of the total population (n = 632), and a testing set comprising 30% of the total population (n = 1354).Four prediction models were developed using the R programming language, including the Gradient Boosting and DRF models, which fall under the Ensemble Machine Learning methods, a DL model, and an LR model.These models were trained using the training set, and their hyperparameters were tuned using the validation set.Finally, the models were fitted onto the testing set to determine their performance metrics, which were compared to identify the most accurate model.

| The ensemble machine learning methods
Ensemble ML methods are a type of algorithm that employ multiple learning techniques to achieve superior predictive accuracy than that which can be achieved by using any individual learning algorithm alone.Several popular contemporary ML algorithms are, in fact, ensemble learners, including the Random Forest (RF) and GBM.Bagging, as used in RF, and boosting, as utilized by GBM, are two methods of ensembling that operate by consolidating a group of weak learners, such as a decision tree, into a single, powerful learner.
GBM is a popular machine-learning algorithm used for both regression and classification problems. 3It is an ensemble method that combines the predictions of multiple weaker models to make a more accurate prediction.The basic idea behind gradient boosting is to train a sequence of models, where each subsequent model learns to DRF model is another algorithm belonging to the ensemble ML method that uses an ensemble of decision trees to perform classification or regression tasks. 4It is called "distributed" because the model is trained using a distributed computing system, which allows for parallel processing of the data.RFs are known for their high accuracy and robustness against overfitting, as well as their ability to handle large data sets.They work by building multiple decision trees on randomly sampled subsets of the data and combining their predictions to make a final prediction.In a DRF, these decision trees are built on different nodes of the computing system, allowing for faster and more efficient training.

| The LR model
The LR model is a typical method for predicting the class of a categorical type variable (the "target variable") using independent variables (the "predictors").LR employs the log odds (the logarithm of the odds) to estimate the probability of one target out of two possible outcomes (binary LR), and the class of the target variable is determined based on this probability. 5

| The DL model
The DL model utilizes weights assigned during a learning process to connect nodes in consecutive layers.The output from the layers is the probability of the target variable, which is then converted to the predicted class based on previous learnings.Although the DL method is primarily used for developing prediction models on large data sets, it can also be applied to data sets of any size.In such cases, techniques should be implemented to augment the training data set and improve the final estimations. 5The DL model employed in this study is based on a multilayer perception (MLP) method, which is particularly suited to tabular data.The model comprises four layers, including an input layer, two hidden layers (each containing ten nodes), and an output layer.

| Model evaluation
We evaluated the predictive power of the models on a single testing data set.Many factors should be considered for this purpose (called "performance metrics"), such as the area under the curve (AUC) of the receiver operating characteristic (ROC) curve, accuracy, sensitivity, specificity, F1-score, and the Matthews Correlation Coefficient (MCC) of the model.The "f1-Score" is a statistical index that represents the harmonic mean of the Precision and Recall measures.
It is widely used in evaluating the performance of classification models.On the other hand, The MCC is a more powerful index for assessing the accuracy of a system, ranging between −1 and +1, where +1 indicates the most accurate prediction without any chance effect or errors.The formula for MCC is based on Formula-1, and it provides a comprehensive measure of the classification system performance that considers true and false positives and negatives.

| RESULTS
The demographic, diagnostic, and procedural characteristics of the enrolled patients are depicted in The important point is that, as the primary PCI protocol for STEMI patients in our center is identical for all scenarios, all the painto-door, door-to-device, and pain-to-device times were similar between the two groups.However, the mean lesion length was higher in the MACE group compared with the non-MACE group (27.6 vs. 30.8mm, p = .003).

| Models performance
To assess the performance of models in predicting adverse outcomes in a particular population, an evaluation of performance metrics is necessary.These metrics for our four models are presented in Table 2 and provide a basis for comparing the performance of various models.Such an analysis enables the selection of the most suitable model for predicting adverse outcomes in the given population.
Obviously, the two models that were based on the ensemble methods, the GBM and the DRF models, are better than the two other models according to all parameters.

| The learning curves
The learning curve is a graphical representation of the performance The learning curves for the two models are presented in Figure 1.By examining the overall trends of the training and validation curves for these models, it was observed that the DRF model exhibits better performance in terms of decreasing the difference between the two curves.Conversely, the GBM model demonstrates an initial divergence between the curves during the early stages of the learning process.While the performance metrics for these models are comparable in our testing data set, it is anticipated that in a larger testing data set, the DRF model would outperform the GBM model.

| DISCUSSION
In the medical field, ML is often employed to create predictive models for complex data sets, due to its ability to handle high-dimensional relationships between features.In this study, we aimed to predict the occurrence of 1-year MACEs of the STEMI patients who were treated with primary PCI, using four ML models: Gradient boosting, DRF, DL, and LR.The performance of each model was then compared, with DRF model exhibiting the highest AUC value, surpassing that of the other models.
][8][9][10][11] However, despite their popularity, these score systems have limitations as they do not include important predictors such as echocardiographic parameters and laboratory data, which may reduce their effectiveness in subgroups of patients.In such cases, ML-based models that rely on electronic medical records and artificial intelligence can provide a more comprehensive approach to outcome prediction.These ML-based models can capture a greater number of variables and complex relationships between features, thereby improving the accuracy and specificity of STEMI outcome prediction.
In 2023, Kasim reported that ML models developed using ML feature selection demonstrated superior performance compared to the conventional risk score, TIMI (AUC: 0.81).Among the individual ML models, SVM Linear with selected features exhibited the best performance, outperforming even the best-performing stacked EL F I G U R E 1 The learning curves of the two ensemble models.
the potential for ML models to improve risk stratification for patients with cardiovascular disease and may contribute to personalized medical decision-making. 12,13 a noteworthy study by Aziz et al., it was demonstrated that ML algorithms have superior predictive performance over the traditional TIMI score system.Notably, the TIMI score system underestimates the risk of mortality.The study showed that 90% of nonsurviving patients were classified as high risk (>50%) by the ML algorithm, in contrast to 10%-30% of nonsurviving patients by TIMI.These results indicate that ML algorithms may provide a more accurate and reliable risk stratification approach in clinical settings. 14e study conducted by Bei Shi revealed that in previous studies,

| Study limitations
There were some limitations to this study that should be considered when interpreting the results.First, the fact that it was a singlecenter study may affect the generalizability of the findings.However, it is important to note that the Tehran Heart Center is a highly regarded referral hospital in Iran, with a diverse patient population.
Second, the retrospective nature of this study poses concerns in the interpretation of its results and extending the applicability of the prediction models.Third, the sample size used in this study was not large enough to allow for precise testing and training of the model.
The low event rate meant that the power of the study was not high enough, and further research with a larger sample size is required in this field.At last, the lack of external validation is another limitation.
The absence of external validation, wherein model performance is assessed on previously unseen data, can lead to reduced generalizability.In ML studies, particularly those involving complex models, the lack of external validation can result in overfitting, where the model fits noise in the training data rather than capturing underlying patterns.Incorporating diverse validation data sets, using crossvalidation techniques, and conducting replication studies are some strategies to address this problem.As our clinic is one of the few national referral centers with a well-structured registry system for primary PCI patients in the country, we were unable to obtain other structured data for accurate external validation of our models; nevertheless, we evaluated our models with a testing data set that was not seen during the learning process of the models.

| CONCLUSION
Our study has led us to the conclusion that ML-based models can be an effective tool for identifying STEMI patients who are at the highest risk of developing adverse events during follow-up.The field of personalized medicine is one that stands to benefit greatly from these algorithms, as they can aid physicians in detecting high-risk patients earlier and taking appropriate preventive measures.By leveraging the predictive power of ML algorithms, physicians can make more informed decisions about patient care, potentially leading to improved outcomes.
correct the mistakes made by the previous model.In other words, each new model is trained to minimize the errors of the previous model.The final prediction is then made by combining the predictions of all the models.
of ML models during training and validation of the data, and it is a crucial tool for understanding how well a model is learning from the data.The learning curve shows the relationship between the training set size and the model's performance.As the size of the training set increases, the performance of the model on the training data usually improves, while the performance on the validation set tends to plateau.This indicates that the model has learned the patterns in the training data too well, leading to overfitting, which reduces the generalization performance on the unseen validation data.Therefore, finding the right balance between the size of the training set and the complexity of the model is crucial for achieving good generalization performance.TOFIGHI ET AL. | 5 of 8

Table 1
. The whole data set is divided into two groups: the MACE group (n = 610) versus the non-MACE group (n = 3904).The mean age of the total population was 62.1 years, while people in the MACE group were significantly older than the non-MACE group.(66.2 vs. 61.5 years, p < .001).Most of the population in both groups was male, with slightly higher rates in the Performance metrics of different machine learning models.AUC, area under the curve of ROC plot; MCC, Matthew's correlation coefficient.
model (AUC: 0.934, confidence interval [CI]: 0.893-0.975vs AUC: 0.914, CI: 0.871-0.957).Additionally, the women-specific model demonstrated better performance than the general non-genderspecific model (AUC: 0.919, CI: 0.874-0.965).These findings indicate T A B L E 2 The results of the accuracy test showed that the deeplearning model had excellent performance in predicting prognosis and outperformed the conventional risk-prediction model.
17One of the main advantages of RF models is their ability to handle larger data inputs, nonlinear variables, and variable interactions, and avoid overfitting.In our study, we developed a risk stratification model for the mortality of patients with acute myocardial infarction using DL from a large perspective national registry.