Comparison among random forest, logistic regression, and existing clinical risk scores for predicting outcomes in patients with atrial fibrillation: A report from the J‐RHYTHM registry

Abstract Background Machine learning (ML) has emerged as a promising tool for risk stratification. However, few studies have applied ML to risk assessment of patients with atrial fibrillation (AF). Hypothesis We aimed to compare the performance of random forest (RF), logistic regression (LR), and conventional risk schemes in predicting the outcomes of AF. Methods We analyzed data from 7406 nonvalvular AF patients (median age 71 years, female 29.2%) enrolled in a nationwide AF registry (J‐RHYTHM Registry) and who were followed for 2 years. The endpoints were thromboembolisms, major bleeding, and all‐cause mortality. Models were generated from potential predictors using an RF model, stepwise LR model, and the thromboembolism (CHADS2 and CHA2DS2‐VASc) and major bleeding (HAS‐BLED, ORBIT, and ATRIA) scores. Results For thromboembolisms, the C‐statistic of the RF model was significantly higher than that of the LR model (0.66 vs. 0.59, p = .03) or CHA2DS2‐VASc score (0.61, p < .01). For major bleeding, the C‐statistic of RF was comparable to the LR (0.69 vs. 0.66, p = .07) and outperformed the HAS‐BLED (0.61, p < .01) and ATRIA (0.62, p < .01) but not the ORBIT (0.67, p = .07). The C‐statistic of RF for all‐cause mortality was comparable to the LR (0.78 vs. 0.79, p = .21). The calibration plot for the RF model was more aligned with the observed events for major bleeding and all‐cause mortality. Conclusions The RF model performed as well as or better than the LR model or existing clinical risk scores for predicting clinical outcomes of AF.


| INTRODUCTION
Atrial fibrillation (AF) is the most common sustained arrhythmia seen in the elderly population and is associated with an increased risk of thromboembolisms, major bleeding, and mortality. 1 Treatment decisions for AF are often made by risk prediction models built using a regression analysis, but their accuracy is modest. [2][3][4][5][6] AF is a highly heterogeneous condition caused by various underlying disorders, and a simple risk score may limit the performance of the risk stratification. 7 Therefore, more accurate and personalized risk stratification approaches are required.
Machine learning (ML), the use of mathematical algorithms that address the higher dimensional, nonlinear relationships among many variables, is making significant progress. [8][9][10] Promising tools for ML in cardiology include the improvement of the automated risk prediction and interpretation of medical imaging that can have a dramatic impact on the practice of cardiology. Currently, several studies have shown that ML outperforms the risk prediction as compared to the traditional logistic models. Mortazavi et al. showed an improved prediction of readmissions for worsening heart failure with ML models as compared to a logistic regression (LR) analysis. 11 In another study using a large multicenter database, the ML model was more accurate in detecting clinical deterioration in the hospitalized patients than the traditional regression models. 12 In a more recent study using patients admitted to the intensive care unit, Hyland et al. developed a new approach that provides early identification of patients at risk for circulatory failure with a much lower false-alarm rate than conventional thresholdbased systems. 13 However, contradictory results have also been reported. 14,15 While AF patients represent an important target population for whom adverse events need predicting, few studies have applied ML to the risk assessment in them. Therefore, the aim of this study was to compare the discrimination and calibration performance of an ML algorithm called the random forest (RF) model, against a stepwise LR model and several conventional score based risk predictors, to predict thromboembolisms, major bleeding, and all-cause mortality, using a prospective nationwide registry of AF patients. [16][17][18] 2 | METHODS

| Patients
For this study, we used individual patient data from the J-RHYTHM Registry. [16][17][18] The J-RHYTHM Registry is an observational, prospective cohort study that enrolled patients with AF between January and July of 2009 at 150 sites within Japan. In this post-hoc study, after excluding patients with mitral stenosis or those who had undergone mechanical valve replacements (n = 410), the final cohort included 7406 patients. Warfarin was used as an oral anticoagulation therapy because no direct oral anticoagulants were available when this registry was carried out. The study protocol conformed to the 1975 Declaration of Helsinki and was approved by the Nippon Medical School institutional review board and review board at each enrolling center.
All patients gave their written informed consent. The data that support the findings of this study are available from the corresponding author upon reasonable request.

| Endpoints
The endpoint of thromboembolisms included ischemic strokes, transient ischemic attacks, and systemic embolisms. Major bleeding as the safety endpoint included intracranial hemorrhage, gastrointestinal bleeding, and other causes of bleeding requiring hospitalization. The all-cause mortality was also tallied. The diagnostic criteria for each event have been described in research design papers. 16,17 The patients were followed for 2 years, or until an endpoint, whichever occurred first. All analyses of the rates of the endpoints were based on the first event during follow-up. A local investigator ascertained the events, and members of the outcomes review committee adjudicated all outcomes.

| Risk scores
The components of the CHADS 2 2 and CHA 2 DS 2 -VASc scores 3 for thromboembolisms and the HAS-BLED, 4 ORBIT, 5 and ATRIA 6 scores for major bleeding are shown in the Supplementary file (Appendix S1).
In the CHA 2 DS 2 -VASc scores, we modified the "V" criterion to include coronary artery disease only, because no data were available regarding peripheral artery disease and aortic plaque. The time in therapeutic range (TTR) was determined with the method of Rosendaal et al. 19 and a labile international normalized ratio (INR) was defined as TTR < 60%.
For this determination, the target INR level was set at 1.6-2.6 for patients aged 70 years or older and at 2.0-3.0 for patients aged younger than 70 years, in keeping with Japanese guidelines for AF pharmacotherapy. 20 We assessed the predictive accuracy of the CHADS 2 and CHA 2 DS 2 -VASc scores for thromboembolisms and all-cause mortality 21 and the HAS-BLED, ORBIT, and ATRIA scores for major bleeding.

| Statistical analysis
The statistical analyses were performed with R project software (R foundation, Vienna, Austria). An RF analysis was performed using the Scikit-learn open-source ML library, version 0.21.2. In this study, we used an RF algorithm, which is a decision tree-based ensemble learning method for the classification, regression, and clustering of the data. 22 The RF analysis was composed of three steps: (1) missing values imputation, (2) classification model building, and (3) feature selection.

| Classification model building
The RF classifier was trained (80% of an overall cohort) and tested (20%) on the feature-selected variables. After hyperparameter tuning and feature selection on the training data, the model was fit to the training data set. The predictive capacity of the models was estimated by the mean value and 95% confidence interval of the C-statistic over 5-fold cross-validation. In this study, the RF model was fit using 1000 trees.

| Feature Selection
We used 42 variables in this study (Supplementary file, Appendix S2).
The feature selection on the training data was performed using a sequential forward floating selection (SFFS). 24 The SFFS is a family of greedy search algorithms, which is used to select a subset of features that is suitable for model building.

| Permutation importance
To provide a description of an individualized prediction made by the algorithm, we measured the permutation importance on a testing dataset. 22,25 The permutation importance was calculated by measuring how the performance of a classifier decreased when a single predictive variable was randomly shuffled. Because shuffling breaks the association between the variable and target clinical outcome, the resulting drop in performance of the classifier as measured by the area under the curve (AUC) was indicative of how much the classifier depended on the predictive variable.

| Stepwise LR analysis
We used the logit link function of R for stepwise multivariable LR. The predictive capacity of the regression model was estimated via the mean value and 95% confidence interval for the C-statistic over the 5-fold cross-validation iterations.

| Model calibration
The performances of the RF and LR models were evaluated with calibration plots comparing the expected and actual event rates for the outcomes. The RF outputs were reconverted into posterior probabilities by fitting the sigmoid functions. 26 The risk of the outcomes was calculated for each sub-interval bounded by the quintiles. A calibration slope smaller than one indicated an overestimation of the event risks for that quintile. We also evaluated the relationship between the existing clinical risk scores and the event rate. The existing risk scores were presented as a continuous score or classified into three categories (low, intermediate, and high risk) based on previous literature. [2][3][4][5][6] The high-risk event rate cutoff value was defined as the maximum event rate (mean value of the highest quintile interval) in the calibration curve of the RF model. We calculated the net reclassification improvement (NRI) by the NRI index and 95% confidence interval to assess the added value of the LR model or risk scores compared to the RF model. 27 A continuous NRI was used to compare the RF and LR, and a categorical NRI was used to compare the RF model and existing risk scores. [2][3][4][5][6] The baseline variables are presented as the number and frequency or mean ± SD values, or the median and interquartile range. The DeLong test was used to compare the C-statisticsbetween the models. 28 A two-tailed p value of <.05 was considered significant.

| RESULTS
The baseline characteristics of the patients are shown in Table 1. We analyzed 7406 patients with nonvalvular AF (age 69.8 ± 10.0 years, female 29.2%). A total of 6404 patients (86.5%) were taking warfarin.
The prevalence of a previous stroke or transient ischemic attack, or major bleeding were 13.8% and 4.5%, respectively. Supplemental Table S2 shows the number of patients with the thromboembolism risk scores and major bleeding risk scores divided into three categories.

| Permutation importance of the RF model
The features in the order of the permutation importance of the RF model for predicting the three types of outcomes are shown in For the total mortality, creatinine clearance, age, and congestive heart failure were the three main features (Figure 2(C)). The permutation importance of risk factors in the LR model is shown in the Supplementary file ( Figure S1).

| Independent predictors of the stepwise LR model
The independent predictors of the three types of outcomes found by the LR model are shown in Figure 3. Of the nine predictors found, the labile INR, height, body weight, age, and strokes were predictors common to the LR and RF models, but the type of AF and use of calcium channel blockers and beta-blockers, were not picked up by the RF model ( Figure 3(A)). For major bleeding, the independent predictors in the LR model not picked up by the RF model were the body weight and AF type (Figure 3(B)). For all-cause mortality, the total cholesterol, ALT, and diastolic blood pressure were independent predictors not picked up by the RF model (Figure 3(C)).

| Model calibration
In Figure Figure S2).

| Net reclassification improvement
The NRI for the outcomes between the models are presented in Supplemental     Table 1 the performance of three ML approaches (RF, gradient boosting, and neural networks) and the LR model in predicting strokes, major bleeding, and mortality, using two global AF registries (ORBIT-AF and GARFIELD-AF). The cross-registry validation revealed that the LR model had a similar or better discrimination and calibration performance for these three outcomes compared to ML. They also reported the superiority of gradient boosting among the ML models.
In our study, we showed that the discriminatory power of the RF The high-risk event rate cutoff values were 3.0%, 4.1%, and 7.8% for thromboembolisms, major bleeding, and all-cause mortality, respectively (red shaded area). The abbreviations and categorical grouping are shown in Table 1 and the Supplementary File (Appendix S1 and Table S1). RF: random forest, LR: stepwise logistic regression direct oral anticoagulants), comorbidities, treatment or survival rate, number of censors, 14 and tuning of the model hyperparameters.
ML models are often thought of as black boxes that take input and produce output. Interactions between the features and intermediate steps that affect output are poorly understood. The algorithm of the RF model is also a black box, but has the advantage of revealing factors (permutation importance) that contribute to improving the accuracy of the model and discovering complex interactions, even in high-dimensional environments. 25

| Study limitations
This study had many strengths, including the large number of sites and patients studied and high quality of the clinical data collected through the registry, but had some limitations. The J-Rhythm Registry was limited to cardiology practices that actively volunteered to participate in this nationwide registry and was not a randomized or blinded study. In this study, 86.5% of the patients were on anticoagulants, which may have confounded the models for the prediction of thromboembolisms and major bleeding. Additionally, no direct oral anticoagulants were used. This study was conducted with patients of Asian race only, therefore outcomes may differ in other races. Although the event rates for the three endpoints were very low (6.2%), we did not consider the class imbalance. To address this problem, we should apply a technique such as synthetic minority over-sampling technique to achieve better classifier performance. 33 In this study we used the RF model and did not employ support vector machine and neural network. The advantage of RF over the support vector machine and neural network is that RF works well for data analyses with a mixture of categorical and continuous values. In the future study, other types of ML algorithms should be tried.

| CONCLUSIONS
Our study showed that the RF model performed as well as or better than the LR model or existing risk scoring schemes for predicting clinical outcomes. The RF model was also able to provide information on the relative importance of individual risk factors. The RF model has the potential to be implemented clinically and improve the decision making in patients with AF.

ACKNOWLEDGEMENT
This study was planned by the Japanese Society of Electrocardiology and supported by grants from the Japan Heart Founda-

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request.