A heuristic‐based hybrid sampling method using a combination of SMOTE and ENN for imbalanced health data

Health datasets often exhibit class imbalance, with healthy individuals being the majority class and patients being the minority class. As in many applications, machine learning methods are frequently used in the prediction and detection of diseases. The class imbalance presents a significant challenge for machine learning classifiers, particularly in their ability to accurately and effectively classify data from minority and majority classes. Therefore, data preprocessing is crucial before classifying the imbalanced data. In this study, we present GASMOTEPSO_ENN method that combines the synthetic minority oversampling technique (SMOTE) and edited nearest neighbour (ENN) algorithms using genetic algorithm (GA) and particle swarm optimization (PSO) heuristics as a preprocessing method to classify the imbalanced health datasets. In the experiments, chronic kidney disease (CKD), cerebral stroke prediction (CSP), and PIMA Indian diabetes (PID) datasets were utilized to assess the performance of the proposed method with metrics derived from the confusion matrix. The GASMOTEPSO_ENN method can classify the various diseases into different two classes of patients and healthy individuals with acceptable Matthews correlation coefficient (MCC) metric using the machine learning algorithms (Logistic regression (LR) 1.00 for CKD dataset, extreme gradient boosting (XGBoost) 0.94 for CSP dataset, and support vector machine (SVM) 0.87 for PID dataset). Moreover, the proposed method also performed well with other metrics in all datasets, and the analysis of the model results in relation to existing literature reveals that the proposed model demonstrably produces superior results.

classification models for early diagnosis and prediction of diseases in clinical studies (Dev et al., 2022;Ganie & Malik, 2022;Iwendi et al., 2022;Krishnamoorthi et al., 2022;Lilhore et al., 2022;Mehbodniya et al., 2022;Rahmayanti et al., 2022;Rezapour & Hansen, 2022).With those models, costly procedures in the health sector, and related care costs are reduced.At the same time, they provide the opportunity to take care of each patient more by reducing the workforce of physicians.With the burgeoning global population, it is clear that physicians need assistant models developed with machine learning algorithms, which enable early detection and monitoring of various diseases.
Machine learning algorithms were originally designed to work with datasets characterized by complete and consistent data, assuming a balanced distribution of classes.However, within the healthcare domain, the issue of imbalanced classification presents a significant challenge.This problem arises when the dataset employed for training machine learning models exhibits an uneven distribution of classes.This situation is frequently seen in medical data due to the prevalence of some diseases.Conventional machine learning algorithms are built on the assumption of an equal number of instances for each class.Consequently, these algorithms tend to develop a bias in favour of the majority class, which typically corresponds to the healthy population.The resulting bias can considerably undermine the predictive accuracy of these models.Misclassifying individuals who are, in reality, patients as healthy can have severe health implications, potentially leading to life-threatening situations.In the realm of healthcare, where the timeliness and precision of diagnoses are of paramount importance, addressing the challenge of imbalanced classification becomes even more critical.Numerous scholars have presented various techniques for addressing the issue of imbalanced classifications.Various oversampling techniques have been devised to enhance the strength of class boundaries, mitigate the risks of overfitting, and augment discrimination.Chawla et al. (2002) proposed the synthetic minority oversampling technique (SMOTE) method, which involves the generation of synthetic minority samples through a process of interpolation between pre-existing minority samples and their nearest minority neighbours.The success of this method in generating synthetic samples has led to the development of several variants, including SMOTE-Tomek (Batista et al., 2004), SMOTE-ENN (Batista et al., 2004), Borderline-SMOTE (Han et al., 2005), SVM-SMOTE (Nguyen et al., 2011), KMeans-SMOTE (Douzas et al., 2018), have been developed.
The existing research literature on imbalanced health data identifies significant research gaps that call for further investigation (Fotouhi et al., 2019).These gaps underscore the need for exploring effective data pre-processing techniques tailored to address rare medical conditions.
They also stress the importance of developing and evaluating machine learning methodologies specifically designed to overcome the challenges posed by imbalanced health data.This study endeavours to introduce a new sampling method specifically crafted to tackle the issues encountered in imbalanced health datasets.This paper holds substantial importance by addressing the challenges of imbalanced health datasets through the introduction of an innovative sampling method.The primary goal of this research is to significantly contribute to the early detection and effective treatment of diseases.By extensively exploring the impact of unevenly distributed classes on the performance of machine learning classification methods, the study aims to enhance the accuracy and speed of diagnoses within the healthcare domain.To address the challenges posed by imbalanced datasets, we propose a hybrid sampling method named GASMOTEPSO_ENN.This method combines SMOTE oversampling with genetic algorithm (GA) and particle swarm optimization (PSO) heuristics, along with edited nearest neighbour (ENN) undersampling methods.Subsequently, we evaluate the performance of the model using imbalanced health datasets within a binary-class scenario.The main contributions of the proposed system in this paper are as follows: • The SMOTE utilizes the class imbalance ratio (IR) as the basis for determining the sampling rate necessary for synthesizing samples in the minority class.When the IR exhibits high values within a given dataset, it is conceivable that certain minority samples may be artificially synthesized numerous times, potentially resulting in the emergence of sample overlap or overfitting complications (Raghuwanshi & Shukla, 2020).In this study, we aim to develop a hybrid sampling strategy that incorporates the SMOTE technique, which has been successfully used in the literature, with heuristic methods for balancing imbalanced datasets.
• The GA is used to overcome the disadvantages of the SMOTE method, which uses the same sampling rate, such as overlapping samples and overfitting.To balance the imbalanced classes in the dataset, the GA heuristic is used to identify the distinct optimal sampling rates for the purpose of synthesizing each novel sample generated within the minority class.
• The utilization of the support vector machine (SVM) classifier has proven to be efficacious in addressing binary classification problems.In instances where the dataset possesses imbalanced classes or a high IR value, the generalization performance of the classifier is significantly impacted by this factor.In order to address this issue, the SMOTE approach, which leverages the PSO heuristic, is employed to generate novel artificial samples within the region known as the margin.The PSO algorithm is employed to guide the search process of synthetic samples that have the potential to enhance the performance of SVM.
• The SMOTE algorithm is combined with the PSO and GA heuristics for the synthesis of new minority class instances.However, these synthetic samples may have noisy or invalid samples that negatively affect classifier performance.To eliminate such samples, we propose a novel hybrid sampling technique, which integrates the GASMOTE, SMOTE-PSO, and ENN methods.This study utilizes two methods, namely, SMOTE-PSO and GASMOTE, to generate new samples of the minority class.The ENN method is employed to eliminate noise samples from the dataset.
• The proposed method has had a notable impact on enhancing the effectiveness of machine learning classifiers such as logistic regression (LR), random forest (RF), support vector machine (SVM), decision tree (DT), Naive Bayes (NB), and extreme gradient boosting (XGBoost) for accurately distinguishing between patients and healthy individuals within imbalanced health datasets.
The rest of this paper is organized as follows.In Section 2, we briefly introduce studies on the application of sampling methods for imbalanced datasets in the prediction and classification of diseases.In Section 3, we summarize background information, including a description of datasets, data preprocessing, imputation methods for missing values, sampling methods for class imbalance, heuristic methods, machine learning algorithms for classification, and model evaluation metrics.Section 4 introduces the heuristic-based hybrid sampling method that combines the oversampling method SMOTE and the undersampling method ENN, using the GA and PSO heuristics that we propose to balance the imbalanced health dataset.
Section 5 presents the experimental procedure and comparative study.Finally, the paper is concluded in Section 6.

| RELATED WORK
Developing a proficient learning model utilizing machine learning approaches in datasets characterized by an imbalanced class distribution represents a difficult challenge.The problem of imbalanced class distribution can frequently be observed in practical classification tasks like spam detection (Li & Liu, 2018), disease detection (Chakraborty et al., 2021;Fernandes et al., 2019;Fujiwara et al., 2020;Li et al., 2017;Ramaswamy & Mukherjee, 2020;Sonak et al., 2016;Tallo & Musdholifah, 2018), credit card fraud detection (Dal Pozzolo et al., 2014;Wei et al., 2013), and cyber security (Bagui & Li, 2021).Generally, a serious class imbalance problem is encountered as a result of the low incidence rates of some diseases in health datasets.Given the prevalence of class imbalance in real-world applications, researchers have proposed diverse approaches to address this issue.Sailasya and Kumari (2021) conducted a comparative analysis of multiple machine learning algorithms with the aim of predicting stroke occurrence based on a multitude of physiological characteristics.The dataset's missing values were imputed using the mean value of the respective column while balancing the imbalanced class distribution was realized through the adoption of the random undersampling method.In this experimental study, the LR, DT, RF, k-nearest neighbour (kNN), SVM, and NB algorithms were used to classify patients and healthy individuals.
Based on the empirical findings, the most effective algorithm was found to be the NB algorithm, which had an accuracy rate of approximately 82%.Additionally, with regard to the precision, recall, and F1-scores attained, it was observed that the NB classifier exhibited superior performance.Rana et al. (2021) have proposed advanced machine learning and deep learning techniques for the prediction of stroke.In the experimental study, an analysis was conducted to compare the efficacy of a proposed model against various other machine learning algorithms, including those based on ensemble, DT, and NB.According to the experiments, the proposed model using artificial neural network (ANN) gave the best performance with a receiver operating characteristics curve (ROC) score of 0.84 compared to other studies.Al-Zubaidi et al. (2022) have proposed a supervised model using different machine-learning classification methods.The SMOTE technique is utilized for the purpose of rebalancing the imbalanced stroke dataset, while the mean simple method is used to impute the missing values within the dataset.To test the proposed model on the dataset, LR, RF, SVM, DT, and the Voting classifiers were used.Based on the experimental results, the RF algorithm showed a superior accuracy of 94.7% compared to other machine learning models.Also, the RF algorithm demonstrated superior performance in terms of precision, recall, and F1-score.Liu et al. (2019) have proposed a hybrid machine learning approach aimed at forecasting the occurrence of cerebral stroke in a medical dataset that is incomplete and imbalanced.This method proposes a novel approach utilizing automated hyperparameter optimization (AutoHPO) for addressing the challenge of an imbalanced dataset.The proposed approach comprises two primary stages, namely, imputing missing values in the dataset followed by hyperparameter optimization.During the initial phase, the RF algorithm is employed to execute the task of imputing missing values.In the ensuing phase, the AutoHPO employing deep neural network (DNN) technique is utilized for the purpose of stroke prediction on the cerebral stroke prediction (CSP) dataset.According to the experimental results, the model has demonstrated remarkable accuracy in predicting the stroke, thus diminishing the false negative (FN) rate.In addition, the probability of stroke has been estimated at a low cost with a reliable and valid reference.Wang and Cheng (2021) have proposed a novel methodology for addressing imbalanced medical datasets through multiple combined methods.The present study suggests employing SMOTE as a means of oversampling, random undersampling (RUS) for undersampling, PSO for feature selection, and the application of the MetaCost algorithm for the purpose of accurate classification.In the experimental study, a total of nine medical datasets were utilized to establish, corroborate, and contrast the suggested approach.Based on the empirical findings, it has been ascertained that the suggested approach has the potential to significantly enhance class imbalance performance.This study's experimental results were utilized to assess the efficacy of our hybrid sampling model on the chronic kidney disease (CKD) dataset within the framework of our comparative investigation.Mienye and Sun (2021) have improved some cost-sensitive machine learning algorithms convenient for imbalanced classification problems to balance imbalanced medical data.In the proposed model, the minority class learning of the model is improved by applying more weighted penalties to the wrong predictions of the minority class and it is provided to perform better than the standard algorithms.In the experimental study, the proposed model is evaluated with model evaluation metrics such as Precision, Recall, F-measure, and ROC.According to the experimental results, the XGBoost algorithm performed comparably to other classifier algorithms (LR, DT, and RF) using the PIMA Indian diabetes (PID) dataset.
Better performance was obtained from studies in the literature using the Haberman Breast Cancer, Cervical Cancer Risk Factors, and CKD medical datasets.The results of this study (Mienye & Sun, 2021) and the related studies (Chittora et al., 2021;Maulidevi & Surendro, 2022;Pranto et al., 2020) were compared to evaluate the performance of our hybrid sampling model on CKD and PID datasets.Gan et al. (2020) have proposed an integrated TANBN comprising a cost-sensitive classification algorithm, namely, AdaC-TANBN, as a means of enhancing classification performance for imbalanced medical diagnosis data.In the experimental studies, the Cleveland heart, Indian liver patient, Dermatology, and Cervical cancer risk factors datasets from the UCI learning repository were used to evaluate the performance of this model.Based on the empirical findings, the model exhibited higher efficacy against the other approaches.In addition, it has been obtained from experimental study results that can help clinicians to make efficient decisions.Bhattacharya et al. (2020) have proposed a preprocessing strategy that disposes of most of the challenges related to information quality within the existing multimodal stroke dataset.The preprocessing procedure involves several steps, such as modifying and standardizing the unprocessed data within the dataset, eliminating irrelevant features, addressing any incomplete data, and rectifying any imbalanced class distributions.Furthermore, the utilization of the Antlion optimization (ALO) algorithm is proposed as a means of efficiently determining optimal hyperparameters for DNN with minimal computational overhead.A detailed comparison has been conducted between the performance of the suggested model and other commonly utilized classification models.Based on the findings obtained through experimentation, the model that was put forth proved to be more efficient in terms of training time consumption and surpassed other methods in terms of performance.Xu et al. (2020) have proposed a novel approach, namely, the misclassification synthetic minority oversampling technique, which seeks to address the limitations inherent in the SMOTE method with respect to sample generation.This method integrates the advantages of the RF algorithm with SMOTE and ENN algorithms to sample the imbalanced data.In experimental studies, the performance of the proposed method and the conventional algorithms on ten datasets taken from the UCI learning repository was compared according to the model performance criteria.
The findings of the experiment indicate that the proposed methodology exhibited superior performance as compared to conventional algorithms and other resampling methods in the domain of medical diagnosis.Devarriya et al. (2020) have proposed two new fitness functions (F2-score and Distance score) to overcome the imbalanced breast cancer data classification problem.Performance results were obtained by comparing D-score and F2-score with other comparison algorithms in the literature.In experimental studies, the performance of the proposed metrics on the Wisconsin Breast Cancer dataset taken from the UCI learning repository was compared to the other existing model performance metrics.Based on the empirical findings, the suggested metrics exhibited superior performance in comparison to all alternative methods for addressing the classification issue relating to imbalanced datasets.
Within the scope of random oversampling (ROS) and RUS method, a comprehensive summary table has been created based on a literature review, presenting the advantages and disadvantages of the conducted studies.This detailed summary table is provided in Table 1.To solve the class imbalance problem, we propose a heuristic-based hybrid sampling method combining the SMOTE oversampling and the ENN undersampling methods using the PSO and the GA heuristics.Since the SMOTE method and its derivatives have been used successfully in many studies and it is very convenient to integrate into a new method, we use this method to resample minority class samples.Also, since the ENN algorithm successfully removes noisy samples from the dataset, we use this method as an undersampling technique.Since nature-inspired optimization techniques and algorithms have been successfully used in solving real-time problems, we combine these methods with the SMOTE algorithm to overcome the class imbalance problem in ROS and RUS method.Thus, inspired by the existing studies and optimization algorithms in the literature, we design a heuristics-based hybrid sampling method to balance imbalanced health datasets.We use CKD, CSP, and PID datasets to test the proposed model in binary classification and evaluate its performance in 6 machine learning classification algorithms against model performance metrics.

| DATA AND METHODS
This section presents an overview of the datasets utilized in the research study, the techniques employed in data preprocessing, the devised hybrid heuristic sampling methodology, the machine learning classification algorithms implemented, and the metrics utilized for assessing the efficacy of the proposed approach.

| Dataset description
In this section, CKD, CSP, and PID datasets used in experimental studies are discussed.These datasets have been selected among the leading causes of death worldwide according to the report published by the World Health Organization (WHO) in 2020 (WHO, 2020).These datasets are suitable for the problems discussed in ROS and RUS method because they have both imbalanced class distribution and missing values.Furthermore, detailed descriptions of the aforementioned datasets are provided in the respective subsections.Missing values in the dataset were not taken into account.

Liu et al. (2019)
Hybrid approach: RF, AutoHPO, and DNN Addressed imbalanced dataset using AutoHPO and RF.Two-stage process may increase complexity.Wang and Cheng (2021) Hybrid approach: SMOTE, RUS, PSO, MetaCost The recommended method efficiently tackles imbalanced datasets, allowing for quick diagnosis and treatment, potentially saving costs and time.
Combining techniques may increase complexity, posing interpretability challenges in understanding complex models.

Mienye and Sun (2021)
Improved cost-sensitive ML algorithms, XGBoost Cost-sensitive algorithms excel in accuracy, precision, recall, and AUC, enhancing minority class focus and overall performance.
Cost-sensitive approach may increase majority class mispredictions and model complexity, reducing interpretability.

Gan et al. (2020)
AdaC-TANBN AdaC-TANBN improves imbalanced medical datasets by adjusting misclassification costs, prioritizing the minority class for enhanced overall performance.
Limited dataset use and computational intensity necessitate further research for comprehensive understanding of AdaC-TANBN's applicability and performance.Bhattacharya et al. (2020).
Antlion + DNN Efficient model with shorter training time and superior performance compared to other methods.
Limited datasets and computational intensity highlight the need for further research for broader applicability and comprehensive comparisons.

RFMSE
The dynamic adjustment of over-sampling rates during hybrid sampling in RFMSE is a promising approach, contributing to better outcomes in handling imbalanced medical datasets.
Dynamic over-sampling in RFMSE promises better outcomes for imbalanced medical datasets.

Devarriya et al. (2020)
New fitness functions (F2-score, Distance score) DGP and F2GP excel in breast cancer classification, outperforming traditional methods with outstanding accuracy in imbalanced datasets.
DGP and F2GP excel in breast cancer classification but need validation on diversedatasets.GP tree intricacies may pose challenges in complex scenarios, requiring further study.

| Chronic kidney disease dataset
The CKD dataset from the Kaggle website (Rubini & Eswaran, 2015) was used to predict chronic kidney disease in ROS and RUS method.The dataset has a total of 25 attributes, including the class label containing blood tests and other measurements from patients with and without CKD.
This dataset consists of a total of 400 records of 150 normal and 250 patients.The dataset is an imbalanced dataset with the 0.6 IR ratio.The 25 features that contribute to the prediction of CKD are as follows: age, blood pressure, specific gravity, albumin, sugar, red blood cells, pus cell, pus cell clumps, bacteria, blood glucose random, blood urea, serum creatinine, sodium, potassium, haemoglobin, packed cell volume, white blood cell count, red blood cell count, hypertension, diabetes mellitus, coronary artery disease, appetite, pedal edema, anaemia, and classification label.
Also, the dataset contains many examples where certain features are missing.

| Cerebral stroke prediction dataset
The CSP dataset retrieved from the Kaggle platform (Tianyu Liu & Fan, 2019) comprises 5110 instances and 12 attributes.The present dataset is characterized by imbalanced distribution, with stroke events accounting for 4.8% (249 instances) of the entire dataset comprised of 5110 recorded samples.The current dataset is utilized for predicting the likelihood of the occurrence of a stroke in a patient by considering input parameters.The 12 attributes that contribute to the prediction of stroke are as follows: id, gender, hypertension, heart disease, marriage status, work type, residence type, smoking status, age, body mass index (BMI), average glucose level, and stroke.The missing values are 30% smoking status and 3.9% BMI features.

| PIMA Indian diabetes dataset
The PID dataset from the Kaggle website (Smith et al., 1988), comprises samples that exclusively comprise females with a minimum age of 21 years.The PID dataset comprises a total of 768 observations, of which 268 have been identified as diabetic and the remaining 500 as nondiabetic.The dataset is an imbalanced dataset with a 1.9 IR ratio.The dataset encompasses a distinct set of diagnostic parameters and metrics that enable the pre-identification of patients having any chronic disease or diabetes.There are 8 influential features that make a noteworthy contribution to predicting diabetes.Specifically, these features include the prediction of diabetes are as follows: number of pregnancies, BMI, insulin level, age, blood pressure, skin thickness, glucose, and DiabetesPedigreeFunction.Meanwhile, the PID dataset is observed to be devoid of explicit instances of missing values.However, certain measurements pertaining to biological attributes are indicative of a zero value.Incorrect measurements have the potential to cause detrimental effects on machine learning algorithms.Hence, we replace the missing values with "nan".The missing values are 0.6% glucose, 4.5% blood pressure, 29.5% skin thickness, 48.6% insulin, and 1.4% BMI features.

| Data preprocessing
Data preprocessing is the step in which data is cleaned, transformed, and integrated to prepare the dataset for the specific data mining task.The efficacy of machine learning techniques is contingent upon a multitude of factors, including the quantity, quality, and variety of data.Real-world datasets can often contain incomplete, inconsistent, or redundant data.Therefore, it is very important to address these issues along with many other issues, as the process of data preprocessing in data mining is known to exert a notable influence on the overall generalization efficacy of machine learning techniques.In the data preprocessing phase of ROS and RUS method, the multiple imputation using chained equations (MICE) assignment method for missing data problems in datasets and the inter-quartile range (IQR) method for the detection of outliers will be adopted.
In addition, a new heuristic-based hybrid sampling approach will be proposed to balance the data with the imbalanced class distribution.

| Imputation methods for missing values
There exist numerous methods employed to address the issue of absent values within a dataset.The simplest of these methods is the deletion method, in which rows or columns containing missing values are deleted.Removing samples with missing values in datasets with few samples can result in a lot of data loss.Simple assignment methods are proposed to load the data with missing values instead of deleting them.In these methods, samples with missing values are filled with values such as mean, mode, and median.The MICE method is a successfully used missing value assignment method.The MICE method constitutes a specific technique for multiple-imputation interventions that effectively address instances of missing data values within a given dataset.The present imputation strategy involves the utilization of iterative predictive regression models to substitute the missing values present in the dataset (Azur et al., 2011).At each iteration, the imputation of each specified variable in the dataset is performed by utilizing other variables present in the dataset.The process of iteration ought to be persistently pursued until the attainment of convergence pertaining to the regression coefficients (Azur et al., 2011).This model has been designed based on the characteristic features of the individual variables.For example, logistic regression is employed for the purpose of modelling binary variables, whereas predictive mean matching is specifically employed for continuous variables (Azur et al., 2011).

| Sampling methods for imbalanced dataset
Class imbalance is a problem that arises as a result of data frequencies due to the skewed structure of the data.In many real-world applications, the classes of a dataset have such a distribution that at least one data class has fewer samples (minority class) than other classes (majority classes) (Tanha et al., 2020).The concept of class imbalance is typically measured through the employment of the IR, a computational metric that is derived from the relative proportion of samples belonging to the majority and minority classes (Stamatatos, 2008).Specifically, it entails dividing the total size of the majority class by that of the minority class to obtain a numerical ratio.The term IR is defined as the quotient of the sample size of the majority class X maj to that of the minority class X min in the context of class imbalance.The IR is given in Equation (1).
As the frequency of observations belonging to the minority class decreases, the imbalance class ratio increases.This phenomenon consequently leads to challenges such as suboptimal model performance, overfitting, or underfitting when utilizing machine learning algorithms.Data preprocessing is a widely used alternative method to overcome the imbalance class ratio (Nnamoko & Korkontzelos, 2020).At this stage, researchers have proposed methodologies to address the issue of class imbalance in datasets, which encompass data-level methods, algorithmlevel approaches, and hybrid solutions.
In ROS and RUS methodology, we will attention to data-level approaches to address the issue of imbalanced class distribution within the dataset.Data-level techniques are designed to optimize the dataset for the implementation of machine learning algorithms by restoring balance to the data distribution.These methods are categorized into three major techniques: undersampling, oversampling, and hybrid sampling.In their simplest forms, the RUS method approach involves the elimination of random samples from the majority class, while the ROS method involves the duplication of random samples from the minority class (Van Hulse et al., 2007).Due to the equal allocation of sample sizes in both minority and majority classes, a state of balance is observed.While the RUS method is known to effectively reduce the training duration through the utilization of a majority class subset, it is plausible that this strategy may disregard important information housed within the excluded samples belonging to the majority class.In this particular example, the removal of random samples from the majority class has the potential to significantly alter the distribution of samples within that class, thereby affecting the representative characteristics of the class in question (Ahmed et al., 2017).Consequently, there exists a considerable risk of erroneous classification of the majority of samples by classification algorithms based on machine learning techniques, ultimately resulting in the diminished performance of the classifier.The ROS methodology does not incur information loss and exhibits superior performance relative to the RUS methodology (Çürüko glu, 2019).The ROS technique, which involves the random replication of samples from the underrepresented class in order to address the class imbalance, may potentially lead certain machine learning classification algorithms to demonstrate a bias towards the duplicated data.Consequently, the model is confronted with the challenge of overfitting, which hinders its ability to produce generalized classifications of test samples (Ahmed et al., 2017).The utilization of the ROS and RUS methodologies encompass a set of strengths and limitations.In response to the limitations associated with indiscriminate ROS and RUS methods, novel oversampling and undersampling approaches have been introduced by researchers.Wilson (1972) proposed the ENN undersampling method that modifies the majority class by removing some of the majority samples from the original majority class.This method uses the kNN algorithm to remove samples whose class label is different from the class of the majority.The number of nearest neighbours used in the ENN method is set to three by default.With this method, the majority of samples with different labels are removed from their three nearest neighbours.Otherwise, if a minority sample has different labels from its three nearest neighbours, the three neighbours are removed.Chawla et al. (2002) proposed a novel oversampling method called SMOTE.The SMOTE method is an oversampling method (Chawla et al., 2002) and is often used to deal with data where class imbalance is common, such as medical data (Blagus & Lusa, 2015).SMOTE generates new samples with the help of interpolation at a random point on the line connecting a positive sample and one of its nearest neighbours (Chawla et al., 2002).A new synthetic sample is created according to Equation (2) by linear combinations of two samples from the minority class (X i and X j ).
Where X new is a new synthetic sample.A random sample from the minority class (X i ) is selected to create a new synthetic sample.Then X j is randomly taken among the k neighbouring samples of the minority class sample X i based on the Euclidean distance.For α value is chosen random float value in the range (0, 1) (Demidova & Klyueva, 2017).Although the SMOTE method exhibits superior performance when compared to traditional oversampling techniques, it may result in the emergence of distinct complications, including the creation of overlapping, boundary, and noise samples.This method has been used successfully in many studies and many SMOTE extensions have been suggested in the literature (Fernández et al., 2018).
The approaches in which the advantages of undersampling and oversampling methods are used are called hybrid approaches.Heuristic approaches are employed as alternative methods in the examination of health-related datasets due to their proven efficacy in addressing a multitude of issues (Nizam Ozogur & Orman, 2022).In ROS and RUS methodologies, to overcome the class imbalance problem, we propose a heuristic-based hybrid sampling method combining the SMOTE oversampling method and the ENN undersampling method using the PSO and the GA heuristics.Moreover, the PSO and the GA methodologies are explicated in Section 3.3.

| Outlier detection methods
Outliers are data points that deviate significantly from the typical values within a dataset or population.These outliers can have negative impacts on statistical analysis in several ways.First, they tend to introduce more variability and increase the margin of error, thereby weakening the statistical power of tests and making it harder to detect meaningful patterns or relationships.Second, if outliers are not randomly distributed, they can violate the assumption of normality, making it challenging to apply certain statistical techniques that rely on this assumption.Finally, outliers can heavily influence estimates or results, potentially leading to biased or misleading conclusions, especially when they represent extreme values that are far from the majority of the data points.Most machine learning algorithms do not work well in the presence of outliers, as they can produce statistically erroneous results.Therefore, it is very important to detect and address outliers before model development.Various outlier detection methods are used to identify and process these exceptional data points, including statistical methods, visualization techniques, and machine learning algorithms.In general, to detect outliers and remove them in imbalanced datasets, the IQR proximity rule as a statistical method is used.The IQR measures variability by dividing a dataset into quartiles, such as Q1, Q2, and Q3; where Q1, Q2, and Q3 are the 25th, 50th, and 75th percentiles of the dataset, respectively.The IQR is used to represent the interquartile range and is calculated with the equation Q3 -Q1.
Using this method, a lower bound and an upper bound are defined for the detection of outliers.The upper limit and the lower limit range are as follows: The Equation ( 3) is used to calculate the upper bound, and the Equation ( 4) is used to calculate the lower bound.Based on these values, any data point below the lower bound or above the upper bound is considered an outlier.

| Heuristic methods
Heuristic methods are algorithms inspired by natural events that can handle any problem fast and efficiently.These methods aim to find good or nearly optimal solutions for optimization problems, although they cannot guarantee an exact solution.Consequently, the utilization of a heuristic solution ought to be regarded as efficacious rather than optimal.Various classifications of heuristic methods exist, including but not limited to biology-based, physics-based, swarm-based, social-based, music-based, chemistry-based, sport-based, mathematics-based, and hybrid methods incorporating elements from multiple categories (Nizam Ozogur & Orman, 2022).In this section, GA method from biology-based algorithms and PSO method from swarm-based algorithms are explained.

| Genetic algorithm
The GA is an evolutionary algorithm that is based on principles of natural selection and the evolutionary theory proposed by Charles Darwin (Mathew, 2012).The GA functions by preserving a group of potential solutions and progressively transforming them through iterations.During every iteration, parents are chosen from the existing population, and their progeny are designated as the next generation's offspring.It is common for the most capable members of a population to engage in reproduction and transmit their genetic traits to future generations.As a result, the quality of solutions can be improved through this process.It is crucial to acknowledge that individuals who possess less optimal traits may also achieve survival and reproductive success by mere chance.The utilization of the GA has become prolific due to its efficacy in addressing intricate optimization problems across various domains, rendering it a desirable and applicable approach.

| Particle swarm optimization
The PSO is derived from the phenomenon observed in natural ecosystems whereby flocks of birds, schools of fish, and herds of animals demonstrate an ability to effectively respond to environmental changes, locate rich food sources, and avoid predators through the use of informationsharing techniques (Kennedy & Eberhart, 1995).The PSO algorithm utilizes a population-based approach in which each individual member is referred to as a particle, and the entire group is considered a swarm.Each member of the collective of particles is initially assigned a random state upon initialization and utilizes past personal and collective experience to dynamically adjust the scope of the search space in pursuit of the optimal position within the swarm.

| Classification
In this section, we account for various machine learning classification algorithms employed in the study to predict and diagnose diseases.

| Logistic regression
The LR classifier is a widely used statistical approach that lets multivariate analysis and modelling of a binary dependent variable (Tu, 1996).The multivariate analysis obtains coefficient estimates for all estimators incorporated in the final model and subsequently sets these estimates for the remaining estimators present in the model.The coefficients evaluate each predictor's contribution to the outcome risk estimate (Zou et al., 2019).Within the field of medicine, this methodology may be employed for prognostication pertaining to the likelihood of disease among a designated cohort.

| Random forest
The RF algorithm is a supervised machine learning methodology that belongs to the ensemble technique category.This ensemble learning method continually uses bootstrapping, averaging, and bagging to train multiple decision trees (Wang et al., 2023).RF classifier generates an output by collecting many decision tree decisions for the final decision.In the context of classification tasks, the outcome of the algorithm is determined by selecting the class that is most frequently identified across multiple trees (Zidi et al., 2023).

| Support vector machine
The SVM is an extensively utilized supervised learning technique that was initially introduced for binary classification tasks by Cortes andVapnik in 1995 (Cortes &Vapnik, 1995).The SVM employs an algorithmic procedure to identify the optimal line capable of accurately categorizing data points into one of two defined classes.This optimal line is referred to as a hyperplane (Tanveer et al., 2022).The SVM exhibits remarkable computational, making it effective even in non-linearly separable scenarios.This is achieved through the kernel trick, which facilitates the construction of the classifier without requiring explicit awareness of the feature space (Cristianini & Shawe-Taylor, 2000).

| Decision tree
The DT algorithm is a supervised method of machine learning that enables the segregation of data samples into distinct categories through the implementation of predetermined criteria in the decision-making process (Lee et al., 2022).A decision tree is a graphic with a tree structure that has a single root node with directed connections to children nodes that might also have branches to other nodes (Rokach & Maimon, 2005).The first cells of the decision tree are called the root or root node.Each observation is classified as "yes" or "no" according to the condition in the root.
Under the root cells, there are interval nodes, and these nodes denote a test on an attribute.The root node is connected to interval nodes through a branch.Each individual branch symbolizes an outcome produced by the test, whereby each observation is assessed and assigned a binary value of either "yes" or "no".At the bottom of the decision tree are leaves, which represent the result of the test.

| Naive Bayes
The NB algorithm is a straightforward method of machine learning that utilizes Bayes' theorem, incorporating a robust assumption of conditional independence among the features when given a specific class (Vembandasamy et al., 2015).The algorithm is employed to make predictions regarding the class labels of test samples.This is achieved by identifying the maximum posterior probability that corresponds to each label (Kim & Lee, 2022).

| Extreme gradient boosting
The XGBoost algorithm was developed by Chen and Guestrin (2016) as a research project at the University of Washington.The XGBoost framework is a supervised machine learning tool that utilizes gradient-boosted decision trees.The approach is referred to entails the utilization of an ensemble learning technique that combines the predictions of several low-performing models, resulting in a highly responsive prediction.
XGBoost, also known as "Extreme Gradient Boosting", has gained widespread popularity owing to its exceptional performance in multiple machine learning tasks, including classification and regression.

| Performance evaluation
The present study employs a range of metrics to evaluate the proposed model's performance on the dataset.These include accuracy (Acc), precision (P), recall (R), F1-score (F1), Matthews correlation coefficient (MCC), and the receiver operating characteristic (ROC) curve.The evaluation metrics are computed by utilizing the values derived from the confusion matrix presented in Table 2.
The confusion matrix is an analytical tool widely employed for assessing the performance of a classification model, often used for evaluating its predictive accuracy.Within the context of such a matrix, the terminology commonly includes the designations of true positives (TP) and false positives (FP), which stand for the number of TP and FP, respectively.Additionally, TN and FN are utilized to denote the number of true negatives (TN) and false negatives (FN), correspondingly.In datasets with class imbalances, the minority class observations are depicted positively while the class label of majority class observations is negatively represented.The ratio of accuracy is determining the ratio of accurate predictions to the overall number of predictions generated.The precision is mathematically represented as the ratio of true positive results to the arithmetic summation of true positives and false positives.The recall, also known as the true positive rate (TPR), refers to the proportion of positive predictions that are accurately identified as positive among all possible positive predictions.Hence, this particular metric holds significant importance in scenarios where the dataset exhibits an imbalanced class distribution, with the target class representing a minority class.
The specificity, also known as the false positive rate (FPR), represents the proportion of TN cases in relation to all negative outcomes.The F1-score is a metric utilized in classification tasks that aggregates both the precision and recall of a classifier into a single value.This is accomplished by calculating the harmonic mean of these two measures.The MCC is a performance metric widely employed in machine learning for the evaluation of binary and multiclass classification models.The MCC considers all four components of the confusion matrix, thereby providing an inclusive and reliable indicator of classification quality.The ROC curve is a graphical representation that illustrates the performance of a model in terms of its TPR and FPR classification measures.The closer the classification value is to 1, the better the performance of the classifier.The area under the ROC curve (AUC) curve represents a two-dimensional measurement.The relational expressions for these metrics are given in Equations ( 5), ( 6), ( 7), ( 8), ( 9) and (10).

Accuracy Acc
T A B L E 2 Confusion matrix for two classes classification.

Predicted positive Predicted negative
Actual positive TP FN

| THE PROPOSED METHODOLOGY BASED ON HYBRID HEURISTIC SAMPLING
In this section, we propose a framework for a heuristic-based hybrid sampling method, with the aim of addressing the challenge of imbalanced health data.The framework efficaciously integrates undersampling and oversampling methodologies by incorporating heuristics, thereby accomplishing a state of balance in class distribution.The present paper unveils the sampling system framework, as depicted in Figure 1, which comprises four primary parts, namely, dataset preprocessing, heuristics-based data oversampling, data undersampling, and a machine learning classifier model.The health diagnosis and classification process of the proposed framework is as follows: 1. Preprocessing step.In this step, missing values are detected in the dataset and invalid values are marked as missing values.We recommend using the MICE algorithm introduced in Section 3.2.1 to fill in these identified missing values.The MICE algorithm is a specific multipleimputation technique for missing value imputation and takes into account the relationships between other features; thus, it could be efficiently applied to the proposed method.Additionally, the preprocessing applied to all datasets comprises the conversion of categorical attributes into numerical values via encoding techniques, as well as feature scaling through the utilization of the Standard Scaler method.
2. Data oversampling step.In this study, a novel combined method is presented to tackle the issue of class imbalance in health datasets.The proposed method integrates the strengths of two existing methodologies, namely, the GASMOTE (Jiang et al., 2016) technique and the SMOTE-PSO (Cervantes et al., 2017) algorithm, in order to achieve a more effective solution.
Jiang et al. ( 2016) proposed the GASMOTE method, which combines the SMOTE oversampling method and the GA heuristic to balance the imbalanced classes in the dataset.The SMOTE algorithm employs a uniform sampling rate for all samples belonging to the minority class.As a result of generating new samples with the same sampling rates, the generalization performance of machine learning classification algorithms is adversely affected.Due to the varying functions of samples in the sampling and classification processes, it is imperative that each sample be assigned a distinct sampling rate.The GASMOTE algorithm is capable of identifying the most optimal combination of diverse sampling rates The framework of the proposed hybrid heuristic sampling model.
NIZAM-OZOGUR and ORMAN and sampling rates for each sample belonging to the minority class.The optimum sampling rate is searched by GA and oversampling is performed for minority class samples with the optimal sampling rate found (Jiang et al., 2016).Cervantes et al. (2017) proposed the SMOTE-PSO method as an approach to address the challenges posed by imbalanced classes and high IR in datasets.This method combines the SMOTE oversampling technique with the PSO algorithm to improve the generalization performance of the SVM classifier.In the SMOTE-PSO approach, novel instances are synthesized in areas that exhibit a decline in minority class density.
Unlike traditional approaches that focus solely on positive class samples, this method also incorporates samples from the majority class that are located in proximity to the negative samples.By doing so, it aims to capture the characteristics and boundaries of the minority class more effectively.The utilization of the PSO algorithm is aimed at enhancing the effectiveness of the SVM classifier through the amplification of informative samples.Unlike the synthetic samples created by SMOTE, the synthetic samples in SMOTE-PSO are specifically generated in the margin region, which is a critical region for SVM classification.By leveraging the combined power of SMOTE and PSO, the SMOTE-PSO method provides a robust solution to handling imbalanced datasets.It aims to enhance the classifier's ability to generalize and improve Normalize and preprocess data.
Apply GASMOTE algorithm to balance the dataset: Initialize Population Pop of P individuals for GASMOTE.
Set BestSmotedTrain as an empty set.
while not termination condition do Generate SmotedTrain i based on X i .
Evaluate fitness using G-mean.
Update BestSmotedTrain if fitness is improved.
Perform selection, crossover, and mutation.

end while
Output BestSmotedTrain.
Apply SMOTE-PSO algorithm for further balancing: Split using k-fold cross-validation: BestSmotedTrain, Test.
Initialize PSO with SVs, create population.
Optimize SVM parameters with PSO.
Evaluate final SVM performance.
Apply ENN algorithm to remove noise: for each sample in BestSmotedTrain do Find k nearest neighbors.
if majority class label is the same then Remove the sample.

end if end for
Output BestBalancedTrain.

Apply IQR to remove outline examples:
Calculate the interquartile range (IQR) for each feature in BestBalancedTrain.
for each feature do Identify samples outside IQR bounds using Equations ( 3) and (4).
Remove identified samples from BestBalancedTrain.

end for
Output the final balanced and noise-filtered training set: BestBalancedTrain.

End
In this section, the proposed model's effectiveness in the binary classification of imbalanced health datasets is evaluated through the implementation of experimental studies utilizing machine learning classifiers.The present investigation employed the CKD, CSP, and PID datasets to evaluate the efficacy of the GASMOTEPSO_ENN method and to contrast it with existing methods in the literature.In binary classification problems, LR, RF, SVM, DT, NB, and XGBoost methods, which are frequently preferred and successfully applied in the literature, were selected for the classification of patients and healthy individuals.The performance evaluation metrics explicated in Section 3.5 are employed for the purpose of quantifying the efficacy of the classifiers.The selection of optimal parameters for the classification algorithms was conducted utilizing the GridSearchCV approach, as offered by the Scikit-Learn library.The experiments were performed on a laptop with AMD Ryzen 75,700 U CPU @ 1.80 GHz (8 cores), 8 GB RAM, and a 64-bit Windows 11 Pro operating system.The Python programming language and the scikit-learn machine learning library were employed to conduct the computations.The experimental milieu utilized for the development of the model is Jupyter Notebook.
The results of experiments are contingent on two distinct scenarios.Initially, the results of 6 distinct classifiers examined in Section 3.4 were evaluated to ascertain their efficacy in accurately distinguishing between patients and healthy individuals while disregarding the imbalanced dataset issue.Resampling methods were not employed in order to validate the effectiveness of the proposed GASMOTEPSO_ENN method in dealing with imbalanced datasets in their original form.Subsequently, an investigation was carried out to evaluate the effectiveness of the GASMOTEPSO_ENN method through the use of 6 distinct classifiers for the purpose of predicting patients and healthy individuals on imbalanced datasets.Furthermore, a comparative analysis was undertaken of the proposed GASMOTEPSO_ENN model with similar studies discussed in the published literature.The findings of these experiments are explicated in the following sections.with the use of one-hot encoding, while continuous variables were subjected to standard scaling methodology.The datasets were partitioned into training and testing subsets, wherein the proportion of the testing subset was set to 0.33, to make it suitable for machine learning classifier models.A comparative analysis was conducted based on model evaluation metrics to assess the respective performances of each classifier.The obtained results of the test can be observed in Tables 3, 4 and 5.The best results for each dataset are highlighted in bold font.
Generally, it is commonly observed that the healthy class represents a majority, while the patient class is a minority in health datasets.The aim of studies on the prediction and classification of diseases is to estimate the patient class with a very small error.Unfortunately, because machine learning methods are designed to work with balanced class data, the model learning curve tends to go to the majority class in such datasets.Consequently, the models exhibit substandard predictive proficiency, particularly for the minority class.Table 3 demonstrates that the SVM model outperforms the other five models in terms of performance metrics on the given dataset, with 1.00 accuracy, 1.00 precision, 1.00 T A B L E 3 Evaluation of the performance of the classifiers on the chronic kidney disease (CKD) dataset without data balancing.recall, 1.00 F1-score, 1.00 MCC, and 0.99 AUC scores.Based on the aforementioned findings, it can be asserted that in cases of the imbalanced dataset, the SVM model surpasses the alternative five models.The findings presented in Table 4 indicate that, of all the models examined, the LR and XGBoost algorithms perform significantly better in terms of accuracy with concertedly the imbalanced CSP dataset.Traditionally, the assessment of a classifier's efficiency is conducted through the consideration of its accuracy.Notwithstanding, in the context of an imbalanced dataset, this score may prove inadequate, as evidenced by the results presented in Table 4.It is possible to develop a classifier that attains an accuracy level of 95% in a domain where the prevailing proportion of the majority class accounts for 95% of the samples.This can be achieved through a basic approach of assigning every new sample to the majority class.Nonetheless, in the context of an imbalanced dataset, a high level of overall accuracy (95%) may lack significance if the classifier fails to recognize any individual instance within the minority class.Although high levels of accuracy are observed in all classifiers, there are low scores noted in other model performance metrics.The low precision value of the model suggests that it categorizes individuals who are actually patients as healthy.The deleterious effects of this phenomenon on patient care and disease progression are significant.Based on the results displayed in Table 5, it was determined that each of the classifiers demonstrated comparable performance in accurately predicting the patients and healthy individuals within the PID dataset.Although the classifiers failed to produce satisfactory results in terms of the MCC measure, they performed averagely on the other evaluation measures.The MCC is a statistical metric that offers heightened reliability, as it exclusively yields a high score when estimation proves to be effective in all four categories represented in the confusion matrix.The MCC value indicates that the model has achieved suboptimal performance in producing an objective outcome for an imbalanced dataset.Moreover, the IR values of the CKD, CSP, and PID datasets are 1.66, 19.52, and 1.86, respectively.The analysis has revealed that the model's competency is not only correlated to the IR of the datasets but also to their size.The dataset pertaining to CKD is relatively limited in scope and due to the relatively small IR, each of the classifiers employed to evaluate the CKD dataset demonstrated high levels of efficacy.In the CSP dataset, characterized by a substantial ratio of IR and a large size, all of the classifier models demonstrated a marked rate of accuracy.Nonetheless, other performance metrics displayed notably lower results.This condition has given rise to classifier models that exhibit inadequate predictive efficacy, particularly with regard to the minority class, indicating an insufficiency to differentiate between patients and healthy individuals.
Within the PID dataset, possessing an IR ratio of 1.86, and classified as medium-sized, it was observed that all classifier models were weak to effectively distinguish between patients and healthy individuals.Furthermore, inadequate performance was documented across all evaluation metrics.Moreover, Figures 2, 3 and 4 show the classifiers' ROC curves and the corresponding AUC values for the CKD, CSP, and PID datasets respectively.According to the ROC curve given in Figure 2, we can say that the accuracy rates of the two classes that rapidly exceed 0.9 are high and parallel to each other.At the same time, we can see that the success of the model is high since the AUC region, which is under the curve where the classes pass, occupies a high place.The ROC curves given in Figures 3 and 4 show that the ability of the 6 classifiers to distinguish between patient and healthy individuals is quite poor.

| Experiment 2: Performance comparison of the proposed model
In this experiment, the performance of the proposed model is compared with the results of related studies solving imbalance problems in the literature.In the data preprocessing stage, the MICE method is employed to address the issue of missing values in the dataset.The data were subjected to a scaling procedure that employed one-hot encoding for categorical variables and standard scaling methodology for continuous  The obtained results highlight that the proposed GASMOTEPSO_ENN method is proficient in addressing imbalanced data and enhancing pre- The results show that the mean AUC performance of classifiers on datasets characterized by minor imbalances exceeds 0.90.Conversely, the average AUC performance on datasets exhibiting significant imbalances is 0.97.Upon balancing the dataset through our proposed methodology, a noteworthy improvement in estimating patients and healthy individuals was observed, surpassing the results of the previous experimental study.
Therefore, it can be concluded that the GASMOTEPSO_ENN method proposed by us is deemed to possess a higher degree of validity and  accuracy.The GASMOTEPSO_ENN method exhibits greater stability and robustness in its applicability, particularly when dealing with datasets that possess significant imbalances, such as the CSP.The GASMOTEPSO_ENN method addressed the class imbalance in the pre-processed dataset, completing the shortcomings of machine learning algorithms that are not effective in imbalance problems.The present study offers a comparative analysis between the GASMOTEPSO_ENN method and similar studies in the literature for CKD, CSP, and PID datasets, as demonstrated in Tables 9, 10 and 11.
Based on the MCC metric results obtained from the 6 classification algorithms utilized in our GASMOTEPSO_ENN method, only the metric findings from the most effective machine learning model have been incorporated in Tables 9, 10 and 11.Furthermore, it has been observed that various classification algorithms exhibit comparable or superior performance in comparison to the ones referred to in the existing literature.The present study highlights that the GASMOTEPSO_ENN method presented herein achieves performance levels that are comparable to those achieved by alternative sampling methods, as observed from the analyses reported in Tables 9 and 10 for the CKD and CSP datasets, respectively.Moreover, the GASMOTEPSO_ENN method exhibited outstanding performance in comparison to other related studies in the literature that utilized the PID dataset, as depicted in Table 11.The achieved results, demonstrating enhanced performance in disease diagnosis with the GASMOTEPSO_ENN method, have practical implications for real-world applications.The method's improved accuracy, especially in datasets with significant imbalances, suggests its potential to reduce misdiagnoses and optimize treatment planning.This could lead to more efficient resource allocation, improved patient outcomes, and advancements in healthcare quality.The method's stability makes it a valuable tool for addressing common challenges in medical datasets, offering practical benefits for researchers and clinicians in enhancing disease diagnosis and treatment.

| CONCLUSIONS AND FUTURE WORK
The issue of imbalanced class distribution is a common challenge encountered in numerous real-world datasets.Generally, health datasets have a highly imbalanced distribution due to many difficulties such as sample minority and data collection costs.In health data analysis, the prediction and classification of diseases is considered one of the most important issues.Accurate estimation of patient individuals in the early stages of the disease is critical in the treatment of diseases.Accurately predicting patients and healthy individuals in health datasets, which often exhibit imbalanced class distributions, poses a significant challenge for machine learning methods.Conventional machine learning algorithms often struggle to accurately predict patients belonging to the minority class due to their assumption of a balanced class distribution or equal misclassification costs.
This negatively affects the treatment of the disease, as it shows the patient individuals as healthy.Various data-level methods like undersampling, oversampling, and hybrid sampling approaches have commonly been employed in literature to address the issue of imbalanced datasets.
This paper focuses on combining both undersampling and oversampling methods with heuristics for class-imbalanced datasets.After balancing the classes in the dataset using the proposed GASMOTEPSO_ENN method, predictive models are applied using machine learning classifiers on the balanced data.The study utilized performance metrics such as accuracy, precision, recall, F1-score, AUC, and MCC to comprehensively assess the efficiency of the classifiers.The evaluation results confirm that the performance of the GASMOTEPSO_ENN method is acceptable and excellent across the three imbalanced health datasets, and it has the best performance among all other related studies using different evaluation metrics.Also, experimental results showed that the imbalance rate and size of the dataset affect the performance of classification algorithms.Our proposed GASMOTEPSO_ENN method stands out in its ability to effectively address imbalanced class distribution in health datasets, displaying superior performance and holding significant promise for improving the accuracy of disease prediction in real-world healthcare applications.Although our proposed method has yielded impressive results in handling class imbalance and performed exceptionally well on a variety of health datasets, it has some limitations.Primarily, the focus of this research is limited to binary classification scenarios within the healthcare domain.To make this method more inclusive and adaptable to various medical conditions and classification complexities, future endeavours will strive to expand it to multi-class imbalanced scenarios.Moreover, the current study mainly concentrates on the technical aspect of managing data imbalances and classifier efficiency.Future research work will focus on the integration of feature selection, resampling, and machine learning techniques.The objective here is to develop a more comprehensive and sophisticated approach to enhance predictive accuracy and model generalization, fostering more effective decision-making in clinical applications.Furthermore, there is potential to broaden the scope of evaluation criteria and obtain feedback from medical professionals in clinical settings to better evaluate the method's practicality and effectiveness in real medical scenarios.These suggestions will guide upcoming researchers as they strive to create more advanced solutions for addressing imbalanced data issues and seamlessly implementing them in clinical applications.
T A B L E 1 0 Comparison with other cerebral stroke prediction (CSP) predictive models.

T
A B L E 1 Advantages disadvantages of the methods in the literature.Comprehensive comparison of multiple ML algorithms.Loss of information by random undersampling.Rana et al. (2021) SMOTE-TOMEK + ANN The proposed ANN model achieved the highest ROC score (0.84).The training time of the ANN model is longer than some other models.Al-Zubaidi et al. (2022) ML Algorithms: DT, RF, LR, SVM, Hard Voting, and Soft Voting Classifiers The RF algorithm gave the highest accuracy of 94.7% of all tested algorithms.

5. 1 |
Experiment 1: Performance comparison of the different classifiers without data balancing This experiment aimed to compare the performance of the model with 6 distinct classifiers in estimating patients and healthy individuals present in the datasets without employing data balancing techniques.The data preprocessing stage consists of the implementation of the MICE method to mitigate the impact of missing values within the dataset.The data underwent a scaling process wherein categorical variables were encoded

F
I G U R E 2 Receiver operating characteristics (ROC) curves of the various classifiers trained with the chronic kidney disease (CKD) dataset without data balancing.variables.The proposed GASMOTEPSO_ENN method is used to balance the imbalanced classes in the datasets.Additionally, following the balancing of the dataset, the IQR method was employed to identify and remove any outliers.Subsequently, all datasets were partitioned into training and testing subsets, with a test-set proportion of 0. 33 to utilization in machine learning classification models.The observed test results are presented in Tables 6, 7 and 8.The best results for each dataset are highlighted in bold font.F I G U R E 3 Receiver operating characteristics (ROC) curves of the various classifiers trained with the cerebral stroke prediction (CSP) dataset without data balancing.F I G U R E 4 Receiver operating characteristics (ROC) curves of the various classifiers trained with the PIMA Indian diabetes (PID) dataset without data balancing.T A B L E 6 Evaluation of the performance of the classifiers on the chronic kidney disease (CKD) dataset.
diction performance, as validated by the model evaluation metrics.The results clearly demonstrate that acceptance of the GASMOTEPSO_ENN method results in a noticeable enhancement of model evaluation metrics, yielding superior results across all analysed datasets.The classification models utilizing the GASMOTEPSO_ENN method exhibited a performance exceeding 90% for the majority of model evaluation metrics on the original dataset.The findings of the study indicate that the GASMOTEPSO_ENN method is proficient in producing a variety of high-quality synthetic samples, ultimately facilitating the enhancement of the basic classifier's capability.Furthermore, Figure5, Figure6, and Figure7show the classifiers' ROC curves and the corresponding AUC values for the CKD, CSP, and PID datasets respectively.

F
I G U R E 5 Receiver operating characteristics (ROC) curves of the various classifiers trained with the chronic kidney disease (CKD) dataset.T A B L E 7 Evaluation of the performance of the classifiers on the cerebral stroke prediction (CSP) dataset.

F
I G U R E 6 Receiver operating characteristics (ROC) curves of the various classifiers trained with the cerebral stroke prediction (CSP) dataset.F I G U R E 7 Receiver operating characteristics (ROC) curves of the various classifiers trained with the PIMA Indian diabetes (PID) dataset.T A B L E 9 Comparison with other chronic kidney disease (CKD) predictive models.
Evaluation of the performance of the classifiers on the cerebral stroke prediction (CSP) dataset without data balancing.Evaluation of the performance of the classifiers on the PIMA Indian diabetes (PID) dataset without data balancing.
Evaluation of the performance of the classifiers on the PIMA Indian diabetes (PID) dataset.
Comparison with other PIMA Indian diabetes (PID) predictive models.