Multifactor data analysis to forecast an individual's severity over novel COVID‐19 pandemic using extreme gradient boosting and random forest classifier algorithms

AI and machine learning are increasingly often applied in the medical industry. The COVID‐19 epidemic will start to spread quickly over the planet around the start of 2020. At hospitals, there were more patients than there were beds. It was challenging for medical personnel to identify the patient who needed treatment right away. A machine learning approach is used to predict COVID‐19 pandemic patients at high risk. To provide input data and output results that execute the machine learning model on the backend, a straightforward Python Flask web application is employed. Here, the XGBoost algorithm, a supervised machine learning method, is applied. In order to predict high‐risk patients based on their current underlying health issues, the model uses patient characteristics as well as criteria like age, sex, health issues including diabetes, asthma, hypertension, and smoking, among others. The XGBoost model predicts the patient's severity with an accuracy of about 98% after data pre‐processing and training. The most important factors to the models are chosen to be age, diabetes, sex, and obesity. Patients and hospital personnel will benefit from this project's assistance in making timely choices and taking appropriate action. This will let medical personnel decide how much time and space to devote to the COVID‐19 high‐risk patients. providing a treatment that is both efficient and ideal. With this programme and the necessary patient data, hospitals may decide whether a patient need immediate care or not.


INTRODUCTION
The majority of viruses that damage the human respiratory system and cause serious sickness are members of the coronavirus family.More than 100 million individuals were impacted by the most recent outbreak of the new Corona illness, which also claimed the lives of numerous people all over the world.Death is a typical outcome of widespread deficiencies in the capacity to recognise COVID-19 disease and the lack of appropriate medicine to conduct specific actions.
The worldwide coronavirus disease 2019 (COVID-19) pandemic affects humans and results in acute respiratory issues and irregular breathing patterns.To diagnose COVID-19 for the first time often used is a RT-PCR is short for reverse transcription-polymerase chain reaction.Nevertheless, because the test is conducted during the first five days after disclosure, a large percentage of false alarms are generated.Effective COVID-19 diagnosis requires the patient's determined underlying health condition report with various characteristics, including characteristics such as sex, age, diabetes, asthma, high blood pressure, and smoking, is taken into consideration to forecast the patient's high-risk severity.Yet even so, the ability of healthcare experts to evaluate the findings is extremely important in the medical area.Thus, a COVID-19 detection system that automatically analyses health reports has become more important in recent years.
In December 2019, the new coronavirus (COVID-19) epidemic initially surfaced in China, and in March 2020, the World Health Organization (WHO) proclaimed it to be a worldwide pandemic.The new coronavirus's symptoms ranged from asymptomatic infection to minor symptoms to severe illness, including respiratory failure and the failure of numerous organs as well as death.Medical practitioners have gotten more encouragement, support, and gratitude than ever before since the COVID-19 epidemic broke out.
The global medical systems are under a lot of stress following the COVID-19 epidemic.Medical professionals began to wonder which patients required urgent treatment as the demand for beds increased.Due to the seriousness of a few COVID-19 cases, hospitalization is necessary, and certain worthy patients may be admitted to the critical care unit.This puts a lot of strain on the already overworked medical staff that work in hospitals.This pressure might be reduced by having a better understanding of who is susceptible to dying from COVID-19.This would enable the identification of those who genuinely require therapy and the customization of care for these high-risk situations.Based on their baseline state of health, the high-risk COVID-19 patients with severe outcomes would receive priority care.The strain on hospital staff would lessen, and hospitals would be able to make better use of their limited resources.
Healthcare workers' lives and health must be protected under these unprecedented circumstances if there is to be a more meaningful global response to the present COVID-19 outbreak.Machine learning techniques may be used to assess a complex system's inner and outward links.When it comes to risk categorization, machine learning models perform better than conventional prediction models because they can recognize nonlinear correlations, have more multi-factorial algorithm optimization, and have built-in model validation techniques.
The remaining sections of this article are arranged as follows.In Part 2, we look at recent machine learning-based COVID-19 detection methods that have been investigated by different researchers.The suggested COVID-19 detection model's technique is then described in Section 3 in more depth.Section 4 presents the experimental findings and a discussion of the suggested model.Lastly, Sections 5 and 6 respectively contain the discussion and conclusion sections with a future scope and declaration.

LITERATURE REVIEW
Assaf et al., 1 developed three distinct machine-learning models for predicting patient corrosion that were connected to both the APACHEII risk-prediction record and the most recent prognosticators.A deep learning standard was used by Li et al., 2 to predict human death using optimistic analysis of the patient's age, gender, and other factors as well as their basic health status.When it comes to this task of predicting the pandemic, Chordia et al. claimed that ARIMA standard performs better than Prophet Standard. 3During a pandemic, Jing Hu et al. created a salesforce application for educational reasons. 4During COVID-19, Hina Gull et al. presented a case study of a little Pakistani city with few medical services. 5 technique for creating an artificial neural network that can forecast a patient's likelihood of survival is provided by Marcus Ong Eng Hock et al. 6 A method and process for diagnosing a patient's illness and treating them were offered by Barnhill et al.This information includes a technique for getting patient records from another scene, looking through the records using a competent neural network, producing an analytical cost, and optionally moving the analytical estimate to another site .Tom Bragg and colleagues anticipated a method.Molecular genomic data and/or medical data are provided, the data are preprocessed, a certain number of variable quantities are chosen based on the combined/mutual data content of the supplied data, and repeatedly produced prediction data utilizing machine learning are produced.These stages are used to track the progression of an illness and/or identify people at high risk.
Yatsuhashi Hiroshi et al., 7 developed the likelihood criteria for locating and diagnosing COVID-19.It is a completely other ballgame to obtain patient role information for COVID-19 patients, including patient-detailed reports about patient description and behaviors (see section 6.3).Based on well recognized biomarkers that can be evaluated in the majority of standard clinical laboratories worldwide, Krysko Olga et al. discovered the predictive elements and provided a technique for getting forecast for COVID-19 patients. 8Machine learning, an adaptive neuro-fuzzy interpretation system, and superior beetle antennae search meta-heuristics are all included in the prediction prototype given by Miodrag Zivkovic et al. 9 For the massive COVID-19 (Coronavirus disease) dataset that William Mary et al. 10 studied, the SVM classification technique would provide improved accuracy and be useful in predicting the future.There were several distinct kinds of cases, such as committed cases, restoration cases, demise cases, and active cases.The COVID-19 transmission model was explored by Yun Xiang Liu et al. 11 in order to observe that prediction in the short term and the application of polynomial regression may forecast the better number of new confirmed patients daily.Hassam Tahir and colleagues developed stringent SOPs to stop COVID-19 by correctly predicting the viral load scenario.ResNet-50 and the Mask R-CNN outperformed the quicker C-CNN. 12ccording to Sumit Bhardwaj et al., the SVR has demonstrated superior dependability than other linear, polynomial, and logistic regression models.The SVR can lessen the spike in the dataset and hence curb the spread of the illness. 13or patients whose COVID-19 was predicted to be severe, Jiahao Qu et al. developed a linear SVM classifier that would value serious patients in three phases.Eventually, ferritin was introduced to have a better effect on severity. 14According to Jagadish Ware V and colleagues, SVM performed better than the other representations when it came to connecting the regression models.By treating the holidays as a contribution to the Facebook prophet model, the results of the holidays may also be taken into account. 15i Hou et al. 16 evaluation of the VGG system's efficacy used two image pre-processing techniques-image decomposition and discriminating search-and an accuracy rate of 86.84%.According to Prajoy Podder et al., 17 ML classifiers are inherited to anticipate COVID-19 and ICU needs.With the use of (RF), (scabicide), and a loading collaborative, COVID-19 exposure may be predicted with an accuracy of 94.39%.(LR).Se-Min Hyun et al. found that information on breathing, heart rate, and body temperature helped forecast the results of a classification algorithm for COVID-19-infected people.The model allowed XGBoost to function at its best on pre-processed data. 18he ANN-GWO reportedly attained the training, testing, and confirming stages, respectively, had a MAPE of 6.23%, 13.15%, and 11.4% to Sina Ardabili et al.Time-series statistics covering the period from January 22 to September 15, 2020, have been used to develop training and assessment procedures. 19Lindsay Schirato et al. reached this conclusion in their paper 20 by precisely identifying healthiness factors that influence an elevated mortality rate and aid health care services in recognizing patient roles with an increased probability of a fatal response.By utilizing this machine learning model, Ryan Yixiang Wang et al. were able to determine that a person's likelihood of spreading an infection may be predicted with 0.91 and 0.92 accuracy. 21he established representations According to Pratima Kumari et al., 22 are used to predict the number of rising confirmed cases, frequent confirmed cases, and increasing mortality cases.The output of this model may be utilized to plan and provide extra health decision-making processes, as revealed by Narayana Darapaneni et al. 23 According to Alice Feng et al., a model that can achieve AUROC of roughly 0.8-0.84 and RMSE of 5.7-1.5 for people who are hospitalized and spend time in the intensive care unit, as well as its longitudinal EHR statistics, can be effectively used to provide a comprehensive forecast of a person's risk to their wellbeing based on past histories of their physical conditions. 24n the active, death, and recovered instances provided by polynomial LASSO and polynomial LR, Vartika Bhadana et al. observed that this ML model offered the best outcome.Because of the ups and downs in dataset values, SVM displays inferior results overall. 25Zhao et al. integrated the analysis of patient data using R and MATLAB and found that the BP neural network system can analyze and anticipate the expansion trend of COVID-19.When the grade is appropriate, it is 0.99. 26n energy-efficient social collaborative trailing system based on Bluetooth Low Energy was developed by Moremada et al.Based on the collected statistics, the algorithm is then anticipated to forecast the possibility that COVID-19 would be achieved. 27Liu et al. used the NAR energetic neural network model for forecasting.By incorporating the number of errors from logistic regression, ARIMA prediction, and SEIR standard, the NAR energetic neural network exceeds the evaluation standard in the time forecast of the new crown prevalent. 28SEIR-PADC energetic replicas for the COVID-19 eruption were built up in the GCC countries by Sedaghat et al.The SEIR-PADC energy model was successfully used.They successfully used MATLAB optimization with a conjunction standard to identify the SEIR-PADC model's best-fitting prediction for COVID-19 statistics. 29The COVID-19 epidemic in Indonesia was predicted using a scenario-based model developed by Mantoro and colleagues.Support Vector Regression (SVR) and Susceptible Infectious Recovered (SIR) were expected to (SIR) classical approaches would be used in this study. 30egmentation, data augmentation, and data modification are all connected via a method called COVID-SDNet, which was created by Tabik et al.The outcomes demonstrate the sophisticated generalization capabilities of COVID-SDNet.Mild and Normal-PCR+ have fewer prospects for improvement because these conditions have insufficient or nonvisual structures. 31Thakur et al. investigated the COVID pandemic mathematically using prediction and optimization.Peak period, transmission rate, mortality, and positivity may all be predicted using the given mathematical formulae. 32Using two hybridized deep learning processes, ResNet model and GoogleNet model, as well as a deep learning model using the mayfly optimization approach, Yenurkar et al., 33 offered future forecasting prediction of COVID-19.
A Straightforward Technique for COVID-19 Pneumonia Detection from x-ray Pictures Deep Feature Extraction and XGBoost were recommended by Munindra Lunagariya et al. 34 as a solution to the Covid wave's detection kit scarcity.A deep learning model with 96.45% accuracy was proposed by Marko Sarac et al. 35 for an intelligent diagnosis of coronavirus using computed tomography images.The study focused on non-clinical methods to identify COVID-19 carriers.The approach developed by Nasiri et al., 36 might be used as an early detection tool to help the radiologists make a diagnosis of the disease more quickly and accurately.Zivkovic et al. 37 suggested a hybrid CNN and XGBoost model that was optimised using a modified arithmetic optimisation approach for COVID-19 early diagnosis from x-ray images.Using deep features and PSO-optimized XGBoost, Domingos Alves Dias Junior proposed an automated method for classifying COVID-19 patients based on chest x-ray images. 38Achieving an F1-score of 99.25%, accuracy, precision, and recall ratings of 98.71%, 98.89%, and 99.63% for the proposed method.In this study, a unique FusionNet Model is put forth for the accurate categorization of COVID-19 disorders. 39

METHODOLOGY
The project's workflow is shown in Figure 1.The dataset obtained from Kaggle (see section 6.3) is pre-processed using the different procedures shown in Figure 1 before being utilized for model training.The pre-handling of the dataset has been finished in its entirety in a Jupyter notebook.A well-liked IDE for machine learning engineers is Jupyter notebook.The actions used for data pre-processing are as follows: 1.In order to obtain high-quality data, the rows in the dataset that included null values were eliminated.From the dataset, the eliminated columns didn't apply to the issue statement.2.Then, all the converted of the values in the symptom column to binary integers, where 1 indicates that a patient has a symptom and 0 indicates that the patient does not.Working with the data is made simpler in this manner.3. The random forest has the restriction that it will give a column greater weight if it has a large cardinality.A column with a high cardinality simply has a large number of different values.The age column was causing this.Hence, age was separated into 4 groups: children and adolescents (0-12), young adults (13-18), adults (19-59), and senior people (60+) in order to decrease the cardinality of the age column (60 years and above) 4. The dataset for the model training was then exported after the columns had been rearranged and exploratory data analysis had been performed (EDA).
The model was trained using conventional machine learning techniques.The machine learning model for XGBoost was created using the XGB Classifier () from the xgboost library.The steps involved in model training are as follows: Random Forest: 1. Class imbalance is an issue that the Random Forest algorithm is prone to.Class imbalance indicates that there aren't enough smaller session occurrences for a standard to recognize the border of the outcome.This is the rationale behind imblearn's Synthetic Minority Oversampling Technology (SMOTE).To address the issue of class imbalance, the library is oversampled.Before fitting a model, duplicates of the minority class are created in the training dataset.2. The SelectFromModel () class from Sklearn is the next step.To extract the most crucial characteristics, a feature selection library is employed.In order to obtain the top six characteristics, a threshold of 0.04 is employed.In tubed, pneumonia, age group, diabetes, hypertension, and covid res were the top six characteristics.These features are used to train the model.3.After training the model, the predictions are examined.Here, the pre-processed dataset is directly training the XGBoost model, and then examined the predictions.
According to the system's flowchart, the user would first submit the most recent health information.There are two ways to upload the data in this case: if it pertains to a single patient, a straightforward form with the patient's basic information and symptoms should be submitted; if it pertains to numerous patients, a CSV file with each patient's medical information should be uploaded.Figure 5 illustrates the CSV file's format.With the web app, there is a download option for CSV files.The machine learning algorithm then forecasts high-risk individuals using this health data, producing a result file.The patient's information is included in the resultant file, along with a high-risk column with the values 1 and 0, which indicates whether the patient has a considerable risk of COVID-19 or not.Following that, an email can be sent to patients who, based on the prognosis, are at high risk.The email will include a warning message, a link to Google Maps that will display nearby hospitals, a health report outlining the patient's high-risk status symptoms, a result file with the patient's basic information, and a high-risk column with the words "YES" or "NO" in it, denoting high-risk and low-risk, respectively.

Random forest procedure
One of the most well-liked and often used algorithms among data scientists is a random woodland.Random forest and other supervised machine learning techniques are extensively applied to classification and regression problems.It develops decision trees from several samples, classifies them based on their average, and regresses them based on a majority decision.
One of the most important features of the Random Forest Algorithm is its ability to handle data sets with both continuous variables, as in regression, and categorical variables, as in classification.It performs better for classification and regression tasks.In this session, you will learn how the random forest works and how to use it for a classification task.
Given that there are just two child nodes (binary tree), Scikit-learn uses the Gini Importance formula (see Equation 1) to calculate the importance of each node for each decision tree: w sub(j) is the weighted number of samples that reached node j, and ni sub(j) is the node's significance.Subscript is not accessible in medium, therefore sub () is used instead.C sub(j) is the impurity value for the node, left(j) is its child from the left split, and right(j) is its child from the right split.Learn how to compute feature importance using the tree.pyxfunction.
The equation: may be used to determine the significance of each feature on a decision tree.
The significance of a characteristic is fi sub(i).i ni sub(j) equals the significance of node j.
Once normalised to a value between 0 and 1, these may be divided by the sum of feature importance.(See Equation 3).
The average of the feature importance across all trees at the Random Forest level determines the final feature significance.The total number of trees is divided by the sum of the significant values for each characteristic for each tree (see Equation 4).
The Random Forest model's RFfi sub indicates the significance of feature i as determined by all the trees (i).The total number of trees (normfi sub(ij)) is equal to the normalised feature significance for i in tree j. (T).
Rough Forest is a controlled machine learning (ML) strategy.Random forests receive a forecast following each tree and vote for the best response during bootstrapping, which is the process of building decision trees from randomly chosen information trials.To comprehend what the random forest algorithm accomplishes, go to Figure 2. Bagging describes how a portion of the dataset's features is picked at random for training.At the beginning, a random forest technique using Python's built-in skeleton module get employed.

XGBOOST algorithm
The random forest system is the foundation of the XGBoost algorithm.Figure 3 will make it easier to comprehend how the XGBoost algorithm has changed over time.Boosting is a method for reducing representational flaws and increasing the impact of effective representations.To reduce defects, gradient boosting employs the gradient descent mechanism.These XGBoost algorithmic enhancements contributed to higher results.1.One of the quickest implementations of gradient boosted trees is XGBoost.By addressing one of the main drawbacks of gradient-boosted trees, it achieves this. 2. Imagine there are thousands of characteristics and, hence, thousands of potential splits.There are thousands of possible splits and losses if we take into account every one of them in order to form a new branch.3. To overcome this inefficiency, XGBoost narrows the range of possible feature splits by analysing the distribution of features across all data points in a leaf.4.Although XGBoost makes use of a few regularisation methods, its speedup is by far its most important feature since it enables quick examination of a huge number of hyperparameter variables.5.This is advantageous since there are several hyperparameters to adjust that aim to prevent overfitting.6.Each decision tree's unique prediction scores are then added up to produce It's important to observe that the two trees in the example make an effort to compliment one another.Our model may be described mathematically as follows: where K is the number of trees, F is the set of potential CARTs, and f is the functional space of F.

Binary classification
Using binary composition, you may put a classification rule to practice by splitting a group of components into two categories.In this study, the patients are into high-risk and low-risk categories by utilizing binary classification.It is presented as "YES" and "NO" in the result file, and "1" and "0" in the dataset, respectively.

About dataset
Using the Mexico government dataset (see section 6.3), the data was obtained from the https://www.kaggle.comwebsite.There are 566,603 rows in the dataset, and the columns for Age, pregnancy, diabetes, copd, asthma, inmsupr, hypertension, other illnesses, cardiovascular, obesity, chronic renal disease, smoking, interaction with other covid, covid res, and icu, and other health problems include information for both men and women.While the ideas are admirable, collecting patient data from COVID-19 patients that includes details about each patient's medical history and behaviour.

Processing
The raw dataset (see section 6.3) for this project opportunity was acquired from the Kaggle website.Python was used to process the data.The Python libraries used include Imblearn (for over-sampling), Numpy, Pandas (for processing CSV files), Matplotlib, Seaborn, Sci-kit Learn (for random forest classifier, metrics computation, and feature selection), and Seaborn.Python script execution takes place in a Jupyter notebook.The raw dataset (see section 6.3) was downloaded from the Kaggle website and pre-processed to exclude any null values, consistency issues, outliers, and duplicate columns.When the data is transformed into integer form, the columns are rearranged to provide the required structure.The pre-processed and adjusted dataset was then made ready for training.

Machine learning
Python is utilized to put the machine learning model into practice (ML).Both the independent and dependent variables (factors) (intended outcome), that is, x and y, are separated from the adjusted dataset (see section 6.3).Next, using x, a training dataset and a test dataset were produced.The dependent variable and the independent variables were then used to fit the model.Hence, the model is trained.

EXPERIMENTAL RESULTS
The existence of the suggested model is evaluated using performance measures including accuracy (see Equation 6), precision (see Equation 7), recall (see Equation 8), F1 score (see Equation 9), AUC, and ROC.Each measure is described in the paragraphs that follow.the total number of actually negative data points, the condition positive (P), condition negative (N), and the number of actual, positive examples in the data, real positive (TP), a precise negative test result demonstrating the presence of a condition or characteristic (TN), Positive error (FP), a test result that mistakenly indicates the existence of a certain illness or quality, is a reliable test result that demonstrates the absence of a disease or characteristic.untrue negative (FN), a test result that erroneously implies the absence of a particular condition or feature.

Accuracy
The representations' correctness gets reorganized.The more accurate one is useful and is incorporated into the Python web application.

AUC (area under curve)
How successfully a model is categorized using its AUC value determine.Both models' AUC values were higher than 0.6.

Precision
Precision for both classes-0 and 1-is taken into account.The accuracy produced by both models is equal.

Recall
For both groups, recall is also taken into account.For class 0, XGBoost performs better than random forest, while for class 1, random forest performs better.

F1 score
Precision's matrix and the other's harmonic mean Recall are combined into a single metric by the F1-score assessment matrix.

System's working
1.The training dataset is used to develop the machine learning (ML) model at first .Columns for id, sexpneumonia, ageing, being pregnant, having diabetes, having COPD, having asthma, having high blood pressure, having obesity, having chronic kidney illness, and using tobacco are included in the training dataset(see section 6.3). 2. A prediction is then made making use of the patient's data.The 16 production and forecast are classed as binary using the numbers 1 and 0. The digits 1 and 0 are replaced with the phrases "YES" and "NO," respectively.3. To create the results by combining the forecast with the patient's name, email address, and identification number.The four columns in the result file will be id, high-risk, name, and email.Send the patients an email when the outcome file has been created.The email includes both the result file and a warning letter.

Python web application
1. Python is the foundation of the web application.
2. The web application was Created with the Flask application framework for Python.A Python-based web development framework is called Flask.In addition to founding the Pocco Python community, Armin Ronacher developed Flask.The Werkzeug WSGI toolkit and the Jinja2 template engine are the main goals of the Flask template engine.3.For this web application, Gunicorn is close as the application server.For UNIX, Gunicorn created the Green Unicorn Python WSGI HTTP Server.Web Server Gateway Interface, or WSGI, refers to web applications written in Python.Gunicorn and other WSGI servers are necessary for the operation of every respected web framework, including Flask and Django.4. JavaScript and the Bootstrap CSS framework are both used in the website's application.Bootstrap is a well-liked solution for developing mobile-first, responsive websites.It may be downloaded and used without cost.Mark Otto and Jacob Thornton for Twitter created bootstrap.5. Using an internet tool and a personalized email template, I created the email's body.
The machine learning (ML) model was trained using the data from 104,606 patients.Here a method is provided for predicting a patient's COVID-19 illness severity.Hospital resource management may be improved with the use of a severity prediction machine learning (ML) model, according to healthcare professionals.Critical patients' lives or deaths may depend on the distribution of resources to them. 2 This is a description of the project's outcomes.
On the pre-processed data, exploratory data analysis (EDA) is first carried out.Looking at how often patients had different symptoms.The majority of patients (12%), who smoke (19%), have obesity (16%), and renal chronic illness (16%) is noted.Almost half of all patients fall into one of these three categories.Cardiovascular illness, hypertension, asthma, COPD, diabetes, and pneumonia patients together account for over 18% of all patients.Moreover, 34% of individuals have other illnesses.
The majority of people in the population are between the ages of 40 and 80, according to an analysis of the population's age distribution.So, it might be concluded that a patient between the ages of High likelihood of developing severe COVID-19 soon in those between 40 and 60.This is in line with the first dataset (see section 6.3), which revealed that those who were 30 or older had a higher risk of developing a severe COVID-19 infection.These people account for a sizeable portion of the working population in the country.The dataset has more men than females, per the analysis of gender distribution.Even yet, gender is not an especially important factor for deciding if a patient is at high risk.The dataset had an equal number of males and girls before data pre-processing.Several records from the original dataset were eliminated during data pre-processing (particularly by deleting null values), which increased the gap between the number of men and females.There are around 42,000 girls and 63,000 males in the pre-processed dataset (see section 6.3).71% of patients do not have high risk factors, compared to 29% of patients who do.There are notably fewer patients with high risk than there are without high risk, which contributes to the uneven classification problem.As a result, the model is trained using the SMOTE technique (Synthetic Minority Over-sampling Method).SMOTE was only used to the random forest model.Unbalanced categorization is addressed with the SMOTE approach.SMOTE is an oversampling approach that produces 28 false demonstrations of the marginal class by duplicating minority session statistics from the original dataset.Here, the minority class which produces synthetic samples is made up of high-risk patients.Synthetic samples are just replicas of the original minority class samples.Instead of using random oversampling, this approach avoids the issue of over-fitting.
The dependent variables that make up the training dataset (see section 6.3) are shown in Figure 4. (factors).The number 1 in the gender column indicates females, whereas the number 2 indicates males."Yes" is denoted by a 1 in all columns save the age group, whereas "Nay" is denoted by a 0 in those fields.Children (0-12 years), Adolescents (13-18 years), Adults (19-59 years), and Senior Adults are the four age categories into which patients are separated (60 years and above).The characteristic with more unique values is more likely to be selected by classifier algorithms (high cardinality features).They prioritize these high cardinality qualities more, which is bad.All 29 classification algorithms have this behavior as their default setting.In order to do that, the age group feature replaces the age feature.For convenience of usage, I keep all the data in integer form.Information regarding the patients' current health is shown in Figure 5.This input file is used to train the machine learning (ML) model to identify patients at high risk.
The best features are obtained via feature selection with a 0.04 threshold value while using random forest.Aging, chronic disease, intubation, pneumonia, hypertension, and diabetes were the top 6 characteristics that we were able to choose thanks to the threshold value of 0.04.Using the key qualities that have been chosen, the model is maintained.According to the feature importance plot technique applied to the model, Figure 6 shows the feature significance of the XGBoost model.
The XGBoost model's confusion matrix is displayed in Figure 7. Count figures or percentages are used to quantify the quantity of reliable and inaccurate forecasts.

F I G U R E 7 XGBoost-Confusion matrix.
Table 1 depicts the model's classification report.In Figure 8, provides a model's output.For the sake of convenience, YES is substituted for 1 and NO for 0. The patient's name, email address, and ID are then combined with this output.A column named high-risk belongs to an independent variable (target output) in the output dataset (see section 6.3). 1 and 0 are in the "High-risk" column.The patient will soon be at a high risk of acquiring COVID-19 if the high-risk is set to 1.If not, there won't be much of a risk to the patient.Figure 12 displays the accuracy against threshold graph for the Random Forest standard.The accuracy of the threshold values from 0.06 to 0.16 was greater.Of them, recall was improved at threshold levels between 0.13 and 0.16.Hence, any threshold value between 0.13 and 0.16 will be suitable.The features for XGBoost are chosen internally automatically.

F I G U R E 13
Health report of the patient.
Using the web application is broken down for hospital administrators in the following manner.The internet tool makes it easy for hospital administrators to upload patient data.When a file is uploaded, the machine learning model begins working in the background and creates a CSV file as the output.By selecting the download option, they may get the outcome.Patients will receive an email when they choose the email option.For this, a unique HTML email template is employed.There are two links in the email: one to a warning message and Google Maps that lists a list of close-by hospitals, and an attachment with a health report.Figure 13 displays the health report for the patient.These are the elements that go into the patients' high-risk status.
The accompanying Figure 14 shows the email that will be delivered to high-risk patients.The Python Flask Mail module uses email.It uses the Google SMTP server to send emails.The development a distinctive email format using an online tool with HTML and CSS is done.It dynamically inserts the patient's name, along with the health report and result F I G U R E 14 The email will be sent to high-risk patients.
file, to the email before sending.The same send email button may also be used to send bulk emails if there are several patients.

DISCUSSION AND CONCLUSION
Now, a lot of research uses machine learning methods to predict COVID-19 positive high-risk individuals.The majority of studies were able to make use of patient data from the hospital's medical files.The Kaggle website is where the raw dataset may be found.The findings diverge from those of the earlier research 1,2,5 in terms of methodology.The machine learning techniques uses Random Forest and XGBoost for classification.In addition to these, a number of other algorithms, including Support Vector Machine, Auto-encoder, Artificial Neural Network, Logistic Regression, Regression Tree, and so forth, were also utilized in earlier study to ascertain which approach worked best for their dataset (see section 6.3).The effects of sexual activity, diabetes, and cardiovascular disease are shown in Figure 6 from the XGBoost model.
To assess the severity of COVID-19, a machine learning prototype is offered.if someone tested positive.The random forest standard showed a precision of 71%, while the XGBoost standard certified an accuracy of 98%.As a consequence, the web application classifies patients using the XGBoost model.Only severity from a COVID-19 diagnosis was predicted by the model yet it can aid medical professionals in determining whether to accept a patient who tests positive for the virus.A person's quality of life may be significantly impacted by the coronavirus infection.In the future, create a model that forecasts the disease's course and current severity.This action will motivate individuals to seek medical help as soon as they can in order to prevent the illness from having a negative impact on the affected person's disposition in the future.If they can get care elsewhere, many patients won't be admitted to the critical care unit (ICU).
The work that is being provided gauges the patient's severity depending on the symptoms that the reports reflect.If the patients are not correctly diagnosed by in-person, well-equipped medical facilities, there may be patient loss.Now, this model can only forecast a patient's outcome based on their symptoms, but in the future, it may also consider a variety of other aspects and characteristics.It is possible to successfully prepare for the overall amount of restorative assets required when this information is combined with the spectrum of therapeutic resources that are already available and available at this time.
It's possible that we'll ultimately merge this project with other applications for healthcare.Eventually, this research can incorporate COVID-19 wave prediction.Many machine learning techniques may be employed and assessed in order to predict high-risk patients.

F I G U R E 2 F I G U R E 3
Development of random forest algorithm.Development of XGBoost algorithm.

F I G U R E 4
Training dataset.

F I G U R E 5
Patient's health data-Input file for the model.

F I G U R E 6
XGBoost-Feature importance.

Figures 9 ,
10, and 11 show comparison bar charts between the Random Forest model and the XGBoost model.Dark brown is used to symbolize Random Forest, whereas light brown is used to represent XGBoost.Class 0 refers to those who are not at high risk, whereas class 1 refers to those who are.

F I G U R E 9
Random forest versus XGBoost performance for class 0. F I G U R E 10 Random forest versus XGBoost performance for class 1.

F I G U R E 11
Random forest versus XGBoost performance-Accuracy and ROC AUC.F I G U R E 12Random forest-Accuracy versus threshold.
The input file (the patient's data) is then used to create the forecast.The ROC (Receiver Operating Characteristic), AUC (Area Under Curve), Accuracy, Recall, Precision, and other metrics are used to assess the model.Among the 15 factors utilized to group patients are sex, age, copd, asthma, hypertension, diabetes, pneumonia, intubation, pregnancy, hypertension, covid result, cardiovascular disease, chronic renal disease, cigarette usage, and other disorders.The threshold value of a second feature is used to pick features in the Random Forest model, but not in the XGBoost model.
TA B L E 1 F I G U R E 8XGBoost-Output file.