Construction and validation of a machine learning‐based nomogram: A tool to predict the risk of getting severe coronavirus disease 2019 (COVID‐19)

Abstract Background Identifying patients who may develop severe coronavirus disease 2019 (COVID‐19) will facilitate personalized treatment and optimize the distribution of medical resources. Methods In this study, 590 COVID‐19 patients during hospitalization were enrolled (Training set: n = 285; Internal validation set: n = 127; Prospective set: n = 178). After filtered by two machine learning methods in the training set, 5 out of 31 clinical features were selected into the model building to predict the risk of developing severe COVID‐19 disease. Multivariate logistic regression was applied to build the prediction nomogram and validated in two different sets. Receiver operating characteristic (ROC) analysis and decision curve analysis (DCA) were used to evaluate its performance. Results From 31 potential predictors in the training set, 5 independent predictive factors were identified and included in the risk score: C‐reactive protein (CRP), lactate dehydrogenase (LDH), Age, Charlson/Deyo comorbidity score (CDCS), and erythrocyte sedimentation rate (ESR). Subsequently, we generated the nomogram based on the above features for predicting severe COVID‐19. In the training cohort, the area under curves (AUCs) were 0.822 (95% CI, 0.765–0.875) and the internal validation cohort was 0.762 (95% CI, 0.768–0.844). Further, we validated it in a prospective cohort with the AUCs of 0.705 (95% CI, 0.627–0.778). The internally bootstrapped calibration curve showed favorable consistency between prediction by nomogram and the actual situation. And DCA analysis also conferred high clinical net benefit. Conclusion In this study, our predicting model based on five clinical characteristics of COVID‐19 patients will enable clinicians to predict the potential risk of developing critical illness and thus optimize medical management.


| INTRODUCTION
Severe coronavirus disease 2019 (COVID-19) outbreaks worldwide during early December 2019. As of May 5, 2020, the number of cumulative cases has surpassed 3,500,000 with over 240,000 deaths worldwide. So far, the global epidemic situation is still very serious. Coronavirus owns distinct immune responses as well as immune escaping features and then creates critical pathogenic processes during inflammation, which subsequently led to lung infection, and edema, acute respiratory distress syndrome (ARDS), or multiple organ dysfunction, and even death. 1 Therefore, rapid and accurate prediction of COVID-19 pneumonia trends can provide effective treatment.
As research progresses, more and more information about COVID-19 pneumonia has been revealed. Lauer et al. 2 reported that under conservative assumptions, the estimated median incubation period of COVID-19 is 5.1 days, and in only 101 of every 10,000 cases, symptoms would not develop until after 14 days of active monitoring or isolation. As reported by Huang et al., patients with COVID-19 present primarily with fever, dry cough, and myalgia or fatigue. Although prognosis of most patients are thought to be favorable, older people and those with weakened immunity may have worse outcomes, even dead. 3 Through the infection process, sequential evaluation of lymphocyte calculation dynamics and inflammatory contents, including lactate dehydrogenase (LDH), C-reactive protein (CRP), and interleukin-6 (IL-6) may aid to distinguish cases with poor prognosis and prompt medical intervention to increase outcomes. 4 Zhou et al. 5 recorded that increased probabilities of in-hospital mortality are linked to older age (odds ratio 1.10; 95% CI, 1.03-1. 17, per year increase; p = .0043), which could be the potential risk factor. Another research demonstrated that more grown age, preexisted hypertension, raised cytokine levels (IL-2R, IL-6, IL-10, and TNF-a), and high LDH level were significantly connected to severe COVID-19 while admitted. 6 And 1 week after the onset of the disease is a critical period, patients with severe illness may develop dyspnea and hypoxemia within, which may quickly progress to ARDS or end-organ failure. 7 Therefore, to identify high-risk patients whose disease may likely progress is of great importance both in delivering personalized medical care and optimizing medical resource distribution on the macro level. Gong et al. 8 provided a nomogram to help clinicians to early identify patients who will exacerbate to severe COVID-19 but they did not take clinical factors like underlying comorbidities in consideration which was a universally acknowledged risk factor. Ji et al. 9 used the CALL score model (Comorbidity-Age-Lymphocyte count-Lactate dehydrogenase) to estimate the progressive risk of COVID-19 patients but the sample size was limited, which may cause the volatility of the result, for example, the hazard ratio of LDH > 500 is 9.8 (2.8-33.8).
As a systemic disease, it is necessary to take multiple indicators into account. Comorbidities, age, biochemical indicators (LDH, CRP, and blood urea nitrogen [BUN]), and blood indicators are all potential influencing factors. In this study, we selected 5 effective indexes among 31 items and established an effective 5-feature based nomogram by machine learning methods. More importantly, to ensure the prediction accuracy, we then further verify this system in a prospective cohort. This could help clinicians to predict the progression of COVID-19 and provide better-centralized management.

| Study population
We retrospectively collected 412 patients from January 1 to February 6, 2020 in Jinyintan Hospital of Wuhan City who were centrally treated and diagnosed with Common or Severe type of COVID-19. For extra validation, 178 patients were prospectively recruited from February 6, 2020 to March 10, 2020. This study was approved by the Ethics Review Committee of Wuhan Jinyintan Hospital and Shanghai General Hospital.

| Diagnostic criteria
According to the "New Coronavirus Pneumonia Diagnosis and Treatment Program (Trial Version 6)" promulgated by the General Office of the National Health

| Study design and data processing
Since most of the Light type of COVID-19 victims do not need medical support or hospitalization and Critical type patients were limited in the hospital, we only analyzed the Common and Severe types, which occupied most of the medical system and equipment.
For the research design, we incorporated three sections to identify and validate clinical signature-based nomograms to predict whether a Common COVID-19 patient will progress to the Severe type. The study flowchart is shown in Figure 1. Initially, we collected For continuous variables, the Maximally Selected Rank Statistics (MSRS) was used to generate the optimal cutoff value and all variables are transformed into dichotomous data. 11 Following initial filtration, a widely used Least Absolute Shrinkage and Selector Operation (LASSO) method 12 with the cross-validation level set at 10-folds, was built to select suitable traits. Concurrently, we also use the Support Vector Machine-Recursive Feature Elimination (SVM-RFE) for hallmarks collection. 13 Ultimately, we intersect clinical features from the LASSO and the SVM-RFE methods and subsequently use the multivariate logistic regression (LG)  and random forest (RF) to test the predictive power of the model. Superior to random forest performance, multivariate logistic regression was used to build the predictive nomogram in 285 patients and internally validated in 127 patients. In the independent validation phase, candidate features were validated in a prospective cohort (n = 178) ( Figure 1).

| Statistical analysis
To operate 10-fold cross-validated LASSO and 5-fold cross-validated SVM-RFE algorithm, we applied glmnet and e1071 in R language, respectively. The random forest algorithm was realized by R package randomForest and the number decision-making tree was set to 1000. To measure the ability of the nomogram in the validation data, we employ the concordance index and the calibration plots to investigate the graphical performance where the R package rms was applied. And to enhance its clinical application, the predictive display of the nomogram was estimated by the analysis of receiver operating characteristic (ROC) as well as area under the curve (AUC) conditions. Decision curve analysis (DCA) were performed to plot net benefit (NB) as well as cost versus benefit ratio and assess the utility of models for decision making. 14 In risk assessment tools, the predicted probability that a patient is diagnosed with certain disease is recorded as P i ; when P i reaches a certain threshold (denoted as P t ), it is defined positive, which is Severe COVID-19 in our study. And the formula is as follows: Hosmer-Lemeshow goodness-of-fit test is used to validate the model fitness. 15 Statistical investigations were conducted in the 3.6.1 version of R language and p < .05 (both sides) is thought to be statistically significant.

| Patient characteristics
Taken together, 590 confirmed cases with COVID-19 were recruited from February 1st through March 10th. Two hundred twenty-six (38.5%) of them developed  Table 1.

| Building a predictive signature
To develop a clinically applicable tool that could predict the probability of whether a COVID-19 patient can develop severe disease, we constructed a nomogram to develop a predictive model, considering clinical covariates ( Figure 4A). The predictors included CRP, LDH, Age, CDCS, and ESR and the risk score of each covariate produced by the LG model are listed ( Figure 4B). To select one subject for instance (blue track in Figure 4A), based on the chosen traits, the cumulative scores add up to 197 and hence the probabilities of progressing to severe COVID-19 is 0.495. To take two patients with distinguished risk scores, for example, the low-risk one added up to 74 points with the probability of 0.12 to progress and the CT scan showed no worsen pneumonia after a 10-day hospitalization. Another one added up to 361 points (high risk) with the probability of 0.89 went through severe lung lesion in 10 days ( Figure 7A,B), which then developed permanent lung damage.

| DISCUSSION
Since the COVID-19 broke out in Hubei, even with the optimal control in China, the cumulative confirmed cases in the globe had overpassed three million and the threat of coronavirus is still out there. In the patients who suffered COVID-19, CRP was increased in 86.22% of them, and ESR in 90.22%. 16 Until now, more and more independent risk factors have been determined and a number of systemic score systems have been built to analyze the status of disease progression and prevent severe outcomes. Ji et al. 9 reported a novel scoring model, named CALL, established for disease condition prediction, which included comorbidity, age, lymphocyte, and LDH, with the AUC reaching 0.91 (95% CI, 0.86-0.94). However, only one validation of this model has been made, the robustness of it calls for further justification, especially considering the sample size in some subgroups are fairly small, like LDH. Another model to early predict severe type of COVID-19 showed older age, higher LDH, CRP, RDW, DBIL, BUN, and lower ALB on admission correlated with higher odds of severe COVID-19, with the AUC reached 0.912 (95% CI, 0.846-0.978) in the training set, and 0.853 (95% CI, 0.790-0.916) in the validation set. 8 Despite the relatively limited sample size, too many continuous variables enrolled can cause inconvenience in real-world implementation.
For the sake of controlling the major health incident and bettering the medical resource allocation, we extracted the clinical data of 590 cases from the Wuhan Jinyintan Hospital and the prediction model has been established, which included ESR, CDCS, Age, LDH, and CRP. Notably, CDCS, the Charlson/Deyo comorbidity score, a system of classifying comorbidities of victims referring to the International Classification of Diseases (ICD) diagnosis codes found in organizational data, such as hospital abstracts data. Each comorbidity classification has a corresponding measurement (from 1 to 6), based on the adjusted risk of mortality or resource use, and the whole weights add up to a single comorbidity score for each victim. A sum of zero means that no comorbidities are found. If the score gets higher, it means the patient is more likely to develop a poorer outcome. These scoring systems have been reported to be associated with overall survival in various types of cancer and the death rate of other morbidities, such as ischemic stroke, acute cholecystitis, acute hip fracture, and so forth. [17][18][19][20][21] Due to the efficacy of CDCS, we groundbreakingly utilized this scoring system in our prediction model with meeting the rule of the TRIPOD Statement, 22 which could also interpret the high mortality of COVID-19 with multiple comorbidities. Hereby, the five significant indices were overlapped by the LASSO and SVM analysis, which are machine learning used for classification and regression analysis to enhance the prediction accuracy and interpretability of the statistical model it produces. Then, the ROC, DCA, and Calibration analysis were performed for performance assessment, and the triple verification were applied. The predictive nomogram indicated that the possibility of the progression from common type to severe type could reach 50%, when the total points meet 197. Thereafter, the AUC of the internal training set, testing set, and external testing set reached 0.822, 0.762, and 0.705, respectively. However, there are still some limitation that should be majorized in the future investigation. The AUC values are lower than 0.9 and more cases should be recruited to optimize our prediction model for more precise forecasting. And the data of patients were derived from Wuhan, Hubei Province, which means the situation outside Hubei Province could be distinct and multicenter analysis is urgently needed.

| CONCLUSION
Through the filtering by LASSO and SVM-RFE, two machine learning methods, five independent predictive features for severe COVID-19 were selected, which are: CRP, LDH, Age, CDCS, and ESR. Based on this, we build a predicting tool that can early predict severe COVID-19 and aid medical decisions for COVID-19 patients.