Early predictors of severe COVID‐19 among hospitalized patients

Abstract Background Limited research has been conducted on early laboratory biomarkers to identify patients with severe coronavirus disease (COVID‐19). This study fills this gap to ensure appropriate treatment delivery and optimal resource utilization. Methods In this retrospective, multicentre, cohort study, 52 and 64 participants with severe and mild cases of COVID‐19, respectively, were enrolled during January‐March 2020. Least absolute shrinkage and selection operator and binary forward stepwise logistic regression were used to construct a predictive risk score. A prediction model was then developed and verified using data from four hospitals. Results Of the 50 variables assessed, eight were independent predictors of COVID‐19 and used to calculate risk scores for severe COVID‐19: age (odds ratio (OR = 14.01, 95% confidence interval (CI) 2.1–22.7), number of comorbidities (OR = 7.8, 95% CI 1.4–15.5), abnormal bilateral chest computed tomography images (OR = 8.5, 95% CI 4.5–10), neutrophil count (OR = 10.1, 95% CI 1.88–21.1), lactate dehydrogenase (OR = 4.6, 95% CI 1.2–19.2), C‐reactive protein OR = 16.7, 95% CI 2.9–18.9), haemoglobin (OR = 16.8, 95% CI 2.4–19.1) and D‐dimer levels (OR = 5.2, 95% CI 1.2–23.1). The model was effective, with an area under the receiver‐operating characteristic curve of 0.944 (95% CI 0.89–0.99, p < 0.001) in the derived cohort and 0.8152 (95% CI 0.803–0.97; p < 0.001) in the validation cohort. Conclusion Predictors based on the characteristics of patients with COVID‐19 at hospital admission may help predict the risk of subsequent critical illness.


| Research overview
The derivation cohort consisted of patients who visited Henan Provincial People's Hospital (Henan) from January to March 2020. This is a tertiary teaching hospital in central China, with 5,000 hospital beds, including 300 intensive care unit (ICU) beds. The annual volume of infectious patients admitted to this hospital is approxi-

| Data collection and clinical assessment
COVID-19 diagnoses were confirmed using positive real-time reverse-transcription polymerase chain reaction assays of nasal and pharyngeal swab specimens, sputum or stool samples or specific immunoglobulin (Ig) M or Ig G antibody testing of the serum. 7 A team of experienced infectious disease clinicians reviewed and cross-checked the data. Two clinicians independently checked each record. The criterion for variable selection was in accordance with previous studies and related specifically to patients with severe disease. 8,9 We included all patients with available data on clinical status during hospitalization (laboratory findings, clinical symptoms and signs, illness severity and discharge status). After assessing the data, it was found that the data from Henan Provincial People's Hospital were more complete than those from the other four hospitals; therefore, the data from Henan Provincial People's Hospital were used for the derivation cohort study, while the data from the other four hospitals were used in the validation cohort study.

| Outcome definitions
We defined the severity of COVID-19 (severe v. mild) based on the Diagnostic and Treatment Guidelines for COVID-19 issued by the Chinese National Health Committee (Version 7). 7 to the hospital, 20 laboratory parameters were collected, including routine indexes of blood examination, such as lymphocyte, platelet and neutrophil counts, as well as haemoglobin and C-reactive protein levels. Inflammatory cytokines, including procalcitonin, interleukin (IL)-6, IL-10, serum ferritin protein, coagulation function indicators (including D-dimer and fibrinogen levels) and liver function indicators (such as lactate dehydrogenase levels), were also included.

| Potential predictive variables
Further, we collected data on immune function indicators, including B-lymphocyte count, T-lymphocyte count, natural killer cell count and Ig M and Ig G antibody titres.

| Variable selection and model construction
In the derivation cohort, 65 patients (Table 1) were included in the variable selection and risk score development. As described herein, 50 variables were entered in the selection process. Least absolute shrinkage and selection operator (LASSO) regression were applied to minimize the potential collinearity of variables measured from the same patient and over-fitting of variables. Imputation for missing variables was considered if missing values were less than 20%. We used predictive mean matching to impute numeric features, logistic regression to impute binary variables and Bayesian polytomous regression to impute factor features. We used penalized LASSO regression for multivariable analyses, augmented with 10-fold crossvalidation for internal validation. This is a logistic regression model that penalizes the absolute size of the coefficients of a regression model based on the value of λ. With larger penalties, the estimates of weaker factors shrink towards zero, so that only the strongest predictors remain in the model. The most predictive covariates were selected by the minimum (λ min). The R package 'glmnet' statistical software (R 4.0.3 version) was used to perform the LASSO regression. Subsequently, variables identified by the LASSO regression analysis were entered into binary logistic regression models, and those that were consistently statistically significant were used as early predictors and in the construction of the risk score, which was then used to construct a risk score card calculator. Two-sided pvalues of < 0.05 were considered statistically significant.

| Assessment of accuracy
The accuracy of the COVID risk score was assessed using the area under the receiver-operating characteristic curve (AUC). For internal validation of the accuracy estimates and to reduce overfit bias, we used 200 bootstrap resamples. The calibration curve was plotted, and the Brier score was calculated to determine its accuracy evaluation. The Brier calculation formula is Brier = (Y-p)2, where Y is the actual outcome variable (0 or 1), and p is the predictive probability calculated by the model. The closer the Brier score is to 0, the more accurate the model is.  Table 2. Data on some laboratory variables (including IL-6 and T lymphocyte count) were missing in the validation cohort, either because the relevant tests could not be conducted or owing to limits on the laboratory staff's authorization to view medical records at the four hospitals.

| Predictors of severe disease
Predictors of severe COVID-19 were analysed using data from the derivation cohort. Fifty variables measured at initial hospital admission (Table 1)

| Construction of risk score and nomogram score calculator card
The COVID risk score was constructed based on the coefficients from the logistic model. We used the following formula for the lo- to allow clinicians to automatically calculate a risk score which can be used to determine the likelihood (with 95% CIs) that a hospitalized patient with COVID-19 will develop severe illness (Figure 2).

| Assessment results of derivation and validation of prediction model for severe COVID-19 cases
Taking the predicted probability calculated by the model as the independent variable and the hospitalization outcome as the classification outcome variable, the performance of the prediction model was

| DISCUSS ION
We identified clinical predictors of severe COVID-19 and developed a prediction model to identify the development of critical ill- in patients with COVID-19. 9,10 In the current study, however, we found that lower haemoglobin levels and higher D-dimer and Creactive protein levels were associated with a higher risk of severe COVID-19. Additionally, lower T-lymphocyte count and increased IL-6 level were associated with a higher risk of severe COVID-19 in the derivation cohort. However, owing to the absence of these two laboratory variables in the validation model, the relevant tests could not be conducted in the study hospitals; therefore, we did not include these two variables in the validation model. To the best of our knowledge, this is the first study to report the combination of these five laboratory variables to predict the severity of COVID-19.
The findings of these laboratory predictors suggest that patients with severe cases of COVID-19 have a high inflammatory response and low levels of cellular immune dysfunction and coagulopathy.
This indicates that these laboratory predictors can be used to diagnose severe cases of COVID-19 early, so that clinicians can control the cytokine storms and immune cell exhaust of severe cases during treatment. 11,12 The limitation of this study is its small sample size. The data used in developing the predictive model were entirely from central China, which could potentially limit the global generalizability of the results.
Additional validation studies of COVID-19 predictors should be conducted in regions outside central China. 8  In the future, we intend to explore the possibility of developing an online calculator for the rapid and convenient assessment of the risk of developing severe COVID-19 illness in patients on admission.

ACK N OWLED G EM ENTS
We wish to acknowledge all healthcare professionals who have

CO N FLI C T O F I NTE R E S T
The authors declare that there is no conflict of interest regarding the publication of this article.

AUTH O R CO NTR I B UTI O N S
YY and QZ designed the study, analysed the data and prepared

CO D E AVA I L A B I LIT Y
Not applicable.

CO N S E NT TO PA RTI CI PATE
The need for written informed consent was waived because deidentified retrospective data were used.

DATA AVA I L A B I L I T Y S TAT E M E N T
The data sets used during the current study are available from the corresponding author on reasonable request.