Bayesian Logistic Injury Severity Score: A Method for Predicting Mortality Using International Classification of Disease-9 Codes

Authors

  • Randall S. Burd MD, PhD,

    1. From the Department of Surgery, UMDNJ-Robert Wood Johnson Medical School (RSB), New Brunswick, NJ; the Computer Engineering and Computer Science Department, University of Louisville (MO), Louisville, KY; and the Department of Statistics, Columbia University (DM), New York, NY.
    Search for more papers by this author
  • Ming Ouyang PhD,

    1. From the Department of Surgery, UMDNJ-Robert Wood Johnson Medical School (RSB), New Brunswick, NJ; the Computer Engineering and Computer Science Department, University of Louisville (MO), Louisville, KY; and the Department of Statistics, Columbia University (DM), New York, NY.
    Search for more papers by this author
  • David Madigan PhD

    1. From the Department of Surgery, UMDNJ-Robert Wood Johnson Medical School (RSB), New Brunswick, NJ; the Computer Engineering and Computer Science Department, University of Louisville (MO), Louisville, KY; and the Department of Statistics, Columbia University (DM), New York, NY.
    Search for more papers by this author

  • Presented in poster form at the Eastern Association for the Surgery of Trauma, Fort Meyers, FL, January 16, 2007.

Randall S. Burd, MD, PhD; e-mail: burdrs@umdnj.edu.

Abstract

Objectives:  Owing to the large number of injury International Classification of Disease-9 revision (ICD-9) codes, it is not feasible to use standard regression methods to estimate the independent risk of death for each injury code. Bayesian logistic regression is a method that can select among a large numbers of predictors without loss of model performance. The purpose of this study was to develop a model for predicting in-hospital trauma deaths based on this method and to compare its performance with the ICD-9–based Injury Severity Score (ICISS).

Methods:  The authors used Bayesian logistic regression to train and test models for predicting mortality based on injury ICD-9 codes (2,210 codes) and injury codes with two-way interactions (243,037 codes and interactions) using data from the National Trauma Data Bank (NTDB). They evaluated discrimination using area under the receiver operating curve (AUC) and calibration with the Hosmer-Lemeshow (HL) h-statistic. The authors compared performance of these models with one developed using ICISS.

Results:  The discrimination of a model developed using individual ICD-9 codes was similar to that of a model developed using individual codes and their interactions (AUC = 0.888 vs. 0.892). Inclusion of injury interactions, however, improved model calibration (HL h-statistic = 2,737 vs. 1,347). A model based on ICISS had similar discrimination (AUC = .855) but showed worse calibration (HL h-statistic = 45,237) than those based on regression.

Conclusions:  A model that incorporates injury interactions had better predictive performance than one based only on individual injuries. A regression approach to predicting injury mortality based on injury ICD-9 codes yields models with better predictive performance than ICISS.

Accurate risk adjustment for injured patients is an essential requirement for clinical trials involving trauma patients and for research assessing outcome after injury. In addition, stratifying injured patients by risk of death is important for evaluating and comparing clinical performance within and between institutions. Despite the need for an effective risk adjustment tool, an accurate, generalizable, and widely accepted method has not been devised.1,2 Injury Severity Score (ISS) has been one of the most commonly used score systems in the trauma literature, but has important limitations.1 This score is based on anatomic injuries and is calculated from values of the Abbreviated Injury Scale (AIS) using an empirically derived method. ISS has a nonmonotonic relationship with mortality rate, occurring because of ISS values derived from different AIS triplets.3–5 The New Injury Severity Score (NISS) is a modification of ISS described in 1997 that performs better than ISS, but has similar limitations.5,6

The Trauma and Injury Severity Score (TRISS) is an estimate of the probability of survival that is calculated using ISS, the Revised Trauma Score (calculated using Glasgow Coma Scale score, blood pressure, and respiratory rate), age, and type of injury (blunt vs. penetrating).7 Because it incorporates physiologic data and injury characteristics in addition to anatomic injury, it is not surprising that TRISS is a more accurate predictor of survival than ISS. After its introduction, TRISS rapidly became a standard method for predicting survival after injury. This score continues to have widespread use as a measure of performance at trauma centers, applied either directly as the predicted probability of survival for individual patients or in the derivation of institution-specific statistics (Z-score or W-score) for comparing performance to benchmark databases.1 Because calculation of TRISS requires a value for respiratory rate and the verbal component of the Glasgow Coma Scale, this score cannot be directly assessed among intubated patients, a subpopulation of trauma patients in whom accurate estimates of survival may be most needed. In addition, physiologic data are often missing in many records in trauma databases, particularly those with a high risk of mortality,8 and cannot be used for records from administrative datasets that have no physiologic data.

More recently, the International Classification of Disease-9 revision (ICD-9)–based Injury Severity Score (ICISS) was described, a method that uses individual ICD-9 injury codes for estimating injury severity.9 This score is appealing because ICD-9 codes are available in trauma and nontrauma databases, and it does not require physiologic data for calculation. ICISS is based on calculating the proportion of survivors among all patients with each injury ICD-9 code. While this value is a proportion rather than a ratio, it has been called the “survival risk ratio” (SRR). This approach was taken because of the challenge of using standard methods to perform a regression with the large number (>2,000) of individual ICD-9 injury codes entered as binary predictors. In its original description, ICISS was calculated as the product of the SRRs of the injuries for each patient.9 This product method supposes independence of the risk of individual injuries, generally not a reasonable assumption. It was subsequently observed that the single worst SRR for each patient predicted survival better than the product of all SRRs.10 In an attempt to partially address the problem of independence, ICISS was further modified by calculating SRRs only from records that had a single injury, followed by calculation of the product of these SRRs.11 While this method performed better than the original ICISS derivation, some injury codes do not occur in isolation in any record, and information potentially available from patients with multiple injuries is lost. Because it is mathematically inaccurate to view ICISS as an estimate of the probability of survival, ICISS is best described as an ad hoc score. Despite the mathematical limitations of ICISS, this score performs well, leading some to suggest that ICISS is now the preferred risk adjustment method for trauma, especially when physiologic parameters are not available.11 Because of its initially promising performance, investigators have recently begun using ICISS as a risk adjustment method for research purposes.12–14

In our research work with administrative databases, we required a risk adjustment method based on injury ICD-9 codes that did not require physiologic data. While ICISS has performed adequately in initial studies, we were concerned that its mathematical limitations would lead to bias in some applications. Standard regression methods are not a suitable alternative because of a variety of practical and theoretical difficulties due to the large number of ICD-9 injury codes. Bayesian logistic regression is a method that can incorporate large numbers of predictors without loss of model performance, addressing some of the limitations of standard regression methods in this context.15 The specific purpose of this study was to use Bayesian logistic regression to develop models for predicting injury mortality using ICD-9 codes and to compare the predictive performance of these models with ICISS. Our overall goal was to develop a predictive model that performs better than currently available methods for predicting injury mortality.

Methods

Data Sources and Subject Selection

The Institutional Review Board at the University of Medicine and Dentistry New Jersey-Robert Wood Johnson Medical School has reviewed this study and approved it as qualifying for exempt from informed consent requirements. We obtained data from the National Trauma Data Bank Version 5.0 (NTDB) and the Nationwide Inpatient Sample (NIS). The NTDB is a voluntary trauma database maintained by the American College of Surgeons Committee on Trauma that has accrued data since 1989 from adult and pediatric trauma centers. For this study, we used records from 2000 to 2004 with at least one ICD-9 diagnosis code (n = 911,700). We excluded records if age was missing (n = 23,023; 2.5%) or injury type was designated as “burn” or was missing (n = 28,703; 3.1%). To avoid double counting, we also excluded records of patients transferred to another hospital (n = 40,360; 4.4%) or with a missing discharge status (n = 68,713; 8.3%). After exclusions, the final data set contained 760,034 records (83% of records with at least one ICD-9 code).

The NIS is a database maintained by the Healthcare Cost and Utilization Project (HCUP) and is the largest all-payer inpatient database in the United States.16 The NIS contains all discharge records from community (nonfederal) hospitals that are selected to represent a 20% sample of hospitals based on five characteristics: ownership (public vs. private), number of beds, teaching status, location (urban vs. rural), and U.S. census region. Because the NIS contains discharges from sampled hospitals, each record includes a weight that is required for making national projections. The database includes information available in a typical discharge record including age, gender, admission source (including transfer from another hospital), diagnoses classified using ICD-9 codes, discharge disposition (including transfer to another hospital), and hospital death. We obtained records from the NIS from 2003 (n = 323,725) with a primary ICD-9 discharge diagnosis for injury (800 to 959.9). We excluded those records with a primary diagnosis of late effects of injury (905 to 909), foreign bodies entering through an orifice (930 to 939), and burns (940 to 949). To avoid double counting, we also excluded records of patients transferred to another hospital, yielding a final data set containing 276,366 records (85% of records with a primary injury diagnosis).

Statistical Approach

Conventional binary logistic regression is a form of regression that can be used to predict a binary dependent variable (such as survival or nonsurvival) based on a set of independent variables. Logistic regression specifically models the log-odds of the binary variable as a linear combination of the independent variables. In the context of predicting trauma mortality using ICD-9 codes, one would work with a model in the form

image

where p(death) denotes the probability of death for a particular patient, n represents the number of injury codes and coden assumes the values 0 or 1 depending on the presence or absence of the nth ICD-9 code. The widely used maximum likelihood method chooses values for the regression coefficients (i.e., β0, β1, β2, . . . βn) that maximize the “likelihood,” essentially the probability of the data, given a particular set of values for the regression coefficients. The underlying principle of this method is that given two candidate sets of values for the regression coefficients, a modeler would generally favor the coefficient set that makes the available data seem most probable. While other alternatives can be considered, most statistical software uses maximum likelihood to estimate regression coefficients for a logistic regression equation.

While this maximum likelihood estimation is useful for many applications, these estimates become unreliable as the number of predictor variables increases. As a general rule, logistic regression models are more reliable when the ratio of observations per dependent variable is greater than 10:1.17,18 Below this ratio, several types of problems can result, including overfitting that occurs when variables that appear statistically significant are just noise (Type I error), under- or overestimation of the variance of estimates, lack of convergence, and poor predictive accuracy. Because the number of ICD-9 injury codes with all possible fourth- and fifth-digit variants is large (>2,000), conventional logistic regression methods are not suitable for developing models predicting mortality rate among injured patients using each injury ICD-9 code. The limitations of conventional logistic regression are likely to become even more apparent as the number of model parameters increases with the addition of other variables.

Bayesian logistic regression is an alternate method that can be used to obtain reliable estimates of regression coefficient when large numbers of predictors are considered. The Bayesian approach initially treats each regression coefficient as a random value derived from a “prior” probability distribution. Prior distributions can reflect specific knowledge about the regression coefficients (e.g., that a particular coefficient is positive in sign). Regression modelers more typically choose symmetric, zero-centered prior distributions with a high variance that reflects little or no prior knowledge. The Bayesian approach then computes “posterior” distributions that blend information from the prior with the information from the data as encoded in the likelihood. In this way, the Bayesian approach computes probability distributions for regression coefficients conditional on the observed data. It is common to report the mode of the posterior distribution as the best guess for the regression coefficients (often referred to as “maximum a posteriori” estimation).

The Bayesian approach to logistic regression counteracts the tendency for overfitting by using prior distributions that result in regression coefficient estimates somewhere between the maximum likelihood estimates and zero, while remaining neutral about whether each regression coefficient is positive or negative. The statistical literature sometimes refers to this process as “shrinkage.”19 This approach can be effective even in applications in which the number of regression coefficients exceeds the number of observations in the data. Choosing a specific form for the prior distribution amounts to deciding the nature of the shrinkage that should be applied. In this study, we specified independent Laplace priors for each regression coefficient. The Laplace distribution exhibits a cusp (discontinuous first derivative) at its mode of zero. As a direct consequence, the posterior mode for many regression coefficients is precisely zero. In this way, Bayesian logistic regression accomplishes simultaneous shrinkage and variable selection. Selecting the variance of the prior dictates the amount of shrinkage. To choose the prior variance, we used cross-validation, a widely used data-driven strategy. In Bayesian logistic regression, the mathematical form of the posterior distribution is straightforward, but determination of the posterior mode requires iterative numerical optimization algorithms.15 The software that we used implements a cyclic coordinate descent algorithm, but other approaches are possible.20 Bayesian logistic regression has been used to develop accurate prediction models in diverse domains in which large numbers of predictors are required, including text classification, genetic association studies, microarray analysis, and veterinary management.15,21–23

We used Bayesian logistic regression to develop models predicting in-hospital death based on ICD-9 injury codes. We have designated this method for predicting trauma deaths as the Bayesian Logistic Injury Severity Score (BLISS). We used records of patients from the NTDB admitted before 2003 as a training data set (n = 447,442) and the remaining records as a testing data set (n = 312,592). We used data from the NIS from year 2003 as a second test data set. Because trauma centers are likely to be represented in the NIS, overlap of individual records in the two test data sets was likely but its extent cannot be determined because of the absence of suitable identifiers in each database. We used NTDB test data to test internal validity and the NIS test data to test external validity.

The training software Bayesian Binary Regression (BBRtrain) and classification software (BBRclassify) that we used are publicly available (http://www.stat.rutgers.edu/~madigan/BBR/). We first developed a simple BLISS model (“reference” model) using only injury ICD-9 codes, entering individual injury codes into the model as dichotomous variables. The BBR software has two available prior distributions: Laplace and Gaussian. We chose a Laplace distribution as the form of the prior distribution for each regression coefficient rather than a Gaussian distribution because it empirically resulted in better model performance. We used 10-fold cross validation to choose the value of prior variance. This technique involves splitting the training data randomly into 10 pieces, training the model using a particular candidate value of the prior variance on nine-tenths of the data, and then measuring how well the candidate value performs on the omitted one-tenth. Tenfold cross-validation repeats this procedure nine times and then computes an average performance measure for the candidate prior variance. The entire procedure is repeated many times for different candidate values of the prior variance and the best value chosen. The documentation for the software describes the details of the cross-validation procedure.

To further improve model performance, we evaluated two- and three-way injury interactions between injury ICD-9 codes. It is very likely that there are clinical scenarios in which the influence of one injury (e.g., pelvic fracture) on the probability of mortality is different when other specific injuries (e.g., a femur fracture) are also present. To convey this idea, one can include variables that model these relationships, called “interaction terms.” Interaction terms are represented by the product of pairs of independent variables. In the case of binary predictors, an interaction term representing the influence of one injury on a second injury (e.g., a femur fracture and a pelvic fracture) assuming the value of “1” when both injuries are present and “0” when only one injury or neither injury is present. Although interaction terms are most commonly considered for injury pairs (two-way interactions), these terms can similarly be used to represent injury triples (three-way interactions).

We next developed new models by selecting several additional variables (age, injury type, and E-codes) to include with injury ICD-9 codes as predictors. We added age into separate models as a continuous variable (normalized to a mean of zero and a variance of one) and as a categorical variable (either 5- or 15-year intervals). Injury type (blunt vs. penetrating) was only available in the NTDB data set. Each NTDB record included a single E-code (classification of external cause of injury or poisoning), while up to four E-codes were available for each record in the NIS. We excluded E-codes for accidental poisoning due to medications and other substances (E850 to E869) and complications of medical care (E870 to E879) because these causes were more likely to represent consequences of treatment rather than injury. These E-codes occurred in 0.5% of records in the NTDB database and in 35.5% of records in the NIS. After exclusion of these E-codes from the NIS, 10.9% of records had no E-code listed. Similar to ICD-9 codes, we added individual E-codes as dichotomous variables. Finally, we also considered interactions between each injury code and age, injury type, and E-codes in separate models.

Evaluation of Model Performance

We evaluated model discrimination using area under the receiver operating curve (AUC). Because the NIS is a survey sample, conventional methods cannot be used to estimate or compare the AUC of different models. We therefore estimated AUC accounting for sample weights and estimated the variance of AUC using 1,000 bootstrap replicates.24 We used standard categorization of AUC values, designating each as excellent (0.90–1.00), good (0.80–0.89), fair (0.70–0.79), or poor (0.60–0.69).

The optimal method for assessing model calibration has been a topic of considerable discussion in previous studies related to mortality prediction in trauma.25 For this study, we used two methods for evaluating model calibration: calculation of the Hosmer-Lemeshow (HL) h-statistic and inspection of calibration curves. For the HL h-statistic, the patients are grouped into “deciles of risk” by first using the model to calculate each patient’s predicted probability of death, and then ranking the patients according to this risk probability. The patients were then divided into 10 groups based on their probability of death. Observed and predicted mortality for each group is then used to calculate the HL h-statistic. Smaller values of the HL h-statistic suggest better model fit, and its null distribution conveys the scale of individual values. Because of the size of the data sets that we used, we did not report the p-value associated with the HL h-statistic because the power to reject the goodness-of-fit test is high when using a large sample size and may lead to misleading conclusions.26 For models derived from the NIS, we used survey weights to derive national projections of the predicted number of deaths, actual number of deaths, and total number of trauma admissions in each probability group. We then used these values to calculate the overall HL h-statistic. When derived using this method, the HL h-statistic represented an estimate of the true national value of this statistic and was not based only on sampled observations.

We next compared BLISS to ICISS. Three methods for obtaining ICISS were evaluated: ICISS-1 (based on the product of all SRRs), ICISS-2 (based on the single worst injury SRR), and ICISS-3 (based on the product of independent SRRs). We calculated each SRR value using data in the NTDB training data set.

We estimated nationwide trauma admission counts using sample weights available for each record in the NIS. We performed analyses using SAS 8.2 (SAS Institute, Cary, NC) for summary statistics and Stata 8.0 (Stata Corp., College Station, TX) for survey statistics and calculation of HL h-statistic.

Results

Overview of Data

We observed several differences in the training and each testing data set. Compared to patients from the NTDB training set, patients in the NTDB testing set were younger, less commonly were female, less commonly had penetrating injuries, and more commonly were transferred from another hospital and had a higher mortality rate. While patients in the NIS testing set were older and more commonly female than those in the NTDB training set, fewer patients were transferred from another hospital, and the mortality rate was lower (Table 1).

Table 1.   Summary Statistics
 NTDB Training DataNTDB Testing DataNIS Testing Data*
  1. — = not available or applicable; IQR = interquartile range; NTDB = National Trauma Data Bank; NIS = Nationwide Inpatient Sample.

  2. *Estimates and statistics were calculated using survey sampling weights.

  3. †Length of stay missing in 6,829 (1.5%) of records.

  4. ‡Length of stay missing in 3,759 (1.2%) of records.

No. of records447,442312,592276,366
Projected national No. of cases1,330,838
Mean age (yr)38.8 (IQR 32)38.6 (IQR 32)54.9 (IQR 48)
Female (%)34.033.750.6
Penetrating injury (%)12.611.8
Transferred from another hospital (%)16.218.93.3
Mean length of stay (days)5.3 (IQR 5)†5.3 (IQR 5)‡5.4 (IQR 4)
Mortality rate (%)4.85.12.8

Development of BLISS

Among 2,489 possible injury ICD-9 codes, 2,210 codes occurred in the NTDB training set. Among these 2,210 codes, 704 were eliminated by the variable selection process used by the Bayesian logistic regression algorithm, leaving 1,506 codes in the final model (Table 2). Model discrimination was good for the NTDB test data (AUC 0.888) and NIS test data (AUC 0.825) (Table 2). The relative values of the HL h-statistic and inspection of calibration curves showed that calibration was better using the NTDB test data than the NIS test data (Table 2 and Figure 1).

Table 2.   Comparison of the Features and Performance of Models Predicting Hospital Mortality Developed Using ICISS and BLISS
ModelNTDB Training DataNTDB Test DataNIS Test Data
No. of variables entered into modelNo. of variables in final modelAUC (95% CI)HL h-statisticAUC (95% CI)HL h-statistic
  1. — = not available or applicable; AUC = area under the receiver operating curve; HL = Hosmer-Lemeshow; ICISS = International Classification of Disease-9–based Injury Severity Score; NIS = Nationwide Inpatient Sample; NTDB = National Trauma Data Bank; SRR = survival risk ratio.

  2. *Variables with SRR < 1.

BLISS
 Main effects
  Injury codes alone (reference model)2,2101,5060.888 (0.885, 0.891)2,7370.825 (0.820, 0.851)3,570
  +Age2,2111,4930.906 (0.903, 0.908)2,9180.855 (0.854, 0.862)2,998
  +Categorical age (5-yr intervals)2,2281,5150.906 (0.904, 0.909)2,8930.858 (0.853, 0.862)7,766
  +Categorical age (15-yr intervals)2,2161,4990.906 (0.903, 0.909)2,9740.825 (0.820, 0.830)7,730
  +Injury type (blunt vs. penetrating)2,2111,5020.888 (0.885, 0.892)2,674
  +E-codes2,8111,7940.899 (0.896, 0.902)2,4100.857 (0.853, 0.861)4,018
 Interactions
  +Two-way injury interactions243,0374,0020.892 (0.889, 0.895)1,3470.830 (0.825, 0.835)1,848
  +Two- and three-way injury interactions2,497,9572,5440.891 (0.888, 0.894)1,3460.825 (0.820, 0.830)2,069
  +Age (15-yr intervals) injury interactions15,4763,2310.903 (0.900, 0.906)2,4840.852 (0.848, 0.857)10,489
  +Injury type–injury interactions4,4232,4080.889 (0.886, 0.892)2,612
  +E-code–injury interactions96,3701,9120.900 (0.897, 0.903)2,0950.839 (0.834, 0.843)2,999
ICISS
  ICISS-1 (traditional method)2,2101,590*0.855 (0.851, 0.859)45,2670.726 (0.719, 0.733)134,280
  ICISS-2 (single worst SRR)2,2101,590*0.866 (0.862, 0.869)13,1130.743 (0.737, 0.750)53,746
  ICISS-3 (independent SRR)2,210512*0.867 (0.863, 0.870)6,3960.793 (0.787, 0.798)21,848
Figure 1.

 Calibration curves for Bayesian Logistic Injury Severity Score (BLISS; reference and 2-way interaction models) and ICD-9–based Injury Severity Score (ICISS; independent survival risk ratio [SRR]) for National Trauma Data Bank (NTDB) test data (A) and Nationwide Inpatient Sample (NIS) test data (B).

We then analyzed the performance of the model in specific subgroups rather than in all records within the NTDB test data set. We focused these subgroup analyses on discrimination. We first calculated AUC for patients divided into 5-year age increments (Figure 2). Discrimination was good to excellent for patients <60 years old, but declined in each subsequent age group, being only fair (AUC 0.765) among patients ≥85 years old. The HL h-statistic was lower among patients <60 years old (2,489) than those ≥60 years old (2,922). The calibration curve generated from patients ≥60 years old was shifted to the left compared to the curve generated from younger patients, showing that the model underpredicted death more often in older patients (Figure 3). While the NIS data set had an age distribution skewed toward older patients, a similar trend in age-related AUC (Figure 2) and leftward shift of the calibration curve for older patients was observed. When patients were further subdivided by type of injury (blunt versus penetrating), AUC was observed to have an age-related decline among those with blunt injuries but not among those with penetrating injuries (Figure 4).

Figure 2.

 Comparison of areas under the receiver operating curve (AUC) in 5-year age increments (lines); age distribution in each test data set (bars). NTDB = National Trauma Data Bank; NIS = Nationwide Inpatient Sample.

Figure 3.

 Calibration curves for all patients, patients <60 years and patients ≥60 years in the National Trauma Data Bank (NTDB) test data set.

Figure 4.

 Age-associated change in discrimination of the Bayesian Logistic Injury Severity Score (BLISS) model by type of injury (National Trauma Data Bank [NTDB] test data). AUC = area under the receiver operating curve.

Development of an Improved Model

We next evaluated the effects of interactions between individual injuries on model performance. Among the 2,210 injury codes occurring in the training data set, we observed 243,037 pairs and 2,497,957 triplets. The Bayesian logistic regression procedure selected 4,002 features for the model that included two-way interactions and 2,544 features for the model that included three-way interactions (Table 2). Inclusion of two-way interactions yielded a model with similar discrimination (Table 2), but improved calibration (Table 2 and Figure 1) compared to the reference model. The addition of three-way injury interactions, however, did not yield better performance compared to the model with two-way interactions.

We then evaluated the effects of several variables on model performance including age, gender, injury type, transfer status, and E-codes (Table 2). AUC for both testing data sets was only slightly greater after the addition of age either as a continuous or categorical variable (Table 2). While models that included age had similar values of the HL h-statistic, the addition of age “smoothed” the appearance of the calibration curves. While the addition of injury type had no noticeable effect on model performance, the addition of E-codes yielded a slight improvement in AUC and calibration compared to the reference model (Table 2). Models that included interactions between age and injuries and between E-codes and injuries also showed a slight improvement in both discrimination and calibration, while no effect of interactions between injury type and injury was observed (Table 2). The relative impact of individual main effect terms and their interactions were similar in both test data sets (Table 2).

Comparison of BLISS with ICISS

We next compared the performance of BLISS to ICISS. Consistent with previous studies of ICISS, we found that the independent SRR approach (ICISS-3) yielded the best model performance among the three ICISS methods evaluated (Table 2). Although the AUC of BLISS was only slightly better than ICISS, BLISS was better calibrated (Table 2, Figure 1). Inspection of calibration curves showed that the improved performance of BLISS over ICISS was most apparent among patients at lower risk for mortality (Figure 1).

Discussion

In this study, we describe the development of BLISS, a new method for risk adjustment for trauma patients based on ICD-9 codes. Similar to ICISS, this method can be used to predict death using only injury ICD-9 codes, making it useful for risk adjustment based on data from both trauma and administrative data sets. A key advantage of BLISS over ICISS is that its underlying approach is a regression method that can accommodate very large numbers of potential predictors. This feature allowed us to consider not only individual ICD-9 injury codes but also the interactions between these codes and other covariates. The reference BLISS model that included only individual ICD-9 injury codes was better calibrated than ICISS. Although BLISS and ICISS use different approaches for variable “selection,” it is interesting to note that BLISS and ICISS are based on a similar number of ICD-9 codes (1,506 and 1,590 respectively). The addition of two-way injury interactions to BLISS yielded a model with further improved calibration. While it is intuitive that unique combinations of injury have a synergistic effect on mortality, the impact of injury interactions on outcome has not been consistently observed in recent studies.27,28 Our study suggests the potential importance of injury interactions on mortality and the need to consider these interactions when developing models predicting mortality. We also observed that injury interactions with E-codes improved model performance. This finding emphasizes the potential importance of the link between the cause of injury and injury mortality.29 It is interesting to note that we did not observe improved model performance when including the type of injury (blunt versus penetrating), either as a main effect or as an interaction term, despite its importance in TRISS and it more recent variants.30

Our analysis of the performance of BLISS in subgroups of patients identified the strengths and limitations of this method. BLISS performed quite well among patients who sustained penetrating injuries, but performed less well among patients who sustained blunt injuries. Several factors may account for this finding. Penetrating injuries may be more visible or require surgery more often than blunt injuries, leading to more accurate representation of penetrating injuries using injury codes than blunt injuries. When no autopsy is performed for patients with blunt injuries, the exact nature of the injury may not be known, while it may be more apparent among patients dying of penetrating injuries. Age-related factors may also have less impact on mortality after some types of penetrating injury than after a blunt injury. Differences in model performance among subsets of patients with different injury types is not unique to BLISS and is the basis for including injury type in the calculation of TRISS. Of note, we have also observed a similar difference based on injury type with ICISS.

We observed an important association between age and model performance. Predictive power worsened (lower AUC and higher HL h-statistic) with increasing age, particularly among patients sustaining a blunt injury. The addition of age to the model improved performance, but no further improvements were observed when the interaction of age with injury was added to the model. In other words, the increased risk of death associated with older age is more strongly associated with age itself than with age-related factors specific to individual injuries. A potential factor that may account for an age-related decline in model performance is the presence of comorbid conditions that are more likely to be observed in older than younger patients. This concept is consistent with calibration curves that showed increased underprediction of death among older compared to younger patients. While the NTDB contains a data field for preexisting comorbid conditions, reporting of this field has not been consistent among trauma centers and was therefore not used for our analyses. In the NIS, designators are not available to distinguish ICD-9 codes representing preexisting comorbidity from those occurring as a result of injuries or treatment of these injuries. Future revision of BLISS will need to consider the contribution of comorbid conditions when accurate data become available.

We believe that development of an externally valid model using a nontrauma database such as the NIS is an additional challenge in the field of risk adjustment for trauma that has received little attention to date. As evidenced by its relative performance in the NTDB and NIS data sets, BLISS has excellent internal validity but more limited external validity. While we observed some improvement in model performance by the addition of age, two-way interactions, and E-codes, decreased performance remained among older patients—a finding that was most striking when the model was applied to the NIS data set. Two factors may account for more limited external validity. First, when compared with trauma registries, codes for many injuries are often omitted in administrative data sets.31 Second, most hospitals in the NIS data set are not trauma centers and likely have patients with different injury profiles and outcomes than hospitals reporting to the NTDB. Both of these factors will tend to lead to predictions that underestimate mortality, particularly among patients with multiple injuries.

Limitations

There are important limitations of BLISS that should be recognized. BLISS in its current form requires injury ICD-9 codes that are not available in an out-of-hospital or the emergency department setting, limiting its use to evaluation of in-hospital deaths. A similar dependence on data not available early after injury is also observed with TRISS and ICISS. Another limitation of BLISS is that it is based on ICD-9 injury codes rather than a more structured system for classifying injury, such as the Abbreviated Injury Scale “post dot” classification or the Barell matrix.32 Because of the large number of injury ICD-9 codes, several distinct codes within the same body region may represent similar injuries with similar severity. For example, head injuries can be classified using more than 400 different codes, and fractures other than skull fractures using more than 300 different codes. While a high level of granularity may help aid in the development of a more accurate model, this advantage comes at the cost of requiring a large amount of data because codes may only be rarely used and may only be variations of more common codes. The issue will become even more important when ICD-10 is adopted, because this coding system will have significantly more individual codes than ICD-9. BLISS has a theoretic advantage over ICISS for managing redundant coding, however, because it uses variable selection rather a specified set of injury ICD-9 codes. In addition, while the methodology of using “SRRs” requires that either ICD-9 codes (ICISS) or Abbreviated Injury Scale values (as used in the Trauma Registry Abbreviated Injury Scale [TRAIS]) are selected, Bayesian logistic regression permits modeling using both coding systems simultaneously, thereby selecting ICD-9 codes or AIS codes most predictive of death. Larger groupings of ICD-9 codes (e.g., using the Barell matrix) can also be entered into models along with individual codes, allowing the variable selection method to choose from among both individual codes and groupings.

Not surprisingly, data preparation and model development for BLISS became computationally more intensive as the number of potential predictors increased. Nevertheless, because Bayesian logistic regression favors a sparse model, we observed that the number of predictors selected after training did not exceed several thousand, even when very large numbers of predictors were initially entered. For this reason, calculation of probabilities once regression coefficients were estimated was quite efficient. Despite the increased computation time required for models with large numbers of predictors, we performed all experiments on a conventional desktop computer. The reference BLISS model can easily be implemented using software for calculating ICISS that has been modified to calculate a probability from regression coefficients, rather than a product of SRRs. Models with interaction terms require additional data processing that is within the scope of researchers with basic skills in large database manipulation. If BLISS is independently validated and gains widespread acceptance, software will be needed to make this method accessible to users with a range of data management skills. We believe that the mathematical soundness and improved accuracy of BLISS when compared to ICISS justifies any additional effort required for model development and implementation.

Development of a classifier requires selection of a relevant subset of predictors while ignoring those predictors that do not improve classification. Because the number of potential predictors that we considered was quite large, one may be concerned that predictors were selected that are not “clinically plausible” in relation to trauma mortality. Relevance of a predictor must, however, be evaluated in relationship to the goal of the modeling task.33,34 Our goal was to develop an accurate classifier using a parsimonious set of predictors derived from injury ICD-9 codes rather than to identify individual predictors with clinical relevance to mortality. Many predictors that appear in the final BLISS models can be expected to be relevant to classification as well as relevant to trauma mortality. Other predictors, however, may have apparent relevance only to classification. Because our goal was to develop an accurate classifier, internal and external model validation rather than inspection of features for clinical plausibility was the appropriate approach for evaluating our results. Similarly, while numerical methods (e.g., Monte Carlo or bootstrap methods) exist for computing the variance of each coefficient for regularized regression, we have not pursued these strategies because model accuracy rather than interpretation of regression coefficients (both magnitude and direction) is more appropriate in this setting.

Conclusions

We have presented the novel use of a regression method for predicting trauma mortality that permits evaluation of a very large number of potential predictor variables. This method has theoretical advantages over ICISS and was shown to be more accurate than ICISS in this preliminary report. Better model calibration was the most striking improvement of BLISS over ICISS, especially when two-way injury interactions were modeled. Even in its current form, we believe that BLISS is preferred to ICISS. BLISS is easily modified to include additional covariates, making it amenable to further revisions. For example, a “TRISS”-like score could be developed incorporating physiologic data in addition to age and mechanism of injury. In the development of a BLISS modification that incorporates physiologic variables, injuries could be modeled either as individual ICD-9 codes or in a more structured manner using groupings of injury ICD-9 codes. A prediction method using Bayesian logistic regression would have a potential advantage over TRISS that models ISS (a score with a nonmonotonic association with injury mortality) as a continuous variable. We have also shown the usefulness of BLISS for studying additional factors associated with trauma death. As an example, we were able to demonstrate the importance of considering injury interactions when modeling trauma mortality. While an obvious application of BLISS will be for benchmarking performance and for risk adjustment in research studies, it will also be a useful method for studying the complex relationships between injury and outcome that require methods for managing high-dimensional data. Whether BLISS should ultimately replace more familiar and established risk adjustment methods used for these purposes will require validation by other investigators.

Ancillary