An illness score is a number assigned to a patient that correlates with a probability that a specific outcome will follow. This number is computed with varying degrees of complexity from several variables. Selection of an illness score or outcome model by the clinician or researcher is based on characteristics such as validated predictive accuracy, transferability to the intended patient population, and ease of application. The use of an illness score to manage individual patients or establish a prognosis has several limitations, but illness severity scores are still a valuable and currently underutilized research tool. Although several diagnosis-specific and diagnosis-independent scores have been proposed in recent years and are being used clinically, scoring systems in general have achieved limited adoption in the veterinary clinical research setting. Potential reasons for this include lack of familiarity with score applications to research, the cumbersome or subjective nature of some of the models available, and lack of prospectively demonstrated association between model score and outcome. The purpose of this review is to discuss the many applications of validated illness severity scores, appropriate score selection, and model construction.
Illness severity scores are gaining increasing popularity in veterinary medicine. This article discusses their applications in both clinical medicine and research, reviews the caveats pertaining to their use, and discusses some of the issues that arise in appropriate construction of a score. Illness severity scores can be used to decrease bias and confounding and add important contextual information to research by providing a quantitative and objective measure of patient illness. In addition, illness severity scores can be used to benchmark performance, and establish protocols for triage and therapeutic management. Many diagnosis-specific and diagnosis-independent veterinary scores have been developed in recent years. Although score use in veterinary research is increasing, the scores available are currently underutilized, particularly in the context of observational studies. Analysis of treatment effect while controlling for illness severity by an objective measure can improve the validity of the conclusions of observational studies. In randomized trials, illness severity scores can be used to demonstrate effective randomization, which is of particular utility when group sizes are small. The quality of veterinary scoring systems can be improved by prospective multicenter validation. The prevalence of euthanasia in companion animal medicine poses a unique challenge to scores based on a mortality outcome.
acute physiology and chronic health evaluation
area under receiver operator characteristic
canine inflammatory bowel disease activity
Intensive Care National Audit and Research Center
intensive care unit
mortality probability model
proton pump inhibitors
simplified acute physiology
severity prediction index
Applications of Illness Severity Scores
Applications for the Individual Patient
Illness severity scores may be broadly considered as either diagnosis-specific or diagnosis-independent. Diagnosis-specific scores assess particular facets of a patient's primary problem, such as proteinuria in the feline renal IRIS score,1 or stool frequency in the canine inflammatory bowel disease score.2,3 Diagnosis-independent scores provide an objective assessment of a patient's global physiologic illness status, typically derived from variables such as blood pressure and temperature.4–6 Severity scores can provide an objective tool for baseline assessment at admission or some other defined time point, or can be calculated daily for patient trending. Scores calculated for a specific patient often are presented as an “outcome prediction” score. Basing therapy on prediction of survival for the individual patient is not an appropriate use of scores. However, scores still can be useful in a clinical setting as an adjunctive tool for patient assessment, taken together with a traditional clinical assessment. Scores can provide the perspective of a data base that may be larger than the experience of a single clinician,7,8 and help remove subjectivity from patient assessment.9 Inclusion of an objective validated score in the medical record may protect the clinician from accusations of false prognostication. When a clinician's assessment has included consideration of a score, accuracy of prediction has been shown to improve.8 In human medicine, head-to-head comparisons between scores and clinicians have shown that experienced clinicians can accurately and consistently outperform even sophisticated models in predicting recovery of an individual from disease.10,11 In agreement with this observation, when a veterinary score designed to predict the survival of critically ill foals was assessed, clinician prediction of outcome outperformed the score prediction by 83–81%. Combining the foal score with the clinician's assessment, however, improved the accuracy of survival prediction by an additional 12%.8 Clinician assessments can be heavily swayed by recent case experience, and the inclusion of an objective score in the patient assessment process can help avoid this pitfall and ensure consistency.
Inappropriate Score Use
Illness severity scores are not designed to be used in isolation to predict outcome for individuals, and predictions should be applied on a population basis. The confidence interval surrounding any prediction for an individual is substantially wider than for a group, and the introduction of this additional uncertainty into the estimate typically ensures that individual estimations are not clinically helpful. An additional issue, particularly for probability estimates, is that a probability likelihood intrinsically implies a population application. For instance, where a dichotomous outcome is being predicted (eg, death, development of renal failure, remission failure) a score of 70% cannot predict a 70% chance of mortality for any 1 individual, but in a group of 100 similar patients, 70% may be expected to die, or 30% will survive.12 Where one can confidently predict a 40% mortality rate for canine intensive care unit (ICU) patients with a severity prediction index (SPI2) of 0.6, it is impossible to discriminate the survivors from the nonsurvivors within that group.13 Similarly, the 10% of patients that survive in a 90% mortality prediction group will not have falsified the odds by surviving but instead confirmed the validity of the probabilities.12 These issues reflect the difficulty of attempting to apply to an individual a probability estimate ranging from 0 to 1 when the individual result will be 0 or 1.9 Clinicians should be aware of this issue, and take particular care if score results are quoted to clients. Scores should not be used for prognostication of individuals, and treatment decisions should not be made solely on the result of a score, particularly where the treatment decision could be deleterious to the misclassified patient.12,14,15
Applications in Triage and Clinician Performance Benchmarking
In human medicine, scores have been used to assist appropriate triage of patient groups, for example triage of patients waiting for coronary artery bypass grafting,16 transplant,16 and to guide appropriate site of care for pneumonia patients.17 The authors of a veterinary score assessing illness severity in canine pancreatitis patients suggested that the score be used to assist prediction of requirement for ICU care and appropriate triage.18 Appropriate roles for the use of available scores in the management or triage of groups of veterinary patients have yet to be fully defined, particularly in companion animal medicine where the focus has been on individual recovery rather than the economic impact of group management. Table 1 details several diagnosis-specific and diagnosis-independent veterinary severity scores, together with some details of their construction and validation. In general terms, use of a score as a component of the overall clinical assessment of the individual is appropriate, whereas use of the score as the sole means of justifying a treatment or euthanasia decision is not. Users of scores must take into consideration the issues of score validation and transferability (discussed below) before applying them to patients.
|Model||Model Features||Model Format||Patient Numbers for|
|Validation Method||AUROC (if Reported)|
|Outcome Parameter||Number of|
|Clinical severity index for canine acute pancreatitis 200818||Mortality at discharge||4||Data collected over a 7-year period, primary and referral population||Table||61 dogs assessed, 14 deaths||Spearman's ρ correlation analysis on construction data set||None|
|Model estimating recovery from ARF in dogs requiring hemodialysis 200840||Alive and dialysis free at 30 days post discharge||13–15||Presumed referral population, retrospective data collection, collection period not stated||Table||182 dogs assessed, 96 deaths||Model discrimination assessed on construction data set||0.91|
|Canine IBD activity index2 (CIBDAI score) 2003||Histologic grade of lesions, serum CRP and haptoglobin||6||Presumed referral population, prospective data collection over 18 months||Table||58 dogs assessed||Spearman's correlation analysis||None|
|Model estimating survival probability in hospitalized foals13 2006||Mortality at discharge||6||Multicenter referral population, prospective data collection||Logistic regression equation||577 foals, 98 deaths||Validated prospectively on an independent data set||AUC not stated for construction or validation data set, 90% sensitivity and 46% specificity for survival at cut-point selected|
|Model estimating survival probability in canine ICU patients-SPI28 2001||Mortality at 30 days post admission||7||Multicenter referral population, prospective data collection||Logistic regression equation||624 dogs, 243 deaths||Validated on an independent data set||AUC on construction data = 0.76, on validation data = 0.68|
|Modified Glasgow coma scale for estimating survival in canine head trauma49 2001||Mortality at 48 hours||Neuro exam, 3 main categories of assessment||Referral population, retrospective data collection over a 9-year period||Table||38 dogs, 7 deaths||Logistic regression on construction data set—P value in univariate analysis reported||None|
Illness severity scores have been used as a performance measure in human medicine. Actual mortality rates are compared with those predicted by the score. Units achieving lower or higher mortality rates than those the scores predict can be designated as high or low achievers. The ratios of actual to predicted scores have been used to rank the performance of physicians (eg, cardiac surgeons), or more commonly health care units.6,19 This approach is an unexplored area in veterinary medicine, but as subspecialization and treatment complexity increases, the use of these tools for the objective evaluation of both individual veterinarians and unit performance may become appropriate. Actual mortality is compared with mortality predicted by the score to calculate a mortality ratio. Before mortality ratios can be compared as a performance measure, similarities among patient groups must be carefully established.14 Mortality ratios have been used to rank the performance of emergency doctors working shifts on a rolling patient population.20 In conjunction with other performance measures, this approach may be a more ethical method by which to allocate clinician remuneration than a commission-oriented system.
The ability to select the most efficacious treatment or form an accurate prognosis often is hampered by lack of availability in veterinary medicine of well-designed observational studies or randomized controlled trials. Case series and retrospective descriptive studies make up a substantial proportion of the veterinary literature. These often are performed over several years, and have many pitfalls including lack of case homogeneity and lack of defined control groups. The findings of observational studies frequently are hampered by confounding, defined as the presence of an extraneous factor distorting the relationship between the outcome and the variable under study.21 Control of confounders minimizes erroneous conclusions about the relationships between exposure and outcome. A common confounder of the relationship between treatment and outcome is illness severity. Objective quantification of illness severity facilitates analytical control of this variable, and can improve the quality of observational studies.
Use of Illness Severity Scores in the Management of Confounding
The severity of a patient's illness or derangement of physiologic status can have a substantial impact on the association between treatment and outcome. Quantifying illness severity in an objective manner facilitates the use of design or analytical techniques to remove the confounding effects of illness variation among patient groups, allowing the “true” impact of the treatment to be identified. For instance, a minimum or maximum illness score can be set as a predefined criterion for study entry. Alternatively, illness severity can be entered as a covariable with treatment in a regression analysis of treatment effect on outcome. This approach will allow the effect of treatment on outcome to be estimated, while controlling for illness severity. In this way, beneficial treatment effects can be identified that otherwise would be missed, and conversely treatments that may appear as risk factors in the initial analysis can have their true effect identified. This approach has been used in studies of human patients for many years. Several scoring systems for objectively quantifying illness severity that operate independent of primary diagnosis have been validated and are in common use in human medicine. These include the acute physiology and chronic health evaluation (APACHE) score,22 mortality probability model (MPM) score,23 simplified acute physiology (SAPS) score,24 and the Intensive Care National Audit and Research Centre (ICNARC) model.25 The APACHE score was first constructed in 1981 and is now in its' 4th reincarnation.26 As an example of use, in a recent cohort study investigating administration of proton pump inhibitors (PPI) as a risk factor for nosocomial pneumonia, measurement of illness severity across the cohort by the APACHE II score was pivotal in eliminating PPI administration as an independent risk factor for pneumonia, even though the initial analysis showed a weak positive association.27 Because the initial analysis did not include illness severity, it failed to identify that PPIs generally are administered to the more critically ill, and therefore to patients more prone to acquiring pneumonia.27 A diagnosis-independent illness severity score, the SPI has been constructed and validated for dogs4,13 although adoption to date in veterinary research unfortunately has been limited. The SPI score has been recalibrated in a multicenter study to give the SPI2 score, which is based on 7 variables, and is diagnosis-independent and prospectively validated. The authors are not aware of any veterinary studies in which illness severity has been objectively quantified and managed in the data analysis as a covariable with outcome. However, there has been a recent trend of documenting illness severity in patient groups, with several recent studies recording the SPI or SPI2 scores of patient groups.28–30 The canine inflammatory bowel disease activity (CIBDAI) score has been used to objectively define treatment groups31 and provide a benchmark for correlation of imaging findings with clinical disease.32 With the increasing availability of appropriately validated scores in veterinary medicine, research use is likely to increase.
Demonstration of Effective or Ineffective Randomization
The process of randomization in controlled trials is intended to equally distribute, and thus eliminate, potential confounding factors such as age or degree of illness severity among treatment groups. However, when case numbers are small or confounding factors are numerous or variable, randomization may not be successful in achieving this goal, and inaccurate estimates of a treatment effect can result. Delineating the severity of illness of animals assigned to treatment and control groups allows for both documentation and analytical control of ineffective randomization, decreasing the risk of an important treatment effect being missed or wrongly assessed.9 Observational studies are typically undertaken when randomization is impractical or unethical. In this study type, scores can be used to objectively describe or stratify patient groups by illness severity.20 Equivalent illness severity among treatment or exposure groups, despite lack of formal randomization, then can be demonstrated or disproved. In a recent study in human patients investigating the role of early hyperglycemia in survival from head trauma, blood glucose concentration was higher in the nonsurviving patients.33 Illness severity also was worse in nonsurvivors, suggesting hyperglycemia to be an epiphenomenon only. However, when illness severity score was included as a covariable in the analysis, 2 levels of blood glucose were identified that were associated with increased mortality after allowing for the effect of illness severity. Thus, blood glucose concentrations that might trigger therapeutic intervention were identified.33
Provision of Objective Context
Reporting a score that has a known and previously validated association with illness severity provides important contextual information in descriptive studies, giving observations greater interpretability and external validity.12 High external validity allows study findings to be better generalized to the wider population. A retrospective veterinary study documenting use of human serum albumin used the veterinary SPI2 score in this manner.28
Reduction of Required Sample Sizes
Illness severity scores are an effective tool by which to improve power and so decrease the sample size required to detect a significant difference between treatment or exposure groups in clinical research. Stratifying patients by severity of illness to ensure patient group homogeneity can decrease the sample size required to measure an effect.14 This approach is of particular relevance to veterinary medicine where case numbers often are small. Multivariable regression can be thought of conceptually as the ultimate form of stratified analysis. In this context, the overall sample size required to identify a statistically significant measure of effect between a variable and outcome is decreased when an additional variable that explains a significant degree of the data variation is introduced.34 In a hypothetical example, consider an observational study investigating the association between medication with an analgesic and comfort level after fracture repair. The relationship between analgesic administration and comfort level will be confounded by the severity of the fracture. Clinicians are more likely to give analgesics to patients with severe fractures, and patients with severe fractures are more likely to be painful. An analysis that does not include a measure of fracture severity is likely to require a very large sample size to identify any association between analgesic administration and improved comfort in this situation, and a reverse association (ie, analgesic administration associated with less comfort) may even be identified. However, if the variable “fracture severity” is introduced into a multivariable analysis, the data variation explained by this variable is effectively “subtracted” and any association between the analgesic and patient comfort will reach statistical significance with lower patient numbers.
Critical Evaluation of an Outcome Prediction Score
Components of Outcome Prediction Scores
Appropriate selection and comparison of scores for clinical or research use requires some understanding of different score construction and validation methods. The approach to assessment of these factors is reviewed below.
Assessment of Model Validity
There are several forms of validity. These include construct, content, and predictive validity. A scoring system will have construct and content validity if it evaluates aspects of disease or illness known by previous research or clinical experience to correlate with severity across all aspects of the disease pathophysiology.35 Predictive validity refers to the correlation between actual and predicted outcomes of patients to whom the score is applied.35 It is this aspect of validity that is addressed in quantitative statistical score validation. The purpose of this process is to establish that the score predicts outcome reliably, and to give a quantitative measure of score performance. The process of validation ideally involves prospectively comparing predicted with actual outcomes. This is an essential step in score assessment, and is unfortunately lacking in several recent veterinary scores. Some debate exists over the appropriate population on which to perform validation. All models will perform well if tested on the same data set used for model construction, and measures of model performance can be artificially inflated if this approach is used.12 It is intuitive that evaluation of a score on the same set of patient data that generated the associations used to construct the score may result in a biased evaluation and an artificially inflated measure of performance. The area under receiver operator characteristic (AUROC) of 0.91 reported for the canine hemodialysis recovery prediction score (detailed in Table 1) likely suffered from this issue. The authors of the original SPI score acknowledged the bias inherent in this method of validation, and in response pursued validation of the SPI2 score on data collected prospectively at several centers.13 Appropriate validation can also be performed by randomly dividing the original data set into separate score construction and validation groups. Where the dataset available is small, and there is concern that too much predictive information will be lost by dividing the data set in this way, a statistical technique called jack-knifing or boot-strapping can be used. With this approach observations are randomly and repeatedly removed from the data set, the model is recalculated, and the predictive validity of the model is repeatedly reassessed. This technique can provide an estimate of model predictive power and stability in the absence of an independent data set.36,37 These validation techniques, although important, only evaluate internal validity, defined as validity of inferences with respect to the population used for the study. Thus, a model with excellent internal performance still may perform poorly if applied to a new and different population. Reasons for this effect include different patient populations, for instance, primary versus referral, or a different patient-diagnosis mix. A model constructed and validated in an ICU with a focus on management of oncologic patients may perform quite differently in an ICU with a focus on acute trauma. Furthermore, different management techniques, and different resource availability (eg, availability of mechanical ventilation) will have considerable impact on score transferability between centers. For these reasons, a multicenter approach ideally should be taken in collecting the data used to construct a score, and score robustness and transferability will be improved. The last human APACHE score was calculated on data collected from 104 ICUs.22 The canine SPI2 score was calculated on data obtained from 4 centers.13 These issues should be considered before applying the score in a new context. It is appropriate when presenting a score to provide considerable detail of the population characteristics used for score construction to allow the user to assess whether the score can appropriately be transferred to the intended patient group.
Discrimination and Calibration
Model calibration reflects the ability of a model to predict outcomes across the relevant outcome range in a population. For example, in a model predicting a 15% mortality rate in 1 group and 90% mortality in another group, the model is said to have good calibration if 15 of 100 patients in the 1st group and 90 of 100 patients in the 2nd group do in fact die. The accuracy of a model reflects the degree to which model predictions reflect the true outcome, and is closely related to model discrimination. Model discrimination measures the ability of the model to discriminate between those individuals expected to live and those expected to die (ie, did the model identify the correct 15 and the correct 90 patients?). The concepts of discrimination and calibration are not mutually exclusive. They measure different characteristics, but characterization of discrimination may only be truly appropriate if good calibration already has been demonstrated. 9 Calibration typically is evaluated by the Hosmer Lemeshow goodness-of-fit test,14,38–40 whereas model discrimination is evaluated by a test of the AUROC curve, which is a plot of the true positive versus false positive proportion predicted by the model at points throughout the predictive range.41 An AUROC of 1.0 indicates perfect discrimination, whereas an AUROC of 0.5 indicates the model is no more predictive than a coin flip. The SPI2 score achieved an AUROC of 0.76 on the construction sample and 0.68 on the validation sample,13 whereas a score predicting canine recovery from hemodialysis achieved an AUROC of 0.91 on the construction sample.42 An example of an AUROC curve with various values is shown in Figure 1. Other measures of model accuracy include the R statistic from Shapiro's Q test and Brier's score.38,40 As models drift over time or are used with new populations, they tend to retain discrimination but lose calibration.43 Reporting of odds ratios of mortality (expected/predicted mortality) at various levels of score, or reporting model sensitivity and specificity at selected cut-points (eg, 83% sensitivity and 43% specificity when a score ≥ 20 is taken to predict death) are cruder evaluation techniques, as is simply reporting a higher mortality rate in patients with higher scores.18
An important issue facing model use is whether a model constructed on 1 case population can be transferred to another and retain predictive accuracy. For instance, a severity score predicting survival in canine surgical patients44 was constructed in 1994 on a population of referral patients undergoing laparotomy and subsequently managed in an ICU. This use might not transfer well to a population of patients undergoing exploratory laparotomy for foreign body removal at a general practice in 2009. Thus, transferability depends heavily on the differences between the construction population and end-user population. All models constructed to date have been found to drift in accuracy both over time and when applied to new source populations.43,45 The reasons for this drift, and potential methods to correct it in new populations,42,13 have been the subject of much debate. Clear and quantitative descriptions of the construction population can assist the end user in appropriate model selection. Geographical location, case mix, primary versus referral population mix, mortality rates, euthanasia rates, and other descriptive measures of the construction patient population may assist in establishing potential transferability. The commonly used models in humans are adjusted approximately every 5 years to account for changes over time.42 External validity tends to be better when the construction data set is obtained from multiple locations.22,13
Intended model use is also an important factor. If a score is used to stratify, characterize, or demonstrate equivalence among patient groups, then drift in the quantitative predictive power of the model over time may be less of an issue. Describing 2 patient groups as having a similar average APACHE II score of 15, or 2 groups of canine acute pancreatitis patients as having dissimilar clinical severity indices of 4 and 6, respectively, has its own internal validity and adds contextual information. This holds true even when scores are no longer well correlated to a specific mortality outcome. At the time of score construction in 1985, an APACHE II score of 15 correlated to a 17% mortality risk.46 Numerous studies since that time have demonstrated drift, with the APACHE II now consistently overpredicting mortality.26 However, by Medline search approximately 265 studies of human patients in the last year used the APACHE II score as a descriptive stratification term without reference to expected mortality. This is a valid technique employing a widely used and well-recognized clinical language to improve the power, relevance, and transferability of research findings. Use of a score in this manner loses validity when predicted mortality derived from an outdated model is compared with actual mortality as a measure of treatment effect.
Veterinary Model Types
Several of the more recent veterinary models available are detailed in Table 1. The main advantage of a disease-specific model is that model development has been performed on a relatively homogenous patient population where many of the clinical features specific to the disease process have been selected as variables. In this way, the end user of the model can expect reasonable fit and transferability of the model, often with the measurement of fewer variables than with diagnosis-independent models.47 In addition, the selected outcome often is of particular relevance to the disease in question, for instance development of systemic bacteremia after mastitis,48 requirement for long term hemodialysis after acute renal failure,42 or likelihood of recovery from surgical colic.49 Disadvantages of disease-specific models include lack of availability for rare diseases, difficulty selecting the primary disease in patients with multiple comorbidities, and lack of applicability in stratifying patient groups with multiple heterogeneous disease processes.
Models that operate independently of the primary diagnosis assume that across a wide and representative population variables that do not make reference to the primary diagnosis, such as age and physiologic status, will suffice to model patient outcome.4,25 Difficulties arise if the primary disease mix of the end-user's population differs substantially from the population on which the model was developed. An important advantage to this model type is the applicability of the score to relatively heterogeneous patient groups, or to individuals where a single primary diagnosis is difficult to assign or before diagnosis has been reached. Some models compromise between these 2 methodologies, and assign a “primary diagnosis” coefficient as a component of the final score.26
Features of Model Construction
Variables used to determine outcome probabilities vary widely depending on the outcome being evaluated. They can be categorized into patient factors such as age, serum creatinine concentration,13 and tumor histological features50 or factors relating to treatments and other processes of care, for example, whether the patient is admitted on a medical or surgical service.13 Variables can be weighted with respect to their relative contribution to outcome prediction; for example, age was given a double weighting compared with heart rate in the 1st SAPS score for human ICU patients.5 Variables can be selected on the basis of expert opinion5 by multivariable logistic regression analysis of patient data in a population of patients where the outcome is known,8or by a combination of the two. In an example of the former, the canine modified Glasgow coma score for predicting outcome after head trauma was devised on neurological markers selected by expert opinion as analogous to the human Glasgow coma scale,51 whereas the variables in the foal survival score8 were selected by regression analysis. In multivariable logistic regression analysis, the variables that have a linear relationship and are most statistically predictive of the outcome are identified.8 Ideally, selected variables should be independent of treatment or care process, as these factors are likely to vary among groups of clinicians and institutions. Inclusion of variables of this nature may decrease external validity and thus limit application of the model to the wider population.14 Selection of variables that are infrequently measured or expensive to measure, however predictive, also may limit the wider applicability of the model.9 As a general rule, the more variables included in the model, the greater the estimated standard errors become, and the more dependent the model becomes on the observed data. Using a large number of variables for score calculation may result in “over-fitting” of the model to the construction data set, and decrease model stability when it is used prospectively.52,53 It has been recommended that the number of variables contributing to a model be limited to equal (positive outcomes/10) − 1 to assist model stability.53 A recent canine pancreatitis severity prediction score was limited by the data set to the study of 14 deaths as the primary outcome, allowing assessment of only (14/10) − 1 = 0.4 variables with convincing stability.18 Finally, variable selection is performed with the aim of minimizing potential for interobserver variability, measurement error, and subjective bias. In a scoring system with a mortality outcome devised to reflect severity of injury in human burn patients, a precise definition of inhalation injury based on clinical and laboratory criteria was provided, with inhalation injury for the purposes of the score additionally defined as that requiring mechanical ventilation.54 In this way, score miscalculation because of clinician subjectivity or measurement error associated with a single test was minimized.
The degree to which the frequency of euthanasia outweighs “natural” death in veterinary medicine poses a unique challenge to models developed on mortality outcome. The performance and timing of euthanasia reflects multiple factors, including severity of patient illness, owner financial and emotional status, diagnosis of a disease anticipated to be terminal at some future point, subjective assessments of degree of suffering, and individual clinician perspective. If all euthanized patients are excluded from the model development data set, available patient data may be limited and biased. If all patients are included regardless of euthanasia status, the significance of a particular variable as a risk factor for death may be masked by patients euthanized for financial reasons. Attempting to determine the exact reason for euthanasia and discriminate among patient subsets on that basis can be challenging because of lack of practicality or appropriateness of in-depth owner questioning at a time of emotional stress. Clinician subjectivity and clinician misconceptions also can be sources of error.55 Handling of euthanized patients in veterinary models varies from complete exclusion56 to exclusion of some subsets42 to complete inclusion.8
The outcome selected as the end-point of the model, for example death,13 survival from surgical colic,49 or recovery of dogs from renal failure with hemodialysis,42 requires careful consideration. The outcome should be specific, easily and accurately determined, and an appropriate end-point for all patients to which the model is applied. As an example, evaluation of time to tumor recurrence may seem useful as an outcome closely associated with tumor malignancy; however, in reality this outcome may be more closely associated with the type and frequency of performance of patient screening tests than tumor behavior. Equally, length of an animal's ICU stay may be more reflective of an owner's financial constraints than of the patient's requirement for hospital care, quality of care provided, or severity of underlying disease. The outcome should be clarified with a specific definition, for instance “mortality at 30 days posthospital admission.” If a more ambiguous outcome is used, for example “occurrence of melanoma metastasis,” the means and methods by which this outcome is to be determined must be clearly stated.14
Examples of modeled outcomes include probability of survival to hospital discharge,8 probability that a certain length of hospital stay will be exceeded,6 probability of recurrence of a specific tumor type,50 or probability that a septic focus is present.56 The CIBDAI index correlated historical and physical findings with bowel histology scores.2 Outcome probabilities are calculated directly or indirectly through an intermediary score from the measured variables. The canine Glasgow coma scale51 uses an intermediary score to model probability of survival after head trauma. With this score, a patient is assigned a number based on several components of a neurological examination, these are summed, and the total score number then is correlated with a graph to obtain the percentage probability of survival. The more sophisticated and useful models allow calculation of percent outcome probability across the full range of the model, rather than only giving a selected cut-point at which death would be predicted.
Models can be presented in many different ways. At the simplest, a table is provided from which a score known to correlate with outcome can be hand calculated at the cage side. A widely used example is the 5-point neurologic score used in patients with thoracolumbar disc disease, where a score of 5 reflects negative deep pain status and increased risk of failure of return to function.57 Alternatively, a graph charting score against expected outcome can be provided.51 Increasing in complexity, a logistic regression equation can be provided into which variable values are inserted and the equation solved for probability. An example of this approach is shown in Figure 2, for SP2 score calculation. Although time consuming, these types of scores still can be determined by hand with a scientific calculator.13,42 At the extreme of complexity, score calculation in some applications requires use of a computer and continuously updated software, for which subscription fees may be required.22
Model Building Process
When intensive care clinicians in human medicine were polled with respect to desirable score properties, they requested a score that was simple, easy to calculate, and that reflected the degree of physiologic derangement of the patient.25 The challenge of model building is to balance simplicity in use and parsimony in the number of variables required against predictive power. Failure to achieve this balance is likely to result in poor model adoption. The process by which the model is constructed, whether by “expert opinion” selection of variables, multivariable logistic regression analysis, or complex neural network techniques, is unimportant, as long as the model is demonstrated to have good internal, external, and predictive validity.
What can we learn from illness severity scores? We can decrease bias and confounding and add important contextual information to our research. We can supplement our clinical judgment with objective measures of patient illness. We can benchmark performance and establish protocols for triage and therapeutic management based on objective measures. Many diagnosis-specific and diagnosis-independent veterinary scores have been presented in recent years. Score adoption may be maximized by the development of scores that can be calculated simply, are well validated, summarize a patient's physiologic condition, and reflect an accepted and trusted development methodology.25 Clinician and researcher access to illness severity scores and score updates would be greatly assisted by the establishment of a central online score repository, perhaps associated with one of the specialty colleges. The prevalence of euthanasia in companion animal medicine poses a unique challenge to scores based on a mortality outcome.
This study is supported by a grant from the Ontario Veterinary College Pet Trust.