# Clinical studies utilising ordinal data: Pitfalls in the analysis and interpretation of clinical grading systems

Article first published online: 2 JUN 2011

DOI: 10.1111/j.2042-3306.2011.00414.x

© 2011 EVJ Ltd

Additional Information

#### How to Cite

Boden, L. (2011), Clinical studies utilising ordinal data: Pitfalls in the analysis and interpretation of clinical grading systems. Equine Veterinary Journal, 43: 383–387. doi: 10.1111/j.2042-3306.2011.00414.x

#### Publication History

- Issue published online: 2 JUN 2011
- Article first published online: 2 JUN 2011

- Abstract
- Article
- References
- Cited By

Veterinary clinicians often utilise and report grading systems in studies that evaluate the degree of severity of clinical findings such as lameness, laryngeal function and heart murmurs (Steiss *et al*. 1989; Christley *et al*. 1997; Ramzan *et al*. 2003; Schumacher *et al*. 2003; Hopper *et al*. 2004; van Eps and Pollitt 2004; Arkell *et al*. 2006; Helwegen *et al*. 2006; Lightfoot *et al*. 2006; Lindegaard *et al*. 2007; Church *et al*. 2009; Perkins *et al*. 2009; Viñuela-Fernández 2011; Barakzai and Dixon 2011; Davidson *et al*. 2011; Menzies-Gow *et al*. 2011; Tranquille *et al*. 2011; Witte *et al*. 2011). The degree of severity of clinical outcomes is typically measured on an ordinal scale using rankings or grades that can be described in words (mild, moderate or severe), using letters (A–D) or numbers (1–5). These grades are often findings for which precise measurements on a continuous scale are not available (Plant *et al*. 2007). As such, they are approximate, subjective measures and, as a result, there is often debate over the usefulness of such grading systems, particularly with respect to their agreement with other grading systems and their repeatability and reproducibility when used by the same or different clinicians. Despite these difficulties, grading systems are sometimes employed as a basis of comparison of clinical outcomes in studies investigating the efficacy of one or more clinical interventions (Barakzai *et al*. 2009).

Although these types of studies appear frequently in the veterinary literature, clinicians should not be complacent and assume that these studies have been appropriately reported and analysed. A cautious and critical appraisal of the methodology used in each study is essential. It is not uncommon for studies using ordinal data to be published despite using incorrect statistical methods. For example, in the medical literature, ordinal data were presented appropriately in only 39–49% of relevant papers and analysed appropriately in only 57–63% (Avram *et al*. 1985; LaValley and Felson 2002; Jakobsson 2004; Jakobsson and Westergren 2005; Plant *et al*. 2007).

This paper seeks to describe broadly different data types, in particular, those types of data that are used in clinical grading systems. It specifically addresses appropriate and inappropriate methods of presenting, analysing and interpreting ordinal data.

### Types of data

- Top of page
- Types of data
- Grading systems
- Data presentation and analysis
- Conclusions
- Author's declaration of interests
- Source of funding
- Acknowledgements
- References

A summary of the types of data utilised in clinical studies is described in Figure 1. The types of data obtained in a study dictate the types of statistical analyses that can be performed. Data can be described as either quantitative (numerical data) or qualitative (categorical data). Quantitative data can be represented as continuous interval data or ratio data. Interval data have an arbitrary zero value (e.g. temperature, where 0°C is an arbitrary point) (Plant *et al*. 2007), while ratio scales have a meaningful zero reference point (e.g. quantity of different blood parameters, where the value zero means an absence of that parameter). Qualitative data are either nominal or ordinal data. Nominal data comprise unordered categories such as gender, breed, colour and diagnosis. Because of the descriptive nature of these categories, the data cannot be ranked (Stevens 1946; Plant *et al*. 2007). For example, a Thoroughbred horse is different from a Shetland pony, but one of these values cannot be ranked ‘higher’ or ‘more’ than the other (Plant *et al*. 2007). Ordinal data comprise ordered categories that can be applied to describe a clinical outcome (e.g. normal, mild, moderate, severe or numerical rankings 0, 1, 2, 3, 4). Numerical ordinal data are often used in grading systems to describe clinical pathology, such as the 4 point (Hackett *et al*. 1991), 5 point (Lane 1993), 6 point (Dixon *et al*. 2001) and Havemeyer grading systems for laryngeal function. The limitation with these data is that the interval or distance between each grade is arbitrary in terms of its clinical meaning (Plant *et al*. 2007).

### Grading systems

- Top of page
- Types of data
- Grading systems
- Data presentation and analysis
- Conclusions
- Author's declaration of interests
- Source of funding
- Acknowledgements
- References

The quality of the grading system for a specific clinical outcome is fundamental to any study using these types of measures (Streiner and Norman 2008). As a veterinarian basing clinical decisions on good evidence-based medicine, one should be aware of whether a particular grading system has been assessed appropriately. This includes consideration of the following:

- • What is the type of data: quantitative, qualitative (nominal or ordinal)?
- • Have grades within the grading system been assigned objectively?
- (i) Have grading systems been measured independently and without bias?

- (i)
- • How consistent are grading systems?
- (i) Repeatability (intraobserver agreement)?
- (ii) Reproducibility (interobserver agreement)?

- (i)
- • What is the relationship between different grading systems if more than one grading system exists?
- • Have grading systems been appropriately validated with respect to a particular clinical outcome? In other words, are the grades within a grading system strongly correlated with degrees of a clinical outcome and/or can they be reliably used to predict a clinical outcome?

If grading systems have not been appropriately and reliably assessed, it is difficult to justify their usefulness as a comparative measure of the efficacy of one or more interventions on a clinical outcome.

### Data presentation and analysis

- Top of page
- Types of data
- Grading systems
- Data presentation and analysis
- Conclusions
- Author's declaration of interests
- Source of funding
- Acknowledgements
- References

Data presentation and analysis need to account for the intrinsic nature of the data collected in a study. Parametric statistical tests make certain distributional assumptions about the data, whereas nonparametric statistical tests make no assumptions about the underlying data distribution (Petrie and Watson 2006). Therefore, some parametric statistical methods (including *t* test, ANOVA) are not appropriate statistical tests for ordinal or nominal data because the data violate the assumptions on which these tests are based. Although less powerful, nonparametric techniques are often the most appropriate statistical methods for the analysis of ordinal or nominal data. Table 1 describes some of the appropriate statistical tests used to analyse ordinal data.

Commonly used but inappropriate tests | Appropriate tests | Null hypothesis | Useful references | |
---|---|---|---|---|

Descriptive | Mean, standard deviation, 95% confidence intervals, line plots, histograms | Median, Mode, Interquartile range, Range, Box plots (Box and whiskers) | ||

Measuring agreement | Measures of correlation between grades do not measure agreement. | Weighted Kappa | Null hypothesis is that any agreement occurs purely by chance. Assumes ordinal data use similar scales and also depends on prevalence of condition tested for. | Cohen (1960, 1968); Siegel and Castellan (1988); Agresti (1990); Ludbrook (2002); Fleiss et al. (2003); Dohoo et al. (2010) |

Measuring correlation (covariation) | Pearson's product moment correlation coefficient is inappropriate if the association between variables is not linear. | Spearman's rank correlation coefficient; Kendall's Tau rank correlation coefficient; Somer's D coefficient, Goodman-Kruskal Gamma coefficient; Kendall's coefficient of concordance; Chi-squared tests (binary) | Null hypothesis is that 2 ordinal variables are independent (i.e. no association); the ranks of one variable do not covary with the ranks of the other variable. In other words, as the ranks of one variable increase, the ranks of the other variable are not more likely to increase (or decrease). | Kendall (1938); Siegel and Castellan (1988) |

Measuring strength and direction of relationship (prediction) | Linear regression | Ordinal logistic regression: (proportional odds); Quantile regression Jonckheere-Terpstra test for trend; Page's test for trend; Bartholomew's test for trend | Null hypothesis is that predictor variables and an ordinal outcome are independent (i.e. no association). | Siegel and Castellan (1988); Fleiss (2003); Liu and Agresti (2005); Koenker (2005); Dohoo et al. (2010) |

Measuring differences; Assessing efficacy of an intervention (change in grade) | t test (paired or unpaired); ANOVA | Mann-Whitney U; Fligner-Policello; Wilcoxon signed rank; Friedman; Kolmogorov-Smirnov; Cramer-von Mises | Null hypothesis is that ranks or grades are similar in both groups. A Mann-Whitney test can be regarded as a comparison of the median values, on the assumption that the 2 distributions are similar in shape (with similar variances). | Siegel and Castellan (1988); Sprent (1993); Zar (2010) |

Strickland Lu test | Difference between changes observed in 2 independent groups; degree-of-change data for each group and category representing ‘no change’ must be specified. | Strickland and Lu (2003) |

It is intuitive to recognise that one cannot calculate a mean (average) breed of a group of horses that includes Thoroughbreds, Shetlands, Arabs and Cobs or the mean severity of lesions that are ranked using words such as mild, moderate or severe (Plant *et al*. 2007). Equally, other summary statistics such as standard deviation or the sum of these categories cannot be calculated as these calculations assume that there is equal spacing between categories. Conversely, it is less intuitive to recognise that numerically ranked data, such as 6 point grading system (Dixon *et al*. 2001) for laryngeal function, cannot be described using a mean, standard deviation or sum. This is because it is easy to forget the assumptions behind the allocation of a numerical grade or rank. As a result, these numerical ordinal data are often misinterpreted as continuous interval or ratio data and described incorrectly. Ordinal data should be described by median, range and/or interquartile ranges, mode or by proportions or percentages within each score or rank. Nonparametric graphs such as box-plots (box and whisker plots) should be utilised instead of other types of summary plots such as line graphs (using means and confidence intervals). Inclusion of tables with frequencies and percentages of each grade may be a more transparent and useful way of presenting the data than either type of graph.

Studies using ordinal grading systems may aim to measure consistency within grading system(s). An assessment of intra- or inter-observer agreement is usually accomplished using Kappa statistics. A Kappa statistic takes into consideration the potential for any agreement between grades due to chance. The magnitude of the Kappa depends on the extent of the agreement as well as the prevalence of the condition being tested for (Dohoo *et al*. 2010). A weighted Kappa is usually the most appropriate test for ordinal grades (Dohoo *et al*. 2010). This is because, if a grading system is measured on a 6 point scale, *grades* of *6* and *5*, respectively, should be considered in closer agreement than *grades* of *6* and *1* (Dohoo *et al*. 2010). A weight matrix specifies how much agreement should be assigned to pairs of grades which are in partial agreement. However, there are no clear guidelines about how a weight matrix should be defined, and changing a weight matrix can completely alter the degree of agreement believed to be there (G. Innocent, personal communication). The weight matrix used in the Kappa analysis should be described in the materials and methods of studies employing this type of analysis.

Some studies incorrectly report correlation coefficients rather than Kappa statistics when referring to intra- or inter-observer agreement within grading systems. However, it is critical to appreciate that measuring correlation simply measures the association between different grades (i.e. how much the distribution of grades varies) assessed during repeated measurements by the same or different observers. Thus a correlation coefficient does not measure repeatability or reproducibility, nor does it measure the ability of the grading system to predict another grading system or to predict the clinical outcome (Bland and Altman 1986). However, if the aim of the study is to determine whether the grades of a grading system increase or decrease systematically (i.e. co-vary) with a clinical outcome or perhaps with another grading system, Spearman's rank correlation coefficient and Kendall's Tau rank correlation coefficient can be used (Table 1). It should be noted that rank correlations are adversely affected by ties in ranks (which are inevitable when considering a grading system with relatively few levels compared to the number of observations). The parametric Pearson's product moment correlation coefficient is an inappropriate test because typically, ordinal data do not meet the required underlying assumptions of linearity, homoscedasticity and normality. Thus, it is very important for authors to examine these assumptions by investigating the nature of association between the data (using scatter plots) before calculating a correlation coefficient.

The strength (magnitude) and direction of a relationship between a grading system and other parameters can be measured using nonparametric regression techniques. Regression enables us to predict the relationship between one variable and another using an appropriate equation. These analyses may be quite complex and require advice from an experienced statistician, particularly if both the outcome and the predictor variable are of an ordinal nature. Some of these regression techniques may include ordinal logistic regression models (proportional odds models) and quantile regression models. Linear regression using a grading system as an outcome is an inappropriate technique for ordinal data.

Finally, a grading system is often used as a measure of the efficacy of one or more interventions on a clinical outcome (e.g. studies that compare 2 interventions with respect to an outcome measured by a grading system). In order to measure the effect of treatments on a clinical outcome, differences in grades (possibly measured repeatedly over multiple times) are typically analysed using statistical tests such as Wilcoxon signed rank (paired groups), Mann-Whitney U (unpaired groups), Kruskal-Wallis (2 or more unmatched groups), Friedman (2 or more matched groups), Kolmogorov-Smirnov, Cramer-von Mises and Strickland Lu tests. The specific null hypothesis for each test may vary and should be checked before the method is used. Sometimes differences in grades are summarised into a binary outcome (improvement or no improvement after an intervention) and the relationship between this outcome and the intervention can then be analysed using Chi-squared tests or logistic regression. The obvious limitation with these latter methods is that the magnitude of the increase or decrease in grade cannot be taken into account.

### Conclusions

- Top of page
- Types of data
- Grading systems
- Data presentation and analysis
- Conclusions
- Author's declaration of interests
- Source of funding
- Acknowledgements
- References

Although studies in the veterinary literature frequently utilise ordinal data to describe grading systems for clinical outcomes, some of these published studies present and analyse the data inappropriately as their limitations have not been adequately exposed through the peer review process. As a result, future authors may inappropriately repeat these techniques and mistakes (Plant *et al*. 2007).

Inappropriate reporting and analysis of ordinal data may result in unjustified conclusions. Nonparametric tests are not as powerful as parametric tests, and as such result in a lower probability of rejecting a false null hypothesis. Conversely, inappropriate use of parametric tests in these types of studies may result in inappropriate conclusions being drawn due to an increase in *type I* errors. In other words, clinicians could infer that significant differences between grading systems or treatments exist, when, in fact, there is no difference (Plant *et al*. 2007; Dohoo *et al*. 2010).

The availability of user-friendly statistical software has made it easier for clinicians to neglect the advice of a statistician or epidemiologist and to present data in a way which can be misleading (Plant *et al*. 2007). With this in mind, the statistical tests described in this paper are not meant to form an exhaustive or prescriptive list of methods for analysing ordinal data. All parametric and nonparametric tests are subject to a number of assumptions that require a good knowledge of when these tests are appropriate or inappropriate to use. As such, a summary paper such as this will never take the place of good statistical advice sought prior to study design. However, in order to practice good evidence-based medicine, readers (clinicians), peer reviewers and editors of veterinary publications should make themselves familiar with appropriate and inappropriate statistical methodologies so that they are able to be vigilant and critical of potentially misleading results which may be reported in these types of studies.

### Author's declaration of interests

- Top of page
- Types of data
- Grading systems
- Data presentation and analysis
- Conclusions
- Author's declaration of interests
- Source of funding
- Acknowledgements
- References

The author did not declare any conflict of interest.

### Source of funding

- Top of page
- Types of data
- Grading systems
- Data presentation and analysis
- Conclusions
- Author's declaration of interests
- Source of funding
- Acknowledgements
- References

Defra is source of funding for the author's post doctoral research position at the University of Glasgow.

### Acknowledgements

- Top of page
- Types of data
- Grading systems
- Data presentation and analysis
- Conclusions
- Author's declaration of interests
- Source of funding
- Acknowledgements
- References

The author would like to acknowledge Garry Anderson, University of Melbourne and Dr Giles Innocent, Biomathematics and Statistics Scotland (BioSS), for providing constructive feedback on this paper.

### References

- Top of page
- Types of data
- Grading systems
- Data presentation and analysis
- Conclusions
- Author's declaration of interests
- Source of funding
- Acknowledgements
- References

- 2011) WINPEPI updated: Computer programs for epidemiologists, and their teaching potential. Epidemiol. Perspect. Innov. 8, 1. doi: 1186/1742-5573-8-1 (
- 1990) Categorical Data Analysis, John Wiley & Sons, New York. (
- 2006) Evidence of bias affecting the interpretation of the results of local anaesthetic nerve blocks when assessing lameness in horses. Vet. Rec. 159, 346-349. , , and (
- 1985) Statistical methods in anesthesia articles. An evaluation of two American journals during two six-month periods. Anesth. Analg. 64, 607-611. , , and (
- 2011) Correlation of resting and exercising endoscopic findings for horses with dynamic laryngeal collapse and palatal dysfunction. Equine vet. J. 43, 18-23. and (
- 2009) Race performance after laryngoplasty and ventriculocordectomy in National Hunt racehorses. Vet. Surg. 38, 941-945. , and (
- 1986) Statistical methods for assessing agreement between two methods of clinical measurement. Lancet i, 307-310. and (
- 1997) Cardiorespiratory responses to exercise in horses with different grades of idiopathic laryngeal hemiplegia. Equine vet. J. 29, 6-10. , , and (
- 2009) Evaluation of discriminant analysis based on dorsoventral symmetry indices to quantify hindlimb lameness during over ground locomotion in the horse. Equine vet. J. 41, 304-308. , , and (
- 1960) A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20, 37-46. (
- 1968) Weighted Kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 70, 213-220. (
- 2011) Exercising upper respiratory videoendoscopic evaluation of 100 nonracing performance horses with abnormal respiratory noise and/or poor performance. Equine vet. J. 43, 3-8. , , and (
- 2001) Laryngeal paralysis: A study of 375 cases in a mixed-breed population of horses. Equine vet. J. 33, 452-458. , , , , , and (
- 2010) Veterinary Epidemiologic Research, 2nd edn., VER Inc., Charlottetown. , and (
- 2003) Statistical Methods for Rates and Proportions, 3rd edn., John Wiley and Sons, New York. , and (
- 1991) The reliability of endoscopic examination in assessment of arytenoid cartilage movement in horses. Part I: Subjective and objective laryngeal evaluation. Vet. Surg. 20, 174-179. , , and (
- 2006) Measurements of right ventricular internal dimensions and their relationships to severity of tricuspid valve regurgitation in National Hunt Thoroughbreds. Equine vet. J., Suppl. 36, 171-177. Direct Link: , , and (
- 2004) Radiographic evaluation of sclerosis of the third carpal bone associated with exercise and the development of lameness in Standardbred racehorses. Equine vet. J. 36, 441-446. , , , and (
- 2004) Statistical presentation and analysis of ordinal data in nursing research. Scand. J. Caring Sci. 18, 437-440. (
- 2005) Statistical methods for assessing agreement for ordinal data. Scand. J. Caring Sci. 19, 427-431. and (
- 1938) A new measure of rank correlation. Biometrika 30, 81-89; doi:10.1093/biomet/30.1-2.81. (
- 2005) Quantile Regression, Cambridge University Press, Cambridge. (
- 1993) Recurrent laryngeal neuropathy: Current attitudes to aetiology, diagnosis and treatment.
*Proceedings of the 15th Bain-Fallon Memorial Lectures, ACVA, Artamon*. pp 173-192. ( - 2002) Statistical presentation and analysis of ordered categorical outcome data in rheumatology journals. Arthritis Rheum. 47, 255-259. and (
- 2006) An echocardiographic and auscultation study of right heart responses to training in young National Hunt Thoroughbred horses. Equine vet. J., Suppl. 36, 153-158. Direct Link: , , , and (
- 2007) Sedation with detomidine and acepromazine influences the endoscopic evaluation of laryngeal function in horses. Equine vet. J. 39, 553-556. , , and (
- 2005) The analysis of ordered categorical data: An overview and survey of recent developments. Sociedad Estadistica c Investigacion Operativa Test 14, 1-73. and (
- 2002) Statistical techniques for comparing measurers and methods of measurement: A critical review. Clin. Exp. Pharmacol. Physiol. 29, 527-536. (
- 2011) Repeatability and reproducibility of the Obel grading system for equine laminitis. Vet. Rec. 167, 52-55. , , , and (
- 2009) Variability of resting endoscopic grading for assessment of recurrent laryngeal neuropathy in horses. Equine vet. J. 41, 342-346. , , , , and (
- 2006) Statistics for Veterinary and Animal Science, 2nd edn., Blackwell Publishing Ltd., Oxford. and (
- 2007) Frequency of appropriate and inappropriate presentation and analysis methods of ordered categorical data in the veterinary dermatology literature from January 2003 to June 2006. Vet. Dermatol. 18, 260-266. , and (
- 2003) The application of a scintigraphic grading system to equine tibial stress fractures: 42 cases. Equine vet. J. 35, 382-388. , , and (
- 2003) The effects of local anaesthetic solution in the navicular bursa of horses with lameness caused by distal interphalangeal joint pain. Equine vet. J. 35, 502-505. , , , , , , and (
- 1988) Nonparametric Statistics for the Behavioural Sciences, 2nd edn., McGraw-Hill, New York. and (
- 1993) Applied Nonparametric Statistical Methods, 2nd edn., Chapman & Hall, London. (
- 1989) Electroacupuncture in the treatment of chronic lameness in horses and ponies: A controlled clinical trial. Can. J. vet. Res. 53, 239-243. , and (
- 1946) On the theory of scales of measurement. Science 103, 677-680. (
- 2008) Health Measurement Scales: A Practical Guide to Their Development and Use, Oxford University Press, Oxford. and (
- 2003) Estimates, power and sample size calculations for two-sample ordinal outcomes under before-after study designs. Stat. Med. 22, 1807-1818. and (
- 2011) Histopathologic features of distal tarsal joint cartilage and subchondral bone in ridden and pasture-exercised horses. Am. J. vet. Res. 72, 33-41. , , , , , and (
- 2004) Equine laminitis: Cryotherapy reduces the severity of the acute lesion. Equine vet. J. 36, 255-260. and (
- 2011) Comparison of subjective scoring systems used to evaluate equine laminitis. Vet. J. 188, 171-177. , , and (
- 2011) Association of owner-reported noise with findings during dynamic respiratory endoscopy in Thoroughbred racehorses. Equine vet. J. 43, 9-17. , , , and (
- 2010) Biostatistical Analysis, 4th edn., Prentice Hall, London. (