Clinical studies utilising ordinal data: Pitfalls in the analysis and interpretation of clinical grading systems


  • L. Boden

    1. Boyd Orr Centre for Population and Ecosystem Health, Institute of Biodiversity, Animal Health and Comparative Medicine, College of Medical, Veterinary and Life Sciences, University of Glasgow, UK.
    Search for more papers by this author

Veterinary clinicians often utilise and report grading systems in studies that evaluate the degree of severity of clinical findings such as lameness, laryngeal function and heart murmurs (Steiss et al. 1989; Christley et al. 1997; Ramzan et al. 2003; Schumacher et al. 2003; Hopper et al. 2004; van Eps and Pollitt 2004; Arkell et al. 2006; Helwegen et al. 2006; Lightfoot et al. 2006; Lindegaard et al. 2007; Church et al. 2009; Perkins et al. 2009; Viñuela-Fernández 2011; Barakzai and Dixon 2011; Davidson et al. 2011; Menzies-Gow et al. 2011; Tranquille et al. 2011; Witte et al. 2011). The degree of severity of clinical outcomes is typically measured on an ordinal scale using rankings or grades that can be described in words (mild, moderate or severe), using letters (A–D) or numbers (1–5). These grades are often findings for which precise measurements on a continuous scale are not available (Plant et al. 2007). As such, they are approximate, subjective measures and, as a result, there is often debate over the usefulness of such grading systems, particularly with respect to their agreement with other grading systems and their repeatability and reproducibility when used by the same or different clinicians. Despite these difficulties, grading systems are sometimes employed as a basis of comparison of clinical outcomes in studies investigating the efficacy of one or more clinical interventions (Barakzai et al. 2009).

Although these types of studies appear frequently in the veterinary literature, clinicians should not be complacent and assume that these studies have been appropriately reported and analysed. A cautious and critical appraisal of the methodology used in each study is essential. It is not uncommon for studies using ordinal data to be published despite using incorrect statistical methods. For example, in the medical literature, ordinal data were presented appropriately in only 39–49% of relevant papers and analysed appropriately in only 57–63% (Avram et al. 1985; LaValley and Felson 2002; Jakobsson 2004; Jakobsson and Westergren 2005; Plant et al. 2007).

This paper seeks to describe broadly different data types, in particular, those types of data that are used in clinical grading systems. It specifically addresses appropriate and inappropriate methods of presenting, analysing and interpreting ordinal data.

Types of data

A summary of the types of data utilised in clinical studies is described in Figure 1. The types of data obtained in a study dictate the types of statistical analyses that can be performed. Data can be described as either quantitative (numerical data) or qualitative (categorical data). Quantitative data can be represented as continuous interval data or ratio data. Interval data have an arbitrary zero value (e.g. temperature, where 0°C is an arbitrary point) (Plant et al. 2007), while ratio scales have a meaningful zero reference point (e.g. quantity of different blood parameters, where the value zero means an absence of that parameter). Qualitative data are either nominal or ordinal data. Nominal data comprise unordered categories such as gender, breed, colour and diagnosis. Because of the descriptive nature of these categories, the data cannot be ranked (Stevens 1946; Plant et al. 2007). For example, a Thoroughbred horse is different from a Shetland pony, but one of these values cannot be ranked ‘higher’ or ‘more’ than the other (Plant et al. 2007). Ordinal data comprise ordered categories that can be applied to describe a clinical outcome (e.g. normal, mild, moderate, severe or numerical rankings 0, 1, 2, 3, 4). Numerical ordinal data are often used in grading systems to describe clinical pathology, such as the 4 point (Hackett et al. 1991), 5 point (Lane 1993), 6 point (Dixon et al. 2001) and Havemeyer grading systems for laryngeal function. The limitation with these data is that the interval or distance between each grade is arbitrary in terms of its clinical meaning (Plant et al. 2007).

Figure 1.

Types of quantitative and qualitative data.

Grading systems

The quality of the grading system for a specific clinical outcome is fundamental to any study using these types of measures (Streiner and Norman 2008). As a veterinarian basing clinical decisions on good evidence-based medicine, one should be aware of whether a particular grading system has been assessed appropriately. This includes consideration of the following:

  • • What is the type of data: quantitative, qualitative (nominal or ordinal)?
  • • Have grades within the grading system been assigned objectively?
    • (i) Have grading systems been measured independently and without bias?
  • • How consistent are grading systems?
    • (i) Repeatability (intraobserver agreement)?
    • (ii) Reproducibility (interobserver agreement)?
  • • What is the relationship between different grading systems if more than one grading system exists?
  • • Have grading systems been appropriately validated with respect to a particular clinical outcome? In other words, are the grades within a grading system strongly correlated with degrees of a clinical outcome and/or can they be reliably used to predict a clinical outcome?

If grading systems have not been appropriately and reliably assessed, it is difficult to justify their usefulness as a comparative measure of the efficacy of one or more interventions on a clinical outcome.

Data presentation and analysis

Data presentation and analysis need to account for the intrinsic nature of the data collected in a study. Parametric statistical tests make certain distributional assumptions about the data, whereas nonparametric statistical tests make no assumptions about the underlying data distribution (Petrie and Watson 2006). Therefore, some parametric statistical methods (including t test, ANOVA) are not appropriate statistical tests for ordinal or nominal data because the data violate the assumptions on which these tests are based. Although less powerful, nonparametric techniques are often the most appropriate statistical methods for the analysis of ordinal or nominal data. Table 1 describes some of the appropriate statistical tests used to analyse ordinal data.

Table 1. Some of the common appropriate and inappropriate statistical tests for the analysis of ordinal data. This is not an exhaustive list of statistical tests and should not replace the advice of an experienced statistician. A more detailed description of other statistical tests for ordinal data is described elsewhere (Abramson 2011; StatXact 9:
 Commonly used but inappropriate testsAppropriate testsNull hypothesisUseful references
DescriptiveMean, standard deviation, 95% confidence intervals, line plots, histogramsMedian, Mode, Interquartile range, Range, Box plots (Box and whiskers)  
Measuring agreementMeasures of correlation between grades do not measure agreement.Weighted KappaNull hypothesis is that any agreement occurs purely by chance. Assumes ordinal data use similar scales and also depends on prevalence of condition tested for.Cohen (1960, 1968);
Siegel and Castellan (1988);
Agresti (1990); Ludbrook (2002); Fleiss et al. (2003);
Dohoo et al. (2010)
Measuring correlation (covariation)Pearson's product moment correlation coefficient is inappropriate if the association between variables is not linear.Spearman's rank correlation coefficient;
Kendall's Tau rank correlation coefficient;
Somer's D coefficient, Goodman-Kruskal Gamma coefficient; Kendall's coefficient of concordance;
Chi-squared tests (binary)
Null hypothesis is that 2 ordinal variables are independent (i.e. no association); the ranks of one variable do not covary with the ranks of the other variable. In other words, as the ranks of one variable increase, the ranks of the other variable are not more likely to increase (or decrease).Kendall (1938); Siegel and Castellan (1988)
Measuring strength and direction of relationship
Linear regressionOrdinal logistic regression: (proportional odds);
Quantile regression
Jonckheere-Terpstra test for trend; Page's test for trend; Bartholomew's test for trend
Null hypothesis is that predictor variables and an ordinal outcome are independent (i.e. no association).Siegel and Castellan (1988); Fleiss (2003); Liu and Agresti (2005); Koenker (2005); Dohoo et al. (2010)
Measuring differences;
Assessing efficacy of an intervention
(change in grade)
t test (paired or unpaired);
Mann-Whitney U;
Fligner-Policello; Wilcoxon signed rank; Friedman; Kolmogorov-Smirnov; Cramer-von Mises
Null hypothesis is that ranks or grades are similar in both groups. A Mann-Whitney test can be regarded as a comparison of the median values, on the assumption that the 2 distributions are similar in shape (with similar variances).Siegel and Castellan (1988); Sprent (1993); Zar (2010)
Strickland Lu testDifference between changes observed in 2 independent groups; degree-of-change data for each group and category representing ‘no change’ must be specified.Strickland and Lu (2003)

It is intuitive to recognise that one cannot calculate a mean (average) breed of a group of horses that includes Thoroughbreds, Shetlands, Arabs and Cobs or the mean severity of lesions that are ranked using words such as mild, moderate or severe (Plant et al. 2007). Equally, other summary statistics such as standard deviation or the sum of these categories cannot be calculated as these calculations assume that there is equal spacing between categories. Conversely, it is less intuitive to recognise that numerically ranked data, such as 6 point grading system (Dixon et al. 2001) for laryngeal function, cannot be described using a mean, standard deviation or sum. This is because it is easy to forget the assumptions behind the allocation of a numerical grade or rank. As a result, these numerical ordinal data are often misinterpreted as continuous interval or ratio data and described incorrectly. Ordinal data should be described by median, range and/or interquartile ranges, mode or by proportions or percentages within each score or rank. Nonparametric graphs such as box-plots (box and whisker plots) should be utilised instead of other types of summary plots such as line graphs (using means and confidence intervals). Inclusion of tables with frequencies and percentages of each grade may be a more transparent and useful way of presenting the data than either type of graph.

Studies using ordinal grading systems may aim to measure consistency within grading system(s). An assessment of intra- or inter-observer agreement is usually accomplished using Kappa statistics. A Kappa statistic takes into consideration the potential for any agreement between grades due to chance. The magnitude of the Kappa depends on the extent of the agreement as well as the prevalence of the condition being tested for (Dohoo et al. 2010). A weighted Kappa is usually the most appropriate test for ordinal grades (Dohoo et al. 2010). This is because, if a grading system is measured on a 6 point scale, grades of 6 and 5, respectively, should be considered in closer agreement than grades of 6 and 1 (Dohoo et al. 2010). A weight matrix specifies how much agreement should be assigned to pairs of grades which are in partial agreement. However, there are no clear guidelines about how a weight matrix should be defined, and changing a weight matrix can completely alter the degree of agreement believed to be there (G. Innocent, personal communication). The weight matrix used in the Kappa analysis should be described in the materials and methods of studies employing this type of analysis.

Some studies incorrectly report correlation coefficients rather than Kappa statistics when referring to intra- or inter-observer agreement within grading systems. However, it is critical to appreciate that measuring correlation simply measures the association between different grades (i.e. how much the distribution of grades varies) assessed during repeated measurements by the same or different observers. Thus a correlation coefficient does not measure repeatability or reproducibility, nor does it measure the ability of the grading system to predict another grading system or to predict the clinical outcome (Bland and Altman 1986). However, if the aim of the study is to determine whether the grades of a grading system increase or decrease systematically (i.e. co-vary) with a clinical outcome or perhaps with another grading system, Spearman's rank correlation coefficient and Kendall's Tau rank correlation coefficient can be used (Table 1). It should be noted that rank correlations are adversely affected by ties in ranks (which are inevitable when considering a grading system with relatively few levels compared to the number of observations). The parametric Pearson's product moment correlation coefficient is an inappropriate test because typically, ordinal data do not meet the required underlying assumptions of linearity, homoscedasticity and normality. Thus, it is very important for authors to examine these assumptions by investigating the nature of association between the data (using scatter plots) before calculating a correlation coefficient.

The strength (magnitude) and direction of a relationship between a grading system and other parameters can be measured using nonparametric regression techniques. Regression enables us to predict the relationship between one variable and another using an appropriate equation. These analyses may be quite complex and require advice from an experienced statistician, particularly if both the outcome and the predictor variable are of an ordinal nature. Some of these regression techniques may include ordinal logistic regression models (proportional odds models) and quantile regression models. Linear regression using a grading system as an outcome is an inappropriate technique for ordinal data.

Finally, a grading system is often used as a measure of the efficacy of one or more interventions on a clinical outcome (e.g. studies that compare 2 interventions with respect to an outcome measured by a grading system). In order to measure the effect of treatments on a clinical outcome, differences in grades (possibly measured repeatedly over multiple times) are typically analysed using statistical tests such as Wilcoxon signed rank (paired groups), Mann-Whitney U (unpaired groups), Kruskal-Wallis (2 or more unmatched groups), Friedman (2 or more matched groups), Kolmogorov-Smirnov, Cramer-von Mises and Strickland Lu tests. The specific null hypothesis for each test may vary and should be checked before the method is used. Sometimes differences in grades are summarised into a binary outcome (improvement or no improvement after an intervention) and the relationship between this outcome and the intervention can then be analysed using Chi-squared tests or logistic regression. The obvious limitation with these latter methods is that the magnitude of the increase or decrease in grade cannot be taken into account.


Although studies in the veterinary literature frequently utilise ordinal data to describe grading systems for clinical outcomes, some of these published studies present and analyse the data inappropriately as their limitations have not been adequately exposed through the peer review process. As a result, future authors may inappropriately repeat these techniques and mistakes (Plant et al. 2007).

Inappropriate reporting and analysis of ordinal data may result in unjustified conclusions. Nonparametric tests are not as powerful as parametric tests, and as such result in a lower probability of rejecting a false null hypothesis. Conversely, inappropriate use of parametric tests in these types of studies may result in inappropriate conclusions being drawn due to an increase in type I errors. In other words, clinicians could infer that significant differences between grading systems or treatments exist, when, in fact, there is no difference (Plant et al. 2007; Dohoo et al. 2010).

The availability of user-friendly statistical software has made it easier for clinicians to neglect the advice of a statistician or epidemiologist and to present data in a way which can be misleading (Plant et al. 2007). With this in mind, the statistical tests described in this paper are not meant to form an exhaustive or prescriptive list of methods for analysing ordinal data. All parametric and nonparametric tests are subject to a number of assumptions that require a good knowledge of when these tests are appropriate or inappropriate to use. As such, a summary paper such as this will never take the place of good statistical advice sought prior to study design. However, in order to practice good evidence-based medicine, readers (clinicians), peer reviewers and editors of veterinary publications should make themselves familiar with appropriate and inappropriate statistical methodologies so that they are able to be vigilant and critical of potentially misleading results which may be reported in these types of studies.

Author's declaration of interests

The author did not declare any conflict of interest.

Source of funding

Defra is source of funding for the author's post doctoral research position at the University of Glasgow.


The author would like to acknowledge Garry Anderson, University of Melbourne and Dr Giles Innocent, Biomathematics and Statistics Scotland (BioSS), for providing constructive feedback on this paper.