Inter-observer variability of two grading systems for equine glandular gastric disease

Background: Equine glandular gastric disease (EGGD) is recognised as a separate en tity to equine squamous gastric disease (ESGD) and it is recommended that lesions are graded differently. Currently, no validated scoring system exists for EGGD. Objectives: To determine inter-observer reliability of two previously described grad ing systems for EGGD and to assess if agreement improved with gastroscopy experi ence, specialist training or familiarity with the descriptive system. Study design: Cross-sectional survey. Methods: A link to an electronic questionnaire containing 20 images of glandular lesions was circulated. Respondents were asked to score lesions using descriptive terminology and a 0-2 verbal rating scale (VRS). Krippendorff's alpha reliability estimate was used to assess inter-rater agreement. A mixed effects model was used to determine which descriptive categories were associated with lesions being described as severe and decision to treat. Results: Eighty-two veterinarians responded, 49 diplomates and 33 non-diplomates. There was no agreement when all four descriptive variables were combined ( α = 0.19). Agreement was fair to moderate for severity ( α = 0.52), distribution ( α


| INTRODUC TI ON
Ordinal scoring systems are commonly used in both human and veterinary medicine. They facilitate documentation of disease, assessment of response to treatment and standardisation of research. Equine gastric ulcers were historically scored using a 0-4 grading scale recommended by the Equine Gastric Ulcer Council. 1 This system was shown to have better inter-observer agreement than a number/severity scoring system, with high kappa values (>0.8) between three observers for both squamous and glandular lesions. 2 Equine glandular gastric disease (EGGD) is now considered a separate entity to equine squamous gastric disease (ESGD) with regard to risk factors, clinical signs, pathophysiology, treatment and prognosis. [3][4][5] The incomplete understanding of the pathophysiology of glandular disease makes sub-classification of lesions difficult and means that grading systems for squamous disease may not be accurate for EGGD. The current recommendation is to use descriptive terminology, which classifies lesions based on four categories; severity, distribution, shape and appearance. 6 A recent statement 3 proposed inclusion of the terms nodular and erythematous to allow a more accurate description of lesions. A novel verbal rating scale (VRS), grading lesions from 0 to 2 has also been used in research. 7 Verbal rating scales are commonly used to score pain in people [8][9][10] and are defined as ordinal scales where words are used to describe the severity of a condition.
Although there is poor correlation between endoscopic findings and histological analysis of lesions, [11][12] gastroscopy remains the best method for antemortem diagnosis of EGGD. The use of an accurate and repeatable grading system is important in both clinical and research settings. Currently, no validated scoring system exists for EGGD, and recent published work has reverted to the original EGUS scale to group data and facilitate statistical analysis. 4,13,14 The main objectives of this study were (a) to determine inter-observer reliability of descriptive terminology and a verbal rating scale (VRS) for EGGD and (b) to assess if agreement improved with gastroscopy experience, specialist training or familiarity with the descriptive system. It was hypothesised that there would be poor agreement for both scales and that agreement would be better among experienced endoscopists, those with specialist training and those familiar with the descriptive system. To ascertain which factors were associated with respondents considering a lesion to be clinically significant, a secondary objective was to determine which other descriptive variables were associated with lesions being described as severe and which factors influenced the decision to treat.

| MATERIAL S AND ME THODS
An electronic questionnaire containing 20 images of glandular lesions in the antrum and pylorus of the stomach was drafted (Data S1). All questions were close-ended, and the survey was anonymous. A set of introductory questions established the respondent's experience of gastroscopy, specialist status and scoring system currently used.
Following this, a series of 20 still images of gastric glandular lesions were displayed sequentially. Respondents were asked to grade each image using the current descriptive terminology 3 ( Figure 1) and a verbal rating scale 7 ( Figure 2). For each image, respondents were asked whether they would recommend treatment based solely on that image. Although artificial, this was asked as a measure of ascribing clinical significance. Each respondent viewed the images in the same order and could navigate back to previous images.
Electronic invitations containing a link to complete the questionnaire were circulated to both specialists and primary care veterinarians using listservs including the American College of Veterinary
Of the internal medicine diplomates, 28 (57%) used the descriptive system, 16 (33%) used the original EGUS 0-4 scoring system, four (8%) used both and one respondent (2%) used the descriptive system in combination with the VRS. Among non-specialists, 16 (49%) used the descriptive system, 13 (39%) used the EGUS system and three (9%) used both. There was no difference between the groups for scoring system used (P = .9). The average number of gastroscopic examinations performed per month is displayed in Figure 3. Specialists were more likely to perform a higher number of gastroscopies per month than non-specialists (P = .004).

| Inter-observer agreement
Inter-observer agreement coefficients for descriptive and VRS grading systems are displayed in Table 1. The same cut-offs as Cohen's kappa can be used when interpreting Krippendorff's alpha coefficient, with α > 0.8 considered to reflect strong agreement. 16,17 Agreement was fair to moderate for severity (α = 0.52), distribution (α = 0.44), appearance (α = 0.38) and shape (α = 0.32). Agreement for the VRS was similar to that for severity (α = 0.53). Agreement was higher among specialists than non-specialists for all descriptive categories and across both scoring systems. Agreement was higher among respondents who currently use the descriptive system in practice, regardless of diplomate status. Overall, the VRS and 'severity' category showed similar inter-observer agreement. Amongst non-diplomates, the VRS had higher agreement than the 'severity' F I G U R E 1 Descriptive system for equine glandular gastric disease, as described by Rendle et al 3

Severity
Mild -Moderate -Severe category. When diplomates, those using the descriptive system in practice and experienced respondents (average >10 gastroscopies per month) were examined separately, the VRS had lower agreement than the 'severity' category.

| Relationship between VRS and description of severity
Description of severity was associated with VRS score (P < .001).
A lesion with a VRS score of 2 was more likely to be described as severe than a lesion with a VRS score of 1 (OR 75.2, 95% CI 51.12-110.48, P < .001).

| Factors contributing to lesions being described as severe
Univariable analysis is presented in Table S1 and results of multivariable analysis are presented in Table 2. Appearance (P = .005) and shape (P < .001), but not distribution (P = .08) were associated with lesions being described as severe. Depressed lesions were more likely to be described as severe compared to flat lesions (OR 4.6, 95% CI 2.22-9.55, P < .001). Haemorrhagic or fibrinosuppurative lesions were more likely to be described as severe than erythematous lesions (OR 2.9, 95% CI 1.51-5.39, P = .001 and OR 2.9, 95% CI 1.51-5.71, P = .001, respectively). Diplomates were less likely to describe lesions as severe (OR 0.5, 95% CI 0.28-0.94, P = .03). Experience level and scoring system currently used did not contribute to lesions being described as severe. Intraclass correlation coefficient was .6 for image (P = .01) and .27 for observer (P < .001), indicating that image accounted for the majority of clustering. The Hosmer-Lemeshow test indicated acceptable model fit (χ 2 = 11.51, P = .17).

| Factors contributing to decision to treat
Results of univariable analysis are presented in Table S2 and multivariable analysis is displayed in Table 3. Appearance (P < .001) and shape (P = .03) were associated with decision to treat. Respondents were more likely to treat depressed lesions compared to flat lesions (OR 3, 95% CI 1.22-7.63, P = .02). Distribution was not associated with decision to treat (P = .13). Diplomates were less likely to treat lesions (OR 0.5, 95% CI 0.24-0.93, P = .03) than non-diplomates.

| D ISCUSS I ON
Poor inter-observer agreement for endoscopy in people has been previously described, with less experienced operators performing worse. 18,19 The Havemeyer scale for grading arytenoid function was found to have fair to moderate agreement between observers, which improved when scales were transposed to dichotomous grades. 20 Ordinal scales have also been shown to have poor interrater agreement when assessing lameness in horses. 21 Interestingly, when experience alone was examined regardless of specialist status, this was not the case but among respondents already using the descriptive system in practice, regardless of specialist status, agreement across all descriptive categories was comparable to the diplomate group. This suggests that additional training to increase familiarity with the system may improve agreement. The VRS had the least agreement amongst the experienced group but performed better than the severity descriptor TA B L E 2 Multivariable binomial logistic regression to determine which factors were associated with lesions being described as 'severe' in the descriptive scoring system. Image and observer are included as random effects amongst non-diplomates. The use of a number may reflect the original EGUS scale which respondents may be more familiar with.
When descriptive variables were examined individually, severity had the highest level of agreement, although this was only moderate.
This is an unexpected finding as it is arguably the most subjective parameter, as the endoscopic assessment of severity alone cannot be used to infer clinical signs. 6 Agreement was worst for shape and appearance which may reflect an unfamiliarity with the terms in use and the range of lesions seen in the glandular mucosa. Clear language for both defining and setting boundaries for each category may be necessary to improve agreement. Utilisation of descriptive terminology was less common among non-specialists. This likely represents a difference in training but may be due to unfamiliarity with the descriptive system or to facilitate communication with owners and trainers who may be more familiar with the original scoring system.
Appearance of loss of mucosal integrity (ie haemorrhagic, fibrinosuppurative and depressed lesions) was associated with lesions being described as severe. Depressed lesions may reflect the appearance of a traditional ulcer or erosion, rather than an inflammatory process per se. Flat and erythematous lesions were less likely to be considered severe. Lesions with a mixed appearance were not associated with severity, although this may be due to the low numbers in this category. Distribution was not associated with lesions being described as severe. This is at odds with the original EGUS scoring system which was based around lesion distribution. 1 The VRS does not take distribution into account. Given the association between severity and VRS, it may be possible to extrapolate this to the VRS, or similar, using the severity category of the descriptive system. Grouping lesions for statistical analysis could be done using severity as the primary variable, with a 0-3 scale to incorporate normal, mild, moderate and severe. This would add an additional category, making it easier to document improvement as well as resolution of lesions.

| Decision to treat
The same factors (shape and appearance) associated with a lesion being considered severe were associated with the decision to treat. This is unsurprising, as a severe lesion would typically infer clinical significance. Diplomates were less likely to treat lesions than nondiplomates. The reason for this is unclear. In practice, the decision to treat may be driven by several factors including clinical signs, owner/trainer demands, financial constraints and individual clinician preference.

| Limitations
This was an opt-in survey and may not be representative of specialist and non-specialist populations as a whole. It is unknown what degree of familiarity respondents had with the various scoring systems, as data were only collected regarding which system each respondent currently used. The use of a scoring system does not necessary reflect familiarity with that system. It was impossible to exclude bias, with veterinarians being aware that they were participating in a study and knowing that their answers would be viewed.
Responses were completely anonymised to minimise the impact of bias, however, an effect on precision and diagnostic accuracy may remain. There is evidence that inter-rater reliability declines under less controlled conditions. 24 Consistency of interpretation was difficult to control. Diagnostic drift is a situation when the assignment of scores may vary slightly in consistency through the scoring process.
This may occur in situations such as this, where there are a large number of samples to be examined or when category characteristics/boundaries are poorly defined. 25 Observers may also subconsciously compare and score an image relative to those previously viewed, particularly for more subjective parameters such as severity.
Inter-observer variability is arguably less important in a clinical setting, particularly if the same clinician is performing follow-up examinations. Assessment of intra-observer reliability is required to see how these scoring systems perform amongst individual clinicians.
The 0-4 EGUC scoring system was not included in this study as the current recommendation is that it should not be applied to EGGD. 6 Although this scale was previously validated using both squamous and glandular lesions, this was performed at a time when glandular disease was not considered a separate pathological process to squamous disease.
Still images were used in this study and may be less representa- Another approach to validate a scoring system is to analyse the relationship between the scores and relevant parameters of disease severity. 25 To the authors' knowledge, there is currently no published work examining the response of specific types of glandular lesions to treatment. The lack of information on biopsies combined with current incomplete understanding of the pathophysiology and clinical signs pertaining specifically to EGGD means that this cannot be undertaken at present but warrants future attention.

| CON CLUS ION
There was no inter-observer agreement for the descriptive system when all four variables were included. The severity category showed the best agreement, and this was similar to the VRS.
Severity was significantly associated with the VRS, suggesting that it may be possible to extrapolate to the latter. The lack of agreement for appearance, shape and distribution identified in this study questions the need for better definition of these particular parameters. As more is understood about clinical signs pertaining to EGGD and response of specific lesion types to treatment, a more comprehensive scoring system may be developed. In the meantime, additional training to increase familiarity with the descriptive system may improve agreement.

ACK N OWLED G EM ENTS
The authors express their appreciation to Professor David Brodbelt for his advice regarding data analysis.

CO N FLI C T O F I NTE R E S T S
No competing interests have been declared.

AUTH O R CO NTR I B UTI O N S
R. Tallon gathered, analysed and interpreted the data and drafted the manuscript. M. Hewetson critically revised the manuscript. Both authors were involved in conception, study design and approval of the final version of the manuscript.

E TH I C A L A N I M A L R E S E A RCH
The study was approved by the Social Sciences Research Ethical Review Board at the Royal Veterinary College (URN SR2018-1678).

I N FO R M E D OWN E R CO N S E NT
Images used in this study were obtained from clinical records and used anonymously. Completion of the questionnaire was taken as participant consent.

DATA ACCE SS I B I LIT Y S TATE M E NT
The data that support the findings of this study are available from the corresponding author upon reasonable request.

PE E R R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/evj.13334.