To test the intra- and interobserver variability, among clinicians with an interest in systemic sclerosis (SSc), in defining digital ulcers.
To test the intra- and interobserver variability, among clinicians with an interest in systemic sclerosis (SSc), in defining digital ulcers.
Thirty-five images of finger lesions, incorporating a wide range of abnormalities at different sites, were duplicated, yielding a data set of 70 images. Physicians with an interest in SSc were invited to take part in the Web-based study, which involved looking through the images in a random sequence. The sequence differed for individual participants and prevented cross-checking with previous images. Participants were asked to grade each image as depicting “ulcer” or “no ulcer,” and if “ulcer,” then either “inactive” or “active.” Images of a range of exemplar lesions were available for reference purposes while participants viewed the test images. Intrarater reliability was assessed using a weighted kappa coefficient with quadratic weights. Interrater reliability was estimated using a multirater weighted kappa coefficient.
Fifty individuals (most of them rheumatologists) from 15 countries participated in the study. There was a high level of intrarater reliability, with a mean weighted kappa value of 0.81 (95% confidence interval [95% CI] 0.77, 0.84). Interrater reliability was poorer (weighted κ = 0.46 [95% CI 0.35, 0.57]).
The poor interrater reliability suggests that if digital ulceration is to be used as an end point in multicenter clinical trials of SSc, then strict definitions must be developed. The present investigation also demonstrates the feasibility of Web-based studies, for which large numbers of participants can be recruited over a short time frame.
Digital ulcers, which are common in patients with systemic sclerosis (SSc) (1, 2), are painful and disabling. These types of ulcers are often used as a primary end point in clinical trials of SSc-related digital ischemia and vasculopathy, as was the case in 2 recent multicenter placebo-controlled trials (3, 4). Patient-assessed digital ulcer “activity” was among a core set of measures proposed for use in trials of SSc-related Raynaud's phenomenon (5). However, digital ulcers are difficult to define, which raises concerns about their reliability as an outcome measure. Problems include 1) making the distinction between “healed” and “nonhealed” ulcers, which can be difficult with lesions situated on the fingertips and the extensor surfaces, since an ulcer crater can persist for months after an acute episode, and 2) determining which sites to use. The study by Korn et al (3) included only those ulcers at or distal to the proximal interphalangeal joints, and yet, more proximal ulcers (e.g., those over the metacarpophalangeal joints) can be very painful, and healing can be difficult.
Reproducible measures of outcome of digital ischemia are badly needed, and it is important to ensure that those measures that already exist are as robust as possible. The aim of this study was to test the intra- and interobserver variability, among clinicians with an interest in SSc, in defining digital ulcers.
The study was Web-based. Thirty-five images of finger lesions, incorporating a wide range of abnormalities at different sites (including digital tip and extensor surfaces), were duplicated, yielding a data set of 70 images. All of the images were obtained by a trained medical photographer, and the majority included a white circular disc (6.4 mm in diameter) to give participants an idea of the size of the ulcer. When an image included more than one lesion that might be classified as an ulcer, the lesion meant for evaluation was marked with an arrow.
Physicians with an interest in SSc (identified via SSc-related organizations) were invited to participate in the study. Each was asked to look through the images in a random sequence, which differed for individual participants and did not allow cross-checking with previous images. Before viewing the test images, the participants were asked to view 13 exemplar lesions, which had been agreed upon by a working group of 7 clinicians who designed the study and which were available for reference purposes while the participants viewed the test images. Participants were then asked to grade the test images on a 3-point ordinal scale, as follows: either “no ulcer” or “ulcer,” and if “ulcer,” then either “inactive” or “active.” The grade for each image was recorded when the participant clicked on the chosen answer, after which the next image appeared.
The study design permitted an analysis of both intra- and interrater reliability. Intrarater reliability was assessed using a weighted kappa coefficient with quadratic weights, which is a measure of agreement between ratings for ordinal data. Interrater reliability was estimated with a multirater weighted kappa coefficient, using the individual rater's first and second ratings separately. The reliability of ratings between particular pairs of categories was assessed using interclass kappa coefficients (6), which have been recommended in scale development to assess the ability of raters to distinguish between particular pairs of categories in a categorical scale (7). To assess whether the reliability of the scale might be improved by combining adjacent categories, we computed the binary kappa coefficients for “no ulcer” versus “ulcer” (“active” and “inactive” combined), and for “no ulcer” and “inactive” ulcer combined versus “active” ulcer. For all of the coefficients, a value of 1 represented complete agreement, and a value of 0 represented agreement no better than chance.
The 95% confidence intervals (95% CIs) for the means of the kappa coefficients for intrarater reliability were obtained using the nonparametric bootstrap method, with resampling by raters. Because the values for interrater reliability were similar for first and second ratings, the mean of the 2 values is presented, with 95% CIs again estimated by the bootstrap method. To investigate variation in reliability between images, the multirater weighted kappa coefficient pooled across both ratings was recalculated, dropping each image in turn. An increase in value without the inclusion of that image, suggested lower reliability for that image, whereas a decrease suggested higher reliability. Images were then ranked according to these changes, and examples of images with high and low reliability are presented below. All statistical analyses were carried out using Stata software (8).
Fifty individuals (most of them rheumatologists) from 15 different countries participated in the study. Table 1 shows the intra- and interrater reliabilities overall and in pairs by category.
|Intrarater reliability, weighted κ (95% CI)*||Interrater reliability, multirater weighted κ (95% CI)†|
|Overall weighted kappa coefficient||0.81 (0.77, 0.84)||0.46 (0.35, 0.57)|
|No ulcers vs. inactive ulcers||0.54 (0.45, 0.64)||−0.12 (−0.31, 0.07)|
|No ulcers vs. active ulcers||0.90 (0.86, 0.95)||0.60 (0.45, 0.76)|
|Inactive ulcers vs. active ulcers||0.63 (0.57, 0.69)||0.28 (0.16, 0.40)|
|No ulcers vs. active ulcers/ inactive ulcers||0.77 (0.73, 0.81)||0.34 (0.10, 0.58)|
|No ulcers/inactive ulcers vs. active ulcers||0.75 (0.71, 0.79)||0.49 (0.17, 0.81)|
The degree of intrarater reliability was high. The weighted kappa coefficient ranged from 0.36 to 0.96, with a median of 0.82. Averaged across all raters, the weighted kappa value was 0.81 (95% CI 0.77, 0.84). When individual categories of lesions were compared in pairs, intrarater reliability was greatest for the distinction between “no ulcer” and “active” ulcer, as might be expected for an ordered categorical scale, with an interclass kappa coefficient of 0.90 (95% CI 0.86, 0.95). Lower levels of reliability were observed between “inactive” ulcer and “active” ulcer (interclass kappa coefficient 0.63 [95% CI 0.57, 0.69]) and between “no ulcer” and “inactive” ulcer (0.54 [95% CI 0.45, 0.64]). High levels of intrarater reliability (0.77 [95% CI 0.73, 0.81]) were observed between “no ulcer” and “ulcer” (with “active” and “inactive” ulcer combined) and between “no ulcer” and “inactive” ulcer combined and “active” ulcer (0.75 [95% CI 0.71, 0.79]). Nevertheless, these values were slightly lower than the overall weighted kappa coefficient, suggesting that a 3-point scale would be better than a binary scale for studies involving a single observer.
Values for interrater reliability (50 raters) were very similar between first and second assessments; therefore, values were pooled to provide a more precise assessment (Table 1). The level of agreement between raters was poorer (weighted kappa coefficient 0.46 [95% CI 0.35, 0.57]). When individual categories of lesions were compared in pairs, interrater reliability was highest for “no ulcer” versus “active” ulcer (kappa coefficient 0.60 [95% CI 0.45, 0.76]) and was poorer when comparing “inactive” with “active” ulcer (0.28 [95% CI 0.16, 0.40]). For the distinction between “no ulcer” and “inactive” ulcer, the interclass kappa coefficient was −0.12 (95% CI −0.31, 0.07), indicating no consistency among observers in distinguishing between these 2 categories. These results were consistent with those for intrarater reliability, with the interclass kappa coefficient being lowest for the distinction between “no ulcer” and “inactive” ulcer. When “no ulcer” was compared with “ulcer” (“active” and “inactive” combined), the kappa coefficient was 0.34 (95% CI 0.10, 0.58), and when “no ulcer” and “inactive” ulcer combined was compared with “active” ulcer, the kappa coefficient was 0.49 (95% CI 0.17, 0.81).
The source of the variability was then explored as outlined above. Examples of images with high and low reliability are shown in Figure 1. Raters mostly agreed that the lesion over the proximal interphalangeal joint (Figure 1A) was not an ulcer, but that the lesion over the fingertip (Figure 1B) was an active ulcer. The lesion shown in Figure 1C (low reliability) had progressed to gangrene, and without guidelines as to whether or not gangrenous lesions should be classified as ulcers, there were clearly wide differences in opinion. The lesion shown in Figure 1D (also low reliability) was an example of a more commonly encountered lesion; here, the difference in opinion was mainly regarding whether the lesion was active or inactive. To explore whether the site of the lesion might influence reliability, a subgroup analysis of different sites was performed.
Of the 35 digital lesions used as test images, 15 were located at the digital tip, 14 at extensor surfaces, and 6 at other sites (radial or ulnar side of the finger). When subgroup analyses of intra- and interrater reliability were performed for lesions at different sites (digital tip, extensor surface of digit, other), there was little evidence that reliability was affected by the location of the lesions being evaluated (Table 2).
|Location depicted (no. of images)||Intrarater reliability, weighted κ (95% CI)*||Interrater reliability, multirater weighted κ (95% CI)†|
|Digital tip (15)||0.81 (0.77, 0.85)||0.42 (0.25, 0.59)|
|Extensor surface of digit (14)||0.81 (0.77, 0.85)||0.47 (0.28, 0.67)|
|Other (6)||0.79 (0.75, 0.84)||0.53 (0.25, 0.80)|
Despite being associated with major morbidity and disability, digital ulcers in patients with SSc are a neglected area of clinical research. No previous studies have focused on defining digital ulcers and/or on their reliability as an outcome measure in clinical trials. Our results indicate that when grading digital lesions on a 3-point ordinal scale as “active” ulcer, “inactive” ulcer, or “no ulcer,” intrarater reliability was good (overall average 0.81). In contrast, interrater reliability was poor (overall average 0.46), and this remained the case when categories of lesions were combined (“no ulcer” versus “ulcer,” and “no ulcer” plus “inactive” ulcer versus “active” ulcer). These findings have implications for multicenter clinical trials, because they suggest that studies involving several observers may be difficult to interpret, because some types of lesions may be classified as ulcers in some centers and not in others.
It was not the task of this Web-based study to define digital ulceration, and therefore, we did not provide strict textual definitions of ulceration. Instead, we included a set of similarly produced images of exemplar lesions, each categorized by the working group that designed the study as being an image of an “active” ulcer, an “inactive” ulcer, or “no ulcer.” However, because of the broad spectrum of digital lesions in patients with SSc, many of the 35 lesions used as test images for the study were quite different from the exemplar lesions. Also, the purpose of the exemplar images was more to “set the scene” than to influence opinion, since our aim was to test intra- and interobserver variability in current practice. We accept that use of the exemplar lesions could have influenced opinion, and that it could be argued that they should not have been included. However, we believe that the inclusion of exemplar images is unlikely to have changed the main conclusion drawn from our results, i.e., that if digital ulceration is to continue to be used as an end point in multicenter clinical trials of SSc involving multiple observers, then strict definitions must be developed (and perhaps used in training sessions prior to multicenter clinical trials).
Our results also suggest that definitions of digital ulcers should have an emphasis on lesions most likely to be perceived as “active” by rheumatologists, because interrater reliability was highest when comparing “no ulcer” with “active” ulcer (0.60). In contrast, interrater reliability was poorest when comparing “no ulcer” with “inactive” ulcer (−0.12), suggesting that inclusion of an “inactive” (or “healed”) category is unlikely to be clinically meaningful.
A key point in developing definitions for use in future studies should be the description of an active ulcer. This definition would be important with regard to clinical decision-making, such as initiation of or a change in therapy, since disease activity would be a central factor in these decisions. Also, a change in the classification of an ulcer from active to inactive may represent a more meaningful clinical end point in interventional studies, when compared with complete healing. As noted above, in the current study we deliberately did not provide strict textual definitions, since our aim was to examine current practice. From the responses received, it is clear that the participating investigators had very different opinions on what constituted an active ulcer. This variability suggests that inspection alone is insufficient to achieve a valid assessment of a clinical end point.
The high level of intrarater reliability may be explained in part by recall, with the rater remembering the score he or she gave to a particular image earlier in the sequence (even though raters were not permitted to go back in the sequence to make comparisons). It is possible that intrarater reliability would have been lower if participants had reexamined the images at a later date. However, it seems likely that individual observers were consistent in their reporting, suggesting that digital ulceration would be a robust end point in smaller studies involving a single observer. Most of the observers were clinicians with an interest in SSc and, therefore, were representative of those most likely to be involved in clinical trials of SSc-related digital ulceration.
There are a number of limitations to our study. First, the study focused solely on finger ulcers, but toe ulcers in SSc are also a major cause of morbidity, with the results of a previous investigation suggesting that ∼26% of SSc patients experience foot ulceration (9). Second, examination of photographs may not be the exact equivalent of direct assessment of ulcers. When seen in 2 dimensions only, some details of the ulcers may be lost, although in this study, the quality of the images (all of which were obtained by a trained medical photographer) was high. Third, there was no possibility of asking the patient details about duration, change in pain, or discharge from the ulcer, all of which might influence the observer's decision as to whether this was a new ulcer. However, it could be argued that assessing an ulcer in a photographic image is more objective than direct physical assessment.
This study demonstrates the feasibility of Web-based studies, for which large numbers of participants can be recruited over a short time frame. Digital ulcers undoubtedly should be a key end point in studies of SSc-related digital ischemia and related problems, but there is a clear need to improve interrater reliability. The next step will be to develop a set of diagnostic criteria (including size, wet/dry, slough/no slough, surface epithelialization) and then to validate these criteria in further studies, both Web-based and in the clinical setting.
Dr. Herrick had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study design. Herrick, Roberts, Silman, Anderson, Goodfield, McHugh, Muir, Denton.
Acquisition of data. Herrick, Tracey, Anderson, McHugh, Muir, Denton.
Analysis and interpretation of data. Herrick, Roberts, Silman, Goodfield, Denton.
Manuscript preparation. Herrick, Roberts, Silman, Anderson, Goodfield, Muir, Denton.
Statistical analysis. Roberts.
We are grateful to all of the clinicians, mainly members of the UK Systemic Sclerosis Study Group and the Scleroderma Clinical Trials Consortium, who participated in the study, and also to Steve Cottrell and colleagues from the Medical Illustration Department, Salford Royal National Health Service Foundation Trust, for collecting the images.