To compare the diagnostic performance of a computer-based method for measuring joint space width with the Sharp joint space narrowing (JSN) scoring method in patients with rheumatoid arthritis (RA).
To compare the diagnostic performance of a computer-based method for measuring joint space width with the Sharp joint space narrowing (JSN) scoring method in patients with rheumatoid arthritis (RA).
A random sample of patients with early RA, for whom sequential hand radiographs and Sharp scores were available, was selected from the National Data Bank for Rheumatic Diseases. Hand joint space width was measured using an automated, computer-based method in random order and with blinding for clinical information. We constructed a receiver operating characteristic curve and compared the diagnostic performance of the computer-based and Sharp methods based on the areas under the curve.
One hundred twenty-nine patients with early RA who underwent serial radiography were included. Changes in the computer-based and Sharp methods were highly correlated (r = 0.75, P < 0.001). The computer-based method was significantly more discriminant than the Sharp JSN subscale. The area under the curve of the computer-based method was 0.96 (95% confidence interval [95% CI] 0.94, 0.99) compared with 0.93 (95% CI 0.89, 0.96) for the Sharp subscale (P = 0.024). At the most discriminant cutoff, specificity of the computer-based method was 88.4% compared with 81.4% for the Sharp subscale (P = 0.11); sensitivity was 87.6% for the computer-based method compared with 82.2% for Sharp subscale (P = 0.19). The signal-to-noise ratio for the computer-based method was 83% compared with 70% for the Sharp subscale (P = 0.013).
The computer-based method for measuring joint space width is more discriminant than the semiquantitative Sharp JSN subscale.
Rheumatoid arthritis (RA) is a chronic inflammatory autoimmune disease that leads to progressive joint destruction, functional disability, and extraarticular complications. Structural joint damage correlates with long-term functional decline in RA patients (1). Thus, controlling progressive joint damage has become a key treatment objective (2). Conventional radiography permits measurement of structural joint damage, and films can be masked or randomized for standardized damage scoring. Radiographic measures of structural joint damage are currently considered the gold standard of treatment efficacy studies in RA (3), and are used extensively in clinical trials as the primary outcome measure. Furthermore, radiographic assessment of joint damage is required by the Food and Drug Administration as a measure of disease progression in clinical trials of disease-modifying antirheumatic drugs (4).
Two aspects of radiographic damage, the occurrence of erosions and gradual joint space narrowing (JSN), measure distinct aspects of the process of structural joint damage. JSN is thought to reflect damage to joint cartilage, whereas erosions reflect bone damage from synovitis. Several scoring systems have been developed to assess the amount of radiographic damage. Of these, the composite method of Sharp (5) and the global method of Larsen (6) are the most widely used (7). Both methods assess JSN of individual joints using a categorical scale (the Sharp scale is scored from 0 to 4, and the Larsen scale either 0 or 1). The Sharp scoring method provides a semiquantitative JSN subscale (JSNSharp) by summing the scores of the individual joints.
While these traditional scoring methods are widely used in clinical studies (3), they are fundamentally subjective and quantify structural changes with discrete scales (8, 9). Reducing continuous phenomena into broad categories loses information and carries the risk of ceiling and floor effects. Traditional scoring methods also require specially trained assessors, which limits the number of potential readers and increases the cost of radiographic assessment. New software for use in measuring the joint space width has been developed (9–14), and this provides a quantitative, reproducible, and more objective measure by which to assess structural joint damage in patients with RA (15, 16). Use of these techniques will be facilitated as radiology departments adopt fully digital modalities.
While computer-based methods have been shown to be more reproducible than traditional semiquantitative methods, they also tend to be more time consuming (13). Duryea et al have developed an automated computer-based method that automatically locates joints (10) and measures joint space width on digitized images (11), with only minimal human intervention and no user-entered seed points. This computer-based method has been validated and has shown good correlation with the JSNSharp (17). The reproducibility error introduced by human intervention in this method is very small (<1%), and the average reading time is only ∼4 minutes per hand radiograph (16). The objective of this study was to assess the diagnostic performance of this method in a longitudinal cohort study of RA and compare it with JSNSharp.
Our study included an inception cohort of consecutive patients with early RA seen at the Wichita Arthritis Center, an outpatient rheumatology clinic, between January 1973 and August 1998. The patients were followed up prospectively after their first clinic visit with clinical, radiographic, laboratory, demographic, and self-reported data. Details of this cohort are described in detail elsewhere (18). For this analysis, we included a random sample of patients with sequential hand radiographs and complete Sharp scores. To increase the homogeneity of the study sample, we excluded a priori observations with a time interval between consecutive radiographs of >100 months and hand radiographs in which more than half of the joints were very severely damaged (completely destroyed joints, severely subluxated joints, or joint arthroplasty).
Paired hand radiographs were digitized using a Lumiscan LS 75 laser film digitizer (Lumisys, Sunnyvale, CA) with a 0.1-mm pixel size at a 12-bit gray-scale resolution. The digitized images were archived and placed on the hard disk of a personal computer for further analysis. Assessments were made in a blinded and randomized manner using a software tool to automatically determine the joint locations and delineate the joint margins to obtain an average joint space width measurement for each joint (Figure 1). A more detailed description of the software algorithm can be found elsewhere (10, 11). Readers verified the automated joint space width measurement of the second through the fifth metacarpophalangeal joints and the second through the fifth proximal interphalangeal joints on each hand radiograph, and made corrections where necessary. A patient's mean hand joint space width, JSWsoft, was the average joint space width in all 16 joints measured.
The measurement work was spread among 4 readers. The readers consisted of 1 medical physicist (JD) and 3 physicians with experience in radiographic assessment (AF, PdP, GN). The interobserver agreement with this computerized method was high, with an interreader root mean square standard deviation of 0.011 mm or an interreader intraclass correlation coefficient of 0.99 (19). Severely damaged joints with no detectable joint space width, which represented 2.6% of all joints, were assigned a value of 0.00 mm. Some joint margins were insufficiently distinct to be reliably traced with the program, and joint space width could not be assessed using the software. Reasons for indistinct joint margins included incorrectly positioned joints (oblique incidences), poor radiographic quality (over- or underexposed radiographs, blurred images), and disease (flexed joints). Overall, 68 of 4,124 joints (1.6%) were not measurable, in which case the hand joint space width was based on the remaining joints.
The radiographs had been previously read and scored for damage by Dr. John T. Sharp, according to his published method (5). Sequential hand radiographs of each patient were displayed side by side, but in random and blinded chronological order. JSN was graded on a scale from 0 to 4 for 16 joints in each hand, with 0 = normal joint space width, 1 = asymmetric JSN, 2 = <50% JSN, 3 = >50% JSN, and 4 = ankylosis. The scores of all assessed joints were summed to obtain JSNSharp, and they ranged from 0 to 132 (5).
We compared the change in the Sharp JSN subscale (ΔJSNSharp) and in the computer-based method (ΔJSWsoft). We first analyzed the correlation of ΔJSNSharp and ΔJSWsoft. We then analyzed the diagnostic performance of the 2 scoring methods. It was assumed that most RA patients have some progression of structural joint damage during the time interval separating 2 radiographs and that the average extent of radiographic damage can only worsen. Analysis of radiographic damage in a large cohort of RA patients has shown that radiographic damage occurs at a constant rate over time and that structural joint damage can be considered irreversible. This was particularly true before anti–tumor necrosis factor therapy became available. No true standard of joint space width is currently available. The most definitive criterion available is the chronological order in which the hand radiographs were obtained. Therefore, we used the temporal order in which radiographs were obtained as our standard. The order was categorized as either “correct” or “incorrect.” We then classified the paired readings as either “increasing damage scores” or “decreasing damage scores” at various cutoff levels. Sensitivity was defined as the proportion of films with correct temporal sequence that had increasing damage scores. Specificity was defined as the proportion of films with incorrect temporal sequence that had decreasing damage scores.
We used the area under the curve (AUC) of a receiver operating characteristic (ROC) curve to compare the diagnostic performance of the 2 scoring methods. An ROC curve is a plot of the sensitivity versus (1 − specificity) of a test, where the points on the curve correspond to different cutoff points used to designate test positivity. In the case of a “2-alternate forced choice” experiment, the AUC of a nonparametric ROC curve plot corresponds to the probability of making the correct choice (20). In the present study, the area under ROC curves corresponded to the probability that the scoring method would identify the correct chronological order of randomly selected pairs of hand radiographs. When both radiograph sets were given the same score, it was assumed that the scoring method would select 1 of the 2 radiographs randomly (20). The nonparametric estimates of the AUC of the ROC curve (trapezoidal method) were compared using Wilcoxon's signed rank test and adjusted for correlations among the scoring techniques, since these measurements were derived from the same radiographs (21, 22). We then chose the most discriminant cutoff for each scoring method and computed the sensitivity and specificity of each method. Sensitivity and specificity were compared using McNemar's probability test.
An alternative way to examine whether the computer-based joint space width method is a more sensitive technique by which to assess progressive JSN than JSNSharp is to compare the signal-to-noise ratios (9). The signal-to-noise ratio conveys the strength of the signal compared with the noise (variability) of the method. Use of techniques with higher signal-to-noise ratios decreases the sample size required to detect change in clinical studies. Signal-to-noise ratio was defined as the ratio of mean change in radiographic scoring methods divided by its standard deviation. Signal-to-noise ratios should always be considered in conjunction with the technique's discrimination, since methods that are insensitive to change will have misleadingly low standard errors and high signal-to-noise ratio values (23).
To evaluate the significance of differences in signal-to-noise ratios between the techniques, we performed a permutation test. The null distribution was derived by random arrangement of film orders in equal chances. The P value was then calculated as the percentage of 5,000 permutations that showed a difference beyond the observed value in the null distribution. Statistical analysis was performed with ROCKIT (24) and with SAS software, version 8.2 (SAS Institute, Cary, NC). All statistical tests were 2-sided. P values less than 0.05 were considered significant.
We examined sequential pairs of hand radiographs of 129 RA patients separated by a median followup period of 4.0 years. All patients had progressive RA; 77% were rheumatoid factor positive, and in all patients, the first radiograph was obtained within 2 years of disease onset. Patient characteristics are shown in Table 1. The mean ± SD increase in ΔJSNSharp between the 2 radiographs was 10.5 ± 13.1 Sharp score units, and the mean ± SD decrease in ΔJSWsoft was −0.16 ± 0.19 mm. ΔJSWsoft and ΔJSNSharp were well correlated (r = 0.75, P < 0.001) (Figure 2).
|Age at baseline, median (IQR) years||56 (44–65)|
|Education level, median (IQR) years||12 (12–14)|
|Disease duration at first radiographic assessment, median (IQR) years||0.9 (0.4–1.9)|
|Followup between first and second radiographs, median (IQR) years||4.0 (2.3–5.7)|
|Ethnic origin, %|
|DMARDs, % of person-time|
|% rheumatoid factor positive||77|
|Glucocorticoids (low-dose), % of person-time||13|
|Computer-based joint space width at baseline, median (IQR) mm||1.29 (1.16–1.46)|
|Sharp joint space narrowing score at baseline, median (IQR)||0 (0–7)|
The computer-based method was more discriminant than JSNSharp (P = 0.024) (Figure 3). The quantitative computer-based method yielded an AUC of 0.96 (95% confidence interval [95% CI] 0.94, 0.99), and the Sharp scoring method produced an AUC of 0.93 (95% CI 0.89, 0.96). These results represent the probability of ordering radiographs in the correct sequence (20), meaning that the computer-based method identified the correct chronological order in 96% of randomly selected pairs of hand radiographs, compared with 93% using the JSNSharp. Another way to represent these data is to compare specificity and sensitivity of the most discriminant cutoff for each scoring method. The specificity of the computer-based scoring method was 88.4% for a cutoff at −0.002 mm, compared with a specificity of 81.4% for a cutoff at 0 Sharp score units (P = 0.11 by McNemar's test). With the same cutoffs, the sensitivity of the computer-based method was 87.6%, compared with a sensitivity of 82.2% (P = 0.19 by McNemar's test). Nevertheless, it is important to remember that specificity and sensitivity figures vary with the choice of the cutoff value.
The signal-to-noise ratio for the computer-based method (JSWsoft) was 83%, compared with 70% for JSNSharp, or a 19% increase. The P value of the permutation test was 0.0132, which demonstrated significant sensitivity of JSWsoft over JSNSharp to indicate changes over time. The difference in discriminative performance was, in large part, explained by the inability of semiquantitative scoring methods to detect minor changes in joint space width. In 31 radiograph pairs (24%), JSNSharp could not detect any change between the baseline and the followup radiograph, which resulted in a change score of 0. This is manifested as a straight line in the ROC curves (Figure 3). In this subset (ΔJSNSharp = 0), the average narrowing by JSWsoft was −0.033 mm (95% CI −0.049, −0.017), and the discriminative performance of the computer-based method was as effective as that in the rest of the sample (AUC 0.88, 95% CI 0.79, −0.97). Radiographic pairs with undetectable changes in JSNSharp had typically shorter time intervals (median 2.8 years compared with 4.4 years for radiographic pairs with positive JSNSharp changes), which suggests that the relative advantage in discriminative performance of the computer-based method over JSNSharp might become more obvious in patients with shorter time intervals between radiographs, which is customary in randomized controlled trials of RA.
We examined the diagnostic performance of an automated, computer-based scoring method in a longitudinal cohort study of RA and compared it with the JSN subscale of the Sharp scoring method. To our knowledge, this is the first attempt to formally ascertain discriminative performance, sensitivity, and specificity of an automated, computer-based method for scoring hand radiographs in RA. The results indicate that quantitative computer-based methods increase the diagnostic performance of radiographic assessment of progressive JSN compared with current scoring methods, such as the Sharp score. The relative increase in discriminative performance was particularly clear in radiograph pairs with shorter followup periods, suggesting that use of this scoring method might improve the effectiveness of radiographic assessment of JSN in short-term trials of RA.
Twenty-four percent of patients demonstrated no change in JSNSharp but still presented significant narrowing of JSWsoft (−0.033 mm). These results corroborate data from Angwin et al (9), who observed no change in Sharp scores in 47% of their patients after 2 years, but a significant reduction in joint space width (−0.027 mm) using a different computer-based method. Those authors also found a 25% improvement in the signal-to-noise ratio with the computer-based method over the total Sharp score, which compares with a 19% improvement in the signal-to-noise ratio over JSNSharp in this study. These variations can be explained by differences in study populations, in lengths of followup, and in the computer-based methods. The computer-based method used in this analysis is more automated and is based on radiographs of 16 joints per hand instead of 12 joints in the previous method (16).
In the absence of a true criterion standard for joint space width, we used the temporal order, in which radiographs were used as the most definite criterion by which to establish discriminative performance. This supposes that most patients with active RA in this cohort have some progression of structural joint damage during the time interval separating 2 radiographs, which is a reasonable assumption, with a median time interval between 2 radiographs of 4 years.
We compared the computer-based method only with JSNSharp instead of with the total Sharp score (erosion and JSN subscale) because we believe that JSN and erosion are 2 distinct phenomena of structural joint damage in RA that should be assessed separately. Future development of software tools may allow joint space width in wrists to be measured and erosion size to be assessed, which would complement the software assessment of hand joint space width. This computer-based method was based on 16 joints and compared with JSNSharp based on 32 joints, in hand radiographs only. Modified versions of the Sharp scoring method have incorporated foot radiographs, which increases sensitivity to change (25). This computer-based method performs well on metatarsophalangeal joints on radiographs of the foot as well. While increasing the number of assessed joints is likely to increase the discriminative performance of both scoring methods, it is unlikely to remove the relative advantage of the computer-based joint space width method over the Sharp method in the assessment of JSN.
We arbitrarily removed radiographic pairs with very long time intervals (>100 months) to increase the homogeneity of our study sample. Sensitivity analyses with various time intervals did not qualitatively influence our conclusions (data not shown). We also excluded a priori hand radiographs with a majority of severely damaged joints, which may affect the generalizability of these findings to more advanced disease. The performance of the computer-based method might be decreased with severely damaged joints, when eroded edges and subluxated joints make it technically difficult to delineate the joint margins. However, joint space width measurement in RA is most relevant in earlier stages of the disease, before major joint damage has occurred (9).
While computer-based methods of measuring joint space width offer a quantitative, reliable, and more objective means by which to assess JSN in patients with RA (15, 16), they tend to be very time consuming, which has limited their clinical use (13). In this study, we used an automated and validated computer-based method that automatically locates joints and measures joint space width with only minimal human intervention, which allows a more time-efficient assessment of joint space width (11, 16, 26). Furthermore, it does not require highly trained personnel, as is the case for the traditional scoring methods. The diagnostic performance of this computer-based method was significantly better than that of the semiquantitative Sharp subscale. More research is needed to establish the diagnostic performance of this scoring method with shorter followup periods between consecutive radiographs, which are customary in randomized controlled trials of RA. Computer-based methods and quantitative measurements of joint space width have the potential to increase the statistical power of radiographic assessment of structural joint damage progression in RA.
We thank Dr. J. T. Sharp for scoring the radiographs.