Swollen and tender joints, important in assessing rheumatoid arthritis (RA) activity, have traditionally been evaluated by health professionals. Whether patients can accurately evaluate joints is uncertain. This study evaluated 1) the reliability of patient-assessed swollen joint counts (SJCs) and tender joint counts (TJCs) versus those assessed by a physician, nurse, and B-mode ultrasonography (US) and 2) patient-derived Disease Activity Score in 28 joints (DAS28) compared with physician-, nurse-, and US-derived DAS28.
Fifty RA patients self-assessed 28 joints (shoulders, elbows, wrists, metacarpophalangeal, proximal interphalangeal, and knees) for swelling and tenderness. They were then assessed separately by a physician, a nurse, and an ultrasonographer. Nine patients were tested twice (intraobserver reliability), and reliability was assessed at the patient level (28 joints) by intraclass correlation coefficients (ICCs) and at the joint level by prevalence-adjusted bias-adjusted kappa.
TJC reliability was good for patient versus physician (ICC 0.85 [95% confidence interval (95% CI) 0.65, 0.94]) and patient versus nurse (ICC 0.76 [95% CI 0.47, 0.90]). However, SJC reliability was poor for patient versus physician (ICC 0.41 [95% CI −0.05, 0.72]) and patient versus nurse (ICC 0.44 [95% CI −0.005, 0.74]). SJC reliability was poor in all assessors compared with B-mode US, particularly patient-assessed SJC (ICC 0.22 [95% CI −0.25, 0.61]). However, patient-derived DAS28 correlated well with US-derived DAS28 (ICC 0.95 [95% CI 0.87, 0.98]). Intraobserver reliability was good for all assessors for TJC, but was lower for SJC.
Patient-derived DAS28 is at least as reliable as physician-, nurse-, or US-derived DAS28, despite poor reliability in patient-assessed SJC.
Joint count assessments involving evaluation of swollen and tender joints are well-known measures of disease activity in rheumatoid arthritis (RA) (1). They are essential components of disease activity indices such as the Disease Activity Score in 28 joints (DAS28) (2) and the American College of Rheumatology (ACR) response criteria (3), which are important outcome measures in assessing the therapeutic effect in RA, both in clinical practice and in trials. Until now, physicians and, more recently, nurses have been in charge of this data collection. However, joint count assessments are sometimes considered time-consuming and are often not performed by rheumatologists during consultations (4).
The idea of patients collecting this data is potentially advantageous not only for alleviating the time burden on health professionals. Patients could assist in assessing their own level of disease activity between physician consultations, which would allow any disease flare to be promptly managed, thus maintaining tight control of disease activity.
There have been conflicting reports on whether patient self-assessments could replace physician assessments (5, 6). A recent systematic review of self-report versus assessor joint counts in RA concluded that self-reported tender joint counts (TJCs) correlated well with those performed by a trained assessor or physician (7). However, correlation of self-reported swollen joint counts (SJCs) with those by assessors was poor. On the other hand, physician joint counts, regarded as the gold standard in the past, are also liable to interobserver variation, especially for SJCs (8–12). Furthermore, it appears that physicians have poor sensitivity for detection of synovitis when compared with ultrasonography (US).
US, a validated modality for synovitis assessment in RA, is more sensitive than joint count assessment for evaluation of synovitis (13, 14). As a result, using a US-derived SJC to calculate the DAS28 would be more representative of true disease activity and, therefore, a better gold standard. To our knowledge, no studies to date have evaluated the metrologic properties of joint count assessment by patients compared with assessment by a physician, a nurse, and US in the same cohort.
These preliminary results prompted us to conduct a study aimed at evaluating the reliability of patient self-assessed SJCs and TJCs versus joint counts assessed by a physician, a nurse, and B-mode US. We also compared the DAS28 derived by patients with physician-, nurse-, and US-derived versions of the DAS28.
PATIENTS AND METHODS
Study design and patients.
This was a cross-sectional, single-center study, approved by the local ethics committee. Adult patients fulfilling the 1987 ACR (formerly the American Rheumatism Association) criteria for RA (15) were consecutively recruited in the day hospital and outpatient clinics in Cochin Hospital between April 1 and July 12, 2009. All patients gave their written informed consent.
Patients were asked to self-assess 28 joints (shoulders, elbows, wrists, metacarpophalangeal [MCP], proximal interphalangeal [PIP], and knees) for swelling and tenderness in a question/mannequin format. Instructions were provided based on the European League Against Rheumatism handbook of clinical assessments in RA (16). A nurse practitioner (MM), experienced in joint count assessment, provided a 5-minute training session on the detection of swollen and tender joints prior to commencement. The self-assessment was followed with a joint count examination performed independently by a different nurse practitioner (CLB) and a rheumatologist (PPC), both blinded to the patients' clinical data. US was performed using a MyLab60 machine (10–13 MHz probe; Esaote Biomedica) on the corresponding 28 joints by 1 physician (AR-W). The US definition of synovitis included synovial hypertrophy and/or synovial effusion detected by B-mode. US descriptions of synovial hypertrophy and synovial effusion were based on the Outcome Measures in Rheumatology Clinical Trials consensus (17). Although this was predominantly a description of the small joints of the hands, synovial hypertrophy and/or synovial effusion were also included in the US synovitis definition for large joints (18). Patients were evaluated using the scanning protocol by Backhaus et al (19). Synovitis was determined through the B-mode with a semiquantitative score from 0 to 3, adapted from the synovitis grading system proposed by Scheel et al (20) with the following subjective definitions: grade 0 = normal, grade 1 = mild synovitis, grade 2 = moderate synovitis, and grade 3 = marked synovitis.
A joint was considered as having synovitis if the grading was ≥1 through B-mode US. Power Doppler was then performed to further identify degrees of activity on synovial hypertrophy identified by B-mode with semiquantitative scoring based on Szkudlarek et al (21).
The physician who performed the US was experienced in detection of synovitis through an initial period of extensive examination on normal subjects (not conducted specifically for this study). The applied cutoff of normality on the 28 joints scanned in both B-mode and power Doppler in this study was based on the images and guidelines used in the US scanning protocol from a previous study of a prospective multicenter RA trial that evaluated US scoring systems of synovitis (18).
Patient and disease characteristics were collected after the joint evaluation. These included sex, age, disease duration, history of surgery related to RA, history of RA treatments, and C-reactive protein (CRP) level in the preceding 3 months. For patients with joint replacements, those joints were omitted from the analysis. Other outcome variables included patient global assessment using a 0–100 visual analog scale (VAS) and functional impairment using the Health Assessment Questionnaire (22). Patients were also asked whether self-assessment of joint counts was acceptable through a VAS, with 0 being unacceptable and 100 being completely acceptable. Time taken to complete the self-assessment was recorded. The latest plain radiographs of the hands (<2 years old) were reviewed and scored for erosions using the Larsen score (23).
The DAS28 was calculated using the TJCs and SJCs obtained by the patients, physician, and nurse. In the case of the US-derived DAS28, SJCs obtained by US were used along with TJCs derived from the respective groups for which the comparison was made. For instance, when the patient-derived DAS28 was compared with the US-derived DAS28, the TJC derived by the patient was used for the US-derived DAS28 calculation, whereas for the physician-calculated DAS28, the US-derived DAS28 used the TJC derived by the physician.
All patients were asked to return to complete an intraobserver assessment within 48 hours, provided that they were in a stable disease state, and joint count assessments were repeated to assess the intraobserver reliability of the patients, physician, nurse, and US.
Statistical analysis was performed using SAS software, version 9.3 (SAS). Descriptive statistics were performed for patient and disease characteristics. Intraobserver reliability (between the same observer at 2 different time points) of TJC and SJC was expressed through the intraclass correlation coefficient (ICC). Interobserver reliability (between different observers, i.e., the patients, physician, nurse, and US) was calculated at the patient level by ICC and at the joint level through percentage of agreement and prevalence-adjusted bias-adjusted kappa (PABAK) using the S of Bennett (24, 25). This was used instead of the kappa statistic because of the paradoxical effect present in situations of low disease prevalence in which the proportion of total joints having synovitis in the cohort is expected to be low. The kappa is substantially reduced in these situations, no matter how good the strength of agreement is on what was or was not synovitis. Part of this paradoxical effect could be ameliorated with the use of PABAK (24). PABAK ranges from 0 to 1, where 0–0.2 = poor agreement, 0.21–0.4 = fair agreement, 0.41–0.6 = moderate agreement, 0.61–0.8 = substantial agreement, and 0.81–1 = near-perfect agreement (26). The DAS28 derived by patients were compared with those derived by the physician, nurse, and US using ICC. A scatter plot of the difference between US- and patient-assessed SJC against US-assessed SJC was used to visually detect a potential systematic bias and the spread across the range.
Factors that potentially contributed to differences in the reliability of patient detection of SJC compared with US were evaluated through a sensitivity analysis. This included the type of joint assessed and radiologic documentation of joint damage (comparing joints that appeared to be severely damaged with those that did not on radiographs) for wrist, MCP, and PIP joints.
The results of US detection of synovitis, or equivalent swelling by joint count, were compared with results of US using a higher threshold (where scores of 0 and 1 would be equivalent to no swelling by joint count, and scores of 2 or 3 would be equivalent to swelling by joint count) by B-mode US.
Characteristics of the 50 patients are illustrated in Table 1. Patients had longstanding disease (median [quartile 1, quartile 3] disease duration 15 years [10, 21 years]) but mild disease activity (median [quartile 1, quartile 3] DAS28 of 3.5 [2.6, 4.5]), and all were receiving ≥1 disease-modifying antirheumatic drug (DMARD). Biologic DMARDs were used in 35 patients (70%), and 32 (64%) were receiving concurrent steroid therapy, highlighting a group that originally had active disease. Of the patients, 14% had a DAS28 <2.6. The median (quartile 1, quartile 3) SJC and TJC assessed by the physician were 5 (3, 7) and 2 (0, 6), respectively.
Patients found assessing their joints to be easy and acceptable. Acceptability was high as measured by VAS (mean ± SD 77 ± 26 mm, range 0–100 mm), and ease of performing the joint counts was also high (mean ± SD VAS 83 ± 22 mm, range 0–100 mm). The average ± SD time taken to complete the self assessment was 12 ± 6 minutes.
Intraobserver reliability of joint count assessment by the patients, physician, nurse, and US.
There were 9 patients who completed intraobserver reliability testing. Results were good for TJC, with the difference in the number of joints detected initially and on retesting ranging from 0 to 6 for the patient, from 0 to 3 for the physician, and from 0 to 1 for the nurse, with ICCs of 0.942 (95% confidence interval [95% CI] 0.855, 0.978), 0.984 (95% CI 0.960, 0.994), and 0.993 (95% CI 0.980, 0.997), respectively. However, the intraobserver reliability of SJC was poor for patient self-assessment, with an ICC of 0.564 (95% CI 0.155, 0.808), with the difference in the number of swollen joints on initial and retesting ranging from 0 to 10. There was 1 outlier with a large discrepancy in the results of 10 joints (i.e., 0 swollen joints at baseline and 10 swollen joints 2 days later). When this patient was excluded, the ICC was 0.867 (95% CI 0.684, 0.947). On the other hand, the intraobserver reliabilities of SJC for the physician, nurse, and B-mode US were good, with ICCs of 0.798 (95% CI 0.544, 0.918), 0.849 (95% CI 0.646, 0.940), and 0.880 (95% CI 0.713, 0.953), respectively. The numbers of swollen joints found on the initial test and on retesting were similar as well, ranging from 0 to 3 for the physician and nurse and ranging from 0 to 5 for B-mode US.
Interobserver reliability of patient, physician, and nurse assessment of SJC or synovitis with B-mode US.
With B-mode US as the gold standard, the patients, physician, and nurse underestimated the number of joints that had synovitis (Table 2). At the patient level (28 joints), SJC was not reliable when compared with US. The ICCs were poor in all 3 groups when compared with US, with the lowest being the patient (ICC 0.220 [95% CI −0.253, 0.609]), followed by the nurse (ICC 0.294 [95% CI −0.178, 0.656]) and physician (ICC 0.412 [95% CI −0.045, 0.726]), with wide 95% CIs. At the joint level, the percentage agreement of synovitis (i.e., whether the joint was swollen or not) between patient and US scoring was 79% (Table 3). The mean percentage agreement at the joint region level for patient-, physician-, and nurse-assessed SJCs with B-mode US-assessed SJCs was similar in most joint regions (data not shown). Of the small joints of the hands, the wrist was the least reliable joint for the patients, physician, and nurse.
Table 2. Description of SJC, TJC, and DAS28 derived by the patients, physician, nurse, and B-mode US*
Table 3. Interobserver agreement of B-mode US with patient, physician, and nurse assessment of synovitis at joint and patient level*
Agreement at joint level
Reliability at patient level, ICC (95% CI)
US = ultrasonography; PABAK = prevalence-adjusted bias-adjusted kappa; ICC = intraclass correlation coefficient; 95% CI = 95% confidence interval.
Patient vs. US
0.220 (−0.253, 0.609)
Physician vs. US
0.412 (−0.045, 0.726)
Nurse vs. US
0.294 (−0.178, 0.656)
Patient vs. physician
0.407 (−0.050, 0.724)
Patient vs. nurse
0.444 (−0.005, 0.744)
Nurse vs. physician
0.308 (−0.163, 0.665)
Due to the large difference in the number of joints detected as having synovitis by B-mode US as compared with clinical assessment, a sensitivity analysis was performed on synovitis grading by B-mode US. When the threshold for synovitis was increased to a grading of ≥2 on B-mode US (i.e., a score of either 2 or 3 was equivalent to “swollen” on the joint count as opposed to a score of 1, 2, or 3), the total number of joints classified as swollen by B-mode US reduced dramatically, by 69%. At the joint level, the reliability of the SJC improved with the patients, physician, and nurse compared with B-mode US, particularly with the PIP joints (from agreement of 51% to 83%). When compared with US, PABAK improved to moderate agreement for the patients (0.261 to 0.580) and substantial agreement for the physician and the nurse (0.381 to 0.634 and 0.360 to 0.733, respectively).
Reliability of patient self-assessed SJC with nurse and physician assessments.
When physician-assessed SJC was used as the gold standard, patients tended to underestimate the number of swollen joints compared with the physician (Table 2). Patient self-assessed SJC was not reliable at the patient level (28 joints), with an ICC of 0.407 (95% CI −0.050, 0.724). At the joint level, PABAK was moderate at 0.560. Compared with the nurse, self-assessed SJC was also unreliable, with an ICC of 0.444 (95% CI −0.005, 0.744). The ICC between the physician and the nurse on the SJC did not fare any better at 0.308 (95% CI −0.163, 0.665), but agreement was 84%.
Level of disagreement in synovitis increased as the number of SJC detected by US increased.
The level of agreement between self-assessed SJC and B-mode US is further illustrated by the scatter plot in Figure 1. When compared with B-mode US, it was clear that the level of disagreement increased as the number of swollen joints detected by US increased. A similar trend was also seen with the scatter plots for the nurse and the physician, respectively (data not shown). When patients were compared with the physician, there was a similar trend of the level of disagreement increasing as the number of swollen joints increased (Figure 1).
Reliability of patient self-assessed TJC with nurse and physician assessments.
Patients detected more tender joints compared with the nurse and the physician (Table 2). Self-assessed TJCs were reliable when compared with the physician and the nurse at the patient level, with mean ICCs of 0.850 (95% CI 0.648, 0.940) and 0.760 (95% CI 0.473, 0.901), respectively. At the joint level, the agreement was similar when compared with the physician and the nurse (PABAK 0.643 and 0.620, respectively). However, agreement was better between the physician and the nurse (PABAK 0.723) (Table 4). Mean percentage agreement at the joint region level for TJC showed that agreement was good for all regions (results not presented). In the assessment of TJC, patients were asked to point out which joints also had spontaneous pain. It was shown that patients' spontaneous pain correlated well with self-assessed TJCs (r = 0.682, P < 0.0001).
Table 4. Interobserver agreement of the patients, nurse, and physician in TJC at the joint level and the patient level*
Agreement at joint level
Reliability at patient level, ICC (95% CI)
TJC = tender joint count. See Table 3 for additional definitions.
Patient vs. physician
0.850 (0.648, 0.940)
Patient vs. nurse
0.760 (0.473, 0.901)
Nurse vs. physician
0.873 (0.697, 0.950)
Reliability of the DAS28 between the patients, physician, and nurse with US.
The physician and nurse tended to underestimate the DAS28, whereas patients tended to report higher DAS28, when the US-derived DAS28 was regarded as the reference gold standard (Table 2). Although this was the case, good correlations were seen with DAS28 scores obtained by the respective groups (Table 5). The patient-derived DAS28 was well-correlated with the US-derived DAS28, with an ICC of 0.949 (95% CI 0.871, 0.980). The physician DAS28 and nurse DAS28 showed excellent reliability when compared with the US DAS28, with ICCs of 0.978 (95% CI 0.944, 0.992) and 0.973 (95% CI 0.932, 0.990), respectively. When the patient-derived DAS28 was compared with those derived by the physician and nurse, excellent correlation was reported as well (ICCs 0.900 [95% CI 0.756, 0.961] and 0.878 [95% CI 0.709, 0.952], respectively). A scatter plot of the difference in the patient DAS28 versus the physician DAS28 showed that the level of disagreement did not widen as the DAS28 increased (Figure 1).
Table 5. Reliability of the DAS28 derived from the patients, physician, nurse, and US at the patient level*
ICC (95% CI)
DAS28 = Disease Activity Score in 28 joints. See Table 3 for additional definitions.
Patient vs. US
0.949 (0.871, 0.980)
Physician vs. US
0.978 (0.944, 0.992)
Nurse vs. US
0.973 (0.932, 0.990)
Patient vs. physician
0.900 (0.756, 0.961)
Patient vs. nurse
0.878 (0.709, 0.952)
Nurse vs. physician
0.906 (0.771, 0.963)
Sensitivity analysis with reliability and degree of radiologic damage on radiograph.
In the sensitivity analysis of radiologic damage as measured through the Larsen score, there was no statistical difference in the reliability of the SJC by the patients, physician, or nurse when compared with B-mode US (data not shown).
The results of this study showed that self-assessments of TJC were reproducible and correlated well with those derived by the physician and nurse. However, in terms of intraobserver reliability, patients were poor assessors of synovitis (SJC) as compared with the other groups. The interobserver reliability of the SJC when compared with US was poor for all groups, especially the patients. However, agreement at the joint level was good, except for the wrist. Despite large interobserver differences in SJCs, the DAS28 derived by the patients were well-correlated with the US-derived DAS28 and the physician-derived DAS28.
The results of this study confirm the findings of previous studies of self-assessment of joint counts. Self-assessments of TJC were well-correlated with those by trained assessors (5, 6, 27), whereas poor correlation of patient-derived SJC with physician-derived SJC has been noted in several studies (6, 28–31). There are limited data in published literature on correlations of the patient-derived DAS28, but preliminary evidence indicates that the correlations are good when compared with the DAS28 derived by trained clinical assessors (32–34). Although the reliability of patient-derived SJCs is poor, interobserver reliability of physician-assessed SJC has been understudied and is liable to interobserver differences (9, 10). In addition, when compared with US, agreement was poor overall in one study (35). To our knowledge, the current study is the first in which US was used as the gold standard for SJC instead of trained clinical assessors. US had been validated as a tool for synovitis assessment (18) and is more sensitive at detecting synovitis than clinical examination (13, 14).
The intraobserver reliability of self-assessed SJC appeared to be different from the test–retest correlations from other studies on self-assessment (29, 31). This was attributed to one outlier who had a test–retest SJC difference of 10 despite reporting no change in disease activity. A lack of understanding of SJC detection by the patient and inadequate training was the most likely reason for this.
On an individual joint level, there was a paradoxical effect in results, with generally good agreement and poor kappa results. This was because the total number of joints with synovitis was small when compared with the number of joints without synovitis; therefore, PABAK was used. With the scatter plot, it was clear that agreement became poorer between patient and US detection of SJC as the number of US-detected SJC increased. This may limit the use of self-assessment for joint counts in patients with low disease activity to between clinical visits only.
There are several reasons for the large disagreement between patient and US detection of synovitis. First, the US synovitis definition included both synovial hypertrophy and effusion, and the latter can be difficult to detect through physical examination in joints such as the shoulders and elbows. Second, the threshold for what was synovitis through B-mode US may have been too low or sensitive compared with detection of synovitis through physical examination, which was particularly evident in the PIP joints. Last, the degree of patient education and training may have been inadequate.
Despite large discrepancies in swollen joint detection, the patient-derived DAS28 correlated well with the US-derived DAS28 and those derived by the physician and the nurse. We selected the US-derived DAS28 as the gold standard for 2 reasons. First, US is more sensitive than physical examination for detection of synovitis (13, 14). Second, the US-derived DAS28 has shown face validity, external validity, sensitivity to change, and discriminant capacity when compared with the clinical DAS28 (36). The slight overestimation of the patient DAS28 was largely due to the higher TJCs reported by patients and the subsequent higher weighting for TJC than SJC in the DAS28 equation. The higher weighting of the TJC, patient global assessment, and CRP level compared with the SJC could explain why the DAS28 derived by patients still correlated well with those derived by US or the physician despite poor reliability in patient SJCs.
The DAS28 including CRP level (DAS28-CRP) was used in this study because it has recently been observed to have a similar validation profile to that of the DAS28 including erythrocyte sedimentation rate (ESR) (37). With the logarithmic calculation of the ESR component of the DAS28, it has been observed that the ESR contribution is higher in its lower range even though it is still within the normal limits; thus, small variations in the ESR can decisively influence the final DAS28 score. Although not formally validated, this is not observed to be true for the DAS28-CRP, which usually yields a lower score compared with the DAS28 using the ESR (37, 38). The main objective of our study was to look at the impact of the SJC and TJC by the respective groups and the subsequent relationship of the DAS28 scores. Therefore, the DAS28-CRP was preferred.
Although patients received training and explanation prior to the self-assessment, the exact time required for education is unclear. SJC reliability is examiner dependent but improves after standardization and training. Radner et al studied the effects of education on improving the reliability of self-assessment in 43 patients with RA who received 15 minutes of training. The ICCs for self-assessed SJC improved in the small joints of the hands when compared with assessments by physicians and biometricians, but those for TJC did not (39).
There are limitations to this study. First, it was a cross-sectional study, and therefore longitudinal evaluation and the assessment of the benefits of further training were not possible. Sample size was restricted due to the time consumption of US assessment, especially in regard to the number of patients tested for intraobserver reliability. Some patients and the clinical assessors may have had a recall bias with a test–retest time interval of under 48 hours, although blinding of patient clinical data was carried out.
However, patients were consecutively recruited and patient self-assessments were comprehensively compared together at the clinical assessors level (physician and nurse) and at the imaging level (US). Intraobserver reliability was investigated using the same group of patients assessed twice by all assessor groups, making this a real-life clinical situation.
There is potential for utilization of patient assessments between consultation appointments. Patient assessment of joint counts and subsequent calculation of the DAS28 from this has a number of potential advantages for daily clinical practice. However, as shown in this study, there are a number of issues that need to be resolved. The optimum time and structure of training to improve self-assessed SJC reliability is yet to be determined. It is also important to verify the observation that self-assessment of SJC is reliable at levels of low disease activity.
Although we have demonstrated that the patient-derived DAS28 is well-correlated with the US-derived DAS28, patient self-assessed SJCs are not reliable in synovitis detection. Agreement with US-detected synovitis is better with low disease activity. The structure of education and training to improve reliability needs to be further evaluated before implementation of self-assessment into daily clinical practice.
All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be submitted for publication. Dr. Cheung had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study conception and design. Cheung, Ruyssen-Witrand, Gossec, Paternotte, Le Bourlout, Mazieres, Dougados.
Acquisition of data. Cheung, Ruyssen-Witrand, Paternotte, Le Bourlout, Mazieres.
Analysis and interpretation of data. Cheung, Ruyssen-Witrand, Gossec, Paternotte, Dougados.