Responsiveness of self-report and therapist-rated upper extremity structural impairment and functional outcome measures in early rheumatoid arthritis




To provide a responsiveness analysis of the self-report and therapist-rated upper extremity functional outcome measures used in a rehabilitation trial.


A variety of commonly used therapist-assessed and self-report structural impairment and functional outcome measures were compared for the ability to detect and measure change in wrist and hand status in an early rheumatoid arthritis population over 12 months. Responsiveness was measured using the standardized response mean (SRM) and effect size (ES).


The most responsive measures were the Michigan Hand Outcomes Questionnaire (SRM 0.49 [95% confidence interval (95% CI) 0.27, 0.72], ES = 0.37 [95% CI 0.21, 0.54]), dominant metacarpophalangeal joint ulnar deviation (SRM 0.46 [95% CI 0.27, 0.65], ES = 0.58 [95% CI 0.34, 0.82]), and mean power handgrip test (SRM 0.45 [95% CI 0.26, 0.64], ES = 0.32 [95% CI 0.18, 0.45]) The least responsive measure was the Health Assessment Questionnaire (SRM −0.12 [95% CI −0.31, 0.08], ES = −0.08 [95% CI −0.21, 0.05]).


Over 12 months, there was substantial variation in wrist and hand outcome measures to detect change over time in an early RA population. Careful consideration is required to choose the most appropriate measure that can detect change.


The ability of an outcome measure to detect clinical change over time is important. The capacity of an outcome measure to detect change in status over time is referred to in this article as responsiveness, which is a measure of the association between change in an observed outcome score and change in the true value of the construct.

Measuring change is challenging. With self-report outcomes, measurement can be affected by individuals' self-appraisal changing over the course of their disease (1) and changes in habituation and coping mechanisms (2). Responsiveness is context and population specific, and although disease-specific scales have been reported as being more responsive than generic scales (3), individualized patient preference scales (i.e., those scales that concentrate on symptoms of individual patients) have been reported as the most responsive (4). There is little information that identifies what difference in outcome is required to represent a meaningful change, and patients' and clinicians' ratings of change remain the gold standard.

Responsiveness is measured by two methods. Anchor-based approaches use two end points for classification, for example, the very best health scenario (perfect health, no pain) and the very worst health scenario (death, worst imaginable pain). Responsiveness can be described as the distance moved between these anchor points.

A distribution-based approach does not use an anchor for defining meaningful change, and is limited in part because of this. Distribution approaches estimate the signal to noise ratio in a population. These are linked to measurement error, variation, and size of the sample. If large differences exist between baseline and followup, then a reasonable measure will detect this change. If the difference is small, then only a particularly responsive measure will detect this. The statistical power to detect these differences is linked to sample size and variation in the population, and the larger the sample size and more homogenous the population, the more likely a measure is to detect change.

Debate exists over whether each index of responsiveness provides similar rank ordering. Research into hip replacement surgery outcomes suggests not (3), whereas in lumbar spine surgery, each index produced similarly ordered ranking (5). There is little research into the comparative responsiveness of rheumatology outcome measures. This provided the rationale for conducting this analysis.

This study reports the comparative responsiveness of commonly used therapist-rated and patient self-report upper extremity structural impairment and functional outcome measures in an early rheumatoid arthritis (RA) population recruited into a rehabilitation trial. Baseline characteristics and the capacity of benefit to this sample were representative of a wider population of early RA patients. These data have been published elsewhere (6). All of the patients were receiving full secondary care intervention, and data from similar published patient cohorts indicated that during the first 12 months of specialist care, mean group upper extremity function improves (7). The outcomes considered in this analysis were all related to upper extremity function, and it was hypothesized that they would show an improvement over the followup duration.

Materials and Methods

Outcome measures were identified from a critical review and were selected if they 1) were reported in published rheumatology research studies within the past 15 years, 2) were published in English, and 3) addressed relevant World Health Organization International Classification of Functioning, Disability, and Health domains of either upper extremity body functions and structures, functional activity, or social participation. Three self-report functional ability measures, 5 structural impairment measures, and 3 therapist-rated functional measures were compared.

Multi-Regional Ethics Committee approval was gained and 116 patients with a diagnosis of RA of a duration less than 5 years were recruited from 8 occupational therapy departments across the Southern UK. One independent, experienced assessor (JA) measured outcomes at baseline and 12 months using standardized protocols recording 1) self-report functional outcome, 2) therapist-rated structural impairment measures, and 3) therapist-rated functional measures.

Self-report functional outcome was measured with the Health Assessment Questionnaire (HAQ) (8), the Disabilities of the Arm, Shoulder, and Hand questionnaire (DASH) (9), and the Michigan Hand Outcomes Questionnaire (MHQ) (10). The HAQ measures functional ability across dressing/grooming, rising, eating, walking, hygiene, reach, grip, and domestic activities. Four-point ordinal responses are scored for 20 items, from 0 = without any difficulty to 3 = unable to do. It takes approximately 10 minutes to complete and 2 minutes to score, and the summary score ranges from 0 to 3. The DASH is a 30-item questionnaire including 21 physical function items, 6 symptom items, and 3 social role function items. A 6-point ordinal scale grades perceived difficulty for each task, taking approximately 10 minutes to complete and 5 minutes to score. Scores range from 0 to 100. The MHQ assesses overall hand function, activities of daily living, pain, work performance, aesthetics, and satisfaction with hand function using a 5-point ordinal scoring system. The questionnaire takes 10 minutes to complete and 12 minutes to score. Scores range from 0 to 100.

Therapist-rated structural impairment measures include 1) the dominant and nondominant Total Active Motion (TAM) of the wrist and individual digits (11), which assesses range of motion and sums total active flexion and extension of the wrist; 2) the hand component of the Signals of Functional Impairment (SOFI) (12), which records the ability to grip a 6–8-cm tube, grip all fingers around a pencil, perform a round pincer grip between the thumb and index finger, and oppose the thumb to the base of the fifth finger; each is graded on a 3-point ordinal scale, where 0 = able to fully complete, 1 = able to partially complete, and 2 = unable to complete; 3) the total summary score of dominant and nondominant metacarpophalangeal (MCP) joint ulnar deviation, measured using a 360° goniometer; and 4) the dominant and nondominant 28-joint Ritchie score (13).

Therapist-rated functional measures include 1) the dominant and nondominant mean power handgrip strength, measured in newtons using the MIE Digital Grip analyzer (MIE, Leeds, UK); 2) the Grip Ability Test (GAT) (14), which takes approximately 3 minutes to complete and 2 minutes to score, and consists of placing 25 cm of Tubigrip (Mölnlycke Health Care, Bedfordshire, UK) onto the nondominant hand, placing a 30 × 10–mm metal paperclip onto a 11.5 × 16–cm envelope, and pouring 200 ml of water into a cup from a 1-liter jug containing 1,000 ml of water; a summary score is produced; and 3) the applied strength (lifting tins and pouring water from a 2,000-ml jug) and applied dexterity (button board and 9-hole peg board) tasks from the Arthritis Hand Function Test (AHFT) (15). These functional tests produce a nominal score for lifting tins and pouring water and a timed summary score for the button and 9-hole peg boards. The tests take 2 minutes to complete and 30 seconds to score.

Baseline and 12-month results were compared using the standardized response mean (SRM; calculated by dividing the mean change scores by the SD of that change) and the effect size (ES; calculated by dividing the mean change scores by the SD of the baseline scores). Higher scores indicate greater responsiveness to change and better discrimination; a negative score indicates a smaller baseline score than followup score.


Eighty-four women and 32 men were recruited, with a mean ± SD age of 57.3 ± 13.7 years, diagnosed with RA for a mean ± SD total of 10.6 ± SD 11.4 months, and receiving a mean ± SD of 1.3 ± 0.6 disease-modifying antirheumatic drugs with a baseline median HAQ disability score of 1.1 (interquartile range 0.6–1.8).

There were no losses to followup for the therapist-rated measures over the 12 months. One hundred seven participants (92%) completed the HAQ, 80 participants (69%) completed the MHQ, and 104 participants (90%) completed the DASH with sufficient data to permit change data to be calculated. There were no substantial differences in baseline profiles between participants who did and did not complete questionnaires.

Table 1 shows the summary of responsiveness analysis of self-report functional ability measures. Table 2 shows therapist-assessed structural impairment measures. Only the most responsive TAM score of the wrist and hand (i.e., the little finger) has been included here. Table 3 shows the summary of therapist-assessed functional measures.

Table 1. Summary of the responsiveness of self-report upper extremity–related function measures from 0 to 12 months*
 HAQ (0–3)DASH (0–100)MHQ (0–100)
  • *

    HAQ = Health Assessment Questionnaire; DASH = Disabilities of the Arm, Shoulder, and Hand questionnaire; MHQ = Michigan Hand Outcomes Questionnaire; SRM = standardized response mean; 95% CI = 95% confidence interval; ES = effect size.

Mean ± SD change−0.06 ± 0.48−4.48 ± 14.237.03 ± 14.28
Range−1.75 to 0.88−45.00 to 24.16−23.21 to 48.00
SRM (95% CI)−0.12 (−0.31, 0.08)−0.31 (−0.51, −0.12)0.49 (0.27, 0.72)
ES (95% CI)−0.08 (−0.21, 0.05)−0.21 (−0.08, −0.34)0.37 (0.21, 0.54)
Table 2. Summary of responsiveness of therapist-assessed structural impairment from 0 to 12 months*
 Ritchie scoreUlnar deviationSOFITAM little finger
  • *

    SOFI = Signals of Functional Impairment; TAM = Total Active Motion; SRM = standardized response mean; 95% CI = 95% confidence interval; ES = effect size.

Mean ± SD change−1.42 ± 6.00−1.97 ± 6.4711.61 ± 25.144.52 ± 10.18−0.17 ± 1.35−0.26 ± 1.3011.22 ± 29.049.02 ± 33.29
Range of change−32 to 30−32 to 18−46 to 198−24 to 59−6 to 6−6 to 3−72 to 122−88 to 138
SRM (95% CI)−0.24 (−0.43, −0.05)−0.30 (−0.50, −0.11)0.46 (0.27, 0.65)0.44 (0.25, 0.63)−0.13 (−0.32, 0.06)−0.20 (−0.39, −0.02)0.39 (0.20, 0.58)0.28 (0.09, 0.47)
ES (95% CI)−0.19 (−0.34, −0.04)−0.25 (−0.41, −0.10)0.58 (0.34, 0.82)0.23 (0.13, 0.32)−0.10 (−0.25, 0.05)−0.16 (−0.31, −0.01)0.26 (0.13, 0.39)0.22 (0.07, 0.37)
Table 3. Summary of responsiveness of therapist-assessed functional ability from 0 to 12 months*
 Mean handgrip strength, newtonsGrip Ability TestAHFT applied dexterity: peg boardAHFT applied dexterity: button boardAHFT applied strength: lifting tins taskAHFT applied strength: pouring task
  • *

    AHFT = Arthritis Hand Function Test; SRM = standardized response mean; 95% CI = 95% confidence interval; ES = effect size.

Mean ± SD change26.15 ± 57.8721.96 ± 54.50−7.45 ± 26.17−2.23 ± 7.69−1.30 ± 4.83−22.70 ± 108.770.25 ± 1.5779.87 ± 381.45
Range of change−118.00 to 289.00−141.00 to 165.00−112.86 to 105.07−37.28 to 19.61−19.94 to 13.50−904.76 to 44.35−7.00 to 9.00−1,400.00 to 1,500.00
SRM (95% CI)0.45 (0.26, 0.64)0.40 (0.21, 0.59)−0.28 (−0.47, −0.10)−0.29 (−0.48, −0.10)−0.27 (−0.46, −0.08)−0.21 (−0.40, −0.02)0.16 (−0.03, 0.35)0.21 (0.02, 0.40)
ES (95% CI)0.32 (0.18, 0.45)0.27 (0.14, 0.40)−0.18 (−0.30, −0.06)−0.20 (−0.33, −0.07)−0.10 (−0.17, −0.03)−0.11 (−0.21, −0.01)0.09 (−0.02, 0.19)0.13 (0.01, 0.25)

Over 12 months, the most responsive self-report upper extremity measure was the MHQ, and the HAQ was the least responsive. Of the 4 therapist-rated impairment outcome measures, the summary score of dominant hand MCP joint ulnar deviation was the most responsive, and the wrist and hand components of the SOFI were the least responsive. Mean handgrip strength was the most responsive therapist-rated functional measure, and the applied strength task from the AHFT was the least responsive to change.

Negative SRM and ES results (indicating an improvement of ability) were obtained for the HAQ, the DASH, the 28-joint Ritchie score, and the timed outcome measures (the GAT and AHFT applied dexterity tasks).


Published data on responsiveness of outcome measures are rare. Yet the ability of an outcome measure to detect and measure change is important for any longitudinal intervention program. This study compared the responsiveness of 13 upper extremity–related outcome measures. Responsiveness varied considerably. The most responsive measures were those that used interval scales and the least responsive were those that used 3- and 4-point ordinal rating scores across fewer items (for example, the HAQ and the SOFI). Although these ordinal measures are likely to be less time consuming for patients to complete and easier to score, they may provide insufficient options to register meaningful change over the 12 months. Interval scales have the potential to measure change in more discrete units and in this analysis were more responsive.

Even those outcome measures that have been specifically designed for this patient population, for example, the GAT, showed lower responsiveness levels to change than generalized measures such as the MHQ. Those measures that have published accounts of patient and expert clinician involvement in their validation, e.g., the DASH, MHQ, and AHFT, were more responsive than those that did not.

A single expert assessor collected study data. Arguably, these measurements may have less measurement error compared with data collected by a number of different assessors. As such, these results may represent the best-case scenario. These results reflect a population- and situation-dependent scenario; thus, they may not be transferable to other populations undergoing different treatments or at different disease stages.

It would have been valuable to request that participants completed a visual analog scale indicating their perceptions of the clinical significance of the changes in their hand function. This would have provided added reference that could help define clinically meaningful change.

Over 12 months of followup in an early RA population, the most responsive structural impairment measure, therapist-assessed functional measure, and self-report functional measures were dominant MCP joint ulnar deviation score, mean handgrip strength measured in newtons, and the MHQ. Ordinal outcome measures were less responsive than measures using continuous interval scales in this population.


All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be submitted for publication. Dr. Adams had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Study conception and design.Adams, Burridge, Hammond, Cooper.

Acquisition of data.Adams, Burridge.

Analysis and interpretation of data.Adams, Mullee, Burridge, Hammond, Cooper.


We would like to thank Dr. Paula Kersten and the reviewers for constructive comments on this article.