In this issue of Arthritis Care & Research, Taylor and McPherson report using a form of item-response theory (IRT) called Rasch analysis to evaluate the Health Assessment Questionnaire (HAQ) disability index and the Short Form 36 (SF-36) physical function score (1). Methods similar to Rasch analysis were used to develop scores for the Scholastic Aptitude Test (SAT), an evaluation of academic skills used in college admissions evaluations that is familiar to US readers. One of the underlying theorems of IRT is that if you can answer a difficult question, you can also answer (probabilistically) all easier questions, and, therefore, questions can be ranked by difficulty. One might rank mathematic ability as ordered by addition, subtraction, multiplication, division, algebra, calculus, etc. In addition, Rasch analysis can assign a difficulty score to each item (question). With such information, the level of academic skill can be measured. Assessments such as the SAT have to be reliable. Reliability means that, for example, if you administer the SAT twice, perhaps a month apart, you will get very similar results. The variability between the 2 administration results is a measure of test reliability, and can be expressed as a correlation coefficient. The reliability of SAT tests is approximately 0.90 (2).

It is useful to think of disability assessment as being similar to academic skills assessment. Using Rasch analysis, one can assign difficulty scores to individual disability assessment questions (3, 4). For example, it is easier to lift a cup to your mouth or walk on flat ground than it is to do outside work or walk 2 miles. If the disability scale is constructed correctly, disability can be measured on a linear continuous scale, just as one would use a thermometer to measure body temperature.

Practically, however, questionnaires (and we extend this discussion to visual analog scales for pain, etc.) deviate significantly from the ideal. Although the SAT actually tests academic ability, clinical questionnaires test perception of functional ability or pain. Given the same painful stimulus, patients will differ in their assessments of the severity of pain. Similarly, people with the same level of physical ability will differ in their assessment of their physical ability.

Therefore, the first problem with clinical assessments is that there is no absolute standard. However, we can assume that each person has his or her own innate, internal standard, and that standard is usually not too distant from the mean of all patients. Clinicians know that some patients are high reporters of pain and disability while others are low reporters, and they make mental adjustments based on such facts. This wavering standard, while of considerable importance, defines interpretability, but does not have anything to do with reliability. Interpretability is a keystone of clinical evaluations, but not of clinical trials or blind evaluation of scores.

The second problem with questionnaire assessments is that they most often have poor reliability. This is not only related to HAQ scores (r = ∼0.85) or pain scores (r < 0.7–0.8), but also to physician's global, joint tenderness and swelling counts (r < 0.8), and to the Disease Activity Score (DAS) (r = ∼0.8) (5–7). From reliability we can estimate the minimal detectable change (MDC), also called the reliable change or the smallest real difference (8). For the technically minded, the MDC = SEM = 1.96 × equation image, where the SEM = SD ×equation image. The consequence of this is that, given 2 measurements in an individual patient, we can only say with confidence that differences between 2 assessments that are equal to or exceed ∼0.75 for HAQ, ∼2.0 for DAS, and ∼3.5 for VAS pain are statistically significant. The reason that the necessary differences are so great is that the uncertainty (reliability estimate) is applied not to the change score but to each of the 2 test measurements.

There are 3 settings in which reliability is important: clinical trials in groups of patients, blind assessments in individual patients, and informed assessments in clinic patients. In clinical trials, low reliability can be overcome by increasing sample size. Blind assessments occur when one tries to interpret 2 scores (i.e., before and after treatment) without the use of additional information. Such assessments are made by insurance companies, third-party providers, and regulatory authorities, among others, to see whether the patient has improved or gotten worse, or if you are doing a good job. However, there is too much variability in clinical measures for blind assessments to be useful or valid. Rheumatologists should resist blind-assessment evaluations on the grounds that they represent very poor science.

It has been suggested, based on blind assessments, that HAQ scores vary too much to be useful for clinical decision making (9). This observation could be extended to the physician's swollen and tender joint examination, physician's global assessment, and the DAS, which are even more variable and less reliable than the HAQ (6, 10). However, no clinician uses such data out of context. The clinical examination makes use of informed assessments, in a Bayesian sense, to use additional information to understand the precision of each clinical variable. For example, by using prior data and knowledge of prior variability (longitudinal analysis), agreement or disagreement with other variables, and additional questioning, the clinician can obtain a much better idea of the precision of the observed change than can be done with blind assessments. This additional information can dramatically increase the effective reliability of clinical observations.

Taylor and McPherson's study raised some additional issues. They presented evidence that the SF-36 physical function scale was a “better” scale than the HAQ for the evaluation of patients with psoriatic arthritis (PsA). Based on separation statistics, which are related to reliability, the SF-36 physical function scale was far more reliable than the HAQ in patients with PsA. However, Taylor and McPherson also showed that the HAQ was biased (inaccurate). In their study, 3.1% of people with PsA had the lowest possible score (floor effect) on the SF-36 physical function scale compared with 30.4% on the HAQ. The HAQ simply did not handle low levels of disability as well as the SF-36. Not only did the 2 scales not give the same score (even if they were standardized), but the disability score may be different according to illness, even if ability was the same.

These observations suggest, but require confirmation, that the HAQ is not a good questionnaire to use in patients with PsA. James Fries, inventor of the HAQ, characterized the HAQ as a “disability” instrument and characterized the SF-36 scale as an “ability” scale. The distinction is apt, as is illustrated in Figure 1 of Taylor and McPherson's article and in an additional study by Wolfe et al, which used a larger and more generalizable sample (n = 8,931) of patients with rheumatoid arthritis (RA) (11). In RA, however, at usual levels of HAQ observed in the clinic, the HAQ was at least as good as the physical function scale.

The study by Taylor and McPherson should remind us that measurement counts. Good assessment tools are reliable and accurate, for it is not only change that we are interested in, but status (i.e., “How is the patient doing?”). Clinical trials, with their emphasis on change, de-emphasize this quality, a quality that is central to clinical care. We continue to recommend the HAQ to clinicians based on its many virtues: a huge body of normative data, ease of scoring and suitability for use in the clinic, and detailed and well-documented associations with long-term outcomes such as mortality and work disability. If the HAQ is to be used in patients with PsA, particularly limited joint PsA, it needs to be interpreted carefully, and with full knowledge of its limitations.


  1. Top of page
  • 1
    Taylor WJ, McPherson K. Using Rasch analysis to compare the psychometric properties of the Short Form 36 Physical Function score and the Health Assessment Questionnaire Disability Index in people with psoriatic arthritis and rheumatoid arthritis. Arthritis Rheum 2007; 57: 7239.
  • 2
    The College Board. Test characteristics of the SAT Reasoning Test: reliability, difficulty levels, completion rates (March 2005–June 2006). URL:
  • 3
    Wolfe F, Michaud K, Pincus T. Development and validation of the health assessment questionnaire. II. A revised version of the health assessment questionnaire. Arthritis Rheum 2004; 50: 3296305.
  • 4
    Wolfe F. Which HAQ is best? A comparison of the HAQ, MHAQ and RA-HAQ, a difficult 8 item HAQ (DHAQ), and a rescored 20 item HAQ (HAQ20): analyses in 2,491 rheumatoid arthritis patients following leflunomide initiation. J Rheumatol 2001; 28: 9829.
  • 5
    RUNMC. DAS-Score.NL. URL:
  • 6
    Lassere MN, van der Heijde D, Johnson KR, Boers M, Edmonds J. Reliability of measures of disease activity and disease damage in rheumatoid arthritis: implications for smallest detectable difference, minimal clinically important difference, and analysis of treatment effects in randomized controlled trials. J Rheumatol 2001; 28: 892903.
  • 7
    Kvien TK, Uhlig T, Mowinckel P, Pincus T. Test-retest reliability of DAS-28 and other standard assessment tools in patients with rheumatoid arthritis: defining cut-offs for changes exceeding the measurement error [abstract]. Ann Rheum Dis 2007; 64: 213.
  • 8
    Schmitt JS, Di Fabio RP. Reliable change and minimum important difference (MID) proportions facilitated group responsiveness comparisons using individual threshold criteria. J Clin Epidemiol 2004; 57: 100818.
  • 9
    Greenwood MC, Doyle DV, Ensor M. Does the Stanford Health Assessment Questionnaire have potential as a monitoring tool for subjects with rheumatoid arthritis? Ann Rheum Dis 2001; 60: 3448.
  • 10
    Wolfe F, Pincus T, Fries JF. Usefulness of the HAQ in the clinic [letter]. Ann Rheum Dis 2001; 60: 811.
  • 11
    Wolfe F, Michaud K, Strand V. Expanding the definition of clinical differences: from minimally clinically important differences to really important differences: analyses in 8,931 patients with rheumatoid arthritis. J Rheumatol 2005; 32: 5839.