Development and validation of the health assessment questionnaire II: A revised version of the health assessment questionnaire




The Health Assessment Questionnaire (HAQ) has become the most common tool for measuring functional status in rheumatology. However, the HAQ is long (34 questions, including 20 concerning activities of daily living and 14 relating to the use of aids and devices) and somewhat burdensome to score, has some floor effects, and has psychometric problems relating to linearity and confusing items. We undertook this study to develop and validate a revised version of the HAQ (the HAQ-II).


Using Rasch analysis and a 31-question item bank, including 20 HAQ items, the 10-item HAQ-II was developed. Five original items from the HAQ were retained. We studied the HAQ-II in 14,038 patients with rheumatic disease over a 2-year period to determine its validity and reliability.


The HAQ-II was reliable (reliability of 0.88, compared with 0.83 for the HAQ), measured disability over a longer scale than the HAQ, and had no nonfitting items and no gaps. Compared with the HAQ, modified HAQ, and Medical Outcomes Study Short Form 36 physical function scale, the HAQ-II was as well correlated or better correlated with clinical and outcome variables. The HAQ-II performed as well as the HAQ in a clinical trial and in prediction of mortality and work disability. The mean difference between the HAQ and HAQ-II scores was 0.02 units.


The HAQ-II is a reliable and valid 10-item questionnaire that performs at least as well as the HAQ and is simpler to administer and score. Conversion from HAQ to HAQ-II and from HAQ-II to HAQ for research purposes is simple and reliable. The HAQ-II can be used in all places where the HAQ is now used, and it may prove to be easier to use in the clinic.

The Health Assessment Questionnaire (HAQ) is the most important and widely used functional status questionnaire in rheumatology. Developed by Fries et al in 1980 (1, 2), it is used in most clinical trials and observational outcome studies (3), and it has been translated into most languages in the industrialized countries (3, 4).

The HAQ is the best predictor of mortality (5), work disability (6), joint replacement (7), and medical costs (8). It is effective in rheumatoid arthritis (RA), osteoarthritis (OA), and other rheumatic conditions. The US Food and Drug Administration accepts it as a measure for evaluation of the prevention of disability.

Despite its extraordinary success, there are reasons to consider its revision (9–12). The HAQ is long. It is composed of 20 questions concerning activities of daily living (ADLs) and 14 questions relating to the use of aids and devices. In addition, its scoring is not simple. Subsequently, a modified HAQ (M-HAQ) with 8 ADLs was developed to address the length and scoring problems (13). A further modification, the multidimensional HAQ (MD-HAQ), added more complex ADLs (14). Like the HAQ, the M-HAQ predicts important long-term outcomes (15–17).

The HAQ also has something of a “floor” problem, in that many persons with physical disability can have normal HAQ scores. In addition, the HAQ is not a linear scale; a 0.25 difference at one level of disability (e.g., a HAQ score of 0.50) may not mean the same as that at another level (e.g., a HAQ score of 1.75) (18). Previous analyses have also suggested that some of the individual questions are not being answered correctly or are being misunderstood by patients (9).

Given the track record of the HAQ and its modified versions, the development of a new version should not be undertaken lightly. A new questionnaire should not only be shorter, and better on a theoretical basis, but it must also be shown to be at least as good as the original HAQ in terms of construct validity, discriminant validity, predictive validity, and reliability. In addition, it must have mean scores that are similar to those produced by the HAQ so that there can be interconversion of the questionnaires. In this report, we describe validation studies of a revised HAQ, the HAQ-II, that was developed using an item bank and Rasch analysis, an item response theory model for measurement (11, 19–28).


Development of the HAQ-II.

For background in understanding the development of the HAQ-II that preceded this report, we present the following information. In January 2001, the National Data Bank for Rheumatic Diseases (NDB) mailed surveys to participants in its long-term outcomes studies, as previously described (8, 29). In addition to the standard HAQ that was mailed to all participants, one-third of participants received 1 of 3 test questionnaires. All of the test questionnaires contained 31 test questions and none of the questions on aids and devices that are present as modifiers for the HAQ. The questions were as follows:

  • Are you able to:

  • Walk a mile?

  • Walk 2 or more miles?

  • Go up a flight of stairs?

  • Go up 2 or more flights of stairs?

  • Open a previously unopened jar?

  • Vacuum in the house?

  • Do outside work (such as yard work)?

  • Wait in a line for 15 minutes?

  • Lift heavy objects?

  • Move heavy objects?

  • Change the bedding?

  • Dress yourself, including shoelaces and buttons?

  • Shampoo your hair?

  • Stand up from a straight chair?

  • Get in and out of bed?

  • Cut your meat?

  • Lift a full cup or glass to your mouth?

  • Open a new milk carton?

  • Walk outdoors on flat ground?

  • Climb up 5 steps?

  • Wash and dry your body?

  • Take a tub bath?

  • Get on and off the toilet?

  • Reach and get down a 5-pound object (such as a bag of sugar) from just above your head?

  • Bend down to pick up clothing from the floor?

  • Run errands and shop?

  • Get in and out of a car?

  • Do chores such as vacuuming or yard work?

  • Open car doors?

  • Open jars which have been previously opened?

  • Turn faucets on and off?

The first test questionnaire gave 4 choices: 1) without any difficulty, 2) with some difficulty, 3) with much difficulty, and 4) unable to do. The second test questionnaire changed the wording slightly to 1) with no difficulty, 2) with a little difficulty, 3) with a lot of difficulty, and 4) unable to do. The third test questionnaire used only 3 categories headed by these instructions: “The following items are about activities you might do during a typical day. Does your health now limit you in these activities? If so, how much?” The categories were 1) yes, limited a little; 2) yes, limited a lot; and 3) no, not limited at all.

The 31 questions, which included 20 from the HAQ, were used as a test item bank. The first two formats were chosen to see whether discrimination between “much difficulty” and “unable” could be improved by changing wording, since previous analyses had shown problems in the discrimination between these response levels. The third format sought to solve the discrimination problem by eliminating one of the categories.

Selection of HAQ-II items.

The goal of the questionnaire development was to obtain a reliable, statistically valid, unidimensional scale that captured as much of the disability continuum as possible. Using Rasch analyses, we used an iterative procedure which balanced 4 concerns: removal of misfitting items, maximizing scale length, elimination of items with overlapping difficulties, and removal of gaps along the disability–difficulty continuum. The alternative wordings did not improve the psychometric properties of the potential questionnaires, and they were discarded. The 10-item HAQ-II questionnaire (Table 1) that emerged from the analyses of the 31 items best balanced the concerns of item fit, scale length, and evenly spaced items. The HAQ-II contains 5 of the original HAQ questions and 5 new questions.

Table 1. Wording of the revised Health Assessment Questionnaire (HAQ-II) and its category percentages and item thresholds in 2,374 patients with rheumatoid arthritis*
 Category description (score)
Without any difficulty (0)With some difficulty (1)With much difficulty (2)Unable (3)Item threshold
  • *

    Except where indicated otherwise, values are percentages of patients selecting category 0, 1, 2, or 3 for each item. The score of the questionnaire is the sum of the individual item scores divided by 10 or the mean of the item scores if 8 or 9 items are completed. The HAQ-II is not to be scored if fewer than 8 items are completed.

  • Item thresholds are derived from Rasch analysis. The more negative the threshold, the more difficult it is to perform the task. See Patients and Methods for additional details.

We are interested in learning how your illness affects your ability to function in daily life. Place an X in the box which best describes your usual abilities over the past week. Are you able to:     
Get on and off the toilet?
Open car doors?
Stand up from a straight chair?47.845.
Walk outdoors on flat ground?52.637.
Wait in a line for 15 minutes?37.141.516.55.00.244
Reach and get down a 5-pound object (such as a bag of sugar) from just above your head?32.145.312.79.9−0.241
Go up 2 or more flights of stairs?19.639.829.810.9−0.876
Do outside work (such as yard work)?13.345.622.019.1−1.452
Lift heavy objects?−2.172
Move heavy objects?3.439.631.125.9−2.430

In addition, we examined the existing HAQ questionnaires for scale length, reliability, misfitting items, and gaps in the scale. These analyses included 2,229 patients completing the HAQ, 7,289 completing the M-HAQ, 8,065 completing the MD-HAQ, and 2,374 completing the final version of the HAQ-II (Tables 1 and 2).

Table 2. Characteristics of versions of the Health Assessment Questionnaire (HAQ)*
Scale (no. of patients analyzed)SeparationReliabilityMisfitting itemsINFITOUTFITReversed thresholdsScale length, logitsDuplicated thresholds, %Item gaps§
  • *

    M-HAQ = modified HAQ; MD-HAQ = multidimensional HAQ; HAQ-II = revision of HAQ tested in the present study (see Patients and Methods for definitions of column headings not given below).

  • Those with INFIT or OUTFIT statistic >1.3. The maximum misfit statistic for the HAQ-II was 1.17.

  • Threshold locations occupied by ≥2 item thresholds. Values refer to the number of duplications divided by the total number of item thresholds.

  • §

    Spaces between item threshold locations that are ≥1 logit in length.

  • Reversed threshold for hygiene was due to the items “shampoo hair” and “take a tub bath.” HAQ refers to categories rather than items.

HAQ (2,229)2.330.83Hygiene1.301.54Hygiene, dressing7.2251
M-HAQ (7,289)2.050.81Turn faucets1.211.34 7.1291
MD-HAQ (8,065)2.410.85Participate in sports1.291.76Participate in sports; walk 2 miles9.3302
   Walk 2 miles1.251.39
   Turn faucets1.201.34
HAQ-II (2,374)2.750.88 NoneNone 10.0170

In the Rasch model of disability, functional ability is considered to lie upon a linear “ruler,” similar to an ordinary ruler, where no disability is the anchor at one end and maximum disability is the anchor at the other end. The range of disability is expressed in logits, a completely linear measure (Table 2). The longer the scale length is in logits, the better the scale is in representing disability. As suggested above, the scale is anchored at one end by the task that is easiest to do and at the other end by the task that is most difficult to do (Table 1, thresholds). An item (question) difficulty (threshold) represents the position in logits that the item occupies on the linear disability scale. Since each task is composed of 4 levels (0, 1, 2, 3), a 10-item questionnaire actually addresses 40 levels of difficulty. In the Rasch analyses, however, we used only 30 levels of difficulty, with each level of an item (HAQ question) representing the point or threshold where the probability is 0.5 of being in 0 as opposed to 1, 1 as opposed to 2, and 2 as opposed to 3. In a “perfect” scale, the 30 thresholds lie equidistant from each other on the disability continuum. In the current analyses, we define gaps in the disability continuum scale as spaces between item threshold locations that are ≥1 logit in length. Duplicated thresholds are defined as threshold locations occupied by ≥2 item thresholds. The presence of gaps, duplicated locations, and nonlinear spacing of thresholds correlates with decreased precision in the assessment of disability.

The overall reliability of a scale can be estimated by examining the person separation statistics and the person model reliability. Separation is the measure of spread in the test sample expressed in units of the test error (30). Reliability is the ratio of the true measure variance to the observed measure variance and is the same as Cronbach's alpha (31). Reliabilities of ≥0.85 are satisfactory. Rasch analysis produces two additional statistics. The mean square INFIT (INFIT) and mean square OUTFIT (OUTFIT) statistics are measures of “signal to noise” that allow one to determine how well an item or an individual level (0, 1, 2, 3) of an item fits the Rasch model (32). An item that has a high INFIT or OUTFIT statistic (>1.3) may not fit the model because it is “noisy” (not understood well or ambiguous) or is measuring a second dimension. For example, if a person was asked to evaluate his/her ability to perform tasks with which he/she had little current experience, the replies might be inaccurate measures of actual ability and would be identified as being “noisy” by the Rasch model. A further indication of this problem is often seen with reversed thresholds. A reversed threshold occurs in these analyses when being “unable” to do an activity (scored as “3” on the HAQ/HAQ-II) appears to be easier to do than doing the activity with “much difficulty” (scored as “2”).

Although the HAQ has 20 items, these collapse into 8 categories after scoring. In the analyses shown in Table 2, we report results at the item level as well as at the category level.

Validation studies.

After development of the HAQ-II, the new questionnaire together with the HAQ was used in 4 consecutive biannual surveys mailed to participants in the NDB. The data from these assessments were then used in the validation studies that form the basis of this report. In addition to the HAQ and HAQ-II data, the NDB collects further data in its biannual, detailed 28-page questionnaire. At each assessment, demographic variables are recorded, including age, sex, ethnic origin, education level, current marital status, medical history, work status, and total family income. Data concerning disease status and activity variables were collected using the following instruments: the M-HAQ (13); visual analog pain, global disease severity, and fatigue scales (33); the Arthritis Impact Measurement Scales anxiety and depression scales (34, 35); the Rheumatology Distress Index; the Rheumatoid Arthritis Disease Activity Index (36–38); and the Work Limitations Questionnaire, a 25-item, self-administered questionnaire measuring the degree to which health problems interfere with ability to perform job roles (39). Patients also completed the Medical Outcomes Study Short Form 36 (SF-36), from which the physical function (PF) scale was calculated (40, 41). Utilities were measured using the EuroQol (42–44) and Short Form 6D (45). Analyses based on the above data were restricted to the 14,038 persons who completed all the HAQ, HAQ-II, M-HAQ, and PF scales.

A second set of data was available from 693 consecutive patients who were identified in 2003 from the clinical practices of 40 US and Canadian rheumatologists, the Rheumatoid Arthritis Evaluation Study (RAES) cohort. This data set was used to examine the correlation between the HAQ and HAQ-II and physical examination findings and laboratory and Disease Activity Score values (46).

The MD-HAQ (14) was not a part of the biannual NDB assessments and therefore was not included in the validation report. However, separate data from the NDB, where the MD-HAQ was collected as part of screening evaluations, were available for 15,543 patients. These data are presented briefly to describe the floor effects of this questionnaire.

The HAQ and HAQ-II questionnaires were also used simultaneously in an open-label clinical trial of 837 RA patients in community practice starting a new disease-modifying antirheumatic drug (DMARD). Pretest and posttest data from this study were analyzed after 3 months of therapy.

Statistical analysis.

Rasch analysis was performed using Winsteps version 3.31 (Winsteps, Chicago, IL) (47) and RUMM 2010 version 3.3 (RUMM Laboratory, Duncraig, Australia) (48). Validation analyses, including correlation analysis, generalized estimating equations (GEE), and linear and Cox regression analysis, were performed using Stata version 8.1 (Stata Corporation, College Station, TX) (49). Comparison of correlation coefficients was made using the Fisher z transformation. The t-test was used to compare the HAQ and the HAQ-II in the clinical trial. The Bland-Altman limits of agreement procedure was used to assess the 2-SD difference between questionnaires administered at the same time to the same patients (50).


Questionnaire analysis.

Rasch analysis was used to categorize the 4 HAQ questionnaires (Tables 1 and 2). The most difficult task was “move heavy objects,” which had an item threshold of –2.430 (Table 1). At the other end of the spectrum, “get on/off toilet” was the easiest task to perform (threshold 2.254). Other items held intermediate positions. The differences among items in regard to their difficulty can also be seen in the percentages of patients selecting each category of a given item. These percentage results paralleled the Rasch item thresholds. The HAQ-II had the longest scale (Table 2), as measured in logits, indicating that it captured more of the continuum of disability than did the other questionnaires. The MD-HAQ also had a long scale, by virtue of the difficult items “participate in sports and games” and “walk 2 miles.” However, these items misfit the Rasch model, indicating a lack of unidimensionality and/or inaccurate assessment. The HAQ also had items that did not fit the Rasch model. Within the HAQ hygiene category, the items “take a tub bath” and “shampoo hair” misfit the model. This, in turn, led to the misfitting of the hygiene category. We also noted gaps in the scales of all the HAQ family questionnaires except for the HAQ-II. Duplicated thresholds were least common in the HAQ-II. These data indicate that the HAQ-II had the most favorable psychometric characteristics as measured by reliability, fit, scale length, reversed thresholds, and item gaps.

The floor (scores of 0) and ceiling (scores of 3) effects and the mean scores for the HAQ questionnaires that were studied in the validation sample are shown in Table 3. Data from the SF-36 PF scale are included for comparison. The HAQ-II had the least floor effect of the HAQ family of questionnaires (5.8%), as would be expected from its long logit length. The HAQ had a greater floor effect (10.1%), and the M-HAQ had the greatest floor effect (24.5%). Although data on the MD-HAQ were not available in the validation sample, 15,543 screening questionnaires using the MD-HAQ were available in the NDB. In this sample, 4.4% of patients were at the floor in the MD-HAQ.

Table 3. Mean scores and floor and ceiling effects for validation study questionnaires (n = 14,038 patients)*
ScaleScore, mean ± SDPatients at floor, %Patients at ceiling, %
  • *

    SF-36 PF = Medical Outcomes Study Short Form 36 physical function scale (see Table 2 for other definitions).

SF-36 PF47.08 ±
HAQ-II1.07 ± 0.665.80.1
HAQ1.09 ± 0.7210.10.2
M-HAQ0.51 ± 0.4924.50.2

Results of validation studies.

Diagnostic groups.

There were 14,038 persons who completed all of the HAQ, HAQ-II, M-HAQ, and PF scales. Of these, 10,916 (77.8%) had RA, 2,478 (17.7%) had OA, and 644 (4.6%) had fibromyalgia.

Distribution characteristics.

The disability indexes differed in their distributions (Figure 1). The HAQ and HAQ-II tended toward normal distributions except at the floor (lowest levels). The PF scale was not normally distributed, with more values at the tails than might be expected. The M-HAQ floor effect was profound and contributed to its non-normality.

Figure 1.

Distribution characteristics of 4 functional status questionnaires in 14,038 patients with rheumatic disease. Floor effect is noted by the percentage of values at 0 for each questionnaire. Ceiling effect for the Medical Outcomes Study Short Form 36 (SF-36) physical function scale is indicated by the percentage of values at 100. Curved lines are superimposed normal distribution curves. HAQ = Health Assessment Questionnaire; HAQ-II = revision of HAQ tested in the present study; M-HAQ = modified HAQ.

As noted above, among the HAQ family of questionnaires, the floor effect was least for the HAQ-II (5.8%) and greatest for the M-HAQ (24.5%) (Table 3). The PF scale had combined ceiling and floor effects of 6.4%. Therefore, compared with the combined floor and ceiling effects of the HAQ-II (with combined effects of 5.9%), the PF represented a shifting of the distribution curve to the right, compared with the HAQ-II scale.

Correlates of functional status questionnaires.

The HAQ-II results were correlated with clinical and outcomes variables at levels similar to those of the HAQ, M-HAQ, and PF scale (Tables 4 and 5). For the 14,038-patient NDB data set (Table 4), HAQ-II correlations were greater than HAQ correlations in 15 of 16 instances. Compared with M-HAQ correlations, HAQ-II correlations were greater for 7 variables and less for 6 variables. Compared with PF scale correlations, HAQ-II correlations were greater for 14 variables, less for 1 variable, and equal for 1 variable. Although these differences achieved statistical significance owing to the large sample size, they were clinically insignificant. The results of the analyses should be considered to show no important difference between the questionnaires. Correlation levels were similar in a data set of serial patients from clinical practice—the RAES cohort (Table 5)—but with the smaller sample size (n = 693), there were no significant differences in the correlations among the questionnaires at the 0.05 probability level.

Table 4. Correlations between results of functional status questionnaires and clinical and outcome variables in the National Data Bank for Rheumatic Diseases Sample (n = 14,038 patients)*
VariableHAQ-IIHAQM-HAQSF-36 PF scale
  • *

    SF-36 PF = Medical Outcomes Study Short Form 36 physical function; RADAI = Rheumatoid Arthritis Disease Activity Index; VAS = visual analog scale; SF-6D = Short Form 6D; QOL = quality of life; AIMS = Arthritis Impact Measurement Scales; GI = gastrointestinal (see Table 2 for other definitions).

HAQ-II (0–3 scale)1.000.910.84−0.85
HAQ (0–3 scale)0.911.000.86−0.80
SF-36 PF scale (0–100)−0.85−0.80−0.721.00
M-HAQ (0–3 scale)0.840.861.00−0.72
EuroQol utility (0–1 scale)−0.67−0.64−0.690.62
RADAI score (0–10)0.650.630.66−0.61
Rheumatology Distress Index (0–100 scale)0.610.590.61−0.58
Global disease severity (0–10 VAS)0.610.580.59−0.59
Pain (0–10 VAS)0.610.590.61−0.57
Fatigue (0–10 VAS)0.560.540.52−0.53
SF-6D utility (0–1 scale)−0.56−0.54−0.480.60
Work Limitations Questionnaire index (0–100 scale)0.560.540.55−0.53
QOL scale (0–100 VAS)−0.54−0.51−0.520.53
AIMS depression scale (0–10)0.440.420.47−0.42
Sleep disturbance (0–10 scale)0.410.400.42−0.38
AIMS anxiety scale (0–10)0.380.360.41−0.36
Social security disability, last 6 months (%)0.340.320.34−0.30
GI severity (0–10 scale)0.330.310.34−0.30
Total direct medical costs, $−0.22
Total joint replacement, %−0.18
Table 5. Correlations between results of functional status questionnaires and clinical practice variables from the Rheumatoid Arthritis Evaluation Study (n = 693 patients)*
  • *

    VAS = visual analog scale; DAS28 = Disease Activity Score in 28 joints; ESR = erythrocyte sedimentation rate (see Table 2 for other definitions).

HAQ-II (0–3 scale)1.000.920.85
HAQ (0–3 scale)0.921.000.84
M-HAQ (0–3 scale)0.850.841.00
Pain (0–10 VAS)0.660.660.67
Patient's assessment of global disease severity (0–10 VAS)0.620.600.61
Fatigue (0–10 VAS)0.570.560.55
Physician's assessment of global disease severity (0–10 VAS)0.480.500.50
Disability (stopped work)0.410.420.35
Tender joint count (range 0–28)0.370.390.40
ESR, mm/hour0.250.270.22
Swollen joint count (range 0–28)
Joint surgery, no/yes0.200.230.11

Correlates of functional status questionnaires for different diagnostic groups.

Correlations between questionnaire results and clinical and outcome variables were not significantly different for the different diagnostic groups (P > 0.05) (Table 6).

Table 6. Correlations between results of functional status questionnaires and clinical and outcome variables according to diagnostic group*
RA (n = 10,916)OA (n = 2,478)Fibromyalgia (n = 644)
  • *

    RA = rheumatoid arthritis; OA = osteoarthritis (see Tables 2 and 4 for other definitions).

HAQ-II (0–3 scale)
HAQ (0–3 scale)0.910.890.89
SF-36 PF scale (0–100)−0.86−0.840.82
M-HAQ (0–3 scale)0.850.820.85
EuroQol utility (0–1 scale)−0.68−0.660.64
RADAI score (0–10)0.660.650.64
Rheumatology Distress Index (0–100 scale)0.620.610.56
Pain (0–10 VAS)0.620.590.57
Global disease severity (0–10 VAS)0.610.600.58
SF-6D utility (0–1 scale)−0.57−0.560.42
Fatigue (0–10 VAS)0.570.580.48
Work Limitations Questionnaire index (0–100 scale)0.550.550.49
QOL scale (0–100 VAS)−0.54−0.550.49
AIMS depression scale (0–10)0.450.440.41
Sleep disturbance (0–10 scale)0.420.400.35
AIMS anxiety scale (0–10)0.390.380.35
Social Security disability, last 6 months (%)0.340.300.38
GI severity (0–10 scale)0.330.350.33
Total direct medical costs, $
Total joint replacement, %

Clinical trial results: comparison of HAQ and HAQ-II.

To assess the ability of the HAQ and HAQ-II to perform in a clinical trial setting, 837 RA patients who received a DMARD over a 3-month period in an open-label clinical trial were studied. At the start of therapy, the HAQ score was 1.50 and the HAQ-II score was 1.41. Effect sizes were calculated for the before–after difference for the HAQ and HAQ-II. The effect size for the HAQ-II was 23.0 (95% confidence interval [95% CI] 18.4–27.4). The effect size for the HAQ was 24.8 (95% CI 20.0–29.5). These differences were not significant (P = 0.298).

Change in HAQ and HAQ-II scores over time.

Using a population-averaged model (GEE) restricted to RA patients who had completed both questionnaires (n = 10,494), the change in HAQ score per year of disease duration was 0.014 (95% CI 0.013–0.016, Wald χ2 = 519.1). The equivalent value for the HAQ-II score was 0.012 (95% CI 0.011–0.123, Wald χ2 = 423.7).

Predictive ability: mortality.

For 10,281 persons who completed more than one NDB questionnaire, Cox regression was used to estimate the ability of the HAQ and HAQ-II to predict future mortality. The hazard ratio (HR) for the HAQ was 2.28 (95% CI 1.89–2.75, likelihood ratio χ2 = 77.2); the HR for the HAQ-II was 2.44 (95% CI 2.00–2.97, likelihood ratio χ2 = 80.11).

Predictive ability: Social Security disability awards.

For 6,472 patients age <65 years who were not receiving US Social Security benefits at their first assessments, Cox regression was used to estimate the ability of the HAQ and HAQ-II to predict future Social Security benefits. The HR for the HAQ was 5.47 (95% CI 4.72–6.35, likelihood ratio χ2 = 552.6); the HR for the HAQ-II was 6.06 (95% CI 5.19–7.07, likelihood ratio χ2 = 549.7).

Conversion of HAQ and HAQ-II scales.

To understand how the HAQ-II might be substituted for the HAQ, as well as the reverse condition, we first graphed the relationship between the two variables using locally weighted scatterplot smoothing (lowess) regression and linear regression (Figure 2). Lowess regression will demonstrate the nonlinear aspects of the relationship between variables. Since the relationship shown in Figure 2 was essentially linear, we performed linear regression and described the relationship between variables by the regression intercept and coefficient. Based on the regression analyses of 14,038 observations, HAQ-II = 0.158 + 0.83 × HAQ and HAQ = 0.39 + 0.989 × HAQ-II (R2 = 0.821). The M-HAQ and the PF scale differ too much from the HAQ and HAQ-II for useful conversions and are not described.

Figure 2.

Regression of HAQ on HAQ-II in 14,038 patients with rheumatic disease. Locally weighted scatterplot smoothing (lowess) regression indicates graphically the nonlinear aspects of the HAQ and HAQ-II relationship compared with linear regression. The lines are virtually superimposable, indicating validity of a linear predictive model. See Figure 1 for other definitions.

The strong relationship between the HAQ and the HAQ-II allows reliable interconversion of research data from the HAQ to the HAQ-II and from the HAQ-II to the HAQ, although this should be confined to adjustment of means and, perhaps, to distribution-independent analyses such as median regression. However, individual patient data cannot be converted, since the level of agreement, even with a correlation coefficient >0.9, is not high enough. Although the difference between the HAQ and HAQ-II mean scores was only 0.02 units, and Lin's concordance correlation was 0.902, the Bland-Altman 95% limits of agreement values were –0.567 and 0.622. The distance between these values is too great for substitution of one functional measure for another in an individual patient.


The validation results of this study suggest that the HAQ-II performs at least as well as the original HAQ. This should not be surprising, since 5 of the 10 HAQ-II items come directly from the HAQ. In addition, poorly fitting items of the HAQ were removed, and the overall item content of the HAQ-II was selected with careful attention to psychometric properties using Rasch analysis. Although the HAQ has 20 items (plus 14 aids and device modifiers), the method of scoring the HAQ reduces the questionnaire to 8 categories. In effect, the HAQ is an 8-item questionnaire, but one that gets some additional reliability from the redundancy of multiple questions in each category. Given the (de facto) 8-item HAQ and the 10-item HAQ-II, the HAQ-II, all things being equal, should perform as well as or better than the original HAQ.

The HAQ-II was developed using Rasch analysis and an item bank of questions in which each question has an intrinsic and measurable difficulty. For example, it is easier to walk on flat ground or get up from a chair than it is to walk up 2 flights of stairs or to walk 2 miles. If questions are selected properly, it is possible to select starting questions about actions that are very easy to do and to end with questions about actions that are very difficult to do. Each question, moreover, has sublevels of difficulty. Walking 2 miles can be done without difficulty, with some difficulty, with great difficulty, or not at all, and each level represents a separate measure of difficulty. Thus, a 10-item questionnaire can represent 4 × 10 separate levels of difficulty or 30 item thresholds. In developing a questionnaire, all of the levels must be considered. An ideal questionnaire would therefore space out the individual difficulties as evenly as possible. It is an axiom of proper questionnaire scaling that, on average, a person who can accomplish activities at a given level of difficulty can also accomplish all items that have lesser degrees of difficulty.

In addition to evenly spacing item difficulties, it is desirable to have a questionnaire that measures a long span of difficulties. It is relatively easy to capture the functional level of persons who are severely disabled (e.g., unable to walk or to arise), but it is much more difficult to measure items at the other end of the spectrum. That is the reason that floor effects are commonly seen in the HAQ series of questionnaires. The problem with questions at the floor end of the disability spectrum is that they often have to refer to activities that people do not often do or that are not necessarily a part of the unidimension of function as much as they are of dimensions such as the performance of athletic activities.

Rasch analysis provides statistical methods to identify items that do not “fit” the hypothesized unidimensional Rasch model or that are not answered accurately. The SF-36 PF scale, which otherwise has superb psychometric properties, has items that do not fit the Rasch model. Similarly, the MD-HAQ questions regarding participation in sports and walking 2 miles do not satisfy the fit criteria. In general, items that are not clearly understood or are not completed add noise (inaccuracy) to the measurement scale, since persons guess at their ability to perform these activities. A further example of this problem can be found in the HAQ question regarding bathing. Because many people use showers instead of bathtubs, arthritis patients' responses indicate that it is more “difficult” to take a bath “with difficulty” than it is to be “unable” to take a bath at all.

A questionnaire with evenly spaced, well-fitting items can provide a good measurement tool, much as a ruler can. However, if the integers on the ruler are not evenly spaced or tend to clump together, the ruler will be less useful as a measurement tool. Furthermore, it is possible to design a “perfect” scale and yet have a scale that is not clinically useful or that is insensitive to change. The validation studies of the HAQ-II show that it performs as well as the “gold standard” HAQ in identifying treatment effect and predicting important outcomes such as mortality or work disability. In addition, it is as strongly related to clinical and outcome variables as is the HAQ, or even more so.

The 10-item scale is easier than the HAQ to use and score in the clinic and in research studies. Because the scales are so closely allied (Figure 2) and have mean scores that differ by only 0.02 units, it is relatively easy to substitute one scale for another. The very large sample size of this study (n = 14,038) provides assurance of the accuracy of the process of converting research data from the HAQ to the HAQ-II and vice versa.

Although we have indicated above that the HAQ and HAQ-II cannot be substituted in individual patients, that warning applies only to contiguous observations, for example, observations 2 and 3. However, if the substitution is continued to observation 4, then the new scale that now has 2 observations can take over from the old one. As with all such changes, experience and thoughtful use of the questionnaire will allow substitution.

The structure of the HAQ-II may seem strange, since it does not use ADL categories. The HAQ places its 20 questions into 8 ADL categories. Each category has its own score, a score that is based only on the most abnormal answer in the category. Ideally, the overall HAQ score would be a measure of functional disability averaged over all of the ADL categories. One problem with this visualization is that categories would somehow have to be weighted, either to be equal in difficulty or to represent some known, expected weight or value for the category. However, there are no known weights, nor is there evidence that equality of categories is rational or correct. In practice, the situation is worse. The HAQ hygiene category, for example, has a Rasch difficulty of −0.82 compared with a difficulty of −0.68 for “activities” (9). Hygiene, which should be much easier than “activities,” is not, and is driven almost entirely by the very difficult “take a bath” question. It is therefore the case that the actual item difficulties, rather than their categorization, are what drive the HAQ score. The HAQ-II ignores ADL categorization, as does the SF-36, in order to build a psychometrically valid questionnaire. This may not be a loss, since it is difficult to express ADL category performance based on a single question within a category. Clinicians who require detailed information regarding specific categories or activities (e.g., hand function) should consider the use of activity- or area-specific questionnaires.

There has been increasing recognition of the conceptual importance of separating functional limitations and disability (51–53). Among the limitations of both the HAQ and the HAQ-II is that they mix items measuring functional limitations with items measuring disability. Nine of the 10 HAQ-II items assess functional limitations; only one (“doing outside work”) is a measure of disability. It would be ideal if both instruments only assessed functional limitations. Future functional and disability assessments are likely to have increasing sophistication as the interactions among illness, function, disablement, and society become increasingly recognized (54).

In conclusion, the HAQ-II is a reliable and valid 10-item questionnaire that performs at least as well as the HAQ and is simpler to administer and score. Conversion from HAQ to HAQ-II and from HAQ-II to HAQ for research purposes is simple and reliable. The HAQ-II can be used in all places where the HAQ is now used, and it may prove to be easier to use in the clinic.