The Health Assessment Questionnaire (HAQ) is the most important and widely used functional status questionnaire in rheumatology. Developed by Fries et al in 1980 (1, 2), it is used in most clinical trials and observational outcome studies (3), and it has been translated into most languages in the industrialized countries (3, 4).
The HAQ is the best predictor of mortality (5), work disability (6), joint replacement (7), and medical costs (8). It is effective in rheumatoid arthritis (RA), osteoarthritis (OA), and other rheumatic conditions. The US Food and Drug Administration accepts it as a measure for evaluation of the prevention of disability.
Despite its extraordinary success, there are reasons to consider its revision (9–12). The HAQ is long. It is composed of 20 questions concerning activities of daily living (ADLs) and 14 questions relating to the use of aids and devices. In addition, its scoring is not simple. Subsequently, a modified HAQ (M-HAQ) with 8 ADLs was developed to address the length and scoring problems (13). A further modification, the multidimensional HAQ (MD-HAQ), added more complex ADLs (14). Like the HAQ, the M-HAQ predicts important long-term outcomes (15–17).
The HAQ also has something of a “floor” problem, in that many persons with physical disability can have normal HAQ scores. In addition, the HAQ is not a linear scale; a 0.25 difference at one level of disability (e.g., a HAQ score of 0.50) may not mean the same as that at another level (e.g., a HAQ score of 1.75) (18). Previous analyses have also suggested that some of the individual questions are not being answered correctly or are being misunderstood by patients (9).
Given the track record of the HAQ and its modified versions, the development of a new version should not be undertaken lightly. A new questionnaire should not only be shorter, and better on a theoretical basis, but it must also be shown to be at least as good as the original HAQ in terms of construct validity, discriminant validity, predictive validity, and reliability. In addition, it must have mean scores that are similar to those produced by the HAQ so that there can be interconversion of the questionnaires. In this report, we describe validation studies of a revised HAQ, the HAQ-II, that was developed using an item bank and Rasch analysis, an item response theory model for measurement (11, 19–28).
- Top of page
- PATIENTS AND METHODS
The validation results of this study suggest that the HAQ-II performs at least as well as the original HAQ. This should not be surprising, since 5 of the 10 HAQ-II items come directly from the HAQ. In addition, poorly fitting items of the HAQ were removed, and the overall item content of the HAQ-II was selected with careful attention to psychometric properties using Rasch analysis. Although the HAQ has 20 items (plus 14 aids and device modifiers), the method of scoring the HAQ reduces the questionnaire to 8 categories. In effect, the HAQ is an 8-item questionnaire, but one that gets some additional reliability from the redundancy of multiple questions in each category. Given the (de facto) 8-item HAQ and the 10-item HAQ-II, the HAQ-II, all things being equal, should perform as well as or better than the original HAQ.
The HAQ-II was developed using Rasch analysis and an item bank of questions in which each question has an intrinsic and measurable difficulty. For example, it is easier to walk on flat ground or get up from a chair than it is to walk up 2 flights of stairs or to walk 2 miles. If questions are selected properly, it is possible to select starting questions about actions that are very easy to do and to end with questions about actions that are very difficult to do. Each question, moreover, has sublevels of difficulty. Walking 2 miles can be done without difficulty, with some difficulty, with great difficulty, or not at all, and each level represents a separate measure of difficulty. Thus, a 10-item questionnaire can represent 4 × 10 separate levels of difficulty or 30 item thresholds. In developing a questionnaire, all of the levels must be considered. An ideal questionnaire would therefore space out the individual difficulties as evenly as possible. It is an axiom of proper questionnaire scaling that, on average, a person who can accomplish activities at a given level of difficulty can also accomplish all items that have lesser degrees of difficulty.
In addition to evenly spacing item difficulties, it is desirable to have a questionnaire that measures a long span of difficulties. It is relatively easy to capture the functional level of persons who are severely disabled (e.g., unable to walk or to arise), but it is much more difficult to measure items at the other end of the spectrum. That is the reason that floor effects are commonly seen in the HAQ series of questionnaires. The problem with questions at the floor end of the disability spectrum is that they often have to refer to activities that people do not often do or that are not necessarily a part of the unidimension of function as much as they are of dimensions such as the performance of athletic activities.
Rasch analysis provides statistical methods to identify items that do not “fit” the hypothesized unidimensional Rasch model or that are not answered accurately. The SF-36 PF scale, which otherwise has superb psychometric properties, has items that do not fit the Rasch model. Similarly, the MD-HAQ questions regarding participation in sports and walking 2 miles do not satisfy the fit criteria. In general, items that are not clearly understood or are not completed add noise (inaccuracy) to the measurement scale, since persons guess at their ability to perform these activities. A further example of this problem can be found in the HAQ question regarding bathing. Because many people use showers instead of bathtubs, arthritis patients' responses indicate that it is more “difficult” to take a bath “with difficulty” than it is to be “unable” to take a bath at all.
A questionnaire with evenly spaced, well-fitting items can provide a good measurement tool, much as a ruler can. However, if the integers on the ruler are not evenly spaced or tend to clump together, the ruler will be less useful as a measurement tool. Furthermore, it is possible to design a “perfect” scale and yet have a scale that is not clinically useful or that is insensitive to change. The validation studies of the HAQ-II show that it performs as well as the “gold standard” HAQ in identifying treatment effect and predicting important outcomes such as mortality or work disability. In addition, it is as strongly related to clinical and outcome variables as is the HAQ, or even more so.
The 10-item scale is easier than the HAQ to use and score in the clinic and in research studies. Because the scales are so closely allied (Figure 2) and have mean scores that differ by only 0.02 units, it is relatively easy to substitute one scale for another. The very large sample size of this study (n = 14,038) provides assurance of the accuracy of the process of converting research data from the HAQ to the HAQ-II and vice versa.
Although we have indicated above that the HAQ and HAQ-II cannot be substituted in individual patients, that warning applies only to contiguous observations, for example, observations 2 and 3. However, if the substitution is continued to observation 4, then the new scale that now has 2 observations can take over from the old one. As with all such changes, experience and thoughtful use of the questionnaire will allow substitution.
The structure of the HAQ-II may seem strange, since it does not use ADL categories. The HAQ places its 20 questions into 8 ADL categories. Each category has its own score, a score that is based only on the most abnormal answer in the category. Ideally, the overall HAQ score would be a measure of functional disability averaged over all of the ADL categories. One problem with this visualization is that categories would somehow have to be weighted, either to be equal in difficulty or to represent some known, expected weight or value for the category. However, there are no known weights, nor is there evidence that equality of categories is rational or correct. In practice, the situation is worse. The HAQ hygiene category, for example, has a Rasch difficulty of −0.82 compared with a difficulty of −0.68 for “activities” (9). Hygiene, which should be much easier than “activities,” is not, and is driven almost entirely by the very difficult “take a bath” question. It is therefore the case that the actual item difficulties, rather than their categorization, are what drive the HAQ score. The HAQ-II ignores ADL categorization, as does the SF-36, in order to build a psychometrically valid questionnaire. This may not be a loss, since it is difficult to express ADL category performance based on a single question within a category. Clinicians who require detailed information regarding specific categories or activities (e.g., hand function) should consider the use of activity- or area-specific questionnaires.
There has been increasing recognition of the conceptual importance of separating functional limitations and disability (51–53). Among the limitations of both the HAQ and the HAQ-II is that they mix items measuring functional limitations with items measuring disability. Nine of the 10 HAQ-II items assess functional limitations; only one (“doing outside work”) is a measure of disability. It would be ideal if both instruments only assessed functional limitations. Future functional and disability assessments are likely to have increasing sophistication as the interactions among illness, function, disablement, and society become increasingly recognized (54).
In conclusion, the HAQ-II is a reliable and valid 10-item questionnaire that performs at least as well as the HAQ and is simpler to administer and score. Conversion from HAQ to HAQ-II and from HAQ-II to HAQ for research purposes is simple and reliable. The HAQ-II can be used in all places where the HAQ is now used, and it may prove to be easier to use in the clinic.