The Stanford Health Assessment Questionnaire Disability Index (HAQ) is the gold standard functional status questionnaire in rheumatology, but it is lengthy. Three shorter versions, the modified HAQ (MHAQ), the Multidimensional HAQ (MDHAQ), and the HAQII are often used in outcomes research as HAQ substitutes. We developed conversion formulas between these modified versions and the original HAQ.
Analysis was limited to the comparison of rheumatoid arthritis (RA) patients at a random observation when the HAQ was recorded in conjunction with the MHAQ (n = 29,596), the MDHAQ (n = 13,665), or the HAQII (n = 15,823). Development models were randomly limited to 80% of the data (development sample) and the remaining 20% was used for model validation.
Two conversion formulas were developed for each of the MHAQ, the MDHAQ, and the HAQII: a short model and a long model inclusive of questions common to both the modified measures and the original HAQ. Short models explained 81–83%, and long models 82–86%, of the variance. Predicted HAQ values of zero were assigned to all cases with an MDHAQ or HAQII score of zero, with remaining cases used for model estimation. Bland-Altman plots demonstrated good concordance between actual and predicted values for each measure. The validation sample closely approximated the results from the development sample (0.005 ≤ ΔR2 ≤ 0.009) for each measure.
We have developed and validated highly accurate conversion formulas from the MHAQ, MDHAQ, and HAQII to the original HAQ in a large sample of RA patients. The developed models are useful for conversion of measures in the research setting. Because of substantial variability at the individual patient level, application of the formulas to individual patients is inadvisable.
As the gold standard functional status questionnaire in rheumatology, the Stanford Health Assessment Questionnaire Disability Index (HAQ) is used in most clinical trials and observational outcome studies (1, 2) and is recommended by the American College of Rheumatology for measurement of physical function (3). While originally conceived as a measurement of patient outcome in rheumatoid arthritis (RA) patients (4), the HAQ has been successfully applied to a variety of rheumatic diseases (5). Additionally, the HAQ has been shown to distinguish between placebo and treatment groups (6, 7), with changes over time agreeing with and augmenting clinical and laboratory evidence of change (8–11). The HAQ is also the best predictor of mortality (12), work disability (13), joint replacement (14), and medical costs (15) as compared with other measures of RA-related disease activity.
Since the inception of the original HAQ, several modified versions have been created in an effort to improve the precision of information gained and/or to reduce the length of the original 41-item instrument. The HAQ asks the patient to rate, on a 4-point ordered-category item scale, the degree of difficulty they have experienced over the last week with each of 20 tasks grouped into 8 functional areas, with scores further adjusted based on an additional 21 questions regarding the use of companion aids or devices. Scores are then converted into an overall mean score ranging from 0–3, with 0 indicating no functional impairment and 3 indicating complete impairment (4, 5, 16).
The modified HAQ (MHAQ), the Multidimensional HAQ (MDHAQ), and the HAQII are the most prominent of these attempts at improvement and are often used in outcomes research as HAQ substitutes without demonstrated equivalence; these may be reported simply as “HAQ” scores without specifying which instrument was used. Additionally, some studies initially use one HAQ version and later switch to another. Although these instruments are at times used interchangeably (17), it is difficult to compare summary scores between instruments due to the variable psychometric properties each possesses. One example of such a problem is the “floor effect” phenomenon, which has been observed with varying degrees in each version of the HAQ. The floor effect is observed when a patient has a completely normal score (0.0) on the instrument despite some functional limitations (18). In other words, if an instrument has a floor effect, it cannot discriminate between individuals that have relatively good (but not perfect) function. Even the original HAQ deviates from a normal distribution at values near zero (19) and has been shown to demonstrate failure to detect clinical improvement in up to 10% of patients (19–21). Since the advent of more effective pharmacologic therapies, and the increasing use of those therapies in patients with milder disease, it is likely the pervasiveness of a floor effect has increased over time, particularly as shorter variations of the original HAQ have come into more common usage with assessment of fewer items potentially missing subtle functional impairment.
Conceived as a shortened version of the HAQ, the MHAQ asks patients to answer 8 questions, 1 in each of the 8 functional areas explored with the HAQ (22). The MHAQ assesses the degree of change in difficulty with specific tasks over the preceding 3 months, and is therefore subject to recall bias, although it has been shown that the MHAQ is correlated with HAQ change scores (23). For comparison with the original HAQ, MHAQ scores are converted to a range between 0 and 3. However, while correlated with HAQ scores, MHAQ scores lack sensitivity to change (24–26), are routinely lower than HAQ scores by ∼0.3–0.5 units (21, 27), and tend to cluster at the lower end of the scale, leading to a non-normal distribution of values (19, 20). These observations yield the conclusion that the MHAQ also has a more pronounced “floor effect” than the HAQ, preventing numerical improvement in scores despite clinical improvement in function in as many as 25% of patients (19–21, 26, 28).
The MDHAQ was created as a further modification of the HAQ and was designed with 10 formally scored activity questions, as well as an additional 3 nonscored items to assess psychological status, with the resultant score again converted into an overall mean score ranging from 0–3 (29, 30). The nonscored psychological status questions were added in an attempt to measure additional health dimensions (sleep, anxiety, and depression) in addition to functional status (30), and therefore they may be omitted from our comparison of physical function scales. Compared with both the HAQ and MHAQ, the MDHAQ demonstrates a less prominent floor effect. However, similar to the HAQ, the MDHAQ deviates from a normal distribution at values near zero (21, 30). Additionally, the MDHAQ has more even spacing of scores than the HAQ and MHAQ, making a change of 0.5 more similar across the range of the scale (30), although outliers remain (19).
Based on the original HAQ, the HAQII is a 10-item functional questionnaire with scores ranging from 0–3. Shorter and simpler than the HAQ, the HAQII has demonstrated levels of reliability and validity similar to the original HAQ, and it has a lesser floor effect as compared with the HAQ and MHAQ, as well as potentially failing to detect clinical improvement in only 5.8% of patients (19, 31, 32). Like the HAQ, the HAQII deviates from normal distribution at values near zero (19). Conversion formulas between the HAQII and HAQ have been developed previously (19). A key benefit of the HAQII over the MHAQ and the MDHAQ is that it is more closely correlated with the original HAQ, with the previously derived conversion formula from HAQII to HAQ demonstrating R2 = 0.821 (31). Additionally, more even spacing of item difficulties from Rasch analysis produces the greatest uniformity of change across the range of the scale as compared with the abovementioned tools (19, 33). To our knowledge, no published formulas exist to convert between the MHAQ or the MDHAQ scores and the original HAQ. Because of substantial variability in the choice of which HAQ version each of the various rheumatic disease registries and trials uses, we sought to develop a method of conversion from MHAQ and MDHAQ to the original HAQ, as well as to confirm the previous work that provided a conversion formula between the HAQII and HAQ. Prior work has not evaluated the addition of the individual questions composing the HAQII or addition of the common questions between the HAQII and HAQ on model fit. We examined the effect that the addition of these common questions has on the model since we expected their inclusion to produce a better conversion formula.
MATERIALS AND METHODS
Since 1998, various versions of the HAQ have been completed a total of 203,041 times by 35,009 unique participants in the National Data Bank for Rheumatic Diseases (NDB) long-term outcomes study. These methods have been previously described (15, 34). Utilizing previously collected NDB data from 1998–2008, analysis was limited to RA patients residing in the US, Canada, and Puerto Rico at a time in which the HAQ was asked simultaneously with at least one of the MHAQ, MDHAQ, or HAQII. Analysis was limited to questionnaires administered in English, which excluded 68 MHAQ and 70 HAQII questionnaires. To prevent bias introduced from multiple measurements on individual participants, only one pair of data per patient (HAQ and the corresponding MDHAQ, MHAQ, or HAQII) was included based on selection of a random observation.
Descriptive statistics were used to compare scores from the HAQ with scores from the MHAQ, MDHAQ, and HAQII, and typical univariate transformations (e.g., base 10 log, natural log, square root, and second, third, and fourth order polynomials) of the MHAQ, MDHAQ, and HAQII. Box-Cox transformation of the MHAQ, MDHAQ, and HAQII was also evaluated. Univariable regressions were performed with each of a variety of explanatory variables and interactions between variables believed to be important indicators of the HAQ score, including domains of demographics, patient habits, comorbidities, RA-specific factors, and other health-related quality of life indicators. Adjusted R2 was used to determine model fit with variables not contributing to the model, based on an improvement in R2 of at least 0.02, excluded from further analyses. Although sex and age were not statistically significant in the HAQII model, they were felt to be of clinically significant importance and were further evaluated during model development. We also evaluated model fit without the addition of age and sex for the MHAQ and MDHAQ models, with addition of all individual questions composing each of the measures, and with addition of individual question responses common to both the HAQ and each of the MHAQ, MDHAQ, or HAQII to each of the models. Common HAQ categories/questions are: wash (able to wash and dry your body?), dress (able to dress yourself, including shoelaces and buttons?), cup (able to lift a full cup or glass to your mouth?), faucet (able to turn faucets on and off?), bend (able to bend down and pick up clothing from the floor?), in car (able to get in and out of a car?), bed (able to get in and out of bed?), walk (able to walk outdoors on flat ground?), reach (able to reach up and get a 5-pound object [e.g., a bag of sugar] from above your head?), toilet (able to get on and off the toilet?), open car (able to open car doors?), and stand (able to stand up from a straight chair?). At least 6 of the 8 common questions for MHAQ and MDHAQ, and 4 of the 5 common questions for HAQII, were required to be present in evaluation of the longer models.
Model development was based on 80% of the data with the remaining 20% used for validating the final models. Predicted values were constrained to the 0–3 range of the HAQ. If predicted values were <0 or >3 they were replaced with the value 0 or 3, respectively. Quantile-quantile plots (not shown) were used to assist in the comparison between the predicted HAQ scores derived from the model and the actual HAQ scores obtained at the same point in time. The above analysis was repeated for each measure (MHAQ, MDHAQ, and HAQII). As visual inspection suggested slight nonlinearity at values <1 for MHAQ, MDHAQ, and HAQII, splines with knots at 0.125, 0.25, 0.5, and 0.75 were used to attempt improvement in model fit for each of the measures. Final models were chosen based on the best fit, which was based on improvement in adjusted R2 of at least 0.02. Two models were developed for the conversion of each measure (MHAQ, MDHAQ, and HAQII) to the HAQ: a primary (long) model inclusive of the common questions as described above and a short model with inclusion of only the measure to be transformed with age and sex if statistically significant.
Further model modifications were investigated once it was discovered that the original linear models for predicting HAQ scores from MDHAQ and HAQII scores predicted 0.1% and 0% zeroes, respectively. Zero-inflated normal models were evaluated but showed minimal or no improvement in model fit and did little to address the issue of a lack of predicted zeroes. Models predictive of HAQ values of zero were developed by assigning predicted HAQ values of zero to all cases with an MDHAQ or HAQII score of zero and then using the remaining cases to estimate a linear regression model as before. This was not done for MHAQ because of the pronounced floor effect of the measure. Predicted HAQ values for the remaining cases were estimated using the coefficients from these models, and the squared correlation between the predicted and actual HAQ values for all cases was calculated.
Graphical fits were presented using local polynomial curves and 95% confidence intervals (95% CIs) with bin widths of 0.1. Bland-Altman concordance statistics were used to evaluate for concordance between actual HAQ values and predicted HAQ values for each measure and were presented graphically with local polynomial curves superimposed. Bland-Altman plots allowed us to determine if the differences between the observed and predicted values exhibited any systematic bias over the range of possible values. Regression diagnostics were used to look for multicolinearity by measurement of the variance inflation factor (VIF), an index that measures how much the variance of a coefficient is increased because of colinearity. For demographic data, differences in sample means between development and validation samples for each measurement tool and differences between samples with missing and non-missing predictor variables were evaluated using 2-sample t-tests. All analysis was performed using Stata statistical software, version 10.1 (Stata).
Single random observations of individual patients simultaneously obtained for each of the MHAQ (n = 29,596), MDHAQ (n = 13,665), and HAQII (n = 15,823) at the same time a HAQ was completed were available. Table 1 displays the characteristics of these patients. No significant differences in patient characteristics were observed between the 80% development and the 20% validation samples for each measurement tool. There were small, but important, differences in patients with missing data, with all differences occurring when missing predictors represented <3% of the total. The effects of missing data are outside the scope of our analysis. Box-Cox transformations did not improve any of the models to a significant degree. Table 2 displays coefficients and SEs for both the short and long final models. For clarity, we have included the equations chosen as the best models for each measure within the text. We have developed 2 models for conversion back to the HAQ for each measure, a longer version as well as a more parsimonious version, with usage recommendations described below. For each measure, the predicted values for the 20% validation sample closely approximated the actual HAQ values and demonstrated a nearly identical line to the development sample (data not shown). Figure 1 shows the graphical representation of the actual versus predicted MHAQ, MDHAQ, and HAQII values using the 20% validation sample. Bland-Altman plots for long models (Figure 2) show that the actual HAQ values for each measure have slightly higher variability than the predicted HAQ values, with positive correlations for all measures: 0.244 for MHAQ, 0.245 for MDHAQ, and 0.205 for HAQII.
Table 1. Characteristics of RA patients in model and validation samples*
MHAQ (n = 30,754)
MDHAQ (n = 13,764)
HAQII (n = 15,929)
For each measure, all differences of sample means had P > 0.10 between model development and validation samples. RA = rheumatoid arthritis; MHAQ = modified Health Assessment Questionnaire; MDHAQ = Multidimensional HAQ.
HAQ version indicates corresponding MHAQ, MDHAQ, and HAQII.
Marital status was not routinely collected at the time MDHAQ questionnaires were asked.
Table 2. Model coefficients (SE) for short and long models with adjusted R2 for development and validation samples*
Dependant variable is the Health Assessment Questionnaire (HAQ) for all models. Short indicates most parsimonious model. Long indicates model inclusive of individual question responses common to both the HAQ and each of the modified HAQ (MHAQ), Multidimensional HAQ (MDHAQ), or HAQII. For short models, 3.5%, 0.7%, and 0.2%, and for long models 11.4%, 8.3%, and 4.5%, of cases for MHAQ, MDHAQ, and HAQII, respectively, were dropped during model development because of missing data on the predictor variables. See Materials and Methods section for descriptions of variables. N/A = not applicable.
The square root of MHAQ was used in the conversion model.
In the development sample, the square root of MHAQ was more closely correlated with HAQ than the untransformed MHAQ variable (0.881 versus 0.857). Using the square root of MHAQ, model fit was better with the inclusion of age and sex (ΔR2 = 0.03). Splines at the 0.125, 0.25, 0.50, and 0.75 levels all improved R2 by 0.01; therefore, they were not used due to nonsignificant improvement in model fit. Adding individual question responses common to both the HAQ and the MHAQ to the model improved R2 by 0.02. For the above reasons the long model was chosen as best:
The average VIF for the final model was 2.85. As expected, there was some strong, but not perfect, multicolinearity between the composite score and the individual items (maximum VIF = 10.00), but the model is still estimable. Since we are primarily interested in the model's predictive power, rather than the meaning of the individual model coefficients, this multicolinearity is not considered to be detrimental to our purposes.
Numerically transformed versions of the MDHAQ did not significantly improve correlation to HAQ scores or model fit; therefore, the untransformed variable was used for analysis. Model fit was better with inclusion of age and sex (ΔR2 = 0.02). The use of splines at the 0.125, 0.25, 0.50, and 0.75 levels did not significantly improve model fit for MDHAQ (ΔR2 <0.01); therefore, they were not included in the final model. As compared with the short model, addition of all individual questions composing the MDHAQ improved R2 by 0.02. Addition of the individual question responses common to both the HAQ and the MDHAQ did not significantly improve R2 (ΔR2 = 0.01). The average VIF for the model, including the question responses common to both the HAQ and the MDHAQ, was 3.28. Similar to the MHAQ model, the individual items in the MDHAQ model demonstrated some multicolinearity (maximum VIF = 14.86); however, we again believe that this is not detrimental to our purposes. Using models predictive of zero, the percentage automatically scored as zero was 3.6% and 4.0% for short and long models, respectively. When rounded to the nearest hundredth, the squared correlation between predicted and actual HAQ values for all cases was identical for each model (R = 0.81 and R = 0.82), regardless of inclusion of zero values. For the above reasons, the long model (obtained by assigning predicted HAQ values of zero to all cases with an MDHAQ score of zero and then using the remaining cases to estimate a linear regression model) was chosen as best:
Numerically transformed versions of the HAQII did not significantly improve correlation to HAQ scores or model fit; therefore, the untransformed variable was used for analysis. Inclusion of age and sex into the HAQII model produced a nonsignificant improvement (ΔR2 <0.01). The use of splines at the 0.125, 0.25, 0.50, and 0.75 levels did not significantly improve the model fit (ΔR2 <0.01). The addition of all individual questions composing the HAQII, and the addition of only the individual question responses common to both the HAQ and the HAQII, improved R2 by 0.03 as compared with the short model; therefore, in favor of parsimony, the model inclusive of only the common questions was chosen. The average VIF for the final model, including the question responses common to both the HAQ and the HAQII, was 2.85, with a maximum VIF of 5.77, indicating no significant issues with multicolinearity. Using models predictive of zero, the percentage automatically scored as zero was 7.3% and 7.5% for short and long models, respectively. The squared correlation between the predicted and actual HAQ values for all cases was identical for each model (R = 0.83 and R = 0.86), regardless of the inclusion of zero values. For the above reasons, the long model (obtained by assigning predicted HAQ values of zero to all cases with a HAQII score of zero and then using the remaining cases to estimate a linear regression model) was chosen as best:
In this manuscript, we developed conversion formulas from the MHAQ, MDHAQ, and HAQII to the HAQ in a large sample of RA patients. We demonstrated that average MHAQ and MDHAQ scores were 0.58 and 0.34, respectively, lower than HAQ scores, while average HAQII scores were only minimally lower (by 0.04) than HAQ scores. Since a change of 0.22 in the HAQ is considered clinically significant (32), these differences are important and illustrate that the MHAQ and the MDHAQ are not equivalent to the original HAQ. One strength of our analysis is the comparison of simultaneously collected measurement of the HAQ with the MHAQ, MDHAQ, and HAQII scales. Although our data set includes relatively few values at the upper end of the HAQ scales, which in turn produce less certainty in the model at the upper ends of each scale, this finding is not different than other studies using these measures (20, 30, 33, 35). Additionally, the clustering of values at the lower end of each scale does not significantly impact the model fit for any of the measures, as evidenced by the nonsignificant changes we found in our attempts to use splines to correct for this issue. We have demonstrated consistency between our models as shown by graphical representation of the 20% validation sample. Although Bland-Altman plots show that the actual HAQ values for each measure have slightly higher variability than the predicted HAQ values, this effect is minimal, and overall the plots demonstrate a good level of concordance for each measure when used for population means. While we would not expect our models to explain 100% of the variance due to moving from a 41-item to 8- and 10-item questionnaires, the lowest adjusted R2 for our models was in excess of 0.80, indicating that only up to 20% of the variance remains unexplained. In contrast with the narrow 95% CI of the predicted mean values for HAQ scores, we observed relatively large 95% CIs for individual predicted values (Figure 1). This illustrates that application of our conversion formulae to the individual patient for clinical care purposes is inadvisable. Moreover, prior work also has demonstrated that the HAQ and HAQII are not interchangeable in an individual patient (19, 33).
For each measure we have developed 2 models for conversion back to the HAQ: a longer, more explanatory version, as well as a more parsimonious version. It is likely that existing data sets will not have all the variables that were at our disposal. We believe selection of the appropriate conversion model may be based on available data (i.e., use the short model if individual question responses are not available) rather than through imputation of data, since there were only small improvements in model fit when moving to longer models. In fact, the largest improvement we found was going from the most parsimonious model to the expanded model for the HAQII, which demonstrated an improvement in R2 of 0.03. For the MDHAQ and the HAQII, we found no significant improvements over models inclusive of only common questions between measures versus models inclusive of all individual questions composing the measures. Notably, all 8 MHAQ questions are taken from the original HAQ without modification or addition of other questions. We found slightly different coefficients for conversion between the HAQII and the HAQ than with prior work done in this area. Wolfe et al (19) had previously reported a conversion equation based on 14,038 observations, of which 10,916 were limited to RA (within the same database), which yielded HAQ = 0.039 + 0.989 × HAQII (the previously published model, HAQ = 0.39 + 0.989 × HAQII, is in error and should read HAQ = 0.039 + 0.989 × HAQII).
Our updated formula, HAQ = 0.038 + 0.998 × HAQII, is very similar. The updated formula will lead to slightly higher values (our model gives the HAQ values that are 1.0% higher), but should be more robust for RA cohorts since our analysis was based on an additional 4,862 RA-specific simultaneous observations for the HAQ and the HAQII.
In addition to the measures discussed above, other modified versions of the HAQ have been created; however, these measures are not widely used at this time. Paramount among such measures is the Patient-Reported Outcome Measurement Information System HAQ, also known as the improved HAQ, which is composed of modified versions of the 20 questions found in the original HAQ. It is written in present tense, uses a 5-point ordered-category item scale, and uses a scoring scale ranging between 0 and 100 with adjustment for 4 questions asking about the use of aids, devices, or assistance (36). Should additional versions of the HAQ become popular in the future, it would be appropriate at that time to develop similar models to convert values to the original HAQ in order to compare outcomes between studies. Ideally, it would be prudent to develop similar conversion models back to the original HAQ as part of a development process for any new HAQ-derived measure. Overall, we believe the models we have developed are most useful for conversion of the MHAQ, the MDHAQ, and the HAQII to the HAQ in the research setting. As we limited our analysis to RA patients, these formulas may not be applicable to other patient populations.
Although all the measures discussed above show validity in measuring function, the HAQII requires the least manipulation of data in order to compare with the original HAQ and has been shown to have the greatest uniformity between values across the range of the scale. In light of ever improving treatments for RA, which produce a greater likelihood of remission, the large floor effect of the MHAQ makes it the least desirable of the measures we evaluated. Based on the strong relationship between the HAQ and the HAQII, and on the previously reported validation studies demonstrating equivalent prediction of outcomes such as mortality and work disability (19), we recommend that future studies use the HAQII when a HAQ substitute is required. The above models will allow comparison of prior data where collections of different versions of the HAQ were used.
All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be submitted for publication. Dr. Michaud had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study conception and design. Anderson, Sayles, Curtis, Michaud.
Acquisition of data. Wolfe, Michaud.
Analysis and interpretation of data. Anderson, Sayles, Curtis, Michaud.
The authors thank Robin High of the University of Nebraska Medical Center's Department of Biostatistics for his assistance with the estimation of the zero-inflated normal models.