Crosscultural Measurement Equivalence of the Health Assessment Questionnaire II


Address correspondence to Martijn A. H. Oude Voshaar, MSc, Institute for Behavioural Research, Faculty of Behavioural Sciences, University of Twente, PO Box 217, 7500 AE Enschede, The Netherlands. E-mail:



To evaluate the crosscultural measurement equivalence of the US and Dutch Health Assessment Questionnaire II (HAQ-II) in rheumatoid arthritis (RA).


Item response theory (IRT) analyses were performed on US (n = 18,747) and Dutch (n = 1,022) HAQ-II data to evaluate the equivalence of crosscultural item performance. The observed inconsistencies were modeled by assigning country-specific item parameters to biased items. The impact of crosscultural item bias on the comparability of the Dutch and US total scores was analyzed by evaluating the agreement between physical function levels estimated from an IRT model with country-specific item parameters for biased items and physical function levels estimated from the original model that did not account for cultural bias.


Two items showed significant crosscultural bias. However, the agreement in physical function estimates between the respecified and original models was very high, with an intraclass correlation coefficient >0.99 and Bland-Altman limits of agreement ranging from −0.08 to −0.01 on a latent scale with a mean of 0 and an SD of 1.


This study suggests that the Dutch and US HAQ-II produce total scores that can be interpreted interchangeably across countries in RA studies, despite some minor bias at the item level.


Rheumatoid arthritis (RA) is a musculoskeletal disease frequently associated with an impaired ability to execute everyday activities. Physical function (PF) is therefore a core outcome domain in this field. The benefits of item response theory (IRT) are increasingly being recognized in the assessment of patient-reported outcomes, including PF. For example, the Rasch analysis–based Health Assessment Questionnaire II (HAQ-II) was developed to overcome the limited measurement range and cumbersomeness of assessing patients with the original HAQ disability index (DI) [1].

Although most psychometric properties of the HAQ-II are well established [2], nothing is currently known about its crosscultural measurement equivalence. However, PF scores are often pooled or compared across countries. It is important to verify that the observed score differences or similarities across cultures reflect true differences or similarities in PF levels rather than cultural bias.

Traditionally, researchers have tried to achieve measurement equivalence through elaborate crosscultural translation procedures. These procedures aim to develop equivalent versions of measures by having qualified translators and content experts reach consensus about items that are difficult to translate. Although important, these procedures do not necessarily yield equivalent versions from a measurement perspective. Crosscultural measurement equivalence only exists when the relations between the observed scale scores and the latent attribute measured by the scale are identical across cultures [3]. Therefore, crosscultural measurement equivalence studies are a necessary next step in evaluating crosscultural validity. The aim of the present study was to investigate the crosscultural measurement equivalence of the original US and Dutch versions of the HAQ-II using IRT modeling.

Significance & Innovations

  • The Health Assessment Questionnaire II (HAQ-II) is a psychometrically sound 10-item physical function questionnaire that was developed using Rasch analysis.
  • This study is the first to use item response theory modeling to evaluate the crosscultural measurement equivalence of the HAQ-II.
  • The current study demonstrates that Dutch and US versions of the HAQ-II produce total scores that can be interpreted interchangeably in studies with rheumatoid arthritis (RA) patients.
  • The results of this study support the crosscultural validity of the HAQ-II in RA.

Patients and methods

Data from the Dutch version of the HAQ-II were analyzed from 2 studies. In the first study, 3 waves of data collection were carried out in the outpatient rheumatology clinic at Medisch Spectrum Twente in Enschede, The Netherlands [4]. All consecutive patients visiting the clinic answered demographic questions and completed standard self-reported measures of disease activity and health status. The baseline data of 472 patients with a clinical diagnosis of RA were used. In the second study, a random selection of 999 outpatients with a clinical diagnosis of RA was asked to participate in a survey study on fatigue; 550 patients agreed to participate in this study (Nikolaus, et al: unpublished observations).

The US HAQ-II data represented a random sample of 18,747 subjects with RA who were participating in a longitudinal study of RA outcomes in the US National Data Bank for Rheumatic Diseases (NDB) [5]. Participants were volunteers recruited from the practices of US rheumatologists who completed mailed or internet questionnaires about their health at 6-month intervals between 2002 and 2011. The participants were not compensated for their participation. For each patient, the diagnosis of RA was made by their rheumatologist. The NDB uses an open cohort design in which patients are enrolled continuously.


The HAQ-II consists of 10 items [1]. Each item is scored on a 4-point rating scale, where 0 = without any difficulty and 3 = unable to do. The HAQ-II is scored by averaging the items when at least 8 items are completed. Pain and general health were assessed using an 11-point numerical rating scale (range 0–10) in the Dutch studies and a 10-cm visual analog scale in the US study. Higher values indicated worse pain or general health in both countries.

Statistical analysis

IRT model

The generalized partial credit model (GPCM) [6] was used to obtain item and person parameters using marginal maximum likelihood estimation for all analyses. The GPCM is an IRT model that pertains to polytomously scored items such as the items of the HAQ-II. The GPCM models the relationship between item responses and the measured trait by person and item parameters. In the case of the HAQ-II, person parameters reflect an individual's estimated standing on a latent PF continuum. Because higher HAQ-II scores indicate worse PF, higher values on the latent continuum also indicate worse PF. Each HAQ-II item is characterized by a discrimination parameter that reflects the ability of the item to discriminate between the levels of PF and 3 threshold parameters that each mark the location on the latent PF continuum where 2 consecutive response options are equally likely to be endorsed. Higher values indicate that the respondents favor the higher response option at higher levels of PF (i.e., higher values reflect the easiness of the item at the level of response options). Since item characteristics are statistically separated from respondent characteristics, crosscultural equivalence can be investigated by evaluating whether the item parameters are identical across cultures.

Differential item functioning (DIF) analysis

Preceding the analysis of crosscultural measurement equivalence, DIF was evaluated for background variables that might confound the results of the main analysis. DIF occurs if the item response is dependent on the background variables. Subsequently, to investigate crosscultural measurement equivalence, we evaluated the presence of crosscultural DIF. Estimates of the GPCM model parameters were computed, assuming different ability distributions for both countries. Lagrange multiplier (LM) statistics, which are based on the difference between mean observed item scores and mean item scores expected by the model, were calculated to establish DIF [7]. Besides a formal statistical analysis testing the null hypothesis that item parameters are the same across countries, the LM tests are accompanied by the effect size statistics (ES) that reflect the mean deviation of observed item scores from the model expectations (absolute residuals) [7]. The technique is sensitive to uniform DIF (i.e., an item is systematically more difficult for either population) and nonuniform DIF (i.e., an item is less strongly related to the overall PF trait in either population) [7]. Because of the large sample sizes in this study and the high number of statistical tests performed, we expected minor model deviations to reach statistical significance. Therefore, an ES >0.10 was considered to indicate significant DIF across countries, irrespective of the statistical significance of the LM tests. This cutoff point has previously been used in research related to the original HAQ DI, which has the same response format as the HAQ-II [8]. Initially, the much larger US sample dominated the concurrent item parameter estimates and spuriously inflated the deviation of observed Dutch scores from the model expectations. To obtain comparable parameter estimates, we created 5 random US samples with an equal amount of respondents as the total Dutch sample and reran the analysis 5 times, once with each of the random US samples. The results were analyzed on a sample by sample basis and pooled by calculating ES statistics after averaging the observed and expected mean item scores for each score group across samples.

Next, we created an overall fitting model for the full data set that accounted for the observed crosscultural DIF. This was achieved by assigning country-specific item parameters to items exhibiting DIF. LM statistics pertaining to the form of the item response curves were calculated after assigning separate parameters to items with substantial crosscultural DIF to evaluate fit for the final respecified model. The rationale of this test was to partition the latent PF continuum into a number of segments and to evaluate whether the item characteristic curve of an item conformed to the form predicted by the model in each of these segments [9]. These statistics can be used to identify misfitting items and together provide a test of overall model fit. Fit was considered acceptable in case no items were found with an ES >0.10.

Impact of DIF

Finally, we investigated the influence of crosscultural DIF on total test scores. Because the presence of DIF might impact the comparability of scores across cultures, but the presence of DIF might also cancel itself out across items, the impact of DIF was investigated within a framework that pertains to the total scores [10]. We estimated the latent PF level of the patients based on the item parameters calculated from the original model and subsequently assigned country-specific parameters to items exhibiting significant DIF and reestimated the latent PF level of the patients. The agreement between the resulting latent PF estimates was analyzed by calculating the intraclass correlation coefficient (ICC) and the limits of agreement according to the Bland-Altman method [11]. The ICCs were calculated from a 2-way random-effects model for single measures. The assumption of the 95% limits of agreement approach that the mean and SD of the differences are constant throughout the range of measurements and that these differences are from an approximately normal distribution was visually inspected after the limits of agreement and mean differences were calculated.


The demographic and clinical characteristics of the study sample are shown in Table 1. The US sample was composed of fewer men and the mean age was lower; the samples were otherwise comparable. The Dutch samples did not significantly differ by HAQ-II scores, pain, and general health, although patients in the first study were significantly younger (mean difference 3.87 years; P < 0.01) and had significantly shorter disease duration (mean difference 4.96 years; P < 0.01).

Table 1. Sample characteristicsa
 Dutch patients (n = 1,022)US patients (n = 18,747)
  1. aValues are the mean ± SD unless otherwise indicated. HAQ-II = Health Assessment Questionnaire II.
  2. bNumerical rating scale for Dutch patients; visual analog scale for US patients.
Sex, % women69.478.8
Age, years64.1 ± 13.360.8 ± 13.2
Disease duration, years15.3 ± 13.314.8 ± 11.31
HAQ-II (range 0–3)1.01 ± 0.650.99 ± 0.67
Pain (range 0–10)b4.4 ± 2.53.8 ± 2.8
General health (range 0–10)b4.4 ± 2.73.7 ± 2.5

DIF analysis

Because the US and Dutch samples significantly differed from each other in terms of age and sex, we evaluated the presence of DIF for age (after creating 3 equally large age groups) and sex in the combined Dutch and US samples preceding further analysis. The results indicated that no items showed DIF for either variable. The ES statistics ranged from 0.00–0.05 for age and 0.00–0.07 for sex.

Next, 5 crosscultural DIF analyses were done using the 5 US samples and the Dutch data (Table 2). Most tests reached statistical significance in most of the samples. However, the observed deviations from the model expectations were minor and only items 3 (are you able to stand up from a straight chair) and 7 (are you able walk up 2 flights of stairs) exceeded the ES threshold of 0.10. Moreover, items 3 and 7 were consistently identified as exhibiting uniform DIF in all 5 samples, whereas none of the remaining items were flagged for DIF in any of the samples (Table 2).

Table 2. Differential item functioning across countriesa
ItemLM (range)dfP(range)ES (range)
  1. aLM (range) = the mean and range of the Lagrange multiplier statistic over the 5 random samples; P (range) = the mean and range of P values over the 5 random samples; ES (range) = effect size defined as the mean and range of observed scores minus the expected scores over 3 score groups, averaged over 5 random samples.
Get on and off the toilet?24.86 (16.07–29.98)30.00 (0.00–0.00)0.04 (0.03–0.05)
Open car doors?4.47 (2.32–3.03)30.26 (0.09–0.51)0.02 (0.01–0.02)
Stand up from a straight chair?79.01 (70.11–88.46)30.00 (0.00–0.00)0.11 (0.11–0.11)
Walk outdoors on flat ground?6.60 (1.81–19.28)30.30 (0.00–0.61)0.01 (0.00–0.01)
Wait in a line for 15 minutes?33.54 (23.87–36.83)30.00 (0.00–0.00)0.07 (0.06–0.07)
Reach and get down an object from just above your head?9.39 (4.37–12.14)30.06 (0.01–0.19)0.02 (0.02–0.02)
Go up 2 or more flights of stairs?89.05 (73.26–98.45)30.00 (0.00–0.00)0.12 (0.10–0.12)
Do outside work (such as yard work)?6.09 (2.45–10.28)30.18 (0.02–0.48)0.01 (0.00–0.02)
Lift heavy objects?17.94 (11.37–32.29)30.00 (0.00–0.01)0.01 (0.00–0.01)
Move heavy objects?5.48 (4.17–7.57)30.15 (0.06–0.24)0.03 (0.01–0.05)

IRT modeling

Item 7, which had the largest amount of DIF according to the previous analysis, was assigned country-specific parameters first. The DIF analyses were repeated at this point to evaluate whether the DIF in item 3 was still present because the presence of DIF in 1 item potentially influences the estimated item parameters of the other items [12]. The DIF in item 3 was still present at this point; therefore, item 3 was assigned country-specific parameters as well. The subsequent analysis failed to identify more items with substantial DIF and fit for this respecified model was investigated next.

Overall model fit

The validity of the model with country-specific parameters was evaluated in the total data set. The LM tests for item characteristic curves were obtained, along with the estimated item parameters. For the present analysis, observed and posterior expected scores were computed using a partitioning of respondents into 3 equally large score level groups. As anticipated, most of the LM statistics reached statistical significance because of the sample size, especially in the US sample. However, Table 3 shows that the absolute differences between the observed mean item scores and mean item scores expected by the model were of a negligible magnitude for most items, considering the 0–3 rating scale range of the HAQ-II. At the scale level, the mean difference between the model-expected HAQ-II scores and observed HAQ-II scores was 0.019 (1.009 versus 0.99) for the US and 0.043 (1.053 versus 1.01) for the Dutch data. Taking into consideration the small discrepancy between observed and expected HAQ-II item and total scores, the overall conclusion was that the model with country-specific parameters for items 3 and 7 fit very well. This finding indicates that the same underlying latent scale of PF applied to both countries, except that the US patients experienced relatively more difficulty standing up from a straight chair (item 3) and going up 2 flights of stairs (item 7) than the Dutch patients, as indicated by lower threshold parameters (Table 3).

Table 3. Overall model fit and item parametersa
ItemItem parametersUS sample (n = 18,747)Dutch sample (n = 1,022)
α (SD)β1 (SD)β2 (SD)β3 (SD)LM (P)ESLM (P)ES
  1. aα = discrimination parameter reflecting the relative strength of the item with the latent trait, like factor loadings in factor analysis; β = threshold parameter reflecting the point on the latent scale where respondents have a 50% likelihood of choosing consecutive response options (i.e., item difficulty); LM = Lagrange multiplier statistic; ES = effect size.
Get on and off the toilet?1.69 (0.031)1.21 (0.001)4.13 (0.001)5.87 (0.006)207.42 (< 0.001)0.0281.76 (< 0.001)0.11
Open car doors?1.64 (0.030)1.30 (0.072)3.90 (0.508)4.5 (0.001)83.92 (< 0.001)0.012.16 (0.340)0.02
Stand up from a straight chair?        
Dutch1.71 (0.156)1.06 (0.124)3.70 (0.012)5.34 (0.009)  10.31 (0.010)0.02
US1.71 (0.028)0.09 (0.002)3.10 (0.003)4.95 (0.005)239.87 (< 0.001)0.02  
Walk outdoors on flat ground?1.81 (0.029)0.78 (0.001)3.04 (0.005)4.54 (0.026)263.8 (< 0.001)0.034.85 (0.090)0.04
Wait in a line for 15 minutes?1.90 (0.026)−0.67 (0.001)1.34 (0.001)2.93 (0.001)88.61 (< 0.001)0.0213.36 (< 0.001)0.09
Reach and get down an object from just above your head?1.91 (0.028)−0.35 (0.001)2.22 (0.001)2.88 (0.003)41.4 (< 0.001)0.017.02 (0.030)0.04
Go up 2 or more flights of stairs?        
Dutch1.75 (0.114)−0.29 (0.014)1.453 (0.001)2.65 (0.001)  2.81 (0.250)0.03
US2.06 (0.029)−1.498 (0.010)1.142 (0.001)2.71 (0.001)51.6 (< 0.001)0.01  
Do outside work (such as yard work)?2.96 (0.042)−2.76 (0.001)1.24 (0.001)2.68 (0.003)88.4 (< 0.001)0.0313.4 (0.060)0.04
Lift heavy objects?3.99 (0.067)−5.94 (0.004)−0.95 (0.002)1.73 (0.002)239.8 (< 0.001)0.023.96 (0.140)0.04
Move heavy objects?3.99 (0.072)−5.56 (0.001)−0.35 (0.001)2.35 (0.001)56.4 (< 0.001)0.026.94 (0.030)0.03

Effect of DIF

The agreement in latent PF estimates between the original model and the model respecified to allow country-specific item parameters for items 3 and 7 was very high. The ICC was >0.999 and the limits of agreement ranged from −0.08 to −0.01 (Figure 1). Given that the latent scale was set to have a mean of 0 and an SD of 1 for the US patients (mean of 0.137 and SD of 1.072 for the Dutch patients), these results suggest that the bias at the test level was very minor.

Figure 1.

Agreement between latent Health Assessment Questionnaire II (HAQ-II) scores and latent HAQ-II scores with country-specific item parameters for items 3 and 7, displayed in a Bland-Altman diagram including lines for the mean difference and upper and lower 95% limits of agreement (R2 = 0.10).


The objective of this study was to evaluate the crosscultural equivalence of the Dutch HAQ-II using IRT modeling. The results suggest that all HAQ-II items functioned equivalently across sex and age, but items 3 and 7 were both slightly more difficult for US patients. However, at the test level, the impact of DIF on total HAQ-II scores was negligible, supporting the crosscultural equivalence of Dutch and US HAQ-II scores.

The observed bias in items 3 and 7 is difficult to explain in terms of obvious translation or cultural issues. One of 2 previous studies that evaluated the cultural equivalence of US and European versions of PF scales also found that an item about standing up from a straight chair was easier for Dutch than for Canadian patients [13]. Conversely, items about ascending stairs were not culturally biased in either of the studies [13, 14]. Given the small absolute magnitude of DIF, the strict definition of DIF used in the current study, and the inconsistent findings of similar items exhibiting DIF in previous studies, cultural bias of these items should probably not be exaggerated.

In the subsequent analysis of the impact of DIF, we found that the PF levels of patients estimated from a model that accounted for the observed DIF by assigning country-specific item parameters to significantly biased items and those from a model that did not were highly in agreement with each other. At the test level, the observed DIF appeared to have very little impact on estimated PF level.

In cases where substantial test-level bias is observed, the procedure used in this study can be employed to make scores comparable again between countries. The resulting model would then argue that the same construct is measured in both countries, but some of the item locations on the latent scale are different, reflecting the observed bias. This is an interesting feature because in some cases it is impossible to correct for observed crosscultural bias. For example, even if it were known beforehand that patients from the 2 countries respond differently to the item “are you able to stand up from a straight chair,” it would be challenging to correct for this observed difference without changing the conceptual and semantic meaning of the question. This approach can also be extended to situations where different, culturally relevant items are used for different cultures, provided there are sufficient anchor items administered in both countries that are not culturally biased.

Previous studies have established the unidimensionality of the HAQ-II while using the same data as the current study. Therefore, we did not analyze the dimensionality of the HAQ-II preceding the IRT analysis [1, 4]. Considering that previous studies have demonstrated the psychometric comparability of the HAQ-II to the original HAQ-DI, the HAQ-II may be substituted for the HAQ-DI in clinical research in RA. The HAQ-II has the advantage of being shorter and easier to score, which reduces the burden on physicians and patients.

There are several limitations of the present study. First, The HAQ-II items were administered alongside other PF items, which might have influenced response behavior. Generalization of the results to situations where the HAQ-II is administered as an autonomous questionnaire should therefore be done with some reservation. Second, the different study settings of both countries caused the samples to differ in key demographics. Although we did not observe DIF of age or sex, unobserved systematic differences between samples might have influenced the results.

Overall, this study suggests that the Dutch and US HAQ-II produce total scores that can be interpreted interchangeably across cultures, despite some minor bias at the item level.


All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be published. Mr. Oude Voshaar had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Study conception and design. Oude Voshaar, Glas, ten Klooster, Taal, Wolfe, van de Laar.

Acquisition of data. Ten Klooster, Taal, Wolfe, van de Laar.

Analysis and interpretation of data. Oude Voshaar, Glas.