Disease-related differential item functioning in the work instability scale for rheumatoid arthritis: Converging results from three methods


  • Kenneth Tang,

    Corresponding author
    1. University of Toronto, Li Ka Shing Knowledge Institute of St. Michael's Hospital, and Institute for Work & Health, Toronto, Ontario, Canada
    • Mobility Program Clinical Research Unit from the Li Ka Shing Knowledge Institute of St. Michael's Hospital, 30 Bond Street, Toronto, Ontario, Canada, M5B 1W8
    Search for more papers by this author
  • Canadian Arthritis Network Work Productivity Group



The 23-item Work Instability Scale for Rheumatoid Arthritis (RA-WIS) is a promising measure to assess risk for future work disability. Validated in both rheumatoid arthritis (RA) and osteoarthritis (OA), it has high potential for cross-disease applications. Our objective was to examine disease-related differential item functioning (DIF) in the RA-WIS.


Workers with RA (n = 120) or OA (n = 130) were recruited from 3 sites and completed a questionnaire consisting of demographic and health- and work-related variables, including the RA-WIS (range 0–23, where 23 = highest work instability). Multiple DIF detection methods were applied for comparability: 1) Mantel-Haenszel and Breslow-Day procedures, 2) hierarchical 3-step sequential logistic regression procedure, and 3) a 1-parameter item response theory approach (Rasch analysis). Both tests of significance (chi-square and F tests) and effect size statistics (ΔMH, ΔR2) were assessed to confirm items demonstrating uniform or nonuniform DIF. A 2-step purification procedure was applied to establish a DIF-free conditioning variable (total RA-WIS score) for DIF analyses. The resultant impact of disease-related DIF at the scale level was also evaluated.


All 3 DIF detection methods converged to reveal 3 RA-WIS items as having significant disease-related uniform DIF. Two items (“difficulty opening doors” and “pressure on hand”) were more likely affirmed in RA, while 1 item (“very stiff”) was more likely affirmed in OA. Overall, only a marginal impact at the scale level was found due to a small proportion of scale items exhibiting DIF and the bidirectional nature of DIF effects.


RA-WIS scores can be directly compared between RA and OA without significant concerns for DIF-related measurement bias.


Work disability associated with rheumatoid arthritis (RA) or osteoarthritis (OA) is an important concern in the working population (1–3). Measures that can identify workers experiencing difficulties meeting their job demands are needed to facilitate early interventions, and to help minimize the risk for more adverse outcomes such as permanent work loss. Recent evidence suggests the Work Instability Scale for Rheumatoid Arthritis (RA-WIS) is a promising measure for such a purpose (4, 5). The RA-WIS is a self-report multi-item scale that examines a wide range of constructs, including perceptions of performance and stamina at work, issues related to time management, symptom control, as well as cognitive distresses. Difficulties with these constructs contribute to what developers have termed “work instability” (WI), defined as a mismatch between a person's functional and cognitive abilities in relation to demands at work (4). The RA-WIS has been validated in both RA (4, 6) and OA (6, 7), which is important for encouraging standardized applications of this outcome across multiple forms of arthritis. Yet, do RA-WIS items function consistently between RA and OA, and are scale scores directly comparable across these 2 forms of arthritis? Originally developed for RA (4), we hypothesize that some RA-WIS items may have less intrinsic relevance for other forms of arthritis, which could bias comparisons of scores across different populations.

Differential item functioning (DIF) analysis can assess whether the probability of item response is systematically linked to respondent characteristics that are unrelated to the concept of the measure itself (8, 9). Such analysis could be applied to help inform whether the RA-WIS operates consistently between RA and OA at both the item and scale levels. Evidence of “disease-related” DIF in the RA-WIS could indicate some underlying difference in item relevance or in the way specific items are perceived or interpreted between different arthritis subgroups, and raise concerns for cross-disease measurement bias. If a considerable proportion of scale items exhibit such DIF, this could invalidate direct comparisons of RA-WIS scores between RA and OA. Moreover, this would also suggest the need to identify unique score cut points for each form of arthritis.

Examples of popular DIF detection methods include the Mantel-Haenszel procedure (10), logistic regression (11) and, more recently, approaches based on item response theory (IRT) measurement frameworks. To date, most attention has been given to investigations of DIF associated with age (12, 13), sex (12, 14, 15), language/translations (16–18), or culture (12, 19, 20), but few studies have examined disease-related DIF (21). Given the high relevance of work disability in both RA and OA and the potential of the RA-WIS for cross-disease applications, the current objective was to assess for disease-related DIF in this measure and its impact on the comparability of scores between RA and OA at the scale level. Multiple DIF detection methods were applied for this investigation to afford an opportunity to examine the comparability of results across different procedures.

Significance & Innovations

  • Diverse methods converged to identify 3 items in the 23-item Work Instability Scale for Rheumatoid Arthritis (RA-WIS) to demonstrate significant disease-related differential item functioning (DIF) between rheumatoid arthritis (RA) and osteoarthritis (OA).

  • Effects of DIF were shown to cancel each other at the scale level; therefore, cross-disease comparisons of RA-WIS summed scores between RA and OA may be conducted without significant concerns for measurement bias.


Recruitment and sample size.

A total of 250 workers who had been diagnosed with either RA (n = 120) or OA (n = 130) were recruited for a 1-year cohort study from one of the following sites: 2 tertiary-level rheumatology clinics in urban teaching hospitals in Toronto, Ontario, Canada, or an outpatient arthritis treatment program providing multidisciplinary services in Vancouver, British Columbia, Canada. To be included, study participants must have been working for pay at least 1 month prior to recruitment, be able to understand written English, and have provided written consent for study participation. Research ethics board approval for this study has been obtained at all of the participating institutions. At study baseline, all of the participants completed a questionnaire consisting of a series of demographic and health- and work-related variables, including the RA-WIS.

Measure: RA-WIS.

The RA-WIS consists of 23 dichotomous items, and aims to quantify the degree of mismatch between an individual's functional abilities and work demands (4). Participants responded to scale items by affirming (yes = 1) or not affirming (no = 0) specific work-related experiences that would indicate WI. Total scale scores range from 0–23, where higher scores indicate greater WI. In addition to existing evidence supporting its internal consistency, validity, and responsiveness (6, 7), its unidimensionality and lack of DIF for age or sex have also been demonstrated from Rasch analyses (4, 7). Among our 250 eligible participants, 239 completed the RA-WIS at baseline, with <10% missing entries (i.e., completed ≥21 of 23 items). Only data from these participants were included for the current analysis.

Descriptive statistics.

Descriptive statistics were used to provide an overview of the characteristics of study participants. Variables assessed included demographic information (e.g., age, sex, marital status) and health- (e.g., disease duration, Health Assessment Questionnaire [HAQ] [22]) and work-related variables (e.g., occupation type).

Overview of DIF analytic strategy.

Our overall approach was to apply 3 different methods to detect 2 types of DIF: uniform and nonuniform. Uniform DIF is a consistent between-group difference in item response probability across the full range of the measured (latent) trait, whereas nonuniform DIF is indicated by a varied between-group difference across the trait (group × trait interaction) (23). In this study, both tests of significance and magnitude measures (where available) were used in combination to identify both types of DIF, in accordance with previous recommendations (24, 25). Some debate exists regarding the optimal statistical criteria to confirm DIF by the various methods (26, 27); therefore, to minimize potential Type I errors, we have applied conservative statistical thresholds and also examined relative magnitudes or “degrees” of DIF among scale items.

Method 1: Mantel-Haenszel and Breslow-Day procedures.

The Mantel-Haenszel procedure (10) is a widely used approach that identifies uniform DIF based on analysis of 3-way (2 × 2 × κ) contingency tables via cross-tabulation of item response (column) by group (row) for every level of the “conditioning variable” (i.e., level of measured trait), where κ represents the number of possible scores for the measure (28). A chi-square test was initially performed to test the null hypothesis that there is no relationship between arthritis type (group) and an item being affirmed (response), controlling for the total RA-WIS score (conditioning variable). If the null hypothesis is rejected (significant chi-square), then disease-related DIF is suggested. The Mantel-Haenszel procedure also provides an effect size statistic known as the “common odds ratio” (αMH) that ranges from 0 to positive infinity, with the value of 1 indicating no DIF. For ease of interpretation, αMH can be transformed into a delta difference (ΔMH) using an ln formula (ΔMH = −2.35 ln[αMH]) (29) to place it on a scale that centers at 0 (ranges from negative to positive infinity). This statistic provides an indication of both DIF magnitude and direction, where a value of ≥|1.5| has been proposed as a significant effect size (28). For this study, we applied a threshold of a P value with Bonferroni correction (PBonf) of <0.05 for the chi-square test as an initial screen, and also expected ΔMH ≥|1.5| as a confirmatory test to verify uniform DIF. To complement the Mantel-Haenszel procedure, the Breslow-Day procedure was applied to assess nonuniform DIF (30, 31). The Breslow-Day procedure provides a chi-square test that assesses the homogeneity of the αMH across different levels of the conditioning variable. That is, if the difference in the probability of item affirmation between RA and OA varied across the range of total RA-WIS scores (PBonf < 0.05), then nonuniform DIF is indicated. No effect size statistic is currently available for the Breslow-Day procedure.

Method 2: logistic regression (LR).

LR is also a common method for detecting DIF (11, 32), and a main advantage of this approach is the ability to simultaneously test for both uniform and nonuniform DIF (33). A hierarchical 3-step sequential binary LR modeling process was applied (34):

equation image

where pi = probability of affirming item i; b = parameter estimate; model 1: conditioning variable entered (tot = total RA-WIS score); model 2: group variable entered (group = arthritis type); and model 3: interaction term entered (conditioning × group variable).

As an initial screen, we examined the discrepancy in the −2 log likelihood between models 1 and 3 using a chi-square distribution with 2 df (11). If the model fit is better (significant chi-square) in model 3, then some disease-related DIF is suggested for the item. A recommended 1% significant level (35) was applied for this initial screen. We then performed a 2-stage method to verify and differentiate uniform and nonuniform DIF (27, 36). The effect size for uniform DIF was assessed by the difference in Nagelkerke's R2 (ΔR2) between models 1 and 2, while the effect size for nonuniform DIF was determined by ΔR2 between models 2 and 3 (34). Criteria for negligible (ΔR2 = <0.035), moderate (0.035 < ΔR2 < 0.070), and large (ΔR2 = >0.070) magnitudes of DIF have been proposed (27, 37). In this study, we sought a minimum of ΔR2 = >0.035 to confirm either form of DIF.

Method 3: 1-parameter IRT approach (Rasch analysis).

The final approach was to apply a Rasch analysis to detect DIF, which required an initial fitting of the data to the dichotomous Rasch probabilistic model (38). This model asserts that the probability of a person endorsing an item is a logistic function of the difference between a person's ability (level of WI [θ]) and the level of WI, represented by the item b:

equation image

where pni = the probability that a person n will affirm item i.

Criteria for fit to the Rasch model.

Proper fit to the Rasch model and transformation to interval-level scaling require a number of criteria to be satisfied: 1) accordance of data structure to the probabilistic form of Guttman scaling (39), 2) unidimensionality (40), and 3) local independence of items. Statistical criteria to be used to evaluate the fit to the Rasch model have been extensively described in a previous review (41), and are only described briefly here.

Fit of data to the Guttman structure.

The item–trait interaction chi-square statistic was used to determine if adequate fit to the Rasch model has been achieved. A customary PBonf > 0.05 threshold was applied to indicate accordance with the Guttman structure. If this was not met, we would perform a sequential removal of the most misfitting item from the model (according to the magnitude of item-fit chi-square statistic) until proper fit is achieved. The Person separation index (PSI) (42) was also assessed to provide an indicator of reliability for the measure, and a minimum of >0.70 was expected.

Tests of unidimensionality and local independence.

Tests of dimensionality were undertaken once the above statistical criteria for fit to the Guttman structure had been satisfied. This was assessed by performing a principal component analysis of the residuals to detect signs of multidimensionality within the scale (43). If a scale is unidimensional, no residual associations (factor structure) within the first residual component should exist once the factor for which item associations exist is extracted. To test this formally, we identified all positively (>0) and negatively (<0) loaded items based on the first residual component, and calculated summed scores associated with these subsets. Then, independent t-tests were conducted to compare the person logit estimates derived from these subsets (44). For a unidimensional scale, the percentage of significant tests (i.e., outside ±1.96) is expected to be less than 5% (or lower bound of associated 95% binomial proportions) (45). Local dependency is defined as consistency among item responses that is unaccounted for by individual differences on the measured construct (43, 46). Residual correlation >0.2 between any item pairs would indicate local dependency.

DIF assessment.

Item characteristic curves (ICCs) derived from Rasch analysis can provide a graphical illustration of the relationship between the latent trait (i.e., WI) and the probability that respondents with a given level of this trait will affirm an item (28, 47). In this study, ICCs were analyzed both visually and statistically to determine the presence of DIF. An item without disease-related DIF should display overlapping ICCs for RA and OA. Uniform DIF would be indicated by a systematic shift between the 2 ICCs across the full spectrum of the latent trait, whereas nonuniform DIF would be indicated by nonparallel ICCs that may intersect over the range of the latent trait. Statistically, DIF was identified by a 2-way analysis of variance of the residuals, where statistical significance in the F test (PBonf > 0.05) for arthritis type (RA versus OA) or the interaction between arthritis type and level of WI would verify the presence of uniform or nonuniform DIF, respectively (48). Effect size (magnitude) statistics have yet to be established for this approach, and therefore, only tests of significance were relied upon to identify items with DIF. Rasch analysis was carried out using RUMM2020 software (49).

Purification of the conditioning variable.

For all DIF analyses in this study, the recommended 2-step purification procedure (24) was applied to verify that items are not exhibiting “pseudo-DIF,” i.e., an apparent opposing DIF caused by other DIF items (50). By this procedure, an initial run was performed to identify probable DIF candidates. Then, all items were retested with a DIF-free conditioning variable (i.e., a “purified” RA-WIS scale score) except for candidate items, which were examined with the specific item included in the scale score as part of the conditioning variable (32).

DIF impact.

In addition to identifying specific items exhibiting DIF, assessments of the overall impact at the scale level are also important (25). Therefore, if disease-related DIF was evident, we would examine how WI differences between RA and OA subgroups in our current study sample would be influenced if different versions of the RA-WIS (ignoring versus accounting for DIF) were applied.


A summary of the demographic and health-related characteristics of the study sample is provided in Table 1. Only age (P < 0.0001) and occupation type (P = 0.02) differed between arthritis subgroups. The sample mean ± SD of the RA-WIS was 8.3 ± 6.3, and distributions of item-level response are shown in Table 2.

Table 1. Demographic and health- and work-related characteristics of all of the study participants (n = 239)*
 RA (n = 112)OA (n = 127)Difference, P
  • *

    RA = rheumatoid arthritis; OA = osteoarthritis; ns = not statistically significant (P > 0.05); SF-1 = Short Form 1; HAQ = Health Assessment Questionnaire; RA-WIS = Work Instability Scale for Rheumatoid Arthritis.

  • Chi-square proportions test or independent t-test applied.

  • National Occupational Classification developed by Human Resources and Skills Development Canada.

 No. available110124 
 Mean ± SD, years46.1 ± 10.653.8 ± 6.7< 0.0001
 Range, years19–6424–65 
 No. available111125 
 Women, no. (%)98 (88.3)101 (80.8)ns
Marital status   
 No. available111125 
 Married, %50.559.2ns
 Divorced, %19.820.0 
 Widowed, %3.62.4 
 Single, %24.313.6 
 In committed relationship, %1.84.8 
Education level   
 No. available111125 
 High school or less, %18.015.2ns
 Some college/university, %18.916.8 
 University/college/technical school graduate, %63.168.0 
Occupation type   
 No. available108120 
 Business, finance, administration, %40.745.00.02
 Health, science, arts, sports, %28.740.8 
 Sales and services, %23.210.8 
 Trades, transport, equipment operators, %7.43.3 
General health: SF-1 (range 1–5, 5 = poor)   
 No. available112126 
 Mean ± SD2.8 ± 0.82.6 ± 1.0ns
Duration of arthritis   
 No. available109123 
 <1 year, %8.310.6ns
 1–5 years, %29.441.5 
 >5 years, %62.448.0 
Self-rated arthritis severity (range 1–7, 1 = very mild and 7 = very severe)   
 No. available112127 
 Mean ± SD3.0 ± 1.73.4 ± 1.9ns
HAQ disability (range 0–3, 3 = high disability)   
 No. available112127 
 Mean ± SD0.7 ± 0.60.7 ± 0.6ns
Work instability: RA-WIS (range 0–23, 23 = high work instability)   
 No. available112127 
 Mean ± SD7.9 ± 5.98.7 ± 6.8ns
Table 2. Distribution of response for Work Instability Scale for Rheumatoid Arthritis items, by arthritis type*
 RA (n = 112), %OA (n = 127), %
  • *

    RA = rheumatoid arthritis; OA = osteoarthritis.

1. Slow34.865.20.041.755.92.4
2. Reduce11.688.40.015.882.71.6
3. Worried26.873.20.029.969.30.8
4. Pain31.368.80.050.449.60.0
5. Stamina56.343.80.065.434.70.0
6. Holiday9.890.
7. Push45.554.50.054.345.70.0
8. Face40.258.90.936.263.80.0
9. Say no39.360.70.042.556.70.8
10. Watch51.848.20.055.943.30.8
11. Open37.562.
12. Extra40.
13. Frustrated18.881.
14. Give up11.688.40.015.883.50.8
15. Get on27.772.30.044.954.30.8
16. Tired44.655.40.045.754.30.0
17. Restrict18.881.30.023.675.60.8
18. Getup25.974.
19. Stiff27.772.30.057.542.50.0
20. All manage33.065.21.836.263.80.0
21. Stress35.764.30.031.567.70.8
22. Pressure49.150.00.930.769.30.0
23. Good bad69.630.40.065.433.90.8

DIF identified by Mantel-Haenszel, Breslow-Day, and LR procedures.

The Mantel-Haenszel procedure identified 3 RA-WIS items that met both our statistical criteria (chi-square PBonf < 0.05, ΔMH ≥|1.5|) for uniform DIF (Figure 1). Two of these items showed a systematically greater probability of being affirmed in RA (item 11: “difficulty opening doors,” item 22: “pressure on hand”), while the other item was more likely affirmed in OA (item 19: “very stiff”). No items showed significant nonuniform DIF according to Breslow-Day tests. These findings were shown to be reproducible when DIF was assessed by statistics (−2 log likelihood and ΔR2) based on the LR procedure (Figure 2).

Figure 1.

Forest plot illustrating the magnitude and direction of the delta difference effect size (ΔMH) of individual Work Instability Scale for Rheumatoid Arthritis (RA-WIS) items using the Mantel-Haenszel differential item functioning (DIF) detection procedure (broken lines show the ΔMH ≥|1.5| threshold). * = items 11, 19, and 22 met both statistical criteria for uniform disease-related DIF (test of significance chi-square P value with Bonferroni correction <0.05 and ΔMH ≥|1.5|); RA = rheumatoid arthritis; OA = osteoarthritis.

Figure 2.

Forest plot illustrating the magnitude and direction of the model difference Nagelkerke's R2 (ΔR2) effect size statistic to detect uniform and nonuniform differential item functioning (DIF) using the hierarchical 3-stage logistic regression procedure (broken lines show the ΔR2 = >0.035 threshold). * = items 11, 19, and 22 met statistical criteria for uniform disease-related DIF (ΔR2 = >0.035 for difference between logistic regression models 1 and 2); RA-WIS = Work Instability Scale for Rheumatoid Arthritis; RA = rheumatoid arthritis; OA = osteoarthritis.

DIF identified by the IRT approach.

Initial fitting of RA-WIS data to the Rasch model indicated some deviations to the Guttman pattern (item–trait interaction χ2 = 134.9, df 69, P = 0.000004; PSI = 0.90). We first removed 2 misfitting items (items 18 and 20) to improve fit to the Rasch model (item–trait interaction χ2 = 101.9, df 63, P = 0.001; PSI = 0.90). Preliminary DIF assessment revealed items 11, 19, and 22 with uniform DIF (Figure 3) and none with nonuniform DIF. Further model modifications were made to examine whether DIF effects would cancel out at the scale level. Testlets (super items) were created for items demonstrating local dependency (residual r = >0.2), which included items 4/19 (r = 0.31), 2/14 (r = 0.30), 9/10 (r = 0.24), 1/5 (r = 0.22), and 12/13 (r = 0.22). Item 15 was then further removed due to poor item fit (fit residual = −3.47, chi-square P = 0.002). Reassessment after these modifications verified DIF for items 11 and 22, and also for the testlet consisting of items 4/19. A final testlet was created to combine items/testlets exhibiting DIF, resulting in a 13-item/testlet model that met criteria for proper fit to the Rasch model (χ2 = 64.3, df 39, P = 0.007; PSI = 0.88) that was free of DIF. Unidimensionality was affirmed in this final model, as only 3.7% of independent t-tests comparing person logit estimates (derived from subsets of all positively versus negatively loaded items) were statistically significant.

Figure 3.

Item characteristic curves of 2 Work Instability Scale for Rheumatoid Arthritis items demonstrating significant disease-related uniform differential item functioning as detected by a 1-parameter item response theory approach (Rasch analysis). A, Item 11, “I have great difficulty opening some of the doors at work” (workers with rheumatoid arthritis [RA] have a greater probability of affirming the item at all levels of work instability [WI]), B, Item 19, “I get very stiff at work” (workers with osteoarthritis [OA] have a greater probability of affirming the item at all levels of WI).

Since 2 of the 3 items exhibiting DIF (items 11 and 22) were related to hand functioning, we further examined whether there was a preexisting difference in the prevalence of “hand involvement” in our sample by comparing the HAQ gripping subscale score (range 0–3, where 3 = most disability) between RA and OA. This was our best available indicator, as we had not collected data on arthritis location or other information specific to hand functioning. HAQ gripping scores were found to differ between arthritis types (χ2 = 21.04, P < 0.001), with the majority (62.2%) with OA scoring 0 (no disability), whereas more than half (58.0%) with RA had a score of 2 or 3 (high disability).

Impact of DIF.

With the original 23-item RA-WIS, the difference in mean ± SD RA-WIS between arthritis subgroups was 0.9 ± 6.4 (OA: 8.7 ± 6.8, RA: 7.9 ± 5.9), or a 3.9% difference. After excluding the 3 items showing disease-related DIF (i.e., 20-item RA-WIS), both groups had lower overall mean ± SD scores (OA: 7.7 ± 6.0, RA: 6.7 ± 5.2), but the relative difference between subgroups (0.9 ± 5.6) remained relatively similar. This represented only a marginally larger difference (4.5%) in the modified scale range (0–20).


The ability to apply the same work disability outcomes to different populations can be useful from the perspective of comparability; however, there is little current evidence to suggest that outcome scores derived from such measures have direct comparability when applied to different populations. Ideally, a given scale score should consistently represent the same level of the underlying trait, but this assumption could be threatened if items do not function consistently (i.e., have bias) across clinical populations. The RA-WIS is a measure that has been independently validated in both RA and OA, and an examination of potential DIF associated with specific forms of arthritis is of interest. This study found evidence of disease-related DIF in the RA-WIS at the item level. Results from 3 different DIF detection methods converged to reveal the same 3 items to show significant uniform DIF, and 2 other items nearing the detection thresholds (Table 3); however, it was also determined that such DIF ultimately had only a minimal impact on the comparability of RA-WIS scores between RA and OA at the scale level. This was likely attributed to the fact that only a small proportion (13% [3 of 23]) of scale items showed significant DIF, and also because such DIF was bidirectional; therefore, much of the effects had “cancelled” each other at the scale level. Given the minimal overall impact, we believe the direct comparison of RA-WIS scores between RA and OA is appropriate, and thus concerns should only be reserved when cross-disease comparisons are made at the item level.

Table 3. Summary of DIF detection methods and statistical criteria applied and study findings*
 Method 1: Mantel-Haenszel/Breslow-Day proceduresMethod 2: logistic regressionMethod 3: 1-parameter IRT (Rasch analysis)
  • *

    DIF = differential item functioning; IRT = item response theory; PBonf = P value with Bonferroni correction; ANOVA = analysis of variance.

  • Reflects results from the preliminary DIF assessment (i.e., prior to model modifications during Rasch analysis).

  • For this study, we confirmed DIF (uniform or nonuniform) only if an item met both the test of significance and the effect size statistical criteria.

Step 1 (initial screen): test of significancePBonf < 0.05 for chi-square (1 df) testP < 0.01 for chi-square test (2 df) of difference of −2 log likelihood between models 1 and 3PBonf < 0.05 for ANOVA F test
Step 2 (confirmatory test): effect size statisticΔMH ≥|1.5|Nagelkerke's ΔR2 = ≥0.035 (uniform DIF: model 2 minus 1, nonuniform DIF: model 3 minus 2)None available
 Uniform DIF: met thresholdItems 11, 19, 22Items 11, 19, 22Items 11, 19, 22
 Uniform DIF: near thresholdItem 4 (chi-square P = 0.01)Item 4 (ΔR2 = 0.029)Item 4 (P = 0.01)
 Item 15 (chi-square P = 0.02)Item 15 (ΔR2 = 0.027)Item 15 (P = 0.002)
 Nonuniform DIF: met thresholdNoneNoneNone
 Nonuniform DIF: near thresholdItem 4 (chi-square P = 0.01)Item 4 (ΔR2 = 0.033)Item 4 (P = 0.04)
 Item 6 (ΔR2 = 0.031)Item 12 (P = 0.03)

What specific underlying factors might account for the observed DIF in the RA-WIS? It seems probable that disease-related DIF for items 11 (“difficulty opening doors”) and 22 (“pressure on hand”) could be directly related to subgroup differences in hand involvement, given significant differences in HAQ gripping disability between RA and OA observed in our cohort. Such DIF could be relevant whenever there is a significant imbalance in the extent of hand involvement between arthritis subgroups being compared. This illustrates an inherent challenge for items with high anatomic specificity, which may not have equal relevance to different arthritis types. One other potential implication to consider is whether such measurement bias could extend beyond direct comparisons between just RA and OA. Presumably, direct comparisons between any 2 cohorts where the extent of hand involvement is significantly unbalanced might pose similar challenges related to these 2 specific RA-WIS items. Users should be aware of the potential for measurement bias for these RA-WIS items due to their anatomic specificity.

The observation that items 19 (“very stiff”) and 4 (“pain or stiffness”) met or closely approximated the statistical threshold for DIF suggests that “stiffness” is likely the specific biased element, since it is common to both items. The opposing direction of DIF for these items (i.e., more likely affirmed in OA) was somewhat surprising. Since the RA-WIS was originally designed for RA, we had anticipated that few, if any, items would show such a strong DIF effect toward OA. A possible explanation is that while morning stiffness is prevalent in RA, it may have less impact during typical work hours in the daytime. An additional factor to consider is the potential influence of work context. Perhaps more workers with OA in our sample were simply working at jobs where prolonged sitting or standing is required, thus increasing their propensity to experience certain clinical signs (i.e., stiffness) compared to those with RA. We did find a subgroup difference in terms of occupation type, although the classification system applied was too broad to be informative on differences in the specific nature of the job requirements. Nonetheless, observed disease-related DIF for these items suggests that stiffness at work could be an experience that may have greater intrinsic relevance for workers with OA.

Precise scoring estimation of outcomes is fundamental to proper interpretation of results. While the current analysis suggests that disease-related DIF ultimately had little overall impact at the scale level, for future comparisons of WI between RA and OA it may still be worthwhile to consider potential strategies to account for DIF for the 3 most relevant items, where possible. One strategy may be to perform a sensitivity analysis with a DIF-free version of the RA-WIS (i.e., 20 items) to confirm subgroup differences in WI where such biases could be a concern. A second option may be to explore item-splitting approaches to “adjust” for DIF (19, 21) in order to establish disease-specific parameters for individual items to facilitate cross-disease comparisons. The increased complexity of scoring the RA-WIS with such an approach, however, is a potential tradeoff that must be considered, especially from the perspective of clinical practicality.

Our relatively small sample size is a study limitation to be considered, as it is below the typical recommendation of n = >200 per comparison group for DIF assessments. However, >100 per group has been considered adequate for binary LR and 1-parameter IRT methods (35). Moreover, we believe converging findings from multiple well-established DIF detection methods provided additional strength to our results, in addition to the fact that we applied conservative statistical criteria to help minimize potential Type I error. Other methodologic strengths to be considered were the use of both parametric and nonparametric approaches and the application of both summed and latent trait scores (i.e., IRT) as the conditioning variable to provide diverse perspectives in our DIF assessments.

We conclude that although 3 RA-WIS items showed disease-related DIF, this had a negligible resultant impact on the comparability of scores at the scale level. This suggests that, ultimately, RA-WIS scores can be directly compared between RA and OA without significant concerns for DIF-related measurement bias. It is important to reiterate that the RA-WIS was originally intended as a disease-specific measure, with items specifically developed for RA, and therefore evidence of some disease-related DIF at the item level was not unexpected. In fact, we believe the relatively small proportion of items exhibiting DIF is indicative of the strong resonance of the overall concept of WI to workers with other forms of arthritis such as OA, where work disability is also an important concern. With ongoing interest to apply work-specific measures in a broad range of other rheumatic conditions, item biases are expected to be increasingly important to consider, and similar works in the future will be useful to ensure that cross-disease comparisons of outcomes are appropriately conducted.


All authors were involved in drafting the article or revising it critically for important intellectual content, and all authors approved the final version to be published. Mr. Tang had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Study conception and design. Tang.

Analysis and interpretation of data. Tang.


The authors declare that Abbott had no direct role in the study design, data collection, analysis and interpretation of the data, writing of the manuscript, approval of manuscript content, or decision to publish this work. Neither the submission nor the publication of this article was contingent on the approval of Abbott.


The authors would like to acknowledge the participating institutions where data for the current study objective were collected: Mount Sinai Hospital (Toronto, Ontario, Canada), the Martin Family Centre for Arthritis Care and Research at St. Michael's Hospital (Toronto, Ontario, Canada), and the Mary Pack Arthritis Program (Vancouver, British Columbia, Canada). The authors would also like to thank the Institute for Work & Health and the Arthritis Community Research & Evaluation Unit for providing in-kind support. Finally, the authors wish to acknowledge individuals who have made contributions to the overall Canadian Arthritis Network Work Productivity project: Xingshan Cao (data analyst), Paul Clarke (research coordinator), Timea Donka (research assistant), Rebecca Dubé (research assistant), Katherine Edwards (research assistant), Taucha Inrig (research assistant), Carol Kennedy (research assistant), Jessica Lee (research coordinator), Xin Li (postdoctoral fellow), Samra Mian (research assistant), Ludmila Mironyuk (research coordinator), Anusha Raj (research associate), Pam Rogers (research coordinator), Rebeka Sujic (research coordinator), Debbie Sutton (data analyst), Ada Todd (research coordinator), Dwayne Van Eerd (research coordinator), Rebecca Wickett (research coordinator), Jessica Widdifield (research coordinator), and Wei Zhang (graduate student). Investigators of the Canadian Arthritis Network Work Productivity Group are as follows: Dr. Dorcas E. Beaton (principal investigator), Dr. Claire Bombardier (principal investigator), Dr. Aslam H. Anis (coinvestigator), Dr. Elizabeth M. Badley (coinvestigator), Dr. Monique A. M. Gignac (coinvestigator), and Dr. Diane Lacaille (coinvestigator).