Toronto Health Economics and Technology Assessment Collaborative, University of Toronto, Toronto General Research Institute, University Health Network, Institute for Work & Health, Centre of Research Expertise in Improved Disability Outcomes, University Health Network Rehabilitation Solutions, Toronto Western Hospital, Toronto, Ontario, Canada
Toronto Health Economics and Technology Assessment Collaborative, Leslie Dan Pharmacy Building, University of Toronto, 6th Floor, Room 658, 144 College Street, Toronto, Ontario, Canada M5S 3M2
The most widely used neck-specific measure in intervention trials is the 10-item Neck Disability Index (NDI), which is assumed to be a unidimensional interval scale, as shown by how NDI data are scored, analyzed, and interpreted. Our objective was to use modern measurement methods to test this assumption (and thereby to also test the validity of calculating summed scores and parametric statistics on NDI data) through Rasch analysis.
NDI data from 521 trial subjects with neck pain were fit to the Rasch model. We examined threshold ordering of NDI items, fit of data to model expectations, presence of differential item functioning, and whether or not the set of NDI items collectively measure a single construct, which is a requirement for calculating summative scores.
There was a lack of fit of data to the Rasch model (χ2 = 140.35, 70 df; P < 0.001). Five items (personal care, lifting, headaches, work, and recreation) had disordered response thresholds. Differential item functioning was detected for age and sex. The NDI items did not contribute to a single construct. Unidimensionality and interval scaling were achieved by removing 2 of the 10 items (resulting in the NDI-8), and converting NDI-8 ordinal (paper) summative scores to NDI-8 interval (Rasch-weighted) scores.
As originally proposed and conventionally used, the NDI is not a unidimensional scale, and has only ordinal scaling. This raises fundamental doubts about the practice of calculating change scores and other parametric statistics on NDI data. A revised 8-item version provides unidimensional interval-level measurement of neck pain disability.
The Neck Disability Index (NDI) is the most widely-used neck-specific measure in published intervention studies (1–13). It has been cited in over 251 scientific articles and translated into more than 20 languages (available at http://www.cmcc.ca/Portals/0/pdfs/rs_%20NDI_manual.pdf). The NDI was developed to assess the effect of patients' neck pain on their activities of daily living, and was based on the Oswestry Low Back Pain Disability Questionnaire, a region-specific questionnaire for patients with low back pain (14). The NDI consists of 5 items derived from the Oswestry Questionnaire (14) and 5 items identified by literature review and clinicians, with minimal input from patients. When published in 1991, the NDI was the first self-administered instrument to measure disability related to neck pain (1).
The NDI is assumed to be a unidimensional interval scale (as shown by how its response data are scored, analyzed, and interpreted) (15–17), but this assumption has not been tested. Verifying the assumption that the NDI is a unidimensional interval-level scale is important. Unidimensionality is necessary to validly summate NDI response data into a single score, and is a fundamental requirement of construct validity (18, 19). Interval-level measurement is a requisite assumption of the parametric statistics (such as means, effect sizes, and minimal detectable change) that are used to analyze outcomes and compare treatment responses across groups (18, 20, 21). Interval-level scaling allows for the straightforward interpretation of NDI change scores where, for example, a change in score from 25 to 20 is equivalent to a change from 10 to 5. This is a characteristic of fundamental measurement and, in order to provide this, NDI data must satisfy the basic rules for constructing an interval scale as described in the theory of additive conjoint measurement (22). In contrast, ordinal scale scores represent a rank-ordering of persons on a measured trait, and relative distances between scores are meaningless (23). Thus, if we wish to use the NDI to calculate change scores, or apply parametric statistics given appropriate distributions, we must obtain a measure of neck pain disability that is free from bias and satisfies the rules for constructing interval scale data.
Although the measurement properties of the NDI have been extensively examined using classic test theory methods (1, 4, 15–17, 24, 25), the assumption of interval-level measurement of neck pain disability has not been verified. Our objective was therefore to use modern psychometric approaches, specifically Rasch analysis, to determine whether the NDI is a unidimensional interval scale, and, if necessary, to alter the NDI using the fewest modifications possible to have it achieve interval-level scaling.
PATIENTS AND METHODS
The NDI consists of 10 items: pain intensity, personal care, lifting, reading, headaches, concentration, work, driving, sleeping, and recreation. Each item has 6 response categories ranging from 0–5, where 0 describes the best situation and 5 describes the worst. The response categories are ordered, adjectival categories that differ descriptively across items, and respondents are asked how their neck pain affects each item. NDI scores are calculated by summing the numeric response to all items to generate a final score that ranges from 0–50. The developers of the NDI have suggested an interpretation for these scores in which a score of ≤4 corresponds to no disability and scores ≥35 correspond to complete disability (1). Minimal clinically important change and minimal detectable change have been estimated to be 3.5 NDI points and 5 NDI points, respectively (16, 17).
Patients and setting.
We pooled subject-level baseline data from 2 randomized trials of neck pain treatments, including age, sex, level of education, neck pain intensity, and NDI responses (26, 27). Data were pooled in order to achieve an adequate sample size as recommended by Linacre (28). Patients had a primary problem of mechanical (nonspecific) neck pain and were recruited in the first trial through the membership of a Southern California health maintenance organization, having presented with neck pain for treatment (27), or in the second trial through newspaper advertisement (26). Potential subjects were excluded from the trials if they had 1) neck pain due to fracture, severe spondylarthropathy, myelopathy, vascular disease, infection, tumor, or other nonmechanic causes of neck pain; 2) progressive neurologic deficit or severe coexisting disease; or 3) pain involving third-party liability or a compensation claim.
Rasch analysis is a process of testing whether data from a scale, such as the NDI, satisfies the rules for constructing interval scale measurement. In doing so, it uses a measurement model developed by the Danish mathematician Georg Rasch (29), which provides a template for such measurement against which the pattern of observed data can be contrasted. Where the observed pattern of response does not deviate from that expected by the model, a transformation of ordinal into interval scaling is achieved. The process of Rasch analysis has been described in detail elsewhere (30, 31). Briefly, the process involves testing a number of attributes of the scale and assumptions of the model. Where the items have more than 2 response options, as in the case of the NDI, an increase in response option should reflect an increase in the underlying trait. If this does not occur, the transition points (thresholds) between adjacent response options are said to be disordered. This may necessitate collapsing of categories to ensure a correct ordering of thresholds.
When observed responses are equivalent or do not greatly differ from the expected responses from the model, data are said to fit the Rasch model. A series of summary and individual item and person fit statistics are used to test the difference between observed responses and responses expected from the model (31). Generally, chi-square fit statistics are required to be nonsignificant (Bonferroni adjusted). Residual fit statistics are expected to be within a given range: ±2.5 for individual items, and with a mean fit residual value close to 0.0 and an SD approaching 1.0 (usually <1.4) for summary statistics.
Item bias or differential item functioning can also contribute to item misfit. Differential item functioning is a form of bias in which one subgroup (e.g., women) with given levels of a latent trait responds differently to an item compared with another subgroup (e.g., men) with similar levels of that latent trait. Differential item functioning is detected by conducting an analysis of variance for each item, comparing across levels of subject characteristics and levels of latent trait. We tested differential item functioning for age (19–30 years, 31–45 years, 46–59 years, and 60–78 years) and sex (women, men).
A person separation index, which is an estimate of a scale's internal consistency, was also calculated. This index is calculated using the same formula as Cronbach's alpha (32), except that logit scale estimates for each person derived from the Rasch analysis are used, rather than their raw scores. We considered ≥0.70 acceptable because our focus was on group-level measurement (33).
The process of Rasch analysis also involves testing the assumption of local independence (34). Local independence assumes that a response to an item is independent of responses to other items, after controlling for the latent trait (35). Breaches of this assumption can arise because of local response dependency, in which the answer to one item would determine the answer to another, and multidimensionality. These are tested by an examination of the residuals, with the former tested by looking for residual correlations ≥0.03 and the latter tested by generating estimates from sets of items that represent high-positive and high-negative loadings of the first component of the residuals (36). These estimates are compared by a series of independent t-tests, and <5% should be outside of the range −1.96 to +1.96. If the lower binomial confidence interval overlaps 5%, the test is interpreted as nonsignificant and the set of items are considered to form a unidimensional scale (31). All analyses were conducted using RUMM2020 software, version 4.1 (Rumm Laboratory, Perth, Western Australia).
The pooled sample (n = 521) consisted of 338 women (64.88%) and 183 men (35.12%) with a mean ± SD age of 44.95 ± 11.50 years. Mean ± SD neck pain intensity in the past week was 5.17 ± 1.87 and mean ± NDI score was 13.57 ± 5.75. Some college education or less was achieved by 289 subjects (55.47%), 157 (30.13%) achieved a college education, and 75 (14.40%) achieved a professional or graduate degree. Response ranges for individual NDI item scores varied from 3–6. We did not collapse response categories selected by <1% of the subjects into the preceding adjacent category because RUMM2020 uses the conditional pairwise maximum likelihood algorithm that provides estimates in the presence of null categories (37).
Rasch analysis of data from the 10-item NDI scale demonstrated a lack of data fit to the model, with a significant item-trait total chi-square interaction (Table 1, analysis 1). The mean ± SD fit residual for items was 0.57 ± 2.13, whereas the SD should be closer to 1.0 when there is adequate fit to the model. The mean ± SD fit residual for persons was −0.22 ± 1.06, which indicated no serious misfit among subjects in the sample, on average. Two items, headaches and recreation, showed statistically significant deviation (misfit) from the model expectations (Table 2). The Bonferroni-adjusted chi-square probability values for headaches and recreation were <0.001. The fit residual values for headaches and lifting were large (4.94 and 3.42, respectively), which also suggested misfit of these items (Table 2).
Table 1. Summary of Rasch analyses of the Neck Disability Index*
Analysis no., description
Item fit residual, mean ± SD
Person fit residual, mean ± SD
Item-trait total chi-square
Test of unidimensionality, percent (95% CI)
Random samples refer to samples of 260 patients that were randomly drawn from the pooled sample of 521 patients after the items headaches and lifting were deleted. PSI = person separation index; 95% CI = 95% confidence interval.
1. Initial Rasch analysis (n = 521)
0.57 ± 2.13
−0.22 ± 1.06
2. Headaches, lifting deleted (n = 521)
0.07 ± 1.22
−0.30 ± 0.98
3. Random sample 1 (n = 260)
0.26 ± 1.08
−0.27 ± 0.95
4. Random sample 2 (n = 260)
0.13 ± 0.83
−0.27 ± 0.95
5. Random sample 3 (n = 260)
0.14 ± 1.01
−0.29 ± 1.00
6. Random sample 4 (n = 260)
0.11 ± 1.18
−0.31 ± 1.01
7. Random sample 5 (n = 260)
0.18 ± 0.71
−0.27 ± 0.96
8. Random sample 6 (n = 260)
0.19 ± 1.01
−0.28 ± 0.97
9. Random sample 7 (n = 260)
0.22 ± 0.77
−0.26 ± 0.95
10. Random sample 8 (n = 260)
0.08 ± 0.77
−0.28 ± 0.97
11. Random sample 9 (n = 260)
0.10 ± 1.11
−0.29 ± 0.96
12. Random sample 10 (n = 260)
0.19 ± 0.91
−0.27 ± 0.96
Table 2. Individual item fit of the Neck Disability Index
The large, positive fit residual value for headaches implied a low level of discrimination. A low level of discrimination was also implied by the characteristic curve of the headaches item, in which observed responses were not as steep as the expected curve. Specifically, responses from groups with lower levels of disability were above what was expected by the model, and responses from groups with higher levels of disability were below what was expected by the model. This suggested that the headaches item captures a trait that is different than the trait captured by other items (34).
The threshold patterns for the 10 NDI items were examined on a threshold map. Distances between thresholds varied across items, which supported the use of the unrestricted (partial credit) version of the Rasch model for this analysis (38). Five items (personal care, lifting, headaches, work, and recreation) demonstrated disordered thresholds. The category probability curves for lifting, in which the third category probability curve is entirely overlapped by the other curves, are shown in Figure 1. This indicates that there is no location on the neck pain disability scale (and therefore, no level of neck pain disability) that the third response category of the lifting item was more likely to be selected than other response categories. Disordered thresholds occur when there are too many response categories or category labels are confusing. Compare, for example, the third response category for lifting, “Pain prevents me from lifting heavy weights off the floor, but I can manage if they are conveniently positioned,” with the fourth response category for lifting, “Pain prevents me from lifting heavy weights, but I can manage light to medium weights if they are conveniently positioned.” These response categories require the ability to clearly differentiate between heavy and light to medium weights, and it would appear that, in this sample at least, this is not the case.
The test for unidimensionality showed that positively and negatively loading sets of items provided statistically significantly different person estimates (Table 2, analysis 1). This suggested that NDI items did not contribute to a single underlying construct. Another breach of unidimensionality was the differential item functioning exhibited by the item lifting, for sex (F-ratio = 10.73, df = 1; Bonferroni-adjusted P = 0.001) and age (F-ratio = 5.11, df = 3; Bonferroni-adjusted P = 0.002). Thus, lifting did not have the same meaning in women compared with men, or in younger subjects compared with older subjects. Finally, the test of local independence identified response dependency between headaches and recreation.
The results of our initial Rasch analysis suggested that the NDI is not a unidimensional interval scale. Our objective then became to determine whether interval-level scaling could be achieved by altering the NDI with the fewest modifications possible. We proceed by removing headaches and lifting from the data to test whether adequate fit of the response data to the model could be achieved, based on the statistical results outlined above and clinical reasoning. Our clinical reasoning for removing headaches and lifting was based on a previous study in which the validity of these items was questioned because their scores did not change at treatment discharge (24). The authors surmised that headaches may not be a common symptom experienced by all neck pain patients, and therefore not sensitive to change. Similarly, patients who have not attempted to lift objects of varying weights may not know how to respond to this item.
With the removal of these 2 items, adequate fit to the model was achieved (Table 1, analysis 2). None of the remaining 8 items demonstrated statistically significant misfit or large fit residual values. None of the items comprising the NDI-8 exhibited differential item functioning to sex or age. The test of unidimensionality was nonsignificant, suggesting that the 8 remaining items contributed to a single underlying trait. There were no significant correlations between items, and only marginal local dependency (−0.31) was identified between concentration and recreation. Personal care and work continued to demonstrate disordered thresholds.
The targeting of the NDI-8 item thresholds for subjects in our sample is shown in Figure 2. There is good coverage of thresholds over the breadth of neck pain–related disability. The mean ± SD person location value (−1.69 ± 1.04) indicates that on average, our sample is centered over the lower end of the neck pain disability scale as defined by the remaining 8 items. The person separation index (reliability) for the NDI-8 was 0.76, which is sufficient for group-level comparisons.
We verified the robustness of this NDI-8 solution by determining whether adequate fit of response data from 10 samples of 260 patients randomly drawn from our pooled sample could be achieved (Table 1, analyses 3–12). In each of these 10 samples, adequate fit of NDI-8 response data was demonstrated by nonsignificant item-trait interaction total chi-square statistics, good item and person fit to the model expectations, and nonsignificant tests of unidimensionality. The person separation index was adequate for group comparisons (range 0.73–0.80).
The NDI-8 as a simple summed ordinal scale will yield a score between 0 and 40. We transformed the NDI-8 ordinal scale into a 0–50 interval scale, using simulated data (5,000 cases) based on the item difficulties and threshold estimates of our observed response data (39). This ensured that we had a similar, equal precision of estimates for every raw score point. This provides NDI users with a straightforward conversion between NDI-8 ordinal (paper) summative scores and NDI-8 interval scores on a scale length (0–50) that NDI users are familiar with (Table 3). The lack of one-to-one concordance between ordinal and interval scores, in which ordinal scores tend to spread more widely across the end-regions of the interval scale compared with the mid-region, is illustrated in Table 3. A visualization of the spacing of ordinal scores along the NDI-8 interval scale that is required to achieve interval scaling is provided in Figure 3.
Table 3. Conversion table between Neck Disability Index 8-item (NDI-8) ordinal (summative) scores and the NDI-8 interval scores
Scores obtained by summing responses to 8 items on paper.
NDI-8 ordinal scores transformed into a 0–50 interval scale.
The consistencies between the original NDI ordinal scale and the NDI-8 ordinal and NDI-8 interval (Rasch-weighted) scales, estimated by an intraclass correlation coefficient (type 3,1), were 0.93 and 0.87, respectively (40). We also compared the construct validity of the original NDI scale and the interval NDI-8 scale by determining the correlation of their sum scores with neck pain intensity. For the NDI ordinal scores, the correlation between neck pain intensity and NDI ordinal scores was Spearman's ρ = 0.43 (Pearson's r = 0.42). For NDI-8 interval scores, the correlation was Spearman's ρ = 0.42 (Pearson's r = 0.45).
Thus, despite removal of 2 items and Rasch weighting of NDI scores, the internal consistency (person separation index) and construct validity of the NDI-8 interval scale was comparable with that of the original NDI ordinal scale. The NDI-8 interval scale provides the additional benefit of fulfilling criteria that must be satisfied in order to justify the use of successive integer scores as a basis for measuring neck pain disability, and the use of mathematical functions to analyze NDI response data.
According to the Rasch model, the NDI in its original form is not a unidimensional interval-level scale, but this can be achieved by removing 2 items (headaches and lifting). This yields a valid unidimensional scale that is free of bias for age and sex. A simple summed score from the NDI-8 yields an ordinal score, but an exchange rate between the ordinal values and the interval-level measurement derived from Rasch analysis is provided in Table 3. This relationship raises fundamental doubts about the practice of reporting thresholds for meaningful change for NDI data (16, 17), in that, for example, 5 raw score points on the ordinal scale relate to significantly different interval scale estimates. The difference between 5 and 10 ordinal points is 5.14 interval scale points, yet the difference between 20 and 25 ordinal points is 2.91 interval points. Thus, if use is made of the 5-point minimal clinically important difference (MCID) on the ordinal scale, MCID would be achieved much more easily by persons moving across the center of the scale than persons at the margins.
Tension between clinimetric and psychometric approaches may be created when selected items that have clinical relevance are removed. Our solution does not preclude deleted items being used for clinical management purposes, but it does preclude them being used in summative raw scores. A preponderance of evidence suggests that the NDI is a multidimensional scale. Although Hains et al concluded that the NDI is unidimensional based on exploratory and confirmatory factor analyses (25), Wlodyka-Demaille et al extracted 2 factors in a principal components analysis (41). Stratford et al found that the NDI relates to both the physical and mental component scores of the Short Form 36 health questionnaire (17). Others have concluded that the NDI is a multidimensional scale that includes symptoms (pain, headaches, concentration), impairments (personal care, lifting, reading, driving, sleeping), and disabilities (work, recreation) (42). These conflicting results might be due to limitations associated with using ordinal data in factor analysis (43).
Previous studies have specifically questioned the validity of the NDI items that we identified as misfitting (lifting, headaches) (24). Without empirical assessment, these items seem problematic. The response category labels are wordy and confusing. More than one concept is contained within the response categories (e.g., lifting ability and pain level in lifting, and headache severity and headache frequency in headaches). The items do not consist of a declarative sentence stem (namely, a single statement under consideration) to which response options represent degrees of endorsement or agreement. Rather, sentence stems are imbedded within the response options themselves.
Previous studies have also observed that response categories designed to measure the highest levels of neck pain disability are rarely endorsed by respondents (24, 25, 42). Personal care has been noted consistently. Washing and dressing do not appear to be often affected by neck pain to the degree described by personal care's most severe response category: “I do not get dressed; I wash with difficulty, and stay in bed.” Thus, the level of disability described by NDI response categories that describe the highest levels of neck disability may be experienced rarely or by a very small proportion of neck pain patients; another possibility may be that these extreme categories are not legitimate for neck pain.
Based on the results of the best-evidence synthesis from the Bone and Joint Decade 2000–2010 Task Force on Neck Pain and Its Associated Disorders, our sample is representative of patients with mechanical (nonspecific) neck pain that are typically enrolled in intervention trials, and thus our findings should be generalizable to these patients (44). Lower levels of neck pain disability can be expected in these patients because nonspecific/mechanical neck pain tends to be used as a diagnosis of exclusion; that is, once all observable causes or lesions (e.g., cervical spine tumors, fractures, infections, inflammatory arthritides, etc.) have been reasonably ruled out. These patients have milder neck pain, and therefore they do not tend to endorse the most severe response options. This is consistent with the targeting of our current sample, which places the distribution of persons at the lower end of the operational range of the scale.
In our analysis, in which we provide an exchange rate between the NDI-8 ordinal raw score and the transformed interval scale, we chose not to correct the disordered thresholds because this would have precluded the possibility of providing a straightforward exchange between the everyday summed ordinal score and its corresponding interval score. Furthermore, collapsing the response categories would have resulted in a varying number of categories across items, which represents a considerable change from the original design of the NDI scale. Nevertheless, this disordering of response options should be further examined in other samples, preferably with more severe disability, to see if the problem is generic. If so, consideration should be given to collapsing categories at some future point, preferably in a systematic and clinically relevant way.
The NDI has not been previously evaluated with modern measurement methods. We used a methodical approach that is consistent with current recommendations (31, 45). Our analysis offers new insights into the internal construct validity of the NDI, including response category functioning. We offer a modified, 8-item NDI scale that represents true (fundamental) measurement of disability related to neck pain, and because we are promoting good measurement practice, we have provided an exchange table to convert biased ordinal NDI scores to unbiased interval NDI scores. Empirical evidence from several studies suggests that outcome measurement is improved when Rasch-based scores are used (including better precision and responsiveness), and that divergent conclusions can be made when ordinal response data are treated as interval data (46–50). Nonetheless, future studies may further examine the validity of neck pain trial results when ordinal data are subjected to parametric analyses.
In conclusion, our results suggest that the NDI as it was originally proposed and is conventionally used is not a unidimensional scale, and that it only has ordinal scaling properties. A revised 8-item version, the NDI-8, provides unidimensional interval-level measurement of disability related to neck pain.
Dr. van der Velde had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study design. Van der Velde, Beaton, Tennant.
Acquisition of data. Hurwitz.
Analysis and interpretation of data. Van der Velde, Beaton, Tennant.
Manuscript preparation. Van der Velde, Beaton, Hogg-Johnson, Hurwitz, Tennant.
Statistical analysis. Van der Velde, Beaton, Hogg-Johnson, Tennant.
We thank Drs. Roni Evans and Gert Bronfort (Northwestern Health Sciences University, Minnesota) for generously allowing us to analyze their data for this study, as well as Dr. Peter Smith (Institute for Work & Health, Toronto) for his assistance with figures, and Mr. Mike Horton (Psychometric Laboratory for Health Services, University of Leeds, Leeds) and the Measurement Research Unit (Institute for Work & Health, Toronto) for their technical and critical advice.