Internal construct validity of the Oxford Knee Scale: Evidence from Rasch measurement




Symptomatic knee osteoarthritis (OA) is present in 1 in 8 patients age >60 years and is associated with significant activity limitation. Several tools have been devised to assess knee problems. The goal of this study was to evaluate the Oxford Knee Scale (OKS) against strict modern psychometric standards through application of the Rasch measurement model.


A total of 224 OKS assessments were included from patients with a clinical diagnosis of knee OA. Data from the OKS were fitted to the Rasch measurement model. We examined the validity of the item scoring functions, the presence of item bias or differential item functioning, the fit of data to model expectations, and whether or not the item set formed a unidimensional scale, thus giving a valid summed score.


The mean age of the 224 patients was 61 years (range 26–90) and 61.5% were women. After rescoring some items, the scale showed good fit to the Rasch model, with a chi-square interaction statistic of 42.663 (36 df, P = 0.206). Overall targeting of the scale (to the patients) was good, with high reliability.


Data from the OKS were consistent with the expectations of the unrestricted (partial credit) derivation of the Rasch model. The targeting of the instrument shows good coverage of thresholds across the whole construct and has good reliability (internal consistency) with a high patient separation. Consequently, this scale can be used in confidence with the knowledge that it is a unidimensional scale largely free of bias.


The prevalence of radiographic and symptomatic knee osteoarthritis (OA) has been reported as 37.4% and 12.1%, respectively, in adults age 60 years and older (1). As the most commonly reported joint with problems in this age group, the knee has been shown to lead to significant activity limitation (2). The ability to monitor the impact of knee problems is therefore a key element in the day-to-day management of knee OA. A number of knee-specific questionnaires have been devised (3). Although most have evidence of reliability and validity, few have been subjected to the scrutiny of modern psychometric methods such as Rasch analysis (4). One such questionnaire, the Oxford Knee Scale (OKS), is a 12-item polytomous questionnaire designed to assess the impact of total knee replacement on patient pain and activity limitation (5). To date it has been validated using traditional approaches, supporting its validity, reliability, and responsiveness (3). The goal of this study was to further validate the OKS by modern psychometric analysis, specifically by examining fit of the scale's data to the Rasch measurement model.


Patients and setting.

A total of 224 OKS assessments were included in the analysis from patients with a clinical diagnosis of OA of the knee. This sample included 55 patients who provided data on 2 occasions to test the stability of the scale over time. The data were drawn from an anonymized database of a routine clinical audit of orthopedic triage and orthopedic clinics.

Rasch analysis.

Rasch analysis is the formal testing of an outcome scale against a mathematical measurement model developed by the Danish mathematician Georg Rasch (4). Polytomous versions of the model are available (6), and the response patterns achieved from the items in the OKS, which are intended to be summed together, are tested against what is expected by the model (a probabilistic form of Guttman scaling [7]). The model assumes that the probability of a given respondent affirming an item is a logistic function of the relative distance between the item location and the respondent location on a linear scale. In other words, the probability that a person will affirm an item (or a category within an item) is a logistic function of the difference between the person's level of, for example, pain (θ) and, in the dichotomous case, the level of pain expressed by the item (b), and only a function of that difference.

A variety of fit statistics are used to indicate if the data meet model expectations, including summary statistics, which include an item-trait interaction; chi-square statistics, which assess the invariance of item hierarchy across the trait; and a residual statistic, which is the standardized sum of all differences between observed and expected values summed over all persons. Individual item- and person-fit statistics, as residuals and chi-squares, are also available. For the chi-square statistics, if the associated P value is less than 0.05 (or a Bonferroni-adjusted value [8]) then the item (or scale for the summary statistic) is deemed to misfit model expectations. The residual fit statistics are standardized values and values outside the range ±1.96 are deemed to misfit model expectations. High positive residuals are of particular concern, whereas high negative residuals indicate some redundancy in the data (that is, the item is not contributing much to the existing information gained from the other items).

In addition to testing whether or not the data from the scale satisfy the rules for constructing measurement, the approach also allows for the examination of whether or not the category ordering of the polytomous items works as expected (if not, they have disordered thresholds), and whether bias among subgroups in the sample exists for an item. Expressed as differential item functioning (DIF), this tests whether or not, at a given level of the construct being measured, the same category is chosen (9, 10).

An estimate of the internal consistency reliability of the scale is available as a Person Separation Index. This is equivalent to Cronbach's alpha (11), where the estimates on the logit scale for each person are used to calculate reliability, rather then the person's raw score, which is how alpha is calculated.

In addition to issues of threshold disordering and DIF, consideration must be given to the assumption of local independence. In practice this comprises 2 aspects, response dependency and unidimensionality. Response dependency arises when items are linked such that the response on one item may be dependent upon the response to another. For example, 3 items in a mobility scale that assess different walking distances will be locally dependent. If you can walk a “mile or more” without difficulty, then you must be able to walk “half a mile” or “100 yards” without difficulty. Such linked questions inflate reliability and confound item and person parameter estimates. Their presence can be detected through the pattern of correlations in the residuals. The Rasch model is also a unidimensional measurement model, thus the assumption is that the items summed together form a unidimensional scale. The absence of any meaningful pattern in the residuals will support the assumption of unidimensionality. A strict test of the assumption of local independence of items has been proposed by Smith (12). This takes the patterning of items in the residuals, looking at the correlation between items and the first residual factor, and uses these patterns to define 2 subsets of items (i.e., the positively and negatively correlated items). Using an independent t-test for the difference in estimates for each person using the subsets of items, the number of tests outside the range ±1.96 should not exceed 5%. A confidence interval for a binomial test of proportions is calculated for the observed data, and this value should overlap the 5% expected value for the scale to be unidimensional.

For Rasch analysis, if a scale is well targeted (i.e., the item difficulties are close to the person's abilities) then a sample size of 108 will give 99% confidence of the person estimate being within ±0.5 logits (13). If the scale is not well targeted then the sample size required for accurate estimation increases to 243. In the current study, a sample of 222 patients was available and therefore the Rasch analysis can be expected to have an appropriate degree of precision, irrespective of the targeting of the group or the distribution across the response options of each item. The Rasch analysis reported here was undertaken with the RUMM2020 package (14).


A total of 224 OKS assessments were included from patients with a clinical diagnosis of OA of the knee. The mean age was 61 years (range 26–90 years) and 61.5% were women. The median score on the scale was 29.5 (interquartile range 21.3–35.0). Using data from the assessments, a log-likelihood ratio test confirmed that the unrestricted (partial credit) version of the Rasch model was the most appropriate to apply. The threshold pattern for the 12 items of the scale is shown in Figure 1. Those items that are not displayed indicate the presence of disordered thresholds. The threshold distances (e.g., the width of category 1) vary across items, which is consistent with the unrestricted version of the Rasch model.

Figure 1.

Category response pattern.

Figure 2 shows the item “For how long have you been able to walk before pain from your knee becomes severe (with or without a stick)” (item 4), where category 3 does not have a point along the trait when it is the most probable response. Consequently, the thresholds between categories 2 and 3 and between 3 and 4 are reversed, with the former being 1.51 logits, the latter 1.11 logits. This is not how the item was intended to work and consequently we rescored this item and 3 other items in the current analysis, 3 of which were collapsed into 4 categories (from the original 5 categories) and 1 of which was collapsed into 3 categories. This was done to ensure that each threshold represented an increasing level of the trait being measured.

Figure 2.

Category probability curves of item 4. Locn = location; FitRes = Fit Residual; ChiSq[Pr] = chi-square probability; F[Pr] = F fit statistic probability.

All individual items were found to meet model expectation; that is, there was no significant deviation from that expectation (e.g., the worst fitting item was “How much has pain from your knee interfered with your usual work?”; chi-square probability 0.024) (Table 1). Item 9, “How much has pain from your knee interfered with your usual work (including housework)?” did display a high negative residual, suggesting some redundancy. In the current data, only 4 respondents had fit residuals >2.58, and their deletion made no difference to the overall model fit. For the summary fit statistics, the chi-square interaction statistic was 42.663 (36 df, P = 0.206), which indicated overall good fit to model expectations.

Table 1. Individual item fit of the Oxford Knee Questionnaire
ItemDescriptionLocationFit statistics
Chi-square probabilityResidual
1How would you describe the pain you usually have from your knee?−1.9010.9240.426
2Have you had any trouble with washing and drying yourself (all over) because of your knee?1.6680.075−1.291
3Have you had any trouble getting in and out of a car or using public transport because of your knee? (whichever you would tend to use)0.5910.2320.259
4For how long have you been able to walk before pain from your knee becomes severe? (with or without a stick)0.5620.3220.824
5After a meal (sat at a table), how painful has it been for you to stand up from a chair because of your knee?0.3690.7170.381
6Have you been limping when walking, because of your knee?−0.5620.6610.017
7Could you kneel down and get up again afterwards?−1.4640.740−0.031
8Have you been troubled by pain from your knee in bed at night?−0.1390.0601.601
9How much has pain from your knee interfered with your usual work (including housework)?−0.3270.024−2.264
10Have you felt that your knee might suddenly “give way” or let you down?0.2750.9470.891
11Could you do the household shopping on your own?0.3140.440−0.134
12Could you walk down one flight of stairs?0.6150.060−1.465

The individual item plot for item 8, “Have you been troubled by pain?” is shown in Figure 3. There is a clear systematic difference in response, with women always giving a higher (worse) response on this item at any level of the construct. We split the item into 2, making 2 items from the same item, 1 for each sex. This action appeared to improve the overall item-trait interaction statistic, but the gain in chi-square was not significant (5.6, 3 df, P > 0.05). Item 8 also showed DIF by age. No DIF was found for repeated visits.

Figure 3.

Differential item functioning plot for item 8.

The distributions of persons and items on the same logit scale are shown in Figure 4. The mean ± SD of the person estimate was 0.687 ± 1.860; therefore, with a scale centered on zero, this scale seemed to be well targeted for this group. After rescoring, the Person Separation Index was 0.924, suggesting a highly reliable scale, one that can be used to measure individual patient change.

Figure 4.

Targeting graph for the Oxford Knee Scale. The upper part of the graph is the distribution of persons, and the lower half is the distribution of the thresholds from the items.

No significant correlations were found among the residuals, suggesting that response dependency was absent within this set of items. Neither was there any indication of dependency across persons (high negative fit residuals), given that some assessments were repeated measures. Finally, the principal components analysis (PCA) of the residuals was used to identify patterns of items to be used to compare person estimates. These person estimates were derived from 2 independent sets of items that were contrasted on the first PCA (i.e., high positive and high negative loading items). The comparison involves individual t-tests for each person, and 8% (95% confidence interval 5–11%) of these tests were found to be outside the range ±1.96. Because the confidence interval overlaps 5%, this supports the unidimensionality of the scale.


Data from the OKS were consistent with the expectations of the unrestricted (partial credit) derivation of the Rasch model. Some rescoring was necessary and the items, with the exception of “Have you been troubled by pain from your knee in bed at night?” were free of DIF. The summary fit statistics, as well as the independent t-test analysis, support the assumption of unidimensionality. This is important because the scale appears to measure both pain and function, and therefore it is shown to measure a higher order construct combining both attributes. This has also been shown for the Western Ontario and McMaster Universities Osteoarthritis Index (15).

The targeting of the instrument shows good coverage of thresholds across the whole construct and good reliability (internal consistency) with a high person separation. Consequently, this scale can be used in confidence with the knowledge that it is a unidimensional scale largely free of bias. Given the observed fit to the Rasch measurement model, a transformation of the ordinal nonlinear score into an interval linear metric can be obtained. Future studies using the scale may give consideration to reducing the number of response options, for example, by reducing the number of options to the range of 1–4. However, the pattern of disordered thresholds observed in this study would need to be verified in further studies before this option is considered.

Previous work has argued that some of the items may be redundant (16). No evidence was found to support this claim in the current analysis. Redundancy would be manifest through high negative fit residuals, which was not the case. The targeting graph also shows that the items were well spread across the continuum of activity limitation and sufficiently covered the range of patients.

The presence of DIF by sex and age for item 8 may be more problematic. There is no obvious reason as to why the question “Have you been troubled by pain from your knee in bed at night?” should elicit a different response from men and women at the same level of impact. We overcame the problem with a technical solution: splitting the item into 2, one for each sex. This means that, for example, all women have missing values for the men's item, and vice versa. However, we found that although this improved fit to model expectations, it did not do so in a significant manner. Consequently, this disturbance of fit did not appear to bias the score in any significant manner, and suggests that for the time being the 12 items can be summed without any adjustment. However, further studies should continue to look for the presence of DIF for this item, and evaluate its impact upon person estimates.

There are a number of limitations to the study. The patients were drawn from those presenting for orthopedic triage for consideration of joint replacement and therefore probably represent a group with at least moderate disease severity. Therefore, the results do not generalize to the total population of those with OA, but the scale was designed for assessing patients presenting with at least moderate disease severity. In this analysis, we did not look at postintervention data, and it will be important to demonstrate an absence of DIF across pre- and postintervention data to demonstrate the validity of change scores. Although in this study we were not directly examining the responsiveness of the scale, the high person separation, giving many distinct points on the knee problem ruler, is supportive of a highly responsive scale, which has been observed elsewhere (3). The invariance of the scale, as shown by the absence of DIF over time, is a requirement for a valid computation of responsiveness.

In conclusion, the OKS meets the expectations of the Rasch measurement model and can be considered a robust scale of pain and activity limitation for patients with knee OA.


Dr. Conaghan had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.

Study design. Conaghan, Emerton, Tennant.

Acquisition of data. Conaghan, Emerton.

Analysis and interpretation of data. Conaghan, Tennant.

Manuscript preparation. Conaghan, Emerton, Tennant.

Statistical analysis. Conaghan, Tennant.