Validation of a Prediction Model to Estimate Health Utilities Index Mark 3 Utility Scores from WOMAC Index Scores in Patients with Osteoarthritis of the Hip


Deborah Marshall, PhD, Vice President, Global Health Economics and Outcomes, i3 Innovus, 1016-A Sutton Dr., Burlington, ON, Canada L7L 6B8. E-mail:


Objective:  To examine the validity of a newly developed prediction model translating osteoarthritis (OA)-specific health-related quality of life (HRQL) scores measured using the Western Ontario and McMaster Osteoarthritis Index (WOMAC) into generic utility-based HRQL scores measured using the Health Utilities Index Mark 3 (HUI3).

Methods:  Preintervention data from 145 patients with hip OA and complete WOMAC and HUI3 baseline assessments from the Alberta Hip Improvement Project study were used to validate three utility prediction models. These models were estimated using data from a previous study of knee OA patients. Predictive performance was assessed using the mean absolute prediction error (MAE) criterion and several other criteria.

Results:  The validation sample appeared healthier (on the basis of the HUI3 and WOMAC) than the subjects used toestimate the prediction models. Nevertheless, the validation sample outperformed the predictive performance of the model sample. The results from the validation sample support the conclusions from the original study in that the primary model identified during model development (a model using WOMAC subscales, their interactions, their square terms, age, OA duration, their square terms, and gender) performed better on the MAE criterion than competing models.

Conclusion:  These results support the external validity of the prediction model for the retrospective estimation of HUI3 utility scores for use in economic evaluation.


Measurement of health-related quality of life (HRQL) has become increasingly common in clinical and policy research, and has been particularly important in evaluating interventions to manage chronic diseases, such as osteoarthritis (OA), that affect multiple health domains, including pain and mobility, over a long period. Most OA trials measure HRQL using the Western Ontario and McMaster Osteoarthritis Index (WOMAC) scale, an OA-specific HRQL instrument. The WOMAC allows for the computation of an overall total measure of HRQL using a weighted sum of the responses to each of the individual items. It does not, however, provide a generic, preference-based summary measure of HRQL.

A preference-based single summary score of HRQL is a “standardized” measure of health outcome. It facilitates the comparison of effectiveness of different interventions targeted at different health conditions, even if these conditions have different effects on survival and/or different health domains. Preference-based measures of HRQL are commonly expressed as utility scores. The utility score is a measure of preference for a state of health relative to “anchor” states; it is customary to assign a utility value of 1.0 for “normal health” and 0.0 for “dead.” Intermediate health states are assigned utility values between 0 and 1, based on the relative desirability of that state [1–3]. The utility score for health states experienced by patients can be combined with the length of time spent in these states to produce a total number of quality-adjusted life-years (QALYs) over the time frame of the comparison. QALYs are the most commonly used metric for cost-utility analysis (CUA) [4]; CUA in turn is the standard approach for assessing the value for money spent on health technologies. Indeed, demonstration of value for money is commonly required by insurers considering coverage of new health technologies.

It would be useful then if WOMAC scores could be reliably mapped to a preference-based measure of HRQL for use in situations where preference-based HRQL data were not collected within the study. Grootendorst et al. [5] have developed and estimated such a prediction model. Their preferred model mapped WOMAC scores, along with basic demographic and OA disease severity data, into Health Utilities Index (HUI3) utility scores. The HUI3 is a commonly used measure of the utility or value that the general population places on different health states [1,6,7]. The models developed by Grootendorst et al. [5] were estimated using data from a prospective study of patients with OA of the knee that evaluated the cost-effectiveness of hylan G-F 20. Although Grootendorst et al. [5] report that the HUI3 prediction models performed well in predicting out-of-sample utility scores, there is no evidence on the predictive performance of this model in a completely independent group of OA patients. We therefore generated such evidence and report our results here.


Data Source for Prediction Model Development—Hylan G-F 20 Study

Data for initial prediction model development were obtained from a previously reported multicenter, randomized, controlled, open-label study of 1-year duration, performed from 1996 to 1997, where patients were randomized to either “appropriate care with hylan G-F 20 (AC + H)” or “appropriate care without hylan G-F 20 (AC).” Patients in this study had symptomatic knee OA of mild to moderate severity and had received prior treatment with analgesics. Patients randomized to the AC + H group received hylan G-F 20 administered as a series of three intra-articular injections at intervals of 1 week, were followed for 1 year, and were assessed by the clinical investigator at baseline and 12 months and by telephone interview at months 1, 2, 4, 6, 8, 10 and 12. Further details pertaining to study design are provided in Grootendorst et al. [5] and Raynauld et al. [8].

Prediction models were developed using data on an intent-to-treat population that included 2186 assessments available for patients in either treatment group (n = 255). Some patients randomized to AC “crossed over” because they sought Synvisc® (hylan G-F 20; Biomatrix Inc., Ridgefield, NJ) treatment either on their own or through the investigator during the course of the trial. Data after crossover were excluded from the analysis. Moreover, assessments with incomplete data were excluded, leaving observations on 1833 complete assessments.

Prediction Model Development

Regression models were developed to predict HUI3 utility scores as a function of subjects' WOMAC score, demographics, and OA severity. Several predictive models were assessed. All models incorporated either the individual item questions from the WOMAC (i.e., patient ratings of the extent of impairment to mobility and the degree of pain and stiffness along a five-point Likert scale ranging from “none” to “extreme”) or the WOMAC subscales based on the responses to the individual item questions [9]. The models varied in the demographic (age, gender) and clinical (OA duration and Kellgren x-ray grade) variables included.

Each model was assessed by its ability to predict HUI3 scores using out-of-sample data from the hylan G-F 20 study, that is, data from the study not used to develop the model. The primary criterion identified a priori for assessing model performance was the mean absolute error (MAE). For comparison purposes, all regression models were also evaluated on the root mean square error (RMSE), intraclass correlation coefficient (ICC), and mean error (ME) criteria (see section on External Validation of Prediction Models for descriptions of these criteria). For each model and each of the criteria, bootstrapping was used to estimate a mean criterion value (MAE, RMSE, ICC, or ME) and its 95% confidence interval. Further details pertaining to prediction model development are provided in Grootendorst et al. [5].

Prediction Models

Final prediction models included WOMAC subscale scores along with their pairwise interactions to allow for nonlinearity. A total of four prediction models were developed:

  • Model 1: WOMAC subscales along with their interactions.

  • Model 2: WOMAC subscales along with their interactions as well as age, age squared, and gender.

  • Model 3 (primary model): WOMAC subscales, along with their interactions, age, OA duration (in the study knee), and their second-order terms as well as gender.

  • Model 4: WOMAC subscales, their interactions, age, duration of OA (in the study knee), and the squares of age and OA duration as well as gender and indicators of Kellgren x-ray grade.

Although model 4 outperformed model 3 on the MAE criterion, we nevertheless selected model 3 as our primary model given that the difference in performance was slight and it is likely that many potential users of the prediction models would lack data on x-ray grade.

External Validation of the Prediction Models

Validation data source—Alberta Hip Improvement Project (HIP) study.  The external validation data set consisted of hip OA patients from the Alberta HIP Study. This study compared the efficacy, cost-effectiveness, and long-term safety of “alternative hip bearing surfaces” versus “conventional hip arthroplasty device surfaces.” Patients needing hip replacements or resurfacing were selected as candidates for the study when they presented to surgeons in orthopedic offices. The data collected contain information from patient interviews, orthopedic office charts, surgical and hospital charts, and administrative data sources at baseline, months 3, 6, 9, 12, and 18 after hip arthroplasty and yearly thereafter. A total of 145 patients with complete data for the WOMAC, HUI3, age, gender, and OA duration at the baseline assessment were used to validate the model algorithm. Patients needing hip replacements or resurfacing were selected when they presented to surgeons in orthopedic offices.

The duration of OA and Kellgren x-ray grades were not specified as part of the original study design; as such, the model prediction estimates could not be examined without making assumptions about baseline x-ray grade. Consequently, model 4 was not considered in the validation because x-ray grade was not a study design element in the Albert HIP Study. For the purpose of validation, only baseline patient assessment data were considered. Postbaseline observations were not used because of the nature of the intervention, which essentially altered the patient's fundamental OA status because the arthritic hip was replaced.

Methods for Handling Missing Data and Imputation—External Validation

There were no missing observations for the HUI3. The scores for patients with missing responses to the WOMAC subscales were imputed as suggested in the respective scoring manual [9]. Two patients who had missing data for the time from onset of OA were excluded from the analysis. For the WOMAC, 2 patients had imputed values for the pain subscale, 1 patient had an imputed value on the stiffness subscale, and 70 patients had imputed values on the function subscale. Of those patients with imputed values on the function subscale, the majority (66 of 70) were due solely to missing responses to function item 17, which addresses “light domestic duties” (males 44% missing; females 47% missing).

Baseline comparison of the patient populations (development vs. external validation).  The characteristics of the patient population from the hylan G-F 20 study were compared with the patient population of the Alberta HIP Study data set. Clinical and demographic variables, the WOMAC total and subscale scores, the domains of the HUI, and overall score were compared descriptively at baseline. Continuous variables were compared using an unpaired t-test, and a chi-square test was used to compare nominal data.

Validation of Prediction Models

Each eligible patient's baseline data from the Alberta HIP Study validation data set were used for validating the model.

The following four measures of predictive performance and the 95% confidence interval were estimated for each of the models developed by Grootendorst et al. [5]. Three of the four criteria (ME, MAE, and RMSE) are derived as a function of the forecast error (the difference between a subject's actual and predicted HUI3 score) of all subjects in the validation sample. The ME is the average of the forecast errors. The MAE is the average of the absolute forecast errors. The RMSE is the positive square root of the average squared forecast error. The fourth criterion, the ICC, a measure of agreement, is the ratio of the between-subject variability to the total residual variability from a two-way mixed model ANOVA.

Mean error is useful in assessing the ability of a model to predict group-level (average) HUI3 scores—in this case, the patient-specific idiosyncratic components of HUI3 tend to cancel each other out. In contrast, MAE, ICC, and RMSE are useful for assessing prediction at the individual patient level. We used bootstrapping to evaluate the performance of the predictive models. For each of 1000 bootstrapped replications, 145 patients were sampled with replacement from the Alberta HIP Study validation data set. For each replication, the criteria (MAE, RMSE, ICC, and ME) were calculated for the predicted scores from each of the three predictive models. The mean criterion values and 95% confidence intervals for each criterion applied to each predictive model were calculated as the 2.5th percentile and 97.5th percentile of the distribution of resulting bootstrapped values.


Statistical analyses were conducted using R software version 2.1.0 (R Development Core Team, 2005) [10] and SAS version 8.2 [11].


Baseline Comparability

The mean age of the 145 subjects from the Alberta HIP Study was 46.9 years, significantly younger than the 255 subjects from the hylan G-F 20 prediction model sample, which had a mean age of 63.1 years (P < 0.001, Table 1). The gender distribution between the two studies also differed: Only 32% of the subjects in the validation data were female compared with 70% in the prediction model sample (P < 0.001, Table 1). This was likely due to the stringent exclusion criteria of the validation sample in which women of childbearing age were not allowed to participate. The body-mass index of the subjects in the validation sample was significantly lower (P < 0.001) than the model sample subjects (26.7 vs. 32.5 kg/m2, respectively). The majority of subjects in both studies were white. The years since the onset of OA, and SF-36 physical and mental component scales were not statistically significantly different between the two sample groups. No inferences were made regarding the differences in x-ray grade between the two population samples, as grading for the validation sample patients was not a study design element. The distribution of socioeconomic status in the model sample differed from that in the validation sample; however, no formal statistical comparisons were performed.

Table 1.  Baseline demographics
 Hylan G-F 20 study (n = 255)
Model development data set
Alberta HIP study (n = 145)
Validation data set
Continuous variablesMean (SD)Mean (SD) 
Age (years)63.1 (9.98)46.9 (6.47)<0.001
Weight (kg)86.6 (19.32)81.7 (22.34)*0.011
Years since onset of ostheoarthristis (years)9.5 (9.59)9.3 (8.15)0.416
SF-36 Physical Component scale28.3 (7.25)29.1 (7.87)0.858
SF-36 Mental component scale50.8 (11.84)48.8 (15.02)0.077
BMI32.5 (7.63)26.7 (5.05)§<0.001
Discrete variables%% 
  • Please note that the sample size was n = 255 for the hylan G-F 20 study and n = 145 for the HIP study with the exception of: *n = 141, n = 253, n = 128, §n = 138.

  • For the validation component using the Alberta HIP Study patients, x-ray grading was not assessed at baseline.

  • **

    The total number of medical problems reported from the Clinical Health Assessment Questionnaire A, responses include: high blood pressure, heart attack, other heart condition, stroke, mental illness, depression, diabetes, cancer, alcohol or drug problem, kidney problem, lung problem, cataract, asthma, severe allergies, liver or gall bladder problem, ulcers or stomach problem, neurological problem, fracture of spine, thyroid or endocrine disorder, problems with prostate (men) or uterus/ovaries (women).

  • BMI, body-mass index = weight in kg/(height in m2); MTP, metatarsophalangeal; NA, not available.

X-ray grade
 Grade 03NANA
 Grade I11NA 
 Grade II26NA 
 Grade III34NA 
 Grade IV26NA 
Household income
Other knee affected by osteoarthritis84.7NANA
Other knee requires treatment54.1NANA
Joints affected by osteoarthritis (right or left hip, or spine, or interphalangeal joints (hand), or thumb carpal meta-carpal joint, or first (MTP)99.6NANA
Prior surgery of the study knee (years)30.59NANA
Total number of “now” conditions**  NA
 1–2 current health problems14.62NA 
 3+ current health problems1.58NA 
Overall health in the past 4 weeks
 Very good16.27NANA
 Very poor1.19NA 

The WOMAC total score (Table 2) differed significantly between the model and the validation sample groups (P = 0.002). The validation sample had consistently lower scores in all three subscales of the WOMAC (pain, stiffness, and function; P = 0.007, P = 0.019, and P = 0.001, respectively), indicating that the validation sample subjects had less morbidity than the subjects in the model sample.

Table 2.  Baseline WOMAC scores
WOMAC scores*Hylan G-F 20 study
(n = 255)
Model development data set Mean (SD)
Alberta HIP study (n = 145) Validation data set Mean (SD)P-value
  • *

    The higher the score, the worse the health state.

  • WOMAC, Western Ontario and McMaster Osteoarthritis Index.

WOMAC total score18.0 (3.95)16.8 (4.44)0.002
WOMAC subscale scores
 Pain (0–20)11.6 (2.81)10.9 (3.34)0.007
 Stiffness (0–8)5.1 (1.46)4.8 (1.38)0.019
 Function (0–68)39.9 (9.25)36.6 (11.24)0.001

The mean overall HUI3 utility score at baseline was 0.52 in the validation sample and 0.48 in the model sample (Table 3, P = 0.957). Although this difference is not statistically significant, it is consistent with the overall impression from the data that the model sample patients were less healthy, probably because of their more advanced age and comorbidities. For example, the model sample patients had worse vision, hearing, speech, dexterity, emotion, and cognition. Interestingly, the only statistically significant differences between the two groups were in the single-attribute HUI3 utility scores for ambulation and pain, where the validation sample patients were significantly worse in both. This finding is consistent with the fact that these patients were on a waiting list to receive hip replacement.

Table 3.  Baseline HUI3 utility scores
HUI3 utility scores*Hylan G-F 20 study
(n = 255)
Model development data set Mean (SD)
Alberta HIP study
(n = 145)
Validation data set Mean (SD)
  • *

    The higher the score, the better the overall health utility.

  • HUI3, Health Utility Index Mark 3.

Overall0.48 (0.23)0.52 (0.21)0.957
Single attribute utility scores
 Vision0.93 (0.88)0.96 (0.07)0.639
 Hearing0.93 (0.19)0.99 (0.05)0.999
 Speech0.98 (0.08)1.00 (0.03)0.998
 Ambulation0.92 (0.14)0.82 (0.14)<0.001
 Dexterity0.95 (0.13)1.00 (0.02)0.999
 Emotion0.92 (0.16)0.94 (0.11)0.909
 Cognition0.92 (0.16)0.98 (0.06)0.999
 Pain0.53 (0.25)0.43 (0.26)<0.001

Model Performance and Prediction

Table 4 reports the ME, MAE, RMSE, and ICC prediction performance of each of the prediction models 1 to 3 in both the model (hylan G-F 20) and validation (Alberta HIP) samples (model 4 results are not reported because information on x-ray grade was unavailable in the validation sample). Somewhat surprisingly, the validation sample outperformed the model sample in three of the four prediction performance measures (MAE, RMSE and ICC). The model sample had a lower ME score than did the validation sample.

Table 4.  Comparison of model performance criteria
Performance criterionModelHylan G-F 20 study
Model development data set
Alberta HIP study
Validation data set
Mean95% confidence interval*Mean95% confidence interval*
  • *

    Percentile 95% confidence intervals obtained from the distribution (of MAE, RMSE, ICC or ME).

  • RMSE, Root mean square error; MAE, Mean square error; ME, mean error; ICC, Intra-class correlation coefficient.

  • Model 1 = WOMAC.

  • Model 2 = WOMAC + DEMOG.

  • Model 3 [Primary Model] = WOMAC + DEMOG + YRSOA.

  • WOMAC = f{Pain, Stiffness, Function (Pain × Stiffness) (Pain × Function) (Stiffness × Function), Pain2, Stiffness2, Function2].

  • DEMOG = f{Age, Age2, Gender}.

  • YRSOA = f{Years since onset of OA, Years since onset of OA2].

  • WOMAC = Western Ontario and McMaster Osteoarthritis Index.

  • WOMAC Pain (0–20).

  • WOMAC Stiffness (0–8).

  • WOMAC Function (0–68).

MAEModel 10.16450.14860.17980.13820.12210.1548
Model 20.16520.14880.18130.13750.12140.1544
Model 30.16290.14570.17790.13600.11950.1524
RMSEModel 10.20830.18720.22900.16980.14960.1898
Model 20.20960.18680.23100.17130.15260.1905
Model 30.20660.18460.22730.16840.14800.1885
ICCModel 10.53600.45130.61090.56760.47250.6539
Model 20.53790.45360.61340.56420.46830.6456
Model 30.55570.46960.62930.57450.48050.6593
MEModel 1−0.0007−0.04420.04120.0179−0.01260.0434
Model 2−0.0005−0.04310.04270.0175−0.01290.0440
Model 3−0.0006−0.04220.03970.0120−0.01830.0376

Models that included the WOMAC subscale scores and their pairwise interaction terms, as well as demographics (age, age squared, gender), and the years of OA in the study knee (years of OA, years of OA squared), that is, model 3 (primary model) and model 4, performed marginally better as they were better able to account for the variability in the response than model 1, which excluded these covariates. The results from the validation sample support the conclusions from the original study in that the primary model identified during development (model 3) consistently performed better on the MAE, RMSE, ICC, and ME criteria than the secondary models (models 1 and 2).

Figure 1 illustrates the relationship between the actual and predicted HUI3 scores for both the model and the validation samples. A unit change on the horizontal axis is equivalent to a unit change along the vertical axis; consequently a slope of 1 would be indicative of the perfect agreement between the two measures (i.e., the 45-degree line from the origin to 1; unity slope).

Figure 1.

(a) Predicted versus actual HUI3 score—Hylan G-F 20 Study. (b) Predicted versus actual HUI3 score—Alberta HIP Study. HUI3, Health Utilities Index Mark 3.

The slope of the graph of actual versus predicted HUI3 scores for the model sample (Fig. 1a) does not equal 1, and more observations fall above the line. The data points were not uniformly distributed about the 45-degree line as the prediction model both over predicted HUI3 scores for those with low HUI3 scores and under predicted HUI scores for those with higher HUI3 scores. A test of the hypothesis that the intercept = 0 and slope = 1 could not be rejected.

This finding is confirmed by the ME estimate for the primary and secondary models (Table 4), which, although negative, is very close to zero. When regressing the observed HUI3 scores on the predicted HUI3 scores in the validation sample, the equation of the line was found to be y = 0.059 + 0.906 × (Predicted HUI3 from Model 3). At the 5% level of significance, there was no evidence that the intercept differs from zero (P = 0.240) or that the slope differs from one (P = 0.329). Because the slope is less than one, a greater proportion of the data points falling (above, if we reverse the axis, see Fig. 1b) below the theoretical 45-degree slope. This is consistent with the ME estimates for the primary and secondary models of Table 4, which, although close to zero, are larger and more positive than those from the model sample. In Figure 1, there is a compression of predicted HUI3 scores between 0.2 and 0.8, while the actual HUI3 scores for the samples are more spread out. These data suggest that the actual HUI “extreme” scores are being underestimated in this population. This compression of predicted scores could be attributed to the differences in the two instruments, with the WOMAC not being able to predict the eight attributes of the HUI3 from the two attributes (ambulation and pain) that it covers.


Grootendorst et al. [5] developed models to predict HUI3 HRQL scores using the WOMAC scores of a group of individuals with OA of the knee. These models were used to predict HRQL of a group of individuals with OA of the hip and thereby validate the models. These “validation” subjects were drawn from different geographic locations, had different clinical presentations, and experienced substantially lower levels of morbidity than did the subjects used to estimate the models. Despite these differences, the ability of the models to predict HRQL scores of the validation subjects actually exceeded the models' predictive performance in the original model subjects.

There are several limitations to the present analysis. First, in the validation data set, x-ray grade was not measured alongside other clinical demographics. As such, the model prediction estimates could not be examined without making assumptions about baseline x-ray grade. Second, the sample size in the validation sample was driven by the nature of the intervention and thus data post baseline were not utilized as the intervention essentially altered the patient's fundamental OA status. Had the intervention been different, perhaps data post baseline could have been utilized, resulting in a larger sample size and more robust validation estimates. Third, the fact that validation sample subjects had lower WOMAC scores (i.e., less morbidity from OA) than model sample subjects was not anticipated. A priori, we expected greater morbidity in the validation study group as these subjects underwent “hip replacement” as opposed to model study subjects, who received intra-articular injections to reduce pain in the hip/knee. One reason for this is that validation subjects were younger and had less comorbidity than the average hip replacement patient. The mean age for total hip arthroplasty in the general population has been reported as 62 years [12], compared with 47 years in the validation sample. It seems likely that validation subjects had worn out hips because of more intense physical activity and might not necessarily be representative of patients with hip replacements.

Finally, although the prediction models are able reliably to predict group average utility scores, they cannot accurately predict patient level utility scores, as utility scores vary because of dysfunction in health domains not captured in the WOMAC. For patient-level analyses, we recommend that utility scores be measured directly.


The prediction models developed by Grootendorst et al. [5] were validated by Alberta HIP study data. These models provide researchers with a tool that can reliably allow mapping of disease-specific HRQL scores measured with WOMAC into utility scores for use in situations where directly measured preference-based HRQL data are unavailable. The predicted utility scores can be used to calculate QALYs for cost-effectiveness analysis in clinical and economic appraisals of interventions that target chronic diseases, such as OA, which affect primarily functional capacity, not longevity. One must, however, use caution using this tool, as it was developed to function on a group level and should not be applied to predict patient-level utility.

The validation analysis was funded by the Alberta Improvements for Musculoskeletal Disorders Study (AIMS). AIMS were initiated by the Alberta Ministry of Health to improve the care and quality of life for patients with acute and chronic musculoskeletal disorders. An independent advisory committee was assembled to oversee the study design, analysis, and interpretation of the findings. Innovus Research Inc. is an independent health economics research organization responsible for the design and execution of the study. The authors would like to thank the Genzyme Corporation for permitting the data from the hylan G-F 20 study [8] to be used for this study.

Financial support: Alberta Improvements for Musculoskeletal Disorders Study (AIMS). AIMS was initiated by the Alberta Ministry of Health to improve the care and quality of life for patients with acute and chronic musculoskeletal disorders.

Conflict of interest: David Feeny and George Torrance have a proprietary interest in Health Utilities Incorporated, Dundas, Ontario, Canada. HUInc. distributes copyrighted Health Utilities Index (HUI) materials and provides methodological advice on the use of HUI.