To compare the EuroQol (EQ-5D) and Short Form 6D (SF-6D) among multiethnic Asian patients with knee osteoarthritis (OA) scheduled for total knee replacement in Singapore.
To compare the EuroQol (EQ-5D) and Short Form 6D (SF-6D) among multiethnic Asian patients with knee osteoarthritis (OA) scheduled for total knee replacement in Singapore.
Patients were asked to complete questionnaires including the EQ-5D, Short Form 36, Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), and Lequesne knee index. EQ-5D and SF-6D utility scores were calculated using the scoring algorithms developed from the UK general population. Agreement between the 2 instruments was assessed by comparing their score distributions, means, medians, intraclass correlation coefficients (ICCs), and a Bland-Altman plot. Correlations of the EQ-5D and SF-6D with WOMAC and Lequesne knee index scores were also examined.
A consecutive sample of 258 knee OA patients (127 English-speaking and 131 Chinese-speaking) participated. The mean ± SD EQ-5D utility score was 0.49 ± 0.31 (range −0.25–1.00) and the mean SF-6D utility score was 0.63 ± 0.12 (range 0.32–0.89). In a hypothetical example, this 0.14-point difference in mean utility scores yielded a difference of $10,000/quality-adjusted life year (QALY) in cost-effectiveness ratios. The score distribution was bimodal for the EQ-5D and normal for the SF-6D. This poor agreement was also demonstrated by the Bland-Altman plot and the low ICC (range 0.18–0.54). Correlations of the WOMAC and Lequesne index with the EQ-5D were higher than with the SF-6D.
Using different preference-based health-related quality of life instruments may yield different utility scores, which could have a great impact on QALY estimates. This highlights the importance of selecting appropriate instruments for economic evaluation. Additional research is needed to determine which instrument (the EQ-5D or the SF-6D) should be used in OA patients.
Preference-based health-related quality of life (HRQOL) measures are commonly used to elicit health state values for calculating quality-adjusted life years (QALYs), which are an essential component of cost-utility analysis. Cost-utility analysis and other economic evaluations have become increasingly important in informing selection of cost-effective treatments, given the increasing dilemma of new and costly therapeutic options coupled with finite health resources.
The EuroQol (EQ-5D) (1) and Short Form 6D (SF-6D) (2) are preference-based HRQOL instruments increasingly used for economic evaluations of clinical interventions and health programs. Both instruments classify a respondent's self-reported health status according to a specific descriptive/classification system and assign a utility score from a scale on which 1 represents a state of full health and 0 represents being dead. Both instruments measure health in terms of physical function, pain, and mental health, and have a scoring function derived from statistical modeling of preferences for multideficit health states elicited from the UK general population (3). However, some differences between these 2 instruments need to be noted. First, the SF-6D describes more health states than the EQ-5D (18,000 versus 243 health states) and therefore may capture smaller health changes (4). Second, EQ-5D utility scores are health preferences measured using the time trade-off (TTO) method, whereas the SF-6D utility function is derived using the standard gamble (SG) method (1, 2). These differences may lead to different utility scores when applied to the same subject, and therefore caution needs to be exercised when choosing these instruments for measuring utility in a particular study.
Knee osteoarthritis (OA), a chronic degenerative disease, is one of the leading causes of pain and physical disability worldwide (5). Its impact on HRQOL of patients has been demonstrated to vary to a certain extent across sociocultural contexts (6, 7). The EQ-5D has been widely used to measure HRQOL of patients with OA (8–13), whereas several studies have used the SF-6D (a more recently available instrument) in patients with OA (14, 15). These 2 instruments have been compared in other diseases (14, 16–25) and in a general population (26, 27). However, some gaps are noted in the existing literature. First, conclusions as to which instrument performs better have not been consistent across diseases and studies, highlighting the necessity for such comparisons to be made in a wider spectrum of diseases and sociocultural contexts. Second, only a few studies in Western countries have compared the performance of the EQ-5D and SF-6D in patients with OA (9, 14), a disease in which economic evaluations are expected to be of increasing importance (28). Therefore, the goal of the present study was to compare the performance of the EQ-5D and SF-6D in multiethnic Asian patients with knee OA in Singapore.
In this institutional review board–approved study, a consecutive sample of patients with knee OA was recruited from the Department of Orthopaedic Surgery at Singapore General Hospital, a tertiary referral hospital in Singapore, from August to December 2005. Patients were eligible if they were diagnosed with knee OA by their attending orthopedic surgeon based on clinical and radiographic features, were scheduled for total knee replacement (TKR), and consented to participate in this study. Each patient was interviewed by a trained interviewer in either English or Chinese using an identical, pretested questionnaire containing the EQ-5D, Short Form 36 (SF-36), Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC), and Lequesne knee index and assessing sociodemographic data and chronic medical conditions.
The EQ-5D self-report questionnaire measures 5 domains of HRQOL including mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. Each of the 5 domains is assessed by a single question with 3 response levels (no problem, some problems, and extreme problems). The EQ-5D defines a total of 243 health states. The EQ-5D scoring algorithm was first developed using TTO-based preference scores for a sample of these health states from a representative sample of the UK general population (1). This EQ-5D algorithm is used worldwide and generates scores ranging from −0.59 to 1.00, with negative scores representing health states worse than being dead, 0 representing being dead, and 1.00 representing the state of full health. Both English and Chinese versions of the EQ-5D have been validated in Singapore patients with rheumatic diseases including OA (29, 30).
The SF-6D, developed from the SF-36, is a multidimensional health classification system assessing the 6 health domains of physical functioning, role limitation, social functioning, pain, mental health, and vitality, with 4–6 levels for each domain. An SF-6D health state is defined by selecting 1 level from each domain, which results in a total of 18,000 health states. The SF-6D scoring algorithm was developed using the SG method from a sample of 249 SF-6D health states from a representative sample of the UK population (2). Utility scores generated by the SF-6D range from 0.29 to 1.00, with 1.00 representing full health and 0.29 representing the worst possible health state defined by the SF-6D (i.e., all domains being at the worst level). English and Chinese versions of the SF-6D have been demonstrated to be equivalent in Singapore (31).
The WOMAC, a 24-item OA-specific index, consists of 3 domains: pain, stiffness, and physical function. Each of these 24 items is graded on either a 5-point Likert scale or a 100-mm visual analog scale (32, 33). In this study, we used Likert scale scoring (version LK 3.0). The domain score ranges from 0 to 20 for pain, 0 to 8 for stiffness, and 0 to 68 for physical function. A rescaled score is calculated for each domain with 100 indicating no symptoms and functional limitation and 0 indicating extreme symptoms and functional limitation.
The Lequesne knee index, an interview format knee OA–specific index, consists of 3 domains assessing pain or discomfort (5 items), maximum distance walked (1 item), and physical disability (4 items) (34, 35). Each domain has a score ranging from 0 (with no problems) to 8 (with extreme problems). This instrument generates a single global index with scores ranging from 0 to 24.
EQ-5D and SF-6D utility scores were calculated using the scoring algorithms developed by Dolan (1) and Brazier et al (2), respectively. Scores for both instruments were compared at the domain and scale levels. At the domain level, we assessed the degree of agreement between the domains of these 2 instruments measuring similar health constructs using Spearman's rank correlation, with correlation coefficients of >0.50, 0.35–0.50, and <0.35 considered strong, moderate, and weak, respectively, based on the literature (36).
At the scale level, we compared the distribution of scores and the mean ± SD and median (interquartile range [IQR]) utility scores generated by both instruments. Paired comparisons were made for the entire sample as well as for several subgroups categorized by ethnicity (Chinese or other ethnic groups), language spoken (English or Chinese), self-reported general health (excellent, good, or fair), and dichotomous levels of impairment. Self-reported general health was defined as excellent, good, or fair if a patient's response to the first question of the SF-36, “In general, would you say your health is?” (which is not part of the SF-6D), was excellent/very good, good, or fair/poor, respectively. Patients with different levels of impairment were classified according to their Lequesne global index score using 10 as a cutoff point as recommended in the literature (35). The degree of agreement between utility scores of both instruments was also examined by calculating the intraclass correlation coefficient (ICC; 2-way random-effects model with absolute agreement) (14) and using a Bland-Altman plot of EQ-5D and SF-6D utility scores (37). For 2 instruments, an ICC ≥0.7 suggests an acceptable level of agreement (38). Associations between the 2 preference-based instruments and OA-specific questionnaires were assessed using Spearman's rank correlation and plotting EQ-5D and SF-6D utility scores against Lequesne global index scores.
Data were entered into a Microsoft Excel spreadsheet (Microsoft, Redmond, WA) and were analyzed using SPSS software, version 13.0 (SPSS, Chicago, IL). All statistical tests were 2-tailed and conducted at a 5% level of significance.
A consecutive sample of 258 patients with knee OA participated in this study, including 127 English-speaking patients and 131 Chinese-speaking patients. Characteristics of patients in each language group and the total sample are shown in Table 1. The mean age of patients was 66 years, with the majority being women (83%) and of Chinese ethnicity (89%) with a mean duration of OA of 6 years. The characteristics of English- and Chinese-speaking patients were generally similar (Table 1). No significant differences in WOMAC and Lequesne global index scores between English- and Chinese-speaking patients were found (P > 0.1 by Student's t-tests).
|Characteristic||English (n = 127)||Chinese (n = 131)||Total (n = 258)|
|Age, mean ± SD years||65.3 ± 7.9||67.8 ± 7.1||66.6 ± 7.6|
|Female sex||97 (76.4)||116 (88.5)||213 (82.6)|
|Chinese||99 (78.0)||131 (100)||230 (89.1)|
|Malay||10 (7.9)||–||10 (3.9)|
|Indian||14 (11.0)||–||14 (5.4)|
|Other||4 (3.1)||–||4 (1.6)|
|Years of education|
|No formal education||34 (26.8)||73 (55.7)||107 (41.5)|
|1–6||44 (34.6)||41 (31.3)||85 (32.9)|
|7–10||33 (26.0)||13 (9.9)||46 (17.8)|
|>10||12 (9.4)||2 (1.6)||14 (5.4)|
|Married||113 (89.0)||122 (93.1)||235 (91.1)|
|Retirees/homemakers||103 (81.1)||119 (90.8)||222 (86.0)|
|Body mass index, mean ± SD kg/m2†||28.6 ± 5.4||27.8 ± 3.9||28.2 ± 4.7|
|Duration of osteoarthritis, mean ± SD years||5.9 ± 5.6||6.1 ± 4.7||6.0 ± 5.2|
|Knee scheduled for surgery|
|Right||75 (59.1)||74 (56.5)||149 (57.8)|
|Left||50 (39.4)||56 (42.7)||106 (41.1)|
|Both||2 (1.6)||1 (0.8)||3 (1.2)|
|Presence of chronic medical conditions‡||87 (68.5)||89 (67.9)||176 (68.2)|
|Lequesne global index score, mean ± SD||14.6 ± 4.2||15.0 ± 3.8||14.8 ± 4.0|
|WOMAC scores, mean ± SD|
|Pain||67.5 ± 16.6||67.9 ± 15.0||67.7 ± 15.7|
|Stiffness||60.6 ± 25.7||61.5 ± 25.0||61.0 ± 25.3|
|Physical function||60.8 ± 16.9||62.0 ± 12.8||61.4 ± 15.0|
The degree of agreement between EQ-5D and SF-6D domains measuring similar constructs was weak to moderate. Spearman's correlation coefficients were 0.44 between SF-6D pain and EQ-5D pain/discomfort and 0.43 between SF-6D mental health and EQ-5D anxiety/depression. SF-6D physical functioning correlated weakly with EQ-5D mobility, self-care, and usual activities (r = 0.28–0.33), reflecting the fact that these domains measured somewhat different aspects of HRQOL.
The 3 most commonly reported EQ-5D profiles were 11121 (14.8%), 21121 (12.5%), and 21221 (12.1%), whereas reported SF-6D profiles were spread across a very large number of states, none of which was reported by more than 4 patients (1.6%). The distribution of EQ-5D utility scores was bimodal (P < 0.001 by Kolmogorov-Smirnov test), whereas that of SF-6D utility scores was normal (P = 0.467 by Kolmogorov-Smirnov test) (Figures 1 and 2). Mean ± SD and median (IQR) EQ-5D utility scores were 0.49 ± 0.31 and 0.66 (0.57), respectively, and corresponding SF-6D utility scores were 0.63 ± 0.12 and 0.63 (0.17), respectively (Table 2). Poor agreement between the 2 instruments was observed despite the strong correlation between them (r = 0.534) (Table 3). The ICC was relatively low for the total sample as well as for the examined subgroups, ranging from 0.18 to 0.54 (Table 2). Given the concern that ICC might be adversely affected by scaling differences between the EQ-5D and the SF-6D, we performed a sensitivity analysis by recalculating the ICC with truncation of EQ-5D scores <0. Interestingly, recalculated ICC values were generally similar to ICC values before truncation, rather than being higher (Table 2). This suggests that observed differences in the ICC are likely to be real rather than due to differences in the scaling of these 2 instruments. The Bland-Altman plot of EQ-5D and SF-6D utility scores (Figure 3) demonstrated a wide interval (i.e., 1.05) between the upper and lower limits of agreement (39).
|n||Mean ± SD||Median (IQR)||P†||ICC|
|EQ-5D||SF-6D||EQ-5D||SF-6D||Without truncation||With truncation|
|All patients||258||0.49 ± 0.31||0.63 ± 0.12||0.66 (0.57)||0.63 (0.17)||< 0.001||0.467||0.454|
|Chinese||230||0.51 ± 0.30||0.64 ± 0.12||0.66 (0.57)||0.64 (0.17)||< 0.001||0.471||0.446|
|Non-Chinese||28||0.33 ± 0.32||0.56 ± 0.09||0.20 (0.60)||0.57 (0.14)||0.002||0.304||0.390|
|English||127||0.47 ± 0.31||0.62 ± 0.12||0.62 (0.57)||0.62 (0.18)||< 0.001||0.391||0.378|
|Chinese||131||0.51 ± 0.31||0.64 ± 0.12||0.66 (0.54)||0.64 (0.16)||< 0.001||0.538||0.531|
|General health status‡|
|Excellent||50||0.62 ± 0.27||0.69 ± 0.11||0.69 (0.18)||0.67 (0.18)||0.269||0.450||0.450|
|Good||131||0.48 ± 0.31||0.63 ± 0.11||0.62 (0.57)||0.62 (0.15)||< 0.001||0.444||0.416|
|Fair||77||0.42 ± 0.32||0.60 ± 0.12||0.59 (0.64)||0.58 (0.15)||< 0.001||0.431||0.455|
|Lequesne global index score|
|≤10||37||0.76 ± 0.09||0.71 ± 0.11||0.78 (0.11)||0.72 (0.19)||0.042||0.176||0.176|
|>10||221||0.44 ± 0.31||0.61 ± 0.12||0.62 (0.57)||0.61 (0.13)||< 0.001||0.419||0.405|
|EQ-5D utility score||SF-6D utility score|
|SF-6D utility score||0.534†|
|Pain or discomfort||−0.570†||−0.328†|
|Maximum distance walked||−0.523†||−0.353†|
|Activities of daily living||−0.533†||−0.281†|
Mean SF-6D utility scores exceeded mean EQ-5D scores by 0.14 for the entire sample (Table 2 and Figure 3). Conversely, median EQ-5D utility scores exceeded median SF-6D scores by 0.03 for the entire sample. This pattern was generally found across the examined subgroups (Table 2). The differences in utility scores between Chinese and non-Chinese patients and between Chinese- and English-speaking patients were larger for the EQ-5D compared with the SF-6D. The difference in utility scores between different patient groups with various self-reported general health or impairment in HRQOL was more evident for the EQ-5D than the SF-6D (Table 2).
Correlations of EQ-5D utility scores with WOMAC and Lequesne index scores were strong, with the exception of the WOMAC stiffness domain. In contrast, correlations of SF-6D utility scores with these domains were only weak to moderate (Table 3). The plot of EQ-5D and SF-6D utility scores to Lequesne global index scores is shown in Figure 4. There was a downward trend in utility scores of the EQ-5D and SF-6D with higher Lequesne global index scores, and this trend was generally more evident for the EQ-5D than the SF-6D.
Using an appropriate and valid utility index derived from a preference-based HRQOL instrument is one of the key determinants in ensuring the quality of cost-utility analysis in health care. It is therefore important to understand the performance of different preference-based instruments across various diseases and sociocultural contexts. In this study, we found that the EQ-5D and SF-6D demonstrated poor agreement at both domain and scale levels in a population of Asian patients with knee OA. These findings highlight concerns regarding the choice of preference-based HRQOL instruments for economic evaluations in OA. If such instruments do not accurately assess HRQOL, the results of economic evaluations where effectiveness is measured in terms of utilities would be called into question. Our findings also highlight the importance of exploring the underlying factors contributing to the observed differences between these 2 instruments.
The difference in mean utility scores for the 2 instruments in the present study deserves comment. EQ-5D and SF-6D score ranges in the present study were very similar to those in published studies of knee OA (14) and other diseases (22, 24). Mean SF-6D utility scores exceeded mean EQ-5D utility scores by 0.14, which is significantly higher than the differences of 0.08 in patients with OA reported by Brazier et al (14), 0.04 in patients with human immunodeficiency virus (24), and 0.03 in patients with rheumatoid arthritis (21), but is lower than the difference of 0.18 in patients with spinal diseases (22). This difference of 0.14 is substantially higher than the minimal important difference (MID) for the EQ-5D (MID = 0.121) and the SF-6D (MID = 0.035) for knee OA (40). There are several possible reasons for this marked difference. First, the health descriptive system of the SF-6D does not measure health states as severe as those assessed by the EQ-5D (14). This could have contributed to the lower EQ-5D scores (as compared with the SF-6D) seen in this study, which assessed patients with severe knee OA requiring knee replacement surgery (Table 2). Second, EQ-5D utility scores are TTO based, whereas SF-6D scores are SG based. SG has been shown to yield higher values than TTO (27, 41). The magnitude of the difference between EQ-5D and SF-6D scores in this study was larger than that seen in the study by Brazier et al (14). This may reflect differences in severity of OA, with all patients in the present study having severe disease and being scheduled for TKR, whereas those in the study by Brazier et al were either scheduled for TKR (with severe disease) or were from a rheumatology clinic (with less severe disease). Another possible reason could be the presence of cultural/attitudinal differences between patients in these 2 studies. These factors may also account for the observation that correlations between various EQ-5D and SF-6D domains in this study were lower than those reported in the study by Brazier et al (14).
Notably, the present study also demonstrated a crossover effect in TTO and SG values (27, 42), with SF-6D scores being higher than EQ-5D scores for patients with worse health states, and the converse being seen for patients with better health states (Figure 3). As shown in Figure 3, there are clearly 2 clusters, distinguished by utility scores greater than 0.50 or lower than 0.50, which are probably due to the specific N3 term used in the UK EQ-5D scoring algorithm. Differences in utility scores (i.e., SF-6D minus EQ-5D) were above the mean difference for almost all patients who reported extreme problems for at least 1 EQ-5D domain (i.e., level 3, N3 = 1), but were below the mean difference for almost all patients who did not report extreme problems for any EQ-5D domain (i.e., level 1 or 2, N3 = 0). This finding is consistent with that reported in a previous study of patients with knee OA by Brazier et al, who also suggested that the use of the N3 term in the scoring algorithm was the reason for the observed bimodal distribution (14).
A simple example may illustrate the impact on decision making using these different instruments. Assuming that all patients with OA have 10 expected life years remaining, and holding other variables constant, the incremental cost-effectiveness ratio for TKR (assuming a cost of $15,000), compared with non-TKR (assuming a cost of $0), would be $4,688 per QALY using the EQ-5D and $15,000 per QALY using the SF-6D. This utility change was calculated assuming that the HRQOL of patients undergoing TKR would improve from a Lequesne global index score >10 to <10 (Table 2). Thus 2 contrasting conclusions would be reached using these 2 instruments, if a threshold of $10,000 per QALY is applied (43).
The abovementioned reasons for differences between EQ-5D and SF-6D scores arise from the different health descriptive systems and methods for eliciting preferences, and are therefore generally consistent across different studies and patient groups (14). However, there are also some differences in these 2 instruments that are specific to patients with knee OA, that are germane, and that need to be highlighted. In this situation, the EQ-5D has some advantages over the SF-6D at an aggregate level (i.e., not head-to-head comparison). First, the EQ-5D demonstrated better discriminative ability between patients with more severe and less severe impairment (based on Lequesne global index scores) than the SF-6D (Table 2). Second, the EQ-5D has better convergent construct validity, as shown by the presence of stronger correlations with WOMAC and Lequesne scores (Table 3). However, the distribution of EQ-5D scores was bimodal whereas that of the SF-6D was normal, raising concerns regarding the scoring algorithm of the EQ-5D. At the individual level, the use of the EQ-5D in patients with OA raises some concerns compared with the SF-6D because some of the participating patients reported that their pain or physical disability was somewhere between the none and moderate options offered by the EQ-5D. These patients had to select one of these options, neither of which best described their health status. The performance of the EQ-5D in OA may be improved to a certain degree if more response options are added to the EQ-5D, as has been suggested by some authors (44).
Our results need to be interpreted in light of our study limitations. First, the number of patients reporting mild impairment (measured using the Lequesne global index) was small, which limits the generalizability of our findings. It is possible that the SF-6D may perform better than the EQ-5D in patients with mild to moderate OA, given that it measures a milder spectrum of reduction in health states (14) and has more levels per domain. Second, test–retest reliability of these 2 instruments was not compared. Finally, scoring algorithms for both instruments used in the present study were developed from a general population of the UK because no such algorithm is available in Singapore. However, use of an algorithm from the same population for both the EQ-5D and SF-6D should result in a more valid comparison.
In conclusion, the agreement between EQ-5D and SF-6D utility scores was less than optimal among multiethnic Asian patients with knee OA undergoing TKR in Singapore. This difference in utility scores between the EQ-5D and the SF-6D could result in a large variation in QALY estimates, which highlights the importance of selecting appropriate preference-based HRQOL instruments in economic evaluation. Further comparative studies are warranted to elucidate the comparative performance of these instruments in patients with knee OA in a variety of sociocultural contexts and to determine which instrument provides the most accurate assessment of health utilities in these patients.
Dr. Thumboo had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis.
Study design. Xie, Li, Thumboo.
Acquisition of data. Xie.
Analysis and interpretation of data. Xie, Li, Thumboo.
Manuscript preparation. Xie, Li, Thumboo.
Statistical analysis. Xie, Luo.
Expertise in rheumatology or orthopedic surgery. Lo, Yeo, Yang, Fong, Thumboo.