Manuela Joore, Department of Clinical Epidemiology and Medical Technology Assessment, Maastricht University Medical Centre plus. PO Box 5800, 6202 AZ Maastricht, The Netherlands. E-mail: email@example.com
Objective: This article investigates whether differences in utility scores based on the EQ-5D and the SF-6D have impact on the incremental cost–utility ratios in five distinct patient groups.
Methods: We used five empirical data sets of trial-based cost–utility studies that included patients with different disease conditions and severity (musculoskeletal disease, cardiovascular pulmonary disease, and psychological disorders) to calculate differences in quality-adjusted life-years (QALYs) based on EQ-5D and SF-6D utility scores. We compared incremental QALYs, incremental cost–utility ratios, and the probability that the incremental cost–utility ratio was acceptable within and across the data sets.
Results: We observed small differences in incremental QALYs, but large differences in the incremental cost–utility ratios and in the probability that these ratios were acceptable at a given threshold, in the majority of the presented cost–utility analyses. More specifically, in the patient groups with relatively mild health conditions the probability of acceptance of the incremental cost–utility ratio was considerably larger when using the EQ-5D to estimate utility. While in the patient groups with worse health conditions the probability of acceptance of the incremental cost–utility ratio was considerably larger when using the SF-6D to estimate utility.
Conclusions: Much of the appeal in using QALYs as measure of effectiveness in economic evaluations is in the comparability across conditions and interventions. The incomparability of the results of cost–utility analyses using different instruments to estimate a single index value for health severely undermines this aspect and reduces the credibility of the use of incremental cost–utility ratios for decision-making.
Instruments that estimate a single index value for health are increasingly used to measure preferences for health states for the estimation of quality-adjusted life-years (QALYs) in cost–utility analyses. These measures are essentially generic health-related quality of life instruments with pre-existing preference weights that can be attached to each permutation of responses. Several widely used instruments that estimate a single index value for health are available, including the EQ-5D  and the SF-6D, which uses responses to 11 of the questions on the SF-36 questionnaire . These measures differ in terms of scoring algorithm, and health state descriptive system, and as a result utility scores may vary according to the choice of instrument [3,4]. Indeed, for a large range of clinical conditions there is evidence for differences between the utility estimates of these two instruments for a given patient [5–14]. Moreover, evidence suggests there are differences in the level of agreement between the two instruments over the range of ill health, potentially causing differences in the estimated change in health state utility across patient groups .
The EQ-5D and SF-6D scoring algorithms are derived using different protocols using different valuation methods (time trade-off and standard gamble, respectively). The literature suggests a cross-over of standard gamble and time trade-off values: Standard gamble values are higher for more severe states, and the opposite applies for the milder states . It has indeed been observed that for milder states the EQ-5D time trade-off utilities were higher than the SF-6D standard gamble utilities . This may partly explain the narrower range of the SF-6D-derived utilities as compared with EQ-5D-derived utilities, which may indicate less sensitivity of the former. With regard to the descriptive system, the operationalization of health in the two instruments is not exactly the same: The “vitality” and “social functioning” dimensions of the SF-6D are not explicitly included in the EQ-5D. This would cause the SF-6D to be more sensitive in situations when impact on these dimensions of health is present. Also, the SF-6D dimensions have more levels than the EQ-5D. Furthermore, because there is evidence for floor effects in the SF-6D and ceiling effects in the EQ-5D, the instruments differ in their description of “full health” and “worse health”[3,4,7]. As a result, the EQ-5D is thought to be sensitive in patient groups with severe health state at baseline, but less sensitive in patient groups with mild health states at baseline. The reverse would apply to the SF-6D.
To our knowledge, two articles reported on the impact of the choice of instrument on the incremental cost–utility ratio. McDonough et al.  concluded, based on a review of the literature that across studies the EQ-5 D tended to provide more favorable incremental cost–utility ratios than the SF-6 D. Grieve et al.  showed that in the case of a model-based study of antiviral therapy for patients with mild hepatitis using the SF-6D rather than the EQ-5D resulted in more favorable incremental cost–utility ratios. Nevertheless, of key concern is the fact whether the differences in utility estimates have an impact on the estimated impact of health-care interventions, and hence incremental cost–utility ratios, because these are used in the appraisal of a medical technology.
This article investigates whether the differences in utility scores based on the EQ-5D and the SF-6D have impact on the probability that the incremental cost–utility ratio is acceptable in five distinct patient groups. This issue is addressed using five empirical data sets of patients with different disease conditions and different health state severity. By examining the impact of the choice of instrument on whether the incremental cost–utility ratios are acceptable across patient groups this study specifically focuses on the difference in the description of “full health” and “worse health” between the two instruments by examining floor and ceiling effects. The article is structured as follows. We first introduce the data sources we used to investigate the impact of the choice of EQ-5D or SF-6D utility estimates on the incremental cost–utility ratios. Next we describe the EQ-5D and SF-6D instruments and the analyses we performed. In the results section we compare within and across studies: baseline utility scores, incremental QALYs, incremental cost–utility analyses, and the uncertainty surrounding the incremental cost–utility ratios.
This research concerns secondary analyses of the data from five separate studies conducted in The Netherlands, comprising in total 794 patients. The studies concerned patients with cardiovascular (hypertension), pulmonary (asthma), mental (panic disorder), and musculoskeletal (ankylosing spondylitis and osteoarthritis) disorders. All studies, except for the study with osteoarthritis patients, were cost-effectiveness analyses conducted alongside a randomized clinical trial. The osteoarthritis study was designed as a historical comparison between two matched patient groups. In all studies active treatments were compared.
The study on hypertension concerned outpatients with hypertension who were randomized to either home blood pressure management, or continuance of office blood pressure management . Measurements took place at baseline, 2, 4, 6, and 12 months. The patients with mild asthma (18 years or older, GINA severity stages I to III) were randomized to a nurse-led telemonitoring intervention or care as usual . In this study assessments took place at baseline, and 4, 8, and 12 months. The group of patients with active ankylosing spondylitis received a 3-week course of spa treatment in a spa-resort either in Austria or in The Netherlands, or continued usual care at home . Quality of life and cost data were collected at baseline, and 7, 12, 26, and 52 weeks later. The patients with panic disorder were treated with sertraline or received usual care . Measurements took place at baseline, and 12 and 24 weeks later. Finally, patients with osteoarthritis who underwent a total hip replacement either received a joint recovery program or received care or usual . Quality of life and costs were assessed at baseline, and 4 and 40 weeks later.
In each study both the EuroQol and the Medical Outcomes Study 36-Item Short Form Health Survey (SF-36) were completed at baseline and at each follow-up by the patients themselves. From the data sources seven incremental cost–utility ratios could be calculated: A. home blood pressure management versus care as usual for hypertensive outpatients; B. nurse-led telemonitoring versus care as usual for asthmatics; C. spa treatment in Austria versus spa treatment in The Netherlands, and both spa treatments versus care as usual for ankylosing spondylitis (D and E); F. sertraline versus care as usual for panic disorder; and G. a joint recovery program versus care as usual after hip replacement surgery in osteoarthritis. Care as usual depends, of course, on the specific disease treated and in each study reflected current care for these patients in The Netherlands according to clinical guidelines if available.
Health-Related Quality of Life Measures
The EQ-5D instrument was developed by a European Group as a standard nondisease-specific instrument for describing and valuing quality of life . It is a questionnaire with a descriptive classification system consisting of five dimensions (mobility, self-care, usual activities, pain/discomfort, anxiety/depression), each with three levels. The descriptive system allows for 243 discrete health states. These health states are assigned values, using a tariff based on the time trade-off. Several tariffs are available, but for this study the original UK population tariff was applied in all studies . The EQ-5D utilities range from −0.59 to 1.00.
The SF-36 is a generic health status instrument, comprising eight scales . For the SF-6D, the items of the SF-36 are converted into a six-dimensional health state classification system, with between two and six levels. This yields 18,000 different health states. The health states are assigned preference weights derived from valuations of a sample of 249 SF-6D health states using the standard gamble in a representative sample of the UK population . The SF-6D utilities range from 0.29 to 1.00.
For this study only patients who completed both the SF-36 and the EQ-5D at baseline were included in the analysis. Floor and ceiling effects in the baseline utility scores were investigated by computing the proportion of patients at the lowest and highest possible utility score. For each patient two outcomes in QALYs were calculated: one based on the EQ-5D utility scores and the other based on the SF-6D utility scores. The patient-level QALYs were estimated by applying the area-under-the-curve method, thus assuming linear change between the discrete follow-up points in time . See Supporting information for details on the QALY calculation at: http://www.ispor.org/Publications/value/ViHsupplementary/ViH13i2_Joore.asp
The time horizon was the study duration. To obtain the incremental QALYs in each study multiple regression analysis was applied to control for differences in baseline utility between the study arms . All analyses were performed using SPSS, version 16.0.
For each study two incremental cost–utility ratios were calculated by dividing the difference in costs between the two alternatives by the difference in both QALYs. To get insight in the uncertainty around the incremental cost-effectiveness ratios nonparametric bootstrap simulations were conducted . In the bootstrap simulations a sample of “costs,”“QALY based on EQ-5D,” and “QALY based on SF-6D” trios of equal size of the original sample was selected a thousand times with replacement. From these data 95% uncertainty intervals for the differences in both QALYs were calculated based on the 2.5th and the 97.5th percentile. The difference in the joint distribution of the incremental results is shown in cost-effectiveness planes (CE plane). The difference in decision uncertainty is presented in cost-effectiveness acceptability curves . Cost-effectiveness acceptability curves present uncertainty as the probability that an intervention has the greatest net benefit as a function of the willingness to pay (WTP) for a certain effect (in our case a QALY). The probability that one intervention is preferred over the other is represented graphically in the CE planes as the proportion of the joint density (ΔC, ΔE) to the lower right of a WTP line. A WTP line is a straight line through the origin of the CE plane that connects points with equal WTP values. This proportion can be estimated repeatedly while rotating the WTP line counter clockwise from horizontal (i.e., WTP = 0) to vertical (i.e., WTP = infinite). Hence, the shape of the cost-effectiveness acceptability curve is dependent upon the location of the joint density (ΔC, ΔE) within the CE plane.
In total 583 patients were included in the analyses in the underlying study. The characteristics of the patients included in the five data sets that were considered for this comparative study are shown in Table 1. At baseline, the ranking of the utility scores was the same for the EQ-5D and the SF-6D. The highest utility scores were observed in the hypertension data set (0.842 and 0.773, respectively), and the lowest in the osteoarthritis data set (0.339 and 0.584, respectively). The EQ-5D utility scores were higher than the SF-6D utility scores in the data sets with a relatively high utility score (hypertension, asthma) and lower in the data sets with a relatively low utility score (osteoarthritis). Floor effects were not observed. Ceiling effects were mostly found in the hypertension and asthma data sets, and more prevalent in the EQ-5D than in the SF-6D utility scores. Across the data sets, at baseline the range of EQ-5D utility scores (0.503) was 2.7 times larger than the range of SF-6D utility scores (0.189). Also, the standard deviations of the utility scores at baseline were larger for the EQ-5D utility scores.
Table 1. Sample size, patient characteristics, and baseline EQ-5D and SF-6D utility scores
The incremental EQ-5D QALY varied from minus 0.011 (the equivalent of 4 days in perfect health lost; osteoarthritis) to 0.055 (the equivalent of 20 days in perfect health gained, spa treatment in Austria versus in The Netherlands for ankylosing spondylitis). For the SF-6D the incremental QALYs varied from minus 0.012 (4 days lost, asthma) to 0.031 (11 days gained, spa treatment in Austria vs. care as usual for ankylosing spondylitis). The 2.5th to 97.5th percentile confidence intervals surrounding the incremental QALYs all included zero, except for the EQ-5D incremental QALY in the asthma and Spa Austria versus care as usual data sets.
The incremental benefit from the new intervention was larger for EQ-5D QALYs than for SF-6D QALYs in the hypertension and asthma data sets. The same was found in the in ankylosing spondylitis data set for the comparison of the Spa treatment in Austria with care as usual and with the Spa treatment in The Netherlands. The largest difference in incremental QALY between the EQ-5D and SF-6D was observed in the asthma data set: 0.045 QALY difference (16 days). The confidence intervals for the EQ-5D incremental QALY were all larger than for the SF-6D incremental QALY.
Incremental Cost–Utility Analyses
The point estimates of the incremental cost–utility ratios indicated dominance in the comparison of the two spa treatments for ankylosing spondylitis, irrespective of the utility instrument used. The cost–utility ratios in the hypertension, asthma, and ankylosing spondylitis (spa treatment in Austria vs. usual care) data sets were more acceptable when based on EQ-5D utility. The ratios in the panic disorder, osteoarthritis, and ankylosing spondylitis (spa treatment in The Netherlands vs. usual care) data sets were lower when based on SF-6D utility. The incremental costs, QALYs, and cost–utility ratios are presented in Table 2, ranked from the data set with the highest baseline utility score (A: hypertension) to the data set with the lowest baseline utility score (G: osteoarthritis).
Table 2. Incremental costs, QALYs, and cost–utility ratios based on the EQ-5D and SF-6D utility estimates
The comparisons are ranked from higher (A) to lower (G) baseline utility score.
CAU, care as usual; CI, confidence interval; ICUR, incremental cost–utility ratio; NE, northeast; NW, northwest (inferior); QALY, quality-adjusted life-year; SE, southeast (dominant); SW, southwest.
Home BP management vs. CAU
Nurse-led telemonitoring vs. CAU
Spa Austria vs. The Netherlands
Spa Austria vs. CAU
Spa The Netherlands vs. CAU
Sertraline vs. CAU
Osteoarthritis of the hip
Joint recovery program vs. CAU
On the CE planes (Fig. 1) it is shown that the uncertainty surrounding the point estimates of the incremental cost–utility ratios in all seven comparisons was larger if the QALY is based on EQ-5D utility. The difference in the probability of acceptance of cost-effectiveness between the EQ-5D and SF-6D is shown in the acceptability curves (Fig. 2: The comparisons are ranked from higher [A] to lower [G] baseline utility score). In Figure 1A, B, F and G the joint density (ΔC, ΔE) is, partly, in the southwest quadrant, indicating less costly and less effective. Therefore, if the WTP increases, more of the joint distribution will fall above the WTP line, and the probability of acceptance decreases. Therefore the cost-effectiveness acceptability curves in Figure 2A, B, F and G are falling. For ceiling ratios between €0 and €80,000 per QALY the smallest differences between EQ-5D- and SF-6D-derived cost per QALY were observed in the ankylosing spondylitis data. The largest differences were observed in the asthma and panic disorder data (Fig. 2B and F). At a ceiling ratio of €40.000 per QALY the probability of acceptance of nurse-led telemonitoring for asthma was 0.55 larger when using EQ-5D utility estimates to calculate QALYs (Fig. 2B). In the panic disorder data set it is the other way around. At €40,000 per QALY the probability of accepting sertraline was 0.45 larger when using SF-6D utility estimates to calculate QALYs (Fig. 2F). By “reading”Figure 2 from left to right and from top to bottom, it is shown that first (higher baseline utility scores; A, B) the probability that the intervention is cost-effective is higher based on the EQ-5D, while later (lower baseline utility scores; F, G) the probability is higher when based on SF-6D utility scores.
Discussion and Conclusions
We investigated the impact of the differences in utility scores based on the EQ-5D and the SF-6D on the probability that incremental cost–utility ratios were acceptable, in five distinct patient groups. Our main findings are the following. First, EQ-5D utility scores were higher than SF-6D utility scores for disease states with higher baseline utility scores and lower for states with lower baseline utility scores. This is in line with previous evidence: Healthier individuals tend to have higher mean scores on the EQ-5D, and less healthy individuals tend to have higher scores on the SF-6D . Also, the considerable ceiling effects observed in the EQ-5D utility scores in the data sets with relatively mild conditions (hypertension, asthma) are in line with the literature . Floor effects were not observed. Second, the observed differences in incremental QALY were very small, and mainly occurred in the trials with baseline utility scores at either end of the health spectrum. The EQ-5D provided more favorable incremental cost–utility ratios for data sets with higher baseline utility and the opposite in the data sets with lower baseline utility. In light of the observed ceiling effects, which are expected to decrease sensitivity to change, in mainly the data sets with higher baseline utility, this finding seems somewhat counterintuitive. Third, the uncertainty surrounding the incremental cost–utility ratios was larger when using EQ-5D utility scores in each data set. This is not surprising, taking into account the considerably larger standard deviation of EQ-5D utility scores. Fourth, the probability that the incremental cost–utility ratios were acceptable was considerably larger when using EQ-5D utility scores in the data sets with higher baseline utility scores; although the opposite was found in the data sets with lower baseline utility scores.
Addressing our main question: Even though the differences in incremental QALYs were rather small, the choice of instrument had considerable impact on the probability that incremental cost–utility ratios are acceptable. Moreover, across the patient groups and comparisons we included in this study, we found that the cost–utility ratios were more acceptable when using EQ-5D in relatively mild health conditions and using SF-6D in relatively serious health conditions. This result does not confirm the findings of McDonough et al. , who concluded, based on a review of the literature, that across studies the EQ-5D tended to provide more favorable incremental cost–utility ratios. This result also is not in line with the expectation that the ceiling effects in the EQ-5D would lead to more favorable EQ-5D results in severe conditions.
The data sets we used reflect a considerable severity range and very different areas in health: musculoskeletal disease, cardiovascular pulmonary disease, and psychological disorders. Nevertheless, it certainly was a convenience sample of studies as available in our department. The systematic difference of the choice for EQ-5D or SF-6D on whether incremental cost–utility ratios are acceptable we observed can be a result of the specific sample of studies we used. For instance, we did not observe floor effects in the SF-6D, although others did [3,31–32]. This may be the reason why in the data set with a relatively serious condition (osteoarthritis), somewhat counter intuitively, the SF-6D utility scores translated into a larger probability that the intervention was cost-effective. If floor effects would be present in the SF-6D baseline utility scores, this result could be reversed. In addition, in the data sets we used only small differences in QALYs were observed between the interventions. Although this is a rather common finding when comparing a new intervention with the best available alternative, it would favor the instrument that overall is more sensitive to change. Taking into account the differences in health state description and scoring algorithm between the EQ-5D and SF-6D it is not clear which instrument would overall be more sensitive to change. It is expected that this will differ between patient groups and interventions.
The findings of this study suggest that besides the differences in the definition of worse and full health, other sources of differences in change in utility score as measured with EQ-5D and the SF-6D play a role. Therefore, it is of great importance that we further improve our understanding of the impact of the choice of the utility instrument on the probability that an incremental cost–utility ratio is acceptable. If feasible, we like to recommend the use of more than one utility instrument in trial-based economic evaluations to obtain as much comparative data as possible. In addition, we strongly encourage researchers to publish any available cost-effectiveness data in which two or more instruments are used to estimate a single index value for health. More explicitly, because in this area the burden of evidence arises from a series of analyses and not a single study, repeating our analyses for other conditions and interventions with a different type and magnitude of effect seems worthwhile. In addition, other sources of differences between the instruments need to be investigated. For instance the differences in domain structure may have impact on utility change. Furthermore, differences in utility change may arise from differences in the interval properties of the utility scales of the EQ-5D and SF-6D.
In conclusion, we observed small differences in incremental QALYs, but remarkably large differences in the probability that the incremental cost–utility ratio is acceptable in the majority of the presented cost–utility analyses. More specifically, in the patient groups with relatively mild health conditions the probability of acceptance of the incremental cost–utility ratio was considerably larger when using the EQ-5D to estimate utility. While in the patient groups with relatively serious health conditions the probability of acceptance of the incremental cost–utility ratio was considerably larger when using the SF-6D to estimate utility. A systematic difference in the probability of accepting the cost–utility of interventions as a result of the choice of utility instrument would seriously bias the comparability of the results of economic evaluations. This is problematic, because much of the appeal in using QALYs as measure of effectiveness in economic evaluations is in the comparability across conditions and interventions. The incomparability of the results of cost–utility analyses using different utility instruments reduces the credibility of the use of incremental cost–utility ratios for decision-making.
An earlier version of this work was presented at the International Health Economics Association Conference, July 2007, Copenhagen, Denmark. Two anonymous reviewers are kindly acknowledged for their valuable comments.
Source of financial support: The data in this study are taken from studies funded by The Netherlands Organisation of Health Research and Development, the Dutch Health Care Insurance Board, the Land Salzburg, the Gasteiner Tal Tourismusgesellschaft, the Kurzentrum Thermentempel, the Gasteiner Heilstollen from Austria, Zorgvoorzieningen Nederlands NV, IZA Zorgverzekering, Dick van Toll Assurantieen BV, and Yakult BV Netherlands.