Discrete Choice Modeling for the Quantification of Health States: The Case of the EQ-5D

Authors

  • Elly A. Stolk PhD,

    1. Institute for Medical Technology Assessment and Department of Health Policy and Management, Erasmus University Rotterdam, Rotterdam, The Netherlands;
    Search for more papers by this author
  • Mark Oppe Msc,

    1. Institute for Medical Technology Assessment and Department of Health Policy and Management, Erasmus University Rotterdam, Rotterdam, The Netherlands;
    Search for more papers by this author
  • Luciana Scalone PhD,

    1. Center for Health Technology Assessment and Outcomes Research, University of Milan, Milan, Italy, and Center for Health Associated Research and Technology Assessment Foundation, Milan, Italy;
    Search for more papers by this author
  • Paul F. M. Krabbe PhD

    Corresponding author
    1. Department of Epidemiology, Biostatistics and Health Technology Assessment, Radboud University Nijmegen Medical Centre, Nijmegen, The Netherlands
    Search for more papers by this author

Paul F.M. Krabbe, Department of Epidemiology, Biostatistics and Health Technology Assessment, Radboud University Nijmegen Medical Centre, PO Box 9101, 6500 HB Nijmegen, The Netherlands. Email: p.krabbe@ebh.umcn.nl

ABSTRACT

Objectives:  Probabilistic models have been developed to establish the relative merit of subjective phenomena by means of specific judgmental tasks involving discrete choices (DCs). The attractiveness of these DC models is that they are embedded in a strong theoretical measurement framework and are based on relatively simple judgmental tasks. The aim of our study was to determine whether the values derived from a DC experiment are comparable to those obtained using other valuation techniques, in particular the time trade-off (TTO).

Methods:  Two hundred nine students completed several tasks in which we collected DC, rank, visual analog scale, and TTO responses. DC data were also collected in a general population sample (N = 444). The DC experiment was designed using a Bayesian approach, and involved 60 choices between two health states and a comparison of all health states to being dead. The DC data were analyzed using a conditional logit and a rank-ordered logit model, relying, respectively, on TTO values and the value for being dead to anchor the DC-derived values to the 0 to 1 quality-adjusted life-year (QALY) scale.

Results:  Although modeled DC data broadly replicated the pattern found in TTO responses, the DC consistently produced higher values. The two methods for anchoring DC-derived values on the QALY scale produced similar results.

Conclusions:  On the basis of the high level of comparability between DC-derived values and TTO values, future valuation studies based on a combination of these two techniques may be considered. The results further suggest that DC can potentially be used as a substitute for TTO.

Introduction

Composite measures of health outcomes such as “quality-adjusted life-years” (QALYs) require weights or values attached to different health states that reflect the levels of health associated with these states. The standard gamble (SG) and time trade-off (TTO), which have emerged from health economics research, are frequently used to assign values to health states [1]. Psychology has contributed another technique, the visual analog scale (VAS) [2]. Unfortunately, there are theoretical and empirical drawbacks to all of these techniques [3]. Responses to the SG and TTO are likely to be influenced by factors extraneous to judgments about health levels, such as risk aversion or time preference. Moreover, empirical violations of the normative axioms supporting the use of these techniques have been noted. Regarding VAS, critics question its interval properties and point to its lack of a relation to economic theory. In the literature on health state valuation, arguments are raised for and against different techniques, but this debate has not led to consensus [4]. Therefore, but also in light of the diverging empirical results, continued work on improving the methods is warranted.

Probabilistic discrete choice (DC) modeling offers an alternative approach for exploring people's values, although this approach is also not without problems and criticism [5–7]. Such DC models can be used to analyze data obtained through approaches involving choices, ranks, or matches between alternatives, as defined by attributes and levels [8]. The DC models were initially developed for the analysis of real-world data, but researchers became quickly aware of their potential for analysis of stated preference data allowing for exploration of a broader range of preference-driven behaviors than possible on basis of real-world data [9]. This strategy was first developed in transport economics and marketing. There, instead of modeling people's actual choices (revealed preferences), Louviere et al. modeled the choices made by subjects in carefully constructed experimental studies based on stated preferences: discrete choice experiments (DCEs) [9]. The term DCE refers to an experiment that is constructed to collect stated preference data that are consistent with the requirements for DC modeling. Recognizing that the DCE framework offers a conceptual basis for the evaluation of the benefits of health programs, the technique is now being used to extend economic evaluations in health care with information about the value of nonhealth outcomes such as waiting time, location of treatment, and type of care [10–12]. More recently, DCEs and accompanying DC models have also been considered for health state valuation [13–17].

DC modeling has good prospects for health state valuation. The statistical literature classifies it among the probabilistic choice models that are grounded in modern measurement theory and consistent with economic theory (i.e., the random utility model). All DC models have in common that they can establish the relative merit of one phenomenon with respect to others. If the phenomena are characterized by specific attributes with certain levels, extended probabilistic choice models would permit estimating the relative importance of the attributes and their associated levels, and even estimating overall values for different combinations of attribute levels. A promising feature of DC models is that the derived values only relate to the attractiveness of a health state; they are not expressed in trade-offs between improved health and something else, as in TTO and SG. Bias as a result of these extraneous factors may therefore be prevented. Moreover, DC models have a practical advantage: when conducting DCEs, health states may be evaluated in a self-completion format. The scope for valuation research is thereby widened as compared to existing TTO protocols for deriving values for health state measurement instruments such as EQ-5D.

But DC models are not without problems when used for health state valuation. The analytical procedure on which analysis of DCE data is based assumes that the difference in values between choice options (e.g., two health states) can be inferred from the proportion of respondents that chose one option over the other. This implies that the relative position of all health states on the latent scale would lie between the “best” and the “worst” health states. For the estimation of QALYs, however, those values need to be scaled on the full health–dead scale. If DC modeling is used to value health, a way must be found to link the derived values under this model to the scale required to calculate QALYs. Yet, there is no consensus on what is the best way to handle the arbitrarily scaled DC values obtained, so it remains uncertain just how valid and informative DC-based values are.

A strategy for rescaling DC values may be to rescale by anchoring them on values obtained for the best and worst health state using other valuation techniques, such as TTO or SG. Nevertheless, the rationale for this approach is unclear, when part of the motivation to explore the DC model as a potential candidate to produce health state values comes from the limitations of existing valuation methods. Alternatively, the DCE may be designed in such a way that the derived health state values can be related to the value of the state “dead.” A simple manner to achieve this seems to be by DCE designs in which respondents are presented one bad health state at a time and asked if they consider it better or worse than being dead. The value difference between these bad health states and being dead would then be estimated from the observed probabilities between the bad health states and being dead. Nevertheless, Flynn et al. [18] have asserted that the precision of the final estimates for the health states, in particular the region around “dead,” may be largely based on the presence of respondents who consider none of the presented health states to be worse than dead. A problem is that the DC model will not accurately capture the error distribution and therefore produces biased estimates. Furthermore, under random utility theory, responses of those who consider all life worth living are perceived to reflect an infinite value difference between health states and dead. This is not necessarily an accurate representation of their preferences, and causes an estimation problem. The values derived from the DC model will then depend on the proportion of respondents who exhibit this preference.

These problems in estimating DC models are less likely to arise in studies comparing health states to each other rather than to being dead. By mixing these two designs, the ability to relate the health state values to being dead may be maintained, while limiting (not omitting) the effect of the aforementioned biases. The procedure has been demonstrated by McCabe et al. [16] and Salomon [5]. These authors mixed the state “dead” in the choice set as a health state, so that a parameter for the state “dead” is estimated as part of the model.

Because none of the various methods to anchor DC-derived values on the full health–dead scale required for QALY computation is without problems, it is hard to say which strategy should be used. Experimentation with the various anchoring strategies is therefore required and convergence with alternative methods for health state valuation needs to be explored, to give advice on this manner and to see if any of the proposed strategies is capable of producing health state values that may be accepted by the research community.

This article considers the application of DC modeling for deriving health state values. Research on novel, enhanced, and feasible measurement tools is conducted by the EuroQol group to support improvement of the group's health status measurement instrument, the EQ-5D. This work is motivated by the perceived limitations of the traditional valuation techniques and by the prospects of DC models for health state valuation. We analyzed congruence across methods (DC, rank, VAS, and TTO) and across samples with the aim of determining whether DC modeling produces value estimates that are comparable to traditional methods. The main focus of the study was to compare DC values to values elicited with the standard TTO technique.

Methods

EQ-5D States

The EuroQol EQ-5D is a generic measurement instrument to describe and value health states [19]. The EQ-5D classification describes health states according to five attributes: mobility, self-care, usual activities, pain/discomfort, and anxiety/depression. Each attribute has three levels: “no problems,”“some problems,” and “severe problems.” Health state descriptions are constructed by taking one level for each attribute, thus defining 243 (35) distinct health states, where “11111” represents the best and “33333” the worst state. An EQ-5D health state may be converted to a single summary index by applying a formula that essentially attaches weights to each of the levels in each dimension. This formula reflects the values of EQ-5D health states as obtained from respondents in a sample of interest. Usually, this is a representative sample of the general population, but in the current study, both a student sample and a general population sample was used.

Not all EQ-5D states were included in the experiment. We constructed a DCE of 60 pairs of EQ-5D states, following the methodology described below. For the three other judgmental tasks in our study protocol, a set of 17 EQ-5D health states was selected. The set comprised five very mild, four mild, four moderate, three severe states, and state “33333.” The 17 states are: 11112, 11113, 11121, 11131, 11133, 11211, 11312, 12111, 13311, 21111, 22222, 23232, 32211, 32223, 32313, 33323, and 33333. The same 17 states were used in the Dutch EQ-5D TTO valuation study [20].

Respondents

For practical reasons, this study included a general population sample (target N = 400) and a student sample (target N = 200). The comparisons across valuation methods and of strategies for anchoring values obtained using DC models relative to dead and full health were done on basis of student data. DC responses were also collected from the general population in order to draw tentative conclusions about the possibility to extrapolate results from the student sample to the general population.

Students were recruited at Erasmus University in Rotterdam, The Netherlands. Each student was offered €20 for participating. The general population sample consisted of members of an Internet panel. This panel included approximately 104,000 people. Stratified sampling was used to select a research sample from the panel that was representative for the Dutch general population in terms of age, sex, and education. The stratified sampling procedure was performed in three rounds, so the final round allowed for over- or undersampling of specific groups if the desired distribution over the strata had not been attained yet. The incentive offered to the panel members consisted of a €2.50 donation to a charity chosen by the respondent and a chance to win gift certificates or other prizes in a lottery.

People in the general population sample were only administered the DCE. The students completed (in this order) the DCE, ranking, VAS, and TTO task in the presence of one of the researchers or a research assistant. To become familiar with the type of health state descriptions, all respondents were administered the EQ-5D prior to the judgmental tasks.

Judgmental Tasks

DCE.  In the DCE, all respondents were presented with a forced choice between two EQ-5D states. After this paired comparison task, the students were prompted to answer a second question related to each of the two health states separately. This extra question offered “dead” as a choice, phrased as, “Would you rather be dead than living in this health state?” In the remainder of the article, we will refer to the two outcomes as DCE data and DCEdead data, respectively.

The DCE was programmed as a computer experiment. The respondents logged in to a Web site where they were presented with a number of choices between two EQ-5D states that were randomly selected from the choice set. Our general population sample received nine DCs; students received 18 DCs, and thus compared 36 states to being dead. It was a pragmatic decision to opt for random selection of choices for an individual, rather than using a blocked design, based on the fact that level balance also was no criterion for design construction, and confidence that systematic effects would be filtered out given the large number of questions and the large sample size.

Ranking, VAS, and TTO.  The ranking, VAS, and TTO tasks were performed as described in Lamers et al. [20]. The valuation procedure may be summarized as follows. First, students rank-ordered the 17 EQ-5D states selected for these tasks, supplemented with “dead” and state “11111,” by putting the card with the “best” health state on top and the “worst” one at the bottom. Next, students valued the rank-ordered health states on the EuroQol VAS using a bisection method that specified the order in which various states needed to be valued. The TTO valuation task followed the VAS valuation. TTO was executed using a computer-assisted personal interviewing method that followed standard TTO protocols based on the original UK study protocol [21]. This implies that the health states were presented in random order, that the TTO task was facilitated by a visual aid, and that the respondents were led by a process of outward titration to select a length of time t in state “11111” (perfect health) that they regarded as equivalent to 10 years in the target state (for states better than dead) or to select a length of time (10-t) in the target state followed by t years in state “11111” (for states worse than dead).

Experimental Design of the DCE

The DCE design was constructed using a Bayesian efficient approach, which to our knowledge has not been applied in health economics before. Most DCEs in health economics have applied orthogonal designs. These allow the uncorrelated estimation of main effects, assuming that all interactions are negligible. A limitation of orthogonal designs is that orthogonality is compromised if, for the purpose of data analysis, categorical multilevel variables need to be transformed into a set of dummy variables. Moreover, in optimal orthogonal designs, the efficiency of the design is optimized for the situation that choices are made randomly. This is true under the restrictive assumption that the estimates of the parameters in the utility model are equal to zero (β = 0). This implies that two choice options within a pair have a 50% probability of being preferred, irrespective of their attribute levels. If β = 0 does not hold, the design will not be optimally efficient for producing information in regard to the true parameter effects [22,23]. Both issues with orthogonal designs apply to EQ-5D valuation, so we decided to look elsewhere.

To construct a Bayesian efficient design, a computer algorithm was used (see Appendix at: http://www.ispor.org/Publications/value/ViHsupplementary/ViH13i8_Stolk.asp) that was obtained at that time from Rose and Bliemer, and described in [24], but which is publicly available now in the software package nGENE. The algorithm entailed an iterative procedure whereby a great many designs, each with the desired number of choice situations, were randomly selected from the full factorial design and compared by their D-error, which was computed on the basis of expected values of the model parameters. In the Bayesian framework, these expected values are known as priors. Because the priors were not perfectly known, they were included as distributions from which they were sampled rather than as point estimates in the design algorithm. This way, when priors deviate from their expected values, the impact on the efficiency of the design is minimized. To that end, the Bayesian efficient design algorithm uses nested Monte Carlo simulation. The best design remaining after 2000 iterations, each containing 1000 draws for the priors, was selected for this study. The probability that this design is the optimal one is small because a more efficient design is likely to exist. Even if not optimal, the design will still be efficient, given the large number of iterations in the Monte Carlo simulation.

The DC model we intended to estimate included main effect terms for the five categorical three-level EQ-5D domains (transformed into a set of 10 dummies) and the so-called N3 term. This is a nonmultiplicative interaction term that is frequently used in EuroQol valuation models. It allows for measuring the “extra” disutility when reporting severe (level 3) problems on at least one EQ domain [19]. In addition, it was considered that the model would need to include an alternative specific constant as recommended in the literature [25] to control for unobserved systematic effects on choices, such as a tendency to always choose the same option. Accordingly, based on degrees of freedom, a minimum number of 12 pairs are required to estimate all model parameters. It was decided to increase this number to 60 pairs to allow for extension of the model with interaction terms, if relevant.

The priors for the main effects were obtained by taking the weighted average of the parameter estimates from three TTO-based EQ-5D studies [20,21,26]. We used a standard error of 20% surrounding these priors to account for the possibility that parameter estimates modeled on the basis of DCE data might be different from those elicited with TTO. The prior parameter estimates of the interactions were set to 0 (Table 1).

Table 1.  Model parameters for the Bayesian efficient design
Main effects*Priors for main effectsInteractions (priors = 0)
  • *

    The abbreviations MO2 to AD3 represent the five categorical three-level EQ-5D domains transformed into a set of 10 dummies. The first level (no problems) was used as reference category.

MO2−0.108MO2*SC2SC2*UA2UA2*PD2PD2*AD2
MO3−0.434MO2*SC3SC2*UA3UA2*PD3PD2*AD3
SC2−0.140MO2*UA2SC2*PD2UA2*AD2PD3*AD2
SC3−0.346MO2*UA3SC2*PD3UA2*AD3PD3*AD3
UA2−0.090MO2*PD2SC2*AD2UA3*PD2 
UA3−0.240MO2*PD3SC2*AD3UA3*PD3 
PD2−0.147MO2*AD2SC3*UA2UA3*AD2 
PD3−0.463MO2*AD3SC3*UA3UA3*AD3 
AD2−0.119MO3*SC2SC3*PD2  
AD3−0.354MO3*SC3SC3*PD3  
  MO3*UA2SC3*AD2  
  MO3*UA3SC3*AD3  
  MO3*PD2   
  MO3*PD3   
  MO3*AD2   
  MO3*AD3   

The algorithm produced a design of 60 pair-wise comparisons of two EQ-5D states. To further improve the design, we identified and altered dominant choices in which logical consistency predicts that one alternative will always be preferred. Nine dominant choices were identified. In five pairs, the worst state was improved to escape from dominance; in the other four, the best state was made worse. The alterations were made randomly, but in accordance with the following rules: 1) the D-efficiency of the design was improved with the alterations; and 2) the new health state was not included yet in the choice set. This strategy resulted in a choice set of 60 pairs including 106 unique health states (94 states were included once, 10 twice, and 2 were included three times).The final set of 60 states is presented in Table 2. The D-error of this design was 1.11.

Table 2.  Final set of 60 pairs of EQ-5D health states for the discrete choice experiment (asterisk marking the nine states that were manually altered)
ChoiceOption 1Option 2ChoiceOption 1Option 2
 12123122323311321121233
 22322331113323331122133
 31111212221333211223312
 43332223312342111222111
 52233123233353221113333
 63213322312361313113113
 733123*22233*372231323231
 82321232121383131332231
 93232233131391212333321
101123132111*402231132123
113322211312411113321123
121312221212423131121313
132222113212432121232213
142231211212441112122112*
152213212321451331331221
1612332313334621321*12111
172233333332473332323122
183122212112481122332321
193113113111492331332222
201223313132503132322321
2131131121215133113*32332
223313121323522213121212
233312231132532322231113
241113332211*541222233121
251223121121553113221333
261231213131561221331232
2721111*11311572331213123
281122312313*582121132313
291323131231593113321331
303112312212601332113231

Analysis

Observed values derived from rank, VAS, and TTO responses for 17 states.  The rank data were analyzed using the “law of comparative judgment” (LCJ) model, as introduced by Thurstone [27,28]. To model the rankings within the Thurstonian framework, the rankings are transformed (“exploded”) into paired comparisons. The analytical procedure assumes that the difference in value between two health states can be inferred from the proportion (i.e., probabilities) of respondents who preferred one health state to another. The resulting matrix of probabilities is subsequently transformed into Z values (i.e., normal distribution). The LCJ values are obtained by taking the mean of all the columns of the Z matrix, as described by Krabbe [28].

Mean VAS and TTO values were obtained with approaches commonly used in EQ-5D valuation studies (described, e.g., in [20,21]). Observed VAS values were obtained on a scale with the end points “best imaginable health” (= 100) and “worst imaginable health” (= 0). To use these values in health state valuation, they need to be rescaled such that state “11111” has a value of 1 and being dead has a value of 0. Rescaling was performed at the respondent level on the basis of the observed VAS scores for the various health states, and the scores that were recorded for “dead” and “perfect health,” using the following equation [19]:

image

The same procedure that was applied in the Dutch valuation study [20] was used for estimating values from TTO responses. For states regarded as better than dead, the TTO value is t/10; for states worse than dead, values are computed as −t/(10 – t). These negative health states were subsequently bounded at minus 1 with the commonly used transformation v′ = v/(1 – v). Linear regression analysis was used to interpolate values for all EQ-5D states from the values for the 17 states that were observed.

Estimated value prediction models and rescaling methods.  For the TTO task, the predicted values for all 243 EQ-5D states were derived after interpolation from the values for the 17 states that were included in the TTO task. The TTO model included an intercept, interpreted as any deviation from full health, as well as dummy variables for the 10 main effects and for the N3 parameter.

We modeled and rescaled DCE-derived values in two different ways. The applied DC models were a conditional logit model (estimated only on the DCE data, Stata: clogit) and a rank-ordered logit model (estimated on DCE and DCEdead data, Stata: rologit), as explained below.

Neither the TTO nor the DC model adjusts for the fact that there are several observations per respondent.

First, we used the conditional logit model to analyze the DCE data obtained from the 60 pair-wise comparisons of EQ-5D states. The model included dummy variables for the 10 main effects and the N3 parameter. The values derived from this model are on an undefined scale. To link the DCE-derived health state values to the QALY scale, we used TTO values for the worst health state (33333) and the best health state (11111) as anchor points for rescaling. For the general population, we used TTO values obtained from the Dutch EQ-5D valuation study (i.e., −0.329 [20]). For the student sample, we used the empirical TTO values derived in this study. We will refer to the resulting values as the DC values.

Alternatively, we derived health state values from the DCE data on the QALY scale by anchoring the values on the value for being dead (thus: 0). For this purpose, we modeled the information obtained from both the DC and DCdead data. The data of these two response tasks were combined to infer how the respondent would have rank ordered the two EQ-5D states and “dead” from most to least preferred. These rank orderings were analyzed using a rank-ordered logit model. Besides the dummy variables for the 10 main effects and the N3 parameter, this model also includes a parameter for the state of being dead, which can be used to rescale the values and put them on the full health–dead (1–0) scale, as demonstrated by McCabe et al. [16]. The value for being dead is anchored at zero by dividing all coefficients by the coefficient for “dead.” By additionally restricting the value of full health to 1, values are produced in the 0 to 1 range for states better than dead, and negative values for states worse than dead. We will refer to the resulting value set as DCdead.

Across-method and across-sample comparison.  Intraclass correlation coefficients (mixed model, average measures) and mean absolute differences were computed to estimate the degree of correspondence between different methods. The intraclass correlation coefficients were also used to compare the DC derived values of students and the general population. Except for the DC model (Stata 10 SE), all statistical analyses were performed in SPSS (V. 17.0; Chicago, IL).

Results

Respondents

Data were elicited in a sample of 444 persons in the general population and 209 students. The general population sample was representative in terms of sex, age, and level of education (Table 3). All students completed the rank, VAS, and TTO tasks. They also completed the DC task, but because of a problem with data storage, responses of five students were not saved. DC responses of those who continually chose only one option were removed from the data set. This applied to six people in the general population sample and none of the students. The DC model was therefore estimated on responses of 204 students and 438 people in the general public. Their responses included no missing values.

Table 3.  Characteristics of the two samples
 Sample (N = 444)General population norms* (%)Students (N = 209)
  • *

    Source: Survey Sampling International, Minicensus data (The Netherlands).

Male, % (N)48.2 (214)50.130.6 (64)
 18–243.8 (17)5.979.7 (51)
 25–347.9 (35)9.018.8 (12)
 35–4410.8 (48)11.31.5 (1)
 45–549.7 (43)10.1
 55–6410.4 (46)8.6
 65–745.6 (25)5.2
Female, % (N)51.8 (230)50.069.4 (145)
 18–244.7 (21)5.882.7 (120)
 25–349.2 (41)9.016.5 (24)
 35–4411.5 (51)11.10.8 (1)
 45–5410.4 (46)9.9
 55–6410.1 (45)8.5
 65–745.9 (26)5.7
Marital status, % (N)   
 Single23.4 (104)68.4 (143)
 Married/living together59.0 (262)16.7 (35)
 Widowed3.2 (14)
 Divorced10.4 (46)1.4 (3)
 Missing, other4.1 (18)13.5 (28)
Educational level, % (N)   
 Low27.0 (120)26.3
 Middle40.1 (178)42.5
 High32.9 (146)31.3100.0 (209)
Age, Mean (SD)45.5 (14.6)22.7 (3.4)
EQ–5D index, Mean (SD)0.83 (0.23)0.93 (0.1)

Preference Data Elicited in Students Using Ranks, VAS, and TTO

The observed mean values for the 17 health states that were obtained in students using ranks, VAS, and TTO are presented in Table 4 and Figure 1. All methods yielded a negative value for state 33333 (rank: −0.06; LCJ: −0.15; VAS: −0.07; TTO: −0.1). Compared to the Dutch TTO-based valuation algorithm, the student sample gave on average slightly higher values for the health states (not presented). The intraclass correlations between the four value sets were high (>0.96, P < 0.001). Yet, the absolute values differed across the methods, in particular between VAS and LCJ, between TTO and VAS, and between TTO and LCJ. VAS values tended to be lower than TTO values. Nevertheless, the mean ranks were similar to the VAS values. This similarity may be caused by the relation between the judgmental tasks of the ranking and VAS: the rank-ordered health states were valued using VAS in a specific order. Application of LCJ to rank data resulted in values that were higher than VAS and TTO values.

Table 4.  Observed and rescaled mean (SD) ranks, visual analog scale (VAS), and time trade-off (TTO) values (N = 209) and predicted discrete choice (DC) values (N = 204) for the 17 EQ-5D states
StateRanksThurstone (exploded ranks)VAS (observed)VAS (normalized)TTO (observed)DC model (predicted)
MeanSDRescaledLCJMeanSDMeanSDRescaledMeanSDDCEDCE dead
111111.010.211.0098.833.35100.000.001.001.001.00
111124.171.970.810.9682.7212.8781.9714.200.820.810.280.920.93
112114.201.780.810.9282.0711.4680.9914.530.810.830.250.930.93
121114.252.270.800.8881.0315.7979.6121.140.800.860.250.950.95
111214.341.920.800.9181.5413.6980.8014.310.810.860.220.950.95
211114.362.480.800.8880.9914.4179.8715.930.800.890.180.950.95
111138.252.990.560.6457.5021.1853.9823.790.540.520.440.650.63
113128.972.190.520.5855.1216.1651.1820.210.510.550.340.620.60
111319.402.830.490.5551.2019.3846.7322.180.470.430.450.640.60
222229.981.920.450.4948.4714.5943.7018.630.440.580.360.700.71
1331110.272.760.440.5248.2618.1443.7122.840.440.510.380.580.56
3221111.052.810.390.4442.7818.4137.3324.990.370.560.400.560.54
1113312.852.940.280.3331.5418.9024.8523.520.250.240.490.400.38
2323213.622.210.230.2427.4214.0920.4718.600.210.290.430.320.30
3222314.211.940.200.2424.7513.3517.4918.420.180.280.440.280.26
3231314.662.010.170.2121.9813.5013.8922.860.140.230.460.210.21
3332316.781.500.040.0310.449.691.4717.570.020.080.490.090.09
dead17.431.940.000.007.5911.380.000.000.00  
3333318.370.82−0.06−0.153.685.87−6.6221.11−0.07−0.100.48−0.10−0.11
Figure 1.

Comparison of values elicited from the student sample: Observed rank, Thurstone scaling (LCJ) based on ranks, VAS, and TTO values for the 17 empirically measured EQ-5D health states, and the derived values of the same 17 states based on the DC task (DCE).

Comparing DC and TTO

Table 5 presents the parameter estimates obtained for DC, DCdead, and TTO. We only report the models that included the N3 parameter, because these performed slightly better than the models without N3. All coefficients were statistically significant, except for mobility level 2 in the TTO model.

Table 5.  Parameter estimates for the models based on data derived by discrete choice experiment (DCE) and time trade-off (TTO)
 DCE general populationDCE studentsDCEdead studentsTTO students
N = 438 Obs = 7,884 (438*9*2)N = 204 Obs = 7,334 (204*18*2)N = 204 Obs = 11,016 (204*18*3)N = 209 Obs = 3,553 (209*17)
CoefSESignCoefSESignCoefSESignCoefSESign
  • *

    In the set of DCE coefficients, the constant represents the alternative specific constant, capturing a tendency to always choose the first option. In TTO, the constant represents the disutility associated with any deviation from full health in so far as it is not attributable to any of the five domains.

Constant*−0.0170.040.674−0.0940.040.028N/A  −0.1030.020.000
MO2−0.2700.070.000−0.3440.080.0000.2970.070.000−0.0120.020.603
MO3−1.4540.080.000−1.4050.090.0001.1690.070.000−0.0910.030.001
SC2−0.5450.070.000−0.3740.070.0000.2960.060.000−0.0550.020.012
SC3−1.1160.080.000−0.8340.080.0000.6910.070.000−0.0790.030.002
UA2−0.3020.070.000−0.5080.080.0000.4100.070.000−0.0540.020.021
UA3−0.9140.080.000−1.3380.090.0001.0620.070.000−0.1690.030.000
PD2−0.1480.070.024−0.3700.070.0000.3350.060.000−0.0870.020.000
PD3−1.3620.080.000−1.7510.090.0001.5210.070.000−0.2970.020.000
AD2−0.4840.080.000−0.5430.080.0000.4240.070.000−0.0690.020.001
AD3−1.5300.080.000−1.6750.090.0001.3510.070.000−0.2310.020.000
N3−0.6040.120.000−0.8550.140.0000.9180.130.000−0.1280.020.000
Dead dummyN/A  N/A  6.0660.160.000N/A  
Model fitsLog-likelihood−2035.03Log-likelihood−1793.29Log-likelihood−3557.94R20.35
Pseudo-R20.26Pseudo-R20.30

Despite some differences (e.g., incentives provided, mode of administration, number of judgments) between the two study samples (students, general population), we observed a strong relationship between the DC-derived values of the two samples (Fig. 2). Although more health states seem to be valued negatively by the general population, this is mostly because of the rescaling on the basis of the TTO value for the worst EQ-5D state, “33333”: −0.329 for the general population and −0.098 for students. Comparison of Figures 2 and 3 suggests that the parameter estimates obtained using DCE in different samples are closer to each other than the DCE-derived and TTO-derived estimates. Figure 3 shows that DC produced higher values than TTO when rescaled on the basis of the TTO values for “33333” and “11111.” The intraclass correlation between the TTO and DC values was 0.93 (P < 0.001; confidence interval [CI] 0.12–0.98) in the general population, and 0.96 (P < 0.001; CI 0.53–0.99) among the students. The mean absolute difference between the student TTO and student DC was 0.060 (SD = 0.039).

Figure 2.

DC values for the 243 EQ-5D health states derived from discrete choice judgments by the general population (Dutch) compared with values derived from similar judgments by Dutch students.

Figure 3.

Comparison of TTO (Dutch algorithm) values with DC (Dutch general population) values.

Absolute values of health states derived by different methods may be different, although in many applications of health state values, the main focus is on marginal differences (e.g., comparisons before and after a medical intervention). Therefore, marginal difference scores for all combinations of the 243 derived EQ-5D states were computed (29,403 combinations) for the TTO and the DC separately (Fig. 4). This analysis shows again that overall DC values are higher than TTO values, and also that marginal differences between TTO and DC values for individual pairs of states can be as large as around 0.20. The mean of marginal differences between DC values and TTO values for individual pairs was 0.086 (SD 0.002).

Figure 4.

Marginal difference scores between the derived values of the 243 EQ-5D states (29,403 combinations) for the TTO (Dutch students) and the DC (Dutch students).

Anchoring DC Values on “Dead”

Students considered a health state to be worse than dead in about 10% of the cases. The DC model parameter estimates derived from the DCdead data are presented in Table 5.

The values produced by the two different models (DC vs. DCdead) are congruent (Fig. 5); the intraclass correlation between the two value sets was 0.99 (P < 0.001; CI 0.92–0.99), while the mean absolute difference between the values was 0.019 (SD 0.009). The DCdead values were slightly lower than values derived from the DCE involving pair-wise comparison of EQ-5D states, except for mild health states. Therefore, the difference between the DCdead values and the TTO values was slightly smaller than the difference between the DC values and the TTO values.

Figure 5.

The DC values (Dutch students) derived from discrete choices between pairs of EQ-5D health states compared with the DC values (Dutch students) derived from discrete choices of separate EQ-5D health states plus being dead.

Interaction Terms in the DC Models

The analysis of the DC models expanded with first-order interaction terms showed that 10 of the 40 interaction terms were statistically significant. Nevertheless, three main effects (mobility level 2, pain level 2, depression/anxiety level 2) were no longer statistically significant when compared to the main effect model. The increase in the amount of explained variance (pseudo-R2) caused by the inclusion of the interaction terms was marginal (main effect: 0.266; main effects + interactions: 0.277).

Discussion

We have presented a systematic comparison of ranks and VAS, TTO, and DC (DCE-derived) values for EQ-5D health states in order to investigate whether or not modeling DCE data produces health state values that are comparable to other conventional valuation techniques, TTO in particular. DC values broadly replicated the pattern found in TTO responses. This observation applies to both samples (general population, students) and, in students, to both strategies that were applied to anchor the DC values on the full health (= 1)–dead (= 0) scale. Besides similarities, there were also systematic differences. DC values were consistently higher than TTO values, which were in turn higher than VAS values. Values derived from rank data were higher when analyzed using LCJ than when using mean ranks. Instead of the classic case V model used here, more general Thurstonian models with unrestricted covariance structures may be more appropriate [29]. The results suggest a systematic difference across the methods, with DC values being the highest of all.

The fact that differences were found between DC modeling and TTO is in line with the findings of several other studies where DC models have been applied in the analysis of rank or DCE data. Salomon compared rank-based models and TTO for EQ-5D using data from the UK general population survey. He found that the rank-based models produced slightly higher values [5]. Ratcliffe et al. [14] compared TTO and DC modeling for a disease-specific outcome measure. DCE-derived values seemed higher than TTO values. A more complex relation was found between rank and TTO data, with better convergence for mild states. McCabe et al. compared values derived from rank data with SG values for SF-6D and HUI health states; the rank data produced higher values [16]. It thus seems that TTO and DC models are largely measuring the same latent construct (quality of a health state), but the techniques do not produce identical results.

The main difficulty we met in applying DC models is that these models generate values on an arbitrary scale, not on the metric of the quality (of life) component of the QALY scale. We have explored the possibility of anchoring the values derived from DCE data on the QALY scale directly by using “dead” as a choice option. This strategy yielded values that were comparable to those derived from the DCE where two EQ-5D states were compared to each other and anchored on the basis of some TTO values. Although this is a promising result with regard to the possibility of using DC models and their associated DCEs as a stand-alone valuation technique, further research is warranted to explore the relationship between the outcomes of the DCEdead approach with a DC model that is anchored on TTO. For example, the difference between the TTO value for state “33333” of students and the general population raises the question whether results about comparability of the two anchoring strategies can be generalized from students to the general population.

If combined use of DC modeling and TTO is considered for health state valuation, the strategy for linking DC and TTO data may need to be further explored. Anchoring on the worst state, 33333, may have contributed to systematic differences between TTO and DC values because of bias resulting from problems of TTO with valuation of states worse than being dead. On the other end of the valuation space, the DC values may be incorrectly anchored with respect to full health. A reliable estimation of the difference between full health and nonoptimal states cannot be obtained from the collected choice data, because of the dominance issues similar to the ones pertaining to dead. The problem can be circumvented in a valuation task involving choices involving scenarios that vary quality of life and one other domain of health, such as length of life or risk of dying as suggested by Flynn et al. [18]. Nevertheless, then health state values would be influenced by risk aversion or time preference, characteristic TTO and SG values are criticized for. In this circumstance, a pragmatic solution may be to use a large number of TTO values for anchoring—possibly excluding the value for state 11111—and apply statistical routines to adjust the parameters of the DC model to fit the TTO data set. Another approach may be the use of specific models that are suitable to deal with dominant health states to calibrate the metric distances in this region [30].

Furthermore, in application of DC models for health state valuation, the added value of different DC models may need to be explored. The DC models employed in the current study are variants of the frequently used multinomial logit model [8,25]. This model makes the simplifying assumptions that the error terms are independently, identically distributed (the IID assumption) and that the ratio of the probabilities of two alternatives i and k does not depend on any alternatives other than i and k (the IIA assumption). Several other models relax the IIA assumption. Examples include the mixed logit model, the generalized extreme value model, and the probit model [25]. The first of these is considered the most promising for DC analysis [31]. While mixed logit models are arguably more powerful, they also require higher data quality. We refrained from powering our study for these more complex models, because our aim was to make a global comparison of TTO and DC values, and then to study the strengths and weaknesses of various ways of anchoring relative DC values on the QALY scale. If one considers application of DC models for health state valuation, we would recommend larger designs that permit estimation of more complex models to alleviate concerns about bias caused by violation of the IIA assumption.

The modern measurement of DC models builds upon the early work and basic principle of Thurstone's “LCJ.” In fact, the class of choice- and rank-based scaling models with its lengthy history (1927 to the present) is one of the few areas in the social and behavioral sciences that have a strong underlying theory. In this respect, it may be interesting to explore the possibility of extending or combining DC models with other closely related (fundamental) measurement models (e.g., Rasch models and item response theory models [32,33]). This might be an important area for future research.

To conclude, we believe that a strategy based on TTO data supplemented by health state values derived from DC modeling may be a feasible and accurate option. Although there are small differences in results from the two conceptually different valuation methods, there seems to be a clear systematic relation that would make conversion from one method to the other feasible and defendable.

This work is partially based on our inspiring discussions and reflections during several meetings of the EuroQol Valuation Task Force. Therefore, we wish to express our thanks in particular to Frank de Charro, Nancy Devlin, Ben van Hout, Paul Kind, and David Parkin. We thank Eva de Vries, Anke Naaborgh, and Mark Huijsman for their assistance in the data collection. We thank our reviewers for their helpful comments.

This work was presented during a workshop at the ISPOR 11th European Congress, Athens, Greece, November 10, 2008.

Source of financial support: This research was made possible by a grant from the EuroQol Group.

Ancillary