Esther W. de Bekker-Grob, Department of Public Health, Erasmus MC—University Medical Centre Rotterdam, PO Box 2040, 3000 CA Rotterdam, The Netherlands. E-mail: firstname.lastname@example.org
Objectives: Discrete choice experiments (DCEs) in health economics commonly present choice sets in an unlabeled form. Labeled choice sets are less abstract and may increase the validity of the results. We empirically compared the feasibility, respondents' trading behavior, and convergent validity between a labeled and an unlabeled DCE for colorectal cancer (CRC) screening programs in The Netherlands.
Methods: A labeled DCE version presented CRC screening test alternatives as “fecal occult blood test,”“sigmoidoscopy,” and “colonoscopy,” whereas the unlabeled DCE version presented them as “screening test A” and “screening test B.” Questionnaires were sent to participants and nonparticipants in CRC screening.
Results: Total response rate was 276 (39%) out of 712 and 1033 (46%) out of 2267 for unlabeled and labeled DCEs, respectively (P < 0.001). The labels played a significant role in individual choices; approximately 22% of subjects had dominant preferences for screening test labels. The convergent validity was modest to low (participants in CRC screening: r = 0.54; P = 0.01; nonparticipants: r = 0.17; P = 0.45) largely because of different preferences for screening frequency.
Conclusion: This study provides important insights in the feasibility and difference in results from labeled and unlabeled DCEs. The inclusion of labels appeared to play a significant role in individual choices but reduced the attention respondents give to the attributes. As a result, unlabeled DCEs may be more suitable to investigate trade-offs between attributes and for respondents who do not have familiarity with the alternative labels, whereas labeled DCEs may be more suitable to explain real-life choices such as uptake of cancer screening.
Estimates of public and patients' preferences are of great importance in informing policy decision-making and improving adherence with public health-care interventions or programs . Discrete choice experiments (DCEs) have become a commonly used technique in health economics to elicit preferences. The DCE is an attribute-based survey method for measuring benefits (utility) . In a DCE, subjects are presented with a sequence of (hypothetical) scenarios (choice sets) and are asked to choose between two or more competing alternatives that vary along several characteristics or attributes of interest . DCEs assume that subjects' preferences (as summarized by their utility function) are revealed through their choices  (for further details, see Bliemer and Rose , Hensher et al. , Louviere et al. , and Ryan et al. ).
A fundamental question that arises in the application of DCE is whether to present the choice sets in a labeled or unlabeled form. The unlabeled form involves assigning unlabeled alternatives in the choice set, such as “alternative A,”“alternative B,” and so on. The labeled form involves assigning labels that communicate information regarding the alternative. In marketing applications, labels tend to consist of brand names and logos, which consumers have learned to associate with different product characteristics and feelings. In the context of health economics, labels tend to consist of generic or brand-name medications, specific screening tests (e.g., colonoscopy, sigmoidoscopy), specific treatments (surgery vs. conservative), or other descriptors. An advantage of assigning labels is that alternatives will be more realistic and the choice task will be less abstract for the subject, which add to the validity of the results. Hence, the results may be better suitable to support decision-making at policy level. Nevertheless, by far, most commonly applied DCEs in health economics used unlabeled alternatives.
The aim of our study was to empirically compare the feasibility, respondents' trading behavior, and convergent validity between a labeled and an unlabeled DCE. All of these aspects were explored in the context of a DCE study directed at investigating population preferences for colorectal cancer (CRC) screening programs in The Netherlands. We were convinced that specific aspects of endoscopy (sigmoidoscopy, colonoscopy) or fecal occult blood test (FOBT) that determine its burden could not be totally captured by presenting an unlabeled “screening test A” variant to patients . For that very reason, we expected differences between an unlabeled and a labeled DCE.
Theoretical Basis of Labeled and Unlabeled DCEs
The aim of discrete choice modeling is to estimate the weights that respondents place on attributes of alternatives. An individual acting rationally is expected to evaluate the set of available alternatives and will choose that alternative that gives the greatest relative utility . Thus, an individual will choose alternative A over B, if U (XA, Z) > U (XB, Z), where U represents the individual's indirect utility function from certain alternatives, XA represents the attributes of alternative A, XB represents the attributes of alternative B, and Z represents the socioeconomic and other characteristics of the individual that influence his/her utility. Choices made in DCEs are analyzed by using random utility theory (i.e., an error term is included in the utility function to reflect the unobservable factors in the individual's utility function) . Thus, an individual will choose alternative A over B, if V (XA, Z) + εA > V (XB, Z) + εB, where V is the measurable component of utility estimated empirically, and εA and εB reflect the unobservable factors in the individual's utility function of alternative A and B, respectively (XA, XB, and Z defined as above).
There are two general types of DCEs: 1) unlabeled and 2) labeled DCEs . Unlabeled DCEs use generic titles for the alternatives (e.g., radio-imaging “A” or “B”). Labeled DCEs use alternative-specific titles for the alternatives (e.g., “computer tomography” or “MRI-scan”[magnetic resonance imaging]). The number of alternatives (irrespective of whether labeled or unlabeled) in a choice set is unrestricted from a theoretical point of view . The decision as to whether to use labeled or unlabeled DCEs is an important one . The labeled alternative itself conveys information to respondents. This matters in choice and other decision tasks, because 1) respondents may use labeled alternatives to infer information that they perceive as missing; and 2) these inferences may be (and usually are) correlated with the random component . Although we may not exactly know what respondents find relevant in the label for forecasting uptake of, for example, a health-care intervention, it may be worthwhile to find out if respondents prefer one alternative label to another. A labeled DCE can take effects into account, which respondents may have learned to associate with different health-care intervention characteristics and feelings, and, as a result, may be more suitable to predict . Unlabeled and labeled DCEs both have their merits. If each of the labeled options has A attributes with L levels and the choice sets are of size M, then there are L^MA possible choice sets, assuming that all labels are presented in a choice set and that the same label does not appear more than once in a choice set. If the options are unlabeled, then there are L^A possible items that can be included in each position of each choice set. If the choice sets are of size M and we are not going to allow the same item to appear more than once in a choice set, then there are “L^A choose M” possible choice sets of size M. Therefore, the designs of an unlabeled DCE can be much smaller. For example, two alternatives with four attributes and three levels yields 6561 (i.e., 3(2*4) = 38) possible alternative combinations for a labeled DCE compared with “just” 81 (i.e., 34) possible alternative combinations for an unlabeled design. Other merits of unlabeled DCEs include that 1) they do not require the identification and use of all alternatives within the universal set of alternatives, namely, the attribute levels are sufficiently broad to represent all alternatives; 2) they might be more robust in terms of not violating IID-assumption (i.e., error terms are independent and identically distributed), because the alternatives may be less correlated with the attributes as in labeled DCEs ; and 3) they encourage respondents to choose an alternative by trading-off attribute levels, which may be desirable from a nonmarket valuation perspective . On the other hand, merits of using labeled DCEs include that 1) they will be more realistic and less abstract so that responses may better reflect the real preference structure; and 2) they can study the main effect of the labels.
CRC is the most frequently occurring malignancy within the European Union and the second leading cause of cancer related death in the Western world [8,9]. Various countries have implemented a national screening program for CRC screening to detect CRC in an early stage or are investigating prerequisites for implementation [10,11]. There are several screening tests eligible for use as a population-based screening program, such as fecal occult blood tests (FOBTs), sigmoidoscopy, or colonoscopy. This study aimed to investigate individual preferences for CRC screening using a DCE.
The questionnaire design phase involved extensive background research, expert opinions, and interviews with screened individuals. Experts (n = 3) were asked to comment on a list of test characteristics derived from our extensive literature review. Potential screenees (n = 40), both participants of a CRC screening program (n = 20) and screening naive individuals (n = 20), could also comment on the list of test characteristics and rank them in order of importance. Based on these data, we selected the most important test characteristics. The levels for each test characteristic incorporated the range of possible test outcomes based on the current literature (for more detail on how the qualitative data were used to select the final test labels, attributes, and levels, see work of Hol et al. [L. Hol, E.W. de Bekker-Grob, L. van Dam, et al., unpubl. ms] and van Dam et al. ). Table 1 lists the labels, attributes, and attribute levels chosen. The labeled CRC screening tests (“FOBT,”“sigmoidoscopy,” and “colonoscopy”) may evoke individual feelings, which may not be captured in the unlabeled CRC screening tests (“CRC screening test A” and “CRC screening test B”). Notably, the invasiveness of the alternative test was (indirectly) described by the levels of five attributes: “side effects of the test,”“complication risk of the test,”“preparation for the patient,”“location of screening,” and “the duration of screening.” Giving directly the information “how a sample is taken” is, in our case, totally equal to the screening test label, “taking a sample from your motion” is equal to FOBT, and “tube into your back passage throughout your colon” is equal to colonoscopy. If the unlabeled DCE would include directly this information about “how the sample is taken” (thus, actually naming the test), then the unlabeled DCE will be a labeled DCE as well; the attribute “how the sample is taken” will have an interaction with all other attributes, and a restricted design is needed to avoid implausible combinations of attribute levels (i.e., the attribute levels are alternative specific and, thus, a labeled DCE). Another point of notice is that the unlabeled experiment had, for some attributes, a smaller-level range than the feasible options in the labeled experiment. As a result, we avoided some extreme combinations of 30 times a screening test, resulting in a reduction in mortality from 3.0% to 2.7% in the unlabeled DCE, which added to utility balance.
Table 1. Attribute and levels for unlabeled and labeled discrete choice experiment between three alternatives A, B, and C
Levels in unlabeled model
Levels in labeled model
FOBT, fecal occult blood test.
Reduction in mortality
Options A and B: from 3.0% to 0.3%, 1.2%, 1.8%, 2.7%
FOBT: from 3.0% to 1.8%, 2.3%, 2.7%
Sigmoidoscopy: from 3.0% to 0.9%, 1.5%, 1.8%
Option C (no test): from 3.0% to 3.0%
Colonoscopy: from 3.0% to 0.1%, 0.5%, 0.8%
No test (Option C; base): from 3.0% to 3.0%
Frequency of screening per 10 years
Options A and B: 1, 2, 5, 10
FOBT: 3, 10, 30
Sigmoidoscopy: 1, 2, 10
Option C: 0
Colonoscopy: 1, 2, 5
No test: 0
Options A and B: none, small
Option C: none
No test: none
Location of screening
Options A and B: at home, hospital
FOBT: at home
Option C: none
No test: none
Options A and B: 10, 30, 60, 90 min
FOBT: 30 min
Sigmoidoscopy: 15 min
Option C: 0 min
Colonoscopy: 105 min
No test: 0 min
Preparation for patient
Options A and B: none, enema, no fasting, drinking 0.75l + fasting, drinking 4l + fasting
Sigmoidoscopy: enema, no fasting
Option C: none
Colonoscopy: drinking 4l + fasting
No test: none
Side effects of screening
Options A and B: none, mild pain
Sigmoidoscopy: mild pain
Option C: none
Colonoscopy: mild pain
No test: none
The combination of the attributes and attribute levels of the unlabeled design resulted in 2048 CRC screening test alternatives (44 * 23). A fractional factorial design was used based on a Web site, which contained a library of more than 200 orthogonal arrays , to reduce the number of alternatives to a manageable level of 16 alternatives in which orthogonality and level balance were fulfilled. These 16 alternatives were paired up with another orthogonal array by using the fold-over technique (i.e., cyclic design), which caused minimal overlap between attribute levels . Each choice set (i.e., a set of available alternatives) contained two screening test alternatives and an opt-out (see Table 2for an example). The unlabeled design had an efficiency of 95% compared with an optimal choice set design, and all main effects were uncorrelated, according to the results of an analysis using the software of Street and Burgess .
Table 2. An example of an unlabeled choice set
Screening test A (A)
Screening test B (B)
No screening test (C)
Enema, no fasting
Mortality risk of colorectal cancer decrease
From 3% to 2.7%
From 3% to 1.8%
Frequency of screening test in the next 10 years
Time duration (min)
Attribute levels in the labeled DCE were alternative specific. In other words, different CRC screening test labels (FOBT, sigmoidoscopy, colonoscopy) were associated with different sets of outcomes. Implausible combinations of attribute levels and labels were minimized as a result. Furthermore, the implausible combinations of the attribute levels longest screening interval with simultaneous highest risk reduction as well as shortest screening interval with lowest risk reduction were blocked. Optimal designs for labeled DCEs, which require a design with two-way interactions, are not available for the general case. Fortunately, SAS software (Version 9.1, SAS Institute Inc., Cary, NC) is capable of generating designs that are highly efficient  in such circumstances. Hence, for the labeled DCE, a D-efficient design was generated with SAS software (SAS Institute, Cary, NC), which resulted in 84 choice sets divided into 7 versions of the questionnaire (D-error 0.573). Each choice set contained two CRC screening test alternatives and an opt-out (see Table 3 for an example).
Table 3. An example of labeled choice set
No screening test (C)
Enema, no fasting
Mortality risk of colorectal cancer decrease
From 3% to 0.9%
From 3% to 2.3%
Frequency of screening test in the next 10 years
Time duration (min)
The unlabeled as well as the labeled DCE contained a dominant choice set (i.e., a choice set in which one screening test alternative is logically preferable) to assess the understanding of the questionnaire (i.e., rationality test). Testing for internal validity should not automatically lead to deleting responses based on “irrational” preferences although it may be a “common” practice (e.g., [16–20]). Deleting “irrational responses” may lead to removing of valid preferences, inducing sample selection bias, and reducing the statistical efficiency and power of the estimated choice models . Therefore, further sensitivity analyses were conducted to quantitate the effect of including and excluding “irrational” responses.
All respondents received the same prior information to the questionnaire: an information brochure explaining different current CRC screening tests (FOBT, sigmoidoscopy, and colonoscopy; i.e., how the sample could be obtained) and their characteristics (advantages and disadvantages). Both DCEs were pilot-tested to make sure that respondents could manage the length of the questionnaires and to check for any problems in the interpretation and face validity. None of the respondents raised any problems with understanding the questionnaires so that the pilot test did not result in any changes to the questionnaires.
Study Sample and Elicitation Mode
The questionnaires were sent by mail to subjects who had recently participated in a regional call–recall CRC screening program (unlabeled n = 212; labeled n = 769) and to randomly selected screening naive subjects of the same region (Groot-Rijnmond) (unlabeled n = 500; labeled n = 1498). It was not possible to directly calculate the statistical power to inform the sample size for a choice experiment. Other studies showed that a sample size of 42 to 208 respondents was sufficient to answer 16 unlabeled choice sets [22–24]. A larger number of labeled than unlabeled DCEs were distributed to potential respondents because the design of the labeled DCE, which included alternative-specific parameters, meant that more coefficients would be included in the model to be estimated. A larger sample size would mean that it is possible to achieve more precise estimation of these parameters. All respondents were between 50 and 74 years of age. Besides the choice sets, the questionnaires also included background variables of respondents such as age, sex, endoscopy (i.e., sigmoidoscopy or colonoscopy) experience, familiarity with CRC because of cases in family or friends, and standardized questions (EQ-5D) to measure self-reported health state. A reminder was sent to nonresponders 4 weeks later.
Chi-square and Student t-tests were used to assess the differences between the characteristics of respondents of the unlabeled and labeled DCEs (for participants in CRC screening and for screening naive respondents separately).
To assess feasibility, we determined the response rate, rationality test outcome, missing values, and the self-rated ease of the task. We used chi-square tests to compare differences in these aspects of feasibility.
Both DCEs were analyzed by using multinomial logit regression models, in which the unlabeled DCE had generic parameters and the labeled DCE had alternative specific parameters. These models were implemented in SAS software (Version 9.1). A priori, we expected all attributes to be important and that all attributes would have a negative effect on utility except for “mortality reduction.”
To assess the degree of trading behavior, we tested for dominant preferences (i.e., if respondents based their responses entirely on one specific attribute or label (one specific CRC screening test)). Chi-square tests were used to assess differences between both DCEs for participants in CRC screening and for screening naive respondents separately.
Finally, relative utility values for different screening test profiles were determined based on the weights that respondents placed on the attributes of alternatives. The total utility value of a screening test profile was equal to the sum of the coefficient weights of its attribute levels [25–28]. The agreement between the labeled and unlabeled DCE outcomes depends strongly on the scale of both DCEs. In DCEs, the scale is not identified, and everything that depends on the scale is not reliable. Only measures based on correlation are really informative. Therefore, convergent validity between both variants was assessed by determining the degree of agreement by means of Pearson correlations (r). Noteworthy, perfect agreement only exists if the relative utility outcomes between unlabeled and labeled DCE lie along the line of equality, whereas perfect correlation (i.e., strength of a relation between the two approaches) exists if the relative utility outcomes lie along any straight line .
The total response rate was 276 (39%) out of 712 and 1033 (46%) out of 2267 for unlabeled and labeled DCE, respectively (P < 0.001). In total, 4 (1%) out of 276 respondents and 30 (3%) out of 1033 respondents, who missed responses to three or more DCE questions, were excluded for further analyses. Of the respondents to the unlabeled DCE, 44% came from the CRC screening group and 56% from the screening naive group; this was 53% and 47% for the labeled DCE. For this imbalance to be corrected, all further analyses were focused on the CRC screening group and screening naive group separately. Respondents did not differ with respect to mean age, sex, and endoscopy experience (P > 0.13) for unlabeled and labeled variants, respectively (Table 4).
The response rate was higher for the labeled DCE than for the unlabeled DCE (Table 5). The labeled DCE especially led to a higher response rate for the CRC screening group (71% vs. 57%, P < 0.001; 33% vs. 31% for screening naive group, P = 0.51). An equal proportion of respondents of the CRC screening group passed the rationality test irrespective of DCE approach (91% vs. 91%; P = 0.96). Nevertheless, more respondents failed the rationality test with the labeled design in the screening naive group (18% vs. 5% for unlabeled design; P < 0.001). There was an equal proportion missing values of 1% for both DCEs irrespective of the response group. Most respondents indicated that they had no difficulties in completing the DCE task, and the groups did not perceive the task differently (P = 0.28 and P = 0.61 for CRC screening group and screening naive group, respectively) (Table 5).
Table 5. Differences in several aspects of feasibility
The effects (i.e., positive or negative direction) of the coefficients of both DCEs were consistent with a priori expectations (and showed therefore theoretical validity), except for the attribute “frequency of screening” in the unlabeled approach (details in the Appendix at: http://www.ispor.org/Publications/value/ViHsupplementary/ViH13i2_deBekkerGrob.asp). The positive coefficient of this attribute in the unlabeled DCE suggests that respondents preferred a higher frequency of screening over a lower frequency of screening per 10 years.
Regarding the unlabeled DCE, all attributes except the attribute “location of screening” proved to be important for preferences of both groups for CRC screening tests (see Appendix at: http://www.ispor.org/Publications/value/ViHsupplementary/ViH13i2_deBekkerGrob.asp). The positive constant term suggests that respondents from the CRC screening group preferred “CRC screening test” over “no CRC screening test” if all other attributes were set to zero.
Regarding the labeled DCE, all attributes proved to be important for preferences of both groups for CRC screening tests (see Appendix at: http://www.ispor.org/Publications/value/ViHsupplementary/ViH13i2_deBekkerGrob.asp; note that, five out of seven attributes (i.e., location of screening, preparation for the patient, side effects of screening, complication risk, and screening duration) were attributes that had one alternative specific level. As a result, their coefficients were caught up in the coefficient of the alternative label). The positive and significant alternative specific constants suggest that the CRC screening group had a positive attitude toward “CRC screening test” over “no CRC screening test,” irrespective of the screening test used (i.e., FOBT, sigmoidoscopy, or colonoscopy). This phenomenon was also seen in the screening naive group although the alternative specific constant of FOBT did not significantly differ from the base level “no CRC screening test” (P = 0.16).
The outcomes of the sensitivity analyses, which excluded the respondents who failed the rationality test, were quite similar whether or not these irrational responses were retained (data not shown). To avoid the removal of valid preferences, the induction of sample selection bias, and the unnecessary reduction of the statistical efficiency and power of the estimated choice models, we included the responses of respondents who failed the rationality test in all our further analyses.
Respondents' Trading Behavior
The labeled DCE led to more dominant preferences (i.e., responses entirely based on one specific attribute or label) (Table 6). This difference was significant for both the CRC screening group (41% vs. 21% for unlabeled DCE; P < 0.001) and the screening naive group (39% vs. 24% for unlabeled DCE; P = 0.001). This difference was caused by the test labels; 24% and 21% of the CRC screening and screening naive respondents, respectively, had dominant preferences for screening test labels. Table 6 also shows that the attributes of both DCEs did not make the difference in the proportion of dominant preferences (0.07 < P < 0.77).
Table 6. Differences in respondents' trading behavior
Difference between respondents unlabeled and labeled DCE.
CRC, colorectal cancer; DCE, discrete choice experiment; n.a., not applicable.
No test at all
Dominant preferences for
Based on the coefficients of the multinomial logit regression models of the unlabeled and labeled DCE (see Appendix at: http://www.ispor.org/Publications/value/ViHsupplementary/ViH13i2_deBekkerGrob.asp), Figure 1 plots the difference in relative utility values for different realistic CRC screening programs for CRC screening naive and CRC screening respondents, respectively (see Table 7 for more details about the total relative utility scores). By using Pearson correlations, the convergent validity between unlabeled and labeled DCEs was found to be low for screening naive respondents (r = 0.17; P = 0.45) but modest for respondents with screening experience (r = 0.54; P = 0.01). The regression comparison between unlabeled (independent variable) and labeled DCEs (dependent variable) showed a scaling as well as a shift phenomenon. The intercept was 0.99 (P < 0.01) and 0.90 (P < 0.01), and the scaling factor was 0.19 (P = 0.45) and 0.51 (P = 0.01) for screening naive respondents and respondents with screening experience, respectively. Respondents reacted about 0.19 or half as strong to the labeled attributes. Taking the attribute levels of frequency into account (i.e., ignoring the relative utility values of the attribute “frequency of screening”), we found that the strength of the relation between both approaches was reasonably good for screening naive respondents (r = 0.71; P = 0.03; and r = 0.53; P = 0.07 for low and high frequency levels, respectively) and very good for respondents with screening experience (r = 0.93; P < 0.001; and r = 0.95; P < 0.001 for low and high frequency levels, respectively).
Table 7. Relative utility scores for realistic CRC screening programs
Realistic CRC screening program (test invasiveness* / mortality risk decreases from 3% to . . . % / frequency per 10 years)
Relative utility score
CRC screening respondents
CRC screening naive respondents
Each type of test (sigmoidoscopy, colonoscopy, FOBT) had one fixed level for the following five attributes: complication risk, location of screening, screening duration, preparation for the patient, and side effects of screening; see Table 1 for more detailed information.
This study shows that it is feasible to use realistic alternatives in labeled DCEs in a health-care context. The labeled DCE led to a higher response rate, especially for the CRC screening group who had familiarity with the context. Nevertheless, more respondents who were not familiar with the context failed the rationality test with the labeled design. The inclusion of labels appeared to play a significant role in individual choices, and increased nontrading behavior. The convergent validity between both DCE variants was low but better for respondents with CRC screening experience.
In health economics, there are no previous publications directly comparing labeled and unlabeled DCEs empirically. Nevertheless, a DCE in ecological economics considered the effects of employing a labeled rather than an unlabeled DCE . That study showed that the inclusion of alternative-specific labels reduced the attention that respondents gave to the attributes (i.e., increased nontrading behavior). This is in line with our study, which showed that 24% and 21% of the CRC screening experienced and screening naive respondents, respectively, only focused at the screening test labels. The ecological economics study also demonstrated convergent validity between a labeled and an unlabeled DCE contrary to our study.
In line with the focus of this article, the results of the unlabeled and labeled DCEs are only described briefly (for further detail information about the practical outcomes of these DCEs for CRC screening practice, see work of van Dam et al.  and Hol et al. [unpubl. ms.]). The respondents in our labeled experiment are actually getting more and partly different information than in the unlabeled experiment, particularly if they have had experience of one of the options. This might be a possible explanation for the differences in our outcomes between the screening naive and CRC screening respondents. Note that, if the reader wants to compare the beta-coefficients of CRC screening respondents and screening naive respondents directly (see Appendix at: http://www.ispor.org/Publications/value/ViHsupplementary/ViH13i2_deBekkerGrob.asp), it should be clear that scale effects might be an issue (for more detail information, see the work of Swait and Louviere ).
The positive direction (effect) of the attribute “frequency of screening per 10 years” in the unlabeled approach (see Appendix at: http://www.ispor.org/Publications/value/ViHsupplementary/ViH13i2_deBekkerGrob.asp) seems to be inconsistent with utility theory. Nevertheless, these “irrational” responses may be explained by respondents making additional assumptions or bringing additional information to the choice [32,33]. As Ryan et al.  provided, evidence that respondents assumed tests with higher costs would be of higher quality. Respondents in our study might associate higher frequency of screening with a more effective test. The differences in preferences for screening frequency between the two DCE approaches demonstrate the importance of continuing research into the biases present across these elicitation methods. Mixed methods may be useful to get more insight into the internal validity of the DCEs. Qualitative techniques, such as the think-aloud technique, may show that seemingly “irrational” choice behavior may not be so irrational after all .
The predominant use of unlabeled experiments in health care may be a result of the perception that labeled experiments are difficult to construct. The design of a labeled DCE does generally mean that a larger sample size is required because it is assumed that, most of time, there are interactions between the alternative label and the attributes. Indeed, this may not be feasible in a health-care setting (e.g., the target group of patients or medical specialists is too small). Nevertheless, this is the case not only for labeled DCE but also for unlabeled DCE in which all (two-way) interactions between attributes are taken into account. Unlabeled DCEs in which all (two-way) interactions between attributes are taken into account may be even much larger then a labeled DCE because, in a labeled DCE, many characteristics can be compressed in one label, whereas in an unlabeled DCE, all possible interactions should be taken into account.
Another explanation for the predominant use of unlabeled experiments in health care may be that labeled DCEs in health care are not necessary (yet). Although it is not clear why labeled DCEs in health economics are rarely used, it has to be clear that the design should be made to fit the research objectives and not the other way around. If the alternative labels are expected to have important differences, then it may be preferable to use a labeled DCE design. Underestimating the role of the alternative labels may lead to worse or even wrong predictions of alternatives people actually prefer. On the other side, if the objective is to estimate attribute values, it may be desirable to use an unlabeled DCE to reduce nontrading behavior because of alternative labels.
This study had some limitations. First, we conducted two DCEs in two samples. It might have been preferable (from a theoretical point of view) to conduct the two DCEs in the same group of respondents (i.e., all respondents filled in one DCE, and then the other DCE; sequence in random order). Nevertheless, that was not possible because of the respondent burden. As a result, we cannot directly compare the absolute values of the utility levels for the attributes and tests. Second, the design of both DCEs was not exactly the same. The combination of d-efficiency criteria and the use of alternative specific and generic attribute levels in the labeled and unlabeled DCE, respectively, resulted in different choice sets presented to the respondents. We have no reason to believe that this has influenced the results to a large extent. Third, testing the convergent validity between unlabeled and labeled DCE was based on comparison of the total utility of alternatives. The labeled DCE had five attributes with one alternative specific level. Therefore, a direct comparison of the coefficients of the attributes (taking scale factor into account) was not possible. Fourth, two attribute levels regarding the alternative-specific attribute “frequency” of FOBT (3 and 30 times screening per 10 years) were not presented in the unlabeled DCE. Therefore, we could only include three total utility scores of (hypothetical) CRC screening programs with FOBT-test in our convergent validity test between both DCE variants.
This study provides important insights in the feasibility and difference in results from labeled and unlabeled DCEs. The inclusion of labels appeared to play a significant role in individual choices but reduced the attention respondents give to the attributes. There was low convergent validity between both DCE variants largely because of different preferences for screening frequency. The choice for a labeled or unlabeled DCE may depend on the type of respondents and the research question. Unlabeled DCEs may be more suitable to investigate trade-offs between attributes and for respondents who do not have familiarity with the alternative labels, whereas labeled DCEs may be more suitable to explain real life choices such as uptake of cancer screening.
Source of financial support: Grant support was from the Dutch Cancer Society (KWF; EMCR 2006-3673, and EMCR 2008-4117).