QALYs without bias? Nonparametric correction of time trade‐off and standard gamble weights based on prospect theory

Abstract Common health state valuation methodologies, such as standard gamble (SG) and time trade‐off (TTO), typically produce different weights for identical health states. We attempt to alleviate these differences by correcting the confounding influences modeled in prospect theory: loss aversion and probability weighting. Furthermore, we correct for nonlinear utility of life duration. In contrast to earlier attempts at correcting TTO and SG weights, we measure and correct all these tenets simultaneously, using newly developed nonparametric methodology. These corrections were applied to three less‐than‐perfect health states, measured with TTO and SG. We found considerable loss aversion and probability weighting for both gains and losses in life years, and we observe concave utility for gains and convex utility for losses in life years. After correction, the initially significant differences in weights between TTO and SG disappeared for all health states. Our findings suggest new opportunities to account for bias in health state valuations but also the need for further validation of resulting weights.

proposed that these discrepancies in elicited QALY weights may result from empirically invalid assumptions present in the theoretical frameworks underlying TTO and SG. More specifically, Bleichrodt argued that TTO and SG weights are biased as they are obtained under the assumptions of expected utility (EU) theory, which has been shown to be descriptively invalid for health outcomes (Bleichrodt, Abellan-Perpiñan, Pinto-Prades, & Mendez-Martinez, 2007;Treadwell & Lenert, 1999). Additionally, although discounted QALY models exist (for an overview, see Hansen & Østerdal, 2006), TTO and/or SG weights are commonly derived under the linear QALY model, which assumes linear utility of life duration (and no discounting of future life years). However, many authors have found diminishing marginal utility of life years; that is, life years that occur in the distant future tend to receive less weight than do life years in the nearer future (Abellan-Perpinan, Pinto-Prades, Mendez-Martinez, & Badia-Llach, 2006;Bleichrodt & Pinto, 2005;Wakker & Deneffe, 1996). In order to obtain QALYs without bias, a methodological shift may be required in HSV towards the use of descriptive utility models such as prospect theory (PT).
PT is characterized by four tenets (Kahneman & Tversky, 1979;Tversky & Kahneman, 1992). These are (a) reference dependence-utility derived from a good is defined over differences from a reference point (RP), instead of over the overall consumption of that good; (b) loss aversion-the utility function has an inflection point at the RP and is steeper for losses than for gains; (c) diminishing sensitivity-utility is concave for gains and convex for losses, which indicates diminishing sensitivity to outcomes further from the RP; and (d) probability weighting-the decision maker overweighs small probabilities and underweighs large probabilities (Kahneman & Tversky, 1979;Tversky & Kahneman, 1992). PT is usually applied to decisions about money but has also been extended to health outcomes (Bleichrodt & Pinto, 2000;Miyamoto & Eraker, 1989). Importantly, as Bleichrodt (2002) proposed, the tenets modeled in PT will likely affect the TTO and SG methods differently, with loss aversion exerting an upward bias on both methods but utility curvature only affecting TTO whereas probability weighting only affects SG.
Given the increased importance of CUA in informing health policy (Drummond, Sculpher, Claxton, Stoddart, & Torrance, 2015), it is imperative to validly determine the weights that are ascribed to the relevant health states. The valuation of these health states, for example, when obtaining tariffs for the commonly used EuroQol (EQ-5D) generic utility classification system (Versteegh et al., 2016), would necessarily occur within a descriptive context (Bleichrodt, Pinto, & Wakker, 2001). This means that the status quo of applying EU and/or the linear QALY model to derive TTO and SG weights (a) will not capture actual preferences, as these may include, for example, loss aversion, and (b) may lead to different TTO and SG weights according to Bleichrodt (2002). 1 As such, our main motivation is to address the discrepancy between TTO and SG weights by obtaining these QALY weights using derivations based on a descriptively valid but nonnormative theory (PT). We will refer to this process, where TTO and SG weights are obtained while incorporating loss aversion, nonlinear utility, and/or probability weighting into their derivation, as correction for PT. If correcting TTO and SG for PT is feasible, it could be used to correct observed responses in HSVs, allowing corrected weights to be used when calculating QALYs to express health benefits in CUAs, as commonly done.
Some studies have attempted to test Bleichrodt's (2002) predictions about PT and correct HSV techniques by assuming PT or adjusting for utility curvature (Attema & Brouwer, 2009;Martin, Glasziou, Simes, & Lumley, 2000;Oliver, 2003;van Osch, Wakker, van den Hout, & Stiggelbout, 2004;Wakker & Stiggelbout, 1995). Yet to date, no study has been able to simultaneously correct both TTO and SG for loss aversion, utility curvature, and probability weighting (see Appendix S1 for an overview of earlier studies on corrections). In this study, we adapted a recently proposed methodology (Abdellaoui, Bleichrodt, L'Haridon, & Van Dolder, 2016) to measure these three deviations without parametric assumptions and elicit TTO and SG weights without assuming EU or the linear QALY model. In other words, we provide the first empirical test of predictions by Bleichrodt (2002) and show how correcting for PT alleviates the discrepancies between TTO and SG.
Our study features several methodological improvements compared with previous attempts at correcting TTO and/or SG weights for PT (see Appendix S1). First, our adaptation of the nonparametric method (Abdellaoui et al., 2016) enables us to determine utility curvature, loss aversion, and probability weighting separately for each individual, without assuming a specific parameter or parametrical form for these functions (as opposed to work by van Osch et al., 2004, Martin et al., 2000, van der Pol & Roux, 2005. We believe this is relevant, as large heterogeneity typically exists for PT elicitations (Pinto-Prades & Abellan-Perpiñan, 2012), warranting an individual measurement approach. Furthermore, applying specific parametric forms within experimental elicitation can confound results (Abdellaoui, 2000), thus 1 These statements hold regardless if one believes EU to be the normative standard (as Tversky, 1979, andWakker, 2010, do), which would, for example, classify loss aversion as "irrational" or a bias. We will make no such claims and will refer to deviations of EU and the linear QALY model as generating bias in TTO and SG. allowing considerable bias to remain after correction (Wakker, 2008;Wakker, 2010). Second, we attempt to append the heterogeneity surrounding RPs by providing all subjects with the same RP, which is a hypothetical expected life duration (following the successful procedure described in Attema, Brouwer, & L'Haridon, 2013). This is important, because even though reference dependence appears to be the most central tenet of PT, earlier work on the location of the RP suggests that individuals use multiple different health outcomes as RP (Bleichrodt et al., 2001;van Osch et al., 2004;van Osch & Stiggelbout, 2008;van Osch, van den Hout, & Stiggelbout, 2006).

| THEORETICAL FRAMEWORK
We describe health outcomes as (β, t), where β represents health status and t indicates the age at which the health profile ends (e.g., living with chronic back pain until 70). Throughout, subscripts (e.g., x and y) are used to refer to possible health profiles faced by a single agent, with age of onset (e.g., current age) denoted by t a . We will often suppress t a by denoting (β x , t x ) as (β x , T x ), with duration defined by T x = t x − t a ≥ 0. We refer to (β x , T x ) as chronic health profiles. We let (β x , T x ) p (β y , T y ) denote the risky prospect that provides health profile (β x , T x ) with probability p and health profile (β y , T y ) with probability 1 − p. Preferences are denoted using the conventional notations ≻, ≽, and ∽ to represent strict preference, weak preference, and indifference, respectively. Also, we assume weak-ordered preferences; that is, they are complete, meaning that decision makers have preferences over risky prospects, and transitive (if x ≽ y and y ≽ z, then x ≽ z). Health profiles (β x , T x ) starting and ending at t a (so that t a = t x ) will thus have T x = 0 (i.e., they equal immediate death), and, for brevity, we will denote such profiles of the form (β x , 0) as D, for any β x . As in Miyamoto, Wakker, Bleichrodt, and Peters (1998), we assume indifference between all profiles denoted D for any β. Finally, we assume monotonicity for duration, that is, (β x , T x ) ≻ (β x , T y ) for T x > T y and any β x . The general QALY model assumes that preferences for health profiles (β x , T x ) are represented by the general utility function V(β x , T x ) = U(β x ) * L(T x ). In this model, L(T) and U(β) denote utility functions over life years or health status, respectively. This QALY model, and the preference foundations underlying it, typically relies on EU to some extent (for axiomatizations, see Miyamoto & Eraker, 1989, Miyamoto & Eraker, 1988. To derive corrected TTO and SG weights, we will extend this model to incorporate insights from PT under risk. That is, we assume that preferences can be represented by the general QALY model, including the extensions we outline below.
Several preliminaries are required before defining our full model (Equations (1) and 2). We assume that preferences for health profiles are defined relative to an RP, which we denote as (β r , T r ). Following Wakker (2010), we define this RP as a point of comparison, which may differ during different parts of the analysis. Given that no plausible theory of RP selection is available (Wakker, 2010), we let the RP depend on framing of the decision context. Hence, (β r , T r ) refers to an expected health profile described in a decision task, which is taken as the neutral point. This health profile has health status β r , endured for T r years. Throughout, for brevity, we denote the duration of all other health profiles as deviations from the RP; that is, we denote health profiles (β We will restrict our model to health profiles (β x , T x * ) ≽ D with β x ≽ β r for any T * x . In other words, we assume our model holds for a restricted outcome domain including only health profiles weakly preferred to immediate death, where health status remains at β r or is improved.
Within this outcome domain, we model PT by incorporating sign dependence for life duration, that is, by modifying L(T) in the general QALY model to L i (T * ). In our model, L i (T * ) is a standard, real-valued ratio scale utility function with L + (T r ) = 0, which may be different for gain outcomes (β x ; T * x ; with β x~βr and T * x ≥ 0Þ and loss outcomes (β x ; T * x ; with β x~βr and T * x < 0). We do not modify U(β) in our model, which implies that changes in health status will be evaluated as in the conventional general QALY model. We incorporate loss aversion 2 by taking L − (T * ) = λL i (T * ) for T * < 0. Here, λ denotes a loss aversion index, with λ > 1 (λ = 1, λ < 1) indicating loss aversion (loss neutrality, gain seeking). Furthermore, we incorporate nonlinear weighting of probabilities by incorporating probability weighting functions w i (p), i = +, −, for gains and losses respectively, that assign a number to each probability p, with w i (0) = 0 and w i (1) = 1.
We will apply this model to risky prospects with at most two outcomes, that is, binary prospects. Thus, preferences over risky prospects with both gain and loss outcomes, that is, In our simplified approach, we model PT over life duration by assuming attribute-specific evaluation (as in Bleichrodt et al., 2009). Loss aversion is, thus, defined over life duration, as it is not meaningful on U(β x ) when health status is considered a qualitative measure (Bleichrodt and Miyamoto, 2003). This does not affect our analysis, as we only consider improvements in health status.
whereas preferences over risky prospects β x ; T * x À Á p β y ; T * y for either gains or losses are evaluated by where i = + [−] when T * x ; T * y > < ½ 0, that is, both outcomes are gains or losses. Whenever w i (p) = p, λ = 1, and no distinction is made between gains and losses (i.e., no reference dependence), our model reduces to the general QALY model.

| SG and TTO correction for PT
TTO weights are obtained by eliciting duration T y , which yields indifference between (β x , T x ) and (FH, T y ), with T x > T y . SG weights, on the other hand, are obtained from indifferences between a certain outcome (β x , T x ), and a risky prospect (FH, T x ) p (D), where p is normally varied until indifference is obtained. Often, TTO and SG weights (i.e., U(β x )) are derived under the assumptions of EU and the linear QALY model, which is a special case of the general QALY model with L(T) = T, U(FH) = 1, and V(D) = 0. Under these assumptions, indifferences (β x , T x )~(FH, T y ) and respectively.
Our correction for PT involves deriving TTO and SG weights by means of our theoretical model based on PT. The application of our theoretical model requires assumptions about the RP used in TTO and SG. Typically, TTO and SG exercises are framed with the impaired health state (β x , T x ) as RP. Furthermore, earlier work on SG 3 has suggested that the outcome that remains constant, that is, the time spent with reduced health status (β x , T x ), usually is taken as RP (Bleichrodt et al., 2001;van Osch et al., 2006). Hence, throughout the paper, we will make the following assumption about the RP for TTO and SG: (β r , T r ) = (β x , T x ).

| Parameter elicitation
In order to correct both TTO and SG weights for PT, that is, to be able to compute the outcome of Equations (3) and (4), one needs to elicit the following: (a) L i (T * ) with T * x as RP to allow estimation of L − T * y , (b) probability weighting functions w i (p), i = +, −, and (c) a loss aversion coefficient λ, which reflects overweighting of losses with T * x as RP. This means that t x should be kept constant across TTO and SG and the elicitation of L i (T * ), to ensure that λ refers to the same theoretical construct throughout (i.e., the same kink around the RP, see Section 4.4).
We report the results of an experiment in which we compare TTO and SG weights derived assuming EU and the linear QALY model to QALY weights corrected for PT (i.e., by Equations (3) and 4). In this experiment, PT parameters were elicited using methodology based on the work by Abdellaoui et al. (2016). To reduce the influence of order effects and test for consistency, multiple counterbalancing procedures were conducted between participants and consistency checks were in place (see Appendix S3). The experiment was computerized in Matlab. Subjects were 99 students of the Rotterdam School of Management (58 female) who were rewarded course credits. Experimental sessions lasted for approximately 55 min and were run on computers in sessions of four subjects sitting adjacently in separate cubicles. An instructor was present at all times to answer questions.

| TTO and SG weight elicitation
We elicited TTO and SG weights for a total of four health states (one practice state) from the EQ-5D-5L (five level) descriptive system (Herdman et al., 2011). These health states reflected an array of mildly aversive health states, in order to avoid health states that could be considered worse than death (Dolan, 1997). The following health states were used: 22222 (practice, β p ), β 1 = 21211, β 2 = 31221, and β 3 = 32341. We applied a bisection choice-based elicitation procedure with four consecutive choices, as choice-based procedures produce more consistent measurements than matching (Noussair, Robin, & Ruffieux, 2004). Subjects were asked to imagine having lived until age 50 in perfect health after which they contracted a disease that would affect their quality of life for their remaining life expectancy of 20 years. TTO and SG were completed for these remaining 20 years (i.e., t a = 50). In both cases, the maximum expected age of death was 70 years; that is, subjects made decisions with regard to the quality of life for age 50 to 70 (followed by death), which ensured that t x was constant for both TTO and SG.

| Nonparametric method
We adapted Abdellaoui et al.'s (2016) nonparametric methodology to measure PT under risk in the health domain. In order to elicit L i (T * ) with the same t x as RP as in TTO and SG, we instructed subjects to take living from current age until 70 in perfect health as RP, that is, (β r , T r ) = (FH, 70 − t a ). Elicitation consisted of four stages (an elaborate description of the method and instructions can be found in Appendices S1, S4, and S5). The first stage connected utility for gains (L + (T * )) to the utility for losses (L − (T * )). The second and third stages employed the trade-off method of Wakker and Deneffe (1996) to measure a standard sequence of utility for gains and utility for losses, respectively. The fourth stage measured probability weighting, separately for gains and losses; that is, w + (p) and w − (p). Our methodology thus makes it possible to completely elucidate PT's tenets in the health domain, without imposing parametric assumptions on L i (T * ) and w i (p). Each of the four stages had slightly different instructions (see Appendix S5), providing the context for the trade-offs that subjects were required to make. Subjects had to choose between two medicines that could amend their situation but would not affect their life expectancy, which remained constant at perfect health. All indifferences were elicited using a bisection choice-based procedure with a slider (following Abdellaoui et al., 2016) where subjects first performed three binary choices. This procedure zoomed in to the point at which subjects would become indifferent but still allowed subjects to specify the final value and adjust accordingly. To allow estimation of L − T * y in Equation (3) regardless of the amount of years given up in TTO, subjects' standard sequence continued to at least 20 years above and below t x (i.e., living until 70), to avoid extrapolation beyond the measured curve 5 .

| Analyses of curvature for L i (T)
We used two methods to investigate the curvature of L i (T * ), that is, utility curvature: a nonparametric method and a parametric method (similar to Abdellaoui et al., 2016). For these analyses of utility curvature, we normalized all durations by dividing through subjects' highest absolute elicited duration for gains and losses, respectively (T * kG or −T * kL Þ.
5 After 25 steps, the standard sequence elicitation was terminated to avoid overburdening our subjects. When necessary, L − T * y was obtained by extrapolation.
This resulted in T * being in the range [−1, 1]. Next, we calculated the area under the curve (AUC) of L i (T * ) separately for both domains, by setting L þ T * kG À Á ¼ 1 and L − T * kL À Á ¼ −1. If utility of life duration is linear, the area under this normalized curve equals one half. Utility for gains in life duration is convex (concave) if the AUC is smaller (larger) than one half, whereas for losses, the opposite direction holds (convex > ½, concave < ½). This method of analyzing utility curvature is nonparametric. We also analyzed L i (T * ) parametrically by employing the most commonly used power utility family using nonlinear least squares, using the same normalizations. For this family, L + (T * ) = (T * ) α and L − (T * ) = −(−(T * ) α ) with α > 0. For gains [losses], α > 1 corresponds to convex [concave] utility, α = 1 corresponds to linear utility, and α < 1 corresponds to concave [convex] utility.

| Analyses of loss aversion
Several definitions of loss aversion exist, with λ being interpreted in various manners (see Köbberling & Wakker, 2005). Köbberling and Wakker (2005) defined loss aversion (λ) as the kink of utility at the RP. That is, they define loss aversion as U ′ ↑ 0 ð Þ=U ′ ↓ 0 ð Þ, with U ′ ↑ 0 ð Þ representing the left derivative and U ′ ↓ 0 ð Þ the right derivative of U at the RP. Hence, we computed each subject's coefficient of loss aversion (λ) over the first steps in their standard sequence for gains and losses, denoted as x þ 1 and x − 1 . Loss aversion is then defined as the ratio of (Abdellaoui et al., 2016). A subject was classified as loss averse if x þ 1 =−x − 1 > 1, loss neutral if x þ 1 =−x − 1 = 1, and gain seeking if x þ 1 =−x − 1 < 1 (as in Wakker, 2010).

| Probability weighting
We used certainty equivalences using varying probabilities to elicit the weighting functions, similar to Attema, Bleichrodt, and L'haridon (2018). In particular, we used linear interpolation to obtain a w + (p) and w − (p), using p = 0.1, 0.3, 0.5, 0.7, 0.9. Furthermore, we used Tversky and Kahneman's one-parameter inverse S-shaped probability weighting function w i (p) = p γ /(p γ +(1 − p) γ ) 1/γ with i = +, −, estimated by nonlinear least squares. The γ-parameter controls for the shape of the probability weighting function. If γ = 1, there is no probability transformation and w i (p) = p. However, if γ < 1, decision makers underweight large probabilities and overweight small probabilities. This corresponds to the commonly found inverse S-shaped weighting function. If γ > 1, the opposite pattern holds, corresponding to an S-shaped weighting function.

| RESULTS
Two subjects expressed unwillingness to trade off any life years, which caused the experiment to fail. These subjects were removed from further analyses. As can be seen in Appendix S3, we included several repetitions to test for consistency. At the aggregate level, we observed significant differences between the consistency indifference value and the value for x i 2 (i.e., the second step) in the standard sequence elicitation for both gains and losses (paired t tests: ps < .01). Furthermore, we found a difference for the consistency checks in the probability sequence for gains (paired t test: ps = .007), but not for losses (paired t tests: ps = .62). Correlations between consistency checks and original values were high, suggesting strong association between these values (Kendall's τs > 0.51, ps < .003).
Twenty-nine subjects violated monotonicity for health states, which indicates that they valued at least one health state, which was better or equal on each dimension lower than their dominated counterpart (e.g., 21211 vs. 31221). As we consider that it is plausible that all subjects prefer more health to less, we reran the full analyses excluding these subjects and found no differences in the main results. Hence, we report the results for the full sample (n = 97).

| Curvature of L + (T) and L − (T)
We observed median AUC for gains equal to 0.555, and for losses, this nonparametric analysis produced a median AUC of 0.561, which were both significantly different from 0.5 (Wilcoxon signed ranks tests: ps < .001). After parametrically fitting a power function to the data, we found a median α of 0.787 for gains and 0.757 for losses (significantly smaller than 1, Wilcoxon signed ranks tests: ps < .001). Thus, both parametric and nonparametric results demonstrated L + (T * ) to be concave and L − (T * ) to be convex. Table 1 shows the classification of subjects' curvature for gains (L + (T * )) and losses (L − (T * )) at the individual level, both parametrically and nonparametrically. The most common pattern was concave curvature for L + (T * ) and convex curvature for L − (T * ) as was found in an earlier implementation of this method (Attema et al., 2018). This conclusion holds for both nonparametric (53%) and parametric (53%) results.

| Loss aversion
Utilizing Köbberling and Wakker's (2005) definition, we found a median loss aversion index of λ = 2 (interquartile range: 1.00-3.52). Thus, we found considerable loss aversion at the aggregate level, with the median being significantly higher than 1 (Wilcoxon test: p < .001). At the individual level, the majority of subjects demonstrated loss aversion, with 72% (n = 70) classifying as loss averse, and 15% (n = 15) and 13% (n = 12) classifying as loss neutral or gain seeking, respectively. Figure 1 shows the median decision weights assigned to p = 0.1, 0.3, 0.5, 0.7, 0.9. As can been seen from the plots, we observe inverse S-shaped probability weighting for both gains and losses, with more pronounced overweighting of small probabilities for losses. Using Tversky and Kahneman's one-parameter function, we found a median γ = 0.92 for gains and a median γ = 0.84 for losses (both significantly lower than 1, Wilcoxon tests: ps < .04). Both analyses demonstrated that the typical inverse S-shaped probability transformation was the most prevalent in our data, for both gains and losses. Moving to the individual level, for gains, we found γ < 1 for 56 subjects (58%) and γ > 1 for 41 subjects (42%). For losses, we found more pronounced inverse S-shaped probability weighting, with 71 (73%) and 26 (27%), respectively.   Nonparametric  Concave  19  51  0  70  Convex  7  17  1  25  Linear  0  1  1  2  Parametric  Concave  19  51  0  70  Convex  6  18  1  25  Linear  0  1 1 2 FIGURE 1 Probability weighting functions for gains (w + (p)) and losses (w − (p)) 4.4 | Health state correction Table 2 shows QALY weights for all health states elicited using TTO and SG, where uncorrected refers to weights elicited assuming EU and linear QALYs, whereas corrected weights are elicited by means of Equations (3) and (4).

| Probability weighting (w i (p))
To test the sensitivity of our results to linear interpolation, we also corrected TTO and SG weights by using power utility to estimate L − T * y and the Kahneman and Tversky probability weighting function to estimate w + (p) and w − (1 − p); these are indicated by "Parametric Corrections" in Table 2. An initial difference in TTO and SG weights existed (paired t tests, all ps < .001), with SG weights being higher than TTO for all β x . Our results show that the corrected weights were lower than the uncorrected weights for TTO and SG (paired t tests: all ps < .01). The initially significant difference between the uncorrected weights only disappeared for all β after applying nonparametric corrections (paired t tests: all ps > .09). The parametric corrections left significant and substantial differences between TTO and SG weights. Finally, we performed four isolated corrections. For the sake of brevity, we only report the results of the nonparametric corrections (see the Supporting Information for results of these analyses for parametric corrections). First, we corrected TTO for utility curvature only, with λ = 1. Second, TTO weights were corrected for loss aversion only, with linear utility (i.e., L i (T * ) = T * ). Third, we corrected SG for probability weighting only, with λ = 1. Finally, SG weights were corrected for loss aversion only, with w i (p) = p. This allows us to demonstrate the influence of each correction in isolation. Table 3 shows that correcting for loss aversion had a stronger downward influence on TTO weights than correcting for curvature of L i (T * ), and both correcting for probability weighting and correcting for loss aversion had a substantial negative influence on SG weights.

| DISCUSSION
This paper provides the first empirical test of Bleichrodt's (2002) predictions about PT, demonstrating that it may be possible to correct the weights typically used in HSV, that is, to reduce bias in TTO and SG.  We estimated the full set of PT's parameters in the health domain, in order to obtain more descriptively valid outcomes, which can be used in the QALY model. Our results are consistent with PT (Kahneman & Tversky, 1979): We observe concave utility curvature for gains and convex utility curvature for losses, inverse S-shaped probability weighting, and considerable loss aversion. In general, the estimates of utility curvature for gains in life duration and loss aversion (when applicable) of earlier work are similar to ours (e.g., Attema, Brouwer, & L'Haridon, 2013;Bleichrodt & Pinto, 2000;Bleichrodt & Pinto, 2005), but different results are found for the utility function for losses in life duration. These differences might be explained by methodological differences, which is a hypothesis that could be tested in future work. Furthermore, we replicated the typical finding that SG weights are higher than TTO weights. By means of corrections similar to those proposed by Bleichrodt et al. (2001), we attempted to remove the systematic bias in these weights, by simultaneously accounting for loss aversion, probability weighting, and utility curvature. Consequently, as predicted by Bleichrodt (2002), the weights assigned to both TTO and SG were markedly lower than their uncorrected counterparts. Moreover, they were no longer significantly different.
Although successful attempts at correcting SG and/or TTO weights using parametric methodology are reported in earlier work (Martin et al., 2000;van der Pol & Roux, 2005;van Osch et al., 2004), our parametric corrections were not able to fully account for the discrepancies between these methods. This seemed to be driven by SG weights remaining higher when parametric estimations for probability weighting were used. Given that our nonparametric estimations of probability weighting allowed full flexibility of the weighting function (see Abdellaoui, 2000), these findings suggest that parametric estimations of probability weighting may produce different results.
Our results demonstrate that, considered in isolation, loss aversion had a stronger downward influence on TTO weights than utility curvature, whereas both probability weighting and loss aversion lowered SG weights considerably. Although these findings are generally in line with previous studies, we observed a downward effect of correcting TTO for utility curvature. Probably, this is caused by the convexity found for losses in life years and the framing of our TTO and SG exercises (which both featured losses in life years from the RP in a reduced health state). Future work could shed light on the degree to which this discrepancy may be caused by the nonparametric method or the framing used in our work.
Several limitations of our study need noting. First, several subjects violated monotonicity for the health states used. Although excluding these subjects from the sample did not alter our results, we expect that these errors in decision making are to be attributed to either (a) imprecision of preferences or (b) error propagation, that is, early errors cascading into later stages of the task. Considering the use of only relatively mild health states, for which subjects may have no precise preference ordering in mind, some overlap may occur within our method. Regarding error propagation, it is good to note that during utility elicitation, subjects could rectify errors by adjusting the final indifference value on the slider to any nondominant value in life years, that is, fix their earlier "errors." Testing for error propagation, by performing an error simulation as described by Bleichrodt and Pinto (2000), confirmed that errors did not have a propagating effect on the standard sequence we elicited for gains and losses. 6 Second, concerns may be raised about the role of the RP in this paper. We find that the observed discrepancies between TTO and SG can be removed by correcting under the assumption that decision makers utilize the guaranteed outcome (β x , T x ) as RP (which ensures that t x remains constant). However, earlier work on health-related preferences has suggested that individuals may also use their own current health and life expectancy as RP (van Nooten & Brouwer, 2004;van Nooten, Koolman, & Brouwer, 2009). In our work, we found no evidence of such effects. 7 A related limitation concerns our assumption that subjects use the fixed outcome in both TTO and SG as their RP, which is crucial for our results as our corrections depend on a constant T r throughout the multiple parts of the experiment. Earlier work, however, demonstrated that SG subjects may also use the time spent in full health as their RP (van Osch & Stiggelbout, 2008). To our knowledge, such work does not exist for TTO methods. Therefore, future work should explore the possibility of correcting under the assumption that subjects use full health as RP, for both TTO and SG.
Finally and perhaps most importantly, the primary goal of the present research was merely to provide the first empirical test of Bleichrodt's (2002) predictions for TTO and SG weights, and our findings should be interpreted in this context. We observed considerable differences to nationally representative findings. For example, the Dutch tariff (Versteegh et al., 2016) for health state β 1 (21211) is 0.876, whereas we elicited a raw TTO weight of 0.665. Our sample, 6 The difference between TTO and SG weights not was not significant in all simulations (k = 1,000) for β 1 and β 2 , while replicating our results in the majority of simulations for β 3 (over 70%). These simulations suggest that our correction method is quite robust to error propagation. 7 We tested for associations between subjects' self-reported life expectancy and their estimates for loss aversion, utility curvature, and probability weighting; no such associations were observed for raw and corrected health state weights (all Kendall's τs < 1.52, all ps > .13).
consisting of young, healthy students will have contributed strongly to this initial discrepancy, next to differences in methodology. We also note that after correction, the discrepancy between tariffs and corrected weighs increases. After the nonparametric correction, the QALY value of state β 1 decreases to 0.492. Clearly, this calls for further investigation of the methods used here, also in other (general public) samples, in order to further explore the impact of corrections and further refine the methods used. This future research may also clarify whether our framing may have yielded relatively low weights and how the methods used here can be simplified to be suitable for use in general public samples.

| CONCLUSION
With the increasing importance of economic evaluations in health care, the question of how to best estimate health states valuations has become a crucial one. Conventional methodologies, such as TTO and SG, systematically arrived at different valuations of the same health state. PT may offer an explanation for this phenomenon (Bleichrodt, 2002), which was never tested directly. Using the nonparametric method (Abdellaoui et al., 2016), we demonstrated that it may be possible to significantly reduce these biases in HSVs. After correction for loss aversion, probability weighting, and utility curvature, TTO and SG weights for three health states were no longer different. This is an encouraging finding, but at the same time, the resulting low absolute values highlight the need for future research. Notwithstanding these important limitations, our findings do suggest the feasibility and relevance of this approach and may prove to be a first step in the move towards QALYs without bias.