### Abstract

- Top of page
- Abstract
- 1 Introduction
- 2 Experimental Methods and Procedures
- 3 Replication of Results and Analysis of Aggregate Decisions
- 4 Analysis of Individual Decisions
- 5 Conclusion
- Acknowledgement
- References
- Supporting Information

We compare the consistency of choices in two methods used to elicit risk preferences on an aggregate as well as on an individual level. We ask subjects to choose twice from a list of nine decisions between two lotteries, as introduced by Holt and Laury 2002, 2005) alternating with nine decisions using the budget approach introduced by Andreoni and Harbaugh (2009). We find that, while on an aggregate (subject pool) level the results are consistent, on an individual (within-subject) level, behaviour is far from consistent. Within each method as well as across methods we observe low (simple and rank) correlations.

### 1 Introduction

- Top of page
- Abstract
- 1 Introduction
- 2 Experimental Methods and Procedures
- 3 Replication of Results and Analysis of Aggregate Decisions
- 4 Analysis of Individual Decisions
- 5 Conclusion
- Acknowledgement
- References
- Supporting Information

Measuring and controlling for risk aversion in laboratory economic experiments is commonplace. However, while the concept of measuring risk aversion – for a given utility function – is relatively straightforward theoretically, finding the appropriate test for risk attitudes is still discussed extensively in the literature. In this article we compare results of two methods that can be used to elicit risk preferences – the popular method by Holt and Laury [HL (2002, 2005)] and a newer approach by Andreoni and Harbaugh [AH (2009)]. Both methods are developed to provide data which can be easily interpreted in a constant relative risk aversion (CRRA) framework. Our design, using each method twice, alternating between methods, allows us to check for consistency of aggregate-level as well as individual-level measurements of risk aversion within and between methods. This within-individual benchmark makes our study unique in the literature studying consistency of risk attitudes. We find that while analysis of aggregate data indicates consistency in behaviour, this evidence is weak on the individual level, both within methods and between methods.

Why should we care about consistency? In this study we take the role of the practitioner who uses elicitation methods to control for risk attitudes of participants. Hence, we require usable and reasonably robust value estimates (*ex post*) based on an (*a priori*) theoretical framework for which a method was designed. As such, these estimates should not be too sensitive to small manipulations in task descriptions and pay-offs; and they should consistently identify risk attitudes – degrees of risk averse, neutral or seeking behaviour – on an aggregate as well as on the individual level. We perform this analysis for HL and AH.

A large body of the literature suggests and discusses problems with elicitation methods, in particular to connect experimental results to the underlying theoretical models. These problems have been acknowledged by adopting empirical strategies, in particular by using utility functions that incorporate stochastic elements affecting individual choices (e.g. Loomes and Sugden, 1995, 1998; Loomes *et al*., 2002) by allowing for effects of interdependence between choices and the choice options presented (Starmer and Sugden, 1993), or capturing explicitly the idea of heterogeneity of players (Ballinger and Wilcox, 1997). Unfortunately this literature – despite its insightful considerations – does not provide an implementable toolkit for the elicitation of risk attitudes in laboratory experiments.

Harrison and Rutström's (2008) survey addresses the practical issue of risk elicitation reviewing different risk-elicitation methods and discussing ways used to estimate risk attitudes. This review compares different elicitation methods and discusses specific characteristics of the methods. It focuses on comparing (cross-sectional) group aggregated data and does not compare differences in elicitation methods on an individual or participant level. One reason for this might be that several studies have found that individuals as well as group aggregates (like averages) show that different elicitation methods yield different measures.1

Isaac and James (2000) compare implied risk attitudes of 34 subjects that resulted out of choices made in a first-price auction to measurements based on the Becker–DeGroot–Marschak (BDM) procedure,2 finding that experimental choices in the auction cannot be aligned to risk attitudes based on the BDM procedure. Their results indicate that the two methods only weakly keep the order in the measures of risk aversion, and that the methods are not just shifters of risk aversion measures within individuals. Ranked correlations (across individuals) are only around 39%.

We add to this literature on within-subject consistency of risk elicitation.3 Several studies have found that risk attitudes are not stable within individuals in experimental settings: Berg *et al*. (2005) found that implied risk attitudes depend on whether individual decisions are measured using auctions for a risky or a riskless asset. Hey *et al*. (2009) compared willingness to pay, willingness to accept, BDM measures and choices over pairwise lotteries, finding inconsistencies and in some cases even negative correlations between results of the different methods within individuals. Anderson and Mellor (2009) compare results of the method developed by HL and survey results on gambles (over job and investment choices), finding that except for a small fraction of *superconsistent* (‘consistently consistent’, p. 152) decision-makers, the methods did not provide consistent within-individual estimates of risk attitudes. Comparing HL results and decisions in a choice setting they refer to as the ‘Deal or No Deal game’ (named after a popular TV show), Deck *et al*. (2008) find that decisions are not consistent and conclude that one elicitation method is treated as an investment (HL), whereas the other is treated as a gambling decision.

When understanding risk attitudes as some (stable) personal characteristic, one can also question if risk attitudes remain constant over time. Andersen *et al*. (2008) and Harrison *et al*. (2005a) investigate temporal stability of risk attitudes using HL. They argue for temporal stability given their results, although subjects do not necessarily make the same choice over a 6-month and 17-month period respectively. The reason for this is that they attribute changes in decisions to order effects (Harrison *et al*. 2005a) and to the fact that there is no shift in the distribution of risk attitudes over time (Andersen *et al*. 2008). Lönnqvist *et al*. (2011) also look at intertemporal stability using HL and a survey; their results indicate that the assumption of stability is problematic and that the predictive power of implied risk attitudes based on HL and decisions in the trust game is low.

Each of these studies compare the results from one risk-elicitation method with the results from another choice setting (like an auction, a trust game, a game show or a survey) in which choices are also likely to be driven by risk attitudes. Our approach differs from this literature by comparing the results of two risk elicitation methods (each applied twice) to measure choice stability over a short time frame within individual and within method (as opposed to the long time frame as in Harrison *et al*. (2005a)). In addition, we investigate aggregate and individual cross-method consistency using two methods of risk-elicitation that have the same theoretical starting point but employ different procedures.

Closest to our study is Dave *et al*. (2010), in a study on a cross section of the Canadian population), who also compare the results of two methods, HL and an approach by Eckel and Grossman (2002). They find that implied risk attitudes of the two methods differ (in the Eckel and Grossman method more individuals are risk neutral) and that HL lead to more inconsistent choices, particularly among individuals with lower mathematical skills.4 Our study differs to their approach not only in our second elicitation method but also by the fact that we have a more homogeneous (student) population as our experimental subjects (see Supporting Information). In the study by Dave *et al*. (2010) this mattered for their result, as mathematical skills which were widely distributed across their experimental subjects, changed the accuracy of measures between methods. Furthermore, our approach of letting subjects make decisions for both methods twice provides us with an internal benchmark when comparing results across methods.5 In addition, our methods of choice are based on the same decision variable used to determine risk attitudes (an optimal probability over gains, see the description of the methods below for more detail); furthermore, both methods were designed with the same theoretical framework (i.e. utility function) in mind. This reduces two potential reasons for why different methods might provide different results on risk attitudes for an individual.

We find (a) that both methods provide a divergent picture of the overall risk attitude of the groups in our subject pool, hence whether subjects are predominately risk neutral or risk averse depends on the elicitation method; (b) that within-subject consistency of individual decisions throughout the experiment is limited for both methods, even within methods and (c) that individual-level consistency decreases further when comparing the two methods. These results confirm outcomes of prior research and call into question how experimental results on risk attitudes can be used for more than general statements about groups of subjects. The observation of limited cross-method consistency is further aggravated considering that the internal consistency is not much better within than across methods. That is, the problem does not only seem to be that measures depend on framing. Hence, both methods fall short of some main criteria we would have seen as desirable.

### 2 Experimental Methods and Procedures

- Top of page
- Abstract
- 1 Introduction
- 2 Experimental Methods and Procedures
- 3 Replication of Results and Analysis of Aggregate Decisions
- 4 Analysis of Individual Decisions
- 5 Conclusion
- Acknowledgement
- References
- Supporting Information

In our experiment we used the methods by HL and AH. Both methods are based on the idea that they provide good measures for a CRRA utility function and the underlying idea of stable individual risk attitudes.6 They ask experimental participants to make choices over lotteries where the main decision variable is the probability of winning; they both allow relatively straightforward calculations of risk aversion parameters and they are both laboratory specific and can be incentivised similarly, which increases the comparability of data collected. We incentivised (*ex post* randomly) selected rounds of both methods such that they yield the same expected value for a risk neutral decision-maker. This further increases the comparability between the methods.

We briefly outline the two approaches as we implemented them. The original studies contain full details, and screenshots can be found in the Supporting Information.

#### 2.1 Holt and Laury's Method

HL consists of a menu of lotteries (or multiple price list, MPL) with changing probabilities over ten constant pairs of outcomes. For each pair of outcomes in the list there is a more and a less risky option (based on the variance of outcomes). Participants were able to see an MPL and were asked to make choices separately for each row between a pair of lotteries. For each further decision row down, the probability mass on the higher pay-off increased by 10%, making the safer option A (i.e. the option with a lower variance in pay-offs) less attractive. We deviate from HL slightly by leaving out the certain option (i.e. 100% probability of the higher pay-off) to avoid any reference point of safety. This reduced our number of choices from ten to nine in each round where we use HL. Over the two rounds we played one set-up in which participants played over slightly higher stakes (a gamble between 10 and 8 vs. a gamble between 19.25 and 0.5) and one in which they played over slightly lower stakes (8 and 6.4 vs. 15.4 and 0.4). The higher stake set-up just scaled up pay-offs and therefore implies slightly higher risk premia for choosing the more secure option. However, the estimated bounds for CRRA coefficients remain the same for each number of safe choices.7 Table 1 provides an example for the set-up with the slightly lower stakes.

Table 1. Multiple price list design by Holt and LauryOption A | Option B |
---|

*p* | *X* | 1−*p* | *Y* | *p* | *X* | 1−*p* | *Y* |
---|

Note |

0.1 | 8 | 0.9 | 6.4 | 0.1 | 15.4 | 0.9 | 0.4 |

0.2 | 8 | 0.8 | 6.4 | 0.2 | 15.4 | 0.8 | 0.4 |

0.3 | 8 | 0.7 | 6.4 | 0.3 | 15.4 | 0.7 | 0.4 |

0.4 | 8 | 0.6 | 6.4 | 0.4 | 15.4 | 0.6 | 0.4 |

0.5 | 8 | 0.5 | 6.4 | 0.5 | 15.4 | 0.5 | 0.4 |

0.6 | 8 | 0.4 | 6.4 | 0.6 | 15.4 | 0.4 | 0.4 |

0.7 | 8 | 0.3 | 6.4 | 0.7 | 15.4 | 0.3 | 0.4 |

0.8 | 8 | 0.2 | 6.4 | 0.8 | 15.4 | 0.2 | 0.4 |

0.9 | 8 | 0.1 | 6.4 | 0.9 | 15.4 | 0.1 | 0.4 |

#### 2.2 Andreoni and Harbaugh's Method

In AH individuals have to trade-off the probability of winning against the amount they can win in a gamble; they allocate a budget (or convex risk budget, CRB) in each decision on the probability (*p*) of winning and the amount to be won (*x*) – with the inverse probability (1−*p*) individuals get a pay-off of zero. Individuals receive a budget *b* which they can use to buy extra percentage points of winning at a price or exchange rate (*e*) per percentage point. Hence, individuals choose a pair (*p*,*x*) such that *x*=*b*−*p*·*e*.

As the method of AH is less common we give the following example here: In round D the participant starts with a budget of $88. The participant can *buy* extra probability of winning at the cost of $2.75 for each percentage point of winning. She could, for example, chose to buy ten percentage points at a cost of 10·$2.75 = $27.5. Consequently, she would get $88−$27.5 = $60.5 with a corresponding probability of *p* = 10% and $0 otherwise (with 1−*p* = 90%). The participant can continue to buy further winning probability, or reduce the winning probability and get a higher amount in case of winning. The participant will (move the slider and) adjust her combination of probability and amount won until some optimal point is reached.

AH vary the range of potential winning gains (*b*) as well as the price *e* between additional gains and additional probability of winning. We implemented the AH method with the deviation that we did not use lotteries involving losses. As in the original study we presented the probability of winning as a green shaded area in a pie chart. The amount received in case of winning was illustrated as a green shaded area in a bar chart. Participants were able to change the probability of winning by moving a slider. While the gamble was graphically presented on the computer screen, the probability of winning and the amount to be won were also stated as numbers on the screen. Table 2 shows the budgets *b* (hence, the maximum amount that could be won with probability zero) in each round as well as the price *e* of one extra percentage probability of winning. These combinations were each presented to participants twice.

Table 2. Pairs of maximum gain and cost of probabilityRound | A | B | C | D | E | F | G | H | I |
---|

Note |

Budget (*b*) | 27.3 | 56 | 172 | 88 | 49.4 | 39.2 | 54.5 | 207 | 116 |

Price (*e*) of one extra per cent of winning probability | 0.28 | 1.17 | 10.75 | 2.75 | 0.77 | 0.41 | 0.68 | 8.62 | 2.42 |

#### 2.3 Experimental Design

We used a within-subject design of individuals who make choices based on the risk-elicitation methods introduced by HL and AH. We analysed decisions of 78 experimental participants from a regular student population throughout seven sessions. Participants were recruited online from the experimental subject pool at the Queensland University of Technology using ORSEE (Greiner, 2004) and through announcements in tutorials. Some participants were also recruited in common places at the university in personal communication; however, when asking students in person for participating in the experiment, the same information was used for recruitment, including the organiser (researchers at the School of Economics and Finance), average earnings (around 20 Australian dollars) and time estimated to complete the experiment (around 30 min). It was also pointed out to the students that there would be no minimum payment for participating in the experiment. It is worth noting that this recruitment of asking students personally to participate was somewhat less controlled than common in many economic experiments. However, as we were interested in within-subject comparisons and were still drawing from a relatively homogeneous student population, this was of minor concern.

The risk-elicitation methods were implemented in a computer laboratory using a custom-made, java-based software. Upon arrival at the laboratory, participants were seated at computers, were asked to work through experimental instructions and start the experiment. Instructions included examples of how to make choices in the experiment and two test questions for each risk-elicitation method. Further help by the experimenter was available upon request of participants. When participants had passed the test questions, they started the experiment, going through two rounds of nine choices for each risk-elicitation method, alternating between the methods. The order of the risk-elicitation methods was switched for about half of our experimental sessions (we did not find significant order effects across participant's decisions depending on the order of the methods).

To avoid portfolio-building or wealth effects in the course of the experiment, after completing the experiment, one of the two rounds was randomly chosen for payment. For this round one choice of each method was randomly selected. Thus, for each method 1 of 18 decision was payment relevant. For the two choices that were selected, participants were given the opportunity to change their earlier decisions; we did so to test whether participants, once they knew that this decision would be paid, would change their decisions.8 Furthermore, the changes in decisions also provide an indicator on the reliability of previously recorded choices over (potentially) hypothetical stakes. Finally, participants were given a questionnaire that asked for some demographic information and student status. After students had finished the questionnaire they were paid and could leave the computer laboratory. Average payments were $17 (SD $18) Australian dollars, of which $10 (SD $5) were paid for decisions in HL and $7 (SD $17) for decisions in AH.

### 3 Replication of Results and Analysis of Aggregate Decisions

- Top of page
- Abstract
- 1 Introduction
- 2 Experimental Methods and Procedures
- 3 Replication of Results and Analysis of Aggregate Decisions
- 4 Analysis of Individual Decisions
- 5 Conclusion
- Acknowledgement
- References
- Supporting Information

In a first step we replicated some of the (central) results in the approaches by HL and AH that were relevant for our comparison. Both studies considered deriving parameter estimates for a CRRA utility function of the form , as introduced in HL or similarly *U*(*x*) = *x*^{α} as in AH. As both notations are equivalent for our purpose, we only report values for *α*. In both methods the probability chosen was the main choice variable of interest for the analysis. For this utility function, HL grouped experimental decision-makers into categories of individuals with a certain risk attitude, based on the estimated coefficient *α*. Although the method used by HL does not allow to directly calculate such a coefficient, bounds of it can be determined by looking at the switching points from more risky to less risky choices. These bounds are, however, difficult to identify if individuals have more than one switching point. Dealing with these issues, HL counted the number of safe choices that an individual had made and grouped individuals into categories that this number of safe choices would have implied if they had only a single switching point (SSP). Table 3 reports our replicated results for our two pay-off set-ups comparable to the low stakes set-up of HL, as well as the original results in HL in their two treatments with low and high monetary pay-offs. The last column contains the empirical distribution of CRRA coefficients based on our AH data to allow a comparison.

Table 3. Overall distribution of risk attitudesRisk attitude | Number of safe choices | | HL (replicated) (%) | HL (2002) (%) | AH (%) |
---|

| | | (1) | (2) | (3) | (4) | (5) |
---|

Notes |

Highly risk loving | 0–1 | *α*>1.95 | 1 | 1 | 1 | 1 | 5 |

Very risk loving | 2 | 1.95>*α*>1.49 | 0 | 7 | 1 | 1 | 2 |

Risk loving | 3 | 1.49>*α*>1.15 | 8 | 5 | 6 | 4 | 6 |

Risk neutral | 4 | 1.15>*α*>0.85 | 29 | 21 | 26 | 13 | 61 |

Slightly risk averse | 5 | 0.85>*α*>0.59 | 17 | 23 | 26 | 19 | 11 |

Risk averse | 6 | 0.59>*α*>0.32 | 22 | 19 | 23 | 23 | 9 |

Very risk averse | 7 | 0.32>*α*>0.03 | 10 | 22 | 13 | 22 | 4 |

Highly risk averse | 8 | 0.03>*α*>−0.37 | 4 | 4 | 3 | 11 | 2 |

Stay in bed | 9–10 | −0.37>*α* | 9 | 4 | 1 | 6 | 0 |

The AH risk-elicitation method allows for a straightforward calculation of CRRA coefficients under the functional form as described above; we do this for each decision that experimental participants take and report the distribution of all the decisions by all participants based on the implied *α*-coefficient.9 We do not replicate the full analysis by AH, who answer five questions on EUT.10 Instead we focus on whether using a CRRA framework with a simple utility function as characterised before is reasonable. We confirm their regression results over all decisions showing that budget allocations of the winning probability and the winning price are approximately constant over the size of winning stakes. As AH we find very small standard errors in the regression results and negligibly small coefficients. This indicates that CRRA is a reasonable assumption.

We find that the classification in terms of risk attitudes of our subjects pool when using the HL method follows a similar distribution to the one reported by HL in their original contribution. Furthermore, we can generally identify a noticeable degree of risk aversion in our subject pool and also find a tendency of (slightly) increasing risk aversion when the stakes over which the lotteries are played increase. Although our stakes are always close to the lower stake set-up of HL, slightly increasing the stakes shifts the results in the expected direction.

Harrison *et al*. (2005b) pointed out that order effects may influence the results, leading to higher risk aversion in the second round. This may affect our results and explain different levels of risk taking between the two rounds. However, we assume that the effect will be similar across subjects and should not change the rank order of participants based on their risk attitudes. For this reason we use rank correlations in our analysis.

Using AH's method we find that coefficients provide results that indicate a higher number of risk neutral choices compared to results in HL, some risk averse choices as well as some decisions that are risk loving. Analysing the total distribution of the results indicates that the two methods, despite drawing on a very similar notion of utility functions and both being theoretically legitimate risk-elicitation procedures, do not provide with the same result. The average risk attitude in HL is between slightly risk averse and risk averse (on average 5.44 safe choices are made), while the average decision in AH is risk neutral (with a tendency towards risk aversion). This is true despite the fact that the expected monetary pay-off is the same across methods, as on expected value terms the same amount can be earned in both methods.

### 4 Analysis of Individual Decisions

- Top of page
- Abstract
- 1 Introduction
- 2 Experimental Methods and Procedures
- 3 Replication of Results and Analysis of Aggregate Decisions
- 4 Analysis of Individual Decisions
- 5 Conclusion
- Acknowledgement
- References
- Supporting Information

To get a better understanding of these differences in the results, we analysed the decisions of our participants on a within-subject basis. That is, as all our participants made 18 decisions in each method, we can analyse in how far each individual decided consistently within and across the two methods.

#### 4.1 Internal consistency of the methods

In a first step we analysed in how far individual participants made consistent decisions within one risk-elicitation method. For this, we used correlations of individual decisions over the two rounds. For the HL method, the number of safe choices made in the first and the second period, which were used to calculate CRRA coefficients as shown in Table 3, gave a correlation of 55% and a ranked (Spearman's *ρ*) correlation of 62%. We also considered a second way to measure the degree of risk aversion for which we did not assume that participants have a clearly determinable SSP, but calculated the average risk premium within their farthest switching points. (This corresponds to an approach described by Andersen *et al*., 2006, who are, however, critical about this procedure.) These averages were correlated at a level of 68% (*ρ* = 69%) over the two rounds of HL. Figure S2 in the Supporting Information also provides a picture of the dispersion of the difference between safe choices in the first round (over lower stakes) and the second round (over higher stakes), indicating that there is a slight shift towards risk aversion, but that it is not a one-directional shift.

As in the HL method the idea of a SSP from less to more risky options is important, we also looked at whether assuming the general prevalence of SSP was reasonable for our sample, and how many of the participants with SSP consistently chose the same number of safe choices over the two rounds. From our 78 participants in the experiment, 48 had a SSP in both rounds of the HL method.11 Of these, 22 chose the same number of safe choices in both periods.12 Of the 22 (HL-consistent) individuals, 10 participants were in the risk neutral category as introduced above and 12 were either risk averse or risk loving.

When asked to reconsider their choices knowing the period that is going to be paid, 8 of 78 participants wanted to change their decision. One of these eight increased the number of safe choices, while all others increased the number of risky choices. Hence, this indicates that generally participants were fine with the choices they had made earlier. To get a better understanding of which individuals switched their decisions, we investigated if they differed in some way from the other participants. However, we found no noticeable correlation with respect to gender, age, their estimated risk attitude or their mother language. There was also no order effect of which method was played first or if the choice that could be reconsidered had just been made in the round before. There was only some small correlation (of 13%) showing that individuals were somewhat more likely to change decisions around the risk neutral switching point. Hence, there is no observable explanation why individuals changed their decisions in HL.

To analyse the internal consistency in the AH set-up, we similarly first looked at correlations between decisions of individuals made between the rounds. For this purpose we calculated implied CRRA *α*-coefficients for each decision as described in footnote 3. These coefficients showed correlations that ranged between 15% and 60% for the same lottery (i.e. the same choice over a corresponding maximum gain and price of an extra probability of winning) over the two rounds. Ranked correlations were between 30% and 57% across individuals indicating that looking at only ordinal risk attitudes of individuals lowered the effect of outliers, but did not lead to a greater consistency over the rounds.13 There was no apparent relationship between the stake of the lottery (*b*) and the correlation between the two rounds; that is it was not clear how to identify which factors led to higher consistency over the rounds. We looked at correlations both for the raw decision variable (i.e. the probability chosen in a given period) as well as for the implied *α* coefficients for each round of the game. Figures 1 a and b illustrate these correlations for each round. As can be seen, for all combinations of *b* and *e* a positive relationship exists, but correlations are far from perfect.

In a second step we therefore tried to find an individual aggregate for the CRRA coefficient over the different CRB choice allocations. We did so by averaging the coefficients for each individual over each round.14 To find out if such an aggregation was appropriate, we tested for whether there was a positive or negative relationship between the maximum gain and the implied CRRA coefficient. We found that it did not for most individuals.15

Having done this aggregation, we compared (round) average *α* values of the two rounds; they showed a correlation of 70% by individual and a ranked correlation of 72%. To get a better picture of robustness of the CRRA coefficients, we also looked at whether participants changed their decisions when being informed that a certain round would be selected for final pay-off. The result showed that – comparatively to the HL method – many participants (a total of 27) changed their choices. However, changing decisions in HL and AH can have a very different leverage and the two are hence not directly comparable. In AH comparatively small changes can be made by adjusting the budget allocation just a little bit, while changing in HL essentially always implies a significant shift in measured risk attitudes. Furthermore, in AH the percentage change in those individuals who revised their decisions was noticeable; on average, participants who changed their choices moved 12% towards safer choices and absolute changes were 30%.16

Again, as for the HL method we investigated potential reasons for changing decisions. We found some variables that are correlated with decision changes in the AH method. Non-native speakers are more likely to make changes (correlation of 24%), indicating that understanding the task might play a role. Age and gender, however, play no role. Furthermore, individuals with higher values of *α* are less likely to change their choices (correlation of 22%). However, these relationships do not seem to be strong. Individuals who change their HL choice are not more likely to do so for their AH choice as well.

Finally, we investigated in how far using average CRRA coefficients derived using the AH method allowed us to reliably classify participants into broad categories of risk averse, risk neutral and risk loving individuals. We therefore tested whether the average CRRA coefficient *α* was significantly different from one using confidence intervals of 2 within-subject standard deviations. We found that only for 5 of the 78 participants the CRRA coefficient *α* was significantly different from 1; from our estimates these five participants were risk averse and all other participants were approximately risk neutral.17

#### 4.2 Comparison across methods

Our data also allow us to compare the two risk-elicitation methods on a within-individual basis. One way to do so is trying to make predictions based on one method of how an individual would have made decisions in the other method. Following this rationale, we used the average risk aversion coefficient derived using the AH method to predict how an individual with this parameter would have decided in the HL framework.

We found that this would have predicted 76% and 75% of decisions in the two rounds of the HL method respectively. However, in this comparison any individual who has multiple switching points (MSP) will have some incorrect predictions, even if both methods estimate the same coefficient. To alleviate this effect, we looked at individuals with SSP only, which showed 83% and 82% correct predictions for single (rows of binary) HL choices (hence not the overall implied risk attitude) over the two periods. These numbers indicate a high level of comparability.

However, we read these numbers with care. The reason for this is that we used AH to determine individual-specific risk attitudes and then predicted choices in HL. A simple benchmark is hence to assume all participants having the same risk attitude and see how well this counterfactual can predict choices made in HL. We did so by assuming all individuals to be risk neutral. This should bias the comparison to the favour of the AH method as aggregate analysis for both methods indicates risk aversion. We found that assuming these risk neutral participants would have predicted choices made by individuals under the HL method equally well (85% and 82% respectively). Hence, our individual-specific estimates do not outperform the counterfactual.

We therefore reverted to the categorisation of participants into groups of people with different risk attitudes as in Table 3. We allocated individuals into these risk categories according to the two methods. Using this approach, 10% of participants were grouped into the same risk attitude category by both methods. The main reason for this is that the AH method (on average) classifies individuals as more risk neutral than the HL method. In this sense one could say the AH method ‘shifts’ behaviour of individuals towards risk neutrality. We observe an average shift of 27%; however, the shift is not only in one direction (the average absolute shift is about 33%) and when looking at the ranked correlation on allocations to risk categories *ρ* is 38%.18 This is surprisingly close to what Isaac and James (2000) found in their study comparing risk attitudes of individuals using a first-price auction and the BDM procedure, which was the first approach comparing two elicitation methods on an individual level. Figure 2 illustrates this relationship.

### 5 Conclusion

- Top of page
- Abstract
- 1 Introduction
- 2 Experimental Methods and Procedures
- 3 Replication of Results and Analysis of Aggregate Decisions
- 4 Analysis of Individual Decisions
- 5 Conclusion
- Acknowledgement
- References
- Supporting Information

Using the risk-elicitation methods developed by HL and AH, we tested their internal and external consistency across and within individuals. We find correlations of about 60–70% between decisions in the two periods within method and within individual. Comparatively, cross-method predictions and correlations were smaller and can only be established on an aggregate level. Hence, the two methods are not procedurally invariant, both over the full subject pool (as visible in Table 3), as well as on an individual level. However, as low cross-method correlations are also due to low within-method consistency of decisions, as visible in our within-method benchmark, *ρ* = 38% between the methods is not so little. Evidently one would like to have risk-elicitation methods with more consistency.

This result of low cross-method consistency seems undesirable considering that *a priori* one would have guessed that the two methods would yield similar results and it seems difficult to determine a better method *ex post*. The difference of *a priori* compatibility and *ex post* divergence of the methods can also not be resolved empirically given our data, as no observable variable explains the difference. Any reasoning seems highly speculative given that the two methods have the same theoretical motivation and the same decision variable, as in both methods individuals choose over probabilities.19

Also the comparison between methods did not provide an unambiguous guideline of which method should be preferred as both are subject to inconsistencies. While individuals were more consistent over the two rounds in the HL method than the AH method, for both it is problematic to clearly identify the risk attitude of an individual.

In both methods we are not describing small errors as inconsistencies, but shifters that are crucial for the interpretation and meaningfulness of estimated coefficients. This conclusion remains despite the fact that we only repeated the tasks over two rounds and increasing the number of repetitions might lead to more inconsistencies. The analysis of the HL method can be improved by disregarding or simplifying inconsistent choices, but this might not be advisable, as Jacobson and Petrie (2009) have shown. We also find few *superconsistent* individuals (as did Anderson and Mellor, 2009); in our subject pool individual inconsistencies are an almost universal problem. As most of the literature before, we read our individual-based cross-method correlations as (unsatisfactory) low.

While we have no clear means to determine which of the two methods is the correct or superior one, we can evaluate in how far the desirable characteristics mentioned in the beginning of the study are met by the methods. In the aggregate both methods allow for making statements about the overall risk attitude and we would conclude that the subject pool is on average (moderately) risk averse. However, while both methods tell about the risk attitude of a group of subjects, it seems difficult to reliably infer the risk attitude of an individual from the methods. Hence, given the desirable criteria of a method from the perspective of a practitioner, both methods seem not to meet more than the most basic ones, primarily allowing to make statements about the general prevalence of risk attitudes in the population. This is somewhat disappointing considering that risk aversion is essentially an individual-based concept.

Our findings are particularly crucial from the perspective of a practitioner for whom measuring risk attitudes is not the last step and ultimate goal, but who would like to use this information for further analysis, for example when quantifying the role of risk attitudes in decisions where risk and other elements determine outcomes jointly. Without individual consistency of decisions, it is questionable to what extent HL (or AH) can be used as measures of control for risk aversion, as it is sometimes done when interpreting other experimental games. We would, for example, now be more careful when using them as a stable measure for individual risk attitudes in experiments trying to take out the risk aspect of other decisions and using experimental results on risk attitudes as indicators of whether individuals are risk averse, risk neutral or risk seeking (e.g. in public good decisions (Gangadharan and Nemes, 2009), trusting decisions (Houser *et al*., 2010) or when linking them to genetic data (Zhong *et al*., 2009)).20 However, it would be very useful having such a tool.