Subjective neuronal coding of reward: temporal value discounting and risk


Dr W. Schultz, as above.


A key question in the neurobiology of reward relates to the nature of coding. Rewards are objects that are advantageous or necessary for the survival of individuals in a variety of environmental situations. Thus reward appears to depend on the individual and its environment. The question arises whether neuronal systems in humans and monkeys code reward in subjective terms, objective terms or both. The present review addresses this issue by dealing with two important reward processes, namely the individual discounting of reward value across temporal delays, and the processing of information about risky rewards that depends on individual risk attitudes. The subjective value of rewards decreases with the temporal distance to the reward. In experiments using neurophysiology and brain imaging, dopamine neurons and striatal systems discount reward value across temporal delays of a few seconds, despite unchanged objective reward value, suggesting subjective value coding. The subjective values of risky outcomes depend on the risk attitude of individual decision makers; these values decrease for risk-avoiders and increase for risk-seekers. The signal for risk and the signal for the value of risky reward covary with individual risk attitudes in regions of the human prefrontal cortex, suggesting subjective rather than objective coding of risk and risky value. These data demonstrate that important parameters of reward are coded in a subjective manner in key reward structures of the brain. However, these data do not rule out that other neurons or brain structures may code reward according to its objective value and risk.


Basic fluid, food and sexual rewards influence the brain via a number of sensory receptors including somatosensory, gustatory, visual, olfactory and auditory modalities. These receptors are not selective for reward, and the brain needs to extract the reward information from polysensory events. Thus the functions of rewards are not defined by the specificity of sensory receptors but are inferred from the influence of rewards on behaviour. The question arises whether the neuronal coding of reward information follows the objective value of rewards or reflects the subjective influence of rewards on behavioural reactions. The term ‘subjective’ refers to particular characteristics of individual decision makers, including inborn or acquired attitudes, beliefs and needs.

Information systems can work well with explicit signals that represent key variables for the specific functions they process. Theoretical considerations and behavioural studies have identified a number of variables underlying reward functions. These include the value and risk of rewards. Investigations into the neuronal mechanisms of reward might benefit from the identificaton of explicit signals for reward variables in two ways. First, the existence of an explicit neuronal signal would indicate the validity and importance of the particular variable for neuronal mechanisms underlying reward function and encourage further behavioural and theoretical work on that particular variable. By contrast, it is rather difficult to investigate brain mechanisms of a function for which there is no variable coded by neuronal signals. Second, once a signal has been identified, further research can characterise its properties and determine the nature of the information it conveys. Together with anatomical and physiological data on neuronal connectivity this knowledge might lead to the functional architecture of neuronal reward processing. This review will describe data from our laboratory concerning the subjective nature of explicit neuronal signals for reward value and risk.

Subjective coding of reward value during temporal discounting

Conceptual background

The kind, magnitude and probability of reward objects determine the value of positively motivating outcomes of behaviour. Blaise Pascal in 1654 famously noted that humans tend to maximize the summed product of value and probability, the mathematical expected value, when making decisions about future outcomes. Animals rationally and consistently prefer larger over smaller rewards (Boysen et al., 2001; Watanabe et al., 2001). However, the objective measurement of subjective reward value by choice preferences reveals that rewards lose some of their value when they are delayed. In fact, rats, pigeons, monkeys and humans often prefer smaller rewards that occur earlier over larger rewards occurring later (Ainslie, 1974; Rodriguez & Logue, 1988; Richards et al., 1997). However, reward value may not always decrease monotonically with increasing delays, as some rewards may have reduced value if consumed too early because of scheduling of energy demand or supply, or competing activity of the animals (Raby et al., 2007). Thus, temporal discounting reflects the more general notion that rewards are particularly valuable at particular points in time and lose some of their value at other times. Taken together, the subjective value of reward appears to vary across time, even though the physical reward remains unchanged.

The factors underlying temporal discounting include the less tangible nature of distant rewards, the uncertainty associated with future events, the need for nutritional and energy supply at particular, often immediate, points in time, the general lack of patience, the propensity for impulsive behaviour, and several irrational and emotional factors associated with temporal delays. The different factors have led to various concepts of temporal discounting. The most common model assumes a reduction in the scalar subjective reward value by delay. Many economists favor exponential discounting with a constant reduction in subjective reward value per unit time (Samuelson, 1937). By contrast, behavioural psychologists describe choices between differently delayed rewards by a hyperbola according to which the rate of discounting is larger in the near than the far future (Ainslie, 1975; Ho et al., 1999). A generalised hyperbola combining hyperbolic and exponential discounting often fits the data best (Loewenstein & Prelec, 1992). The influence of temporal delays of reward not only influences behavioural choices but also reduces the efficacy of rewards on learning, even with constant overall reward density and rate (Holland, 1980). Taken together, irrespective of particular discounting functions, temporal delays contribute importantly to the subjective valuation of reward.

Temporal discounting may be due to the reduction in the scalar reward value of the conditioned, reward-predicting stimulus (input stage) or, alternatively, may involve value alterations during the decision or choice process (output stage). In agreement with basic principles of reinforcement theory (Sutton & Barto, 1998), neurons code the value of future rewards in their responses to conditioned, reward-predicting stimuli (Schultz & Romo, 1990; Critchley & Rolls, 1996; Fiorillo et al., 2003) or in relation to movements leading to reward (Watanabe, 1996; Hollerman et al., 1998; Samejima et al., 2005). Alterations of these reward signals by temporal delays may constitute neuronal correlates for the temporal discounting of subjective reward value.

Experimental design

The most simple, and most easily interpretable, predictive neuronal reward value signals consist of responses to Pavlovian conditioned stimuli without choices, whereas the use of two-choice options makes interpretations of neuronal data less straightforward. An alteration of simple predictive reward signals by different reward delays would locate the value change at the input stage of decision making. A subjective neuronal value signal would decrease with increasing reward delay, whereas an objective value signal should remain constant as physical reward magnitude remains identical with all delays. Pavlovian licking responses and intertemporal choices between differently delayed rewards provide appropriate behavioural measures for temporal discounting. Testing both humans and monkeys in tasks with similar delays would facilitate comparisons of behavioural and neuronal discounting between these species.

Pavlovian temporal discounting task

In our experiments with monkeys and humans (Kobayashi & Schultz, 2008; Gregorios-Pippas et al., 2009), different visual stimuli predict the same physical amount of reward after fixed delays of 2–16 s (Fig. 1A). Each stimulus is associated with a specific reward delay. Rewards are a small quantity of liquid for animals and a picture of a money bill for humans (a fixed, known percentage of which is paid out immediately after the experiment). Such rewards produce learning and approach behavior, even though it is possible that they do not represent true ‘primary’ positive reinforcers and that their ‘primary’ reward value is downstream from their immediate effect on the body (Wise, 2002). Reward delay is defined as the interval between stimulus onset and reward onset. Intertrial intervals (from reward offset until next stimulus onset; ITI) are adjusted such that mean cycle time (stimulus–reward delay + ITI) is identical across all stimulus–reward delays (animals, 22.0 ± 0.5 s; humans, 15.5 s fixed + 2 s Poisson-mean truncated at 8 s). Thus, overall reward density and rate are constant across all reward delays. One of our experiments uses an ITI with fixed mean; hence cycle time covaries with reward delay (Fiorillo et al., 2008). However, the discounting data are comparable in these ITI versions.

Figure 1.

 Temporal reward discounting as a test for subjective reward value: experimental design and behavioural data. (A) Pavlovian conditioned stimuli predicting liquid reward after four different delays, as used for neurophysiological experiments in monkeys. (B) Subjective estimates of elapsed time by the peak interval procedure (PIP) in humans. Note the slightly shorter estimates of longer delays. (C) Subjective estimates of reward value in monkeys. For each reward delay, choice probabilities for an adjusted early (2-s delay) over fixed later reward increase with the magnitude of the early reward. Longer reward delays are associated with lower indifference values (horizontal line at = 0.5 choice), indicating reduced subjective value of later rewards despite constant physical reward magnitude. Data are from the intertemporal choice task using the adjusting amount procedure, separately for the four delays. (D) Hyperbolic fitting to decreasing indifference values across longer delays, as obtained from intertemporal choices between early and later rewards shown in C. Value = 162/(1 + 0.31 delay). Horizontal dotted line indicates constant objective reward value. (E) Subjective estimations of reward value in humans. Data are from the intertemporal choice task between an adjusted immediate and a fixed delayed reward (£20) using the adjusting amount procedure. (F) Hyperbolic fitting to decreasing indifference values across longer delays, as obtained from intertemporal choices between early and later rewards shown in E. Value = 20/(1 + 0.07 delay); 15 participants. Delays are derived from their mean estimated values in the PIP task.

Modified peak interval procedure (PIP)

In addition to the subjective valuation of delayed rewards, temporal delays themselves are perceived and processed in a subjective manner, showing variations among individuals (Meck, 2005). Thus a comprehensive analysis of timing processes should incorporate both objective and subjectively estimated delays. We use the PIP to assess the subjective time perception of reward delays in an objective manner (Roberts, 1981). In unrewarded PIP test trials, the same stimulus as in the main discounting task outlasts the normal reward time by three times the stimulus–reward interval. Our monkey experiments use anticipatory licking as behavioural PIP measure (Fiorillo et al., 2008). Licking increases half-way through each stimulus–reward delay and declines close to its end, suggesting subjective delay estimation. In our human imaging experiments, participants press a particular PIP button once to indicate the expected time of reward and another PIP button once to terminate the PIP trial (Gregorios-Pippas et al., 2009). Participants underestimate all delays slightly, and the longest delay of 13.5 s by approximately 1–2 s (Fig. 1B). All further analyses for the human experiment are based on PIP estimated subjective delays.

Intertemporal choice task

The adjusting amount procedure assesses in an objective manner the subjective value of rewards delivered after different delays (Richards et al., 1997). Each trial contains two visual choice options. We use the same visual stimuli as in the Pavlovian task and present them at fixed left and right positions. Choice of one option produces the earliest reward whose amount is varied experimentally, whereas choice of the alternative option results in one of the later rewards whose amount is fixed. Systematic variations of the amount of the early reward allows us to establish psychometric functions of choice preference, measured as probablity of choosing the early reward. Varying the early rather than the late reward allows us to assess subjective reward value as close as possible to the value-predicting stimulus. Choice indifference (choice probability P = 0.5) implies that the two options are valued as subjectively equivalent. The amount of the early reward that produces choice indifference indicates the subjective value of each late reward, as measured in millilitres (Figs 1C and D) or British pounds (£; Figs 1E and F).

In the neurophysiological experiment (Kobayashi & Schultz, 2008), monkeys indicate their choice by a saccadic eye movement from a central fixation spot to one of the two choice options. We adjust the amounts of the earliest reward (2 s delay) and measure the probability of choosing this reward. Sigmoidal fitting allows us to determine the choice-indifference point.

In the human imaging experiment (Gregorios-Pippas et al., 2009), participants choose by differential button press between the two options. We obtain the choice-indifference point with the iterative and converging parameter estimation by sequential testing (PEST) procedure (Luce, 2000). The amount of an immediate reward (0 s delay) starts at 50% of the fixed amount of the later reward (£20) and is iteratively changed to produce preference reversal while halving the step size on every reversal, thus asymptotically approaching choice indifference. The adjusted amount of the immediate reward is shown immediately after button press.

In each human and animal participant we fit the early reward amounts at choice indifference across the delays with different functions and obtain the discounting factors by minimizing the mean squared errors (Figs 1D and F). Employed functions are: hyperbolic, V = A/(1 + kD); and exponential, V = A e−kD, where V is value, A is amount of late reward, D is delay (in s) and k is discounting factor.

Temporal value discounting in behavioural responses

Anticipatory licking changes depending on the length of the reward delay. Licking starts earlier and occurs on a higher proportion of trials with shorter reward delays, thus indicating higher subjective values for earlier rewards (Fiorillo et al., 2008; Kobayashi & Schultz, 2008).

Performance in the intertemporal choice task shows progressively lower indifference values for increasingly delayed rewards. In monkeys, indifference values decrease monotonically across the delays of 4, 8 and 16 s by approximately 25, 50 and 75%, respectively, compared to reward after 2 s (Fig. 1C; Kobayashi & Schultz, 2008). A hyperbolic discount function fits the decrease in reward value significantly better than an exponential function (Fig. 1D). Mean hyperbolic discounting factors in the two animals of the study are 0.17 and 0.31. Extension of the delays of both choice options results in preference reversal typical for hyperbolic discounting. In humans, the amount of the immediate reward in the PEST procedure converges regularly at choice indifference (Gregorios-Pippas et al., 2009). Indifference values decrease monotonically across the four delays of 4, 6, 9 or 13.5 s (Fig. 1E). Both hyperbolic and exponential functions fit the decrease without significantly different correlation coefficients R2 (Fig. 1F). The mean hyperbolic discounting factor is 0.05 across all 15 participants. However, the mean discounting factor is 0.11 when seven participants who fail to discount are excluded. Note that the temporal delays in the range of a few seconds are much shorter than the delays of weeks and months used in other human temporal discounting studies. It is rather astonishing that humans indeed discount over such short delays, in particular as they receive the money outcome only after the experiment, and the lack of significant disounting in half of the participants is not entirely surprising.

These data demonstrate substantial behavioural temporal discounting of reward value at delays of a few seconds in both humans and monkeys. However, humans show substantially weaker temporal discounting than monkeys at similar short delays, although it is unclear how reward value, which influences the steepness of discounting, compares between money for humans and juice for monkeys. Behavioural value discounting occurs despite constant reward rates in the ITI-adjusted schedule, suggesting that reward delay dominates the subjective valuation of delayed rewards over overall reward rate (amount per time).

Temporal value discounting in responses of dopamine neurons

Neuronal systems involved in the temporal discounting of reward value include the principal reward structures, namely the dopamine system, ventral striatum, orbitofrontal cortex and amygdala. Lesions of the ventral striatum or basolateral amygdala accentuate the preference of rats for small immediate over larger delayed rewards and thus increase temporal discounting (Cardinal et al., 2001; Winstanley et al., 2004), whereas excitotoxic and dopaminergic lesions of the orbitofrontal cortex decrease temporal discounting (Kheramin et al., 2004; Winstanley et al., 2004). Neurophysiological studies demonstrate that midbrain dopamine neurons code reward value. Their responses to reward-predicting stimuli increase monotonically with magnitude, probability and their summed product, expected value (Fiorillo et al., 2003; Tobler et al., 2005).

The majority of midbrain dopamine neurons respond with activation to reward-predicting stimuli. The dopamine responses decreases monotonically across the predicted reward delays (Fig. 2A), despite the same amount of reward being delivered after each delay (Fiorillo et al., 2008; Kobayashi & Schultz, 2008). Closer inspection of the population response reveals an initial, rather nondifferential, component and a subsequent, differential, part that decreases in amplitude with longer delays. The initial nondifferential component lasts until 110 ms after the stimulus and probably reflects response generalisation or pseudoconditioning for which dopamine neurons are known to be sensitive (Waelti et al., 2001; Tobler et al., 2003). Generalised responses are due to the physical similarity between conditioned, predictive, stimuli whereas pseudoconditioning arises when a ‘primary’ reinforcer sets a contextual background and induces nonspecific responses to any event within this context (Sheafor, 1975). The subsequent differential response decrease with increasing delays becomes significant at 110–360 ms after the stimulus (arrow and shaded area in Fig. 2B). Fitting exponential and hyperbolic functions to the responses of each dopamine neuron reveals slightly better overall goodness of fit for hyperbolic than for exponential discounting (Fig. 2C). Corresponding to the steeper behavioural discounting seen with smaller than with larger rewards (Kirby & Marakovic, 1995), reduction in reward magnitude to one-fourth produces significantly steeper neuronal discounting.

Figure 2.

 Coding of subjective reward value by monkey dopamine neurons during temporal discounting. (A) Responses of single dopamine neuron to stimuli predicting the same physical reward magnitude after different delays. Responses decrease with increasing delay. The four stimuli indicating the specific reward delays alternated randomly, and the trial types are separated for analysis and display. For each rastergram, the sequence of trials runs from top to bottom. Black tick marks show times of neuronal impulses. Histograms show mean discharge rate for each delay. (B) Average population responses to reward-predicting stimuli decrease with increasing delay (87 neurons in two monkeys). Coloured traces refer to delays of 2 s (black, top), 4 s (blue), 8 s (green) and 16 s (orange, bottom). Shaded zone and arrow indicate the second, more specific, component of the neuronal response which varies particularly well with reward delay. (C) Hyperbolic fitting of mean normalised neuronal population response to reward-delay-predicting stimuli (54 neurons in one animal). Data are from shaded zone in B. CS – indicates response to unrewarded control stimulus.

These data suggest that temporal delays affect dopamine responses to reward-predicting stimuli in a similar manner as they affect behavioural licking and intertemporal choice preferences. The decrease in dopamine responses with increasing reward delay is indistiguishable from the effects of lower reward magnitude. This similarity suggests that temporal delays affect dopamine responses via changes in reward value. For dopamine neurons, delayed rewards seem to appear simply as if they were smaller rewards. Thus, dopamine neurons seem to code the subjective rather than the objective value of predicted delayed rewards.

An earlier study investigated responses of rat dopamine neurons with rewards of different delays and sizes (Roesch et al., 2007). The neurons show higher responses to stimuli predicting earlier rather than later liquid rewards of the same magnitude (0.5 vs. 1–7 s). Mostly the same neurons also show stronger responses to stimuli predicting larger rather than smaller rewards after identical delays. These results are compatible with the data obtained in monkeys. Interestingly, when tested during an intertemporal choice task, the dopamine responses to the simultaneously appearing stimuli reflect the more valuable of the two reward options irrespective of the subsequent choice, effectively dissociating the dopamine response from the overt behavioral choice.

Taken together, the results suggest that dopamine neurons show temporal discounting of reward value. The discounting would conceivably occur at the input stage during choices between differently delayed rewards. The discounting responses to reward-delay-predicting stimuli and the similarity with magnitude coding suggest that dopamine neurons code the subjective value as derived from multiple reward parameters such as delay and magnitude.

Temporal value discounting in frontal cortex and striatum neurons

Several studies report temporal discounting in reward-related activity of neurons in cortical and subcortical structures together with behavioual indices for temporal discounting of reward value. Premotor cortical neurons in monkeys show lower activations following visual instructions for delayed behavioural responses and rewards (Roesch & Olson, 2005a). Reversal of cue–delay associations leads to reversal of neuronal responses, suggesting a relationship to delay rather than visual stimulus properties. The decreases in premotor responses correlate well with slower behavioural reactions, indicating that the neuronal response decrease may reflect a reduction in general motivational factors by delays rather than reduced reward value per se. About one-third of task-related neurons in monkey dorsolateral prefrontal cortex show delay-related reductions in responses to chosen cue targets in choice trials (Kim et al., 2009). In the orbitofrontal cortex of monkeys, neurons show temporal discounting of cue responses (Roesch & Olson, 2005b). The same neurons also code reward magnitude, suggesting that temporal discounting may indeed reflect the reduced subjective valuation of reward. Probing reward magnitude coding in delay-discounting neurons is a good test for reward value, as only a subset of orbitofrontal neurons show the graded reward value coding typical of dopamine neurons. Orbitofrontal neurons also show reduced movement-related responses with increasing delays, but these responses do not seem to covary with explicitly tested reward value (Roesch et al., 2006). Neurons in the ventral striatum of rats show temporal discounting of responses to reward-predicting odours (Roesch et al., 2009). Taken together, reward-related neuronal responses undergo temporal discounting in a number of brain structures outside the dopamine system, suggesting that subjective reward coding is not limited to dopamine neurons and might be a rather widespread phenomenon in many neurons coding reward information. This conclusion should not indicate that rewards are coded by all reward neurons in a subjective manner. Humans in particular are well able to assess reward value in an objective manner, but this capacity may involve cognitive mechanisms not yet investigated in animal experiments.

Temporal value discounting in human brain

As invasive neurophysiological studies are routinely only possible in animals, the knowledge gained from these studies should be used to interpret the responses obtained in human imaging studies and extend the human studies further. However, the experimental conditions of most human temporal discounting studies differ in several important aspects from those employed in animals. Many human discounting studies identify separate brain systems for mediating immediate and delayed rewards (McClure et al., 2004), except one investigation assessing scalar discounting (Kable & Glimcher, 2007). Furthermore the reward delays of days, weeks and months are well beyond the range of a few seconds used in animals, and even the shortest tested delays of minutes are impractical with animals (McClure et al., 2007). Although hypothetical and real monetary rewards may produce similar discounting (Johnson & Bickel, 2002), any reward paid out after long delays as a sum over many trials constitutes a less direct and motivating event than a reward delivered immediately after every trial.

Human neuroimaging studies demonstrate consistent blood oxygen level-dependent (BOLD) responses to reward in the ventral striatum (O’Doherty, 2004). These signals reflect reward value by coding the quantity and probability of reward (Knutson et al., 2005; Preuschoff et al., 2006; Tobler et al., 2007b). Regression analysis of BOLD responses to the Pavlovian conditioned stimuli predicting rewards after delays of 4, 6, 9 and 13.5 s identifies a region in the ventral striatum in which the BOLD responses decrease monotonically with increasing delay (Fig. 3A). Similar to the relatively mild behavioural discounting with these short delays (Fig. 1F), the decrease in BOLD responses is small when averaged across all participants. However, median split of the group of 15 human participants into seven behavoural discounters and seven nondiscounters demonstrates significant decreases in BOLD responses in the discounters but not in the nondiscounters (Fig. 3B). The decrease in BOLD responses is well fitted by both hyperbolic and exponential functions in the discounters, with a slightly but significantly better fit for exponential than hyperbolic functions (Fig. 3C). Lower reward magnitudes (£5) induce stronger BOLD discounting than larger magnitudes (£20). Thus the discounting of BOLD responses occurs with Pavlovian reward predictors irrespective of choice, becomes steeper with lower reward magnitudes and, due to the constant cycle time, reflects the delay rather than the rate of reward.

Figure 3.

 Coding of subjective reward value in human ventral striatum during temporal discounting. (A) Location of activation in ventral striatum, as derived from regressing BOLD responses on individual indifference values measured in each participant in the intertemporal choice task. (B) Time courses of BOLD responses to stimuli predicting four different reward delays. Data are shown separately for seven discounters (top) and seven nondiscounters (bottom), as defined by behavioural discounting. Delays are indicated as assessed in the PIP task. (C) BOLD responses to reward-delay-predicting stimuli, separately for discounters and nondiscounters. Data show signal changes at peaks of time courses (4 s after stimulus onset; shaded intervals in B) at peak voxel of BOLD response shown in A. Hyperbolic function fits the data (mean ± SEM) based on mean subjective PIP-estimated delays from discounters and nondiscounters, respectively, for objective intervals of 4, 6, 9 and 13.5 s. For similar fitting with exponential function, see Gregorios-Pippas et al. (2009). (D) Correlations between BOLD and behavioural discounting for hyperbolic fits to the data (Pearson correlation on 15 participants). The y-axis shows discount factors from fits to peak BOLD responses of individual participants to the differential reward-delay-predicting stimuli in the discounting task. The x-axis shows behavioural factors from fits to reward values at behavioural choice indifference in the intertemporal choice task using adjusting-amount and PEST procedures. Hyperbolic fits: R2 = 0.55 (P < 0.01 against 0 slope). Measures shown in B–D are from peak voxel of circled area shown in A.

The analyses separating discounters from nondiscounters suggests that the decreases in BOLD responses in the ventral striatum are related to individual degrees of behavioural discounting. A more formal analysis demonstrates significant correlations between discounting factors for BOLD and behavioural responses in individual participants, for both hyperbolic and exponential functions (Fig. 3D). These results suggest that the decreases in BOLD responses to reward-delay-predicting stimuli match behavioural discounting not only between the two categorical groups of discounters vs. nondiscounters but also at the level of individual participants.

A previous human imaging study suggests an influence of financial status on BOLD responses during learning (Tobler et al., 2007a). Individuals with higher assets would value the modest monetary rewards of the study less than would poorer participants, as marginal utility decreases usually with increasing personal finances (Kreps, 1990). In keeping with steeper behavioural discounting with smaller compared to larger rewards (Kirby & Marakovic, 1995), the subjectively lower reward magnitudes of richer participants should lead to stronger discounting. Indeed, ventral striatal BOLD responses in the seven discounters decrease more in participants with higher rather than lower assets for delays of 13.5 s, along with mildly steeper behavioural discounting in the intertemporal choice task. Thus the brains of richer participants seem to discount rewards more steeply across intervals of a few seconds. Although these laboratory data conform with the idea of steeper discounting of subjectively smaller rewards, they run counter to the notion that humans are usually more risk-seeking with smaller rewards, as more risk-seeking is commonly associated with less discounting. Future discounting experiments may address this issue by controlling for both reward size and risk attitude.

Other studies find substantial decreases in BOLD responses to reward-delay-predicting stimuli across longer delays of hours, days and months. As in our study, the decreases occur in the ventral striatum but also in specific regions of the frontal cortex (Kable & Glimcher, 2007). In studies aiming to distinguish neural processes categorically between immediate and larger rewards, prediction of the early reward activates the striatum whereas waiting for later rewards is associated with activations in several frontal structures (McClure et al., 2004, 2007; Tanaka et al., 2004). BOLD responses in the striatum are related to individual degrees of discounting (Hariri et al., 2006; Wittmann et al., 2007).

Taken together, these results demonstrate increasingly smaller striatal BOLD responses with increasing temporal delays of reward in a key reward structure, the ventral striatum. The decrease in BOLD responses within the time frame of single-neuron studies may reflect the known temporal sensitivities of reward responses in dopamine and orbitofrontal neurons projecting to the ventral striatum. Although striatal BOLD responses to reward usually reflect reward value (Knutson et al., 2005; Preuschoff et al., 2006; Tobler et al., 2007b), none of the studies specifically identified a signal for reward value per se without discounting; thus there is a faint possbility that some of the decreases in BOLD responses might simply reflect temporal delays rather than a delay-induced decrease in reward value. Nevertheless, the data suggest that the reduction in BOLD responses in the ventral striatum with temporal delays reflects the decrease in subjective reward value. In addition, both the interpersonal variations in discounting and the effects of personal assets confirm the subjective nature of neuronal reward processing with temporal delays.

Subjective coding of risky rewards

Conceptual background

Rewards are objects that are advantageous or necessary for the survival of individuals in a variety of environmental situations. Individuals live in a world with limited resources, which are incompletely known (epistemological uncertainty), partly unpredictable or inherently stochastic. Thus, uncertainty is an inherent feature of resources in many natural situations and importantly determines survival. Faced with uncertainty, humans tend to engage in superstitious beliefs and behaviour in order to reduce the uncertainty and bias uncertain events towards favorable outcomes. The degree of superstition is often inversely related to the degree of knowledge of the world, and the acquisition of knowledge can reduce superstition. This relationship suggests that human brains detect the uncertainty in environmental events. Thus, comprehensive investigations of the neuronal mechanisms underlying behaviour directed at obtaining important resources should include uncertainty.

Rewards have specific magnitudes and occur with specific probabilities. Therefore rewards can be adequately described by probability distributions of reward values, and choices between two rewards can be viewed as choices between probability distributions. Decision makers prefer probability distributions of reward with the highest expected value (the anticipated ‘mean’, first moment of the distribution) while taking into account the variance (second moment) or its square root, the standard deviation. Variance or standard deviation reflects the spread of a distribution and indicates how far possible values are away from the expected value. They are measures of the degree of uncertainty in known probability distributions and are referred to as risk (Markowitz, 1952). An alternative, often more appropriate, measure of risk is the coefficient of variation, defined as the standard deviation divided by the expected value (Weber et al., 2004). Intuitively, ‘risk’ refers to the chance of winning or losing relative to the expected value of the probability distribution, rather than the more narrow, common-sense association with loss. By contrast, ‘ambiguity’ refers to the uncertainty in probability distributions that are incompletely known or difficult to capture.

In order to make informed choices, decision makers need predictive information about the probability distributions they are about to choose from. With incomplete predictions, decision making will degenerate towards guessing. Predictions about future rewards are determined by the contingency (dependency) of the reward on a particular stimulus or action (Rescorla, 1967). In the most simple form, Pavlovian predictions are learned without the active participation of the subject. The range of Pavlovian predictions beyond the classically studied vegetative reactions (Pavlov, 1927) is open and debatable. However, for the sake of simplicity we may call any prediction of reinforcers ‘Pavlovian’ if the acquisition of these predictions does not require the subject’s own action. Importantly, with this definition Pavlovian predictions inform about probability distributions of rewards. Thus, Pavlovian predictions include at least the two-first moments of these distributions, namely the value (expected value) and the risk (variance or standard deviation). With this concept, classic Pavlovian predictions allow subjects to take the risk of outcomes into account when making informed decisions.

Risk influences behavioural choices depending on the local risk attitude of the individual decision maker. In risk-averse choices, individuals prefer options with lower rather than higher risk, all other parameters being equal. Behavioural attitudes toward risk vary between individuals and depend within the same individual on the domain of the risky event and the situation in which it occurs, as shown in animal foraging (Caraco et al., 1980, 1990) and human choices (Weber et al., 2002). Risky choices are determined by the subjectively perceived riskiness of options, even with constant absolute degree of risk (Weber & Milliman, 1997). Risk attitudes can be related to temporal discounting; steeper discounting results in lower valuation of later higher rewards that would compensate earlier losses, thus leading to risk aversion. Thus risk aversion might be partly explained by temporal discounting, whereas risk seeking would be anticorrelated with temporal discounting and require to overcome the reduced value of delayed compensation.

Risk attitude can be derived from the subjective utility of objective values assessed by choice preferences. A gradually flattening, concave utility function indicates that the gains achieved by higher outcomes become gradually less important (Fig. 4A). When risk is involved, the steeper slope of the function at lower gains suggests that potential losses loom larger than gains relative to an intermediate reference. For example, a risk avoider prefers a sure gain of £5 over an option with an equal probability of winning £1 or £9 (each P = 0.5). Due to the steep slope at low values, receiving the £1 reduces the utility more relative to £5 than a gain of £9, thus discouraging the choice of risky options. Higher risk increases the tendency to choose safe outcomes because the differences in slope become important with larger ranges. Thus risk-avoiders appear to assign lower subjective values to options with higher rather than lower risk, despite the same mean objective value (£5). By contrast, with a convex utility function the slope increases with increasing values, indicating that lower values have little utility (‘small change’) whereas higher values produce disproportionately more utility (Fig. 4B). Convex utility functions are associated with preference for higher rather than lower risk options, called risk-seeking. The utility gained from outcomes exceeding the mean more than offsets the loss incurred by outcomes below the mean; the steeper slope at higher gains encourages the choice of risky options. Thus risk preference can be inferred from utility functions. Taken together, risk influences the subjective valuation of outcomes, and choices are determined not only by the value of outcomes but also by their risk. The valuation of risky outcomes, and possibly even the preceding assessment of risk itself, appear to be subjective.

Figure 4.

 Theory and design for risk experiments in humans. (A) Hypothetical concave utility function with single concave component, associated with risk aversion. Based on such a utility function, decision makers would prefer a safe choice of £5 over a gamble of £1 £ or £9 occurring with equal probability (P = 0.5 each), as the loss from obtaining £1 weighs more than the gain of £9 (arrow). (B) Hypothetical convex utility function associated with risk seeking. Such a function would be associated with higher subjective value of the gamble compared to the safe outcome. (C) Expected reward value and risk as a function of reward probability. Expected reward value, measured as mathematical expectation of reward, increases monotonically with reward probability (filled circles). Expected value is minimal at P = 0 and maximal at P = 1. Risk, measured as reward variance, follows an inverted U-function of probability and is minimal at P = 0 and P = 1 and maximal at P = 0.5 (open squares). (D) Experimental stimuli used for testing reward value and risk. Twelve different stimuli are associated with different reward magnitudes (ordinate) and probabilities (abscissa) as shown. Expected value of stimuli (sum of probability-weighted magnitudes) is indicated below stimuli and increases with distance from origin. Circles connected by lines indicate two-choice options with two identical expected values, respectively (100 and 200 points) but each with different risk due to specific magnitude–probability combinations.

Experimental design

Although probability denotes the frequency of uncertain events, it is not by itself a monotonic measure of risk. For example, in a two-outcome situation such as reward vs. no reward, outcome value increases monotonically with the probability of outcome whereas risk is maximal at P = 0.5. Thus the degree of risk follows an inverted U-function peaking at P = 0.5 (Fig. 4C). At P = 0.5, there is exactly as much chance to obtain a reward as there is to miss a reward, whereas higher and lower probabilities than P = 0.5 make gains and losses, respectively, more certain and thus are associated with lower risk. Thus, the design distinguishes risk, which varies according to an inverted U-function of probability, from expected value, which increases monotonically with probability.

Risk prediction

The use of imperative, no-choice tasks facilitates the study of basic neuronal mechanisms of risk that are independent of choice and occur before a decision is made. Many processes intervene between the reception of key decision information and the overt behavioural choice. Neuronal signals track expected reward value and risk at an initial perceptual level, and additional subsequent neuronal processes may determine the final choice, including the comparison of previously signalled action values (Sutton & Barto, 1998). Thus, a first step in investigating neuronal mechanisms of reward might focus on neuronal value and risk signals without choice. Nevertheless, to be meaningful for decision making, neuronal reward signals should be also investigated in choice situations.

In a typical experiment in humans, specific pictures predict specific reward magnitudes (100–400 points in steps of 100) at specific probabilities (P = 0.0–1.0 in steps of 0.25), resulting in specific expected value and variance predicted by each stimulus (Fig. 4D; Tobler et al., 2007b). Only one stimulus is presented in imperative trials without choice, whereas two stimuli are shown simultaneously in choice trials. The outcome is presented as the number of points gained (0–400), of which 4% are summed and paid out as British Pence immediately after the experiment.

Risk preference

The risk attitude of participants influences the choice between two simultaneously presented stimuli associated with low and high risk but the same expected value (e.g. connected circles in Fig. 4D). The risky gamble produces one of two equiprobable (P = 0.5) reward magnitudes. Each time the participant choses the more certain stimulus, the factor of risk aversion increases by one, whereas choosing the more uncertain stimulus decreases it by one (four choices total). An average positive factor indicates risk aversion, a negative factor indicates risk seeking, and a zero factor risk neutrality. We also determine risk attitude at choice indifference by identifying for each risky option the safe amount for which participants are indifferent between the risky and the safe option (certainty equivalent), using the PEST procedure. In another risk assessment, participants rate the pleasantness of the risk-predicting stimuli on a scale ranging from 5 (very pleasant) to −5 (very unpleasant). We quantify risk aversion by comparing the ratings for (P = 0.25 + = 0.75) and P = 1.0 (Wakker, 1994). Risk attitudes measured by choice preferences and subjective pleasantness ratings correlate in our experiments with factors around r = 0.6 (Tobler et al., 2007b, 2009). Using these risk assessments with different expected values allows us to determine the coding of reward value separately from risk.

Subjective coding of reward risk

When different visual stimuli predict reward with different probabilities, BOLD responses in the lateral orbitofrontal cortex vary according to an inverted U-function of probability without significantly varying with reward value (Tobler et al., 2007b). These data indicate the coding of the risk in the different probability distributions. Risk coding is also found, separately from value coding, in the ventral striatum, subthalamic nucleus, mediodorsal thalamic nucleus, midbrain and bilateral anterior insula when the interval between the prediction and resolution of risk is extended to several seconds (Preuschoff et al., 2006). These latter risk signals have longer latencies than the orbitofrontal risk signal and occur in brain structures that receive dopamine afferents, possibly reflecting input from the similarly slow dopamine risk signal (Fiorillo et al., 2003. The differences in time course between the striatal and orbitofrontal responses may reflect different functions of these risk signals.

A good test for the subjective coding of risk might be to correlate the risk signal with individual risk attitudes across different individuals, as measured by their choice preferences. Indeed, the risk signal, as defined above by fitting an inverted U-function of probability, increases in the lateral orbitofrontal cortex with individual degrees of risk aversion (Fig. 5, top). Risk avoiders seem to have a particularly substantial signal indicating the degree of risk in the upcoming reward (right), whereas risk seekers lack such a signal (left). By contrast, a risk signal in the medial frontal cortex increases with risk seeking (Fig. 5, bottom). The signal is particularly strong in risk seekers (left) and, if used by the brain for biasing decisions, might drive individual choices toward the more risky options.

Figure 5.

 Relation of human frontal risk signals to individual risk attitude. (A) Location of BOLD signal for risk in lateral orbitofrontal cortex covarying with risk as inverted U-function of probability. The signal increases with increasing risk aversion across participants. Risk attitude is measured by choice preferences. (B) Correlation of contrast estimates for risk of individual participants with individual risk aversion. (C) Location of BOLD signal for risk in medial frontal cortex covarying with risk seeking. (D) Risk correlation analogous to B.

These data suggest that risk signals are not the same across different individuals but vary according to individual risk attitudes, suggesting subjective coding of risk. The individual variations in risk signals may explain the different attitudes of individuals towards risk and might influence their choices in risky situations.

Subjective valuation of risky rewards

Risk attitudes determine choice preferences in risky situations. It is generally assumed that choices are directed toward the most highly valued outcomes. Thus, choices biased by risk attitude are based on the subjective valuation of risky outcomes. A more complete investigation of neuronal risk mechanisms should not only assess individual, subjective variations of risk signals but, importantly, consider the influence of risk on reward value.

BOLD signals in parts of prefrontal cortex code expected reward value irrespective of individual risk attitudes. The same BOLD signal also codes risk; the risk coding, but not the value coding, varies with individual risk attitude (Tobler et al., 2007b). These data reveal a combined value and risk signal whose risk component appears to be subjective. However, the result does not yet demonstrate a neuronal correlate for the influence of risk on subjective reward valuation. What we need is not only a signal that codes both value and risk but a direct influence of risk on the value signal, and that influence of risk on the value signal should depend in a consistent manner on risk attitude. This is exactly what BOLD responses in parts of prefrontal cortex do.

BOLD responses in the lateral prefrontal cortex increase with increasing expected value irrespective of risk attitude, suggesting value coding (Fig. 6A; Tobler et al., 2009). Time courses of value responses to the low-risk options are similar irrespective of risk attitude (Figs 6B and C; blue curves). These value-coding activations are influenced by different levels of risk. Importantly, the influence depends on individual risk attitudes measured by choice preferences. The value signal decreases with increasing risk in risk avoiders (Fig. 6B; blue lines with downward arrows toward red lines) but increases with increasing risk in risk seekers (Fig. 6C; upward arrows). The changes occur in both imperative and choice situations. These results suggest a remarkable integration of risk into expected-value signals in the prefrontal cortex.

Figure 6.

 Influence of risk attitude on subjective valuation of risky outcomes in human lateral prefrontal cortex. (A) Location of BOLD signal for risk attitude-dependent influence of risk on reward value. The risk signal is defined by correlation with risk as an inverted U-function of probability. (B) Time courses of BOLD value signal with low-risk outcomes (blue curves). In risk avoiders, increasing risk (red curves) leads to collapse of value coding (green arrows), suggesting reduced subjective valuation of more risky outcomes compared to less risky outcomes. (C) Time courses of BOLD value signal increasing with risk in risk seekers, suggesting increased subjective valuation of more risky than of less risky outcomes. In B and C, the average variance of low-risk and high-risk outcomes is 2500 and 20 000 points2, respectively. Low and high expected values are averaged across 50 and 100, and across 150 and 200, points respectively. Time courses are averaged separately for risk avoiders and risk seekers. Risk attitude is measured by choice preferences.

The subjective nature of the BOLD value signal for risky rewards relates well to the influence of risk on behavioural decisions conceptualised by expected utility theory. Just as the lower preference of risk avoiders for risky options reflects lower subjective valuation of risky outcomes (Fig. 4A), the BOLD value responses decrease with more risky outcomes (Fig. 6B). In analogy, the preference of risk seekers for more risky options demonstrates hightened subjective value of risky outcomes and is paralleled by stonger value responses for risky outcomes (Fig. 6C). The correlation between behavioural and lateral prefrontal BOLD responses to risk relates well to alterations in risk attitude induced by electrical stimulation to the prefrontal cortex (Knoch et al., 2006; Fecteau et al., 2007). Taken together, this part of the prefrontal cortex values risky rewards in a subjective manner according to individual risk attitute, reporting less value in risk avoiders but more value in risk seekers.

These neuronal results correlate well with the role of risk in behavioural references conceptualised by expected utility theory. The integration of risk and expected value into a subjective value signal is reminiscent of the mean-variance approach in finance theory, which views expected utility as a function of the expected value and risk, based on Taylor series expansion using utility functions (Levy & Markowitz, 1979). However, the data should not be taken to refute neuronal correlates for other approaches to subjective valuation of outcomes such as scalar expected utility theory or prospect theory. However, even if these data relate to the mean-variance approach to utility it is conceivable that other brain structures instantiate other theoretical notions of subjective outcome valuation.

The assessment of expected utility is traditionally based on preferences during choices. The similarity of ventrolateral prefrontal activations in imperative and choice situations suggests that risk attitude may influence the subjective valuation of risky outcomes in this part of frontal cortex irrespectively of overt choices. Subjective reward coding in imperative trials facilitates data interpretation and suggests that the mechanism is operational at the input stage of processes leading to potential choices.


The essential nature of rewards for the survival of the individual and its genes in a competitive environment with limited resources exerts pressure on the brain to process reward with the highest possible energy efficiency. Architecture and energy demand would benefit greatly from processing only the most salient features of rewards at any given time rather than maintaining a complete representation of all the resources in the world at all times, most of which will not be relevant at a given moment. Reward mechanisms would be more energy-efficient by focussing only on those parameters that are currently important for making choices by a given individual, and devoting only a minimum of additional processing to the general situation the individual is currently in. Thus, neuronal mechanisms underlying decisions between early and late rewards would consume relatively little energy if they require only a comparison between the current subjective values of the different options. Similarly for decisions under uncertainty, less neuronal processing is required for a quick and undemanding decision by weighting the risk information according to the current risk attitude of the particular individual rather than getting an objective risk assessment common to all individuals. The subjective valuation of risky outcomes is then a consequence of subjective risk coding. Thus, the subjective neuronal coding of important decision variables of reward may reflect the energy efficiency in processing information imposed by evolutionary pressure.


The work in the author’s laboratory was supported by the Wellcome Trust, the Behavioural and Clinical Neuroscience Institute (BCNI) Cambridge, the Human Frontiers Science Programme, and other grant and fellowship agencies.