Surprise beyond prediction error

Authors


Abstract

Surprise drives learning. Various neural “prediction error” signals are believed to underpin surprise-based reinforcement learning. Here, we report a surprise signal that reflects reinforcement learning but is neither un/signed reward prediction error (RPE) nor un/signed state prediction error (SPE). To exclude these alternatives, we measured surprise responses in the absence of RPE and accounted for a host of potential SPE confounds. This new surprise signal was evident in ventral striatum, primary sensory cortex, frontal poles, and amygdala. We interpret these findings via a normative model of surprise. Hum Brain Mapp 35:4805–4814, 2014. © 2014 The Authors. Human Brain Mapping Published by Wiley Periodicals, Inc.

INTRODUCTION

The concept of prediction error has taken center stage in many theories of learning, most notably in reinforcement learning. In “model-free” reinforcement learning, reward prediction errors (RPEs) learn the value of being in some context or state (Balleine et al., 2008; Dayan and Niv, 2008; McClure et al., 2003; O'Doherty et al., 2004). In “model-based” reinforcement learning, state prediction errors (SPEs) can learn an internal model of probable consequences of being in some state—i.e., they learn state transition probabilities. In both cases, PEs capture how “surprising” a reward/state is and how to adjust expectations accordingly (see Information Box). PE theories are appealing because of their conceptual simplicity: they simply learn from unexpected events. Yet it is unclear whether all surprise is reducible to some un/signed PE. This is because most experiments confound different forms of surprise: events far from the average, “expected” value are also improbable. We therefore looked for evidence of improbability-based surprise not reducible to RPE or SPE. We specifically asked whether two identical rewards, with identical RPE, could evoke different brain responses based on their relative probability, while taking care to exclude SPE mechanisms.

In our paradigm, some cues predicted bimodal rewards (one or three coins arose frequently, while two coins were rare; see Fig. 1). Thus subjects' seldom observed the average number of coins and rarely received the average monetary payment (the “expected reward”). Instead they usually received the extreme payments of one and three coins. Because RPEs reflect the difference between observed and expected reward, and the average reward corresponds to the expectation, RPEs are zero when participants observe two coins (i.e., math formula, see Information Box). However, surprise should be highest for these very same trials where RPEs are zero on average. Other cues predicted unimodal rewards: rewards for which this expected value—two coins—was frequent and unsurprising (see Fig. 1). According to model-free reinforcement learning, no learning takes place in the absence of a RPE; i.e., these theories provide no mechanism whereby the subject can learn that the two-coin outcome is surprising in one case but not the other.

Figure 1.

The trial structure. With 1/3 probability one of three cues is randomly presented. Cues were presented for 0.8 s (behavior) or 1–3 s (fMRI), immediately followed by 1, 2, or 3 monetary units, presented with the indicated conditional probabilities. All cues/outcomes are presented the same number of times, the only predictable structure being in different probabilistic associations between each cue and the reward. Timings for the fMRI and pure behavioral studies are given above.

In contrast, a “model-based system,” which encodes how likely each possible outcome is, may exploit SPEs to learn this discrimination (see Information Box). For this reason, differential brain responses to the two-coin outcome must reflect the model-based system, being either un/signed SPE or surprise per se, i.e., conditional improbability. We seek to identify surprise per se, by contrasting hemodynamic responses to the improbable versus probable “expected reward”, i.e., the two-coin outcome under bimodal versus unimodal distributions, while including SPE covariates in our statistical analysis.

Information Box: Model-Free RPEs and Model-Based SPEs

Surprise as captured by prediction error (PE or δ) has played an essential role in the interpretation of data from single cell recording and from neuroimaging studies (Friston, 2009; Glimcher, 2011; Rescorla and Wagner, 1972; Schultz and Dickinson, 2000; Schultz et al., 1997; Sutton and Barto, 1998): PE is defined as the difference between observed and expected quantities. A scalar RPE features in theories of “model-free” reinforcement learning and permits subjects to calibrate their reward expectations (Rescorla and Wagner, 1972; Schultz et al., 1997; Sutton and Barto, 1998). Following cue math formula, the RPE math formula simply codes the difference between received and expected reward, math formula. This RPE is signed, meaning that more reward than expected, corresponding to positive RPE, has a different meaning from (i.e., is “better than”) less reward than expected, which corresponds to a negative prediction error. During learning, the expected reward math formula may be updated on each trial according to math formula, where math formula is a learning rate parameter. One could argue though that the amount of surprise should not depend on the sign of the RPE. This notion can be captured with unsigned RPEs which are simply the absolute value of math formula, denoted math formula. Unsigned RPE can be used to guide attention.

While RPEs learn the expected value of each cue math formula, SPEs learn the probability of each specific outcome (see Ludvig et al., 2012; Sutton and Barto, 1990). Assuming that one of math formula discrete outcome states may follow cue math formula, a model-based system may express math formula signed SPEs, each denoted math formula, and math formula unsigned SPEs, denoted math formula, in response to the attained outcome. Each SPE has the form math formula, where math formula indicates a binary transition (1 for yes/ 0 for no) from cue math formula to outcome math formula and math formula is the expected probability of this transition. The expected state transition probabilities math formula may then each be updated according to math formula.

In applying these definitions to our task (see Fig. 1), we assume that the reward math formula on each trial—which drives model-free RPE learning—is simply equal to the magnitude of financial payoff, i.e., 1, 2, or 3 Swiss francs (CHF, see Fig. 1). Regarding model-based SPE learning, note that there are nine transition probabilities in total in our task: three outcomes math formula for each possible cue math formula (see Fig. 1). In our task, math formula To take a concrete example, imagine a trial in which three coins followed cue math formula, then math formula while math formula and math formula. The model-based system then expresses three signed SPEs math formula and three unsigned SPEs math formula in response to the outcome. The expected state transition probabilities math formula may then each be updated according to math formula. Because this model-based system may learn that two coins are likely to follow cue 2 but not cue 1 or cue 3, it can learn discriminations that the model-based system cannot (see Introduction and Fig. 1).

Both the models considered above learn about the rewards/states and express some form of mismatch between prediction and observation. While un/signed PE expresses the (un/signed) arithmetic difference between some expectation and observation (Dayan et al., 2000; Friston et al., 2006; Pearce and Hall, 1980; Roesch et al., 2012), the present study looks for signals which code the conditional surprise or improbability of an event but are not reducible to PE (MacKay, 2003).

METHODS AND MATERIALS

Participants

All subjects had normal or corrected-to-normal vision and were screened to exclude those with a previous history of neurological or psychiatric disease. All gave informed consent and the study was approved by the Ethics Committee of the Canton of Zurich. After completing a consent form and MR safety questionnaire, participants were invited to read the task instructions.

Procedure and Rationale

Naive subjects viewed visual stimuli presented against a black background on a computer monitor while in an fMRI scanner. On each trial one of three visual cues (fractals) was presented at random on the left or the right of the screen. After 2 ± 1 s this cue was replaced by coin(s) in the center of the screen indicating a monetary reward of 1, 2, or 3 Swiss francs (CHF). Subjects stood to win the amount indicated if they correctly reported the side of the cue with a button-press. This task served only to maintain attention and was designed to be easy: on any one trial, the predictive cue was perceptibly either on the left or the right, 5 cm from the midline. In line with this, subjects performed this incidental laterality judgment task at ceiling, for all cues and reward levels. The reward following each cue was sampled randomly from a cue-specific probability distribution—which was unknown to the subjects (see Fig. 1). Each cue yielded two CHF on average but with a different probability distribution over monetary rewards. In this context, conventional model-free RPE learning algorithms (Rescorla and Wagner, 1972) can only learn the average or “expected” reward that is constant over cues, while more recent theories permit subjects to discriminate based on reward variance, risk or precision (Preuschoff and Bossaerts, 2007; Schultz et al., 2008). In theory, the model-based system may exploit SPEs to discriminate cue-specific outcome probabilities even when RPEs cannot help them e.g. when RPE is zero.

We wanted to ensure that all surprise responses in this task were cue-specific; i.e., that they reflected discrimination learning and not some other improbable feature of the outcome. We therefore arranged that, over all trials, cues, and rewards were presented with the same (marginal) frequency (Fig. 1). This ensured that novelty/familiarity of cues and outcomes were controlled because subjects saw each cue and reward the same number of times throughout the experiment. This is important because novelty may also elicit responses in the midbrain dopaminergic system implicated in RPE-processing (Ljungberg et al., 1992). Recency effects were constant because the presentation rate of each cue or reward was the same. Thus if subjects failed to discriminate cue-specific reward distributions, they would be equally surprised by all rewards.

To further control for PE, our regression included trial-specific unsigned and signed PEs as covariates, see (Pearce and Hall, 1980; Roesch et al., 2012). To assess behavioral evidence for learning, we asked subjects to report the probabilistic contingencies explicitly after the fMRI session: this probed their declarative “model” following learning. We also conducted a separate behavioral study with the identical design—except that different subjects were required to report the magnitude of rewards at the end of each trial. The purpose of this study was to provide additional behavioral evidence for the relevance of surprise. In particular we asked if response times increased on conditionally surprising trials.

Behavioral Study 1

In behavioral study 1, we studied sixteen healthy male volunteers (age range: 20–25 years). The purpose of this study was to establish the behavioral relevance of surprise. Subjects observed cue (fractal)—reward (coins) associations on the computer screen and reported the number of coins via key press, as quickly and accurately as possible (timing parameters given in Fig. 1). If subjects correctly reported the number of coins within a 500 ms time-window, they stood the possibility to win the equivalent money (a subset of 10 attempts were randomly selected and paid at the end). By experimental design, cues preceded the financial reward, CHF 1, 2, or 3 (see Fig. 1). There were three sessions separated by a 3 min break. All rewards were independent samples from the conditional distributions shown in Figure 1. In each trial of sessions two and three, the cue was drawn randomly with probability 1/3. Session 1 cues were presented in sequence i.e., 40 presentations of cue 1, then 40 of cue 2 then 40 of cue 3. We used all three sessions for the behavioral analysis. The actual frequencies presented to subjects were forced to be the same as those illustrated in Figure 1: we achieved this by drawing the outcome on each trial without replacement from an “Urn” containing 40 outcomes arranged in the proportions given in Figure 1, i.e., [18/40, 4/40, 18/40] and [4/40, 32/40, 4/40]. While this technically introduces a little dependence in trial-by-trial realizations—draws are not identically and independently distributed – it ensures consistent surprise responses between subjects with a relatively small number of trials.

fMRI

Using fMRI we studied 19 different male participants (age range: 20–25 years), presenting exactly the same cue-reward contingencies as above—but asked subjects to report the laterality (left/right) of the cue on each trial. This incidental behavioral task was the same for all cues and therefore independent of the cue-specific reward associations of interest. This meant that reaction time and response inhibition are not confounded with subjective surprise (as it was in the preceding, strictly behavioral, task). Task instructions introduced subjects to the visual cues (fractals) and outcomes (1, 2, or 3 coins) and informed subjects that each cue would be followed by 1, 2, or 3 coins that were “available to win” (1 coin = CHF 1). On each presentation of a cue, subjects were asked to report the position of the (fractal) cue on the screen by left/right button-press. They were told that success in this task determined their final monetary payoff. Specifically, a random subset of 10 trials per block would be selected after the experiment for payment: If subjects had successfully reported the cue-location within time, the corresponding money would be paid out. Subjects were told that they could not predict which cue would appear on any trial but that there “may be a relationship between the cue and the number of coins available.” Participants' earnings were calculated for each session of the experiment.

Task and contingencies

Each trial started with a variable ITI with only a fixation cross visible in the center of the screen. The ITI length was sampled uniformly from the interval 4–6 seconds. The ITI was followed by the presentation of one out of three visual (fractal) cues, randomly on the left or right of the screen, for 1–3 seconds. At the offset of this cue, 1, 2, or 3 coins were presented, indicating money available to win. Following the presentation of coins, participants were shown the fixation cross again. There were three sessions separated by a break. All rewards were independent samples from the conditional distributions shown in Figure 1. In each trial of sessions two and three, the cue was drawn randomly with probability 1/3. In session 1, cues were presented in sequence i.e., forty presentations of cue1-reward, then forty of cue 2-reward then forty of cue 3-reward. The cue-outcome assignments, as well as the order of blocks in session 1, were counterbalanced across subjects. To preclude brain responses based on novelty, familiarity or recency effects, we excluded session one from the fMRI analysis. The fMRI results below therefore report on sessions two and three. Each session was 10-min long with two 3-min breaks in between.

Behavior 2

After scanning, we elicited subjects' belief about the relative frequency of 1, 2, or 3 coins associated with each of the three cues, which probed their declarative knowledge of the probabilistic contingencies (bimodal versus unimodal). To elicit self-reported beliefs about the relative frequency of each outcome, subjects were given three sheets of paper, one for each cue. At the top of each page was a picture of the cue: along the bottom of the page were pictures of 1, 2, and 3 coins (the same pictures that reported outcomes during the task itself). Above each coin(s) was an empty space. For each coin outcome, subjects used a pencil to report, “the percentage of times this number of coins followed this cue.” A histogram was deemed “bimodal” if and only if the probability assigned to outcome 2 was lower than the probability assigned to both outcome 1 and outcome 3. Otherwise the histogram was deemed “unimodal.”

fMRI Data Acquisition

Images were acquired using a Philips Achieva 3T whole-body scanner with an eight channel SENSE head coil (Philips Medical Systems, Best, The Netherlands) at the Laboratory for Social and Neural Systems Research (SNS Lab), Zurich. Subjects viewed the stimuli through a mirror fitted on top of the head coil. We acquired gradient echo T2*-weighted echo-planar images (EPIs) with blood-oxygen-level–dependent (BOLD) contrast (slices/volume, 37; repetition time, 2.47 s). Approximately 350 volumes were collected in each session of the experiment. Scan onset times varied randomly relative to stimulus onset times. Volumes were acquired at a +15° tilt to the anterior commissure-posterior commissure line, rostral > caudal. Imaging parameters were the following: echo time, 30 ms; field of view, 220 mm. The spatial resolution of the functional data was 3 × 3 × 3 mm. A T1-weighted 3D-TFE high-resolution structural image was also acquired for each participant. For this, the following parameters were used: Repetition Time (TR) = 7.4 s, Echo Time (TE) = 3.4 s, inversion time (TI) = 876.2 ms (minimum TI delay), Flip angle (deg) = 8, Field of view (FOV) = 250 × 250 (×180), matrix size = 240 (Reconstruction matrix), voxel size = 1 × 1 × 1 (1.041 reconstructed); Acquisition time 5.57 min.

fMRI Image Analysis

Statistical parametric mapping (SPM8; Functional Imaging Laboratory, University College London) was used to spatially realign functional data, and coregister them to the individual anatomical image before normalizing to standard MNI space and smoothing with an isometric Gaussian kernel with a full-width at half-maximum of 9 mm.

First-level design (within-subject)

For each subject, we used linear regression to model fMRI BOLD responses to each of the nine cue-conditional outcomes, i.e., one coin following cue 1, two coins following cue 1, three coins following cue 1, one coin following cue 2… etc. We used a standard rapid-event–related fMRI approach in which evoked hemodynamic responses to stimulus events are estimated separately by convolving a canonical hemodynamic response function with a stimulus function encoding the onsets for each event. These nine events were entered into a design matrix together with six movement parameters. Our main objective here was to contrast probable versus improbable rewards, in a condition which has zero RPE on average, i.e., at the expected reward of two coins. To exclude SPE explanations, we therefore added further control variables as follows.

Basic SPE model

We included un/signed SPEs as “parametric modulators,” conditional on five different learning rates. Parametric modulators were derived from the learning models described in the Information Box. Specifically, they were

  1. Signed SPEs associated with state-transitions on each trial, math formula (see Information Box), conditional on five learning rates math formula. By extending the notation used in the Information Box, these can be written as math formula.
  2. Unsigned SPEs for each learning rate, i.e., math formula.

We used five learning rates because of evidence that there may be many different learning rates in the brain, operating simultaneously in different areas (O'Doherty et al., 2003; Tobler et al., 2007). We did not take an independent behavioral or autonomic measure of “the learning rate” as a proxy for the neuronal learning rate. While this may be appropriate, it rests on the stronger assumptions that (1) There is a single neuronal learning rate, (2) the behavioral learning rate and the neuronal learning rate are identical. By including five different learning rates, we gave the PE model the best chance to explain BOLD activation.

Augmented model

As a secondary confirmation, to further exclude RPE based explanations, we confirmed that any effects remained significant in an augmented model which also contained un/signed RPEs. To specify this augmented model, we added two further sets of parametric modulators, also time locked to the outcome of each trial, to the above design

  1. Signed RPE associated with the monetary outcome on each trial, conditional on five different learning rates math formula. These can be written as math formula.
  2. Unsigned RPEs for each of the five learning rates, i.e., math formula.

In this way, even though RPEs equal 0 for 2 coins on average, we ensure that we are maximally conservative when we make the claim that our surprise responses are not RPE responses: i.e., they are not confounded with any residual component of an RPE signal.

Optimal surprise model: In a third and final model we asked whether activations reflected optimal surprise, conditional on a Bayesian learner. Conditional surprise can be quantified mathematically by Shannon surprise, math formula, for which subjects must first learn the relative probability of rewards, denoted by math formula, where again math formula (Dayan et al., 2000; Friston 2009; MacKay, 2003). We therefore looked for evidence of a hemodynamic signal that tracked the Shannon Surprise expressed by a model-based Bayesian learner. We trained a simple Bayesian model which learned the conditional probability of each reward state following each cue math formula and expressed Shannon surprise math formula. We assumed that math formula was learnt by updating multinomial distribution over the random number of coins math formula, i.e., math formula, under i.i.d. assumptions. In this notation, each element of the 3-vector math formula gives the probability of receiving 1, 2, or 3 coins following cue math formula: The superscript simply indexes these three elements. Assuming an uninformative (Dirichlet) prior math formula, with concentration parameter math formula and uniform base distribution math formula, the surprise at observing math formula coins then simply corresponds to math formula. Here math formula is the number of times that math formula coins have followed this cue to date, so math formula just reports the (regularized) relative empirical frequency of math formula coins given the cue.

This procedure resulted in a trial-by-trial expression of Shannon surprise which we included as parametric modulator of the outcome for each trial. In addition to this, we included all of the un/signed RPEs and SPEs of the previous model as covariates of no interest. Our design included convolved stimulus events for each cue math formula and each outcome math formula, and movement parameters as covariates of no interest.

The inclusion of five learning rates increased the ability of RPE (and SPE) to explain variance otherwise attributable to a purely model-based surprise in all three models. In this way, we ensure that we are maximally conservative when we make the claim that our surprise responses are not RPE responses: i.e., they are not confounded with any residual component of an RPE signal.

Second-level design (between-subject)

We used the standard summary-statistic approach for inference. Namely, we treated subject-specific first-level contrast images as observations. To examine the consistency of our effects over subjects we used these contrast images to calculate a one-sample t-statistic. We first tested the contrast between hemodynamic responses to the improbable two CHF outcome versus the probable two CHF outcome. We then tested the group-level effect of trial-by-trial Shannon surprise, as elicited by our Bayesian learner.

RESULTS

Behavior 1

In the purely behavioral study subjects reported the number of coins presented on the screen. For each subject, we compared the average time it took to respond to the improbable two coin outcome (following bimodal cues) versus the probable two coin outcome (following unimodal cue). Using a one-sample summary-statistic approach, a t-test showed that subjects were on average 14 ms (95% CI = [2.5, 22.4]) slower in the improbable case (P = 0.018, df = 15). Supporting Information Figure 1 plots subjects' time to report the “expected reward” (i.e., the two CHF) following each cue.

Behavior 2

A different behavioral measure was taken from the 19 different subjects in the fMRI study (detailed below). In debriefing, we asked these subjects to draw a histogram over coins for each cue (the conditional probability distribution). Grading these as correct if they reported the true contingency (unimodal or bimodal), only six attempts out of math formula were unsuccessful. Assuming (conservatively) that subjects chose unimodal and bimodal distributions with equal probability at chance (for each cue), a Binomial test gave math formula. Subjects therefore acquired an accurate declarative “model” of the contingencies.

As can be seen in Figure 2, the self-reported distributions qualitatively matched the real distributions. Apart from bi- versus unimodality, there was some suggestion of probability distortion (Tversky and Kahneman, 1992): small probabilities tended to be over-estimated and larger ones under-estimated.

Figure 2.

Self-reported frequency of each outcome—1, 2, or 3 coins—conditional on each of the three cues. Blue circles indicate the mean frequency reported by subjects after the task. Dotted lines correspond to 95% confidence bounds. Red squares indicate the true frequency with which each outcome followed each cue, i.e., ground truth (see also Fig 1). The self-reported frequencies reflected the actual frequencies reasonably well.

fMRI Study

We first analyzed brain data with the basic model, in which SPE served as covariates of outcome-related responses. A between-subject (random effects) analysis contrasted hemodynamic responses to the improbable versus probable two-coin outcome. This revealed four regions of activation. These effects were significant following multiple comparison correction across the whole brain [i.e., family-wise error (FWE) cluster-level whole-brain corrected with math formula as cluster-inducing height threshold]. We found significant cluster activations in the right frontal pole, P = 0.026, x = (21, 47, 1), bilateral occipital lobe, P < 0.001, (−21, −79, −11) and (27, −79, −11) respectively, the right amygdala P = 0.012, (54, −4, −14) and the right mid frontal gyrus P = 0.013, (27, 8, 52), see Figure 3. Because the ventral striatum (VS) is strongly implicated in RPE, we wondered whether it would also be sensitive to conditional improbability. A small volume analysis using an anatomical definition revealed activation in the right VS, significant at cluster and peak level (P = 0.037 and P = 0.04). This is also visible in Figure 3D. Importantly, all of these activations were also significant in an augmented model which additionally controlled for un/signed RPE explicitly as a covariate (see “First-level design” section).

Figure 3.

Hemodynamic response to surprising, improbable rewards that carry no RPE. We used linear regression to assess hemodynamic responses to improbable versus probable rewards, under a condition with zero RPE on average. This statistical analysis controlled for un/signed SPEs. We found significant cluster activations in the right frontal pole (A), bilateral occipital lobe (B), the right amygdala (C), and the right mid frontal gyrus (D). The VS activation partly visible in (D) survived small volume correction using an anatomical definition of VS. These activations are consistent with surprise at the sensory properties of the outcome, i.e., the reward state, and/or surprise at the rewarding aspects of the outcome. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

We next looked for evidence that the brain tracks trial-by-trial Shannon surprise (see the final model of “First-level design”). A between-subject (random effects) analysis examined the average effect of Shannon surprise, having controlled for un/signed RPE and SPE in the analysis. We again found strong bilateral occipital activation, P < 0.001, (−18, −91, −5) and (27, −79, −11) respectively and right frontal pole activation, P = 0.017, (21, 47, 1). Additionally, we found activation within the right superior parietal gyrus, P = 0.012, (30, −70, 49). We did not find ventral striatal activity following a small volume correction.

Following the request of a reviewer, we repeated all of the above analyses with 6 mm smoothing and observed a very similar pattern of significant activation in each case. Interestingly, this analysis now revealed a significant Shannon surprise activation in right VS following small volume correction.

DISCUSSION

We have studied reward learning in a passive learning situation. It is known that existing RPE schemes do not fully account for learning in this setting (Dayan and Niv, 2008; Schultz and Dickinson, 2000): For example, they have limited capacity for subjective uncertainty (Preuschoff and Bossaerts, 2007; Schultz et al., 2008) and simply associate each cue or “state” with a single value. Experimental evidence points to simple learning in the absence of RPEs, e.g. experiments in the conditioning literature on what is known as “identity unblocking,” where a change in the identity of the rewarding stimulus leads to new learning, even when the amount of “reward” is properly controlled for (Burke et al., 2008; Bornstein et al., 2011; McDannald et al., 2012; Rescorla, 1999). In contrast, humans and animals can use environmental cues to predict the likelihood of specific outcomes (Balleine, 2005; Balleine et al., 2009; Dayan and Balleine, 2002; d'Acremont et al., 2013; Fletcher et al., 2007; Gläscher et al., 2010; Griffiths, 2007). There is evidence that such “internal models” are learned via other forms of SPE. To isolate the neuronal substrate of surprise—not attributable to RPE or SPE—we have used a simple Pavlovian task which the model-free RPEs cannot learn because RPEs in response to outcomes eliciting high versus low conditional surprise are zero and statistically controlled for SPE explanations.

We showed that there are surprise responses that cannot be accounted for PE. We also observed surprise signals in the right frontal pole (Fig. 3A), bilateral occipital lobe (Fig. 3B), the right amygdala (Fig. 3C) and the right mid frontal gyrus (Fig. 3D). A small volume analysis revealed significant surprise effects, beyond PE, in the VS. Primary visual and frontal polar activations were replicated across all of our analyses.

Primary visual responses are consistent with subjective surprise at sensory features of the outcome: i.e., reward identity as opposed to a scalar reward value or utility (Alink et al., 2010; Dayan and Niv, 2008; Kok et al., 2012). This response may reflect top–down attention effects that follow in the wake of surprise. In any case, a surprise effect in early visual cortex accords with theories holding that top–down predictions modulate the response of primary sensory regions to incoming sensory information. From this perspective our data emphasize that these predictions are probabilistically sophisticated: neither scalar nor unimodal (Gaussian). Our data may also cast light on earlier studies that showed PEs modulate visual cortex responses and its connectivity during associative learning (den Ouden et al., 2009, 2010; Summerfield and Koechlin, 2008; Summerfield et al., 2008;) but did not dissociate surprise. Our empirical dissociation of surprise serves as a reminder that prediction errors are not the only way to understand such learning.

Frontal polar responses occurred in a region implicated in sophisticated model-based capabilities, including goal-directed reasoning and general problem-solving (Genovesio et al., 2013). This region evolved after the split between New World and Old World primates, and may have specifically evolved during ape and human relation (Genovesio et al., 2013).

There are two distinct notions of surprise relevant to paradigms like ours. The perceptual surprise associated with perceptual state or “identity” of the outcome (based on probabilistic distributions over a perceptual space) which we have emphasized thus far is, at least conceptually, distinct from the utility surprise about how rewarding the outcome is. This latter would require a probability distribution over the scalar “utility” or “reward value.” Crucially for us, in our task neither can be learned with simple RPE-based surprise mechanisms. To maximize the subjective and hemodynamic impact of surprising events perceptual surprise and the utility surprise were intentionally aliased or confounded in our design, i.e., a reward value or “utility” of one CHF is associated with a given visual percept (a circle/coin), a reward value of two CHF is associated with another percept (two overlapping circles/coins) and a reward value of three CHF is associated with a third percept (three overlapping circles/coins). In principle, by simultaneously evoking perceptual surprise and utility surprise, our design gains sensitivity to either effect at the cost of losing specificity about which effect is responsible. In practice, however previous literature suggests a significant disjunction between the brain regions involved in perceptual versus utility processing. That we observed surprise responses in primary (visual) perceptual regions has encouraged us to interpret this in terms of perceptual surprise, i.e., consequent from learned associations between the cue and the perceptual properties of the reward. Conversely, the observation of surprise effects in VS and amygdala points to utility surprise.

Our paradigm relates to the literature on implicit statistical learning in which state-state associations are learned without any feedback e.g. (Fiser and Aslin, 2001; Turk-Browne et al., 2005). Our task differs in that we can dissociate reward-independent learning that arises within a classical reward learning task and exclude common PE explanations. Previous studies have examined the neural bases of predictive or causal learning with neutral stimuli e.g. (Corlett et al., 2007; d'Acremont et al. 2013; Fletcher et al., 2001; Gläscher et al., 2010; Turner et al., 2004). Several brain structures appear to code prediction errors in relation to such learning (Boly et al., 2011; Corlett et al., 2010; Friston et al., 2006; Friston, 2009; Gläscher et al., 2010; Schultz and Dickinson, 2000). Our analysis revealed surprise responses beyond PE responses. In other words, these responses were based on the conditional improbability of events, which could not be explained by the most straightforward formulation of PEs.

ACKNOWLEDGMENTS

The authors thank Peter Dayan for helpful comments.

Ancillary