## INTRODUCTION

The concept of prediction error has taken center stage in many theories of learning, most notably in reinforcement learning. In “model-free” reinforcement learning, reward prediction errors (RPEs) learn the value of being in some context or state (Balleine et al., 2008; Dayan and Niv, 2008; McClure et al., 2003; O'Doherty et al., 2004). In “model-based” reinforcement learning, state prediction errors (SPEs) can learn an internal model of probable consequences of being in some state—i.e., they learn state transition probabilities. In both cases, PEs capture how “surprising” a reward/state is and how to adjust expectations accordingly (see Information Box). PE theories are appealing because of their conceptual simplicity: they simply learn from unexpected events. Yet it is unclear whether all surprise is reducible to some un/signed PE. This is because most experiments confound different forms of surprise: events far from the average, “expected” value are also improbable. We therefore looked for evidence of improbability-based surprise not reducible to RPE or SPE. We specifically asked whether two identical rewards, with identical RPE, could evoke different brain responses based on their relative probability, while taking care to exclude SPE mechanisms.

In our paradigm, some cues predicted bimodal rewards (one or three coins arose frequently, while two coins were rare; see Fig. 1). Thus subjects' seldom observed the average number of coins and rarely received the average monetary payment (the “expected reward”). Instead they usually received the extreme payments of one and three coins. Because RPEs reflect the difference between observed and expected reward, and the average reward corresponds to the expectation, RPEs are zero when participants observe two coins (i.e., , see Information Box). However, surprise should be highest for these very same trials where RPEs are zero on average. Other cues predicted unimodal rewards: rewards for which this expected value—two coins—was frequent and unsurprising (see Fig. 1). According to model-free reinforcement learning, no learning takes place in the absence of a RPE; i.e., these theories provide no mechanism whereby the subject can learn that the two-coin outcome is surprising in one case but not the other.

In contrast, a “model-based system,” which encodes how likely each possible outcome is, may exploit SPEs to learn this discrimination (see Information Box). For this reason, differential brain responses to the two-coin outcome must reflect the model-based system, being either un/signed SPE or surprise *per se*, i.e., conditional improbability. We seek to identify surprise *per se*, by contrasting hemodynamic responses to the improbable versus probable “expected reward”, i.e., the two-coin outcome under bimodal versus unimodal distributions, while including SPE covariates in our statistical analysis.

### Information Box: Model-Free RPEs and Model-Based SPEs

Surprise as captured by prediction error (PE or δ) has played an essential role in the interpretation of data from single cell recording and from neuroimaging studies (Friston, 2009; Glimcher, 2011; Rescorla and Wagner, 1972; Schultz and Dickinson, 2000; Schultz et al., 1997; Sutton and Barto, 1998): PE is defined as the difference between observed and expected quantities. A scalar RPE features in theories of “model-free” reinforcement learning and permits subjects to calibrate their reward expectations (Rescorla and Wagner, 1972; Schultz et al., 1997; Sutton and Barto, 1998). Following cue , the RPE simply codes the difference between received and expected reward, . This RPE is signed, meaning that more reward than expected, corresponding to positive RPE, has a different meaning from (i.e., is “better than”) less reward than expected, which corresponds to a negative prediction error. During learning, the expected reward may be updated on each trial according to , where is a learning rate parameter. One could argue though that the amount of surprise should not depend on the sign of the RPE. This notion can be captured with unsigned RPEs which are simply the absolute value of , denoted . Unsigned RPE can be used to guide attention.

While RPEs learn the expected value of each cue , SPEs learn the probability of each specific outcome (see Ludvig et al., 2012; Sutton and Barto, 1990). Assuming that one of discrete outcome states may follow cue , a model-based system may express signed SPEs, each denoted , and unsigned SPEs, denoted , in response to the attained outcome. Each SPE has the form , where indicates a binary transition (1 for yes/ 0 for no) from cue to outcome and is the expected probability of this transition. The expected state transition probabilities may then each be updated according to .

In applying these definitions to our task (see Fig. 1), we assume that the reward on each trial—which drives model-free RPE learning—is simply equal to the magnitude of financial payoff, i.e., 1, 2, or 3 Swiss francs (CHF, see Fig. 1). Regarding model-based SPE learning, note that there are nine transition probabilities in total in our task: three outcomes for each possible cue (see Fig. 1). In our task, To take a concrete example, imagine a trial in which three coins followed cue , then while and . The model-based system then expresses three signed SPEs and three unsigned SPEs in response to the outcome. The expected state transition probabilities may then each be updated according to . Because this model-based system may learn that two coins are likely to follow cue 2 but not cue 1 or cue 3, it can learn discriminations that the model-based system cannot (see Introduction and Fig. 1).

Both the models considered above learn about the rewards/states and express some form of mismatch between prediction and observation. While un/signed PE expresses the (un/signed) arithmetic difference between some expectation and observation (Dayan et al., 2000; Friston et al., 2006; Pearce and Hall, 1980; Roesch et al., 2012), the present study looks for signals which code the conditional surprise or improbability of an event but are not reducible to PE (MacKay, 2003).