Acute stress alters probabilistic reversal learning in healthy male adults

Behavioural adaptation is a fundamental cognitive ability, ensuring survival by allowing for flexible adjustment to changing environments. In laboratory settings, behavioural adaptation can be measured with reversal learning paradigms requiring agents to adjust reward learning to stimulus–action–outcome contingency changes. Stress is found to alter flexibility of reward learning, but effect directionality is mixed across studies. Here, we used model‐based functional MRI (fMRI) in a within‐subjects design to investigate the effect of acute psychosocial stress on flexible behavioural adaptation. Healthy male volunteers (n = 28) did a reversal learning task during fMRI in two sessions, once after the Trier Social Stress Test (TSST), a validated psychosocial stress induction method, and once after a control condition. Stress effects on choice behaviour were investigated using multilevel generalized linear models and computational models describing different learning processes that potentially generated the data. Computational models were fitted using a hierarchical Bayesian approach, and model‐derived reward prediction errors (RPE) were used as fMRI regressors. We found that acute psychosocial stress slightly increased correct response rates. Model comparison revealed that double‐update learning with altered choice temperature under stress best explained the observed behaviour. In the brain, model‐derived RPEs were correlated with BOLD signals in striatum and ventromedial prefrontal cortex (vmPFC). Striatal RPE signals for win trials were stronger during stress compared with the control condition. Our study suggests that acute psychosocial stress could enhance reversal learning and RPE brain responses in healthy male participants and provides a starting point to explore these effects further in a more diverse population.

acute psychosocial stress could enhance reversal learning and RPE brain responses in healthy male participants and provides a starting point to explore these effects further in a more diverse population.
K E Y W O R D S cognitive flexibility, computational modelling, decision-making, TSST 1 | BACKGROUND Humans and other agents are routinely confronted with decision-making situations under stress, for example, when choosing an efficient and cheap way of commuting to work, despite running late. Different choice options, such as taking the car, bike or train, are associated with relatively stable and predictable levels of cost (such as arriving at work sweaty in case of the bike) and reward (such as exercising in case of the bike). In contrast, a flat tire or a train delay is a more uncertain, less predictable factor. Both stable and uncertain factors interact, in that cycling to work may be rewarding in sunny weather but not on a rainy day. Stress impacts individuals' emotions, mood and physiological responses and may affect their cognitive processing resources, influencing their decision-making strategies (Lupien et al., 2007). This might be especially relevant in situations that afford high behavioural flexibility, for instance, in constantly changing environments. Stress is also an important factor in causing and maintaining psychiatric conditions (McEwen, 2004) and strongly influences health-related behaviour in general (Cohen et al., 2016).
Flexible decision-making requires one to learn what is most rewarding in the current environment and adapt one's decision-making to that. With regard to cognitive flexibility as a well-studied subdomain of decision-making, studies have found mixed results for the influence of stress, ranging from beneficial to detrimental effects across paradigms (Goldfarb et al., 2017;Plessow et al., 2012Plessow et al., , 2011. In a meta-analysis, acute stress showed a small negative impact for tasks in which reward seeking and risk taking is disadvantageous (d = 0.26 and d = 0.44) but showed no effect if this was not the case (Starcke & Brand, 2016). Similarly, a meta-analysis over a small number of studies investigating the effects of acute stress on cognitive flexibility concluded that stress had a small impairing effect (Hedges' g = À0.30) (Shields, Sazma, et al., 2016). Different processes involved in decision-making are presumably differentially prone to interruption by stress (Schwabe & Wolf, 2011, 2009). Whereas habitual decision-making relies on simple stimulus-related associations, goal-directed decisionmaking associates actions with a motivational value and is therefore more flexible but also computationally more costly. It has been found that acute and chronic stress causes a shift from goal-directed decision-making to habitual decision-making on a neural and behavioural level (Schwabe & Wolf, 2011). One possible explanation for the variable findings are different types of standardized stressors, which are commonly used in behavioural experiments. They can be physiological as in the Cold Pressor Task, psychosocial as in the Trier Social Stress Test (TSST) or both as in the Socially Evaluated Cold Pressor Test (Starcke & Brand, 2016). Depending on the type of paradigm, the time point of when physiological effects reach their peak level can differ quite strongly (McRae et al., 2006). Another reason for the interindividual differences are the endocrinological and neural sex differences (Bale & Epperson, 2015) and their interaction with decision-making (Starcke & Brand, 2016). For example, stress exposure before a decision-making task increased brain activation and was related to rewardseeking behaviour in males but not in females . A further source of variability for metaanalytical findings lies in how cognitive flexibility was measured. Both meta-analyses predominantly focused on classical paradigms such as the Wisconsin card sorting test or task-switching tests. While providing valuable insight into overall cognitive flexibility, these paradigms mostly rely on averaged outcome measures. In contrast, tasks designed for computational modelling may provide a more fine-grained measure of behavioural adaptation.
In a behavioural study applying computational modelling, participants under recent and acute stress exhibited suboptimal foraging behaviour with a tendency to overexploit their current options. Increased perseverance has also been observed in a task that differentiates habitual and goal-directed behaviour (Raio et al., 2020). Probabilistic reversal learning requires participants to choose between stimuli with varying reward contingencies. In these paradigms, contingencies are reversed several times throughout the task unannounced and therefore demand behavioural adaptation to a changing environment. A computational mechanism underlying the putative learning process can be formalized by the reward prediction error (RPE), a computational quantity derived from the reinforcement learning framework. RPE signal the difference between an observed and expected reward (Dolan & Dayan, 2013) and are used to update the value of a stimulus, a state or an action. The neural signature of RPE during reversal learning is reliably found in the human ventral frontostriatal circuitry (O'Doherty et al., 2003).
So far, heterogeneous subdomains in the operationalization of decision-making and methodological differences regarding the type of stressor have complicated the picture (Porcelli & Delgado, 2017). Most previous studies on stress effects on decision-making have employed between-subject designs-but subjects vary drastically in both individual stress responses, choice behaviour and how stress affects performance. In the previously used between-subject designs, it thus remains unclear how much of stress-related changes to the neural correlates of probabilistic reversal learning can be attributed to the stressor and how much may be related to interindividual differences in stress reactivity. Stress reactivity may also differ, depending on long-term stress exposure (Radenbach et al., 2015) or cognitive (Otto et al., 2013) and personality trait (Raio et al., 2020) variables. The few studies using within-subjects designs to investigate learning are either purely behavioural (Radenbach et al., 2015) focused on psychoimmunological measures (Treadway et al., 2017) or neuroimaging methods such as electroencephalography (Cavanagh et al., 2011), which lacks the possibility of precise spatial signal localization and anatomical specificity with respect to the neural representation of RPE signals. To our knowledge, only two studies combine a within-subjects design with computational modelling and functional neuroimaging (fMRI) to elucidate underlying cognitive mechanisms (Carvalheiro et al., 2021;Robinson et al., 2013). However, in both studies, effectiveness of stressor was only confirmed with subjective ratings instead of physiological correlate such as cortisol or heart rate. Furthermore, the nature of stressors in both studies, namely, uncontrollable sounds (Carvalheiro et al., 2021) and threat of shock (Robinson et al., 2013), differs from our focus on psychosocial stress. Although both studies employed reward-learning paradigms, they did not tap reversal learning specifically. To assess the effect of psychosocial stress on cognitive flexibility, we used the TSST and fMRI to study probabilistic reversal learning in healthy male participants, employing a within-subjects design. In contrast to previous studies, we used a state-of-the-art hierarchical Bayesian modelling approach (Piray et al., 2019) with a two-level structure, which allowed us to model the impact of stress on behavioural adaptation.

| Study design
Employing a within-subjects design, 38 healthy male adult participants (n = 28 in the final analysed sample, all right-handed) performed a probabilistic reversal learning task during fMRI in two separate sessions 7 days apart ( Figure 1). Participants were only male because cortisol reactivity to the TSST (Liu et al., 2017) and cognitive flexibility (Shields, Trainor, et al., 2016) were apt to sex differences, the latter of which may be further amplified by stress  and potential impact of cyclical changes. This would render interpretation of possible effects related to (physiological) stress response challenging in this stage of studying stress effects in decision-making. As we discuss in the discussion part of this manuscript, we do advice for follow-ups to study effects in more diverse samples, which include females or other gender identities. Participants were recruited from the database of the Max Planck Institute for Human Cognitive and Brain Sciences in Leipzig, Germany, and through advertising in the local community. They were included only in the absence of medical, neurological and current or lifetime psychiatric disorders assessed by the German version of the Structured Clinical Interview for Diagnostic and Statistical Manual of Mental Disorders (SCID-IV). Clinical interviews were conducted in person by a clinically trained physician (MP). Procedures and materials are identical with a previous study from our laboratory using a different paradigm (Luettgau et al., 2018). The study was approved by the ethics committee of the medical faculty at the University of Leipzig, including informed consent prior to inclusion and a full debriefing about the aims of the study after the entire protocol. During the stress condition, participants were exposed to a mock interview and calculus in front of a socially unresponsive committee in white lab coats, following the standardized TSST protocol (Kirschbaum et al., 1993). During the control condition, participants read a neutral text in absence of the committee (see Supporting Information). Order of session type (stress vs. control) was counterbalanced across participants. In order to prevent confounding effects of circadian rhythm on cortisol levels (Kudielka et al., 2004), both experimental sessions were scheduled at the same time of the day. Acute stress responses were assessed at physiological (cortisol) and subjective (self-report) levels at six time points throughout the session (Figure 2).

| Physiological stress response
We assessed physiological stress response via salivary cortisol, measured six times throughout the experiment at the following time points relative to the start of intervention (stress or control): t1: À30 min; t2: À2 min; t3: +10 min; t4: +15 min; t5: +30 min; t6: +45 min (Luettgau et al., 2018). For collection and extraction of saliva, we used Salivette saliva sampling tubes (Salivette Cortisol ® , Sarstedt, Nuembrecht, Germany) (see Supporting Information). Individual cortisol reactivity was determined by calculating the area under the curve (AUC) with respect to ground (AUCg-stress and AUCg-control; see Pruessner et al., 2003) separately for both conditions and subtracting AUCg-control from AUCg-stress. The AUC was calculated based on individual subject-wise time points, to account for slight temporal dispersion in the testing protocol. For an additional analysis to confirm stress reactivity, please refer to the Supporting Information.

| Subjective stress response
Three different visual analogue scales (VAS) ranging from 0 to 100 were used to assess subjective arousal, valence and stress at all time points (T1-T6). Participants were asked to rate how they felt, regarding arousal on a scale 'Please rate your current state' from 0 (sleepy) to 100 (active), valence on a scale from 0 (unhappy) to 100 (happy) and stress on a scale from 0 (not stressed) to 100 (stressed). Analogous to cortisol values, this was determined by calculating the AUC with respect to ground (AUCg-stress and AUCg-control; Pruessner et al., 2003) separately for both conditions and subtracting AUCg-control from AUCg-stress.

| Working memory capacity
Participants also performed the digit span backwards task from the test battery Hamburg-Wechsler-Intelligenztest F I G U R E 1 Study design (a) and task design (b).

| Past subjective stress response
Furthermore, participants filled in a German version of the Perceived Stress Scale (PSS-10; Cohen et al., 1983), at home via an Internet-based survey (LimeSurvey; www. limesurvey.org). They evaluated potential situations in their life, with regard to their respective stressfulness during the last 30 days.

| Task design
Participants performed a probabilistic reversal learning task, which included 160 trials and comprised around 15 min. The task (Boehme et al., 2015;Reiter et al., 2016) was programmed in Matlab (The MathWorks, Natick, MA) with Psychtoolbox (Brainard, 1997). On every trial, participants chose between two cards, each depicting a different geometric figure (different sets of figures were used for both experimental sessions). The underlying reward structure was not explicitly instructed but had to be inferred: Reward probabilities associated with the two choice options were anti-correlated (i.e. when card A had a reward probability of 80% and therefore a punishment probability of 20%, card B had a reward probability of 20% and a punishment probability of 80% and vice versa). Furthermore, participants were informed of the probabilistic nature of the task but not on the actual probabilities: The currently 'better' card was only rewarded in 80% of all trials with 10 cent. After a fixed number of 55 trials, contingencies reversed, and these reversals repeated four times over the middle experimental phase, F I G U R E 2 Physiological (cortisol) stress response in the saliva (a) and subjective stress responses as measured by the visual analogue scales (b-d) over the course of the session. The dotted vertical line indicates the anticipation period during the intervention (violet shaded area). The dark grey shaded area indicates the time the reversal learning task was performed in the MR scanner. The stress condition is indicated with a red line; the control condition with a blue line. The error bars represent standard error of the mean. Time indicated in parentheses: Minutes before or after the start of the stress or control intervention. The division of timepoints on the horizontal axis is approximately reflecting real-time division. followed by another stable phase in the end starting at trial 126 (see Figure S1). Participants were all subjected to the same task structure and were instructed to win as much money as possible because they would be receiving the winnings at the end of the experiment. They received a base rate of €8.00 per hour (8 h in total for all sessions) and earned an additional mean win of €4.99 on the control day and €5.27 on the stress day.
Because feedback was drawn probabilistically on each trial, and we wanted to ensure that the number of probabilistic events was matched between the control and the stress condition, six participants had to be excluded from the final sample to avoid confounds, as they received different task environments in terms of the amount of probabilistic errors due to a programming mistake. Additionally, two participants had to be excluded due to technical failure, and two additional participants had to be excluded because they performed the task below chance level, leaving a total of 28 participants for final analyses.

| Stress response analyses
Cortisol responses (AUCg) and the three subjective VAS scales were compared across conditions (stress vs. control) using two-tailed paired-sample t-tests at a significance level of p < 0.05.

| Behavioural data
Single-trial multilevel generalized linear models (logistic regressions) were conducted using the lme4 package (Bates et al., 2015) in R (Version 4.0.3). Parameter estimates were considered significant at p ≤ .05. We analysed trial-by-trial correct responses (choose better option), win-stay (select same stimulus after win) and lose-switch (switch stimulus after loss) behaviour with the factors stress condition (CT vs. ST, counterbalanced, effect coding as À0.5 and 0.5) and experimental phase (pre, reversal, post) as fixed effects and subject as a random effect, allowing for an individually varying intercept per subject. For the factor experimental phase, we specified a custom-centred contrast, testing the null hypothesis of performance differences between first stable and reversal and late stable and reversal phase using the hypr package (Rabe et al., 2020). Main effects of condition and phase as well as an interaction effect were added incrementally in two steps. We used 2 -tests based on log-likelihood changes to compare a null model, which predicted outcome variables with the individually varying intercept per subject to a model including varying intercepts and all main effects. If this showed a significant better fit, we compared the main effect model to an interaction effect model. For the best-fitting model, the parameter estimates' Odds Ratio was computed to assess effect size. Additionally, we performed the same analysis using the cortisol AUCg values instead of condition labels as predictor. Participants were excluded when their performance was below chance (correct responses <50%), as described in Section 2.6. Across all trials, participants missed a relatively low number of trials (0.71%).
Furthermore, to explore the potential moderating impact of past stress exposure as well as working memory capacity on stress-related learning (Otto et al., 2013;Radenbach et al., 2015), we associated these with the stress effect on task performance (based on total correct responses in the stress condition À total correct responses in the control condition). We correlated this value with past subjective stress (PSS-10), as well as working memory performance (digit span backwards task). Due to missing values for four participants, regarding the PSS-10, the former analysis was conducted with a reduced sample of 24 participants.

| Computational models
We set up the following model space to describe different learning processes that might have generated the data. It comprised Rescorla-Wagner (RW;Rescorla & Wagner, 1972) and Pearce-Hall (PH;Pearce & Hall, 1980) models and a null model (no-learning). In the RW and PH models, the expected value Q a,t of an action a at trial t is updated via the RPE δ Q a,t (Equation 1), which is defined as the difference between received reward R t and previously expected reward value for the chosen stimulus Q a,t (Equation 2): In RW models, we accounted for learning about the unchosen option as indicated by the implicit anticorrelated task structure in different sub-models [Equation 3; κ ¼ 0 for single update (SU), κ ¼ 1 for full double update (DU) and freely fitted κ for individually weighted double update (iDU)]. We further varied whether learning rates α differed for wins and losses but always implemented separate inverse decision noise temperatures β for wins and losses. The PH model encompasses Equations (1) and (2) with a dynamic learning rate depending on a decay over time as and the absolute prediction error (see Supporting Information or Pearce & Hall, 1980). In the no-learning model, a stable bias towards one of the stimuli was implemented. For all learning models, trial-wise Q-action values are transformed into choice probabilities by a softmax response model with different inverse decision noise temperatures β following wins and losses: The inverse decision temperature parameter β reflects choice stochasticity with higher values equating more deterministic and lower values equating more stochastic choices.
We followed a two-step procedure: First, we fit our model space to the behavioural data of the control condition. Then, the best-fitting model from the control condition was used for modelling behaviour under stress now with additional 'stress weights' on the free parameters. Taken together, the 'step 1 model space' consisted of eight models for learning under the control condition. In the following part, we will abbreviate Rescorla-Wagner models with RW, Pearce-Hall models with PH. RW-SU-1al, RW-SU-2al, RW-DU-1al, RW-DU-2al, RW-iDU-1al, RW-iDU-1al, PH and no-learning. We applied Bayesian model comparison (Piray & Daw, 2020) to find out which of these models explained the data best [see protected exceedance probabilities (PXP) in Figure 4].
To model learning under the stress condition, we added stress weights to the free parameters of the bestfitting model from the first step (RW-DU-2al). The 'step 2 model space' included the DU-2al model without stress effects (RW-DU-2al-NoStress), one with stress weights affecting only the learning parameters α win and α loss (RW-DU-2al-StressLearning) one model with stress only affecting the temperature parameters β win and β loss , (RW-DU-2al-StressBetas) and a full model with stress affecting all free parameter (RW-DU-2al-StressAll). This model space was fitted to combined data from both conditions: Trials were concatenated across control and stress conditions within subjects, with the free stress parameters quantifying the additive effect on the respective parameters for the trials of the stress condition. As in Step 1, model fits were then compared between models.

| Model fitting
Models from both steps were fitted under the hierarchical Bayesian inference approach as implemented in the cbm toolbox (Piray & Daw, 2020) run in in Matlab R2018a. This procedure allowed for concurrent model comparison and parameter estimation. Thereby, the latter also followed a multilevel modelling approach: The group mean parameter affects individual parameter estimation and vice versa, but the relationship is scaled by how (relatively) well the model explains the individual subject's behaviour.

| fMRI data
Scans were acquired on a Siemens 3-T high-resolution PRISMA MR-System with a 20-channel head coil (Siemens, Erlangen, Germany). Covering the whole brain, 40 slices were acquired in oblique orientation at 20 to the anterior commissure-posterior commissure line and in ascending order with the following parameters: T2*-weighted gradient-echo echo-planar imaging (EPI) (TR: 2.09 s; TE: 22 ms; flip angle: 90 ; 3 Â 3 mm 2 inplane voxel resolution, 0.5 mm gap between slices, voxel size: 3 Â 3 Â 4 mm). After preprocessing (see Supporting Information), fMRI data was analysed within SPM12. Separate first-level models were computed for the control and stress condition. On the first individual, subject-level feedback onsets were modelled as events with zero duration, and the RPE were included as parametric modulator. Missing trials (when participants were slower than 3 s, or no response was given) were modelled as events of no interest. The six realignment parameters were added as nuisance regressors. Contrast images were computed for the RPE separately for the control and stress condition and subsequently submitted to random-effects group statistics (second level) using a paired t-test to compare activation between conditions (stress/control). To control for multiple comparisons, family-wise error correction (p FWE ) was applied at the whole-brain level at p FWE < 0.05 in SPM. For testing the condition effect, a mask of the RPE main effect over both conditions were used at p FWE < 0.05. In order to differentiate RPE signals for win and loss trials (similar to Carvalheiro et al., 2021), additional first-level models were computed, which differentiated feedback into win and loss trials, again separately for control and stress day. Both trial types (win and loss) were modelled separately introducing the RPE as parametric modulator, which results in two contrast images (RPE win and RPE loss). The contrast images of RPE win and RPE loss were subjected to separate paired t-tests to investigate activation between conditions (stress/control). For testing an effect of trial type, a mask of the RPE main effect over both conditions was used at p FWE < 0.05.

| Sample characteristics
The final sample consisted of n = 28 healthy male adult human participants with a mean age of 26.9 (SD = 5.7; range: 18-41) years, a mean of 12.2 (SD = 1.2) educational years and a mean verbal intelligence of 103.8 (SD = 10.1). The order in which participants performed the control vs. stress was evenly spread (i.e. 13 participants performed the control condition on the first day and the stress condition on the second day, and 15 participants performed the stress condition on the first day and the control condition on the second day).

| Stress response analyses
The stress intervention significantly increased subjective stress responses, such as arousal and subjective stress, as well as physiological responses (cortisol levels). Valence was decreased under stress (see Figure 2 and Table 1).

| Behavioural results
Best-fitting multilevel linear modelling included a subject-specific intercept, as well as main effects of condition and phase. Predicting correct responses on a singletrial basis with multilevel linear modelling indicated the expected task effect in the reversal (p < 0.001) and in the last stable phase (p < 0.001). For both phases, correct responses decreased with respect to the first reference phase. Furthermore, there was a main effect of condition (p = 0.020), suggesting that participants' correct responses subtly increased with a 1.13 higher chance (Odd's Ratio, OR = 1.13) for correct responses under stress (see Table 2 and Figure S2a). As shown in Figure 3b, the effects of stress on correct responses were quite heterogeneous with high interindividual variability. The findings on correct responses were supported by a significant main effect (p = 0.030) of stress when the physiological stress level (AUC) was used as a continuous predictor instead of experimental condition (see Figure S2b and Table S1). In this model, task effects were again significant for the reversal phase (p < 0.001) as well as the last stable phase (p < 0.001).
Regarding win-stay behaviour, best-fitting multilevel linear modelling included a subject-specific intercept, as well as a main effect of condition and phase. Task effects of the reversal phase (p < 0.001) and the last stable phase (p < 0.001) were significant, but not the experimental condition (p = 0.22). Win-stay behaviour decreased in the reversal phase, as well as the last stable phase with respect to the first reference phase. Similarly, lose-switch behaviour resulted in significant task effects of reversal phase (p < 0.001) and last stable phase (p < 0.001), but not experimental condition (p = 0.73) (see Tables S2 and  S3). Lose-switch behaviour increased in the reversal phase, as well as the last stable phase with respect to the first reference phase.

| Exploratory behavioural analysis of moderator variables
The impact of stress on behavioural performance (Δ correct responses) did not correlate with working memory capacity (r(26) = 0.16, p = 0.42) nor with our measure of past subjective stress (r(22) = À0.19, p = 0.37).

| Computational modelling results
Behaviour in the control condition ('step 1 model space'; see Section 2 for the models) was best explained by an RW model with full double update and two learning rates (the RW-DU-2al) across all participants with a PXP = 0.62 (see Figure 4). This indicates that most T A B L E 1 Subjective and physiological stress responses.

Stress response
Mean difference t d p participants used the anti-correlated task structure and updated the chosen and the unchosen choice option to a similar extent (full double-update model, DU). Although there was some evidence for use of an individual double update (iDU) in our sample, we decided to focus on full DU-learning, as evident in the majority of participants. Furthermore, the learning rate in win trials was lower than in loss trials (paired t-test on alpha win vs. alpha loss: t(27) = À6.7, p < 0.001), resulting in stronger updates after loss compared with win feedback. In a next step, additional free parameters for potential stress effects were entered for this winning model (the 'step 2 model space'; see Section 2 for an explanation of the models). This resulted in a best fit for RW-DU-2al-StressBetas (PXP = 0.92), indicating that only the temperature parameters β win and β loss were different between the T A B L E 2 Multilevel generalized linear modelling results of the best-fitting model predicting correct responses. control and stress conditions, but not the learning rates (see Table S4 for parameter estimates). Model comparison resulted in lower protected exceedance probabilities (PXP < 0.1) for all other models (see Figure 4). Choice temperature parameters were significantly higher after win trials compared with loss trials (F(1,27) = 22.77, p < .001) and numerically higher during the control compared with the stress condition, although the latter effect was not significant (F(1, 27) = 0.25, p = .623). When introducing order as an additional scaling effect, as suggested by reviewers, the stress effect was indeed significant (see Supporting Information).

| fMRI results
We found a main effect of RPE combined over both conditions in the vmPFC, bilateral ventral striatum, posterior cingulate cortex (PCC) and bilateral insula (p FWE < .05 for the whole brain; see Figure 5 and Table S7). We did not observe significant RPE-related activation differences between control and stress condition when modelling win and loss trials together. On an uncorrected level, there was higher activation in the right insula during stress compared with the control condition, but this did not entirely survive multiple comparison correction ([46, 4, 10], t = 4.02, p FWE SVC main effect = .068, p uncorrected < 0.001; see Figure S9). Parallel to our exploratory behavioural analysis we assessed potential associations between past subjective stress (PSS) and working memory capacity (WM) with changes in RPE signal induced by acute stress (Δ RPE stress À control). We computed new first-level statistics combining stress and control condition into one model and generated a contrast image with the difference between stress and control condition. These contrast F I G U R E 4 Protected exceedance probability (PXP): (a) 'step 1' model space explaining behaviour in the control condition, (b) 'step 2' model space with added free stress parameters to the best-fitting model of the control condition, in order to detect stressrelated parameter differences between control and stress condition. 1al, one learning rate for win/loss trials 2al; two separate learning rates for win/loss trials; DU, double update; iDU, individual double update; LR, learning rate; PH, Pearce-Hall; RW, Rescorla-Wagner; SU, single update. images were then entered into separate second level models with PSS and WM as covariates, respectively. We did not find significant effects of PSS nor WM on the changes in RPE activation.
Furthermore, modelling win and loss trials separately, participants under stress compared with the control condition exhibited stronger signalling of RPE for win trials in the left striatum (main effect of condition: [À10, 10, 2], t = 6.43, p FWE whole brain corrected = 0.041; see Figure 6). No significant difference between stress and control condition was observed for RPE in loss trials.

| DISCUSSION
The present within-subjects study investigated the behavioural and neural effects of acute psychosocial stress on probabilistic reversal learning in healthy male human participants. In short, the stress induction worked properly, and the stress effects seemed to last long enough to possibly induce stress effects in the scanner during the task reported here (see Figure 2). We found that participants were slightly more accurate under acute stress. Additionally, the neural representation of RPE signals was significantly higher during acute stress for win trials F I G U R E 5 Neural activation related to reward prediction error across both conditions. Displayed are clusters showing significant RPE coding in vmPFC, ventral striatum, posterior cingulate cortex and insula at p FWE whole brain corrected < 0.05 combining stress and control conditions (main effect of task).
F I G U R E 6 Neural activation related to reward prediction error (RPE) when modelling win and loss trials separately. Left-hand side and middle: Main effect RPE during win trials across conditions (yellow-orange gradient) and the stress effect (stress condition > control condition; red) in the left striatum at pFWE whole brain corrected < 0.05. Right-hand side: Box plot of contrast estimates for RPE win for control and stress condition at the peak voxel (x = À10, y = 10, z = À2). in our sample. Computational modelling of choice behaviour, however, showed no stress effect on learning rates, but rather stress effects in the use of learned values. Specifically, on the behavioural level, participants learned to choose the correct (i.e. more often rewarded) stimulus and adapted their choices after changes in reward contingencies (reversals) during both the control and the stress condition. Unlike previous studies (Shields, Sazma, et al., 2016), we observed more correct responses during the stress compared with the control condition, but the effect size was small (OR = 1.13) and other behavioural measures such as win-stay or loseswitch behaviour were not affected. Furthermore, participants displayed substantial interindividual variability including better, worse or non-different performance under acute stress in our within-subjects design; therefore, it is challenging to interpret these results by themselves.
Follow-up computational modelling analyses of choice behaviour showed that participant's behaviour was best explained by an RW model using RPE to update the expected values of both the chosen and the unchosen choice option, indicating that participants considered the anti-correlated task structure. Acute stress did not affect the learning rate, a parameter that scales the influence of the RPE in updating of the expected values. Therefore, within our model space, there was no evidence that stress affected the updating speed of learned expected values itself. In contrast, our modelling analysis did suggest that the degree to which participants used the learned values (temperature parameter) differed between the stress and control condition. More specifically, introducing different temperature parameters for the control and the stress condition explained the observed behaviour best. When introducing order as an additional scaling factor, we found that temperature parameters were higher in the stress condition indicating that participants followed the learned values more closely during stress compared to control condition. Two studies using cognitive computational modelling during learning tasks also observed effects of acute stress on choice temperature, mostly higher stochasticity (Cremer et al., 2021;Radenbach et al., 2015), whereas other studies observed attenuation of model-based behaviour (Otto et al., 2013) or an increased tendency for win-stay behaviour (Raio et al., 2020). However, comparability is limited due to the different tasks used, mainly focusing on the balance between model-free and model-based learning (Cremer et al., 2021;Otto et al., 2013;Radenbach et al., 2015;Raio et al., 2020), which was not the focus of the present study.
On the neural level, RPE signals were correlated with neural activation in a network comprising vmPFC, bilateral ventral striatum, posterior cingulate cortex and insula across both conditions, in line with previous studies using the same paradigm (Boehme et al., 2015;Katthagen et al., 2020;Reiter et al., 2017Reiter et al., , 2016 and with meta-analytic findings of RPE fMRI studies (Fouragnan et al., 2018). In another meta-analysis, vmPFC and the posterior cingulate cortex were identified as regions specific to reward delivery (Jauhar et al., 2021) The posterior cingulate cortex has also been suspected to signal change detection (Pearson et al., 2011), which is crucial to perform well during reversal learning. No whole-brain correctable stress effects on RPE representation were observed when assessing win and loss trials together. The trendwise increase of RPE-related activation in the insula during the stress compared with the control condition, might contribute to the behavioural effect as the insula has been implicated in error processing, mainly interpreted to code salience signals (Fouragnan et al., 2018). However, this finding did not survive stringent correction for multiple testing and therefore needs to be interpreted with caution.
When differentiating RPE signals during win and loss trials, we found stronger coding of positive RPEs in the ventral striatum during the stress compared with the control condition in our sample of healthy male participants. This increased neural activation following acute social stress could correspond to better behavioural performance in the stress condition. Stress has been shown to affect the mesolimbic dopaminergic system although both increasing and inhibiting effects have been described depending on the intensity, duration and controllability of the stressor (Baik, 2020). In line with our finding, stressful experience in rodents has been found to increase reward-evoked dopamine release in the ventral lateral striatum (Stelly et al., 2020). Another study found an increase of negative (unexpected aversive face stimuli) but not of positive (appetitive) prediction error signals in the ventral striatum in a condition of threat (potential of electric shock), although no difference between positive and negative PE signals were observed in the safe condition (Robinson et al., 2013). While we did not differentiate between threatening and safe context, this finding suggests that RPE signals are highly context-sensitive. In contrast to our finding, another study observed a blunted positive prediction error signal in the dorsal striatum with impaired performance in win trials (Carvalheiro et al., 2021). In our study, acute social stress was induced using the TSST before scanning, whereas Carvalheiro et al. used aversive sounds inside the scanner to induce stress. Therefore, differences in stress induction likely contribute to the different findings.
In rodents, acute stress improved reversal learning, whereas chronic stress impaired reversal learning (Bryce & Howland, 2015;Hurtubise & Howland, 2017). Differential long-term stress exposure may have led to the heterogeneous effects of stress on reversal learning in our sample. In humans, chronic stress increased the detrimental influence of acute stress on model-based learning (Radenbach et al., 2015). Apart from chronic stress exposure, cognitive capacities or personality traits are further potential explanations for the inconsistent impact of acute stress on learning. A high working memory capacity seems to hold a protective function against the attenuation of model-based learning (Otto et al., 2013), whereas trait impulsivity interacts with different aspects of learning differentially, but particularly seems to increase perseveration (Raio et al., 2017). As probabilistic reversal learning does not disentangle model-based and model-free learning these effects of moderators were impossible to replicate here. Exploratory analyses on working memory capacity and past subjective stress did not reveal any respective effects on stress in our sample.
We acknowledge that our findings are limited by several factors. First of all, we found that the power in this study is low, due to the small sample size. This means that several effects found, especially the MRI results, should be interpreted with great caution, and effects found should be replicated independently. Second, we only tested male participants. This was partly due to constraints in recruitment procedures and the fact that a part of the sample was tested as a healthy control sample for a patient study. Nonetheless, regardless of these constraints, we did make the decision to stick with a maleonly sample, as there might be the gender differences in decision-making (Shields, Trainor, et al., 2016), which may be amplified by stress  and potential impact of cyclical changes. This could have made interpretation of possible effects related to (physiological) stress response even more challenging than we face in the current sample. Furthermore, our sample was homogeneously young and highly educated. This reduced variability in our sample might have limited our ability to find differences between both conditions, and these sample characteristics reduced the generalizability of effects across sex, genders (females or non-binary), age and education level, and we would advise for additional studies, to investigate if similar effects are found in non-male, older and lower educated populations, but also to patient samples. Regarding our task used, it does not allow to temporally disentangle value and RPE representations in the brain. Stress effects may be related to the value representation and utilizing of those values during the decision process as indicated by our modelling findings. Although speculative at this point, our finding of altered choice stochasticity parameters may hint towards this and aligns with recent findings on the importance of computational noise directly affecting value representation (Findling et al., 2019). Dissociating these computations might be a promising avenue for future studies to determine the neurocomputational processes underlying reversal learning performance increases under acute stress.
Whereas our relatively young and healthy study sample has shown slight beneficial effects of acute stress, other more vulnerable populations may show different patterns. Stress, especially when long term or chronic, is an important factor in causing and maintaining psychiatric illness (McEwen, 2004). Although healthy individuals can adapt to a certain level of stress and even find it beneficial (Lighthall et al., 2013), decision-making frequently goes awry in psychiatric disorders (C aceda et al., 2014). Our results suggest that it might be worthwhile assessing decision-making under acute stress in populations at risk of developing psychiatric conditions to reveal how stress is involved in maladaptive decision-making. Identification of altered choice behaviour and relevant neural networks in healthy individuals make it possible to disentangle how stress affects healthy decision-making and what might be a maladaptive psychiatric alteration. As an operationalization of cognitive flexibility, reversal learning is a construct with high relevance for several psychiatric disorders. For instance, cognitive flexibility and its neural correlates are impaired in patients with alcohol use disorder (Reiter et al., 2016), anorexia nervosa (Bernardoni et al., 2017), binge-eating disorder , ADHD (Hauser et al., 2014) or schizophrenia (Schlagenhauf et al., 2014).

| CONCLUSION
Our study combines the advantages of a within-subjects design and fine-grained computational measures to investigate the effect of acute psychosocial stress on probabilistic reversal learning in healthy male adults. Several lines of analysis showed slightly improved performance, reflected in altered choice stochasticity, with wholebrain-correctable neural effects of increased RPE signalling for win trials under stress.