The reward positivity reflects the integrated value of temporally threefoldâ•’layered decision outcomes

Fast and efficient outcome evaluation plays an eminent role in the monitoring and adjustment of human behavior. In many situations, making optimal choices relies on the brains ability to accurately predict the incentive costs and benefits of particular options and, based on that capacity, learn to adjust future decision strategies accordingly. The neurocognitive system underlying the monitoring and evaluation of outcomes has been investigated using various methods, including eventrelated potentials (ERP) in the human electroencephalogram (EEG). Numerous prior studies have focused on an ERP amplitude variation, which emerges at frontomedial scalp positions about 200 to 350 ms following the onset of an external decision feedback (Miltner et al., 1997; for an overview see Sambrook & Received: 11 May 2020 | Revised: 22 January 2021 | Accepted: 29 January 2021 DOI: 10.1111/psyp.13789


| INTRODUCTION
Fast and efficient outcome evaluation plays an eminent role in the monitoring and adjustment of human behavior. In many situations, making optimal choices relies on the brains ability to accurately predict the incentive costs and benefits of particular options and, based on that capacity, learn to adjust future decision strategies accordingly.
The neurocognitive system underlying the monitoring and evaluation of outcomes has been investigated using various methods, including event-related potentials (ERP) in the human electroencephalogram (EEG). Numerous prior studies have focused on an ERP amplitude variation, which emerges at fronto-medial scalp positions about 200 to 350 ms following the onset of an external decision feedback (Miltner et al., 1997; for an overview see Sambrook & Goslin, 2015). This effect manifests as a relative negativegoing amplitude shift in response to unfavorable outcomes and/or a relative positive-going amplitude shift in response to favorable outcomes. Different terms have been used for this ERP effect in the past (e.g., Feedback-Related Negativity, Feedback-Error-Related Negativity, Medial-Frontal Negativity, Feedback-Correct-Related Positivity, Reward Positivity). In the following, we will use the terms Reward Positivity (RewP) when referring to absolute amplitudes for a given condition and Delta Reward Positivity (∆RewP) when referring to the amplitude difference between positive minus negative action consequences.
According to a widely accepted model of RewP generation, it reflects prediction error signals sent to the anterior midcingulate cortex (aMCC) indexing the inadequacy of outcome expectations (Holroyd & Coles, 2002; also see e.g., Chase et al., 2011;Sambrook et al., 2018). Within a modified hierarchical reinforcement learning account of this model (Holroyd & Yeung, 2012), the aMCC functions as an option selection and maintenance module in a neural reward processing system. The initial evaluation of action outcomes as better or worse than expected (i.e., signed reward prediction errors) is carried out by the so-called critic regions, the ventral striatum and the orbitofrontal cortex. Based on this evaluation, a respective phasic dopaminergic prediction error signal is sent to the aMCC, which uses these teaching signals to learn the mean value of a timely extended behavioral task. The RewP is thought to be an electrophysiological byproduct of the incoming and/or implementation of such teaching signals. In a given situation, the aMCC then chooses and maintains behavioral options by exerting control, if necessary, over the dorsolateral cortex and the dorsal striatum (actor module). The actor in turn executes the single steps of the extended behavior.
Given the proposed learning and decision-making function of the aMCC (Holroyd & Yeung, 2012), it would be critical that it receives very enriched and nuanced information about action-outcome values as reflected by finegrained variations in RewP amplitudes. Contrary to that, early studies on the ΔRewP effect indicated that it corresponds to a simple dichotomous classification into good and bad outcomes (e.g., Gehring & Willoughby, 2002;Hajcak et al., 2006;Yeung & Sanfey, 2004). However, with an increase in complexity of experimental tasks later research repeatedly demonstrated that RewP amplitudes can mirror a more finely graded scaling of outcome values (e.g., Bellebaum et al., 2010;Frömer et al., 2016;Kreussel et al., 2012;Meadows et al., 2016;Osinsky et al., 2018;Osinsky et al., 2012; also see Sambrook & Goslin, 2015). For instance, hierarchical reinforcement learning has primarily been investigated with pseudo-reward tasks where an overall goal depends on the success of various subgoals, and hence, involves multiple decision steps (e.g., Diuk et al., 2013;Ribas-Fernandes et al., 2011. These tasks revealed independent reward prediction errors for subgoal-and overall goal-related actions in the aMCC, providing support for the assumption that the aMCC responds to information from multiple levels of abstraction. In this regard, an increasingly recognized question is the modulation of the RewP by the level of feedback information content as well as the subjective weighing of different information features in single-and multi-step contexts (e.g., Cockburn & Holroyd, 2018). In particular, we recently introduced an adaptation of the so-called doors task, in which each decision led to a singular outcome event with two interlaced layers of independent temporal consequences, that is, an immediate monetary consequence and a more delayed monetary consequence (Osinsky et al., 2018). Thus, the two consequences of each outcome could either converge (e.g., a positive immediate consequence and a positive delayed consequence) or diverge (e.g., a positive immediate consequence but a negative delayed consequence) with regard to their valence. Our results indicated that RewP amplitude is proportional to the additively integrated value of both consequences. Moreover, ∆RewP sensitivity to the two consequence types correlated with choice-adjustments in consequence-specific manner and were relatively stable over a 1-week interval. Hence, signals sent to the aMCC, as reflected by the RewP, appear to convey integrated information about multiple temporal consequences of a single outcome event, with individuals stably differing in the weightings of delayed and instant consequences.
In the study presented here, we aimed to replicate and extend our earlier findings by addressing the following issues. First, we were interested whether RewP amplitude can reflect even more complexly interlaced outcome information. We therefore increased the number of temporal consequence layers from two to three. That is, each outcome now has an immediate monetary consequence that is paid out directly after the experiment, an intermediate consequence paid out 2 weeks after the experiment and a delayed consequence paid out 4 months after the experiment. Based on our prior report (Osinsky et al., 2018), we expected that RewP amplitudes are particularly proportional to the additive value of the three payouts, resulting in a roughly continuous and linear scaling with increasing additive outcome value. However, we also expected that the size of the ΔRewP effect should decrease with an increase in the payout delay, reflecting the general devaluation of delayed compared to immediate rewards (cf. Cherniawski & Holroyd, 2013;Qu et al., 2013).
Second, we investigated the link between differences in ∆RewP and trial-to-trial behavioral adjustments. The evidence for such link is somewhat mixed with some studies reporting a substantive relation (e.g., Arbel et al., 2013;Cohen & Ranganath, 2007;Hewig et al., 2011;Holroyd & Krigolson, 2007;Santesso et al., 2008;Yasuda et al., 2004) whereas others do not (e.g., Chase et al., 2011;Luft et al., 2013Luft et al., , 2014Walsh & Anderson, 2011). In our previous study (Osinsky et al., 2018), we observed rather small but specific relations between the ∆RewP responsivities to different temporal consequences and consequence-driven adjustments in choice behavior. We therefore expected to replicate this pattern in the current study.
Third, we investigated whether individual differences in RewP sensitivities to the three consequence layers can be measured reliably in terms of classical test theory. In the face of a growing interest in how inter-individual variance in RewP is related to personality and/or psychopathology (e.g., Bakic et al., 2016;Bress & Hajcak, 2013;Cherniawski & Holroyd, 2013;Mothes et al., 2016;Mussel et al., 2015;Proudfit, 2015;Riepl et al., 2016;Schmidt, Holroyd, et al., 2017;Schmidt, Mussel, et al., 2017;Takács et al., 2015;Tsypes et al., 2019), researchers have recently begun to examine the psychometric properties of RewP absolute amplitudes and ∆RewP effects (e.g., Bress et al., 2015;Ethridge & Weinberg, 2018;Levinson et al., 2017;Luking et al., 2017). These studies have already indicated that the RewP is a reliable biomarker of reward processing. However, since psychometric properties are task-and sample-specific it remains unclear whether the results obtained for more simple reward stimuli also generalize for more complex and multilayered rewards. We therefore analyzed internal consistency reliability of absolute RewP amplitudes across conditions and conditionwise as well as ∆RewP scores in our task.
Fourth, we were interested in whether inter-individual differences in RewP responsivity to the three temporal consequence layers are relatively stable over a time period of a few months. In personality research, rank-order consistency is one important prerequisite to consider a certain variable as trait-like (e.g., Roberts & DelVecchio, 2000). To check for such relative stability of RewP amplitudes, we retested a large subgroup of our participants for a second time 3 months after their first participation. Although this study uses more complex forms of feedback, we expect RewP reliability and stability to be about as high as with simple feedback to assume trait-like properties.
In addition, we were also interested whether oscillatory activity in the theta range (4-8 Hz) supports complex feedback processing. Fronto-medial theta (FMθ) activity has been observed in response to a range of processes that require the need for top-down control (Bernat et al., 2015; for a review see Cavanagh & Frank, 2014). Reinforcement learning theory links FMθ to control processes of the aMCC over the actor module when a successful execution of the entire task is under threat (Holroyd & Umemoto, 2016). Thus, RewP and FMθ are likely related but not functionally redundant EEG phenomena, with the former reflecting an input signal to the aMCC and the later representing an output. To further uncover potential similarities and differences between these two signatures in the processing of multicomponent outcomes, we also conducted all analyses mentioned above for FMθ power in an exploratory manner.

| Participants and procedure
As we were also interested in individual differences, we collected data from a relatively large sample of 102 students from the University of Osnabrueck (i.e., all participants who responded to the announcement of the study and fit the inclusion criteria). For three of these students, less than 20 trials per condition were available for averaging (see below). Thus, the final sample consisted of 99 participants (71 women, mean age = 22.36, SD = 2.94). All had normal or corrected-to-normal vision and reported to be free of any psychiatric or neurological condition. They received partial course credit as well as the money they won in the task for their participation (immediate wins: M = 1.97 €, SD = 0.27; intermediate wins: M = 3.92 €, SD = 0.60; delayed wins: M = 10.42 €, SD = 1.40). The participants returned to the laboratory 2 weeks and 4 months after the experiment to collect their monetary rewards. All participants were asked to be contacted for participating in a retest session 3 months later. The final retest sample consisted of 58 individuals (41 women, mean age = 21.52, SD = 2.67) who gave permission, returned to the laboratory, and had a sufficient number of trials available for averaging EEG data (i.e., 20 trials per condition; 1 participant was excluded). The date for the retest session was not pushed behind the final reimbursement of the main session to ensure that enough participants were still available for testing. The task and procedure were the same as for the main testing session. After arrival, participants received general information about the study and gave written informed consent. Afterward, EEG recordings were prepared before participants received instructions and completed the behavioral task. Following the task, individual account significance was assessed with a written question. The entire experiment took approximately 2 hr and was in accordance with the declaration of Helsinki.

| Task and behavioral analyses
Participants completed a complex reinforcement-learning task during which they could win or lose money for three monetary accounts. The immediate account would be cashed out directly after the experiment, the intermediate account after 2 weeks and the delayed account 4 months after the experiment. The combination of possible consequences for each account (win or loss) was independently manipulated, and hence, resulted in eight possible outcomes that were probabilistically linked to eight different doors so that participants could learn optimal choice strategies by trial-anderror. The participants were told that within each block, a certain outcome preferably hides behind a certain door and that these preferences changed after each block. They were not informed about specific probabilistic ratios. In fact, each outcome was linked to its preferred door 11 times and to every other door three times in each block. In previous studies, we observed a high level of difficulty to learn the links based on these probabilistic ratios (Osinsky et al., 2018). This ensures almost equal trial numbers for every outcome for analysis. The option-outcome links were randomized before each block.
Each trial started with a blank screen for 250 ms followed by the eight doors arranged in a 3 × 3 matrix with a blank square in the middle. Participants choose one door by pressing the corresponding key on a numerical pad. During this choice period, the current balances of all three monetary accounts were presented at the bottom of the screen. After the choice, a fixation cross appeared for 1,000 ms until the feedback was presented for 1,200 ms. The feedback stimulus consisted of a colored letter within a shape that signaled the outcome for all three consequences simultaneously. The stimuli-account mapping was counterbalanced across participants (also see statistical analyses). All participants that took part in the retest received the same stimuli-account mapping as in the main testing session. A single uppercase letter represented a mythical creature behind the door, either F for fairy [germ. Fee] that added money to one monetary account or O for orc [germ. Ork], that stole money from this account. The color of the letter was either green or red, where a green letter indicated a win to another monetary account and a red letter a loss to that account. The letter was surrounded by either a square or a triangle that represented a win and loss to the third account, respectively. Since perceived reward value decreases with increasing distance to payout time (Green & Myerson, 2004) and losses have a higher subjective value than wins (Tom et al., 2007;Tversky & Kahneman, 1992), the following outcome values were used: Immediate: −1 Cents, +2 Cents; intermediate: −2 Cents, +4 Cents; delayed: −5 Cents, +10 Cents. The task consisted of 12 blocks with 32 trials each. All stimuli were presented on a 32inch monitor with a screen distance of 130 cm. Timings were controlled using Presentation software (Neurobehavioral Systems, Inc., Albany, CA).
To examine adjustments in choice behavior for each of the eight outcome conditions in trial N, we calculated the frequency of choosing a different door versus choosing the same door again in trial N + 1. Ratios range from 0 (always chooses the same door again after a certain outcome) to 1 (always switches the door after a certain outcome).

| Electrophysiological recording and preprocessing
Continuous EEG was recorded using a 64-channel actiCAP electrode cap according to the international 10-10 system. All electrodes were referenced to FCz and Afz served as ground. Electrode impedances were kept below 20kΩ using ECI electrode gel. The signal was amplified and recorded with a BrainAmp System and BrainVision Recorder software (Brain Products GmbH, Gilching, Germany), bandpass filtered at 0.016-250 Hz and digitalized with a sampling rate of 500 Hz.
Offline preprocessing of the EEG signal was conducted using BrainVision Analyzer software (Brain Products GmbH, Gilching, Germany). For the ERP analysis, the signal was rereferenced to the mean of all electrodes and ocular artifacts were corrected using independent component analysis as implemented in BrainVision Analyzer in semi-automatic mode. The signal was bandpass-filtered with a Butterworth zero phase filter (0.1-30 Hz, roll-off: 48 dB/oct, time constant: 3.18) and segmented into 1,000 ms epochs around feedback onset (−200 to 800 ms). Segments containing voltage steps greater than 100 μV/ms or maximum-minimum differences of 150 μV or larger within a sliding time-window of 600 ms were rejected from further analyses. All remaining segments were averaged separately for each of the eight feedback conditions and baseline corrected (−200 to 0 ms). A minimum number of 20 segments per condition was required for statistical analysis (artifact free trial numbers per condition for ERP and wavelet analyses are shown in Supporting Information Table S1). Because the visual inspection of the grand average waveforms revealed differences in ∆RewP onsets between conditions, we analyzed ∆RewP onset latencies as the 50% fractional peak latency (Kiesel et al., 2008;Luck, 2014). These onset differences were not statistically significant (F(1.27,173.72) = 1.23, p = .29, 2 p = .01), nor was there a significant interaction of condition and stimuli-mapping group (F(3.56,170.89) = .58, p = .66, 2 p = .01). Therefore, the RewP was quantified as mean amplitude at FCz between 200 and 300 ms post feedback-onset. For time-frequency analysis, the EEG signal was re-segmented into 2,000 ms epochs around feedback onset (−1,000 to 1,000 ms). Segments containing voltage steps greater than 100 μV/ms or maximum-minimum differences of 150 μV or larger were rejected. Remaining segments were transformed using a family of complex Morlet wavelets from 1 to 30 Hz in 30 logarithmic steps. These modulated Gaussian sine functions are defined as: (t) = Ae − t 2 ∕2 e i2 ct , with parameter c = 6.7. Due to unit energy normalization, all frequency levels have the same total energy of 1.0. Resulting power levels for each frequency layer were then baseline-corrected using the −800 to −200 ms time window and transformed to a decibel (dB) scale. Segments were averaged for each participant and condition. FMθ was quantified as mean power from 4.09 to 7.34 Hz within 200 to 400 ms post feedback onset at FCz.

| Statistical analysis
All statistical analysis were performed using the Statistical Package for the Social Sciences (SPSS for Windows, version 25; Chicago, IL, USA). For all inference statistics the alpha level was set at .05. Switching frequencies, RewP amplitude and FMθ power were analyzed, using 2 × 2 × 2 × 3 mixed design analyses of variance (ANOVAs) with immediate, intermediate and delayed consequences (positive/negative) as within-subject factors and stimuli-account mapping as a between subject factor (3 groups; see below). For resulting interactions between within-subject factors, aggregated scores across non-impacting variables and difference scores (i.e., ∆RewP) of effective variables were computed. Whenever the assumption of sphericity was violated, Greenhouse-Geisser correction was used to adjust the degrees of freedom. Two-tailed t tests for paired samples were performed to check whether the effect of individual outcomes differs depending on other outcomes.
Because the association between outcome valence and letters and colors are relatively over-learned compared to geometric shapes and might, therefore, be more easily processed, the stimulus-account mapping was considered as a between-subject factor (3 groups) as suggested by an anonymous reviewer. For group 1, the immediate consequence was indicated by the letter, the intermediate consequence by the color of the letter and the delayed consequence by the shape surrounding the letter. For group 2, the shape signaled the immediate wins and losses, the letter the intermediate consequence and the color of the letter the delayed consequence. For the third group of participants, the color and shape indicated immediate and intermediate consequences, respectively, whereas the letter signaled the money won or lost for the delayed account. The sample was not split into groups for stability and reliability analyses because the stimulus mapping was constant between the main session and the retest.
In order to extract individual responsivity of the RewP to the different consequence types, we calculated ∆RewP difference-scores by subtracting aggregated RewP amplitudes for negative consequences from those of positive consequences separately for each of the three temporal consequence types. To assess internal consistency reliability, Spearman-Brown corrected correlations (r sb ) were calculated between absolute RewP amplitudes and ∆RewP scores resulting from odd and even trials (i.e., split-half reliability). Moreover, we calculated Cronbach's alpha across all eight conditions, indicating the general interrelation between condition-averaged RewP amplitudes (cf. Thigpen et al., 2017). Relative temporal stability of absolute RewP amplitudes and ∆RewP scores from T1 to T2 was assessed by Pearson correlation coefficients. To test whether the RewP has a behavioral effect, multiple regressions for switching frequencies on the ∆RewP scores were performed. Effect sizes were calculated in terms of partial eta squared for ANOVAs and in terms of Cohen's d for t tests. The same statistical analyses were applied for FMθ.

| Time-domain RewP
Mean RewP amplitudes are shown in Figure 2. Grand average ERP waveforms and topographical maps are shown in Figure 3a consequences (F(1,96) = 20.05, p < .001, 2 p = .17) with greater amplitudes for positive outcomes for each consequence type. There were significant interactions for immediate consequences and intermediate consequences (F(1,96) = 22.64, p < .001, 2 p = .19) as well as immediate consequences and delayed consequences (F(1,96) = 6.30, p = .014, 2 p = .06). These interactions reflected a modulation of the ∆RewP effect size for intermediate consequences and delayed consequences by the immediate factor. In particular, the ∆RewP effect was larger for intermediate (t(98) = 4.78, p < .001, d = 0.68) and delayed consequences (t(98) = 2.54, p = .013, d = .36) when the immediate consequence was positive compared to negative. Further, we found a main effect of stimuli-account mappings (F(2,96) = 5.76, p = .004, 2 p = .11) and a significant interaction of stimuli-account mappings and immediate consequences (F(2,96) = 5.94, p = .004, 2 p = .11). Post hoc tests showed greater amplitudes for positive immediate outcomes than after negative immediate outcomes within all stimuli-account mappings (all t(98) > 4.421, all p < .001). However, this effect was significantly larger when the immediate consequence was signaled by the color as compared to the letter (t(65) = 3.74, p < .001, d = .91). There were no other significant effects of F I G U R E 2 Mean RewP amplitudes per condition at FCz across groups for the main testing session and the retest. Error bars represent the standard error of the mean (SEM) F I G U R E 3 Grand averaged ERP waves for each outcome (immediate, intermediate, delayed consequences) are shown in the upper row, whereas the lower row of graphs depicts the difference waves for positive minus negative consequences per condition and topographical maps for difference waves (200-300 ms). Graphs on the left side (A) refer to the main testing session (T1). Graphs on the right side (B) are based on the retest data. All waveforms are collapsed across stimuli-account mapping groups (group-wise results are shown in Supporting Information Figure  S1) stimuli-account mapping on in RewP amplitude for the immediate consequence. Altogether, all three consequence types induced a significant ∆RewP effect, which, however, occurred more pronounced for the immediate compared to the intermediate and delayed consequences. Moreover, the general sensitivity of the RewP to intermediate and delayed consequences was stronger when the immediate consequences of a given outcome were positive compared to negative. The strongest RewP was evoked by the optimum outcome, which entailed positive consequences at all three payout timepoints. The more the deviation from this optimum, the smaller the RewP. Overall, a similar result pattern was observed for the retest data (see Figures 2 and 3b). A detailed description of the retest results can be found in the Supporting Information.
Considering all eight condition-wise RewP amplitudes as items, we observed a very high Cronbach's alpha of .99. Moreover, when considering each condition separately, we also found very high split-half reliabilities (r SB = .92 to .95). With regard to ∆RewP scores, split-half reliabilities were only in a low to moderate range (r SB = .51 to .61) (see Table 1). For the condition-wise absolute RewP amplitudes, we observed high test-retest correlation coefficients (see Table 1). Concerning the ∆RewP scores, we observed substantive stability for the immediate and delayed consequences. The ∆RewP for intermediate consequences showed a very low and non-significant test-retest stability. Given the rather low split-half reliabilities of the ∆RewP scores, we applied double correction for attenuation � to estimate test-retest stabilities under the assumption of perfectly reliable ∆RewP scores (Muchinsky, 1996). These estimates pointed to a strong temporal stability of ∆RewP scores for immediate consequences and moderate temporal stability for delayed consequences. Interestingly, we observed no significant interrelations between the three ∆RewP scores (r = −.10 to .13, all p > .21), indicating that the individual responsivities to the three consequence types were independent of each other. Again, this might be caused by low reliabilities of these difference scores. However, even when we applied a double correction for attenuation the correlations between ∆RewP remained in a low range (r = −.18 to .23). To analyze the potential link between individual differences in RewP and consequence-driven decision-making, we conducted multiple regressions of ∆RewP scores on ∆SF-Scores for each consequence type. With adjusted R 2 < .01 for all of the regression models, none of these allowed statistically significant predictions (all F(3,95) < 1.47, all p > .23).
Split-half internal consistency of ∆FMθ scores (Table 1) generally fell below what can be considered as acceptable (r SB = .43 to .66) and showed no strong interrelations between conditions (r = −.04 to .11, all p > .27). In contrast, for absolute FMθ power per condition, we observed good to excellent split-half reliabilities (r SB = .91 to .92). For ∆FMθ scores, we observed medium test-retest correlations for immediate consequences (r = .51), but only negligible and insignificant coefficients for intermediate and delayed consequences (r = .10 to .15). However, temporal stability for the constituent power per consequence condition was moderate across all three consequence types (r = .59 to .64).
Multiple regression analyses of ∆FMθ on ∆SF-Scores for each consequence type did not yield any statistically significant regression models with all adjusted R 2 < .01 (all F(3,95) < 1.32, all p > .27). Accordingly, we found no behavioral effect of FMθ.

| DISCUSSION
Human behavior is fundamentally driven by promoting actions that maximize reward through feedback-based learning. However, the outcomes of one's action can be complex. This is especially the case when a single action involves multiple consequences, which differ in the time they will occur. The aim of this study was to shed new light on the relationship between such complex forms of feedback and electrophysiological indices of reward processing (cf. Osinsky et al., 2018), by increasing the feedback information content by a third temporal dimension. The results showed gradations in feedback-related activity relative to valence of the three temporal outcome dimensions. That is, for each consequence, the RewP showed larger amplitudes for wins than for losses and these differences were greatest for immediate consequences and decreased with increasing delay in payout time. These effects were considerably smaller for FMθ and | 9 of 13 ROMMERSKIRCHEN Et al.
did not relate to adjustments in trial-to-trial choice. With that, our results are in line with recent studies that used similar levels of feedback information content following single or multi-step choices (e.g., Bellebaum et al., 2010;Cockburn & Holroyd, 2018;Frömer et al., 2016;Luft et al., 2014;Osinsky et al., 2018). They also underscore current theoretical assumptions about the role of the aMCC within a hierarchical reinforcement learning system (Holroyd & Umemoto, 2016;Holroyd & Yeung, 2012), which is able to produce and learn from prediction errors that stem from the processing of complex outcome information. However, regarding the use of more complex forms of feedback, it is not yet conclusive whether the RewP reflects multiple prediction errors that project to the aMCC from subcortical critic regions or one overall valuation signal.
Recent studies using multi-step decision paradigms indicated simultaneous dissociable prediction errors at multiple levels of the hierarchy within critic module areas only as long as the reward information of different levels is presented separately (Daw et al., 2011;Diuk et al., 2013;Ribas-Fernandes et al., ,2011. Functional imaging results in human and non-human primates show that the subjective value of immediate and delayed rewards is processed in distinguishable neuron populations along an anterior-posterior gradient within the medial prefrontal cortex (Roesch et al., ,,2006(Roesch et al., ,, , 2012Wang et al., 2014). In addition, it has been shown that the firing pattern of neurons in the OFC and VS represents individual reward preferences (O'Doherty et al., 2006;Tremblay & Schultz, 1999). In this light, the lack of substantive interrelations between the ∆RewP effects may indicate that the individual responsivity to immediate and delayed consequences reflect two distinct trait-like properties of the brain, which represent the subjective weighting of outcome aspects. This could be realized by anatomical differences in neuron population (e.g., size, effective connectivity) that transfer to stable differences in the signals sent to the aMCC or by functional differences like attentional orienting to different subinformation of complex feedback stimuli (Leong et al., 2017;Niv et al., 2015;Oemisch et al., 2019). Thereby, unique motivational values based on stable preferences to immediate over delayed rewards can be placed on different reward quantities. Hence, stable differences in delay discounting may give rise to differences in item-specific utility by assigning different attentional weights to different stimulus features.
While the results of this study suggest that the value of different outcome dimensions are processed by an integrated system and reflected in the RewP, this assumption must not necessarily hold true for unsigned prediction errors or salience signals. Since it has been shown that the time windows before and after the RewP are sensitive to salience (e.g., Nieuwenhuis et al., 2004;Sambrook & Goslin, 2014), the current use of three different visual stimuli (letters, colors and geometrical shapes) to indicate the three different consequence types is a potential confound in this study because they may perceptually be easier (color and letter) or harder (shape) to decode, and thus, more or less salient. Although the stimulus-account mapping was randomized, small effects of were observed in the RewP time window. However, as can be seen in Supporting Information Figure S1, the stimulus-consequence interaction is most prominent at the N1 and does not seem to reflect perceptual differences between the stimuli themselves. While the role of salience is beyond the scope of this study, this issue deserves further attention in future studies.
If the RewP reflects stable differences in outcome evaluation, one could expect a relation between RewP amplitude and habitual decision-making. However, in the present study, we did not observe any substantive covariance between RewP responsivities to immediate and delayed outcome consequences and choice behavior. This is clearly in contrast to our previous and very similar study (Osinsky et al., 2018), where we observed such a relationship. However, the task itself is with its eight response options and three outcome dimensions rather difficult to learn and might promote exploration more than exploitation. This is also demonstrated when looking at the switching frequency of the optimal outcome which is, at about 47%, rather high. It should be noted though that the literature on the RewP-behavior link is generally highly mixed (for an overview, see Holroyd & Umemoto, 2016). What are the reasons for these inconsistencies? The system might learn simple response-outcome associations in a modelfree low-level fashion that would be linked to trial-to-trial adjustments in behavior. The aMCC, however, choses and maintains timely extended goal-directed policies that are not directly linked to trial-by-trial behavioral adjustments by following model-based principles (Holroyd & Umemoto, 2016;Sambrook & Goslin, 2015). The link between the RewP and behavioral choice should, therefore, be weaker in contexts that favor goal-directed control (Walsh & Anderson, 2012). Despite the fact that delay discounting causes people to place a higher subjective value on instantaneous outcomes, our participants reported that delayed outcomes (offering the greatest objective value) were most important to them. Hence, model-free prediction errors in this task might be based on subjective value, consistent with research that found that dopaminergic regions were only sensitive to subjective value in the evaluation phase but not during choice (Liu et al., 2011) and our finding that immediate positive reward has the largest effect on the RewP. In contrast, a later goal directed system may continue to calculate prediction errors in a model-based fashion, which eventually drove behavioral choice in our task (cf. Bayer & Glimcher, 2005). There is also the possibility that the weighting of immediate and delayed rewards might shift during the course of the experiment depending on how much money has been collected for each account.
To extend our previous work (Osinsky et al., 2018), we also analyzed oscillatory correlates of multidimensional feedback processing. FMθ power has consistently been interpreted as a within network communication mechanism indicating increased need for cognitive control (for a review see Cavanagh & Frank, 2014). Although flexible recruitment of different brain areas seems especially crucial following multiple external demands, our results indicated that, contrary to the time-domain RewP, FMθ responses to the different delay times is much smaller and depend on the quantity of negative outcomes rather than positive ones. Still, making fine graded distinctions between multiple favorable options to learn, which one is the most rewarding enables the tradeoff between reward maximization and the minimization of resource expenditure. Consistent with this idea and previous findings, the RewP is primarily driven by reward-related variance (Foti et al., 2011;Proudfit, 2015; for a review see Sambrook & Goslin, 2015). When it comes to adjusting control, however, a fine graded dissociation may be not as crucial as it is with rewards (i.e., multiple unfavorable outcomes require similar adaptations in licensed control). Therefore, the results of this study are consistent with the idea that the time domain RewP and FMθ reflect dissociable components of reward and control processing, respectively (e.g., Holroyd & Umemoto, 2016;Luft et al., 2014;Osinsky et al., 2016;Rawls et al., 2020).
Besides the general relation between the RewP and multicomponent outcome processing, we were interested in psychometric properties of the RewP, an issue which has already attracted some attention in more recent years (e.g., Bress et al., 2015;Ethridge & Weinberg, 2018;Levinson et al., 2017;Luking et al., 2017). Our analyses pointed to excellent reliabilities of condition-wise absolute RewP amplitudes and also showed that a very high proportion of RewP variance is shared between conditions. Since common true variance is largely canceled-out when subtracting two highly correlated variables from each other, it is not surprising that reliabilities of the ∆RewP (and also ∆FMT) scores dropped to a rather unsatisfying level. This problem of difference scores is long known and some authors, therefore, questioned the use of such indices (e.g., Cronbach & Furby, 1970). However, because ERPs are no uni-dimensional measures but rather reflect a mixture of several contributing variables, the application of between-condition contrasts and difference scores is broadly accepted for isolating mechanisms of interest (e.g., Hardy, 2005;Keil et al., 2014;Luck, 2014). Accordingly, the high interrelation between condition wise RewP amplitudes in our study might be caused by interindividual differences in a few broad and stable neurocognitive, neurophysiological and/or neuroanatomical variables that contribute large proportions of systematic variance to each and every RewP measurement under different outcome conditions. Using absolute RewP amplitudes as indices of reward processing may, therefore, even obscure individual differences in the neurocognitive mechanism of interest. Moreover, it has been argued that the concept of reliability in terms of classical test theory is unsuitable when it comes to the measurement precision of difference scores (e.g., Fischer, 2003). Future studies on the psychometry of ERP difference scores may, therefore, apply other approaches on this issue (e.g., item response theory).

| CONCLUSION
In sum, our results show that the RewP is sensitive to the graded value of action outcomes, resulting from the integration of three temporally distinct consequences of a singular outcome event. From a reinforcement learning perspective (cf. Holroyd & Coles, 2002;Holroyd & Umemoto, 2016;Holroyd & Yeung, 2012), our data indicate that the aMCC may receive finely nuanced and complex teaching signals about action outcome values. Moreover, these signals may be subject to relatively stable interindividual differences, which might reflect personal reward preferences. However, because of the low spatial resolution and inverse problem of EEG, future studies on this issue may employ other techniques (e.g., combined EEG-fMRI recording) to further elucidate where exactly in the brain the integration of immediate and delayed outcome consequences into a single value takes place.

ACKNOWLEDGMENTS
Open access funding enabled and organized by ProjektDEAL.