EEG mismatch responses in a multimodal roving stimulus paradigm provide evidence for probabilistic inference across audition, somatosensation, and vision

The human brain is constantly subjected to a multimodal stream of probabilistic sensory inputs. Electroencephalography (EEG) signatures, such as the mismatch negativity (MMN) and the P3, can give valuable insight into neuronal probabilistic inference. Although reported for different modalities, mismatch responses have largely been studied in isolation, with a strong focus on the auditory MMN. To investigate the extent to which early and late mismatch responses across modalities represent comparable signatures of uni‐ and cross‐modal probabilistic inference in the hierarchically structured cortex, we recorded EEG from 32 participants undergoing a novel tri‐modal roving stimulus paradigm. The employed sequences consisted of high and low intensity stimuli in the auditory, somatosensory and visual modalities and were governed by unimodal transition probabilities and cross‐modal conditional dependencies. We found modality specific signatures of MMN (~100–200 ms) in all three modalities, which were source localized to the respective sensory cortices and shared right lateralized prefrontal sources. Additionally, we identified a cross‐modal signature of mismatch processing in the P3a time range (~300–350 ms), for which a common network with frontal dominance was found. Across modalities, the mismatch responses showed highly comparable parametric effects of stimulus train length, which were driven by standard and deviant response modulations in opposite directions. Strikingly, P3a responses across modalities were increased for mispredicted stimuli with low cross‐modal conditional probability, suggesting sensitivity to multimodal (global) predictive sequence properties. Finally, model comparisons indicated that the observed single trial dynamics were best captured by Bayesian learning models tracking unimodal stimulus transitions as well as cross‐modal conditional dependencies.


| INTRODUCTION
Humans inhabit a highly structured environment governed by complex regularities. The brain is subjected to such environmental regularities by a multimodal stream of sensory inputs ultimately constructing a perceptual representation of the world. The sensory system is thought to capitalize on statistical regularities to efficiently guide interaction with the world enabling anticipation and rapid detection of sensory changes (Bregman, 1994;Dehaene et al., 2015;Friston, 2005;Frost et al., 2015;Gregory, 1980;Winkler et al., 2009).
Neuronal responses to deviations from sensory regularities can be valuable windows into the brain's processing of statistical properties of the environment and corresponding sensory predictions. The presentation of rare deviant sounds within a sequence of repeating standard sounds induces well known mismatch responses (MMRs) that can be recorded with electroencephalography (EEG), such as the mismatch negativity (MMN; Naatanen et al., 1978;Naatanen et al., 2007) and the P3 (or P300;Polich, 2007;Squires et al., 1975;Sutton et al., 1965). The MMN is defined as a negative EEG component resulting from subtraction of standard from deviant trials between $100 and 200 ms poststimulus. Although the MMN has primarily been researched in the auditory modality, similar early mismatch components have been reported for other sensory modalities, including the visual (Kimura et al., 2011;Pazo-Alvarez et al., 2003;Stefanics et al., 2014) and, to a lesser extent, the somatosensory modality (Andersen & Lundqvist, 2019;Hu et al., 2013;Kekoni et al., 1997). The P3 is a later positive going component in response to novelty between 200 and 600 ms around central electrodes, which has been described for the auditory, somatosensory, and visual modalities and is known for its modality independent characteristics (Escera et al., 2000;Friedman et al., 2001;Knight & Scabini, 1998;Polich, 2007;Schroger, 1996).
Despite being one of the most well-studied EEG components, the neuronal generation of the MMN remains subject of ongoing debate (Garrido, Kilner, Stephan, & Friston, 2009;May & Tiitinen, 2010;Naatanen et al., 2005). Two prominent but opposing accounts cast the MMN as adaptation-based or memory-based, respectively.
Adaptation-based accounts argue that the observed differences between standard and deviant responses primarily result from neuronal attenuation leading to stimulus specific adaptation (SSA; Jaaskelainen et al., 2004;May et al., 1999). In animals, SSA has been shown to result in response patterns similar to the MMN (Ulanovsky et al., 2003;Ulanovsky et al., 2004) and simulation work suggests that different types of MMN-like responses can be reproduced by pure adaptation models (May & Tiitinen, 2010). However, it remains unclear if the full range of MMN characteristics can be explained by adaptation alone (Fitzgerald & Todd, 2020;Garrido, Kilner, Stephan, & Friston, 2009;Wacongne et al., 2012). The memory-based view, on the other hand, suggests that the MMN is a marker of change detection based on sensory memory trace formation (Näätänen, 1990;Naatanen et al., 2005;Naatanen & Näätänen, 1992). The memory trace stores local information on stimulus regularity and compares it to incoming sensory inputs that may signal changes in the current sensory stream.
While largely neglected by previous interpretations of the MMN, it is becoming increasingly clear that key empirical features of MMRs concern stimulus predictability rather than stimulus change per se. The MMN has been reported in response to abstract rule violations (Paavilainen, 2013), unexpected stimulus repetitions (Alain et al., 1994;Horvath & Winkler, 2004;Macdonald & Campbell, 2011) and unexpected stimulus omissions (Heilbron & Chait, 2018;Hughes et al., 2001;Salisbury, 2012;Wacongne et al., 2011;Yabe et al., 1997). Similar characteristics have been reported for P3 MMRs (Duncan et al., 2009;Prete et al., 2022) and both MMN and P3 responses have been shown to increase for unexpected compared to expected deviants Sussman, 2005;Sussman et al., 1998). Insights concerning the predictive nature of MMRs have led to further development of the memory-based account of MMN generation into the model-adjustment hypothesis (Winkler, 2007).
This view assumes a perceptual model that is informed by previous stimulus exposure and continually predicts incoming sensory inputs.
The model is updated whenever inputs diverge from current predictions, and the MMN is hypothesized to constitute a marker of such divergence.
The model-adjustment hypothesis is in line with the increasingly influential view that the brain is engaging in perceptual inference to anticipate future sensory inputs (Friston, 2005;Gregory, 1980;Von Helmholtz, 1867). Related theories regard the brain as an inference engine and come with neuronal implementation schemes that accomplish probabilistic (Bayesian) inference in a neurologically plausible manner (Bastos et al., 2012;Friston, 2005Friston, , 2010. Process theories such as predictive coding assume that the brain maintains a generative model of its environment which is continuously updated by comparing incoming sensory information with model predictions on different levels of hierarchical cortical organization (Friston, 2005(Friston, , 2010Rao & Ballard, 1999;Winkler & Czigler, 2012). Differential influences of SSA and change detection on the MMN are proposed to result from the same underlying process of prediction error minimization, mediated by different post-synaptic changes to (predicted) sensory inputs (Auksztulewicz & Friston, 2016;Garrido, Kilner, Kiebel, & Friston, 2009). As such, the theory has the potential to unify previously opposing theories of MMN generation Garrido, Kilner, Kiebel, & Friston, 2009;Garrido, Kilner, Stephan, & Friston, 2009) while accounting for its key empirical features (Heilbron & Chait, 2018;Wacongne et al., 2012).
With regard to the proposed universal nature of predictive accounts of brain function, reports of comparable MMRs across different modalities are of particular interest. So far, mismatch signals have been primarily studied in isolation, with a strong focus on the auditory system. However, key properties of the auditory MMN, such as omission responses and modulations by predictability, have also been reported for the visual (Czigler et al., 2006;Kok et al., 2014) and the somatosensory MMN (Andersen & Lundqvist, 2019;Naeije et al., 2018), and modeling studies in all three modalities suggest that MMRs may reflect signatures of Bayesian learning (BL; Gijsen et al., 2021;Lieder et al., 2013;Maheu et al., 2019;Ostwald et al., 2012;Stefanics et al., 2018). While studies directly investigating mismatch signals in response to multimodal sensory inputs are rare, previous research indicates a ubiquitous role for cross-modal probabilistic learning. The brain tends to automatically integrate auditory, somatosensory, and visual stimuli during sequence processing (Bresciani et al., 2006(Bresciani et al., , 2008Frost et al., 2015) and cross-modal perceptual associations can influence statistical learning of sequence regularities (Andric et al., 2017;Parmentier et al., 2011), modulate MMRs (Besle et al., 2005;Butler et al., 2012;Friedel et al., 2020;Kiat, 2018;Zhao et al., 2015) and influence subsequent unimodal processing in various ways (Shams et al., 2011). Recent advances in modeling Bayesian causal inference suggest that the main computational stages of multimodal inference evolve along a multisensory hierarchy involving early sensory segregation followed by mid-latency sensory fusion and late Bayesian causal inference (Cao et al., 2019;Rohe et al., 2019;Rohe & Noppeney, 2015). However, the extent to which the MMN and P3 reflect these stages and should be considered sensory specific signatures of regularity violation or the result of modality independent computations in an underlying predictive network is not fully understood.
The current study aimed to investigate the commonalities and differences between MMRs in different modalities in a single experiment and to elucidate in how far they reflect local, unimodal or global, cross-modal computations. To this end, we employed a roving stimulus paradigm, in which auditory, somatosensory, and visual stimuli were simultaneously presented in a probabilistic tri-modal stimulus stream.
Typically, MMRs are studied with the oddball paradigm, in which rarely presented "oddball" stimuli deviate from frequently presented standard stimuli in some physical feature, such as sound pitch or stimulus intensity. The roving stimulus paradigm, on the other hand, defines deviants and standards in terms of their local sequence position, while the frequency of occurrence of their stimulus features across the sequence is equal (Baldeweg et al., 2004;Cowan et al., 1993). The deviant is defined as the first stimulus that breaks a train of repeating (standard) stimuli. With repetition, the deviant subsequently becomes the new standard, defining a train of stimulus repetitions. Thus, the roving stimulus paradigm is an excellent tool to experimentally induce MMRs, while controlling for differences in physical stimulus features.
Based on a probabilistic model, we generated sequences of high and low intensity stimuli that were governed by unimodal transition probabilities as well as cross-modal conditional dependencies. This allowed us to test to what extent early and late MMRs are sensitive to local and global violations of statistical regularities and to draw conclusions regarding their potential role in cross-modal hierarchical inference. Specifically, we extracted the MMN and P3 MMRs for each modality and investigated their modality specific and modality general response properties regarding stimulus repetition and change, as well as their sensitivity to cross-modal predictive information. Further, we used source localization to investigate modality specific and modality general neuronal generators of MMRs. Finally, we complemented our average-based analyses with single-trial modeling to investigate if signatures of unimodal and cross-modal Bayesian inference can account for trial-to-trial fluctuations in the MMN and P3 amplitudes.

| MATERIALS AND METHODS
Participants underwent a novel multimodal version of the roving stimulus paradigm. Our paradigm, depicted in Figure 1, consisted of simultaneously presented auditory (A), somatosensory (S), and visual (V) stimuli, which each alternated between two different intensity levels ("low" and "high"). The tri-modal stimulus sequences originated from a single probabilistic model (described in Section 2.3), resulting in different combinations of low and high stimuli across the three modalities in each trial.

| Experimental setup
Each trial consisted of three bilateral stimuli (A, S, and V) that were presented simultaneously by triggering three instantaneous outputs of a data acquisition card (National Instruments Corporation, Austin, Texas, USA) every 1150 ms (inter-stimulus interval).
Auditory stimuli were presented via in-ear headphones (JBL, Los Angeles, California, USA) to both ears and consisted of sinusoidal waves of 500 Hz and 100 ms duration that were modulated by two different amplitudes. The amplitudes were individually adjusted with the participants to obtain two clearly distinguishable intensities (mean of the low intensity stimulus: 81:43 AE 1:22 dB; mean of the high intensity stimulus: 93:02 AE 0:98 dB).
Somatosensory stimuli were administered with two DS5 isolated bipolar constant current stimulators (Digitimer Limited, Welwyn Garden City, Hertfordshire, UK) via adhesive electrodes (GVB-geliMED GmbH, Bad Segeberg, Germany) attached to the wrists of both arms.
The stimuli consisted of electrical rectangular pulses of 0.2 ms duration. To account for interpersonal differences in sensory thresholds, the two intensity levels used in the experiment were determined on an individual basis. The low intensity level (mean: 3:97 AE 0:84mA) was set in proximity to the detection threshold yet high enough to be clearly perceivable (and judged to be the same intensity on both sides). The high intensity level (mean: 6:47 AE 1:33mA) was determined for each participant to be easily distinguishable from the low intensity level yet remaining non-painful and below the motor threshold.
Visual stimuli were presented via light emitting diodes (LEDs) and transmitted through optical fiber cables mounted vertically centered to both sides of a monitor. The visual flashes consisted of rectangular waves of 100 ms duration that were modulated by two different amplitudes (low intensity stimulus: 2:65 V; high intensity stimulus: 10 V) that were determined to be clearly perceivable and distinguishable prior to the experiment. Participants were seated at a distance of about 60 cm to the screen such that the LED's were placed within the visual field at a visual angle of about 67 .
In each of six experimental runs of 11.5 min, a sequence of 600 stimulus combinations was presented. To ensure that participants maintained attention throughout the experiment and to encourage monitoring of all three stimulation modalities, participants were instructed to respond to occasional catch trials (target questions) via foot pedals. In six trials randomly placed within each run the fixation cross changed to one of the letters A, T, or V followed by a question mark. This prompted participants to report if the most recent stimulus (directly before appearance of the letter) in the auditory (letter A), somatosensory (letter T for "tactile"), or visual (letter V) modality was presented with low or high intensity. The right foot was used to press either a left or a right pedal, and the pedal assignment (left = low/ right = high or left = high/right = low) was counterbalanced across participants.
It should be noted that our MMR paradigm in form of an attended roving stimulus sequence with relatively long ISI (1.15 s) differs from the classic oddball protocol for MMN elicitation in which participants are engaged in a primary task, often attending a separate modality. Since this is not easily possible with our paradigm (containing auditory, somatosensory, and visual stimuli), we used catch-trials in each modality to ensure that attentional resources were distributed largely equally across the simultaneous stimulus streams. The ISI at the upper end of the range used for MMN elicitation was set during piloting such that the perceptually demanding tri-modal bilateral stimulation was deemed not to be overwhelming in terms of sensory overload.

| Probabilistic sequence generation
Each of the three sensory modalities (A, S, V) were presented as binary (low/high) stimulus sequences originating from a common probabilistic model. Three types of stimulus sequences, depicted in Figure 2 were generated with different probability settings. The settings determine the transition probabilities within each modality given the arrangement of the other two modalities (i.e., either congruent or F I G U R E 1 Experimental paradigm. Participants were seated in front of a screen and received sequences of simultaneously presented bilateral auditory beep stimuli (green), somatosensory electrical pulse stimuli (purple) and visual flash stimuli (orange) each at either low or high intensity. On consecutive trials, stimuli within each modality either repeated the previous stimulus intensity of that modality (standard) or alternated to the other intensity (deviant). This created tri-modal roving stimulus sequences, where the repetition/alternation probability in each modality was determined by a single probabilistic model (see Section 2.3). In 1% of trials (catch trials) the fixation cross changed to one of the three letters A, T, or V, interrupting the stimulus sequence. The letter prompted participants to indicate whether the last auditory (letter A), somatosensory (letter T for "tactile") or visual (letter V) stimulus, respectively, was of high or low intensity. Responses were given with a left or right foot pedal press using the right foot. incongruent). One setting defines lower change probability if the other two modalities are congruent (e.g., for any change in modality A from t À 1 to t, S and V were congruent with p 100j000 In each of six experimental runs, the stimulus sequence was defined by one of the three different probability settings. Each F I G U R E 2 Probabilistic sequence generation. (a) Schematic of state transition matrix (left). Light color shading depicts transitions in the respective modality which were assigned specific transition probabilities: Green = auditory change, purple = somatosensory change, orange = visual change, gray diagonal = tri-modal repetition, white = multimodal change (set to zero). States 0-7 correspond to a specific stimulus combination (right), that is, eight permutations of low (0) and high (1)  probability setting was used twice during the experiment and the order of the six different sequences was randomized. Participants were unaware of the sequence probabilities and any learning of sequence probabilities was considered to be implicit and task irrelevant.
Following the nomenclature suggested by Arnal and Giraud (2012), the resulting stimulus transitions for each modality within the different sequences can be defined as being either predicted (here higher change probability conditional on congruency/incongruency), mispredicted (here lower change probability conditional on congruency/incongruency) or unpredictable (here equal change probability).
For each modality, repetitions are more likely p¼ :825 ð Þthan changes p¼ :175 ð Þregardless of the type of probability setting and stimulus, resulting in classic roving standard sequences for each modality (mean stimulus train length: 5, mean range of train length: 2-34 stimuli).

| EEG data collection and preprocessing
Data were collected using a 64-channel active electrode EEG system (ActiveTwo, BioSemi, Amsterdam, Netherlands) at a sampling rate of 2048 Hz, with head electrodes placed in accordance with the extended 10-20 system. Individual electrode positions were recorded using an electrode positioning system (zebris Medical GmbH, Isny, Germany).
Preprocessing of the EEG data was performed using SPM12 (Wellcome Trust Centre for Neuroimaging, Institute for Neurology, University College London, London, UK) and in-house MATLAB scripts (MathWorks, Natick, MA). First, the data were referenced against the average reference, high-pass filtered (0.01 Hz), and downsampled to 512 Hz. Subsequently, eye-blinks were corrected using a topographical confound approach (Berg & Scherg, 1994;Ille et al., 2002). The Data were epoched using a peri-stimulus time interval of À100 to 1050 ms and all trials were visually inspected and artifactual data removed. Likewise, catch trials were omitted for all further analyses. Furthermore, the EEG data of two consecutive participants were found to contain excessive noise due to hardware issues, resulting in their exclusion from further analyses and leaving data of 32 participants. Finally, a low-pass filter was applied (45 Hz) and the preprocessed EEG data were baseline corrected with respect to the pre-stimulus interval of À100 to À5 ms. To use the general linear model (GLM) implementation of SPM, the electrode data of each participant were linearly interpolated into a 32 Â 32 grid for each time point, resulting in one three-dimensional image (with dimensions 32 Â 32 Â 590) per trial. These images were then spatially smoothed with a 12 Â 12mm full-width half-maximum Gaussian kernel to meet the requirements of random field theory, which the SPM software uses to control the family wise error rate.

| Event-related responses and statistical analysis
First, to extract basic MMR signals of each modality from the EEG data, we contrasted standard and deviant trials of each modality with paired t-tests corrected for multiple comparisons by using clusterbased permutation tests implemented in fieldtrip (Maris & Oostenveld, 2007). Two time windows of interest were defined based on the literature (Duncan et al., 2009)  To test for the implicit effect of cross-modal predictability based on the different conditional probability setting in the sequence, a Predictability model was defined that consisted of 37 regressors: an intercept regressor and 18 regressors coding standards and deviants of each modality for each of the three conditions described above: unpredictable (trials originate from sequences with no conditional dependence between modalities), predicted (trials originate from sequences with conditional dependence; trials defined by change being likely), mispredicted (trials originate from sequences with conditional dependence; trials defined by change being unlikely). On the single-participant level, these were coded for congruent and incongruent trials separately resulting in 36 regressors. By definition, the number of trials in regressors with mispredicted trials was lowest, on average corresponding to around 60 deviant trials and 800 standard trials per modality.
Finally, a P3-Conjunction model was specified that consisted of seven regressors: an intercept regressor and six regressors coding all standards and deviants for each of the three modalities. This model was used to apply SPM's second level conjunction analysis, contrasting standards and deviants across modalities in search of common P3 effects across modalities.
Each GLM was estimated on the single-trial data of each participant using restricted maximum likelihood estimation. This yielded β-parameter estimates for each model regressor over (scalp-) space and time, which were subsequently analyzed at the group level. Second level analyses consisted of a mass-univariate multiple regression analysis of the individual β scalp-time images with a design matrix specifying regressors for each condition of interest as well as a subject factor. Second level beta estimates were contrasted for statistical inference and multiple comparison correction was achieved with SPM's random field theory-based FWE correction (Kilner et al., 2005).

| Source localization
In addition to the TLCD model, different BL models were created to contrast the static train length based TLCD model with dynamic generative models tracking transition probabilities. The BL models consist of conjugate Dirichlet-Categorical models estimating probabilities of observations read out by three different surprise functions: Bayesian surprise (BS), predictive surprise (PS), and confidencecorrected surprise (CS).
BS quantifies the degree to which an observer adapts their generative model to incorporate new observations (Baldi & Itti, 2010;Itti & Baldi, 2009) and is defined as the Kullback-Leibler (KL) divergence between the belief distribution prior and posterior to the update: PS is based on (Shannon, 1948) definition of information and defined as the negative logarithm of the posterior predictive distribution, assigning high surprise to observed events y t with low estimated probability of occurrence: PS y t ð Þ¼Àlnp y t js t ð Þ¼Àln p y t jy tÀ1 , …, y 1 ð Þ . CS additionally considers the commitment of the generative model and scales with the negative entropy of the prior distribution (Faraji et al., 2018). It is defined as the KL divergence between the (informed) prior distribution at the current time step and a flat prior distribution b p s t ð Þ updated with the most recent event y t : Following Faraji et al. (2018) surprise quantifications can be categorized as puzzlement or enlightenment surprise. While puzzlement refers to the mismatch between sensory input and internal model belief, closely related to the concept of prediction error, enlightenment refers to the update of beliefs to incorporate new sensory input. In the current study, we were interested in a quantification of the model inadequacy by means of an unsigned prediction error as reflected by surprise. As such, throughout the manuscript, with prediction error we do not refer to the specific term of (signed) reward prediction error as used for example in reinforcement learning but rather use it to refer to the signaling of prediction mismatch. While both PS and CS are instances of puzzlement surprise, CS is additionally scaled by belief commitment and quantifies the concept that a low-probability event is more surprising if commitment to the belief (of this estimate) is high. BS, on the other hand, is an instance of enlightenment surprise and is considered a measure of the update to the generative model resulting from new incoming observations.
A detailed description of the Bayesian observer, its transition probability version as well as the surprise read-out functions can be found in our previous work on somatosensory MMRs (Gijsen et al., 2021). Here, we will primarily provide a brief description of the specifics of two implementations of Dirichlet-Categorical observer models, a unimodal and a cross-modal model. Both observer models receive stimulus sequences (of one respective modality) as input and iteratively update a set of parameters with each new incoming observation. In each iteration, the estimated parameters are read out by the surprise functions (BS, PS, and CS) to produce an output which is subsequently used as a predictor for the EEG data.
For each modality, the unimodal Dirichlet-Categorical model considers a binary sequence with two possible stimulus identities (low and high) estimating transition probabilities with y t ¼ o t for t ¼ 1,…, T with a set of hidden parameters s i ð Þ for each possible transition from o tÀ1 ¼ i. This unimodal model does not capture any cross-modal dependencies in the sequence (i.e., the alternation and repetition probabilities conditional on the tri-modal stimulus configuration).
Therefore, we defined a cross-modal Dirichlet-Categorical model to address the question whether the conditional dependencies were used by the brain during sequence processing for prediction of stimulus change. The dependencies in the sequence were independent of stimulus identity but provide information about the probability of repetition or alternation d t ð ) conditional on the congruency of the other modalities. The cross-modal model thus estimates alternation proba-  Gijsen et al. (2021) where the interested reader is kindly referred to for further information. First, the stimulus sequence-specific regressor of each model was obtained for each participant. After z-score normalization, the regressors were fitted to the single-trial, event-related electrode data using a free-form variational inference algorithm for multiple linear regression (Flandin & Penny, 2007;Penny et al., 2003;Penny et al., 2005). The obtained model-evidence maps were subsequently subjected to the BMS procedure implemented in SPM12  to draw inferences across participants with wellestablished handling of the accuracy-complexity trade-off (Woolrich, 2012

| Bayesian model comparison
The estimated model-evidence maps were used to evaluate the models' relative performance across participants via family-wise BMS (Penny et al., 2010). The model space was partitioned into three types of families to draw inference on different aspects of the involved models. Given that the literature provides some evidence for each of the three surprise read-out functions (BS, PS, CS) to capture some aspect of EEG MMRs, we included all of them in the family wise comparisons to avoid biasing the comparison of different BL models.
The first model comparison considered the full space of BL models as a single family (BL family) and compared it to the TLCD model (TLCD family) and the null model (NULL family). Since the BL models had their tau parameter optimized, which was not possible for the TLCD model, we applied the same penalization method used in our previous study (Gijsen et al., 2021

| Auditory MMRs
The MMN, as the classic MMR, has originally been studied in the auditory modality and is commonly described as the ERP difference wave calculated by subtraction of standard trials from deviant trials (deviants-standards). This difference wave typically shows a negative deflection at fronto-central electrodes and corresponding positivity at temporo-parietal sites, ranging from around 100 to 250 ms (Naatanen et al., 1978;Naatanen et al., 2007). Correspondingly, we find a signifi- Within early and late auditory MMR clusters, the response to both standards and deviants was modulated by the number of standard repetitions. The auditory system is known to be sensitive to stimulus repetitions, particularly within the roving standard paradigm (Baldeweg et al., 2004;Cowan et al., 1993;Ulanovsky et al., 2003;Ulanovsky et al., 2004). Therefore, we hypothesized a gradual increase of the auditory response to standard stimuli around the time of the MMN, known as repetition positivity (Baldeweg, 2006;Baldeweg et al., 2004;Haenschel et al., 2005) as well as reciprocal negative modulation of the corresponding deviant response (Bendixen et al., 2007;Naatanen et al., 2007).

| Visual MMRs
We hypothesized visual MMRs to present as an early MMN at occipi-

| Cross-modal P3 effects
In search of a common P3 effect to deviant stimuli, we created conjunctions of the deviants > standards contrasts across the auditory, somatosensory, and visual modalities. The conjunction revealed a common significant cluster starting at $300 ms (cluster p fwe < :05) that comprised anterior central effects around 300-350 ms followed by more posterior effects from 400 to 600 ms (Figure 4a).
To investigate the modulation of the P3 MMR by predictability, we used two-way ANOVAs with the three-level factor modality (auditory, somatosensory, visual) and the three-level factor predictability condition ( predicted, mispredicted, unpredictable). Separate ANO-VAs were applied to deviants and standards. We hypothesized that the cross-modal P3 MMR might be sensitive to multisensory predictive information in the sequence, as the P3 has been shown to be sensitive to global sequence statistics (Bekinschtein et al., 2009;Wacongne et al., 2011) and to be modulated by stimulus predictability (Horvath et al., 2008;Horvath & Bendixen, 2012;Max et al., 2015;Prete et al., 2022;Ritter et al., 1999;Sussman et al., 2003). Indeed, within the common P3 cluster, both deviants (299-313 ms, peak p fwe < :05) and standards (316-332 ms, peak p fwe < :05) show significant differences between predictability conditions. No significant interaction of predictability condition with modality was observed.
Post hoc t tests were applied to the peak beta estimates to investigate the differences between the three pairs of conditions. Taken together, this result suggests that stimuli which were mispredicted based on the predictive multisensory configuration resulted in increased responses within the common P3 cluster compared to predicted or unpredictable stimuli, regardless of their role as standards or deviants in the current stimulus train.
For completeness, we also tested the effect of predictability in the earlier MMN cluster, but we did not observe any significant modulations here (results not shown).

| Source localization
The source reconstruction analysis resulted in significant clusters of activation for each modality's MMN as well as the P3 MMR. The results are depicted in Figure 5 and cytoarchitectonic references are described in Table 1.  Table 1) suggests additional activation of hierarchically higher sensory areas such as secondary somatosensory cortex for the sMMN (p uncorr < :001; left cluster: peak t = 4.21; right cluster: peak t = 5.01) and lateral occipital cortex (fusiform gyrus) for vMMN (part of the primary visual cluster). In addition to the sensory regions, common frontal sources with dominance on the right hemisphere were identified using a conjunction analysis for the MMN of all three modalities. In particular, significant common source activations were found in the right inferior frontal gyrus (IFG; p fwe < :05; cluster: peak t = 3.15) and right middle frontal gyrus (MFG; p fwe < :05; cluster: peak t = 2.89). Additional significant common sources include frontal pole (p fwe < :05; left cluster: peak t = 2.56; right cluster: peak t = 2.28), left inferior temporal gyrus (p fwe < :05; cluster: peak t = 2.52) and right inferior parietal lobe (p fwe < :05; cluster: peak t = 2.85).

| DISCUSSION
The present study set out to compare mismatch signals in response to tri-modal sequence processing in the auditory, somatosensory, and visual modalities and to investigate influences of predictive cross-modal information. We found comparable but modality spe- 4.1 | Modality specific mismatch signatures in response to tri-modal roving stimuli By using a novel tri-modal roving stimulus sequence originating from an underlying Markov process of state transitions, we were able to elicit and extract unique EEG signatures in each of the three sensory modalities (auditory, somatosensory, and visual).
Of the EEG mismatch signatures, the auditory MMN is one of the most widely researched responses to deviation from an established stimulus regularity (Naatanen et al., 1978;Winkler et al., 2009) (Naatanen et al., 2007) and might speak against pure N1 adaptation (as suggested by Jaaskelainen et al., 2004;May et al., 1999).
The somatosensory equivalent to the auditory MMN (sMMN) reported in the current study shows negative polarity at bilateral temporal electrodes and corresponding central positivity. The sMMN likely reflects an enhanced N140 component, as suggested by Kekoni et al. (1997). However, most previous sMMN studies used oddball paradigms where some critical discussion revolves around the distinction of the sMMN from an N140 modulation by stimulus properties alone. Here, we report an sMMN around the N140 which can be assumed to be independent of stimulus confounds due to the reversed roles of standard and deviant stimuli in the roving paradigm.
Although several previous studies have reported somatosensory mismatch responses, conflicting evidence exists regarding the exact components that may constitute an equivalent to the auditory MMN.
Some studies report a more fronto-centrally oriented negativity (Kekoni et al., 1997;Shen et al., 2018;Spackman et al., 2007;Spackman et al., 2010) or observed such pronounced central positivity that they were led to conclude that it is in fact the central positivity that should be considered the somatosensory equivalent of the aMMN (Akatsuka et al., 2005;Shinozaki et al., 1998). However, some evidence appears to converge on a temporally centered negativity with corresponding central positivity as the primary sMMN around 140 ms (Gijsen et al., 2021;Ostwald et al., 2012).
While the auditory and somatosensory MMN's in the current study were found to be highly comparable in their signal strength, their hypothesized counterpart in the visual modality showed a comparatively weaker response. Nevertheless, we found a significant vMMN at occipital electrodes extending to temporal electrodes within a time window of 100-200 ms poststimulus, with corresponding (fronto-) central positivity. This observation is in line with previous research reporting posterior (Cleary et al., 2013;Kimura et al., 2010;Urakawa et al., 2010) and temporal (Heslenfeld, 2003;Kuldkepp et al., 2013) patterns of vMMN with corresponding central positivity (Cleary et al., 2013;Czigler et al., 2006;File et al., 2017).

| Neuronal generators of MMN signatures
Source reconstruction analyses were used to identify underlying neuronal generators of the modality specific MMN signatures. Interestingly, for each sensory modality, we found generators in the primary and higher order sensory cortices as well as additional frontal generators in IFG and MFG.
For the visual modality, we identified sources in visual areas (V1-V4) and additional frontal activations in IFG and MFG as the neuronal generators underlying the vMMN. Previous studies have shown similar combinations of visual and prefrontal areas (Kimura et al., 2010;Kimura et al., 2011;Kimura et al., 2012;Urakawa et al., 2010;Yucel et al., 2007) and have particularly highlighted the IFG as a frontal generator of the vMMN (Downar et al., 2000;Hedge et al., 2015). Similarly, an fMRI study of perceptual sequence learning in the visual system has shown right lateralized prefrontal activation in addition to activations in visual cortex in response to regularity violations (Huettel et al., 2002). Yet another study has suggested a role for right prefrontal areas in interaction with hierarchically lower visual areas for the prediction of visual events (Kimura et al., 2012), all in line with our results.
Overall, our finding of inferior and middle frontal sources for the MMN in all three modalities provides further evidence for a modality independent role for these generators as previously suggested by Downar et al. (2000). As such, these modality-independent frontal generators might reflect higher stages of a predictive hierarchy working across modalities in interaction with lower modality specific regions, as previously suggested primarily for the auditory modality (Garrido, Kilner, Stephan, & Friston, 2009).

| Modulation of the MMN by stimulus repetition
An important feature of the MMN which theories of its generation have aimed to account for is its sensitivity to stimulus repetition. The MMN is known to increase with prior repetition of standards (Imada et al., 1993;Javitt et al., 1998;Naatanen & Näätänen, 1992;Sams et al., 1983). Correspondingly, in the current study, we find a significant increase of auditory and somatosensory MMN with the length of the preceding stimulus train as well as a comparable tendency for the vMMN. Moreover, we show that this increase was driven by a reciprocal negative modulation of deviant and positive modulation of standard responses, suggesting a combined influence of repetition dependent change detection and dynamics akin to stimulus adaptation.
The observed positive modulation of standard responses, particularly in the auditory modality, is in line with the repetition positivity account of Baldeweg and colleagues (Baldeweg, 2006;Baldeweg, 2007;Baldeweg et al., 2004;Haenschel et al., 2005). In the auditory modality, repetition positivity has been isolated as a positive slow wave that accounts for repetition-dependent increases of auditory ERPs up to the P2 component (Haenschel et al., 2005). With regard to its functional role, it has been argued to reflect auditory sensory memory trace formation (Baldeweg et al., 2004;Costa-Faidella, Baldeweg, et al., 2011;. Interestingly, MMN studies using the oddball paradigm often report an increasing MMN with standard repetition without further dissecting the contributions from standard and deviant dynamics. A contribution of the standard repetition positivity appears to be particularly dominant in roving stimulus paradigms (Cooper et al., 2013), potentially because a memory trace of the standard stimulus identity must be reestablished after each change of roles for standard and deviant stimuli. It has even been suggested that the memory trace dynamics of the standard observed in response to roving oddball sequences might in fact be the primary driver of train length effects on MMN amplitudes (Baldeweg et al., 2004;Costa-Faidella, Baldeweg, et al., 2011;Haenschel et al., 2005). Importantly, although some evidence exists to suggest an additional role for train length dependent deviant modulation also in roving paradigms (Cowan et al., 1993;Haenschel et al., 2005), a dissection of combined standard and deviant contributions as performed here is rarely described.
Similar to the aMMN, we found the sMMN to be modulated by stimulus repetition. An early repetition positivity effect in the responses to standards was observed prior to 100 ms indicating comparable sensory adaptation dynamics as described for the aMMN. In the visual modality, a comparable train length effect to auditory and somatosensory modalities was observed but did not reach statistical significance in the vMMN time window. Given the overall weaker response in the current study for vMMN this might not be surprising.
Moreover, discussions about the repetition modulation of vMMN responses are often based on findings concerning the auditory system rather than direct findings in the visual modality. While sensory adaptation to stimulus repetition is generally found throughout the visual system (e.g., Clifford et al., 2007;Grill-Spector et al., 2006) it is rarely directly reported in vMMN studies (but see Kremlacek et al., 2016).
Overall, the vMMN literature seems to suggest that the vMMN may be a rather unstable phenomenon. In fact, by controlling for confounding effects, one study has called the existence of the vMMN for low level features such as the ones used here into question entirely (Male et al., 2020). The vMMN appears to show a much less pronounced spatiotemporal pattern than auditory and somatosensory equivalents, which is reflected in larger variance in the reported topographies and time windows in studies investigating vMMN (but see Section 11 for a discussion of alternative explanations regarding the current study).

| MMN as a signature of predictive processing
Recent research supports the view that Bayesian perceptual learning mechanisms underlie the generation of mismatch responses such as the MMN (Friston, 2005(Friston, , 2010Garrido, Kilner, Stephan, & Friston, 2009). Given the proposal of Bayesian inference and predictive processing as universal principles of perception and perceptual learning in the brain (Friston, 2005(Friston, , 2010, comparable mismatch responses are expected to be found across sensory modalities. Evidence for the predictive nature of mismatch responses, akin to key findings from the auditory modality, is for instance given by studies showing somatosensory (Andersen & Lundqvist, 2019;Naeije et al., 2018) and visual (Czigler et al., 2006;Kok et al., 2014) MMN in response to predicted but omitted stimuli. Moreover, Ostwald et al. (2012) and Gijsen et al. (2021) have shown that single trial somatosensory MMN and P3 MMRs can be accounted for in terms of surprise signatures of Bayesian inference models tracking stimulus transitions.
Similarly, the vMMN has been described as a signature of predictive processing (Kimura et al., 2011;Stefanics et al., 2014), signaling prediction error instead of basic change detection (Stefanics et al., 2018).
Correspondingly, we found comparable mismatch signatures in auditory, somatosensory, and visual modalities. The train length effects observed in our study across modalities have previously been related to predictive processing. Repetition positivity in the auditory modality has been interpreted as a reflection of repetition suppression, resulting from fulfilled prediction (Auksztulewicz & Friston, 2016;Baldeweg, 2007;Costa-Faidella, Baldeweg, et al., 2011;. A corresponding negative modulation of deviant responses on the other hand, would signal a failure to suppress prediction error after violation of the regularity established by the current stimulus train. Under such a view, longer trains of repetitions lead to higher precision in the probability estimate which in turn results in a scaling of the prediction error in response to prediction violation (Auksztulewicz & Friston, 2016;Friston, 2005;). In line with these hypotheses, Garrido and colleagues Garrido, Kilner, Kiebel, & Friston, 2009) (Fardo et al., 2017). As we find involvement of such modality specific sensory and modality independent frontal areas for MMN responses across modalities, our results suggest comparable roles for these sources in a predictive hierarchy.

| P3 Mismatch responses reflect cross-modal processing
In addition to the modality specific MMN responses, deviants in all three modalities elicited a late positive mismatch component in the P3 time window. Despite differences in the exact latency and extent of this response between modalities, we identified a common mismatch cluster from 300 to 350 ms in central electrodes, followed by a slightly more posterior cluster extending from 400 to 600 ms. Particularly the earlier cluster may correspond to the well-known P3a response, which peaks at around 300 ms after change-onset at (fronto-) central electrodes and is thought to be elicited regardless of sensory modality (Escera et al., 2000;Friedman et al., 2001;Knight & Scabini, 1998;Polich, 2007;Schroger, 1996).
The P3a is closely related to the MMN as they are both elicited during active and passive perception of repeated stimuli interrupted by infrequent stimulus deviations (Polich, 2007;Schroger et al., 2015).
While the P3a has been initially related to attentional switches to task-irrelevant but salient stimulus features (Escera et al., 2000;Friedman et al., 2001;Polich, 2007), more recent accounts suggest that the MMN and P3a might reflect two stages of a predictive hierarchy, each representing (potentially differentiable) prediction error responses Wacongne et al., 2011). Similar to the MMN, P3 responses are known to be modulated by stimulus probability (Duncan-Johnson & Donchin, 1977) and can be elicited by unexpected stimulus repetitions (Duncan et al., 2009;Squires et al., 1975) and omissions of predicted sound stimuli (Prete et al.;Sutton et al., 1967), which provides compelling evidence for a role of the P3 in predictive processing. Similar to the MMN responses described above, we found the individual P3 MMR responses in all three modalities to show reciprocal modulations of standards and deviants by stimulus repetition, which has previously only been reported for the auditory modality (Bendixen et al., 2007). This sensitivity to stimulus repetition of mismatch responses in early and late time-windows has been interpreted in terms of regularity and rule extraction in the auditory modality (Bendixen et al., 2007) and is in line with an account of repetition suppression over and above early sensory adaptation.  (Max et al., 2015). It has thus been suggested that the P3 reflects a higherlevel deviance detection system concerned with the significance of the stimulus in providing new information for the system (Horvath et al., 2008). Interestingly, a recent study investigating mismatch responses to different auditory features showed that while the MMN response in an earlier (classical) time window was generally affected by regularity violations, only the later response (P3 range) contained information about the specific features that were violated . Furthermore, computational studies indicate that P3 responses reflect specific quantities of unexpectedness as well as updates to a prior belief (Jepma et al., 2016;Kolossa et al., 2015).
Overall, current research provides evidence for the view that the MMN reflects prediction errors at earlier hierarchical stages, primarily concerned with more local regularity extraction, whereas P3 responses reflect more global rule violations which require a certain level of abstraction and information integration (Bekinschtein et al., 2009;Wacongne et al., 2011;Winkler et al., 2005). Our findings of a sensitivity of the P3 response to cross-modal predictive information carried by the multimodal configuration of the stimulus sequence further supports such a view. Across modalities, we found an increased P3 response to mispredicted compared to predicted or unpredictable stimuli, regardless of their role as standards or deviants.
Generally, the P3 deviant response in the current study likely reflects a (unsigned) prediction error to a local regularity established by stimulus repetition. However, increased P3 responses to mispredicted stimuli indicate additional violations of global, cross-modal predictions which are extracted from multimodal context information.
The observed pattern suggests influences of precision weighting on prediction errors ). In case of both predicted and mispredicted stimuli, the cross-modal predictive context allows for more precise predictions (i.e., high prior precision) than in case of the unpredictable stimuli (low prior precision). Under such an interpretation, the precision for mispredicted deviants is high, resulting in a pronounced prediction error response. Since the precision for predicted deviants is also high, the resulting prediction error response is low because the stimulus was suppressed. Even though the size of prediction error to unpredictable deviants could generally be expected in between those of predicted and mispredicted deviants, the observed response is low (similar to that of a predicted deviant), because the prior precision in this context is low. This interpretation is in line with the fact that no significant difference was found between predicted and unpredictable deviants. A similar modulation of multimodal predictability is found for the P3 response to standards. However, interestingly, in case of the standards, the response to predicted stimuli is significantly lower than to unpredictable stimuli. This difference between standards and deviants could be due to the fact that deviants are generally surprising, even if they are more predictable in terms of their cross-modal configuration. Standards, on the other hand, are generally predicted to occur (high precision) which might result in a pronounced suppression of prediction error in case they are additionally cross-modally predicted.  Linden, 2005;Polich, 2007), whereas parietal regions are presumed to be more involved in task-related P3b responses. The identified sources have been shown to be involved in a fronto-parietal network relevant for the supramodal processing of stimulus transitions and deviance detection (Downar et al., 2000;Huang et al., 2005). Similarly, a fronto-parietal attention network (Corbetta & Shulman, 2002) has been shown to be involved in oddball processing in the auditory and visual modalities (Kim, 2014). The network consists of two functionally and anatomically distinct parts which closely interact (Vossel et al., 2014). While the dorsal part of the network is believed to be involved in the allocation of top-down, endogenous attention (e.g., triggered by predictive information), the ventral part is involved in bottom-up, exogenous attention allocation and thus, processing of unexpected stimuli. Importantly, it has been shown that this network operates supramodally to facilitate processing of information from multimodal events (Macaluso, 2010;Macaluso & Driver, 2005).
Thus, the predictive information in the multimodal sequences presented in the current study may be processed in such a fronto-parietal network to aid the perception of multimodal stimulus streams. Future research would benefit from studies further investigating such multimodal probabilistic sequences with higher spatial resolution to inform these proposed interpretations. Bayesian inference and are thus markers of probabilistic sequence processing in the brain.

| Modeling single-trial EEG responses as signatures of bayesian inference
Within the family of BL models, we found that a cross-modally informed model (UCM), tracking cross-modal conditional dependencies between modalities in addition to unimodal transitions, outperformed a purely unimodal transition probability model (UM) at central electrodes within an early and a late time-window. The crossmodal effects in the late time-window are directly in line with the sensitivity of the P3 cluster to cross-modal predictability discussed above and support an interpretation of P3 mismatch responses to reflect signatures of cross-modal Bayesian inference. Given that cross-modal learning was not explicitly instructed or task-relevant, the results are compatible with the view that the brain is sensitive to cross-modal information by default (Driver & Noesselt, 2008;Ghazanfar & Schroeder, 2006) and that processing multimodal information might be appropriately captured by Bayesian inference (Kording et al., 2007;Shams & Beierholm, 2022). Interestingly, however, an earlier crossmodal effect was found prior to 300 ms which was not reflected in the GLM results, suggesting that potential modulations of MMN signatures by predictability manifest in the dynamics of single trial surprise signals but not in significant mean differences between predictability conditions. Since the earlier cross-modal effect observed in the modeling results was primarily confined to central and frontocentral electrodes it may be related to activity of the frontal generators of the MMN. As discussed above, the frontal cortex is assumed to be involved in MMN generation (Deouell, 2007) in interaction with hierarchically lower sensory sources and has been hypothesized to form top-down predictions about incoming sensory stimuli Garrido, Kilner, Kiebel, & Friston, 2009;Garrido, Kilner, Stephan, & Friston, 2009). This assumption is further supported by our source reconstruction results which show modality independent frontal generators in addition to sensory specific regions to underlie the MMN in auditory, somatosensory, and visual modalities.
Regarding the surprise read-out functions of the BL models, we find a slight dominance of CS in earlier mismatch signatures prior to 200 ms, while the late clusters tend to reflect BS. This is well in line with our previous study performed in the somatosensory modality (Gijsen et al., 2021) and other studies have similarly reported a reflection of BS in P3 mismatch responses (Kolossa et al., 2015;Mars et al., 2008;Ostwald et al., 2012;Seer et al., 2016). Given their differences in reading out the probability estimates of the Bayesian observer, the different surprise signatures in the MMN and P3 MMR might provide some insight into their respective computational roles.
CS has been categorized as an instantiation of puzzlement surprise (Faraji et al., 2018) reflecting a mismatch between sensory input and internal model belief which is additionally scaled by belief commitment. Low-probability events are thus more surprising if commitment to the belief (of this estimate) is high. BS reflects incorporation of new information, quantifying an update to the generative model and has been categorized as enlightenment surprise (Faraji et al., 2018).
Accordingly, the MMN may be considered a marker of prediction error scaled by belief commitment, whereas the P3 may reflect the subsequent update of the predictive model.
Given that the P3 shows sensitivity to cross-modal prediction violation (GLM results) and tends to carry signatures of multimodal inference and model updating (Bayesian modeling results), we suggest that the P3 likely reflects model updates with respect to the multimodal context. Although unnoticed by participants, given that the statistical regularities changed across experimental runs, the generative model continuously required updating. As such, our results might be reflective of a volatile sensory environment and relate to previous findings which indicate that later MMRs, such as the P3, reflect belief updates about the volatility of the underlying (hidden) statistics governing sensory observations (e.g., Weber et al., 2020;Weber et al., 2022). We leave it for future research to design experiments which are better suited to evaluate these speculations more specifically and more thoroughly.

| Limitations
Although we gained valuable insights into the commonalities and differences between mismatch responses in different modalities, our study faces certain limitations in its implementation and scope. First, although reports of weak vMMN responses can be found in the literature, an alternative explanation may lie in the stimulation protocol used in the current study. Our visual stimuli consisted of bilateral flash stimuli with two different intensities, which were presented in the periphery of the visual field. Since, to our knowledge, no other study has used visual flash stimuli to elicit vMMN, our results are not directly comparable to previous research. Moreover, due to the retinotopic organization of the visual cortex (Horton & Hoyt, 1991;Sereno et al., 1995), a "far peripheral" placement (i.e., >60 ; Strasburger et al., 2011) of the LED's results in the activation of (primary) visual areas folded deep inside the cortex, in the calcarine sulcus between the hemispheres. It is therefore possible that the visual mismatch responses were not weaker per se but were merely harder to detect by means of EEG.

| CONCLUSION
With the current study, we provide evidence for modality specific and modality independent aspects of mismatch responses in audition, somatosensation, and vision resulting from a simultaneous stream of tri-modal roving stimulus sequences. Our results suggest that responses to stimulus transitions in all three modalities are based on an interaction of hierarchically lower, modality specific areas with hierarchically higher, modality independent frontal areas. We show that similar dynamics underlie these mismatch responses which likely reflect predictive processing and Bayesian inference on unimodal and multimodal sensory input streams.

ACKNOWLEDGMENT
The authors would like to thank the HPC Service of ZEDAT, Freie Universität Berlin, for computing time. Open Access funding enabled and organized by Projekt DEAL.

FUNDING INFORMATION
This work was supported by Berlin School of Mind and Brain, Humboldt Universität zu Berlin (MG and SG, http://www.mind-and-brain. de/home/), and Deutscher Akademischer Austauschdienst (SG, https://www.daad.de/en/). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.