Sensitivity to syllable stress regularities in externally but not self‐triggered speech in Dutch

Several theories of predictive processing propose reduced sensory and neural responses to anticipated events. Support comes from magnetoencephalography/electroencephalography (M/EEG) studies, showing reduced auditory N1 and P2 responses to self‐generated compared to externally generated events, or when the timing and form of stimuli are more predictable. The current study examined the sensitivity of N1 and P2 responses to statistical speech regularities. We employed a motor‐to‐auditory paradigm comparing event‐related potential (ERP) responses to externally and self‐triggered pseudowords. Participants were presented with a cue indicating which button to press (motor–auditory condition) or which pseudoword would be presented (auditory‐only condition). Stimuli consisted of the participant's own voice uttering pseudowords that varied in phonotactic probability and syllable stress. We expected to see N1 and P2 suppression for self‐triggered stimuli, with greater suppression effects for more predictable features such as high phonotactic probability and first‐syllable stress in pseudowords. In a temporal principal component analysis (PCA), we observed an interaction between syllable stress and condition for the N1, where second‐syllable stress items elicited a larger N1 than first‐syllable stress items, but only for externally generated stimuli. We further observed an effect of syllable stress on the P2, where first‐syllable stress items elicited a larger P2. Strikingly, we did not observe motor‐induced suppression for self‐triggered stimuli for either the N1 or P2 component, likely due to the temporal predictability of the stimulus onset in both conditions. Taking into account previous findings, the current results suggest that sensitivity to syllable stress regularities depends on task demands.


| INTRODUCTION
The brain's capacity to formulate predictions of upcoming events in the environment is one of the most studied phenomena across sensory modalities (e.g., Baldeweg, 2006;Blakemore et al., 2000;Rao & Ballard, 1999). These predictions may relate to the timing ('when', temporal prediction) and content/quality ('what', formal prediction) of upcoming sensory events (Arnal & Giraud, 2012;Kotz & Schwartze, 2010) and are based on our acquired knowledge and experience of the world. A special form of prediction generated by the brain is related to the sensory consequences of our own actions. The underlying mechanism is described by the internal forward model of motor control (e.g., . According to this model, when a motor plan is formulated, an internal copy of the command, termed 'efference copy', is used to generate a prediction of the anticipated sensory feedback. This prediction, or 'corollary discharge', is then compared to the actual sensory feedback (reafference signal), allowing the system to distinguish between self-generated and externally generated sensations and to monitor and adapt our own motor output more readily. This model has also been applied to speech production, linking psycholinguistic models of feedback monitoring at the phoneme and syllable level to general motor control mechanisms (e.g., Hickok, 2012;Kotz & Schwartze, 2016).
As a consequence of this mechanism, the sensory response to internally generated stimulation is suppressed, leading to well-known phenomena such as the inability to tickle oneself (Blakemore et al., 2000). This perceived sensory suppression, termed motor-induced suppression (MIS), goes hand in hand with the suppression of sensory-related neural activity, shown across multiple sensory domains, including somatosensory (Blakemore et al., 2000) and auditory (e.g., Christoffels et al., 2011;Knolle et al., 2012;Niziolek et al., 2013). The degree of MIS reflects the accuracy of the generated prediction: the better the match between predicted feedback and actual sensory feedback, the greater the suppression (e.g., Christoffels et al., 2011). MIS is modulated by stimulus properties, including the predictability of the frequency and timing of tones (Bäss et al., 2008;Knolle et al., 2013a) or manipulations of voice identity (Johnson et al., 2021), voice quality and timing in speech (Aliu et al., 2009;Behroozmand & Larson, 2011), and prototypicality of vowels (Niziolek et al., 2013). MIS is further modulated by experience, with musicians showing different suppression patterns than non-musicians (Ott & Jäncke, 2013). In summary, these findings suggest that greater suppression is indicative of more predictable sensory events and that this suppression may be modulated by experience.
One aspect of MIS that is poorly understood is the extent to which it depends on the predictability of the timing of stimulus onset. Studies investigating MIS typically do so by comparing identical stimuli to each other that are either self-generated (e.g., via button press or speech production) or externally presented (e.g., Knolle et al., 2013a;Niziolek et al., 2013;Ott & Jäncke, 2013;Pinheiro et al., 2018). This will often be done in a blocked design, where the externally presented stimuli are presented at the same time intervals as the self-triggered stimuli. Although this approach preserves the temporal structure of the stimulus streams across conditions, it does not fully control for the temporal predictability (Hughes et al., 2013). Some studies have attempted to address this question, by either introducing temporal uncertainty in the self-generated stimuli (Bäss et al., 2008;Lange, 2011;Pinheiro et al., 2019) or enhancing the temporal predictability of the externally presented stimuli through external cues (Harrison et al., 2021;Ody et al., 2022;Sowman et al., 2012). Their results suggest that the effect of temporal predictability on MIS may depend on design parameters: Introducing temporal uncertainty in self-generated stimuli typically leads to preserved MIS, whereas temporal predictability of the externally presented stimuli seems to attenuate this suppression effect.
These observations suggest that MIS may be a suitable measure to investigate the brain's sensitivity to regularities in the formal and temporal structure of speech during production. Within speech and language, regularities exist at multiple timescales (faster and slower), allowing the formulation of formal (e.g., phonotactic probability) and temporal (e.g., syllable stress) predictions across different processing levels. These predictions are established through exposure to regularities in speech throughout development, and evidence of sensitivity to these regularities is found already in infancy (Nazzi et al., 1998;Saffran et al., 1996). This sensitivity may provide an important foundation in the early stages of language acquisition, by allowing infants to segment the continuous speech signal into words (Jusczyk et al., 1999;Mattys & Jusczyk, 2001;Thiessen & Saffran, 2003), and continues to facilitate speech processing throughout the lifespan, as indicated by both behavioural and neural evidence.
There is ample neural evidence supporting the aforementioned behavioural observations of facilitated processing of more regular items in speech perception, with variations in phonotactic probability and stress patterns modulations neural processing (Bonte et al., 2005;Di Liberto et al., 2019;Emmendorfer et al., 2020;Rothermich et al., 2012;Tremblay et al., 2016). However, data on the neural correlates of these features in speech production are sparse. Functional magnetic resonance imaging (fMRI) investigations have shown sensitivity to distributional statistics such as phonotactic probability, syllable frequencies or mutual information in speech production tasks across the speech network, including auditory as well as motor regions, with reduced blood oxygenation level-dependent (BOLD) signal for items with higher frequency of occurrence within the language (Papoutsi et al., 2009;Tremblay et al., 2016). These findings are in line with psycholinguistic models proposing that motor plans of more frequently occurring structures are stored in a 'mental syllabary', whereas less frequent articulatory representations need to be compiled from smaller units on the spot (Levelt, 1999;Levelt & Wheeldon, 1994;Schiller et al., 1996). Electrophysiological data on these features in speech production tasks are sparse. In a go/no-go task, where 'go' decision was based on lexical stress position, N200 latency was earlier for words with first-syllable stress (Schiller, 2006). However, this was proposed to be related to the incremental encoding (i.e., from word onset to end) of the metre during speech production, rather than a function of typical/ atypical stress patterns, which is further supported by behavioural findings in trisyllabic stimuli . Currently, there seem to be no studies investigating the effect of variations in phonotactic probabilities in speech production with electrophysiological methods.
The current experiment aimed to investigate how predictability of phonotactic probability and syllable stress contribute to speech production, extending our knowledge from previous studies investigating speech perception (e.g., Bonte et al., 2005;Emmendorfer et al., 2020) and production (e.g., Schiller, 2006;Tremblay et al., 2016). To approach this question, we focused on MIS as this allows investigating how such (ir)regularities modulate the accuracy of the predictions. Prior studies have shown that the suppression effect is sensitive to subconscious variations in the predictability of the speech signal (Behroozmand & Larson, 2011;Niziolek et al., 2013). Although some studies have investigated this phenomenon in overt speech production (e.g., Christoffels et al., 2011;Niziolek et al., 2013), this comes with challenges due to artefacts caused from engaging the facial muscles during articulation. Furthermore, overt production leads to variability in the pronunciation of the individual utterances, which can lead to changes in the degree of suppression (Niziolek et al., 2013). This is a particularly relevant constraint in the current design, as less familiar features may show more variability in articulation as well as more speech errors (Heisler & Goffman, 2016;Munson, 2001;Sasisekaran et al., 2010). To circumvent these challenges, we employed a button-press paradigm, or motor-toauditory paradigm, where the participant triggers the presentation of speech stimuli via button press (e.g., Knolle et al., 2019;Ott & Jäncke, 2013;Pinheiro et al., 2018). Although participants were able to consciously anticipate the upcoming stimulus in our paradigm, our aim was to investigate whether implicit knowledge of statistical regularities in speech (high vs. low phonotactic probability and first-vs. second-syllable stress) would influence the strength of these predictions, resulting in modulations of the N1 and P2 suppression effects similar to those seen for subconscious variation in the speech signal in paradigms using overt speech (Behroozmand & Larson, 2011;Niziolek et al., 2013).
The classical design in these experiments employs three conditions: an auditory-only (AO) condition, where participants are passively presented with auditory pseudowords; a motor-auditory (MA) condition, where participants trigger the presentation of self-produced pseudowords through a button press; and finally, a motor-only (MO) control condition used to correct for the motor activity (MA À MO = MA corrected [MAC]). This design has been applied to investigate MIS in response to a range of stimulus types, including tones (Knolle et al., 2013a), voices , vowels (Knolle et al., 2019) and single syllables (Ott & Jäncke, 2013). These designs typically elicit modulations of the auditory N1 and P2 components. An observed reduction of N1 amplitude in response to self-triggered stimuli is thought to reflect an unconscious, automatic prediction resulting from the efference copy/corollary discharge, whereas P2 suppression reflects a more conscious differentiation between self-generated and externally generated events (e.g., Bolt & Loehr, 2021;Knolle et al., 2013aKnolle et al., , 2019Pinheiro et al., 2018). Here, we investigated the effect of phonotactic and syllable stress regularities on MIS of the N1 and P2 components, using prerecorded utterances of bisyllabic Dutch pseudowords from each participant. If implicit knowledge of statistical regularities of speech influences the accuracy of the prediction generated by the corollary discharge, we would expect this to result in modulations of N1 and P2 suppression effects. Specifically, we aimed at testing the following hypotheses: (1) N1 and P2 amplitudes are reduced for self-generated stimuli compared to externally generated stimuli (i.e., main effect of condition [AO vs. MAC], MIS); (2) this reduction in amplitude is modulated by phonotactic probability and syllable stress (i.e., interactions between phonotactic probability [high vs. low] and condition, and syllable stress [first vs. second] and condition), with high phonotactic probability and first-syllable stress items leading to greater amplitude reduction due to greater predictability; and (3) phonotactic probability and syllable stress may interactively modulate MIS (i.e., three-way interaction between phonotactic probability, syllable stress and condition), where we do not have precise predictions about the nature of this interaction.

| Participants
Thirty-four right-handed native Dutch speakers participated in the study after giving their informed consent. The study was approved by the Ethical Committee of the Faculty of Psychology and Neuroscience at Maastricht University (ERCPN-OZL 205_17_03_2019) performed in accordance with the approved guidelines and the Declaration of Helsinki. Participants were invited to complete two sessions: one for recording the stimulus materials, followed by the electroencephalography (EEG) session.
Five participants completed the stimulus recording but did not complete the EEG session due to the COVID-19 pandemic. One participant was excluded from the EEG session due to failure to accurately reproduce stimuli. One participant was excluded due to excessive noise in the EEG signal (<100 trials remaining per stimulus per condition). This led to a final sample of 27 participants (9 male, mean age = 21.9, standard deviation = ±3.8), who completed both sessions of the experiment. The stimulus recording procedures and variations of the EEG paradigm were piloted in an additional nine participants. The stimuli generated from these pilot participants were used to determine the criteria for stimulus selection as described in the following section.

| Stimulus generation
The stimuli for the EEG experiment were prepared on an individual basis. Participants were invited for an initial stimulus recording session scheduled several days prior to the EEG session. The stimuli consisted of four pseudowords (Table 1, adapted from Bonte et al., 2005;Emmendorfer et al., 2020), which differed from each other in phonotactic probability (notsal vs. notfal, quantified based on the log-frequency counts of the consonant clusters at the syllable boundary, calculated as 5.83 and 4.72 for /ts/ and /tf/, respectively; Bonte et al., 2005) and syllable stress (first vs. second syllable). During the EEG experiment, each participant was presented with stimuli in their own voice. As second-syllable stress is rare in Dutch, 'natural' pronunciation of bisyllabic pseudowords with this stress pattern is challenging. To circumvent this issue, participants were presented with the target words, which were generated using a splicing procedure. The target words were spoken by a female Dutch speaker, who produced the syllables of interest by replacing them individually with syllables in existing bisyllabic Dutch words containing the same (spoken) consonant cluster SylS2 PhonProb High (HPP) Notsal Notsal

Low (LPP) Notfal Notfal
Note: Bold font indicates stressed syllable (SylS1, first syllable; SylS2, second syllable). The phoneme combination '-ts-' constitutes the HPP and '-tf-' the LPP. Abbreviations: HPP, high phonotactic probability; LPP, low phonotactic probability; PhonProb, phonotactic probability; SylStr, syllable stress. and stress pattern as the target pseudowords (e.g., /badzout/ ! /notzout/ and /badsal/ ! notsal, /ontslag/ ! /notslag/ and /ontsal/ ! /notsal/; for more details, see Emmendorfer et al., 2020). These spliced target words were presented to the participants of the current experiment. After ensuring the participant could hear and reproduce the differences between the pseudowords, each target was presented 15 times in random order, and the participants were asked to repeat them as accurately as possible. Participants were not explicitly instructed to attend to the stress pattern as this could lead to exaggerated expression of syllable stress. From the 15 repetitions of each pseudoword, one item was selected as the stimulus for the EEG experiment. To ensure comparability across participants, without having to manipulate the recording to deviate from the participants own naturally produced utterance, we selected items such that they were comparable in the timing of the perceptual centres (p-centres) of the syllables. Pcentres are thought to represent the perceived 'beat' of the speech stimulus. The timing of the p-centres was estimated with a beat detection algorithm (custom MATLAB script adapted from Cummins & Port, 1998). Here, the beat or p-centre is defined as the midpoint of each local rise in the amplitude envelope of the recorded signal, representing the vocalic nucleus of a syllable. The duration of the interval between the p-centres of each syllable in the bisyllabic pseudowords was calculated, and from 10 participants (9 pilot participants and 1 from the final sample), the average interval was calculated for each pseudoword. These values were used to select the best fitting stimulus for the participants who completed the subsequent EEG session. For each pseudoword, the item with the closest matching interval was selected. If this item contained acoustic artefacts or a mispronunciation, it was discarded, and the next best item was selected. This procedure allowed the selection of temporally comparable stimuli, while preserving each participant's own pronunciation without editing or manipulating the timing. A representation of the stimuli included in the experiment can be found in Figure 1. Stimuli were filtered with a Hann bandpass filter (80-10,500 Hz) and intensity scaled to 60 dB. Mean stimulus duration was .640 s (standard deviation = .056 s), and the mean interval between p-centres of the stimuli was .319 s (standard deviation = .042 s).

| EEG paradigm
The paradigm (adapted from Johnson et al., 2021;Ott & Jäncke, 2013) consisted of three conditions ( Figure 2a). In all three conditions, the trial began with the presentation of a fixation cross, followed by a cue (< left and > right) at .4-1.0 s after trial onset. In the MA condition, participants pressed a button (left or right), which triggered the presentation of a stimulus (due to technical limitations, the stimulus presentation was delayed by 12.36 ms on average; however, such a delay is far below the detection threshold for trained musicians, which lies around 100 ms [van Vugt & Tillman, 2014] and would thus be perceived as simultaneous). In the AO condition, participants were presented with the same cue, but the stimulus presentation occurred without button press, .5 s after cue onset. In the MO condition, the participants pressed the cued button, but no stimulus was presented. This condition was included to correct for the motor component in the MA condition. This MAC condition was calculated as MA À MO, thus allowing the comparison of neural activity in response to self-generated (MAC) and externally generated auditory stimuli (AO). A reduction in N1 and P2 amplitudes for MAC relative to AO is then interpreted as MIS.
The EEG recording occurred over the course of six experimental runs, each consisting of 18 blocks (8 MA, 8 AO and 2 MO) ( Figure 2b). In each MA and AO block, one stimulus pair was presented. The stimuli within the pair differed from each other in either phonotactic probability or syllable stress (Figure 2c), and each cue/button press corresponded to one stimulus, allowing the participant to anticipate the upcoming token in a mini-block. Each pair was presented twice per run and condition, with the cue/button assignment counterbalanced across blocks. Within each block, the first four trials (always including two left and two right) were excluded from analysis to allow the participant to form an association between cue and word. In four blocks per run (two MA and two AO), four catch trials were included at the end of the block, where the cue-stimulus pairing was switched; that is, the left cue was followed by the stimulus previously associated with the right cue. Participants were instructed to attend to the cue-stimulus pairing and were asked to report at the end of each block whether they noticed a switch. This task was included to ensure the participants were correctly associating the presented stimulus with the cue/button press, and these trials were excluded from analysis. Note that this task required participants to memorize the button-stimulus association and thus consciously anticipate the specific token (notsal vs. notfal and first-vs. second-syllable stress). In the example presented in Figure 2a, the stimulus pair is notsal-notfal, so the participant would form the association of right button press corresponding to notsal and left button press corresponding to notfal. If this block contained a switch, the last four trials would have the reversed association (left button notsal and right button notfal). We hypothesized that implicit knowledge of the statistical regularities of the more abstract stimulus features (high vs. low phonotactic probability and typical vs. atypical syllable stress) would modulate the strength of these predictions as reflected in N1 and P2 suppression effects. The total number of trials per block varied between 14 and 28 trials such that the participant could not anticipate when the catch trials would occur by counting. This resulted in 10-20 trials per block and a total of 90 trials per condition/stimulus/cue assignment included in the analysis (Figure 2b).

| EEG recording
EEG was recorded with BrainVision Recorder (Brain Products, Munich, Germany) using a 63-channel recording setup. Ag/AgCl sintered electrodes were mounted according to the 10% equidistant system, including 57 scalp electrodes, left and right mastoids for offline rereferencing and four electrooculogram (EOG) electrodes to facilitate removal of artefacts caused by eye movements (two placed on the outer canthi and two above and below the right eye). The scalp was cleaned at electrode sites, and electrodes were filled with electrolyte gel to keep impedances below 10 kΩ. Data were acquired with a sampling rate of 1000 Hz, using Fpz as an online reference and AFz as ground. During recording, participants were seated on a comfortable chair in an acoustically and electrically shielded room.

| EEG processing
EEG data were processed using the EEGLAB toolbox (Delorme & Makeig, 2004) and custom MATLAB scripts. The continuous EEG data were filtered using a bandpass filter of 1-30 Hz and then downsampled to a sampling rate of 250 Hz. Noisy channels were identified, removed and interpolated using the EEGLAB plugin clean_rawdata, and the data were re-referenced to the average signal of the two mastoid electrodes. The data were then epoched 0-2.4 s relative to the onset of the trial to remove noisy break intervals, while still including the entire duration of the experimental blocks. The data were then decomposed using independent component analysis (ICA). Two to four independent components, reflecting blinks and horizontal eye movements, were removed for each participant.
Initial inspection of the reconstructed data, using a typical pre-stimulus baseline correction, revealed an unexpected negative shift in the AO condition relative to F I G U R E 1 Stimuli. Stimuli selected for the electroencephalography (EEG) experiment. Individual intensity contours of the stimuli are represented in grey and mean intensity contours across participants in red. Stimuli from an exemplary participant are represented in black. Timing of the p-centres, representing the onset of the vocalic nucleus, is represented by dashed lines (red: averaged across participants; black: exemplary participant). Individual tokens to be used for each participant were selected based on the interval between the p-centres for first and second syllables. Stimulus onset t = 0 is equivalent to t = 0 in the subsequent event-related potential (ERP) plots. Subplot titles indicate which stimulus is represented: HPP, high phonotactic probability; LPP, low phonotactic probability; SylS1, first-syllable stress; SylS2, second-syllable stress. Notsal and notfal represent stimulus tokens, where capitalization indicates the stressed syllable. the MAC condition. To explore whether this may be due to systematic differences between conditions in the baseline window, we expanded the pre-stimulus window. This revealed a positive deflection preceding the stimulus/button press in all three conditions, which appeared time locked to the visual cue ( Figure S1). This deflection could not be removed through high-pass filtering ( Figure S2) or ICA ( Figure S3). Although the deflection was present in all three conditions, it was effectively removed during the MA À MO subtraction (Figure 3a). The remaining deflection in the AO condition likely reflects a combination of visual processing of the cue, as well as anticipatory processes and temporal orientation to the upcoming stimulus due to the fixed temporal F I G U R E 2 Experimental design. (a) Three experimental conditions. In all three conditions, the trial started with the presentation of a fixation cross for .4-1.0 s, followed by a visual cue < or >. In the motor-auditory (MA) condition, participants pressed the button corresponding to the cue (< left or > right), triggering the presentation of the pseudoword stimulus via speakers. In the motor-only (MO) condition, the participants pressed the button, but no stimulus was presented. In the auditory-only (AO) condition, the stimulus was presented .5 s after cue onset without button press. MAC, motor-auditory corrected; MIS, motor-induced suppression. (b) Overview of the electroencephalography (EEG) paradigm timeline. The total EEG measurement lasted 90-100 min, consisting of six runs of approximately 15 min each. Within each run, 18 mini-blocks were presented (8 AO, 8 MA and 2 MO). In each mini-block, one stimulus pair was presented (letters a-d correspond to the stimulus pair presented as denoted in [c]), where one stimulus was associated with the < (left) cue/button press and one with the > (right) cue/button press. Within the eight mini-blocks of AO and MA, each pair was presented twice, with the cue/hand assignment counterbalanced across mini-blocks. AO and MA conditions consisted of 14-28 trials per mini-block. The first four trials were discarded from analysis. In four mini-blocks per run (two AO and two MA), four catch trials where cue/hand assignment was switched were included at the end of a mini-block. These trials were also discarded from analysis, leading to a final 10-20 trials per miniblock. (c) Overview of stimuli and contrasted features: PhonProb, phonotactic probability; HPP, high phonotactic probability; LPP, low phonotactic probability; SylStr, syllable stress; SylS1, first-syllable stress; SylS2, second-syllable stress.
interval between cue and stimulus in this condition (Figure 2a). This observation violates the assumption of baseline correction that there are no systematic differences between conditions in the selected window. Therefore, the data were instead first epoched À.2 to 1.3 s relative to the onset of the cue, and baseline correction was applied relative to .2 s prior to the onset of the cue. The data were then subsequently epoched À.6 to .5 s relative the onset of the stimulus or button press. Although this window does not include the full stimulus duration (on average .640 s), it is sufficient to present the N1 and P2 components while ensuring that the window presented only contains data from the current trial and is not contaminated by the presentation of the fixation cross from the subsequent trial. Given the .5-s interval between cue and stimulus onset in the AO condition, this means that the first .1 s of the epoch (À.6 to À.5 s) includes a portion of the baseline window. Due to the variability of the interval duration in the MA and MO conditions, the timing of the baseline window relative to button press is variable; however, as in the AO condition, the baseline window is prior to the cue onset, during the presentation of the fixation cross. Previous findings have shown differences between self-generated and externally generated auditory stimuli already in the pre-stimulus window (Reznik et al., 2018); thus, a pre-stimulus baseline window may not be appropriate for this type of paradigm, even without the positivity we observe. The pre-cue baseline correction was used throughout the analysis steps. Epochs with reaction times shorter than 400 ms or longer than 800 ms after cue onset were excluded from analysis. As this resulted in a smaller number of trials in the MA condition compared to the AO condition, trial numbers were equalized across stimuli and conditions for each participant, to avoid unequal numbers of trials biasing the results. A minimum of 100 trials per stimulus per condition (900 trials total) were retained per participant.
Before moving to the planned analysis of the N1 and P2 components, we first explored the pre-stimulus positivity through cluster-based permutation analysis. Such a systematic difference between conditions would render a direct comparison of the N1 and P2 amplitudes of these two conditions invalid, as we cannot exclude that any observed modulations of these components might be driven by this deflection rather than true MIS as hypothesized. We tested this via a cluster-based permutation analysis (Maris & Oostenveld, 2007). One-sided pairedsamples t tests between AO and MAC were performed at each time point in the time window À.5 to 0 s relative to stimulus onset for 1000 random partitions using the ft_timelockstatistics function of the FieldTrip toolbox (Oostenveld et al., 2011). This analysis revealed a significant difference between AO and MAC. The observed cluster started at approximately .3 s prior to stimulus onset, with a broad topographic distribution.
To isolate the components of interest from this prestimulus deflection, we followed a temporal principal component analysis (PCA) approach (Korka et al., 2019) using the event-related potential (ERP) PCA toolkit (Dien, 2010). Average waveforms per participant, condition and stimulus were entered into a temporal PCA. In total, this resulted in nine average waveforms per participant: four AO (one per stimulus), four MA and one MO. The input to the PCA was thus a matrix of 275 variables (time points) and 243 observations (9 waveforms * 27 participants). Based on the results of Horn's parallel test, 15 temporal components (Table S1) explaining 94.99% of the variance in the data were retained for Promax rotation (k = 3) with a covariance matrix structure and Kaiser weighting. PCA decomposition yields two sets of coefficients to describe the EEG signal: Factor loadings correspond to the time course of a factor, which is constant across all conditions, participants and electrodes; factor scores correspond to the contribution of each factor to the EEG signal of each observation (participant, condition and electrode) and can be directly used for statistical analyses to quantify differences across observations. The ERP can thus be described as the sum of unstandardized factor loadings multiplied by the corresponding factor scores (Scharf et al., 2022). The unstandardized factor loadings (loadings * standard deviation) of these 15 components are presented in Figure 3b. Components relating to pre-and post-stimulus/button-press activity (represented in grey and black, respectively) were identified based on their peak latency. Figure 3c,d illustrates that the reconstructed ERPs (factor loadings * standard deviation * factor scores) from these component groups accurately align with the pre-and post-stimulus/buttonpress activity in the original ERPs, while minimizing modulations in the other time window. Two temporal components reflecting the N1 and P2 observed in the dataset were identified based on their timing and polarity: temporal component 7 peaking at 196-ms post-stimulus/button-press onset as the N1 and temporal component 5 peaking at 296-ms post-stimulus/buttonpress onset as the P2 (represented in red in Figure 3b). These timings are later than the classically observed N1 and P2 latencies. However, a relative delay is consistent with the nature of the stimuli due to their complexity (Conde et al., 2018) and slow onset rise time (Onishi & Davis, 1968). When adjusted for the timing of the pcentre of the first syllable of participants' pseudoword pronunciations, the N1 and P2 latencies are shorter, at approximately 125 and 212 ms, respectively. For our analyses, we kept the time locking to stimulus onset as it resulted in delayed but better aligned N1 and P2 responses across participants.

| Statistical analyses
PCA factor scores corresponding to the N1 and P2 components were entered into statistical analysis in R Version 4.1.0 (R Core Team), using functions from the rstatix (Kassambara, 2021), ggpubr (Kassambara, 2020), gridExtra (Auguie, 2017) and ggplot2 (Wickham, 2016) packages. Normal distribution of the N1 and P2 factor scores was confirmed for all conditions via Shapiro-Wilk test (Tables S1 and S4), and outlier identification via boxplot methods did not reveal any extreme outliers (points beyond Q1-3 * inter-quartile range [IQR] and Q3 + 3 * IQR). In two separate 2 Â 2 Â 2 repeatedmeasures analyses of variance (ANOVAs) (high vs. low PhonProb Â first vs. second SylStr Â AO vs. MAC Cond), we tested the following hypotheses for both N1 and P2 factor scores averaged across electrodes within a region of interest (ROI) determined by the PCA component's peak channel (FCz, FC1, FC2, FC3 and FC4 for N1 and Cz, C1, C2, C3 and C4 for P2): (1) N1 and P2 amplitudes are reduced for self-triggered stimuli compared to externally generated stimuli (i.e., main effect of Cond, MIS); (2) this amplitude reduction is modulated by phonotactic probability and syllable stress (i.e., interactions between PhonProb and Cond, and SylStr and Cond), with high phonotactic probability and first-syllable stress items leading to a greater amplitude reduction due to greater predictability; and (3) phonotactic probability and syllable stress may interactively modulate MIS (i.e., three-way interaction between PhonProb, SylStr and Cond), where we did not have precise predictions about the nature of this interaction. ANOVA results were corrected for multiple comparisons with Bonferroni-Holm correction (Cramer et al., 2016), and follow-up t tests of simple effects were Bonferroni corrected.

| Behavioural results
Due to non-normal distribution of the response time and sensitivity scores (d 0 ), behavioural analyses were performed using non-parametric tests. For two participants, behavioural data were not recorded for one out of six experimental runs due to a technical error. Excluding these participants from the analysis did not change the outcome of behavioural or ERP results. The average time of button presses was 508.45 ms after the cue presentation (standard deviation = 44.77 ms). Non-parametric paired-samples Wilcoxon tests revealed a significant difference in response time across conditions (W = 299, p = .007), with longer response times in the MA condition (mean = 518.62 ms, standard deviation = 43.54 ms) than the MO condition (mean = 499.29 ms, standard deviation = 44.90 ms). This difference likely reflects increased attentional demands in the MA condition, where participants had to direct their attention to the upcoming stimulus.
Sensitivity to the catch trials was assessed with d 0 calculated with the psycho R package. Mean d 0 across conditions revealed that participants were able to perform the task well (mean = 3.21, standard deviation = .77). Nonparametric paired-samples Wilcoxon tests revealed a significant difference in sensitivity to catches across conditions (W = 49.5, p = .0129), with improved sensitivity in the MA condition (mean = 3.36, standard deviation = .77) compared to the AO condition (mean = 3.08, standard deviation = .76). The active button-press task in the MA condition likely facilitated directing attention to the upcoming stimulus, resulting in this improved behavioural performance.

| ERP results
Visual inspection of the ERP grand averages (Figures 4a  and 5a) revealed an N1/P2 morphology, with the N1 peaking around 200 ms and the P2 around 300 ms. When adjusted for the timing of the p-centre of the first syllable of participants' pseudoword pronunciations, the N1 and P2 latencies were shorter, at approximately 125 and 212 ms, respectively. For our analyses, we kept the time locking to stimulus onset as it resulted in delayed but better aligned N1 and P2 responses across participants. In the following sections, we present the results of the statistical analyses. Here, we report only significant or otherwise noteworthy main effects and interactions, as well as post hoc simple effects. The full results of the statistical analyses as well as descriptive statistics can be found in Tables S2-S4 for N1 and Tables S5 and S6 for P2 results.

| DISCUSSION
The current study aimed to investigate whether MIS of the N1 and P2 amplitudes is modulated by formal (phonotactic probability) and temporal (syllable stress) predictability in the speech signal. We used a motor-toauditory paradigm, where participants triggered the generation of self-produced pseudowords through a button press. This approach was intended as a step towards investigating speech production, while limiting the interference of motor artefacts, variability in speech production and speech errors present during overt production of pseudowords. We expected to observe an MIS effect, with larger N1 and P2 amplitudes in the AO condition, compared to the MA condition. Furthermore, we expected this suppression effect to be modulated by phonotactic probability and/or syllable stress, where high probability items (high phonotactic probability and first-syllable stress) would elicit greater suppression, as they might be more 'prototypical' items in the language (Niziolek et al., 2013). Due to an observed cue-locked pre-stimulus deflection in the AO condition, not present in the MA condition after correcting for motor output (Figure 3a), we applied a temporal PCA approach to isolate the N1 and P2 components of interest. Our hypotheses were tested on the factor scores from two temporal factors that aligned with the N1 and P2 components in the current dataset. We observed an interaction between syllable stress and condition on N1 factor scores, where secondsyllable stress items elicited a larger N1 compared to firstsyllable stress items, but only in the AO condition. Syllable stress further modulated P2 factor scores, where now first-syllable stress items elicited greater activation compared to second-syllable stress. Strikingly, we did not observe any MIS effect for self-triggered (MAC) compared to externally triggered (AO) stimuli, unlike a large body of past literature comparing self-triggered and externally triggered auditory stimuli (e.g., Bäss et al., 2008;Knolle et al., 2012Knolle et al., , 2013aKnolle et al., , 2013bKnolle et al., , 2019Niziolek et al., 2013;Pinheiro et al., 2018). In the following, we offer F I G U R E 4 Effects of phonotactic probability and syllable stress on N1 factor scores. Plots represent the event-related potential (ERP) waveforms (blue and green) and reconstructed N1 (black; factor loadings * SD * factor score) in microvolt scale and time locked to stimulus onset, topographic distribution of N1 factor scores and mean N1 factor scores averaged within a frontocentral region of interest (ROI) (FCz, FC1, FC2, FC3 and FC4). Effects of phonotactic probability are presented in (a) and for syllable stress in (b). A significant interaction between syllable stress and condition was observed, which was driven by an effect of syllable stress in the auditory-only (AO) condition, where second-syllable stress items elicited a larger N1 compared to first-syllable stress items (c, bottom, p = .0009). Note that baseline correction was performed in 200-ms window prior to cue onset (approximately À.5 s). Due to the pre-stimulus positivity (see Figure 3), AO and motor-auditory corrected (MAC) are different in the pre-stimulus window depicted here. Cond, condition; HPP, high phonotactic probability; LPP, low phonotactic probability; PhonProb, phonotactic probability; SylStr, syllable stress; SylS1, first-syllable stress; SylS2, second-syllable stress.
interpretations for the effects of syllable stress on the N1 and P2 components and provide possible explanations for the lack of MIS effect.

| Syllable stress variations modulate N1 and P2 components
First-and second-syllable stress items differ from each other in terms of both acoustic properties of the stimuli as well as their likelihood of occurrence in the language at hand. Pitch, intensity and duration serve as perceptual markers of stress in Dutch, with stressed syllables exhibiting increased pitch, intensity and duration relative to their unstressed counterparts. As obligatory auditory evoked potentials, both N1 and P2 have been shown to be modulated by the acoustic features related to perceptual markers of syllable stress. Increased stimulus amplitude and duration have both been associated with increased N1 and P2 amplitudes (Alain et al., 1997;Ostroff et al., 2003;Paiva et al., 2016). Changes in stimulus frequency show the reversed effect: Increases in tonal frequency are associated with decreased amplitudes (Antinoro et al., 1969;Jacobson et al., 1992;Pantev et al., 1995;Wunderlich & Cone-Wesson, 2001). However, in these studies, frequency manipulations occurred predominantly in ranges above 400 Hz, which is outside of the typical f0s of human speech, typically between 100 and 250 Hz (e.g., Pépiot, 2014;Pisanski et al., 2020). If the observed N1 and P2 modulations are related to acoustic differences, this would most likely be reflected in increases in N1 and P2 factor scores for first-syllable stress items relative to second-syllable stress items, which should be consistent across conditions (AO vs. MAC). The main effect of syllable stress on the P2 component, which does not differ across conditions, is therefore consistent with an effect of acoustic stimulus properties.
The pattern of modulations observed on the N1 component tells a different story. Here, we observe an interactive effect of syllable stress and condition, with the effect of syllable stress being present only in the AO condition. The directionality of this effect, with a larger N1 for the less probable, second-syllable stress items, further supports the notion that statistical regularities in the F I G U R E 5 Effects of phonotactic probability and syllable stress on P2 factor scores. Plots represent the event-related potential (ERP) waveforms (blue and green) and reconstructed P2 (black; factor loadings * SD * factor score) in microvolt scale and time locked to stimulus onset, topographic distribution of P2 factor scores and mean P2 factor scores averaged within a frontocentral region of interest (ROI) (FCz,FC1,FC2,FC3 and FC4). Effects of phonotactic probability are presented in (a) and for syllable stress in (b). A main effect of syllable stress on P2 factor scores was observed, where first-syllable stress items elicited a larger P2 compared to second-syllable stress items (c, bottom, p = .0003). Note that baseline correction was performed in 200-ms window prior to cue onset (approximately À.5 s). Due to the pre-stimulus positivity (see Figure 3), auditory only (AO) and motor-auditory corrected (MAC) are different in the pre-stimulus window depicted here. Cond, condition; HPP, high phonotactic probability; LPP, low phonotactic probability; PhonProb, phonotactic probability; SylStr, syllable stress; SylS1, first-syllable stress; SylS2, second-syllable stress. temporal structure of speech may contribute to the observed effect. Theories of predictive processing propose that greater surprisal, or unexpected input, lead to increased neural responses (Aitchison & Lengyel, 2017;Friston, 2005;Rao & Ballard, 1999;Spratling, 2017). Additionally, the forward model proposes that selftriggered stimuli are associated with reduced neural responses as the sensory outcome can be anticipated based on the motor plan (Blakemore et al., 2000;. The lack of effect of syllable stress in the MA condition suggests that first-and second-syllable stress items could be anticipated equally well when they are self-triggered. However, when externally presented in the AO condition, the less regular second-syllable stress item may have elicited greater surprisal, resulting in a larger neural response. An alternative explanation for the differences across conditions may lie in attentional demands differing across conditions. The attention hypothesis proposes that the classically observed suppression effect is a result of attentional resources being divided between the motor output and sensory input (Horv ath, 2015). Although we did not observe a suppression effect, this hypothesis may still explain the lack of sensitivity to syllable stress observed in the MA condition. However, the behavioural results suggest that participants were in fact able to attend better to the upcoming stimuli when they followed a button press rather than when they were externally triggered, which is not consistent with the attention hypothesis. Thus, if the current results are related to differences in attentional demands, it is more likely that the increased attention to the stimuli in the MA condition resulted in more accurate predictions of the upcoming stimulus, such that both first-and second-syllable stress items could be anticipated equally well.

| No effect of phonotactic probability
The results of the current study stand in contrast to findings from research investigating the effect of sublexical regularities on the mismatch negativity (MMN) using similar stimuli in a passive oddball paradigm (Emmendorfer, 2022;Emmendorfer et al., 2020). These studies observed modulations of the MMN by variations in phonotactic probability, but not syllable stress, in contrast to the present results that indicate an effect of syllable stress, but not phonotactic probability, on N1 and P2 amplitudes. We first focus on the effects of syllable stress and examine the role that lexical stress plays in Dutch. Although they differ in their probability of occurrence, both first-and second-syllable stress patterns are legal in Dutch, and variations in stress can be used to resolve lexical conflict in the cases where the meaning of the word differs depending on which syllable is stressed (e.g., present vs. present;Cutler, 2005;Cutler & Van Donselaar, 2001). The prior studies interpret the lack of an effect of syllable stress as an indicator that, for Dutch speakers, variations in stress patterns are not an important factor in passive listening to individual pseudowords in the absence a sentence context. Although the stimuli in the current paradigm also consisted of individual pseudowords, the paradigm differed from prior passive oddball studies in that participants had to distinguish between first-and second-syllable stress in order to correctly perform the task. These increased attentional demands relative to passive listening in an oddball task likely contributed to the difference in effects of syllable stress across studies.
Moving to the effects of phonotactic probability, prior studies employing similar stimuli showed a facilitative effect of high phonotactic probability relative to low phonotactic probability items on deviance detection (Bonte et al., 2005;Emmendorfer, 2022;Emmendorfer et al., 2020). Here, we did not observe any N1 or P2 modulation by variations in phonotactic probability. The reason for this likely lies in the stimulus selection: Phonotactic probability is modulated at the syllable boundary ($200-250 ms after stimulus onset), whereas the components of interest here peaked around 200 and 300 ms. Thus, the manipulation of phonotactic probability likely occurred too late to contribute to N1 and P2 modulations (though co-articulatory cues may already be present in the first syllable). Additionally, the consonant clusters /ts/ and /tf/ differed from each other less in both phonotactic probability and phonetic properties compared to the clusters /ts/ and /tk/ used in Emmendorfer et al. (2020). Although Bonte et al. (2005) reported MMN modulations for both notsel vs. notkel and notsel vs. notfel contrasts, these differences were larger for the /ts/ vs. /tk/ contrast. Thus, the combination of the relative timing of the consonant clusters within the pseudowords and that of the ERP components of interest, together with the smaller phonotactic difference between the stimuli, likely contributed to the absence of an effect of this feature on the N1 and P2 components.

| Lack of MIS due to temporal predictability
The current results also differ crucially from a large body of research investigating MIS (e.g., Bäss et al., 2008;Knolle et al., 2012Knolle et al., , 2013aPinheiro et al., 2018) in that we do not observe a significant suppression effect. The lack of suppression may be explained by variations in the design between previous studies and the current experiment. The typical approach involves a blocked design (but see Knolle et al., 2013b, for an event-related variation), where the button presses generating the stimulus presentation are self-initiated in the MA condition. The auditory stimuli are then presented at the same temporal intervals in the AO condition. In the current study, participants were instead instructed to press a button following the presentation of a visual cue. This same visual cue was presented in the AO condition, with the stimulus following at the fixed interval of 500 ms after cue presentation, introducing temporal predictability of the stimulus in the AO condition. This is a crucial difference in design compared to prior studies: In the classical approach, the AO and MA conditions differ not only in whether the stimulus is self-generated or externally presented but also in the predictability of the stimulus timing: In the MA condition, the participant can accurately predict the timing, whereas some temporal uncertainty remains in the AO condition.
A growing body of literature suggests that a considerable portion of the suppression effect observed in previous research may be driven by the temporal predictability of the events (Harrison et al., 2021;Hughes et al., 2013;Ody et al., 2022;Sowman et al., 2012;Storch & Zimmermann, 2022). However, several studies have also reported preserved suppression effects when manipulating temporal predictability (Bäss et al., 2008;Lange, 2011;Pinheiro et al., 2019). These studies differ in how they manipulated temporal predictability: While some did so by introducing variable delays between button press and sound presentation to make the timing of the stimuli in the MA condition more unpredictable (Bäss et al., 2008;Lange, 2011;Pinheiro et al., 2019), others did so by also introducing temporal predictability in the AO condition with cues from different modalities (Harrison et al., 2021;Ody et al., 2022;Sowman et al., 2012). Thus, precisely how temporal predictability influences the suppression effect may depend heavily on design parameters. Although the current study did not explicitly manipulate the temporal predictability of the stimulus presentation, the visual cue included in the AO condition led to a close match in temporal predictability across conditions. Thus, the current findings are in line with recent studies reporting a role of temporal prediction in MIS and highlight the need for this effect to be more carefully characterized in future studies.

| Limitations
Although the lack of MIS is consistent with the temporal predictability of the stimulus presentation in the AO condition, this was not the intended outcome of the current experiment. The introduction of the visual cue that was presented in all three conditions (adapted from Johnson et al., 2021) led to an undesired positive deflection in the AO condition, which was effectively removed from the MA condition when correcting for motor output (MA À MO). PCA can deal with the separation of overlapping components but cannot erase the possibility that the visual cue affected the processing of the stimuli. Including a visual control condition to subtract from the AO condition may have ameliorated this issue. However, this would only account for purely visual processes. The deflection likely also represents attentional and anticipatory processes, as the participants were instructed to explicitly attend to the stimulus and could anticipate not only which item would be presented but also when it would be presented, due to the constant timing between cue and stimulus. Thus, an additional adjustment to the current paradigm could include jittering the timing of these events to dissociate the processes associated with the cue and the stimulus. Varying the time between cue and stimulus could also address the question of whether the suppression effect is driven by the temporal predictability of the stimulus (Bäss et al., 2008;Harrison et al., 2021;Hughes et al., 2013;Lange, 2011;Ody et al., 2022;Pinheiro et al., 2019;Sowman et al., 2012).
A further limitation of the current design is in the manipulation of phonotactic probability. We selected stimuli similar to those used in our prior oddball paradigms (Bonte et al., 2005;Emmendorfer et al., 2020) to allow comparing results across studies. However, the timing of the manipulation of phonotactic probability was likely too late in the stimulus ($200-250 ms after stimulus onset) to influence the processes underlying N1 and P2 generation. Although subtle co-articulatory cues may already be present within the first syllable, these may not be salient enough to elicit differences in surprisal between stimuli. Therefore, the current results do not allow us to draw final conclusions about whether auditory evoked potentials including the N1 and P2 are sensitive to variations in phonotactic probability. Future studies may investigate this using different stimuli that are more suited to answer this question, for example, by manipulating phonotactic probability at stimulus onset. Alternatively, manipulating local stimulus regularities similarly to an oddball paradigm would allow us to test for modulations of the MMN elicited by self-triggered deviants (e.g., Korka et al., 2019), facilitating closer comparisons between experimental approaches.
Finally, we note that the current paradigm, using button press-triggered presentation of self-produced pseudowords, is only an indirect measure of speech production. We selected this approach to avoid interference from motor artefacts associated with overt speech, as well as to control for variability in utterances and errors that may differ across stimulus features (high vs. low phonotactic probability and first-vs. second-syllable stress; Heisler & Goffman, 2016;Munson, 2001;Sasisekaran et al., 2010). Thus, although this approach results in strong experimental control, it does not account for the full complexity of overt speech production. Although other studies have followed a similar approach using button press-triggered speech (Conde et al., 2018;Knolle et al., 2019;Ott & Jäncke, 2013;Pinheiro et al., 2018), future research should consider adapting this paradigm to overt speech production to directly investigate the neural correlates of phonotactic probability and syllable stress during speech production.

| CONCLUSION
In conclusion, the present experiment provides insights into processing differences for phonotactic and syllable stress regularities in speech perception and production, by comparing self-triggered (via button press) to externally triggered (own) speech. We report novel observations suggesting that syllable stress regularities influence speech perception in Dutch speakers, where processing of more regular syllable stress patterns is facilitated. Considering previous findings that showed no such effect in a passive oddball paradigm (Emmendorfer et al., 2020), the role of syllable stress regularities appears to depend on whether task demands require attention to the stimuli. To summarize, the current results suggest that a sensitivity to regularities in phonotactic and temporal structure of speech may be differently exploited in speech perception and production processes. Further investigations controlling for some of the limitations observed in the current paradigm are needed to confirm the results of the current analyses.

AUTHOR CONTRIBUTIONS
All authors designed the experiment and refined the manuscript. AKE prepared the materials, collected and analysed the data and wrote the first draft of the manuscript.