Electrophysiological evidence for differences between fusion and combination illusions in audiovisual speech perception

Abstract Incongruent audiovisual speech stimuli can lead to perceptual illusions such as fusions or combinations. Here, we investigated the underlying audiovisual integration process by measuring ERPs. We observed that visual speech‐induced suppression of P2 amplitude (which is generally taken as a measure of audiovisual integration) for fusions was similar to suppression obtained with fully congruent stimuli, whereas P2 suppression for combinations was larger. We argue that these effects arise because the phonetic incongruency is solved differently for both types of stimuli.


Introduction
Recognizing that the analysis focuses on the comparison between fusions and combinations, it seems odd that the cases of visual or auditory capture are not discussed, as they are generally far more common than the combination perception.
The neuronal underpinning of sensory integration in general, and McGurk in specific need to be introduced in the intro. As it is, there is only a single sentence in the last paragraph, but there is an extensive literature. What is the N1-P2 thought to represent? What is the source? What other multisensory ERP effects have been observed with McGurk? What is the hypothesized underlying difference between fusion and combos, etc.

Methods:
Pg 5 -where were the speakers located?
Pg 7 -Why was the re-referencing not done to the global mean?
Results: Please report effect sizes for all stats -given the large number of follow up t-tests, perhaps a table would allow for the more precise reporting of statistics, with the pertinent summary wording still in the text?
Why was the choice made to compare AV directly to V as opposed to comparing it to (A+V)? Discussion: One difference between fusion and combinations that is not discussed is that the total number of phonemes perceived is the definition of fusion. Fusion occurs in all congruent presentations and the "fusion"perceptions -except for the combination. It's not simply that the combination perception also includes an extra phoneme perceived, which is completely accurate, but it's also that fusion occurs in the congruent trial. The combination trial is fundamentally different from the congruent trials in a way that the fusion trial is not.
Pg 10 -if the p2 is thought to represent feedback from STS, this should be discussed in the introduction.
Pg 11 -"However, participants usually do not notice the AV incongruency in combination stimuli (or in fusion stimuli)." -was this measured?
Pg 11 -define CV -Is this a typo that's supposed to be "AV" or do you mean "consonant-vowel"?
Pg 12 -The discussion of correlational results is rather lacking. Given the graphs of individual data points, I am convinced of the fusion correlation, but not of the combination correlation (see comment about Fig 3 below). Regardless of how these correlations hold up, there needs to be much more in the way of interpreting them for this presented result to be of meaning for the reader.  (7) that reported this perception at any meaningful level. Typographic: Pg 3, "Such fusions do not always occur: changing the modality" -should be a semicolon. Pg 5 -I think "FFMPEG" is written "FFmpeg" Reviewer: 3 (Daniel Senkowski, Charité-Universitätsmedizin Berlin, Germany) Comments to the Author This manuscript describes an EEG study examining the ERP correlates during fusion and combination trials in the McGurk illusion. The research question is interesting and the work could be, in my opinion of potential interest for publication in EJN. However, I have a few points that should be addressed.
I was wondering whether something have gone wrong in the analysis of AgVb minus Vb trials. I have seen many AV minus V subtraction approaches in AV speech studies but have, thus far, never observed such a large difference between AV minus V vs. A alone as illustrated in Fig. 2, right panel, light gray trace. The trace for the AgVb minus Vb trials simply does not fit to the other six traces (of Auditory "b" and Auditory "g", which should actually be labeled "combinations and "fusion"). This could also explain why there is such a highly significant difference between Auditory vs. AV congruent for fusion trials, whereas there is no such difference for combination trials. Overall, these results raise skepticism and I would like to encourage the authors to thoroughly double-check all their analysis scripts.

Reply to Reviewer 1 Major
First, it is unclear whether all trials were analyzed, or only those in which the response corresponded to the expected one (e.g. for the fusion stimulus, whether only "d" responses were included). The latter would be recommended if the number of trial allows it. Please clarify.

Response:
We apologize for the confusion and we would like to clarify that we analyzed all data, not just those trials that were perceived 'correctly'. Although our approach is in-line with much of the relevant work in which responses are not taken into account either (e.g., Colin et al., 2002Colin et al., , 2004Saint-Amour et al., 2007;Stekelenburg & Vroomen, 2007), the reviewer has a point that analyses of 'correct' responses may be ideal if the number of trials allow it.
In addition to this however, ERPs should ideally also represent averages across comparable numbers of trials per condition for each participant (which also includes V-only; see our response to this Reviewer's comment below where we explain why the AV -V subtraction is critical). If we were to include only those participants who had at least 40 'correct' trials in all conditions (which is 50% 'correct' or more, and 40 trials are quite minimal to base an ERP on), we would end-up including only half of our sample (N = 17). As can be seen in the figure below, the critical trends in the grand averages that included all data were quite similar to those that included correct responses-only, but we prefer including all participants for the sake of power.
All trials, N = 32 'Correct' trials, N = 17, participants only included if at least 50% correct in all conditions

Ab
AbVb -Vb AbVg -Vg V onset A onset

Auditory "g"
Second, the discussion should be clarified. Particularly, the authors write already in the abstract: "suppression of P2 amplitude (which is generally taken as a measure of AV integration)", but going on into the discussion, their point seems to be that stronger P2 suppression is related to weaker integration. This change in the interpretation should be explained better. Also, the relationship between the current findings and previous studies using incongruent or combination and fusion stimuli should be described in more detail (e.g. Klucharev & al., 2003). Is their explanation related to the salience or predictability of visual speech (e.g. Arnal & al., 2009)? Further, the end of the first paragraph on p. 11 is unclear: "audiovisual integration occurs, at least partly, after the P2…" Response: In the abstract, we raise the general point that P2 amplitude suppression is a measure of AV integration in order to justify our approach and analyses. Therefore, we believe this sentence does not need to be changed. We do however, agree with the reviewer that the fact that stronger suppression might be related to weaker integration needs to be clarified. In fact, we cannot really claim that stronger suppression equals weaker integration per se, but only that P2 suppression may reflect differences in the extent to which AV integration on a phonetic level occurs.
Firstly, when there is no phonetic integration (because the listener is unaware of the phonetic content of artificial speech stimuli), there is no P2 suppression (Baart et al., 2014), which indicates that P2 suppression is related to phonetic integration. Secondly, Stekelenburg and Vroomen (2007)  However, the fact remains that we simply cannot be certain whether 'congruency processing' is indeed at the foundation of the differences we observed, as there are multiple reasons why P2 suppression could be different for combinations than for all other stimuli, as we explain in the discussion. We are quite confident though, that temporal predictability (i.e., the fact that the visual signal leads the auditory onset) cannot explain the patterns in the data for two reasons: (1) visual lead (anticipatory motion) was the same across all stimuli, and (2) temporal properties of the stimuli modulate the N1, but not the P2 (e.g., Baart et al., 2014;Stekelenburg & Vroomen, 2007;Vroomen and Stekelenburg, 2010). This however, does not exclude predictability or saliency on a phonetic level as a potential explanation, but again, we cannot be sure given the data we have in hand.

Minor
The final sentence of the abstract could be more specific.
Response: We have added the different possible explanations between parentheses.
A McGurk stimulus may be heard according to the visual component (e.g. Alsius & al., 2014), which should be mentioned.

Response:
We assume the reviewer refers to the work by in which incongruent stimuli comprised a /mi/-/ni/ contrast. Such materials (i.e., stimuli in which the AV phonetic contrast is different from the specific labial/velar conflicts needed to produce McGurk fusions/combinations) can indeed lead to visual capture, and we now include a general statement about unimodal capture on page 3.
p. 6 "participants indicated which alternative corresponded to their auditory percept". Except in V trials.

Response:
We thank the Reviewer for identifying this mistake, which we corrected by deleting "auditory".

p. 8 Were the pair-wise comparisons corrected in the behavioral responses (see p. 9 FDR)?
Response: We now clarify that all pair-wise comparisons are indeed FDR corrected. The changes are reflected on page 8 and in Figure 1b.

"Following an additive model, AV integration effects can be captured by comparing A-only ERPs with AV-V difference waves". However, the first author has an excellent paper (Baart, 2016), according to which subtracting V is not needed. Why did the authors choose to use AV-V rather than AV here? I would like to know, but do not ask that this explanation should be included in the manuscript text unless the authors opt
to do so.

Response:
In a recent meta-analyses, the first author indeed found that A vs. AV and A vs. AV -V comparisons yielded similar effects. However, the data included in that paper were always obtained from AV congruent stimuli. In contrast, McGurk stimuli are incongruent by definition. As a result, any differences that arise from direct comparisons between AbVb and AbVg, or between AgVg and AgVb can potentially reflect processing differences between the visual components of the stimuli. Therefore, we subtracted out the visual activity in the AV -V difference waves.
p. 9-10 The authors may consider adding the significant P2 time windows in the text, and aligning the timelines in Fig. 2 a and b. Also, adding "stimulus" after McGurk fusion/combinations might make the end of the Results clearer.

Response:
Aligning the time-lines as the Reviewer suggests could implies that the significance plots in Figure 2b become too small to read. However, we added the ERPs for each comparison to Figure 2b, which should clarify the time-windows where effects were significant. In the main text, we now include ANOVAs on fixed time-windows for the N1 (100-200 ms) and P2 (200-300 ms), which produce the same pattern of significance as visible in Figure 2b. We therefore did not add more details in the text as we believe the ANOVAs and the information in Figure 2b provide sufficient information to the reader to understand the effects of interest without ambiguities. We also believe our new results section is clearer than before.

Introduction
Recognizing that the analysis focuses on the comparison between fusions and combinations, it seems odd that the cases of visual or auditory capture are not discussed, as they are generally far more common than the combination perception.
Response: In our data, it is clear that, on average, fusions/combinations occurred much more often than capture, and occurred with a frequency that resembles the original report by McGurk and MacDonald (1976). However, the Reviewer is right that auditory or visual capture can occur more often than fusions or combinations (depending on the stimuli that are used), and we therefore discuss the issue briefly on page 3.

The neuronal underpinning of sensory integration in general, and McGurk in specific need to be introduced in the intro. As it is, there is only a single sentence in the last paragraph, but there is an extensive literature.
What is the N1-P2 thought to represent? What is the source? What other multisensory ERP effects have been observed with McGurk? What is the hypothesized underlying difference between fusion and combos, etc.

Response:
We would like to clarify that the word-limit of the manuscript format does not allow us to describe the full background on the N1 and P2. However, we now do include the congruence processing differences hypothesis in the introduction (page 4), which is based on the fact that AV congruency is processed at, and after, the P2 peak (but not at the N1). The source of the N1/P2 is well-described in the literature, and we believe that the most important fact for the current manuscript is that these are auditory peaks (originating in auditory areas), which are modulated by visual speech. Since EJN's policy is to make the reviewer comments and our responses to those available (if the manuscript gets accepted), we would like to refer to our response to comment #2 by Reviewer 1, where we expand more on the relevant literature and our line of thought.

Methods:
Pg 5 -where were the speakers located?
Response: The speakers were located left and right of the monitor, which is now clarified on page

5.
Pg 7 -Why was the re-referencing not done to the global mean?
Response: When re-referencing to the global mean, the head is considered a concentric sphere with homogeneous conductive properties, and the electrical field generated by the assembled (horizontal and vertical) dipoles would therefore approach zero. The more electrodes are included, the better this underlying principle is met. Since we have a relatively small number of electrodes, we opted for rereferencing against the average of the two mastoids (see also Baart & Samuel, 2015, among many others, where a similar re-referencing procedure is used with the identical EEG set-up).

Results: Please report effect sizes for all stats -given the large number of follow up t-tests, perhaps a table would
allow for the more precise reporting of statistics, with the pertinent summary wording still in the text?
Response: The result sections have been changed and effect-sizes are added consistently. The table with the relevant information (e.g., test statistics, p-values and effect sizes) is now included in Figure   1.
Why was the choice made to compare AV directly to V as opposed to comparing it to (A+V)?

Response:
We never compared to AV directly to V. In the behavioral data, we assessed the proportion of responses per stimulus type, separately for all stimuli, and in the EEG data, we compared A ERPs to the AV -V difference wave. If one assumes that (1) A amplitude > than AV amplitude (which is the basic notion behind N1/P2 peak suppression), and (2) A ≠ A + V (the rationale behind the additive model), comparing AV to the sum of the unimodal activity (as the Reviewer proposes) yields the exact same differences as comparing AV -V to A (which is what we did).

Discussion:
One difference between fusion and combinations that is not discussed is that the total number of phonemes perceived is the definition of fusion. Fusion occurs in all congruent presentations and the "fusion"perceptions -except for the combination. It's not simply that the combination perception also includes an extra phoneme perceived, which is completely accurate, but it's also that fusion occurs in the congruent trial. The combination trial is fundamentally different from the congruent trials in a way that the fusion trial is not.
Response: This is a good point, which we now added on page 12.
Pg 10 -if the p2 is thought to represent feedback from STS, this should be discussed in the introduction.

Response:
We have now included the rationale on page 4.
Pg 11 -"However, participants usually do not notice the AV incongruency in combination stimuli (or in fusion stimuli)." -was this measured?
Response: No, we have not measured this directly, and we have changed the sentence accordingly (see page 12).
Pg 11 -define CV -Is this a typo that's supposed to be "AV" or do you mean "consonant-vowel"?
Response: We intended to say "consonant-vowel", which we now clarify. Response: In hindsight, we agree that the presentation and discussion of the correlation analyses were not very informative. Given the fair comment made by this Reviewer about the combination correlation (see next comment), and the issue about statistical significance of the correlations (raised by Reviewer 3), we decided to delete the correlation analyses from the manuscript. Instead, we now clarified the theoretical frame-work (in response to the comments made by the Editor and all Reviewers) and included more (details regarding the) analyses, also in response to the Editor and all Reviewers. (7) that reported this perception at any meaningful level.

Fig 3 -It seem that the correlation with combination responses is driven by a minority of participants
Response: Yes, the Reviewer is right, and we acknowledge that this makes it rather difficult to draw any firm conclusions about this particular correlation (which is why we had not done so in the original manuscript). As mentioned in our response to this Reviewer's previous comment, we have now deleted the correlation analyses to avoid blurring our analyses/discussion with marginal results.

Typographic:
Pg 3, "Such fusions do not always occur: changing the modality" -should be a semicolon.