We contrasted conditions where endogenous visuospatial attention was focused on a visual channel that was either congruent or incongruent with the accompanying auditory channel, while keeping the amount of multisensory information in the environment constant. This allowed us to determine whether the neural correlates typically associated with the formation of a multisensory percept were sensitive to the locus of visual spatial attention. The primary finding of this study was the robust modulatory effect of attention to AV congruence. This influence was observed in heteromodal cortical regions consistent with previous studies of multisensory fusion, in the SC and in ‘sensory-specific’ visual regions as early as the primary visual cortex.
Audiovisual linguistic integration was employed due to its ecological validity, robust neural indices (c.f. Stevenson et al., 2007) and the strong relationship to previous behavioural paradigms that explored the relationship between attention and MI. Both the McGurk effect (Massaro, 1987; Soto-Faraco et al., 2004) and investigations of the pre-attentive nature of the ventriloquist effect (Driver, 1996) employed linguistic AV integration. Furthermore, a separate psychophysical study here confirmed that AV congruence can enhance the discriminability of visual targets on the attended side, even when subjects are instructed to respond only to events in the visual stream.
Cortical effects of attention to audiovisual congruence
The STS receives convergent input from auditory, visual and somatosensory regions (see Padberg et al., 2003; Schmahmann & Pandya, 1991) and has been frequently associated with MI (Calvert et al., 2000; Beauchamp et al., 2004b,a; Miller & D’Esposito, 2005; Noesselt et al., 2007). The present study demonstrates that attention and congruent AV speech jointly contribute to the activation of the STS.
Our current activation of STS is anatomically consistent with previous fMRI studies on the fusion of a multisensory event into a single percept (Bushara et al., 2003; Miller & D’Esposito, 2005). In contrast, studies that have compared the presentation of a single unisensory stimulus with the presentation of two simultaneously presented stimuli in different modalities have identified a more posterior section as the putative multisensory component of the STS (the pSTS). Enhanced activation in this region has been reported in terms of the response to bimodally presented (compared with the unimodally presented) AV speech and object stimuli irrespective of sensory/semantic congruence (Calvert et al., 2000; Beauchamp et al., 2004a,b) but not with respect to semantically congruent vs. incongruent AV object stimuli (although the perirhinal cortex exhibited an enhanced response; Taylor et al., 2006). The pSTS response is enhanced in response to bimodally vs. single unimodally presented letter/speech sounds but this pSTS response does not vary with respect to the congruent/incongruent nature of these letter/sound pairs (van Atteveldt et al., 2004, 2007). pSTS activation has also been reported in response to spatially congruent vs. incongruent AV stimuli (Noesselt et al., 2007). It has recently been proposed that many of the comparisons that produce pSTS activation may contain several perceptual/cognitive effects that are not related to multisensory perception per se (Hocking & Price, 2008). Hocking & Price (2008) demonstrated that pSTS effects commonly attributed to multisensory influences can be achieved through similar unimodal manipulations of sensory input load and attention.
In the present study, pSTS was found to activate when the selective attention conditions were compared with the low-level SB. This activation of pSTS might be consistent with an attention-independent effect of AV congruence, which was present in all four selective attention conditions (i.e. also in the attInc conditions) but not in the baseline. However, the low-level SB condition also did not require any selective attention to the stimuli, thus the relative contribution of spatial attention and unattended AV congruence cannot be definitively determined from the current design. Nonetheless, the role of this region in the binding of a multisensory percept appears unlikely, as studies that have specifically compared fused vs. unfused multisensory percepts have failed to observe any activation in this posterior region of the STS (Olson et al., 2002; Bushara et al., 2003; Jones & Callan, 2003; Macaluso et al., 2004; Miller & D’Esposito, 2005; Ojanen et al., 2005 see also Hein & Knight, 2008).
Here, the critical interplay between attention and AV congruence was found in a more anterior region of the STS. In this region the effect of AV congruence was contingent on attention to the multisensory stimulus. Accordingly, activity associated with unattended AV congruence (i.e. during ‘attend-incongruent’ conditions) was not greater than AV incongruent stimulation alone (i.e. the low-level baseline). This suggests that attention did not simply modulate an overall effect of congruence, which was entirely absent when the AV congruence was unattended, but rather that attention and congruence were both necessary to activate this region. Likewise, due to the symmetrical nature of our design, the contrast attCon ‘attended congruent > attend incongruent’ is identical to ‘unattend incongruent > unattend congruent’. The consequence of this is that the effects attributed to the attendance of the congruent visual stream could equally arise due to the need to ignore the incongruent stream. The response elicited by the SB suggests that this is not the case. The requirement to disregard incongruent streams was present in this condition and along the STS the response to attend-congruent conditions was enhanced relative to SB, whereas the attend-incongruent condition induced the same fMRI response. This suggests that the critical factor in the evocation of the STS response to the attCon condition is attention to a crossmodally congruent visual stimulus rather than the need to ignore a crossmodally incongruent visual stimulus.
As noted above, activation of this STS region has been previously associated with the formation of multisensory percepts (e.g. Miller & D’Esposito, 2005), which in our study may also entail greater comprehension/processing of the auditory text in the ‘attend-congruent’ conditions. Thus, the interplay between attention and congruence in STS may reflect some consequence of the integration of the multisensory input (e.g. increased comprehension). However, it should be stressed that here the semantic linguistic information contained within the auditory stream was always task-irrelevant and could not help with the detection of the visual targets. It is also possible that subjects employed a strategy whereby they strategically focused on the auditory stream under congruent AV conditions to use the disparate nature of the AV information to better identify the target stimulus. However, we consider this unlikely due to the fact that the information contained in the auditory stream contained misleading information (a continuous stream rather than a decelerating sequence of events as in the visual stream). However, the possibility of such a strategic shift in attention cannot be completely excluded. A further alternative interpretation is that attention to a stimulus in the visual modality automatically increases the strength of the neural representation of a related stimulus in a second (auditory) modality. Such spreading of attentional effects has previously been reported within the visual modality (Serences & Boynton, 2007) and similar properties may exist across modalities (see also Macaluso & Driver 2001; Busse et al., 2005). Despite the fact that the current data (as is the case with previous studies on MI) cannot disentangle the underlying mechanisms of integration from its possible consequences, our findings indicate that top-down (endogenous attention) and bottom-up (sensory input) factors jointly contribute to the activation of STS during the processing of AV speech.
It is notable that, in this study, no enhanced response during attention to multisensory incongruence when compared with attention to multisensory congruence was detected, either in the whole brain analysis or in the more focused analyses discussed hereafter. Although effects of incongruent AV stimuli have been observed in previous studies, this has normally been in the context of priming effects, where the stimulus in one modality precedes in time the stimulus in the other (Noppeney et al., 2008). Under such conditions, expectancy is violated in incongruent conditions, which may lead to an enhanced neural response to the unexpected (crossmodal) stimulus, as accounted for within such theories as predictive coding (Rao & Ballard, 1999).
Subcortical effects of attention to audiovisual congruence
Although reports of the SC as a multisensory structure are prevalent in the animal literature (Meredith & Stein, 1983; Stein & Meredith, 1993; Stein & Stanford, 2008), it is seldom reported in fMRI studies of multisensory processing. Here we show that attention to AV congruence increases the multisensory response bilaterally in the SC. In contrast, Miller & D’Esposito (2005) found that incongruent AV speech produced an enhanced SC response. This difference may be due to the type of incongruent stimuli used (temporally jittered vs. completely unmatched in the present study). MI in the SC has been observed in anesthetized animals. Therefore, one might expect MI to occur primarily in an automatic, attention-independent manner (Stein & Meredith, 1993). However, multisensory responses in the SC are dependent on cortical input (Jiang et al., 2001) and have been linked to overt spatial behaviour in animal studies (Stein et al., 1989). The finding here of an attentional influence on the multisensory response of the SC suggests that the multisensory function of this region is not a passive response to environmental events but is actively shaped by current behavioural goals. Accordingly, the SC would be part of a larger network of cortical and subcortical regions that jointly contribute to the formation of integrated multisensory percepts (Stein & Stanford, 2008).
Effect of attention to audiovisual congruence within early visual cortex
In this study, attention to AV congruence was also observed to modulate the activity in visual regions. This finding is consistent with previous investigations of multisensory processing, which have identified enhanced responses within sensory-specific cortex (Macaluso et al., 2000; Noesselt et al., 2007; Alink et al., 2008). This study adds to this previous literature by showing the attentional modulation of these effects and by demonstrating that multisensory facilitation arises as early in the visual hierarchy as V1 (using meridian mapping in each single subject, Fig. 4A). Fine-resolution fMRI in the macaque has localized functional modulation of the auditory cortex by visual input in both the core and belt fields (Kayser et al., 2007) but multisensory effects in primary visual cortex have not been described in much detail [see Noesselt et al. (2007) who reported AV interaction in V1 based on anatomical criteria]. V1 has been shown to have direct connections with auditory regions in the monkey (Falchier et al., 2002) and it is possible that direct collateral connections drive this effect. However, it is also possible that feedback connections and interactions with higher level association areas may govern this response [Schroeder & Foxe, 2002; Noesselt et al., 2007; see also Ghazanfar et al. (2008) for interactions between STS and auditory regions during face/voice integration]. Our current finding of interplay between endogenous spatial attention and the processing of AV congruence may favour the second hypothesis, which emphasizes the role of high-level associative regions rather than relatively low-level automatic processing. Nonetheless, attention and MI may also affect the visual cortex via different pathways (e.g. frontoparietal for spatial attention and direct projections for MI). Additional work is needed to elucidate the relative contributions of these top-down and collateral influences.
In the present study, the effect of attention on MI appeared to be more pronounced when stimuli were attended in the right visual field. This is an unexpected finding as the right side of the face (normally present in the left visual field) is thought to be a more important source of lingual cues (Geffen et al., 1971; Wolf & Goodale, 1987; Nicholls et al., 2004). However, these findings relate only to centrally presented face stimuli and it is possible that right visual field advantages for part-base processing (Hillger & Koenig, 1991; Rossion et al., 2000) may dominate during the lateralized presentation of faces.