Spatial attention can modulate audiovisual integration at multiple cortical and subcortical sites


Dr S. Fairhall, as above.


The role of attention in multisensory integration (MI) is presently uncertain, with some studies supporting an automatic, pre-attentive process and others suggesting possible modulation through selective attention. The goal of this functional magnetic resonance imaging study was to investigate the role of spatial attention on the processing of congruent audiovisual speech stimuli (here indexing MI). Subjects were presented with two simultaneous visual streams (speaking lips in the left and right visual hemifields) plus a single central audio stream (spoken words). In the selective attention conditions, the auditory stream was congruent with one of the two visual streams. Subjects attended to either the congruent or the incongruent visual stream, allowing the comparison of brain activity for attended vs. unattended MI while the amount of multisensory information in the environment and the overall attentional requirements were held constant. Meridian mapping and a lateralized ‘speaking-lips’ localizer were used to identify early visual areas and to localize regions responding to contralateral visual stimulations. Results showed that attention to the congruent audiovisual stimulus resulted in increased activation in the superior temporal sulcus, striate and extrastriate retinotopic visual cortex, and superior colliculus. These findings demonstrate that audiovisual integration and spatial attention jointly interact to influence activity in an extensive network of brain areas, including associative regions, early sensory-specific visual cortex and subcortical structures that together contribute to the perception of a fused audiovisual percept.


The world is not experienced as multiple streams of isolated sensory signals, rather sight, sound, smell and touch blend to form compound multisensory percepts. Multisensory integration (MI) is advantageous in interactions with the environment, where multisensory input can reduce sensory uncertainty (McDonald et al., 2000; Frassinetti et al., 2002). Although traditional views of MI posit that this occurs in an automatic manner (Massaro, 1987), recent evidence suggests that attentional control may also be involved (Tiippana et al., 2004).

In the audiovisual (AV) domain, several perceptual phenomena indicate independence between MI and attention. Studies of the ventriloquist effect (Driver, 1996; Bertelson et al., 2000; Vroomen et al., 2001) and the McGurk illusion (McGurk & MacDonald, 1976; Massaro, 1987; Soto-Faraco et al., 2004) suggested that MI is pre-attentive and immune to top-down attentional influences. Conversely, recent behavioural studies that introduced a demanding secondary task to exhaust attentional resources demonstrate that attention can influence MI (Tiippana et al., 2004; Alsius et al., 2005, 2007).

These discordant findings may relate to the complex and multifaceted neuronal mechanisms underlying MI. Multisensory effects have been reported in heteromodal cortical regions [e.g. the superior temporal sulcus (STS), Beauchamp et al., 2004b; Bushara et al., 2003; Calvert et al., 2000, 2001; Miller & D’Esposito, 2005], subcortical areas [thalamus, Komura et al., 2005; superior colliculus (SC), Meredith & Stein, 1983] and brain regions traditionally considered as ‘unisensory’ (for reviews see Schroeder & Foxe, 2005; Macaluso & Driver, 2005; Driver & Noesselt, 2008). Furthermore, these regions may communicate via direct feedforward connections and via polysynaptic feedback loops (Falchier et al., 2002; Schroeder & Foxe, 2002; McDonald et al., 2003; Rockland & Ojima, 2003; Ghazanfar & Schroeder, 2006; Noesselt et al., 2007). Thus, MI and attention may interact at many different processing stages, possibly depending on the stimulus type and task requirements. The few previous studies that found attentional modulation of MI used simple AV stimuli comparing dual multisensory events with the summed response produced by single unisensory stimuli (e.g. paired vs. unpaired gratings and tones; Talsma & Woldorff, 2005). Furthermore, previous studies demonstrating attentional modulation of MI used electroencephalography (Talsma et al., 2007) that has a relatively low spatial resolution and is likely to miss any effect in subcortical regions [but see Busse et al., 2005 using functional magnetic resonance imaging (fMRI)].

The aim of the present study was to investigate the link between selective attention and MI using AV speech stimuli that have been repeatedly demonstrated to cause automatic binding of a fused percept (Driver, 1996; Massaro, 1987; McGurk & MacDonald, 1976). Subjects attended one of two visual streams (speaking lips) presented simultaneously in the left and right hemifields. A central auditory stream (spoken words) was congruent with either the attended or unattended visual stream. fMRI at 3 T and meridian mapping of the visual cortex permitted detailed investigation of the attentional modulation of AV congruence (here indexing MI) during the comparison of conditions equivalent for the overall attentional requirements and the amount of multisensory information present in the environment.

Materials and methods


Thirteen right-handed native Italian speakers participated in the fMRI study (seven female, mean age 26.6 years). No participants reported neurological impairments and all gave written informed consent. The study was approved by the Fondazione Santa Lucia (Scientific Institute for Research Hospitalization and Health Care) Independent Ethics Committee, in accordance with the Declaration of Helsinki.

Stimuli and apparatus

Stimuli consisted of 30-s AV speech segments read by a female speaker. Texts were in Italian and were taken from descriptive sections of the novel ‘Alice in Wonderland’ by Lewis Carroll. Audio tracks were optimized for conversion from digital to pneumatic audiosystems through spectral pre-emphasis ( and played at a level to be clearly audible above the scanner noise. The auditory stimuli were played to both ears and were perceived centrally. To remove any potential effects of extralingual cues, videos were cropped to include only the mouth and chin (see Fig. 1). Synchronous and asynchronous AV stimuli were constructed by pairing each video track with either the matched auditory track or an unmatched audio track (i.e. the reading of a different section of the novel). Two visual streams were presented simultaneously in the left and right visual field (centred at 4.8° of visual angle horizontally and 4.4° vertically). On each side, the visual stimulus subtended 5.1° horizontally and 4.2° vertically. Eye movements were monitored using the ASL Eye-Tracking System with remote optics, custom-adapted for use in the scanner (Model 504, sampling rate, 60 Hz; Applied Science Laboratories, Bedford, USA).

Figure 1.

 Schematic picture of the stimuli, task and results from the behavioural validation of the task. (A) Stimuli and task. The top two panels show two conditions of selective attention (aR-attCon and aL-attInc). In these two conditions the stimuli were identical (with the right visual stream congruent with the central auditory stream) but subjects attended to either the congruent visual stream (aR-attCon) or the incongruent stream (aL-attInc). The lower panel shows the SB condition when attention was maintained centrally and neither visual stream was congruent with the auditory central steam. (B) Individual 70.7% detection thresholds were determined using the staircase method; an example for one subject is shown here. (C) Group data revealed greater detection accuracy for visual targets embedded in a visual stream that was congruent with the auditory stream compared with the incongruent condition (± SEM). attnCon, attend to audiovisual congruence; attnInc, attend to audiovisual incongruence.


The experiment consisted of 36 30-s blocks, acquired within a single fMRI run. In each block, a single auditory stream and two visual streams (one in each hemifield) were presented (see Fig. 1A). In 28 of these blocks, one of the two visual streams was congruent with the auditory stream. Subjects covertly maintained their attention on either the left or right visual stream (selective attention conditions). The attended visual stimuli were either congruent (50%) or incongruent (50%) with the auditory stream. In this way, the amount of multisensory information present in the environment was fixed and only the locus of attention varied. The resulting four conditions were: attend to AV congruence in the left visual field (aL-attCon), attend to AV incongruence in the left visual field (aL-attInc), attend to AV congruence in the right visual field (aR-attCon) and attend to AV incongruence in the right visual field (aR-attInc). Each of these conditions was repeated for seven blocks.

During the selective attention conditions, the task of the subject was to detect a visual target, consisting of a gradual slowing and cessation of the lip movements. Movement decelerated from 15 to 3 frames/s over a 1-s period and then ceased for 666 ms. Slowing/cessation events occurred on both the attended and unattended side but subjects had to respond to these events only when they occurred in the attended visual field (button press with the right index finger). Slowing/cessation events on the unattended side served as catch trials to ensure that subjects complied with the instruction to selectively maintain attention on one side. There were zero, one or two slowing/cessation events per block, with a total of 18 occurrences on the attended side (targets) and 18 occurrences on the unattended side (catch trials).

The remaining eight 30-s blocks were used to establish a sensory baseline (SB) condition. In these blocks, neither the left nor the right visual stream was synchronized with the auditory stream and subjects were instructed to simply maintain central fixation. At the start of each block, a leftward arrow, rightward arrow (selective attention conditions) or horizontal bar (SB condition) instructed subjects to attend to the left or right visual stream or to maintain attention on the fixation cross. An 8-s interval between the cue and the beginning of the AV stimulation minimized the influence of cue-related activity on the processes of interest occurring during the ensuing 30-s stimulation blocks.

Psychophysical testing

In addition to the imaging paradigm a separate psychophysics study (eight subjects, mean age 29.8 years) was performed to behaviourally validate the role of MI in the current task, which required detection of visual targets with concurrent but task-irrelevant auditory stimuli. Psychophysical tests included bilateral visual stimuli as in the fMRI experiment (i.e. with congruent AV on the unattended side in the ‘attInc’ conditions; see Fig. 1A) and synthesized scanner noise was superimposed on the audio track.

The experiment was conducted in two phases. In the first phase, subject-specific 70.7% accuracy thresholds for the visual targets were determined for ‘attCon’ and ‘attInc’ conditions. The duration of the deceleration target was increased or decreased using a staircase method (2UP-1DOWN; Levitt, 1971). Hits and correct rejections (i.e. no response when the target was presented in the unattended hemifield) were counted as correct (UP) responses, whereas misses and false alarms were counted as incorrect (DOWN) responses. The staircase procedure terminated after seven turns. Step sizes were in two frames for the first three turns and subsequently in one frame (see Fig. 1B).

In the second phase, the subject-specific thresholds for ‘attCon’ and ‘attInc’ conditions were averaged and utilized as the fixed target duration. The test consisted of the same visual target detection task as in the fMRI experiment (30-s AV segments, with attention to congruent or incongruent AV) but now with 96 near-threshold targets per condition (inter-target interval, 3 ± 1 s), rather than just nine above-threshold targets as in the fMRI experiment.


Imaging was carried out in a 3-T Allegra head scanner (Siemens, Erlangen, Germany). Blood oxygenation level-dependent contrast was obtained using echo planar T2*-weighted imaging. Thirty transverse slices were acquired with a repetition time of 1.95 s, time to echo of 30 ms, matrix size 64 × 64 and an in-plane resolution of 3 × 3 mm; slice thickness and gap were 2.5 and 1.25 mm, respectively.


Imaging data were analysed using SPM5 ( The first four image volumes of each run were discarded to allow for stabilization of longitudinal magnetization. Pre-processing included rigid-body transformation (realignment) and slice timing to correct for head movement and slice acquisition delays. The images were then normalized to Montreal Neurological Institute (MNI) space using the co-registered T1 image, resampled to 3 × 3 × 3 mm voxels and smoothed with a Gaussian filter of 8 mm full-width at half maximum. The time series for each participant were high-pass filtered at 128 s and pre-whitened by means of an autoregressive model AR(1). For the identification and analysis of visual regions (see below), operations were performed in subjects’ native (non-normalized) space and a smoothing filter of 4 mm (rather than 8 mm) was applied.

At the first level (subject-specific) analysis, box-car regressors modelling the occurrence of the five different conditions [four selective attention conditions (aL-attCon, aL-attInc, aR-attCon and aR-attInc) plus the SB] were convolved with the standard SPM5 haemodynamic response function. In addition the effects of head movement, subject responses (button presses to the visual targets) and ocular deviations from fixation (see below) were modelled as regressors of no interest. The resulting general linear model produced an image estimating the effect size of the response induced by each of the five conditions of interest. At the second (inter-subject) level, these images were entered into a random effects factorial design with five levels, corresponding to the five conditions, plus an additional subject constant to account for non-condition-specific inter-subject variance. Correction for non-sphericity (Friston et al., 2002) was used to account for possible differences in error variance across conditions and any non-independent error terms for the repeated measures.

The critical contrast for this experiment compared attention to AV congruence vs. attention to AV incongruence [(aL-attCon + aR-attCon)>(aL-attInc + aR-attInc)]. Additional contrasts of interest included the four selective attention tasks vs. SB [(aL-attCon + aR-attCon + aL-attInc + aR-attInc)>(SB)] and the contrast between the attend left vs. right conditions [(aL-attCon + aL-attInc) > (aR-attCon + aR-attInc) and vice versa]. For these comparisons the Statistical Parametric Mapping threshold was set to P-corrected = 0.05 at cluster level (cluster size estimated a P-uncorrected = 0.001), considering the whole brain as the volume of interest. In addition, due to the putative role of the SC in MI (c.f. Stein & Stanford, 2008), corrected P-values for this region were assigned considering a reduced volume of interest (Worsley et al., 1996). This was defined from the average of coordinates reported in previous fMRI studies (retrieved from

Region of interest analysis of visual regions

To provide a more detailed investigation of the interplay between attention and MI in early visual areas, a ‘moving-lips’ localizer and meridian mapping were performed. At the beginning and end of the experimental session, subjects underwent two fMRI runs with four types of stimuli: moving lips in the left or right visual field (stimulus also flashing at 7.5 Hz to maximize sensory responses), or ribbon-shaped checkerboards running along the horizontal or vertical meridian (flashing at 1 Hz). Each stimulation condition was presented in blocks of 16 s. The duration of each run was approximately 4 min (120 volumes, with the same imaging parameters as the main experiment).

Following acquisition, blood oxygenation level-dependent maps were projected onto each subject’s flattened cortical surface using Freesurfer (Fischl et al., 1999). Horizontal and vertical meridians were used to identify the boundaries of V1/V2 and V2/V3 (see Fig. 4A). Within V1 and V2, the moving-lips localizer permitted identification of the retinotopic response produced by this visual stimulus presented in the left or right visual field. The peak activations in the left/right V1 and V2 were used for subsequent analysis of multisensory effects. In addition, regions of the anterior fusiform gyrus were identified, as these regions had previously been shown to be responsive to the crossmodal congruence of AV speech information (Macaluso et al., 2004). Published MNI space coordinates consistent with the fusiform face area were averaged (x = −40, y = −55, z = −17; x = 37, y = −54, z = −18) (Macaluso et al., 2004; Rhodes et al., 2004; Egner & Hirsch, 2005; Spiridon et al., 2006) and inverse normalized into subject-specific space (see Fig. 4A). The nearest local maxima to this coordinate, meeting the criteria of a positive effect of all five attention conditions and the simple effect of task (four attention conditions vs. SB, capturing the attentional modulation of this ventral stream regions), were selected. This remains a valid localizer as this contrast is orthogonal to the effects of interest (attention to AV congruence, effect of lateralized attention). For each subject and each visual region, we extracted the estimated parameters for the four selective attention conditions vs. SB, averaging values within a 4.5-mm sphere to optimize the signal-to-noise ratio.

Figure 4.

 Methods and results for region of interest (ROI) analysis within visual cortex. (A) Method for generating regions of interest in the left hemisphere of a representative subject. (i) V1 and V2 were identified using flat mapping and meridian mapping to delineate boundaries between visual regions. (ii) Subject-specific V1 and V2 peak activations for the right visual field presentations of the localizer stimulus were then selected using the boundaries of these regions and are depicted by green crosses. (iii) Subject-specific fusiform gyrus (FG) coordinates were identified by projecting published MNI coordinates into the brain space of individual subjects (here indicated by the green cross) and selecting the local maxima of the ‘selective attention task > SB’ contrast. Single-subject statistical maps are thresholded at Puncorrected < 0.001. (B) Group results for the three visual regions (V1, V2 and FG) as a function of hemisphere (left/right), attended visual field [attend left (aL)/attend right (aR) side] and attention to AV congruence [attended (attnCon)/unattended (attnInc) congruence]. All regions showed greater activity for attention to the contralateral than ipsilateral visual field and critically greater activity for attnCon than attnInc conditions but only during attention to the right visual field (*P < 0.05). Effect sizes are expressed in arbitrary units (+SEM). LOC, lateral occipital cortex; RVF, right visual field.

Monitoring of central fixation

Saccades were identified as eye movements with a velocity exceeding 25°/s. An average of 21.5 (SD 25) saccadic deviations were detected per subject, with only 37% of these terminating in the attended visual quadrant. The mean duration of deviations from fixation was 347 ms (SD 125 ms). Neither the frequency nor duration of deviations from fixation varied as a function of AV congruence (T < 1.12, ns). All losses of central fixation were modelled as events of no interest in the first level (subject-specific) fMRI analyses, ensuring that our results at the group level are unaffected by eye movements.


Behavioural validation of multisensory influences

The separate psychophysical study revealed a significant influence of AV congruence on the detection of near-threshold visual targets (T7 = 2.81, P < 0.05), with targets being more likely to be detected during ‘attCon’ (73.9%) than ‘attInc’ (67.8%) conditions. These behavioural data confirmed that AV congruence affected processing at the attended side, despite the auditory stream being task-irrelevant and non-informative with respect to the visual detection task.

Behavioural data from the imaging session

One subject failed to perform the selective attention task in the scanner (target detection accuracy 11%) and was excluded from further analysis. The remaining subjects identified on average 93% of the above-threshold targets (hits) and responded erroneously to 6% of the catch trials in the unattended visual field (corresponding to 1.1 false alarms per subject). Reaction times for attCon and attInc were not significantly different (1517 vs. 1346; T11 = 1.59) but it should be noted that there were only nine targets for each congruency condition, as this task was designed solely to ensure task compliance during fMRI, while interfering minimally with multisensory effects (the auditory stream was unchanged during target presentation).

Overall effect of selective attention tasks

First, we compared the four selective attention conditions vs. SB to highlight the overall effect of the spatial selective attention task, irrespective of attended side. This produced activation within the dorsal (intra-parietal sulcus, lateral premotor and supplementary motor area) and ventral (ventral premotor cortex) frontoparietal attention networks (c.f. Corbetta & Shulman, 2002; Corbetta et al., 2008; see Fig. 2A). The posterior section of the STS (pSTS) was also activated by the selective attention tasks. The localization of this activation is consistent with both the temporo-parietal junction of the ventral attention network (associated with the attentional demands of the task) and multimodal pSTS (potentially activated by attention-independent MI). In addition, increased activation was observed in extrastriate lateral occipital cortex, consistent with reported locations of the extrastriate body area (c.f for coordinates Spiridon et al., 2006). Coordinates are presented in Table 1.

Figure 2.

 (A) Positive effect of selective attention task (all selective attention conditions > SB). (B) Positive effect of lateralized covert spatial attention to the right visual stream [attend right (aR) > attend left (aL); left panel] or the left visual stream (aL > aR; right panel). Statistical maps are thresholded at Puncorrected < 0.001 and projected on a three-dimensional rendering of the canonical MNI space.

Table 1.    Location, significance and extent for all contrasts
HemisphereRegionCorrected P-valueNumber of voxelsZ-scoreCoordinates (mm)
  1. P-values are family-wise error (FWE)-corrected at the cluster level. aSTS/pSTS, anterior/posterior superior temporal sulcus; BA, Brodmann’s area; LVF/RVF, left/right visual field; MT, middle temporal; LOC, lateral occipital cortex; IFG, inferior frontal gyrus.

Selective attention tasks > SB
 RpSTS  5.0054−363
 RpSTS  4.7157−3912
 RPrecentral0.0005045.11 45−351
 RIFG  5.0042927
Attend LVF > attend RVF
 RBA 180.0009316.8018−9018
 RBA 19  5.4533−72−12
Attend RVF > attend LVF
 LBA 180.00013627.27−21−9618
 LBA 19  7.14−27−78−9
Attend AV congruence > attend AV incongruence
 RmSTS  4.3466−12−3
 RmSTS  3.9769−216
 LmSTS  4.30−63−273
 LmSTS  3.91−33−21−9
Small volume correction

Effects of lateralized attention

Next we sought to confirm the effectiveness of the manipulation of selective spatial attention to one or the other hemifield, directly comparing conditions of leftward vs. rightward attention and vice versa. This revealed the expected modulation of activity in visual cortex contralateral to the attended side [e.g. activation of the right occipital cortex for leftward attention: (aL-attCon + aL-attInc) > (aR-attCon + aR-attInc)]. The effect of spatial selective attention extended across the extrastriate cortex and peaked in Brodmann’s areas 18 and 19 (see Fig. 2B and Table 1). This result confirms that subjects were indeed focusing spatial attention on the designated visual field and were in compliance with the task instruction.

Influence of attending to audiovisual congruence: attention to multisensory integration

Normalized voxel-wise analysis

The critical contrast of this experiment was the comparison of attention to the AV congruent visual stream vs. attention to the AV incongruent visual stream [(aL-attCon + aR-attCon) > (aL-attInc + aR-attInc)]. This contrast reveals regions that activate selectively when multisensory AV speech was within the current focus of endogenous spatial attention. A robust haemodynamic response was observed extending along both the left and right STS (see Fig. 3A and Table 1). Activity in this region increased when subjects attended the visual stimulus congruent with the auditory stream compared with attention to the incongruent visual stimulus. It is notable that this modulatory effect of attention to MI occurred despite the same amount of multisensory information being present. Furthermore, within this region the activation induced by attending to the AV incongruence was not greater than sensory stimulation alone. Thus, unattended AV congruence was insufficient to activate this region (e.g. when subjects attended to one side but the auditory stream was congruent with the visual stream on the other side; see red bars in Fig. 3A). This suggests that spatial attention does not just modulate responses to congruent AV speech but rather attention and AV congruence are jointly required to modulate the activation of this region.

Figure 3.

 Statistical maps and signal plots of the effect of attention to AV congruence [(aL-attCon + aR-attCon) > (aL-attInc + aR-attInc)] in (A) the STS and (B) the SC. Signal plots show activation during attention to AV congruence (blue bars) and attention to AV incongruence (red bars) compared with the SB. Effect sizes are expressed in arbitrary units (+SEM). Statistical maps are thresholded at Puncorrected < 0.001. Blue outline in B depicts the volume of interest comprising the SC projected on the average structural scan of the 12 participants in coronal and sagittal (insets) planes.

An a-priori motivated anatomical search volume applied to the SC revealed an interplay between spatial attention and MI in this region. The signal plots of the SC showed that attention to AV congruence produced a greater response than attention to AV incongruence [see Fig. 3B; attCon (in blue) and attInc (in red)]. Again, activity during attention to incongruent AV was not different from SB (see red bars in Fig. 3B that are on average around zero), indicating that AV congruence activates the SC only when it is selectively attended (blue bars).

The inverse contrast (incongruent > congruent) revealed no significant activation.

Analyses of sensory-specific visual areas

To characterize the response within the visual cortex in more detail, three regions were independently identified for subsequent region of interest analysis. These regions were V1, V2 and a region of the fusiform gyrus anatomically consistent with the fusiform face area (see Materials and methods and Fig. 4A). To characterize the response within each of these visual regions, a four-way within-subjects anova was performed with the factors: region (V1/V2/fusiform gyrus), hemisphere (left/right), hemifield-attended (left/right) and attention to AV congruence (attended/unattended MI).

This showed the expected hemifield-attended by hemisphere interaction, reflecting the influence of lateralized attention (F1,11 = 106.2, P < 0.001; c.f. signal plots, Fig. 4B). Critically, the anova revealed a main effect of attention to AV congruence across regions (F1,11 = 6.1, P = 0.031), demonstrating an interplay between attention and MI within striate and extrastriate visual areas. For each region, we sought to confirm our a-priori hypothesis of increased activation for attention to congruent vs. incongruent AV stimuli using a weighted contrast. These additional t-tests revealed significant effects in all three regions: V1 (T11 = 2.09, P = 0.030, one-tailed), V2 (T11 = 2.40, P = 0.018, one-tailed) and the fusiform gyrus (T11 = 2.32, P = 0.020, one-tailed).

Despite the absence of a significant interaction between attention to MI and visual hemifield (P = 0.21), the signal plots indicate larger effects of attention to MI when spatial attention was directed in the right visual field (‘aR’, see Fig. 4B). Indeed, post-hoc t-tests revealed significant effects of attention to MI during attend-right conditions in all regions and across both hemispheres (all P < 0.05, one-tailed), except for V1 in the right hemisphere (P = 0.051). By contrast, no effect of attention to MI was observed in any of the regions during the attend-left condition (all T11 < 1).


We contrasted conditions where endogenous visuospatial attention was focused on a visual channel that was either congruent or incongruent with the accompanying auditory channel, while keeping the amount of multisensory information in the environment constant. This allowed us to determine whether the neural correlates typically associated with the formation of a multisensory percept were sensitive to the locus of visual spatial attention. The primary finding of this study was the robust modulatory effect of attention to AV congruence. This influence was observed in heteromodal cortical regions consistent with previous studies of multisensory fusion, in the SC and in ‘sensory-specific’ visual regions as early as the primary visual cortex.

Audiovisual linguistic integration was employed due to its ecological validity, robust neural indices (c.f. Stevenson et al., 2007) and the strong relationship to previous behavioural paradigms that explored the relationship between attention and MI. Both the McGurk effect (Massaro, 1987; Soto-Faraco et al., 2004) and investigations of the pre-attentive nature of the ventriloquist effect (Driver, 1996) employed linguistic AV integration. Furthermore, a separate psychophysical study here confirmed that AV congruence can enhance the discriminability of visual targets on the attended side, even when subjects are instructed to respond only to events in the visual stream.

Cortical effects of attention to audiovisual congruence

The STS receives convergent input from auditory, visual and somatosensory regions (see Padberg et al., 2003; Schmahmann & Pandya, 1991) and has been frequently associated with MI (Calvert et al., 2000; Beauchamp et al., 2004b,a; Miller & D’Esposito, 2005; Noesselt et al., 2007). The present study demonstrates that attention and congruent AV speech jointly contribute to the activation of the STS.

Our current activation of STS is anatomically consistent with previous fMRI studies on the fusion of a multisensory event into a single percept (Bushara et al., 2003; Miller & D’Esposito, 2005). In contrast, studies that have compared the presentation of a single unisensory stimulus with the presentation of two simultaneously presented stimuli in different modalities have identified a more posterior section as the putative multisensory component of the STS (the pSTS). Enhanced activation in this region has been reported in terms of the response to bimodally presented (compared with the unimodally presented) AV speech and object stimuli irrespective of sensory/semantic congruence (Calvert et al., 2000; Beauchamp et al., 2004a,b) but not with respect to semantically congruent vs. incongruent AV object stimuli (although the perirhinal cortex exhibited an enhanced response; Taylor et al., 2006). The pSTS response is enhanced in response to bimodally vs. single unimodally presented letter/speech sounds but this pSTS response does not vary with respect to the congruent/incongruent nature of these letter/sound pairs (van Atteveldt et al., 2004, 2007). pSTS activation has also been reported in response to spatially congruent vs. incongruent AV stimuli (Noesselt et al., 2007). It has recently been proposed that many of the comparisons that produce pSTS activation may contain several perceptual/cognitive effects that are not related to multisensory perception per se (Hocking & Price, 2008). Hocking & Price (2008) demonstrated that pSTS effects commonly attributed to multisensory influences can be achieved through similar unimodal manipulations of sensory input load and attention.

In the present study, pSTS was found to activate when the selective attention conditions were compared with the low-level SB. This activation of pSTS might be consistent with an attention-independent effect of AV congruence, which was present in all four selective attention conditions (i.e. also in the attInc conditions) but not in the baseline. However, the low-level SB condition also did not require any selective attention to the stimuli, thus the relative contribution of spatial attention and unattended AV congruence cannot be definitively determined from the current design. Nonetheless, the role of this region in the binding of a multisensory percept appears unlikely, as studies that have specifically compared fused vs. unfused multisensory percepts have failed to observe any activation in this posterior region of the STS (Olson et al., 2002; Bushara et al., 2003; Jones & Callan, 2003; Macaluso et al., 2004; Miller & D’Esposito, 2005; Ojanen et al., 2005 see also Hein & Knight, 2008).

Here, the critical interplay between attention and AV congruence was found in a more anterior region of the STS. In this region the effect of AV congruence was contingent on attention to the multisensory stimulus. Accordingly, activity associated with unattended AV congruence (i.e. during ‘attend-incongruent’ conditions) was not greater than AV incongruent stimulation alone (i.e. the low-level baseline). This suggests that attention did not simply modulate an overall effect of congruence, which was entirely absent when the AV congruence was unattended, but rather that attention and congruence were both necessary to activate this region. Likewise, due to the symmetrical nature of our design, the contrast attCon ‘attended congruent > attend incongruent’ is identical to ‘unattend incongruent > unattend congruent’. The consequence of this is that the effects attributed to the attendance of the congruent visual stream could equally arise due to the need to ignore the incongruent stream. The response elicited by the SB suggests that this is not the case. The requirement to disregard incongruent streams was present in this condition and along the STS the response to attend-congruent conditions was enhanced relative to SB, whereas the attend-incongruent condition induced the same fMRI response. This suggests that the critical factor in the evocation of the STS response to the attCon condition is attention to a crossmodally congruent visual stimulus rather than the need to ignore a crossmodally incongruent visual stimulus.

As noted above, activation of this STS region has been previously associated with the formation of multisensory percepts (e.g. Miller & D’Esposito, 2005), which in our study may also entail greater comprehension/processing of the auditory text in the ‘attend-congruent’ conditions. Thus, the interplay between attention and congruence in STS may reflect some consequence of the integration of the multisensory input (e.g. increased comprehension). However, it should be stressed that here the semantic linguistic information contained within the auditory stream was always task-irrelevant and could not help with the detection of the visual targets. It is also possible that subjects employed a strategy whereby they strategically focused on the auditory stream under congruent AV conditions to use the disparate nature of the AV information to better identify the target stimulus. However, we consider this unlikely due to the fact that the information contained in the auditory stream contained misleading information (a continuous stream rather than a decelerating sequence of events as in the visual stream). However, the possibility of such a strategic shift in attention cannot be completely excluded. A further alternative interpretation is that attention to a stimulus in the visual modality automatically increases the strength of the neural representation of a related stimulus in a second (auditory) modality. Such spreading of attentional effects has previously been reported within the visual modality (Serences & Boynton, 2007) and similar properties may exist across modalities (see also Macaluso & Driver 2001; Busse et al., 2005). Despite the fact that the current data (as is the case with previous studies on MI) cannot disentangle the underlying mechanisms of integration from its possible consequences, our findings indicate that top-down (endogenous attention) and bottom-up (sensory input) factors jointly contribute to the activation of STS during the processing of AV speech.

It is notable that, in this study, no enhanced response during attention to multisensory incongruence when compared with attention to multisensory congruence was detected, either in the whole brain analysis or in the more focused analyses discussed hereafter. Although effects of incongruent AV stimuli have been observed in previous studies, this has normally been in the context of priming effects, where the stimulus in one modality precedes in time the stimulus in the other (Noppeney et al., 2008). Under such conditions, expectancy is violated in incongruent conditions, which may lead to an enhanced neural response to the unexpected (crossmodal) stimulus, as accounted for within such theories as predictive coding (Rao & Ballard, 1999).

Subcortical effects of attention to audiovisual congruence

Although reports of the SC as a multisensory structure are prevalent in the animal literature (Meredith & Stein, 1983; Stein & Meredith, 1993; Stein & Stanford, 2008), it is seldom reported in fMRI studies of multisensory processing. Here we show that attention to AV congruence increases the multisensory response bilaterally in the SC. In contrast, Miller & D’Esposito (2005) found that incongruent AV speech produced an enhanced SC response. This difference may be due to the type of incongruent stimuli used (temporally jittered vs. completely unmatched in the present study). MI in the SC has been observed in anesthetized animals. Therefore, one might expect MI to occur primarily in an automatic, attention-independent manner (Stein & Meredith, 1993). However, multisensory responses in the SC are dependent on cortical input (Jiang et al., 2001) and have been linked to overt spatial behaviour in animal studies (Stein et al., 1989). The finding here of an attentional influence on the multisensory response of the SC suggests that the multisensory function of this region is not a passive response to environmental events but is actively shaped by current behavioural goals. Accordingly, the SC would be part of a larger network of cortical and subcortical regions that jointly contribute to the formation of integrated multisensory percepts (Stein & Stanford, 2008).

Effect of attention to audiovisual congruence within early visual cortex

In this study, attention to AV congruence was also observed to modulate the activity in visual regions. This finding is consistent with previous investigations of multisensory processing, which have identified enhanced responses within sensory-specific cortex (Macaluso et al., 2000; Noesselt et al., 2007; Alink et al., 2008). This study adds to this previous literature by showing the attentional modulation of these effects and by demonstrating that multisensory facilitation arises as early in the visual hierarchy as V1 (using meridian mapping in each single subject, Fig. 4A). Fine-resolution fMRI in the macaque has localized functional modulation of the auditory cortex by visual input in both the core and belt fields (Kayser et al., 2007) but multisensory effects in primary visual cortex have not been described in much detail [see Noesselt et al. (2007) who reported AV interaction in V1 based on anatomical criteria]. V1 has been shown to have direct connections with auditory regions in the monkey (Falchier et al., 2002) and it is possible that direct collateral connections drive this effect. However, it is also possible that feedback connections and interactions with higher level association areas may govern this response [Schroeder & Foxe, 2002; Noesselt et al., 2007; see also Ghazanfar et al. (2008) for interactions between STS and auditory regions during face/voice integration]. Our current finding of interplay between endogenous spatial attention and the processing of AV congruence may favour the second hypothesis, which emphasizes the role of high-level associative regions rather than relatively low-level automatic processing. Nonetheless, attention and MI may also affect the visual cortex via different pathways (e.g. frontoparietal for spatial attention and direct projections for MI). Additional work is needed to elucidate the relative contributions of these top-down and collateral influences.

In the present study, the effect of attention on MI appeared to be more pronounced when stimuli were attended in the right visual field. This is an unexpected finding as the right side of the face (normally present in the left visual field) is thought to be a more important source of lingual cues (Geffen et al., 1971; Wolf & Goodale, 1987; Nicholls et al., 2004). However, these findings relate only to centrally presented face stimuli and it is possible that right visual field advantages for part-base processing (Hillger & Koenig, 1991; Rossion et al., 2000) may dominate during the lateralized presentation of faces.


Although previous studies into various AV perceptual phenomena suggest that multisensory fusion may be relatively insensitive to attention, here we observed that spatial attention to the visual component of an AV stimulus is critical for the activation of brain regions thought to reflect neural correlates of MI. This interplay between attention and multisensory congruence was manifest in two heteromodal regions (SC and STS) where spatial attention was critical for the evocation of a multisensory response. Furthermore, widespread influences of attention on MI are also observed across unimodal visual regions as early in the visual hierarchy as the primary visual cortex. These results demonstrate that voluntary behaviour and current goals can modulate the integration of the senses, affecting processing in the brain at multiple cortical and subcortical stages.


The Neuroimaging Laboratory is supported by The Italian Ministry of Health.


attend to audiovisual congruence in the left visual field


attend to audiovisual incongruence in the left visual field


attend to audiovisual congruence in the right visual field


attend to audiovisual incongruence in the right visual field




functional magnetic resonance imaging


multisensory integration


Montreal Neurological Institute


posterior section of the superior temporal sulcus


sensory baseline


superior colliculus


superior temporal sulcus