The environment often engages multiple sensory channels simultaneously. When events or objects stimulate more than one sensory organ, they are more likely to be detected and more efficiently processed than events stimulating just one sensory modality (e.g. Stein & Stanford, 2008).

Such crossmodal enhancement requires that the sensory inputs arising from one event are correctly assigned to that object (termed the ‘correspondence problem’; Ernst & Bülthoff, 2004). Multisensory research has demonstrated that features that are shared by all sensory systems, such as space, time or meaning are used for crossmodal binding (for recent reviews see Driver & Noesselt, 2008; Stein & Stanford, 2008). For example, in a cocktail party setting, matching of the temporal dynamics of lip movements and of speech sound allows us to assign the correct message to each speaker. Furthermore, this ability of crossmodal binding assists in orienting spatial attention to a particular conversation partner while ignoring other sources of distracting sensory information, such as other peoples’ conversations.

The binding of synchronous auditory and visual (speech) signals seems to happen automatically and pre-attentively (see Driver, 1996; Bertelson et al., 2000; Vroomen et al., 2001a). The voluntary orienting of spatial attention to one particular speaker or, more generally, to an event is an example of endogenous attention allocation. By contrast, stimulus-driven orienting of spatial attention, elicited by the saliency of the stimulus, has been termed exogenous attention. Due to their relatively high salience, such pre-attentively detected, congruent cross-modal stimuli may function as exogenous attentional cues (Spence & Driver, 2000; Stekelenburg et al., 2004; Senkowski et al., 2008).

In this issue, Fairhall & Macaluso (2009) provide evidence for the operation of both mechanisms: (i) pre-attentively crossmodal matching and (ii) enhancement of crossmodal interaction by endogenous attention. The authors presented the lips of two speaking faces, one in the left and one in the right visual hemifield. The centrally presented auditory stream was matched with the lip movements of one speaker only. In such a situation people generally perceive the auditory speech signal as originating from the location of the matching lips. This phenomenon is well known as the ‘ventriloquist effect’. Critically, Fairhall and Macaluso also asked participants to perform a visual detection task in one hemifield only.

This task served to systematically manipulate the focus of visual spatial attention. The auditory stream did not have any significance to the participants and thus could be ignored. This paradigm allowed the authors to compare brain responses as assessed with functional magnetic resonance imaging (fMRI) to multisensory, matching audio-visual speech sequences that occurred either at the location of visual spatial attention (‘congruent condition’) or on the opposite side (‘incongruent condition’) while keeping physical stimulation constant. The authors report that visual attention enhances activity in brain regions well accepted as multisensory integration areas (Driver & Noesselt, 2008), such as the superior temporal sulcus (STS) and the superior colliculus (SC), only when the auditory track matched the attended lip movements. Similar effects were observed in several visual brain regions, including the primary visual cortex.

These results suggest that the processing of congruent cross-modal stimuli within the locus of endogenous visual spatial attention is enhanced. Furthermore, the data of Fairhall & Macaluso are compatible with the view that binding of audio-visual speech can occur pre-attentively. When audio-visual matching speech was presented on the opposite side of the locus of visual-spatial attention (incongruent condition), participants performed worse in the visual detection task than when matching stimuli were presented on the same side as the focus of participant’s endogenous visual spatial attention (congruent condition). These behavioural results, together with findings from earlier studies (Spence & Driver, 2000; Vroomen et al., 2001b), suggest that the audio-visual binding process that elicited a ventriloquist illusion towards the opposite side of the visual attention focus took place outside the locus of endogenous attention and resulted in an exogenously attentional shift towards the task irrelevant side, which in turn interfered with the visual detection task.

In summary, the results described by Fairhall and Macaluso suggest that endogenous (visual) spatial attention amplifies the processing of congruent cross-modal input that is likely integrated and detected pre-attentively.

Fairhall and Macaluso did not attempt to link the enhanced activity in multisensory and early visual cortical regions to behavioural data such as, for example, improved speech comprehension. The question of whether activity changes in multisensory regions due to endogenous spatial attention are accompanied by altered multisensory performance (Alsius et al., 2005) remains to be investigated in future work. Moreover, other top-down influences on multisensory integration such as plausibility and meaning (Kitagawa & Spence, 2005; Hein et al., 2007) will need to be investigated in more detail to extend our understanding of the conditions when and which multisensory binding processes are subject to voluntary modulation.


  1. Top of page
  2. References