Stimulation to one sensory system can enhance perception of stimuli presented to a different sensory system. For example, auditory stimuli can enhance the perceived intensity of light (Stein et al., 1996) and, conversely, light can enhance the perceived intensity of acoustic white noise (Odgaard et al., 2004). Vibrotactile pulses aid in the detection of tones and increase their perceived loudness (Gillmeister & Eimer, 2007; Schurmann et al., 2004; Ro et al., 2009). Under noisy acoustic conditions, seeing a talker can lower the auditory speech detection threshold (Grant & Seitz, 2000; Grant, 2001; Bernstein et al., 2004; Kim & Davis, 2004; Schwartz et al., 2004; Eskelund et al., 2011). The speech detection enhancement could be attributable to audiovisual integration that leads to an amodal “integrated neural signal [that] is different (e.g., bigger, smaller, having a different temporal evolution)” (Stein et al., 2010). Alternatively, it could be due to visual guidance for listening to the speech in noise (Nahum et al., 2008).
We investigated speech detection enhancement with respect to an ideal observer, which is a theoretically optimal detector (Green & Swets, 1966; Pelli & Farell, 1999). We used the ideal observer as a standard yardstick to quantify system-level changes with and without multisensory inputs. The ideal observer has the full knowledge about the acoustic stimulus to be detected. Its performance is limited only by noise in the stimulus and uncertainty inherent in the task (e.g., uncertainty due to multiple stimuli for the same response).
An ideal-observer model can be used to quantify multisensory facilitations using two orthogonal factors: (i) a non-acoustic stimulus could reduce the internal noise of the perceiver (e.g., a visual speech stimulus might recruit an auditory speech-specific process that may be less noisy than a generic sound-detection process); and (ii) it could facilitate the extraction of the acoustic signal from the noisy input by appropriately focusing perceptual resources to the relevant information in the signal (e.g., by providing a temporal marker or by correlating with the structure of the auditory stimulus). Multisensory processing could, at least in principle, worsen one factor while improving the other (e.g., integrating a noisy but informative non-acoustic signal with the task-relevant acoustic signal could add noise but also increase efficiency).
An additive noise ideal-observer model (Green & Swets, 1966; Pelli & Farell, 1999) explicitly represents and dissociates these two factors. Discriminability between a noisy signal and noise alone, measured in term of d′, is expressed as:
where E is signal energy, N is the spectral density of the external noise in the stimulus, Neq is the additive noise in the perceptual system, expressed as an equivalent noise source at the input, and η is the sampling efficiency of the perceptual system. For an ideal observer, η = 1.0 and Neq = 0. For humans, Neq > 0 and 0 ≤ η < 1.
Intuitively, the internal or ‘intrinsic’ noise, expressed as the equivalent input noise, is the perceptual system's precision for signal transduction and sensory measurement. For a given sensory–perceptual system, different neural pathways might operate with different amounts of intrinsic noise. The measurable intrinsic noise of a human observer depends on which of the subsystems are recruited, and how or whether the signals are combined. For example, if an auditory speech-specific subsystem had lower intrinsic noise than a general-purpose auditory system, and if a visual stimulus led to an increased utilisation of the hypothetical lower-noise speech subsystem, intrinsic noise reduction should be observed. Alternatively, if the visual signal is combined with the auditory signal to form an amodal speech signal (multisensory integration), the noise in the visual system should contribute to the observed intrinsic noise.
Sampling efficiency (sometimes called statistical efficiency or calculation efficiency) is the fraction of the noise-limited stimulus information that a perceptual system utilises to perform a task. For example, a system that uses the visual stimulus onset time to attend synchronously to the auditory input will exhibit a higher sampling efficiency for detecting the auditory stimulus than one that ignores that information. In general, the more a system uses the spatiotemporal properties specific to the stimulus, the higher should be its sampling efficiency. We expect efficiency to increase if the perceiver can use knowledge about visual and/or tactile stimuli to pick out the auditory signal from its noise background.
Under the assumption of an additive noise ideal-observer model, changes in intrinsic noise (Neq) and sampling efficiency (η) are theoretically independent. As Eqn (1) suggests, these parameters can be empirically determined for both unisensory and multisensory conditions by adding external noise (N) to the signal. To minimise effects of nonlinearity in the perceptual system associated with performance level, measurements are typically made at a constant d′. The ideal-observer model of Eqn (1) can then be rewritten to make explicit that at a constant d′, the signal energy (E) required to achieve the specific d′ is linearly proportional to the total spectral density of the external and internal noise (N + Neq), and the proportional constant is inversely proportional to sampling efficiency:
Hence, an experiment that measures the threshold signal energy (E) as a function of external noise (N) at a constant d′ provides a straightforward means to estimate intrinsic noise (Neq) and sampling efficiency (η).
[The additive noise ideal-observer model is known to be an incomplete model for human signal-detection performance. While E is often linearly related to N for a given d′, as dictated by Eqn (2), the slope and intercept of this relationship often depend on d′. That is, the squared d′ is nonlinearly related to the net signal-to-noise ratio E/(N + Neq). More elaborated models attempt to account for this nonlinearity (Lu & Dosher, 2008). Here, however, it is sufficient to isolate efficiency from intrinsic noise and to characterise qualitatively the nonlinearity between d′ and the net signal-to-noise ratio.]
The additive noise ideal-observer model can account for a broad range of tasks, from simple signal detection to object identification (Green & Swets, 1966; Legge et al., 1987; Tjan et al., 1995; Pelli & Farell, 1999). Furthermore, numerous studies have used similar observer models to study the effects of attention and perceptual learning on human performance and, in doing so, they demonstrated that efficiency and intrinsic noise are empirically dissociable (Lu & Dosher, 1998; Gold et al., 1999; Sun et al., 2010); see also Lu & Dosher (2008) for an extensive review and elaborated theoretical analysis).
The current study
We used the ideal-observer model of Eqns (1) and (2) to investigate the data from two experiments. In Experiment 1, speech detection thresholds were measured at four external noise levels, including a no-noise condition, while holding d′ constant. Stimulus conditions were audio-only (AO), audio-tactile (AT), audiovisual with a stationary rectangle (AVR) and audiovisual speech (AVS). The tactile stimulus extended generalisability to an additional sensory system. Having demonstrated that intrinsic auditory noise does not change across different multisensory stimuli, in Experiment 2 a more sensitive paradigm was used to examine further whether visible speech stimuli confer significant additional advantage for detection. The four conditions from Experiment 1 and the combination of the visual rectangle with the tactile stimulus (AVRT) and the visual speech with the tactile stimulus (AVST) were presented. Multisensory integration and speech-specific processing are ruled out as explanations for the auditory speech detection enhancement with audiovisual speech. The results point to the ability to use knowledge about visual and/or tactile stimuli to pick out the auditory signal from its noise background.