SEARCH

SEARCH BY CITATION

Keywords:

  • ideal-observer analysis;
  • multisensory enhancement;
  • speech detection

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Experiment 1: efficiency and intrinsic noise
  5. Experiment 2: efficiency and linearity of speech detection
  6. General discussion
  7. Authors' contributions
  8. Conflict of interest
  9. Acknowledgements
  10. References
  11. Supporting Information

Acoustic speech is easier to detect in noise when the talker can be seen. This finding could be explained by integration of multisensory inputs or refinement of auditory processing from visual guidance. In two experiments, we studied two-interval forced-choice detection of an auditory ‘ba’ in acoustic noise, paired with various visual and tactile stimuli that were identically presented in the two observation intervals. Detection thresholds were reduced under the multisensory conditions vs. the auditory-only condition, even though the visual and/or tactile stimuli alone could not inform the correct response. Results were analysed relative to an ideal observer for which intrinsic (internal) noise and efficiency were independent contributors to detection sensitivity. Across experiments, intrinsic noise was unaffected by the multisensory stimuli, arguing against the merging (integrating) of multisensory inputs into a unitary speech signal, but sampling efficiency was increased to varying degrees, supporting refinement of knowledge about the auditory stimulus. The steepness of the psychometric functions decreased with increasing sampling efficiency, suggesting that the ‘task-irrelevant’ visual and tactile stimuli reduced uncertainty about the acoustic signal. Visible speech was not superior for enhancing auditory speech detection. Our results reject multisensory neuronal integration and speech-specific neural processing as explanations for the enhanced auditory speech detection under noisy conditions. Instead, they support a more rudimentary form of multisensory interaction: the otherwise task-irrelevant sensory systems inform the auditory system about when to listen.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Experiment 1: efficiency and intrinsic noise
  5. Experiment 2: efficiency and linearity of speech detection
  6. General discussion
  7. Authors' contributions
  8. Conflict of interest
  9. Acknowledgements
  10. References
  11. Supporting Information

Stimulation to one sensory system can enhance perception of stimuli presented to a different sensory system. For example, auditory stimuli can enhance the perceived intensity of light (Stein et al., 1996) and, conversely, light can enhance the perceived intensity of acoustic white noise (Odgaard et al., 2004). Vibrotactile pulses aid in the detection of tones and increase their perceived loudness (Gillmeister & Eimer, 2007; Schurmann et al., 2004; Ro et al., 2009). Under noisy acoustic conditions, seeing a talker can lower the auditory speech detection threshold (Grant & Seitz, 2000; Grant, 2001; Bernstein et al., 2004; Kim & Davis, 2004; Schwartz et al., 2004; Eskelund et al., 2011). The speech detection enhancement could be attributable to audiovisual integration that leads to an amodal “integrated neural signal [that] is different (e.g., bigger, smaller, having a different temporal evolution)” (Stein et al., 2010). Alternatively, it could be due to visual guidance for listening to the speech in noise (Nahum et al., 2008).

We investigated speech detection enhancement with respect to an ideal observer, which is a theoretically optimal detector (Green & Swets, 1966; Pelli & Farell, 1999). We used the ideal observer as a standard yardstick to quantify system-level changes with and without multisensory inputs. The ideal observer has the full knowledge about the acoustic stimulus to be detected. Its performance is limited only by noise in the stimulus and uncertainty inherent in the task (e.g., uncertainty due to multiple stimuli for the same response).

Ideal-observer model

An ideal-observer model can be used to quantify multisensory facilitations using two orthogonal factors: (i) a non-acoustic stimulus could reduce the internal noise of the perceiver (e.g., a visual speech stimulus might recruit an auditory speech-specific process that may be less noisy than a generic sound-detection process); and (ii) it could facilitate the extraction of the acoustic signal from the noisy input by appropriately focusing perceptual resources to the relevant information in the signal (e.g., by providing a temporal marker or by correlating with the structure of the auditory stimulus). Multisensory processing could, at least in principle, worsen one factor while improving the other (e.g., integrating a noisy but informative non-acoustic signal with the task-relevant acoustic signal could add noise but also increase efficiency).

An additive noise ideal-observer model (Green & Swets, 1966; Pelli & Farell, 1999) explicitly represents and dissociates these two factors. Discriminability between a noisy signal and noise alone, measured in term of d′, is expressed as:

  • display math(1)

where E is signal energy, N is the spectral density of the external noise in the stimulus, Neq is the additive noise in the perceptual system, expressed as an equivalent noise source at the input, and η is the sampling efficiency of the perceptual system. For an ideal observer, η = 1.0 and Neq = 0. For humans, Neq > 0 and 0 ≤ η < 1.

Intuitively, the internal or ‘intrinsic’ noise, expressed as the equivalent input noise, is the perceptual system's precision for signal transduction and sensory measurement. For a given sensory–perceptual system, different neural pathways might operate with different amounts of intrinsic noise. The measurable intrinsic noise of a human observer depends on which of the subsystems are recruited, and how or whether the signals are combined. For example, if an auditory speech-specific subsystem had lower intrinsic noise than a general-purpose auditory system, and if a visual stimulus led to an increased utilisation of the hypothetical lower-noise speech subsystem, intrinsic noise reduction should be observed. Alternatively, if the visual signal is combined with the auditory signal to form an amodal speech signal (multisensory integration), the noise in the visual system should contribute to the observed intrinsic noise.

Sampling efficiency (sometimes called statistical efficiency or calculation efficiency) is the fraction of the noise-limited stimulus information that a perceptual system utilises to perform a task. For example, a system that uses the visual stimulus onset time to attend synchronously to the auditory input will exhibit a higher sampling efficiency for detecting the auditory stimulus than one that ignores that information. In general, the more a system uses the spatiotemporal properties specific to the stimulus, the higher should be its sampling efficiency. We expect efficiency to increase if the perceiver can use knowledge about visual and/or tactile stimuli to pick out the auditory signal from its noise background.

Under the assumption of an additive noise ideal-observer model, changes in intrinsic noise (Neq) and sampling efficiency (η) are theoretically independent. As Eqn (1) suggests, these parameters can be empirically determined for both unisensory and multisensory conditions by adding external noise (N) to the signal. To minimise effects of nonlinearity in the perceptual system associated with performance level, measurements are typically made at a constant d′. The ideal-observer model of Eqn (1) can then be rewritten to make explicit that at a constant d′, the signal energy (E) required to achieve the specific d′ is linearly proportional to the total spectral density of the external and internal noise (N + Neq), and the proportional constant is inversely proportional to sampling efficiency:

  • display math(2)

Hence, an experiment that measures the threshold signal energy (E) as a function of external noise (N) at a constant d′ provides a straightforward means to estimate intrinsic noise (Neq) and sampling efficiency (η).

[The additive noise ideal-observer model is known to be an incomplete model for human signal-detection performance. While E is often linearly related to N for a given d′, as dictated by Eqn (2), the slope and intercept of this relationship often depend on d′. That is, the squared d′ is nonlinearly related to the net signal-to-noise ratio E/(N + Neq). More elaborated models attempt to account for this nonlinearity (Lu & Dosher, 2008). Here, however, it is sufficient to isolate efficiency from intrinsic noise and to characterise qualitatively the nonlinearity between d′ and the net signal-to-noise ratio.]

The additive noise ideal-observer model can account for a broad range of tasks, from simple signal detection to object identification (Green & Swets, 1966; Legge et al., 1987; Tjan et al., 1995; Pelli & Farell, 1999). Furthermore, numerous studies have used similar observer models to study the effects of attention and perceptual learning on human performance and, in doing so, they demonstrated that efficiency and intrinsic noise are empirically dissociable (Lu & Dosher, 1998; Gold et al., 1999; Sun et al., 2010); see also Lu & Dosher (2008) for an extensive review and elaborated theoretical analysis).

The current study

We used the ideal-observer model of Eqns (1) and (2) to investigate the data from two experiments. In Experiment 1, speech detection thresholds were measured at four external noise levels, including a no-noise condition, while holding d′ constant. Stimulus conditions were audio-only (AO), audio-tactile (AT), audiovisual with a stationary rectangle (AVR) and audiovisual speech (AVS). The tactile stimulus extended generalisability to an additional sensory system. Having demonstrated that intrinsic auditory noise does not change across different multisensory stimuli, in Experiment 2 a more sensitive paradigm was used to examine further whether visible speech stimuli confer significant additional advantage for detection. The four conditions from Experiment 1 and the combination of the visual rectangle with the tactile stimulus (AVRT) and the visual speech with the tactile stimulus (AVST) were presented. Multisensory integration and speech-specific processing are ruled out as explanations for the auditory speech detection enhancement with audiovisual speech. The results point to the ability to use knowledge about visual and/or tactile stimuli to pick out the auditory signal from its noise background.

Experiment 1: efficiency and intrinsic noise

  1. Top of page
  2. Abstract
  3. Introduction
  4. Experiment 1: efficiency and intrinsic noise
  5. Experiment 2: efficiency and linearity of speech detection
  6. General discussion
  7. Authors' contributions
  8. Conflict of interest
  9. Acknowledgements
  10. References
  11. Supporting Information

In Experiment 1, speech detection thresholds were measured at four external noise levels, including a no-noise condition, while holding d′ constant. We wanted to determine whether the ideal-observer model could account for the data, and if so, how efficiency and intrinsic noise might vary across multisensory conditions. Stimulus conditions were audio-only (AO), audio-tactile (AT), audiovisual with a stationary rectangle (AVR), and audiovisual speech (AVS).

Materials and methods

Participants

We tested four participants (ages 19–37 years, mean 25; one male) with American English as their first language, normal or corrected-to-normal vision, normal pure-tone thresholds for ten standard frequencies from 250 to 8000 Hz (ANSI, S3.6-2004), and normal composite scores on the Hearing in Noise Test (Nilsson et al., 1994). The participants had average or better lip-reading ability (Auer & Bernstein, 2007). They gave informed consent and were paid $12/h for their participation. Testing took place over 4–6 sessions (mean 5.5), distributed over 8–71 days (mean 33). Human subject testing was approved by the Institutional Review Board of the St Vincent's Hospital, Los Angeles, CA, USA, which oversees human subjects research at House Ear Institute, Los Angeles, where the data were collected. The experiments were undertaken with the understanding and written consent of each subject, and the study conforms to the Code of Ethics of the World Medical Association (Declaration of Helsinki), printed in the British Medical Journal (18 July 1964).

Stimuli
Auditory

The speech stimulus was a video-recorded ‘ba’ spoken by a female (Bernstein et al., 2004). The 543-ms acoustic syllable was adaptively adjusted in sound level during testing (see below). White noise was presented at 0, 40, 50 and 60 dB SPL. A large (90-s) file of computer-generated acoustic white noise was sampled randomly for each trial, extending both across intervals and between them, at a constant level throughout a run. The acoustic stimulus and the white noise were mixed using a calibrated audio system, including a custom attenuator, and were presented through calibrated ER-3A insert earphones (Etymotic Research Inc.; external noise exclusion 30 dB SPL).

Visual

The visual stimuli included the corresponding video of the talker as she pronounced the ‘ba’ syllable (in AVS and AVST - Exp 2 only - conditions) and a static rectangular image (AVR, AVRT Exp 2 only; Fig. 1). The visual speech stimulus movement onset coincided with the acoustic syllable onset in the signal-present interval (Fig. 2). The visible syllable was longer than the acoustic signal, as is often true with isolated audiovisual speech syllables. To equate for the contrast energy in the visual stimuli, the non-speech visual stimulus was a static rectangle filled with pixels randomly selected from the rectangular region of the visual speech stimulus including the face (Fig. 1). The viewing distance was 1 m. The face and the rectangle stimuli subtended 6.0° of visual angle horizontally and 8.2° vertically. A fixation cross during AO trials was presented continuously against a grey background and subtended 0.72° of visual angle.

image

Figure 1. Visual stimuli used in different conditions. (A) Fixation cross. This stimulus appeared during the audio-only (AO) speech stimulus condition. (B) The static rectangle comprised the pixels of the first frame of the visible speech stimulus, in a random static pattern. This stimulus appeared during the audiovisual with a stationary rectangle (AVR) condition and during the conditions with both the stationary rectangle and the tactile stimulus (AVRT). (C) Visual speech stimuli (only one frame is shown). The natural moving visual speech stimulus was shown during the audiovisual speech condition (AVS). It was also shown during the audiovisual speech and tactile (AVST) condition.

Download figure to PowerPoint

image

Figure 2. Stimulus timing diagram. (E) Each trial comprised two temporal intervals, with the target acoustic ‘ba’ presented in only one of the intervals. All other stimuli for a particular condition were repeated in both intervals. The horizontal extent of the stimuli in the figure corresponds to their temporal interval. The tactile stimulus, the visual rectangle and the acoustic syllable had the same duration (534 ms). (D) In the AVS condition, the talker's face appeared at the beginning of the interval but the mouth did not move until the acoustic signal onset. Dots on either end of the timeline (E) indicate the frames of temporal jitter; the total jitter around a particular interval was always 167 ms (five frames). (F) indicates the noise duration. Up to six stimulus conditions were tested in this study; (E) audio-only (AO), (A and E) audio-tactile (AT), (C and E) audiovisual with a stationary rectangle (AVR), (A, C and E) AVR with tactile (AVRT), (D and E) audio with visual speech (AVS) and (A, D and E) AVS with tactile (AVST). (B) In the (E) AO and (A and E) AT conditions, a fixation cross was displayed for the entire interval.

Download figure to PowerPoint

Tactile

A Bruel & Kjaer 4810 minishaker mounted on a wooden stand that incorporated an armrest delivered a vibration stimulus to the right index fingertip. The stimulus was a 200-Hz haversine pulse train (i.e., pulse duration of 2.5 ms) of total duration 534 ms, with the same onset and offset as the acoustic ‘ba’, presented via a 0.25-inch-diameter circular probe. A custom stimulus delivery system incorporated compensation for finger loading. The minishaker was encased in a foam-lined box to attenuate acoustical emissions, and participants wore earmuffs (Bilsom Comfort model #2315, NRR 25 dB) throughout testing to guard against detecting acoustic radiation from the vibrating device, although no evidence suggested that vibration was detectable in the presence of the acoustic masking noise. The tactile intensity level was set to the average level at which the stimulus was judged to be equal in intensity to the visual rectangle (7.2 μm peak displacement), following an informal cross-modal intensity-matching experiment.

Timing

Synchronised onsets between auditory and visual stimuli, and between auditory and tactile stimuli, were permanently established using a pre-recorded stimulus DVD. Figure 2 illustrates the timing within a trial, during which the auditory ‘ba’ stimulus was randomly presented in only one of two observation intervals. The visual speech stimulus began with freeze frames but motion onset coincided with acoustic onset. The visual square and tactile stimuli onset coincided with acoustic onset timing. A total jitter of 167 ms was randomly inserted at the onset and offset of the two observation intervals such that all trials were the same duration. In the AO condition, a fixation cross was presented for the entire 2135 ms of each observation interval, the total duration of the video speech, including freeze frames. Uniform gray frames of 167 ms duration separated observation intervals in addition to the jitter.

Procedure

A two-interval forced-choice (2IFC) paradigm with adaptive three-down one-up staircase algorithm (Levitt, 1971) was used to obtain 79.4% (d′ = 1.16 for a 2IFC design) detection thresholds. Within each testing block, stimulus condition and noise level were fixed, and the ‘ba’ stimulus amplitude was varied. The adaptive step sizes were as follows: at the beginning of the block, 3-dB steps were used until the first reversal following an error, then 2-dB steps until the third reversal, 1-dB until the fifth reversal, 0.5 dB until the eighth reversal and 0.1 dB for the final four reversals. Thresholds were the arithmetic mean in dB units of all 12 reversal points. In the noise conditions, the initial SNR was −6 dB. In a no-added-noise (quiet) condition, the initial speech level was 10 dB SPL. Two subsequent blocks in each type of condition were initiated with SNRs of 6 dB above the threshold from the previous corresponding stimulus block.

Participants received 15 practice trials per condition and then executed a variable number of testing blocks per session. The conditions were pseudo-randomly ordered and each condition was presented at every noise level once before any were repeated, resulting in 48 blocks (three repetitions × four conditions × four noise levels). Because the paradigm used adaptive testing, the number of test trials per participant varied somewhat, averaging 65 trials per block.

Participants were told to attempt to detect the auditory stimuli and keep their gaze on the video monitor. They were not explicitly told to attend to the tactile stimuli. It was obvious to the participants that the visual and tactile stimuli were presented in both the signal-present and signal-absent intervals. Participants were instructed to respond as quickly and as accurately as possible when they detected the ‘ba’ auditory stimulus. Responses were made using a two-button box with each button assigned to one of the stimulus intervals. Participants were free to respond during the first interval if they detected the stimulus there. Response times were recorded but not analysed. LEDs affixed to the sides of the monitor and on the button box indicated the correct response after each response. Testing took place in a double-walled sound booth.

Results and discussion

Each participant contributed 16 thresholds (four noise levels × four stimulus conditions) averaged over three blocks (~ 200 trials per threshold). In a repeated-measures anova, stimulus type (F3,9 = 36.24, < 0.0001) and noise level (F3,9 = 2015, < 10−12) had strong effects on signal thresholds without any significant interaction. Post hoc pairwise contrasts, corrected for multiple comparisons, revealed the order of signal threshold magnitudes to be AO (27.9 dB SPL) > (AT ≈ AVR) > AVS (25.6 dB SPL). That is, all multisensory conditions improved speech signal detection, with visual speech providing the largest gain.

The ideal-observer model (Eqn (1)) provided a good fit to the data of each participant, accounting for 99% of the variance (Fig. 3A). Intrinsic noise (Neq), efficiency (η), and the standard errors of the estimates were obtained by fitting Eqn (1) to the data (Fig. 3B and C). Intrinsic noise did not vary across stimulus condition (F3,9 = 1.003, = 0.435). The average level of intrinsic noise was equivalent to an input noise of 15.8 dB SPL, which is very low compared to the external noise.

image

Figure 3. Ideal-observer analysis of speech-detection in noise (Experiment 1). (A) Energy (E) of the speech signal is plotted against the power spectral density (N) of the external noise in log units for each participant. The ideal-observer model of Eqn (1) provides an excellent fit to the individual data (R2 > 0.99). (B) Equivalent input noise (intrinsic noise) and (C) sampling efficiency were estimated from the fits. Error bars represent ± 1 SE of the estimates. Multisensory conditions had no effect on intrinsic noise but significantly improved efficiency. Efficiencies were AO < (AT, AVR) < AVS, with mean intrinsic noise level estimated at 15.8 dB SPL.

Download figure to PowerPoint

In contrast, efficiency was reliably affected by the stimulus condition (F3,9 = 23.42, < 0.001). Post hoc pairwise comparisons showed that efficiency was AO (2.2%) < (AT ≈ AVR) < AVS (3.6%). Efficiency averaged across conditions was 2.8%. Thus, no evidence was obtained for multisensory integration (i.e., either reduced or increased internal noise), but there was reliable evidence for increased auditory efficiency from visual and tactile stimuli.

Experiment 2: efficiency and linearity of speech detection

  1. Top of page
  2. Abstract
  3. Introduction
  4. Experiment 1: efficiency and intrinsic noise
  5. Experiment 2: efficiency and linearity of speech detection
  6. General discussion
  7. Authors' contributions
  8. Conflict of interest
  9. Acknowledgements
  10. References
  11. Supporting Information

Experiment 1 showed that the intrinsic noise was very low relative to external noise (equivalent to an external noise at 15.8 dB SPL). Equation (1) implies that d′ ≈ √(ηE/N) whenever external noise is sufficiently high relative to intrinsic noise (≫ Neq). That is, d′ measured at high external noise is unaffected by intrinsic noise and can therefore be used as a surrogate for efficiency. This fact was used to obtain a more precise assessment of multisensory facilitation, particularly, the relative effect of visual speech. It was also used to characterise any nonlinearity between d′ and SNR, which provides additional insight into the basis for multisensory enhancement.

Experiment 2 was carried out in two phases. In the preliminary phase, SNR thresholds were obtained adaptively at d′ = 1.16 with the signal fixed at 55 dB SPL and external noise varied. The relatively high signal intensity was used to ensure that performance would not be limited by the weak intrinsic noise. In the main experiment, a common range of SNRs, chosen based on the results from preliminary experiment and applicable to all participants, was used to measure values of d′ by using the method of constant stimuli (i.e., with both noise and signal fixed within blocks). Two additional conditions were tested in Experiment 2, AVRT and AVST, for which tactile stimuli were presented synchronously with the AVR and AVS stimuli, respectively.

Materials and methods

Participants

Applying the same selection criteria as in Experiment 1, six participants (age 19–44 years, mean 26; two male) took part in the main part of the experiment. The testing was completed in three to seven sessions for each subject, collected over 6–93 days. (See Supporting Information, Fig. S1 for the preliminary phase of threshold testing.)

Procedures

The signal level was fixed at 55 dB SPL, and external noise levels of 68, 69 and 70 dB SPL were tested, resulting in three SNR levels. Each test run comprised 36 two-interval forced-choice trials per d′ estimate. Participants were tested in all six conditions at all three SNR levels repeated three times, with each pseudorandomly ordered set of 18 tests completed before the next set.

Results and discussion

Accuracy data in the form of proportion correct for each testing block were converted to d′ [d′ = √2 Φ−1 (p), where p is the proportion correct and Φ−1 is the inverse cumulative normal distribution] (Fig. 4). The d′ values were submitted to a repeated-measures anova with three factors: stimulus type (= 6), SNR (= 3), and block (= 3). There were significant main effects of stimulus type (F5,25 = 8.904, < 0.001) and SNR (F2,10 = 130.737, < 0.001) only. Post hoc pairwise contrasts revealed that in terms of d′, AO (1.2) < AT < (AVR, AVRT, AVS, AVST; mean = 1.8). The AVS, AVST, AVRT and AT stimuli were not reliably different from each other. Visible speech was not a reliably better stimulus than non-speech multisensory stimuli for enhancing auditory speech detection in noise.

image

Figure 4. Mean d′ as a function of SNR across the six stimulus conditions tested in the main phase of Experiment 2 using the method of constant stimuli. The speech signal was fixed at 55 dB SPL, and the noise levels were at 68, 69 and 70 dB SPL. Bars indicate ± 1 SEM. The lines are slightly staggered horizontally for ease of viewing. The dashed line shows the predicted d′ values from the ideal-observer model (Eqn (1)) fitted with data from Experiment 1. There is a good agreement between the two experiments, except that the log–log slope of the empirical psychometric function (d′ vs. E/N) is steeper than that predicted by the ideal-observer model, suggesting that uncertainty about the speech target was a limiting factor for the participants.

Download figure to PowerPoint

Results across experiments

In Experiment 1, threshold signal energy (E) and noise power spectral density (N) followed a linear relationship, as dictated by the ideal-observer model (Eqn (1)) at the tested d′ of 1.16. We estimated the parameters (Neq and η) of the ideal-observer model with data from Experiment 1 and used the model (Eqn (1)) to predict the d′ values from Experiment 2 (Fig. 4, dashed line). This between-subjects cross-experiments prediction was particularly good in the vicinity of the tested d′ of 1.16, even though the corresponding external noise level for this d′ in Experiment 2 was 10 dB higher than the highest noise level used in Experiment 1.

However, the ideal-observer model estimated using data from Experiment 1 systematically underestimated the values of d′ from Experiment 2 at higher SNR levels (at external noise levels of 69 and 68 dB SPL). The slopes between log(d′) and log(E/N) were significantly greater than the predicted value of 1/2 from Eqn (1) (Fig. 4). That is, the human psychometric functions, measured in Experiment 2, have steeper log–log slopes than that of the ideal-observer model. A steeper log–log slope is consistent with an observer who does not know the signal exactly and has to consider one of many possibilities (Tanner, 1961; Pelli, 1985; Graham, 1989; Tjan et al., 2006). This can be understood intuitively. At high SNR (high E/N), simultaneously considering one of many signal possibilities, even when there was just one specific signal, has no impact on performance. This is because only one of the possibilities is a good match to the high-SNR signal. In this case, the real observer is essentially like the ideal observer who knows the signal exactly. At low SNRs, however, contemplating multiple possibilities other than the exact specification of the signal increases the false alarm rate and thus reduces d′ relative to a signal-known-exactly ideal observer. Hence, compared to that of an ideal observer, a real observer who is uncertain about the precise specification of the signal will have a steeper slope, with a disproportionally poorer performance at low SNR.

Inasmuch as d′ is monotonically related to efficiency at high external noise level (Eqn (1)), d′ can be used as a surrogate for efficiency to test whether uncertainty reduction is the mechanism responsible for the multisensory facilitation. Figure 5 shows the log–log slope for individual participants against their averaged d′ across all SNR levels. There is a strong negative correlation (= −0.67, < 0.001) between the log–log slope and mean d′. If we interpret the log–log slope as uncertainty and mean d′ as efficiency, then the negative correlation means that the higher a participant's uncertainty was about the target, the lower was the participant's efficiency. This can be seen also in terms of the specific conditions in Fig. 5, with low efficiency and high uncertainty for AO (and AT for one participant), and high efficiency and lowest uncertainty for individual AVS, AVR, AVST and AVRT points. This supports the view that visual by itself or with tactile stimuli during auditory speech detection in noise reduces the uncertainty about when or how to listen.

image

Figure 5. Log–log slopes of the empirical psychometric functions obtained in Experiment 2 as a function of mean d′. In experimental conditions, when the external noise level is very high compared to the intrinsic noise (estimated at 15.8 dB SPL; Fig. 3B), d′ is completely determined by efficiency for a given E/N ratio (Eqn (1)) and can therefore be used as a surrogate for efficiency. The log–log slope for each participant in each multisensory condition is plotted against their averaged d′ values across the three external noise levels. A strong negative correlation between the log–log slope and d′ (regression line, = −0.67, < 0.001) implies that higher efficiency is associated with a reduction in uncertainty about the speech signal.

Download figure to PowerPoint

General discussion

  1. Top of page
  2. Abstract
  3. Introduction
  4. Experiment 1: efficiency and intrinsic noise
  5. Experiment 2: efficiency and linearity of speech detection
  6. General discussion
  7. Authors' contributions
  8. Conflict of interest
  9. Acknowledgements
  10. References
  11. Supporting Information

In Experiment 1, the task-irrelevant visual and/or tactile stimuli increased sampling efficiency for auditory speech detection without having any effect on intrinsic noise (Fig. 3). The relationship of threshold acoustic signal energy estimates across conditions was AO > (AT ≈ AVR) > AVS. In Experiment 2, using the method of constant stimuli with three external noise levels at least five orders of magnitude (50 dB) higher than the intrinsic noise, the order of d′ values was AO < AT <  (AVR, AVRT, AVS, AVST), providing further evidence for a multisensory sampling efficiency advantage but with no evidence for a visual speech advantage. Furthermore, across participants and multisensory conditions, the lower log–log slope of the psychometric function (d′ vs. E/N) was associated with higher efficiency (higher d′ obtained in high external noise), which is attributable to reduced uncertainty.

Our results show that perception of visual and/or vibrotactile stimuli increases the statistical efficiency of auditory speech detection in noise without altering noise intrinsic to the perceiver. This is a functional statement about the perceptual system that does not uniquely translate into a specific neural implementation. Nevertheless, this functional finding constrains what might be the probable biological implementation, which we discuss below. There are several available mechanisms to account for our findings, but a few mechanisms that have been previously proposed seem unlikely in light of the results reported here.

Multisensory integration and intrinsic noise

Multisensory neuronal integration is frequently offered to explain reduction in perceptual detection thresholds relative to unisensory thresholds (e.g., Stein et al., 1996; Odgaard et al., 2004; Gillmeister & Eimer, 2007; Ro et al., 2009). However, the multisensory enhancements reported for the current study are not attributable to multisensory integration, a term which we take to mean the combination of inputs from multiple sensory representations into a resultant amodal representation. [The term ‘integration’ is used with various amounts of precision. For some investigators, any type of response effects of multisensory stimulus combinations that cannot be accounted for by various statistical approaches to data analysis (such as summing unisensory responses) are referred to as effects of integration. We would prefer to reserve the term integration for effects that can be validly attributed to convergence of multisensory representations with resultant amodal representations. All the other effects can be grouped as multisensory interactions, following the example of Kayser & Logothetis (2007).]

Our definition of multisensory integration is close to the notion of ‘feature fusion’, for which the separate feature representations from different modalities are merged into a single data representation before a perceptual decision is made. An alternative to feature fusion is ‘decision fusion’, where modality-specific representations are used to make perceptual decisions before these sometimes ambiguous decisions are merged to form a percept. Based solely on computational considerations involved in building an automatic audio-visual speech recognition system (including factors such as difficulties in speech segmentations, data rates and distinguishable tokens), Meyer et al. (2004) argued in favor of decision fusion and against feature fusion, a view that is supported by our empirical findings. Of course, an automatic recognition system is not in general expected to be an instantiation of a neural system.

We cannot attribute the observed multisensory enhancement to multisensory integration (akin to feature fusion) because an integrated representation (Stein et al., 2010) from multisensory input would in theory comprise each sensory system's representation of the stimulus as well as the system's internal noise, resulting in a change in the net (increased or decreased) intrinsic noise of the perceiver, which we did not observe. Consistent with our observed lack of change in intrinsic noise, Chandrasekaran et al. (2013) found no change in the magnitude and variability of the firing rates of monkeys' auditory neurons when behavioral performance for vocalisation detection was enhanced (speeded up) by the presence of a visible vocalising face.

While observations of improvements in perceptual performance might imply noise reduction from combining inputs across modalities, such reduction is not guaranteed. For example, when a task-relevant acoustic signal is averaged with a non-acoustic signal that occurs identically with and without the target, all that is contributed is a noisy channel. This increases intrinsic noise without affecting efficiency. (Efficiency is not affected because a sufficiently high external noise in the acoustic stimulus would render the noise from this uninformative channel inconsequential.) In our experiments, no change in intrinsic noise was observed, providing no support for multisensory integration.

Multisensory stochastic resonance

While the phenomenon of multisensory stochastic resonance (Harper, 1979) seems to resemble some of our findings, they are unlikely to share the same neural mechanism. Stochastic resonance is a low-signal low-noise phenomenon, in which a non-informative low-amplitude noise added to a weak signal causes part of the signal to exceed an internal threshold of a nonlinear system (Lugo et al., 2012). The fact that we did not see any multisensory benefit on intrinsic noise, which is a limiting factor on performance in the regime of low external signal and noise, suggests that stochastic resonance is not the relevant mechanism. Indeed, as shown in Fig. 4, our SNRs were 15.8–17.8 dB, whereas (within-modal) stochastic resonance demonstrated with normal-hearing adults detecting an acoustic pure tune has been demonstrated at −15 or −20 dB SNR (Zeng et al., 2000). The SNRs in our experiments therefore seem too high to benefit from stochastic resonance. Another signature of stochastic resonance is that there should be a U-shaped function between detection threshold and external noise level (Lugo et al., 2008), but Figs 3A and 4 show no evidence of an inverted-U shape for the uni- and multi-sensory conditions. Interestingly, Harper (1979) introduced the phenomenon as an ‘arousal mediator of the sensory interaction’. This general qualification, which suggests an up-regulation of processing during a more precisely defined stimulus interval (when the task-irrelevant noise was on), is in line with our findings, as we discuss in more detail below.

Sampling efficiency

Multisensory enhancements led to increases in sampling efficiency. This means that the perceiver had improved knowledge about the task-relevant acoustic signal whenever a visual and/or tactile stimulus was presented in synchrony, even though the latter stimuli could not by themselves signal which interval contained the acoustic target. Consistent with this interpretation, we observed that the lowering of the slope of the psychometric function was associated with increased efficiency, which is a signature for reduced uncertainty about the signal. This reduction could be due to a more precise marking of the onset moment of the acoustic speech stimuli by the synchronous onset of visual and/or tactile inputs. The visual and/or tactile signal(s) could be used to deploy attention at an advantageous moment in time (Megevand et al., 2013). That is, their onset could cue the moment to attend to the possible occurrence of the auditory stimulus (Power et al., 2011), thus excluding any influence of internal (intrinsic) or external (stimulus) noise outside of the expected temporal interval of the stimulus, increasing sampling efficiency as a result.

Furthermore, if the onset signal from the visual and/or tactile modality is produced under high threshold (requiring high signal-to-noise ratio), then the intrinsic noise in the non-acoustic channel will have little effect on the onset signal, leading to no detectable changes in the measured intrinsic noise, consistent with our finding. The actual magnitude of the effect on sampling efficiency could depend on the stimulus-onset asynchrony of the non-auditory signal (Schroeder & Foxe, 2002; Ghazanfar et al., 2005; Kayser et al., 2008; Raij et al., 2010; Megevand et al., 2013), which we did not explicitly manipulate at the stimulus level but may vary across sensory modalities because of differences in sensory processing. Such variation may explain the observed differences in sampling efficiency across multisensory conditions.

Beyond an explanation tied to attention, the improvement in efficiency of the non-informative multisensory stimulus could be attributable to amplification of ongoing neuronal activity in primary auditory areas by a high-threshold non-auditory signal. For example, an auditory stimulus that is paired with a simultaneous somatosensory stimulus will elicit stronger A1 neuronal responses than an auditory stimulus presented in isolation, as shown in awake behaving macaques (Lakatos et al., 2007). Schroeder and colleagues (Schroeder et al., 2008) theorise that “visual cues amplify the cortical processing of accompanying vocalisations by shifting the phase of ongoing neuronal oscillations so that the auditory inputs tend to arrive during a ‘high excitability state’.” As those authors point out, for a weak auditory signal the effects of phase-shifting could determine whether or not inputs generate reliable postsynaptic effects. Compatible with the current results, their explanation for multisensory enhancement need not imply integration of stimulus representations (e.g., syllables are not integrated with the static visual rectangles) or intrinsic noise reduction, merely auditory response enhancement during the interval when the auditory signal is expected. Thus, improved efficiency could arise at least in part due to enhancement in auditory processing induced by the synchronous onset of the visual and/or tactile stimulus. Reducing timing uncertainty in the auditory channel by a visual input can speed up auditory processing, as observed in Chandrasekaran et al. (2013), without affecting the underlying representation.

That the visual and/or tactile stimulus can be used as a cue to deploy attention is compatible with a multisensory interpretation (Bernstein et al., 2013) of the reverse hierarchy theory of speech processing (Nahum et al., 2008). Reverse hierarchy theory (Hochstein & Ahissar, 2002) posits that, under specific conditions, the perceiver can use higher-level knowledge to guide access to lower-level representations that are available in the input. The visual and/or tactile stimuli did not combine with the auditory stimulus but guided discernment of the signal embedded in its acoustic noise background. In fact, it is conceivable that such guidance targets different stages of neural processing, depending on the task and the kind of information available in each sensory stream (Megevand et al., 2013). Here, the onset cue could guide auditory attention to ‘glimpses’ of the vowel that become available as the non-stationary acoustic noise signal varies in amplitude. That is, the onset also reveals approximately how long to attend.

Overall, based on the results in the current study, multisensory enhancement to auditory speech detection cannot be attributed to reduction in intrinsic noise. It can be attributed to more efficient use of auditory input. Unfortunately, behavioral experiments alone cannot adjudicate between several probable neural mechanisms. However, the ideal-observer model can be applied in conjunction with neural measures to further isolate the neural source for enhanced speech detection efficiency.

No speech-specific mechanism

We obtained scant support for the possibility that visual speech stimuli convey unique benefit to the detection task (Grant & Seitz, 2000; Grant, 2001; Bernstein et al., 2004; Kim & Davis, 2004; Schwartz et al., 2004; Eskelund et al., 2011). If there were something special about affording visual and auditory speech stimuli for this detection task, the AVS speech condition should have produced reliably better thresholds. When acoustic speech signal amplitude varied adaptively and noise was fixed, there was evidence for an AVS advantage (Experiment 1). However, the AVS advantage disappeared when compared with other multisensory stimuli (AVR, AVRT, and AVST) in Experiment 2 (see also additional results in Supporting Information). We also showed similarly enhanced detection with stationary stimuli (the AVRT and AVR condition) as with visual speech (Bernstein et al., 2004).

A prominent previous explanation for the AVS detection advantage is that fine-grained correlations between the acoustic and visual speech stimuli are used by the perceiver (Grant & Seitz, 2000; Grant, 2001; Schwartz et al., 2004; Eskelund et al., 2011). Such a mechanism would introduce the noise of the visual system into the percept in the AVS condition, but we could not detect any change in the intrinsic noise. The lack of any difference between the visual-speech condition and static rectangle conditions also argues against that explanation.

As suggested above, the correlation between onsets of visual and auditory stimulus events is probably critical to the multisensory advantage (Schwartz et al., 2004), even without recourse to multisensory integrative mechanisms of the type that might result in amodal representations. In Bernstein et al. (2004), when the visual speech stimulus included a preparatory mouth gesture preceding the acoustic stimulus onset, there was evidence for a significant effect of visual speech that was not found when the preparatory gesture was removed from the stimulus.

Lifelong experience of audiovisual speech stimuli potentially makes available predictions about the natural relationships between auditory and visual speech features (Bernstein et al., 2008; Jiang & Bernstein, 2011). Natural running speech with its asynchronously available visual and auditory onsets of syllables or phonemes offers multiple opportunities to predict the features of auditory speech stimuli based on visual speech features. In the current experiment, those opportunities were not available, and there was no evidence that visual speech conferred a special advantage for auditory speech detection. An obvious extension of the current study would be to reintroduce the natural preparatory mouth gesture in the current visual speech stimulus to determine whether its effect is to enhance efficiency, which we predict, rather than reduce intrinsic noise.

Conclusion

Using ideal-observer analysis, we identified the functional mechanism for the multisensory enhancement observed in the detection of a speech token in noise. The enhancement is due to an increase in the statistical efficiency of the perceiver caused by a reduction in the uncertainty about the speech signal, mostly likely related to its onset time but possibly also due to knowing its temporal extent. Our analysis rejected mechanisms that require combining signals from multiple sensory streams in the multisensory conditions, because we did not observe any change in the perceivers' internal noise. We advocate a flexible scheme of multisensory facilitations for which the point(s) of multisensory interactions can be task-dependent.

Authors' contributions

  1. Top of page
  2. Abstract
  3. Introduction
  4. Experiment 1: efficiency and intrinsic noise
  5. Experiment 2: efficiency and linearity of speech detection
  6. General discussion
  7. Authors' contributions
  8. Conflict of interest
  9. Acknowledgements
  10. References
  11. Supporting Information

L.E.B. and B.T. developed the study concept. All authors contributed to the study design. Testing and data collection were performed by E.C. E.C. performed data analysis under the supervision of B.T., and B.T. carried out independent data analysis. B.T., L.E.B. and E.C. drafted the paper. B.T. and L.E.B. provided critical revisions. All authors approved the final version of the paper for submission.

Acknowledgements

  1. Top of page
  2. Abstract
  3. Introduction
  4. Experiment 1: efficiency and intrinsic noise
  5. Experiment 2: efficiency and linearity of speech detection
  6. General discussion
  7. Authors' contributions
  8. Conflict of interest
  9. Acknowledgements
  10. References
  11. Supporting Information

This study was supported by NIH (R01s DC008583 and DC008308 to L.E.B. and R01 EY017707 to B.S.T). The authors thank Brian Chaney and John Jordan for their hardware and software contributions to this research.

Abbreviations
AO,

audio-only

AT,

audio-tactile

AVR,

audiovisual with a stationary rectangle

AVRT,

combination of the visual rectangle with the tactile stimulus

AVS

audiovisual speech

AVST,

combination of the visual speech with the tactile stimulus

d′,

discriminability between a noisy signal and noise alone

E,

signal energy

N,

spectral density of the external noise in the stimulus

Neq,

additive noise in the perceptual system (expressed as an equivalent noise source at the input)

η,

sampling efficiency of the perceptual system

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Experiment 1: efficiency and intrinsic noise
  5. Experiment 2: efficiency and linearity of speech detection
  6. General discussion
  7. Authors' contributions
  8. Conflict of interest
  9. Acknowledgements
  10. References
  11. Supporting Information

Supporting Information

  1. Top of page
  2. Abstract
  3. Introduction
  4. Experiment 1: efficiency and intrinsic noise
  5. Experiment 2: efficiency and linearity of speech detection
  6. General discussion
  7. Authors' contributions
  8. Conflict of interest
  9. Acknowledgements
  10. References
  11. Supporting Information
FilenameFormatSizeDescription
ejn12471-sup-0001-FigS1.pdfapplication/PDF137KFig. S1. Mean detection thresholds for 12 participants at the performance criterion of 79.4% correct (the preliminary phase of Experiment 2).

Please note: Wiley Blackwell is not responsible for the content or functionality of any supporting information supplied by the authors. Any queries (other than missing content) should be directed to the corresponding author for the article.