Neural coding of formant-exaggerated speech in the infant brain


Yang Zhang, Department of Speech-Language-Hearing Sciences, 164 Pillsbury Dr. SE, University of Minnesota, Minneapolis, MN 55455, USA; e-mail:


Speech scientists have long proposed that formant exaggeration in infant-directed speech plays an important role in language acquisition. This event-related potential (ERP) study investigated neural coding of formant-exaggerated speech in 6–12-month-old infants. Two synthetic /i/ vowels were presented in alternating blocks to test the effects of formant exaggeration. ERP waveform analysis showed significantly enhanced N250 for formant exaggeration, which was more prominent in the right hemisphere than the left. Time-frequency analysis indicated increased neural synchronization for processing formant-exaggerated speech in the delta band at frontal-central-parietal electrode sites as well as in the theta band at frontal-central sites. Minimum norm estimates further revealed a bilateral temporal-parietal-frontal neural network in the infant brain sensitive to formant exaggeration. Collectively, these results provide the first evidence that formant expansion in infant-directed speech enhances neural activities for phonetic encoding and language learning.


Language input is assigned roles of varying importance in acquisition models and theories. The ‘poverty-of-stimulus’ argument asserts that language is unlearnable from the impoverished input data available to children (Chomsky, 1980). In contrast, speech research over the past five decades has established that enriched exposure adaptively guides language acquisition early in life (Höhle, 2009; Kuhl, Conboy, Coffey-Corina, Padden, Rivera-Gaxiola & Nelson, 2008). For example, when talking to infants, people across cultures tend to use exaggerated pitch, elongated words, and expanded vowel space with stretched formant frequencies (Ferguson, 1964; Fernald, 1992; Kuhl, Andruski, Chistovich, Chistovich, Kozhevnikova, Ryskina, Stolyarova, Sundberg & Lacerda, 1997). This special speech style undergoes important age-related changes to accommodate the communicative capacity of the developing mind (Amano, Nakatani & Kondo, 2006; Fernald & Morikawa, 1993; Kitamura, Thanavishuth, Burnham & Luksaneeyanawin, 2001; Liu, Tsao & Kuhl, 2009).

The acoustic alterations of infant-directed speech (IDS) purportedly serve vital social and linguistic functions in early learning. Prosodic exaggeration is thought to direct infants’ attention, modulate arousal and affect initially, and later fulfill more specific linguistic purposes such as lexical segmentation (Cooper & Aslin, 1994; Fernald, 1992). Formant exaggeration is phonetically associated with hyperarticulation (Johnson, Flemming & Wright, 1993), which may facilitate language learning by making the critical acoustic distinctions more salient and the phonetic categories more discriminable (Kuhl et al., 1997). Supporting evidence indicates that vowel space in maternal speech is positively correlated with infants’ speech perception – mothers who tended to ‘stretch out’ their vowels had better-performing babies in phonetic discrimination (Liu, Kuhl & Tsao, 2003). Furthermore, computer models have demonstrated robust unsupervised learning of speech sounds based on IDS input (de Boer & Kuhl, 2003; Kirchhoff & Schimmel, 2005; Vallabha, McClelland, Pons, Werker & Amano, 2007). However, little is known about the neurobiological mechanisms that promote learning by exploiting the physical properties of IDS.

Brain research studies offer new insights into speech processing and language acquisition (Dehaene-Lambertz & Gliga, 2004; Kuhl et al., 2008). Several neurophysiological indices have been shown to be associated with IDS compared to adult-directed speech (ADS), including increased frontal cerebral blood flow (Saito, Aoyama, Kondo, Fukumoto, Konishi, Nakamura, Kobayashi & Toshima, 2007), increased frontal electroencephalography (EEG) power (Santesso, Schmidt & Trainor, 2007), and enhanced event-related potentials (ERPs) in the frontal-temporal-parietal recording sites (Zangl & Mills, 2007). The IDS-induced enhancement in neural activity may work jointly with arousal, attention and affect to strengthen auditory memory for phonological, syntactic and semantic categories. However, none of the previous infant studies controlled the acoustic parameters to determine the linguistic effects of formant exaggeration specific to IDS independent of the prosodic/affective effects primarily drawn from fundamental frequency (f0) modifications that are also found in pet-directed speech (Burnham, Kitamura & Vollmer-Conna, 2002).

The present study utilized synthesized stimuli and high-density EEG to investigate neural coding of vowel formant exaggeration in infants. EEG records electrical potential signals from electrodes placed on the scalp. ERPs, which are derived from averaging EEG epochs time-locked to stimulus presentation, provide a direct noninvasive measure of postsynaptic activities with millisecond resolution, suitable for studying the online cortical dynamics of acoustic and linguistic processing (Dehaene-Lambertz & Gliga, 2004; Näätänen & Winkler, 1999). High-density EEG, which records data from 64 or more electrodes, additionally allows for reliable source estimation of high-quality ERP data (Izard, Dehaene-Lambertz & Dehaene, 2008; Johnson, de Haan, Oliver, Smith, Hatzakis, Tucker & Csibra, 2001; Reynolds & Richards, 2005). Furthermore, advancement in EEG time-frequency analysis has opened a new venue for studying the event-related oscillations (EROs) in infants (Csibra, Davis, Spratling & Johnson, 2000; Csibra & Johnson, 2007). EROs reflect time-varying neuronal excitability and discharge synchronization at different rhythms subserving communications between neuronal populations for various attentional, memory and integrative functions (Klimesch, Freunberger, Sauseng & Gruber, 2008). Studies have shown that delta (1–4 Hz) and theta (4–8 Hz) activities, among other EROs, are closely associated with linguistic processing (Radicevic, Vujovic, Jelicic & Sovilj, 2008; Scheeringa, Petersson, Oostenveld, Norris, Hagoort & Bastiaansen, 2009). It remains to be tested how formant-exaggerated speech affects neural activation and synchronization in the infant brain.

The experimental design followed a basic assumption in auditory neuroscience – the average response for repeated presentations of the same stimulus (or multiple instances of the same stimulus category) is equivalent to the neural representation of the stimulus (or the category), which codes its acoustic/perceptual features. An alternating block design with an equal stimulus ratio was adopted for this purpose.1 The design took into account developmental changes in infant ERPs. A number of studies have shown that speech perception and auditory ERPs change dramatically in the first year of life, and adult-like language-specific perception occurs by 6 months of age (e.g. Cheour, Ceponiene, Lehtokoski, Luuk, Allik, Alho & Näätänen, 1998; Kuhl, Williams, Lacerda, Stevens & Lindblom, 1992; Polka & Werker, 1994). The developmental changes in the latency, amplitude, polarity and scalp distribution of ERP responses have led to a better understanding of brain mechanisms that support phonetic processing and language learning. There are two salient auditory ERP components at this age, P150, a positive peak at approximately 150 ms, and N250, a negative peak at approximately 250 ms (Dehaene-Lambertz & Dehaene, 1994; Fellman, Kushnerenko, Mikkola, Ceponiene, Leipälä & Näätänen, 2004; Kushnerenko, Ceponiene, Balan, Fellman, Huotilaine & Näätänen, 2002; Novak, Kurtzberg, Kreuzer & Vaughan, 1989; Rivera-Gaxiola, Silva-Pereyra, Klarman, Garcia-Sierra, Lara-Ayala, Cadena-Salazar & Kuhl, 2007; Zangl & Mills, 2007).

Although an exact neurocognitive model is not available to test how exaggerated speech affects neural processing at the segmental level in infants, developmental studies have provided important details about the neural basis of speech perception early in life (Dehaene-Lambertz & Gliga, 2004; Kuhl et al., 2008). Magnetoencephalography (MEG) data show that phonetic discrimination in infants at 6–12 months of age activates the inferior frontal and superior temporal regions in the left brain (Imada, Zhang, Cheour, Taulu, Ahonen & Kuhl, 2006). Functional magnetic resonance imaging (fMRI) data further reveal that activation for speech stimuli in the Broca’s area can be found even in 3-month-old infants (Dehaene-Lambertz, Hertz-Pannier, Dubois, Mériaux, Roche, Sigman & Dehaene, 2006). The co-activation in Broca’s and Wernecke’s areas is thought to indicate perceptual-motor binding to promote speech learning. Consistent with imaging results, ERP studies suggest that both left and right auditory regions are sensitive to coding acoustic/phonetic features of speech stimuli with striking similarities between infants and adults (Dehaene-Lambertz & Gliga, 2004). There exists limited evidence for left-hemisphere dominance for speech in infants, which may be attributable to a functional asymmetry of the auditory system in processing rapid acoustic transitions versus slow spectral changes (Poeppel, 2003; Zatorre & Belin, 2001). For instance, EEG and near-infrared spectroscopy data from newborn infants show bilateral activation for speech-like acoustic modulations and right-hemisphere dominance for acoustic modulations at a much slower rate (Telkemeyer, Rossi, Koch, Nierhaus, Steinbrink, Poeppel, Obrig & Wartenburger, 2009). The vowel stimuli in the present study did not contain rapid acoustic transition and thus provided an opportunity to test functional asymmetry for spectral processing in infants at 6–12 months of age.

The general hypothesis was that formant exaggeration would induce enhanced neural responses for speech processing in the infant brain. Specifically, the present study examined the effects of formant exaggeration in two ERP components (P150 and N250), two ERO bands (delta and theta), and three broad regions of interest (frontal, temporal/central, parietal) in both hemispheres. There were four closely related questions. First, at what time points, or in what ERP components, did the effect occur? Second, was the hypothesized effect mediated by differences in neural synchronization? Third, what cortical regions were affected? Fourth, did the data support early functional asymmetry for spectral processing of formant exaggeration? Answers to these questions would provide an initial account of the neural mechanisms responsible for the facilitative role of formant exaggeration in speech learning and acquisition.



Eighteen normally developing infants were recruited via advertisement. Informed parental consent was obtained in accordance with the procedures approved by the institutional Human Research Protection Program. The parents were paid $20. The infants were full term with normal pregnancies and deliveries and no known auditory or visual problems. Both parents were native English speakers, and no immediate family members had a history of speech, language, or hearing deficits. Two infants did not complete the entire recording session. Four others were excluded due to excessive noise in their EEG data. The remaining 12 infants (seven girls, mean age = 9.2 months, range = 6.5–11.4 months) were included in this report. Infants at this age range were considered appropriate because their auditory N250 responses were known to be robust and sensitive to phonetic learning.


The vowels were created with the HLsyn program (Sensimetrics Corporation, USA) (Figure 1). HLsyn allowed quality synthesis with direct control of formants and quasi-articulatory parameters based on the Klatt formant synthesizer (Hanson & Stevens, 2002). The sounds were 200 ms in duration (Supplementary materials, Sounds 1 and 2). The F1 and F2 parameters were chosen to simulate vowel space expansion based on previous studies (Hillenbrand, Getty, Clark & Wheeler, 1995; Kuhl et al., 1997). Only the male /i/ sounds were used for this study.2 Specifically, the center frequencies of F1 and F2 were 342 Hz and 2322 Hz for the non-exaggerated /i/. The exaggerated /i/ had F1 at 310 Hz and F2 at 2480 Hz. The f0, F3 and F4 were held constant at 138, 3000, and 3657 Hz for both sounds. Both vowels included a rise/fall time of 10 ms. They were resampled at 44.1 kHz and normalized with equal average RMS (root mean square) intensity.

Figure 1.

(a) Simulated vowel triangle expansion in the F1-F2 space for the three point vowels, /i/, /a/, and /u/. Only the /i/ sounds were used in this study. (b) Spectral overlay plot for the synthesized exaggerated and non-exaggerated /i/ stimuli.

Stimulus presentation

Infants sat on their parent’s lap in an acoustically and electrically treated chamber (ETS-Lindgren Acoustic Systems). Stimulus presentation used the EEvoke software (ANT Inc., The Netherlands). The sounds were played via a pair of loudspeakers (M-audio BX8a) placed at approximately 1.2 m in front and 40 degrees to the sides of the infant. The sound level was calibrated to be 65 dB SPL at the approximate location corresponding to the center position of the subject’s head. Alternating blocks were presented for a total of 360 trials in 18 blocks, with each block consisting of 20 identical sounds. Block order was counterbalanced among the subjects. The offset-to-onset interstimulus intervals were randomized between 1.1 and 1.2 s, and the inter-block silence period was 5 s. The short blocks (each less than 30 s) and the relatively long randomized ISIs and inter-block silence served to reduce habituation of the important ERP components (Dehaene-Lambertz & Dehaene, 1994; Woods & Elmasian, 1986).

EEG recording

Continuous EEG was recorded (bandwidth = 0.016–200 Hz; sampling rate = 512 Hz) using the ASA-Lab system with REFA-72 amplifier (TMS International BV) and WaveGuard cap (ANT Inc., The Netherlands). Head circumference was measured for each infant to determine head size. The EEG cap used shielded wires for 65 sintered Ag/AgCl electrodes in the international 10–20 montage system and the intermediate locations (Figures 2, 3a, and 3b). The ground was positioned at AFz, and the default reference for the REFA-72 amplifier was the common average of all connected unipolar electrode inputs. In the EEG cap, each electrode was surrounded by a silicone ring to hold the conductive gel, which allowed the electrodes to be prefilled with gel and facilitated a smoother cap-fitting procedure. Adjustments on individual electrodes were made to keep impedances at or below 5 kΩ.

Figure 2.

Grand average ERP plot (linked-mastoid reference) for the two speech stimuli in all 64 electrodes.

Figure 3.

(a) Photo illustration of an 11-month-old infant wearing the 64-channel EEG cap during a break in the recording session. (b) Realistic head model with 64 standard electrode positions on. Nine grouped electrode sites were marked, representing frontal (F), central (C), and parietal (P) regions in the left (L), middle (M), and right (R) divisions. (c) Grand average global field power data for the exaggerated and non-exaggerated vowel stimuli. Significance for time-point-by-time-point comparison was shown by the black and white bars on the x-axis. The black bars on the x-axis showed time windows of significant enhancement for the exaggerated sound relative to the non-exaggerated sound, and the white bars showed time windows of significant reduction [< .01]. (d) Grand average ERP data (linked-mastoid reference) at the nine regionally grouped electrode sites for point-by-point comparison between the two stimuli. The black bars on the x-axis showed time windows of significant negativity enhancement for the formant-exaggerated speech [< .01]. A cautionary note is needed for proper interpretation as the electrode site effects are not equivalent to those of cortical regions in the brain.

During the experiment, one or two research assistants sat in front of the infant, silently playing with toys to keep the infant’s attention. With parental permission, a muted cartoon movie was also played on a 20-inch LCD TV at 2.5 m away. The researcher in the control room communicated with the assistant and parent via intercom to coordinate the recording and initiate the session when the infant sat relatively still. A surveillance camera (Canon VB-C50iR) monitored the infant’s behavior. The ASA-Lab automatically saved the online video in synchrony with EEG recording for later assessment of data quality and the infant’s alertness level. When necessary, sound presentation and EEG recording would be paused until the infant sat relatively still again. The entire EEG session, including preparation, lasted approximately 40 minutes.

ERP waveform analysis

ERP averaging was performed offline in BESA (Version 5.2, MEGIS Software GmbH, Germany) following recommended guidelines (DeBoer, Scott & Nelson, 2007; Picton, Bentin, Berg, Donchin, Hillyard, Johnson, Miller, Ritter, Ruchkin, Rugg & Taylor, 2000). The EEG data were bandpassed at 0.5–40 Hz. The ERP epoch length was 700 ms, including a pre-stimulus baseline of 100 ms. The automatic artifact scanning tool in BESA was applied in two steps to detect bad electrodes and noisy signals. First, adjustments of rejection threshold parameters were made on every subject to inspect the effects on the entire recording session and the number of accepted trials. Second, the automatic rejection criteria were determined for all subjects. Epochs with a signal level exceeding 150 μV from the segment baseline or a slew rate exceeding 75 μV/ms were rejected. A subject was excluded if the number of accepted trials for each stimulus condition did not meet the minimum of 40. As the focus of the study was on neural coding of the standard stimuli, the first stimulus in each block was excluded to avoid possible MMN elicitation from the alternating blocks. The average number of accepted trials was 58 for the exaggerated /i/ and 55 for the non-exaggerated /i/. Weighted averaging was calculated for the grand average to minimize influences of individual subjects with fewer trials. A caution against weighted averaging is that it might result in an undesirable bias towards drowsy infants, who could have a large number of uninformative epochs due to the decrease of ERP amplitude. For the present study, online observation and video recordings indicated that there were no alertness problems for the infants (12 out of 18) included in grand averaging.

To keep consistency with the majority of previous infant ERP studies using speech stimuli (e.g. Ceponiene, Haapanen, Ranta, Näätänen & Hukki, 2008; Fellman et al., 2004; Rivera-Gaxiola et al., 2007; Zangl & Mills, 2007), the ERP data were re-referenced to linked mastoids. Further analyses were performed in Matlab (Version 7.6) with the EEGLAB software (Version 7.2.11) (Delorme & Makeig, 2004). To improve signal to noise ratio, nine electrode regions were defined for grouped averaging for each subject (Figure 3b): left frontal (LF, including F7, F5, F3, FT7, FC5, and FC3), middle frontal (MF, including F1, Fz, F2, FC1, FCz, and FC2), right frontal (RF, including F8, F6, F4, FT8, FC6, and FC4), left central (LC, including T7, C5, C3, TP7, CP5, and CP3), middle central (MC, including C1, Cz, C2, CP1, CPz, and CP2), right central (RC, including T8, C6, C4, TP8, CP6, and CP4), left posterior (LP, including P7, P5, P3, PO7, PO5, and PO3), middle posterior (MP, including P1, Pz, P2, and POz), and right posterior (RP, including P8, P6, P4, PO8, PO6, and PO4) (Schneider, Debener, Oostenveld & Engel, 2008). To derive an unbiased estimate independent of electrode selection, global field power (GFP) was calculated for comparison by computing the standard deviation of the amplitude data across the 64 electrodes at each sampling point (Hamburger & van der Burgt, 1991; Lehmann & Skrandies, 1980). Repeated-measures ANOVA tests were performed on the mean amplitudes of a 20 ms interval around peaks of interest. The peak search windows (60–200 ms for P150 and 200–500 ms for N250) were confirmed by inspection of the grand mean ERP overlay plots and topography. Peak-to-peak (P150-N250) values were also calculated for statistical comparison. The within-subject factors were stimulus type (exaggerated and non-exaggerated), hemisphere (left and right) and electrode region (frontal, central, and parietal). Where appropriate, either Bonferroni or Greenhouse-Geisser correction was applied to the reported p-values.

There have been controversies regarding the use of linked mastoids in ERP research (Picton et al., 2000). The topographical voltage maps of linked-mastoid reference may be incorrect because the mastoids are not neutral electrodes for auditory stimuli (Dehaene-Lambertz & Gliga, 2004; Michel, Murray, Lantz, Gonzalez, Spinelli & Grave de Peralta, 2004). To examine the effects of reference selection, ERP waveform analysis using common average reference was also performed. The common average reference method assumes that the average of all recording electrodes on a volume conductor is approximately neutral. However, this assumption is valid only with accurate spatial sampling of the scalp field, which requires a sufficient number of electrodes with full coverage of the head surface. The average of 64 electrodes with the standard 10-10 montage might be insufficient to qualify as a truly neutral reference (Dien, 1998; Yao, Wang, Oostenveld, Nielsen, Arendt-Nielsen & Chen, 2005). To help evaluate the differences between the two reference methods, the ERP data were further transformed into reference-free current source density (CSD) estimates (Kayser & Tenke, 2006; Perrin, Pernier, Bertrand & Echallier, 1989). In this approach, spherical spline surface Laplacian transform was applied to identify locations and relative magnitudes of current sources and sinks. The CSD waveforms were submitted to unrestricted Varimax-rotated temporal principal components analysis (PCA) based on estimating the source-current covariance matrix from the measured-data covariance matrix (Kayser & Tenke, 2006). The CSD estimates served to sharpen ERP scalp topography by eliminating the volume-conducted contributions and the dependence on reference.

Time-frequency analysis

Time-frequency representations (TFRs) were derived for evoked EROs in the frontal (MF), central (MC) and parietal (MP) electrode regions using continuous wavelet transform (CWT) in Matlab (Csibra & Johnson, 2007; Samar, Bopardikar, Rao & Swartz, 1999). CWT was applied to the averaged ERP signals (referenced to linked mastoids) with complex Morlet wavelets (bandwidth = 1 Hz, center frequency = 0.5 Hz) in Matlab. Morlet scalograms were plotted using the absolute values of the squared coefficient of CWT, and the frequency values on the plots were converted from the normalized scale vector. The power data for delta (1–4 Hz) and theta (4–8 Hz) bands at the MF, MC, and MP electrode sites were subject to further analysis to examine the temporal evolution of formant exaggeration.

Temporal evolution analysis

Two-tailed time-point-to-time-point t-tests were performed to obtain accurate information about the temporal evolution of significant differences between the exaggerated and non-exaggerated stimuli (Guthrie & Buchwald, 1991). Guthrie and Buchwald’s method requires three pieces of information to assess ERPs’ temporal evolution at a chosen significance level: (1) sample size (N, representing the number of subjects), (2) number of sampling time points (T), and (3) the autocorrelation value (ø). This parametric test method relies on the computation of minimum number of consecutive sample points that need to show significant differences, which depends on the autocorrelation estimates and the total number of sample points in the ERP data. The analysis was applied to the GFPs, ERPs, CSDs and TFRs. For the present study, the data for an entire epoch were decimated and assessed at two segments(0–200 ms covering P150 and earlier components, and 200–600 ms covering N250 and later components) in order to be consistent with Guthrie and Buchwald’s specific recommendations for the three parameters. For = 12, the ø values obtained from all IDS-versus-ADS comparisons in the study were in the range of 0.38 to 0.74, corresponding to at least 5–9 consecutive sample points at the significance level of 0.01. For a conservative estimate, an interval would not be considered to differ significantly in this study unless at least nine consecutive sample points (approximately 18 ms) showed significant differences at the level of 0.01.

Source localization analysis

Minimum norm estimation (MNE L2-norm) was applied to the averaged ERP data (Hämäläinen & Ilmoniemi, 1994; Izard et al., 2008). The MNE analysis approximated the current source space using hundreds of prefixed, discreet and distributed dipoles directly within the cortex and searched for the optimal estimate with the smallest norm to explain the measured ERP signals. Unlike CSDs that are second-order spatial derivatives representing the current sources and sinks on the scalp, MNEs are modeled as true reference-free estimates of cortical current activities. The implementation for the present study included the following steps:

  • 1 The infant electrode montage was calculated by scaling the standard positions for the WaveGuard EEG cap to fit the average head circumference of the 12 infant subjects.
  • 2 A three-shell layer model was used to approximate the infant head (Reynolds & Richards, 2005). To improve the focality and reliability of the source activities, the MNE procedure included both depth weighting and spatio-temporal weighting (Dale & Sereno, 1993; Lin, Witzel, Ahlfors, Stufflebeam, Belliveau & Hämäläinen, 2006) to avoid bias towards superficial sources. Noise regularization used the lowest 15% values, and baseline noises were weighted by the average over the 64 electrodes. The entries in the main diagonal of the noise covariance matrix were equally proportional to the average noise power over all channels.
  • 3 The total activity at each source location was computed as the root mean square of the source activities of its three components. The total activity solutions were projected to the standard realistic brain model in BESA. The current source data for 750 prefixed locations (x, y, z coordinates covering the entire brain space) at all latencies (358 sample points for the −100–600 ms epoch) were further analyzed for temporal and spatial interpretations.
  • 4 In the temporal analysis, the total MNE activities in each hemisphere were summed at each time point for each stimulus. The MNE differences between the two stimuli at each sample point were then subjected to two-tailed z-test relative to the baseline mean and variance. To examine regional contributions to the total MNE activities, standard anatomical boundaries in the Talairach space were used to define the spatial masks for each region of interest (ROI) in the brain space (Lancaster, Woldorff, Parsons, Liotti, Freitas, Rainey, Kochunov, Nickerson, Mikiten & Fox, 2000). The anatomical ROIs allowed a crude estimation for frontal, temporal, and parietal activities in the two hemispheres separately (Zhang, Kuhl, Imada, Kotani & Tohkura, 2005).
  • 5 In the spatial analysis, temporal integration was performed. The MNE differences between the two stimuli at each location and each time point were converted to z-scores relative to the distribution of baseline activities. The average z-scores were then calculated for the two selected time windows (0–200 ms and 200–600 ms) at each source location and plotted using Matlab visualization functions.


ERP waveform results for linked-mastoid reference

The ERP data for the vowel stimuli showed clear auditory N50, P150 and N250 responses and traceable deflection for N450 in the sustaining negativity in the frontal-central sites (Figures 2, 3). The P150-N250 complex was identified in all 12 infant subjects. A clear N50 response was observed in 10 out of the 12 subjects. Compared with the non-exaggerated /i/, the exaggerated /i/ elicited more negative ERP responses in both the early time window (prior to 200 ms) dominated by P150 and the late window (subsequent to 200 ms) dominated by N250. Point-to-point t-test on the global field power (GFP) showed significantly larger ERPs for the exaggerated sound in three components, N50 (44–62 ms), N250 (228–356 ms), and sustaining negative activities (392–431 ms) [< .01] (Figure 3c). Significantly smaller GFP values were found at 78–97 ms and 136–166 ms for the exaggerated sound, indicating relatively more negative responses in the positive window.

Repeated-measures ANOVA results were obtained from P150, N250, and peak-to-peak (P150-N250) amplitude data on the LF, LC, LP, RF, RC, and RF sites. For P150, there were significant main effects of stimulus type [F(1, 11) = 5.28, < .05] and region [F(2, 22) = 11.30, < .01]. For N250, significant effects were observed for stimulus type [F(1, 11) = 20.45, < .001], hemisphere [F(1, 11) = 7.66, < .01] and region [F(2, 22) = 5.75, < .01], and hemisphere-by-region interaction [F(2, 22) = 4.99, < .05]. The P150 and N250 responses got progressively closer to baseline in the frontal-central-parietal direction, and the right hemisphere showed dominance in N250. In the peak-to-peak (P150-N250) analysis, significant effects were found for stimulus type [F(1, 11) = 29.97, < .001] and region [F(2, 22) = 24.38, < .001], suggesting that the effects in P150 and N250 were not due to an overall shift of the P150-N250 complex.

Regional breakdown of the point-to-point comparisons between exaggerated and non-exaggerated /i/s showed detailed contributions from the frontal, central and parietal recording sites at different time intervals (Figures 3d and 6a). Significant effects were observed for N50 (at RP), P150 (at MC, MP, and RC), and N250 (at LF, MF, MC, MP, RF, RC, and RP) as well as for N450 (at LF, LC, MF, MC, MP, RF, RC and RP).

Figure 6.

(a) Grand average ERP overlay plots of 64 electrodes (linked-mastoid reference) and topography maps of N50, P150, N250 and N450 responses for the two vowel stimuli. (b) Grand average ERP overlay plots (common average reference) and topography maps for the two stimuli. (c) Grand average CSD overlay plots and topography maps for the two stimuli. (d) Grand average Morlet scalograms at the MF (frontal), MC (central), and MP (parietal) sites for the exaggerated and non-exaggerated /i/ sounds in the study.

ERP waveform results for common average reference

The ERP data with common average reference showed clear auditory N50, P150, N250 and N450 responses in the frontal electrode sites (LF, MF and RF) (Figures 4 and 6b). Unlike the ERPs with linked-mastoid reference, the parietal sites (LP, MP, RP) uniformly showed polarity reversal to the frontal sites and the vertex, suggesting the existence of dipole current sources in the left and right temporal cortices. The scalp potential distribution pattern created problematic averaging for the LC and RC sites because the co-occurring sources and sinks at the same sampling time point would cancel each other out in these two electrode groups.

Figure 4.

Grand average ERP data (common average reference) at the nine regionally grouped electrode sites for point-by-point comparison between the two stimuli.

Repeated-measures ANOVA showed similarities and differences between the two reference methods. For P150 amplitudes, there were significant effects for stimulus type [F(1, 11) = 9.86, < .01] and region [F(2, 22) = 13.49, < .001]. For N250 amplitudes, there were main effects of region [F(1, 11) = 32.98, < .0001], hemisphere [F(1, 11) = 5.28, < .05], and stimulus-by-region interaction [F(1, 11) = 17.25, < .001]. As in the linked-mastoid reference analysis, the right hemisphere was dominant for N250.

Point-to-point comparisons between exaggerated /i/ and non-exaggerated /i/ showed enhanced N250 in MF, RF, and MC. However, there was no such N250 effect in LF. Contrary to the linked-mastoid reference, the significant differences were more extensive in the parietal electrodes than the frontal electrodes. The N250 enhancement showed up as increased positivity in LP and RP sites due to polarity reversal. Significantly more negative responses prior to N250 were observed in the middle line frontal and central sites (MF and MC), and this effect was observed with reversed polarity in left temporal-parietal sites (LC and LP). Significant enhancement in sustaining negativity after N250 was found in LF, MF, MC and RF with polarity reversal in LP, MP and RP sites. No significant differences were found in any electrode sites for the N50 response.

CSD waveform results

The reference-free CSD estimates showed polarity reversal in the frontal-parietal sites, which was similar to ERP data with common average reference. The existence of dipole current distribution in the bilateral temporal regions produced problematic averaging for LC and RC electrodes (Figures 5 and 6c). Unlike the ERP data, the CSD estimates closely approximated localization of cortical activities, and did not show strong P150 and N250 responses in the midline electrodes.

Figure 5.

Grand average CSDs (current source density) at the nine regionally grouped electrode sites for point-by-point comparison between the two stimuli.

Repeated-measures ANOVA results showed consistent significant effects in P150 for stimulus type [F(1, 11) =  11.27, < .01] and region [F(2, 22) = 9.76, < .01]. There were main effects in N250 for region [F(1, 11) =  23.37, < .0001], hemisphere [F(1, 11) = 5.28, < .05], and stimulus-by-region interaction [F(1, 11) = 27.51, < .001]. Point-to-point comparisons showed significant contributions to P150 differences in MC, LP, and MP regions, N250 differences in LF, RF, MC, RC, LP, and RP regions, and sustaining negativity differences in MF, RF, MC, LP and RP regions.

Topography and TFR results

The grand average ERP and CSD overlay plots and topographical maps for the two vowels illustrated the existence of N50, P150, N250 and N450 components (Figures 6a, 6b, 6c). Despite the differences in the statistical comparisons among linked-mastoid reference, common average reference and the reference-free CSDs, the P150 amplitudes were consistently maximal at bilateral frontal electrode sites, and the N250 was dominant in temporal-parietal sites with functional asymmetry in favor of the right hemisphere. The N250 enhancement effect was clearly visible at temporal-parietal sites for exaggerated /i/ relative to the non-exaggerated /i/.

The dominant evoked EROs (linked-mastoid reference) were found in the delta (1–4 Hz) and theta (4–8 Hz) bands corresponding to the P150-N250 complex in time with linked-mastoid reference (Figure 6d). Differences between the two stimuli were observed with stronger ERO power in favor of the exaggerated sound. The overall ERO power got progressively weaker in the frontal-central-parietal direction. Strong delta and theta activities were present at the frontal and central electrode sites, and the parietal site predominantly showed delta activity. In point-to-point comparisons, significantly enhanced delta activity was found for the exaggerated /i/ sound at all three sites (154–425 ms at MF, 152–388 ms at MC, and 152–502 ms at MP) [< .01]. Differences in theta band power were significant only in the frontal and central sites (162–281 ms at MF, and 201–283 ms at MC) [< .01].

MNE results

The MNE plots showed activation in frontal, temporal, and parietal regions for the two vowel stimuli in both hemispheres (Figures 7a and 7b). Point-to-point comparisons showed significantly enhanced total cortical activity (N250 and sustaining activity following N250) for the exaggerated sound in both left (231–440 ms,537–571 ms) and right hemispheres (190–482 ms, 500–600 ms) [< .01]. As expected, temporal regions were the primary contributor to the enhancement in N250 and sustaining activities. There were also contributions from frontal and parietal regions in the two hemispheres. Unlike the waveform data, the total MNE amplitude data showed no significant difference in P150 for either hemisphere. The lack of significance in total activity for P150 could arise from regularization of cortical source diffusion and depth weighting in the MNE calculation. Significant regional P150 differences were found bilaterally in the frontal MNE activities [< .01]. Although the peak amplitude for N250 appeared to show right-hemisphere dominance, the mean MNE differences (total MNE activity between the two stimuli in the window of 270–450 ms) were slightly larger in the left hemisphere than in the right.

Figure 7.

(a) Temporal evolution of total MNE activity and regional (frontal, temporal and parietal) MNE activities for the two stimuli in the two hemispheres. The black bars on the x-axis show time windows of significant MNE enhancement for the exaggerated speech relative to baseline current activities [< .01]. (b) Spatial localization in top, left and right views of the MNE estimates (in nAm) for the two vowel stimuli. The MNE activities were integrated over 0–600 ms at each prefixed spatial location. (c) Spatial localization in top, left and right views of the MNE differences between the two vowel stimuli. The MNE differences were converted into z -scores relative to baseline and integrated over two time windows (0–200 ms and 200–600 ms).

The significant MNE differences for the early window (0–200 ms) were localized primarily in the inferior frontal area in both hemispheres. In the late time window (200–600 ms), the significant differences were found in the left inferior frontal cortex as well as right anterior temporal cortex extending posteriorly to superior temporal, inferior frontal and middle frontal regions (Figure 7c).


Speech scientists have long stressed the importance of formant exaggeration in infant-directed speech for phonetic learning (Burnham et al., 2002; Kuhl et al., 1997). The ERP waveforms (including CSD waveforms), TFRs, and MNE data here provided three lines of evidence in support of this view. Despite striking differences in ERP waveforms due to reference choice, significant enhancement in N250 and sustaining activity following N250 for exaggerated speech was confirmed in all the analyses. The reduced P150 effect, on the other hand, was not consistently found. Unlike N250, the early P150 component presumably reflected acoustic mapping of the spectral differences between the stimuli (Rivera-Gaxiola et al., 2007). This functional distinction was partly supported by the scalp distribution of the components in all three topographical calculations using linked-mastoid reference, average reference, and the reference-free CSD approach. The P150 was dominant in the frontal sites, and the N250 extended posteriorly from frontal to temporal-parietal electrode sites. The timing and scalp distribution of the enhanced negativity in the 200–600 ms window were consistent with the notion that the N250 and sustaining negative responses are linked with phonetic and lexical processing in infants at 6 months of age or older (Mills, Prat, Zangl, Stager, Neville & Werker, 2004; Rivera-Gaxiola et al., 2007; Zangl & Mills, 2007). An alternative interpretation is that the P150 and N250 responses do not necessarily serve the strict bifurcation of auditory vs. linguistic processing. Rather, these two components co-occur and behave similarly in many experimental situations, and may thus reflect connected processes. In line with both of these interpretations, a missing or diminished N250 was found to be associated with lower level of cognitive and linguistic development and diverted central auditory processing (Ceponiene et al., 2002; Fellman et al., 2004; Tonnquist-Uhlen, 1996).

Differential patterns of neural activity for IDS and ADS have been reported in previous infant studies (Saito et al., 2007; Santesso et al., 2007; Zangl & Mills, 2007). Saito and colleagues employed near-infrared spectroscopy in examining neonates’ responses to naturally spoken sentences in the two speech styles. They found that IDS increased frontal activation in neonates, which was mainly attributable to the prosodic exaggeration of IDS and its socio-affective impact. Santesso et al. showed that in 9-month-old infants, the overall frontal activation in terms of EEG power was linearly related to affective intensity of natural sentences spoken in IDS. Zangl and Mills compared ERPs for words spoken in IDS and ADS in 6- and 13-month-old infants and found larger N600–800 responses to IDS than to ADS in both age groups. In the older infants, familiar words additionally showed enlarged N200–400 response to IDS. Given that the IDS stimuli in the previous study were significantly longer in duration and higher in fundamental frequency, maximum pitch, and frequency range than ADS stimuli (Zangl & Mills, 2007), the increased brain activity for IDS would presumably reflect a composite effect of both prosodic and linguistic exaggerations. By controlling acoustic exaggeration other than formants in IDS and ADS, the new ERP data here demonstrated that formant exaggeration alone at the segmental level could produce significant enhancement in neural activation in 6–12-month-old infants, which may serve to strengthen associations between phonetic processing and word learning (Swingley, 2009).

The mechanism for the observed enhancement in N250 and sustaining negativity appears to rely on neural synchronization of evoked EROs time-locked and phase-locked to stimulus presentation. In the literature, the adult theta activity has been linked with arousal/orienting responses and working memory of verbal stimuli (Basar, Basar-Eroglu, Karakas & Schürmann, 1999; Hwang, Jacobs, Geller, Danker, Sekuler & Kahana, 2005; Klimesch, Hanslmayr, Sauseng, Gruber, Brozinsky, Kroll, Yonelinas & Doppelmayr, 2006; Scheeringa et al., 2009; Summerfield & Mangels, 2005). In infants, delta (1–4 Hz) and theta (4–8 Hz) activities are both affected by linguistic processing with increased theta power for affective speech (Orekhova, Stroganova, Posikera & Elam, 2006; Radicevic et al., 2008; Santesso et al., 2007). As the pitch level was controlled in the present study, the observed increases in delta activity at frontal-central-parietal sites, as well as in theta activity at frontal-central sites, could not be due to prosodic processing. Rather, it could be a composite effect of attentional and phonetic encoding processes in response to the acoustically more salient and phonetically more distinct speech (Kuhl et al., 1997). As attention was not controlled in the present study, it remains to be tested whether formant exaggeration alone makes speech more attractive to infant listeners.

The MNE differences between the stimuli revealed a bilateral cortical neural network sensitive to formant exaggeration, including the Broca’s area in the left brain and frontal-temporal-parietal regions in the right. Broca’s activation for speech processing has been reported in imaging studies of infants at 3–12 months of age (Dehaene-Lambertz et al., 2006; Imada et al., 2006), suggesting the existence of early perceptual-motor binding in support of language acquisition. The present MNE data further indicate that formant-exaggerated speech leads to enhanced Broca’s activation, which may drive speech learning via interactions with the perceptual-motor system involving temporal, frontal, and parietal cortices in both hemispheres. It is interesting to note that the infant MNE activation patterns for passive listening to speech show striking resemblance to adult fMRI data during passive listening to music (Lahav, Saltzman & Schlaug, 2007). Adult imaging research has also shown that auditory listening alone can recruit production-related regions including Broca’s area (Love, Haist, Nicol & Swinney, 2006; Meyer, Steinhauer, Alter, Friederici & von Cramon, 2004; Skipper, Nusbaum & Small, 2005; Wilson, Saygin, Sereno & Iacoboni, 2004). The adult data were thought to reflect more general mnemonic and integrative functions for the Broca’s area in making associations between motor actions for sound generation (not just speech) and the acoustic product. However, passive listening to nonsense syllables does not reliably elicit inferior frontal activation in adults (Zhang, Kuhl, Imada, Iverson, Pruitt, Stevens, Kawakatsu, Tohkura & Nemoto, 2009; Zhang et al., 2005). As no motor component of speech is measured for comparison in the present design, it remains purely speculative that passive listening to speech might elicit motor activities in the developing minds to mediate phonological acquisition.

The ERP, CSD and MNE data all indicated greater involvement of the right hemisphere for the N250 effect than the left. This result was consistent with a recent study that showed early functional asymmetry of spectral processing in newborns (Telkemeyer et al., 2009). There is a growing literature relating the right hemisphere with speech processing at the prelexical and paralinguistic levels (e.g. Bristow, Dehaene-Lambertz, Mattout, Soares, Gliga, Baillet & Mangin, 2009; Homae, Watanabe, Nakano, Asakawa & Taga, 2006; Scott & Wise, 2004; Simos, Molfese & Brenden, 1997). However, the laterality result directly contradicted previous findings about left-hemisphere dominance in significantly enhanced N200–400 and N600–800 responses for familiar words spoken in IDS relative to ADS (Zangl & Mills, 2007). The laterality inconsistency can be explained by the functional asymmetry model – spoken words involve fine-scale temporal processing of the rapid acoustic transitions in the left brain, and processing steady spectral cues in simple vowel stimuli primarily depends on the right brain (Poeppel, 2003; Zatorre & Belin, 2001). Nevertheless, this model did not specify the time course of functional asymmetry or the time course of interactions of cortical regions in auditory processing. A simple extrapolation would predict the same pattern of functional asymmetry regardless of the time course of brain activities, which was not supported by the current results. Further research is necessary to determine left/right functional asymmetries at different cortical regions and in different time windows and how asymmetry in brain activation varies as a function of stimulus properties, task variables, and subject characteristics.

The reference-dependent and reference-free approaches in the present study showed similarities as well as striking differences. The ERP research field has yet to adopt one standard solution regarding the choice of reference. Caution must be used in interpreting ERP results with different reference methods (Dien, 1998; Yao et al., 2005). The topographical map for common average reference was similar to the CSD map in terms of the polarity reversal pattern. As all electrical activity produces dipolar fields, measurements from the two sides of the dipolar activity will always be negatively correlated. It is noteworthy that polarity reversal in the temporal-parietal electrodes relative to frontal electrodes could potentially cause problems in channel grouping and interpretation. Compared with common average reference, the linked-mastoid reference appears to produce biophysically unrealistic unipolar voltage fields. Although linked mastoid reference was quite popular in the past, it is recommended that researchers should switch to more progressive approaches by adopting the common average reference in future studies. While the CSD and MNE solutions have the advantages of being reference-free, these methods are highly susceptible to noise influence and thus technically challenging to implement when analyzing individual subjects’ data, especially those of infants where there tends to be more noise.

As children learn to speak only the language(s) that they are exposed to, defining the role of input and the neurobiological mechanisms enabling this feat is central to our understanding of the perceptual and computational processes that adaptively shape both the developing brain and the language outcome. There is cumulative evidence that the acoustic and linguistic modifications in IDS have important functions in the acquisition of phonology and grammar (Burnham et al., 2002; Liu et al., 2003; Morgan & Demuth, 1996; Werker, Pons, Dietrich, Kajikawa, Fais & Amano, 2007). The present results add a neural-level account of how formant exaggeration in speech alters infants’ brain activities for phonetic processing. This account is not without its limitations in explaining the role of formant exaggeration in language acquisition. Research has shown that not all aspects of acoustic exaggeration in IDS necessarily aid speech discriminability or learning (Trainor & Desjardins, 2002). As the distributional and statistical properties of language input are embedded within an interactive social learning environment (Meltzoff, Kuhl, Movellan & Sejnowski, 2009), it seems unlikely that any single property of IDS is indispensable to normal language development.

Of particular interest to theory and practice is that the effects of enriched language exposure, including formant exaggeration, are not limited to infancy. IDS-based input manipulation is conceptualized to be an agent of neural plasticity regardless of age or experience (Zhang et al., 2009). The benefits of various input manipulations have been demonstrated in infants, children, and adults with or without learning disabilities (Bradlow, Kraus & Hayes, 2003; Kuhl et al., 2003; Tallal, 2004; Zhang et al., 2009). Given that early brain measures have predictive power for later language skills (Kuhl et al., 2008; Molfese, 1989), more developmental studies are needed to delineate the role of language input in the social context of language acquisition or effective intervention. In particular, an experimental design focusing on the different spectral and temporal aspects of IDS is necessary to build a better understanding of cortical speech processing and functional asymmetry. Both speech stimuli and nonspeech control can be applied to further investigate the effects of acoustic versus phonetic processing in populations of specific ages and neurological conditions (Dehaene-Lambertz & Gliga, 2004).

In summary, the present study examined the effects of formant exaggeration on cortical speech processing in infants at 6–12 months of age. Despite methodological differences, there was significant enhancement in N250 with right-hemisphere dominance in all reference-dependent and reference-free analysis approaches. Time-frequency analysis indicated increased neural synchronization for processing formant-exaggerated vowel stimuli in the delta band at frontal-central-parietal electrode sites as well as in the theta band at frontal-central sites. Minimum norm estimates further revealed a bilateral cortical neural network (frontal, temporal and parietal regions) in the infant brain sensitive to formant exaggeration, which may facilitate learning via cortical interactions in the perceptual-motor systems. Although there was limited support for the early functional asymmetry for spectral processing of formant exaggeration in the right hemisphere, hemispheric laterality may vary depending on the time course of neural activation.


  • 1

    ERP research in speech perception has primarily used the mismatch negativity (MMN) paradigm. The generation of MMN involves repeated presentation of the same stimulus (or stimulus category), which is occasionally replaced by another stimulus (or stimulus category) (Näätänen, Paavilainen, Rinne & Alho, 2007). The MMN is a powerful measure that reflects experience-dependent neural sensitivity in detecting a sound change. A different paradigm, which requires less recording time, presents stimuli in blocks with an equal occurrence ratio to study neural coding of speech stimuli independent of discriminatory sensitivity (e.g. Molfese & Molfese, 1980; Sharma, Marsh & Dorman, 2000; Tremblay, Friesen, Martin & Wright, 2003; Zangl & Mills, 2007). The alternating block design combined important features of both paradigms while keeping the stimulus presentation time relatively short for the infant participants (Supplemental Figure 1). Unlike the MMN experiment, the alternating block design shifted the focus to neural coding of the standard stimuli alone.

  • 2

    The male voice was chosen in consideration of naturalness of the synthesized vowel stimuli as judged by five adult native speakers of English. The HLsyn software program is based on the Klatt formant synthesizer (Klatt & Klatt, 1990), and male-based speech synthesis has been used in previous infant speech perception studies (Kuhl, Tsao & Liu, 2003; Liu et al., 2003). As pointed out by an anonymous reviewer, female speech that typically contains greater acoustic exaggeration could potentially produce a larger effect in infants than is reported in the present study.


This study was supported by funding sources to YZ, including two Brain Imaging Research Awards from the College of Liberal Arts (CLA) and the Grant-in-aid Program at the University of Minnesota. The first author would like to thank three anonymous reviewers for revision suggestions, Drs Matti Hämälainen, Iku Nemoto, and Masaki Kawakatsu for technical discussions, and CLA Associate Deans, Jo-Ida C. Hansen and Jennifer Windsor, for support.