- Top of page
- Experiment 1
- Experiment 2
- Experiment 3
- General Discussion
It is well attested that 14-month-olds have difficulty learning similar sounding words (e.g., bih/dih), despite their excellent phonetic discrimination abilities. By contrast, Rost and McMurray (2009) recently demonstrated that 14-month-olds’ minimal-pair learning can be improved by the presentation of words by multiple talkers. This study investigates which components of the variability found in multitalker input improved infants’ processing, assessing both the phonologically contrastive aspects of the speech stream and phonologically irrelevant indexical and suprasegmental aspects. In the first two experiments, speaker was held constant while cues to word-initial voicing were systematically manipulated. Infants failed in both cases. The third experiment introduced variability in speaker, but voicing cues were invariant within each category. Infants in this condition learned the words. We conclude that aspects of the speech signal that have been typically thought of as noise are in fact valuable information—signal—for the young word learner.
Research in early language acquisition has been peppered with findings that very young infants have excellent abilities to discriminate speech categories (e.g., Eimas, Siqueland, Jusczyk, & Vigorito, 1971; Werker & Tees, 1984; for a review, see Werker & Curtin, 2005). However, Stager and Werker (1997) (for a review, see Werker & Fennell, 2006) reported that for somewhat older infants (14-month-olds), some of these abilities appear to be ineffective when applied to word learning. Phonological skills, such as the ability to discriminate between native-language phonemes (Werker & Tees, 1984), and to represent the phonology of words in detail (Ballem & Plunkett, 2005; Swingley & Aslin, 2002), seem to have little bearing on the ability to learn words that are phonologically similar (Stager & Werker, 1997; see also Swingley & Aslin, 2007; Werker, Cohen, Lloyd, Casasola, & Stager, 1998; but for recognition of known words, see Fennell & Werker, 2003; Swingley & Aslin, 2002).
Explanations for the failure to learn phonologically similar words typically focus on top-down mechanisms, such as task demands (Werker et al., 1998; Yoshida, Fennell, Swingley, & Werker, 2009) or lexical access (Swingley & Aslin, 2007). Proponents of the former argue that the demands of laboratory word learning tasks are heavy because the children are required to encode both visual and auditory forms in a short time period and then to connect them to one another. This requires children to allocate their limited resources to specific elements of the task (for a review, see Werker & Fennell, 2006). PRIMIR (Werker & Curtin, 2005) describes this as a case where general perceptual processes overwhelm the child’s system, leaving little room for phonetic ones. Additionally, the switch task typically used in these experiments (see Werker et al., 1998) requires that information be represented and organized robustly, as success requires the infant to determine that something is not part of a category. Children this age succeed more easily at positive identification tasks in which they must map an auditory word form to an object (Ballem & Plunkett, 2005). Even infants trained in the style of Stager and Werker (1997) correctly identify word–object pairings when the test is presented using a two-alternative looking paradigm (Yoshida et al., 2009). Lack of capacity coupled to the difficulty of the switch task might negatively affect 14-month-olds’ use of their discrimination skills in this task. However, as children get older, they become more adept, and by 20 months, they learn phonologically similar words in the switch task (Werker, Fennell, Corcoran, & Stager, 2002).
Alternatively, it has been suggested that processes involved in lexical access, particularly competition (e.g., Dahan, Magnuson, Tanenhaus, & Hogan, 2001; Luce & Pisoni, 1998), interfere with learning (Swingley & Aslin, 2007). In the small lexicon of 14-month-olds, known words are accessed somewhat easily from phonetic input and compete with novel or newly learned words. New words that sound similar to existing words will activate both a novel representation and these existing known words, and do not fare well in the resulting competition. Thus, 14-month-olds learning words like “tog” will have difficulty because they retrieve “dog” instead (Swingley & Aslin, 2007). Similarly, when infants learn two similar words at once, the word forms compete with one another for representation. As a result, each inhibits the other and learning fails, or alternatively, both representations get linked to the referent (as they are both momentarily active in parallel). As children develop, the lexicon expands, resulting in more “balanced” competition—the strength of competitive interactions coming from dog, for example, may be balanced by competition from words like doll, tall, dot, and bog (cf. Theissen, 2007).
Although both theories explain existing behavioral data, they imply that speech perception is well developed in children at this age, and that top-down factors impede it (Werker & Curtin, 2005). However, it is possible that bottom-up speech perception factors, that is, perceptual abilities that are relevant for speech but not completely developed, may contribute to this failure.
Although discrimination tasks indicate that some category boundaries are established by 1 year (e.g., Werker & Tees, 1984), there is also abundant evidence that children refine their phoneme categories well into the school years (Nittrouer, 2002; Ohde & Haley, 1997; Slawinski & Fitzgerald, 1998). Thus, it is possible that 14-month-olds’ phonetic categories are only partially developed, and the existing categories, while sufficient to succeed at discrimination tasks, may provide a weak platform for word learning.
Rost and McMurray (2009) assessed this by examining the role of acoustic variability in learning phonologically similar words. We hypothesized that if speech categories were still developing, the small set of acoustic exemplars provided in most studies (Stager & Werker, 1997; Werker et al., 1998, 2002) might leave ambiguity about the structure of the phonetic category. Variability could provide more structure to the phonetic category, supporting word learning. Similar effects of variability on category learning have been observed in both visual categorization (Oakes, Coppage, & Dingel, 1997; Quinn, Eimas, & Rosenkrantz, 1993) and in the acquisition of phonetic categories in a second language (Lively, Logan, & Pisoni, 1993), suggesting that this simple manipulation may be an important way to support categories that are not yet fully developed.
Fourteen-month-olds were tested in the switch task (Werker et al., 1998) by habituating them to two novel objects paired with two novel, phonologically similar, words (/buk/ and /puk/, both rhyme with “luke”1). Infants were then tested on a same trial, where the word–object pairing was consistent with habituation, and a switch trial, where the word–object pairing was opposite of what it had been in habituation. If infants internalized the word–object mapping, they should dishabituate on the switch trials. Experiment 1 replicated prior work: infants hearing a small set of exemplars failed to notice the switch. However, Experiment 2 employed multiple exemplars of the words spoken by 18 speakers; infants hearing variable exemplars correctly acquired the two phonologically similar words.
At face value, successful learning in the multitalker condition is surprising. Multitalker variation imposes a significant cost on speech processing in adults (Mullennix, Pisoni, & Martin, 1989), toddlers (Ryalls & Pisoni, 1997), and infants (Jusczyk, Pisoni, & Mullinex, 1992) and could be expected to add to the task demands here. In fact, from a purely processing standpoint, this may add significant demands.
However, specific types of variability may also play a role in forming appropriate phonetic categories. Under both prototype (Kuhl, 1991; Miller, 1997, 2001) and exemplar (Goldinger, 1998; Pierrehumbert, 2003) theories of speech perception, variability is essential to defining the limits of a category (e.g., what tokens are not a /b/). Developmentally, it is important for the learner to hear variable exemplars in order to delineate the acoustic space encompassed by a phonological category and words. Moreover, as numerous authors have pointed out (Swingley & Aslin, 2002; Yoshida et al., 2009), the switch task relies on infants’ abilities to both identify a word and identify that a given auditory stimulus is not an exemplar of a lexical category. If variability is essential to defining the edge of a category, a lack of variability could be particularly problematic in the switch task.
The multitalker input used in Rost and McMurray (2009) contained multiple sources of variability, both within and between speakers. This included variation in prosodic patterning, fundamental frequency, vowel quality, and voice timbre. These factors do not distinguish /buk/ from /puk/, nor do they serve as cues for voicing more broadly. However, these tokens also contained variation in Voice Onset Time (VOT; the continuous cue that distinguishes voicing, hence the two words to be learned) that is constrastive for the voicing feature distinguishing /buk/ and /puk/. A number of studies have examined the role of such variation in the formation of speech categories. Phonetic investigations of cues like VOT reveal statistical distributions that maintain the separability of /b/ and /p/, but have significant within-category variation (Allen & Miller, 1999; Lisker & Abramson, 1964). Moreover, Maye, Werker, and Gerken (2002) (see also Maye, Weiss, & Aslin, 2008; Teinonen, Aslin, Alku, & Csibra, 2008) have demonstrated that infants are sensitive to these distributions and may use them to learn speech categories. In these studies, infants were exposed to a set of words in which the VOT statistically distributed into one or two clusters, after which, infants’ patterns of discrimination mirrored the number of clusters in the input. Thus, variation in contrastive cues may play a role in category learning (see McMurray, Aslin, & Toscano, 2009) by providing an estimate of the width of the category or its edge.
Experiments 1 and 2 tested the hypothesis that variability along the contrastive dimension of voicing helps infants define the phonological categories for the words, while simultaneously eliminating noncontrastive variation that might be expected to impede processing. If true, it might suggest that further development of the internal statistical structure of VOT distributions is necessary for phonological categories to be engaged in this case.
- Top of page
- Experiment 1
- Experiment 2
- Experiment 3
- General Discussion
These experiments investigated the role of contrastive and noncontrastive phonetic variability in infants’ word learning in the switch-task procedure. Experiments 1 and 2 examined whether variability in a contrastive cue was necessary for minimal-pair learning in the switch task. Our initial hypothesis was that the switch task requires children to determine that a given exemplar is not a member of the /buk/ (or /puk/) category, and as a result, some estimate of the extent of a category along the contrastive dimension may be needed to make this determination. However, this was not the case: across both experiments there was no evidence for learning, even when three cues to voicing varied simultaneously. Indirectly, this provides evidence that the kind of statistical learning first reported by Maye et al. (2002, 2008) (see also Kuhl et al., 2007; McMurray et al., 2009; Vallabha et al., 2007) can not account for learning in Rost and McMurray (2009) as variability along the contrastive dimension of voicing alone is not sufficient to support learning. We do not argue that infants ignore variability along dimensions, such as VOT. Indeed, it is likely to be important in establishing the location of categories within a dimension. However, it seems that this is not the information that they must glean to succeed here by this more advanced age. This suggests that the perceptual development that supports learning on this task is not simply locating categories within a dimension. Rather, some other component of perceptual development must be occurring.
By contrast, Experiment 3 suggests that variability along noncontrastive acoustic dimensions supports minimal-pair learning in the switch task, even when contrastive variability is minimized. Before reaching this conclusion, however, it is important to assess several alternatives. One possible explanation for this is that the stimuli presented in Experiment 3 are more natural than those in Experiments 1 and 2. It is not clear that this is the case: both sets of stimuli were created by manipulating natural speech using similar techniques (cross-splicing), and adult listeners did not report that either sounded unnatural. Nor is it clear that manipulated speech in this case poses a problem as previous switch-task studies (Stager & Werker, 1997; for a review, see Werker & Fennell, 2006; for an example using voicing contrasts, see Pater et al., 2004) all used un-manipulated natural speech, and 14-month-olds consistently failed to learn minimal-pair words.
A second possibility is that the highly salient variation between speakers was more engaging and thus resulted in better learning. However, our analysis of infant habituation times renders unlikely the possibility that infants were more engaged as they had slightly fewer trials to habituation in Experiment 3 than in Experiments 1 and 2.
A third possibility is that the more naturalistic variation in Experiment 3 also contained secondary cues to voicing. Yet, measurements of our stimuli rule out the possibility that the items retained perceptible variability of cues related to voicing. Moreover, if VOT was treated as a relative cue (which is unlikely given the adult work), Experiment 3 substantially minimized variation in this contrastive dimension, and infants still learned the words.
As neither naturalness, saliency, contrastive acoustic cues, nor task demand explanations adequately explain the results of Experiment 3, we are left with irrelevant speaker information as the driving force of this effect. It must therefore be that variability along dimensions that do not typically distinguish words, in fact helps 14-month-olds to acquire lexically contrastive phonetic representations.
One simple account for this is that infants might not be fully committed to which cues are relevant for voicing by this age. If this were the case, then, variability along indexical dimensions helps infants learn that they are not relevant; conversely, the relative invariance of VOT points to its utility in contrasting words. Multitalker variability helps the infants with dimensional weighting (Toscano & McMurray, 2010a), the assignment of weight or importance to perceptual dimensions.
Ongoing computational work (Apfelbaum & McMurray, 2010) shows how simple associative learning mechanisms can give rise to this. This model suggests that without speaker variability infants erroneously associate indexical and pitch cues with both words—when the same speaker is heard at test, then, both words receive partial support making it difficult to rule one out. The constant indexical cues, thus, interfere with establishing contrast. Variability in speaker prevents this by spreading association across many possible speakers.
By this account, multitalker variability might be only one of many types of variability that could yield this same effect. Variability in noncontrastive cues (as is prevalent in infant-directed speech) has been thought to be helpful for word and language learning in young infants, although relatively few reports indicate that this is indeed supportive of learning, as opposed to merely preferred by infants. Singh (2008) is a notable exception. She familiarized 7.5-month-olds to words using both high- and low-affect productions, and found that infants only segmented the words in the presence of high affective variability, that is, high prosodic variability. Similarly, infants segment words from infant-directed speech but not adult-directed speech in novel speech strings containing statistical cues to word boundaries (Theissen, Hill, & Saffran, 2005). This raises the possibility that highly variable prosody alone may be sufficient to support word learning in this task, as well.
These results suggest that the established view that infants use the statistical structure of contrastive cues to learn phonological categories (Kuhl et al., 2007; Maye et al., 2002, 2008; McMurray et al., 2009; Vallabha et al., 2007) may be incomplete. We suggest that by 14 months, even though infants appear to discriminate tokens within a dimension, they might not be fully committed to VOT as a relevant dimension for distinguishing words that vary in voicing, and must determine which dimensions are relevant by examining relative variability.
Of course, the behavioral experiments reported here and in Rost and McMurray (2009) do not offer definitive proof of our dimensional weighting account. Further empirical and computational work will be necessary to fully establish this account. However, as we argue in the subsequent sections, the dimensional weighting account is consistent with both the task demands framework for explaining the switch task and with broader exemplar models of speech (e.g., Pierrehumbert, 2003). Moreover, the use of relative variability as a mechanism of weighting crops up in numerous domains of learning and may represent a general principle of learning. Thus, when the present behavioral data are coupled with the seeming universality of such mechanisms and strong computational models (Apfelbaum & McMurray, 2010; Toscano & McMurray, 2010a), this seems to be quite a reasonable explanation.
Implications for task demands and phonological development
In the task demands framework (Werker & Curtin, 2005; Werker & Fennell, 2006), attentional demands on the infant create an apparent U-shaped developmental trend where infants’ speech perception abilities are intact and preserved, but infants are unable to access them in a difficult task, as they struggle to balance perceptual, phonological, and lexical representations.
There is no doubt that the switch task is particularly hard. Infants fail at the switch-task test but succeed at the easier looking-preference test (Yoshida et al., 2009). Nazzi’s (2005) sorting-by-name task may yet be more difficult. However, this may not be simply an issue of general capacity limits, but the unique way in which word–object mappings must be used in the switch task may also create task-specific difficulty (e.g., Swingley & Aslin, 2002). However, there are two interpretations of infants’ difficulties with this task: it could indicate that phoneme perception is robust at this age, but that a difficult task masks children’s ability to deploy these skills (e.g., Werker & Fennell, 2006). Alternatively, our work suggests that this difficult task reveals specific difficulties in speech perception.
In an easy task, such as a checkerboard dishabituation or a looking-preference task, the nature of the task only requires infants to discriminate pairs of speech sounds—it is not necessary to ignore any dimension, as a detectable difference in any of them should be sufficient to drive discrimination. In Maye et al.’s (2002, 2008) work, the relevant statistics within a cue were sufficient to alter discrimination. However, the switch task is closer to a categorization task, in which many sources of information (irrelevant or relevant) may be associated with the response. Thus, it may reveal a second component of perceptual development, dimensional weighting. Dimensional weighting is a key feature of PRIMIR (Werker & Curtin, 2005), but it was not explicitly tied to switch-task failure due to lack of empirical evidence. The results of our experiments suggest that this explicit relationship. For 14-month-olds in the switch task, the statistics of contrastive cues are less helpful (as they are relevant to a problem that is already solved) than the statistics of noncontrastive cues (which are relevant to the problem of weighting).
Thus, as numerous researchers have pointed out, the nature of the task is of fundamental importance to understanding results like these (Swingley & Aslin, 2002; Werker & Fennell, 2006; Yoshida et al., 2009). However, the overall difficulty of task perhaps does not fully describe why. Rather, what is important is the way that the task shapes how particular (and perhaps nonobvious) sources of information contribute to learning, the particular mappings that must be employed at test, and the kind of information used in those mappings (for a similar discussion, see Yoshida et al., 2009). Our interpretation of these results is that it is not that a difficult switch task masks intact phoneme perception, but rather that this difficult task highlights an aspect of speech perception is not yet well developed at this age. We may be left with the original conclusion of Stager and Werker (1997) that speech perception may not be developed sufficiently in 14-month-olds to fully support word learning.
Importantly, the ability of variation to shape dimensional learning is likely to break down differently depending on the acoustic/phonetic properties in question. This could explain the differences we see between perception of consonant, vowels, fricatives, and liquids (Havy & Nazzi, 2009, Nazzi, 2005; but see Mani & Plunkett, 2007, 2008; Nazzi & New, 2007). Ultimately, differences in performance across phonetic contrasts may derive less from their phonological status (e.g., consonant versus vowel) and more from the statistical structure of the cues to these contrasts, particularly when we look across multiple relevant and irrelevant dimensions. For example, cues like VOT do vary between speakers (Allen, Miller, & de Steno, 2003), but they are largely discriminable without taking this into account; therefore, high variance in speaker cues will quickly reveal the more invariant contrastive VOT cues. However, for vowels, and to a lesser extent, fricatives and liquids, contrastive and noncontrastive acoustic dimensions overlap substantially (vowels: Hillenbrand, Getty, Clark, & Wheeler, 1995; fricatives: Jongman, Wayland, & Wong, 2000). These contrastive dimensions, such as formant structure, F0, and length, are also cues that vary considerably by speaker. In order to use speaker variation to detect such differences, infants may need more sophisticated ways of dealing with this variability or may simply need to learn more about those things that contribute to variance (Cole, Linebaugh, Munson, & McMurray, 2010) before vowel can be a cue to word identity.
As a result, failures in the switch task at 14 months do not represent a reversal of development, a U-shaped curve, or a discontinuity. We suggest rather that speech perception was never fully developed at 12 months, as is evidenced by studies of older children (Nittrouer, 2002; Ohde & Haley, 1997; Slawinski & Fitzgerald, 1998). Reliance largely on discrimination measures resulted in a failure to consider other factors (like dimensional weighting) that are revealed by this task.
Mechanisms of learning
This study hints that dimensional weighting is sensitive to the relative variation along different dimensions. This may in fact represent a general principle of learning. For example, the role of irrelevant variation suggested by this work parallels mechanisms proposed by Gómez (2002) for statistically determined grammatical dependency structures. Adults and infants learned a novel grammar with nonadjacent dependency structure. When intervening elements in the dependency were long and variable, both adults and infants detected the nonadjacent dependencies. When intervening elements did not vary, participants were unable to learn the grammatical dependencies. Consequently, learning of grammatical dependencies in Gómez’s experiment requires high variability in those elements that are not criterial for determining the grammar.
Yu and Smith’s (2007) work on learning word–object mappings via cross-situational statistics illustrates the same point. In this study, subjects learned a small set of word–object mappings solely by noticing the statistical relationship between the sound and the object: whenever a given word was heard, the referent was consistently present. Importantly, competing objects were variable (with respect to the auditory word form). When the competing words were less variable (i.e., there were fewer words each competing more systematically with the referent) subjects struggled much more to learn the word–object pairings.
The variability of irrelevant rules, associations, or dimensions may be fundamental to learning. This in turn hearkens back to much older work on cue adaptation or cue neutrality (Bourne & Restle, 1959; Bush & Mosteller, 1951; Restle, 1955), from the learning theoretic tradition. In these studies, animals or adult humans learned two-alternative categorization among stimuli that varied in multiple dimensions (some informative, some not). Crucially, subjects did not know in advance what dimensions to attend to and had to determine this from the relative amount of variability. Thus, an analysis of the relative variability in the input (or its utility in predicting the word/category) may be a core mechanism of learning.
More broadly, one of the critiques commonly leveled at (and by) the statistical learning community is its necessity to know a priori what units to compute statistics over (Marcus & Berent, 2003; Newport & Aslin, 2004; Remez, 2005; Saffran, 2003; but see Spencer et al., 2009). This work suggests a response to that critique: the system might compute statistics over multiple dimensions simultaneously to “discover” the right ones (using simple estimates of variability or something more complex). The system thereby forms knowledge of the statistical structure of the dimension.
Implications for models of speech perception
This description of dimensional weighting also dovetails with work showing that speech perception in both adults and children is improved in known voices (Creel et al., 2008; Nygaard, Sommers, & Pisoni, 1994; for a review, see also Goldinger, 1998). As each speaker uses production cues differently and even has his/her own habitual VOT (Allen et al., 2003), listeners must learn to be sensitive to talker-specific intracategory differences (Allen & Miller, 2004). In light of our data, such effects could be interpreted as the remnants of dimensions that are not fully down-weighted. Speaker-specific effects have been taken to support exemplar models of speech (e.g., Goldinger, 1998; Pierrehumbert, 2003) in which contrastive and noncontrastive information are stored together as part of the word form. Our results suggest that such models might need to consider the ways that multiple dimensions are encoded and weighted, and how this changes over development.
Perhaps more importantly, a classic issue in speech perception has been the problem of invariance—how can listeners perceive the same word from highly variable acoustic streams? Classic theories have parsed “signal” (that is, the acoustic information we have labeled as being criterial) from “noise” and have attempted to explain category selection on only a few dimensions. By contrast, this work suggests that at least developmentally, the “noise” may be essential to acquiring the signal.