We have demonstrated that basic learning processes can produce many of the behavioral results seen in low variability Switch task studies (Pater et al., 2004; Stager & Werker, 1997; Werker et al., 1998, 2002), high variability Switch studies (Rost & McMurray, 2009, 2010), and two-alternative preferential looking studies (Ballem & Plunkett, 2005; Yoshida et al., 2009). The critical link explaining why infant behavior varies so widely across these studies appears to be the structure of the acoustic variability in the training set. When infants are exposed to variable exemplars of words, their learning is focused on the consistent pieces of information—in this case, phonological information. However, when additional information is also highly consistent, as when a single speaker produces the words in a highly similar intonation, infants also associate this noncontrastive information with the object. This redundant information in the end reduces the contrast established by the phonetic cues.
Such an account suggests that despite evidence of adult-like abilities to discriminate between phonological categories (Kuhl et al., 1992; Werker & Tees, 1984), the task of phonological acquisition is not yet complete early in the second year. The additional task of phonological acquisition is to determine which dimensions meaningfully map on to words, and which are not useful for identifying words (Dietrich et al., 2007). To accomplish this second task, tracking the variability of different dimensions as they relate to object categories appears to be a highly effective mechanism for determining which cues are phonologically relevant.
5.1. The plausibility of overspecification
Our account of early word learning posits overspecification of lexical representations, such that acoustic information that is irrelevant to lexical identity is encoded as part of the word form. While at first such indiscriminate encoding appears to be a contentious claim, there is a sizable body of evidence supporting such an account. Thirteen-month olds associate nonspeech sounds (e.g., beeps and whistles) with pictures as though these sounds are appropriate object labels (Woodward & Hoyne, 1999; though Hirsh-Pasek, Golinkoff, & Hollich, 2000, only found this with nonspeech sounds made by humans, e.g., sighs). Further, there is evidence from segmentation tasks that at least as late as 10.5 months, infants encode affective cues along with word forms (Singh, Morgan, & White, 2004) and at 7.5 months encode arbitrary variation in pitch (Singh, White, & Morgan, 2008). When familiarized to words with a happy affect or a specific pitch value, infants show a preference for passages containing matching words only if they contain the same prosodic cues they heard during training.
The PRIMIR framework also suggests that overspecification could account for Switch task failures at 14 months (Werker & Curtin, 2005). Under this framework, infants encode all available information, and the indexical information overwhelms processing capacity, leading children to ignore phonological information. While this accords with our view of overspecification, this account differs from our approach in suggesting that children are ignoring phonological information; instead, we suggest that children use that information, but it is insufficient to overcome the wealth of indexical information that they also use for word recognition. These talker cues suggest that the minimal pair words are the same, leading to the failure. In addition, it is unclear how PRIMIR’s account of overspecification could accommodate the benefit of multiple speaker training in the Switch task.
More directly, adults associate speaker-identity cues with lexical items (e.g., Bradlow, Nygaard, & Pisoni, 1999; Creel, Aslin, & Tanenhaus, 2008). Such associations are also reasonable assumptions for development—languages like Chinese use pitch phonologically, and many cues to speaker identity overlap with cues to phonation type (which is contrastive in languages like Green Mong; Andruski, 2006). Thus, children should not have innate constraints on which cues will be contrastive in their language; they must learn this to cope with the incredible diversity seen across phonologies of the world’s languages. Our account presents a mechanism for learning which acoustic dimensions are most useful phonologically: The relative pattern of covariation between cues and words leads the child to down-weight noncontrastive cues and up-weight contrastive ones. Tracking variability in different acoustic dimensions is a highly effective tool in determining which dimensions should be attended. Crucially, this mechanism is intimately bound up with the emerging lexicon—variability is tracked with respect to specific words and has its effect in the associations between specific acoustic cues and words.
Our assumption that the 14-month-old does not know which cues are phonemically relevant would seem to discount the learning already completed to this point. However, we are not arguing that 14-month-olds have not learned anything—they can clearly discriminate speech sounds (Eimas, Siqueland, Jusczyk, & Vigorito, 1971; Werker & Tees, 1984), and they may also be able to categorize them (although categorization is irrelevant to Switch task studies, since none have ever asked infants to generalize across different exemplars in a substantive way [e.g., to a new speaker]).
Indeed, in the aforementioned studies by Singh et al. (2004, 2008), infants’ ability to generalize across affects seems to slightly improve between 7.5 and 10.5 months. Thus, the ability to ignore this form of irrelevant variation is clearly developing as well. However, this has only been tested in a discrimination task using highly differentiable, well-known words (bike vs. hat), and so may not be sufficient for learning of minimal pairs at later ages. Thus, while infants may have begun this process of dimensional weighting by 14 months, there is no reason to think it is complete. Indeed, Goldinger (1998) has shown that even into adulthood, speaker-specific information plays an important role in lexical encoding, especially for words that are heard infrequently.
Moreover, if infants already knew which dimensions to attend for word learning, it is unclear why they would learn better in the Rost and McMurray (2009, 2010) studies than in SST studies, and why learning (nonminimal) pairs of words with similar contrasts can improve learning of some contrasts (Thiessen, 2007, 2010). Natural learning settings may allow infants to learn minimal pairs (e.g., ball and doll), because of natural variability in nonlinguistic information in these settings; the SST, however, plays to their weakness by reinforcing encoding of noncontrastive cues. Indeed while we have focused on auditory information as the source of false associations, it is possible that additional contextual cues could be encoded early in word learning as is often seen in memory tasks (Gooden & Baddeley,1975).
5.2. Levels of representation
Our model assumes direct connections between acoustic information and lexical items, without intermediate (e.g., phonetic) representations. Such direct connections between acoustic information and object representations is consistent with exemplar models of speech perception and word recognition (Goldinger, 1998; Hawkins, 2003; Pierrehumbert, 2001), and with associated evidence that indexical information affects adult spoken word recognition (Bradlow et al., 1999; Creel et al., 2008; Goldinger, 1998). However, our model contrasts with pure exemplar models in two ways. First, we predict the pattern of statistical co-occurrence between cues and words eventually results in noncontrastive information (like F0 and indexical information) receiving significantly less weight than contrastive cues (though perhaps in a word-specific way). This can be accomplished by incorporating dimensional weighting into an exemplar approach (Kruschke, 1992, 2001; Nosofsky, 1986; Regier, 2005). While infant speech perception may start like a pure exemplar model, over the course of learning, the ability to (at least partially) abstract away from irrelevant cues emerges.
Second, and perhaps most important, individual exemplars are not stored in long-term memory—rather, the weights of the network store an accumulation of their traces. Traditionally, exemplar approaches have confounded the encoding of speaker-specific information and the storage of individual lexical exemplars. Our approach shows that these two aspects can be separated; while our model does not need to store specific exemplars of spoken items, it remains sensitive to speaker-specific cues. By incorporating the exemplar-type sensitivity to speaker in a more prototype-style model, associative learning may offer an alternative account of such effects without the controversial claims regarding storage of huge sets of exemplars.
5.3. Other influences in early word learning
While our associative account can explain a fairly large set of the findings in word learning in 14-month-olds, there are a number of other factors that likely play an additional role. For example, as the lexicon increases, infants become more able to learn minimal pair words even with single speakers (Werker et al., 2002). At some point infants become able to generalize their learning of which dimensions are meaningful to new words that they are learning. As currently instantiated, our model weights cues on a word-by-word basis—it cannot generalize this learning across words.
There are multiple ways one may be able to account for this. We earlier described perceptual learning mechanisms that weight dimensions as a whole (perhaps using the relative variability along each dimension as a source of information; Toscano & McMurray, 2010). This may operate alongside the associative mechanisms outlined here, and it would need to work slower than these associative mechanisms to yield the right pattern of data. Such a mechanism is reasonable—it would not be too costly if infants committed to the wrong cue-weighting for a single word, while it would be substantially worse if an entire dimension were ruled out prematurely. Thus, one would want associative forms of cue-weighting to work faster than perceptual learning.
Alternatively, Thiessen (2007) presents data that suggest that the distributional structure of the lexicon may guide learning through feedback mechanisms. When children learn very distinct words (dawbow and tawgoo), they were better able thereafter to learn the minimal pair (daw and taw). Upon hearing daw, children may partially activate the learned word dawbow. Feedback from this word could boost activation of the overlapping phonological information, thus helping draw attention to the relevant aspects of word identity. Even if irrelevant dimensions like talker voice were still being considered, this could help overcome it. Such a feedback mechanism may explain why children gradually become better able to learn minimal pair words. As phonological neighborhoods become more populated, a greater number of words will be coactivated during novel word learning, leading to a large degree of feedback suppressing irrelevant information. Indeed, Thiessen (2010) has shown just such an effect of neighborhood populating. While learning dawbow and tawgoo was sufficient to allow learning of daw and taw, this learning did not generalize to other vowel contexts (dee and tee). However, exposure to a wider array of vowel contexts before minimal pair learning helps infants generalize their learning to new contexts.
The lexical-feedback account could complement the associative mechanisms presented in our simulations. At its core, this account suggests that learners achieve dimensional weighting through links between acoustics and concepts. As more words are learned, the consistent weak association between noncontrastive information and lexical identity is more easily generalized to new words. A purely perceptual mechanism for dimensional weighting may have more difficulty accounting for such an effect, as it appears to depend on having formed links between acoustics and concepts.
Finally, word learning in the Switch task may not be an entirely auditory problem. Presentation of known objects with their names prior to Switch task training yields better word learning (Fennell & Waxman, 2010). This suggests the influence of higher level systems on word learning, though it is unclear what those might be in this work. The initial presentation may contribute to speaker familiarity or normalization affecting the way speaker cues are coded, or it may alter the attentional state of the infant changing the way activation thresholds map to responses. Regardless, such results may complement our own by suggesting that this associative core must be embedded in a richer system that supports language behavior and learning.
In this light, the associative mechanisms detailed in this paper are not meant to be an exhaustive model of word learning, but rather to represent one key component to the system. A number of features could be added to our model to make it a more complete account of word learning, including the addition of intermediate stages between the auditory and visual units (Schafer & Mareschal, 2001), dimensional attention weighting (Kruschke, 1992, 2001; Regier, 2005), competition between units, or feedback (McMurray et al., 2009). However, the primary goal of this study was to investigate the locus of variability effects in early word learning, not to provide a comprehensive model of word learning. From these simple simulations, it appears that the influence of variability in early word learning studies is related to how associative mechanisms link acoustic information to visual categories.