Effects of Visual Information on Adults’ and Infants’ Auditory Statistical Learning


should be sent to Erik D. Thiessen, Department of Psychology, Carnegie Mellon University, Pittsburgh, PA 15213. E-mail: thiessen@andrew.cmu.edu


Infant and adult learners are able to identify word boundaries in fluent speech using statistical information. Similarly, learners are able to use statistical information to identify word–object associations. Successful language learning requires both feats. In this series of experiments, we presented adults and infants with audio–visual input from which it was possible to identify both word boundaries and word–object relations. Adult learners were able to identify both kinds of statistical relations from the same input. Moreover, their learning was actually facilitated by the presence of two simultaneously present relations. Eight-month-old infants, however, do not appear to benefit from the presence of regular relations between words and objects. Adults, like 8-month-olds, did not benefit from regular audio–visual correspondences when they were tested with tones, rather than linguistic input. These differences in learning outcomes across age and input suggest that both developmental and stimulus-based constraints affect statistical learning.

1. Introduction

Learners are able to identify many different kinds of statistical regularities from linguistic input, including phonological and syntactic patterns (e.g., Chambers, Onishi, & Fisher, 2003; Mintz, 2002; Thiessen & Saffran, 2003). Despite the power of statistical learning, though, there is little doubt that human learners are constrained. Learners do not identify all kinds of statistical patterns equally well (e.g., Newport & Aslin, 2004; Peperkamp, Le Calvez, Nadal, & Dupoux, 2006; Redford, 2008; Saffran & Thiessen, 2003). However, most of the research into constraints on human learning has focused on how learners do when presented with a single learning task. This is insufficient for a complete understanding of statistical learning for two reasons. First, language frequently presents learners with multiple problems simultaneously. For example, when exposed to a novel word form in fluent speech, learners have the opportunity to both learn the word form, and to learn the referent of the word. Second, constraints on learning may be especially important when the input is complex enough to support multiple learning problems (e.g., Fiser & Aslin, 2002; Pinker, 1984).

Consider the interaction between the statistical information useful for segmenting words from fluent speech (e.g., Saffran, Aslin, & Newport, 1996) and identifying referents for words (e.g., Smith & Yu, 2008). Taken in isolation, both word segmentation (e.g., Thiessen, Hill, & Saffran, 2005; Toro, Sinnett, & Soto-Faraco, 2005) and referential learning are constrained (Golinkoff, Shuff-Bailey, Olguin, & Ruan, 1995; Landau, Smith, & Jones, 1988; Markman, 1990; Markman & Wachtel, 1988). It is also clear that these learning processes interact. Learners who are previously familiar with a word form map it more easily to a novel referent (e.g., Graf Estes, Evans, Alibali, & Saffran, 2007; Storkel, 2001). Conversely, children map familiar objects to novel labels more easily than unfamiliar objects (e.g., Hall, 1991). Because these learning tasks interact, different constraints may operate when learners are presented with both problems simultaneously. If the interaction between the problems is unconstrained, the additional complexity when they are presented together may hinder learning (Fiser & Aslin, 2002; Pinker, 1984). Alternatively, learning may be constrained in such a way that the learning occurs sequentially, with one problem learned before the other. It is even possible that learning, if appropriately constrained, could be facilitated by the simultaneous presentation of multiple regularities. This could be the case if learning of one regularity provides information that supports learning of the second.

To explore these possibilities, it is critical to present learners with the opportunity to identify simultaneous regularities. This set of experiments did so by building on prior research demonstrating that learners benefit from the embedding of audio input in a visual context (e.g., Hollich, Newman, & Jusczyk, 2005). Appropriate visual information helps learners determine whether speakers are producing one language or multiple languages (Soto-Faraco et al., 2007). Similarly, the presence of a video improves adults’ ability to identify word boundaries in fluent speech (Sell & Kaschak, 2009). In all of these tasks, however, the auditory learning task is the only task, and vision facilitates that task. The current experiments differ by presenting learners with two problems simultaneously: word segmentation and discovery of word–object relations. This better simulates the richness of language, where any single utterance may provide information about many different aspects of language (e.g., Saffran & Wilson, 2003).

2. Experiment 1

All learners in this experiment were presented with words embedded in fluent speech. As in previous statistical learning experiments (e.g., Saffran et al., 1996), these words could be segmented via transitional probabilities that were high within words, and low at word boundaries. A subset of the participants in this experiment (in the no-video condition) were presented solely with fluent speech. The only learning task this group faced was identifying word boundaries.

A second group of participants (in the regular-video condition) saw objects synchronized to the onset and offset of the words in the fluent speech. Each word in the fluent speech was consistently paired with a unique object. As such, this group of participants was presented with two potential statistical regularities to learn: word boundaries, and the relations between particular words and objects.

A third group of participants (in the irregular-video condition) also saw shapes synchronized to words in fluent speech, but these participants saw objects that were not consistently associated with the words. This condition serves as a control to make sure that performance in the regular-video condition is not affected by some aspect of the visual stimuli other than the regular relation between words and shapes.

2.1. Method

2.1.1. Participants

Participants were 60 undergraduates at Carnegie Mellon University. Twenty participants apiece were randomly assigned to one of three stimulus conditions: no-video, regular-video, or irregular-video.

2.1.2. Stimuli Audio stimuli All participants were exposed to a stream of synthesized speech used in Saffran et al.’s (1996) experiments. This artificial language contained four words: padoti, bidaku, tupiro, and golabu. The transitional probabilities between syllables within a word were 1.0, and the transitional probabilities between syllables across word boundaries were .33. Two words (bidaku and tupiro) and two part-words (tigola and bupado) were used as test items. Unlike words, part-word test items contained a transition between syllables with low transitional probability. Visual stimuli In the no-video condition, participants saw a static checkerboard image for the duration of their exposure to the synthesized speech.

Participants in the regular-video and irregular-video condition saw looming shapes synchronized with the word boundaries. Shapes appeared at the same instant the word began to play and remained onscreen for the duration of the word. At the beginning of a word, each shape occupied roughly 1/16th of the screen. Over the course of the presentation of the word, the shape increased in size until it filled the screen.

In the regular-video condition, each word was paired with a particular object (padoti: white cross; bidaku: green diamond; tupiro: purple heart; golabu: yellow hexagon). In the irregular-video condition, words and shapes co-occurred with no consistent pattern. The irregular-video condition is necessary because it may be possible to use the video image as a cue to word boundaries. To indicate that discovering word–object relations (as opposed to the mere presence of objects) is facilitative, adults must perform better in the regular-video condition than in the irregular-video condition.

2.1.3. Procedure

In all three conditions, participants sat in front of a portable DVD player with a 10″ screen wearing airline-pilot style headphones. Participants were simply informed that after watching the video, they would answer a series of questions about what they saw and heard. Segmentation test There were 16 two-alternative forced-choice questions in the segmentation test. For each question, participants heard a word and a part-word (in counterbalanced order), separated by 1s of silence. They were asked to circle the item that sounded more like the speech they heard (for discussion of this procedure, see Saffran, Newport, Aslin, Tunick, & Barrueco, 1997). Word–shape correspondence test After completion of the 16 segmentation test items, participants only in the regular-video condition were informed that they would now answer an additional series of 16 questions. These questions assessed whether participants learned that particular words corresponded to shapes. For each question, participants heard one of the four words from the synthesized speech. They then saw a sequence of four shapes on the screen, looming with the same animation as during the initial exposure. They were asked to circle which of the four shapes went with the word. If discovering word–object relations does facilitate the performance, higher scores on the word–shape correspondence test should be correlated with higher scores on the segmentation test.

2.2. Results and discussion

A one-way anova was performed on participants’ scores on the word-segmentation test as a function of condition. There was a significant effect of condition, F(2,57) = 7.7, < .01. Participants performed best in the regular-video condition (M = 12.0, SD = 2.6), and less well in the irregular-video (M = 9.8, SD = 3.0) and no-video condition (= 8.9, SD = 2.0). Scores in all three conditions differed from chance—regular-video: t(19) = 6.9, < .01; irregular-video, t(19) = 2.8, < .05; no-video: t(19) = 2.1, < .05. To follow up the effect of condition indicated by the anova, planned t tests were performed. Here and elsewhere, all t tests reported are two-tailed. There was no significant difference between participants’ performance in the no-video and in the irregular-video condition: t(38) = 1.1, p = .30. However, participants in the regular condition scored significantly better than participants in either of the other two conditions—regular-video vs. no-video: t(38) = 3.4, < .01; regular-video vs. irregular-video: t(38) = 2.2, < .05 (Fig. 1).

Figure 1.

 Scatterplot of participants’ responses (out of 16; chance = 8) on the segmentation test in Experiment 1.

The fact that participants performed best in the regular-video condition indicates that the presence of regular word–object relations facilitated learning. Consistent with this hypothesis, participants successfully identified these regular word–object relations. On average, participants scored 8.8 (out of 16; chance = 4) correct on the correspondence test (SE = 0.8), which was significantly above chance, binomial < .01. Further, as illustrated by Fig. 2, the correlation between the two tests was positive, = .64, and significant, < .01. Higher scores on one test were associated with higher scores on the other.

Figure 2.

 Scatterplot of the relation between participants’ scores on the segmentation test and correspondence test for participants in the regular-video condition.

These results converge to indicate that participants identified relations between words and objects, and that detecting this relationship facilitated word segmentation. Logically, the correlation is also consistent with another plausible hypothesis: that success in word segmentation facilitates learning a relation between words and shapes. Indeed, this hypothesis is almost certain to be true in the natural language learning environment, so it may well partially explain the observed correlation in this experiment. Recall, however, that the conditions (no-video, regular-video, and irregular-video) manipulated the availability of the regular relation between words and shapes. The fact that participants performed best on the segmentation test in the regular-video condition indicates that detecting the relation between words and shapes available facilitates performance on the segmentation test. This does not rule out the converse causal relation that segmentation facilitates learning relations between words and shapes. But it does indicate that the opportunity to identify a regular relation between words and shapes (only available in the regular-video condition) facilitates performance on tests of word segmentation.

3. Experiment 2

Prior experiments have demonstrated that infants are able to segment words from fluent speech via transitional probabilities (e.g., Saffran et al., 1996) and to identify relations between words and shapes (e.g., Stager & Werker, 1997; Thiessen, 2007), but no experiments have assessed both simultaneously. Because infants are the primary learners of language, their performance is both theoretically and pragmatically important and can clarify issues related to continuity of learning across the lifespan and the characteristics of the input that facilitate language acquisition. If infants, like adults, learn best when there are multiple relations available to be learned in the input, they should distinguish between words and part-words most successfully in the regular-video condition. In contrast, if infants learn most easily when presented with simplified stimuli, they should be most successful after exposure to the no-video stimuli.

3.1. Method

3.1.1. Participants

Participants in this experiment were 45 infants between the ages of 7.5 and 9 months (= 8.12). Infants were randomly assigned to one of three groups: no-video, regular-video, and irregular-video. In order to obtain data from 45 infants, it was necessary to test 59. The additional 13 infants were excluded for the following reasons: fussing or crying (10), parental interference (2), and experimenter error (2). According to parental report, all infants were full term and free of ear infections at the time of testing.

3.1.2. Procedure

This experiment used a slightly modified version of the HPP, presenting the visual stimuli on a central monitor rather than from the side of the room. Preferential looking experiments with a central monitor are commonly and successfully used with infants (e.g., Fernald, 1985). Infant participants were seated on their parents’ lap in a sound-isolated room, approximately one foot away from a 30″ monitor. There were two speakers adjacent to the monitor and a camera mounted above it. The parents wore noise-canceling headphones to eliminate bias. An experimenter outside the room watched the infants over a closed-circuit monitor to initiate test trials and code the direction of the infants’ gaze.

There were two phases to this experiment: the segmentation phase and the test phase. During the segmentation phase, infants heard the synthesized speech from speakers adjacent to the monitor, while the monitor displayed the visual stimuli appropriate to the infants’ condition.

The test phase used the same two words and two part-words as the adult test. Each item was repeated three times, for a total of 12 trials. Before each trial, an attention-getter (a brightly colored Winnie the Pooh video, coupled with an excited exclamation) attracted infants’ gaze to the monitor. Once the infant oriented to the monitor, the experimenter initiated the test trial. Each trial consisted of a repetition of a single word (or part-word), with a pause of 1 s between repetitions. For as long as infants’ gazed at the monitor, the test item continued to repeat. When infants looked away from the monitor for two continuous seconds, the test trial ended.

3.1.3. Stimuli

The stimuli during the segmentation phase were identical to the audio and video presentations used in Experiment 1 and were presented for an equivalent amount of time. During test phase, words and part-words were paired with an orange bar rotating like a propeller (it completed one revolution every 3 s). Pilot testing indicated that infants were far more likely to maintain their interest in the experiment if the monitor displayed a moving object rather than a static image. Both the color and the shape of the bar were novel with respect to the segmentation phase of the experiment, and the motion was unlike the looming animation infants saw during the segmentation phase.

3.2. Results and discussion

As illustrated in Fig. 3, infants in all three conditions looked longer at part-word trials than word trials, replicating the results of Saffran et al. (1996). In the no-video condition, infants looked at word trials for 8.9 s (SD = 3.5) and at part-word trials for 9.9 s (SD = 3.7). In the regular-video condition, infants looked at word trials for 7.7 s (SD = 3.5), and at part-word trials for 9.1 s (SD = 3.7). In the irregular-video condition, infants looked at word trials for 7.8 s (SD = 3.0), and at part-word trials for 9.1 s (SD = 2.8). Paired t tests indicate that the difference between word and part-word looking trials was significant in all three conditions—no-video: t(14) = 2.5, < .05; irregular-video: t(14) = 2.4, < .05; regular-video: t(14) = 2.5, < .05. These results indicate that infants in all three conditions learned enough about words to distinguish them from part-words.

Figure 3.

 Eight-month-old infants’ looking times to word and part-word test-trials in the No-Video, Irregular-Video, and Regular-Video conditions.

A 2 (test item) × 3 (condition) anova was used to compare the performance across three conditions. There was a main effect of test item, F(1, 42) = 17.9, < .01, due to the fact that infants in all three conditions preferred the part-word test trials to the word test trials. However, there was neither a main effect of condition, F(2, 42) < 1, ns, nor any interaction between condition and test item, F(2, 42)  < 1, ns. This analysis indicates that infants performed equivalently in all three conditions. In contrast, adults were much more successful segmenting words from fluent speech in the regular-video condition than in either the irregular-video or no-video condition. These results raise a question: why do adults benefit from regular relations between words and objects, while infants fail to benefit?

One explanation for infants’ inability to benefit from the regular-video condition is that infants failed to detect the relations between words and shapes present in the regular-video condition. This suggestion is consistent with a variety of converging evidence indicating that 8-month-old infants are relatively insensitive to relations between words and objects in the visual world. Infants at this age have a small vocabulary (e.g., Fenson et al., 2002), and they have difficulty in acquiring names for novel objects in controlled laboratory experiments (e.g., Werker, Cohen, Lloyd, Casasola, & Stager, 1998). If infants do not detect the relation between words and objects in the regular-video condition, they cannot benefit from any facilitation that identifying the relation provides to adult learners.

Indeed, additional testing is consistent with this hypothesis. Although infants in this experiment were not tested on their knowledge of word–object relations, an additional sample of sixteen 8-month-olds was presented with test trials assessing knowledge of word–object pairings. These infants were familiarized with the segmentation stimuli from the regular-video condition, for an equivalent length of time. But rather than being tested on words and part-words, all infants were presented with a word (each infant was tested with only a single word, although all four words were used as test items across the group) paired either with the correct object or with an object that had previously been paired with a different word. These infants showed no difference in their looking time to correct pairings (M = 8.2, SD = 2.9) vs. incorrect pairings (= 8.4, SD = 3.1), t(15) < 1, ns. This finding replicates prior results indicating that, for infants of this age, identifying word–object relations from a brief laboratory exposure is a difficult task (e.g., Stager & Werker, 1997; Werker et al., 1998).

Infants’ apparent failure to discover word–object relations should not be taken as evidence that infants are insensitive to the presence of the looming objects, or that they are unable to detect audio–visual relations at this age. A wide variety of research indicates that infants are able to make associations between audio and visual stimuli, from studies of early word comprehension (e.g., Tincoff & Jusczyk, 1999) to experiments identifying factors that facilitate infants’ discovery of audio–visual associations (e.g., Bahrick, Flom, & Lickliter, 2002; Gogate & Bahrick, 1998; Lewkowicz, 1986, 2003). But although the ability to learn audio–visual associations has been suggested to be a necessary foundation for later word learning (e.g., Gogate, Walker-Andrews, & Bahrick, 2001), it is not sufficient for adult-like word-learning competence. Young children may be able to discover word–object relations in tasks like the ones used in these experiments (e.g., Smith & Yu, 2008). However, Experiments 1 and 2 converge with a variety of other results (e.g., Stager & Werker, 1997; Werker et al., 1998) to suggest that infants are less adept at discovering such relations than adults. In turn, this makes it less likely that the presence of word–object relations can influence other aspects of learning.

4. Experiment 3

There are at least two plausible hypotheses that can explain why adults detect and benefit from the presence of regular relations between words and shapes, whereas 8-month-olds do not. The first is that adults’ ability to take advantage of the regular-video condition is due to the fact that they are faster, more efficient information processors than young infants (e.g., Pelphrey & Reznick, 2003). To detect a relation between words and shapes, learners must be able to process the identity of the shape (and the word) in the time it is on screen. There are several experimental results suggesting that young infants are less successful in processing multiple sources of information than older infants and adults (e.g., Robinson & Sloutsky, 2007; Stager & Werker, 1997). An alternative hypothesis is that the difference between 8-month-olds and older learners relates to their degree of prior linguistic experience. Adults are well aware that one of the primary functions of words is to refer to features of the visual world such as shape. Eight-month-olds may not yet expect to discover relations between words and shapes (Werker et al., 1998).

Although these hypotheses relate to the difficulties experienced by infant learners, they also suggest predictions about adult learning. From an experience-based perspective, the facilitation adults receive from regular word–object-based relations should be limited to stimuli for which adults expect to discover such relations. For nonlinguistic stimuli, there should be less (or no) facilitation. As a first step toward assessing the experience hypothesis (though further research with infants is required), adult participants in this experiment were exposed to nonlinguistic stimuli. Each of the words from the synthesized speech used in Experiment 1 was replaced by a sequence of three tones. This transformation leaves the statistical structure of the acoustic sequence intact, so that learners can still use transitional probabilities to segment the input (e.g., Saffran, Johnson, Aslin, & Newport, 1999). Importantly, adults should not have strong expectations that tonal sequences correspond to objects. Thus, on the experience-based account, adults’ learning should not be facilitated by the presence of a regular relation between tones and shapes. In contrast, on the capacity account, there is little reason to expect that adults should perform any differently than they did with speech.

4.1. Method

4.1.1. Participants

Participants were 60 Carnegie Mellon undergraduates. None had participated in Experiment 1. Twenty participants apiece were each randomly assigned to the no-video, regular-video, and irregular-video conditions.

4.1.2. Stimuli

To create the audio stimuli for this experiment, the syllables from the fluent speech in Experiment 1 were replaced with tones. After replacing each word with a three-tone sequence (called a tone-word), there were four unique tone-words in the audio stimuli. Thus, padoti became the three-tone sequence G#A#F; bidaku was transformed to CC#D#; tupiro to BF#G; and golabu to ADE. Tones were presented to listeners at the same average amplitude and duration as the speech of Experiment 1. Note that these tone-words can be distinguished both by their absolute pitch and by their relative pitch (the changes between particular notes).

There were four test items used in the segmentation test. Two were tone-words (BF#G and ADE), three-tone sequences with high transitional probabilities. The other two were tone part-words. These two test items (F#GA and DEB) contained a transition with low transitional probabilities.

4.1.3. Procedure

The procedure in this experiment was identical to that of Experiment 1.

4.2. Results and discussion

Participants scored very similarly on the segmentation test in the no-video (M = 9.3, SD = 2.5), irregular-video (M = 9.2, SD = 2.5), and regular-video (= 9.6, SD = 2.1) conditions. Note that scores in the no-video condition here (= 9.3) were very similar to scores in the no-video condition from Experiment 1 (= 8.9). This similarity suggests that participants were equally successful segmenting both speech and tonal materials in the absence of visual information. Participants in all three conditions scored significantly above chance—regular-video: t(19) = 3.4, < .05; irregular-video, t(19) = 2.1, p < .05; no-video: t(19) = 2.13, < .05. A one-way anova indicated no significant effect of condition: F(2, 57) < 1. The lack of a significant difference across conditions indicates that participants in the regular-video condition did not benefit from the presence of the regular relation between tone-words and shapes.

For those participants in the regular-video condition, the average score on the correspondence test (out of 16; chance = 4) was 5.0 correct responses (SE = 0.7), which was not significantly above chance (binomial = .23). This performance contrasts with Experiment 1, where participants exposed to speech averaged 7.5 correct responses on the correspondence test. A t test indicated that the performance on the correspondence test across these two experiments differed significantly, t(38) = 2.4, < .05. Moreover, unlike Experiment 1, there was no significant correlation between the segmentation and correspondence tests: r = −.08, p = .74.

These results converge to suggest that adults do not discover relations between tone sequences and shapes as efficiently as they discover relations between words and shapes (although this should not be taken to mean that adults are unable to discover relations between tones and objects). In turn, this means that when they are asked to distinguish between sequences with high and low transitional probabilities, they receive less benefit from relations between tones and shapes than from relations between words and shapes.

5. General discussion

This series of experiments represents one of the first efforts to determine whether learners can identify multiple statistical relations simultaneously, and how such learning might be constrained. The results clearly indicate that learners are capable of identifying multiple relations simultaneously. When presented with fluent speech in which words are regularly paired with unique objects, adults are able to identify word boundaries and the relations between words and objects. Indeed, adults benefit from being presented with two learning problems rather than one: Their ability to identify words (and discriminate them from part-words) improves when they have experienced those words paired with unique objects. Note that this is not the only logical possibility; it is quite plausible that one learning process could have interfered with the other, such that only learning of one task would be successful, or no learning at all would occur (e.g., Toro et al., 2005). Instead, the presence of a second learning problem actually facilitated learning of the first.

At the same time, it is clear that learning multiple relations is constrained. The current results indicate that learning is limited by at least two kinds of constraints. The first is a developmental constraint: Identical input does not necessarily yield identical learning output at different stages of development. Whereas adults benefited from the presence of regular relations between words and shapes, infants did not. The second is a stimulus constraint: the same formal relations present in the input are more or less likely to be detected by learners as a function of the characteristics of the input. Although adults identify both word forms and word–object relations from linguistic stimuli, they are less adept at doing so when presented with tonal stimuli. There are multiple potential explanations for the difference between speech and tones, but differential experience with word–object and tone–object relations may play a role.

Indeed, both of these constraints may arise from prior experience. Experience with language may provide practiced learners, such as adults, with strong expectations that words relate to objects (e.g., Namy & Waxman, 1998). This leads adults to be sensitive to such relations for linguistic input, but fail to detect them in tonal input. Eight-month-old infants may not yet have expectations about the relations between words and objects (e.g., Werker et al., 1998) and reap little benefit from the presence of these relations. The possibility that prior experience leads to constraints on subsequent learning should not be taken to mean that learners will always be unsuccessful in learning from novel stimuli, or stimuli that exhibit novel relations. Rather, it suggests that learners will be facilitated when they are exposed to stimuli that fit with expectations built up over prior experience (e.g., Norman, Brooks, & Allen, 1989; Vicente & Wang, 1998). But given a lengthier exposure, simplified input, or more robust cues (e.g., Hunter & Ames, 1988), learners might succeed in relatively novel domains, such as identifying the relations between tones and objects in the current experiment. As such, it may be best to characterize at least some of the constraints on learning as soft constraints: biases that privilege exploration of certain kinds of relations but which can be violated if the input clearly indicates that an unprivileged relation is available to be learned. The continual accretion of experience may eventually allow 8-month-olds to transform from novice to expert word learners. Any mechanistic account of learning will benefit from a consideration of the interaction between experience and constraints on learning. Moreover, the ability of different mechanistic accounts to accurately simulate the effect of these factors provides a potent test of an account’s plausibility.


This research was funded by a grant from the National Science Foundation (BCS-0642415) to the author. I thank Jenny Saffran, Dick Aslin, and Elissa Newport for their generous permission to use the language from their original experiments on infant statistical learning, Brent Fiore for his work in creating the audio–visual synchronization, and Ashley Episcopo, Teresa Pegors, and many other research assistants for their help in running participants. Additionally, I thank Michael Kaschak, David Rakison, and three anonymous reviewers for helpful discussion and useful comments on previous versions of this work, as well as the many parents who participated in this research.