4.2. Relation to prior work
As we noted in the introduction, a number of dynamical systems proposals have been advanced that model the integration of spoken and visual information in the VWP. These proposals form an essential foundation for our project, and we incorporate many of their insights into the present work: for example, continuous activation values, feedback connections, and melding of different sources of constraint in an interactive activation framework. Now, through careful comparison of our account with these projects, we clarify some of the ways that the current work contrasts with and/or extends these approaches.
Allopenna et al. (1998) examined the fine-grained temporal dynamics of spoken-word recognition in the VWP. In a behavioral study, they instructed people to pick up a target item in a visual array. They found that the amount of looking to each item in the array (i.e., to the target itself, and to items sharing a phonological onset or rhyme) was predicted by lexical activations in TRACE (McClelland & Elman, 1986), an attractor model of spoken-word recognition that takes spoken, but not visual information, as its input. Allopenna et al. (1998) also observed a kind of local coherence in eye-movement behaviors: looks to competitor items that shared a rhyme, but not an onset (e.g., beaker—speaker). Despite the clear difference in onsets, listeners were more likely to look to rhymes than to unrelated distractors that shared no phonological overlap with the target (e.g., carriage). This result is closely related to the reference item looks we observed, suggesting the formation of localized structure, as predicted by self-organization.
However, Allopenna et al. (1998)’s model is only partially constrained by the visual information. For purposes of simulating looks (using the Luce choice rule, given activation across the entire lexicon), they restricted their analysis to only those nodes in TRACE that corresponded to objects in the visual display, thus emphasizing the dynamics of the linguistic portion of the signal, and not the visual portion (see also Tanenhaus, Magnuson, Dahan, & Chambers, 2000). Spivey (2007) modified this TRACE-based approach to allow feedback from the visual component to influence the dynamics. He simulated Allopenna et al.’s behavioral findings with a recurrent network with three layers: a lexical layer, with word nodes fed raw activation levels from TRACE’ a visual layer, with object nodes activated when an object was present in the display; and an integration layer, which connected the lexical and visual layers. Like Spivey (2007), we have allowed both visual and verbal information to modulate the dynamics. However, neither Spivey nor Allopenna et al.’s approaches handle structure above the lexical level. Note that if TRACE were fed the words in a phrase like “The cat that’s beside the star” in succession, at the last word in the sentence, it would continue looking to the star, although behaviorally listeners return to the cat implied by the larger phrasal structure. The model we have implemented takes a step toward clarifying how dynamical models like these might approach the challenge of integrating syntax-level information across words in sentences.
The dynamical systems approach we have proposed, however, is not the first to address sentence processing in the visual world. Mayberry et al. (2009), for example, simulated anticipatory looking behaviors (e.g., Altmann & Kamide, 1999, 2007, 2009; Knoeferle & Crocker, 2006, 2007) in the visual world using an augmented simple recurrent network (SRN; Elman, 1990) that processed sentence-level input. Behaviorally, listeners robustly use information from the language and the visual context to anticipate upcoming linguistic referents. For example, Altmann and Kamide (1999) showed that listeners hearing “The boy will eat the…” were more likely to look at edible objects like a cake, as compared to inedible objects like a ball or truck, as predicted by the verb in the sentence. Accordingly, Mayberry et al. (2009) used a multilayer network, with recurrence in the hidden layer, to predict role assignment of arguments in a scene, given the visual and linguistic contexts. They presented their model (CIA-NET) with both word-by-word (German-based) sentences, as well as visual contexts depicting scenes. The task of their model was to activate a filler-role representation of the event within the scene that the language referred to (each scene was assumed to contain two possible events). Consistent with the behavioral data, the model can use information from the linguistic signal up through the verb (e.g., “The princess is painting the...”), and information from the visual signal (e.g., a pirate who is washing a princess who is painting a fencer), to anticipate the direct object predicted by the union of the language and the scene (e.g., fencer). Also consistent with the behavioral data, the model favors visual information over stereotyped linguistic information when they conflict: Given “The pilot was spied on by the…,” for example, in a visual context depicting a wizard who is spying on a pilot who is being fed by a detective, the model anticipated the wizard, consistent with the visual context, and not the detective, a stereotypical and predictable agent of the verb spy based on the language. Thus, the model is highly sensitive to the relationship between sentence-level structure in the language, and interactions among different items in a visual context.
A very appealing property of the Mayberry et al. (2009) model is that, as an SRN, it learns to relate visual and linguistic information and to use this information to focus looks appropriately. Although the network we have described is not a learning model, it is nevertheless compatible with such an approach. We assume, for example, that the linguistic pulses that modulate the network’s action landscape reflect learned associations between stereotyped (eye-movement) behaviors, linguistic contexts, and visual contexts. Consistent with robust behavioral findings, Mayberry et al.’s model also acts anticipatorily. In this regard, we also found evidence for a kind of anticipation: listeners tended to fixate the reference items as they heard “beside,” before “star” or “square” was named in the sentence, a behavior exhibited by our model. Our model demonstrated this anticipatory behavior because of the semantics assigned to each pulse. The effect of the pulse for “beside,” for example, is to deepen the attractor basins for the reference items, which are beside the item usually being fixated at this point in the trial. This definition of the effect of beside is a pure stipulation in our model, unlike in Mayberry et al.’s. We think it plausible that experience with the word “beside” induces a context-independent tendency to look at objects beside the object currently being looked at.14 This assumption about adult behavior is consistent with a learning paradigm that drives an organism toward helpfully exploratory behavior (e.g., Oudeyer, Kaplan, & Hafner, 2007; Sutton & Barto, 1998).
The Mayberry et al. (2009) model, however, is limited in a number of ways. First, the output of the model consists of looks to holistic scenes: for example, the model might activate a scene involving a princess who is painting a fencer, rather than a pirate who is washing a princess. However, the model does not generate looks to individual items within each scene (e.g., princess vs. fencer). At a finer level of analysis, listeners do look to individual items in the display. Additionally, Mayberry et al.’s model makes the assumption that listeners have a rich mental representation about the relationships between all items in a display. For example, their model assumes that listeners do not simply know that a pirate, princess, and fencer are present; they also know precisely what each item is doing to all of the other items. However, there is evidence to suggest that listeners often store only a minimal amount of information about items in a visual context (Ballard, Hayhoe, Pook, & Rao, 1997), according to the task at hand. Additionally, there is the problem of understanding how listeners could grapple with very rich visual contexts, in which items are interacting in an infinite number of ways, like we might encounter in the real (visual) world. By employing a combinatorial generation mechanism—the looking behavior in Impulse Processing arises from the combination of pulses created by the word sequence and the context—our model is situated to exhibit an appropriately open-ended variety of behaviors.
Roy and Mukherjee (2005) have also addressed the integration of sentence-level and visual information in VWP-like settings. Their model (Fuse) is a probabilistic rule model which is trained to interpret referential expressions about items in a visual context, and to find the items identified by the language. Fuse processes sentences incrementally, by generating a probability distribution across the items in a visual context, based on their fit with the language. As each new word is processed, Fuse modulates the distribution of probabilities across the visual display. As Fuse processes an utterance like “The large green block in the far right beneath the yellow block and the red block,” for example, it first allocates higher probabilities to large blocks in the display (“The large…”), then to large green blocks (“…green…”), then to large green blocks on the right (“…block in the far right…”), and so forth.
Like Mayberry et al. ’s (2009) CIANET, Roy and Mukherjee’s (2005) Fuse has the desirable trait that it learns to perform its task: it is trained on corpora of real language spoken by real people about real visual contexts. Like our model, Fuse also “interprets” complex referring expressions. Unlike our model, however, and the others we have discussed, Fuse is not explicitly a model of eye movements: Roy and Mukherjee interpret the model’s probability distributions as distributions over attentional foci. If one assumes that attentional foci correspond to fixation locations, then the model can be interpreted as a model of eye movements. Under this assumption, the model makes incorrect predictions about the fixation patterns associated with complex noun phrases: to arrive at an interpretation, the model divides the complex noun phrase into subphrases, such that one phrase identifies the target (e.g., “The large green block…”), and the other phrase identifies landmark items (akin to our reference items) that serve to disambiguate the target (e.g., “beneath the yellow block and the red block”). As the model begins to process the noun phrase identifying the landmark item, its attention shifts to the landmark item in the visual context, rather than remaining on the target item (e.g., higher probabilities are allocated toward yellow and red blocks which are above a large green block, rather than toward a large green block which is below a yellow and red block). Thus, in processing the last word in “The cat beside the star…,” Fuse would allocate a higher probability to the star in the display, although behaviorally listeners return to the target.
Interesetingly, although Roy and Mukherjee do not address the issue in their discussion, Fuse appears to exhibit local coherence behaviors. Roy and Mukherjee (2005) plot the distribution of probabilities to visual objects during the processing of “The large green block…” in a visual context containing large green blocks and small green blocks. The probability bars accompanying the figure suggest that while their model allocated the highest probabilities to large green blocks, it also allocated elevated probabilities to small green blocks, which were at least consistent in color with the language, as compared to small red blocks, for example. Like looks to rhymes (e.g., Allopenna et al., 1998), and the looks to reference items that we observed, this suggests the formation of local structure despite incongruence with the global context. This behavior of the model is likely a consequence of the way individual words impinge on the system’s probability distributions. Each word (e.g., “green”) is mapped to consistent items in the visual context, and these context-independent probabilities are multiplied together as a sentence is processed. While this independence assumption about the effect of words on the system recapitulates the notion of bottom-up priority, it is nevertheless limited, in so much as it cannot naturally handle more complex expressions (e.g., the model does not shift attention from the landmark to the target after hearing the modifying clause of a complex noun phrase; instead, the reference of the whole phrase is computed independently of the locus of attention).
Our dynamical systems approach is also generally consistent with the feature-based approach of Altmann and Kamide (2007, 2009). These authors are especially concerned with results from the visual world that indicate that listeners do not simply look to items in a visual context as they are named, but that listeners also look to items which are related in any number of ways with the language. Their theoretical approach provides a rich account of a large number of results from the VWP: for example, looks to a rope on hearing “snake” as a consequence of physical featural overlap (Dahan & Tanenhaus, 2005), and looks to a trumpet on hearing “piano” as a consequence of categorical featural overlap (Huettig & Altmann, 2005; see also Yee & Sedivy, 2006). Their proposal assumes that entities in the language (e.g., words) and in the visual world (e.g., objects or images) activate corresponding representations in our mental world: mental representations which are featural in composition, and which include information about the form, function, associations, and so forth, of the words and visual objects being processed. Their proposal assumes that a visual representation receives a boost in activation when it shares features with a linguistic representation, increasing the likelihood of a saccade to that object in the display. Such effects can be predicted by recurrent networks like the one we describe here that use distributed codes: the featural encodings of distinct entities overlap. Consequently, if one object gets activated, it will partially activate other objects that share features with it. This kind of behavior is well-documented in recurrent networks, closely related to ours, with feature overlap (e.g., Kawamoto, 1993; McClelland & Kawamoto, 1986; McRae, de Sa, & Seidenberg, 1997; Harm & Seidenberg, 2001, 2004). For simplicity, and because our focus is not on feature overlap effects, we have used localist encodings in the current model, and thus do not consider fine-grained semantic and physical feature overlap, although this is not a necessary restriction. An important question for future modeling research in this area is whether the same sentence processing dynamics can be sustained when more complex distributed codes are used in the recurrent layers.
4.3. The abstractness of language
As we noted in the introduction, one may well wonder if a theory that directly connects language and action can handle abstract uses of language. What about situations where an action specified by language is not immediately carried out (e.g., hearing on the telephone, “Could you pick up a quart of milk on the way home?”) or where the language evinces a mental change that does not require any physical response (e.g., “You see, the morning star and the evening star are one and the same object!”)?
An in-depth discussion of these issues is not within the scope of this paper. Nevertheless, we note that the approach we have outlined has a reasonable answer to this question: In Impulse Processing, perceptions modify a landscape that specifies actions. But the shape of the landscape at any point in time, and the location of the system on the landscape, is determined by the perceiver’s cumulative interaction with the environment. So the obvious action specified by a particular piece of language (e.g., “pick up a quart of milk”) may not dominate the behavior at the moment the utterance occurs. It is useful to think about these issues in terms of contrasting structural scales. In the model discussed above, we considered examples in which the referent of a modifying noun (e.g., “star”) attracted some looks during the utterance of a complex noun phrase (e.g., “the cat that’s beside the star”) during the time when the modifying noun was being spoken. However, this tendency was modulated by the strength of the attractor of the head noun, as indicated by the comparison of referential (i.e., cat) and lexical plus referential (i.e., bat) ambiguities. Since the scale of the head noun attractor was relatively large compared to the scale of the modifying noun, the influence of the modifying noun on the looking behavior was minimal (e.g., given the large attractor basin for cat1 in the garden path case; see Fig. 5).
Relatedly, we hypothesize that, when someone hears a statement that refers to events associated with a remote time, as in the milk example, then, although there is an effect on the action landscape of the hearer at the time of perception of the request, this effect is a relatively small deformation. It will cause some minimal activation of motor-related neural pathways associated with the process of purchasing milk, but it will not cause the person to leap up and begin milk-purchasing activities at the moment. This is because the current situation constraints cause the magnitude of the deformation of the prospective action to be minimized in relation to the magnitude of deformations related to the task at hand (in this case, talking on the phone). In the milk-purchasing case, we assume that the deformation caused by the request, though small, sticks around in a portion of the mental space of the person connected with her plans for traveling home, and, at the appropriate point in the journey, the small deformation becomes enlarged to the point where it causes appropriate action (e.g., driving into the parking lot of a store that sells milk, etc.) Similarly, in the case of an abstract mental revision, like learning the common identity of the morning star and the evening star, Impulse Processing claims that the comprehender’s landscape for action is revised at a small scale when the utterance occurs, and this deformation becomes enlarged later at points where it becomes relevant (e.g., acts of drawing a diagram of the solar system).
In the restricted domain of syntactic comprehension on a seconds-long timescale, where words that occur at one point in time constrain the possibilities for words at certain future points, Tabor (2000, 2003, 2009) discusses a neural activation framework, called fractal grammars, that works according to the scale-manipulation principle just outlined. Although the attractors in this framework do not modulate behavior at the millisecond timescale appropriate for modeling eye-tracking data, the framework nevertheless suggests that neural memory manipulation could be a matter of scale manipulation.
This scale-manipulation view is generally consistent with accounts that ground abstract conceptual knowledge in perceptual and motor systems (e.g., Barsalou, 1999; Barsalou, Simmons, Barbey, & Wilson, 2003), and it is supported by data on neural responses to language which indicate neural activation in regions relevant to the action associated with the language in the absence of overt muscular responses (e.g., Moody & Gennari, 2010; see also Pulvermüller, 2005). We hypothesize that these neural responses are weak versions of the activation dynamics that would take place if the person actually engaged in the action described by the language. The view is also consistent with the finding that when people are asked to imagine a described scene while viewing a blank wall, the scan paths of their eyes resemble those they would produce if they were actually viewing the scene (e.g., Spivey & Geng, 2001). In this case, the blank wall provides such a weak global context that the eye movements that are naturally associated with the words are not suppressed and can be detected. This view also helps explain how the strongly input-driven self-organization approach is compatible with the finding that different task constraints produce very distinct scan-path characteristics for the same image (e.g., Yarbus, 1967)—the task constraints amplify different attractor basins.
We suggest, then, that Impulse Processing is a sufficiently flexible framework to make headway on the problem of integrating multiple, loosely coordinated information sources, and that the framework makes distinctive, empirically justified predictions, and that it has a plausible take on the well-known challenges of handling concrete and abstract language in a common framework.