Symbol Interdependency in Symbolic and Embodied Cognition

Authors


should be sent to Max M. Louwerse, Institute for Intelligent Systems/Department of Psychology, University of Memphis, 202 Psychology Building, Memphis, TN 38152-3230. E-mail: mlouwerse@memphis.edu

Abstract

Whether computational algorithms such as latent semantic analysis (LSA) can both extract meaning from language and advance theories of human cognition has become a topic of debate in cognitive science, whereby accounts of symbolic cognition and embodied cognition are often contrasted. Albeit for different reasons, in both accounts the importance of statistical regularities in linguistic surface structure tends to be underestimated. The current article gives an overview of the symbolic and embodied cognition accounts and shows how meaning induction attributed to a specific statistical process or to activation of embodied representations should be attributed to language itself. Specifically, the performance of LSA can be attributed to the linguistic surface structure, more than special characteristics of the algorithm, and embodiment findings attributed to perceptual simulations can be explained by distributional linguistic information.

1. Introduction

One of the central research questions in the cognitive sciences is concerned with the nature of meaning in language comprehension and how meaning can be extracted. Since the cognitive revolution in the 1950s (Miller, 2003), a strong consensus has emerged that computers are not only successful in extracting meaning from language, but that their processes also simulate human cognitive processes. That computers can extract meaning from language and enhance theories of human cognition, the theme of this topiCS issue, bears two presuppositions. The first is that meaning can in fact be extracted from language computationally, and the second that these computational methods and their findings can advance theories of human cognition. Even though these two presuppositions are not ipso facto linked (Sparck Jones & Willett, 1997), in the cognitive sciences they typically are viewed as being inseparable (Jurafsky & Martin, 2001). Yet a growing divide can be observed with regards to the validity of these two presuppositions (De Vega, Glenberg, & Graesser, 2008; Pecher & Zwaan, 2005; Semin & Smith, 2008).

To some it seems obvious that meaning can be extracted computationally and that there are strong similarities between computational algorithms and the human cognitive processes (Landauer, McNamara, Dennis, & Kintsch, 2007; Rogers & McClelland, 2004). According to this account, which I will concisely call “symbolic cognition,” the meaning of rose primarily is the product of statistical computations from associations between rose and concepts like flower, red, thorny, and love. One influential symbolic cognition account that is taken as exemplary throughout this paper, that of latent semantic analysis (LSA), uses large text corpora to compute semantic similarities between concepts.

To others, however, the computation of amodal linguistic information cannot amount to meaning and according to them the confidence in symbolic cognition leads the cognitive sciences astray. This account, which I will concisely call “embodied cognition,” expresses an increased concern about linguistic representations of meaning, as well as any analogies between computational and human approaches of meaning extraction (Pecher & Zwaan, 2005; Semin & Smith, 2008). The embodied cognition argument states that meaning does not lie in amodal linguistic systems but is modal in nature. Consequently, connectionist models of symbol manipulation cannot capture meaning. Instead, so the embodied cognition argument goes, at the heart of meaning lies the activation of perceptual and embodied experiences. In other words, according to the embodied cognition account, the meaning of rose comes from the activation of perceptual experiences with roses, their colors, their smell, and the occasions we tend to perceive them at, rather than from the linguistic information associated with roses.

The literature has used “amodal,”“symbolic,” and “linguistic” as antonyms for “modal,”“embodied,” and “perceptual” (De Vega et al., 2008). We use the term “symbolic” here as a synonymous for amodal linguistic, and “embodied” as a synonym for perceptual.

This article reviews both the symbolic and the embodied accounts of cognition. The claim made in this paper is directly related to the two presuppositions mentioned earlier, that is, whether meaning can be extracted from language computationally and whether it advances theories of human cognition. The claim is two-fold. First, the symbolic cognition account tends to place more emphasis on the algorithm than on linguistic regularities. Techniques like LSA are a convenient shortcut to nonlatent first-order word co-occurrences in language. That is, language is organized in such a way that any form of meaning extraction identified by algorithms such as LSA emerges from the linguistic surface structure itself, even though the LSA algorithm can make the computation faster and more efficient. Second, the embodied cognition account underestimates the importance of linguistics in general and—for the purposes of this paper—what can be gleaned from the surface structure of language. The argument made here is that embodied representations are directly mapped onto language because language encodes embodied relations. That is, much of the evidence in favor of embodied cognition can be traced back to patterns in language, at least for those studies that are using linguistic stimuli. These regularities in language can in turn be exploited by language users, for instance, in constructing embodied representations.

The central claim in this paper is reminiscent of a claim made by Deacon (1997, p. 104): The support for language comprehension and language production is vested neither in the brain of the language user, its computational processes, nor in embodied representations, but outside the user, the process, and the representation, in language itself.

2. Symbolic cognition

Symbolic cognitive models are theories of human cognition that take the form of working computer programs (Lewis, 1999). Many computer models fit the label of symbolic cognitive models, including ACT-R (Anderson, 2007), CAPS (Just & Carpenter, 1992), CLARION (Sun & Peterson, 1997), Epic (Meyer & Kieras, 1997), and Soar (Newell, 1990). The focus in this article will not be on these cognitive architectures, but instead on computational algorithms, specifically LSA.

According to LSA, meaning is captured by mapping words into a continuous high dimensional semantic space. LSA is trained on a corpus of texts resulting in a semantic space. Input—words, sentences, paragraphs, or texts—are compared in this semantic space, with a cosine value representing the semantic similarity between the input units. More specifically, the underlying mechanism is as follows. Texts are segmented in contexts (e.g., sentences or paragraphs). The frequency of occurrence of each word in each context is computed. The resulting co-occurrence matrix contains lots of zeros since many words only appear in a few contexts. These local associations are next transformed by means of a mathematical compression technique such as singular value decomposition (SVD) into a small number of dimensions (typically 300) yielding more unified knowledge representations by removing noise. That is to say, LSA goes beyond the simple unit-context frequency matrix. Words are not only similar because they appear in the same context (i.e., first-order co-occurrences), but because they occur in similar contexts (i.e., higher-order co-occurrences). In LSA, words are represented by long vectors of numbers that define a high dimensional space. The similarity of any two words can be assessed by computing the cosine between their vectors (Deerwester, Dumais, Furnas, Landauer, & Harshman, 1990; Louwerse, Cai, Hu, Ventura, & Jeuniaux, 2006).

Latent semantic analysis has shown to be very promising in a variety of tasks. For instance, it has been successfully used for information retrieval purposes (Berry, Dumais, & O’Brien, 1995; Deerwester et al., 1990; Salton, Wong, & Yang, 1975; Widdows, 2004).

Landauer and Dumais (1997) tested whether LSA would pass the Test of English as a Foreign Language (TOEFL) test by the Educational Testing Service that every foreigner at an American University needs to take. On 80 multiple-choice test items the performance by LSA was the same as the average student taking the test. When LSA was trained on the content of textbooks—not the questions or the answers—LSA received a passing grade on a multiple-choice test provided by the textbook publishers (Landauer, Foltz, & Laham, 1998).

Latent semantic analysis turned out to be equally successful in measuring coherence in text. Foltz, Kintsch, and Landauer (1998) reanalyzed texts from two studies that manipulated coherence and assessed readers’ comprehension, and they found that LSA can measure coherence adequately. Similar findings have been reported by Louwerse and Jeuniaux (2009) and McNamara, Cai, and Louwerse (2007). It is therefore not surprising that LSA has been used as an important component of coherence metrics in Coh-Metrix, a Web-based tool that analyzes texts on hundreds of types of coherence relations and measures of language, text, and readability (Graesser, McNamara, Louwerse, & Cai, 2004).

Latent semantic analysis has also shown to be doing well in analyzing other forms of language use. Kintsch (2000) used LSA to design a computational model of metaphor comprehension by computationally modeling the interaction between the meaning of the topic and vehicle of metaphors. Louwerse and Van Peer (2009) presented various examples of LSA successfully extracting meaning from figurative language. Kintsch (2002) illustrated how LSA can even be used to identify the theme and subthemes of a text.

Latent semantic analysis has also been successfully used in document clustering and genre classification. Louwerse (2004) applied LSA to literary texts to determine the style of the author (idiolect) and groups of authors based on gender and literary period (sociolect). Where simple keyword algorithms failed, LSA was able to classify texts in terms of idiolect and sociolect on the basis of lexical consistency. Louwerse, Lewis, and Wu (2008) used LSA for the categorization of Shakespearean plays, and Louwerse, McCarthy, McNamara, and Graesser (2004) applied LSA in combination with other computational linguistic measures to a set of corpora to determine variations in written and spoken registers, distinguishing speech from writing, factual information from situational information, topic consistency versus topic variation, elaborative versus constrained, and narrative versus nonnarrative.

The success of LSA in such a wide range of language analysis tasks made the algorithm an ideal candidate for intelligent essay graders and intelligent tutoring systems. When Landauer et al. (1998) created an LSA space of textbook and student essays, they found that LSA performance correlated better with expert graders than the performance of these graders did with each other. Intelligent tutoring systems like AutoTutor and iSTART use LSA as the tutor’s knowledge base. The tutoring system AutoTutor engages the student in a conversation on a particular topic such as conceptual physics or computer literacy. AutoTutor uses LSA for its model of the world knowledge serving as a model of the long-term memory of a conversational partner and determining the semantic association between a student answer, and ideal good and bad answers (Graesser, VanLehn, Rose, Jordan, & Harter, 2001; Graesser et al., 2004). In the tutoring system iSTART, LSA is not used for content evaluation, but for strategy evaluation. iSTART teaches students how to more efficiently read texts. In this system, LSA augments feedback to students’ self-explanations (McNamara, Boonthum, Levinstein, & Millis, 2007).

Latent semantic analysis is also the engine behind Summary Street, a reading comprehension and writing instruction tool. Students write summaries of a text and Summary Street evaluates these summaries by comparing them to the text and providing feedback about the summary content and writing mechanisms (E. Kintsch, Steinhart, Stahl, & the LSA Research Group, 2000).

In part because of its success in such a variety of language tasks, LSA has been considered to provide a solution to the poverty of the stimulus argument, also called Plato’s problem. Plato’s problem is the psychological dilemma of how humans, observing a relatively small set of events, can construct knowledge representations that are adaptive in a large, potentially infinite variety of situations (Chomsky, 1980). LSA does this by mapping initially meaningless words into a continuous high dimensional semantic space, more or less simulating cognition (Landauer & Dumais, 1997). It is important to note here that the solution to Plato’s problem presumably lies in the added value of the LSA algorithm, a specific powerful sophisticated statistical process, and not in the surface structures in language (Landauer, 1999; Landauer & Dumais, 1997; Landauer et al., 1998). Landauer et al. (1998) say about the added value of LSA:

It is important to note from the start that the similarity estimates derived by LSA are not simple contiguity frequencies, co-occurrence counts, or correlations in usage, but depend on a powerful mathematical analysis that is capable of correctly inferring much deeper relations (thus the phrase “Latent Semantic”), and as a consequence are often much better predictors of human meaning-based judgments and performance than are the surface level contingencies that have long been rejected … by linguists as the basis of language phenomena. (p. 260)

This is an important issue, as it relates to the question of how much meaning can be extracted from language. According to Landauer et al., the answer lies in the computational algorithm; according to the present paper the answer lies in language itself. These answers are of course not mutually exclusive, but the bias towards the algorithm of the linguistic structure is important for theories of cognition.

Though for very different reasons than found in the symbolic account, the underestimated role of language in the comprehension process can also be found in the account opposing symbolic cognition, that of embodied cognition.

3. Embodied cognition

An increasing number of cognitive scientists have argued that LSA-like symbol manipulation has little to do with meaning extraction. That is, they argue that without symbol grounding, that is, grounding words to bodily actions in the environment, we can never get past defining a symbol with another symbol. Simple covariation of amodal symbolic patterns basically is not much more than a symbolic merry-go-round (Harnad, 1990). After all, comprehension goes beyond looking up a foreign word in a foreign dictionary and translating it into another foreign word from the same dictionary. Instead, the embodied cognition argument goes, meaning extraction heavily relies on the activation of perceptual experiences more than on linguistic regularities (Barsalou, 1999; Glenberg, 1997; Zwaan, 2004).

Unquestionably, there is a wealth of experimental evidence that comprehension must go beyond symbol manipulation. The results of these experiments show that when linguistic stimuli are processed the information is reenacted. For instance, Pulvermüller, Shtyrov, and Ilmoniemi (2005) applied neurophysiological imaging techniques to determine spatiotemporal activity in the brain when participants were presented with different action words. While participants engaged in a distraction task, spoken words such as kick or lick were presented. Brain activation was observed in those brain areas related to motor actions of the face (inferior frontocentral areas) or leg (superior temporal areas) corresponding to the action words.

In an experiment by Klatzky, Pellegrino, McCloskey, and Doherty (1989), comprehension of verbally described actions (e.g., the phrase picking up a grape) was facilitated by preceding primes that specified the motor movement (e.g., grasp). Similar evidence showing a correspondence between linguistic information and motor movement comes from Bargh, Chen, and Burrows (1996) who asked participants to read passages describing elderly people. Participants, unbeknownst to the dependent variable being tested, walked more slowly to the elevator after the experiment ended than a control group.

Other evidence in favor of an embodied cognition account showed that nonlinguistic representations are tightly coupled to language. When participants were asked to verify whether a picture depicted an object in a sentence, participants responded more quickly to a picture of an eagle with its wings spread out after reading The ranger saw an eagle in the sky than after The ranger saw an eagle in the tree (Zwaan, Stanfield, & Yaxley, 2002) or to a picture of a horizontally depicted nail after reading He pounded the nail into the wall than after reading He pounded the nail into the floor (Stanfield & Zwaan, 2001).

Evidence for the coupling of linguistic information to nonlinguistic representations is also found in other studies (Richardson & Matlock, 2007; Spivey & Geng, 2001; Spivey, Tanenhaus, Eberhard, & Sedivy, 2002). For instance, Spivey and Geng (2001) found that subjects acted out the mental image of a passage they read. Subjects listened to a story that had descriptions of upward, downward, leftward, and rightward events, like in the following text:

Imagine that you are standing across the street from a 40 story apartment building. At the bottom there is a doorman in blue. On the 10th floor, a woman is hanging her laundry out of the window. On the 29th floor, two kids are sitting on the fire escape, smoking cigarettes. On the very top floor, two people are screaming.

Following the text and unbeknownst to participants, participants’ eye movements were recorded. Spivey and Geng found that eye movements were in the direction of the described directions (vertical in this case), suggesting that lower-level motor processes are activated with higher-level cognitive processes.

These embodied cognition experiments show that language comprehension primarily involves the activation of nonlinguistic representations. Moreover, according to the embodied cognition account, linguistic symbols and combinations of these symbols are abstract and arbitrary in nature, are not grounded in the world, and can therefore not form the basis of meaning. The question to what extent statistical regularities in language affect conceptual processing seems irrelevant. Instead, amodal linguistic symbols must always activate embodied representations whose meshing only constitutes meaning (Glenberg, 1997).

4. Symbol interdependency

On the face of it the symbolic and embodied accounts of cognition seem mutually exclusive. After all, either meaning emerges from associations between linguistic units revealed by powerful statistical computations of large bodies of text (symbolic cognition) or meaning does not come from statistical regularities in the amodal linguistic system, but from perceptual simulations (embodied cognition). However, perhaps there are ways to consider these two accounts not as mutually exclusive, but as mutually reinforcing (Goldstone & Rogosky, 2002). According to the symbolic account all concepts depend on all of the other concepts, while according to the embodied account these concepts have a perceptual basis. Obviously, the view that symbolic and embodied cognition accounts are mutually reinforcing is appealing given that there is considerable psychological evidence supporting both accounts.

The symbol interdependency hypothesis proposes language comprehension is both embodied and symbolic (Louwerse, 2007, 2008; Louwerse & Jeuniaux, 2008, 2010). According to this hypothesis language comprehension can be symbolic through interdependencies of amodal linguistic symbols, but it can also be embodied through the references these symbols make to perceptual representations. The symbol interdependency hypothesis thereby makes an important prediction. Language has evolved to become a communicative short-cut for language users and encodes relations in the world, including embodied relations. The symbol interdependency hypothesis thus emphasizes the importance of the language structures, without discarding the notion of symbol grounding. However, many language tasks allow for limited symbol grounding in order to bootstrap meaning through the relations between amodal linguistic symbols (Louwerse & Jeuniaux, 2008). To facilitate this process, language is organized in such a way that language encodes perceptual information.

The prediction that language encodes perceptual information has important implications for the symbolic and embodied cognition accounts presented earlier. For the symbolic cognition account, it means that results obtained from LSA can also be obtained through non-latent patterns in the language surface structure. For the embodied cognition account, it means that results attributed to perceptual simulations can be traced back to language itself.

Louwerse and Jeuniaux (2008, 2010) demonstrated that in most language comprehension tasks the processes linked to symbolic cognition control the early stages of comprehension, in order to allow the language user to create quick-and-dirty representations. The processes linked to embodied cognition control comprehension in subsequent stages allowing the language user to create a complete situation model. In other words, Louwerse and Jeuniaux (2008, 2010) distinguished between shallow (underspecified and incomplete) and deep (specified and complete) language processing, and argued that language processing is typically shallow but can be deep depending on the situations of language use. Indeed, there is evidence that semantic anomalies in texts often go unnoticed (Barton & Sanford, 1993; Van Oostendorp, Otero, & Campanario, 2002). For language processing all that usually counts is “good-enough representations” (Ferreira, Ferraro, & Bailey, 2002). This means that semantic information extracted from language itself, for example, through regularities in its surface structure, might be noisy but provides important cues for good-enough comprehension.

The symbol interdependency hypothesis is not new. It is directly based on Deacon’s (1997) hierarchy of signs, which is in turn based on Peirce’s (1923) semiotic theories. Deacon (1997) argued that different levels of signs have a hierarchical relationship with each other, whereby relations between these signs can operate at one level (symbols being related to other symbols) and at different levels (symbols referring to their referents). Deacon claimed that this hierarchy of different levels of signs can help us explain why humans have language, but other species do not. Humans are a symbolic species—they can make links between symbols, and between symbols and their referents—whereas other species have difficulty making the link between the symbols. Higher species such as chimpanzees, however, approximate the symbolic ability of meaning induction, according to Deacon.

The symbol interdependency hypothesis is also related to Kintsch’s (1998) Construction Integration model. The propositional net formed from the text itself (the textbase) is dominated by symbolic (propositional) representations. From this propositional net an elaborated propositional net is formed using a richer set of representations, presumably including perceptual representations. However, contrary to the Construction Integration model, the symbol interdependency hypothesis places the surface structure, rather than the propositional deep structure of language, at the heart of the construction stage.

The symbol interdependency hypothesis is perhaps most akin to Paivio’s Dual Coding Theory. Paivio (1971, 1986) identified three levels of meaning, a representational, referential, and associative level. At the level of representational meaning, verbal and nonverbal stimuli activate the corresponding representational comprehension processes. That is, verbal stimuli are represented in linguistic representational units such as words, whereas nonverbal stimuli are represented in nonlinguistic representational units such as images. At the second level of meaning, referential meaning, interconnections are formed between the verbal and nonverbal representational processes. Verbal stimuli allow for pictorial representations (linking word to picture), and imaginal stimuli allow for linguistic representations (linking picture to word). The third level of representation, associative meaning, involves intraverbal associations (associative connections between words) and interimaginal representations (associative connections between nonperceptual information). Paivio’s theory thus acknowledges relations between the (amodal) linguistic units, as well as relations between (amodal) linguistic units and their (modal) referents. If information is presented verbally, the most immediate representation is also verbal in nature. However, Paivio's Dual Coding Theory does not state that language encodes preceptual information and that these encodings mediate verbal and nonverbal processes.

Finally, the symbol interdependency hypothesis also shows similarities with Barsalou’s (1999) perceptual symbol theory. Barsalou (1999) argues that perceptual states are not transduced into a completely new representational language. Visual objects are not being transduced into amodal descriptions, but into visual representations. When an object is perceived, information is extracted from perceptual representations and transferred to memory. In memory, these extractions (perceptual symbols) function symbolically, standing for their referents and used in symbolic computation. The perceptual symbol system theory thus poses that cognition is both symbolic and embodied in nature, but different from the symbol interdependency hypothesis, it emphasizes the embodied aspect in linguistic processes. This is made more explicit in the language and situated simulation (LASS) theory (Barsalou, Santos, Simmons, & Wilson, 2008). Although according to the LASS theory the linguistic system becomes activated immediately, preceding the activation of a deeper simulation system, the theory argues embodiment being most relevant to cognition.

In a nutshell, the assumptions behind the symbol interdependency hypothesis are that language encodes perceptual information and that language users make use of these linguistic cues. But what is the evidence for these assumptions? I will start with some general evidence, before moving to more specific evidence in the subsequent sections of this paper.

Language is a cognitive instrument allowing people to communicate meaning in the world around us. What is so convenient about instruments in general is that they have been made and shaped for their instrumental purposes. Hammers are structured in such a way that it is easy to hit nails; screwdrivers are made and shaped to turn screws. Similarly, language has evolved to communicate meaning (Hurford, 2007). This means that language provides its users with linguistic cues how to understand the world, and language might in fact not be as arbitrary as the embodied cognition account suggests (see Christiansen & Chater, 2008).

There is of course plenty of evidence that language encodes information from the world around us. For instance, it seems more important to know who is doing something than who is undergoing the action. Language has conveniently encoded this in its word order, with languages showing many varieties in the subject-verb-object order, but with no, or hardly no, languages adopting a word order whereby objects precede subjects (Greenberg, 1963). Another simple example of language being shaped for convenient communication is that short words tend to describe objects and events that are frequent (Zipf, 1935). A more recent example questioning an extreme view of arbitrariness of language comes from the relation between phonology and syntax. Actions and objects are nicely distinguished in the form of nouns and verbs. Monaghan, Chater, and Christiansen (2005) have cross-linguistically shown that phonological features alone can determine whether a word is a verb or a noun, raising interesting questions on the arbitrariness of language.

Other evidence suggesting language structures are not accidental comes from Musso et al. (2003), who tested the difference between real and unreal grammatical rules in terms of activation in Broca’s area. German participants with no knowledge of Italian and Japanese were asked to read sentences with real grammatical rules for Italian and Japanese and sentences with fabricated “unreal” grammatical rules in these languages. Increase of activation over time in Broca’s area was specific for “real” language acquisition only, independent of the kind of language. Detecting the difference between real and unreal language turned out not to be acquired over long stretches of time with exposure to considerable amounts of discourse, but was picked up almost instantaneously. Similarly, behavioral studies indicated that infants are more sensitive to normal speech than backward speech, the latter violating several segmental and suprasegmental phonological properties. Even 4-day-old neonates and 2-month-old infants were able to discriminate sentences in their native language from sentences in a foreign language. Importantly, discrimination performance disappeared when the sentences were played backwards (Dehaene-Lambertz & Houston, 1998; Ramus, Hauser, Miller, Morris, & Mehler, 2000).

If language structures the world around us, what then is the evidence that language users use these structures in the comprehension process? For instance, what is the evidence that comprehenders pick up on statistical regularities in language? Immediate evidence comes from word-association tests. Thumb and Marbe (1901) were among the first to investigate which semantically associated words are evoked when participants are presented with a stimulus word. Evidence shows that words from the same syntactic class are typically evoked (e.g., table is more likely to evoke chair than eat). Moreover, common words tend to be evoked more than less common words. The findings suggest that participants rely on statistical frequencies in syntactic and semantic constructions. Of course, these statistical frequencies need to be learned. If semantic associations are based on detecting statistical regularities, this skill needs to be acquired over time. Consequently, children are then predicted to rely mostly on syntagmatic relations, relations belonging to different syntactic categories but occurring in the same context (e.g., soft evoking pillow). Adults, on the other hand, having experienced words in more contexts are predicted to primarily produce paradigmatic relations (e.g., soft evoking hard). That is, whereas syntagmatic relations can develop through being exposed to a word in a context only once, paradigmatic relations can only develop through repeatedly being exposed to a word in various contexts. This is exactly what the experimental evidence shows. The shift from syntagmatic to paradigmatic relations occurs around the time the language user is exposed to considerably more language input, namely around the time the child starts to read (Brown & Berko, 1960; Ervin, 1961). Interestingly, adult language users who are less exposed to a language, as is the case with non-native speakers, show the same preference for syntagmatic relations in word association tasks as children (Politzer, 1978). Moreover, the syntagmatic to paradigmatic shift can also be induced with nonsense syllables (McNeill, 1966).

Perhaps a problem with this evidence is its ecological validity. In the experiments described above participants are given a stimulus word and are asked to respond with a word that first comes to mind. They do this for stimulus word after stimulus word. This hardly represents detecting statistical regularities in natural language comprehension, it can be argued. But it turns out that even subtle cues in stimuli are detected as regularities. For instance Saffran, Aslin, and Newport (1996) showed that 8-month-old infants rely on statistical learning to extract information about word boundaries when presented with brief speech segments. In fact, only after a 2-minute exposure of three-syllable strings infants were able to distinguish between familiar and novel sound sequences. Much of this evidence is based on participants paying attention to stimuli. But statistical learning also occurs haphazardly. For instance, Mordkoff and Yantis (1991) asked participants to quickly respond to a target on a screen. Sometimes these targets co-occurred with a nontarget, sometimes they did not. Moreover, some nontargets appeared with the target more often than others. Even though participants were only asked to make decisions on the target, they learned the correlation between target and nontarget stimuli such that decision times were faster when these nontargets were shown (Mordkoff & Yantis, 1991; Experiment 4 and 5). Participants benefited from the additional information by learning associations even though the task did not require them to do so. Evidence from incidental statistical learning can also be found in language tasks. Saffran, Newport, Aslin, Tunick, and Barrueco (1997) asked participants to create computer illustrations. In the meantime, unsegmented speech from an artificial language was played to them. Both adults and children (first graders) learned the words of the artificial language even without paying much attention to the auditory stimulus.

These results show that statistical learning takes place across stimuli (visual and auditory), across ages (8-month-olds and adults), whether or not participants pay attention to the stimuli (attentional and incidental learning), in both linguistic and nonlinguistic tasks. It is therefore not surprising that the region of the sylvian fissure responsible for many language capacities is called the association cortex (Caplan, 1996). For instance, the left inferior occipital temporal cortex has shown to be involved in processing written words. Earlier studies argued that this area was related to the recognition of orthographic information (Warrington & Shallice, 1980), but Polk and Farah (2002) come to a different conclusion and argue that this brain area responds to orthographic regularities of sequences of abstract letters identities. In other words, it detects statistical regularities in word forms.

Even when there are no statistical regularities in incoming information, humans try to find a pattern. Most other animals do not and rely on frequency estimates instead. For instance, if a random sequence of red and green lights is presented, whereby the red light is presented 70% of the time and the green light 30% of the time, nonhumans use the optimal strategy by relying on the frequency. Humans, on the other hand, make (suboptimal) predictions on the basis of the (nonexistent) pattern in the sequence (Hinson & Staddon, 1983). Moreover, there is evidence that the left hemisphere of humans houses a cognitive mechanism that is responsible for these pattern guesses (Wolford, Miller, & Gazzaniga, 2000). These findings suggest that this regularity interpreter is unique to humans and is located in the left hemisphere, similar to the capacity of comprehending and producing language.

In sum, there is considerable evidence that language is structured in a way that facilitates processing. From these structures comprehenders can make associations quickly and effortlessly. According to the symbol interdependency hypothesis, this could on the one hand reconcile a symbolic cognition account that contends that meaning emerges from amodal symbolic systems through special constraint satisfaction algorithms and, on the other hand, an embodied cognition account that contends that words have a perceptual basis with embodied representations being activated when words are processed.

The problem that emerges now is that the symbolic cognition account, as discussed earlier, claims that meaning extraction involves powerful statistical processes that go well beyond surface structures in language (Landauer, 1999; Landauer et al., 1998). Though for very different reasons, the limited role of language is also found in the embodied cognition account that claims meaning primarily comes from embodied representations and not from an amodal symbol system (Pecher & Zwaan, 2005). In the remainder of this article I will show that both claims are problematic in that they both underestimate language structures.

5. Semantic regularities in language

Landauer and colleagues argued that the success of LSA primarily lies in the added value of the algorithm more than in the structure of language (Landauer, 1999; Landauer & Dumais, 1997; Landauer et al., 1998). If this is the case, this would greatly affect the validity of the symbol interdependency hypothesis. After all, the emphasis placed by the symbol interdependency hypothesis on language should then shift to the mechanism of meaning extraction. In that case, the primary question becomes whether humans utilize a similar process as LSA in extracting meaning from language. On the other hand, if it can be shown that LSA is a convenient tool to extract meaning from language, but less sophisticated algorithms that simply rely on first-order co-occurrences yield similar results, the emphasis in human meaning induction should be on language more so than on the mechanism.

The purpose of the current section is to investigate whether LSA and nonlatent algorithms using the language surface structure yield comparable results.

5.1. LSA and first-order co-occurrences

In showing that there is no correlation between LSA and first-order co-occurrences, Landauer et al. (Landauer, 1999; Landauer & Dumais, 1997; Landauer et al., 1998) give an example from human-computer interaction (HCI), taking the words human, interface, computer, user, system, response, time, EPS, survey, trees, graph, and minor and showing that LSA induces latent semantic representations from these words. Using a small number of technical documents, Landauer et al. demonstrated that LSA was able to adequately identify semantic similarities between these words, whereas first-order co-occurrences did not yield comparable results. These findings showed LSA’s strength lies in extracting meaning from a small body of texts, whereas first-order co-occurrences fail because of data sparsity.

A fundamental question that needs to be answered is whether first-order co-occurrences yield similar results as LSA if the problem of data sparsity were to be solved, for instance by using a larger corpus (see also Stone, Dennis, & Kwantes, this issue). If LSA yields similar results as first-order co-occurrences, the language surface structure is adequate for inferring word meanings. On the other hand, if LSA yields different results than first-order co-occurrences, first-order co-occurrence relations alone are inadequate for inferring word meanings, and powerful algorithms such as LSA are needed for meaning extraction. In other words, if first-order co-occurrences yield similar results as LSA, then the strength of meaning extraction from text lies in language; if first-order co-occurrences yield different results than LSA, the strength of meaning extraction lies in the algorithm. Note, however, that the question here is not whether meaning extraction should be done without sophisticated algorithms such as LSA. The question instead is to determine the extent to which the simplest of algorithms can extract meaning from language, thereby providing an estimate of the lower bound of what humans can extract from language.

The question of whether the performance of simple algorithms with more data yields similar results as more sophisticated algorithms such as LSA with less data have been addressed more extensively elsewhere (Budiu, Royer, & Pirolli, 2007; Cai et al., 2004; Louwerse & Zwaan, 2009; Recchia & Jones, 2009). For illustration purposes, Landauer et al.’s (1998) analysis of HCI words was repeated here using a larger corpus.

For this and all subsequent LSA analyses (unless stated otherwise) an LSA space was created using the Touchstone Applied Science Associates (TASA) corpus, which is frequently used to create LSA spaces (Landauer et al., 2007). The TASA corpus consists of approximately 10 million words (92,409 word types) of unmarked English text on language arts, health, home economics, industrial arts, science, social studies, and business. This corpus is divided into 37,600 documents, averaging 166 words per document. Also, for this and all subsequent first-order word co-occurrence analyses (unless stated otherwise) the Web 1T 5-gram corpus (Brants & Franz, 2006) was used. The corpus consists of unigrams, bigrams, trigrams, 4- and 5-grams of information from the Google database. It consists of 1 trillion word tokens (13,588,391 word types) from 95,119,665,584 sentences. Words in the corpus are more like character strings and include email address and URLs, as well as punctuation (e.g., I know. is a trigram). The word type counts are therefore considerably inflated.

In the first analysis, semantic associations of all combinations of the 12 Landauer et al. human-computer interaction keywords were computed using LSA cosine values using the TASA space, and TASA and Web 1T 5-gram first-order co-occurrence frequencies. Semantic relations between identical words (e.g., computer–computer) and values yielding zero results (e.g., interface–graph) were removed from the analysis. LSA cosine values and TASA log frequencies yielded a significant correlation (= .467, < .001, = 58). The drawback of this analysis is that about half of the words did not co-occur. In a second analysis, we compared the LSA cosine values and the log frequencies of the Web 1T 5-gram corpus, a corpus 100,000 times larger than the TASA corpus. This comparison again yielded a significant correlation (= .485, < .001, = 102), now with more co-occurrences being included.

This example, used for illustrative purposes, shows LSA and first-order co-occurrence estimates of semantic similarity are similar. Consequently, results obtained from LSA analyses are likely also to be found in first-order co-occurrence analyses, and vice versa, under the condition that the corpus being used is of adequate size.

A potential problem with first-order co-occurrence estimates is that they supposedly do not allow for synonyms and other strong paradigmatic relationships. For instance, Dumais (2003) argued that the search query “car” does not retrieve “automobile,” whereas LSA would. Perhaps the success of first-order co-occurrence estimates again depends on the size of the corpus. For instance, using the Web 1T 5-gram corpus, the log frequency of car-automobile is 33.33, compared to 35.17 for car-truck, 33.82 for car-cars, 31.33 for car-vehicle, 30.70 for car-motorcycle, and 30.04 for car-train, yielding a correlation of = .71, = .04 with LSA findings.

Of course, LSA has some important advantages over first-order co-occurrences. First, it allows for knowledge induction using a far smaller corpus than when large numbers of n-gram combinations are compared. Second, it allows for input units beyond a word, such as sentence, paragraph, or even text comparisons. Finally, LSA uses a considerably faster algorithm than any word co-occurrence algorithm that searches through 3.5 million word types as with the Web 1T 5-gram corpus. But LSA has an important drawback. Its analysis is latent, whereas a word co-occurrence analysis is overt. This has important implications for a theory of cognition. If LSA is a theory of cognition, questions can be raised regarding the psychological validity of its mechanisms (Glenberg & Robertson, 2000). On the other hand, if LSA is a convenient short-cut to relations that are present in language itself, the success of LSA shifts from the power of the algorithm to the power of language.

Following the correlation between LSA and first-order word co-occurrences, it needs to be determined whether the language humans are exposed to in any way resembles a large language corpus as the one used here. If humans are only exposed to a small fraction of the language that is needed to obtain reliable first-order co-occurrences, human meaning induction must rely on the sophistication of the algorithm. On the other hand, if humans are exposed to a large amount of language, then, at least in theory, in statistical learning humans can rely on the surface structure of language.

The question of how much language humans are exposed to is difficult to answer, as it depends on how language is defined here: word combinations, types, or tokens. Moreover, if an estimate can be given, that estimate is obviously different for different people. Mehl, Vazire, Ramirez-Esparza, Slatcher, and Pennebaker (2007) estimated daily word use based on data from six corpus samples based on 396 participants that were conducted between 1998 and 2004. Over a period of 17 waking hours an average participant used approximately 16,000 words, albeit with very large individual differences around the mean. If the assumption is made that a language users produce 30% of the language and hears 70% of the language, the average person uses approximately 53,000 word tokens a day, which averages almost 20 million word tokens a year. This number should be considered a lower bound, because it does not include language we overhear but do not pay attention to, inner speech, or songs we listen to. Mehl et al.’s (2007) participants were between 17 and 29 years old, with data being very similar for the lower age as for the higher age groups.

Finally, written language (newspapers, magazines, books, emails, Internet) is not considered in these estimates of language exposure. It is therefore fair to say that the amount of language an average language user is exposed to is at least around 200–500 million word tokens, somewhere between the size of the TASA corpus (10 million word tokens) and the Web 1T 5-gram (1 trillion word tokens). In conclusion, this example might indicate that language overtly encodes some of the relations that LSA reveals in a latent analysis.

5.2. Categorization of semantic knowledge using LSA and first-order co-occurrences

The previous section showed that results obtained with LSA may also be obtained with simpler algorithms, given that the corpus is of an adequate size. The purpose of the second analysis is two-fold. First, the analysis aims to again show that first-order associations allow for effective meaning induction comparable to LSA. The second purpose is to show that categorization of concepts using perceptual features can emerge from language.

Rogers and McClelland (2004, 2008) presented a computational model that simulates human categorization of concepts. Rogers and McClelland’s connectionist model is very similar to the original models described in Rumelhart, McClelland, and the PDP research group (1986). Rogers and McClelland proposed that semantic cognition is formed by activation of neuron-like processing units that form categories over time. Using a large number of experimental studies, the authors computationally modeled these studies on the categorization of semantic concepts, lexical acquisition, and disordered semantic cognition. Central in Roger and McClelland’s theory is that semantic representations mediate between perceptual features (e.g., red), functional features (e.g., fly), and verbal descriptors (e.g., bird), akin to Collins and Quillian’s (1969) work on semantic networks. However, Roger and McClelland’s theory differs from the semantic networks theories, in that categories in Rogers and McClelland’s model emerge in the connectionist process rather than being fixed in the rigid semantic network.

For instance, Rogers and McClelland trained their connectionist model and showed that over time by identifying features belonging to concepts such as canary, robin, sparrow, and penguin, the network is able to induce that these concepts are a member of the category bird. Rogers and McClelland pointed out that their model is very similar to LSA. At the same time their model shows some important differences, for instance, by revealing the emergence of categories over time. Moreover, the authors leave the question aside whether the semantic features are symbolic (linguistic) or embodied (perceptual) in nature.

The argument can be made that the structure that Rogers and McClelland (2004, 2008) find in the output is built into the input. Patterns are not extracted by the network per se but are entered into, and enhanced by, the network (Borsboom & Visser, 2008; Snedeker, 2008). The question can therefore be raised where these input units come from. Based on the discussion of symbolic and embodied cognition earlier, the answer to this question is simple: They either come from linguistic information or from perceptual simulations. But that answer does not quite suffice, for linguistic input requires first-order associations in language, and perceptual input requires perceptual information. And after all, semantic associations are presumed not to be encoded adequately through first-order co-occurrences (Landauer, 1999) and meaning cannot be induced without grounding each and every amodal linguistic symbol (Glenberg, 1997).

The question of whether the surface structure of language allows for the categorization process in Rogers and McClelland’s model was investigated here by taking the verbal descriptors and features used in Rogers and McClelland (2004). Table 1 gives the 16 verbal descriptors, 26 features and a description of the type of features, and the six categories that Rogers and McClelland (2004) obtained from their model.

Table 1. 
Verbal descriptors and features used by Rogers and McClelland (2004) and categories resulting from their analysis
CategoriesVerbal DescriptorsFeaturesType of Feature
AnimalBirchBarkAttributive
BirdCanaryBranchesAttributive
FishCodFeathersAttributive
FlowerDaisyFurAttributive
PlantFlounderGillsAttributive
TreeMapleLeavesAttributive
 OakLegsAttributive
 PenguinPetalsAttributive
 PineRootsAttributive
 RobinScalesAttributive
 RoseSkinAttributive
 SalmonWingsAttributive
 SparrowFlyFunctional
 SunfishGrowFunctional
 SunflowerLivingFunctional
 TulipMoveFunctional
  SingFunctional
  SwimFunctional
  WalkFunctional
  BigVisual
  GreenVisual
  PrettyVisual
  RedVisual
  TwirlyVisual
  WhiteVisual
  YellowVisual

In the first analysis, semantic associations between 16 verbal descriptors (names of birds, fish, flowers, and trees; Table 1) and their 26 features (attributive, functional, visual; Table 1) were computed. The 16 × 26 matrix was submitted to an MDS analysis using the ALSCAL algorithm (SPSS 15.0.1 MDS procedure; Chicago, IL). The advantages of the use of LSA in combination with MDS has been described in Louwerse (2007), Louwerse and Van Peer (2009), Louwerse and Zwaan (2009), and Louwerse et al. (2006). The matrix of LSA cosine values was transformed into a matrix of Euclidean distances and these distances were scaled multidimensionally by comparing it with arbitrary coordinates in an n-dimensional space (low cosine values correlates with large distances, high values with short distances). The coordinates were iteratively adjusted such that the Kruskal’s stress was minimized and the degree of correspondence was maximized. Default criteria were used with an S-stress convergence = 0.001, minimum stress value = 0.005, and maximum iterations = 30. That is, the algorithm stopped iterating when the difference between stress values across iterations was less than the criterion, the stress value itself was less than the criterion, or when the maximum number of iterations was reached.

Following Borg and Groenen (1997) among others, a low dimensionality was chosen in order to cancel out over- and underestimation errors in the proximities. The fitting of the data was good with a two-dimensional scaling (Kruskal’s stress 1 = .189, R2 = .901). The two-dimensional graph is given in Fig. 1, showing an almost perfect categorization of birds, flowers, fish, and trees in each of the quadrants, with plants on the right and animals on the left. These findings show that LSA is able to bootstrap categories of concepts.

Figure 1.

 MDS plot of the LSA analysis of 16 verbal descriptors × 26 features used in Rogers and McClelland (2004; Appendix B.2). Circles are added to emphasize groupings.

The remaining question is whether LSA was able to do this because of the algorithm or because of language encoding these features. To answer this question the same verbal descriptor × feature analysis was conducted using frequency counts in the Web 1T 5-gram corpus. As in the previous word co-occurrence analysis, frequency counts were normalized using z-scores and the 16 × 26 matrix was submitted to an MDS analysis using the ALSCAL algorithm. The fitting of the data was considerably lower, largely due to the fact that many word co-occurrences did not occur in the corpus, though the fitting was still high (Kruskal’s stress 1 = .302, R2 = .602). As before, the categorization of birds, flowers, fish, and trees emerged from the MDS plot (Fig. 2), as well as the distinction between animals and plants.

Figure 2.

 MDS plot of the first-order co-occurrence analysis of 16 verbal descriptors × 26 features used in Rogers and McClelland (2004; Appendix B.2). Circles are added to emphasize groupings.

Two findings from this categorization analysis are noteworthy. First, first-order co-occurrence analyses yielded very similar results as the LSA analyses. Second, perceptual features assigned to verbal descriptors yielded a grouping of concept categories. That is, language encodes categorization information that first-order co-occurrence techniques can visualize as adequately as LSA. Both findings are support by the symbol interdependency hypothesis.

Next, the question is addressed to what extent the amodal linguistic system is organized such that embodied representations are encoded in language, where “embodied representations” are defined by the embodied cognition literature itself.

6. Perceptual information is encoded in language

The analysis in the previous section showed that results obtained using LSA are very similar to results using first-order co-occurrences. In this section, results from well-known previously published studies finding empirical evidence in favor of an embodied cognition account will be placed in a symbolic cognition context. The argument made in this section is that embodied cognition results obtained using linguistic stimuli should at least also be considered from a symbolic cognition perspective, because language has encoded embodied relations, and these linguistic cues are used by language users.

6.1. Modality switching

An important piece of evidence for embodied cognition comes from modality switching studies. Pecher, Zeelenberg, and Barsalou (2003) conducted a study in which the effect of modality switching was investigated. Participants were presented with sentences containing a concept word and a property (i.e., blenders can be loud) and pressed a “true” or “false” button based on the word pair being correct (blenders can be loud) or incorrect (loud can be blenders). The sentence following was either from the same modality (e.g., auditory: leavesrustling) or a different modality (e.g., gustatory: cranberriestart). Results showed response times to be faster when a word pair from the same modality followed than when a word pair from a different modality followed. Pecher et al. argued that this demonstrated that sensorimotor systems were activated during conceptual processing. That is to say, perceptual processing across these sensorimotor systems was costly, while perceptual processing within a system was not (see also Pecher, Zanolie, & Zeelenberg, 2007).

If perceptual information is encoded in language, the modality switching costs might be explained by semantic relations between the stimuli. There are three ways to look at this option. First, one can look at combinations of the sentence pairs where a combination refers to the same modality or a different modality (leavesrustling and blenders–loud vs. leavesrustling and cranberriestart). Second, one can look at whether a concept word has a stronger semantic relation with a same-modality property (leaves-rustling) than a different-modality property (leaves-tart). Third, one can look at whether property words have a stronger semantic relationship with property words from the same modality than property words from different modalities (rustling-loud vs. rustling-tart). Evidence for a stronger semantic relation between same-modality combinations than different-modality combinations would provide evidence that perceptual information is encoded in language.

For the first analysis all of the 176 word pairs used in the positive critical trials in Pecher et al. (2003) were used. Each pair included a concept and a property falling into five modalities: motor (e.g., pebble–kicked), smell (e.g., soap–perfumed), sound (e.g., horn–blaring), taste (e.g., soup–salty), touch (e.g., blanket–itchy), and visual (e.g., pumpkin–orange). The list of word pairs was randomized, so that a duo of word pairs either referred to the same modality or to a different modality. Next, the LSA cosine values for all word pair duos were computed. As predicted, duos referring to the same modality had a stronger semantic relation than duos referring to different modalities (= 0.10, SD = 0.13 vs. = 0.02, SD = 0.08), F (1, 86) = 10.09, = .002, MSE = 0.01), suggesting that modality shifts can be identified through linguistic information.

For the second analysis, the same set of 176 concept-property pairs was used to determine whether concept words have a stronger semantic relation with the property words related to that concept, than property words related to other concepts. LSA cosine values were computed for all possible concept and property combinations. Two groups of combinations were constructed. One group contained all combinations within a modality. That is, if a concept was initially paired with a modality (e.g., audition) then the same-modality combination group only consisted of combinations of the concept and properties from that modality (e.g., loud, rustling). For instance, blender–rustling was one pair for which the cosine was computed, leaves–loud another pair. In a similar fashion, the second group consisted of word pairs that combined concept and property from different modalities (blender–tart and leaves–tart were part of this set). An anova on the LSA cosine values between the two groups again showed a significant difference between cosines of combinations within a modality versus across modalities, with same–modality groups having higher cosine values (= 0.107, SD = 0.143) than different–modality groups (= 0.019, SD = 0.079), F (1, 26542) = 208.96, < .001, MSE = 0.006, further supporting the idea that language encodes modality shifts.

The third analysis was very much the same as the second analysis, except that the same- and different-modality groups were now populated with cosine values between properties. For instance, loud–rustling was a comparison in the same-modality group, while loud–tart and rustling–tart were comparisons in the different-modality group. An anova again showed a significant difference between the same-modality and the different-modality group, F (1, 24814) = 6,027.49, < .001, MSE = 0.007, with same-modality comparisons yielding higher cosines (= 0.087, SD = 0.127) than different-modality comparisons (= 0.021, SD = 0.073), again showing that modality shifts can be identified through linguistic cues.

The findings from these three analyses show that perceptual relations are encoded in language. Language users in turn could utilize these linguistic cues in their sensorimotor simulations.

6.2. Affordances

Further evidence for embodied cognition comes from the activation of affordances. Glenberg and Robertson (2000) presented participants with a setting and one of three sentences, a related sentence matching a typical situation in the world, a related sentence matching a situation that is atypical but can be imagined, and a nonafforded sentence that makes the situation described unnatural. An example is given in 1.

  • 1a.Setting: After wading barefoot in the lake, Erik needed something to get dry.
  • 1b.Related: He used his towel to dry his feet.
  • 1c.Afforded: He used his shirt to dry his feet.
  • 1d.Nonafforded: He used his glasses to dry his feet.

Sensibility and envisioning data from participants showed no differences between related and afforded sentences. On the other hand, differences were found between related and non-afforded sentences, and afforded and nonafforded sentences, unsurprisingly with lowest scores for nonafforded sentences. Glenberg and Robertson (2000) concluded that participants embody the sentences; because the nonafforded sentence cannot be embodied it yields low sensibility and envisioning data.

Louwerse (2007) tested to what extent language predicts these differences, using LSA and MDS on the semantic relations between stimulus sentences. As with Glenberg and Robertson’s (2000) findings, the computational linguistic results yielded no differences between the related and the afforded sentences. On the other hand, and again similar to Glenberg and Robertson’s results, a significant difference was found between the LSA results of the related sentences and the nonafforded sentences, with the related sentences yielding higher values than nonafforded sentences. When the related, afforded and non-afforded sentences were compared with the setting sentence, the nonafforded sentence was furthest away in the Euclidean distance, whereas the afforded and the related sentence were close to the setting sentence, linking the computational results to Glenberg and Robertson’s experimental results. Finally, computational estimates correlated with Glenberg and Robertson’s sensibility and envisioning ratings, r(54) = .328, p = .01; r(54) = .31, p = .02, respectively (Louwerse, 2007).

These findings suggest that language encodes affordances; these linguistic cues language users in turn can use in forming embodied representations.

6.3. Iconicity

In an iconicity study, Zwaan and Yaxley (2003) presented participants with two words presented underneath one another, each word pair either having an iconic relation (attic above basement) or a reverse-iconic relation (basement above attic). Response times in a semantic judgment task were faster when items had an iconic relation than when they had a reverse iconic relation, presumably because items activated embodied relations and these embodied representations were iconic or reverse-iconic.

Louwerse (2008) tested to what extent these embodied relations were encoded in language. When the word order of the items was investigated, iconic orders (attic-basement) occurred significantly more frequently than reverse-iconic orders (basement-attic). An explanation for this finding is that because humans typically view the world from top to bottom, language has encoded this so that words describing concepts at the top precede those describing concepts at the bottom (Benor & Levy, 2006).

Louwerse (2008) replicated the Zwaan and Yaxley results in a semantic judgment experiment. Both the Zwaan and Yaxley embodiment variable as well as the word-order frequency variable explained response times; however, word order (symbolic cognition account) did this better than iconicity (embodied cognition account).

When the same semantic judgment experiment was conducted, but with word pairs presented horizontally instead of vertically, word order still significantly explained response times, suggesting that in normal left-to-right reading processes, word order explains semantic judgment.

These results show that iconic relations are encoded in language. These linguistic cues language users can exploit to activate embodied representations.

6.4. Geographical information

Powerful evidence favoring embodied representations comes from visual imagery. Cognitive representations of world maps seem to come from images rather than from language, and if they do come from language, it is language that describes spatial information (Taylor & Tversky, 1996). Louwerse and Zwaan (2009) investigated to what extent geographical positioning of cities is encoded in general language, testing whether text co-occurrence scores between pairs of cities corresponded to the distance between them. We hypothesized that cities that are located together are talked about in similar contexts, much like the idea behind LSA and first-order co-occurrences. Louwerse and Zwaan selected the 50 largest cities of the United States and determined their longitude and latitude. Next, the semantic relationship between these 50 cities was computed with LSA. Semantic spaces were created using three newspapers, the New York Times, Wall Street Journal, and Los Angeles Post. The 50 × 50 cosine matrix obtained for each newspaper corpus was then supplied to an MDS algorithm. Absolute MDS estimates positively correlated with the actual longitude and latitude of the 50 cities. This finding was replicated using first-order co-occurrences of the 50 cities in the Web 1T 5-gram corpus, ruling out the possibility that the findings should be attributed to the algorithm rather than to linguistic information.

The finding that Louwerse and Zwaan were able to produce a map of the United States of America solely using linguistic information from corpora that did not provide specific spatial information about these cities, shows that language encodes geographical information. Results from experimental studies reported in Louwerse and Zwaan (2009) indicated that 16–35% of the longitude and latitude variance in human location estimates was predicted by the corpus data, indicating that human geographical estimates might be based in part on spatial information coded in language.

6.5. Motor resonance

The embodied cognition literature has presented evidence that linguistic information activates corresponding motor responses. This so-called action-compatibility effect (ACE) shows responses to linguistic stimuli to be faster when the physical response is in the same direction as the movement implied by a sentence (Glenberg & Kaschak, 2002). For instance, in a sentence sensibility task participants responded faster to the sentence Courtney handed you the notebook when the “yes” button was closer to their body than away from their body, while they responded faster to You handed Courtney the notebook when the “yes” button was away from their body (Glenberg & Kaschak, 2002).

The rationale for the ACE effect is that words and phrases are indexed to perceptual information from which affordances are derived. These affordances are then meshed (Glenberg & Kaschak, 2002). The linguistic information itself is presumed to be arbitrary and does not at all contribute to a distinction between “horizontal” or “vertical,”“toward information” or “away information.” On the other hand, according to the symbol interdependency hypothesis embodied relations are encoded in language. Consequently, language can cue comprehenders in action-sentence compatibility. This hypothesis was tested using the stimuli in Kaschak et al. (2005).

Kaschak et al. (2005) found a difference in participants responding to auditory sentences presented with a visual stimulus depicting a motion in the same versus an opposite direction described by the sentence. All 31 sentences from Kaschak et al. (2005) were used containing 16 sentences describing a horizontal movement (He rolled the bowling ball down the alley; The dog was running towards you) and 15 sentences describing a vertical movement (The steam rose from the boat; The sand poured through the hour glass), with half of the sentences in each group in an away vs. towards condition or an up vs. down condition.

A 31 × 31 matrix of cosine values between the Kaschak et al. (2005) sentences was computed and then submitted to an MDS algorithm using the ALSCAL algorithm. The fitting of the data was acceptable with a two-dimensional scaling (Kruskal’s stress 1 = .329, R2 = .559). If linguistic information can be clustered according to the horizontal versus vertical dimension, a difference in loadings on MDS Dimension 1 is expected. Similarly, if linguistic information can be clustered according to the toward/away or the up/down dimension, a difference in loadings on MDS dimension 2 is expected. For Dimension 1 loadings differed between the horizontal and the vertical sentences, F (1, 30) = 17.04, < .001, MSE = 0.985. For Dimension 2 no difference was found within the horizontal direction category, F (1, 15) = 3.895, = .07, MSE = 0.337 or the vertical direction category, F (1, 14) = 0.543, = .474, MSE = 0.759. The likely explanation for the absence of a difference as found in Dimension 2 is the low number of items being compared. What is more, one of the weaknesses of LSA is that it does not take into account word-order. It is therefore unable to distinguish whether The shark was drawing near you or You were drawing near the shark (but see Dennis, 2004). Nevertheless, LSA was able to capture the differences between language encoding information from the horizontal condition versus the vertical condition, giving an indication that these embodied relations are encoded in language.

To test the role of language in the ACE effect further, stimuli used in Zwaan and Taylor (2006) were subjected to the same analysis. As in Glenberg and Kaschak (2002) and Kaschak et al. (2005), Zwaan and Taylor (2006) investigated language-based motor resonance. Participants were presented with sentences describing a clockwise (Jim tightened the lug nuts) or a counterclockwise motion (Dave removed the screw from the wall) motion. Zwaan and Taylor (2006) asked participants to listen to sentences describing clockwise and counterclockwise information and to make sensibility judgments by turning a knob clockwise or counterclockwise. Subjects responded more quickly when the motion described in the sentences matched their knob rotation than when there was a mismatch between knob and sentence motion. If language encodes embodied relations, clockwise sentences should cluster together, as should counterclockwise sentences, allowing language users to utilize these language features in embodied cognition.

All 18 items for Zwaan and Taylor (2006) were used to create an 18 × 18 LSA cosine matrix. This matrix was next submitted to an MDS algorithm using the ALSCAL algorithm. The fitting of the data was poor with a one-dimensional scaling (Kruskal’s stress 1 = .561, R2 = .257). Nevertheless, when loadings on Dimension 1 were compared, a significant difference was found for items describing a clockwise versus items describing a counterclockwise rotation, F (1, 8) = 8.02, = .02, MSE = 1.29.

These computational linguistic analyses using Kaschak et al.’s (2005) and Zwaan and Taylor’s (2006) stimuli suggest that language encodes cues for motor affordances. The organization of linguistic stimuli does not indicate the direction of embodied representations (up, down, left, right) and it does not account directly for motor responses. However, language seems to have encoded some of the motor affordance information. With limited grounding, meaning can be bootstrapped throughout the linguistic system.

7. Discussion and conclusion

The present paper was motivated by two presuppositions underlying the topic of this journal issue. First, that meaning can be extracted from language computationally and, second, that the computational findings can advance theories of human cognition. The question whether meaning can be extracted computationally and advance theories of human cognition has caused a schism between symbolic and embodied accounts of cognition. Symbolic accounts of cognition have emphasized that meaning can be extracted from language, whereas embodied cognition accounts have emphasized that meaning is constructed from fundamentally embodied representations.

The present paper has described how these two accounts are mutually reinforcing in the symbol interdependency hypothesis. According to this hypothesis, meaning can be induced by symbol grounding, as well as by bootstrapping meaning through relations between the symbols themselves. This bootstrapping process is facilitated by language having encoded embodied information. In using language, speakers encode the perceptual world around them. These linguistic cues comprehenders can use to decode the perceptual world around them. Because embodied relations are encoded in language, extracting meaning from language computationally is feasible.

The symbol interdependency hypothesis might help explain how symbolic and embodied cognition accounts are mutually reinforcing, but it also generates a number of research questions. For instance, the computational linguistic analyses presented throughout this paper have shown that language encodes perceptual information, but to what extent language users rely on linguistic information in the activation of embodied representation is an open question. After all, the fact that language encodes perceptual information might be the result of language users producing embodied representations in linguistic information, not necessarily language users transducing linguistic information to embodied representations. Furthermore, it might well be the case that under different parameters a symbolic or an embodied cognition account reigns supreme (Louwerse & Jeuniaux, 2010). These parameters might be related to the duration a representation is held in memory (Kaschak & Borreggine, 2008), to the access to nonlinguistic information (Pecher, van Dantzig, Zwaan, & Zeelenberg, 2009), to individual differences based on skill (Madden & Zwaan, 2006), age (Dijkstra, Yaxley, Madden, and Zwaan, 2004), or to the cognitive task (Louwerse & Jeuniaux, 2010). In other words, the fact that language encodes perceptual information does not rule out the embodied cognition argument that perceptual simulations are made in language processing tasks.

It has often been stated that cognition cannot be exclusively symbol manipulation. That is undoubtedly true, but can cognition be inclusively symbol manipulation? The symbol grounding thought experiment is as follows (Glenberg & Kaschak, 2002; Harnad, 1990; Searle, 1980). You land in a foreign country and all you have is a dictionary of the foreign language you do not speak. You would be hopelessly lost in translation, exchanging one foreign word for another.

But perhaps one should consider a different, and more realistic, version of this thought experiment. Imagine you land in a foreign country, you know some basics of the language spoken in that country, and you are continuously exposed to the foreign language you do not speak. Because embodied relations are encoded in language and because humans are skilled at picking up linguistic regularities, symbolic cognition helps to bootstrap meaning that is obtained through embodied cognition.

More concretely, if one knows the foreign words that refer to cities, the data presented in this paper suggest one is able to predict geographical information (Louwerse & Zwaan, 2009). If one knows the foreign words that refer to time, it is possible to place the words in chronological order, whether the words are referring to days of the week, months of the year, or other time units (Louwerse et al., 2006). If one knows that words refer to spatially related concepts, it is easy to predict which one is located higher and which one lower (Louwerse, 2008). If one knows that words refer to personal pronouns, it is easy to determine which words are first-, second-, and third-person pronouns, and which ones are singular and which ones are plural (Louwerse & Van Peer, 2009). These predictions are not limited to English (Louwerse & Van Peer, 2006; Louwerse et al., 2006), but they seem to be applicable to natural language in general.

The point is this: An embodied component should not be abandoned altogether, but neither should a symbolic component. Cognitive science should be cautious that the pendulum of research that swung towards an exclusively symbolic cognition position in the latter part of the last century does not swing towards an exclusively embodied cognition position in the current century. Symbolic and embodied accounts of cognition are mutually reinforcing. The support for language comprehension and language production is vested neither in computational processes nor in embodied representations, but in language itself.

Acknowledgments

I would like to thank Diane Pecher for sharing the experimental stimuli (section 6.1), and Louise Connell, Simon Dennis, Mike Jones, Chris Kurby, Danielle McNamara, and an anonymous reviewer for comments on earlier drafts of this paper. The usual exculpations apply.

Ancillary