Language Evolution Can Be Shaped by the Structure of the World



Human languages vary in many ways but also show striking cross-linguistic universals. Why do these universals exist? Recent theoretical results demonstrate that Bayesian learners transmitting language to each other through iterated learning will converge on a distribution of languages that depends only on their prior biases about language and the quantity of data transmitted at each point; the structure of the world being communicated about plays no role (Griffiths & Kalish, 2005, 2007). We revisit these findings and show that when certain assumptions about the relationship between language and the world are abandoned, learners will converge to languages that depend on the structure of the world as well as their prior biases. These theoretical results are supported with a series of experiments showing that when human learners acquire language through iterated learning, the ultimate structure of those languages is shaped by the structure of the meanings to be communicated.

1 Introduction

Human languages have rich structure on many levels, from phonology to semantics to grammar. Where does this structure come from? Many researchers agree that linguistic structure is shaped by the structure of our minds—that our brains contain prior biases that favor the acquisition or retention of some linguistic systems over others. As such, debate generally centers around the nature and origin of these biases. Some suggest that natural selection operates on genes for language (Komarova & Nowak, 2001; Nowak, Komarova, & Niyogi, 2001; Pinker & Bloom, 1990) or on capabilities then used by the language faculty (Hauser, Chomsky, & Fitch, 2002). Others have suggested that humans easily learn language not because of a language-specific genetically encoded mechanism, but because language evolved to be learnable and usable by human brains (Brighton, Smith, & Kirby, 2005; Christiansen & Chater, 2008; Zuidema, 2002). Although these accounts disagree in many particulars, they agree that the structure of language arises from the structure of the brain.

In this study we argue that language evolution can be shaped by the structure of the world in addition to pre-existing cognitive biases. Because language involves communicating about the world, the structure of that world (i.e., the things to be communicated) can interact with people's prior biases to shape the languages that develop. We offer theoretical and experimental support of this proposition. On the theoretical side, we take as our starting point recent work within the “iterated learning” framework (in which new learners receive their data from previous learners). Previous research has shown that when learners are individually Bayesian, an iterated learning chain converges in the limit to the prior distribution over all possible languages (Griffiths & Kalish, 2005, 2007). However, the proof of this rests on certain premises: either languages carry no assumptions about the distribution of meanings/events in the world (Griffiths & Kalish, 2005), or if they do, the frequency with which those events are spoken about is entirely dependent on the language, rather than also the distribution of events in the world (Griffiths & Kalish, 2007). Here, we demonstrate that if either premise is abandoned, the iterated learning process converges to a distribution that depends on the distribution of events in the world as well as the prior biases of the learner. These theoretical results are experimentally tested in a laboratory-based iterated learning experiment, which shows that participants converge on different languages depending on the structure of the space of meanings they are shown.

1.1 Iterated learning

The iterated learning modeling framework is widely used in language evolution research (Griffiths & Kalish, 2007; Kirby et al., 2008; Kirby & Hurford, 2002; Reali & Griffiths, 2009a; Smith, 2009). It views the process of language evolution in terms of a chain of Bayesian learners (or generations), which is illustrated schematically in Fig. 1. The first learner in the chain sees some linguistic data (e.g., utterances), forms a hypothesis about what sort of language would have generated that data, and then produces his or her own data by sampling from his or her posterior distribution; this data then serves as input to the next learner in the chain. Over time, the languages that emerge from this process become non-arbitrary: Two recent articles (both referred to, henceforth, as GK) demonstrate that when the learners are Bayesian, we should expect an iterated learning chain to converge to the prior distribution over all possible languages (Griffiths & Kalish, 2005, 2007). That is, the probability of any given language emerging does not depend on the structure of the world or independent properties of the language—only the assumptions of the learner. The existence of a linguistic bottleneck (in which only a small amount of information is transmitted at each link in the chain) can speed the rate of convergence, but GK's result implies that neither the structure of the meaning space nor the nature of the initial language should have an effect on the language that eventually evolves.

Figure 1.

(a) Schematic illustration of the typical iterated learning paradigm, which assumes that learner n acquires language on the basis of the language data produced by learner n − 1. (b) A different view of iterated learning recognizes that because individuals produce language to communicate about the world, the data available to learners includes meanings in the world as well as the language produced by the learner before them.

GK's result can be conceptualized as follows. Suppose learners must acquire languages that describe a two-dimensional semantic space of some sort. For illustrative purposes, suppose further that the learners have a prior bias to prefer languages with fewer words and to pay more attention to one of the dimensions, as occurs in human category learning and development (Landau, Smith, & Jones, 1988). This prior bias might impose a distribution over hypotheses h about possible languages, like the illustrative one shown in Fig. 2a: Languages like ha with a few words that classify according to the preferred dimension (the x-axis in this case) have higher prior probability than languages like hb, which have many words, or hc, whose words classify according to the less preferred dimension. GK suggest that languages evolving to describe this space will converge to the prior distribution: 40% of the time ha will emerge, 10% of the time hb will emerge, so forth. Although this prior and these precise numbers are imaginary, the picture in Fig. 2a provides a schematic illustration of what GK's results mean.

Figure 2.

(a) Intuitive illustration of the results of Griffiths and Kalish (2005, 2007) (GK). Given a two-dimensional semantic space, a learner with a prior bias to favor one dimension of that space (the x-axis) and languages with fewer words might have a prior distribution over languages that puts more probability on ha and less on hb or hc. GK demonstrates that the languages that evolve will converge to this prior distribution. (b) However, if the natural categories in the world have a different structure, we might intuitively expect that languages that capture that structure, like hc, should be more likely to evolve.

It also, however, highlights an apparent oddity within these results. Suppose that the world possesses structure in the form of natural categories of some sort, and these natural categories happen to group items according to the non-preferred dimension, as shown in Fig. 2b: The items observed by the learner—generated by the world—correspond to the black dots, which fall naturally into two clusters. We might intuitively expect that a language like hc would be a better fit to this world (and hence be more likely to evolve) than a language like ha, even though ha has higher prior probability. The results of GK appear to suggest otherwise. Is our intuition simply wrong, or is there a mismatch between the GK derivation and the problem of language evolution within a structured world? In the next section, we argue for the latter.

2 Theoretical result

Our primary result formalizes the learning framework similarly to Griffiths and Kalish (2005), although later we will also address the slightly different formalization of Griffiths and Kalish (2007). For now, we consider a Bayesian learner who sees m meanings or events, denoted x = {x(1)x(m)}. These meanings are paired with m corresponding utterances denoted y = {y(1)y(m)}. The first learner in the chain is shown some initial data consisting of meaning–utterance pairs (x0y0). Then, when shown new events x1, the learner produces utterances y1, so that (x1y1) are the input to the next learner. In general, learner n + 1 sees data (xnyn) and generates yn+1 based on events xn+1, so that the next learner receives input (xn+1yn+1). The goal of each learner is to estimate the mapping between meanings and utterances, which corresponds to learning the language he or she is exposed to. It is assumed that each learner has the same countable hypothesis space math formula of possible languages, such that each math formula corresponds to one language. For any learner, acquisition involves a learning step and a production step.

In the learning step, learner n + 1 sees (xnyn) and computes a posterior distribution over possible languages hn+1. Bayes' rule implies that we can express this posterior distribution as follows:

display math(1)

In their derivation, Griffiths and Kalish (2005) assume that each language h makes no assumption about which events x are more likely than any other; given that assumption, they note that P(h|x) = P(h) and proceed with a version of Eq. (1) based on that modification. Alternatively, however, it might be that the language carries with it certain assumptions about what events are possible or probable in the world, in which case the GK assumption is untenable.1 In other words, simply observing meaningful events x may bias the learner to prefer some languages over others. If this is the case, then P(h|x) does not equal P(h), and the learning step is described by Eq. (1).

To see what this shift does to the iterated learning chain, we now turn to the production step. In this step, the learner encounters new meanings xn+1, generated from the (objective) distribution Q(x) of meanings in the world. Given these meanings, the learner generates the new utterances yn+1 by sampling them from math formula, where hn+1 is the learner's language (sampled from the posterior distribution in Eq. (1)).

As all people in the chain follow the same learning and production steps, we can calculate math formula, the probability that learner n + 1 acquires language hn+1 given that the previous learner used the language hn, in the following way:

display math(2)

Thus, we have a sequence of random variables h1h2h3,… describing the languages acquired by each person in the chain. This constitutes a Markov chain whose transition probabilities are given by math formula. Assuming the chain is ergodic, then its stationary distribution π(h) satisfies

display math(3)

for all hn+1. Put another way, the probability distribution over languages hn ultimately approaches π(hn) as n→∞.

In the setup used by Griffiths and Kalish (2005), the stationary distribution π(h) corresponds to the prior P(h). However, under our formalization this is no longer the case. To find the stationary distribution in this situation, we make the following “representativeness” assumption: that the posterior probability of a hypothesis given an actual dataset x is close to its expected posterior probability given the generating distribution Q(x). In other words, we assume that math formula, for some x ∼ Q(x). If this assumption holds, then the stationary distribution is approximately math formula. That is, the chain converges to the expected posterior distribution over languages given meaningful events in the world. This is because for math formula to be the stationary distribution it must be true that:

display math

The assumption these results depend on is relatively weak: All it requires is that the events or meanings each learner sees be a representative sample from the true generating distribution Q(x).2 In the limit where no learner sees any data, the stationary distribution converges to the prior, as P(h|x) = P(h) in that situation. But as the amount of data increases, the languages that evolve will depend on the posterior distribution P(h|x) and the distribution of meanings in the world Q(x). As the posterior depends on both prior and likelihood (P(h|x) ∝ P(h)P(x|h)), this means that the languages that evolve will be sampled from a distribution depending on which ones are favored a priori as well as which ones best capture the meanings in the world. The additional Q(x) term means that the distribution of those meanings matters as well. These results suggest that languages like hc might be more likely to evolve in a world like the one in Fig. 2b than the prior distribution over languages might suggest.

The derivation offered by Griffiths and Kalish (2007) removes the premise made by Griffiths and Kalish (2005) that languages h and events x are independent. However, their new proof of convergence to the prior depends on a different premise: When h and x are not independent, then the distribution of events x depends entirely upon the language rather than the world.3 Indeed, the 2007 derivation makes no reference to an external generating distribution, like Q(x), at all.

Our results taken together with both GK derivations lead to the following conclusion: Language evolution should by affected by the structure of the world as long as (a) different languages make different predictions about the distribution of events, and (b) the actual distribution of which events are spoken about does depend, at least in part, on the nature of the external world. This second premise, put another way, simply states that language is shaped by the structure of the world as long as which things people talk about (and how often they talk about them) is affected by the distribution of those things in the external world, not just the language the people speak.

One key difference between our result and both GK results is that we predict that the amount of data that is transmitted from one speaker to the next (i.e., the information bottleneck) actually changes the stationary distribution of the chain. This contrasts with the prediction made by the GK derivations that although smaller bottlenecks lead to faster convergence, the language will always be a sample from the prior distribution over languages regardless of bottleneck size.4 Our results, however, predict that the nature of the emergent languages will change as a function of the size of the bottleneck. When it is large or non-existent (i.e., there is a great deal of data available to each learner), the stationary distribution will converge to the expected posterior distribution over languages given meaningful events in the world; when the bottleneck is small, it will converge to the prior. This provides a natural test of our prediction: When the bottleneck is small, do the resulting languages resemble the ones that emerge when it is not?

In the next section we report experimental results supporting our theoretical findings. When there is a great deal of data available to the learner at each generation, the languages that emerge from an iterated learning chain with humans are shaped by the structure of the world. However, when there is little data, they look substantially different and appear unaffected by the structure of the world. This is consistent with the predictions of our theoretical analysis but not that of GK.

3 Method

We adopt the standard iterated learning paradigm, in which participants form chains in which the output of the nth participant is the input of participant n + 1 and the input for the first participant is random. In a training phase, participants see a number of meaning–word pairs and are asked to learn them. In a test phase, they are shown meanings and asked to produce the corresponding word; these are the pairings for the next participant and correspond to the “language” that exists at that point in the chain. Our question is whether the languages that evolve over the course of a chain depend on the distribution of meanings Q(x), and whether the predicted dependence on bottleneck size is observed.

3.1 Participants

A total of 225 participants from the University of Adelaide participated in what they were told was an “alien language” study in which they were asked to learn a language consisting of labels for visual stimuli (squares of different sizes). Participants consisted of undergraduates from all departments as well as some recruited from a paid participant pool drawing from the surrounding community. The experiment was conducted at a computer terminal and the instructions did not reveal that it was an iterated learning experiment; participants did not know that their answers were to be given to the next participant.

3.2 Stimuli

The “meanings” in our experiments consisted of 36 possible squares differing in size and color, as shown in Fig. 3a. In the control condition, color varied continuously from 0% brightness (black) to 100% brightness (white) in increments of 20%, and size from smallest (10 × 10) to largest (60 × 60) in increments of 10 pixels. In this condition, there is no obvious or privileged way of categorizing the stimuli. In the size condition, the stimuli were more discontinuous along the size dimension: The sizes were 10 × 10, 15 × 15, 20 × 20, 50 × 50, 55 × 55, and 60 × 60. In the color condition they were discontinuous along the color dimension: The colors were 0%, 10%, 20%, 80%, 90%, and 100% brightness.

Figure 3.

(a) Space of stimuli seen in each of the three conditions of the experiment. Stimuli in the control condition varied continuously along the dimensions of size and color; in the size condition they varied discontinuously according to size, and in the color condition they varied discontinuously along the color dimension. These different spaces thus impose different event distributions Q(x). (b) Schematic illustration of the predictions about what the evolved language should look like in each condition. In the size condition, the words should evolve to categorize the stimuli according to size, with one word (w1) applying to the smaller objects and the other (w2) applying to the larger ones; in the color condition, the words should split the space into the dark (w1) and light (w2) objects. Predictions for the control condition are more uncertain, as there are no natural boundaries within this space.

These conditions, then, correspond to worlds with different event distributions Q(x), and each favors languages that partition the stimuli in different ways, as shown in Fig. 3b. In the size condition one would expect the words to categorize by size, in particular, to correspond to the distinction between smaller (w1) and larger (w2) items. Conversely, one would expect the words in the color condition to evolve to distinguish between darker (w1) and lighter (w2) stimuli. Because the control condition contains stimuli that vary continuously along both dimensions, it is more unclear what the resulting language should look like. If participants have a prior bias to favor one dimension more than another, one might expect the resulting language to have six words, one for each value along the most important dimension; if they do not have any strong prior bias, one might expect languages to vary idiosyncratically or to evolve toward having one word for all stimuli. Which of these happens is somewhat irrelevant for our purposes; the main goal of running the control condition was to provide a comparison for the other conditions and to make apparent any prior biases that might exist.

3.3 Procedure

Our main question was whether the structure of the resulting language would be different in the size and color conditions. We also wanted to test the prediction about bottleneck size made by our theoretical analysis. This was accomplished by additionally running each of the three conditions through either a small or large bottleneck, resulting in a 2 × 3 design. There were three chains of participants in each of the six conditions; Because convergence is faster when the bottleneck is small, the small bottleneck group had five participants per chain, whereas the large bottleneck group had 20.

Following Kirby et al. (2008), in all conditions, stimuli were pseudo-randomly divided into a seen and an unseen set for each participant. The two bottleneck conditions differed only according to how many items were in the seen set: Participants in the large bottleneck group were presented with half of the entire dataset (i.e., 18 meanings), whereas those in the small bottleneck group saw only four.

Each participant acquired the language in a single session consisting of three rounds, each containing a training and a testing phase, with an optional break in between rounds. In the training phases, participants were shown two randomized exposures to the seen set in which each stimulus was shown on a computer screen with the corresponding word printed below it. Participants could see the next item by pressing the next button. In the testing phases, participants were shown the stimuli and asked to type the corresponding word; they were never given feedback. The testing phases in the first two rounds contained a random half of the seen set and an equal-sized number of items randomly sampled from the unseen set. The final round of testing in all conditions contained the entire stimulus set (i.e., all 36 stimuli).

For the first participants in each chain, the seen set consisted of a random subset of a full language consisting of 36 consonant-vowel-consonant (CVC) words randomly assigned to each of the possible 36 possible stimuli. For subsequent participants, the language consisted of the meaning–word pairs given by the previous participant in their final round of testing. With each new person in the chain, different items were randomly allocated into the seen and unseen sets. We performed no filtering on the data, not even to remove typos or ensure that the words followed a CVC pattern.

4 Results

One prediction made by both our analysis and GK's is that convergence should be more rapid when there is less data available to the learner, as in the small bottleneck group. This was borne out in our experiment, as shown in Fig. 4. We measured convergence, as in Kirby et al. (2008), by calculating the Levenshtein distance (Levenshtein, 1966) between adjacent pairs of participants in each chain at generation g. It is given by:

display math(4)

where math formula is the word associated with meaning m by the participant at generation g, math formula is the normalized Levenshtein distance between words yg and yg−1, and 36 is the number of total meanings in our experiment. Levenshtein distance is the number of edits required to turn one string into another (for instance, the distance between bik and bok is 1), and therefore it captures the transmission error from generation to generation.

Figure 4.

Transmission error by generation in each condition. As predicted, languages evolve to be increasingly learnable (decreasing transmission error in later generations), and convergence is faster when the bottleneck is smaller. Note that the slightly lower convergence in the color condition with the small bottleneck is entirely carried by the final participant in one of the chains (Chain A).

It is clear from Fig. 4 that in all conditions, the chains ultimately converged to a low error rate. Convergence was slower in the large bottleneck group, as predicted.

Our main prediction was that the structure of the world would affect the languages when the bottleneck was large but not small, as in the former case the data would exist in sufficiently high quantity that the stationary distribution would be closer to the expected posterior distribution, which includes the distribution of events in the world Q(x). The results, shown in Fig. 5, are consistent with this prediction. In the large bottleneck groups it is evident that there was a substantial effect of the structure of the meaning space on the structure of the resulting languages; all chains in the size condition evolved words whose primary categorization divided the stimuli by size, and all chains in the color condition evolved words which categorized according to color (although this effect was stronger for some chains than others). By contrast, languages in the small bottleneck groups tended to evolve toward having only one word, and there are no observable effects of world structure.

Figure 5.

Languages from the final participant in each of the chains in the large and small bottleneck groups. Each square is laid out as in Fig. 2, where the x axis is size and the y axis is color. Items labeled with the same word are depicted with the same shade; words that are very similar but not identical are depicted with similar shades. Specific words are detailed below each square. It is evident that in the large bottleneck group, as predicted, the stimulus space has a considerable impact on the structure of the resulting language. All languages in the size condition evolved words that categorized more according to size, all languages in the color condition categorized more according to color, and all languages in the control condition were not strongly driven by either dimension. Also as predicted, in the small bottleneck group all languages evolved similarly; there was not enough data at each generation for the structure of the meaning space to affect how the languages evolved.

These differences can be quantified using the adjusted Rand Index or adjR (Hubert & Arabie, 1985). This measure captures the similarity between clusterings; an adjR of 1 indicates that the clusters are identical, whereas 0 is the score one would expect when comparing two random clusterings; scores below 0 indicate that the clusters match less than one would expect by chance. Here, each of the resulting languages corresponds to one “clustering” of the stimuli; for instance, the language in Chain A of the color condition in the large bottleneck group corresponds to a clustering in which the 18 darkest stimuli are in one cluster and the 18 lightest stimuli are in another. We can compare each of the actual clusterings to the canonical color and size clusterings in Fig. 3b. The results are shown in Fig. 6. When the bottleneck is large, the languages in the color condition have a much higher adjR when compared with the canonical color clustering, and languages in the size condition have a much higher adjR when compared with the canonical size clustering. However, in the small bottleneck groups, none of the languages is consistent with either canonical clustering.

Figure 6.

Average adjR values for the final languages in each condition, compared with the canonical clusterings according to size and color: An adjR of 0 indicates that they match no more than would be expected by chance, whereas an adjR of 1 indicates perfect alignment. In the large bottleneck condition, the languages in the size and color conditions match their respective canonical clusterings far above chance; all other languages do not match either clustering. Error bars reflect standard error.

To determine if these results are significant, for each condition and bottleneck group we calculated a structural precision score, which captures the degree to which the evolved languages in each condition differentially match either the canonical color or canonical size clusterings. This score is given by the absolute value of the difference between the adjR value of that language for the canonical color clustering and its adjR value for the canonical size clustering.5 If the structural precision score is close to 0, it indicates that the evolved language fits with both the canonical color and canonical size clusterings equally well—in other words, that it has not evolved specifically to match either the size or color cluster. A higher structural precision score indicates that the language matches one of the two clusterings better than the other—in other words, that it has evolved to fit one “world” more than the other. If the structure of the world did have a significant role on the structure of the language, we would therefore expect that the structural precision score would be significantly higher in the color and size conditions (where the adjR values capturing the consistency of the language to the canonical size and canonical color should be markedly different from each other) than in the control condition (where those adjR values should be similar). Moreover, we should expect to see this only in the large bottleneck group, where the effect of world structure should be discernible.

As expected, the mean values of the structural precision score are larger in the size and color conditions than in the control conditions when the bottleneck is large (size: 0.685, color: 0.670, and control: 0.055) but not when the bottleneck is small (size: 0.0, color: 0.117, and control: 0.109). A two-way anova on bottleneck × condition reveals a significant main effect of bottleneck type (F(1,12) = 15.988, p = .002) and a significant interaction (F(2,12) = 5.333, p = .022). The nature of the interaction is evident in planned comparisons, which show that in the large bottleneck group there is a significant effect of condition (F(2,6) = 6.155, p = .035), but in the small bottleneck group there is not (F(2,6) = 0.528, p = .615). Together, these analyses indicate that, as predicted, the structure of the world has a significant effect on the structure of the language when the bottleneck is large but not when it is small.

5 Discussion

Our work indicates that if there is no a priori assumption that a learner's hypotheses about languages are independent of the external world the learner inhabits, and the external world does actually affect which events or meanings are spoken of, then the languages evolved by Bayesian learners through iterated learning will converge to a distribution that depends on the posterior probability over languages as well as the external structure of the meaning space. Here, we consider the implications and limitations of our findings.

These results differ significantly from previous results by Griffiths and Kalish (2005, 2007) that suggest that the stationary distribution of a chain of Bayesian iterated learners depends only on their prior.6 This divergence arises because GK either assume that languages make no assumptions about the distribution of events in the world (2005) or that, if they do, the distribution of events spoken about is shaped entirely by the language and not at all by the external world (2007). Note that one cannot simply interpret GK's results in such a way that their sense of a prior itself includes these additional external environmental influences. Interpreting the prior in such a way that it includes the structure of the world strips the GK result of much of its theoretical relevance: If one were to do this, it would be impossible to distinguish between the effects of the cognitive biases of the learner and the effects of the environment in which the learner operates.

As the difference between our results and GK's stems from different underlying premises about the relationship between the structure of the world and the nature of language, a natural question is which set of premises are correct. This is an open question, although we suggest that in at least some circumstances ours are plausible. First, the premise that the distribution of events x that are talked about is shaped at least in part by the external structure of the world—and not just by language—seems eminently reasonable: We talk about cats not just because our language has the word “cat” but also because cats actually do exist, and we have direct experience of them. Second, the premise that languages make assumptions about the distribution of events in the world also appears sensible, at least sometimes. Language learners only start acquiring words after having observed many objects and events in the world, and it seems reasonable for them to expect word meanings to map onto these objects and events in a sensible way. The mapping between grammar and world structure is less obvious, but one might expect that learners' grammatical expectations are affected by their observations of the world—for instance, expecting salient or frequent characteristics, like number or gender, to be marked grammatically, as has been suggested earlier (e.g., Bybee, 2000; Du Bois, 1987; Evans, 2003). Further exploring in what circumstances this premise is correct is a subject for future research.

Our experimental results further support our theoretical predictions by showing that when people are exposed to different worlds with different distributions of events, the language that emerges reflects that event structure—even when the people, as in most iterated learning experiments, do not realize that they are in the middle of “evolving” a language at all. It is important to note here that the experiments do not indicate, nor did we ever predict, that people's prior biases would play no role in shaping the structure of the language; in fact, it is quite likely that cognitive factors like memory are a major reason that languages in all conditions evolved to have fairly few words. Rather, the implication of the experiments is that world structure plays an additional role on top of the effect of whatever prior biases exist. This was evident in the fact that the difference between the evolved languages in size, color, and control conditions can only be explained by the differences in the distribution of events Q(x) in those conditions, as the participants were sampled from the same underlying population and thus presumably all shared similar prior biases.

There has been a great deal of experimental work supporting the finding that iterated learning experiments reveal human learners' inductive biases (Griffiths, Christian, & Kalish, 2008; Kalish, Griffiths, & Lewandowsky, 2007; Kirby et al., 2008; Reali & Griffiths, 2009a; Smith & Wonnacott, 2010). How do we reconcile our results with this research? First, we do not deny that prior biases are a factor; our results simply suggest that in certain circumstances they are not the only factor. Second, our results do not invalidate the broad research program of using iterated learning experiments to illuminate people's inductive biases. Many experiments are constructed to either not have events be generated by the external world (i.e., have no Q(x) distribution) or to have all participants see the same distribution of events (i.e., they do not manipulate Q(x) between conditions). We do predict that significant changes in the distribution Q(x) should result in different stationary distributions of the chains, although how large that effect may be in different kinds of experiments is an open empirical question.

Our findings may also resolve an apparent contradiction in the literature. While some theoretical results suggest that language evolution should converge to the prior, there is also theoretical work showing that the structure of the meaning space can affect the nature of the evolving language (Brighton & Kirby, 2001; Kirby, 2001; Maurits, Perfors, & Navarro, 2010; Smith, Jones, Yoshida, & Colunga, 2003). In addition, many empirical cross-linguistic results are consistent with the idea that aspects of world structure can affect linguistic structure. Communicative complexity has been argued to increase with social complexity, both in humans and non-humans (Freeberg, Dunbar, & Ord, 2012; McComb & Semple, 2005), whereas phonological and morphological complexity may be negatively related to community size or isolation (Lupyan & Dale, 2010; Trudgill, 2009). Others have suggested that environmental factors like frequency of use may also affect language structure (Bybee, 2000; Evans, 2003). Although the link between some of these factors and the distribution Q(x) is not always exact, our work may help explain how these findings are possible in light of the GK results.

Further tests of our theoretical predictions will include additional cross-linguistic work as well as experimental, laboratory-based research—for instance, varying the frequency of meanings and initializing chains with languages that do not match the space of meanings (e.g., initializing participants who see the meaning space from the color condition with a language conforming to the canonical size pattern). In addition, a great deal of theoretical work remains. Existing work investigates how GK's results are affected if the chain consists of more than one learner per generation (Burkett & Griffiths, 2010; Smith, 2009), or if learners are capable of “teaching” subsequent learners in the chain (Beppu & Griffiths, 2009). How would our results be affected under these circumstances? Other work has built on the GK result to show that iterated learning with Bayesian learners is equivalent to the Wright–Fisher model of genetic drift (Reali & Griffiths, 2009b); does iterated learning correspond to a different evolutionary model when the stationary distribution it converges to is different?

There are many remaining open questions in addition to these, but our results indicate that the world may matter more than we previously thought. Perhaps language has the structure it does not just because of our brains but because of the world as well.


We thank Natalie May, Tin Yim Chuk, Jia Ong, and Kym McCormick for their help recruiting participants and running the experiment; Simon Kirby and Mike Kalish for helpful discussions; and Tom Griffiths and Rick Dale for useful comments on a draft of the manuscript. AP and DJN were jointly supported by ARC fellowship DP0773794, AP individually by ARC fellowship DE120102378, and DJN individually by ARC fellowship FT110100431.


  1. 1

    More formally, GK assume that each language h specifies P(y|hx), the conditional distribution over utterances y given the events x. Our formulation corresponds to assuming that each language maps onto a joint (subjective) probability distribution over events and utterances, P(xy|h). We can factorize the joint distribution P(xy|h) = P(y|xh)P(x|h). Moreover, as P(h|x)∝P(x|h)P(h), in our set up P(h|x) ≠ P(h).

  2. 2

    Note that this is actually a specific case of a more general result: languages may be shaped by world structure even in situations where the meanings are not a representative sample from Q(x). We focus on the special case when representativeness holds because it does hold in a lot of situations, and the stationary distribution is more interpretable in this case. Appendix A presents the more general proof.

  3. 3

    More formally, this derivation assumes that data d are generated from P(d|h) and that h are sampled from P(h|d), for any construal of d. Thus, d can include events x as well as utterances y. Note that if d = (xy), however, the events and utterances are entirely generated by P(d|h) = P(xy|h); that is, an external event-generating distribution like Q(x) plays no role.

  4. 4

    The GK results do predict that there will be an effect of bottleneck size if one assumes that learners are selecting the hypothesis with the maximum posterior probability (MAP estimation) rather than sampling directly from the posterior distribution (Griffiths & Kalish, 2007). However, that bottleneck effect is different than the one we predict; it suggests, unlike us, that the stationary distribution will still always be centered on the prior, implying that the bottleneck size does not influence which language is most likely. The difference is that under MAP estimation, the variance of that distribution depends on the bottleneck size.

  5. 5

    The reason we calculate the statistics based on this score, rather than on something more intuitive like the adjR values themselves, is that the adjR values are not all independent of each other; for example, if a language is highly consistent with the canonical size clustering, then it will not be consistent with the canonical color clustering. The structural precision score, by combining both values into one measure, ensures that all the values entering into the statistical test are independent of each other.

  6. 6

    Note that they are also distinct from (although related to) the results of Beppu and Griffiths (2009), which reveal a different way that the stationary distribution of iterated learning can change when learners are provided with data generated by the external world. In that work, which is not focused on language, the hypotheses to be learned are hypotheses about the nature of the external distribution x directly, rather than mappings between x and y (as occurs in language). Like us, Beppu and Griffiths (2009) find that the stationary distribution no longer converges to the prior, although the precise nature of the stationary distribution is not the same as we find here.

Appendix A

The derivation presented in the theoretical results is actually a special case of a more general derivation (we thank Tom Griffiths for a discussion of this). This more general derivation diverges from the one in the main article at step 4 below:

display math

When the bracketed term is equal to 1, then math formula (or, equivalently math formula). One circumstance when this occurs is, as in the main article, when math formula:

display math

There are presumably other circumstances under which the bracketed term is equal to 1, but they are much less straightforward and harder to offer an intuitive interpretation of. Because of this, and because the general point of this study is simply that language evolution can be shaped by the structure of the world—a point which is strengthened rather than weakened by this more general derivation—our interpretation in the main article focuses on the version presented there.