A Bayesian Model of Biases in Artificial Language Learning: The Case of a Word-Order Universal


should be sent to Jennifer Culbertson, Department of Brain & Cognitive Sciences, University of Rochester, 246 Meliora Hall, Rochester, NY14627. E-mail: jculbertson@bcs.rochester.edu


In this article, we develop a hierarchical Bayesian model of learning in a general type of artificial language-learning experiment in which learners are exposed to a mixture of grammars representing the variation present in real learners’ input, particularly at times of language change. The modeling goal is to formalize and quantify hypothesized learning biases. The test case is an experiment (Culbertson, Smolensky, & Legendre, 2012) targeting the learning of word-order patterns in the nominal domain. The model identifies internal biases of the experimental participants, providing evidence that learners impose (possibly arbitrary) properties on the grammars they learn, potentially resulting in the cross-linguistic regularities known as typological universals. Learners exposed to mixtures of artificial grammars tended to shift those mixtures in certain ways rather than others; the model reveals how learners’ inferences are systematically affected by specific prior biases. These biases are in line with a typological generalization—Greenberg's Universal 18—which bans a particular word-order pattern relating nouns, adjectives, and numerals.

1. Introduction

1.1. Typological universals: Evidence of universal properties of the cognitive system?

A topic of debate in cognitive science since antiquity is the origin of knowledge, specifically, the relative contributions of environmental experience and learner-imposed structure. One classic field on which this debate has played out is grammatical knowledge. The existence of many statistical generalizations stating that seemingly arbitrary properties are shared by the world's languages—so-called typological universals—has been taken by some as compelling (if circumstantial) evidence that human learners are internally biased to impose those properties on the grammars they acquire, acting as agents of language change.1 That the debate is still very much alive is attested by contrary positions claimed to be superior, or generally more plausible, in recent work such as Nettle (1999), Bybee (2008), Evans and Levinson (2009), Tomasello (2009), and Levinson and Evans (2010). According to a mainstream version of the learning bias hypothesis, languages change over time because the statistical distribution of grammars acquired by one generation of learners is systematically different from the distribution of grammars deployed by the previous generation (Kroch, 2000; Lightfoot, 2006): Learners shift the distribution in favor of grammars exhibiting the properties preferred by their biases—these properties then emerge over time as typological universals (Kirby, 1999). Statistical mixtures are implicated in the variation that permeates language, at all levels. A language changes from an earlier state, predominantly exhibiting some particular pattern, to a later state, predominantly exhibiting a different pattern, as the statistical mixture of grammars used by speakers shifts away from the former pattern toward the latter.

This work addresses direct behavioral evidence that language learners do indeed shift grammar mixtures in favor of those grammars with properties observed to be favored typologically. In this article, we develop a computational learning model to formalize a notion of learning bias that can explain, quantitatively, how learners in artificial language-learning experiments systematically alter the language they are exposed to. In a particular case, we show that the bias identified by the model turns out to indeed formalize a prior preference for grammars obeying the relevant typological regularity (of word order).2

For the test case considered here, the relevant cross-linguistic regularity (see Table 1) is one that was identified by Joseph Greenberg in his groundbreaking typological work; he codified it as ‘‘Universal 18.’’

Table 1. 
Greenberg’s Universal 18 bans languages combining word orders Adjective-Noun and Noun–Numeral (shaded cell) Thumbnail image of
  • 1Universal 18: Languages that [predominantly] order adjectives before nouns also [predominantly] order numerals before nouns—but not conversely (Greenberg, 1963, paraphrased).

This may be one of those ‘‘seemingly-arbitrary’’ typological generalizations, which to some suggest that learners impose properties on languages. And indeed, a bias parallel to Universal 18 appeared to be at work in the artificial language-learning experiment reported in Culbertson et al. (2012). The goal of the present article is to quantify, in a formal learning model, the bias exhibited by these learners—explaining the basis of such a bias is not in the scope of this article (but see Culbertson et al., 2012, for much discussion). We will treat the learners’ bias as crucially related to the properties of the linguistic structures involved, not entirely reducible to more general factors.

The model we propose addresses a general artificial language-learning paradigm in which participants are exposed to an artificial language displaying variation, motivated by the view of language change discussed above. This experimental paradigm was introduced by Hudson Kam and Newport (2005, 2009) in a line of research showing that, under certain conditions, adult and child learners will reduce the degree of variation in—regularize—a language. The bias of learners to shift mixtures to make them more regular and their bias to favor grammars that respect typological regularities like Universal 18 turn out to interact in a way that makes this experimental paradigm a particularly sensitive one for revealing those biases. The model is tested on learning data (reported in Culbertson et al., 2012) for languages generated by statistical mixtures of rules governing the order of adjectives, numerals, and nouns in simple, two-word phrases. The data are comprised of utterances produced by participants after exposure training. The fundamental hypothesis, supported by the data reviewed below, is that learners will regularize only in the direction of grammars that obey the constraints exhibited in typological generalizations.

This article proceeds as follows: in section 2, we review the relevant aspects of the experimental results we model as our test case, then in section 3 we provide some background, introducing the Bayesian approach to learning. In section 4, we start with a high-level, conceptual introduction to the model, and then move on to provide a technical description. In section 5, we describe how the model parameters are fit and report the results of the modeling process. Section 6 summarizes our findings and presents conclusions that we would like to draw from them concerning learning and linguistic typology. First, however, we will discuss models of artificial language-learning experiments that have been proposed previously and highlight the novel contribution made by this work.

1.2. Previous work

When studying any cognitive function, it is important to separate those phenomena that can be best characterized in domain-independent terms from those that cannot. This is especially true in the case of language, given the debate mentioned above concerning the extent to which language learning relies on general cognitive mechanisms rather than principles specific to the domain of language. To understand the questions at stake in the research reported here, and the nature of the original contribution, requires consideration of the domain-dependence issue, which arises in particular in Bayesian modeling of artificial language learning. Previous work in this area has addressed issues in syntax and word learning using domain-independent models, while the work we report here formalizes a problem in a domain-specific form.

A Bayesian model of an artificial language-learning experiment reported in Wonnacott, Newport, and Tanenhaus (2008) is presented in Perfors, Tenenbaum, and Wonnacott (2010) (for a similar study, see Hsu & Griffiths, 2009). The experiment examines the relationship between distributional information in the input and learners’ willingness to generalize—that is, allow novel items, here verbs, to appear in multiple syntactic constructions. Learners exposed to training data in which all verbs are presented in both constructions are contrasted with learners receiving data in which each verb is presented in only one construction. The key question is, given a novel verb presented in one construction, will it be predicted to be felicitous in the other construction as well? In the experiment, and one version of the model, the finding is that learners are more likely to predict that the verb can be used in the unpresented construction when verbs in the training data did so.

The problem Perfors et al. (2010) are fundamentally interested in concerns what learners infer about some set of data in the absence of negative evidence. Although in the Wonnacott et al. (2008) experiment, data are verbs and constructions, this question arises in a number of other domains, and Perfors et al. (2010) thus seek to explain learners’ behavior using a domain-general model. Models of this kind allow us to see precisely how part of the problem of learning which verbs may appear in which constructions—a well-studied problem in language acquisition—can be formally understood solely in the general terms of the distribution of items into categories.

This work addresses a case of what is hypothesized to be specifically a linguistic rather than domain-general structure. In particular, we ask whether cross-linguistic generalizations about this structure are psychologically real—specifically, manifest in the biases of the cognitive language-learning mechanism. In the artificial language-learning experiment being modeled (Culbertson et al., 2012), the stimuli to which learners are exposed are two-word phrases consisting of a modifier—an adjective or a numeral—and a noun. The word order of the phrase locates the modifier in either pre- or post-nominal position. For the design of the experiment and of the model, the hypothesis is that these linguistic (or substantive) dimensions are critical; the particular modifier types and word orders are not interchangeable. This is because the hypothesis under investigation is that a language in which adjectives are predominately pre-nominal, and numerals are predominantly post-nominal, will be less learnable than the language resulting from exchanging the word orders of the modifier types, all else equal. It is this that Greenberg's Universal 18 predicts, under our general hypothesis that if a language type is very rare among the world's languages, then it is less learnable—or more precisely, that the human language-learning system is biased against those linguistic patterns that are observed to be typologically rare.

In addition to linguistic-structure-specific issues, the Culbertson et al. (2012) experiment also involves the domain-general question of how learners treat variation. Regularization—reduction of variation—was the subject of work by Reali and Griffiths (2009), who developed a Bayesian model of artificial language-learning experiments in which participants are exposed to novel objects which are given inconsistent labels (two novel labels are used for each object, with varying frequencies). As with the Perfors et al. (2010) study, Reali and Griffiths (2009) show that a domain-general model can capture an important aspect of learning behavior—in this case the trajectory of regularization over generations of speakers. Interestingly, Reali and Griffiths (2009) also show that not all sources of variation are subject to regularization by learners; in another experiment, they show that the variation associated with the outcome of a coin toss is not regularized. In the work presented here, we are interested in how regularization interacts with Universal 18’s specific predictions concerning the substantive linguistic biases of learners. According to our hypothesis, regularization only proceeds in directions favored by substantive biases; hence, degree of regularization serves as an index that we can use to infer the particular content of those biases.

Our focus on the substantive, specifically linguistic aspects of Universal 18 affects the formal structure of our model. The bias—prior probability distribution—must depend jointly on the relevant substantive linguistic dimensions. For according to Universal 18, there is no asymmetry between pre- and post-nominal position for adjectives alone, nor for numerals alone: Both types of modifiers occur with great frequency in both positions across the languages of the world. What is very rare is the combination, within a single language, of adjectives in a particular position—pre-nominal—with numerals in a different position—post-nominal. This demands a new type of prior distribution, which we will develop below (section 4).

2. Artificial language learning of word order in the nominal domain

A general method for obtaining experimental evidence connecting learning biases to typological regularities in syntax—the Mixture-Shift Paradigm—was proposed in Culbertson et al. (2012) and applied to the case of Greenberg's Universal 18. The experiment investigated the extent to which, in artificial language learning by adults, there is asymmetric learning of patterns of adjective, numeral, and noun ordering that parallels Universal 18’s ban on the ordering pattern combining pre-nominal adjectives and post-nominal numerals (i.e., Adj–N, N–Num).3

As shown in Table 1, the pattern (Adj–N, N–Num) is attested; therefore, it is clearly possible to learn. However, this does not lead to the conclusion that no relevant constraint in the cognitive system exists. In fact, using the Mixture-Shift Paradigm, Culbertson et al. (2012) provided evidence suggesting informally that a parallel learning bias does appear to affect how adult participants acquire adjective, numeral, noun ordering patterns. In this section, we briefly review the experiment's design, method, and key results.

2.1. Summary of the experiment and its results

The typological pattern of interest in Culbertson et al. (2012) concerns constraints on the distribution of possible combinations of {Noun, Adj} and {Noun, Num} orders: the relevant patterns are shown in (2).

  • 2Possible patterns: ordering combinations of{Noun, Adjective}, {Noun, Numeral} (with typological proportions from Table 1).
    • 1 Adjective–Noun & Numeral–Noun (27%).
    • 2 Noun–Adjective & Noun–Numeral (52%).
    • 3 Noun–Adjective & Numeral–Noun (17%).
    • 4 *Adjective–Noun & Noun–Numeral (4%).

The hypothesized substantive learning bias is described in (3).

  • 3The hypothesized substantive learning bias.
    • i. Favor so-called harmonic ordering patterns, which preserve the position of the noun as either following or preceding both modifier types (i.e., patterns 1 and 2 in (2)).
    • ii. Disfavor the particular non-harmonic pattern 4 in (2), which combines Adj–Noun order with Noun–Num order.

These two aspects of the substantive bias were hypothesized to explain key properties of the typological data shown in Table 1.

To test whether learning appears to reflect this bias, Culbertson et al. (2012) exposed adult learners to grammar mixtures that featured one of the four possible ordering combinations as the dominant pattern, accompanied by some variation. In the training phase of the experiment, learners saw pictures of novel objects and heard phrases describing them that were uttered by an ‘‘informant’’ and comprised of either an adjective and a noun or a numeral and a noun. The order used by the informant in any particular description was generated probabilistically following the training probabilities defined by the learner's input condition (word order had no effect on meaning). These input conditions are summarized in Table 2.

Table 2. 
Summary of experimental conditions
 Condition 1Condition 2Condition 3Condition 4
  1. Note: The dominant order for each modifier type was always used in 70% of utterances; the remaining 30% used the opposite order (not conditional on particular vocabulary items). Dominant patterns in conditions 1, 2 are harmonic. The dominant pattern in condition 4 is hypothesized to be disfavored.

Dominant adj order AdjNoun NounAdj NounAdj AdjNoun
Dominant num order NumNoun NounNum NumNoun NounNum

After this training phase, in a testing phase learners were asked to produce descriptions of pictures using these two-word phrases, and the dependent measure of interest was the extent to which they used the dominant ordering pattern of their training condition in the utterances they generated.

The results, illustrated in Fig. 1(a), indicated that learners in conditions 1, 2, and 3 tended to regularize the input grammars—that is, they used the dominant pattern more frequently than did the informant providing their input—although learners in the harmonic conditions 1 and 2 regularized the most. Crucially, however, learners did not regularize when the dominant pattern was the typologically rare (Adj–N, N–Num). The asymmetrical learning behavior in the experiment thus informally appears to support positing a substantive learning bias with the two key properties given in (3).

Figure 1.

 Results reported in Culbertson et al. (2012). (a) Average use of the dominant order by participants in each of the four conditions. The dominant pattern in condition 4 is hypothesized to be disfavored. The stars above conditions 1, 2, and 3 indicate that participants regularized to a level significantly higher than the input (the dotted line at 70% indicates the level at which the dominant pattern was used in the input for each condition). (b) A plot of pAdj−N by pNum−N for individual learners. Different colors correspond to different training conditions; training probabilities are open points. Arrows are typical shifts.

As we will discuss in the next section, the Bayesian approach we adopt assumes that the task of the learner in this case is to infer which grammar (within a hypothesis space) the informant used to generate the training data, and to use this grammar in their own productions. (The precise concept of grammar used in the model is explained below.) Crucially, the Bayesian approach also dictates that probabilistic constraints on the hypothesis space, instantiated by prior biases over this space, can affect which grammar the learner infers. The plot in Fig. 1(a) suggests that learners are generally more likely to infer grammars that are more regular (closer to being deterministic) compared to their input; however, the fact that learners did not regularize to the same extent in all conditions, even though the evidence observed was statistically equivalent in all conditions, suggests that certain grammars are inherently more likely to be inferred than others—the hallmark of a learning bias.

Fig. 1(b) plots each learner's performance as a solid dot in a two-dimensional space, where the conditional probability of producing Adj–N (given that an utterance contains an adjective) appears on the x-axis and the conditional probability of producing Num–N (given that an utterance contains a numeral) appears on the y-axis.4 Each point in this space corresponds to a stochastic language; a completely deterministic version of each ordering pattern lies in one of the four corners. In the upper right corner, labeled L1, is a language generated by a deterministic grammar which uses both Adj–N and Num–N 100% of the time (pure pattern 1 of (2)): This grammar is harmonic. In the lower left corner, labeled L2, is the opposite harmonic grammar, which uses these orders 0% of the time (N–Adj and N–Num are used categorically: pattern 2 of (2)). Deterministic non-harmonic grammars corresponding to patterns 3 and 4 are then in the upper left and lower right corners, respectively (labeled L3, L4).

The solid arrows illustrate what informally appear to be typical shifts made by learners in the experiment. In the case of learners trained in conditions 1, 2, and 3, there are points which, when compared to the training input (the open points), are shifted closer to the L1, L2, and L3 corners, respectively—this illustrates regularization of the input, and several such points can be found for each of these three conditions. That learners in conditions 1, 2, and 3 regularized is also shown in Fig. 1(a), which plots mean behavior. What Fig. 1(a) does not illustrate is the critical fact that there are no such points among the condition-4 learners (as indicated by the dashed arrow in Fig. 1(b)). Furthermore, Fig. 1(b) shows clearly that learners exposed to input condition 4 not only did not regularize, they actually acquired grammars that appeared to be shifted toward either the L1 or L2 corner—in other words, they shifted the language to bring it more in line with the proposed learning bias (3) (and hence more in line with Universal 18). Shifting toward L1 or L2 is also seen for those learners in condition 3 who did not regularize the input pattern. As we will discuss in detail, the behavior of learners in the harmonic conditions 1 and 2, in condition 3, and in condition 4 are all qualitatively as well as quantitatively distinct. Precisely what kind of underlying prior bias could give rise to this complex outcome is not at all clear. The Bayesian model proposed here will not only validate the claims made by Culbertson et al. (2012) concerning asymmetrical learning and regularization across conditions, it will also reveal the internal structure of a bias which is able to capture the more complex properties of learner behavior.

3. Conceptual introduction to the Bayesian approach

A Bayesian model of learning is one which assumes that the learner follows principles of Bayesian probabilistic inference. These principles form a mathematical basis for inferring the probability that some hypothesis, H, characterizes the source of the set of observations O. For example, Bayesian models of language acquisition generally involve inferring the grammar that is responsible for generating the input utterances. Once H is inferred as an explanation for O, the learner can use H to generate predictions about future observations—or even, as we assume here, to produce new linguistic data, going beyond the set of input utterances O and generating novel productions of their own.

An important feature of Bayesian models is that they incorporate probabilistic constraints on the hypothesis space that the learner searches. These constraints can be thought of as prior knowledge, or biases, about which hypotheses are more or less likely a priori (i.e., prior to receiving the learning data of interest). Taking infant language acquisition as an example, Universal Grammar is typically conceived of as a set of (hard or soft) constraints on possible rules (e.g., Chomsky, 1988). Such a Universal Grammar would guide the process of language acquisition by narrowing down the set of possible rules the learner will consider likely to have generated a given set of utterances. (See Chater & Manning, 2006, for a survey of Bayesian models of language.)

The role that prior knowledge plays in Bayesian models can be seen clearly by looking at a simple version of Bayes’ Theorem, shown in Equation (1).


This equation states that the posterior probability a learner assigns to a hypothesis after receiving some observed data is proportional to the likelihood of observing that data if the hypothesis were correct multiplied by the prior probability of that hypothesis. All things being equal, if the prior probability of some hypothesis is low, then the posterior probability of that hypothesis given the observed data will be low; in a Bayesian model of learning, the learner will be less willing to infer that such a hypothesis is a likely explanation for the observed data. Another way of stating this is that if the prior for one hypothesis is lower than that for another hypothesis, then the weight of observed evidence needed to achieve a given level of posterior probability would be greater for the hypothesis with the lower prior probability.

In proposing a Bayesian model of the experiment in Culbertson et al. (2012), we aim to show that this experiment, and in fact any using the Mixture-Shift Paradigm, is particularly amenable to a Bayesian approach. This is primarily because the paradigm is designed to test hypotheses about the effects of prior knowledge, or biases, on the grammar that learners infer (acquire). In this case, we are interested not only in biases constraining possible sets of rules—the substantive bias (3)—but also in a constraint on the variation encoded in probabilistic rules—the regularization bias. In constructing a Bayesian model, we will provide a formal specification of these two biases. This will improve our understanding of learners’ behavior in the experiment because the model will quantify the strength of the regularization and substantive biases (independently of one another). With the biases quantified, we can see whether they do indeed line up with universals—for example, Universal 18—or not. Furthermore, the model will allow us to determine whether there is reliable evidence for learners’ preferred shifting direction when they do not regularize.

4. Setting up the model

4.1. A graphical representation of the model

In the model we propose, a learner is exposed to training data, and under the influence of prior biases, infers a probability distribution over grammars: the probability each grammar generated the training data. The learner then selects a grammar from this distribution and uses it for production.

A graphical representation of the hierarchical generative model is shown in Fig. 2. The top node of the graph represents the hyperparameters γ and (x, y) that specify the prior constraints on the hypothesis space of grammars, as described in detail below. This is a prior probability distribution over grammars G, written P(G|γ, x, y) or more simply P(G). It is this distribution P(G)—that is, the parameters γ and (x, y) that determine it—which is our primary focus.

Figure 2.

 The graphical structure of the model.

Associated with the central node of the graph is a probability distribution over the space of possible grammars, which a learner in condition k will infer on the basis of his or her training set Trainingk. This distribution gives, for every possible grammar G, the probability that G is responsible for generating all the observed training data—that is, the posterior probability of the grammar, which we will write P(G|Trainingk). Sampling from this posterior distribution, learner j in condition k selects a grammar Gj,k. In this we follow the sampling model of Reali and Griffiths (2009) and Griffiths and Kalish (2007). (Whether learners retain some knowledge of P(G) after selection of Gj,k, or whether they even represent P(G) mentally, does not matter for our purposes here.)

In the test phase of the experiment, this learner j will use his or her learned grammar Gj,k to generate a corpus of utterances, Testingj. The posterior distribution from which the grammar is selected entails a prediction for the distribution over such corpora, the predictive distribution that we will call P(Testingj|Trainingk). In Fig. 2, the dashed line connecting G to Testingj indicates that the latter is in a sense separate from the learning model. During learning, the posterior is inferred based on Trainingk, under the constraints imposed by the prior. After learning, the posterior is no longer undetermined, but a fixed probability distribution used to predict the set of test utterances, Testingj.

This type of model in effect uses analysis by synthesis: perception—the probability distribution over grammars used to analyze the training input—and production—the probability distribution over grammars used to produce the testing output—are the same, as perception simply is inference about the production process. Under instruction to speak like the informant who produced the training input, we assume that the distribution from which an experimental participant selects the grammar that she will use in producing her own utterances is the same as the probability distribution she inferred for the grammar the informant used to produce his utterances. For the model presented here, we further make the simplifying assumption that in the experiment, negligible learning occurs during the testing phase5: the relevant probability of G at test is that which is posterior to the training data only.

By fitting the parameters in the prior γ, we will uncover whether the testing data across all learners in the experiment support the existence of the hypothesized biases (3). In other words, we will be able to determine whether the testing data are best captured by positing a bias, which results in a low prior expectation for a grammar like pattern 4 of (2), and a relatively high prior expectation for harmonic grammars. We will also gauge the strength of the regularization bias by fitting the parameters (x, y) of the prior. Precisely how these biases can be enforced in the model will be explained in detail below.

4.2. The model

In what follows, we will provide a more in-depth discussion of each part of the model (mathematical details are deferred to section 4.3). Then we will discuss how the hyperparameters of the model, (γ, x, y), were fit and describe the results of the modeling process, including a comparison between the data the model generates and the actual testing data from the experiment reported in Culbertson et al. (2012).

4.2.1. Grammars

In our model, grammars are extremely simple probabilistic context-free grammars (PCFGs), each specified by a pair of probabilities, one for each of two rewrite rules: AdjP → Adj N, and NumP → Num N. All utterances generated according to a given grammar are therefore statistically independent, and a particular grammar is determined by two parameters: let G(padj, pnum) be the grammar in which padj is the probability of the pre-nominal adjective rule AdjP → Adj N (i.e., the conditional probability of having the modifier first, given that the modifier is of type adjective), and pnum is the conditional probability of the pre-nominal numeral rule. Each modifier type—adjective and numeral—occurs an equal number of times in the training data given to each participant in the experiment, so the probability of each modifier type is fixed at 0.5. The fact that our hypothesis space consists of PCFGs means that after learning, the inferred posterior probability distribution constitutes a mixture of stochastic grammars. Treating learning as producing a mixture of grammars is typical in computational models of language learning and language change (e.g., Clark & Roberts, 1993; Kirby, 1999; Niyogi, 2006; Yang, 2002) as well as theoretical models (e.g., Anttila & Yu Cho, 1998; Boersma & Hayes, 2001; Kroch, 2000); however, such mixtures are more often comprised of deterministic grammars.

Consider the grid shown in Fig. 3. Each intersection in the grid represents a possible grammar G(padj, pnum); for example, the point (0.3, 0.7) represents the grammar G(0.3, 0.7) wherein the conditional probability of uttering the order Adj–N (given that the modifier is of type Adj) is 0.3, and the conditional probability of uttering Num–N is 0.7 (given a numeral modifier; recall Fig. 1(b)). The granularity of the grid in Fig. 3 employs intervals of 0.1, but one can imagine the grid becoming infinitely fine, and in fact the ‘‘grid’’ we use is continuous.

Figure 3.

 A simplified (coarse) grid of grammars.

The input to learners in the experiment is a particular training regime—one of four conditions (defined in Table 2)—which can be characterized by a pair of counts representing the number of utterances (trials) employing each order for that condition during training, as in Table 3. Each grammar on the grid assigns some probability to each set of training counts; although a given set of counts is more likely to have been generated by some grammars than others, in principle each grammar on the grid will in fact have some non-zero probability of having generated those counts (excluding those grammars exactly on the outer edge of the grid). The grammar with the highest likelihood of generating the condition-3 training data, Training3, is the point (0.3, 0.7) shown in Fig. 3. The further a grid point is from this point, the lower its likelihood of generating Training3.

Table 3. 
Training regime for each condition
Training RegimeAdj–NN–AdjNum–NN–Num
Condition 128122812
Condition 212281228
Condition 312282812
Condition 428121228

We can calculate the probability that a grammar, G(padj, pnum), generates a pair of counts by independently calculating the probability of the Adj–N counts according to padj, and the probability of the Num–N counts according to pnum, using the binomial formula. Given a particular grammar, the probability, or likelihood, of generating a pair of counts for Trainingk is then simply the product of the two probabilities of the counts for each modifier type in that condition: This is the first basic premise of the model, formulated in (4).

  • 4 The likelihood P(Trainingk|G) is the product of two binomial distributions, one with parameter padj and one with parameter pnum.

(The formal statement of (4) is deferred to section 4.3, Equation (3).) The probability of a set of testing counts, Testingj, given a particular grammar can also be calculated in this same way (see section 5.1).

Following Bayes’ Theorem, to infer a posterior probability distribution over grammars that could have generated the training data, the learner (implicitly) computes both the likelihood assigned to the training data by each grammar and the prior probability of each grammar (see Equation (1) in section 3). Because, for a given modifier type, the likelihood function is a binomial distribution (4), the distribution we posit for the corresponding prior is a beta distribution: This distribution is a natural choice because it is the one that gives the posterior probability distribution—the distribution over grammars which the learner will infer—the same form as the prior: The posterior will also be a beta distribution. The beta distribution is therefore called the conjugate prior to the binomial.6 As the complete likelihood of a grammar is the product of two binomial distributions, the prior we posit involves the product of two beta distributions, one over padj and the other over pnum; each of these is governed by a pair of ‘‘shape’’ parameters, as stated in premise (5):

  • 5 The prior probability of a grammar G is built from the product of two beta distributions with parameters (αadj, βadj) and (αnum, βnum).

The formal statement of (5) is Equation (5) in section 4.3 below. Functioning as a prior, the beta distribution (a special case of the Dirichlet distribution with only two parameters) can be conceived of as information which the learner brings to the task regarding the distribution of padj and pnum—information that, when computing the posterior probability of a grammar, formally functions as prior counts, as though the learner assumes that α−1 pre-nominal and β−1 post-nominal phrases are already given, prior to encountering the training data.

The two parameters of a beta distribution, α and β (both greater than zero), govern the shape of the distribution (its mean, variance, etc.). The values that the two parameters take on will have the effect of making the mean more extreme—closer to 1 or 0—if α is much greater than β, or vice versa. (The mean is α/(α + β).) A distribution which favors more extreme values will assign a higher probability to more regular (less random) grammars; those which are closer to a corner of the space in Fig. 1(b). This can formalize a regularization bias. As we will discuss below, we treat the regularization bias as constant across all word-order patterns. It is illustrated in Fig. 4, where three example beta distributions governing adjective order, and three governing numeral order, proceed from favoring less regular to more regular grammars; in Fig. 4(a) the distributions move from favoring Adj–N and N–Adj equally towards the extreme favoring p(Adj−N) = 1; exchanging α and β gives Fig. 4(b), where the darkest curve illustrates the opposite extreme, a distribution favoring p(Num−N) = 0, that is, favoring N–Num.

Figure 4.

 Example beta distributions. Adjective beta curves from lighter to darker illustrate progression toward favoring more regular use of Adj–N (shape parameters are (α = 10, β = 10), (α = 15, β = 3), (α = 15, β = 0.1)). Numeral beta curves illustrate progression toward favoring more regular use of N–Num.

4.2.2. Prior components and the substantive bias

We have just seen that, with certain shape parameters (α, β), a beta distribution can assign higher prior probability to more extreme values of pAdj−N or pNum−N, thus favoring more regular grammars. The relevant cases are the darkest ones shown in Fig. 4; they impose a preference for extreme probability values on one or the other side of the range from 0 to 1. The product of two such distributions (one for adjectives, the other numerals) imposes a preference for grammars near a particular corner of the grammar space shown in Fig. 3—the corner determined by whether the Adj and Num distributions peak to the left or right of 0.5. We will say that a grammar G(pAdj−N, pNum−N) ‘‘favors’’ an ordering pattern, say pattern 1, if pAdj−N > 0.5 and pNum−N > 0.5.

To empirically determine this bias, we will fit the experimental data; for this computation, we do not want to impose any built-in asymmetries, but rather provide the opportunity for the model to assign any a priori probability distribution to the four possible patterns, as the data may require. So for each corner l, let us define a bias componentLl—a probability distribution over grammars (in the grid), a product of two beta distributions, which we will write P(G|Ll). Ll has a modal grammar, the grammar to which it assigns the highest probability. Each of the four components will be said to favor one of the ordering patterns in (2) above if its modal grammar favors that ordering pattern. Ll is the component favoring order pattern l.

The prior probability of a grammar G(padj, pnum), according to component Ll, is the product of probabilities assigned to the particular values (padj, pnum). As different components favor different (padj, pnum) combinations, the values of the shape parameters depend on the component; they will be defined shortly.

The substantive bias is then a prior defined as a mixture of the four components: a distribution given by a weighted sum, with mixing weights γ = (γ1, γ2, γ3, γ4) ranging between 0 and 1, and summing to 1. The prior probability of a grammar G, P(G), thus combines the probability P(G|Ll) of G being generated according to the probability distribution defined by all components Ll. Each component Ll itself has a prior probability P(Ll), which we are calling γl; the contribution of each component to the prior probability of a given grammar is weighted with the corresponding value in the multinomial prior γ.7 In other words, each component distribution assigns some probability to a grammar, but in the prior this probability is weighted by the value of γ assigned to that component. This means that a low weight assigned to a component will lead to a low prior probability of grammars favored by it. This part of the model therefore has the potential to enforce the substantive bias (3). Determining the values comprising γ is a primary objective of this work.

To summarize, the prior probability of a grammar G is equal to the sum of the probabilities of G according to each component, Ll, weighted by the prior probability of Ll, γl, as shown in Equation (2).


An important feature of our model is that we constrain the shape parameters of the prior's multiple beta distributions to be equal, to formalize the hypothesis that the strength of the regularization bias is the same in all components, regardless of which word-order pattern they favor. Differences in learning behavior across conditions must therefore be achieved through combining this uniform regularization bias with a substantive bias—a non-uniform prior over components, γ.

More precisely, we require that the shape parameters for all four of the components share two values (x, y), one higher (x) and one lower (y). This means that in the model there are only two free shape parameters, the values of which are combined appropriately to create each component distribution. Crucially, the average of a beta distribution is greater than 0.5 if α > β, and less than 0.5 if α < β. So component L3, which favors post-nominal adjectives and pre-nominal numerals, pattern 3, will have for its beta distribution governing pAdj−N, α3,adj = y (low), β3,adj = x (high), and for its beta distribution governing pNum−N, α3,num = x (high),β3,num = y (low). In the L4 component, favoring pre-nominal adjectives and post-nominal numerals, pattern 4, x and y will be exchanged. Table 4 specifies exactly how the shape parameters (α, β) for the two beta distributions for each component are determined by the two model parameters (x, y).

Table 4. 
Constraints on the shape parameters (α, β) of the two beta distributions governing each component distribution
α adj β adj α num β num
  1. Note: The model parameters (x, y) correspond to higher and lower values, respectively. (Compare Table 3.)

L 1 x y x y
L 2 y x y x
L 3 y x x y
L 4 x y y x

4.2.3. Posterior probability distribution over grammars

Recall that the learners’ task is to select a grammar from a posterior distribution over grammars inferred after receiving the training data. Given that we have a conjugate prior, the posterior resulting from learning will also be a mixture of the form in Equation (2)—it will be a mixture of posterior components (see Equation (10) in section 4.3). A consequence of this picture, which the results of the experiment will suggest is desirable, is that each component distribution contributes a non-zero probability to the prior probability of every training regime. This in effect allows the learner to conclude, for example, that a training regime as in Table 3 for condition 4 could in fact have been generated by, say, the component favoring pattern 1. This might occur if the prior weight assigned to the L4 component, γ4, were extremely low relative to γ1.

Now that we have specified how to calculate the likelihood, P(Trainingk|G), for condition k given a grammar G (premise (4)), and the prior probability, P(G), of G (premise (5)), we are in a position to compute G’s posterior probability in condition k, which we denote P(G|Trainingk). According to Bayes’ Theorem, Equation (1), the posterior probability of a grammar G is proportional to the product of the prior probability of G and the likelihood with which G generated the observed set of counts, Trainingk.

Consider first the contribution to the posterior from a single component of the prior, Ll. This prior is the product of two beta distributions with (α, β) shape parameters for adjectives and numerals. Because the prior is conjugate to the likelihood function, the posterior component contributed by Ll is also a product of two beta distributions—but the shape parameters are altered in light of the training data: The updated α parameter for adjectives, for example, is the corresponding α parameter for the prior plus the number of Adj–N examples observed in Trainingk (see Equation (9)). The adjective β parameter is likewise increased, by the number of N–Adj examples observed. This is why, as remarked previously, the prior's parameters can be interpreted as prior counts. (Technically, (α−1, β−1) are pseudo-counts: The value of the parameters will not necessarily be integers, and in the standard parameterization of the beta distribution, the exponents, which correspond to counts, have 1 subtracted from each parameter: compare Equations (4) and (7)).

Importantly, the mixture coefficients of the posterior components, like the shape parameters, also change to reflect the training data. Recall that γl denotes the prior probability of component Ll, P(Ll). That probability changes to P(Ll|Trainingk) after observing the data Trainingk: for example, as Lk is the component that makes the data the most likely, P(Lk|Trainingk) will be increased relative to its prior value P(Lk), while other values will be decreased: The mixture shifts, increasing the relative strength of Lk. The new mixture coefficients are given by Equation (11) of the Appendix.

  • 6 The posterior distribution P(G|Trainingk) is given by Equation (2), with the parameters α, β defining each P(G|Ll), and each mixture parameter γl = P(Ll), updated based on the counts in Trainingk.

A learner in condition k will, after receiving his or her training data, sample a grammar G = G(padj, pnum) from the inferred posterior distribution over grammars P(G|Trainingk). In the testing phase of the experiment, if asked to produce an Adj utterance, the learner will produce the pre-nominal form Adj–N with probability padj and the post-nominal order N–Adj with probability (1−padj). This yields another binomial distribution for the testing counts: P(Testingk|G) satisfies premise (4) just as P(Trainingk|G) does. Averaging this binomial distribution over the possible grammars sampled from the beta-mixture posterior gives a beta-binomial-mixture distribution as the predictive distribution for testing-data counts (Equation (15) of the Appendix).

4.3. Formal details

In this section (which may be skipped without loss of continuity), we spell out formally the premises of the model presented above. Starting with premise (4), let ck,a and ck,n be the counts in Trainingk of Adj–N and Num–N, respectively, out of a total of tk,a and tk,n examples of {Adj, N} and {Num, N}. The likelihood function is given by:


where the standard binomial distribution is defined by:


Spelling out premise (5), the components of the prior are


where, for Adj:


Here and henceforth, the equations pertaining to Num are exactly analogous to the equations given for Adj. The standard beta distribution is defined by:


where the normalizing integral over p is by definition the beta function B. The prior-component shape parameters (αl,a, βl,a) and (αl,n, βl,n) will be fit as explained in section 5. As stated in Equation (2), the full prior is a mixture, with coefficients γl = P(Ll), of the prior components specified in Equations (5)–(7).

As stated in premise (6), for each training condition Trainingk, the posterior has the same form as the prior, except that the shape parameters and mixture coefficients have changed as a function of the Adj–N and Num–N counts in Trainingk. We have




The posterior given Trainingk is then:


where the updated mixture coefficients inline image are given by Equation (11) in the Appendix.

5. Fitting the model parameters

5.1. Procedure

The hyperparameters of this model—γ, x and y—were fit using the production testing data from all the learners in the experiment (48 total, 12 per condition).8 This was done by maximizing the probability of the testing data—that is, the product, across all learners in all of the four conditions, of the probability of each learner's testing data given the posterior distribution for their training condition (see Equation (12) in the Appendix). (From the Bayesian perspective, this maximum-likelihood procedure corresponds to having a flat prior distribution over the hyperparameters, and selecting the parameter values with maximal posterior probability.) Finding the probability of the testing data (counts) for a given learner involves summing, over all grammars in the grammar space, the likelihood of those counts according to the grammar (binomial distributions), weighted by the posterior probability of that grammar (beta distributions): The predictive distribution that results is given by beta-binomial distributions—one per component Ll. The complete predictive distribution is a mixture of these four beta-binomials; see the Appendix for the formal details.

The optimization algorithm used here was a quasi-Newton search method which allows parameters to be constrained by upper and lower bounds, implemented in the general-purpose R (R Development Core Team, 2010) optimization function, optim, using the ‘‘L-BFGS-B’’ method (Zhu, Byrd, Lu, & Nocedal, 1994). Here, all γl were bounded during optimization by (0, 20) and then renormalized to sum to one (as the values of γ function as mixture coefficients, with γl = P(Ll)); (x, y) were both bounded by (0.001, 20).

5.2. Results

The parameter values that gave the maximum-likelihood fit to the testing data are shown in (7). Example beta distributions with α and β values similar to the best fit (x, y) can be seen in the two most extreme (darkest) distributions of Fig. 4. The exact means of the beta distributions illustrate the strength of the prior's regularization bias; in training condition 1, for example, the mean (for both adjectives and numerals) would be approximately 0.99994, and for training condition 2, approximately 0.00006—extremely close to 1 and 0. (As the optimization was constrained by y ≥ 0.001, this estimate is actually conservative.)

  • 7 Best-fit parameters (per-participant log likelihood = −5.99):
    • i. γ = (0.6293, 0.3706, 0.0001, 0)
    • ii. (x, y) = (16.5, 0.001)

The γ value for component 4 was found to be robust across starting values of the parameters (γ, x, y). For component 3, some starting values resulted in a higher prior weight. This is due to the strength of the regularization bias, which pulls the posterior for a given condition into the corner of grammar space even with a very low value of the gamma for that condition; thus, a reasonable fit to the data can be obtained even with relatively small (but non-zero) values for γ1 and γ2. High values of γ3 appeared to be in part due to a single participant in condition 4 whose testing proportions are in fact very close to L3—the only learner who shifted across the diagonal in grammar space. To investigate this, we analyzed the data for outliers and removed participants whose testing proportions for either adjectives or numerals were more than two standard deviations from the condition mean for that modifier type (Iglewicz & Hoaglin, 1993). This resulted in the removal of one participant (out of 13) from each condition, including the extreme condition-4 outlier. The γ values reported in (7) and used below to plot the prior and posterior surfaces are thus those which resulted in the maximum-likelihood value found across runs of the optimization algorithm with these outliers removed.

Crucially, the best-fitting values of the prior parameter, γ, which represent the substantive bias, approximate what would be expected according to Greenberg's Universal 18: The weight for component 4 (which favors pattern 4) is at 0, while the weights for the other three components are not (compare hypothesis (3ii), section 2.1). Furthermore, the weights for the two harmonic patterns are much higher than those for the two non-harmonic patterns (compare hypothesis (3i)). Thus, γ parallels what is expected based on the typological distribution of such patterns (see Table 1). Although the fact that the prior component weights follow this distribution is critical, the predictions the model makes for how learners should shift the input mixture are of particular interest. Before examining this in more detail, however, we will first verify that the complexity of our model is warranted by its fit to the data, and that the relevant differences among the four γl parameters are reliable.

5.3. Validation of the modeling results

To confirm quantitatively that (a) the relatively small difference found between γ3 (0.0001) and γ4 (0) was in fact meaningful, and (b) the difference between γ3 and the two harmonic γl values was meaningful, we compared our model—model 1—to two simpler baseline models in which γ contains only two parameters. The first alternative model—model 2—has one parameter for harmonic languages and one for non-harmonic languages (i.e., γ1 and γ2 are constrained to be equal, and γ3 and γ4 are constrained to be equal). The second alternative model—model 3—has one parameter for the disfavored L4 component, and one for the other three (i.e., γ1, γ2, γ3 are constrained to be equal).

The maximum-log-likelihood fit to the testing data was found for each model, and Likelihood Ratio tests (Lehmann, 1986), which allow comparison of nested models with different degrees of freedom, were used to compare these simpler models with our more complex model. The simpler models 2 and 3 assigned the data (natural) log likelihoods of −292.9 and −293.9, respectively, compared to −287.6 for model 1. According to these tests, the more complex model significantly outperforms both simpler models (for model 2 vs. model 1, χ2(2) = 10.7, p = .0048; for model 3 vs. model 1, χ2(2) = 12.68, p = .0018). The more complex model also outperforms both simpler models using the Bayesian Information Criterion test (BIC, Schwarz, 1977), which more heavily penalizes complexity. These results indicate that (a) simply distinguishing harmonic from non-harmonic languages is not enough to account for the results of the experiment; likewise (b) simply distinguishing L4 from all the others is not enough; and (c) even though γ3 is very low according to model 1, the difference between the best-fit values γ3 = 0.0001 and γ4 = 0 is meaningful. Furthermore, we will see below that this difference produces very different posterior (Fig. 7) and predictive (Fig. 9) distributions.

These model comparison results suggest that given the data we have—that is, the particular sample of testing points resulting from the experiment—our model merits the parameter structure we have allowed; nevertheless, we would still like to know whether the differences among the estimated γ values are reliable. To investigate this, we ran the model on data from a second replication experiment, with 55 participants.9 With the parameter values which best fit the data from the original experiment, the model assigned a per-participant log likelihood of −7.13 to the testing data from the second experiment.

We also calculated credible intervals for the γ parameter values using bootstrap sampling of the pooled data from both experiments (Good, 2005). (The intervals derived from the original experiment's data alone do not contradict our claims but simply are not informative due to the small number of participants. The fit shown in (8) indicates the consistency between the two experiments, which is also attested by the credible intervals that result from resampling the pooled data.) Credible intervals are a Bayesian counterpart of confidence intervals, in this case assuming a flat hyperprior over the γl values themselves. These intervals allow us to assess whether the crucial differences among the individual components are reliable, and not simply the result of the particular data sample that the model was fit to. The bootstrap credible intervals we report were calculated by resampling with replacement 1,000 times the individual participant data for each condition and optimizing the model's parameter values for each resampling of the data.

  • 8 Best-fit parameters for pooled data (per-participant log likelihood = −6.82):
    • i. γ = (0.572, 0.427, 0.0006, 0).
    • ii. (x, y)  =  (10.9, 0.001).

The resulting 95% highest probability density credible intervals (Hoff, 2009) are shown in Fig. 5, along with the estimated median parameter values. Crucially, the intervals for γ1 and γ2 do not overlap with the interval for γ3, and the interval for γ3 does not overlap with the interval for γ4. In fact, of the 1,000 bootstrap samples, none resulted in γ3 being lower than γ4 (and just three resulted in equal values). Furthermore, both γ1 and γ2 were higher than γ3 in 97% of samples. On the other hand, the credible intervals for γ1 and γ2 overlap substantially.

Figure 5.

 Plot of 95% highest probability density credible intervals around median γ parameter values. Inset plot shows magnified view of  log(γ3) and  log(γ4). (Because the interval for γ4 is (0, 0), 10−6 was added to both γ3 and γ4.)

These analyses suggest that we can reliably conclude that {γ1, γ2} > γ3 > γ4, although we cannot determine the relative strengths of γ1 and γ2.

5.4. Analysis of the best-fitting model

Having assessed the reliability of our estimates of the relative strengths of the four components in the model, we return to examining the predictions the model makes. Following the logic of a generative model, we say that the behavior generated by the model is ‘‘predicted’’ by the model, although it is of course true that as the parameters of the model are fit to the experimental data, these are not a priori predictions. For concreteness, we use the best-fit parameters shown in (7).

To see the mixture shift effected by the prior, for a given condition we can compare the likelihood distribution (which is independent of the prior's hyperparameters) to the posterior distribution: The former is what a learner would acquire given no bias, while the latter is what our Bayesian learner acquires under the influence of the biases encoded in the prior. Most straightforwardly, consider Fig. 6, which compares these two plots for condition 1. The peak of the likelihood surface is, as expected, centered around the training (Adj–N, Num–N) probabilities at (0.7, 0.7). The effect of the prior bias in the case of condition 1 generates a posterior distribution which is shifted into the L1 corner, centered at (0.8, 0.8)—as γ1 is large, probability mass is concentrated in the training region; however, the regularization bias pulls the distribution toward the corner.

Figure 6.

 (a) Likelihood and (b) posterior surfaces for condition 1, using best-fit beta shape parameters.

That component 1 (which favors pattern 1) is assigned a high weight in the prior is unsurprising given that participants in training condition 1 appeared quite likely to acquire (a regularized version of) that pattern. As expected, the weight for component 2, favoring the other harmonic pattern L2, is similarly quite high—the posterior is essentially the mirror image of that shown for condition 1 (peak centered at (0.2, 0.2)). For this particular data set, the weight for component 1 is higher than the weight for component 2, but the credible interval analyses discussed in section 5.3 suggest that this may not be a reliable difference. In this particular sample, the difference may arise because learners in the non-harmonic condition 4 were most likely to learn a grammar that was shifted towards L1.

Turning to the non-harmonic components, recall that the weight for component 3 is very low. However, the effect of even such a small prior weight can be seen in Fig. 7(a), the posterior surface for training condition 3: a substantial amount of probability density is placed near the L3 corner. Relative to components 1 and 2, of course, less probability mass falls under the peak that is shifted into the L3 corner; however, the center of that peak, at (0.2, 0.8), is shifted toward the corner to the same extent as in the harmonic conditions. The smaller bumps in the posterior surface predict, in agreement with the intuitive assessment of the raw data in section 2, that some participants exposed to training condition 3 should shift away from the input not into the L3 corner, but toward one of the harmonic corners.

Figure 7.

 Posterior surface for experimental training conditions (a) 3 and (b) 4, using best-fit beta shape parameters.

Although of small magnitude, the prior weight for component 3 is greater than zero, the weight for component 4; this small difference is sufficient to predict—correctly—that the behavior of participants in conditions 3 and 4 should be strikingly different.

For component 4 (which favors pattern 4), the zero prior weight, γ4, results in a posterior which has an extremely low probability density in the vicinity of the L4 corner. For condition-4 learners, where does the model predict that the probability mass should be concentrated instead? As Fig. 7(b) illustrates, it is divided between those grammars closer to L1 and those closer to L2. This result predicts not only that learning a grammar which falls near the L4 corner is extremely unlikely—unsurprising, given the fact that no learner's output falls in the L4 corner—but also that learners in condition 4 will acquire grammars which are shifted significantly toward a harmonic language.

Fig. 8 shows the prior distribution (on a log scale) with the best-fit hyperparameters. The two ‘‘wings’’ of this surface that rise in the L1 and L2 corners are each contributed by the corresponding component L1, L2 of the prior. Because the mixture weight for L4 is γ4 = 0, there is no such wing rising in the L4 corner (i.e., the wing that would be contributed there by L4 is multiplied by 0): instead, the falling portions of the L1, L2 wings produce a falling ‘‘tail’’ for the prior in the L4 corner. Because the mixture weight for L3 is smaller than those of L1 and L2, the ‘‘head’’ of the surface in the L3 corner rises to a lower peak value, and the wings from L1 and L2 encroach upon it.

Figure 8.

 Plot of the prior using best-fit hyperparameters, in log space. Colored dots locate the training points for each condition in this space.

We suggest the following interpretation for this surface. The likelihood surface for condition 1 peaks near the L1 corner where the prior surface is dominated by the L1 wing: Multiplying the condition-1 likelihood by the prior thus gives a posterior that is elevated into the L1 corner, shifting its peak closer to the corner—from (0.7, 0.7) to (0.8, 0.8). The result is uni-modal. The same occurs for condition 2.

The likelihood peak for condition 4 falls in the downward-pointing tail of the surface. The portion of this peak closer to the L4 corner is pushed down when multiplied by the prior; there is no posterior peak in the L4 corner. The portions falling on the wings rising into the L1, L2 corners, however, are lifted up; the result is a posterior with a peak on each of the L1, L2 wings: a bimodal distribution, with peaks shifted from the condition-4 training probabilities (where the likelihood peaks) toward the L1, L2 corners.

The peak of the likelihood surface for condition 3 falls in the most complex region of the prior surface. As with condition 4, in condition 3, the L1 wing of the prior, when multiplied by the condition-3 likelihood, produces a peak shifted toward the L1 corner from the condition-3 training probabilities. The same is true of the L2 wing, yielding a second peak of the condition-3 posterior. At the location of the condition-3 likelihood peak, the strongest influence is from the head of the prior surface: The L3 component's contribution is the highest there. This shifts the likelihood peak toward the L3 corner, producing the third and dominant peak of the tri-modal condition-3 posterior.

In this way, we can understand how—from a natural conjugate prior constructed from the four possible products of two beta distributions with the same shape parameters—the complex differences among the learning behaviors in the various language conditions arise from the interaction of a uniform regularization bias encoded in the shape parameters and non-uniform substantive biases encoded in the mixture coefficients γ.

To illustrate more concretely the fit of the model to the production testing data in the experiment, Fig. 9 shows a sample drawn from the predictive probability distribution for each training condition. The sample (light colored points) shows what the model predicts the testing data should look like, and the dark colored points show the actual testing data for each condition. Fig. 9 therefore provides a qualitative picture of the similarity between the model's predictions and the observed results. The model appears to provide a close fit to the distribution of testing data in all conditions. Although several learners in condition 4 fall slightly outside the predicted distribution for their training condition, the model correctly predicts the directions of learners’ shifts in each condition, indicated by arrows—regularization or movement toward a harmonic language for learners in condition 3 (the outermost dashed contour curve for the condition 3 distribution indicates the three lobes corresponding to the three predicted clusters), and for learners in condition 4, movement toward a harmonic language only.

Figure 9.

 Sampled data drawn from the predictive distributions (partly transparent lighter points), and actual participant testing data (opaque darker points) for each training condition. Dashed lines drawn around each distribution are contours drawn to highlight high probability density areas.

6. Discussion

The generative model described here (like that of Reali & Griffiths, 2009) posits that participants in an artificial language-learning experiment are Bayesian learners—using experience, in the form of experimental training data, combined with prior biases, to infer a posterior distribution over grammars from which they select a grammar with which they then produce utterances during testing. Learners in the experiment analyzed here clearly did not replicate the training data veridically; however, the results can be captured if we take the inference process to have been systematically affected by learners’ expectations about which word-order patterns are more likely a priori, and what level of variability should be encoded in probabilistic rules.

The bias of learners in the word-order learning experiment, estimated by the model, quantitatively confirms the hypothesis that learning biases parallel typological generalizations concerning the ordering of adjectives, numerals, and nouns: This is the substantive facet of the grammar-learning bias. Furthermore, the substantive bias encoded in the model is able to explain the complex pattern of individual participant results in the experiment (illustrated in Fig. 1(b))—something that is not possible without a formal learning model.

More specifically, the model assigned a prior weight of 0 to the component of the bias that favored the typologically rare pattern 4 (Adj–N, N–Num), meaning that grammars assigned a high probability by this component will be unlikely to be given a correspondingly high probability in the posterior probability distribution acquired by learners: The model correctly predicts that learners in condition 4 will not regularize. Furthermore, because the components favoring the harmonic patterns 1 and 2 are assigned high weights, the model correctly predicts that learners exposed primarily to the rare pattern 4 will actually infer a posterior probability distribution which is shifted toward the typologically favored harmonic corners of grammar space (L1 and L2). Returning to the typological data in Table 1, the results of our model—fit to a set of learning data in which no learner regularizes pattern 4—may appear to predict in some sense that no language should use deterministic pattern 4. The typological data reveal, however, that 4% of the world's languages have pattern-4-dominant grammars. Assuming it is, in fact, true that at least some of these languages use not just a dominant pattern-4 grammar, but a deterministic one, this suggests that in a larger sample of experimental data (or a sample with a larger set of training utterances), some learners should be able to acquire such a pattern. In such an experiment, on the basis of the results reported here, we predict that γ4 would not be zero but would still be lower than γ3 which would in turn still be lower than {γ1, γ2}: The biases will still be revealed to parallel the typology.

The complex pattern of behavior exhibited by learners in condition 3, who were trained on the pattern (N–Adj, Num–N), is also captured by the model. This condition corresponds to a non-harmonic word-order pattern that is robustly typologically attested but considerably less frequent than the harmonic patterns corresponding to conditions 1 and 2. The model assigns to component 3 a weight in the prior which is non-zero, yet not as high as the weights for either harmonic-pattern component. The effect of this prior weight is that individual learners exposed primarily to this pattern are actually predicted to show one of three behavioral outcomes—they will either acquire a regularized version of the training input or they will shift toward one of the two harmonic languages. That the structure of the biases encoded in our model would successfully capture the pattern of behavior for learners in the two non-harmonic conditions was not clear in advance and represents perhaps the major accomplishment of the model.

The model also allows us to evaluate any role played by speakers’ native language (English) bias; as Fig. 9 shows, the fact that component 1 is assigned the highest weight effectively results in learners in the non-harmonic training conditions, particularly condition 4, shifting more toward the L1 rather than the L2 corner. However, as we have noted, the difference between L1 and L2 may be an artifact of this sample of data (the bootstrap credible intervals for the components favoring these two languages overlapped significantly). The typological data in Table 1 suggests that L2 may in fact be preferred over L1; however given that learners were native English speakers, this is likely not the best population on which to test that hypothesis. Further discussion of the possible role of an English-order bias can be found in Culbertson et al. (2012).

The bias estimated by the model also quantifies an observation in earlier work by Newport and colleagues, that learners tend to reduce variability as they acquire a grammar; this is the regularization facet of the bias. The best-fitting parameters of the model result in distributions whose means are close to 0 or 1, so that more extreme probabilities—more regular rules—are favored; this has the effect of generally pushing the grammars inferred by learners into the (less variable) corners of the grammar space. The interaction between the substantive and regularization facets of the bias results in the finding that learners in condition 4—unlike those in every other condition—do not regularize the input pattern they were exposed to. The model thus provides a precise characterization of the claim that learners regularize their input, but only when that input conforms to typological generalizations; when it does not, learners shift the input in a way that brings it more in line with these generalizations.

To briefly review the technical elements, what was built into the model's bias was only the form of the prior. Because the generative model is the product of two binomial distributions (for the number of pre-nominal Adj and Num utterances), the natural prior to assume is the conjugate prior, a product of two beta distributions (for the pre-nominal probabilities pAdj−N, pNum−N). A single product of beta distributions favors only a single quadrant of grammar space. This would suffice to encode a substantive bias favoring a single word-order pattern. To enable the prior to formalize a more complex bias—for example, the typologically motivated one hypothesized in (3), favoring both harmonic orders and disfavoring one of the disharmonic orders—the form of the prior we posited is a mixture of the four possible products of betas, each favoring one quadrant of grammar space. In leaving free the weights γ of this mixture, we allowed the model free rein in deciding for itself which quadrants to favor, and by how much. The shape parameters of the beta distributions were constrained to produce an equal regularization bias for each quadrant, confining to the four mixture weights γ the substantive aspect of the bias. The strength of the regularization bias (if any) was not predetermined; this was up to the model to decide, by selecting values for the two free beta-distribution shape parameters (x, y). What we have shown here is that a model with a prior structured in this way is able to account for the experimental data, thereby confirming that the hypothesized biases based on typological findings can in fact explain learners’ on-line behavior in an artificial language-learning task.

For testing the psychological reality of learning biases that parallel typological universals, the model we have presented can be applied to many other universals besides the one we have studied here, Greenberg's Universal 18. Like this universal, many others can be cast in the form of a dispreference for combining, in the same grammar, two rules (here, AdjP → Adj N and NumP → N Num). A well-known example prohibits (VP → V NP) & (PP → NP P): verb before its object and adposition after its object (making P a post-position rather than a pre-position; Greenberg's Universal 3, Greenberg, 1963). Another type of example, from phonology, prohibits palatalization before back vowels without palatalization before front vowels (Bhat, 1978; Wilson, 2006); this disprefers rule combinations such as (/ka/→[inline imagea]) & (/ki/ → [ki]).

Prior components of the type proposed here are implicated for a pair of rules when those rules are seen to interact typologically in the way that the rules for AdjP and NumP interact in Universal 18. Linguistic research aims to discover general categories, structures, and rules in terms of which universal patterns receive short descriptions; this would tend to minimize the number of such interacting rule pairs. The number of components in the prior (hence the number of mixture coefficients γ) thus scales linearly in the number of such universals, as expected for a formalization of the hypothesis that learning biases parallel universals. Although the relevant type of interaction is attested for a variety of rule pairs, there are few such interactions among three or more rules at once. Prior components of order higher than two (involving, e.g., the product of three beta distributions) would be called for, then, for only a small number of linguistic dimensions. There is reason for optimism, therefore, that the modeling approach discussed here, and the experimental artificial language-learning paradigm it models, will have wide applicability in the exploration of the psychological reality of learning biases that parallel typological universals.


  • 1

     It is well known that every learner capable of generalization must have an inductive bias—this guides generalization beyond the observations (e.g., Mitchell, 1997: 39ff; Geman, Bienenstock, & Doursat, 1992; Griffiths, Chater, Kemp, Perfors, & Tenenbaum, 2010). At issue here is not whether human language learners have biases, but rather, whether these biases parallel linguistic universals.

  • 2

     Whether the grammatical preferences exhibited by the adult experimental participants are the result of learning prior to the experiment we cannot say, but it will be evident that biases based on knowledge of English alone can play only a very limited role in explaining the key findings. For detailed discussion of this issue, see Culbertson et al. (2012).

  • 3

     Throughout this article Adj = Adjective, Num = Numeral, N = Noun, Adj–N = Adjective–Noun order, {Adj, N} = a phrase (AdjP) containing Adj, N in either order; p(Adj−N) = pAdj−N = padj = pa = conditional probability of order Adj–N given {Adj, N}, and similarly for p(Num−N) = pNum−N = pnum = pn.

  • 4

     For this very fruitful way of looking at the experimental data, we are extremely grateful to Don Mathis.

  • 5

     This assumption is supported by the data. Production testing data were separated into four successive trial bins, and two logistic regression models were fit to the testing data—one including trial bin as a fixed effect, the other a baseline model without this factor (participant and item were random effects in both models). A likelihood ratio test comparing these two models did not yield a significant difference, suggesting that trial bin was not a significant factor in explaining the results (χ2(3) = 1.78, p = 0.7).

  • 6

     For the same reason, this prior distribution is also used in Reali and Griffiths (2009). Reali and Griffiths (2009) also use a mixture of beta distributions for one model, but not for the same reason that we use a mixture below.

  • 7

     The prior is a multinomial in the sense that the generative model it provides is equivalent to choosing, independently for each utterance: one of the four components Ll (with probability γl); then a grammar G(pAdj−N, pNum−N) from the distribution Ll; then a modifier type Mod ( = Adj or Num, with uniform probability); and finally a word order, with the pre-nominal order probability given by the appropriate value pMod−N.

  • 8

     Internal to the model, learning by experimental participants uses the input presented during the ‘‘training’’ phase of the experiment as data for inferring a posterior grammar distribution; the data produced by learners in the ‘‘testing’’ phase is then used as ‘‘training’’ data for fitting the model's hyperparameters. For fitting, the data we use follows Culbertson et al. (2012) in including only testing trials in which the participant produced the correct vocabulary items; therefore, each participant may have a different number of testing trials. The maximum number of testing trials was 80 (mean number of trials per participant was 70).

  • 9

     The design and procedure of this experiment are identical to the previous experiment, with the exception of the precise type of feedback given—in the original experiment, non-contingent feedback was given concerning word order during the production testing phase, whereas in this replication experiment, no feedback was given. The results of the two experiments are very similar; details are reported in Culbertson (2010).


This work was supported by an NSF graduate research fellowship awarded to the first author, by a Blaise Pascal International Research Chair funded by the Ile-de-France department and the French national government awarded to the second author, by an IGERT training grant awarded to the Cognitive Science Department at Johns Hopkins University, and by an NIH training grant awarded to the Center for Language Sciences at the University of Rochester. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation or other sponsors. The authors wish to acknowledge the following people for their help with this work: Colin Wilson, Don Mathis, Géraldine Legendre, Mark Pitt, and three anonymous reviewers.


For the derivations of these equations, see the Supplementary Materials.

Given that the shape parameters of the prior components are always x and y, it can be shown that, after training on Trainingk, the updated mixture coefficients of Equation (10), inline image, are as follows:


where the updated shape parameters (inline image, etc.) are given in Equation (9).

The hyperparameters were fit by maximizing the probability of the testing data from all learners j:


The predictive distribution giving the probability P(Testingj|Trainingk), for learner j in condition k, is, from the posterior mixture (Equation (10)):




For each component Ll, we have the beta-binomial contribution:


using the abbreviations


where in Testingj, the count of Adj–N is inline image out of a total of inline image Adj trials, and inline image and inline image are the corresponding values for Num.