The previous two sections have illustrated how computational and psycholinguistic research on the production of referring expressions are often seen as separate areas, with only limited interaction between them. Yet each of the two areas suffers from limitations which the other would be well placed to rectify. Clearly, as we have seen, there are various aspects of referential behavior which current reg algorithms ignore, but which have received a lot of attention among psycholinguists. Conversely, psycholinguistic theories often rely on intuitive notions such as ‘‘common ground,’’‘‘adaptation,’’‘‘alignment,’’ or ‘‘salience,’’ without defining these precisely. A study by Poesio, Stevenson, Eugenio, and Hitzeman (2004) shows how a computational approach can be useful in such cases. In discussing the Centering model of discourse anaphora (Grosz, Joshi, & Weinstein, 1995), these authors demonstrated the extent to which the underlying assumptions of psycholinguistic models need to be explicated. Psycholinguistic experimentation (e.g., Brennan, 1995; Gordon & Hendrick, 1999; Gordon, Kendrick, Ledoux, & Yang, 1999) has suggested that the preference for a pronoun over a name in both production and comprehension is affected by factors such as the salience of the antecedent and in which utterance it occurred. However, the notions of ‘‘salience’’ and ‘‘utterance’’ have remained vague. This has given rise to several parameters in the Centering model. Poesio and colleagues argued that while ‘‘the best way to test such preferences is through behavioral experiments,” this is in practice difficult because of ‘‘the enormous number of possible ways of setting the theory's parameters“ (Poesio et al., 2004, p. 310). They, therefore, set about testing several of these alternative parameter settings computationally, using an annotated corpus to test different versions of the theory, and explicitly formalizing several hitherto underspecified parameters in the process.
Even though there is undoubted scope for increased collaboration between practitioners in the psycholinguistic and computational camps, we shall see that this requires a re-evaluation of the goals that reg algorithms are designed to achieve, as well as a different focus in psycholinguistic studies.
4.1. Goals of computational algorithms and their relevance for psycholinguistics
As we have seen, the exact goal of reg algorithms, as these are presented in the literature, is often unclear. Dale and Reiter's (1995) Incremental Algorithm is a good example. On the one hand, the authors argued that one way of creating computational models is to ‘‘determine how speakers generate texts and build an algorithm based on these observations (the Incremental Algorithm Interpretation)’’ (Dale & Reiter, 1995, p. 252) and, consequently, their Incremental Algorithm is often understood as aiming to produce referring expressions that resemble those that speakers produce. Yet they state: ‘‘The argument can be made that psychological realism is not the most important consideration for developing algorithms for embodiment in computational systems; in the current context, the goal of such algorithms should be to produce referring expressions that human hearers will understand, rather than referring expressions that human speakers would utter’’ (Dale & Reiter, 1995, p. 253). The ambiguity of their goal also shines through when they write: ‘‘The fact (for example) that human speakers include redundant modifiers in referring expressions does not mean that natural language generation systems are also required to include such modifiers; there is nothing in principle wrong with building generation systems that perform more optimizations of their output than human speakers. On the other hand, if such beyond human-speaker optimizations are computationally expensive and require complex algorithms, they may not be worth performing; they are clearly unnecessary in some sense, after all, since human speakers do not perform them’’ (Dale & Reiter, 1995, p. 253).
The ambiguities surrounding the aim of reg models raise significant problems for evaluating such models. The goal of humanlikeness would call for comparison against corpora or against the results of language production experiments (e.g., van Deemter et al., in press; Gatt & Belz, 2010; Jordan & Walker, 2005; van der Sluis & Krahmer, 2007; Viethen & Dale, 2007). By contrast, the goal of producing expressions that are easiest to understand (Paraboni et al., 2007) would tend to make reading (e.g., self-paced reading or eye movements during reading; Almor, 2000; Garrod, Freudenthal, & Boyle, 1994; Gordon, Grosz, & Gilliom, 1993) or auditory language comprehension paradigms (e.g., recording of eye movements while people identify objects in a visual scene; Brown-Schmidt, 2009; Sedivy, Tanenhaus, Chambers, & Carlson, 1999) the evaluation methods of choice.
4.2. The psychological reality of computational algorithms
Even granted that reg algorithms do not seek to model the actual language production process, but only its output, there are aspects to these algorithms that make them psychologically implausible. Perhaps the most striking property of most computational algorithms that is problematic from a psycholinguistic point of view is their determinism: They always generate the same referring expression in a particular situation or condition. For example, in a situation where there is no other object of the same category as the target object (say, a single car), most algorithmic models either always generate minimally specified expressions (the car) or always generate overspecified expressions (e.g., the red car). But given this specific situation, they would not generate a minimally specified expression in some cases and an overspecified expression in others. This contrasts with the results from experiments with human speakers, which show that they produce various types of referring expressions in a specific condition. For example, Pechmann (1989) showed that across different speakers, both minimally specified referring expressions (on 21% of experimental trials) and overspecified expressions (on 75% of trials) were produced when there was no other object of the same category (e.g., only a single car), while underspecified expressions were also chosen on a small percentage (4%) of trials. Very similar non-deterministic results were observed by Engelhardt et al. (2006), while Dale and Viethen (2010) showed that even when referring to simple objects in simple scenes, different speakers used a large variety of referring expressions to refer to the same object, while the same speaker was likely to vary his or her choice of referring expression considerably in very similar (or even isomorphic) scenarios. Inter-person variability is not a feature of (spoken or written) language alone: De Ruiter and colleagues, for example, report a large amount of variation between subjects in terms of the type and role of their gestures (de Ruiter, Bangerter, & Dings, 2012). Explanations of inter-person variation are not difficult to think of. Variability may be partly explained by children's exposure to different stimuli—compare Matthews, Butcher, Lieven, and Tomasello (2012), in this journal issue, for relevant experiments. Yet theoretical models struggle to give variation a natural place; at best, they offer a many-to-many relationship between contents and forms, allowing, in particular, that a given content can be expressed through different forms. A good example is Gundel's reformulated givenness hierarchy, which specifically allows different types of referring expressions to be associated with each level in the hierarchy (Gundel, Hedberg, & Zacharski, 2012).
It is important to note that even probabilistic reg algorithms (such as the systems described and evaluated in Gatt & Belz, 2010) are usually deterministic. These models typically use a probability distribution learned from training data to return the most probable referring expression given a particular situation. However, their output in isomorphic situations will always be the same. An exception is the model proposed by Fabbrizio, Stent, and Bangalore (2008), who proposed a probabilistic model which incorporates individual preferences for particular referring expressions, thereby altering the type of expression it generates depending on the preferences of individual speakers. Another non-deterministic model was proposed by Dale and Viethen (2010), who used different algorithms to mimic different speakers and found that this increased the correlation between the model and human responses as compared to a deterministic model. For a specific speaker, however, the output of both these models remains deterministic; that is, it is assumed that a single speaker always produces the same referring expression in a particular situation. The results of experimental studies are normally reported averaged across participants, so they do not report whether individual human speakers are deterministic. However, closer examination of the data of individual participants of almost any study reveals that their responses vary substantially, even within a single experimental condition. For example, we examined the data of Fukumura and van Gompel (2010), who conducted experiments that investigated the choice between a pronoun and a name for referring to a previously mentioned discourse entity. The clear majority (79%) of participants in their two main experiments behaved non-deterministically, that is, they produced more than one type of referring expression (i.e., both a pronoun and a name) in at least one of the conditions.
Indeed, variability in many aspects of individual behavior seems to be the rule rather than the exception. A classic example comes from ballistic research around 1900, which observed that the bullets of a skilled target shooter do not always hit the target, but pile up close to the bull's eye, with fewer and fewer strikes further away from it, giving rise to a bell-shaped probability distribution (Holden & Van Orden, 2009). Linguists have long known that language use is variable as well. Sociolinguists, for example, believe that language change and social register (e.g., idiolects associated with different social strata) cause a phenomenon known as diglossia, where different grammars are represented in the head of a single individual at the same time (Kroch, 2000). For each utterance, the individual is thought to ‘‘choose” between different grammars, where the probability of choosing a given grammar is affected by the recent history of the individual. The link with reference was made in Gibbs and Van Orden (2012), who discuss variability in speakers’ pragmatic choices, including the choice how to refer (e.g., whether to express privileged information, cf. Horton & Keysar, 1996), proposing to explain these by assuming that ‘‘the bases of any particular utterance (...) are contingencies, which are to an underappreciated extent the products of indiosyncracy in history, disposition, and situation’’ (Holden & Van Orden, 2009).
The idea that human responses may best be viewed as non-deterministic, even within a single speaker, suggests that non-determinism should be an important property of a psychologically realistic algorithm. One approach to model non-determinism is exemplified by so-called roulette-wheel generation models (Belz, 2007). Rather than always generating the same, most probable output given a specific input, these models sample alternatives from a non-uniform distribution, returning outputs in proportion to their likelihood. To our knowledge, models of this kind have not yet been exploited for generating referring expressions; however, this may be a promising way to incorporate non-determinism. Another possibility would be to turn current deterministic algorithms for the generation of referring expressions into non-deterministic algorithms. Regardless of which approach is chosen, the goal should be to make quantitative, and testable, predictions.
Consider the Incremental Algorithm (ia) once again (Dale & Reiter, 1995). The original, deterministic version always generates the cup to refer to a black cup in the presence of a blue ashtray and yellow candle. The reason is that it assumes a fixed preference order, causing it to check the category of the object (cup) before its color (red), and since cup rules out both distractors, color is not tried. As we have seen, research by Pechmann (1989) suggests that speakers do produce overspecified expressions such as the red cup in this situation. To account for this, the ia could be revised slightly, so that color is selected first when it is a discriminating feature. But this would still not fully account for Pechmann's (1989) results, because he showed that although overspecified expressions (e.g., the black cup) are produced most frequently, minimally specified expressions are produced on one quarter of the trials. To account for this, the ia would need to incorporate some form of non-determinism. One possibility would be to include a random process by which the algorithm checks color before type three-quarters of the time and type before color in the remaining quarter (both across speakers and within a single speaker).
If we assume that the decision about which property is checked first is a probabilistic, non-deterministic process, then the algorithm makes interesting predictions that are relevant to psycholinguists. For example, a non-deterministic version of the Incremental Algorithm makes exact, quantitative predictions about when overspecification occurs. Although several psycholinguistic studies have shown that overspecification is common, it remains unclear under exactly what conditions it occurs and psycholinguistic models do not make clear predictions concerning this issue. We therefore, believe that the algorithm provides an important step toward a better understanding of the possible psychological mechanisms involved in overspecification.
To let us make this more concrete, assume that when referring to a small black cup in the context of a large white cup and a large red cup (so color or size can be used to uniquely characterize the target), speakers produce the black cup four times more often than the small cup. In that case, there is a 80%-20% color-size preference (ignoring, for the sake of argument, possible overspecified expressions like the small black cup). According to the non-deterministic version of the Incremental Algorithm, this pattern arises because speakers first check color in 80% of cases, whereas they first check size in 20% of cases (and the category cup is obligatorily added, because the black one or the small one sounds awkward). Once we have determined the color-over-size preference, we can predict how often overspecification occurs in other situations. When referring to a small black cup in the context of a large white cup and a large black cup (i.e., only size is required to produce a distinguishing description), the algorithm initially chooses color over size in 80% of cases, but because this does not uniquely identify the target, it subsequently adds size, resulting in an overspecified expression (the small black cup) in 80% of cases. In the other 20%, it first checks size, and because this uniquely identifies the target, color will not be added. An 80%-20% split is also predicted to occur when the same target (a small black cup) occurs in a context with a small white cup and a large white cup (so color is required). In 80% of cases, color is checked first, and because this uniquely identifies the target, the algorithm produces the black cup. In the other 20%, size is selected first, but because it does not uniquely identify the target, color is added, resulting in the small black cup. Thus, the algorithm makes clear quantitative predictions that arise from the fact that color is usually checked before size. These predictions can be tested in psycholinguistic studies (Gatt, Gompel, Krahmer, & Deemter, 2011).
Other algorithms can be made non-deterministic in similar ways. For example, the Greedy Algorithm (Dale, 1989) iteratively selects the property which rules out most distractors from among those that have not yet been ruled out by the properties selected so far. But ties can occur, where two or more properties rule out the same number of distractors, and in this case the choice could be made probabilistic. When referring to a small black cup in the context of a large white cup and large red cup, for example, the revised greedy algorithm might produce the black cup 80% of the time and the small cup 20% of the time (assuming, as before, that the category—cup—is always added given that omission of the category sounds awkward). One interesting prediction is that the same 80%-20% color-over-size preference should occur when referring to e1 in Table 2 even though a priori the color of the target rules out more distractors (e2 and e3) than its size (e2). The reason is that the Greedy Algorithm first selects the property with spoon, because this rules out most distractors (e3, e4, and e5). Next, only the distractor e2 remains, which can be removed by either color or size, resulting in a tie. Assuming an 80%-20% color-over-size preference, people should produce the black cup with the spoon in 80% of cases and the small cup with the spoon in 20% of cases. We believe this is a striking new prediction, especially if color is always chosen over size when it rules out more distractors. If this prediction were to be confirmed, this would provide initial support for the idea that human speakers first select the property that rules out most distractors.
Table 2. Another referential domain
[ The Incremental algorithm. ]
Existing metrics of ‘‘humanlikeness’’ (van Deemter et al., in press; Gupta & Stent, 2005; Jordan & Walker, 2005; Viethen & Dale, 2007) were not designed to measure the extent to which an algorithm reflects the variation in a corpus. This is most easily seen in connection with variation between speakers. Consider a simple example involving just one reference task, to which just two speakers are exposed: Speaker a utters np1 and speaker b utters np2. Suppose these two nps are as different as they can be, so it is impossible to match (i.e., resemble) both of them. Now consider two algorithms, each of which is run twice. One algorithm generates the two human-produced nps, while the other behaves deterministically, generating np1 on both runs:
Speaker a: np1. Speaker b: np2.
Algorithm 1: np1 (first run); np2 (second run).
Algorithm 2: np1 (first run); np1 (second run).
Intuitively speaking, Algorithm 1 captures the variation among the two speakers much better than Algorithm 2. Existing evaluation metrics, however, attribute the same score to both algorithms, because these metrics compute the extent to which the descriptions generated by a given algorithm match the speaker-generated descriptions on average, comparing each generated description with each human-produced description. Since both of the descriptions, np1 and np2, match one human-generated description fully (leading to a score of 1) while failing to match the other one entirely (scoring 0), both algorithms end up with the same averaged score of 0.5. The development and deployment of metrics that are able to measure the variation among speakers (or, indeed, within a speaker) is an example of the way in which computational research will need to change if a more psycholinguistic angle on language production is adopted.
Although it is the ultimate aim of reg to produce referring noun phrases, most current reg algorithms are limited to content determination: They determine which properties are expressed, but they seldom determine how and in which order they are realized, leaving the decision between, for example, the big red car and the red big car to a generic and independent realisation algorithm (paceKhan et al., 2012; Siddharthan & Copestake, 2004). But if algorithms are to offer full models of reference, they will need to address linguistic realization in its full generality. Psycholinguists have long suggested that incrementality plays an important role here as well. Pechmann (1989), for example, observed that participants often realized color before they realized size, even though this is not the preferred word order, and argued that this is because color is recognized faster than size, because color is not a relative property. Recently, this idea has received support from a study by Brown-Schmidt and Tanenhaus (2006), who showed that the timing of speakers’ fixations to a distractor (e.g., a large triangle) predicted whether they produced a prenominal adjective (the small triangle) or a repair following the noun (the triangle…the small one). Furthermore, fixations to the distractor were made earlier when speakers produced a prenominal adjective than a postnominal modifying phrase (the triangle with the small squares in the context of a triangle with large squares). This supports the idea that information that is first fixated and accessed is encoded first in the referring expression. Future algorithms could incorporate these findings, for example, by making sure that highly salient properties that are rapidly encoded are realized earlier in referring expressions than less salient properties.
Finally, as was already pointed out by Dale and Reiter (1995), some algorithms are unlikely to be psychologically realistic, because they are computationally very costly to run when there is a large number of distractors. Under such conditions, which are quite typical of real-life situations, such algorithms would thus be very slow. Given the limited processing capacity of humans and their reliance on ‘‘fast and frugal’’ heuristics rather than exact calculations (Gigerenzer & Goldstein, 1996; Simon, 1956; Tversky & Kahneman, 1982), it seems likely that human processing mechanisms use clever shortcuts that will need to be incorporated into future algorithms.