Utility-Based Generation of Referring Expressions


correspondence should be sent to Markus Guhe, School of Informatics, Informatics Forum, University of Edinburgh, 10 Crichton Street, Edinburgh EH8 9AB, Scotland. E-mail: m.guhe@ed.ac.uk


This paper presents two cognitive models that simulate the production of referring expressions in the iMAP task—a task-oriented dialog. One general model is based on Dale and Reiter’s (1995)incremental algorithm, and the other is a simple template model that has a higher correlation with the data but is specifically geared toward the properties of the iMAP task. The property of the iMAP task environment that is modeled here is that the color feature is unreliable for identifying referents while other features are reliable. The low computational cost of the incremental algorithm for generating referring expressions makes it an interesting starting point for a cognitive model. However, its explanatory power is limited, because it generates uniquely distinguishing referring expressions and because it considers features for inclusion in the referring expression in a fixed order. The first model extends the original incremental algorithm by an ability to adapt to feedback of whether a referring expression was used successfully, but it seems to overpredict the frequency with which distinguishing expressions are made and underpredict the frequency of overspecified referring expressions. The second model produces features for referring expressions purely based on its current estimate of a feature’s utility. Both models predict the observed human behavior of decreasing use of color terms and increasing use of useful feature terms.

1. Introduction

A central aspect of human communication is the ability to refer. Much has been written about how and which referring expressions are produced and understood, but most research focuses on (a) previous interactions between the interlocutors, in particular the dialog history, and considers the levels of referring expressions for well-known objects (loafer vs. shoe), (b) first mentions, where previous acts of referring are not considered at all, or (c) general mechanisms of human cognition, like the perceived salience of the referent’s features, which usually deals with adjectival modification (red vs. large) of fairly simple concrete nouns.

This paper relates the choice of which features of an entity are used in a referring expression to the features’ usefulness in the environment in which the referring expressions are produced. More precisely, it shows that the interlocutors in a task-oriented dialog adapt their choice of features over time and that the different utilities of the referents’ features in the task environment drive this change in behavior. I will present two cognitive models that explain this phenomenon in terms of John Anderson’s (1990)rational analysis, which is an account of how a cognitive system optimizes its adaptation to its environment.

The effect of utility in determining referring expressions has thus far been suggested only with respect to the previous dialog, but it has not been linked to external factors. The current paper, therefore, connects a newly observed phenomenon in dialog (adaptation of feature choice) to a well-known mechanism that has already been used to explain many other behavioral observations (rational analysis).

1.1. Factors influencing the choice of referring expression and the role of the task environment

A large number of factors influences what referring expression a speaker produces. Some of the more important ones are whether it is the first or a repeated reference to the entity, the interaction history between the interlocutors, linguistic economy, and the other entities that the linguistic expression might be a reference to (often called “distractors”).

It is a well-known phenomenon that first references often are more elaborate than subsequent ones, for example, the large white cup can be referred to as the cup or it later on. To avoid such effects of reduced expressions, much research investigates only first references. Computational approaches to the generation of referring expressions even often treat all referring expressions as first references and assume that the expressions generated this way will be good enough for most purposes.

When referring to the same entity after its first mention, that is, after it has been introduced into the dialog, speakers not only use reduced expressions but indeed show a preference for repeating expressions that were used previously. The interlocutors form conceptual pacts (Brennan & Clark, 1996; Clark & Wilkes-Gibbs, 1986).

A question that hardly has been addressed at all is how the acts of referring themselves change over time. Conceptual pacts, for example, explain how references to the same entity (do not) change, but they do not offer an explanation of how the overall use of features across referring expressions changes over the course of many interactions. The main work on how referring changes over time in the same task environment has been done by Garrod and Anderson (1987) and by Garrod and Doherty (1994). They describe how a community of speakers establishes a sub-language in referring to entities, that is, how language use changes over the course of subsequent interactions. Their approach is similar to the one put forward here in that they propose an explanation in terms of utility, which is a measure of the successes of previous interactions between the interlocutors, but this utility is not linked to how useful a feature is to distinguish one entity from another one in the given task environment. Their main finding is that dialog partners settle on their own idiosyncratic sub-language. This work has been integrated into the more general framework of alignment in dialog by Pickering and Garrod (2004), that is, how speakers adapt to each other. These observations will be important in the second model, presented in Section 4.

There is some evidence that extra-linguistic factors play a role in generating referring expressions. For example, the presence of a second character influences whether speakers use a pronoun or the character’s name for references following the introductory mention (Arnold & Griffin, 2007). This is true even if the displayed characters differ in gender, so that the name does not disambiguate any more than the pronoun. The authors argue that the reason for this behavior lies in the speakers’ cognitive load while producing the referring expression, that is, the speakers’ limited cognitive capacities cause the effect and not a property of the task environment.

Taking aspects of cognitive load into account is part of another strand of findings in which the cooperative view on dialog (e.g., Clark, 1996) is contrasted with a speaker-oriented view (e.g., Bard et al., 2000). In both views, the speaker makes the general assumption that what he/she knows is shared knowledge. The cooperative view assumes that every utterance is followed by an acceptance or a rejection, which helps the dialog proceed on the basis of common ground, that is, that both speakers ensure that they agree on their shared knowledge and assumptions (Clark, 1996; Steedman, 2000). According to the speaker-oriented view, the speaker only adapts to the listener’s needs if problems arise in the dialog, for example, by explicit feedback from the listener. This is consistent with the observation that speakers sometimes produce overspecified referring expressions (Dale & Reiter, 1995; Paraboni, van Deemter, & Masthoff, 2007; Pechmann, 1989): Such expressions will not prevent the listener from identifying the target object, and they let the speaker profit from a simple generation process. Since speaker and listener both benefit, the communicative strategy cannot be attributed uniquely to concerns for the listener’s needs.

Overall, therefore, the existing research on referring expressions addresses a whole host of factors that clearly influence which expression a speaker chooses in a given situation. What has not yet been addressed is that the given environment is also such an influence. In this paper, I will make this link to the environment.

1.2. Utility-based selection probability

To establish the link between task environment and the use of features in referring expressions, I will present computational cognitive models that are based on John Anderson’s theory of rational analysis. Rational analysis of cognitive systems states that the “cognitive system operates at all times to optimize the adaptation of the behavior of the organism” (J. Anderson, 1990; p. 28; see also J. Anderson & Schooler, 1991). This is true not only on an evolutionary scale but also for an individual’s adaptations by learning from interactions with the environment. Memories with a high utility are recalled faster and more often than memories with a low utility. It predicts that a speaker producing a referring expression will recall those memories with a higher probability (easier and faster) that previously contributed to a successful interaction with the dialog partner (who is part of the environment from the speaker’s point of view). It is also a main mechanism establishing the sub-languages of Garrod and colleagues: Successful decisions about which linguistic means to use are reinforced by feedback from the interlocutors, which cause these means to be used more often, first, within the dialog, then, within a community of speakers.

Rational analysis is a cornerstone of Anderson’s ACT-R theory, which is the theoretical framework for the models presented below. Available as an implemented cognitive architecture, ACT-R facilitates creation and testing of cognitive models. Here, I use ACT-R 6.0, described in Anderson, (2007).

ACT-R is a hybrid cognitive architecture—a production system with a subsymbolic layer. Procedural knowledge is encoded as production rules, or productions, which are if–then rules: If a set of conditions is met, then a specified action is executed. To decide which production to fire in the current situation, ACT-R creates the conflict set (consisting of all productions whose conditions are met), determines their selection probability (how likely will a production be successful in comparison to the other productions in the conflict set), and selects a production according to these probabilities. The selection probability is computed from the productions’utility values. Utilities are a measure of how often a production was fired (used) successfully in the past (here: how often it contributed to producing a successful description of the landmark to the interlocutor). Selection probabilities are computed as:


where Pi is the selection probability for production i, Ui is the utility of production i, s is the noise in the utilities (s is a free parameter with a default value of 1), and j are all applicable productions (including i). Thus, selection probabilities are ratios of the utility of production i and a sum of the utilities of all currently applicable productions (including production i).

Utility values are learned over time. After a production has been used, its utility is updated according to the following equation:


where Ui is utility of production i, n is the number of applications of the production, α is the learning rate, and R is a reward.

If the production is applied successfully, its utility is updated with a positive reward; if it is unsuccessful, its utility is updated with a negative one. Anderson (2007, p. 161) points out that this is basically the Rescorla-Wagner learning rule (Rescorla & Wagner, 1972) or the delta rule by Widrow and Hoff (1960). So there is nothing particularly “ACT-R-ish” about this rule; it is a general learning rule (cf. also Sutton and Barto 1998).

In addition to such procedural knowledge, ACT-R contains a number of other modules and mechanisms, details of which can be found in Anderson (2005, 2007) and on the ACT-R website (http://actr.psy.cmu.edu/). Generally speaking, like all cognitive architectures, ACT-R distinguishes between procedural and declarative knowledge. Thus, in addition to the procedural module, which uses the mechanisms outlined above, ACT-R’s second core component is the declarative memory module. For the interaction with the outside world, ACT-R has the visual, aural, manual, and vocal modules, and it has the goal buffer and the imaginal buffer for storing the current goal and current state, respectively.

By using the procedural module, I am assuming that producing referring expressions is a well-learned task, that the models do not learn new procedural knowledge, and that the existing knowledge is only adapted to the demands of the task (environment). Taking an extreme position, it could be argued that basically all linguistic expressions need to contain a referring expression in order to specify what the assertion is “about.” While I hasten to emphasize that I do not make this claim, it seems obvious that referring expressions occur so frequently that it is reasonable to assume that producing them is a well-learned skill.

With respect to the theory of rational analysis not much depends on this choice, however. The straightforward alternative is to produce referring expressions from declarative instead of procedural knowledge. The main difference is that instead of the selection probability of a production, the models would be based on retrieval probabilities of declarative memory chunks. A related approach would use ACT-R’s instance-based learning mechanisms (Gonzalez, Lerch, & Lebiere, 2003; Taatgen & Wallach, 2002), which explains how people acquire new skills, and a case could be made that doing the task described below involves skill acquisition. However, all these learning mechanisms are based on rational analysis, even though the realization in the models themselves would be somewhat different, and do, therefore, not change the substance of the argument.

Using ACT-R has the usual benefit of a unified cognitive theory that the models are immediately related to a wide body of research. It is a means of integrating research from different fields, in this particular case computational linguistics and psycholinguistics. A unified theory of cognition as it was envisioned by Allen Newell (1973, 1990; early versions were developed with Herbert Simon, for example, in Newell & Simon, 1963), thus, is a tool for providing a bridge: It creates computational models of psychological phenomena. However, the models are not just isolated, one-off models. Newell argued that a collection of fragmented accounts (models, empirical findings, theories) of parts of human cognition are not a satisfactory scientific theory. Instead, the fragments need to be integrated into unified theories. (He did not think there could only be one.) The most important advantage of this approach is that it allows for an explanation of new phenomena in known terms. This is also the aim of the research presented here: to explain changes in the referring expressions that speakers choose in dialog with the well-understood mechanism of utility-based selection probability.

2. The iMAP map task

The iMAP (intelligent Map Task Agent Project) task is a modified Map Task (A. H. Anderson et al., 1991). The Map Task is an unscripted, task-oriented dialog in which an Instruction Giver and an Instruction Follower each have a map of the same fictional location. The task is to reproduce a route on the Instruction Follower’s map that initially exists only on the Instruction Giver’s map; see Fig. 1. The task offers a much more natural setting than standard experiments in which participants produce a referring expression for a fixed display, which, after the expression has been produced, is replaced by the next display and participants receive no feedback about an expression’s success. Thus, in such experiments the participants cannot learn from past successes or failures.

Figure 1.

 Map pair from the iMAP Map Task corpus. Left map, Instruction Giver; right map, Instruction Follower.

2.1. Materials

Cartoon landmarks were the main means of identifying and describing the route. Some landmarks differed between the two maps. In the iMAP task they could differ by:

  • 1 Mismatching in a feature between the two maps (most notably color).
  • 2 Being absent on one of the maps or present on both.
  • 3 Being affected or not by “ink damage” that covered about half the landmarks on the Instruction Follower’s map and that obscured their color (but all other features could be clearly identified).

In addition to the unreliable color feature, which is present on all maps, each map also had one feature with which the landmarks could be distinguished reliably and which was one of:

  • 1 Number (for maps containing bugs or trees as landmarks); cf. Fig. 1.
  • 2 Pattern (fish, cars).
  • 3 Kind (birds, buildings).
  • 4 Shape (aliens, traffic signs).

There were three experimental variables:

  • 1 Homogeneity: whether the landmarks on a map were of just one kind, for example, bugs, or whether the landmarks were of different kinds.
  • 2 Orderliness: whether the ink blot on the Instruction Follower’s map obscured a contiguous stretch of the route (orderly) or a non-contiguous stretch (disorderly); Fig. 1 shows a disorderly map.
  • 3 Animacy: whether the landmarks on a map were animate or inanimate (thus, on the non-homogeneous maps there were only landmarks from the four inanimate or the four animate kinds of landmarks).

2.2. Procedure

The participants were told that the maps “are of the same location but drawn by different explorers”; but they were not told how or where the maps differ. They were instructed to re-create the route on the Instruction Follower’s map as accurately as possible. There was no time limit.

Each dyad completed two simple training maps and was then presented with all combinations of the 2 (homogeneity) × 2 (orderliness) × 2 (animacy) design; thus, each dyad completed eight dialogs, one for each type of landmark (bugs, trees, fish, cars, birds, buildings, aliens, traffic signs). As there were 4 different types of distinguishing feature (number, pattern, kind, shape), the corpus consists of 32 map pairs (64 maps) overall. To avoid effects of the sequence of map presentations, a Latin square design was used. After the fourth map/dialog, participants exchanged roles. There were 32 dyads.

To reduce the variability of referring expressions, before each dialog, each participant was prompted with printed labels to name a few landmarks that would occur on the next map. Landmarks on the maps themselves were not labeled.

2.3. Setup and data collection

Participants sat in front of individual computers, facing each other, but separated by a visual barrier. The communication was recorded using five camcorders. Eye gaze was recorded for Instruction Givers only, using a remote eye tracker. Speech was recorded on separate channels for Instruction Giver and Instruction Follower, each via their own headset microphone, and was later transcribed manually. The routes drawn by the Instruction Follower were recorded by the computer.

As participants were in the same room, each could hear the other speak and each could see the other’s upper torso via a video stream in the left half of their monitor. The right half of the monitor showed the map.

2.4. Participants and data coding

Sixty-four undergraduates participated for course credits. For the current analysis, all referring expressions in the transcribed dialogs were coded for use of color terms and for terms describing the landmark features (number, pattern, kind, shape). Data for other modalities are not considered here.

2.5. Color was not useful but other features were

The iMAP design restricted the utility of color in referring expressions: The color of half the landmarks on the Instruction Follower’s maps was obscured by “ink blots,” and colors of the other landmarks did not always match between the maps, for example, a green landmark could be orange on the other map. Thus, color was an unreliable distinguisher in this particular task environment—it could only be used successfully in 40% of the cases. This is contrary to the usual high degree of salience (and, thereby, utility) of color, which gives rise to the high frequency with which it is used in referring expressions.

A map’s reliable feature matched for 95% of the landmarks. It was used only for those landmarks for which it was useful, for example, number was only used to distinguish bugs and trees (Guhe & Bard, 2008a). To exclude effects of conceptual pacts and reduced expressions, only introductory (first) mentions of landmarks by the Instruction Giver are considered here.

To determine the effect of utility, Guhe and Bard (2008a,b) compared changes in the use of color terms over time with changes in the use of terms mentioning the reliable features. The data show that the use of the color feature decreases from ca. 0.6 of possible mentions to ca. 0.15 of possible mentions over the course of the eight dialogs. Rate of mention also decreases significantly within each of the eight dialogs; see Fig. 2 for the first dialog. Only color behaves this way (Guhe & Bard, 2008a); the reliable features (number, pattern, kind, shape) do not show this pattern. In fact, their rate of mention increases within each dialog.

Figure 2.

 Change in the rate of mention of color terms (limited utility) and useful feature terms (full utility) in the first dialog (map 1) of the iMAP experiment. The graph also shows the rate of referring expressions in which both features (color and the map’s useful feature) or no features are used. The simulations below use the first 29 landmarks that are introduced into the dialog by the Instruction Giver, because the test map contains 29 route-critical landmarks.

3. The incremental algorithm as cognitive model

This section and the following one present two cognitive models of producing referring expressions. The first model is based on the incremental algorithm of Dale and Reiter and has the advantage that it is not limited to the iMAP task but, at the same time, its correlation with the human data is limited. The second model is motivated by the observations of Garrod and colleagues: It assumes that the speakers settled on a way of referring to landmarks and makes the decision which features to include in a referring expression purely on the basis of their utility. While it has a higher correlation to the human data, it is also a model specific for the task at hand; cf. Section 5 for further discussion of these issues.

3.1. The incremental algorithm

Dale and Reiter’s (1995) incremental algorithm for generating referring expressions has two main properties that make it an interesting starting point for a cognitive account of how people produce referring expressions: It has a low computational cost (it has linear complexity, that is, its run-time increases linearly with the number of potential referents), and it can produce non-minimal (overspecified) as well as non-distinguishing (underspecified) referring expressions; see the example below. Because of these properties, the algorithm is often used in computational approaches to referring expressions. It should be noted, however, that the authors did not claim this algorithm to be a cognitive algorithm in the sense of Steedman (1995), that is, an algorithm that describes a cognitive processing mechanism, even though they took some aspects of cognition for motivating parts of it, for example, a feature’s cognitive salience.

The incremental algorithm is a feature-selection algorithm: To describe the referent (the target object), it selects features that describe the referent but do not describe the distractors (the other possible referents in the given situation). By doing this, the algorithm generates a uniquely distinguishing referring expression, that is, a referring expression that refers to only one potential referent in the given situation.

The features are chosen according to an ordered preference list, for example, <color, number, size, orientation>. Starting with the most preferred feature, the algorithm determines its value for the referent and adds the feature’s value to the referring expression if it rules out one or more distractors. Suppose, for example, that the target object is the topmost group of bugs in Fig. 3; thus, the other two groups are the distractor set. The algorithm begins by determining red as the value of the referent’s color feature. Using red in the referring expression rules out the middle group, because it is purple, and the distractor is removed from the distractor set. The algorithm then takes the next feature (number) and determines its value (four). Because the remaining distractor has the value two for number, the algorithm selects the number feature as well. This leaves no distractors, that is, the algorithm has produced the uniquely distinguishing referring expression: four red bugs. (Bugs is added by default, because the incremental algorithm is geared toward producing noun phrases, which syntactically require a noun, which in most cases encodes an entity’s kind. Note that the cognitive model presented below only adds the kind feature for maps that require it. This property of the original algorithm can, of course, easily be incorporated.)

Figure 3.

 Example display for the incremental algorithm.

The example also demonstrates how the algorithm generates non-minimal referring expressions, because four bugs would be the minimal uniquely distinguishing referring expression. As color comes first in the preference list, it is always checked before number, even if it is not needed for making a uniquely distinguishing expression. Once the algorithm has selected a feature, it does not reverse its decision. This incremental mode of operation is what gives the algorithm its name and is the reason for its low (linear) complexity. The algorithm produces underspecified referring expressions if there are not enough features in the preference list to yield a distinguishing expression, for example, with a preference list <color, size>, the algorithm produces red bugs for the above example.

3.2. Uniquely distinguishing expressions and the fixed preference list

The rationale for the order of features in the preference list is that perceptually salient features like color are preferred over less salient ones and that discrete features like number are preferred over relative features like size or orientation. The motivation for preferring salient features is that they are usually useful, because they are easy to identify for both speaker and listener. (It is, however, rather unclear how to determine the salience of a feature.) The motivation for preferring discrete features is that they are easier to process, because they do not require comparisons, which means they need fewer computations and are, therefore, more efficient to use.

In the original incremental algorithm, the ordering of features is simply assumed, so it is not based on an independent mechanism like rational analysis. Jordan and Walker (2005) used machine learning techniques to determine the order of features in the preference list. These algorithms already incorporate psychological findings like conceptual pacts, but they only provide global adaptations to properties of linguistic corpora and do not account for changes over time and for adaptations to the task environment. A basic version of this approach is used by Kelleher (2007), who orders the features in the preference list according to their frequency in a corpus.

The approach by Goudbeek, Krahmer, and Swerts (2009) is the one most similar to the one taken in this paper. The authors primed participants with preferred or dispreferred referring expressions, which they had established in an earlier experiment. They found that under some circumstances speakers do align to dispreferred referring expressions and suggest that algorithms for generating referring expressions should be made

more psychologically realistic by implementing dynamic instead of fixed preference lists. The order of the preference list can first be determined by preference (e.g., word frequency) and subsequently be adjusted based on contextual input. (p. 7)

The adaptive incremental model presented below is just such a model of dynamic, adaptive changes of selection preferences in the light of their context. The incremental algorithm has two aspects that make it an interesting point of departure for this paper: first, its goal of creating uniquely distinguishing expressions; second, the role of the fixed preference list. The models described below cast doubt on the assumption that people produce uniquely distinguishing expressions, even though more research addressing the issue is needed. They also show that in task-oriented dialogs a fixed preference list may be useful, but only if the choices are also sensitive to the features’ utilities in the task environment. The latter aspect will be explored with the second model.

3.3. Distractor sets

The incremental algorithm operates on distractor sets, that is, sets of objects from which the target object is distinguished. In typical psycholinguistic experiments, the number of objects that a participant has to consider is small enough so that all objects constitute the distractor set. In the iMAP task, however, there are up to 62 landmarks on one map, and it seems highly unlikely that referring expressions could ever be produced by comparing a proposed form against 61 other items. The model, therefore, uses the groups of landmarks shown in Fig. 4 as distractor sets. They are computed by the algorithm presented in Guhe (2007b), which is based on the proposal by Thórisson (1994). This algorithm first identifies the route-critical locations, which are locations where the route either passes in between two landmarks or close by one landmark. There are 16 such locations on the map shown in Fig. 4. Each of these locations forms the center of a distractor set, and each distractor set contains the landmarks that are

Figure 4.

 Map from the iMAP corpus used for the simulations with the assumed distractor sets.

  • 1 Within a circle of a fixed radius around the current location and
  • 2 Not “hidden from view,” that is, there is no other landmark between this landmark and the current location.

Using this algorithm, there are 29 route-critical landmarks on the map, that is, landmarks that the Instruction Giver must mention to describe the route to the Instruction Follower. These are the landmarks used for the simulations described below.

While this method to determine distractor sets is reasonable for current purposes, a more comprehensive model of visual search is certainly needed, that is, a model of what landmark to focus on next. I do not claim that these are exactly the distractor sets the participants used, but the sets shown in Fig. 4 will be among any empirically established sets. What is more important for the current purpose than the exact distractor sets is the fact that the distractor sets are not available to the model as wholes, which is what the standard form of the incremental algorithm assumes. Instead, the model knows the “next” landmark for any given landmark, viz. a landmark that the visual system would focus on next, given a current landmark.

This means, the models described in this section use the slight variation of the incremental algorithm shown in Fig. 5. Step 4 of the algorithm models the decision of which feature to try next. In this step, two productions compete to fire, one proposing to try color, the other to try the useful feature. They form a conflict set in the sense described in Section 1.2, and ACT-R decides which production to fire according to the corresponding equations.

Figure 5.

 Incremental algorithm used in the cognitive model.

The maps with number as distinguishing feature, for example, the one in Fig. 4, are the best test case for the corpus, because they show the observed effects most clearly. The maps with kind as their useful feature conflate the notion of the useful feature and the feature that the incremental algorithm selects by default (as pointed out above, for the models presented here this is no problem, because they do not select the kind feature by default; nevertheless, considering the bigger picture, it seems best to use different maps for testing the model); the pattern maps were unintentionally designed in such a way that number was a useful distinguisher as well (cf. Guhe & Bard, 2008a); and the shape maps, while being suitable, do not suggest distractor sets in quite such a straightforward way.

3.4. Non-adaptive versions of the incremental algorithm

To check that the observed data indeed require an adaption to the task environment by way of linguistic feedback, the human data were tested against some non-adaptive models. For all these models, simulations show no significant correlation to the human data. They differ from the adaptive incremental model described below in that they do not learn from feedback but are the same otherwise.

All models using the incremental algorithm try to produce uniquely distinguishing referring expressions by incrementally removing distractors from the distractor set by comparing feature values of the target object to feature values of distractors. However, the models do not have a fixed preference list. Instead, the (constant) selection probabilities determine which feature is tested; cf. step 4 in Fig. 5. As already mentioned, all models are models of the Instruction Giver and produce expressions mentioning a landmark for the first time. This excludes effects of repeated mentions like conceptual pacts and reduced expressions (Guhe & Bard, 2008b).

The original incremental algorithm accounts neither for the adaptation, because it does not learn, nor for the relative proportions of feature uses by the human speakers, because it is deterministic and its preference list does not change. This does not change when following Kelleher’s (2007) approach and ordering the preference list according to the frequency of feature mentions in the corpus, that is, to test the useful feature before color.

Finally, I estimated the frequency of feature mentions from the averages in the corpus, which are 0.75 for the useful feature and 0.34 for color. A model that tests one of the features first with the corresponding probability (i.e., values normalized to 1, which are 0.71 for the useful feature, 0.29 for color) and the other feature after that shows no significant correlation to the human data.

3.5. The adaptive incremental algorithm

Thus, for a model to show the observed behavioral changes over time, it must be able to learn from the feedback about the success or failure of a referring expression it receives from the “dialog partner.” As outlined above, the feedback is used to update the model’s estimate of a feature’s utility by assigning a positive or negative reward (modeled by the learning equation). Feedback is modeled as a probabilistic function that gives negative feedback in 60% of the cases in which the color feature has been produced, because color fails as a distinguishing feature in 60% of the cases in the iMAP task, and a positive reward otherwise. This, obviously, does not happen in real life but is a standard assumption in cognitive models like this. The changes in the estimated utilities then affect a production’s selection probability, which is a function of the current utilities (modeled by the selection probability equation).

The two “propose” productions of step 4 in Fig. 5 are responsible for the model’s change in behavior. They are the only productions forming conflict sets of more than one production in the simulations. (More accurately, rewards affect all productions that fired since the last reward was given. But as no other productions are members of a conflict set with more than one production, the “propose” productions are the only ones for which the changing utilities affect the selection probability.)

The model predicts the changes for the useful feature to a significant degree (a linear regression between model simulations and the human data is significant: β0 = 0.675, β1 = 0.402, R2 = 0.130, F(1, 27) = 5.20, < 0.05), while its prediction for the color terms approaches significance (β0 = 0.556, β1 = 0.306, R2 = 0.060, F(1, 27) = 2.78, = 0.107); see Fig. 6. The simulations used a start utility of 20 for the “propose” productions, a positive reward of 12 and a negative reward of 0.

Figure 6.

 Adaptive incremental model.

There are a few things to note about the simulations. The models generate referring expressions for the 29 route-critical landmarks of the number maps, that is, those landmarks that the route has to go in between or close by in order for it to be reproduced accurately. Values for the simulations given here are always the average of 500 runs on the test map, while the human data are averaged over all first dialogs of a dyad, that is, not just those using the number maps. This is purely for expository purposes; the Appendix contains an overview of model simulations for all dialogs and maps. Finally, while the models process the landmarks in the order in which they are given on the map (assuming that this is the preferred order of mention), the human data show the referring expressions in the temporal order in which they were introduced, which may be different and contain references to other landmarks as well.

The reason for the incremental algorithm’s limited explanatory power seems to be the way in which it predicts under- and overspecified referring expressions, which results from its goal to generate uniquely distinguishing expressions and its stopping to test features once this has been achieved. This has two effects. First, the incremental algorithm predicts uniquely distinguishing expressions in cases where people do not produce them. In the test map, this happens in two regions: (a) on the right side of the map, where there are one green bug, one white bug, and two green bugs, (b) in the lower left of the map, where there are two red bugs and two orange bugs (plus four green bugs and one pink bug). This causes the outliers for landmarks 15, 16, 17, and 23 in the test map. In these cases, the model always selects color and/or the useful feature. The human data, however, do not show such values for these particular landmarks, for example, color is used to refer to the two red bugs with a frequency of 0.29, number with one of 1; for the one white bug the frequency is 0.375 for color and 0.75 for number. These frequencies are not different from the frequencies for the landmarks preceding or following these particular landmarks. In other words, for such cases, the model overpredicts the frequency with which the Instruction Giver produces uniquely distinguishing expressions. The model also overpredicts the frequency of referring expressions using at least one feature, because unlike people it never produces expressions using no feature, that is, expressions mentioning neither color nor number; cf. Fig. 7.

Figure 7.

 Proportion of referring expressions with no features in the adaptive incremental model.

Second, the incremental algorithm underpredicts the frequency of overspecified referring expression, that is, expressions with both features; cf. Fig. 8. As can be seen from the test map, in almost all cases, one feature suffices to make a distinguishing expression. Thus, with respect to over- and underspecified referring expressions, the incremental algorithm’s behavior is not a good model of the iMAP speakers.

Figure 8.

 Proportion of referring expressions using both features.

Despite a number of outliers and despite not being able to predict the levels of referential under- and overspecification, the model still significantly accounts for variation in the choice of the useful feature terms. This highlights the force of the adaptation process. It would also be easily possible to tweak the model so that the fit of the color feature becomes significant, but doing so would be beside the point. The adaptive incremental algorithm shows that rational analysis can account for the observed phenomenon.

4. A template model

The model using the adaptive incremental algorithm demonstrates two points. First, the utility of different features in the task environment drives the changes observed in the human data; second, it seems doubtful that people always generate carefully crafted uniquely distinguishing referring expressions like the model predicts, at least not in task-oriented dialogs, where referring is not the only task.

A useful next step is to consider the particular circumstances under which the referring expressions in the iMAP corpus were produced instead of applying a generic algorithm. Feedback between the dialog partners not only affects feature choice but also has the effect that they are converging on a particular style of interaction. This is Garrod and Anderson’s (1987)output—input co-ordination principle:

whereby in formulating an utterance the speaker will match as closely as possible the lexical, semantic and pragmatic options used to interpret the last relevant utterance from their interlocutor. Bilateral conformity to the principle quickly produces convergence on a common description scheme. (Garrod & Doherty, 1994, p. 185)

Convergence can be seen as a kind of structural priming, a tendency to use the same structures as recently used by an interlocutor. On the syntactic level, Reitter, Moore, and Keller’s (2006) analysis of the original Map Task corpus suggests that this is especially true for task-oriented dialogs.

This is the guiding principle for the second model. It is a simple template model that chooses the semantic features for referring expressions of the form useful feature—color—noun, but the decision to use a feature is based purely on the utility-based selection probability. In other words, the model makes the decision of whether to use the useful feature and then whether to use color. Note that the order of these tests in the template model has no effect on feature selection, because the tests only consider a feature’s utility, not whether they rule out distractors and are, therefore, independent of each other. That this is the preferred form of referring expressions is a reasonable assumption, because

  • 1Color and the useful feature distinguish the landmarks in the iMAP task,
  • 2 The participants were prompted to produce referring expressions with this structure in the short naming exercises before the start of each dialog,
  • 3 The landmarks are similar, so it is possible and economical to use similar referring expressions.

The template model simply takes a different starting point than the incremental algorithm. It does not address the question of whether the processes of adapting the utility of semantic features and alignment/structural priming have any mutual influences.

This model does not attempt to produce uniquely distinguishing referring expressions, and, thus, any change in the usage frequencies can only be due to the changes in the estimated utilities. Additionally, it does not use distractor sets; so there is no need to make assumptions about which ones are used in the task. This model also has a slightly lower (but still linear) computational cost, because it is not performing checks against a distractor set.

The template model shows a significant correlation to the human data for both the useful feature and color, cf. Fig. 9 (color: β0 = 0.848, β1 = 0.811, R2 = 0.645, F(1, 27) = 51.76, < 0.001; useful feature: β0 = 0.422, β1 = 0.544, R2 = 0.270, F(1, 27) = 11.38, < 0.01; simulations used a start utility of 10 for the “use-feature” productions and 7 for the “do not use feature” productions; the positive reward was 12, the negative 0).

Figure 9.

 Template model.

Compared to the adaptive incremental model, the template model has a stronger correlation to the human data. This is principally due to the model being better at predicting expressions with both or no features (cf. Figs. 10 and 11), which is an indicator of the frequency of over- and underspecified referring expressions. The template model does not produce outliers, that is, it produces underspecified expressions for the outliers discussed in Section 3.5, and the frequency with which it produces expressions using both features suggests that it produces overspecified expressions with the right frequency (because in almost all cases one feature suffices for a distinguishing expression.) A linear regression between model and data for referring expressions containing both features shows a significant correlation (β0 = 0.793, β1 = 0.683, R2 = 0.467, F(1, 27) = 23.61, < 0.001; cf. Fig. 10) as does one for expressions with no features, that is, expressions like the bugs or bugs0 = 0.236, β1 = 0.405, R2 = 0.164, F(1, 27) = 5.31, < 0.05; cf. Fig. 11). The only real discrepancy between model and data is the frequency of color mentions for the first landmark. Because the frequencies are purely based on a feature’s utility, the model contains no mechanism that could cause this value to be lower.

Figure 10.

 Proportion of referring expressions using both features in the template model.

Figure 11.

 Proportion of referring expressions using no features in the template model.

While the model has a higher correlation with the human data, it is also a much more task-specific model than the general purpose adaptive incremental algorithm, that is, it would have to be modified before it can be reused.

5. Conclusions

A feature’s utility in the iMAP task environment is the force driving the Instruction Giver’s change of behavior: The Instruction Giver’s preferences for particular features in expressions referring to landmarks gradually change to using a useful feature (number, pattern, kind, shape) more and more often and an unreliable feature (color) less and less. Thus, by improving the estimates of the different features’ utilities, the Instruction Giver adapts to this property of the task environment.

The explanation in terms of utility-based selection probabilities is based on John Anderson’s (1990) rational analysis. It has the advantage of being a well-studied trait of human cognition. Thus, the explanation put forward here is not specific to the observed phenomenon but is a general cognitive mechanism.

The incremental algorithm, a widely used paradigm for generating referring expressions, proves to be a viable starting point for a cognitive model of the production of referring expressions. In particular, two of its main ideas have been supported by the presented models, namely that (a) generating referring expressions is a feature-selection problem for which, (b) there exists an incremental algorithm with linear complexity.

The incremental algorithm highlights the issues of uniquely distinguishing and over-/underspecified referring expressions. Because the available eye-tracking data are not precise enough in the iMAP corpus, the distractor sets used by the participants cannot be determined. Therefore, new experiments will have to answer the question of whether people create uniquely distinguishing referring expressions with respect to the distractor sets they are using, but it seems unlikely that this is what they do. For the outliers produced by the model (cf. Section 3.5), people do not use the features needed to produce a uniquely distinguishing expression more often than in other referring expressions (assuming participants use similar distractor sets). People also seem to produce more overspecified referring expressions than the incremental algorithm, which stops selecting more features once a reference is uniquely distinguishing. But even so, the adaptive incremental algorithm is a suitable model of the observed behavior.

The adaptive incremental algorithm replaces the fixed preference list with productions that propose features, because the probabilities (preferences) with which these productions fire can change as a function of interactions with the task environment. This results in an improved fit between model and data.

A simple template model predicts the human data with an even higher correlation. This corroborates that the observed changes are driven by changing estimates of a feature’s utility value and not by changes in the way the referring expressions are computed algorithmically, for example, by changing the order of features in the preference list of the incremental algorithm. The model also does not rely on assumptions about distractor sets, further highlighting the role of the effect of rational analysis and the adaptation. The higher correlation of the template model with the data is due to it not producing outliers like the incremental model and to producing over- and underspecified expressions with the frequencies observed in the human data.

While this seems to suggest that the template model is a better cognitive model, it also has less coverage, because it is geared toward this corpus. In a sense it also reintroduces the notion of a preference list, but an important difference is that it does not stop selecting features once it deems a referring expression uniquely distinguishing. Thus, the order in which the template model checks features has no effect on the produced output.

Concluding, it cannot be said that either of the two models is a better model of how humans produce referring expressions. The incremental model is more general while the template model has a higher correlation with the data. The incremental model may better capture the iMAP task during its very early stages, because it is a model of generating referring expressions that can be applied to a wide variety of tasks (and in that sense may be considered an adaptation to the task of generating referring expressions in general), while the template model may better capture the iMAP task after the dialog partners settle on a form of interaction, and in the case of the iMAP task, this may happen quite quickly.

Continuing the line of thought started by Garrod and colleagues and put into a broader framework by Pickering and Garrod (2004), these two models should be integrated into a more comprehensive cognitive model of producing referring expressions. Such a model will additionally have to explain how the alignment between the speakers transforms the “incremental model form” of interaction into the “template model form” of interaction. Such a model can build on findings of convergence and alignment between the dialog partners and will probably use a mechanism like ACT-R’s instance-based learning. This model will, thus, consist of a part for choosing features and a part for deciding on the structure of expressions. It will also need to include a model of visual search for establishing the objects that the referent is contrasted with.


This research was supported by the NSF (grant NSF-IIS-0416128) and by the EPSRC (grant EP/F035594/1).

Thanks to Mark Steedman and in particular to Ellen Gurman Bard for comments on earlier drafts; to Dan Bothell for providing help with the ACT-R system; and to Roger van Gompel, David Reitter, and two anonymous reviewers for their helpful comments.


This appendix contains further results of comparing the human data to simulations with the template model. For expository purposes, the main text of the paper compares the model simulations only to the averages of the first map, that is, the data of the first dialog led by a dyad. The dialogs in the iMAP corpus fall into three groups: (i) map 1, (ii) maps 2–4, and (iii) maps 5–8. Most measures show significant differences between these groups. Our current understanding of the corpus suggests that groups (ii) and (iii) differ because the role of Instruction Giver and Instruction Follower were exchanged after the fourth dialog. Group (i) shows significant changes in the first quarter/half of the dialog for most measures. More details on these differences are given in Guhe and Bard (2008a). The simulation results in Tables 1–4 show that the observed and modeled adaptation is not just present in the first dialog.

Table 1. 
Comparison of model and the different groups of maps according to Guhe and Bard (2008a)
 Model ∼ Average of All Maps
Map 1Maps 2–4Maps 5–8
ColorUsef. Feat.BothNoneColorUsef. Feat.BothNoneColorUsef. Feat.BothNone
Table 2. 
Comparison between model and the aggregate of all maps
 Model ∼ Average of All Maps
ColorUsef. Feat.BothNone
Table 3. 
Comparison for the maps having number as useful feature (bug and tree maps) using the same parameter setting as in the simulations in Table 1
 Model ∼ Average of Number Maps
Map 1Maps 2–4Maps 5–8
Table 4. 
Comparison between model and the number maps for maps 5–8 with different parameter settings for a better correlation
 Model ∼ Average of Number Maps (Better Correlation)
Maps 5–8