SEARCH

SEARCH BY CITATION

Keywords:

  • Generation/production of referring expression;
  • Evaluation metrics for generation algorithms;
  • Psycholinguistics;
  • Reference;
  • Incremental algorithm

Abstract

  1. Top of page
  2. Abstract
  3. 1. Generation of referring expressions
  4. 2. Some classic GRE algorithms
  5. 3. How to test a GRE algorithm
  6. 4. The tuna corpus
  7. 5. The furniture subcorpus
  8. 6. The people subcorpus
  9. 7. Tractability of algorithms
  10. 8. Conclusion
  11. Acknowledgment
  12. References

A substantial amount of recent work in natural language generation has focused on the generation of ‘‘one-shot’’ referring expressions whose only aim is to identify a target referent. Dale and Reiter's Incremental Algorithm (IA) is often thought to be the best algorithm for maximizing the similarity to referring expressions produced by people. We test this hypothesis by eliciting referring expressions from human subjects and computing the similarity between the expressions elicited and the ones generated by algorithms. It turns out that the success of the IA depends substantially on the ‘‘preference order’’ (PO) employed by the IA, particularly in complex domains. While some POs cause the IA to produce referring expressions that are very similar to expressions produced by human subjects, others cause the IA to perform worse than its main competitors; moreover, it turns out to be difficult to predict the success of a PO on the basis of existing psycholinguistic findings or frequencies in corpora. We also examine the computational complexity of the algorithms in question and argue that there are no compelling reasons for preferring the IA over some of its main competitors on these grounds. We conclude that future research on the generation of referring expressions should explore alternatives to the IA, focusing on algorithms, inspired by the Greedy Algorithm, which do not work with a fixed PO.


1. Generation of referring expressions

  1. Top of page
  2. Abstract
  3. 1. Generation of referring expressions
  4. 2. Some classic GRE algorithms
  5. 3. How to test a GRE algorithm
  6. 4. The tuna corpus
  7. 5. The furniture subcorpus
  8. 6. The people subcorpus
  9. 7. Tractability of algorithms
  10. 8. Conclusion
  11. Acknowledgment
  12. References

Sixteen years ago in a classic article in this journal, Dale and Reiter (1995) introduced the Incremental Algorithm (IA) for the Generation of Referring Expressions (GRE). Although this is the most referred-to publication on GRE so far, there has been surprisingly little work that directly assesses the validity of its claims. For even though a number of empirical studies have been carried out, few of these address the type of reference discussed by Dale and Reiter, as we shall argue. Likewise, Dale and Reiter's arguments concerning the computational complexity of the IA have gone largely unchallenged. This article aims to redress this situation, by comparing the IA with its main competitors against data elicited from human participants in a controlled experiment, and by assessing the complexity of the IA. Our main finding is that, in many situations, other, more flexible generation strategies may be preferable to the IA.

1.1. Generation of Referring Expressions

GRE algorithms are computational models of people's ability to refer to objects. Reference is a much studied aspect of language, and a key part of communication, which is why GRE algorithms are a key part of almost any Natural Language Generation program (NLG; Reiter & Dale, 2000). Although GRE has the ultimate aim of producing fully fledged referring expressions (i.e., noun phrases), one of its most important subtasks is to determine the semantic content of referring expressions. It is on this part of GRE—henceforth called Content Determination—that Dale and Reiter focused, and the same is true of this article. Let us sketch what this Content Determination task amounts to. (For details, see Section 2.)

1.2. Dale and Reiter's position

Appelt and Kronfeld had studied the GRE, focusing on a range of difficult issues arising from the fact that referring expressions are an integral part of a larger communicative act. The resulting algorithms generated expressions which took discourse context into account (e.g., Appelt & Kronfeld, 1987; Kronfeld, 1989) and often contained more information than was necessary to identify the referent, for example, to make it easier for the hearer to carry out the process of identification, or to meet additional communicative goals (Appelt, 1985). Although their work offered important insights, its success as a model of human reference was essentially unproven, and this started increasingly to be seen when computational linguists became more data oriented. For this reason, Dale and Reiter decided to refocus, concentrating on what they saw as the core of the problem, which is to identify a referent, focusing on simple situations, where nothing matters except the properties of the referent and of the other objects in the domain (the distractors). Other aspects of the utterance context were temporarily disregarded, not because they were deemed unimportant, but because these researchers thought it wise to focus on simple things first. Even though this refocusing on a narrower view of reference was not universal—some continued to investigate the effects of linguistic context (e.g., Jordan, 2002; Passonneau, 1995; Stone, Doran, Webber, Bleam, & Palmer, 2003) and the interaction with other communicative goals (e.g., Jordan, 2002; Stone et al., 2003)—it has exerted a strong influence on the direction of computational GRE.

Let us spell out Dale and Reiter's assumptions in more detail. The language-generating program takes as its input a knowledge base (KB) whose content is mutually known by speaker and hearer. The KB ascribes properties to objects, formulated as a combination of an attribute and a value. If the domain objects are pets, for example, then one attribute might be type, with values such as dog and poodle. The first aim of GRE is to find properties that identify the intended referent uniquely, using a semantic form like (type: poodle, color: brown). Computing such a semantic form is called Content Determination. The second aim is to express the semantic form in words, for example, the poodle which is brown.

Dale and Reiter examined a number of algorithms, all of which were based on interpretations of the Gricean maxims, which revolve around brevity, relevance, and truth (Grice, 1975). Roughly speaking, these algorithms sought to minimize the redundancy in a description. In the most extreme case (e.g., Dale, 1989), the Gricean maxims had been interpreted as dictating the choice of the smallest set of properties of a referent that will uniquely identify it. Dale and Reiter contrasted this with a more relaxed interpretation of the maxims, which gave rise to what has come to be known as the IA. They argued that, in the situations envisaged, the IA produces referring expressions that resemble human-generated referring expressions better than its competitors. Following Belz and Gatt (2008), we shall call this the criterion of humanlikeness. Dale and Reiter also argued that the IA is computationally tractable, and that this constitutes another argument in favor of it. The two arguments can be seen as related if one assumes that neither people nor machines can solve computationally intractable problems in real time.

1.3. Aims of the article

Dale and Reiter's empirical position was based on a reading of the psycholinguistic literature (particularly as summarized in Levelt, 1989). Yet—consistent with existing practice in NLG at the time—their paper did not include an empirical test. Later work (Jordan & Walker, 2000; Passonneau, 1995) has tested some of their ideas, as we shall see, but this work has tended to concentrate on different (in fact, more complex) referential situations than the ones on which Dale and Reiter focused. We aim to put Dale and Reiter's original ideas to the test. Like many other studies, our investigation will focus on Content Determination (unlike Krahmer & Theune, 2002; Stone et al., 2003), disregarding words and syntactic constructions. Like Dale and Reiter, we shall focus on the task of identifying the referent, disregarding the well-documented fact that referring expressions can serve other communicative purposes (e.g., Jordan, 2002; Stone et al., 2003), and we shall focus on ‘‘one-shot’’ referring expressions, as produced in a null context where words outside the referring expression do not play a role.

In one respect, we have been less conservative. For although Dale and Reiter focused exclusively on singular descriptions, many references to sets can be generated by variations of the classic GRE algorithms (e.g., van Deemter, 2002; Gardent, 2002; Gatt & van Deemter, 2007; Horacek, 2004). This is something we considered worth testing as part of our general plan. In order not to contaminate the discussion of Dale and Reiter's claims, we only discuss plural references of the kind that can be generated by a relatively simple extension of the IA; moreover, we report results on singulars and plurals separately. For generality, we have studied referring expressions in two different domain types, one of which involves references to furniture (the furniture subcorpus) and the other to photographs of faces of people (the people subcorpus).

The plan for the experiment which is the main focus of this article was first outlined in van Deemter, van der Sluis, and Gatt (2006). The first tentative results were reported in Gatt et al. (2007) concerning the furniture subcorpus, and in van der Sluis et al. (2007) concerning the people subcorpus. The tuna corpus and evaluation method have influenced the field considerably, in particular, after they were chosen as the basis on which GRE algorithms were evaluated in the First NLG STEC on Attribute Selection for Referring Expressions Generation, in the Spring and Summer of 2007.1 Twenty-two different algorithms were submitted to this STEC for comparative evaluation, coming from 13 different research teams (Belz & Gatt, 2007). tuna also featured in a subset of the tasks organized for the second STEC in this area,2 where the number of submitted systems was even greater (Gatt, Belz, & Kow, 2008a). As of October 2008, the annotated corpus is available from the Evaluations and Language resources Distribution Agency.3

This article offers a definitive statement of the aims and set-up of the tuna experiment, the structure of the annotated corpus, and our analyses of the corpus (based on a larger number of subjects, in a greater number of conditions, and validated against a second corpus), superseding our earlier papers on the topic. Consistent with our aim of testing Dale and Reiter's original hypotheses, this empirical analysis is combined with a discussion of the computational complexity of the IA and its main competitors, and concludes with a discussion of the advantages and disadvantages of incrementality.

2. Some classic GRE algorithms

  1. Top of page
  2. Abstract
  3. 1. Generation of referring expressions
  4. 2. Some classic GRE algorithms
  5. 3. How to test a GRE algorithm
  6. 4. The tuna corpus
  7. 5. The furniture subcorpus
  8. 6. The people subcorpus
  9. 7. Tractability of algorithms
  10. 8. Conclusion
  11. Acknowledgment
  12. References

To test Dale and Reiter's claims, we focus our investigation on the algorithms that were most prominently discussed in their paper. We introduce the algorithms briefly before discussing them in more detail.

  • 1
     Full Brevity (FB). Conceptually, this is the simplest approach, discussed in a number of early contributions (e.g., Appelt, 1985) and formalized by Dale (1989). FB minimizes the number of properties in the description. Technically, FB is not one algorithm but a class, since minimality can be achieved by many different algorithms. FB is computationally intractable, because in the worst case, the time it takes to find a minimal description grows exponentially with the number of properties that the program can choose between (Reiter, 1990).
  • 2
     Greedy Algorithm (GR). This faster algorithm was introduced as an approximation to FB by Dale (1989). Rather than searching exhaustively, it selects properties one by one, always choosing the property that is true of the intended referent and excludes the greatest number of distractors. GR does not always produce a minimal description, because a property that removes the maximum number of distractors at the time of its inclusion might not remove the maximum number of objects in combination with properties that will later be added.
  • 3
     Incremental Algorithm (IA). Like GR, this algorithm selects properties one by one, and stops when their combination identifies the referent. Incrementality, in this broad sense which it shares with GR, was earlier hinted at in Appelt (1985) (where it is also pointed out that ‘‘Choosing a provably minimal description requires an inordinate amount of effort’’). In the IA, the order in which properties are added to the description is not dictated by their discriminatory power, as was the case for GR, but by a fixed PO which tells us, broadly speaking, in which order these properties need to be considered. It is important to realize that this PO is logically distinct from the left-to-right order in which attributes occur within a noun phrase. The idea, supported by psycholinguistic work (e.g., Pechmann, 1989), is, rather, that some attributes are more prominent in speakers’ minds than others, which makes them more likely to be included in descriptions. (The fact that, in human language production, some attributes are selected before others, but realized after them, presents an interesting research problem for psycholinguists (e.g., Sedivy et al. 1999), but it goes beyond the scope of this article.) Like GR, IA does not allow backtracking: Once selected, a property is never withdrawn. As others before us have observed (e.g., Fabbrizio, Stent, & Bangalore, 2008b; Jordan & Walker, 2005), the outcome of the algorithm depends on the choice of PO.

2.1. An example

A specially constructed example will clarify how these algorithms can lead to different outcomes. Imagine a small domain of objects {abcdefg}. Now consider the following five attributes, each of which has two values, denoted as val1 and val2 . We choose values which are each other's complement. Since complementary values do not overlap, this means that the choice between different values (of a given attribute) is always obvious once the referent is given. Thus, here and elsewhere in this article, we will focus on the question of attribute selection, on which most research in this area has concentrated.4

  • image

Suppose the target referent is the dog e. FB will combine att2 val1 (black, which is true of the objects {abce} only) with att4 val1 (outdoor, which is true of {defg}), because these two properties jointly single out e while all other descriptions that manage the same feat happen to contain three or more properties. The output thus consists of the properties black and outdoor. Notice that realizing this content into a description would not be straightforward, since neither property maps naturally to a head noun (it is usually the type of an entity which has this role). This description might be realized as, for example, the black animal that lives outdoors, inserting the category or type of the referent (i.e., animal) artificially.

While GR often produces the same description as FB, this is not the case in the present example. GR will start by selecting the property att1 val1 (poodle, corresponding to the set {cef}), because it is the property that excludes most distractors. Even though only two distractors are left, namely c and f, no single property manages to remove both of these at the same time. As the next step, GR will select either att2 val1 (black, removing f) or att3 val1 (bastard, removing c), but in each case a third property is required to remove the last remaining distractor. GR does not generate the smallest possible description, but something like the black poodle that lives outdoors instead.

What description is generated by IA? If the attributes are attempted in the order in which they were listed, so att1 is attempted first, followed by att2, then a version of GR is mimicked precisely. But if att2 is attempted first, followed by att4, then the result is the same as that of FB. In other cases, much lengthier descriptions can result, for example, when att5 is attempted first (narrowing down the set of possible referents to {abcdef}), and then att2 (narrowing down the set of possible referents to {abce}), followed by att1 (resulting in {ce}) and finally att4 (resulting in the set {e} which contains only the target referent). The resulting description might be realized as the black poodle with a tail, which sleeps outdoors.

This artificial example shows how dramatically the outcome of the IA can depend on PO. These differences make it difficult to test the hypothesis that IA's descriptions resemble human-generated descriptions. Which of all the possible IAs, after all, is the hypothesis about? It is possible that in most actually occurring situations, different POs lead to very similar descriptions, or ones that are similar in terms of their humanlikeness. We will show that this is not the case, and that there are important differences between the different versions of IA. This outcome will raise the question how ‘‘good’’ POs might be selected.

2.2. The type attribute

Types are usually realized as nouns. poodle, for example, is a type. Because a referring expression normally requires a noun (we tend to say the brown dog even when brown would identify the referent), Dale and Reiter gave types special treatment. If type is not selected by the algorithm (e.g., because all objects are of the same type), IA adds the property to the description at the end of the search process. Suppose, for example, the referent is a dog. Now if the IA starts by selecting the property brown, and if this property happens to identify the referent uniquely, then the property (type:dog) is added, generating the brown dog. This makes sure that every description contains one property realizable as a noun. This special treatment of nouns—foreshadowed in Appelt (1985)—can be seen as independent of the search strategy. In fact, we believe that the same considerations that make this a good move in combination with IA make it a good move in combination with other algorithms. In our comparison between algorithms, therefore, we have leveled the playing field by making sure that FB and GR apply the same idea. Each of the algorithms considered in this article will therefore add a suitable value of the type attribute at the end of the algorithm. This did not significantly change the outcome of any of the algorithms since, in the data we use for our evaluation, type was never either necessary or sufficient to identify a target referent, alone or in combination with other properties (see Section 4.2.3).

2.3. Plurals

We decided to examine certain kinds of expressions that refer to sets as well. We did this by applying a slightly generalized version of IA to these cases, simplifying a proposal in van Deemter (2002). The original IA stopped adding properties to a description when they singled out the set whose only element is the referent. The new IA does the same, stopping when these properties single out a set of referents. The same idea extends naturally to FB and GR. This technique allows us, for example, to use the description ([att1: val1], [att2: val2]) to refer to the set {ce}. Note that this approach does not work for collective references (e.g., the parallel lines in this picture; cf. Stone, 2000), which require drastic departures from the algorithms discussed by Dale and Reiter. Another case which this algorithm will not handle is that requiring plural descriptions involving the union of two different sets, such as the dogs and the cats (van Deemter, 2002; Gatt & van Deemter, 2007; Horacek, 2004). These cases too will be omitted from this study.

3. How to test a GRE algorithm

  1. Top of page
  2. Abstract
  3. 1. Generation of referring expressions
  4. 2. Some classic GRE algorithms
  5. 3. How to test a GRE algorithm
  6. 4. The tuna corpus
  7. 5. The furniture subcorpus
  8. 6. The people subcorpus
  9. 7. Tractability of algorithms
  10. 8. Conclusion
  11. Acknowledgment
  12. References

Empirical work relevant to GRE falls into two main classes. On the one hand, psycholinguistic studies on reference span at least four decades, since the foundational work of Krauss and Weinheimer (1964). On the other hand, partly as a result of increasing interest in comparative evaluation in NLG, a number of studies have compared the IA and some other algorithms against corpus data. We give an overview of the main findings from both types of research.

3.1. Psycholinguistic research on reference

As we have seen, the starting point for contemporary GRE research was the notion of brevity. Brevity was implicit in an early paper by Olson (1970), who argued that reference has a primarily contrastive function, so that the content of an identifying description would be expected to be determined by the distractors from which the referent is distinguished. A substantial body of evidence now shows that brevity is not the only factor motivating speaker's choices of properties in the reference. The phenomenon of overspecification, whereby speakers select properties which have little or no contrastive value, was observed in early developmental studies (Ford & Olson, 1975; Whitehurst & Sonnenschein, 1978; Whitehurst, 1976); later studies on adult speakers confirmed the tendency. Pechmann (1984) showed that adult speakers begin to articulate a reference before their scanning of a visual domain is complete, and that perceptually salient attributes, such as color, were apt to be selected before other properties, such as size, which required more cognitive effort because their value can be determined only by comparison with other objects. Later work showed that type and color were always used in referring expressions in visual domains, even when they had no contrastive value (Pechmann, 1989; Schriefers & Pechmann, 1988), a result not replicated for size. According to these authors, type has privileged status not only because of syntactic constraints, but because speakers process a referent as a conceptual gestalt, central to which is the referent's object class; similarly, they argued that color forms part of the speaker's mental representation of an object. Similar results have been reported by Mangold and Pobel (1988) and Eikmeyer and Ahlsèn (1996). A slightly different interpretation is given by Belke and Meyer (2002), who interpret the finding in terms of an attribute's codability, that is, the ease with which that attribute can be included in a mental representation of an object. Thus, relative attributes such as size are argued to have low codability. Evidence for overspecification has also been found in connection with locative (Arts, 2004) and relational descriptions (Engelhardt, Bailey, & Ferreira, 2006).

The demonstration that speakers overspecify has obvious relevance to a study comparing the IA with its predecessors, since one of the consequences of a PO is the potential for referential overspecification, with an increased likelihood that attributes which are given high priority (say, color), will be used even though they are not required for a distinguishing description.

3.2. Direct comparisons of the IA with other models

One of the first studies to systematically compare the IA with other algorithms focused on the coconut corpus, a collection of task-oriented dialogs in which interlocutors had to resolve a joint task (buying furniture on a fixed budget) (Jordan, 2000a).5 Jordan's study compared the IA with two models. Jordan's Intentional Influences (II) model (Jordan, 2000a, 2002) views content determination as heavily influenced by communicative intentions over and above the identification intention. Thus, the intention to signal agreement with a proposal (in this case, to buy some item of furniture) may motivate the speaker to repeat a description produced by their interlocutor, for example. The authors also included a computational implementation of the Conceptual Pacts model proposed by Brennan and Clark (1996), in which a speakers’ choice of content for a referring expression is in part based on a tacit ‘‘agreement’’ with the interlocutor to refer to objects in a specific way. The IA was outperformed by both models. However, the comparison leaves open the question as to whether the IA is an adequate model of referent identification, since the models to which it was compared explicitly go beyond this.

Still within a dialog context, Gupta and Stent (2005) carried out an evaluation on the coconut and the maptask (Anderson et al., 1991) corpora. They compared the IA and a version of the GR by Siddharthan and Copestake (2004) against a baseline procedure that included the type of an object, and randomly added further properties until a referent was distinguished. Additionally, each algorithm was augmented with dialog-oriented heuristics and coupled with procedures for the realization of modifiers. Thus, the evaluation used an evaluation metric which combined (a) the degree of agreement between an algorithm's attribute selections and a human's, and (b) the extent to which the automatic realization of attributes was syntactically a good match to the human realization.

While these studies offer important insights, they do not directly address the questions outlined in the Section 1 to our article. In the maptask corpus, for example, most referents are named entities with no true distractors, which can explain why the baseline algorithm outperformed both IA and the GR on this data in Gupta and Stent's study. In the coconut corpus, these two algorithms outperformed the baseline, but the original IA was outperformed by variants that incorporated dialog-oriented heuristics. This is exactly as one would predict, since identification is often not the only referential goal of interlocutors, particularly in the coconut corpus, where other factors have been shown to be paramount (Jordan, 2000a; Jordan & Walker, 2005). The evaluation metric used by Gupta and Stent incorporated syntactic factors, going beyond the purely semantic task-definition that the IA sought to address. This, of course, is only a limitation from the viewpoint of a study like the present one, which focuses on Content Determination.

A study by Viethen and Dale (2006) offers a more straightforward comparison of the IA and the GR. Viethen and Dale stuck to the identification criterion as the sole communicative intention. They tested against a small corpus of 118 descriptions, obtained by asking experimental participants to refer to drawers in a filing cabinet, which differed on four dimensions, namely color, row, column and whether or not a drawer was in a corner. The primary evaluation metric used was recall, defined as the proportion of descriptions in the corpus which an algorithm reproduced perfectly. The comparison of IA and GR revealed a recall rate of 79.6% for the latter, compared with a 95.1% for the IA (with both figures excluding relational descriptions). Moreover, the corpus contained a limited number (29) of overspecified descriptions, of which the IA reproduced all but five. Although these results seem favorable for the IA, they only tell us which descriptions are generated by one or more of all (4!=)24 possible POs of the attributes. This is because Viethen and Dale combined results from all 24 versions of the algorithm, a legitimate move, but one that obscures the extent to which a given version of the IA, with a particular PO, contributes to the overall rate.

Recently, Di Fabbrizio and colleagues reported on a study that is closely related to the questions of this article (Fabbrizio et al., 2008a,b; see Bohnet, 2008, for a related approach), comparing different versions of the IA.6 One version uses a PO that reflects the frequency of attributes in the tuna corpus as a whole, whereas the other attempts to model different speakers: When modeling a given speaker, the IA uses a PO which reflects the frequency with which this speaker (i.e., this subject in the data collection experiment) used the attribute in question. Although the findings of these studies are intriguing—speaker modeling improved the performance of the algorithm—they need to be treated with some caution. This is because the set of speakers represented in the training set from which speaker-dependent POs were obtained was different from the set of speakers represented in the test set on which their algorithms were evaluated. This makes it difficult to interpret the conclusion that ‘‘speaker constraints can be successfully used in standard attribute selection algorithms to improve performance on this task’’ (Fabbrizio et al., 2008b, p. 156). One possible reason for the improvement is that the constraints in question are sufficiently general to apply to classes of speakers rather than individual speakers, and hence can be generalized from one sample (individuals represented in the training set) to another (those represented in the test set).

In summary, most existing GRE evaluations do not address the question formulated at the beginning of this article, because they either placed the IA within the context of a task (such as collaborative dialog) in which reference is likely to go beyond the primary aim for which IA was designed, or because their evaluation criteria obscure the role of attribute preferences (e.g., by averaging over multiple POs). The work of Dale and Reiter remains central to current work in GRE. To take an example, although the three STECs organized in this area over the past few years have led to novel proposals, with an emphasis on empirical methods (Bohnet, 2007; de Lucena & Paraboni, 2008; Spanger, Kurosawa, & Tokunaga, 2008; Theune, Touset, Viethen, & Krahmer, 2007) and the exploitation of novel frameworks such as genetic algorithms (Hervás & Gervás, 2009; King, 2008), many submissions took the IA or one of its two ‘‘competitors’’ as a starting point. Gricean brevity has also been emphasized as a desirable property of algorithms in recent years (Bohnet, 2007; Gardent, 2002). It therefore seems crucial to put the claims made by Dale and Reiter to the test while maintaining their original assumptions.

3.3. Toward an evaluation methodology

The foregoing discussion raises a number of methodological issues that the evaluation experiment reported below seeks to address. First, the IA is a family of algorithms, since there are as many versions of it as there are POs. The question then arises as to whether all these possible versions should be considered, with the combinatorial explosion that this brings about. Our approach will be to select only those orders which are ‘‘plausible.’’ Where possible, we shall attempt to define plausibility in terms of earlier psycholinguistic work. As we have seen in Section 3.1, however, much of this work has focused on relatively simple, well-defined visual domains with attributes, such as color and size. What of more complex domains in which the variety of attributes increases, and the determination of ‘‘salient’’ or ‘‘conceptually central’’ attributes is more difficult? Psycholinguistic research has had less to say about preferences in these contexts. For this reasons, it seemed important to investigate the performance of algorithms in domains that are more complex than the ones that have typically been studied, as well as very simple ones.

Since Dale and Reiter's claims focused on Content Determination, the aims that we set ourselves suggest that a comparison of GRE algorithms should abstract away from differences in lexical choice and syntactic realization. Suppose an intended referent has the properties 〈type:sofa〉 and 〈color:red〉, and two human authors produce the descriptions the settee which is red and the red sofa, respectively. An algorithm which selects both the above properties should be counted as achieving a perfect match to both descriptions. A comparison should also rest on the knowledge that the algorithm and the authors share the same communicative intentions (namely, to identify the referent). Of the studies reviewed above, only Viethen and Dale (2006) and di Fabbrizio et al. (2008a, 2008b) satisfied this requirement. The experiment we employed to collect data aimed to minimize potentially confounding communicative intentions.

In line with Dale and Reiter's starting point, the corpus-based evaluation on which we report here focuses on an assessment of the humanlikeness of the descriptions generated by a given GRE algorithm. In other words, we ask how well an algorithm mimics speakers.

3.4. The evaluation metric

An evaluation that compares automatically produced output against human data should take into account partial matches, something that a simple recall measure does not do. We therefore adopt the Dice coefficient, a well-accepted distance metric which computes the degree of similarity between two sets in a straightforward way. (Section 8 will briefly discuss alternative metrics.) In keeping with earlier remarks, we shall be assessing the similarity between sets of attributes (such as color), rather than sets of properties (such as green). The Dice metric is similar to the ‘‘match’’ metric applied to GRE algorithms by Jordan (2000b) (which was defined as X/N, where X is ‘‘the number of attribute inclusions and exclusions that agree with the human data’’ and N is the maximum number of attributes that can be expressed for an entity). Dice is computed by scaling the number of attributes that two descriptions have in common, by the overall size of the two sets:

  • image(1)

where DH is (the set of attributes in) the description produced by a human author and DA the description generated by an algorithm. Dice yields a value between 0 (no agreement) and 1 (perfect agreement). We will also report the perfect recall percentage (PRP), the proportion of times when an algorithm achieves a score of 1, agreeing perfectly with a human author on the semantic content of a description. Finally, a description may contain an attribute several times. For instance, the green desk and the red chair contains the attribute color twice. We treat these as distinct; hence, technically, our Dice coefficient is computed over multi-sets of attributes.

4. The tuna corpus

  1. Top of page
  2. Abstract
  3. 1. Generation of referring expressions
  4. 2. Some classic GRE algorithms
  5. 3. How to test a GRE algorithm
  6. 4. The tuna corpus
  7. 5. The furniture subcorpus
  8. 6. The people subcorpus
  9. 7. Tractability of algorithms
  10. 8. Conclusion
  11. Acknowledgment
  12. References

The experiment that led to the tuna corpus was carried out over the Internet over a period of 3 months, yielding 2,280 singular and plural descriptions by 60 participants. Since this corpus was originally developed, two smaller datasets have been constructed, using the same overall methodology. These were created in 2008 for the comparative evaluation of algorithms in some of the tasks in the second GRE Shared Task Evaluation Campaign, tuna-reg’08; one of them was also reused in tuna-reg’09, the third and final GRE STEC involving tuna held in 2009 (see Gatt & Belz, 2010, for details). The present study compares algorithms against human descriptions in the original tuna corpus and validates the results against one of the tuna-reg’08 test datasets. The reason is that the original tuna data collection incorporated some design features—discussed next—which may have introduced complications in the interpretations of results. Since the reg’08 test sets omitted these features, this comparison gives an additional measure of confidence in our interpretations. We first describe the design, collection, and annotation of the original tuna corpus, followed by a complete overview of the data sources and their use in Section 4.8.

4.1. General setup of the experiment

Let us use the term domain for a GRE domain in the sense outlined in the previous sections, using domain type to refer to the kinds of objects in a domain, such as furniture. Different domain types can lead to qualitatively different referring expressions (e.g., Koolen et al., 2009; Viethen & Dale, 2006). We therefore let participants refer to objects in two very different domain types, yielding two subcorpora, the furniture subcorpus (Section 4.2.1) and the people subcorpus (Section 4.2.2).

We had to find a setting in which large numbers of reasonably natural human descriptions could be obtained, each referring to an object for the first time (i.e., without being able to rely on previous references to the same referent). These constraints militate against the use of dialogs, which is why we opted for the following, more straightforward approach.7

In the experiment, each trial consisted of one or two target referents and six distractor objects, with the targets clearly demarcated by red borders, as illustrated in Fig. 1. The participants were asked to identify the objects that were surrounded by the red borders. Participants were told that they would be interacting with a language-understanding program which would interpret their description and remove the referents from the domain. This was intended to introduce a measure of interactiveness into the proceedings. It was emphasized that the aim of these descriptions was to identify the referent. The system was programmed in such a way that one or two objects (depending on whether there were one or two target referents) were automatically removed from the domain after a participant had entered his or her description. To emphasize that this task can be performed with different degrees of success, the system removed the correct referent(s) on 75% of the trials, but the wrong one(s) on a quarter of trials, which were randomly determined.

image

Figure 1.  Trials in the tuna elicitation experiment.

Download figure to PowerPoint

Pilots with this scheme suggested that this did not discourage participants from taking the task seriously. Nonetheless, this feature may have introduced confounding factors in our design. For example, participants’ referential behavior may have altered as a result of the system's ‘‘misinterpreting’’ a description, leading to less risk-taking behavior (Carletta & Mellish, 1996). This is one of the limitations of this methodology which motivated the use of the tuna-reg’08 test data as a second, validating test set, since this dataset was constructed without this feature.

The experiment was designed to produce a corpus that would serve as a lasting resource for GRE evaluation; hence, it took into account a number of factors, of which only a few will concern us here. In the following sections, we describe the materials and design of the experiment, as well as the corpus annotation. A more detailed explanation of the annotation procedure can be found in van der Sluis et al. (2006) and Gatt, van der Sluis, and van Deemter (2008b).

4.2. Materials

Referential domains (corresponding to experimental trials) consisted of one or two images of target referents and six distractor objects. Objects were displayed in a sparse 3 (row) × 5 (column) matrix. The positioning of objects was determined randomly at runtime, for each participant and each trial.

4.2.1. The furniture subcorpus

The furniture subcorpus describes pictures of furniture and household items obtained from the Object Databank,8 a set of realistic, digitally created images developed by Michael Tarr and colleagues at Brown University. Four types of objects were selected from the Databank, corresponding to four values of the type attribute. For each object, there were four versions corresponding to four different values of orientation. Pictures were manipulated to create a version of each type × orientation combination in four different values of color and two values of size, as shown in Table 1. As shown in the table, there are two additional attributes, x-dimension and y-dimension, which describe the location of an entity in the 3 × 5 grid.

Table 1.    Attributes and values in the furniture subcorpus
AttributePossible Values
typechair, sofa, desk, fan
colorblue, red, green, gray
orientationfront, back, left, right
sizelarge, small
x-dimension (column number)1,2,3,4,5
y-dimension (row number)1,2,3
4.2.2. The people subcorpus

The people subcorpus consists of references elicited in domains consisting of high-contrast, black-and-white photographs of people, following previous experimental work using the same set (van der Sluis & Krahmer, 2004).

This subcorpus is more complex than the furniture one, because a given portrait can be described using a substantial, and perhaps open-ended, number of different attributes (e.g., the bald man with the friendly smile and the nerdy shirt). Nevertheless, based on van der Sluis and Krahmer (2004), a number of salient attributes were identified, as shown in Table 2.

Table 2.    Attributes and values used in the people subcorpus
AttributePossible Values
typeperson
orientationfront, left, right
ageyoung, old
beard0 (false), 1 (true), dark, light, other
hair0 (false), 1 (true), dark, light, other
hasglasses0 (false), 1 (true)
hasshirt0, 1
hastie0, 1
hassuit0, 1
x-dimension (column number)1,2,3,4,5
y-dimension (row number)1,2,3

The attributes beard and hair have values which are of mixed types. Thus, both can take the boolean values 1 or 0, and either of a set of literal values indicating the color of a person's hair or beard. In the actual corpus annotation, each of these attributes was a combination of two separate ones, one taking a Boolean value indicating whether a person had the attribute (e.g., hashair), and another indicating the color (e.g., hairColor). However, the latter was always used in conjunction with the former, in expressions like dark-haired. Therefore, it seemed reasonable to combine these into a single attribute in this study, thereby also reducing the number of attributes overall, and reducing the number of possible POs for the IA in the process.

4.2.3. Construction of domains

The experiment consisted of 38 experimental trials, divided into 20 furniture trials and 18 people trials, each with one or two targets and six distractors in the sparse matrix. Each trial displays a different domain. For furniture, the domains were constructed by taking each possible combination of attribute-value pairs in each domain type9 and constructing a domain in which that combination was the minimally distinguishing description for the referent(s). For example, in a domain in which the minimal description for the target referent was {〈color: red〉,〈orientation: front〉} (red and facing front), at least one distractor would be a red chair, and at least one other distractor would be a chair facing front, but only the target referent would have both properties. In the people subcorpus, the minimal description was calculated based on a combination of three salient attributes found in the Van der Sluis and Krahmer (2004) study, namely beard, hasglasses, and the attribute age.

The type was never part of the minimally distinguishing description because it was assumed, based on robust psycholinguistic findings, that it would be included anyway. The available attribute-value pairs in a domain type were represented an approximately equal number of times. For example, of the 12 furniture domains where orientation was part of the minimal description, a target faced front or back exactly half the time, and left or right in the rest.

4.3. Design

Table 3 presents an overview of the experimental design, which manipulated one within-subjects, and two between-groups factors. (The abbreviations used in the table will be introduced presently.) The within-subjects factor manipulated the Cardinality and Similarity of objects. In the case of plural domains with two target referents, the two referents may or may not be sufficiently similar to be describable by means of the same minimal conjunction of properties. Where this is not possible, one may have to split the description in two parts (set-theoretically, the union of two sets; logically, the disjunction of two conjunctions of properties) saying, for example, the red table and the blue sofa. Accordingly, Cardinality/Similarity had three levels:

Table 3.    Experimental design and number of descriptions within each cell
 FurniturePeople
SGPSPDSGPSPD
+fc − loc (N = 15)1059010590105105
fc + loc (N = 15)1059010590105105
+fc + loc (N = 15)1059010590105105
fc − loc (N = 15)1059010590105105
  • 1
    Singular (SG): Seven furniture domains and six people domains, contained a single target referent.
  • 2
    Plural/Similar (PS): Six furniture domains had two referents with identical values for the attributes with which they could be distinguished from their distractors. For example, two pieces of furniture might both be blue in a domain where the minimally distinguishing description consisted of color. This was also the case in six people domains where, for instance, both targets might be wearing glasses in a domain where 〈glasses: 1〉 sufficed for a distinguishing description. In furniture domains, the two referents in this condition had different values of type (e.g., one was a chair, and the other a sofa), whereas in people domains, they were identical (since all entities were men).
  • 3
    Plural/Dissimilar (PD): In the remaining seven plural furniture trials and the six plural people trials, the targets had different values for the minimally distinguishing attributes. Thus, plural descriptions would always involve a disjunction (i.e., a set union) if they were to be distinguishing. Since disjunction requires significant extensions to the classic algorithms discussed by Dale and Reiter (1995), we shall omit data in this condition.

The first between-groups factor, ±loc, consisted in whether participants were encouraged to use locative expressions. Half of the participants were discouraged, although not prevented, from using locative expressions (−loc condition), whereas the other half (+loc) were not. The former were told that the language-understanding program they were interacting with had access to the same domain representation but had different information about the position of objects, so that using locatives would be counter-productive. Participants in +loc were told that the system had access to the complete domain of objects, including location. Locatives were not included in Dale and Reiter's discussion of GRE algorithms. In recent years, it has become increasingly clear that the location of a referent can play a special role in referring expressions (Arts, 2004), and that location requires special mechanisms for dealing with relations such as above (Dale & Haddock, 1991; Kelleher & Kruijff, 2006). Given that the primary focus of this article is an evaluation of the claims of Dale and Reiter (1995), we do not consider locative descriptions here.

The second between-groups factor sought to determine whether participants would perceive the communicative situation as fault-critical (±fc). The group in the fault-critical (+fc) condition was told that the program would eventually be used in situations where accurate referent identification was crucial, and no opportunity to rectify errors would be available. In this condition, participants could not correct the system's ‘‘mistakes’’ when it removed the wrong referent(s). Subjects in the −fc condition were not told this. Instead, in the 25% of trials where the program removed the wrong referents, they were asked to click on the correct pictures to rectify the ‘‘error’’ made by the program. Once again, a full discussion of the effects of the fc factor would take us far beyond the scope of this article. We shall therefore collapse descriptions from both ±fc conditions in our analysis based on the original tuna Corpus. Although some preliminary analysis of the corpus data suggested that there was no difference between the two conditions in the likelihood of overspecified descriptions, some previous work (Maes, Arts, & Noordman, 2004; von Stutterheim, Mangold-Allwinn, Barattelli, Kohlmann, & Kölbing, 1993) suggests that manipulation of communicative context may affect referential behavior. These complications further motivate our validation against the tuna-reg’08 test data, which did not manipulate a ±fc factor.

4.4. Participants and procedure

The experiment was run over the Internet. Participants were asked for a self-rating of their fluency in English (native speaker, non-native but fluent, not fluent). Participants who rated themselves as not fluent were not included in the corpus. Participants were then randomly assigned to a condition and read the corresponding instructions. The instructions emphasized that the purpose of their descriptions is to identify referents. They were asked to complete the experiment (i.e., all 38 furniture and people trials) in one sitting. Trials were presented in randomized order. Each trial consisted of a presentation of a domain, as shown in Fig. 1, where participants were prompted for a description of the target referent(s). This was followed by a feedback phase, in which the system removed the target referent. Sixty participants completed the experiment, 15 in each group depicted in Table 3.

4.5. Annotation

An XML annotation scheme was developed for the corpus, which pairs each corpus description with a representation of the domain in which it was produced. In the scheme, which is exemplified in Fig. 2, a description is represented in three different ways: (a) the original string typed by a participant (the STRINGDESCRIPTION node); (b) the same string with all substrings corresponding to an attribute annotated using ATTRIBUTE tags (the DESCRIPTION node); (c) a simplified representation consisting only of the set of attributes used by a participant (the ATTRIBUTESET node).

image

Figure 2.  Example of a corpus instance: ‘‘the sofa facing right.’’

Download figure to PowerPoint

Evaluation required that the domains that were ‘‘seen’’ by humans and algorithms be compatible whenever possible. This was not possible when human-produced expressions contained attributes that were not specified in the domain (e.g., where a person was described as being serious); these were tagged using name = ‘‘other.’’ In Sections 5 and 6, these attributes were treated as different from any system-generated properties.

4.6. Annotation procedure and interannotator agreement

The corpus was annotated by two of the authors based on consensus.10 The reliability of our annotation scheme was evaluated by comparing a subset of 516 descriptions in the corpus with the annotations made by two independent annotators (hereafter, A and B), postgraduate students with an interest in NLG, who used the same annotation manual. The Dice coefficient was used as a similarity metric for comparing the three annotated versions of each description. This allows us to measure the degree to which annotators agreed on the semantic content of a particular description (cf. Passonneau, 2006).

Table 4 displays the pairwise mean and modal Dice scores. In all three pairwise comparisons, there is a high degree of similarity, with the most frequent score being 1. However, agreement was slightly higher between the two annotators A and B than between either of them and the authors. To take the likelihood of chance agreement into account, we used Krippendorf's α (Krippendorf, 1980), which has the general form shown in Eq. (2):

  • image(2)

where Do is the observed disagreement between annotators and De is the disagreement expected when the units of interest are being coded purely by chance. We follow Carletta (1996) in assuming that a value greater than 0.8 indicates high reliability (see Artstein & Poesio, 2008, for discussion). In the present case, the disagreement on two descriptions D1 and D2 is calculated as 1 − dice(D1D2). We followed Passonneau (2006) in adopting the following instantiation of Eq. (2):

  • image(3)

where r is the number of corpus descriptions, m the number of annotators (i.e., 3), i ranges over the individual corpus descriptions, and nDji is the number of times the set of attributes Dj has been assigned to description i (of a maximum of 3). The α value obtained was 0.85. This implies that the three sets of independently annotated descriptions were in high agreement and the annotation scheme used is replicable to a high degree.

Table 4.    Mean and modal Dice scores in the interannotator reliability study
 Mean (SD)Mode (%)
Authors versus A0.886 (0.17)1 (53.2)
Authors versus B0.891 (0.15)1 (51.5)
A versus B0.934 (0.15)1 (72.2)

4.7. A note on the construction of the tuna-reg’08 datasets

As stated before, we use Test Sets 1 and 2 from the tuna-reg’08 STEC to further validate our results. They were constructed via an online elicitation experiment based on the original design, with the following differences:

  • 1
     Participants were not told that they would be interacting with a natural language-understanding system, and they received no ‘‘feedback’’ of the kind given in the original experiment. Rather, they were asked to identify objects as though they were typing them for another person to read.
  • 2
     The between-groups factor manipulating whether or not the communicative situation was fault-critical (±fc) was dropped.
  • 3
     Only singular domains were used in the experiment. This is because the STECs did not include plural descriptions.

This experiment was completed by 218 participants, of whom 148 were native speakers of English. The test sets were constructed by randomly sampling from the data gathered from native speakers only: Both sets contain 112 different domains, divided equally into furniture and people descriptions and sampled evenly from both ±loc experimental conditions. In Test Set 1, there is one description per domain, for a total of 112 descriptions (56 per domain type). We use this as development data in this study (Sections 5 and 6). In Test Set 2, there are two different descriptions for each domain, for a total of 224 descriptions (112 in each domain type). This constitutes our independent test dataset, used for validation of the results on the original tuna corpus.

4.8. Summary of testing and development data

Table 5 shows the number of descriptions in our two test sets, and our development set. Because the furniture and people subcorpora vary so much in complexity, we focus on each one separately in what follows. In each case, the data consist of (a) singular descriptions (SG); and (b) similar plurals (PS) elicited in the −loc condition. These are the two classes of descriptions that can be handled by the algorithms without extensions to deal with disjunction. Since participants in the −loc condition were not prevented from using locative attributes (although they were discouraged), we further exclude from our test data all descriptions that include them.

Table 5.    Descriptions in the test and development data in the two subcorpora.
 SourceCardinalityFurniturePeopleTotal
TestOriginal tuna CorpusSG156132288
PS158138296
reg’08 Test Set 2SG5656111
Developmentreg’08 Test Set 1SG5656112

4.9. Algorithms and comparisons

As observed earlier, testing all possible versions of the IA would not be practical, particularly in a domain with a large number of attributes. In each subcorpus, we therefore selected a subset of the possible IAs, focusing on those which prioritize preferred attributes. Although such preferences can often be identified from previous psycholinguistic work, this is not always possible, especially in the case of the people descriptions. For this reason, we used our development data to estimate frequencies with which different attributes were used. The resulting frequency ranking for each subcorpus was used to determine a set of POs that were predicted to be ‘‘optimal.’’

For each subcorpus, we report mean Dice scores for algorithms based on both the original tuna Corpus and the tuna-reg’08 test data. For validation purposes, we report correlations between these means. Significant correlations are taken to suggest that the two datasets are compatible, in spite of the methodological differences in the data collection. Focusing on the main test dataset (that is, the original tuna data), our statistical analysis then proceeds in two steps. First, we compare the set of optimal IAs to the predicted suboptimal IAs, as well as a random baseline (hereafter referred to as IA-RAND), which always selected the type of an attribute, and then incrementally added randomly chosen properties to a description until the target referent was identified (this strategy follows Gupta & Stent, 2005). In this part of the study, we are mainly concerned with the impact of different POs on the IA. We then address the question of how the IA compares to other algorithms by identifying the two versions of the IA which are statistically the best and worst, and comparing them to GR and FB. In each case, we report the results of a by-items anova with Tukey's post hoc comparisons.

5. The furniture subcorpus

  1. Top of page
  2. Abstract
  3. 1. Generation of referring expressions
  4. 2. Some classic GRE algorithms
  5. 3. How to test a GRE algorithm
  6. 4. The tuna corpus
  7. 5. The furniture subcorpus
  8. 6. The people subcorpus
  9. 7. Tractability of algorithms
  10. 8. Conclusion
  11. Acknowledgment
  12. References

Our comparison of algorithms begins with the furniture subcorpus. We first identify candidates for ‘‘plausible’’ IAs. As indicated in Section 3.1, there are strong precedents in the psycholinguistic literature for hypothesizing that, of the three attributes in this domain, color will tend to be strongly preferred, whereas size is dispreferred (e.g., Belke & Meyer, 2002; Pechmann, 1989). The situation is less clear with orientation.

Frequencies computed from our development dataset, displayed in Table 6, confirm the hypothesized trends, but also evince a tie between size and orientation. We therefore expect that POs that put color first will generally perform better, but there is the possibility that the relative order of size and orientation will show a less dramatic difference. We can test these hypotheses by comparing all the possible IAs as follows:

Table 6.    Frequency of attribute usage in the development data for the furniture subcorpus. (Locative attributes and other are omitted because they were ignored in this study.)
AttributeFrequency (%)
type56 (31.6)
color49 (27.7)
orientation20 (11.3)
size20 (11.3)
  • 1
     IA-COS: color >> orientation >> size
  • 2
     IA-CSO: color >> size >> orientation
  • 3
     IA-OCS: orientation >> color >> size
  • 4
     IA-OSC: orientation >> size >> color
  • 5
     IA-SCO: size >> color >> orientation
  • 6
     IA-SOC: size >> orientation >> color

The top panel of Table 7 displays the mean Dice scores for each version of the IA within the two Cardinality conditions under consideration, as well as the PRP obtained by the algorithms across conditions. Overall means are reported for both test sets.

Table 7.    Mean Dice scores and standard deviations for the furniture subcorpus, with PRP scores per algorithm. Plural Similar scores are reported for the original tuna data only. c, o, and s stand for color, orientation, and size, respectively. PRP stands for perfect recall percentage.
 Original tuna datatuna-reg’08 data
SingularPlural SimilarOverallSingular
Mean (SD)PRPMean (SD)PRPMean (SD)PRPMean (SD)PRP
IA-COS0.917 (0.12)60.90.797 (0.10)70.857 (0.12)33.80.916 (0.16)69.1
IA-CSO0.917 (0.12)60.90.791 (0.11)7.60.853 (0.13)34.10.916 (0.16)69.1
RAND0.840 (0.15)31.40.755 (0.13)3.20.797 (0.14)17.20.826 (0.18)34.6
IA-OCS0.829 (0.14)250.728 (0.13)1.90.778 (0.14)13.40.829 (0.15)25.5
IA-SCO0.815 (0.14)19.20.730 (0.12)2.50.772 (0.14)10.80.823 (0.15)18.2
IA-OSC0.803 (0.16)22.40.728 (0.13)1.90.765 (0.15)12.10.801 (0.17)25.5
IA-SOC0.780 (0.16)18.60.707 (0.13)2.50.743 (0.15)10.50.782 (0.16)18.2
FB0.841 (0.17)39.10.736 (0.14)4.40.788 (0.16)21.70.845 (0.17)37.5
GR0.829 (0.17)37.20.721 (0.13)2.50.774 (0.16)19.70.845 (0.17)37.5

5.1. Comparison of the two test datasets

The mean Dice scores on the two datasets are strongly correlated, both if we compare the overall score on the tuna data with the tuna-reg’08 data (r9 = .96; p < .001) and if we compare only the means obtained on the singular descriptions (r9 = .985; p < .001).11 The ranking of the algorithms is largely the same for the two datasets, particularly for the top and bottom rankings. The overall rankings of the algorithms12 on the tuna data also correlated significantly with the rankings on the tuna-reg’08 data (Spearman's ρ9 = .92; p = .001) as did the rankings obtained on the singular descriptions (ρ9 = .98; p < .001). This suggests that the two datasets are largely compatible, and that the issues raised in relation to the original experimental design did not lead to significant deviations.

5.2. Comparing different versions of the IA

For the next part of our analysis, we focus exclusively on the original tuna data. As the table shows, all the algorithms performed worse on the plural descriptions, a point to which we return at the end of this section. However, the relative ordering of the different versions of the IA is stable irrespective of the Cardinality condition. Moreover, the trends seem to go in the predicted direction, with the two top IAs being the ones which place color first, while prioritizing size or orientation leads to a decline. Note also that the PRP, which reflects the extent to which an algorithm agreed perfectly with an individual on a specific domain, declines sharply for those algorithms which do not put the preferred attribute first, while IA-COS and IA-CSO achieve a perfect match with corpus instances more than 60% of the time on the singular data. Moreover, the random baseline (IA-RAND) outperforms all except these two IAs.

A 7(algorithm) × 2(cardinality) univariate anova was conducted to compare all the versions of the IA on the original tuna data. There were highly significant main effects of both factors (algorithm: F(6,2184) = 35.67, p < .001; cardinality: F(1,2184) = 291.95, p < .001). The interaction approached significance (F(6,2184) = 2.02, p = .06).

Pairwise comparisons using Tukey's Honestly Significant Differences yielded the three homogeneous subsets of algorithms (A, B, C) displayed in Table 8. The table evinces a partition between the two IAs that prioritize color, and the other algorithms. Their performance stands out, in other words, as significantly better than the five other algorithms. They are also the only algorithms that significantly outperform IA-RAND.

Table 8.    Homogeneous subsets among versions of the IA in the furniture subcorpus. Algorithms that do not share a letter are significantly different at α = 0.05.
IA-COSA  
IA-CSOA  
IA-RAND B 
IA-OCS B 
IA-SCO BC
IA-OSC BC
IA-SOC  C

These results indicate that even in a small domain with few dimensions of variation among objects, the humanlikeness of the output of the IA is strongly affected by the PO selected. For the next part of our analysis, we will compare GR and FB against an ‘‘optimal’’ IA, namely IA-COS, and the ‘‘suboptimal’’ IA-SOC.

5.3. Comparing the IA with GR and FB

Table 7 (bottom panel) shows that the Greedy and FB algorithms fall between the two top-scoring IAs which place color first, and the others, with IA-RAND outperforming them narrowly in terms of mean scores. We conducted a 4(algorithm) × 2(cardinality) univariate anova comparing the best and worst IAs to FB and GR. There was a significant main effect of algorithm (F(3,1248) = 38.91, p < .001) and cardinality (F(1,1248) = 1.48, p < .001). Once again, the interaction was not significant (F(3,1248) = 1.48, p > .2). As in the first test, performance on plurals declined for all algorithms.

The results of a post hoc Tukey's test are presented in Table 9, which shows a clean partition between the brevity-oriented algorithms on the one hand, and the best and worst IAs on the other. (If IA-CSO was used in the comparison, instead of IA-COS, its position would be identical to that of IA-COS in the present table.) No significant difference was obtained between FB and GR; the reason is probably that although a Greedy strategy will not guarantee that the briefest description is found, in a domain with relatively few attributes (and therefore few possibilities) it is likely to converge with a FB strategy. This is actually supported by the results on the tuna-reg’08 data, where the two algorithms obtained identical results.

Table 9.    Homogeneous subsets among the best and worst IAs with FB and GR. Algorithms that do not share a letter are significantly different at α = 0.05.
IA-COSA  
FB B 
GR B 
IA-SOC  C

The results in Table 7 show that the humanlikeness of the IA is dependent on the PO. Moreover, the comparison with the other algorithms (Table 9) suggests that the prediction of Dale and Reiter (1995), that an incremental strategy would improve humanlikeness, is only valid for some IAs.

5.4. Discussion

The results so far suggest that the PO is an important component of any analysis that seeks to test Dale and Reiter's (1995) claims. Having said this, choosing a ‘‘good’’ PO based on psycholinguistic studies would have been easy in the furniture domain type, since these studies suggest that color is highly preferred Pechmann (1984).

At a more fine-grained level, matters also depend on the evaluation metric used. With the exception of IA-SOC, the overall means in Table 7 range between 0.75 and 0.9. These differences seem intuitively small, an outcome that could be related to the simplicity of the domain. From an engineering point of view (i.e., one that favors robust, feasible solutions that give reasonable results in the long run), the performance of the random baseline IA-RAND, as well as GR and FB, would seem acceptable. On the other hand, the PRP scores distinguish the top-ranking algorithms more sharply.

One outcome which deserves more comment is the difference between singulars and plurals, with plurals yielding consistently lower Dice scores (Table 7). One possible explanation is that the two referents of a plural description in the furniture subcorpus always had different types. To give the algorithms a fair chance of matching participants’ descriptions, we allowed these algorithms to use unions of types, as in ‘‘the blue desk and chair.’’ It turns out, however, that noun phrases involving unions of types are rare. Instead of ‘‘the blue desk and chair,’’ participants tended to produce descriptions such as ‘‘the blue desk and the blue chair,’’ which uses the color attribute redundantly twice. Since our computation of Dice uses multisets, this lowers the overall score of the algorithms. In our example, the human attribute set would contain two occurrences of color, whereas an algorithm's would contain only one, thus decreasing Dice overall. (In the case of the people subcorpus, both referents in a plural description had the same type value, since both were always men.)

A full discussion of plurals is beyond the scope of this article, but the observations made here do confirm the general thesis that people deviate substantially from Gricean brevity. A closer analysis of plural references suggests, in fact, a substantial impact of the way the objects are categorized (their type) on the form and content of the referring expression.

6. The people subcorpus

  1. Top of page
  2. Abstract
  3. 1. Generation of referring expressions
  4. 2. Some classic GRE algorithms
  5. 3. How to test a GRE algorithm
  6. 4. The tuna corpus
  7. 5. The furniture subcorpus
  8. 6. The people subcorpus
  9. 7. Tractability of algorithms
  10. 8. Conclusion
  11. Acknowledgment
  12. References

We now turn to the people subcorpus, where the number of attributes is greater than in our previous analysis. Hence, the possibilities multiply, both in terms of the number of possible versions of the IA, and in terms of the choices the authors had to describe objects.

Compared with the furniture subcorpus, the larger number of attributes (nine excluding type) in the people subcorpus makes testing all possible IAs impractical. Therefore, it is even more crucial to have an a priori estimate of what POs might constitute ‘‘optimal’’ and ‘‘suboptimal’’ IAs. Here, however, the psycholinguistic literature provides less guidance than before. Most of the work cited in Section 3 focused on references to objects with the kinds of attributes we find in the furniture domain type. (Not a lot has been published on attributes such as hasglasses.) Therefore, our reliance on frequencies based on the development dataset is greater than before. Attribute usage frequencies are displayed in Table 10.

Table 10.    Frequency of attribute usage in the development data for the people subcorpus. (Locative attributes and other are ignored in this study.)
AttributeFrequency (%)
type55 (29.6)
hasbeard36 (19.4)
hasglasses25 (13.4)
hashair22 (11.8)
age14 (7.5)
hasshirt4 (2.2)
hassuit3 (1.6)
hastie2 (1.1)
orientation1 (0.5)
x-dimension12 (6.5)
y-dimension12 (6.5)
other0

Aside from type, the table suggests a gap between a set consisting of the three attributes beard, hasglasses and hair, and the others. To construct different versions of the IA, we took all possible permutations of these three attributes, imposing a fixed order on the other six. Additionally, we again used a version of the IA that reversed the hypothesized ‘‘best’’ orders; this is our predicted suboptimal version. This resulted in the following versions of the IA, in addition to IA-RAND:

  • 1
     IA-GBHOATSS: hasglasses >> beard >> hair >> orientation >> age >> hastie >> hasshirt >> hassuit
  • 2
     IA-GHBOATSS: hasglasses >> hair >> beard >> ⋯ >> hassuit
  • 3
     IA-BGHOATSS: beard >> hasglasses >> hair >> ⋯ >> hassuit
  • 4
     IA-BHGOATSS: beard >> hair >> hasglasses >> ⋯ >> hassuit
  • 5
     IA-HBGOATSS: hair >> beard >> hasglasses >> ⋯ >> hassuit
  • 6
     IA-HGBOATSS: hair >> hasglasses >> beard >> ⋯ >> hassuit
  • 7
     IA-SSTAOHGB: hassuit >> hasshirt >> hastie >> age >> orientation >> hair >> hasglasses >> beard

As before, Table 11 gives descriptive statistics for all the algorithms, with the different versions of the IA in the top panel.

Table 11.    Mean Dice scores and standard deviations for the people subcorpus, with PRP scores per algorithm. Plural Similar scores are reported for the original tuna data only.
 Original tuna datatuna-reg’08 data 
SingularPlural SimilarOverallSingular
Mean (SD)PRPMean (SD)PRPMean (SD)PRPMean (SD)PRP
IA-GBHOATSS0.844 (0.17)44.70.819 (0.21)44.90.831 (0.19)44.80.811 (0.17)33.9
IA-BGHOATSS0.822 (0.17)36.40.776 (0.20)31.90.799 (0.19)34.10.797 (0.17)32.1
IA-GHBOATSS0.776 (0.21)29.50.759 (0.23)33.30.767 (0.22)31.50.77 (0.18)26.8
IA-BHGOATSS0.728 (0.19)15.90.683 (0.24)18.10.705 (0.22)170.792 (0.17)30.3
IA-HGBOATSS0.688 (0.18)3.80.671 (0.20)9.40.679 (0.19)6.70.765 (0.17)25
IA-HBGOATSS0.658 (0.20)4.50.622 (0.22)6.50.640 (0.21)5.60.752 (0.17)23.2
IA-RAND0.598 (0.23)11.40.539 (0.22)10.10.568 (0.23)10.70.527 (0.21)0
IA-SSTAOHBG0.344 (0.11)00.466 (0.19)13.80.407 (0.16)70.344 (0.08)0
FB0.764 (0.23)34.10.693 (0.28)34.80.728 (0.26)34.40.642 (0.23)19.6
GR0.693 (0.20)8.30.634 (0.23)10.10.663 (0.21)9.30.642 (0.23)19.6

6.1. Comparison of the two test datasets

As before, there were strong positive correlations between the means obtained on the two datasets, both when the overall tuna means are compared with the tuna-reg’08 means (r10 = .9; p < .001) and when the singular subset only is compared (r10 = .9; p = .001). Rankings of algorithms are identical for the singular subset of the tuna data and the tuna-reg’08 data. The rankings on the overall tuna data display some variation in the middle ranks compared with the tuna-reg’08 data, but the two datasets give the same top and bottom rankings (i.e., the top two and bottom two algorithms are the same). This is confirmed by a strong positive correlation between ranks (ρ10 = .88; p = .001).

6.2. Comparing different versions of the IA

At a glance, the table suggests some important differences between distributions of the scores and algorithm rankings in this subcorpus compared with the previous one. First, the overall Dice scores are more broadly distributed, with IA-RAND and IA-SSTAOHBG scoring at or below 0.57. Second, there does not seem to be such a sharp difference in performance between the SG and PS conditions, with only relatively small decreases in performance. The exception is IA-SSTAOHBG, which performs worse in the SG condition compared with PS. Third, although IA-RAND once again ranks above the worst-performing IA, the random procedure peforms worse, in relative terms, than it did on the furniture subcorpus. Fourth, the worst-performing IA has an extremely low PRP of 7, scoring 0 on this measure in the singular data, meaning that it does not achieve a perfect match with any of the descriptions produced by our subjects.

A 6(algorithm) × 2(cardinality) univariate anova again showed a significant main effect of algorithm(F(9,1876) = 118.47, p < .001) but no main effect of cardinality (F(1,1876) = 2.22, p > .1). However, the interaction was highly significant (F(6,1876) = 6.37, p < .001). These results confirm the preliminary impressions gleaned from the table, where the decline in performance on PS is not as sharp as it was on the furniture data. The interaction is obtained primarily because IA-SSTAOHBG reverses the trend found for all the other algorithms.

Table 12 displays the homogeneous subsets obtained from the Tukey's pairwise comparisons. The table is most interesting in the difference at the two extremes. At the top of the table, the two best-performing algorithms differ significantly from all other algorithms. At the bottom, two distinct subsets identify the worst-performing algorithms, one of which is IA-RAND. Interestingly, the latter does not cluster with the worst-performing PO. Looking at the two best versions of the IA, the distinction between the POs appears subtle. This is also evident in the overlap between groups B and C, which mirrors the overlap found in the furniture subcorpus between algorithms that fall between the extremes (Table 8).

Table 12.    Homogeneous subsets among versions of the IA in the people subcorpus. Algorithms that do not share a letter are significantly different at α = 0.05.
IA-BGHOATSSA    
IA-GHBOATSSA    
IA-BHGOATSS B   
IA-HGBOATSS BC  
IA-HBGOATSS  C  
IA-RAND   D 
IA-SSTAOHBG    E

6.3. Comparing the IA with GR and FB

As before, the means for GR and FB in Table 7 suggest that they fall somewhere between the best- and worst-performing IAs. However, whereas their means were not significantly different in the furniture subcorpus, for the people subcorpus a significant difference was found, with FB outperforming GR (with a much higher PRP. In this section, we compare these two algorithms with one of the best IAs (IA-BGHOATSS) and the worst (IA-SSTAOHBG).

A 4(algorithm) × 2(cardinality) univariate anova revealed the same trends as with the comparison of the IAs in the previous subsection, with a main effect of algorithm (F(3,1072) = 192.63, p < .001), no main effect of cardinality (F(1,1072) = 1.14, p > .25) and a significant interaction (F(3,1072) = 13.44, p < .001), the latter once again because of IA-SSTAOHBG.

Homogeneous subsets from the pairwise comparisons are shown in Table 13. Apart from confirming the superiority of the top-ranked IA, these results also confirm that FB outperformed GR, and both are significantly better than the worst version of the IA.

Table 13.    Homogeneous subsets among the best and worst IAs with FB and GR in the people subcorpus. Algorithms that do not share a letter are significantly different at α = 0.05.
IA-BGHOATSSA   
FB B  
GR  C 
IA-SSTAOHBG   D

6.4. Discussion

The answers to our questions are generally more clear-cut on this dataset than on the furniture subcorpus. There is an obvious dependency of the IA on the PO: IA-SSTAOHBG does not achieve a single perfect match with any description, and is significantly worse than all other algorithms, while once again, the versions of the IA which perform best are those which prioritize ‘‘preferred’’ (i.e., in the present case, frequent) attributes.

The increased number of choices in this domain type also means that a random incremental procedure is more likely to select a distinguishing combination of attributes which a human author would not select. Another observation concerns the distinction between GR and FB; the two algorithms are significantly different on this dataset, with FB performing better and achieving a PRP which approaches that of some of the higher-ranked IAs. In general, GR is likely to overspecify, including attributes that are not minimally required for a distinguishing reference. The fact that FB achieves a respectable performance means that the validity of the generalization that people are unlikely to be brief is strongly dependent on the domain type.

As we have argued, much of the psycholinguistic literature has shown preferences for attributes such as color. On one interpretation (Pechmann, 1989; Schriefers & Pechmann, 1988), this is because of a holistic, ‘‘gestalt’’ mental representation of objects to which some attributes are central, owing to their salience and the fact that they are not relative (unlike size). A related interpretation is that these attributes have high codability (Belke & Meyer, 2002). However, predictions of overspecification based on these theories do not seem to carry over straightforwardly to attributes such as whether a person wears glasses or whether he or she is bald. The investigation of these phenomena in different domains is an important area of future research.

The main reason why the two Cardinality/Similarity conditions differ less sharply in the people subcorpus compared with the furniture data is that in the Plural Similar condition in the people domains, the two target referents were both of the same type (i.e., both men). The kinds of ‘‘partitioned’’ descriptions that were unavoidable in the furniture corpus (because the two referents in a plural target had different values for the type attribute) would sometimes still arise, as in (4a), where different properties were used to characterize each referent. However, where the description ascribes the same properties to each referent as in the PS condition, the form will be as in (4b).

(4)(a) The man with a beard and the man with glasses.
(b) The men with beards and glasses.

7. Tractability of algorithms

  1. Top of page
  2. Abstract
  3. 1. Generation of referring expressions
  4. 2. Some classic GRE algorithms
  5. 3. How to test a GRE algorithm
  6. 4. The tuna corpus
  7. 5. The furniture subcorpus
  8. 6. The people subcorpus
  9. 7. Tractability of algorithms
  10. 8. Conclusion
  11. Acknowledgment
  12. References

Dale and Reiter claimed that the IA is superior to its competitors in two respects, namely humanlikeness and computational tractability. We have so far focused on the first of these claims, but it is worth discussing the second one as well.

Firstly, suppose we accept tractability as an important consideration. It is then far from clear that this rules out algorithms such as GR, or perhaps even FB. To see why, let us assess the run-time complexity of each of these algorithms, use the following abbreviations:

  • 1
    na = number of properties known to be true of the intended referent.
  • 2
    nd = number of distractors.
  • 3
    nl = number of attributes mentioned in the final referring expression.

Under Dale and Reiter's analysis, GR has a complexity of na × nd × nl, because it needs to make nl passes through the problem, at each stage checking at most na attributes to determine how many of the nd distractors they rule out. By contrast, the IA has a complexity of nd × nl, because it requires nl passes, but it does not look for the optimal attribute at each stage, since this is fixed in the Preference PO. Although this makes GR more computationally ‘‘expensive’’ than IA, the standard view regarding the complexity of algorithms is that only the general shape of the function matters and not its fine details. Because both algorithms are polynomial, the standard position suggests that they should probably be tarred with the same brush. In other words, there are no strong computational reasons for preferring IA over GR. A worse complexity class is only reached with FB, whose complexity Dale and Reiter assessed as inline image (i.e., exponential).

It is unusual to assess complexity in terms of variables like nl (which is not known before the end of the calculation) and na (which would require an algorithm of its own to calculate). A similar picture emerges, however, if the algorithms are subjected to a more traditional worst-case analysis (van Deemter, 2002). Such an analysis puts the complexity of IA at nx × np, where np equals the total number of properties expressible in the language (i.e., the number of attribute/value combinations) and nx is the number of objects in the domain. In this analysis, it is easy to see that the order in which properties are tested is irrelevant, because in the worst case, all properties are tested. The same conclusion follows, namely that both GR and IA have polynomial complexity.13

Additionally, is debatable whether it makes sense to dismiss a GRE algorithm purely because it is computationally ‘‘intractable.’’ It might be partly for this reason that complexity has been discussed relatively rarely in GRE in recent years, when most research has focused on empirical tests of algorithms on miniature domains. Suppose algorithm x produced better output than algorithm y, but at a much slower pace. Would we really want to prefer y over x under all circumstances? The following arguments militate against such a position:

  • 1
     Current GRE algorithms do not pretend to model procedural aspects of human reference production at all; at best, they offer a good approximation of the descriptions produced by a human speaker. Thus, the primary question that determines a choice between algorithms is which one mimics human output better, not which one is faster.
  • 2
     Computational tractability requires one to make assumptions which, in practice, can be debatable. Suppose. for example, no referring expression can contain more than a hundred properties. This reasonable assumption would instantly remove the variable nl from the formula for the complexity of FB, causing this algorithm to run in polynomial time. An algorithm whose theoretical complexity is polynomial (but whose constants have high values) can easily take more time in practice than an exponential one (whose constants have low values). Thus, it is often difficult to assess the practical implications of complexity results.

To sum up: the experiments reported in previous sections have led us to question the empirical superiority of the IA. Our discussion of computational complexity now tends in the same direction. Combining the evidence, greedy approaches to GRE may have a lot to offer after all.

8. Conclusion

  1. Top of page
  2. Abstract
  3. 1. Generation of referring expressions
  4. 2. Some classic GRE algorithms
  5. 3. How to test a GRE algorithm
  6. 4. The tuna corpus
  7. 5. The furniture subcorpus
  8. 6. The people subcorpus
  9. 7. Tractability of algorithms
  10. 8. Conclusion
  11. Acknowledgment
  12. References

The IA is by far the best-known GRE algorithm to date. Krahmer and Theune (2002, p. 223), for example, wrote that ‘‘the IA has become more or less accepted as the state of the art for generating descriptions.’’Horacek (1997, p. 207) wrote that ‘‘the IA is generally considered best now, and we adopt it for our algorithm, too.’’ Recently, Goudbeek, Krahmer, and Swerts (2009) stated that ‘‘Dale and Reiter's (1995) IA is often considered the algorithm of choice (...), due to its algorithmic simplicity and empirical groundedness.’’Oberlander (1998, p. 506) stated that the IA ‘‘clearly achieves reasonable output much of the time.’’ The IA has also become the basis of much recent work that seeks to widen the coverage of GRE algorithms. Examples include work on relational descriptions (e.g., Areces, Koller, & Striegnitz, 2008), salience (such as Piwek, 2009, which takes the incremental model ‘‘as a point of departure’’), vague descriptions (van Deemter, 2006), and spatial descriptions (e.g., Kelleher & Kruijff, 2006; Turner, Sripada, Reiter, & Davy, 2007). No other GRE algorithm can boast a similar popularity, particularly for identifying a referent in a ‘‘null context’’ (cf. the Section 1). Similarly, the FB and GR strategies are still among the main competitors of the IA. Perhaps the main other contender is the graph-based approach of Krahmer, van Erk, and Verleg (2003), but its main selling point is arguably that it can encode a wide variety of algorithms, including IA, FB, and GR.

We have not been able to confirm the advantages that have been claimed for the IA. From a point of view of run-time complexity, there are no strong reasons for preferring IA over the GR, and in the corpora that we studied, it would be misleading to say that the IA matches human-produced descriptions more closely; for although there always existed a version of the IA that outperformed all other algorithms examined, this is not surprising given the fact that, in simple domain types, any halfway reasonable description can be produced by the IA if a PO is hand-picked.14 The success of the IA depended substantially on PO: A suboptimal PO produces descriptions that are worse than FB and GR. This was not only true when ‘‘unreasonable’’ PO were used but also when all available evidence, including corpus frequencies, were taken into account to find a good PO. Furthermore, as we hope to explain in more detail elsewhere, an analysis of differences between participants in our experiment revealed that some human speakers are modeled more accurately by FB and GR than via any incremental generation strategy: In the people data, FB agreed perfectly with a human author about 61% of the time, for example. Combining all the evidence, we conclude that someone who is looking for a GRE algorithm for a previously unstudied application domain might do better choosing GR (augmented with a Dale and Reiter-style treatment of head nouns, as we have argued in Section 2), instead of an unproven version of the IA.

Because this article has used the assumptions outlined in Dale and Reiter (1995), our evaluations have focused on the extent to which the descriptions produced by an algorithm matched human-produced descriptions. It is intrinsically interesting to construct algorithmic models of human language production, for example, because it can help computer programs to, one day, pass the Turing test (Turing, 1950). Moreover, there is some evidence that, by making referring expressions resemble the ones produced by human speakers, the resulting expressions tend to be more easily understood (Campana, Tanenhaus, Allen, & Remington, 2004). For all these reasons, humanlikeness is currently the prevalent perspective on evaluation of GRE algorithms, including the recent STEC challenges (Gatt & Belz, 2010). Having said this, we acknowledge that there should also be room for alternative, utility-driven evaluation methods. For, although some psycholinguistic theories emphasize the cooperative nature of reference (e.g., Brennan & Clark, 1996; Clark & Wilkes-Gibbs, 1986), there is evidence that producers do not necessarily maintain a model of a receiver's communicative needs (e.g., Arnold, 2008; Engelhardt et al., 2006; Keysar, Lin, & Barr, 2003).

The shortcomings of the IA became particularly noticeable in connection with the more complex of the two domain types, which involved black-and-white photographs of people's faces (i.e., the people domain type). But our people domains were still comparatively simple. Real people would have been identifiable in terms of their physical features, past actions, and so on. It is unclear whether any of the algorithms discussed here would do well in such situations. Some studies, in fact, suggest that there are domain types in which the IA performs poorly regardless of PO.15 These results suggest to us that future research in GRE should pay close attention to the complexities posed by the large and complex domains that speakers are faced with in real life.

Research on GRE has moved on considerably since 1995, when Dale and Reiter put forward their hypotheses (see Krahmer & van Deemter, 2011, for a survey). Yet the core GRE problem of producing naturalistic ‘‘one-shot’’ descriptions of a single referent continues to attract considerable interest, as was demonstrated by the recent GRE evaluation campaigns. The investigation on which we have reported in this article raises the question whether incrementality is basically on the right track, or whether some other approach to reference generation is perhaps superior. It appears to us that a nuanced answer to this question is called for, since the choice depends on the situation in which the algorithm has to operate.

In situations where it is possible to see, based on experimental evidence for example, that certain attributes are preferred over others (e.g., because they are easier to perceive or to express), the IA has considerable appeal, because this algorithm allows us to translate this evidence directly into a PO.16 In situations where neither intuitions nor experimental evidence is available, a version of the GR is likely to be a better choice than the IA.

The main strength of the GR lies in its ability to determine the usefulness of a property dynamically: In some cases, size will have great discriminatory power, for example, but if 90% of the domain elements have the same size as the target referent, then size will be nearly useless for identifying the target. The GR is able to take such differences into account, because it selects properties based on how many distractors they remove.

But arguably, discriminatory power is not enough. Suppose only one person in the room has green eyes. This does not necessarily make eye color very useful for referring to this person, because the color of someone's eyes is difficult to perceive from a distance. A defender of the GR might counter that if eye color is imperceptible, this attribute should not be available to the generator: GRE algorithms should only use properties that are common knowledge. The IA does have a subtle advantage in such cases, since its PO allows one to order attributes according to their degree of perceptibility, avoiding a sharp, and ultimately arbitrary, distinction between perceptible and imperceptible. What the IA cannot do, however, is make eye color more highly preferred for referents nearby than for referents further afield: It must always be preferred to the same degree. The fact that different POs can be selected for different domain types and text genres does not alter this.

We want to make a plea for algorithms that determine dynamically which attributes to select for inclusion into a description, based on features of the situation. Discriminatory power can play a role, but so can the extremity of a property (see next; also van Deemter, 2006 and Section 4.1), intentional influences (Jordan, 2000a,b), and alignment (Goudbeek et al., 2009). (A framework suitable for combining several factors into one ‘‘cost’’ is Krahmer et al., 2003.) To show what we have in mind, let us say a bit more about one factor.

The idea of the extremity of a property can be traced back to some well-designed but little-known experiments (Hermann & Deutsch, 1976). When participants were shown pairs of candles, which differed in height and width, it turned out that when asked to refer to a candle that was both taller and fatter than its distractor, speakers overwhelmingly did this by expressing whichever of the two dimensions made the target ‘‘stand’’ out more: When the target candle had a width of 50 mm and a height of 120 mm, for example, while the distractor had a width of 25 mm and a height of 100 mm, speakers referred to the target as ‘‘dick’’ (fat), rather than ‘‘lang’’ (tall), because the relative difference between 50 and 25 mm is greater than between 120 and 100 mm. Since both properties (i.e., width and height) would have ruled out the only available distractor, their discriminatory power in this situation is equal, yet there was a tendency for speakers to express the more extreme property. Results of this kind are difficult to replicate using a fixed PO. The study of GRE algorithms that avoid a rigid PO and select their properties ‘‘dynamically’’—making use of a combination of discriminatory power, extremity, and other factors—appears to us to be one of the most promising directions of work on GRE at the moment.

Footnotes
  • 1
  • 2
  • 3
  • 4

     A different focus would have been possible, for example, because one value (e.g., mammal) may be more general than another value (e.g., dog) of the same attribute. (In Section 4.2.2 we shall encounter this phenomenon.) The IA assumes that the different values of an attribute are always roughly equally preferred, with the choice between them depending on matters such as their discriminatory value (i.e., not on a fixed PO). This assumption can be questioned—If an elephant is pink its color is more worth mentioning than if it is gray—and this can motivate a different IA, whose PO defines an ordering on properties (i.e., combinations of an attribute and a value) rather than on attributes. The choice of the best value of an attribute has, to the best of our knowledge, never been empirically investigated. See Dale and Reiter (1995) (the FindBestValue function) for an algorithm and van Deemter (2002) for a logical analysis focusing on problems that arise when the values of an attribute overlap. See also Section 8.

  • 5

     Also worth mentioning is Passonneau (1995), which focused on reference in discourse in the context of the pear stories of Chafe (1980), where a number of algorithms in the Gricean tradition were compared with approaches based on centering.

  • 6

     The study was carried out in the context of tuna-reg’08, the second round of STEC using this corpus (Gatt & Belz, 2008). For this STEC, two new test sets were generated by partially reproducing the original tuna methodology. These test sets, which feature in parts of the analysis presented in Section 5 and 6, are briefly described in Section 4. The study by Di Fabbrizzio et al. was published after our own studies based on the tuna corpus described in Section 4 (Gatt, der Sluis, & van Deemter, 2007; van der Sluis, Gatt, & van Deemter, 2007).

  • 7

    Koolen, Gatt, Goudbeek, and Krahmer (2009) replicated the tuna experiment with Dutch speakers, manipulating communicative setting: Some participants were in a non-dialog setting while others were in a dialog setting involving a confederate. In terms of attribute overspecification, no effect of communicative setting was found.

  • 8

    http://stims.cnbc.cmu.edu/Image%20Databases/TarrLab/Objects/TheObjectDatabank.zip. Compare the coconut experiment, where subjects initially saw written descriptions (e.g., ‘‘table-high yellow$400’’) instead of pictures.

  • 9

     Locative attributes were not used in this calculation, as they were randomly determined.

  • 10

    van der Sluis, Gatt, and van Deemter (2006) describes the scheme for manual annotation. The annotated text was processed automatically to produce the representation discussed in Section 4.5.

  • 11

     We compare means for singular descriptions in addition to the overall means on the tuna data because the tuna-reg’08 dataset consisted of singulars only.

  • 12

     Each algorithm was assigned a number indicating its rank, where 1 is the top-ranked (best-performing) algorithm.

  • 13

     Both these calculations are formulated in terms of properties, as if they were primitive entities rather than combinations of an attribute and a value. Although the original IA was incremental in its consideration of attributes, this is not the case for its consideration of values: Given an attribute, the algorithm looked for the most suitable value of this attribute (i.e., using its function FindBestValue). See also note 8.

  • 14

     See Section 2 for illustration. The claim in the text holds only for uniquely identifying descriptions that do not contain any logically superfluous properties.

  • 15

     An example is Paraboni, van Deemter, and Masthoff (2007), where reference in complex, hierarchically structured domains was studied. The authors found that no single IA was able to generate the types of elaborate descriptions (‘‘picture 5 in Section 3,’’ rather than ‘‘picture 5,’’ where the latter was minimally distinguishing) that were most preferred by both authors and readers. Other algorithms, designed to produce carefully over-specified descriptions, proved superior to the IA in such situations.

  • 16

     The same is true if the evidence is different across the different values of an attribute, in which case the IA will operate with a PO defined over properties (i.e., combinations of an attribute and a value) instead of attributes. See also note 8.

References

  1. Top of page
  2. Abstract
  3. 1. Generation of referring expressions
  4. 2. Some classic GRE algorithms
  5. 3. How to test a GRE algorithm
  6. 4. The tuna corpus
  7. 5. The furniture subcorpus
  8. 6. The people subcorpus
  9. 7. Tractability of algorithms
  10. 8. Conclusion
  11. Acknowledgment
  12. References
  • Anderson, A., Bader, M., Bard, E., Boyle, E., Doherty, G. M., Garrod, S., Isard, S., Kowtko, J., McAllister, J., Miller, J., Sotillo, C., Thompson, H. S., & Weinert, R. (1991). The hcrc Map Task corpus. Language and Speech, 34, 351366.
  • Appelt, D. (1985). Planning english referring expressions. Artificial Intelligence, 26(1), 133.
  • Appelt, D., & Kronfeld, A. (1987). A computational model of referring. In Proceedings of the 10th International Joint Conference on Artificial Intelligence (IJCAI-87) (pp. 640647). Milan: Morgan Kaufman.
  • Areces, C., Koller, A., & Striegnitz, K. (2008). Referring expressions as formulas of description logic. In Proceedings of the 5th International Conference on Natural Language Generation (INLG-08).
  • Arnold, J. E. (2008). Reference production: Production internal and addressee-oriented processes. Language and Cognitive Processes, 23(4), 495527.
  • Arts, A. (2004). Overspecification in instructive texts. PhD thesis, University of Tilburg.
  • Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics (survey article). Computational Linguistics, 34(4), 555596.
  • Belke, E., & Meyer, A. (2002). Tracking the time course of multidimensional stimulus discrimination: Analysis of viewing patterns and processing times during same-different decisions. European Journal of Cognitive Psychology, 14(2), 237266.
  • Belz, A., & Gatt, A. (2007). The attribute selection for gre challenge: Overview and evaluation results. In Proceedings of UCNLG+MT: Language Generation and Machine Translation.
  • Belz, A., & Gatt, A. (2008). Intrinsic vs. extrinsic evaluation measures for referring expression generation. In Proceedings of 46th Annual Meeting of the Association for Computational Linguistics (ACL-08).
  • Bohnet, B. (2007). is-fbn, is-fbs, is-iac: The adaptation of two classic algorithms for the generation of referring expressions in order to produce expressions like humans do. In Proceedings of the Language Generation and Machine Translation Workshop (UCNLG+MT) at MT Summit XI.
  • Bohnet, B. (2008). The fingerprint of human referring expressions and their surface realization with graph transducers. In Proceedings of the 5th International Conference on Natural Language Generation (INLG-08).
  • Brennan, S., & Clark, H. (1996). Conceptual pacts and lexical choice in conversation. Journal of Experimental Psychology, 22(6), 14821493.
  • Campana, E., Tanenhaus, M., Allen, J., & Remington, R. (2004). Evaluating cognitive load in spoken language interfaces using a dual-task paradigm. In Proceedings of the 9th International Conference on Spoken Language Processing (ICSLP’04).
  • Carletta, J. (1996). Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2), 249254.
  • Carletta, J., & Mellish, C. (1996). Risk-taking and recovery in task-oriented dialogues. Journal of Pragmatics, 26, 71107.
  • Chafe, W. L. (Ed.) (1980). The pear stories: Cognitive, cultural, and linguistic aspects of narrative production. Norwood, NJ: Ablex.
  • Clark, H., & Wilkes-Gibbs, D. (1986). Referring as a collaborative process. Cognition, 22, 139.
  • Dale, R. (1989). Cooking up referring expressions. In Proceedings of 27th Annual Meeting of the Association for Computational Linguistics, ACL-89.
  • Dale, R., & Haddock, N. (1991). Generating referring expressions containing relations. In Proceedings of 5th Conference of the European Chapter of the Association for Computational Linguistics.
  • Dale, R., & Reiter, E. (1995). Computational interpretation of the Gricean maxims in the generation of referring expressions. Cognitive Science, 19(8), 233263.
  • van Deemter, K. (2002). Generating referring expressions: Boolean extensions of the incremental algorithm. Computational Linguistics, 28(1), 3752.
  • van Deemter, K. (2006). Generating referring expressions that involve gradable properties. Computational Linguistics, 32(2), 195222.
  • van Deemter, K., van der Sluis, I., & Gatt, A. (2006). Building a semantically transparent corpus for the generation of referring expressions. In Proceedigs of 4th International Conference on Natural Language Generation (Special Session on Data Sharing and Evaluation), INLG-06.
  • Eikmeyer, H. J., & Ahlsèn, E. (1996). The cognitive process of referring to an object: A comparative study of german and swedish. In Proceedings of 16th Scandinavian Conference on Linguistics.
  • Engelhardt, P. E., Bailey, K., & Ferreira, F. (2006). Do speakers and listeners observe the gricean maxim of quantity? Journal of Memory and Language, 54, 554573.
  • di Fabbrizio, G. D., Stent, A. J., & Bangalore, S. (2008a). Referring expression generation using speaker-based attribute selection and trainable realization (att-reg). In Proceedings of 5th International Conference on Natural Language Generation (INLG’08) (pp. 211214).
  • di Fabbrizio, G. D., Stent, A. J., & Bangalore, S. (2008b). Trainable speaker-based referring expression generation. In Proceedings of 12th Conference on Computational Natural Language Learning (CONLL’08) (pp. 151158).
  • Ford, W., & Olson, D. (1975). The elaboration of the noun phrase in children's object descriptions. Journal of Experimental Child Psychology, 19, 371382.
  • Gardent, C. (2002). Generating minimal definite descriptions. In Proceedings of 40th Annual Meeting of the Association for Computational Linguistics, ACL-02.
  • Gatt, A., & Belz, A. (2008). The tuna challenge 2008: Overview and evaluation results. In Proceedings of the 5th International Conference on Natural Language Generation (INLG-08).
  • Gatt, A., & Belz, A. (2010). Introducing shared task evaluation to nlg: The tuna shared task evaluation challenges. In E. Krahmer & M. Theune (Eds.), Lecture Notes in Computer Science, Vol 14. Empirical methods in natural language generation (pp. 264–293). New York: Springer.
  • Gatt, A., Belz, A., & Kow, E. (2008a). The tuna challenge 2008: Overview and evaluation results. In Proceedings of 5th International Conference on Natural Language Generation, INLG-08.
  • Gatt, A., & van Deemter, K. (2007). Incremental generation of plural descriptions: Similarity and partitioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP-07.
  • Gatt, A., der Sluis, I., & van Deemter, K. (2007). Evaluating algorithms for the generation of referring expressions using a balanced corpus. In Proceedings of the 11th European Workshop on Natural Language Generation, ENLG-07.
  • Gatt, A., van der Sluis, I., & van Deemter, K. (2008b). Xml format guidelines for the tuna corpus. Technical report, Computing Science, University of Aberdeen, Available at: http://staff.um.edu.mt/albert.gatt/pubs/tunaFormat.pdf.
  • Goudbeek, M., Krahmer, E., & Swerts, M. (2009). Alignment of (dis)preferred properties during the production of referring expressions. In Proceedings of Workshop ‘‘Production of Referring Expressions: Bridging the gap between computational and empirical approaches to reference’’ (PRE-CogSci 2009).
  • Grice, H. (1975). Logic and conversation. In P. Cole & J. Morgan (Eds.), Syntax and semantics: Speech acts (Vol. III, pp. 41–58). New York: Academic Press.
  • Gupta, S., & Stent, A. J. (2005). Automatic evaluation of referring expression generation using corpora. In Proceedings of 1st Workshop on Using Corpora in NLG, Birmingham, UK.
  • Hermann, T., & Deutsch, W. (1976). Psychologie der Objektbenennung. Bern: Huber.
  • Hervás, R., & Gervás, P. (2009). Evolutionary and case-based approaches to reg. In Proceedings of 12th European Workshop on Natural Language Generation (ENLG-09).
  • Horacek, H. (1997). An algorithm for generating referential descriptions with flexible interfaces. In Proceedings of 35th Annual Meeting of the Association for Computational Linguistics, ACL-97, Madrid (pp. 206213).
  • Horacek, H. (2004). On referring to sets of objects naturally. In Proceedings of 3rd International Conference on Natural Language Generation, INLG-04.
  • Jordan, P., & Walker, M. (2000). Learning attribute selections for non-pronominal expressions. In Proceedings of 38th Annual Meeting of the Association for Computational Linguistics.
  • Jordan, P. W. (2000a). Influences on attribute selection in redescriptions: A corpus study. In Proceedings of the Cognitive Science Conference.
  • Jordan, P. W. (2000b). Intentional influences on object redescriptions in dialogue: Evidnece from an empirical study. PhD thesis, University of Pittsburgh.
  • Jordan, P. W. (2002). Contextual influences on attribute selection for repeated descriptions. In K. van Deemter & R. Kibble (Eds.), Information sharing: Reference and presupposition in natural language generation and understanding (pp. 295–328). Stanford, CA: CSLI Publications.
  • Jordan, P. W., & Walker, M. (2005). Learning content selection rules for generating object descriptions in dialogue. Journal of Artificial Intelligence Research, 24, 157194.
  • Kelleher, J. D., & Kruijff, G.-J. (2006). Incremental generation of spatial referring expressions in situated dialog. In Proceedings of joint 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, ACL/COLING-06.
  • Keysar, B., Lin, S., & Barr, D. J. (2003). Limits on theory of mind use in adults. Cognition, 89, 2541.
  • King, J. (2008). OSU-GP: Attribute selection using genetic programming. In Proceedings of the 5th International Conference on Natural Language Generation (INLG-08).
  • Koolen, R., Gatt, A., Goudbeek, M., & Krahmer, E. (2009). Need I say more? On factors causing referential overspecification. In Proceedings of Workshop ‘‘Production of Referring Expressions: Bridging Computational and Psycholinguistic Approaches’’ (pre-cogsci’09).
  • Krahmer, E., & Theune, M. (2002). Efficient context-sensitive generation of referring expressions. In K. van Deemter & R. Kibble (Eds.), Information sharing: Reference and presupposition in language generation and interpretation (pp. 223–264). Stanford, CA: CSLI.
  • Krahmer, E., & van Deemter, K. (2011). Computational generation of referring expressions: A survey, in preparation. Available at: http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00088. Accessed September 23, 2011.
  • Krahmer, E., van Erk, S., & Verleg, A. (2003). Graph-based generation of referring expressions. Computational Linguistics, 29(1), 5372.
  • Krauss, R., & Weinheimer, S. (1964). Changes in reference phrases as a function of frequency of usage in social interaction: A preliminary study. Psychonomic Science, 1, 113114.
  • Krippendorf, K. (1980). Content analysis. Newbury Park, CA: Sage Publications.
  • Kronfeld, A. (1989). Conversationally relevant descriptions. In Proceedings of 27th Annual Meeting of the Association for Computational Linguistics, ACL-89.
  • Levelt, W. M. J. (1989). Speaking: From intention to articulation. Cambridge, MA: MIT Press.
  • de Lucena, D., & Paraboni, I. (2008). usp-each frequency-based greedy attribute selection for referring expressions generation. In Proceedings of 5th International Conference on Natural Language Generation (INLG’08) (pp. 219220).
  • Maes, A., Arts, A., & Noordman, L. (2004). Reference management in instructive discourse. Discourse Processes, 37(2), 117144.
  • Mangold, R., & Pobel, R. (1988). Informativeness and instrumentality in referential communication. Journal of Language and Social Psychology, 7(3–4), 181191.
  • Oberlander, J. (1998). Do the right thing ... but expect the unexpected. Computational Linguistics, 24(3), 501507.
  • Olson, D. R. (1970). Language and thought: Aspects of a cognitive theory of semantics. Psychological Review, 77, 257273.
  • Paraboni, I., van Deemter, K., & Masthoff, J. (2007). Generating referring expressions: making referents easy to identify. Computational Linguistics, 33(2), 229254.
  • Passonneau, R. (2006). Measuring agreement on set-valued items (masi) for semantic and pragmatic annotation. In Proceedings of 5th International Conference on Language Resources and Evaluation, LREC-2006.
  • Passonneau, R. J. (1995). Integrating Gricean and attentional constraints. In Proceedings of 14th International Joint Conference on Artificial Intelligence (IJCAI-95).
  • Pechmann, T. (1984). Accentuation and redundancy in children's and adults’ referential communication. In H. Bouma & D. G. Bouwhuis (Eds.), Attention and performance (Vol. 10, pp. 417–432). Hillsdale, NJ: Lawrence Erlbaum.
  • Pechmann, T. (1989). Incremental speech production and referential overspecification. Linguistics, 27, 89110.
  • Piwek, P. (2009). Salience and pointing in multimodal reference. In Proceedings of Workshop ‘‘Production of Referring Expressions: bridging the gap between computational and empirical approaches to generating reference’’ (PRE-CogSci’09).
  • Reiter, E. (1990). The computational complexity of avoiding conversational implicatures. In Proceedings 28th Annual Meeting of the Association for Computational Linguistics.
  • Reiter, E., & Dale, R. (2000). Building natural language generation systems. Cambridge, UK: Cambridge University Press.
  • Schriefers, H., & Pechmann, T. (1988). Incremental production of referential noun phrases by human speakers. In M. Zock & G. Sabah (Eds.), Advances in natural language generation (Vol. 1, pp. 172–179). London: Pinter.
  • Sedivy, J., Tanenhaus, M., Chambers, C., & Carlson, G. (1999). Achieving incremental semantic interpretation through contextual representation. Cognition, 71, 109147.
  • Siddharthan, A., & Copestake, A. (2004). Generating referring expressions in open domains. In Proceedings of 42nd Annual Meeting of the Association for Computational Linguistics, ACL-04.
  • van der Sluis, I., Gatt, A., & van Deemter, K. (2006). Manual for the tuna corpus: Referring expressions in two domains. Technical report, University of Aberdeen. Available at: http://www.abdn.ac.uk/~csc264/TunaCorpusManual/. Accessed January 18, 2011.
  • van der Sluis, I., Gatt, A., & van Deemter, K. (2007). Evaluating algorithms for the generation of referring expressions: Going beyond toy domains In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP-07). Borovets, Bulgaria.
  • Spanger, P., Kurosawa, T., & Tokunaga, T. (2008). TITCH: Attribute selection based on discrimination power and frequency. In Proceedings of 5th International Conference on Natural Language Generation (INLG-08).
  • Stone, M. (2000). On identifying sets. In Proceedings of 1st International Conference on Natural Language Generation, INLG-00.
  • Stone, M., Doran, C., Webber, B., Bleam, T., & Palmer, M. (2003). Microplanning with communicative intentions: The spud system. Computational Intelligence, 19(4), 311381.
  • von Stutterheim, C., Mangold-Allwinn, R., Barattelli, S., Kohlmann, U., & Kölbing, H.-G. (1993). Reference to objects in text production. Belgian Journal of Linguistics, 8, 99125.
  • Theune, M., Touset, P., Viethen, J., & Krahmer, E. (2007). Cost-based attribute selection for generating referring expressions (graph-fp and graph-sc). In Proceedings of UCNLG+MT: Language Generation and Machine Translation (pp. 9597).
  • Turing, A. (1950). Computing machinery and intelligence. Mind, LIX(2236), 433460.
  • Turner, R., Sripada, S., Reiter, E., & Davy, I. (2007). Selecting the content of textual descriptions of geographically located events in spatio-temporal weather data. In Proceedings of the Conference on Applications and Innovations in Intelligent Systems XV (AI-07), Cambridge, UK.
  • Van der Sluis, I., & Krahmer, E. (2004). The influence of target size and distance on the production of speech and gesture in multimodal referring expressions. In Proceedings of ICSLP-2004, October 4–8, Jeju, Korea.
  • Viethen, J., & Dale, R. (2006). Algorithms for generating referring expressions: Do they do what people do?. In Proceedings of 4th International Conference on Natural Language Generation, INLG-06.
  • Whitehurst, G., & Sonnenschein, S. (1978). The development of communication: Attribute variation leads to contrast failure. Journal of Experimental Child Psychology, 25, 490504.
  • Whitehurst, G. J. (1976). The development of communication: Changes with age and modeling. Child Development Development, 47, 473482.