Introducing a cognitive methodology for the automated detection of connotative meaning in text



Pilot results are used to illustrate the potential of an integrated multi-disciplinary approach to semantic text analysis that combines cognitive-oriented human subject experimentation with Machine Learning (ML) based Natural Language Processing (NLP). The goal of this work is to automate the recognition of connotative meaning in text using a range of linguistic and non-linguistic features. This methodology combines human evaluations of text with automated processing of text in order to address social, cognitive and linguistic aspects of connotative meaning. The design of this study places the human in the loop at the point of identifying the degree of connotation present within a given text, rather than at the point of expert linguistic analysis. Findings from the pilot study show that consensus between human subjects regarding the presence of connotative meaning in text can be achieved, that the degree of connotation present in a given text can be described on a continuum, and that a system can be implemented to recognize extreme cases of connotation (or its opposite, denotation). The protocol used in this pilot study was run using a relatively modest number of human evaluators and set of data. Based on the extensibility of methodology and the promising findings described here, the reliability of the data used to train the ML system can be easily improved by increasing the size of both the human subject sample and the data set.


We glean more from text than what can be explicitly linked back to the definitions of individual words and phrases, or parse trees of the sentences, or the named entities that are identified. Explicit content has been well investigated and the Information Extraction field has made excellent advances in identifying, tagging, and then utilizing named entities, events, and relations in question-answering, summarization, and other information access applications. However, there are other, more amorphous aspects of meaning such as a sincere apology, an urgent request for help, a serious warning, or a perception of personal threat, etc. which humans take away from text, but are not currently being exploited by even the most advanced text analysis systems. While humans tap into this additional level of understanding quite intuitively, it is beyond what Natural Language Processing (NLP) can currently accomplish. For example, after reading a note of apology, one can say, “He does not seem remorseful.” or “He shows remorse.” But can a system be enabled to accomplish this?

The complex nature of this problem requires a highly interdisciplinary methodology. In order to make progress in this area, a range of expertise needs to be applied at various points throughout the duration of data collection and analysis. For example, the automated understanding of connotative meaning of text requires a deep understanding of: the mechanics of how meaning is conveyed in text; the individual and social behaviors involved in the disambiguation of language by humans; as well as a detailed understanding of machine learning and its potential to be applied to text processing. While progress can certainly be made if these elements are investigated separately, we believe that by implementing an integrated methodology, even greater results can be achieved.

Thus, merging cognitive and social cognitive psychology research with sophisticated machine learning could arguably extend current NLP systems. This study combines human subject experimentation and NLP / machine learning experiments into an integrated methodology. Building from a strong tradition of empirical methodologies in NLP research (Liddy, 2003), this research encapsulates an end-to-end research methodology that begins by: 1) establishing a human-understanding baseline for the distinction between connotative and denotative meaning; 2) extending the analysis of the mechanics of literal versus non-literal meaning by applying NLP tools to the human-annotated text, and then; 3) using these cumulative results to train a machine learning system to recognize the potential for connotative meaning at the sentence level, across a much broader corpus. This paper describes the preliminary iteration of this methodology and suggests ways that this approach could be improved for future applications. It also provides a model for future work that seeks to make a similar bridge between psychological investigation and system building.

Related Research

This exploratory work contributes to a growing base of NLP research that investigates implied meaning in text. Recent research in this area focuses on identifying and annotating the affective or emotional attitude expressed in text (Liu, Lieberman, & Selker, 2003; Mihalcea & Liu, 2006; Rubin, Liddy & Kando, 2006; Shaikh, Helmut, & Ishizuka, 2006; Strapparava & Mihalcea, 2008; Strapparava & Valitutti, 2004; Valitutti, Strapparava, & Stock, 2004). Other researchers are working on the automated detection and labeling of sentiment, opinion and certainty (Balog, Mishne, & de Rijke, 2006; Liu & Maes, 2004; Wilson, Wiebe, & Hwa, 2004). We see all of these phenomena as variations of a wider category of interpretation, namely connotative meaning.

In relation to this, work on opinion and evaluative language that is being conducted by Wiebe and associates is particularly noteworthy (Riloff, Wiebe, & Wilson, 2003; Shanahan, Qu, & Wiebe, 2006; Wiebe & Mihalcea, 2006; Wiebe, Wilson, & Cardie, 2005; Wilson, Wiebe, & Hoffmann, 2005; Wilson, Wiebe, & Hwa, 2004). Unlike other studies in which certain words are assumed to universally signal standardized states of mind (such as happy, sad, angry), the approach taken in their research is to evaluate the text in terms of who was speaking and in what context. Their annotation scheme relies on identifying frames of reference and on the disambiguation of nested phrases in order to accurately assign subjective or non-literal attributes to text expressing opinion, sentiment and emotion (Wiebe, Wilson, & Cardie, 2005). The annotation scheme designed and implemented by this team not only creates a baseline standard for human-labeling of the phenomena of interest (necessary for statistical and machine-learning approaches which still rely on human annotation at some point in the system development process), but the scheme also takes into consideration the effects of context on the interpretation of meaning.

Analytic framework: A cognitive approach

We view an excerpt of text to be a stimulus, albeit much more complex than most stimuli used in typical psychological experiments. The meaning of any excerpt of text is tied to a constructive cognitive process that is heavily influenced by previous experience and cues, or features, embedded within the text. It is here that we would like to gain some insight into the notion of connotative meaning and view the work to be done in this area as highly exploratory in regard to where and how various text features are most revelatory. We believe that insight into the connotative meaning of text lies at the intersection of these perspectives and that the nature of this intersection should be explored.

Our goal is to gain a better understanding of:

  • (1)what features are attended to when the text is being interpreted,
  • (2)which of these features are most salient

One's ability to derive connotative meaning from text is behavior that is learned, becoming intuitive, we posit, in much the same way that an individual learns any skill or behavior. When this process of attending and learning is repeated across instances, specific skills become more automatic, or reliable (Logan, 1988; Wyer & Bargh, 1997). This process is considered to be constructive and episodic in nature, yet heavily dependent upon "cues" that work to draw or focus one's attention (D'Eredita & Barreto, 2006). Furthermore, research on communities suggests that the meaning of an artifact (e.g., a specific excerpt of text) is heavily influenced by how it is used in practice (Wenger, 1999) The meaning of text is constructed in a similar manner. Members of a speech community tend to make similar assumptions, or inferences. The mechanics of making such inferences are scaled to the amount of contextual information provided. Our preliminary research suggests that when presented with a sentence that is out of context, an individual seemingly makes assumptions about one or all of the following: who created the text, the context from which it was pulled, and the intended substance of the message. Each of these pieces potentially enables an individual to constrain the artifact to a relatively small and functional set of possible interpretations.



The pilot research conducted in order to investigate how connotative meaning is conveyed and recognized was funded through the Advanced Question and Answering for Intelligence (AQUAINT) Project of the U.S. federal government's Intelligence Advanced Research Projects Activity (IARPA) Office. Funded as an exploratory'Blue Sky' project, this award enabled us to develop an extensible experimental setup using human participants to establish baseline measures, and to make progress towards training a machine learning recognizer system. This research has resulted in an informed estimation of the challenges inherent in research of this nature, as well as some promising directions for future work.

We approached the pilot phases as a purely exploratory proof-of-concept endeavor. Of primary interest was implementing the interdisciplinary methodology from start to finish, identifying strengths and weaknesses of the approach along the way. For that reason, preliminary evaluations have resulted in highly suggestive, rudimentary findings, which we feel contribute greatly to establishing an extensible methodology that can only be improved in future iterations.


Blog text was used as the corpus for this research, chosen due to its increasing prevalence as a communicative genre. Sentences were deemed the most practical and fruitful unit of analysis because words and even phrases were considered too restrictive, and pieces of text spanning more than one sentence too unwieldy. A single sentence presented enough context while still allowing for a wide range of interpretations. Sentences were randomly selected from a pool of texts automatically extracted from blogs, using a crawler set with keywords such as "oil", "Middle East" or "Iraq." Topics were selected with the intention of narrowing the range of vocabulary used in order to aid the initial machine learning experiments.

Preliminary phase

The initial research began by seeking to determine whether people actually agreed about whether there were implicit meanings associated with a given piece of text. To start, we conducted a series of eight semi-structured, face-to-face interviews. Individuals were presented with 20 sentences selected to include some texts that were expected to be highly connotative as well as some texts expected to be highly denotative. Each interviewee was asked to exhaustively share all possible meanings they could derive from the stimulus text, while also pinpointing what it was about the text that led them to make these conclusions. These responses provided us with some insights into the features people paid attention to when presented with a sentence that was out of context. This protocol also allowed us to test different phrasing of the prompts and to get feedback from participants regarding the clarity of the questions posed.

Next, an open-ended, on-line instrument was created using a similar protocol wherein we presented a series of 20 sentences to participants (N=193) and, for each stimulus text, asked: 1) “What does this sentence suggest?” & “What makes you think this?”; and 2) “What else does this sentence suggest?” & “What makes you think this?” Again, responses were also evaluated to determine any bias that might have resulted from the way in which the prompts were phrased.

Upon content analysis of the responses, we found that while the interpretations of the text were relatively idiosyncratic, how people allocated their attention was more consistent. Most people tended to make assumptions about the (1) author (addressing who created the artifact), (2) context (addressing from where the sentence was taken) and/or (3) intended meaning of the words. We interpreted this to mean that these three areas were potentially important for identifying inferred meanings of texts.

Design of pilot experiment

Next, our efforts focused on designing a reusable and scalable online evaluation tool that would allow us to gather more judgments from a larger sample of subjects using a much larger pool of stimulus texts. Scaling up the human evaluations also allowed us to decipher between responses that were either systematically patterned or more idiosyncratic (or random). That is, are there ‘denotative’ sentences that people consistently interpret differently from ‘connotative’ sentences, or is the interpretation of implied meaning solely based on individual experience and larger context?

To answer this research question, a forced-choice protocol was developed for the evaluation of sentences, in which participants were presented with two sentences and asked to make a choice. In order to protect against bias, half the participants were asked to identify the sentence that provided more of an opportunity to read between the lines, while the other half were asked to select the sentence that provided less of an opportunity to read between the lines. The phrase ‘read between the line’ was specifically chosen as a result of feedback during preliminary testing. This common language phrase was found to be understandable by a wide array of participants and could be consistently applied to the type of meaning that we are interested in.

We piloted the system with randomly selected samples of both sentences and participants with the intent to eventually make direct comparison among more controlled samples of sentences and participants. This has direct implication for the evaluation phase of our pilot. Because sentences were selected at random, without guarantee of a certain number of each type of sentence, our goal was to achieve results on a par with chance. Anything else would reveal systematic bias in the experimental design or implementation. This also provides us with a baseline for future investigations where the stimulus text could be more willfully controlled.

According to the forced-choice design, each online participant was first presented with a series of 32 pairs of sentences, one pair at a time, and asked to identify the sentence that provided more of an opportunity to read between the lines (Table 1). Half the participants were presented with a positive prompt (which sentence provides the most opportunity) and half were presented with a negative prompt (which sentence provides the least opportunity). Positive / negative assignment was determined randomly.

Table 1. Illustration of forced-choice design for sentence evaluation (8 sentences)
original image

This first set of 32 pairs was considered as Round 1. The 32 sentences selected in Round 1 were then reassigned into 16 new pairs and presented to participants in Round 2. This process continued for one more round, with Round 3 containing 8 sentences in 4 pairs. After the three rounds were completed, sentence received a score of (negative values indicate denotative prompt):

  • 0 if it was never selected

  • -1/1 if it did not get past Round 2

  • -2/2 if it did not get past Round 3

  • -3/3 if it was one of the 4 remaining sentences after Round 3 was completed.

As indicated by Table 1, this scheme enables a single person to evaluate, for example, 8 sentences by making just three forced-choice selections. In the pilot study, a forced choice scenario required a sample of only 13 participants in order to evaluate 832 sentences. We sought eight individual human evaluations per sentence in order to mitigate bias caused by individual differences. This was an improvement over previous designs we had previously considered and tested, as the forced-choice setup increased the number of sentences and the number of evaluations per sentence, allowing us to increase the reliability of our findings.


Evaluation of text by human subjects

In the first iteration of the pilot setup, each of 832 sentences were viewed by six different participants, three assigned to a positive group and three to a negative group, as described above. Volunteers were recruited from the population of graduate students at the School of Information Studies through a recruitment message sent to listservs. Because of the forced choice design of the protocol, the overall number of respondents was less important than the number of evaluations made of each of the 832 sentences. Although we fell short of our goal of eight passes for each sentence, we were able to successful gather six evaluations for each of the 832 sentences.

Each evaluation resulted in a score. The denotative condition ranged in ratings from 0 to ±3 while the connotative condition ranged in ratings from 0 to 3. These were then averaged to achieve an overall score for each sentence. It is important to note that this averaged score for each sentence replaces a typical inter-coder reliability metric.

Because they were randomly selected, each sentence had a predictable chance of ultimately being identified as connotative or denotative. So every sentence in our corpus has:

  • 50% chance of not making it out of Round 1, receiving a score of 0

  • 50% * 50%, or 25% chance of not making it past Round 2, receiving a score of −1/1

  • 50% * 50% * 50%, or 12.5% chance of not making it past Round 3, receiving a score of −2/2

Thus, we are able to directly observe whether we get an overall chance outcome or not by simply adding up the numbers across sentences and rounds. Having established a baseline based on chance, we can next control for various features and evaluate the relative impact as systematic differences from the baseline. In other words, we will be able to say with a relatively high degree of certainty that characters? or feature, sentence structure, behavior, etc. was responsible for skewing the odds in a reliable manner because we will be able to control for these variables across various experimental scenarios and both compare results across scenarios and a chance outcome. This, combined with improved validity resulting from an increased number of human judgments and an increased number of sentences viewed, marks the strength of this methodology.

Additionally, we will be able to compare sentences within each scenario even when an overall chance outcome occurs. For example, in the initial run of our sentences, we achieved an overall chance outcome, with scores distributed according to the 50-25-12.5 proportions described above. However, "anomalies" emerged - sentences that were strongly skewed towards being rated 0 reliably across one condition (connotative or denotative) and a 3 reliably across the other ("opposite" condition). This allowed us to gather a reliable and valid subset of data that can be utilized in ML experiments. See below for a short list of sample sentences grouped according to the overall scores they received:

Denotative examples

  • The equipment was a radar system.

  • There are red stars and statues of Lenin.

  • Kosovo has been part of modern day Serbia since 1912.

  • The projected figure for 2007 is about $ 3100.

Connotative examples

  • Really, I am.

  • In fact, do what you bloody well like.

  • But it's pretty interesting, in a depressing sort of way.

  • It's no more a language than American English or Quebecois French.

Experimental Machine Learning System

Our preliminary analysis suggests that humans are consistent in recognizing the extremes of connotative and denotative sentences. This groundwork enables us to move towards an automatic recognition system that builds on these findings to identify when a text conveys connotative meaning. Machine Learning (ML) techniques will be used to enable a system to first classify a text according to whether it conveys a connotative or denotative level of meaning, and later, identify specific connotations. ML techniques usually assume a feature space within which the system learns the relative importance of features to use in classification. Since humans process language at various levels (morphological, lexical, syntactic, semantic, discourse, and pragmatic), it is assumed in psycho-linguistic theory that some multi-level combination of features is helping them reach consistent conclusions. Hence, the initial machine learning classification decision will be made based on a class of critical features, as cognitive and social-cognitive theory suggests happens in human interpretation of text.

TextTagger, CNLP's Information Extraction System currently can identify sentence boundaries, part-of-speech tag words, stem and lemmatize words, identify various types of phrases, categorize named entities and common nouns, recognize semantic relations, and resolve co-references in text. We are in the process of designing a ML framework that utilizes these tags and can learn from a few examples provided by the human subject experiments described above, then train on other sets of similar data marked by analysts as possessing the features illustrated by the sentences consistently identified as conveying connotative meaning.

For preliminary ML-based analysis, the data collection included 266 sentences (from the original 832 used in human subject experiments), 145 tagged as strongly connotative and 121 tagged as strongly denotative by subjects. Fifty sentences from each set became a test collection and the remaining 95 connotative and 71 denotative sentences were used for training. Our baseline results (without TextTagger annotations) were: Precision: 44.77; Recall: 60; F: 51.28. After tagging, when the only feature we use is proper names and common nouns the results improved: Precision: 51.61 Recall: 92; F: 67.13. Although these results are not as high as some categorization results reported in the literature for simpler categorization tasks such as document labeling or spam identification, we believe that using higher level linguistic features extracted by our NLP technology for this more complex task will significantly improve them. More sophisticated analysis will be conducted during future applications of this methodology.


Returning to a point made at the beginning of this paper, the design of this study places the human in the loop at the point of identifying the degree of connotation present within a given text, rather than at the point of expert linguistic analysis. Future work will improve and refine this basic pilot implementation. By allowing the ML system to do time- and labor-intensive analysis, and exploiting a natural human ability to ‘know it when they see it’ (in this case ‘it’ referring to connotative meaning), we feel that this pilot methodology has great potential to deliver helpful results.

In addition to the significant contribution this research will make in the area of NLP, it will also provide a model for future work that seeks to create similar bridges between psychological investigation and system building. While progress can certainly be made if these elements are investigated separately, we believe that we can achieve even greater results by implementing a multidisciplinary and integrated methodology. Preliminary results suggest that our approach is viable and that a system composed of multiple layers of analysis — with each level geared towards reducing the variability of the next — holds promise.

Key to the approach developed for this study was the continued and regular participation of our research team members from a range of disciplines, including cognitive psychology, discourse linguistics, and computer science, throughout the duration of the project. This helped to ensure that evaluations produced by human subjects during the initial phases of the study, could, whenever possible, be used to set a baseline for machine learning trials. Additionally, human text analysis informed the calibration of TextTagger and the identification of potentially salient linguistic features. The integrated team provided the opportunity for the results from human participant experiments to inform expectations for the performance of the machine learning system.

Limitations of this study and future work

The work described above is preliminary and exploratory. In future work, the notion of speech communities will be addressed. The pilot study looked at a very generalized speech community, expecting to achieve equally generalized results. While this has merit, there is much to be learned by implementing this approach using a more targeted community. A recent project undertaken by researchers at the Air Force Research Lab (AFRL) in Rome, NY applied components of this methodology to an database of military chat communications. This ongoing research is specifically directed at recognizing speaker uncertainty, calm/hurried, or demanding speech.

It is also interesting to note for future work that a sub-set of strongly denotative sentences did emerge from the data set, however, the reliability of the result was not equally matched with a subset of sentences that were classified as being as strongly connotative. This is potentially suggestive of an interpretative process that allows for more factual, denotative text to be perceived as just that, leaving little room for interpretation across a sample of participants. In other words, a degree of variability would be expected on the connotative end of the spectrum, while that variance would be expected to sharply narrow on the denotative end of the spectrum. This hypothesis would support the idea that connotation and denotation are not simply two sides of the same coin, but are examples of different interpretive processes.

The protocol used in this pilot study was run using a relatively modest number of human evaluators and a relatively small set of data. With the experience gained during the pilot, the reliability of the data used to train the ML system can be easily improved by increasing the size of both the human subject sample and the data set. With a more robust set of initial data, ML experiments can progress beyond the basic proof-of-concept results reported here and produce actionable feature sets tuned to specific speech communities.