The art of creating an informative data collection for automated deception detection: A corpus of truths and lies

Authors

  • Victoria L. Rubin,

    1. Language and Information Technology Research Lab (LIT.RL), Faculty of Information and Media Studies, University of Western Ontario, North Campus Building, Room 260, London, Ontario, Canada N6A 5B7
    Search for more papers by this author
  • Niall J. Conroy

    1. Language and Information Technology Research Lab (LIT.RL), Faculty of Information and Media Studies, University of Western Ontario, North Campus Building, Room 260, London, Ontario, Canada N6A 5B7
    Search for more papers by this author

Abstract

One of the novel research directions in Natural Language Processing and Machine Learning involves creating and developing methods for automatic discernment of deceptive messages from truthful ones. Mistaking intentionally deceptive pieces of information for authentic ones (true to the writer's beliefs) can create negative consequences, since our everyday decision-making, actions, and mood are often impacted by information we encounter. Such research is vital today as it aims to develop tools for the automated recognition of deceptive, disingenuous or fake information (the kind intended to create false beliefs or conclusions in the reader's mind). The ultimate goal is to support truthfulness ratings that signal the trustworthiness of the retrieved information, or alert information seekers to potential deception. To proceed with this agenda, we require elicitation techniques for obtaining samples of both deceptive and truthful messages from study participants in various subject areas. A data collection, or a corpus of truths and lies, should meet certain basic criteria to allow for meaningful analysis and comparison of socio-linguistic behaviors. In this paper we propose solutions and weigh pros and cons of various experimental set-ups in the art of corpus building. The outcomes of three experiments demonstrate certain limitations with using online crowdsourcing for data collection of this type. Incorporating motivation in the task descriptions, and the role of visual context in creating deceptive narratives are other factors that should be addressed in future efforts to build a quality dataset.

INTRODUCTION

As a branch of Library and Information Science (LIS), Natural Language Processing (NLP) is concerned with the algorithmic treatment of texts for the purpose of simulating human-like tasks such as language translation, automated summarization and classification, or sentiment analysis. Similar tasks would otherwise be considered human information behaviors, yet in many situations automation of repetitive or time-consuming intellectual work is a welcome addition to human abilities. For instance, automated spell-checking supplements the human inattention to detail; automated information extraction (finding and systematizing concepts from multiple documents) saves time and effort.

In developing ways to treat textual data, language models are frequently built based on human performance and artifacts of human information behavior (such as typical spelling errors or written summaries). Thus, obtaining reliable, typical samples of texts to model different linguistic phenomena has been one of the challenges in NLP. The typical methodological approach in NLP would consider a data collection consisting of a training dataset, with a test dataset aside for validation and effectiveness of the methods.

For our research agenda the task is to work towards automated deception detection with a goal of separating truthful messages from deceptive ones. Deception is commonly defined as a knowing and intentional attempt to foster a false belief or conclusion (Buller & Burgoon, 1996; Rubin, 2010b; Zhou, Burgoon, Nunamaker, & Twitchell, 2004).

Deception has been studied for ages by philosophers and psychologists, linguists and communication analysts. The body of deception research informs us of what is known today about how people both deceive and go about distinguishing lies from truth, no matter how successful they are at it. What this literature does not typically provide, for NLP purposes, is a clear dataset of positive and negative examples (i.e., truthful and deceptive statements) which could then be compared and contrasted in order to model deceptive strategies. Deception research “has been held back by the difficulty of ground truth verification. Finding suitable data “in the wild” and conducting the fact checks to obtain ground truth is costly, time-consuming and labor intensive” (Fitzpatrick & Bachenko, 2012). The authors note that other research relying on fact checking often faces these problems (e.g., Rubin, 2007, 2010a; Rubin, Liddy, & Kando, 2005; Sauri & Pustejovsky, 2009; Wiebe, Bruce, Bell, Martin, & Wilson, 2001).

Various deceptive situations and realms have been examined in deception research, from personal to public. Cheating and plagiarism, fraud and spam, false identities and hoaxes come to mind. Various circumstances in which speakers could be suspected of lying include marketing pitches and political speeches, personal testimonies and insurance claims. But there is typically very few recorded deceptive examples or those that have undeniable confirmation of the ground truth, since the concept is open to subjective interpretation.

Thus, any research project in NLP (as well as Machine Learning or Artificial Intelligence) that undertakes modeling how people tend to deceive versus tell the truth is confronted with an issue of obtaining empirical data samples in a usable digital format with certain requirements that make it suitable to automation efforts.

Two concurrent corpus building studies should be noted as they were just presented at the first specialized workshop. Fitzpatrick and Bachenko (2012) collected narratives from public domain that are “likely to have a rich source of ground truth evidence and general background information,” e.g., crime investigation and legal websites, and published police interviews. Gokhman, Hancock, Prabhu, Ott, & Cardie (2012) also discuss the tremendous potential and the unique limitations of the crowdsourcing approach, akin to ours, only in the context of hotel reviews.

STUDY OBJECTIVES

Our research agenda involves creating and developing methods for automatic discernment of deceptive messages from truthful ones in computer-mediated communication context (Rubin, 2010b; Rubin & Conroy, 2011, 2012; Rubin & Vashchilko, 2012). For this, we require elicitation techniques for obtaining samples of both deceptive and truthful messages from study participants in various subject areas. A dataset should meet certain basic criteria to allow for meaningful analysis and comparison of linguistic behavior. This paper examines the difficulties encountered in the experimental design process, and proposes solutions by weighing pros and cons of various experimental set-ups. The process extends to using images as a way to establish ground truth and restricting the topicality (or subject areas) of potentially deceptive content.

Three phases of the experiment were conducted: [1] an initial task involving participants' stories of luck and tragedy, [2] a revised story task using depersonalized stories, and [3] an image description task involving various degrees of truth about the content of the images. The outcomes of these experiments are used to formulate the limitations of using open ended task descriptions, the importance of incorporating motivation in the task descriptions, and the role of visual context in creating deceptive narratives.

BACKGROUND

Deception Detection

People are generally not particularly successful in distinguishing lies from truth (DePaulo, Charlton, Cooper, Lindsay, & Muhlenbruck, 1997; Frank, Paolantinio, Feeley, & Servoss, 2004; Rubin & Conroy, 2012; Vrij, 2000). Several studies which examine communicative behaviors suggest that liars may communicate in ways qualitatively different from truth-tellers. In other words, the current theory indicates that there may be stable differences in behaviors of liars versus truth tellers, and that the differences should be especially evident in the verbal aspects of behavior (Ali & Levine, 2008). Situational contexts for baseline truthful texts are often drastically different, complicating direct comparisons. For instance, in an analysis of synchronous text-based communication, deceivers produced more total words, more sense-based words (e.g., seeing, touching), and used fewer self-oriented but more other-oriented pronouns when lying than when telling the truth (Hancock, Curry, Goorha, & Woodworth, 2008). Compared to truth-tellers, liars showed lower cognitive complexity, used fewer self-references and other-references, and used more negative emotion words (Newman, Pennebaker, Berry, & Richards, 2003).

Automated deception detection research aims to develop tools for the automated recognition of deceptive, disingenuous or fake information (the kind intended to create false beliefs or conclusions in the reader's mind). Mistaking intentionally deceptive pieces of information for authentic ones (true to the writer's beliefs) can create negative consequences, since our everyday decision-making, actions, and mood are often impacted by information we encounter. The ultimate goal is to support truthfulness ratings that signal the trustworthiness of the retrieved information, or alert information seekers and users to potential falsification, misinformation, omission, concealment, or equivocations.

Interpretation of Images

As we seek to establish the ways in which deceivers deviate from normative truth in their communication, the use of images can provide a useful means for establishing an agreed-upon baseline of truthful content. This is not a simple matter of determining truth based on what an image represents since the interpretation of visual images can be measured from various levels of refinement. According to Panofsky (1939), the pre-iconographical level or “ofness” of a picture is what it actually depicts in terms of familiar objects and events. At the iconographical level, a picture is of the entity it represents or symbolizes. At the third level of iconology or interpretation, the meaning of a picture requires for its understanding familiarity with meanings that are imparted by various contexts, artistic and social, of the picture (Svenonius, 1994).

Shatford (1994) also introduced a faceted classification of image attributes, and proposed the idea of viewing an image within four facets, namely, Objects (Who), Activities and Events (What), Place (Where) and Time and Space (When).

Typology of Deceptive Content

In the previous analysis of linguistic cues in deceptive computer-mediated communication (CMC) messages (Rubin & Conroy, 2012), we manually content-analyzed an elicited dataset (per Krippendorff, 2004) focusing on what stories have in common, and in what respects they differ. This systematic qualitative account resulted in an empirically-derived faceted classification of potentially deceptive messages which varied along 5 facets, including Message Theme, or what the message is generally about (i.e., its topicality).

The analysis showed that messages are distributed over 12 thematic categories (Figure 1), the most prominent of which were tragedy or unfortunate circumstances (31%) and unexpected luck (17%)

Figure 1.

Message Themes within Elicited Potentially Deceptive Computer-Mediated Messages (Rubin & Conroy, 2012).

The most prevalent category, Tragedy Theme, described rare incidents which led to unfortunate outcomes of some kind, as exemplified by a story of a death after an argument. Luck was typically described as an unexpected fortune such as winning a lottery or finding something of value.

Other properties identified in deceptive messages (see Figure 2) included: Deception Centrality, referring to what proportion of the story is deceptive, and how significant the deceptive part is. Deception Realism refers to how much reality is mixed in with the lie. A message can be based predominantly on reality with a minor deviation, or can be based on a completely imaginary world. Deception Essence is what the deception is about. Each deception essence (event, entity, characteristics, etc.) may vary by the orthogonal deception facets, either in its centrality to the message (such as focal point or minor detail) or in its degree of realism. Finally, Self-Distancing Degree is the distance between the message sender and plot characters transpired as variable dimension across stories, created by misattributions (revealed by liars afterwards) and by narrator's perspective (revealed by writing stories from the first or third person).

This classification (Figure 2) empirically derived from the content of deceptive messages, particularly thematic categories of “aboutness,” topicality and essence, is in principle similar to how the meaning of images is structurally based, as outlined by Panofsky (1939) and Shatford (1994). This structure can be used to provide compatible methods of the analysis of deceptive messages, particularly when the description of images is used as the conduit for eliciting truthful and deceptive accounts of entities, events, characteristics, and so forth.

Figure 2.

Facetted Classification of Potentially Deceptive Messages (redrawn from Rubin & Conroy, 2012).

Compositional Criteria of Analytical Content

To formulate a detailed and comprehensive content- and linguistic-based analysis of verbal deceptive behavior, we require a dataset that meets the following criteria:

  • 1.Be sufficient in length and topic variability to replicate and expand on the facetted typology uncovered in the analysis of deception in CMC.
  • 2.Provide sufficient detail to identify properties such as story essence, centrality, realism and self-distancing.
  • 3.Consist of sufficient statistical dispersion between deceptive and non-deceptive content to form meaningful comparisons.
  • 4.Be accurate and verifiable, as much as possible, regarding what is indicated to be deceptive and what is indicated to be non-deceptive.

RESEARCH QUESTION

In this study we explore the phenomenon of qualitative differences between deceptive and truthful messages. To achieve this we asked this question:

Using computer-mediated communication processes, what are the most effective ways to elicit deceptive and truthful content (in the form of user-generated stories) for the purposes of serving as satisfactory experimental data?

By “satisfactory” we refer to the Compositional Criteria of Analytic Content. That is, we seek a dataset which is distinguished from the unsatisfactory by possessing several or all of the criteria listed (above). Using generated content to formalize patterns of deceptive language is predicated on the ability to reproduce the very phenomenon of interest. To address this, we proceed based on previous findings, and adapt the experimental design according to intermediate results.

We attempt to elicit deceptive and non-deceptive content by asking participants to write original stories in subject areas restricted to topics which are conducive to deception, namely personal luck and personal tragedy. Following this, a modified version of the experiment was used. Whereas in the original version, participants wrote personal stories, the modified version omitted the personal element so that stories became generic. Finally, in order to address the shortcomings encountered in the first set of results, the task was changed again so that images were used for deceptive descriptions. This allowed for the topicality of the stories to be determined. In the image description task, three variations were used. In the truth condition, descriptions were formulated based only on the preicongraphical facets of images. In the distortion condition, participants embellished their descriptions with the additions of further details and contextual elements. In the false condition, subjects completed missing sections of the images by inventing entirely new descriptions. The images were taken from public-access online sources and selected based on novelty, and the variety of visual content. It was expected that the more elaborate and interesting the pictures, the easier it would be for participants to formulate detailed and extensive descriptions, both truthful and deceptive. The process outlined below describes the challenges, findings and subsequent resolutions encountered in the process of constructing a dataset suitable to linguistic analysis.

PHASE 1: PERSONAL TRUTHFUL/DECEPTIVE STORIES

Data Collection Method (Phase 1)

In Phase 1 of this study, we elicited personal stories using Amazon's online survey service, Mechanical Turk (www.mturk.com). The data collection instrument requested each respondent to write a rich, unique short story which was to contain some degree of deception. Writers then ranked their story on a scale of 1 to 5 based on the deceptive content. The five point scale establishes gradations within the truth-deception continuum, without imposing any specific values to participants' self-ratings for the categories in between the two extremes.

In the luck scenario, we devised a task that asked participants to create a story of their own about “luck,” “opportunity” and “good fortune”. Participants were asked to make these stories believable to an average reader. Participants were invited to expand their stories with details and other characteristics such as background circumstances, who was involved, and how the good fortune occurred. Here is a sample: “We are asking inventive and skillful participants to create a story about personal luck, good fortune, or opportunity in their lives. Write as believable and descriptive an account as you can. Be Creative! However, the story should still only be about fortunate events that happened to you unexpectedly. Describe every detail and make it seem real! For example, how did it happen? Who was involved? What were the circumstances?”

For the tragedy scenario, the same format was used which varied only according to subject matter. In this case, participants were asked to devise their story around “misfortune,” “tragedy,” or “lost opportunity”. Again, participants were asked to be believable, descriptive and elaborate on all elements of the narrative.

To help participants reflect on their response, and to aid in the data collection, we posed questions regarding the nature of the deception. This included information about what details were deceptive and how deceptive elements impacted the focus of the story. The purpose of this measure was to determine how deception level is assessed by the participants. For instance a singular, key deceptive element may factor more significantly to the deception self-rank than several “insignificant” story details. Rather than allowing users to freely describe the facet, choices were given in the form of a checklist from which participants could select the most appropriate category. This checklist was derived from previous analyses of deceptive stories (Rubin & Conroy, 2011, 2012).

Data Analysis (Phase 1)

Given the nature of the data used, a qualitative and quantitative assessment was made. The purpose of these analyses was to determine whether the collected stories conformed sufficiently to the data criteria enumerated in the previous section. The qualitative analysis involved reading each story to attend to whether or not the participants met the task specifications in the following aspects: the overall level of detail of the story, the types of participants chosen, the centrality of deception, and the extent to which deception was incorporated in the story. The quantitative analysis summarized the deceptive to non-deceptive content ratio, the average time spent and length of the story and other participant demographic information.

Results & Discussion (Phase 1)

The dataset from Phase 1 showed a total of 40 stories (20 tragedy, 20 luck). The stories were gathered over a period of roughly 6 hours. Each worker worked an average of 13.5 minutes. Given the distribution of deceptive stories in both the tragedy and luck sets there was, simply stated, not enough deceptive content to use for meaningful comparison. There might be several reasons the participants chose to tell a truthful story rather than a deceptive one. In light of this result, explaining the reasons participants did not deceive is the first objective in any potential redesign.

In the luck stories 17 out of the 20 were indicated to be “entirely truthful” (self-rank = 1). Of the 3 other stories that reported any deceptive content, their combined deceptive content was just over half (53%) of the story content of the 3 stories. So, for the dataset whose ideal distribution would contain a 50:50 overall truthful to deceptive ratio, there was in fact 92% truthful content. On its face, this limits the generalizability of any conclusions made based on the linguistic differences and violates the dataset criteria made at the outset. The tragedy scenario fared slightly worse. There were only 5 out of 20 stories with any deceptive content at all. In these stories, the actual deceptive content was minimal; all stories were self-ranked as 2, or “slightly deceptive”. This means that for the entire tragedy dataset, all but 5% of the content was identified as truthful. The respondents in this set were 35% female and 65% male. The respondents were in every age range with most (6) being in the 31–40 year old range.

It seems the greatest problem encountered in the Phase 1 pass was that the task was not sufficiently suited to motivate participants to invent false stories. Without some degree of intentional falsehood in the dataset, language comparison is impossible, and sections where participants elaborate upon the nature and centrality of the deceptive elements are irrelevant. In addition, the stories generally lacked inventive details and tended be simplistic narratives of apparently true events.

In general, participants seemed not to understand the purpose of the question. Based on the responses given, it seems they were not sufficiently engaged or inclined to actually deceive a reader to invent rich and false descriptions in the stories they were asked to write. In the tragedy scenario, a recurrent message theme was that stories tended to be of a personal and severe nature. For example, in several instances participants reported witnessing the actual traumatic death of a person, or the experience of coping with the illness of a loved one. In these cases, the strength of the memory is indicated within the story itself, and precise details, such as time of day or date of the incident are often mentioned. One completely true story exemplifies this tendency:

Example 1.

It was a sunny day in June. My boyfriend and our good friend Tony started drinking early that day, as Tony was upset that it was the day of his grandmother's funeral, and he was not able to attend. By about 4:00 in the afternoon, they were quite intoxicated. I had enough of their drunken rambling, so I was going back to the house. My boyfriend called me and told me to get down to where they were at, otherwise he was going to go do something stupid. He and Tony decided to swim to try to sober up, so they jumped off the railroad bridge. They were swimming around just fine, and after while they were coming back in from the water. Tony made it to shore, and my boyfriend turned around and jumped back in. He started swimming to where Tony was and started struggling at one point. Tony jumped in to help him, and after a few minutes, they both drowned. I called 911, but by the time they got there, and the dive team got there, they treated it as a recovery and not a rescue. I watched the whole thing happen, and still have not gotten over it.[self-rank=1(truthful); ID=A26GD3XSS4 …]11

Due to the highly personal and emotional nature of stories like this, it is hypothesized that asking participants to describe personal tragedy was contrary to the aims of the experiment. Without sufficient incentive to distort the facts, participants were relying on memories that were highly vivid and accessible and this influenced the choice to use true rather than deceptive accounts. The performance difficulties which arose from the extra effort needed to produce unnaturally untrue modifications to these personal stories may be attributed to the fact that authors perceived no incentive for this additional effort. Since emotional memories contain more sensorial and contextual details than neutral memories (Comblain, D'Argembeau, & Van der Linden, 2005), the emotional nature of tragic memories (being more vivid) would counter any propensity to distort these memories simply for the sake of convincing a hypothetical reader.

As indicated above, the luck condition showed only a slight improvement in the choice to lie. However, most stories indicate that only a “minor detail” comprised this deceptive component. Luck stories were often not of a personal or emotional nature and this may have been due to other factors. Those who chose to write completely true stories were spared from explaining just what was deceptive in their story (by the pre-defined checklist) and the impact and function of those details in the overall narrative. Participants were compensated equally regardless of whether they chose a deceptive or true story in both the luck and tragedy sets. Since they were asked to develop untruthful elements and then explain these decisions and their role in the story, participants may have simply found it easier to avoid follow up questions and complete the survey in the most expedient way possible, by simply telling the truth or otherwise labeling the story as completely truthful.

Topic variability selected within the tragedy/luck domain was limited as well, and this impacted the first criteria of a usable dataset. This was evidenced particularly in the luck results, as the serendipitous finding of money and winning money at a casino were chosen in over half the stories. In the 3 deceptive stories, interesting and descriptive content was omitted in favor of simplistic and underdeveloped narratives. In addition to being typically short, the following account was reported to be half-deceptive, and did not fully explain the deception elements in reference to what was written. Examples such as this made it difficult to readily observe how lying about luck may differ from true accounts.

Example 2.

I am very lucky to get such a life. My parents are my close friends. They will do what i22 like to have what i am wishing to have even if i didn't asking them what i am wishing to have. I can share anything with them whatever problems i have they will understand it and help me to solve my problems. [self rank =3; ID=A2SKPF8W2L …]

PHASE 2: DEPERSONALIZED TRUTHFUL AND DECEPTIVE STORIES

Data Collection (Phase 2)

In Phase 2 of this study, another set of stories were gathered using the Mechanical Turk service. Again, the data collection instrument requested each respondent to write a rich, unique short story containing some degree of deception. The tragedy task was revised to explicitly steer participants away from writing personal accounts. As indicated above the personal element was believed to be the cause of participants' lack of willingness to deceive. Participants were asked to simply make up any tragic story and incorporate details to make it believable and convincing. The luck task was revised in a similar fashion, and a “depersonalized” version was created. Despite the uniformity of the topics found in the Phase 1 results, no attempt was made to explicitly revise this aspect of the task. By removing the personal element it was believed participants would be more creative in their choice of topics. The task was scaled down considerably and read as follows, “Write a story about being Lucky. Be Inventive and Creative! Be Believable and Descriptive! Use Detail (Who What When Where How?)”.

Results and Discussion (Phase 2)

Our dataset from Phase 2 produced 40 stories (20 tragedy, 20 luck). The stories were gathered over a period of roughly 24 hours. Each worker worked an average of 10.2 minutes. The results showed more deceptive content than the original set. There also seemed to be an overall improvement in the quality, detail, length, and variability of all stories including those truthful and deceptive. In the luck scenario, 6 of the 20 stories reported any deceptive content, and this accounted for 17% of the overall content. The tragedy scenario seems to have been impacted more by removing the personal element. In the tragedy case 30% of the entire content was deceptive, contained within 8 stories with a self-rank greater than 1. The respondents in this set were 35% female and 65% male. The respondents were in every age range with most (7) being in the 31–40 year old range.

Despite the removal of the personal element in the task description, 14 of 20 tragedy stories were written in the first person, indicating a personal point of narrative. Incidents of death, illness and severe personal tragedy were less prominent in both the deceptive and non-deceptive cases. Participants wrote about a range of topics, including lost love, missed work opportunities, broken dreams, and mistaken identity.

Many of the same difficulties involved in motivating participants to create deceptive stories were found in the Phase 2 version as in Phase 1. Even though the personal element was removed to provide more incentive to create untruthful stories, participants were still prone to writing very short paragraphs, often with less than the required 5 to 7 sentences. The number of deceptive stories improved only slightly in the Phase 2 condition, and the deceptive content remained relegated to minor details versus more elaborate lies. In addition, an unforeseen problem arose concerning the writing topics. Although the task description required participants to restrict their stories to the specific domains of luck and tragedy, there were vastly different results and quality of responses. Luck stories were still confined to homogeneous topic choices, and remained fairly simplistic. Regardless of the truth value of such stories, surface differences were minimal and truthful events could easily be made to fulfill the deceptive requirements simply by changing the subject or person involved. Either way, using such content to examine the differences in deceptive language was not possible following an examination of the content.

Finally, it was not clear if the participants were creating their own content or using copied excerpts. The length and language used for a few examples in the tragedy scenario raised suspicions about whether the excerpts were copied and pasted from content found online or elsewhere. Indeed, criteria 4, which emphasizes establishing ground truth in deception, remains a problem in the experimental design and highlights the limitations of the existing methods. Beyond trusting the self-rank assigned by participants, there was no way to determine whether what is reported as truth is actually the case. Motivating participants to write rich, linguistically diverse descriptions remains a considerable challenge. At the same time, variations in interpretation and effort exercised by the participants caused us to re-examine the nature of the task in terms of providing a contextual framework for the deception. Therefore, to revise the task it was necessary to formally establish ground truth through some form of objective evidence. This was the basis for methods employed in Phase 3.

PHASE 3: IMAGE DESCRIPTION

Data Collection (Phase 3)

To overcome the problem described in Phase 1 and 2, Phase 3 asked to describe images by their visual content. More specifically, truthful descriptions would correspond to the “ofness” facet of the preiconological level of image meaning as described by Panofsky (1939). A set of 10 images was used across three individual subtasks, each involving a different degree of deception. Phase 3 methods were used in order to supply a ground truth dataset for a better comparison between deceptive and non-deceptive language while also providing a way to target the subject of deception. The images were selected based on certain visual qualities, namely ambiguity, novelty, color range, and intrigue. The following three subtasks were informed by the Deception Realism facet in the Facetted Classification of Deceptive Messages (Rubin & Conroy, 2012, see Figure 2). The first subtask, Phase 3a – Truth Condition, required that participants describe the content of the images using only what they saw, without adding any contextual or background information. In other words, they simply translated the visual information before them into factual accounts of activities, objects, locations, colors and other visual data. The second subtask, Phase 3b – Distortion Condition, participants were asked to extend their descriptions by adding content which distorted the truth of the condition. These descriptions may differ in some nonspecific way from the actual image yet participants were to describe the majority of the image truthfully. Compared to the Phase 3a results, this content was considered deceptive since it deviated in some way from the visual evidence present in the image. And unlike the methods used in Phase 1 and 2, information about what elements of the descriptions were deceptive could be established by referring to the image rather than polling participants separately. For the third subtask, Phase 3c – Imagination Condition, the original images were altered by blanking out a significant portion. In this case, participants were asked to describe the image and complete the blanked-out portion with their own ideas. The goal of this task is to extract elaborations on the existing image with potentially untrue details while using the still visual portion as a contextual basis. The results of this subtask may illustrate how deception occurs in the face of visual evidence as a context or backdrop for the deceptive content.

Results and Discussion (Phase 3)

Phase 3a – Truth Condition

Over a period of roughly 4 days, Mechanical Turk participants described a set of 10 images. Four different participants described each image producing 40 true descriptions. Each description took an average of 1.8 minutes to compose. The truth condition descriptions were generally satisfactorily although some were not extensive enough to be considered accurate or complete visual descriptions. This is indicated in the range of description lengths, from 12 characters to 605. Participants were asked to be very descriptive and detailed into what was occurring in each picture. Given the limitations of time and space, the descriptions were not excessive in the level of detail included, but were deemed sufficient for this study. On average, descriptions were 205 characters long. The majority of descriptions remained focused on the visual objects without reference to any contextual or cultural interpretations (i.e., pre-iconographic interpretation). Typical descriptions are characterized by two examples:

Example 3.

A pig is lieing2 down in the on a black asphault street in a puddle of water. The puddle is small and the water is brown. The pigs outline is reflected in the water. It is day out. There are white lines on the street. There are modern low buildings on the sides of the road. About half way down the street there are green and yellow triangular flags attached to a building. [ID=2CQU98JHST …]

Example 4 (Figure 3).

Photograph of what seems to be a gymnasium. The floor is made of small strips of wood (the color is brown); much of it collapsed. There are four half-columns against a wall. The wall is brown as well has two doors: one blue (close), one white (open). There is a balcony with a fence (purple) and, at each ends of it, there are two basketball hoops attached to the balcony. Lastly, there is a stage for shows in the back of the gymnasium. The stage has purple and grey curtains. [ID=2D7DEY5COW …]

Figure 3.

Truthfully Described Image (Example 4).

Phase 3b – Distortion Condition

Over a period of about 4 days each image was described by 2 respondents producing a total 20 distorted descriptions. The results showed that there was generally a fairly broad range of interpretation regarding what proportion of the actual image description was to be distorted and extent of that distortion. On the whole, distorted descriptions were more extensive than mere truthful descriptions. This is evidenced by the description lengths, which ranged from 308 to 596 characters. An analysis of the distorted descriptions revealed that oftentimes no conflicting or contradictory information was used when compared to the existing visual information. Instead, additional information was provided which expanded on the image content. Whether or not people were present in the image determined the nature of these additions. For example, when people were present, the distorted content may include their individual background information, hypothetical thoughts or intentions, or narrative elements to create a more complete description. In this example, the author expands the visual elements by creating new, plausible details (names, places) which may further explain the scene to a viewer.

Figure 4.

The Image for the Invented Description (Example 5).

Example 5 (Figure 4)

Jason Collier, on leave from Iraq, where he serves in the Navy, won a contest sponsored by the Denver Broncos, to sing a song at half time. The Broncos sponsored a writing contest where the person wrote their story about why they wanted to sing at the Bronco's game. The Broncos were touched by Jason's story about being a Bronco's fan since he was 10 and about his desire to become a professional singer after his tour of Iraq was finished. They wanted to help a member of the military to realize their dream. Jason said that after being in Iraq, he was not nervous singing to such a huge crowd. [ID=2HYO33FDEY …]

These distortions may take on a narrative quality that was not found in the truthful condition. Writers felt compelled to create purposes for the characters present in the scene:

Example 6.

Women is2 talking on her cell phone and is really involved in the call. She is getting a cup of coffee on her way to work. She is in a big hurry but enjoying the conversation on the phone with her friend. Her friend is describing a date that she had last night and the women is laughing along with her. Distracted by the phone call, the women puts way too much sugar in her cup and it overflows with sugar. Now she has to wait for a new cup of coffee and is going to be even more in a hurry. Sometimes when in a hurry, you should slow down. [ID=2A9FL42J1L …]

Sometimes these new details were more subtle and contextually based, but still deviated from a purely objective description.

Example 7 (Figure 5).

It was time for the winter celebration in Iceland. The townspeople got together and built a huge landscape out of snow and ice. the landscape was over 220 feet tall. They got a few people to spell out 2012 with their bodies in the snow. The others stood in the background. A friend took their picture with a camera. Their brightly colored jackets stood out against the pure white town. [ID=2598MZ9O9Z …]

Figure 5.

The Image for the Invented Description (Example 7).

When people were absent from the image, details such as information about how the scenery was constructed, who was taking the photo, or why the photo was taken had to be added or changed to suit the subject matter. In these instances descriptions may include new objects, scenery and their related details. The extent of the distortion varied widely as well. In some cases, the original picture was not referred to at all and writers took liberties to invent entirely new scenes and events. This example description was highly distorted version of a picture of a cow in a grassy valley; no people, no traffic, definitely no skyscrapers.

Example 8.

There is a ton of traffic grid-lock in the big city. Taxis, cars, and pedestrians are all crammed into the streets and sidewalks trying to get to work. Trash is littered all over the street. There are dark clouds in the sky and it is beginning to rain. Skyscrapers tower over each side of the street causing it to look even darker. [ID=26Y18N46UX …]

The variation in the interpretation of the task at this phase indicates that for future trials, restricting the image content to one or two styles may provide more accurate guidelines for the participants. Also reducing the number of images used in the experiment would limit the analysis to a select set. Participants expressed no confusion about what they should make up or why. Part of the task asked for user feedback about encountered problems or reflections of the task requirements. The majority of the 20 responses indicated no difficulty completing the task and some reported the task to be enjoyable and interesting (which is not a minor matter when the goal is to elicit creative and rich verbal descriptions). Regardless of potential improvements to the task, the results introduce potential areas for the analysis and understanding of deception. Indicated in these examples is the propensity for authors to express their own questions and fill cognitive gaps regarding contextual information by adding their own creations. Rather than creating information that violates what is visually evident, they add plausible information which complements the evidence. In other words, the descriptions do not necessarily disagree with what is evident in the picture. For example, a picture of a blue sky is never reported to be green. This highlights a distinction regarding different degrees of deception. Interpersonal Deception Theory (IDT) outlines falsification as a type of deception in addition to omission, and equivocation (Buller & Burgoon, 1996). Our participants chose to avoid falsification as a form of deception. Instead, false beliefs are fostered in a potential reader by obscuring the confirmable evidence through elaboration and equivocation. Deception through the omission of significant visual elements was also present in the results; however, it is not clear if that was the result of an intentional deception, or merely an oversight.

Phase 3c – Imagery Condition

The final image description phase ran over a period of 4 days as well, with each image described by 2 respondents, producing a total of 20 descriptions. The descriptions ranged from 185 to 442 characters and averaged 288 overall. It seems that the Phase 3b task was more successful in prompting participants to create new, vivid descriptions based on the visual content than was the method of removing a portion of the image (Phase 3c). For one reason, the blanked portion restricted participants too rigidly about what part of the image to invent. In these circumstances the missing information was accommodated with mundane details, such as a “child playing” in the blanked out section. Participants were reluctant to offer further details beyond merely introducing a new object or character. Although this amounts to introducing “untrue” information, there was little integration with the remaining picture to constitute a false but elaborate story. Another problem that infringed on the usefulness of the dataset was simple misunderstanding by the respondents. In 5 instances, respondents did not complete the missing portion but instead mentioned the presence of a “white box”. On the whole, the Phase 3c dataset did not yield wholly distinctive responses from the previous phase, although some useful excerpts were found. The following example places a mountain lion in the scene; the narrative is marked with some hesitation and non-specificity:

Example 9.

This is a beautiful distant photo of a landscape. It is taken in a wild area, with green fields in front and gray and brown mountains looming in the distance. Above, white clouds float amidst deep blue sky. A mountain lion is sitting down in the middle of the field. It is sleek and graceful, and appears to be a female. [ID=2W0JIA0RYN …]

DISCUSSION

Three phases of the experiment elicited deceptive content through participants' stories of luck and tragedy, depersonalized stories of luck and tragedy, and descriptions of images involving various degrees of deception. The outcomes of these experiments reveal some of the challenges in structuring a task to balance deceptive and non-deceptive content and to include sufficient descriptive elements for further linguistic analysis.

The predominant problem encountered in the first two set-ups was the difficulty in satisfying the third item of the compositional criteria: the overall deceptive content was too minimal to create comparison groups. To determine deceptive markers, such as particular word usage or demonstrable cognitive complexity, a corpus of substantially rich content is required for meaningful analysis. However, in an experimental CMC environment such as Mechanical Turk, generating the corpus remains a challenge. In particular, it appears the task was not sufficiently suited to motivate participants to invent false stories, as evidenced by the strong preference to produce non-deceptive stories.

Although topics of luck and tragedy were chosen based on prior assessments of topical preferences for deception, they are not easily integrated with a motivational component that compels subjects to lie well for some perceived gain. For instance, socially-driven extrinsic motivation often involves using deception in order to maintain existing social relationships, such as pretending to enjoy a received gift to spare the feelings of a friend, or to improve how one is perceived by others through impression management (Schlenker, 1980). Such motivations were not present and even seem mismatched with the intention of relating stories of personal woe or accidental fortune. The topicality, it seems, is secondary to having a good reason to lie and this affected the outcome of the quality of the deception. Directing the topicality of stories may be better served through variations (such as “personal success,” rather than “luck,” or “pity” rather than “misfortune”) and, therefore, more aligned to individual motivations found in natural instances of deception. Moreover, instructing participants that lies “should” be believable simply does not mimic real-world socially-driven needs which compel liars in the first place. Writers may have attended mostly on the subject matter of their story, rather than the secondary outcome the lie – to produce as a result of instilling a genuine false belief in a reader. Hence, further efforts must situate the task so that participants are more cognizant of the impact of their lies and the consequences of not succeeding.

Interpersonal deception theory (Buller & Burgoon, 1996) describes the role of context interactivity and relational familiarity on deceptive interactions. The degree of interactivity or “interpersonalness” is posited to alter deceptive cognitions and behaviors since task demands are high and the communication is spontaneous rather than rehearsed. The deception task in this study, since it relies on a low degree of interactivity and a modality which introduces the possibility that participants used rehearsed stories, demonstrates two ways in which the task was detached from other real world encounters, where people are prone to deceive. Without motivation, the most common choice of participants is to relay stories which require the minimal degree of effort. In this case, using readily available and detailed stories is preferable to inventing entirely new, yet plausible accounts.

We encountered limitations with using the Mechanical Turk crowdsourcing service for generating rich deceptive content. Although a convenient method of setting up a written task and gathering content, it proved to be somewhat restrictive in the degree of exercisable control over desired characteristics of the participant group. In particular, there was no way to efficiently select experienced liars from inexperienced ones. Participants may be filtered based on rough demographics such as geographic location or rank, but whether subjects have an interest or aptitude for written communication remains largely uncontrollable in this scenario. Since social factors are seen to often drive deception, online environments, such as social networking tools like Facebook and Twitter, may be further considered as useful data sources. Such sites facilitate natural communication within existing social circles. Alternative methods for sourcing participants, for example by selecting experienced online writers or bloggers, would conceivably contribute significantly to the diversity of the deceptive content and address the multidimensional features of a quality corpus. Of course, the objective of seeking overall patterns of deception across a broad population, as that provided by Mechanical Turk, likely increases the generalizability of the overall findings – something limited when experts rather than non-experts are specifically targeted.

Beyond trusting the self-rank assigned by participants, there was no way to determine whether that which is reported as truth is indeed true. It makes it difficult to satisfy the fourth item of compositional criteria. Image descriptions were believed to provide a ground truth dataset for comparison, help to disambiguate the types of descriptions expected, and allow for a more precise understanding of the centrality of the deceptive content. In the task redesign, participants were directed to invent contextual details, external objects or people. The deceptive value of descriptive embellishments could be determined simply by referring to what is not immediately apparent in the image. We relied on the participants' preicongraphical (Panofsky, 1939) understanding of the image for the truth dataset. The distortion condition in the image description task produced more elaborate and useful content as opposed to the false condition. This difference indicates that to generate descriptive, deceptive text it is beneficial not to restrict the subject of the deception too rigidly. The decision to help generate deceptive text by removing objects and figures (as with the distortion condition) invites participants to merely replace these. Whereas allowing the participants freedom to distort the image in a non-specific way allows embellishment in a fashion the writer deems appropriate. When this occurs, descriptions are more likely to display evidence of properties such as story essence, centrality, realism and self-distancing. A shortcoming of the image task, however, relates to its ability to mimic a deceptive situation. Iconographical and iconological interpretations occur when participants view and describe these images. The degree that these interpretations are reproduced in the descriptions affects the confidence in determining whether the content is in fact deceptive, since this content may not actually conflict with the real, internal beliefs of the author and thus not qualify as deception, per our definition (Buller & Burgoon, 1996; Rubin, 2010b; Zhou, et al., 2004).

CONCLUSIONS AND FUTURE WORK

To address gaps in deception research and Natural Language Processing, we attempt to build a body of written deceptive content for systematic analysis. NLP requires, as a minimum, a substantial corpus of content through which to analyze and categorize language patterns. The themes of tragedy and luck were shown previously to be natural topical areas for those writing deceptive content. Tasking participants to write deceptively about these topics in a controlled computer-mediated communication environment, however, has proven to be difficult, especially since the conditions are not conducive to motivating deception in the normative sense. It is the perceived rewards which cause people to lie in real-world circumstances; cues to deception are more pronounced when people are motivated to succeed, especially when the motivations were identity relevant (DePaulo, et al., 2003). The participants in our trials were not impeded to tell the truth since doing so resulted in no penalty or costs of the type which are witnessed in other scenarios conducive to deception. When authors did opt to lie, the content was not substantial and lacked the sought-after evidence of centrality, realism and other identified characteristics. Again, this indicates a lack of incentive to recreate believable yet deceptive accounts from the participants when they are given the option to simply tell the truth.

Images have shown to be beneficial in eliciting deceptive material in a controlled environment. They provide ground truth for comparison, they restrict the domain and the topicality of the descriptions, and they help regulate evaluations of the degree and severity of the deception. Unfortunately, the use of image description in our study negated the possibility for investigating how subjects formulate whole lies based around luck or tragedy. Acknowledging the difficulty of the data-gathering task in a crowdsourcing environment, our study indicates that lacking a communicative context between individuals, or otherwise explicit consultation with the experimenters, impacts the quality of the data and the propensity for deception to occur. To capitalize on these findings future studies will address the need for establishing ground truth data at the outset, and to limit the format and nature of data gathering processes to more highly structured approaches.

Acknowledgements

We thank our study participants and two anonymous reviewers. This research is funded by the New Research and Scholarly Initiative Award (10–303) “Towards Automated Deception Detection: An Ontology of Verbal Deception Cues for Computer-Mediated Communication” from the Academic Development Fund at the University of Western Ontario.

Footnotes

  1. 1

    Ten-digit identification numbers are listed for all examples matching database entries for easy future reference.

  2. 2

    The participants' spelling and capitalization are left uncorrected to preserve the authenticity of their writing style and normal practices.

Ancillary