The first direct replication on using verbal credibility assessment for the detection of deceptive intentions

Abstract Verbal deception detection has gained momentum as a technique to tell truth‐tellers from liars. At the same time, researchers' degrees of freedom make it hard to assess the robustness of effects. Replication research can help evaluate how reproducible an effect is. We present the first replication in verbal deception research whereby ferry passengers were instructed to tell the truth or lie about their travel plans. The original study found truth‐tellers to include more specific time references in their answers. The replication study that closely mimicked the setting, procedure, materials, coding, and analyses found no lie–truth difference for specific time references. Although the power of our replication study was suboptimal (0.77), Bayesian statistics showed evidence in favor of the null hypothesis. Given the great applied consequences of verbal credibility tests, we hope this first replication attempt ignites much needed preregistered, high‐powered, multilab replication efforts.


| INTRODUCTION
In the challenge to tell truth-tellers from liars, verbal deception detection has emerged as one of the more promising approaches (Oberlader et al., 2016;Vrij, Fisher, & Blank, 2015). Verbal deception detection sets out to identify verbal indicators of deception in statements made about an event.
Based on the notion that liars will have more difficulty providing a convincing and hence detailed account of a fabricated event than truth-tellers, the cognitive approach to deception postulates that the differences in difficulty are represented in, for example, the richness of the verbal account about the event . Similarly, the theory of Reality Monitoring poses that the content of a statement about a genuinely experienced event can be recalled in more detail than the content of a fabricated event (Johnson, Bush, & Mitchell, 1998). Both the cognitive approach and Reality Monitoring agree on the prediction that truthful statements are richer in detail than deceptive statements. There is a body of research on the verbal deception detection approach with meta-analytical findings suggesting that detail richness can identify liars and truth-tellers better than chance (Masip, Sporer, Garrido, & Herrero, 2005;Oberlader et al., 2016;Vrij et al., 2015). However, meta-analyses rely on the quality of the original studies and cannot ascertain whether the individual effects reported in studies are reliable (van Elk et al., 2015). For progress in the field of verbal deception detection, replication studies are needed to solidify the findings and to work towards a strong empirical fundament that practitioners can apply. In other words, replication efforts are just as important as new, exploratory studies: "innovation points out paths that are possible; replication points out paths that are likely; progress relies on both" (Open Science Collaboration, 2015, p. 7).

| Replicating verbal deception detection research
The importance of replication research was shown by a landmark finding that only one third to one half of 100 psychological experiments replicated (Open Science Collaboration, 2015). A replication study can be conceptual or direct (Nosek et al., 2015). Conceptual replication studies are those that test a previously found effect under new circumstances, taking into account the key ingredients that are believed to matter. A direct replication aims to mimic the original study as closely as possible (Simons, 2014). For any effect to matter, it should be obtainable under similar circumstances. That is if experiment X finds an effect, a new experiment Y following the procedure, sample size, and analysis of X should be able to see that same effect. The field of verbal deception detection is characterized by a multitude of interviewing techniques (e.g., asking difficult questions vs. open recall), cues (e.g., plausibility, consistency, and richness of detail), coding of those cues (e.g., what counts as a detail), annotation methods (e.g., manual human annotation and automated information extraction), and analytical approaches (e.g., individual cues vs. predictive modelling with multiple cues). These elements allow for high researchers' degrees of freedom (Gelman & Loken, 2013), that is, aspects on which the researcher has to make decisions when conducting a study and presenting results. The resulting variation between studies makes it hard to assess how robust the effects found in verbal deception detection are. In the current paper, we, therefore, present the first replication of verbal deception detection research.

| The original study
We aimed to replicate the second experiment of Warmelink, Vrij, Mann, and Granhag (2013). Eighty-four participants (36 male; mean age 58 years, SD = 12.6) were instructed to either tell the truth or lie about the reasons for travelling on a 6-hr-long ferry trip between Portsmouth (UK) and Caen (France). Participants were approached by an interviewer blind to the experimental condition and were asked either a control question ("Please describe in as much detail as possible what you are going to do today at your destination") or a temporal prompt question ("Please describe what your timetable is for today at your destination"). The answers (word count M = 33.7, SD = 20.71) were manually annotated by two independent, trained human judges on specific times (e.g., "half past seven" and "five o'clock"), temporal details (e.g., "earlier" and "1 hour"), and spatial details (e.g., "in Paris" to "to London"). Truthful answers contained more mentions of specific times than deceptive answers (d = 0.54; 95% confidence interval [CI]: 0.10; 0.99). We chose the effect found for specific times for replication because (a) the very short interview (47 s) is attractive for applied purposes, (b) the dependent measure of specific times is well-automatable (Kleinberg, Mozes, Arntz, & Verschuere, 2017), and (c) the effect size is promising for a field characterized by relatively small effects (DePaulo et al., 2003).

| The current study: Direct replication part
We replicated the time prompt question findings from the second experiment in Warmelink et al. (2013). The study was conducted on a ferry on the Dutch islands, and participants were interviewed in Dutch. We extended the original experiment to further test whether actively eliciting specific information benefitted deception detection. Note that the additional question came after the replication part so that it could not affect the replication. Our first hypothesis is directly taken from the original study and states that truthful answers to the time schedule question contain a higher proportion of specific time occurrences than deceptive answers.

| The current study: Additional question and coding
Apart from the direct replication part, we also examined whether the proportion of spatial details is higher in truthful than deceptive answers on an additional route description question. Similar to the prompt question mechanism for specific time references in the original study (i.e., asking for specific times enlarges truth-lie differences), asking for a route description might be helpful to invoke truth-lie differences on an additional dimension, namely, spatial details. Because the majority of verbal deception research resorts to humans who count the occurrences of verbal indicators, we further added two conceptually identical hypotheses on the related, computationally extracted constructs (temporal and spatial details).
We expected that the proportion of "time" and "space" references as extracted with word count software is higher in truthful than in deceptive answers for questions on the respective domain (i.e., the time schedule and the route question). The procedure, manipulations, hypotheses, and analyses for the current study were preregistered before data collection (accessible at https://osf.io/w9qe2/register/ 565fb3678c5e4a66b5582f67). The materials, data, and code are available at https://osf.io/t29dz/. This paper reports all measures, conditions, data exclusions, and considerations to determine the sample size as stated in the preregistration.

| Participants
We approached participants on a ferry and interviewed them about their plans at their destination. We collected data from passengers on the ferry from the Dutch mainland (Harlingen) to the Dutch island Terschelling, which took approximately 120 min. Similar to the original study, willingness to participate was high with more than 80% of approached participants agreeing to partake. We aimed to collect data for the identical sample size as the original study (n = 84). As stated in the preregistration, this sample size is nearly identical to the one reached with a priori statistical power analysis for the key to-be-replicated effect size of d = 0.54 (one-sided t test, alpha significance level 0.05, and power of 0.80, required n = 88). Our initial sample consisted of 85 participants, of whom six were excluded because they did not follow the instructions properly (e.g., they were not lying in the deceptive condition). Our final sample consisted of 79 participants, randomly assigned to either the truthful (n = 41, 39.47% female, M age = 45.51 years, SD age = 18.39) or deceptive condition (n = 38, 41.46% female, M age = 45.95 years, SD age = 14.68).

| Design
The design of this experiment is 2 (Veracity: truthful vs. deceptive, between-subjects) by 2 (Question focus: time schedule vs. route description, within-subjects) with the proportion of human-coded specific times as key dependent variable. The focus of the replication is on the time schedule questions identical to the original study's "time prompt" condition.
Additional dependent variables-as outlined in the preregistration -are the proportion of human-coded spatial details, as well as the automatically coded proportion of temporal and spatial details. As a control, we also asked for participants' motivation to be convincing.

| Procedure
Two experimenters gathered the data on the ferry boat on 4 days in 2017. The experimenters approached participants for voluntary participation in a "deception detection experiment." All experimenter-participant interaction was in Dutch. Experimenter 1 approached the participants, asked whether they were willing to participate, and had them sign the informed consent form. Before participants were allocated to either the truthful or deceptive condition, Experimenter 1 established the ground truth by asking the participants what their plans at their destinations were (e.g., "going home" and "weekend trip toTerschelling") and asked for the participants' age and whether they had made the trip before. All participants were randomly assigned to the truthful or deceptive condition-participants chose an envelope from a shuffled stack of all envelopes containing the instructions for truth-tellers or liars. The participants read the instructions according to their condition in the envelope as follows: "You are in the truth condition. In a few minutes, an interviewer will ask you a few questions about your trip. Your task is to tell the truth about what you are going to do at your trip's destination. Try to convince the interviewer that you are telling the truth. There will be no follow-up questions" (truthful condition); and "You are in the lie condition. In a few minutes, an interviewer will ask you a few questions about your trip.
Your task is to lie about what you are really going to do at your trip's destination and to pretend that you are travelling for a different reason. Try to convince the interviewer that you are telling the truth. There will be no follow-up questions" (deceptive condition).
Each participant had 3 min of preparation time before the second experimenter (i.e., the interviewer) arrived.
The interview consisted of two brief questions. The first one (time schedule question) was focused on the temporal aspects of the journey, and the interview question was identical (translated to Dutch) to the one asked in the original experiment: "Please describe in as much detail as possible what your timetable is for today at your destination." The additional question that we added (route description question) targeted spatial aspects of the trip and concerned the route description from the moment the participant got off the ferry boat to their destination ("Please describe the route from when you leave the boat to your destination"). Each interview was audio-recorded and later transcribed. After the interview, the experimenter asked for the participants' motivation to provide a convincing story (from 1 -very low-to 10-very high), to recall their veracity instructions, and noted the participants' gender.

| Human coding of statements
The transcribed interviews were coded by two independent and trained human judges. Before coding the actual transcripts, both coders received a detailed 3-hr training session on practicing statements from a different study (but also on truthful and deceptive intentions) with one of the authors (B. K.). The annotation guidelines were identical to those used in the original study. After discussing annotation inconsistencies, the two judges annotated another six full statements of which the annotation was approved by the lead author of the original study and coauthor of the current paper (L. W.). We instructed the coders to annotate and count the number of specific time occurrences (e.g., "quarter past one") using verbatim the same instructions from the original experiment. For the additional hypothesis and the exploratory part, the coders also counted the number of spatial details (e.g., "next to" and "down") and the number of temporal details (e.g., "after," "before," and "subsequently"). To assess the reliability of the coding procedure, we had the first coder score 40% of the statements and the second coder score all statements. The agreement between the two human judges was high (specific time: Pearson correlation coefficient r = 0.90, intraclass correlation ICC = 0.86, p < 0.001; spatial details: r = 0.92, ICC = 0.89, p < 0.001; temporal details: r = 0.68, ICC = 0.71, p < 0.001). For the analysis, we used the judgments of the second coder and standardized the count variables (specific times, spatial, and temporal details) by the word count of each statement per question type (seeTable 1 for examples high and low in human coded variables).

| Automated coding of statements
An alternative to human judgments is the Linguistic Inquiry and Word Count (LIWC) software (Pennebaker, Boyd, Jordan, & Blackburn, 2015). The LIWC counts how many words per input text belong to predefined psycholinguistic lexicon categories and has been used for verbal deception research before (e.g., Bond et al., 2017). For the current experiment, we used the categories "time" (e.g., "once" and "since") and "space" (e.g., "above" and "outside") each of which is standardized by the word count per statement and question type. We used the Dutch translation of the 2007 LIWC version (Boot, Zijlstra, & Geenen, 2017).   Table 2). Thus, although the time question elicited more specific time answers than the route question, we did not find that the time schedule question elicited more specific times in truth-tellers than in liars.

| Additional measure: Motivation
Participants were highly motivated to provide a convincing story, and   This Bayes factor can be interpreted as substantial evidence for the null hypothesis that the truthful statements do not differ from deceptive ones over the alternative hypothesis .
In the original study, there was substantial evidence in favor of the alternative hypothesis, BF 10 = 3.24.

| Informed priors
When one possesses evidence about the likelihood of an effect before obtaining new data, this prior belief should be explicitly incorporated into the Bayesian estimation. To do so, we treat the findings of the original study as the prior evidence for the data from the replication study, which is the posterior distribution of the original study becomes the prior for the replication (Gronau, Ly, & Wagenmakers, 2017). In doing so, we incorporate the belief of the original study (i.e., that there is a moderately sized effect) into the hypothesis testing of the replication and obtain BF 01 = 24555.02-"extreme evidence" in favor of the null.
Treating the original effect size (here d = 0.54) at face value can be misleading because most published effect sizes are overestimations of the true effect (Gelman & Carlin, 2014;Simonsohn, 2015 informed priors, our current study cannot ascertain the existence or absence of an effect that is a lot smaller than the one suggested in the original paper.

| Temporal details
We explored whether the human-coded temporal details (i.e., including nonspecific time references such as "then" and "after") could help discriminate truthful from deceptive statements. There was only a sig-

| Statement length
The 2

| DISCUSSION
This paper presents the first replication study in the field of verbal deception detection research. The original study found that truthtellers mentioned more specific times than liars when talking about a trip they made. We were not able to find significant differences in the occurrence of specific times between truth-tellers and liars.

| Did the findings replicate?
A judgment of the success of a direct replication should go beyond mere statistical significance testing (Nosek & Errington, 2017;Open Science Collaboration, 2015). We evaluate the current replication was asked "Did the results replicate the original effect?" Out of four authors, none voted "Yes," three voted "No," and one voted "inconclusive." The inconclusive vote was motivated by the low power (calculated a priori for a power of 0.80; post hoc reached power for d = 0.54: 0.77, see below). In addition to these five criteria, Bayesian hypothesis testing tends to favor the null hypothesis over the original hypothesis. Taken together, several assessment criteria suggest that the original study did not replicate.

| Differences between original and replication study
We see at least three differences between the original and the replication that may explain the divergent findings. First, in the replication, the majority of participants reported that they had made the same trip before. This might have enabled the liars to use previous travels as a lie. In doing so, their lie contains many truthful aspects retrieved from previous experience. Although this is certainly ecologically valid, it is in stark contrast to experimental deception research where the lie is often a complete lie without resorting to previous experience (e.g., Sooniste, Granhag, Strömwall, & Vrij, 2015). The low proportion of passengers who did not make the trip before and the lack of that information from the original study do not allow us to further explore this explanation. The travellers' experience with their destination and travel to it might even be a crucial moderator (e.g., Warmelink, Vrij, Mann, Jundi, & Granhag, 2012). Clearly, more research is needed on this matter.
Second, an important aspect of direct replications is that of the setting, population, and time, so that "[e]xact replications are replications of an experiment that operationalize both the independent and the dependent variable in exactly the same way as the original study" (Stroebe & Strack, 2014, p. 61). Although the setting (on a ferry) was mirrored closely, one important difference could have been the participants' native language. In the original study, participants were interviewed in their native English language whereas the replication did so with participants in their native Dutch language. In the absence of evidence that the English and Dutch language differ in their prevalence of specific time references (for an examination of spatial references, see Van Staden, Bowerman, & Verhelst, 2006, who show that Dutch might be richer in spatial description grammar), we argue that it is unlikely that the current language differences have affected the chance of replication. Moreover, the underlying theories (e.g., Reality Monitoring) are not limited to a particular language but rather assume that the memory recollection processes are universal. lengthier answers in the replication, it is possible that the differences mentioned above played a role so that, for example, participants were more talkative because they already made the trip. Importantly, however, that difference in answer length should not have lowered that chance for replication as lengthier statements are typically better suited for verbal deception detection than shorter ones  and several methods are specifically designed to elicit lengthier and richer verbal accounts (e.g., the model statement technique, Harvey, Vrij, Leal, Lafferty, & Nahari, 2017).
Despite the seemingly minor (or no) detrimental effects of potential slight deviations for the original, it cannot be established whether these minor variations combined made the replication less likely. In the absence of evidence that such slight variations could have affected the findings, we acknowledge this possibility but cannot suggest which variation or which combination of variations caused the replication failure. To our best knowledge and intention, the current replication study is identical to the original in that we operationalized the independent and dependent variables precisely as was done in the original.
We, therefore, deem it fair to call the replication a direct one.

| Statistical power for the replication study
An important methodological aspect of replication efforts is the statistical power of the replication study (i.e., the likelihood that a significant effect of a given size-here: d = 0.54-is observed given the sample size and alpha significance threshold, Lakens, 2013). To give the orig- Practical considerations in the current replication study led us to decide to mirror the identical sample size of the original study, which coincided with a priori calculations for a power of 0.80. The achieved power was marginally smaller (0.77). However, this implies that on average in the long run, the chance of observing the original effect if it were there was only 0.77. This implies that a single replication attempt, with a chance of 23% of incorrectly not detecting an existing effect of the original size, is not enough to conclude that the effect does not exist (at least when one would rely on the 5% significance threshold). The latter is amplified by the conclusion that most effects are overestimations, and hence, true to-be-replicated effects are smaller than those that are reported (Gelman & Carlin, 2014). Therefore, we can conclude that we could not replicate the original effect of identical size but we cannot with high confidence ascertain that the effect (i.e., more specific time references in truthful than in deceptive intentions) does not exist. It is possible that such an effect exists but that it is much smaller in magnitude (see also Gelman's "piranha argument" about the unlikely coexistence of large effects in behavioral science, Gelman, 2017). Taken together, if an effect is considered to be important (e.g., for practical or scientific reasons), higher powered studies and more replication attempts are needed.

| Additional insights
We did not obtain support for the additional hypotheses that truthful statements contain more temporal details (human and computercoded) and more spatial details (computer-coded) than deceptive statements. Contrary to our expectation, however, we found that deceptive statements contained more human-coded spatial details than truthful ones. The framework of interpersonal Reality Monitoring predicts that truth-tellers can recall an event in more detail than liars because the latter never experienced it and, therefore, have to resort to fabrication (Johnson et al., 1998;Nahari, 2018). Liars also have fewer cognitive resources available to produce a detailed, rich account of the fabricated event . Albeit in contradiction with this notion that liars lack the cognitive resources to produce statements as detailed as truth-tellers, the opposite effect found for spatial details is not an exception. Previously, it has been argued that expected, factual questions are what liars prepare for and can, therefore, enrich with details (Warmelink et al., 2012). In support of that idea, people who lied about their planned weekend activities mentioned more persons and more locations than those who told the truth (Kleinberg, van der Toolen, Vrij, Arntz, & Verschuere, 2018). A working hypothesis states that liars might overcompensate in their statements because they are particularly inclined to appear convincing whereas truth-tellers assume that their truth will appear naturally. In a different study, individual details mentioned by truth-tellers and liars were coded as truthful or false and a similar pattern emerged: Liars compensated for their inability to provide sufficient truthful detail after a 2-week delay by adding false details whereas truth-tellers did not (Nahari, 2018). To address these dynamics, the use of unexpected questions (e.g., on the planning of the event) seems a worthwhile addition to future research on that hypothesis.

| CONCLUSION
Truth-telling and lying ferry passengers did not differ significantly in specific time references when asked about the time schedule of their travel plans. It should be noted that both the original and the replication study only provide a point estimate of the effect. This is not uncommon in replication research (e.g., Open Science Collaboration, 2015); however, ideally, any replication would consist of multiple, independent replication attempts. 4 In the current study, the lack of high statistical power leaves the possibility that there exists an actual effect. We encourage other researchers in the deception detection community to conduct preregistered, well-powered, multilab replication studies of the core effects of the field to consolidate the science of verbal deception detection. Such a collective effort will help clarify which effects in verbal deception research are reliable.