Automated verbal credibility assessment of intentions: The model statement technique and predictive modeling

Summary Recently, verbal credibility assessment has been extended to the detection of deceptive intentions, the use of a model statement, and predictive modeling. The current investigation combines these 3 elements to detect deceptive intentions on a large scale. Participants read a model statement and wrote a truthful or deceptive statement about their planned weekend activities (Experiment 1). With the use of linguistic features for machine learning, more than 80% of the participants were classified correctly. Exploratory analyses suggested that liars included more person and location references than truth‐tellers. Experiment 2 examined whether these findings replicated on independent‐sample data. The classification accuracies remained well above chance level but dropped to 63%. Experiment 2 corroborated the finding that liars' statements are richer in location and person references than truth‐tellers' statements. Together, these findings suggest that liars may over‐prepare their statements. Predictive modeling shows promise as an automated veracity assessment approach but needs validation on independent data.


| Detecting deceptive intentions
For many years, deception research focused on people lying about their past actions (e.g., what someone was doing during the time of a crime). Since recently, attention is also paid to the detection of deceptive intentions (Mac Giolla, Granhag, & Liu-Jönsson, 2013;Sooniste, Granhag, Knieps, & Vrij, 2013;Warmelink, Vrij, Mann, & Granhag, 2013). There are indications that the principles that apply to the detection of deception on past events also apply to deceptive intentions (Granhag & Mac Giolla, 2014). When truth-tellers report a past event, they can rely on their memory, whereas liars cannot if they discuss an event they have never experienced. A similar logic may apply to lying about intentions. Plans for future actions that are not accompanied by an intention to execute result in a less detailed mental image of the event than plans that are accompanied by the enactment intentions (Granhag & Knieps, 2011;Szpunar, 2010). It is important to note, however, that past events are imagined in more detail than future events (D'Argembeau & Van der Linden, 2004;Gamboz et al., 2010).
Cues for deception concerning intentions might, therefore, be less clear compared with those for past events.
To date, research into the verbal approach to the detection of deceptive intentions has examined different verbal cues with sometimes contradicting findings. In one study, passengers at international airports were instructed to lie or tell the truth about their forthcoming trip (Vrij et al., 2011). Those who lied about their journey provided statements that were less plausible and included more contradictions than truthful statements but did not differ in the amount of detail. Building on the notion that the expectedness of the questions asked might moderate the effectiveness of the verbal deception detection approach , another series of experiments asked participants expected and unexpected questions about a fabricated or truthful future event (Fenn, McGuire, Langben, & Blandón-Gitlin, 2015;Warmelink et al., 2013;Warmelink, Vrij, Mann, Jundi, & Granhag, 2012). Although differences in the amount of detail emerged in some studies when unanticipated questions were asked (Sooniste et al., 2013;Warmelink et al., 2013), these effects were absent in other studies (Fenn et al., 2015;Kleinberg, Nahari, Arntz, & Verschuere, 2017). In yet another study, it was found that markers of good planning behavior (e.g., effective time allocation and how an action will be carried out) were more prevalent in truthful than in deceptive statements (Mac Giolla et al., 2013). Conversely, deceptive statements contained more justifications for the actions (i.e., why an action will be carried out). Furthermore, a recent study reported that deceptive intentions contained fewer verifiable details than truthful ones (Jupe et al., 2017). Taken together, the literature on the detection of deceptive intentions suggests that the verbal approach could be promising and that the richness of detail might be a useful cue to deception.

| The model statement technique
A model statement is a detailed example of a verbal statement given by someone on a topic unrelated to the current research context, and providing such a statement may help to increase verbal differences between truth-tellers and liars. By reading a detailed example before providing their account, interviewees are thought to learn the level of detail that is expected from their statement, which in turn makes them inclined to provide more detail. Providing more detailed information should be easier for truth-tellers than for liars: The former could easily retrieve details from their memory of a specific event, whereas liars struggle to include sufficient detail to match the expectations raised by the model statement (Vrij, Fisher, & Blank, 2017;Vrij, Hope, & Fisher, 2014). Besides, liars will likely not provide more detailed information after reading a model statement because the provision of extra information could lead to cues that give away their lie (e.g., incriminating information, Nahari, Vrij, & Fisher, 2014) or expose the lack of contextual information in their account (Vrij, Fisher, & Blank, 2017).
There are mixed findings as to the usefulness of the model statement method so far. On the one hand, the provision of a model statement led to lengthier statements and better truth-lie discrimination (i.e., truthful statements were more plausible; see Leal, Vrij, Warmelink, Vernham, & Fisher, 2015). Another study found that the discrimination between truthful and deceptive insurance claims based on the number of verifiable details improved with a model statement (Harvey, Vrij, Leal, Lafferty, & Nahari, 2017). Moreover, a model statement benefited detection accuracy when details inferred from behavior scripts (e.g., "we went to the restaurant and ordered food and something to drink") and complications were counted . These studies suggest that the model statement aids deception detection when the quality of information (e.g., plausibility, verifiability, and number of complications) is measured. On the other hand, several other studies have not found support for the beneficial role of a model statement when the quantity of details is examined. In Bogaard, Meijer, and Vrij (2014), a model statement led to lengthier statements but did not benefit the discrimination between truth-tellers and liars with commonly used verbal content analysis tools measuring quantity of detail (e. g., RM). Likewise, there was no evidence to the beneficial effects of the provision of a model statement for the amount of "total details" (Ewens et al., 2016) nor for the statement quantity in children and adolescents (Brackmann, Otgaar, Roos af Hjelmsäter, & Sauerland, 2017). In sum, there are indications that a model statement may improve verbal deception detection when examining verbal aspects other than the quantity of details. Importantly, although some studies failed to find an effect of the model statement, no study indicated that a model statement impeded deception detection, and regarding quantity of details, several studies showed that it increased the information provided (e. g., Bogaard et al., 2014;Leal et al., 2015). The current study tests whether the model statement technique can facilitate the detection of truthful and deceptive intentions.

| Large-scale deception detection
In a setting such as prospective airport passenger screening, large-scale deception detection may be only applicable when data can be collected and analyzed automatically (Kleinberg, Arntz, & Verschuere, in press). A key challenge for verbal deception detection is then the transition from manual, human coding of verbal content towards computer-automated approaches. Although these two methodological lines have the same goal of identifying deceptive and truthful content, they both have different advantages and shortcomings (e.g., Hauch, Blandón-Gitlin, Masip, & Sporer, 2015). First, the manual annotation of a text is limited in its large-scale potential because it relies on instructed human coders. The efforts and time involved in the human coding approach make it virtually unfit for the assessment of vast numbers of statements in near real time (e.g., in airport settings).
Computer-automated approaches are less affected by this requirement and can be scaled up and allow for text analysis in real time (for a review, see Fitzpatrick, Bachenko, & Fornaciari, 2015). Second, inherent to the involvement of human assessors in manual coding is the lack of perfect reliability of the judgments made. Contrary to computer-automated approaches, the agreement between multiple humans is never entirely perfect and therefore might pose a threat to the validity. Because we are particularly interested in potential large-scale applications, we resort to computer-automated methods as a primary analytical tool in the current study. Several methods have been proposed to integrate verbal deception theory and computer-automated analysis.

| Linguistic Inquiry and Word Count
The Linguistic Inquiry and Word Count (LIWC) software (Pennebaker, Boyd, Jordan, & Blackburn, 2015) examines the proportion of words belonging to one of 92 categories. The attractiveness of the LIWC is that the categories are thought to represent psycholinguistic processes such as the emotional tone of a text (e.g., "lucky" and "melancholic") or the number of cognitive processes in a text (e.g., "know" and "ought").

| Named entity recognition
Recently, it has been proposed to use named entities in verbal deception detection (Kleinberg, Mozes, et al., 2017;Kleinberg, Nahari, & Verschuere, 2016). Named entity recognition (NER) is an information extraction method that identifies and classifies information from natural language into predefined categories (e.g., persons, dates, and times).
Truthful statements are expected to contain more named entities than deceptive statements because truthful accounts (a) are typically richer in detail (Johnson et al., 1998;Masip et al., 2005), (b) contain more verifiable details (Nahari et al., 2014), and (c) are often more contextually embedded (Köhnken, 2004). The named entity-based approach has been shown to be useful for the identification of deceptive and truthful hotel reviews (Kleinberg, Mozes, et al., 2017). These findings suggest that named entities might be a means to measure the liars' strategy of withholding potentially incriminating information (e.g., persons that could be consulted to verify an alibi), resulting in liars' mentioning fewer named entities.

| The current study
We investigated whether it is possible to detect truthful and deceptive statements about planned activities in a computer-automated verbal deception detection workflow (i.e., automated data collection and automated text analysis). Because the majority of verbal deception research has been conducted regarding past activities, we also included a comparison condition of participants who provided a truthful or deceptive statement about their recent activities (Experiment 1). To enhance verbal differences, we provided all participants with a model statement in Experiment 1 and experimentally investigated the provision of the model statement in Experiment 2.
In the first experiment, there were four conditions. In the two truthful conditions, participants were instructed to tell the truth about their (a) forthcoming or (b) past weekend. In the two deceptive conditions, participants were instructed to lie about an activity assigned to them (c) for the forthcoming or (d) about the past weekend. The main focus of this study was the automated detection of deception. All statements were therefore coded automatically using the LIWC and named entity approaches. Because human coding is the standard in the majority of psycholegal deception studies, we added manual annotations on a subset (40%) of the statements of Experiment 1.
We expected several main effects of veracity. On the basis of the theory of RM and the idea that richer mental images accompany genuinely planned activities, it was expected that truthful statements would be lengthier (dependent variable [DV]: no. of words), be richer in detail (DV: richness of detail measured via LIWC and human coding), contain more specific information (DV: named entities), and be more plausible (DV: human-coded plausibility) than deceptive statements.
We also expected that truthful statements would contain more references to how (DV: human-coded how-utterances) an activity was executed and fewer justifications of the actions (i.e., why they executed an activity, DV: human-coded why-utterances) than deceptive statements (Mac Giolla et al., 2013). Last, we expected that the difference between truthful and deceptive statements would be more pronounced for statements about the past than for statements about the future (interaction hypothesis). In the exploratory analysis, we looked at machine learning classification of truthful and deceptive statements and examined individual linguistic predictors.

| Data availability statement
The confirmatory analyses for the two experiments were preregistered before data collection. The preregistrations, data, and supporting information are available at https://osf.io/wqc4p/. The source code to the experimental tasks is available at https://github. com/ben-aaron188/verbal_deception_past_future. English speakers and had not partaken in previous pilot studies. To ensure that participants had concrete weekend plans, we collected data just before a weekend (Thursday and Friday). All participants were reimbursed with GBP1.50 for this study. Due to simultaneous starting times, we collected data from 347 participants on which we applied four preregistered exclusion criteria: double IP addresses (n = 23), noncomplete data (n = 4), not following the instructions (n = 0), and failing the manipulation check (i.e., not recalling the instructions after writing the statement, n = 28; all participants were asked "How were you instructed to write your statement?" on a scale from 0 = answer truthfully to 100 = answer deceptively; we excluded those who indicated a score higher than 10 in the truthful condition, or a score lower than 90 in the deceptive condition).

| The model statement
We adhered to the suggested guidelines for formulating a model statement (Centre for Research and Evidence on Security Threats, 2016), with one exception. Given the online context of the current investigation, we did not provide an audiotaped version but rather presented the statement as text (as did Harvey et al., 2017). We followed the remaining suggestions and created a statement that (a) is unrelated to the research scenario (here: weekend plans), (b) describes an authentic experience, and (c) is not created on the spot during the interview.
The actual model statement was created by interviewing a friend of one of the authors via telephone about her first day at university.
The interview was transcribed and translated into English from Dutch, resulting in a length of 527 words (Supporting Information S1). To ensure that the participants read the statement, they could only proceed to the next page after 1 min and were informed that they would be asked four multiple-choice questions about the model statement (Supporting Information S2). If a participant failed to answer a question correctly, she or he was redirected to the model statement followed by four new multiple-choice questions.

| Experimental manipulation
Participants were randomly allocated to one of two conditions of veracity (truthful vs. deceptive). Thus, participants gave either a deceptive or truthful statement on their planned or past activities. Liars were assigned an activity that they had to pretend to intend for the coming weekend (or have done on the past weekend). We allotted an activity to liars to avoid that they used one of their previously experienced weekend activities. To keep the selection of activities standardized, all participants had to choose from a drop-down menu of 31 activities (e.g., attending a wedding; Supporting Information S3).

Past weekend plans
In the past weekend conditions, participants were asked to select at least one activity that they had carried out last weekend and at least three activities that they had not carried out last weekend. For those activities that they indicated to have carried out last weekend, they were asked to report how often they had done them before (on a slider from never to very often). Subsequently, they were asked the same question for the activities that they said they had not carried out last weekend. In the truthful condition, participants were instructed to provide a convincing account about one activity that was randomly chosen from their selected truthful activities. In the deceptive condition, participants were assigned one activity that they, in the previous step, indicated to not have carried out before. For instance, if a participant in the deceptive condition had indicated to have "visited the zoo" but did not "go to a birthday party," the participant could be assigned to declare to have attended a birthday party.
To provide a little more context, we added one extra detail to the selected activity in the deceptive condition. For example, if the determined activity was "throwing a party", the assigned activity was "throwing a party with your friends at your favorite pub" (Supporting Information S4).

Future weekend plans
In the future weekend conditions, participants were asked to select at least one activity that they were planning to do on the upcoming weekend and at least three activities that they were not planning to do. For the planned activities, they were asked to indicate how often they had done them before, how certain they were about carrying out that activity, and how well they had planned that activity. For the activities that they indicated not to carry out, participants were asked how often they had carried them out before and how certain they were of not carrying them out. Equivalent to the truthful past weekend condition, those in the truthful forthcoming weekend condition were told one activity that they intended to do next weekend. In the deceptive forthcoming weekend condition, they were assigned the activity that had the lowest score on how often they had done it before and the highest score on how certain they were not to carry out that activity. Equivalent to the past weekend plans, we find a little more detail in the deceptive next weekend condition (e.g., "Going to a festival in a big city with a friend").

| Procedure
Participants accessed the experimental task-advertised as "Lie detection study about your weekend plans"-via their Prolific account. The minimal requirement for doing this task was a Web browser. Upon starting the task, participants were informed about the study and gave their consent for participating. Next, they read general instructions about the purpose of the task that some participants are instructed to tell the truth about their last (or upcoming) weekend, and some are instructed to lie. On the next page, they gave information about their activities during last weekend or for the forthcoming weekend (see Section 2.1.3). Participants were then randomly allocated to an experimental condition and read instructions according to their veracity and time condition. In particular, participants were told that they were about to write a statement about one specific activity, which was indicated in bold letters alongside these instructions. Participants were then directed to the model statement. Once they proceeded through the model statement and the subsequent multiple-choice test, participants received their statement instructions emphasizing that they should make their story "as detailed, plausible and convincing as possible." In both veracity conditions, participants were reminded to write only about the given activity and that they could take the time to prepare their statement.
Moreover, they were told that each account would be read by deception experts who would determine whether or not they believed the story. If they were believed, they would be rewarded with an additional GBP0.50. We paid the bonus to the participants with 20% highest overall proportion of named entities in their statement.
On the next screen, participants had to write their statement in a text box. They could only proceed to the next screen if their statement was at least 80 words long and if their statement was proper English. If these criteria were not met, they were reminded about the length and language of the required input via a pop-up. We also disabled the copy-and-pasting functionality to prevent participants from reusing text.
After completing the statement, participants were asked three questions to be answered with a slider from 0 to 100.
1. "How were you instructed to write your statement?" (truthfuldeceptive) 2. "How much of your statement is based on truthful elements?" (nothing-all of it) 3. "How motivated were you to write a convincing statement?" (not at all-absolutely) Before exiting the experiment, all participants provided demographic information.

Linguistic Inquiry and Word Count
We used the LIWC to extract the proportions of words in each statement that belonged to those psycholinguistic LIWC categories that best represent the RM richness of detail. Specifically, we modeled the richness of detail as the sum of the LIWC categories percept (perceptual processes; including the subcategories see, hear, and feel; e.g., saw, touch, and heard), space (spatial references; e.g., down and in), and time (temporal references; e.g., until and end ;Bond & Lee, 2005).
Our outcome variable is the proportion of the occurrence of unique occurrences of named entities (i.e., each entity is counted only once) relative to the word count in each statement (Kleinberg, Mozes, et al., 2017).

| Manual coding of statements
A random subset of 147 statements (73 on past weekend plans and 74 on future weekend plans) was rated manually by two coders who were blind to the experimental condition and hypotheses. The coders were instructed to rate each statement as a whole on its plausibility, its richness of detail, the occurrence of how-utterances and why-utterances. 1 Each variable was scored on a Likert scale from 1 (very low/few) to 7 (very high/many). Although recent findings suggest that counting details is more reliable than scale judgments (Nahari, 2016), we decided to follow the procedure of previous intentions studies (Sooniste, Granhag, Strömwall, & Vrij, 2015).
Both coders received a training session in which statements were rated and discussed with an instructor. Further, 40% of the statements (n = 58) were rated by both coders, and the remaining 60% (n = 88) were randomly split between the two coders. The intraclass correlation coefficients were .11 for plausibility (ns), .90 for richness of detail (p < .001), .60 for how-utterances (p < .001), and .67 for why-utterances (p < .001). Because of the very low reliability of plausibility, we decided not to analyze plausibility judgments.
1 Plausibility: "Could this incident have happened as described? Could this be an honest description of someone's weekend activities?" (Leal et al., 2015). Richness of detail: "The inclusion of specific descriptions of place, time, persons, objects and events in the statement" (Vrij, 2015). The occurrence of how-utterances: "Concrete descriptions of activities. This can include, but is not limited to, sentences that included phrases such as 'we planned to…', 'we were going to…', 'we intended to…'" (Mac Giolla et al., 2013). Why-utterances: "There are two types of answers to 'why'. First, wider motivations/reasons why someone planned an activity. Second, motivations/reasons for doing something in a certain way" (Mac Giolla et al., 2013).

| Analytical plan
We conducted separate 2 (veracity: truthful vs. deceptive) by 2 (time: past vs. future) between-subjects ANOVAs with preregistered Bonferroni significance level correction on each of the DVs. For seven key DVs in the main, preregistered, analysis, we adhered to an alpha significance level of .05/7 = .007. The effect size Cohen's f indicates the magnitude of effects, with f = 0.10, f = 0.25, and f = 0.40 for small, moderate, and large effects, respectively (Cohen, 1988).
To compare the diagnostic efficiency of the DVs, we conducted receiver operating characteristics analyses. We compare the areas under the curve (AUCs) using Venkatraman's (2000) AUC comparison test. In the exploratory analyses, we used a supervised machine learning classification task to predict the veracity of statements. All statistical analyses were conducted with R (R Core Team, 2016). For AUC analysis, we used the pROC R package (Robin et al., 2011). The machine learning analyses were conducted with the caret package (Kuhn, 2017). Table 1 summarizes the results for the confirmatory analyses, expecting main effects of veracity. There was no significant interaction effect between veracity and time for any of the DVs. For the number of words and how-utterances, a significant main effect of time revealed that the statements were lengthier and contained more how-utterances when they were about past weekend activities than when they were about forthcoming weekend plans. Only for one of the four human-coded DVs was the hypothesis supported: truthtellers included more how-utterances in their statement than liars.

| Exploratory analyses
3.3.1 | Machine learning classification: Experiment 1 To predict the veracity of a statement, we used supervised machine learning classification, which, contrary to classical statistical testing, learns from the data to predict an outcome (for an overview, see Yarkoni & Westfall, 2017). More specifically, in a supervised machine learning task, a classifier algorithm is trained on a subset of the data to predict an outcome class (here: truthful vs. deceptive). To build a classifier algorithm, one selects features (i.e., predictor variables) based on which the relationship to the outcome class is learned. To avoid overfitting, we split the data into a training set (80% of the data) and a holdout test set (20%). During the training phase, we applied a fivefold cross-validation with 10 repetitions (e.g., Ott et al., 2011). The cross-validation procedure ensures that each observation in the training data has been used for building and validating the final predictive model. Once the final model was determined, we assessed the performance on the holdout test set, which was not used in the training phase. This procedure is used as a safeguard to ensure the validity of the final model.
Linear SVMs create an n-dimensional space, where n equals the number of features and calculates a linear kernel function that splits the data into two classes (here: truthful and deceptive). The aim is to derive a hyperplane that splits the data in a way that the distance between the hyperplane and the two classes in the n-dimensional space is maximized (Murphy, 2012).

As feature sets, we used (a) all LIWC variables (92 features) and (b)
a subset intended to model psychological processes (40 features, e.g., cognitive processes, negative thinking, perceptual processes, see Supporting Information S6). Table 2 shows the performance metrics for both past and forthcoming weekend plans.
The findings suggest the predictive models built on all LIWC variables and the "psychological processes" subset outperform chance classification for prospective weekend plans but not for past weekend plans.

| Other LIWC variables and individual named entities
We explored whether truth-lie differences emerged on individual LIWC or named entity categories. This also enabled us to understand the verbal differences within the composite score of "richness of detail." Table 3 displays the means and effect size of the veracity main effect for the three LIWC subcategories that formed the LIWC richness of detail (i.e., percept, space, and time) and other individual LIWC and named entity categories that were significant veracity predictors in another study with the same approach (Kleinberg, Mozes, et al., 2017). The findings suggest that although the categories percept (f = −0.12), space (f = −0.17), and time (f = 0.23) were significant in differentiating deceptive from truthful statements, they did exhibit their effect in different directions. Only the temporal information category ("time") was, as could be expected from RM, higher for truthful than for deceptive statements. The spatial information ("space") and perceptual processes ("percept") were higher in deceptive than in truthful texts. These discrepant findings might explain why the composite index of the LIWC richness of detail did not indicate a significant difference.

| Discussion: Experiment 1
The confirmatory analysis of the first experiment showed that deceptive statements did not differ from truthful statements in length, the richness of detail, named entities, and why-utterances. We found support only for the hypothesis that truthful statements would contain more how-utterances than deceptive ones. The exploratory predictive analysis yielded promising results for machine learning classification tasks. Deceptive and truthful plans for the forthcoming weekend were identified with an accuracy above chance (80.64% and 74.19% for all LIWC variables and psychological processes, respectively). Exploratory analysis also suggested that liars included more references to persons and places than truth-tellers. However, this result may be due to a confound: Liars received slightly more specific instructions for their activities (e.g., "Going on a holiday to Spain with a friend") than truth-tellers (e.g., "Going on a holiday"). As such, the inclusion of person and place references may have been a function of the instructions rather than the veracity. To further investigate these seemingly contradictory findings and to assess the replicability of the predictive modeling results, we ran a second experiment with preregistered hypotheses. The second experiment also allowed us to isolate the effect of the model statement    (Harvey et al., 2017;Leal et al., 2015), we preregistered the following hypotheses: • Deceptive statements will contain more (computer-scored) person, location, temporal, spatial, date, and time references than deceptive statements.
• The machine learning classification accuracy of truthful and deceptive statements is above chance level. The classifier trained on the data of Experiment 1 performs with above chance level accuracy on the data of Experiment 2.
• The differences in linguistic and verbal content variables between truthful and deceptive statements are larger when a model statement is provided than when it is not, resulting in higher classification accuracy.

| Participants
The data collection procedure was identical to that of Experiment 1. We aimed to replicate the effects found in the first experiment and adhered to the same sample size including a buffer for potential data loss, resulting in 100 participants required per condition. Due to simultaneous starting times, we collected data of 413 participants and, as per the preregistered exclusion criteria, excluded those who could not recall whether they were instructed to write a truthful or deceptive statement after writing the statement (n = 28, final sample = 385). 2 The remaining 385 participants were allocated blockwise into four experimental conditions: a truthful condition with a model statement (n = 90, 66.67% female, M age = 32.56 years, SD age = 9.23), a deceptive condition with a model statement (n = 97, 70.10% female, M age = 32.39, SD age = 10.42), a truthful condition without a model statement (n = 101, 73.27% female, M age = 32.00, SD age = 9.36), and a deceptive condition without a model statement (n = 97, 69.07% female, M age = 33.55, SD age = 11.06). There was no difference between the conditions in gender, X 2 (3) = 1.02, p = .795, Cramer's V = 0.03, or age, F(1, 383) = 0.24, p = .626, f = 0.03.

| Changes compared with Experiment 1
Those who read the model statement followed the same procedure as those in Experiment 1. Participants who did not read a model statement were directed to the input field immediately after they received their veracity instructions (including the prompt to be as detailed, plausible, and convincing as possible). This procedure was based on related previous studies (Bogaard et al., 2014;Leal et al., 2015). The instructions provided to deceptive participants were changed to be identical to those given to truth-tellers; that is, all participants received the nonspecific instructions (e.g., "throwing a party"). Table 4 shows that the findings of Experiment 1 were supported for person references and location references, which were both more prevalent in deceptive than in truthful statements. There were no veracity-by-model statement interaction effects. For person references (with > without model statement) as well as for temporal information (without > with a model statement) and date references (without > with a model statement), there was a significant main effect of the provision of the model statement, albeit only for person references in the expected direction. 3

Machine learning classification: Experiment 2
We predicted that the overall classification accuracy with a machine learning approach would be significantly better than chance level. Specifically, we predicted that when with all LIWC categories, the resulting classification accuracy was better than the chance level (here: 50.39% due to a slight condition imbalance). The machine learning classifica- ). An exact binomial test revealed that accuracy was significantly higher than chance (p = .002).
We also predicted that the classification accuracy would be higher when a model statement was provided than when participants did not read a model statement. We expected that the diagnostic efficiency of the classifier for participants with the model statement would be significantly better than for the participants who did not read the model statement. There was no difference between the two classifiers, Venkatraman's AUC comparison test (E = 0.04, 2,000 bootstraps, p = .868). Note also that both classifiers' accuracy did not outperform chance level.

Cross-experiment machine learning classification
To assess the classification accuracy of machine learning classifiers on independent data, we used the exact SVM classifier with the full LIWC feature set of the intentions data from Experiment 1 and tested its performance on the data from Experiment 2. That is, rather than evaluating the performance on holdout data from the same data collection, we test it on truly independent data from a different sample. This analysis resulted in an accuracy of 61. 30% (56.23-66.19%)

| Exploratory analysis
For comparison purposes, we also explored the length of statement (Table 4)  research showing that a model statement increased information provided by the participants (Bogaard et al., 2014;Leal et al., 2015). Because the primary aim of this study was to test the detectability of deceptive intentions in a potentially large-scale setting, we collected data through an online interface and focused on computer-automated analysis.

| Predicting the veracity of statement
From an applied perspective, such as prospective passenger screening, the prediction accuracy of a model might be more important than the explanatory aspects underlying it. With the use of machine learning, deceptive and truthful statements were classified well above chance with relatively high accuracies of 74.19% and 80.65%, respectively.
To assess the "true" performance of a predictive model, it is important to test it on newly collected data. In fact, most machine learning approaches to verbal deception detection are not evaluated on data from a new sample (Fitzpatrick et al., 2015), and most of the reported accuracy rates in the psycholegal literature were obtained without any cross-validation (see the critique by Levine et al., 2017). We, therefore, examined the robustness of these accuracy rates with cross-validation within the sample as well as on a new sample in the second experiment.
The current investigation is, to the best of our knowledge, the only one that tested a classifier's accuracy on fresh, independent data from a new sample. The results are promising in that they withstood the crossexperiment test, but they also highlight the drop of the accuracy when classifiers were applied to out-of-sample data. The accuracy rates will per definition be higher if the classifier is trained and tested on the same data, compared with a proper validation on a new sample (Yarkoni & Westfall, 2017). Although data from the first experiment suggest accuracies of up to 80%, the independent-sample validation indicated that the true boundaries might be closer to 63% (similar accuracies using

| The model statement technique
We did not find support for the hypothesis that providing a model statement benefits deception detection. Unexpectedly, participants who read a model statement provided fewer date entities and temporal information but more person entities than those who did read a model statement. These latter findings would need corroboration.
The absence of a beneficial effect of the model statement was also reported elsewhere (Bogaard et al., 2014;Brackmann et al., 2017;Ewens et al., 2016;Harvey et al., 2017;Leal et al., 2015;Vrij, Leal, et al., 2017). We see two possible explanations. First, hidden moderators might determine the role of the model statement. Looking at the verbal cues-especially details, at a more granular level (e.g., qualifying details into verifiable details, script behavior details, and complications) -could be an important aspect for further research (the data of the two experiments are openly available). Second, the null findings might be due to boundary conditions of the model statement technique. We provided participants with a model statement about a past event (i.e., first day at uni). Future research could assess whether an alignment of the temporal focus of the model statement and the participants' action (i.e., past or future action) is necessary. Furthermore, the length requirement that we imposed on all statements (minimum of 80 words) could have played a role. Although intended as a safeguard to elicit sufficient information in the online context, it is possible that this resulted in unnatural content and blurred potential truth-lie differences.

| Manual versus automated text analysis
Concerning the large-scale focus in this study, two aspects merit attention.
1. Although the computer-automated analysis was applied successfully above chance level in the current study, the value of manual human scoring cannot (yet) be dismissed. Semantic, linguistic concepts such as plausibility are not yet easily automatable. Likewise, promising approaches such as the verifiability approach (i.e., looking at verifiable details, Nahari et al., 2014) are currently limited to manual annotation, which limits their large-scale potential.
The technical question of human versus automated coding performance might best be answered in direct comparisons and rigorous empirical testing. Such a comparison should test which technique yields the best accuracies and, most importantly, produces replicable and generalizable results. Because the aim of the current paper was to predict the veracity rather than illuminate the theoretical underpinnings of it, we focused more on the machine learning part rather than the individual cues underpinning it. We do acknowledge that the theory matters and should, in fact, be incorporated into predictive models to make use of the best of both worlds. In the future, hybrid approaches (e.g., Kleinberg, Mozes, et al., 2017) might help bridge the gap between theory and methods and human and automated analyses: Human annotations of the verifiability, for example, could be used as outcome variables for a predictive linguistic model.
Ideally, this could result in a real-time and valid proxy for otherwise manually coded constructs.
2. The current study relied on a passive collection of data. Alternatively, future approaches could explore how dynamic conversational environments (e.g., online chat) facilitate deception detection. Such a line of inquiry might also help to shorten the participation duration which is essential for applied purposes and would allow for the targeted elicitation of needed information (e.g., those pieces that could be verified).

| CONCLUSION
Verbal deception detection is a promising yet complex path for the detection of deceptive intentions-both from an academic and from an applied perspective. In two experiments, we found evidence that liars mentioned more person and location references than truth-tellers, which may be exploited for the detection of their false accounts. Predictive modeling with psycholinguistic features yielded promising results above chance level. At the same time, independent validation showed that within-sample cross-validation might still overestimate classification accuracies. The current findings provide novel insights into liars' strategies, highlight the promise of machine learning for deception detection, and emphasize the need for proper validation of predictive deception detection analysis.