Addressing selective attrition in the enhanced response time- based concealed information test: A within-subject replication

Correspondence Gáspár Lukács, Faculty of Psychology, University of Vienna, Liebiggasse 5, A-1010 Vienna, Austria. Email: gaspar.lukacs@univie.ac.at Summary The response time-based concealed information test can reveal when a person recognizes a relevant item among other, irrelevant items, based on comparatively slower responding. Thereby, if a person is concealing the knowledge about the relevance of this item (e.g., recognizing it as a murder weapon), this deception can be revealed. A recent study, conducted online and using a between-subject design, introduced a significantly enhanced version by including additional items in the task. While this modified version outperformed the original version, it also resulted in a much higher rate of participant dropouts (i.e., participants leaving the experiment's website without completing the task). The grave implication is that the perceived enhancement is perhaps merely due to selective attrition. Therefore, the current experiment replicates the original one, but using a within-subject design. The results show that there is a large enhancement even when selective attrition is prevented.


| INTRODUCTION
The response time-based concealed information test (RT-CIT) aims to reveal whether a person is concealing knowledge regarding a certain detail. To illustrate the CIT, let us consider a murder case scenario in which the murder weapon is known only to the perpetrator and the investigators. In this case, the CIT could include the actual murder weapon (the probe; e.g. "rifle") and several other weapons (irrelevants; e.g. "knife," and "rope"). These items would be sequentially presented to a suspect in random order. When each item has to be responded to with a keypress, the recognition of the probe (in this case, "rifle") by a guilty person (who is aware of the relevance of that item) will typically result in a slower response to that item than to the irrelevant items.
Thereby, based on the probe-irrelevant RT differences, a guilty person can be distinguished from innocent ones.
The standard CIT (S-CIT) includes a single (randomly chosen) target irrelevant item that requires pressing a response key different from the response key for probe and nontarget irrelevant items. For example, the key "I" has to be pressed whenever the target item appears, while the key "E" has to be pressed whenever any of the other items (probe and nontarget irrelevants) appear.
However, a recent study introduced a significantly enhanced version (E-CIT) by adding filler items to the task (Lukács, Kleinberg, & Verschuere, 2017). In that study, the probes were the participants' certain personal details (birthday, favorite animal, etc.), which were therefore "familiar" (self-related, recognizable, etc.) to the given participant, as opposed to the irrelevants (e.g., other dates, random animal names) that were in this respect rather "unfamiliar" (other-related, etc.). Two corresponding kinds of fillers were added to the task: (a) familiarity-referring words ("FAMILIAR," "RECOGNIZED," and "MINE") that had to be categorized with the same key as the target (and, thus, with the opposite key than the probe and the irrelevants), and (b) unfamiliarity-referring words ("UNFAMILIAR," "UNKNOWN," "OTHER," "THEIRS," "THEM," and "FOREIGN") that had to be categorized with the same key as the probe and irrelevants. It was assumed that responses to the familiar probes (the true personal details of the given participant) would be even slower because they have to be categorized together with unfamiliarity-referring expressions (and opposite to familiarity-referring expressions; see e.g., Greenwald, Poehlman, Uhlmann, & Banaji, 2009). In contrast, participants with random details as probes (i.e., not their own personal details) see no substantial difference between probes and irrelevants (in particular: the probes are no more familiar, self-related, or recognizable, than the irrelevants), and therefore the fillers do not slow down the responses to the probe.
The other important assumption was that the increased cognitive load due to the increased complexity of the test (more items, etc.) required more attention throughout the task, which likely facilitated deeper semantic processing of the stimuli (Lukács et al., 2017, p. 3; see also Visu-Petra, Varga, Miclea, & Visu-Petra, 2013). More specifically, in a CIT without filler items (and with a single probe and single target; Verschuere, Kleinberg, & Theocharidou, 2015), examinees have to look out for a single target item, pressing the alternative key for all other items. Thereby, they may focus on the target and to some extent ignore the content and meaning of the rest of the items. By adding fillers (to be categorized with the same key as the target), examinees have to pay more attention, and are induced to more carefully process the content and meaning of each of the presented items.
This, in turn, also increases the attention to the relevant probe (which otherwise might be ignored to some extent), resulting in even slower keypress responses to it (in case of a recognized probe, such as by a guilty examinee).
The study did not provide any detailed investigation of the underlying fundamental processes-nonetheless, the enhancing effect of fillers was plainly demonstrated: the rate of correctly detected "guilty" and "innocent" (as simulated in the study) participants rose from .68 to .94, as measured by areas under the curves (e.g. National Research Council, 2003, pp. 342-344).

| Selective attrition
Pertinent to the present study, there was an apparent limitation in the study of Lukács et al. (2017). Due to the increased complexity of the task, a substantial ratio of the online participants dropped out (left the experiment's website without completing the task) when using the E-CIT (10.0-11.4%; though also 3.5-8.5% in S-CIT), as estimated by comparing the number of first practice round completions with full test completions. The factual, precise dropout rates were reported by , who also recorded all participant who began the practice task at all (starting at the instruction page). The difference in this case was striking: 60-63% dropout in E-CIT, while only 18-19% in S-CIT.
The very grave implication is that the enhancement in the E-CIT is perhaps merely an artificial construct due to selective attrition (Zhou & Fishbach, 2016): Participants who dropped out when using the E-CIT (but not when using the S-CIT) may have been the ones that are generally less susceptible to the CIT (i.e., would have smaller probe-irrelevant differences). For example, one reason could be that there are participants who are generally less motivated and pay little attention to their tasks. These participants would decide to drop out when they are to perform the E-CIT because the task is too complex for them, while they would complete the simpler S-CIT task, but with suboptimal results. Thereby, it would appear that the E-CIT outperforms the S-CIT, while the larger effect in case of the E-CIT would be merely due to different types of participants having performed it.  Kleinberg, & Verschuere, 2018), and even more are in progress. Therefore, assessing this potential confound is an urgent matter.
To address this concern, the current study replicated the original experiment, but using a within-subject design, with each participant performing both the S-CIT and the E-CIT. The procedure was identical 1 to Experiment 1 in Lukács and Ansorge (2019; with only minor differences compared to the study by Lukács et al., 2017), except that only the S-CIT and the E-CIT conditions were measured, both with simulated guilty participants, using a within-subject design.

| METHOD
The methods and analyses were preregistered at https://osf.io/9q8gk (analyses that were not preregistered are under the heading "Exploratory Tests"). All collected data is available via https://osf.io/wu5cf.

| Participants
The experiment was run on Figure Eight (www.figure-eight.com), an online crowdsourcing platform where participants from anywhere in the world can register to complete small online tasks Peer, Samat, Brandimarte, & Acquisti, 2015). Only subjects who had taken other experiments on the website seriously ("Level 3 contributors") were invited for the present experiment. All participants performed both the S-CIT and the E-CIT, in random order. All participants simulated guilty suspects: the probes were the participants' self-reported autobiographical identity details (e.g., their country of origin).

| Procedure
Before beginning the experiment, all participants agreed to the informed consent to proceed further. Participants then provided demographic information, and selected their three autobiographical details: country of origin, date of birth (month and day), and favorite animal. This was followed by the very short (3 min) LexTALE English competency test (Lemhöfer & Broersma, 2012), in which 60 words are presented, among which 40 are real English words, while 20 are nonwords, and the instruction is to decide, for each word, whether it is an actual English word. This test was implemented as described at www.lextale.com, with the only difference that a 4-s time limit applied to each response to curb possible cheating (i.e., looking up the words online or in a dictionary during the task). The LexTALE minimum score for upper intermediate (B2) level is 60% accuracy (Lemhöfer & Broersma, 2012, p. 341). Consequently, those who did not achieve a score above a more lenient threshold of 55% clearly did not have the required English skill, and therefore were automatically disqualified and redirected to the Figure Eight website. Then followed the first of the CITs as described below.

| Item selection
Participants were informed that the following task simulates a lie detection scenario, during which they should try to hide their identities. They were then presented a short list of randomly chosen items within each of the three categories in the task (countries, dates, animals). 3 The participants were asked to choose any (but a maximum of two per category) items that were personally meaningful to them or in any way appeared different from the rest of the items on those lists. Subsequently, the four irrelevants and one target were randomly selected from among the non-chosen items, in each category.
Next, participants were shown their three targets, and were asked to memorize these items to recognize them as requiring a different response during the following task. On the next page, participants were asked to recall the memorized items, and could proceed only if they selected these items correctly from a dropdown menu. If any of the selected items was incorrect, the participant received a warning and was redirected to the previous page.

| S-CIT task
During the S-CIT, items were presented one by one in the center of the screen, and participants had to categorize them by pressing one of two keys ("E" or "I") on their keyboard. Each block contained six different items, which were presented repeatedly in random order (see Test Structure): the probe, the four irrelevants, and the target. Participants had to press "I" whenever the target appeared, while they had to press "E" whenever the probe or an irrelevant appeared.

| E-CIT task
The E-CIT differed from the S-CIT only in that it also contained filler items. Whenever a familiarity-referring filler appeared ("FAMILIAR," "RECOGNIZED," and "MINE") participants had to press the "I" key (same as for targets), while whenever an unfamiliarity-referring filler appeared ("UNFAMILIAR," "UNKNOWN," "OTHER," "THEIRS," "THEM" and "FOREIGN"), they had to press "E" (same as for probe and irrelevants).

| Test structure
The inter-trial interval always randomly varied between 400 and 700 ms. In case of an incorrect response or no response within the given time limit (800 ms, except for practice, see below), the caption "WRONG" or "TOO SLOW" appeared in red color, respectively, below the stimulus for 400 ms, followed by the next trial.
The main task was preceded by a comprehension check and two practice tasks. The check served to ensure that the participant had fully understood the task. The items consisted of 21 randomly ordered trials, including the 12 different main items and each of the 9 possible fillers. During the comprehension check, participants had plenty of time (10 s) to choose a response. However, each trial required a correct response. In case of an incorrect response, the participant was reminded of the instructions and had to repeat this check.
In the following first practice task, the response window was longer than in the main task (2 s instead of 800 ms), while the second practice task had the same design as the main task. Both practice tasks consisted of 14 trials, in a way that two successive tasks always contained all of the possible items in the task (3 probes, 12 irrelevants, 3 targets, 9 fillers). In either practice task, in case of too few valid responses, the participants were reminded of the instructions, and had to repeat the practice task. The requirement was a minimum of 60% valid responses (correct key between 150 and 800 ms) for each of the following item types: targets; familiarity-referring fillers; unfamiliarityreferring fillers; main items (probes and irrelevants together).
These initial practice tasks always corresponded to the CIT version that came first (S-CIT or E-CIT). After the first CIT, there was a last practice round with items corresponding to the upcoming second CIT.
The main task, in each test, contained three blocks, one for each category (countries, dates, or animals; in random order). In each block, each probe, irrelevant, and target was repeated 18 times. Within each of the three blocks, the order of the items was randomized in groups: first, all six items (one probe, four irrelevants, and one target) in the given category were presented in a random order, then the same six items were presented in another random order (but with the restriction that the first item in the next group was never the same as the last item in the previous group).
For the E-CIT only, fillers were placed among these items in a random order, but with the restrictions that an filler trial was never followed by another filler trial, and each of the 9 fillers (3 familiarityreferring, 6 unfamiliarity-referring) preceded each of the 3 probes, 3 targets, and 12 irrelevants exactly one time.

| Data analysis
For all analyses, RTs below 150 ms were excluded. For RT analyses, only correct responses were used. Accuracy was calculated as the number of correct responses divided by the number of all trials (after the exclusion of those with RT below 150 ms).
To demonstrate the magnitude of the observed effects for F tests, partial eta-squared (η p 2 ) values are shown along with their 90% CIs (Steiger, 2004). For t tests, Welch-corrected statistics are reported for parametric tests (Delacre, Lakens, & Leys, 2017), Wilcoxon statistics for nonparametric tests, and, to demonstrate the magnitude of the observed effects, Cohen's d values as standardized mean differences and their 95% CIs (Lakens, 2013). All analyses were conducted in R (R Core Team, 2019; via: Kelley, 2018;Lawrence, 2016).

| RESULTS
Aggregated means of RT means and accuracy rates, for the different stimulus types, for S-CIT and for E-CIT, from the present withinsubject study as well as from the original between-subject study (Lukács et al., 2017), are given in Table 1.

| DISCUSSION
The practical implication of the larger probe-irrelevant differences in the E-CIT, as opposed to S-CIT, is straightforward: Adding familiarityrelated fillers to the RT-CIT helps to more reliably reveal whether a person is concealing knowledge about a given critical information.
While the results of the original paper may have been biased to some extent, the current within-subject replication shows that there is a F I G U R E 1 Means andSDs of individual probe-minus-irrelevant (P-I) response time mean differences (left panel) and accuracy rate differences (right panel); per version (S-CIT and E-CIT), and per research design (the present within-subject design versus the original between-subject design in the study by Lukács et al., 2017) T A B L E 2 RT means and accuracy rates, per test phase large difference even when selective attrition is prevented. Thus, it is now assured that the E-CIT outperforms the S-CIT.
While the enhancing effect of the filler has now been repeatedly demonstrated, the underlying mechanisms are yet to be explored. The idea that responding is easier (and thus faster) when closely related items share the same response key, and vice versa, is a well established mechanism supported by dozens of studies (in particular in relation to the Implicit Association Test e.g., Greenwald et al., 2009;Nosek, Greenwald, & Banaji, 2007; but see also e.g., Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976, p. 435;Iordan, Greene, Beck, & Fei-Fei, 2015).
The increased task complexity implies increased task difficulty, and hence increased cognitive load. This could affect the task in different ways, and disentangling how this may influence the outcomes is not straightforward. As explained in the introduction, one plausible benefit is that increased cognitive load induces more attention to the presented items and their meaning. In any case, the notion that increased cognitive load facilitates slower deceptive responses is supported by previous RT-CIT papers (Hu, Evans, Wu, Lee, & Fu, 2013;Visu-Petra et al., 2013;Visu-Petra, Miclea, & Visu-Petra,-2012), and even by studies of other deception detection methods (e.g., Vrij et al., 2008).
Related to task difficulty, another aspect that is yet to be investigated is whether or not the E-CIT is susceptible to countermeasures, as it has been shown for other RT-based deception detection methods (e.g., Hu, Chen, & Fu, 2012;Hu, Rosenfeld, & Bodenhausen, 2012;Van Bockstaele et al., 2012;Verschuere, Prati, & De Houwer, 2009). In particular, examinees might be able to manipulate probe-irrelevant RT differences by making deliberately fast responses to probes and/or slow responses to irrelevants. It seems possible that due to the relatively complex design of the E-CIT, examinees may find it difficult to intentionally alter the timing of their responses to only the appropriate subset of the items used (e.g. increase the response times only for the irrelevants by inhibiting fast reactions to these items, and yet still make fast responses to probes). Nonetheless, ultimately, this is an empirical question.

| Disadvantages and advantages of the online setting
A laboratory-based replication of the E-CIT is still needed. Even when preventing selective attrition, the online setting may allow more noise in the data than strictly controlled lab experiments due to (a) potentially sub-optimal computer hardware used by participants, research (e.g., Buhrmester, Kwang, & Gosling, 2011;Paolacci & Chandler, 2014), in particular the validity of online RT research (e.g., Germine et al., 2012;Hilbig, 2016;McGraw, Tew, & Williams, 2000; for the HTML5/JavaScript framework as in the present study, see Reimers & Stewart, 2015), and recently even specifically the validity of the RT-CIT in online settings -all of which unanimously conclude that online RT research such as in the present study is a sound alternative to conventional lab studies and that results obtained in this environment closely reflect those obtained in strictly controlled laboratory conditions. Note also that, in all RT-CIT experiments, the key dependent variable (probe-irrelevant difference) is obtained, for each participant, via a within-subject comparison, which also serves as a control for external influences. However, even more assurance is gained by using an overall within-subject research design (i.e., within-subject statistical comparisons), as in the present study.
Importantly, online research has its own advantages as well. Fast and relatively inexpensive data collection allows for hundreds of participants to be collected within days. The available population is very diverse and even international, thus it can provide a broad demonstration of generalizability, and the obtained samples also more closely reflect the test results of possible criminal suspects than a study that involves only university students as in typical lab studies. Furthermore, the RT-CIT itself may be applied using online (web browserbased) applications in real life cases (Lukács, 2019;: In any scenario where other appropriate software has not been set up, the RT-CIT can be easily administered via an online application using any web browser on a standard computer (or even on smartphones; Lukács, Kleinberg, Kunzi, & Ansorge, 2020).
Nonetheless, future related studies involving online participants should preferably avoid between-subject designs to prevent potential confounds arising from selective attritionin particular when different conditions involve different levels of difficulty in performing the task.

| Effect of test order
Almost all previous RT-CIT studies so far used between-subject design, and none has examined the effect of repeated testing. The results of the present study indicate that repeated testing (hence familiarity with the task, practice, fatigue, etc.) has no substantial impact on the key outcomes (i.e., on the probe-irrelevant differences).
The aggregated means of probe-irrelevant differences ("P-I") in All in all, this invites future RT-CIT studies to use within-subject research design. This not only prevents confounds due to selective attrition (as long as the design is properly counter-balanced), but it is in general more resourceful as well as more reliable than the between-subject design (as long as the order effect does not confound the results).
However, since the present study used two different CIT versions per subject, it does not provide sufficient evidence for the practical question of whether repeated testing of the same version may influence outcomes (such as deliberate practice by a suspect who is aware of an upcoming CIT test). This issue would require its own dedicated study.

| CONCLUSIONS
The enhancement of the RT-CIT with filler items may be a great advantage with important practical implications (Lukács et al., 2017).
However, a subsequent study demonstrated robust differences in participant dropout rates between the key conditions (60-63% vs. 18-19%; . For a proper appraisal of the enhancement, the present study compared the two conditions in a within-subject design, to avoid differences in dropout rates, and ascertained that the difference is real and not due to a confound.

CONFLICT OF INTEREST
The author has no financial or personal conflicts of interest.

DATA AVAILABILITY STATEMENT
All data collected in the present study is available at https://osf.io/ wu5cf. The data collected in the study of Lukács et al. (2017) is available at https://osf.io/kv65n/.
2 No participant had to be excluded due to too low accuracy; see preregistration for criteria. 3 The items on these lists never contained any of the probes (i.e., the actual identity details of a given participant), but, within each category, they had the closest possible character length to the given probe (depending on the list of available items), and none of them started with the same letter (except in case of months). In case of countries, if the probe included a space (e.g., "New Zealand" or "Czech Republic"), the items on this list were all chosen to include a space as well.