Information leakage in the Response Time‐Based Concealed Information Test

wileyonlinelibrary.com/journal/acp 1178 Summary The Response Time‐Based Concealed Information Test (RT‐CIT) can reveal when a person recognizes a relevant (probe) item among other, irrelevant items, based on comparatively slow responses to the probe item. For example, if a person is concealing his or her true identity, one can use the suspected identity details as probes, and other, random details as irrelevants. However, in our study, we show that even when participants are merely informed about such probes (i.e., the relevant identity details) before performing the RT‐CIT, their responses will also be slower to these details. Hence, it is more difficult to distinguish such innocent but pre‐informed persons from actually guilty persons. At the same time, we introduce a CIT version with familiarity‐ related inducer stimuli, but with no targets, that elicits probe‐minus‐irrelevant RT differences only among guilty participants but not among informed innocent participants. Implications for the theory and the application of CITs are discussed.

Csifcsák, 2017), the same effect has yet to be tested using the RT-CIT. Therefore, in the present research, we test this scenario in an information leakage simulation. Furthermore, we introduce a slightly modified RT-CIT method that is resistant to such information leakage. At the same time, these investigations also serve to empirically support an extended theoretical framework of the CIT (Lukács, Grządziel, Kempkes, & Ansorge, 2019;Lukács, Gula, et al., 2017).

| Three versions of the Response Time-Based Concealed Information Test
The standard RT-CIT consists of a two-alternative forced choice task, where participants classify the presented stimuli as the target or as one of several nontargets by pressing one of two keys (Varga, Visu-Petra, Miclea, & Buş, 2014;Verschuere, Suchotzki, & Debey, 2015). Per each trial, a stimulus is shown. Across trials, typically around five nontargets are presented, among which one is the probe, which is an item that only a guilty person would recognize, and the rest are irrelevants, which are similar to the probe and thus indistinguishable from the probe for an innocent person. For example, in a murder case where the true murder weapon was a knife, the probe could be the word "knife," whereas irrelevants could be "gun," "rope," and so forth. Assuming that the innocent examinees are not informed about how the murder was committed, they would not know which of the items is the probe. The items are repeatedly shown in a random sequence, and all of them have to be responded to with the same response keys, except one arbitrary target-a randomly selected, originally also irrelevant item that has to be responded to with the other response key. Because guilty examinees recognize the probe as the relevant item in respect of the deception detection scenario, it will become unique among the irrelevants and in this respect more similar to the rarely occurring target (Lukács et al., 2016;Lukács, Gula, et al., 2017). Due to this conflict between the instructed response classification of probes as irrelevants on the one hand, and the uniqueness of probes and, thus, greater similarity to the alternative response classification as potential targets on the other hand, the responses to the probes will involve response conflict (Seymour & Schumacher, 2009) and will be generally slower in comparison with the irrelevants. Thus, based on the probe-minus-irrelevant RT differences, guilty examinees can be distinguished from innocent examinees.
A recent study significantly improved this method (i.e., significantly increased the accuracy of distinguishing guilty examinees from innocent ones) by adding inducer items to the task (Lukács, 2019;Lukács, Kleinberg, & Verschuere, 2017), inspired by the Implicit Association Test (IAT; Greenwald, McGhee, & Schwartz, 1998;particularly Bluemke & Friese, 2008;Karpinski & Steinman, 2006; see also Agosta & Sartori, 2013). The IAT measures the strength of associations between certain critical items to be evaluated, such as concepts or entities (e.g., various political parties) and certain attribute items (e.g., good vs. bad). The main idea is that responding is easier (and thus faster) when closely related items share the same response key (e.g., Greenwald, Poehlman, Uhlmann, & Banaji, 2009;Nosek, Greenwald, & Banaji, 2007). For example (taken from Bluemke & Friese, 2008), a person with an implicit preference for a specific political party responds faster when having to categorize stimuli related to that party (e.g., party emblems or names of well-known party members) together with positive words (e.g., joy and health). Inversely, the categorization of the same stimuli (for the preferred party) will be slower when they share a response key with negative words (e.g., pain and disease).
Note that the general adverse effect of feature overlap (semantic or any other) on categorization is not a novelty of the IAT. In particular, it has long been argued and widely demonstrated that categorization is most efficient in case of "most attributes common to members of the category and the least attributes shared with members of other categories" (Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976, p. 1435; see also, e.g., Iordan, Greene, Beck, & Fei-Fei, 2015). This, of course, holds for not only concepts but also simple visual stimuli as well (e.g., Azizian, Freitas, Watson, & Squires, 2006;Marchand, Inglis-Assaff, & Lefebvre, 2013).
All in all, we assumed that an analogous mechanism may be introduced in the CIT by adding probe-referring "attributes." (We call these attributes inducers, as they serve to induce associations.) In the original study (Lukács, Kleinberg, & Verschuere, 2017), the probes were general autobiographical details (birthday, favorite color, etc.), and correspondingly, the inducers were familiarity-and ownership-related, or, more precisely, self-referring and other-referring. Inducers referring to the participants' own details (e.g., "FAMILIAR" and "MINE") had to be categorized with the same key as the target and, thus, with the key opposite to the response key for the probe (and the irrelevants), whereas inducers referring to other details (e.g., "OTHER" and "THEIRS") had to be categorized with the same key as the probe (and irrelevants). It was assumed that this would have a similar effect as in the IAT: Reponses to the self-related probes (true identity details) would be even slower because they have to be categorized together with other-referring expressions (and opposite to self-referring expressions). In contrast, in case of innocents, the probes are not self-related. Hence, the inducers will not slow down the responses to the probe.
Less relevant to the present Introduction, we briefly note that the other additional hypothesized reason for the enhanced effect was that the increased cognitive load (due to the increased complexity) also requires more attention throughout the task, which likely facilitates deeper processing of the stimuli (Lukács, Kleinberg, & Verschuere, 2017, p. 3; see also Visu-Petra, Varga, Miclea, & Visu-Petra, 2013).
In the present study, our first main objective was to test the effect of information leakage on this enhanced version (from here on: Enhanced-CIT; E-CIT). For a basis of comparison for the effect of inducers in respect of the information leakage, we also included the original, standard version with no inducers and only a target along with the probe and irrelevants (Target-CIT). Although we expected an effect of information leakage (i.e., probe-minus-irrelevant difference for informed innocents) for both versions, the Target-CIT may be less susceptible to this manipulation, simply because it has only a small effect (relatively small probe-minus-irrelevant differences) in case of truly guilty participants in the first place (Lukács, Kleinberg, & Verschuere, 2017;Verschuere, Kleinberg, & Theocharidou, 2015).
Note that here we first used the single-probe protocol CIT, in which only one probe is included within each block of the task . We used the multiple-probe protocol CIT (with multiple probes intermixed within each block) in a follow-up experiment, where the relevant differences between the two protocols are also briefly discussed.
Already presupposing that the leakage would indeed render both these versions ineffective, our second main objective was to introduce a leakage-resistant version by a very simple alteration of the E-CIT: removing the target from the task and thereby only leaving inducers along with the probe and irrelevants (Inducer-CIT). Our hypothesis here is that response conflict due to the mere recognition of a probe as a relevant item is brought on by the presence of the target.
The target shares the semantic category of the irrelevant and probe items (e.g., dates and in case of looking for a birthday), and its only distinction is that it is the single item that requires a different key response, which makes it unique among the rest of the items, and, consequently, a relevant item in the task. The only other unique and relevant item in the task is the probe. Note that this relevance would be of different origin depending on whether viewed from the perspective of a guilty person or from the perspective of an innocent but informed person. The guilty persons recognize the item as directly related to them-for example, via the committed crime, or because it is their autobiographical detail-whereas the innocent persons recognize the item as one of which they have been informed as relevant to the deception detection test. Nonetheless, in either case, the probe will be recognized as a single relevant item among the irrelevants. Hence, the probe, as opposed to the irrelevants, will share the target's feature of uniqueness and relevance and will therefore be more difficult to categorize together with the irrelevants.
Importantly, if we remove the target from the CIT, there is no such response conflict. Let us consider an example where we try to show whether a person's true country of origin is Germany. A target may be, say, Sweden. In this case, whenever a country appears, the examinee has to consider whether or not it is the target country Sweden, which would require a different key response. However, whenever "Germany" appears, because it is known to the examinee as a relevant country (as suspected country of origin, regardless of whether or not this is true) and therefore unique among the irrelevants, it will take more time to decide that its specific relevance and uniqueness do not invite a different key response. If we do not include the target "Sweden," there is no unique, task-relevant country to be categorized with a different response key. Therefore, relevant or not, all countries, including Germany, will be categorized with the same response keys with equal ease. Neither guilty nor informed innocent participants will have a response conflict due to the relevance of the probe, hence no slower responses to the probe and no probe-minus-irrelevant difference.
However, even without targets, the Inducer-CIT would still be sensitive to self-related information of guilty participants because, for the guilty participants, the probes and the self-referring inducers (that are categorized opposite to the probe) share the feature of selfrelatedness. There is, at the same time, no unique, specifically relevant item among the inducers. The inducers are also distinct from the categories of the rest of the items (e.g., countries, to which both probe and irrelevants belong, as well as the target, in the standard CIT). They constitute an additional category of familiarity-and ownership-related words, including two subcategories: one with inducers that refer directly to self-relatedness (and familiarity and ownership) and one with inducers that refer to the opposite (other-relatedness, unfamiliarity, etc.). Probes have to be categorized together with other-related inducers and opposite to self-related inducers. The guilty participant's (but not the informed innocent participant's) relation to the probe is one of self-relatedness, and therefore, in this case, the probe's required response key is in conflict with the key that has to be pressed when self-referring inducers are displayed. Due to this conflict, guilty participants are expected to respond slower to the probe than to the irrelevants (and consequently have larger probe-minus-irrelevant differences). However, in case of innocent participants, being informed of the probe leads only to its recognition as relevant, but not to its association with self-relatedness. Again, note also that there are no unique, relevant items in this task apart from this probe, but merely inducers that create the semantic dimension of self-and other-relatedness and respective associations with response keys. Hence, no conflict, no response slowing to the probes and no probe-minus-irrelevant differences.
For a procedural overview of the three versions, see Table 1.

| Study overview
In six groups of participants, we tested all three RT-CIT versions (Target-CIT, E-CIT, Inducer-CIT), with two conditions (groups) in each: simple guilty participants (with their own details as probes) and informed innocent participants (with randomly chosen, originally irrelevant details as probes, but thoroughly informed about these details).

| Methods
The experiment was preregistered at https://osf.io/fh6at (Foster & Deardorff, 2017;Wagenmakers, Wetzels, Borsboom, van der Maas, & Kievit, 2012). All behavioral data (original and aggregated) are available via https://osf.io/zr39m/, along with the entire original task (written in JavaScript) and its simplified version for demonstration Note: Overview of the presence (Yes) or absence (No) of the item types (probes, irrelevants, targets, and inducers) as described in the text. Note that the procedural differences between task versions concern the target and inducer items exclusively. All three tasks are fully identical in procedure in every other respect.
purposes (task version can be chosen; trial number reduced to five in every phase; no restrictions for error rates).

| Participants
This experiment was run on Figure Eight (www.figure-eight.com; formerly known as CrowdFlower), an online crowdsourcing platform where participants from anywhere in the world can register to complete small online tasks (Peer, Samat, Brandimarte, & Acquisti, 2015). Hence, this website may also be used to offer participation in online experiments by providing a link to the task to be completed (e.g., . People registered on this site as "contributors" complete many such tasks, and their performance may be rated after the completion of the tasks by the "customers" who offered those tasks. Based on these ratings, contributors are categorized into three levels, where contributors with best ratings are categorized as "Level 3." When creating a new task, a customer (in this case, the current authors) may choose the lowest level of contributors that are allowed to take the task. We set this to "Level 3"; hence, only such "Level 3" contributors were allowed to participate in the study.
We opened slots for 310 participants for our experiment, paying 1.20 USD per completed task. The task could only be completed in one uninterrupted time from one IP address: Another attempt from an IP address that was already stored with a completed task resulted in a warning prompt on the first page of the task that did not allow continuation. Possibly due to simultaneous starting times, 313 participants completed the task (see dropout rates in Appendix A).
Each participant was randomly assigned to perform one of the three RT-CIT versions: Target-CIT, E-CIT, or Inducer-CIT. Each participant was also randomly assigned to the guilty or informed innocent condition.
In the guilty condition, the probes were participants' self-reported autobiographical identity details (e.g., their country of origin), simulating a guilty suspect. In the informed innocent condition, the probes were randomly chosen, originally irrelevant identity details, but these participants were informed about these details in advance (simulating an innocent suspect exposed to information leakage; see Section 2.1.2).
Our exclusion criteria were at least 50% accuracy for each of the following item categories: targets, self-referring inducers, and otherreferring inducers (Lukács, Kleinberg, & Verschuere, 2017 the only difference that a 4-s time limit applied to each response to curb possible cheating (i.e., looking up the words online or in a dictionary during the task). The LexTALE minimum score for upper intermediate (B2) level is 60% accuracy (Lemhöfer & Broersma, 2012, p. 341).
Consequently, those who did not achieve a score above our more lenient threshold of 55% clearly did not have the required English skill and therefore were automatically disqualified and redirected to the (presumed) reliance on semantic associations, which requires a clear understanding of basic English , and then followed one of the CITs as described below.

Item selection
Participants were informed that the following task simulates a lie detection scenario, during which they should try to hide their identities. They were also told that they may actually not see their own details in the task, in which case they are in the "innocent" condition, simulating an innocent suspect. They were then presented a short list of randomly chosen items within each of the three categories in the task (countries, dates, and animals). The items on these lists never contained any of the probes (the actual identity details of a given participant), but, within each category, they had the closest possible character length to the given probe (depending on the list of available items), and none of them started with the same letter (except in case of months). In case of countries, if the probe included a space (e.g., "New Zealand" or "Czech Republic"), the items on this list were all 1 It is perhaps noteworthy that even a few of the guilty participants (11 out of 157), who were supposed to provide their true autobiographical details, could not recall them correctly. Although these persons may have made an incorrect selection by accident, it is also possible that they in fact did not provide their true details honestly, but rather chose them randomly.
Hence, it may be advisable to implement this check in all future tasks.
chosen to include a space as well. The participants were asked to choose any (but a maximum of two per category) items that were personally meaningful to them or in any way appeared different from the rest of the items on those lists. Subsequently, the items for the task were randomly selected from the nonchosen items (as this assures that the irrelevants were indeed irrelevant).
For participants in the guilty condition, their self-reported identity details served as the probe, in each of the three categories, whereas the four irrelevants and one target (where this applied) were randomly chosen from among the nonchosen items. (The target was of course not used in the Inducer-CIT; see section.) For a participant in the informed innocent condition, six items were selected for each of the three categories. Out of these six, one was randomly assigned to be a probe, whereas the remaining five served as irrelevants and target (where applicable). Thus, in either condition and in any of the CIT versions, there were altogether three probes and 12 irrelevants, whereas only in the Target-CIT and E-CITs, there were an additional three targets (one per block, see below) as well.

Information leakage
Following the item selection, all innocent participants were informed of the selected probes on a dedicated "background information" page, where they were described a person who "committed a serious crime, but is now hiding his true identity," and they were told that they are one of the suspects (see full text in Appendix B; see also the very similar "background story" in Lukács, Gula, et al., 2017). The country of origin, date of birth, and favorite animal of this person were pointed out repeatedly. On the next page, all participants had to type in all these three details correctly in order to proceed with the test. If any of the entered items was incorrect, the participant received a warning and was redirected to the background information page.

Targets
Next, participants in the Target-CIT and E-CIT versions were presented their three targets and were asked to memorize these items in order to recognize them as requiring a different response during the following task. On the next page, participants were asked to recall the memorized items and could proceed only if they selected these items correctly from a dropdown menu. If any of the entered items was incorrect, the participant received a warning and was redirected to the previous page in order to have another look at the same items.
(For the Inducer-CIT, this target learning phase was omitted.)

Task designs
In each RT-CIT task, the items were presented one by one in the center of the screen, and participants had to categorize them by pressing one of two keys ("E" or "I") on their keyboard. The design of the Verschuere . Namely, participants were told that pushing the "I" key means "YES," they recognize the item, whereas pushing the "E" key means "NO," they do not recognize the item-and they were correspondingly instructed to say YES to the targets and NO to all other words (i.e., both the irrelevants and the probes).
In case of the E-CIT and Inducer-CIT, the description was slightly modified to focus on familiarity: Participants were told that pushing the "I" key means that the displayed item is "FAMILIAR" to them, whereas pushing the "E" key means that the item is "UNFAMILIAR" to them.
Participants in these groups also had to categorize nine different inducers: three referring to familiarity ("FAMILIAR," "RECOGNIZED," and "MINE") had to be categorized as familiar ("I" key), whereas the other six referring to unfamiliarity ("UNFAMILIAR," "UNKNOWN," "OTHER," "THEIRS," "THEM," and "FOREIGN") had to be categorized as unfamiliar ("E" key). In case of the E-CIT (but not in the Inducer-CIT), participants were also instructed to respond FAMILIAR to the targets. In both E-CIT and Inducer-CIT, all other words (irrelevants and probes) had to be categorized as unfamiliar.
In previous studies, reminder captions were displayed on the screen throughout the task (e.g., "Recognize?" or "Familiar to you?" on the top of the screen), but, to avoid any related potential confounds, we simply omitted any of these captions altogether. Arguably, this could hardly have any relevant effect, but in order to demonstrate this, we ran a smaller preliminary within-subject experiment (using one guilty E-CIT group only; n = 52), briefly presented in Appendix C, which shows that indeed the presence or absence of captions makes no difference.
The intertrial interval (i.e., between the end of one trial and the beginning of the next) always randomly varied between 100 and 300 ms. In case of a correct response, the next trial followed. In case of an incorrect response or no response within the given time limit, the caption "WRONG" or "TOO SLOW" in red color appeared, respectively, below the stimulus for 400 ms, followed by the next trial.
The main task was preceded by a comprehension check and two practice tasks. The check served to ensure that the participant had fully understood the task. The items consisted of 12-21 randomly ordered trials, including 10-12 different main items (two probes and eight irrelevants and, for the Target-CIT and E-CITs, two targets; each of which was randomly chosen from one out of the three categories, In the following first practice task, the response window was longer than in the main task (2 s instead of 800 ms), whereas the second practice task had the same design as the main task. Both practice tasks consisted of 9-14 trials, in a way that two successive tasks always contained all of the possible items in the task (for Target-CIT: three probes, 12 irrelevants, three targets; for E-CIT: nine inducers in addition; and also nine additional inducers, but no targets, for Inducer-CIT). In either practice task, in case of too few valid responses, the participants received a corresponding feedback, were reminded of the instructions, and had to repeat the practice task. The requirement was a minimum of 60% valid responses (correct key between 150 and 800 ms) for each of the following item types (when the given type existed in the given CIT version): targets, self-referring inducers, other-referring inducers, and main items (probes and irrelevants together).
Note that in previous online experiments, the exclusion was set to 50%, which is, however, chance level, and seemed too low to us. Also, probe and irrelevants were previously treated separately.
In previous experiments, the separate check for probes was included primarily to ensure that the participant had understood the task, but here, this was already fully ensured through the comprehension check.
The main task, in each test, contained three blocks, each for one separate category (countries, dates, or animals; in random order). In each block, each probe, irrelevant, and target (where applicable) were repeated 18 times (hence, altogether 54 probe, 216 irrelevant, and 54 target trials). From the Inducer-CIT, target trials were omitted (leaving 54 probe and 216 irrelevant trials). Within each of the three blocks, the order of the items was randomized in groups: first, all five or six items (one probe, four irrelevants, and, where applicable, one target) in the given category were presented in a random order, and then the same six items were presented in another random order (but with the restriction that the first item in the next group was never the same as the last item in the previous group).
In the E-CIT and Inducer-CIT, inducers were placed among these items in a random order, but with the restrictions that an inducer trial was never followed by another inducer trial, and each of the nine inducers (three self-referring and six other-referring) preceded each of the three probes, three targets (for E-CIT, but not for Inducer- date, and animal). Again, in case of guilty participants, the probes were simply their own details (hence, the instruction read: "please select again the truly self-related details that you yourself gave in the very beginning"). More importantly, in case of informed innocent participants, these were the details they were informed about in the beginning, in the frame of the background information (hence, the instruction read: "please select below the details of the criminal as it was described in the beginning"). All participants who gave any of the details incorrectly were excluded from the analyses (see Section 2.1.1). 2 This is a new (but preregistered) exclusion method, the purpose of which was to ensure that the participants in the informed innocent conditions were indeed properly informed about the probe items. (Nonetheless, for the sake of treating data in all conditions similarly, the same method was applied to the guilty conditions as well.) After the task, there was a short survey where participants rated the personal importance of the items used in the task (their country of origin, birthday, and favorite animal; on a scale from one to six, where one is "entirely unimportant" and six is "very important"), and finally, the participants were given a brief explanation about the purpose of the study.

| Data analysis
We conducted preregistered analyses, unless explicitly specified oth- For secondary analyses, in Appendix D, we report (a) a mixed ANOVA to explore the potential effects of item saliency (countries vs. animals) and its interactions, and (b) all tests described so far (regarding probe and irrelevant RT means) with probe and irrelevant accuracy rates rather than correct RTs.
On the request of reviewers of an earlier version of this manuscript, we (a) added Bayesian analyses to all F and t tests, and (b) we calculated AUCs for each of the CIT versions with simulated naive, uninformed innocents (contrasted to both guilty and informed innocent participants). For each simulation, we took a randomly generated sample of 100 participants from a normal distribution with a mean of zero (rnorm function in R, with set.seed at 100). The SDs for the Target-CIT and E-CIT were based on the data for the same respective CIT versions in the very similar study of Lukács, Kleinberg, and Verschuere (2017). The results were reanalyzed using the criteria in the present paper and excluding the item category of "favorite color" 2 All participants who failed on any of the details for the first time were excluded from all analyses. Nonetheless, as an additional check, these participants were once warned about the incorrect selections and were asked to try again. Over half of them (24 out of 39) selected the probes correctly at this second time, indicating that, although they may have been uncertain or confused by this question, they had been properly informed. After this second check, regardless of failed selections, all participants were allowed to finish the task and receive payment. 3 One reviewer insisted that we run this exploratory ANOVA instead of the following two preregistered tests: (a) a t test between the guilty E-CIT and guilty Inducer-CIT version (in order to estimate the difference between their efficiency in case of uninformed innocents) and (b) a one-way analysis of variance (ANOVA) for informed innocent conditions only (to explore possible differences in susceptibility to information leakage). However, which particular method was used here made no difference for the conclusions. The full original, preregistered analysis is uploaded to https://osf.io/zr39m/.
(as the remaining categories of countries, dates, and animals correspond exactly to those in the present paper

Effect sizes
In order to demonstrate the magnitude of the observed effects for F tests, partial eta-squared (η p 2 ) values are shown along with their 90% CIs (Steiger, 2004). We report Welch-corrected t tests (Delacre, Lakens, & Leys, 2017) and Cohen's d values as standardized mean differences and their 95% CIs (Kelley, 2018;Lakens, 2013). We used the conventional alpha level of .05 for all statistical significance tests.

Bayesian analysis
We report Bayes factors using the default r-scale of 0.707 (Morey & Rouder, 2018). The Bayes factor is a ratio between the likelihood of the data fitting under the null hypothesis and the likelihood of fitting under the alternative hypothesis (Jarosz & Wiley, 2014;Wagenmakers, 2007). For example, a Bayes factor (BF) of 3 means that the obtained data are three times as likely to be observed if the alternative hypothesis is true, whereas a BF of 0.5 means that the obtained data are twice as likely to be observed if the null hypothesis is true. Here, for more readily interpretable numbers, we denote Bayesian factors as BF 10 for supporting alternative hypothesis and as BF 01 for supporting null hypothesis. Thus, for example, BF 01 = 2 again means that the obtained data are twice as likely under the null hypotheses than under the alternative hypothesis. Typically, BF = 3 is interpreted as the minimum likelihood ratio for "substantial" evidence for either the null or the alternative hypothesis (Jeffreys, 1961).
For all analyses, RTs below 150 ms were excluded. For RT analyses, only correct responses were used. Accuracy was calculated as the number of correct responses divided by the number of all trials (after the exclusion of those with an RT below 150 ms).

| Group-level response time analysis
All means and SDs of individual RT means, for the different stimuli types, in all guilty and innocent conditions, are given in Table 2.
We conducted an ANOVA, with between-subjects factors Knowledge (guilty vs. informed innocent) and Version (Target-CIT, E-CIT, and Inducer-CIT), on probe-minus-irrelevant RT mean differences (

| Individual classification
Probe-minus-irrelevant differences in RT mean were used as predictor variables to calculate AUCs, which are shown for each condition in

| Discussion
In Experiment 1, we have shown that the RT-CIT is vulnerable to information leakage when the targets were present in the E-CIT, whereas it was not affected when using the Inducer-CIT with no targets but only inducers. It was also shown that both the Target In Experiment 1, we used the single-probe (SP) protocol Target-CIT, where each category is presented in separate blocks (see right panel in Figure 2). The SP has several practical advantages, such as applicability even in case of a limited number of probe items (Podlesny, 2003), compatibility with common test procedures and scoring algorithms (Krapohl, 2011), and sequential testing to narrow down possibilities (Lukács, Kleinberg, & Verschuere, 2017). Consequently, practitioners currently also consider the SP protocol to be the only viable option (Ogawa, Matsuda, Tsuneoka, & Verschuere, 2015). Furthermore, in our experiment, we use SP protocol for the E-CIT and Inducer-CIT versions, which would be unnecessarily complex with a corresponding MP protocol (Lukács, Kleinberg, & Verschuere, 2017). Hence, for comparability, we wanted to use the SP protocol in the Target-CIT as well.
However, in each block, the inducers constitute additional nine items, among which there are three that have to be categorized as "targets" (i.e., opposite to the probe and irrelevants). Therefore, in respect of the number of different items, the MP protocol with its multiple items in each block (in particular, three targets) is in fact more comparable with the Inducer-CIT. Furthermore, it has been repeatedly shown that the MP protocol clearly outperforms the SP protocol in the (RT-based) Target-CIT (Eom, Sohn, Park, Eum, & Sohn, 2016;Lukács, Kleinberg, & Verschuere, 2017;. Therefore, from the practical perspective, the future use of the suboptimal SP Target-CIT seems unlikely. Finally, we expect the FIGURE 1 Means and SEs of individual probe-minus-irrelevant response time (RT) mean differences in Experiment 1 for the guilty participants (with their own details as probes) and for the informed innocent participants (with random details as probe, but informed about it), in the three CIT versions: Target-CIT (with probes, irrelevants, and targets), E-CIT (with probes, irrelevants, targets, and self-referring inducers), and Inducer-CIT (with probes, irrelevants, and other-referring inducers) larger effect to also be present in case of informed innocents, leading to increased statistical power.
Consequently, in Experiment 2, we ran the same study as in Experiment 1, but with the MP Target-CIT. In the MP protocol (see left panel in Figure 2), items related to the different categories (e.g., weapons, locations, and dates) are completely intermixed within blocks. Additionally, we again included the Inducer-CIT to ensure that the null finding among informed innocents from Experiment 1 is replicable.

| Participants
Participants were sampled via Figure Eight as in Experiment 1, with the same procedure and exclusion criteria. However, in this experiment, to ensure that we have the intended minimum sample size in each condition, we preregistered a procedure to sample more participants (n + 5 more for n missing) in case of less than 50 valid completions (after all exclusions, see below) in any given condition.
Following that procedure, we first opened 220 slots. Then we opened six more slots for the guilty MP Target

| Procedure
The procedure corresponded exactly to that in Experiment 1, except a very slight modification in the description of required responses.
Namely, in Experiment 1, the instruction text related the response keys to recognition in case of Target-CIT and to familiarity in case of Inducer-CIT (see Section 2.1.2). To make the instructions more straightforward, in Experiment 2, we removed these explanatory ref- erences and simply instructed participants to press the given keys when the corresponding items were displayed (e.g., press the key "I" for the target details and press "E" for everything else, without further explanation about what these keypresses mean). Furthermore, there were no reminder captions displayed at any point (i.e., not even during the first two practice phases as in Experiment 1).
The arrangement of items was also identical to that in Experiment 1, except that in the MP Target-CIT, instead of one category per block, each block contained an equal number of items intermixed from each category (countries, dates, or animals). Within each of the three blocks, the order of the items was randomized in groups: First, all 18 items (three probes, 12 irrelevants, and four targets) were presented in a random order (but with the restriction that target trials were never followed by another target trial, and probe trials were never followed by another probe trial). Then the same 18 items were presented in another random order (but with the restriction that the first item in the next group was never the same as the last item in the previous group).

| Data analysis
We again conducted preregistered analyses, unless explicitly specified otherwise. We used the probe-minus-irrelevant correct RT mean as dependent variable in (a) an ANOVA with between-subjects factors

| Group-level response time analysis
All means and SDs of individual RT means, for the different stimuli types, in all guilty and informed innocent conditions, are given in Table 3.
We conducted an ANOVA, with between-subjects factors Knowledge (guilty vs. informed innocent) and Version (MP Target-CIT vs. Inducer-CIT), on probe-minus-irrelevant RT mean differences. We  Figure 3).
The one-sided paired sample t tests between the probe RT means and irrelevant RT means within each informed innocent condition (for effect sizes, see Table 3) showed a significant effect in case of MP Target

| Individual classification
Probe-minus-irrelevant differences in RT mean were used as predictor variables to calculate AUCs, which are shown for each condition in Table 3. Diagnostic accuracy was very modest for the MP Target-CIT (almost as low as for the SP Target-CIT in Experiment 1), but notably better for the Inducer-CIT. However, the comparison of two

| Discussion
In this second experiment, we successfully replicated the finding that the Inducer-CIT is resistant to information leakage. Furthermore, as it was clearly shown in case of the E-CIT, but only indicative in case of the SP Target-CIT, we have now shown in the MP Target-CIT that the presence of the target leads to a significant probe-minus-irrelevant difference in case of informed innocents.
We found no significant interaction between the factors Knowledge (guilty vs. informed innocent) and CIT Version (MP Target-CIT vs. Inducer-CIT) when using RT means. This is because the probeminus-irrelevant differences for guilty participants in case of the MP Target-CIT were large enough to still create a substantial difference between guilty and informed innocent participants, despite of the effect of information leakage. This difference was similar to that in case of the Inducer-CIT, which was not susceptible to information leakage, but nonetheless had comparatively low probe-minusirrelevant differences for guilty participants.
We did, however, find a significant interaction between these same factors when using accuracy rates (see Appendix D). In fact, the effect of information leakage on accuracy rate probe-minus-irrelevant differences in the MP Target-CIT was so large that informed innocents had significantly larger such differences than guilty participants. This may also indicate a speed-accuracy tradeoff (Heitz, 2014). Namely, informed innocent participants may have focused on giving fast responses to the probe, instead of accurate responses-hence, the effect of information leakage is observable primarily in accuracy rates and less in mean correct RTs.
Furthermore, even with the RT measure, despite no significant interaction of group means, we found that the Inducer-CIT had a higher AUC (i.e., it could better discriminate guilty from informed innocent than the MP Target-CIT): This is because, in addition to the in fact slightly larger differences between the group means of guilty and informed innocent individual probe-minus-irrelevant differences (though not statistically significant), the Inducer-CIT also exhibited smaller variance of these individual probe-minus-irrelevant differences, in both guilty and informed innocent groups (see SDs in Table 3). That is, for both groups (guilty and informed innocent), the predictor variables (individual probe-minus-irrelevant differences) were more narrowly distributed and, hence, allowed less overlap between guilty and informed innocent participants. The higher variance in the MP Target-CIT may be due to the presence of the targets' different influences on different persons. In particular, the subjective meaning of the targets to any given examinee may modulate their effects (Suchotzki, De Houwer, Kleinberg, & Verschuere, 2018).
In any case, we have unambiguously proven our two main points: (a) the significant adverse effect of information leakage on the standard CIT (when target is present) and (b) the absence of this effect when the target is not present in the CIT task.

| GENERAL DISCUSSION
In the present paper, we have shown that the RT-CIT is sensitive to information leakage and, therefore, cannot be effectively applied in field settings where the probe may be known to innocent suspects. We have also shown a possible remedy in form of an RT-CIT relying primarily on associations. At the same time, our findings also provide insight into the mechanism of the CIT, supporting previously proposed theories (Lukács, Gula, et al., 2017;Lukács, Kleinberg, & Verschuere, 2017).
We did so by comparisons between three essential designs: (a) the Target In conclusion, this new version stands as a viable future option in case of informed innocent participants, regardless of whether the information leakage is a known fact or a mere suspected possibility.
Naturally, further research into the topic will be needed, as well as independent replications. Given the multiple facets of this method (several item types, each with its own parameters of categorization, visual display, timing, semantic attributes, etc.), there are abundant opportunities for improvements as well. For example-as for an aspect relevant to the rest of this discussion as well-the semantic dimension of the inducers did not relate directly to the probes. In particular, the self-referring fillers were "familiar," "recognized," and "mine"-out of which "mine" refers most closely to the self-related probes (as opposed to irrelevants), whereas, on the other hand, "recognized" may be perceived as referring to items recognized as relevant, with an unnecessary focus on the recognition factor. An improvement, therefore, could consist of using exclusively self-related and/or directly probe-related concepts (e.g., "home country," "birthday," "my favorite," and "my own"). For theoretical purposes, this may be contrasted in an experiment with more strictly recognition-focused inducers ("recognized," "relevant," etc.).
This connects to another important point, namely, the role of the target item. It was shown, in case of informed innocents, that the inclusion of the target in the E-CIT elicits a large probe-minusirrelevant effect in sharp contrast to the otherwise identical Inducer-CIT. As described in detail in Section 1, we attribute this to the fact that the probe's uniqueness and relevance is shared by the target.
However, one may also consider the target as an item that controls for semantic category: In the Inducer-CIT, participants may recognize the item category before fully processing the relevance of the specific item (in particular, the probe) and, therefore, could categorize it based partly on its category (e.g., always press left key for countries, regardless of whether they are irrelevants or probes), diminishing probeminus-irrelevant differences to some extent. This could explain the smaller probe-minus-irrelevant differences for Inducer-CIT compared with E-CIT in Experiment 1. This adverse factor in the Inducer-CIT for guilty cases cannot be circumvented, because it is simultaneously the essence of protection against the adverse effect of information leakage in innocent cases.
However, the targets may have an additional role as visual control items as well: In the Inducer-CIT, probe and irrelevant items may be partly discriminated based on visual cues, such as character length or spaces. Consequently, a potential improvement would be to present the inducers in a format more visually similar to that of the current probe and irrelevant items: For example, complementing them with filler characters for corresponding length, inserting spaces, or, in case of dates, appending random numbers to them (e.g., "MINE 19" or "OTHER 07"). In essence, all measures that prevent a fast classification of inducers on the basis of their visual features alone should in turn foster their semantic processing and the resulting conflict between inducers and probes among guilty participants.
As described in the previous paragraphs and as evident from the differences between the E-CIT and Inducer-CIT (Table 2 and in the E-CIT). We attribute this to the increased cognitive load elicited by the involvement of two semantic dimensions if not even tasks (i.e., the discrimination between self-vs. other-relatedness and the discrimination between response meanings of targets and irrelevants as separate instances of items not related to the self) inviting closer attention and, thereby, deeper processing of the stimuli (as argued in more detail, but not tested, by Lukács, Kleinberg, & Verschuere, 2017).
Thus, we have provided evidence for the impact of targets on probe-minus-irrelevant differences and for the influence of semantically associated inducers on the same differences, as two separate components that can work independently-but that can also interact, if combined, producing an effect that surpasses the effect of either component alone, for both of the examined scenarios (guilty and informed innocent cases).
Because it limits the applicability of the method (Podlesny, 2003), in this paper, we have so far only presented this probe-minusirrelevant effect as unfavorable in case of innocent examinees.
However, importantly, there are cases where this may in fact be desirable: For example, a witness (either a bystander or an accomplice) may falsely deny the recognition of a suspect. In that case, the guiltiness of the examinee is not in question, but merely the fact of recognition.
The present study demonstrates that the E-CIT is likely to be effective in this situation as well-although such specific scenarios will require further investigations.

| Limitations
Same as in almost all deception research experiments, "guilt" and "innocence" were merely simulated in our study. Although the personal relevance of the presented autobiographical details arguably resembles the relevance of real-life incriminating items, the extent of applicability is yet to be explored. On the one hand, in a specific situation very similar to the one simulated in the present study, authorities may test the true identity of the person, in which case the results may be assumed comparable with those in our study (regarding higher stakes at hand, see Kleinberg & Verschuere, 2016). On the other hand, the relevance of crime-related items (such as a murder weapon), which may be contributed to by the various emotions related to the actually committed crime (guilt, suspense, etc.), would be very difficult to simulate in a controlled experiment and may require field studies in the future.
Relatedly, in our study, autobiographical identity details were the objects of the test, with inducers referring to familiarity and ownership. On the one hand, the same principle may be adapted to other scenarios: for example, in case of a murderer's gun ("my gun") or for a stolen object ("my loot"). On the other hand, this may not always be as straightforward as in case of identity details: For example, a thief might not consider a stolen object as his own property. This could be examined in the future, along with potential use of action related expressions as inducers, such as "I stole" and "they stole," etc. (Lukács, Gula, et al., 2017).
Our study did not assess how well people can remember leaked information. Our study concerned the consequences of remembered leaked details. Therefore, we simply excluded informed innocent participants who did not correctly recall the probe items about which they were informed. Because in real-world situations innocents would probably forget some of the leaked information (and because such cases were excluded in the present study), we tested the impact of leaked information under relatively conservative conditions.
All in all, future research should explore the applicability of our findings in differing scenarios, including possibly more realistic simulations of information leakage. This may also include testing the method's potential susceptibility to countermeasures (Hu, Rosenfeld, & Bodenhausen, 2012;Verschuere, Prati, & De Houwer, 2009).
The online setting of our experiment may allow more noise in the data than strictly controlled lab experiments due to (a) potentially suboptimal computer hardware used by participants, (b) varying environmental factors (e.g., different lighting conditions or disruptive noises), and (c) participants' lower motivation to pay close attention and perform the task properly. However, there have been numerous studies that explored the reliability of online psychological research (e.g., Buhrmester, Kwang, & Gosling, 2011;Paolacci & Chandler, 2014), in particular the validity of online RT research (e.g., Germine et al., 2012;Hilbig, 2016;McGraw, Tew, & Williams, 2000; for the HTML5/JavaScript framework as in the present study, see Reimers & Stewart, 2015) and recently even specifically the validity of the RT-CIT in online settings -all of which unanimously conclude that online RT research such as in our study is a sound alternative to conventional lab studies and that results obtained in this environment closely reflect those obtained in strictly controlled laboratory conditions. Note also that, in our experiment, the key dependent variable (probe-minusirrelevant differences) was obtained, for each participant, via a within-subject comparison, which also serves as a control for external influences. Finally, online research has its advantages as well: in particular, a highly diverse international sample that provides a broad demonstration of generalizability and also more closely reflects the test results of possible criminal suspects than a study involving only university students as in typical lab studies.

ACKNOWLEDGMENTS
We thank Anna Walker and Matthew Pelowski for proofreading.

Dropout rates
In the first E-CIT article (Lukács, Kleinberg, & Verschuere, 2017), due to technical reasons, dropout rates were reported based only on an estimate: the number of participants who successfully completed the first practice phase. Here, we are more precise, reporting the number of all participants who began the test (following the English test and before starting the first practice, where conditions began to differ; Table A1).

APPENDIX B
The full text for the background information was (with example probes

APPENDIX C Caption display in the Response Time-Based Concealed Information Test
In the original study introducing the E-CIT (Lukács, Kleinberg, & Verschuere, 2017), the differences in caption display during the test were noted as a possible minor confound. Namely, the Target-CIT had, as reminders, the following captions displayed throughout the task: "Recognize?" at the top of the screen, "YES = e" on the left side, "NO = i" on the right side (where "e" and "i" refer to the corresponding response keys on a standard keyboard; same as in previous experiments). The E-CIT had slightly modified captions to correspond to the use and the concept of the familiarity-related inducers: "Familiar to you?" at the top, "FAMILIAR = e" on the left, "UNFAMILIAR = i" on the right. Nonetheless, in a supplementary experiment, it was shown that the results do not change even when using the same captions (always familiarity-related) in all versions.
Here, we add that, arguably, having or not having these captions hardly makes any difference in the first place, and to preclude even the suspicion of any related confound, we shall simply omit them altogether. Still, to consider any potential effects, one could say that, although in the beginning the captions may facilitate the understanding of the task, eventually these additional items on the screen may just cause distraction from the critical stimuli presented in the center. Therefore, before we proceeded with the main objectives of our study (information leakage), we ran a smaller experiment testing any potential difference between captions-on and captions-off versions, withinsubject, using only one group: guilty participants in the E-CIT, which is the most complex version, and, therefore, assumed to be most susceptible to be influenced by either distractions or facilitated comprehension.
The data were collected the exact same way as in the main exper-  Table B1. The order (starting with captions on or with captions off) had no effect on either measure (p > .4).
We can conclude that no differences were found between the captions-on and captions-off versions for the means of the probe-  Figure D1. This is unsurprising: The saliency refers to the personal importance of the probe (which is higher in case of countries of origins than in case of favorite animals), but, in case of participants only informed of the probe, the probes in any item categories have equal importance in that they are recognized as relevant to the test (but have no further personal importance).

Accuracy rates
We report statistical results for accuracy rates in the same manner as for RTs. All means and SDs of individual accuracy rates, for the different stimuli types, in all guilty and informed innocent conditions, are given in Table D1.
We conducted an ANOVA, with between-subjects factors Knowledge (guilty vs. informed innocent) and Version (Target-CIT, E-CIT, and Inducer-CIT), on probe-minus-irrelevant accuracy rate differences ( Probe-minus-irrelevant differences in accuracy rates were used as predictor variables to calculate AUCs for each condition (Table D1).

FIGURE D1
Means and SEs of individual probe-minus-irrelevant reaction time (RT) mean differences (i.e., correct probe RT means minus correct irrelevant RT means) in Experiment 1. High-salient: item category in which the probe is highly personally important. Low-salient: item category in which the probe is less personally important. Guilty: participants with their own details as probes. Innocent: participants with random details as probe, but informed about it. (In this figure, all three task versions are merged together) There were no significant differences between the accuracy-based AUCs of any two of the three task versions (p > .4 for all comparisons).
To test the effect of information leakage on each CIT version separately, we performed paired sample t tests between the probe accuracy rates and irrelevant accuracy rates within each informed innocent condition (for corresponding effect sizes, see Table D1).

Saliency
We again examined the effect of saliency and its possible interactions across the CIT versions for probe-minus-irrelevant RT means. In a three-way ANOVA, with Saliency (high-salient countries vs. lowsalient animals) as within-subject factor and Version (Target-CIT, Inducer-CIT) and Knowledge (guilty vs. informed innocent) as between-subjects factors, the three-way interaction was not significant, F (1, 208) = 3.79, p = .053, η p 2 = .018, 90% CI [.000, .058], BF 01 = 1.02. The Saliency main effect, however, was significant in the expected direction, with probe-minus-irrelevant differences larger for high-salient (country) items than for low-salient (animal) items,  Figure D2.

Accuracy rates
All means and SDs of individual accuracy rates are given in Table D2.
We conducted an ANOVA with between-subjects factors Knowledge (guilty vs. informed innocent) and Version (MP Target-CIT vs.  .554 [.442, .666] Note: Means and SDs (in the format of M ± SD) for individual accuracy rates (percentages of correct responses) for Probe (item presumed to be the participant's own detail), Irrelevant (other details in the same categories as the probe), Target (the designated irrelevant details that require a different response), Self-referring (self-referring inducers), Other-referring (other-referring inducers), P-I (individual probe minus irrelevant values). Dashes indicate inapplicable cases: no inducers in the Target-CIT, and no targets in the Inducer-CIT. Cohen's d effect sizes (with 95% CIs in brackets): d within for probe-minus-irrelevant differences, d between for differences between guilty and informed innocent for each CIT version. AUC: Area under the curve (i.e., classification accuracy between the guilty and informed innocent participants of each CIT version).
Inducer-CIT), on probe-minus-irrelevant accuracy rate differences. We found a significant main effect of Version (larger negative probe- that in case of the MP Target-CIT, the negative probe-minusirrelevant accuracy rate difference was not larger for guilty than for informed innocents, as it normally happens-whereas it was so, as expected, in case of Inducer-CIT (see Table D2). Follow-up t tests indicate that both these differences were significant (although, unlike for the interaction, with indeterminate BFs): larger differences in case of informed innocent MP Target-CIT than guilty MP Target-CIT, t(90.2) = 2.41, p = .018, BF 10 = 2.80, but larger differences in case of guilty Inducer-CIT than informed innocent Inducer-CIT, t(86.3) = −2.19, p = .031, BF 10 = 1.67 (for effect sizes, see Table D2).
This means that, surprisingly, probe-minus-irrelevant accuracy rate differences would better predict informed innocence reversely with the MP Target-CIT as compared to what is usually expected: In the MP Target-CIT, informed innocent participants give more incorrect probe responses (as compared with irrelevant responses) than guilty participants.
The one-sided paired sample t tests between the probe accuracy rates and irrelevant accuracy rates within each informed innocent condition (for effect sizes, see Table D2) showed a significant effect for both MP Target Finally, probe-minus-irrelevant differences in accuracy rates were used as predictor variables to calculate AUCs, which are shown for each condition in Table D2. The AUC for Inducer-CIT was shown significantly higher than that of the Target-CIT, using a one-sided DeLong's test, D(209.06) = 2.58, p = .005. Note: Means and SDs (in the format of M ± SD) for individual accuracy rates (percentages of correct responses) for Probe (item presumed to be the participant's own detail), Irrelevant (other details in the same categories as the probe), Target (the designated irrelevant details that require a different response), Self-referring (self-referring inducers), Other-referring (other-referring inducers), and P-I (individual probe minus irrelevant values). Dashes indicate inapplicable cases: no inducers in the Target-CIT and no targets in the Inducer-CIT. Cohen's d effect sizes (with 95% CIs in brackets): d within for probe-minus-irrelevant differences and d between for differences between guilty and informed innocent for each CIT version. AUC: Area under the curve (i.e., classification accuracy between the guilty and informed innocent participants of each CIT version).

FIGURE D2
Means and SEs of individual probe-minus-irrelevant RT mean differences (i.e., correct probe RT means minus correct irrelevant RT means) in Experiment 2. High-salient: item category in which the probe is highly personally important. Low-salient: item category in which the probe is less personally important. Guilty: participants with their own details as probes. Innocent: participants with random details as probe, but informed about it. (In this figure, the two task versions are merged together)