Can pupillometry distinguish accurate from inaccurate familiarity?

Pupillometry, the measurement of pupil diameter, has become increasingly popular as a tool to investigate human memory. It has long been accepted that the pupil is able to distinguish familiar from completely novel items, a phenomenon known as "pupil old/new effect". Surprisingly, most pupillometric studies on the pupil old/new effect tend to disregard the possibility that the pupillary response to familiarity memory may not be entirely exclusive. Here, we investigated whether the pupillary response to old items correctly judged familiar (hits; accurate familiarity) can be differentiated from the pupillary response to new items wrongly judged familiar (false alarms; inaccurate familiarity). We found no evidence that the two processes could be isolated, as both accurate and inaccurate familiarity showed nearly identical mean and across-time pupillary responses. However, both familiarity hits and false alarms showed pupillary responses unequivocally distinct from those observed during either recollection or novelty detection, which suggests that the pupil measure of familiarity hits and/or false alarms was sufficiently sensitive. The pupillary response to false alarms may have been partially driven by perceptual fluency, since novel objects incorrectly judged to be old (i.e., false alarms) showed a higher degree of similarity to studied images than items correctly judged as novel (i.e., correct rejections). Thus, our results suggest that pupil dilation may not be able to distinguish accurate from inaccurate familiarity using standard recognition memory paradigms, and they also suggest that the pupillary response during familiarity feelings may also partly reflect perceptual fluency.

memory that are characterized by memory for the item alone without its study context and recall memory for the item in its study context, respectively.
Despite copious evidence suggesting a strong link between accurate familiarity (indexed by familiarity hits) and increases in pupil diameter, much less is known about the relationship between accurate familiarity and inaccurate familiarity (indexed by false alarms). Some evidence suggests that recognition false alarms can be induced by enhancing processing fluency (e.g., via masked priming) of items that have not been studied in an earlier phase of an experiment. The unexpected increase in fluency causes participants to endorse those fluent items as old, regardless of their true study status (Jacoby & Whitehouse, 1989;Johnston et al., 1985;Whittlesea, 1993;Whittlesea & Williams, 2000).
Neuroimaging research indicates that accurate familiarity and fluency-driven inaccurate familiarity memory may recruit similar neural mechanisms. Using fMRI, Dew and Cabeza (2013) found that reduced activity in the perirhinal cortex, which, for decades, had been considered a hallmark of familiarity memory, was also observed for unstudied words that were primed during the test phase. These and several other findings (Gomes et al., 2016(Gomes et al., , 2017(Gomes et al., , 2019Wang et al., 2016) suggest that the neural substrates of accurate and fluency-based inaccurate recognition memory are likely mediated by a common neural network. Indeed, computational models have indicated that, if certain assumptions hold, a single-memory system is capable of supporting both recognition memory and fluency (Berry et al., 2008(Berry et al., , 2012. If enhanced fluency can induce false memories, what effect has increased processing fluency on pupillary responses? We are only aware of one study (Kuchinke et al., 2009) that examined the relationship between processing fluency and pupil dilation. In this study, participants were shown Cubist paintings which varied in content accessibility (low, medium and high), and were asked to press a button as soon as they recognized a concrete object in the painting (e.g., the face of a woman). The results showed that the fluently-processed paintings were responded to faster, and this effect was accompanied by larger pupil dilation at the point of object recognition. Thus, this study suggests that enhancing processing fluency can lead to increased pupil dilation. The question remains, however, whether the enhanced fluency that triggers a false alarm during a recognition memory test generates a pupillary response that resembles that of a typical familiarity-based pupillary response.
The idea that accurate and inaccurate familiarity could potentially exhibit similar pupillometric behavior has rarely been considered. There exist, however, a few studies which reported larger pupil dilation for new stimuli judged "old" (i.e., false alarms) than new stimuli judged "new" (i.e., correct rejections) (Brocher & Graf, 2017;Montefinese et al., 2013;Otero et al., 2011). In addition, these studies also found that false alarms showed less dilation than correct recognition memory (truly old items correctly categorized as old). However, participants only had the option to judge an item as either "old" or "new," rather than recollected, familiar or new/unstudied, so pupil dilation for "old" judgments might have reflected the combined contribution of recollection and familiarity memory.
Worryingly, most pupillometric studies on human memory often interpret pupil dilation differences between familiarity hits and correctly-rejected responses as evidence that it specifically indicates accurate familiarity. However, if false alarms produce a similar pupillary response to that observed with familiarity hits, then the specificity of pupil dilation to familiarity memory (and perhaps recollection) may need revisiting. Indeed, in a recent study from Mill et al. (2016), participants were given highly predictive cues of the true old/new status of words before making their recognition memory judgments. They found that when participants were specifically cued that the status of the word would be "new", the pupil showed a stronger dilation for "old" than "new" responses, and this was true for both correct (hits > correct rejections) and incorrect (false alarms > misses) judgments. In contrast, for "likely old" cues, no difference between "old" and "new" responses was found. The authors concluded that pupil old/new effects do not signal accurate recognition, but, instead, could reflect participants' subjective sense of recognition.
It is, therefore, critical to understand how specific pupil dilation that accompanies accurate familiarity really is, by comparing the pupillometric behavior of familiarity hits with familiarity-based false alarms. In a previous study, we provided evidence that implicit memory can be dissociated from accurate familiarity at the pupil level (Gomes et al., 2015). In that study, we showed that even though forgotten items (i.e., misses) displayed larger pupils than correctly rejected new items (i.e., correct rejections), this pupil dilation was reduced compared to that observed for familiarity hits (even when appropriately matched for difficulty and familiarity strength). Here, we re-analyzed the same dataset but examined the relationship between familiarity hits (accurate familiarity) and familiarity-based false alarms (inaccurate familiarity) instead. Our familiarity-only procedure focused attention on making familiarity decisions about stimuli studied under incidental conditions, likely reducing recollection memory so that relatively few stimuli were recollected (Mayes et al., 2007;Montaldi et al., 2006). This procedure may have also increased the degree to which fluency processing contributed to familiarity decisions.
If fluency modulates pupil dilation as some research suggests (Kuchinke et al., 2009), then it is plausible to assume that pupillary responses to both familiarity hits and false alarms are equally affected by fluency, and they may show similar pupillary behavior. However, if fluency plays a minimal role, then pupil dilation for familiarity hits and false alarms may well be different. Alternatively, if the brain fails to distinguish accurately whether an item is old or new, false alarms may still produce a brain response that translates into indistinguishable pupillometric behavior from familiarity hits, regardless of how fluently the items were processed. 2 | METHOD 2.1 | Participants, materials, and procedure Power analysis revealed that at least 45 participants would be required to achieve a power of .80. For this power calculation, we used the effect size of the difference in the pupillary time courses between familiarity hits and misses in Gomes et al. (2015). In the present study, 90 participants were recruited. Twelve participants had to be excluded due to either technical difficulties in the discrimination of the pupil or exaggerated movement during the tasks.
We followed the recommendations given by Goldinger and Papesh (2012) when applying pupillometry to the study of human memory, which included the usage of: (1) colorless, low-contrast stimuli, (2) relatively long stimulus exposure and interstimulus interval, (3) baseline-corrected diameter analyses, and (4) complementary pupillometric metrics. We have described the materials and procedure in detail in Gomes et al. (2015). In short, participants were presented with either 50 (Experiment 1) or 100 (Experiments 2 and 3) object pictures at study, and either counted the number of red dots (either one or two) that were flashed within the image borders (Experiment 1 and 2) or decided if the object was bigger or smaller than a shoebox (Experiment 3). At test, participants saw the studied pictures randomly interspersed with 100 novel pictures (50 in Experiment 1). They were instructed to perform a recognition memory task, in which they decided whether each picture was recollected, familiar or new. During the test phase, a fixation cross was presented for 3,000 ms, followed by a blank screen for 500 ms (baseline) and finally the picture of an object, which remained on the screen until participants responded. Pupil data were recorded for each individual trial, starting at baseline onset until participants' response.

| Data cleaning, reduction, and analysis
Blinks and other artifacts were removed from each pupillary trace and linear interpolation replaced data points for a particular trace that deviated from the mean by more than plus/minus 2 standard deviations. An unweighted 5-point moving average filter was applied to the data. Furthermore, trials with reaction times (RTs) that were greater or smaller than 2 standard deviations were also excluded from the analysis.
In all our analyses, the mean pupillary response was subtracted from the averaged baseline period (500 ms prior to stimulus onset). These baseline-corrected pupillary responses were calculated for recollection hits (Rs), familiarity hits (Fs), familiarity false alarms (FAs), correct rejections (CRs), and misses (Ms). In order to reduce the effects of conscious effort and familiarity strength on pupil dilation, we applied an RT-matching procedure whenever any two conditions were contrasted (see Gomes et al., 2015, for details). This procedure consisted of iterating over every trial of a particular category (e.g., FA) in order to find a trial of a different category (e.g., CR) with a similar RT. Positive differences were balanced against negative differences, such that the net mean RT for the categories being matched would be identical (Gomes et al., 2015;Spencer et al., 2009). Only participants with at least 3 valid trials in any given response category were considered.
Note that the significance, as well as direction of the results, were unchanged when the unmatched data were analyzed instead (see Supplementary Information). However, given that RTs for some of the categories were significantly different from each other (e.g., FAs vs. CRs), we decided to present the matched data in the main text.
There were no effects interacting with the experiment factor in any of the analyses performed (all Fs < 1.90, ps > .16). Thus, to simplify the analyses and leverage the larger sample size, we collapsed the data across the experiment factor.

| BOLAR index of image similarity
For the analysis of image similarity, we computed the bank of local analyzer responses (BOLAR; Zelinsky, 2003). 1 The BOLAR model isolates the visual similarity component by decomposing a stimulus into featural visual dimensions such as color, orientation, and spatial scale. This is achieved by the use of 108 multiscale Gaussian-derivative filters (GDFs) that code the visual features of the image. Large-scale filters extract low spatial frequency patterns in an image (e.g., overall shape of an object), whereas smallscale filters extract high spatial frequency patterns (e.g., fine details of an object). These 36 GDFs are then applied to the three color/intensity channels (36 * 3 = 108 filters), thereby, capturing chromatic information. Because this is done across all pixels in an image, implementing this procedure results in a BOLAR vector for each individual pixel. Once the BOLAR vectors of any two images are known, 1 Note that we did not employ this algorithm in our previous study (Gomes et al., 2015), because the focus of that study was on long-term priming rather than perceptual fluency. In the present report, however, we examined the possibility that perceptual fluency affected the pupillary responses of FAs. Therefore, we used the BOLAR index here to measure perceptual similarity between images. the normalized Euclidean distance (E) between those BOLAR vectors can be computed, giving a map of image differences. This difference map is then summarized by squaring and summing each signal, and subsequently taking the square root of that value (Equation 1a). The resulting scores are then normalized in order to bring all values to the range [0, 1], and then subtracted by 1 (such that low values indicated low similarity and high values indicated high similarity; Equation 1b). (Zelinsky, 2003). n represents the dimensionality of the BOLAR representation; BV and BV ′ are the two BOLAR vectors being compared. (b) The E s for each paired comparison (x i ) were normalized within their own response category x (e.g., FA), so that values had a score between 0 and 1, with scores closer to 1 representing higher similarity-this is the Bolar index (BI).

Equation 1. (a) Calculation of the Euclidean distance (E) between any two images
Figure 1 (left) shows an example of the results of the BOLAR analysis for a subset of images with varying degrees of similarity of perceptual features.
After RT-matching (see methods and Gomes et al., 2015), RTs were not significantly different among the different response categories (see Figure 2).
Our main interest in the present study was to determine whether inaccurate familiarity (as indexed by FAs), could be distinguished from accurate familiarity (as indexed by Fs). For that purpose, we initially compared Fs with properly RTmatched FAs. There was no difference between the two categories with respect to mean pupil diameter, F(1,70) = .41, p = .68 (see Figure 3). Even though we did not observe significant interactions with the Experiment factor (see Methods section), we decided to compare Experiments 1 and 2 with Experiment 3, given that they differed in the encoding task: Experiments 1 and 2 used a more perceptual orienting task (dot task), whereas Experiment 3 used a more semantic orienting task (size-judgment task). First, as with the collapsed data, we did not find a statistically significant difference between Fs and FAs in either experiment (ts < 1.00, ps > .33, ds < .21). In addition, there was no pupillary response difference for the Fs versus FAs comparison between Experiment 1/2 and Experiment 3 (t(69) = 1.11, p > .27, d = .29). This was true even though memory performance was substantially better for the size-judgment than the dot task, and RTs for F I G U R E 1 Left: Graphical depiction of the BOLAR index as a measure of the similarity between pairs of pictures. The same nine exemplar items were selected and arranged along the vertical and horizontal axis of a 9×9 matrix. Colors indicate different BOLAR scores, in that larger values (red) indicate higher similarity. The diagonal of the matrix has a value of 1 since the images being compared are identical. Note how BOLAR accurately captures similarity based on the overall shape (e.g., the shape of the pen and the thermometer), as well as finer details (e.g., the petals of the flower and the windmill sails). Right: Density distribution of BOLAR scores for all possible pairwise comparison of two images both Fs and FAs were also significantly faster for the sizejudgment than the dot task (all ts > 2.94, ps < .005). This result is interesting because it suggests that the pupillary response between Fs and FAs does not differ, even under experimental conditions where F memory is substantially stronger. Therefore, we continued data analyses with the collapsed data only.
Bayesian statistics allow the assessment of how much support we have for the null hypothesis in comparison to the alternative hypothesis. The Bayes Factor (BF), in particular, represents the evidence in the data favoring one hypothesis over another. We were interested in comparing the hypothesis of larger pupil dilation for Fs than FAs against the null hypothesis of no difference between the two categories. F I G U R E 2 Mean raw reaction times for recollection hits (Rs), familiarity hits (Fs), false alarms (FAs), and correct rejections (CRs). Color boxplots represent mean reaction times after implementing the matching procedure (see Methods and Gomes et al. (2015) for details). Each color is associated with a specific matching between two categories (e.g., purple boxplots represent the matched reaction times of Rs and Fs, red boxplots represent the matched reaction times of FAs and CRs, etc.). Error bars represents the standard error of the mean. Rs = Recollection hits, Fs = Familiarity hits, FAs = False alarms, CRs = Correct rejections, Ms = Misses F I G U R E 3 Mean baseline-corrected pupil dilation for RT-matched categories. The categories are RT-matched within each facet (e.g., for the first facet, Rsare RT-matched with Fs). Error bars represent the standard error of the mean. Rs = Recollection hits, Fs = Familiarity hits, FAs = False alarms, CRs = Correct rejections, Ms = Misses Conventionally, BF values equal or larger than 3 or equal or smaller than 1/3 represent substantial evidence, and anything between 1/3 and 3 are considered inconclusive evidence. In our case, the BF was .3, which would be interpreted as our data supporting the null hypothesis.
Next, we compared the pupillary time courses of Fs versus FAs. For this analysis, the pupil data for each individual trial was split into six bins, each containing an equivalent number of pupil recordings. 2 This procedure allowed us to standardize different trials that had different RTs (all three experiments were self-paced), and, thus, contained different numbers of pupil recordings. Each bin was then averaged across all trials for a particular category, thus providing a binarized (averaged) time course for each participant and response category (Gomes et al., 2015).
The time course analysis only revealed a significant main effect of Bin, F(1.44,102.04) = 27.75, p < .001, η 2 p = 1, but no critical Bin × Category interaction, F(1.62,114.86) = 1.43, p = .24, η 2 p = .27 (see Figure 4). Unsurprisingly, the subsequent post-hoc analysis did not reveal any significant differences between the two categories at any bin level (all ts < 1.25, ps > .21). The BF analysis also found strong evidence favoring the model with the main effect of bin relative to any other model. In fact, the reduced model with only the main effect of Bin was favored against the full model (which included the Bin × Category interaction) by a factor of more than 200 (BF main /BF interaction = 245.04 [±4.76%]).
In order to demonstrate that the lack of a pupillary response difference between Fs and FAs was not the result of insensitivity of either Fs or FAs, we compared these two categories to RT-matched CRs. Both Fs and FAs differed statistically from their corresponding matched CRs (Fs > 5.81, ps < .02, η 2 p > .07). In both analyses there was also a significant main effect of Bin (both Fs > 18.49, ps < .001, η 2 p > .20), as well as a significant Bin × Category interaction (both Fs > 3.71, ps < .032). Post-hoc tests revealed that both bins 5 and 6 significantly differed between Fs and CRs as well as between FAs and CRs (ts > 2.19, ps < .032). Similarly, the comparison against RT-matched Ms yielded a significant main effect of Bin and a Bin × Category interaction for both Fs versus Ms as well as FAs versus Ms (main effect: Fs > 14.69, ps < .001, η 2 p > .17; interaction: Fs > 3.69, ps < .03, η 2 p > .05). Posthoc tests indicated that bins 5 and 6 significantly differed between Fs and Ms, ts < 3.51, ps < .002, whereas only bin 6 differed between FAs and Ms, t = 1.96, p = .05.
The previous analysis revealed that the pupil size for both F and FA responses can be differentiated from the pupil size of CR and M responses, in the sense that the former two 2 We performed this analysis using different number of bins (up to 10), but the results did not change significantly. Thus, we show here the analysis using six bins to be consistent with Gomes et al. (2015).

F I G U R E 4
Mean baseline-corrected pupil dilation for RT-matched categories. The categories are RT-matched within each facet (e.g., for the first facet, Rsare RT-matched with Fs). Each bin represents a standardized time window, ranging from stimulus onset (Bin 1) until participants' response (Bin 6; see text for further details). Note that the first time bin corresponds to the average of the first few milliseconds of pupillary recordings. Because the pupil dilates above baseline in most trials, the first bin reflects this averaged initial dilation after stimulus onset, which is the reason why the waveforms do not start at exactly 0. Error bars represent the standard error of the mean. Rs = Recollection hits, Fs = Familiarity hits, FAs = False alarms, CRs = Correct rejections, Ms = Misses.

GOMES Et al.
response categories showed larger pupil diameter than the latter two response categories. There is also evidence that F responses show a reduction in pupil diameter relative to R responses (Kafkas & Montaldi, 2012;Otero et al., 2011). Thus, we compared the mean pupillary response between (1) RT-matched Rs and Fs , as well as between (2) RT-matched Rs and FAs. The main effect of category was not significant for either contrast 1 or 2 (both Fs < .83, p > .37, η 2 p < .02). However, for both contrasts, a 2 Category × 6 Bin of the pupillary time course revealed a significant main effect of Bin, Fs > 29.05, ps < .001, η 2 p > .38, as well as a more interesting Category × Bin interaction, Fs > 3.56, ps < .03, η 2 p > .07. As seen in Figure 3, this interaction is mostly the result of Rs showing a noticeably significantly larger increase in pupil dilation than either Fs or FAs in bin 6 (ts > 2.24, ps < .03).
Taken together, the above analyses suggest that FAs behaved much like RT-matched Fs in that (1) these two categories did not differ in terms of either mean pupil size or pupillary time course, and (2) they both differed from their respective RTmatched CRs, Ms, and Rs in a very similar way.
Recent research has shown that fluency-based inaccurate familiarity can trigger a similar brain mechanism to that which produces accurate familiarity (Dew & Cabeza, 2013;Gomes et al., 2019;. One possibility could be that participants experienced enhanced perceptual fluency during the processing of certain novel pictures, which was then used as a cue for distinguishing old from new items (Johnston et al., 1985;Whittlesea, 1993). In order to test if perceptual fluency could partly explain the pupillometric similarity between F and FA responses, we computed the BOLAR index (an indicator of image similarity; see Methods) for every FA and CR image and for all participants. We assumed that the larger the BOLAR index between any given new picture and studied pictures was (i.e., the more similar, on average, a new picture was to studied pictures) the more fluently that new picture would be processed, and that this relationship would only be observed with FAs. It follows, then, that if an increase in perceptual fluency caused participants to produce FAs, we should observe (1) an association between pupil size and image similarity for FAs but not CRs, and (2) overall larger similarity between studied and FA pictures than studied and CR pictures.
First, we computed a BOLAR index between each individual FA/CR image and each studied image. Then, all BOLAR indices for that FA/CR image were averaged. Therefore, every FA/CR image had an associated BOLAR index, which represented its average Euclidean distance to all studied images. Finally, we calculated the mean of these averaged BOLAR indices for both FAs and CRs, yielding a mean FA BOLAR index and a mean CR BOLAR index for each participant.
We restricted our analysis to CRs and FAs. Correct Fs were excluded because processes other than fluency-based similarity to studied stimuli may well be involved and affect pupil size, and, thus, would introduce a confound. If so, a significant correlation would probably not reflect the effects of fluency alone. In contrast, both CR and FA images were novel, and so, a comparison between these two categories is not biased by prior study status. We initially performed separate correlational analyses in order to determine if mean pupil diameter and the BOLAR index were positively correlated in either response category. As predicted, the relationship between BOLAR index and pupil diameter was significant for FAs, r = .25, p = .016, whereas it was not for CRs, r = −.11, p = .83, suggesting that the relationship between higher similarity and larger pupil dilation was only observed for FAs (see Figure 5, left). The correlation coefficient for FAs was significantly different from the correlation coefficient for CRs, z = 2.19, p = .01.
In addition, we found a significantly larger similarity index for FAs than CRs (t = 2.46, p = .016, d = .28; see Figure 5, right), suggesting that, on average, studied pictures were more similar to FA than CR pictures. We decided to explore this effect further by analyzing how the BOLAR index related to making or not making an FA on a trial-by-trial basis. For that aim, we performed a multilevel logistic regression analysis using the lme4 package (Bates et al., 2015) in R (www.r-proje ct.org). We created a model with the response category (CR, FA) as the dependent variable, the BOLAR index for each individual FA/CR image, the mean pupil response, and the interaction between the two as predictors, and participant as a random effect. The idea behind this analysis was to try to predict category membership on the basis of the BOLAR index at each trial. As expected, both mean pupil dilation and BOLAR index were significant predictors of category membership (see Model 1 in Table 1). In other words, the higher the BOLAR index for a given image (i.e., the more similar that image was to studied images), the more likely the model categorized it as an FA. The interaction between pupil dilation and BOLAR index was nonsignificant.
We also built a multilevel regression model with pupil dilation as the dependent variable and response category, BOLAR index and the interaction as predictors, but, similar to Model 1, only the main effects were significant (see Model 2 in Table 1).
Thus, the results of these two analyses suggest that one potential reason that participants made FAs may have been due to the greater similarity that FA pictures had with the study items.

| DISCUSSION
In this study, we examined the relationship between the pupillary response to familiarity hits (accurate familiarity) and familiarity false alarms (inaccurate familiarity). Our results unequivocally showed that the mean pupil diameter and pupillary time courses for Fs and RT-matched FAs were indistinguishable from one another. Since FAs correspond to new items that were incorrectly judged to be "old," and given that both Fs and FAs differed statistically from both RT-matched CRs and Rs, this result suggests that increased pupil dilation during recognition memory tests is not uniquely related to successful recognition memory (see also Mill et al., 2016). Rather, our results imply that accurate and inaccurate familiarity may involve the operation of a common neural system that triggers a pupillary response of the same kind.
Considering that a few studies have found that familiarity hits show different pupillometric behavior from false alarms (Brocher & Graf, 2017;Montefinese et al., 2013;Otero et al., 2011), our finding might seem surprising. In those studies, however, the old/new recognition memory test administered did not strictly encourage the use of familiarity. In fact, it is likely that the pupillary response to recognized items in those studies was the result of the combined contribution of recollection and familiarity memory. Given that recollection produced larger pupil dilation than familiarity, as our and other studies have found (Kafkas & Montaldi, 2011Otero et al., 2011), it is not surprising that correct recognition in the above studies was associated with an increased pupillary response relative to inaccurate memory (which may be predominantly based on a sense of familiarity). In contrast, our study minimized the influence of recollection processes during the recognition memory test by using the familiarityonly procedure, which is a modification of the remember/ know procedure. Thus, the discrepancies between those and our study could reflect these differences in methodology.
Another study that measured pupil responses and also used the familiarity-only procedure with object stimuli had a different aim that involved comparing oldness judgments with newness judgments (Kafkas & Montaldi, 2015). Although this study found very few false alarms with oldness judgments, it did, however, find that familiarity hits initially showed larger pupil responses than familiarity misses of newness in the newness judgment task, which might be regarded as equivalent to false alarms in the oldness judgment. In the period leading up to the response, pupil size was the same. Our results suggest that the different procedures are associated with distinct patterns of pupil responses, although this requires further investigation.
We can confidently exclude the possibility that the similar magnitude of pupil dilation between Fs and FAs was the result of different levels of familiarity strength and/or increased cognitive effort applied to FA than F items (or vice versa). We only compared RT-matched trials between Fs and FAs, such that only trials that had similar RTs were included in the analysis. This should have effectively and considerably reduced differences in cognitive effort between contrasting T A B L E 1 Results from the multilevel logistic regression model predicting category membership (FA, CR) from pupillary response and BOLAR index predictors (Model 1), and the multilevel regression model predicting pupil dilation from BOLAR index and response category (Model 2) categories. Although we acknowledge that RT-matching may still be insufficient to perfectly match effort and familiarity strength between any two categories, changes in RTs remain a popular proxy for cognitive effort (Beatty, 1982;Gomes & Mayes, 2015a, 2015bMontefinese et al., 2013;Porter et al., 2007;Võ et al., 2008). Another potential criticism could be raised that the lack of a statistical difference in the pupillary response between Fs and FAs was because either Fs or FAs (or both) simply failed to produce a distinctive pupillary response altogether (i.e., the measures were too noisy). This is, however, inconsistent with both Fs and FAs showing a substantially reduced pupillary response relative to Rs as time elapsed, as well as increased dilation relative to CRs and Ms, both of which are well-documented effects in the literature (Gomes et al., 2015;Kafkas & Montaldi, 2011Otero et al., 2011).
Another possibility is that the increase in pupil dilation for FAs reflects a misattribution of "oldness" to the new stimuli. This might occur as the result of participants experiencing an unexpected increase in processing fluency for FAs, which is then interpreted as familiarity (Whittlesea et al., 1990;Whittlesea & Williams, 2000). Old/new decisions are based on whatever cues the participant receives during each trial. For FAs, there is not an objective memory signal as there is for truly old items. Thus, when participants experience increase fluency for FA items, this fluency is unexpected, and gets attributed to prior study.
Indeed, some research has revealed that the neural mechanism responsible for familiarity has a striking resemblance to that of inaccurate memory (Dew & Cabeza, 2013;Gomes et al., 2019). In our experiments, we used very simple linedrawings of objects, so unstudied pictures inevitably shared overlapping features with studied pictures. Furthermore, the relatively shallow encoding conditions (at least during the dot task) resulted in such weak memory traces that, without other cues to help differentiate old from new, likely made processing fluency appealing for guiding recognition memory decisions (Jacoby & Whitehouse, 1989;Whittlesea et al., 1990). Our analysis of image similarity indicated that the similarity between any given new image and studied images was strongly related to increases in pupil size, but only for FAs ( Figure 5, left). In addition, the similarity between studied and FA images were overall larger than that between studied and CR images ( Figure  5, right), and our logistic regression model was more likely to categorize an image as FA as image similarity increased. In summary, FAs may have led to larger pupils than CRs because FA images shared more perceptual characteristics with study image(s), triggering increased processing fluency during FAs, which was subsequently interpreted as familiarity.
We should stress that we are not arguing that the old/new pupil effect commonly observed in previous studies must reflect this fluency process. In fact, at face value, there appear to be four possibilities: First, accurate and inaccurate familiarity might lead to similar pupillary responses (both in magnitude and across time) due to the operation of a singlesystem neural mechanism (Berry et al., 2008(Berry et al., , 2012. Second, it is also possible that accurate and inaccurate familiarity involves largely distinct neural mechanisms, but with some overlapping mechanisms that trigger a similar pupillary response. Third, if fluency without familiarity can support above-chance performance in some recognition memory tests (Gomes et al., 2017;Leynes & Zish, 2012;Whittlesea & Leboe, 2003), then the observation of increased pupil dilation in response to Fs, need not be a hallmark of familiarity at all, but could actually indicate contamination from fluency that does not help produce familiarity (Voss, Lucas, et al., 2012). Finally, and perhaps more likely, pupil dilation may indicate familiarity feelings but, when effort and familiarity memory strength are equated, the brain may simply fail to distinguish between accurate and inaccurate familiarity, producing similar pupil dilations. Fluency may represent an intermediate stage following Jacoby's fluency attribution notion, given that our results suggest that inaccurate familiarity relates to how similar false alarm stimuli are to studied stimuli. This indicates that an underlying fluency process contributes to inaccurate familiarity and triggers pupil dilation. Accurate familiarity may also have a contribution from this same fluency process, the extent of which will require further investigation.
Unfortunately, we cannot disentangle these competing possibilities, as our experiments were not tailored to look specifically at the relationship between fluency and familiarity. Nevertheless, it is clear that pupillometry will not be useful in separating these two processes quantitatively with standard recognition memory paradigms. In order to adequately isolate accurate from inaccurate familiarity memory, future pupillometric studies must be able to differentiate F from FA processing, a contrast seldom reported in the relevant literature. One way to achieve this could be, for example, using meaningless stimuli, such as squiggles, which are unlikely to evoke strong fluency effects, but still trigger typical responses associated with familiarity memory Voss & Paller, 2007). However, it remains distinctly possible that appropriately matched accurate and inaccurate familiarity are associated with indistinguishable pupillary responses.

| Limitations of the present study
One important limitation of the present study is that measuring similarity using the BOLAR index makes certain assumptions that may be false. When judging similarity, it is clear that different individuals may select different features, use different algorithms, and rank images differently from the BOLAR index. The BOLAR index used to compute similarity in the present study is a purely data-driven method, which computes distances between BOLAR vectors based on the image pixels (see Methods). It may well be that the resulting difference maps contain features that our brains simply do not use or, just as plausible, may fail to select features that our brain does use. If the overlap is insufficient, then the correlation between objective and subjective similarity may be poor. Averaged subjective similarity ratings between certain stimulus categories have been computed (Frank et al., 2020;Migo et al., 2013) but, unfortunately, subjective estimates of similarity were not acquired in the present study. Future research will need to determine whether the findings reported here with the BOLAR index also occur with subjective similarity ratings.
Another potential concern was the lack of a significant interaction between pupil dilation and BOLAR index when the response category was used as the dependent variable in the logistic multilevel regression analysis. This poses a problem to our account that FAs are driven by perceptual fluency, as we would have expected a significant modulation of fluency (as indexed by the BOLAR values) on pupil dilation for FAs but not CRs. Nevertheless, we did find a modest correlation between the BOLAR index and the pupil dilation for FAs but not CRs at a participant level (see Figure 5, left). Even though the FA effect was admittedly weak, the correlation coefficient for FAs was still significantly different from that of CRs.
Finally, another limitation of the present study is that, for the main pupil results, the data for the three experiments were collapsed despite the fact that two different encoding tasks were used. Collapsing the pupil data had the major advantage of healthily increasing the power of the comparison of F and FA pupil dilation analyses. However, collapsing pupil data across different encoding conditions is questionable if they have produced familiarity memory of different strengths. There is evidence of this as is indicated in the results above. Nevertheless, collapsing would be justified if there were no effect of encoding condition on relative F and FA pupil dilation. No such interaction effects were found with the experiment factor on any of the pupil dilation analyses (see Methods). The limitation arises because comparing the pupil measures of the three experiments would mean that at least some of the statistics were underpowered (particularly for Bayes Factor analysis). Nevertheless, there was no trend towards F and FA pupil differences in the data. In the future, adequately powered analyses should use a wider range of conditions so as to systematically vary familiarity memory strength sufficiently to assess whether F versus FA pupil differences emerge when memory strength becomes greater than we observed.

| Implications of this study to lie detection
Our results suggest that it may be difficult to differentiate accurate from inaccurate (familiarity) memory, as increases in pupil dilation were observed both when subjects correctly remembered images as well as when they wrongly believed the images were old. Perhaps a germane question could be asked: do the results of the present study have implications in the investigation of deception as measured with pupillometry?
Several studies have reported increases in pupil diameter when participants were asked to lie (Dionisio et al., 2001;Heilveil, 1976;Trifiletti et al., 2020;Wang et al., 2010), which has called for the development of lie detection technology based on pupillometry. Even though these studies showed greater increases in pupil dilation for deceptive than truthful answers, it is unclear if the distinction between deceptive and inaccurate memory during recognition memory tests can be as easily detected with pupillometry. It will, therefore, be important to acertain if actively lying about memories of particular items or events produces greater pupil dilation than inaccurate memory to those same items or events during recognition. If the pupillary response to lying is found to be no different than when participants wrongly believe they remembered an item, this may limit the utility of pupillometry as a lie detector. Future research may help find a more conclusive answer.