Development and validation of the Eyewitness Metamemory Scale

964 wileyonlinelibrary.com/journal/acp Summary Metamemory can be defined as the knowledge about one's memory capabilities and about strategies that can aid memory. In this paper, we describe the development and validation of the Eyewitness Metamemory Scale (EMS), tailored specifically for use in face memory and eyewitness identification settings. Participants (N = 800) completed the EMS and other measures on general metamemory. Results from exploratory and confirmatory factor analysis revealed good factorial validity, internal consistency, and content validity. The EMS items emerged into three distinct factors: memory contentment, memory discontentment, and memory strategies. The EMS is a brief and easily administrable questionnaire that might be used to assess self‐ratings of face recognition capacity and use of strategies to encode faces.


| INTRODUCTION
Metamemory can be defined as the knowledge about one's memory capabilities and strategies that can aid memory (Shimamura, 2008). This construct has been the subject of a substantial amount of research, sparked by developmental studies investigating how the ability to evaluate one's memory processes and mnemonic strategies improved learning during early childhood (Cavanaugh & Perlmutter, 1982). Metamemory research has since expanded to a variety of domains such as cognitive neuropsychology, educational psychology, and cognitive psychology, motivating the development of diverse self-report measures on memory monitoring and control (Pannu & Kaszniak, 2005). However, current psychometric instruments for assessment of metamemory typically focus on broad memory domains (e.g., episodic memory or semantic memory), and there appears to be an absence of self-assessment instruments of memory capacity for faces and person recognition. In this paper, we present the development and initial validation evidence for a metamemory assessment scale tailored specifically to face memory and eyewitness identification settings.
Metamemory research is essential for a comprehensive understanding of how people use and perceive their own memory, providing a theoretical framework that can generate testable hypotheses. For example, in research examining feeling-of-knowing judgements, participants decide whether they have studied some new information sufficiently for future recall. If the subjective memory confidence experienced indicates they have not sufficiently learned the material, they may employ mnemonic strategies or engage in further study to better learn the material (Koriat, 1993). Other important branches of between metacognitive judgement and memory performance (Kelemen, 2000), use of memory strategies (Guerrero Sastoque et al., 2019), regulation of retrieval (Goldsmith, Pansky, & Koriat, 2014), and how metamemory changes across the lifespan (Ghetti, Lyons, Lazzarin, & Cornoldi, 2008).
Interest in assessing different aspects of metamemory has stimulated the development of various self-report measures that differ in content and item format. The content may include different aspects of metamemory so that respondents are asked to indicate the frequency of forgetting, the vividness of remembering, contentment with one's memory, and perceived changes or decay in their capabilities.
The item format can also vary so that some instruments focus on the relative frequency of memory issues in relation to others or in relation to one's own performance across a specified period. For example, the Metamemory in Adulthood Questionnaire (Dixon, Hultsch, & Hertzog, 1988) assesses individual's knowledge of general memory processes and tasks, frequency of memory strategy use, self-rated memory ability, perceptions of memory stability over time, anxiety regarding memory, memory and achievement motivation, and locus of control in memory abilities. The Multifactorial Memory Questionnaire (MMQ; Troyer & Rich, 2002) was developed to assess separate dimensions of memory ratings that are applicable to clinical assessment and intervention. This instrument includes scales of contentment regarding one's memory, self-appraisal of one's memory capabilities, and reported frequency of memory strategy use. Another example is the Squire Subjective Memory Questionnaire (SSMQ; van Bergen, Brands, & Jelicic, 2010;Squire, Wetzel, & Slater, 1979), assessing how one's memory trust has developed over time.
Despite the existence of several self-report memory questionnaires, there seems to be an absence of instruments that focus specifically on self-rated memory capacity for faces and person recognition.
Most of the current measures have a strong focus on clinical assessments or interventions and typically include items concerning selfevaluation of general memory ability or items concerning semantic or episodic memory issues. One notable exception is the newly developed Stirling Face Recognition Scale (SFRS; Bobak, Mileva, & Hancock, 2019). The SFRS was developed to assess face recognition ability, ranging from developmental prosopagnosia (i.e., a neurological disorder characterized by the inability to recognize faces) to superrecognition. It has two components, face processing and face memory, which correlated moderately with objective face matching tests (correlations between r = .28 and r = .34). However, this instrument has not yet been subjected to factor analysis, and the reliability of each SFRS component is unknown. Furthermore, The SFRS does not include items related to other person identification elements that may be relevant in eyewitness settings.
Self-report instruments specifically developed to measure face recognition ability and person identification would have important implications for research and practice. One important issue in the criminal justice system, for instance, is to distinguish accurate from inaccurate eyewitness identifications. Evidence obtained from witnesses of crimes can be very influential in court decisions, but inaccurate witness identifications can impair investigations and in more severe cases contribute to miscarriages of justice. Some postdictors of eyewitness identification accuracy have been identified, such as early statements of confidence (Brewer & Wells, 2006), decision time (Sporer, 1993), and decision process (i.e., absolute vs. relative judgements, Dunning & Stern, 1994). However, under certain circumstances, the predictive value of those factors is undermined, for example, when eyewitnesses are exposed to biased lineups (Charman, Wells, & Joy, 2011) or receive feedback after an identification is made (Semmler, Brewer, & Wells, 2004). This limitation highlights the importance of investigating new factors that may be used to estimate eyewitness accuracy that are less undermined by external factors. One such potential estimator is self-efficacy in face recognition, which has shown to be predictive of eyewitness accuracy performance (Olsson & Juslin, 1999;Perfect, 2004). However, previous studies on this issue have used single items of unknown reliability and validity, limiting conclusions regarding the relation between self-efficacy and objective memory performance. A reliable and valid metamemory scale tailored specifically to eyewitness settings would improve the inferences in studies investigating the relation between self-ratings of memory ability and objective memory accuracy.
Another important theoretical implication of an eyewitness metamemory scale is that it would help elucidate the relation between self-ratings of memory ability and expressions of confidence. Koriat (1993) has proposed that expressions of memory confidence are partly based on the encoding experience (i.e., characteristics of the stimuli) and on internal cues or beliefs about memory capacity (i.e., "am I good at recognizing this type of stimuli?"). However, general theories of memory confidence have not yet been thoroughly examined in eyewitness contexts. In forensic settings, for example, eyewitness confidence judgements are commonly used for assessing the likelihood that the eyewitness memory is accurate (Wixted & Wells, 2017). The ability to accurately evaluate one's own memory performance is a critical feature of metamemory function, but laboratory manipulations have shown that eyewitness confidence can be inflated by factors such as postidentification feedback (Douglass & Steblay, 2006) and repeated recall (Odinot & Wolters, 2006). It has been suggested that confidence expressed by witnesses is also influenced by internal cues (Leippe & Eisenstadt, 2014), but the extent to which memory accuracy and confidence for faces is related to self-perceived recognition skill is relatively unknown. In one of the few studies on the matter, Olsson and Juslin (1999) found that people who claim to be good face recognizers show slightly higher accuracy and better confidence-accuracy calibration in eyewitness identifications, but that study is limited by the use of single items of unknown reliability and validity. The absence of valid measures of eyewitness face recognition ability impairs the advancement of this theoretical line of research.
With such a measure, it would be possible to better examine the relation between beliefs of memory capacity and expressions of confidence in forensic relevant contexts.
Despite the benefits of self-report tools, it can be argued that memory accuracy could be better estimated by objective tests of memory performance. In fact, it has been proposed that tests of face recognition performance are informative estimators of proclivity to choose and identification accuracy (Baldassari, Kantner, & Lindsay, 2019;Russ, Sauerland, Lee, & Bindemann, 2018). However, in practical terms, objective tests of face recognition are more difficult to implement in applied and research settings. That is because commonly used tests of face recognition or face match ability are computerized and include many repeated trials (e.g., Dowsett & Burton, 2015;Russell, Duchaine, & Nakayama, 2009). Ideally, both objective memory tests and self-ratings of memory performance could be deployed to estimate eyewitness identification accuracy, but such approach may not always be possible due to time and resources constraints. In this scenario, brief self-ratings of memory ability may be a feasible alternative to provide estimates of accuracy in practical settings and in empirical studies, although the relation between eyewitness self-ratings of memory capacity and objective performance has yet to be elucidated (Olsson & Juslin, 1999).
In sum, a metamemory instrument tailored specifically to eyewitness settings would be of considerable value in several lines of research and has the potential to aid end-users in forensic contexts.
Obtaining valid measures of metamemory for eyewitness identification is essential in research investigating the relation between self-efficacy, objective accuracy, and expressions of confidence.
Depending on the results and development of this line of research, self-ratings of memory ability may also be employed to distinguish accurate from inaccurate identifications or to identify individuals with superior face recognition abilities (Russell et al., 2009). In this article, we present the development steps and initial evidence of the psychometric validity of the Eyewitness Metamemory Scale (EMS), a self-report memory instrument tailored specifically to face recognition and eyewitness identification settings. For the purposes of this study, we aimed to develop the instrument and test its factorial structure, while also testing for its convergent and discriminant validity through associations with other metamemory measures.

| Participants and procedure
A total of 1,347 participants proceeded past the informed consent page, although 143 cases were removed for failure to complete the metamemory measures. Several exclusion criteria were adopted to ensure the quality of the data: (a) 38 cases were removed for taking more than 90 min to complete the experiment (without outliers, the study took in average 30 min to be completed); (b) 145 cases were removed for completing the experiment in under 15 min (i.e., an impossible time to attentively complete the study); (c) 78 cases were removed for not passing all of the attention checks; and (d) 137 cases were removed due to suspicious bot activity (i.e., Prims & Motyl, 2018). The final sample (N = 800) comprised 62% female participants and had a mean age of M = 29.83, ranging from 18 to 72 years (SD = 11.89). The sample was from Amazon Mechanical Turk (48%), university students attending U.K. and Dutch institutions (32%), and participants recruited through social media (20%). Participants recruited via Amazon Mechanical Turk received U$0.50, students received course credits, and participants recruited via social media were entered a prize draw for the prize of two £50 Amazon vouchers.
In an online survey presented via Qualtrics, participants first completed the EMS, followed by other general metamemory scales. The EMS was always shown first, whereas the other metamemory scales were presented in a random order. 1 Demographic information including gender, age, and level of education was also obtained, and on completion of all tasks, participants were debriefed and thanked for their participation.

| Eyewitness Metamemory Scale
Two qualitative approaches were adopted to develop an initial pool of items for the EMS. First, we closely examined the items of other metamemory measures and, where possible, based our item development on these items. Then a semistructured interview was conducted with a group of legal psychologists and graduate students working in this field of research (N = 14) to obtain additional information regarding memory self-assessment in eyewitness contexts. The initial pool of items consisted of 35 items, including eyewitness specific items and items concerning facial recognition adapted from various metamemory questionnaires. All items were rated on a 7-point Likert scale ranging from 1 (strongly disagree) to 7 (strongly agree). We did not establish specific hypotheses concerning the factorial structure that would emerge from these items but rather used an exploratory approach to establish its factorial structure.

| General metamemory instruments
In addition to the EMS, participants also completed the MMQ (Troyer & Rich, 2002) and the SSMQ (van Bergen, Horselenberg, Merckelbach, Jelicic, & Beckers, 2010;Squire et al., 1979). The MMQ has three subscales: contentment, ability, and strategy. All items are measured on a 5-point Likert scale. The contentment scale has 18 items (e.g., "I am generally pleased with my memory ability"; α = .92) rated from 1 (strongly disagree) to 5 (strongly agree), with higher scores indicating higher memory contentment. The ability scale has 20 items related to experiences with common memory errors over the past 2 weeks (e.g., "how often do you forget an appointment?"; α = .92) from 1 (all the time) to 5 (never), with higher scores indicating better self-reported ability. The strategy scale has 19 items concerning the use of memory strategies during the past 2 weeks (e.g., "how often do you use a timer or alarm to remind you when to do something?"; α = .88). The items are assessed on a 1 Participants then took part in an eyewitness paradigm consisting of a mock crime video and two identification tasks with confidence judgements. These data were obtained as part of a larger research project aiming to investigate the relation between metamemory measures and eyewitness memory performance. Due to space and focus, we only report on those measures that are relevant to the development of the EMS. scale ranging from 1 (never) to 5 (all the time), with higher scores indicating greater use of memory strategies. The MMQ has shown good high test-retest reliability and high internal consistency in the original study by Troyer and Rich (2002) and in adaptations to different countries (e.g., Fort, Adoul, Holl, Kaddour, & Gana, 2004;van der Werf & Vos, 2011). The SSMQ consists of 18 items related to memory trust (e.g., "my ability to recall things when I really try is"; α = .94). Participants rated the items on a 9-point scale ranging from −4 (worse than ever) to 4 (better than ever before). This instrument has shown good psychometric properties in different studies and has been correlated in a meaningful way with age, cognitive failures, and susceptibility to misinformation (van Bergen, Brands, et al., 2010;van Bergen, Horselenberg, et al., 2010). The MMQ and SSMQ differ mainly in response format. Although both instruments tap into self-rated memory ability, the MMQ focuses on present ability (i.e., "I am generally pleased with my memory ability"), whereas the SSMQ focuses on memory development over time ("my memory ability is better than ever before"). Those instruments were selected to test convergent and divergent validity of the EMS due to their good psychometric properties and high content validity in assessing metamemory traits such as self-ratings of memory capacity and memory trust. However, no specific hypotheses were established a priori concerning the specific relation between each of the SSMQ and MMQ factors with the factors obtained for the EMS, given that the factorial structure of the EMS was unknown prior to our analysis. Therefore, the convergent and divergent analysis in this study were exploratory, and it was generally expected that factors in the EMS would relate meaningfully with factors from the MMQ and SSMQ given the similarities between those instruments in assessing metamemory.

| Attention checks
Three attention checks were included within the metamemory assessment, in which participants were asked to select a specific response for that item such as "for this question, please select the option 4 (better than ever before)." The attention checks were included as an exclusion criterion (see Section 2.1).

| RESULTS
To examine the validity of the EMS factorial solutions, a within-sample replication strategy was adopted, and the total sample was randomly split in half (Osborne & Fitzpatrick, 2012). The first half was treated as a training dataset for obtaining an initial factorial solution via exploratory factor analysis (EFA). The second half was treated as a test dataset for examining the fit of the initial solutions obtained in the training dataset via confirmatory factor analysis (CFA). All analyses were performed in the statistical software package R (2019). The dataset and data analysis script can be found in Data S1 to S3.

| Exploratory factor analysis
Prior to the analysis, one item was removed because of a semantic error in the survey. A correlation matrix of the remaining 34 items was screened to identify items that were poorly correlated with the others, or items that were highly correlated and generating multicollinearity issues. Eight items were excluded for showing weak item-total correlations (r < .30). Two other items were excluded for presenting high correlations (r > .65) and redundant content in relation to other items.
Diagnostic tests were performed on the remaining 24 items to examine the assumptions for EFA. Data gathering for the metamemory measures were performed in an online setting with forced responses, so no missing responses were present. Graphical inspection and significant Shapiro Wilk tests for all the items indicated significant univariate nonnormality, with skewness ranging from −0.87 to +0.89 and kurtosis ranging from −1.01 to +0.32. This observation was supported by the statistically significant Mardia's test, indicating that the assumption of multivariate normality was violated. Therefore, a weighted least squares extraction method for EFA was used, which provides standard errors and tests of model fit that are robust to the nonnormality of the data.
The items showed good factorability (Kaiser-Meyer-Olkin test = 0.89 and significant Bartlett's test) and did not present multicollinearity or singularity issues (determinant >0.00001).
Parallel analysis and scree plots were used as factor retention criteria and suggested the presence of four factors. A four-factor solution was extracted using oblimin rotation to allow for correlations between the factors. This solution revealed four distinguishable factors, but one factor had only four emerging items that appeared to be related to memory development over time. These items presented high cross-loadings with two of the other factors in the solution, indicating that a four-factor solution might not be robust. We proceeded with the extraction of a three-factor solution using oblimin rotation.
Examining the pattern matrix, we decided to exclude one item from the first factor for high cross-loadings and a content that was dissonant with the other items (i.e., "People are generally good at remembering unfamiliar faces"). The same three-factor extraction was then repeated on the remaining 23 items (the pattern matrix for this solution is presented in Table 1). Items had high loadings on their respective factor, with no cross loadings higher than.30. We termed the three factors memory contentment (10 items explaining 19% of the total scale variance), memory discontentment (eight items explaining 15% of the variance), and memory strategies (five items explaining 10% of the variance). The memory contentment factor combined items related to positive self-perception of memory ability, including keywords such as "satisfied," "confident," and "better." The memory discontentment factor combined items related to negative selfperception of memory ability, including keywords such as "trouble" and "worse." The memory strategies factor combined items related to the use of memory strategies in the context of person identification and could be defined as the extent to which an individual adopts strategies to better recognize someone in the future. Reliability of the factors was examined using omega coefficients instead of alpha, given that assumptions for alpha are rarely met in psychometric research (Dunn, Baguley, & Brunsden, 2014

| Confirmatory factor analysis
The purpose of the subsequent analysis was to confirm the factor structure for the 23-item EMS on a separate subset of our sample.
The results from the EFA indicated that a three-factor solution was the most appropriate to describe the EMS. A two-factor structure was also submitted for analysis as a plausible competing model for comparing fit indices. This two-factor solution was fitted to further examine whether the contentment and discontentment factors in the three-factor solution emerged due to phrasing method rather than to the constructs the factors represent (Podsakoff, MacKenzie, & Podsakoff, 2012). Confirmatory factor analyses were conducted to test both models. Goodness of fit was evaluated using the robust root mean square error of approximation (RMSEA) and its 90% confidence Note. Factor loadings higher than >.40 are presented in bold.
Abbreviation: ITC, item-total correlations. interval, robust comparative fit index (CFI), robust Tucker-Lewis index (TLI), and expected cross-validation index. These fit indices provide different types of information (i.e., absolute fit, fit adjusting for model parsimony, and fit relative to a null model), and when combined, they provide a reliable and conservative evaluation of model fit (Schreiber, Nora, Stage, Barlow, & King, 2006). The chi-square test is reported, but not relied upon to evaluate model fit due to its oversensitivity to sample size and the fact that it tests for perfect fit. The evaluation of the models was based on (a) conventional criteria for good model fit (RMSEA < .08, CFI > .90, TLI > .90, smallest expected crossvalidation index) and (b) the interpretability of the solution (i.e., the comprehensibility of the factors on a conceptual level).
Diagnostic tests were performed on the 23 items to examine the assumptions for CFA and indicated that the assumption of multivariate normality was violated. Therefore, we estimated parameters in CFA using a maximum likelihood estimation with robust standard errors, which provides tests of model fit that are robust to the nonnormality of the data (Finney & DiStefano, 2013). Figure 1 presents the model specification and goodness of fit indices for the three-factor model and two-factor model. The model fit indices suggested that the three-factor solution had a better fit compared with the two-factor solution. However, the three-factor solution did not fit the data particularly well (e.g., RMSEA > 0.08, CFI < 0.90, and TLI < 0.90). In an exploratory approach, we revised the three-factor model by evaluating its modification indices, adopting only theoretically sound modifications to avoid overspecification of the model. Following this approach, we included two new correlations between errors of Items 6 and 8 and Items 16 and 18. These modifications were based on the content of the items, which seem to be closely related to memory development over time (e.g., "As I age, I find my ability to remember faces is getting better"). The revised model resulted in an acceptable fit to data (see Figure 1). EMS-Contentment and EMS-Discontentment were not strongly related, and a model aggregating both factors in a single memory contentment factor presented poor fit to the data in this study. Furthermore, eyewitness memory contentment was positively related to the use of strategies for person identification (r = .45), but this relation was not observed for eyewitness memory discontentment (r = .04).

| Convergent and discriminant validity
It may be the case that individuals with higher contentment with their own memories seek additional strategies to maintain performance or that adopting strategies to better recognize someone result in higher satisfaction with one's memory capacity (Meinhardt, Persike, & Meinhardt, 2014). These findings indicate that, at least in part, contentment and discontentment with one's own capacity for face and person recognition may represent independent constructs, rather than opposite ends of the same spectrum. The EMS-Strategies factor had only small to moderate correlations with self-rated contentment and ability for general memory capacity and memory trust. This divergent correlation pattern seems related to the fact that EMS-Strategies items focus specifically on the use of memory strategies to encode and remember faces, which appears to be somewhat independent from the use of strategies and self-appraisal for general memory.
Regarding discriminant validity, part of the nonshared variance between EMS and the other scales may be due to differences in content and item format. The EMS focuses specifically on memory for faces and person identification, whereas the other measures have a broader scope of items related to different memory domains. Contemporary memory models often consider that memory is composed of relatively independent domains (Repovs & Baddeley, 2006), and there is some evidence that people have distinct self-perceived capacity for different memory domains (Tonković & Vranić, 2011). In terms of item format, most items in the EMS are responded in relation to present contentment with memory, whereas in the SSMQ, for example, items are responded in relation to memory development over time (e.g., "better than ever before"). The SSMQ memory development focus may be especially appropriate in clinical contexts, where changes in memory perception can indicate the advancement of medical conditions (Mitchell, 2008).
Due to space and focus, we report in the current paper the devel-  Contentment score, the odds of making a correct identification increased by a factor of 1.41 (p < .001), and the odds of making a false identification decreased by a factor of 0.79 (p = .002). In both studies, it was also observed that EMS-Contentment and EMS-discontentment were significant predictors of identification confidence, suggesting that expressions of confidence are partially influenced by self-ratings of face recognition ability. Taken together, these studies provide initial evidence for the content validity and predictive validity of the EMS.
The EMS fills an important gap in the literature on face recognition and eyewitness testimony that might benefit a variety of research subjects. Eyewitnesses in criminal cases, for example, can provide unique evidence that may help solve investigations, hint to primary suspects, or potentially identify a perpetrator (Benton, Ross, Bradshaw, Thomas, & Bradshaw, 2006). However, eyewitness memory is malleable and susceptible to contamination, which may impair investigations or in more severe cases contribute to wrongful convictions. Discriminating accurate from inaccurate eyewitnesses is a challenging issue, but some postdictors of eyewitness identification accuracy have been identified, such as decision time during identifications (Sauer, Brewer, & Wells, 2008), self-reported decision-making process (Sauerland & Sporer, 2007), and early statements of confidence (Wixted & Wells, 2017).
Metamemory judgements and individual differences in face recognition capacity may also relate to eyewitness performance, but this hypothesis has been relatively unexplored. Some studies have suggested that people have only moderate insight into their face recognition and face perception abilities (Bobak et al., 2019), but expressions of confidence may have a stronger relation to self-perceived memory ability. This is of importance because confidence statements are often used to discriminate accurate from inaccurate witnesses, but little is known about whether confidence statements are affected by individual differences (Leippe & Eisenstadt, 2014). Research adopting selfreport instruments of face recognition capacity such as the EMS could help clarify the relationship between past experiences with memory and confidence judgements.
Research on prosopagnosia and superrecognizers could also benefit from the use of self-rated measures of face memory capacity. From a theoretical perspective, it is not clear whether prosopagnosia and superior face recognition represent opposite ends of the same continuum of face memory ability (Bobak et al., 2019). Comparing self-reported scores of face recognition capacity on patients with prosopagnosia and super recognizers could help clarify whether objective memory capacity has a linear relation with subjective memory experience. From an applied perspective, superrecognizers are considered as particularly valuable to national security agencies and border control due to their extraordinary ability to match and recognize faces from video footage or lineups (Bobak, Dowsett, & Bate, 2016). Valid face memory capacity self-report questionnaires could be used as screening tools among many participants prior to other behavioural testing, helping identifying individuals with remarkable face memory skill.
The EMS is a brief, easily administrable metamemory questionnaire focusing on face recognition and eyewitness contexts. It has been developed on a large and relatively heterogeneous sample and showed good psychometric properties, although future amplification of its validity is desired. Self-report assessments of memory add a unique element to the assessment of memory performance that cannot be obtained in objective memory testing alone. Self-report tools allow the measurement of overarching memory issues and experiences rather than artificial laboratory-based memory problems, providing insights on an individual's memory functioning. Consequently, such tools have an important role in research and theory development regarding how memory performance relates to one's theory and one's previous experiences with memory.