Faking on a Situational Judgment Test in a Medical School Selection Setting: Effect of Different Scoring Methods?

We examined the occurrence of faking on a rating situational judgment test (SJT) by comparing SJT scores and response styles of the same individuals across two naturally occurring situations. An SJT for medical school selection was administered twice to the same group of applicants (N&#160;=&#160;317) under low&#8208;stakes (T1) and high&#8208;stakes (T2) circumstances. The SJT was scored using three different methods that were differentially affected by response tendencies. Applicants used significantly more extreme responding on T2 than T1. Faking (higher SJT score on T2) was only observed for scoring methods that controlled for response tendencies. Scoring methods that do not control for response tendencies introduce systematic error into the SJT score, which may lead to inaccurate conclusions about the existence of faking.


| INTRODUC TI ON
The predictive validity evidence on situational judgment tests (SJTs) in personnel selection stimulated the introduction of SJTs in educational selection settings. SJTs instruct individuals to judge the appropriateness of potential response options to challenging situations (Weekley & Ployhart, 2006). These dilemma-like situations take place in the context of the organization or the educational program for which an individual applies. Generally, SJTs are used to measure noncognitive attributes. SJTs demonstrate sufficient criterion-related validity in personnel selection (McDaniel, Hartman, Whetzel, & Grubb, 2007) and educational admissions (Lievens, Buyse, & Sackett, 2005a). Additionally, SJTs have incremental validity over traditional cognitive predictors such as high-school grade point average (GPA) (Schmitt et al., 2009). Finally, SJT scores show smaller socioeconomic group differences than traditional predictors (Lievens, Patterson, Corstjens, Martin, & Nicholson, 2016).
Parallel to other noncognitive measures, concerns have been raised about faking on SJTs (Weekley & Ployhart, 2006). Faking is defined as conscious response distortion in order to make a favorable impression and to increase the chance of getting hired (Goffin & Boyd, 2009). Concerns about faking on noncognitive measures are a consequence of the use of self-report formats that are prone to faking.

| Faking on personality measures
Faking in high-stakes selection settings has been extensively investigated on personality measures. Considerable research has been devoted to answering the research questions "can people fake?" and "do people fake?" (Cook, 2016). Regarding the first question, studies that instructed respondents to deliberately "fake good" demonstrated that most people can increase their personality scores (McFarland & Ryan, 2000;Viswesvaran & Ones, 1999). Regarding the second question, studies comparing the personality test scores of incumbents and applicants found more desirable scores for applicants, indicating that people do fake in high-stakes settings (Birkeland, Manson, Kisamore, Brannick, & Smith, 2006;Rosse, Stecher, Miller, & Levin, 1998). Both questions have been addressed

| Consequences of faking
Although researchers reached considerable consensus with respect to peoples' ability and willingness to fake, differing perspectives exist on the influence of faking on the construct and predictive validity of personality measures. One perspective considers the influence of faking on the predictive validity of personality measures to be negligible, calling the concerns on social desirability in the use of personality tests a "red herring" (Ones, Viswesvaran, & Reiss, 1996). Other studies have indicated that faking does not affect the construct validity (Ones & Viswesvaran, 1998) or the factor structure of personality measures (Hogan, Barrett, & Hogan, 2007). Additionally, Ingold, Kleinmann, König, and Melchers (2015) demonstrated a positive relation between faking and job performance and thus proposed that faking should be viewed as socially adequate behavior. In contrast, the other perspective regards faking as detrimental to the use of personality measures for selection purposes, because faking affects the rank order of the applicants and reduces the quality of hiring decisions (Donovan et al., 2014;Griffith, Chmielowski, & Yoshita, 2007). In addition, concerns have been raised about the adverse effect of faking on the construct validity (Rosse et al., 1998) and criterion-related validity (Morgeson et al., 2007;Mueller-Hanson, Heggestad, & Thornton, 2003) of personality test scores. So far, no consensus has been reached regarding the consequences of faking on personality measures.

| Measures against faking
Several studies investigated approaches to deal with faking on personality measures. First, warning respondents about the potential identification and consequences of faking resulted in lower personality scores than not warning respondents . However, warnings may also reduce the convergent validity of a personality measure (Robson, Jones, & Abraham, 2007). Second, faking has been tackled by correcting personality test scores for the score on a faking measure (e.g., a social desirability scale) (Goffin & Christiansen, 2003;Schmitt & Oswald, 2006). The success of this approach is often limited due to the poor construct validity of faking measures, as the variance in faking measures is often not only explained by faking, but also by personality test scores and the criterion (Cook, 2016;Griffith & Peterson, 2008;Schmitt & Oswald, 2006). Finally, another approach to reduce the influence of faking is the use of forced-choice response formats, forcing respondents to choose between equally desirable responses (Jackson, Wroblewski, & Ashton, 2000;O'Neill et al., 2017).
A disadvantage of forced-choice response formats is their ipsative nature which impedes the comparison of applicants, because the total score is equal for each applicant (Heggestad, Morrison, Reeve, & McCloy, 2006). However, one can perform interindividual comparisons through partially ipsative measurement using scoring formats that allow total score variability . To summarize, research on the effectiveness of various approaches to deal with faking on personality measures has mixed results.

| Faking on SJTs
Unlike the extended research on faking on personality tests, the number of published studies on faking on SJTs is limited (Table 1). As with personality tests, lab studies showed that individuals are able to obtain higher SJT scores if they are instructed to fake (Lievens & Peeters, 2008;Nguyen, Biderman, & McDaniel, 2005;Oostrom, Köbis, Ronay, & Cremers, 2017;Peeters & Lievens, 2005). The size of the faking effects seems to depend on the order in which the fake and honest conditions are presented to the respondent. On a would-do SJT (i.e., which asks respondents what they would actually do), Nguyen et al. (2005) found a larger effect size when respondents received the instructions to respond honestly first (d = 0.34) than when respondents received the faking instructions first (d = 0.15). In contrast, Oostrom et al. (2017) found a larger faking on a would-do SJT when faking instructions preceded honest instructions (d = 1.09) than vice versa (d = 0.82).
A faking effect on a should-do SJT (i.e., which asks respondents what they should do) was only found in the fake-first condition (d = 0.45), whereas the reverse (i.e., higher SJT scores under honest instructions than under faking instructions) was found in the honest-first condition (d = −0.34) (Nguyen et al., 2005). Oostrom et al. (2017) found a faking effect in both conditions, but the effect size was much smaller in the honest-first condition (d = 0.11) than in the fake-first condition (d = 1.31). Field studies comparing existing groups of applicants and nonapplicants showed mixed results, with one study reporting better SJT performance for applicants (Ployhart, Weekley, Holtz, & Kemp, 2003) and another study reporting better SJT performance for nonapplicants (Weekley, Ployhart, & Harold, 2004).
Several faking studies on SJTs attempted to reduce faking (Lievens & Peeters, 2008;Oostrom et al., 2017). The most common approach is asking individuals what they should do (i.e., knowledge instructions) as opposed to asking individuals what they would actually do (i.e., behavioral tendency instructions) (Nguyen et al., 2005).

| Present study
This study examined the fakability of an SJT in a medical school selection setting. Prior studies in the medical education domain indicated that applicants showed more response distortion on personality tests than nonapplicants (Anglim, Bozic, Little, & Lievens, 2018;Griffin & Wilson, 2012). The current study investigates whether applicants also distort their responses to an SJT. Prior faking research on SJTs is extended in three different ways.
First, unlike the SJT studies mentioned in Table 1, this study used a within-subjects design without different instructional sets (i.e., a field study). Although previous studies have used within-subjects designs in the educational field to examine faking on personality measures (Griffin & Wilson, 2012;Niessen, Meijer, & Tendeiro, 2017), this is one of the first field studies using a within-subjects design to examine faking on an SJT. As mentioned above, the disadvantage of between-subjects designs is the complexity to determine if group differences are caused by faking or by existing individual differences (e.g., in job experience), especially in field studies where random assignment to applicant and nonapplicant groups is not possible. Within-subjects designs control for these individual differences. Additionally, lab studies examine whether applicants can fake, but not whether applicants actually do fake in real-life high-stakes selection settings. The present field study investigated the actual occurrence of faking by comparing the SJT scores of the same individuals across two naturally occurring situations (i.e., low-stakes and high-stakes). Although the combination of a within-subjects design and a field study will extend previous faking research on SJTs, the real-life setting of the present study does not allow counterbalancing the order of the low and high-stakes settings. Earlier exposure to an identical or comparable test may cause retest effects (Lievens, Buyse, & Sackett, 2005b). Retest effects may reflect faking, but may, for example, also encompass practice effects, due to familiarization with the test format (Hooper, Cullen, & Sackett, 2006). The present study examined retest effects using a between-subjects analysis comparing the SJT score of first-time test takers to second-time test takers (Lievens et al., 2005b).
Second, this study investigated differences in faking between desirable and undesirable response options because prior research proposed that there might be differences in the extent to which positive traits are exaggerated and unflattering traits are de-emphasized (Goffin & Boyd, 2009 (Stemler et al., 2016). A survey regarding faking behaviors during job applications revealed that the proportion of respondents indicating to de-emphasize negative traits was larger than the proportion of respondents indicating to exaggerate positive characteristics (Donovan, Dwight, & Hurtz, 2003). Accordingly, we hypothesized the following: Hypothesis 1 The influence of faking on SJT scores will be more pronounced for undesirable than for desirable response options.
Third, the present study examined faking on an SJT that uses a rating format as opposed to a pick-one or pick-two format (e.g., most and least likely to perform). To our knowledge, no prior faking studies have been published on a rating SJT ( response tendencies in the use of a rating scale. Based on previous findings, we formulated the following hypothesis.

Hypothesis 2b
More extreme responding is related to a larger score difference between low-stakes and high-stakes settings for a scoring method that is more strongly affected by response tendencies (henceforth: a scoring method that does not control for response tendencies).
Finally, as an additional exploratory test, we examined whether a scoring method controlling for response tendencies had stronger construct validity than a scoring method not controlling for response tendencies (Weng et al., 2018). We expect that the systematic error introduced by response tendencies will lower the construct validity of scoring methods not controlling for response tendencies.

| Context and procedure
This study was conducted at a Dutch medical school, where the selection was based on pre-university GPA, extracurricular activities and three cognitive tests on mathematics, logical reasoning, and a video lecture. Three months before the selection testing day, applicants had the opportunity to participate in a selection orientation day, where they received information about the selection procedure.
Participation in the selection orientation day was voluntary and free of charge. The same SJT scenarios were administered twice-on the 2017 selection orientation day (T1) and on the 2017 selection testing day (T2) (interval: three months). On both occasions, the SJT was administered for research purposes only and participation was voluntary. However, the stakes were higher on T2 as the SJT was administered among the admission tests for which test performance did determine the selection outcome. Because the selection context was more obviously present on T2, it was expected that applicants would be more motivated to fake on T2. Applicants were informed that their answers would not influence the selection decision, because ethical regulations precluded misleading the applicants about the true purpose of the SJT administration. Applicants were asked to sign an informed consent form before participation. The data in this study were processed confidentially.

| Situational judgment test
The SJT was designed to measure integrity and was developed using a combination of critical incident interviews and two established theoretical models. The first model comprised the honesty-humility dimension of the HEXACO personality inventory. Unlike the wellknown Big Five personality dimensions, the HEXACO assumes six dimensions of personality: honesty-humility, emotionality, extraversion, agreeableness, conscientiousness, and openness to experience (Lee & Ashton, 2004). The sixth factor, honesty-humility, is defined as "sincere, fair and unassuming versus sly, greedy and pretentious" , p. 1324 and is positively related to integrity (Lee, Ashton, & de Vries, 2005). The desirable response options of the SJT were written based on three of the four facets of the honesty-humility dimension (i.e., sincerity, fairness, and modesty). The fourth facet, greed avoidance, was not used because this facet was considered less relevant for medical school applicants. The second model comprised the cognitive distortions measured by the How I Think questionnaire (Barriga & Gibbs, 1996). Self-serving cognitive distortions are inaccurate thinking styles that may lead to the violation of social norms (Nas, Brugman, & Koops, 2008)  shows the intercorrelations between SJT scores and the other variables collected during the selection procedure.

| Scoring methods
The SJT was scored using three methods that were differently affected by response tendencies in the use of a rating scale. First, the raw consensus scoring method calculated the absolute distance on the rating scale between an applicant's judgment and the average judgment of a group of subject matter experts (SMEs). The SMEs were residents in training to become general practitioners. The size Electronic copy available at: https://ssrn.com/abstract=3602890 of the SME group ranged between 18 and 23. The characteristics of the SME sample were described in De Leng et al. (2018). Distances were summed across the response options to obtain a scale score.
Scale scores based on raw consensus were reverse coded (i.e., subtracted from the maximum possible score), for higher scores to indicate better SJT performance. For raw consensus scoring methods, extreme responding generally relates to lower SJT scores because more extreme ratings result in larger deviations from the scoring key (Weng et al., 2018). Second, the standardized consensus scoring method calculated the absolute distance between the applicant's judgment and the average SMEs' judgment, but first performed a within-person z standardization such that each respondent has a mean of zero and a standard deviation of one across the items

| Personality
The HEXACO simplified personality inventory (HEXACO-SPI) (De Vries & Born, 2013) was administered online after the selection testing day (T2), but before applicants received the admission decision.
The honesty-humility dimension of the HEXACO-SPI was used to examine the construct validity of the integrity SJT. The honestyhumility subscale consisted of 16 items (e.g., "I find it hard to lie") which need to be judged on a five-point rating scale (

| Mean differences
The mean raw consensus SJT score was significantly lower (worse) on T2 than T1 (

| Association between mean differences and extreme responding
The association between the mean score differences and extreme responding was examined by correlating the T1-T2 difference in the percentage of extreme rating scale points (ERS difference) to the T1-T2 difference in SJT scores ( Note: T1 = selection orientation day (low motivation-to-fake context), T2 = selection testing day (high motivation-to-fake context). Bold coefficients indicate a significant correlation (p < 0.001, two-tailed).

| Construct validity
As expected, the correlation of the raw consensus SJT scores with honesty-humility was smaller than the correlation of the standardized and dichotomous SJT scores with honesty-humility (Table 4).
However, the standardized consensus score based on all response options had a significant and positive correlation to honesty-humility on T1 (r = 0.17, p = 0.029) and T2 (r = 0.24, p = 0.001). For the dichotomous consensus scoring method, the overall SJT score was also significantly and positively correlated to honesty-humility, but only on T2 (r = 0.25, p = 0.001  Leng et al., 2018;Elliott et al., 2011;Stemler et al., 2016).

| Retest effects
Finally, retest effects were investigated as an alternative or complementary explanation for the T1-T2 differences because the real-life setting of the present field study prevented counterbalancing the order of the low and high-stakes settings. Possible retest effects were examined by comparing the T2 SJT score for repeat test takers (i.e., applicants who also participated on T1) and novel test takers (i.e., applicants who did not participate on T1) (cf. Lievens et al., 2005b). For the raw consensus scoring method, repeat test takers did not significantly differ from novel test takers in the SJT score on T2 (Table 5).
For the standardized consensus scoring method, a significant difference was found for the overall SJT score (t(589) = −3.28, p = 0.001,  no retest effects when using the raw consensus score and in small retest effects when using the standardized and dichotomous consensus score. Thus, retest effects-faking or practice-were only detected for scoring methods that controlled for response tendencies.
The SPSS syntax for the above analyses can be found in Supporting Information 4.

| Faking
Because the standardized and dichotomous SJT scores were not affected by systematic error due to response tendencies, we will focus on these SJT scores in the discussion below. The higher SJT scores on T2 than T1 seems to demonstrate a small faking effect for the standardized (d = −0.25) and dichotomous (d = −0.28) scoring methods, indicating that on the same SJT, the same applicants obtained a higher score in a high-stakes setting than in a low-stakes setting.
The effect size is smaller than most effect sizes reported in Table 1.
Unfortunately, a direct comparison with these published effect sizes is problematic because none of the previous SJT faking studies used a within-subjects design in the field (i.e., not using different instructional sets). Consequently, dissimilar effect sizes are likely caused by differences in study design and study type. Between-subjects designs may produce larger faking effects than within-subjects designs if the compared groups also differ on other variables, for example, job experience (Ployhart et al., 2003;Vasilopoulos, Reilly, & Leaman, 2000). Additionally, lab studies may generate larger effect sizes than field studies, because different instructional sets involve a stronger intervention (Birkeland et al., 2006). Another possible explanation for the smaller faking effect found in this study is that the integrity SJT used knowledge response instructions, whereas most previous SJT faking studies used behavioral tendency instructions. Two SJT faking studies that compared both response instructions (Nguyen et al., 2005;Oostrom et al., 2017) demonstrated that the faking effect is smaller for knowledge than for behavioral tendency instructions. McDaniel et al. (2007) describe SJTs with knowledge instructions as maximal performance tests and SJTs with behavioral tendency instructions as typical performance tests and argue that self-reports of typical behavior are more susceptible to faking than self-report predictors of knowledge. Our findings seem to support the lower susceptibility to faking of SJTs with knowledge instructions, but also indicate that knowledge instructions do not completely cancel out the faking effect, presumably because SJTs are not pure knowledge tests due to their noncognitive content.

| Desirable and undesirable response options
The T1-T2 increase in the SJT score based on desirable response options was significant for the standardized (d = −0.27) and dichoto- Note: T1 = selection orientation day (low motivation-to-fake context), T2 = selection testing day (high motivation-to-fake context). Standard deviations between brackets. Bold numbers indicate a significant difference.
for desirable response options. Additionally, the T1-T2 increase in extreme responding was larger for desirable (d = −0.50) than unde- In other words, it is probable that respondents fake-consciously or unconsciously-on a survey regarding faking behaviors.
Social desirable responding consists of intentional faking and unconscious self-deception (Paulhus & John, 1998). Desirable items are potentially more affected by self-deception than undesirable items.
An interesting avenue for future research is to unravel the influence of faking and self-deception on de-emphasizing negative traits and exaggerating positive traits. Additionally, an explanation for the stronger faking effects for desirable than undesirable response options might be found in the self-discrepancy theory (Higgins, Roney, Crowe, & Hymes, 1994). The self-discrepancy theory describes that discrepancies between one's perceived actual self and one's desired self result in negative feelings (Higgins, Shah, & Friedman, 1997). The desired self may be characterized by aspirations and wishes (i.e., the ideal self) or by obligations and responsibilities (i.e., the ought self).
Individuals who are predominated by the ideal self are more oriented toward approaching a desired end state, whereas individuals who are predominated by the ought self are more oriented toward avoiding an undesired end state (Higgins et al., 1994). Applicants' responses to the SJT might have been more strongly affected by the ideal self than the ought self, possibly caused by characteristics of the selection context leading to self-enhancement. To our knowledge, no previous faking studies have referred to the self-discrepancy theory, so more research is necessary to elucidate the influence of ideal and ought selves on faking positive and negative traits.

| Scoring methods
A rating SJT allowed us to examine faking through extreme responding. Extreme responding is unaffected by the scoring method of the SJT, which is useful because our findings indicate that conclusions about faking heavily depend on how an SJT is scored. Extreme responding occurred more often in a high-stakes than in a low-stakes setting, which is in line with previous faking research on personality measures (Levashina, Weekley, Roulin, & Hauck, 2014;Van Hooft & Born, 2012). For a traditional raw consensus scoring method, extreme responding is related to lower scores, because it creates more distance from the consensus judgment, which is often located near the midpoint of the rating scale (Weng et al., 2018). Consequently, one coaching strategy to improve the score on a rating SJT instructs respondents to avoid the extreme responses on the rating scale (Cullen, Sackett, & Lievens, 2006). Additionally, our results indicate that a raw consensus SJT score may have weaker construct validity than a standardized or dichotomous SJT score, which is in line with previous research demonstrating lower criterion-related validity for scoring methods that do not control for response tendencies (McDaniel et al., 2011;Weng et al., 2018). Response tendencies introduce systematic error in a rating SJT score, which may result in lower construct and criterion-related validity coefficients. In addition, findings indicate that systematic error caused by response tendencies may lead to inaccurate conclusions about faking on an SJT.
Hypothesis 2b that scoring methods that do not control for response tendencies are more strongly affected by a change in extreme responding than scoring methods that do control for response tendencies is only confirmed for the dichotomous SJT score based on undesirable response options. Apparently, controlling for response tendencies within one test administration does not reduce the influence of a change in response tendencies across test administrations. Additionally, an explanation for the significant influence of extreme responding on the dichotomous SJT score might be that, for 11 out of 40 response options, the consensus judgment was located near the midpoint of the rating scale (i.e., between 2.5 and 4.5 on a 6-point rating scale). For these ambiguous midrange items, an applicant might be close to but on the "incorrect" side of the rating scale, yielding no points. More extreme responding in the highstakes setting would shift the applicant's judgment to the "correct" half of the rating scale, producing a higher SJT score. Weng et al. (2018) showed that the dichotomous consensus scoring method is more appropriate for non-midrange items, supporting this potential explanation.
A last notable finding was that-for the standardized and dichotomous SJT score-the construct validity was stronger on T2 than T1, possibly because applicants are familiarized with the SJT format on T2 which reduces construct-irrelevant variance due to unfamiliarity with the test format (Lievens et al., 2005b applicants received the admission decision. Applicants might have been motivated to fake on the personality measure, because admission was not yet certain. The stronger construct validity on T2 might, therefore, also be a result of overlapping variance caused by faking in both scores (i.e., SJT score on T2 and honesty-humility score). Finally, the larger correlation with honesty-humility on T2 might be caused by the stronger common frame of reference produced by the high-stakes selection context (Ones & Viswesvaran, 1998). Even though the stakes were lower on T1 than T2, some applicants might still have felt a tendency to fake. In contrast, a highstakes setting may present a stronger frame of reference that is shared by all applicants. Ones and Viswesvaran (1998) emphasize the importance of standardizing the test administration to generate a common frame of reference and to enhance the reliability.
This explanation is supported by higher estimates of internal consistency reliability for the SJT score on T2 than T1. More research is necessary to examine which of these processes give rise to the stronger construct validity on T2 than T1.
Overall, each scoring method has pros (e.g., raw consensus

| Faking versus retest effect
Due to the real-life setting of this study, the order of the selection orientation day and selection testing day could not be counterbalanced. We examined the possibility of a retest effect as an alternative explanation by comparing the SJT scores of first-time and second-time test takers (cf. Lievens et al., 2005b). The significantly higher score for second-time test takers (d = 0.27) provides evidence for a small retest effect when using the standardized or dichotomous consensus scoring method, which corresponds to previous research on retest effects on SJTs (Dunlop, Morrison, & Cordery, 2011;Lievens et al., 2005b). Retest effects could represent faking, but could also represent a practice effect or actual improvement in the relevant construct (Hooper et al., 2006

| Implications for future research and practice
First, we recommend future investigations of faking or retest effects on rating SJTs to use scoring methods that control for response tendencies. Examples of other scoring methods that control for response tendencies are mode consensus or proportion consensus (Weng et al., 2018). Second, research on the consequences of faking for the construct validity of personality measures should take into account the influence of response tendencies (i.e., extreme responding) and scoring methods. Our findings indicate that response tendencies and scoring methods might be contributing factors to the mixed evidence concerning the influence of faking on the construct validity. Third, we advise researchers to make a distinction between desirable and undesirable response options as this may affect the conclusions on SJT faking. The distinction between desirable and undesirable items can be based on empirical data (Stemler et al., 2016) or on a theoretical framework (De Leng et al., 2018).
Practitioners of SJTs are also recommended to use scoring methods that control for response tendencies and undesirable response options because these modifications may increase the construct and criterion-related validity of the SJT.

| Limitations
This study is not without limitations. First, the main limitation of this study is the inability to rule out other possible sources of a retest effect. A between-subjects analysis comparing the SJT scores of novel and repeat test takers indicated a retest effect of similar size as the faking effect. The investigation of retest effects on noncognitive instruments has primarily interpreted these effects as a result of applicant faking (Van Iddekinge & Arnold, 2017). However, retest effects may have many different causes, such as practice effects, genuine improvement in the construct, reduction in test anxiety, or test familiarization (Lievens et al., 2005b;Van Iddekinge & Arnold, 2017).
Even though the retest effect found in the current study is likely produced by faking as T1 and T2 were deliberately chosen to have substantial contextual differences, it is not feasible to exclude other potential sources of a retest effect. Future studies should use research designs that allow the separation of these different sources.
Second, the scoring methods used rely on the difference between a respondent's rating and the average rating across a group of SMEs. Difference scores, however, have several limitations, such as low reliability, reduced effect sizes, and loss of information from the separate component scores (Edwards, 2001). The limitations of the raw consensus scoring method were confirmed in the present study as shown by obscured faking or retest effect and the weak construct validity. The standardized and dichotomous consensus scoring methods solved some of the problems of the raw consensus scoring method. Nonetheless, future research is advised to examine polynomial regression methods as an alternative method for scoring SJTs, because these methods provide a more direct solution to the problems of difference scores (Edwards, 2001;Kulas, 2013).
Third, due to the real-life setting of this study, we investigated faking using only one order of conditions: low-stakes on T1 and high-stakes on T2. Other within-subjects studies on faking mainly use the reversed order, i.e., high-stakes among applicants on the first occasion and low-stakes among incumbents on the second occasion. We believe that the low-stakes-first order of the current study has some important benefits. First, because most T1 respondents were also present on T2, it is unlikely that our findings are affected by a restriction of range. Second, because our T2 respondents were not medical school incumbents, it is unlikely that T1-T2 score differences are caused by experience at medical school. The within-subjects field study by Ellingson et al. (2007) examined response distortion on a personality inventory using both orders and found a larger score change for the low-stakesfirst condition than for the high-stakes-first condition. In contrast, within-subjects studies using directed-faking instructions demonstrated a faking effect on a should-do SJT for the fake-first condition, but not for the respond-honestly-first condition (Nguyen et al., 2005;Oostrom et al., 2017). A faking effect observed only in the fake-first condition was explained by respondents' tendency to respond deliberately different during the second condition after they responded to the best of their ability during the first condition. The tendency to respond differently might be less strong in the current study because no directed faking instructions were used and due to the longer time period between both conditions than in the previous studies. Nonetheless, these contrasting findings require more research on the effect of the order of the lowand high-stakes settings.
Fourth, during both test administrations, the SJT was administered for research purposes only, which might reduce the generalizability of our findings to real selection settings. However, Niessen et al. (2017) found large score differences on several noncognitive measures between a research and an admission context, even though applicants were informed that the noncognitive measures were not used for selection. Additionally, the difference in extreme responding indicated that the applicants responded differently on T2. The selection testing day, therefore, appears to be a sufficient proxy of a high-stakes situation.
Finally, it might be too simplistic to assume that faking is limited to extreme responding (König, Merz, & Trauffer, 2012). Moreover, prior research has demonstrated that response styles differ across individuals (Ziegler, 2015) and cultures (He, Bartram, Inceoglu, & Van de Vijver, 2014). Further research is required to examine how other response styles apart from extreme responding relate to faking.

CO N FLI C T O F I NTE R E S T
The authors report no conflict of interest.