Determining influence, interaction and causality of contrast and sequence effects in objective structured clinical exams

Abstract Introduction Differential rater function over time (DRIFT) and contrast effects (examiners' scores biased away from the standard of preceding performances) both challenge the fairness of scoring in objective structured clinical exams (OSCEs). This is important as, under some circumstances, these effects could alter whether some candidates pass or fail assessments. Benefitting from experimental control, this study investigated the causality, operation and interaction of both effects simultaneously for the first time in an OSCE setting. Methods We used secondary analysis of data from an OSCE in which examiners scored embedded videos of student performances interspersed between live students. Embedded video position varied between examiners (early vs. late) whilst the standard of preceding performances naturally varied (previous high or low). We examined linear relationships suggestive of DRIFT and contrast effects in all within‐OSCE data before comparing the influence and interaction of ‘early’ versus ‘late’ and ‘previous high’ versus ‘previous low’ conditions on embedded video scores. Results Linear relationships data did not support the presence of DRIFT or contrast effects. Embedded videos were scored higher early (19.9 [19.4–20.5]) versus late (18.6 [18.1–19.1], p < 0.001), but scores did not differ between previous high and previous low conditions. The interaction term was non‐significant. Conclusions In this instance, the small DRIFT effect we observed on embedded videos can be causally attributed to examiner behaviour. Contrast effects appear less ubiquitous than some prior research suggests. Possible mediators of these finding include the following: OSCE context, detail of task specification, examiners' cognitive load and the distribution of learners' ability. As the operation of these effects appears to vary across contexts, further research is needed to determine the prevalence and mechanisms of contrast and DRIFT effects, so that assessments may be designed in ways that are likely to avoid their occurrence. Quality assurance should monitor for these contextually variable effects in order to ensure OSCE equivalence.


| INTRODUCTION
Ensuring that assessment scores fairly represent the performance of trainees remains a priority for assessment in health professionals' education. Although different philosophical 1 and epistemological 2 positions can be adopted to account for variability in assessors' judgements, 3 the field of assessor cognition has demonstrated influences which can contribute unhelpful variability or bias to assessment judgements, regardless of adopted philosophical stance. 4 The influences 'differential rater function over time (DRIFT)' 5 and 'contrast effects' 6 are difficult to ascribe to the notion of 'meaningful difference' in experts' judgement 7 and consequently represent detrimental influences on assessors' judgements. Despite the potential implications for candidates and trainees of these effects, they remain incompletely understood. The purpose of this paper is to extend that understanding and explore it within the context of an objective structured clinical exam (OSCE).

Consideration of these effects should occur in light of what is
already known about sources of variance in OSCEs. Variance due to stations is typically the largest systemic source of variance, accounting for approximately 3.5 times the variance due to resident ability (46.7% vs. 13.3% respectively 8 ) in one study. Variance due to examiners and simulated patients are often nested in (i.e., confounded with) station variance, making them hard to estimate, 9 but available estimates of examiner variance vary substantially across studies from trivial (0.4% 10 ) to more substantial (13% 11 or 18% 12 ). Notably, contrast effects could erroneously contribute to candidate variance estimates, although DRIFT effects would be expected to contribute to the error term. As a result, neither is routinely demonstrated by conventional psychometric analyses.
Contrast effects describe examiners' tendency to be biased away from the standard of preceding performances, that is, to allocate unduly low scores for one candidate following a good performance of another and unduly high scores following a poor performance. 6 The effect has been demonstrated in three separate experimental studies, all situated within a workplace-based assessment context. 6,13,14 In these studies, contrast effects typically showed a moderate effect size (Cohen's d = 0.6) and accounted for a greater proportion of score variance (24%) than examiners' consistent tendency to give either high or low scores (18%; i.e., their 'Hawkishness' or 'Dovishness'), whilst accounting for a 31% difference in pass/fail decisions for borderline candidates. 6 Further work demonstrated that assessors' narrative judgements were as equally susceptible to the effect as their scores, 14 whilst other work suggested the effect is likely to operate unconsciously, beyond examiners' awareness. 13 Although these studies were all experimental and focused on an assessment context of Mini-CEX assessments of consultation skills, a further study used observational methods to examine patterns of data from an OSCE and a multiple mini-interview (MMI) for selection. 15 It found patterns of correlations in both contexts that were consistent with contrast effects, albeit explaining a smaller proportion of variance of between 5% and 11% of score variance across contexts.
As a result, contrast effects appear to be a robust phenomenon with potential to bias examiners' judgements to a small or moderate extent in any setting where trainees are examined in sequence and significantly alter outcomes for candidates near to the pass/fail threshold. Despite this, little further research has explored their impact on practice or attempted to mitigate their effect, particularly in high stakes performance assessments such as the OSCE.
DRIFT is described as a tendency for raters to systematically alter their scoring for progressive candidates over the course of a period of examining. 5 Mclaughlin et al. 5 showed that examiners became progressively more lenient across a formative 10 station OSCE, with scores increasing by an average of 0.88% per station.
Although this effect appears small at a station level, it meant that residents scored an average of 8.8% higher if they took a station at the end of the OSCE than the start. Having discounted warm up effects, by replicating findings after excluding initial stations, the authors suggested the effect was due to examiner fatigue. By contrast, Hope and Cameron 16 showed the opposite effect: Examiners in a broadly focused summative undergraduate OSCE (Year 3 of 5) grew progressively more stringent over time, with a decline of 0.14% per station. This accounted for a 3.27% reduction in scores between the first and last groups. Cotzee and Monteiro 17 examined these patterns in a summative OSCE that determined whether international nursing graduates could practice in Ontario, Canada.
Although they found no general support for DRIFT in these data, they demonstrated a significant negative relationship for one out of 12 stations, which itself appeared to be attributable to one track (and potentially therefore one examiner). Candidates in this track more frequently failed the station when examined late rather than early in the sequence. DRIFT effects therefore can be an important influence on both outcomes and scores but are unpredictable, varying both in direction and occurrence between settings. Importantly all three studies investigating DRIFT effects used observational data with the authors presuming the observed effects were due to changes in examiners' behaviour. Without experimental control, they could not exclude the possibility that the observed effects were due to changes in either students' behaviour or other unknown factors. For instance, rather than examiners becoming more lenient with time, 6 students' performance may have improved over the course of the OSCE. Consequently, it would be useful to determine whether the observed effects are indeed due to examiners.
In summary, contrast effects have predominantly been studied in an experimental context with less insight into their operation in practice and some suggestion that the effects in practice may be smaller.
Conversely, DRIFT effects have only been demonstrated observationally without the ability to causally attribute them to examiner behaviour. The aim of this study was to study both phenomena simultaneously within the same OSCE to determine the magnitude of both effects, whether they interact and whether DRIFT effects can be causally attributed to examiner behaviour.

| Assessment context
We used secondary data analysis to address this aim, using data from a recent study by Yeates et al. 18 derived from a summative Year 3 undergraduate OSCE exam at Keele University Medical School. Students were studying for the qualification MBChB, which is a 5-year, predominantly undergraduate, course. Year 3 is the first year that students spend predominantly in clinical placements and have learned clinical skills appropriate to a broad range of medical, surgical and primary care disciplines. Students had a median age of 22 years (range 20-32 years). The OSCE consisted of 12 Â 10 min stations, each student doing four stations on 3 consecutive days. One hundred and thirteen students were examined, distributed across four parallel circuits that were repeated in the morning and afternoon with (predominantly) different examiners. This gave eight separate groups of examiners. Scores were allocated using Keele's GeCos marking system 19 that collects ratings on five domains (scored 1-4) and a global grade (1-7) on each station, giving a possible score range on each station from 6 to 27. As a consequence of these design features, the OSCE context was somewhat different to the workplace-based assessment in which the majority of observations of contrast effects have previously occurred 6,13,14 : Examiners used domain-based ratings with taskspecific prompts rather than generic marking scales; examiners were supplied the correct diagnosis for the case rather than having to reach their own diagnosis and were briefed on the scoring format and had previously (several months earlier) undergone generic benchmarkingbased training, which involved scoring videos of OSCE performances within a faculty development event and comparing and discussing scores.

| Analysis
Using these data, we first attempted to replicate the patterns of observational relationships shown in prior work, which are consistent with contrast and DRIFT effect. As well as aiming to replicate prior work, these analyses made use of all 'live' OSCE scores and the scores given to all embedded videos, (i.e., all scores allocated by examiners during the OSCE, hereafter referred to as 'all within-OSCE data') so might be expected to be maximally powered. Second, we examined scoring patterns for the embedded videos to determine whether they showed evidence of contrast or DRIFT effects or an interaction between the two effects.
To examine linear relationships suggestive of contrast or DRIFT effects in the entire dataset, we organised all data collected during the OSCE (all live scores and embedded video scores) in terms of the sequence of performances seen by each examiner. This created a new variable for each performance which we termed 'sequence'. This ranged from 1 (the first performance seen by a particular examiner) to up to a maximum of 17, the last performance seen by that examiner within a given session of the exam. The maximum sequence value varied between examiners depending on the arrangements of candidates within the session and whether the examiner opted to score embedded videos, between a maximum of 12 and 17. To operationalise contrast effects, based on the methodology in Yeates et al., 15 we calculated the average score given to the three preceding candidates by each examiner. This gave a new continuous variable which we termed 'previous candidates'. We used this measure rather than simply the score of the single previous performance as it consistently showed stronger relationships in Yeates et al.'s 15 study. Where less than three performances were available, again as per the method of Yeates et al., we used the average of all preceding performances (i.e., the first performance was excluded; for the second performance in the sequence, we used the score for the first performance; for the third performance in the sequence, we used the average of the first and second performances in the sequence). To avoid results being confounded by Simpson's paradox whereby unnecessary causation might be attributed to a single set of data, 20 we modelled the influence of multiple known predictors of OSCE scores: candidate, station and examiner within a generalised linear model (GLM). 21  To further illustrate, Hypothesis 2 would hold true if examiners were influenced by contrast effects (i.e., their scores were biased away from the standard of preceding performances.
To examine the influence of contrast and DRIFT effects on the scores that examiners gave to the embedded videos, we developed two categorical variables related to sequence and previous candidates.
For simplicity, we labelled these variables 'sequence' and 'contrast'.
As the median sequence value in the overall data was 8, we denoted performances with low sequence values (1-8) as 'early" and later sequence values (>8) as 'late' to give the sequence variable.
To develop the 'contrast' variable, we categorised scores for each embedded video based on the average scores given to the (up to) three preceding candidates. To do this, we compared the value of the 'previous candidates' variable with the average score given to the embedded video in question by all examiners (i.e., our best measure of the standard of the embedded video). Instances where an examiner scored an embedded video that had been preceded by comparatively weak performances were denoted 'previous low', whereas instances where an examiner scored an embedded video that had been preceded by comparatively strong performances were denoted 'previous high'. This categorised embedded video scores relative to the standard of the preceding performances regardless of their absolute level (i.e., the score given to an embedded video were categorised as 'previous high' if the preceding performances had been scored more highly than it, regardless of whether these performances were actually 'good' or not). This relative approach can be justified as Yeates et al. 13 showed that contrast effects operate at multiple levels of performance, rather than just for borderline performances. Instances where the preceding performances received the same average score as an embedded video were omitted.
Having categorised data, we used GLM to examine the influence of performance (factor), examiner (factor), sequence (factor, early or late) and contrast (factor, previous high or previous low) on the dependent variable of score for embedded video performances. In our first analysis, all factors were compared as main effects without interactions. We then repeated the analysis, including an interaction of Sequence (factor, early or late) Â contrast (factor, previous high or previous low). Both models were estimated using maximum likelihood estimation in IBM SPSS v 26. 22 These analyses tested the following hypotheses: Hypothesis 3. Examiners will allocate higher scores to embedded videos in the 'early' sequence variable condition than in the 'late' sequence variable condition.
In practical terms, Hypothesis 3 would hold true if examiners scored a given performance more highly when encountered early in the sequence compared with encountering the same performance later in the sequence. Arbitrarily we hypothecated that: Hypothesis 5. The difference between scores allocated by examiners under the 'previous high' and 'previous low' conditions will be greater for the 'late' condition than the 'early' condition.
For inferential statistical tests, we adopted a type 1 error rate of 5% but applied the Bonferroni correction to account for our five separate hypotheses resulting in a significant level of p = 0.05/5 = 0.01. We opted not to perform a post hoc calculation of the apparent power of the study as these are sample dependent and therefore have the potential to mislead. 23 We could have modelled the power of a sample of this size to detect an arbitrary prespecified difference; however, owing to the complex data structure, this would have required simulation that would have relied on multiple assumptions (likely derived from sample-dependent estimates).
For these reasons, consistent with Colegrave and Ruxton and Levine and Ensom., 24,25 we assert that the 95% confidence intervals (CIs) of the estimates are the best measure of the precision of the analysis and have reported those.

| RESULTS
Included data for the variable 'sequence' ranged from 2 to 17, with a uniform distribution, a median of 9 and an interquartile range (IQR) of 8. Data for sequence 1 scores were omitted as they never had corresponding 'previous candidate' data. 'Previous candidates' data (i.e., the average of the up to three preceding candidates)  (denoting examiner DRIFT effects) was non-significant: β coefficient = À0.06, SE = 0.03, Wald χ 2 2.8 (1), p = 0.09. As a result, Hypothesis 1 was not supported, and this analysis was not consistent with the presence of DRIFT effects in these data. The variable 'previous candidates' (denoting contrast effects) was non-significant: β = À0.016, SE = 0.04, Wald χ 2 0.2 (1), p = 0.71. As a result, Hypothesis 2 was not supported, and this analysis was not consistent with the existence of contrast effects in these data. Please see Table 1 for regression coefficients.  and could therefore potentially have their categorisation (pass/fail or fail/pass) altered by this magnitude of difference. These data are illustrated in Figure 2.

| Factorial comparisons of embedded videos scores
Conversely, there was no significant difference in the scores given to video performances when they were preceded by high scor-

| Theoretical implications of findings
This study has shown two somewhat contradictory findings: partial support for DRIFT effects and lack of support for contrast effects.
Although prior research has variously shown scores increasing 5 and decreasing 16  OSCEs, 27,28 it is an issue that has received comparatively little attention despite the well-described cognitive load that examiners experience. 29 If fatigue does mediate this effect, we may expect to see a less dramatic effect when scoring criteria are optimised to reduce cognitive load or the examining task is simplified in other ways, for example, through station design. 30 17 and it may be that multiple effects interact at different time to produce different overall effects. Indeed, the muted (embedded) and null (all within-OSCE data) effects we observed could have arisen due to the overlay of multiple DRIFT effects, some increasing and some decreasing scores over time. As these mechanisms are currently speculative, mechanistic work is required to understand these influences further.
The lack of support for contrast effects in these data contradicts the findings of the majority of prior research on this topic. 6,13,14 Again, it is useful to consider potential reasons why they did not occur in these data. high' groups on a minority of occasions is consistent with this explanation. Categorising preceding performances on their absolute level (good or poor) rather than their relative level (better or worse) could also potentially have produced different findings. As explained in the methods, we chose this method as prior research suggests that contrast effects occur at all levels of performance but are greatest where the difference between successive students was large. 14 If the null effect arose due to insufficient variation in students' ability, then we might conclude that contrast effect may only be a significant issue where candidates of very disparate ability are examined together.
Although it is not possible to draw firm conclusions on any of these speculations, two points are salient: First, contrast effects may be less ubiquitous than the prior research had suggested, and second, these findings do not exclude the potential for them to occur in other OSCE situations. As a result, ongoing vigilance for their impact is needed.

| Practical implications
Although emphasis on the formative role of OSCEs has justifiably increased, 38 ensuring that OSCEs provide a fair measure of learners' ability remains critical to their justification. 39 Consequently, any undue impact of these effects on assessment decisions in OSCEs could challenge the chain of their validity. 40

| Limitations
The main limitation of this study emanates from the secondary data upon which the analyses were based. These originated from a single OSCE in a single context and were originally collected for a different purpose. Although the findings may not implicitly generalise to other settings, we believe that the design and participant population were typical of many undergraduate OSCE settings. Investigation of contrast effects relied on natural variation in the standard of the preceding performances examiners judged, rather than deliberate manipulation. Although this was ecologically valid, it could have con- We only examined for the presence of contrast or DRIFT effects in overall scores; we cannot exclude the possibility that contrast or DRIFT effects might have occurred within individual domains of the assessment and could therefore bias the scores within these domains.
The importance of any such effect (were it to occur) would depend on the particular usage of domains scores within a given assessment.
Additionally, we were not able to model the potential for overlaid positive and negative DRIFT effects. We cannot exclude the potential that the small/null DRIFT effects we have reported masked more pronounced effects in subsets of data.

| Future research
Given the uncertainty around when, why and how both contrast and DRIFT effects may occur in assessments, future work should seek to more thoroughly establish the prevalence of these effects in OSCE exams and seek to determine conditions which mediate their presence and/or direction. Depending on the importance of effects that occur, further work might explore the cost-benefit relationship of measures to mitigate their effects in practice. As several institutions have recently explored the potential for online OSCEs, these may offer an opportunity to replicate the study, whilst blinding examiners to the presence of comparison performances (as these could potentially be 'hidden' amongst other on-screen performances). This could overcome one of the limitations mentioned above. Further work could explore the presence or absence of domain-level effects.

| Conclusions
Our findings suggest that the DRIFT we observed has a small influence on students' OSCE scores that can be causally attributed to examiner. Although the magnitude of DRIFT effects will not always have important consequences, they can reduce the precision of scores and have the potential to produce an unfair influence that should be considered within quality assurance of OSCEs.
Conversely, although contrast effects may importantly bias examiners' scores in some instances, they appear less ubiquitous than previously suggested. Consequently, more research is required to determine the prevalence and mediators of both influences so that assessment design can be used to avoid or limit the impact of their occurrence or so that mitigating interventions can be developed.
As both effects appear to be contextually variable, their presence or absence should be monitored as part of quality assurance processes to ensure the fairness and validity of assessment outcomes in OSCEs.