We sought to develop a coherent validity argument for the interpretation of test scores derived through use of the script concordance method. Following an approach advocated by Messick,18 we examined published data for five categories of validity evidence: content; response process; internal structure; relations to other variables, and consequences. We found evidence relating to content, internal structure and relations to other variables in support of the validity of SCT score interpretations, although significant evidentiary gaps remained. Conversely, evidence supporting the validity of SCT scores with respect to examinee thought and response processes and educational consequences is weaker and limited.
A potential limitation of our exercise is that it is conventionally undertaken to evaluate the validity of inferences from scores on specific instruments developed for specific purposes. However, we have argued that results derived from whole classes (or methods) of assessment lend themselves to certain global interpretations that might be useful for helping educators decide whether or not to invest in their own versions of a test, which would then require further validity verification. Our study is also limited by the relatively small body of current SCT literature, as well as our potential bias, despite our best attempts at objectivity, as investigators who are intimately involved in SCT research and development.
Implications for education and future research
Content evidence has been bolstered by published guidelines for standardising the content and process of SCT construction. The development of suitably ill-defined test items, which describe clinical situations in which there is no single best approach, is important for lending credence to SCT score interpretations. However, the fact that not all experts agree on a single best solution to a given clinical problem does not mean that no such solution exists; more research is required to address this legitimate concern regarding SCT content validity. Careful item development and panel selection are clearly crucial for ensuring that SCT response options reflect a spectrum of acceptable practices, and that the experts reflect good clinical judgement and current clinical practice. As published work on SCTs has been carried out under research conditions, it remains to be seen how SCTs will perform when developed and implemented by non-experts. Content evidence may also be strengthened by soliciting qualitative or mixed-method data from examinees and panel members about their perceptions of the authenticity of the script concordance assessment experience.
Internal structure evidence is supported by consistently high reliability estimates from published SCTs across a spectrum of medical domains. Evidence in this category could, however, be reinforced by test–retest estimates of reliability and by generalisability studies examining the decomposition of sources of variance in SCT (e.g. errors attributable to items and item–examinee interactions versus errors attributable to answer key generation by the expert panel). With respect to the effects of panel composition on reliability, research into how expert panels that contain widely deviant responders (i.e. those with aberrantly low total scores on an SCT, or those with outlying responses to particular SCT questions) should be treated is lacking and might provide important additional evidence in this category.
Evidence from the ‘relations to other variables’ category offers some support to the hypothesis that SCTs probe a construct that diverges from that probed through most MCQ tests. Research thus far has focused on comparisons of SCTs with MCQs in which one answer is identified as clearly and unambiguously better than its alternatives. However, stronger correlations might be expected between scores on SCTs and other types of MCQs that, like SCTs, offer partial credit for answers judged reasonable but not necessarily optimal. Evidence in the ‘relations’ category might be further solidified by data extrapolated through a comparative multi-trait, multi-method research approach,52 which would allow investigators to examine patterns of correlation between different methods of assessment (e.g. SCT versus MCQ) and the ‘traits’ (constructs) they purport to measure (e.g. reasoning in contexts of uncertainty versus knowledge fund) in a more rigorous manner.
A strategic research agenda for the SCT method should, however, focus on the two categories for which evidence is, to date, the least robust: thought and response processes, and consequences. At present, the evidence that SCT examinees’ thought and response processes align with the intended construct is based largely on theoretical argumentation. Whereas the inherent structure of its stimulus and response format clearly precludes assessment of examinees’ ability to generate medical hypotheses or collect appropriate data, the SCT’s claim to probe clinical data interpretation as an isolated construct requires more empirical substantiation. Further comparative cognitive research, perhaps employing think-aloud or concept-mapping strategies, may shed light on the types of cognitive strategies examinees employ in approaching SCT questions. A fuller understanding of examinee thought and response processes is critical for helping educators to diagnose and remediate trainees who perform poorly on an SCT.
Evidence relating to consequences, or educational impact, is arguably the most important category of validity evidence.53 However, at present little is known about the consequential aspect of validity of the script concordance method. For example, the script concordance method’s presumed effect on learning (i.e. of steering learners away from the rote memorisation of ‘textbook answers’ towards deeper learning strategies) requires empirical corroboration. Furthermore, its accentuation of the role of uncertainty in clinical data interpretation, intended to simulate the conditions and complexities of real-life medical practice, may be counterintuitive to medical learners, particularly those accustomed to assessment under educational models in which ‘right’ answers tend to be extolled. The potential effects – positive and negative – of an assessment method rooted in uncertainty should be further explored.
The repercussions of the way that SCTs are scored, such that all panellist responses are considered to have intrinsic merit, are also open to speculation: is the SCT scoring system a tacit endorsement of the implication that ‘experts never err’ or an acknowledgement that practitioners often interpret data differently depending on their varying experiences (scripts) in health care? The SCT’s scoring system introduces complexity into the scoring process, but may have the practical effect of reminding educators to articulate and model comfort with uncertainty when debriefing students after administering an SCT, or during other educational activities surrounding patient care. A study of the incremental value of the SCT’s unique scoring system, weighed against the consequences of the complexity it entails, may therefore be warranted.
Although innovative methods for rendering SCT scores more meaningful for students may soon serve as the basis for setting standardised pass or fail scores,48 the consequences of such decisions will undoubtedly lead to further questions. What opportunities exist for clinical educators to help remediate learners who demonstrate substandard SCT performance? How can SCT examinees who score poorly improve their CDI skills? These and other concerns about the consequences of SCT should be the primary focus of further investigation.
Finally, emerging paradigms in assessment indicate a shift in emphasis from the evaluation of individual methods or instruments to the evaluation of entire assessment programmes.54 To date, no data exist regarding the contribution of SCTs to the delivery of a varied, competence-based assessment programme as a whole. With its emphasis on the application of knowledge, the SCT assesses trainees’ competence at the ‘knows how’ level of Miller’s pyramid.55 As such, it has the potential to complement other assessments situated at both lower (e.g. MCQs, ‘knows’) and higher (e.g. OSCEs, ‘shows how’; multi-source feedback, ‘does’) levels of Miller’s pyramid. Evidence testifying to the role of the script concordance method – among a measured blend of other methods – within structured assessment programmes would further bolster the validity argument in favour of its adoption.