The first step in any validity evaluation entails an identification of the intended construct.13 The essential purpose of construct identification is to justify a particular interpretation of a test score by explaining the behaviour that the test score summarises.14 According to its originators, SCT scores are meant to reflect ‘a specific skill of clinical competence: the ability to weigh clinical information in light of entertained hypotheses’.1 The ability to appropriately interpret clinical data, particularly under conditions of ambiguity or uncertainty, is an integral part of the clinical reasoning process15 and lies at the heart of what some refer to as ‘clinical judgement’.16,17
Categories of evidence
The next step in a structured validity inquiry is to investigate the extent to which assessment scores can be presumed to reflect the intended construct. To conduct this step, we searched the PubMed, EMBASE and PsycINFO databases for peer-reviewed, English- and French-language articles relating to the theoretical underpinnings, construction procedures and psychometric properties of SCTs. Using the combined search terms ‘script’ and ‘concordance’, we identified 37 relevant articles. We then reviewed these articles to evaluate the construct validity of the script concordance method, following an established approach for analysing validity data from five categories: content; response process; internal structure; relations to other variables, and consequences.18
This first category of validity evidence evaluates ‘the relationship between a test’s content and the construct it is intended to measure’.19 For an SCT score to represent a legitimate measure of clinical data interpretation (CDI) under conditions of uncertainty, the test content must, ex hypothesi, include problems that are ill-defined and authentic.
Fournier et al.20 issued guidelines for helping SCT developers prepare test items that are ill defined (i.e. imbued with a degree of uncertainty, imprecision or incompleteness). The guidelines advocate that relevant factual knowledge should be necessary – but not sufficient – for responding to the test questions. Properly fashioned SCT questions are intended to be unanswerable using formulaic or algorithmic reasoning, or pure recall of factual information. The questions are therefore tailored to probe examinees’ ability to select an appropriate alternative from among several acceptable options, rather than a single correct answer from among several factually incorrect distractors.
Success in developing suitably ill-defined SCT items can, to some extent, be verified. Questions that elicit identical responses from all experts are no different from single-correct-answer or single-best-answer MCQs, and those that obtain too broad a distribution of responses from the expert panel are considered too ambiguous.21 By contrast, optimal SCT questions are those that produce a small range of expert responses clustered around a modal answer. High-quality questions (i.e. those with content that is most consistent with the intended construct) can therefore be easily and objectively recognised.
The intention behind the script concordance approach is to simulate authentic conditions of medical practice, in which courses of action or lines of thinking about specific clinical problems are seldom indisputable, even among experts.22 Although case vignettes can never reflect the full complexity of real-patient encounters, SCT makers are instructed to generate questions from representative cases seen in daily practice.20 In some instances, audiovisual materials, including video segments, have been used to enhance the authenticity of the test-taking experience.2,23
Conclusion: Published guidelines for standardising the creation of authentic, ill-defined test items serve to ensure that individual SCTs legitimately probe the method’s intended construct (i.e. data interpretation in contexts of clinical uncertainty). As such, the guidelines constitute an important source of content evidence, assuming they are diligently followed during SCT development and pilot testing under non-research conditions.
The ‘response process’ category of validity evidence entails a search for data elucidating the relationship between an assessment’s intended construct and the thought processes and response actions of its examinees.8 Current evidence for alignment between thought and response processes and the intended construct of the SCT rests on several theoretical assumptions.
The script concordance approach is conceptually linked to a model of clinical reasoning known as the ‘hypothetico-deductive’ (HD) method.12 The HD method suggests that doctors tend to generate a few hypotheses early in a clinical encounter, and subsequently orient data collection towards confirming or rejecting their initial hypotheses.24 Patterned after this model, the SCT features three columns that correspond to the stages of hypothesis generation (‘If you were thinking…’), data collection (‘…and then you find…’) and data interpretation (‘…this hypothesis becomes…’), respectively. For each SCT question, both an initial hypothesis (column 1) and a new piece of clinical information (column 2) are provided, and therefore do not require independent generation by the examinee. What remains, ostensibly, is the stage of data interpretation, in which the examinee is presumed to make a decision regarding the fit of the new data with the given hypothesis. The script concordance method is therefore meant to probe one key signpost along an accepted theoretical pathway of clinical reasoning.
However, clinical data interpretation is not a skill that can be teased apart from the medical knowledge upon which it relies.25 The script concordance method presumes that for each SCT question, examinees mobilise knowledge structures –‘illness scripts’26– from their mental databases that are relevant to the given hypothesis. Script concordance hinges on an inference that examinees with more evolved illness scripts will interpret data and make decisions that increasingly concord with those of experts given the same clinical scenarios. Indeed, SCTs used in various domains of medicine have consistently demonstrated that scores tend to increase with increasing levels of training.4,23,27,28
There are some empirical data to support the claim that the thought processes of SCT examinees include a judgement of fit between new clinical data and activated scripts. In one computer-based study using the script concordance format, subjects were asked to gauge the effects (i.e. more likely, less likely, no effect) of new pieces of information on a series of diagnostic hypotheses.29 Subjects’ response times were significantly faster when they were presented with clinical information that was either typical of or incompatible with the given hypothesis than when they were presented with information that was atypical. Subjects also responded more accurately when provided with typical than with atypical information. The investigators concluded that processing time and accuracy of judgement on script concordance tasks are influenced by the degree of compatibility between new clinical information and relevant activated scripts.
Conclusion: Validity evidence in support of a clear relationship between the intended construct of the script concordance method and the thought and response processes of examinees is largely theoretical and has minimal empirical substantiation.
Whereas content and response process evidence is gathered to ensure that test material legitimately probes an intended construct, internal structure data provide evidence that it does so in a reproducible, or reliable, manner. The internal structure category of evidence addresses key questions related to the reliability of an assessment method.8
Internal structure evidence for the SCT method demonstrates dependably high measures of internal consistency, with alpha coefficient values of 0.70–0.90 across an array of medical disciplines.1,2,6,23,27,28,30,31 The method’s tendency to produce high reliability estimates is partly a function of the minimal testing time required per item, which permits the efficient collection of numerous samples of examinee performance. Script concordance tests generally contain 60–90 questions (nested in 20–25 cases for optimal reliability), and can be completed in about 1 hour.32 They are therefore designed to diminish the problem of case-specificity that has bedevilled the interpretations of scores obtained through other methods of assessment, such as patient management problems33 or long-case clinical examinations (CEXs),34 that address CDI over small or single samples of items.
Another source of internal structure evidence for the script concordance method comes from data pertaining to the composition of the expert panel. Gagnon et al.,35 for example, determined that a panel size of at least 10–15 members is required for acceptable (i.e. α ≥ 0.70) reliability and that up to 20 members may be necessary for high-stakes examinations. Two other studies independently discovered that whether the reference panel was composed of experts directly involved in the training of the examinees had no bearing on the relative ranking of examinee scores (although absolute scores were higher when examinee responses were compared with those of their own instructors).36,37
Conclusion: The SCT design has yielded remarkably robust indices of internal consistency across a spectrum of medical domains, supporting the argument that in each case a single common construct is being probed. Research concerning the ideal composition of the expert panel has yielded additional supportive evidence in this category.
Relations to other variables
To the extent that a test’s score represents an underlying construct, it should correlate strongly with other indicators of the same or similar constructs, and weakly with measures of unrelated constructs.8 Validity evidence in this category can be derived by correlating scores obtained by a method of interest with those obtained by other methods of assessment.
Two studies have investigated the correlation between SCT and MCQ test scores. Collard et al.38 used a common-content blueprint to develop a fact-based true/false test and an SCT intended to probe biomedical reasoning. A positive correlation between true/false test and SCT scores was found for students at earlier (Years 3 and 4; r = 0.53, p < 0.0001), but not later (Years 5 and 6; r = 0.07, p = 0.64), stages of training. The authors concluded that ‘the absence of any significant correlation in students in the later years may indicate that a relative independence of factual knowledge and clinical reasoning has developed with experience’.38 In another study, Fournier et al.31 found no significant correlation (r2 = 0.0164, p = 0.5905) between scores on a 60-question, ‘type C’ (single best answer with four distractors) MCQ test and a 90-question (nested in 30 cases) SCT administered to a small cohort of residents in emergency medicine.
In a study designed to verify whether SCT scores obtained by medical students could predict ‘clinical reasoning performance’ as residents, Brailovsky et al.39 found moderate correlations between students’ scores on an SCT administered at the end of clerkship and those obtained at the end of residency using two other methods for assessing reasoning in contexts of clinical uncertainty (r = 0.451, p = 0.013; r = 0.447, p = 0.015, respectively). In the same study, correlations between early SCT scores and later scores on an objective structured clinical examination (OSCE), the focus of which was to assess a somewhat different construct (reasoning during the performance of technical skills), were significantly weaker (r = 0.352, p = 0.052).
Conclusion: Studies thus far have detected relatively weak correlations between SCT scores and scores obtained on fact-based examinations, offering support to the claim that SCTs, at least to a degree, measure a different construct from tests probing pure recall of propositional knowledge. Note that the evidence here is sparse, relying on results from only a few studies that compared SCTs and single-correct-answer MCQ tests matched globally – but not on an item-by-item basis – for content. Moreover, correlations between SCT and MCQ scores in these studies were not corrected for attenuation and thus may appear falsely low. Evidence that SCT scores early in training predict later scores on tests probing similar constructs exists, but is also scant.
This category explores evidence relating to the intended or unintended consequences of an assessment method.8 Evidence concerning the effects of a method’s scoring format, its procedure for determining score thresholds (e.g. pass/fail cut scores) and its impact on learning and teaching practices also falls under this category.9
The scoring format of the SCT is a version of the aggregate method that takes into account the variability of experts’ responses to particular clinical situations.40,41 It assumes that, for each question, the answer provided by the greatest number of panel members reflects optimal data interpretation under the given circumstances and that other panel members’ answers reflect a difference of interpretation that is still clinically valuable and merits proportional credit. Under this paradigm, domain experts are considered to represent the reference standard for determining the degree of acceptability of different responses to SCT questions. The use of this type of scoring method in SCT has been justified and has been shown to be a key determinant of its discriminatory power.42,43
However, the SCT’s scoring method has not gone uncontested. Bland et al.,44 for example, showed that several alternative scoring methods – including single-best-answer approaches – reproduced the results obtained using the SCT’s method of aggregate scoring. In general, the literature on the effects of differential weighting of item responses on validity has been tepid. For example, Sabers and White45 reported negligible increments in reliability and validity as a result of weighted scoring. Haladyna46 found that option weighting was labour-intensive and resulted in only slight gains in reliability and validity in a number of testing situations.
With regard to cut scores, the establishment of fair and transparent norm- or criterion-referenced methods47 for determining success or failure on SCTs has not yet been described. Angoff, Ebel and other conventional standard-setting methods for tests with dichotomous scoring systems are not appropriate for establishing SCT cut scores. Charlin et al.48 recently proposed a new statistical method for transforming and reporting scores that offers a common metric for gauging the performance of an SCT examinee relative to those of panel members. This method may, in future, be exploited to investigate standard setting and optimal pass/fail cut scores for SCTs under various testing conditions.
Several studies have explored the consequences of using the script concordance model for educational purposes during interactive workshops for health professionals.49–51 In each study, script concordance-type questions were used to assess participants’ competence in interpreting clinical data in diagnostic or management dilemmas in an area of concern. The exercise served as the basis for focused educational discussions between non-experts and experts attending the workshops. Pre- and post-workshop assessments (some were self-assessments) in each study suggested that the intervention led to improvements in participants’ knowledge, clinical reasoning skills or practice habits.
Conclusion: Script concordance assessment has been, in several published instances, successfully exploited for its immediate instructional effects, whereby it helps to identify and supplement gaps in learners’ knowledge structures. Little is known about the longer-term educational impact of the script concordance method on teaching and learning. Furthermore, no sufficient body of procedural evidence and outcomes data with which to defend the use of tests based on the script concordance method in high-stakes examinations currently exists and questions remain regarding optimal methods for scoring and setting standards in SCTs.