Medical Education 2011:45: 741–747
Context Good examinations have a number of characteristics, including validity, reliable scores, educational impact, practicability and acceptability. Scores from the objective structured clinical examination (OSCE) are more reliable than the single long case examination, but concerns about its validity have led to modifications and the development of other models, such as the mini-clinical evaluation exercise (mini-CEX) and the objective structured long examination record (OSLER). These retain some of the characteristics of the long case, but feature repeated encounters and more structure. Nevertheless, the practical considerations and costs associated with mounting large-scale examinations remain significant. The lack of metrics handicaps progress. This paper reports a system whereby a sequential design concentrates limited resources where they are most needed in order to maintain the reliability of scores and practicability at the pass/fail interface.
Methods We analysed data pertaining to the final examination administered in 2009. In the complete final examination, candidates see eight real patients (the OSLER) and encounter 12 OSCE stations. Candidates whose performance is judged as entirely satisfactory after the first four patients and six OSCE stations are not examined further. The others – about a third of candidates – see the remaining patients and stations and are judged on the complete examination. Reliability was calculated from the scores of all candidates on the first part of the examination using generalisability theory and practicability in terms of financial resources. The functioning of the sequential system was assessed by the ability of the first part of the examination to predict the final result for the cohort.
Results Generalisability for the OSLER was 0.63 after four patients and 0.77 after eight patients. The OSCE was less reliable (0.38 after six stations and 0.55 after 12). There was only a weak correlation between the OSLER and the OSCE. The first stage was highly predictive of the results of the second stage. Savings facilitated by the sequential design amounted to approximately GBP30 000.
Conclusions The overall utility of examinations involves compromise. The system described provides good perceived validity with reasonably reliable scores; a sequential design can concentrate resources where they are most needed and still allow wide sampling of tasks.