Objectives Case specificity implies that success on any case is specific to that case. In examining the sources of error variance in performance on case-based examinations, how much error variance results from differences between cases compared with differences between items within cases? What is the optimal number of cases and questions within cases to maximise test reliability given some fixed period of examination time?
Methods G and D generalisability studies were conducted to identify variance components and reliability for each examination analysed, and to optimise the reliability of the given test composition (1, 1.5, 2, 3, 4 and 5 questions per case), using data from 3 key features examinations of the Medical Council of Canada (n = 6342 graduating medical students), each of which consisted of about 35 written cases followed by 1–4 questions regarding specific key elements of data gathering, diagnosis and/or management.
Results The smallest variance component was due to subjects; the variance due to subject–item interaction was over 5 times the interaction with cases (on average, 0.1106 compared with 0.0195). Relatively little variance was due to differences between cases; about 80% of the error variance was due to variability in performance among items within cases. The D study showed that reliability varied between 0.541 and 0.579, was least with 1 item per case and highest at 2 and 3 items per case.
Conclusions The main source of error variance was items within cases, not cases, and the optimal strategy in terms of enhancing reliability would use cases with 2–3 items per case.