Medical Education 2011: 45: 560–569

Context  Assessment in the workplace is important, but many evaluations have shown that assessor agreement and discrimination are poor. Training discussions suggest that assessors find conventional scales invalid. We evaluate scales constructed to reflect developing clinical sophistication and independence in parallel with conventional scales.

Methods  A valid scale should reduce assessor disagreement and increase assessor discrimination. We compare conventional and construct-aligned scales used in parallel to assess approximately 2000 medical trainees by each of three methods of workplace-based assessment (WBA): the mini-clinical evaluation exercise (mini-CEX); the acute care assessment tool (ACAT), and the case-based discussion (CBD). We evaluate how scores reflect assessor disagreement (Vj and Vj*p) and assessor discrimination (Vp), and we model reliability using generalisability theory.

Results  In all three cases the conventional scale gave a performance similar to that in previous evaluations, but the construct-aligned scales substantially reduced assessor disagreement and substantially increased assessor discrimination. Reliability modelling shows that, using the new scales, the number of assessors required to achieve a generalisability coefficient ≥ 0.70 fell from six to three for the mini-CEX, from eight to three for the CBD, from 10 to nine for ‘on-take’ ACAT, and from 30 to 12 for ‘post-take’ ACAT.

Conclusions  The results indicate that construct-aligned scales have greater utility, both because they are more reliable and because that reliability provides evidence of greater validity. There is also a wider implication: the disappointing reliability of existing WBA methods may reflect not assessors’ differing assessments of performance, but, rather, different interpretations of poorly aligned scales. Scales aligned to the expertise of clinician-assessors and the developing independence of trainees may improve confidence in WBA.