The policy context
The last decade has seen a major expansion in postgraduate assessment within the medical professions. This has been driven by two main factors. Firstly, the education literature has provided growing evidence that assessment and feedback drive learning across the whole continuum of education.1 Secondly, in the modern, regulation-bound world, health services are mandated to demonstrate safe and effective practice to the public.2 In this context, assessment must carry the heavy burden of helping trainee clinicians to achieve competence and then assuring that they have succeeded in doing so.
Good assessment practice
In view of the fact that so much hangs on assessment, good clinical practice is becoming dependent upon good assessment practice.3 Fortunately, education research has provided a number of important observations about how to assess well.
Firstly, clinical performance is context-specific; a good performance in one case doesn’t necessarily predict a good performance in another case.4 Consequently, clinicians should be assessed on a sample of cases.
Secondly, complex performance cannot be reduced to simple checklists; it requires sophisticated judgements that can take account of context.5 Doctors who judge their peers and trainees largely agree on who is performing well and poorly, but they display some individual differences. Consequently, clinicians should be assessed by a sample of suitably experienced judges.3
Thirdly, attempts to standardise assessment by taking doctors out of their real workplaces and into a controlled environment are futile. It is quite possible to assess a doctor in a controlled environment, but competence in such a setting does not predict real workplace performance.6,7 Competent doctors may perform poorly in the workplace for a variety of reasons. Experience in UK performance assessment procedures suggests that those reasons include: failure to learn from mistakes; poor mental health; workload-related issues, and family problems.8
In short, to know how they perform in the workplace, clinicians should be assessed regularly in the workplace on an adequate sample of their day-to-day work by other clinicians who understand the work and are able to make judgements. This type of assessment has been called workplace-based assessment (WBA).
The WBA dilemma
The importance of WBA is embedded in key policy documents in the UK9 and across the world. Consequently, there has been an explosion in the use of WBA methods. For example, every specialty in the UK has included several WBA methods in its curriculum for trainees.10
Unfortunately, the implementation of WBA in medicine worldwide has been fraught with difficulty. In the UK, the Academy of Medical Royal Colleges summarises the feeling of the medical profession from the findings of several surveys:
‘The profession is rightly suspicious of the use of reductive “tick-box” approaches to assess the complexities of professional behaviour, and widespread confusion exists regarding the standards, methods and goals of individual assessment methods. This has resulted in widespread cynicism about WBA within the profession, which is now increasing.’10
Furthermore, where WBA methods have been psychometrically evaluated, scores have been found to be very vulnerable to assessor differences and assessors have generally been indiscriminate in rating most trainees very positively.11,12 This means that very large numbers of assessors and cases are required to achieve reliability.
Problems with scales
Assessors who have used WBA in practice highlight a number of problems which may help to explain the widespread cynicism about the method and its disappointing psychometric performance. Some of the most interesting observations have emerged from training discussions in which assessors score performance samples (usually from video) and then discuss the reasons for their scoring differences.13 Frequently, assessors agree over the performance they have seen, but disagree over their interpretation of the essential focus of the assessment (the assessment construct) or the meaning of the points on the scoring scales (the response format).14
Some scales are designed to reflect linear gradations of performance, such as the ‘unsatisfactory’ to ‘superior’ scale employed for the original mini-clinical evaluation exercise (mini-CEX) instrument.15 Typically, assessors have different interpretations of what constitutes, for example, a ‘superior’ performance and, when the scale is accompanied by more detailed descriptions for guidance, assessors do not refer to them. They are also reluctant to make use of categories that sound pejorative, such as ‘unsatisfactory’ or ‘poor’.
Other scales are designed to reflect progress in relation to predetermined stages of training, such as the ‘well below expectation for F1 completion’ to ‘well above expectation for F1 completion’ scale employed by the UK Foundation Programme instruments.12 (F1 refers to the most junior level of trainee in the UK.) Typically, clinician-assessors report significant uncertainty about the standard expected for a given stage of training, a limited knowledge of lengthy curricula, and reluctance to rate a trainee as being below the expected standard when they know that the trainee is approaching the end of a given training period.
Defining a construct
What, then, is the most valid assessment construct for a medical trainee and what is the best scale on which to reflect it? Clearly, this is a complex question because the focus of assessment varies across different domains of performance and for different levels of training. However, Olle ten Cate makes a strong case for establishing a unifying theme to run through all aspects of postgraduate training. He argues that clinical supervisors’ judgements focus on the construct of ‘entrustability’ (‘Do I trust this trainee?’) and that this construct is a helpful weighted and balanced synthesis of many complex factors that no authentic assessment in the workplace should separate.16 In the USA, the Accreditation Council for Graduate Medical Education (ACGME) has taken an alternative approach to defining the development of postgraduate competence by setting out exhaustive descriptions of ‘milestones’ specific to each domain of competence.17 However, an examination of the milestones allows us to discern two key constructs at work; they plot a story of increasing sophistication and independence.
One method of WBA has incorporated the construct of independence in its scale. The UK Intercollegiate Surgical Curriculum Programme has adopted procedure-based assessment (PBA) as an assessment of intraoperative (mainly technical) skill. Following a surgical operation, the PBA global assessment scale asks the assessor whether the trainee was: (i) ‘unable to perform the procedure, or part observed, under supervision’; (ii) ‘able to perform the procedure, or part observed, under supervision’; (iii) ‘able to perform the procedure with minimal supervision (needed occasional help)’, or (iv) ‘competent to perform the procedure unsupervised (could deal with complications that arose)’. A parallel evaluation of PBA and the objective structured assessment of technical skills (OSATS) found PBA to be much more reliable.18 Just two operations (each observed by a different assessor) were required to separate trainees with a reliability of 0.76. The generalisability (G) study shows that this was not because trainees performed particularly consistently from procedure to procedure, but because assessors used more of the scale to discriminate between trainees and assessor variation was much smaller.