Good questions, good answers: construct alignment improves the performance of workplace-based assessment scales


Dr Jim Crossley, Academic Unit of Medical Education, University of Sheffield, 85 Wilkinson Street, Sheffield S10 2GJ, UK. Tel: 00 44 114 2225341; Fax: 00 44 114 2225369; E-mail:


Medical Education 2011: 45: 560–569

Context  Assessment in the workplace is important, but many evaluations have shown that assessor agreement and discrimination are poor. Training discussions suggest that assessors find conventional scales invalid. We evaluate scales constructed to reflect developing clinical sophistication and independence in parallel with conventional scales.

Methods  A valid scale should reduce assessor disagreement and increase assessor discrimination. We compare conventional and construct-aligned scales used in parallel to assess approximately 2000 medical trainees by each of three methods of workplace-based assessment (WBA): the mini-clinical evaluation exercise (mini-CEX); the acute care assessment tool (ACAT), and the case-based discussion (CBD). We evaluate how scores reflect assessor disagreement (Vj and Vj*p) and assessor discrimination (Vp), and we model reliability using generalisability theory.

Results  In all three cases the conventional scale gave a performance similar to that in previous evaluations, but the construct-aligned scales substantially reduced assessor disagreement and substantially increased assessor discrimination. Reliability modelling shows that, using the new scales, the number of assessors required to achieve a generalisability coefficient ≥ 0.70 fell from six to three for the mini-CEX, from eight to three for the CBD, from 10 to nine for ‘on-take’ ACAT, and from 30 to 12 for ‘post-take’ ACAT.

Conclusions  The results indicate that construct-aligned scales have greater utility, both because they are more reliable and because that reliability provides evidence of greater validity. There is also a wider implication: the disappointing reliability of existing WBA methods may reflect not assessors’ differing assessments of performance, but, rather, different interpretations of poorly aligned scales. Scales aligned to the expertise of clinician-assessors and the developing independence of trainees may improve confidence in WBA.


The policy context

The last decade has seen a major expansion in postgraduate assessment within the medical professions. This has been driven by two main factors. Firstly, the education literature has provided growing evidence that assessment and feedback drive learning across the whole continuum of education.1 Secondly, in the modern, regulation-bound world, health services are mandated to demonstrate safe and effective practice to the public.2 In this context, assessment must carry the heavy burden of helping trainee clinicians to achieve competence and then assuring that they have succeeded in doing so.

Good assessment practice

In view of the fact that so much hangs on assessment, good clinical practice is becoming dependent upon good assessment practice.3 Fortunately, education research has provided a number of important observations about how to assess well.

Firstly, clinical performance is context-specific; a good performance in one case doesn’t necessarily predict a good performance in another case.4 Consequently, clinicians should be assessed on a sample of cases.

Secondly, complex performance cannot be reduced to simple checklists; it requires sophisticated judgements that can take account of context.5 Doctors who judge their peers and trainees largely agree on who is performing well and poorly, but they display some individual differences. Consequently, clinicians should be assessed by a sample of suitably experienced judges.3

Thirdly, attempts to standardise assessment by taking doctors out of their real workplaces and into a controlled environment are futile. It is quite possible to assess a doctor in a controlled environment, but competence in such a setting does not predict real workplace performance.6,7 Competent doctors may perform poorly in the workplace for a variety of reasons. Experience in UK performance assessment procedures suggests that those reasons include: failure to learn from mistakes; poor mental health; workload-related issues, and family problems.8

In short, to know how they perform in the workplace, clinicians should be assessed regularly in the workplace on an adequate sample of their day-to-day work by other clinicians who understand the work and are able to make judgements. This type of assessment has been called workplace-based assessment (WBA).

The WBA dilemma

The importance of WBA is embedded in key policy documents in the UK9 and across the world. Consequently, there has been an explosion in the use of WBA methods. For example, every specialty in the UK has included several WBA methods in its curriculum for trainees.10

Unfortunately, the implementation of WBA in medicine worldwide has been fraught with difficulty. In the UK, the Academy of Medical Royal Colleges summarises the feeling of the medical profession from the findings of several surveys:

‘The profession is rightly suspicious of the use of reductive “tick-box” approaches to assess the complexities of professional behaviour, and widespread confusion exists regarding the standards, methods and goals of individual assessment methods. This has resulted in widespread cynicism about WBA within the profession, which is now increasing.’10

Furthermore, where WBA methods have been psychometrically evaluated, scores have been found to be very vulnerable to assessor differences and assessors have generally been indiscriminate in rating most trainees very positively.11,12 This means that very large numbers of assessors and cases are required to achieve reliability.

Problems with scales

Assessors who have used WBA in practice highlight a number of problems which may help to explain the widespread cynicism about the method and its disappointing psychometric performance. Some of the most interesting observations have emerged from training discussions in which assessors score performance samples (usually from video) and then discuss the reasons for their scoring differences.13 Frequently, assessors agree over the performance they have seen, but disagree over their interpretation of the essential focus of the assessment (the assessment construct) or the meaning of the points on the scoring scales (the response format).14

Some scales are designed to reflect linear gradations of performance, such as the ‘unsatisfactory’ to ‘superior’ scale employed for the original mini-clinical evaluation exercise (mini-CEX) instrument.15 Typically, assessors have different interpretations of what constitutes, for example, a ‘superior’ performance and, when the scale is accompanied by more detailed descriptions for guidance, assessors do not refer to them. They are also reluctant to make use of categories that sound pejorative, such as ‘unsatisfactory’ or ‘poor’.

Other scales are designed to reflect progress in relation to predetermined stages of training, such as the ‘well below expectation for F1 completion’ to ‘well above expectation for F1 completion’ scale employed by the UK Foundation Programme instruments.12 (F1 refers to the most junior level of trainee in the UK.) Typically, clinician-assessors report significant uncertainty about the standard expected for a given stage of training, a limited knowledge of lengthy curricula, and reluctance to rate a trainee as being below the expected standard when they know that the trainee is approaching the end of a given training period.

Defining a construct

What, then, is the most valid assessment construct for a medical trainee and what is the best scale on which to reflect it? Clearly, this is a complex question because the focus of assessment varies across different domains of performance and for different levels of training. However, Olle ten Cate makes a strong case for establishing a unifying theme to run through all aspects of postgraduate training. He argues that clinical supervisors’ judgements focus on the construct of ‘entrustability’ (‘Do I trust this trainee?’) and that this construct is a helpful weighted and balanced synthesis of many complex factors that no authentic assessment in the workplace should separate.16 In the USA, the Accreditation Council for Graduate Medical Education (ACGME) has taken an alternative approach to defining the development of postgraduate competence by setting out exhaustive descriptions of ‘milestones’ specific to each domain of competence.17 However, an examination of the milestones allows us to discern two key constructs at work; they plot a story of increasing sophistication and independence.

One method of WBA has incorporated the construct of independence in its scale. The UK Intercollegiate Surgical Curriculum Programme has adopted procedure-based assessment (PBA) as an assessment of intraoperative (mainly technical) skill. Following a surgical operation, the PBA global assessment scale asks the assessor whether the trainee was: (i) ‘unable to perform the procedure, or part observed, under supervision’; (ii) ‘able to perform the procedure, or part observed, under supervision’; (iii) ‘able to perform the procedure with minimal supervision (needed occasional help)’, or (iv) ‘competent to perform the procedure unsupervised (could deal with complications that arose)’. A parallel evaluation of PBA and the objective structured assessment of technical skills (OSATS) found PBA to be much more reliable.18 Just two operations (each observed by a different assessor) were required to separate trainees with a reliability of 0.76. The generalisability (G) study shows that this was not because trainees performed particularly consistently from procedure to procedure, but because assessors used more of the scale to discriminate between trainees and assessor variation was much smaller.

Present study

This paper reports a study designed to evaluate whether this observation generalises to other methods of WBA. We take three WBA methods and compare the performances of their existing conventional scales with those of new scales aligned to the constructs of developing clinical sophistication and independence, or ‘entrustability’. If the new scale is indeed a more valid reflection of progression through postgraduate training in the eyes of clinician-assessors, then we would expect to find two psychometric consequences:

  • 1 assessors should discriminate between trainees more widely (rather than saying that they are all superior or that they all meet the expected standard), and
  • 2 assessors should agree with one another more consistently, both in their overall level of reading of the standard required and in their responses about a particular trainee.


Selecting the instruments

To discover if the apparent benefit of a construct-aligned scale is context-specific, we chose three instruments to cover a range of assessment domains. Each instrument is already in use in the UK as part of the Joint Royal Colleges of Physicians Training Board curricula for medical trainees.19 Existing data on the performance of each instrument allowed us to ensure that the unchanged parts of the instruments performed normally during the evaluation.

The mini-CEX is designed for assessing some or all of multiple, short, real-time clinical encounters in authentic situations. It can be used to concentrate on any of: interviewing; examining; communication; judgement; professionalism, and efficiency. It was developed in the USA from the longer clinical evaluation exercise to allow for the broader sampling of encounters in the workplace.15

The case-based discussion (CBD) is designed to allow the assessor to probe the clinician’s clinical reasoning, decision making and application of medical knowledge in relation to patient care. The discussion is based on a written record which can be proposed by the trainee, but should be selected by the assessor. The assessor then asks the trainee to explain his or her management or records.

Versions of the mini-CEX and CBD feature in the UK Foundation Programme12 and in most UK Royal Colleges’ trainee assessment programmes.10

The acute care assessment tool (ACAT) is newer than the other methods. It was developed as ‘an assessment of a trainee during a period of practising acute medicine considering the trainee’s performance in the management of the take, patient management, and teamworking’.20 Trainee doctors are assessed either by trainee colleagues working with them during the acute duty period (‘on-take’ ACAT), or by the consultant at the handover and post-duty ward round (‘post-take’ ACAT). The instrument is broad and covers: clinical assessment; record keeping; investigations and referrals; managing critical illness; time management; teamworking; leadership, and handover.

All three original instruments in this study used the same scale, ranging from ‘well below expectations for stage of training’ to ‘well above expectations for stage of training’. For this evaluation, the original scale was retained, but a second scale was added. In the new scale, predetermined training level anchors were accompanied by behavioural descriptors aligned to the constructs of developing clinical sophistication and independence. For example, the ACAT anchors included ‘trainee required frequent supervision to assist in almost all clinical management plans and/or time management’ and ‘able to practise independently and provide senior supervision for the acute care period’. The full list of descriptors is presented in Table 1.

Table 1.   Construct-aligned scales
RatingMini-CEX clinical anchorsCBD clinical anchorsACAT clinical anchors
  1. Mini-CEX = mini clinical evaluation exercise; CBD = case-based discussion; ACAT = acute care assessment tool

Performed below level expected during Foundation ProgrammeDemonstrates basic consultation skills, resulting in incomplete history and/or examination findings. Shows limited clinical judgement following encounterDemonstrates little knowledge and lacks ability to evaluate issues, resulting in only a rudimentary contribution to the management planTrainee required frequent supervision to assist in almost all clinical management plans and/or time management
Performed at the level expected on completion of Foundation Programme/early Core TrainingDemonstrates sound consultation skills, resulting in adequate history and/or examination findings. Shows basic clinical judgement following encounterDemonstrates some knowledge and limited evaluation of issues, resulting in a limited management planTrainee required supervision to assist in some clinical management plans and/or time management
Performed at the level expected on completion of Core Training/early higher trainingDemonstrates good consultation skills, resulting in a sound history and/or examination findings. Shows solid clinical judgement following encounter consistent with early higher trainingDemonstrates satisfactory knowledge and logical evaluation of issues, resulting in an acceptable management plan consistent with early higher trainingSupervision and assistance needed for complex cases; competent to run the acute care period with senior support
Performed at level expected during higher trainingDemonstrates excellent and timely consultation skills, resulting in a comprehensive history and/or examination findings in a complex or difficult situation. Shows good clinical judgement following encounterDemonstrates detailed knowledge and solid evaluation of issues, resulting in a sound management planVery little supervising consultant input needed; competent to run the acute care period with occasional senior support
Performed at level expected on completion of higher trainingDemonstrates exemplary consultation skills, resulting in a comprehensive history and/or examination findings in a complex or difficult situation. Shows excellent clinical judgement following encounter consistent with completion of higher trainingDemonstrates deep up-to-date knowledge and comprehensive evaluation of issues, resulting in an excellent management plan consistent with completion of higher trainingAble to practise independently and provide senior supervision for the acute care period


The instruments were revised at the end of 2009 as part of a regular quality enhancement process and administered in the same way as the original versions to trainees in medical specialties across all regions of the UK. Assessments are trainee-initiated and are recorded in a web-based, electronic portfolio in keeping with most WBA procedures in the UK. The research team downloaded anonymised scores linked to identifying codes for assessor and trainee into Excel spreadsheets for analysis. All consecutive assessments from December 2009 (the modification date) to March 2010 were included in the analysis.

Data analysis

The main outcomes are assessor discrimination, assessor agreement over standard, and assessor agreement over a particular trainee. Psychometrically, these variables will be reflected in a variance component analysis of the scores as person variance (Vp), judge stringency variance (Vj), and judge subjectivity variance (Vj*p), respectively. In this annotation, V = variance, p = trainee, j = assessor and i = episode (encounter, acute take or discussion).

If Vp rises and Vj and Vj*p fall, then reliability (which is proportional to Vp and inversely proportional to Vj and Vj*p) will also improve. Therefore, these three outcomes are helpfully summarised by the reliability of scores using the conventional and revised scales. For each scale, reliability is reported as the number of assessments required to achieve a generalisability coefficient (GC) ≥ 0.70.

Put simply, the better the assessors separate trainees, and the more similarly assessors rate a given trainee, the fewer assessments are required to achieve a given level of reliability.

All analyses were conducted in spss Version 14 (SPSS, Inc., Chicago, IL, USA) using generalisability theory. In generalisability, a variance component analysis (generalisability [G] study) estimates the influence that key assessment variables (such as case-to-case variation of assessor stringency) exert on the scores. These variances are then combined in a decision (D) study to model the reliability or generalisability of a putative assessment in which the numbers of assessors or cases vary.

The ordinal categorical scores were converted to number approximations for the analysis and ranged from 1 (lowest rating) to 5 (highest rating). The ‘unable to comment’ responses were treated as missing data. The analysis and reporting aims to meet the recommendations of Crossley et al.21 The sample of trainees and assessors is reported so that the reader can gauge its adequacy. The same data are used for the G study estimates and the D study modelling.

The G study used the minimum norm quadratic unbiased estimator (minque) procedure because the data were naturalistic and unbalanced. Minimum degrees of freedom (d.f.) were reported by re-analysing the data using analysis of variance (anova, sum of squares type 3). The regression model could only estimate the first-order effects of trainee ability (Vp) and assessor stringency (Vj). Assessor and episode are confounded because only one assessor scores each episode, so the effects of assessor subjectivity over trainee (Vj*p) and trainee case-to-case variation (Vi:p) are both included in the error term (Vres).

The D study assumed that each additional assessment episode was performed by a different assessor and thus used the equation: GC = Vp/(Vp + [Vj/Nj] + [Vres/Ni]).


Recruitment and sampling

More than 2000 medical trainees and 4000 assessors conducting 24 322 assessments contributed data to this evaluation. The size and depth of the sample for each assessment method are presented in Table 2.

Table 2.   The sample of trainees and assessors
Assessments, nMini-CEXCBDACAT
Trainee sampleAssessor sampleTrainee sampleAssessor sampleTrainee sampleConsultant assessor samplePeer assessor sample
  1. Mini-CEX = mini clinical evaluation exercise; CBD = case-based discussion; ACAT = acute care assessment tool

Total assessments3185318545134513446332351228
Total persons1834229824483067186421961026

G study results

Table 3 presents the G study results, which show how much the raw assessment scores were influenced by the assessors’ ability to discriminate between trainees of different abilities (Vp) and by variable assessor stringency or leniency (Vj). As the analysis section explains, the assessors’ varying views of a given trainee (Vj*p) are included in the residual term (Vres). Two strong themes emerge from comparing the variance components.

Table 3.   G study results
ComponentPlain English meaning Mini-CEXCBDOn-take ACATPost-take ACAT
Conv* scaleNew scaleConv* scaleNew scaleConv* scaleNew scaleConv* scaleNew scale
  1. The degrees of freedom (d.f.) with which the variance components have been estimated were calculated using anova. The minque procedure that was actually used estimates the variance components with much higher equivalent d.f. because the procedure makes use of much more of the data

  2. Mini-CEX = mini clinical evaluation exercise; CBD = case-based discussion; ACAT = acute care assessment tool; Conv* = conventional

VpVariable trainee ability (across episodes and assessors)Variance
VjVariable assessor stringency (across episodes and trainees)Variance
Vres, including Vp*i*j, Vp*j, and Vi(p)Residual variation, including assessor subjectivity and trainee episode-to-episode variationVariance

Across all the instruments, Vp is higher with the new scale. This means that assessors discriminated more widely between high- and low-performing trainees using the new scale than they did using the conventional scale and, when a trainee saw several assessors, those assessors scored the trainee more similarly.

In addition, across all instruments (except ACAT when used on take), Vj is lower with the new scale. This means that assessors were more consistent in which part of the scale they used when using the new scale than they were using the old scale. ‘Hawkish’ and ‘dovelike’ tendencies were reduced.

Because Vres includes several sources of variation, it is not possible to draw conclusions from this effect, which, in any case, does not show a consistent pattern.

Reliability results

Table 4 presents the D study results for a range of assessment sample sizes in which each additional assessment is assumed to be performed by a different assessor. The number of assessments required before GC ≥ 0.7 is highlighted. The number of assessments required for the conventional scale fell when using the construct-aligned scale from six to three for the mini-CEX, from eight to three for the CBD, from 10 to nine for the on-take ACAT, and from 30 to 12 for the post-take ACAT.

Table 4.   Reliability comparisons
Sample sizeMini-CEXCBDOn-take ACATPost-take ACAT
Conv* scaleNew scaleConv* scaleNew scaleConv* scaleNew scaleConv* scaleNew scale
  1. The significant bold values indicate the number of assessments required before GC ≥ 0.7 is highlighted.

  2. Mini-CEX = mini clinical evaluation exercise; CBD = case-based discussion; ACAT = acute care assessment tool; Conv* = conventional



Main findings

This paper reports a study conducted to test the hypothesis that assessment scales aligned to medical trainees’ developing sophistication and independence provide a more valid response construct for clinician-assessors than conventional scales. The hypothesis is upheld across three very different assessment contexts. In all contexts, clinician-assessors made more reliable assessment judgements using modified scales than they did using conventional scales.

This improvement in reliability is primarily important because it provides evidence of the valid alignment of constructs. The G study results indicate that the construct-aligned scales caused assessors both to discriminate more widely between high- and low-performing trainees (Vp) and to come more into line with one another in terms of the expected standard (Vj) (Table 3).

However, it is also of great pragmatic significance. Reliability is a product of both good discrimination and good reproducibility. In three of four contexts, the difference in reliability was large and reduced the number of assessors and episodes required to achieve ‘in training’ levels of reliability very substantially (Table 4).

Strengths of the study

We were able to test the generality of the hypothesis by examining a parallel modification across three different assessment formats. We were able to recruit a large number of participants by introducing the change as a quality enhancement to an existing assessment programme and, consequently, the study suffered no selection bias of trainees or assessors. We had access to existing reliability data for each instrument and each performed broadly normally during the study. Even the large differences between the on-take and post-take ACATs were similar before and after the scale amendment.

Limitations of the study

There are a number of limitations to this study.

Firstly, as is often the case with WBA evaluations, the sampling matrix is complex and disordered. This is why it has not been possible to separate all of the relevant effects. We have addressed this complexity by displaying data on the sample depth so that the reader can judge the sufficiency of the data (Table 2), by displaying the minimum degrees of freedom on which the variance component estimates area is based (Table 3), and by using the regression approach most suited to unbalanced data.21 A post hoc analysis was performed on smaller and more complete matrices by removing scores derived from trainees who were only scored once or assessors who scored only one episode. The results were similar. Thus, we are confident that the effect estimates are dependable.

Secondly, a direct comparison between the conventional scale and a pure independence-aligned scale would have been a better test of a pure hypothesis. We actually compared the original scale with a hybrid scale in which predetermined training level anchors were accompanied by behavioural descriptors aligned to the constructs of developing clinical sophistication and independence. Furthermore, the behaviour descriptors could probably be improved. They contain some references to training levels and are sometimes an uncomfortable mixture of the separate domains of the assessment form. These compromises arose mainly from the difficulty of writing specific clinical anchors for assessments which could be used in a wide variety of contexts. Although these are valid criticisms of the instruments’ design, they make it less rather than more likely that the study would demonstrate a difference. This makes the size and consistency of the observed differences all the more remarkable.

Thirdly, all the analyses are built on parametric approaches to data. The scores do indeed follow a skewed normal distribution, but it is important to remember that the original data are ordinal categorical responses and that we have had to make assumptions in order to convert them into numeric scores. These assumptions are probably weaker for behavioural descriptor categorical responses than for pseudo-numeric scales.

Finally, it is possible that the base scale (which defines known training levels) may have confounded assessors’ scores by causing them to rate the trainee according to his or her training level rather than according to his or her performance. This would produce reliable scores that are invalid because they reflect a construct that differs from that intended. This seems unlikely, however, for two reasons. Firstly, other instruments that have used a fixed educationally oriented scale (such as the UK Foundation Programme versions of the CBD, the mini-CEX, direct observation of procedural skills [DOPS] and multi-source feedback) have generally reported disappointing reliability.12 Secondly, a good number of trainees at all training stages (including the early years) were scored as performing at the ‘level expected for completion of higher training’, and a moderate number of trainees approaching the completion of higher training were scored as performing at lower levels.


Our findings suggest that clinician-assessors are more likely to discriminate between high- and low-performing doctors, and are more likely to agree with one another when they are using a rating scale aligned with the constructs of developing clinical sophistication and independence. This observation is important in its own right and promises significant benefits for WBA. However, it also has a wider significance because it raises the possibility that the disappointing psychometric performance of WBA to date may stem not from disagreements about the performance observed, but from different interpretations of the questions and the scales. If so, it may be that we can improve the reliability of WBA yet further by improving the design of the instruments.

On reflection, it seems obvious that assessors will interpret abstract anchors such as ‘unsatisfactory’ or ‘superior’ inconsistently from one another, and that many will be unwilling to label a trainee or a colleague in the pejorative way demanded by the lower levels of such scales. Equally, anchors to predetermined training levels such as ‘meets expectations for stage of training’ hang directly on assessor expectations, which are likely to be variable, and many will find it hard to rate their colleagues as performing ‘below expectations’. Nevertheless, it is part of a clinician’s day-to-day business to decide whether another doctor is safe to lead an acute take, run a clinic or perform an operation independently. These decisions integrate many factors that may or may not be easy to articulate, and each needs to be contextualised, weighted and balanced. However, despite their complexity, the constructs that they represent have real face validity as a measure of readiness to practise. Our data seem also to show that clinician-assessors, if asked in the right way, can make highly reliable judgements about them.


Workplace-based assessment scales should be designed to align to the expertise of the assessor and the trainee’s developing ability in the workplace. In many cases of medical WBA this will require the use of anchors linked to the construct of clinical independence. It is almost certainly better to avoid pejorative anchors and sliding scales linked to expectations for stage of training.

A key part of the field testing of new instruments should include checking what assessors understand by the questions and the scale. Norming groups in which assessors score mock episodes and then discuss their differences provide very useful data for this purpose.

Further work is required to deepen our understanding of the key ingredients of question and scale design so that assessors’ scores can increasingly reflect their view of the performance under assessment and be less obscured by their varied interpretations of the instrument.

Contributors:  JC conceived the research question, conducted the analysis and wrote the first draft of the paper. GJ, JB and WW implemented the new scales, oversaw the data collection and contributed in full to the revision process. All four authors approved the final manuscript before submission.


Acknowledgements:  none.

Funding:  none

Conflicts of interest:  none.

Ethics approval:  this study is based on a quality enhancement evaluation and thus did not require research ethics committee approval. In line with the Declaration of Helsinki, participating trainees and assessors were given the following information: the Joint Royal Colleges of Physicians Training Board (JRCPTB) and deaneries may use data from the ePortfolio to support their work in quality assurance training. This work will use anonymised, collated data and will make no attempt to investigate the performance of individuals. It is categorised as audit or service evaluation based on routinely collected data rather than research. The JRCPTB thus provides oversight of the use of trainees’ assessment data.