SEARCH

SEARCH BY CITATION

Abstract

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. How generic are WBA methods?
  7. Which assessors are best-placed to judge?
  8. Conclusions
  9. Acknowledgments
  10. References

Medical Education 2012: 46: 28–37

Context  Historically, assessments have often measured the measurable rather than the important. Over the last 30 years, however, we have witnessed a gradual shift of focus in medical education. We now attempt to teach and assess what matters most. In addition, the component parts of a competence must be marshalled together and integrated to deal with real workplace problems. Workplace-based assessment (WBA) is complex, and has relied on a number of recently developed methods and instruments, of which some involve checklists and others use judgements made on rating scales. Given that judgements are subjective, how can we optimise their validity and reliability?

Methods  This paper gleans psychometric data from a range of evaluations in order to highlight features of judgement-based assessments that are associated with better validity and reliability. It offers some issues for discussion and research around WBA. It refers to literature in a selective way. It does not purport to represent a systematic review, but it does attempt to offer some serious analyses of why some observations occur in studies of WBA and what we need to do about them.

Results and Discussion  Four general principles emerge: the response scale should be aligned to the reality map of the judges; judgements rather than objective observations should be sought; the assessment should focus on competencies that are central to the activity observed, and the assessors who are best-placed to judge performance should be asked to participate.


Introduction

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. How generic are WBA methods?
  7. Which assessors are best-placed to judge?
  8. Conclusions
  9. Acknowledgments
  10. References

Historically, assessments have often measured the measurable rather than the important. Over the last 30 years, however, we have witnessed a gradual shift of focus in medical education. We increasingly attempt to teach and assess what matters most.

This reformation has had three main themes:

First, the move from the testing of superficial knowledge towards the testing of understanding, construction and interpretation, reflected, for example, in Biggs and Collis’ SOLO (structure of observed learning outcomes) taxonomy,1 has informed developments in knowledge test design.

Second, the recognition that skills and attitudes can be as important as knowledge, reflected in Bloom’s original taxonomy,2 has contributed to new formats of clinical examination.

Finally, psychometric perspectives, in highlighting assessor subjectivity and the case-specificity of performance,3 have prompted a move towards multiple ‘mini’ test samples across many different assessment formats, such as the objective structured clinical examination (OSCE) and the eponymous mini-clinical evaluation exercise (mini-CEX).4

Many of these developments have deconstructed assessments and, some would argue, consequentially deconstructed learning. That is, breaking the assessed behaviour into subcomponents, or even simply sampling it in that way, has mandated learners to focus less on the big picture and more on elements or underpinning ‘competencies’. Interestingly, however, the competency movement also provided an altogether contrasting direction. It argued that, in practice, all the component parts of a competence must be marshalled together and integrated to deal with real workplace problems.5 Miller’s pyramid6 models this idea well by implying that knowledge is necessary, but not sufficient, for understanding. Understanding is necessary but not sufficient for ability (or competence), and ability is necessary but not sufficient for actual day-to-day performance.6 Each new layer of the pyramid reconstructs what had been separated.

This proves to be highly relevant in assessment because studies have demonstrated that doctors’ abilities when assessed in a controlled (deconstructed) environment do not dependably predict their actual day-to-day performance.7,8 To know how a doctor is actually performing on a daily basis, he or she must be assessed when engaged in normal work. In essence, this represents the case for the importance of workplace-based assessment (WBA).

WBA evaluations show poor engagement and disappointing reliability

Because WBA measures what no other assessments can, it has been rapidly incorporated into postgraduate assessment programmes around the world. In the UK, for example, it features in the programme of every Royal College.9 Nevertheless, it has been relatively unpopular. A report of the UK Academy of Medical Royal Colleges summarised a number of surveys thus:

‘The profession is rightly suspicious of the use of reductive “tick-box” approaches to assess the complexities of professional behaviour, and widespread confusion exists regarding the standards, methods and goals of individual assessment methods… This has resulted in widespread cynicism about WBA within the profession, which is now increasing.’9

Furthermore, where WBA methods have been psychometrically evaluated, highly variable results have been obtained. In a few cases, results show good measurement characteristics (e.g. Nair et al.10), but in many postgraduate training programmes, scores are found to be very vulnerable to assessor differences, and assessors have generally been indiscriminate in rating most trainees very positively.11,12 This means that from a psychometric perspective, very large numbers of assessors and cases are required to discriminate reproducibly among trainees.

Where do we go from here?

This paper offers some suggestions for improving WBA by looking at some basic instrument design issues. The literature includes many constructive suggestions as to how the situation might be improved, most of which focus on better implementation, the provision of more resources and better assessor training. This article, however, considers the design of WBA methods. Industrial design is commonly viewed as both a science, underpinned by ergonomics, metallurgy, electronic engineering and so forth, and an art, in that it puts a new innovative framework through and around an existing object. In this paper, we look at some questions of WBA design, firstly by reviewing some evidence and then by using some creative speculation about the art of the possible.

Methods

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. How generic are WBA methods?
  7. Which assessors are best-placed to judge?
  8. Conclusions
  9. Acknowledgments
  10. References

As part of our research strategy, we looked first at variance in the outcomes of recent WBA studies and then for design features of assessments that were associated with better validity or reliability outcomes.

The outcomes we examine are psychometric characteristics. A more valid or reliable test should reduce assessor disagreement and increase discrimination among those being assessed. These outcomes should provide more reliable scores and, by implication, more valid tests.

Most of the data we examine are relatively new or reflect the reanalysis of data from existing investigations. We use some of our own observations as examples and hope that these will stimulate other investigators to interrogate their own data for similar themes and to test some of the remaining hypotheses.

We start at the endpoint of the process. What do assessors think they are measuring and where do they put their mark? In other words, what is the issue that the assessor is responding to, and what type of box is the tick going in?

Results

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. How generic are WBA methods?
  7. Which assessors are best-placed to judge?
  8. Conclusions
  9. Acknowledgments
  10. References

What scales work best?

Assessors may agree on performance, but interpret response scales differently

Assessor training, and several standard-setting procedures, frequently include ‘norming’ or ‘calibration’ groups in which assessors independently rate a sample of performance (usually from video) and then discuss any differences.12 Interestingly, they often disagree over their interpretations of the response scale even when they agree about what they have observed. Common points of discussion include the issue of whether a poor ability to relate to patients falls within the ‘communication’ or the ‘professionalism’ domain, and whether a performance that ‘wasn’t very good’ should be deemed as ‘unsatisfactory’.

Table 1 provides some examples of the use of such scales from the mini-CEX,4 case-based discussion (CBD)13 and procedure-based assessment (PBA)14 instruments. Response scales align themselves to one of a variety of constructs, including: a trait with ordinal levels of performance (degree of ‘merit’); a developmental level of training, and, rarely, a clinician-aligned construct such as ‘readiness for independent practice’.

Table 1.   Examples of three types of workplace-based assessment scale constructs from the mini-clinical exercise (mini-CEX), case-based discussion (CBD) and procedure-based assessment (PBA)
ReferenceScaleConstruct
Mini-CEX, all items (e.g. interview skills, examination skills, management)Numeric 1–9 scale with three range anchors: ‘unsatisfactory’ (1–3), ‘satisfactory’ (4–6) or ‘superior (7–9)Each viewed as a normative trait with ordinal levels of merit
CBD, all itemsOrdinal categoric 6-point scale with six anchors: ‘Well below expectations for F1 completion’ to ‘Well above expectations for F1 completion’ [F1 = first year after qualification]Developmental level (in this case related to level and timing of training)
PBA, global summaryCategoric 4-point scale with four anchors: 1 = unable to perform the procedure observed, or part thereof, under supervision 2 = able to perform the procedure, or part observed, under supervision 3 = able to perform the procedure with minimum supervision 4 = competent to perform the procedure unsupervisedClinical independence or readiness for independent practice with ordinal levels

A parallel evaluation of PBA with the non-technical skills for surgeons (NOTTS) and objective structured assessment of technical skills (OSATS) instruments, undertaken as part of a large study of WBA methods,15 suggested that, when using the PBA global summary scale, assessors agreed with one another much more closely and were much more discriminating than they were when using comparable scales on the other instruments. The PBA global scale is unusual in being so well aligned to the expertise and priorities of clinician-assessors.

With reference to this observation, Crossley et al.16 tested the hypothesis that such assessor-relevant alignment improved the reliability of scores in a study designed to evaluate whether this observation generalises to other methods of WBA. They took three methods of WBA (the mini-CEX, CBD, and two versions of a measure of acute care performance) and compared the performances of their existing conventional scales with those of new scales specifically aligned to the construct of developing clinical sophistication and independence, a construct that has been identified elsewhere as ‘entrustability’.17 The original scales were normative and developmental. For example, on a mini-CEX a developmental descriptor for a good performance might be ‘Performed at level expected during advanced training’. This type of anchor is very common across WBA methods. However, the new scales were ‘clinically anchored’. For example, an equivalent clinical anchor for the ‘expected level’ developmental descriptor is: ‘Demonstrates excellent and timely consultation skills, resulting in a comprehensive history and/or examination findings in a complex or difficult situation. Shows good clinical judgement following encounter.’

The study used the two types of scale across more than 2000 medical trainees in a sample of 24 322 assessments conducted by 4000 assessors.16 If the new scale did indeed facilitate a more valid reflection of progression through postgraduate training in the eyes of clinician-assessors, we would expect to find two psychometric consequences:

  • 1
     trainees should be discriminated more widely, by contrast with the findings of previous studies, which demonstrated extensive clustering of good and high performers (e.g. Nair et al.10), and
  • 2
     there should be better agreement among assessors, both in terms of their overall perception of the standard required and in their responses about particular trainees.

This is exactly what the results showed. Simply aligning the scale with the priorities of clinician-assessors substantially increased assessor discrimination and reduced assessor disagreement. In the mini-CEX, CBD, and the ‘on-take’ and ‘post-take’ versions of the Acute Care Assessment Tool (ACAT),18 these changes were sufficiently large to reduce the number of assessments required to achieve good ‘in-training’ reliability (generalisability coefficient of 0.7) from six to three, eight to three, 10 to nine and 30 to 15, respectively, facilitating a saving of approximately 50% in assessor workload.

Not only did the reliability improve markedly, but it did so across a wide variety of measurement instruments used in variable contexts. Why did this happen?

Response scales need to reflect cognitive structuring

As long ago as 1980, various observers remarked that the cognitive characteristics of raters have greater influence on the rating process than more genetically or institutionally fixed attributes (e.g. sex, age, race, job). In a landmark review of many studies,19 the authors suggested, for example, that more experienced and more cognitively complex raters were less susceptible to halo effects and also preferred detailed anchors to minimal descriptors. They also found, perhaps unsurprisingly, that the perceived purpose of rating has a substantial effect on the cognitive process of the rater. However, their main conclusion suggested a lesser need to investigate the format of a rating form (and even suggested a moratorium on format-related research) than to understand, and appropriately utilise, the cognitive schema of the raters.

Other evidence affirms the importance of raters’ cognitive frameworks, but shows how the response format might exploit these frameworks by good alignment.20 In an innovative study, student audiologists, after initial training on the rating scales, were asked to rate the quality of voice production using four types of scale, including one with no anchors at all, one anchored solely with textual descriptors, one anchored with naturally generated auditory stimuli, and one in which both text and auditory anchors were present. The use of the auditory anchors resulted in significantly smaller 95% confidence intervals for ratings of mild voice disorders (the most difficult types of disorder to rate), as well as examples of breathy and hoarse voice qualities. Scales with textual anchors showed better inter-rater reliability than scales with no anchors, but were generally not as strong as scales with auditory anchors. The combination of textual and auditory anchors resulted in the greatest degree of inter-rater reliability as assessed via mean correlations and confidence limits around voice quality identification.

In a wider context, as previously identified,19 research on the value of construct alignment in WBA is predominantly comprised of field studies. The authors stress that, as a result, it is difficult to make comparisons across these studies with respect to the dimensions of performance examined. Each discipline, specialty and profession has a different conception of what may be important in assessing its trainees; consequently, each rating instrument is ultimately unique. Indeed, in the study by Crossley et al.,16 scale anchors represented an ‘uncomfortable mixture’ of separate domains on the various assessment forms as it was difficult for the authors to write specific clinical anchors for assessments that could be used across a wide variety of contexts. However, that may be the point: the response scale needs to be aligned to the reality map of the judges. Nevertheless, it is notable that, in nursing, the superiority of clinically anchored scales for WBA rating has been identified for some time.18,21,22 Thus these results resonate to some degree, in general, across disciplines.

Given the complexity of clinical training, it would be difficult to replicate the auditory style of an anchor in medicine for ‘clinical competence’, but, clearly, anchors that, literally, resonate with raters’ experiences might be a more profitable avenue of exploration than abstract descriptors such as ‘at expected level’ or ‘satisfactory’. Abstract descriptors feature absolutely no points of reference as to what a rater might be looking for in assigning a trainee to a category. Hence, there may be room in some scales for pictorial anchors (for some skills, such as suturing and some examination skills) similar to the pictures used in recipe books that could be used for peer, supervisor and self-rating. Alternatively, anchors could be developed from research on the concepts (reality maps) with which the supervisors are comfortable (e.g. Bogo et al.23,24).

Ask for judgements rather than objective observations

Different WBA instruments ask about performance from different conceptual starting points. Table 2 provides two examples (item stem and response options) from each of three instruments: the PBA instrument14 for assessing surgical skills; the Sheffield Assessment Instrument for Letters (SAIL)25 for assessing clinical referral correspondence, and the mini-CEX4 for assessing clinical encounters.

Table 2.   Example workplace-based assessment items from performance-based assessment (PBA), the Sheffield Assessment Instrument for Letters (SAIL) and the mini-clinical exercise (mini-CEX)
ReferenceItem stemResponse options
PBA PL4‘Ensures the operation site is marked where applicable’D = development required S = satisfactory standard
PBA global summary‘Level at which completed elements of the [operation] were performed on this occasion’1 = unable to perform the procedure, or part observed, under supervision 2 = able to perform the procedure, or part observed, under supervision 3 = able to perform the procedure with minimum supervision 4 = competent to perform the procedure unsupervised
SAIL 1‘Is there a medical problem list?’1 = yes 0 = no
SAIL global rating‘This letter clearly conveys the information I would like to have about the patient if I were the next doctor to see him or her’1–10 (Likert scale)
Mini-CEX 2Physical examination skills1 (unsatisfactory) to 9 (superior)
Mini-CEX 3Humanistic qualities/Professionalism1 (unsatisfactory) to 9 (superior)

If we apply Donabedian’s taxonomy,26 we see that the mini-CEX seems to address performance at the structural level (the relatively stable characteristics, or traits, of the doctor), ‘PBA PL4’ and ‘SAIL 1’ approach performance at the process level, and the ‘PBA global summary’ and ‘SAIL global rating’ ask about performance at the outcome level. In this rating activity, outcome or structure-level questions require a degree of judgement; it is not simply a matter of establishing whether or not something took place. However, concerns about subjectivity have, over the past few decades, led to a history of instruments focused at the process level in an attempt to increase examiner agreement. For example, ‘made visual contact’, ‘told patient where to put clothes’, and ‘shook hands’ are common performance items from consulting assessments developed during the 1970s and 1980s. The performance score is usually based on the sum of scores on the items.

However, as described in the Introduction, perhaps performance is more than the sum of its parts. In other words, perhaps:

‘shook hands’ (process) + ‘made visual contact’ (process) ≠ ‘establish rapport’ (outcome) or ‘interpersonal skill’ (structural attribute).

Perhaps a doctor with interpersonal skills will implement his or her process behaviours differently depending upon the unique nature of the interaction in order to achieve rapport or trust: he or she might avoid the handshake when the gesture might come across as implying superiority or as overly formal, and might avoid making eye contact when the patient may perceive such contact as representing an unwelcome challenge or level of intimacy.

Assessors judge performance more consistently and discriminatingly when they are not tied to process level observations

If performance is more complex than the sum of its parts and if a good performance is something upon which appropriately experienced observers agree, we might just expect a counter-intuitive observation. Subjective judgements about outcome-level performance or structure-level attributes might result in at least as much if not more assessor agreement and more performance discrimination than objective responses about what actually took place.

This is exactly what the literature demonstrates. The phenomenon was described by Regehr et al.,27 who discovered that the global scale that accompanied OSCE items (for standard-setting purposes only) provided more reliable scores than the actual items. It also holds true in many other evaluations that the reliability of subjective judgements is commonly at least as good as that of objective checklists. To cite the examples in Table 2, three surgeons observing one case each discriminated among trainees with a reliability of 0.76 based on the sum of their checklist responses (PBA PL4, etc.), but with a reliability of 0.82 based on their simple unstructured outcome judgement at the end of the operation (PBA global summary).15 Similarly, three judges scoring 10 letters each discriminated among trainees with a reliability of 0.72 based on the sum of their checklist responses (SAIL 1, etc.), but with a reliability of 0.74 based on their outcome judgement (SAIL global summary).25

In essence, scraping up the myriad evidential minutiae of the subcomponents of the task does not give as good a picture as standing back and considering the whole. In this situation, the assessor develops an approach to the checklist that involves a kind of instrumental impressionism, whereby he or she makes a judgement that is global but, nevertheless, is vitally dependent on an overall, somewhat merged, perception of the details. In this setting appropriately experienced (and trained) assessors interpret behaviours in context and in combination such that they are able to judge the relatively stable attributes that underpin the behaviours with greater agreement and discrimination than a measure of the sum of those behaviours.

How generic are WBA methods?

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. How generic are WBA methods?
  7. Which assessors are best-placed to judge?
  8. Conclusions
  9. Acknowledgments
  10. References

Most WBA instruments ask for judgements about all performance domains

This is an interesting feature of WBA instrument design. Although the instruments were developed to assess performance in a very wide range of contexts (clinical encounters, technical procedures, written correspondence, case discussions, emergency care, etc.), they almost all ask about the same domains of performance, such as: clinical method (history taking and examination); clinical judgement (diagnosing and planning); communication; professionalism, and organising or managing the clinical encounter. It is unclear why designers consider that every context provides good data for assessing every domain. This may be a byproduct of attempts to produce generic scales. It may also reflect a desire for efficiency, the influence of the accepted competency domains (knowledge, skills and professional behaviours), or nervousness about brief assessment forms. The latter arises from the fact that brief forms analysed by ‘classical’ reliability evaluation by internal correlation typically yield poorly reproducible data because of a mathematical artefact in classical reliability theory.

The obvious question then concerns whether every context provides equally valid and reliable data for every domain. If so, we should expect that examiner agreement and discrimination over any particular domain (e.g. organisation) will be the same whether it is observed in a clinical encounter or a handover. In fact, that is not what the data show. When G studies examine domain-level scores, some domain scores display better assessor agreement and discrimination than others. Critically, the relative reliability of domain scores varies across contexts. Table 3 illustrates this by presenting the domains from three diverse instruments,4,13,28 with a pool of data recently collected in a number of different studies of these methods of assessment in the workplace.15,16,29 For each domain–method combination, the predicted reliability of an assessment standardised to 10 observations is given. Many domains in the mini-CEX and CBD tools are reliably assessed, but ‘organisation and efficiency’ is assessed most reliably in the mini-CEX, whereas ‘medical record keeping’ is most reliably assessed in the CBD. In the ACAT, no domain reaches satisfactory reliability, but handover achieves the best result and this element of clinical practice is not sampled anywhere else within these three tools. The data presented here are quite limited, and there are apparent exceptions to the ‘rule’, but arguably those domains of performance that are clearly demonstrated in the context or activity being observed are associated with more reliable judgements. Perhaps this is because they sample the domain construct more effectively in that context. In summary, assessors may make more reliable, and hence more valid, judgements about domains of performance that they can see clearly demonstrated in a particular context or activity.

Table 3.   Workplace-based assessment instruments organised by performance domain, with associated reliabilities* of the appropriate tool, standardised to 10 items per domain
DomainMini-CEX itemsCBD itemsACAT items
  1. * This table is compiled from a number of sources of data including studies involving the first author15 and previously published work in the area16,31

  2. Mini-CEX = mini-clinical exercise; CBD = case-based discussion; ACAT = Acute Care Assessment Tool

Clinical methodMedical interview skills (0.75) Physical examination skills (0.75)Clinical assessment (0.73)Clinical assessment (0.50)
Clinical judgementClinical judgement (0.72)Investigations and referrals (0.69) Treatment (0.70) Follow-up and future planning (0.75)Investigations and referrals (0.27) Management of the critically ill patient (0.46)
CommunicationCounselling and communication skills (0.76)Medical record keeping (0.77)Medical record keeping (0.33)  Handover (0.55)
ProfessionalismConsideration for the patient/professionalism (0.72)Professionalism (0.77)Clinical leadership (0.49)
OrganisationOrganisation/efficiency (0.81) Time management (0.51) Management of take (0.49)

Which assessors are best-placed to judge?

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. How generic are WBA methods?
  7. Which assessors are best-placed to judge?
  8. Conclusions
  9. Acknowledgments
  10. References

Different respondent groups provide discrete perspectives over and above the expected person-to-person variation

Multi-source assessment and feedback (MSF) has largely superseded peer ratings because of the conviction that it is important to gather judgements from several different perspectives. The first rational question then is: do the different respondent groups provide different perspectives? If they do, then MSF adds value (and not just numbers) to single-group peer ratings. Different gazes will be reflected in two psychometric outcomes: if some groups are genuinely more stringent than others or have different ‘tastes’ (i.e. rank subjects differently), then an appropriately designed G study will show that a respondent’s group designation accounts for some score variation over and above the baseline variation among individuals.

This is observed in the data. A number of studies report that raters of different designations rate with different levels of stringency in assessing consultants30 or junior doctors31 and across the full range of medical specialties.32 In each case, junior doctors are the most lenient; progressively more empowered staff groups provide progressively more stringent ratings. Furthermore, using the datasets from the Royal College of Physicians’ evaluation of revalidation assessments reported by Wilkinson et al.,29 and from the NCAS normative data sample reported by Crossley et al.,30 we found that the rater’s designation accounted for 10.0% and 6.2% of all score variation in the two studies, respectively, and the interaction between rater designation and the people being rated accounted for 1.0% and 6.2% of all score variation across the studies, respectively.33 In other words, different respondent designations have different standards and different views of an individual doctor; typically some doctors are preferred by nursing staff and some by their peers.

The views of some designations are more valid than those of others

Given that different designations provide different perspectives, it seems rational to ask whose perspective is the most valid. In some cases the answers are self-evident. For example, few clerical staff or patients are likely to be able to comment on a clinician’s judgement. In the study by Mackillop et al.,32 one respondent points out:

‘As I am a secretary I was unable to answer any questions regarding the Dr’s clinical work although from comments from others I believe this to be good. I can only answer what I know.’

This means that response rates are usually low when non-clinicians are asked to judge clinical items. However, even a 50% response rate among clerical staff about ‘diagnostic skill’ and the ‘management of complex clinical problems’ raises questions.32 In other domains it may be less clear or surprising who is the most ‘valid’ or best-placed respondent. However, there is a clear trend in the evaluation data: respondent groups of people who regularly observe an aspect of performance agree with one another most closely. For example, in the study reported by Mackillop et al.,32 nurses provided around 25% of ratings of trainee doctors’ performance of largely ward-based activities. Nurses agreed in their assessment of trainees to a degree whereby 15 nurses’ ratings provided scores with a reliability coefficient of 0.81; however, ratings by 15 allied health professionals (AHPs), 15 doctors and 15 clerical staff achieved reliability coefficients of 0.77, 0.74 and 0.69, respectively. Why should this be? Perhaps it is because nurses, followed by AHPs, see the greatest quantity of trainee doctors’ ward-based activities. If the scrub nurse rarely sees the surgeon’s bedside manner, then the data obtained from such a source is subject to maximal construct-irrelevant variance. It could, for example, amount to ‘hearsay’, which is inadmissible in a court of law. However, for the same reason that no single assessment method can encompass all of clinical competence, it is clear that no single professional group can assess it either. Clinical competence is so broad that no-one sees it all. Each method represents a lens on performance and different health professionals act as the viewers who look through those lenses. Investigations into the capability of professional groups to assess aspects of practice might start with assessing the scope of their contact and collaboration with other groups. (This might also lead to heightened professional respect.) Ultimately, assessors who have the competence to judge an aspect of performance, and have had the opportunity to observe it, appear to provide more reliable ratings.

Conclusions

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. How generic are WBA methods?
  7. Which assessors are best-placed to judge?
  8. Conclusions
  9. Acknowledgments
  10. References

Some of our observations are better evidenced than others. However, the overall picture seems compelling: because high-level assessment is a matter of judgement, it works better if the right questions are asked, in the right way, about the right things, of the right people.

In many respects, the most remarkable observation might be how irrational we have been to date in designing WBA instruments and processes. We have often asked all respondents to comment on all areas of performance, regardless of their expertise or their opportunity to observe. We have often wasted the integrating, contextualising, weighting capacity of appropriate (and expensive) judges by limiting them to certain types of observation. We have often asked judges to comment on domains of performance that they do not observe and can, at best, only infer. We have frequently confronted assessors with self-evidently loose ‘merit-oriented’ or ‘training-oriented’ response scales that include pejorative or determinative statements, and expected them to interpret and use those items meaningfully and consistently. It is reasonable to say that the instruments may have served to obscure assessors’ judgements rather than to illuminate them.

We should move away from these types of response formats in WBA. Rather, we would recommend that care is taken to align the construct of the response scale with the reality maps of the judges. This requires both forethought and field testing.

Contributors:  both authors conceived the ideas in the article. JC was principally responsible for the analysis and interpretation of the data. JC wrote a first draft and BJ critically revised it for intellectual content. Both authors approved the final manuscript.

Acknowledgments

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. How generic are WBA methods?
  7. Which assessors are best-placed to judge?
  8. Conclusions
  9. Acknowledgments
  10. References

Acknowledgements:  the authors thank the Royal College of Physicians of London, the Sheffield Surgical Skills Study Research Group, and Dr Helena Davies, Academic Unit of Child Health, University of Sheffield, for their collaboration in many of the studies cited or reanalysed for this paper.

Funding:  none

Conflicts of interest:  none.

Ethical approval:  ethical approval was not required for this study. Approval was sought where appropriate for the original studies.

References

  1. Top of page
  2. Abstract
  3. Introduction
  4. Methods
  5. Results
  6. How generic are WBA methods?
  7. Which assessors are best-placed to judge?
  8. Conclusions
  9. Acknowledgments
  10. References
  • 1
    Biggs J, Collis K. Evaluating the Quality of Learning: the SOLO Taxonomy. New York, NY: Academic Press 1982.
  • 2
    Bloom B, ed. Taxonomy of Educational Objectives. The Classification of Educational Goals. Handbook I: Cognitive Domain. New York, NY: McKay 1956.
  • 3
    Crossley J, Humphris G, Jolly B. Assessing health professionals. Med Educ 2002;36 (9):8004.
  • 4
    Norcini JJ, Blank LL, Duffy FD, Fortna GS. The mini-CEX: a method for assessing clinical skills. Ann Intern Med 2003;138:47681.
  • 5
    McClelland D. Testing for competence rather than for intelligence. Am Psychol 1973;28:114.
  • 6
    Miller G. The assessment of clinical skills/competence/performance. Acad Med 1990;65 (Suppl):637.
  • 7
    Rethans J, Sturmans F, Drop R, van der Vleuten C, Hobus P. Does competence of general practitioners predict their performance? Comparison between examination setting and actual practice. BMJ 1991;303 (6814):137780.
  • 8
    Kopelow M, Schnabl G, Hassard T, Klass D, Beazley G, Hechter F, Grott M. Assessing practising physicians in two settings using standardised patients. Acad Med 1992;67 (Suppl 10):1921.
  • 9
    Academy of Medical Royal Colleges. Improving Assessment. London: AoMRC Press 2009.
  • 10
    Nair BR, Alexander H, McGrath B, Parvathy MS, Kilsby E, Wenzel J, Frank I, Pachev G, Page G. The mini-clinical evaluation exercise (mini-CEX) for assessing clinical performance of international medical graduates. Med J Aust 2008;189 (3):15961.
  • 11
    Weller JM, Jolly B, Misur MP, Merry AF, Jones A, Crossley JG, Pederson K, Smith K. Mini-clinical evaluation exercise in anaesthesia training. Br J Anaesth 2009;102:63341.
  • 12
    Kogan J, Holmboe E, Hauer K. Tools for direct observation and assessment of clinical skills of medical trainees: a systematic review. JAMA 2009;302:131626.
  • 13
    Davies H, Archer J, Southgate L, Norcini J. Initial evaluation of the first year of the Foundation Assessment Programme. Med Educ 2009;43: 7481.
  • 14
    Pitts D, Rowley D, Sher J. Assessment of performance in orthopaedic training. J Bone Joint Surg 2005;87 (9):118791.
  • 15
    Beard JB, Marriott J, Purdie H, Crossley J. Assessing the surgical skills of trainees in the operating theatre: a prospective observational study. Health Technology Assessment Programme, National Institute for Health Research. http://www.hta.ac.uk/1626. [Accessed 11 January 2011.]
  • 16
    Crossley J, Johnson G, Booth J, Wade W. Good questions good answers: construct-aligned scales perform much better than other scales for WBA. Med Educ 2011;45:5609.
  • 17
    ten Cate O, Snell L, Carraccio C. Medical competence: the interplay between individual ability and the health care environment. Med Teach 2010;32 (8):66975.
  • 18
    Redfern S, Norman I, Calman L, Watson R, Murrells T. Assessing competence to practise in nursing: a review of the literature. Res Papers Educ 2002;17 (27):5177.
  • 19
    Landy FJ, Farr JL. Performance rating. Psychol Bull 1980;87:72107.
  • 20
    Awan S, Lawson LL. The effect of anchor modality on the reliability of vocal severity ratings. J Voice 2009;23:34152.
  • 21
    Bondy KN. Criterion-referenced definitions for rating scales in clinical evaluation. J Nurs Educ 1983;22:37682.
  • 22
    Bondy KN, Jenkins K, Seymour L, Lancaster R, Ishee J. The development and testing of a competency-focused psychiatric nursing clinical evaluation instrument. Arch Psychiatr Nurs 1997;11:6673.
  • 23
    Bogo M, Regehr C, Power R, Hughes J, Woodford M, Regehr G. Toward new approaches for evaluating student field performance: tapping the implicit criteria used by experienced field instructors. J Soc Work Educ 2004;40 (3):41726.
  • 24
    Bogo M, Regehr C, Hughes J, Power R, Globerman J. Evaluating a measure of student field performance in direct service: testing reliability and validity of explicit criteria. J Soc Work Educ 2002;38 (3):385401.
  • 25
    Crossley J, Howe A, Newble D, Jolly B, Davies H. Sheffield Assessment Instrument for Letters (SAIL): performance assessment using out-patient letters. Med Educ 2001;35:111524.
  • 26
    Donabedian A. Evaluating the quality of medical care. Milbank Q 1966;44: 166206.
  • 27
    Regehr G, MacRae H, Reznick RK, Szalay D. Comparing the psychometric properties of checklists and global rating scales for assessing performance on an OSCE-format examination. Acad Med 1998;73 (9):9937.
  • 28
    Johnson GJ, Barrett J, Jones M, Wade W. The acute care assessment tool: a workplace-based assessment of the performance of a physician in training on the acute medical take. Clin Teach 2009;6:1059.
  • 29
    Wilkinson J, Crossley J, Wragg A, Mills P, Cowan G, Wade W. Implementing workplace assessment across the medical specialties in the United Kingdom. Med Educ 2008;42 (4):36473.
  • 30
    Crossley J, McDonnell J, Cooper C, McAvoy P, Archer J. Can a district hospital assess its doctors for re-licensure? Med Educ 2008;42 (4):35963.
  • 31
    Bullock A, Hassell A, Markham W, Wall D, Whitehouse A. How ratings vary by staff group in a multi-source feedback assessment of junior doctors. Med Educ 2009;43:51620.
  • 32
    Mackillop L, Crossley J, Vivekananda-Schmidt P, Wade W, Armitage M. A single generic multi-source feedback tool for revalidation of all UK career-grade doctors: does one size fit all? Med Teach 2011;33:7583.
  • 33
    Crossley J, Wilkinson J, Davies H, Wade W, Archer J, McAvoy P. Putting the 360 degrees into multi-source feedback. Abstract presented at the Association for the Study of Medical Education Conference on Multi-source Feedback, London, 13 December 2006;5.