SEARCH

SEARCH BY CITATION

Editor’s note: This is the first in a new series in Medical Education entitled ‘Dialogue’. Each publication in the series will be a transcription of an e-mail discussion about a current issue in the field held by two scholars who have approached the issue from different perspectives. The accompanying editorial in this issue gives more details.1

Kevin Eva:

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References

I’ve stolen the title for this first ‘Dialogue’ from a paper you wrote a few years ago in which you used the journey between Scylla and Charybdis, the dangers Odysseus had to negotiate on his voyage, as a metaphor for the difficulty of navigating between ‘excessive examination and naïve reliance on self-assessment’.2 The metaphor seems equally fitting for describing the difficulties we face in navigating between objectification and judgement in assessment. Do we choose to crash up against the rocky shoal of checklists and the atomisation of medicine they promote or to be sucked down into the whirlpool that is subjectivity and the concerns about fairness and defensibility that go with it?

Brian Hodges:

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References

It’s an important issue as the words ‘objectification’ and ‘judgement’ are used all the time in relation to assessment. According to Wikipedia, ‘objectification’ is the process by which an abstract concept is treated as if it is a concrete thing or physical object. ‘Judgement’ is the evaluation of evidence in the making of a decision. These are not neutral terms. There are actually quite loaded. They mean different things in different contexts and they conjure up important ways of understanding how the world works. The ‘valence’ we give to them renders them positive or negative. For example, in relation to objectification, there is a clear difference between the sense of the word in ‘we made the test much more fair by objectifying the passing criteria’ and that in ‘today’s body-conscious society has completely objectified women’. Similarly, judgement sounds considerably more positive in some contexts (e.g. ‘a difficult decision was made using expert judgement’) relative to others (e.g. ‘the questions sounded fair but the way she looked at me – I could just feel the judgement’).

KE:

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References

That dichotomy seems like a perfect representation of both the assessment issue we’ve been grappling with and the state of health professional education in general. Every construct worth advocating for has a downside that needs to be part of the conversation. Anderson explicitly discussed this when he described the conflict between the humanistic elements of health care and biomedical views by stating: ‘…society’s reaction has been to engage in a curious love–hate relationship… whereby we demand ever more of the latter even as we bemoan its dominance, and yearn for the former while remaining wary of its inefficiencies and lack of uniformity.’3 The quote seems equally applicable to the desire for assessment protocols that offer both a systematic and inarguable measurement of one’s ‘actual’ ability (with the downside of atomisation and all that goes along with it) AND more recognition and valuing of the broader things we expect medical practitioners to do that can’t easily be compartmentalised into a simple checkbox or multiple-choice question item (i.e. things that require judgement in a manner that will elicit both positive and negative valence). For some of these issues, it may simply be that we need to treat them as physicists treat light – we can’t understand it solely as a particle or as a wave even though the two perspectives don’t easily mesh, a metaphor that I’ve recently noticed Brian Jolly also put forward.4

BH:

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References

During the last half of the 20th century, ‘subjectivity’ became a bad word. In assessment it was associated with unreliability. In turn, unreliability was associated with unfairness. The classic story is that of oral examinations. In the US, work by McGuire, among others, showed that oral examinations – at that time used for national licensure examinations for all medical graduates – had inter-rater correlations of < 0.25.5 These examinations were unreliable psychometrically: they were hopelessly tainted by subjective judgement. So they were eliminated. This was a rather dramatic excision of what had for decades been a widely used assessment tool. In the US, only written tests were deemed to be sufficiently ‘objective’ to be used. This left the curious situation in which, just as Miller’s pyramid was coming into vogue and an emphasis on competence as performance (and not just knowledge) was taking hold, the only means to assess performance competence was eliminated. A European educator pointed out to me that this was not the route taken in the UK.6 There, reliability was not thought to trump the need for observation. He expressed puzzlement that from the 1960s until 2005, when the Part 2 Clinical Skills objective structured clinical examination (OSCE) was adopted as part of the US Medical Licensing Examination (USMLE), there was no performance-based assessment in the USA. It was for one reason only – but one very powerful reason – that the problem of ‘subjectivity’ had to be overcome.

KE:

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References

Yet there are good arguments suggesting that abandoning judgement may equate to throwing out the baby with the bathwater. One of the things that got me (and many others) interested in the topic was your 1999 work illustrating that global judgements in an OSCE context were equally reliable relative to ‘objective’ checklists, but demonstrated better validity in terms of aligning better with the construct of experience yielding better performance.7 Ron Harden tells me that the ‘objective’ in ‘objective structured clinical examination’ was never intended to be the focal point of the technique. He recalls that the OSCE was primarily a way to ensure that students would be observed performing clinical tasks. What led you to this line of inquiry over a decade ago?

BH:

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References

A peculiar thing happened at the end of the 20th century. After the long period I just described during which any sort of rating scale that was ‘subjective’ fell out of favour, research began showing that ‘global’ ratings used in performance-based assessments had quite good psychometric properties. In fact, research that we did in psychiatry,8 that Reznick and his team that Regehr et al. did in surgery,9 and that they summarised from several other domains10 showed that global ratings actually had better reliability than checklists.

My original objection to checklists was that communication skills, to my mind at least, could never be binary. I was looking for a tool to assess something that I believed to be on a continuum. Discovering that ANY kind of global rating performed as well or better than a checklist was a surprise because of the tonne of literature indicating poor psychometric properties of global ratings commonly used during in-training assessments. van der Vleuten and Schuwirth argued that the psychometric era simultaneously ushered in standardisation and multiple sampling, but that the latter is much more important than the former.11 With enough sampling, may it be that standardisation isn’t necessary? Norman went so far as to call for proof that standardised atomistic checklists were even defensible,12 heralding an era in which those interested in assessment once again examined the value of holistic judgement – even in something considered as hopeless as the ‘old oral examination’.

KE:

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References

That’s exactly when I experienced my introduction to the field, so the notion that sampling was more important than structure was a highly formative learning point for me. Although I agree with concerns about atomisation, I’m equally concerned with the tendency within the field to conflate objective, numerical, reliable and fair. I sincerely appreciate the need for trustworthy and replicable measurement (although I think a good argument could be made that the psychometric discourse is too dominant in places), but have never understood how that necessarily means standardisation and the seeking of objective (checklist-y) data.

The lesson of the value of sampling (or the wisdom of crowds) was the fundamental piece of logic behind the work we’ve been conducting in the admissions domain.13 The non-academic qualities the profession desires are arguably the least amenable to objective data collection and, pre-matriculation, we’re not only trying to assess, but we’re trying to use those assessments to predict how the individual applicant will perform in school and beyond. Despite that additional challenge, we’ve found that one can extract useful information if one collects a variety of judgements.14 What I’d love to better understand is what the limits are. Human judgement is flawed – there’s no question – but why is it so much better than other information in some domains (when used in some ways) and so much worse in others?

BH:

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References

Schuwirth and van der Vleuten agree and are among a group of psychometricians internationally who are pointing to the limitations of an over-reliance on a classic psychometric approach (discourse). They have described a number of reasons why such approaches are no longer adequate.15 However, they have said that it remains challenging to define how the field of assessment can begin to reintroduce such things as subjectivity, complexity and the aggregration of data over time, place and rater, and retain ‘rigour’. They ask: ‘How can we build non-psychometric rigour into assessment?’ Do you have thoughts about this?

KE:

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References

Actually, the issue for me isn’t simply one of ‘psychometrics are insufficient’ (and, rest assured, I know that’s not what you were suggesting in your ‘maintenance of incompetence’ paper.16 In modern psychometric models, reliability is defined as the proportion of variance attributable to whatever you’re interested in differentiating (students/practitioners in this instance). Other variance is defined as error. That error gets diluted mathematically when we aggregate across multiple observations (e.g. if ‘situation’ is the dominant form of ‘error’, one needs to average across observations collected from many situations to yield reliable measurements).

If we get away from thinking of the score we’re collecting as ‘the truth’, then there’s no reason we can’t collect data that are subjective/complex and still trust them to provide replicable information. In fact, the psychometric model, if applied well, can give us guidance regarding what it’s most important to aggregate across, thus helping us to improve our decisions without worrying that ‘rigour’ is being lost according to psychometric standards. The key way forward from my perspective is to try to better understand what ‘error’ is random (and, hence, improved by aggregation) and what ‘error’ is systematic (thus indicating biases that may need to be overcome). Gingerich et al.17 talk of this as concern that error might not actually be error in the lay sense.

BH:

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References

Point well taken! Nevertheless, it seems to me that debating what KIND of psychometric model is best to understand subjectivity may be beside the point. Consider the work of van Zanten et al.18 They present a rigorous study on the simulated clinical encounters used by the Educational Commission for Foreign Medical Graduates (ECFMG) in which they used a psychometric approach in an attempt to understand, partition and finally dismiss the portion of score variance attributable to ethnicity. The conclusion was that there are effects of ethnicity, but that they WASH OUT if enough data are combined.

What Danette McKinley, one of the authors of this study,18 told me, is that she has trouble juxtaposing this sort of work and what is known in clinical medicine.6 Danette is African-American. She told me that she knows (and the literature supports it) that when she takes her mother to the emergency room (ER), she is going to get different treatment because she is Black. That different treatment is based on a whole complex set of assumptions, stereotypes and traditions that all start with judgement. I suppose it could be possible to come up with a very complex psychometric model that could capture, quantify and partition the variance components associated with judgement and ethnicity. It would have to be VERY complex because ethnicity is rather hard to define. In fact, just defining the categories to measure raises problems. What is an Asian?

So even if we could find main effects, and predictable mechanisms that lead systematically to judgements (for instance, there is some evidence that concordant pairs behave differently from discordant pairs), the issue that weighs on my mind is this: if we are going to rehabilitate subjectivity in assessment, is psychometric analysis the way we want to understand the nature and effects of subjectivity? Should we use psychometric formulae to understand what happens when a doctor sees a Black patient in the ER, or a Black student in an OSCE station or on the ward and is asked to rate his or her competence?

KE:

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References

Actually, that is a perfect example to show where I suggest the issue lies. Raw aggregation over many samples of performance is perfectly sufficient if the ‘errors’ one is accumulating over are random. When systematic biases (such as racial stereotyping) impact upon a rater’s judgement, then we have another problem entirely as aggregating across many judgements that are all biased in the same direction won’t get us closer to where we want to be with respect to the validity of the assessment provided. That doesn’t address the issue you raise that one can only ever feasibly sample performance across a limited number of variables as the models do become too complex to define, let alone measure, when we try to imagine how to take everything into account.

That said, the issue Danette raises is, I think, less about the limits of psychometrics per se than it is aligned with the root of much of the debate around evidence-based medicine – the distinction between individuals and populations. Cronbach, one of the architects of the psychometric discourse, convincingly illustrated how efforts to measure group differences are not only unrelated to individual differences, but can actually impede our ability to observe individual differences (and vice versa).19

There is every reason to believe that, on average, Blacks do receive different health care from Whites. Individuals of high socio-economic status (SES) receive different health care from those of low SES and males receive different health care from females.20 Having done some of that research myself, I can tell you that the effects are often small and variable.21 That statement is in no way intended to minimise the problem – it’s real and a major issue at the level of the population. The large amount of overlap between the distributions, however, means that we can’t assume that because there’s an effect overall, every individual instance will reveal the bias. Some standardised patients will make judgements that preferentially rate those of concordant race; some won’t, and some will show preferences for other races. That’s the reason van Zanten et al.18 found a benefit of averaging. As a result, with all due respect to Danette, she can’t ‘know’ that her mother will receive different treatment each time she goes into the ER even though she’s absolutely right to be concerned about such bias and shouldn’t assume it’s not present.

When we’re talking about assessment, we’re trying to make judgements about performance that allow inferences of competence for an individual. Will this drug work for this person? = Should this individual be deemed fit to practise in this domain? I think that the world of human behaviour is probabilistic to the point that no model of any type will enable us to say with absolute certainty (using judgement or more objective means) how an individual will perform in every possible situation. That’s where I think more modern conceptions of psychometrics (or maybe just my interpretation of those conceptions) and constructivist notions align. When I say we can measure something reliably, I’m not making a statement about measuring some fundamental truth or a statement that I can predict with absolute certainty how the person will perform in all situations. It’s simply a statement that, in general, the person has a tendency to perform well or not perform well within the domain being measured. The issue that got us into this conversation was simply that one can make those types of statement equally well (or better in some domains) by relying on rater judgement than by seeking ‘fairness’ by using more objective measures. How would you have us alter the community’s assessment protocols to grapple with these issues?

BH:

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References

Great question. I will take a stab at it, but I would very much welcome discussion on this point as there are many in the field who have given it a great deal of thought. It seems to me there are two, possibly three directions we could take. I think you are spot on when you bring us back to whether we can make ‘statements’ about competence based on expert judgement that we can live with, without the salve of rigid approaches to objectivity. I highlight your use of the word ‘statements’ because if a ‘statement’ of competence is what we want to end up with, perhaps the translation of behaviours into numbers and then numbers back into statements is an unnecessary detour. This is essentially what the growing chorus of qualitative researchers are suggesting. Kuper et al. argued for this in a Medical Education commentary in 200722 and van der Vleuten and colleagues have recently tried to pick this up and operationalise how qualitative methods and criteria for assessment with language-based data would look.23 So, an exploration of the notion that the ‘data’ we collect about competence could be language-based, rather than numerically based, seems like one avenue to pursue.

The other comes from thinking about the degree to which the whole discussion takes place on a bit of a health professional island. In every other domain, particularly where the marker of competence is a PhD (and I know this has strong resonance among present company), it is striking that the discussion of reliability, validity, objectivity and the like has no traction. And I don’t see it gaining any. By contrast, the PhD process, from supervision right up to the defence, is based on synthetic expert judgement. I don’t see a great deal of resistance to this as a form of determining competence. Is it possible that the problem we have in the health professions is that we lack a curriculum (that includes longitudinal supervision and an apprenticeship model) that would allow for assessment with this sort of deep expert judgement, rather than with our assessment tools? Regehr et al.’s work on supervisors using competence ‘templates’ or ‘profiles’ to match their students to categories takes us in this direction.24 They showed that experts can quite effectively engage in pattern recognition in relation to competence. Or, as Jack Boulet put it to me, it may be that the only question we need to ask is: ‘Would you send a family member to this doctor’?6 If we ask it of enough people, and record their narrative responses, would we need to do anything else?

KE:

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References

That’s certainly what most conclusions drawn from workplace-based assessment studies would lead us to believe … ask enough patients or colleagues for a rating (or a dichotomous judgement) and you’ll get reliable information. However, in most studies I’ve seen, it takes a large and sometimes infeasible number of observations (n = 40 for patient satisfaction; n = 10 for colleague judgement) to get data that would make the psychometricians happy. Could one do better by moving towards using words rather than numbers? I don’t know yet, although I agree there is important work being conducted in this area by many people.17,25

Your drawing an analogy with PhD training is particularly intriguing and thought-provoking. My first reaction was that most of the people I know who have served as external examiners on PhD dissertations have had the uncomfortable experience of being less than enthralled with the work presented to them and as a result have had to wrestle with whether or not to pass the student despite the ‘expert judgement’ of the supervisor. My second reaction was that accurate assessment in the PhD world doesn’t matter as much because the potential to do harm is not as great. Moving beyond that initial overly simplistic cynicism, though, has led me to think about the distinction between the purpose for which we measure performance in the PhD world versus the purpose for which we measure performance in an md. I’ll caricaturise a bit as I realise it’s not quite this extreme, but in evaluating a doctoral dissertation or a submitted paper, what I’m generally assessing is whether or not that piece of work adds meaningfully to the literature and meets the standards of rigour set out by the community. Although that undoubtedly leads naturally to some inferences about the individual’s capacity to engage deeply in the scientific process, I don’t need to make an overall judgement about that person because the next time he or she ‘practises’ the craft, that work will be similarly evaluated and, if it’s not up to par, the candidate’s proposal won’t receive funding and any resulting paper won’t be published (again, I’m caricaturising). In the md world, by contrast, the judgements we’re asked to make are used to determine whether or not that person should be granted a license to practise medicine and, if he or she chooses to do so, to practise in a way that can be, for all intents and purposes, unmonitored.

In other words, when we grant a PhD, we’re saying, ‘You have demonstrated the capacity to practise science’, with the implicit understanding that the person won’t practise for long if he or she is not up to par. When we grant an md, we’re saying: ‘You have the capacity to practise medicine.’ Although this is counter-intuitive, such ‘trait-based’ statements fly in the face of a lot of evidence that behaviour is very situationally driven.26 Clearly, one can’t practise medicine or science without a particular knowledge base, but the extent to which that knowledge base is applied effectively may be influenced by any number of factors. The whole movement towards competency-based education reinforces the view that a competence is something one can maintain, by contrast with the lessons taught to us from the adaptive expertise folk.27

So maybe it is a curricular issue both in the sense that there could be benefits from moving back to an apprenticeship model and in the sense that we need to reconceptualise what we’re trying to accomplish with the assessment system and think through how the system needs to change to enable a change in focus. This leads me to propose that marrying notions of scholarship (i.e. peer review, public dissemination of findings, grounding data in ongoing conversation) with conceptual models of assessment could lead to significant and meaningful reorientation of the health professional assessment world by using competencies as a guide while simultaneously reducing the notion that we can measure competence full stop. Given that I started this conversation though, you deserve the final word.

BH:

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References

It has been a pleasure engaging in this discussion. It seems to me that we are entering a phase in the study and conduct of health professional assessment that we could call ‘post-psychometric’, in that the domains of expertise that are engaged to analyse our biggest assessment challenges are becoming much more diverse. You and I seem to share the view that subjective judgement does indeed have value in assessment and that perhaps our predecessors moved too far in its demonisation. Nevertheless, a renaissance of subjective judgement in assessment raises interesting questions about what judgement is and how to sensibly aggregate multiple judgements without losing the richness of multiple perspectives. It also challenges us to think carefully about the ‘media’ we will use to capture, analyse and aggregate those subjective judgements – be they expressed in numeric or linguistic form, or some combination of the two.

Even as we emerge from an era that was perhaps overly focused on objectification, I think we might also agree that the pendulum shouldn’t swing too far in the other direction. Although our field has gained a great deal in terms of better assessment tools and greater fairness for students, in the words attributed to Albert Einstein: ‘Not everything that counts can be counted and not everything that can be counted counts.’

Contributors:  this manuscript is a transcription of an original e-mail correspondence that took place between KE and BD.

Acknowledgements:  none.

Conflicts of interest:  none.

Ethical approval:  not applicable.

References

  1. Top of page
  2. Kevin Eva:
  3. Brian Hodges:
  4. KE:
  5. BH:
  6. KE:
  7. BH:
  8. KE:
  9. BH:
  10. KE:
  11. BH:
  12. KE:
  13. BH:
  14. KE:
  15. BH:
  16. References